Presentation: Gobblin: A Framework for Solving Big Data Ingestion Problem
Traditionally, a Big Data system is about the large sheer volume of datasets it handles and the large processing power behind it. Nowadays, It also means large data ingestion and integration with high velocity and high quality. While the first part of the big data problem has been the focus lately with innovations to tackle these challenges.
In reality, the latter part of the problem starts to cause big pain point a lot of times before developers get to solve the next problems. With first hand experience on big data ingestion and integration pain points, we built Gobblin, a unified data ingestion framework to address the following challenges:
- Source integration: The framework provides out-of-the-box adaptors for all our commonly accessed data sources such as Salesforce, MySQL, Google, Kafka and Databus, etc.
- Processing paradigm: Support both standalone and scalable platforms, including Hadoop and Yarn. Integration with Yarn provides the ability to run scheduled batch ingest or continuous ingestion.
- Data quality assurance: the framework exposes data metrics collectors and data quality checkers as first class citizens which can be used to power continuous data validation.
- Extensibility: data pipeline developers can integrate their own adaptors with the framework, and make it leverage-able for other developers in the community.
- Self-service: data pipeline developers can compose a data ingestion and transformation flow in the form of a DAG using a simple pipeline definition language or UI.
In this talk, we will cover Gobblin’s system architecture, key design decisions and tradeoffs, and lessons learned from operating disparate LinkedIn use cases in production.
Lin Qiao Elsewhere
Similar Talks


Tracks
Covering innovative topics
Monday, 3 November
-   
          Architectures You've Always Wondered about    
  The newest and biggest Internet architectures 
-   
          Real World Functional     
  Putting functional programming concepts to work in the real world. 
-   
          The Future of Mobile    
  The future of mobile and performance improvements 
-   
          Continuous Delivery: From Heroics to Becoming Invisible    
  Continuous Delivery philosophies, cultures, hiccups, and best practices. 
-   
          Unleashing the Power of Streaming Data    
  This track explores a variety of use-cases, platforms, and techniques for processing and analyzing stream data from the companies deploying them at scale! 
-   
          Sponsored Solutions Track I    
  
Tuesday, 4 November
-   
          Engineering for Product Success    
  Architectures that make products more successful 
-   
          Reactive Service Architecture    
  Reactive, Responsive, Fault Tolerant and More. 
-   
          Modern CS In the Real World    
  How modern CS tackles problems in the real world. 
-   
          Applied Machine Learning and Data Science    
  Understand your big big data! 
-   
          Deploying at Scale    
  Containerizing Applications, Discovering Services, and Deploying to the Grid. 
-   
          Sponsored Solutions Track II    
  
Wednesday, 5 November
-   
          Beyond Hadoop     
  Emerging Big Data Frameworks and Technology 
-   
          Scalable Microservice Architectures    
  This track addresses the ways companies with hundreds of fine-grained web-services (e.g. Netflix, LinkedIn) manage complexity! 
-   
          Java at the Cutting Edge    
  The latest and greatest in the Java ecosystem 
-   
          Engineering culture    
  Successes and failures in creating an engineering culture. 
-   
          Next gen HTML5 and JS    
  How Web Components, the Future of CSS, and more are changing the web. 
-   
          Sponsored Solutions Track III    
  



