Presentation: Gobblin: A Framework for Solving Big Data Ingestion Problem

Traditionally, a Big Data system is about the large sheer volume of datasets it handles and the large processing power behind it. Nowadays, It also means large data ingestion and integration with high velocity and high quality. While the first part of the big data problem has been the focus lately with innovations to tackle these challenges.

In reality, the latter part of the problem starts to cause big pain point a lot of times before developers get to solve the next problems. With first hand experience on big data ingestion and integration pain points, we built Gobblin, a unified data ingestion framework to address the following challenges:

  • Source integration: The framework provides out-of-the-box adaptors for all our commonly accessed data sources such as Salesforce, MySQL, Google, Kafka and Databus, etc.
  • Processing paradigm: Support both standalone and scalable platforms, including Hadoop and Yarn. Integration with Yarn provides the ability to run scheduled batch ingest or continuous ingestion.
  • Data quality assurance: the framework exposes data metrics collectors and data quality checkers as first class citizens which can be used to power continuous data validation.
  • Extensibility: data pipeline developers can integrate their own adaptors with the framework, and make it leverage-able for other developers in the community.
  • Self-service: data pipeline developers can compose a data ingestion and transformation flow in the form of a DAG using a simple pipeline definition language or UI.

In this talk, we will cover Gobblin’s system architecture, key design decisions and tradeoffs, and lessons learned from operating disparate LinkedIn use cases in production.

Lin Qiao Elsewhere

Tracks

Covering innovative topics

Monday, 3 November

Tuesday, 4 November

Wednesday, 5 November

Conference for Professional Software Developers