Track: Modern Data Architectures

Location: Ballroom BC

Day of week: Tuesday

Data architecture is a fast-moving field. Yesterday's best practices can turn out to be inadequate for today's problems.  

We are looking to bring together data architects and engineers who have a deep understanding of the problems in the field today and a vision of what the future of data looks like. From modern solutions that proved useful at scale to timeless design principles that remain relevant.

We'll explore the ideas and systems you need today to build data architectures that will still be useful in the future. 

Track Host: Gwen Shapira

Principal Data Architect @Confluent, PMC Member @Kafka, & Committer Apache Sqoop

Gwen is a principal data architect at Confluent helping customers to achieve success with their Apache Kafka implementation. She has 15 years of experience working with code and customers to build scalable data architectures, integrating microservices, relational and big data technologies. She currently specializes in building real-time reliable data processing pipelines using Apache Kafka. Gwen is an author of “Kafka - the Definitive Guide”, "Hadoop Application Architectures", and a frequent presenter at industry conferences. Gwen is also a committer on the Apache Kafka and Apache Sqoop projects. When Gwen isn't coding or building data pipelines, you can find her pedaling on her bike exploring the roads and trails of California, and beyond.

10:35am - 11:25am

Future of Data Engineering

The current generation of data engineering has left us with data pipelines, data warehouses, and machine learning platforms that are largely batch-based and centrally managed. They're often largely manually operated, and integrating new systems can be cumbersome. Over the next few years, a number of trends are going to require us to rethink how and what we build. Data is now realtime, companies are running many database technologies, teams are demanding more control of their data, and regulatory policy has begun dictating how and when we store data. This talk will present a vision of what it will take for data engineers deliver a next generation data ecosystem.

Chris Riccomini, Distinguished Engineer @WePay

11:50am - 12:40pm

Data Mesh Paradigm Shift in Data Platform Architecture

Many enterprises are investing in their next generation data platform, with the hope of democratizing data at scale to provide business insights and ultimately make automated intelligent decisions. Data platforms based on the data lake architecture have common failure modes that lead to unfulfilled promises at scale. 

In this talk Zhamak shares her observations on the failure modes of a centralized paradigm of a data lake, or its predecessor data warehouse. 

She introduces Data Mesh, the next generation data platforms, that shifts to a paradigm that draws from modern distributed architecture: considering domains as the first class concern, applying platform thinking to create self-serve data infrastructure, and treating data as a product.

Zhamak Dehghani, Principal Technology Consultant @ThoughtWorks

1:40pm - 2:30pm

Taming Large State: Lessons From Building Stream Processing

Streaming engines like Apache Flink are redefining ETL and data processing. Data can be extracted, transformed, filtered, and written out in real time with an ease matching that of batch processing. However, the real challenge of matching the prowess of batch ETL remains in doing joins, maintaining state, and dynamically pausing or resting the data.

At Netflix, micro-services serve and record many different kinds of user interactions with the product. Some of these live services generate millions of events per second, all carrying meaningful but often partial information. Things start to get exciting when the company wants to combine the events coming from one high-traffic micro-service to another. Joining these raw events generates rich datasets that are used to train the machine learning models that serve Netflix recommendations.

Historically, Netflix has done this joining of large volume datasets in batch. Recently, the company asked, If the data is being generated in real time, why can’t it be processed downstream in real time? Why wait a full day to get information from an event that was generated a few minutes ago?

This talks describes how we solved a complex join of two high-volume event streams at Netflix using Flink. You’ll learn about 

  • Managing out of order events and processing late arriving data
  • Exploring keyed state for maintaining large state
  • Fault tolerance of a stateful application
  • Strategies for failure recovery 
  • Schema evolution in a stateful realtime application
  • Data validation batch vs streaming

Sonali Sharma, Data Engineering and Analytics @Netflix
Shriya Arora, Senior Software Engineer @Netflix

2:55pm - 3:45pm

Kafka Needs No Keeper

We have been served well by Zookeeper over the years, but it is time for Kafka to stand on its own. This is a talk on the ongoing effort to replace the use of Zookeeper in Kafka: why we want to do it and how it will work. We will discuss the limitations we have found and how Kafka benefits both in terms of stability and scalability by bringing consensus in house. This effort will not be completed over night, but we will discuss our progress, what work is remaining, and how contributors can help.

Colin McCabe, Software Engineer @confluentinc

4:10pm - 5:00pm

Modern Data Architectures Open Space

Session details to follow.

5:25pm - 6:15pm

Practical Change Data Streaming Use Cases With Apache Kafka & Debezium

Debezium (noun | de·be·zi·um | /dɪ:ˈbɪ:ziːəm/) - Secret Sauce for Change Data Capture

Apache Kafka is a highly popular option for asynchronous event propagation between microservices. Things get challenging though when adding a service’s database to the picture: How can you avoid inconsistencies between Kafka and the database?

Enter change data capture (CDC) and Debezium. By capturing changes from the log files of the database, Debezium gives you both reliable and consistent inter-service messaging via Kafka and instant read-your-own-write semantics for services themselves.

In this session you’ll see how to leverage CDC for reliable microservices integration, e.g. using the outbox pattern, as well as many other CDC applications, such as maintaining audit logs, automatically keeping your full-text search index in sync, and driving streaming queries. We’ll also discuss practical matters, e.g. HA set-ups, best practices for running Debezium in production on and off Kubernetes, and the many use cases enabled by Kafka Connect's single message transformations.

Gunnar Morling, Open Source Software Engineer @RedHat

Tracks

Monday, 11 November

Tuesday, 12 November

Wednesday, 13 November