Data Engineering is becoming increasingly relevant to our highly-connected, AI driven world. In the past, software engineers focused their efforts on developing scalable web architectures until they realized that their biggest headache was their data architecture. For most of us, data architecture simply meant running an RDBMS for all of our needs, from transactional read-write workloads to ad-hoc point and scan analytics loads. As our data grew, so did our use-cases for data-driven products (e.g. fraud detection systems, recommender systems, personalization services) -- these 2 rising trends combined to stress our RDBMS beyond their capabilities. Data engineers entered the field to solve our problems by introducing specialized data stores (e.g. search engines, graph engines, large scale data processing (e.g. Spark), NoSQL, stream processing (E.g. Beam, Flink, Spark)) and the machinery to glue them together (e.g. ETL pipelines, Kafka, Sqoop, Flume). Today, data architectures are as vast and varied as the use-cases they supports. What are some emerging technologies and trends in this space and how are some of cutting-edge companies solving their problems? Come to this track to learn more.
Track: Emerging Trends in Data Engineering
Location: Bayview AB
Day of week: Tuesday
Track Host: Sid Anand
Sid Anand currently serves as PayPal's Chief Data Engineer, focusing on ways to realize the value of data. Prior to joining PayPal, he held several positions including Agari's Data Architect, a Technical Lead in Search @ LinkedIn, Netflix’s Cloud Data Architect, Etsy’s VP of Engineering, and several technical roles at eBay. Sid earned his BS and MS degrees in CS from Cornell University, where he focused on Distributed Systems. In his spare time, he is a maintainer/committer on Apache Airflow, a co-chair for QCon, and a frequent speaker at conferences. When not working, Sid spends time with his wife, Shalini, and their 2 kids.
11:50am - 12:40pm
Massively scaling MySQL using Vitess
Are you dealing with the challenges of rapid growth? Are you thinking about how to scale your database layer? Should you use NoSQL? Should you shard your relational database? If you are facing these kinds of problems, this session is for you. Vitess is a database solution for deploying, scaling and managing large clusters of MySQL instances. It's architected to run as effectively in a public or private cloud architecture as it does on dedicated hardware. It combines and extends many important MySQL features with the scalability of a NoSQL database. This session gives an overview of the salient features of Vitess, and at the end, we'll cover some advanced features with a demo.
1:40pm - 2:30pm
Transaction Processing in FoundationDB
FoundationDB provides users strongly consistent transactions without a two-phase commit protocol. This talk will go through the architecture of FoundationDB and describe what is happening in the internals of the database when a client commits a transaction.
2:55pm - 3:45pm
Patterns of Streaming Applications
Stream processing engines are becoming pivotal in analyzing data. They have evolved beyond a data transport and simple processing machinery, to one that's capable of complex processing. The necessary features and building blocks of these engines are well known. And most capable engines have a familiar Dataflow based programming model.
As with any new paradigm, building streaming applications requires a different mindset and approach. Hence there is a need for identifying and describing patterns and anti-patterns for building these applications. Currently this mindshare is scarce.
Drawn from my experience working with several engineers within and outside of Netflix, this talk will present the following:
- A blueprint for streaming data architectures and a review of desirable features of a streaming engine
- Streaming Application patterns and anti-patterns
- Use cases and concrete examples using Flink
Attendees will come away with patterns that can be applied to any capable stream processing framework such as Apache Flink.
4:10pm - 5:00pm
Training Deep Learning Models at Scale on Kubernetes
Deep Learning has recently become very important for all kinds of AI applications from conversational chatbots to self-driving cars. In this talk, we will talk about how we use deep learning for natural language processing, utilize Tensorflow for training deep learning models, run Tensorflow on top of Kubernetes, and use GPUs.
We have a need to train deep learning models for each conversational bot that we deploy on our platform. Training individual bots on one-off systems using ad-hoc processes is no longer a feasible solution as it does not scale with the number of bots in our system. In order to address the above requirements, we have built a framework for running long running jobs that leverages our existing Kubernetes infrastructure. We have designed our jobs framework to have the following key benefits.
- Jobs can be executed either on a fixed schedule or a manual trigger or an automated trigger ( i.e some other event in our system can trigger a job)
- High availability of job workers.
- Scale up (or down) the number of workers for each job type based on need.
- We can assign specific attributes to specific workers. For example, we ensure that our training workers are always executed on GPU nodes so that they can take full advantage of the GPU resources available in our infrastructure.
- Simplified job management. This includes the ability to monitor, audit and debug each job that was executed. Further, using our systems for centralized logging and monitoring, we can quickly understand key results from the job. For example, in case of model training jobs, we can quickly look at the confusion matrix to understand if the trained model should be promoted to our production systems.
In the talk, we will present how we have leveraged Kubernetes to realize each of the above benefits.
5:25pm - 6:15pm
The Whys and Hows of Database Streaming
Batch-style ETL pipelines have been the de facto method for getting data from OLTP to OLAP database systems for a long time. At WePay, when we first built our data pipeline from MySQL to BigQuery, we adopted this tried-and-true approach. However, as our company scaled and our business needs grew, we observed a stronger demand for making data available for analytics in real-time. This led us to redesign our pipeline to a streaming-based approach using open-source technologies such as Debezium and Kafka.
This talk goes over the central design pattern around database streaming, change data capture (CDC), and what its advantages are over alternative approaches like trigger or event-sourcing. To solidify the concept, we will go through our MySQL-to-BigQuery streaming pipeline in detail, explaining the core components involved, and how we built this pipeline to be resilient to failure. Finally, we will expand on some of our on-going work around the additional challenges we face when streaming peer-to-peer distributed databases (i.e. Cassandra), and what some potential solutions around it are.
Tracks
Monday, 5 November
-
Microservices / Serverless Patterns & Practices
Evolving, observing, persisting, and building modern microservices
-
Practices of DevOps & Lean Thinking
Practical approaches using DevOps & Lean Thinking
-
JavaScript & Web Tech
Beyond JavaScript in the Browser. Exploring WebAssembly, Electron, & Modern Frameworks
-
Modern CS in the Real World
Thoughts pushing software forward, including consensus, CRDT's, formal methods, & probabilistic programming
-
Modern Operating Systems
Applied, practical, & real-world deep-dive into industry adoption of OS, containers and virtualization, including Linux on Windows, LinuxKit, and Unikernels
-
Optimizing You: Human Skills for Individuals
Better teams start with a better self. Learn practical skills for IC
Tuesday, 6 November
-
Architectures You've Always Wondered About
Next-gen architectures from the most admired companies in software, such as Netflix, Google, Facebook, Twitter, & more
-
21st Century Languages
Lessons learned from languages like Rust, Go-lang, Swift, Kotlin, and more.
-
Emerging Trends in Data Engineering
Showcasing DataEng tech and highlighting the strengths of each in real-world applications.
-
Bare Knuckle Performance
Killing latency and getting the most out of your hardware
-
Socially Conscious Software
Building socially responsible software that protects users privacy & safety
-
Delivering on the Promise of Containers
Runtime containers, libraries, and services that power microservices
Wednesday, 7 November
-
Applied AI & Machine Learning
Applied machine learning lessons for SWEs, including tech around TensorFlow, TPUs, Keras, PyTorch, & more
-
Production Readiness: Building Resilient Systems
More than just building software, building deployable production ready software
-
Developer Experience: Level up your Engineering Effectiveness
Improving the end to end developer experience - design, dev, test, deploy, operate/understand.
-
Security: Lessons Attacking & Defending
Security from the defender's AND the attacker's point of view
-
Future of Human Computer Interaction
IoT, voice, mobile: Interfaces pushing the boundary of what we consider to be the interface
-
Enterprise Languages
Workhorse languages found in modern enterprises. Expect Java, .NET, & Node in this track