Human-Centric Machine Learning Infrastructure @Netflix

Next QConSF Conference: Applied AI for Developers QCon.ai April 2019

What You’ll Learn

Learn the right questions to be asking to be able to get heterogeneous models into a production environment in a consistent, reliable way.
Hear about choices and reasoning the Netflix Machine Learning Infrastructure team made on developing tooling for the data scientists.
Understand some of the challenges and solutions made to create a paved road for machine learning models to production.

Abstract

Netflix has over 100 data scientists applying machine learning to a wide range of business problems from title popularity predictions to quality of streaming optimizations. Our unique culture gives data scientists plenty of freedom to choose the modeling approach, libraries, and even the programming language that will make them productive at solving problems. However, we want to balance this freedom by providing a solid infrastructure for machine learning, ensuring models can be promoted quickly and reliably from prototype to production, and enabling reproducible and easily shareable results.

We started building this infrastructure a little over a year ago with a human-centric mindset. Many existing open-source machine learning frameworks are great at making advanced modeling possible. The job of our ML infrastructure is to make it remarkably easy to apply these frameworks to real business problems at Netflix. We have found that this requires an infrastructure that covers the day-to-day challenges of data scientists holistically, from understanding input data to building trust with consumers of models, not just the parts that are directly related to fitting and scoring models.

Come learn the techniques and underlying principles driving our approach, which you'll be able to adapt and apply to your own use cases.

Question:

What is the focus of the work you do at Netflix?

Answer:

I'm with the machine learning infrastructure team at Netflix. We work with about a hundred data scientists who solve all kinds of business problems. It’s not only video recommendations, but we also help answer many other questions to make Netflix an even better experience. We help all these data scientists to be more productive and make it easier for them to start prototyping their models to produce business value.

Question:

Netflix has a 'paved path’ approach when it comes to software and microservices. Is it the same thing when it comes to machine learning?

Answer:

It is very much the same thing. We want to provide a 'paved path' so there's always a very clear, recommended way to do things. This is especially important for data science since there are many people from academia who are very adept at creating really strong theoretical models but when it comes to actually taking something to production and making it operationally solid, typically they require a lot of help. Integrating with a platform like Netflix can be non-trivial. At the same time, we want to balance that with the idea of freedom and responsibility, so people still have the freedom to choose the exact modeling approach they want to take. The platform has to be flexible.

Question:

From a high level, when you talk about a platform for machine learning are you talking about models that you cluster or are you talking about things that you stick into a container and then run across a cluster? What are we talking about for offering a platform for data scientists?

Answer:

One thing we have noticed is the question of putting machine learning in production is far broader than how you train or score a model. It starts all the way from how you find the data you need, how you do the feature engineering, and how you scale it. Then obviously there is the training and scoring question, like how do you run it in a container management system.

After that, there are questions like how do you integrate the results of your machine learning models to other downstream value clients (so consumers). Finally, how do you operate the whole pipeline in a way that the people who consume the results can trust that the results are always correct and trustworthy?

If you need to iterate on things, how can you go back to the drawing board and quick release the next version? There are so many questions that are outside that narrow question of just training and scoring the model.

Question:

What do you use to manage the lifecycle, the CI/CD pipeline of machine learning? Is it custom software or is it open source tooling?

Answer:

It’s both. Obviously, Netflix has been investing in CI and CD for a long time and actually many of the tools they use, like Spinnaker and Titus, are open source. They are also really close to other open source tools like Kubernetes and whatever you want to use for CI.

Question:

Can you give me an example of some of the questions you get from data scientists when you are trying to deploy models?

Answer:

When it comes to common questions, as boring as it may sound, my experience is that machine learning infrastructure is much more about data than science. Most questions we get are related to data: how do I find the data I need, how do I set up the data pipeline, how do I handle the somewhat non-trivial amounts of data in python and R, can I use pandas, can I use R, how do I structure my feature engineering so I completely iterate on new ideas? We get many questions in that space.

We get questions about modeling as well, but usually, once you have the data in a beautiful data frame, then the data scientists are totally happy to use tools they know best, like Scikit Learn or Tensorflow. So the questions we are seeing are mostly related to the data pipelines.

Speaker: Ville Tuulos

Machine Learning Infrastructure Engineer @Netflix

Ville Tuulos is a software architect in the Machine Learning Infrastructure team at Netflix. He has been building ML systems at various startups, including one that he founded, and large corporations for over 15 years. He enjoys exploring and building novel human-computer interfaces for complex domains, as well as low-level systems hacking.

Find Ville Tuulos at

Speaker page

@vtuulos

Sales Engineer @solacedotcom

Ken Overton

Create a Fair & transparent AI Pipeline with AI Fairness 360

STSM, AI and Machine Learning @IBM

Animesh Singh

Create a Fair & transparent AI Pipeline with AI Fairness 360

Software developer @IBM, committer to Apache Bahir and contributor to Jupyter Enterprise Gateway

Christian Kadner

Reducing Risk of Credential Compromise @Netflix

Security Researcher, Leader, Advisor @Netflix

William Bengtson

Reducing Risk of Credential Compromise @Netflix

Sr. Cloud Security Engineer @Netflix

Travis McPeak

Taking the Canary Out of the Coal Mine

Staff Security Engineer @Cruise Automation

Mike Ruth

Using Data to Measure Risk in Cyber Systems

Director of Cyber Risk @QadiumInc

Marshall Kuypers

Security & Psychology: Demotivating Persistent Threats

Engineering Director @ShapeSecurity & JavaScript Expert

Jarrod Overson

Open Source Robotics: Hands on with Gazebo and ROS 2

Software Engineer @OpenRoboticsOrg

Louise Poubel

Tracks

Monday, 5 November

Microservices / Serverless Patterns & Practices

Evolving, observing, persisting, and building modern microservices
Practices of DevOps & Lean Thinking

Practical approaches using DevOps & Lean Thinking
JavaScript & Web Tech

Beyond JavaScript in the Browser. Exploring WebAssembly, Electron, & Modern Frameworks
Modern CS in the Real World

Thoughts pushing software forward, including consensus, CRDT's, formal methods, & probabilistic programming
Modern Operating Systems

Applied, practical, & real-world deep-dive into industry adoption of OS, containers and virtualization, including Linux on Windows, LinuxKit, and Unikernels
Optimizing You: Human Skills for Individuals

Better teams start with a better self. Learn practical skills for IC

Tuesday, 6 November

Architectures You've Always Wondered About

Next-gen architectures from the most admired companies in software, such as Netflix, Google, Facebook, Twitter, & more
21st Century Languages

Lessons learned from languages like Rust, Go-lang, Swift, Kotlin, and more.
Emerging Trends in Data Engineering

Showcasing DataEng tech and highlighting the strengths of each in real-world applications.
Bare Knuckle Performance

Killing latency and getting the most out of your hardware
Socially Conscious Software

Building socially responsible software that protects users privacy & safety
Delivering on the Promise of Containers

Runtime containers, libraries, and services that power microservices

Wednesday, 7 November

Applied AI & Machine Learning

Applied machine learning lessons for SWEs, including tech around TensorFlow, TPUs, Keras, PyTorch, & more
Production Readiness: Building Resilient Systems

More than just building software, building deployable production ready software
Developer Experience: Level up your Engineering Effectiveness

Improving the end to end developer experience - design, dev, test, deploy, operate/understand.
Security: Lessons Attacking & Defending

Security from the defender's AND the attacker's point of view
Future of Human Computer Interaction

IoT, voice, mobile: Interfaces pushing the boundary of what we consider to be the interface
Enterprise Languages

Workhorse languages found in modern enterprises. Expect Java, .NET, & Node in this track

This Year's Schedule

The all-new QCon app!

Available on iOS and Android

The new QCon app helps you make the most of your conference experience. Easily browse and follow the conference schedule, star the talks you want to attend, and keep tabs on your personal itinerary. Download the app now for free on iOS and Android.

Track: Applied AI & Machine Learning

Location: Ballroom A

Duration: 10:35am - 11:25am

Day of week: Wednesday

Level: Intermediate

Persona: Data Engineering, Data Scientist, ML Engineer

What You’ll Learn

Abstract

Speaker: Ville Tuulos

Find Ville Tuulos at

Similar Talks

Tracks

Monday, 5 November

Microservices / Serverless Patterns & Practices

Practices of DevOps & Lean Thinking

JavaScript & Web Tech

Modern CS in the Real World

Modern Operating Systems

Optimizing You: Human Skills for Individuals

Tuesday, 6 November

Architectures You've Always Wondered About

21st Century Languages

Emerging Trends in Data Engineering

Bare Knuckle Performance

Socially Conscious Software

Delivering on the Promise of Containers

Wednesday, 7 November

Applied AI & Machine Learning

Production Readiness: Building Resilient Systems

Developer Experience: Level up your Engineering Effectiveness

Security: Lessons Attacking & Defending

Future of Human Computer Interaction

Enterprise Languages

The all-new QCon app!

Available on iOS and Android

Presentation: Human-Centric Machine Learning Infrastructure @Netflix

Track: Applied AI & Machine Learning

Location: Ballroom A

Duration: 10:35am - 11:25am

Day of week: Wednesday

Level: Intermediate

Persona: Data Engineering, Data Scientist, ML Engineer

More talks on:

Share this on:

What You’ll Learn

Abstract

Speaker: Ville Tuulos

Find Ville Tuulos at

Similar Talks

Tracks

Monday, 5 November

Tuesday, 6 November

Wednesday, 7 November

The all-new QCon app!

Available on iOS and Android