Presentation: The lego model for machine learning pipelines
Key Takeaways
- Learn real-world approaches to developing reusable machine learning models.
- Discover approaches to making data science accessible to everyone.
- Hear lessons learned from building a modular, reusable a Machine Learning pipeline at Salesforce.
Abstract
80-90% of data science is data cleaning and feature engineering. However, if we were to plot a count of what all the data science tools are for, we would find that most innovation happens in data infrastructure and modeling. We want to change that and make data scientists much more productive while also improving the quality of their work.
In this talk I will describe the machine learning platform we wrote on top of spark to modularize these steps. This allows easy reuse of components, simplifying model building and changes. The framework simplifies the data preparation and feature building stages with reusable classes for each data source, making subsequent feature generation a matter of a few lines of code.
Model selection classes wrap the functionality of pre-existing and custom algorithms to provide a uniform interface for modeling, allowing rapid iteration in model evolution. By breaking the machine learning process into a series of simple to implement and interchangeable pieces we have democratized the process of building machine learning models.
Interview with Leah McGuire
QCon: Leah it sounds like you’re going to be talking a lot about a product you’ve built at Salesforce, is this a product talk?
Leah: I think I'll talk mostly about what the biggest challenges are in machine learning, so the time sinks and the way you build machine learning models. I mean, the way machine learning models are generally built is to start from scratch and build up your data set, feature design, and do model selection. For example, this was the case at LinkedIn. The data is being used by hundreds of other data scientists to build similar models, but none of that work is reusable. So I'll talk about how we made it possible to re-use a lot of the feature extractions, feature cleaning, etc. that you have to do in order to do machine learning.
QCon: Are the problems you plan to discuss specific to Salesforce or something applicable to everyone?
Leah: The parts of the system I'm going to describe will be generally for the people who are thinking about the same issues, because they have to build some of these different models based on the same data sources. I think it can help with that. I think the principles I'm going to describe and the techniques will be very useful to people who want to make building machine learning models more efficient. These are the designs for building to solve those problems.
QCon: So I have to ask, why didn’t you you just use the Spark Machine Learning Pipeline?
Leah: There were a couple of reasons. The first was that it treats everything sequentially. So, basically, if you wanted to do a lot of transformations on your data, you would have to do them all sort of chained up. The second was that it didn't allow for non-deterministic transformations. So for example, if you need to pivot your data, the only way you can do that with the Spark ML framework is if you know exactly what was supposed to come out at the end, which is not always the case.
QCon: What is the key point you like people to leave your talk with:
Leah: The key is don't think of machine learning as a one time project, because it's something that you want to integrate into your product in many different ways. If you're smart and you build it like you're building another architecture, you're going to save yourself a lot of time in the future. No one wants to write the same function over and over with slight changes, but that's what happens a lot in machine learning, and it's really not necessary if you think about it.
QCon: So will you dive into any code examples?
Leah: Undoubtedly, I will have code examples. This is all built on Spark and Scala.
I think the code will be more examples of how you can implement specific ideas, it's not going to be like this is the way you should code this up. More like if you want to make reusable transformations, this is the kind of interface that you might write for that.
Similar Talks
Tracks
Covering innovative topics
Monday Nov 16
-
Architectures You've Always Wondered About
Silicon Valley to Beijing: Exploring some of the world's most intrigiuing architectures
-
Applied Machine Learning
How to start using machine learning and data science in your environment today. Latest and greatest best practices.
-
Browser as a platform (Realizing HTML5)
Exciting new standards like Service Workers, Push Notifications, and WebRTC are making the browser a formidable platform.
-
Modern Languages in Practice
The rise of 21st century languages: Go, Rust, Swift
-
Org Hacking
Our most innovative companies reimagining the org structure
-
Design Thinking
Level up your approach to problem solving and leave everything better than you found it.
Tuesday Nov 17
-
Containers in Practice
Build resilient, reactive systems one service at a time.
-
Architecting for Failure
Your system will fail. Take control before it takes you with it.
-
Modern CS in the Real World
Real-world Industry adoption of modern CS ideas
-
The Amazing Potential of .NET Open Source
From language design in the open to Rx.NET, there is amazing potential in an Open Source .NET
-
Optimizing You
Keeping life in balance is always a challenge. Learning lifehacks
-
Unlearning Performance Myths
Lessons on the reality of performance, scale, and security
Wednesday Nov 18
-
Streaming Data @ Scale
Real-time insights at Cloud Scale & the technologies that make them happen!
-
Taking Java to the Next Level
Modern, lean Java. Focuses on topics that push Java beyond how you currently think about it.
-
The Dark Side of Security
Lessons from your enemies
-
Taming Distributed Architecture
Reactive architectures, CAP, CRDTs, consensus systems in practice
-
JavaScript Everywhere!
Javascript is Everywhere. Learn why
-
Culture Reimagined
Lessons on building highly effective organizations