Warning message

  • The service having id "twitter" is missing, reactivate its module or save again the list of services.
  • The service having id "facebook" is missing, reactivate its module or save again the list of services.
  • The service having id "google_plus" is missing, reactivate its module or save again the list of services.
  • The service having id "linkedin" is missing, reactivate its module or save again the list of services.

Presentation: The lego model for machine learning pipelines



10:35am - 11:25am

Key Takeaways

  • Learn real-world approaches to developing reusable machine learning models.
  • Discover approaches to making data science accessible to everyone.
  • Hear lessons learned from building a modular, reusable a Machine Learning pipeline at Salesforce.


80-90% of data science is data cleaning and feature engineering. However, if we were to plot a count of what all the data science tools are for, we would find that most innovation happens in data infrastructure and modeling. We want to change that and make data scientists much more productive while also improving the quality of their work.

In this talk I will describe the machine learning platform we wrote on top of spark to modularize these steps. This allows easy reuse of components, simplifying model building and changes. The framework simplifies the data preparation and feature building stages with reusable classes for each data source, making subsequent feature generation a matter of a few lines of code.

Model selection classes wrap the functionality of pre-existing and custom algorithms to provide a uniform interface for modeling, allowing rapid iteration in model evolution. By breaking the machine learning process into a series of simple to implement and interchangeable pieces we have democratized the process of building machine learning models.

Interview with Leah McGuire

QCon: Leah it sounds like you’re going to be talking a lot about a product you’ve built at Salesforce, is this a product talk?

Leah: I think I'll talk mostly about what the biggest challenges are in machine learning, so the time sinks and the way you build machine learning models. I mean, the way machine learning models are generally built is to start from scratch and build up your data set, feature design, and do model selection. For example, this was the case at LinkedIn. The data is being used by hundreds of other data scientists to build similar models, but none of that work is reusable. So I'll talk about how we made it possible to re-use a lot of the feature extractions, feature cleaning, etc. that you have to do in order to do machine learning.

QCon: Are the problems you plan to discuss specific to Salesforce or something applicable to everyone?

Leah: The parts of the system I'm going to describe will be generally for the people who are thinking about the same issues, because they have to build some of these different models based on the same data sources. I think it can help with that. I think the principles I'm going to describe and the techniques will be very useful to people who want to make building machine learning models more efficient. These are the designs for building to solve those problems.

QCon: So I have to ask, why didn’t you you just use the Spark Machine Learning Pipeline?

Leah: There were a couple of reasons. The first was that it treats everything sequentially. So, basically, if you wanted to do a lot of transformations on your data, you would have to do them all sort of chained up. The second was that it didn't allow for non-deterministic transformations. So for example, if you need to pivot your data, the only way you can do that with the Spark ML framework is if you know exactly what was supposed to come out at the end, which is not always the case.

QCon: What is the key point you like people to leave your talk with:

Leah: The key is don't think of machine learning as a one time project, because it's something that you want to integrate into your product in many different ways. If you're smart and you build it like you're building another architecture, you're going to save yourself a lot of time in the future. No one wants to write the same function over and over with slight changes, but that's what happens a lot in machine learning, and it's really not necessary if you think about it.

QCon: So will you dive into any code examples?

Leah: Undoubtedly, I will have code examples. This is all built on Spark and Scala.

I think the code will be more examples of how you can implement specific ideas, it's not going to be like this is the way you should code this up. More like if you want to make reusable transformations, this is the kind of interface that you might write for that.

Similar Talks

Dir. of Training @NewCircle
VP of Product Engineering @Tuplejump
Senior Software Engineer @BlueJeansNetwork
VP of Global Platform and Infrastructure @PayPal
CTO and co-founder @AzulSystems
Engineering Manager and Technical Lead for Real Time Analytics @Twitter
Senior Director for Alibaba Wireless Division


Covering innovative topics

Monday Nov 16

Tuesday Nov 17

Wednesday Nov 18

Conference for Professional Software Developers