Track: Big Data Meets the Cloud


Day of week:

Big Data technology and best practices have seen widespread adoption over the past few years. Understandably, Big Data technology vendors primarily focus on the needs of enterprises, which means that most of their products are developed for deployment and use within private data centers. In these environments, network topology, compute and storage placement, and hardware specifications are all under the control of data center operations. During a similar period, public cloud providers such as AWS, Azure, and Google Cloud Platform have seen a migration of mostly smaller companies (and some notable larger ones) to their services. How do companies that want to leverage the cloud adapt their Big Data technologies to work efficiently? Come to this track to learn from companies that have implemented their Big Data use-cases in the Cloud.

Key Takeaways:

  • Learn practical experience in a track focused on moving and running Big Data infrastructures in a cloud environment.
  • Hear practical lessons from companies leveraging cloud providers such as AWS and GCP.
  • Gain a better understanding of the state-of-the-art for cloud infrastructure that leverages the cloud.
Track Host:
Jeff Magnusson
Director of Data Platform @StitchFix
As Director of the Data Platform at Stitch Fix, Jeff Magnusson leads the team responsible for building robust and scalable infrastructure and data services that integrate with numerous interfaces across the business. By leveraging machine computation together with expert­human judgement to generate recommendations and insights, these platforms unlock innovative ways to utilize data science and machine learning that optimize and differentiate the way the company operates the business. Prior to Stitch Fix, Jeff managed the Data Platform Architecture team at Netflix, where he helped design and open source many of the components of the Hadoop based infrastructure and big data platform. Jeff holds a PhD from the University of Florida, specializing in database system implementation.

Trackhost Interview

QCon: Stitch Fix does a bunch of interesting things with ML, don’t you?

Jeff: Yeah, we’re a heavily data focused company. A primary output of the data science team is a recommendation engine that assists our personal stylists in finding the best clothing for our clients, but data science at Stitch Fix runs far deeper than that. It’s trying to predict not just what to sell to whom, but what to buy and where to place it, how to retain customers. All of that is heavily influenced by the data science team over here.

QCon: What is your track at QCon going to look like?

Jeff: The track is called Big Data Meets the Cloud. Some of that is based on my experience just running Big Data infrastructures in the cloud and scaling it. What I found is that there’s best practices and technologies that exist and are pretty ubiquitous in datacenter-based environments that just don’t work the same way in a public cloud based environment. There is tons of exciting tech out there right now. A lot of migration from traditional Hadoop to Spark. There are lots of new file formats, in-memory caching techniques, containerization through Docker.

All of that is super cool, it’s all open source. Most of it is being developed by companies that are big datacenter-based deployments, and there are assumptions baked into those technologies often times that just don’t work the same way in the cloud. We are trying to explore best practices for actually scaling out that infrastructure in cloud environments, what works and what doesn’t, what tricks and techniques can be used in a cloud environment, how we can exploit properties of the cloud such as elasticity to get a lot of mileage out of it.

QCon: What do you want people that come to your track to walk away with?

Jeff: Primarily, a better understanding of the state-of-the-art for cloud-based data infrastructure deployment and coming things that are really exciting. Better knowledge of best practices that exist, especially when they differ from best practices in a datacenter-based deployment. Then, we want to feature people from different major cloud providers - AWS, Google, Azure. One thing that would be really exciting is people walking away with a comparison of the tradeoffs and efficiencies between running Big Data infrastructure on those different platforms. The final thing for me is just to give people an awareness of areas where more extensions or capability needs to be developed by the community.

The great thing about Big Data infrastructure is so much of it is open source but that means it is up to the engineering community to make it into what we need to make it, and the first step to that is awareness of where focus needs to be put.

10:35am - 11:25am

by Nikhil Garg
Engineering Manager @Quora

by Chun-Ho Hung
Software Engineer @Quora

Hundreds of millions of people use Quora to find accurate, informative, and trustworthy answers to their questions. All our infrastructure is built on top of AWS.

In this talk, we will be talking about Quanta, Quora's counting system powering our high-volume near-realtime analytics that serves many applications like ads, content views, and many dashboards.

Quanta counters support/are:

  • High write throughput...
11:50am - 12:40pm

by Matti Pehrs
Software Engineer @Spotify

by Mārtiņš Kalvāns
Big Data Engineer @Spotify

Spotify is currently one of the most popular music streaming services in the world with over 100 million monthly active users. We have over the last few years have a phenomenal growth that now has pushed our backend infrastructure out from our data centers and into the cloud. Earlier this year we announced that we are transitioning all of our backend into Google Cloud Platform, GCP. 

In this talk we are going to give an brief overview of what our Data Infrastructure tribe...

1:40pm - 2:30pm

by Doug Daniels
Director of Engineering @Datadog

At Datadog, we collect almost a trillion metric data points per day from hosts, containers, services, and customers all over the world. We have built a highly elastic, cloud-based platform to power analytics, machine learning, and statistical analysis on this data at high scale.

In this talk, we will discuss the cloud-based platform we have built and how it differs from a traditional datacenter-based analytics stack. We will walk through the decisions we have made at each layer,...

2:55pm - 3:45pm

Open Space
4:10pm - 5:00pm

by Stefan Krawczyk
Algo Dev Platform Lead @StitchFix

Stitch Fix is an online clothing retailer that not only focuses on delivering personalized clothing recommendations for our customers, but also applies the output of data science to automate numerous other business functions through the delivery of forecasts, predictions, and analyses via a robust API layer. We rely heavily on the ability for applied mathematics & statistics and our human decision makers to synergistically work; doing this well requires us to merge art & science...

5:25pm - 6:15pm

by Dan Weeks
Leads Big Data Compute @Netflix

by Tom Gianos
Senior Software Engineer, Big Data Platform @Netflix

Netflix runs one of the largest big data analytics infrastructure in the public cloud. Our platform leverages the scalability, reliability, and flexibility of the cloud to move quickly and innovate.

In this talk, we will discuss the overall big data platform architecture and dive into the two key design choices that underpin our platform: Storage and Orchestration. We will discuss how we leverage S3 as our data warehouse storage layer. We rely on Parquet as our primary storage format...



Monday Nov 7

Tuesday Nov 8

Wednesday Nov 9