Track:

Duration

Duration:

1:40pm - 2:30pm

Level:

Intermediate

Persona:

Architect

Key Takeaways

Learn strategies, tools, and techniques for building big data platforms leveraging cloud technologies.
Hear how the elasticity of the cloud meets scaling needs of a data heavy consumer.
Understand some of the tradeoffs that have to be made when choosing your cloud provider.

Abstract

At Datadog, we collect almost a trillion metric data points per day from hosts, containers, services, and customers all over the world. We have built a highly elastic, cloud-based platform to power analytics, machine learning, and statistical analysis on this data at high scale.

In this talk, we will discuss the cloud-based platform we have built and how it differs from a traditional datacenter-based analytics stack. We will walk through the decisions we have made at each layer, including leveraging S3 for data storage; isolating job families on their own ephemeral, spot-instance-powered clusters; tailoring hardware to the job family; optimizing the development cycle with git-powered deployment, and more. We'll cover the pros and cons of these decisions vs a traditional stack in detail.

We'll also discuss the tooling we have built to manage this level of dynamism and make it simple for data scientists and engineers to use. Finally, we will end with recommendations for folks getting started with their own analytics platform in the cloud: tools, frameworks, and platforms you can build upon.

Interview

Question:

QCon: What is your role today?

Answer:

Doug: I am the Director of Engineering at Datadog. I am responsible for a few different teams that span our online data business and our offline data work. So data science, data engineering, and the monitoring team (that deals with ensuring that our customers get alerted whenever there is an issue).

Question:

QCon: So what is the space that Datadog operates in?

Answer:

Doug: We are a monitoring service for large-scale cloud companies. We monitor all of the hosts and infrastructure, the applications that are running on those hosts, and the things that people connect to. We pull all of that in, in real time, and show it in visualization and monitoring. It's a monitoring tool for a large dynamic infrastructure.

Question:

QCon: So you target multiple cloud environments, not just AWS. But you are across Azure, GCP. Is that true?

Answer:

Doug: Yes, pretty much every major cloud environment and also in the datacenter as well. It works wherever folks have hosts or containers.

Question:

QCon: What’s the motivation for your talk?

Answer:

Doug: I want to talk about the specific things that you see when you build a platform for big data in the cloud. It differs from the sort of more traditional platforms that are on a less dynamic datacenter environment. I think we’ve pushed the on prem datacenter about as far as you can go.

We are using a pretty much elastic infrastructure for our processing, and we do everything in Amazon S3 for our data. We have a lot of systems built up to handle that dynamism. I want to talk about what you see at each layer when you deal with this environment, and how that contrasts with the more traditional systems that people might be use to for big data.

Question:

QCon: When you talk about elasticity at Datadog, can you put that in perspective? What does it mean to be elastic at Datadog?

Answer:

Doug: In this context, it means that the compute resources we are using, the hosts, are entirely scaled up and down for the jobs that we need to do in the cloud. In addition (at least for the high-latency big data portion of the platform), they are entirely using spot instances. We’re doing ephemeral instances and dynamic pricing. We are balancing all that together for the workload and using different types of hardware based on the given workload. That’s the sort of thing that you can only do in the cloud. The cloud offers a lot of advantages for cost and for flexibility, but it also presents challenges for managing that level of dynamism.

Question:

QCon: What is that elasticity scale wise? What is the bottom and upper balance? How much are you scaling?

Answer:

Doug: It’s in the hundreds of nodes for this particular platform. Sometimes we are up in hundreds and hundreds of nodes. Other times, we are down at almost zero when we are not doing anything. It’s very dynamic, and it’s used for ‘human facing jobs.’ These are things that our data scientists are doing with the data and also for automated jobs that are running on the schedule.

Question:

QCon: What’s the framework that you are using? Is this streaming data? Are you talking Spark, Flink?

Answer:

Doug: For this framework, I am talking mostly about batch data, so Hadoop and Spark mostly. We use Luigi for stream workflow management. In the talk, I want to consider other options for all of those things. For people in the audience that want to build something like this, I particularly want to focus on what the options are at every layer of the stack. Things like: what you are looking at, what’s out there for you, and what you are going to have to build. Some of this stuff we’ve custom built ourselves, but others are freely available as open source.

Question:

QCon: When you usually spot instances and you spin things up, do you have analytics on what and how you actually spin things up? Is it just a traditional kind of start-and-stop based on demand type elasticity?

Answer:

Doug: We’ve done a lot of research about which types of instances we want to use and Datadog actually measures the pricing in the spot market every minute. So we can see the fluctuations there. We don’t have research to offer about like ‘this time of day versus the other time of day’. I might do a little bit of looking into that, but, mostly, we’ve measured the volatility in the different instance types and choose our types based on that.

Question:

QCon: Can you give me an example of some of the lessons you’ve learned to give people an idea of what they’ll hear about in the talk?

Answer:

Doug: One of the things we’ve learned is that getting your cluster management right and figuring out how to deal with all of these dynamic clusters is a challenging thing. This is particularly challenging when you are dealing with automated jobs and humans. So the system that we landed on is sort of a tagging based system which works pretty well for us. There are some lessons about using Amazon S3 as a store of data instead of HDFS (which is more traditional). S3 looks like a file system, and it’s tempting to use it exactly like a file system. But, if you do, you’ll find places where that breaks down pretty substantially. There are workarounds for it, but they are pretty onerous in some ways. That’s another tradeoff that we found that we can talk about. We still like S3 a lot. But it’s not for free compared to some other file systems.

Speaker: Doug Daniels

Director of Engineering @Datadog

Doug Daniels is a Director of Engineering at Datadog, where he works on high-scale data systems for monitoring, data science, and analytics. Prior to joining Datadog, Doug was CTO at Mortar Data and an architect and developer at Wireless Generation, where he designed data systems to serve more than 4 million students in 49 states.

Find Doug Daniels at

Speaker page

@ddaniels888

Founder and CEO @Cask

Jon Gray

Stranger Things: The Forces that Disrupt Netflix

Senior Software Engineer, Playback Features @Netflix

Haley Tucker

99.99% Availability via Smart Real-Time Alerting

Data Science Manager @Uber

Franziska Bell

Creating A Culture of Observability at Stripe

Observability Specialist @Stripe

Cory Watson

Migrating to a Fault Tolerant System with Spanner

Software Engineer @Google

Edwin Fuquen

Freeing the Whale: How to Fail at Scale

CTO @Buoyant

Oliver Gould

Automating Chaos Experiments In Production

Senior Software Engineer @Netflix

Ali Basiri

Architecting for Failure in a Containerized World

Principle Data Analysis Leader @Infolace

Tom Faulhaber

Scaling Quality On Quora Using Machine Learning

Engineering Manager @Quora

Nikhil Garg

Tracks

Monday Nov 7

Architectures You've Always Wondered About

You know the names. Now learn lessons from their architectures
Distributed Systems War Stories

“A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable.” - Lamport.
Containers Everywhere

State of the art in Container deployment, management, scheduling
Art of Relevancy and Recommendations

Lessons on the adoption of practical, real-world machine learning practices. AI & Deep learning explored.
Next Generation Web Standards, Frameworks, and Techniques

JavaScript, HTML5, WASM, and more... innovations targetting the browser
Optimize You

Keeping life in balance is a challenge. Learn lifehacks, tips, & techniques for success.

Tuesday Nov 8

Next Generation Microservices

What will microservices look like in 3 years? What if we could start over?
Java: Are You Ready for This?

Real world lessons & prepping for JDK9. Reactive code in Java today, Performance/Optimization, Where Unsafe is heading, & JVM compile interface.
Big Data Meets the Cloud

Overviews and lessons learned from companies that have implemented their Big Data use-cases in the Cloud
Evolving DevOps

Lessons/stories on optimizing the deployment pipeline
Software Engineering Softskills

Great engineers do more than code. Learn their secrets and level up.
Modern CS in the Real World

Applied, practical, & real-world dive into industry adoption of modern CS ideas

Wednesday Nov 9

Architecting for Failure

Your system will fail. Take control before it takes you with it.
Stream Processing

Stream Processing, Near-Real Time Processing
Bare Metal Performance

Native languages, kernel bypass, tooling - make the most of your hardware
Culture as a Differentiator

The why and how for building successful engineering cultures
//TODO: Security <-- fix this

Building security from the start. Stories, lessons, and innovations advancing the field of software security.
UX Reimagined

Bots, virtual reality, voice, and new thought processes around design. The track explores the current art of the possible in UX and lessons from early adoption.

SCHEDULE

Duration

Level:

Persona:

Key Takeaways

Abstract

Interview

Find Doug Daniels at

Similar Talks

Tracks

Monday Nov 7

Tuesday Nov 8

Wednesday Nov 9

Conference for Professional Software Developers

Follow QCon

Contact

Menu

QCons around the World

Presentation: Elastic Big Data Platform @ Datadog

Duration

Level:

Persona:

More talks on:

Key Takeaways

Abstract

Interview

Find Doug Daniels at

Similar Talks

Tracks

Monday Nov 7

Tuesday Nov 8

Wednesday Nov 9

Conference for Professional Software Developers

Follow QCon

Contact

Menu

QCons around the World