Conference: Nov 13-15, 2017
Workshops: Nov 16-17, 2017
Presentation: Elastic Big Data Platform @ Datadog
Duration
Level:
- Intermediate
Persona:
- Architect
Key Takeaways
- Learn strategies, tools, and techniques for building big data platforms leveraging cloud technologies.
- Hear how the elasticity of the cloud meets scaling needs of a data heavy consumer.
- Understand some of the tradeoffs that have to be made when choosing your cloud provider.
Abstract
At Datadog, we collect almost a trillion metric data points per day from hosts, containers, services, and customers all over the world. We have built a highly elastic, cloud-based platform to power analytics, machine learning, and statistical analysis on this data at high scale.
In this talk, we will discuss the cloud-based platform we have built and how it differs from a traditional datacenter-based analytics stack. We will walk through the decisions we have made at each layer, including leveraging S3 for data storage; isolating job families on their own ephemeral, spot-instance-powered clusters; tailoring hardware to the job family; optimizing the development cycle with git-powered deployment, and more. We'll cover the pros and cons of these decisions vs a traditional stack in detail.
We'll also discuss the tooling we have built to manage this level of dynamism and make it simple for data scientists and engineers to use. Finally, we will end with recommendations for folks getting started with their own analytics platform in the cloud: tools, frameworks, and platforms you can build upon.
Interview
Doug: I am the Director of Engineering at Datadog. I am responsible for a few different teams that span our online data business and our offline data work. So data science, data engineering, and the monitoring team (that deals with ensuring that our customers get alerted whenever there is an issue).
Doug: We are a monitoring service for large-scale cloud companies. We monitor all of the hosts and infrastructure, the applications that are running on those hosts, and the things that people connect to. We pull all of that in, in real time, and show it in visualization and monitoring. It's a monitoring tool for a large dynamic infrastructure.
Doug: Yes, pretty much every major cloud environment and also in the datacenter as well. It works wherever folks have hosts or containers.
Doug: I want to talk about the specific things that you see when you build a platform for big data in the cloud. It differs from the sort of more traditional platforms that are on a less dynamic datacenter environment. I think we’ve pushed the on prem datacenter about as far as you can go.
We are using a pretty much elastic infrastructure for our processing, and we do everything in Amazon S3 for our data. We have a lot of systems built up to handle that dynamism. I want to talk about what you see at each layer when you deal with this environment, and how that contrasts with the more traditional systems that people might be use to for big data.
Doug: In this context, it means that the compute resources we are using, the hosts, are entirely scaled up and down for the jobs that we need to do in the cloud. In addition (at least for the high-latency big data portion of the platform), they are entirely using spot instances. We’re doing ephemeral instances and dynamic pricing. We are balancing all that together for the workload and using different types of hardware based on the given workload. That’s the sort of thing that you can only do in the cloud. The cloud offers a lot of advantages for cost and for flexibility, but it also presents challenges for managing that level of dynamism.
Doug: It’s in the hundreds of nodes for this particular platform. Sometimes we are up in hundreds and hundreds of nodes. Other times, we are down at almost zero when we are not doing anything. It’s very dynamic, and it’s used for ‘human facing jobs.’ These are things that our data scientists are doing with the data and also for automated jobs that are running on the schedule.
Doug: For this framework, I am talking mostly about batch data, so Hadoop and Spark mostly. We use Luigi for stream workflow management. In the talk, I want to consider other options for all of those things. For people in the audience that want to build something like this, I particularly want to focus on what the options are at every layer of the stack. Things like: what you are looking at, what’s out there for you, and what you are going to have to build. Some of this stuff we’ve custom built ourselves, but others are freely available as open source.
Doug: We’ve done a lot of research about which types of instances we want to use and Datadog actually measures the pricing in the spot market every minute. So we can see the fluctuations there. We don’t have research to offer about like ‘this time of day versus the other time of day’. I might do a little bit of looking into that, but, mostly, we’ve measured the volatility in the different instance types and choose our types based on that.
Doug: One of the things we’ve learned is that getting your cluster management right and figuring out how to deal with all of these dynamic clusters is a challenging thing. This is particularly challenging when you are dealing with automated jobs and humans. So the system that we landed on is sort of a tagging based system which works pretty well for us. There are some lessons about using Amazon S3 as a store of data instead of HDFS (which is more traditional). S3 looks like a file system, and it’s tempting to use it exactly like a file system. But, if you do, you’ll find places where that breaks down pretty substantially. There are workarounds for it, but they are pretty onerous in some ways. That’s another tradeoff that we found that we can talk about. We still like S3 a lot. But it’s not for free compared to some other file systems.
Similar Talks
.
Tracks
Monday Nov 7
-
Architectures You've Always Wondered About
You know the names. Now learn lessons from their architectures
-
Distributed Systems War Stories
“A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable.” - Lamport.
-
Containers Everywhere
State of the art in Container deployment, management, scheduling
-
Art of Relevancy and Recommendations
Lessons on the adoption of practical, real-world machine learning practices. AI & Deep learning explored.
-
Next Generation Web Standards, Frameworks, and Techniques
JavaScript, HTML5, WASM, and more... innovations targetting the browser
-
Optimize You
Keeping life in balance is a challenge. Learn lifehacks, tips, & techniques for success.
Tuesday Nov 8
-
Next Generation Microservices
What will microservices look like in 3 years? What if we could start over?
-
Java: Are You Ready for This?
Real world lessons & prepping for JDK9. Reactive code in Java today, Performance/Optimization, Where Unsafe is heading, & JVM compile interface.
-
Big Data Meets the Cloud
Overviews and lessons learned from companies that have implemented their Big Data use-cases in the Cloud
-
Evolving DevOps
Lessons/stories on optimizing the deployment pipeline
-
Software Engineering Softskills
Great engineers do more than code. Learn their secrets and level up.
-
Modern CS in the Real World
Applied, practical, & real-world dive into industry adoption of modern CS ideas
Wednesday Nov 9
-
Architecting for Failure
Your system will fail. Take control before it takes you with it.
-
Stream Processing
Stream Processing, Near-Real Time Processing
-
Bare Metal Performance
Native languages, kernel bypass, tooling - make the most of your hardware
-
Culture as a Differentiator
The why and how for building successful engineering cultures
-
//TODO: Security <-- fix this
Building security from the start. Stories, lessons, and innovations advancing the field of software security.
-
UX Reimagined
Bots, virtual reality, voice, and new thought processes around design. The track explores the current art of the possible in UX and lessons from early adoption.