Presentation: Elastic Big Data Platform @ Datadog

Duration

Duration: 
1:40pm - 2:30pm

Level:

Persona:

Key Takeaways

  • Learn strategies, tools, and techniques for building big data platforms leveraging cloud technologies.
  • Hear how the elasticity of the cloud meets scaling needs of a data heavy consumer.
  • Understand some of the tradeoffs that have to be made when choosing your cloud provider.

Abstract

At Datadog, we collect almost a trillion metric data points per day from hosts, containers, services, and customers all over the world. We have built a highly elastic, cloud-based platform to power analytics, machine learning, and statistical analysis on this data at high scale.

In this talk, we will discuss the cloud-based platform we have built and how it differs from a traditional datacenter-based analytics stack. We will walk through the decisions we have made at each layer, including leveraging S3 for data storage; isolating job families on their own ephemeral, spot-instance-powered clusters; tailoring hardware to the job family; optimizing the development cycle with git-powered deployment, and more. We'll cover the pros and cons of these decisions vs a traditional stack in detail.

We'll also discuss the tooling we have built to manage this level of dynamism and make it simple for data scientists and engineers to use. Finally, we will end with recommendations for folks getting started with their own analytics platform in the cloud: tools, frameworks, and platforms you can build upon.

Interview

Question: 
QCon: What is your role today?
Answer: 

Doug: I am the Director of Engineering at Datadog. I am responsible for a few different teams that span our online data business and our offline data work. So data science, data engineering, and the monitoring team (that deals with ensuring that our customers get alerted whenever there is an issue).

Question: 
QCon: So what is the space that Datadog operates in?
Answer: 

Doug: We are a monitoring service for large-scale cloud companies. We monitor all of the hosts and infrastructure, the applications that are running on those hosts, and the things that people connect to. We pull all of that in, in real time, and show it in visualization and monitoring. It's a monitoring tool for a large dynamic infrastructure.

Question: 
QCon: So you target multiple cloud environments, not just AWS. But you are across Azure, GCP. Is that true?
Answer: 

Doug: Yes, pretty much every major cloud environment and also in the datacenter as well. It works wherever folks have hosts or containers.

Question: 
QCon: What’s the motivation for your talk?
Answer: 

Doug: I want to talk about the specific things that you see when you build a platform for big data in the cloud. It differs from the sort of more traditional platforms that are on a less dynamic datacenter environment. I think we’ve pushed the on prem datacenter about as far as you can go. 

We are using a pretty much elastic infrastructure for our processing, and we do everything in Amazon S3 for our data. We have a lot of systems built up to handle that dynamism. I want to talk about what you see at each layer when you deal with this environment, and how that contrasts with the more traditional systems that people might be use to for big data.

Question: 
QCon: When you talk about elasticity at Datadog, can you put that in perspective? What does it mean to be elastic at Datadog?
Answer: 

Doug: In this context, it means that the compute resources we are using, the hosts, are entirely scaled up and down for the jobs that we need to do in the cloud. In addition (at least for the high-latency big data portion of the platform), they are entirely using spot instances. We’re doing ephemeral instances and dynamic pricing. We are balancing all that together for the workload and using different types of hardware based on the given workload. That’s the sort of thing that you can only do in the cloud. The cloud offers a lot of advantages for cost and for flexibility, but it also presents challenges for managing that level of dynamism.

Question: 
QCon: What is that elasticity scale wise? What is the bottom and upper balance? How much are you scaling?
Answer: 

Doug: It’s in the hundreds of nodes for this particular platform. Sometimes we are up in hundreds and hundreds of nodes. Other times, we are down at almost zero when we are not doing anything. It’s very dynamic, and it’s used for ‘human facing jobs.’ These are things that our data scientists are doing with the data and also for automated jobs that are running on the schedule.

Question: 
QCon: What’s the framework that you are using? Is this streaming data? Are you talking Spark, Flink?
Answer: 

Doug: For this framework, I am talking mostly about batch data, so Hadoop and Spark mostly. We use Luigi for stream workflow management. In the talk, I want to consider other options for all of those things. For people in the audience that want to build something like this, I particularly want to focus on what the options are at every layer of the stack. Things like: what you are looking at, what’s out there for you, and what you are going to have to build.  Some of this stuff we’ve custom built ourselves, but others are freely available as open source. 

Question: 
QCon: When you usually spot instances and you spin things up, do you have analytics on what and how you actually spin things up? Is it just a traditional kind of start-and-stop based on demand type elasticity?
Answer: 

Doug: We’ve done a lot of research about which types of instances we want to use and Datadog actually measures the pricing in the spot market every minute. So we can see the fluctuations there. We don’t have research to offer about like ‘this time of day versus the other time of day’. I might do a little bit of looking into that, but, mostly, we’ve measured the volatility in the different instance types and choose our types based on that.

Question: 
QCon: Can you give me an example of some of the lessons you’ve learned to give people an idea of what they’ll hear about in the talk?
Answer: 

Doug: One of the things we’ve learned is that getting your cluster management right and figuring out how to deal with all of these dynamic clusters is a challenging thing. This is particularly challenging when you are dealing with automated jobs and humans. So the system that we landed on is sort of a tagging based system which works pretty well for us. There are some lessons about using Amazon S3 as a store of data instead of HDFS (which is more traditional). S3 looks like a file system, and it’s tempting to use it exactly like a file system. But, if you do, you’ll find places where that breaks down pretty substantially. There are workarounds for it, but they are pretty onerous in some ways. That’s another tradeoff that we found that we can talk about. We still like S3 a lot. But it’s not for free compared to some other file systems.

Speaker: Doug Daniels

Director of Engineering @Datadog

Doug Daniels is a Director of Engineering at Datadog, where he works on high-scale data systems for monitoring, data science, and analytics. Prior to joining Datadog, Doug was CTO at Mortar Data and an architect and developer at Wireless Generation, where he designed data systems to serve more than 4 million students in 49 states.

Find Doug Daniels at

.

Tracks

Monday Nov 7

Tuesday Nov 8

Wednesday Nov 9

Conference for Professional Software Developers