Presentation: 99.99% Availability via Smart Real-Time Alerting

Duration

Duration: 
5:25pm - 6:15pm

Persona:

Key Takeaways

  • Learn how Uber is using real time data streams for anomaly detection and how they handle huge growth volume.
  • Understand some of the machine learning challenges and solutions in developing a large scale time series event system.
  • Understand the role of computational and statistical resilience in building real-time anomaly and outage detection systems.

Abstract

The Observability team at Uber focuses on providing intelligent real-time outage detection and root cause exploration at scale. This encompasses multiple building blocks: (i) a proprietary, scalable back-end store for application telemetry data that can service more than 500 million time series in real-time, (ii) a user-friendly and robust query language and UI for setting up alert configurations, (iii) the development of novel time series and machine learning models for fully automated, intelligent real-time outage and outlier detection, which have broken new ground in detection accuracy and speed whilst being sufficiently computationally tractable to be applied to hundreds of thousands of time series in real-time, and (iv) intelligent root cause exploration based on zipkin-style distributed tracing.

Interview

Question: 
QCon: What is your role today?
Answer: 

Franziska: Currently, I am a Data Science Manager. I lead a team of six data scientists who are working on real time anomaly and outage detection at Uber. Basically, what we are trying to do is find user facing outages as quickly as possible. These are things like when people cannot sign in, cannot sign up, or cannot take a trip (or maybe where ETA’s might be degraded).

Question: 
QCon: How is Uber applying real-time anomaly and outage detection with machine learning?
Answer: 

Franziska: It really comes back to a couple of really challenging items. One is really the size. We have about 500 million time series that we are currently tracking. This space grows about 25% month over month. So let’s say, for example, 1% of these are business critical metrics. That is still 5 million time series and so, if you want to even set just static thresholds for those, you would have to have a whole group of engineers just setting somewhat objectively setting upper and lower bounds for these.

The second thing is basically Uber’s growth. As I said, the time series grows about 25% month over month. So if you actually want to then run and maintain only like half your static thresholds at once you will be stuck re-setting and re-adjusting these thresholds over and over again. This is really the demand cycle that we experience. So, as you can imagine, if you look at a particular city, in the middle of the night, the demand and number of trips will be much lower than during those peak hours of the morning, rush hour commute (or perhaps Saturday evening traffic). So basically what that means is that we get this sinusoidal pattern of 24 per hour, 7 day cadences.

So anything that is a static threshold would not be very good in terms of trying to find deviations from these time series that might indicate an outage. Basically the team is building completely novel time series, statistical and machine learning models to capture these different patterns and to find (as quickly as possible) when there are any deviations.

Question: 
QCon: Are you using any particular frameworks and toolkits or is this something you are developing in-house?
Answer: 

Franziska: We are developing all of these things in house. We typically prototype in Python and R. The company is moving towards Go, so we basically have integrated everything in Go. We have actually built functions that sit very close to the data.This puts a lot of constraints on the types of algorithms we can use and also the architecture that you have to consider for these systems.

Question: 
QCon: What do you want someone who comes to your talk to walk away with?
Answer: 

Franziska: The talk will have two aspects to it.

The first part outlines the monitoring stack that we have built over the last couple of months. The stack really showcases how to do monitoring at scale (it’s an actual perspective). Some of the things I’ll discuss is how we built in-house our own Cassandra based time series data store which can handle 500 million time series, about 2 million writes per second (and also has to retrieve data extremely frequently because we have to do these checks on the data). I will also discuss how we built, for example, an RPC tracing tool that is openly available that can help in a highly distributed system of about 1,000 microservices to understand tracing and to help with root cause exploration. So this will be one angle.

The second angle will be a deep dive into the various approaches and learnings that we have from a data science perspective in solving the problems in the first part. This is still a very much open research problem, and it is really interesting to be working with such cutting-edge technology in an industry aiming to solve such a crucial problem. We will be going into the prerequisites needed for the algorithms, including: what kind of approaches we are currently using, future outlook on the various different tools, and how we can make the systems even more intelligent than today.

Question: 
QCon: Are you going to go into those models?
Answer: 

Franziska: Yes. That is a basically a prerequisite.I won’t be discussing details of algorithms. If you were to build your own system, what are the kind of the things you would have to take into consideration? For example, you need computational resiliency (as we talked about), and you need to have statistical resilience. You have to be able to forget what happened in the past. For instance, if last week was Halloween (which is a very big day for us), then, for example (if you just take a week over week approach), you would be incorrect in flagging something as slow in volume, right? So you need to be robust towards growth (considering the seasonality). Obviously, it is a streaming problem. So you have a lot of constraints in terms of how quickly this data comes in, and how quickly you evaluate the data (which again puts constraints on computational complexity).

So on a high level, I will be talking about all of these prerequisites and constraints that you would have to face in such a system. I will go into the ways we have accomplished this.

Basically, we have a two tiered system. The first tier is to have a univariate approach so every time series is being looked at individually. This allows for an embarrassingly parallel approach. However, it might lead to some false positives; things that are outliers but not outages. You can imagine things like weather and concerts and promotions. The Warriors games that we have going on in San Francisco can cause very large spikes in demand for example and also in our metrics. Obviously, an on-call engineer at 2:00am is not interested in being woken up if there is a Beyonce concert happening in Australia right now right? So basically, we have a second filter which now has a multivariate approach. We take in multiple time series events, etc, to really get an enhanced signal and enhanced precision out of this, to really distinguish between what is an outlier and what is an outage.

Question: 
QCon: What do you feel is the most disruptive tech in IT right now?
Answer: 

Franziska: From a data science perspective, I think it’s definitely from the anomaly detection part, trying to be in the streaming fashion, trying to have a high scalable system, short time to detection, high precision, and recall is something that is extremely important to a lot of tech companies, not only Uber. But we know of many other tech companies that are interested in working on this and are currently working on it: Microsoft for example, AirBnB, Yahoo and many others. So I think this is a very interesting topic that is of interest to many, many people throughout the tech sector and even outside of the tech sector. The health sector for example has interest in anomaly detection in event streams.

Speaker: Franziska Bell

Data Science Manager @Uber

Franziska Bell is lead data scientist of the Intelligent Decision Systems team at Uber, which focuses on developing new models for real-time outage and outlier detection. Since joining Uber in late 2014, these models have broken new ground in detection accuracy and speed whilst being sufficiently computationally tractable to be applied to 100,000s of time series in real-time. Before Uber, Franziska was a Postdoc at Caltech where she developed a novel, highly accurate approximate quantum molecular dynamics theory to calculate chemical reactions for large, complex systems, such as enzymes. Franziska earned her Ph.D. in theoretical chemistry from UC Berkeley focusing on developing highly accurate, yet computationally efficient approaches which helped unravel the mechanism of non-silicon-based solar cells and properties of organic conductors.

Find Franziska Bell at

.

Tracks

Monday Nov 7

Tuesday Nov 8

Wednesday Nov 9

Conference for Professional Software Developers