Track:

Duration

Duration:

5:25pm - 6:15pm

Persona:

Data Scientist
Developer
DevOps Engineer

Key Takeaways

Learn how Uber is using real time data streams for anomaly detection and how they handle huge growth volume.
Understand some of the machine learning challenges and solutions in developing a large scale time series event system.
Understand the role of computational and statistical resilience in building real-time anomaly and outage detection systems.

Abstract

The Observability team at Uber focuses on providing intelligent real-time outage detection and root cause exploration at scale. This encompasses multiple building blocks: (i) a proprietary, scalable back-end store for application telemetry data that can service more than 500 million time series in real-time, (ii) a user-friendly and robust query language and UI for setting up alert configurations, (iii) the development of novel time series and machine learning models for fully automated, intelligent real-time outage and outlier detection, which have broken new ground in detection accuracy and speed whilst being sufficiently computationally tractable to be applied to hundreds of thousands of time series in real-time, and (iv) intelligent root cause exploration based on zipkin-style distributed tracing.

Interview

Question:

QCon: What is your role today?

Answer:

Franziska: Currently, I am a Data Science Manager. I lead a team of six data scientists who are working on real time anomaly and outage detection at Uber. Basically, what we are trying to do is find user facing outages as quickly as possible. These are things like when people cannot sign in, cannot sign up, or cannot take a trip (or maybe where ETA’s might be degraded).

Question:

QCon: How is Uber applying real-time anomaly and outage detection with machine learning?

Answer:

Franziska: It really comes back to a couple of really challenging items. One is really the size. We have about 500 million time series that we are currently tracking. This space grows about 25% month over month. So let’s say, for example, 1% of these are business critical metrics. That is still 5 million time series and so, if you want to even set just static thresholds for those, you would have to have a whole group of engineers just setting somewhat objectively setting upper and lower bounds for these.

The second thing is basically Uber’s growth. As I said, the time series grows about 25% month over month. So if you actually want to then run and maintain only like half your static thresholds at once you will be stuck re-setting and re-adjusting these thresholds over and over again. This is really the demand cycle that we experience. So, as you can imagine, if you look at a particular city, in the middle of the night, the demand and number of trips will be much lower than during those peak hours of the morning, rush hour commute (or perhaps Saturday evening traffic). So basically what that means is that we get this sinusoidal pattern of 24 per hour, 7 day cadences.

So anything that is a static threshold would not be very good in terms of trying to find deviations from these time series that might indicate an outage. Basically the team is building completely novel time series, statistical and machine learning models to capture these different patterns and to find (as quickly as possible) when there are any deviations.

Question:

QCon: Are you using any particular frameworks and toolkits or is this something you are developing in-house?

Answer:

Franziska: We are developing all of these things in house. We typically prototype in Python and R. The company is moving towards Go, so we basically have integrated everything in Go. We have actually built functions that sit very close to the data.This puts a lot of constraints on the types of algorithms we can use and also the architecture that you have to consider for these systems.

Question:

QCon: What do you want someone who comes to your talk to walk away with?

Answer:

Franziska: The talk will have two aspects to it.

The first part outlines the monitoring stack that we have built over the last couple of months. The stack really showcases how to do monitoring at scale (it’s an actual perspective). Some of the things I’ll discuss is how we built in-house our own Cassandra based time series data store which can handle 500 million time series, about 2 million writes per second (and also has to retrieve data extremely frequently because we have to do these checks on the data). I will also discuss how we built, for example, an RPC tracing tool that is openly available that can help in a highly distributed system of about 1,000 microservices to understand tracing and to help with root cause exploration. So this will be one angle.

The second angle will be a deep dive into the various approaches and learnings that we have from a data science perspective in solving the problems in the first part. This is still a very much open research problem, and it is really interesting to be working with such cutting-edge technology in an industry aiming to solve such a crucial problem. We will be going into the prerequisites needed for the algorithms, including: what kind of approaches we are currently using, future outlook on the various different tools, and how we can make the systems even more intelligent than today.

Question:

QCon: Are you going to go into those models?

Answer:

Franziska: Yes. That is a basically a prerequisite.I won’t be discussing details of algorithms. If you were to build your own system, what are the kind of the things you would have to take into consideration? For example, you need computational resiliency (as we talked about), and you need to have statistical resilience. You have to be able to forget what happened in the past. For instance, if last week was Halloween (which is a very big day for us), then, for example (if you just take a week over week approach), you would be incorrect in flagging something as slow in volume, right? So you need to be robust towards growth (considering the seasonality). Obviously, it is a streaming problem. So you have a lot of constraints in terms of how quickly this data comes in, and how quickly you evaluate the data (which again puts constraints on computational complexity).

So on a high level, I will be talking about all of these prerequisites and constraints that you would have to face in such a system. I will go into the ways we have accomplished this.

Basically, we have a two tiered system. The first tier is to have a univariate approach so every time series is being looked at individually. This allows for an embarrassingly parallel approach. However, it might lead to some false positives; things that are outliers but not outages. You can imagine things like weather and concerts and promotions. The Warriors games that we have going on in San Francisco can cause very large spikes in demand for example and also in our metrics. Obviously, an on-call engineer at 2:00am is not interested in being woken up if there is a Beyonce concert happening in Australia right now right? So basically, we have a second filter which now has a multivariate approach. We take in multiple time series events, etc, to really get an enhanced signal and enhanced precision out of this, to really distinguish between what is an outlier and what is an outage.

Question:

QCon: What do you feel is the most disruptive tech in IT right now?

Answer:

Franziska: From a data science perspective, I think it’s definitely from the anomaly detection part, trying to be in the streaming fashion, trying to have a high scalable system, short time to detection, high precision, and recall is something that is extremely important to a lot of tech companies, not only Uber. But we know of many other tech companies that are interested in working on this and are currently working on it: Microsoft for example, AirBnB, Yahoo and many others. So I think this is a very interesting topic that is of interest to many, many people throughout the tech sector and even outside of the tech sector. The health sector for example has interest in anomaly detection in event streams.

Speaker: Franziska Bell

Data Science Manager @Uber

Franziska Bell is lead data scientist of the Intelligent Decision Systems team at Uber, which focuses on developing new models for real-time outage and outlier detection. Since joining Uber in late 2014, these models have broken new ground in detection accuracy and speed whilst being sufficiently computationally tractable to be applied to 100,000s of time series in real-time. Before Uber, Franziska was a Postdoc at Caltech where she developed a novel, highly accurate approximate quantum molecular dynamics theory to calculate chemical reactions for large, complex systems, such as enzymes. Franziska earned her Ph.D. in theoretical chemistry from UC Berkeley focusing on developing highly accurate, yet computationally efficient approaches which helped unravel the mechanism of non-silicon-based solar cells and properties of organic conductors.

Find Franziska Bell at

Speaker page

Product Marketing Manager @Perforce

John Williston

Continuous Innovation through DevOps Pipelines

Senior Technology Strategist @Dynatrace

Andi Grabner

Hardware & Provisioning Engineering @Twitter

Provisioning Engineering SE @Twitter

Nik Johnson

Hardware & Provisioning Engineering @Twitter

Staff Hardware Engineer @Twitter

Matt Singer

Stranger Things: The Forces that Disrupt Netflix

Senior Software Engineer, Playback Features @Netflix

Haley Tucker

Migrating to a Fault Tolerant System with Spanner

Software Engineer @Google

Edwin Fuquen

Freeing the Whale: How to Fail at Scale

CTO @Buoyant

Oliver Gould

Automating Chaos Experiments In Production

Senior Software Engineer @Netflix

Ali Basiri

Architecting for Failure in a Containerized World

Principle Data Analysis Leader @Infolace

Tom Faulhaber

Tracks

Monday Nov 7

Architectures You've Always Wondered About

You know the names. Now learn lessons from their architectures
Distributed Systems War Stories

“A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable.” - Lamport.
Containers Everywhere

State of the art in Container deployment, management, scheduling
Art of Relevancy and Recommendations

Lessons on the adoption of practical, real-world machine learning practices. AI & Deep learning explored.
Next Generation Web Standards, Frameworks, and Techniques

JavaScript, HTML5, WASM, and more... innovations targetting the browser
Optimize You

Keeping life in balance is a challenge. Learn lifehacks, tips, & techniques for success.

Tuesday Nov 8

Next Generation Microservices

What will microservices look like in 3 years? What if we could start over?
Java: Are You Ready for This?

Real world lessons & prepping for JDK9. Reactive code in Java today, Performance/Optimization, Where Unsafe is heading, & JVM compile interface.
Big Data Meets the Cloud

Overviews and lessons learned from companies that have implemented their Big Data use-cases in the Cloud
Evolving DevOps

Lessons/stories on optimizing the deployment pipeline
Software Engineering Softskills

Great engineers do more than code. Learn their secrets and level up.
Modern CS in the Real World

Applied, practical, & real-world dive into industry adoption of modern CS ideas

Wednesday Nov 9

Architecting for Failure

Your system will fail. Take control before it takes you with it.
Stream Processing

Stream Processing, Near-Real Time Processing
Bare Metal Performance

Native languages, kernel bypass, tooling - make the most of your hardware
Culture as a Differentiator

The why and how for building successful engineering cultures
//TODO: Security <-- fix this

Building security from the start. Stories, lessons, and innovations advancing the field of software security.
UX Reimagined

Bots, virtual reality, voice, and new thought processes around design. The track explores the current art of the possible in UX and lessons from early adoption.

SCHEDULE

Duration

Persona:

Key Takeaways

Abstract

Interview

Find Franziska Bell at

Similar Talks

Tracks

Monday Nov 7

Tuesday Nov 8

Wednesday Nov 9

Conference for Professional Software Developers

Follow QCon

Contact

Menu

QCons around the World

Presentation: 99.99% Availability via Smart Real-Time Alerting

Duration

Persona:

More talks on:

Key Takeaways

Abstract

Interview

Find Franziska Bell at

Similar Talks

Tracks

Monday Nov 7

Tuesday Nov 8

Wednesday Nov 9

Conference for Professional Software Developers

Follow QCon

Contact

Menu

QCons around the World