Azure Cosmos DB: Low Latency and High Availability at Planet Scale

Azure Cosmos DB is a fully-managed, multi-tenant, distributed, shared-nothing, horizontally scalable database that provides planet-scale capabilities and multi-model APIs for Apache Cassandra, MongoDB, Gremlin, Tables, and the Core (SQL) APIs. It currently powers many mission-critical services both within Microsoft (such as Microsoft Teams and Active Directory) and across large-scale Fortune 500 organizations (such as Walmart and Adobe). 

This talk covers the internal architecture of Azure Cosmos DB and how it achieves high availability, low latency, and scalability. We will first cover the design of the storage engine, with particular emphasis on ensuring high availability and scalability through partitioning and replication. Next, we will zoom in on the request routing gateway to see how it has evolved to solve the well-known multi-tenant cloud infrastructure challenges of containing noisy neighbors and limiting blast radius. Lastly, we will discuss performance as a feature and as a culture. We will cover what we measure and how we think about SLOs to achieve and maintain low latency. 

Building planet-scale services necessitates solving complex scalability challenges and making numerous tradeoffs across various components in the product. We look forward to sharing our experiences and lessons learned in building Azure Cosmos DB.


Speaker

Mei-Chin Tsai

Partner Director of Software Eng Manager @Microsoft, one of the original developers on .NET

Mei-Chin Tsai is a Engineering Director at Microsoft, responsible for Azure Cosmos DB developer experience. She leads the charge to evolve a frictionless developer experience for Azure Cosmos DB;  from the Software Development Kit, request routing gateway, to OSS APIs and tooling (such as Notebook and Portal). She was previously the Development Manager for .NET Runtime and C# in Microsoft’s Developer Division. Mei-Chin graduated from University of Illinois at Urbana-Champaign with a Ph.D. degree in Computer Science. She joined Microsoft in 1994 and was one of the original developers on .NET. She is passionate about scalability, performance, quality, and developer experience. She is committed in growing and mentoring people. At spare time, she loves to travel and is an avid tennis player.

Read more
Find Mei-Chin Tsai at:

Speaker

Vinod Sridharan

Principal Software Engineering Architect @Microsoft

Vinod Sridharan is a Principal Software Engineering Architect at Microsoft responsible for the Azure Cosmos DB APIs. He works on the design and architecture of the core components that power them, the gateway and the supporting distributed service infrastructure. Across various components including storage, transport, load balancing, and routing, Vinod drives low latency, high availability, and performance throughout the Azure Cosmos DB service. In his spare time, Vinod loves to travel, sing, and go hiking.

Read more
Find Vinod Sridharan at:

Date

Wednesday Oct 26 / 01:40PM PDT ( 50 minutes )

Location

Ballroom A

Topics

Architecture High Availability Low Latency Scalability Storage Engine Partitioning and Replication Request Routing Gateway Cloud Infrastructure

Share

From the same track

Session Architecture

Honeycomb: How We Used Serverless to Speed Up Our Servers

Wednesday Oct 26 / 11:50AM PDT

Honeycomb is the state of the art in observability: customers send us lots of data and then compose complex, ad-hoc queries. Most are simple, some are not. Some are REALLY not; this load is both complex, spontaneous, and urgent.

Speaker image - Jessica Kerr
Jessica Kerr

Principal Developer Evangelist @honeycombio

Session Architecture

From Zero to A Hundred Billion: Building Scalable Real Time Event Processing At DoorDash

Wednesday Oct 26 / 02:55PM PDT

At DoorDash, real time events are an important data source to gain insight into our business but building a system capable of handling billions of real time events is challenging.

Speaker image - Allen Wang
Allen Wang

Software Engineer @DoorDash, previously Lead for real-time data infrastructure team @Netflix

Session Architecture

Magic Pocket: Dropbox’s Exabyte-Scale Blob Storage System

Wednesday Oct 26 / 04:10PM PDT

Magic Pocket is used to store all of Dropbox’s data.

Speaker image - Facundo Agriel
Facundo Agriel

Software Engineer / Tech Lead @Dropbox, previously @Amazon

Session Architecture

Amazon DynamoDB: Evolution of a Hyper-Scale Cloud Database Service

Wednesday Oct 26 / 10:35AM PDT

Amazon DynamoDB is a cloud database service that provides consistent performance at any scale. Hundreds of thousands of customers rely on DynamoDB for its fundamental properties: consistent performance, availability, durability, and a fully managed serverless experience.

Speaker image - Akshat Vig
Akshat Vig

Principal Engineer NoSQL databases @awscloud