Track: Production Readiness: Building Resilient Systems

Location: Ballroom BC

Day of week:

A production readiness review is used by software companies to determine whether the design and implementation of the system is ready to be released to its customers. The process is used to identify and address the reliability of a service, sufficiency of the coverage of privacy and security needs, and the ease of the operability. This track explores what types of aspects of software need to be prepared to start taking on full production load with customer’s data. Topics include observability, emergency response, capacity planning, release processes, and SLOs for availability and latency.

Track Host: Michelle Brush

Engineering Manager, SRE @Google
Michelle Brush is a math geek turned computer geek with 20 years of software development experience. She has developed algorithms and data structures for pathfinding, search, compression, and data mining in embedded as well as distributed systems. In her current role as an SRE Manager for Google, she leads the teams of SREs that ensure GCP's APIs are reliable. Previously, she served as the Director of HealtheIntent Architecture for Cerner Corporation, responsible for the data processing platform for Cerner’s Population Health solutions. Prior to her time at Cerner, she was the lead engineer for Garmin's automotive routing algorithm.

10:35am - 11:25am

Monitoring and Tracing @Netflix Streaming Data Infrastructure

Netflix streaming data infrastructure transports trillions of events per day and supports hundreds of streaming processing jobs. The team behind it is small and there is no separate operations team. To efficiently manage and operate this huge infrastructure and reduce operational burden for everyone, we developed a set of tools that enables automated operations and mitigations. Our Kafka monitoring tools provide comprehensive signals and great insights into the health of our Kafka brokers and consumers, from which we derived ways to automate error handling that improves stability of brokers and stream processing jobs. For data streams that have high consistency requirements, instead of purely relying on aggregated counts that may be misleading, we trace individual events along their transporting path. Enabled by stream processing with minimal resources, tracing provides insight into end-to-end data loss, duplicates and latency at near real time and with high accuracy. These results helped us to further improve our service quality and validate design trade-offs.

The talk will give the design and implementation details of these dev/ops tools and highlight the critical roles they play in operating our data infrastructure. It will showcase how active and targeted tools development for operational use can quickly payoff with improved product quality and overall agility.

Allen Wang, Architect & Engineer in Real Time Data Infrastructure Team @Netflix

11:50am - 12:40pm

Observability in the Development Process: Not Just for Ops Anymore

Monitoring has been historically considered an afterthought of the software development cycle: something owned by the ops side of the room. But instead of trying to predict the various ways something might go sideways right before release, what might it look like instead to learn about our production systems in order to figure out what to build, and how to build it, and whom for?

Observability is all about asking new questions of your systems -- and is something that should be built into the process of crafting software from the very beginning. In this talk, we'll explore what it looks like in practice, so that production stops being just where our development code runs into issues: it becomes where part of our development process lives.

Christine Yen, Cofounder @honeycombio

1:40pm - 2:30pm

Building Confidence in Healthcare Systems Through Chaos Engineering

Healthcare demands resilient software. Healthcare systems are resistant to change, as change can be viewed as a threat to system availability. To scale and modernize these systems, software engineers have to build confidence in how they can continually introduce change.    

This talk will cover how Cerner evolved their service workloads and applied gameday exercises to improve their resiliency. It will focus on how they transitioned their Java services from traditional enterprise application servers to a container deployment on Kubernetes using Spinnaker. It will share how they standardized their service deployment to have consistent instrumentation to get deep insight into the overall behavior of their system. It will explain strategies for how they applied traffic management approaches to safely introduce chaos engineering experiments, improving their overall understanding of the system.

Carl Chesser, Principal Engineer @Cerner

2:55pm - 3:45pm

How to Invest in Technical Infrastructure

Deciding what to work on is always difficult and is especially treacherous for folks working as infrastructure engineers and leaders. Will Larson unpacks the process of picking and prioritizing technical infrastructure work, which is essential to long-term company success but discussed infrequently. Will shares Stripe's approaches to prioritizing infrastructure as your company scales, justifying—and maybe even expanding—your company's spend on technical infrastructure, exploring the whole range of possible areas to invest into infrastructure, adapting your approach between periods of firefighting and periods of innovation, and balancing investment in supporting existing products and enabling new product development.

Will Larson, Foundation Engineering @Stripe

4:10pm - 5:00pm

Stop Talking & Listen; Practices for Creating Effective Customer SLOs

In this data-driven age we are constantly collecting and analyzing monumental quantities of data. We want to know everything about our product, how our customers use it, how long they use it and more importantly is the product even working? With all this data, we should be able to answer all of these questions. But turns out, that’s not always the case. In this talk, we’ll discuss some of the common pitfalls that arise from collecting and analyzing service data such as only using 'out-of-the-box' metrics and not having feedback loops. Then we'll discuss some practical tips for reducing noise and increasing effective customer signals with SLOs and analyzing customer pain points.

Cindy Quach, Site Reliability Engineer @Google

Last Year's Tracks

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.