Track: Production Readiness: Building Resilient Systems

Day of week: Wednesday

A production readiness review is used by software companies to determine whether the design and implementation of the system is ready to be released to its customers. The process is used to identify and address the reliability of a service, sufficiency of the coverage of privacy and security needs, and the ease of the operability. This track explores what types of aspects of software need to be prepared to start taking on full production load with customer’s data. Topics include observability, emergency response, capacity planning, release processes, and SLOs for availability and latency.

Track Host: Michelle Brush

Engineering Manager, SRE @Google

Michelle Brush is a math geek turned computer geek with over 15 years of software development experience. She has developed algorithms and data structures for pathfinding, search, compression, and data mining in embedded as well as distributed systems. In her current role as an SRE Manager for Google, she leads the team of SREs that ensures GCE's APIs are reliable. Previously, she served as the Director of HealtheIntent Architecture for Cerner Corporation, responsible for the data processing platform for Cerner’s Population Health solutions.  Prior to her time at Cerner, she was the lead engineer for Garmin's automotive routing algorithm. 

Monitoring and Tracing @Netflix Streaming Data Infrastructure

Netflix streaming data infrastructure transports trillions of events per day and supports hundreds of streaming processing jobs. The team behind it is small and there is no separate operations team. To efficiently manage and operate this huge infrastructure and reduce operational burden for everyone, we developed a set of tools that enables automated operations and mitigations. Our Kafka monitoring tools provide comprehensive signals and great insights into the health of our Kafka brokers and consumers, from which we derived ways to automate error handling that improves stability of brokers and stream processing jobs. For data streams that have high consistency requirements, instead of purely relying on aggregated counts that may be misleading, we trace individual events along their transporting path. Enabled by stream processing with minimal resources, tracing provides insight into end-to-end data loss, duplicates and latency at near real time and with high accuracy. These results helped us to further improve our service quality and validate design trade-offs.

The talk will give the design and implementation details of these dev/ops tools and highlight the critical roles they play in operating our data infrastructure. It will showcase how active and targeted tools development for operational use can quickly payoff with improved product quality and overall agility.

Allen Wang, Architect & Engineer in Real Time Data Infrastructure Team @Netflix

Observability in the Development Process: Not Just for Ops Anymore

Monitoring has been historically considered an afterthought of the software development cycle: something owned by the ops side of the room. But instead of trying to predict the various ways something might go sideways right before release, what might it look like instead to learn about our production systems in order to figure out what to build, and how to build it, and whom for?

Observability is all about asking new questions of your systems -- and is something that should be built into the process of crafting software from the very beginning. In this talk, we'll explore what it looks like in practice, so that production stops being just where our development code runs into issues: it becomes where part of our development process lives.

Christine Yen, Cofounder @honeycombio

Tracks

Monday, 11 November

  • Ethics, Regulation, Risk, and Compliance

    With so much uncertainty, how do you bulkhead your organization and technology choices? Learn strategies for dealing with uncertainty.

  • Software Supply Chain

    Life of a software artifact from commit to deployment. Security, observability and provenance of the software supply chain.

  • Architectures You've Always Wondered About

    Next-gen architectures from the most admired companies in software, such as Netflix, Google, Facebook, Twitter, & more

  • Languages of Infrastructure

    This track explores languages being used to code the infrastructure. Expect practices on toolkits and languages like Cloudformation, Terraform, Python, Go, Rust, Erlang.

  • Building & Scaling High-Performing Teams

    To have a high-performing team, everybody on it has to feel and act like an owner. Organizational health and psychological safety are foundational underpinnings to support ownership.

  • Bare Knuckle Performance

    Killing latency and getting the most out of your hardware

Tuesday, 12 November

Wednesday, 13 November