Yes, I Test In Production (And So Do You)

Next QConSF Conference: Applied AI for Developers QCon.ai April 2019

What You’ll Learn

Learn strategies to think about with observability for production services.
Hear about the importance of context when it comes to service owners.
Understand approaches to improve the reliability of your systems.

Abstract

Testing in production has gotten a bad rap. People seem to assume that you can only test before production *or* in production. But while you can rule some things out before shipping, those are typically the easy things, the known unknowns. For any real unknown-unknown, you're gonna be Testing in Production.

And this is a good thing! Staging areas are a black hole for engineering cycles. Most interesting problems will only manifest under real workloads, on real data, with real users and real concurrency. So you should spend much fewer of your scarce engineering cycles poring over staging, and much more of them building guard rails for prod -- ways to ship code safely, absorb failures without impacting users, test various kinds of failures in prod, detect them, and roll back.

You can never trust a green diff until it's baked in prod. So let's talk about the tools and principles of canarying software and gaining confidence in a build. What types of problems are worth setting up a capture/replay apparatus to boost your confidence? How many versions should you be able to run simultaneously in flight? And finally we'll talk about the missing link that makes all these things possible: instrumentation and observability for complex systems.

Question:

What's the motivation for this talk?

Answer:

The motivation for this talk is to help people understand that deploying software carries an irreducible element of uncertainty and risk. Trying too hard to prevent failures will actually make your systems and your teams *more* vulnerable to failure and prolonged downtime. So what can you do about it?

First, you can shift the way you think about shipping code. Don’t think of a deploy like a binary switch, where you can stop thinking about it as soon as it’s successfully started up in production. Think it more like turning on the heat in your oven. Deploys simply kick off the process of gaining confidence in your code over time. You should never trust new code until it’s been baked in production, with real users, real services, real data, real networking. Your confidence should rise in proportion to the range of conditions and user behaviors that your code has handled in the wild over time.

Second, you can empower your software engineers and turn them into software owners. Ownership over the full software life cycle means the same person can write code, deploy their code, and debug the code live in prod. Why does this matter? It’s extremely inefficient to have different sets of people running, deploying and debugging, because the person with the most context from developing it is the best suited to spot problems or unintended consequences quickly and fix them before users are affected. Software ownership leads to better services, happier users, and better lives for practitioners.

A key part of this ownership is the process of shepherding your code into that fully baked state. I call the talk “testing in production” (somewhat tongue in cheek) to embrace the side of it that means leaning into resilience rather than fixating on preventing failures. Every single time you run a deploy, you are kicking off a test -- of that unique combination of point-in-time infra, deploy tools, environment and artifact. So I want to talk about some of the techniques and practices that are worth spending some of your precious engineering cycles on, instead of the black sucking hole for engineering effort that is every staging environment I’ve ever seen.

Question:

Martin Fowler famously had a blog post about how “you must be this tall for microservices” or organizational maturity before you’re really ready for microservices. Is there a “you must be this tall” equivalent for being able to test in production?

Answer:

Absolutely. Starting with observability. Until you can drill down and explain any anomalous raw event, you should not attempt any advanced maneuvers - you’ll just be irresponsibly screwing with production and starting fires right and left that you don’t even know about. And I mean observability in the technical sense -- traditional monitoring is not good enough. You need instrumentation, an event-oriented perspective, ideally some tracing, etc. You need the tooling that lets you hunt down outliers and aggregate at read time by high-cardinality dimensions like request ID. No aggregation at write time. Etc.

You also need a team of software engineers that cares deeply about production quality and what happens when their code meets reality, that doesn’t consider their job finished and walk away any time they merge a branch, or that don’t understand how to instrument their code for understandability and debuggability. If you’re lucky, you have some SREs or other operational experts to help everyone level up at best practices.

Question:

A lot of times when I talk to people about CI/CD and things like getting code out fast in production, they push back or they have hesitancy. That's probably an indication that something else, like observability, is the problem, right?

Answer:

If you’re terrified about shipping bad code to prod, well, you need to work on that first. Because shipping new code shouldn’t be a terrifying experience. It should be pedestrian. But deploy software (and deploy processes) are systematically under-invested in everywhere.

Instead of building the guard rails and processes that build confidence and help to find problems early, people just write code, deploy it, and wait to get paged. That’s INSANE. First of all, most anomalies don’t generate an alert, and they shouldn’t -- nobody would ever get any sleep because our systems exist in an eternal state of imperfection and partial degradation. And that’s ok! That’s what SLOs are for!

Once we know about a possible failure, we can monitor for that failure and write tests for that failure. That means most of the failures we do ship to prod are going to be unknown-unknowns. We can’t predict them, we shouldn’t try. But we can often spot them.

That’s why the best, most efficient, most practical, least damaging way to identify breaking changes is for every software developer to merge their code, deploy it, and immediately go look at it compared it to the last version. You just wrote it so it's fresh in your mind, you have all the context. Did you ship what you think you just shipped, is it working the way you expected it to work? Do you see anything else that looks unexpected or suspicious? This simple checklist will catch a high percentage of the logic problems and other subtle bugs you ship.

Question:

What do you want someone who comes to this talk to be able to walk away with?

Answer:

There's a lot of fear around production. There's a lot of scariness, a lot of association with the masochism and lack of sleep that operations has historically been known for (that’s on us: sorry!). But it doesn't have to be so scary. You should want to do this.

Owning your own code in production should be something that you want to do, because it leads to better code, higher quality of services, happier users, stronger teams … even better quality of life for the engineers who own it. There is plenty of science showing that autonomy and empowerment make people love their work. When ownership goes up, bugs go down and frustration goes down, and free time for product development goes up too. It’s a virtuous feedback loop: everyone wins. So I want to show you the steps to get there.

Speaker: Charity Majors

Co-Founder @Honeycombio, formerly DevOps @ParseIT/@Facebook

Charity is a cofounder and engineer at Honeycomb.io, a startup that blends the speed of time series with the raw power of rich events to give you interactive, iterative debugging of complex systems. She has worked at companies like Facebook, Parse, and Linden Lab, as a systems engineer and engineering manager, but always seems to end up responsible for the databases too. She loves free speech, free software and a nice peaty single malt.

Find Charity Majors at

Speaker page

@mipsytipsy

Similar Talks

Continuous Reliability

Co-founder and CTO @OverOps

Tal Weiss

Actionable Continuous Delivery Metrics

Head of Product @thoughtworks

Suzie Prince

Reducing Risk of Credential Compromise @Netflix

Security Researcher, Leader, Advisor @Netflix

William Bengtson

Reducing Risk of Credential Compromise @Netflix

Sr. Cloud Security Engineer @Netflix

Travis McPeak

Taking the Canary Out of the Coal Mine

Staff Security Engineer @Cruise Automation

Mike Ruth

Using Data to Measure Risk in Cyber Systems

Director of Cyber Risk @QadiumInc

Marshall Kuypers

Security & Psychology: Demotivating Persistent Threats

Engineering Director @ShapeSecurity & JavaScript Expert

Jarrod Overson

Fairness, Transparency, and Privacy in AI @LinkedIn

Tech Lead Fairness, Transparency, Explainability & Privacy Efforts @LinkedIn

Krishnaram Kenthapadi

Jupyter Notebooks: Interactive Visualization Approaches

Senior Researcher in the Quantitative Financial Research Group @Bloomberg

Chakri Cherukuri

Tracks

Monday, 5 November

Microservices / Serverless Patterns & Practices

Evolving, observing, persisting, and building modern microservices
Practices of DevOps & Lean Thinking

Practical approaches using DevOps & Lean Thinking
JavaScript & Web Tech

Beyond JavaScript in the Browser. Exploring WebAssembly, Electron, & Modern Frameworks
Modern CS in the Real World

Thoughts pushing software forward, including consensus, CRDT's, formal methods, & probabilistic programming
Modern Operating Systems

Applied, practical, & real-world deep-dive into industry adoption of OS, containers and virtualization, including Linux on Windows, LinuxKit, and Unikernels
Optimizing You: Human Skills for Individuals

Better teams start with a better self. Learn practical skills for IC

Tuesday, 6 November

Architectures You've Always Wondered About

Next-gen architectures from the most admired companies in software, such as Netflix, Google, Facebook, Twitter, & more
21st Century Languages

Lessons learned from languages like Rust, Go-lang, Swift, Kotlin, and more.
Emerging Trends in Data Engineering

Showcasing DataEng tech and highlighting the strengths of each in real-world applications.
Bare Knuckle Performance

Killing latency and getting the most out of your hardware
Socially Conscious Software

Building socially responsible software that protects users privacy & safety
Delivering on the Promise of Containers

Runtime containers, libraries, and services that power microservices

Wednesday, 7 November

Applied AI & Machine Learning

Applied machine learning lessons for SWEs, including tech around TensorFlow, TPUs, Keras, PyTorch, & more
Production Readiness: Building Resilient Systems

More than just building software, building deployable production ready software
Developer Experience: Level up your Engineering Effectiveness

Improving the end to end developer experience - design, dev, test, deploy, operate/understand.
Security: Lessons Attacking & Defending

Security from the defender's AND the attacker's point of view
Future of Human Computer Interaction

IoT, voice, mobile: Interfaces pushing the boundary of what we consider to be the interface
Enterprise Languages

Workhorse languages found in modern enterprises. Expect Java, .NET, & Node in this track

This Year's Schedule

The all-new QCon app!

Available on iOS and Android

The new QCon app helps you make the most of your conference experience. Easily browse and follow the conference schedule, star the talks you want to attend, and keep tabs on your personal itinerary. Download the app now for free on iOS and Android.

Track: Production Readiness: Building Resilient Systems

Location: Ballroom A

Duration: 4:10pm - 5:00pm

Day of week: Wednesday

Level: Intermediate

Persona: Architect, CTO/CIO/Leadership, Developer, General Software, Technical Engineering Manager

What You’ll Learn

Abstract

Speaker: Charity Majors

Find Charity Majors at

Similar Talks

Tracks

Monday, 5 November

Microservices / Serverless Patterns & Practices

Practices of DevOps & Lean Thinking

JavaScript & Web Tech

Modern CS in the Real World

Modern Operating Systems

Optimizing You: Human Skills for Individuals

Tuesday, 6 November

Architectures You've Always Wondered About

21st Century Languages

Emerging Trends in Data Engineering

Bare Knuckle Performance

Socially Conscious Software

Delivering on the Promise of Containers

Wednesday, 7 November

Applied AI & Machine Learning

Production Readiness: Building Resilient Systems

Developer Experience: Level up your Engineering Effectiveness

Security: Lessons Attacking & Defending

Future of Human Computer Interaction

Enterprise Languages

The all-new QCon app!

Available on iOS and Android

Presentation: Yes, I Test In Production (And So Do You)

Track: Production Readiness: Building Resilient Systems

Location: Ballroom A

Duration: 4:10pm - 5:00pm

Day of week: Wednesday

Level: Intermediate

Persona: Architect, CTO/CIO/Leadership, Developer, General Software, Technical Engineering Manager

More talks on:

Share this on:

What You’ll Learn

Abstract

Speaker: Charity Majors

Find Charity Majors at

Similar Talks

Tracks

Monday, 5 November

Tuesday, 6 November

Wednesday, 7 November

The all-new QCon app!

Available on iOS and Android