You are viewing content from a past/completed QCon -

Track: The Art of Chaos Engineering

Location: Ballroom BC

Day of week:

Chaos Engineering is an emerging discipline, but the underlying concepts are not. Failure is going to happen - Are you ready? Put simply, Chaos Engineering is one approach to “breaking things on purpose” that teaches us new information about our systems through experimentation. By triggering incidents intentionally in a controlled way, we gain confidence that our systems can deal with those failures before they occur in production. Come learn from those just starting this journey as well as the experts pushing the state of the art. We will hear war stories from those putting out the fires in the middle of the night, as well as those starting the fires during the day! In the end we’ll learn how to build systems and organizations that improve in the face of failure.

Track Host: Kolton Andrus

Founder of Gremlin Inc, former Netflix

Kolton is the founder of Gremlin - helping companies build more robust services. He was a Chaos Engineer at Netflix, focused on the resilience of the Edge services. He designed and built FIT: Netflix’s failure injection service. Prior he improved the performance and reliability of the Amazon Retail website. At both companies he has served as a ‘Call Leader’, managing the resolution of company-wide incidents. Kolton is passionate about building resilient systems, primarily as it lets him break things for fun and profit.

Chaos Architecture

Perfectly engineered resilient systems may be broken by confused operators when they behave differently in response to underlying failures. Highly available applications need to be resilient to failures in infrastructure, networks, applications and operators. Chaos engineering is needed to exercise the incident handling mechanisms at every level, including people and processes. This talk will look at best practices and challenges in getting to a chaos architecture mindset.

Adrian Cockcroft, VP Cloud Architecture Strategy @AWSCloud & Microservices Pioneer

Chaos: The Last Stand Against Our Robot Overlords

As the complexity and criticality of our software systems is rapidly increasing; our ability and available methodologies to ensure their determinism and correctness are often nascent or sometimes even non-existent. We see the effects of this paradox as we advance the role and responsibility of software in society. Often the evidence is observed in service outages, security breaches, financial market "flash crashes", and now the ever shortening length of time between the development and eventual production of autonomous vehicles.


The pursuit of automating aspects of our lives is often stifled simply by chaos: i.e. our best laid plans coming in contact with the unexpected. An essential element of working with the chaos present in every system is to first be able to effectively characterize it. Chaos Engineering and chaos experiments on the complex data, interfaces, and algorithms used in autonomous vehicles should be a minimum requirement in validating operational safety. Taking it a step further, Chaos Engineering could be the beginning of bringing to Software Engineering the kind of determinism, predictability, and assurance we often take for granted everyday from disciplines like Structural, Mechanical, and Electrical Engineering. We need to begin to shift towards working with chaos instead of against it, in order to build safe, reliable, and increasingly deterministic complex systems. The change in how we engineer software for large-scale consumption is shifting from, "It might work, but I wouldn't bet my life on it." to, "I know this will work, I'd bet my life on it."

Failure at Netflix Velocity

Netflix is a strong believer in Chaos Engineering and the Velocity of Innovation. Most of the time, our customers never notice the former and appreciate the latter. Occasionally however…

Can not connect to Netflix. You press play and it doesn't work. You can't log in. Nothing is on the screen and Stranger Things Season 2 just released!

A behind the scenes look at how Netflix engineering teams think about failure. The tools, techniques, and training we use to shorten the inevitable failures of our systems and impacts to our customers. Come hear why we believe chaos is your friend, failure is guaranteed, and why our organization is better off having both.

Dave Hahn, Sr SRE, Reliability and Chaos Engineering @Netflix

Chaos Engineering on a Budget

As the systems that support internet-scale services grow larger and ever more complex, chaos engineering has emerged as industry best practice for ensuring system resiliency. Many companies maintain entire teams devoted to chaos testing their product. But what can you do if you don't have these kinds of resources to devote to the problem? How can you get started with chaos engineering without hiring an entire team of experts?

This is the story of implementing chaos testing on a small product, and how several small and targeted early investments in chaos engineering saved huge amounts of time and effort down the road.

Heather Nakama, Software Engineer @Microsoft - Azure Search

The Art of Chaos Engineering Panel

Kolton Andrus, Founder of Gremlin Inc, former Netflix
Willie Wheeler, Principal Application Engineer @Expedia
Sahar Samiei, Senior Product Manager @Expedia
Nathan Äschbacher
Dave Hahn, Sr SRE, Reliability and Chaos Engineering @Netflix
Adrian Cockcroft, VP Cloud Architecture Strategy @AWSCloud & Microservices Pioneer
Heather Nakama, Software Engineer @Microsoft - Azure Search

Expedia’s Journey Toward Site Resiliency

Those coming from product-driven organizations—where product features are often prioritized over resiliency-related concerns—will understand how challenging it can be to convince teams to do resiliency work. In this presentation we’ll share Expedia’s resiliency journey, starting with resiliency as an afterthought and progressing toward resiliency as a first-class concern. Attendees will learn about the importance of partnering with the teams experiencing operational struggles, and equipping them with the data to make the right investments at the right time.

Willie Wheeler, Principal Application Engineer @Expedia
Sahar Samiei, Senior Product Manager @Expedia

Last Year's Tracks

  • Monday, 16 November

  • Architecting for Confidence: Building Resilient Systems

    Your system will fail. Build systems with the confidence to know when they do and you won’t.

  • Remotely Productive: Remote Teams & Software

    More and more companies are moving to remote work. How do you build, work on, and lead teams remotely?

  • Operating Microservices

    Building and operating distributed systems is hard, and microservices are no different. Learn strategies for not just building a service but operating them at scale.

  • Distributed Systems for Developers

    Computer science in practice. An applied track that fuses together the human side of computer science with the technical choices that are made along the way

  • The Future of APIs

    Web-based API continue to evolve. The track provides the what, how, and why of future APIs, including GraphQL, Backend for Frontend, gRPC, & ReST

  • Resurgence of Functional Programming

    What was once a paradigm shift in how we thought of programming languages is now main stream in nearly all modern languages. Hear how software shops are infusing concepts like pure functions and immutablity into their architectures and design choices.

  • Tuesday, 17 November

  • Social Responsibility: Implications of Building Modern Software

    Software has an ever increasing impact on individuals and society. Understanding these implications helps build software that works for all users

  • Non-Technical Skills for Technical Folks

    To be an effective engineer, requires more than great coding skills. Learn the subtle arts of the tech lead, including empathy, communication, and organization.

  • Clientside: From WASM to Browser Applications

    Dive into some of the technologies that can be leveraged to ultimately deliver a more impactful interaction between the user and client.

  • Languages of Infra

    More than just Infrastructure as a Service, today we have libraries, languages, and platforms that help us define our infra. Languages of Infra explore languages and libraries being used today to build modern cloud native architectures.

  • Mechanical Sympathy: The Software/Hardware Divide

    Understanding the Hardware Makes You a Better Developer

  • Paths to Production: Deployment Pipelines as a Competitive Advantage

    Deployment pipelines allow us to push to production at ever increasing volume. Paths to production looks at how some of software's most well known shops continuous deliver code.

  • Wednesday, 18 November

  • Java, The Platform

    Mobile, Micro, Modular: The platform continues to evolve and change. Discover how the platform continues to drive us forward.

  • Security for Engineers

    How to build secure, yet usable, systems from the engineer's perspective.

  • Modern Data Engineering

    The innovations necessary to build towards a fully automated decentralized data warehouse.

  • Machine Learning for the Software Engineer

    AI and machine learning are more approachable than ever. Discover how ML, deep learning, and other modern approaches are being used in practice by Software Engineers.

  • Inclusion & Diversity in Tech

    The road map to an inclusive and diverse tech organization. *Diversity & Inclusion defined as the inclusion of all individuals in an within tech, regardless of gender, religion, ethnicity, race, age, sexual orientation, and physical or mental fitness.

  • Architectures You've Always Wondered About

    How do they do it? In QCon's marquee Architectures track, we learn what it takes to operate at large scale from well-known names in our industry. You will take away hard-earned architectural lessons on scalability, reliability, throughput, and performance.

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.