You are viewing content from a past/completed QCon

Presentation: Building Confidence in Healthcare Systems Through Chaos Engineering

Track: Production Readiness: Building Resilient Systems

Location: Ballroom BC

Duration: 1:40pm - 2:30pm

Day of week: Wednesday

Slides: Download Slides

Share this on:

This presentation is now available to view on InfoQ.com

Watch video with transcript

What You’ll Learn

  1. Find out how a healthcare IT company uses Chaos Engineering to build confidence in their system.
  2. Learn what are some of the tools to use and organizational challenges to overcome to perform the tests needed to build confidence

Abstract

Healthcare demands resilient software. Healthcare systems are resistant to change, as change can be viewed as a threat to system availability. To scale and modernize these systems, software engineers have to build confidence in how they can continually introduce change.    

This talk will cover how Cerner evolved their service workloads and applied gameday exercises to improve their resiliency. It will focus on how they transitioned their Java services from traditional enterprise application servers to a container deployment on Kubernetes using Spinnaker. It will share how they standardized their service deployment to have consistent instrumentation to get deep insight into the overall behavior of their system. It will explain strategies for how they applied traffic management approaches to safely introduce chaos engineering experiments, improving their overall understanding of the system.

Question: 

What is the work you're doing today?

Answer: 

I'm an engineer at Cerner Corporation, and I work within our Service Platform Engineering group. We focus on the common service needs that support our electronic medical record called Millennium. This includes how we build, deploy, and operate the services supporting these clinical workflows within the Millennium system.

Question: 

What are the goals for the talk?

Answer: 

I try to bring a basic practitioner's perspective of how we started changing an existing system which includes the technology choices to help in that change. We had a lot of existing services that were labeled as these enterprise Java-based services running within technologies like IBM WebSphere Application Server which introduced challenges, how it is configured, how is tuned and optimized. Those environments became very crystallized around how we build and operate something. As we started introducing newer technologies that was changing how these service workloads operated, we had to take a lot of different approaches to build confidence in the newer system. This required us bringing different groups of our organization together as a single team, to ensure that all the parts of our infrastructure supporting our system was part of these tests. As we identified scenarios to test, we wanted to make sure these experiments were based on our production systems, not on an isolated lab that could have different characteristics. As part of this talk, I will share the tactics of how we applied traffic management patterns so we could shadow or replay traffic in these environments to safely introduce these experiments in the production environment, while not impacting existing user workflows.

Question: 

I'm imagining doing these chaos experiments in a healthcare system might be rather different from a media platform like Netflix or Spotify.

Answer: 

Right, we actually were inspired by Netflix on these approaches, but knew we had to show how we could safely introduce this approach in our environments. This required us to think about our existing change control policies, how we manage live traffic into these environments, and how we had to be closely aligned as a team when doing these tests. We found that we were building confidence and safety in the system by introducing these types of experiments within our infrastructure, since we were proactively identifying and learning about different behaviors of the system. As others have shared on chaos engineering, we weren’t trying to just break things in production and then see what would happen. We were planning experiments that were expecting to successfully handle the introduction of the system failure, but we often would learn about newer compounding failures or something that was more sensitive to the failure than expected. Hopefully, as I share some of our stories and how we approach introducing these experiments, it can illustrate how we worked through some of the barriers in a regulated environment to get this included and how the system was improved as a result.

Question: 

What do you want people to leave the talk with?

Answer: 

I would like to give them some tangible things of how we introduced this approach and share the benefits we found as a result. In addition, I plan to share specifics on how we used certain technologies to do the types of tests. These types of experiments were rich with learning and always introduced surprises which forces you to focus upfront on observability in your system. My hope is to share how these learnings helped us prioritize what we viewed were important capabilities in our delivery system, so that we could incrementally build the system and get insight to how it handled these planned failures in earlier iterations. As a result, my goal is that people can walk away having a toolkit of examples and guidance on how we approached it, and hopefully they can leverage these ideas within their own organizations.

Speaker: Carl Chesser

Principal Engineer @Cerner

Carl is a principal engineer supporting the service platform at Cerner Corporation, a global leader in healthcare information technology. The majority of his career has been focused on evolving and scaling the service infrastructure for Cerner's core electronic medical record platform called Millennium. He is passionate about growing a positive engineering culture at Cerner and contributes as an organizer of hackathons, meetups, and giving technical talks. In his spare time, he enjoys blogging about engineering related topics and sharing his poorly made illustrations at https://che55er.io.

Find Carl Chesser at

2020 Tracks

  • Non-Technical Skills for Technical Folks

    To be an effective engineer, requires more than great coding skills. Learn the subtle arts of the tech lead, including empathy, communication, and organization.

  • Clientside: From WASM to Browser Applications

    Dive into some of the technologies that can be leveraged to ultimately deliver a more impactful interaction between the user and client.

  • Languages of Infra

    More than just Infrastructure as a Service, today we have libraries, languages, and platforms that help us define our infra. Languages of Infra explore languages and libraries being used today to build modern cloud native architectures.

  • Mechanical Sympathy: The Software/Hardware Divide

    Understanding the Hardware Makes You a Better Developer

  • Paths to Production: Deployment Pipelines as a Competitive Advantage

    Deployment pipelines allow us to push to production at ever increasing volume. Paths to production looks at how some of software's most well known shops continuous deliver code.

  • Java, The Platform

    Mobile, Micro, Modular: The platform continues to evolve and change. Discover how the platform continues to drive us forward.

  • Security for Engineers

    How to build secure, yet usable, systems from the engineer's perspective.

  • Modern Data Engineering

    The innovations necessary to build towards a fully automated decentralized data warehouse.

  • Machine Learning for the Software Engineer

    AI and machine learning are more approachable than ever. Discover how ML, deep learning, and other modern approaches are being used in practice by Software Engineers.

  • Inclusion & Diversity in Tech

    The road map to an inclusive and diverse tech organization. *Diversity & Inclusion defined as the inclusion of all individuals in an within tech, regardless of gender, religion, ethnicity, race, age, sexual orientation, and physical or mental fitness.

  • Architectures You've Always Wondered About

    How do they do it? In QCon's marquee Architectures track, we learn what it takes to operate at large scale from well-known names in our industry. You will take away hard-earned architectural lessons on scalability, reliability, throughput, and performance.

  • Architecting for Confidence: Building Resilient Systems

    Your system will fail. Build systems with the confidence to know when they do and you won’t.

  • Remotely Productive: Remote Teams & Software

    More and more companies are moving to remote work. How do you build, work on, and lead teams remotely?

  • Operating Microservices

    Building and operating distributed systems is hard, and microservices are no different. Learn strategies for not just building a service but operating them at scale.

  • Distributed Systems for Developers

    Computer science in practice. An applied track that fuses together the human side of computer science with the technical choices that are made along the way

  • The Future of APIs

    Web-based API continue to evolve. The track provides the what, how, and why of future APIs, including GraphQL, Backend for Frontend, gRPC, & ReST

  • Resurgence of Functional Programming

    What was once a paradigm shift in how we thought of programming languages is now main stream in nearly all modern languages. Hear how software shops are infusing concepts like pure functions and immutablity into their architectures and design choices.

  • Social Responsibility: Implications of Building Modern Software

    Software has an ever increasing impact on individuals and society. Understanding these implications helps build software that works for all users