Presentation: Building Confidence in Healthcare Systems Through Chaos Engineering

Track: Production Readiness: Building Resilient Systems

Location: Ballroom BC

Duration: 1:40pm - 2:30pm

Day of week: Wednesday

Share this on:

What You’ll Learn

  1. Find out how a healthcare IT company uses Chaos Engineering to build confidence in their system.
  2. Learn what are some of the tools to use and organizational challenges to overcome to perform the tests needed to build confidence

Abstract

Healthcare demands resilient software. Healthcare systems are resistant to change, as change can be viewed as a threat to system availability. To scale and modernize these systems, software engineers have to build confidence in how they can continually introduce change.    

This talk will cover how Cerner evolved their service workloads and applied gameday exercises to improve their resiliency. It will focus on how they transitioned their Java services from traditional enterprise application servers to a container deployment on Kubernetes using Spinnaker. It will share how they standardized their service deployment to have consistent instrumentation to get deep insight into the overall behavior of their system. It will explain strategies for how they applied traffic management approaches to safely introduce chaos engineering experiments, improving their overall understanding of the system.

Question: 

What is the work you're doing today?

Answer: 

I'm an engineer at Cerner Corporation, and I work within our Service Platform Engineering group. We focus on the common service needs that support our electronic medical record called Millennium. This includes how we build, deploy, and operate the services supporting these clinical workflows within the Millennium system.

Question: 

What are the goals for the talk?

Answer: 

I try to bring a basic practitioner's perspective of how we started changing an existing system which includes the technology choices to help in that change. We had a lot of existing services that were labeled as these enterprise Java-based services running within technologies like IBM WebSphere Application Server which introduced challenges, how it is configured, how is tuned and optimized. Those environments became very crystallized around how we build and operate something. As we started introducing newer technologies that was changing how these service workloads operated, we had to take a lot of different approaches to build confidence in the newer system. This required us bringing different groups of our organization together as a single team, to ensure that all the parts of our infrastructure supporting our system was part of these tests. As we identified scenarios to test, we wanted to make sure these experiments were based on our production systems, not on an isolated lab that could have different characteristics. As part of this talk, I will share the tactics of how we applied traffic management patterns so we could shadow or replay traffic in these environments to safely introduce these experiments in the production environment, while not impacting existing user workflows.

Question: 

I'm imagining doing these chaos experiments in a healthcare system might be rather different from a media platform like Netflix or Spotify.

Answer: 

Right, we actually were inspired by Netflix on these approaches, but knew we had to show how we could safely introduce this approach in our environments. This required us to think about our existing change control policies, how we manage live traffic into these environments, and how we had to be closely aligned as a team when doing these tests. We found that we were building confidence and safety in the system by introducing these types of experiments within our infrastructure, since we were proactively identifying and learning about different behaviors of the system. As others have shared on chaos engineering, we weren’t trying to just break things in production and then see what would happen. We were planning experiments that were expecting to successfully handle the introduction of the system failure, but we often would learn about newer compounding failures or something that was more sensitive to the failure than expected. Hopefully, as I share some of our stories and how we approach introducing these experiments, it can illustrate how we worked through some of the barriers in a regulated environment to get this included and how the system was improved as a result.

Question: 

What do you want people to leave the talk with?

Answer: 

I would like to give them some tangible things of how we introduced this approach and share the benefits we found as a result. In addition, I plan to share specifics on how we used certain technologies to do the types of tests. These types of experiments were rich with learning and always introduced surprises which forces you to focus upfront on observability in your system. My hope is to share how these learnings helped us prioritize what we viewed were important capabilities in our delivery system, so that we could incrementally build the system and get insight to how it handled these planned failures in earlier iterations. As a result, my goal is that people can walk away having a toolkit of examples and guidance on how we approached it, and hopefully they can leverage these ideas within their own organizations.

Speaker: Carl Chesser

Principal Engineer @Cerner

Carl is a principal engineer supporting the service platform at Cerner Corporation, a global leader in healthcare information technology. The majority of his career has been focused on evolving and scaling the service infrastructure for Cerner's core electronic medical record platform called Millennium. He is passionate about growing a positive engineering culture at Cerner and contributes as an organizer of hackathons, meetups, and giving technical talks. In his spare time, he enjoys blogging about engineering related topics and sharing his poorly made illustrations at https://che55er.io.

Find Carl Chesser at

Tracks

Monday, 11 November

Tuesday, 12 November

Wednesday, 13 November