Did the Chaos Test Pass?

People used to ask me all the time how to figure out if their chaos test has “passed,” and I’d always say “well, that’s a loaded question.” To confirm that a chaos test “passed,” we need to do verification of hypotheses - sometimes you’re trying to prove some system behavior occurred in response to a stimulus, while other times you’re trying to prove the absence of a change in system behavior. Take this already nebulous concept, and now think about making it generic enough that the core validation logic can be re-used by any engineer running any kind of experiment on any one of our products. Then, try to do all of this in a complex distributed technical environment where it’s hard enough just to determine whether an application was healthy in the first place! That’s exactly the problem that the chaos engineering team at Vanguard has been tackling with the recent addition of automated assertions to the internal chaos tooling. In this talk, you’ll learn about when it’s appropriate to define “pass” and “fail” for a chaos experiment, and when it might not be, and you’ll get to take a peek under the hood at the way that Vanguard engineers are automatically verifying their hypotheses in the context of chaos experiments.

What's the focus of your work these days?

I actually just started a new role now operating in more of an architect capacity than where I previously was very narrowly scoped, Site Reliability Engineering. So focusing on architecture that supports all of our developer experience platform and enablement of software engineering excellence across our entire I.T. organization.

Can you tell me what the motivation behind your talk is?

In this talk I hope to build on some of what I've shared about Vanguard's chaos engineering strategy in some prior talks, and talk a little bit about the level of maturity that we've reached now where we're not just experimenting in an exploratory way - but using the results of our Chaos experiments to make some assertions about the reliability of our systems and hopefully make sure that others understand how to do the same with their Chaos experiments as well.

How would you describe the persona and level of the target audience for this session?

I think that any technician will take a lot away from this talk, especially anyone who works in large enterprises, because that's the environment that I'm working in at Vanguard. Anyone who has some experience running chaos experiments in the past or has an interest in running chaos experiments in their organizations. So anyone with a site reliability engineering background, or just some experience with chaos experimentation will really enjoy this talk.

Is there anything specific that you would like these folks to walk away with after watching your presentation?

When they walk away from the presentation they'll certainly have a feel for the architecture and what we've built if they want to do something similar in their own organizations. But I don't expect that that's what most will take away. I hope primarily that people will walk away with the idea to put some assertions around some of their chaos experiments, also to do some exploratory testing without assertions and to determine when is the right time to do each of those things.


Speaker

Christina Yakomin

Senior Site Reliability Engineering Specialist @Vanguard_Group

Christina is a Senior Site Reliability Engineering Specialist in Vanguard's Chief Technology Office. She has worked at the company's Malvern, PA headquarters since graduating from Villanova University with an undergraduate degree in Computer Science. Throughout her career, she has developed an expansive skill set in front- and back-end web development, as well as cloud infrastructure and automation, with a specialization in Site Reliability Engineering. She has earned several Amazon Web Services certifications, including the Solutions Architect - Professional. Christina has also worked closely with the Women's Initiative for Leadership Success at Vanguard, both internally at the company and externally in the local community, to further the career advancement of women and girls - in particular within the tech industry. In her spare time (and when it is safe to do so!), Christina is passionate about traveling; she has visited over 20 different countries and 25 U.S. states so far!

Read more
Find Christina Yakomin at:

Date

Wednesday Oct 26 / 11:50AM PDT ( 50 minutes )

Location

Bayview

Topics

SRE Chaos Experiment

Share

From the same track

Session SRE

The Endgame of SRE

Wednesday Oct 26 / 10:35AM PDT

The containers are deployed and the builds are green. Yaml flows through the system, linted, reviewed, tested, and shipped with ease and regularity. Our intrepid SRE finds themself at a crossroads. The infrastructure is great but teams still struggle to maintain error budgets.

Speaker image - Amy Tobey
Amy Tobey

Senior Principal Engineer and SRE practice Leader @Equinix

Session SRE

Rethinking Reliability: What You Can (and Can't) Learn From Incidents

Wednesday Oct 26 / 02:55PM PDT

This talk presents research collected from the VOID—an open database of public incident reports. Containing over 2,000 reports for almost 700 organizations, the database allows for more structured review and research about software-related incident reporting.

Speaker image - Courtney Nash
Courtney Nash

Internet Incident Librarian & Senior Research Analyst at Verica, previously @Holloway @Fastly @O’Reilly Media @Microsoft & @Amazon

Session SRE

The Eternal Sunshine of the Toil-Less Prod

Wednesday Oct 26 / 04:10PM PDT

One of the most important decisions in building an SRE practice is what kind of work should be assigned to the SRE team, and in what percentages.

Speaker image - Sasha Rosenbaum
Sasha Rosenbaum

Director of the Cloud Services Black Belt Team @RedHat

Session

[Panel] SRE: Is it Working?

Wednesday Oct 26 / 01:40PM PDT

How does SRE mature from a craft with a wide range of skills and levels of expertise to a mature discipline?

Speaker image - Courtney Nash
Courtney Nash

Internet Incident Librarian & Senior Research Analyst at Verica, previously @Holloway @Fastly @O’Reilly Media @Microsoft & @Amazon

Speaker image - Amy Tobey
Amy Tobey

Senior Principal Engineer and SRE practice Leader @Equinix

Speaker image - Christina Yakomin
Christina Yakomin

Senior Site Reliability Engineering Specialist @Vanguard_Group

Speaker image - Sasha Rosenbaum
Sasha Rosenbaum

Director of the Cloud Services Black Belt Team @RedHat