Effective SRE

In theory, having a solid SRE program is required for successful cloud IT services.  In practice, not all SRE programs are created equal, and in fact many attempts to establish an SRE program have failed or even backfired.  What distinguishes a good SRE program from a bad one, and is there a framework for preventing catastrophe?  This isn't an idle discipline.  Many of the practices mentioned in the original SRE materials are now hotly contested, like MTTR, root cause analysis, and runbook automation.

SRE deserves a critical and optimistic review.  For example, are qualitative methods better suited to evaluate the success of SRE over quantitative methods?  As a practice, SRE may be the software industry's best hope of holding off governmental regulations around availability, uptime, and certification of software engineers.  That hope can only be fulfilled if we have effective SRE practices


From this track

Session SRE

Did the Chaos Test Pass?

Wednesday Oct 26 / 11:50AM PDT

People used to ask me all the time how to figure out if their chaos test has “passed,” and I’d always say “well, that’s a loaded question.” To confirm that a chaos test “passed,” we need to do verification of hypotheses - sometimes you’re trying to prove some system behavior occurred in response

Christina Yakomin

Senior Site Reliability Engineering Specialist @Vanguard_Group

Session SRE

The Endgame of SRE

Wednesday Oct 26 / 10:35AM PDT

The containers are deployed and the builds are green. Yaml flows through the system, linted, reviewed, tested, and shipped with ease and regularity. Our intrepid SRE finds themself at a crossroads. The infrastructure is great but teams still struggle to maintain error budgets.

Amy Tobey

Senior Principal Engineer and SRE practice Leader @Equinix

Session SRE

Rethinking Reliability: What You Can (and Can't) Learn From Incidents

Wednesday Oct 26 / 02:55PM PDT

This talk presents research collected from the VOID—an open database of public incident reports. Containing over 2,000 reports for almost 700 organizations, the database allows for more structured review and research about software-related incident reporting.

Courtney Nash

Internet Incident Librarian & Senior Research Analyst at Verica, previously @Holloway @Fastly @O’Reilly Media @Microsoft & @Amazon

Session SRE

The Eternal Sunshine of the Toil-Less Prod

Wednesday Oct 26 / 04:10PM PDT

One of the most important decisions in building an SRE practice is what kind of work should be assigned to the SRE team, and in what percentages.

Sasha Rosenbaum

Director of the Cloud Services Black Belt Team @RedHat

Session

[Panel] SRE: Is it Working?

Wednesday Oct 26 / 01:40PM PDT

How does SRE mature from a craft with a wide range of skills and levels of expertise to a mature discipline? In the Panel SRE: Is it Working?, we bring together all of today’s speakers and rethink/discuss original assumptions of what SRE is and dive into how we believe SRE is evolving.

Courtney Nash

Internet Incident Librarian & Senior Research Analyst at Verica, previously @Holloway @Fastly @O’Reilly Media @Microsoft & @Amazon

Amy Tobey

Senior Principal Engineer and SRE practice Leader @Equinix

Christina Yakomin

Senior Site Reliability Engineering Specialist @Vanguard_Group

Sasha Rosenbaum

Director of the Cloud Services Black Belt Team @RedHat

Date

Wednesday Oct 26 / 09:00AM PDT

Share

Unable to make QCon San Francisco?

You can attend this track and more, online at QCon Plus from Nov 30 - Dec 8, 2022.

Check out QCon Plus!

Track Host

Casey Rosenthal

CEO, Co-Founder @verica_io

Casey Rosenthal is CEO and co-founder of Verica; formerly the Engineering Manager of the Chaos Engineering Team at Netflix. He has experience with distributed systems, artificial intelligence, translating novel algorithms and academia into working models, and selling a vision of the possible to clients and colleagues alike. His super‐power is transforming misaligned teams into high-performance teams, and his personal mission is to help people see that something different, something better, is possible. For fun, he models human behavior using personality profiles in Ruby, Erlang, Elixir, and Prolog.

Read more
Find Casey Rosenthal at: