Designing Fault-Tolerant Software with Control System Transparency

Teams at NASA and JPL that create mission-critical software for spacecraft take a principled approach to fault tolerance. Let's see how those same principles, centered around a concept of transparency, can help us achieve reliability in pragmatic, modern software delivery settings.

As our society continues to depend more and more heavily on software, the need for that software to be reliable also increases. At the same time, the software systems we build as practitioners continue to become more and more complex, with many moving parts and unexpected, emergent behavior. How can those systems become as reliable and robust as the Voyager 2 deep space probe that made news recently and is still in service 46 years after its launch?

In this talk, we'll be drawing inspiration from an architectural paper that captures experiences from the aerospace industry. "GN&C [Guidance, Navigation, and Control] Fault Protection Fundamentals" by Robert D. Rasmussen describes four key principles for building fault-tolerant software, involving transparency of (1) objectives; (2) models; (3) knowledge; and (4) control. For each principle, we'll identify use cases from well-known software or protocols (e.g. HTTP) where we can see the principle in action; or, if the principle is *not* commonly applied, we'll describe how it might look in familiar settings like invoking RESTful or GRPC APIs.

Interview:

What's the focus of your work these days?

I'm currently working on broad architectural refactoring projects to help update parts of our architecture to gain more engineering leverage (identifying and extracting common components for reuse, for example).

What's the motivation for your talk at QCon San Francisco 2023?

I'm passionate about fault tolerance and reliability, and have been for a long time. I will admit to enjoying putting a service under rigorous load or stress testing, seeing it fall over, and then fixing it so it no longer falls over that way! But in general, I see a great opportunity for the software industry to continue updating our practices to build more of that reliability and resilience in by design. There are groups, like NASA, who have been focused on fault tolerant software for a long time: after all, a small mistake could turn a deep space probe into a $100M brick. They've developed a lot of rigor and techniques to be successful at this, and the rest of the industry can probably learn a thing or two!

How would you describe your main persona and target audience for this session?

This talk is probably most useful to software architects or tech leads, as we will be discussing techniques and patterns that an organization would want to apply broadly to gain the benefits. That said, the talk should be broadly accessible to a wider engineering audience, as we'll be explaining the principles in the talk as we go, and illustrating them with examples.

Is there anything specific that you'd like people to walk away with after watching your session?

I'd primarily hope that attendees gain a new way of thinking about fault tolerance and reliability; being able to learn about a new point of view on a familiar problem is often the most valuable takeaway from talks.


Speaker

Jon Moore

Staff Software Engineer @Stripe with over 35 years of software engineering experience across both academia and industry

Over his career, Jon Moore has been a researcher, management consultant, network engineer, small business owner, tech lead, architect, and technology executive. He is equally comfortable leading and managing teams and personally writing production-ready code. His current interests include distributed systems, fault tolerance, refactoring, building healthy and engaging engineering cultures, and Texas Hold'em. Jon received his Ph.D. in Computer and Information Science from the University of Pennsylvania and currently resides in West Philadelphia, although he was neither born there nor raised there and does not spend most of his days on playgrounds.

Read more
Find Jon Moore at:

Date

Tuesday Oct 3 / 11:45AM PDT ( 50 minutes )

Location

Ballroom BC

Topics

Reliability Fault Tolerance Architecture

Share

From the same track

Session Database

How Netflix Ensures Highly-Reliable Online Stateful Systems

Tuesday Oct 3 / 02:45PM PDT

Under most stateless services are stateful databases, caches, and systems which form the bedrock applications are built on.

Speaker image - Joseph Lynch

Joseph Lynch

Distributed Systems Engineer @Netflix Working on Online Datastores and Data Abstractions

Session Resiliency

How Do We Talk to Each Other? How Surfacing Communication Patterns in Organizations Can Help You Understand and Improve Your Resilience

Tuesday Oct 3 / 01:35PM PDT

As a system increases in inevitable complexity, it becomes impossible for a single operator to have a clear, unambiguous understanding of what's happening in the system. Understanding the system requires a joint effort between teammates and technology.

Speaker image - Nora Jones

Nora Jones

Founder and CEO @jeli_io, Founder of Learning From Incidents (LFI) Online Community and Conference

Session Architecture

Disaster Recovery Across a Million Pieces

Tuesday Oct 3 / 10:35AM PDT

Data recovery is more than just backing up and restoring a data store. The goal of any disaster recovery effort is getting the system back to working as expected across all of its parts.

Speaker image - Michelle Brush

Michelle Brush

Engineering Director, SRE @Google, Previously Director of HealtheIntent Architecture @Cerner Corporation & Lead Engineer @Garmin, Author of "2 out of the 97 Things Every SRE Should Know"

Session Resiliency

Orchestrating Resilience: Building Modern Asynchronous Systems

Tuesday Oct 3 / 03:55PM PDT

Building asynchronous, event-driven systems can be daunting. Managing states, ensuring resilience, maintaining traceability, and handling a myriad of other challenges often require more effort than building the functionality itself.

Speaker image - Sai Pragna Etikyala

Sai Pragna Etikyala

Technical Lead @Twilio

Session

Unconference: Designing for Resilience

Tuesday Oct 3 / 05:05PM PDT

What is an unconference? An unconference is a participant-driven meeting. Attendees come together, bringing their challenges and relying on the experience and know-how of their peers for solutions.