Teams at NASA and JPL that create mission-critical software for spacecraft take a principled approach to fault tolerance. Let's see how those same principles, centered around a concept of transparency, can help us achieve reliability in pragmatic, modern software delivery settings.
As our society continues to depend more and more heavily on software, the need for that software to be reliable also increases. At the same time, the software systems we build as practitioners continue to become more and more complex, with many moving parts and unexpected, emergent behavior. How can those systems become as reliable and robust as the Voyager 2 deep space probe that made news recently and is still in service 46 years after its launch?
In this talk, we'll be drawing inspiration from an architectural paper that captures experiences from the aerospace industry. "GN&C [Guidance, Navigation, and Control] Fault Protection Fundamentals" by Robert D. Rasmussen describes four key principles for building fault-tolerant software, involving transparency of (1) objectives; (2) models; (3) knowledge; and (4) control. For each principle, we'll identify use cases from well-known software or protocols (e.g. HTTP) where we can see the principle in action; or, if the principle is *not* commonly applied, we'll describe how it might look in familiar settings like invoking RESTful or GRPC APIs.
What's the focus of your work these days?
I'm currently working on broad architectural refactoring projects to help update parts of our architecture to gain more engineering leverage (identifying and extracting common components for reuse, for example).
What's the motivation for your talk at QCon San Francisco 2023?
I'm passionate about fault tolerance and reliability, and have been for a long time. I will admit to enjoying putting a service under rigorous load or stress testing, seeing it fall over, and then fixing it so it no longer falls over that way! But in general, I see a great opportunity for the software industry to continue updating our practices to build more of that reliability and resilience in by design. There are groups, like NASA, who have been focused on fault tolerant software for a long time: after all, a small mistake could turn a deep space probe into a $100M brick. They've developed a lot of rigor and techniques to be successful at this, and the rest of the industry can probably learn a thing or two!
How would you describe your main persona and target audience for this session?
This talk is probably most useful to software architects or tech leads, as we will be discussing techniques and patterns that an organization would want to apply broadly to gain the benefits. That said, the talk should be broadly accessible to a wider engineering audience, as we'll be explaining the principles in the talk as we go, and illustrating them with examples.
Is there anything specific that you'd like people to walk away with after watching your session?
I'd primarily hope that attendees gain a new way of thinking about fault tolerance and reliability; being able to learn about a new point of view on a familiar problem is often the most valuable takeaway from talks.
Staff Software Engineer @Stripe with over 35 years of software engineering experience across both academia and industry
Over his career, Jon Moore has been a researcher, management consultant, network engineer, small business owner, tech lead, architect, and technology executive. He is equally comfortable leading and managing teams and personally writing production-ready code. His current interests include distributed systems, fault tolerance, refactoring, building healthy and engaging engineering cultures, and Texas Hold'em. Jon received his Ph.D. in Computer and Information Science from the University of Pennsylvania and currently resides in West Philadelphia, although he was neither born there nor raised there and does not spend most of his days on playgrounds.