Cellular service architectures are a conceptually simple way for highly available online services to limit the impact of cascading failures and improve scale-out. So why aren't we all using them? And how do they even work in practice?
In this talk, we'll explore Slack's transformative 18-month journey from a traditional multi-AZ architecture to a robust cellular architecture. Triggered by a critical incident in June 2021, this architectural shift revolutionized Slack's approach to system resilience and failure mitigation. We'll delve into the motivation behind the work, the operational characteristics of our particular implementation, challenges and complexities posed by the migration, and limitations of our approach as compared to others.
Attendees will gain insights into designing for failure in large-scale distributed systems, techniques for graceful degradation and traffic management, balancing consistency requirements with availability in cellular architectures, and strategies for executing major architectural changes while maintaining service quality.
Speaker
Cooper Bethea
Formerly Senior Staff Engineer and Technical Lead @Slack, Previously SRE Lead and SRE Workbook Author @Google
Cooper is a software engineer and site reliability expert with 17 years experience working on improving the reliability of large-scale distributed systems. Most recently, as Senior Staff Software Engineer at Slack, Cooper led the Cellular Slack project, a major rearchitecting initiative that significantly enhanced the platform's fault tolerance and disaster recovery capabilities.
Previously at Google, Cooper served as Reliability Lead for the Global Cloud Load Balancer and was the lead author for the “Managing Load” chapter of the SRE Workbook. His career spans roles at Foursquare and Sift, where he held responsibility for the availability of all user-facing infrastructure.
Cooper is passionate about building scalable, resilient systems and sharing knowledge within the tech community. His talks draw from nearly two decades of hands-on experience with some of the industry's most demanding infrastructure environments.