Slack's Migration to a Cellular Architecture

Cellular service architectures are a conceptually simple way for highly available online services to limit the impact of cascading failures and improve scale-out. So why aren't we all using them? And how do they even work in practice? 

In this talk, we'll explore Slack's transformative 18-month journey from a traditional multi-AZ architecture to a robust cellular architecture. Triggered by a critical incident in June 2021, this architectural shift revolutionized Slack's approach to system resilience and failure mitigation. We'll delve into the motivation behind the work, the operational characteristics of our particular implementation, challenges and complexities posed by the migration, and limitations of our approach as compared to others.

Attendees will gain insights into designing for failure in large-scale distributed systems, techniques for graceful degradation and traffic management, balancing consistency requirements with availability in cellular architectures, and strategies for executing major architectural changes while maintaining service quality.


Speaker

Cooper Bethea

Formerly Senior Staff Engineer and Technical Lead @Slack, Previously SRE Lead and SRE Workbook Author @Google

Cooper is a software engineer and site reliability expert with 17 years experience working on improving the reliability of large-scale distributed systems. Most recently, as Senior Staff Software Engineer at Slack, Cooper led the Cellular Slack project, a major rearchitecting initiative that significantly enhanced the platform's fault tolerance and disaster recovery capabilities.

Previously at Google, Cooper served as Reliability Lead for the Global Cloud Load Balancer and was the lead author for the “Managing Load” chapter of the SRE Workbook. His career spans roles at Foursquare and Sift, where he held responsibility for the availability of all user-facing infrastructure.

Cooper is passionate about building scalable, resilient systems and sharing knowledge within the tech community. His talks draw from nearly two decades of hands-on experience with some of the industry's most demanding infrastructure environments.

Read more
Find Cooper Bethea at:

Date

Wednesday Nov 20 / 02:45PM PST ( 50 minutes )

Location

Ballroom A

Topics

Architecture Traffic Routing Resilience Redundancy Failure Recovery

Share

From the same track

Session Architecture

One Network: Cloud-Agnostic Service and Policy-Oriented Network Architecture

Wednesday Nov 20 / 11:45AM PST

In this age of an interconnected world, One Network helps customers to simplify deployment of their products and services by providing a unified service and policy oriented network architecture that breaks down the boundaries of public and private clouds, different runtimes and tr

Speaker image - Anna Berenberg

Anna Berenberg

Engineering Fellow, Foundation Services, Service Networking, @Google Cloud, Co-Author of "Deployment Archetypes for Cloud Applications"

Session

Thinking Like an Architect

Wednesday Nov 20 / 10:35AM PST

Are architects supposed to be the smartest people on the team, making all the important decisions for developers to fill in the blanks? Certainly not. Rather, architects make everyone else smarter, for example by sharing decision models or revealing blind spots.

Speaker image - Gregor Hohpe

Gregor Hohpe

Author of "Enterprise Integration Patterns" and "The Software Architect Elevator", Cloud Architect, Member of IEEE Software Advisory Board, Previously @AWS, @Google, and @Allianz

Session Architecture

Renovate to Innovate: Fundamentals of Transforming Legacy Architecture

Wednesday Nov 20 / 01:35PM PST

Renovating old buildings and homes is commonplace, but why is technological renovation often overlooked? Just like a big home renovation adds to the quality of life, a successful architectural renovation has an outsized impact on the pace of innovation.

Speaker image - Rashmi Venugopal

Rashmi Venugopal

Product Engineering @Netflix, Speaker, Previously Product Engineer @Uber & @Microsoft, Building and Operating Reliable Distributed Systems at Scale

Session Legacy Code

Building Tomorrow’s Legacy Code, Today

Wednesday Nov 20 / 03:55PM PST

Confronting legacy code and managing technical debt are inevitable aspects of building sustainable systems. Often, when we’re building new code, we don’t keep that inevitable future in mind–that the code we’re building today is the legacy code of tomorrow.

Speaker image - Shawna Martell

Shawna Martell

Senior Staff Engineer @Carta