Building Resilient Asynchronous and Event-Driven Systems

Asynchronous and event-driven architectures promise scalability and durability, but they often mask complex failure modes that only surface under unpredictable situations like large traffic spikes or dependency failures. These problems become even more complicated when you are serving multiple customers, i.e, in multi-tenant systems, which is often the case. This training dives into real-world design challenges - drawing on over a decade of experience operating multi-tenant, high-throughput systems. We will explore less talked about architecture patterns like shuffle sharding, retry storms, back-pressure, and concurrency limiting that allow systems to degrade gracefully instead of collapsing under pressure. Attendees will leave with mental models, metrics, and playbooks for building systems that are prepared to fail—and recover—predictably.

By the end of the training session, you will have confidence in building resilient asynchronous and event-driven systems that don’t collapse and lead to gradual recovery when unexpected situations hit.

 

Key Takeaways

1 Unlike APIs, asynchronous system availability is measured very differently and requires different observability metrics.

2 We will dive into understanding system behaviors to stay ahead of when high traffic load hits your system.

3 How to design intelligent retries and backoffs that often lead to cascading failures.

4 We will discuss about failure modes before they happen.


Speaker

Tejas Ghadge

Engineering Head @AWS Amplify, AWS Lambda Event Driven Applications and AWS Lambda Developer Experience where he leads an organization of 100+ engineers/managers

Tejas Ghadge is engineering head for AWS Amplify, AWS Lambda Event Driven Applications and AWS Lambda Developer Experience where he leads an organization of 100+ engineers/managers across multiple sites in US and Canada.

With over 14 years of experience at AWS, Tejas brings deep operational and architectural experience from - operating large scale (millions of requests per second) event driven systems, leading and analyzing hundreds of operational incidents and successfully launching dozens of delightful customer features for AWS Lambda and AWS Amplify customers. 

Read more