Orchestrating Resilience: Building Modern Asynchronous Systems

Building asynchronous, event-driven systems can be daunting. Managing states, ensuring resilience, maintaining traceability, and handling a myriad of other challenges often require more effort than building the functionality itself. This talk will dive into my journey at Twilio, where I faced these complexities firsthand and turned to workflow orchestrators as a solution. I'll share practical examples from our projects, the challenges we faced, and how we overcame them.

We will explore the landscape of various workflow orchestrators, using our experience to highlight why we chose Temporal over other options like Apache Airflow and AWS Step Functions. Moreover, we'll examine strategies for handling partial and known failures, and how orchestrators can simplify and expedite these processes.

In the end, the audience will take away an in-depth understanding of workflow orchestrators' transformative role in asynchronous systems, empowering developers to build reliable, efficient, and easy-to-navigate systems.

Key Takeaways:

  1. Gaining insights into the challenges encountered while developing asynchronous systems, including state management, resilience, traceability, observability, and maintainability.
  2. Understanding the value of workflow orchestrators in building reliable and efficient asynchronous systems, and how they can significantly reduce the development overhead.
  3. Understanding different  workflow orchestrators available today, and their unique features and suitability for various use cases.

Speaker

Sai Pragna Etikyala

Technical Lead @Twilio

Sai Pragna Etikyala is a Technical Lead at Twilio, currently leading the team responsible for A2P 10DLC compliance for messaging. Utilizing her extensive experience with asynchronous systems, she has efficiently re-architected Twilio's complex compliance systems, leading to notable improvements in manageability and operational efficiency. Before joining Twilio, she worked at Amazon Web Services, Yahoo, and Cerner. Throughout her tenure at these companies, she developed robust end-to-end solutions and successfully managed complex operations. This has enriched her expertise not only in asynchronous computing but also in software development, cloud computing, and healthcare IT solutions. She holds a Master's degree in Computer Science from Arizona State University. Her innovative and agile approach to software engineering and leadership distinguishes her as a significant contributor to the telecommunications realm and beyond.

Read more

Date

Tuesday Oct 3 / 03:55PM PDT ( 50 minutes )

Location

Ballroom BC

Topics

Resiliency Architecture Asynchronous Systems

Share

From the same track

Session Database

How Netflix Ensures Highly-Reliable Online Stateful Systems

Tuesday Oct 3 / 02:45PM PDT

Under most stateless services are stateful databases, caches, and systems which form the bedrock applications are built on.

Speaker image - Joseph Lynch
Joseph Lynch

Distributed Systems Engineer @Netflix Working on Online Datastores and Data Abstractions

Session Resiliency

How Do We Talk to Each Other? How Surfacing Communication Patterns in Organizations Can Help You Understand and Improve Your Resilience

Tuesday Oct 3 / 01:35PM PDT

As a system increases in inevitable complexity, it becomes impossible for a single operator to have a clear, unambiguous understanding of what's happening in the system. Understanding the system requires a joint effort between teammates and technology.

Speaker image - Nora Jones
Nora Jones

Founder and CEO @jeli_io, Founder of Learning From Incidents (LFI) Online Community and Conference

Session Architecture

Disaster Recovery Across a Million Pieces

Tuesday Oct 3 / 10:35AM PDT

Data recovery is more than just backing up and restoring a data store. The goal of any disaster recovery effort is getting the system back to working as expected across all of its parts.

Speaker image - Michelle Brush
Michelle Brush

Engineering Director, SRE @Google, Previously Director of HealtheIntent Architecture @Cerner Corporation & Lead Engineer @Garmin, Author of "2 out of the 97 Things Every SRE Should Know"

Session Reliability

Designing Fault-Tolerant Software with Control System Transparency

Tuesday Oct 3 / 11:45AM PDT

Teams at NASA and JPL that create mission-critical software for spacecraft take a principled approach to fault tolerance. Let's see how those same principles, centered around a concept of transparency, can help us achieve reliability in pragmatic, modern software delivery settings.

Speaker image - Jon Moore
Jon Moore

Staff Software Engineer @Stripe with over 35 years of software engineering experience across both academia and industry

Session

Unconference: Designing for Resilience

Tuesday Oct 3 / 05:05PM PDT

What is an unconference? An unconference is a participant-driven meeting. Attendees come together, bringing their challenges and relying on the experience and know-how of their peers for solutions.