QCon San Francisco 2021 November 1-5, 2021 |

This presentation is now available to view on InfoQ.com

What You’ll Learn

Hear how Netflix moved from a three-tier architecture to microservices.
Hear what the Netflix Edge Gateway team does.
Learn among others how they determine which microservice generates an error in case there is one.

Abstract

Being the gatekeepers for the company is a very exciting and enlightening experience - we are trusted to build and operate bridges between the Internet and the data centers. With such a centralized location in the ecosystem comes great power and great responsibility - we have a unique insight into the company’s traffic and access patterns and often can adjust them in order to provide better user experience, reduce cost or improve on some other parameter.

In this talk we’ll review Netflix’s edge gateway ecosystem - multiple traffic gateways performing different functions deployed around the world. We’ll touch upon the motivation behind such topology and highlight challenges it introduces. We’ll see where and how the value is added, what is the operational footprint of it and what happens when things go wrong.

Question:

Are you going to talk about how the three-tier architecture was broken down into microservices at Netflix?

Answer:

Yes, I can definitely talk about that. We could talk about such a journey in general terms - how you start as a new company, usually several engineers, three-tier architecture and most likely a monolithic application. Over time the company, if successful, grows, builds more functionality to the app as well as adding engineering capacity. At some point you have a little to much in a single app and might see engineers stepping on each other toes - time to introduce an isolation between domain components and potentially break the monolith aparat. While monolith is being separated into services - there are at least two questions to answer: how do you route requests to the right backends and also how you replicate common functionality (such as logging, authz, rate-limiting, DDoS protection) in all the services.

Leaving routing alone for now, one could say - let’s implement common logic as a base server, a library that is used as a foundation of all the services in the company. It’s absolutely possible to go this way, but it comes with challenges around keeping the base library up-to-date across deployments and how to support polyglot development (multiple programming languages or stacks). Alternatively, an API Gateway can be introduced to implement common functionality such common functionality, while keeping logic external to the services and ensuring uniform implementation (version) across the board. As the organization grows - being able to perform infrastructure updates without involving service owners is essential. API Gateway team can iterate easily, add features or push security updates without being blocked on other teams updating the library or sidecar or even restarting their application. Plus it solves the routing issue as well.

Question:

Are you going to talk about pushing that application gateway beyond the cloud provider and closer to the user?

Answer:

Absolutely. One of the things that I want to touch upon is what happens when the company goes multi-region (multiple datacenters). One needs to answer an important question - how to route clients, how do they get to your ecosystem? There are several ways of doing Global Traffic management, but chances are you’ll end up using DNS-based approach. Keep in mind that with multiple data centers occasionally you’ll need to evacuate some of them - shift all of the traffic away from it. Usually, it is done with a DNS flip: adjust DNS records, wait for DNS changes to be propagated and traffic to start flowing to new locations. Unfortunately, many clients and resolvers do not honor TTL of DNS - therefore traffic might be hitting the “unhealthy” DC after evacuation. In such cases, API Gateway can help to route requests from one region to another by doing cross-region traffic proxying - anything to ensure customer satisfaction. Since I'm touching client routing to the data center I will briefly talk about the service Netflix built recently that is used as a last-mile API gateway. This service is deployed in ISP and IX locations and is the first point of contact for many of our clients. The reasoning behind introducing such layer is to reduce the time needed for the initial TLS handshake and also reduce the time to recover from TCP errors. While doing that it also provides a more granular and responsive traffic routing capabilities and improves the failover experience - traffic can be controlled without relying on DNS propagation.

Question:

What is a listener going to learn from your talk about how to orchestrate their microservices?

Answer:

I hope Engineers and Architects who attend this talk will learn about a set of potential problems that might be ahead as business grows. It's useful to have an example of how companies approached similar issues and understand the trade-offs, even if you don’t take the same steps. It’s not a pure engineering problem, it’s a problem of scaling the organizations. By growing the number of teams, a lot of potential communication channels between them are introduced as - it’s better to designing your ecosystem to reduce the number of required connections. API Gateway helps with that.

Question:

Give me an example of one of the lessons that you might share.

Answer:

Last year the team build a new service - Raju. The goal was to detect anomalous behavior of backend services and alert service owners on such events. That was done without changing a single line of code on the backends. The idea is that API Gateway has a centralized understanding of the health of the ecosystem - observing all the requests going through and the responses sent to clients. Applying some statistical methods to such data we were able to identify issues at scale. Having a holistic view of your ecosystem is very powerful and it can be leveraged. Another one, maybe less positive, if you make you API Gateway config self-servicable and allow Engineers to adjust routing rules globally - provide tools to assess the impact, otherwise unexpected things may happen.

Speaker: Vasily Vlasov

Engineering Leader @Netflix

Vasily works at Netflix where he leads Cloud Gateway team. The team is providing API Gateway and Push notification services for Netflix’s customers and employees. Prior to Netflix Vasily designed and built the gateways and software load-balancers for iCloud and iTunes at Apple.

Before becoming an adept of Edge Vasily worked on various challenges ranging from building UI tools for managing Informix Databases to software to convert code from PL/I to Java.