You are viewing content from a past/completed QCon

Presentation: Controlled Chaos: Taming Organic, Federated Growth of Microservices

Track: Microservices Patterns & Practices

Location: Ballroom A

Duration: 2:55pm - 3:45pm

Day of week: Tuesday

Slides: Download Slides

Share this on:

This presentation is now available to view on InfoQ.com

Watch video

What You’ll Learn

  1. Hear about some of the complexity of deploying and operating microservices.
  2. Learn some patterns to apply in a microservices architecture to preserve stability and security over time.

Abstract

The success with which enterprises execute on microservice strategies and the degree to which cloud-native technologies boost developer productivity leave operations and security teams with an organically growing landscape of federated services that is increasingly difficult to control. As a result, failures mount, resilience declines, and innovation dies.

In this talk, I focus on the challenges that result from organic, federated growth as well as the patterns that can be applied to monitor and control these dynamic systems, like bulkheads, backpressure, and quarantines, from both an operational and security perspective. I illustrate how visibility and control of the surrounding environment become more important than the observability of individual threads of execution and how the nature of this organic architectural style necessitates a shift to remediating behaviors in real-time.

Question: 

What is the work you're doing today?

Answer: 

I am the co-founder and CEO of Glasnostic, a startup that provides an operations solution for rapidly evolving networks of services that lets enterprises gain control over the unpredictable behaviors that such environments exhibit. My background is in tech. I was the co-founder of the company that became Red Hat OpenShift and, like everybody else, have always focused on how to optimally support the building of applications until I realized that successful applications don’t typically become successful because they are well-engineered. They become successful because they are operated well. And this is what I’ve set out to do at Glasnostic: help fix the massive operations crisis that we are facing today.

Question: 

What are the goals for the talk?

Answer: 

I want to raise awareness of how deeply enterprise agility and microservices change the way we architect and operate service architectures. On the business side, there is this new, agile operating model with small, self-managing teams that work in rapid decision and learning cycles. On the technical side, businesses execute a microservices-first strategy, developer productivity is buoyed by cloud-native technologies and development itself is scaled out with many teams deploying in parallel. These factors leave architects and operators with a continually evolving landscape of sprawling and increasingly connected services that, while being of great benefit to the business, is inherently unstable and insecure and thus difficult to control. And, as we all know, if we can’t control, we can’t operate, and if we can’t operate, innovation dies. That’s the problem that we are solving.

Now, the key difference between applications and such federated, evolving service landscapes is that failure in service landscapes occurs overwhelmingly not because of code defects in individual threads of execution but due to environmental factors. This is also the reason why service landscapes truly represent a new technological paradigm. The old model of looking at code execution to find a “root cause” that will fix the current failure is irrelevant in a world where my code is significantly more likely to be impacted by noisy neighbors, grey failures and other random occurrences. And because I, as a developer, am ultimately responsible for only a handful of services within a much larger landscape, it becomes evident how monitoring, tracing, and in general, any local “observability” is much less useful than we’ve been trained to think. We can all engineer stand-alone, single-blueprint applications. The difficulties arise once you string decomposed applications together to form a federated landscape of services.

Because service landscapes represent a new paradigm where the impact of environmental factors outweighs code defects and because these environmental factors are inherently unpredictable, we need to change how we operate these architectures. We need to be able to remediate in real time, with near-zero mean time to repair (MTTR). This means, for instance, that, from an observability perspective, we can’t be looking at petabytes of machine data, with or without ML. We need to look at “golden signals” such as requests, latencies, concurrencies and bandwidth that are relevant to the environment and that are universally observable so we can quickly detect and identify failures, grey or not. And on the remediation side, we need to have a playbook of predictable operational patterns that we can apply in real-time such as backpressure, bulkheads, or quarantines.

Question: 

Can you give me an example of quarantining, how that would work in practice?

Answer: 

Sure. If you run a service landscape and deploy a new service, it will be connected to a number of existing services and, because it has an API, you don’t know who will call you an hour, a day or a week from now. So, fundamentally, you won’t be able to engage in big-design-up-front architecture. And because you can’t architect, you can’t be sure that this new service will work within the landscape. As a result, you’ll want to ease the service into the landscape. Also, because you can’t stage a service landscape in any meaningful way due to its complexity and, by the way, because staging without production traffic and production scale is moot anyway—so, because you can’t stage, you’ll want to ease the deployment into production. And one operational pattern you can apply to that effect is the quarantine pattern. 

At its most basic level, the quarantine pattern involves restricting a service’s or a group of service’s upstream traffic, keeping it very limited, and then to remove that restriction slowly so that the operations team can observe the effects that the deployment has on the wider architecture. So, curiously, the quarantine pattern is often a way to implement a governor pattern. It is a powerful pattern because the vast majority of failures are triggered when changes are introduced in the landscape. As a result, because quarantining is so useful in reducing deployment risk, we often see this deployed automatically. 

Question: 

What do you want people to leave the talk with?

Answer: 

I want people to leave with two insights. First, that service landscapes genuinely represent a new paradigm. If you are looking to become a digital enterprise, and if you, therefore, strive for agility and execute a microservices-first strategy, then you will find yourself running a service landscape. It is not a matter of “if” but “when.” You simply can’t continue to build distributed applications. In fact, distributed systems engineering is probably the worst antipattern today because it is so slow and expensive and involves waterfall-like big design up-front.

Second, I want people to leave with the insight that failures in service landscapes, i.e. stability and security issues, are overwhelmingly due to complex environmental behaviors, not individual threads of execution, which is how we as engineers have been brought up to think. And, that these environmental behaviors are fundamentally unpredictable, which necessitates an entirely different, “mission control” mindset when it comes to operating such landscapes. We need to remediate in real-time, which means we’ll have to look at golden signals, not get lured into the abyss that is high-cardinality observability. And armed with such actionable visibility, we need to be able to apply predictable operational patterns. Finally, we need to be able to actually do something in real-time, not run a half-day incident response process.

It is this mission control operational mindset that enables enterprises to innovate rapidly and successfully in the digital domain. Apollo 13 didn’t make it back to earth because the mission was well engineered. They made it back because the mission was operated well.

Speaker: Tobias Kunze

Co-founder and CEO @glasnostic

As CEO of Glasnostic, Tobias Kunze is on a mission to help enterprises manage their rapidly evolving microservice architectures. Before co-founding Glasnostic, he was the co-founder of Makara, an enterprise PaaS that became Red Hat OpenShift. Before that, he ran development and operations for Lycos Europe's shopping arm and held various other engineering positions at startups and enterprises alike.

Find Tobias Kunze at

Last Year's Tracks

  • Monday, 16 November

  • Operating Microservices

    Building and operating distributed systems is hard, and microservices are no different. Learn strategies for not just building a service but operating them at scale.

  • Distributed Systems for Developers

    Computer science in practice. An applied track that fuses together the human side of computer science with the technical choices that are made along the way

  • The Future of APIs

    Web-based API continue to evolve. The track provides the what, how, and why of future APIs, including GraphQL, Backend for Frontend, gRPC, & ReST

  • Resurgence of Functional Programming

    What was once a paradigm shift in how we thought of programming languages is now main stream in nearly all modern languages. Hear how software shops are infusing concepts like pure functions and immutablity into their architectures and design choices.

  • Social Responsibility: Implications of Building Modern Software

    Software has an ever increasing impact on individuals and society. Understanding these implications helps build software that works for all users

  • Non-Technical Skills for Technical Folks

    To be an effective engineer, requires more than great coding skills. Learn the subtle arts of the tech lead, including empathy, communication, and organization.

  • Tuesday, 17 November

  • Clientside: From WASM to Browser Applications

    Dive into some of the technologies that can be leveraged to ultimately deliver a more impactful interaction between the user and client.

  • Languages of Infra

    More than just Infrastructure as a Service, today we have libraries, languages, and platforms that help us define our infra. Languages of Infra explore languages and libraries being used today to build modern cloud native architectures.

  • Mechanical Sympathy: The Software/Hardware Divide

    Understanding the Hardware Makes You a Better Developer

  • Paths to Production: Deployment Pipelines as a Competitive Advantage

    Deployment pipelines allow us to push to production at ever increasing volume. Paths to production looks at how some of software's most well known shops continuous deliver code.

  • Java, The Platform

    Mobile, Micro, Modular: The platform continues to evolve and change. Discover how the platform continues to drive us forward.

  • Security for Engineers

    How to build secure, yet usable, systems from the engineer's perspective.

  • Wednesday, 18 November

  • Modern Data Engineering

    The innovations necessary to build towards a fully automated decentralized data warehouse.

  • Machine Learning for the Software Engineer

    AI and machine learning are more approachable than ever. Discover how ML, deep learning, and other modern approaches are being used in practice by Software Engineers.

  • Inclusion & Diversity in Tech

    The road map to an inclusive and diverse tech organization. *Diversity & Inclusion defined as the inclusion of all individuals in an within tech, regardless of gender, religion, ethnicity, race, age, sexual orientation, and physical or mental fitness.

  • Architectures You've Always Wondered About

    How do they do it? In QCon's marquee Architectures track, we learn what it takes to operate at large scale from well-known names in our industry. You will take away hard-earned architectural lessons on scalability, reliability, throughput, and performance.

  • Architecting for Confidence: Building Resilient Systems

    Your system will fail. Build systems with the confidence to know when they do and you won’t.

  • Remotely Productive: Remote Teams & Software

    More and more companies are moving to remote work. How do you build, work on, and lead teams remotely?