Presentation: Migrating to a Fault Tolerant System with Spanner

Duration

Duration: 
11:50am - 12:40pm

Persona:

Key Takeaways

  • Learn about the experiences building a distributed system using Google’s Spanner.
  • Hear about steps and decisions that can be made early in a project’s lifecycle that can save developers from pain later.
  • Understand high scale decision designs that Google needed to consider while developing custom tooling for developer infrastructure.

Abstract

Designing systems that take failure into account from the start is hard. Sometimes it’s very tempting to take shortcuts that will adversely affect your system in the long run. However, there are steps one can take to avoid these shortcuts and build a fault-tolerant system.

This talk is a case study in transitioning Guitar, an internal integration testing framework, to Spanner. Spanner is a database developed internally at Google that provides a fast, distributed data store for applications in addition to distributed transactions, replication, and automated backups. Previously Guitar relied upon a hodge-podge of Bigtables, in-memory data structures, and custom mechanisms for recovering state upon failure. This previous architecture resulted in a system that was unscalable, unreliable, and an impediment to developing new features that were sorely needed.

We will first go into details of the previous architecture and how the mechanisms in place for dealing with failure were inadequate and not properly thought out. We will then discuss the transition plan from the old to the new system, which could be generally applicable to other migrations, and the lessons we learned for how to deal with failure properly. In addition, we will discuss how we transitioned our heavily used production system for all of our clients with no downtime and in a fully controlled and gradual manner.

Interview

Question: 
QCon: What is your role today?
Answer: 

Edwin: I am a software engineer at Google. I work in Developer Infrastructure, where we build tools for all engineers inside Google. Many tools used internally are completely custom, so there is a large group that maintains all of them. My team specifically works on an integration testing tool that is used by a large majority of Googlers to stand up environments and run integration tests against. It mimics a production environment as much as possible.

Question: 
QCon: Can you explain your talk title to me?
Answer: 

Edwin: Sure. With our integration testing system, we had a central server, that we called the registry. The registry acts as a gateway between all the different users running integration tests. It’s acts like an index to keep track of all the projects and facilitates communication. This registry has an older architecture. It uses older technology within Google and, because of that, has a lot of scalability issues. It also was initially written for a much lower scale than what we are currently running the tool at now. There are a host of issues that we’ve been having with it over the past year or two as we’ve been maintaining this project. 

This talk is about how we’re transitioning it to use Spanner (Spanner is Google's scalable, multi-version, globally-distributed, and synchronously-replicated database), Spanner, as a core technology here, allows us to scale a lot easier and write this service in a distributed manner. So the new tooling allows us to horizontally scale and get rid of many of the scaling issues we had before.

Question: 
QCon: Can you give me an example of one of the lessons you will specifically talk about?
Answer: 

Edwin: In terms of the old architecture, it was designed in a way where they had anticipated less load. So it was almost like a prototype. It was the classic thing where you build a prototype and then you start adding onto it. The prototype sort of becomes the real thing. It was just fine for a long time, but, once we got to a certain scale, it became really difficult to expand upon that. So one of the lessons learned I plan to discuss is thinking long term. I will discuss things like what are little things you can do now that will it way easier to scale later on?

Question: 
QCon: Who is this talk targeted for?
Answer: 

Edwin: I would say distributed systems engineers, but more so architects. I want to focus on high level aspects in terms of decision making. People who are determining what sort of technologies to use. This talk is about some of the things to keep in mind when designing a distributed system. That’s the focus.

Question: 
QCon: What are your key takeaways for this talk?
Answer: 

Edwin: One thing is when you are starting out building your system, even when you are prototyping, thinking about the decisions you can make to make sure that when you need to scale, it won’t be as difficult. If there are two competing drives, you can over-optimize early and you’re wasting time, or you can completely ignore any sort of optimization and make your life miserable later on. There is a fine line between balancing both those concerns, so the takeaway is: here are easy things you can always do regardless of whether you are going to need to scale to millions or just to a couple thousand users at a time, that will make it easier to maintain things overall. Eventually if you do need to scale, it won’t be such a headache.

Speaker: Edwin Fuquen

Software Engineer @Google

Find Edwin Fuquen at

Similar Talks

.

Tracks

Monday Nov 7

Tuesday Nov 8

Wednesday Nov 9

Conference for Professional Software Developers