Conference: Nov 13-15, 2017
Workshops: Nov 16-17, 2017
Presentation: Migrating to a Fault Tolerant System with Spanner
Key Takeaways
- Learn about the experiences building a distributed system using Google’s Spanner.
- Hear about steps and decisions that can be made early in a project’s lifecycle that can save developers from pain later.
- Understand high scale decision designs that Google needed to consider while developing custom tooling for developer infrastructure.
Abstract
Designing systems that take failure into account from the start is hard. Sometimes it’s very tempting to take shortcuts that will adversely affect your system in the long run. However, there are steps one can take to avoid these shortcuts and build a fault-tolerant system.
This talk is a case study in transitioning Guitar, an internal integration testing framework, to Spanner. Spanner is a database developed internally at Google that provides a fast, distributed data store for applications in addition to distributed transactions, replication, and automated backups. Previously Guitar relied upon a hodge-podge of Bigtables, in-memory data structures, and custom mechanisms for recovering state upon failure. This previous architecture resulted in a system that was unscalable, unreliable, and an impediment to developing new features that were sorely needed.
We will first go into details of the previous architecture and how the mechanisms in place for dealing with failure were inadequate and not properly thought out. We will then discuss the transition plan from the old to the new system, which could be generally applicable to other migrations, and the lessons we learned for how to deal with failure properly. In addition, we will discuss how we transitioned our heavily used production system for all of our clients with no downtime and in a fully controlled and gradual manner.
Interview
Edwin: I am a software engineer at Google. I work in Developer Infrastructure, where we build tools for all engineers inside Google. Many tools used internally are completely custom, so there is a large group that maintains all of them. My team specifically works on an integration testing tool that is used by a large majority of Googlers to stand up environments and run integration tests against. It mimics a production environment as much as possible.
Edwin: Sure. With our integration testing system, we had a central server, that we called the registry. The registry acts as a gateway between all the different users running integration tests. It’s acts like an index to keep track of all the projects and facilitates communication. This registry has an older architecture. It uses older technology within Google and, because of that, has a lot of scalability issues. It also was initially written for a much lower scale than what we are currently running the tool at now. There are a host of issues that we’ve been having with it over the past year or two as we’ve been maintaining this project.
This talk is about how we’re transitioning it to use Spanner (Spanner is Google's scalable, multi-version, globally-distributed, and synchronously-replicated database), Spanner, as a core technology here, allows us to scale a lot easier and write this service in a distributed manner. So the new tooling allows us to horizontally scale and get rid of many of the scaling issues we had before.
Edwin: In terms of the old architecture, it was designed in a way where they had anticipated less load. So it was almost like a prototype. It was the classic thing where you build a prototype and then you start adding onto it. The prototype sort of becomes the real thing. It was just fine for a long time, but, once we got to a certain scale, it became really difficult to expand upon that. So one of the lessons learned I plan to discuss is thinking long term. I will discuss things like what are little things you can do now that will it way easier to scale later on?
Edwin: I would say distributed systems engineers, but more so architects. I want to focus on high level aspects in terms of decision making. People who are determining what sort of technologies to use. This talk is about some of the things to keep in mind when designing a distributed system. That’s the focus.
Edwin: One thing is when you are starting out building your system, even when you are prototyping, thinking about the decisions you can make to make sure that when you need to scale, it won’t be as difficult. If there are two competing drives, you can over-optimize early and you’re wasting time, or you can completely ignore any sort of optimization and make your life miserable later on. There is a fine line between balancing both those concerns, so the takeaway is: here are easy things you can always do regardless of whether you are going to need to scale to millions or just to a couple thousand users at a time, that will make it easier to maintain things overall. Eventually if you do need to scale, it won’t be such a headache.
Similar Talks
.
Tracks
Monday Nov 7
-
Architectures You've Always Wondered About
You know the names. Now learn lessons from their architectures
-
Distributed Systems War Stories
“A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable.” - Lamport.
-
Containers Everywhere
State of the art in Container deployment, management, scheduling
-
Art of Relevancy and Recommendations
Lessons on the adoption of practical, real-world machine learning practices. AI & Deep learning explored.
-
Next Generation Web Standards, Frameworks, and Techniques
JavaScript, HTML5, WASM, and more... innovations targetting the browser
-
Optimize You
Keeping life in balance is a challenge. Learn lifehacks, tips, & techniques for success.
Tuesday Nov 8
-
Next Generation Microservices
What will microservices look like in 3 years? What if we could start over?
-
Java: Are You Ready for This?
Real world lessons & prepping for JDK9. Reactive code in Java today, Performance/Optimization, Where Unsafe is heading, & JVM compile interface.
-
Big Data Meets the Cloud
Overviews and lessons learned from companies that have implemented their Big Data use-cases in the Cloud
-
Evolving DevOps
Lessons/stories on optimizing the deployment pipeline
-
Software Engineering Softskills
Great engineers do more than code. Learn their secrets and level up.
-
Modern CS in the Real World
Applied, practical, & real-world dive into industry adoption of modern CS ideas
Wednesday Nov 9
-
Architecting for Failure
Your system will fail. Take control before it takes you with it.
-
Stream Processing
Stream Processing, Near-Real Time Processing
-
Bare Metal Performance
Native languages, kernel bypass, tooling - make the most of your hardware
-
Culture as a Differentiator
The why and how for building successful engineering cultures
-
//TODO: Security <-- fix this
Building security from the start. Stories, lessons, and innovations advancing the field of software security.
-
UX Reimagined
Bots, virtual reality, voice, and new thought processes around design. The track explores the current art of the possible in UX and lessons from early adoption.