Track:

Duration

Duration:

11:50am - 12:40pm

Persona:

Architect
Developer

Key Takeaways

Learn about the experiences building a distributed system using Google’s Spanner.
Hear about steps and decisions that can be made early in a project’s lifecycle that can save developers from pain later.
Understand high scale decision designs that Google needed to consider while developing custom tooling for developer infrastructure.

Abstract

Designing systems that take failure into account from the start is hard. Sometimes it’s very tempting to take shortcuts that will adversely affect your system in the long run. However, there are steps one can take to avoid these shortcuts and build a fault-tolerant system.

This talk is a case study in transitioning Guitar, an internal integration testing framework, to Spanner. Spanner is a database developed internally at Google that provides a fast, distributed data store for applications in addition to distributed transactions, replication, and automated backups. Previously Guitar relied upon a hodge-podge of Bigtables, in-memory data structures, and custom mechanisms for recovering state upon failure. This previous architecture resulted in a system that was unscalable, unreliable, and an impediment to developing new features that were sorely needed.

We will first go into details of the previous architecture and how the mechanisms in place for dealing with failure were inadequate and not properly thought out. We will then discuss the transition plan from the old to the new system, which could be generally applicable to other migrations, and the lessons we learned for how to deal with failure properly. In addition, we will discuss how we transitioned our heavily used production system for all of our clients with no downtime and in a fully controlled and gradual manner.

Interview

Question:

QCon: What is your role today?

Answer:

Edwin: I am a software engineer at Google. I work in Developer Infrastructure, where we build tools for all engineers inside Google. Many tools used internally are completely custom, so there is a large group that maintains all of them. My team specifically works on an integration testing tool that is used by a large majority of Googlers to stand up environments and run integration tests against. It mimics a production environment as much as possible.

Question:

QCon: Can you explain your talk title to me?

Answer:

Edwin: Sure. With our integration testing system, we had a central server, that we called the registry. The registry acts as a gateway between all the different users running integration tests. It’s acts like an index to keep track of all the projects and facilitates communication. This registry has an older architecture. It uses older technology within Google and, because of that, has a lot of scalability issues. It also was initially written for a much lower scale than what we are currently running the tool at now. There are a host of issues that we’ve been having with it over the past year or two as we’ve been maintaining this project.

This talk is about how we’re transitioning it to use Spanner (Spanner is Google's scalable, multi-version, globally-distributed, and synchronously-replicated database), Spanner, as a core technology here, allows us to scale a lot easier and write this service in a distributed manner. So the new tooling allows us to horizontally scale and get rid of many of the scaling issues we had before.

Question:

QCon: Can you give me an example of one of the lessons you will specifically talk about?

Answer:

Edwin: In terms of the old architecture, it was designed in a way where they had anticipated less load. So it was almost like a prototype. It was the classic thing where you build a prototype and then you start adding onto it. The prototype sort of becomes the real thing. It was just fine for a long time, but, once we got to a certain scale, it became really difficult to expand upon that. So one of the lessons learned I plan to discuss is thinking long term. I will discuss things like what are little things you can do now that will it way easier to scale later on?

Question:

QCon: Who is this talk targeted for?

Answer:

Edwin: I would say distributed systems engineers, but more so architects. I want to focus on high level aspects in terms of decision making. People who are determining what sort of technologies to use. This talk is about some of the things to keep in mind when designing a distributed system. That’s the focus.

Question:

QCon: What are your key takeaways for this talk?

Answer:

Edwin: One thing is when you are starting out building your system, even when you are prototyping, thinking about the decisions you can make to make sure that when you need to scale, it won’t be as difficult. If there are two competing drives, you can over-optimize early and you’re wasting time, or you can completely ignore any sort of optimization and make your life miserable later on. There is a fine line between balancing both those concerns, so the takeaway is: here are easy things you can always do regardless of whether you are going to need to scale to millions or just to a couple thousand users at a time, that will make it easier to maintain things overall. Eventually if you do need to scale, it won’t be such a headache.

Speaker: Edwin Fuquen

Software Engineer @Google

Find Edwin Fuquen at

Speaker page

@edftwin

Senior Software Engineer, Playback Features @Netflix

Haley Tucker

99.99% Availability via Smart Real-Time Alerting

Data Science Manager @Uber

Franziska Bell

Creating A Culture of Observability at Stripe

Observability Specialist @Stripe

Cory Watson

Scaling Quality On Quora Using Machine Learning

Engineering Manager @Quora

Nikhil Garg

Query Understanding: a Manifesto

Data Scientist, Author of "Faceted Search"

Daniel Tunkelang

Iterative Design for Data Science Projects

Partner and Data Scientist @Datascope

Bo Peng

The Art of Relevance and Recommendations

Security Research Engineer @ShapeSecurity

Clarence Chio

Further Together: Curated Pairing Culture @Pivotal

Software Engineer @Pivotal

Neha Batra

Next Gen Startup Cultures: Innovating As You Grow

Sr Director of Engineering @CrowdStrike

Jim Plush

Tracks

Monday Nov 7

Architectures You've Always Wondered About

You know the names. Now learn lessons from their architectures
Distributed Systems War Stories

“A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable.” - Lamport.
Containers Everywhere

State of the art in Container deployment, management, scheduling
Art of Relevancy and Recommendations

Lessons on the adoption of practical, real-world machine learning practices. AI & Deep learning explored.
Next Generation Web Standards, Frameworks, and Techniques

JavaScript, HTML5, WASM, and more... innovations targetting the browser
Optimize You

Keeping life in balance is a challenge. Learn lifehacks, tips, & techniques for success.

Tuesday Nov 8

Next Generation Microservices

What will microservices look like in 3 years? What if we could start over?
Java: Are You Ready for This?

Real world lessons & prepping for JDK9. Reactive code in Java today, Performance/Optimization, Where Unsafe is heading, & JVM compile interface.
Big Data Meets the Cloud

Overviews and lessons learned from companies that have implemented their Big Data use-cases in the Cloud
Evolving DevOps

Lessons/stories on optimizing the deployment pipeline
Software Engineering Softskills

Great engineers do more than code. Learn their secrets and level up.
Modern CS in the Real World

Applied, practical, & real-world dive into industry adoption of modern CS ideas

Wednesday Nov 9

Architecting for Failure

Your system will fail. Take control before it takes you with it.
Stream Processing

Stream Processing, Near-Real Time Processing
Bare Metal Performance

Native languages, kernel bypass, tooling - make the most of your hardware
Culture as a Differentiator

The why and how for building successful engineering cultures
//TODO: Security <-- fix this

Building security from the start. Stories, lessons, and innovations advancing the field of software security.
UX Reimagined

Bots, virtual reality, voice, and new thought processes around design. The track explores the current art of the possible in UX and lessons from early adoption.

SCHEDULE

Duration

Persona:

Key Takeaways

Abstract

Interview

Find Edwin Fuquen at

Similar Talks

Tracks

Monday Nov 7

Tuesday Nov 8

Wednesday Nov 9

Conference for Professional Software Developers

Follow QCon

Contact

Menu

QCons around the World

Presentation: Migrating to a Fault Tolerant System with Spanner

Duration

Persona:

More talks on:

Key Takeaways

Abstract

Interview

Find Edwin Fuquen at

Similar Talks

Tracks

Monday Nov 7

Tuesday Nov 8

Wednesday Nov 9

Conference for Professional Software Developers

Follow QCon

Contact

Menu

QCons around the World