Warning message

The service having id "twitter" is missing, reactivate its module or save again the list of services.
The service having id "facebook" is missing, reactivate its module or save again the list of services.
The service having id "google_plus" is missing, reactivate its module or save again the list of services.
The service having id "linkedin" is missing, reactivate its module or save again the list of services.

Track:

Architecting for Failure

Location:

Ballroom B/C

Duration

Duration:

1:40pm - 2:30pm

Key Takeaways

Failure happens... design distributed system to handle failure gracefully
Hear lessons learned building a distributed NoSQL Database
Learn principles and architectural patterns when building a distributed database.

Abstract

Running distributed systems in production can be tremendously challenging. In this session, we will cover common problems and failures seen with distributed systems, and discuss design patterns that can used to maintain data integrity and availability when everything goes wrong. We will use Druid as a real world case study of how these patterns are implemented in an open source technology.

Attendees will learn first hand about the multitude of software, hardware, network, and data center problems that can arise with running distributed systems, and the features that are required for availability and survivability. To provide real world examples, we will examine the architecture of Druid, and how the system is designed to power applications that need to be up 24/7. We will also cover common pitfalls with running distributed systems in various environments, including the trade-offs with on-premise and cloud deployments.

Finally, we will cover best practices around properly instrumenting monitoring and alerting for distributed systems. We will examine various open source technologies that can be used for efficient monitoring, and how these technologies can be used to maintain the availability of your cluster.

We hope this session will help you better use, design, and monitor your data systems.

Additional Links:

Intro reading list about distributed systems:

http://the-paper-trail.org/blog/distributed-systems-theory-for-the-distributed-systems-engineer/

More information about Druid: www.druid.io

Interview with Fangjin Yang

QCon: Your talk is called "Architecting Distributed Databases for Failure." Can you tell me about it?

Fangjin: The talk is about architectural patterns that exist with building distributed databases and some of the lessons learned building an open source distributed database. All distributed databases have a lot of principles in common (such as the way that distributed systems can survive failure). Academia has been talking about this for ages. What I want to do is talk about some of these main principles and use examples from an open source project called Druid as an example of a real world case study in which these principles are implemented.

QCon: You mention Druid as an example of building a distributed database, what is Druid?

Fangjin: Druid is an open source column-oriented distributed database. It's primarily designed for analytics and business intelligence. If you're familiar with data warehousing world, it's a database that's really good for OLAP queries and streaming your data into the database directly.

QCon: So how will your background developing Druid affect your talk?

Fangjin: One thing I was hoping to do was talk a little bit about how you do a rolling update on a distributed system, and why it makes sense to have a management component be a part of rolling updates.

From the very beginning with Druid, we decided that we wanted to be able to do rolling deployments. Druid has various components built in to allow you to do take down on process at a time without any loss of data. If you bring that process back up, Druid has a very quick way of restoring state. There is a management piece that says "hey, one of my servers went down, I might need to start redistributing load and replicating missing data." If the server that went down comes back pretty fast, then the management piece says "OK, the server has been restored I don't have to do anything." It's part of the reason why Druid has many different components. It has dedicated components monitoring the state of the cluster. I would leverage similar stories of these types of experiences in my talk.

QCon: How is your talk going to be structured?

Fangjin: First, I thought I would discuss about the different types of failures that can occur (from the small failures to the oh crap, everything is broken failures). From there, I want to cover a bit about the main principles that distributed systems implement to handle these various classes of failures and the typical patterns that systems follow. The third part would be more of the case study where I've implements some of these principles like the rolling deployment example about. The last part is really about how how does all this actually function in a working system.

QCon: What is the main take away someone might have coming to your talk?

Fangjin: I think the main takeaway that I hope people get out of this talk is things are going to break. It's not the end of the world. There are a lot of things you can do to survive various types of failures.

Co-creator of Apache Kafka, Co-founder & Head of Engineering @Confluent

Neha Narkhede

How to have your Causality and Wall Clocks Too

Senior Fellow @Comcast

Jon Moore

Preparing PayPal for Launch

VP of Global Platform and Infrastructure @PayPal

Sri Shivananda

Beyond the Hype: 4 years of Go in Production

CTO & Iron.io Co-founder

Travis Reeder

Dino DNA! Health identity from the wrist @Jawbone

Director, Head of Data Science and Engineering @Jawbone

Brian Wilt

How NOT to measure Latency

CTO and co-founder @AzulSystems

Gil Tene

Personalization in the Pinterest Homefeed

Discovery Team Engineer @Pinterest

Dmitry Chechik

Flying faster with Heron

Engineering Manager and Technical Lead for Real Time Analytics @Twitter

Karthik Ramasamy

Debugging Microservices in Production

CTO @Joyent

Bryan Cantrill

Tracks

Covering innovative topics

Monday Nov 16

Architectures You've Always Wondered About

Silicon Valley to Beijing: Exploring some of the world's most intrigiuing architectures
Applied Machine Learning

How to start using machine learning and data science in your environment today. Latest and greatest best practices.
Browser as a platform (Realizing HTML5)

Exciting new standards like Service Workers, Push Notifications, and WebRTC are making the browser a formidable platform.
Modern Languages in Practice

The rise of 21st century languages: Go, Rust, Swift
Org Hacking

Our most innovative companies reimagining the org structure
Design Thinking

Level up your approach to problem solving and leave everything better than you found it.

Tuesday Nov 17

Containers in Practice

Build resilient, reactive systems one service at a time.
Architecting for Failure

Your system will fail. Take control before it takes you with it.
Modern CS in the Real World

Real-world Industry adoption of modern CS ideas
The Amazing Potential of .NET Open Source

From language design in the open to Rx.NET, there is amazing potential in an Open Source .NET
Optimizing You

Keeping life in balance is always a challenge. Learning lifehacks
Unlearning Performance Myths

Lessons on the reality of performance, scale, and security

Wednesday Nov 18

Streaming Data @ Scale

Real-time insights at Cloud Scale & the technologies that make them happen!
Taking Java to the Next Level

Modern, lean Java. Focuses on topics that push Java beyond how you currently think about it.
The Dark Side of Security

Lessons from your enemies
Taming Distributed Architecture

Reactive architectures, CAP, CRDTs, consensus systems in practice
JavaScript Everywhere!

Javascript is Everywhere. Learn why
Culture Reimagined

Lessons on building highly effective organizations

Schedule

Warning message

Location:

Duration

Key Takeaways

Abstract

Additional Links:

Interview with Fangjin Yang

Find Fangjin Yang at

Similar Talks

Tracks

Covering innovative topics

Monday Nov 16

Tuesday Nov 17

Wednesday Nov 18

Conference for Professional Software Developers

Follow QCon

Contact

Menu

QCons around the World

Warning message

Presentation: Architecting Distributed Databases for Failure

Location:

Duration

More talks on:

Key Takeaways

Abstract

Additional Links:

Interview with Fangjin Yang

Find Fangjin Yang at

Similar Talks

Tracks

Covering innovative topics

Monday Nov 16

Tuesday Nov 17

Wednesday Nov 18

Conference for Professional Software Developers

Follow QCon

Contact

Menu

QCons around the World