Presentation: How Slack Works

Duration

Duration: 
10:35am - 11:25am

Level:

Persona:

Key Takeaways

  • Learn some of the practices that enables Slack to handle the challenges of synchronous large group communication.
  • Hear practical approaches of how Slack leverages MySQL for  eventual consistency.
  • Gain an understanding of the problem space around high quality persistent group messaging.

Abstract

Slack is a persistent group messaging app for teams. Slack's 3.4 million active users expect high levels of reliability, low latency, and extraordinarily rich client experiences across a wide variety of devices and network conditions. In this talk, we'll take a tour of Slack's infrastructure, from native and web clients, through the edge, into the Slack datacenter, and around the various services that provide real-time messaging, search, voice calls, and custom emoji.

Interview

Question: 
QCon: You are the Chief Architect at Slack. What are some of the unique challenges that Slack is solving that people might not expect?
Answer: 

Keith: I think people see Slack as a site that serves a lot of people, has a lot of customers, and also as something that does real-time messaging. They feel like they have seen those things before. Real-time messaging is pretty crowded. There has been WhatsApp, Facebook, and Google serving a lot more users than us. All that is true, but what is not true is how  those messages are served to those users. While Facebook and Google have real-time messaging products, the actual number of people in a given conversation (in a given virtual space) is usually quite small. 

For example, you curate the list of recipients from most of those messaging products by hand. Even if there are group threads, those group threads usually have to be applied in one-sies and two-sies. Having a good, real-time conversation where thousands of participants are able to contribute, feel like they have some sense of virtual place, and are also able to consume the information is so far an unsolved problem in the synchronous side. People have done this with message boards and other asynchronous things where you have threads, replies, and comments (like Facebook wall posts). But doing that in real time is different in the same way that video conferencing is a different discipline than video streaming. 

Slack’s challenges are around maintaining low latency and high reliability. Low latency, for example, is important to making the product actually feel good. This is mostly below the threshold of conscious perception, but, if you mess this up, the product feels bad. It’s hard to say why it feels bad. It just does. In essence, these are textual equivalents to stops and starts in bad video conferencing.

In a video or voice call, you may end up saying, ‘Oh, I was just going to say,’ or ‘No, you go ahead.’ It gets worse as you add participants, especially when participants have low latency. 

In a nutshell, the challenge for Slack is synchronous group communication with large numbers of participants.

Question: 
QCon: You talk about a feel with real-time messaging. What type of response times are you talking about with Slack?
Answer: 

Keith: For the sake of the talk, I am going to gather some data that we are comfortable sharing. So I don’t have it at my fingertips at the moment, so I want to make sure there is a little bit of an asterisk by anything I say here. 

With that said, we monitor 99th percentile ping time back and forth through our real-time messaging server from connected clients. We try to keep that under 100 milliseconds. That includes users connected throughout the world over crummy connections.

Question: 
QCon: From a high level, what’s the architecture of Slack? I know it’s a LAMP stack, but I think you have a Java messaging layer, right?
Answer: 

Keith: There is this piece of Slack that resembles the previous generation LAMP stack apps. The part of Slack that remembers your email address, your avatar, the name of your team,  custom images and so forth. That is recognizable. It is the LAMP stack app of people who have been doing this since the early days. It has had a lot of features, growth, and change. So it’s had to adapt to a more complicated infrastructure than your usual LAMP stack app does, but it’s recognizable as LAMP stack app. 

There is a whole chunk of the technology that doesn’t have any relationship to LAMP. The whole real-time part of it that is connecting with a network of services. It is a network of backends that are in Java. It started out with a single process (the Message Server we called it) that was a Java application that one of the founders (Serguei Mourachov) wrote. It was a WebSocket server. When you talk to the LAMP stack app, one of the things you would ask it was ‘Where is my WebSocket server?,’ and it would give you back some hostname and port number that you should go to talk to your team’s message server. It was that simple. That Java process acted like a message bus. You speak a WebSocket protocol that is documented on our website. It is the same protocol you use if you look at api.slack.com (under the RTM section of the real-time messaging section).

That’s no longer a single machine doing this function. We are busy diversifying that into pieces of the infrastructure that are oriented towards presence. For instance, as groups get larger you start spending more and more communication bandwidth on just communicating who has a green dot next to their name and who doesn’t. 

We’re in the middle of a redesign of the message server architecture. We have goals related to reliability that we think the current architecture limits. 

A quick side note (I am going to try and be honest about this in the talk), Slack’s architecture is a moving target right now. We know that the architecture we are using today is going to change over the next couple of years. I’m going to be really open about these things. 

Question: 
QCon: The way you are describing things, it sounds like you have got both a monolith and microservices. Is that fairly accurate?
Answer: 

Keith: I would say monolithic application logic that is in LAMP and then services. I wouldn’t go so far as to use the term microservices. There are services that I can count on my fingers and toes. In the LAMP stack, there are all these things that we don’t call services even though structurally they are. We don’t call MySQL a service because we don’t think of it that way, but we use MySQL to persist data.

BTW, I will never apologize for the choice to use MySQL to persist data. If a nuclear weapon hit the datacenter in Sweden, MySQL is what Facebook would go scrambling to tape for and the same is true for us.

The world just has too many millions of years of server operation on MySQL without it losing data for the value property of these other stores to be plausible to me, at least so far.

Question: 
QCon: On that front, someone is inevitably going to say “Why not Cassandra? Why not NoSQL?”
Answer: 

Keith: MySQL is worth a mention, in a similar vain as PHP. It’s an old sword that has killed a lot of orcs. Other businesses might be more directly impacted by the returns of trying something new. But for us, this is a relative commodity thing. The win from being on Cassandra or something like that would be an incremental small constant factor over what MySQL can do right now. 

In return for that small incremental improvement, we don’t know how to operate it any more. There are fewer books, blog posts, and operator years of experience that we can call on to use it. It just isn’t interesting. None of the things that it claims to do are interesting to us in a way that completely prevents us from being able to use MySQL right now. 

For example, we use a master-master replication in MySQL. If you ask a lot of people what master-master replication is, they will look at you like you have two heads. 

We do statement-based replication, and we actually have architected the app in such a way that the eventual consistency from writing to both sides of the replica is OK. It’s possible to have differences in the byte-wise representation of some rows, but they are not going to be semantically different. So every write goes to two places. It’s going to two places through MySQL’s SBR, and we do have two write heads available all of the time. 

There is some interesting stuff there because which one do you back up? There are simple tasks that can be written, getting the same row and I read it twice in a row byte-wise that actually might be subtly broken by this. We do a little bit of hygiene on this. Imagine you have got the master-master pair in there. They are basically writing back and forth to each other. The way that we rotate these things in and out of their lifecycle is that we decide, ‘Okay, you are a month old. We’re worried you have gotten stale somehow.’ So we will attach something that will eventually be a new master to the other side of the master-master pair and then set it up as a slave and replicate to it. Once it has caught up to the master it is attached to, we cut off the old master, and it is like a little snaking tail that moves around. It’s an interesting little process that gives us write availability in the presence of one failure. We know how we’d go to three if we had to get to three. We are not buying some fancy product that costs money. Our customers would have to pay for that. 

Question: 
QCon: What do you want people that come to your talk to walk away with?
Answer: 

Keith: Honestly, I am partly there to sell Slack or the problems we face at slack. It’s the notion that high quality persistent group messaging is a more complicated problem than it seems on the surface. It’s probably not something you want to solve from scratch. 

Speaker: Keith Adams

Chief Architect @Slack, previously @Facebook

Keith Adams is Chief Architect at Slack. Before Slack, he was an engineer at Facebook, where he contributed to search infrastructure, led work on the HipHop Virtual Machine, and helped start Facebook AI Research. He was also an early engineer at VMware. He holds an ScB in Computer Science from Brown University (2000).

Find Keith Adams at

Similar Talks

Developer @ThoughtWorks Inc
Tech Lead of Manhattan Team @Twitter
Staff Engineer, JVM Team @Twitter
Technical Manager Aurora / Mesos Team @Twitter
Provisioning Engineering SE @Twitter
Data Infrastructure Engineer @Apple

.

Tracks

Monday Nov 7

Tuesday Nov 8

Wednesday Nov 9

Conference for Professional Software Developers