Conference: Nov 13-15, 2017
Workshops: Nov 16-17, 2017
Presentation: How Slack Works
Duration
Level:
- Intermediate
Persona:
- Architect
Key Takeaways
- Learn some of the practices that enables Slack to handle the challenges of synchronous large group communication.
- Hear practical approaches of how Slack leverages MySQL for eventual consistency.
- Gain an understanding of the problem space around high quality persistent group messaging.
Abstract
Slack is a persistent group messaging app for teams. Slack's 3.4 million active users expect high levels of reliability, low latency, and extraordinarily rich client experiences across a wide variety of devices and network conditions. In this talk, we'll take a tour of Slack's infrastructure, from native and web clients, through the edge, into the Slack datacenter, and around the various services that provide real-time messaging, search, voice calls, and custom emoji.
Interview
Keith: I think people see Slack as a site that serves a lot of people, has a lot of customers, and also as something that does real-time messaging. They feel like they have seen those things before. Real-time messaging is pretty crowded. There has been WhatsApp, Facebook, and Google serving a lot more users than us. All that is true, but what is not true is how those messages are served to those users. While Facebook and Google have real-time messaging products, the actual number of people in a given conversation (in a given virtual space) is usually quite small.
For example, you curate the list of recipients from most of those messaging products by hand. Even if there are group threads, those group threads usually have to be applied in one-sies and two-sies. Having a good, real-time conversation where thousands of participants are able to contribute, feel like they have some sense of virtual place, and are also able to consume the information is so far an unsolved problem in the synchronous side. People have done this with message boards and other asynchronous things where you have threads, replies, and comments (like Facebook wall posts). But doing that in real time is different in the same way that video conferencing is a different discipline than video streaming.
Slack’s challenges are around maintaining low latency and high reliability. Low latency, for example, is important to making the product actually feel good. This is mostly below the threshold of conscious perception, but, if you mess this up, the product feels bad. It’s hard to say why it feels bad. It just does. In essence, these are textual equivalents to stops and starts in bad video conferencing.
In a video or voice call, you may end up saying, ‘Oh, I was just going to say,’ or ‘No, you go ahead.’ It gets worse as you add participants, especially when participants have low latency.
In a nutshell, the challenge for Slack is synchronous group communication with large numbers of participants.
Keith: For the sake of the talk, I am going to gather some data that we are comfortable sharing. So I don’t have it at my fingertips at the moment, so I want to make sure there is a little bit of an asterisk by anything I say here.
With that said, we monitor 99th percentile ping time back and forth through our real-time messaging server from connected clients. We try to keep that under 100 milliseconds. That includes users connected throughout the world over crummy connections.
Keith: There is this piece of Slack that resembles the previous generation LAMP stack apps. The part of Slack that remembers your email address, your avatar, the name of your team, custom images and so forth. That is recognizable. It is the LAMP stack app of people who have been doing this since the early days. It has had a lot of features, growth, and change. So it’s had to adapt to a more complicated infrastructure than your usual LAMP stack app does, but it’s recognizable as LAMP stack app.
There is a whole chunk of the technology that doesn’t have any relationship to LAMP. The whole real-time part of it that is connecting with a network of services. It is a network of backends that are in Java. It started out with a single process (the Message Server we called it) that was a Java application that one of the founders (Serguei Mourachov) wrote. It was a WebSocket server. When you talk to the LAMP stack app, one of the things you would ask it was ‘Where is my WebSocket server?,’ and it would give you back some hostname and port number that you should go to talk to your team’s message server. It was that simple. That Java process acted like a message bus. You speak a WebSocket protocol that is documented on our website. It is the same protocol you use if you look at api.slack.com (under the RTM section of the real-time messaging section).
That’s no longer a single machine doing this function. We are busy diversifying that into pieces of the infrastructure that are oriented towards presence. For instance, as groups get larger you start spending more and more communication bandwidth on just communicating who has a green dot next to their name and who doesn’t.
We’re in the middle of a redesign of the message server architecture. We have goals related to reliability that we think the current architecture limits.
A quick side note (I am going to try and be honest about this in the talk), Slack’s architecture is a moving target right now. We know that the architecture we are using today is going to change over the next couple of years. I’m going to be really open about these things.
Keith: I would say monolithic application logic that is in LAMP and then services. I wouldn’t go so far as to use the term microservices. There are services that I can count on my fingers and toes. In the LAMP stack, there are all these things that we don’t call services even though structurally they are. We don’t call MySQL a service because we don’t think of it that way, but we use MySQL to persist data.
BTW, I will never apologize for the choice to use MySQL to persist data. If a nuclear weapon hit the datacenter in Sweden, MySQL is what Facebook would go scrambling to tape for and the same is true for us.
The world just has too many millions of years of server operation on MySQL without it losing data for the value property of these other stores to be plausible to me, at least so far.
Keith: MySQL is worth a mention, in a similar vain as PHP. It’s an old sword that has killed a lot of orcs. Other businesses might be more directly impacted by the returns of trying something new. But for us, this is a relative commodity thing. The win from being on Cassandra or something like that would be an incremental small constant factor over what MySQL can do right now.
In return for that small incremental improvement, we don’t know how to operate it any more. There are fewer books, blog posts, and operator years of experience that we can call on to use it. It just isn’t interesting. None of the things that it claims to do are interesting to us in a way that completely prevents us from being able to use MySQL right now.
For example, we use a master-master replication in MySQL. If you ask a lot of people what master-master replication is, they will look at you like you have two heads.
We do statement-based replication, and we actually have architected the app in such a way that the eventual consistency from writing to both sides of the replica is OK. It’s possible to have differences in the byte-wise representation of some rows, but they are not going to be semantically different. So every write goes to two places. It’s going to two places through MySQL’s SBR, and we do have two write heads available all of the time.
There is some interesting stuff there because which one do you back up? There are simple tasks that can be written, getting the same row and I read it twice in a row byte-wise that actually might be subtly broken by this. We do a little bit of hygiene on this. Imagine you have got the master-master pair in there. They are basically writing back and forth to each other. The way that we rotate these things in and out of their lifecycle is that we decide, ‘Okay, you are a month old. We’re worried you have gotten stale somehow.’ So we will attach something that will eventually be a new master to the other side of the master-master pair and then set it up as a slave and replicate to it. Once it has caught up to the master it is attached to, we cut off the old master, and it is like a little snaking tail that moves around. It’s an interesting little process that gives us write availability in the presence of one failure. We know how we’d go to three if we had to get to three. We are not buying some fancy product that costs money. Our customers would have to pay for that.
Keith: Honestly, I am partly there to sell Slack or the problems we face at slack. It’s the notion that high quality persistent group messaging is a more complicated problem than it seems on the surface. It’s probably not something you want to solve from scratch.
Similar Talks
.
Tracks
Monday Nov 7
-
Architectures You've Always Wondered About
You know the names. Now learn lessons from their architectures
-
Distributed Systems War Stories
“A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable.” - Lamport.
-
Containers Everywhere
State of the art in Container deployment, management, scheduling
-
Art of Relevancy and Recommendations
Lessons on the adoption of practical, real-world machine learning practices. AI & Deep learning explored.
-
Next Generation Web Standards, Frameworks, and Techniques
JavaScript, HTML5, WASM, and more... innovations targetting the browser
-
Optimize You
Keeping life in balance is a challenge. Learn lifehacks, tips, & techniques for success.
Tuesday Nov 8
-
Next Generation Microservices
What will microservices look like in 3 years? What if we could start over?
-
Java: Are You Ready for This?
Real world lessons & prepping for JDK9. Reactive code in Java today, Performance/Optimization, Where Unsafe is heading, & JVM compile interface.
-
Big Data Meets the Cloud
Overviews and lessons learned from companies that have implemented their Big Data use-cases in the Cloud
-
Evolving DevOps
Lessons/stories on optimizing the deployment pipeline
-
Software Engineering Softskills
Great engineers do more than code. Learn their secrets and level up.
-
Modern CS in the Real World
Applied, practical, & real-world dive into industry adoption of modern CS ideas
Wednesday Nov 9
-
Architecting for Failure
Your system will fail. Take control before it takes you with it.
-
Stream Processing
Stream Processing, Near-Real Time Processing
-
Bare Metal Performance
Native languages, kernel bypass, tooling - make the most of your hardware
-
Culture as a Differentiator
The why and how for building successful engineering cultures
-
//TODO: Security <-- fix this
Building security from the start. Stories, lessons, and innovations advancing the field of software security.
-
UX Reimagined
Bots, virtual reality, voice, and new thought processes around design. The track explores the current art of the possible in UX and lessons from early adoption.