******************************************************************************** The following talk transcript was recorded at QConSF 2017. The text And its copyright remain the sole property of the speaker. **************************************************************************************** >> We're going to start in about one minute. >> Everyone in a food coma? (Laughter). All right, welcome back. Next up is a topic that is near and dear to my heart, the architecture of Reddit, given by our infrastructure architect, Neil Williams. (Applause). >> Hello, hello. A quick point of order, I was asked to tell everyone that there are cards on your seats, please fill them in and turn them in at the end of the talk. I will get started. If you have used Reddit before, you have probably seen this a few times. I am hoping you will see it less and less, but you have definitely seen it. And this talk will hopefully help you understand why. If you have not used Reddit, then how about a quick explanation? Reddit's the frontpage of the internet, it is a community hub, and it is a place for people to talk about everything that they are interested in. But, more importantly, for this topic, Reddit is a really big website. We are currently the 4th largest in the U.S., according to Alexa, and serve 320 million users every month, doing all sorts of stuff, like posting a million times a day, and casting 75 million votes. It kind of adds up. So, let's dig into what the site looks like. This is a very high-level overview of the architecture of Reddit, it is focused only on the parts of the site that are involved with the core experience of the site. So, I'm leaving out some really interesting stuff, like all of our data analysis and the ad stack, all that kind of stuff. But this is the core of the Reddit experience. The other thing to know about this diagram is that it is very much a work in progress. I made a diagram like this a year ago, and it looked nothing like this. And this also tells you a whole lot about our engineering organization, as much as it tells you about the tech that we use. So, that's actually really interesting here. In the middle here, the giant blob, is R2. That's the original monolithic application that is Reddit and has been Reddit since 2008. That's a big Python blob and we will talk about that in a bit more detail. The front-end engineers at Reddit got tired with the pretty outdated stuff we have in R2, so they are building out these modern, front-end applications, these are all in node and they share a code between the server and the client. They all act as an API client themselves, so they will talk to APIs that are provided by our API gateway, or R2 itself. And it acts just like your mobile phone, or whatever other API client is out there. We are also starting to split up R2 into various back-end services, and these are all highlighted here. The core thing here is that they are kind of the focus of individual teams. So you can imagine that there's an API team, there's a listing team, there's a think team. I will explain a little bit more what those mean later. So, they are written in Python, which has helped us splitting stuff out of the existing Python monolith, and they are built on a common library that allows us to not reinvent the wheel every time that we do this, and it comes with monitoring and tracing and that kind of stuff built in. On the back-end, we use Thrift, thrift gives us nice, strong schemas, and we allow for HTTP on the front-end for the API gateway so we can still talk to them from the outside. Finally, we have the CDN in front, that's Fastly. And if you saw the talk earlier, they do some pretty cool stuff. One thing that we use it for is being able to do a lot of decision logic outside at the edge, and figure out which stack we're going to end that request to, based on the domain that is coming in, the path on the site, any of the cookies that the user has, including, perhaps, experiment bucketing. So that's how we can have all of these multiple stacks that we're starting to split out, and still have one Reddit.com. So, since R2's the big blob, and it is complicated and old, we will dig into it's details. So the giant monolith here is a very complicated beast with its own weird diagram. We run the same code on every one of the servers here, it is monolithic, each server might run different parts of that code, but the same stuff is deployed everywhere. The load balancer is in the front, we use HTTP proxy, the point of that is to taken the request that the user has and split it into various pools of application servers. We do that to isolate different kinds of request paths so, say, a comments page is going slow today because of something going on, it doesn't affect the front page for other people. That has been very useful for us for gating these weird issues that happen. We also do a lot of expensive operations when a user does things, like vote, or submit a link, etc. And we defer that to an asynchronous job queue via Rabbit MQ, we put the message in the queue and the processors handle it later, usually pretty quickly. These memcache and postgress section we have a core data model I will talk about, Thing, and that is what you would consider most of the guts of Reddit: Accounts, links, sub-Reddits, comments, all of that is stored in this data model, called thing, which is based in postgress with memcache in front of it. And finally, we use Cassandra very heavily. It has been in the stack for seven years now, it is used for a lot of the new features, ever since it came on board, and it has been very nice for its ability to stay up with one node going down, that kind of thing. Cool. So, that was a big about the structure of the site itself. Let's talk about how some of the parts of the site work, starting with listings. So, a listing is kind of the foundation of Reddit. It is a list, an ordered list of links. You could naïvely think about this as selecting links out of the database with a sort. These, as you will see, on the front page, you will see it in sub-Reddits, etc. The way that we do it is not actually by running the select out of the database. Instead, initially, what would happen is the select would happen, and it would be cached as a list of IDs in memcache, and that way you can fetch the list of IDs easily and you can look up the links by primary key, and that is very easy as well. So that was a nice system, and worked great. Those things, those listings, needed to be invalidated whenever you change something in them, which happens when you submit something but, most frequently, it happens when you vote on something. So the vote queues are something that really update the listings very frequently. We also have to do some other stuff in the vote processors, such as anti-cheat processing. So, it turns out that running that select query, even occasionally when you invalidate it, is expensive. So when you do something like voting, you have all the information you need to be able to go and update that cache listing. You don't really need to re-run the query. What we do instead is not just store the ID, but the ID paired with the sort information related to that thing, and then when we do something like process or vote, we fetch down the current cached listing, we modify it, in this example you will see that we vote up on Link 125, which moves it up in that list and changes the score that it has in that list. And then we will write it back. That is kind of an interesting read/mutate/write operation, which has the potential for raised conditions, so we lock around that. So you will notice, once we are doing that, we are not actually running the queries anymore. It is not really a cache anymore, it is actually its own first-class thing that we are storing, it is a persisted index, really, a de-normalized index. So, at that point, they started being stored originally in other things, and nowadays in Cassandra. So, I'm going to talk a little bit about something that went wrong. I mentioned that the queues usually process pretty quickly, not always. Back in the middle of 2012, we started seeing that the vote queues would start getting really backed up in the middle of the day particularly peak traffic when everybody is around. This would delay the processing of those votes which, is visible to users on the site, because a submission would not get its score properly, it would be sitting on the front page, but its score is going up very slowly, and it should be much higher than it is, and we will get it many hours later when the queue processed. So, why not add more scale, more processors? That made it worse. We had to dig in. We didn't have great observability at the time. We couldn't figure out what was going on, we saw the whole processing time for a vote was longer than before but, beyond that, who knows. So we started adding a bunch of timers, when we narrowed it down, we realized it is the logged I mentioned that was causing the problems. The very popular sub-Reddits on this site are getting a lot of votes, it makes sense, they are popular. When you have a bunch of votes that are happening at the same time, you are trying to update the same listing from a bunch of votes at the same time, and they are all just waiting on the lock. So adding more just added more people waiting on the lock and actually didn't help at all. So what we did to fix this is we partitioned the vote queues. So this is really just dead simple. We took the sub-Reddit ID of the link being voted on and used that and just did, like, modular 10 and put it into one of 10 different queues. This looks like, you are voting on a link of Reddit 111, or 1, and you are going to vote queue one, or vote queue 7. And what we did, we had the same total number of processors in the end, they are divided into different partitions and there are fewer vying for any single look at the same time. It worked really well, smooth sailing forever. (Laughter). Not really. So, late-2012, just a few months of respite, we saw the vote queues slowing down okay. The lock contention and processing time looked okay in the average, but then we looked at the P99s, the 99th percentile of the timers, and we saw there were some votes going really poorly, and that was interesting. So we had to dig in and just start putting print statements in to see what is going on when you are taking over this amount of time. Pretty dumb, but it worked. (Laughter). And, what we found is that there was a domain listing. So we have listings on the site that are for all of the links that are submitted to a given domain, sorry, the dangers of a touch screen. And the domain listing was the point of contention now. So, all of these things were partitioned and not vying for the same lock for sub-Reddits, but then they were in the same thing, vying for a domain listing. This was not great, and it was causing a lot of issues. So, long story short, much later we comprehensively fixed this by splitting up the queries all together. Instead of processing one vote in an entire single job processer, we now do a little bit of up-front stuff, and then we make a bunch of messages that deal with different parts of the job. And those all work on the right partitions for themselves, so they are not vying across partitions. Interesting stuff here is that you really need to have timers in your code, nice, granular timers, and they also give you a cross-section. You get a lot of info from your P99s and tracing, or some way of getting info out of the P99 cases is really important as well for figuring out what's going on in the weird cases. There is also kind of obvious, but locks are really bad news for throughput. If you have to use them, then you should be partitioning on the right thing. And so, going forward, we've got some new data models that we're trying out for storing those cache queries in a lockless way. It is pretty interesting, and it has been promising so far. But we haven't -- we haven't committed fully to it yet. And, more importantly, we're starting to split out listings altogether. So we have this listing service, is and -- and the goal of it, we have a team working on it, is to make relevant listings for users. And the relevant listings can come from all sorts of sources, which includes the data analysis pipeline, machine learning, and these normal old listings, like the rest of the site. So that's -- that's kind of the future here, where we extracted out into its own service and R2 doesn't even need to know about how it is coming anymore. Cool. So you have listings of things, but what about the things themselves? The, as I said earlier, Thing is in postgress and cache. This is the oldest data model in Reddit, R2, and it is a pretty interesting data model. (Laughter). I'm getting some smiles. The thing to know about it is that it is designed to takeaway certain pain points, and also make it so you cannot act until you do something, like expensive joins. It is vaguely schemerless and very key value. There's one Thing per noun, on the site, like a sub-Reddit thing, and each is represented by a pair of tables in postgress. The Thing table looks like it is, it is abbreviated, and the idea is that there is one row for each Thing, or object, that exists. And there is a set of fixed columns there, which covered everything that the original day's Reddit needed to do the just basic selects queries to run the site. So, that's all the stuff that you would sort and filter on to make a listing back in the day. The data table, however, has many rows per thing object. And they each have a key and a value, and this makes up kind of a bag of properties for that thing. This has been pretty neat for Reddit in terms of the ability to make changes, make new additions, to the site, without having to go in and alter a table in production. It is cool in that way, and there's a lot of performance issues. So it is interesting. Thing in Postgress is done as a set of tables live in a single database cluster, each primary in the database cluster handles writes, and then we have a number of read-only replicas which we replicate to asynchronously. R2 connects to the databases and would prefer to use the replicas for read operations so that they scale out better. At the time, it would also do this thing where it looks -- it determines that if a query failed, it would guess that the server is down and try to not use it again in the future. Thing also works with memcache, it reduces the read-only replicas. The object is serialized and popped into memcache, R2 reads it first and hits it on a miss, and we write to memcache from making a change, rather than deleting it and allowing it to be re-populated on the next read. So in 2011, we were waking up a lot with these errors. We would wake up suddenly with an alert saying that the replication to one of the secondaries had crashed. And this meant that that database was getting more and more out of date as it was going on, we had to fix it. The immediately thing to do is to take it out of replication, out of usage on the site, you start re-building it and go back to bed. But then, when we started seeing the next day when we woke up, that some of the cached listings were referring to items that didn't exist in postgress, which is a little terrifying. So you can see the cached listing here, 1234, but then the Thing table only has 1, 2, and 4. Not a great thing to see. This caused the pages that needed that to crash, they were looking for the data, it wasn't there, and it just died. We built a lot of tooling at the time to clean up the listings, clean up the bad data that shouldn't be there, and remove it from them. This was obviously really painful, we were looking into a lot of things going on there. And there were not a whole lot of us, either, about five people at the time. The issue started with a primary saturating its disk, it was running out of iops and something was going slow. What did we do about that? Got beefier hardware. That's pretty good, right? (Laughter). Problem solved. Not really. So, a few months later, everything has been nice and quiet for a while, but we were doing a pretty routine maintenance and accidentally bumped offline the primary, and suddenly we see the replication log alert firing. And looking at the logs from the application at the time, a lightbulb went off. I mentioned that R2 will try to remove a dead database from the connection pools, the code looks like this, very pseudo-Cody, we have in configuration a list of databases, we consider the first in the list to be the primary. So when we are going to decide which database to use for a query, we take the list of databases, we filter it down to the ones that are alive, we take the first one off the list as the primary, and the secondaries for the rest. And then we choose, based on the query type, which server to use and go for it. There's a bug there. So what happens when it thinks the primary is down? It would, instead, take the primary out of the list and now we are using a secondary as the first item in the list, and we try writing to a secondary. Well, we didn't have proper permissions set up, so that worked. And you can write to the secondary. (Laughter). Oops. (Laughter). So, you wrote to the secondary, it created the thing, we write it to the cached listing, all is good. We take the secondary out and rebuild it, and the data is gone. Not good. So, yeah, use permissions. They are really useful. (Laughter). They are not just there to annoy people. They are very helpful. (Laughter). If you do de-normalize the cached listings, it is really important to have tooling for healing. And, going forward, some changes we're making: New services are using our service discovery system, which is pretty standard across all our stack to find databases. So they don't have to implement all this logic in themselves, and that helps with reducing complexity and making it a battle-tested component. And also, finally, we're starting to move that whole Thing model out into it's own service. The initial reason for this was we had new other services coming online, and they wanted to know about the core data that's in Reddit. So, we had this service that starts out by just being able to read the data, and now it is starting to take over the writes as well and take it all out of R2. A huge upside to this is all of that code in R2 had a lot of Legacy and a lot of weird, twisted ways that it had been used. So by pulling it out and doing this exercise, it is now going to be separate and clean and something that we can completely re-think as necessary. Cool. So, another major thing. I said that Reddit is a place for people to talk about stuff. They do that in comment trees. So, an important thing to know about comments on Reddit is that they are threaded. This means that you nest replies so you can see the structure of a conversation. They can also be linked to deep within that structure. This makes it a bit more complicated to render these trees. It is pretty expensive to go and say, okay, there's 10,000 comments on this thread, I need to look up all 10,000 comments, find out what the parents are, etc. So we store, hey, another denormalized listing, the parent relationships of that whole tree in one place so we can figure out ahead of time, okay, this is the sub-set of comments we're going to show right now and only look at those comments. This also is kind of expensive to do. So, we defer it to offline job processing. One advantage in the comment tree stuff is that we can batch up messages and mutate the trees in one big batch. That allows for more efficient operations on them. A more important thing to note is the tree structure is sensitive to ordering. If I insert a comment and its parent doesn't exist, that is weird in a tree structure, right? So that needs to be watched out for, because things happen, and the system had some stuff to try and heal itself in that situation, where it will re-compute the tree, or try to fix up the tree. The processing for that also has an issue that sometimes you end up with a megathread on the site, some news event is happening, the Superbowl is happening, whatever. People like to comment. You have 50,000 comments in one thread, and that thread is now going pretty slowly. That is affecting the rest of the site. So we developed a thing that allows us to manually mark a thread and say, this thread gets dedicated processing. It just goes off to its own queue, always -- called the fast lane. Well, that caused issues. (Laughter). Early-2016, there was a major news event happening, pretty sad stuff, and a lot of people were talking about it. The thread was making the processing of comments pretty slow on the site, and so we fast-laneed it. And then everything died. Basically, what happened immediately then was the fast lane queue started filling up with messages, very, very, very, very, very quickly, and rapidly filled up all of the memory on the message broker. This meant that we couldn't add any new messages anymore, which is not great. In the end, the only thing that we could do was re-start the message broker, and lose all those messages, to get back to steady-state. So this meant that all of the other actions on the site that are deferred to queues were also messed up. What it turned out to be the cause was that stuff for dealing with a missing parent. So when we fast-laned the thread, the earlier comments in the thread that had happened before the fast-laneing were in the other queue and not yet processed, when we fast-laneed the queue, we skipped it with a bunch of new messages, and the site recognized that it was inconsistent and every page thread was putting a new message on to the queue, please recompute me, I'm broken. Not good. So we re-started Reddit, everything went back to normal, and now we use queue quotas, which are very nice. Resource limits ts allow you to prove one thing from hogging all of your resources, it is important to use things like quotas. If we had turned on quotas at the time, the fast lane queue would have started dropping messages and meant that thread would have been more inconsistent, or comments wouldn't show up there for a little while, but the rest of the site would have kept working. Cool, all right. This is a little more meta. We have a bunch of servers to power all of this stuff, and we need to scale them up and down. So this is kind of what traffic to Reddit looks like over the course of a week. It is definitely very seasonal, so you can see that it is about half at night what it is in the day. There's a couple weird humps there for different time zones, it is pretty consistent overall. So what we want to do with the auto scaler is to save money off peak and deal with situations where we need to scale up because something crazy is going on. It does a pretty good job of that. What it does is it watches the utilization metrics, reported by our local balancers, and it automatically increases or decreases the number of servers that we are requesting from AWS. We just offloaded the actual logic of terminating AWS to auto-scaling groups, that works pretty well. And the way the auto scaler knows what is going on out there is each host has a demon on it that registers its existence into a Zookeeper cluster. This is a rudimentary health check. And, of note, is that we were also using this system for memcache. It is not really auto scaled in that we were ever going up and down, but it was there so that if a server died, it would be replaced automatically. So, in mid-2016, something went pretty wrong with the auto scaler that taught us a lot. We were in the midst of making our final migration from EC2 classic into VPC, it had a huge number of benefits for us in networking and security and that stuff, and the final component to move was the Zookeeper cluster. This was the Zookeeper cluster being used for the auto scaler. So it was pretty important. The plan for this migration was to launch the new cluster into VPC, stop the auto scaler services so they don't mess with anything during this migration, re-point the auto scaler agents on each of the servers out in the fleet to point at the new Zookeeper cluster, and then we will repoint the auto scaler services themselves, restart the cluster, and act like nothing ever happened. What actually happened was a little different, unfortunately. We launched the new Zookeeper cluster, working great. We stopped the auto scaler services, cool. We start re-pointing things, we got about 1/3 of the way through that when suddenly, about 1/3 of our servers get terminated. And it took us a moment to realize why, and then we all facepalmed rather heavily. (Laughter). So what happened is that Puppet was running on the auto scaler server, when it did its half hourly run, it decided to re-start the auto scaler demons, they were still pointing at the old cluster, and they saw all of the servers that had migrated to the new cluster as being unhealthy and terminated them. (Laughter). Very helpful, thanks auto scaler. (Laughter). And so, what can we do about that? Well, just wait a minute, the auto scaler will bring a bunch of whole new servers up. That was relatively easy, except that the cache servers were also done in the same system. And they have state, and the new ones that come up don't have that state. So what this meant was that suddenly, all of the production traffic that was happening was hammering the postgress replicas that could not handle that, they were not used to that many caches being gone at the same time. Yeah. (Laughter). Pretty reasonable. (Laughter). So what we learned from that was that things that do destructive actions, like terminating servers, should have sanity checks in them. Am I terminating a large percentage of the servers? I should not do that. We needed to improve the process about the migration itself, and having a peer-reviewed check list where we made sure that we had extra layers of defense there would have been really helpful. And the other important thing is that stateful services are pretty different from stateless services, you should treat them differently. So using the same autoscaler technology for that is probably not the best idea. The next-gen autoscaler we are building has a lot of this stuff. It has tooling to recognize that it is affecting a larger number of servers than it should and stop itself, and it does -- it uses our service discovery system to determine health instead of just an agent on the boxes. So we actually get a little more fine-grain detail, like is the actual service okay, not just the host. But, yeah. It will refuse to take action on large numbers of servers at once. Cool! So, in summary, observability is key here. People make mistakes, use multiple layers of safeguards. Like, if we'd turned off the auto scaler and put an exit command at the top of the script, nothing would have happened if Puppet brought it back up. Little extra things, where no one failure can cause new issues. And it is really important to make sure that your system is simple and easy to understand. Cool, thank you. So this is just the beginning for us. We are building a ton of stuff, and hiring. And also, on Thursday, the entire infraops team, many of whom are right there, are doing an AMA if you would like to ask some questions. (Applause). >> Questions? You had mentioned about the monolithic application being there, and you are removing the data model from the monolithic application. Is there any reason to split up the entire thing instead of just removing the heart of the application, really? >> Um, yeah. There is absolutely -- as I mentioned, the services that we are spinning out are kind of based on our organizational structure. So, for example, you didn't see a comments service, we didn't really have a team that is focused on comments right now. So the Thing service is because the other services existed, it needed to exist for them. And the other services exist because there are teams that want autonomy. >> You mentioned that you are using a queue, how is that working for your scale? >> Rabbit has treated us pretty decently, we are running a single host which, is pretty interesting. You wouldn't notice immediately if that went away, it is not a take-the-site-down thing, because individual servers will queue on the local host if the Rabbit is down. But we haven't really explored too much the options there. A lot of the stuff going forward is based on Kafka, but there are cases where we consider Rabbit because, like a job queue, it is probably better suited to Rabbit than Kafka. >> You mentioned that some of the issues you had was with a lot of speed when you are flooded with a lot of requests for threads. Do you have issues with depth, for threads on accounting, that go millionths and millionths in? >> (Laughter), well, yes, we have had problems with our counting in our depths of threads. I believe that there is actually a depth limit these days, but we have also just asked our counting in particular, if you are not familiar, our counting is a sub-Reddit where people literally just count. I don't know why. (Laughter). But they just, like, one person says one, and then the next person replies two, and they get up into the millions for some reason. (Laughter). But we have asked them to start splitting up into multiple threads, they are using live threads now. (Laughter). >> What are you using for service discovery? >> The Airbnb smart tack. There's an agent that runs the check on the host and registers against Zookeeper, and there's an agent on the host that wants to talk to the host and updates from Zookeeper the local HSA application, and so it talks to the single IP import on local host, and that HSA proxy knows where the remost hosts are and connects to them for you. >> How do live threads fit into the architecture? >> So live threads were built explicitly to be very separate. They are pretty much entirely stored in Cassandra, and they were the first users of the web socket system that we have. So that's a big part of it is that you don't have to refresh and re-load all of that data. It is constantly pushed down with web sockets. But it is in R2, and pretty Cassandra-based. >> What are you using for the API gateway? >> So we currently have a home-grown service that is doing that. I'm not sure what the plans are for going forward, but the current version is definitely -- it is serving a few different goals at the same time, and probably needs to be re-thought a bit. >> Going once, going twice... thanks, Neil. (Applause). Okay, we will reconvene here in about 30 minutes to hear Bing Wei talk about Slack channel. Live captioning by Lindsay @stoker_lindsay at White Coat Captioning @whitecoatcapx