QCon San Francisco 2021 November 1-5, 2021 | | Parsing JSON Really Quickly: Lessons Learned

This presentation is now available to view on InfoQ.com

Abstract

Our disks and networks can load gigabytes of data per second; we feel strongly that our software should follow suit. Thus we wrote what might be the fastest JSON parser in the world, simdjson. It can parse typical JSON files at speeds of over 2 GB/s on single commodity Intel core with full validation; it is several times faster than conventional parsers.

How did we go so fast? We started with the insight that we should make full use of the SIMD instructions available on commodity processors. These instructions are everywhere, from the ARM chip in your smartphone all to way to server processors. SIMD instructions work on wide registers (e.g., spanning 32 bytes): they are faster because they process more data using fewer instructions. To our knowledge, nobody had ever attempted to produce a full parser for something as complex as JSON by relying primarily on SIMD instructions. And many people were skeptical that a full parser could be done fruitfully with SIMD instructions. We had to develop interesting new strategies that are generally applicable. In the end, we learned several lessons. Maybe one of the most important lesson is the importance of a nearly obsessive focus on performance metrics. We constantly measure the impact of the choices we make.

Question:

What is the work you're doing today?

Answer:

I work on software performance from what I call a data engineering point of view. These are the kind of problems that interest me. You start with large volumes of data and processing the data itself is the bottleneck or indexing the data or crunching it or whatever you want to do with the data.

This is a little bit different from other types of performance problems. If you are working in gaming, you have performance issues but they're not necessarily derived from the fact that you have a large volume of data coming in. They are more numerical. So what do people do to solve these kinds of problems? Today, they basically throw more processors at it and we've become really good at that with the cloud and all the architectures that we've been building on top of it. Part of the problem, and I think there is low hanging fruit there, is that very often, we end up with what I would describe as low efficiency, high complexity, high latency solutions that achieve a good throughput. We do process the data and we we do get it processed but there's a cost. My own interest is to try to get more bang per core so we get lower latency, simpler systems and so forth.

Question:

What are the goals for the talk?

Answer:

We built what is probably the fastest JSON parser in the world. JSON is pretty much a standard data interchange format. Most web services use the JSON format. Lots of databases use it so it's fairly standard. We realized that ingesting JSON could itself become a performance problem and we've been talking with lots of engineers in the industry who confirm this. And so lots of benchmarks very often end up being bottlenecked by the JSOn parsing. So what we would like to do is we would like to explain what we did because this JSON parser I think took some people by surprise.

Obviously they can verify it works and it does what we say it does. Some people would have thought something like this to be not necessarily possible, not plausible. We use techniques and tricks that are not unique. But they're not widely known. So I'd like to describe it to people. We we've been using tricks and techniques.

We've built experience on how to exploit the current hardware. That's something not everyone knows how to do. I would like to discuss this a little bit, but also I'd like to cover more like the methodological angle and discuss how we start with the problem and manage to get this performance out of the system. It's sometimes engineering, some of it is more based on experience but I think that our general strategy is that people can adapt.

Question:

Can you tell me how the JSON parser project got started. I think you said there was a paper that was published from Microsoft.

Answer:

In the paper that came out, I think it was this big academic conference by Microsoft engineers who work on parsing JSON and they made the point that this was a bottleneck and that's something I was aware of. I turned to a friend of mine said "Well yeah but could we do better?". So we discussed it and we thought we could do better. We did the work and then we got good results. But prior to submitting the paper for peer review, I just posted the code on GitHub. And within a few days it was like trending on GitHub and within a few weeks it had thousands and thousands of stars. It went crazy. This suggests to me that there's a lot of interest for this type of work.

It's a bit nerdy and it's almost like obsessive compulsive work. You take something and try to grind it down and really make it really really fast. But even though you pointed out that maybe it was a little bit technical and maybe not as accessible, I think talking about microservices and so forth, my experience so far has been pretty good in the sense that there's a surprising number of people who are actually very interested in understanding how this stuff works and how we did it.

Question:

You're using SIMD, single instruction multiple data, instructions as a way of optimizing the parser. I presume you're doing that because it allows you to perform the same operation on multiple data points simultaneously but it's unusual in this context. Can you talk a bit more about that.

Answer:

SIMD instructions are commonplace. They're well-known. They go back to the Pentium 4 for many many years ago. There's nothing very surprising about them. So what are they typically used for? They are used if you're doing number crunching, deep learning on the main CPU you're using instructions, that's easy. If you're doing signal processing, image compression, it's par for the course.

If you're doing crypto you might use them as well, this is standard. But for things like parsing strings it's a little bit more original because when you're taking something like JSON the temptation when you receive the data is to do lots of branching and process the data basically byte by byte to look at you know to process it to stream through it and look at each byte and then branch. “Oh I see an object here”, so I'm going to go the object path and then here as I see what looks like an integer so I'm going to parse the integer and so forth. And it's really not intuitive that you could use SIMD instructions for problems like this. But you can, obviously. Hopefully, I'll get to talk about that and explain a little bit how it's done and you don't need a Phd to understand any of it.

Question:

What do you want people to leave the talk with?

Answer:

One of the insights I'd like people to bring back home is the idea that you have to know your constraints when you're doing data engineering. And I mean here your performance constraints. So when mechanical or electrical engineers build a bridge or do power lines, they start from the physical constraints right. So they know from first principles how much weight in theory that the bridge could sustain.

They know from first principle how much power can go through the lines and so forth. But often in software people don't know how fast things could be. So they run at a certain speed and often it's even lucky if people even know the current speed so very often they don't even measure it. But even if they were to measure it they don't know how it relates to how fast it could be. So they don't know whether they're like 10 or 100 times slower than what it could be. And in the work we do we try to actually find our bounds. We tried to measure how fast we could be and see maybe we could be two times faster and in the best case. So this gives us a bound.

So this is one insight I'd like to drive home and also hopefully we'd like to share some programming tricks that we've been using that I think are of general use.

Question:

What do you think is the next big disruption in software.

Answer:

A prediction I (well, not really me) made many years ago was that basically we would get to a point where storage would be infinite. And you can make this statement very precise in the following sense. For example how much data does it take to record everything you see in a year. We achieved this goal of basically having this infinite storage and we have in some respect also achieved a similar goal of having nearly infinite compute power.

My own take is that the new challenge is at the border where we don't have constraints and then clearly it seems that in a cloud setting, neither storage nor compute power are much of a constraint. Except if you're talking about efficiency. So this is this is one angle where I don't think this could be a big disruption but I think there could be a renewed focus on efficiency. More because I think right now we're not there yet.

Speaker: Daniel Lemire

Professor and Department Chair @TELUQ - Université du Québec

Daniel Lemire is a computer science professor at the Université du Québec (TELUQ). He has written over 70 peer-reviewed publications, including more than 40 journal articles. He has held competitive research grants for the last 15 years. He serves on the program committees of leading computer science conferences. During the 2016-2017 NSERC Discovery Grant competition, he received a rating of outstanding for the excellence of the researcher.

He is a long-time social media user: his blog has thousands of readers and was featured on Slashdot, Reddit and Hacker News. He was one of the first Twitter users: @lemire.