SH:Blog:

Making RethinkDB hard durability a little bit faster

I decided to run RethinkDB YCSB benchmarks the other day, and much to my dismay, it was really slow! Actually, no, the benchmark itself was fine. What was slow was loading the data into the database before the benchmark. It created a table at a rate of, like, 140 documents per second, on an SSD! Holy crap that's slow.

Why so slow? It did your dumb newbie data-loading technique: One client, with hard durability turned on. That's gonna be slow, because the client will sit there twiddling its thumbs while waiting for the server, the server will sit there twiddling its thumbs while waiting for the OS, which'll wait for the disk... it's like a pre-deregulation airline.

And think about it this way. On a 7200 RPM rotational drive, you've got 120 rotations per second. A reasonable hard durability workload on a rotational drive, if you're not smart about writing to a quickly accessible part of the spindle, would be 120 operations per second. That's assuming, y'know, no filesystem overhead. That's one write (with an fdatasync call) per rotation. That's one write every 8.333 ms. (You could actually get it down to one write per 1.5 ms – depending on the hardware – if you're clever about how you choose what sector to write to.)

140 per second is barely an improvement. Sad!

It turns out that doing an fdatasync, on some Samsung SSD (an 840 Pro?), generally seems to take about 1.5-2ms or so on ext4. And it's worse on btrfs. It works about twice as fast on an Intel 520.

Who cares?

After all, when loading data into the database, it makes more sense to do soft-durability writes, and then run an r.table('foo').sync() query to make sure all outstanding writes have really gotten written. Or if you've got a sufficiently distributed cluster, getting stuff on disk isn't really much of a concern in the first place.

This matters because it's the workload you get when some newbie downloads the software and first tries to insert some data into it. And, you know, generally speaking latency does matter in other circumstances.

How RethinkDB starts up

To understand what's going on, you need to know how RethinkDB starts up. The first thing RethinkDB does is read a bunch of "metablocks" (in a "journal") that have

  1. a CRC checksum of their contents,
  2. a version number,
  3. information about how to access other information in the file.

The version number increments every time we write a new metablock. It considers the latest valid metablock to be the one with the largest version number that has a valid checksum. This metablock has info we can use to load the rest of the file. (That info consists of file ranges of other information that we use to load the state of the database.)

It is possible for a metablock to have an invalid CRC checksum – if it was partially written before a power failure.

So why was it slow

So why was RethinkDB so slow? It's because here's what happened when you do a hard-durability write:

  1. A bunch of data containing the new document, the new B-tree node, or whatever, gets written somewhere, in some as-yet-unused region of the file.
  2. An fdatasync() happens, to ensure that info gets written to disk.
  3. A new "metablock" gets written in the metablock extent, with references to the new data.
  4. An fdatasync() happens, to ensure that the metablock has been written.

Now the write is complete! But that's so slow, because you need two fdatasyncs. Why do we have the first fdatasync? The reason is, we need to ensure that the metablock is written after the data it refers to.

Here's a faster way to do it:

  1. A bunch of data containing a new document, the new B-tree node, or whatever, gets written somewhere, in some as-yet-unused region of the file. And checksums of those blocks are computed. And a "metablock" gets written in the metablock extent, with references to the data, and a checksum of the file ranges that are being written simultaneously.
  2. An fdatasync() happens, to ensure that everything has been written.

Now, when starting up, we have to find the latest metablock (with a valid CRC checksum), and then, if it has a list of file ranges and checksums to verify, we have to load those file ranges and verify their checksums.

If those file range checksums fail, yeowch! Use the previous metablock instead.

It only bothers checksumming if you have small writes. If enough dirty data's getting written at once, it'll fall back to using two fdatasyncs.

What checksum function?

I wound up going with a Fletcher-64 checksum. The reason is, it's one checksum function you can use to compose checksums under concatenation. If X and Y are strings, then checksum(concat(X, Y)) can easily be computed from checksum(X) and checksum(Y) (you also need to know length(Y)).

This makes it easy to compute checksums on a piecemeal basis and combine them later. And you get more flexibility in the future to run checksums, or partial checksums, at a more convenient time, like when the data's fresh in cache.

(One problem: plain Fletcher-64 can't distinguish between a 0xFFFFFFFF block and an 0x00000000 block. To fix this, every 32-bit word is xored with 1 before going into the checksum. That eliminates a very plausible failure mode.)

So uh... the results?

Let's see how fast we can insert 1000 small documents in sequence from a JavaScript client, with hard durability. These numbers are what you get on a ThinkPad W520 and an Intel 520 SSD or a 7200RPM Western Digital Black drive. Filesystems are running under whatever mode Debian Jessie has them under. (Except the last row, which is for an Ivy Bridge rMBP.) Measurements taken by inserting 1000 documents and taking what looks like the best time measurement we'll get. (Real scientific, I know.)

FS/DeviceOld RethinkDBNew RethinkDB
ext4/intel 5203.422s2.932s
ext3/intel 5203.378s2.842s
ext2/intel 5202.933s1.860s
xfs/intel 5203.679s2.920s
btrfs/intel 5204.376s3.207s
ext4/7200RPM30.865s26.748
ext3/7200RPM31.664s30.037s
ext2/7200RPM28.696s27.417s
xfs/7200RPM52.551s32.881s
btrfs/7200RPM69.724s41.957s
rMBP19.029s10.538s

But here's the thing. The client overhead was a significant factor. Typically the client (written in JavaScript) spent 0.7s of CPU time.

And have no doubt the server spends some time burning quite a few gratuitous CPU cycles, too.

This is available in my fork of RethinkDB. (Eventually it'll make it into RethinkDB proper, but pull requests are taking a while to get merged in right now.)

Addendum: How does Postgres do?

On ext4/intel 520, Postgres finishes 1000 writes in under 0.7 seconds, with 0.3 seconds spent by the JavaScript client. Does 1000 or 1001 fdatasyncs. So yeah, there's a lot of overhead going on with RethinkDB there.

- Sam

(posted January 2 '17)