SH:Blog:

Bidirectional Varint Encoding

Suppose you want a file format to be scannable forwards or backwards. To do that, you might need to scan past an unsigned 64-bit number in either direction. If you don't care about wasting space, you could do this by using 8 bytes to encode your 64-bit number. (Then, scanning is just a matter of reading 8 bytes, in whichever direction you're going.)

For a compressed representation, you can use a modification of the base-128 varint encoding, which normally can only be scanned in one direction.

The trick is, treat values less than 128, that are represented with one byte, the same way: the msb (most significant bit) is 0. For bigger values, instead of having the bytes' msb's follow the pattern 111...1110, have them follow the pattern 1000...001.

To scan the number, read a byte. If the msb is 0, you're done. If it's 1, then keep reading bytes until you have read another byte whose msb is 1.

As with varint encoding, each byte's 7 least significant bits represent a base-128 digit of the number in little-endian order.

Examples of representations, with the msb's underlined.

Number (decimal) Varint Encoding (binary) Bidirectional Encoding (binary)
0 [0000 0000] [0000 0000]
87 [0101 0111] [0101 0111]
4099 [1000 0011] [0010 0000] [1000 0011] [1010 0000]
1234567 [1000 0111] [1010 1101] [0100 1011] [1000 0111] [0010 1101] [1100 1011]

(Explanation: 4099 = 3 + 32 × 128.)

The downside of this encoding, relative to the regular varint encoding, is that you can't look at the middle of a stream of bytes and pick out all the integer values, when trying to debug something. You don't know whether the msb sequence ...1011101001111001... breaks up into ...[101][11][0][1001][11][1001]... or into ...1][0][11][101][0][0][11][11][0][0][1.... With regular varint encoding, you'd have ...10][1110][10][0][11110][0][1... for sure.

- Sam

(posted January 16 '18)