Smart way of storing data

Let’s talk about bit packing, deduplication and many more

Published in

Towards Data Science

7 min readOct 28, 2017

Bit packing is about making data representation fit your data as a glove. That’s a very visual metaphor, but it might be a bit confusing, so let’s just jump into the first and simplest example:

00000001 00000000 00000001

What you see above is a binary representation of a boolean array: true, false, true . In most programming languages a boolean is stored in 8bits, where 0 is false and >=1 is true. It all makes sense because of how CPU reads data, but if we know that we are going to store an array of boolean values, we can use a bit array:

00000101

In this case a bool array can be represented as a byte array, where size is ceil(size/8). When we ask for element at index i we need to create two indices from i:

byte index floor(i/8)
bit index i mod 8, or you could also do i — (byteIndex * 8) if you don’t want to do modulo operation.

In order to find out if the value is true or false at index i we need to create a bit mask from the bit index : 1 << bitIndex and than apply following expression: bitArray[byteIndex] & mask != 0.

As we can see, reading boolean values from a bit array requires a few computations, but can reduce the size of stored data by 87.5% in best case.

Now let’s talk about numbers. Numbers are stored in bytes:

1byte: 0…256
2bytes: 0…65,536
4bytes: 0…4,294,967,296
8bytes: 0…18,446,744,073,709,551,616

But what if we know that our data consists of smaller numbers?

A quarter of the byte can represent numbers 0…3 and half of the byte 0…15.

We could use the same technique as in bit array — computing the two indices and get the values with a few bit shifting operations.

00000001 00000010 00000011 00000000

Where values are 1, 2, 3, 0. Can be turned into:

00111001

Achieving reduction of 75%. I will leave it to the reader to think about how an array of 1, 10, 12, 3 can be saved with a half byte representation.

Does it make sense to store a 6bit number as well?

If we store a number not in a power of two amount of bits (1, 2, 4, 8), we will have to face the fact, that a number is stored between two bytes. Meaning that we will have to read first byte and than the second byte. So generally it is possible, but to be honest I never seen people doing it. What I have seen however — storing a tuple of two small numbers together in one byte. For example FlexBuffers type definition is:

2 lower bits representing the bit-width of the child (8, 16, 32, 64)
6 bits representing the actual type

What if I don’t know how big my numbers are going to be?

In this case, you can either go with the highest possible size, or you could apply variable-length quantity technique.

In this technique, we take a bit stream which represents the number and partition it into 7 bits. 8th bit becomes a “flag” bit. The flag bit is set to 1 if the next 7 bit partition does not contain all 0s. If you are interested in the details please have a look at the wikipedia article.

Variable length quantity technique is used in Protocol Buffers, it is so important for the format, that it is the first thing encoding documentation explains.

Interestingly, if you look at UTF-8 encoding. It is basically the same technique, just with a small tweak for scanning through bytes. First byte in a byte sequence, which encodes a Unicode character tells you how long the sequence is. If the byte “starts”(left to right direction) with a 0 it’s only this one byte. If the unicode character needs two bytes, the “start” of first byte will be 11 if it’s three than 111 and so on.

There is on more technique I want to discuss before we switch to deduplication. I am talking about storing deltas.

Imagine you are storing a sequence of time stamps, represented as unix time. If we would store the time stamps as is, every entry would consume 4bytes. However, if we would store only the first entry as is, and all the others only as deltas, than we could use a much smaller number representation for all the following entries. Best case scenario events are always shorter than 255 seconds (4 minutes 15 seconds) apart. So we can store time stamps in 1byte, saving approximately 75% of space.

Now let’s talk about deduplication.

Every data has some kind of repetition in it. Deduplication techniques try to find those repetitions and “eliminate” them.

Elimination is a strong word, we can’t just remove a value because it is stored twice. What we can do, is to store values indirectly. A good old example of deduplication is the GIF format. In GIF, a color is represented with 24bits — 3bytes. And it is stored in a palette. A palette can hold 256 entries (3*256=768bytes). Every pixel in the stored images is stored as 1byte palette index.

Let’s do some math

We have an overhead of 768bytes and a win of 66% per pixel. Meaning, if we have more than 384 pixels: 384 * 3 = 384 * 1 + 768 we covered the cost of the palette and start our journey to 66% less space.

Same technique can be applied to big numbers, text strings or any other data which is repeated throughout our data set. The only problem with this technique is that we need to know how many different manifestations of data we will have. With GIF the color palette is created in a way that some colours might get “squished” to one. It is not a problem with an image, but you probably would not want to do the same with text.

If you read the previous sections carefully, you might already know the answer. We could store the palette indices as a variable-length quantity.

There is another smart thing we could do. We could try to use the law of large numbers. In a large data set we can assume that some entries will occur more often than the others. In this case we could create a small palette just for X most occurring values, have another palette/array, which will store not that frequent numbers and the values, will be stored in a bit packed array, which will indicate if the number is a frequent one or not. OK it’s very abstract lets make a small example.

Imagine we are storing numbers, and we know that 7, 13, 42 are the three most frequent numbers, we normally store. Now lets see an example of a data sequence:

Initial: 7, 7, 7, 13, 13, 13, 2345, 42, 42, 7, 13, 6543, 7=>Index:    1, 1, 1, 2, 2, 2, 0, 3, 3, 1, 2, 0, 1
Frequent: 7, 13, 42
Other:    2345, 6543

So we see we have an index with 13 entries, which stores numbers from 0 to 3. This is a sequence of values wich we stored. If the value is bigger than 0, it is referring to “Frequent” number: 1 -> 7, 2 -> 13, 3 -> 42. If it is 0 than we need to take the next number out of the “Other” array.

This means that we can not access values by index, we can only iterate through them, but as an index is a number between 0 and 3 we could use quarter of a byte, to store 13 values and 2bytes to store 5 other values. Let’s do some the math again:

Index: ceil(13 / 4) = 4 bytes
Frequency + Other: 5 * 2 = 10 bytes

In total 14 bytes compared to 13 * 2 = 26 bytes.

Seems like, it can be worth the hassle if you have lot’s of entries and the law of large numbers is on your side.

One use case, where the law of large numbers is definitely on your side, is when you deal with a sparse value sequence. By sparse value sequence I mean, a sequence where lots of values can be “default” or null (absence of value). In this case, our index can be just a bit array, indicating the absence of value. We don’t need “Frequent” array, we only need the “Other/Values”.

Initial: 23, 45, null, null, null, null, null, 3=>Index:  1, 1, 0, 0, 0, 0, 0, 1
Values: 23, 45, 3

Another technique, which is kind of based on law of large numbers is Huffman Coding. I think the best way to understand the concept behind it, is to look at morse code. In morse code the most frequent characters are represented with the shortest sequence, and the less frequent with the longest. This way, when you communicate with morse code, you probably will spend less time transmitting. Huffman Coding is a bit more complicated as it is a binary format, where morse code is ternary (short, long and silence), but the idea is very close.

There are probably more smart ways of storing data, but sadly this is all I got. I would be happy, to see more more techniques in the comments though. You may clap if you wish.