# On Optimally Squishing Large Datasets

*This blog post, primarily written by my third year PhD student Yihan Gao, is a high level overview of our KDD’16 paper, titled “Squish: Near-Optimal Compression for Archival of Relational Datasets”, with **code on github**, and **paper on arXiv**.*

Due to ever-increasing rate at which data is being generated, from conversational bots and the internet of things, to large-scale physical simulations and high-throughput genomic sequencing, *data compression*, especially for the purpose of archival, remains extremely crucial.

Traditional algorithms for compression are designed for binary strings, and identify opportunities for compression by finding repeated patterns within these strings. Since files are essentially stored as binary strings, these algorithms are universally applicable to all types of data.

However, by operating on a generic binary string representation, these algorithms lose out on fine-grained compression opportunities. One such setting is the über-important setting of *relational* or tabular datasets: here, generic compression algorithms end up performing very poorly. We present **Squish**: an *asymptotically optimal algorithm for compressing relational datasets*. Compared to other algorithms for relational datasets, as well as more traditional algorithms, our algorithm achieves a *superior compression rate*, and is *customizable to special purpose datasets and needs*, as we will describe later on.

### Motivation

Consider a local population census dataset as an example, with name, gender, age, weight, and height as attributes. Relational dataset compression algorithms proposed in prior work are capable of utilizing domain constraints (e.g., names are strings; gender has only two possible values; age, weight, and height are numeric fields). But there are other opportunities for compression:

1. There are soft *dependencies* between attribute values. For example, gender can usually be deduced from the name, and the gender of an individual often influences the height and weight.

2. Numeric values may be *correlated*. For example, taller individuals usually weigh more. Children (age<12) are usually much shorter than adults.

By exploiting these opportunities, we can use less space to store the same information. However, this is not a trivial task since we do not know what kinds of datasets we will be dealing with. We need to develop an algorithm that can capture the dependencies without assuming any specific structure of the dataset, and find a way to utilize these dependencies to compress the dataset.

**Using Bayesian Networks to Capture Dependencies**

Let us deal with the first task: how can we capture the dependencies in the dataset without assuming any structure? Since the dependencies are soft or “fuzzy”, we need a probabilistic model to capture these dependencies. For this purpose, we use a *Bayesian network*.

In short, we use each random variable in the Bayesian network to represent one of the attribute values of the tuple, such that each tuple in the dataset corresponds to a random sample from the Bayesian network.

Thus, the first step of of our compression algorithm is to learn a Bayesian network from the relational dataset. Then, our question becomes: how do we use a Bayesian network to compress the dataset? Before answering this question, let us review some basic concepts of arithmetic coding.

**A Basic Review of Arithmetic Coding**

Arithmetic coding is an adaptive compression scheme for strings. It requires a conditional probabilistic model for each of the characters in the string, and generates a uniquely decodable code for all possible fixed-length strings. We use an example to illustrate this coding scheme.

In this example, the right hand side is the probabilistic model for the string. To begin with, arithmetic coding assigns a probability interval to each string with length equal to the probability of the string. For example, *aaa* corresponds to [*0*, *0.048*], and *aba* corresponds to [*0.12*, *0.204*]. Next, given the probability interval [*l*, *r*], we find the smallest integer *k*, and another integer *M* such that, *l* <= *2*^{-*k*}*M* < *2*^{-*k*}(*M*+*1*) <= *r*. The binary representation of *M* is the code for the string.

**Combining Bayesian Networks with Arithmetic Coding**

Even though it was developed for strings, the arithmetic coding procedure can actually be used on Bayesian networks with little modification. The construction only requires two things: an ordering among random variables, and a conditional probability distribution for each random variable given all of the preceding ones. Interestingly, Bayesian networks provides us with both. Here is an example of how to apply the same procedure on Bayesian network. Note that the computation is essentially the same as before.

**Dealing with Complex Attributes**

The above example illustrates how we can use Bayesian networks and arithmetic coding to compress relational datasets with only categorical attributes. Now, here comes the last missing piece of Squish: how can we deal with complex attributes like numbers and strings?

In order to be able to handle complex attributes, we developed an abstract interface for all data types, called SquID (short for “Squish Interface for Data types”). SquID is used to encode attributes in such a way that they behave like categorical attributes, and then we can use arithmetic coding to compress them.

A SquID is tree-style model of the probability distribution. Essentially, it is a decision tree with probabilities associated with edges. Here is an example SquID, which describes the distribution Pr(*X* in (*k*-*1*,*k*]) = *0.9*^{*k*-*1*}**0.1*.

To encode an attribute using SquID, we simply construct a sequence of decisions, such that at the end of last decision, we would know exactly what the attribute value is. For example, to encode a numerical attribute using SquID, we can simulate the bisecting procedure, and determine the value of the attribute within [*log n*] steps.

**Supporting User-Defined Data Types**

The SquID interface also enables us to be able to support user-defined attributes. Even if the dataset contains attributes that does not belong to one of the primitive data types (categorical, numerical and string) we implemented, the user can still use Squish to compress the dataset by implementing a SquID interface for the new data type. It is relatively simple, and offers great opportunities to include domain-specific knowledge into Squish to achieve even better compression results.

**Takeaways**

1. We developed ** Squish**, a compression algorithm for relational datasets. It uses a

*combination of Bayesian networks and arithmetic coding*to exploit soft dependencies and correlations between attributes.

2. SquID, the data type interface of Squish, allows us to apply arithmetic coding on complex attributes. SquID also enables the user to define new types of attributes, using which Squish can compress datasets with non-primitive attribute types.

3. Squish supports lossy compression with a user-specified error threshold.

4. Squish is provably asymptotically optimal for all datasets that can be described efficiently using Bayesian networks, and performs much better — *with up to 50% reduction in storage* — than the state-of-the-art relational dataset compression algorithms on 4 realistic datasets.