The world’s leading publication for data science, AI, and ML professionals.

Huffman Encoding & Python Implementation

An old but efficient compression technique with Python Implementation

Huffman Encoding is a Lossless Compression Algorithm used to compress the data. It is an algorithm developed by David A. Huffman while he was a Sc.D. student at MIT, and published in the 1952 paper "A Method for the Construction of Minimum-Redundancy Codes". [1]

As it can be understood from being a "Compression Technique", the aim is to encode the same data in a way that takes up less space. Accordingly, when a data is encoded with Huffman Coding, we get a unique code for each symbol in the data. For example the string "ABC" occupies 3 bytes without any compression. Let’s assume while the character A is given the code 00, the character B is given the code 01, the character C is given the code 10 as the result of encoding. To store the same data, we would only need to use 6 bits instead of 3 bytes. Before examining the working principle of Huffman Encoding, I hope what I mean by compression is better understood !

"Image by Author"
"Image by Author"

The Algorithm

Huffman Encoding is an algorithm which uses frequency (or probability) feature of symbols and a binary tree structure. It consists of the following 3 steps:

  • Probability Calculation & Ordering the Symbols
  • Binary Tree Transformation
  • Assigning Codes to the Symbols

Probability Calculation & Ordering the Symbols

We count the number of each symbol in the whole data, then we calculate the "probability" of each symbol by dividing that count by the total number of characters in the data. Since its an algorithm using probability, more common symbols – the symbols having higher probability – are generally represented using fewer bits than less common symbols. This is one of the advantageous sides of Huffman Encoding.

As an example, for the following data having 5 different symbols as A B C D E, we have the probabilities as shown right:

"Image by Author"
"Image by Author"

Then we easily order the symbols according to their probabilities representing each symbol as a node and call that our "collection". Now, we are ready to pass the next step.

"Image by Author"
"Image by Author"

Binary Tree Transformation

  1. From the collection, we pick out the two nodes with the smallest sum of probabilities and combine them into a new tree whose root has the probability equal to that sum.
  2. We add the new tree back into the collection.
  3. We repeat this process until one tree encompassing all the input probabilities has been constructed.
"Image by Author"
"Image by Author"

Assigning Codes to Symbols

After obtaining this binary tree – so called Huffman Tree, the only thing we should do is to assign 1 for each time we go right child and 0 for each time we go left child.

Finally, we have the symbols and their codes obtaining by Huffman Encoding!

"Image by Author"
"Image by Author"

We can take a quick look and see even only for 21 characters, the difference between compressed and non-compressed data is nontrivial.

"Image by Author"
"Image by Author"

Implementation with Python

To implement Huffman Encoding, we start with a Node class, which refers to the nodes of Binary Huffman Tree. In that essence, each node has a symbol and related probability variable, a left and right child and code variable. Code variable will be 0 or 1 when we travel through the Huffman Tree according to the side we pick (left 0, right 1)

"Image by Author"
"Image by Author"

We have 3 helper functions, first one for calculating the probabilities of the symbols in the given data, second one for obtaining encodings of symbols which will be used after having Huffman Tree and the last one for obtaining output (encoded data).

"Image by Author"
"Image by Author"
"Image by Author"
"Image by Author"
"Image by Author"
"Image by Author"

Additionally, we have a Total_Gain function which takes the initial data and the dictionary comes from Calculate_Code holding the symbols and their codes together. That function calculates the difference between the bit size of compressed and non compressed data.

"Image by Author"
"Image by Author"

Finally, we have Huffman_Encoding function which takes only the data as parameter and gives us the result encoding and the total gain using all these 4 functions.

"Image by Author"
"Image by Author"

Let’s see how the code works and outputs for our already examined case!

"Image by Author"
"Image by Author"

Another example which is far bigger than our simple case. I create a demo_file.txt and copy some random information about Data Compression having 1579 words. For that example, I commented out the Print_Encoded function since it was not possible to take a screenshot of the whole output in one image.

"Image by Author"
"Image by Author"

As a conclusion, we see that the compression ratio does not change with the growing amount of data, and this ratio is close to 2:1. We can say that Huffman Encoding is an algorithm that compresses the data to its half size. Although it is old, it is still an effective compression algorithm!

You can click that github link to reach my code and try with your own examples.

Also, if you would like to learn how to decompress a Huffman Encoded data to obtain back the initial data, click here!


[1] https://en.wikipedia.org/wiki/Huffman_coding


Related Articles