Implementing the Huffman Compression Algorithm in C++

The fi­nal code is in GitHub here.

Da Vinci is quoted say­ing, Art is never fin­ished, only aban­doned”. I don’t see why it should be any dif­fer­ent for code. With that said, I’d like to de­clare my lat­est pro­ject: an im­ple­men­ta­tion of the huff­man’s al­go­rithm abandoned. It works well as it is, but it can be made a lot bet­ter.

So any­way, I’ve been work­ing on it for about a week. I started a day be­fore my Data Communication as­sess­ment in­spired by a chap­ter on the Information Theory. There I was, at 11 in the night, hav­ing read for the first time in my life about huff­man’s al­go­rithm and I was thrilled! I de­cided then, in the spur of the mo­ment, to pull an all-nighter and write code for the huff­man al­go­rithm. Then I boiled some wa­ter, made my­self a cup of strong cof­fee.

I thought I could fin­ish it that night.But it took a lot longer than I had es­ti­mated. Why? well,

  1. The code doesn't use C++ templates. I implemented all necessary data structures as it would be a nice learning experience.
  2. I didn't know istream ignores the bytes that represent whitespace in ASCII even in binary input mode!
  3. I was using unsigned chars instead of ints for short for-loops (habit carried over from programming AVRs), and one of them overflowed in special edge cases, causing me days of debugging.

Huffman Compression

For those of you who don’t know, huff­man’s al­go­rithm takes a very sim­ple idea and finds an el­e­gant way to im­ple­ment it. At its heart is the ob­ser­va­tion that the more a thing is men­tioned, the shorter its name should be. This idea man­i­fests it­self in daily life too. For ex­am­ple, we use nick­names for peo­ple we call of­ten, we have ab­bre­vi­a­tions for long words we of­ten have to use and even the word ok was once an ab­bre­vi­a­tion for all cor­rect”.

Translated to the world of com­put­ers, it means that if a chunk of data re­peats it­self more than an­other chunk of data in the same group, it is best to en­code the more oc­cur­ring chunk with smaller code word. Practically what it means is that the com­pressed file has two parts. First there is a dic­tio­nary of code words and their cor­re­spond­ing chunks of data. And then there is a the data it­self, en­coded us­ing code­words (often called pay­load in com­pres­sion jar­gon). Dur­ing de­com­pres­sion, the dic­tio­nary is read and the code­words are se­quen­tially trans­lated into ac­tual data.

Of course, the im­ple­men­ta­tion of this idea re­quires us to first solve a few prob­lems. For one, un­com­pressed data is easy to read. The spec­i­fi­ca­tion of the file for­mat is all you need to ex­tract rel­e­vant in­for­ma­tion from un­com­pressed files. In un­com­pressed au­dio, say, a fixed num­ber of bytes un­am­bigu­ously make up a sam­ple i.e a sin­gle am­pli­tude in the sound wave. Simply by read­ing one sam­ple at a time and set­ting a pro­por­tional volt­age across the speaker you con­vert the dig­i­tal rep­re­sen­ta­tion into real sound. And if you time it right, you might even hear the mu­sic. But once you com­press that raw file with the scheme I men­tioned above things start to get tricky.

Say this is the data you want to com­press:

BANANA

The ASCII codes for the al­pha­bets in bi­nary are:

B : 01000010

A : 01000001

N : 01001110

Thus the ASCII en­cod­ing of BANANA is:

01000010 01000001 01001110 01000001 01001110 01000001

Keep in mind that when the com­puter is read­ing ASCII text files, it al­ways does so in chunks of 8 bits per char­ac­ter. That’s the stan­dard. So the com­puter al­ways knows that every 8 bits can be read from the file and matched against the ASCII table al­ready in it’s mem­ory to find which char­ac­ter was typed.

Let’s try to com­press this with the afore­men­tioned scheme of call­ing more fre­quent chunks of data with smaller nick­names. We will start with a naive in­ter­pre­ta­tion.

A : Occurs thrice. Let’s call it a 0

N : Occurs twice. Let’s call it a 1

B : Occurs once. Let’s call it a 10

The com­pressed data thus turns out to be:

1001010

Lets see the un­com­pressed data again.

01000010 01000001 01001110 01000001 01001110 01000001

That’s a mas­sive shrink­age. Now let’s try de­com­press­ing it. We come across a mas­sive prob­lem. There are mul­ti­ple ways to de­code the com­pressed string based on how you di­vide the code­words. Here are a few:

10-0-1-0-1-0 : BANANA

1-0-0-10-10 : NAABB

1-0-0-1-0-1-0 : NAANANA

See the prob­lem? We wanted to say BANANA but started singing Havana in­stead. Of what use is a com­pres­sion scheme if you can’t de­com­press de­ter­min­is­ti­cally? Think about it. It is math­e­mat­i­cally im­pos­si­ble to de­rive the orig­i­nal mes­sage us­ing only the com­pressed mes­sage and the the dic­tio­nary with­out also send­ing other in­for­ma­tion about the mes­sage (which ends up mak­ing the com­pressed file larger than the orig­i­nal).

This is where we in­tro­duce the con­cept of pre­fix code (also called pre­fix-free code). The idea is sim­ple. In our naive com­pres­sion scheme, we in­tro­duce a con­straint. No code­word can start with an­other code­word. In other words, no code­word in the sys­tem is the pre­fix of an­other code­word. So if you read the bit­stream from left to right, if you come across a valid en­try in the code dic­tio­nary, it can be safely trans­lated to the orig­i­nal sym­bol. We es­sen­tially scan the com­pressed data se­quen­tially, rec­og­nize code­words, trans­late them, and carry on.

Applied to the BANANA ex­am­ple, the pre­fix con­straint re­sults in the fol­low­ing cor­rec­tion:

A : Occurs thrice. Let’s call it a 0

N : Occurs twice. Let’s call it a 10

B : Occurs once. Let’s call it a 11

Notice how, in this sys­tem, no code­word is the pre­fix of an­other code­word. BA­NANA now com­presses to 110100100. This can be de­ter­min­is­ti­cally re­cov­ered to BA­NANA.

And this is what the huff­man com­pres­sion does. Given a set of sym­bols and their re­spec­tive fre­quen­cies, it churns out the small­est pre­fix code­words to rep­re­sent them all. That is the huff­man com­pres­sion. And there are other al­go­rithms to do this, most no­tably Shannon-Fano al­go­rithm (which they also ex­pect me to mem­o­rize for ex­ams), but as wikipedia says, “Shannon–Fano is al­most never used; Huffman cod­ing is al­most as com­pu­ta­tion­ally sim­ple and pro­duces pre­fix codes that al­ways achieve the low­est ex­pected code word length [as op­posed to Shannon-Fano]”.

How huff­man’s al­go­rithm does that is ac­tu­ally pretty in­ter­est­ing too. I am far too lazy to de­scribe it all in this post, but this video by com­put­er­phile (along with all oth­ers re­lated to com­pres­sion) are great if you’re just start­ing.

The Implementation

The im­ple­men­ta­tion is re­ally the sim­plest I could find. It is very slow and not op­ti­mized for any­thing ex­cept cod­ing ef­fi­ciency. The al­pha­bets are al­ways bytes, for the sake of sim­plic­ity. This means that the chunk of data’ that I talked about is al­ways a sin­gle byte.

Follow this link to see the full code in GitHub.

Here is how it com­presses a file:

  1. count_frequency() returns an array of size 256 with the byte alphabet as index and it's corresponding frequency as the value. If, for instance, the byte 0x25 (37 in decimal) occurs 50 times in the file, the array[37] has value of 50.
  2. The tree::node class represents one single alphabet with a data and a frequency property. 256 new tree::node objects are created, one for each alphabet. They are pushed into a priority queue. The priority queue takes in elements in an arbitrary order, but dequeues them in a fixed order. In this case, the lowest frequency alphabets are always popped first.
  3. The make_huffman_code() function pops two elements from the priority queue, makes a new element (say X) with the two elements as two leaves and puts the new element X back in the queue. X is a node in the huffman's tree. This goes on in a loop until only one element is left in the priority queue. That is the root of the huffman tree, and all other elements are it's descendants.
  4. The huffman tree is traversed for every alphabet and a codeword is generated by the generate_code() function. This function returns an array of objects of class huffman_code. The class huffman_code encodes the huffman's code which needs to pieces of information to be correctly understood: the code itself, and the length of the code as that can vary too.
  5. Using the array, we encode the original file. A class bitstream is used here to pack the variable-bit-length huffman codes. The bitstream class uses a little buffer in memory to pack the codes. Once the buffer overflows, it signals the buffer to be flushed to the file and restores the buffer with the overflowed contents. This helps keep the memory requirements independent of the size of file being compressed. The details of the file format are in the README.md.

The soul of this en­tire code base is the fol­low­ing code.

tree * make_huffman_tree(input_param options) {
  //open the file for inspection
  ifstream in_file;
  in_file.open(options.input_file, ios::binary | ios::in);

  // count the frequency of all 256 bytes
  unsigned * count = count_frequency(in_file, options.input_file_size);

  priority_queue < tree::node > * x = generate_alphabets(count);

  //count[] is no longer needed
  delete[] count;

  int cnt = 0;
  tree::node * a, * b, element, c;

  while (x -> element_count() > 1) {
    element = x -> dequeue();
    a = new tree::node(element);
    element = x -> dequeue();
    b = new tree::node(element);
    c = tree::node(0, a -> get_frequency() + b -> get_frequency(), a, b);
    x -> enqueue(c);
  }

  //there is only one element in the priority queue now: The root.
  element = x -> dequeue();
  tree::node * root = new tree::node(element);
  tree * huffman_tree = new tree(root);

  in_file.close();
  return huffman_tree;
}

Since the huff­man codes are of vari­able lengths, I had to write a spe­cial bit stream pack­ing class. It can work with ar­bi­trar­ily sized buffers and has a mech­a­nism to avoid buffer over­flows. I had the most fun work­ing on that class. The header is be­low:

typedef unsigned char byte;
class bitstream {
  protected: byte * buffer;
  unsigned buffer_size; //in bytes
  unsigned bit_pos; //in bits

  unsigned remainder; // to store overflowing data
  unsigned remainder_size;

  inline unsigned get_byte_pos();
  inline unsigned get_bit_offset();
  inline unsigned get_free_bits();
  int micropack(byte, unsigned);

  // after the buffer has been emptied,
  // pack remainder inside buffer
  bool add_remainder(unsigned, unsigned);

  public: bitstream(unsigned buffer_size_in_bytes);
  ~bitstream();

  unsigned get_occupied_bytes();

  // packs data into buffer upto specified size
  // returns true if okay
  // returns false if the buffer has filled up
  // if pack is false, flush then reset buffer
  bool pack(unsigned huffman_code, unsigned code_length);

  //gives you a pointer to the buffer
  const byte * flush_buffer();

  //clears the buffer then packs remainder in
  int reset_buffer();
};

During de­com­pres­sion:

  1. The compressed file is checked for the file signature.
  2. The table is read and stored in memory as an array (lets call it dictionary).
  3. The pluck_bit() plucks successive bits from the file. Every bit of the file is appended to a variable, then the variable is checked with all 256 entries of the dictionary. If it matches, the corresponding byte is appended to the decompressed file and the variable is cleared.

This is the most naive way I can think of to de­com­press the file. But one good thing about a slow de­com­pres­sion is that I got to make a progress bar. This might be the first time I wrote a pro­gram that takes long enough to make progress-bar manda­tory (apart from file-up­loads, of course).

Results

The re­sult­ing pro­gram is able to com­press and de­com­press files from com­mand line.

The de­tails for us­age are in the readme.md file on github but here’s a sum­mary:

  1. To compress, ./huff <filename>. The compressed file will be named a.huff by default. To specify output file name, use ./huff -o <output filename> <filename>
  2. To decompress, ./huff -v -d <compressed file>. The -v flag turns on a progress bar, because decompression is slow with this naive implementation.

The com­pres­sion per­for­mance varies wildly with the na­ture of file, but text files gen­er­ally com­press to half the size. In a rep­re­sen­ta­tive test, a sam­ple 6.3 MB text file com­pressed to 3.6 MB. Trying to com­press the al­ready com­pressed file brought the size down to 3.5 MB.

By com­par­i­son, gZip com­pressed the same file down to 2.3 MB and bZip com­pressed the text file to 1.7 MB, the small­est in all the com­pared com­par­i­son schemes.

Closing remarks

When I set out to im­ple­ment Huffman’s al­go­rithm, I had two main ob­jec­tives:

  1. Compare it with gzip to see how well it fares. Obviously gzip is better at compressing, but I found out just by how much. There is a huge difference. If you want to try for yourself, clone the repo from github. But that is not the whole story. Gzip is itself a combination of LZ77 followed by a series of huffman compressions. Gzip is built with huffman compression as one of it's pillars (see also: DEFLATE). So I suppose huffman wins in the end.
  2. See if it is feasible in memory constrained embedded systems (like atmega or at89c2051). It seems doable. Huffman decompression doesn't require a lot of RAM (with the compression table stored in PROGMEM) and it doesn't matter if the decompression is slow for me because it's okay to do it one sample at a time, every 1/8000th of a second (for storing and playing sounds). This might be my next project after the exams end.

I learnt a lot of things with this pro­ject. I got a glimpse of the im­mense math­e­mat­i­cal com­plex­ity that is hid­den un­der­neath the ab­strac­tions and the tools of com­pres­sion. I learnt a lot more about gzip com­pres­sion and LZ77 al­go­rithm. I learnt how a code can be slow, and how to make it faster. But new things were not all that I learnt. I got to learn more about the good old C++. It’s var­i­ous idio­syn­crasies, it’s weak­nesses and strengths, mem­ory man­age­ment, code or­gan­i­sa­tion and bet­ter use of test pro­grams. I know, I still have a long way to go, but this pro­ject has been a im­por­tant step to­wards that des­ti­na­tion.