HUFFMAN COMPRESSION


 

Huffman compression reduces the average code length used to represent the symbols of an alphabet. Symbols of the source alphabet which occur frequently are assigned with short length codes.The general strategy is to allow the code length to vary from character to character and to ensure that the frequently occurring character have shorter codes.

Huffman compression is performed by constructing a binary tree using a simple example set. This is done by arranging the symbols of the alphabets in descending order of probability. The two lowest probabilties are repeatedly added and resorted. This process continues until the sum of probabilities of the last two symbols is 1. Once this process is complete, a Huffman binary tree can be generated. If a probability of 1 in the last two symbols is not obtained, there is most likely a mistake in the process. This probability of 1 forms the last symbol in the root of the binary tree.

The resultant codewords are then formed by tracing the tree path from the root node to the end node codewords after assigning 0s and 1s to the branches.

 



The Algorithm: the process used to create this tree is simple yet elegant:
  1. First count the amount of times each character appears. This is the frequency of each character.
  2. Create a collection of n small, one-node trees (where n is the number of distinct characters in the input stream). Each of these n trees represents a distinct input character and has a weight corresponding to its count tallied in the analysis step.
  3. Pick out the two trees with the smallest weights and remove them from the collection. Combine them into a new tree whose root has a weight equal to the sum of the weights of the two trees and with the two trees as its left and right subtrees. Add the newly combined tree back into the collection.
  4. Continue this process - select the two trees (with anywhere from 1 to (n - 1) nodes) with lowest weight, join these by a new root node, set the root node's weight, and place the new tree back into the pool. Repeat this process until one tree encompassing all the input weights has been constructed.