Huffman
compression reduces the average code length used to represent the
symbols of an alphabet. Symbols of the source alphabet which occur
frequently are assigned with short length codes.The general strategy is
to allow the code length to vary from character to character and to
ensure that the frequently occurring character have shorter codes.
Huffman
compression is performed by constructing a binary tree using a simple
example set. This is done by arranging the symbols of the alphabets in
descending order of probability. The two lowest probabilties are
repeatedly added and resorted. This process continues until the sum of
probabilities of the last two symbols is 1. Once this process is
complete, a Huffman binary tree can be generated. If a
probability of 1 in the last two symbols is not obtained, there is most
likely a
mistake in the process. This probability of 1 forms the last symbol in
the root of the binary tree.
The resultant codewords are
then formed by tracing the tree path from the root node to the end node
codewords after assigning 0s and 1s to the branches.
The Algorithm: the process used to create this tree is simple yet
elegant:
First
count the amount of times each character appears. This is the frequency
of each character.
Create
a collection of n small, one-node trees (where n is
the number of distinct characters in the input stream). Each of these n
trees represents a distinct input character and has a weight
corresponding to its count tallied in the analysis step.
Pick
out the two trees with the smallest weights and remove them from the collection. Combine them into a new tree whose
root has a weight equal to the sum of the weights of the two trees and
with the two trees as its left and right subtrees. Add the newly
combined
tree back into the collection.
Continue this process - select the two trees (with
anywhere from 1 to (n - 1) nodes) with lowest weight, join these
by a new root node, set the root node's weight, and place the new tree
back into the pool. Repeat this process until one treeencompassing all the input weights has been constructed.