|
Introduction A canonical Huffman code is a particular type of Huffman code which has the property that it can be very compactly described. In computer science, Huffman coding is an entropy encoding algorithm used for data compression that finds the optimal system of encoding strings based on the relative frequency of each character. ...
Data compressors generally work in one of two ways. Either the decompressor can infer what codebook the compressor has used from previous context, or the compressor must tell the decompressor what the codebook is. Since a canonical Huffman codebook can be stored especially efficiently, most compressors start by generating a "normal" Huffman codebook, and the convert it to canonical Huffman before using it.
Algorithm The normal Huffman coding algorithm assigns a particular codeword to every symbol in the alphabet according to the weights assigned to each such symbol. To make the code a canonical Huffman code we need only renumber the codewords. For example, suppose we have the following non-canonical codebook: A = 11 B = 0 C = 101 D = 100 Here the letter A has been assigned 2 bits, B has 1 bit, and C and D both have 3 bits. We now sort the list first by codeword length, and then by letter: C = 101 D = 100 A = 11 B = 0 We now replace each of the codewords with a new one of the same length using the following algorithm: - The first symbol in the list gets assigned the codeword 000...000 (same length as the original code).
- Each subsequent symbol is assigned the next binary number in sequence.
- When you reach a shorter codeword, snip digits from the right of the codeword before incrimenting to generate the next code.
By these rules, the canonical version of the above codebook is C = 000 D = 001 A = 01 B = 1 Encoding the Codebook The whole advantage of canonical Huffman is that one can write the codebook in fewer bits. Let us take our origina Huffman codebook again: A = 11 B = 0 C = 101 D = 100 There are several ways we could encode this and write it into a compressed data file. For example, we could write each symbol followed by the codeword: "A",2,11,"B",1,0,"C",3,101,"D",3,100 Since we are listing the symbols in order, we can probably omit them, which gives 2,11,1,0,3,101,3,100 With our canonical version, C = 000 D = 001 A = 01 B = 1 we need only say how many bits each symbol has. The canonical Huffman algorithm allows us to recreate the entire table using only that. So we may now write 2,1,3,3 And that is all. |