FACTOID # 135: The Pitcairn Islands have the world’s shortest highway system, with only 6.4 kilometers of road. They also have the fourth-fewest main phone lines.
 
 Home   Encyclopedia   Statistics   Countries A-Z   Flags   Maps   Education   Forum   FAQ   About 
 
WHAT'S NEW
RECENT ARTICLES
More Recent Articles »
 

SEARCH ALL

FACTS & STATISTICS    Advanced view

Search encyclopedia, statistics and forums:

 

 

(* = Graphable)

 

 


Encyclopedia > Huffman coding

In computer science and information theory, Huffman coding is an entropy encoding algorithm used for lossless data compression. The term refers to the use of a variable length code table for encoding a source symbol (such as a character in a file) where the variable-length code table has been derived in a particular way based on the estimated probability of occurrence for each possible value of the source symbol. It was developed by David A. Huffman while he was a Ph.D. student at MIT, and published in the 1952 paper "A Method for the Construction of Minimum-Redundancy Codes." Huffman became a member of the MIT faculty upon graduation and was later the founding member of the Computer Science Department at the University of California, Santa Cruz, now a part of the Baskin School of Engineering. Computer science, or computing science, is the study of the theoretical foundations of information and computation and their implementation and application in computer systems. ... A bundle of optical fiber. ... An entropy encoding is a coding scheme that assigns codes to symbols so as to match code lengths with the probabilities of the symbols. ... In mathematics, computing, linguistics, and related disciplines, an algorithm is a procedure (a finite set of well-defined instructions) for accomplishing some task which, given an initial state, will terminate in a defined end-state. ... Lossless data compression is a class of data compression algorithms that allows the exact original data to be reconstructed from the compressed data. ... A variable length code is a data compression code in which the length of the output of the code is dependent not only on the length of the input of the code, but also the content of the input. ... Professor David A. Huffman (August 9, 1925 - October 7, 1999) was a pioneer in the Computer Science field. ... Doctor of Philosophy (from Greek , meaning Teacher of Philosophy), typically abbreviated Ph. ... The Massachusetts Institute of Technology (MIT) is a private, coeducational research university located in Cambridge, Massachusetts. ... The University of California, Santa Cruz, also known as UCSC or UC Santa Cruz, is one of the ten campuses of the University of California. ...


Huffman coding uses a specific method for choosing the representation for each symbol, resulting in a prefix-free code (sometimes called "prefix codes") (that is, the bit string representing some particular symbol is never a prefix of the bit string representing any other symbol) that expresses the most common characters using shorter strings of bits than are used for less common source symbols. Huffman was able to design the most efficient compression method of this type: no other mapping of individual source symbols to unique strings of bits will produce a smaller average output size when the actual symbol frequencies agree with those used to create the code. A method was later found to do this in linear time if input probabilities (also known as weights) are sorted. A prefix code, also known as a prefix-free code or comma-free code, is a code constructed so that any partial code word, beginning at the start of a full code word but terminating prior to the end of that code word, is not itself a valid code word. ... In computational complexity, an algorithm is said to take linear time, or O(n) time, if the time it requires is proportional to the size of the input, which is usually denoted n. ...


For a set of symbols with a uniform probability distribution and a number of members which is a power of two, Huffman coding is equivalent to simple binary block encoding, e.g., ASCII coding. Huffman coding is such a widespread method for creating prefix-free codes that the term "Huffman code" is widely used as a synonym for "prefix-free code" even when such a code is not produced by Huffman's algorithm. In mathematics, a power of two is any of the nonnegative integer powers of the number two; in other words, two times itself a certain number of times. ... The Railway Block Code is a system of bells used to send simple messages about train operations from one signalbox to another. ... There are 95 printable ASCII characters, numbered 32 to 126. ...


Although Huffman coding is optimal for a symbol-by-symbol coding with a known input probability distribution, its optimality can sometimes accidentally be over-stated. For example, arithmetic coding and LZW coding often have better compression capability. Both these methods can combine an arbitrary number of symbols for more efficient coding, and generally adapt to the actual input statistics, the latter of which is useful when input probabilities are not precisely known. Arithmetic coding is a method for lossless data compression. ... LZW (Lempel-Ziv-Welch) is an implementation of a lossless data compression algorithm created by Abraham Lempel and Jacob Ziv. ...

Contents

History

In 1951, David Huffman and his MIT information theory classmates were given the choice of a term paper or a final exam. The professor, Robert M. Fano, assigned a term paper on the problem of finding the most efficient binary code. Huffman, unable to prove any codes were the most efficient, was about to give up and start studying for the final when he hit upon the idea of using a frequency-sorted binary tree and quickly proved this method the most efficient. Robert Mano Fano (1917- ) is professor emeritus of Electrical Engineering and Computer Science at Massachusetts Institute of Technology. ... In computer science, a binary tree is a tree data structure in which each node has at most two children. ...


In doing so, the student outdid his professor, who had worked with information theory inventor Claude Shannon to develop a similar code. Huffman avoided the major flaw of the suboptimal Shannon-Fano coding by building the tree from the bottom up instead of from the top down. A bundle of optical fiber. ... Claude Elwood Shannon (April 30, 1916 _ February 24, 2001) has been called the father of information theory, and was the founder of practical digital circuit design theory. ... In the field of data compression, Shannon-Fano coding is a technique for constructing a prefix code based on a set of symbols and their probabilities (estimated or measured). ...


Problem definition

Informal description

Given. A set of symbols and their weights (usually probabilities).
Find. A prefix-free binary code (a set of codewords) with minimum expected codeword length (equivalently, a tree with minimum weighted path length).
This article needs to be cleaned up to conform to a higher standard of quality. ... In probability theory the expected value (or mathematical expectation) of a random variable is the sum of the probability of each possible outcome of the experiment multiplied by its payoff (value). Thus, it represents the average amount one expects as the outcome of the random trial when identical odds are...


Formalized description

Input.
Alphabet A = left{a_{1},a_{2},cdots,a_{n}right}, which is the symbol alphabet of size n.
Set W = left{w_{1},w_{2},cdots,w_{n}right}, which is the set of the (positive) symbol weights (usually probabilities), i.e. w_{i} = mathrm{weight}left(a_{i}right), 1leq i leq n.

Output.
Code Cleft(A,Wright) = left{c_{1},c_{2},cdots,c_{n}right}, which is the set of (binary) codewords, where ci is the codeword for a_{i}, 1 leq i leq n.

Goal.
Let Lleft(Cright) = sum_{i=1}^{n}{w_{i}timesmathrm{length}left(c_{i}right)} be the weighted path length of code C. Condition: Lleft(Cright) leq Lleft(Tright) for any code Tleft(A,Wright).


Samples

Input (A, W) Symbol (ai) a b c d e Sum
Weights (wi) 0.10 0.15 0.30 0.16 0.29 = 1
Output C Codewords (ci) 000 001 10 01 11  
Codeword length (in bits)
(li)
3 3 2 2 2
Weighted path length
(li wi )
0.30 0.45 0.60 0.32 0.58 L(C) = 2.25
Optimality Probability budget
(2-li)
1/8 1/8 1/4 1/4 1/4 = 1.00
Information content (in bits)
(−log2 wi) ≈
3.32 2.74 1.74 2.64 1.79  
Entropy
(−wi log2 wi)
0.332 0.411 0.521 0.423 0.518 H(A) = 2.205

For any code that is biunique, meaning that the code is uniquely decodeable, the sum of the probability budgets across all symbols is always less than or equal to one. In this example, the sum is strictly equal to one; as a result, the code is termed a complete code. If this is not the case, you can always derive an equivalent code by adding extra symbols (with associated null probabilities), to make the code complete while keeping it biunique.


As defined by Shannon (1948), the information content h (in bits) of each symbol ai with non-null probability is A Mathematical Theory of Communication, published in 1948 by mathematician and computer scientist Claude Shannon, was one of the founding works of the field of information theory. ...

h(a_i) = log_2{1 over w_i}

The information content of symbols with null probability is not defined, but in practice can be defined as any finite value because this information will be absent from the encoded message (unless the message A has an infinite symbol length, but in this case these symbols have an infinitesimal positive probability 0+).


The entropy H (in bits) is the weighted sum, across all symbols ai with non-zero probability wi, of the information content of each symbol: Entropy of a Bernoulli trial as a function of success probability, often called the binary entropy function Entropy is a concept in thermodynamics (see entropy), statistical mechanics and information theory. ...

H(A) = sum_{w_i > 0} w_i h(a_i) = sum_{w_i > 0} w_i log_2{1 over w_i} = - sum_{w_i > 0} w_i log_2{w_i}

All symbols with zero probability have in theory a positive infinite entropy, but as they are necessarily absent from the original message to be encoded, they don't contribute to the entropy of the encoded message (unless the message is infinite) ; so they could be made equivalent to a zero entropy within the sum above, removing the restriction on the suitable indices.


As a consequence of Shannon's Source coding theorem, the entropy is a measure of the smallest codeword length that is theoretically possible for the given alphabet with associated weights. In this example, the weighted average codeword length is 2.25 bits per symbol, only slightly larger than the calculated entropy of 2.205 bits per symbol. So not only is this code optimal in the sense that no other feasible code performs better, but it is very close to the theoretical limit established by Shannon. In information theory, the source coding theorem (Shannon 1948) informally states that: N i. ...


Note that, in general, a Huffman code need not be unique, but it is always one of the codes minimizing L(C).


Basic technique

The technique works by creating a binary tree of nodes. These can be stored in a regular array, the size of which depends on the number of symbols(N). A node can be either a leaf node or an internal node. Initially, all nodes are leaf nodes, which contain the symbol itself, the weight (frequency of appearance) of the symbol and optionally, a link to a parent node which makes it easy to read the code (in reverse) starting from a leaf node. Internal nodes contain symbol weight, links to two child nodes and the optional link to a parent node. As a common convention, bit '0' represents following the left child and bit '1' represents following the right child. A finished tree has N leaf nodes and N−1 internal nodes. In computer science, a binary tree is a tree data structure in which each node has at most two children. ... In computer programming, a group of homogeneous elements of a specific data type is known as an array, one of the simplest data structures. ... 9, 14, 19, 67 and 76 are leaf nodes. ... In computer science, an internal node or inner node is any node of a tree data structure that is not a leaf node. ...


A linear-time* method to create a Huffman tree is to use two queues, the first one containing the initial weights (along with pointers to the associated leaves), and combined weights (along with pointers to the trees) being put in the back of the second queue. This assures that the lowest weight is always kept at the front of one of the two queues. In computational complexity, an algorithm is said to take linear time, or O(n) time, if the time it requires is proportional to the size of the input, which is usually denoted n. ... In providing services in computer science, transport, and operations research a queue (pronounced kyew) is a buffer where various entities such as data, objects, persons, or events are stored and waiting to be processed. ...


Creating the tree:

  1. Start with as many leaves as there are symbols.
  2. Enqueue all leaf nodes into the first queue (by probability in decreasing order).
  3. While there is more than one node in the queues:
    1. Dequeue the two nodes with the lowest weight.
    2. Create a new internal node, with the two just-removed nodes as children (either node can be either child) and the sum of their weights as the new weight.
    3. Enqueue the new node into the rear of the second queue.
  4. The remaining node is the root node; the tree has now been generated.

It is generally beneficial to minimize the variance of codeword length. For example, a communication buffer receiving Huffman-encoded data may need to be larger to deal with especially long symbols if the tree is especially unbalanced. To minimize variance, simply break ties between queues by choosing the item in the first queue. This modification will retain the mathematical optimality of the Huffman coding while both minimizing variance and minimizing the length of the longest character code.


* This method is linear time assuming that you already have the leaf nodes sorted by initial weight. If not, sorting them will take O(nlogn) time. In computer science and mathematics, a sorting algorithm is an algorithm that puts elements of a list in a certain order. ...


Main properties

The frequencies used can be generic ones for the application domain that are based on average experience, or they can be the actual frequencies found in the text being compressed. (This variation requires that a frequency table or other hint as to the encoding must be stored with the compressed text; implementations employ various tricks to store tables efficiently.)


Huffman coding is optimal when the probability of each input symbol is a negative power of two. Prefix-free codes tend to have slight inefficiency on small alphabets, where probabilities often fall between these optimal points. "Blocking", or expanding the alphabet size by coalescing multiple symbols into "words" of fixed or variable length before Huffman coding, usually helps, especially when adjacent symbols are correlated (as in the case of natural language text). The worst case for Huffman coding can happen when the probability of a symbol exceeds 2-1 = 0.5, making the upper limit of inefficiency unbounded. These situations often respond well to a form of blocking called run-length encoding. Run-length encoding (RLE) is a very simple form of data compression in which runs of data (that is, sequences in which the same data value occurs in many consecutive data elements) are stored as a single data value and count, rather than as the original run. ...


Arithmetic coding produces slight gains over Huffman coding, but in practice these gains have seldom been large enough to offset arithmetic coding's higher computational complexity and patent royalties. (As of July 2006, IBM owns patents on many methods of arithmetic coding in several jurisdictions; see US patents on arithmetic coding.) Arithmetic coding is a method for lossless data compression. ... A patent is a set of exclusive rights granted by a state to a patentee (the inventor or assignee) for a fixed period of time in exchange for the regulated, public disclosure of certain details of a device, method, process or composition of matter (substance) (known as an invention) which... A royalty is a sum paid to the creator of performance art for the use of that art. ... For the Manfred Mann album, see 2006 (album). ... International Business Machines Corporation (known as IBM or Big Blue; NYSE: IBM) is a multinational computer technology corporation headquartered in Armonk, New York, USA. The company is one of the few information technology companies with a continuous history dating back to the 19th century. ... Arithmetic coding is a method for lossless data compression. ...


Variations

Many variations of Huffman coding exist, some of which use a Huffman-like algorithm, and others of which find optimal prefix codes (while, for example, putting different restrictions on the output). Note that, in the latter case, the method need not be Huffman-like, and, indeed, need not even be polynomial time. An exhaustive list of papers on Huffman coding on its variations is given by "Code and Parse Trees for Lossless Source Encoding"[1]. In computational complexity theory, polynomial time refers to the computation time of a problem where the time, m(n), is no greater than a polynomial function of the problem size, n. ...


n-ary Huffman coding

The n-ary Huffman algorithm uses the {0, 1, ..., n − 1} alphabet to encode message and build an n-ary tree. This approach was considered by Huffman in his original paper.


Adaptive Huffman coding

A variation called adaptive Huffman coding calculates the frequencies dynamically based on recent actual frequencies in the source string. This is somewhat related to the LZ family of algorithms. Adaptive Huffman coding is an adaptive coding technique based on Huffman coding, building the code as the symbols are being transmitted, having no initial knowledge of source distribution, that allows one-pass encoding and adaptation to changing conditions in data. ... LZ77 and LZ78 are the names for the two lossless data compression algorithms published in papers by Abraham Lempel and Jacob Ziv in 1977 and 1978. ...


Huffman template algorithm

Most often, the weights used in implementations of Huffman coding represent numeric probabilities, but the algorithm given above does not require this; it requires only a way to order weights and to add them. The Huffman template algorithm enables one to use any kind of weights (costs, frequencies, pairs of weights, non-numerical weights) and one of many combining methods (not just addition). Such algorithms can solve other minimization problems, such as minimizing max_ileft[w_{i}+mathrm{length}left(c_{i}right)right] , a problem first applied to circuit design[2].


Length-limited Huffman coding

Length-limited Huffman coding is a variant where the goal is still to achieve a minimum weighted path length, but there is an additional restriction that the length of each codeword must be less than a given constant. The package-merge algorithm solves this problem with a simple greedy approach very similar to that used by Huffman's algorithm. Its time complexity is O(nL), where L is the maximum length of a codeword. No algorithm is known to solve this problem with the same efficiency as conventional Huffman coding, The Package-Merge Algorithm is an O(nL)-time algorithm for finding an optimal length-limited Huffman code for a given distribution on a given alphabet of size n, where no code string is permitted to have length more than L. It is greedy algorithm which is a generalization of... Greedy Algorithms are algorithms which follow the problem solving meta-heuristic of making the locally optimum choice at each stage with the hope of finding the global optimum. ...


Huffman coding with unequal letter costs

In the standard Huffman coding problem, it is assumed that each symbol in the set that the code words are constructed from has an equal cost to transmit: a code word whose length is N digits will always have a cost of N, no matter how many of those digits are 0s, how many are 1s, etc. When working under this assumption, minimizing the total cost of the message and minimizing the total number of digits are the same thing.


Huffman coding with unequal letter costs is the generalization in which this assumption is no longer assumed true: the letters of the encoding alphabet may have non-uniform lengths, due to characteristics of the transmission medium. An example is the encoding alphabet of Morse code, where a 'dash' takes longer to send than a 'dot', and therefore the cost of a dash in transmission time is higher. The goal is still to minimize the weighted average codeword length, but it is no longer sufficient just to minimize the number of symbols used by the message. No algorithm is known to solve this in the same manner or with the same efficiency as conventional Huffman coding. 1922 Chart of the Morse Code Letters and Numerals Morse code is a method for transmitting information by using standardized sequences of variously spaced short and long elements for the characters and words in a message. ...


Optimal alphabetic binary trees (Hu-Tucker coding and the canonical Huffman code)

In the standard Huffman coding problem, it is assumed that any codeword can correspond to any input symbol. In the alphabetic version, the alphabetic order of inputs and outputs must be identical. Thus, for example, A = left{a,b,cright} could not be assigned code Hleft(A,Cright) = left{00,1,01right}, but instead should be assigned either Hleft(A,Cright) =left{00,01,1right} or Hleft(A,Cright) = left{0,10,11right}. This is also known as the Hu-Tucker problem, after the authors of the paper presenting the first linearithmic solution to this optimal binary alphabetic problem, which has some similarities to Huffman algorithm, but is not a variation of this algorithm. These optimal alphabetic binary trees are often used as binary search trees. If weights corresponding to the alphabetically ordered inputs are in numerical order, the Huffman code has the same lengths as the optimal alphabetic code, which can be found from calculating these lengths. The resulting alphabetic code is sometimes called the canonical Huffman code and is often the code used in practice, due to ease of encoding/decoding. The technique for finding this code is sometimes called Huffman-Shannon-Fano coding, since it is optimal like Huffman coding, but alphabetic in weight probability, like Shannon-Fano coding. The Huffman-Shannon-Fano code corresponding to the example is {000,001,01,10,11}, which, having the same codeword lengths as the original solution, is also optimal. In computer science, a function is called linearithmic if it is of the form n · log n  (i. ... A binary search tree of size 9 and depth 3, with root 8 and leaves 1, 4, 7 and 13. ... Introduction A canonical Huffman code is a particular type of Huffman code which has the property that it can be very compactly described. ... In the field of data compression, Shannon-Fano coding is a technique for constructing a prefix code based on a set of symbols and their probabilities (estimated or measured). ...


Applications

Arithmetic coding can be viewed as a generalization of Huffman coding; indeed, in practice arithmetic coding is often preceded by Huffman coding, as it is easier to find an arithmetic code for a binary input than for a nonbinary input. Also, although arithmetic coding offers better compression performance than Huffman coding, Huffman coding is still in wide use because of its simplicity, high speed and lack of encumbrance by patents. Arithmetic coding is a method for lossless data compression. ... A patent is a set of exclusive rights granted by a state to a patentee (the inventor or assignee) for a fixed period of time in exchange for the regulated, public disclosure of certain details of a device, method, process or composition of matter (substance) (known as an invention) which...


Huffman coding today is often used as a "back-end" to some other compression method. DEFLATE (PKZIP's algorithm) and multimedia codecs such as JPEG and MP3 have a front-end model and quantization followed by Huffman coding. DEFLATE is a lossless data compression algorithm that uses a combination of the LZ77 algorithm and Huffman coding. ... PKZIP is an archiving tool originally written by the late Phil Katz, and marketed by his company PKWARE, Inc. ... A Codec is a device or program capable of performing encoding and decoding on a digital data stream or signal. ... In computing, JPEG (pronounced JAY-peg; IPA: ) is a commonly used standard method of compression for photographic images. ... MPEG-1 Audio Layer 3, more commonly referred to as MP3, is a popular digital audio encoding and lossy compression format and algorithm, designed to greatly reduce the amount of data required to represent audio, yet still sound like a faithful reproduction of the original uncompressed audio to most listeners. ... Quantized signal Digital signal In digital signal processing, quantization is the process of approximating a continuous range of values (or a very large set of possible discrete values) by a relatively-small set of discrete symbols or integer values. ...


See also

Modified Huffman coding is used in fax machines to encode black on white images (bitmaps). ... In the field of data compression, Shannon-Fano coding is a technique for constructing a prefix code based on a set of symbols and their probabilities (estimated or measured). ... In computer science and information theory, data compression or source coding is the process of encoding information using fewer bits (or other information-bearing units) than an unencoded representation would use through use of specific encoding schemes. ...

References

Scientific American is a popular-science magazine, published (first weekly and later monthly) since August 28, 1845, making it the oldest continuously published magazine in the United States. ... Thomas H. Cormen is the co-author of Introduction to Algorithms, along with Charles Leiserson, Ron Rivest, and Cliff Stein. ... Charles E. Leiserson is a computer scientist, specializing in the theory of parallel computing and distributed computing, and particularly practical applications thereof; as part of this effort, he developed the Cilk multithreaded language. ... Professor Ron Rivest Professor Ronald Linn Rivest (born 1947, Schenectady, New York) is a cryptographer, and is the Viterbi Professor of Computer Science at MITs Department of Electrical Engineering and Computer Science. ... Clifford Stein is a computer scientist, currently working as a professor at Columbia University in New York, NY. He earned his BSE from Princeton University in 1987, a MS from Massachusetts Institute of Technology in 1989, and a PhD from Massachusetts Institute of Technology in 1992. ... Cover of the second edition Introduction to Algorithms is a book by Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. ...

External links


  Results from FactBites:
 
PlanetMath: Huffman coding (659 words)
Huffman coding is a method of lossless data compression, and a form of entropy encoding.
The mapping is obtained by the path from the root of the Huffman tree to the leaf associated with a symbol's weight.
This is version 4 of Huffman coding, born on 2002-03-08, modified 2003-09-04.
Huffman coding - Wikipedia, the free encyclopedia (2061 words)
Huffman coding is such a widespread method for creating prefix-free codes that the term "Huffman code" is widely used as a synonym for "prefix-free code" even when such a code is not produced by Huffman's algorithm.
Huffman, unable to prove any codes were the most efficient, was about to give up and start studying for the final when he hit upon the idea of using a frequency-sorted binary tree and quickly proved this method the most efficient.
Huffman coding with unequal letter costs is the generalization in which this assumption is no longer assumed true: the letters of the encoding alphabet may have non-uniform lengths, due to characteristics of the transmission medium.
  More results at FactBites »


 

COMMENTARY     


Share your thoughts, questions and commentary here
Your name
Your comments
Please enter the 5-letter protection code

Want to know more?
Search encyclopedia, statistics and forums:

 


Lesson Plans | Student Area | Student FAQ | Reviews | Press Releases |  Feeds | Contact
The Wikipedia article included on this page is licensed under the GFDL.
Images may be subject to relevant owners' copyright.
All other elements are (c) copyright NationMaster.com 2003-5. All Rights Reserved.
Usage implies agreement with terms.