|
Introduction
The chain rule for Kolmogorov complexity is an analogue of the chain rule for Information entropy, which states: In computer science, the Kolmogorov complexity (also known as descriptive complexity, Kolmogorov-Chaitin complexity, stochastic complexity, algorithmic entropy, or program-size complexity) of an object such as a piece of text is a measure of the computational resources needed to specify the object. ...
Claude Shannon In information theory, the Shannon entropy or information entropy is a measure of the uncertainty associated with a random variable. ...
- H(X,Y) = H(X) + H(Y | X)
That is, the combined randomness of two sequences X and Y is the sum of the randomness of X plus whatever randomness is left in Y once we know X. This follows immediately from the definitions of conditional and joint entropy fact from probability theory that the joint probability is the product of the marginal and conditional probability: The conditional entropy is an entropy measure used in information theory. ...
The joint entropy is an entropy measure used in information theory. ...
Probability theory is a branch of mathematics concerned with analysis of random phenomena. ...
This article defines some terms which characterize probability distributions of two or more variables. ...
This article defines some terms which characterize probability distributions of two or more variables. ...
This article defines some terms which characterize probability distributions of two or more variables. ...
- P(X,Y) = P(X)P(Y,X)
The equivalent statement for Kolmogorov complexity does not hold exactly; it is only true up to a logarithmic factor: - K(X,Y) = K(X) + K(Y | X) + O(log(K(X,Y)))
It states that the shortest program to reproduce X and Y is using a program to reproduce X and a program to reproduce Y given X, plus at most a logarithmic factor. Using this statement one can define an analogue of mutual information for Kolmogorov complexity. The Big O notation is a mathematical notation used to describe the asymptotic behavior of functions. ...
In probability theory and, in particular, information theory, the mutual information, or transinformation, of two random variables is a quantity that measures the mutual dependence of the two variables. ...
Proof The direction is obvious: we can write a program to produce x and y by concatenating a program to produce x, a program to produce y given access to x, and (whence the log term) the length of one of the programs, so that we know where to separate the two programs for x and y | x (log(K(x,y)) upper bounds this length). The direction is rather more difficult. The key to the proof is the construction of the set ; that is, the construction of the set of all pairs (u,z) such that the shortest input (for a universal Turing machine) that produces (u,z) (and some way to distinguish u from z) is shorter than the shortest producing (x,y). We will also need , the subset of A where z = u. Enumerating A is not hard (although not fast!). In parallel, simulate all 2K(x,y) programs with length . Output each u,z as the simulated program terminates; eventually, any particular such u,z will be produced. We can restrict to Ax simply by ignoring any output where . We will assume that the inequality does not hold and derive a contradiction by finding a short description for K(x) in terms of its index in the enumeration of a subset of A. The Turing machine is an abstract machine introduced in 1936 by Alan Turing to give a mathematically precise definition of algorithm or mechanical procedure. As such it is still widely used in theoretical computer science, especially in complexity theory and the theory of computation. ...
In computability theory, often less suggestively called recursion theory, a countable set S is called recursively enumerable, computably enumerable, semi-decidable or provable if There is an algorithm that, when given an input — typically an integer or a tuple of integers or a sequence of characters — eventually halts...
In particular, we can describe y by giving the index in which y is output when enumerating Ax (which is less than or equal to | Ax | ) along with x and the value of K(x,y). Thus, where the O(1) term is from the overhead of the enumerator, which does not depend on x or y. Assume by way of contradiction - K(y | x) > K(x,y) − K(x) + O(log(K(x,y)))
for some x,y. Combining this with the above, we have So for any constant c and some x,y, Let e = K(x,y) − K(x) + clogK(x,y); . Now, we can enumerate a set of possible values for x by finding all u such that | Au | > 2e -- list only the u such that y is output after more than 2e other values when enumerating . x is one such u. Let U be the set of all such u. We can enumerate U by enumerating each Au (in parallel, as usual) and outputting a u only when y would be output for Au and more than 2e other values for the Au would be output. Note that given K(x,y), e and the index of x in U we can enumerate U to recover x. Note that is a subset of A, and A has at most 2K(x,y) elements, since there are only that many programs of length K(x,y). So we have: where the first inequality follows since each element of U corresponds to an Au with strictly more than 2e elements. But this leads to a contradiction: which when we substitute in e gives - K(x) < log(K(x,y)) + log(K(x,y) + K(x) + log(K(x,y)) + K(x,y) − K(x,y) + K(x) − clogK(x,y)
which for large enough c gives K(x) < K(x).
Reference - Li, Ming; Paul Vitányi (1997). An introduction to Kolmogorov complexity and its applications. New York: Springer-Verlag. ISBN 0387948686.
|