|
In probability theory and information theory, the mutual information, or transinformation, of two random variables is a quantity that measures the mutual dependence of the two variables. The most common unit of measurement of mutual information is the bit, when logarithms to the base 2 are used. Probability theory is the mathematical study of phenomena characterized by randomness or uncertainty. ...
A bundle of optical fiber. ...
A random variable is a mathematical function that maps outcomes of random experiments to numbers. ...
Measurement is the determination of the size or magnitude of something. ...
This article is about the unit of information. ...
Definition of the mutual information
Formally, the mutual information of two discrete random variables X and Y can be defined as:  where p(x,y) is the joint probability distribution function of X and Y, and p(x) and p(y) are the marginal probability distribution functions of X and Y respectively. Given two random variables X and Y, the joint probability distribution of X and Y is the probability distribution of X and Y together. ...
This article defines some terms which characterize probability distributions of two or more variables. ...
In the continuous case, we replace summation by a definite double integral: Look up continuum in Wiktionary, the free dictionary. ...
In mathematical analysis, there is a serious distinction between a double integral and an iterated integral. ...
 where p(x,y) is now the joint probability density function of X and Y, and p(x) and p(y) are the marginal probability density functions of X and Y respectively. It should be noted that these definitions are ambiguous because the base of the log function is not specified. To disambiguate, the function I() could be parameterized as I(X,Y,b) where b is the base. Alternatively, since the most common unit of measurement of mutual information is the bit, a base of 2 could be specified:  and  Intuitively, mutual information measures the information that X and Y share: it measures how much knowing one of these variables reduces our uncertainty about the other. For example, if X and Y are independent, then knowing X does not give any information about Y and vice versa, so their mutual information is zero. At the other extreme, if X and Y are identical then all information conveyed by X is shared with Y: knowing X determines the value of Y and vice versa. Mutual information quantifies the distance between the joint distribution of X and Y and what the joint distribution would be if X and Y were independent. Mutual information is a measure of dependence in the following sense: I(X; Y) = 0 iff X and Y are not dependent, i.e. independent, random variables. This is easy to see in one direction: if X and Y are independent, then p(x,y) = p(x) × p(y), and therefore: Given two random variables X and Y, the joint probability distribution of X and Y is the probability distribution of X and Y together. ...
IFF, Iff or iff can stand for: Interchange File Format - a computer file format introduced by Electronic Arts Identification, friend or foe - a radio based identification system utilizing transponders iff - the mathematics concept if and only if International Flavors and Fragrances - a company producing flavors and fragrances International Freedom Foundation...
 Moreover, mutual information is nonnegative (i.e. I(X;Y) ≥ 0; see below) and symmetric (i.e. I(X;Y) = I(Y;X)). Sphere symmetry group o. ...
The mutual information is the same as the uncertainty contained in Y (or X) alone, namely the entropy of Y (or X: clearly if X and Y are identical they have equal entropy). Entropy of a Bernoulli trial as a function of success probability, often called the binary entropy function Entropy is a concept in thermodynamics (see thermodynamic entropy), statistical mechanics and information theory. ...
Relation to other quantities Mutual information can be equivalently expressed as  where H(X) and H(Y) are the marginal entropies, H(X|Y) and H(Y|X) are the conditional entropies, and H(X,Y) is the joint entropy of X and Y. Since H(X) ≥ H(X|Y), this characterization is consistent with the nonnegativity property stated above. Entropy of a Bernoulli trial as a function of success probability, often called the binary entropy function Entropy is a concept in thermodynamics (see thermodynamic entropy), statistical mechanics and information theory. ...
The conditional entropy is an entropy measure used in information theory. ...
The joint entropy is an entropy measure used in information theory. ...
Intuitively, if entropy H(X) is regarded as a measure of uncertainty about a random variable, then H(X|Y) can be seen as "the amount of uncertainty remaining about X after Y is known", and thus the right side of the first of these equalities can be read as "the amount of uncertainty in X, minus the amount of uncertainty in X which remains after Y is known", which is equivalent to "the amount of uncertainty in X which is removed by knowing Y". This corroborates the intuitive meaning of mutual information as the amount of information (that is, reduction in uncertainty) that knowing either variable provides about the other. Note that H(X|X) = 0 and therefore H(X) = I(X;X). Thus I(X;X) ≥ I(X;Y), and one can formulate the basic principle that a variable contains more information about itself than any other variable can provide. Mutual information can also be expressed as a Kullback-Leibler divergence, of the product p(x) × p(y) of the marginal distributions of the two random variables X and Y, from p(x,y) the random variables' joint distribution: In probability theory and information theory, the Kullback-Leibler divergence (or information divergence, or information gain, or relative entropy) is a natural distance measure from a true probability distribution P to an arbitrary probability distribution Q. Typically P represents data, observations, or a precise calculated probability distribution. ...
In probability theory, given two jointly distributed random variables X and Y, the marginal distribution of X is simply the probability distribution of X ignoring information about Y, typically calculated by summing or integrating the joint probability distribution over Y. For discrete random variables, the marginal probability mass function can...
Given two random variables X and Y, the joint probability distribution of X and Y is the probability distribution of X and Y together. ...
 Furthermore, let p(x|y) = p(x, y) / p(y). Then  Thus mutual information can thus also be understood as the expectation of the Kullback-Leibler divergence of the univariate distribution p(x) of X from the conditional distribution p(x|y) of X given Y: the more different the distributions p(x|y) and p(x), the greater the information gain. In probability theory the expected value (or mathematical expectation) of a random variable is the sum of the probability of each possible outcome of the experiment multiplied by its payoff (value). Thus, it represents the average amount one expects as the outcome of the random trial when identical odds are...
Given two jointly distributed random variables X and Y, the conditional probability distribution of Y given X (written Y | X) is the probability distribution of Y when X is known to be a particular value. ...
Variations of the mutual information Several variations on the mutual information have been proposed to suit various needs. Among these are normalized variants and generalizations to more than two variables.
Normalized variants Normalized variants of the mutual information are provided by the coefficients of constraint (Coombs, Dawes & Tversky 1970) or coefficients of uncertainty (Press & Flannery 1988)  The two coefficients are obviously not equal. A more useful and symmetric scaled information measure is the redundancy  which obtains a minimum of zero when the variables are independent and a maximum value of  when one variable is completely redundant with the other. Another symmetrical measure is the symmetric uncertainty (Witten & Frank 2005), given by  which represents a weighted average of the two uncertainty coefficients (Press & Flannery 1988). Other normalized versions are provided the following expressions (Yao 2003, Strehl & Ghosh 2002).  Weighted variants In the traditional formulation of the mutual information,  each event or object specified by (x,y) is weighted by the corresponding probability p(x,y). This assumes that all objects or events are equivalent apart from their probability of occurrence. However, in some applications it may be the case that certain objects or events are more significant than others, or that certain patterns of association are more semantically important than others. For example, the deterministic mapping {(1,1),(2,2),(3,3)} may be viewed as stronger (by some standard) than the deterministic mapping {(1,3),(2,1),(3,2)}, although these relationships would yield the same mutual information. This is because the mutual information is not sensitive at all to any inherent ordering in the variable values (Cronbach 1954, Coombs & Dawes 1970, Lockhead 1970), and is therefore not sensitive at all to the form of the relational mapping between the associated variables. If it is desired that the former relation — showing agreement on all variable values — be judged stronger than the later relation, then it is possible to use the following weighted mutual information (Guiasu 1977)  which places a weight w(x,y) on the probability of each variable value co-occurrence, p(x,y). This allows that certain probabilities may carry more or less significance than others, thereby allowing the quantification of relevant holistic or prägnanz factors. In the above example, using larger relative weights for w(1,1), w(2,2), and w(3,3) would have the effect of assessing greater informativeness for the relation {(1,1),(2,2),(3,3)} than for the relation {(1,3),(2,1),(3,2)}, which may be desirable in some cases of pattern recognition, and the like. There has been little mathematical work done on the weighted mutual information and its properties, however.
Multiple-variable variants Several generalizations of mutual information to more than two random variables have been proposed, such as total correlation and interaction information, but a widely agreed on definition has not yet emerged. One high-dimensional generalization scheme that maximizes the mutual information between the joint distribution and other target variables is found be useful in feature selection. The total correlation (Watanabe 1960) is one of several generalizations of the mutual information. ...
Feature selection, also known as subset selection or variable selection, is a process commonly used in machine learning, wherein a subset of the features available from the data are selected for application of a learning algorithm. ...
Absolute mutual information Using the ideas of Kolmogorov complexity, one can consider the mutual information of two sequences independent of any probability distribution: In computer science, the Kolmogorov complexity (also known as descriptive complexity, Kolmogorov-Chaitin complexity, stochastic complexity, algorithmic entropy, or program-size complexity) of an object such as a piece of text is a measure of the computational resources needed to specify the object. ...
- IK(X;Y) = K(X) − K(X | Y)
To establish that this quantity is symmetric up to a logarithmic factor ( ) requires the chain rule for Kolmogorov complexity (Li 1997). Approximations of this quantity via compression can be used to define a distance measure to perform a hierarchical clustering of sequences without having any domain knowledge of the sequences (Cilibrasi 2005). In probability theory and, in particular, information theory, the mutual information, or transinformation, of two random variables is a quantity that measures the mutual dependence of the two variables. ...
In computer science and information theory, data compression or source coding is the process of encoding information using fewer bits (or other information-bearing units) than an unencoded representation would use through use of specific encoding schemes. ...
In mathematics a metric or distance is a function which assigns a distance to elements of a set. ...
Data clustering is a common technique for data analysis, which is used in many fields, including machine learning, data mining, pattern recognition, image analysis and bioinformatics. ...
In computing, domain knowledge is the knowledge and skills that software programs encode. ...
In probability theory and, in particular, information theory, the mutual information, or transinformation, of two random variables is a quantity that measures the mutual dependence of the two variables. ...
Applications of mutual information In many applications, one wants to maximize mutual information (thus increasing dependencies), which is often equivalent to minimizing conditional entropy. Examples include: - The Channel capacity is equal to the mutual information, maximized over all input distributions.
- Discriminative training procedures for hidden Markov models have been proposed based on the maximum mutual information (MMI) criterion.
- RNA secondary structure prediction from a multiple sequence alignment.
- Mutual information has been used as a criterion for feature selection and feature transformations in machine learning. It can be used to characterize both the relevance and redundancy of variables, such as the minimum redundancy feature selection.
- Mutual information is often used as a significance function for the computation of collocations in corpus linguistics.
- Mutual information is used in medical imaging for image registration. Given a reference image (for example, a brain scan), and a second image which needs to be put into the same coordinate system as the reference image, this image is deformed until the mutual information between it and the reference image is maximized.
Channel capacity, is the amount of discrete information that can be reliably transmitted over a channel. ...
State transitions in a hidden Markov model (example) x â hidden states y â observable outputs a â transition probabilities b â output probabilities A hidden Markov model (HMM) is a statistical model where the system being modeled is assumed to be a Markov process with unknown parameters, and the challenge is to determine...
Ribonucleic acid (RNA) is a nucleic acid polymer consisting of nucleotide monomers. ...
A representation of the 3D structure of the Myoglobin protein. ...
First 90 positions of a protein multiple sequence alignment of instances of the acidic ribosomal protein P0 (L10E) from several organisms. ...
Feature selection, also known as subset selection or variable selection, is a process commonly used in machine learning, wherein a subset of the features available from the data are selected for application of a learning algorithm. ...
As a broad subfield of artificial intelligence, Machine learning is concerned with the development of algorithms and techniques that allow computers to learn. At a general level, there are two types of learning: inductive, and deductive. ...
This article or section does not cite its references or sources. ...
Corpus linguistics is the study of language as expressed in samples (corpora) or real world text. ...
Medical imaging designates the ensemble of techniques and processes used to create images of the human body (or parts thereof) for clinical purposes (medical procedures seeking to reveal, diagnose or examine disease) or medical science (including the study of normal anatomy and function). ...
In computer vision, sets of data acquired by sampling the same scene or object at different times, or from different perspectives, will be in different coordinate systems. ...
In mathematics as applied to geometry, physics or engineering, a coordinate system is a system for assigning a tuple of numbers to each point in an n-dimensional space. ...
References - Cilibrasi, R.; Paul Vitányi (2005). "Clustering by compression" (PDF). IEEE Transactions on Information Theory 51 (4): 1523-1545.
- Coombs, C. H., Dawes, R. M. & Tversky, A. (1970), Mathematical Psychology: An Elementary Introduction, Prentice-Hall, Englewood Cliffs, NJ.
- Cronbach L. J. (1954). On the non-rational application of information measures in psychology, in H Quastler, ed., Information Theory in Psychology: Problems and Methods, Free Press, Glencoe, Illinois, pp. 14—30.
- Kenneth Ward Church and Patrick Hanks. Word association norms, mutual information, and lexicography, Proceedings of the 27th Annual Meeting of the Association for Computational Linguistics, 1989.
- Guiasu, Silviu (1977), Information Theory with Applications, McGraw-Hill, New York.
- Li, Ming; Paul Vitányi (1997). An introduction to Kolmogorov complexity and its applications. New York: Springer-Verlag. ISBN 0387948686.
- Lockhead G. R. (1970). Identification and the form of multidimensional discrimination space, Journal of Experimental Psychology 85(1), 1-10.
- Athanasios Papoulis. Probability, Random Variables, and Stochastic Processes, second edition. New York: McGraw-Hill, 1984. (See Chapter 15.)
- Press, W. H., Flannery, B. P., Teukolsky, S. A. & Vetterling, W. T. (1988), Numerical Recipes in C: The Art of Scientific Computing, Cambridge University Press, Cambridge.
- Strehl, Alexander and Ghosh, Joydeep (2002). Cluster ensembles -- A knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research 3, 583-617.
- Witten, Ian H. & Frank, Eibe (2005), Data Mining: Practical Machine Learning Tools and Techniques, Morgan Kaufmann, Amsterdam.
- Yao, Y. Y. (2003) Information-theoretic measures for knowledge discovery and data mining, in Entropy Measures, Maximum Entropy Principle and Emerging Applications , Karmeshu (ed.), Springer, pp. 115-136.
- Peng, H.C., Long, F., and Ding, C., "Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 27, No. 8, pp.1226-1238, 2005. Program
|