FACTOID # 67: Nearly a quarter of people in Monaco are over 65.
 
 Home   Encyclopedia   Statistics   Countries A-Z   Flags   Maps   Education   Forum   FAQ   About 
 
 
 
WHAT'S NEW
RECENT ARTICLES
More Recent Articles »
 

SEARCH ALL

FACTS & STATISTICS    Advanced view

Search encyclopedia, statistics and forums:

 

 

(* = Graphable)

 

 


Encyclopedia > Average Mutual Information

In probability theory and information theory, the mutual information, or transinformation, of two random variables is a quantity that measures the mutual dependence of the two variables. The most common unit of measurement of mutual information is the bit, when logarithms to the base 2 are used. Probability theory is the mathematical study of phenomena characterized by randomness or uncertainty. ... A bundle of optical fiber. ... A random variable is a mathematical function that maps outcomes of random experiments to numbers. ... Measurement is the determination of the size or magnitude of something. ... This article is about the unit of information. ...

Contents

Definition of the mutual information

Formally, the mutual information of two discrete random variables X and Y can be defined as:

I(X;Y) = sum_{y in Y} sum_{x in X} p(x,y) log frac{p(x,y)}{p(x),p(y)}, !

where p(x,y) is the joint probability distribution function of X and Y, and p(x) and p(y) are the marginal probability distribution functions of X and Y respectively. Given two random variables X and Y, the joint probability distribution of X and Y is the probability distribution of X and Y together. ... This article defines some terms which characterize probability distributions of two or more variables. ...


In the continuous case, we replace summation by a definite double integral: Look up continuum in Wiktionary, the free dictionary. ... In mathematical analysis, there is a serious distinction between a double integral and an iterated integral. ...

I(X;Y) = int_Y int_X p(x,y) log frac{p(x,y)}{p(x),p(y)} ; dx ,dy, !

where p(x,y) is now the joint probability density function of X and Y, and p(x) and p(y) are the marginal probability density functions of X and Y respectively.


It should be noted that these definitions are ambiguous because the base of the log function is not specified. To disambiguate, the function I() could be parameterized as I(X,Y,b) where b is the base. Alternatively, since the most common unit of measurement of mutual information is the bit, a base of 2 could be specified:

I(X;Y) = sum_{y in Y} sum_{x in X} p(x,y) log_2 frac{p(x,y)}{p(x),p(y)}, !

and

I(X;Y) = int_Y int_X p(x,y) log_2 frac{p(x,y)}{p(x),p(y)} ; dx ,dy, !


Intuitively, mutual information measures the information that X and Y share: it measures how much knowing one of these variables reduces our uncertainty about the other. For example, if X and Y are independent, then knowing X does not give any information about Y and vice versa, so their mutual information is zero. At the other extreme, if X and Y are identical then all information conveyed by X is shared with Y: knowing X determines the value of Y and vice versa.


Mutual information quantifies the distance between the joint distribution of X and Y and what the joint distribution would be if X and Y were independent. Mutual information is a measure of dependence in the following sense: I(X; Y) = 0 iff X and Y are not dependent, i.e. independent, random variables. This is easy to see in one direction: if X and Y are independent, then p(x,y) = p(x) × p(y), and therefore: Given two random variables X and Y, the joint probability distribution of X and Y is the probability distribution of X and Y together. ... IFF, Iff or iff can stand for: Interchange File Format - a computer file format introduced by Electronic Arts Identification, friend or foe - a radio based identification system utilizing transponders iff - the mathematics concept if and only if International Flavors and Fragrances - a company producing flavors and fragrances International Freedom Foundation...

log frac{p(x,y)}{p(x),p(y)} = log 1 = 0. !

Moreover, mutual information is nonnegative (i.e. I(X;Y) ≥ 0; see below) and symmetric (i.e. I(X;Y) = I(Y;X)). Sphere symmetry group o. ...


The mutual information is the same as the uncertainty contained in Y (or X) alone, namely the entropy of Y (or X: clearly if X and Y are identical they have equal entropy). Entropy of a Bernoulli trial as a function of success probability, often called the binary entropy function Entropy is a concept in thermodynamics (see thermodynamic entropy), statistical mechanics and information theory. ...


Relation to other quantities

Mutual information can be equivalently expressed as

begin{matrix} I(X;Y) & = & H(X) - H(X|Y)    & = & H(Y) - H(Y|X)   & = & H(X) + H(Y) - H(X,Y) end{matrix}


where H(X) and H(Y) are the marginal entropies, H(X|Y) and H(Y|X) are the conditional entropies, and H(X,Y) is the joint entropy of X and Y. Since H(X) ≥ H(X|Y), this characterization is consistent with the nonnegativity property stated above. Entropy of a Bernoulli trial as a function of success probability, often called the binary entropy function Entropy is a concept in thermodynamics (see thermodynamic entropy), statistical mechanics and information theory. ... The conditional entropy is an entropy measure used in information theory. ... The joint entropy is an entropy measure used in information theory. ...


Intuitively, if entropy H(X) is regarded as a measure of uncertainty about a random variable, then H(X|Y) can be seen as "the amount of uncertainty remaining about X after Y is known", and thus the right side of the first of these equalities can be read as "the amount of uncertainty in X, minus the amount of uncertainty in X which remains after Y is known", which is equivalent to "the amount of uncertainty in X which is removed by knowing Y". This corroborates the intuitive meaning of mutual information as the amount of information (that is, reduction in uncertainty) that knowing either variable provides about the other.


Note that H(X|X) = 0 and therefore H(X) = I(X;X). Thus I(X;X) ≥ I(X;Y), and one can formulate the basic principle that a variable contains more information about itself than any other variable can provide.


Mutual information can also be expressed as a Kullback-Leibler divergence, of the product p(x) × p(y) of the marginal distributions of the two random variables X and Y, from p(x,y) the random variables' joint distribution: In probability theory and information theory, the Kullback-Leibler divergence (or information divergence, or information gain, or relative entropy) is a natural distance measure from a true probability distribution P to an arbitrary probability distribution Q. Typically P represents data, observations, or a precise calculated probability distribution. ... In probability theory, given two jointly distributed random variables X and Y, the marginal distribution of X is simply the probability distribution of X ignoring information about Y, typically calculated by summing or integrating the joint probability distribution over Y. For discrete random variables, the marginal probability mass function can... Given two random variables X and Y, the joint probability distribution of X and Y is the probability distribution of X and Y together. ...

I(X;Y) = D_{mathrm{KL}}(p(x,y)|p(x)p(y)).


Furthermore, let p(x|y) = p(x, y) / p(y). Then

begin{matrix} I(X;Y) & = & sum_y p(y) sum_x p(x|y) log_2 frac{p(x|y)}{p(x)}   & = & sum_y p(y) ; D_{mathrm{KL}}(p(x|y)|p(x))   & = & mathbb{E}_Y{D_{mathrm{KL}}(p(x|y)|p(x))} end{matrix}


Thus mutual information can thus also be understood as the expectation of the Kullback-Leibler divergence of the univariate distribution p(x) of X from the conditional distribution p(x|y) of X given Y: the more different the distributions p(x|y) and p(x), the greater the information gain. In probability theory the expected value (or mathematical expectation) of a random variable is the sum of the probability of each possible outcome of the experiment multiplied by its payoff (value). Thus, it represents the average amount one expects as the outcome of the random trial when identical odds are... Given two jointly distributed random variables X and Y, the conditional probability distribution of Y given X (written Y | X) is the probability distribution of Y when X is known to be a particular value. ...


Variations of the mutual information

Several variations on the mutual information have been proposed to suit various needs. Among these are normalized variants and generalizations to more than two variables.


Normalized variants

Normalized variants of the mutual information are provided by the coefficients of constraint (Coombs, Dawes & Tversky 1970) or coefficients of uncertainty (Press & Flannery 1988)

C_{XY}=frac{I(X;Y)}{H(Y)} ~~~~mbox{and}~~~~ C_{YX}=frac{I(X;Y)}{H(X)}

The two coefficients are obviously not equal. A more useful and symmetric scaled information measure is the redundancy

R= frac{I(X;Y)}{H(X)+H(Y)}


which obtains a minimum of zero when the variables are independent and a maximum value of

R_{max }=frac{min (H(X),H(Y))}{H(X)+H(Y)}


when one variable is completely redundant with the other. Another symmetrical measure is the symmetric uncertainty (Witten & Frank 2005), given by

U(X,Y) = 2R = 2 frac{I(X;Y)}{H(X)+H(Y)}


which represents a weighted average of the two uncertainty coefficients (Press & Flannery 1988).


Other normalized versions are provided the following expressions (Yao 2003, Strehl & Ghosh 2002).

frac{I(X;Y)}{mbox{min}(H(X),H(Y))}, ~~~~~~~ frac{I(X;Y)}{H(X,Y)}, ~~~~~~~ frac{I(X;Y)}{sqrt{H(X)H(Y)}}

Weighted variants

In the traditional formulation of the mutual information,

I(X;Y) = sum_{y in Y} sum_{x in X} p(x,y) log frac{p(x,y)}{p(x),p(y)},

each event or object specified by (x,y) is weighted by the corresponding probability p(x,y). This assumes that all objects or events are equivalent apart from their probability of occurrence. However, in some applications it may be the case that certain objects or events are more significant than others, or that certain patterns of association are more semantically important than others.


For example, the deterministic mapping {(1,1),(2,2),(3,3)} may be viewed as stronger (by some standard) than the deterministic mapping {(1,3),(2,1),(3,2)}, although these relationships would yield the same mutual information. This is because the mutual information is not sensitive at all to any inherent ordering in the variable values (Cronbach 1954, Coombs & Dawes 1970, Lockhead 1970), and is therefore not sensitive at all to the form of the relational mapping between the associated variables. If it is desired that the former relation — showing agreement on all variable values — be judged stronger than the later relation, then it is possible to use the following weighted mutual information (Guiasu 1977)

I(X;Y) = sum_{y in Y} sum_{x in X} w(x,y) p(x,y) log frac{p(x,y)}{p(x),p(y)},

which places a weight w(x,y) on the probability of each variable value co-occurrence, p(x,y). This allows that certain probabilities may carry more or less significance than others, thereby allowing the quantification of relevant holistic or prägnanz factors. In the above example, using larger relative weights for w(1,1), w(2,2), and w(3,3) would have the effect of assessing greater informativeness for the relation {(1,1),(2,2),(3,3)} than for the relation {(1,3),(2,1),(3,2)}, which may be desirable in some cases of pattern recognition, and the like. There has been little mathematical work done on the weighted mutual information and its properties, however.


Multiple-variable variants

Several generalizations of mutual information to more than two random variables have been proposed, such as total correlation and interaction information, but a widely agreed on definition has not yet emerged. One high-dimensional generalization scheme that maximizes the mutual information between the joint distribution and other target variables is found be useful in feature selection. The total correlation (Watanabe 1960) is one of several generalizations of the mutual information. ... Feature selection, also known as subset selection or variable selection, is a process commonly used in machine learning, wherein a subset of the features available from the data are selected for application of a learning algorithm. ...


Absolute mutual information

Using the ideas of Kolmogorov complexity, one can consider the mutual information of two sequences independent of any probability distribution: In computer science, the Kolmogorov complexity (also known as descriptive complexity, Kolmogorov-Chaitin complexity, stochastic complexity, algorithmic entropy, or program-size complexity) of an object such as a piece of text is a measure of the computational resources needed to specify the object. ...

IK(X;Y) = K(X) − K(X | Y)

To establish that this quantity is symmetric up to a logarithmic factor (I_K(X;Y) approx I_K(Y;X)) requires the chain rule for Kolmogorov complexity (Li 1997). Approximations of this quantity via compression can be used to define a distance measure to perform a hierarchical clustering of sequences without having any domain knowledge of the sequences (Cilibrasi 2005). In probability theory and, in particular, information theory, the mutual information, or transinformation, of two random variables is a quantity that measures the mutual dependence of the two variables. ... In computer science and information theory, data compression or source coding is the process of encoding information using fewer bits (or other information-bearing units) than an unencoded representation would use through use of specific encoding schemes. ... In mathematics a metric or distance is a function which assigns a distance to elements of a set. ... Data clustering is a common technique for data analysis, which is used in many fields, including machine learning, data mining, pattern recognition, image analysis and bioinformatics. ... In computing, domain knowledge is the knowledge and skills that software programs encode. ... In probability theory and, in particular, information theory, the mutual information, or transinformation, of two random variables is a quantity that measures the mutual dependence of the two variables. ...


Applications of mutual information

In many applications, one wants to maximize mutual information (thus increasing dependencies), which is often equivalent to minimizing conditional entropy. Examples include:

  • The Channel capacity is equal to the mutual information, maximized over all input distributions.
  • Discriminative training procedures for hidden Markov models have been proposed based on the maximum mutual information (MMI) criterion.
  • RNA secondary structure prediction from a multiple sequence alignment.
  • Mutual information has been used as a criterion for feature selection and feature transformations in machine learning. It can be used to characterize both the relevance and redundancy of variables, such as the minimum redundancy feature selection.
  • Mutual information is often used as a significance function for the computation of collocations in corpus linguistics.
  • Mutual information is used in medical imaging for image registration. Given a reference image (for example, a brain scan), and a second image which needs to be put into the same coordinate system as the reference image, this image is deformed until the mutual information between it and the reference image is maximized.

Channel capacity, is the amount of discrete information that can be reliably transmitted over a channel. ... State transitions in a hidden Markov model (example) x — hidden states y — observable outputs a — transition probabilities b — output probabilities A hidden Markov model (HMM) is a statistical model where the system being modeled is assumed to be a Markov process with unknown parameters, and the challenge is to determine... Ribonucleic acid (RNA) is a nucleic acid polymer consisting of nucleotide monomers. ... A representation of the 3D structure of the Myoglobin protein. ... First 90 positions of a protein multiple sequence alignment of instances of the acidic ribosomal protein P0 (L10E) from several organisms. ... Feature selection, also known as subset selection or variable selection, is a process commonly used in machine learning, wherein a subset of the features available from the data are selected for application of a learning algorithm. ... As a broad subfield of artificial intelligence, Machine learning is concerned with the development of algorithms and techniques that allow computers to learn. At a general level, there are two types of learning: inductive, and deductive. ... This article or section does not cite its references or sources. ... Corpus linguistics is the study of language as expressed in samples (corpora) or real world text. ... Medical imaging designates the ensemble of techniques and processes used to create images of the human body (or parts thereof) for clinical purposes (medical procedures seeking to reveal, diagnose or examine disease) or medical science (including the study of normal anatomy and function). ... In computer vision, sets of data acquired by sampling the same scene or object at different times, or from different perspectives, will be in different coordinate systems. ... In mathematics as applied to geometry, physics or engineering, a coordinate system is a system for assigning a tuple of numbers to each point in an n-dimensional space. ...

References

  • Cilibrasi, R.; Paul Vitányi (2005). "Clustering by compression" (PDF). IEEE Transactions on Information Theory 51 (4): 1523-1545.
  • Coombs, C. H., Dawes, R. M. & Tversky, A. (1970), Mathematical Psychology: An Elementary Introduction, Prentice-Hall, Englewood Cliffs, NJ.
  • Cronbach L. J. (1954). On the non-rational application of information measures in psychology, in H Quastler, ed., Information Theory in Psychology: Problems and Methods, Free Press, Glencoe, Illinois, pp. 14—30.
  • Kenneth Ward Church and Patrick Hanks. Word association norms, mutual information, and lexicography, Proceedings of the 27th Annual Meeting of the Association for Computational Linguistics, 1989.
  • Guiasu, Silviu (1977), Information Theory with Applications, McGraw-Hill, New York.
  • Li, Ming; Paul Vitányi (1997). An introduction to Kolmogorov complexity and its applications. New York: Springer-Verlag. ISBN 0387948686. 
  • Lockhead G. R. (1970). Identification and the form of multidimensional discrimination space, Journal of Experimental Psychology 85(1), 1-10.
  • Athanasios Papoulis. Probability, Random Variables, and Stochastic Processes, second edition. New York: McGraw-Hill, 1984. (See Chapter 15.)
  • Press, W. H., Flannery, B. P., Teukolsky, S. A. & Vetterling, W. T. (1988), Numerical Recipes in C: The Art of Scientific Computing, Cambridge University Press, Cambridge.
  • Strehl, Alexander and Ghosh, Joydeep (2002). Cluster ensembles -- A knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research 3, 583-617.
  • Witten, Ian H. & Frank, Eibe (2005), Data Mining: Practical Machine Learning Tools and Techniques, Morgan Kaufmann, Amsterdam.
  • Yao, Y. Y. (2003) Information-theoretic measures for knowledge discovery and data mining, in Entropy Measures, Maximum Entropy Principle and Emerging Applications , Karmeshu (ed.), Springer, pp. 115-136.
  • Peng, H.C., Long, F., and Ding, C., "Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 27, No. 8, pp.1226-1238, 2005. Program

  Results from FactBites:
 
Deborah Vakas Duong and John Grefenstette: SISTER (11186 words)
The average mutual information of 0.36 for the individual recognition is not significantly different from the average mutual information of 0.22 for the role recognition treatment.
The average mutual information in the role recognition treatment almost doubles, from 0.22 in the simple scenario to 0.40 in the composite good scenario, while the mutual information in the individual recognition treatment halves, from 0.36 in the simple to 0.17 in the composite good scenario.
The correlation between utility and mutual information in the individual treatment is 0.37, and is significant at the 94% level, while the correlation in the role treatment is.72.
Mutual information (269 words)
Note that when talking about mutual information, we use the continuous definition of entropy i.e., differential entropy, introduced in the previous section.
Mutual information is a measure of the information that members of a set of random variables have on the other random variables in the set.
An analogy is this: Upon arriving at the airport, we receive less information when the ticket agent tells us that (1) we missed the plane, and (2) the plane left 15 minutes ago, than either of those pieces of information would provide individually.
  More results at FactBites »


 
 

COMMENTARY     


Share your thoughts, questions and commentary here
Your name
Your comments

Want to know more?
Search encyclopedia, statistics and forums:

 


Lesson Plans | Student Area | Student FAQ | Reviews | Press Releases |  Feeds | Contact
The Wikipedia article included on this page is licensed under the GFDL.
Images may be subject to relevant owners' copyright.
All other elements are (c) copyright NationMaster.com 2003-5. All Rights Reserved.
Usage implies agreement with terms, 1022, m