FACTOID # 172: The number of tourists in San Marino is almost 19 times the resident population.
 
 Home   Encyclopedia   Statistics   Countries A-Z   Flags   Maps   Education   Forum   FAQ   About 
 
WHAT'S NEW
RECENT ARTICLES
More Recent Articles »
 

FACTS & STATISTICS    Simple view

  1. Select countries to view: (hold down Control key and click to select several)

     

     

    Compare:

     

     

  1. Select fact or statistic: (* = graphable)

     

     

     

  2. (OPTIONAL) Compare to statistic: (both need to be graphable)

     

     

     

  3. View result as:

     

       
(OR) SEARCH ALL encyclopedia, stats & forums:   

Encyclopedia > Estimation of covariance matrices

In multivariate statistics, the importance of the Wishart distribution stems in part from the fact that it is the probability distribution of the maximum likelihood estimator of the covariance matrix of a multivariate normal distribution. Although no one is surprised that the estimator of the population covariance matrix is simply the sample covariance matrix, the mathematical derivation is perhaps not widely known and is surprisingly subtle and elegant. Multivariate statistics or multivariate statistical analysis in statistics describes a collection of procedures which involve observation and analysis of more than one statistical variable at a time. ... In statistics, the Wishart distribution, named in honor of John Wishart, is any of a family of probability distributions for nonnegative-definite matrix-valued random variables (random matrices). These distributions are of great importance in the estimation of covariance matrices in multivariate statistics. ... In mathematics, a probability distribution assigns to every interval of the real numbers a probability, so that the probability axioms are satisfied. ... Maximum likelihood estimation (MLE) is a popular statistical method used to make inferences about parameters of the underlying probability distribution of a given data set. ... In statistics, an estimator is a function of the known data that is used to estimate an unknown parameter; an estimate is the result from the actual application of the function to a particular set of data. ... In statistics and probability theory, the covariance matrix is a matrix of covariances between elements of a vector. ... In probability theory and statistics, a multivariate normal distribution, also sometimes called a multivariate Gaussian distribution, is a specific probability distribution, which can be thought of as a generalization to higher dimensions of the one-dimensional normal distribution (also called a Gaussian distribution). ...

Contents


The multivariate normal distribution

A random vector XRp×1 (a p×1 "column vector") has a multivariate normal distribution with a nonsingular covariance matrix Σ precisely if ΣRp × p is a positive-definite matrix and the probability density function of X is In linear algebra, a positive-definite matrix is a Hermitian matrix which in many ways is analogous to a positive real number. ... In mathematics, a probability density function (pdf) serves to represent a probability distribution in terms of integrals. ...

f(x)=[mathrm{constant}]cdot det(Sigma)^{-1/2} expleft(-{1 over 2} (x-mu)^T Sigma^{-1} (x-mu)right)

where μ ∈ Rp×1 is the expected value. The matrix Σ is the higher-dimensional analog of what in one dimension would be the variance. In probability theory (and especially gambling), the expected value (or mathematical expectation) of a random variable is the sum of the probability of each possible outcome of the experiment multiplied by its payoff (value). Thus, it represents the average amount one expects to win per bet if bets with identical... In probability theory and statistics, the variance of a random variable is a measure of its statistical dispersion, indicating how far from the expected value its values typically are. ...


Maximum-likelihood estimation

Suppose now that X1, ..., Xn are independent and identically distributed with the distribution above. Based on the observed values x1, ..., xn of this sample, we wish to estimate Σ (we adhere to the convention of writing random variables as capital letters and data as lower-case letters).


First steps

It is fairly readily shown that the maximum-likelihood estimate of the expected value μ is the "sample mean" Maximum likelihood estimation (MLE) is a popular statistical method used to make inferences about parameters of the underlying probability distribution of a given data set. ...

overline{x}=(x_1+cdots+x_n)/n.

See the section on estimation in the article on the normal distribution for details; the process here is similar. The normal distribution, also called Gaussian distribution, is an extremely important probability distribution in many fields. ...


Since the estimate of μ does not depend on Σ, we can just substitute it for μ in the likelihood function In statistics, a likelihood function is a conditional probability function considered a function of its second argument with its first argument held fixed, thus: and also any other function proportional to such a function. ...

L(mu,Sigma)=[mathrm{constant}]cdot prod_{i=1}^n det(Sigma)^{-1/2} expleft(-{1 over 2} (x_i-mu)^T Sigma^{-1} (x_i-mu)right)
propto det(Sigma)^{-n/2} expleft(-{1 over 2} sum_{i=1}^n (x_i-mu)^T Sigma^{-1} (x_i-mu) right)

and then seek the value of Σ that maximizes this.


We have

L(overline{x},Sigma) propto det(Sigma)^{-n/2} expleft(-{1 over 2} sum_{i=1}^n (x_i-overline{x})^T Sigma^{-1} (x_i-overline{x})right).

The trace of a 1 × 1 matrix

Now we come to the first surprising step.


Regard the scalar (x_i-overline{x})^T Sigma^{-1} (x_i-overline{x}) as the trace of a 1×1 matrix! The term scalar is used in mathematics, physics, and computing basically for quantities that are characterized by a single numeric value and/or do not involve the concept of direction. ... In linear algebra, the trace of an n-by-n square matrix A is defined to be the sum of the elements on the main diagonal (the diagonal from the upper left to the lower right) of A, i. ...


This makes it possible to use the identity tr(AB) = tr(BA) whenever A and B are matrices so shaped that both products exist. We get

det(Sigma)^{-n/2} expleft(-{1 over 2} sum_{i=1}^n operatorname{tr}((x_i-mu)^T Sigma^{-1} (x_i-mu)) right)
=det(Sigma)^{-n/2} expleft(-{1 over 2} sum_{i=1}^n operatorname{tr}((x_i-mu) (x_i-mu)^T Sigma^{-1}) right)

(so now we are taking the trace of a p×p matrix!)

=det(Sigma)^{-n/2} expleft(-{1 over 2} operatorname{tr} left( sum_{i=1}^n (x_i-mu) (x_i-mu)^T Sigma^{-1} right) right)
=det(Sigma)^{-n/2} expleft(-{1 over 2} operatorname{tr} left( S Sigma^{-1} right) right)

where

S=sum_{i=1}^n (x_i-overline{x}) (x_i-overline{x})^T in mathbf{R}^{ptimes p}.

Using the spectral theorem

It follows from the spectral theorem of linear algebra that a positive-definite symmetric matrix S has a unique positive-definite symmetric square root S1/2. We can again use the "cyclic property" of the trace to write In mathematics, particularly linear algebra and functional analysis, the spectral theorem is a collection of results about linear operators or about matrices. ... Linear algebra is the branch of mathematics concerned with the study of vectors, vector spaces (also called linear spaces), linear transformations, and systems of linear equations in finite dimensions. ... In linear algebra, the trace of an n-by-n square matrix A is defined to be the sum of the elements on the main diagonal (the diagonal from the upper left to the lower right) of A, i. ...

det(Sigma)^{-n/2} expleft(-{1 over 2} operatorname{tr} left( S^{1/2} Sigma^{-1} S^{1/2} right) right).

Let B = S1/2 Σ−1 S1/2. Then the expression above becomes

det(S)^{-n/2} det(B)^{n/2} expleft(-{1 over 2} operatorname{tr} (B) right).

The positive-definite matrix B can be diagonalized, and then the problem of finding the value of B that maximizes

det(B)^{n/2} expleft(-{1 over 2} operatorname{tr} (B) right)

reduces to the problem of finding the values of the diagonal entries λ1, ..., λp that maximize

lambda_i^{n/2} exp(-lambda_i/2).

This is just a calculus problem and we get λi = n, so that B = n Ip, i.e., n times the p×p identity matrix.


Concluding steps

Finally we get

Sigma=S^{1/2} B^{-1} S^{1/2}=S^{1/2}((1/n)I_p)S^{1/2}=S/n,,

i.e., the p×p "sample covariance matrix"

{S over n} = {1 over n}sum_{i=1}^n (X_i-overline{X})(X_i-overline{X})^T

is the maximum-likelihood estimator of the "population covariance matrix" Σ. At this point we are using a capital X rather than a lower-case x because we are thinking of it "as an estimator rather than as an estimate", i.e., as something random whose probability distribution we could profit by knowing. This random matrix can be shown to have a Wishart distribution with n − 1 degrees of freedom. In statistics, the Wishart distribution, named in honor of John Wishart, is any of a family of probability distributions for nonnegative-definite matrix-valued random variables (random matrices). These distributions are of great importance in the estimation of covariance matrices in multivariate statistics. ...


Alternative derivation

An alternative derivation of the maximum likelihood estimator can be performed via matrix calculus formulae (see also differential of a determinant and differential of the inverse matrix). It also verifies the aforementioned fact about the maximum likelihood estimate of the mean. Re-write the likelihood in the log form using the trace trick: In mathematics, matrix calculus is a specialized notation for doing multivariable calculus, especially over spaces of matrices, where it defines the matrix derivative. ... In algebra, a determinant is a function depending on n that associates a scalar det(A) to every n×n square matrix A. The fundamental geometric meaning of a determinant is as the scale factor for volume when A is regarded as a linear transformation. ... In mathematics and especially linear algebra, an n-by-n matrix A is called invertible, non-singular or regular if there exists another n-by-n matrix B such that AB = BA = In, where In denotes the n-by-n identity matrix and the multiplication used is ordinary matrix multiplication. ...

operatorname{ln} L(mu,Sigma) = operatorname{const} -{n over 2} ln det(Sigma) -{1 over 2} operatorname{tr} left[ Sigma^{-1} sum_{i=1}^n (x_i-mu) (x_i-mu)^T right].

The differential of this log-likelihood is

d ln L(mu,Sigma) = -{n over 2} operatorname{tr} left[ Sigma^{-1} left{ d Sigma right} right] -{1 over 2} operatorname{tr} left[ - Sigma^{-1} { d Sigma } Sigma^{-1} sum_{i=1}^n (x_i-mu)(x_i-mu)^T - 2 Sigma^{-1} sum_{i=1}^n (x_i - mu) { d mu }^T right].

It naturally breaks down into the part related to the estimation of the mean, and to the part related to the estimation of the variance. The first order condition for maximum, dlnL(μ,Σ) = 0, is satisfied when the terms multiplying dμ and dΣ are identically zero. Assuming (the maximum likelihood estimate of) Σ is non-singular, the first order condition for the estimate of the mean vector is

sum_{i=1}^n (x_i - mu) = 0,

which leads to the maximum likelihood estimator

widehat mu = bar X = {1 over n} sum_{i=1}^n X_i.

This lets us simplify sum_{i=1}^n (x_i-mu)(x_i-mu)^T = sum_{i=1}^n (x_i-bar x)(x_i-bar x)^T = S as defined above. Then the terms involving dΣ in dlnL can be combined as

-{1 over 2} operatorname{tr} left( Sigma^{-1} left{ d Sigma right} left[ nI_p - Sigma^{-1} S right] right).

The first order condition dlnL(μ,Σ) = 0 will hold when the term in the square bracket is (matrix-valued) zero. Pre-multiplying the latter by Σ and dividing by n gives

widehat Sigma = {1 over n} S,

which of course coincides with the canonical derivation given earlier.


Shrinkage estimation

If the sample size n is small, and the number of considered variables p is large the above empirical estimators of covariance and correlation are very inefficient. Specifically, it is possible to furnish estimators that improve considerably upon the maximum likelihood estimate in terms of mean squared error. Moreover, for n < p, the empirical estimate of the covariance matrix becomes singular, i.e. it cannot be inverted to compute the precision matrix.


As alternative, many methods have been suggested to improve the estimation of the covariance matrix. All of these approaches rely on the concept of shrinkage. This is implicit in Bayesian methods, in penalized maximum likelihood methods, and explicit in the Stein-type shrinkage approach. Bayesian refers to probability and statistics -- either methods associated with the Reverend Thomas Bayes (ca. ... Maximum likelihood estimation (MLE) is a popular statistical method used to make inferences about parameters of the underlying probability distribution of a given data set. ... The James-Stein estimator is a nonlinear estimator which can be shown to dominate, or outperform, the ordinary (least squares) estimator. ...


A simple version of a shrinkage estimator of the covariance matrix is constructed as follows. One considers a convex combination of the empirical estimator with some suitable chosen target, e.g., the diagonal matrix. Subsequently, the mixing parameter is selected to maximize the expected accuracy of the shrunken estimator. This can be done by cross-validation, or by using an analytic estimate of the shrinkage intensity. The resulting regularized estimator can be shown to outperform the maximum likelihood estimator for small samples. For large samples, the shrinkage intensity will reduce to zero, hence in this case the shrinkage estimator will be identical to the empirical estimator. Apart from increased efficiency the shrinkage estimate has the additional advantage that it is always positive definite and well conditioned. A convex combination is a linear combination of data points (which can be vectors or scalars) where all coefficients are positive and sum up to 1. ...


A review on this topic is given, e.g., in [1].


A covariance shrinkage estimator is implemented in the R package "corpcor". The R programming language, sometimes described as GNU S, is a programming language and software environment for statistical computing and graphics. ...


  Results from FactBites:
 
Estimation of covariance matrices - Wikipedia, the free encyclopedia (912 words)
In multivariate statistics, the importance of the Wishart distribution stems in part from the fact that it is the probability distribution of the maximum likelihood estimator of the covariance matrix of a multivariate normal distribution.
Although no one is surprised that the estimator of the population covariance matrix is simply the sample covariance matrix, the mathematical derivation is perhaps not widely known and is surprisingly subtle and elegant.
is the maximum-likelihood estimator of the "population covariance matrix" Σ.
Covariance matrix - Wikipedia, the free encyclopedia (430 words)
In statistics and probability theory, the covariance matrix is a matrix of covariances between elements of a vector.
Others call it the covariance matrix, because it is the matrix of covariances between the scalar components of the vector X.
The derivation of the maximum-likelihood estimator of the covariance matrix of a multivariate normal distribution is perhaps surprisingly subtle.
  More results at FactBites »


 

COMMENTARY     


Share your thoughts, questions and commentary here
Your name
Your comments
Please enter the 5-letter protection code

Want to know more?
Search encyclopedia, statistics and forums:

 


Lesson Plans | Student Area | Student FAQ | Reviews | Press Releases |  Feeds | Contact
The Wikipedia article included on this page is licensed under the GFDL.
Images may be subject to relevant owners' copyright.
All other elements are (c) copyright NationMaster.com 2003-5. All Rights Reserved.
Usage implies agreement with terms.