FACTOID # 61: Indonesia contains the most known mammal species - and the most mammal species under threat.
 
 Home   Encyclopedia   Statistics   Countries A-Z   Flags   Maps   Education   Forum   FAQ   About 
 
WHAT'S NEW
RECENT ARTICLES
More Recent Articles »
 

FACTS & STATISTICS    Simple view

  1. Select countries to view: (hold down Control key and click to select several)

     

     

    Compare:

     

     

  1. Select fact or statistic: (* = graphable)

     

     

     

  2. (OPTIONAL) Compare to statistic: (both need to be graphable)

     

     

     

  3. View result as:

     

       
(OR) SEARCH ALL encyclopedia, stats & forums:   

Encyclopedia > Mahalanobis distance

In statistics, Mahalanobis distance is a distance measure introduced by P. C. Mahalanobis in 1936. It is based on correlations between variables by which different patterns can be identified and analysed. It is a useful way of determining similarity of an unknown sample set to a known one. It differs from Euclidean distance in that it takes into account the correlations of the data set and is scale-invariant, i.e. not dependent on the scale of measurements. A graph of a normal bell curve showing statistics used in educational assessment and comparing various grading methods. ... Distance is a numerical description of how far apart objects are at any given moment in time. ... Prasanta Chandra Mahalanobis (born June 29, 1893, died June 28, 1972) was an Indian scientist and applied statistician. ... 1936 (MCMXXXVI) was a leap year starting on Wednesday (link will take you to calendar). ... Positive linear correlations between 1000 pairs of numbers. ... This article is about statistical samples. ... In mathematics, Euclidean geometry is the familiar kind of geometry on the plane or in three dimensions. ... A data set (or dataset) is a collection of data, usually presented in tabular form. ...


Formally, the Mahalanobis distance from a group of values with mean mu = ( mu_1, mu_2, mu_3, dots , mu_p )^T and covariance matrix Ρ for a multivariate vector x = ( x_1, x_2, x_3, dots, x_p )^T is defined as: In statistics and probability theory, the covariance matrix is a matrix of covariances between elements of a vector. ...

D_M(x) = sqrt{(x - mu)^T Rho^{-1} (x-mu)}.,

Mahalanobis distance can also be defined as dissimilarity measure between two random vectors  vec{x} and  vec{y} of the same distribution with the covariance matrix Ρ : A multivariate random variable or random vector is a vector X=(X1,...,Xn) whose components are scalar-valued random variables on the same probability space (Ω, P). ... In mathematics and statistics, a probability distribution is a function of the probabilities of a mutually exclusive and exhaustive set of events. ...

 d(vec{x},vec{y})=sqrt{(vec{x}-vec{y})^T Rho^{-1} (vec{x}-vec{y})}.,

If the covariance matrix is the identity matrix, the Mahalanobis distance reduces to the Euclidean distance. If the covariance matrix is diagonal, then the resulting distance measure is called the normalized Euclidean distance: In mathematics, the Euclidean distance or Euclidean metric is the ordinary distance between two points that one would measure with a ruler, which can be proven by repeated application of the Pythagorean theorem. ...

 d(vec{x},vec{y})= sqrt{sum_{i=1}^p {(x_i - y_i)^2 over sigma_i^2}},

where σi is the standard deviation of the xi over the sample set. In probability and statistics, the standard deviation of a probability distribution, random variable, or population or multiset of values is a measure of the spread of its values. ...

Contents

Intuitive explanation

Consider the problem of estimating the probability that a test point in N-dimensional Euclidean space belongs to a set, where we are given sample points that definitely belong to that set. Our first step would be to find the average or center of mass of the sample points. Intuitively, the closer the point in question is to this center of mass, the more likely it is to belong to the set. Around 300 BC, the Greek mathematician Euclid laid down the rules of what has now come to be called Euclidean geometry, which is the study of the relationships between angles and distances in space. ...


However, we also need to know how large the set is. The simplistic approach is to estimate the standard deviation of the distances of the sample points from the center of mass. If the distance between the test point and the center of mass is less than one standard deviation, then we conclude that it is highly probable that the test point belongs to the set. The farther away it is, the more likely that the test point should not be classified as belonging to the set. In probability and statistics, the standard deviation of a probability distribution, random variable, or population or multiset of values is a measure of the spread of its values. ...


This intuitive approach can be made quantitative by defining the normalized distance between the test point and the set to be  {x - mu} over sigma . By plugging this into the normal distribution we get the probability of the test point belonging to the set.


The drawback of the above approach was that we assumed that the sample points are distributed about the center of mass in a spherical manner. Were the distribution to be decidedly non-spherical, for instance ellipsoidal, then we would expect the probability of the test point belonging to the set to depend not only on the distance from the center of mass, but also on the direction. In those directions where the ellipsoid has a short axis the test point must be closer, while in those where the axis is long the test point can be further away from the center.


Putting this on a mathematical basis, the ellipsoid that best represents the set's probability distribution can be estimated by building the covariance matrix of the samples. The Mahalanobis distance is simply the distance of the test point from the center of mass divided by the width of the ellipsoid in the direction of the test point.


Relationship to leverage

Mahalanobis distance is closely related to the leverage statistic h. The Mahalanobis distance of a data point from the centroid of a multivariate data set is (N − 1) times the leverage of that data point, where N is the number of data points in the set.


Applications

Mahalanobis distance is widely used in cluster analysis and other classification techniques. It is closely related to Hotelling's T-square distribution used for multivariate statistical testing. Clustering is the classification of objects into different groups, or more precisely, the partitioning of a data set into subsets (clusters), so that the data in each subset (ideally) share some common trait - often proximity according to some defined distance measure. ... Statistical classification is a type of supervised learning problem in which labeled training data is used to create a function that will correctly predict the label of future data. ... In statistics, Hotellings T-square statistic, named for Harold Hotelling, is a generalization of Students t statistic that is used in multivariate hypothesis testing. ...


In order to use the Mahalanobis distance to classify a test point as belonging to one of N classes, one first estimates the covariance matrix of each class, usually based on samples known to belong to each class. Then, given a test sample, one computes the Mahalanobis distance to each class, and classifies the test point as belonging to that class for which the Mahalanobis distance is minimal. Using the probabilistic interpretation given above, this is equivalent to selecting the class with the highest probability.


Also, Mahalanobis distance and leverage are often used to detect outliers especially in the development of linear regression models. A point that has a greater Mahalanobis distance from the rest of the sample population of points is said to have higher leverage since it has a greater influence on the slope or coefficients of the regression equation. Figure 1. ... In statistics, linear regression is a regression method that models the relationship between a dependent variable Y, independent variables Xp, and a random term ε. The model can be written as where β1 is the intercept (constant term), the βis are the respective parameters of independent variables, and p is the...


References

  • P.C. Mahalanobis, On the generalised distance in statistics, Proceedings of the National Institute of Science of India 12 (1936) 49-55

  Results from FactBites:
 
Mahalanobis distance - Wikipedia, the free encyclopedia (736 words)
The Mahalanobis distance is simply the distance of the test point from the center of mass divided by the width of the ellipsoid in the direction of the test point.
The Mahalanobis distance of a data point from the centroid of a multivariate data set is (N − 1) times the leverage of that data point, where N is the number of data points in the set.
A point that has a greater Mahalanobis distance from the rest of the sample population of points is said to have higher leverage since it has a greater influence on the slope or coefficients of the regression equation.
Thermo Scientific - Algorithms - Discriminant Analysis, Mahalanobis Distance (1588 words)
The Mahalanobis distance is a very useful way of determining the "similarity" of a set of values from an "unknown: sample to a set of values measured from a collection of "known" samples.
In addition, since the Mahalanobis distance is measured in terms of standard deviations from the mean of the training samples, the reported matching values give a statistical measure of how well the spectrum of the unknown sample matches (or does not match) the original training spectra.
The Mahalanobis distance constructs a space that weights the variation in the sample along the axis of elongation less than in the shorter axis of the group ellipse.
  More results at FactBites »


 

COMMENTARY     


Share your thoughts, questions and commentary here
Your name
Your comments
Please enter the 5-letter protection code

Want to know more?
Search encyclopedia, statistics and forums:

 


Lesson Plans | Student Area | Student FAQ | Reviews | Press Releases |  Feeds | Contact
The Wikipedia article included on this page is licensed under the GFDL.
Images may be subject to relevant owners' copyright.
All other elements are (c) copyright NationMaster.com 2003-5. All Rights Reserved.
Usage implies agreement with terms.