FACTOID # 175: Canadians drink more fruit juice than the citizens of any other nation - more than one litre each, every week.
 
 Home   Encyclopedia   Statistics   Countries A-Z   Flags   Maps   Education   Forum   FAQ   About 
 
WHAT'S NEW
RECENT ARTICLES
More Recent Articles »
 

FACTS & STATISTICS    Simple view

  1. Select countries to view: (hold down Control key and click to select several)

     

     

    Compare:

     

     

  1. Select fact or statistic: (* = graphable)

     

     

     

  2. (OPTIONAL) Compare to statistic: (both need to be graphable)

     

     

     

  3. View result as:

     

       
(OR) SEARCH ALL encyclopedia, stats & forums:   

Encyclopedia > Numerical taxonomy

Data clustering is a common technique for statistical data analysis, which is used in many fields, including machine learning, data mining, pattern recognition, image analysis and bioinformatics. Clustering is the classification of similar objects into different groups, or more precisely, the partitioning of a data set into subsets (clusters), so that the data in each subset (ideally) share some common trait - often proximity according to some defined distance measure. Statistics is a type of data analysis whose practice includes the planning, summarizing, and interpreting of observations of a system possibly followed by predicting or forecasting of future events based on a mathematical model of the system being observed. ... Machine learning is an area of artificial intelligence concerned with the development of techniques which allow computers to learn. More specifically, machine learning is a method for creating computer programs by the analysis of data sets. ... Data mining, also known as knowledge-discovery in databases (KDD), is the practice of automatically searching large stores of data for patterns. ... For the William Gibson novel, see: Pattern Recognition (novel). ... Image analysis is the extraction of useful information from images; mainly from digital images by means of digital image processing techniques. ... Bioinformatics or computational biology is the use of techniques from applied mathematics, informatics, statistics, and computer science to solve biological problems. ... Classification may refer to: Taxonomic classification Statistical classification Hint: Language use may refer to a taxonomic classification that is used for statistical purposes also as a statistical classification (like International Statistical Classification of Diseases and Related Health Problems). ... A partition of U into 6 blocks: a Venn diagram representation. ... A is a subset of B If X and Y are sets and every element of X is also an element of Y, then we say or write: X is a subset of (or is included in) Y; X ⊆ Y; Y is a superset of (or includes) X; Y ⊇ X... In mathematics a metric or distance is a function which assigns a distance to elements of a set. ...


Besides the term data clustering, there are a number of terms with similar meanings, including cluster analysis, automatic classification, numerical taxonomy, botryology and typological analysis.


Data clustering algorithms can be hierarchical or partitional. With hierarchical algorithms, successive clusters are found using previously established clusters, whereas partitional algorithms determine all clusters in one go. Hierarchical algorithms can be agglomerative (bottom-up) or divisive (top-down). Agglomerative algorithms begin with each element as a separate cluster and merge them in successively larger clusters. The divisive algorithms begin with the whole set and proceed to divide it into successively smaller clusters. A hierarchy (in Greek hieros = sacred, arkho = rule) is a system of ranking and organizing things. ... A partition of U into 6 blocks: a Venn diagram representation. ...

Contents


Hierarchical clustering

Introduction

Hierarchical clustering builds a hierarchy of clusters. The traditional representation of this hierarchy is a tree, with individual elements at one end and a single cluster with every element at the other. Agglomerative algorithms begin at the top of the tree, whereas divisive algorithms begin at the bottom. (In the figure, the arrows indicate an agglomerative clustering.)


Traditional representation Wikipedia does not have an article with this exact name. ...


Cutting the tree at a given height will give a clustering at a selected precision. In the above example, cutting after the second row will yield clusters {a} {b c} {d e} {f}. Cutting after the third row will yield clusters {a} {b c} {d e f}, which is a coarser clustering, but with fewer clusters.


Agglomerative hierarchical clustering

This method builds the hierarchy from the individual elements by progressively merging clusters. Again, we have six elements {a} {b} {c} {d} {e} and {f}. the first step is to determine which elements to merge in a cluster. Usually, we want to take the two closest elements, therefore we must define a distance d(element1,element2) between elements. personal space, proxemics. ...


Suppose we have merged the two closest elements b and c, we now have the following clusters {a}, {b, c}, {d}, {e} and {f}, and want to merge them further. But to do that, we need to take the distance between {a} and {b c}, and therefore define the distance between two clusters. Usually the distance between two clusters and is one of the following:

  • the maximum distance between elements of each cluster (also called complete linkage clustering)
  • the minimum distance between elements of each cluster (also called single linkage clustering)
  • the mean distance between elements of each cluster (also called average linkage clustering)
  • the sum of all intra cluster variance
  • the increase in variance for the cluster being merged (Ward's criterion)

Each agglomeration occurs at a greater distance between clusters than the previous agglomeration, and one can decide to stop clustering either when the clusters are too far apart to be merged (distance criterion) or when there is a sufficiently small number of clusters (number criterion).


Partitional clustering

k-means and derivatives

k-means clustering

The k-means algorithm assigns each point to the cluster whose center (or centroid) is nearest. The centroid is the point generated by computing the arithmetic mean for each dimension separately for all the points in the cluster.

Example: The data set has three dimensions and the cluster has two points: X = (x1, y1, z1) and Y = (x2, y2, z2). Then the centroid Z becomes Z = (x3, y3, z3), where x3 = (x1 + x2)/2 and y3 = (y1 + y2)/2 and z3 = (z1 + z2)/2

This is the basic structure of the algorithm (J. MacQueen, 1967):

  • Randomly generate k clusters and determine the cluster centers or directly generate k seed points as cluster centers
  • Assign each point to the nearest cluster center.
  • Recompute the new cluster centers.
  • Repeat until some convergence criterion is met (usually that the assignment hasn't changed).

The main advantages of this algorithm are its simplicity and speed, which allows it to run on large datasets. Yet it does not systematically yield the same result with each run of the algorithm. Rather, the resulting clusters depend on the initial assignments. The k-means algorithm maximizes inter-cluster (or minimizes intra-cluster) variance, but does not ensure that the solution given is not a local minimum of variance.


Fuzzy c-means clustering

One of the problems of the k-means algorithm is that it gives a hard partitioning of the data, that is to say that each point is attributed to one and only one cluster. But points on the edge of the cluster, or near another cluster may not be as much in the cluster as point in the center of cluster.


Therefore, in fuzzy clustering, each point does not pertain to a given cluster, but has a degree of belonging to a certain cluster, as in fuzzy logic. For each point x we have a coefficient giving the degree of being in the kth cluster uk(x). Usually, the sum of those coefficients has to be one, so that uk(x) denotes a probability of belonging to a certain cluster: Fuzzy logic is an extension of Boolean logic dealing with the concept of partial truth. ...

With fuzzy c-means, the centroid of a cluster is computed as being the mean of all points, weighted by their degree of belonging to the cluster, that is

The degree of being in a certain cluster is the inverse of the distance to the cluster

then the coefficients are normalized and fuzzyfied with a real parameter m > 1 so that their sum is 1. So


The fuzzy c-means algorithm is greatly similar to the k-means algorithm; when , algorithms are equivalent:

  • Choose a number of clusters
  • Assign randomly to each point coefficients for being in the clusters
  • Repeat until the algorithm has converged (that is, the coefficients' change between two iterations is no more than ε, the given sensitivity threshold) :
    • Compute the centroid for each cluster, using the formula above
    • For each point, compute its coefficients of being in the clusters, using the formula above

The fuzzy c-means algorithm minimizes intra-cluster variance as well, but has the same problems as k-means, the minimum is local minimum, and the results depend on the initial choice of weights.


Applications

In biology has two main applications in the fields of computational biology and bioinformatics. Main article: Life There are many universal units and common processes that are fundamental to the known forms of life. ... Bioinformatics or computational biology is the use of techniques from applied mathematics, informatics, statistics, and computer science to solve biological problems. ...

The Transcriptome is the set of all mRNA molecules (or transcripts) in one or a population of biological cells for a given set of environmental circumstances. ... This stylistic schematic diagram shows a gene in relation to the double helix structure of DNA and to a chromosome (right). ... Neuraminidase ribbon diagram An enzyme (in Greek en = in and zyme = leaven) is a protein, or protein complex, that catalyzes a chemical reaction and also controls the 3D orientation of the catalyzed substrates. ... In biochemistry, a metabolic pathway is a series of chemical reactions occurring within a cell, catalyzed by enzymes, and resulting in either the formation of a metabolic product to be used or stored by the cell (metabolic sink), or the initiation of another metabolic pathway (then called a flux generating... An expressed sequence tag or EST is a short sub-sequence of a protein-coding DNA sequence. ... A DNA microarray, the different colours indicate relative expression of different genes. ... Genome annotation is related to: Human Genome Project Genome project Sequence alignment Functional annotation Sequence annotation Protein family Categories: Biology stubs ... Genomics is the study of an organisms genome and the use of the genes. ... In chemistry, sequence analysis comprises techniques used to determine the sequence of a polymer formed of several monomers. ... This is a list of gene families or gene complexes, that is sets of genes which occur across a number of different species which often serve similar biological functions. ... Evolutionary biology is a subfield of biology concerned with the origin and descent of species, as well as their change over time, i. ... Categories: Stub | Molecular genetics | Evolutionary biology ...

External links

  • Jain, Murty and Flynn: Data Clustering: A Review, ACM Comp. Surv., 1999. Available here
  • for another presentation of hierarchical, k-means and fuzzy c-means see this introduction to clustering. Also has an explanation on mixture of Gaussians.
  • Eruditionhome Great directory site for data clustering.
  • http://members.tripod.com/asim_saeed/paper.htm

The normal distribution, also called Gaussian distribution, is an extremely important probability distribution in many fields, especially in physics and engineering. ...

See also

k-means, ANN, Self-organizing map, other clustering and mixture model links. The k-means algorithm is a variant of the Expectation-Maximization algorithm in which the goal is to determine the k means of data generated from Gaussian distributions. ... ANN may mean: Artificial Neural Network All Night Nippon All-Nippon News Network This is a disambiguation page — a navigational aid which lists other pages that might otherwise share the same title. ... The self-organising map (SOM) is a method for unsupervised learning, based on a grid of artificial neurons whose weights are adapted to match input vectors in a training set. ...


  Results from FactBites:
 
Taxonomy - Wikipedia, the free encyclopedia (533 words)
Taxonomy (from Greek verb tassein = "to classify" and nomos = law, science, cf "economy") was once only the science of classifying living organisms (alpha taxonomy), but later the word was applied in a wider sense, and may also refer to either a classification of things, or the principles underlying the classification.
Such taxonomies as those analyzed by Durkheim and Lévi-Strauss are sometimes called folk taxonomies to distinguish them from scientific taxonomies that claim to be disembedded from social relations and thus objective and universal.
The field of solving or best-fitting of numerical equations that characterize all measurable quantities of a set of objects is called cluster analysis; this is a form of taxonomy called numerical taxonomy or taximetrics.
Publication list (1679 words)
Rohlf, F. A numerical taxonomic study of the genus Aedes (Diptera: Culicidae) with emphasis on the congruence of larval and adult classifications.
Coefficients of correlation and distance in numerical taxonomy.
Numerical taxonomy of soils from nine orders by cluster and centroid-component analyses.
  More results at FactBites »


 

COMMENTARY     


Share your thoughts, questions and commentary here
Your name
Your comments
Please enter the 5-letter protection code

Want to know more?
Search encyclopedia, statistics and forums:

 


Lesson Plans | Student Area | Student FAQ | Reviews | Press Releases |  Feeds | Contact
The Wikipedia article included on this page is licensed under the GFDL.
Images may be subject to relevant owners' copyright.
All other elements are (c) copyright NationMaster.com 2003-5. All Rights Reserved.
Usage implies agreement with terms.