|
Data clustering is a common technique for statistical data analysis, which is used in many fields, including machine learning, data mining, pattern recognition, image analysis and bioinformatics. Clustering is the classification of similar objects into different groups, or more precisely, the partitioning of a data set into subsets (clusters), so that the data in each subset (ideally) share some common trait - often proximity according to some defined distance measure. Statistics is a type of data analysis whose practice includes the planning, summarizing, and interpreting of observations of a system possibly followed by predicting or forecasting of future events based on a mathematical model of the system being observed. ...
Machine learning is an area of artificial intelligence concerned with the development of techniques which allow computers to learn. More specifically, machine learning is a method for creating computer programs by the analysis of data sets. ...
Data mining, also known as knowledge-discovery in databases (KDD), is the practice of automatically searching large stores of data for patterns. ...
For the William Gibson novel, see: Pattern Recognition (novel). ...
Image analysis is the extraction of useful information from images; mainly from digital images by means of digital image processing techniques. ...
Bioinformatics or computational biology is the use of techniques from applied mathematics, informatics, statistics, and computer science to solve biological problems. ...
Classification may refer to: Taxonomic classification Statistical classification Hint: Language use may refer to a taxonomic classification that is used for statistical purposes also as a statistical classification (like International Statistical Classification of Diseases and Related Health Problems). ...
A partition of U into 6 blocks: a Venn diagram representation. ...
A is a subset of B If X and Y are sets and every element of X is also an element of Y, then we say or write: X is a subset of (or is included in) Y; X ⊆ Y; Y is a superset of (or includes) X; Y ⊇ X...
In mathematics a metric or distance is a function which assigns a distance to elements of a set. ...
Besides the term data clustering, there are a number of terms with similar meanings, including cluster analysis, automatic classification, numerical taxonomy, botryology and typological analysis. Data clustering algorithms can be hierarchical or partitional. With hierarchical algorithms, successive clusters are found using previously established clusters, whereas partitional algorithms determine all clusters in one go. Hierarchical algorithms can be agglomerative (bottom-up) or divisive (top-down). Agglomerative algorithms begin with each element as a separate cluster and merge them in successively larger clusters. The divisive algorithms begin with the whole set and proceed to divide it into successively smaller clusters. A hierarchy (in Greek hieros = sacred, arkho = rule) is a system of ranking and organizing things. ...
A partition of U into 6 blocks: a Venn diagram representation. ...
Hierarchical clustering Introduction Hierarchical clustering builds a hierarchy of clusters. The traditional representation of this hierarchy is a tree, with individual elements at one end and a single cluster with every element at the other. Agglomerative algorithms begin at the top of the tree, whereas divisive algorithms begin at the bottom. (In the figure, the arrows indicate an agglomerative clustering.)
Wikipedia does not have an article with this exact name. ...
Cutting the tree at a given height will give a clustering at a selected precision. In the above example, cutting after the second row will yield clusters {a} {b c} {d e} {f}. Cutting after the third row will yield clusters {a} {b c} {d e f}, which is a coarser clustering, but with fewer clusters.
Agglomerative hierarchical clustering This method builds the hierarchy from the individual elements by progressively merging clusters. Again, we have six elements {a} {b} {c} {d} {e} and {f}. the first step is to determine which elements to merge in a cluster. Usually, we want to take the two closest elements, therefore we must define a distance d(element1,element2) between elements. personal space, proxemics. ...
Suppose we have merged the two closest elements b and c, we now have the following clusters {a}, {b, c}, {d}, {e} and {f}, and want to merge them further. But to do that, we need to take the distance between {a} and {b c}, and therefore define the distance between two clusters. Usually the distance between two clusters and is one of the following: - the maximum distance between elements of each cluster (also called complete linkage clustering)
- the minimum distance between elements of each cluster (also called single linkage clustering)
- the mean distance between elements of each cluster (also called average linkage clustering)
- the sum of all intra cluster variance
- the increase in variance for the cluster being merged (Ward's criterion)
Each agglomeration occurs at a greater distance between clusters than the previous agglomeration, and one can decide to stop clustering either when the clusters are too far apart to be merged (distance criterion) or when there is a sufficiently small number of clusters (number criterion).
Partitional clustering k-means and derivatives k-means clustering The k-means algorithm assigns each point to the cluster whose center (or centroid) is nearest. The centroid is the point generated by computing the arithmetic mean for each dimension separately for all the points in the cluster. - Example: The data set has three dimensions and the cluster has two points: X = (x1, y1, z1) and Y = (x2, y2, z2). Then the centroid Z becomes Z = (x3, y3, z3), where x3 = (x1 + x2)/2 and y3 = (y1 + y2)/2 and z3 = (z1 + z2)/2
This is the basic structure of the algorithm (J. MacQueen, 1967): - Randomly generate k clusters and determine the cluster centers or directly generate k seed points as cluster centers
- Assign each point to the nearest cluster center.
- Recompute the new cluster centers.
- Repeat until some convergence criterion is met (usually that the assignment hasn't changed).
The main advantages of this algorithm are its simplicity and speed, which allows it to run on large datasets. Yet it does not systematically yield the same result with each run of the algorithm. Rather, the resulting clusters depend on the initial assignments. The k-means algorithm maximizes inter-cluster (or minimizes intra-cluster) variance, but does not ensure that the solution given is not a local minimum of variance.
Fuzzy c-means clustering One of the problems of the k-means algorithm is that it gives a hard partitioning of the data, that is to say that each point is attributed to one and only one cluster. But points on the edge of the cluster, or near another cluster may not be as much in the cluster as point in the center of cluster. Therefore, in fuzzy clustering, each point does not pertain to a given cluster, but has a degree of belonging to a certain cluster, as in fuzzy logic. For each point x we have a coefficient giving the degree of being in the kth cluster uk(x). Usually, the sum of those coefficients has to be one, so that uk(x) denotes a probability of belonging to a certain cluster: Fuzzy logic is an extension of Boolean logic dealing with the concept of partial truth. ...
With fuzzy c-means, the centroid of a cluster is computed as being the mean of all points, weighted by their degree of belonging to the cluster, that is The degree of being in a certain cluster is the inverse of the distance to the cluster then the coefficients are normalized and fuzzyfied with a real parameter m > 1 so that their sum is 1. So The fuzzy c-means algorithm is greatly similar to the k-means algorithm; when , algorithms are equivalent: - Choose a number of clusters
- Assign randomly to each point coefficients for being in the clusters
- Repeat until the algorithm has converged (that is, the coefficients' change between two iterations is no more than ε, the given sensitivity threshold) :
- Compute the centroid for each cluster, using the formula above
- For each point, compute its coefficients of being in the clusters, using the formula above
The fuzzy c-means algorithm minimizes intra-cluster variance as well, but has the same problems as k-means, the minimum is local minimum, and the results depend on the initial choice of weights.
Applications In biology has two main applications in the fields of computational biology and bioinformatics. Main article: Life There are many universal units and common processes that are fundamental to the known forms of life. ...
Bioinformatics or computational biology is the use of techniques from applied mathematics, informatics, statistics, and computer science to solve biological problems. ...
The Transcriptome is the set of all mRNA molecules (or transcripts) in one or a population of biological cells for a given set of environmental circumstances. ...
This stylistic schematic diagram shows a gene in relation to the double helix structure of DNA and to a chromosome (right). ...
Neuraminidase ribbon diagram An enzyme (in Greek en = in and zyme = leaven) is a protein, or protein complex, that catalyzes a chemical reaction and also controls the 3D orientation of the catalyzed substrates. ...
In biochemistry, a metabolic pathway is a series of chemical reactions occurring within a cell, catalyzed by enzymes, and resulting in either the formation of a metabolic product to be used or stored by the cell (metabolic sink), or the initiation of another metabolic pathway (then called a flux generating...
An expressed sequence tag or EST is a short sub-sequence of a protein-coding DNA sequence. ...
A DNA microarray, the different colours indicate relative expression of different genes. ...
Genome annotation is related to: Human Genome Project Genome project Sequence alignment Functional annotation Sequence annotation Protein family Categories: Biology stubs ...
Genomics is the study of an organisms genome and the use of the genes. ...
In chemistry, sequence analysis comprises techniques used to determine the sequence of a polymer formed of several monomers. ...
This is a list of gene families or gene complexes, that is sets of genes which occur across a number of different species which often serve similar biological functions. ...
Evolutionary biology is a subfield of biology concerned with the origin and descent of species, as well as their change over time, i. ...
Categories: Stub | Molecular genetics | Evolutionary biology ...
External links - Jain, Murty and Flynn: Data Clustering: A Review, ACM Comp. Surv., 1999. Available here
- for another presentation of hierarchical, k-means and fuzzy c-means see this introduction to clustering. Also has an explanation on mixture of Gaussians.
- Eruditionhome Great directory site for data clustering.
- http://members.tripod.com/asim_saeed/paper.htm
The normal distribution, also called Gaussian distribution, is an extremely important probability distribution in many fields, especially in physics and engineering. ...
See also k-means, ANN, Self-organizing map, other clustering and mixture model links. The k-means algorithm is a variant of the Expectation-Maximization algorithm in which the goal is to determine the k means of data generated from Gaussian distributions. ...
ANN may mean: Artificial Neural Network All Night Nippon All-Nippon News Network This is a disambiguation page — a navigational aid which lists other pages that might otherwise share the same title. ...
The self-organising map (SOM) is a method for unsupervised learning, based on a grid of artificial neurons whose weights are adapted to match input vectors in a training set. ...
|