|
Clustering is the classification of objects into different groups, or more precisely, the partitioning of a data set into subsets (clusters), so that the data in each subset (ideally) share some common trait - often proximity according to some defined distance measure. Data clustering is a common technique for statistical data analysis, which is used in many fields, including machine learning, data mining, pattern recognition, image analysis and bioinformatics. Statistical classification is a type of supervised learning problem in which labeled training data is used to create a function that will correctly predict the label of future data. ...
A partition of U into 6 blocks: a Venn diagram representation. ...
A data set (or dataset) is a collection of data, usually presented in tabular form. ...
A is a subset of B, and B is a superset of A. In mathematics, especially in set theory, the terms, subset, superset and proper (or strict) subset or superset are used to describe the relation, called inclusion, of one set being contained inside another set. ...
In mathematics a metric or distance function is a function which defines a distance between elements of a set. ...
A graph of a Normal bell curve showing statistics used in educational assessment and comparing various grading methods. ...
Data analysis is the act of transforming data with the aim of extracting useful information and facilitating conclusions. ...
As a broad subfield of artificial intelligence, Machine learning is concerned with the development of algorithms and techniques that allow computers to learn. At a general level, there are two types of learning: inductive, and deductive. ...
Data mining (DM), also called Knowledge-Discovery in Databases (KDD) or Knowledge-Discovery and Data Mining, is the process of automatically searching large volumes of data for patterns using tools such as classification, association rule mining, clustering, etc. ...
Pattern recognition is a field within the area of machine learning. ...
Image analysis is the extraction of meaningful information from images; mainly from digital images by means of digital image processing techniques. ...
Map of the human X chromosome (from the NCBI website). ...
Besides the term data clustering (or just clustering), there are a number of terms with similar meanings, including cluster analysis, automatic classification, numerical taxonomy, botryology and typological analysis. Types of clustering
Data clustering algorithms can be hierarchical or partitional. Hierarchical algorithms find successive clusters using previously established clusters, whereas partitional algorithms determine all clusters at once. Hierarchical algorithms can be agglomerative ("bottom-up") or divisive ("top-down"). Agglomerative algorithms begin with each element as a separate cluster and merge them into successively larger clusters. Divisive algorithms begin with the whole set and proceed to divide it into successively smaller clusters. A hierarchy (in Greek hieros = sacred, arkho = rule) is a system of ranking and organizing things. ...
A partition of U into 6 blocks: a Venn diagram representation. ...
Two-way clustering, co-clustering or bi-clustering are the names for clusterings where not only the objects are clustered but also the features of the objects, i.e., if the data is represented in a data matrix, the row and columns are clustered simultaneously. Data Matrix code example, encoding Wikipedia, the free encyclopedia A Data Matrix code is a two-dimensional matrix barcode consisting of black and white square modules arranged in either a square or rectangular pattern. ...
Another important distinction is whether the clustering uses symmetric or asymmetric distances. A property of Euclidean space is that distances are symmetric (the distance from object A to B is the same as the distance from B to A). In other applications (e.g., sequence-alignment methods, see Prinzie & Van den Poel (2006)), this is not the case. Around 300 BC, the Greek mathematician Euclid laid down the rules of what has now come to be called Euclidean geometry, which is the study of the relationships between angles and distances in space. ...
Distance measure An important step in any clustering is to select a distance measure, which will determine how the similarity of two elements is calculated. This will influence the shape of the clusters, as some elements may be close to one another according to one distance and further away according to another. For example, in a 2-dimensional space, the distance between the point (x=1, y=0) and the origin (x=0, y=0) is always 1 according to the usual norms, but the distance between the point (x=1, y=1) and the origin can be 2, or 1 if you take respectively the 1-norm, 2-norm or infinity-norm distance. Distance is a numerical description of how far apart objects are at any given moment in time. ...
Common distance functions: - the Euclidean distance (also called distance as the crow flies or 2-norm distance). A review of cluster analysis in health psychology research found that the most common distance measure in published studies in that research area is the Euclidean distance or the squared Euclidean distance.
- the Manhattan distance (also called taxicab norm or 1-norm)
- the maximum norm
- the Mahalanobis distance corrects data for different scales and correlations in the variables
- the angle between two vectors can be used as a distance measure when clustering high dimensional data. See Inner product space.
In mathematics, the Euclidean distance or Euclidean metric is the ordinary distance between the two points that one would measure with a ruler, which can be proven by repeated application of the Pythagorean theorem. ...
As the Crow Flies is a novel by Jeffrey Archer. ...
Taxicab geometry, considered by Hermann Minkowski in the 19th century, is a form of geometry in which the usual metric of Euclidean geometry is replaced by a new metric in which the distance between two points is the sum of the (absolute) differences of their coordinates. ...
In mathematical analysis, the uniform norm assigns to real- or complex-valued functions f the nonnegative number This norm is also called the supremum norm or the Chebyshev norm. ...
In statistics, Mahalanobis distance is a distance measure introduced by P. C. Mahalanobis in 1936. ...
In mathematics, an inner product space is a vector space with additional structure, an inner product (also called a scalar product), which allows us to introduce geometrical notions such as angles and lengths of vectors. ...
Hierarchical clustering Creating clusters Hierarchical clustering builds (agglomerative), or breaks up (divisive), a hierarchy of clusters. The traditional representation of this hierarchy is a tree (called a dendrogram), with individual elements at one end and a single cluster containing every element at the other. Agglomerative algorithms begin at the top of the tree, whereas divisive algorithms begin at the bottom. (In the figure, the arrows indicate an agglomerative clustering.) In computer science, a tree is a widely-used computer data structure that emulates a tree structure with a set of linked nodes. ...
A dendrogram is a tree diagram frequently used to illustrate the arrangement of the clusters produced by a clustering algorithm (see cluster analysis). ...
Cutting the tree at a given height will give a clustering at a selected precision. In the following example, cutting after the second row will yield clusters {a} {b c} {d e} {f}. Cutting after the third row will yield clusters {a} {b c} {d e f}, which is a coarser clustering, with a smaller number of larger clusters.
Agglomerative hierarchical clustering For example, suppose this data is to be clustered, and the euclidean distance is the distance metric. In mathematics, the Euclidean distance or Euclidean metric is the ordinary distance between the two points that one would measure with a ruler, which can be proven by repeated application of the Pythagorean theorem. ...
In mathematics a metric or distance function is a function which defines a distance between elements of a set. ...
Raw data The hierarchical clustering dendrogram would be as such: Image File history File links Clusters. ...
A dendrogram is a tree diagram frequently used to illustrate the arrangement of the clusters produced by a clustering algorithm (see cluster analysis). ...
Traditional representation This method builds the hierarchy from the individual elements by progressively merging clusters. In our example, we have six elements {a} {b} {c} {d} {e} and {f}. The first step is to determine which elements to merge in a cluster. Usually, we want to take the two closest elements, according to the chosen distance. Wikipedia does not have an article with this exact name. ...
Optionally, one can also construct a distance matrix at this stage, where the number in the i-th row j-th column is the distance between the i-th and j-th elements. Then, as clustering progresses, rows and columns are merged as the clusters are merged and the distances updated. This is a common way to implement this type of clustering, and has the benefit of catching distances between clusters. In mathematics, a distance matrix is a matrix (two-dimensional array) containing the distances, taken pairwise, of a set of points. ...
Suppose we have merged the two closest elements b and c, we now have the following clusters {a}, {b, c}, {d}, {e} and {f}, and want to merge them further. To do that, we need to take the distance between {a} and {b c}, and therefore define the distance between two clusters. Usually the distance between two clusters and is one of the following: - The maximum distance between elements of each cluster (also called complete linkage clustering):
-
 - The minimum distance between elements of each cluster (also called single linkage clustering):
-
 - The mean distance between elements of each cluster (also called average linkage clustering):
-
 - The sum of all intra-cluster variance
- The increase in variance for the cluster being merged (Ward's criterion)
- The probability that candidate clusters spawn from the same distribution function (V-linkage)
Each agglomeration occurs at a greater distance between clusters than the previous agglomeration, and one can decide to stop clustering either when the clusters are too far apart to be merged (distance criterion) or when there is a sufficiently small number of clusters (number criterion). Wcdriscoll 15:22, 20 May 2007 (UTC)
Concept clustering Another variation of the agglomerative clustering approach is conceptual clustering. Conceptual clustering is a machine learning paradigm for unsupervised classification. ...
Partitional clustering k-means and derivatives K-means clustering. The K-means algorithm assigns each point to the cluster whose center (also called centroid) is nearest. The center is the average of all the points in the cluster — that is, its coordinates are the arithmetic mean for each dimension separately over all the points in the cluster... The K-means algorithm is an algorithm to cluster objects based on attributes into k partitions. ...
- Example: The data set has three dimensions and the cluster has two points: X = (x1, x2, x3) and Y = (y1, y2, y3). Then the centroid Z becomes Z = (z1, z2, z3), where z1 = (x1 + y1)/2 and z2 = (x2 + y2)/2 and z3 = (x3 + y3)/2.
The algorithm steps are (J. MacQueen, 1967): - Choose the number of clusters, k.
- Randomly generate k clusters and determine the cluster centers, or directly generate k random points as cluster centers.
- Assign each point to the nearest cluster center.
- Recompute the new cluster centers.
- Repeat the two previous steps until some convergence criterion is met (usually that the assignment hasn't changed).
The main advantages of this algorithm are its simplicity and speed which allows it to run on large datasets. Its disadvantage is that it does not yield the same result with each run, since the resulting clusters depend on the initial random assignments. It minimizes intra-cluster variance, but does not ensure that the result has a global minimum of variance.
Fuzzy c-means clustering In fuzzy clustering, each point has a degree of belonging to clusters, as in fuzzy logic, rather than belonging completely to just one cluster. Thus, points on the edge of a cluster, may be in the cluster to a lesser degree than points in the center of cluster. For each point x we have a coefficient giving the degree of being in the kth cluster uk(x). Usually, the sum of those coefficients is defined to be 1: Fuzzy clustering is a class of algorithm in computer science. ...
Fuzzy logic is derived from fuzzy set theory dealing with reasoning that is approximate rather than precisely deduced from classical predicate logic. ...
 With fuzzy c-means, the centroid of a cluster is the mean of all points, weighted by their degree of belonging to the cluster:  The degree of belonging is related to the inverse of the distance to the cluster  then the coefficients are normalized and fuzzyfied with a real parameter m > 1 so that their sum is 1. So  For m equal to 2, this is equivalent to normalising the coefficient linearly to make their sum 1. When m is close to 1, then cluster center closest to the point is given much more weight than the others, and the algorithm is similar to k-means. The fuzzy c-means algorithm is very similar to the k-means algorithm: - Choose a number of clusters.
- Assign randomly to each point coefficients for being in the clusters.
- Repeat until the algorithm has converged (that is, the coefficients' change between two iterations is no more than ε, the given sensitivity threshold) :
- Compute the centroid for each cluster, using the formula above.
- For each point, compute its coefficients of being in the clusters, using the formula above.
The algorithm minimizes intra-cluster variance as well, but has the same problems as k-means, the minimum is a local minimum, and the results depend on the initial choice of weights. The Expectation-maximization algorithm is a more statistically formalized method which includes some of these ideas: partial membership in classes. It has better convergence properties and is in general preferred to fuzzy-k-means. In statistical computing, an expectation-maximization (EM) algorithm is an algorithm for finding maximum likelihood estimates of parameters in probabilistic models, where the model depends on unobserved latent variables. ...
QT clustering algorithm QT (quality threshold) clustering (Heyer et al, 1999) is an alternative method of partitioning data, invented for gene clustering. It requires more computing power than k-means, but does not require specifying the number of clusters a priori, and always returns the same result when run several times. The algorithm is: - The user chooses a maximum diameter for clusters.
- Build a candidate cluster for each point by including the closest point, the next closest, and so on, until the diameter of the cluster surpasses the threshold.
- Save the candidate cluster with the most points as the first true cluster, and remove all points in the cluster from further consideration.
- Recurse with the reduced set of points.
The distance between a point and a group of points is computed using complete linkage, i.e. as the maximum distance from the point to any member of the group (see the "Agglomerative hierarchical clustering" section about distance between clusters). A visual form of recursion known as the Droste effect. ...
Graph-theoretic methods Formal concept analysis is a technique for generating clusters of objects and attributes, given a bipartite graph representing the relations between the objects and attributes. Other methods for generating overlapping clusters (a cover rather than a partition) are discussed by Jardine and Sibson (1968) and Cole and Wishart (1970). A concept lattice for objects consisting of the integers from 1 to 10, and attributes composite, even, odd, prime, and square. ...
In the mathematical field of graph theory, a bipartite graph is a special graph where the set of vertices can be divided into two disjoint sets and such that no edge has both end-points in the same set. ...
In mathematics, a cover of a set X is a collection of subsets C of X whose union is X. In symbols, if C = {Uα : α â A} is an indexed family of subsets of X, then C is a cover if More generally, if Y is a subset of X and...
A partition of U into 6 blocks: a Venn diagram representation. ...
Elbow criterion The elbow criterion is a common rule of thumb to determine what number of clusters should be chosen, for example for k-means and agglomerative hierarchical clustering. It should also be noted that the initial assignment of cluster seeds has bearing on the final model performance. Thus, it is appropriate to re-run the cluster analysis multiple times. A rule of thumb is an easily learned and easily applied procedure for approximately calculating or recalling some value, or for making some determination. ...
The elbow criterion says that you should choose a number of clusters so that adding another cluster doesn't add sufficient information. More precisely, if you graph the percentage of variance explained by the clusters against the number of clusters, the first clusters will add much information (explain a lot of variance), but at some point the marginal gain will drop, giving an angle in the graph (the elbow). This elbow can not always be unambigiously identified. On the following graph, the elbow is indicated by the red circle. The number of clusters chosen should therefore be 4. Percent of variance explained or percent of total variance is the ratio of within-group variance to total variance.
Image File history File links DataClustering_ElbowCriterion. ...
Spectral clustering Given a set of data points A, the similarity matrix may be defined as a matrix S where Sij represents a measure of the similarity between points . Spectral clustering techniques make use of the spectrum of the similarity matrix of the data to cluster the points. Sometimes such techniques are also used to perform dimensionality reduction for clustering in fewer dimensions. A similarity matrix is a matrix of scores which express the similarity between two data points. ...
In linear algebra, the eigenvectors (from the German eigen meaning inherent, characteristic) of a linear operator are non-zero vectors which, when operated on by the operator, result in a scalar multiple of themselves. ...
In statistics, dimensionality reduction is mapping a multidimensional space into a space of fewer dimensions. ...
One such technique is the Shi-Malik algorithm, commonly used for image segmentation. It partitions points into two sets (S1,S2) based on the eigenvector v corresponding to the second-smallest eigenvalue of the Laplacian matrix In image analysis, segmentation is the partition of a digital image into multiple regions (sets of pixels), according to some criterion. ...
In linear algebra, the eigenvectors (from the German eigen meaning own) of a linear operator are non-zero vectors which, when operated on by the operator, result in a scalar multiple of themselves. ...
In mathematics, a number is called an eigenvalue of a matrix if there exists a nonzero vector such that the matrix times the vector is equal to the same vector multiplied by the eigenvalue. ...
In the mathematical field of graph theory the admittance matrix or Laplacian matrix is a matrix representation of a graph. ...
- L = I − D − 1 / 2SD − 1 / 2
of S, where D is the diagonal matrix -
This partitioning may be done in various ways, such as by taking the median m of the components in v, and placing all points whose component in v is greater than m in S1, and the rest in S2. The algorithm can be used for hierarchical clustering, by repeatedly partitioning the subsets in this fashion. A related algorithm is the Meila-Shi algorithm, which takes the eigenvectors corresponding to the k largest eigenvalues of the matrix P = SD − 1 for some k, and then invokes another (e.g. k-means) to cluster points by their respective k components in these eigenvectors. In linear algebra, the eigenvectors (from the German eigen meaning own) of a linear operator are non-zero vectors which, when operated on by the operator, result in a scalar multiple of themselves. ...
In mathematics, a number is called an eigenvalue of a matrix if there exists a nonzero vector such that the matrix times the vector is equal to the same vector multiplied by the eigenvalue. ...
Applications Biology In biology clustering has many applications This article or section does not cite any references or sources. ...
- In imaging, data clustering may take different form based on the data dimensionality. For example, the SOCR EM Mixture model segmentation activity and applet shows how to obtain point, region or volume classification using the online SOCR computaitonal libraries.
- In the fields of plant and animal ecology, clustering is used to describe and to make spatial and temporal comparisons of communities (assemblages) of organisms in heterogeneous environments; it is also used in plant systematics to generate artificial phylogenies or clusters of organisms (individuals) at the species, genus or higher level that share a number of attributes
- In computational biology and bioinformatics:
The Statistics Online Computational Resource (SOCR) is a suite of online tools and interactive aids for hands-on learning and teaching concepts in statistical analyses and probability theory. ...
Divisions Green algae Chlorophyta Charophyta Land plants (embryophytes) Non-vascular plants (bryophytes) Marchantiophytaâliverworts Anthocerotophytaâhornworts Bryophytaâmosses Vascular plants (tracheophytes) â Rhyniophytaârhyniophytes â Zosterophyllophytaâzosterophylls Lycopodiophytaâclubmosses â Trimerophytophytaâtrimerophytes Pteridophytaâferns and horsetails Seed plants (spermatophytes) â Pteridospermatophytaâseed ferns Pinophytaâconifers Cycadophytaâcycads Ginkgophytaâginkgo Gnetophytaâgnetae Magnoliophytaâflowering plants...
Animalia redirects here. ...
This article or section does not cite any references or sources. ...
Biological systematics is the study of the diversity of life on the planet earth, both past and present, and the relationships among living things through time. ...
In biology, phylogenetics (Greek: phylon = tribe, race and genetikos = relative to birth, from genesis = birth) is the study of evolutionary relatedness among various groups of organisms (e. ...
Map of the human X chromosome (from the NCBI website). ...
The Transcriptome is the set of all mRNA molecules (or transcripts) in one or a population of biological cells for a given set of environmental circumstances. ...
This stylistic schematic diagram shows a gene in relation to the double helix structure of DNA and to a chromosome (right). ...
Ribbon diagram of the enzyme TIM, surrounded by the space-filling model of the protein. ...
In biochemistry, a metabolic pathway is a series of chemical reactions occurring within a cell, catalyzed by enzymes, resulting in either the formation of a metabolic product to be used or stored by the cell, or the initiation of another metabolic pathway (then called a flux generating step). ...
An expressed sequence tag or EST is a short sub-sequence of a transcribed spliced nucleotide sequence (either protein-coding or not). ...
Example of an approximately 40,000 probe spotted oligo microarray with enlarged inset to show detail. ...
Genome annotation is the process of attaching biological information to sequences. ...
Genomics is the study of an organisms entire genome. ...
The term sequence analysis in biology implies subjecting a DNA or peptide sequence to sequence alignment, sequence databases, repeated sequence searches, or other bioinformatics methods on a computer. ...
This is a list of gene families or gene complexes, that is sets of genes which occur across a number of different species which often serve similar biological functions. ...
This article or section does not cite any references or sources. ...
Schematic of a region of a chromosome before and after a duplication event Gene duplication occurs when an error in homologous recombination, a retrotransposition event, or duplication of an entire chromosome leads to the duplication of a region of DNA containing a gene [1]. The significance of this process for...
The genotype is the specific genetic makeup (the specific genome) of an individual, usually in the form of DNA. It codes for the phenotype of that individual. ...
Market research Cluster analysis is widely used in market research when working with multivariate data from surveys and test panels. Market researchers use cluster analysis to partition the general population of consumers into market segments and to better understand the relationships between different groups of consumers/potential customers. Market research is the process of systematic gathering, recording and analyzing of data about customers, competitors and the market. ...
Statistical surveys are used to collect quantitative information about items in a population. ...
In economics, consumers are individuals or households that consume goods and services generated within the economy. ...
A customer is someone who purchases or rents something from an individual or organisation. ...
Target market is the market segment to which a particular product is marketed. ...
A products position is how potential buyers see the product. ...
In business and engineering, new product development is the complete process of bringing a new product to market. ...
Experimental research designs are used for the controlled testing of causal processes. ...
Other applications Social network analysis: In the study of social networks, clustering may be used to recognize communities within large groups of people. A social network is a map of the relationships between individuals, indicating the ways in which they are connected through various social familiarities ranging from casual acquaintance to close familial bonds. ...
Community is a set of people (or agents in a more abstract sense) with some shared element. ...
Image segmentation: Clustering can be used to divide a digital image into distinct regions for border detection or object recognition. A digital system is one that uses discrete values (often electrical voltages), especially those representable as binary numbers, or non-numeric symbols such as letters or icons, for input, processing, transmission, storage, or display, rather than a continuous spectrum of values (ie, as in an analog system). ...
In common usage, an image (from Latin imago) or picture is an artifact that reproduces the likeness of some subjectâusually a physical object or a person. ...
The goal of edge detection is to mark the points in a digital image at which the luminous intensity changes sharply. ...
Computer vision is the study of methods which allow computers to understand images, or multidimensional data in general. ...
Data mining: Many data mining applications involve partitioning data items into related subsets; the marketing applications discussed above represent some examples. Another common application is the division of documents, such as World Wide Web pages, into genres. Data mining (DM), also called Knowledge-Discovery in Databases (KDD) or Knowledge-Discovery and Data Mining, is the process of automatically searching large volumes of data for patterns using tools such as classification, association rule mining, clustering, etc. ...
WWWs historical logo designed by Robert Cailliau The World Wide Web (or the Web) is a system of interlinked, hypertext documents that runs over the Internet. ...
Slippy map optimization: Flickr's map of photos and other map sites use clustering to reduce the number of markers on a map. This makes it both faster and reduces the amount of visual clutter. Flickr is a photo sharing website and web services suite, and an online community platform, which is generally considered an early example of a Web 2. ...
Comparisons between data clusterings There have been several suggestions for a measure of similarity between two clusterings. Such a measure can be used to compare how well different data clustering algorithms perform on a set of data. Many of these measures are derived from the matching matrix (aka confusion matrix), e.g., the Rand measure and the Fowlkes-Mallows Bk measures. A confusion matrix is a visualization tool typically used in supervised learning (in unsupervised learning it is typically called a matching matrix). ...
Description A confusion matrix is a visualization tool typically used in supervised learning. ...
The Rand index or Rand measure is a technique for measure of similarity between two data clusters. ...
Marina Meila's Variation of Information metric is a more recent approach for comparing distance between clusterings. It uses mutual information and entropy to approximate the distance between two clusterings across the lattice of possible clusterings. In probability theory and, in particular, information theory, the mutual information, or transinformation, of two random variables is a quantity that measures the mutual dependence of the two variables. ...
Ice melting - classic example of entropy increasing[1] described in 1862 by Rudolf Clausius as an increase in the disgregation of the molecules of the body of ice. ...
Algorithms In recent years considerable effort has been put into improving algorithm performance (Z. Huang, 1998). Among the most popular are CLARANS (Ng and Han,1994), DBSCAN (Ester et al., 1996) and BIRCH (Zhang et al., 1996).
See also An artificial neural network (ANN), often just called a neural network (NN), is an interconnected group of artificial neurons that uses a mathematical model or computational model for information processing based on a connectionist approach to computation. ...
In statistics, and especially in biostatistics, cophenetic correlation (more precisely, the cophenetic correlation coefficient) is a measure of how faithfully a dendrogram preserves the pairwise distances between the original unmodeled data points. ...
In statistical computing, an expectation-maximization (EM) algorithm is an algorithm for finding maximum likelihood estimates of parameters in probabilistic models, where the model depends on unobserved latent variables. ...
The k-means algorithm is a variant of the Expectation-Maximization algorithm in which the goal is to determine the k means of data generated from Gaussian distributions. ...
The technique is also used in marketing, see Multidimensional scaling in marketing Multidimensional scaling (MDS) are a set of related statistical techniques often used in data visualisation for exploring similarities or dissimilarities in a given data set. ...
The self-organizing map (SOM) is a subtype of artificial neural networks. ...
Bibliography - Abdi, H. "(1994). Additive-tree representations (with an application to face processing) Lecture Notes in Biomathematics, 84, 43-59.".
- Clatworthy, J., Buick, D., Hankins, M., Weinman, J., & Horne, R. (2005). The use and reporting of cluster analysis in health psychology: A review. British Journal of Health Psychology 10: 329-358.
- Cole, A. J. & Wishart, D. (1970). An improved algorithm for the Jardine-Sibson method of generating overlapping clusters. The Computer Journal 13(2):156-163.
- Ester, M., Kriegel, H.P., Sander, J., and Xu, X. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, Oregon, USA: AAAI Press, pp. 226–231.
- Heyer, L.J., Kruglyak, S. and Yooseph, S., Exploring Expression Data: Identification and Analysis of Coexpressed Genes, Genome Research 9:1106-1115.
- Huang, Z. (1998). Extensions to the K-means Algorithm for Clustering Large Datasets with Categorical Values. Data Mining and Knowledge Discovery, 2, p. 283-304.
- Jardine, N. & Sibson, R. (1968). The construction of hierarchic and non-hierarchic classifications. The Computer Journal 11:177.
- The on-line textbook: Information Theory, Inference, and Learning Algorithms, by David J.C. MacKay includes chapters on k-means clustering, soft k-means clustering, and derivations including the E-M algorithm and the variational view of the E-M algorithm.
- MacQueen, J. B. (1967). Some Methods for classification and Analysis of Multivariate Observations, Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, University of California Press, 1:281-297
- Ng, R.T. and Han, J. 1994. Efficient and effective clustering methods for spatial data mining. Proceedings of the 20th VLDB Conference, Santiago, Chile, pp. 144–155.
- Prinzie A., D. Van den Poel (2006), Incorporating sequential information into traditional classification models by using an element/position-sensitive SAM. Decision Support Systems 42 (2): 508-526.
- Romesburg, H. Clarles, Cluster Analysis for Researchers, 2004, 340 pp. ISBN 1-4116-0617-5 or publisher, reprint of 1990 edition published by Krieger Pub. Co... A Japanese language translation is available from Uchida Rokakuho Publishing Co., Ltd., Tokyo, Japan.
- Zhang, T., Ramakrishnan, R., and Livny, M. 1996. BIRCH: An efficient data clustering method for very large databases. Proceedings of ACM SIGMOD Conference, Montreal, Canada, pp. 103–114.
For spectral clustering : David J.C. MacKay (born April 22, 1967) is the professor of natural philosophy in the department of Physics at the University of Cambridge. ...
- Jianbo Shi and Jitendra Malik, "Normalized Cuts and Image Segmentation", IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8), 888-905, August 2000. Available on Jitendra Malik's homepage
- Marina Meila and Jianbo Shi, "Learning Segmentation with Random Walk", Neural Information Processing Systems, NIPS, 2001. Available from Jianbo Shi's homepage
- see referenced articles here
For estimating number of clusters: - Can, F., Ozkarahan, E. A. (1990) "Concepts and effectiveness of the cover coefficient-based clustering methodology for text databases." ACM Transactions on Database Systems. 15 (4) 483-517.
For discussion of the elbow criterion: - Aldenderfer, M.S., Blashfield, R.K, Cluster Analysis, (1984), Newbury Park (CA): Sage.
External links - P. Berkhin, Survey of Clustering Data Mining Techniques, Accrue Software, 2002.
- Jain, Murty and Flynn: Data Clustering: A Review, ACM Comp. Surv., 1999.
- for another presentation of hierarchical, k-means and fuzzy c-means see this introduction to clustering. Also has an explanation on mixture of Gaussians.
- David Dowe, Mixture Modelling page - other clustering and mixture model links.
- a tutorial on clustering [1]
- The on-line textbook: Information Theory, Inference, and Learning Algorithms, by David J.C. MacKay includes chapters on k-means clustering, soft k-means clustering, and derivations including the E-M algorithm and the variational view of the E-M algorithm.
- An overview of non-parametric clustering and computer vision
- "The Self-Organized Gene", tutorial explaining clustering through competitive learning and self-organizing maps.
- kernlab - R package for kernel based machine learning (includes spectral clustering implementation)
- Tutorial - Tutorial with introduction of Clustering Algorithms (k-means, fuzzy-c-means, hierarchical, mixture of gaussians) + some interactive demos (java applets)
|