Cluster analysis, also called segmentation analysis or taxonomy analysis, creates groups, or clusters, of data. Clusters are formed in such a way that objects in the same cluster are similar and objects in different clusters are distinct. Measures of similarity depend on the application.
Hierarchical Clustering groups
data over a variety of scales by creating a cluster tree or dendrogram.
The tree is not a single set of clusters, but rather a multilevel
hierarchy, where clusters at one level are joined as clusters at the
next level. This allows you to decide the level or scale of clustering
that is most appropriate for your application. The Statistics and Machine
Learning Toolbox™ function
clusterdata performs all of the necessary
steps for you. It incorporates the
which may be used separately for more detailed analysis. The
dendrogram function plots the cluster tree.
k-Means Clustering is a
partitioning method. The function
data into k mutually exclusive clusters, and returns
the index of the cluster to which it has assigned each observation.
Unlike hierarchical clustering, k-means clustering
operates on actual observations (rather than the larger set of dissimilarity
measures), and creates a single level of clusters. The distinctions
mean that k-means clustering is often more suitable
than hierarchical clustering for large amounts of data.
DBSCAN is a density-based
algorithm that identifies arbitrarily shaped clusters and outliers (noise) in data. The
performs clustering on an input data matrix or on pairwise distances between
dbscan returns the cluster indices and a vector
indicating the observations that are core points, which are points that have at least a
minimum number of neighbors (
their epsilon neighborhood (
Unlike k-means clustering, the DBSCAN algorithm does not require
prior knowledge of the number of clusters, and clusters are not necessarily spheroidal.
DBSCAN is also useful for density-based outlier detection, because it identifies points
that do not belong to any cluster.
Cluster Using Gaussian Mixture Models form clusters by representing the
probability density function of observed variables as a mixture of multivariate normal
densities. Mixture models of the
gmdistribution class use an expectation maximization (EM) algorithm to fit data, which assigns posterior
probabilities to each component density with respect to each observation. Clusters are
assigned by selecting the component that maximizes the posterior probability. Clustering
using Gaussian mixture models is sometimes considered a soft clustering method. The
posterior probabilities for each point indicate that each data point has some
probability of belonging to each cluster. Like k-means clustering,
Gaussian mixture modeling uses an iterative algorithm that converges to a local optimum.
Gaussian mixture modeling may be more appropriate than k-means
clustering when clusters have different sizes and correlation within them.