Package: clustering.evaluation
Superclasses: clustering.evaluation.ClusterCriterion
CalinskiHarabasz criterion clustering evaluation object
clustering.evaluation.CalinskiHarabaszEvaluation
is
an object consisting of sample data, clustering data, and CalinskiHarabasz
criterion values used to evaluate the optimal number of clusters.
Create a CalinskiHarabasz criterion clustering evaluation object
using evalclusters
.
creates
a CalinskiHarabasz criterion clustering evaluation object.eva
= evalclusters(x
,clust
,'CalinskiHarabasz')
creates
a CalinskiHarabasz criterion clustering evaluation object using additional
options specified by one or more namevalue pair arguments.eva
= evalclusters(x
,clust
,'CalinskiHarabasz',Name,Value
)

Clustering algorithm used to cluster the input data, stored
as a valid clustering algorithm name or function handle. If the clustering
solutions are provided in the input, 

Name of the criterion used for clustering evaluation, stored as a valid criterion name. 

Criterion values corresponding to each proposed number of clusters
in 

Distance measure used for clustering data, stored as a valid distance measure name. 

List of the number of proposed clusters for which to compute criterion values, stored as a vector of positive integer values. 

Logical flag for excluded data, stored as a column vector of
logical values. If 

Number of observations in the data matrix 

Optimal number of clusters, stored as a positive integer value. 

Optimal clustering solution corresponding to 

Data used for clustering, stored as a matrix of numerical values. 
addK  Evaluate additional numbers of clusters 
compact  Compact clustering evaluation object 
plot  Plot clustering evaluation object criterion values 
The CalinskiHarabasz criterion is sometimes called the variance ratio criterion (VRC). The CalinskiHarabasz index is defined as
$$VR{C}_{k}=\frac{S{S}_{B}}{S{S}_{W}}\times \frac{\left(Nk\right)}{\left(k1\right)},$$
, where SS_{B} is the overall betweencluster variance, SS_{W} is the overall withincluster variance, k is the number of clusters, and N is the number of observations.
The overall betweencluster variance SS_{B} is defined as
$$S{S}_{B}={\displaystyle \sum _{i=1}^{k}{n}_{i}{\Vert {m}_{i}m\Vert}^{2}},$$
where k is the number of clusters, m_{i} is the centroid of cluster i, m is the overall mean of the sample data, and $$\Vert {m}_{i}m\Vert $$ is the L^{2} norm (Euclidean distance) between the two vectors.
The overall withincluster variance SS_{W} is defined as
$$S{S}_{W}={\displaystyle \sum _{i=1}^{k}{{\displaystyle \sum _{x\in {c}_{i}}\Vert x{m}_{i}\Vert}}^{2},}$$
where k is the number of clusters, x is a data point, c_{i} is the ith cluster, m_{i} is the centroid of cluster i, and $$\Vert x{m}_{i}\Vert $$ is the L^{2} norm (Euclidean distance) between the two vectors.
Welldefined clusters have a large betweencluster variance (SS_{B}) and a small withincluster variance (SS_{W}). The larger the VRC_{k} ratio, the better the data partition. To determine the optimal number of clusters, maximize VRC_{k} with respect to k. The optimal number of clusters is the solution with the highest CalinskiHarabasz index value.
The CalinskiHarabasz criterion is best suited for kmeans clustering solutions with squared Euclidean distances.
[1] Calinski, T., and J. Harabasz. "A dendrite method for cluster analysis." Communications in Statistics. Vol. 3, No. 1, 1974, pp. 1–27.