How to Work Out the Number of Clusters in Cluster Analysis
Cluster analysis techniques (most of them) require users to specify the number of clusters that they require. There are six broad classes of approaches to choosing the number of clusters: penalized fit heuristics, statistical tests, the extent of association with other data, replicability, no small classes, and domain-usefulness.
Penalized fit heuristics
Provided there are no technical errors, it should always be the case that the more clusters you have, the better the clusters will fit the data. At some point, however, adding the number of clusters will overfit the data. Penalized fit heuristics are metrics that start with a computation of fit, and then penalize this based on the number of clusters.
Dozens and perhaps hundreds of penalized fit heuristics have been developed, such as the Bayesian information criteria (BIC), the gap statistic, and the elbow method (where the penalty factor is passed on the perceptions of the analyst rather than a cut-and-dried rule).
A practical challenge with all penalized fit heuristics is that they tend to be optimized to work well for a very specific problem but work poorly in other contexts. As a result, such heuristics are not in widespread use.
Statistical tests, such as likelihood ratio tests, can also be used to compare a different number of clusters. In practice, these tests make very strong and difficult-to-justify assumptions, and none of these tests has ever been widely adopted.
The extent of association with other data
This approach involves assessing the extent to which each cluster solution (i.e., the two-cluster solution, the three-cluster solution, etc.) is associated with other data. The basic idea is that the stronger the association with other data, the greater the likelihood that the solution is valid, rather than just reflecting noise.
A practical challenge with this approach is that any truly novel and interesting finding is one that does not relate strongly to existing classifications.
Replicability is computed by either randomly sampling with replacement (bootstrap replication) or splitting a sample into two groups. Cluster analysis is conducted in the replication samples. The number of classes that get the most consistent results (i.e., consistent between the samples), is considered to be the best. This approach can also be viewed as a form of cross-validation.
Two challenges with this approach are that local optima may be more replicable than global optima (i.e., it may be easier to replicate a poor solution than a better solution), and that replicability declines based on the number of clusters, all else being equal.
No small classes
The basic idea of this approach is that you choose the highest number of classes, such that none of the classes are small (e.g., less than 5% of the sample). This rule has long been used in practice as a part of the idea of domain-usefulness but has recently been discovered to also have some theoretical justification (Nasserinejad, K, van Rosmalen, J, de Kort, W, Lesaffre, E (2017) Comparison of criteria for choosing the number of classes in Bayesian finite mixture models. PloS one, 12).
A weakness of this approach is the difficulty of specifying the cutoff value.
Perhaps the most widely used approach is to choose the solution that appears, to the analyst, to be the most interesting.