What is k-Means Cluster Analysis?
k-means cluster analysis is an algorithm that groups similar objects into groups called clusters. The endpoint of cluster analysis is a set of clusters, where each cluster is distinct from each other cluster, and the objects within each cluster are broadly similar to each other.
The required data for k-means cluster analysis
Typically, k-means cluster analysis is performed on a table of raw data, where each row represents an object and the columns represent quantitative characteristic of the objects. These quantitative characteristics are called clustering variables. For example, in the table below there are 18 objects, and there are two clustering variables, x and y. In a real-world application there will typically be many more objects and more variables. For example, in market segmentation, where k-means is used to find groups of consumers with similar needs, each object is a person and each variable is commonly a rating of how important various things are to consumers (e.g., quality, price, customer service, convenience).
How k-means cluster analysis works
Step 1: Specify the number of clusters (k). The first step in k-means is to specify the number of clusters, which is referred to as k. Traditionally researchers will conduct k-means multiple times, exploring different numbers of clusters (e.g., from 2 through 10).
Step 2: Allocate objects to clusters. The most straightforward approach is to randomly assign objects to clusters, but there are many other approaches (e.g., using hierarchical clustering). In the diagram below, the 18 objects have been represented by dots on a scatterplot, where x is shown by the horizontal position of each object and y by the vertical. The objects have been randomly assigned to the two clusters (k = 2), where one cluster is shown with filled dots and the other with unfilled dots.
Step 3: Compute cluster means. For each cluster, the average value is computed for each of the variables. In the plot below, the average value of the filled dots for the variable represented by the horizontal position (x) of the dots is around 15; for the variable on the vertical dimension it is around twelve. These two means are represented by the filled cross. Or, stated slightly differently: the filled cross is in the middle of the black dots. Similarly, the white cross is in the middle of the white dots. These crosses are variously referred to as the cluster centers, cluster means, and cluster medoids.
Step 4: Allocate each observation to the closest cluster center. In the plot above, some of the filled dots are closer to the white cross and some of the white dots are closer to the black cross. When we reallocate the observations to the closest clusters we get the plot below.
Step 5: Repeat steps 3 and 4 until the solution converges. Looking at the plot above, we can see that the crosses (the cluster means) are no longer accurate. In the following plot they have been recomputed using step 3. In this example the cluster analysis has converged (i.e., reallocating observations and updating means cannot improve the solution). In examples with more data a few more iterations are typically required (i.e., steps 3 and 4 are repeated until no respondents change clusters).
The algorithm described above is known as the batch algorithm. Many other variants of k-means have been developed. Perhaps the most popular of these moves objects to a cluster one at a time, updating the mean each time.
The outputs from k-means cluster analysis
The main output from k-means cluster analysis is a table showing the mean values of each cluster on the clustering variables. The table of means for the data examined in this article is shown below.
A second output shows which object has been classified into which cluster, as shown below. Other outputs include plots and diagnostics designed to assess how much variation exists within and between clusters.