Optimal number of clusters in k-means clustering

How to Perform K-Means Clustering in R

K-means clustering is a fundamental and widely used unsupervised learning algorithm designed to partition a dataset into a pre-defined number of groups or clusters, denoted as K. This method seeks to minimize the variance within each cluster, ensuring that data points grouped together are highly similar to one another while being distinct from observations in other clusters. Implementing K-means clustering in R is straightforward, primarily utilizing the powerful kmeans() function. This function requires critical arguments such as the data matrix, the desired number of clusters (K), and the number of random initial configurations (nstart). Upon execution, the function efficiently returns crucial outputs, including the computed cluster centroids and the specific cluster assignments for every single data point in the input set.


The process of clustering is a core technique within machine learning and data mining, focusing on uncovering inherent structures and natural groupings within complex datasets. It systematically attempts to discover cohesive “clusters” of observations where the observations share high degrees of similarity based on their feature values. This methodology is vital when exploratory data analysis reveals potential non-obvious patterns that can inform decision-making or model development.

The fundamental objective of successful clustering is to achieve high intra-cluster similarity and high inter-cluster dissimilarity. Put simply, the goal is to define clusters such that all observations within a single cluster are statistically very similar to one another, typically measured by their spatial proximity in the feature space. Simultaneously, the algorithm ensures that these clusters are markedly different from observations residing in adjacent or distant clusters. This effective separation allows for meaningful categorization and subsequent targeted analysis based on the derived groupings.

Clustering is inherently classified as a form of unsupervised learning. Unlike supervised methods, which rely on labeled training data to predict a specific response variable, unsupervised methods operate solely by seeking underlying patterns, structures, or relationships within the input data itself without requiring any prior knowledge of class labels. The absence of a designated target variable defines its unsupervised nature and makes it suitable for exploratory data analysis where ground truth labels are missing.

A prime example of clustering application lies in market segmentation. Businesses frequently employ clustering algorithms when they possess detailed demographic and behavioral data about their customer base. This information might include various key metrics relevant to purchasing power and lifestyle:

  • Household income demographics
  • Household size and composition
  • Head of household Occupation category
  • Geographical factors, such as distance from the nearest urban area

By leveraging these diverse variables, clustering techniques allow companies to identify homogenous groups of households. These identified groups are deemed similar in ways that are highly relevant to consumer behavior. For instance, a cluster analysis might reveal groups that are more likely to purchase premium products or respond favorably to specific types of advertising campaigns, enabling highly effective and customized marketing strategies that maximize return on investment.

What is K-Means Clustering?

K-means clustering is arguably the most recognized partitioning algorithm, functioning as an iterative refinement process. The core concept involves partitioning $N$ observations into exactly $K$ clusters, where $K$ is a user-defined integer representing the target number of groups. The algorithm seeks to assign each observation to the cluster whose mean (or centroid) is closest to it, effectively minimizing the within-cluster sum of squares (WCSS). The ultimate objective remains consistent: to produce $K$ clusters where the inherent similarity among observations within any single cluster is maximized, while simultaneously maximizing the difference between observations belonging to separate clusters.

The stability and effectiveness of the K-means algorithm stem from its iterative refinement process. Since the initial assignment of observations or the random selection of starting centroids can significantly influence the final cluster configuration, the algorithm is typically run multiple times (using the nstart parameter) with different random starting points. The run that achieves the smallest total WCSS is selected as the optimal solution, guaranteeing robust and reliable cluster definitions that genuinely reflect the underlying structure of the data and mitigate the risk of converging to a poor local optimum.

In practice, performing K-means clustering involves a well-defined sequence of steps that guide the iterative optimization process. Understanding these steps is crucial for proper implementation and interpretation of the results, especially when dealing with high-dimensional data where visual inspection is impossible.

The standard procedure for executing K-means clustering is outlined sequentially:

  1. Choose a Value for K (The Number of Clusters). This initial step is non-trivial as the selection of $K$ dictates the granularity of the resulting partitions. Since there is no single predetermined correct value for $K$, data scientists often employ diagnostic tools (like the Elbow Method or Gap Statistic) to test several different values for $K$. The optimal value is determined by analyzing the resulting cluster separation, compactness, and overall interpretability for the specific business or scientific problem context.
  2. Randomly Initialize Cluster Assignments or Centers. The algorithm begins by arbitrarily assigning each observation to one of the $K$ clusters, or by randomly selecting $K$ data points to serve as the initial cluster centers (centroids).
  3. Iterative Refinement until Convergence. The core of K-means involves repeating the following two sub-steps until the cluster assignments stabilize, meaning no observation changes its cluster membership between consecutive iterations, or until a predefined maximum number of iterations is reached.
    • Calculate Cluster Centroids: For each of the $K$ clusters, the new cluster centroid is computed. This centroid is the geometric center, represented by the vector of the mean values for all $p$ features (dimensions) across all observations currently assigned to that specific $k$th cluster.
    • Reassign Observations: Every observation in the entire dataset is then reassigned to the cluster whose newly computed centroid is closest to it. The measurement of “closest” is typically defined using the squared Euclidean distance metric, which measures the straight-line distance between two points in multidimensional space, thereby minimizing the within-cluster variance.

K-Means Clustering Implementation in R

R provides excellent capabilities for performing K-means clustering, leveraging built-in functions and powerful extension packages specifically designed for cluster analysis and visualization. The following tutorial demonstrates a step-by-step process for executing, validating, and visualizing a K-means model using a classic dataset available in the R environment.

Step 1: Load the Necessary Packages

To efficiently conduct and interpret K-means clustering in R, we rely on two key external packages that enhance the capabilities of the base kmeans() function. We load factoextra for enhanced visualization tools, specifically used for determining the optimal number of clusters and plotting the final results clearly. We also load the cluster package, which contains essential functions related to cluster validation metrics, such as the Gap Statistic.

library(factoextra)
library(cluster)

Step 2: Load and Prepare the Data

For this practical demonstration, we will utilize the intrinsic USArrests dataset available within R’s standard environment. This dataset contains comprehensive statistics for all 50 U.S. states in 1973, detailing crime rates per 100,000 residents across four variables: Murder, Assault, and Rape, alongside the percentage of the population residing in urban areas, denoted as UrbanPop. The dataset is ideal for demonstrating K-means as it requires segmentation based on multivariate characteristics.

Data preparation is a critical step before applying K-means. Since K-means relies on distance measures, variables measured on vastly different scales (e.g., Assault rates are much higher than Murder rates) can unfairly dominate the clustering process. Therefore, it is imperative to normalize or scale the variables so they all possess equal importance in the distance calculation. The code below illustrates the necessary pre-processing steps: loading the data, handling missing values, and scaling the features.

  • Loading the Dataset: We begin by assigning the USArrests data to a new data frame called df.
  • Handling Missing Data: We ensure data integrity by removing any rows containing missing values (NA) using na.omit().
  • Scaling Features: We standardize each variable using the scale() function, resulting in variables that have a mean of 0 and a standard deviation of 1. This ensures all variables are weighted equally when computing distances.
#load data
df <- USArrests

#remove rows with missing values (if any exist)
df <- na.omit(df)

#scale each variable to have a mean of 0 and sd of 1
df <- scale(df)

#view first six rows of dataset to confirm scaling
head(df)

               Murder   Assault   UrbanPop         Rape
Alabama    1.24256408 0.7828393 -0.5209066 -0.003416473
Alaska     0.50786248 1.1068225 -1.2117642  2.484202941
Arizona    0.07163341 1.4788032  0.9989801  1.042878388
Arkansas   0.23234938 0.2308680 -1.0735927 -0.184916602
California 0.27826823 1.2628144  1.7589234  2.067820292
Colorado   0.02571456 0.3988593  0.8608085  1.864967207

Step 3: Determining the Optimal Number of Clusters (K)

As established, a critical requirement for K-means clustering is the prior specification of the number of clusters, $K$. Selecting an inappropriate $K$ can lead to either overly sparse or overly broad groupings, thus obscuring meaningful patterns. To guide this selection, R provides sophisticated tools focusing on metrics that evaluate the compactness (WCSS) and separation (Gap Statistic) of clusters for varying values of $K$. We recall the syntax of the R function:

kmeans(data, centers, nstart)

We must determine the value for the centers parameter (K). We will employ two standard, reliable diagnostic techniques to estimate the optimal $K$ based on the characteristics of our scaled data.

  • centers: The intended number of clusters, $K$.
  • nstart: Specifies the number of initial random configurations. We typically use a high value (like 25) to ensure the algorithm finds the best, most stable solution.

Method 1: The Elbow Method (Total Within Sum of Squares)

The Elbow Method evaluates the total within-cluster sum of squares (WCSS) as $K$ increases. WCSS quantifies the compactness of the clusters; the lower the WCSS, the closer the observations are to their respective cluster centroids. We plot $K$ against WCSS. Since adding more clusters always reduces WCSS, we look for the point of diminishing returns—the “elbow”—where the marginal gain from adding an extra cluster drops sharply, suggesting that the optimal balance between compactness and simplicity has been reached.

We utilize the fviz_nbclust() function from the factoextra package, setting the method parameter to "wss" (Within Sum of Squares) to generate this diagnostic plot:

fviz_nbclust(df, kmeans, method = "wss")

Optimal number of clusters in k-means clustering

Analyzing the resulting plot, we look for the distinctive “bend” where the WCSS curve begins to level off. In this visualization of the USArrests data, a notable elbow is visually apparent when the number of clusters is set to $K=4$. This suggests that partitioning the data into four distinct groups provides the best trade-off between minimizing within-cluster variation and maintaining model parsimony.

Method 2: The Gap Statistic

While the Elbow Method is intuitive, it can sometimes be subject to subjective interpretation. A more statistically rigorous approach is the Gap Statistic. This metric formally compares the total intra-cluster variation observed in the real data for different values of $K$ against the expected variation derived from a reference null distribution—a dataset generated with no inherent clustering structure. The optimal number of clusters is the value of $K$ that maximizes the gap statistic, indicating the most significant and statistically meaningful deviation from randomness.

We calculate the gap statistic using the clusGap() function from the cluster package, specifying the maximum $K$ to test (K.max = 10) and setting the number of bootstrap samples (B = 50). The results are then visualized using the fviz_gap_stat() function:

#calculate gap statistic based on number of clusters
gap_stat <- clusGap(df,
                    FUN = kmeans,
                    nstart = 25,
                    K.max = 10,
                    B = 50)

#plot number of clusters vs. gap statistic
fviz_gap_stat(gap_stat)

Gap statistic for optimal number of clusters

The visualization of the Gap Statistic strongly confirms the initial finding from the Elbow Method. The gap statistic reaches its highest point precisely at $K=4$ clusters. Since both widely accepted methods converge on the same result, we can confidently proceed with $K=4$ as the optimal number of clusters for segmenting the U.S. states based on crime and urbanization rates.

Step 4: Performing K-Means Clustering with Optimal K

With the optimal number of clusters determined to be $K=4$, we now execute the final K-means clustering algorithm on our scaled dataset. To ensure the reproducibility of our findings, particularly due to the random initialization component of the K-means algorithm, we first set a seed value. We use nstart=25 to maximize the chance of finding the global optimum by trying 25 different initial configurations and selecting the one with the lowest overall WCSS.

#make this example reproducible
set.seed(1)

#perform k-means clustering with k = 4 clusters
km <- kmeans(df, centers = 4, nstart = 25)

#view results summary
km

K-means clustering with 4 clusters of sizes 16, 13, 13, 8

Cluster means:
      Murder    Assault   UrbanPop        Rape
1 -0.4894375 -0.3826001  0.5758298 -0.26165379
2 -0.9615407 -1.1066010 -0.9301069 -0.96676331
3  0.6950701  1.0394414  0.7226370  1.27693964
4  1.4118898  0.8743346 -0.8145211  0.01927104

Clustering vector:
       Alabama         Alaska        Arizona       Arkansas     California       Colorado 
             4              3              3              4              3              3 
   Connecticut       Delaware        Florida        Georgia         Hawaii          Idaho 
             1              1              3              4              1              2 
      Illinois        Indiana           Iowa         Kansas       Kentucky      Louisiana 
             3              1              2              1              2              4 
         Maine       Maryland  Massachusetts       Michigan      Minnesota    Mississippi 
             2              3              1              3              2              4 
      Missouri        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
             3              2              2              3              2              1 
    New Mexico       New York North Carolina   North Dakota           Ohio       Oklahoma 
             3              1              2              1              1              1 
        Oregon   Pennsylvania   Rhode Island South Carolina   South Dakota      Tennessee 
             1              1              1              4              2              4 
         Texas           Utah        Vermont       Virginia     Washington  West Virginia 
             3              1              2              1              1              2 
     Wisconsin        Wyoming 
             2              1 

Within cluster sum of squares by cluster:
[1] 16.212213 11.952463 19.922437  8.316061
 (between_SS / total_SS =  71.2 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss" "betweenss"   
[7] "size"         "iter"         "ifault"         

The output summary provides comprehensive insight into the final cluster configuration. Key results include the cluster sizes, confirming how the 50 states were distributed across the four partitions:

  • 16 states were allocated to the first cluster.
  • 13 states were allocated to the second cluster.
  • 13 states were allocated to the third cluster.
  • 8 states were allocated to the fourth cluster.

Furthermore, the output displays the scaled means (centroids) for each cluster, which are essential for interpreting the characteristic profile of each group in terms of standard deviations from the dataset mean. Importantly, the result “between_SS / total_SS = 71.2%” indicates that 71.2% of the total variability in the data is captured by the differences between the clusters, demonstrating excellent segregation between the defined groups.

Step 5: Visualizing and Interpreting the Clusters

To gain a clearer spatial understanding of how the states are grouped, we can project the multidimensional cluster structure onto a two-dimensional plane. This is typically achieved using Principal Component Analysis (PCA) to find the dimensions that capture the maximum variance. The fviz_cluster() function plots the clusters based on the first two principal components, visually illustrating the quality of separation achieved by the K-means algorithm.

#plot results of final k-means model, projected onto the first two principal components
fviz_cluster(km, data = df)

K-means clustering plot in R

The visual representation confirms that the four clusters are reasonably well-separated in the feature space, aligning with the large “between SS” value observed previously. Cluster 3 and Cluster 4 show some proximity, indicating shared characteristics, while Cluster 2 and Cluster 1 are distinct and highly separated from the others.

For practical interpretation, analyzing the scaled cluster means provides information on relative differences, but examining the means of the original, unscaled variables provides immediate, real-world context. We use the aggregate() function to calculate the mean of the original USArrests variables for all states belonging to each cluster. This allows us to characterize the profiles of the four distinct state groups based on actual crime rates and urbanization percentages:

#find means of each cluster using the original data values
aggregate(USArrests, by=list(cluster=km$cluster), mean)

cluster	  Murder   Assault	UrbanPop	    Rape
				
1	3.60000	  78.53846	52.07692	12.17692
2	10.81538 257.38462	76.00000	33.19231
3	5.65625	 138.87500	73.87500	18.78125
4	13.93750 243.62500	53.75000	21.41250

By reviewing the means of the original variables, we can now define the characteristics of each cluster precisely:

  • Cluster 1: Represents states with generally low crime rates (e.g., 3.6 murders per 100k) and average urbanization (52.1%).
  • Cluster 2: Represents states with high crime rates across all categories (e.g., 10.8 murders, 257.4 assaults) and high urbanization (76.0%).
  • Cluster 3: Represents states with moderate crime rates and high urbanization (73.9%), positioning them as an intermediate group.
  • Cluster 4: Represents states characterized by very high murder rates (13.9) and high assault rates (243.6), yet surprisingly low urbanization (53.8%), suggesting high rural violence.

Finally, for subsequent analysis, mapping, or reporting, it is beneficial to append the cluster assignment back to the original, unscaled dataset. This integrates the new structural information found by K-means directly into the state records, creating a comprehensive final dataset.

#add cluster assigment to original data
final_data <- cbind(USArrests, cluster = km$cluster)

#view final data structure
head(final_data)

	    Murder	Assault	UrbanPop  Rape	 cluster
				
Alabama	    13.2	236	58	  21.2	 4
Alaska	    10.0	263	48	  44.5	 2
Arizona	     8.1	294	80	  31.0	 2
Arkansas     8.8	190	50	  19.5	 4
California   9.0	276	91	  40.6	 2
Colorado     7.9	204	78	  38.7	 2

Pros and Cons of K-Means Clustering

While K-means clustering is a powerful and popular tool for partitioning data, it is essential to understand its inherent strengths and limitations before deployment in a production environment. Its popularity in big data applications is largely attributed to its computational efficiency and simplicity, especially when handling large volumes of data.

The primary advantages offered by K-means include:

  • Efficiency and Speed: It is remarkably quick and computationally inexpensive, possessing a linear time complexity $O(nkt)$, where $n$ is the number of data points, $k$ is the number of clusters, and $t$ is the number of iterations. This makes it highly scalable for massive datasets.
  • Simplicity and Ease of Use: The underlying algorithm is easy to understand and straightforward to implement, requiring minimal complex configuration beyond specifying the number of clusters $K$.

However, analysts must be aware of its potential drawbacks, which often dictate whether K-means is the most appropriate algorithm for a given task:

  • Dependency on K Specification: The necessity of manually specifying the number of clusters $K$ beforehand is a significant operational limitation, often requiring preliminary heuristic methods to determine the optimal value.
  • Sensitivity to Outliers: K-means relies heavily on the mean (centroid) calculation. Consequently, it is highly susceptible to the influence of outliers, which can dramatically skew the cluster centers and lead to inaccurate segmentation.
  • Assumption of Spherical Clusters: K-means inherently assumes that clusters are convex and roughly spherical in shape, performing poorly when clusters have complex, non-linear boundaries or unequal sizes and densities.

For scenarios where K-means limitations become prohibitive—such as the presence of numerous outliers or non-spherical data distributions—alternative clustering techniques may offer superior performance. Two common alternatives are K-medoids clustering (which uses medoids, making it more robust to outliers) and hierarchical clustering (which avoids the need to specify $K$ upfront).


You can find the complete R code used in this example and additional resources for clustering here.

Cite this article

stats writer (2025). How to Perform K-Means Clustering in R. PSYCHOLOGICAL SCALES. Retrieved from https://scales.arabpsychology.com/stats/how-to-perform-k-means-clustering-in-r/

stats writer. "How to Perform K-Means Clustering in R." PSYCHOLOGICAL SCALES, 17 Dec. 2025, https://scales.arabpsychology.com/stats/how-to-perform-k-means-clustering-in-r/.

stats writer. "How to Perform K-Means Clustering in R." PSYCHOLOGICAL SCALES, 2025. https://scales.arabpsychology.com/stats/how-to-perform-k-means-clustering-in-r/.

stats writer (2025) 'How to Perform K-Means Clustering in R', PSYCHOLOGICAL SCALES. Available at: https://scales.arabpsychology.com/stats/how-to-perform-k-means-clustering-in-r/.

[1] stats writer, "How to Perform K-Means Clustering in R," PSYCHOLOGICAL SCALES, vol. X, no. Y, ص Z-Z, December, 2025.

stats writer. How to Perform K-Means Clustering in R. PSYCHOLOGICAL SCALES. 2025;vol(issue):pages.

Download Post (.PDF)
Slide Up
x
PDF
Scroll to Top