How to Determine the Number of Clusters in K-Means using Python

Have you ever been faced with a dataset and wondered how to group the data points effectively? Clustering is one approach that can be used to group similar data points into clusters. Clustering is an unsupervised learning technique that can be used to identify patterns within a dataset. One of the most popular clustering algorithms is K-Means, which is widely used in various domains such as finance, healthcare, and social media. In this article, we will explore K-Means clustering and discuss how to determine the number of clusters in K-Means using Python.

Table of Contents

What is K-Means Clustering?

K-Means clustering is a widely used clustering algorithm that partitions a dataset into K clusters. It is an iterative algorithm that starts by randomly assigning K centroids (representative points of each cluster) and then iteratively updates these centroids until convergence. The algorithm assigns each data point to the closest centroid, and then updates the centroid based on the mean of the data points in that cluster. The process is repeated until the centroids no longer move significantly.

The Elbow Method

One of the most popular methods to determine the optimal number of clusters in K-Means is the elbow method. The elbow method computes the sum of squared distances between each data point and its closest centroid for different values of K. The sum of squared distances is then plotted against the number of clusters (K), and the plot resembles an arm with an elbow. The optimal number of clusters is the point where the rate of decrease in the sum of squared distances slows down, and the plot starts to flatten out, forming the "elbow."

To illustrate this, let’s consider a dataset with two features: x and y. We can use Python’s Scikit-Learn library to perform K-Means clustering and plot the sum of squared distances against the number of clusters using the following code:

from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import pandas as pd

data = pd.read_csv('data.csv')
X = data.iloc[:, [0, 1]].values

wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)

plt.plot(range(1, 11), wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

In this code, we import the necessary libraries and read a dataset stored in a CSV file. We then extract the features (x and y) and store them in the variable "X." We then loop through different values of K (1 to 10) and perform K-Means clustering for each value of K. We store the sum of squared distances (WCSS) in the variable "wcss" for each value of K. Finally, we plot the WCSS against the number of clusters (K) using Matplotlib.

The resulting plot should resemble an arm with an elbow, as shown below:

In this example, the optimal number of clusters is three since the rate of decrease in the sum of squared distances slows down after K=3, and the plot starts to flatten out.

The Silhouette Method

Another method that can be used to determine the optimal number of clusters in K-Means is the Silhouette method. The Silhouette method computes a score for each data point that measures how well it fits into its assigned cluster compared to other clusters. The Silhouette score ranges from -1 to 1, with values closer to 1 indicating that a data point is well-clustered, and values closer to -1 indicating that a data point may be assigned to the wrong cluster.

The average Silhouette score for each value of K is then plotted against the number of clusters (K), and the optimal number of clusters is the one that maximizes the average Silhouette score.

To illustrate this, let’s consider the same dataset with two features: x and y. We can use Python’s Scikit-Learn library to perform K-Means clustering and compute the Silhouette score using the following code:

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import pandas as pd

data = pd.read_csv('data.csv')
X = data.iloc[:, [0, 1]].values

silhouette_scores = []
for i in range(2, 11):
    kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0)
    kmeans.fit(X)
    labels = kmeans.labels_
    silhouette_scores.append(silhouette_score(X, labels))

print(silhouette_scores)

In this code, we import the necessary libraries and read a dataset stored in a CSV file. We then extract the features (x and y) and store them in the variable "X." We then loop through different values of K (2 to 10) and perform K-Means clustering for each value of K. We store the Silhouette score in the variable "silhouette_scores" for each value of K.

The output of this code should be a list of Silhouette scores for each value of K:

[0.5712149405242419, 0.42237379094615565, 0.3728591964565877, 0.34018022052802216, 0.3384704665991873, 0.3353996759038333, 0.3072237767347577, 0.30904774232339256, 0.2956646523776192]

In this example, the optimal number of clusters is two, as it maximizes the Silhouette score.

Conclusion

In this article, we explored K-Means clustering and discussed two methods that can be used to determine the optimal number of clusters in K-Means. The Elbow method and the Silhouette method are widely used and effective methods that can help identify the optimal number of clusters for a dataset. Python’s Scikit-Learn library provides a convenient and easy-to-use API for performing K-Means clustering and computing the Elbow and Silhouette scores. By leveraging these methods, you can effectively group similar data points into clusters and gain insights into your data.

Leave a Comment

Your email address will not be published. Required fields are marked *