This article will show the important use cases of data clustering methods, how to use these methods, and also show how one can use these methods as a dimensionality reduction technique.
First of all, let’s discuss the use cases that make these methods so popular.
Customer Segmentation
Online stores use this method for clustering their customers according to their purchasing patterns, day of purchase, age, income, and many more factors. This helps the store to understand its customers better and also helps in making decisions that will ensure high profitability.
To understand this more clearly, let’s take an example. Let’s say after the clustering of the customers according to the day of purchase and their age; we found out that the people with age less than 22 spend quite less during the last days of the month. It is quite likely that most of the crowd with ages less than 22 belong to the students, and since the end of the month is tough financially for most of them, they might be reluctant to visit the store. So, to use this to the store’s advantage, the store might put on end-of-month discount sales, which could pull the student crowd to the store more frequently than before, even at the end of the month.
The store might not have found this solution if they didn’t use clustering on the customers.
You can find one more example of clustering in this Tableau Dashboard.
For data analysis
Sometimes we find some interesting results when we analyze each cluster of the data separately rather than analyzing the whole data together.
As a dimensionality reduction technique
Clustering methods can also be used as dimensionality reduction methods. We will see how to do this at the end of article.
Semi-supervised Learning
We can increase the accuracy of our machine-learning model by first clustering the data and then training a separate model for each cluster.
Sometimes, when we use a semi-supervised learning method for the classification problem, we might get instances with the same label in one of the clusters. To deal with such a situation, we can create a model that returns the same label for every instance that is given as input.
You can find this approach used in one of my projects. You can see the source code for the project on my Github account.
Other use cases
Other than the above-mentioned use cases, clustering is quite helpful in segmenting images, etc.
Now let’s see some of the most famous clustering methods.
KMeans Clustering
KMeans is one of the most famous clustering methods out there. This method will try to find out the blob’s center and then assign each instance to one of the centers.
Let’s see how this method works.
Just start by randomly placing the centroids. Then label each of the clusters. Then assign a label to every instance. The instance will get the label of a cluster that is closest to it. Then we will update the centroids again. After this, we will repeat the process again and again until we find no changes at all to the centroids.
Although this algorithm is guaranteed to converge, it may not converge to the optimal solution. Converging to the right solution depends on the centroid initialization, i.e., the coordinates of the centroid that we use at the start of the algorithm.
One of the solutions to this problem is to run the algorithm multiple times with different random centroid initializations and then keep the best solution. The best solution is found by the performance measure known as inertia. It is basically the mean squared distance between each instance and its closest centroid.
There is a more popular solution to this problem. The new algorithm that implements this solution is known as KMeans++.
It introduced a new initialization step that tends to select centroids that are distant from one another, and this improvement made the kMeans algorithm much less likely to converge to a sub-optimal solution.
There are some more variations of the KMeans algorithm, such as accelerated KMeans or mini — batch KMeans, etc.
We can easily implement this algorithm using the Scikit-Learn library’s built-in classes. But, the real challenge is to find the optimal number of clusters that would separate the data perfectly.
Finding the optimal number of clusters
There are two methods that can be used to find the optimal number of clusters:
Using the elbow method and a silhouette score
Using kneed Python library
Using elbow method and silhouette score
The elbow method is basically finding the elbow to the plot between the inertia and the number of clusters. We want an inertia value that is neither too high nor too low. Generally, such value is located at the elbow of the inertia vs number-of-clusters plot.
We can find the silhouette score using the Scikit-learn library. Silhouette's score lies between — 1 and 1.
Silhouette score close to 1 means that the instance is well inside its own cluster and far from other clusters. Silhouette score close to 0 means that it is close to the cluster boundary. Silhouette score close to -1 means that the instance may have been assigned to the wrong cluster.
So we find the optimal number of clusters such that inertia is neither too high nor too low and also silhouette score should also be decent.
Let’s see how to do this.
According to the plot, the optimal number of clusters should be 4 or 5. The inertia value at these two values is neither too high nor too low.
According to the two plots above, at the number of clusters equal to 4, we will get a decent enough silhouette score as well as a good inertia value. So, we can use 4 clusters to find out the good performance for the clustering.
Using kneed library
Kneed library also gave the same value for the number of clusters as that of the first method. Now let’s use 4 clusters to separate the data and then visualize the clusters.
DBSCAN Clustering
This algorithm defines the clusters as continuous regions of high density separated by regions of low density. Due to this, the clustering made by DBSCAN can take any shape, unlike KMeans, which gives convex-shaped clusters.
The most important component of the DBSCAN algorithm is the concept of core samples. Core samples are the instances present in high-density regions. So basically, clusters in the DBSCAN algorithm are the set of core samples that are close to each other and a set of non-core samples that are close to core samples. We can easily implement the DBSCAN algorithm using the Scikit-Learn DBSCAN class. This class has two important parameters, min_samples, and eps, which define what we mean when we say dense.
For each instance, the algorithm counts how many instances are located within a small distance eps from it. This region is called the instance’s eps — neighborhood.
If an instance has at least min_samples instances in its eps-neighborhood (including itself), then it is considered a core instance. All instances in the neighborhood of the core instance belong to the same cluster. This neighborhood may include other core instances and hence a long sequence of neighboring core instances from a single cluster.
Let’s see how to perform clustering using Scikit-Learn class using the same data that we used for KMeans clustering.
Note that the noisy data samples are given the label -1.
There are many other clustering algorithms such as agglomerative clustering, mean-shift clustering, affinity propagation, spectral clustering, etc.
Now that we have learned how to implement clustering algorithms, let’s see how we can use these methods for the dimensionality reduction problem.
Using clustering as a dimensionality reduction method
We can find out the affinity of each instance with each cluster once the data clustering is completed.
Affinity is the measure of how well each instance fits into different clusters.
Once we have the affinity vector of each instance, we can then replace the original instance with its affinity vector. If the affinity vector is k-dimensional, then the new dimensions of the data will be k only.
It doesn’t matter how many dimensions the original data has, after clustering, the data will have dimensions equal to the number of clusters into which the data is divided.
Let us use an iris flower dataset for this demonstration.
Here we can see that there are 2 more inaccurate predictions than before. This is due to the fact that the dimensionality reduction method losses some information. However, having only 2 inaccurate predictions still indicates a high level of accuracy.
I hope you like the article. If you have any thoughts on the article then please let me know. Any constructive feedback is highly appreciated.
Have a great day!
Comments