Clustering algorithms are unsupervised algorithms where the training data is not labeled. Rather, the algorithms cluster or group the data sets based on common characteristics. There are two main techniques for clustering data: K-Means clustering and Hierarchical clustering. In this project, you will use K-Means clustering for customer segmentation. Before you implement the actual code, let’s first briefly review what K-Means clustering is.
K-means Clustering Algorithms
K-Means Clustering K-Means clustering is one of the most frequently used algorithms for clustering unlabeled data. In K-Means clustering K refers to the number of clusters that you want your data to be grouped into. In K-Means clustering, the number of clusters has to be defined before K clustering can be applied to the data points.
Steps for K-Means Clustering
Following are the steps that are needed to be performed in order to perform K-Means clustering of data points.
- Randomly assign centroid values for each cluster
- Calculate the distance (Euclidean or Manhattan) between each data point and centroid values of all the clusters.
- Assign the data point to the cluster of the centroid with the shorted distance.
- Calculate and update centroid values based on the mean values of the coordinates of all the data points of the corresponding cluster.
- Repeat steps 2-4 until new centroid values for all the clusters are different from the previous centroid values.
Why Use K-Means Clustering?
K-Means clustering is particularly useful when:
- K-Means clustering is a simple to implement the algorithm
- Can be applied to large datasets
- Scales well to unseen data points
- Generalize well to clusters of various sizes and shapes.
Disadvantages of K-Means Clustering Algorithms
The following are some of the disadvantages of the K-Means clustering algorithm.
- The value of K has to be chosen manually.
- Convergence or training time depends on the initial value of K.
- Clustering performance is affected greatly by outliers.
Enough of theory. Let’s see how to use K-Means clustering for customer segmentation.
Read:👉What is data mining?
The first step is importing the required libraries, as shown in the following script:
1. import numpy as np 2. import pandas as pd 3. from sklearn.datasets.samples_generator import make_blobs 4. from sklearn.cluster import KMeans 5. from matplotlib import pyplot as plt 6. import seaborn as sns 7. %matplotlib inline
Importing the Dataset
The CSV dataset file for this project is freely available at this link (click here). The CSV file for the dataset Mall_Customers.csv can also be downloaded from the Datasets folder of the GitHub and SharePoint repositories.
this script imports the dataset.
dataset = pd.read_csv(‘E:\Datasets\Mall_Customers.csv’)
The below output shows that the dataset has five columns: CustomerID, Genre, Age, Annual Income (K$), and Spending Score (1-100). The spending score is the score assigned to customers based on their previous spending habits. Customers with higher spending in the past have higher scores.
Let’s see the shape of the dataset
The output below shows that the dataset contains 200 records and 5 columns.
Before we do actual customer segmentation, let’s briefly analyze the dataset. Let’s plot a histogram showing the annual income of the customers.
1. sns.distplot(dataset[‘Annual Income (k$)’], kde=False, bins = 50)
The output shows that most of the customers have incomes between 60 and 90K per year.
Similarly, we can plot a histogram for the spending scores of the customers, as well.
1. sns.distplot(dataset[‘Spending Score (1-100)’], kde=False, bins = 50, color = “red”)
The output shows that most of the customers have a spending score between 40 and 60.
We can also plot a regression line between annual income and spending score to see if there is any linear relationship between the two or not.
1. sns.regplot(x=”Annual Income (k$)”, y=”Spending Score (1-100)”, data=dataset)
From the straight line in the below output, you can infer that there is no linear relation between annual income and spending.
Finally, you can also plot a linear regression line between the Age column and the spending score.
1. sns.regplot(x=”Age”, y=”Spending Score (1-100)”, data=dataset)
The output confirms an inverse linear relationship between age and spending score. It can be inferred from the output that young people have higher spending compared to older people.
Enough of the data analysis. We are now ready to perform customer segmentation on our data using the K-Means algorithm.
K-Means Clustering Algorithms
We want to perform K-Means clustering algorithms based on the annual income and spending score columns because we want to target the customer base with high income and high spending scores. Therefore, we will filter these two columns and will remove the remaining columns from our dataset. Here is the script to do so:
1. dataset = dataset.filter([“Annual Income (k$)”, “Spending Score (1-100)”], axis = 1) 2. dataset.head()
The output shows that we now have only the annual income and spending score columns in our dataset.
To implement K-Means clustering, you can use the K-Means class from the sklearn.cluster module of the Sklearn library. You have to pass the number of clusters as an attribute to the K-Means class constructor. To train the K-Means model, simply pass the dataset to the fit() method of the K-Means class, as shown below.
1. # performing kmeans clustering using KMeans class 2. km_model = KMeans(n_clusters=4) 3. km_model.fit(dataset)
Once the model is trained, you can print the cluster centers using the cluster_centers_attribute of the K-Means class object.
1. #printing centroid values 2. print(km_model.cluster_centers_)
The four cluster centers as predicted by our K-Means model has the following values
[[48.26 56.48 ] [86.53846154 82.12820513] [87. 18.63157895] [26.30434783 20.91304348]]
In addition to finding cluster centers, the K-Means class also assigns a cluster label to each data point. The cluster labels are numbers that basically serve as cluster id. For instance, in the case of four clusters, the cluster ids are 0,1,2,3. To print the cluster ids for all the labels, you can use the labels_attribute of the K-Means class, as shown below.
1. #printing predicted label values 2. print(km_model.labels_)
[3 0 3 0 3 0 3 0 3 0 3 0 3 0 3 0 3 0 3 0 3 0 3 0 3 0 3 0 3 0 3 0 3 0 3 0 3 0 3 0 3 0 3 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1]
The following script prints the clusters in different colors along with the cluster centers as black data points, as shown below.
1. #pring the data points 2. plt.scatter(dataset.values[:,0], dataset.values[:,1], c= km_model.labels_, cmap=’rainbow’) 3. 4. 5. #print the centroids 6. plt.scatter(km_model.cluster_centers_[:, 0], km_model.cluster_centers_[:, 1], s=100, c=’black’)
Till now in this project, we have been randomly initializing the value of K or the number of clusters. However, we do not know exactly how many segments of customers are there in our dataset. To find the optimal number of customer segments, we need to find the optimal number of K because K defines the number of clusters. There is a way to find the ideal number of clusters. The method is known as the elbow method.
Elbow Method for Finding K value
In the elbow method, the value of inertia obtained by training K-Means clusters with different number of K is plotted on a graph.
The inertia represents the total distance between the data points within a cluster. Smaller inertia means that the predicted clusters are robust and close to the actual clusters.
To calculate the inertia value, you can use the inertia_attribute of the K-Means class object. The following script creates inertial values for K=1 to 10 and the plots in the form.
1. # training KMeans on K values from 1 to 10 2. loss = 3. for i in range(1, 11): 4. km = KMeans(n_clusters = i).fit(dataset) 5. loss.append(km.inertia_) 6. 7. #printing loss against number of clusters 8. 9. import matplotlib.pyplot as plt 10. plt.plot(range(1, 11), loss) 11. plt.title(‘Finding Optimal Clusters via Elbow Method’) 12. plt.xlabel(‘Number of Clusters’) 13. plt.ylabel(‘loss’) 14. plt.show()
From the output below, it can be seen that the value of inertia didn’t decrease much after five clusters.
Let’s now segment our customer data into five groups by creating five clusters.
1. # performing kmeans clustering using KMeans class 2. km_model = KMeans(n_clusters=5) 3. km_model.fit(dataset)
1. #pring the data points 2. plt.scatter(dataset.values[:,0], dataset.values[:,1], c= km_model.labels_, cmap=’rainbow’) 3. 4. 5. #print the centroids 6. plt.scatter(km_model.cluster_centers_[:, 0], km_model.cluster_centers_[:, 1], s=100, c=’black’) When K is 5, the clusters predicted by the K-Means clustering algorithm are as follows:
From the above output, you can see that the customers are divided into five segments. The customers in the middle of the plot (in purple) are the customers with an average income and average spending. The customers belonging to the red cluster are the ones with a low income and low spending. You need to target the customers who belong to the top right cluster (sky blue). These are the customers with high incomes and high spending in the past, and they are more likely to spend in the future, as well. So any new marketing campaigns or advertisements should be directed at these customers.
Finding Customers to Target for Marketing
The last step is to find the customers that belong to the sky blue cluster. To do so, we will first plot the centers of the clusters.
1. #printing centroid values 2. print(km_model.cluster_centers_)
Here is the output. From the output, it seems that the coordinates of the centroid for the top right cluster are 86.53 and 82.12. The centroid values are located at index 1, which is also the Id of the cluster.
[[55.2962963 49.51851852] [86.53846154 82.12820513] [25.72727273 79.36363636] [88.2 17.11428571] [26.30434783 20.91304348]]
To fetch all the records from the cluster with id 1, we will first create a dataframe containing index values of all the records in the dataset and their corresponding cluster labels, as shown below.
1. cluster_map = pd.DataFrame() 2. cluster_map[‘data_index’] = dataset.index.values 3. cluster_map[‘cluster’] = km_model.labels_ 4. cluster_map
Next, we can simply filter all the records from the cluster_map data frame, where the value of the cluster column is 1. Execute the following script to do so.
1. cluster_map = cluster_map[cluster_map.cluster==1] 2. cluster_map.head()
Here are the first five records that belong to cluster 1. These are the customers that have high incomes and high spending.