CODING INFINITE LOGO TRANSPARENT 2

K-means Clustering Algorithm With Numerical Example

K-means clustering is one of the most used clustering algorithms in machine learning . In this article, we will discuss the concept, examples, advantages, and disadvantages of the k-means clustering algorithm. We will also discuss a numerical on k-means clustering to understand the algorithm in a better way.

What is K-means Clustering?

K-means clustering is an unsupervised machine learning algorithm used to group a dataset into k clusters. It is an iterative algorithm that starts by randomly selecting k centroids in the dataset. After selecting the centroids, the entire dataset is divided into clusters based on the distance of the data points from the centroid. In the new clusters, the centroids are calculated by taking the mean of the data points.

With the new centroids, we regroup the dataset into new clusters. This process continues until we get a stable cluster.K-means clustering is a partition clustering algorithm. We call it partition clustering because of the reason that the k-means clustering algorithm partitions the entire dataset into mutually exclusive clusters.

K-means Clustering Algorithm

To understand the process of clustering using the k-means clustering algorithm and solve the numerical example, let us first state the algorithm. Given a dataset of N entries and a number K as the number of clusters that need to be formed, we will use the following steps to find the clusters using the k-means algorithm.

  • First, we will select K random entries from the dataset and use them as centroids. 
  • Now, we will find the distance of each entry in the dataset from the centroids. You can use any distance metric such as euclidean distance, Manhattan distance, or squared euclidean distance.
  • After finding the distance of each data entry from the centroids, we will start assigning the data points to clusters.We will assign each data point to the cluster with the centroid to which it has the least distance.
  • After assigning the points to clusters, we will calculate the new centroid of the clusters. For this, we will use the mean of each data point in the same cluster as the new centroid. If the newly created centroids are the same as the centroids in the previous iteration, we will consider the current clusters to be final. Hence, we will stop the execution of the algorithm. If any of the newly created centroids is different from the centroids in the previous iteration, we will go to step 2.

K-means Clustering Numerical Example with Solution

Now that we have discussed the algorithm, let us solve a numerical problem on k means clustering. The problem is as follows.You are given 15 points in the Cartesian coordinate system as follows.

PointCoordinates
A1(2,10)
A2(2,6)
A3(11,11)
A4(6,9)
A5(6,4)
A6(1,2)
A7(5,10)
A8(4,9)
A9(10,12)
A10(7,5)
A11(9,11)
A12(4,6)
A13(3,10)
A14(3,8)
A15(6,11)

We are also given the information that we need to make 3 clusters. It means we are given K=3.We will solve this numerical on k-means clustering using the approach discussed below.

First, we will randomly choose 3 centroids from the given data. Let us consider A2 (2,6), A7 (5,10), and A15 (6,11) as the centroids of the initial clusters. Hence, we will consider that

  • Centroid 1=(2,6) is associated with cluster 1.
  • Centroid 2=(5,10) is associated with cluster 2.
  • Centroid 3=(6,11) is associated with cluster 3.

Now we will find the euclidean distance between each point and the centroids. Based on the minimum distance of each point from the centroids, we will assign the points to a cluster. I have tabulated the distance of the given points from the clusters in the following table

PointDistance from Centroid 1 (2,6)Distance from Centroid 2 (5,10)Distance from Centroid 3 (6,11)Assigned Cluster
A1 (2,10)434.123106Cluster 2
A2 (2,6)056.403124Cluster 1
A3 (11,11)10.295636.0827635Cluster 3
A4 (6,9)51.4142142Cluster 2
A5 (6,4)4.4721366.0827637Cluster 1
A6 (1,2)4.1231068.94427210.29563Cluster 1
A7 (5,10)501.414214Cluster 2
A8 (4,9)3.6055511.4142142.828427Cluster 2
A9 (10,12)105.3851654.123106Cluster 3
A10 (7,5)5.099025.3851656.082763Cluster 1
A11 (9,11)8.6023254.1231063Cluster 3
A12 (4,6)24.1231065.385165Cluster 1
A13 (3,10)4.12310623.162278Cluster 2
A14 (3,8)2.2360682.8284274.242641Cluster 1
A15 (6,11)6.4031241.4142140Cluster 3

At this point, we have completed the first iteration of the k-means clustering algorithm and assigned each point into a cluster.

In the above table, you can observe that the point that is closest to the centroid of a given cluster is assigned to the cluster. 

Now, we will calculate the new centroid for each cluster.

  • In cluster 1, we have 6 points i.e. A2 (2,6), A5 (6,4), A6 (1,2), A10 (7,5), A12 (4,6), A14 (3,8). To calculate the new centroid for cluster 1, we will find the mean of the x and y coordinates of each point in the cluster. Hence, the new centroid for cluster 1 is (3.833, 5.167).
  • In cluster 2, we have 5 points i.e. A1 (2,10), A4 (6,9), A7 (5,10) , A8 (4,9), and A13 (3,10). Hence, the new centroid for cluster 2 is (4, 9.6) 
  • In cluster 3, we have 4 points i.e. A3 (11,11), A9 (10,12), A11 (9,11), and A15 (6,11). Hence, the new centroid for cluster 3 is (9, 11.25).

Now that we have calculated new centroids for each cluster, we will calculate the distance of each data point from the new centroids. Then, we will assign the points to clusters based on their distance from the centroids. The results for this process have been given in the following table.

PointDistance from Centroid 1 (3.833, 5.167)Distance from centroid 2 (4, 9.6) Distance from centroid 3 (9, 11.25)Assigned Cluster
A1 (2,10)5.1692.0407.111Cluster 2
A2 (2,6)2.0134.1188.750Cluster 1
A3 (11,11)9.2417.1392.016Cluster 3
A4 (6,9)4.4032.0883.750Cluster 2
A5 (6,4)2.4615.9467.846Cluster 1
A6 (1,2)4.2498.17112.230Cluster 1
A7 (5,10)4.9721.0774.191Cluster 2
A8 (4,9)3.8370.6005.483Cluster 2
A9 (10,12)9.2046.4621.250Cluster 3
A10 (7,5)3.1715.4926.562Cluster 1
A11 (9,11)7.7925.1920.250Cluster 3
A12 (4,6)0.8503.6007.250Cluster 1
A13 (3,10)4.9041.0776.129Cluster 2
A14 (3,8)2.9531.8876.824Cluster 2
A15 (6,11)6.2232.4413.010Cluster 2

Now, we have completed the second iteration of the k-means clustering algorithm and assigned each point into an updated cluster. In the above table, you can observe that the point closest to the new centroid of a given cluster is assigned to the cluster. 

Now, we will calculate the new centroid for each cluster for the third iteration.

  • In cluster 1, we have 5 points i.e. A2 (2,6), A5 (6,4), A6 (1,2), A10 (7,5), and A12 (4,6). To calculate the new centroid for cluster 1, we will find the mean of the x and y coordinates of each point in the cluster. Hence, the new centroid for cluster 1 is (4, 4.6).
  • In cluster 2, we have 7 points i.e. A1 (2,10), A4 (6,9), A7 (5,10) , A8 (4,9), A13 (3,10), A14 (3,8), and A15 (6,11). Hence, the new centroid for cluster 2 is (4.143, 9.571) 
  • In cluster 3, we have 3 points i.e. A3 (11,11), A9 (10,12), and A11 (9,11). Hence, the new centroid for cluster 3 is (10, 11.333).

At this point, we have calculated new centroids for each cluster. Now, we will calculate the distance of each data point from the new centroids. Then, we will assign the points to clusters based on their distance from the centroids. The results for this process have been given in the following table.

PointDistance from Centroid 1  (4, 4.6)Distance from centroid 2  (4.143, 9.571)Distance from centroid 3 (10, 11.333)Assigned Cluster
A1 (2,10)5.7582.1868.110Cluster 2
A2 (2,6)2.4414.1659.615Cluster 1
A3 (11,11)9.4857.0041.054Cluster 3
A4 (6,9)4.8331.9434.631Cluster 2
A5 (6,4)2.0885.8728.353Cluster 1
A6 (1,2)3.9708.19712.966Cluster 1
A7 (5,10)5.4920.9585.175Cluster 2
A8 (4,9)4.4000.5896.438Cluster 2
A9 (10,12)9.5276.3410.667Cluster 3
A10 (7,5)3.0275.3907.008Cluster 1
A11 (9,11)8.1225.0631.054Cluster 3
A12 (4,6)1.4003.5748.028Cluster 1
A13 (3,10)5.4921.2217.126Cluster 2
A14 (3,8)3.5441.9437.753Cluster 2
A15 (6,11)6.7052.3434.014Cluster 2

Now, we have completed the third iteration of the k-means clustering algorithm and assigned each point into an updated cluster. In the above table, you can observe that the point that is closest to the new centroid of a given cluster is assigned to the cluster. 

Here, you can observe that no point has changed its cluster compared to the previous iteration. Due to this, the centroid also remains constant. Therefore, we will say that the clusters have been stabilized. Hence, the clusters obtained after the third iteration are the final clusters made from the given dataset. If we plot the clusters on a graph, the graph looks like as follows.

K-means clustering plot

In the above plot, points in the clusters have been plotted using red, blue, and black markers. The centroids of the clusters have been marked using green circles.

Applications of K-means Clustering in Machine Learning

K-means clustering algorithm finds its applications in various domains. Following are some of the popular applications of k-means clustering.

  • Document Classification : Using k-means clustering, we can divide documents into various clusters based on their content, topics, and tags. 
  • Customer segmentation : Supermarkets and e-commerce websites divide their customers into various clusters based on their transaction data and demography. This helps the business to target appropriate customers with relevant products to increase sales. 
  • Cyber profiling : In cyber profiling, we collect  data from individuals as well as groups to identify their relationships. With k-means clustering, we can easily make clusters of people based on their connection to each other to identify any available patterns. 
  • Image segmentation : We can use k-means clustering to perform image segmentation by grouping similar pixels into clusters.
  • Fraud detection in banking and insurance : By using historical data on frauds, banks and insurance agencies can predict potential frauds by the application of k-means clustering. 

Apart from these examples, there are various other applications of k-means clustering such as ride-share data analysis, social media profiling, identification of crime localities, etc. 

Suggested Reading: Advanced Coding Concepts

Advantages of K-means Clustering Algorithm

Following are some of the advantages of the k-means clustering algorithm.

  • Easy to implement : K-means clustering is an iterable algorithm and a relatively simple algorithm. In fact, we can also perform k-means clustering manually as we did in the numerical example.
  • Scalability : We can use k-means clustering for even 10 records or even 10 million records in a dataset. It will give us results in both cases.
  • Convergence : The k-means clustering algorithm is guaranteed to give us results. It guarantees convergence. Thus, we will get the result of the execution of the algorithm for sure.
  • Generalization : K-means clustering doesn’t apply to a specific problem. From numerical data to text documents, you can use the k-means clustering algorithm on any dataset to perform clustering. It can also be applied to datasets of different sizes having entirely different distributions in the dataset. Hence, this algorithm is completely generalized.
  • Choice of centroids : You can warm-start the choice of centroids in an easy manner. Hence, the algorithm allows you to choose and assign centroids that fit well with the dataset.

Disadvantages of K-means Clustering Algorithm

With all the advantages, the k-means algorithm has certain disadvantages too which are discussed below.

  • Deciding the number of clusters : In k-means clustering, you need to decide the number of clusters by using the elbow method. 
  • Choice of initial centroids : The number of iterations in the clustering process completely depends on the choice of centroids. Hence, you need to properly choose the centroids in the initial step for maximizing the efficiency of the algorithm.
  • Effect of outliers : In the execution of the k-means clustering algorithm, we use all the points in a cluster to determine the centroids for the next iteration. If there are outliers in the dataset, they highly affect the position of the centroids. Due to this, the clustering becomes inaccurate. To avoid this, you can try to identify outliers and remove them in the data cleaning process .
  • Curse of Dimensionality : If the number of dimensions in the dataset increases, the distance of the data points from a given point starts converging to a specific value. Due to this, k-means clustering that calculates the clusters based on the distance between the points becomes inefficient. To overcome this problem, you can use advanced clustering algorithms like spectral clustering. Alternatively, you can also try to reduce the dimensionality of the dataset while data preprocessing .

In this article, we have explained the k-means clustering algorithm with a numerical example. We have also discussed the applications, advantages, and disadvantages of the k-means clustering algorithm. 

To learn more about machine learning, you can read this article on regression in machine learning . You might also like this article on polynomial regression using sklearn in python .

To read about other computer science topics, you can read this article on  dynamic role-based authorization using ASP.net . You can also read this article on user activity logging using Asp.net .

' src=

Similar Posts

Knn classification using sklearn module in python.

Classification is often used in machine learning to derive solutions to different business problems. In this article, we will discuss the implementation of the KNN classification algorithm using the sklearn module in Python. What is KNN Classification Algorithm? KNN (K-Nearest Neighbors) is a popular machine-learning algorithm for classification tasks. The basic idea behind the KNN…

Ensembling Techniques in Machine Learning

Ensembling Techniques in Machine Learning

We use various techniques to achieve greater accuracy in machine learning models. One such method is ensemble learning. In this article, we will discuss the basics of ensemble learning. We will discuss the various ensembling techniques and the differences between them. What is Ensemble Learning? Ensemble learning is a technique to build machine learning applications…

One Hot Encoding in Python

We use different categorical data encoding techniques while data analysis and machine learning tasks. In this article, we will discuss the basics of one hot encoding. We will also discuss implementing one hot encoding in Python. What is One Hot Encoding? One hot encoding is an encoding technique in which we represent categorical values with…

ECLAT Algorithm Numerical Example

We use different algorithms in market basket analysis to find frequent itemsets. In this article, we will discuss the Equivalence Class Clustering and bottom-up Lattice Traversal (ECLAT) algorithm. Here, we will discuss the basics of the Eclat algorithm with a numerical example. We will also discuss the advantages and disadvantages of the ECLAT algorithm over…

Overfitting and Underfitting in Machine Learning

Overfitting and Underfitting are two of the common issues that we face while training machine learning models. In this article, we will discuss overfitting and underfitting in machine learning. We will also discuss how to avoid overfitting and underfitting using various techniques.  Before we start with a discussion on overfitting and underfitting, I suggest you…

Categorical Data Explained With Examples

In data science, we work with data to produce insights that can help businesses solve problems. In real-world applications, most of the data is produced in a categorical format. Most of the attributes like gender, day of the week, names, etc can only be presented in textual or categorical format. On the contrary, most of…

K-Means Clustering Algorithm | Examples

K-means clustering-, k-means clustering algorithm-, advantages-, disadvantages-, practice problems based on k-means clustering algorithm-, problem-01:.

A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4, 9)

The distance function between two points a = (x1, y1) and b = (x2, y2) is defined as-

Use K-Means Algorithm to find the three cluster centers after the second iteration.

We follow the above discussed K-Means Clustering Algorithm-

Iteration-01:

Calculating distance between a1(2, 10) and c1(2, 10)-.

= |2 – 2| + |10 – 10|

Calculating Distance Between A1(2, 10) and C2(5, 8)-

= |5 – 2| + |8 – 10|

Calculating Distance Between A1(2, 10) and C3(1, 2)-

= |x2 – x1| + |y2 – y1|

A1(2, 10)059C1
A2(2, 5)564C3
A3(8, 4)1279C2
A4(5, 8)5010C2
A5(7, 5)1059C2
A6(6, 4)1057C2
A7(1, 2)9100C3
A8(4, 9)3210C2

Cluster-01:

Cluster-02:, cluster-03:, for cluster-01:, for cluster-02:.

= ((8 + 5 + 7 + 6 + 4)/5, (4 + 8 + 5 + 4 + 9)/5)

For Cluster-03:

= (1.5, 3.5)

Iteration-02:

The following illustration shows the calculation of distance between point A1(2, 10) and each of the center of the three clusters-

Calculating Distance Between A1(2, 10) and C2(6, 6)-

Calculating distance between a1(2, 10) and c3(1.5, 3.5)-.

= |1.5 – 2| + |3.5 – 10|

A1(2, 10)087C1
A2(2, 5)552C3
A3(8, 4)1247C2
A4(5, 8)538C2
A5(7, 5)1027C2
A6(6, 4)1025C2
A7(1, 2)992C3
A8(4, 9)358C1

From here, New clusters are-

First cluster contains points-

Second cluster contains points-

Third cluster contains points-

= ((2 + 4)/2, (10 + 9)/2)

= (6.5, 5.25)

Problem-02:

Calculating distance between a(2, 2) and c1(2, 2)-, calculating distance between a(2, 2) and c2(1, 1)-.

A(2, 2)01.41C1
B(3, 2)12.24C1
C(1, 1)1.410C2
D(3, 1)1.412C1
E(1.5, 0.5)1.580.71C2

K-Means Clustering in Python: A Practical Guide

K-Means Clustering in Python: A Practical Guide

Table of Contents

Overview of Clustering Techniques

Partitional clustering, hierarchical clustering, density-based clustering, understanding the k-means algorithm, writing your first k-means clustering code in python, choosing the appropriate number of clusters, evaluating clustering performance using advanced techniques, building a k-means clustering pipeline, tuning a k-means clustering pipeline.

The k -means clustering method is an unsupervised machine learning technique used to identify clusters of data objects in a dataset. There are many different types of clustering methods, but k -means is one of the oldest and most approachable. These traits make implementing k -means clustering in Python reasonably straightforward, even for novice programmers and data scientists.

If you’re interested in learning how and when to implement k -means clustering in Python, then this is the right place. You’ll walk through an end-to-end example of k -means clustering using Python, from preprocessing the data to evaluating results.

In this tutorial, you’ll learn:

  • What k -means clustering is
  • When to use k -means clustering to analyze your data
  • How to implement k -means clustering in Python with scikit-learn
  • How to select a meaningful number of clusters

Click the link below to download the code you’ll use to follow along with the examples in this tutorial and implement your own k -means clustering pipeline:

Download the sample code: Click here to get the code you’ll use to learn how to write a k-means clustering pipeline in this tutorial.

What Is Clustering?

Clustering is a set of techniques used to partition data into groups, or clusters. Clusters are loosely defined as groups of data objects that are more similar to other objects in their cluster than they are to data objects in other clusters. In practice, clustering helps identify two qualities of data:

  • Meaningfulness

Meaningful clusters expand domain knowledge. For example, in the medical field, researchers applied clustering to gene expression experiments. The clustering results identified groups of patients who respond differently to medical treatments.

Useful clusters, on the other hand, serve as an intermediate step in a data pipeline . For example, businesses use clustering for customer segmentation. The clustering results segment customers into groups with similar purchase histories, which businesses can then use to create targeted advertising campaigns.

Note: You’ll learn about unsupervised machine learning techniques in this tutorial. If you’re interested in learning more about supervised machine learning techniques, then check out Logistic Regression in Python .

There are many other applications of clustering , such as document clustering and social network analysis. These applications are relevant in nearly every industry, making clustering a valuable skill for professionals working with data in any field.

You can perform clustering using many different approaches—so many, in fact, that there are entire categories of clustering algorithms. Each of these categories has its own unique strengths and weaknesses. This means that certain clustering algorithms will result in more natural cluster assignments depending on the input data.

Note: If you’re interested in learning about clustering algorithms not mentioned in this section, then check out A Comprehensive Survey of Clustering Algorithms for an excellent review of popular techniques.

Selecting an appropriate clustering algorithm for your dataset is often difficult due to the number of choices available. Some important factors that affect this decision include the characteristics of the clusters, the features of the dataset, the number of outliers, and the number of data objects.

You’ll explore how these factors help determine which approach is most appropriate by looking at three popular categories of clustering algorithms:

  • Partitional clustering
  • Hierarchical clustering
  • Density-based clustering

It’s worth reviewing these categories at a high level before jumping right into k -means. You’ll learn the strengths and weaknesses of each category to provide context for how k -means fits into the landscape of clustering algorithms.

Partitional clustering divides data objects into nonoverlapping groups. In other words, no object can be a member of more than one cluster, and every cluster must have at least one object.

These techniques require the user to specify the number of clusters, indicated by the variable k . Many partitional clustering algorithms work through an iterative process to assign subsets of data points into k clusters. Two examples of partitional clustering algorithms are k -means and k -medoids.

These algorithms are both nondeterministic , meaning they could produce different results from two separate runs even if the runs were based on the same input.

Partitional clustering methods have several strengths :

  • They work well when clusters have a spherical shape .
  • They’re scalable with respect to algorithm complexity.

They also have several weaknesses :

  • They’re not well suited for clusters with complex shapes and different sizes.
  • They break down when used with clusters of different densities .

Hierarchical clustering determines cluster assignments by building a hierarchy. This is implemented by either a bottom-up or a top-down approach:

Agglomerative clustering is the bottom-up approach. It merges the two points that are the most similar until all points have been merged into a single cluster.

Divisive clustering is the top-down approach. It starts with all points as one cluster and splits the least similar clusters at each step until only single data points remain.

These methods produce a tree-based hierarchy of points called a dendrogram . Similar to partitional clustering, in hierarchical clustering the number of clusters ( k ) is often predetermined by the user. Clusters are assigned by cutting the dendrogram at a specified depth that results in k groups of smaller dendrograms.

Unlike many partitional clustering techniques, hierarchical clustering is a deterministic process, meaning cluster assignments won’t change when you run an algorithm twice on the same input data.

The strengths of hierarchical clustering methods include the following:

  • They often reveal the finer details about the relationships between data objects.
  • They provide an interpretable dendrogram .

The weaknesses of hierarchical clustering methods include the following:

  • They’re computationally expensive with respect to algorithm complexity.
  • They’re sensitive to noise and outliers .

Density-based clustering determines cluster assignments based on the density of data points in a region. Clusters are assigned where there are high densities of data points separated by low-density regions.

Unlike the other clustering categories, this approach doesn’t require the user to specify the number of clusters. Instead, there is a distance-based parameter that acts as a tunable threshold. This threshold determines how close points must be to be considered a cluster member.

Examples of density-based clustering algorithms include Density-Based Spatial Clustering of Applications with Noise, or DBSCAN , and Ordering Points To Identify the Clustering Structure, or OPTICS .

The strengths of density-based clustering methods include the following:

  • They excel at identifying clusters of nonspherical shapes .
  • They’re resistant to outliers .

The weaknesses of density-based clustering methods include the following:

  • They aren’t well suited for clustering in high-dimensional spaces .
  • They have trouble identifying clusters of varying densities .

How to Perform K-Means Clustering in Python

In this section, you’ll take a step-by-step tour of the conventional version of the k -means algorithm. Understanding the details of the algorithm is a fundamental step in the process of writing your k -means clustering pipeline in Python. What you learn in this section will help you decide if k -means is the right choice to solve your clustering problem.

Conventional k -means requires only a few steps. The first step is to randomly select k centroids, where k is equal to the number of clusters you choose. Centroids are data points representing the center of a cluster.

The main element of the algorithm works by a two-step process called expectation-maximization . The expectation step assigns each data point to its nearest centroid. Then, the maximization step computes the mean of all the points for each cluster and sets the new centroid. Here’s what the conventional version of the k -means algorithm looks like:

k means algorithm

The quality of the cluster assignments is determined by computing the sum of the squared error (SSE) after the centroids converge , or match the previous iteration’s assignment. The SSE is defined as the sum of the squared Euclidean distances of each point to its closest centroid. Since this is a measure of error, the objective of k -means is to try to minimize this value.

The figure below shows the centroids and SSE updating through the first five iterations from two different runs of the k -means algorithm on the same dataset:

The purpose of this figure is to show that the initialization of the centroids is an important step. It also highlights the use of SSE as a measure of clustering performance. After choosing a number of clusters and the initial centroids, the expectation-maximization step is repeated until the centroid positions reach convergence and are unchanged.

The random initialization step causes the k -means algorithm to be nondeterministic , meaning that cluster assignments will vary if you run the same algorithm twice on the same dataset. Researchers commonly run several initializations of the entire k -means algorithm and choose the cluster assignments from the initialization with the lowest SSE.

Thankfully, there’s a robust implementation of k -means clustering in Python from the popular machine learning package scikit-learn . You’ll learn how to write a practical implementation of the k -means algorithm using the scikit-learn version of the algorithm .

Note: If you’re interested in gaining a deeper understanding of how to write your own k -means algorithm in Python, then check out the Python Data Science Handbook .

The code in this tutorial requires some popular external Python packages and assumes that you’ve installed Python with Anaconda. For more information on setting up your Python environment for machine learning in Windows, read through Setting Up Python for Machine Learning on Windows .

Otherwise, you can begin by installing the required packages:

The code is presented so that you can follow along in an ipython console or Jupyter Notebook. Click the prompt ( >>> ) at the top right of each code block to see the code formatted for copy-paste. You can also download the source code used in this article by clicking on the link below:

This step will import the modules needed for all the code in this section:

You can generate the data from the above GIF using make_blobs() , a convenience function in scikit-learn used to generate synthetic clusters. make_blobs() uses these parameters:

  • n_samples is the total number of samples to generate.
  • centers is the number of centers to generate.
  • cluster_std is the standard deviation.

make_blobs() returns a tuple of two values:

  • A two-dimensional NumPy array with the x- and y-values for each of the samples
  • A one-dimensional NumPy array containing the cluster labels for each sample

Note: Many scikit-learn algorithms rely heavily on NumPy in their implementations. If you want to learn more about NumPy arrays, check out Look Ma, No for Loops: Array Programming With NumPy .

Generate the synthetic data and labels:

Nondeterministic machine learning algorithms like k -means are difficult to reproduce. The random_state parameter is set to an integer value so you can follow the data presented in the tutorial. In practice, it’s best to leave random_state as the default value, None .

Here’s a look at the first five elements for each of the variables returned by make_blobs() :

Data sets usually contain numerical features that have been measured in different units, such as height (in inches) and weight (in pounds). A machine learning algorithm would consider weight more important than height only because the values for weight are larger and have higher variability from person to person.

Machine learning algorithms need to consider all features on an even playing field. That means the values for all features must be transformed to the same scale.

The process of transforming numerical features to use the same scale is known as feature scaling . It’s an important data preprocessing step for most distance-based machine learning algorithms because it can have a significant impact on the performance of your algorithm.

There are several approaches to implementing feature scaling. A great way to determine which technique is appropriate for your dataset is to read scikit-learn’s preprocessing documentation .

In this example, you’ll use the StandardScaler class. This class implements a type of feature scaling called standardization . Standardization scales, or shifts, the values for each numerical feature in your dataset so that the features have a mean of 0 and standard deviation of 1:

Take a look at how the values have been scaled in scaled_features :

Now the data are ready to be clustered. The KMeans estimator class in scikit-learn is where you set the algorithm parameters before fitting the estimator to the data. The scikit-learn implementation is flexible, providing several parameters that can be tuned.

Here are the parameters used in this example:

init controls the initialization technique. The standard version of the k -means algorithm is implemented by setting init to "random" . Setting this to "k-means++" employs an advanced trick to speed up convergence, which you’ll use later.

n_clusters sets k for the clustering step. This is the most important parameter for k -means.

n_init sets the number of initializations to perform. This is important because two runs can converge on different cluster assignments. The default behavior for the scikit-learn algorithm is to perform ten k -means runs and return the results of the one with the lowest SSE.

max_iter sets the number of maximum iterations for each initialization of the k -means algorithm.

Instantiate the KMeans class with the following arguments:

The parameter names match the language that was used to describe the k -means algorithm earlier in the tutorial. Now that the k -means class is ready, the next step is to fit it to the data in scaled_features . This will perform ten runs of the k -means algorithm on your data with a maximum of 300 iterations per run:

Statistics from the initialization run with the lowest SSE are available as attributes of kmeans after calling .fit() :

Finally, the cluster assignments are stored as a one-dimensional NumPy array in kmeans.labels_ . Here’s a look at the first five predicted labels:

Note that the order of the cluster labels for the first two data objects was flipped. The order was [1, 0] in true_labels but [0, 1] in kmeans.labels_ even though those data objects are still members of their original clusters in kmeans.lables_ .

This behavior is normal, as the ordering of cluster labels is dependent on the initialization. Cluster 0 from the first run could be labeled cluster 1 in the second run and vice versa. This doesn’t affect clustering evaluation metrics.

In this section, you’ll look at two methods that are commonly used to evaluate the appropriate number of clusters:

  • The elbow method
  • The silhouette coefficient

These are often used as complementary evaluation techniques rather than one being preferred over the other. To perform the elbow method , run several k -means, increment k with each iteration, and record the SSE:

The previous code block made use of Python’s dictionary unpacking operator ( ** ). To learn more about this powerful Python operator, check out How to Iterate Through a Dictionary in Python .

When you plot SSE as a function of the number of clusters, notice that SSE continues to decrease as you increase k . As more centroids are added, the distance from each point to its closest centroid will decrease.

There’s a sweet spot where the SSE curve starts to bend known as the elbow point . The x-value of this point is thought to be a reasonable trade-off between error and number of clusters. In this example, the elbow is located at x=3 :

The above code produces the following plot:

k means elbow method

Determining the elbow point in the SSE curve isn’t always straightforward. If you’re having trouble choosing the elbow point of the curve, then you could use a Python package, kneed , to identify the elbow point programmatically:

The silhouette coefficient is a measure of cluster cohesion and separation. It quantifies how well a data point fits into its assigned cluster based on two factors:

  • How close the data point is to other points in the cluster
  • How far away the data point is from points in other clusters

Silhouette coefficient values range between -1 and 1 . Larger numbers indicate that samples are closer to their clusters than they are to other clusters.

In the scikit-learn implementation of the silhouette coefficient , the average silhouette coefficient of all the samples is summarized into one score. The silhouette score() function needs a minimum of two clusters, or it will raise an exception.

Loop through values of k again. This time, instead of computing SSE, compute the silhouette coefficient:

Plotting the average silhouette scores for each k shows that the best choice for k is 3 since it has the maximum score:

k means silhouette analysis

Ultimately, your decision on the number of clusters to use should be guided by a combination of domain knowledge and clustering evaluation metrics.

The elbow method and silhouette coefficient evaluate clustering performance without the use of ground truth labels . Ground truth labels categorize data points into groups based on assignment by a human or an existing algorithm. These types of metrics do their best to suggest the correct number of clusters but can be deceiving when used without context.

Note: In practice, it’s rare to encounter datasets that have ground truth labels.

When comparing k -means against a density-based approach on nonspherical clusters, the results from the elbow method and silhouette coefficient rarely match human intuition. This scenario highlights why advanced clustering evaluation techniques are necessary. To visualize an example, import these additional modules:

This time, use make_moons() to generate synthetic data in the shape of crescents:

Fit both a k -means and a DBSCAN algorithm to the new data and visually assess the performance by plotting the cluster assignments with Matplotlib :

Print the silhouette coefficient for each of the two algorithms and compare them. A higher silhouette coefficient suggests better clusters, which is misleading in this scenario:

The silhouette coefficient is higher for the k -means algorithm. The DBSCAN algorithm appears to find more natural clusters according to the shape of the data:

k means clustering crescents

This suggests that you need a better method to compare the performance of these two clustering algorithms.

If you’re interested, you can find the code for the above plot by expanding the box below.

Plot Crescents Show/Hide

To learn more about plotting with Matplotlib and Python, check out Python Plotting with Matplotlib (Guide) . Here’s how you can plot the comparison of the two algorithms in the crescent moons example:

Since the ground truth labels are known, it’s possible to use a clustering metric that considers labels in its evaluation. You can use the scikit-learn implementation of a common metric called the adjusted rand index (ARI) . Unlike the silhouette coefficient, the ARI uses true cluster assignments to measure the similarity between true and predicted labels.

Compare the clustering results of DBSCAN and k -means using ARI as the performance metric:

The ARI output values range between -1 and 1 . A score close to 0.0 indicates random assignments, and a score close to 1 indicates perfectly labeled clusters.

Based on the above output, you can see that the silhouette coefficient was misleading. ARI shows that DBSCAN is the best choice for the synthetic crescents example as compared to k -means.

There are several metrics that evaluate the quality of clustering algorithms. Reading through the implementations in scikit-learn will help you select an appropriate clustering evaluation metric.

How to Build a K-Means Clustering Pipeline in Python

Now that you have a basic understanding of k -means clustering in Python, it’s time to perform k -means clustering on a real-world dataset. These data contain gene expression values from a manuscript authored by The Cancer Genome Atlas (TCGA) Pan-Cancer analysis project investigators.

There are 881 samples (rows) representing five distinct cancer subtypes. Each sample has gene expression values for 20,531 genes (columns). The dataset is available from the UC Irvine Machine Learning Repository , but you can use the Python code below to obtain the data programmatically.

To follow along with the examples below, you can download the source code by clicking on the following link:

In this section, you’ll build a robust k -means clustering pipeline. Since you’ll perform multiple transformations of the original input data, your pipeline will also serve as a practical clustering framework.

Assuming you want to start with a fresh namespace , import all the modules needed to build and evaluate the pipeline, including pandas and seaborn for more advanced visualizations:

Download and extract the TCGA dataset from UCI:

After the download and extraction is completed, you should have a directory that looks like this:

The KMeans class in scikit-learn requires a NumPy array as an argument. The NumPy package has a helper function to load the data from the text file into memory as NumPy arrays:

Check out the first three columns of data for the first five samples as well as the labels for the first five samples:

The data variable contains all the gene expression values from 20,531 genes. The true_label_names are the cancer types for each of the 881 samples. The first record in data corresponds with the first label in true_labels .

The labels are strings containing abbreviations of cancer types:

  • BRCA : Breast invasive carcinoma
  • COAD : Colon adenocarcinoma
  • KIRC : Kidney renal clear cell carcinoma
  • LUAD : Lung adenocarcinoma
  • PRAD : Prostate adenocarcinoma

To use these labels in the evaluation methods, you first need to convert the abbreviations to integers with LabelEncoder :

Since the label_encoder has been fitted to the data, you can see the unique classes represented using .classes_ . Store the length of the array to the variable n_clusters for later use:

In practical machine learning pipelines, it’s common for the data to undergo multiple sequences of transformations before it feeds into a clustering algorithm. You learned about the importance of one of these transformation steps, feature scaling, earlier in this tutorial. An equally important data transformation technique is dimensionality reduction , which reduces the number of features in the dataset by either removing or combining them.

Dimensionality reduction techniques help to address a problem with machine learning algorithms known as the curse of dimensionality . In short, as the number of features increases, the feature space becomes sparse. This sparsity makes it difficult for algorithms to find data objects near one another in higher-dimensional space. Since the gene expression dataset has over 20,000 features, it qualifies as a great candidate for dimensionality reduction.

Principal Component Analysis (PCA) is one of many dimensionality reduction techniques. PCA transforms the input data by projecting it into a lower number of dimensions called components . The components capture the variability of the input data through a linear combination of the input data’s features.

Note: A full description of PCA is out of scope for this tutorial, but you can learn more about it in the scikit-learn user guide .

The next code block introduces you to the concept of scikit-learn pipelines . The scikit-learn Pipeline class is a concrete implementation of the abstract idea of a machine learning pipeline.

Your gene expression data aren’t in the optimal format for the KMeans class, so you’ll need to build a preprocessing pipeline . The pipeline will implement an alternative to the StandardScaler class called MinMaxScaler for feature scaling. You use MinMaxScaler when you do not assume that the shape of all your features follows a normal distribution.

The next step in your preprocessing pipeline will implement the PCA class to perform dimensionality reduction:

Now that you’ve built a pipeline to process the data, you’ll build a separate pipeline to perform k -means clustering. You’ll override the following default arguments of the KMeans class:

init: You’ll use "k-means++" instead of "random" to ensure centroids are initialized with some distance between them. In most cases, this will be an improvement over "random" .

n_init: You’ll increase the number of initializations to ensure you find a stable solution.

max_iter: You’ll increase the number of iterations per initialization to ensure that k -means will converge.

Build the k -means clustering pipeline with user-defined arguments in the KMeans constructor:

The Pipeline class can be chained to form a larger pipeline. Build an end-to-end k -means clustering pipeline by passing the "preprocessor" and "clusterer" pipelines to Pipeline :

Calling .fit() with data as the argument performs all the pipeline steps on the data :

The pipeline performs all the necessary steps to execute k -means clustering on the gene expression data! Depending on your Python REPL , .fit() may print a summary of the pipeline. Objects defined inside pipelines are accessible using their step name.

Evaluate the performance by calculating the silhouette coefficient:

Calculate ARI, too, since the ground truth cluster labels are available:

As mentioned earlier, the scale for each of these clustering performance metrics ranges from -1 to 1. A silhouette coefficient of 0 indicates that clusters are significantly overlapping one another, and a silhouette coefficient of 1 indicates clusters are well-separated. An ARI score of 0 indicates that cluster labels are randomly assigned, and an ARI score of 1 means that the true labels and predicted labels form identical clusters.

Since you specified n_components=2 in the PCA step of the k -means clustering pipeline, you can also visualize the data in the context of the true labels and predicted labels. Plot the results using a pandas DataFrame and the seaborn plotting library:

Here’s what the plot looks like:

k means clustering pca

The visual representation of the clusters confirms the results of the two clustering evaluation metrics. The performance of your pipeline was pretty good. The clusters only slightly overlapped, and cluster assignments were much better than random.

Your first k -means clustering pipeline performed well, but there’s still room to improve. That’s why you went through the trouble of building the pipeline: you can tune the parameters to get the most desirable clustering results.

The process of parameter tuning consists of sequentially altering one of the input values of the algorithm’s parameters and recording the results. At the end of the parameter tuning process, you’ll have a set of performance scores, one for each new value of a given parameter. Parameter tuning is a powerful method to maximize performance from your clustering pipeline.

By setting the PCA parameter n_components=2 , you squished all the features into two components, or dimensions. This value was convenient for visualization on a two-dimensional plot. But using only two components means that the PCA step won’t capture all of the explained variance of the input data.

Explained variance measures the discrepancy between the PCA-transformed data and the actual input data. The relationship between n_components and explained variance can be visualized in a plot to show you how many components you need in your PCA to capture a certain percentage of the variance in the input data. You can also use clustering performance metrics to evaluate how many components are necessary to achieve satisfactory clustering results.

In this example, you’ll use clustering performance metrics to identify the appropriate number of components in the PCA step. The Pipeline class is powerful in this situation. It allows you to perform basic parameter tuning using a for loop .

Iterate over a range of n_components and record evaluation metrics for each iteration:

Plot the evaluation metrics as a function of n_components to visualize the relationship between adding components and the performance of the k -means clustering results:

The above code generates the a plot showing performance metrics as a function of n_components :

k means performance evaluation

There are two takeaways from this figure:

The silhouette coefficient decreases linearly. The silhouette coefficient depends on the distance between points, so as the number of dimensions increases, the sparsity increases.

The ARI improves significantly as you add components. It appears to start tapering off after n_components=7 , so that would be the value to use for presenting the best clustering results from this pipeline.

Like most machine learning decisions, you must balance optimizing clustering evaluation metrics with the goal of the clustering task. In situations when cluster labels are available, as is the case with the cancer dataset used in this tutorial, ARI is a reasonable choice. ARI quantifies how accurately your pipeline was able to reassign the cluster labels.

The silhouette coefficient, on the other hand, is a good choice for exploratory clustering because it helps to identify subclusters. These subclusters warrant additional investigation, which can lead to new and important insights.

You now know how to perform k -means clustering in Python. Your final k -means clustering pipeline was able to cluster patients with different cancer types using real-world gene expression data. You can use the techniques you learned here to cluster your own data, understand how to get the best clustering results, and share insights with others.

In this tutorial, you learned:

  • What the popular clustering techniques are and when to use them
  • What the k -means algorithm is
  • How to implement k -means clustering in Python
  • How to evaluate the performance of clustering algorithms
  • How to build and tune a robust k -means clustering pipeline in Python
  • How to analyze and present clustering results from the k -means algorithm

You also took a whirlwind tour of scikit-learn, an accessible and extensible tool for implementing k -means clustering in Python. If you’d like to reproduce the examples you saw above, then be sure to download the source code by clicking on the following link:

You’re now ready to perform k -means clustering on datasets you find interesting. Be sure to share your results in the comments below!

Note: The dataset used in this tutorial was obtained from the UCI Machine Learning Repository. Dua, D. and Graff, C. (2019). UCI Machine Learning Repository . Irvine, CA: University of California, School of Information and Computer Science.

The original dataset is maintained by The Cancer Genome Atlas Pan-Cancer analysis project.

🐍 Python Tricks 💌

Get a short & sweet Python Trick delivered to your inbox every couple of days. No spam ever. Unsubscribe any time. Curated by the Real Python team.

Python Tricks Dictionary Merge

About Kevin Arvai

Kevin Arvai

Kevin is a data scientist for a clinical genomics company, a Pythonista, and an NBA fan.

Each tutorial at Real Python is created by a team of developers so that it meets our high quality standards. The team members who worked on this tutorial are:

Aldren Santos

Master Real-World Python Skills With Unlimited Access to Real Python

Join us and get access to thousands of tutorials, hands-on video courses, and a community of expert Pythonistas:

Join us and get access to thousands of tutorials, hands-on video courses, and a community of expert Pythonistas:

What Do You Think?

What’s your #1 takeaway or favorite thing you learned? How are you going to put your newfound skills to use? Leave a comment below and let us know.

Commenting Tips: The most useful comments are those written with the goal of learning from or helping out other students. Get tips for asking good questions and get answers to common questions in our support portal . Looking for a real-time conversation? Visit the Real Python Community Chat or join the next “Office Hours” Live Q&A Session . Happy Pythoning!

Keep Learning

Related Topics: advanced data-science machine-learning

Keep reading Real Python by creating a free account or signing in:

Already have an account? Sign-In

Almost there! Complete this form and click the button below to gain instant access:

K-Means Clustering (Source Code)

🔒 No spam. We take your privacy seriously.

k means clustering problem solving

Reset password New user? Sign up

Existing user? Log in

k-Means Clustering

Already have an account? Log in here.

K-means clustering is a traditional, simple machine learning algorithm that is trained on a test data set and then able to classify a new data set using a prime , \(k\) number of clusters defined a priori.

Data mining can produce incredible visuals and results. Here, k-means algorithm was used to assign items to 1000 clusters, each represented by a color [1] .

An objective of machine learning is to derive techniques for unsupervised learning on data. This kind of data analysis is very helpful in many applications that require classification of data, such as identifying cancerous cells within a large sample, clustering words with similar definitions for better search engine accuracy, identifying outliers in student’s academic performance for better refinement of habits, or even for detecting landmines in a battlefield [2] .

Building a Baseball Team Using Classification Imagine a high school baseball coach wants to use data analysis to predict whether potential new recruits will be spectacular, mediocre, or dismal. He has data on the players currently on his team: position, years of experience, batting average, on-base percentage, and number of stolen bases per games. He also has some data on the players he is considering recruiting. How might he go about selecting a new player that fills any gaps of skill currently on his team?

K-means algorithm

Technical analysis, when to use k-means clustering.

This clustering algorithm separates data into the best suited group based on the information the algorithm already has. Data is separated in \(k\) different clusters, which are usually chosen to be far enough apart from each other spatially, in Eucledian Distance , to be able to produce effective data mining results. Each cluster has a center, called the centroid , and a data point is clustered into a certain cluster based on how close the features are to the centroid.

A computer-generated program showing k-means clustering [3]

K-means algorithm iteratively minimizes the distances between every data point and its centroid in order to find the most optimal solution for all the data points.

  • \(k\) random points of the data set are chosen to be centroids.
  • Distances between every data point and the \(k\) centroids are calculated and stored.
  • Based on distance calculates, each point is assigned to the nearest cluster
  • New cluster centroid positions are updated: similar to finding a mean in the point locations
  • If the centroid locations changed, the process repeats from step 2, until the calculated new center stays the same, which signals that the clusters' members and centroids are now set.

Finding the minimal distances between all the points implies that data points have been separated in order to form the most compact clusters possible, with the least variance within them. In other words, no other iteration could have a lower average distance between the centroids and the data points found within them.

A graphical look at k-means clustering [4]

The K-means algorithm defined above aims at minimizing an objective function , which in this case is the squared error function.

The objective function for the K-means clustering algorithm is the squared error function: \(J = \sum_{i=1}^{k} \sum_{j=1}^{n} (||x_ i – v_j||)^2 = 1 \) where, \(||x_i – v_j|| \) is the Eucledian distance between a point, \(x_i\), and a centroid, \(v_j\), iterated over all k points in the \(i^{th}\) cluster, for all n clusters. [5]

In simpler terms, the objective function attempts to pick centroids that minimize the distance to all points belonging to its respective cluster so that the centroids are more symbolic of the surrounding cluster of data points.

Why use the squared error function rather than, say, the absolute error? The reason is because the squared error has nicer mathematical properties than absolute error. For further reference on these properties, check out this blog post by Ben Khun.

Although the algorithm seems quite simple, finding the optimal solution to the problem for observations in either d dimensions or for k clusters is NP-Hard . However, if k and d are fixed, the problem can be solved in time \(O(n^{dk+1}log(n))\) using Lloyd’s Algorithm, a common k-clustering algorithm, where n is the number of entities to be clustered. [6] . However, the running time for this algorithm on nicely clustered data can be quite small, as minimization of the objective function will occur quickly.

Because the algorithm computes the distances between each of the k cluster centers and their respective data points every single iteration, a few iterations are enough to make further adjustments not worth the time complexity trade-off. In other words, because further iterations won’t change the assignments of the majority of data points but their distances still need to be calculated, the algorithm becomes inefficient if convergence is the goal. For this reason, several variations of Lloyd’s k-means clustering algorithm have been developed in order to speed up the process at later stages; where these variations include using the triangle-inequality , amongst others.

Before applying k-means clustering to a data set, data has to go from characteristics of an object to numerical data that can be analyzed.

Categorizing Baseball Players How would the baseball coach use k-means clustering to predict whether new recruits will be good? Each trait of a player can be represented as a feature vector . By converting characteristics into numbers in a feature vector, the players become comparable in a vector space so that their differences can be better quantified. The coach has the same type of information on both current players and new potential ones. Using k-means clustering on the entire set of data points, he can figure out which of his current, known-level (remarkable, mediocre, dismal) players are closest to the new ones, and which of the new players would fill the most voids within his team.

K-Means clustering is a fast, robust, and simple algorithm that gives reliable results when data sets are distinct or well separated from each other in a linear fashion. It is best used when the number of cluster centers, is specified due to a well-defined list of types shown in the data. However, it is important to keep in mind that K-Means clustering may not perform well if it contains heavily overlapping data, if the Euclidean distance does not measure the underlying factors well, or if the data is noisy or full of outliers [7] .

  • Brickley, D. 1k_overview . Retrieved from https://www.flickr.com/photos/danbri/6233990550/in/photolist-auSQcG-87Yxj7-EmVCps-ptEDsu-7m1WFX-5EFMse-i4x1v-egRfqk-81wUmZ-a4D73j-87Yxiy-tty5TD-bAhHzE-5tMAuK-7MfnED-7rbggn-7rfbmm-rq7j4f-a1hTYE-gxCMpH-57XUAW-a6Nx34-8hAC1D-ounzPd-dybf7L-fCTKB3-dybeZd-dy5Mut-dybfdb-ow6s3X-wcCF9s-cqjRDS-rrXXND-dybfjm-fCBeqB-dy5MGx-fCTM13-nouGAk-71tzae-4xwxcC-k2tddi-8kfyKH-8hABZB-7y4a3Y-ou3Zst-ftDTEu-osZ261-oun8p5-cSNeMy-4eJiRW
  • Naik, A. Clustering Algorithm Applications . Retrieved from https://sites.google.com/site/dataclusteringalgorithms/clustering-algorithm-applications
  • Petersen, J. K-means . Retrieved from http://pypr.sourceforge.net/kmeans.html#k-means-example
  • None, N. k-means clustering . Retrieved May 02, 2016, from https://en.wikipedia.org/wiki/K-means_clustering
  • Matteucci, M. K-Means Clustering . Retrieved June 14, 2016, from http://home.deib.polimi.it/matteucc/Clustering/tutorial_html/kmeans.html
  • Inaba, M. (1994). Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering. ACM Symposium on Computational Geometry , 10th , 332-339.
  • Naik, a. k-Means Clustering Algorithm . Retrieved June 14,2016, from https://sites.google.com/site/dataclusteringalgorithms/k-means-clustering-algorithm

Problem Loading...

Note Loading...

Set Loading...

Definitive Guide to K-Means Clustering with Scikit-Learn

k means clustering problem solving

  • Introduction

K-Means clustering is one of the most widely used unsupervised machine learning algorithms that form clusters of data based on the similarity between data instances.

In this guide, we will first take a look at a simple example to understand how the K-Means algorithm works before implementing it using Scikit-Learn. Then, we'll discuss how to determine the number of clusters (Ks) in K-Means, and also cover distance metrics, variance, and K-Means pros and cons.

Imagine the following situation. One day, when walking around the neighborhood, you noticed there were 10 convenience stores and started to wonder which stores were similar - closer to each other in proximity. While searching for ways to answer that question, you've come across an interesting approach that divides the stores into groups based on their coordinates on a map.

For instance, if one store was located 5 km West and 3 km North - you'd assign (5, 3) coordinates to it, and represent it in a graph. Let's plot this first point to visualize what's happening:

k means clustering problem solving

This is just the first point, so we can get an idea of how we can represent a store. Say we already have 10 coordinates to the 10 stores collected. After organizing them in a numpy array, we can also plot their locations:

k means clustering problem solving

  • How to Manually Implement K-Means Algorithm

Now we can look at the 10 stores on a graph, and the main problem is to find is there a way they could be divided into different groups based on proximity? Just by taking a quick look at the graph, we'll probably notice two groups of stores - one is the lower points to the bottom-left, and the other one is the upper-right points. Perhaps, we can even differentiate those two points in the middle as a separate group - therefore creating three different groups .

In this section, we'll go over the process of manually clustering points - dividing them into the given number of groups. That way, we'll essentially carefully go over all steps of the K-Means clustering algorithm . By the end of this section, you'll gain both an intuitive and practical understanding of all steps performed during the K-Means clustering. After that, we'll delegate it to Scikit-Learn.

What would be the best way of determining if there are two or three groups of points? One simple way would be to simply choose one number of groups - for instance, two - and then try to group points based on that choice.

k means clustering problem solving

Let's say we have decided there are two groups of our stores (points). Now, we need to find a way to understand which points belong to which group. This could be done by choosing one point to represent group 1 and one to represent group 2 . Those points will be used as a reference when measuring the distance from all other points to each group.

In that manner, say point (5, 3) ends up belonging to group 1, and point (79, 60) to group 2. When trying to assign a new point (6, 3) to groups, we need to measure its distance to those two points. In the case of the point (6, 3) is closer to the (5, 3) , therefore it belongs to the group represented by that point - group 1 . This way, we can easily group all points into corresponding groups.

In this example, besides determining the number of groups ( clusters ) - we are also choosing some points to be a reference of distance for new points of each group.

That is the general idea to understand similarities between our stores. Let's put it into practice - we can first choose the two reference points at random . The reference point of group 1 will be (5, 3) and the reference point of group 2 will be (10, 15) . We can select both points of our numpy array by [0] and [1] indexes and store them in g1 (group 1) and g2 (group 2) variables:

After doing this, we need to calculate the distance from all other points to those reference points. This raises an important question - how to measure that distance. We can essentially use any distance measure, but, for the purpose of this guide, let's use Euclidean Distance_.

Advice: If you want learn more more about Euclidean distance, you can read our "Calculating Euclidean Distances with NumPy" guide.

It can be useful to know that Euclidean distance measure is based on Pythagoras' theorem:

$$ c^2 = a^2 + b^2 $$

When adapted to points in a plane - (a1, b1) and (a2, b2) , the previous formula becomes:

$$ c^2 = (a2-a1)^2 + (b2-b1)^2 $$

The distance will be the square root of c , so we can also write the formula as:

$$ euclidean_{dist} = \sqrt[2][(a2 - a1)^2 + (b2 - b1) ^2)] $$

Note: You can also generalize the Euclidean distance formula for multi-dimensional points. For example, in a three-dimensional space, points have three coordinates - our formula reflects that in the following way: $$ euclidean_{dist} = \sqrt[2][(a2 - a1)^2 + (b2 - b1) ^2 + (c2 - c1) ^2)] $$ The same principle is followed no matter the number of dimensions of the space we are operating in.

So far, we have picked the points to represent groups, and we know how to calculate distances. Now, let's put the distances and groups together by assigning each of our collected store points to a group.

To better visualize that, we will declare three lists. The first one to store points of the first group - points_in_g1 . The second one to store points from the group 2 - points_in_g2 , and the last one - group , to label the points as either 1 (belongs to group 1) or 2 (belongs to group 2):

We can now iterate through our points and calculate the Euclidean distance between them and each of our group references. Each point will be closer to one of two groups - based on which group is closest, we'll assign each point to the corresponding list, while also adding 1 or 2 to the group list:

Let's look at the results of this iteration to see what happened:

Which results in:

We can also plot the clustering result, with different colors based on the assigned groups, using Seaborn's scatterplot() with the group as a hue argument:

k means clustering problem solving

It's clearly visible that only our first point is assigned to group 1, and all other points were assigned to group 2. That result differs from what we had envisioned in the beginning. Considering the difference between our results and our initial expectations - is there a way we could change that? It seems there is!

One approach is to repeat the process and choose different points to be the references of the groups. This will change our results, hopefully, more in line with what we've envisioned in the beginning. This second time, we could choose them not at random as we previously did, but by getting a mean of all our already grouped points. That way, those new points could be positioned in the middle of corresponding groups.

For instance, if the second group had only points (10, 15) , (30, 45) . The new central point would be (10 + 30)/2 and (15+45)/2 - which is equal to (20, 30) .

Since we have put our results in lists, we can convert them first to numpy arrays, select their xs, ys and then obtain the mean :

Advice: Try to use numpy and NumPy arrays as much as possible. They are optimized for better performance and simplify many linear algebra operations. Whenever you are trying to solve some linear algebra problem, you should definitely take a look at the numpy documentation to check if there is any numpy method designed to solve your problem. The chance is that there is!

To help repeat the process with our new center points, let's transform our previous code into a function, execute it and see if there were any changes in how the points are grouped:

Note: If you notice you keep repeating the same code over and over again, you should wrap that code into a separate function. It is considered a best practice to organize code into functions, especially because they facilitate testing. It is easier to test an isolated piece of code than a full code without any functions.

Let's call the function and store its results in points_in_g1 , points_in_g2 , and group variables:

And also plot the scatter plot with the colored points to visualize the groups division:

k means clustering problem solving

It seems the clustering of our points is getting better . But still, there are two points in the middle of the graph that could be assigned to either group when considering their proximity to both groups. The algorithm we've developed so far assigns both of those points to the second group.

This means we can probably repeat the process once more by taking the means of the Xs and Ys, creating two new central points (centroids) to our groups and re-assigning them based on distance.

Let's also create a function to update the centroids. The whole process now can be reduced to multiple calls of that function:

k means clustering problem solving

Notice that after this third iteration, each one of the points now belong to different clusters. It seems the results are getting better - let's do it once again. Now going to the fourth iteration of our method:

k means clustering problem solving

This fourth time we got the same result as the previous one. So it seems our points won't change groups anymore, our result has reached some kind of stability - it has got to an unchangeable state, or converged . Besides that, we have exactly the same result as we had envisioned for the 2 groups. We can also see if this reached division makes sense.

Let's just quickly recap what we've done so far. We've divided our 10 stores geographically into two sections - ones in the lower southwest regions and others in the northeast. It can be interesting to gather more data besides what we already have - revenue, the daily number of customers, and many more. That way we can conduct a richer analysis and possibly generate more interesting results.

Clustering studies like this can be conducted when an already established brand wants to pick an area to open a new store. In that case, there are many more variables taken into consideration besides location.
  • What Does All This Have To Do With K-Means Algorithm?

While following these steps you might have wondered what they have to do with the K-Means algorithm. The process we've conducted so far is the K-Means algorithm . In short, we've determined the number of groups/clusters, randomly chosen initial points, and updated centroids in each iteration until clusters converged. We've basically performed the entire algorithm by hand - carefully conducting each step.

The K in K-Means comes from the number of clusters that need to be set prior to starting the iteration process. In our case K = 2 . This characteristic is sometimes seen as negative considering there are other clustering methods, such as Hierarchical Clustering, which don't need to have a fixed number of clusters beforehand.

Due to its use of means, K-means also becomes sensitive to outliers and extreme values - they enhance the variability and make it harder for our centroids to play their part. So, be conscious of the need to perform extreme values and outlier analysis before conducting a clustering using the K-Means algorithm.

Also, notice that our points were segmented in straight parts, there aren't curves when creating the clusters. That can also be a disadvantage of the K-Means algorithm.

Note: When you need it to be more flexible and adaptable to ellipses and other shapes, try using a generalized K-means Gaussian Mixture model . This model can adapt to elliptical segmentation clusters.

K-Means also has many advantages ! It performs well on large datasets which can become difficult to handle if you are using some types of hierarchical clustering algorithms. It also guarantees convergence , and can easily generalize and adapt . Besides that, it is probably the most used clustering algorithm.

Now that we've gone over all the steps performed in the K-Means algorithm, and understood all its pros and cons, we can finally implement K-Means using the Scikit-Learn library.

  • How to Implement K-Means Algorithm Using Scikit-Learn

To double check our result, let's do this process again, but now using 3 lines of code with sklearn :

Here, the labels are the same as our previous groups. Let's just quickly plot the result:

k means clustering problem solving

The resulting plot is the same as the one from the previous section.

Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!

Note: Just looking at how we've performed the K-Means algorithm using Scikit-Learn might give you the impression that this is a no-brainer and that you don't need to worry too much about it. Just 3 lines of code perform all the steps we've discussed in the previous section when we've gone over the K-Means algorithm step-by-step. But, the devil is in the details in this case! If you don't understand all the steps and limitations of the algorithm, you'll most likely face the situation where the K-Means algorithm gives you results you were not expecting.

With Scikit-Learn, you can also initialize K-Means for faster convergence by setting the init='k-means++' argument. In broader terms, K-Means++ still chooses the k initial cluster centers at random following a uniform distribution. Then, each subsequent cluster center is chosen from the remaining data points not by calculating only a distance measure - but by using probability. Using the probability speeds up the algorithm and it's helpful when dealing with very large datasets.

Advice: You can learn more about K-Means++ details by reading the "K-Means++: The Advantages of Careful Seeding" paper, proposed in 2007 by David Arthur and Sergei Vassilvitskii.

  • The Elbow Method - Choosing the Best Number of Groups

So far, so good! We've clustered 10 stores based on the Euclidean distance between points and centroids. But what about those two points in the middle of the graph that are a little harder to cluster? Couldn't they form a separate group as well? Did we actually make a mistake by choosing K=2 groups? Maybe we actually had K=3 groups? We could even have more than three groups and not be aware of it.

The question being asked here is how to determine the number of groups (K) in K-Means . To answer that question, we need to understand if there would be a "better" cluster for a different value of K.

The naive way of finding that out is by clustering points with different values of K , so, for K=2, K=3, K=4, and so on :

But, clustering points for different Ks alone won't be enough to understand if we've chosen the ideal value for K . We need a way to evaluate the clustering quality for each K we've chosen.

  • Manually Calculating the Within Cluster Sum of Squares (WCSS)

Here is the ideal place to introduce a measure of how much our clustered points are close to each other. It essentially describes how much variance we have inside a single cluster. This measure is called Within Cluster Sum of Squares , or WCSS for short. The smaller the WCSS is, the closer our points are, therefore we have a more well-formed cluster. The WCSS formula can be used for any number of clusters:

$$ WCSS = \sum(Pi_1 - Centroid_1)^2 + \cdots + \sum(Pi_n - Centroid_n)^2 $$

Note: In this guide, we are using the Euclidean distance to obtain the centroids, but other distance measures, such as Manhattan, could also be used.

Now we can assume we've opted to have two clusters and try to implement the WCSS to understand better what the WCSS is and how to use it. As the formula states, we need to sum up the squared differences between all cluster points and centroids. So, if our first point from the first group is (5, 3) and our last centroid (after convergence) of the first group is (16.8, 17.0) , the WCSS will be:

$$ WCSS = \sum((5,3) - (16.8, 17.0))^2 $$

$$ WCSS = \sum((5-16.8) + (3-17.0))^2 $$

$$ WCSS = \sum((-11.8) + (-14.0))^2 $$

$$ WCSS = \sum((-25.8))^2 $$

$$ WCSS = 335.24 $$

This example illustrates how we calculate the WCSS for the one point from the cluster. But the cluster usually contains more than one point, and we need to take all of them into consideration when calculating the WCSS. We'll do that by defining a function that receives a cluster of points and centroids, and returns the sum of squares:

Now we can get the sum of squares for each cluster:

And sum up the results to obtain the total WCSS :

This results in:

So, in our case, when K is equal to 2, the total WCSS is 2964.39 . Now, we can switch Ks and calculate the WCSS for all of them. That way, we can get an insight into what K we should choose to make our clustering perform the best.

  • Calculating WCSS Using Scikit-Learn

Fortunately, we don't need to manually calculate the WCSS for each K . After performing the K-Means clustering for the given number of clusters, we can obtain its WCSS by using the inertia_ attribute. Now, we can go back to our K-Means for loop, use it to switch the number of clusters, and list corresponding WCSS values:

Notice that the second value in the list, is exactly the same we've calculated before for K=2 :

To visualize those results, let's plot our Ks along with the WCSS values:

k means clustering problem solving

There is an interruption on a plot when x = 2 , a low point in the line, and an even lower one when x = 3 . Notice that it reminds us of the shape of an elbow . By plotting the Ks along with the WCSS, we are using the Elbow Method to choose the number of Ks. And the chosen K is exactly the lowest elbow point , so, it would be 3 instead of 2 , in our case:

k means clustering problem solving

We can run the K-Means cluster algorithm again, to see how our data would look like with three clusters :

k means clustering problem solving

We were already happy with two clusters, but according to the elbow method, three clusters would be a better fit for our data. In this case, we would have three kinds of stores instead of two. Before using the elbow method, we thought about southwest and northeast clusters of stores, now we also have stores in the center. Maybe that could be a good location to open another store since it would have less competition nearby.

  • Alternative Cluster Quality Measures

There are also other measures that can be used when evaluating cluster quality:

  • Silhouette Score - analyzes not only the distance between intra-cluster points but also between clusters themselves
  • Between Clusters Sum of Squares (BCSS) - metric complementary to the WCSS
  • Sum of Squares Error (SSE)
  • Maximum Radius - measures the largest distance from a point to its centroid
  • Average Radius - the sum of the largest distance from a point to its centroid divided by the number of clusters.

It's recommended to experiment and get to know each of them since depending on the problem, some of the alternatives can be more applicable than the most widely used metrics (WCSS and Silhouette Score) .

In the end, as with many data science algorithms, we want to reduce the variance inside each cluster and maximize the variance between different clusters. So we have more defined and separable clusters.

  • Applying K-Means on Another Dataset

Let's use what we have learned on another dataset. This time, we will try to find groups of similar wines.

Note: You can download the dataset here .

We begin by importing pandas to read the wine-clustering CSV (Comma-Separated Values) file into a Dataframe structure:

After loading it, let's take a peek at the first five records of data with the head() method:

We have many measurements of substances present in wines. Here, we also won't need to transform categorical columns because all of them are numerical. Now, let's take a look at the descriptive statistics with the describe() method:

The describe table:

By looking at the table it is clear that there is some variability in the data - for some columns such as Alcohol there is more, and for others, such as Malic_Acid , less. Now we can check if there are any null , or NaN values in our dataset:

There's no need to drop or input data, considering there aren't empty values in the dataset. We can use a Seaborn pairplot() to see the data distribution and to check if the dataset forms pairs of columns that can be interesting for clustering:

k means clustering problem solving

By looking at the pair plot, two columns seem promising for clustering purposes - Alcohol and OD280 (which is a method for determining the protein concentration in wines). It seems that there are 3 distinct clusters on plots combining two of them.

There are other columns that seem to be in correlation as well. Most notably Alcohol and Total_Phenols , and Alcohol and Flavonoids . They have great linear relationships that can be observed in the pair plot.

Since our focus is clustering with K-Means, let's choose one pair of columns, say Alcohol and OD280 , and test the elbow method for this dataset.

Note: When using more columns of the dataset, there will be a need for either plotting in 3 dimensions or reducing the data to principal components (use of PCA) . This is a valid, and more common approach, just make sure to choose the principal components based on how much they explain and keep in mind that when reducing the data dimensions, there is some information loss - so the plot is an approximation of the real data, not how it really is.

Let's plot the scatter plot with those two columns set to be its axis to take a closer look at the points we want to divide into groups:

k means clustering problem solving

Now we can define our columns and use the elbow method to determine the number of clusters. We will also initiate the algorithm with kmeans++ just to make sure it converges more quickly:

We have calculated the WCSS, so we can plot the results:

k means clustering problem solving

According to the elbow method we should have 3 clusters here. For the final step, let's cluster our points into 3 clusters and plot the those clusters identified by colors:

k means clustering problem solving

We can see clusters 0 , 1 , and 2 in the graph. Based on our analysis, group 0 has wines with higher protein content and lower alcohol, group 1 has wines with higher alcohol content and low protein, and group 2 has both high protein and high alcohol in its wines.

This is a very interesting dataset and I encourage you to go further into the analysis by clustering the data after normalization and PCA - also by interpreting the results and finding new connections.

K-Means clustering is a simple yet very effective unsupervised machine learning algorithm for data clustering. It clusters data based on the Euclidean distance between data points. K-Means clustering algorithm has many uses for grouping text documents, images, videos, and much more.

You might also like...

  • Get Feature Importances for Random Forest with Python and Scikit-Learn
  • Definitive Guide to Hierarchical Clustering with Python and Scikit-Learn
  • Definitive Guide to Logistic Regression in Python
  • Seaborn Boxplot - Tutorial and Examples

Improve your dev skills!

Get tutorials, guides, and dev jobs in your inbox.

No spam ever. Unsubscribe at any time. Read our Privacy Policy.

Data Scientist, Research Software Engineer, and teacher. Cassia is passionate about transformative processes in data, technology and life. She is graduated in Philosophy and Information Systems, with a Strictu Sensu Master's Degree in the field of Foundations Of Mathematics.

In this article

k means clustering problem solving

Monitor with Ping Bot

Reliable monitoring for your app, databases, infrastructure, and the vendors they rely on. Ping Bot is a powerful uptime and performance monitoring tool that helps notify you and resolve issues before they affect your customers.

OpenAI

Vendor Alerts with Ping Bot

Get detailed incident alerts about the status of your favorite vendors. Don't learn about downtime from your customers, be the first to know with Ping Bot.

Supabase

© 2013- 2024 Stack Abuse. All rights reserved.

MLP Logo

Machine Learning

K-means clustering algorithm from scratch.

  • April 26, 2020
  • Venmani A D

K-Means Clustering is an unsupervised learning algorithm that aims to group the observations in a given dataset into clusters. The number of clusters is provided as an input. It forms the clusters by minimizing the sum of the distance of points from their respective cluster centroids.

  • Basic Overview

Introduction to K-Means Clustering

Steps involved.

  • Maths Behind K-Means Clustering

Implementing K-Means from scratch

  • Elbow Method to find the optimal number of clusters

Grouping mall customers using K-Means

Basic overview of clustering.

Clustering is a type of unsupervised learning which is used to split unlabeled data into different groups.

Now, what does unlabeled data mean?

Unlabeled data means we don’t have a dependent variable (response variable) for the algorithm to compare as the ground truth. Clustering is generally used in Data Analysis to get to know about the different groups that may exist in our dataset.

We try to split the dataset into different groups, such that the data points in the same group have similar characteristics than the data points in different groups.

Now how to find whether the points are similar?

Use a good distance metric to compute the distance between a point and every other point. The points that have less distance are more similar. Euclidean distance is the most common metric.

k means clustering problem solving

Clustering algorithms are generally used in network traffic classification, customer, and market segmentation. It can be used on any tabular dataset, where you want to know which rows are similar to each other and form meaningful groups out of the dataset. First I am going to install the libraries that I will be using.

Let’s look into one of the most common clustering algorithms: K-Means in detail.     Hands-on implementation on real project: Learn how to implement classification algorithms using multiple techniques in my Microsoft Malware Detection Project Course.    

K-Means follows an iterative process in which it tries to minimize the distance of the data points from the centroid points. It’s used to group the data points into k number of clusters based on their similarity.

Euclidean distance is used to calculate the similarity.

Let’s see a simple example of how K-Means clustering can be used to segregate the dataset. In this example, I am going to use the make_blobs the command to generate isotropic gaussian blobs which can be used for clustering.

I passed in the number of samples to be generated to be 100 and the number of centers to be 5.

k means clustering problem solving

If you look at the value of y, you can see that the points are classified based on their clusters, but I won’t be using this compute the clusters, I will be using this only for evaluating purpose. For using K-Means you need to import KMeans from sklearn.cluster library.

For using KMeans, you need to specify the no of clusters as arguments. In this case, as we can look from the graph that there are 5 clusters, I will be passing 5 as arguments. But in general cases, you should use the Elbow Method to find the optimal number of clusters. I will be discussing this method in detail in the upcoming sections.

After passing the arguments, I have fitted the model and predicted the results. Now let’s visualize our predictions in a scatter plot.

k means clustering problem solving

There are 3 important steps in K-Means Clustering.

  • 1. Initialize centroids – This is done by randomly choosing K no of points, the points can be present in the dataset or also random points.
  • 2. Assign Clusters – The clusters are assigned to each point in the dataset by calculating their distance from the centroid and assigning it to the centroid with minimum distance.
  • 3. Re-calculate the centroids – Updating the centroid by calculating the centroid of each cluster we have created.

k means clustering problem solving

Maths Behind K-Means Working

k means clustering problem solving

One of the few things that you need to keep in mind is that as we are using euclidean distance as the main parameter, it will be better to standardize your dataset if the x and y vary way too much like 10 and 100.

Also, it’s recommended to choose a wide range of points as initial clusters to check whether we are getting the same output, there is a possibility that you may get stuck at the local minimum rather than the global minimum.

Let’s use the same make_blobs example we used at the beginning. We will try to do the clustering without using the KMeans library.

I have set the K value to be 5 as before and also initialized the centroids randomly at first using the random.randint() function.

Then I am going to find the distance between the points. Euclidean distance is most commonly used for finding the similarity.

I have also stored all the minimum values in a variable minimum . Then I regrouped the dataset based on the minimum values we got and calculated the centroid value.

Then we need to repeat the above 2 steps over and over again until we reach the convergence.

Let’s plot it.

k means clustering problem solving

Elbow method to find the optimal number of clusters

One of the important steps in K-Means Clustering is to determine the optimal no. of clusters we need to give as an input. This can be done by iterating it through a number of n values and then finding the optimal n value.

For finding this optimal n, the Elbow Method is used.

You have to plot the loss values vs the n value and find the point where the graph is flattening, this point is considered as the optimal n value.

Let’s look at the example we have seen at first, to see the working of the elbow method. I am going to iterate it through a series of n values ranging from 1-20 and then plot their loss values.

k means clustering problem solving

I am going to be using the Mall_Customers Dataset. You can download the dataset from the given link

k means clustering problem solving

Let’s try to find if there are certain clusters between the customers based on their Age and Spending Score.

k means clustering problem solving

More Articles

  • Predictive Modeling

How Naive Bayes Algorithm Works? (with example and full code)

Similar articles, complete introduction to linear regression in r, how to implement common statistical significance tests and find the p value, logistic regression – a complete tutorial with examples in r.

Subscribe to Machine Learning Plus for high value data science content

© Machinelearningplus. All rights reserved.

k means clustering problem solving

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free sample videos:.

k means clustering problem solving

What Is the K-Means Clustering Algorithm?

K-means is a simple but powerful clustering algorithm in machine learning. Here, our expert explains how it works and its plusses and minuses.

Noah Topper

K-means is a very simple clustering algorithm used in machine learning. Clustering is an unsupervised learning task. Learning is unsupervised when it requires no labels on its data. Such algorithms can find inherent structure and patterns in unlabeled data. Contrast this with supervised learning , where a model learns to match inputs to corresponding outputs, which requires the data to be labeled.

Clustering is perhaps the simplest unsupervised learning task to understand. In a data set, it’s possible to see that certain data points cluster together and form a natural group.

For example, suppose we have a lot of data on the behavior of YouTube viewers. We might identify one set of users who visit the site frequently, stay on for a long time, and typically reach videos through the subscription feed. Call these “power users.” We might also see a group of “casual users” who visit the site only occasionally, don’t stick around for very long, and typically reach videos from the home page, recommendation bar or directly from links.

Clustering techniques can identify such groups of users automatically, without any human having to label some users as power users and others as casual. Such an algorithm won’t be able to tell us what these clusters mean, but if there are groups of users with different behavior, it can at least detect that they exist and identify their key qualities. We can then decide what to do from there.

Now let’s get into it!

K-Means Algorithm

K-means is a simple clustering algorithm in machine learning. In a data set, it’s possible to see that certain data points cluster together and form a natural group. The goal of k-means is to locate the centroids around which data is clustered They are the “means” in “k-means.” If we know where these points are, the intuition behind the algorithm is that we can then classify each point by assigning it to its closest cluster center.

More From Noah Topper Is the Human Brain a Computer?

How Does the K-Means Algorithm Work?

Consider the following unlabeled data:

It was randomly generated to cluster around five central points, called centroids. The goal of k-means is to locate these centroids; they are the “means” in “k-means.” If we know where these points are, the intuition behind the algorithm is that we can then classify each point by assigning it to its closest cluster center.

Okay, but how do we find these centroids? First, we must tell the algorithm how many there are. In this case, there are five. The algorithm then makes a guess, choosing five random points as its initial centroids. Next, it looks at all the points assigned to the same cluster. It’s likely that the centroid we initially picked is actually not at the center of this cluster (as you can see in the visualization below). So, we update our guess by moving the centroid to the middle of the cluster and do the same for every other centroid. We then repeat this process until the centers stop moving.

As an example, here’s what the first three steps of (one particular run of) k-means looks like on this data set:

In the top left, we have our initial choice of five random centroids. In the top right, we assign each point to its closest centroid and color the regions accordingly. In the middle left, we then shift our centroids to the middle of our identified clusters, and so on. Running the full algorithm, we eventually get the following clustering:

Looks pretty good! These five centroids now lie right at the centers of the five clusters in our data.

If you’re looking to use this algorithm, the machine learning library Scikit-learn has an easy-to-use, built-in k-means class . It comes with some extra optimizations by default. For example, it biases the initial choice of centroids to be more spread out than would typically occur from a purely random choice. It also repeats the algorithm a number of times from scratch and returns the best clustering it finds.

Without this, it’s easy for k-means to get stuck on a clustering that’s too densely packed in one area. For example, look back at the second figure above. We can see that k-means initially has a lot more centroids in the bottom-left than the top-right. If we get an unlucky run, the algorithm may never realize that the “cluster” in the top-right is actually two clusters.

Advantages of K-Means

The main advantages of k-means are its simplicity, speed and scalability. Understanding and implementing k-means is simple, and it’s always good to understand your tools. In practice, it also runs quite fast and even works well for large data sets.  It’s also guaranteed to terminate (although not necessarily to the optimal answer), so the algorithm won’t keep moving its centers back and forth forever.

Disadvantages of K-Means (and How to Fix Them)

K-means has some disadvantages, however. First, we must choose the number of clusters k in advance; the algorithm does not determine it for us. If you have some prior knowledge about how many there should be, all well and good. Otherwise, you’ll have to tune the choice of k . This can be tricky, but there are a few potential solutions.

What Is Inertia in K-Means Clustering?

The inertia of a k-means model is the average mean-squared distance of each point to its assigned centroid.

It’s not enough to just minimize inertia since making k larger always lowers it. If we set k to 10 in our last example, then points would be much closer to their centers, but it wouldn’t be a very natural clustering Often, however, if you try increasing values of k , you will see an initial sharp dropoff in inertia with diminishing returns as k gets larger. Values of k just before these diminishing returns are good candidates. 

Another potential solution is to use the silhouette score , a metric which rewards the model for having points near their centroids but also penalizes it for points being too close to other centroids. This can disincentivize having too many clusters. Looking for k s that score high on this metric is another good way to find candidate values. Additionally, if your k-means model is part of a larger, supervised learning system, as I discuss more later, then you can tune k in a straightforward way in order to minimize loss of the overall system.

K-means is also sensitive to the positions of the initial centroids. It can get stuck at a local optimum and end up with a crummy answer. But we’ve already discussed how to address this: initialize centroids to be spread out, and run the algorithm multiple times. Since k-means is fairly fast, this isn’t too much of a problem.

Next, k-means is sensitive to the scale of the data. Distance in each direction is treated as equally important, so if one of your features is on a larger scale (e.g., a weight of 100 to 500 pounds versus length of five to 10 feet), k-means will treat it as more important. You can address this problem relatively easily by rescaling your data.

Harder to overcome is that k-means is sensitive to the shape of the data. The algorithm only learns spherical clusters, so if your data has clusters of a different shape, it will struggle to learn them. K-means is also sensitive to outliers and struggles with higher-dimensionality data. For example, k-means would have a hard time clustering 1024 by 1024 images since our data points would have over a million dimensions.

More in Machine Learning What Is Sentiment Analysis?

Applications of K-Means Algorithm in Machine Learning

K-means has lots of applications. One is customer segmentation, which we discussed in the YouTube example at the start. This means identifying groups of similar users and can be useful to better understand your userbase. What sorts of people are using your product, and what for and in what way? This may inform overall design decisions and business strategies. 

Customer segmentation is also useful from the more computational side. Continuing with the YouTube example, we can recommend videos or channels to users based on what other people in their cluster are watching.

Another use case is anomaly detection. If some new data point is very far from all your existing clusters, then something strange is going on and may warrant investigation. Suppose we find some YouTube account which clicks on a video, immediately likes it, then moves on and repeats. This is probably a bot that you want to deactivate. Fortunately, its behavior likely does not fit into your natural clusters of users, and we can detect this automatically.

Yet another use is dimensionality reduction and feature engineering. Our data may have dozens of dimensions but only, say, five significant clusters. If we transform each point by representing it with its distance to each centroid, we can drastically reduce the dimensionality of our data. This can speed up learning if we then feed our transformed data into a supervised learning algorithm. Alternatively, we could add these distances as new features to our data set to aid learning, rather than throwing out the original features entirely.

Finally, consider semi-supervised learning, when only a small fraction of our data is labeled. Suppose we have a set of handwritten digits, like the classic MNIST problem . If we have 100,000 such pictures, but only 10 percent of them are labeled, supervised learning might not do very well on just that 10 percent. Is there some way we can use the other 90 percent of the data to our advantage?

Yes! If we cluster the data, we can identify groups of pictures that are probably the same digit without any labeling required. If we then find the labeled picture closest to each centroid, we can propagate its label to the other pictures in the cluster or possibly just the pictures closest to the centroid. We then have a much larger amount of labeled data to train on.

K-means can be a particularly good strategy for semi-supervised classification since we know something about the number of clusters in advance: It’s tied to the number of classes. Usually, we’ll want to set k to be some small multiple of the number of classes. Here, we might have 50 clusters since there are multiple ways to write each digit.

And that’s k-means! Remember to go check out the Scikit-learn k-means class to start using the algorithm right away.

Recent Machine Learning Articles

62 Artificial Intelligence (AI) Companies to Know

  • Skip to secondary menu
  • Skip to main content
  • Skip to primary sidebar

Statistics By Jim

Making statistics intuitive

What is K Means Clustering? With an Example

By Jim Frost 10 Comments

What is K Means Clustering?

The K means clustering algorithm divides a set of n observations into k clusters. Use K means clustering when you don’t have existing group labels and want to assign similar data points to the number of groups you specify (K).

In general, clustering is a method of assigning comparable data points to groups using data patterns. Clustering algorithms find similar data points and allocate them to the same set. K means clustering is one such algorithm. A cluster is a group or set of similar data points, and I’ll use those terms synonymously.

“K means” refers to the following:

  • The number of clusters you specify (K).
  • The process of assigning observations to the cluster with the nearest center (mean).

K means clustering forms the groups in a manner that minimizes the variances between the data points and the cluster’s centroid. Learn more about Variances .

Imagine you have a large dataset of observations, but there is no grouping information or labels for the data points. Use K means clustering to generate groups comprised of observations with similar characteristics. For example, if you have customer data, you might want to create sets of similar customers and then target each group with different types of marketing.

K means clustering is a popular machine learning algorithm. It’s an unsupervised method because it starts without labels and then forms and labels groups itself. K means clustering is not a supervised learning method because it does not attempt to predict existing or known group labels.

K means clustering usage has grown recently thanks to machine learning taking off. But it started as a statistical grouping method for signal processing in 1957.

Read on to learn how the K means clustering algorithm works and see an example of it.

How the K Means Clustering Algorithm Works

The K Means Clustering algorithm finds observations in a dataset that are like each other and places them in a set. The process starts by randomly assigning each data point to an initial group and calculating the centroid for each one. A centroid is the center of the group. Note that some forms of the procedure allow you to specify the initial sets.

Then the algorithm continues as follows:

  • It evaluates each observation, assigning it to the closest cluster. The definition of “closest” is that the Euclidean distance between a data point and a group’s centroid is shorter than the distances to the other centroids.
  • When a cluster gains or loses a data point, the K means clustering algorithm recalculates its centroid.
  • The algorithm repeats until it can no longer assign data points to a closer set.

When the K means clustering algorithm finishes, all groups have the minimum within-cluster variance, which keeps them as small as possible. Sets with minimum variance and size have data points that are as similar as possible. There is variability amongst the characteristics in each cluster, but the algorithm minimizes it.

In short, the observations within a set should share characteristics. Be sure to assess the final groups to be sure they make sense and satisfy your goals! In some cases, the analysts might need to specify different numbers of groups to determine which value of K produces the most useful results. For example, you might find that some sets are too large or too small to be helpful.

K Means Clustering Example

Imagine you’re studying businesses in a specific industry and documenting their information. Specifically, you record the variables shown in the dataset snippet below.

Dataset for K means clustering algorithm example.

Download the full CSV dataset: KMeansClustering .

Now you want to group them into three clusters of similar businesses using these four variables.

Let’s run the analysis!

Interpreting the Results

The downloadable dataset contains the K mean clustering assignments for each business. We’ll look at some of the output to understand the groups.

Statistical output for the K means clustering example.

The statistical output shows that K means clustering has created the following three sets with the indicated number of businesses in each:

  • Cluster1: 6
  • Cluster2: 10
  • Cluster3: 6

We know each set contains similar businesses, but how do we characterize them? To do that, we need to look at the Cluster Centroids section. The output shows that Cluster 1 contains businesses with more clients, a higher rate of return, higher sales, and existed for more years. Cluster 3’s centroid has the lowest values. Cluster 2 is between them. You can describe the groups as the following:

  • 1: Established industry leaders
  • 2: Mid-growth businesses
  • 3: Newer businesses

Frequently, examples of K means clustering use two variables that produce two-dimensional groups, which makes graphing easy. This example uses four variables, making the groups four-dimensional. You can’t graph them all on a single plot! This complexity highlights the value of using an algorithm to create the sets. Of course, larger datasets can have many more variables.

I can plot a pair of variables on a scatterplot to help you see the clusters. Keep in mind that K means clustering used the other two variables as well. Consequently, this graph is a two-dimensional slice of a four-dimensional space. Learn more about Scatterplots .

Scatterplot displaying the groups.

The scatterplot displays companies by their years of existence and sales. The graph color codes the groups in the data.

Share this:

k means clustering problem solving

Reader Interactions

' src=

January 10, 2023 at 5:36 pm

Hi Prof and thank you !

Can p values be used to determine the significance of the number of similar images being clustered together as thumbnails on a canvass Vs the number of outliers on a canvass. In other words …. Ratio of similar thumbnails to thumbnail outliers on a canvass

The traditional k means performance tests for inter cluster and intra cluster are computationally too expensive for this large dataset of visual images. I chose the quality of the visual clustering output for my use case.

Warm regards

' src=

September 6, 2022 at 6:15 pm

Can k-means be used with categorical data?

' src=

September 6, 2022 at 8:52 pm

No, you can’t use K means clustering with categorical data. K means minimizes distances between data points and centroids. Categorical data cannot be placed on a scale with distances between observations. Hence, it doesn’t work with that data type. In terms of nominal (categorical), ordinal, interval, and ratio scales , you really need at least an interval scale (for the distances) to be able to use K means clustering.

' src=

September 6, 2022 at 7:33 am

Thanks for your post. I’ve been following you from here in Brazil, and I have your book. K means is important for analyzing the mass of data we manipulate.

September 6, 2022 at 4:58 pm

Hi José! Greetings to you in lovely Brazil! Thanks so much for buying one of my books! 🙂

Thanks for writing. I think you’re correct that K means clustering has grown in importance due to the volume of data we now produce.

' src=

September 6, 2022 at 4:36 am

Hello. Thank you for sharing your experience. Could you share some titles of good books on the analysis and design or surveys where those methods that you mentioned are also available.

' src=

September 6, 2022 at 3:36 am

In the Survey research world, K-means has a very chequered reputation amongst those who have used it extensively. It is exceptionally sensitive to changes in input variables and to the choice of the number of clusters. The first is a particular problem unless there is an underlying theory about the relevant variables. Adding one variable can change the entire solution dramatically. I stopped using it years ago in favour of latent class analysis and latent continua, especially via the use of correspondence analysis. k-mean is also not great with non-continuous variables.

September 6, 2022 at 4:57 pm

Thanks for sharing your experiences with K means clustering. It’s great when readers write about their real-world usage of statistical procedures. I think you raise some great points in your comments. First of all, not all tools work in all contexts. Secondly, it’s important to assess the results and determine whether they make sense given your subject-area knowledge. Don’t treat any procedure as a black box that just spits out an answer that you assume is correct. Verify!

Think of it this way. Classifying any set of observations using a set of variables can be tricky even when you have experts in the subject area doing the classification. There can be disagreements between experts on how to weight the different variables and where to draw lines between different categories. Now imagine doing that with a statistical algorithm that doesn’t use any subject-area knowledge but does assess the data using an objective approach that considers a large number of variables. There are bound to be pros and cons. Various difficulties. And so on. There is also the question of how many groups are optimal.

In short, I’m sure how well it works varies by area.

I think K means clustering has its place, particularly when you think the big data arena and where the risk of misclassification is small. Maybe you classify people into different groups and the based on the groups either target them with different types of advertising or product recommendations.

' src=

September 5, 2022 at 7:22 pm

Thank you Prof. Clearly K means clustering as a statistical tool can help create meaning out of a cacophony of data just as your given example clearly demonstrated it.

September 5, 2022 at 10:26 pm

Hi Funsho, thanks! And I love how you phrase it, “creating meaning out of a cacophony of data!” That’s exactly what it does!

Comments and Questions Cancel reply

Tutorial Playlist

Machine learning tutorial: a step-by-step guide for beginners, an introduction to machine learning, what is machine learning and how does it work, machine learning steps: a complete guide, top 10 machine learning applications in 2024, different types of machine learning: exploring ai's core, a beginner's guide to supervised & unsupervised learning in ai.

Everything You Need to Know About Feature Selection

Linear Regression in Python

Everything you need to know about classification in machine learning, an introduction to logistic regression in python, understanding the difference between linear vs logistic regression, the best guide on how to implement decision tree in python, random forest algorithm, understanding naive bayes classifier, the best guide to confusion matrix, how to leverage knn algorithm in machine learning.

K-Means Clustering Algorithm: Applications, Types, Demos and Use Cases

PCA in Machine Learning: Your Complete Guide to Principal Component Analysis

What is cost function in machine learning, the ultimate guide to cross-validation in machine learning, an easy guide to stock price prediction using machine learning, what is reinforcement learning: a complete guide, what is q-learning: the best guide to understand q-learning, the best guide to regularization in machine learning, everything you need to know about bias and variance, the complete guide on overfitting and underfitting in machine learning, mathematics for machine learning - important skills you must possess, a one-stop guide to statistics for machine learning, embarking on a machine learning career here’s all you need to know, how to become a machine learning engineer, top 45 machine learning interview questions and answers for 2024, explaining the concepts of quantum computing, supervised machine learning: all you need to know, 10 machine learning platforms to revolutionize your business, what is boosting in machine learning : a comprehensive guide, machine learning vs. neural networks: understanding the differences.

Unlocking the Future: 5 Compelling Reasons to Master Machine Learning in 2024

Feature Engineering

How to create a fake news detection system, k-means clustering algorithm: applications, types, & how does it work.

Lesson 17 of 39 By Mayank Banoula

K-Means Clustering Algorithm: Applications, Types, Demos and Use Cases

Table of Contents

Every Machine Learning engineer wants to achieve accurate predictions with their algorithms. Such learning algorithms are generally broken down into two types - supervised and unsupervised . K-Means clustering is one of the unsupervised algorithms where the available input data does not have a labeled response.

Types of Clustering

Clustering is a type of unsupervised learning wherein data points are grouped into different sets based on their degree of similarity.

The various types of clustering are:

  • Hierarchical clustering
  • Partitioning clustering 

Hierarchical clustering is further subdivided into:

  • Agglomerative clustering
  • Divisive clustering

Partitioning clustering is further subdivided into:

  • K-Means clustering 
  • Fuzzy C-Means clustering 

Your AI/ML Career is Just Around The Corner!

Your AI/ML Career is Just Around The Corner!

Hierarchical Clustering

Hierarchical clustering uses a tree-like structure, like so:

hierarchical clustering

In agglomerative clustering, there is a bottom-up approach. We begin with each element as a separate cluster and merge them into successively more massive clusters, as shown below:

clustering-slide19

Divisive clustering is a top-down approach. We begin with the whole set and proceed to divide it into successively smaller clusters, as you can see below:

clustering slide20

Partitioning Clustering 

Partitioning clustering is split into two subtypes - K-Means clustering and Fuzzy C-Means.

In k-means clustering, the objects are divided into several clusters mentioned by the number ‘K.’ So if we say K = 2, the objects are divided into two clusters, c1 and c2, as shown:

clustering-slide21

Master Gen AI Strategies for Businesses with

Master Gen AI Strategies for Businesses with

Here, the features or characteristics are compared, and all objects having similar characteristics are clustered together. 

Fuzzy c-means is very similar to k-means in the sense that it clusters objects that have similar characteristics together. In k-means clustering, a single object cannot belong to two different clusters. But in c-means, objects can belong to more than one cluster, as shown. 

clustering-slide22

What is Meant by the K-Means Clustering Algorithm?

K-Means clustering is an unsupervised learning algorithm. There is no labeled data for this clustering, unlike in supervised learning. K-Means performs the division of objects into clusters that share similarities and are dissimilar to the objects belonging to another cluster. 

The term ‘K’ is a number. You need to tell the system how many clusters you need to create. For example, K = 2 refers to two clusters. There is a way of finding out what is the best or optimum value of K for a given data. 

For a better understanding of k-means, let's take an example from cricket. Imagine you received data on a lot of cricket players from all over the world, which gives information on the runs scored by the player and the wickets taken by them in the last ten matches. Based on this information, we need to group the data into two clusters, namely batsman and bowlers. 

Let's take a look at the steps to create these clusters.

Assign data points

Here, we have our data set plotted on ‘x’ and ‘y’ coordinates. The information on the y-axis is about the runs scored, and on the x-axis about the wickets taken by the players. 

If we plot the data, this is how it would look:

plot data

Perform Clustering

We need to create the clusters, as shown below:

perform-clustering

Considering the same data set, let us solve the problem using K-Means clustering (taking K = 2).

The first step in k-means clustering is the allocation of two centroids randomly (as K=2). Two points are assigned as centroids. Note that the points can be anywhere, as they are random points. They are called centroids, but initially, they are not the central point of a given data set.

centroids

The next step is to determine the distance between each of the randomly assigned centroids' data points. For every point, the distance is measured from both the centroids, and whichever distance is less, that point is assigned to that centroid. You can see the data points attached to the centroids and represented here in blue and yellow.

centroids2

The next step is to determine the actual centroid for these two clusters. The original randomly allocated centroid is to be repositioned to the actual centroid of the clusters.

clusters3

This process of calculating the distance and repositioning the centroid continues until we obtain our final cluster. Then the centroid repositioning stops.

centroidsr

As seen above, the centroid doesn't need anymore repositioning, and it means the algorithm has converged, and we have the two clusters with a centroid. 

Advantages of k-means

  • Simple and easy to implement: The k-means algorithm is easy to understand and implement, making it a popular choice for clustering tasks.
  • Fast and efficient: K-means is computationally efficient and can handle large datasets with high dimensionality.
  • Scalability: K-means can handle large datasets with a large number of data points and can be easily scaled to handle even larger datasets.
  • Flexibility: K-means can be easily adapted to different applications and can be used with different distance metrics and initialization methods.

Disadvantages of K-Means:

  • Sensitivity to initial centroids: K-means is sensitive to the initial selection of centroids and can converge to a suboptimal solution.
  • Requires specifying the number of clusters: The number of clusters k needs to be specified before running the algorithm, which can be challenging in some applications.
  • Sensitive to outliers: K-means is sensitive to outliers, which can have a significant impact on the resulting clusters.

Applications of K-Means Clustering

K-Means clustering is used in a variety of examples or business cases in real life, like:

  • Academic performance 
  • Diagnostic systems 
  • Search engines 

Wireless sensor networks

Academic Performance

Based on the scores, students are categorized into grades like A, B, or C. 

Diagnostic systems

The medical profession uses k-means in creating smarter medical decision support systems, especially in the treatment of liver ailments.

Search engines

Clustering forms a backbone of search engines. When a search is performed, the search results need to be grouped, and the search engines very often use clustering to do this. 

The clustering algorithm plays the role of finding the cluster heads, which collect all the data in its respective cluster.

Distance Measure 

Distance measure determines the similarity between two elements and influences the shape of clusters.

K-Means clustering supports various kinds of distance measures, such as: 

  • Euclidean distance measure
  • Manhattan distance measure 
  • A squared euclidean distance measure
  • Cosine distance measure 

Euclidean Distance Measure 

The most common case is determining the distance between two points. If we have a point P and point Q, the euclidean distance is an ordinary straight line. It is the distance between the two points in Euclidean space. 

The formula for distance between two points is shown below:

euclidean

Squared Euclidean Distance Measure

This is identical to the Euclidean distance measurement but does not take the square root at the end. The formula is shown below:

squared eu

Manhattan Distance Measure

The Manhattan distance is the simple sum of the horizontal and vertical components or the distance between two points measured along axes at right angles.

Note that we are taking the absolute value so that the negative values don't come into play. 

The formula is shown below:

manhattan distance

Cosine Distance Measure

In this case, we take the angle between the two vectors formed by joining the origin point. The formula is shown below:

cosine-distance

Become a Data Scientist with Hands-on Training!

Become a Data Scientist with Hands-on Training!

K-means on Geyser's Eruptions Segmentation

K-means can be used to segment the Geyser's Eruptions dataset, which records the duration and waiting time between eruptions of the Old Faithful geyser in Yellowstone National Park. The algorithm can be used to cluster the eruptions based on their duration and waiting time and identify different patterns of eruptions.

K-means on Image Compression

K-means can also be used for image compression, where it can be used to reduce the number of colors in an image while maintaining its visual quality. The algorithm can be used to cluster the colors in the image and replace the pixels with the centroid color of each cluster, resulting in a compressed image.

Evaluation Methods

Evaluation methods are used to measure the performance of clustering algorithms. Common evaluation methods include:

Sum of Squared Errors (SSE): This measures the sum of the squared distances between each data point and its assigned centroid.

Silhouette Coefficient: This measures the similarity of a data point to its own cluster compared to other clusters. A high silhouette coefficient indicates that a data point is well-matched to its own cluster and poorly matched to neighboring clusters.

Silhouette Analysis

Silhouette analysis is a graphical technique used to evaluate the quality of the clusters generated by a clustering algorithm. It involves calculating the silhouette coefficient for each data point and plotting them in a histogram. The width of the histogram indicates the quality of the clustering. A wide histogram indicates that the clusters are well-separated and distinct, while a narrow histogram indicates that the clusters are poorly separated and may overlap.

How Does K-Means Clustering Work?

The flowchart below shows how k-means clustering works:

slide32

The goal of the K-Means algorithm is to find clusters in the given input data. There are a couple of ways to accomplish this. We can use the trial and error method by specifying the value of K (e.g., 3,4, 5). As we progress, we keep changing the value until we get the best clusters. 

Another method is to use the Elbow technique to determine the value of K. Once we get the K's value, the system will assign that many centroids randomly and measure the distance of each of the data points from these centroids. Accordingly, it assigns those points to the corresponding centroid from which the distance is minimum. So each data point will be assigned to the centroid, which is closest to it. Thereby we have a K number of initial clusters.

For the newly formed clusters, it calculates the new centroid position. The position of the centroid moves compared to the randomly allocated one.

Once again, the distance of each point is measured from this new centroid point. If required, the data points are relocated to the new centroids, and the mean position or the new centroid is calculated once again. 

If the centroid moves, the iteration continues indicating no convergence. But once the centroid stops moving (which means that the clustering process has converged), it will reflect the result.

Let's use a visualization example to understand this better. 

We have a data set for a grocery shop, and we want to find out how many clusters this has to be spread across. To find the optimum number of clusters, we break it down into the following steps:

The Elbow method is the best way to find the number of clusters. The elbow method constitutes running  K-Means clustering on the dataset.

Next, we use within-sum-of-squares as a measure to find the optimum number of clusters that can be formed for a given data set. Within the sum of squares (WSS) is defined as the sum of the squared distance between each member of the cluster and its centroid.

slide34

The WSS is measured for each value of K. The value of K, which has the least amount of WSS, is taken as the optimum value. 

Now, we draw a curve between WSS and the number of clusters. 

slide35

Here, WSS is on the y-axis and number of clusters on the x-axis.

You can see that there is a very gradual change in the value of WSS as the K value increases from 2. 

So, you can take the elbow point value as the optimal value of K. It should be either two, three, or at most four. But, beyond that, increasing the number of clusters does not dramatically change the value in WSS, it gets stabilized. 

Let's assume that these are our delivery points:

delivery points

We can randomly initialize two points called the cluster centroids.

Here, C1 and C2 are the centroids assigned randomly. 

Now the distance of each location from the centroid is measured, and each data point is assigned to the centroid, which is closest to it.

This is how the initial grouping is done:

initial grouping

Compute the actual centroid of data points for the first group.

Reposition the random centroid to the actual centroid.

random centranoid

Compute the actual centroid of data points for the second group.

actual centroid

Once the cluster becomes static, the k-means algorithm is said to be converged. 

The final cluster with centroids c1 and c2 is as shown below:

final centroid

K-Means Clustering Algorithm 

Let's say we have x1, x2, x3……… x(n) as our inputs, and we want to split this into K clusters. 

The steps to form clusters are:

Step 1: Choose K random points as cluster centers called centroids. 

Step 2: Assign each x(i) to the closest cluster by implementing euclidean distance (i.e., calculating its distance to each centroid)

Step 3: Identify new centroids by taking the average of the assigned points.

Step 4: Keep repeating step 2 and step 3 until convergence is achieved

Let's take a detailed look at it at each of these steps.

We randomly pick K (centroids). We name them c1,c2,..... ck, and we can say that 

centroid

Where C is the set of all centroids.

We assign each data point to its nearest center, which is accomplished by calculating the euclidean distance.

centroid slide44

Where dist() is the Euclidean distance.

Here, we calculate each x value's distance from each c value, i.e. the distance between x1-c1, x1-c2, x1-c3, and so on. Then we find which is the lowest value and assign x1 to that particular centroid.

Similarly, we find the minimum distance for x2, x3, etc. 

We identify the actual centroid by taking the average of all the points assigned to that cluster. 

centroid slide 45

Where Si is the set of all points assigned to the ith cluster.     

It means the original point, which we thought was the centroid, will shift to the new position, which is the actual centroid for each of these groups. 

Keep repeating step 2 and step 3 until convergence is achieved.

How to Choose the Value of "K number of clusters" in K-Means Clustering?

Although there are many choices available for choosing the optimal number of clusters, the Elbow Method is one of the most popular and appropriate methods. The Elbow Method uses the idea of WCSS value, which is short for for Within Cluster Sum of Squares. WCSS defines the total number of variations within a cluster. This is the formula used to calculate the value of WCSS (for three clusters) provided courtesy of Javatpoint:

WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in Cluster2distance(Pi C2)2+∑Pi in CLuster3 distance(Pi C3)2

Python Implementation of the K-Means Clustering Algorithm

Here’s how to use Python to implement the K-Means Clustering Algorithm. These are the steps you need to take:

  • Data pre-processing
  • Finding the optimal number of clusters using the elbow method
  • Training the K-Means algorithm on the training data set
  • Visualizing the clusters

1. Data Pre-Processing. Import the libraries, datasets, and extract the independent variables.

# importing libraries    

import numpy as nm    

import matplotlib.pyplot as mtp    

import pandas as pd    

# Importing the dataset  

dataset = pd.read_csv('Mall_Customers_data.csv')  

x = dataset.iloc[:, [3, 4]].values 

2. Find the optimal number of clusters using the elbow method. Here’s the code you use:

#finding optimal number of clusters using the elbow method  

from sklearn.cluster import KMeans  

wcss_list= []  #Initializing the list for the values of WCSS  

#Using for loop for iterations from 1 to 10.  

for i in range(1, 11):  

    kmeans = KMeans(n_clusters=i, init='k-means++', random_state= 42)  

    kmeans.fit(x)  

    wcss_list.append(kmeans.inertia_)  

mtp.plot(range(1, 11), wcss_list)  

mtp.title('The Elobw Method Graph')  

mtp.xlabel('Number of clusters(k)')  

mtp.ylabel('wcss_list')  

mtp.show() 

3. Train the K-means algorithm on the training dataset. Use the same two lines of code used in the previous section. However, instead of using i, use 5, because there are 5 clusters that need to be formed. Here’s the code:

#training the K-means model on a dataset  

kmeans = KMeans(n_clusters=5, init='k-means++', random_state= 42)  

y_predict= kmeans.fit_predict(x) 

4. Visualize the Clusters. Since this model has five clusters, we need to visualize each one.

#visulaizing the clusters  

mtp.scatter(x[y_predict == 0, 0], x[y_predict == 0, 1], s = 100, c = 'blue', label = 'Cluster 1') #for first cluster  

mtp.scatter(x[y_predict == 1, 0], x[y_predict == 1, 1], s = 100, c = 'green', label = 'Cluster 2') #for second cluster  

mtp.scatter(x[y_predict== 2, 0], x[y_predict == 2, 1], s = 100, c = 'red', label = 'Cluster 3') #for third cluster  

mtp.scatter(x[y_predict == 3, 0], x[y_predict == 3, 1], s = 100, c = 'cyan', label = 'Cluster 4') #for fourth cluster  

mtp.scatter(x[y_predict == 4, 0], x[y_predict == 4, 1], s = 100, c = 'magenta', label = 'Cluster 5') #for fifth cluster  

mtp.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 300, c = 'yellow', label = 'Centroid')   

mtp.title('Clusters of customers')  

mtp.xlabel('Annual Income (k$)')  

mtp.ylabel('Spending Score (1-100)')  

mtp.legend()  

mtp.show()  

Coding provided by Javatpoint .

Demo: K-Means Clustering

Problem Statement - Walmart wants to open a chain of stores across the state of Florida, and it wants to find the optimal store locations to maximize revenue.

The issue here is if they open too many stores close to each other, they will not make a profit. But, if the stores are too far apart, they do not have enough sales coverage. 

Solution - An organization like Walmart is an e-commerce giant. They already have the addresses of their customers in their database. So they can use this information and perform K-Means Clustering to find the optimal location.

Conclusion 

Considered as the Job of the future, Machine Learning engineers are in-demand as well as highly paid. A report by Marketwatch predicts a machine learning growth rate of over 45% for the period between 2017 and 2025. So, why restrict your learning to merely K-means clustering algorithms? Enroll in Simplilearn's Machine Learning Course and expand your knowledge in the broader concepts of Machine Learning. Get certified and become a part of the Artificial Intelligence talent that companies constantly look forward to.

Find our Post Graduate Program in AI and Machine Learning Online Bootcamp in top cities:

NameDatePlace
Cohort starts on 25th Sep 2024,
Weekend batch
Your City
Cohort starts on 28th Sep 2024,
Weekend batch
Your City
Cohort starts on 8th Oct 2024,
Weekend batch
Your City

About the Author

Mayank Banoula

Mayank is a Research Analyst at Simplilearn. He is proficient in Machine learning and Artificial intelligence with python.

Recommended Programs

Post Graduate Program in AI and Machine Learning

*Lifetime access to high-quality, self-paced e-learning content.

Recommended Resources

What is Hierarchical Clustering and How Does It Work

What is Hierarchical Clustering and How Does It Work

Free eBook: 2016 High Paying Certifications

Free eBook: 2016 High Paying Certifications

K-Means Clustering Algorithm: Applications, Types, Demos and Use Cases

Free eBook: Guide To Scrum Methodology

  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.

Understanding K-means Clustering with Examples

K-Means is one of the most important algorithms when it comes to Machine learning Certification Training . In this blog, we will understand the K-Means clustering algorithm with the help of examples.

A Hospital Care chain wants to open a series of Emergency-Care wards within a region. We assume that the hospital knows the location of all the maximum accident-prone areas in the region. They have to decide the number of the Emergency Units to be opened and the location of these Emergency Units, so that all the accident-prone areas are covered in the vicinity of these Emergency Units.

The challenge is to decide the location of these Emergency Units so that the whole region is covered. Here is when K-means Clustering comes to rescue!

Before getting to K-means Clustering, let us first understand what Clustering is.

A cluster refers to a small group of objects. Clustering is grouping those objects into clusters. In order to learn clustering, it is important to understand the scenarios that lead to cluster different objects. Let us identify a few of them.

What is Clustering?

Clustering is dividing data points into homogeneous classes or clusters:

  • Points in the same group are as similar as possible
  • Points in different group are as dissimilar as possible

When a collection of objects is given, we put objects into group based on similarity.

Application of Clustering:

Clustering is used in almost all the fields. You can infer some ideas from Example 1 to come up with lot of clustering applications that you would have come across.

Listed here are few more applications, which would add to what you have learnt.

  • Clustering helps marketers improve their customer base and work on the target areas. It helps group people (according to different criteria’s such as willingness, purchasing power etc.) based on their similarity in many ways related to the product under consideration.
  • Clustering helps in identification of groups of houses on the basis of their value, type and geographical locations.
  • Clustering is used to study earth-quake. Based on the areas hit by an earthquake in a region, clustering can help analyse the next probable location where earthquake can occur.

Clustering Algorithms:

A Clustering Algorithm tries to analyse natural groups of data on the basis of some similarity. It locates the centroid of the group of data points. To carry out effective clustering, the algorithm evaluates the distance between each point from the centroid of the cluster.

The goal of clustering is to determine the intrinsic grouping in a set of unlabelled data.

What is K-means Clustering?

K-means (Macqueen, 1967) is one of the simplest unsupervised learning algorithms that solve the well-known clustering problem. K-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining.

K-means Clustering – Example 1:

A pizza chain wants to open its delivery centres across a city. What do you think would be the possible challenges?

  • They need to analyse the areas from where the pizza is being ordered frequently.
  • They need to understand as to how many pizza stores has to be opened to cover delivery in the area.
  • They need to figure out the locations for the pizza stores within all these areas in order to keep the distance between the store and delivery points minimum.

Resolving these challenges includes a lot of analysis and mathematics. We would now learn about how clustering can provide a meaningful and easy method of sorting out such real life challenges. Before that let’s see what clustering is.

K-means Clustering Method:

If k is given, the K-means algorithm can be executed in the following steps:

  • Partition of objects into k non-empty subsets
  • Identifying the cluster centroids (mean point) of the current partition.
  • Assigning each point to a specific cluster
  • Compute the distances from each point and allot points to the cluster where the distance from the centroid is minimum.
  • After re-allotting the points, find the centroid of the new cluster formed.

The step by step process:

Now, let’s consider the problem in Example 1 and see how we can help the pizza chain to come up with centres based on K-means algorithm.

K Means Clustering Algorithm | K Means Example in Python | Machine Learning Algorithms | Edureka

Within the video you will learn the concepts of K-Means clustering and its implementation using python.

Similarly, for opening Hospital Care Wards:

K-means Clustering will group these locations of maximum prone areas into clusters and define a cluster center for each cluster, which will be the locations where the Emergency Units will open. These Clusters centers are the centroids of each cluster and are at a minimum distance from all the points of a particular cluster, henceforth, the Emergency Units will be at minimum distance from all the accident prone areas within a cluster.

Here is another example for you, try and come up with the solution based on your understanding of K-means clustering.

K-means Clustering – Example 2:

Let’s consider the data on drug-related crimes in Canada. The data consists of crimes due to various drugs that include, Heroin, Cocaine to prescription drugs, especially by underage people. The crimes resulted due to these substance abuse can be brought down by starting de-addiction centres in areas most afflicted by this kind of crime. With the available data, different objectives can be set. They are:

  • Classify the crimes based on the abuse substance to detect prominent cause.
  • Classify the crimes based on age groups.
  • Analyze the data to determine what kinds of de-addiction centre is required.
  • Find out how many de-addiction centres need to be setup to reduce drug related crime rate.

The K-means algorithm can be used to determine any of the above scenarios by analyzing the available data.

Following the K-means Clustering method used in the previous example, we can start off with a given k, following by the execution of the K-means algorithm.

Mathematical Formulation for K-means Algorithm:

D= { x 1 ,x 2 ,…,x i ,…,x m } à data set of m records

x i = (x i1 ,x i2 ,…,x in ) à each record is an n-dimensional vector

Finding Cluster Centers that Minimize Distortion:

Solution can be found by setting the partial derivative of Distortion w.r.t. each cluster center to zero.

For any k clusters, the value of k should be such that even if we increase the value of k from after several levels of clustering the distortion remains constant. The achieved point is called the “Elbow”.

This is the ideal value of k, for the clusters created.

Related Post: 

Application of Clustering in Data Science Using real-time examples. 

Recommended videos for you

Python classes – python programming tutorial, business analytics with r, diversity of python programming, android development : using android 5.0 lollipop, data science : make smarter business decisions, introduction to business analytics with r, business analytics decision tree in r, python loops – while, for and nested loops in python programming, application of clustering in data science using real-time examples, 3 scenarios where predictive analytics is a must, python numpy tutorial – arrays in python, mastering python : an excellent tool for web scraping and data analysis, machine learning with python, know the science behind product recommendation with r programming, python programming – learn python programming from scratch, web scraping and analytics with python, python tutorial – all you need to know in python programming, python for big data analytics, sentiment analysis in retail domain, python list, tuple, string, set and dictonary – python sequences, recommended blogs for you, who can take up data science, scrapy tutorial: how to make a web-crawler using scrapy, what are comments in python and how to use them, how to implement super() function in python, how to best implement multiprocessing in python, python scikit-learn cheat sheet for machine learning, top 65 data analyst interview questions and answers in 2024, python requests tutorial: get and post requests in python, a step by step guide to linear regression in r, top 10 features of python you need to know, support vector machine in r: using svm to predict heart diseases, how to implement gcd in python, what is mutithreading in python and how to achieve it, how to reverse a list in python: learn python list reverse() method, what is data collection: different types of data collection, tools, and steps, what is an interpreter in java, r tutorial – a beginner’s guide to learn r programming, machine learning career and future scope, fifa world cup 2018 best xi: analyzing fifa dataset using python, why learn r.

Hi everyone, I have an eyetracking dataset and want to use it to predict group membership. So given x and y coordinates, can I predict whether someone is a male or female. I haven’t used K-Cluster algorithm before and was wondering if it can be used and how, to answer my question. Thank you for your response.

Sir wil u please provide me kmean mapreduce in r

what is the difference between plain and iterative mapreduce?

“If k is given, the K-means algorithm can be executed in the following steps” but you don’t say where “k” in ‘if k is given’ comes from.

but k is the number of clusters how can u say in data set

You are welcome, Rahul!! Please check out other posts as well.

Join the discussion Cancel reply

Trending courses in data science, data science and machine learning internship ....

  • 22k Enrolled Learners
  • Weekend/Weekday

Python Programming Certification Course

  • 62k Enrolled Learners

Data Science with Python Certification Course

  • 126k Enrolled Learners

Statistics Essentials for Analytics

  • 7k Enrolled Learners

SAS Training and Certification

  • 6k Enrolled Learners

Data Science with R Programming Certification ...

  • 41k Enrolled Learners

Data Analytics with R Programming Certificati ...

  • 27k Enrolled Learners

Analytics for Retail Banks

  • 2k Enrolled Learners

Decision Tree Modeling Using R Certification ...

Advanced predictive modelling in r certificat ....

  • 5k Enrolled Learners

Browse Categories

Subscribe to our newsletter, and get personalized recommendations..

Already have an account? Sign in .

20,00,000 learners love us! Get personalised resources in your inbox.

At least 1 upper-case and 1 lower-case letter

Minimum 8 characters and Maximum 50 characters

We have recieved your contact details.

You will recieve an email from us shortly.

  • Data Science
  • Data Analysis
  • Data Visualization
  • Machine Learning
  • Deep Learning
  • Computer Vision
  • Artificial Intelligence
  • AI ML DS Interview Series
  • AI ML DS Projects series
  • Data Engineering
  • Web Scrapping

K means Clustering – Introduction

K-Means Clustering is an Unsupervised Machine Learning algorithm, which groups the unlabeled dataset into different clusters. The article aims to explore the fundamentals and working of k mean clustering along with the implementation.

Table of Content

What is K-means Clustering?

What is the objective of k-means clustering, how k-means clustering works, implementation of k-means clustering in python.

Unsupervised Machine Learning is the process of teaching a computer to use unlabeled, unclassified data and enabling the algorithm to operate on that data without supervision. Without any previous data training, the machine’s job in this case is to organize unsorted data according to parallels, patterns, and variations. 

K means clustering, assigns data points to one of the K clusters depending on their distance from the center of the clusters. It starts by randomly assigning the clusters centroid in the space. Then each data point assign to one of the cluster based on its distance from centroid of the cluster. After assigning each point to one of the cluster, new cluster centroids are assigned. This process runs iteratively until it finds good cluster. In the analysis we assume that number of cluster is given in advanced and we have to put points in one of the group.

In some cases, K is not clearly defined, and we have to think about the optimal number of K. K Means clustering performs best data is well separated. When data points overlapped this clustering is not suitable. K Means is faster as compare to other clustering technique. It provides strong coupling between the data points. K Means cluster do not provide clear information regarding the quality of clusters. Different initial assignment of cluster centroid may lead to different clusters. Also, K Means algorithm is sensitive to noise. It may have stuck in local minima.

The goal of clustering is to divide the population or set of data points into a number of groups so that the data points within each group are more comparable to one another and different from the data points within the other groups. It is essentially a grouping of things based on how similar and different they are to one another. 

We are given a data set of items, with certain features, and values for these features (like a vector). The task is to categorize those items into groups. To achieve this, we will use the K-means algorithm, an unsupervised learning algorithm. ‘K’ in the name of the algorithm represents the number of groups/clusters we want to classify our items into.

(It will help if you think of items as points in an n-dimensional space). The algorithm will categorize the items into k groups or clusters of similarity. To calculate that similarity, we will use the Euclidean distance as a measurement.

The algorithm works as follows:  

  • First, we randomly initialize k points, called means or cluster centroids.
  • We categorize each item to its closest mean, and we update the mean’s coordinates, which are the averages of the items categorized in that cluster so far.
  • We repeat the process for a given number of iterations and at the end, we have our clusters.

The “points” mentioned above are called means because they are the mean values of the items categorized in them. To initialize these means, we have a lot of options. An intuitive method is to initialize the means at random items in the data set. Another method is to initialize the means at random values between the boundaries of the data set (if for a feature x, the items have values in [0,3], we will initialize the means with values for x at [0,3]).

The above algorithm in pseudocode is as follows:  

Import the necessary Libraries

We are importing Numpy for statistical computations, Matplotlib to plot the graph, and make_blobs from sklearn.datasets.

Create the custom dataset with make_blobs and plot it

Clustering dataset - Geeksforgeeks

Clustering dataset

Initialize the random centroids

The code initializes three clusters for K-means clustering. It sets a random seed and generates random cluster centers within a specified range, and creates an empty list of points for each cluster.

Plot the random initialize center with data points

Data points with random center - Geeksforgeeks

Data points with random center

The plot displays a scatter plot of data points (X[:,0], X[:,1]) with grid lines. It also marks the initial cluster centers (red stars) generated for K-means clustering.

Define Euclidean distance

Create the function to assign and update the cluster center.

The E-step assigns data points to the nearest cluster center, and the M-step updates cluster centers based on the mean of assigned points in K-means clustering.

Step 7: Create the function to Predict the cluster for the datapoints

Assign, update, and predict the cluster center, plot the data points with their predicted cluster center.

K-means Clustering - Geeksforgeeks

K-means Clustering

The plot shows data points colored by their predicted clusters. The red markers represent the updated cluster centers after the E-M steps in the K-means clustering algorithm.

Import the necessary libraries

Load the dataset, elbow method .

Finding the ideal number of groups to divide the data into is a basic stage in any unsupervised algorithm. One of the most common techniques for figuring out this ideal value of k is the elbow approach.

Plot the Elbow graph to find the optimum number of cluster

k means clustering problem solving

Elbow Method

From the above graph, we can observe that at k=2 and k=3 elbow-like situation. So, we are considering K=3

Build the Kmeans clustering model

Find the cluster center, predict the cluster group:, plot the cluster center with data points.

K-means clustering - Geeksforgeeks

K-means clustering

The subplot on the left display petal length vs. petal width with data points colored by clusters, and red markers indicate K-means cluster centers. The subplot on the right show sepal length vs. sepal width similarly.

In conclusion, K-means clustering is a powerful unsupervised machine learning algorithm for grouping unlabeled datasets. Its objective is to divide data into clusters, making similar data points part of the same group. The algorithm initializes cluster centroids and iteratively assigns data points to the nearest centroid, updating centroids based on the mean of points in each cluster.

Frequently Asked Questions (FAQs)

1. what is k-means clustering for data analysis.

K-means is a partitioning method that divides a dataset into ‘k’ distinct, non-overlapping subsets (clusters) based on similarity, aiming to minimize the variance within each cluster.

2.What is an example of k-means in real life?

Customer segmentation in marketing, where k-means groups customers based on purchasing behavior, allowing businesses to tailor marketing strategies for different segments.

3. What type of data is k-means clustering model?

K-means works well with numerical data, where the concept of distance between data points is meaningful. It’s commonly applied to continuous variables.

4.Is K-means used for prediction?

K-means is primarily used for clustering and grouping similar data points. It does not predict labels for new data; it assigns them to existing clusters based on similarity.

5.What is the objective of k-means clustering?

The objective is to partition data into ‘k’ clusters, minimizing the intra-cluster variance. It seeks to form groups where data points within each cluster are more similar to each other than to those in other clusters.

Please Login to comment...

Similar reads.

  • AI-ML-DS With Python
  • Best PS5 SSDs in 2024: Top Picks for Expanding Your Storage
  • Best Nintendo Switch Controllers in 2024
  • Xbox Game Pass Ultimate: Features, Benefits, and Pricing in 2024
  • Xbox Game Pass vs. Xbox Game Pass Ultimate: Which is Right for You?
  • Full Stack Developer Roadmap [2024 Updated]

Improve your Coding Skills with Practice

 alt=

What kind of Experience do you want to share?

This week: the arXiv Accessibility Forum

Help | Advanced Search

Mathematics > Optimization and Control

Title: a cutting plane algorithm for globally solving low dimensional k-means clustering problems.

Abstract: Clustering is one of the most fundamental tools in data science and machine learning, and k-means clustering is one of the most common such methods. There is a variety of approximate algorithms for the k-means problem, but computing the globally optimal solution is in general NP-hard. In this paper we consider the k-means problem for instances with low dimensional data and formulate it as a structured concave assignment problem. This allows us to exploit the low dimensional structure and solve the problem to global optimality within reasonable time for large data sets with several clusters. The method builds on iteratively solving a small concave problem and a large linear programming problem. This gives a sequence of feasible solutions along with bounds which we show converges to zero optimality gap. The paper combines methods from global optimization theory to accelerate the procedure, and we provide numerical results on their performance.
Comments: 12 pages, 3 figures
Subjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
classes: 90C26 (Primary) 90C27 (Secondary)
 classes: G.1.6
Cite as: [math.OC]
  (or [math.OC] for this version)
  Focus to learn more arXiv-issued DOI via DataCite

Submission history

Access paper:.

  • HTML (experimental)
  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

IEEE Account

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

COMMENTS

  1. K-means Clustering Algorithm With Numerical Example

    It means we are given K=3.We will solve this numerical on k-means clustering using the approach discussed below. First, we will randomly choose 3 centroids from the given data. Let us consider A2 (2,6), A7 (5,10), and A15 (6,11) as the centroids of the initial clusters. Hence, we will consider that.

  2. K-means Clustering: Algorithm, Applications, Evaluation Methods, and

    The approach kmeans follows to solve the problem is called Expectation-Maximization. The E-step is assigning the data points to the closest cluster. The M-step is computing the centroid of each cluster. Below is a break down of how we can solve it mathematically (feel free to skip it). The objective function is:

  3. K-Means Clustering with Math

    Clustering. b. K-Means and working of the algorithm. c. Choosing the right K Value. Clustering. A process of organizing objects into groups such that data points in the same groups are similar to the data points in the same group. A cluster is a collection of objects where these objects are similar and dissimilar to the other cluster.

  4. K-Means Clustering Algorithm

    K-Means Clustering-. K-Means clustering is an unsupervised iterative clustering technique. It partitions the given data set into k predefined distinct clusters. A cluster is defined as a collection of data points exhibiting certain similarities. It partitions the data set such that-. Each data point belongs to a cluster with the nearest mean.

  5. A Practical Guide on K-Means Clustering

    silhouette-score for p = (b - a)/max(b, a) To get a score on a cluster level, average the scores of each point in the cluster. Let's look at it intuitively. If two clusters overlap, many overlapping points will have lower b and vice-versa. Lower b means a lower score. If a =0, then the score is 1.

  6. K-means clustering algorithms: A comprehensive review, variants

    Despite these limitations, the K-means clustering algorithm is credited with flexibility, efficiency, and ease of implementation. It is also among the top ten clustering algorithms in data mining [59], [217], [105], [94].The simplicity and low computational complexity have given the K-means clustering algorithm a wide acceptance in many domains for solving clustering problems.

  7. K-Means Clustering in Python: A Practical Guide

    The k-means clustering method is an unsupervised machine learning technique used to identify clusters of data objects in a dataset. There are many different types of clustering methods, but k-means is one of the oldest and most approachable.These traits make implementing k-means clustering in Python reasonably straightforward, even for novice programmers and data scientists.

  8. K-Means Clustering algorithm explained with examples

    So, to solve this problem of random initialization, there is an algorithm called K-Means++ that can be used to choose the initial values, or the initial cluster centroids, for K-Means. Determining the optimal number of clusters for k-means clustering can be another challenge.

  9. k-Means Clustering

    contributed. K-means clustering is a traditional, simple machine learning algorithm that is trained on a test data set and then able to classify a new data set using a prime, k k number of clusters defined a priori. Data mining can produce incredible visuals and results. Here, k-means algorithm was used to assign items to 1000 clusters, each ...

  10. Definitive Guide to K-Means Clustering with Scikit-Learn

    Introduction. K-Means clustering is one of the most widely used unsupervised machine learning algorithms that form clusters of data based on the similarity between data instances. In this guide, we will first take a look at a simple example to understand how the K-Means algorithm works before implementing it using Scikit-Learn.

  11. K-Means Clustering Algorithm from Scratch

    K-Means Clustering is an unsupervised learning algorithm that aims to group the observations in a given dataset into clusters. The number of clusters is provided as an input. It forms the clusters by minimizing the sum of the distance of points from their respective cluster centroids. Contents Basic Overview Introduction to K-Means Clustering Steps Involved … K-Means Clustering Algorithm ...

  12. K-Means Clustering Algorithm in Machine Learning

    K-means is a very simple clustering algorithm used in machine learning. Clustering is an unsupervised learning task. Learning is unsupervised when it requires no labels on its data. Such algorithms can find inherent structure and patterns in unlabeled data. Contrast this with supervised learning, where a model learns to match inputs to ...

  13. What is K Means Clustering? With an Example

    The K means clustering algorithm divides a set of n observations into k clusters. Use K means clustering when you don't have existing group labels and want to assign similar data points to the number of groups you specify (K). In general, clustering is a method of assigning comparable data points to groups using data patterns.

  14. k-means clustering

    k-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster.This results in a partitioning of the data space into Voronoi cells.

  15. Everything you need to know about K-Means Clustering

    Considering the same data set, let us solve the problem using K-Means clustering (taking K = 2). The first step in k-means clustering is the allocation of two centroids randomly (as K=2). Two ...

  16. K-Means Algorithm Solved Numerical

    K-means Clustering. Understand K-means clustering with practical examples and clear explanations. Aug 22. Avicsebooks. ML Part 5: Clustering. Clustering is a type of unsupervised machine learning ...

  17. K-means Clustering Algorithm: Applications, Types, & How ...

    Considering the same data set, let us solve the problem using K-Means clustering (taking K = 2). The first step in k-means clustering is the allocation of two centroids randomly (as K=2). Two points are assigned as centroids. Note that the points can be anywhere, as they are random points.

  18. K-Means Clustering: From A to Z

    K-means clustering is a good place to start exploring an unlabeled dataset. The K in K-Means denotes the number of clusters. This algorithm is bound to converge to a solution after some iterations. It has 4 basic steps: Initialize Cluster Centroids (Choose those 3 books to start with) Assign datapoints to Clusters (Place remaining the books one ...

  19. Understanding K-means Clustering with Examples

    What is K-means Clustering? K-means (Macqueen, 1967) is one of the simplest unsupervised learning algorithms that solve the well-known clustering problem. K-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. K-means Clustering - Example 1:

  20. An Optimized Algorithm For Efficient Problem Solving In K-MEANS Clustering

    K-means clustering is used to cluster numerical data. In K-means we define two measures of distances, between two data points (records) and the distance between two clusters. Distance can be measured (calculated) in a number of ways but four principles tend to hold true. This paper proposes an optimized algorithm for k-means clustering. We introduce genetic algorithm skilfully, on data sets to ...

  21. K means Clustering

    K means clustering, assigns data points to one of the K clusters depending on their distance from the center of the clusters. It starts by randomly assigning the clusters centroid in the space. Then each data point assign to one of the cluster based on its distance from centroid of the cluster. After assigning each point to one of the cluster ...

  22. Adaptive K-means clustering based under-sampling methods to solve the

    It is crucial to preset k before implementing the k-means clustering process. To generate a balanced dataset, the value of k should be less than or equal to N min (the size of minority class dataset). When k is close to N min, the number of samples selected from each cluster decreases accordingly.Meanwhile, the sub-sequent under-sampling algorithm becomes meaningless.

  23. A cutting plane algorithm for globally solving low dimensional k-means

    Clustering is one of the most fundamental tools in data science and machine learning, and k-means clustering is one of the most common such methods. There is a variety of approximate algorithms for the k-means problem, but computing the globally optimal solution is in general NP-hard. In this paper we consider the k-means problem for instances with low dimensional data and formulate it as a ...

  24. A Clustering Method Based on K-Means and Motion ...

    A VANET clustering algorithm (KMSC) combining K-Means and motion similarity is proposed to solve the problem of unstable and even interrupted data transmission between VANETs due to high-speed vehicle movement and variable topology. Using elbow method, the clustering number of K-Means algorithm is no longer randomly selected, which makes the ...