Customer Segmentation using K-Means Clustering

Ieee account.

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

  • Research Note
  • Open access
  • Published: 12 May 2024

An inversion-based clustering approach for complex clusters

  • Mohammad Mahdi Barati Jozan 1 ,
  • Aynaz Lotfata 2 ,
  • Howard J. Hamilton 3 &
  • Hamed Tabesh 1  

BMC Research Notes volume  17 , Article number:  133 ( 2024 ) Cite this article

Metrics details

The choice of an appropriate similarity measure plays a pivotal role in the effectiveness of clustering algorithms. However, many conventional measures rely solely on feature values to evaluate the similarity between objects to be clustered. Furthermore, the assumption of feature independence, while valid in certain scenarios, does not hold true for all real-world problems. Hence, considering alternative similarity measures that account for inter-dependencies among features can enhance the effectiveness of clustering in various applications.

In this paper, we present the Inv measure, a novel similarity measure founded on the concept of inversion. The Inv measure considers the significance of features, the values of all object features, and the feature values of other objects, leading to a comprehensive and precise evaluation of similarity. To assess the performance of our proposed clustering approach that incorporates the Inv measure, we evaluate it on simulated data using the adjusted Rand index.

The simulation results strongly indicate that inversion-based clustering outperforms other methods in scenarios where clusters are complex, i.e., apparently highly overlapped. This showcases the practicality and effectiveness of the proposed approach, making it a valuable choice for applications that involve complex clusters across various domains.

Conclusions

The inversion-based clustering approach may hold significant value in the healthcare industry, offering possible benefits in tasks like hospital ranking, treatment improvement, and high-risk patient identification. In social media analysis, it may prove valuable for trend detection, sentiment analysis, and user profiling. E-commerce may be able to utilize the approach for product recommendation and customer segmentation. The manufacturing sector may benefit from improved quality control, process optimization, and predictive maintenance. Additionally, the approach may be applied to traffic management and fleet optimization in the transportation domain. Its versatility and effectiveness make it a promising solution for diverse fields, providing valuable insights and optimization opportunities for complex and dynamic data analysis tasks.

Peer Review reports

Introduction

Clustering is a fundamental technique in data mining and machine learning, aiming to group objects into distinct clusters [ 1 , 2 , 3 , 4 , 5 , 6 , 7 ]. Objects within a cluster show high similarity to each other and low similarity to objects in other clusters, determined by a similarity measure [ 8 , 9 , 10 , 11 ].

Selecting an appropriate similarity measure is crucial for clustering algorithms. Studies indicate its significance in algorithm performance [ 8 , 9 , 10 , 11 , 12 ]. Various research assesses similarity measures across disciplines like web clustering [ 13 ], trajectory clustering [ 14 ], and chemical databases [ 15 ]. Some research endeavors have further examined performance variations based on the type of data being analyzed, differentiating between categorical data [ 16 ] and continuous data [ 8 ]. Consequently, some scholars [ 17 ] have advocated for the utilization of two fundamental clustering techniques: "similarity measures" for qualitative data and "distance measures" for quantitative data [ 17 ]. In this study, we'll refer to both types of measures as "similarity measures."

Table 1 lists similarity functions for qualitative data, and Table  2 shows distance functions for quantitative data.

A key limitation of similarity measures (e.g., Euclidean distance [ 17 ] and Hamming similarity [ 18 ]) is their exclusive reliance on feature values. Consequently, when two objects or entities exhibit similar feature values, they are considered more similar, regardless of any other pertinent factors. This oversimplification may overlook crucial aspects of the data.

Secondly, similarity measures often assume feature independence, neglecting their interdependence and potential influence on each other's values. This oversight may result in incomplete representations of data relationships. Moreover, most measures overlook feature prioritization, disregarding the varying importance of features in determining similarity. These assumptions do not fully align with real-world complexities, potentially limiting applicability and accuracy. To address these challenges, researchers explore inversion as a promising approach, investigating its theoretical and practical aspects [ 19 , 20 , 21 , 22 ].

In one study [ 19 ], a new constructive bijection connects permutations with a specific number of inversions to those with a particular major index, facilitating exploration of mathematical connections. Another work [ 20 ] introduces a probability distribution on the group of permutations of the set of integers, providing insights into inversion's probabilistic aspects for permutation-based data analysis. Furthermore, [ 21 ] presents six bijections linking a specific type of polyominoes called deco polyominoes with permutations, establishing connections between classical statistics and permutation-related analyses. Moreover, [ 22 ] proposes an efficient solution for counting interior edge crossings in bipartite graphs, relevant for layered graph drawing and data visualization enhancement.

This study introduces " Inv ," an inversion-based similarity measure addressing previous challenges. It forms the basis of a new clustering approach grouping objects by inversion-based similarity. The primary goal is to optimize clustering by maximizing intra-cluster similarity and minimizing inter-cluster similarities. By incorporating this measure, we anticipate achieving more meaningful clustering results that better reflect underlying data patterns and relationships.

Data and method

The flowchart of the activities undertaken to assess the new inversion-based clustering approach is presented in the Fig.  1 .

figure 1

Flowchart of implementation and evaluation of the inversion-based clustering approach

Formulation of challenges related to similarity criteria

Several key terms used in simulation examples are defined below.

Object: Any entity that we intend to cluster is an object . Each object \({obj}_{i}\) has n features and can be represented by a vector: \({obj}_{i}= \langle {f}_{i1}, {f}_{i2}, \dots , {f}_{in}\rangle\) .

Feature space: Each feature has a valid range of values known as its feature space , represented by \({S}_{f}\) for feature \(f\) .

Universe: The set of all objects we want to cluster is known as the universe , represented by \(U= \left\{{obj}_{1}, {obj}_{2},\dots ,{obj}_{k}\right\}.\)

Similarity Measure: A similarity measure is a measure that takes two objects as input and outputs a numeric value representing their similarity.

Clustering problem: A problem of partitioning k objects into m groups \(\langle {C}_{1}, {C}_{2}, \dots , {C}_{m}\rangle\) according to a specified similarity measure, so that the objects in each group are as similar as possible to each other and as different as possible from objects in other groups. These clusters adhere to two constraints [1–7]:

In the following paragraphs, the challenges stated in the introduction section are formulated using the defined terms.

Symmetry challenge

A drawback of distance-based measures is called the Symmetry challenge , where adding or subtracting a value to a feature has the same distance effect. Given that Euclidean distance is widely used in clustering algorithms [ 23 ], it is used as a representative of distance-based measures in the following practices. Practice 1 demonstrates the weakness of similarity criteria when adding or subtracting a specific value to a feature.

Practice 1: Consider the following three objects:

The first feature of \({obj}_{2}\) is v units less that the first feature of \({obj}_{1}\) , and the first feature of \({obj}_{3}\) is v units more than the first feature of \({obj}_{1}\) . All other features are equal in the three objects. The Euclidean distance \(E\) between them, the similarity for adding \(v\) to the first feature is calculated as follows:

and the similarity for subtracting \(v\) from the first feature is calculated as follows:

This challenge can arise for any feature. In general, this challenge can be expressed as follows:

The equality of the distances is mathematically correct, but in the real world, the significance of adding a specific value to a feature may be different from that of subtracting the same amount. An example based on student scores can be found as Supplementary Example 1 (see Additional file 1 ).

Place symmetry challenge

The Place Symmetry challenge is a generalized form of the previous challenge. In this challenge, increasing or decreasing a value can occur for each feature. In Practice 2, the weakness of the similarity criteria when adding or subtracting a certain value to possibly different features of objects is shown.

Practice 2: Consider the following three objects:

if \({V}_{i}\) represents age and \({V}_{j}\) represents weight, both normalized to their respective feature spaces, adding \(v\) units to \({V}_{i}\) and subtracting \(v\) units from \({V}_{j}\) yields the same Euclidean distance between two pairs of objects. However, the implications of increasing or decreasing \(v\) units in weight differ from those in age due to varying value distributions. Hence, altering their values by \(v\) holds distinct meanings for each feature.

Feature independence challenge

The Feature Independence challenge refers to the treatment of features as if they are independent of each other when calculating distance-based measures, whereas in practice they are often interdependent.

Definition of the concept of inversion

To address the challenges defined in section " Formulation of challenges related to similarity criteria ", a similarity measure named Inv , which is based on the concept of inversion, is proposed. The mathematical definition of inversion is as follows:

Inversion: In a sequence \(S= \langle {a}_{1}, {a}_{2}, \dots , {a}_{n}\rangle\) of pairwise comparable elements \({a}_{i} (i=\mathrm{1,2},\dots ,n)\) , a pair \(({a}_{i},{a}_{j})\) is called an inversion if \(i<j\) and \({a}_{i}>{a}_{j}\) [ 22 ].

The concept of inversion is illustrated by Practice 3.

Practice 3: Suppose we have five movies and we ask two people to rate them on a scale of 1 to 10 according to their preferences. In this example, an object is a person and it is represented by a vector with 5 features [ 24 ].

Scores are given in Table  3 .

The definition of inversion for only one person is presented in Supplementary Example 1 (see Additional file 1 ). Let's now explore a more indirect application of the definition to determine the inversions between the scores of the two people. We arrange the movies for each person in a rank sequence based on preferences, with the highest-scoring movie first. Ties are resolved by ranking the movie with the lower movie number first. Person 1's sequence is \(\langle {Movie}_{1}, {Movie}_{5},{Movie}_{2},{Movie}_{3},{Movie}_{4}\rangle\) , while Person 2's is \(\langle {Movie}_{1}, {Movie}_{3},{Movie}_{4},{Movie}_{2},{Movie}_{5}\rangle\) . Renaming the movies based on Person 1's rank sequence, we get 〈1,2,3,4,5〉 for Person 1 and 〈1,4,5,3,2〉 for Person 2. Applying the inversion definition to the second sequence, we find 4 is inverted compared to 3 and 2 (two inversions), 5 is inverted compared to 3 and 2 (two inversions), and 3 is inverted compared to 2 (one inversion), totaling five inversions.

A visual method for counting inversions involves organizing movies for each person by their scores, highest to lowest, and connecting corresponding movies. The intersections of these lines indicate the number of inversions between the two sequences.

In Fig.  2 , the five inversions are indicated by the five points where the lines intersect. The more different the sequences, the greater the number of the inversions. The minimum and maximum number of inversions of two vectors with n features are 0 and \(\frac{n\left(n+1\right)}{2}\) , respectively [ 25 ]. Supplementary Code 1 contains the inversion calculation algorithm (see Additional file 1 ).

figure 2

Each point of intersection between lines representing an inversion (practice 3)

Notice from Practice3 that the way ties are broken affects the number of inversions. The proposed way of resolving ties is to consider a default sequence for features, and when the values of two or more features are the same, their order will be based on the default order. For instance, in the movies database, priority is given to movies with smaller numbers. Thus, as Person 2 rated Movie 1 and Movie 3 equally at 9, Movie 1 takes precedence in the rank order.

Practice3 illustrates the importance of feature prioritization in cases where values are equal. Depending on specific requirements, certain features may need to be prioritized over others, which can have a considerable impact on the resulting number of inversions.

Introduction of the Inv similarity measure based on the notion of inversion

In this subsection, the Inv inversion-based similarity measure is introduced. The Inv measure of the similarity of two objects is defined as the number of inversions that exist between two objects according to the Count_Inversions function. Formally, the measure is defined as follows:

According to the Inv measure, as inversion count between two sequences rises, their similarity decreases, and vice versa. Although we refer to inversion as a similarity measure, it actually yields a measure of dissimilarity because the greater the similarity between two objects, the lower the number of inversions.

Three advantages of the inversion-based similarity measure are as follows [ 22 ]: the number of inversions is affected by the rank positions of the features in the vector (Feature Independence challenge); the number of inversions when v is added to feature \({f}_{i}\) can be different from the number of inversions when it is subtracted from \({f}_{i}\) (Symmetry challenge); and the number of inversions when v is added to \({f}_{i}\) may not be the same as the number of inversions when v is added to \({f}_{j}\) (Place Symmetry challenge).

The Symmetry challenge in distance-based measures results from the equal impact of adding or subtracting a value from a feature on the distance. In calculating inversions, not only is the value of a particular feature considered, but also the values of other features become integral. The feature's rank position within the object's vector significantly affects the inversion count. Altering a feature's rank position by adding a value may yield different outcomes compared to subtracting the same value, influenced by other features. Consequently, the inversion count varies depending on specific features, emphasizing the nuanced nature of inversion calculations.

The Place Symmetry challenge expands upon the Symmetry challenge by allowing the increase or decrease of a value for each feature individually. As previously noted, the count of inversions is influenced by both the value of a specific feature and other features, so it addresses this challenge as well as the previous challenge.

To challenge the independence of features, since determining the number of inversions requires determining the values and order of all features, hence the features cannot be independent of each other. The explanation of how the proposed measure addresses the outlined challenges are illustrated in Supplementary practice 1 (see Additional file 1 ).

The proposed similarity measure offers an additional advantage by allowing adjustments to consider feature importance in distance calculation. This enables the consideration of feature relevance or priority, enhancing the measure's utility. To accommodate priorities, the inversion measure can be modified to incorporate both feature value and assigned priority. By default, all variables are assumed to have equal priority. Practice 4 demonstrates a method for factoring feature priority into similarity calculation.

Practice 4 : Consider two objects, each with three features, depicted by colored circles in Fig.  3 . Let the orange feature have priority 1 (lowest), the red feature priority 2, and the blue feature priority 3 (highest). An adjusted inversion function is employed, where each inversion detected is weighted by the product of the relevant features' priorities.

figure 3

Prioritizing features: A All features have the same priority, which is one, B The priority of each feature is displayed within the corresponding circle (Practice 4)

According to Fig.  3 A, there are 2 inversions because there are 2 points where the lines intersect (OB and RB). Since the priority of every feature is one, the adjusted inversion function is as calculated as:

To calculate the adjusted inversion for Fig.  3 B, the products of the priorities of each inversion (represented by a line) are also considered.

Introduction of the clustering approach based on the Inv measure

In this section, we formulate an algorithmic framework for an inversion-based clustering approach as a type of partitioning clustering method and then we instantiate the framework with different measures to specify two inversion-based clustering algorithms. The main steps of a partitioning clustering method are as follows. First, the number of desired clusters is chosen and initial centroids in the range of the feature values are randomly selected. Next, every input object is assigned to the nearest centroid based on its distance from it using the similarity measure. Every centroid is then moved to the mean location of its assigned cluster, and this process is iteratively repeated until convergence is reached, as indicated by no further changes in assignments [ 25 ]. The pseudo-code of proposed algorithmic framework can be found as Supplementary Code 2 (see Additional file 1 ).

Algorithmic framework for inversion-based clustering

The steps of the algorithmic framework are described below.

Input: \(U=\{{obj}_{1}, {obj}_{2}, \dots , {obj}_{k}\}\) , number of clusters ( \(m\) )

Output : \(m clusters \left({C}_{1}, {C}_{2}, \dots , {C}_{m}\right)\)

Step 1) Normalize each feature based on the Min–Max Feature scaling method. The normalized feature value \({F}_{ia}\) for feature \({f}_{a}\) in object \({obj}_{i}\) is calculated as the following equation:

where \({f}_{ia}\) is the original feature value for \({obj}_{i}\) , \({\text{min}}({f}_{a})\) is the minimum value and \({\text{max}}\left({f}_{a}\right)\) is the maximum value of the feature across all objects in universe U , and \({F}_{ia}\) is the normalized value. The feature space for every normalized feature is equal to [0, 1].

Step 2) Initialize the centroids \({c}_{1}, {c}_{2}, \dots , {c}_{m}\) of clusters \({C}_{1}, {C}_{2}, \dots , {C}_{m}\) randomly.

Step 3) Calculate the number of inversions between every combination of an object \({obj}_{i}\) and a centroid \({c}_{j}\) and assign every object to the cluster for the centroid with the fewest inversions.

Step 4) For each cluster \({C}_{j}\) , calculate the new centroid \({c}_{j}\) as the mean of the objects assigned to the cluster:

\({N}_{j}\) is the number of objects assigned to cluster \({C}_{j}\) , and the sum is taken over all objects in this cluster.

Step 5) Repeat Steps 3 and 4 until the clusters do not change, or the number of iterations reaches a predetermined value.

For the third step of the algorithmic framework, which is to assign an object to a cluster using a measure, two measures are proposed, which lead to the Inversion-Based Clustering Algorithm (ICA) and the Regulator Inversion Euclidean distance based Clustering Algorithm (RIECA).

Approach 1) Inversion-Based Clustering Algorithm (ICA)

In the first round of ICA, the centroids are randomly initialized, then inversions between every combination of an object and a centroid are calculated, and finally every object is assigned to the centroid with the fewest inversions. In subsequent rounds, inversions between every combination of an object and an updated centroid are calculated. In other words, the measure function is defined as follows:

where \({obj}_{i}\) is the \({i}^{th}\) object, \({C}_{c}\) is \({c}^{th}\) cluster (which has centroid \({c}_{c}\) ), and \(Inv\left({obj}_{i},{c}_{c}\right)\) is the number of inversions between \({obj}_{i}\) and \({c}_{c}\) .

Approach 2) Regulator Inversion Euclidean distance based Clustering Algorithm (RIECA)

With RIECA, inversion is used as a regulator for Euclidean distance [ 17 ], i.e. the Euclidean distance is multiplied by the number of inversions. If the number of inversions is high, the value of the measure grows more than if the number of inversions is low. In this approach, the Measure function is defined as follows:

where \({Obj}_{i}\) is the \({i}^{th}\) object, \({c}_{c}\) is the \({c}^{th}\) cluster, \(Inv\left({Obj}_{i},{c}_{c}\right)\) is the number of inversions between \({Obj}_{i}\) and \({c}_{c}\) , and \(E\left({Obj}_{i},{c}_{c}\right)\) is the Euclidean distance between \({Obj}_{i}\) and \({c}_{c}\) .

This section describes the generation of synthetic data and the evaluation of the proposed approach, which correspond to the final two steps of the flowchart given in section " Data and method ".

Generation of synthetic data

Synthetic datasets were generated using the MixSim package [ 26 ]. The average pairwise overlap parameter (denoted as ῶ) in package allows for the creation of clusters with varying degrees of complexity, ranging from well-separated (ῶ = 0.001) to highly overlapped ones (ῶ = 0.4) [ 27 ]. The summary of the package documentation can be found as Supplementary Documentation 1 (see Additional file 1 ).

To evaluate the performance of the ICA and REICA algorithms, we generated two types of random data samples for each value of ῶ (0.4, 0.3, 0.2, 0.05, and 0.001). The first type of data sample (referred to as Simulated Dataset1) comprised 200 random data values with 6 features, forming 5 clusters. The second type of data sample (referred to as Simulated Dataset2) included 500 random data values with 7 features, forming 10 clusters. These datasets facilitated comprehensive evaluation of algorithm performance across diverse clustering complexities.

Evaluation of the proposed approach

The proposed ICA and REICA algorithms are compared with EM clustering [ 28 ] from the mclust package [ 29 ], k-means [ 3 ] and hierarchal clustering [ 30 ] from the stats package, and k-medoids [ 31 ] clustering from the cluster package [ 32 ]. For the comparison, 1000 data samples of Simulated Dataset 1 and 1000 of Simulated Dataset 2 were prepared and then each algorithm was run on each sample separately. The ICA and REICA algorithms are implemented in R version 3.4.2 [ 33 ]. This study employs the adjusted Rand index implemented in the MixSim package to evaluated algorithms [ 26 , 34 , 35 , 36 ]. The values obtained for the adjusted Rand index for the algorithms when applied to Simulated Dataset 1 are shown in Table 4 .

Table 4 shows that for complex cluster structures (ῶ = 0.4, 0.3, and 0.2), REICA algorithm outperforms other algorithms in cluster formation. However, for moderately or highly separated clusters (ῶ = 0.05, 0.001), both ICA and REICA algorithms perform poorly compared to other algorithms. This was anticipated, as distance between objects outweighs the influence of inversions in such scenarios.

To evaluate the statistical significance of the adjusted Rand index differences among algorithms, we conducted a one-way ANOVA test. Further details are available in Supplementary statistical test result 1 (see Additional file 1 ).

Tables 5 and 6 reveal the optimal algorithm for each ῶ value, offering insights into algorithm performance across diverse clustering scenarios. These results elucidate the efficacy of ICA and REICA algorithms amidst varying clustering complexities and separations.

In highly or moderately separated clusters, distance becomes more influential, diminishing the role of the number of inversions. Consequently, the proposed algorithms are less effective in handling such scenarios compared to other types of clustering. REICA excelled in complex structures, while no algorithm consistently outperformed others in highly or moderately separated clusters. When ῶ = 0.05, k-means was most effective in both simulations. Conversely, with ῶ = 0.001 and a small number of clusters and samples (Simulated Dataset 1), k-means, EM, K-medoids, and Hierarchical algorithms were highly effective. However, with a large number of clusters and samples (Simulated Dataset 2), EM algorithm proved most effective. These findings highlight algorithm strengths in diverse clustering conditions, revealing performance across varying complexities and separations. The selection of the optimal algorithm depends on specific data characteristics and nature.

Clustering techniques encompass partitioning, hierarchical, density-based, grid-based, and model-based methods [ 25 ]. The proposed algorithms fall under partitioning clustering. ICA is similar to k-means but utilizes an inversion-based similarity measure instead of a distance measure, whereas REICA utilizes inversion as a form of regulation for the Euclidean distance. In REICA, Euclidean distance is multiplied by the number of inversions. As a result, when there are more inversions, the measure value increases more prominently compared to scenarios with fewer inversions. This innovative method enriches the algorithm's ability for intricate cluster handling and enhances data analysis insights.

One strength of this study is that it identified three major challenges for Euclidean distance measures: the Symmetry challenge, the Place Symmetry challenge, and the Feature Independence Challenge. To address these challenges, we introduce the inversion-based measure " Inv. " Our findings show the significance of both Euclidean distance and inversions for similarity measures. Particularly, REICA, which multiplies inversions by Euclidean distance, outperforms ICA, which relies solely on inversions. The effectiveness of REICA, which employs a hybrid measure that considers inversions and Euclidean distance, suggests potential for developing other hybrid measures considering both factors. Such measures could prioritize inversion for complex clusters and emphasize Euclidean distance for well-separated clusters, aligning with k-means and other clustering methods.

Another strength of this study is that it benefits from applying statistical testing techniques like one-way ANOVA to rigorously assess the effectiveness of the proposed algorithms in comparison to existing algorithms. Additionally, simulating diverse data samples covering various clustering complexities enhances the findings' robustness and generalizability.

One strength of the proposed inversion-based algorithm is its ability to prioritize features differently. This prioritization operates at two levels. Firstly, during inversion calculation, equal feature values are sorted based on predetermined priority, influencing the number of inversions. Secondly, priority can be introduced as a coefficient in inversion calculations, where each intersection reflects the multiplication of feature priority values. As a result, the distance between two objects will increase with higher priority values, allowing more flexible and nuanced clustering.

For evaluation, the proposed algorithms' performance was compared with classical clustering methods: k-means [ 3 ], EM clustering [ 28 ], hierarchical clustering [ 30 ], and k-medoids [ 31 ]. Results indicate RIECA outperforms other algorithms with complex cluster structures [ 37 , 38 ]. However, in scenarios of moderate or high cluster separation, the proposed algorithms are less effective, consistent with Berikov [ 39 ] and Cupertino [ 40 ].

A limitation of the inversion-based clustering approach is its effectiveness sensitivity to initial centroid selection, a common challenge in partition-based algorithms [ 25 ]. Mitigation strategies, such as repetition and employing the K-means++ initialization method [ 41 ], can be adapted to address this issue. In this study, we adopt the solution of running the algorithm several times with random initial centroids [ 25 ].

Another limitation is the lack of evaluation on real datasets. Assessing an algorithm's performance on actual data is challenging due to unknown cluster complexity. To address this, one could compare synthetic datasets with known complexity to real databases to gain an understanding of the complexity, then evaluate the algorithms under conditions similar to practical tasks. This approach offers insights into real-world performance, establishing their relevance and reliability for diverse data analysis scenarios.

This paper emphasizes the significant impact of similarity measures on clustering algorithm performance. Existing measures, often reliant on feature values and assuming feature independence, may not yield optimal results in practice. To address this, we introduced the innovative inversion-based Inv measure, which considers other object and feature values through inversion. We proposed two algorithms (ICA and RIECA) based on Inv measure, and evaluated their performance using simulated data. Results showed inversion-based clustering outperformed traditional techniques for complex cluster structures.

Future studies can explore practical applications of the Inv measure in real-world problems to improve clustering performance across domains. Further research could investigate other hybrid measures combining inversions and distance-based measure.

Availability of data and materials

The datasets used and/or analyzed during the current study are available from the first author on reasonable request.

Jain AK, Narasimha Murty M, Flynn PJ. Data clustering: a review. ACM Comput Surv. 1999;31(3):264–323.

Article   Google Scholar  

Lingras P, Huang X. Statistical, evolutionary, and neurocomputing clustering techniques: cluster-based vs object-based approaches. Artif Intell Rev. 2005;23:3–29.

MacQueen J. Some methods for classification and analysis of multivariate observations. Proceedings of the fifth Berkeley symposium on mathematical statistics and probability. 1967;1(14).‏

Xu R, Wunsch D. Survey of clustering algorithms. IEEE Trans Neural Netw. 2005;16(3):645–78.

Article   PubMed   Google Scholar  

Rodriguez SIR, de Carvalho FAT. Fuzzy clustering algorithms with distance metric learning and entropy regularization. Appl Soft Comput. 2021;113: 107922.

Jothi R, Mohanty SK, Ojha A. Gene expression clustering using local neighborhood-based similarity measures. Comput Electr Eng. 2021;91: 107032.

Nozad SAN, Haeri MA, Folino G. SDCOR: Scalable density-based clustering for local outlier detection in massive-scale datasets. Knowl Based Syst. 2021;228: 107256.

Shirkhorshidi AS, Saeed A, Wah TY. A comparison study on similarity and dissimilarity measures in clustering continuous data. PLoS ONE. 2015;10(12): e0144059.

Article   PubMed   PubMed Central   Google Scholar  

Boriah S, Chandola V, Kumar V. Similarity measures for categorical data: a comparative evaluation. Proceedings of the 2008 SIAM international conference on data mining. Society for Industrial and Applied Mathematics, 2008.‏

Yan Q, et al. A discriminated similarity matrix construction based on sparse subspace clustering algorithm for hyperspectral imagery. Cogn Syst Res. 2019;53:98–110.

Renjith S, Sreekumar A, Jathavedan M. Performance evaluation of clustering algorithms for varying cardinality and dimensionality of data sets. Mater Today Proc. 2020;27:627–33.

Shrifan NHMM, Akbar MF, Isa NAM. An adaptive outlier removal aided k-means clustering algorithm. J King Saud Univ Comput Inf Sci. 2022;34(8):6365–76.

Google Scholar  

Strehl A, Ghosh J, Mooney R. Impact of similarity measures on web-page clustering. Workshop on artificial intelligence for web search (AAAI 2000). 2000;58.

Zhang Z, Huang K, Tan T. Comparison of similarity measures for trajectory clustering in outdoor surveillance scenes. 18th International Conference on Pattern Recognition (ICPR'06). IEEE, 2006;3.‏

Khalifa, Aysha Al, Maciej Haranczyk, and John Holliday. "Comparison of nonbinary similarity coefficients for similarity searching, clustering and compound selection." Journal of chemical information and modeling 49.5 (2009): 1193–1201.‏

Lourenço, Fernando, Victor Lobo, and Fernando Bacao. "Binary-based similarity measures for categorical data and their application in Self-Organizing Maps." (2004): 1–18.‏

Xu D, Tian Y. A comprehensive survey of clustering algorithms. Ann Data Sci. 2015;2:165–93.

Taheri R, et al. Similarity-based Android malware detection using Hamming distance of static binary features. Future Gener Comput Syst. 2020;105:230–47.

Vajnovszki V. A new Euler-Mahonian constructive bijection. Discret Appl Math. 2011;159(14):1453–9.

Gnedin A, Olshanski G. The two-sided infinite extension of the Mallows model for random permutations. Adv Appl Math. 2012;48(5):615–39.

Deutsch E, Pergola E, Pinzani R. Six bijections between deco polyominoes and permutations. 2008, arXiv preprint, arXiv:0810.2876 .‏

Barth W, Jünger M, Mutzel P. Simple and efficient bilayer cross counting. Graph Drawing: 10th International Symposium, GD 2002 Irvine, CA, USA, August 26–28, 2002 Revised Papers 10. Springer Berlin Heidelberg, 2002

Grabusts P. The choice of metrics for clustering algorithms. Environ Technol Resourc Proc Int Sci Pract Confer. 2011;2:20.

Jozan MM, Taghiyareh F, Faili H. An inversion-based genetic algorithm for grouping of students. Proc. 7th Int. Conf. Virtual Learn. Vol. 1. No. 1. 2012.‏

Han J, Pei J, Tong H. Data mining: concepts and techniques. Burlington: Morgan kaufmann; 2022.

Melnykov V, Chen W-C, Maitra R. MixSim: an R package for simulating data to study performance of clustering algorithms. J Stat Softw. 2012;51:1–25.

Maitra R, Melnykov V. Simulating data to study performance of finite mixture modeling and clustering algorithms. J Comput Graph Stat. 2010;19(2):354–76.

Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. J Roy Stat Soc: Ser B (Methodol). 1977;39(1):1–22.

Fraley C, Raftery AE. MCLUST version 3 for R: Normal mixture modeling and model-based clustering. Vol. 504. Technical report, 2006.‏

Chris D, He X. Cluster merging and splitting in hierarchical clustering algorithms. 2002 IEEE International Conference on Data Mining, 2002. Proceedings.. IEEE, 2002.‏

Kaufman L, Rousseeuw PJ. Partitioning around medoids (Program PAM). Wiley Series in Probability and Statistics, Hoboken, NJ, USA: John Wiley & Sons, Inc., 1990, pp. 68–125, https://doi.org/10.1002/9780470316801.ch2

Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K. Cluster: cluster analysis basics and extensions. R package version 1.14. 3." 2012.

R Core Team, R. R: a language and environment for statistical computing. 2013: 275–286.‏

Halkidi M, Batistakis Y, Vazirgiannis M. On clustering validation techniques. J Intell Inf Syst. 2001;17:107–45.

Agustı LE, et al. A new grouping genetic algorithm for clustering problems. Expert Syst Appl. 2012;39(10):9695–703.

Rand WM. Objective criteria for the evaluation of clustering methods. J Am Stat Assoc. 1971;66(336):846–50.

Thong PH. Picture fuzzy clustering for complex data. Eng Appl Artif Intell. 2016;56:121–30.

Khan L, Luo F. Hierarchical clustering for complex data. Int J Artif Intell Tools. 2005;14(05):791–809.

Berikov V. Weighted ensemble of algorithms for complex data clustering. Pattern Recogn Lett. 2014;38:99–106.

Cupertino TH, Huertas J, Zhao L. Data clustering using controlled consensus in complex networks. Neurocomputing. 2013;118:132–40.

Hämäläinen J, Kärkkäinen T, Rossi T. Improving scalable K-means++. Algorithms. 2020;14(1):6.

Download references

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Author information

Authors and affiliations.

Department of Medical Informatics, Faculty of Medicine, Mashhad University of Medical Sciences, Mashhad, Iran

Mohammad Mahdi Barati Jozan & Hamed Tabesh

Department of Pathology, Microbiology, and Immunology, School Of Veterinary Medicine, University of California, Davis, USA

Aynaz Lotfata

Department of Computer Science, University of Regina, Regina, SK, Canada

Howard J. Hamilton

You can also search for this author in PubMed   Google Scholar

Contributions

M.M.B.J.: Conceptualization, methodology, formal analysis, investigation, writing—original draft, draft preparation. A.L.: Methodology, validation, formal analysis, review, supervision. H.J.H.: Methodology, validation, formal analysis, review, supervision. H.T.: Conceptualization, methodology, validation, formal analysis, review, supervision.

Corresponding author

Correspondence to Hamed Tabesh .

Ethics declarations

Ethics approval and consent to participate.

Not applicable.

Consent for publication

Competing interests.

The authors submitting this manuscript have no conflict of interest to declare.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

13104_2024_6791_moesm1_esm.docx.

Additional file 1. Supplementary Figures, Tables and Codes. This file contains supplementary figures, tables and codes that provide further insights into the experimental setup and results discussed in the article.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Barati Jozan, M.M., Lotfata, A., Hamilton, H.J. et al. An inversion-based clustering approach for complex clusters. BMC Res Notes 17 , 133 (2024). https://doi.org/10.1186/s13104-024-06791-y

Download citation

Received : 08 August 2023

Accepted : 30 April 2024

Published : 12 May 2024

DOI : https://doi.org/10.1186/s13104-024-06791-y

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Clustering algorithm
  • Inversion-based similarity measure
  • Overlapping clusters
  • Adjusted Rand index

BMC Research Notes

ISSN: 1756-0500

customer segmentation using k means clustering research paper

Customer Segmentation Using RFM Model and a New Approach to Encourage Loyal Customers

  • Conference paper
  • First Online: 11 May 2024
  • Cite this conference paper

customer segmentation using k means clustering research paper

  • A. S. Uthara 12 ,
  • S. Sajikumar 13 &
  • R. C. Jisha 12  

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 941))

Included in the following conference series:

  • International Conference on Information and Communication Technology for Competitive Strategies

This paper focuses on customer segmentation using the recency, frequency, and monetary (RFM) model and implements strategies to reward and notify loyal customers through email notifications. We analyze customer data using the RFM model, which evaluates customer behavior based on three factors: recency (how recently a customer made a purchase), frequency (how often they make purchases), and monetary value (the amount spent). By segmenting customers based on these metrics, we can identify loyal customers who have made recent, frequent, and high-value purchases. Once loyal customers are identified, the proposed work implements a reward system to encourage their continued loyalty. Rewards include giving reward points to loyal customers based on their position. To notify loyal customers about their rewards and keep them informed, the paper integrates an email notification system. Personalized emails are sent to loyal customers, notifying them of their rewards and expressing gratitude for their loyalty. By utilizing the RFM model for customer segmentation, businesses can identify and target their most loyal customers effectively. The paper provides a framework to leverage customer data, implement reward strategies, and utilize email notifications to optimize customer loyalty management.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Compact, lightweight edition
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Gupta G, Aggarwal H (2012) Improving customer relationship management using data mining. Int J Mach Learn Comp 2(6):874

Article   Google Scholar  

Uncles MD, Dowling GR, Hammond K (2003) Customer loyalty and customer loyalty programs. J Cons Market 20(4):294–316

Derya B (2011) Data mining using RFM analysis. In: Knowledge-oriented applications in data mining. IntechOpen

Google Scholar  

Wu J, Shi L, Lin W-P, Tsai S-B, Li Yuanyuan, Yang Liping, Guangshu Xu (2020) An empirical study on customer segmentation by purchase behaviors using a RFM model and K-means algorithm. Math Prob Engineering 2020:1–7

Saravanan M, Prasad G, Jagadeesan M, Raman R, Rekha S (2012) Group recommender model for boosting and optimizing customer purchases. In: 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, pp 1266–1271. IEEE

Unnikrishnan S, Sreelakshmi S, Deepa G (2016) Enhancement of accuracy in K-means clustering. Int J Cont Theory Appl 9(15):7619–7626

Athira KT, Gopika PG, Deepa G (2018) Collation between hierarchical and K-means clustering algorithm. J Eng Appl Sci 13:4592–4596

Vani K, Gupta D (2014) Using K-means cluster based techniques in external plagiarism detection. In: 2014 international conference on contemporary computing and informatics (IC3I), pp 1268–1273. IEEE

Pillai B, Mounika M, Rao PJ, Sriram P (2016) Image steganography method using k-means clustering and encryption techniques. In: 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI), pp 1206–1211. IEEE

Sreejith S, Rahul S, Jisha RC (2016) A real time patient monitoring system for heart disease prediction using random forest algorithm. In: Advances in signal processing and intelligent recognition systems: Proceedings of Second International Symposium on Signal Processing and Intelligent Recognition Systems (SIRS-2015), pp 485-500. Springer International Publishing

Arthur D, Vassilvitskii S (2007) K-means++ the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pp 1027–1035

Min L, Wei L, Ming C (2015) Research of customer loyalty based on the improved K-means algorithm. Int J Hybrid Inform Tech 8(10):311–318

MathSciNet   Google Scholar  

Anitha P, Patil MM (2022) RFM model for customer purchase behavior using K-means algorithm. J King Saud Univ-Comp Inform Sci 34(5):1785–1792

Meyer-Waarden L, Benavent C (2001) Loyalty programs: strategies and practice. FEDMA Res Day Madr 14:1–33

Bowen JT, Chen S-L (2001) The relationship between customer loyalty and customer satisfaction. Int J Contemp Hosp Manage 13(5):213–217

Download references

Author information

Authors and affiliations.

Amrita Vishwa Vidyapeetham, Amritapuri, Kollam, Kerala, India

A. S. Uthara & R. C. Jisha

College of Engineering Trivandrum, Thiruvananthapuram, Kerala, India

S. Sajikumar

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to A. S. Uthara .

Editor information

Editors and affiliations.

Jahangirnagar University, Dhaka, Bangladesh

M. Shamim Kaiser

School of Computer Science, Shaanxi Normal University, China, China

Juanying Xie

Jaipur Engineering College and Research, Jaipur, India

Vijay Singh Rathore

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Cite this paper.

Uthara, A.S., Sajikumar, S., Jisha, R.C. (2024). Customer Segmentation Using RFM Model and a New Approach to Encourage Loyal Customers. In: Kaiser, M.S., Xie, J., Rathore, V.S. (eds) Intelligent Strategies for ICT. ICTCS 2023. Lecture Notes in Networks and Systems, vol 941. Springer, Singapore. https://doi.org/10.1007/978-981-97-1260-1_10

Download citation

DOI : https://doi.org/10.1007/978-981-97-1260-1_10

Published : 11 May 2024

Publisher Name : Springer, Singapore

Print ISBN : 978-981-97-1259-5

Online ISBN : 978-981-97-1260-1

eBook Packages : Engineering Engineering (R0)

Share this paper

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

IMAGES

  1. Customer Segmentation Using K-Means Clustering

    customer segmentation using k means clustering research paper

  2. Customer Segmentation using KMeans Clustering

    customer segmentation using k means clustering research paper

  3. Customer Segmentation Using K-Means Clustering

    customer segmentation using k means clustering research paper

  4. Clustering Algorithm for Customer Segmentation

    customer segmentation using k means clustering research paper

  5. GitHub

    customer segmentation using k means clustering research paper

  6. Customer Segmentation with K-Means Cluster algorithm in R

    customer segmentation using k means clustering research paper

VIDEO

  1. Tumor Detection in Brain MRI Image Using Template based K-means and Fuzzy C-means Clustering| MATLAB

  2. Case Study: Customer Segmentation using k-means Clustering

  3. K-Means Clustering for Image Segmentation

  4. Implementation of K Means Clustering for Customer Segmentation

  5. Implementation-of-K-Means-Clustering-for-Customer-Segmentation

  6. K Means Clustering for Customer Segmentation

COMMENTS

  1. Customer Segmentation using K-means Clustering

    The process of segmenting the customers with similar behaviours into the same segment and with different patterns into different segments is called customer segmentation. In this paper, 3 different clustering algorithms (k-Means, Agglomerative, and Meanshift) are been implemented to segment the customers and finally compare the results of ...

  2. Customer Segmentation Using K-Means Clustering

    Kushwaha and Prajapati ( 2014) presented the paper "Mall Customer Segmentation Using Clustering Algorithm". The k -means clustering design is employed to do market-based analysis using an unsupervised machine learning technique. It also entails predicting the target clients who are easily converged among all customers.

  3. PDF Customer Segmentation Using K-Means Clustering

    Customer Segmentation Using K-Means Clustering. Abstract This paper tells about the customer segmentation of today's fast-paced world of marketing, where the focus has shifted from product to customer; customer service management may be seen as a key to attaining revenue growth and profit. The communication system that exists between ...

  4. Customer Segmentation Using K- Means Clustering Algorithm

    This paper presents a two-pass clustering algorithm with a combination of the linear assignment and k-means methods. To avoid the inconsistency of clustering results from the k-means method with ...

  5. K-Means Clustering Approach for Intelligent Customer Segmentation Using

    The research was performed on understanding the E-commerce platform. Research on the algorithms such as hierarchical clustering, K-Means clustering, and mean shift clustering were evaluated for customer segmentation. From the research, K-Means clustering was chosen as most suitable for conducting the customer segmentation.

  6. Mall Customer Segmentation Using K-Means Clustering

    Abstract. This research paper aims to investigate using k-means clustering for segmenting mall customers utilizing a dataset. Customer segmentation is a crucial aspect of marketing strategy as it allows for a more targeted and personalized approach. Therefore, specialized attributes about the customers are utilized in this study to segment the ...

  7. Customer Segmentation using K-Means Clustering

    As the population rapidly increases Customer Segmentation is becoming critically important nowadays for many businesses. It is widely used to target the best customers for their product and provide service. But for improving its efficiency we need to apply different tactics from time to time. The K means clustering algorithm is used to analyze large and complex datasets. It groups similar ...

  8. Customer segmentation using K-means clustering and the adaptive

    In this section, we first introduce the conventional K-means and PSO algorithms. Then, we review the related research of the K-means clustering algorithm, particularly the problem of K-means cluster centre optimization. In the proposed methods, the effect of the initial centres of K-means is reduced using the PSO algorithm.

  9. Customer segmentation using K-means clustering and the adaptive

    The improvement of enterprise competitiveness depends on the ability to match segmented customers in a competitive market. In this study, we propose a customer segmentation method based on the improved K-means algorithm and the adaptive particle swarm optimization (PSO) algorithm. The current PSO algorithm can easily fall into a local extremum; thus, adaptive learning PSO (ALPSO) is proposed ...

  10. Customer Segmentation Using K-Means Clustering

    Customer Segmentation Using K-Means Clustering. March 2023. DOI: 10.1007/978-981-19-9228-5_38. In book: Proceedings of Third International Conference on Advances in Computer Engineering and ...

  11. Customer Segmentation using K-means Clustering

    This project is based on customer clustering method where the customer's data is collected, analyzed, processed and visualized and a data science model is built which will help in forming clusters or segments of customers using the k-means clustering algorithm and RFM model for already existing customers. Expand. 15.

  12. PDF Customer Segmentation using K-Means Clustering

    Customer segmentation has been demonstrated to benefit from clustering. Clustering is a sort of unsupervised learning that allows us to locate clusters in unlabeled datasets. Clustering techniques include K-means, hierarchical clustering, DBSCAN clustering, and others [5]. The major purpose of this work is to apply a data mining strategy to ...

  13. Customer Segmentation Analysis Using Clustering Algorithms

    Rizki et al. [ 10] has used two methods for Customer Loyalty Segmentation on a Point-of-Sale System; they are Recency-Frequency-Monetary (RFM) model and K-Means algorithm. Besides K-Means Clustering and related approaches, other techniques for segmentation were also used in several research works.

  14. Customer segmentation using K-means clustering and the adaptive

    DOI: 10.1016/j.asoc.2021.107924 Corpus ID: 244185292; Customer segmentation using K-means clustering and the adaptive particle swarm optimization algorithm @article{Li2021CustomerSU, title={Customer segmentation using K-means clustering and the adaptive particle swarm optimization algorithm}, author={Yue Li and Xiaoquan Chu and Dong Tian and Jianying Feng and Weisong Mu}, journal={Appl. Soft ...

  15. Approaches to Clustering in Customer Segmentation

    Website: www.sciencepubco. com/index.php/IJET. Research paper. Approaches to Clustering in Customer Segmentation. Shreya Tripathi 1*, Aditya Bhardwaj. 2. , Poovammal E 3. 1, 2 Studies in the ...

  16. PDF Customer Segmentation using K-Means Algorithm

    customer segment to target is done using the customer segmentation process using the clustering technique. In this paper, the clustering algorithm used is K-means algorithm which is the partitioning algorithm, to segment the customers according to the similar characteristics. To determine the optimal clusters, elbow method is used. 2. Introduction

  17. Customer Segmentation in Online Retail Using K-Means Clustering

    Kotler & Armstrong [] introduced a practical approach to customer segmentation for market division and logistic strategies.Cooil et al. [] conducted research and discussed segmentation and clustering approaches based on the Migros company, utilizing supervised classification techniques.In the e-commerce field, Koul et al. [] explored various customer segmentation techniques, applying different ...

  18. Customer Segmentation Using K-Means Clustering Algorithm and RFM Model

    This study adapts the K-means clustering algorithm to RFM model by extracting features that represent RFM aspects of home appliances, and ranks the resulting clusters by Customer Lifetime Value (CLV) metric, which measures how valuable a customer is to the business. The key points in customer segmentation are determining target customer groups and satisfying their needs.

  19. A review on customer segmentation methods for personalized customer

    Authors of 21 publications use the value of RFM-analysis for the segmentation with k-means. Three research groups use a principal component analysis (PCA) for feature selection before clustering with k-means. Only Ding et al. use a graph representation before segmentation. The graph is built based on user-item interactions.

  20. An inversion-based clustering approach for complex clusters

    Clustering is a fundamental technique in data mining and machine learning, aiming to group objects into distinct clusters [1,2,3,4,5,6,7].Objects within a cluster show high similarity to each other and low similarity to objects in other clusters, determined by a similarity measure [8,9,10,11].Selecting an appropriate similarity measure is crucial for clustering algorithms.

  21. Customer Segmentation Using RFM Model and a New Approach to ...

    K-means clustering is applied as a methodology to segment customers based on their recency, frequency, monetary (RFM) value scores. The goal is to group customers into clusters that exhibit similar purchasing behaviors and engagement levels. We implemented K-means clustering in the following way. Step 1 Data Preparation