Subjective Segmentation and K Means Clustering

Segmentation is a structured and iterative process to group objects – customers, accounts or transactions – into similar segments.

Objective Segmentation requires label or tagging for the input objects. For example, in case of credit card applications, the labels are assigned as good and bad to the input applications.  The segmentation techniques help to differentiate good customers and bad customers based on certain variables or attributes.

Subjective Segmentation does not require labeling of the objects. Segmentation Techniques such as K means clustering interactively groups the objects into different segments based on input variables. Subjective segmentation is also called unsupervised segmentation.

Some of the important factors to be considered in clustering or segmentation mechanism are

    • Optimal or initial number of segments: What is initial number of clusters to be used? Or what is optimal number of clusters for the input dataset or objects
    •  Clustering Variables or attributes: What objective attributes or variables will be used to compare objects?
    • Similarity Measures: How object will be defined similar?
    • Clustering Approach such as  Divisive or Agglomerative

In the Divisive clustering, the full input data set is regarded as a single cluster to start with then based on input list of attributes and similarity measures, the objects will be recursively split into multiple segments until required conditions are met.  Whereas In the agglomerative clustering, each object is regarded as cluster then recursively these clusters are merged together based on similarity measures until meets required stop criteria.

Subjective Segmentation using K Means Clustering

K Means clustering techniques creates segments by recursively assigning objects so that the objects within a segment are similar to each other but different from the objects of other clusters.

The inputs variables, which are used for clustering, have to be interval or ordinal variables. There are multiple similarity measures to compute similarity among objects and one of the most commonly used similarity measure is Euclidean Distance.

Euclidian Distance

Algorithm Used in K Means Clustering

Step 1: Decide and input initial number of clusters, K

Step 2:  Initialize cluster centers or centroid randomly (SAS use some type of initialization mechanism based on initial observations in the input dataset)

Step 3: Calculate distance, Euclidian Distance, of an object from each cluster centroids

Step 4:  Assign the object to the nearest (with minimum Euclidean Distance) cluster centroids

Step 5: Recalculate Cluster Centroids and membership of the object to the clusters

Step 6: Process of assignment of the objects, recalculation of Cluster Centroids, and reassignment of the objects process continue until all objects are assigned to distinct clusters and change of Cluster Centroids is not significant

Some of the practical considerations and assumptions in K Means clusters

  • Initial number of clusters can have a significant impact, we recommend that user try out different value of input number of clusters or can use hierarchical clustering first to understand structured of the data
  • K Means clusters can be influence of initial observations if dataset is small
  • Outliers can have an influence on the clusters formed. One of the ways to deal with this issue is to estimate more number of cluster centroids and provide these as input cluster seeds or initial centriods
  • Scale of input variables will impact Euclidean Distance and clusters. So, it is recommended to standardize the input variable.
  • K Means clustering is applied when input variables are interval or ordinal variables.
  • K Means clusters are typical of elliptical shape due to Euclidean Distance similarity measure

Building Subjective Segmentation using K Means Clustering on Transcript Data