Skip to main content icon/video/no-internet

Many different cluster analysis methods are available to the researcher for classifying observations and variables. Most of these methods fall into one of two categories: hierarchical and nonhierarchical. The nonhierarchical methods are occasionally referred to as iterative partitioning methods. Hierarchical methods usually begin the clustering process by forming a matrix of similarities between entries. Clusters are formed by putting together those entries that are the most similar. This method is an agglomerative process that yields the results in the form of hierarchical trees or denograms. The nonhierarchical methods of iterative partitioning usually begin with an arbitrary classification, and through an iterative revision process, they attempt to find a classification that minimizes the within-cluster variation or equivalently maximizes the between-cluster variation.

Each of the two categories of method has its appropriate domain of applications and corresponding advantages and disadvantages. The nonhierarchical method appears to be especially appropriate in the reduction of large databases, in the analyses of group similarities, in nonlinear predictions, and in estimation of multivariate distributions for high-dimensional metric data often found in the social sciences.

The purpose of this summary is to describe the algorithm for one particular method of nonhierarchical clustering. The method is known as k-means. Computer programs such as SPSS and SAS are set up to perform this type of clustering. The algorithm presented here is the one developed by James B. MacQueen. Also, we will discuss proposed enhancements to the algorithm that would provide the user of this method better information about the cluster solution.

The k-means algorithm was originally developed as a method of computing an optimal partitioning in the sense of within-class variance of an n-dimensional population on the basis of a sample. The k-means procedure is faster than most nonhierarchical methods, but it does not, in general, converge to a global optimum partition. There are, however, special cases for which convergence to an optimal partition can be achieved. There does not seem to be any feasible, general method that will always yield an optimum cluster solution except in one dimension. The k-means procedure appears to give clusters that are reasonably efficient in a within-class variance sense. The efficiency is dependent to some extent on the researchers' intuition in picking the “correct” number of clusters (k) corroborated by mathematical analysis and practical computational experience.

Features of the Clustering Algorithm

To sort m data units into k clusters, the following steps are used:

1. The first k data units are taken as the first k clusters with one point in each cluster.

2. The following (or remaining) mk data units are assigned (one after another) to one of the k clusters on the basis of the shortest distance between the data unit and the centroid (mean) of the cluster. The distance can be computed using a number of formulas. The most popular is the Euclidean distance squared:

None

This formula is the squared Euclidean distance between the centroid of cluster k and the jth data unit in that cluster.

...

  • Loading...
locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles

Sage Recommends

We found other relevant content for you on other Sage platforms.

Loading