K-Means Cluster Analysis

Neil J.Salkind

doi:10.4135/9781412952644

Entry
Reader's guide
Entries A-Z
Subject index

Return to Entries

K-Means Cluster Analysis

Edited by:
Neil J. Salkind
In:Encyclopedia of Measurement and Statistics
Chapter DOI:https://doi.org/10.4135/9781412952644.n237
Subject:Anthropology, Business and Management, Criminology and Criminal Justice, Communication and Media Studies, Counseling and Psychotherapy, Economics, Education, Geography, Health, History, Marketing, Nursing, Political Science and International Relations, Psychology, Social Policy and Public Policy, Social Work, Sociology, Science, Technology, Computer Science, Engineering, Mathematics, Medicine

Request Permissions

Show page numbers Hide page numbers

Many different cluster analysis methods are available to the researcher for classifying observations and variables. Most of these methods fall into one of two categories: hierarchical and nonhierarchical. The nonhierarchical methods are occasionally referred to as iterative partitioning methods. Hierarchical methods usually begin the clustering process by forming a matrix of similarities between entries. Clusters are formed by putting together those entries that are the most similar. This method is an agglomerative process that yields the results in the form of hierarchical trees or denograms. The nonhierarchical methods of iterative partitioning usually begin with an arbitrary classification, and through an iterative revision process, they attempt to find a classification that minimizes the within-cluster variation or equivalently maximizes the between-cluster variation.

Each of the two categories of method has its appropriate domain of applications and corresponding advantages and disadvantages. The nonhierarchical method appears to be especially appropriate in the reduction of large databases, in the analyses of group similarities, in nonlinear predictions, and in estimation of multivariate distributions for high-dimensional metric data often found in the social sciences.

The purpose of this summary is to describe the algorithm for one particular method of nonhierarchical clustering. The method is known as k-means. Computer programs such as SPSS and SAS are set up to perform this type of clustering. The algorithm presented here is the one developed by James B. MacQueen. Also, we will discuss proposed enhancements to the algorithm that would provide the user of this method better information about the cluster solution.

The k-means algorithm was originally developed as a method of computing an optimal partitioning in the sense of within-class variance of an n-dimensional population on the basis of a sample. The k-means procedure is faster than most nonhierarchical methods, but it does not, in general, converge to a global optimum partition. There are, however, special cases for which convergence to an optimal partition can be achieved. There does not seem to be any feasible, general method that will always yield an optimum cluster solution except in one dimension. The k-means procedure appears to give clusters that are reasonably efficient in a within-class variance sense. The efficiency is dependent to some extent on the researchers' intuition[Page 504] in picking the “correct” number of clusters (k) corroborated by mathematical analysis and practical computational experience.

Features of the Clustering Algorithm

To sort m data units into k clusters, the following steps are used:

1. The first k data units are taken as the first k clusters with one point in each cluster.

2. The following (or remaining) m – k data units are assigned (one after another) to one of the k clusters on the basis of the shortest distance between the data unit and the centroid (mean) of the cluster. The distance can be computed using a number of formulas. The most popular is the Euclidean distance squared:

This formula is the squared Euclidean distance between the centroid of cluster k and the jth data unit in that cluster.

...

Sign in to access this content

Get a 30 day FREE TRIAL

Watch videos from a variety of sources bringing classroom topics to life
Read modern, diverse business cases
Explore hundreds of books and reference titles

No internet connection.

All search filters on the page have been cleared.

Your search has been saved.

Entry

Reader's guide

Entries A-Z

Subject index

K-Means Cluster Analysis

Features of the Clustering Algorithm

Sign in to access this content

Get a 30 day FREE TRIAL

Read next

More like this

Sage Recommends