Cluster Analysis

Douglas Steinley

doi:10.4135/9781506326139

Entry
Reader's guide
Entries A-Z
Subject index

Return to Entries

Cluster Analysis

By: Douglas Steinley
In:The SAGE Encyclopedia of Educational Research, Measurement, and Evaluation
Chapter DOI:https://doi.org/10.4135/9781506326139.n116
Subject:Education

Request Permissions

Show page numbers Hide page numbers

Generally, cluster analysis refers to the goal of identifying or discovering groups within the data, in which the primary caveat is that the groups are not known a priori. Prior to discussing methods for identifying clusters, it is helpful to consider the fundamental question: What is a cluster? For an N × P data matrix X, containing measurements on N observations across P variables, each observation can be thought of as a point in P dimensional space. Clusters then are groups of points in P dimensional space that are similar in some fashion. After furthering the introduction of clusters, this entry lists and then examines the seven steps of cluster analysis. Those steps include determining which observations are to be clustered, which variables are to be used, and whether those variables should be standardized. Subsequent steps include selecting an appropriate measurement, choosing the clustering method, and then determining the number of clusters. The final step focuses on interpreting, testing, and replicating the results of the cluster analysis.

In an early, and still excellent, review of the field of cluster analysis presented to the Royal Statistical Society, Richard Melville Cormack advanced the notion that clusters have to be externally isolated and internally cohesive. Geometrically, internal cohesion indicates that the observations within a cluster are “clumped” together in the multivariate P dimensional space, whereas externally isolated indicates that the observations are well separated from each other. Alternatively, this can be further thought of as regions of the multivariate space that are dense with spaces of “sparseness” separating them, leading to a natural conceptualization of clusters corresponding to multiple modes in the multivariate space. While attempting to capture this dual notion of isolation and cohesion, several different metrics of “clusteriness” and algorithms to uncover these notions have been developed.

The vast majority of methods, whether hierarchical or nonhierarchical, often have the goal of obtaining a clustering solution such that the clusters are mutually exclusive and collectively exhaustive; that is, that each observation is assigned to one and only one cluster and all observations are assigned to at least one cluster. Initially, it seems like the best approach would be to evaluate all possible cluster solutions (e.g., look at all possible assignments of observations to clusters); however, the number of possible solutions (e.g., partitions) is enormous. Specifically, for N observations and K clusters, the number of possible ways to assign the N observations to the K clusters is given by the Stirling number of the second kind:

$S (N, K) = \frac{1}{K!} \sum_{i = 0}^{K} {(- 1)}^{i} (\begin{array}{l} K \\ i \end{array}) {(K - i)}^{N},$

a quantity that can be approximated by $\frac{K^{N}}{K!}$ , which increases rapidly for increases in both N and K. For instance, the number of possible partitions of 20 observations into five clusters is 7.95 × 1011; modestly increasing the sample size to [Page 294]100 observations results in 6.27 × 1081, resulting in a situation in which it is impossible to evaluate all possible partitions. As such, the goal has been to develop approaches that give good solutions without evaluating all possible partitions. Because all possible solutions are not being evaluated, the ability to definitively state that the best solution (often referred to as the globally optimal solution) has been found is lost. That is, these approaches are heuristic in nature in that a set of rules are established that defines how the approach identifies the resultant clusters and not all possible solutions are evaluating, so it is impossible to know whether the final set of clusters is the best possible of all the S(N, K) partitions. Given the monumental task of selecting a candidate partition as “the best” of all the possibilities, it is necessary to have some guidelines about how to proceed.

...

Sign in to access this content

Get a 30 day FREE TRIAL

Watch videos from a variety of sources bringing classroom topics to life
Read modern, diverse business cases
Explore hundreds of books and reference titles

No internet connection.

All search filters on the page have been cleared.

Your search has been saved.

Entry

Reader's guide

Entries A-Z

Subject index

Cluster Analysis

Sign in to access this content

Get a 30 day FREE TRIAL

Read next

More like this

Sage Recommends