Skip to main content icon/video/no-internet

Cluster Sampling

Cluster sampling is a probability sampling technique in which all population elements are categorized into mutually exclusive and exhaustive groups called clusters. Clusters are selected for sampling, and all or some elements from selected clusters comprise the sample. This method is typically used when natural groups exist in the population (e.g., schools or counties) or when obtaining a list of all population elements is impossible or impractically costly. As compared to simple random sampling, cluster sampling can reduce travel cost for in-person data collection by using geographically concentrated clusters. At the same time, cluster sampling is generally less precise than simple random or stratified sampling; therefore, it is typically used when it is economically justified (i.e., when a dispersed population would be expensive to survey). This entry discusses selecting clusters with equal and unequal probability and provides a comparison of cluster sampling to other sampling methods.

Selecting Clusters With Equal Probability

Cluster sampling can be applied in one or more stages but, regardless of the number of stages, the first step is to select the clusters (primary sampling units) from which sample elements (secondary sampling units) will be drawn. A basic one-stage design takes a simple random sample of clusters and selects for sampling all elements within those clusters, although this design is rarely used in practice. A researcher could select schools and collect data about every student in the selected schools. Because elements within a cluster are often similar—a phenomenon called a cluster effect—it may be redundant and inefficient to sample a large proportion of the elements within a cluster.

Large-scale studies typically use a multistage cluster sampling method. A basic implementation of this type of sample is a two-stage cluster sample selecting clusters via simple random sample and independently subsampling elements within each cluster, using the same sampling fraction across clusters. The downside of this simple approach is that it results in differing sample sizes per cluster, making it less attractive than other designs. Designs with more than two stages may also be useful; a three-stage statewide survey, for example, could sample school districts, then schools within selected districts, then teachers within selected schools.

In multistage sampling, the variance of the estimated quantities depends on within-cluster and between-cluster variance. Within-cluster variance is related to the intraclass correlation coefficient (ICC), which measures the degree of homogeneity of the variable of interest for elements within a cluster. ICC is typically interpreted as the correlation between the responses of individuals in the same cluster. Using schools as clusters and students’ test scores as an outcome, an ICC of 0.2 would mean that 20% of the variation in the student test scores is accounted for by the school a student attends, and 80% is accounted for by variation across students within schools.

Selecting Clusters With Unequal Probability

Clusters may also be selected with probability proportional to size. This means that clusters containing a greater size measure (e.g., the number of population elements) are more likely to be included in the sample than clusters with fewer elements. Such a sampling scheme would, for instance, be more likely to select a college dormitory where 100 students live than one where 20 live. Specifically, the probability of selection for cluster c is m × NcN where m clusters are selected from the population of clusters, Nc is the measure of size (e.g., the number of secondary sampling units) in cluster c, and N is the sum of the measure of size across all clusters (e.g., the number of elements in the population). If the cluster sample is stratified, then the numbers in this probability formula reflect those in a particular stratum. The second stage sample could select an equal number of elements from each cluster. This creates an equal workload in each cluster, which is preferable if data collection involves face-to-face communication, and in expectation results in a self-weighting sample. If the probability of selection within cluster c is nNc, then the cumulative probability of selection for each secondary sampling unit reduces to mnN, a constant. The variance of an estimator of the population mean is a function of the number of clusters selected, the sample size within each cluster, and the ICC. Higher ICC values increase the variance for a given sample size, and increasing the ratio of the number of clusters selected to the sample size within a cluster reduces the overall variance and increases the precision of the final estimates.

...

  • Loading...
locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles

Sage Recommends

We found other relevant content for you on other Sage platforms.

Loading