In this guide, you will learn how to produce estimates, standard errors, and confidence intervals for some population parameters using a sample that has been drawn using two sampling stages (clustering). The example focuses on means, but the principles demonstrated apply to a variety of estimators, including statistical modeling. The example assumes that you have already opened the data file in Stata.
A clustered random sample is obtained by dividing the population units into a set of mutually exclusive and exhaustive groups (clusters), drawing a sample of clusters, then within each selected cluster, sampling units of observation. The clusters are called primary sampling units (PSUs), and the units of observation are called final sampling units (FSUs). Cluster sampling is typically preferred over simple random sampling as, for a fixed cost, we can obtain far more FSUs with cluster sampling than with a simple random sample. Regardless of the motivation for using cluster sampling, parameter estimates, standard errors, and hypothesis test, results are obtained the same way, which differs from methods that assume simple random sampling.
This example presents mean estimation, with standard errors and confidence limits, using a dataset with a cluster sampling design. Statistics for 13 variables are obtained on which subjects are asked about monthly expenditure. Among the variables are common expenses such as food, utilities, loans, and rent. For each item, the respondent gave a value in Kenya shillings (KSh). Cluster membership is indicated by the variable a8_1, and the response variables of interest are r17_1 through r17_12 and r17tot.
In Stata, information about sampling design is made part of the dataset through the svyset command. The svyset command gives Stata the four pieces of information necessary to produce proper point estimates and standard errors from survey data:
The syntax for the svyset command is:
svyset psuvar [pweight=wgtvar], strata(stratvar) fpc(fpcvar)
where psuvar is the primary sampling unit indicator variable, wgtvar is the sampling weight variable, stratvar is the stratum indicator variable, fpcvar is the population total (within strata) variable. psuvar is required, all others are optional.
This dataset uses two stages of sampling with equal selection probability among all units, so there are no weights, no stratification, and no finite population correction. The svyset command is:
A wide variety of commands that use survey data are available in Stata. For this example, we use the mean command. To run a command and make use of the complex sampling design information, the command is preceded by svy:. Otherwise, the syntax is as usual for the command. For example, to obtain means, we use the mean command:
svy: mean r17_1
A list of Stata commands that support the svy: prefix is available at: https://www.stata.com/manuals13/svysvyestimation.pdf
When the svy: prefix is invoked, the output looks very similar to that which users are accustomed. For example, the output for the mean command without using svy: is:
With the svy: prefix:
Note that the standard errors are substantially higher when the clustering information is used.
You can download this sample dataset along with a guide showing how to obtain mean estimates and standard errors using clustering information. The sample includes nominal and quantitative variables, so a variety of statistical models may be tried. See whether you can reproduce the results presented here as well as obtain estimates using other Stata commands with which you are familiar.