How-to Guide for Stata
Introduction

In this guide, you will learn how to produce estimates, standard errors, and confidence intervals for some population parameters using a sample that has been drawn using stratification and differing sampling rates between strata. The example focuses on means and proportions, but the principles demonstrated apply to a variety of estimators, including statistical modeling. The example assumes that you have already opened the data file in Stata.

Contents
  • Stratification
  • An Example in Stata: Time Spent Doing Some Common Tasks in Sierra Leone
    • 2.1 The Stata Commands
    • 2.2 Exploring the Stata Output
  • Your Turn
1 Stratification

A stratified random sample is obtained by dividing the population units into a set of mutually exclusive and exhaustive groups (strata), then drawing a sample of units from each stratum. Stratified sampling is preferred over simple random sampling for one (or more) of four reasons:

  • Estimators are more efficient when stratification is used appropriately.
  • Equally efficient estimators for each stratum can be guaranteed.
  • Variance homogeneity need not be assumed between strata.
  • Cost per unit of information can be reduced using stratification.

Regardless of the motivation for using stratification, parameter estimates, standard errors, and hypothesis test, results are obtained the same way, which differs from methods that assume simple random sampling.

2 An Example in Stata: Time Spent Doing Some Common Tasks in Sierra Leone

This example presents mean and proportion estimation, with standard errors and confidence limits, using a dataset with a stratified sampling design. Statistics for three variables are obtained:

  • How long does it take someone on average to reach the nearest source of drinking water from your household? (c2)
  • How long does it take someone on average to reach the nearest market from your household? (c5)
  • How long does it take someone on average to reach the nearest motorable road from your household? (c8)

Each item uses a response scale with five options:

  • Less than 15 minutes
  • 15 to 30 minutes
  • 30 minutes to 1 hour
  • Between 1 and 2 hours
  • Over 2 hours

for which we obtain proportions.

For each item, we created a numeric variable valued in minutes (c2n c5n c8n):

  • 7.5
  • 22.5
  • 45
  • 90
  • 150

from which we obtain means. The sampling design is stratified with unequal selection probability between strata so weight (inv_prob) and stratum (localcouncil) variables are in the dataset.

2.1 The Stata Commands

In Stata, information about sampling design is made part of the dataset through the svyset command. The svyset command gives Stata the four pieces of information necessary to produce proper point estimates and standard errors from survey data:

  • The name of the sampling weight variable
  • The name of the variable (or variables) that indicate stratum membership
  • The name of the variable (or variables) that indicate sampling stages
  • The name of the variable (or variables) that indicate within stratum or cluster population sizes

The syntax for the svyset command is:

svyset psuvar [pweight=wgtvar], strata(stratvar) fpc(fpcvar)

where psuvar is the primary sampling unit indicator variable, wgtvar is the sampling weight variable, stratvar is the stratum indicator variable, and fpcvar is the population total (within strata) variable. psuvar is required, all others are optional. In this dataset, the primary and final sampling units are the same (household), and there is no finite population correction. The stratum indicator variable is localcouncil and the sampling weight variables is inv_prob. The svyset command is:

svyset _n [pweight=inv_prob], strata(localcouncil)

psuvar is required. In this example, the primary and final sampling units are the same, household. As there is one line of data per household, _n, Stata’s automatic data line indicator variable is used as the psuvar.

A wide variety of commands that use survey data are available in Stata. For this example, we use the mean and proportion commands. To run a command and make use of the complex sampling design information, the command is preceded by svy:. Otherwise the syntax is as usual for the command. For example, to obtain means, we use the mean command:

svy: mean c2n c5n c8n

for proportions, we use the proportion command:

svy: proportion c2 c5 c8

A list of Stata commands that support the svy: prefix is available at https://www.stata.com/manuals13/svysvyestimation.pdf

2.2 Exploring the Stata Output

When the svy: prefix is invoked, the output looks very similar to that users are accustomed to. For example, the output for the mean command without using svy: is:

With the svy: prefix:

For proportions, without the svy: prefix:

With the svy: prefix:

3 Your Turn

You can download this sample dataset along with a guide showing how to obtain mean and percent estimates using stratification information. The sample includes ordered nominal and quantitative variable, so a variety of statistical models may be tried. See whether you can reproduce the results presented here as well as obtain estimates using other Stata commands with which you are familiar.