Skip to main content icon/video/no-internet

Selection refers to a situation in which data are not representative of the underlying population of interest. In particular, selection occurs when unobserved factors that determine whether an observation is in the data set also help determine the value of the quantity of interest. For example, in a study of willingness to volunteer, it would make little sense to only select those participants who volunteer to participate. An analysis of data that suffer from selection generally produces biased estimates and renders statistical inference problematic. Solutions to the problem commonly involve modeling the selection process and the outcome of interest at the same time to account for how selection influences the observed values.

Types of Missing Data

Selection generally refers to a particular type of missing data: nonrandom missingness. Also referred to as nonignorable (NI) missingness, it is one of three types of missing data, which include missing at random (MAR) and missing completely at random (MCAR). Data that are MCAR are missing in a way that is completely unrelated to any information in the data set—observations or values of variables are missing with equal probability. Data that are MAR have values that are missing in a way that is related to the observed values of other variables in the data set—the probability that a value is missing can be predicted with observed variables alone. Data exhibit NI missingness when the value of the missing variable helps explain its missingness above and beyond what can be explained with the information contained in observed variables. Of course, this also means that values of variables included in the data set help determine their inclusion as well.

In its common usage, selection refers to a specific type of NI missingness. A typical situation involves no missing data for all covariates Z, but missing values of the quantity of interest Y. The values of Y might be missing for a variety of reasons. For example, schools might withhold tests from students who might score relatively low, rendering observation of their scores impossible, or respondents might refuse to answer survey questions about sensitive activities. In general, nonrandom sample selection occurs when observation of Y for an individual (or other unit of analysis) depends on unobserved information that helps determine its value.

The best way to avoid nonrandom sample selection is by careful study design and implementation. In survey contexts, allocating effort on additional follow-ups to avoid unit and item nonresponse can reduce or eliminate selection. Phrasing questions in ways that reduce sensitivity can help as well (perhaps by substituting scales in place of exact responses). In program evaluation, the focus should be on ensuring random assignment and high compliance rates. In real-world situations, however, these remedies might be impossible, often leaving only statistical approaches to correct for selection.

Consequences

Even though the value of Y might depend on a combination of observed factors, which are measured in Z, and unobserved factors, the critical element that determines the consequences of selection is whether observation of Y depends, at least in part, on unobserved factors. Selection that depends only on observed factors does not lead to unrepresentative values of Y given the observed covariates, although if one is interested in the average for the entire population, then one should remember to make adjustments to account for differences in the distribution of covariates between the selected sample and the population of interest.

...

  • Loading...
locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles

Sage Recommends

We found other relevant content for you on other Sage platforms.

Loading