Skip to main content icon/video/no-internet

Missing Data Methods

Virtually all epidemiologic studies suffer from some degree of missing or incomplete data. This means that some of the cases have data missing on some (but not all) of the variables. For example, one patient in a study may have values present for all variables except age. Another may have missing data on blood pressure and years of schooling. Missing data create problems for most statistical methods because they presume that every case has measured values on all variables in whatever model is being estimated. This entry surveys some of the many methods that have been developed to deal with these problems.

The most common method for handling missing data is complete case analysis (also known as listwise or casewise deletion). In this method, cases are deleted from the analysis if they have missing data on any of the variables under consideration, thereby using only complete cases. Because of its simplicity and because it can be used with any kind of statistical analysis, complete case analysis is the default in nearly all statistical packages. Unfortunately, complete case analysis can also result in the deletion of a very large fraction of the original data set, leading to wide confidence intervals and low power for hypothesis tests. It can also lead to biased estimates if data are not missing completely at random (to be defined below).

To avoid these difficulties and to salvage at least some of the discarded information, many different methods have been developed. Most of these methods are crude, ad hoc, and may only make things worse. Although more principled and effective methods have appeared in the past 20 years, they are still woefully underutilized.

Assumptions

Before examining various missing data methods, it is important to explain some key assumptions that are often used to justify the methods. The definitions given here are intended to be informal and heuristic.

Missing Completely at Random (MCAR)

Many missing data methods are valid only if the data are missing completely at random. Suppose that only one variable Y has missing data and that there are other variables (represented by the vector X) that are always observed. We say that data on Y are missing completely at random if the probability of missing data on Y is completely unrelated either to X or to Y itself. Symbolically, this is expressed as

Note that missingness on Y can depend on other unobserved variables. It just can't depend on variables that are observed for the model under consideration. This definition can easily be extended to situations in which more than one variable has missing data, but the representation gets more complicated. Note also that MCAR is not violated if missingness on one variable is related to missingness in another variable. In an extreme but common situation, two variables may be always missing together or always present together; this does not violate MCAR.

Missing at Random (MAR)

MCAR is a very strong assumption. Some of the newer missing data methods are valid under the weaker assumption of missing at random. Again, let's suppose that only one variable Y has missing data, and that there are other variables X that are always observed. We say that data on Y are missing at random if the probability of missingness on Y may depend on X, but does not depend on Y itself after adjusting or controlling for X: In symbols, we say Pr(Y missing|X, Y) = Pr(Y missing|X). Here's an example of a violation of this assumption: People with high income may be less likely to report their income, even after adjusting for other observed variables.

...

  • Loading...
locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles

Sage Recommends

We found other relevant content for you on other Sage platforms.

Loading