Multiple Imputation For Missing Data

Neil J.Salkind

doi:10.4135/9781412952644

Entry
Reader's guide
Entries A-Z
Subject index

Return to Entries

Multiple Imputation For Missing Data

Edited by:
Neil J. Salkind
In:Encyclopedia of Measurement and Statistics
Chapter DOI:https://doi.org/10.4135/9781412952644.n301
Subject:Anthropology, Business and Management, Criminology and Criminal Justice, Communication and Media Studies, Counseling and Psychotherapy, Economics, Education, Geography, Health, History, Marketing, Nursing, Political Science and International Relations, Psychology, Social Policy and Public Policy, Social Work, Sociology, Science, Technology, Computer Science, Engineering, Mathematics, Medicine

Request Permissions

Show page numbers Hide page numbers

In many, if not most, studies, some data that were meant to be collected are missing. For example, in a survey, some people may not respond to all the questions. Or, in a randomized experiment, some units' outcomes may not be measured because of equipment failure. Multiple imputation is one principled method for handling such missing data. The general idea is to fill in the missing data with plausible values, analyze the completed data set, and repeat the process multiple times. The analyses from each completed data set are combined to result in inferences that properly account for the missing data. Multiple imputation has been used in large governmental surveys such as the National Health and Nutrition Examination Survey and the Survey of Consumer Finances, and in numerous studies by individual researchers.

Before we review multiple imputation, it is worth-while to consider the most common and convenient approach to handling missing data: Analyze only the cases that have complete data for the variables of interest. This available cases approach can lead to inaccurate estimates. For a simple illustration of this point, consider the hypothetical data in Table 1 for a random sample of five people. Suppose that weights of all people over 6 feet tall are missing—so that the observed data are 130, 140, and 150—because the height/weight instrument is unable to record information for people over 6 feet tall. Researchers interested in estimating the population average weight are in[Page 664] trouble if they use only the three available cases: their sample average is a severe underestimate.

Table 1 Hypothetical Data for Illustrating Multiple Imputation
Height (inches)	Weight (pounds)
65	130
68	140
70	150
72	160
75	170

Many times, researchers are interested in relationships among variables, such as regression coefficients. In the hypothetical example, the fitted regression of weight on height obtained using the three available cases results in reasonable (unbiased) estimates of the slope and intercept, because the regression holds for all heights. However, in data sets with many variables and complicated missing data patterns, using only the available cases might exclude a large fraction of the observations, which could dramatically increase the variability of the estimates. Additionally, different specifications of models may use different units for estimation, making theoretical properties of resulting inferences nearly impossible to understand and practical comparisons of different models difficult.

Illustration of Multiple Imputation

In contrast to available cases analyses, multiple imputation uses all records for estimation, which takes advantage of the information from partially completed records. To illustrate multiple imputation, we again use the hypothetical example. We first demonstrate how to analyze a set of multiply imputed data sets, and then discuss methods of generating imputations.

Suppose that five plausible values for each missing weight have been generated to create five completed data sets. These are displayed in Table 2, along with the estimated slope and its variance obtained from fitting standard linear regression in each completed data set. Inferences for the population regression slope β are based on three quantities. First, compute the average of the five estimated slopes, which equals 4.01. Second, compute the variance of these five estimated slopes, which equals .0523. Third, compute the average of the variances of the slopes, which equals .0209. The point estimate of the population slope is 4.01, and the variance associated with this point estimate is (1 + 1/5)(0.0523) + 0.0209 = 0.0836. An approximate 95% confidence interval for β is 4.01 ±1.96√.0836.

...

Sign in to access this content

Get a 30 day FREE TRIAL

Watch videos from a variety of sources bringing classroom topics to life
Read modern, diverse business cases
Explore hundreds of books and reference titles

No internet connection.

All search filters on the page have been cleared.

Your search has been saved.

Entry

Reader's guide

Entries A-Z

Subject index

Multiple Imputation For Missing Data

Illustration of Multiple Imputation

Sign in to access this content

Get a 30 day FREE TRIAL

Read next

More like this

Sage Recommends