Skip to main content icon/video/no-internet

Imputation involves replacing missing values, or missings, with an estimated value. In a sense, imputation is a prediction solution. It is one of three options for handling missing data. The general principle is to delete when the data are expendable, impute when the data are precious, and segment for the less common situation in which a large data set has a large fissure. Imputation is measured against deletion; it is advantageous when it affords the more accurate data analysis of the two. This entry discusses the differences between imputing and deleting, the types of missings, the criteria for preferring imputation, and various imputation techniques. It closes with application suggestions.

Figure 1 Missing Data Structure

Impute or Delete

The trade-off is between inconvenience and bias. There are two choices for deletion (casewise or pairwise) and several approaches to imputation. Casewise deletion omits entire observations (or cases) with a missing value from all calculations. Pairwise deletion omits observations on a variable-by-variable basis. Casewise deletion sacrifices partial information either for convenience or to accommodate certain statistical techniques. Techniques such as structural equation modeling may require complete data for all the variables, so only casewise deletion is possible for them. For techniques such as calculating correlation coefficients, pairwise deletion will leverage the partial information of the observations, which can be advantageous when one is working with small sample sizes and when missings are not random.

Imputation is the more advantageous technique when (a) the missings are not random, (b) the missings represent a large proportion of the data set, or (c) the data set is small or otherwise precious. If the missings do not occur at random, which is the most common situation, then deleting can create significant bias. For some situations, it is possible to repair the bias through weighting—as in poststratification for surveys. If the data set is small or otherwise precious, then deleting can severely reduce the statistical power or value of the data analysis.

Imputation can repair the missing data by creating one or more versions of how the data set should appear. By leveraging external knowledge, good technique, or both, it is possible to reduce bias due to missing values. Some techniques offer a quick improvement over deletion. Software is making these techniques faster and sharper; however, the techniques should be conducted by those with appropriate training.

Categorizing Missingness

Missingness can be categorized in two ways: the physical structure of the missings and the underlying nature of the missingness. First, the structure of the missings can be due to item or unit missingness, the merging of structurally different data sets, or barriers attributable to the data collection tools. Item missingness refers to the situation in which a single value is missing for a particular observation, and unit missingness refers to the situation in which all the values for an observation are missing. Figure 1 provides an illustration of missingness.

Table 1 Underlying Nature of Missingness

Second, missings can be categorized by the underlying nature of the missingness. These three categories are (1) missing completely at random (MCAR), (2) missing at random (MAR), and (3) missing not at random (MNAR), summarized in Table 1 and discussed below.

...

  • Loading...
locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles

Sage Recommends

We found other relevant content for you on other Sage platforms.

Loading