In this guide, you will learn the essential basics of assessing and managing missingness in survey datasets in SPSS using a practical example to illustrate the process. Readers are provided links to the example dataset and encouraged to replicate this example. An additional practice example is suggested at the end of this guide. The example assumes you have already opened the data file in SPSS.
Missing data are found in almost every dataset. Simply put, data are missing when information that is present in variables for some cases is not present for others. Missing data can pose a problem for the analysis of the data and for drawing appropriate inferences. With less data, one’s inferences are less powerful, and if the data are missing in patterns that relate to what is being measured in the study, then inferences can be biased as a result. How you deal with missing data depends on the type of missingness, how much of the data are missing, and the reasons for the missingness. The first task is to assess the amount of missing data and how it is distributed across the variables in your dataset. Secondly, you need to understand the reasons for each type of missingness. Is it planned, refusal, non-substantive? Thirdly, you will need to decide how to prepare the data for analysis. From the point of view of this introductory guide, the main options are, broadly, to discard any cases with missings on any of the variables used in a set of analyses that produce statistical estimates or, alternatively, to use all possible cases available for each estimate, thereby using a different subset of cases for each analysis. Both are examples of the so-called complete case analysis or listwise deletion where cases without data on the variable of interest are not used to derive estimates related to this variable. Different statistical packages treat missings slightly differently for different computations and procedures, but the principles remain the same.
This guide provides a basic introduction to dealing with missing data in survey research. The data come from the British Social Attitudes (BSA) Survey 2016.
This example uses five variables from the 2016 BSA Survey:
The first task is to examine each variable of interest to see how it is coded and which values represent missing data of which type. This can be done in SPSS by selecting from the menu:
Analyze → Reports → Codebook
In the “Codebook” dialog box that opens, move the questionnaire version children in household and opinion about plane travel variables into the “Codebook Variables” box. “Dependent List:” box as shown in Figure 1. Click OK to produce variable information.
When using variables as ordinal or quantitative, values should be on a continuum. Often “missing” responses are given numerical values which do not meet this need. In the case of the variable plnallow, for example, values of −2, −1, 8, and 9 do not locate the respondent on a continuum of attitude to plane travel between “agreeing strongly” and “disagreeing strongly” that people should be able to travel by plane as much as they like. We do not know the opinion of these respondents, or whether they have one at all.
In this case, there are 1,619 observations for which “Skip version off route,” meaning that the questions were not asked of the recipients of Versions B and C of the questionnaire. There are also 542 respondents who failed to return their self-completion questionnaire (A, B, or C). Values 8 and 9 are not assigned as missing in the dataset, but there are 29 responses of “can’t choose” and 25 of “not answered.” We therefore need to recode the variable before we can use it in analysis.
To recode the variable so that these values will be treated as missing in analysis, we select the following SPSS menu options:
Transform → Recode into Same Variables
In the “Recode into Same Variables” dialog box that opens, move the opinion about plane travel variable into the “Numeric Variables:” text box at the top right as shown in Figure 2.
Click on “Old and New Values,” and in the dialog box that opens, type “-2” in the “Value” box in the “Old Value” section at the top left. In the “New Value” section, choose “System-missing.” Then press Add and the changes will appear in the “Old–>New:” box. Once you have done this for values −1, 8, and 9 as well, as shown in Figure 3, press Continue to return to the previous box and then press OK to make the changes.
Repeat the process for plnenvt and plnterm. (Note that you can add all three variables to the “Numeric Variables:” text box at once if you are making the same changes to all of them.)
To combine survey items into a summative scale, we first want to examine the correlation between survey items to see whether they might measure the same underlying construct. Having recoded the missing or non-substansive quantitative responses to missing as above, we can estimate the correlation between items in SPSS by selecting from the menu:
Analyze → Correlate → Bivariate
In the “Bivariate Correlations” dialog box that opens, move the three opinion variables into the “Variables:” window. Figure 4 shows what this looks like in SPSS.
To select listwise deletion, which will only include cases where responses are available for all three survey items, select “Options” from the “Bivariate Correlations” dialog box. In the “Missing Values” section of the “Bivariate Correlations: Options” box that opens, check “Exclude cases listwise” as shown in Figure 4. Press Continue and then OK.
Next, we create the summative scale, taking the mean of cases with at least one response across the three items. To do this, select the following options:
Transform → Compute Variable
Give the new variable the name protectenv in the “Target Variable:” box at the top right. Select “Statistical” from the “Function group:” box and “Mean” from the options that appear below. Double click on it, and it will appear in the “Numeric Expression:” box at the top. Enter the three trust variables, separated by commas, within the brackets as shown in Figure 6.
Press OK to create the summative scale.
It is useful to compile a variable that is a count of missings for each case. We could then select cases with a value of zero (no missing items) to use in complete-case analysis for example.
To create a variable which is the count of missing responses across all five variables used in our analysis, choose the following menu options:
Transform → Count Values within Cases
In the dialog box that opens, choose a name and label in the Target Variable and Target Label text boxes at the top. In this case, we are calling the variable totmiss. Move all five variables into the “Numeric Variables:” box as shown in Figure 7.
Select Define Values… and in the box that opens, check “System- or user-missing” and press Add as shown in Figure 8.
Press Continue to return to the previous box and then press OK.
Finally, we create a frequency response table for the new variable totmiss. Select the following menu options:
Analyze → Descriptive Statistics → Frequencies
Move totmiss to the “Variable(s):” box and press OK to produce the frequency distribution.
The variable ABCVer identifies which questionnaire version, A, B, or C, that each respondent received. In addition to the three versions, coded 1, 2, or 3, there are other values that could be regarded as missing. –2 and –1 denote missing data by design – “in this case, it would mean that respondents were not allocated to A, B or C.”
The count column shows, though, that no values are thus coded, meaning that everyone was assigned to one of the three versions of the questionnaire. Values 8 and 9 indicate “Don’t know” and “Refusals.” Although in the dataset these values are coded, they are not relevant for this variable because it is completed by the interviewer on giving the self-completion questionnaire to the respondent. Unsurprisingly, there are no instances of these values in the dataset. For this variable, then, there are no actual missing observations, even though the precoded dataset allows for various categories of potential missings.
The same situation pertains for whether the respondent has a child in the household, as shown in Figure 11. No respondent refused the question or said don’t know, so there are complete observations for this variable, with 938 who have a child at home and 2,004 who do not.
Figure 12 presents a different kind of example. The question was asked only in Version A of the self-completion. Respondents were asked whether people should be allowed to travel by plane as much as they like. Values 1–5 capture the substantive responses to this question (including neither agree nor disagree as a middle option). Again, minus numbers denote missing. In this case, there are 1,619 observations for which “Skip version off route,” meaning that the questions were not asked of the recipients of Versions B and C of the questionnaire. There are also 542 respondents who failed to return their self-completion questionnaire (A, B, or C). Values 8 and 9 are not assigned as missing in the dataset, but there are 29 responses of “can’t choose” and 25 of “not answered.”
Figure 13 shows the same table but after recoding these cases as missing.
Figure 14 shows the pairwise correlation matrix of these items after recoding, while Figure 15 shows the listwise deleted matrix. In Figure 14, there is a different number of observations used for each of the three estimated coefficients. The bottom row shows the total number of observations for each of the three items. In Figure 15, by contrast, the number of observations for all three estimated correlation coefficients is 701 after listwise deletion. Any case that has a missing value on any of the three items is excluded from the analysis. It can also be seen that the coefficients are slightly different between the pairwise and listwise versions of the correlation matrix because a different set of observations is used for each estimate.
Figure 16 shows the frequency response table for the scale protectenv, computed using the mean of the three items, accepting a minimum of one item for a valid scale score. This retains the maximum number of observations possible – “733, compared with listwise or “complete-case” computation, which, as we learned from Figure 14, would only yield 701 cases. Whatever decisions are made, they should be accurately reported, so that results are transparent and reproducible.
Figure 17 shows the frequency response table from a variable totmiss that has been computed as a count of the number of missing values on all of the variables mentioned in the guide so far. 701 cases have no missing observations, while the remaining cases have between one and three missing observations on the variables in question.
You can download this sample dataset along with a guide showing how to carry out the procedures using statistical software. The sample dataset also includes three further variables mobdsafe, mobddang, mobdban that assess how concerned people are about the use of cell phones while driving. See whether you can reproduce the tables in this guide, recode these variables, and compute a scale to measure attitudes towards concern about driving while on the phone, as shown for the trust example above.
IBM® SPSS® Statistics software (SPSS) screenshots Republished Courtesy of International Business Machines Corporation, © International Business Machines Corporation. SPSS Inc. was acquired by IBM in October, 2009. IBM, the IBM logo, ibm.com, and SPSS are trademarks or registered trademarks of International Business Machines Corporation, registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at “IBM Copyright and trademark information” at http://www.ibm.com/legal/copytrade.shtml.