How-to Guide for IBM® SPSS® Statistics Software
Introduction

In this guide you will learn how to detect heteroscedasticity following a linear regression model in IBM® SPSS® Statistical Software (SPSS), using a practical example to illustrate the process. You will find links to the example dataset and you are encouraged to replicate this example. An additional practice example is suggested at the end of this guide. The example assumes you have already opened the data file in SPSS.

Contents
• Heteroscedasticity
• An Example in SPSS: Blood Pressure and Age in China
• 2.1 The SPSS Procedure
• 2.2 Exploring the SPSS Output
1 Heteroscedasticity

Linear regression models estimated via Ordinary Least Squares (OLS) rest on several assumptions, one if which is that the variance of the residual from the model is constant and unrelated to the independent variable(s). Constant variance is called homoscedasticity, while non-constant variance is called heteroscedasticity. This example illustrates how to detect heteroscedasticity following the estimation of a simple linear regression model.

2 An Example in SPSS: Blood Pressure and Age in China

This example uses two variables from the 2006 China Health and Nutrition Survey:

• A person’s systolic blood pressure (systolic).
• A person’s age, measured in years (age).

There are 9178 respondents in this survey. Systolic blood pressure measures the pressure in a person’s arteries when their heart beats (contracts and pumps blood). In this dataset, this variable ranges from 70 to 240 with a mean of about 122 and a standard deviation of 18.14. Age is measured in years, and in this dataset it ranges from 17 to 95 with a mean of about 49 and a standard deviation of 15.19. Both of these variables are continuous, making them appropriate for simple regression.

2.1 The SPSS Procedure

Before producing the simple regression model, it is a good idea to look at each variable separately. However, in the interest of space, we forgo doing so here. Readers should explore the SAGE Research Methods Dataset examples associated with Simple Regression and Multiple Regression for more information.

You estimate a simple regression model in SPSS by selecting from the menu:

Analyze → Regression → Linear

In the Linear Regression dialog box that opens, move the systolic blood pressure variable (systolic) into the Dependent: window and move the age variable (age) into the Independent(s): window.

Figure 1 shows what this looks like in SPSS.

Figure 1: Selecting simple regression from the Analyze menu in SPSS. Because we want to explore whether there is evidence of heteroscedasticity among the residuals of this regression, we also want to produce a scatter plot that plots the standardized residuals on the Y-axis and the standardized predicted values of the dependent variable on the X-axis.

First, we click the “Plots…” button on the right-hand side of the Linear Regression dialog box. That opens a second dialog box. In this second dialog box, move *ZRESID into the open box under Y: and *ZPRED into the open box under X: as shown in Figure 2

Figure 2: Producing a two-way scatter plot of standardized residuals and standardized predicted values for a regression model in the Linear Regression: Plots dialog box in SPSS. Once you are done, click Continue in this dialog box, and then click OK to perform the analysis.

2.2 Exploring the SPSS Output

Figure 3 presents five tables of results that are produced by the simple linear regression procedure in SPSS. The fifth table is produced because we asked SPSS to produce plots using the standardized residuals. The fourth table in Figure 3, outlined in red, includes the results of the regression model itself.

Figure 3: Simple regression of systolic blood pressure on age, 2006 China Health and Nutrition Survey. The first three tables in Figure 3 report the independent variable(s) entered into the model, some summary fit statistics for the regression model, and an analysis of variance for the model as a whole. While detailed examination of these tables is beyond the scope of this example, we note in the second table that R Square measures the proportion of the variance in the dependent variable explained by the model, which in this case consists of a single independent variable. An R Square of 0.157 means that approximately 15.7% of the variance in systolic blood pressure is accounted for by age.

The fourth table in Figure 3, outlined in red, presents the estimates of the intercept, or constant, and the slope coefficient. The results report an estimate of the intercept (or constant) as equal to approximately 98.56. The constant of a simple regression model can be interpreted as the average expected value of the dependent variable when the independent variable equals zero. In this case, our independent variable, age, can never be zero, so the constant by itself does not tell us much.

The estimated value for the slope coefficient linking age to systolic blood pressure is estimated to be approximately 0.47. This represents the average marginal effect of age on systolic blood pressure, and can be interpreted as the expected change on average in the dependent variable for a one-unit increase in the independent variable. For this example, that means that every increase in age of 1 year is associated with an average increase of about 0.47 in systolic blood pressure. The fourth table in Figure 3 reports that this estimate is statistically significantly different from zero, with a p-value well below 0.001. This leads us to reject the null hypothesis and conclude that there does appear to be a positive relationship between a person’s age and their systolic blood pressure in China.

Figure 4 presents a plot with the standardized residuals of this regression on the Y-axis and the standardized predicted values of the dependent variable on the X-axis. Figure 4 shows that the vertical spread of the residuals is relatively low for respondents with lower predicted levels of systolic blood pressure. However, as we move left to right and the predicted level of systolic blood pressure increases, we see the vertical spread of the residuals also increasing. This spread appears to shrink somewhat at the very highest predicted values for systolic blood pressure. Overall, Figure 4 shows a pattern in the variance of the residuals, meaning that we appear to have evidence of heteroscedasticity.

Figure 4: Two-way scatter plot of standardized residuals from the regression shown in forth table of Figure 3 on the Y-axis and standardized predicted values of the dependent variable from that regression on the X-axis, 2006 China Health and Nutrition Survey. Unfortunately, SPSS does not include any formal tests of heteroscedasticity. Users can create macros within SPSS to perform specific functions not built into the software, but that process is beyond the scope of this example. Example code for a macro that includes the Breusch–Pagen test, and a tutorial video on how to use it, can be found at the following links:

Applying the steps of the Breusch–Pagen test to this example results in a test statistic of 652.33. When compared to a Chi-squared distribution with 1 degree of freedom, the resulting p-value falls well below the standard 0.05 level. Thus we have clear evidence to reject the null hypothesis of homoscedasticity and accept the alternative hypothesis that we do in fact have heteroscedasticity in the residual of this regression model.