In this guide, you will learn how to detect heteroscedasticity following a linear regression model in Stata using a practical example to illustrate the process. Readers are provided links to the example dataset and encouraged to replicate this example. An additional practice example is suggested at the end of this guide. The example assumes you have already opened the data file in Stata.
Linear regression models estimated via Ordinary Least Squares (OLS) rest on several assumptions, one of which is that the variance of the residual from the model is constant and unrelated to the independent variable(s). Constant variance is called homoscedasticity, while non-constant variance is called heteroscedasticity. This example illustrates how to detect heteroscedasticity following the estimation of a simple linear regression model.
This example uses a subset of data from the first wave of the Early Childhood Longitudinal Study, Kindergarten (ECLSK) dataset. This extract includes data from 11,933 students who were in kindergarten in the 1998–99 academic year. The two variables we examine are:
Both of these performance score measures are scales based on student responses to a large number of test items in each area. Each scale was built using item response theory, which is a common method of measuring performance based on multiple test items. The reading score variable ranges from about 21 to just over 138, with a mean of 36 and a standard deviation of 10. The math score variable ranges from about 10 to almost 116, with a mean of 27 and a standard deviation of 9. Both variables are continuous measures, making them appropriate for simple regression.
Before producing the simple regression model, it is a good idea to look at each variable separately. However, in the interest of space, we forgo doing so here. Readers should explore the SAGE Research Methods Dataset examples associated with Simple Regression and Multiple Regression for more information.
You estimate a simple regression model in Stata by entering the regress command in the Command window, followed firstly by the dependent variable c1r4rscl, then the independent variable c1r4mscl. The command is as follows:
regress c1r4rscl c1r4mscl
Press Enter to run the analysis.
Entering the command as above into the Stata Command window is the simplest way to carry out this estimation. However, the simple regression model can also be estimated by using the menu options as follows:
Statistics → Linear models and related → Linear regression
In the “regress - Linear Regression” dialog box that opens, two text boxes are provided to specify the dependent and independent variables to include in the model. In the “Dependent variable” box, select c1r4rscl from the drop-down menu. In the “Independent variables” text box, select c1r4mscl.
Once you are done, click OK to perform the analysis.
Figure 1 shows what the dialog box looks like in Stata.
We want to explore whether there is evidence of heteroscedasticity among the residuals of this regression, so next, we produce a scatterplot that plots the residuals on the Y-axis and the predicted values of the dependent variable on the X-axis. To do this in Stata, enter the following command in the Command window, after running the regression:
Press Enter to produce a scatterplot of the residuals versus predicted values.
For further clarity, you can ask Stata to add a line at y = 0. This will provide a stronger visual sense of whether the residual values are evenly distributed around zero for all predicted values. To do this, use the following Stata command:
Press Enter to produce a scatterplot with a line at y = 0.
You can also produce a scatterplot using the Stata menu options as follows:
Statistics → Linear models and related → Regression diagnostics → Residual-versus-fitted plot
A dialog box named “rvfplot - Residual-versus-fitted plot” will open. Simply click OK to produce the scatterplot.
Figure 2 shows what the dialog box looks like in Stata.
To add a line at y = 0, select the “Y axis” tab at the top of the dialog box and click on “Reference lines” as shown in Figure 3.
This opens the “Reference lines (y axis) dialog box. Tick the box next to “Add lines to graph at specified y values” by clicking on it. In the text box below, write “0” as shown in Figure 4.
Click Accept to return to the previous dialog box, then click OK to produce the scatterplot with a line at y = 0.
There are several formal tests for heteroscedasticity that can be carried out in Stata. In this example, we will use the Breusch–Pagen test. Following the regression, enter the following command in the Command window:
Press Enter to produce the Breusch–Pagen test statistic.
To do this using the menu options, select the following options from the Stata menu:
Statistics → Postestimation
In the “Postestimation Selector” dialog box that opens, click on the plus control next to “Specification, diagnostic, and goodness-of-fit analysis” to expand the content.
Figure 5 shows what this looks like in Stata.
Click on “Tests for heteroskedasticity” and press Launch to produce a second dialog box, “estat - Postestimation statistics for regress.” In the box at the top,”Tests for heteroskedasticity (hettest)” should be highlighted. Directly beneath that, select “Breusch-Pagan/Cook-Weisberg” from the drop-down options. Ensure that the button next to “Use fitted values of the regression” is checked.
Press OK to run the command.
Figure 6 shows what this looks like in Stata.
Figure 7 presents a table of results that are produced by the simple linear regression procedure in Stata.
The top section of the table provides an analysis of variance for the model as a whole. While these results are not the focus of this example, we note that the R-Squared figure reported to the upper right of the table measures the proportion of the variance in the dependent variable explained by the model. In this case, the model consists of a single independent variable. An R-Squared of .498 means that almost 50% of the variance in reading scores is accounted for by math scores.
The bottom part of the table presents the estimates of the intercept, or constant (_cons), and the slope coefficient. The results report an estimate of the intercept (or constant) as equal to approximately 13.971. The constant of a simple regression model can be interpreted as the average expected value of the dependent variable when the independent variable equals zero. In this case, our independent variable, c1r4mscl, can never be zero, so the constant by itself does not tell us much.
The estimated value for the slope coefficient linking math scores to reading scores is estimated to be approximately 0.81. This represents the average marginal effect of the math score on the reading score and can be interpreted as the expected change on average in the dependent variable for a one-unit increase in the independent variable. For this example, that means that every increase in a student’s math score of one point is associated with an average increase in a kindergarten student’s reading score of 0.81 of a point. The fourth table in Figure 8 reports that this estimate is statistically significantly different from zero, with a p value well below .001. This leads us to reject the null hypothesis and conclude that there does appear to be a positive relationship between math scores and reading scores among kindergarten students.
Figure 8 presents a plot with the residuals of this regression on the Y-axis and the predicted values of the dependent variable on the X-axis. Figure 8 shows that the vertical spread of the residuals is relatively low among students with lower predicted reading scores. However, as we move left to right and the predicted reading scores increase, we see the spread of the residuals also increasing. The resulting image appears like a cone or fan that is spreading out as we move from left to right in the figure. This means that the variance of the residuals is not constant and, thus, we appear to have evidence of heteroscedasticity.
Figure 9 presents the results of the Breusch–Pagen test for heteroscedasticity, with a test statistic of 11,543.55. When compared to a Chi-Squared distribution with one degree of freedom, the resulting p value falls well below the standard .05 level. Thus, we have clear evidence to reject the null hypothesis of homoscedasticity and accept the alternative hypothesis that we do in fact have heteroscedasticity in the residual of this regression model.
Download this sample dataset to see whether you can replicate these results. Then, repeat the analysis, this time replacing the reading and math score variables used here that were measured in the Fall of 1998 with the math score (c2r4mscl) and the reading score (c2r4rscl) as measured in the Spring of 1999, and then, explore whether or not there is evidence of heteroscedasticity in the residuals of the regression.