How-to Guide for Stata
Introduction

In this guide, you will learn how to estimate a multiple regression model in Stata using a practical example to illustrate the process. Readers are provided links to the example dataset and encouraged to replicate this example. An additional practice example is suggested at the end of this guide. The example assumes you have already opened the data file in Stata.

Contents
• Multiple Regression
• An Example in Stata: Knowledge and Attitudes About Science
• 2.1 The Stata Procedure
• 2.2 Exploring the Stata Output
1 Multiple Regression

Multiple regression expresses a dependent, or response, variable as a linear function of two or more independent variables. Multiple regression builds directly on simple regression, as simple regression is limited to a single independent variable. Multiple regression requires estimating an intercept (often called a constant) and a slope for each independent variable that describes the expected value of the dependent variable at any particular value of the independent variables. Most attention is typically focused on these slope estimates because they capture the relationships between the dependent and each independent variable. The dependent variable should be continuous. This example will focus on using independent variables that are also continuous, though the model can also accommodate categorical independent variables (see Regression with dummy variables).

2 An Example in Stata: Knowledge and Attitudes About Science

This example uses three variables from the 2005 Eurobarometer (EB 63.1):

• Score on a science “quiz” composed of 13 true/false items (kstot)
• Attitude to science and faith (question wording: “We rely too much on science and not enough on faith”; responses on a five-point scale from strongly disagree to strongly agree) (toomuchscience)
• Age measured in years (age)

The science knowledge quiz has a range of 0–13. Its mean is about 8.7. The attitude to science and faith question has five categories, ranging from 0 to 4, with a mean of about 2.5. Age has a range of 15–93, with a mean of about 45. For the purposes of this example, we treat all variables as continuous.

2.1 The Stata Procedure

When conducting a multiple regression, it is often wise to examine each variable in isolation first. Summary statistics for each variable can be compiled using the summarize command, followed by the variables of interest. Enter the following command in the Stata Command window:

summarize toomuchscience kstot age

Press Enter to produce summary statistics detailing the number of observations, mean, standard deviation, and range for each variable.

Next, we present histograms of science knowledge, attitude, and age. The histogram for age can be created in Stata by entering the following command in the Command window:

histogram age

Press Enter to produce a histogram. By default, Stata will produce a density histogram. To select frequency, enter the following command instead:

histogram age, frequency

Alternatively, you can create a histogram by selecting options from the Menu as follows:

Graphics → Histogram

In the histogram dialog box that opens, you will see a textbox labelled “Variable” in the upper left-hand corner. Use the drop-down menu to select age from the list of variables. To the right of the “Variable” box, you will see two buttons asking you to specify whether data are discrete or continuous. Ensure that the “Data are continuous” option has been selected. In the lower right-hand corner under “Y axis”, select “Frequency”. Click OK to perform the analysis.

The same procedure can be followed to produce histograms of science knowledge and attitude, replacing the variable age with kstot and with toomuchscience. In the case of variables kstot and toomuchscience, the data should be treated as discrete. The Stata command for science knowledge, for example, is as follows:

histogram kstot, discrete frequency

(If using the menu options, ensure the “Data are discrete” option in the dialog box is selected.)

Screenshots for the procedure to produce histograms in Stata are available in the How-to Guides for the Dispersion of a Continuous Variables topic that is part of SAGE Research Methods Datasets.

You estimate a multiple regression model in Stata by entering the regress command in the Command window, followed firstly by the dependent variable toomuchscience and then by the independent variables kstot and age. The command is as follows:

regress toomuchscience kstot age

Press Enter to run the analysis.

Entering the command as above into the Stata Command window is the simplest way to carry out this estimation. However, the model can also be estimated by using the menu options as follows:

Statistics → Linear models and related → Linear regression

In the regress Linear Regression dialog box that opens, two textboxes are provided to specify the dependent and independent variables to be included in the model. In the “Dependent variable” box, select toomuchscience from the drop-down menu. In the “Independent variables” textbox, select kstot and age.

Once you are done, click OK to perform the analysis.

Figure 1 shows what the dialog box looks like in Stata.

Figure 1: Selecting Multiple Regression From the Statistics Menu in Stata.

2.2 Exploring the Stata Output

Figures 2, 3, and 4 present histograms for each variable.

Figure 2 shows that the mean age in the sample is about 45 years with a standard deviation of just over 17 years. The distribution looks approximately normal, with a slight positive skew.

Figure 3 shows that the majority of values on the science knowledge quiz score cluster between 5 and 11. There is a slight negative skew to the distribution. Overall, there is little reason for concern as to the appropriateness of the variable for inclusion.

Figure 4 shows that the mean score on the science and faith attitude variable is just over 2. There are only five discrete values possible in the distribution, based on the response options available, but OLS regression is relatively robust to violations of the assumption of normality, and in fact, the distribution looks approximately normal, with a slight negative skew.

Figure 2: Histogram Showing the Distribution of Age in Years, 2005 Eurobarometer (EB 63.1).

Figure 3: Histogram Showing the Distribution of Science Knowledge Quiz Scores, 2005 Eurobarometer (EB 63.1).

Figure 4: Histogram Showing the Distribution of Science Attitude Scores, 2005 Eurobarometer (EB 63.1).

It is also useful to explore the possible correlation between your independent variables. In this case, the Pearson correlation coefficient between age and kstot is −0.116. It is statistically significantly different from zero, but the correlation is relatively small in absolute terms, and we therefore have little concern about multicollinearity influencing this regression analysis.

Figure 5 presents a table of results that are produced by the multiple regression procedure in Stata.

Figure 5: Multiple Regression of Attitude to Science on Science Knowledge and Age, 2005 Eurobarometer (EB 63.1).

The top section of the table provides an analysis of variance for the model as a whole. While these results are not the focus of this example, we note that the R-Squared figure reported to the upper right of the table measures the proportion of the variance in the dependent variable explained by the model. In this case, the model consists of two independent variables. An R-Squared of .032 means that only about 3.2% of the variance in attitudes is accounted for by knowledge and age.

The bottom section of the table presents the estimates of the intercept, or constant (_cons), and the slope coefficients. It reports an estimate for the intercept, or constant, that is approximately 2.8. The constant of a multiple regression model can be interpreted as the average expected value of the dependent variable when all of the independent variables equal zero. In this case, the independent variable science knowledge has only a handful of respondents that score zero, and no one is aged zero, so the constant by itself does not tell us much. Researchers do not often have predictions based on the intercept, so it often receives little attention.

The estimated value for the partial slope coefficient linking knowledge to attitude is approximately 0.08. This represents the average marginal effect of knowledge on attitude and can be interpreted as the expected change in the dependent variable on average for a one-unit increase in the independent variable, controlling for age. It is called a partial coefficient because it represents the unique association with the dependent variable, not that which is shared with the other independent variable(s). For this example, that means that every increase in quiz score by one point is associated with a decrease in attitude score of about −0.08, adjusted for age. Bearing in mind the valence of the question wording, this means that those who are more knowledgeable tend to be more favourable towards science, i.e., disagreeing with the statement.

The estimated value for the partial slope coefficient linking age to attitude is approximately 0.002. This represents the average marginal effect of each additional year on attitude and can be interpreted as the expected change in the dependent variable on average for a one-unit increase in the independent variable, controlling for science knowledge. For this example, that means that for every year older a person is, their attitude score is expected to increase by 0.002, controlling for science knowledge. This may seem like a very small effect, but remember that this is the effect of only one additional year. Bearing in mind the valence of the question wording, this means that older people tend to be less favourable towards science, i.e., agreeing with the statement. The table also reports that both slope estimates are statistically significantly different from zero. This leads us to reject both null hypotheses and conclude that there appear to be relationships for both age and science knowledge with attitudes.

There are multiple diagnostic tests researchers might perform following the estimation of a simple regression model to evaluate whether the model appears to violate any of the OLS assumptions or whether there are other kinds of problems such as particularly influential cases. Describing all of these diagnostic tests is beyond the scope of this example.