In this guide, you will learn how to estimate a simple regression model in Stata using a practical example to illustrate the process. Readers are provided links to the example dataset and encouraged to replicate this example. An additional practice example is suggested at the end of this guide. The example assumes you have already opened the data file in Stata.
Simple regression expresses a dependent, or response, variable as a linear function of an independent variable. This requires estimating an intercept (often called a constant) and a slope that describes the expected value of the dependent variable at any particular value of the independent variable. Most attention is typically focused on the slope estimate because it captures the relationship between the dependent and independent variables. The dependent variable should be continuous. This example will focus on using an independent variable that is also continuous, though the model can also accommodate a categorical independent variable (see Regression with dummy variables).
This example uses two variables from the 2012 U.S. Statistical Abstracts, both of which are actually measured in 2007:
The poverty rate variable has a mean of 12.67, a standard deviation of 3.12, a minimum value of 7.1, and a maximum value of 20.6. The infant mortality variable has a mean of 7.07, a standard deviation of 1.50, a minimum value of 4.8, and a maximum value of 13.1.
When conducting a simple regression, it is often wise to examine each variable in isolation first. Summary statistics for each variable can be compiled using the summarize command, followed by the variables of interest. Enter the following command in the Stata Command window:
summarize infantmort poverty
Press Enter to produce summary statistics detailing the number of observations, mean, standard deviation, and range for each variable.
Next, we present histograms of infant mortality and poverty rates across the states. The histogram for infant mortality can be created in Stata by entering the following command in the Command window:
Press Enter to produce a histogram. By default, Stata will produce a density histogram. To select frequency, enter the following command instead:
histogram infantmort, frequency
Alternatively, you can create a histogram by selecting options from the Menu as follows:
Graphics → Histogram
In the histogram dialog box that opens, you will see a text box labeled “Variable” in the upper left-hand corner. Use the drop-down menu to select infantmort from the list of variables. To the right of the “Variable” box, you will see two buttons asking you to specify whether data are discrete or continuous. Ensure that the “Data are continuous” option has been selected. In the lower right-hand corner under “Y axis,” select “Frequency.” Click OK to perform the analysis.
The same procedure can be followed to produce a histogram of poverty, replacing the variable infantmort with poverty.
Screenshots for the procedure to produce histograms in Stata are available in the How-to Guides for the Dispersion of a Continuous Variables topic that is part of SAGE Research Methods Datasets.
You estimate a simple regression model in Stata by entering the regress command in the Command window, followed firstly by the dependent variable infantmort, then the independent variable poverty. The command is as follows:
regress infantmort poverty
Press Enter to run the analysis.
Entering the command as above into the Stata Command window is the simplest way to carry out this estimation. However, the simple regression model can also be estimated by using the menu options as follows:
Statistics → Linear models and related → Linear regression
In the regress Linear Regression dialog box that opens, two text boxes are provided to specify the dependent and independent variables to include in the model. In the “Dependent variable” box, select infantmort from the drop-down menu. In the “Independent variables” text box, select poverty.
Once you are done, click OK to perform the analysis. Figure 1 shows what the dialog box looks like in Stata.
Figure 2 shows that the majority of values for infant mortality clustered near the mean of about 7. However, the distribution does have a slight positive skew, meaning there are a few more extreme values above the mean than below it. Further diagnostics might be warranted to make sure those observations are not unduly influencing the results of the regression.
Figure 3 shows that the values for state poverty rates cluster somewhat near its mean of about 12.7. There is a slight positive skew, but there are also a fair number of observations below the mean as well. Further diagnostics might be warranted to make sure the few observations with larger values are not unduly influencing the results of the regression.
Figure 4 presents a table of results that are produced by the simple linear regression procedure in Stata.
The top section of the table provides an analysis of variance for the model as a whole. While these results are not the focus of this example, we note that the R-Squared figure reported to the upper right of the table measures the proportion of the variance in the dependent variable explained by the model. An R-Squared of .289 means that almost 29% of the variance in the infant mortality rate across the states is accounted for by the poverty rate.
The bottom section of the table presents the estimates of the intercept, or constant (_cons), and the slope coefficients. The results report an estimate of the intercept (or constant) as equal to approximately 3.78. The constant of a simple regression model can be interpreted as the average expected value of the dependent variable when the independent variable equals zero. In this case, our independent variable, poverty, can never be zero, so the constant by itself does not tell us much. Researchers do not often base predictions on the intercept, so it often receives little attention.
The estimated value for the slope coefficient linking state poverty rate to state infant mortality rate is approximately 0.26. This represents the average marginal effect of the poverty rate on the infant mortality rate and can be interpreted as the expected change on average in the dependent variable for a one-unit increase in the independent variable. For this example, this means that every increase of one percentage point in a state’s population that is living in poverty is associated with an average increase in the number of infant deaths per 1,000 births of 0.26. The table also reports that this estimate is statistically significantly different from zero. This leads us to reject the null hypothesis and conclude that there does appear to be a positive relationship between poverty rates and infant mortality across the U.S. states. There are multiple diagnostic tests researchers might perform following the estimation of a simple regression model to evaluate whether the model appears to violate any of the OLS assumptions or whether there are other kinds of problems such as particularly influential cases. Describing all of these diagnostic tests is beyond the scope of this example.
Download this sample dataset to see whether you can replicate these results. Then, repeat the analysis, this time replacing the infant mortality variable with the variable named fsperrecip, which measures how much a state spent on food stamps per recipient in 2007.