- 00:11
BEN GOODRICH: Hi.I'm Ben Goodrich, a lecturer at Columbia University.[Ben Goodrich, Lecturer, QuantitativeMethods in the Social Sciences] And I teach primarilyin the quantitative methods and the social science master'sprogram.In this tutorial, I'll be discussing the best practicesfor diagnosing whether the assumptionsof your linear regression model aremet using graphical and numeric output from that regressionmodel.

- 00:32
BEN GOODRICH [continued]: The key points that I want to talk about in this tutorialare five.First, what are the assumptions of the ordinary least squaresestimator?Second, what are outliers and high leverage points?Third, what graphical diagnostics canwe use to assess the assumptions of the ordinaryof these squares estimator?Fourth, what statistical tests can

- 00:54
BEN GOODRICH [continued]: we use to assess those assumptions?And fifth, if some of those assumptionsappear to be violated, what alternative estimatorsmight we use?So why is this important?Regression diagnostics indicate whether an assumptionof your regression model has been violated.If one of those assumptions are violated,then the estimates are compromised.

- 01:15
BEN GOODRICH [continued]: Maybe the estimates of the coefficients.Maybe the estimates of their uncertainty,like their standard errors and confidence intervals.Either way, you would want to know whenthose assumptions are violated.And maybe take alternative action.[What are the assumptions of the ordinary least squaresestimator?]

- 01:36
BEN GOODRICH [continued]: What are the assumptions of the ordinary least squaresestimator?Some books will phrase the following assumptionsa little bit differently.The ones I'm going to be talking about noware paraphrased from an introductory econometrics bookby Jeffrey Woolridge, whose site you'll see at the end.The first assumption is that the linear modelcan be written for each observation

- 01:57
BEN GOODRICH [continued]: as an outcome y is equal to a constant or intercept beta 0plus beta 1 times x1, plus beta 2 times x2, et cetera.Where the betas are called coefficientsand the observations on the x's are predictors of that outcome.Finally, there is an error term epsilon

- 02:18
BEN GOODRICH [continued]: that is added on to the right-hand side of the equationto complete this linear model.It's linear because it only involvesaddition and multiplication, no trigonometry functions,or squaring, or anything like that.The second assumption is that the data are a random samplefrom a well-defined population.For example, we may have a subset of 1,000 people who

- 02:41
BEN GOODRICH [continued]: are asked public opinion polls.Those are a sample from the population of all adultsor all registered voters in the United States.The third assumption is that none of the predictorsare an exact linear function of the other predictorsfor all observations.That means for at least one of the observations,

- 03:01
BEN GOODRICH [continued]: you can't write x1 as a linear functionof the other predictors.This assumption is ordinarily always satisfied in practiceif you have a simple random samplefrom a well-defined population.The fourth assumption is that you expect the error termepsilon to be zero, no matter what

- 03:22
BEN GOODRICH [continued]: the values of the predictors, the x's, arefor that observation or any other observationin your dataset.These four assumptions are sufficientfor the ordinary least squares estimatorto give unbiased estimates for the coefficients beta.The fifth assumption that the errors are distributed normally

- 03:42
BEN GOODRICH [continued]: with a constant variance, sigma squared,is necessary for the OLS estimatorto be optimal in some sense, alsofor the standard errors of our estimates.[Outliers & High Leverage Points]Let's denote an estimate for the k-th coefficient beta

- 04:03
BEN GOODRICH [continued]: with what's known as beta hat k, with a little hat on top of it.The prediction of our model for yis denoted y hat, where we just plug in estimatesfor the coefficients in our linear modelbut omit the error term.So we have beta hat 0 plus beta hat 1 times x1,

- 04:24
BEN GOODRICH [continued]: plus beta hat 2 times x2, et cetera.Then a residual is defined as the differencebetween the outcome for that observation yand the model's prediction for that outcome, y hat.So our residual is simply y minus y hat.An outlier can be loosely defined

- 04:44
BEN GOODRICH [continued]: as a point that is relatively faraway from the other points in your dataset, either vertically or horizontally.A high leverage point is defined as a point that is horizontallyfar away from the rest of the points in the data set.And a residual that is both far away from zeroand has high leverage means that that observation

- 05:06
BEN GOODRICH [continued]: can have an outsized influence on the coefficientsthat you estimated.And you might need to take alternative actionthat we'll discuss at the end of this video.As you can see on this plot, we havea scatterplot for one predictor x on the horizontal axis,and one outcome variable y on the vertical axis.

- 05:28
BEN GOODRICH [continued]: Most of the points in black constitute a cloudin the middle of the plot.But we have three unusual points.In the bottom left, we have a point in red.This is a high leverage point because it'sfar to the left of the main cloud of points.It's also a fairly large negative residual and

- 05:51
BEN GOODRICH [continued]: that it's off the gray dotted line,indicating the estimated regression relationship.So that is a residual that is large and has high leverage.In the top right, we have a pointin purple that is also high leverage because its farto the right of the main cloud of points in the data set.

- 06:14
BEN GOODRICH [continued]: However, it is pretty close to the regression line in gray.So it's not a large outlier.This is what's known as a good leverage point.Finally, we have a point in the top middle in bluethat is an outlier because it is verticallyseparated from the main cloud of points.

- 06:36
BEN GOODRICH [continued]: But it's not high leverage because it'spretty much in the center of the points horizontally.So the blue point is not causing any undue influenceto our regression estimates.Only the red point might be.[Graphical Diagnostics to Test Assumptionsof the OLS Estimator]

- 06:57
BEN GOODRICH [continued]: Now I'm going to be talking about a regression model wherethe prestige of various occupationsis predicted by the average incomethat those occupations earn and the average level of educationthat people in those occupations have,in addition to the type of occupationit is, whether blue collar, white collar, or professional.

- 07:20
BEN GOODRICH [continued]: On the horizontal axis, we have a measureof leverage, which is how influential a point is.On the vertical axis, we have standardized residuals,which is residuals divided by their estimated standard errorsigma.Here we see that most of the pointsare between 0 and 0.1, which is good.Individually, those points don't have very much leverage.

- 07:44
BEN GOODRICH [continued]: There's a handful of points that havegreater leverage as indicated going farthertoward the right of the plot.But most of those points are not dramaticallyhave large residuals.However, there is one point that hashigh leverage and a residual of about 3 in standardize units.

- 08:06
BEN GOODRICH [continued]: And that is ministers, which is labeled at the top of the plot.This makes sense because ministersare thought of as a very prestigious position,but they don't earn a lot of income in their job.Thus the relationship that tends to hold for most occupationswhere higher income is associated with higher prestige

- 08:29
BEN GOODRICH [continued]: doesn't necessarily pertain to religious leaders.Thus the observation in our data setfor ministers may be adversely affecting the estimatesthat we get to bias the coefficient on incomecloser to 0.What's known as Cook's Distance isa combined measure of leverage and having

- 08:54
BEN GOODRICH [continued]: a high or low residual.These are indicated in a variety of computer software packages.And in this plot, are indicated by the dotted red lines.So here we see minister having both high leverageand a large residual has a large Cook's Distance.Cook's Distance is, loosely speaking,

- 09:14
BEN GOODRICH [continued]: an indication of how much your coefficients that you estimatewould change if that point were excluded from your data set.So if we were to leave out ministers,would our estimates change a lot or a little?For observations with a large Cook's distance,they would change a lot.This next plot is another one that you'll see often.

- 09:36
BEN GOODRICH [continued]: On the horizontal axis, we have the fitted values, y hat.On the vertical axis, we have the residuals.Here we're looking for is there indicationthat some observations have different error variancethan other observations.In this particular case, it basically

- 09:57
BEN GOODRICH [continued]: just looks like a sea of points.There's a couple that have a large residual.But there's no real pattern to any of the dots here.If we saw a curved linear relationshiplike a U-shaped, or a situation where all the points wereclustered close to 0 when y hat is small

- 10:18
BEN GOODRICH [continued]: but they get farther away from 0 when y hat is large,then we would be concerned that the fifth assumption does nothold.However, in this case, there there'snot a whole lot going on.So we wouldn't be particularly worried about that assumption.This next plot is designed to assesswhether the errors of our regression modelare normally distributed.

- 10:40
BEN GOODRICH [continued]: On the horizontal axis, we have the theoretical quantilesof the normal distribution, standard normal distribution.On the vertical axis, we have the empirical quantilesof the residuals, which are intended to bean estimate of our errors.Here we see for most of the points lineup on a straight line, which suggests

- 11:02
BEN GOODRICH [continued]: that the normality assumption is perhapsappropriate for the bulk of the data set.However, there's a couple of points in the top right thatare well off the gray line.And this indicates that the distributionof the errors for extreme right-tail valuesis not consistent with a normal distribution.

- 11:25
BEN GOODRICH [continued]: And again, that calls into questionthe fifth assumption that we made for the ordinary leastsquares estimator.These next plots are called added variable plots.On the vertical axis, we have the outcome variable prestigepredicted by all the other variables except

- 11:45
BEN GOODRICH [continued]: for the one on the horizontal axis.So in the top left plot, we have an added variable plotof prestige, given the other variables,conditional on income, given the other variables.What this means is that if you do a regression of incomeon all the variables except prestigiousand take the residuals, plot those on the horizontal axis.

- 12:08
BEN GOODRICH [continued]: And then regress prestige on all the other variablesexcept income, plot those on the vertical axis.The estimated coefficient on the income predictoris the slope of the resulting line,which is given here in red.Again, you would want to look for pointsthat are both high leverage and high residual.

- 12:32
BEN GOODRICH [continued]: In the case of the top left, we see a pointthat has far to the left.It has high leverage.And it's pretty far off the red line, indicating high residual.I believe this point is, again, for the minister.occupation.You can do the same thing for the other predictors

- 12:52
BEN GOODRICH [continued]: in the model.For example, in the top right we have the added variable plotsfor prestige versus the education of the occupation.And in this case, it doesn't lookas if there are too many points thathave both high leverage and a big residual.There is also some added variable plots

- 13:14
BEN GOODRICH [continued]: at the bottom for the predictor.Is the job professional class or is the job workingclass relative to white collar occupations?[Statistical Test to Assess Assumptions]Moving on to statistical tests, first weneed to define a null hypothesis.

- 13:35
BEN GOODRICH [continued]: A null hypothesis is something that we assume to be true.And ask, what is the probability of observinga particular statistic if that hypothesis were true?This probability is represented by what is known as a P value.And we usually in the social sciencesreject a null hypothesis if the P value is less than 0.05.

- 13:59
BEN GOODRICH [continued]: However, failure to reject the null hypothesis doesn'tconstitute affirmative evidence that the null hypothesis istrue because we're assuming the null hypothesisis true in order to derive the following test statistics.Different statistical packages maysupport a variety of null hypothesisthat can be used to assess the assumptions of your regression

- 14:20
BEN GOODRICH [continued]: model.Tests known as the RESET test and the Rainbow testare for the null hypothesis that the relationshipbetween the outcome and the predictorsis linear, which is the first assumption that wemade for the ordinary least squares estimator.Against the alternative hypothesisthat the relationship between the outcome and the predictors

- 14:40
BEN GOODRICH [continued]: is nonlinear.If you do one of these tests and get a P value that'sextremely low, you may choose to rejectthat null hypothesis in favor of the alternativehypothesis that the relationship is nonlinear,and then turn to an alternative approach.The Breusch-Pagan test is a generalizationof the White test, both of which have

- 15:01
BEN GOODRICH [continued]: the null hypothesis that the variance of the errorsis a constant and is thus unrelated to anyof the predictors.Again, if you receive a P value that is extremely small,you may choose to reject that null hypothesis.And if so, you're rejecting the fifth assumptionmade in the ordinary least squares estimator.

- 15:25
BEN GOODRICH [continued]: [Alternative Estimators]If the relationship between an outcome variableand one of the predictors or multiple of the predictorsseems to be nonlinear, you can use an alternative estimatorsuch as nonlinear least squares.Or try including quadratic or interaction terms

- 15:45
BEN GOODRICH [continued]: in your regression model, which you can read about,but are beyond the scope of this tutorial.If you think your model is essentially correct,or at least the first assumption of linearity is correct,but the errors are not normally distributedwith a constant variance, which is the fifth assumption,then you can use what are known as White correctedor robust standard errors, which give, in a manner of speaking,

- 16:10
BEN GOODRICH [continued]: good estimates of the standard errors.Even if the fifth assumption does not hold,provided that the first four assumptions do hold.Finally, if your estimates seem to be adversely affectedby points that are both high leverageand have a large residual, such as the minister in our runningexample, you can use an alternative estimator

- 16:33
BEN GOODRICH [continued]: of the linear model that minimizesthe absolute value of the residualsor some other function of the residualsthat gives less weight to extreme pointsthen the ordinary least squares estimator, which penalizesthe square of the residuals.[Conclusion]

- 16:55
BEN GOODRICH [continued]: In this tutorial, we've learned about toolsthat can help you tell when an assumptionof the ordinary least squares regression estimatorhas been violated.If one of the five assumptions is violated,then you can't trust something about the estimates.Perhaps the coefficient estimates,perhaps their standard errors, depending on whichassumption is violated.

- 17:16
BEN GOODRICH [continued]: But knowing which assumption is suspectcan point you in the direction of an alternative estimate thatwould tend to produce better estimates.For more information, please read chapter 6of John Fox and Sanford Weisberg's book,An R Companion to Applied Regression,Second Edition, published by Sage in 2011.

- 17:37
BEN GOODRICH [continued]: Or check out Jeffrey Woolridge's book,Introductory Econometrics, any of the more recent editions.

### Video Info

**Publisher:** SAGE Publications Ltd

**Publication Year:** 2017

**Video Type:**Tutorial

**Methods:** Regression analysis, Ordinary least squares, Cook's distance

**Keywords:** clergy; income; mathematical concepts; ministers; occupations; prestige; white collar
...
Show More

### Segment Info

**Segment Num.:** 1

**Persons Discussed:**

**Events Discussed:**

**Keywords:**

## Abstract

Professor Ben Goodrich discusses regression models in quantitative research. He demonstrates how to tell when an ordinary least squares regression estimator has been violated and how to navigate around violations to get better estimates.