- 00:00
[MUSIC PLAYING]

- 00:11
BEN LAMBERT: In this video, I wantto talk about the assumptions of ANOVAand how we can go about testing whether those assumptions areupheld in the data.So imagine that we have an example wherewe're looking at the income of, perhaps, individual families,perhaps one year after a particular typeof intervention.

- 00:31
BEN LAMBERT [continued]: And so individual crosses here represent individual familydata.And we're supposing that there are three different treatmentshere.So some individuals just remain self-financed.They receive no type of intervention.Some have access to microfinance.And some received aid.

- 00:52
BEN LAMBERT [continued]: So the question we'd like to answer hereis does the type of finance scheme, to which a family hasaccess, actually influence their income in the longer run?So in this example, here, you mightthink that this would be a good circumstance to use ANOVA.So what we need to do, first of all,is write down the model according to ANOVA.

- 01:13
BEN LAMBERT [continued]: So what we have here is that family i, within group j-- soj, here, just refers to the type of finance schemeto which the family had access.That's equal to the group specific mean,uj, plus an error term, epsilon ij, which is idiosyncratic.

- 01:36
BEN LAMBERT [continued]: So before we run ANOVA here, we needto check that certain assumptions are upheld.The first assumption that we need to checkis the assumption of homoscedasticity.So what exactly does this mean?Well, the idea here is that, if we think about this errorterm as having some sort of variance,

- 01:58
BEN LAMBERT [continued]: so we imagine that the variance of epsilon i is equal to,let's say, sigma j squared, so thereis a group specific variance.Then the assumption of homoscedasticityjust says that sigma 1 squared equals sigma 2 squared--

- 02:20
BEN LAMBERT [continued]: and in this example, there are just 3 groups--equals sigma 3 squared, which equalsthe sort of overall variance.So what does this mean?It just means that there isn't variance which varies by group.The second assumption, that we need to make sure is upheld,is the assumption of normality.

- 02:41
BEN LAMBERT [continued]: So typically, what we do in ANOVAis assume that the error term, epsilon ij,is normally distributed, with a mean of 0and a variance of sigma squared.The final assumption that we need to think aboutis the assumption of independence.

- 03:04
BEN LAMBERT [continued]: So that just means that our data are independent of one another.And typically, this is a hard thing to check.And typically, it's the thing which the majority of analysesare criticized for.So what are the consequences if we have a violationof each of these assumptions?Well, if we have a violation of independence,

- 03:25
BEN LAMBERT [continued]: this is typically very bad.And what that normally means is that weneed to change our model to accountfor the sort of dependent structurethat we actually see within our data.So an example of the type of model that you might use heremight be repeated measures ANOVA,where you have observations for the same individuals over time.

- 03:49
BEN LAMBERT [continued]: And in particularly bad circumstances,this can mean that it's very difficult to carryour inference in any way.So violations of independence are very serious.Violations of normality are much less serious.The reason for that are that there are ANOVA tests which

- 04:09
BEN LAMBERT [continued]: are robust to non-normal data.So even if you have non-normal residuals or non-normal errors,technically, then you can still carry outANOVA using these types of tests.And also, if you have enough data points,then the central limit theorem kicks in,which says that, overall, your errors start

- 04:30
BEN LAMBERT [continued]: looking like they're normally distributed anyway.But nonetheless, we still need to be aware of whether or notour data are normally distributed,because all of the inferential tests,that we've spoken about before, arebased on this assumption of normality.So, if we have a violation of normality,then it can impair our ability to carry out inference

- 04:52
BEN LAMBERT [continued]: correctly.Finally, if we think about violations of homoscedasticity,in other words, we have heteroscedastic data,then we find that this is actually also pretty serious.And in particular, the presence of heteroscedasticityactually makes the effect of having non-normal data even

- 05:12
BEN LAMBERT [continued]: more detrimental for inference.However, we shall see in the next video,on remedial measures for violationsof each of the assumptions, that you can actuallycorrect for heteroscedasticity much of the time.So it's not as serious a problem as a violation of independence.So, now going through the assumptions in turn,

- 05:34
BEN LAMBERT [continued]: we're going to talk about how we can actually testfor each of these conditions.So starting off with the assumption of homoscedasticity,the idea here is that the first thing you should always dois plot your data.This is going to be a common themeacross at least these first two tests

- 05:56
BEN LAMBERT [continued]: and actually even in the test that we have for independence.And what we're looking for, when we plot our data,are differences in the variance by group.So you can either plot the data, as we have down below,or you can plot the residuals.Both of them basically show the same thing.

- 06:19
BEN LAMBERT [continued]: And just to remind you, the residualsare just the difference between the actual data,for individuals, within each group,and the estimated group-specific mean.And a nice way to plot your data or your residualsis to use box and whisker plots, that we've spoken about before.So perhaps, what we might do here

- 06:39
BEN LAMBERT [continued]: is draw box and whisker plots on all of our data.And this is very easy to do using statistical software.So by drawing these box and whisker plots,we're able to see, very quickly, for the datathat we have here, that there are potentially differences

- 07:01
BEN LAMBERT [continued]: in the variance by group.So for our data set here, it lookslike we've got a violation of this assumptionof homoscedasticity.In practice, it's difficult to determine where is the cutoff.When do we decide that our data, our variance between eachof the groups are sufficiently different suchthat you conclude that you have a violation

- 07:22
BEN LAMBERT [continued]: of homoscedasticity?So we need to carry out some sort of statistical tests.But by graphing our data first, we'renot carrying out those tests blindly.And we should always be conscious of a statistical testthat goes in contrast to what we saw when we actuallyplotted the data.So the second way we can actually test this assumption

- 07:45
BEN LAMBERT [continued]: is to carry out some sort of statistical test.One of these such tests is known as the Levene test.And it works in a very intuitive way.It calculates something which is known as the W statistic, whichis given by, on the numerator, the sum across all jgroups of something, which I'm going to find later,

- 08:09
BEN LAMBERT [continued]: Z j bar minus Z bar.And then the numerator is just divided through by q minus 1,where q is the number of groups.And then on the denominator, we have the sum across both iand j of Z ij minus Z bar j, again, all squared.

- 08:35
BEN LAMBERT [continued]: I should have put squared, up here, on the top, as well.All divided through by n minus q.So, now I need to define exactly what I mean by Z ij.Well, Z ij is just equal to the absolute deviation of W ijfrom the group specific mean.

- 08:56
BEN LAMBERT [continued]: So the numerator is essentially comparing the group averageabsolute deviation with the overall average deviation.So intuitively, if there is a difference between these twothings that's pretty big, then thatindicates that there are deviationsand, hence, variances that vary by group.

- 09:19
BEN LAMBERT [continued]: So if this numerator is big, then that'sindicative of the fact that we've probably got a problem.But we need to compare the numerator with something.So we compare it with the individual absolute deviationfrom the within group absolute deviation mean.So if the numerator is big relative to the denominator,

- 09:41
BEN LAMBERT [continued]: in other words, we're getting W which is much greater than 1,then that's indicative of the factthat we probably have unequal variances across groups.Just how big does this thing haveto be before we would conclude that wehave a violation of homoscedasticity?Well, the idea is that, under the null hypothesisof the fact that we have homoscedastic errors,

- 10:03
BEN LAMBERT [continued]: then this whole thing happens to be F distributedwith q minus 1 degrees of freedomfor its first input, which is just the thing that we dividethe numerator by.And n minus q degrees of freedom for its second input.So if we get a value of W which isabove a critical value for this F distribution,then we reject the null of homoscedastic errors.

- 10:25
BEN LAMBERT [continued]: There are many types of tests for homoscedasticity.Another one that I just want to talk about here,that's commonly used, is the Brown-Forsythe test.And it works in exactly the same way as the Levene test,except, instead of using the sample mean,

- 10:46
BEN LAMBERT [continued]: we replace this with the sample median.And the reason that some people prefer the Brown-Forsythe testis that it is robust to non-normal data.So that's particularly important if you look dataand it doesn't look particularly normal.Now, if we think about testing for normality of our data,

- 11:11
BEN LAMBERT [continued]: then the first thing we should dois exactly the same as that whichwe spoke about for the test of homoscedasticity,and that is to plot your data.There are a number of different plotsthat you can do to check visuallywhether your data looks like it's normally distributed.The first and most obvious of these

- 11:32
BEN LAMBERT [continued]: is to draw a histogram of potentially your residuals.And here, what you might see, if your data look reasonablynormal, is something which, at least,in sort of histogram terms, lookslike it's normally distributed.So here, the shape of the graph isn't particularly

- 11:53
BEN LAMBERT [continued]: different to normal.And you can imagine fitting a normal distribution to that.So visually, it looks like, for our particular case,that the data is normally distributed.An alternative plot, that you can do, that is frequently usedis something which is known as a q-q plot.And that just stands for quantile-quantile plot.

- 12:15
BEN LAMBERT [continued]: And the idea here is that, on the x-axis,you will plot the theoretical quantilesof a normal distribution against the empirical quantileson the y-axis that you actually get from your residuals.So the idea is, if the data were actually normal,

- 12:35
BEN LAMBERT [continued]: then it would follow the y equals x line here.If your data are very non-normal,then you might see significant deviations from the yequals x line.After we've plotted our data, we maywant to carry out a test for normality.Because sometimes it is difficult visually

- 12:56
BEN LAMBERT [continued]: to understand exactly where therewill be a cutoff between normality and non-normality.Again, there are many tests that we can carry out.Some of the ones that are quite popular-- the firstis the Jarque-Bera test.And the way in which this test worksis by looking at the skewness and the kurtosis of your data

- 13:17
BEN LAMBERT [continued]: and comparing those values with thatof a theoretical normal distribution.Another popular test is the Shapiro-Wilk test.And this is a test which is based on the theoretical orderstatistics for a normal distribution.Another popular test is the Kolmogorov-Smirnov test,

- 13:39
BEN LAMBERT [continued]: which is a general non-parametric test, whichbasically compares the CDF of, in this case,a normal distribution with that of your empirical CDF.So in summary, to test for normality,we plot the data using a histogram or a q-q plot.

- 14:00
BEN LAMBERT [continued]: And there are an array of different tests, whichallow us to test for whether or not our data arein fact non-normal.Finally, if we discuss the concept of independenceand whether it is, in fact, possible to test for whetheror not the data are, in fact, independent.Well, the majority of the time, this

- 14:21
BEN LAMBERT [continued]: is a very difficult thing to explicitly test.And that's because whether or notyour data are independent is typicallydictated by the way in which the study was designed.So this is really something we shouldbe thinking about when you're actually planning your study.However, there are a few ways, oneof which we'll discuss here, that

- 14:41
BEN LAMBERT [continued]: allow you to kind of visually check,under certain circumstances, whether or notyou have a violation of independence.The way we're going to talk about hereis something which is known as a sequence plot.So what we imagine we're doing is, perhaps for our example,we're collecting the income, over time,

- 15:03
BEN LAMBERT [continued]: for particular families.So perhaps we're repeating the same measurementsfor the same family, or perhaps we'rejust collecting the data for different familiesat different points in time.And what we might do is we might draw a graph of our residualsover time.So here, if we draw a graph and there is a systematic content

- 15:28
BEN LAMBERT [continued]: to our residuals, then that is indicative of the factthat there might be some sort of systematic factorthat is meaning that our data are no longer independent.In this case, we have a dependent structure over time.And note that this particular plot doesn't necessarilyhave to be carried out with time as being

- 15:51
BEN LAMBERT [continued]: the variable in the x-axis.Perhaps you're collecting data on individuals geographically,and so you have a measure of how far those individuals arefrom, let's say, a capital city.Then what you could do is you couldplot the residuals against the distance from the capital city.And again, if there was a persistent pattern, then

- 16:11
BEN LAMBERT [continued]: that might indicate that there is some sort of dependencewithin your data.So in summary, we've seen that thereare a variety of ways of testing the assumptions of ANOVA.And violation of these assumptionshas some consequences for ANOVA.In the next video, we're going to discuss what we can actuallydo if we do find that there are these violations

- 16:33
BEN LAMBERT [continued]: of these assumptions.[MUSIC PLAYING]

### Video Info

**Series Name:** ANOVA

**Episode:** 8

**Publisher:** SAGE Publications Ltd

**Publication Year:** 2017

**Video Type:**Tutorial

**Methods:** Analysis of variance, Heteroscedasticity

**Keywords:** independence; mathematical concepts; mathematics; microfinance; welfare

### Segment Info

**Segment Num.:** 1

**Persons Discussed:**

**Events Discussed:**

**Keywords:**

## Abstract

Professor Ben Lambert presents chapter 8 of the ANOVA series. This segment upholds assumptions of ANOVA using a demonstration problem of families receiving welfare or microfinance.