Skip to main content
Search form
  • 00:14

    RICHARD PARKER: Hello, my name is Richard Parker,I'm senior statistician at the Edinburgh Clinical Trials UnitUniversity of Edinburgh.My expertise is in applied medical statistics.So in this video, I provide guidance on multiple testingand also clarify when and how we might adjust for this

  • 00:35

    RICHARD PARKER [continued]: in clinical research studies.This is based on my own experienceas a practicing medical statistician.[Multiple testing and family wise error rate]So multiple testing sometimes referredto more generally as multiplicity or multiple

  • 00:55

    RICHARD PARKER [continued]: comparisons, is common in medical research.So this happens whenever we are testingmore than one hypothesis.Now it is often claimed that multiple testing increasesthe type I error rate.But what does that actually mean?And do we always have to adjust for multiple testing

  • 01:15

    RICHARD PARKER [continued]: in the statistical analysis?Well, these are the questions we willtry to answer in this video.Firstly, when people say that multiple testing increasesthe type I error rate, what do they actually mean,or rather what should they mean?So what they should mean is that multiple testing increases

  • 01:36

    RICHARD PARKER [continued]: the probability of a false significant resultamong all the hypotheses being tested,or among all comparisons being performed.Formally, this is called the family-wise error rate.[Family-wise error rate FWER] So the family-wise error rate isthe probability of those making at least one type I error among

  • 01:58

    RICHARD PARKER [continued]: all the hypothesis tests that we are performing.Now it's important to note that the family-wise errorrate always increases with multiple testing.If we have performed multiple hypothesistests or multiple comparisons, then the probabilityof at least one false significant result

  • 02:18

    RICHARD PARKER [continued]: is always higher than the type I error of the individual tests.So that's not in doubt.So the question is not, does it increase the overall typeI error rate.But the question is, does it matter.For example, if we're doing an exploratory studywhere we're trying to identify interesting correlations

  • 02:39

    RICHARD PARKER [continued]: between variables or differences between groups thatare worthy of further investigation,there is no need to adjust for the increased probabilityof a type I error rate due to multiple testing,because it's not important.A false significant result does not carry with itthe same negative consequences as for a definitive study,

  • 03:03

    RICHARD PARKER [continued]: because further work will be done.Equally, if we're interested in each specific hypothesis test,then controlling the overall false significancerate of at least one significant result makes no sense.In this case, we are interested in the individual hypothesistests, and the individual error rates.

  • 03:27

    RICHARD PARKER [continued]: And our focus is in controlling the individual per comparisonerror rates.To make these ideas concrete, supposewe have conducted a clinical trialto investigate if vitamin D intervention improves outcomesfor people with multiple sclerosis or MS,

  • 03:48

    RICHARD PARKER [continued]: we could for example, have an indicator of diseaseprogression as our primary outcomeand for secondary outcomes we could measure for example 20other outcomes including pain scores, blood pressure,quality of life and so on.Now, suppose we found a statistically significantdifference in the quality of life measure

  • 04:09

    RICHARD PARKER [continued]: but not for the other secondary outcomesor the primary outcome.Then can we validly conclude that the interventionis effective?The answer is no.Because the probability of at least onefalse significant result across all outcomesis very high in this situation.

  • 04:30

    RICHARD PARKER [continued]: So it's around 64% for independent tests.And we would expect on average to seeone false significant result if we conducted 20 tests.So in a study design, if we're goingto conclude that an intervention iseffective on the basis of one significant result, when

  • 04:52

    RICHARD PARKER [continued]: we have performed multiple testing,then the family-wise error rate is highly relevant.And we need to use a multiple testing adjustmentin this situation.However, if we have a specific interestin the effect of vitamin D on quality of lifethen the family-wise error rate doesn't matter.

  • 05:14

    RICHARD PARKER [continued]: And we're most interested in the individual per comparison errorrate or quality of life.In this case, we don't need to perform a multiple testingadjustment, but we do need to makea very specific and precise interpretation of the result.

  • 05:35

    RICHARD PARKER [continued]: So for example, we make referenceto the outcome quality of life, and wesay that there's a significant effect of vitamin D on qualityof life.As another example, suppose we have designed a multi-arm trialto look at the effectiveness of various interventions,to reduce the duration and symptoms of the common cold.

  • 05:58

    RICHARD PARKER [continued]: [multi-arm trial]If these interventions are distinct,and we're interested in the individual effectivenessof each of them, then we don't needto adjust for multiple testing.Our main focus in this case would be on the per comparisonerror rate.Third example, suppose we are interested in the long term

  • 06:19

    RICHARD PARKER [continued]: effectiveness of a complex exercise based interventionin hypertensive patients.And four primary outcomes are myocardial infarction, stroke,transient ischemic attack, heart failureand coronary heart disease.If we're going to conclude that the intervention works based

  • 06:40

    RICHARD PARKER [continued]: on a significant improvement in one of these outcomes,then we need to adjust for multiple testing.This is because we have multiple opportunitiesto conclude that the intervention works.Alternatively, we could form a composite outcomefrom the different outcomes.

  • 07:01

    RICHARD PARKER [continued]: This means for example, we could redefine our outcomeinto a single outcome which compares at least oneof the cardiovascular outcomes versus no cardiovascularoutcomes.And in this case, we would avoid the multiple testing issue.

  • 07:21

    RICHARD PARKER [continued]: It's also worth noting that if we demandthat all hypothesis tests are statisticallysignificant for us to conclude that the intervention iseffective, then it's true we're performing multiple testing,but we don't need to use a multiple testing adjustmentprocedure in this case.

  • 07:41

    RICHARD PARKER [continued]: Because we need all of the hypothesistests to be significant for us to make an overall conclusionthat the intervention is effective.[Methods for adjusting for family wise error rate]So what happens if we decide that the family-wise error

  • 08:05

    RICHARD PARKER [continued]: rate does matter.Well, in this case, we need to decidehow to adjust for multiple testingand various techniques are available.So the most frequently used multipletesting correction procedure is the Bonferroni Method.[Bonferroni Method]

  • 08:25

    RICHARD PARKER [continued]: So this simply involves dividing the overall significance levelby the number of tests performed.So for example, if we perform five testsand our overall significance level is 5%, then it's just 5%divided by 5.

  • 08:48

    RICHARD PARKER [continued]: So for each individual tests in this case,we use a 1% significance threshold.Another way to do it is that, we could,if we wanted to divide the overall error rateallocation in a different way.Such that the individual error rates all add up to 5%.

  • 09:10

    RICHARD PARKER [continued]: But the most common way to do it is simplyto divide the overall level by the number of tests performed.So in the exercise based intervention example,if we are testing the five separate outcomes,we divide the significance level for example 5% by 5.

  • 09:34

    RICHARD PARKER [continued]: So this means that we actually use an adjusted significancelevel of 1% to compare the p-values against.This procedure will then control the overall family-wise errorrate at 5%.So you can see on the screen that we'verun the various tests on all of the outcomes.

  • 09:56

    RICHARD PARKER [continued]: And we get these p-values out.And if we compare each of these p-valuesto the 1% significance threshold,we can see that only one of the hypothesis testswas significant.So this was the test based on myocardial infarction, whichgave a p-value of 0.002.

  • 10:19

    RICHARD PARKER [continued]: So that was the only outcome significant after usingthe Bonferroni method.So this method of adjustment, the Bonferroni adjustmentis very common and very simple to use.One of its main criticisms though,is that it's too conservative.In other words, in some cases we over adjust

  • 10:42

    RICHARD PARKER [continued]: for multiple testing.This happens if the test statistics are positivelycorrelated, which is likely if the outcomes are related.So sometimes it is too conservative,it's not it's not very powerful test.However, Bonferroni adjustment does provide strong error rate

  • 11:02

    RICHARD PARKER [continued]: control, which means that the family-wise error rate iscontrolled regardless of the true effect sizesand regardless of the configurationof null and alternative hypotheses.Some multiple comparison proceduresonly provide weak control, which means that they're only

  • 11:23

    RICHARD PARKER [continued]: guaranteed to control the error rate under certain conditions.For example, if our null hypothesis is true.Another procedure is called the Holm Method.[Holm Method] It's sometimes called the Holm Bonferroniprocedure.And this is a method that's more powerful than the Bonferroni

  • 11:44

    RICHARD PARKER [continued]: method.And also provides strong error rate control.So this method involves ordering the p-valuesaccording to their magnitude from lowest to highest.It begins at the same significance levelas the Bonferroni procedure.And then test the other hypotheses

  • 12:04

    RICHARD PARKER [continued]: at successively higher levels.So this method is best illustrated by an example.So we see here on the screen that we'veordered the different p-values for all of the outcomes.And then once we've done that, wecompare each of these p-values with the adjusted p-values,

  • 12:25

    RICHARD PARKER [continued]: which are based on the number of remaining tests.So for example, our smallest p-valueis for myocardial infarction.The results of this test is 0.002,so that's lower than the 0.01 significance level.And we calculate that by dividing

  • 12:47

    RICHARD PARKER [continued]: the overall significance level by the number of remainingtests.In this case, we've just started.So there's five tests remaining including the tests that we'rejust comparing against.So in this case, it is statistically significant.So we then move on to the next highest

  • 13:08

    RICHARD PARKER [continued]: p-value, which corresponds to ischemic heart disease.And we have a p-value in this case of 0.012.And using the Holm adjusted p-value,this is also statistically significant.And this just simply involves dividing

  • 13:29

    RICHARD PARKER [continued]: the overall significance level by number of remaining tests.In this case, it's four.So we get a p-value 0.0125.So after testing the ischemic heart disease outcome,we conclude that statistically significant we thenmove on to the stroke outcome.

  • 13:51

    RICHARD PARKER [continued]: And for this one we get a p-value of 0.03.And this is higher than the home adjusted p-value,which is 0.0167.And in this case, we fail to reject the null hypothesis

  • 14:13

    RICHARD PARKER [continued]: of no difference between groups for the stroke outcome.So once we've done that, we cannot then continue to testthe other outcomes.So the transient ischemic attack and heart failure hypothesisare retained without testing.

  • 14:34

    RICHARD PARKER [continued]: So the way this procedure works is it'sconditional on the previous test being significantfor us to proceed.And since stroke was not significant,then we can't proceed to test the other outcomes.But you can see in this case, that we'veconcluded that both myocardial infarction and ischemic heart

  • 14:57

    RICHARD PARKER [continued]: disease are statistically significant.So that's the Holm method.So another procedure that we could useis called the fixed-sequence method.[Fixed-sequence method] So another way of describing thisis performing hierarchical testing.So in this case, we would formally

  • 15:19

    RICHARD PARKER [continued]: specify a hierarchical testing structure.And that is-- that means that outcomes are onlytested in a prespecified order.And testing proceeds in sequence until a hypothesis testis non-significant.In which case testing stops and the remaining hypotheses

  • 15:42

    RICHARD PARKER [continued]: are not tested.So returning to our example.If we've prespecified the order of the hypothesis tests,then it could proceed like this.So first of all, first variable is myocardial infarction.So we test that at the 5% level.

  • 16:04

    RICHARD PARKER [continued]: And then if it's significant, then weproceed to the next hypothesis tests which is stroke.And again that's statistically significant.And then we proceed to the next one, which is heart failure.But in this case, it's not--the p-value is greater than 0.05.So it's not significant.And in this case testing stops, and we don't

  • 16:27

    RICHARD PARKER [continued]: test the other hypotheses.So we can see for this method we're not usingan adjusted-significance level.We're still using the overall 5% significance level.But again, testing is conditional on previous testsbeing significant.So it's very important--

  • 16:49

    RICHARD PARKER [continued]: pre-specifying the order of hypotheses.And this should be done at the study design stagebefore we've done the analysis, so that it's free from bias.And great care should be taken to orderthe p-values carefully, sorry, order the hypotheses carefully.

  • 17:10

    RICHARD PARKER [continued]: Because otherwise, we could end up with a situationwhere we don't end up testing.We don't end up performing many hypothesis tests.So those are just three methods, three different methodswe could use in multiple--to correct for multiple testing.

  • 17:32

    RICHARD PARKER [continued]: But there's many other procedures.The advantages of all the above proceduresis that they're all non-parametric,which means they don't require any normality assumption.[Advantages of the methods discussed]And they also provide strong error rate control,which means that the family-wise errorrate is controlled regardless of the true effect sizes.

  • 17:55

    RICHARD PARKER [continued]: Many procedures will be more powerful.But note, that some of these will onlyprovide weak error rate control and somemay rely on strong untestable assumptions.It's impossible to cover all the different proceduresin this video.But for further information about the different type

  • 18:18

    RICHARD PARKER [continued]: of procedures, I recommend that students and researchersconsult the excellent book by Dmitrienko,which covers many different multiple testingprocedures with a focus of their application in clinical trials.People often use the word problem with multiple testing.

  • 18:39

    RICHARD PARKER [continued]: They say multiple testing problem,which may imply that multiple testing is a drawbackto be avoided at all costs.However, multiple testing is only a problemif it is not considered carefully.In fact, multiple testing is to be encouragedif it means that full use is made of the data collectedfrom patients, and that the value of a study

  • 19:01

    RICHARD PARKER [continued]: can be maximized through this.Indeed, when we say multiple testing problem,this should be understood in the same senseas a mathematical problem that needs solving, not a problemto be avoided.Multiple testing may not always require multiple testingadjustment, but they should at least be considered.

  • 19:24

    RICHARD PARKER [continued]: Many people intellectually skip considerationof whether multiple testing is needed,and immediately proceed to multiple testing procedures.Or else, apologetic statements in their final reportas to why they didn't provide a multiple testing procedure.However,This is a question which should be considered carefully

  • 19:44

    RICHARD PARKER [continued]: before analysis.So we should really consider carefullybefore we perform any analysis.Do we actually need to use a multiple testing adjustment?And if we do need to adjust for multiple testing, whatprocedure will we use.Ideally, we need to use a multiple testing

  • 20:07

    RICHARD PARKER [continued]: procedure that provides strong error rate control.So methods which control the error rate in the strong sense,such as the Bonferroni method are usually recommendedover those that do not.Thank you for listening to this, watching this videoand hope you found it helpful.


Richard Parker, Senior Statistician at the University of Edinburgh, discusses multiple testing and methods for error rate control, including family-wise error rates, the Bonferroni method, the Holm method, and the fixed-sequence method.

Looks like you do not have access to this content.

Introduction to Multiple Testing and Methods for Error Rate Control

Richard Parker, Senior Statistician at the University of Edinburgh, discusses multiple testing and methods for error rate control, including family-wise error rates, the Bonferroni method, the Holm method, and the fixed-sequence method.

Copy and paste the following HTML into your website