Skip to main content
Search form
  • 00:00

    [MUSIC PLAYING][An Introduction to Missing Data in Health Research]

  • 00:14

    LAURA GRAY: My name is Laura Gray.I'm a Research Fellow at the School of Healthand Related Research at the University of Sheffield.[Laura Gray, Research Fellow] I'm a health econometrician,which means I spend lots of time dealing with largeobservational data sets which contain informationon individuals' health, medical records,and other personal characteristics.

  • 00:37

    LAURA GRAY [continued]: Missing data can be a problem in most types of data set,but in some types of data, it's worse than in others.In health and medicine, some questionsaren't answered by participants because theymight feel a bit uncomfortable answering questions,or they might feel embarrassed to answer certain things.There might be questions where answers

  • 00:57

    LAURA GRAY [continued]: were unknown at the time, or where they were incorrectlymeasured or recorded, and so they were subsequently removed.If a study involves following participants over time,some participants might be what wecall lost to follow-up, where they might, for example, movehouse, or the interviewer may notbe able to contact them for another reason.

  • 01:18

    LAURA GRAY [continued]: This tutorial will outline what missing data is,discussed the different types of missing dataand how to identify which type of missing data you might have.We'll look at the assumptions that you make when using datasets with missing data, and why these missingdata can be problematic.This tutorial will point users in the directionof appropriate techniques that canbe used to overcome the problems of missing data

  • 01:41

    LAURA GRAY [continued]: by briefly outlining some of the methodsthat we can use to overcome these common problems.[What is missing data?]Missing data describes a situationwhere any data point in a data set is missing.And this could be because it was never recorded

  • 02:01

    LAURA GRAY [continued]: or because it was recorded incorrectly and subsequentlydeleted.This table shows some data containing informationon six individuals and their age, sex, weight, heightand BMI.It could be that a single variableis missing for an individual.It could be that multiple variables

  • 02:22

    LAURA GRAY [continued]: are missing for an individual.Or it could even be that all data for such an individualis missing.So in the context of health and medicine,missing data could be data which is missing on a patient's ageor sex, their disease status for a certain disease,measures of general health, or measuressuch as blood pressure or BMI, or itcould be that all of this information

  • 02:44

    LAURA GRAY [continued]: on a certain patient is missing.[Why is missing data problematic?]Missing data causes a loss of statistical power,and this loss of statistical power is irreversible.As long as the data is missing, there'llalways be a loss of power.

  • 03:04

    LAURA GRAY [continued]: So the only way to get this power backis to go back and recover the data.So, to go back to the participantsand ask them the questions that are missing.When performing analysis with missing data,you have to make assumptions about the data.Most of these assumptions are untestable.For example, we might assume that any missing data follows

  • 03:26

    LAURA GRAY [continued]: a specific pattern.But there's often no statistical testto determine whether this is definitely true or not,so we have to rely on the data user's best judgment.The biggest problem with missing datais that, if it's not correctly accounted for,it can cause biased estimates and standard errors.And this might mean that you've got incorrect point

  • 03:47

    LAURA GRAY [continued]: estimates, as well as incorrect p-values, confidence intervals,and so on.If inappropriate analysis is used,then missing data can also cause inefficient estimates,meaning that any analysis will not make bestuse of the data that is included in the dataset, the non-missing data.There are a number of techniques that

  • 04:08

    LAURA GRAY [continued]: can be used to deal with this missing data,and the appropriate technique willdepend on the type of missing data that you have.We say that data is missing completely at randomwhen the probability that it's missingis not related to any personal characteristics,either observed or observed.For example, if BMI is being recorded at a GP surgery

  • 04:31

    LAURA GRAY [continued]: but the scales stop working for the last fivepatients of the day, we would say that the missing BMIvalues are missing at random.And they're not missing as a resultof anything related specifically to those five individuals.It's just bad luck.Although this type of missing datais less common than other types of missing data,

  • 04:52

    LAURA GRAY [continued]: it is often possible to say with reasonable confidencewhen we think our data is missing completelyat random, assuming we know the reason for the missing data.We say that data is missing at randomwhen the probability that it's missingis dependent on observable characteristics, so on datathat has already been collected and we have in our data set,

  • 05:13

    LAURA GRAY [continued]: but that it's not dependent on any unobserved variablesthat we haven't collected or that are just notable to be observed.And that includes the value of BMIitself that we would have collected had it notbeen missing.For example, the probability that BMI is missing mightdepend on an individual's age, because young people tend

  • 05:36

    LAURA GRAY [continued]: to visit their GP less often, and they're therefore lesslikely to have a recent BMI value recorded at their GP's.If we are confident that there are no other unobservedvariables that were also influencing the probabilitythat the data are missing, then wewould claim that the data is missing at random.

  • 05:58

    LAURA GRAY [continued]: We say that data is missing not at random when the probabilitythat data is missing depends on things that we cannot observeand that are not included in the data,and that includes the missing data itself.So going back to the example of general practice BMI values,BMI might be less likely to be missing for individualswith very high or very low BMI, as they

  • 06:21

    LAURA GRAY [continued]: might be more likely to have visited their GPand have had their BMI recorded, regardlessof their age or other observable characteristics.[How can you identify types of missing data?]If you've got missing data in your data set,the first thing to do is to look closely at the data.

  • 06:41

    LAURA GRAY [continued]: Group the data into different sections.Is there a different pattern to the missingness in malesand females, for example?Try to find out what might predict the missingness.To do this, you could create a new variable whichindicates whether or not a variable is missing,and use a logistic regression to work out whatmight predict that missingness.

  • 07:02

    LAURA GRAY [continued]: You could also look at previous studies whichhave used similar data or similar variables,and see what they found predicted missing values.[What are the methods for dealing with missing data?]The best way to handle missing datawill depend on the type of missing data that you have.

  • 07:22

    LAURA GRAY [continued]: But in most cases, the first thing to tryis to see if you can recover any missing data.And this is often not possible, but worth a try.The most common analysis that is usedas the default in most statistical software packagesis complete case analysis.Complete case analysis can produce unbiased estimates

  • 07:46

    LAURA GRAY [continued]: and unbiased standard errors if data aremissing completely at random.However, if data are not missing completely at random,if they're missing at random or missing not at random,then complete case analysis is inefficientand it will provide biased estimates and standard errors.

  • 08:06

    LAURA GRAY [continued]: If you believe that your data is missing at random--that is, any missing data is dependent on other variablesthat are observed in your data setand it doesn't depend on anything that's unobserved--then there are two main options for analysis,and these are inverse probability weightingor imputation.Inverse probability weighting aims

  • 08:28

    LAURA GRAY [continued]: to make the most of the complete cases thatare available in the data set, making them morerepresentative of all the cases, includingthose with missing data.This method weights each complete caseby its inverse probability of being a complete case.So we need to have all the predictors of missingnessincluded in the data, and this is the reason

  • 08:49

    LAURA GRAY [continued]: that we need data to be missing at random.It also requires there to be no missing data for all variableswhich influence the probability of being a complete case.If performed correctly, this methodcan produce unbiased estimates and standard errorswhen data is missing at random.Often remains inefficient because itdoesn't use all the data from the non-complete cases.

  • 09:10

    LAURA GRAY [continued]: Another way of accounting for missing data which is missingat random is called imputation.The simplest form of imputation is mean imputation,where missing data is replaced with the meanof that variable using observationswhich are observed.However, this method can still cause biased estimatesand standard errors.Regression imputation uses regression to estimate

  • 09:34

    LAURA GRAY [continued]: missing values, which are then imputed into the data set.This is an improvement on mean imputationbecause the estimated data to be imputedis dependent on other personal characteristics.But that can lead to overfitting the relationshipbetween the variables.A more complex form of imputation,known as multiple imputation, involves

  • 09:56

    LAURA GRAY [continued]: imputing multiple data sets.In each data set, imputations are taken repeatedlyfrom estimated distributions rather than just once,meaning that data retains its uncertainty.Subsequent analysis is performed on each of the imputed datasets, and then pooled, or combined, into a single output.

  • 10:20

    LAURA GRAY [continued]: Multiple imputation, when done correctly,can produce unbiased estimates and standard errorswhen data are missing at random.If you believe that your data is missing not at random,then any analysis would require a more complex modelwhich jointly models the data and the missing dataitself, the missingness itself.If this is the case, you might want

  • 10:40

    LAURA GRAY [continued]: to try collecting more data so that data which was previouslyunobserved becomes observed in orderto make the assumption of missingat random more convincing.[Conclusion]After watching this tutorial, I hope that you now

  • 11:01

    LAURA GRAY [continued]: feel confident to assess your dataand identify which type of missing data you have,and also to determine the extent to which missing data mightbe causing you a problem in any analysis.If you're collecting your own data,then I hope that what we've covered in this tutorialmight help you to consider how to minimizethe problems of missing data when you design your study.

  • 11:24

    LAURA GRAY [continued]: For example, how might you prevent incorrect datafrom being recorded, or which variables should be includedto help predict the probability of missing dataif it does appear?If you find that missing data is indeeda problem within the data set that you have,what we've covered in this tutorialshould point you towards the appropriate method

  • 11:46

    LAURA GRAY [continued]: to deal with your missing data.Thank you for listening to this tutorial today,and I hope you found it useful.[Further Reading]


Laura Gray, Research Fellow at the University of Sheffield School of Health and Related Research, discusses missing data in health research, including why missing data is problematic, identifying types of missing data, and methods for dealing with it.

Looks like you do not have access to this content.

An Introduction to Missing Data in Health Research

Laura Gray, Research Fellow at the University of Sheffield School of Health and Related Research, discusses missing data in health research, including why missing data is problematic, identifying types of missing data, and methods for dealing with it.

Copy and paste the following HTML into your website