  • 00:10

    NICHOLAS BIDDLE: So thanks for watching this videoas part of the Sage Research Methods Series.My name is Dr. Nicholas Biddle.I'm the Deputy Director of the ANU Centerfor Social Research and Methods, in the Research Schoolof Social Sciences at the Australian National University.[Dr. Nicholas Biddle] What I'm going to talk about during thisvideo is different types of data which we can use for policyand evaluation.

  • 00:31

    NICHOLAS BIDDLE [continued]: In particular, I'm going to talk about administrative and linkeddata sets.And I'm going to use the example of an indigenous identificationchange to highlight the benefits of linking data through time.So what do we mean by administrative data?Well, government departments and agenciesbuild up data collections during their day-to-day activities.

  • 00:54

    NICHOLAS BIDDLE [continued]: They routinely gather informationwhen registering people who are carrying out transactions,or for record-keeping purposes.In essence, it's information which is collected whilstdelivering services.Now this information isn't really collectedfor research purposes, but it can be used answerto key research questions.And in particular, key research questions

  • 01:17

    NICHOLAS BIDDLE [continued]: about policy issues, policy evaluations,and service delivery issues.So within Australia, there's a rangeof different administrative data sets.There's inbound and outbound passenger data.So when you enter Australia, or uniquely in Australia,when you leave, you have to fill outa range of information about your destination

  • 01:38

    NICHOLAS BIDDLE [continued]: or your origin, your intention, as well assome characteristics about yourself, including occupation.The Medicare and pharmaceutical benefits schemes,they collect information when people access health services,when they purchase or use pharmaceuticals, or in a sense,

  • 01:60

    NICHOLAS BIDDLE [continued]: any transactions as part of the hospital system.There's tax office data.When people fill out tax returns,they have to fill out a range of information about themselves,as well as detailed information about their earnings.And finally there's income support data.So when people receive government payments,

  • 02:22

    NICHOLAS BIDDLE [continued]: their information is stored as part of very large data sets.And what's quite interesting about income support data isit's one of the few longitudinal datasets we have on individuals in receipt of government support.So there's a number of strengths and limitations

  • 02:43

    NICHOLAS BIDDLE [continued]: of administrative and linked data,and using those for research purposes.So administrative data collections already exist.So there's no real additional costfor collecting that information.Administrative data sets are frequentlycollected in the same way over many years,and subject to very strict departmental standards.

  • 03:05

    NICHOLAS BIDDLE [continued]: Which means they're quite comparable through time.Researchers can use historical informationto cover the same data over different timeto help uncover trends.Administrative data collections have very large sample sizes,and they tend to cover the whole population being studied.

  • 03:26

    NICHOLAS BIDDLE [continued]: Most importantly, the collection process is not intrusive.So that data is being collected anyhow.Which means there's no additional respondentburden for research purposes.Of course there's limitations.Administrative data sets can be very complicated to analyze,and they require a lot of detailed knowledge

  • 03:47

    NICHOLAS BIDDLE [continued]: of the system which the data is being collected as part of,as well as how the data is collected.While there's a reduction in respondent burden,that burden is placed on service providers.Service providers often need to provide additional information,if they're knowing that their information, their data

  • 04:08

    NICHOLAS BIDDLE [continued]: is going to be used for research purposes.As an outside researcher, as opposed to someone researchingwithin government, administrative data setscan be very hard to access, in particular, accessing metadata,or the data about the data.There's privacy concerns.People who provide that informationdon't necessarily consent to the data being used

  • 04:29

    NICHOLAS BIDDLE [continued]: for administrative purposes.And finally, the data, while interesting and useful,isn't always perfect.It's not designed-- it's not specific to the researchquestions.And often, we have second best or third bestdata measures for the things whichwe're ultimately interested in.

  • 04:52

    NICHOLAS BIDDLE [continued]: And another limitation, or another aspect of the datawhere it's not necessarily suitable for research purposes,is that we often don't have informationon those who don't use the service,or the counter-factual.So for example, we know who might be-- we know informationabout those who might be participating in preschool.But we don't know information about those who aren't

  • 05:14

    NICHOLAS BIDDLE [continued]: participating in preschool.Which makes it very hard to understandthe preschool decision.So that's administrative data sets.What about data linkage?Well, data linkage is basically whenyou combine two different sources of datainto one single data set.That can be done at the aggregate level,

  • 05:34

    NICHOLAS BIDDLE [continued]: and that's been done for many, many years.But more interesting and more recently, we undertakedata linkage at the individual level.So we have information on the same individualfrom different data sets.And that data linkage can be administrative datato administrative data.So you might have, as I mentioned before,data from the Australian tax office, so our tax data,

  • 05:57

    NICHOLAS BIDDLE [continued]: linked with social security data.Or passenger card data linked with health data,as an example.We often link survey data to survey data.And I'm going talk about an example later on in this video.Or we can link administrative data to survey data.So we might have very rich data on a survey,

  • 06:20

    NICHOLAS BIDDLE [continued]: but we don't have longitudinal data.So what we might do is take the survey as a baseline,and then follow individuals through time,using the administrative data sets.Now there's two main different typesof linkage at the individual level.There's what we might call deterministic linkage.That's where we know exactly who those individuals are

  • 06:43

    NICHOLAS BIDDLE [continued]: on the two different data sets.That's their unique identifier.So for example, in a lot of schools data sets,those individual students have a unique identifier.And so therefore, we can link a range of school's data setsto each other.Sometimes though, we don't have a unique identifier.

  • 07:04

    NICHOLAS BIDDLE [continued]: What we have though, is informationwhich is not necessarily unique, but for whichan individual- which is quite uncommon for many individualsto have exactly the same information.And when we link those data sets through time,

  • 07:26

    NICHOLAS BIDDLE [continued]: it's called probabilistic linkage.So what we do is we take an individual in one data set,look at their information, look at the other data set,and see, OK, which individual in another data setis the most likely person to link to?Now obviously, there's limitations of that.

  • 07:47

    NICHOLAS BIDDLE [continued]: Especially when the information on one or the other datasets is entered incorrectly, or when it doesn'tseparate individuals too much.But probabilistic linking is very usefulon for data sets which aren't designedto be linked through time.An example of that is the Australian Census Longitudinal

  • 08:09

    NICHOLAS BIDDLE [continued]: data set or the ACLD.So the ACLD is basically a linked dataset of two population censuses through time.Now many countries link their censuses through time.And Australia recently linked theirs from the 2006 census

  • 08:29

    NICHOLAS BIDDLE [continued]: to the 2011 census.So in a sense, what the ABS, the Australian Bureau of Statisticsdid, is it took a sample about 5% of the 2006 census.Now remember, the census is supposedto have information on everyone.So that 5% is about 1,000,000 records.And that's linked probabilistically,based on a most likely match to all individuals on the 2011

  • 08:54

    NICHOLAS BIDDLE [continued]: census.Now one of the limitations of this trans-census longitudinaldata set is it's linked-- the linkage isconducted without using names and addresses.For privacy reasons, those names and addressesfrom the 2006 census wasn't retained through time.But there was very detailed area level information.

  • 09:18

    NICHOLAS BIDDLE [continued]: So in 2011, people were asked where they lived in 2011,as well where they lived five years ago.People were asked about their date of birth, so month, date,and year, as well as a range of other informationwhich is quite unique to different individuals,so their highest level of qualifications.

  • 09:41

    NICHOLAS BIDDLE [continued]: What country their parents were born in.What country they were born in.So using all that information, individualscan be linked probabilistically through time.So what we have is two waves of data, a very, very large sampleof the Australian population.So the Australian Bureau of Statistics

  • 10:01

    NICHOLAS BIDDLE [continued]: was able to link about 80%, about 82% of the 2006 sample.So we can use the Australian Census Longitudinal Dataset, the ACLD, to look at a very important policyquestion within Australia.And that's the size of the indigenous population.So like a number of settler countries,

  • 10:23

    NICHOLAS BIDDLE [continued]: Australia has a indigenous population,the original [INAUDIBLE] population,which is very important for policy purposes.It's important for policy purposesbecause indigenous Australians have unique rightsto land and other government services,based on their ongoing attachment to their land,

  • 10:45

    NICHOLAS BIDDLE [continued]: or because of efforts of the governmentto make up for previous policy failures.The indigenous population is relativelydisadvantaged to the rest of the population, whichmeans there's quite a big government focus on improvingoutcomes of indigenous Australians,

  • 11:06

    NICHOLAS BIDDLE [continued]: and therefore a focus on checkingwhere those outcomes are changing through time.So we need to know the size, as well as the characteristicsof the indigenous population.Now indigenous Australians made up about 2.7% of the populationin 2006.So that's not really large enough

  • 11:26

    NICHOLAS BIDDLE [continued]: to be a large sample of longitudinal survey data.So we're reliant on the longitudinal censusto understand the characteristicsof the indigenous population.One aspect of the indigenous populationis it's a very rapidly growing population.So between 2006 and 2011, there were about 20

  • 11:50

    NICHOLAS BIDDLE [continued]: and 1/2% additional indigenous Australianscounted as part of the census.Now part of that growth was due to higher fertility rates.But the higher fertility rates do notexplain anywhere near all of that additional growth.One of the potential sources of growth

  • 12:10

    NICHOLAS BIDDLE [continued]: is individuals who previously didn't identifyas being indigenous, but did identify as such-- sodidn't identify as being indigenous in 2006,but did identify as being indigenous in 2011.But without two waves of data with that individual's

  • 12:31

    NICHOLAS BIDDLE [continued]: information being revealed at more than one point in time,we can't tell where the individuals are changingtheir status, who's changing their status,and what that means for the size of the population.So we can use the ACLD.And this linked data set, or this type of analysisis a great example of how we can use two linked data sets

  • 12:54

    NICHOLAS BIDDLE [continued]: to answer new policy questions.So about 10% of those who identifiedas being indigenous in 2006 were notidentified as being as such in 2011.So they were previously identified as being indigenous,but were no longer identified as such.Now that's about 40,000 people, when you sum that up

  • 13:15

    NICHOLAS BIDDLE [continued]: to represent the population.So there's people who are going in the opposite direction,from non-indigenous to indigenous.Of those who identify as being non-indigenous in 2011,or not stated, about 0.4% were identifiedas being indigenous in 2011.So that's about 64,000 people.So you can see from the linked census

  • 13:35

    NICHOLAS BIDDLE [continued]: data sets, where we have information in 2006on someone's indigenous status, and information in 2011,is that there was an excess number of people whoare changing from being non-indigenous to indigenous,rather than the other way around.In a sense, there's a net increase

  • 13:56

    NICHOLAS BIDDLE [continued]: in the indigenous population due to changing status.And this linked data set is the only way in which wecan identify that information.We can also use that linked data setto look at the characteristics of that population.So what we know is that the newly identified indigenouspopulation is much more socio-economicallyadvantaged than the previously identified indigenous

  • 14:17

    NICHOLAS BIDDLE [continued]: population.We also know that they're much morelikely to live in urban areas than the indigenous populationthat was identified in 2006.We know they're more likely to be employed.The newly identified population is more likely to be employed.So what that means is those who were newly

  • 14:37

    NICHOLAS BIDDLE [continued]: identified look more like the non-indigenous populationin 2006.Which means that when you're tryingto compare characteristics of the indigenous populationin 2006, relative to the non-indigenous populationin 2006, and indigenous population 2011relative to the non-indigenous population in 2011,

  • 14:59

    NICHOLAS BIDDLE [continued]: some of that change is likely to be due to changesin identification.

Dr. Nicholas Biddle discusses the different types of data used for policy and evaluation, including administrative and linked data sets. The example of indigenous identification change is explained to highlight the benefits of linking data through time.

Data Types for Policy and Evaluation

