- 00:03
[Partha Lahiri Discusses Big Data for Small Areas]

- 00:11
PARTHA LAHIRI: I am Partha Lahiri.I am a faculty member of Joint Program and Survey Methodology,commonly referred to as JPSM, and I am also a faculty memberof the Department of Mathematics at the University of Marylandat College Park.I work on different topics in statistics and survey

- 00:34
PARTHA LAHIRI [continued]: methodology.Right now, my focus is on small area estimation,specifically with big data.[What is meant be the term big data?]Big data may mean different thingsto different people and different researchers.It's really a relative term.

- 00:54
PARTHA LAHIRI [continued]: For a social scientist, it may mean social media datalike Twitter data, administrative records, datathat you get from cell phone sensors, and so on.It's usually highly unstructured data.It's not designed for any scientific experiment,

- 01:18
PARTHA LAHIRI [continued]: but it is just there to be used.And according to professor Robert Groves,this is also called organic data.This is undesigned data that one can potentiallyuse to improve estimation and prediction,perhaps with the help of some high quality data

- 01:41
PARTHA LAHIRI [continued]: like poll data or survey data.So for social scientists, often time big datais like administrative records.And these data may not be a big data for computer scientistsin terms of computing simple descriptive statistics

- 02:01
PARTHA LAHIRI [continued]: like average standard deviation and so on.But this could be a big data problemif social scientists are interested in doingsome specialized analyses.For example, mixed model analysiswhere they may have to evaluate the procedures usingsimulations, and so on.

- 02:23
PARTHA LAHIRI [continued]: And suddenly, it becomes a big data problembecause the regular statistical methods and softwaremay not be applied and we have to design something.You know [INAUDIBLE] computing and statistical proceduresto deal with such data.[What is meant by the term small area estimation?]

- 02:47
PARTHA LAHIRI [continued]: Small area estimation refers to a domain, whichcould be a geographical region.Something like a state, or district, or countieswhere we have inadequate data to make goodinference using standard survey methodology.

- 03:10
PARTHA LAHIRI [continued]: And in most of the National Survey,for example an American community survey,the design is such that it can producegood estimates and good standard estimates for a large area.For example, national level or regional levelBut researchers and public policy analysts,

- 03:33
PARTHA LAHIRI [continued]: they are interested in using the data for producingsimilar statistics.We simulate precision for the same kind of quantitiesat a much lower geography, for the state level,county level, and so on.Sometimes small area could be domain.For example, domains formed by cross-classification

- 03:56
PARTHA LAHIRI [continued]: of different demographic groups and sometimesin conjunction with geography.And main problem here is the standard methodthat we know of from survey methodology doesn't work herebecause of this small sample size.So small area is not an area whose size in terms

- 04:21
PARTHA LAHIRI [continued]: of population size is small, but itis small in terms of the sample size.[What are the problems with using survey methods to produceestimates for small areas?]There are a lot of problems in using survey methodsbut reducing estimate for small geography or small cells formed

- 04:43
PARTHA LAHIRI [continued]: by different demographic groups.The main problem is that most of the national surveysare designed to produce estimates at the large arealevel And it's not designed to produce estimateat the small geography level or small groups.And the main problem is the sample size.

- 05:06
PARTHA LAHIRI [continued]: And sometimes the sample size is very small.Sometimes you don't even have any samplefor certain small cells or small areas.And when you don't have any sample in a group,there is no direct estimate available.And when there are a small sample size

- 05:28
PARTHA LAHIRI [continued]: you can still get some estimate, but theywill be highly unreliable.And on top of that, it can be verydifficult to give a measure of uncertainty of those estimatesbecause they will be highly variable and highly unstable.And one can observe some peculiar situations.For example, for estimating proportion, which we are all

- 05:52
PARTHA LAHIRI [continued]: interested in in survey methodology,like in public opinion, or we are interested in whatpercentage of people will vote for a certain candidate,we might get very strange results.For example, if all the observations are the same,you might get the estimated 1 or 0 depending on the observation.

- 06:16
PARTHA LAHIRI [continued]: And more disturbing the situation that in that caseyou can get measure of uncertainty of 0,giving misleading information.And sometimes when you produce estimatesof confidence interval, for example, itmight go out of the range and it willgive non-informative situation.[What opportunities has the availability of big data

- 06:38
PARTHA LAHIRI [continued]: offered to statisticians producing public opinionstatistics?]Different type of big data are available these days,giving us a lot of opportunity to improve small areastatistics.And there are different kinds of big data

- 06:58
PARTHA LAHIRI [continued]: that I would like to talk about.But first of all, let me say that small area statistics,if we use the traditional way, we just use the survey data.And as I explained earlier, the problem is the sample size.We don't have enough sample size to produce reliable estimates.Now have big data and those data searches are big.

- 07:23
PARTHA LAHIRI [continued]: And they are available for smaller geographyand we also have data for different demographic groups.And thus we have a lot of opportunityto make use of the data cleverly usingsophisticated statistical and computing tools.So let me first explain how this can

- 07:45
PARTHA LAHIRI [continued]: be done using a larger survey data, whichcan be thought of as big data in the polling contextbecause most of the polls have small sample size rangingfrom maybe 500 to 2000.And they are very small when we try

- 08:06
PARTHA LAHIRI [continued]: to estimate, for example, state levelestimation or different demographics groups.And now, if we combine these datawith large survey data like American Community Survey data,or current population survey data,we can increase the effective sample sizeby borrowing strength from these larger survey data.

- 08:30
PARTHA LAHIRI [continued]: So for example, if we want to know estimate a president'sapproval rating on economy, then wecan use a poll data, which is small site,and combine it with the current population survey data, whichcontains a lot of information about labor force statistics,

- 08:53
PARTHA LAHIRI [continued]: and using statistical models and sophisticated computing toolsto produce reliable estimates.I have already said how to use bigger survey data,like American community survey data or current populationsurvey data.But there could be other kinds of big datathat can be used for research purposes.

- 09:15
PARTHA LAHIRI [continued]: For example, there are a lot of administrative data available.Say for instance we are interested in some welfareprogram and we want to do a public opinion researchpoll to understand how people think about that program.So you can have this survey, but then we

- 09:36
PARTHA LAHIRI [continued]: can use administrative data.For example, we can take example of IRS data,Internal Revenue Service data, or SNAP data, formerly knownas food stamp data, in order to borrow strength.Now other than the administrative data,there are other kinds of big data available

- 09:57
PARTHA LAHIRI [continued]: which are huge in volume.For example, say we are interested in takingpublic opinion about traffic condition in different countiesin the US.And here we can take again a survey of US populationabout what they think about the traffic condition

- 10:19
PARTHA LAHIRI [continued]: and we can combine this with some big data comingfrom vehicle probe data, where we get data continuously,by minute, from different GPS and smartphonedevices in the car.And we can use this data in conjunction with poll data

- 10:42
PARTHA LAHIRI [continued]: to get reliable statistics.And there are other kinds of big data we may be interested in.For example, in the US presidential election context,there are number of polls going on all the time.And even when we have so many polls,

- 11:05
PARTHA LAHIRI [continued]: we still need more information.We are hungry for information.And we need, for example, information every day.By state, by county, and so on.And there are not simply enough data to supportall this information.And we can use the social media data,

- 11:25
PARTHA LAHIRI [continued]: for example, Twitter data, and combine with all these pollingdata to have some information on all the statistics weare interested in.[How can survey data be combined with big data to improve smallarea estimation statistics?]We have talked about the survey data that is poll data, whichare small in size.

- 11:46
PARTHA LAHIRI [continued]: And we have big data and there are various kinds of big data.It can be a bigger survey data, administrative data, or sensordata, for example, or social media data.So there are different types of data hereand our task would be to combine all this information

- 12:07
PARTHA LAHIRI [continued]: in producing the statistic that we want.And the question is, how do we produce such statistics?There are a number of metrics available in small areaestimation literature.And the metrics vary widely.And so most promising is explicit modeling.

- 12:28
PARTHA LAHIRI [continued]: And one reason we may want to go for a purely model-basedmethod, because we explicitly state the modelso that you can check all the assumptions we are makingand to make sure the models are reasonable.Now there are a variety of models one can use.

- 12:49
PARTHA LAHIRI [continued]: And I'll give one example of how the modeling will go.For example, consider the situationwhere we are combining different data sourceswhere data are available at different levels of geography.For example, the poll survey datais available at the personal level

- 13:10
PARTHA LAHIRI [continued]: and we are measuring some characteristic,like income expenditure, health care, and so on.And then we have data available at, say, county level.And that may be, like, the SNAP, or food stamp,data that I talked about.

- 13:31
PARTHA LAHIRI [continued]: And then, there would be some data at the state level,maybe coming from the IRS data, or Internal Revenue Servicedata.So there are data at various stages and the questionis, how do you combine?And one commonly used method is usemulti-level, or hierarchical, model

- 13:52
PARTHA LAHIRI [continued]: because it can extract informationfor different types of data at different leveland it can also ascertain the variabilityat different levels.And once we have the modeling done,the next step would be how to process the model.

- 14:15
PARTHA LAHIRI [continued]: How to do the estimation using the model.Again, there are various ways of doing it,and the Bayesian, or approximate Bayesian,method is a useful tool because itcan process the prior informationthat we have on certain research questions with the data

- 14:36
PARTHA LAHIRI [continued]: to produce the best possible result.And in a lot of situations it will produce statistics whichis like a weighted combination, direct estimator,that we get from the small poll data and a synthetic part whichmay be the regression kind of estimator.

- 14:57
PARTHA LAHIRI [continued]: And the way it combines these different sources of datais also reasonable.For example, it may produce an estimatewhich may be very close to direct poll estimateif the sample size for the area is large.

- 15:19
PARTHA LAHIRI [continued]: But on the other hand, if the poll datadoesn't poll a lot of information for that area,most of that weight will be going to the regression part.Now this method may be also effective in situationswhen we don't have any information for to poll data.That is, we don't have any information

- 15:42
PARTHA LAHIRI [continued]: on the characteristics we are interested in.So in this case, what will happenis all the information will be coming from other data sourcesand it is called the synthetic data.Now this is a simple way of addressingthe problem when we don't have any data for certain areas.But other sophisticated statistical models

- 16:05
PARTHA LAHIRI [continued]: can be applied.For example, we can consider the spatial temporal model.And the advantage of this kind of modelis that, rather than combining informationfrom ordinary big data sources, it can actuallyuse the poll data from neighboring areas

- 16:25
PARTHA LAHIRI [continued]: or from previous time points in producing better estimates.[Does big data analysis require interdisciplinarypartnerships?]In my view big data analysis requiresa multidisciplinary team effort.We need researchers from domain, the particular discipline

- 16:48
PARTHA LAHIRI [continued]: we are interested in.We need statisticians and we need computer scientists.First of all, a collaboration between a researcherin the domain discipline and statisticianswould be very beneficial.For example, in formulating the problem,which is very important.

- 17:08
PARTHA LAHIRI [continued]: And this will help us in designingsurveys, both the sample design and the questionnaire design.And this will help us in identifying alternative datasources.I mean the big data sources.The kind of bigger survey we haveavailable administrative data, other kinds

- 17:30
PARTHA LAHIRI [continued]: of big data like Twitter data and other social media data.And once we have that, we can develop statistical modelsand statistical methodology, like the Bayesian method,to produce the estimate and the relatedquantities like standard area and confidence interval.

- 17:53
PARTHA LAHIRI [continued]: But a problem could be that when you have a large data set,computing could be an issue.And so, I think it would be beneficial to havecomputer scientists included in some of the big data projects.When we had developed this statistical method

- 18:14
PARTHA LAHIRI [continued]: with big data, one issue would be the computingbecause the data set is huge.And we can borrow strength in the teamby including computer scientists because they can guide ushow to manage the data, how to develop a computer algorithm,

- 18:35
PARTHA LAHIRI [continued]: and also computing using parallel computingmethodology, for example.And so this will be a good game of effortif we include domain scientists, statisticians,and computer scientists.[What does the future hold for survey methods and small areaestimation in the era of big data?]

- 18:56
PARTHA LAHIRI [continued]: Methodology is based on data whichis typically small in size.One reason for that, we're going to make it very high qualitydata.And it's well-designed and it costs a lot of money.And inside of it, typically, we may hear a lot of variables.One issue here is that these days, people don't

- 19:21
PARTHA LAHIRI [continued]: respond to the questionnaire.And so, although initially we design for a lot of samples,due to the non-response, the effective sample sizebecomes smaller.And this is a big problem in surveys.So now we have big data and the question

- 19:45
PARTHA LAHIRI [continued]: is whether this is good news or bad news.And in my view, this is good newsbecause we can supplement the high qualitydata that we collect from the surveyand use this big data to supplement it.And that would require some new methodology

- 20:07
PARTHA LAHIRI [continued]: based on the models.And so I think the future of survey methodologyin the presence of big data is great.And so we don't have to fear about the big data world.We can embrace it and try to do a better job.Now the question is, where does small area estimation fit in?

- 20:30
PARTHA LAHIRI [continued]: And small area estimation would be a great wayto combine information and it can go side by sidewith the survey data and big datato do high-quality statistics.So I think this is an exciting time for both surveymethodologies and the research you're walking on in small area

- 20:52
PARTHA LAHIRI [continued]: estimation problems.

### Video Info

**Publisher:** SAGE Publications Ltd

**Publication Year:** 2017

**Video Type:**Interview

**Methods:** Small area estimation, Big data

**Keywords:** collaboration; computer science; domain; geographic regions; geographical areas; gps tracking; inferences; practices, strategies, and tools; Size perception; Software
...
Show More

### Segment Info

**Segment Num.:** 1

**Persons Discussed:**

**Events Discussed:**

**Keywords:**

## Abstract

Professor Partha Lahiri explains small area estimation and why it is difficult for researchers to gather reliable information about a small area solely using survey methods. Big data sources like county or state records can supplement survey data to give a more nuanced picture of what is happening in a particular area.