Skip to main content
Search form
  • 00:01


  • 00:10

    KATERINA METSA: My name is Katerina Metsa.I'm an associate director of research at the Pew ResearchCenter.I've been with the Center for almost eight years,and I'm an expert on journalism and media.Particularly, I'm focusing on media consumption habits,attitudes towards the media, and especially the role

  • 00:31

    KATERINA METSA [continued]: of technology in use.So by train, I'm a political scientist.That was my bachelor's degree.And then I did a master's in European politics,and then I did another master's on communication and technologyand news.So after a couple of other iterations in the job search,

  • 00:55

    KATERINA METSA [continued]: I joined the center in 2010.And I'm also pursuing a PhD right now.I'm on my fourth year, writing my dissertationat the American University on communication and technology.

  • 01:19

    KATERINA METSA [continued]: We decided to study the water crisisin Flint, Michigan as a case studyin understanding how news and information was flowingwhile the crisis was unfolding.This was not a study in understanding the water

  • 01:40

    KATERINA METSA [continued]: crisis itself, but it was more of understandinghow residents and the audience in Flint, in Michigan as well,were using search, like Google, to find moreinformation about their personal health, but also public health,and how all this fitted with events that were happening

  • 02:04

    KATERINA METSA [continued]: and that were unfolding while the crisis was happening,and how also that related to media coverage of the crisis.So part of our research agenda in the past five yearsis understanding, as I mentioned,the digital environment where news is flowingand where people are exposed with news.

  • 02:26

    KATERINA METSA [continued]: Part of that, our search engines, social media,we know from our research that a majority of Americansuse Google as a search engine to search information.So first, there were three components in this exploration.One was to use Google search data,

  • 02:49

    KATERINA METSA [continued]: and I can explain how we did that.The second was to audit the media coverage that was relatedto the water Flint crisis.And then third was to use Twitter data.So starting with the Google data,

  • 03:11

    KATERINA METSA [continued]: we gained access to the Google Health API,which is a separate tool, an API,than the regular public Google TrendsAPI, where we were able to research and getdata on about 2,700 search terms specific to the Flint water

  • 03:32

    KATERINA METSA [continued]: crisis or related to water and the conditions of water.The second element, the media coverage, what we didwas collect all the stories for the same time periodthat we did this research, which was about 2 and 1/2 years,collect all the coverage and the news articles and TV

  • 03:55

    KATERINA METSA [continued]: segments that were related to the Flint crises.And that resulted in about 4,500 stories about the crisis.And the third one, which was the Twitter data, whatwe did was we used a computational softwarecalled Crimson Hexagon, where we collected

  • 04:15

    KATERINA METSA [continued]: all their related tweets about the Flint water crisisusing what we call a Boolean search of termsto gather basically all the tweets related to the crisis.And finally, all this data came together

  • 04:35

    KATERINA METSA [continued]: to help us understand what was a news environment relatedto that very big news event that was happening locally,initially, in Flint, but then spread nationally later whenthe story was unfolding.So the Google Health API, as I mentioned,is a very different tool than the public Google Trends API

  • 05:00

    KATERINA METSA [continued]: that most people are familiar with, or at leastthey have done some kind of search to look for some termsthat they are interested in.So we entered a agreement with Googleto be able to access the Google Health API,and that happened through a proposal process.

  • 05:22

    KATERINA METSA [continued]: We presented a couple of projectsthat we were interested in exploringand that fitted our agenda, and we were granted access.We basically took it from there and starteddoing the research and a lot of the exploration initially.Following that, the first step was to develop search terms,

  • 05:46

    KATERINA METSA [continued]: as we say.What were the types of things that we were interestedor that we could investigate and explorethat people were searching at the timeof the crisis over that period of 2 and 1/2 years?That was a very thorough process.We used several techniques, from brainstorming,

  • 06:06

    KATERINA METSA [continued]: which is a simple way of all of us researchers coming togethertrying to throw some ideas to each other on what people mightbe searching, to other methods, like reviewing allthe related media coverage and howpeople were talking about the crisis,to see there are terms that we saw repeated that people mightbe using while searching.

  • 06:28

    KATERINA METSA [continued]: And that's our two examples of our processin developing these terms.So what we did then eventually isdevelop this list of about 2,700 search terms,and these also included terms that we actually even provideddifferent types of those terms.

  • 06:56

    KATERINA METSA [continued]: To give you an example to make this clearer,we used terms like filtering water or filtered water.That way, we're trying to capture different waysthat people might be searching for the concept,but in different grammatical ways.So this was the first step in actually identifying

  • 07:17

    KATERINA METSA [continued]: what those terms would be in orderto be able to ask the data from the API.After we developed that list that was testedand all these iterations of getting to that list,eventually what we did is we spent monthsin trying to get samples of the data and tested the data

  • 07:37

    KATERINA METSA [continued]: and apply statistical modelings on the data to see what worksand what doesn't work, and also collaborate with Google to askall our questions on things that we were seeingor we didn't understand or why things were happening,

  • 07:57

    KATERINA METSA [continued]: and getting to understand what the data--how it was structured and all that.So after all these months of exploring all this,we decided what the path would be in orderto analyze this data.So what we did is basically we gotfrom the API data that were on a weekly basis,

  • 08:21

    KATERINA METSA [continued]: starting from January 5 of 2014, when the crisis hadn't reallystarted, but it was right before the issue with the waterstarted to arise.And then after we got all these weekly data,there were a couple of things that wesaw that we needed to address.

  • 08:44

    KATERINA METSA [continued]: Some of it was basically zero values,which is a very common thing in our world,that because of privacy issues, the Google API won't give youvalues for a certain week because they want to maintainprivacy of their users.So to deal with that, what we did is basically take50 samples for each week, which then if there were zero values,

  • 09:06

    KATERINA METSA [continued]: we did a statistical method called imputing--imputing those values in order to get the data for that week.Then we had our data, basically, for all these data points,and what we did then is actually put them in a visual,

  • 09:28

    KATERINA METSA [continued]: see how this search was developing throughout that 2and 1/2 year period.And what we saw was a lot of ups and downs and spikes and peaks.So briefly, what we did is we applied a smoothing

  • 09:49

    KATERINA METSA [continued]: technique on the data.So we removed the edges around those peaks and spikes.And then we used the change point method with an R packageto understand where differences in these areas, in these peaks,

  • 10:10

    KATERINA METSA [continued]: were actually meaningful so we could understandwhere there was change.So that's a brief take of our journey with the search data.After we had our main source of data,

  • 10:30

    KATERINA METSA [continued]: which was the Google data, which we hadn't even explored beforeso that was an interesting methodological opportunityfor us, we felt that we needed to complementthis source of data with other datato help us understand even better whatwas happening with information and the news during that time.

  • 10:55

    KATERINA METSA [continued]: So the second thing we did was to collect all the stories thatappeared locally and nationally about the Flint water crisisfor the same time period.So we collected about 4,500 stories for that 2 and 1/2 yearperiod from local newspapers, regional papers,

  • 11:19

    KATERINA METSA [continued]: online outlets, as well as national newspapersand the network evening broadcast.This way, we were able to see also,in relation to Google searches, whether a storythat appeared on a news outlet was related to the interest

  • 11:41

    KATERINA METSA [continued]: we were seeing in Google search.So that was the intention of gathering that informationto see if there was a relationship between the two.

  • 12:02

    KATERINA METSA [continued]: The third component, as I already described,was using Twitter data to also helpus, again, understand what was happening with the other twodata sources.We know from our research that a sizable portion of Americans

  • 12:22

    KATERINA METSA [continued]: get news from social media--67% of Americans get news from social media from 2017,and especially on Twitter.About 7 in 10 Twitter users get news there,which is actually one of the highestshares of users of a social media platform

  • 12:43

    KATERINA METSA [continued]: that they get news there.So from our research, we know that Twitteris a space that people go for news,and that was one of the reasons we decided to include Twitterdata in this project.So what we did is basically develop,again, way fewer terms, because they were not

  • 13:03

    KATERINA METSA [continued]: needed for the purpose of capturing all the tweets thatwere related to the Flint water crisis,and we used the computational software platformcalled Crimson Hexagon.What it allows you to do is train what is called a monitor.

  • 13:25

    KATERINA METSA [continued]: So using these terms, you basicallyget all the related tweets, that they appear,including these terms.So then what we did is we had researchersgo through and train a monitor with relevant tweetsthat we knew for a fact were about the crisis.Because, for instance, there could be some other Flint term

  • 13:50

    KATERINA METSA [continued]: that wasn't about the crisis, but itwas about something else.So we coded those, as we say in our language,as non-relevant tweets and developeda monitor which was exclusively about the crisis.And after we were able to develop and train that monitorfully, that was used as a sample to apply those rules

  • 14:15

    KATERINA METSA [continued]: in the whole Twitter conversation for that same 2and 1/2 year period with all the tweets thatwere related to the crisis.

  • 14:38

    KATERINA METSA [continued]: The majority of Americans use search engines,specifically Google, to search for informationon the internet.However, Google search data is nota representation of public opinion,but it can be a good proxy of understanding people's

  • 14:60

    KATERINA METSA [continued]: intentions, what they are interestedin at the time being.And the benefit of that is it helpsus understand whether a topic or a specific search termportrays an interest from people.

  • 15:22

    KATERINA METSA [continued]: But as many have mentioned, thereis no way for us to know whether searching, for instance,'why my water is brown' is because peopleare interested in the Flint crisis or, is itbecause they are actually seeing brown wateron their tap and their kitchen, or if it is just something

  • 15:46

    KATERINA METSA [continued]: that they heard in the news and theywant more information about.So we don't have that kind of detailor the motivations in understandingwhy people started searching for it.However, as I mentioned, it is a good tooland a complementary tool in understanding their intentionson what they were doing and what is the conversation that

  • 16:09

    KATERINA METSA [continued]: is happening in that specific searchengine around those terms that we decided to research for.I think our approach to this projectwas to try to understand this big crisis that was happening

  • 16:30

    KATERINA METSA [continued]: at the time, and one of the reasonsthat we didn't focus exclusively on Google search datawas to import also data on media coverage, datafrom social media, but also do a very detailed timelineof the things that were happeningon the ground and also nationally,eventually, as the story became national to inform

  • 16:54

    KATERINA METSA [continued]: all these data sources and how they speak to each other.From our research, we know that a majority of Americansgo to Google as a search engine to get information online.So this was one of the reasons we engaged with this research

  • 17:18

    KATERINA METSA [continued]: and we decided to use Google search data as a data source.However, Google search data cannot be a representationof people's opinions.It's not possible to ask them through a search engineor have data through a search engine thatspeak to what people believe or what they want.

  • 17:43

    KATERINA METSA [continued]: However, for this particular case study,we believed that Google search datacan be a very good proxy in understanding people'sintentions and interests as they are showcased in search data.

  • 18:04

    KATERINA METSA [continued]: For the Google search data, we identified about 2,700 terms.Then what we did is group these terms into five categories--five conceptual categories.For instance, terms related to public health or termsrelated to personal health.Some examples would be 'why my hair is falling' or 'why is

  • 18:31

    KATERINA METSA [continued]: my water brown.'That was more about personal health issues.But then there were terms that were environment or aboutthe city that were more about the public healthaspect of the story.So we created these five categories.They were about news related terms,public health related terms, personal health related terms.

  • 18:53

    KATERINA METSA [continued]: Then a category about contaminants, so specificterminology that included things like lead and e-coli, veryspecific terminology about chemicalcontaminants that could be in the water.And the fifth category was about politics and government,so terms that were related with the political scene.

  • 19:15

    KATERINA METSA [continued]: For instance, Governor Snyder--that was one of the key actors while the crisis was happening.So after we grouped all these terms into these fivecategories, we started a period of testingto see what kind of data we're getting back from the API.

  • 19:36

    KATERINA METSA [continued]: And following that, we made decisions on the time periodthat we wanted to study.That's when we decided to start January5 of 2014 all the way through the first week of July of 2016.So the data were basically based on a weekly frame.

  • 20:03

    KATERINA METSA [continued]: And then we decided that we are goingto look into three geographies, the Flint DMA--and what that stands for is Designated Market Area,so it is the city of Flint, including some neighboringcounties--the state of Michigan, and the US--

  • 20:23

    KATERINA METSA [continued]: basically, the whole territory of the US.Our goal was to create and understandthree types of comparisons.So one, we wanted to compare within each categorythrough time.So, let's say, how news related terms developedthrough the 2 and 1/2 year period,

  • 20:45

    KATERINA METSA [continued]: whether there were changes, et cetera.The second type of comparison we wanted to dowas to be able to compare among categories.So compare whether public health newsrelated terms were different from personal health relatedterms or from political terms.And the third type of comparison that we

  • 21:07

    KATERINA METSA [continued]: were interested in exploring was comparisons and differencesamong regions.So how did searches in Flint--were they different in the way theywere unfolding from searches in Michigan or nationally?After we created this analytical plan and frame,

  • 21:28

    KATERINA METSA [continued]: we got the data.We realized that for privacy reasons,some of the instances that we were getting back from the APIwere zero values, which means that the Google Health API doesnot give a value for that specific weekor that specific group of terms because they want to maintain

  • 21:50

    KATERINA METSA [continued]: privacy of their users.For that reason, we decided to sample 50 samples for each datapoint, as we say, and then take the mean,the average in simple words, of those 50 samples.In the case where we saw that there were zero values, whatwe did is use a statistical method

  • 22:14

    KATERINA METSA [continued]: called imputing the data.So what you're doing, in very simple words,is take information from all the datapoints you've seen of those 50 samples for that specific week,let's say, in Flint for that search term.And you do is you're imputing the value that would

  • 22:35

    KATERINA METSA [continued]: replace eventually that zero.The reason we did that, it was because wecan't know that there were no searches about that searchterm.Most likely, because of those privacy thresholds,we were getting zero values, but thatdidn't mean that that was a real true zero value

  • 22:55

    KATERINA METSA [continued]: since we had other samples that were returningvalues for that data point.And that was the reason for imputing,so we could have a better representation of the data.After we did that and we basically created data pointsfor all our categories and geographic regions,

  • 23:16

    KATERINA METSA [continued]: we plotted the data into visuals--into these very long fever lines.And what we did is apply a smoothing techniqueto eliminate noise, as we say in our world.That was basically removing spikes or peaks

  • 23:38

    KATERINA METSA [continued]: and smoothing them out in order to understand and beable to see patterns that were beyond just momentarily changesthrough the weeks.And after we did that, another step was to basically--

  • 23:59

    KATERINA METSA [continued]: our question was, how do we understandthat a change in a given week x isdifferent from a change in a given week y?Is that a meaningful change?Is that what we are seeing, a change that is worth talkingabout?And what we did there is use a statistical package

  • 24:20

    KATERINA METSA [continued]: and technique called Change Point using R,and that is actually the name of the package, too, Change Point.And what this technique is doing is basicallytrying to identify whether a peak or a changeduring a period is different from a change

  • 24:41

    KATERINA METSA [continued]: from the neighboring periods--what was happening in the neighboring periods.Just to rephrase that, to explain a bit better,is that this statistical technique, the Change Pointtechnique, is identifying distinct periods of activitywhere it can show you if a period, period x,

  • 25:06

    KATERINA METSA [continued]: is different from the period before or the period after.So after we created all that and we tested it,we were able to understand and identifychanges and meaningful change that we could talk about.In the end, because we wanted to be conservative

  • 25:29

    KATERINA METSA [continued]: and we didn't want to exaggerate any changes we were seeingin the data, what we did is also applyone more threshold, which was a 30% changebetween a peak of period a from a peak in period b.And if that change was more than 30%,then we decided that that's a significant change

  • 25:51

    KATERINA METSA [continued]: that we could safely talk about in the data.So after this path and all these steps of testing and applyingdifferent statistical models--and we did more testing with other statistical modelsthat I didn't describe here.

  • 26:12

    KATERINA METSA [continued]: I'm just describing the ones that we eventuallyused in this research.We also complement all that with the other pieces of datathat we had--the news coverage data and the Twitter data,which were all organized and structured in the same manneron that same weekly basis to help us

  • 26:34

    KATERINA METSA [continued]: inform what we were seeing in allof these changes in the search data.And finally, the qualitative part of thiswas to review all the big events and other events thatwere happening while we're seeingthese changes in searches in Flint about the crisis.

  • 27:05

    KATERINA METSA [continued]: For the Twitter data, all the volumeand the numbers of tweets we got for this researchwas anonymized.It was not in the interest of this researchto be able to track people or to havespecific examples of tweets that they go back to the user.

  • 27:29

    KATERINA METSA [continued]: Our goal for this is to understandthe volume of tweeting about the Flint water crisis.So in the way we did it, which was by training a monitorto be able to capture the related tweets to the crises,and we didn't have any noise or unrelated Twitter conversation

  • 27:51

    KATERINA METSA [continued]: that was not part of the water crisis,we then just received and downloadedfrom that computational platform just the volumeof tweets, which was daily.And then we basically combined it and aggregated

  • 28:15

    KATERINA METSA [continued]: to that week level.So we do have a great sense and care for privacyand how we are using this type of data.And not only this type of data, but also our survey data.

  • 28:36

    KATERINA METSA [continued]: We have extensive steps even releasinga data set from a report that used survey dataand eliminating any personal information or any informationthat could go back to that respondent.So our mission is to serve people with good data

  • 29:00

    KATERINA METSA [continued]: and information so they can make informed decisions,but also maintaining people's privacyand providing data that can be used for research purposes,but not to harm people.

  • 29:26

    KATERINA METSA [continued]: This was a methodological opportunity.We were very excited to work with datawe haven't worked before and apply new statistical modelsin exploring and analyzing this type of data.It was an opportunity to work cross-team.Our journalism team worked very closely

  • 29:46

    KATERINA METSA [continued]: with our methods team and our labsteam in analyzing this type of data,and we learned from each other.So in a sense, it was a great opportunityto apply this new type of data in a very important topicand news event that was happening.So what did we find?

  • 30:09

    KATERINA METSA [continued]: So as someone may expect, overall,throughout this 2 and 1/2 year period,we saw that search activity coincidedand is very closely related to news events and news coverage.But there was an exception to that.In the beginning of the crisis--

  • 30:30

    KATERINA METSA [continued]: so we're talking about the spring to summer of 2014.When the crisis had started unfoldingwas when the water supply changed in Flintand corrosion of the pipes started happening.What we saw there were two elevated periodsof increased search activity that we

  • 30:51

    KATERINA METSA [continued]: didn't see media coverage at that level increasing,and that was interesting.So basically, what this showed us--what the data showed is that people were searchingabout news related to water and personal health

  • 31:12

    KATERINA METSA [continued]: before there was even a government action.So the first action that we saw from the governmentin terms of an official notice was in August of 2014, and waybefore news has turned to this issue in an elevated manner.There were some coverage in the local level,

  • 31:33

    KATERINA METSA [continued]: but it was not to the point as we saw later in the crisis.And that was one of the biggest spikesthat we saw throughout that period in Flint locally.So the second finding after that was that the searching

  • 31:56

    KATERINA METSA [continued]: and what was happening around this crisis was local first.So we saw a lot of elevated activity in mostly all termsfrom the political related terms in Flint and Michigan,and basically, the natural search activity only

  • 32:17

    KATERINA METSA [continued]: rose when the topic of the Flint water crisiswas a national topic of news coverageand news outlets were covering that at the national level.We are committed in doing this type of work,specifically in journalism, because the way news is

  • 32:41

    KATERINA METSA [continued]: traveling and the way people engage with newshas changed a lot, especially online.And we see big differences between young and older people.So this is an area we're really interested.And we traditionally, in our lineof research-- in the journalism research,

  • 33:02

    KATERINA METSA [continued]: we are committed in using other types of datato inform our survey data.I found it-- and this was a very personal project because I wasvery involved from the beginning of its conception to deliveringit--I found it very valuable and I actually really enjoyed it,

  • 33:22

    KATERINA METSA [continued]: to work with very different peoplethat I don't usually work with on a daily basis,but also with people that they have a verydifferent skill than mine.In our team, we had the computational scientists.I myself have been trained to do this type of work.

  • 33:46

    KATERINA METSA [continued]: So we worked very closely together.And then we brought one of our senior methodologistsfrom our methodology team to helpus test all these different statistical models to make surethat this is the right path and even share all the questions wehad and try to get to the--

  • 34:06

    KATERINA METSA [continued]: is that the right approach?Or, why are we seeing this and not that?And is there another way we couldtest this or another model or statistical methodthat we could apply?And then with our labs team, it was fascinatingbecause they bring a skill set from computer science.So even developing our Python scripts and our R scripts

  • 34:32

    KATERINA METSA [continued]: and trying to hit the API, as we say,and understand the differences from all the toolsthat Google offers.Because we use the Google Health API,but we wanted to be ready to understandwhy we couldn't do this on the Google Trends APIor what was different-- not that we couldn't do it,

  • 34:53

    KATERINA METSA [continued]: but what was different from these different offeringsthat Google offers.So I think it was a very valuable experience.It was one that I enjoyed a lot and it was onethat I learned a lot.This is the fascinating part of working

  • 35:13

    KATERINA METSA [continued]: at a place like the Center.Each day, every day you learn something from your colleaguesor from someone that is interested in your work.And you have a conversation and you reallyfeel that you are accomplishing, but also youare evolving as a researcher.

  • 35:38



Katerina Metsa, Associate Director of Research at the Pew Research Center, discusses her research using Goggle, news media, and Twitter data to study information diffusion in a public health crisis, including insights that social media habits can provide; collecting Google search, news media, and Twitter data; what the different data sources revealed and whether they complimented each other; ethical considerations; and the findings of the research.

Looks like you do not have access to this content.

Analyzing Information Diffusion in a Public Health Crisis Using Google, News Media & Twitter Data

Katerina Metsa, Associate Director of Research at the Pew Research Center, discusses her research using Goggle, news media, and Twitter data to study information diffusion in a public health crisis, including insights that social media habits can provide; collecting Google search, news media, and Twitter data; what the different data sources revealed and whether they complimented each other; ethical considerations; and the findings of the research.

Copy and paste the following HTML into your website