Skip to main content
Search form
  • 00:00

    [MUSIC PLAYING][Researching Human Rights Using Big Data Analysis]

  • 00:11

    PATRICK BALL: I'm Patrick Ball, executive directorand co-founder of the Human Rights Data AnalysisGroup. [Patrick Ball, Executive Director and Co founder]For almost 25 years, my colleaguesand I have built databases and conducted statistical analysisfor big human rights projects, includingtruth commissions in nine countries,the United Nations missions in four countries, five domestic

  • 00:34

    PATRICK BALL [continued]: and international criminal tribunals,and dozens and dozens of non-governmental human rightsorganizations all over the world.Our work in all of those projects is to analyze data.And recently, people have said, what about big data?The problem here is that big data is an idea from industry,

  • 00:55

    PATRICK BALL [continued]: from companies that make their livings, do their business,tracking their customers, figuring outwhat their customers want, and giving customerswhat they wanted in the past.In industry, they can do all kinds of fancy analysison vast troves of information, because theyknow all the information.

  • 01:15

    PATRICK BALL [continued]: Nothing is hidden from them.For example, an internet service providerknows every packet that flows through their network.A manufacturer knows every widgetthat they ship, every little thing they make,every customer they sell it to.An internet retailer knows every clickthat every customer has ever made and whether or notthose clicks resulted in a purchase.

  • 01:36

    PATRICK BALL [continued]: They have all the data.[Human Rights Data Collection]In human writes data collection, wedon't know what we don't know, and the reason that's importantis that we don't know if what we don't knowis systematically different from what we do know.

  • 01:57

    PATRICK BALL [continued]: I know that's a lot of complicateddo's and don'ts on knowing and not knowing.But through the course of this talk,I'm going to explain why the data that we're missingmay be very different from the data that we have,and what that means is that the data that we haveis a bad representation of the whole world.So if we use the data that we have

  • 02:18

    PATRICK BALL [continued]: to make conclusions, particularlystatistical conclusions, we get the answers wrong.This is a big deal.[Coverage Rate]Imagine that you collect and combine three databases.So imagine a Venn diagram in which we have three

  • 02:38

    PATRICK BALL [continued]: interlocking white circles.And each of those circles represents a databasethat we observe by going into the worldand conducting fieldwork.We talk to people.We look at newspapers or internet sites.We look at cadavers.We conduct exclamations.We find out from morgues what people have come through,

  • 03:01

    PATRICK BALL [continued]: and all kinds of ways of determining peoplewho've died in a conflict.And in this example, we've done that three times.We have three interlocking white circles.The question is, what's missing?Have we gotten nearly all the cases?So if we have a gray circle around the three white circles

  • 03:22

    PATRICK BALL [continued]: is in a Venn diagram, is the gray only a tiny fractionof the white circles, or have we missed most of it?That is, does the encircling gray circleactually have more area even in the white circles?Which picture is accurate in this Venn diagram?The problem is that even if we don't care how many people,

  • 03:45

    PATRICK BALL [continued]: even if we're not interested in the total magnitudeof the number of deaths, we're alwaysinterested in comparisons.For example, in the Truth and Reconciliation Commissionin Peru, one of the crucial questionswe were addressing was, what was the relative proportionof responsibility?We were comparing the killings committed by the Peruvian army

  • 04:07

    PATRICK BALL [continued]: to killings committed by the Gorillas of the ShiningPath, the Sendero Luminoso.Which group committed more killings?Well, of the data that we had of stories that people hold to us,we had about the same number.About the same number of people werekilled by the Peruvian army as by Sendero Luminoso in the data

  • 04:30

    PATRICK BALL [continued]: that we had captured.But when we did some statistical analysis,that I'll explain later in this talk,we discovered that in fact many, many more peoplehad been killed by Sendero Luminoso,but that it happened in secret.We've never been able to find out about it.What that means is that the relationship between whatis observed, roughly equal numbers of people

  • 04:52

    PATRICK BALL [continued]: killed by both parties, and what is true,what statisticians call the population,this relationship is called the coverage rate.OK, think of dividing the white circles by the total sizeof the gray circle.When you've covered almost all the killings,you have a coverage rate around one.But when you've covered only a small fraction of it,

  • 05:13

    PATRICK BALL [continued]: that coverage rate may be 0.3 or 0.4.The only way to figure out the relationship between whatyou observe and what the total is,or that is how do you calculate the coverage rate,the only way to do that is with a probability-based model.I'll explain that in a moment.Here's the real problem.

  • 05:34

    PATRICK BALL [continued]: Coverage rates are not the same for different placesand different times.So here we can look at 25 little maps of Syria,and each section of the map of each one of those 25 mapsrepresents a different location in Syria.Each of these maps is one month between 2012 and 2013,

  • 06:01

    PATRICK BALL [continued]: a two year period, 25 months.What we're showing in the relative coloration in the mapis how much we don't know.This is a measure of the opposite of coverage rate.And the point is that the darker the area, the darker red coloron each map, the less we now.

  • 06:25

    PATRICK BALL [continued]: And we've calculated that using this probability modelthat I'll discuss the moment.The point of that is that, you can see, as we go over monthsand as we move around to different places,in some months, and in some places,we know almost everything.In other months, in other places, we know almost nothing.This is a big problem, because if we look at patterns,

  • 06:49

    PATRICK BALL [continued]: trends over time or trends over space, those patternsand those trends are determined, not by how much violence therewas, which is what we think we're looking at,those trends, instead, are determinedby how much information we got.And that information is not the sameas how much violence there was.

  • 07:10

    PATRICK BALL [continued]: If somebody is writing down all the namesof people killed and the person doing work,the documentation expert, if that person has to leave,well, the number of people that we document will go to zero.It will look as though no one is being killed,but in fact, many more people maybeing killed when no one is writing it down.

  • 07:30

    PATRICK BALL [continued]: That may be why the person had to leave,because balance went up.So the two words, the two phrases that we use hereare the conflict dynamics, that iswhat are the changes over time and space of violence?Is violence going up?Is it worse in the north, worse in the south?Those are conflict dynamics.We want to understand conflict dynamics,

  • 07:51

    PATRICK BALL [continued]: but what we get when we look at dataare the documentation dynamics.We get a picture of who's writing stuff down.We get a picture of what we can capture, not what's true.[Selection Bias]Let's turn now to talk a little bit about Iraq,And I'll talk about why documentation dynamics have

  • 08:14

    PATRICK BALL [continued]: been so dangerous there.In Iraq, we had an awful lot of media coverage.Tremendous numbers of journalistswent into Iraq in starting in 2003 with the invasionby the US and the US allies to overthrow Saddam Hussein.And then, in the ensuing years, journalists stayed in Iraq

  • 08:38

    PATRICK BALL [continued]: and did really good research.They reported every death they could find.A project in the United Kingdom, the Iraq Body Count,did a terrific job reading the world's mediaand capturing information about peoplekilled in their database.So they have a large database of tens of thousands

  • 08:59

    PATRICK BALL [continued]: of reported deaths.How many did they miss, and what kinds of deathswere they that they missed?Well, it's very hard to tell, because the only datathat we have are media reports.But what we can do is look at each event,each individual incident and ask two questionsabout each of those incidents.

  • 09:20

    PATRICK BALL [continued]: The first question we ask is, how big was it?Was it an incident which one person was killed,were two people killed, were three people killed,and so forth?That's the size of event.The next question we can ask is, how much reporting was thereabout that event?Was there one story about it, two stories about it,three stories about it?How many different media companies covered that event?

  • 09:43

    PATRICK BALL [continued]: This graph explores the relationship between the eventsize and the number of stories about each event.So the first bar on the left shows the distributionof the number of sources for events, whereonly one person was killed.And what you can see in that bar is

  • 10:05

    PATRICK BALL [continued]: that about a third, a little bit more than a third,of all the events that have only one victim had only one storywritten and about them.A little more than a third, about 40%of the people, who were killed one at a timehad only two stories written about them.And then about a quarter of the people whowere killed in events of size one

  • 10:26

    PATRICK BALL [continued]: had three or more stories written about them.Now let's turn to events of 15 or more people being killed.And in that case, we can see that about 85%of the incidents in which 15 or more people were killedwere reported by 15 or more sources.So we can tell that the larger the event,

  • 10:47

    PATRICK BALL [continued]: the more stories were reported about it.Let's return to our original question.What kinds of events are likely to have zero stories writtenabout them?That is, what kinds of events willbe unreported by the world's mediaand therefore invisible to this database?It seems that events of size one generally

  • 11:11

    PATRICK BALL [continued]: have only one or two sources that cover them.So it's easy to imagine that events of size onemay often have zero sources about them,whereas events that are very largehave many, many sources about them.And this makes sense.If you think a little bit about large events,

  • 11:32

    PATRICK BALL [continued]: they're harder to conceal.They happen.Lots of people are affected.Lots of people know about it, and so they're harder to hide.Furthermore, we find out looking into these events a little bitmore deeply, that they're newsworthy in another way.By using another set of methods, that Iwon't describe in this talk, we can estimate very generally.

  • 11:57

    PATRICK BALL [continued]: We don't have a very precise estimate yet,but we have very generally the sensethat the probability that an event isreported in the world's media is relativelysmall for small events.So it may be between 20% and 30%, 20% to 30% of the eventshaving only one victim appear in the world's media,

  • 12:19

    PATRICK BALL [continued]: whereas almost all of the events at 15 or more peopleappear in the world's media.Now why is this a problem?Why is it a problem that we have a bias, thatis we have the world's media per first report big eventsand rarely reports small events?Why is this a problem?Well, it's a problem, because it turns out

  • 12:40

    PATRICK BALL [continued]: that small events and large eventsare very different kinds of events.Small events between 2008 and 2012were perpetrated largely by Shia militias,whereas the large events were perpetratedby Al Qaeda and Iraq and other international insurgent groups,or they were perpetrated by the US military or coalition

  • 13:02

    PATRICK BALL [continued]: partners in collateral damage events, thatis civilians that were bound by accidentor were bombed as the coalition was pursuing people that theythought were their enemies.So we have different perpetrators.We also have different weapons.The small events were largely committed by firearms,people got shot, whereas in the large events,

  • 13:24

    PATRICK BALL [continued]: we see improvised explosive devices, bombs, or air strikesbeing the more common weapons.Similarly, small events and large eventshave different victims.The small events were almost all adult men.The victims were all adult men, whereas in large events,we have a random selection of the Iraqi population, women,men, old people, children, and so forth.

  • 13:47

    PATRICK BALL [continued]: Finally, the political goals of the small eventsand large events are different.The small events were directed toward ethnic cleansing.Shia militias were trying to drive out Sunni people,drive them out of ,particularly, the Baghdad and the Baghdadsuburbs, whereas the larger events were eitherdestabilization or control depending on whothe perpetrator was.

  • 14:08

    PATRICK BALL [continued]: We have two completely different stories going on here,and by covering only the large eventswe missed the small events.The world failed to recognize the unfolding ethnic cleansingthat drove the Sunni Arabs out of Baghdad.So when we look now, we look back at history

  • 14:29

    PATRICK BALL [continued]: from the perspective of 2015, one of the things we can askis, why are so many Sunni Arabs in Western Iraqwelcoming the Islamic State?Well, one of the reasons they're welcoming Islamic Stateis because they have been very violently persecutedby Shia militias, and they look to the Islamic State

  • 14:51

    PATRICK BALL [continued]: to protect them, because the Iraqi state failedto protect them.So it's really important to understand the true conflictdynamics and not be misled as we interpret them simplyby documentation dynamics.A lot's at stake here.We have to get the answer right.[Raw Data]

  • 15:15

    PATRICK BALL [continued]: Everywhere my team looks, we find this kindof bias in homicide reporting.We've studied this and written about it in Colombia,in Syria, in Kosovo, El Salvador, Guatemala,Sierra Leone, the Democratic Republic of the Congo,and Peru.Everywhere we've been able to look, we find this bias.The bias is everywhere.

  • 15:37

    PATRICK BALL [continued]: So it doesn't matter how much data we have,the bias actually persists until youhave all the data, which in practice neverhappens or almost never.There's a tiny number projects the take decades.But in practice, when we're tryingto assess conflict dynamics for policy questions,

  • 15:59

    PATRICK BALL [continued]: we always have to assume that the data is very profoundlybiased.No matter how big data gets, it doesn't get that closeto completeness.So it's never really a reliable basisfor understanding patterns by itself.And here are a small list of the kinds of data sourceswe've used to assess mass homicides and conflict,

  • 16:22

    PATRICK BALL [continued]: including truth Commission testimonies, United Nationsinvestigations, press articles, crowdsourced data,NGO-- Non-Governmental Organization documentation--social media feeds, perpetrator records, government archives,state agency registries-- those are allslightly different from each other-- refugee camp records,and surveys they don't sample randomly.Raw data is very good for a case,

  • 16:44

    PATRICK BALL [continued]: but it's an inadequate basis to understand patterns.The point of statistics is to be right or at least understandhow imprecise we are.And if we don't care for right, whybother with any rigor or even evidence?But if you do care if you're right,if it's important to get the answer right,we need to use some kind of statistical model.

  • 17:07

    PATRICK BALL [continued]: And you have to use that model in orderto account for the fact that you only have some of the data.You want to use the data to make a statementabout the whole population, about all the people,or about all the victims of homicide,but you only have some of it.So in order to go from the data youhave to a conclusion about everyone,you need some kind of model.

  • 17:29

    PATRICK BALL [continued]: We should also keep in mind that statistics is primarilyabout comparisons.And we think about-- when we say what are the figures,we're often asking a question about the total magnitudeof an event.But more substantially interesting questionshave to do with comparisons.Is violence worse in the north?Is it worse in the south?Are victims primarily of ethnicity A or ethnicity B?

  • 17:51

    PATRICK BALL [continued]: Is the violence affecting men or women?Is it affecting adults, or is it affectingchildren, and so forth?Those kinds of comparisons tell us what'shappening in the violence.And those comparisons are especially sensitive to bias.So we need to make sure that we'vegotten the answer just right.Let's be clear.This is not a technology problem.

  • 18:13

    PATRICK BALL [continued]: This is not something we can build an app for.Statistical modeling is somethingwe have to do for each particular project.We have to think through all the specific aspectsof that project.We have to think about how all these little complicatedprobability relationships are working in orderto get the model right.There's no technology which can substitute for this problem.

  • 18:36

    PATRICK BALL [continued]: Instead, we turn to statistics, not to software engineering.[What can we do with raw data?]We can do things with raw data.Raw data is useful for specific purposes.In particular, we can use raw datato affirm the existence of a case.We can say here's a bunch of evidence that

  • 18:57

    PATRICK BALL [continued]: shows that this thing happened.That's good.It's often very, very important to document a specific murderor a specific massacre.This gives you case knowledge.We can learn about each case.We can learn about a massacre that happened in Marchand compare it to a massacre that happened in April.Comparing cases is very, very useful.

  • 19:20

    PATRICK BALL [continued]: We can know the size of a given caseif we investigate that case in great detail.We might be able to say, well, at least 100 victimsdied in the market bombing.We can be specific about what we know.We may not know everything, but wecan say at least this number of people died.What we cannot do is make comparisons.

  • 19:41

    PATRICK BALL [continued]: We cannot say something like this case is the biggestmassacre so far this year.We don't know.We don't know if other massacres might have been bigger,but we didn't investigate them very carefully,or there might have been massacresthat we don't know anything about it at all.Similarly, we cannot make claims about patterns.We cannot say where a hotspot of violence is,because maybe that's just a hotspot of reporting.

  • 20:04

    PATRICK BALL [continued]: Maybe that's just where all of our friends with cell phonesare taking good videos.We don't know where another hotspot mightbe, where there aren't people taking videos or the celltowers are all destroyed.Similarly, we cannot make claims about patterns over time.We cannot say there are fewer victims in March than in April,or that violence is getting worse,

  • 20:26

    PATRICK BALL [continued]: or that things are worse in the north or the south.So we can do some things with raw data,but we cannot make comparisons or understand patterns.For those, we need statistical models.There are only three ways to do rigorous statistics.First, we might have a perfect census.

  • 20:46

    PATRICK BALL [continued]: A census means we have all the data.We often are familiar with censuses of people,of households, in a country.We use that to adjust voting patterns so that we makesure that we have enough representation for peoplein different parts of the country,or to deliver health services, or infrastructure,or something else.Censuses are very important.

  • 21:08

    PATRICK BALL [continued]: A census in statistical terms means that youhave all the possible data.If you're, you know every clickthat every customer has ever made.That's a census, and this is what big data should mean.If you have all the data, you can make any analysis you like.As I mentioned earlier, there are very small number

  • 21:28

    PATRICK BALL [continued]: of projects in human rights in which we have all the data.But even there, the only way we canprove that we have all the data is through some sortof statistical modeling.A second way to do rigorous statisticsis to have a random sample of the population.A random sample is not an arbitrary sample.A random sample does not mean that you just walk outon the street and start asking people,

  • 21:49

    PATRICK BALL [continued]: whoever will talk to you.That's an arbitrary sample.You don't know who avoided you, who trusted you.So a random sample is hard to do.It requires actually doing some math.You have to figure out, you have to developsome kind of random pattern that a computer or a random numbertable will generate so that you can make surethat the people you talk to in your sample

  • 22:12

    PATRICK BALL [continued]: are not people that you chose, because you liked them,or they trusted you, or you thought they were attractive,or they had good clothing, or whatever.Instead, you need to rely on some methodto select people at random.This is hard to do.There are many challenging technical issues.And once you have the sample, it canbe really hard to figure out how the sample relates

  • 22:33

    PATRICK BALL [continued]: to the population.However, we have decades and decades of researchthat tell us how these pieces work and how they fit together.So a random sample is probably the best wayto make an estimate of what's going on in a population.There's a third approach called the poster modelingof the sampling process.And there are several different ways

  • 22:54

    PATRICK BALL [continued]: that you can do poster modeling.It's sometimes called post stratification.It only works when you have exactly the right data,but it requires an awful lot of math and computing capacity.This is the approach that my colleagues and I take here.But I want to be clear, it's not the only way to do it.In fact, it's not even the best way.

  • 23:15

    PATRICK BALL [continued]: However, if the data you have is a bunch of listthat people write down, well, thismay be the only way to go from the data youhave to an estimate of the total population.[Conclusion]If you want to do a study with data,

  • 23:36

    PATRICK BALL [continued]: you might want to ask yourself three questions.First, what is your data?Is it complete?Next, what is your question?Are you asking about some pattern in the underlyingpopulation, or you really asking questionsabout the data itself, about the process of knowledge?And third, ask yourself what are your capabilities?

  • 23:57

    PATRICK BALL [continued]: Can you design your own study, or are youstuck with it be using only the data that's available to you?If you can collect more data, youmight be able to collect just the right data thatwill allow you to do something even more useful combiningtwo data sets.

Video Info

Publisher: SAGE Publications Ltd

Publication Year: 2017

Video Type:Video Case

Methods: Big data, Quantitative data analysis

Keywords: awareness; capabilities; comparison; conflict; estimation; ethnic cleansing, wars, and conflicts; homicide; human rights; human rights abuses; Islamic state; media coverage of the Iraq War; motivation; perpetrators; reporting crimes and victimization; Sensitivity; Shi'a islam; Sunni Islam; victims; weapons ... Show More

Segment Info

Segment Num.: 1

Persons Discussed:

Events Discussed:



Dr. Patrick Ball describes the challenges of researching human rights abuses. He points out that because large-scale events draw more attention and media coverage, small-scale incidents and associated trends may be overlooked. Human rights researchers must use statistical methods to make comparisons and draw accurate conclusions about data.

Looks like you do not have access to this content.

Researching Human Rights Using Big Data Analysis

Dr. Patrick Ball describes the challenges of researching human rights abuses. He points out that because large-scale events draw more attention and media coverage, small-scale incidents and associated trends may be overlooked. Human rights researchers must use statistical methods to make comparisons and draw accurate conclusions about data.