The data revolution is upon us. It’s said that in the last two years, more data has been created than has ever been created before. And in two years’ time, we’ll be able to say the same thing.
For Gary King, the Albert J. Weatherhead III University Professor and director of the Institute for Quantitative Social Science at Harvard University, it is not the growth in the volume of data which is changing the world, it is the ability to use it. King describes big data as “the change in the world in which lots of things produce data.” This data itself is not inherently useful; the question is whether you can make it useful.
In this conversation King uses text analysis as an example of big data analytics. Social media has likely brought with it the largest increase in the expressive capacity of the human race in the history of the world. Roughly 650 million social media messages are produced every day. So, to someone trying to make statements about what those messages contain, would having 750 million messages make anything better? “Having bigger data,” King says, “only makes things more difficult.” The real innovation is in the ways of analysing those data.
King goes on to discuss the development of a sophisticated technique for analysing social media posts according to the needs of social scientists, valuing the trend of what people are saying over individual categorisations. He describes how a “mathematically similar” project, which utilised a database of Chinese social media posts, led to an insight into the how and why of Chinese government censorship, and to a further surprising revelation.
After discussing the supposed failure of polls in predicting Trump’s presidency, King concludes with a rumination on big data’s characterisation as a democratising, or manipulative, force.
DE: DAVID EDMONDS
GK: GARY KING
DE: This is Social Science Bites with me, David Edmonds. Social Science Bites is a series of interviews with leading social scientists and is made in association with SAGE. A revolution in data is underway. But it’s not really about the scale of data we now have available to us. Gary King is a Harvard professor who specializes in working out methods to analyze data. His results have enabled him to do some surprising things including tracking the censorship tactics of the Chinese government. Gary King, welcome to Social Science Bites.
GK: Thanks for having me. I appreciate it.
DE: We’re talking today about big data. What is big data?
GK: Big data is the change in the world in which lots of things produce data. If you’re in a company, and the company changes its HR system, or finance system, or air conditioning system, or sprinkler system, likely, there’s going to be a little spigot attached to it. And that spigot is going to spew out data. The data itself is not particularly useful. The question is whether you can make it useful.
DE: So we have a lot of numbers, a lot of bits of information. So what?
GK: Well, it provides an opportunity. We can make the data actionable. The revolution is big data. I mean, I love the idea of big data. The media has discovered this term that they’ve enabled the public to understand what the heck it is we do. However, big data is not about the data. It’s not even about the big. It’s mostly about the analytics. There is a revolution going on. The revolution is about the new methods of being able to provide actionable information from existing data.
DE: So we’ll get onto the analytics in a minute. Let’s just concentrate on the big. Give an illustration of just how much data is now out there.
GK: The number that people keep giving is the rate at which data is collected and exists is expanding so fast that, in the last 2 years, there is more data than has ever existed on the planet, right? In the last 2 years, more data has been collected than has ever been collected in the history of the world. And that will be true 2 years from now as well, and then 2 years from then, just because the rate keeps increasing. Frankly, I don’t know whether that number is exactly right. But the rate is increasing very, very fast.
DE: And there’s data for my credit card. There’s data from social media. Quite soon, they’ll be data, presumably, from my food and kettle.
GK: Absolutely. It used to be that almost all the data in the world was created here, inside universities, by doing surveys and other kinds of experiments. Now, most of the data is created out there in the world. It’s created, actually, incidentally, as part of other things. So your cell phone produces an enormous amount of data. The GPS system produces incredible amounts of data. Almost everything produces tons of data. And so the question is whether we can do something with that.
DE: Well, let’s get on to that. One of the things you’ve done is text analysis. Explain what text analysis is.
GK: So today, there were 650 million social media posts that were written, and put on the web, and made available for you to read, or for us to download and analyze. How are you going to read all that? And when it’s bigger, when there’s 750 million a day, is that going to help you any more? No. Having bigger data only makes things more difficult. So the real innovation is the ways of analyzing those data. A colleague was downloading data every week and analyzing it. And the data, of course, became bigger and bigger. At some point, there was so much data that it didn’t fit on his computer. And so he asked our IT team to spec out a new computer for him that could deal with a much larger set of data. And the answer came back that it was a big computer that he needed. Actually, he needed to spend about $2 million on this computer. Now, that’s a beefy computer. It’s not impossible to get a computer like that, but you have to really gear up to get a very big grant to buy something like that and justify it. So we got a hold of this request. And we realized, wait a second. This is a little ridiculous. And we got 2 graduate students and myself, and we spent almost an afternoon. And now he runs the same analysis on much more data in 20 minutes on his laptop.
DE: But to get back to the text analysis-- we’ve got the 650 million social media posts every day. And you have found various ways of analyzing what people are saying.
GK: Right. So the 650 million social media posts, it’s probably the largest increase in the expressive capacity of the human race in the history of the world. It enables 1 person to write a social media post and, potentially anyway, for billions of people to read it. But what it doesn’t provide is the ability of any 1 person to understand what millions of other people are saying. It’s not possible to do by fully human means. Is there a way, the question is, to analyze the text in a more automated fashion? That’s what automated text analysis do?
DE: I can see how this would be extremely useful for private corporations. If I want to know how my new brand of orange juice is doing, I could analyze what everybody is saying about it on social media.
GK: Absolutely. If you want to know what people are saying about your products, then you certainly want to know what people are saying on social media. Because those are speaking in a way that other people can hear. Before social media, people were saying things and hallway conversations and on their soapbox in the town square and at the water cooler and things like that. And now, quite a large fraction of this, is available for research. It’s on the web. It can be downloaded. It can be analyzed.
DE: So it sounds like just a giant opinion poll.
GK: Yes. Although, there’s different notions of what public opinion is. It’s quite like the classic notion of public opinion, 100 years or more ago, in which the definition of public opinion was the people that chose to express themselves. If you were sitting in your home, by yourself, your opinion was not particularly relevant. And then, in the 1950s and ’60s, when modern public opinion came into existence, the definition of public opinion changed with the measure. And then it was pop questions of randomly selected people from different geographic locations, and we saw what the answer was. And sometimes that was tremendously useful. And sometimes not so much. But now, it turns out, we can get back to the classic notion of public opinion, which is activated public opinion of people who wish to express themselves, or influence policy, or politics, or product usage, or anything else. And a very large fraction of those people are actually saying things so that everyone else can hear, and so that researchers can hear also. So yes. Absolutely. This is tremendously valuable for companies and corporations. In fact, some of the methods that we developed have been patented by Harvard University and licensed to a startup company, Crimson Hexagon, and now, they have offices all around the world.
DE: So you’ve come up with a way of analyzing people’s contributions to social media in various forms. How do you come up with that analysis?
GK: The analysis itself is a method of statistics where we take all the text, and we process it in particular ways, and build specialized algorithms that can understand what people are saying, millions of people are saying. A very babyish version of this, as we choose a key word and count the number of posts-- although, that method tends to work very badly. So we have much more sophisticated methods. And there’s a whole literature of academics working on this. We came up with a particular method that was very successful at estimating the fraction of people speaking about any subject that you wish to choose. And you can get very deep into the meaning. So you asked me how we came upon this method. We actually tried to develop a method for more than a year, and every single one of them failed. So we were applying the methods that the computer scientists had developed for what they call natural language processing. And it turns out that none of the methods worked. Why is that? Well, because the computer scientists cared about something different than a social scientist. What they cared about was individual classification. So they’ll take a social media post and try to classify it into a set of categories, whether it’s about politics or about business. If it’s about politics-- whether it’s pro-Obama or pro-Trump or whatever it is. And they try to increase the percent of these posts that are correctly predicted into categories. That worked fine for what they’re trying to do. But what we cared about was not any individual post. Actually, the fact is nobody cares what Stat Pumpkin 222 says on Twitter. The only thing we care about is what everybody is saying-- the percent of people that were speaking in each of these categories, not any 1 post in the categories. We had been working on a completely unrelated project for over a year that, mathematically, was the same thing, even though, substantively, it was completely different. So what was that project? Well, it was trying to understand what people are dying from in the developing world. In fact, for the World Health Organization, they want to know what people are dying from all over the world so that they could direct public health dollars, so that we could catch emerging diseases. In the United States, and most of the developed world, when someone dies, there’s a death certificate. There’s a medical personnel of some sort that sees the body, perhaps does an autopsy, does some tests, and decides what the person died from. And that’s how you figure out what the cause of death is or the distribution of the causes of death. But in the developing world, people go off to the bush and die, and they’re never heard from again. There’s no autopsy. There’s nothing, so how are you supposed to do it? So they do verbal autopsies. They ask the next of kin a series of uncomfortable questions about the symptoms the person had. Yes, the person had back pain. No, the person wasn’t bleeding, et cetera. You give those to a physician, and then the physician would determine what the cause of death was. The problem was that they realized, if you gave the list of symptoms to a different physician, they would tell you a different cause of death. And so the physicians were useless in this context. They needed to see the body.
DE: So you had a better way of analyzing the verbal descriptions of how people’s family members had died.
GK: Turns out, not. Not the verbal descriptions, no. The quantitative descriptions-- that’s why this application was completely different than the social media analysis. So literally, the questions were yes or no questions. Did the person who died have this symptom, did they have this symptom, did they have this symptom? They’d ask about 50 of these yes or no questions-- no discussion, no paragraphs written. That, people tried to automate. By individually classifying, by automated means, each of those sequence of 0’s and 1’s into 1 of the causes of deaths. And that didn’t work well either. The human physicians didn’t work well, and the individual classifications didn’t work well. And then we realized that, in public health, nobody cares about you. They only care about everybody, right? They only care what everybody dies from-- the percent of people that die from tuberculosis, not what any 1 person dies from. So we developed a method that didn’t classify individuals. It only estimated the percent in the category. It only estimated the percent of people dying from each of these causes. So at the same time we were working on what seemed like a completely unrelated problem, which is how to understand what people are saying in social media, nobody was dying. There was no causes of death. There were no physicians. There was nothing. However, we realized, after about a year of working on this, that in both verbal autopsies, and in social media, what we were doing in both cases, at first, was individually classifying. We were taking an individual death and putting it into a bucket and an individual social media post and putting it into a bucket, one of the categories of interest. And in both cases, that was a very error-prone process. And also, we didn’t care about it at all. We only cared about the percent of people dying from cancer and the percent of social media posts there were about a politician that didn’t like their foreign policy. And so we realized that the second problem was actually the same as the first problem. Mathematically, the 2 were the same. And so we used the analytical statistical methods we had developed to solve the verbal autopsy problem. We applied it to social media, and it worked very well.
DE: So that’s fascinating. You’ve also used your analytic methods to find out the motives of an autocratic government that we know little about. That’s the Chinese government. Tell me a bit about how that came about.
GK: So we were working on these methods of automated text analysis-- technical, statistical, analytical methods, mathematical methods of automated text analysis. And we thought, let’s push these forward until they break. And that way, we will understand the flaws, and we’ll figure out how to improve them. Why don’t we try it in a language for which we didn’t develop it? We thought, well, what language could we try? Well, Chinese works very differently. Let’s try Chinese. So we called up Crimson Hexagon, which was this firm that I founded. And they go around the world and collect all social media posts that are public. And we said, send us a database of social media posts from China, and we’ll analyze those. So we got a database. And the database had individual social media posts and the URL from which each one came. We analyzed it, we found flaws in our methods, we improved our methods, we were pushing things forward. We were going to write a great paper on automated text analysis in Chinese, and then also, how it improved our techniques in English. And at some point, I said to my 2 terrific graduate students, Jen Pan, who now teaches at Stanford, and Molly Roberts, who now teaches at UCSD-- they both happen to speak Chinese, by the way. And I said go back to the websites from which these came. Let’s understand what happens in context. They came back to my office, and they said, there must be something wrong with the data we’re getting from Crimson Hexagon. Because sometimes we click on the posts, and it goes back to the website from which it came. And other time, we click on URL that came from Crimson Hexagon, and it doesn’t go anywhere. And so I said show me. And we walk over to my computer in my office, where we’re sitting right now, and we’re clicking on posts, and we see both of those examples. And we click on something else, and all of a sudden it says, this post is being investigated. And we say, investigated? It’s a social media post. Who investigates social media posts? Well, of course, the Chinese government does. And they are not embarrassed about censorship. And that’s what they did. So we happened to discover the way that the Chinese government censors. And we also came upon the fact that we were able to download all Chinese language social media posts before the Chinese government could read and censor them. And so we had the entire corpus of censored Chinese language social media posts that the Chinese people were not allowed to read, but we could.
DE: So you are able to understand what the Chinese were trying to censor and what tactics they were using to cover up bad news, as they saw it.
GK: Well, that’s what everybody thought. What everybody thought was that they were censoring criticism of the government. And it turns out that they were not censoring criticism of the government. In fact, you can say the nastiest things you want about the government and the leaders and their policies. You can say the leaders of this town, they’re all stealing money, here’s how much, these are the overseas bank accounts in which they have it. And by the way, they all have mistresses, and here are their names, and that won’t be censored. But if you say, “and let’s go protest,” then, they’ll censor you. In fact, if you say, the leaders of this other town are doing such a great job, let’s go have a rally in their favor-- they’ll censor you also. They don’t care what you think about them. They only care what you can do.
DE: So you discovered what they care about. You also discovered that they were putting on social media, fake messages.
GK: Yes. So it had long been rumored that the Chinese government fabricates social media posts and posts them on the web in the name of ordinary people. People have written quite a lot about it. So who has written about it? Well, journalists, scholars, activists-- people on social media accusing other people of doing this-- all of these categories of people, they all thought the same thing, that the Chinese government is arguing against the people who argue against the government. If you grew up in the United States, in third grade, you may have learned the word antidisestablishmentarianism. That word actually means against the people who are against the government. So thank you, Mrs. McNeil. I got to use that word in a paper for the first time. But it turns out that what these people are doing is not antidisestablishmentarianism. They are not arguing against the people who argue against the government. In fact, the Chinese government, when they fabricates social media posts, they’re not arguing against anybody. They’re not arguing. What they’re doing is distracting. They’ll post something that says it’s a beautiful day today, or I woke up this morning thinking about how important our martyrs were for the history of China. Well, they use these in giant bursts of activity, just as distraction at particular moments, like when there’s some type of protist or collective action or event or they’re worried about them. They’ll fire this giant cannon.
DE: To work this out, you have to identify which were the fake messages and which with the genuine ones.
GK: Right. So there was a small leak of emails to a local propaganda department in China in 1 county. It was a small leak, but a giant pile of data. We extracted from these data, 47,000 known social media posts, that were fabricated by the Chinese government. And we extracted from them, not only what they were, what they were saying, but who is saying them, when they were saying them, why they were saying them, where they were posting them. So then, what we did is, we extrapolated to all the other counties in China-- they’re called $0.50 party posts, by the way. Because the Chinese government had been rumored to pay $0.50 to individual Chinese people to write these posts. We found out, by the way, that that was false. But in any event, we then predicted who the $0.50 party people were in all the rest of the country and which posts were $0.50 party posts. So that wasn’t good enough, just a bunch of predictions, because how do you know if they’re right. So it turns out, one of the things social scientists do is they ask very, very sensitive questions of people. So there’s people that ask details of your sexual history or asking questions in countries in which you can’t really speak against the government for fear for your life. There’s ways of asking questions that we used. So we posed these questions in a special way on social media in private to the people we predicted to be $0.50 party people. 59% of those people admitted to fabricating social media posts. That also wasn’t quite good enough. Because well, maybe we just asked the wrong question. So it turns out, we were able to validate our validation. So what we did is we went back to 1 county where we had known $0.50 party members, we applied the same survey to these people, and we asked them, and 57% of those people admitted to fabricating social media posts. So then we had our validation.
DE: I can imagine that people in the State Department reading your papers will be fascinated by this because it gives an insight into what the Chinese government care about, what motivates them, what riles them.
GK: Potentially, governments would be interested in what we’re doing. We’re, of course, scholars trying to just understand the nature of autocracies, the nature of this particular autocracy. We’re obviously not a government.
DE: We now have President Donald Trump-- and the polls told us that wasn’t going to happen. They got it wrong. How come?
GK: So what the polls do is they’re trying to estimate the percent of people that would vote for Trump versus vote for Clinton. The polls are not actually forecasting the outcome of the election. We do have methods that forecast the outcome of the election, and they’re actually quite good. With information available at the time of the conventions, like in August, we can predict the popular vote results within plus or minus 5 or 6 percentage points. If the prediction is that it’s going to be a tie, then plus or minus 5 or 6 percentage points isn’t that helpful. But it is, nevertheless, incredibly informative. It’s not going to be 70% for 1 of the candidates or 60% for the candidates. Actually, let’s put this in context. You say the polls are not doing well. It used to be that we would take a random sample of Americans, not a haphazard sample, but a random sample. Everybody has to have the same probability of selection. And then we would call them up, and we would have a long conversation with them, ask them who they’re going to vote for, and lots of other questions. And then we would average those in particular ways, and we would report the answer. There’s a serious scientific basis for that. If we take a random sample of 1,000 people, we can tell you what’s going to happen with quite high reliability, and the error rate, we know very well. So what happened this year? Well, cell phones happened this year. It’s been happening for a while. But if someone calls you up on your cell phone, where you’re paying for the minutes, and I say I’m going to ask you some questions. Please, will you tell me questions so that I can report this in the media, and-- I don’t know. Make money or whatever-- people don’t do that. More than 90% of people, when asked to participate in a survey, refuse. And so the average of the answers to a random collection of people is not random. That’s a disaster for survey research. So how close did the polls do? Well, Clinton won the popular vote by more than a percentage point. The polls predicted that she would win by 3 or 4 percentage points. So they were off by 2 or 3 percentage points. That’s incredibly good. It’s actually a miracle they did so well, even though the scientific foundation of what they’re doing has completely crumbled. We have to understand that the polls are good at doing what it is they seek to do, which is to estimate the popular vote. That’s a little different than the electoral college.
DE: Is there a downside to this wonderful revolution in big data? I’m thinking about things like manipulation. Is it becoming easier to manipulate individual?
GK: So reporters call, and they say, so there is a revolution in how citizens can speak in authoritarian countries and have everyone else here, and it’s an incredible democratizing force. And then, a few hours later, a completely different reporter will call and say, so social media is being completely manipulated, and governments are having their way with people. And the interesting thing here is that both are right. The relevant actor in all of these cases is, basically, the politicians. And by politicians, I mean the people in government, but also the people on the ground-- the citizens. And then there’s a playing field. The playing field is changing. It’s not tilting, in necessarily favoring one or the other, but it’s changing. It’s a different game that everybody’s playing. And so what happens is, the politicians do everything they can, given the hurdles in front of them, to get their goals. So you have ordinary citizens trying to create a revolution, let’s say, in an autocracy. And you have autocratic leaders trying to keep them in check. And in democracies, we have quite analogous things going on. We have 1 party trying to influence what’s going on and change the nature of the conversation. When social media comes in, it completely changes the ability of somebody to speak to a wide audience. It also does the opposite, right? It changes the ability of someone to change the nature of social media. So it isn’t the technology that produces the change, it is the creative politicians, the creative people on both ends of the continuum. We’ll have to see how that works out. I want everybody to know that the potential of data, given the revolution in analytics, is extraordinary. We are all worried about invasions of our privacy. It’s actually back to a couple of hundred years ago when we all lived in villages when, basically, everybody knew everything that was happening at all times. So we’re getting a little closer to that. And you may feel good about that, or you may not feel good about that. And I can totally understand it. But please add to your considerations, 1 thought. How much privacy would you be willing to give up to live, let’s say, 10 years longer than your life expectancy? How much privacy would you be willing to give up to live a more comfortable, fun, engaging, wealthy life? Those are the kinds of questions we actually have to ask. If we manage the data appropriately, the population of the world will let us develop all of these cool things that will make our lives much better.
DE: 1 final question. You’ve described yourself a couple of times in this interview as a social scientist. Do you think you’re an odd kind of social scientist? I’m thinking of the anthropologist or the sociologist who goes down and talks to people in a village or talks to people in a distant city and has a notebook and so on. You’re not doing any of that, because you’re analyzing data, you’re analyzing numbers.
GK: We all work together. We all have different pieces of the puzzle. It’s nice for people to be well-rounded. But when you go to your surgeon, you don’t ask them how good a piano player they are. So we have specialties. Yes, of course, I talk to people. Yes, I want to understand the context of what I’m doing. The particular thing and I work on is developing methods to extract new types of information from new types of data. That’s what I like to contribute. But absolutely-- I need the anthropologists, and the sociologists, and the other political scientists, and the economists, and everybody else. The interesting thing about the social sciences, collectively, is that we used to be a bunch of separate fiefdoms. And more and more, we’re, for example, training social scientists at large to basically have the experience in these different fields. There’s no chance that I would have done the interviews of people in China who are fabricating social media posts, and done it well, without their research on people asking sensitive questions in very difficult situations, one on one. You have to understand the context of anything you study.
DE: Gary King, thank you very much.
GK: Thanks a lot. It was a lot of fun. [MUSIC PLAYING] Social Science Bites is made in association with SAGE. For more interviews, go to socialsciencespace.com. [MUSIC PLAYING]