Skip to main content
Search form
  • 00:01

    [MUSIC PLAYING][An Introduction to Big Data for Social Science Research]

  • 00:09

    MIHALY FAZEKAS: My name is me Mihaly Fazekas.I'm an Assistant Professor at the Central European UniversitySchool of Public Policy.[Mihaly Fazejas][Assistant Professor, School of Public Policy Central EuropeanUniversity]And I am an expert in government administrative dataand big data methods with a particular focus on corruptionand good government research.I'm going to talk about big data for social science research,

  • 00:29

    MIHALY FAZEKAS [continued]: and we are going to cover two main topics, two emergingtrends in this field.One is about what big data means for new datasets and datasources and data structures.[Trends in Big Data for Social Science][What big data means for new data sets, data sources,and data structures][What big data means for data analytic methodsand techniques]And the second, what big data means for dataanalytic methods and techniques.[What is big data?]

  • 00:55

    MIHALY FAZEKAS [continued]: I'm sure you have been wondering what big data really means.This is a hip term a lot of peopleare using in policy and research.But it's actually not that easy to nail it down.There are very many conflicting views of what it means.Is it about the data itself?Is it the methods?Is it both?

  • 01:16

    MIHALY FAZEKAS [continued]: I'm not going to give you one best that answer,because probably it doesn't yet exist.But definitely cover a few dimensions,a few characteristics which make big dataresearch different from traditional, typicallysurvey-based research.And by and large, we can talk about four

  • 01:38

    MIHALY FAZEKAS [continued]: distinct characteristics which define big data as a new datasource.The first one is the frequency of data available.[Characteristics of Big Data][Frequency of data available]So how often we can access the data,how often the data refreshes.So in the social media big data domain,

  • 01:59

    MIHALY FAZEKAS [continued]: the data is basically updated every second, right?This is very different from, say,an annual survey or another survey-based data collectionmethod.The second important difference or novelty of big datais detail of data, what we have.If you have survey data as your predominant tool for collecting

  • 02:22

    MIHALY FAZEKAS [continued]: data, there are a certain number of things you can ask people,because they remember certain things,but wouldn't be able to give you reliable answersto other things.However, in big data, you often have behavioral data.You often have a minute detail of what people are doing,right?So in some cases, you can even have their heartbeats,

  • 02:42

    MIHALY FAZEKAS [continued]: their body temperature.You can have each of their tweets.You can have all their comments on Facebook.So you can understand that this is a lot more detailed,fine-grained picture of what people are doing.The third aspect of big data and datasetsis the coverage, what scope of things datasets are describing.

  • 03:04

    MIHALY FAZEKAS [continued]: [Detail of data][Coverage of data]Again, if you think about a survey,you can ask people about what they have done,their usual life, what they feel,what their perceptions are.But then big data methods that can give you insightsinto aspects which were not accessible before.For example, the exact location of peopleon the level of every second, right?

  • 03:26

    MIHALY FAZEKAS [continued]: So that gives you a far more detailed, if you can say,intrusive insight into what people are doing.That also brings opportunities and risks.And the fourth aspect of big data researchis availability of this data.[Availability of data]And availability is very different in big data.It comes in online web pages of large, online datasets.

  • 03:51

    MIHALY FAZEKAS [continued]: And these datasets are typically available on a real-time basis,or at least a very frequent access to these datasets.So in a way, you can think of your researchas just using one copy of the data.And then in a day or a week time,you can update your dataset, and you can use the broader datasetfor you doing you research.

  • 04:11

    MIHALY FAZEKAS [continued]: [What new data sources are there,and what can they help you achieve?]Understanding these abstract characteristics of big datadatasets still might make you wonder what reallythese new data sources are.So let me give you a few examples of typically

  • 04:32

    MIHALY FAZEKAS [continued]: used big data and datasets.One group of data sources is social media data, whicha lot of people are using.So social media can cover Facebook, Twitter, Instagram,or other data sources, other platforms.

  • 04:52

    MIHALY FAZEKAS [continued]: So these social media platforms providedata on people's interactions with each other,and what people share, how people describe themselves,the connections people have.Often, there is location data attached to it.So it's a very rich dataset.However, it only concerns a certain type of interaction,right?So you don't necessarily know about what people's jobs are

  • 05:14

    MIHALY FAZEKAS [continued]: or how much they earn.You know about what photos they share.There are a number of data sourceswhich you would consider traditional datasources, such as newspaper texts,such as legislative texts, such as governmentadministrative data like tax records or spending items.We have had data on these, but typically, these datasets

  • 05:36

    MIHALY FAZEKAS [continued]: or this data has been on paper.So accessing it, putting it in a databaseand readily analyzable with quantitative methodswas very costly, very labor use.And now with the electronic recording of text,legislative text, for example, or a newspaper content,our capacity to gather these datasetscreates a structured dataset from it, greatly improved.

  • 06:02

    MIHALY FAZEKAS [continued]: So there is a big group of data sources, whichare traditional by nature.But because they're electronic, theypresent new possibilities for analysis.And there are a couple of additional data sources, whichpeople less often associate with big data,but in research, they represent the classic

  • 06:23

    MIHALY FAZEKAS [continued]: and widely used analysis.And these relate to online activity, which is notnecessarily social media.So for example, Google search terms, Google search trends.That's something people have used in the past.Or location, which is linked to your phone and smartphone.And also, the usage of certain apps.

  • 06:45

    MIHALY FAZEKAS [continued]: In particular, health-related apps,these datasets can present valuable datasets for research.One example of how even very narrow data such as Twitterdata can be used for revealing complex structuresquite far removed from Twitter itself

  • 07:05

    MIHALY FAZEKAS [continued]: is the use of Twitter networks for identifying where usersare, where they're clustered, and interestinglyrevealing the geographical structure of Twitter users.So one example you can see on this slide

  • 07:26

    MIHALY FAZEKAS [continued]: is simply plotting the network of Twitter users, whois following whom, and with colors highlighting clusterswithin this user network.And the clusters are named by geographic locationin Spanish cities in this example.And it just shows, if you just try

  • 07:48

    MIHALY FAZEKAS [continued]: to understand who is following whomand how people are clustered in these networks,you can actually get to the geographical locationof these people, because people living close to each otherare a lot more likely to follow each other.Hence, the clustering in the networkoften corresponds to geographical location.[How does big data methodology compare with traditional

  • 08:11

    MIHALY FAZEKAS [continued]: research methodology?]I'm sure many of you are familiarwith traditional survey methodology, survey data,and how you can use that for social sciences research.Hence, I think it would make senseto situate big data analysis and datasources in comparison to traditional survey methodology.

  • 08:33

    MIHALY FAZEKAS [continued]: Let us compare these two data sources and groups of methodswith each other so that we better understandwhat big data represents.The first dimension along which we can compare survey datawith big data is the population, in the population

  • 08:54

    MIHALY FAZEKAS [continued]: these methods used.In survey research, we traditionallyhave a sampling frame from which we sampleour units of observation.However, in big data, our populationis typically open-ended, right?Facebook users come online every second, right?So your eventual population from which

  • 09:17

    MIHALY FAZEKAS [continued]: you draw your observations changes over time.It's open-ended.The second dimension is data source.In surveys, we typically have a single survey, or at best,waves of surveys if you have repeated survey methodology.However, in big data, we typically

  • 09:39

    MIHALY FAZEKAS [continued]: have multiple online sources.We typically like to link different datasets like Twitterdata to Facebook data.Or even if you just think about Facebook data,you might want to link photos people sharewith the text of the comments they share, linking thatto the profile, how they describe themselves,

  • 09:60

    MIHALY FAZEKAS [continued]: so the text of that profile.The third dimension is sampling how we draw the samplefrom the identified population.We typically take the full population, a sampling frame,and we draw a random sample in survey research.However, in big data, we typically don't take samples.We take all data.Use full population.

  • 10:22

    MIHALY FAZEKAS [continued]: So this implies that we don't have a sampling error at all,because we take the full population.The fourth dimension is data collection.Typically with surveys, we use carefullycrafted and tested questionnaire to solicitpeople's self-reported opinions or even

  • 10:46

    MIHALY FAZEKAS [continued]: self-reported experience or facts about themselves.But the key is that you always ask peopleto reflect on themselves.So there is a translation going on here,what people think of themselves or company managers thinkof their company.In big data, on the other hand, we typicallyuse a computer algorithm for data collection,

  • 11:09

    MIHALY FAZEKAS [continued]: and we collect the data that is stored electronically.This typically implies that we are collecting behavioral data.We don't ask people, what's their heartbeat, right?We don't do that.Instead, we collect the data from the app,which measures their heartbeat.We don't ask people how many people's posts

  • 11:32

    MIHALY FAZEKAS [continued]: they read today on Facebook.Instead, we go to Facebook, and collecttheir online behavioral data.The fifth dimension is the data generation process,and it directly follows from the data collection method.In survey methodology, we put a lot of effortinto coming up with the best questions, the best

  • 11:54

    MIHALY FAZEKAS [continued]: formulated questions or the right order of questions.It's a highly controlled data generation process.However, in big data, we often--not always-- but often face the problem of found data.And found data in our context meansthat data is generated by someone else for purposes

  • 12:17

    MIHALY FAZEKAS [continued]: typically different from research.So it can be a profit making enterprise such as Google.It runs the search engine, and ithas its goals with collecting that dataand influencing user behavior.I'm sure many of you are familiar with seeing how Googleprovides keyword search term recommendations as you

  • 12:42

    MIHALY FAZEKAS [continued]: type, right?And that algorithm is optimized for the purposes of Google.It's not optimized for objective social research.So if you use Google search terms, you have to understand,this algorithm changes over time.This algorithm search serves different purposes, not

  • 13:02

    MIHALY FAZEKAS [continued]: necessarily the same as social sciences research.The next dimension we should mention is the data structure.If you collect data through a survey,your typical unit of observation will be straightforward.If you ask people, it's the individual.If you ask companies or organizations,your unit of observation will be the particular organization.

  • 13:24

    MIHALY FAZEKAS [continued]: However, if you collect behavioral datafrom an online platform, you typicallyface multilevel data structure, whichtends to be complex and not necessarily orderly.Think about a single tweet on Twitter.As you can see on this slide, very

  • 13:46

    MIHALY FAZEKAS [continued]: straightforward for the user.A couple of words and the picture."Four more years" from Barack Obamais represented as data in a very complex way.You have information on the user itself.What's the username?how many followers that user has.Give information, textual information

  • 14:08

    MIHALY FAZEKAS [continued]: on the tweet itself."Four more years."You have a photograph.And you have, in addition, a geotagging link to the tweet.So you can see that's very simple, verystraightforward to understand as a tweet.But if you want to represent it as a data,you already have multiple levels,different types of variables.Some textual, some as a picture, which

  • 14:32

    MIHALY FAZEKAS [continued]: represents its unique opportunities and challengesfor big data research.The next dimension is the standards of validity.What you are familiar with in survey research--the p-values, the stars we are often after--is merely representing sampling error,this is our predominant theory.

  • 14:53

    MIHALY FAZEKAS [continued]: You have a full population, you draw a sample from it,and you want to make inferences to the population basedon your sample.Now if you have your full population, you didn't sample,you collected all your data, millions of observations,then you don't have the challenge of sampling error.However, you have challenge of measurement there

  • 15:16

    MIHALY FAZEKAS [continued]: or measurement bias.By the way, survey research also has this problem.It tends not to focus on it that much.But in big data research, this comes to the forefrontbecause of what I mentioned already,which is often we work with found data,data generated for other purposes than research.In addition, when you understand big data's

  • 15:39

    MIHALY FAZEKAS [continued]: a set of analytical methods, the standard of validitytends to shift as well, shift to predictive poweras opposed to sampling error as opposedto statistical significance.And predictive power-- I'll talk about it in a bit--

  • 16:00

    MIHALY FAZEKAS [continued]: is the yardstick by which we assess the quality,the goodness of models.Final dimension which is worth mentioningis the data analytical methods or techniques you have.You have seen plenty of great methodsin the different tutorials and other case studies.

  • 16:23

    MIHALY FAZEKAS [continued]: These are used on survey data.These methods don't lose their validity, their applicabilityin the big data domain.However, because we have a lot of data,because we often take predictive power as the ultimate benchmarkby which to measure models, we have a whole new arsenal

  • 16:46

    MIHALY FAZEKAS [continued]: of analytical methods, a few of whichwe are going to mention in a bit.[How can big data be collected?]You might be wondering how you canget hold of these new, exciting datasets, right?I've been wondering myself quite a few years ago.

  • 17:06

    MIHALY FAZEKAS [continued]: And there are generally two main waysto tap into these datasets.One is so-called webscraping.The other one is using APIs or Application ProgrammingInterfaces.Webscraping is a technique which weuse to access large volumes of HTML pages or web pages.

  • 17:27

    MIHALY FAZEKAS [continued]: So basically, every page you see in your browserhas a code behind it of an HTML code.And this is structure to some degree in every case,and exploit this structure for extractingparticular bits of information from the websites.

  • 17:50

    MIHALY FAZEKAS [continued]: Now how you can do this, indeed, thatwould require a different tutorial.I'm sure you can find it in this collection.But there are dedicated packages in R or Pythonwhere you can pre-process these HTML pagesand then extract information from them.

  • 18:13

    MIHALY FAZEKAS [continued]: It's very important to understandthat the starting point is always understandinghow the website is written.What's the structure of that HTML code or other code?And then you can explore the structureto extract information.When we use the second way of accessing such datasets,

  • 18:33

    MIHALY FAZEKAS [continued]: APIs, typically our task is easier,because we are tapping into a structured dataset.If there is documentation in addition, it's even better,so you can understand what is in that dataset.[How can big data be analyzed?]

  • 18:58

    MIHALY FAZEKAS [continued]: As we have already seen, big databrings with it a whole new arsenalof analytical methods, while you can alsouse the more traditional, alreadywidely used statistical methods.The new big data methods have a few characteristicswhich set them aside from traditional methods,

  • 19:22

    MIHALY FAZEKAS [continued]: and that refers both to standards of validity assessingmodels, as well as how these models workand how we can use them.In general, when you use big data analytical methods,we typically assess the quality of models.We typically choose between models

  • 19:45

    MIHALY FAZEKAS [continued]: using goodness of fit measures or predictive power measures.One problem which doesn't come up in traditional surveyresearch but comes up a lot with big data methodsis overfitting, that your model predicts your actual observed

  • 20:05

    MIHALY FAZEKAS [continued]: data too well.It becomes too specific to the data you have.That's why what we often do when doing big dataresearch is a so-called test train split.So you split up your data into subsamples.And you train your model, you build your models,

  • 20:27

    MIHALY FAZEKAS [continued]: you estimate your parameters in the training dataset.And then you assess the goodness of fitof your model on the test dataset,so dataset not seen by your model.So in this way, if a model is very, very goodin predicting your observed training datasets,

  • 20:50

    MIHALY FAZEKAS [continued]: but it fails in a similar but still different best dataset,then this test train split would revealthat your model overlearned one particular realizationof your data and underperforms on new datasets.And that's very important, because if youwant to use your model for a longer time period,you will get new data, right?

  • 21:10

    MIHALY FAZEKAS [continued]: This is the domain of real-time data and constant increasingdatasets.So very important to keep this test train split.Now a variant of test train splitis the so-called k-fold cross validation,and the underlying idea is very similar.You just split up your data into subsamples.

  • 21:31

    MIHALY FAZEKAS [continued]: But when you do k-fold cross validation,you do this multiple times so that you have multiple trainingsets and test sets as well.In addition to assessing overall model fit,we are still often interested in the statistical significanceof individual parameters.

  • 21:51

    MIHALY FAZEKAS [continued]: Because we don't have a problem of sampling error, sostatistical significance and the p values and the starsyou are familiar with, then we haveto apply a different standard.You can also understand that if youhave a dataset of three million observations, pretty muchevery parameter will be statistically

  • 22:12

    MIHALY FAZEKAS [continued]: significant in a traditional sense.To solve this problem, one approachis to use random permutations, or youcan use Monte Carlo random permutation simulationswhere you randomly reassign, say, your dependent variable,right?You had a binary dependent variable, zeros and one.

  • 22:34

    MIHALY FAZEKAS [continued]: You randomly reassign them.And you reran your model 100 times, 1,000 times.And then this multiple rerun of your model will create--will lead to a distribution of your parameter.And then you can compare this truly random distribution

  • 22:57

    MIHALY FAZEKAS [continued]: of your parameter of interest to the actual observed parametervalue.And if these parameter values very close to the randomlyfound parameter values, then the likelihoodis high that your actual observed relationship isproduced by random processes.However, if it's very rare, very unlikely to arise

  • 23:18

    MIHALY FAZEKAS [continued]: from a truly random distribution of your parameter values,then we can conclude that it's an actual relationship, whichis produced by some sort of causal relationshiprather than a causal mechanism, rather than a random process.Without comprehensively reviewing every single method

  • 23:38

    MIHALY FAZEKAS [continued]: used in the big data domain, let us just mentiona few which would help you and guide youthrough the different further tutorialsin these diverse field of methods.By and large, there are two groups of big data methods.One we call supervised methods.The other one is called unsupervised methods.

  • 24:00

    MIHALY FAZEKAS [continued]: The supervised methods are those wherewe have an existing dependent variable, 0 and 1,or you have someone's earnings in continuous variables.So you want to predict something.Unsupervised methods, however, don'thave such a dependent variable.You can call these methods pattern recognition methods

  • 24:20

    MIHALY FAZEKAS [continued]: where you really try to unearth, uncoverthe underlying structure in the datawithout saying which model, which variable should predictwhich dependent variable.Big data methods also vary by how flexible they are,how many restrictions we impose on the models themselves.

  • 24:43

    MIHALY FAZEKAS [continued]: And they're also very diverse in the maximization algorithm,which underlines their use.For example, there are methods whichare closer to the traditional regression methods like suchas ordinary least squares.These methods are polynomials or local regression,regressions splines.

  • 25:06

    MIHALY FAZEKAS [continued]: It might be also familiar to thosewith more advanced traditional statistical methods.But there are completely new methodswhich we haven't seen in traditional statisticssuch as decision trees and random forest or neuralnetworks.Just to give you a little flavor of how differentthese big data methods are from more familiar regression

  • 25:27

    MIHALY FAZEKAS [continued]: methods, let us look at a simple decision tree algorithm.A decision tree algorithm works as a supervised method,supervised machine learning or statistical learning method.So we have a dependent variable.We want to predict something.We might want to predict earnings

  • 25:48

    MIHALY FAZEKAS [continued]: of a particular individual.How a single decision tree works isthat it takes particular nodes, particular decisionpoints where the tree branches.At each of these nodes, we take a variable.We take a cut point, which sets at zero or one,

  • 26:10

    MIHALY FAZEKAS [continued]: yes or no, fulfilling the condition or not.And then the tree branches, depending on--will fulfill or not a particular condition.So a condition, a typical conditionrelevant for predicting earnings wouldbe the years of experience.

  • 26:31

    MIHALY FAZEKAS [continued]: Here, the algorithm tries to find a cut point underlyinga condition which would minimize the error of predictionin the two branches of the tree.Say, if earnings really skyrockets after, say,

  • 26:52

    MIHALY FAZEKAS [continued]: 10 years of experience, then the algorithmwill find 10 years as a cut point,giving a single average predictionfor those individuals who have between zero and 10years of experience.That will be the average of that group.And give another prediction for all those peoplewho have more than 10 years of experience.

  • 27:13

    MIHALY FAZEKAS [continued]: And the power of this algorithm isthat you can cut, slice your sample into multiple smallergroups, even along the same variable.Say you first cut by 10.So zero and 10, 10-plus.And then you can go further.Say, 10, 15, 15-plus, right?

  • 27:33

    MIHALY FAZEKAS [continued]: So even using the same variable, youcan refine your algorithm step by step,refine your prediction step by step.You can combine different variables.Say, gender and years of experience, education and yearsof experience.And all these in subsequent steps, reallylike branches of trees.And you arrive at the final best model where

  • 27:56

    MIHALY FAZEKAS [continued]: you've minimized your error.[Conclusion]Just to sum up what we have heard and achieved so far,big data, it does bring amazing amount of new data.

  • 28:17

    MIHALY FAZEKAS [continued]: These datasets can be accessed in very different waysthan traditional survey or traditional administrativedatasets.They also come in different structureswith often different ways of data generation processes.However, this is only one part of the story.We have also new methods, machine learning methods, which

  • 28:39

    MIHALY FAZEKAS [continued]: are very popular nowadays.But it doesn't mean that our traditional regressionor clustering methods are not applicable.They remain applicable.Also, we have seen that while you have new data,you might have some new standards of liability,some new methods.It doesn't mean that the researchprocess changes fundamentally.

  • 29:01

    MIHALY FAZEKAS [continued]: It still remains the same.We still need theory.We still need a measurement model,and we need to understand what data represents in our dataset.We still need to carefully assess models.And even though statistical significance representingsampling error is less of a problem with big data datasets,

  • 29:23

    MIHALY FAZEKAS [continued]: we still need to assess significanceof individual predictors and assess goodnessof fit of the overall model.[Further Reading][Fazekas, M. (2014).The Use of 'Big Data' for Social Sciences Research,An Application to Corruption Research.] [ /big-data-for-social-sciences-re search-an-application-to-corruption-research][Gromeland, G., Witten, D., Hastie, T., & Tibshirani,R. (2015).An Introduction to Statistical Learning,With Applications in R. 6th edition.London, United Kingdom, Springer.] [for data & R codesee: http://www-bcf.u][James, G. & Wickham, H. (2016).R for Data Science.Sebastapol, CA, O'Reilly Media.] [] [Fullcourse syllabus: sing-big-data-social-science-research][MUSIC PLAYING]


Mihály Fazekas, PhD, Assistant Professor at the School of Public Policy, Central European University, discusses using big data for social science research including, new data sources and what they can help achieve, the difference between big data and traditional research methodology, and the collection and analysis of big data.

Video Info

Publication Info

SAGE Publications Ltd
Publication Year:
SAGE Research Methods Video: Data Science, Big Data Analytics, and Digital Methods
Publication Place:
London, United Kingdom
SAGE Original Production Type:
SAGE Tutorials
Copyright Statement:
(c) SAGE Publications Ltd., 2019


Mihaly Fazekas

Segment Info


Segment Num: 1


Segment Start Time:

Segment End Time:


Things Discussed

Organizations Discussed:

Events Discussed:

Places Discussed:

Persons Discussed:

Methods Map

Computational social science

Computational social science refers to computational methods and approaches to study the social sciences, particularly incorporating big data, social network analysis, social media content and spatial data.
Computational social science
Introduction to Big Data for Social Science Research

Mihály Fazekas, PhD, Assistant Professor at the School of Public Policy, Central European University, discusses using big data for social science research including, new data sources and what they can help achieve, the difference between big data and traditional research methodology, and the collection and analysis of big data.

Copy and paste the following HTML into your website