Skip to main content
Search form
  • 00:05

    Using Big Data to Measure Formidable Concepts:The Case of Government Contracts & Corruption Measurement.

  • 00:09

    MIHALY FAZEKAS: My name is Mihaly Fazekas.[Mihaly Fazekas, Assistant Professor, Schoolof Public Policy, Central European University]I'm an assistant professor at the Central EuropeanUniversity, School of Public Policy.I'm an expert in big data methodsand especially administrative data,government-sensitive data.And my research expertise is in corruption and qualityof government research, and I am the lead researcher

  • 00:32

    MIHALY FAZEKAS [continued]: of the DIGIWHIST Project and the Follow-on SustainabilityProject.How did you become interested in studying corruption?I find corruption notoriously hard to measure,and most research, to date, has hadto use perceptions data, which I regard

  • 00:55

    MIHALY FAZEKAS [continued]: among many other, colleagues of mine,which regard of questionable qualitybecause there are all sorts of biases in corruptionperceptions, indicators such as peopleattribute like low GDP or bad economic fortunes to corruption

  • 01:16

    MIHALY FAZEKAS [continued]: or people disconnect their personal experiencesfrom the corruption perceptions or often corruptionis perceived in context where peoplehave no actual experience.Think about highway contracts beingawarded to a group of cronies who then steal say,

  • 01:36

    MIHALY FAZEKAS [continued]: 5 centimeters out of 10 centimeters of the concreteof the foundation.Then we know within five years that the foundation is weak.It has been weakened by corruption.We know that there is a lot of money which is stolen,but who knows better.I mean, if you have a general population survey,people will only reflect on these in the perceptions

  • 01:58

    MIHALY FAZEKAS [continued]: through the media and not actual experience.So I find there is a general problem with measuringcorruption, and if you want to understand it,if you want to fight it, if you want to see trends over time,you really need reliable and consistent measurement.So when I started my PhD, I was looking for new data sources,

  • 02:18

    MIHALY FAZEKAS [continued]: new objective data where objective measurementsof at least corruption risk could be developed.And public procurement or government contractingis one of those areas of government spending.There we have a lot of very detailed datavery, very fine-grained large volumes of data,

  • 02:41

    MIHALY FAZEKAS [continued]: and we perceive this area of government spendingis very corrupt.So I thought, OK, there is a chance here to gather new data,develop new measurements in an areawhere I am likely to find something very fishy.How did this interest lead to the DIGIWHIST project?

  • 03:03

    MIHALY FAZEKAS [continued]: Building on my initial PHD research focusing on Hungary,Czech Republic, and Slovakia, I developed a more comprehensivelarge-scale research projects togetherwith a couple of colleagues, where I have expanded my PhDmethodology to 35 European jurisdictions,

  • 03:25

    MIHALY FAZEKAS [continued]: so that means 34 European countries plus the EuropeanCommission, following the same datastructure, similar measurement models, and verysimilar research questions and methodologies.And this Sage case study will be predominantly focusingon DIGIWHIST and using big data methods to measure corruption,

  • 03:49

    MIHALY FAZEKAS [continued]: risks in government contracts.Given the scale and methodological complexityof this DIGIWHIST project, we have quite a few challengesin building a consortium of organizations and individuals

  • 04:12

    MIHALY FAZEKAS [continued]: with the right skill set and making them work together.So that you understand a bit better what these challengesreally meant, first, we have to understand very legallycomplex environments.In government contracting, every single transactionis regulated to a great detail.

  • 04:33

    MIHALY FAZEKAS [continued]: How many days you have to advertise, where,what you have to put on the website,what you can ask for, what you cannot.It can be discriminating against some companies versus not.It's very legally changing.On the other hand, we have technical complexitiesbecause public procurement regulations and laws are then

  • 04:56

    MIHALY FAZEKAS [continued]: translated into governments websites and IT systems, whichgather data, publish data, receive datafrom non-government parties, typically companies.And these data sets vary country by country and typically,from one period to another.A typical European country would have two to three IT systems

  • 05:18

    MIHALY FAZEKAS [continued]: in the last 10 years, and they wouldhave changed their procurement lawlike two, three times per year.So you can imagine what the legal complexityand technical IT level of complexity.And understanding and tackling both were needed,bring together to be able to create a data set whichis comparable across countries and also comparable over time.

  • 05:42

    MIHALY FAZEKAS [continued]: And only then when we created these very complex data sets,then we started to look at measuring corruption, whichis very challenging on its own.Many of my researcher colleagues would definitely agree.And then as the third step, start using the data,using the new corruption risk indicators to do research.

  • 06:03

    MIHALY FAZEKAS [continued]: In that sense, DIGIWHIST project has been an unusual project.It was designed to set up a research infrastructure,a data-indicated infrastructure, a publicly-available,freely-available data sets with indicators fully replicable,

  • 06:24

    MIHALY FAZEKAS [continued]: but still feeding into future researchin public policy, political science, economics,and a long range of other fields.So facing these challenges and solutions we envisaged,we had to put together a large consortium ledby the University of Cambridge and the participation

  • 06:45

    MIHALY FAZEKAS [continued]: of other very good universities in Europe,such as the Hertie School of Governance in Berlinand the EU Horizon 2020 Program was funding us for three years.It was a large-scale research project, whichhas come to an end in February.And since then we are running a smaller sustainability phase

  • 07:08

    MIHALY FAZEKAS [continued]: of the project for at least two more yearsto be able to refine our data set and measurement.And also we have been lucky enoughto gather some additional fundingto expand this approach, this methodology to developingcountries.Data collection and challenges.

  • 07:35

    MIHALY FAZEKAS [continued]: The first step in our research projectwas to understand what administrative dataon government contracts is out there currently,devise a feasible strategy to collect and standardize as muchas possible on these various data setsfrom 34 countries plus the European Commission,

  • 07:55

    MIHALY FAZEKAS [continued]: and bring it together into a single structure, whichis efficient enough to be able to run on our serversand allow users to tap into it in high speed.How we did that.First we had to carry out a detailed legal mappingof standard publication formats.

  • 08:16

    MIHALY FAZEKAS [continued]: So these are formats governments imposedon every single public entity with spending public money.And these forms define what informationshould be published.And that is, for example, if you word a contract,we have to know who is the name is the winner, what's

  • 08:36

    MIHALY FAZEKAS [continued]: the name of the winner, what's the address, what'sthe contract value, when the contract was signed,so very basic information.But interestingly, even these basic bits of informationare often published often put on the web in different formats,slightly different wordings, slightly different part

  • 08:57

    MIHALY FAZEKAS [continued]: in the HTML code.So understanding the legally-required formatand understanding what those legally-defined bitsof information mean in our data structure thathad to be linked to the technical understanding of howthose bits of information are represented in government

  • 09:20

    MIHALY FAZEKAS [continued]: websites.And that was very challenging going a lot of back and forth.You can imagine just very little things.OK, contract value, this is how it'swritten in an announcement.There is a number.Does it include VAT?Does it include potential overruns.Typically, you can increase contract value by,let's say 5% without telling anyone.

  • 09:41

    MIHALY FAZEKAS [continued]: Does it include it, does not, all sorts of variance.And these all have a meaning to what numberwe actually put in the contract value column in our data set.So the programmers had to work very closelytogether with those who understand the legal framework,typically lawyers, and the two talking together

  • 10:03

    MIHALY FAZEKAS [continued]: to define our data points in the data set.Technically, how we collected the datawas also very challenging, and wehad to overcome a great number of problems on the goal.We faced problems such as changing dataon publication formats.Some governments switched from structure to data

  • 10:25

    MIHALY FAZEKAS [continued]: dump to an HTML publication or the other way around.Or they introduced a new procurement law,so the whole website was changed and the terminologychanged, typically, slightly but still itmeant we had to rewrite our algorithms.So what we have done, we wrote extensive web-scrapinglogarithms, some collecting website information,

  • 10:47

    MIHALY FAZEKAS [continued]: collecting the computer codes representing the websites,and then writing parsing algorithms whichtake out distinct bits of information from the HTML codesand put it into our predefined data structure.Now, in this process, we already hadto implement a great number of standardization processes.

  • 11:07

    MIHALY FAZEKAS [continued]: You can think about how dates are representedin all '35 European countries with all sortsof different nomenclatures also for representing numbers.Is it dots or commas, zeros or not.So this first instance standardization

  • 11:29

    MIHALY FAZEKAS [continued]: had to happen at the early stage so that wehave a real data set, which is already to some degreecomparable.Using mixed methods to successfully measurecorruption.Many people would agree with me that tryingto measure corruption objectively is madness.

  • 11:49

    MIHALY FAZEKAS [continued]: It's very hard.So many tried, and many failed.So what did we do differently, whywe think we succeeded in this.First, we used poorly-used mixed methods.Even though we had amazing data sets I just described,which captured between 5% to 9% of GDPs

  • 12:13

    MIHALY FAZEKAS [continued]: in Europe on the contract level, and for each contract,we would have hundreds of variables.It's great data.But how do we get from this to reliable and validcorruption risk measurement.Our approach was a genuine, mixed-method approach.And in spite of having these amazing quantitative data,

  • 12:35

    MIHALY FAZEKAS [continued]: we started from the qualitative.We started from stories of how corruption isdone in this particular domain.So we collected a great number of interview evidence, storiesof people who bid or lawyers active in this fieldhave worked for government.They have seen such cases.

  • 12:55

    MIHALY FAZEKAS [continued]: We looked at court cases, proven cases,and also cases reported by the investigative thorough media,so not as reliable as a court judgment,but a lot of details and strong reason to suspect corruptionhas happened.And from these rich set of qualitative cases,these stories, so to say, we extracted typical situations,

  • 13:19

    MIHALY FAZEKAS [continued]: typical scenarios when corruption takesplace in public procurement.And then what was common in all of these casesis the pretense of open-and-fair competition, somethingwhich laws require and ordinary citizens would agree

  • 13:40

    MIHALY FAZEKAS [continued]: that the contract should go to those whoare the cheapest, the best quality, most reliable work,rather than to friends.So in each of these case, the requiredopen-and-fair competition, the required normsof open-and-fair competition had been

  • 14:00

    MIHALY FAZEKAS [continued]: subverted in different ways.And these typical cases which we extracted from the caseevidence represented different variations on the same thing.They violated the open and fair norms of competitionto benefit a particular set of connected firms

  • 14:21

    MIHALY FAZEKAS [continued]: and individuals.You can call them an old boys' network.You can call them a set of cronies or oligarchs.There was always someone owning or runninga company personally connected to a politician or a bureaucratwho were colluding with each otherto subvert these norms of open-and-fair competition.

  • 14:41

    MIHALY FAZEKAS [continued]: Now, it's for you to understand howour qualitative and quantitative measurementworks tightly together.Let me give you an example.So one story I frequently hear and my colleagues frequentlyheard all around Europe was when therewas a public call for tenders.

  • 15:02

    MIHALY FAZEKAS [continued]: The companies could submit their bids.But then this advertisement was live for a very short period,say four days, including the weekendsand say a national holiday.So in effect, people, companies wouldhave two days or even less to put together a credible bid

  • 15:23

    MIHALY FAZEKAS [continued]: for an otherwise large tender.So anyone in the industry looking at this public tenderwould see this is impossible.Now, what's happening was where the corruption is lying.The corruption is lying in the informal informationtransmission between the public buyer

  • 15:43

    MIHALY FAZEKAS [continued]: and the private bidder, the connected bidder.So someone has shown the tendering terms well in advanceto this connected company, so he or she couldstart preparing the bids and everyone else who just seesthe public announcement has impossible amountof little time to put together the bid.

  • 16:04

    MIHALY FAZEKAS [continued]: So that's a classic scheme.All other competitors are disadvantaged,the connected company has an unfair advantage.So it can submit the bid, which ishigher than market price, lower qualityso that corruption rents can be extracted.So that's the story.It repeats itself from Sweden to Romania.

  • 16:24

    MIHALY FAZEKAS [continued]: Now, how do we measure that?How do we go from a story to reliable indicatoror a set of indicators in a big data set, representing 5% to 9%of GDP, so enormous, millions of observations.Getting from the story to measurementis through the hub of modeling.

  • 16:44

    MIHALY FAZEKAS [continued]: We basically tried to model how a corruption situation evolvedand tried to represent the logic of a corruption situationin the quantitative data.So in this example, we tried to identify a very shortadvertisement period.It's a couple of days.So you can imagine a distributionof a number of days for advertisement,

  • 17:07

    MIHALY FAZEKAS [continued]: and the example on the slide you can see from Portugal.You can see that there is a long tailto the left, very few days, 5, 10, 12 days to advertise bids.Now, the challenge for the quantitative measurementis to understand when short is really short enoughfor disadvantaging some companies than for others,

  • 17:28

    MIHALY FAZEKAS [continued]: and this really depends on the market logic.In some markets, companies are very quick.In five days, they put together the bids, and that's OK.In some others, even one month would be too short.We don't have and no one has the detailed knowledgeof these different cut points, when it's short,when it's not short.So we tried to take these number of days

  • 17:50

    MIHALY FAZEKAS [continued]: for the advertisement controlled for contract value,market, part of the year, sustained the predictors,and we tried to predict the incidents of single bidding.And single bidding, so one bid submittedon a competitive market, is part of the story

  • 18:12

    MIHALY FAZEKAS [continued]: because why a corrupt group has short deadline.The story is very simple.They want one company to be the only and all other competitorsin the market not being able to bid.So basically, we need empirically,regular relationship, so to say correlationbetween a small number of days for advertisement

  • 18:35

    MIHALY FAZEKAS [continued]: and the probability of single bidding.And this is exactly what you havehere, the probability of single biddingcompared to the market norm as a function of different lengthsof advertisement periods.And hence, in the case of this slide in Portugal,we could find the couple of days of advertisement increases

  • 18:55

    MIHALY FAZEKAS [continued]: a single bidding probability drastically.So you have two variables, two risk indicators together,a small number of days and single bidding.So these together proxy this story I just described.This is how we move from a story to reliable measurementof corruption risk in large data sets of millions of contracts.

  • 19:16

    MIHALY FAZEKAS [continued]: Data analysis and data cleaning.Now, you might be thinking that, OK, nice,we have some stories, some empirical quantitativeregularities, but really is it a reliable measurementof corruption.And I will be agreeing to these points.Hence, we had to look at a number of external validity

  • 19:42

    MIHALY FAZEKAS [continued]: tests to the corruption risk indicators identifiedin the way I just described.And these external validity testsalways had to be by linking new dataor linking new variables to the existingcorruption as indicators.So for example, something we have done in many countries,

  • 20:02

    MIHALY FAZEKAS [continued]: we looked at the prices.You expect corrupt contracts to be more pricey.So you basically, look at risk indicatorssuch as tight, short advertisement periodand single bidding.And you see these contracts are worded in such a wayand their set scenario are more expensivethan their counterparts, otherwise,similar counterparts.

  • 20:23

    MIHALY FAZEKAS [continued]: And luckily, we be found this relationship throughout Europe,so that's very reassuring that a similar scheme is existing,as I said from Sweden to Romania.Another way of linking different datasets for validation purposes is lookingat political connections of companies.So if political connections are used for favoritism,

  • 20:46

    MIHALY FAZEKAS [continued]: high-level corruption, then you wouldexpect the kind of red flags from the tendering processsuch a single bidding to correlatewith the presence of political connections of bidders.And this is, again, what we have found,that companies linked to politicianstend to have lots of your competitors.

  • 21:07

    MIHALY FAZEKAS [continued]: They're a lot more likely to be single bidders.Analyzing such high volume of complex datait is no easy challenge.We have over 17 million government contractsin our data set, and mining for these suspicious relationships,

  • 21:32

    MIHALY FAZEKAS [continued]: this scenarios, these stories of corruptionrequires you to zip through all these 17 million contractsand also to look at a great number of variableswithin the same model.So running it on your own laptop is out of the question.

  • 21:53

    MIHALY FAZEKAS [continued]: What is possible is to use high-capacity serversand clusters of servers.And even then, the most efficient wayof using our computational capacityis to break down the data into smaller subsectionssuch as say, country, year, and rerun the same analysis over

  • 22:18

    MIHALY FAZEKAS [continued]: and over again, and then piece together the results later on.A related challenge for our analysishas been that even though we havea lot of detailed information on the companies biddingand companies winning such as company name, streetaddress, city, and the date at which this information was

  • 22:39

    MIHALY FAZEKAS [continued]: valid, still, identifying uniquely each organizationwas not straightforward.Because you can have IBM--I dot B dot M--International Business Machines.You can have different legal entities in different countriesbut trading under the same name, just incorporatedin different countries.

  • 22:60

    MIHALY FAZEKAS [continued]: So identifying organizations so that youcan have consistent tracking of organizational performanceover time, that was a big challenge and thatcontinues to be a big challenge for us.There are some basic street cleaning approaches,which we used for standardizing, for example, company names.

  • 23:23

    MIHALY FAZEKAS [continued]: You can lower case them.You can change company form to the national form,standardize them, limited, LTD.Or on the public entity side, youcan have university, [? universidad, ?] uni dot.So you can standardize these.

  • 23:44

    MIHALY FAZEKAS [continued]: You take the most frequent words in the streams,and you say, OK, this is the same, this is the same.So this is part of the preprocessing.But then you still have the challengeof linking different organization namesand then just taking the perfect matches,they're not going to take you far.So you have to define a certain range of dissimilarity

  • 24:07

    MIHALY FAZEKAS [continued]: where you would say, OK, these are still the sameeven though there is some degree of dissimilarity.There is a range of string similarity measures,such as the Levenshtein distance.And what we have done, we calculated a big matrixof these distances across all our strings.

  • 24:28

    MIHALY FAZEKAS [continued]: And then for identifying the suitable cut point at which yousay, OK, this is the same entity,this is not the same entity, you have to apply manual checks.So basically, you had to look at the concrete example, searchfor those companies and say they are really the same or not.

  • 24:49

    MIHALY FAZEKAS [continued]: And this combination of quantitative and qualitativemanual checks, we could refine our cut points.Now, things have moved on, and now thereis a very useful Python package calledDedupe, which helps you use machine learning based

  • 25:10

    MIHALY FAZEKAS [continued]: on a small set of manually, hand-coded examplesto define similarity and de-similarityand assign entities to clusters of strings.New analysis and insight.Having assembled such a wide data

  • 25:33

    MIHALY FAZEKAS [continued]: set with such a great scope and develop corruption riskindicators allowed us to revisit scientific questionsand hypotheses which have been looked at maybe qualitativelyor using perceptions later, whichI described much less reliable than objective risk indicators.

  • 25:57

    MIHALY FAZEKAS [continued]: In addition, it also allowed us to refine and testexisting theories of corruption and anti-corruption.But on top of testing theories, whichhave a particular importance for researchers like myself,we also could reveal new descriptive evidence,and descriptive evidence can be particularly

  • 26:18

    MIHALY FAZEKAS [continued]: powerful for policy purposes.For example, as you can see on the slide,we can cut this data into very small groupsproviding a lot more specific evidencethan before as possible.With perceptions data, you can compare countries.However, with objective corruption

  • 26:39

    MIHALY FAZEKAS [continued]: as key indicators based on government contracting data,you can, for example, select a sector, compare like with like.Let's look at infrastructure, say transport infrastructureinvestment, and let's compare regions, not countries,but regions.And this reveals a great diversity within countries,

  • 27:01

    MIHALY FAZEKAS [continued]: and if we wanted to look at other sectors, say education,you could see there is a great diversity across sectorswithin the country and within the region.So basically, big data combined with our normal measurementreveals a new picture.Countries are relevant, but often,

  • 27:22

    MIHALY FAZEKAS [continued]: very often leading country, regional, and leading countriessector of variance is as great as a variance cross countries.So looking at countries is useful,but it has its limitations.Looking at what you're really interested like lethal,smaller groups of observations, which are morehomogeneous that gives you the right unit of measurement.

  • 27:45

    MIHALY FAZEKAS [continued]: Findings.Revisiting research questions and hypotheseshas been one of our goals from the outset,and it continues to be one of our goals,and we are very happy to see that other people independentlyof us download our data sets and use it for scientific research.

  • 28:09

    MIHALY FAZEKAS [continued]: One of the things we have looked at with our new dataand indicators was the role of political party financingin corruption risks.This had been looked at historical perspectiveusing case studies.There are very few quantitative papers,but if we could look at, in particular,

  • 28:30

    MIHALY FAZEKAS [continued]: political party finance regulations and corruptionin government contracts.And this gives us a lot tighter mechanismsto evaluate because often what happensis that one company donates the particular candidateor party, which then gets elected, gets into power,

  • 28:51

    MIHALY FAZEKAS [continued]: and then in return, we get a contractawarded to that company.That's the payback.So we really have both an dependent and independent side,something which are a lot closer than generic corruption,perceptions of the population, and then some sortof generic party finance indicator.

  • 29:12

    MIHALY FAZEKAS [continued]: And to our surprise, we found that the restrictivenessof party finance regulations, such as companiesare allowed to donate, there are limitationsof how much an individual can donate,so these restrictions have absolutelyno impact on corruption risks.It might come as a surprise to some of you.It was definitely very surprising for us,

  • 29:33

    MIHALY FAZEKAS [continued]: but some others would say, well, actually,this is not surprising.There are so many ways, and one can circumventregulations of party finance.So just because we have a new close and additionor restriction doesn't mean that there'sanything changing underground.In addition to looking out questions researchers have beenlooking at for a long time, we also

  • 29:55

    MIHALY FAZEKAS [continued]: looked at new questions and new problems.And one of these was a particular questionof European Union funds and corruption risksin Eastern European countries andsouthern European countries, so recipients of EU funds.And this is a crucial question for the integrity of the EU.

  • 30:16

    MIHALY FAZEKAS [continued]: It's a crucial question for economic developmentin the recipient countries.Qualitative evidence, so far, has been very mixed.Some have said that flooding Europe's least developedregions will lead to a great increase in corruptionbecause these public administrations and controlmechanisms will just not be up to the task

  • 30:39

    MIHALY FAZEKAS [continued]: of controlling such volumes of money coming in from outside.Some others have said, no, this is not at all the case.The EU has such a strong frameworkfor controlling corruption and fraudthat this will not be the case at all.So having anecdotal or qualitative evidence

  • 30:60

    MIHALY FAZEKAS [continued]: going both sides, as well as theoretical arguments goingin either direction was really up for quantitative analysisto decide what's going on.The implemented more traditional matchingapproach comparing like with like,so comparing a road construction project funded by the EU

  • 31:21

    MIHALY FAZEKAS [continued]: with a road construction project fundedby national government, very similar in every respects otherthan their funding source.The same contract value happening in the same region.So really, the wider environment would be the same,but the funding source.And to our surprise, we found that there

  • 31:44

    MIHALY FAZEKAS [continued]: is a very diverse landscape.In some countries EU funding increases corruption riskscompared to the national funded procurement projects.And in some others, there is no such relationship,so the two really are very similar.Yet in others, we have lower corruption.

  • 32:04

    MIHALY FAZEKAS [continued]: So it seems that EU additional controlson top of national controls, they lead to lower corruption.So we really could come up with a very detailed mappingof these risk corruption differences between EU fundingand the national based on over 1.5 million contracts.

  • 32:26

    MIHALY FAZEKAS [continued]: Recommendations for similar research and sustainability.Any of you thinking about such a mad data collection exercisehave to bear in mind that if this is a one-off projects,then the public value and research value

  • 32:49

    MIHALY FAZEKAS [continued]: greatly increases.So really, you have to have a credible systemof sustainability plan.And what happened in our case is wedesigned the framework, which is very expensive, veryhard to create at the outset.But once you map with legislation,

  • 33:10

    MIHALY FAZEKAS [continued]: once those mapping are translated into computer codes,then basically, you can just run your algorithmsin the regular intervals.It still will be costly because we are talkingabout larger volumes of data.But sustainability is assured as long as the underlying source

  • 33:30

    MIHALY FAZEKAS [continued]: data structure is unchanged.Now of course, governments tend to change procurement law two,three times a year, and then governmentsare not particularly good in updating their IT systemsin ways that keeps them consistent with past systems,so there is a need for maintenance.

  • 33:52

    MIHALY FAZEKAS [continued]: But creating a robust infrastructure at the outsetmakes system sustainability relatively feasibleand less expensive.Still, we had to go for additional funding,additional research funding, and lookfor funding from foundations such as the Open Society

  • 34:13

    MIHALY FAZEKAS [continued]: Institute, which is helping us to keep the lights on,running the servers, and updating our algorithms.Conclusion.Our work, I hope, has shown that even crazy data collectionand research infrastructure projects

  • 34:37

    MIHALY FAZEKAS [continued]: can bring great benefits.They can be realized.They can be made feasible if one is based on totally thoughtthrough and executed pilots and generous research funding.They can be made sustainable if thereis enough policy and academic interest going forward.

  • 34:59

    MIHALY FAZEKAS [continued]: And the regular and frequent downloadsof our database both by researchers, journalists,and the government analysts show that thereis interest for these.It's worth doing it and worth continuing, and in our case,we were lucky enough that this is not just

  • 35:20

    MIHALY FAZEKAS [continued]: a European enterprise, but we can expand our workin Latin America, the United States,and quite a big number of SoutheastAsian and African countries.


Mihaly Fazekas, PhD, Assistant Professor, School of Public Policy at the Central European University, discusses the use of big data to measure government contracts and corruption, including what prompted this study of corruption; the DIGIWHIST Project; data collection and challenges encountered; using mixed methods to measure corruption; data cleaning and analysis; new analysis and insights; research findings; and recommendations for similar research and sustainability.

Video Info

Publication Info

SAGE Publications Ltd
Publication Year:
SAGE Research Methods Video: Data Science, Big Data Analytics, and Digital Methods
Publication Place:
London, United Kingdom
SAGE Original Production Type:
SAGE Case Studies
Copyright Statement:
(c) SAGE Publications Ltd


Mihaly Fazekas

Segment Info


Segment Num: 1


Segment Start Time:

Segment End Time:


Things Discussed

Organizations Discussed:

Events Discussed:

Places Discussed:

Persons Discussed:

Methods Map

Secondary data analysis

The analysis of pre-existing data in social research.
Secondary data analysis
Using Big Data to Measure Formidable Concepts: The Case of Government Contracts Data & Corruption Measurement

Mihaly Fazekas, PhD, Assistant Professor, School of Public Policy at the Central European University, discusses the use of big data to measure government contracts and corruption, including what prompted this study of corruption; the DIGIWHIST Project; data collection and challenges encountered; using mixed methods to measure corruption; data cleaning and analysis; new analysis and insights; research findings; and recommendations for similar research and sustainability.