With the advent of ‘Big Data’, that is, the online availability of large volumes of electronic data characterising all aspects of social life, new avenues of research have opened up both with its promises and caveats. Working with ‘Big Data’ requires a new research mindset in terms of obtaining data and analysing them. This methodology case study uses the example of an ongoing research project looking into high-level corruption in public procurement in Central and Eastern Europe. This project collects hundreds of thousands of official procurement announcements available online, such as contract award announcements. As there is no readily available database of public procurement announcements in any of the Central and Eastern European countries, it uses computer algorithms to download announcement texts from which then useful information, ‘variables’ are extracted. The so developed new database sheds novel light on the process of corruption in public procurement and allows for testing well-established theories of corruption.
By the end of the case, you should
- Be able to assess the usefulness and feasibility of using large unstructured information sources for social sciences research
- Be familiar with the basics of some of the methods and computer programs available for data collection
- Be familiar with elementary approaches available for creating a database from large amounts of unstructured information
- Be able to understand the basic analytical challenges of using large-scale databases and some typical solutions to them
Context of the Research
This section briefly spells out the general direction of new opportunities presented for social sciences by ‘Big Data’. It also discusses the main goals and methodological approach of the case study research project.
The United Nations report Big Data for Development: Challenges and Opportunities aptly spells out that today's world is experiencing a data revolution. This means that both the speed and frequency of data created are increasing at an accelerating pace virtually covering the full spectrum of social life in ever greater detail. Moreover, much of these data are more and more readily available, making real-time data analysis feasible. However, many large and research-relevant data sources are not available publicly; they have to be obtained from data holders such as Facebook, Google or national governments.
Parallel to increasing data availability, new analytical methods and technologies have developed too. Under the umbrella term data mining, a range of methods for data capture and analysis have developed, many of which have close resemblance with well-known statistical methods such as regression or clustering techniques. They are nevertheless responsive to specific issues of analysing ‘Big Data’ such as much higher computational requirements of analysing millions of observations or the limited usefulness of traditional concepts of statistical significance.
Taken together, increasing data availability and more adequate data analysis tools unlock a whole new universe for social sciences research which can lead to new indicators and insights. One particularly troubled area of social sciences research is the study of corruption where the purposefully hidden nature of actor behaviour makes it particularly hard to come by reliable data and indicators.
This methodology case uses the example of an ongoing multi-year research project (http://anticorrp.eu/) looking into many forms of corruption across the globe and also focusing on high-level corruption in public spending in Central and Eastern Europe, such as European Union (EU) funds disbursement. One of the primary purposes of the research project is to develop new indicators of corruption (http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2331980) which can be used to adequately test established theories and to develop new insights. This research is motivated by the inadequacy of current survey-based measures of corruption, which are both potentially biased and insensitive to change over time.
The research project has set out to identify new data sources for measuring high-level corruption in Central and Eastern European countries and to develop corruption indicators anew. Due to the availability of large volumes of unexplored raw data and the central role it plays in state–society relations, the research team has chosen to research public procurement specifically. This choice implied that actor behaviour in procurement processes had to be understood qualitatively first, then described quantitatively based on official data sources. This was made possible by extensive transparency laws throughout Europe, which require governments to publish a great deal of information on each public procurement procedure such as the deadline for submission, number and names of bidders or final contract value. We chose Hungary as a pilot case due to the ease of access to online public procurement announcements, other administrative data sources and the country's suspected high prevalence of corruption.
This section briefly reviews the most important novel challenges presented by ‘Big Data’ in general as well as in the particular context of the research case study.
Working with ‘Big Data’ presents many usual social science research challenges in new forms and also adds new items to the researchers’ checklist. Novel methodological challenges are discussed in this section under four main headings by contrasting them with standard challenges of survey research.
Data Source Identification
In survey research, data sources are typically individuals or organisations whose full population has to be identified (i.e. developing a sampling frame) and then a representative sample has to be drawn from the known population (i.e. sampling). When working with ‘Big Data’, sources are typically online content or metadata1 such as articles of online news portals or official electronic records. Typically, the goal is to obtain the full set of available documents rather than sampling. The location or source of these documents may be in one place or in multiple places; often, there is no central register of documents of interest. By implication, working with ‘Big Data’ either requires laborious verification of the completeness of information sources or relying on data providers such as Facebook for verifying completeness.
In the case study project looking at corruption in public procurement, identifying relevant data sources was very challenging as the Hungarian government spends more than 10% of annual gross domestic product (GDP) on it and the potential sources are many. In addition, lack of transparency of public spending means that a considerable, but unknown, portion of this spending is not publicly announced and hence unlikely to be part of our research (although submitting freedom of information requests is also an option for those with time and patience at hand!). The resources available for research and the need for timely data collection made it impossible to collect paper-based sources and made the research team focus only on online material. As many potential data sources could be quickly identified, ranging from official government portals to commercial data providers, assessing the completeness and reliability of these sources was essential of high-quality research. Completeness was understood not only as a simple ratio of observed to estimated total public procurement but also as the differences in the characteristics of observed and non-observed transactions (e.g. average transaction size, distribution of contract values across main actors).
In survey research, data are collected through the means of a questionnaire administered in-person, online, over the phone or another method. This gives rise to usual problems such as non-response or item non-response bias, which have a large literature. When working with ‘Big Data’, data are typically collected with computer algorithm or sometimes with manual download of online content unless a well-structured database is available, which is the exception rather than the norm. The equivalent of non-response is when some documents are missing, for example, because they were not uploaded to the site or they were uploaded in a different format difficult to collect. The equivalent of item non-response is missing or uninterpretable/erroneous information in the documents which may or may not be systematic (in corruption research, missing and erroneous administrative records are typically ‘suspicious’). Addressing both of these requires developing adequate procedures both for identifying the magnitude of the problems and for obtaining the missing data or erroneous information from other sources.
In the case study project, we had to face the simple fact that despite the importance of public procurement data for holding governments accountable, there is no database which could be readily used for research in Hungary (and in the rest of Central and Eastern Europe for that matter). Even after identifying the most complete and reliable online data sources, obtaining public procurement information was challenging. As principal data holders were unwilling to share their data, online documents had to be obtained from their websites directly, which is a technically complex and laborious task.
As gaps in the public procurement data obtained directly from official announcements were suspected to be large, alternative data collection methods had to be sought simultaneously, for example, to impute at least total procurement spending value per public organisation using national budget statistics.
In survey research, data structuring typically doesn't arise as a distinct research challenge as one questionnaire refers to one unit of observation and variables directly follow from questionnaire questions, although data quality enhancement and check can be considered as relevant research challenges which are widely discussed in survey methodology literature. However, data structuring arises as a major issue when using ‘Big Data’ for social sciences research. This has multiple reasons. First, the obtained online content or documents often don't stand in any straightforward relationship to units of analysis (individuals or groups) as, for example, some individuals can have many observed documents or actions associated with them while some others hardly any. Hence, there is a need to link observed documents to units of analysis (e.g. tying multiple online journal articles reporting on the same public organisation). Once the relevant links between observed units of data collection are established, information has to be aggregated by the units of observation of scientific interest (e.g. counting the number of likes between two persons on Facebook to create a relational database often used in social network analysis). Second, raw information collected from online sources is often unstructured and requires extensive data structuring efforts. For example, it is relatively straightforward to download all the textual information from the homepage of a parliament corresponding to the texts of laws, the timing of their introduction to parliament and name of ministry introducing them, but creating a database which contains information in a structured and analysable format is hard work. Database creation from raw information represents one of the key new skills required for working with ‘Big Data’. Even if the data provider maintains a structured database as, for example, Facebook or Google do, restructuring information may be necessary to avoid privacy concerns and to make the data fit for analysis. Note that most data holders’ needs and criteria for high-quality data are quite different from those of social sciences.
Key challenges for our research project were to extract meaningful information from the raw text files of each procurement announcement and to link them in a meaningful way for analysis as no structured database is held by the principal national data provider, Hungarian Public Procurement Authority. First, automatised data extraction had to rely on a strict legal framework which defined the format and information content of each type of announcement. Even though legal rules defined the format of announcements precisely, the relevant regulation changed frequently in Hungary (i.e. multiple times a year). Second, creating a database ready for analysis required linking announcements with each other as the prime unit of analysis was the public procurement procedure. In a normal case, each procurement procedure was associated with more than one type of announcement such as call for tenders and contract award announcement. In addition, a procurement procedure may also have multiple announcements of the same type, if, for example, there was an error and the same announcement had to be published again with the corrected content.
In survey data analysis, sampling error, that is, random error arising from only observing a small sample of the full population, plays a central role. The probability of hypotheses being correct for the full population in light of our sample is a key parameter (i.e. significance level). In ‘Big Data’ analysis, the full or nearly full population is observed typically, which renders the question of sampling error obsolete. Other sources of error and randomness take centre stage. Measurement error (which is also a big issue in survey data analysis) becomes a crucial problem which can be directly estimated in many cases. For example, the same information can be extracted with multiple methods, and from the discrepancies between the resulting variables, an estimate of measurement error can be obtained. A further problem with statistical significance tests is that large sample sizes make the probability of refusing a hypothesis very small (recall that statistical significance is obtained by using a divisor of N or N − 1, where N is the sample size), rendering traditional statistical significance testing a less useful guide in judging hypotheses. Standard tests of statistical significance are, nevertheless, widely used at least partially due to their wide acceptance as a gold standard in social science research. ‘Big Data’ also presents an often neglected, but very crucial, challenge which concerns computing capacity, as millions of records cannot easily be analysed with computers and statistical programs most often used by social sciences researchers.
In the public procurement corruption research project, we had obtained information on more than 100,000 procurement procedures in Hungary, making traditional statistical significance tests very easy to pass. Furthermore, we noticed that public procurement announcements have greatly different quality in terms of completeness and clarity of the information reported despite uniform and clear legal prescriptions on the content of each type of announcement.
Methods in Action: How to Work with ‘Big Data’
This section outlines in detail the main methodological solutions the case study research project applied to address the research challenges presented by working with ‘Big Data’. The focus lies primarily on the specific solutions applied instead of general discussion of textbook alternatives primarily because there are few widely accepted and codified approaches to these new problems.
The research team had to search for online content relating to public procurement in Hungary using standard search engines, using key informant interviews and reviewing the relevant government reports (there was virtually no academic research using such data in Hungary).2 While the research project was made possible by identifying the most relevant source, this search had never fully ended as we had to remain open for incorporating additional data at any point of the research project.
Identifying a suitable and high-coverage data source was made possible by Hungarian Public Procurement Law that obliges every issuer of tenders to publish information on a national portal (http://www.kozbeszerzes.hu/) called Public Procurement Bulletin if it meets certain criteria such as exceeding a minimum contract size. While this online source covered the bulk of Hungarian public procurement, we had to estimate the proportion of missing data to reliably establish the value of the data source. This was done by collating the amount of total public procurement spending as reported on this portal and the total public procurement spending estimated from aggregate budget figures following a European Commission methodology. This has established that as high as 62% of total spending is reported in the database in 2009, for example, albeit the figure varied by year.
Once establishing that a large portion of total public procurement spending is reported in our online data source, the research team precisely scoped the systematic bias in the sample by analysing the rules governing which contract has to and doesn't have to be reported in the national Public Procurement Bulletin. This work has revealed that with the exception of some special sectors such as defence, the online portal contained information on the largest procurement contracts. As the research project's main aim was to study high-level corruption, the missing smaller contracts proved to be of a relatively small problem in scientific terms. Nevertheless, further efforts were made to collect additional procurement information and to precisely gauge the magnitude of missing information on the micro-level.
The next step was to verify that the online portal does indeed contain all the documents it claims to contain. Upon verification of the completeness of this online material (e.g. checking whether there are gaps in the publication dates suggesting that some periods are not covered by online material), the research team had to realise that even though the national portal is required to publish everything online in a standard format (i.e. HTML pages), a great number of public procurement announcements are published in a different format on a different URL (i.e. scanned PDF documents not amenable to automated text search). The identified differences in file format implied different data collection methods which will be discussed in the next subsection. An example of standard document format can be found, for example, at http://www.kozbeszerzes.hu/ertesito/megtekint/portal_35472_2010/. The document at this URL shows how the HTML page of an example public procurement announcement is structured and available for viewing.
In the public procurement corruption research project, confidentiality of information was not a problem as we used only publicly available data. Nevertheless, in many research projects working with ‘Big Data’, confidentiality of data is a major concern. This is the case, for example, using Facebook data or individual health records.
Once it became clear that there is no well-maintained database underlying the Hungarian Public Procurement Bulletin and that it was not possible to obtain even the unstructured data from the Public Procurement Authority, a direct data collection method had to be explored. This meant that most data collection had to be carried out with the help of computer algorithms, so-called crawlers. These basically downloaded all the text files from the online portal and already recorded some elementary information necessary for identification of individual documents such as unique URL and date of appearance on the webpage. Such data collection methods are widely available by now, and researchers are strongly advised to consult programming specialists who can carry this task out efficiently and reliably (e.g. many webpages have protection against too high volumes of data download; hence, the crawler algorithm's download speed has to be adjusted accordingly). It is essential to document the crawler's specification in detail to meet the usual scientific standards of peer review and replicability. The complexity of the website where the information can be found is the main driver of crawler reliability (i.e. the simpler the webpage, the easier and more reliable the download procedure). An example of a crawler used in the research project written in PHP programming language can be found in Figure 1.
Figure 1. Example of a crawler (PHP computer code) downloading announcements.
As not all the relevant online content could be found at the same URL and in the same format (see the previous subsection for details), remaining documents which were in PDF format had to be collected manually.
To complement the public procurement data and to cross-check vital characteristics of our database, such as total procurement spending value, the research team has also collected data from official data providers in more ‘traditional’ ways such as requesting company registry data from the central registry or obtaining public sector organisations’ annual budgets from the National Treasury.
Data structuring represented the single most laborious and resource-intensive part of the research project. The primary activities we carried out were to extract textual information exploiting the strict structure of each announcement and to recode this information into analysable variables.
To extract relevant information precisely, the prescribed announcement formats and the implementation of each new format had to be mapped. This was no straightforward task as the legal framework changed several times a year and the implementation of new rules was uneven and intermittent. In addition, the webpage reporting the announcements changed structure multiple times during our research project adding and removing relevant information from the web!
Mapping announcement formats helped not only to precisely extract information but also to fully and completely define the list of variables available for data extraction. As the legal basis of most sections in the announcements changed over time, the list and content of variables to be extracted had to be harmonised to arrive at a database which allows for time-series analysis. For example, the list of procedure types available for issuers of tenders, such as open procedure according to certain paragraphs of the Public Procurement Law, has changed several times during our study period. This made it necessary to track definitional changes and to create categories consistent over time. This was only possible by using four higher level categories whose content remained relatively stable even across many years.
Once the mapping of announcement formats was done, automated text search algorithms were used to extract the relevant information. Many widely available computer programs are capable of effectively carrying out these tasks such as R, Python or PHP. These computer algorithms utilised the fixed format of announcements by type and time period. For example, the name and address of the issuer of tenders had to appear at the top of a contract award announcement following exactly the same wording (minor deviations according to year of publication). Hence, the computer algorithm could extract issuer names and addresses from the raw text once the researchers defined the sections’ start and end. See, for example, how we did this extracting information on electronic auctioning in Hungarian contract award announcements in Figure 2.
Figure 2. Example of a computer algorithm extracting information from contract award announcements relating to the use of electronic auctioning in the tendering process.
Extracted information turned out to be erroneous and missing in surprisingly high proportion. The first response from the research team was to establish whether wrong and missing variable values result from the deficiencies of the data extraction method. This involved manual checking of a sample of erroneous and missing values and correcting the extraction algorithm if needed. This procedure revealed that actual publication practice was often far from legal prescriptions. Moreover, detailed analysis often revealed that the underlying text files lack sufficiently standardised format for automated data extraction to be efficient; hence, manual data extraction had to be applied. Manual data extraction carries high risk of error; hence, every step of manual collection was logged in a closed online platform, and results were manually checked by a quality assurance team. The remaining errors were treated as errors in the announcements themselves, which were marked by multiple variables, used later on in the analysis. This is because hiding information by erroneously reporting it, for example, making typos in the name of the winning firm, is a simple and effective way of avoiding public scrutiny and lowering the probability of detecting corruption.
Data extraction followed initial research needs by collecting information on those variables which seemed to be most relevant for research at the outset. However, as new findings emerged, the need for additional variables also arose, creating a dialogue between analytical work and data collection. Hence, the research team extracted additional variables from the available raw text files well into the research phase dedicated to analysis. This was made possible by the large amount of information available in the announcements. This approach is in stark contrast with traditional survey research where a single data collection moment or a small number of data collection moments limits the capacity of researchers to fit data exactly to research needs. Social scientists working with ‘Big Data’ should be prepared for a more iterative and flexible data collection approach with its great opportunities and frustrations (e.g. think about the possibility of bringing in additional variables at any time of the research to refute an already tested hypothesis).
As the unit of analysis and data collection did not fully match, linking units of data collection and aggregating information were of essential importance for any meaningful analysis. At the micro-level, individual public procurement announcements (data observed) had to be linked to each other if they belonged to the same tendering procedure (unit of analysis). As some procedures were of a highly complex nature involving multiple repetitions of the same procedural step, additional and highly relevant variables were created to characterise each procedure by the number and types of announcements observed. For example, a court ruling ordering the issuer of a tender to rerun the tendering procedure was taken as an indicator of corruption, especially in cases when the same company was awarded the contract in the repeated as in the prior procedure.
Even though the primary unit of analysis was the public procurement tender, organisations could also be characterised by the sum of their procedures conducted. This carried the additional possibility of linking organisation-level statistical data such as annual budgets of public organisations to public procurement data. Hence, aggregation of procedure-level data was carried out. Furthermore, aggregation to the country level was also done to facilitate international comparisons. Nevertheless, method of aggregation is far from straightforward as multiple aggregation methods (e.g. simple average or taking extreme values) and bases of aggregation can be applied (e.g. number of procedures or contract value).
Large volumes of data caused a considerable headache to the research team. The most computationally demanding task had to be carried out by a high-capacity server instead of the personal computers and laptops of the research team.
As our data did not properly fit the parameters necessary for standard significance testing, we had to look for alternative methods for testing significance and accounting for randomness.
When deciding on whether a variable is significant in regression models, we used significance values from Monte Carlo random permutation simulations. Permutation tests are widely used in the natural as well as the social sciences, for example, in social network analysis where data typically relate to full populations and observations are not independent of each other. The Monte Carlo random permutation simulation randomly reassigns the outcome variable to observations multiple times and calculates the parameters of interest such as regression coefficients each time. By doing so, it obtains a distribution of each parameter when the outcome is truly random. The probability of the actual test statistic falling outside this random distribution, therefore, represents the probability of observing the relationship when the effect is random. A low significance level indicates that it is highly unlikely that the observed parameter could be the result of a random process – a very intuitive interpretation. The researchers are urged to look for further alternatives such as Bootstrapping and Bayesian estimation methods and collate findings. Reporting standard significance tests is advisable for the sake of statistical tradition.
This methodology case study reviewed the principal challenges of working with ‘Big Data’ and provided a snapshot of the particular solutions used in a research project looking into high-level corruption in Hungary.
The main areas of novel challenges were
- data identification,
- data collection,
- data structuring,
- data analysis.
The research project has delivered a completely new database which allows for looking into high-level corruption in Hungary and already started to do comparative research in countries where data collection in a similar approach could be advanced quickly, such as Czech Republic and Slovakia. The new indicators developed using the public procurement database have a long way to go until they become a standard part of our corruption diagnosis toolkit. Nevertheless, the value of developing indicators from the bottom up using rich administrative databases combined with thorough qualitative evidence is already clear.
While the approach presented here is specific to the research at hand, its overall approach and new methods are applicable to a wide range of phenomena and data sources. With further advances in information and communication technologies, it is expected that the scope of applicability will further increase opening up new scientific frontiers.
Exercises and Discussion Questions
- Identify a data source which can be harnessed in line with this case study's approach. Discuss its strengths and weaknesses in terms of access, content and value for social sciences research.
- Which other data sources could you link to it to increase database scope and cross-check database content?
- How would you design a data collection exercise for the data source you identified?
- Why are traditional significance tests inadequate for analysis using ‘Big Data’? What are the alternatives?
- Discuss the strengths and weaknesses of using readily available data sources such as Google compared to data collected directly by a research team from the Internet.
1. Metadata describes the objects’ most generic characteristics as opposed to the nature of behaviour, for example, the time and location of short message service (SMS) texts sent rather than the actual content of messages.
2. In fact, the very first step of designing the research project was to identify at least one good data source with good coverage. No serious research planning was done until that point as without a new data source, no new insights could be obtained.