Skip to main content

Cleaning a Data Set to Make Ready for Analysis

Case
By: Published: 2014 | Product: SAGE Research Methods Cases
Methods:
+- LessMore information
Search form
No results
Not Found
Download Case PDF

Abstract

In 2007, I was asked to undertake the updated statistical analysis of a historic cohort of British cotton mill workers who had been medically examined in the late 1960s and early 1970s and had been traced and flagged for a mortality study. It was necessary to acquire an understanding of the content of the data and an interpretation of the variable values appearing in the data and to create a variable catalogue. Inevitably, with data collected over such a long period of time, a certain amount of cleaning was required, to get the data set ready for the proposed analysis. These activities turned out to be far more interesting and time consuming than I would have thought possible beforehand. Cleaning and updating the data proved to be a type of detective work that involved searching within the data for inconsistencies, hunting through the paper records that were available and cross-referencing with several other sources of information. The cleaned data have resulted in two publications and will support further work.

Learning Outcomes

By the end of the case, students should have

  • An understanding of why it is important to clean data before analysis
  • A knowledge, via example, of the types of techniques used to detect errors
  • An appreciation of the need to keep a good record of corrections made
  • An awareness that cleaning data is not always an easy task and can take a long time
  • An anticipation that cleaning data can be interesting
  • An appreciation that cleaning data does lead to a deeper understanding and is a prerequisite for a valid analysis
History of the UK Cotton Workers' Cohort

In the spring of 2008, I was asked by a colleague at the University of Central Lancashire (UCLan), School of Health, if I would be interested in doing an updated statistical analysis of the UK cotton workers' cohort. A cohort is a group of people all of whom have something in common and who are entered into a study and then followed-up for a long period of time to see what happens to them, often to study their health. From the time when cotton first came to Europe from the American South and up until early in the 1970s, Lancashire had been a massive producer of cotton fibre and Lancashire exported cotton products throughout the world. Anyone who knows Lancashire knows that Lancashire is very damp and so Lancashire is ideal for the production of cotton fibre. The towns near to UCLan at Preston, such as Oldham and Rochdale, had hundreds of mills in the past and the cotton industry was a major employer of the local working people. I happily agreed to my colleague's request to update the analysis. The focus of this cohort study was largely medical and health related but cohort studies are widely used to study social phenomena such as educational outcomes (see, for example, the Millennium Birth Cohort) and so this case study is equally relevant to studies in the social sciences.

In the early 1960s, work by Her Majesty's Inspectorate of Factories suggested that there was a need for a large study of the lung disease byssinosis which is caused by breathing in cotton dust in the mill working environment. Between 1966 and 1972, about 4000 cotton workers in 52 mills were recruited to a study to look at their respiratory health. Workers were medically examined and completed a health questionnaire which included details about how much they smoked and what respiratory problems they had, such as breathlessness and coughing. The medical examination also included lung function tests known as ‘forced expiratory volume in 1 second’, FEV1 for short, and ‘forced vital capacity’, FVC for short. Three types of mills were included – those working with coarse cotton, and those working with medium and also fine cotton because the cotton dust particle size might influence the respiratory health problems. In order to get enough mills of these three types, two Scottish mills and two Yorkshire mills were included. Between 1972 and 1974, many of the workers were medically examined again, each worker up to three or four times in total.

In 1972, it was decided to continue the work as a mortality study called MR14 (medical research study number 14). The National Health Service (NHS) kept a central register of all people in Britain who were registered with a family doctor, and this register listed every citizen by their NHS number and their date of birth. The register was at a place called Smedley Hydro in Southport on the Lancashire coast. Smedley Hydro has an interesting history since it was where the British government held a register of all British citizens during World War II. This was necessary in order to provide people with rationing stamp books since food and clothing was rationed. This site is now the General Register and NHS Information Centre. The Register also recorded the date of death of everyone and kept the death certificate, which recorded the person's main cause of death. In 1972, the Register was kept in very large books with information written in by hand. Coincidentally, in the early 1980s, I had visited the Register with the Lancashire and Cumbria local group of the Royal Statistical Society and we had toured the site and were shown the books. The names, dates of birth, study identification numbers, and NHS numbers of the cotton workers were sent to Smedley Hydro for ‘tracing and flagging’. First, the person was found in the register books and then a note was put on their record which flagged that they were in the study called MR14. At any future date, the study researchers could then ask for a ‘Members and Postings’ listing for MR14 and this would report who had died and when. The researchers could then buy a copy of the person's death certificate and find out the cause of the person's death.

Up until this time, about 1973 or 1974, all the information from the study was collected by the Factories Inspectorate on paper records. By 1976, the original medical examination data for each worker, covering all their medical examinations, had been transcribed onto a single summary sheet for each worker and the original questionnaires were destroyed. The summary sheets were sent away to be entered onto Hollerith punched cards ready for computerisation. By 1979, the punched cards had been used to enter the data into the computer program SPSS, and the SPSS Variable View contained the definition of the coding of the variables. At this point, it was realised that the medical examination data from 452 workers were missing from the SPSS file. It is possible that the punch card operators were in a hurry and often turned over two summary sheets at a time as they entered the data, or thought they had entered a sheet when they had not done so. Remember that they did not have a screen to look at, but worked at a punch card machine. The Inspectorate checked very thoroughly and found that the missing subset had no particular characteristic that would bias future analyses. Statisticians would say that these workers' medical records were ‘missing at random’.

Restarting the Cotton Workers' Cohort Study

In 1990, an interesting scientific article was published (Hodgson & Jones, 1990) which looked at the mortality in the cohort caused by lung cancer for the period 1968 to 1984. This compared the lung cancer death rates in the cohort with what you would expect from knowing the national death rate from lung cancer for the population of Britain for the same time period. The findings were that fewer workers died of lung cancer than expected even after taking account of the amount of tobacco that the workers smoked. The important point is that cotton is an organic dust and a number of different studies on people working with organic dusts (such as cereal farmers and people working with flax) had found that exposure to organic dust protected people from getting lung cancer. The theory was that the body's immune system is put on high alert status by certain particles in an organic dust and so the body attacked and destroyed lung cancer cells before they could form a tumour.

The Lancashire cotton industry had collapsed by the mid-1970s, and in 2008, virtually no one in Lancashire worked in a cotton dust environment. The Chinese city of Shanghai then became the centre of the world's cotton industry, and results from the British cotton workers' cohort would be relevant to the cotton workers of Shanghai. The Inspectorate of Factories, together with several other organisations, had changed to become the UK Health and Safety Executive (HSE) during the intervening years. Just after 2000, the HSE took the decision to discontinue the mortality study. However, since 1984, a further 23 years had gone by, and it would be of value to see whether the result concerning lung cancer still held true. For this reason, it was decided to bring the mortality study up to date at UCLan to the end of 2007.

During 2008, UCLan received all the paper-based records which remained for the cohort, and these filled three filing cabinets. The files were catalogued and it became clear that some paper records had been lost in the intervening years although a substantial amount of information regarding the study was still available to provide background reference material. When it came to the electronic records, however, there was a setback because the final master data set for 2000 could not be successfully read from the magnetic tape where it had been archived. However, by diligent searching of their file base, our HSE partner found the master data file for 1997 and also the SPSS syntax file that converted this into an SPSS data set. In August 2008, the electronic files arrived on a secure pendrive (flash memory), and we were able to upload these to UCLan's secure servers.

Creating the SPSS Data File

We created the SPSS data file using the master data file and the SPSS syntax file and this worked successfully. The master data file was a dense text file of digits without any separators between the numbers; the SPSS syntax file interpreted these digits and set up a ‘.sav’ file which contained the variables in the Variable View and the cases in the Data View. It was helpful that I had worked with SPSS during the 1970s, in the days before Windows operating systems existed. I understood how the digits were formatted and I understood the SPSS syntax language. I could see that the SPSS syntax created new variables from the existing ones in the master data file. Some of the new variables were 4-digit years, such as ‘1982’, which were created from the 2-digit year ‘82’ in the master data file. Towards the end of the last millennium, as the world approached the year 2000, there was much concern about what was called the ‘millennium bug’. There was a fear that many computer programs would stop working when the calendar turned to 2000 because year was stored as a 2-digit number. Clearly, work had been carried out in the SPSS syntax to counter the year 2000 millennium bug.

Why Do We Need to Check and Clean Data?

As a statistician with over 35 years' experience, I would never start work on an analysis until I have cleaned and checked the data that I am going to be working with. Why is doing this so important?

First, we have to remember the saying ‘garbage in, garbage out’. If the data are full of errors, then the analysis will produce unreliable results. If we publish these results, then we may be embarrassed at a future date when someone else shows that we were wrong. Second, if the data have errors, it is quite possible that we will uncover these during the data analysis. For example, at one point, I needed to use in my analysis the age at which a worker started work in cotton. I used their date of birth and the date they started work to calculate the age that they started work. When checking this, I found a person who had started work at the age of 8 years but I knew that this was impossible. If I found this during the analysis stage, then I might find that I needed to repeat previous analyses with the corrected data – a very inefficient way of working. Third, I feel that we owe a duty to the participants in a study to get their data right. The cotton mill workers have given their personal information freely to benefit future generations by the knowledge gained from their life experience. Getting their data right is our way, as researchers, of thanking them and of respecting and valuing the contribution that our study participants have made.

Creating a Variable Catalogue and Checking the Data

With a large and unfamiliar data set, it is important to start out work by creating a variable catalogue. This lists all the variables in the data together with the codes used for categorical variables and their interpretation and the units used for quantitative variables. For quantitative variables, you would also decide which smallest and which largest values are realistic. Next, you proceed through the variables one at a time to check for possible errors. This should involve tabulating frequencies of values, drawing box plots and checking whether any values are beyond the range of what is reasonable. Once individual variables have been checked one at a time, it is then important to check the internal consistency of the data across related variables within the data set. If there are additional sources of data outside the data set (such as the Members and Postings listing), the next step is to check that these are consistent with the variables in the data set. During these stages, it is advisable to note the errors down and not make corrections until later.

Once all the checking is complete and you have a good understanding of the data, then it is time to correct the errors as best you can. If you cannot correct the error, then you have to think what you are going to do. Can you leave this record out? This really depends on the type of study and the extent and importance of the error, and no hard and fast rules can be applied. If the study is a randomised controlled trial (which this one is not), then there will be a number of protocols or rules to be followed, so you need to be careful. At this stage, you need to keep good notes of what you have corrected and why, in case you want to reverse the corrections. It may be that you have misunderstood something and that the original data were correct after all.

This was the procedure we followed in checking and cleaning the cotton workers' cohort data. Often the expression ‘dirty data’ is used for the data when you first receive it to work on. The checking and correcting is called ‘cleaning the data’ and once this is complete you have ‘clean data’. The expression ‘big data’ is very popular at present, now that very large data sets are becoming available for statistical analysis. One of the problems with big data is that they are often ‘big and dirty’. Although we would not call the cotton workers' cohort a big data set, the principles for cleaning the data apply equally well to big data.

The following sections will explain some of the more interesting features we found when cleaning the cotton workers' cohort data, and these will illustrate the ideas outlined above in more detail. By seeing examples, you will get ideas of how to approach the cleaning of your data.

Workers Born in the 19th Century: Cross-Checking Previously Computed Values

We produced a frequency table of decade of birth for the workers by using the year of birth. Two workers had been born between 1880 and 1889 and 51 were born between 1890 and 1899. The earliest medical examinations were carried out in 1966, which meant that some of the workers were still working well into their 70s. We checked with the paper records and this was correct. What was not correct, however, was the calculation of lung function for 12 of these workers. Age at medical examination had been calculated by hand and entered onto the paper summary sheet, and lung function was calculated in the SPSS syntax using age, gender and the value of FEV1 percentage from the master data file. We cross-checked year of birth with age at first medical examination and found that, for example, a person born in 1896 was shown as having an age of 30 years at their first medical in 1966. Somebody in the past had subtracted 66 from 96 and got 30 years, instead of subtracting 1896 from 1966 to get 70 years of age. The 70-year-old worker was then compared with the expected lung capacity of a 30-year-old and found to have very poor function. The age at medical examination was corrected and the lung function recalculated for the 12 workers.

Dates of Medical Examinations: Checking That Dates Are in the Right Order

Each worker had up to four medical examinations. However, the data file had room to hold data from only three medical examinations. We resolved this by checking in the paper records and found that the file contained data for the first, second and third examinations, except for 11 workers who had a fourth medical. These people had data for the first, second and fourth. We checked that everyone who had a third medical also had data for a second medical and that everyone who had a second medical also had data for a first medical. We also checked that the date of the last medical was later than the second medical and that the second medical had a later date than the first medical. There were no inconsistencies found.

The Twice Dead: Checking for Duplicates

As you may recall, the Members and Postings list from the Register of births and deaths listed all people who had died and who had been flagged for MR14. The listing gave the study identifier for the worker, so this would allow us to match the deceased person with the record in the SPSS data file. However, it was found that the Members and Postings listing contained repeated identifiers, so it would seem that some workers had died more than once! The cause of this was that, during the intervening years after the workers had been flagged in the register books, the books had been computerised. Workers who had an MR14 flag had the flag transferred to their computer registry record along with their study identifier. However, the study identifier had been written by hand and clearly there had been some difficulty in reading the handwritten digits. For example, the digit 5 and the digit 8 were sometimes mistaken for each other. So, for example, the worker identifier 288 appeared twice, when in fact one should have been 285 and the other 288. In total, 17 identifiers were repeated, but it was possible to find for each repeat who the original worker was and who the repeat was, by using the dates of birth and names supplied in the Members and Postings list and checking in the paper records.

Mysterious Mill 53: Checking for Values Out of Expected Range

Each worker's record contained the mill number where they had worked. This was usually from 1 to 52 since 52 mills took part. Each worker's record also recorded whether they worked with fine, medium or coarse cotton. We cross-tabulated mill number with type of cotton and it was clear that each mill worked with only one type of cotton since the vast majority of workers in a single mill had the same cotton type. However, there were some exceptions. A few workers apparently worked with a different type of cotton than their co-workers at the same mill. This clearly indicated that they had had an incorrect type of cotton recorded for them, and this was corrected.

A number of workers were shown to be working at mill 53, but searching the paper records showed that mill 53 did not exist. However, the paper records contained a list of study identifiers used at each mill and this allowed us to allocate most of the mill 53 workers to a mill in the range 1 to 52. What seems likely was that workers were recruited to the study at the mill where they worked, but medical examinations were carried out over a period of time and those workers may have left that mill by the time of their medical. We do know that every effort was made to medically examine all those recruited even if this was done at their own home. We could not allocate a few of the workers listed as working at mill 53 to a real mill and so we had to exclude these workers from some of the analyses because we did not know what type of cotton they had worked with.

It's Twins: Checking for Sequential Repeats

Another way to check for internal consistency is to work down a variable in sequence and check whether there are repeated values, one after the other. We checked the dates of birth of the workers in the sequence of their study identifiers and found that several (fewer than 20) consecutive dates of birth were identical. This is unlikely, except of course in the case of twins who work together in the same mill. We were able to check back with the paper records and found that only two of the occurrences were for two sets of twins. The others were errors, and we were able to correct dates of birth for the repeats. However, this had the unfortunate effect that sometimes the wrong person had been traced (or the person had failed to be traced) because a wrong date of birth had been supplied to the Register, and as a consequence some workers had to be removed from the analysis (about 17).

However, we had the good fortune to find 16 missing summary sheets for medical examinations in the paper records in a large brown envelope. Looking at the dates of the last examination for these people, it appeared that the summary sheets had been updated with the last examination data but were returned too late to be sent off for putting onto punched cards. These people had been successfully traced and so we were able to enter their medical data into the data set.

We also looked at sequentially entered dates of death; not everyone had died, so we had to be careful doing this. We found a number of occurrences where sequential workers had the same date of death. Twins could not provide an explanation for this. More likely, the person entering the death data had made a mistake. So we checked with the Members and Postings listing and the death certificates, and we were able to correct the repeated entries. Before analysis commenced, we also selected a 10% random samples of deaths and checked date of death and cause of death against the death certificates as a general test of quality of the mortality data.

Gender: Cross-Checking Internal Data with External Sources

For many studies, you would not consider checking gender, but we did for our study. The data file contained a variable for gender, but the information which the Register supplied also contained gender. We cross-referenced these and found two inconsistencies. For the age of the individuals concerned, this would not likely to be due to gender reassignment issues. There was a simpler explanation; the individuals had names which could be used for either gender. For example, the name Ashley could be for a man or a woman (this is not one of the real names in the study). A typed list of participants was provided by each mill, probably by the mill office, and gender was given on this. It is likely that the office worker just made a mistake because they did not know the mill worker and that mistake had never been corrected.

Scottish Workers Never Die: Are Variables Similar in Different Subsets

As noted earlier, two Scottish mills were included in the study in order to include mills that worked with fine cotton. When looking at the number of deaths for each mill, it was clear that very few Scottish workers had died compared with English workers. We did not think this was due to the fresh Scottish air but required another explanation. In a mortality study, it is very important that you do not have individuals who seem as if they are living forever simply because you had no way of finding out whether they had died. You need to check who is still alive, calculate their age and consider whether you may have failed to record their death. The Scottish workers were traced and flagged at the Register at Southport, Lancashire. However, in the intervening years, Scotland had gained more devolved independence, and deaths in Scotland were now recorded in Scotland. Hence, we obtained a Members and Postings listing for MR14 from the Registrar General for Scotland. Unfortunately, this listing did not contain the study identifiers to match against in the data file. In the transfer to Scotland, the flag for MR14 had been transferred but not the participant identifier. Again the paper records proved invaluable because we were able to access the names of the Scottish mill workers and find their study numbers from typed listings of participants which had been archived at the Public Records Office.

How the Changes Were Made Using Word

As you may recall, the master data file consisted of densely packed digits which an SPSS syntax file decoded and turned into an SPSS ‘.sav’ file. Also, it seems worth repeating that it is important to keep notes of what you have changed in the data and the reason why you have done it. We kept the notes and made the changes using Microsoft Word and the Review facility. Most people know how to use Word for text documents, but Word is equally useful for text data files. We made our updates using Word Review so we could see the old values and the new. We also inserted a comment next to the change that documented the change and why we were making the change. Then, when the changes were complete, we saved the Word version of the file in ‘Final showing markup’ state as a record of the changes. Next, we removed all the comments, accepted all the changes, and moved from ‘Final showing markup’ to view ‘Final’ and then also saved this as plain text. The plain text file was then put through the SPSS syntax to create the corrected SPSS data file. This way, we had a permanent record of all the changes made to the original data file and we could reverse any change at a later date should we wish to do so. We were then ready to start the statistical analysis.

Postscript

I have always been fascinated by statistics ever since I started statistics at school in 1970 when I was 16. By 1970, the cotton mill workers had just completed their first medical examinations and the study had collected only 20 deaths. By 2007, when I first came to be involved, the study had accumulated 2159 deaths among 3549 traced participants. I found this cohort study particularly interesting because the study seemed to have run parallel to my career and to have reflected the changing face of statistics and statistical computing over my many working years. It has been a privilege towards the end of my career to work with such a valuable data set and to have divined some of its secrets.

Our analyses did confirm that working with cotton provided some protection to the mill workers from lung cancer and you can read the article in the British Journal of Cancer (McElvenny et al., 2008). In addition to this, it became clear that light smoking at least predicted higher mortality for women than for men (an unexpected outcome of the cohort study), and an article on this has appeared in BMC Public Health (Hurley, 2014). Further work will look at the role of lung function in predicting longevity.

Exercises and Discussion Questions
  • What is meant by the expression ‘data mining’ and why is data cleaning important for data mining?
  • How big do data have to be to qualify as ‘big data’ and why is data cleaning important for big data. Think about where ‘big data’ come from to answer this question.
  • What is meant by ‘statistical imputation’ and how does this differ from data correction? When is statistical imputation justified and reasonable?
  • What are the inherent dangers in data cleaning? Are the results based on cleaned data more or less valid and reliable than results based on the original, dirty data?
  • When a data set is accumulated over many decades, what other problems arise in the cleaning of such data? For example, think about cause of death and how it is recorded and whether date of birth will always be accurate.
  • What data-protection issues do you have to think about when you are cleaning a data set for analysis? For example, did the individuals who provided the data agree to its use and how do you know that? Were the data obtained in compliance with the law? Are any of the data considered to be sensitive data according to the law?
Further Reading
Burlingame, N., & Nielsen, L. (2012). A simple introduction to data science. Wickford, RI: New Street Communications.
Dasu, T., & Johnson, T. (2003). Exploratory data mining and data cleaning. Hoboken, NJ: John Wiley & Sons. http://dx.doi.org/10.1002/0471448354
Ganti, V., & Dassarma, A. (2013). Data cleaning – A practical perspective. San Rafael, CA: Morgan & Claypool.
Granville, V. (2014). Developing analytical talent – Becoming a data scientist. Hoboken, NJ: John Wiley & Sons.
References
Hodgson, J. T., & Jones, R. D. (1990). Mortality of workers in the British cotton industry in 1968–1984. Scandinavian Journal of Work, Environment & Health, 16, 113–120.
Hurley, M. A. (2014). Light smoking at base-line predicts a higher mortality risk to women than to men; evidence from a cohort with long follow-up. BMC Public Health, 14, 95. doi: http://dx.doi.org/10.1186/1471-2458-14-95http://dx.doi.org/10.1186/1471-2458-14-95
McElvenny, D. M., Hurley, M. A., Lenters, V., Heederik, D., Wilkinson, S., & Coggon, D. (2011). Lung cancer mortality in a cohort of UK cotton workers: An extended follow-up. British Journal of Cancer, 105, 1054–1060. doi: http://dx.doi.org/10.1038/bjc.2011.312http://dx.doi.org/10.1038/bjc.2011.312

Copy and paste the following HTML into your website