Skip to main content icon/video/no-internet

Data cleaning, data cleansing, or data scrubbing is the process of improving the quality of data by correcting inaccurate records from a record set. The term specifically refers to detecting and modifying, replacing, or deleting incomplete, incorrect, improperly formatted, duplicated, or irrelevant records, otherwise referred to as “dirty data,” within a database. Data cleaning also includes removing duplicated data within a database.

Data provided for communication research often rely on manual data entry, performed by humans, and therefore are subject to error introduction. Because of this manual process, the data require cleaning. The need for such cleaning increases when data come from multiple sources and a standard schema was not used across sources. The goal of data cleaning is to provide a data set that is consistent enough to allow for accurate analysis. The original intent and meaning of the information provided by the participant are not altered but rather inconsistencies caused by data transmission problems, the use of different definitions in different data stores, and user entry errors are addressed to remove or manage the inconsistent data. This entry introduces the data cleansing process, including its manual and computer-assisted approaches, and further discusses the difference between data cleansing and data validation.

The Data Cleansing Process

To begin, cleaning data involves reviewing data to identify inconsistencies. Inconsistent or incorrect data could be caused by typographical errors, misspellings, or incomplete answers. The inconsistent data are validated against a known list of options. With strict validation, any records containing invalid responses are removed completely from the record set. If fuzzy validation is acceptable, the data are corrected when a close match or known answer is available. For example, under fuzzy validation, if a research participant provides an e-mail address such as JSmith@college.edw, the researcher changes the e-mail address to JSmith@college.edu knowing that the original e-mail address provided contained an invalid suffix. Under strict validation, the invalid e-mail address would be removed from the record set.

Common Practices and Approach

A typical approach to cleaning data begins on the broadest possible level. To begin, the effort focuses on detecting and removing all major inconsistencies in the data. As the major errors are corrected, it becomes easier to perform a more detailed analysis of the remaining dirty data. The first step to data cleaning is analysis (e.g., detecting errors and inconsistencies that require attention). The second step is to determine the codes to be used to map the source data to the common or standard codes. The third step entails testing the transformation of the data using the standard codes on a subset of data to ensure the expected results when applied to the entire dataset. The final step is the transformation of the data using the standard codes. During this final step, the faulty data are removed or transformed, changing the incorrect data to the correct values based on the standard data model.

One of the primary data cleaning practices is the deduplication of records. In practical terms, if the data exist in a spreadsheet or a data table, this entails sorting the data and scanning for multiple rows with the same data. Should any duplicates exist, the researcher needs to remove all of the duplicate entries leaving only one in the dataset.

...

  • Loading...
locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles

Sage Recommends

We found other relevant content for you on other Sage platforms.

Loading