Researchers in the social sciences and beyond are dealing more and more with massive quantities of text data requiring analysis, from historical letters to the constant stream of content in social media. Traditional texts on statistical analysis have focused on numbers, but this book will provide a practical introduction to the quantitative analysis of textual data. Using up-to-date R methods, this book will take readers through the text analysis process, from text mining and pre-processing the text to final analysis. It includes two major case studies using historical and more contemporary text data to demonstrate the practical applications of these methods. Currently, there is no introductory how-to book on textual data analysis with R that is up-to-date and applicable across the social sciences. Code and a variety of additional resources are available on an accompanying website for the book.

Preparing Text for Analysis: Text Cleaning and Formatting

Preparing Text for Analysis: Text Cleaning and Formatting

Preparing text for analysis: Text cleaning and formatting

3.1 Text Cleaning

Often the most difficult and time-consuming parts of text analysis is cleaning the text, which must be done prior to the analysis.

The texts of our case studies were produced by scanning printed materials. Optical character recognition (OCR) has improved over the past decade, but it is still not completely reliable. In fact, OCR usually creates “dirty” text, so the text must be cleaned.

Tip: We have found that, like gardeners, it is important for researchers to get their hands dirty by delving into the unrefined text. Since OCR mistakes tend to follow patterns and since the language systems themselves tend to be patterned in terms of where the ...

  • Loading...
locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles