Researchers in the social sciences and beyond are dealing more and more with massive quantities of text data requiring analysis, from historical letters to the constant stream of content in social media. Traditional texts on statistical analysis have focused on numbers, but this book will provide a practical introduction to the quantitative analysis of textual data. Using up-to-date R methods, this book will take readers through the text analysis process, from text mining and pre-processing the text to final analysis. It includes two major case studies using historical and more contemporary text data to demonstrate the practical applications of these methods. Currently, there is no introductory how-to book on textual data analysis with R that is up-to-date and applicable across the social sciences. Code and a variety of additional resources are available on an accompanying website for the book.

Word Distributions: Document-Term Matrices of Word Frequencies and the “Bag of Words” Representation

Word Distributions: Document-Term Matrices of Word Frequencies and the “Bag of Words” Representation

Word distributions: Document-Term matrices of word frequencies and the “bag of words” representation

4.1 Document-Term Matrices of Frequencies

After data cleaning, it’s usually necessary to do another preprocessing step by (a) stripping away excess white space (that is, collapsing any instances of multiple spaces into a single space), (b) converting capital letters to lowercase letters, and (c) omitting special characters, symbols, and numbers.

Next, the preprocessed text is broken into discreet words or tokens, in the technical language—that is, words that are consecutive arrangements of characters/letters that are separated from other tokens by a blank space.

At this stage of the preprocessing, stop words and other very, very short words are omitted. Remember that stop ...

  • Loading...
locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles