Researchers in the social sciences and beyond are dealing more and more with massive quantities of text data requiring analysis, from historical letters to the constant stream of content in social media. Traditional texts on statistical analysis have focused on numbers, but this book will provide a practical introduction to the quantitative analysis of textual data. Using up-to-date R methods, this book will take readers through the text analysis process, from text mining and pre-processing the text to final analysis. It includes two major case studies using historical and more contemporary text data to demonstrate the practical applications of these methods. Currently, there is no introductory how-to book on textual data analysis with R that is up-to-date and applicable across the social sciences. Code and a variety of additional resources are available on an accompanying website for the book.

Classification of Documents

Classification of documents

8.1 Introduction

The clustering that we discussed in Chapter 7 is an unsupervised learning tool. Remember that the term unsupervised expresses the fact that the researcher does not have outside information about the group association before carrying out the analysis. Documents are unclassified; Clustering looks for homogeneous groupings of documents. It does this by empirically grouping documents on the basis of their word similarities. Its results may then lead to hypotheses that can then be studied.

In this chapter, we consider classification, which differs in that it utilizes the documents’ known categorical metainformation. Remember, because you have already coded these metavariables, you have already assigned classification markers. Classification assumes that you know the unique group (or class) that characterizes each document. ...

  • Loading...
locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles