Researchers in the social sciences and beyond are dealing more and more with massive quantities of text data requiring analysis, from historical letters to the constant stream of content in social media. Traditional texts on statistical analysis have focused on numbers, but this book will provide a practical introduction to the quantitative analysis of textual data. Using up-to-date R methods, this book will take readers through the text analysis process, from text mining and pre-processing the text to final analysis. It includes two major case studies using historical and more contemporary text data to demonstrate the practical applications of these methods. Currently, there is no introductory how-to book on textual data analysis with R that is up-to-date and applicable across the social sciences. Code and a variety of additional resources are available on an accompanying website for the book.

Clustering of Documents

Clustering of documents

7.1 Clustering Documents

Given that there are thousands of documents in a large corpus, you may want to know which of the documents within the corpus are similar. Can you identify groupings of documents such that the documents in each group are reliably similar to each other but dissimilar to documents in other groups? Clustering tells us about document similarity. Discovering that such clusters exist and being able to identify the documents in each group are useful first steps to further analysis.

Cluster analysis is the task of grouping a set of documents in a way that documents in the same group are more similar to each other than to those in other groups. Groups are referred to as clusters; a ...

  • Loading...
locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles