Teach students how to construct a viable research project based on online sources. Gabe Ignatow and Rada Mihalcea’s An Introduction to Text Mining: Research Design, Data Collection, and Analysis provides a foundation for readers seeking a solid introduction to mining text data. The book covers the most critical issues that must be taken into consideration for research projects, including web scraping and crawling, strategic data selection, data sampling, use of specific text analysis methods, and report writing. In addition to covering technical aspects of various approaches to contemporary text mining and analysis, the book covers ethical and philosophical dimensions of text-based research and social science research design.
Chapter 8: Basic Text Processing
Basic Text Processing
The goals of Chapter 8 are to help you to do the following:
- Define basic text processing steps, such as tokenization, stop word removal, stemming, and lemmatization.
- Explain text statistics and laws that govern the distribution of words in text.
- Explore the basics of language models, and evaluate their applications.
- Discuss the main goals of more advanced text processing steps.
Text analysis almost invariably requires some form of text processing. Consider the following example of a tweet: Today’s the day, ladies and gents. Mr. K will land in U.S. :). If one wants to use information from this piece of text for any form of text mining or other text analysis, it is important to determine what are the tokens in this text—today, ’s, the, ...