Natural Language Processing

Abstract

Natural language processing (NLP) is the use of computer technology to assist in, or complete, tasks involving the processing, categorizing, analyzing, or interpreting the meaning of human language. NLP is an interdisciplinary body of research drawing from linguistics, computer science, artificial intelligence, machine learning, and the social sciences. This entry introduces NLP, with emphasis on its strengths and weaknesses for social scientific applications. It reviews major concepts in NLP, such as the representation of natural language as strings, and then discusses commonly used data collection strategies to rapidly build large databases of natural language for analysis. This entry also introduces major techniques in how to efficiently process natural language using computational routines including counting strings and substrings, case manipulation, string substitution, tokenization, stemming and lemmatizing, part-of-speech tagging, chunking, named entity recognition, feature extraction, and sentiment analysis. It also discusses a variety of dimensionality reduction techniques including principal components analysis, topic models, hidden Markov models, and support vector machines that are widely used in NLP-based research projects. This entry also provides a discussion of Python and R, two computer programming language well suited for NLP, with specific recommendations for add-on packages designed to streamline and simplify research. The entry concludes with a more general discussion of the strengths and weaknesses of NLP along with important directions for future research.

locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles