Skip to main content icon/video/no-internet

Classification

Classification refers to a broad set of statistical methods that arise in many different applications. In a classification problem, we have a categorical response variable that we wish to investigate in relationship to one or more input variables. Classification methods can be applied to problems in a wide variety of settings; applications in education include analyzing patterns of responses to standardized exams, inferring which middle school students will benefit from a drug prevention program, and predicting which graduating high school seniors will choose to attend a particular university if they are offered admission.

Common classification methods include logistic regression, support vector machines, decision trees, random forests, neural networks, and k-nearest neighbors. This entry discusses a few general issues in classification that should be considered when choosing a method and the differences between classification and the related problem of clustering.

General Issues in Classification

Classification problems include both prediction and inference. In an inference problem, the goal is to describe the relationship between the response variable and the explanatory variables, whereas in a prediction problem, the goal is to predict the value of an unobserved response variable for a new data point based on observed predictor variables. For example, if we wish to examine the relationship between a person’s diet and whether the person later gets cancer, this is an inference problem because the question of which foods put a person at risk is of paramount importance. In contrast, if we wished to classify the content of an image based on features extracted from the digital representation of the image, this is a prediction problem because which features are useful for making the classification are not important.

Logistic regression and decision trees are examples of methods that are appropriate for inference because they provide easy to interpret information about the relationship between the response variable and the explanatory variables. Though, as with any statistical methodology, making causal claims based on the results from a classification analysis relies on proper experimental design. K-nearest neighbors, support vector machines, and random forests may provide accurate predictions, but can be more challenging to interpret, and are therefore more appropriate for prediction problems than inference problems.

Any problem with a categorical response variable may be deemed a classification problem, but methods differ based on how many levels the categorical response has. Logistic regression is most often used as a binomial method for a binary response variable; by contrast, multinomial logistic regression, k-nearest neighbors, and linear discriminant analysis can easily handle any number of classes.

Decision Boundaries

Decision boundaries separate the space of input variables into regions labeled according to classification. One of the key elements determining the complexity of a classification problem is the shape of these boundaries. Figure 1 shows two classification problems with two classes (Δ, +) and two predictor variables (X1, X2). The solid line shows the Bayes’s optimal decision boundary, whereas the dotted line is the decision boundary estimated with logistic regression. Figure 1A shows a case where the Bayes’s optimal decision boundary is linear, whereas in Figure 1B, the boundary is nonlinear. If the input variables describe a space best partitioned using a nonlinear decision boundary, it is important to choose a method that can estimate such a boundary, particularly for inference problems.

...

  • Loading...
locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles

Sage Recommends

We found other relevant content for you on other Sage platforms.

Loading