Skip to main content icon/video/no-internet

Classification and regression tree (CART) is a machine learning (or classification) algorithm that constructs a tree-structured classifier to assign group labels to each case based on its attributes. The resulting tree-structured classifier is usually ideal for interpretation and decision making. The algorithm requires that for each case in the data, there be two variables. The first is the group variable to be classified and predicted (such as disease status or treatment), and the second is the variable of attributes that can be multidimensional numerical or categorical data (such as smoking status, sex, or abundance of various enzymes in blood). Normally, the method is implemented in a set of training cases to learn a classifier. The classifier is then applied to an independent test set to evaluate the generalized classification accuracy.

CART analysis is performed as a binary recursive partition tree. An impurity measure is defined to describe the purity (concentration in a single group) of cases in a node. The algorithm recursively searches for the attribute criterion that partitions data into two parts with the largest decrease of impurity measure. Normally, the Gini impurity index is used in CART analysis:

None

where

T is the data set in the node, m the number of groups, and

pi the proportion of group i in the node.

When all cases in the node belong to the same group (the purest case), I(T) is minimized at 0, and when each group has the same proportion (the most impure case), I(T) is maximized. The partition of the hierarchical tree continues until either the number of cases in the node is too small or the decrease of the impurity index is not statistically significant. Additional pruning rules may be applied to decide termination of tree growth to prevent a problem of overfitting.

CART has a number of merits compared with other classification algorithms. The method is inherently nonparametric without a distribution assumption of the data (in contrast to methods like linear discriminant analysis). It is thus more robust against skewed or ill-behaved distributed data. Learning in CART is a “white” box, and the learned classification criteria are easy to interpret. Thus CART is most applicable in situations when interpretation and learning of important attributes contributing to classification are the major goals in the application. CART by its nature can easily handle categorical and ordinal data in addition to numerical data. Finally, computation of an exhaustive search for the best partition is very fast, making the method feasible for large data sets.

Software Package

Since CART is a relatively modern statistical technique, it is not implemented in most major statistical software (e.g., SAS and S-PLUS). SPSS contains an add-on module, “SPSS Classification Trees.” A commercial software, “Classification and Regression Tree,” specifically for CART analysis, is also available. In the following discussion, an extension package “tree” of the free software R is used to implement CART.

Table 1 MPG for a Select Group of Cars
Car Efficiency Cylinders Displacement Horsepower Weight Acceleration
1 Inefficient 4 151 85 2855 17.6
2 Economic 4 98 76 2144 14.7
3 Economic 5 121 67 2950 19.9
4 Inefficient 6 250 105 3353 14.5
5 Inefficient 4 151 88 2740 16
6 Inefficient 6 250 88 3021 16.5
7 Economic 4 71 65 1836 21
8 Economic 4 112 88 2395 18
9 Economic 4 141 71 3190 24.8
10 Inefficient 8 350 155 4360 14.9
11 Inefficient 4 98 60 2164 22.1
12 Economic 6 262 85 3015 17
13 Inefficient 6 200 85 3070 16.7
14 Inefficient 6 258 110 2962 13.5
15 Inefficient 4 116 75 2158 15.5
16 Inefficient 4 140 72 2401 19.5
17 Inefficient 8 350 180 4499 12.5
18 Inefficient 8 307 200 4376 15
19 Inefficient 8 318 140 3735 13.2
20 Economic 4 78 52 1985 19.4
21 Economic 4 89 71 1990 14.9
22 Economic 4 97 75 2265 18.2
23 Inefficient 6 250 98 3525 19
24 Economic 4 83 61 2003 19
25 Inefficient 8 302 140 3449 10.5

An Example

An example of classification of car fuel efficiency is demonstrated as follows. The data shown in Table 1 are a random subsample of 25 cars from the “auto-mpg” data set from the UCI Machine Learning Repository (http://www.ics.uci.edu/∼mlearn/MLRepository.html). The group variable for classification is “efficiency,” which has two possible values: inefficient (mpg < 25) and economic (mpg ≥ 25). Five attributes for classifying and predicting fuel efficiency are available for prediction: cylinders, displacement, horsepower, weight, and acceleration.

...

  • Loading...
locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles

Sage Recommends

We found other relevant content for you on other Sage platforms.

Loading