Classification and Regression Tree

Neil J.Salkind

doi:10.4135/9781412952644

Entry
Reader's guide
Entries A-Z
Subject index

Return to Entries

Classification and Regression Tree

Edited by:
Neil J. Salkind
In:Encyclopedia of Measurement and Statistics
Chapter DOI:https://doi.org/10.4135/9781412952644.n83
Subject:Anthropology, Business and Management, Criminology and Criminal Justice, Communication and Media Studies, Counseling and Psychotherapy, Economics, Education, Geography, Health, History, Marketing, Nursing, Political Science and International Relations, Psychology, Social Policy and Public Policy, Social Work, Sociology, Science, Technology, Computer Science, Engineering, Mathematics, Medicine

Request Permissions

Show page numbers Hide page numbers

Classification and regression tree (CART) is a machine learning (or classification) algorithm that constructs a tree-structured classifier to assign group labels to each case based on its attributes. The resulting tree-structured classifier is usually ideal for interpretation and decision making. The algorithm requires that for each case in the data, there be two variables. The first is the group variable to be classified and predicted (such as disease status or treatment), and the second is the variable of attributes that can be multidimensional numerical or categorical data (such as smoking status, sex, or abundance of various enzymes in blood). Normally, the method is implemented in a set of training cases to learn a classifier. The classifier is then applied to an independent test set to evaluate the generalized classification accuracy.

CART analysis is performed as a binary recursive partition tree. An impurity measure is defined to describe the purity (concentration in a single group) of cases in a node. The algorithm recursively searches for the attribute criterion that partitions data into two parts with the largest decrease of impurity measure. Normally, the Gini impurity index is used in CART analysis:

where

T is the data set in the node, m the number of groups, and

pi the proportion of group i in the node.

When all cases in the node belong to the same group (the purest case), I(T) is minimized at 0, and when each group has the same proportion (the most impure case), I(T) is maximized. The partition of the hierarchical tree continues until either the number of cases in the node is too small or the decrease of the impurity index is not statistically significant. Additional pruning rules may be applied to decide termination of tree growth to prevent a problem of overfitting.

CART has a number of merits compared with other classification algorithms. The method is inherently nonparametric without a distribution assumption of the data (in contrast to methods like linear discriminant analysis). It is thus more robust against skewed or ill-behaved distributed data. Learning in CART is a “white” box, and the learned classification criteria are easy to interpret. Thus CART is most applicable in situations when interpretation and learning of important attributes contributing to classification are the major goals in the application. CART by its nature can easily handle categorical and ordinal data in addition to numerical data. Finally, computation of an exhaustive search for the best partition is very fast, making the method feasible for large data sets.

Software Package

Since CART is a relatively modern statistical technique, it is not implemented in most major statistical software (e.g., SAS and S-PLUS). SPSS contains an add-on module, “SPSS Classification Trees.” A commercial software, “Classification and Regression Tree,” specifically for CART analysis, is also available. In the following discussion, an extension package “tree” of the free software R is used to implement CART.

Table 1 MPG for a Select Group of Cars
Car	Efficiency	Cylinders	Displacement	Horsepower	Weight	Acceleration
1	Inefficient	4	151	85	2855	17.6
2	Economic	4	98	76	2144	14.7
3	Economic	5	121	67	2950	19.9
4	Inefficient	6	250	105	3353	14.5
5	Inefficient	4	151	88	2740	16
6	Inefficient	6	250	88	3021	16.5
7	Economic	4	71	65	1836	21
8	Economic	4	112	88	2395	18
9	Economic	4	141	71	3190	24.8
10	Inefficient	8	350	155	4360	14.9
11	Inefficient	4	98	60	2164	22.1
12	Economic	6	262	85	3015	17
13	Inefficient	6	200	85	3070	16.7
14	Inefficient	6	258	110	2962	13.5
15	Inefficient	4	116	75	2158	15.5
16	Inefficient	4	140	72	2401	19.5
17	Inefficient	8	350	180	4499	12.5
18	Inefficient	8	307	200	4376	15
19	Inefficient	8	318	140	3735	13.2
20	Economic	4	78	52	1985	19.4
21	Economic	4	89	71	1990	14.9
22	Economic	4	97	75	2265	18.2
23	Inefficient	6	250	98	3525	19
24	Economic	4	83	61	2003	19
25	Inefficient	8	302	140	3449	10.5

An Example

An example of classification of car fuel efficiency is demonstrated as follows. The data shown in Table 1 are a random subsample of 25 cars from the “auto-mpg” data set from the UCI Machine Learning Repository (http://www.ics.uci.edu/∼mlearn/MLRepository.html). The group variable for classification is “efficiency,” which has two possible values: inefficient (mpg < 25) and economic (mpg ≥ 25). Five attributes for classifying and predicting fuel efficiency are available for prediction: cylinders, displacement, horsepower, weight, and acceleration.

...

Sign in to access this content

Get a 30 day FREE TRIAL

Watch videos from a variety of sources bringing classroom topics to life
Read modern, diverse business cases
Explore hundreds of books and reference titles

No internet connection.

All search filters on the page have been cleared.

Your search has been saved.

Entry

Reader's guide

Entries A-Z

Subject index

Classification and Regression Tree

Software Package

An Example

Sign in to access this content

Get a 30 day FREE TRIAL

Read next

More like this

Sage Recommends