Stepwise Regression

Tze Leung Lai; Ching-Kang Ing

doi:10.4135/9781412961288

Entry
Reader's guide
Entries A-Z
Subject index

Return to Entries

Stepwise Regression

By: Tze Leung Lai & Ching-Kang Ing
In:Encyclopedia of Research Design
Chapter DOI:https://doi.org/10.4135/9781412961288.n444
Subject:Anthropology, Business and Management, Criminology and Criminal Justice, Communication and Media Studies, Counseling and Psychotherapy, Economics, Education, Geography, Health, History, Marketing, Nursing, Political Science and International Relations, Psychology, Social Policy and Public Policy, Social Work, Sociology, Technology, Medicine

Request Permissions

Show page numbers Hide page numbers

Stepwise, also called stagewise, methods in fitting regression models have been extensively studied [Page 1450]and applied in the past 50 years, and they still remain an active area of research. In many study designs, one has a large number K of input variables and the number n of input–output observations (xi1, …, xiK, yi), 1 ≤ i ≤ n, is often of the same or smaller order of magnitude than K. Examples include gene expression studies, where the number K of genomic locations is typically larger than the number n of subjects, and signal or image reconstruction, where the number K of basis functions to be considered exceeds the number n of measurements or pixels. Stepwise methods are perhaps the only computationally feasible ways to tackle these problems, and certain versions of these methods have recently been shown to have many desirable statistical properties as well.

Stepwise regression basically carries out two tasks sequentially to fit a regression model

where β =(β1, …, βK)T is a vector of regression parameters, xi = (xi1, xiK)T is a vector of regressors (input variables), εi represents unobservable noise, and yi is the observed output. The first task is to choose regressors sequentially and the second task is to refit the regression model by least squares after a regressor has been added to the model. For notational simplicity, assume that the yi and xij in Equation 1 have been centered at their sample means so that Equation 1 does not have an intercept term.

To begin, stepwise regression chooses the regressor that is most correlated to the output variable (i.e., such that the [sample] correlation coefficient between yi and xij is the largest among the K regressors). One then performs least squares regression of yi on the selected regressor xij, yielding the least squares fit and the residuals yi–Ŷi. A variable selection criterion is then applied to determine whether the chosen regressor should indeed be included. If the criterion accepts the chosen regressor, then the researcher repeats the stepwise procedure to the remaining regressors but with yi–Ŷi in place of yi. More generally, after the regressors labeled j1, …, jk have been included in the model and the residuals have been computed, the researcher chooses such that the correlation coefficient between ei and xi, is the largest among the remaining K–k input variables, and it performs least squares regression of ei on xi,, yielding a new set of residuals ei, which are used in the criterion to determine whether the regressor labeled should be included. If the criterion rejects the regressor, then it is not included in the model, and the stepwise regression procedure terminates with the set of input variables xi,j1, …, xi,jk.

A traditional variable selection criterion is based on the F test of H0: βj = 0 in the regression model to determine whether the regressor labeled should be added to the model that already contains the regressors labeled j1, …, jk. If the F test rejects H0 at significance level α, which is often chosen to be 5%, then the regressor labeled j is included in the model. Otherwise, βj is deemed to be not significantly different from 0, and therefore the corresponding regressor is excluded from the model. Note that because this test-based procedure carries out a sequence of F tests, the overall significance level can differ substantially from α. Such an F test of whether a particular regressor in a larger set of input variables has regression coefficient 0 is called a partial F test, and the corresponding test statistic is called a partial F statistic.

...

Sign in to access this content

Get a 30 day FREE TRIAL

Watch videos from a variety of sources bringing classroom topics to life
Read modern, diverse business cases
Explore hundreds of books and reference titles

No internet connection.

All search filters on the page have been cleared.

Your search has been saved.

Entry

Reader's guide

Entries A-Z

Subject index

Stepwise Regression

Sign in to access this content

Get a 30 day FREE TRIAL

Read next

More like this

Sage Recommends