Skip to main content icon/video/no-internet

Multicollinearity (or collinearity) is a statistical phenomenon in multiple linear regression analysis where two (or more) independent or predictor variables are highly correlated with each other, or intercorrelated. Presence of multicollinearity violates one of the core assumptions of multiple linear regression analysis and as such it is problematic; the predicted regression coefficients are not reliable anymore.

This entry discusses the issue of multicollinearity and why it might be problematic. It also outlines symptoms and diagnostics to determine whether or not multicollinearity is present. Finally, this entry discusses several ways of dealing with multicollinearity.

Context

In multiple linear regression analysis, several independent or predictor variables (usually denoted by X and sometimes referred to as regressors) are modeled to predict or estimate one dependent variable (usually denoted by Y). When constructing such a multiple regression model, there are six general assumptions that have to be met in order to obtain reliable and robust parameter estimates of the proposed model. First, a linear relationship between the variables of interest is assumed. There are other types of regression specifically tailored to estimate regression coefficients when this first assumption is not met (e.g., log-linear regression). Second, multivariate normality is assumed. This entails that all modeled variables are normally distributed. Violations of this assumption can sometimes be solved by transforming those nonnormally distributed variables, for example a log-transformation (however, in some cases this may lead to multicollinearity). Third, the error terms of the variables should not be correlated; in other words, there should be no autocorrelation present. Fourth, it is assumed that the variance in the model is equally distributed around its regression line; homoscedasticity is assumed. Fifth, there should be an absence of influential observations which means there should not be any substantial outliers. Finally, in order to estimate reliable and robust regression coefficients, there should be no (or very little) multicollinearity present in the proposed regression model.

Multicollinearity occurs when there is a strong linear relationship between two (or more) independent or predictor variables in a multiple regression model. This entails that the independent variables (X) are not independent from each other. In social science research, there is often some presence of multicollinearity, but it becomes problematic when the independent variables are too strongly intercorrelated (in general above .80).

First of all, it becomes difficult to interpret the found regression coefficients. One of the aims of multiple regression analysis is to unravel the effect one independent variable has on the dependent one, while keeping the other predictors constant. This becomes impossible when two (or more) independent variables are strongly correlated as it is not meaningful anymore to explain the model in terms of increasing one predictor while keeping all others constant. Furthermore, presence of multicollinearity leads to unreliable estimates of the regression coefficients. It also leads to more Type II errors, meaning that it becomes harder to find statistically significant effects due to increased standard errors of the regression coefficients. Another effect of multicollinearity is an underestimation of R2.

In general, problematic forms of multicollinearity are not encountered very often in social science research, but it does occur in certain situations. For example, including interaction effects in the regression model leads to risks of multicollinearity. In addition, the presence of multicollinearity can cause greater problems in smaller than larger samples due to inflated standard errors. Specifically, in the fields of communication, media studies, and journalism, some level of multicollinearity is highly likely between measures of media and news use. Arguably, this is why such variables should be included in the model as latent variables (so they can be accounted for as a whole in the regression model), but this is still a topic of debate.

...

  • Loading...
locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles

Sage Recommends

We found other relevant content for you on other Sage platforms.

Loading