Skip to main content icon/video/no-internet

Dummy Coding

Dummy coding, also known as indicator coding, provides a means for researchers to represent a categorical variable as a set of independent quantitative variables. The resulting dummy variables take on values of 0 and 1 and can be used as predictors in regression analysis. Given a categorical variable that can take on k values, it is possible to create k − 1 dummy variables without any loss of information. Dummy variables are often included in regression models to estimate the effects of categorical variables such as race, marital status, diagnostic group, and treatment setting.

When constructing a set of dummy variables, one level of the original categorical variable is selected as a reference category and is excluded from analysis. Each remaining level becomes a single dummy variable on which observations receive a value of 1 if they fall into the category and 0 if they do not. For example, given a four-level psychiatric diagnosis variable (bipolar disorder, major depressive disorder, other mood disorder, no mood disorder), the “no mood disorder” group might be selected as the reference category. Three dummy variables would then be constructed: “BIPOLAR” would be scored 1 for persons with this diagnosis and 0 for those in the other three diagnostic groups; similarly, “MDD” would be scored 1 for persons with a diagnosis of major depressive disorder and 0 for those with another diagnosis; and “OTHMOOD” would be scored 1 for persons with a diagnosis of an “other mood disorder” and 0 for those diagnosed with bipolar disorder, major depressive disorder, or no mood disorder. Each dummy variable would have 1 df.

In regression analyses, the coefficients for each of the k − 1 dummy variables quantify the estimated effect on the outcome variable of membership in the group in question versus membership in the reference group. For example, in a logistic regression analysis predicting diagnosis of bloodborne infection, a coefficient (b) for the “BIPOLAR” variable of 1.12 would represent the difference in the log-odds of infection between persons with bipolar disorder and those with no mood disorder. Expressed in terms of an odds ratio (eb = 3.06), persons with bipolar disorder would have slightly more than three times the odds of infection compared with persons in the reference group. Dummy variables are used in a wide variety of regression analyses, including, but not limited to, ordinary least-squares, logistic, Cox, and Poisson regression.

The choice of reference category should be made based on substantive scientific considerations, and the category should be selected to support meaningful contrasts. In the case of race, this often results in the selection of “White” as the reference category. In experimental studies, the control group is often an appropriate choice. Miscellaneous or “other” categories containing a heterogeneous mix of observations (e.g., “other race”) usually do not provide researchers with the ability to make useful inferences and are therefore rarely suitable as reference categories. To ensure stable estimates, the number of observations in the reference category should not be too small relative to the size of the other categories.

...

  • Loading...
locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles

Sage Recommends

We found other relevant content for you on other Sage platforms.

Loading