# Logistic Regression

Logistic regression is a statistical method to test for associations, or relationships, between variables. Like all regression analyses, logistic regression is a predictive analysis where a model is tested to find out whether the value of one variable, or the combination of values of multiple variables, can predict the value of another variable. The distinguishing feature of logistic regression is that the dependent (also called outcome or response) variable is categorical. This entry first describes the method and the concepts of causal inference and biological plausibility. It then discusses positive and negative associations and the odds ratio and provides an example of the use of logistic regression analysis to determine whether depression increases the risk of older people needing home help. The entry concludes by reviewing some assumptions and sources of error in logistic regression.

In binary logistic regression, which is the most common type of logistic regression, the dependent variable is binary or dichotomous. That means that there can only be two options for its value. For example, yes/no, pass/fail, alive/dead, satisfied/unsatisfied, and so on. In logistic regressions where there are more than two categories for the dependent variable, a less common multinomial logistic regression test is needed.

The dependent variable is the thing you are trying to explain or predict. There can be one or multiple independent (also called predictor or explanatory) variables tested in your model, and these can be either discrete variables (including dichotomous or ordinal), or they can be continuous (interval) variables. The term dependent suggests that this variable is dependent upon the status of the independent or predictor variable(s). As with all regression analyses, when there are multiple independent variables in a model, you are testing the predictive ability of each independent variable while controlling for the effects of other predictors. In logistic regression, the results lead to an estimation of the change in probability or odds of the outcome event occurring with a change in the value of the independent variable(s) relative to the probability or odds of the outcome event occurring given no change in the predictor variables. The results are not as easily interpreted as the results of a linear regression analysis, where the level of the outcome can be predicted from the predictor variables.

In logistic regression, the odds of the outcome of interest occurring for one unit change in the predictor variables is given in relation to the null hypothesis or equal odds. Equal odds is represented by an odds ratio value of 1.0. An increase in odds of the outcome occurring is indicated by an odds ratio value of greater than 1.0, and a decrease in the odds of the outcome occurring is indicated by an odds ratio value of less than 1.0. Statistically significant odds ratios are an indication of an association existing between the variables. The further the odds ratio number is from 1.0, the greater or stronger the association.

An example of a logistic regression inquiry can be: Does the value of x (independent variable) change the likelihood of y (dependent variable) being “yes” (rather than “no”)? For example, does eating bread crusts increase the likelihood of having curly hair (rather than straight hair)? In this case, a statistically significant odds value of greater than 1.0 would indicate that eating bread crusts does increase the chance of hair being curly.

Logistic regression can also indicate the strength of this predictive relationship by providing a value for the increased or decreased odds of the outcome occurring for a given change in the predictor variable. In our example, if the odds ratio is only a little bit greater than 1.0, then eating crusts only slightly increases the likelihood of having curly hair, and other factors are probably more important. However, if the odds ratio is a lot greater than 1.0, then eating bread crusts makes a really big difference to your chances.

In another example, the odds of being obese among children watching 11–20 hours of TV per week compared with children watching ≤10 hours of TV per week is around 1.4. That is, children are 1.4 times more likely to be obese in the 11–20 hours group than in the ≤10 hours group—a rather modest 40% increase. However, for children watching more than 30 hours of TV per week, the odds of being obese is around 3.6 or 3.6 times the odds than for children watching TV for ≤10 hours. Note that the dependent variable is obese versus not obese, and the independent variable is TV watching per week in hours categorized into ≤10 hours, 11–20 hours, 21–30 hours, and >30 hours. The reference group in the model is the ≤10 hours per week group, and therefore the odds of being obese among children watching ≤10 hours of TV per week is assumed to be 1.0. The odds of obesity in the other groups is given relative to the odds for the reference group and hence called an odds ratio. So far the examples only have one independent variable. With multiple independent variables, you can see the relative importance of the predictors. That is, which of the variables in the model is the strongest predictor of the outcome?

Note that logistic regression does not tell you the actual likelihood or odds of an outcome in an individual. The results give the probability of the outcome occurring with 1 unit value higher of the predictor variable compared with the probability given the original value of the variable. Nor can logistic regression be used to determine whether a variable causes an increased or decreased probability of the outcome.

### Causal Inference

The term dependent does not suggest that the independent variable(s) cause the outcome. This is a very important concept to understand when interpreting the results of regression analyses. In our example, if you found an association between eating bread crusts and curly hair, you cannot conclude that eating bread crusts causes curly hair. Similarly, it cannot be known from the TV watching data whether it is the increased TV watching that causes the increased likelihood of obesity. In fact, in this example, the causal relationship is likely to be complex, multifactorial, and possibly bidirectional. That is, there may be an element of higher body mass index (BMI) causing children to choose more sedentary behaviors. The possible reasons for a finding of increased odds include

• A direct causal relationship exists. Eating crusts does in fact make your hair grow curly.
• A reverse causal relation exists. Having curly hair makes you eat more bread crusts.
• An indirect causal pathway. People who eat more bread crusts are more likely to have curly hair, but the causal pathway is more complex. For example, eating more bread crusts makes you drink more water and drinking more water makes your hair go curly.
• A third factor is associated with both predictor and outcome variables. For example, bread crust eating tends to be higher in people who eat more bread, and it is bread that causes hair to be curly.
• There is no relationship between eating bread crusts and having curly hair and the finding was purely coincidence. This is a false positive or a Type I error.

The point is that finding an association does not tell you which of these possible reasons for the association is true. If you find an association between two variables, you cannot assume that the predictor variable caused the increased odds of the outcome occurring. From the results, you may be able to suggest an explanation, but you need to test your new hypothesis with another type of experimental design. A significant association in a regression analysis does not necessarily indicate a causal relationship.

### Biological Plausibility

This leads on to the concept of biological plausibility, or “Does this explanation make reasonable or logical sense?” Of course, in reality you should not find an association between eating bread crusts and curly hair because there is no biologically plausible rationale why eating bread crusts would make your hair curly. If you use a p value cutoff of <.05 for statistical significance in your logistic regression analyses, then 5 in every 100 relationships tested, where no relationship exists, will be statistically significant and simply a random chance finding. For this reason, it is important to use logistic regression to test only biologically plausible theories rather than to analyze all the combinations available in the data and then try to subsequently explain the significant relationships found. In other words, it is important to have a research question with a stated hypothesis or expectation before any analyses are carried out.

### Positive and Negative Associations and Increased or Decreased Odds

Whenever a logistic regression analysis identifies an association, the association may be either positive or negative. This tells you about the direction of the association or whether the factor increases or decreases the likelihood of the outcome of interest. In terms of odds, a positive association produces an odds ratio of greater than 1.0, and a negative association produces an odds ratio of less than 1.0. For this reason, it is important to be aware of how you code your outcome variable for the analysis.

Typically, statistical software packages will provide the odds for the outcome coded with the higher value compared with the outcome coded with the lower value. Thus, if you coded obesity with “one” and normal weight with “zero,” the odds will be for the probability of having obesity. In the example of higher grades at school increasing the odds of the student going on to tertiary education, if enrolling in tertiary education is coded “one” and not enrolling in tertiary education is coded “zero,” the association would be positive and the odds would be >1.0. However, if enrolling in tertiary education is coded “one” and not enrolling in tertiary education is coded “two,” the association would be negative and the odds would be <1.0. The results have the same interpretation. The odds ratio values are simply the inverse of each other. The odds of enrolling in tertiary education are better for students with higher grades and worse for students with lower grades. The difference in direction of the association and value of the odds ratio is simply due to the coding.

### Example of Logistic Regression Analysis

Let’s use the following example to help explain the results of a logistic regression analysis. An analysis tested the hypothesis that depression increases the risk of older people needing home help. The model has one independent (predictor) variable, depression, and a dichotomous dependent (outcome) variable, home help. Depression scores can range from 0 (no depression) to 21. Home help can either be 1 (yes) or 0 (no).

### Unstandardized Coefficient or B Value

In this model, the unstandardized coefficient (B value) was .15. The B value is similar to the B value in a linear regression analysis and can be used in a predictive equation. However, in logistic regression, the equation predicts the probability of a case falling into the desired category rather than the value for the outcome variable. In this case, the B value is positive; therefore, higher depression scores (if significant) are associated with greater likelihood of needing home help.

### Standardized Odds Ratio, Exp(B), or β Value

The β value is the exponential of the B value, or the odds ratio, and because it is standardized, its magnitude can be considered relative to the magnitude of the β value(s) for other variable(s) in the model or for variable(s) in other models. The β value is the point estimate of the strength of the association. The further away the β value is from 1.0, the stronger the association. In the depression versus home help example, the β value is 1.16 for depression. That means the odds of needing home help are 1.16 times higher for someone reporting one point more on the depression scale than for a person with a depression score one point lower.

### Significance

A p value of .05 is most commonly selected as the cutoff level to signify the statistical significance of an odds ratio. The cutoff value doesn’t have to be .05, and there may be reasons why you choose a cutoff value that is more (e.g., <.01) or less (e.g., <.1) stringent. A p value of <.05 means that there is a 5% chance of the association not being a true association and purely down to chance or coincidence or that there is 95% confidence of a true association existing between the two variables. Thus, if there is an association between two variables with a p value of .08, there is an 8% chance that a true association does not exist, which is generally considered unacceptably high.

The p value for the depression versus home help example was <.001. Therefore, we can be more than 99.9% sure that there is a true association between depression and home help (although we cannot assume that depression causes people to need home help).

### Confidence Interval

The confidence interval is another way of expressing likelihood an association truly exists. The β value is the point estimate of the odds ratio, whereby odds of 1.0 means that a one increment change in the independent variable does not increase or decrease the probability that the dependent variable will be in the category of interest. If the 95% confidence interval includes 1.0, there is a greater than 5% chance that a true relationship between the variables does not exist. For example, a 95% confidence interval of [0.93, 3.76] includes an odds ratio estimate of 1.0 and therefore we cannot say with confidence that a true association exists. However, confidence intervals give more information than just statistical significance and therefore more information than p values.

The 95% confidence interval for the β values is the range within which we can be 95% confident that the true β value lies for your population of interest based on the information from your sample. Thus, while the β value gives the point estimate of the odds ratio and therefore an indication of how much greater or lesser the odds of the outcome is, the 95% confidence interval provides an estimation of the precision of your point estimate. In the example, the true odds is likely to be somewhere between 0.93 and 3.76. This is a wide range of possible values, so the estimation of the odds is considered imprecise. And while the data do not support there being an association, it would be foolish to conclude that no association exists.

The 95% confidence interval for the odds ratio for home help with an increase in depression score was [1.12, 1.19]. That means we can be 95% confident that the true value for the population is between 1.12 and 1.19. This is only a small increase in odds, but a very precise finding thanks to the large sample size available.

The p value and width of the confidence interval is highly influenced by the sample size and homogeneity of the sample. In other words, if you have a very large sample with a wide spread of values, then your p value is more likely to be smaller, your confidence interval narrow, and your point estimate is likely to be closer to the true value for the population. In the depression and home help study, there were data available from over 6,000 people which enabled such a precise estimate of the odds ratio.

The sample size can influence the p value and the precision estimate (confidence interval) but does not influence the strength of the association (point estimate) apart from the possibility of it being closer to the true population value with greater sample sizes.

### Assumptions and Sources of Error

There are a number of assumptions and sources of error in a logistic regression analysis that should be considered. Logistic regression can handle ordinal and nominal data as independent variables as well as continuous (interval or ratio scaled) data. Binary logistic regression requires the dependent variable to be binary. Ordinal or interval data can be reduced to a dichotomous level but doing this loses a lot of information, which may make this test inferior compared to ordinal logistic regression or linear regression in these cases.

In regression analyses, it is good to have a wide range of values of the independent variable(s) in the analysis sample. If the sample includes only a small portion of the range of possible values for one or more of the independent variables, you might not get a very accurate indication of their relationship with the dependent variable. Certainly you will have limited generalizability of the results.

Models do not need to have linear relationships between the dependent and independent variables. Logistic regression can handle all sorts of relationships because it applies a nonlinear log transformation to the predicted odds ratio. The independent variables do not need to be normally distributed—although multivariate normality yields a more stable solution. Also the error terms (the residuals) do not need to be normally distributed.

As explained earlier, because logistic regression assumes that the odds ratio is the probability of the event occurring given a change in the independent variable, it is necessary that the dependent variable is coded accordingly for the event of interest. That is, for a binary regression, the higher factor level of the dependent variable should represent the desired outcome or outcome of interest.

Adding independent variables to a logistic regression model will always increase its statistical validity because it will always explain a bit more variance of the outcome. However, adding more and more variables to the model makes it inefficient and over fitting can occur. Only include as many variables as needed for your research question/hypothesis. That is, only the meaningful variables should be included. But you should try and include all meaningful variables, and this requires a good knowledge of the field of inquiry and deep consideration of the research question and hypothesis and is likely to be the most challenging part of a logistic regression analysis.

Logistic regression requires each observation to be independent, that is, that the data points should not be from any dependent samples design, such as before-after measurements, or matched pairings. The model should have little or no multicollinearity. That is, the independent variables should be pretty much independent from each other. As long as correlation coefficients among independent variables are less than .90, the assumption can be considered met. There is, however, the option to include interaction effects of categorical variables in the analysis.

Logistic regression assumes linearity of independent variables and log odds. Although it does not require the dependent and independent variables to be related linearly, it requires that the independent variables are linearly related to the log odds. Otherwise, the test underestimates the strength of the relationship and rejects the relationship too easily (i.e., indicating there are not significant results or not rejecting the null hypothesis) when the relationship is significant. A possible solution to this problem is the categorization of the independent variables. That is transforming interval variables to ordinal level and then including them in the model. An example of this is to transform BMI values into ordinal categories of underweight (BMI < 20), normal weight (BMI = 20–25), overweight (BMI >25 but ≤30), and obese (BMI >30).

Large sample sizes are important. Maximum likelihood estimates are less powerful than ordinary least squares (used for simple and multivariable linear regression). Ordinary least squares analysis needs at least five cases per independent variable in the analysis; however, maximum likelihood estimates need at least 10 cases per independent variable, and some statisticians recommend at least 30 cases for each parameter to be estimated. Odds ratios are most accurate if the outcome rate in the sample closely approximates the outcome rate in the population. There should be no outliers in the data. The presence of outliers can be assessed by converting the continuous predictors to standardized, or z scores, and removing values below −3.29 or greater than 3.29.

See also Multiple Linear Regression; Odds Ratio

10.4135/9781506326139.n403

, , & (2013). Applied logistic regression (
3rd ed.
). Hoboken, NJ: Wiley.
(2016). Modeling and analysis of stochastic systems. Boca Raton, FL: CRC Press.
, , & (2013). Applied logistic regression (Wiley series in probability and statistics). Hoboken, NJ: Wiley.