This case study introduces the reader to the basic concept of regression analysis by using research we conducted into solutions to gun violence as an example. We explain conceptually why regression is used by researchers and how to understand some of the most important numbers generated by a regression analysis, including p values, regression coefficients, R-squared values, and interaction terms. We make these ideas and numbers concrete by describing their use in our research about gun violence, specifically whether strict gun control laws or access to mental health care for those in need is more effective at preventing gun deaths. Finally, we explain some limitations to the use of regression analysis. This case study is intended for college students with no background in statistics. We strive to explain regression conceptually rather than mathematically and limit our use of jargon or technical language within reason.
By the end of this case, students should be able to
- Understand why regression analysis is used by researchers to answer particular research questions
- Explain what some important numbers generated by regression analysis mean conceptually, including p values, regression coefficients, R-squared values, and interaction terms
- Recognize some limitations in the use of regression analysis
America is plagued by gun violence. More than 30,000 Americans routinely die in shootings every year according to the Centers for Disease Control and Prevention (CDC; 36,252 in 2015 alone, see Murphy, Xu, Kochanek, Curtin, & Arias, 2017, p. 12). Recent mass shootings in Parkland, Las Vegas, and Orlando have focused attention on the issue (for now), and as a result, politicians from across the political spectrum have felt the need to offer solutions to alleviate the violence. These solutions vary widely, from alleviating poverty to banning violent video games. However, the two most popular and prominent solutions are tough gun control legislation and some kind of policy targeting those with mental illness. The frequency with which these two solutions are proposed as the fix for gun violence would seem to suggest that there is something to these policies, that if only they were enacted gun violence would be significantly reduced, if not banished altogether. But is this true? And if it is, how effective is each solution relative to the other one (i.e., is one solution relatively more effective than the other)? And crucially, how can we know with reasonable confidence the answers to these questions?
At first blush, it can be tempting to think there is a straightforward way to go about addressing these questions. You say you want to know how effective gun control laws are? Well, just take a look at a state that passed stricter gun control laws and see if the number of gun deaths went down afterward. Easy. Same with a mental illness solution. Look at places that increased mental health spending and check to see if the number of gun deaths went down afterward. Compare and contrast and you know the relative effectiveness of these two solutions and can move on to advocating for the more effective one. Not so fast. As you probably guessed, things are not that easy (and political scientists are not that useless). Your first instinct to look at what happened to gun deaths after a particular solution was implemented is good but needs to account for the inherent complexity of the real world. How do you know that new gun control law actually caused that reduction in gun deaths? What if both gun control laws and mental health spending reduce gun violence? Which solution is responsible for which averted death and how do you know?
These questions get at one of the most pervasive methodological problems in the social sciences, that of determining causality from mere association. In other words, it is the problem of knowing whether or not an outcome would still occur in the absence of something else. Would there have still been fewer gun deaths in the absence of that strict new gun control law? Or did the gun control law cause those fewer deaths? If those fewer deaths would have still occurred in the absence of the new law, why? What actually caused the reduction? As you can imagine, telling the difference between something that causes another thing, and two things that simply look like they are associated with one another but are not, is extremely important in the political and public policy worlds. Practically any question of public importance runs into this problem. Is union membership declining because fewer Americans work blue-collar jobs? Or could it be because there is more legislation hostile to unions? Both of these things are happening, but are both causing the decline? Is economic inequality increasing because of declining union membership, lower income taxes, or both? These are not just academic curiosities. As the example of gun violence illustrates, solutions to all kinds of real-world problems are implemented (or not) based on arguments about what causes what. Every time a politician advocates this solution or that fix, she is implying something about what is causing the problem in the first place. If she advocates a mental illness solution to the problem of gun violence, for example, she is suggesting that those with mental illness cause at least some of the problems of gun violence.
Getting what causes what right then is obviously critically important but, alas, very difficult. There are many techniques for addressing this problem, some more appropriate than others depending on the question being asked. While no method is bulletproof on the question of what causes what, one of the most widely used in the social sciences is called regression analysis. Understanding and appreciating the basic intuition behind this method, why it is necessary, and its limitations are the aims of this case study. Rather than make this a wholly abstract exercise, we approach introducing the reader to regression analysis through the example of a research project we conducted about solutions to gun violence that made use of regression (Smith & Spiegler, in press). It is our hope that this will make the intuition behind regression more concrete and enable the reader to imagine how this versatile method might be applied to other problems. This case study is designed for college students with no prior background in statistics. We do our best to refrain from using unnecessary amounts of technical language or jargon (although some are unavoidable) and strive to explain regression analysis conceptually rather than mathematically. This explanation will not allow the reader to conduct a regression analysis or indeed understand the technical aspects of it reported in academic papers. What it will do is give the reader an intuitive understanding of why researchers find it necessary to use this method, what kind of questions regression analysis can be deployed to answer, and some of the limitations to this method.
Why Regression for Gun Violence?
When we sat down to figure out how best to approach the problem of gun violence in America, what struck us was a lack of basic agreement surrounding the causes of gun violence. While there are many suspected causes, two stand out from the rest in terms of the frequency with which they are blamed: easy access to guns in the United States and violence committed by some of those who suffer from a mental illness. This fundamental lack of agreement about causes meant that there was even less agreement about solutions to gun violence, whether there should be stricter gun control laws or more funding for mental health services, or both. And somewhat surprisingly for such a politically contentious issue, there was little research that directly examined these two solutions side by side to determine which was more or less effective relative to the other one. We decided to fill in this gap. But how exactly? It became immediately clear that determining which solution was more effective would involve more than a simple comparison. First of all, what if both solutions are effective? How do we tell which one is actually responsible for which averted gun death? And second, what if there are other things influencing gun deaths beyond access to guns and violence committed by those with a mental illness?
The problem of knowing what causes what is the main reason to make use of a technique like regression analysis. The unfortunate truth for researchers is that most things of political importance are not caused or affected by any one thing but by a range of different factors of which it is fiendishly difficult to know the individual effects of any particular one. And to complicate it further, some things may look like they do affect something else, but in fact do not. In other words, the researcher may think there is a causal relationship between two things because it looks as though when one thing changes, the other changes in response to it, but in reality these two things are unrelated to each other. To clarify terminology here, by causal relationship, we mean that one thing alters the behavior of something else (i.e., in its absence, the other thing would behave differently).
This can be difficult to understand in the abstract, so a small example: People generally eat more ice cream during the summer at the same time that they go swimming outside more often. One could take this as evidence that swimming outside in general causes people to eat more ice cream. Perhaps one would theorize that the extra exercise associated with swimming might make people who are watching their waistlines more inclined to eat a calorie-intensive dessert, or alternatively swimming might simply make them hungrier. Therefore, an advocate concerned about America’s obesity epidemic might conclude from this that policies designed to discourage swimming would improve public health. This of course is the wrong conclusion to draw (and violates common sense). People both eat more ice cream and swim more often outdoors during the summer months because the temperature is warmer. While both swimming and ice cream consumption increased in the summer, and so perhaps swimming appeared to affect ice cream consumption, they were in fact not causally related. The hot weather was the thing that actually caused both swimming and ice cream consumption to increase at the same time. If a number of outdoor pools closed for some reason during the summer, but the weather remained hot, we would not expect that the amount of ice cream consumed would decrease even as people swam less because the two things are not causally related (although admittedly if the pools also sold ice cream, the closing of those pools might actually cause a small decrease in the amount of ice cream consumed; this gets at the complexity of the real world and the fact that most things have multiple causes).
Therefore, there are two primary and related reasons when trying to determine what causes what to use a method like regression analysis:
- Almost everything is affected by multiple other things and it can therefore be difficult to know what causal influence should be attributed to one thing versus another.
- Some things may appear to be causally related to one another but in reality are not.
Now that we have illustrated the main reasons for why regression analysis, let us spend a little bit of time with the how. This is not intended to teach the reader to perform regression analyses on her own, but merely to provide the intuition behind the method. To simplify greatly, regression analysis stripped down to its essentials is a statistical technique that tells the researcher the likelihood that two things are associated with, or related to, one another (i.e., they “move” together in some way but are not necessarily in a causal relationship [more on this below]), and if so, how. This is difficult to understand in the abstract, so let us return to the example of gun violence. After deciding that we wanted to study two particular proposed solutions to gun violence—stricter gun control and a mental illness solution—we decided that the two problems we discussed above necessitated using regression. Our hypothesis—that is, our explanation about how things are associated (or not) with each other based on theory—was that stricter gun control laws and increased access to mental health services for those in need of them are both causally related to the number of gun-related deaths and that both of these relationships are negative (i.e., stricter gun control laws would cause a decrease in the number of gun deaths and an increase in access to mental health services for those in need of them would also cause a decrease in the number of gun deaths independently of gun control laws and vice versa; note here that how to form theories and decide what to measure—for example, what is meant by “gun violence” exactly?—are both extremely important parts of the research process but are glossed over here because they are not the subject of this case study).
Therefore, we have a thing that is being affected or caused by something else (i.e., gun violence), and two things that are, according to our hypothesis, doing the causing (i.e., gun control laws and access to mental health services for those in need). The thing that is in theory being affected, the number of gun deaths per state in our case, is called the dependent variable. The things that in theory affect it in some way are called independent variables. In our example, using regression analysis tells us whether stricter gun control laws are associated with the number of gun deaths per state, and if so, how they are related (i.e., As gun control laws become stricter, do the number of gun deaths go up or down, and by how many?—note that this does not tell you whether there is actually a causal relationship, more on this below). It also tells us the same thing about the effect of increasing access to mental health services for those in need on gun deaths.
Crucially, regression analysis allows for the effects of each independent variable to be statistically distinguished from one another (this is perhaps the most important thing to take away from this discussion). In other words, regression allows for the effect of one independent variable on the dependent variable to be statistically removed as it were from the effects of other independent variables on that dependent variable. In our example, the individual effect of just stricter gun control laws on gun deaths can be distinguished from the effect of just access to mental health services on gun deaths because both are included in the regression analysis. This last point is important: Only those things that are actually included in the analysis can be statistically distinguished from each other. Therefore, if you think something has an important effect on the dependent variable, it is best to include it as an independent variable.
In practice, regression analysis spits out two really important numbers for each independent variable. The first one, the p value, indicates whether or not it is likely that there in fact is a relationship between an independent variable and the dependent variable. If the probability is high that there is in fact a relationship of some kind, the independent variable is labeled as statistically significant. The second important number, the regression coefficient, tells the researcher in what direction that relationship runs (i.e., whether as the independent variable goes up, the dependent variable goes up or down), and how large that relationship is after accounting for the effects of the other independent variables on the dependent variable (e.g., how many deaths are averted on average as a result of stronger gun control laws after removing the effects of any change in access to mental health services on the number of gun deaths).
In this way, the researcher begins to address the problem of knowing what exactly affects what and how. An observant reader may, however, object by saying that this still does not overcome one of the problems we identified above, that of knowing whether something actually causes something else rather than simply creating the illusion of a causal relationship. Indeed, a regression coefficient can only tell the researcher about associations between variables; it can only suggest causality but never prove it definitively. Indeed, there is no such thing as a “causal coefficient.” The unfortunate truth is that there is no exact science to establishing causality, not even using regression. What regression analysis does, however, is it allows the researcher to add in many different independent variables to try and statistically account for as many important effects on the dependent variable as possible (e.g., many different influences on the rate of gun deaths). By accounting for all of these variables, one can hopefully say that there is a good chance that a causal relationship in fact does exist between a particular independent variable and the dependent variable (if it makes sense theoretically and the numbers bear it out) rather than just an association that may be coincidental, even if one cannot say there is causality with absolute certainty.
For our research into gun violence, we included variables that we thought may have an effect on the number of gun deaths in addition to the gun control and mental health care variables. These included the gender balance in a particular state, the percentage of a state that was White, the percentage who owned a gun, and the median age of the state, among others, in an attempt to show that if our two main solutions are indeed effective according to the analysis, this effectiveness was not the result of something else important we had failed to consider. In this way, the problem of establishing causality rather than mere association was addressed, if not entirely overcome. Next, we examine two other important aspects of regression analysis beyond the basics discussed above: multiplicative interaction terms and the R-squared statistic.
A Closer Look: Interaction Terms and R-Squared
We have included two main variables of interest in our regression, one measuring the strictness of gun control and the other measuring the percentage of people in a state who lack access to mental health care. By including each of these variables in the regression, we can determine the independent effect of each on the dependent variable (i.e., number of gun deaths per state). But what if the effectiveness of increasing access to mental health care on the number of gun deaths in a state depends on whether or not there are strict gun control laws in that state already (or vice versa)? To address this potential circumstance, researchers can use what are known as multiplicative interaction terms in the regression analysis.
Interaction terms are constructed by multiplying two variables by one another. Researchers can use interaction terms to help determine whether the effect of one independent variable on the dependent variable depends on the level of some other independent variable. We chose to make use of an interaction term in our study of gun violence solutions for a very specific reason: In addition to exalting the individual virtues of gun control and mental health care, politicians often suggest that employing both at the same time would be especially effective. Specifically, the case is often made that stricter gun control laws and increased access to mental health care would result in even fewer gun deaths than if only one of these approaches is employed because these strategies would complement one another somehow. For example, stricter gun control laws might allow mental health treatment to have the time to positively affect a potentially homicidal or suicidal person by making it more difficult for that person to obtain a gun. In addition, increased access to mental health services might allow authorities to be notified by health care providers that a firearm should be taken away from a potentially dangerous person before that person becomes violent.
Finally, we have told you about two important numbers that apply to each independent variable separately (the p value and the regression coefficient). But what if you want a sense for how well all of the independent variables taken together affect or explain changes in the dependent variable? You are in luck! There is a widely used number that applies to the regression model as a whole: the R-squared value. The R-squared value tells you about how well the independent variables altogether explain variation in the dependent variable (e.g., why the number of gun deaths moves up or down in a particular state). R-squared values range from 0 to 1; when you have a low R-squared value, the independent variables in your model do not explain why the dependent variable changes very well and you might want to include some additional independent variables. How high is high enough? There is not a hard and fast rule (although there are usually rough conventions determined by one’s field of study), but if your value is especially close to 0 you will want to think about whether you are missing anything potentially important. Sometimes (as is the case for our research project) you will see something called an adjusted R-squared reported. The only difference between this and a regular R-squared value is that a regular R-squared increases every time a new variable is included and an adjusted R-squared administers a small penalty for each new variable (the idea here is that it incentivizes researchers to include only independent variables that really do have an important effect on the dependent variable, rather than just anything that increases the R-squared value by a little bit).
And the Winner Is …
Now that you know why and how we used regression in our study, what did we find? In our first model (one that did not include the interaction term), we found that stricter gun control laws, but not increased access to mental health services for those in need of them, appeared to have a relationship with the number of gun deaths in a state. This is because the p value for the gun control variable suggested that it was statistically significant (recall our discussion of p values above). In contrast, the p value for the mental health care variable suggested that it was not statistically significant. As the mental health care variable was not statistically significant, the regression coefficient (the second important number discussed previously) provides no valuable information because lack of statistical significance implies that no relationship with the dependent variable (i.e., number of gun deaths) exists. However, as the gun control variable is statistically significant, we can interpret the direction and size of its relationship with the dependent variable using the regression coefficient. The value of this coefficient in this model is –1.78; this means that a one-unit increase in gun control strictness (which is on a 4-point scale from 0 to 4) is associated with, on average, 1.78 fewer gun deaths per 100,000 people in a state. Finally, the adjusted R-squared for this model was .667, meaning that about 66.7% of the variation in state gun deaths was explained by the variables in our model (which in the fields of political science and public policy is considered very high!).
In the second model that did include the interaction term, the gun control variable was still statistically significant, the mental health care access variable was still not significant, and the interaction term itself was statistically significant. However, it should be noted that interpreting the regression coefficient for an interaction term is much more complicated than for an ordinary variable. Without getting into the details of how to interpret one, our results provided support for the hypothesis that these two solutions to gun violence can in fact work in concert with one another to decrease the number of gun deaths more effectively than one solution alone. In states where the fewest people lacked access to mental health care, increased strictness in gun control laws translated to fewer gun deaths. This effect was less pronounced (or disappeared altogether) in states where relatively few people in need had access to mental health services.
In addition to examining the combined effects of gun control and access to mental health service on all gun deaths, we also included analyses that divided these deaths into gun suicides—which constitute the majority of gun deaths—and non-suicides (i.e., homicides and accidental gun deaths). For non-suicides, our findings were broadly the same as for all gun deaths. However, we had a surprising result for the suicide model: Our interaction term was no longer statistically significant. In other words, the effect of gun control strictness on the number of gun suicides was similar regardless of access to mental health services in a state. Our explanation for this perhaps surprising result centers on the fact that federal law requires one to be “adjudicated as a mental defective” or “committed to a mental institution” to have a firearm removed (National Conference of State Legislatures, 2013). As those who are homicidal are disproportionately more likely than those who are suicidal to fall under one of these categories that allows one’s firearm to be taken away by law enforcement, we would expect the combination approach to have more success in preventing non-suicides than in suicides (and would therefore explain why the interaction term was no longer statistically significant for just suicides).
Limitations of Regression
So, regression analysis has helped us answer some very important questions about possible solutions to gun violence. We can now go on to advocate for stricter gun control laws confident in the knowledge that the empirical evidence strongly backs us up. Or does it? Not to denigrate our own research and results, but it is necessary to point out the fact that regression analysis inevitably suffers from several limitations that researchers ignore at their own risk. The first and perhaps most serious limitation is one that we have already discussed, the problem of determining what is truly a causal relationship, and what simply looks like a causal relationship but is not. We have spent a good part of this essay making the case for using regression analysis precisely to overcome this difficulty, and now we are saying actually regression analysis is limited on this front? Frustrating, I know. But this is so important an issue that we feel the need to circle back around and warn the reader again. While regression analysis helps to suggest that a particular independent variable might causally affect the dependent variable, it is by no means definitive. It is generally impossible to account for every independent variable that affects a particular dependent variable (the real world is simply too complex to be simplified without losing something), leaving even the best studies open to charges of incompleteness and therefore inaccuracy. As a result, every regression analysis should be taken with a grain of salt and a hefty dose of skepticism. That is not to say that researchers should not endeavor to make their regression analyses as reflective of the real world as possible. Ultimately, however, the final decision about whether an independent variable causally affects the dependent variable should be made by first considering the numbers and then comparing them with what theory and common sense tell us.
Other methods besides regression analysis may suggest causation with more precision and confidence (although never with certainty) and so should be considered in cases in which they are feasible. Experiments, for example, in which the researcher forms two similar groups, applies a treatment to one but not the other, and then compares the groups to determine the effect of that treatment (e.g., randomized drug trials), are often considered the “gold standard” of research. Unfortunately, while the use of experiments is growing in the social sciences, most questions of political importance cannot be answered ethically using an experiment (e.g., giving a group of people health care while denying it to another group to study its effects would not pass muster with a review board). As a result, observational studies that employ methods like regression analysis and do not directly apply treatments (i.e., they only observe what is happening in the world) will likely continue to be the method of choice for social scientists.
A second limitation can be described as “researcher malpractice.” You may have heard a common complaint about statistics along the lines of “they can be made to say whatever you want.” While the reality is perhaps not quite this bad, there is undoubtedly room for shenanigans in statistics generally and regression analysis in particular. One common issue concerns the p values we discussed earlier. Recall that the p value tells you whether or not an independent variable is statistically significant, that is, whether or not there is a high probability that some kind of relationship exists between the independent variable and the dependent variable. The devil here though is very much in the details. Exactly how high does that probability need to be to reach statistical significance? Independent variables are either statistically significant or they are not. Therefore, a level is simply specified by the researcher (a decision that is usually highly influenced by the conventions of her field of study). The problem is that this (essentially arbitrary) cutoff for statistical significance creates an enormous incentive on the part of the researcher to get over the hurdle anyway possible, even if just slightly over it. Can’t reach statistical significance for your main independent variables? Maybe just leave out a less important variable that is “messing up” your results. Maybe collecting a bit more data will do it. Anything to get into the promised land of statistical significance where dreams of publishing your results can live to see another day. The problem, of course, is that these practices distort research. The researcher is using a desired result to dictate the statistical analysis, rather than allowing the analysis to dictate the results. We are not arguing that this funny business inevitably afflicts all or even most studies that make use of regression analysis. But it is prevalent enough that one should bear in mind that a particular regression analysis may be biased in this way or indeed in other ways as well.
Finally, there is the problem of data. The sad truth for researchers is that there is simply not data available for everything. Some things are too difficult to measure with much reliability, and other things that could be measured simply have not been due to lack of money or interest. The precise number of guns in the United States, for example, is not currently known with certainty (although there are estimates available). This can create difficulties for researchers using regression. They might wish to account for something important, but if the data are not available, that potential influence on the dependent variable may simply not be included in the analysis. And beyond this obvious data problem, there is a subtler issue as well. How exactly do researchers decide what questions to study? Ideally, researchers address the most “important” problems because we expect research to be “relevant” in some way (these are of course somewhat subjective standards to use, but useful nonetheless). The problem is that researchers may be more tempted to address those questions for which there are data easily available rather than ones that may be more important. Call this the “low hanging fruit” bias. This is a broader problem than the other two and does not necessarily affect the quality or results of a particular regression analysis. We mention it, however, because it likely affects the types of questions that are being asked and answered in the social sciences. The increased use of regression is more likely to lead to questions that lend themselves to empirical analysis being taken up, while others that may be equally important (but less quantifiable) languish. Students of the social sciences should therefore bear in mind that the most important questions for their own research may not always be the ones that can be answered through regression analysis.
So, what are the broad takeaways from this discussion of regression analysis and gun violence? First, that regression analysis is a useful tool because it helps to address the problem of knowing what causes what in the real world, something that we have hopefully demonstrated through the example of testing the effectiveness of solutions to gun violence. Second, that while regression can provide some evidence to suggest there is a causal relationship between two things, it does not by itself ever prove that one thing causes another. Rather, a researcher must use her own judgment to decide if the evidence warrants such an assertion. And finally, that regression analysis might not always be the right method for a particular question. Sometimes an experiment would work better or a question simply requires a non-empirical method of analysis. It is our hope that readers of this case study keep these broad ideas and problems in mind as they continue to explore regression and perhaps begin to use it to answer their own questions.
Exercises and Discussion Questions
- Please describe the two main reasons discussed in this case study for why one might consider using regression analysis?
- What is the difference between a dependent and an independent variable?
- What information do p values and regression coefficients tell you?
- In what circumstance would it be appropriate to use an interaction term?
- Please describe two limitations of regression analysis.