Summary
Contents
For more than 40 years, SAGE has been one of the leading international publishers of works on quantitative research methods in the social sciences. This new collection provides readers with a representative sample of the best articles in quantitative methods that have appeared in SAGE journals as chosen by W. Paul Vogt, editor of other successful major reference collections such as Selecting Research Methods (2008) and Data Collection (2010).
The volumes and articles are organized by theme rather than by discipline. Although there are some discipline-specific methods, most often quantitative research methods cut across disciplinary boundaries.
Editor's Introduction
Editor's Introduction
For more than 40 years, SAGE has been a leading international publisher of works on quantitative research methods in the social sciences. This four-volume set provides readers with a sample of the best articles in quantitative methods that have appeared in SAGE journals. The articles and volumes are organized by theme rather than by discipline. Although there are some methods that are used mainly in one or a few disciplines, most quantitative research methods cut across disciplinary boundaries. Any major method tends to be used in all fields of research, despite the fact that the methods’ names sometimes vary as between, for instance, sociology, economics, and psychology. For example, the terms hierarchical linear models, random coefficient models, and multilevel models all refer to the same set of analytic techniques.
To select the articles to include as chapters in this work, I surveyed SAGE journals from the origins of SAGE's entry into the field in the 1960s through the middle of 2010. The criteria for selecting chapters were the quality and influence of the articles. Recent decades are naturally represented more fully because the number of SAGE journals has expanded steadily since the 1960s, and very rapidly more recently. I first searched in the obviously methodological journals on SAGE's list, such as: Sociological Methods and Research, Educational and Psychological Measurement, Organizational Research Methods, Statistical Methods for Medical Research, Journal of Educational and Behavioral Statistics, and Clinical Trials. Next I turned to disciplinary journals that often publish methodological work, such as: Sociology, Political Research Quarterly, Review of Educational Research, and Personality and Social Psychology Bulletin. When searching for good exemplars of SAGE quantitative methods, I prejudged journal content as little as possible and at least reviewed the tables of contents of nearly every SAGE journal.
The selected articles come from a wide range of journals, often specialist journals, but I have chosen articles that would have a general appeal – in terms of methodology, not necessarily in topic of application. Highly technical and specialized articles frequently have a shorter “shelf life;” they are more likely to be become obsolete. In brief, the articles included are those most likely to have enduring value to research methodologists. Thus the sample is not necessarily representative in the way a random sample would be. Rather, it is representative more in the way a quota sample is representative: the choices were made using select categories in order to cover the major topics in the field and to draw from a range of journals representing the scope of quantitative methods disciplines in which SAGE publishes.
The following collection of 70 chapters is drawn from 23 different SAGE journals. The earliest chapter was originally published in 1968, the latest in 2010. The collection is a microcosm of developments in a rapidly expanding field. The four volumes are arranged in 17 categories.1 Here is a brief overview. Volume 1, “Fundamental Issues in Quantitative Research,” begins with general orientations to quantitative research. Then the focus moves to the two types of research in which quantitative research has been most fully developed: experiments and surveys. It concludes with a problem that all researchers face: missing data – with special attention to methods recently developed to correct for that universal problem.
Volume 2, “Measurement for Causal and Statistical Inference,” begins with chapters on data coding and measurement, which are necessary steps before the researcher can begin analyzing the data. Drawing causal inferences is one of the chief goals of such analyses. One kind of causal inference concerns whether a program or social intervention has had an effect, and one way to assess the reality of causal effects is to use hypothesis testing and statistical inference. Each of those topics is addressed in Volume 2. Null hypothesis testing is very effective for a particular type of research, the kind in which the key issue is whether a difference exists and when the researchers’ concern is to determine this. Many other kinds of investigation can benefit from different approaches. The first of these discussed in Volume 3, “Alternatives to Hypothesis Testing,” is confidence intervals and effect sizes. Meta-analysis also tends to focus more on the size of an effect than on whether it is significantly different from chance. One type of effect size is the correlation which, with its analytic cousin regression, provides the analyst with ways to measure the size of the effect of variables on one another. Varieties of regression (logit and probit), used when dependent variables are categorical, are of growing importance. Also of growing importance are techniques that may be employed when all variables are categorical; these techniques fall under the rubric of categorical data analysis.
Volume 4, called “Complex Designs for a Complex World,” examines approaches to complicated topics that require more advanced methods of analysis – all of them heavily computer intensive. First studied is Structural Equation Modeling (SEM), which is used to analyze causal models with multiple indicators of latent variables and structural relations among latent variables. Then Multi-Level Modeling (MLM) is discussed. This alternative to OLS regression is used when data are in nested categories or levels. Using MLM it is possible to explain the separate effects of different levels of analysis, such as the effects of a classroom, a school, and a neighborhood on an outcome such as student learning. Since many variables of interest in social research are both latent and nested, the integration of these two methods (SEM and MLM) holds much promise. Human activity also takes place in space and time, and our final two sections in volume 4 highlight ways of investigating that obvious but methodologically complicated fact.
Looking at the collection as a whole, two features are particularly noteworthy. First is the breadth, depth, and diversity of quantitative methodological knowledge available in SAGE journals. Second is the fact that a large majority of the chapters in this collection – and in publications on the topic of quantitative research methods more generally – attempt in various ways to make progress on one overriding issue, the problem of valid causal inference. How can researchers draw justifiable causal conclusions? What research designs, methods of measurement, and techniques of analysis further that goal? How can consumers of research judge competing claims of causality? These questions are highly salient in some chapters and latent but influential in most others. Of course, investigators using quantitative research methods also have goals other than causality; description and forecasting are important examples. But justifiable causal inference is the chief problem with which quantitative research methodologists struggle.
Volume 1: Fundamental Issues in Quantitative Research
The first volume begins with articles that aim to discuss quantitative methods generally and to see them as a more or less integrated whole rather than a series of discrete techniques (Section 1.1). Experimental methods also have a general import because they tend to set the standards for excellence in research, even in disciplines that rarely conduct experiments; they are discussed in Section 1.2. Survey research is also ubiquitous. It is rare to find a field of social research that does not make use of this method; its sampling and measurement techniques are discussed in Section 1.3. Volume 1 concludes with an issue that all researchers face: missing data. The problems arising from missing data and the methods recently developed to correct them are reviewed in Section 1.4.
1.1. General Orientations
Publications in scholarly journals about quantitative methods are usually focused on highly specific problems and ways to deal with them. But some of the most oft-cited and influential articles take a more general look at the field. Our first four chapters are essays. They provide provocative overviews of quantitative methods and issues related to them. The “whence” and “whither” of quantitative methods – thoughts about where we have come from and suggestions about where we are going – are discussed in Chapters 1 and 2. First come Wright's reflections on ten influential statisticians and their contributions to our discipline. This chapter was originally published in Perspectives on Psychological Science, one of SAGE's newer journals. In Chapter 2 Wainer predicts future areas of research that will be promising and suggests that a few others have been played out. This contribution was originally published in Educational and Behavioral Statistics, one of SAGE's more long-standing quantitative journals. The two authors agree on many topics and themes and provide a good overview of the issues and problems discussed in the remainder of these four volumes – including causation, missing data problems, coding and measurement, experimental design, statistical inference, alternatives to hypothesis testing, meta-analysis, categorical data analysis, structural equation modeling, multi-level modeling, and other computer-intensive methods. After these insightful reviews of the breathtaking abundance of methods available to us, we are reminded by Peterson, in Chapter 3, that it is often wise to be minimalists in our selection of methods. Little is gained by using an elaborate method merely because we can, despite a certain pressure, which most of us have probably felt at one time or another, to demonstrate that we are conversant with cutting edge techniques for analyzing quantitative data.
And what exactly are quantitative data? How do they differ from graphic and qualitative data? Sandelowski and colleagues in Chapter 4 scrutinize these questions in an article on “quantitizing.” This is the process of converting qualitative data into quantitative – usually words into numbers. This practice is particularly salient in mixed methods research, and this chapter was originally published in one of SAGE's newest journals, the Journal of Mixed Methods Research. In mixed methods, transformations of data of one type into another make possible the merger of data sets and their joint analysis. Quantitizing data or its opposite, qualitizing data, raises more complicated issues than might at first seem likely. First there is the question: what are data? They are not inherently qualitative or qualitative. We obtain data by converting systematic observation into symbols (numbers, words, or graphics) that we can analyze. If we think about it our assumption that we know what the terms “quantitative” and “qualitative” mean may not always seem so well grounded. For example, merely counting the number of instances of something, the most elementary of quantitative acts, begins with qualitative categorizing of cases. If something is in the category, then we count it; if it is not we don't. After we count instances of a phenomenon, and analyze the resulting data, we then often conclude by qualitizing, by judging for example that the data indicate a “significant” relationship.
In sum, this first section of SAGE QUANTITATIVE RESEARCH METHODS addresses general orientations toward methodological issues. That means that it often raises questions of when or whether to use particular methods and why one might do so. The emphasis in the remaining chapters shifts to issues that are often of greater concern to practicing researchers: given that one has decided to employ a particular method, how does one do so – and do so well? We begin with experiments, what many researchers consider the focal method in social and behavioral research as well as in the natural sciences.
1.2. Experimental Methods
Experiments are widely considered the premier quantitative method, the point of reference for all others. (Of course, experiments are not inherently quantitative, since they are regularly used to study qualitative variables.) Experiments, particularly randomized controlled trials (RCTs) are often referred to as the “gold standard.” This seems an odd way to denote excellence since the gold standard has long been defunct. Be that as it may, the reason for the honored place of experiments in the menu of research methods is their overwhelming superiority in drawing justifiable causal inferences. That superiority stems from the two key, definitional, features of the experimental method: researcher control over (1) how the independent variable is administered and (2) which cases receive treatments. These features allow techniques such as double-blind procedures and random assignment to control and experimental groups. Together, they reduce, further than any other class of method, the number and influence of confounding variables. They thereby minimize, more than any other class of method, competing explanations for the cause of an outcome.
When experimenters can manipulate independent variables and can randomly assign cases to treatment and control groups, this provides an unusually high level of certainty about “internal validity,” which refers to making accurate causal inferences about the relations among variables. When researchers cannot manipulate the variables of interest and, for practical or ethical reasons, cannot randomly assign cases, then, by definition, experimental methods cannot be used. Many of the other methods discussed in these volumes are attempts to find statistical alternatives to random assignment and control over independent variables. That has long been the orientation of quantitative methodological work. This does not mean, however, that the experimental methods have no weaknesses. Chief among them is “external validity,” or the ability to generalize from experimental samples and settings to broader populations and contexts.
External validity, or generalizability to a population, could be achieved were experimental cases first randomly sampled from the population and then randomly assigned to experimental conditions. This is rarely possible, which means that experimenters have long devoted considerable attention to the external validity problem. In a classic article originally published in 1968 and still widely cited. Bracht and Glass review (in Chapter 5) the main threats to the external validity of experimental conclusions. In Chapter 6, Baker and Kramer emphasize the need for experimenters to continue to be vigilant about threats to validity and about the application of experiments in the real world. Textbook ideals rarely pertain. In addition to problems of external validity, the authors also discuss, and provide statistical techniques to address, missing outcomes: when subjects drop out or when they do not comply with experimental protocols.
When the necessary conditions cannot be met – experimenter control of the independent variables and random assignment of cases – there are numerous alternatives. Among the most important of these are natural experiments and regression discontinuity designs (RDDs). They are discussed in our next two chapters. Dunning in Chapter 7 examines natural experiments in which the natural or social world provides conditions that approximate randomized control trials (RCTs). “Treatments” are administered, more or less at random, to some cases and not others. Examples could include natural events (hurricanes, earthquakes, viruses) that influence some communities more than others – or social and political events that do so. By comparing the two sets of cases, the researcher can draw causal inferences about the effect of the “treatments.” Dunning suggests that there are “probably more natural experiments waiting to be discovered” than most researchers would imagine. He reviews several compelling examples and concludes that the key criterion for evaluating natural experiments is the degree to which the treatment can plausibly be argued to have been at random.
The regression discontinuity design (RDD) is another option when cases cannot be assigned at random, but one that is much more under control of the researcher than is the natural experiment. It is particularly appropriate when random assignment runs the risk of assigning people most in need of treatment to the no-treatment control group. Rather than use random assignment, the researcher uses a cutoff score on a pretest variable (measured continuously) to determine who is put into the treatment group. The neediest, as determined by being below the cutoff, are included in the treatment group; those above the cutoff go into the control group. The computations for implementing the RDD are somewhat complicated, but methodologists are agreed that it is comparable in internal validity to the RCT. Mandell's contribution in Chapter 8 is to propose and demonstrate a hybrid version of the two designs – experiments and RDD's – which are often considered mutually exclusive. The proposed hybrid improves the statistical precision of the RDD, and it can be used with assignment variables coded categorically.
1.3. Survey Research
In survey research the emphasis switches from internal to external validity – from random assignment to random selection of cases. The aim is often less to discover a causal link between variables (internal validity) and more to generalize a finding from a sample to a population (external validity). The best known example is in election surveying. The population of likely voters is identified, a random sample is taken using more or less complicated sampling techniques, and respondents are asked about their voting plans – whether they intend to vote and for whom. The responses are tallied and various statistical techniques are used to estimate what the responses would have been had the entire population been polled. The accuracy of the estimates and the margins of error around the estimates vary in predictable ways with the size of the sample. None of the steps in this process are easy. But the techniques are well known and well honed. The four chapters of this section on survey research examine several problems, the solutions to which are still under development: surveying hard to access populations; aggregating survey responses into more general indicators; the age-period-cohort conundrum; and making surveys that use the World Wide Web more representative of the general population.
Attempting to survey samples from populations whose members are rare or unknown raises complicated uncertainties. One cannot very easily sample from a population when it is difficult to find or when one does not know what it is. So the first step becomes defining or identifying the population rather than sampling from it. Among the kinds of unknown populations important to estimate and sample for reasons of public policy and public health are homeless people, those who use illegal drugs, and those who are infected with a difficult to control disease, such as HIV/AIDS. Both national and international public health agencies have sponsored numerous studies using new techniques to identify such populations. In Chapter 9 Hay and colleagues provide one of the most rigorous examples of the application of the so-called the “capture-recapture” method to estimate the size of a population (injecting drug users in England) that would be virtually impossible to estimate with any accuracy using conventional survey methods.
Survey questions usually ask respondents questions with forced-choice answers. Researchers often seek to combine discrete responses to seemingly unrelated questions into summary indices of a broader construct. One type of these summary indices, important for social policy, is the quality-of-life index (QOL), which aggregates numerous social indicators into a more general measure. The debates about whether such aggregating can be meaningfully done and if so how to do it are quite intense. Hagerty and Land in Chapter 10 situate the discussion in the context of one of the more promising approaches to this kind of work, one that takes into account the social heterogeneity and the diversity of respondents. When using cross-sectional data and even cohort data to study differences over time, it has been all but impossible to separate period effects, age effects, and generational or cohort effects. Age, period, and generation are confounded. Because of this confounding, it has not been possible to study the effects of all three on a fourth variable. In Chapter 11 Smith provides an overview of four new articles by top researchers in the field that provide increasingly promising means of dealing with this problem, one that has long plagued survey researchers. A newer problem has emerged with the availability of the Internet as a survey tool. No one thinks that people who can be reached for surveys on the Internet are a representative sample of the broader population. Thus, while the web-based survey has much to recommend it, using one always produces a biased sample of a general population. Schonlau and colleagues (in Chapter 12) use an exceptionally good database (a national probability sample containing an Internet sub-sample) for addressing the extent of that bias and proposing solutions (based on propensity scores) to correct it, at least partially.
1.4. Methods for Missing Data
All research projects have to deal with the problem of missing data. Consequently, many techniques have been developed to deal with the difficulties it causes. Unfortunately, the best known and most widely used techniques, including the typical default methods in most software packages, are not very good – or worse. The effective techniques are much more complicated than the poor ones, which are often models of simplicity. Missing data is a problem for many reasons. Most statistical analysis techniques assume complete data sets, and using those techniques with incomplete data runs the risk of serious misinterpretation. By reducing sample size, missing data reduce statistical power, and except in those rare cases when the data are missing at random, they introduce bias. The four chapters in this section address aspects of the missing data problem. They all discuss the promise and prospects of the multiple imputation (MI) approach pioneered by Donald Rubin and others. In simplest terms, MI uses computer randomization techniques (Markov Chain Monte Carlo or MCMC) to create several data sets (usually 5 to 10) in which the missing data have been imputed. These estimates are combined, by taking their means, to make a final estimate. As the following chapters make dear, MI is not one technique, but a family of related techniques, and there is considerable discussion about which members of the family are most acceptable and whether related techniques should be admitted.
In Chapter 13 Zhang & Rubin talk about the type of data gaps caused when outcome data are truncated by “death.” The concept originated in medical research in studies of the effect of treatments on quality of life when some patients the before the study is completed. The problem is actually quite widespread in many types of research. For example, in experimental studies of high school education programs, students who drop out before the final outcome measure is taken, are also said to produce outcomes with truncation by “death.” Whatever the field, such truncation raises intricate analysis problems, to which Zhang and Rubin propose a range of solutions adapted to varying situations and statistical assumptions.
In Chapter 14 Allison reviews multiple imputation (MI) techniques, which have been highly attractive to researchers for many reasons. The main limitation is that for MI to work most effectively, missing data must be assumed to be missing at random. Missing data are rarely missing at random, and it is hard to be sure whether or when they are. The complications in computations of more elaborate methods of MI are efforts to work around this unmet assumption. Choosing among the work-arounds also involves choosing among types of software to implement them. And, as Allison shows, a poor choice can have nasty consequences. Kenward and Carpenter in Chapter 15 address many of the same issues in this rapidly developing field, with specific emphasis on medical research, but with conclusions that are of much broader interest. While no method of missing data analysis is as good as designing research protocols to reduce it in the first place, MI has distinct advantages over other techniques and certainly over doing nothing. Still, the field is open and rapidly developing and, as is usual in such cases, increasing in technical complexity. Beunckens and colleagues in Chapter 16 discuss parallel questions and problems with MI, particularly when using it to analyze hierarchical data from a large, multi-site experimental trial. Despite important recent improvements in methods for missing data, they continue to remain a persistent and fundamental problem in research methodology.
Volume 2: Measurement for Causal and Statistical Inference
After selecting designs, collecting data with them, and handling missing data problems, the researcher's attention typically turns to topics discussed in Volume 2, which begins with issues in data coding and measurement (in Section 2.1). After coding and measurement, analysis comes to the fore. Perhaps the chief analytic problem of interest to researchers is justifiable causal inference; causality is addressed in Section 2.2. One type of causal inference concerns the evaluation of programs and interventions, specifically whether they have had a discernable effect (Section 2.3). And a favored way to determine causal effects is hypothesis testing and statistical inference (discussed in Section 2.4).
2.1 Measurement and Coding
The purpose of coding and measurement is to prepare data so that they reliably and validly express key concepts. When data are coded and measured well, researchers are able to analyze the data in reliable and valid ways. Decisions about how to code data emerge from the character of the concepts being investigated. Coding decisions, in turn, then importantly shape the analysis options open to the investigator. Quantitative methods in coding and measurement aim to provide researchers with means of consistently and accurately handling the concepts under investigation. The four chapters in this section deal with: dichotomizing continuous data, developing indices for measuring fidelity of implementation, reducing error in multiple comparisons, and validating surrogate outcome measures. All four chapters deal with ways that measurement techniques can improve the conclusions drawn from data.
Dichotomizing or otherwise categorizing continuous data is remarkably widespread in the social and health sciences despite the fact that there has been no good reason for it since the invention of the computer. In a classic explanation of why “this practice of our grandparents” (they have become our great grandparents since those words were written in 1983) should immediately be abandoned, Cohen (in Chapter 17) persuasively demonstrates that dichotomizing routinely results in huge losses in statistical power, losses that are equivalent to discarding the majority of one's data. False null or negative findings are often due to the fact that researchers have thus mutilated their data. Erroneous conclusions are also often due to a failure to properly implement a research protocol or a program intervention. Mowbray and colleagues (in Chapter 18) discuss the importance of fidelity of implementation (or the degree to which the independent variable was executed as planned), including the need to develop indices for measuring implementation fidelity, as a key step in developing evidence-based practice.
In Chapter 19, Williams and colleagues discuss how to reduce error in inferences made using multiple comparisons. They use a huge database and apply techniques developed by Tukey and others to show the clear need for adjusting estimates of statistical significance and confidence intervals to account for the bias introduced by making multiple comparisons of the effects of independent variables on outcome variables. Sometimes an outcome is undesirable or distant or rare (such as death in medical research) and thus hard to study directly. For example, rather than waiting to compute death rates – in the more-or-less distant future – of patients receiving a new treatment versus those receiving a placebo, researchers can sometimes find a “surrogate endpoint.” This is an outcome measure that correlates with the actual outcome or endpoint of interest. But mere correlation is not sufficient. Ideally, the surrogate would be a mediating variable necessary to attain the actual endpoint. This approach to measuring outcomes has been developed primarily in medical research, but has considerable applicability in other realms. Green and colleagues in Chapter 20 discuss how to validate surrogate endpoint measures and to find consistency among several such measures.
2.2. Causation
when attempting to make causal inferences researchers aim to control for covariates and thereby to eliminate other possible explanations for a causal connection that is suggested by an association between an independent and a dependent variable. In experiments this control of covariates is accomplished very effectively by random assignment of cases to conditions. Random assignment tends to take care of covariates “automatically,” by equalizing, at least within the limits of probability, control and experimental groups. Various supplements to random assignment – such as matched pairs and Latin squares – can be used to improve upon chance for important covariates.
When researchers do not have the experimentalists’ control over administration of the independent variable and assignment of cases, the need for statistical control of covariates is greatly increased. The further one moves from random assignment, the greater the need becomes. Controlling covariates is more needed in quasi-experiments than in clinical trials. And, in archival designs, in which researchers investigate variables they do not generate, and which are in no sense randomly assigned to treatment conditions, controlling for covariates becomes the entire analysis strategy. One then moves from random assignment to causal modeling.
When discussing these issues, we still almost always begin with John Stewart Mill's three conditions for causation. To conclude that two variables (call them A & B) are causally linked the researcher must demonstrate three things. First, the presumed cause (A) must precede the effect (B), since “after” cannot cause “before.” Second, the hypothesized cause and effect (A & B) must covary, because if they do not occur together, they cannot be causally linked. And, third, no other explanation accounts as well for the covariance of A & B, the postulated cause and effect. The last condition is the truly difficult one. It is not too much of an exaggeration to say that all research methods concerned with causality are attempts to deal with it, that is, to improve the degree to which we can claim that no rival explanation better accounts for the one posited in the researcher's theory. In the five chapters of this section, the authors address various aspects of what is arguably the most central of all questions in research methodology.
We begin with philosophical roots in Chapter 21 by Reiss. He first reviews the main accounts of causation adduced by social scientists, finds exception to each, and then moves to a pluralist conception. Each of the main accounts (counterfactual, regularity, mechanistic, and interventionist) is correct in what it asserts, but wrong in what it denies – specifically when it denies that other accounts can be coherent. Reiss qualifies his pluralistic position by explaining that it can be less help than one might think. That is because pluralism itself is pluralistic – there is a plurality of different types of pluralism. The rigorous conceptual hygiene provided in this chapter helps set the stage for, and helps keep our thinking clear during, the somewhat more technical discussions in the subsequent chapters of this section.
The comparative effectiveness of different statistical models – specifically those associated with experimental and with observational research – is the subject of Freedman's Chapter 22. Most of what we think we know in the medical sciences, and even more of what we believe to be true in the social sciences, has been learned through observational research. But, Freedman says, if causal claims based on experimental research should clash with those derived from observational research, one should believe the experimental claims. And such clashes occur. One of the best known recent examples is the reversal of advice about hormone replacement therapy (HRT) for postmenopausal women. Based on extensive observational studies with thousands of cases it was concluded that HRT protects against heart disease. More recent experimental studies indicated that HRT had adverse effects. Because of that experimental research, physicians changed their advice. And millions of women stopped HRT after 2001. Excellent observational studies were overturned by excellent experimental studies. Experiments should trump observational studies – so say virtually all methodologists – and they are probably correct in most cases. But why should we believe it? Stopping HRT was probably not the correct decision for all women in 2001, specifically younger women. The mean age of the sample in the experimental study was 63, but for women younger than 60, HRT was actually effective in reducing heart disease. The observational study, in turns out, was based on a more representative sample. The HRT story is even more complex than this brief account suggests, and new evidence continues to appear at the time these words are being written (late 2010). But the complexities help clarify what is involved in some of our beliefs about the relative effectiveness of different designs.
The reason for the widespread belief in the superiority of experiments for causal inference probably goes back to Mill's third condition for causation. In experiments it is harder than in observational research to think of other plausible explanations for an apparent causal effect. For example, if a mean difference between the control and experimental group were significant with an exact p-value of .02, this would mean that an effect that large or larger in a sample that size would occur only 2% of the time – if the null hypothesis were true. It does not mean it is impossible, just that it is unlikely – a probably of 2% that the randomization process produced non-equivalent control and experimental groups, groups that were different enough to account for the mean difference between them on the outcome variable. There are also other possible explanations, of course: researchers may have followed the protocol imperfectly. Or subjects who dropped out may have been different in some unknown ways that biased the result (if the ways were known statistical adjustments would have been made). These explanations are not farfetched, but they do seem to most researchers less likely than kinds of things that could go wrong in an observational study. For example, the women who chose HRT in the real-world observational study could have been different in some unknown ways from women who opted not to follow HRT (again, known differences would be adjusted for). Our preference for the experimental result over the multivariate observational result is not statistical. In a clash between an experiment and an observational study – even if the exact p-value for the mean difference were identical to the exact p-value for the multiple regression coefficent – most researchers would choose the experiment and its mean difference. That choice is based on beliefs about probable behavior of cases (usually people) under different design conditions (usually laboratory versus real-world), and has little to do with quantitative reasoning or methods.
In our next three chapters, the discussion of quantitative techniques for causal inference is focused on three problems of causal inference rather than on the general grounding of causal beliefs: mediating, matching, and suppressor variables. Sobel in Chapter 23 discusses the key role of mediating variables, the links between the predictor and the outcome variables, which are frequently studied using structural equation models (SEMs). Researchers often mistakenly assume that randomization means that mediating variables can be causally interpreted, but randomization is directly relevant only to treatment (not mediating) variables. Sobel describes some techniques for partially overcoming this limitation. In Chapter 24 Morgan & Harding discuss promising uses of matching to improve estimates of causal effects, particularly in sociological research, most of which is non-experimental. Matching has mostly been developed for use in experiments, but recent innovations have made it increasingly useful for observational research. Suppressor variables can bedevil attempts to draw causal conclusions, and they are relatively common in SEMs. Maassen & Bakker (in Chapter 25) discuss some of the controversies surrounding them, and provide advice about how to interpret path models (SEMs) when an unforeseen suppressor makes its appearance.
2.3 Program Evaluation and Individual Assessment
Program evaluation investigates the impact of interventions in social programs. As such, it is a form of causal analysis – what are the effects of the program? Since program evaluation takes place in applied field settings, the number of and threats to validity from confounding variables can be formidable, not the least of which is that the effect is generally measured as change over lengthy periods of time. While the goal in program evaluation is to ascertain program outcomes, they are often measured by assessing the impact of the program on individuals; participants are assessed as individuals, but with the aim of evaluating the overall effect of the program.
The first question for program evaluators is logically: What is the outcome of interest? This will vary depending on the audience for the evaluation – either potential individual clients or decision makers rating program performance. Once the outcome and audience are established, the next analytic questions become: how to measure the outcome and how to establish the causal link between the outcome and the program elements. These questions can raise considerable technical difficulties for quantitative researchers. As in many fields of data analysis, there is a debate between those who think that older simpler methods still have much to offer, versus those who believe that, in order to draw valid causal conclusions, older techniques must be replaced by more advanced methods. This issue is addressed in Chapters 26 and 27, on gain scores – also called difference scores – in educational and psychological research (by Williams and Zimmerman) and in the study of organizational behavior (by Edwards). Williams and Zimmerman argue that these simple scores can be more useful than researchers often believe, while Edwards claims that they are widely used but have so many problems that they should rarely be employed by serious researchers. These two chapters provide key arguments readers can use to form their judgments.
The most important set of more advanced methods to measure change goes under the name of value-added models (VAMs). Originating in economics, the approach is increasingly used in program evaluation. The basic idea is to gauge the effect of additional resources (inputs) on outputs such as products. In educational evaluation the outputs are usually measures of student achievement. In health fields, a typical outcome measure is patient survival years. VAMs have improved the typical educational evaluation by shifting the analytic focus away from mean differences across groups such as classrooms, and toward the growth in achievement of individual students over time. In Chapter 28 Raudenbush provides an overview of the potential and the as yet unresolved difficulties of the VAM approach. One key difficulty when taking VAM methods from manufacturing to education is that the key inputs are hard to define and observe directly As they are currently conducted in education, VAMs are much more effective for providing information that consumers can use to decide among programs, teachers, or schools. They are less helpful for producing information that officials can use to hold institutions accountable for outcomes. Holding administrators accountable for outcomes is also the main goal of a system of performance ratings inaugurated in 2001 in England. As described by Bevan in Chapter 29, the “star rating system,” in which health care providers are ranked from zero to three stars, brings accountability to a system in which, because it is public, is not regulated by market forces. While the star System allows the “naming and shaming” of the incompetent, and the rating system is easy to understand, neither the inputs nor the outputs are measured very well. Evaluation researchers often face a tradeoff between measures of inputs and outputs that are valid but complicated versus ratings that are of dubious accuracy but comprehensible by the lay public and politicians.
2.4 Statistical Inference
Statistical inference is also a branch of causal analysis. The causal inferences run from sample to population. Or inferences are made about differences between experimental and control groups, which represent populations of those receiving versus not receiving the treatment of interest. As is true of survey sampling, the basic outlines of statistical inference have long been well established. Current research focuses mainly on what to do in specialized situations.
We begin with a critique of null hypothesis significance testing (NHST), which has long been the predominant method of making statistical inferences in social research. Almost any general discussion of NHST written since the 1990s is likely to be a critique or a response to critiques. The set of methods that once comprised the standard paradigm has long been under attack, but as Gill (in Chapter 30) and others have pointed out, it retains considerable prominence among publishing researchers. Gill's chapter provides a very informative overview, which sets the stage for the more technical discussions in the remaining chapters of this section and introduces some of the alternatives to be discussed in Sections 3.1 and 3.2.
Experimenters working with real-world rather than laboratory subjects often assign intact groups (clinics, schools, agencies) to treatment and control conditions. These cluster-randomized designs are necessary when individuals cannot be randomly assigned, as is often the case in educational and social policy research. However, standard tests of statistical significance, such as the t test, assume individual random assignment. Using these tests for clusters seriously overstates statistical significance; inflated effect sizes and confidence intervals that are too narrow are other misleading results. In Chapter 31 Hedges proposes solutions, importantly based on the intraclass correlation, that will be adopted by researchers seeking greater precision in their estimates.
Tests of statistical significance are also used as criteria for equating score data from two alternative versions of a measure, such as two forms of a test of knowledge. The goal is to detect and eliminate any differences in the difficulty of the two versions, which were designed to be equivalent. In Chapter 32, Moses reviews several significance tests that have been proposed for choosing equating functions for different versions of a measure. He shows that the most accurate significance tests for selecting equating functions were, under several conditions examined in his study, likelihood ratio tests for comparing loglinear models.
The strength of a statistical inference is directly related to the size of the sample from which the interference is made. Hence determining sample size is one of the most consequential decisions that quantitative researchers make. As in many statistical choices, there are two broad categories of approach: classical frequentist and Bayesian. Bayesian methods have been a theoretical possibility since the 18th century, but have become practical for broad ranges of applications only much more recently. Some of the calculation problems associated with doing Bayesian analyses were intractable until they were surmounted by the availability of methods of computer simulation, specifically MCMC. Most researchers find Bayesian methods more difficult than the frequentist alternatives. But the advantages of Bayesian methods – chiefly that they formally combine prior knowledge with that gained from research to make inferences – mean that they are used with increasing frequency in statistical decision making. Chapter 33, by Pezeshk and colleagues, provides an excellent example of comparing Bayesian methods with classical frequentist approaches to calculating sample size and to using an optimal sample size to determine the benefits of a treatment.
Volume 3: Alternatives to Hypothesis Testing
Null hypothesis testing is very effective for a particular type of research, the kind in which the key issue is whether a difference exists and in which the researchers’ chief concern is to establish this. Many other kinds of investigation can benefit from alternative approaches. The first of these discussed in Volume 3 is confidence intervals and effect sizes (Section 3.1). Meta-analysis (Section 3.2) also tends to focus more on the size of an effect than on whether is significantly different from chance. One type of effect size is the correlation and regression family of techniques, which provides the analyst with ways to measure the size of the effect of variables on one another; these are discussed in Section 3.3. Varieties of regression-logit and probit – used when dependent variables are categorical are of growing importance (Section 3.4), as are techniques to be used when all variables are categorical, techniques that fall under the rubric of categorical data analysis (Section 3.5).
Together the quantitative methods discussed in Volume 3 form a suite of statistical techniques that focus on the magnitude of effects, rather than emphasizing, as does statistical significance, the probability that estimated effects could be due to sampling error. This approach is not opposed to measures of statistical significance, since estimates of effect size are routinely tested for statistical significance. Indeed, measures of statistical significance entail measures of effect size. The basic question answered by a p-value is: what is the probability of a result this size or larger in a sample of this size, if the null is true? What is meant by “effect size” in the examples that follow are standardized measures of effect size that enable meaningful comparison across studies. The three main categories of these are: standardized mean differences, correlations (which are standardized covariances), and, for dichotomous variables, odds ratios. These effect size measures, in conjunction with confidence intervals, enable one to consider not only the probability that an effect exists, but also to draw conclusions about any effect's comparative magnitude and the margins of error with which the effect has estimated.
3.1. Confidence Intervals and Effect Sizes
Quantitative researchers over the past two decades have put decreasing emphasis on hypothesis testing and have instead stressed the importance of the size of an effect rather than its statistical significance. Concomitantly, confidence intervals around effect sizes (and other point estimates) have been increasingly recommended and practiced.
This is, in general, good statistical practice, but it seems especially important in policy research where the size of an effect and the confidence with which we can estimate that size is naturally of great interest, particularly for comparing alternative policies in social programs, medicine, education, and business applications. In Chapter 34, Harris adds one further recommendation to supplement effect size research when comparing policies. It is important not only to standardize measures of the size of policy outcomes but also to discuss effect sizes relative to their comparative cost. The basic idea is simple: for each policy outcome, divide the effect size by the cost; then make comparative judgments about the effectiveness of policies or programs. But the implementation of this simple idea is far from easy, as the study of the cost-effectiveness of medical treatments has shown (see also Chapter 33). Cost-effectiveness can also be controversial. For example, patients might understandably prefer the best treatment as opposed to the treatment that is the best value for the money.
A rather systematic assault on the ubiquitous p-value has occurred since the 1990s. The historical origins are much more distant, but a widespread recognition of the limits of the p-value and its overuse would be hard to miss today. Quite a bit of the discussion focuses on how researchers have misused and misinterpreted the p-value. Of course, the statistic is hardly to blame if researchers do not know how to use it properly. It is more important to focus on the appropriate uses and correct interpretations of the p-value and how these compare to alternatives – the main ones being the range of effect size measures and confidence intervals. Cumming in Chapter 35 makes head-to-head comparisons of confidence intervals (CIs) and p-values specifically for replication of findings. He concludes that “if you repeat an experiment, you are likely to obtain a ? value quite different from the p in your original experiment.” That is because ? is an unstable, or unreliable, measure even in studies with large sample sizes. By contrast, confidence intervals are much more stable and are therefore much more useful as a prediction of what will happen in the next experiment. In a study of mean differences between a control and an experimental group, a 95% CI “is also an 83% prediction interval for where the next … [mean difference] will fall.” Head-to-head comparisons such as this one are essential to guide researchers in making effective methodological choices.
Editors of SAGE journals have increasingly recommended, and even required, that authors include effect sizes, confidence intervals, and confidence intervals for effect sizes in their reports of quantitative research. This editorial trend is fostered by the realization that all statistics are affected by sampling error, and CIs display this, but p-values do not. Chapter 36 is an instructive example of the steps editors have been taking. In that chapter Fan and Thompson strongly recommend that authors submitting articles to Educational and Psychological Measurement report CIs for score reliabilities. The general point is that if you can compute a statistic you can, at least in theory, compute a CI, which indicates a range of plausible values for the statistic. I say that in theory you can compute a CI because the techniques for doing so with particular statistics can be complicated and some have not been fully developed. Fan and Thompson conclude that, because there are several ways reliability coefficients can be computed, “intervals for reliability coefficients can [also] be estimated in various ways.” And they illustrate how authors may do so. While most statistical packages contain options for computing CIs, the default options are not always the best choice. This is a point made also by Curran and colleagues in Chapter 37, specifically in regard to the root mean squared error of approximation (RMSEA), which is a measure of goodness-of-fit in structural equation models (SEMs). One of the advantages of RMSEA as compared with many other measures of fit is that one can compute confidence intervals around RMSEA estimates. This is an important advantage, although the authors note that in small samples (less than 200) the CIs are notably less accurate.
3.2 Meta-analysis
Meta-analysis uses standardized effect sizes to facilitate the integration of research findings. While methodological authors frequently recommend using effect size measures (see Section 3.1), in meta-analysis the research would be impossible without them. Of course recommendations to review the literature before undertaking one's own research have long been routine. What meta-analysis adds to that recommendation is ways to integrate or synthesize findings quantitatively, mostly using effect sizes. A research synthesis may be especially appropriate when the research literature is extensive, but there is confusion about what it says or perhaps the research reports actually contradict one another. Conflicting reports can often be due to differences in the studies, differences that the synthesizer is uniquely positioned to discover. And a synthesis can have more external validity than the studies it summarizes, because it synthesizes data pertaining to different methods, groups, and contexts. Examples from medical research have frequently shown that syntheses can help reduce previous uncertainty about a treatment or intervention. This advantage is perhaps most relevant when the studies being synthesized have mainly been conducted using small groups (as is common in experiments). With small studies it can be the case that only when the results are pooled does the sample become large enough to have sufficient statistical power to detect an effect or to clarify apparently contradictory results.
One limitation to meta-analysis is that some data are not amenable to its main techniques. Despite recent advances in the field, meta-analysis works best when the outcome data from the studies are fairly simple. Meta-analysis has been most successful summarizing the results of experiments with one dependent variable, one independent variable, and few if any covariates or control variables. That is because the common tools for meta-analysis (standardized mean differences, correlations, and odds ratios) work best with relatively simple data. With more complex models it is harder to find appropriate tools for synthesis. For example, standardized multiple regression coefficients cannot be properly compared across studies that include different variables in the models – as they routinely do.
Despite such limitations, meta-analysis has become hugely important since it was pioneered in the 1970s. One of the pioneers was Gene Glass, whose contribution is reproduced as Chapter 38. Originally published in 1977, this is the most cited article in the Review of Research in Education. It is easy to see why. Although there have been many technical developments since Glass published this piece, it clearly and persuasively lays out the basic case for meta-analysis, and his arguments have not been superseded. As Glass insisted, he did not devise new quantitative techniques; indeed, the methods of meta-analysis “are part of the stock-in-trade of any methodologist and most empirical researchers.” Rather he made the case for applying these methods to the integration or synthesis of research findings. The rest is history – or footnotes. For example, Vacha-Haase in another widely cited article (appearing here as Chapter 39), applies meta-analytic techniques to comparing not only effect sizes but the reliability of scores across studies. In fields where measurement is the main concern, such as psychological and educational testing, the approaches of “reliability generalization” outlined by Vacha-Haase have much to offer.
A superb example of using meta-analytic techniques to investigate findings across studies that have used different methods is provided by Slavin and Smith's Chapter 40. They note that an inverse relation between effect size (ES) and sample size (N) has been observed in more than one field, most often in medical and educational research. Studies with small samples tend to have large effect sizes. The fact is clear, but it is curious, and the explanation is somewhat elusive. Two possibilities come to mind. First, a form of publication bias may arise in which small experimental studies, which are often preliminary pilot studies, get published only if they have large effect sizes. Small preliminary studies, which are often of lower quality, tend to have a great deal of variability. As a group they yield a disproportionate number of very large and very small ESs. Publication bias means that only the large ones get published and are thus easily available for meta-analysis. A second possible explanation is that with a small ? it is possible for researchers to enact “super-realization,” which is the implementation of the treatment at a level of intensity and effectiveness “that could never be replicated on a large scale.” This leads to a threat to external validity (i.e., limited generalizability). This kind of threat is particularly consequential in program evaluation where the goal is less to test a hypothesis about the relation between variables and more to find successful programs with large effect sizes and that can be widely implemented. Effect sizes are much more reliable in studies with large Ns even when they are not RCTs, and that reliability is a key to making intelligent policy and selecting effective programs.
In a review of 185 studies of school mathematics programs, Slavin and Smith further found sample size to be closely related to research design. True randomized experiments or randomized controlled trials (RCTs) had a mean sample size of 212; for randomized quasi-experiments the figure was 558; and in studies in which matching was used rather than randomization, the average sample was 2,570. The negative association between sample size and effect size occurred most strongly in the RCTs (r = –.40). Therefore, making RCTs the “gold standard,” even when small numbers of participants are randomly assigned, seems to be an unwise policy. Random assignment is important to use whenever possible, but sample size matters too. Slavin and Smith conclude that their “findings argue against the policy … of placing extraordinary value on randomized experiments even when they have very small sample sizes.” Such studies can be hard to replicate in a broader context, and replication is the key issue in evaluation of policies and programs.
No quantitative analysis can be better than the quality of the data, and this includes meta-analyses. Our next two chapters address the issue of data quality, which in meta-analysis means the quality of the research reports examined. loannidis and Trikalinos in Chapter 41 discuss how to test for bias in retrieved articles while Kostoff, in Chapter 42, examines the benefits of different forms of searching for retrieving articles and the data within them. Both chapters focus on published medical research, in part because the number of publications and meta-analyses of those publications is very large, but their conclusions can be adapted to meta-analysis in other fields. When summarizing the results and integrating the findings of research reports, the reports have to be examined for quality, specifically for biases. As an indicator of bias, Ioannidis and Trikalinos suggest a test that can be used to detect an excess (beyond what would be expected by probability) of statistically significant findings.
Kostoff focuses on finding articles and the information within them, rather than on assessing them for bias. The most typical methods of searching, in such databases as Science Citation Index and Medline, use titles, keywords, and abstracts, but not the full texts of the articles. Using ScienceDirect allows researchers to search over 8 million full-text articles. The advantages of full-text searching seem overwhelming as compared to typical more limited searching in abstracts. Kostoff found the retrieval of articles to be 10 times greater and the articles found were published in a wider range of journals. Most important, perhaps, although difficult to demonstrate quantitatively, is the expanded opportunities in full-text searching for discovering relevant information that would otherwise be missed.
3.3. Correlation and Regression
Correlation and regression are not necessarily alternatives to hypothesis testing since they can be used in hypothesis testing problems. For example, in a simple experiment the mean difference in the dependent variable between the control and treatment groups is often tested for statistical significance (the null hypothesis is usually no difference) and the resulting p-value interpreted. And/or the correlation between score on the outcome variable and being in the treatment group can likewise tested (the null is usually a correlation of zero), also leading to a p-value, which will be identical to that for the mean difference. The fact that correlation and regression are not often used to analyze experimental data is mostly a matter of tradition. Confusion is added to tradition when textbooks contrast experimental designs with so-called “correlational” designs. This is a false dichotomy that limits analytical options. Correlations are useful for describing the strength of the effect of experimental treatments. And as has often been demonstrated, virtually all analytic methods are at base correlational. All are based on variances and covariances, and most multivariate techniques are erected upon analysis of correlation and/or covariance matrices. These statistics are all versions of the same basic algorithm, the same general linear model. Correlation's analytic cousin, regression, is the heart of quantitative analysis in sociology, political science, economics, and related disciplines such as social work and business. Regression is the all but universal method for explaining or predicting variance in an outcome by the variance in one or more predictor variables while Controlling for multiple covariates. As we will see in the following sections, different types of regression must be used to analyze different types of data. Despite data differences, regression always answers a more or less elaborate version of the same basic question: for every 1-unit increase in a predictor variable, what happens to an outcome variable? The answer is expressed in the regression coefficient.
Like all statistical techniques, correlation and regression can be misused, and debate is sometimes rigorous about appropriate use and potential abuse. A striking example is Chapter 43 in which Vul and colleagues discuss the “puzzlingly high” correlations (they also say “impossibly” high and that these “voodoo correlations” really “should not be believed”) between areas of brain activation as indicated by functional magnetic resonance imaging (fMRI) and psychological measures of personality and emotion. Not surprisingly, the authors whose works were thus criticized have replied vigorously. The f MRI produces an image of oxygen levels of the blood in the brain while laboratory subjects respond to questions, view images, and so on (studies must be conducted in a laboratory, since portable f MRI machines do not yet exist). Some critics have questioned whether anything is gained by learning that certain sections of the brain sometimes use more oxygen when a laboratory subject makes hypothetical decisions or experiences hypothetical emotions. Still the popularity of this research has been very great among researchers, the general public, and funding agencies. But the perils of using correlations to draw causal conclusions about the relation of brain activation and personality measures are also great, particularly when the reported reliabilities of the measures being associated are low and, very often, go unreported. This chapter and the controversy it has engendered is very instructive for researchers pondering the appropriate uses correlational analyses.
Similar issues arise when correlating variations in DNA with variations in social and individual outcomes such as delinquency and depression. Explanations built on such data as DNA base pairs is considered by many to be more fundamental and enduring than social, historical, environmental, or cultural data. DNA analyses are more advanced than f MRI analyses, but many of the same analytical issues pertain. The eagerness of the general public and popular press for this kind of research is if anything surpassed by the enthusiasm of research funding agencies. Guo and Adkins in Chapter 44 provide an overview of DNA research and a discussion of making statistical associations between genetic variation and human traits and behaviors. As with all research using existing conditions, rather than assigning cases to treatment and control conditions, the possibility of correlations being due to confounding variables may be high. And, given the millions of DNA base pairs that differ between humans (they have billions in common) the chances are strong of finding random associations and spurious correlations between them and human traits. The reader will recall Mill's criteria for causation (Section 2.2, above). The hypothesized cause and effect must covary and the covariance cannot be better explained by a different variable. These criteria are still relevant for, as Guo and Adkins point out, while statistical associations are necessary to demonstrate a causal link, they are seldom if ever sufficient. Sociologists in particular are likely to find social explanations to be more persuasive than genetic ones, particularly explanations for social problems. And given the virtual impossibility of manipulating genetic causes, social explanations also have higher potential for practical applicability. If, for example, crime is caused by social conditions there is some possibility of adjusting those conditions in ways that would reduce crime. But if it is caused by genetic makeup, the options are much more limited.
Basic methods of correlation and regression work most smoothly when the variables being associated are continuous rather than categorical or ranked. When the data are categorical, as they often are in social research, more specialized techniques need to be used. For example, in multiple regressions, when one of the predictor (explanatory) variables is categorical, especially when it has multiple categories, the standard technique is to use dummy coding in which all of the categories except one are compared to that one, which is usually called the excluded or reference category. But, as Gayle and Lambert explain in Chapter 45, there is a problem with using a reference category. While each of the categories can be compared to the reference, they cannot be directly compared to one another. For example, if one of the predictor variables is the six regions of a nation, one can make, say, Region 6 the reference category and compare each of the other Regions to it. But this does not allow one to compare regions to one another – e.g., Region 2 to Region 5, Region 1 to Region 4, etc. This vexing problem can be gotten around by using the “quasi-variance,” which is easily calculated using the variance-covariance matrix and a free on-line calculator to which the authors provide a link.
Another technical problem in regression analyses that can cause interpretive problems in research on social programs occurs when assignment to programs is not at random – and most often it is not. One solution used by many researchers is propensity score matching (PSM). PSM is a form of matched pairs that combines scores on multiple variables related to the outcome into an overall propensity score. That score is used to match participants to assess their response to treatment. This can be an effective way to deal with confounding variables, but Freedman and Berk, in Chapter 46, urge caution. It is far too easy, they demonstrate in extensive simulations, for weighting regressions with propensity scores to increase random error substantially; this leads to estimates of the standard errors that are too low. Rather than this risky practice, they conclude, “Investigators who have a causal model that they believe in should probably just fit the equation to the data” and not use propensity score weighting.
3.4 Logit and Probit Regression
When the dependent variable is categorical, one cannot properly use ordinary least squares (OLS) regression. Categorical predictor variables are not without problems for the analyst, but needed adjustments are minor in comparison to the transformations that have to be made for categorical outcome or dependent variables. Probit and logit (or “logistic”) regression are the two main ways to handle categorical outcome variables in regression analyses. Because categorical outcome variables are common in most fields, probit and logit regression are used in a wide range of disciplines. This is illustrated in our next four chapters which come from sociology, psychology and education, medicine, and political science – which, incidentally, also illustrates the wide range disciplines in which SAGE journals are published.
One issue in logit and probit models when predictor variables are also categorical is the problem of residual variation or unobserved heterogeneity in the categories of the predictor variable. In Chapter 47, Allison proposes and illustrates new methods that can adjust for unequal residual variation. Another drawback of logit and probit regression as compared to OLS methods is that there is as yet no good equivalent of the R2 statistic, with which the researcher can estimate the overall effect size, or the percentage of variance in the outcome explained, by the entire model. Several analogues of the R2 statistic, so-called pseudo-R2s, have been proposed, but their estimates vary considerably and there is no clear reason to prefer one over another. Allen and Le, in Chapter 48, provide an alternative without the disadvantages of the pseudo-R2s: the overall odds ratio (OOR). Mediating variables (also called “intermediate” or “surrogate” endpoints in medical research) are also more complicated to estimate in logit and probit regressions than in OLS models. In Chapter 49, MacKinnon and colleagues propose methods that provide a better alternative to the methods currently in use, although these methods have limited applicability with small samples. In each of the three above examples, proposed alternative solutions to problems were argued for and illustrated by comparing the alternatives while analyzing the same (often simulated) data. A similar approach to methodological research is taken by Grofman and Schneider in Chapter 50. But rather than comparing alternative approaches to logit regression, they compare it with a quite distinct method, one which also addresses dichotomous outcome variables: qualitative comparative analysis (QCA). Pioneered by Charles Ragin in the 1980s, QCA is based on Boolean logic rather than statistical theory. It is widely used in comparative research in political science and sociology. The authors of this chapter conclude that while QCA and logit regression ostensibly deal with the same class of problems, the methods will lead to different conclusions when analyzing the same data set.
3.5 Categorical Data Analysis
Log-linear models are used for problems in which all variables are categorical. They are called log-linear because they use equations that transform odds, by taking their natural logs, to make them linear. The distinction between dependent and independent variables does not apply in this kind of analysis. Rather, the researcher attempts to explain the cell counts in each cell by examining the interactions of the variables in the remaining cells. Log-linear techniques make it possible to conduct multivariate analyses of categorical data, and they are capable of handling several nominal variables and their relations in a way that approximates structural equation models (see Section 4.1). Even though contingency tables of categorical variables (discrete distributions) are quite common in social research, log-linear models are used less often than they might be because employing them involves many technical difficulties that researchers find challenging, especially as compared to older chi-squared methods. Holland and Thayer in Chapter 51 provide a very useful overview of applicable models and their uses to fit data. Cheng and Long in Chapter 52 address another vexing problem in logit models: what to do about the assumption of independence of irrelevant alternatives (IIA). There are various statistical tests for this assumption. The authors review these and conclude that they are all wanting. In the absence of good statistical tests, the authors recommend using multinomial and conditional logit models only when it can be reasonably assumed on theoretical grounds that violations of the IIA assumption are highly unlikely.
As discussed above (Section 3.3, Chapter 50) qualitative comparative analysis (QCA) can provide, for some problems, a Boolean logic alternative to analyzing categorical data. The degree to which researchers using the suite of logit-based methods and their colleagues using QCA can find common ground is a matter of some dispute. In Chapter 53 Eliason and Stryker demonstrate how tools used for model testing in quantitative analyses (goodness-of-fit tests) can be applied to QCA, specifically to the fuzzy-set version of QCA. The idea of fuzzy sets was originally developed in computer science and refers to sets or categories the boundaries of which contain some uncertainty; they are often rank ordered rather than strictly categorical in QCA. Eliason and Stryker argue that they have demonstrated that a probabilistic approach can be applied to fuzzy-set QCA and “established a firm inferential foundation for fuzzy-set methodology.”
Optimal matching (OM) is a method for categorical variables in the sense that data are gathered for distinct individuals or other cases, and the data are sequences of distinct steps or stages. OM was first developed for DNA sequencing; it was imported into the social sciences by Andrew Abbott in the 1980s. Several competing versions of the method have been used, mostly to study careers or work histories with one goal being to develop typologies of occupational careers. Hollister in Chapter 54 compares some of the competing approaches to sequence analysis. While the steps in the sequences in work histories are relatively clear – e.g., education, part-time employed, employed, unemployed, employed, part-time employed, and retired – analyzing the sequences by comparing them to one another can be exceptionally complex, in part because the number of sequences in a large sample can be very high. The basic idea of OM is to compare sequences by calculating the number of steps (insertions and deletions, called “indels”) it would take to change one sequence into another. The number of steps is called the “cost.” The attraction of sequence analysis is that it promises a way to study aspects of social life that have been intractable. But, as Hollister shows, the methods are still under development, and there is no definitive way to judge the competing versions.
Volume 4: Complex Designs for a Complex World
While complex issues and analytic problems occur in the course of implementing the methods discussed in the first 3 volumes, in Volume 4 levels of analytic complexity tend to be even higher – and so is the dependency on modern high-speed computers and computer programs. We begin, in Section 4.1, with Structural Equation Modeling (SEM), which is used to analyze causal models with multiple indicators of latent variables and structural relations among latent variables. In Section 4.2, Multi-Level Modeling (MLM) is discussed. MLM is used when the research variables are nested or hierarchical or contextual. And it can be argued that most phenomena in the social and behavioral sciences are thus nested. Prior to the availability of MLM, researchers studying nested problems used OLS regression despite the fact that doing so violated the assumption of independent observations. Variables in social research tend to be latent in their measurement and nested in fact. Hence the use of SEM and MLM are widespread and rapidly growing, as are calls for their integration. Human activity also takes place in space and time, and our final two sections highlight ways of investigating that obvious but methodologically complicated fact. In Section 4.3 techniques for examining time are studied including a version of MLM, event history analysis, survival analysis, and models for long term economic trends. Finally, methods of modeling spatial dynamics of residential patterns and terrorist networks are coupled (in Section 4.4) with methods for using computer-intensive techniques to search for data including the contents of publications.
4.1 Structural Equation Modeling
Structural equation modeling (SEM) incorporates virtually all more elementary statistical techniques, to the point that one can consider them all to be special cases of SEM. SEM is a sophisticated statistical method for testing complex causal models in which the dependent and independent variables are latent. Since latent variables cannot be observed directly, they are constructs inferred from patterns of relations among observable variables. SEM combines a measurement model, built using factor analysis, with a causal model built using path and multiple regression analyses. This combination allows researchers to study the effects of latent variables on each other. The complexity of the analyses means that they are always done with dedicated computer programs such as LISREL, EQS, AMOS – or with programs available in R's freeware environment.
The four chapters of this section address various aspects of SEM's problems and prospects. In Chapter 55 Graham proposes building on the fact that all techniques in the general linear model (from the t-test through canonical correlation) are special cases of SEM. He uses this fact to structure a curriculum for a general statistics course persuasively arguing that this would be an improvement over teaching each statistical method as though it were independent and unrelated to the others; his chapter also provides a good overall introduction to SEM. Hayton and colleagues in Chapter 56 address a more specific technical problem with exploratory factor analysis (EFA). EPA is often a preliminary step researchers take before they are ready to commit to a measurement model needed to conduct a confirmatory factor analysis (CFA), which is a stage in the development of a full SEM. The CFA is how the measurement part of the SEM is developed. In CFA the researcher must specify the structure in advance; in EFA, by contrast, the goal is to discover the structure or pattern of relations among the observed variables that add up to a latent variable. Deciding which among the factors generated by an EFA to retain has always been a big issue for researchers. Hayton and colleagues review the methods and propose an alternative (parallel analysis) to the most commonly used ones, such as scree plots.
Confirmatory factor analysis (CFA) is one of the chief methods researchers use to establish measurement invariance, or the consistency or equivalence of measurements over time as well as across populations and modes of measurement. Item response theory (IRT) addresses many of the same measurement problems, and Meade and Lautenschlager in Chapter 57 compare the effectiveness of these two sets of methods concluding that at least for some problems (particularly in organizational research) IRT outperforms CEA. The two are built on different models. As a component of SEM, CFA is built on the general linear model, whereas IRT addresses measurement equivalency using the generalized linear model. Specifically IRT employs a type of logit regression; the log of the odds (or logit) of an observed outcome (e.g., the answer to a question) is the dependent measure. This is used to describe the relation between the observed responses to questions and underlying latent variable. The CFA component of SEM is also Thompson's topic in Chapter 58. He discusses means of and necessity for interpreting the output of a CFA using both factor pattern coefficients and factor structure coefficients. Many researchers focus mainly or exclusively on the pattern coefficients, but this, Thompson demonstrates, is to ignore important analytic aids available by using the structure coefficients as well.
4.2 Multilevel Modeling
Multilevel modeling (MLM) is a set of analytic techniques for studying nested variables and observations. The nesting may be contextual, where cases are nested in contexts, for example: individual students are nested in classrooms and classrooms are nested in schools; each of these is a level that can influence an outcome variable, such as student achievement. MLM allows researchers to separate the variance into components explaining, for example, the effects of schools (Level 3) and classrooms (Level 2) on individual students (Level 1). The nesting may also be observational, where observations are nested in cases. For example, in growth-curve models, observations of students’ progress are nested within individual students. MLM is very widely used in part because the realities social researchers study are often in fact nested and contextual, but the regression techniques for studying nested variables have become available only comparatively recently. Before the 1980s, using typical methods, such as OLS regression, the problems of separating level effects were computationally intractable. By requiring the assumption of the independence of observations, researchers using OLS regression had to assume that potentially crucial determinants of outcome variables (contexts) had no effect. Since the 1980s researchers in several disciplines have developed a range of analytic techniques for addressing this problem of non-independent, nested observations. Because there has been a good deal of simultaneous invention, the techniques go by various names. Multilevel modeling (MLM) is probably the most generic term, but several others are commonly used, and each describes a feature of these methods of analysis: mixed effects models, random coefficient models, and hierarchical linear models (HLM). The latter is also the name of a popular software package.
One final point important to make about MLM is that many of the analytical problems researchers can address using MLM can also be investigated with SEM – and vice versa (see Section 4.1 above). There is a tendency for MLM to be more frequently used by economists and sociologists (sociology is an inherently multi-level discipline), while SEM seems more widespread in the psychology-based disciplines. But the two methods are complementary in many ways: multi-level components can be incorporated into SEMs and latent-variable measurement components can be included in MLMs. Although the methods were originally developed and continue to be developed by different researchers in separate disciplines, there is also quite a lot of work being done that focuses on integrating them and that may lead to an understanding of how they are special cases of a more general model. Be that as it may, by allowing researchers to examine the structures of relations among variables and to study nested variables without violating regression assumptions, SEM and MLM have greatly enhanced researchers’ abilities to address real-world complexities in their data analysis.
An overview of the methodological issues that most typically arise in multilevel modeling (MLM) is provided by Dedrick and colleagues in Chapter 59. They analyze a sample of 99 recent articles in peer-reviewed journals and provide recommendations for practice based on their review – most importantly that authors need to provide sufficient information for readers to judge the quality of the model. As for the kinds of research problems addressed with MLM, the authors found that it was more common to study individuals within contexts than observations within individuals, and that most studies were neither experimental nor based on probability samples. Most studies were two-level studies; higher levels are more complicated and require sample sizes that can be difficult to achieve for the third and higher levels. Sherbaum and Ferreter focus on the issue of sample size for MLM in Chapter 60. Their examples are drawn from organizational research, but their guidelines are broadly relevant for researchers determining MLM sample sizes so as to maximize statistical power. What is unique about MLM is that the question “How many cases?” has to be answered at each level of analysis. Potential problems with an underpowered study are multiplied by each level of analysis included.
These potential problems are why Klein and Kozlowski's review (in Chapter 61) of critical steps in conceptualizing and measuring multilevel constructs is especially apt. While in some respects MLM can be seen as nothing more than a logical development of ordinary multiple regression, that extension of regression methods to multiple levels raises conceptual and analytical problems that researchers may be wont to breeze over. These problems are more complicated when MLM is used to study observations within cases, rather than cases within contexts. That is because, as Bliese and Ployhart discuss in Chapter 62, studying observations within individual cases almost always means working with longitudinal data. The observations are not independent because they are within cases and because they are longitudinally autocorrelated. This kind of non-independence raises more conceptual problems than cross-sectional contextual non-independence. The authors specifically compare the multilevel modeling approach, which they refer to as random coefficient modeling (RCM), to structural equation modeling (SEM), which as discussed above, can address many of the same analysis problems as RCM/MLM. Each method has advantages and disadvantages. RCM is often more robust in the face of missing data, more flexible when modeling time, and can more easily include multiple levels of nesting. SEM's advantages are not inconsequential, mainly its ability to account for measurement error in the observed or manifest data. Thus, while the authors think that on the whole, the RCM approach is more productive for the kinds of longitudinal problems they address, they conclude that “researchers are likely to benefit from being familiar with both RCM and SEM.” And, in a nice example of good practice, the authors provide the code in the open-source software (the R package) that readers can use to replicate their results.
4.3. Event History, Survival, and Longitudinal Analyses
Modeling time has always been complicated for quantitative analysts. Models for one-time, cross-sectional data rarely encounter as many conceptual and technical problems. Some of the analytic problems stem from the fact that the passage of time is a continuous event, but it is often studied as a series of discrete states. An effective and oft-used method is event-history analysis, which is the subject of the first two chapters (63 and 64) of this section. Event-history analysis (EHA) is a suite of methods for studying the movement over time of subjects through successive states or conditions. Some designs ask subjects to remember biographical data concerning the timing and duration of events in their lives. Other designs involve collecting event data by researchers. In either case, the object is to study change from one state to the next and to estimate the length of time in each state. Events are changes from one categorical state to another. For example, marital status could be studied with the states or conditions being: unmarried, married, divorced, remarried, and widowed. Survival analysis is a variety of event history analysis focusing on the question of how long subjects remain (“survive”) in a categorical state until they reach another one-time state (the paradigm case being death). Survival analysis is best known in medical research to study the duration of illnesses; it often uses Cox regression, which is a variety of logit regression that incorporates the time to the event as well as whether or not it occurred.
Andersen and Keiding examine event history analysis in a medical context in Chapter 63. They provide an overview of the method and some of its more challenging though typical problems. For example, in survival analysis, incomplete observations occur because some individuals have yet to contract a disease at the end of the study (right-censoring) and others have contracted it before the study begins (left-truncation). Muthén and Masyn in Chapter 64 discuss more advanced methods particularly methods of adding latent class analyses to event-history indicators. Latent Class Analysis (LCA) has goals similar to factor analysis, but unlike factor analysis it is used with categorical data. The goal is to discover latent variables that may underlie discrete observations. LCA is used to find discrete categories or “classes” of latent variables, based on observations of manifest variables that have categorical, not continuous, values. LCA is computationally very complex (the authors show how to use Muthén's Mplus software to do the computations), but because so many event history/survival analyses include categorical variables, the additional analytic complexity has to be undertaken for a considerable class of problems.
Time can also be an important factor in the pacing of methods of data collection. Even when they are collected simultaneously at two or more levels and are thus nested, data can have different structures. Nezlek in Chapter 65 focuses on structures that are very important in the study of naturally occurring phenomena, particularly day-to-day social interaction. For example, in event-contingent data, subjects describe the specific events that occurred over a given period of time (the parallels with event-history analysis are close). Nezlek provides an overview of how to apply random coefficient modeling and, within RCM, different modeling techniques to data thus structured. Massman and colleagues in Chapter 66 also address different measurement models for temporal data, specifically in relation to business cycles and economic turning points. Several methods are commonly used to identify trends and cycles in economic time series data. The authors discuss these and demonstrate that one's choice of method can strongly affect the conclusions one draws about economic events, which in turn can lead to highly consequential policy decisions.
4.4 Computer-Intensive and Hi-Tech Spatial Analysis Methods
The theme in this section is computer intensive methods. The theme is illustrated by a topic that cuts across all four chapters: networks of widely varying types – how to identify, use for data collection, visualize, and model networks. Networks of knowledge dissemination and patterns of citations of academic researchers are the topics of Najman and Hewitt in Chapter 67. The study of these subjects, interesting in themselves, is given much of its impetus by a desire to measure research productivity and to rank individuals and institutions accordingly. While current measures of productivity are inadequate in many ways, even these approximate measures reveal distinct citation networks and publication patterns among different disciplines. In Thelwall's Chapter 68 we move from a consideration of using computer networks to gather information about external matters, such as researchers’ productivity, to using computer networks to study computer networks, specifically the World Wide Web. Thelwall explains how a web crawler (an automated method of finding and extracting data from web pages not needing human intervention) can be constructed and implemented using a “distributed analyzer.” The distributed system spreads the computational tasks among many computers using machines when they would otherwise be idle. This makes possible large-scale data mining tasks that no lone researcher using a single computer could undertake.
Social networks are the topic of our last two chapters – terrorist social networks and neighborhood residential patterns. In Chapter 69 Yang and Sageman address the problem of how to visualize complex network data, using as their test cases information about terrorist networks. Space constraints quickly become a problem for analysts when networks are large and/or complex. Methods of visualization and depiction are limited by the centimeters on the page or the pixels on the screen. Yang and Sageman use a form of “fractal views” for data reduction and to moderate the complexity of the networks they study. This enables them to find hidden relationships among terrorists and to identify the key persons in terrorist groups. In Chapter 70, Benenson and colleagues study residential neighborhood dynamics. They extend Schelling's classic descriptions of how big social consequences can result from the interaction of relatively small individual choices, actions, and preferences. Census data are increasingly being collected using high-resolution satellite mapping of areas of interest to census takers. The result is GIS (geographic information science) data bases that can be used to study residential dynamics. Building on Schelling's ideas it is becoming possible to model human social interaction in space and over time. In these last two chapters we see the increasingly common merger of quantitative data with spatial data and graphic representations of them.
Conclusion
Reviewing this sample of articles in quantitative research methods that have appeared over the past four decades in SAGE Publication's journals provides the reader an overview of the field of quantitative research methods – or the set of related fields that together constitute quantitative research methods. Quantitative methods are developed in statistics and mathematics departments, but also by researchers in every substantive field in the social, behavioral and medical sciences, which means that important research in quantitative methods can be found in a very wide range of journals. While the breadth of topics reviewed in these volumes is great, several themes can be detected. And most of those themes can be seen as sub-themes of one overriding issue: the question of causal inference. How can researchers make justifiable causal inferences? What methods – designs, measurement techniques, and analytical approaches – are most useful and in what circumstances, for moving toward valid causal inferences? While other goals are also present in quantitative research methods – such as description in census work and prediction in election polling – causality overwhelms other research aims.
This predominance of causality is true not only of the methods associated with experimental research but also with those typically used in observational and archival research, which tend to use correlation and regression-based methods. It is often said that “correlation does not imply cause,” but this is surely wrong. A correlation does not prove a cause, and it does not explain a causal relation, but it is a necessary condition for causation (it is Mill's second criterion), and it definitely implies the presence of a causal relation. If the variables A and ? are validly correlated this means that one causes the other, or that they both are caused by a third variable, or more remotely perhaps, each is caused by other variables (M & N, say) which are themselves related causally. These broader issues of causality and correlation aside, it is a clear empirical fact that questions of causation predominate in methodological discussions in quantitative research both in these volumes and, more generally, in the field of research methods.
Note
1. The categories are rarely mutually exclusive; for example, an article discussing statistical inference in survey research could be grouped either with articles on inference or on surveys.