Skip to main content icon/video/no-internet

A scatterplot is a two-dimensional coordinate graph showing the relationship between two quantitative variables, with each observation in a data set plotted as a point in the graph. Although they are employed less frequently in publications than some other statistical graphs, scatterplots are arguably the most important graphs for data analysis. Scatterplots are used not only to graph variables directly, but also to examine derived quantities, such as residuals from a regression model.

Precursors to scatterplots date at least to the 17th century, such as Pierre de Fermat's and René Descartes's development of the coordinate plane, and Galileo Galilei's studies of motion. The first statistical use of scatterplots is probably from Sir Francis Galton in the 19th century, and is closely tied to his pioneering work on correlation, regression, and the bivariate-normal distribution.

Several versions of an illustrative scatterplot in Figure 1 show the relationship between fertility and contraceptive use in 50 developing countries. The total fertility rate is the number of expected live births to a woman surviving her child-bearing years at current age-specific levels of fertility. As is conventional, the explanatory variable (contraceptive use) appears on the horizontal axis, and the response variable (fertility) on the vertical axis.

None

Figure 1 Four Versions of a Scatterplot of the Total Fertility Rate (Number of Expected Births per Woman) Versus the Percentage of Contraceptors Among Married Women of Childbearing Age, for 50 Developing Nations

Robey, Rutstein, and Morris (1992).
  • A traditional scatterplot is shown in Panel A.
  • In Panel B, the names of the countries are graphed in place of points. The names are drawn at random angles to minimize overplotting, but many are still hard to read; if the graph were enlarged without proportionally enlarging the plotted text, this problem would be decreased.
  • Panel C displays a more modern version of the scatterplot: Grid lines are suppressed; tick marks do not project into the graph; the data are plotted as open, rather than filled, circles to make it easier to discern partially overplotted points; and the axes are drawn just to enclose the data. The solid curve in the graph represents a robust nonparametric regression smooth; the two broken curves are smoothes based on the positive and negative residuals from the nonparametric regression, and are robust estimates of the standard deviation of the residuals in each direction. The relationship between fertility and contraceptive use is nearly linear; the residuals have near-constant spread and appear symmetrically distributed around the regression curve. More information on smoothing scatterplots may be found in Fox (2000).
  • Panel D illustrates how the values of a third, categorical variable may be encoded on a scatterplot. Here, region is encoded using discriminable letters. Similar encodings employ different colors or plotting symbols.

Beyond the illustrations in Figure 1, there are a number of important variations on and extensions of the simple bivariate scatterplot: Three dimensional dynamic scatterplots may be rotated on a computer screen to examine the relationship among three variables; scatterplot matrices show all pairwise marginal relationships among a set of variables; and interactive statistical graphics permit the identification of points on a scatterplot, and the linking of scatterplots to one another and to other graphical displays. These and other strategies are discussed in Cleveland (1993) and Jacoby (1997).

...

locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles

Sage Recommends

We found other relevant content for you on other Sage platforms.

Loading