Skip to main content icon/video/no-internet

Exploratory data analysis (EDA) looks at data to see what they seem to say. The distribution of the observed data is examined without imposing an arbitrary probability model on it. We look for trends, such as patterns and linear or nonlinear relationships between variables, and deviations from the trends, such as local anomalies, outliers, or clusters. This facilitates discovering the unexpected as well as confirming suspicions, rather like detective work.

EDA is sometimes viewed as a grab bag of tools, but this is a misconception. It is more accurate to view EDA as a procedure for data analysis. We start from a set of expectations or specific questions arising from the data context and explore the data with these in mind, while remaining open to observing unexpected patterns. The approach involves making many plots and numerical models of the data. Plots allow us to examine the distribution of the data without an imposed probability model; thus, statistical graphics form the backbone of EDA. Plots provide simple, digestible summaries of complex information that enable discovering unexpected structure. With the assistance of techniques such as bootstrapping, permutation, and model selection methods, we can assess whether the observed patterns in the data are more than random noise.

EDA is different from confirmatory statistical analysis. In confirmatory analysis, we start from a hypothesis and work to confirm or reject the hypothesis. EDA is a hypothesis discovery process. EDA provides approximate answers to any question of interest, instead of an exact answer to the wrong question. In the process of exploration, the data may suggest hypotheses, leading to follow-up confirmatory analysis with new data.

Methods in common usage that have arisen from EDA include the boxplot, stem-and-leaf plot, median polish, and projection pursuit.

History

The term exploratory data analysis was coined by John W. Tukey, and it is the title of his landmark book, published in 1977. It is a very idiosyncratic book, jam-packed with ways to make calculations on and draw pictures of data with paper and pencil. It is full of opinions, such as the following:

Pictures based on the exploration of data should force their messages upon us. Pictures that emphasize what we already know—“security blankets” to reassure us—are frequently not worth the space they take. Pictures that have to be gone over with a reading glass to see the main point are wasteful of time and inadequate of effect. The greatest value of a picture is when it forces us to notice what we never expected to see.

Such opinions communicate a wisdom learned from experiences with data. The intensity of this written work, emphasized by bold and italic typeface, communicates practical advice on working with data. This integral component of Tukey's conceptualization of EDA is unfortunately missing from later treatments of EDA, which tend to make EDA look like a loose collection of ad hoc methods. In 2001, Salsburg published an easy-reading biography of Tukey's contributions on EDA in the context of other major statistical developments of the previous century.

...

  • Loading...
locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles

Sage Recommends

We found other relevant content for you on other Sage platforms.

Loading