Skip to main content icon/video/no-internet

Exploratory data analysis (EDA) is a data-driven conceptual framework for analysis that is based primarily on the philosophical and methodological work of John Tukey and colleagues, which dates back to the early 1960s. Tukey developed EDA in response to psychology's overemphasis on hypodeductive approaches to gaining insight into phenomena, whereby researchers focused almost exclusively on the hypothesis-driven techniques of confirmatory data analysis (CDA). EDA was not developed as a substitute for CDA; rather, its application is intended to satisfy a different stage of the research process. EDA is a bottom-up approach that focuses on the initial exploration of data; a broad range of methods are used to develop a deeper understanding of the data, generate new hypotheses, and identify patterns in the data. In contrast, CDA techniques are of greater value at a later stage when the emphasis is on testing previously generated hypotheses and confirming predicted patterns. Thus, EDA offers a different approach to analysis that can generate valuable information and provide ideas for further investigation.

Ethos

A core goal of EDA is to develop a detailed understanding of the data and to consider the processes that might produce such data. Tukey used the analogy of EDA as detective work because the process involves the examination of facts (data) for clues, the identification of patterns, the generation of hypotheses, and the assessment of how well tentative theories and hypotheses fit the data.

EDA is characterized by flexibility, skepticism, and openness. Flexibility is encouraged as it is seldom clear which methods will best achieve the goals of the analyst. EDA encourages the use of statistical and graphical techniques to understand data, and researchers should remain open to unanticipated patterns. However, as summary measures can conceal or misrepresent patterns in data, EDA is also characterized by skepticism. Analysts must be aware that different methods emphasize some aspects of the data at the expense of others; thus, the analyst must also remain open to alternative models of relationships.

If an unexpected data pattern is uncovered, the analyst can suggest plausible explanations that are further investigated using confirmatory techniques. EDA and CDA can supplement each other: Where the abductive approach of EDA is flexible and open, allowing the data to drive subsequent hypotheses, the more ambitious and focused approach of CDA is hypothesis-driven and facilitates probabilistic assessments of predicted patterns. Thus, a balance is required between an exploratory and confirmatory lens being applied to data; EDA comes first, and ideally, any given study should combine both.

Methods

EDA techniques are often classified in terms of the four Rs: revelation, residuals, reexpression, and resistance. However, it is not the use of a technique per se that determines whether it is EDA, but the purpose for which it is used—namely, to assist the development of rich mental models of the data.

Revelation

EDA encourages the examination of different ways of describing the data to understand inherent patterns and to avoid being fooled by unwarranted assumptions.

Data Description

The use of summary descriptive statistics offers a concise representation of data. EDA relies on resistant statistics, which are less affected by deviant cases. However, such statistics involve a tradeoff between being concise versus precise; therefore, an analyst should never rely exclusively on statistical summaries. EDA encourages analysts to examine data for skewness, outliers, gaps, and multiple peaks, as these can present problems for numerical measures of spread and location. Visual representations of data are required to identify such instances to inform subsequent analyses. For example, based on their relationship to the rest of the data, outliers may be omitted or may become the focus of the analysis, a distribution with multiple peaks may be split into different distributions, and skewed data may be reexpressed. Inadequate exploration of the data distribution through visual representations can result in the use of descriptive statistics that are not characteristic of the entire set of values.

...

  • Loading...
locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles

Sage Recommends

We found other relevant content for you on other Sage platforms.

Loading