Exploratory Data Analysis

Maria M. Pertl; David Hevey

doi:10.4135/9781412961288

Entry
Reader's guide
Entries A-Z
Subject index

Return to Entries

Exploratory Data Analysis

By: Maria M. Pertl & David Hevey
In:Encyclopedia of Research Design
Chapter DOI:https://doi.org/10.4135/9781412961288.n143
Subject:Anthropology, Business and Management, Criminology and Criminal Justice, Communication and Media Studies, Counseling and Psychotherapy, Economics, Education, Geography, Health, History, Marketing, Nursing, Political Science and International Relations, Psychology, Social Policy and Public Policy, Social Work, Sociology, Technology, Medicine

Request Permissions

Show page numbers Hide page numbers

Exploratory data analysis (EDA) is a data-driven conceptual framework for analysis that is based primarily on the philosophical and methodological work of John Tukey and colleagues, which dates back to the early 1960s. Tukey developed EDA in response to psychology's overemphasis on hypodeductive approaches to gaining insight into phenomena, whereby researchers focused almost exclusively on the hypothesis-driven techniques of confirmatory data analysis (CDA). EDA was not developed as a substitute for CDA; rather, its application is intended to satisfy a different stage of the research process. EDA is a bottom-up approach that focuses on the initial exploration of data; a broad range of methods are used to develop a deeper understanding of the data, generate new hypotheses, and identify patterns in the data. In contrast, CDA techniques are of greater value at a later stage when the emphasis is on [Page 456]testing previously generated hypotheses and confirming predicted patterns. Thus, EDA offers a different approach to analysis that can generate valuable information and provide ideas for further investigation.

Ethos

A core goal of EDA is to develop a detailed understanding of the data and to consider the processes that might produce such data. Tukey used the analogy of EDA as detective work because the process involves the examination of facts (data) for clues, the identification of patterns, the generation of hypotheses, and the assessment of how well tentative theories and hypotheses fit the data.

EDA is characterized by flexibility, skepticism, and openness. Flexibility is encouraged as it is seldom clear which methods will best achieve the goals of the analyst. EDA encourages the use of statistical and graphical techniques to understand data, and researchers should remain open to unanticipated patterns. However, as summary measures can conceal or misrepresent patterns in data, EDA is also characterized by skepticism. Analysts must be aware that different methods emphasize some aspects of the data at the expense of others; thus, the analyst must also remain open to alternative models of relationships.

If an unexpected data pattern is uncovered, the analyst can suggest plausible explanations that are further investigated using confirmatory techniques. EDA and CDA can supplement each other: Where the abductive approach of EDA is flexible and open, allowing the data to drive subsequent hypotheses, the more ambitious and focused approach of CDA is hypothesis-driven and facilitates probabilistic assessments of predicted patterns. Thus, a balance is required between an exploratory and confirmatory lens being applied to data; EDA comes first, and ideally, any given study should combine both.

Methods

EDA techniques are often classified in terms of the four Rs: revelation, residuals, reexpression, and resistance. However, it is not the use of a technique per se that determines whether it is EDA, but the purpose for which it is used—namely, to assist the development of rich mental models of the data.

Revelation

EDA encourages the examination of different ways of describing the data to understand inherent patterns and to avoid being fooled by unwarranted assumptions.

Data Description

The use of summary descriptive statistics offers a concise representation of data. EDA relies on resistant statistics, which are less affected by deviant cases. However, such statistics involve a tradeoff between being concise versus precise; therefore, an analyst should never rely exclusively on statistical summaries. EDA encourages analysts to examine data for skewness, outliers, gaps, and multiple peaks, as these can present problems for numerical measures of spread and location. Visual representations of data are required to identify such instances to inform subsequent analyses. For example, based on their relationship to the rest of the data, outliers may be omitted or may become the focus of the analysis, a distribution with multiple peaks may be split into different distributions, and skewed data may be reexpressed. Inadequate exploration of the data distribution through visual representations can result in the use of descriptive statistics that are not characteristic of the entire set of values.

...

Sign in to access this content

Get a 30 day FREE TRIAL

Watch videos from a variety of sources bringing classroom topics to life
Read modern, diverse business cases
Explore hundreds of books and reference titles

No internet connection.

All search filters on the page have been cleared.

Your search has been saved.

Entry

Reader's guide

Entries A-Z

Subject index

Exploratory Data Analysis

Ethos

Methods

Revelation

Data Description

Sign in to access this content

Get a 30 day FREE TRIAL

Read next

More like this

Sage Recommends