Skip to main content icon/video/no-internet

Difficulty Index

When designing and developing educational and psychological tests, questionnaires, and assessments, measurement specialists attend directly to the qualities of the items making up the test or questionnaire. For example, it would be important to know whether an item is too easy or too difficult for the intended audience and uses of the test. Thus, estimating accurately an item’s difficulty index is important for good measurement and high-quality test design.

Estimating Item Difficulty

Contemporary test development practices rely on two broad statistical approaches for estimating item difficulty—classical test theory (CTT) and item response theory (IRT). The CTT approach draws on traditional statistical methods for estimating item difficulty. In the CTT framework, the proportion of examinees answering an item correctly or endorsing a particular response option on a questionnaire serves as the difficulty index. This is referred to as an item’s p value, and it ranges between 0.0 and 1.0, with higher values indicating a greater proportion of examinees responding correctly to (or endorsing) the item. Depending on the design, say a criterion-referenced test versus a norm-referenced test, measurement specialists may create a test using items having a range of p values—seeking an appropriate mix of easy, moderately difficult, or very difficult test items. Thus, an item’s p value is one of the most useful, and most frequently reported, item statistics.

Item p values, however, are highly sample dependent. The underlying or latent ability levels of the sample examinees interact with estimates of the difficulty of the test items. An item may appear to be much harder (or easier) in one sample of examinees than in another. As a consequence, CTT methods often lead to perplexing and unintended shifts in item difficulty estimates from sample to sample. Estimates of item difficulty that are independent of the ability levels of the sample examinees would be more helpful.

To address this problem, psychometric specialists working largely during the latter half of the 20th century developed a series of statistical methods referred to collectively as IRT. The IRT framework rests on the idea that measurement specialists are interested largely in measuring cognitive abilities, personality traits, and other psychological characteristics that are not directly observable or latent. From this perspective, a test is simply a collection of items designed to measure a person’s level or standing on the latent trait. Thus, when designing the test, the developer is interested in how each individual item relates to the latent trait and how the group of items relates to that trait or ability. IRT models make the study of these relationships more tractable.

The IRT framework assumes the relationship between item performance and the latent ability can be modeled by a one-, two-, or three-parameter logistic function. For simplicity, the focus here is on the one-parameter (the difficulty index) model. Typically, two assumptions underpin an IRT model—the first assumes a unidimensional structure of the test data (measuring one primary construct or latent ability) and the other relates to the mathematical (logistic) form of the item characteristic function or curve (denoted as the ICC). Figure 1 shows the general form of item characteristic functions for the one-parameter logistic model.

...

  • Loading...
locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles

Sage Recommends

We found other relevant content for you on other Sage platforms.

Loading