Skip to main content icon/video/no-internet

Reliability

According to the Standards for Educational and Psychological Testing, reliability (also referred to as measurement precision) refers to the consistency of assessment results over independent administrations of the testing procedure. The assessment results can be examinees’ scores or raters’ ratings of examinees’ performances on an assessment. Reliability is a central concept in measurement and a necessary condition when building a validity argument. Indeed, if an assessment fails to yield consistent results, it is imprudent to make any inferences about what a score signifies. Reliability is high if the scores or ratings for each examinee are consistent over replications of the testing procedure. Reliability coefficients range from 0 to 1, with 0 being extremely unreliable and 1 representing perfect reliability. There is no absolute critical value for acceptable reliability as the need for precision depends on the stakes of the assessment. Typically, high-stake assessments (e.g., college admission tests) necessitate higher reliability standards than low-stake assessments (e.g., classroom examinations). This entry describes the most popular methods for estimating reliability as well as factors impacting reliability from both the classical and modern test theory perspectives.

Methods to Estimate Reliability

In classical test theory, the consistency of test scores is evaluated mainly in terms of reliability coefficients, and defined in terms of the correlation between scores derived from replications of the test procedure on a sample of test takers. There are four broad types/categories of reliability coefficients: stability coefficients, equivalence coefficients, internal consistency coefficients, and coefficients based on interrater agreement. Each type of coefficient reflects the variability associated with different data-collection designs and interpretations or uses of scores.

Stability Coefficients: The Test–Retest Method

The test–retest method, a measure of stability, is used to determine the consistency of the examinees’ scores on a test over time. The test–retest coefficient is obtained by correlating the scores of identical tests administered to the same examinees twice under similar testing conditions. Carry-over effects and the interval of time between the two test administrations can influence the test–retest coefficient, so this method is most appropriate for tests measuring traits that are not susceptible to carry-over effects and that are stable across time intervals. In practice, the longer the time interval between administrations, the lower the estimated reliability.

Equivalence Coefficients: The Alternate Forms Method

The alternate forms method, a measure of equivalence, is used to examine the consistency of two sets of scores on two parallel forms of a test. The alternate form coefficient is obtained by correlating the scores of parallel (or equivalent) forms of a test to the same examinees under similar conditions in close succession. That is, one form is administered to a group of examinees followed (at a well-chosen close time point) by the administration of an alternate form. The quality or similarity of the parallel forms can influence the alternate form coefficient. In practice, if the forms are not parallel, the alternate form method produces low estimates of reliability.

Internal Consistency Coefficients: Split-Half, KR-20, and Coefficient α Methods

Both measures of stability and equivalence require two administrations of (or parallel forms of) a test, but the administration of two tests can be impractical or unnecessary in reality. Internal consistency coefficients, which require a single test administration, are used to assess the consistency of the examinees’ responses to the items within a test. There are two broad classes of methods for estimating internal consistency coefficients. The first class is generally denoted as split-half procedures. The second class of methods requires an analysis of the variance–covariance structure of the item responses. With respect to the split-half methods, a test is administered to a group of examinees, then the test is split into two parallel halves, and the two sets of scores from the two split halves are correlated. This half-test reliability estimate is then used to calculate the full test reliability using the Spearman-Brown prophecy formula, which is written as

...

  • Loading...
locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles

Sage Recommends

We found other relevant content for you on other Sage platforms.

Loading