## Entry

## Reader's guide

## Entries A-Z

## Subject index

# Reliability

According to the Standards for Educational and Psychological Testing, reliability (also referred to as measurement precision) refers to the consistency of assessment results over independent administrations of the testing procedure. The assessment results can be examinees’ scores or raters’ ratings of examinees’ performances on an assessment. Reliability is a central concept in measurement and a necessary condition when building a validity argument. Indeed, if an assessment fails to yield consistent results, it is imprudent to make any inferences about what a score signifies. Reliability is high if the scores or ratings for each examinee are consistent over replications of the testing procedure. Reliability coefficients range from 0 to 1, with 0 being extremely unreliable and 1 representing perfect reliability. There is no absolute critical value for acceptable reliability as the need for precision depends on the stakes of the assessment. Typically, high-stake assessments (e.g., college admission tests) necessitate higher reliability standards than low-stake assessments (e.g., classroom examinations). This entry describes the most popular methods for estimating reliability as well as factors impacting reliability from both the classical and modern test theory perspectives.

### Methods to Estimate Reliability

In classical test theory, the consistency of test scores is evaluated mainly in terms of reliability coefficients, and defined in terms of the correlation between scores derived from replications of the test procedure on a sample of test takers. There are four broad types/categories of reliability coefficients: stability coefficients, equivalence coefficients, internal consistency coefficients, and coefficients based on interrater agreement. Each type of coefficient reflects the variability associated with different data-collection designs and interpretations or uses of scores.

### Stability Coefficients: The Test–Retest Method

The test–retest method, a measure of stability, is used to determine the consistency of the examinees’ scores on a test over time. The test–retest coefficient is obtained by correlating the scores of identical tests administered to the same examinees twice under similar testing conditions. Carry-over effects and the interval of time between the two test administrations can influence the test–retest coefficient, so this method is most appropriate for tests measuring traits that are not susceptible to carry-over effects and that are stable across time intervals. In practice, the longer the time interval between administrations, the lower the estimated reliability.

### Equivalence Coefficients: The Alternate Forms Method

The alternate forms method, a measure of equivalence, is used to examine the consistency of two sets of scores on two parallel forms of a test. The alternate form coefficient is obtained by correlating the scores of parallel (or equivalent) forms of a test to the same examinees under similar conditions in close succession. That is, one form is administered to a group of examinees followed (at a well-chosen close time point) by the administration of an alternate form. The quality or similarity of the parallel forms can influence the alternate form coefficient. In practice, if the forms are not parallel, the alternate form method produces low estimates of reliability.

### Internal Consistency Coefficients: Split-Half, KR-20, and Coefficient α Methods

Both measures of stability and equivalence require two administrations of (or parallel forms of) a test, but the administration of two tests can be impractical or unnecessary in reality. Internal consistency coefficients, which require a single test administration, are used to assess the consistency of the examinees’ responses to the items within a test. There are two broad classes of methods for estimating internal consistency coefficients. The first class is generally denoted as split-half procedures. The second class of methods requires an analysis of the variance–covariance structure of the item responses. With respect to the split-half methods, a test is administered to a group of examinees, then the test is split into two parallel halves, and the two sets of scores from the two split halves are correlated. This half-test reliability estimate is then used to calculate the full test reliability using the Spearman-Brown prophecy formula, which is written as follows:

$${\rho}_{X{{X}^{\prime}}_{n}}=\frac{2{\rho}_{AB}}{1+{\rho}_{AB}},$$

where ${\rho}_{X{{X}^{\prime}}_{n}^{}}$ is the reliability projected for the full-length test with n being the number of total items in a test, and ρAB is the correlation between the half-tests A and B.

When calculating reliability based on item covariance, the two most widely used procedures are KR-20 (Kuder Richardson-20) and coefficient α (often referred to as Cronbach’s α). Coefficient $\alpha \text{\hspace{0.17em}}\left(\widehat{\alpha}\right)$ is computed by

$$\widehat{\alpha}=\frac{k}{k-1}\frac{{\displaystyle \sum {\widehat{\sigma}}_{i}^{2}}}{(1-{\widehat{\sigma}}_{X}^{2}\text{)}},$$

where k is the number of items on the test, ${\widehat{\sigma}}_{i}^{2}$ is the variance of item i, and ${\widehat{\sigma}}_{X}^{2}$ is the total test variance. KR-20, a special case of coefficient α for dichotomously scored items only, is also based on the proportion of persons passing each item and the standard deviation of the scores.

### Coefficients Based on Interrater Agreement: Interrater Method

The interrater method, a measure of consistency of ratings, is used to examine the consistency of observed performances over different raters or observers. It is obtained by having two or more observers rate a performance of any kind and calculating the percentage of agreement between observations. The interrater approach is the preferred method when calculating the reliability of assessments/performances such as constructed responses, speeches, debates, or musical performances. Variation among raters and variability in the interpretation of assessment results are the two potential sources of error influencing interrater reliability.

### Factors Affecting Reliability

In this section, the factors that impact the reliability of assessment results are discussed. Although individual characteristics (e.g., motivation, fatigue, health, and ability) as well as the quality of assessment itself (e.g., clarity of instructions and test difficulty) inevitably impact all reliability estimates, here, the focus is on the three most widely cited sources of error with respect to reliability.

### Test Length

Generally speaking, the longer the measure is, the more reliable the measure is. As test length increases, the proportion of the student’s score that can likely be attributed to error decreases. For example, low ability students may answer a single item correctly, even if guessing; however, it is much less likely that low ability students will correctly answer all items on a 20-item test via guessing. The use of longer measures minimizes the impact of singular human error. Other test characteristics being equal (e.g., item quality), a measure with 40 items should have higher reliability than one with 20 items. The relationship between reliability and test length can be mathematically shown in the Spearman-Brown prophecy formula mentioned previously. The formula is based on the assumption that, when tests are shortened or lengthened, items of comparable content and statistics to those already in the test are deleted or added. For example, if the reliability of a 20-item test is determined to be 0.75, and the length of the test is doubled by adding items of comparable content and statistics, then the predicted reliability of the new test would be

$$0.86\left(\frac{2\text{\hspace{0.17em}}\times \text{\hspace{0.17em}}0.75}{(1\text{\hspace{0.17em}}+\text{\hspace{0.17em}}0.75)}\right).$$

### Spread of Scores

Because reliability is sample dependent, all other factors being equal, the greater the spread of scores, the higher the reliability estimate. Indeed, larger reliability coefficients result when examinees remain in the same relative position in a group across multiple administrations of an assessment. To be sure, errors of measurement have less influence on the relative position of individuals when the differences among group members are large (when there is a large spread of scores). Consequently, anything that reduces the possibility of shifting positions in the group (e.g., a heterogeneous sample of examinees) also contributes to larger reliability coefficients.

### Objectivity of Scoring

The objectivity of scoring influences reliability in the sense that the error introduced by the scoring procedure varies with respect to the extent that human judgment is required. With objective items such as multiple-choice or matching items, the scoring presents little opportunity for the introduction of human error. Constructed response items and performance assessments, however, often involve the subjective judgments of human raters or scorers. Consequently, they are subject to different degrees of scoring error, depending on the nature of the question and the scoring procedures. For example, short-answer constructed response items tend to be more objectively scoreable than longer, more complex student responses (e.g., essays) and products (e.g., projects).

### Standard Error of Measurement (SEM)

Within a classical test theory framework, an examinee’s observed test score (X) is composed of two parts: the true score (T) and the error score (E):

$$X=T+E$$

The true score can be interpreted as the average of the observed scores obtained over an infinite number of repeated administrations with the same test or parallel forms of the test. The error score is the difference between the observed test score and the true score.

The SEM is an estimate of the extent to which an examinee’s scores vary across administrations. For example, for a group of examinees, each individual has a true score and several possible observed scores around the individual’s true score. Theoretically, each examinee’s personal distribution of possible observed scores around the examinee’s true score has a standard deviation. The SEM is the average of these individual error standard deviations for the group.

Another way of thinking about reliability is that it refers to the extent to which students’ scores are free from errors of measurement. Assuming errors are random and independent, the observed score variance $\left({\sigma}_{X}^{2}\right)$ can be further decomposed into the variance in true scores $\left({\sigma}_{T}^{2}\right)$ and the variance in the errors of measurement $\left(\text{SEM},\text{\hspace{0.17em}}{\sigma}_{E}^{2}\right)$. The reliability coefficient (or the correlation between two measures of the same trait) can also be mathematically defined as the ratio of true score variance to observed score variance. SEM (σE) is a function of the standard deviation of observed scores (σX) and the reliability coefficient (ρ⊥XX′):

$${\sigma}_{E}=\sqrt[{\sigma}_{X}]{1-{{\rho}^{\prime}}_{XX}}.$$

Note that as the reliability coefficient increases, the SEM decreases.

### Classification Consistency and Accuracy

Decision consistency (DC) refers to the extent to which classifications of examinee decisions agree based on two independent administrations of the same exam or two parallel forms of an exam. Decision accuracy (DA) refers to the extent to which the actual classifications based on observed scores agree with the “true” classifications. The DC and DA are important for assessments with a purpose to classify examinees into performance categories (as is often the purpose of criterion-referenced tests). Similar to classical reliability with respect to the consistency of overall assessment results, consistency of students’ classifications is also a necessary condition when building a validity argument for criterion-referenced tests. Without certain confidence in the consistency of students’ classifications, any inferences based on the classifications would be dubious.

### Methods to Estimate DC and DA

When calculating or determining DC and DA, the two most common indices are the agreement index P and Cohen’s κ. The agreement index P is defined as the proportion of times that the same decision would be made based on two parallel forms of a test. It can be expressed as

$$P={\displaystyle \sum _{j=1}^{J}{P}_{jj}},$$

where J is the number of performance categories, and Pjj is the proportion of examinees consistently classified into the jth category across the two administrations or forms of a test. If Form 1 is one set of observed scores, and Form 2 is replaced with the true scores or another criterion measure, then P becomes the DA index. To get a more interpretable measure of decision-making consistency, Cohen’s κ can be computed as follows:

$$\begin{array}{c}\kappa =\frac{{P}_{0}-{P}_{C}}{1-{P}_{C}},\\ {P}_{0}={\displaystyle \sum _{j=1}^{J}{P}_{j}{}_{j}},\\ {P}_{C}={\displaystyle \sum _{j=1}^{J}{P}_{j}.P{.}_{j}},\end{array}$$

where P0 is the observed proportion of agreement, PC is the expected proportion of agreement, Pjj is the proportion of examinees consistently classified into the jth category, and Pj. and P.j are the marginal proportions of examinees falling in the jth category across the two administrations of the test, respectively. PC represents the DC expected by chance.

κ can be thought of as the proportion of agreement that exists above and beyond that which can be expected by chance alone. κ has a value between −1 and 1. A value of 0 and below indicates that the decisions are as consistent as the decisions based on two tests that are statistically independent. In other words, the decisions are very inconsistent and the reliability of classifications is extremely low. A value of 1 indicates that the decisions are as consistent as the decisions based on two tests that have perfect agreement.

### Reliability From Item Response Theory (IRT) Perspective

Unlike classical reliability, which uses a single value to describe a measure’s average reliability, in IRT, reliability is not uniform across the entire range of proficiency levels. Scores at both ends of the proficiency level generally have more errors associated with them than scores at the center of the proficiency distribution. IRT emphasizes the examination of item and test information in lieu of classical reliability. In mathematical statistics, the term (Fisher) information conveys a similar, but more technical, meaning. It is defined as the reciprocal of the precision with which a parameter could be estimated. For instance, in IRT, an interest is in estimating the value of the ability parameter (θ) of an examinee, which is denoted by $\widehat{\theta}$. All ability estimates have a variance $\left({\sigma}^{\uparrow}2\text{\hspace{0.17em}}\right|\widehat{\theta})$, which is a measure of the precision with which a given ability level can be estimated. The amount of information (I) at a given ability level is the reciprocal of this variance and can be shown as follows:

$$\text{I}|\theta =\sqrt{\left(\frac{1}{({\sigma}^{\uparrow}2\text{\hspace{0.17em}}|\widehat{\theta}\text{)}}\right)}.$$

The higher the information at a given ability level, the more precise the item parameter estimate tends to be than one with lower information.

Under IRT, each item on a test measures the proficiency level or ability of an examinee. Therefore, the amount of information for any single item can be computed at any ability level. The mathematical definition of the amount of item information depends upon the particular IRT model employed. For the one-parameter logistic and Rasch models, the item information is a function of the item difficulty parameter. For the two-parameter logistic model, the item information is a function of the item discrimination and item difficulty parameters, whereas for the three-parameter logistic model, the item information is a function of item discrimination, item difficulty, and pseudo-guessing parameters. Generally speaking, item information functions tend to have a bell shape. Highly discriminating items have tall, narrow information functions that provide considerable information but over a narrow range (Figure 1), whereas less discriminating items provide less information over a wider range (Figure 2). The highest item information of Item 1 is 1, whereas the highest item information of Item 2 is 0.25.

### Figure 1 Item information function for Item 1

Note: This item is simulated using 2PL model with an item discrimination parameter of 2.0 and item difficulty parameter of 1.0 on the logistic scale.

### Figure 2 Item information function for Item 2

Note: This item is simulated using 2PL model with an item discrimination parameter of 1.0 and item difficulty parameter of 1.0 on the logistic scale.

Because items are conditionally independent of each other given an individual’s score, the test information function (TIF) is simply the sum of information of all items on a test. Assume that a test with the 2 items above, the TIF of the test looks like that shown in Figure 3.

### Figure 3 Test information function

Note: The TIF of this test is composed of two items: one with item discrimination of 2.0 and item difficulty of 1.0, and the other one with item discrimination of 1.0 and item difficulty of 1.0.

The TIF is 1.25 (the sum of item information of Items 1 and 2) and it is modal around 1.0, which is the item difficulty of both items.

The conditional SEM, the reciprocal of the test information at a given trait level (θ), is obtained as follows:

$${\sigma}_{\downarrow}E\text{\hspace{0.17em}}|\theta =\sqrt{\frac{1}{\text{TIF}}}.$$

The aggregate SEM, which is analogous to the SEM from CTT perspective, is obtained as follows:

$${\sigma}_{E}=\sqrt{\frac{1}{\text{TIF}}}.$$

That is, the measurement error is equal to the square root of the reciprocal of the test information and it is interpreted in the same way as the traditional SEM. With a large item bank, TIFs can be manipulated to control measurement error very precisely because the TIF shows the degree of precision at each individual proficiency level.

### Final Thoughts

The reliability—as it is a precursor to establishing test score validity—of a measure is a critical consideration. Reliability and the SEM can be obtained from both classical and IRT perspectives and they are conceptually the same. The choice of method for establishing an assessment’s reliability should be determined in light of the data collection design (e.g., two test administrations or single test administration, the same test or parallel forms available) and the intended interpretation and/or use of scores (e.g., stability, equivalence, internal consistency, or classification consistency). The level of precision required depends on both the purpose and stakes of the assessment. To ensure reliable results when designing assessments, one should encourage test takers to perform their best, have scoring criteria that are readily available by test takers and raters (when appropriate), allow enough time, and have enough items. Ultimately, the purpose of any assessment is to provide meaningful feedback about what examinees know and are able to do. Well-developed assessments yielding consistent results are key to this goal.

See also Classical Test Theory; Internal Consistency; Item Response Theory; Split-Half Reliability; Test Information Function; Test–Retest Reliability; Validity

### Further Readings

- Assessment
- Assessment Issues
- Standards for Educational and Psychological Testing
- Accessibility of Assessment
- Accommodations
- African Americans and Testing
- Asian Americans and Testing
- Cheating
- Ethical Issues in Testing
- Gender and Testing
- High-Stakes Tests
- Latinos and Testing
- Minority Issues in Testing
- Second Language Learners, Assessment of
- Test Security
- Testwiseness

- Assessment Methods
- Ability Tests
- Achievement Tests
- Adaptive Behavior Assessments
- Admissions Tests
- Alternate Assessments
- Aptitude Tests
- Attenuation, Correction for
- Attitude Scaling
- Basal Level and Ceiling Level
- Benchmark
- Buros Mental Measurements Yearbook
- Classification
- Cognitive Diagnosis
- Computer-Based Testing
- Computerized Adaptive Testing
- Confidence Interval
- Curriculum-Based Assessment
- Diagnostic Tests
- Difficulty Index
- Discrimination Index
- English Language Proficiency Assessment
- Formative Assessment
- Intelligence Tests
- Interquartile Range
- Minimum Competency Testing
- Mood Board
- Personality Assessment
- Power Tests
- Progress Monitoring
- Projective Tests
- Psychometrics
- Reading Comprehension Assessments
- Screening Tests
- Self-Report Inventories
- Sociometric Assessment
- Speeded Tests
- Standards-Based Assessment
- Summative Assessment
- Technology-Enhanced Items
- Test Battery
- Testing, History of
- Tests
- Value-Added Models
- Written Language Assessment

- Classroom Assessment
- Authentic Assessment
- Backward Design
- Bloom’s Taxonomy
- Classroom Assessment
- Constructed-Response Items
- Curriculum-Based Measurement
- Essay Items
- Fill-in-the-Blank Items
- Formative Assessment
- Game-Based Assessment
- Grading
- Matching Items
- Multiple-Choice Items
- Paper-and-Pencil Assessment
- Performance-Based Assessment
- Portfolio Assessment
- Rubrics
- Second Language Learners, Assessment of
- Selection Items
- Student Self-Assessment
- Summative Assessment
- Supply Items
- Technology in Classroom Assessment
- True-False Items
- Universal Design of Assessment

- Item Response Theory
- Reliability
- Scores and Scaling
- T Scores
- Z Scores
- Age Equivalent Scores
- Analytic Scoring
- Automated Essay Evaluation
- Criterion-Referenced Interpretation
- Decile
- Grade-Equivalent Scores
- Guttman Scaling
- Holistic Scoring
- Intelligence Quotient
- Interval-Level Measurement
- Ipsative Scales
- Levels of Measurement
- Lexiles
- Likert Scaling
- Multidimensional Scaling
- Nominal-Level Measurement
- Norm-Referenced Interpretation
- Normal Curve Equivalent Score
- Ordinal-Level Measurement
- Percentile Rank
- Primary Trait Scoring
- Propensity Scores
- Quartile
- Rankings
- Rating Scales
- Reverse Scoring
- Scales
- Score Reporting
- Semantic Differential Scaling
- Standardized Scores
- Stanines
- Thurstone Scaling
- Visual Analog Scales
- W Difference Scores

- Standardized Tests
- Achievement Tests
- ACT
- Bayley Scales of Infant and Toddler Development
- Beck Depression Inventory
- Dynamic Indicators of Basic Early Literacy Skills
- Educational Testing Service
- Iowa Test of Basic Skills
- Kaufman-ABC Intelligence Test
- Minnesota Multiphasic Personality Inventory
- National Assessment of Educational Progress
- Partnership for Assessment of Readiness for College and Careers
- Peabody Picture Vocabulary Test
- Programme for International Student Assessment
- Progress in International Reading Literacy Study
- Raven’s Progressive Matrices
- SAT
- Smarter Balanced Assessment Consortium
- Standardized Tests
- Standards-Based Assessment
- Stanford-Binet Intelligence Scales
- Summative Assessment
- Torrance Tests of Creative Thinking
- Trends in International Mathematics and Science Study
- Wechsler Intelligence Scales
- Woodcock-Johnson Tests of Achievement
- Woodcock-Johnson Tests of Cognitive Ability
- Woodcock-Johnson Tests of Oral Language

- Validity
- Concurrent Validity
- Consequential Validity Evidence
- Construct Irrelevance
- Construct Underrepresentation
- Content-Related Validity Evidence
- Criterion-Based Validity Evidence
- Measurement Invariance
- Multicultural Validity
- Multitrait–Multimethod Matrix
- Predictive Validity
- Sensitivity
- Social Desirability
- Specificity
- Test Bias
- Unitary View of Validity
- Validity
- Validity Coefficients
- Validity Generalization
- Validity, History of

- Assessment Issues
- Cognitive and Affective Variables
- Data Visualization Methods
- Disabilities and Disorders
- Distributions
- Educational Policies
- Brown v. Board of Education
- Adequate Yearly Progress
- Americans with Disabilities Act
- Coleman Report
- Common Core State Standards
- Corporal Punishment
- Every Student Succeeds Act
- Family Educational Rights and Privacy Act
- Great Society Programs
- Health Insurance Portability and Accountability Act
- Inclusion
- Individualized Education Program
- Individuals With Disabilities Education Act
- Least Restrictive Environment
- No Child Left Behind Act
- Policy Research
- Race to the Top
- School Vouchers
- Special Education Identification
- Special Education Law
- State Standards

- Evaluation Concepts
- Evaluation Designs
- Appreciative Inquiry
- CIPP Evaluation Model
- Collaborative Evaluation
- Consumer-Oriented Evaluation Approach
- Cost–Benefit Analysis
- Culturally Responsive Evaluation
- Democratic Evaluation
- Developmental Evaluation
- Empowerment Evaluation
- Evaluation Capacity Building
- Evidence-Centered Design
- External Evaluation
- Feminist Evaluation
- Formative Evaluation
- Four-Level Evaluation Model
- Goal-Free Evaluation
- Internal Evaluation
- Needs Assessment
- Participatory Evaluation
- Personnel Evaluation
- Policy Evaluation
- Process Evaluation
- Program Evaluation
- Responsive Evaluation
- Success Case Method
- Summative Evaluation
- Utilization-Focused Evaluation

- Human Development
- Instrument Development
- Accreditation
- Alignment
- Angoff Method
- Body of Work Method
- Bookmark Method
- Construct-Related Validity Evidence
- Content Analysis
- Content Standard
- Content Validity Ratio
- Curriculum Mapping
- Cut Scores
- Ebel Method
- Equating
- Instructional Sensitivity
- Item Analysis
- Item Banking
- Item Development
- Learning Maps
- Modified Angoff Method
- Norming
- Proficiency Levels in Language
- Readability
- Score Linking
- Standard Setting
- Table of Specifications
- Vertical Scaling

- Organizations and Government Agencies
- American Educational Research Association
- American Evaluation Association
- American Psychological Association
- Educational Testing Service
- Institute of Education Sciences
- Interstate School Leaders Licensure Consortium Standards
- Joint Committee on Standards for Educational Evaluation
- National Council on Measurement in Education
- National Science Foundation
- Office of Elementary and Secondary Education
- Organisation for Economic Co-operation and Development
- Partnership for Assessment of Readiness for College and Careers
- Smarter Balanced Assessment Consortium
- Teachers’ Associations
- U.S. Department of Education
- World Education Research Association

- Professional Issues
- Diagnostic and Statistical Manual of Mental Disorders
- Guiding Principles for Evaluators
- Standards for Educational and Psychological Testing
- Accountability
- Certification
- Classroom Observations
- Compliance
- Confidentiality
- Conflict of Interest
- Data-Driven Decision Making
- Educational Researchers, Training of
- Ethical Issues in Educational Research
- Ethical Issues in Evaluation
- Ethical Issues in Testing
- Evaluation Consultants
- Federally Sponsored Research and Programs
- Framework for Teaching
- Licensure
- Professional Development of Teachers
- Professional Learning Communities
- School Psychology
- Teacher Evaluation
- Teachers’ Associations

- Publishing
- Qualitative Research
- Auditing
- Delphi Technique
- Discourse Analysis
- Document Analysis
- Ethnography
- Field Notes
- Focus Groups
- Grounded Theory
- Historical Research
- Interviewer Bias
- Interviews
- Market Research
- Member Check
- Narrative Research
- Naturalistic Inquiry
- Participant Observation
- Qualitative Data Analysis
- Qualitative Research Methods
- Transcription
- Trustworthiness

- Research Concepts
- Applied Research
- Aptitude-Treatment Interaction
- Causal Inference
- Data
- Ecological Validity
- External Validity
- File Drawer Problem
- Fraudulent and Misleading Data
- Generalizability
- Hypothesis Testing
- Impartiality
- Interaction
- Internal Validity
- Objectivity
- Order Effects
- Representativeness
- Response Rate
- Scientific Method
- Type III Error

- Research Designs
- ABA Designs
- Action Research
- Case Study Method
- Causal-Comparative Research
- Cross-Cultural Research
- Crossover Design
- Design-Based Research
- Double-Blind Design
- Experimental Designs
- Gain Scores, Analysis of
- Latin Square Design
- Meta-Analysis
- Mixed Methods Research
- Monte Carlo Simulation Studies
- Nonexperimental Designs
- Pilot Studies
- Posttest-Only Control Group Design
- Pre-experimental Designs
- Pretest–Posttest Designs
- Quasi-Experimental Designs
- Regression Discontinuity Analysis
- Repeated Measures Designs
- Single-Case Research
- Solomon Four-Group Design
- Split-Plot Design
- Static Group Design
- Time Series Analysis
- Triple-Blind Studies
- Twin Studies
- Zelen’s Randomized Consent Design

- Research Methods
- Classroom Observations
- Cluster Sampling
- Control Variables
- Convenience Sampling
- Debriefing
- Deception
- Expert Sampling
- Judgment Sampling
- Markov Chain Monte Carlo Methods
- Quantitative Research Methods
- Quota Sampling
- Random Assignment
- Random Selection
- Replication
- Simple Random Sampling
- Snowball Sampling
- Stratified Random Sampling
- Survey Methods
- Systematic Sampling
- Weighting

- Research Tools
- Amos
- ATLAS.ti
- BILOG-MG
- Bubble Drawing
- C Programming Languages
- Collage Technique
- Computer Programming in Quantitative Analysis
- Concept Mapping
- EQS
- Excel
- FlexMIRT
- HLM
- HyperResearch
- IRTPRO
- Johari Window
- Kelly Grid
- LISREL
- Mplus
- National Assessment of Educational Progress
- NVivo
- PARSCALE
- Programme for International Student Assessment
- Progress in International Reading Literacy Study
- R
- SAS
- SPSS
- Stata
- Surveys
- Trends in International Mathematics and Science Study
- UCINET

- Social and Ethical Issues
- 45 CFR Part 46
- Accessibility of Assessment
- Accommodations
- Accreditation
- African Americans and Testing
- Alignment
- Asian Americans and Testing
- Assent
- Belmont Report
- Cheating
- Confidentiality
- Conflict of Interest
- Corporal Punishment
- Cultural Competence
- Deception in Human Subjects Research
- Declaration of Helsinki
- Dropouts
- Ethical Issues in Educational Research
- Ethical Issues in Evaluation
- Ethical Issues in Testing
- Falsified Data in Large-Scale Surveys
- Flynn Effect
- Fraudulent and Misleading Data
- Gender and Testing
- High-Stakes Tests
- Human Subjects Protections
- Human Subjects Research, Definition of
- Informed Consent
- Institutional Review Boards
- ISO 20252
- Latinos and Testing
- Minority Issues in Testing
- Nuremberg Code
- Second Language Learners, Assessment of
- Service-Learning
- Social Justice
- STEM Education

- Social Network Analysis
- Statistics
- Bayesian Statistics
- Statistical Analyses
- t Tests
- Analysis of Covariance
- Analysis of Variance
- Binomial Test
- Canonical Correlation
- Chi-Square Test
- Cluster Analysis
- Cochran Q Test
- Confirmatory Factor Analysis
- Cramér’s V Coefficient
- Descriptive Statistics
- Discriminant Function Analysis
- Exploratory Factor Analysis
- Fisher Exact Test
- Friedman Test
- Goodness-of-Fit Tests
- Hierarchical Regression
- Inferential Statistics
- Kolmogorov-Smirnov Test
- Kruskal-Wallis Test
- Levene’s Homogeneity of Variance Test
- Logistic Regression
- Mann-Whitney Test
- Mantel-Haenszel Test
- McNemar Change Test
- Measures of Central Tendency
- Measures of Variability
- Median Test
- Mixed Model Analysis of Variance
- Multiple Linear Regression
- Multivariate Analysis of Variance
- Part Correlations
- Partial Correlations
- Path Analysis
- Pearson Correlation Coefficient
- Phi Correlation Coefficient
- Repeated Measures Analysis of Variance
- Simple Linear Regression
- Spearman Correlation Coefficient
- Standard Error of Measurement
- Stepwise Regression
- Structural Equation Modeling
- Survival Analysis
- Two-Way Analysis of Variance
- Two-Way Chi-Square
- Wilcoxon Signed Ranks Test

- Statistical Concepts
- p Value
- R2
- Alpha Level
- Autocorrelation
- Bonferroni Procedure
- Bootstrapping
- Categorical Data Analysis
- Central Limit Theorem
- Conditional Independence
- Convergence
- Correlation
- Data Mining
- Dummy Variables
- Effect Size
- Estimation Bias
- Eta Squared
- Gauss-Markov Theorem
- Holm’s Sequential Bonferroni Procedure
- Kurtosis
- Latent Class Analysis
- Local Independence
- Longitudinal Data Analysis
- Matrix Algebra
- Mediation Analysis
- Missing Data Analysis
- Multicollinearity
- Odds Ratio
- Parameter Invariance
- Parameter Mean Squared Error
- Parameter Random Error
- Post Hoc Analysis
- Power
- Power Analysis
- Probit Transformation
- Residuals
- Robust Statistics
- Sample Size
- Significance
- Simpson’s Paradox
- Skewness
- Standard Deviation
- Type I Error
- Type II Error
- Type III Error
- Variance
- Winsorizing

- Statistical Models

- Teaching and Learning
- Active Learning
- Andragogy
- Bilingual Education, Research on
- College Success
- Constructivist Approach
- Cooperative Learning
- Curriculum
- Distance Learning
- Dropouts
- Evidence-Based Interventions
- Framework for Teaching
- Head Start
- Homeschooling
- Instructional Objectives
- Instructional Rounds
- Kindergarten
- Kinesthetic Learning
- Laddering
- Learning Progressions
- Learning Styles
- Learning Theories
- Literacy
- Mastery Learning
- Montessori Schools
- Out-of-School Activities
- Pygmalion Effect
- Quantitative Literacy
- Reading Comprehension
- Scaffolding
- School Leadership
- Self-Directed Learning
- Service-Learning
- Social Learning
- Socio-Emotional Learning
- STEM Education
- Waldorf Schools

- Theories and Conceptual Frameworks
- g Theory of Intelligence
- Ability–Achievement Discrepancy
- Andragogy
- Applied Behavior Analysis
- Attribution Theory
- Behaviorism
- Cattell–Horn–Carroll Theory of Intelligence
- Classical Conditioning
- Classical Test Theory
- Cognitive Neuroscience
- Constructivist Approach
- Data-Driven Decision Making
- Debriefing
- Educational Psychology
- Educational Research, History of
- Emotional Intelligence
- Epistemologies, Teacher and Student
- Experimental Phonetics
- Feedback Intervention Theory
- Framework for Teaching
- Generalizability Theory
- Grounded Theory
- Improvement Science Research
- Information Processing Theory
- Instructional Theory
- Item Response Theory
- Learning Progressions
- Learning Styles
- Learning Theories
- Mastery Learning
- Multiple Intelligences, Theory of
- Naturalistic Inquiry
- Operant Conditioning
- Paradigm Shift
- Phenomenology
- Positivism
- Postpositivism
- Pragmatic Paradigm
- Premack Principle
- Punishment
- Reinforcement
- Response to Intervention
- School-Wide Positive Behavior Support
- Scientific Method
- Self-Directed Learning
- Social Cognitive Theory
- Social Learning
- Socio-Emotional Learning
- Speech-Language Pathology
- Terman Study of the Gifted
- Transformative Paradigm
- Triarchic Theory of Intelligence
- True Score
- Unitary View of Validity
- Universal Design in Education
- Wicked Problems
- Zone of Proximal Development

- Threats to Research Validity

- Loading...