Skip to main content icon/video/no-internet

A sparse table is a cross-classification of observations by two or more discrete variables that has many cells with small or zero frequencies. Sparse contingency tables occur most often when the total number of observations is small relative to the number of cells. For example, consider a table with N = 4 cells and n = 12 observations. If the observations are spread evenly over the four cells, then the maximum possible frequency is small (i.e., n/N = 3). Furthermore, if the occurrence of observations in one of the four cells is a rare event, then a large sample would be needed to obtain observations in this cell. As a second example, consider a cross-classification of four variables, each with seven categories, which has N = 74 2,401 cells. A sample size considerably larger than 2,401 is required to ensure that all cells contain nonzero frequencies, and a much larger sample size is needed to ensure that all frequencies are large enough for statistical tests and modeling.

Table 1 Example of a Two-Way Table With Sampling Zeros
Genetic Testing: More Harm or Good?
More Good More Harm It Depends Total
How Much Great Deal 54 10 0 64
Know About Not Very Much 170 88 0 258
Genetic Tests? Nothing at All 17 14 0 31
Total 241 112 0 353

Sparseness invalidates standard statistic hypothesis tests, such as chi-square tests of independence or model goodness-of-fit. The justification for comparing test statistics (e.g., Pearson's chi-square statistic or the likelihood ratio statistic) to a chi-square distribution depends critically on having “large” samples, where large means having expected values for cells that are greater than or equal to 5. Without large samples, the probability distribution with which test statistics should be compared is unknown. Possible solutions to this problem include using an alternative statistic, performing exact tests, or approximating the probability distribution of test statistics by resampling or Monte Carlo methods.

Sparse tables often contain zero frequencies, which can cause estimation problems, including biased descriptive statistics (e.g., odds ratio), the estimation of log-linear model parameters, and difficulties for computational algorithms that fit models to data. Whether an estimation problem exists depends on the pattern of zero frequencies in the data and the particular model being estimated. Parameters cannot be estimated when there are zeros in the corresponding margins. For example, Table 1 consists of the cross-classification of responses to the following two questions and possible responses from the 1996 General Social Survey (http://www.icpsr.umich.edu:8080/GSS/homepage.htm): (a) “Based on what you know, do you think genetic testing will do more good than harm?” with possible responses of “more good than harm,” “more harm than good,” and “it depends”; and (b) “How much would you say that you have heard or read about genetic screening?” with possible responses of “a great deal,” “not very much,” and “nothing at all.” None of the respondents answered “It depends” to the first question. For the independence log-linear model, a parameter for the marginal effect for “It depends” cannot be estimated. The information needed to estimate this parameter is the column marginal value, which has no observations.

...

locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles

Sage Recommends

We found other relevant content for you on other Sage platforms.

Loading