Skip to main content icon/video/no-internet

Record linkage, or exact matching, refers to the activity of linking together two or more databases on a single population. The U.S. Bureau of the Census uses record linkage in its efforts to estimate the population undercount of the decennial census. The two files that Census links together are a sample of the decennial census and a second, independent enumeration of the population areas covered by the sample. Some individuals are counted in both the census and the second enumeration, whereas others are absent from one or both of the canvasses. Suppose that the numbers of individuals who are enumerated are given in Table 1. The question marks indicate counts of individuals that are not known.

Table 1 Counts From Two Enumerations Based on Record Linkage
Second Enumeration
Yes No Total
Census Yes nyy nyn n census
Enumeration No nny ? ?
Total n second ? ?

The total size of the population can be estimated if assumptions about the two enumeration efforts and the population are made. Under standard assumptions of capture-recapture models, the total size of the population can be estimated as ncensusnsecond/nyy. If 250 people were counted in the census sample, 200 were counted in the second enumeration, and 125 were common to both lists, then the population size would be estimated as 250(200)/125 = 400. However, if only 100 people were common to both lists, then one would estimate the population size to be 250(200)/100 = 500.

Record linkage is challenging when the sizes of the files being linked are very large and unique identifying information on every individual is not available. Examples of unique identifiers (IDs) include Social Security numbers (SSNs); U.S. passport numbers; state driver's license numbers; and, except for identical twins, a person's genetic code. The decennial census does not collect SSNs or any other unique ID number. The number of people in the census undercount sample is a few hundred thousand. Thus, record linkage in this context needs to be computerized and automated.

Entries in the two databases are compared on the fields of information common to two files. Consider the following hypothetical records in the two files called File A and File B:

File A Record File B Record
Wayne Feller W. A. Fuller
Male, Married, Age 70 Male, Married, Age 71
202 Snedecor Rd. 202 Snedecor, Apt. 3
Ames, Iowa Aimes, IA

These records, although containing clear differences, could correspond to the same person. Alternate versions of names and addresses, nicknames and abbreviations, and misspellings and typographical errors are frequently encountered in large, population databases. The U.S. Bureau of the Census and other U.S. and foreign statistical agencies use sophisticated methods to address these and other challenges.

...

  • Loading...
locked icon

Sign in to access this content

Get a 30 day FREE TRIAL

  • Watch videos from a variety of sources bringing classroom topics to life
  • Read modern, diverse business cases
  • Explore hundreds of books and reference titles

Sage Recommends

We found other relevant content for you on other Sage platforms.

Loading