Automated Linking of Historical Data

Working Paper: NBER ID: w25825

Authors: Ran Abramitzky; Leah Platt Boustan; Katherine Eriksson; James J. Feigenbaum; Santiago Pérez

Abstract: The recent digitization of complete count census data is an extraordinary opportunity for social scientists to create large longitudinal datasets by linking individuals from one census to another or from other sources to the census. We evaluate different automated methods for record linkage, performing a series of comparisons across methods and against hand linking. We have three main findings that lead us to conclude that automated methods perform well. First, a number of automated methods generate very low (less than 5%) false positive rates. The automated methods trace out a frontier illustrating the tradeoff between the false positive rate and the (true) match rate. Relative to more conservative automated algorithms, humans tend to link more observations but at a cost of higher rates of false positives. Second, when human linkers and algorithms use the same linking variables, there is relatively little disagreement between them. Third, across a number of plausible analyses, coefficient estimates and parameters of interest are very similar when using linked samples based on each of the different automated methods. We provide code and Stata commands to implement the various automated methods.

Keywords: automated linking; historical data; record linkage; census data

JEL Codes: C81


Causal Claims Network Graph

Edges that are evidenced by causal inference methods are in orange, and the rest are in light blue.


Causal Claims

CauseEffect
automated methods generate very low false positive rates (C52)accuracy of automated methods (C52)
automated methods generate very low false positive rates (C52)true match rates (C52)
same linking variables used by human linkers and algorithms (C45)little disagreement in matches (C78)
stability of coefficient estimates across linked samples (C20)reliability of automated methods for economic analyses (C80)
automated methods can effectively replicate results from hand-linked data (C80)utilization of automated methods in economic research (C80)

Back to index