Combining Family History and Machine Learning to Link Historical Records

Working Paper: NBER ID: w26227

Authors: Joseph Price; Kasey Buckles; Jacob Van Leeuwen; Isaac Riley

Abstract: A key challenge for research on many questions in the social sciences is that it is difficult to link historical records in a way that allows investigators to observe people at different points in their life or across generations. In this paper, we develop a new approach that relies on millions of record links created by individual contributors to a large, public, wiki-style family tree. First, we use these “true” links to inform the decisions one needs to make when using traditional linking methods. Second, we use the links to construct a training data set for use in supervised machine learning methods. We describe the procedure we use and illustrate the potential of our approach by linking individuals across the 100% samples of the US decennial censuses from 1900, 1910, and 1920. We obtain an overall match rate of about 70 percent, with a false positive rate of about 12 percent. This combination of high match rate and accuracy represents a point beyond the current frontier for record linking methods.

Keywords: record linking; machine learning; historical data; genealogy; socioeconomic status

JEL Codes: C81; J1; N01


Causal Claims Network Graph

Edges that are evidenced by causal inference methods are in orange, and the rest are in light blue.


Causal Claims

CauseEffect
user-generated genealogical data + machine learning (C55)higher match rate (C52)
traditional methods (C90)lower match rate (C78)
training data reliability (C52)better outcomes (I14)
preprocessing of birth years and birthplaces (J19)improved accuracy of machine learning models (C45)
xgboost algorithm (C52)accurate matches (C52)
integration of family history research + automated methods (N01)enhanced quality of data for social science research (C81)

Back to index