Working Paper: NBER ID: w26227
Authors: Joseph Price; Kasey Buckles; Jacob Van Leeuwen; Isaac Riley
Abstract: A key challenge for research on many questions in the social sciences is that it is difficult to link historical records in a way that allows investigators to observe people at different points in their life or across generations. In this paper, we develop a new approach that relies on millions of record links created by individual contributors to a large, public, wiki-style family tree. First, we use these “true” links to inform the decisions one needs to make when using traditional linking methods. Second, we use the links to construct a training data set for use in supervised machine learning methods. We describe the procedure we use and illustrate the potential of our approach by linking individuals across the 100% samples of the US decennial censuses from 1900, 1910, and 1920. We obtain an overall match rate of about 70 percent, with a false positive rate of about 12 percent. This combination of high match rate and accuracy represents a point beyond the current frontier for record linking methods.
Keywords: record linking; machine learning; historical data; genealogy; socioeconomic status
JEL Codes: C81; J1; N01
Edges that are evidenced by causal inference methods are in orange, and the rest are in light blue.
Cause | Effect |
---|---|
user-generated genealogical data + machine learning (C55) | higher match rate (C52) |
traditional methods (C90) | lower match rate (C78) |
training data reliability (C52) | better outcomes (I14) |
preprocessing of birth years and birthplaces (J19) | improved accuracy of machine learning models (C45) |
xgboost algorithm (C52) | accurate matches (C52) |
integration of family history research + automated methods (N01) | enhanced quality of data for social science research (C81) |