Breakthroughs in Historical Record Linking Using Genealogy Data: The Census Tree Project

Working Paper: NBER ID: w31671

Authors: Kasey Buckles; Adrian Haws; Joseph Price; Haley EB Wilbert

Abstract: The Census Tree is the largest-ever database of record links among the historical U.S. censuses, with over 700 million links for people living in the United States between 1850 and 1940. These high-quality links allow researchers in the social sciences and other disciplines to construct a longitudinal dataset that is highly representative of the population. In this paper, we describe our process for creating the Census Tree, beginning with a collection of over 317 million links contributed by the users of a free online genealogy platform. We then use these links as training data for a machine learning algorithm to make new matches, and incorporate other recent efforts to link the historical U.S. censuses. Finally, we introduce a procedure for filtering the links and adjudicating disagreements. Our complete Census Tree achieves match rates between adjacent censuses that are between 69 and 86% for men, and between 58 and 79% for women. The Census Tree includes women and Black Americans at unprecedented rates, containing 314 million links for the former and more than 41 million for the latter.

Keywords: genealogy; historical data; record linking; census data; machine learning

JEL Codes: C81; J10; N01


Causal Claims Network Graph

Edges that are evidenced by causal inference methods are in orange, and the rest are in light blue.


Causal Claims

CauseEffect
machine learning techniques (C45)match rates (C52)
census tree dataset (C80)match rates for women and Black Americans (J79)
application of machine learning techniques (C45)improvement in record linking accuracy (C52)
systematic process (C90)quality of links (L15)
machine learning model training using user-generated links (C45)match rates (C52)

Back to index