Working Paper: NBER ID: w31671
Authors: Kasey Buckles; Adrian Haws; Joseph Price; Haley EB Wilbert
Abstract: The Census Tree is the largest-ever database of record links among the historical U.S. censuses, with over 700 million links for people living in the United States between 1850 and 1940. These high-quality links allow researchers in the social sciences and other disciplines to construct a longitudinal dataset that is highly representative of the population. In this paper, we describe our process for creating the Census Tree, beginning with a collection of over 317 million links contributed by the users of a free online genealogy platform. We then use these links as training data for a machine learning algorithm to make new matches, and incorporate other recent efforts to link the historical U.S. censuses. Finally, we introduce a procedure for filtering the links and adjudicating disagreements. Our complete Census Tree achieves match rates between adjacent censuses that are between 69 and 86% for men, and between 58 and 79% for women. The Census Tree includes women and Black Americans at unprecedented rates, containing 314 million links for the former and more than 41 million for the latter.
Keywords: genealogy; historical data; record linking; census data; machine learning
JEL Codes: C81; J10; N01
Edges that are evidenced by causal inference methods are in orange, and the rest are in light blue.
Cause | Effect |
---|---|
machine learning techniques (C45) | match rates (C52) |
census tree dataset (C80) | match rates for women and Black Americans (J79) |
application of machine learning techniques (C45) | improvement in record linking accuracy (C52) |
systematic process (C90) | quality of links (L15) |
machine learning model training using user-generated links (C45) | match rates (C52) |