Working Paper: NBER ID: w25657
Authors: Sarah Tahamont; Zubin Jelveh; Aaron Chalfin; Shi Yan; Benjamin Hansen
Abstract: Objective: \nThe increasing availability of large administrative datasets has led to a particularly exciting innovation in criminal justice research, that of the “low-cost” randomized trial in which administrative data are used to measure outcomes in lieu of costly primary data collection. In this paper, we point out that randomized experiments that make use of administrative data have an unfortunate consequence: the destruction of statistical power. Linking data from an experimental intervention to administrative records that track outcomes of interest typically requires matching datasets without a common unique identifier. In order to minimize mistaken linkages, researchers will often use “exact matching” (retaining an individual only if all their demographic variables match exactly in two or more datasets) in order to ensure that speculative matches do not lead to errors in an analytic dataset. \nMethods: \nIn this paper, we derive an analytic result for the consequences of linking errors on statistical power and show how the problem varies across different combinations of relevant inputs, including the matching error rate, the outcome density and the sample size. \nResults:\nWe show that this seemingly conservative approach leads to underpowered experiments and potentially to the failure of entire experimental literatures. For marginally powered studies, which are common in empirical social science, exact matching is particularly problematic. \nConclusions: \nWe conclude on an optimistic note by showing that simple machine learning-based probabilistic matching algorithms allow criminal justice researchers to recover a considerable share of the statistical power that is lost to errors in data linking.
Keywords: administrative data; randomized experiments; statistical power; linking errors; probabilistic matching
JEL Codes: C1; C12; K42
Edges that are evidenced by causal inference methods are in orange, and the rest are in light blue.
Cause | Effect |
---|---|
linking errors (Y80) | attenuation of treatment effect estimates (C22) |
linking errors (Y80) | increased likelihood of type II errors (C92) |
linking errors (Y80) | reduced statistical power (C20) |
exact matching (C52) | increased false negative rates (C52) |
increased false negative rates (C52) | higher total error rate (C83) |
higher total error rate (C83) | reduced statistical power (C20) |
machine learning-based probabilistic matching algorithms (C45) | mitigated effects of linking errors (Y80) |
mitigated effects of linking errors (Y80) | recovered statistical power (C59) |