Adjusting Imperfect Data: Overview and Case Studies

Working Paper: NBER ID: w12977

Abstract: Research users of large administrative have to adjust their data for quirks, problems, and issues that are inevitable when working with these kinds of datasets. Not all solutions to these problems are identical, and how they differ may affect how the data is to be interpreted. Some elements of the data, such as the unit of observation, remain fundamentally different, and it is important to keep that in mind when comparing data across countries. In this paper (written for Lazear and Shaw, 2007), we focus on the differences in the underlying data for a selection of country datasets. We describe two data elements that remain fundamentally different across countries -- the sampling or data collection methodology, and the basic unit of analysis (establishment or firm) -- and the extent to which they differ. We then proceed to document some of the problems that affect longitudinally linked administrative data in general, and we describe some of the solutions analysts and statistical agencies have implemented, and explore, through a select set of case studies, how each adjustment or absence thereof might affect the data.

Keywords: No keywords provided

JEL Codes: C81; C82; J0

Causal Claims Network Graph

Edges that are evidenced by causal inference methods are in orange, and the rest are in light blue.

Causal Claims

Cause	Effect
Differences in data collection methodologies (C83)	Different interpretations of the data (Y10)
Sampling schemes (worker-based versus firm-based) (C83)	Discrepancies in findings (C90)
Unit of analysis (establishment versus firm) (L25)	Discrepancies in findings (C90)
Coding errors in person identifiers (C83)	Spurious job histories (J63)
Spurious job histories (J63)	Biased flow statistics upwards (C46)
Systematic and random errors in identifiers (C83)	Accuracy of employment statistics (J68)
Adjustments made (or lack thereof) (F32)	Estimates of employment and wage dynamics (J39)

Back to index