Imputation in US Manufacturing Data and Its Implications for Productivity Dispersion

Working Paper: NBER ID: w22569

Authors: T. Kirk White; Jerome P. Reiter; Amil Petrin

Abstract: In the U.S. Census Bureau's 2002 and 2007 Censuses of Manufactures 79% and 73% of observations respectively have imputed data for at least one variable used to compute total factor productivity. The Bureau primarily imputes for missing values using mean-imputation methods which can reduce the true underlying variance of the imputed variables. For every variable entering TFP in 2002 and 2007 we show the dispersion is significantly smaller in the Census mean-imputed versus the Census non-imputed data. As an alternative to mean imputation we show how to use classification and regression trees (CART) to allow for a distribution of multiple possible impute values based on other plants that are CART-algorithmically determined to be similar based on other observed variables. For 90% of the 473 industries in 2002 and the 84% of the 471 industries in 2007 we find that TFP dispersion increases as we move from Census mean-imputed data to Census non-imputed data to the CART-imputed data.

Keywords: No keywords provided

JEL Codes: C80; L11; L60

Causal Claims Network Graph

Edges that are evidenced by causal inference methods are in orange, and the rest are in light blue.

Causal Claims

Cause	Effect
mean imputation (C36)	variance of imputed variables (C36)
nonimputed data (C80)	variance of imputed variables (C36)
CART imputed data (Y10)	variance of imputed variables (C36)
mean imputed data (C80)	TFP dispersion (F16)
nonimputed data (C80)	TFP dispersion (F16)
CART imputed data (Y10)	TFP dispersion (F16)
census completed data (C80)	TFP dispersion (F16)
plant exit (Y60)	TFP (F16)

Back to index