Working Paper: NBER ID: w31844
Authors: Donna K. Ginther; Carlos Zambrana; Patricia Oslund; Wanying Chang
Abstract: This paper examines whether publication data matched to the Survey of Doctorate Recipients can be used for research purposes. We use Gold Standard data created to validate the publication match quality and compare these measures to publications assigned by a machine-learning algorithm developed by Thomson Reuters (now Clarivate). Our econometric model demonstrates that publications likely suffer from non-classical measurement error. Using horse race and instrumental variable models, we confirm that the Gold Standard data are relatively free from measurement error but show that the Clarivate data suffer from non-classical measurement error. We employ a variety of methods to adjust the Clarivate data for false negatives and false positives and demonstrate that with these adjustments the data produce estimates very similar to the Gold Standard. However, these adjustments are not as useful when publications are used as a dependent variable. We recommend using subsamples of the data that have better match quality when using the Clarivate data as a dependent variable.
Keywords: No keywords provided
JEL Codes: C26; J40; O30
Edges that are evidenced by causal inference methods are in orange, and the rest are in light blue.
Cause | Effect |
---|---|
Clarivate data (Y10) | nonclassical measurement error (C20) |
gold standard data (Y10) | free from measurement error (C20) |
adjustments for false negatives and false positives in Clarivate data (C80) | estimates similar to gold standard (C13) |
publication counts (A14) | career outcomes (salaries and likelihood of receiving federal research funding) (I23) |
Clarivate data (with adjustments) (Y10) | reliable estimates for career outcomes (J24) |
publication data as dependent variable (C29) | inadequate adjustments (F32) |