Working Paper: CEPR ID: DP15852
Authors: Morgane Laouenan; Palaash Bhargava; Jean-BenoƮt Eymoud; Olivier Gergaud; Guillaume Plique; Etienne Wasmer
Abstract: We add to the literature on notable individuals (famous, prominent, distinguished) in collecting first a massive amount of data from various editions of Wikipedia and Wikidata along with deduplication techniques; and then using these partially overlapping sources to cross-verify each retrieved information. This strategy results in a cross-verified database of 2.2 million individuals, including a third who are not present in the English edition of Wikipedia. An extension to 4.7 million entries is currently not recommended given the inaccuracy of the information and discrepancies between Wikidata and other sources. A non-negligible fraction of newly-added individuals were collected from non-English editions of Wikipedia. We adopt a social science approach: data collection is driven by specific social questions on gender, economic and cul- tural development and quantitative exploration of cultural trends, that we document in this paper. A sample of 100,000 individuals is available here http://medialab.github.io/bhht-datascape, together with the most recent version of this paper.
Keywords: notable individuals; creative class; urban economics; economic history
JEL Codes: N01; N9; R00
Edges that are evidenced by causal inference methods are in orange, and the rest are in light blue.
Cause | Effect |
---|---|
Construction of a cross-verified database of notable individuals (B31) | More nuanced understanding of historical and cultural trends (Z19) |
Inclusion of diverse language editions of Wikipedia (Y90) | More comprehensive view of historical figures (B31) |
Enhanced database (Y10) | Better statistical analyses of sociohistorical facts (C80) |
Enhanced database (Y10) | Understanding the dynamics of gender, economic, and cultural development (F63) |
Methodology employed (deduplication and cross-verification) (C83) | Minimizes errors in the data (C83) |
Minimizes errors in the data (C83) | More accurate representation of the notability of individuals across different cultures and periods (B31) |
Findings reveal a non-trivial error rate among less documented individuals (C83) | Suggests that manual corrections or statistical treatments may be required to improve data quality (C80) |