Stability of Experimental Results: Forecasts and Evidence

Working Paper: NBER ID: w25858

Authors: Stefano Dellavigna; Devin Pope

Abstract: How robust are experimental results to changes in design? And can researchers anticipate which changes matter most? We consider a specific context, a real-effort task with multiple behavioral treatments, and examine the stability along six dimensions: (i) pure replication; (ii) demographics; (iii) geography and culture; (iv) the task; (v) the output measure; (vi) the presence of a consent form. We use rank-order correlation across the treatments as measure of stability, and compare the observed correlation to the one under a benchmark of full stability (which allows for noise), and to expert forecasts. The academic experts expect that the pure replication will be close to perfect, that the results will differ sizably across demographic groups (age/gender/education), and that changes to the task and output will make a further impact. We find near perfect replication of the experimental results, and full stability of the results across demographics, significantly higher than the experts expected. The results are quite different across task and output change, mostly because the task change adds noise to the findings. The results are also stable to the lack of consent. Overall, the full stability benchmark is an excellent predictor of the observed stability, while expert forecasts are not that informative. This suggests that researchers' predictions about external validity may not be as informative as they expect. We discuss the implications of both the methods and the results for conceptual replication.

Keywords: No keywords provided

JEL Codes: C9; C91; C93

Causal Claims Network Graph

Edges that are evidenced by causal inference methods are in orange, and the rest are in light blue.

Causal Claims

Cause	Effect
stability of experimental results under different design changes (C90)	near-perfect replication of the experimental results (C59)
design changes (O30)	stability across demographic groups (J19)
geographic comparisons between US and Indian subjects (R12)	lower correlation (C10)
task changes (J62)	instability (C62)
changes in output measures (C67)	greater instability (C62)
presence or absence of a consent form (C90)	no significant effect on results (C20)

Back to index