Last week I was treated to a great workshop titled “Reproducibility and Integrity in Scientific Research” at the University of Canterbury where I presented my article (joint with Benjamin D.K. Wood), “Which tests not witch hunts: A diagnostic approach for conducting replication research.” The article provides tips and resources for researchers seeking a neutral approach to replication research. In honor of the workshop and Halloween, I thought I’d scare up a blog post summarizing the article.
Why conduct replication research?
Suppose you’ve read a study that you consider to be innovative or influential. Why might you want to conduct a replication study of it? Here when I say ‘replication study’, I mean internal replication (or desk replication), for which the researcher uses the study’s original data to reassess the study’s findings. There are three reasons you might want to conduct such a study: to prove it right, to learn from it, or to prove it wrong. We rarely see the first reason stated, making it a bit of phantom. However, I am a big fan of conducting replication research to validate a study’s findings for the purpose of policy making or program design. We see the second reason – to learn from it – more often, although often in the context of graduate school courses on quantitative methods.
Instead, many fear that most replication studies are conducted with the desire to prove a study wrong. Zimmerman (2015) considers “turning replication exercises into witch hunts” to be an easy pitfall of replication research. Gertler, Galiani, and Romero (2018) report that unnamed third parties “speculated” that researchers for a well-known replication study sought to overturn results. The specter of speculation aside, why might replication researchers look for faults in a study?
To address this challenge, Ben Wood and I set out to develop a neutral approach to replication research based on elements of quantitative analysis and using examples from 3ie-funded replication studies. This approach is intended for researchers who want to dissect a study beyond just a pure replication (which is using the study’s methods and original data to simply reproduce the results in the published article). The diagnostic approach includes four categories: assumptions, data transformations, estimation, and heterogeneous outcomes.
In the Whitney, Cameron, and Winters (2018) replication study of the Galiani and Schargrodsky (2010) impact evaluation of a property rights policy change in Buenos Aires, the replication researchers note that the original authors provide balance tables for the full sample of 1,082 parcels but only conduct their analysis on a subset of 300 parcels. Whitney, et al. test the pre-program balance between program and comparison parcels on four key characteristics for the households in the analysis subset and find statistically significant differences for three of the four. Their further tests reveal that these imbalances do not change the ultimate findings of the study, however.
There is a lot of hocus pocus that goes into getting data ready for analysis. These spells determine what data are used, including decisions about whether to kill outliers, how to bring missing values back from the dead, and how to weight observations. We also often engage in potion making when we use data to construct new variables, including variables like aggregates (e.g., income and consumption) and indexes (e.g., empowerment and participation). Replication researchers can use the study data and sometimes outside data in order to answer questions about whether these choices are well supported and whether they make a difference to the analysis.
Kuecken and Valfort (2018) question the decision by Reinikka and Svensson (2005) to exclude certain schools from the analysis dataset used for their study of how an anti-corruption newspaper campaign affects enrollment and learning. The original study includes a footnote that the excluded schools experienced reductions in enrollment due to “idiosyncratic shocks”, which the original authors argue should not be systematically correlated with the explanatory variable. Kuecken and Valfort resurrect the excluded schools and find that the published statistical significance of the findings is sensitive to the exclusion.
There are two sets of replication questions around estimation methods. One is whether different methods developed for similar statistical tasks produce the same results. A well-known example is the replication study conducted by epidemiologists Aiken, Davey, Hargreaves, and Hayes (2015) (published as two articles) of an impact evaluation of a health intervention conducted by economists Miguel and Kremer (2004). This replication study combined with systematic review evidence resulted in the worm wars, which were indeed spine-chilling. The second set of questions is how sensitive (or robust) the results are to parameters or other choices made when applying estimation methods. Many published studies include some sensitivity tests, but there are sometimes additional sensitivity tests that can be conducted.
Korte, Djimeu, and Calvo (2018) do the converse of worm wars – they apply econometric methods to data from an epidemiology trial by Bailey, et al. (2007) testing whether male circumcision reduces incidence of HIV infection. For example, Korte, et al. exploit the panel nature of the data, that is, repeated observations of the same individuals over time, by running a fixed effects model, which controls for unobserved individual differences that don’t change over time. They find that the econometric methods produce very similar results as the biostatistical methods for the HIV infection outcome, but produce some different results for the tests of whether male circumcision increases risky sexual behavior.
Wood and Dong (2018) re-examine an agricultural commercialization impact evaluation conducted by Ashraf, Giné, and Karlan (2009). The commercialization program included promoting certain export crops and making it easier to sell all crops. The original study explores heterogeneous outcomes by whether the sample farmers grew the export crops before the intervention or not and find that those who did not grow these crops are more likely to benefit. Wood and Dong use value chain theory to hypothesize that the benefits of the program come from bringing farmers to the market, that is getting them to sell any crops (domestic or export). They look at heterogeneous outcomes by whether farmers grew any cash crops before the program and find that only those who did not grow cash crops benefit from the program.
Internal replication research provides validation of published results, which is especially important when those results are used for policy making and program design (Brown, Cameron, and Wood, 2014). It doesn’t need to be scary, and original authors don’t need to be spooked. The “which tests not witch hunts” paper provide tips and resources for each of the topics described above. The paper also provides a list of “don’ts” for replication research, which I’ll summarize in a separate post. Happy Halloween!
Cross-published on The Replication Network blog.