How to conduct a replication study: Which tests not witch hunts

Last week I was treated to a great workshop titled “Reproducibility and Integrity in Scientific Research” at the University of Canterbury where I presented my article (joint with Benjamin D.K. Wood), “Which tests not witch hunts: A diagnostic approach for conducting replication research.” The article provides tips and resources for researchers seeking a neutral approach to replication research. In honor of the workshop and Halloween, I thought I’d scare up a blog post summarizing the article.

The article provides tips and resources for researchers seeking a neutral approach to replication research.
Why conduct replication research?

Suppose you’ve read a study that you consider to be innovative or influential. Why might you want to conduct a replication study of it? Here when I say ‘replication study’, I mean internal replication (or desk replication), for which the researcher uses the study’s original data to reassess the study’s findings. There are three reasons you might want to conduct such a study: to prove it right, to learn from it, or to prove it wrong. We rarely see the first reason stated, making it a bit of phantom. However, I am a big fan of conducting replication research to validate a study’s findings for the purpose of policymaking or program design. We see the second reason – to learn from it – more often, although often in the context of graduate school courses on quantitative methods.

Instead, many fear that most replication studies are conducted with the desire to prove a study wrong. Zimmerman (2015) considers “turning replication exercises into witch hunts” to be an easy pitfall of replication research. Gertler, Galiani, and Romero (2018) report that unnamed third parties “speculated” that researchers for a well-known replication study sought to overturn results. The specter of speculation aside, why might replication researchers look for faults in a study?

This approach is intended for researchers who want to dissect a study beyond just a pure replication (which is using the study’s methods and original data to simply reproduce the results in the published article).
One reason is publication bias. Experience shows that replication studies that question the results of original studies are more likely to be published, and Gertler, Galiani, and Romero (2018) provide evidence from a survey of editors of economics journals showing that editors are much more likely to publish a replication study that overturns results than one that confirms results. Regardless of publication bias, however, my experience funding replication studies while working at the International Initiative for Impact Evaluation (3ie) is that not all replication researchers carry torches and pitchforks. Many just don’t know where to start when conducting replication research. Without some kind of template or checklist to work from, these researchers are often haunted by the academic norm of critical review and approach their replication work from that standpoint.

To address this challenge, Ben Wood and I set out to develop a neutral approach to replication research based on elements of quantitative analysis and using examples from 3ie-funded replication studies. This approach is intended for researchers who want to dissect a study beyond just a pure replication (which is using the study’s methods and original data to simply reproduce the results in the published article). The diagnostic approach includes four categories: assumptions, data transformations, estimation, and heterogeneous outcomes.

Assumptions
The application of methods and models in conducting empirical research always involves making assumptions.
The application of methods and models in conducting empirical research always involves making assumptions. Often these assumptions can be tested using the study data or using other data. Since my focus is often development impact evaluation, the assumptions I see most often are those supporting the identification strategy of a study. Examples include assuming no randomization failure in the case of random-assignment designs or assuming unobservables are time invariant in the case of difference-in-difference designs. Many other assumptions are also often necessary depending on the context of the research. For example, when looking at market interventions, researchers often assume that agents are small relative to the market (i.e., price takers). Even if the study data cannot be used to shed light on these assumptions, there may be other data that can.

In the Whitney, Cameron, and Winters (2018) replication study of the Galiani and Schargrodsky (2010) impact evaluation of a property rights policy change in Buenos Aires, the replication researchers note that the original authors provide balance tables for the full sample of 1,082 parcels but only conduct their analysis on a subset of 300 parcels. Whitney, et al. test the pre-program balance between program and comparison parcels on four key characteristics for the households in the analysis subset and find statistically significant differences for three of the four. Their further tests reveal that these imbalances do not change the ultimate findings of the study, however.

Data transformations

There is a lot of hocus pocus that goes into getting data ready for analysis. These spells determine what data are used, including decisions about whether to kill outliers, how to bring missing values back from the dead, and how to weight observations. We also often engage in potion making when we use data to construct new variables, including variables like aggregates (e.g., income and consumption) and indexes (e.g., empowerment and participation). Replication researchers can use the study data and sometimes outside data in order to answer questions about whether these choices are well supported and whether they make a difference to the analysis.

Kuecken and Valfort (2018) question the decision by Reinikka and Svensson (2005) to exclude certain schools from the analysis dataset used for their study of how an anti-corruption newspaper campaign affects enrollment and learning. The original study includes a footnote that the excluded schools experienced reductions in enrollment due to “idiosyncratic shocks”, which the original authors argue should not be systematically correlated with the explanatory variable. Kuecken and Valfort resurrect the excluded schools and find that the published statistical significance of the findings is sensitive to the exclusion.

Estimation methods

There are two sets of replication questions around estimation methods. One is whether different methods developed for similar statistical tasks produce the same results. A well-known example is the replication study conducted by epidemiologists Aiken, Davey, Hargreaves, and Hayes (2015) (published as two articles) of an impact evaluation of a health intervention conducted by economists Miguel and Kremer (2004). This replication study combined with systematic review evidence resulted in the worm wars, which were indeed spine-chilling. The second set of questions is how sensitive (or robust) the results are to parameters or other choices made when applying estimation methods. Many published studies include some sensitivity tests, but there are sometimes additional sensitivity tests that can be conducted.

Korte, Djimeu, and Calvo (2018) do the converse of worm wars – they apply econometric methods to data from an epidemiology trial by Bailey, et al. (2007) testing whether male circumcision reduces incidence of HIV infection. For example, Korte, et al. exploit the panel nature of the data, that is, repeated observations of the same individuals over time, by running a fixed effects model, which controls for unobserved individual differences that don’t change over time. They find that the econometric methods produce very similar results as the biostatistical methods for the HIV infection outcome, but produce some different results for the tests of whether male circumcision increases risky sexual behavior.

Heterogeneous outcomes
Understanding whether the data from a published study point to heterogeneous outcomes can be important for using the study’s findings for program design or policy targeting.
Understanding whether the data from a published study point to heterogeneous outcomes can be important for using the study’s findings for program design or policy targeting. These further tests on a study’s data are likely to be exploratory rather than confirmatory. For example, one might separate a random-assignment sample into men and women for heterogeneous outcomes analysis even if the randomization did not occur for these two groups separately. Exploration of heterogeneous outcomes in a replication study should be motivated by theoretical or clinical considerations.

Wood and Dong (2018) re-examine an agricultural commercialization impact evaluation conducted by Ashraf, Giné, and Karlan (2009). The commercialization program included promoting certain export crops and making it easier to sell all crops. The original study explores heterogeneous outcomes by whether the sample farmers grew the export crops before the intervention or not and find that those who did not grow these crops are more likely to benefit. Wood and Dong use value chain theory to hypothesize that the benefits of the program come from bringing farmers to the market, that is getting them to sell any crops (domestic or export). They look at heterogeneous outcomes by whether farmers grew any cash crops before the program and find that only those who did not grow cash crops benefit from the program.

Internal replication research provides validation of published results, which is especially important when those results are used for policymaking and program design (Brown, Cameron, and Wood, 2014). It doesn’t need to be scary, and original authors don’t need to be spooked. The “which tests not witch hunts” paper provide tips and resources for each of the topics described above. The paper also provides a list of “don’ts” for replication research, which I’ll summarize in a separate post. Happy Halloween!


Cross-published on The Replication Network blog.

Sharing is caring!