Null results should produce answers, not excuses

 

By: Annette N. Brown

I recently served on a Center for Global Development (CGD) panel to discuss a new study of the effects of community-based education on learning outcomes in Afghanistan. (Burde, Middleton, and Samii 2016) This exemplary randomized evaluation finds some important positive results. But the authors do one thing in the study that almost all impact evaluation researchers do – where they have null results, they make, what I call for the sake of argument, excuses.

With the investment that goes into individual impact evaluations, especially large field trials, null results should produce answers, not excuses. We are quick to conclude that something works when we get positive results, but loathe to come right out and say that something doesn’t work when we get null results. Instead we say things like: perhaps the implementation lacked fidelity; or maybe the design of the intervention didn’t fully match the theory of change; or we just didn’t have enough observations. The U.S. Department of Education commissioned this useful brief to help people sort through the possible explanations for null results.

Sure, more research is always needed, but if expensive studies give us more questions than answers, it is harder to make the case for funding them in the first place.
These explanations then lead to the conclusion that more research is needed.Sure, more research is always needed, but if expensive studies give us more questions than answers, it is harder to make the case for funding them in the first place. We need to design experiments and their evaluations from the outset so they are more likely to give us answers, even from null results.

My argument here applies mostly to evaluations of experimental interventions (which may or may not be experimental, or randomized, evaluations). That is, I am talking about evaluations of new interventions that we want to test to see if they work. In the case of the community-based education evaluation, the study included experiments on two within-treatment variations: one to look at the effects of adding a community participation component to the education intervention, and another to look at the effects of requiring minimum qualifications in teacher selection when that requirement leads to teachers being hired from outside the community.

The authors motivate the first variation, community participation, by explaining that it is a concept commonly incorporated into NGO-administered programs, although the evidence of its effectiveness is mixed. In Afghanistan, they implemented community participation using community libraries, adult reading groups and a poster campaign. Their test of community participation produces “no statistically significant effect on access, learning outcomes, measures of parental support to education, or trust in service providers.” The authors go on to explain, “these weak results could be due to one of two possible factors: the enhanced package provided did not adequately address the reasons why some parents persistently do not send their children to school, or the activities were not implemented at the appropriate ‘dosage’ to be effective.”

What we would like the authors to be able to say, given the investment in conducting this experiment is, “the finding of no statistically significant effect demonstrates that community participation is not an effective enhancement for community-based education in Afghanistan.” But there are indeed open questions about activity selection and implementation.  So I understand why the authors do not feel comfortable drawing this definitive conclusion.

What can we do to make sure that null results do tell us what doesn’t work? I recommend three steps.

First, we should make sure that the intervention we are evaluating is the best possible design based on the underlying theory of change and for the context where we’re working. This recommendation points to the need for formative research. Formative research can include collection of quantitative and/or qualitative data, which can be analyzed using quantitative and/or qualitative methods, and it should produce the information necessary to design the best possible intervention. In this Afghanistan case, formative research would ask the question why are parents not sending their children to school and would explore with these parents and other parents in Afghanistan what services or activities they would like to see.

Put differently, a hypothesis is an educated guess, and formative research is the education behind the guess.
A strong hypothesis should give us answers both when it is rejected and when it is not rejected. Some examples of formative research conducted to inform intervention design come from the International Initiative for Impact Evaluation’s (3ie’s) HIV Self-testing Thematic Window. In both Kenya (formative research summary here) and Zambia (formative study to be published soon, overview here) 3ie provided grants for formative research in advance of funding studies of pilot interventions. This blog post summarizes two formative research studies conducted by the mSTAR project of FHI 360 to inform the design of a mobile money salary payments program in Liberia.

The second step to help ensure answers from null results is to make sure that the intervention as designed can be implemented with fidelity in the context for which it was designed. This step often requires formative evaluation. A formative evaluation involves implementing the intervention on a small scale and collecting data on the implementation itself and about those who receive the intervention. A formative evaluation is not the same as piloting the intervention to test whether the intervention causes the desired outcome or impact—that measurement of an attributable effect requires a counterfactual.

A formative evaluation addresses two questions:

  1. can the intervention be implemented, and
  2. will the target beneficiaries take it up?

The first question often focuses on implementation process, and the formative evaluation uses process evaluation methods. Sometimes though, it is more a science question, and the formative evaluation is more like a feasibility study looking at whether the technology actually works in the field. Here is an example of a formative evaluation of hearing screening procedures in Ecuadorian schools designed to learn how well a new screening technology can be implemented.

The second formative evaluation question, whether recipients take up the intervention, should not be in too much doubt if the formative research was good. But we often see impact evaluations of new or pilot programs where the main finding is that the participants did not “use” the intervention. In the case of the Afghanistan study, for example, “only about 35% of household respondents and community leaders in enhancement villages report [the presence of] libraries and adult reading groups.” Those take-up data do suggest that the null result from the impact evaluation reflects that community participation didn’t happen more than it reflects that community participation doesn’t work. But we shouldn’t need a big study with a counterfactual to test whether participants will take up a new intervention. We can do that with a formative evaluation.

Why don’t we see more formative research and formative evaluation? They take a lot of time and a lot of money. Often at the point that impact evaluators are brought in, there is already a timeline for starting a project and no patience to spend a year collecting and analyzing formative research data and conducting a formative evaluation of the intervention before the full study is launched. It is not as much of a problem when the study yields positive results, but leaves us with excuses when the study yields null results.

The third step to getting answers from null results is ensuring there is enough statistical power to estimate a meaningful minimum detectable effect. This step is well known, and it is fairly common now to see power calculations in impact evaluation proposals or protocols, at least for the primary outcome of interest.

What we rarely see are ex-post power calculations to explore whether the study as implemented still had enough power to measure something meaningful.
If the final sample size turns out to be smaller, or other assumed parameters are different in reality (e.g. the inter-cluster correlation is higher) a null effect could be the result of too little power. This finding doesn’t give us the answer that the intervention doesn’t work, but at least we know why we don’t know. Ex-post power calculations can also be useful for understanding null results in tests of heterogeneous outcomes, where the applicable sample is often smaller than the full sample determined by the ex-ante power calculations. Examples of how ex-post power calculations can be used are here and here.

As I said at the beginning, there are many features that make the Burde, Middleton, and Samii evaluation exemplary, features not seen in all (or even most) development impact evaluations:

  • First, the study overall is highly policy relevant. The primary intervention being evaluated, community-based education, is something the Afghan government is looking to take over from the NGOs eventually. The second of the two within-treatment variations is also highly policy relevant. If and when the government does take over community-based education, the minimum qualifications requirement for teachers would likely apply, so how this requirement affects recruitment is extremely relevant.
  • Second, the authors pay particular attention to the ecological validity of the experimental interventions, which improves the usefulness of the pilot for informing the intervention at scale.
  • Third, the study includes a detailed cost effectiveness analysis allowing a comparison of the different community-based education packages in terms of effect per dollar invested.
  • Fourth, the study includes analysis of the influence of other NGO activities in these communities. In spite of random assignment, the presence of related NGO activities is not perfectly balanced across the study arms, so understanding the attributable impact of the intervention requires analysis of other NGO activities.
  • Fifth, the study was conducted in close collaboration with the Afghanistan Ministry of Education, which benefits the policy relevance and the policy take-up.

The high quality of this study means that the authors can make specific policy recommendations based on the positive results that they do find, and this quality was appreciated by the Afghan Deputy Minister of Education in his comments at CGD. But null results happen. They happened here and they happen in many studies. To get the most out of our impact evaluations, we need to set them up so that null results give us answers.


Photo credit: FHI 360

Sharing is caring!

2 Responses to "Null results should produce answers, not excuses"
  1. Rick Homan says:

    Your point about ex post effect size calculations along with the minimal detectable effect at 80% power is very important. I recall sitting at an STI meeting in South Africa and researcher after researcher presented results in a positive light but then said they lacked statistical power to detect a difference between intervention and control groups. At some point one wonders if spending multiple years conducting an underpowered clinical trial could be approaching the equivalent of research malpractice.

  2. Mario Chen says:

    The blog makes a lot of great points. One point, I’d like to add a warning for is the used of ex post power (or sometime also called post-hoc power) using observed effect sizes. The used of post-hoc power is not recommended as it simply restate what’s already provided with the p-value, and it is usually misinterpreted. A non-significant result will always lead to low post-hoc power. Confidence intervals may be a better way to express the uncertainty in the sample. Post-hoc power should be discouraged for the analysis of a single study. However, it may be useful for comparing studies in a systematic review as done in the cited references. The use of the minimum detectable effect size is probably a better approach in such cases anyways. All of this should not be taken as discouraging the discussion of what the meaningful effect sizes should be.

Leave a Reply

Your email address will not be published. Required fields are marked *