Turning lemons into lemonade, and then drinking it: Rigorous evaluation under challenging conditions

In early 2014, USAID came to the ASPIRES project with a challenge. They requested that our research team design and implement a prospective quantitative evaluation of a USAID-funded intervention in Mozambique. The intervention centered on a combined social and economic intervention for girls at high risk of HIV infection. As a research-heavy USAID project focused on the integration of household economic strengthening and HIV prevention/treatment, ASPIRES was prepared for the task.

The challenges, however, came in the particulars of the evaluation scenario. Among them:

  1. The intervention was in the fourth year of a five-year term, with no baseline or evaluation planning prior to our involvement.
  2. The number of girls to be enrolled prospectively in the intervention’s final year was relatively small, numbering just several hundred, and many would begin the intervention before our team could collect a baseline.
  3. Implementation and roll out strategies among the communities involved were non-standard and ad hoc.
  4. There was little opportunity and zero appetite among the implementers to modify their plans in the final year to suit an evaluation.
  5. The intervention lacked a clearly articulated theory of change.

From this, the research team set its mind to identifying the best possible design to fulfill the client’s request. This is to say, we sought out a recipe for lemonade amidst these somewhat lemony conditions. The team employed an intuitive “rigor decision tree,” moving through each option based on feasibility, and landed on what we felt to be the best design, highlighted in green in Figure 1 below.

Figure 1: Rigor decision tree

Figure 1: Rigor decision tree

In sum, the research entailed two rounds of data in a clustered, non-equivalent (two-stage) cohort trial. Maximizing rigor in our context meant employing multi-level exact matching and difference-in-differences estimation to estimate intervention effects on the outcomes of interest.

We needed to find the point where maximum rigor and client expectations met.

Certainly this design was not ideal, but the team wasn’t operating in an ideal evaluation world. We needed to find the point where maximum rigor and client expectations met, and this was it.

As the team dove into the study, our plans evolved further to suit the context. We planned to examine impact on seven outcomes related to HIV vulnerability, but preliminary analysis of data from the intervention group at the first measurement occasion revealed that this study could not obtain accurate measures for five of the outcomes. The team, therefore, focused on evidence of impact of the intervention on the two outcomes we could measure: girls’ knowledge related to gender-based violence (GBV) and school attendance.

Other unanticipated challenges came up through the study period. Power, for example, became more of an issue than expected. Although the target sample size was 300 girls in the intervention group and 650 girls in the comparison group, the effective sample size after matching and post-stratification weighting was 174 and 323 respectively – well below target numbers.

Despite the constraints, the research team implemented the most robust design possible and followed procedures consistent with a confirmatory study, including a comparison group and collecting data at two time points. We employed robust analytic methods to adjust for the non-equivalence we anticipated. We specified decisions a priori and thoroughly examined the validity and reliability of key analytic measures prior to testing the main hypotheses. Furthermore, we minimized the number of outcomes involved in hypothesis testing so as to not overinflate the type 1 error rate, and fully reported all of our findings. Collectively, these factors increased the validity of our findings. Additionally, we collected extensive qualitative data to help explain the impact of the intervention.

The results in the end? From the quantitative study, the team found no evidence of impact of the intervention on girls’ knowledge related to GBV or school attendance. These findings were affirmed to a considerable extent by the qualitative data, which suggested challenges in the depth of knowledge retained by participants and in the sustainability of their economic gains. Incidentally, the results highlighted the value of mixed methods research. In the absence of the qualitative research, the investigation would have had very little to say about this intervention at all, other than “we observed no significant differences.” Instead the team was able to identify roadblocks in some potential pathways to change.

So, from a challenging design, we were left with a challenging result – which is to say, null results. Obviously, USAID would have preferred a different result from the only evaluation attached to this intervention it had supported.

But we commend USAID for seeking out and enlisting our project to deliver the best possible assessment of this intervention. In the end, USAID received valid data to shed light on the effects of this intervention. Null results are important results (see here), and there is a great deal to be learned from them especially when paired with qualitative research findings that help explain “the why.”

All learning, like all good lemonade, should be sweet. USAID continues to digest the implications of this study for their future programming.

Photo credit: Background vector created by Photoroyalty – Freepik.com

Sharing is caring!