Do I write like a girl? New evidence on gendered outcomes in grant proposal writing and scoring

Several weeks ago I was on a review team (a Red Team, for those in the know) for a proposal my colleagues were developing for the Bill & Melinda Gates Foundation. One comment I kept coming back to was that they needed to be more specific so that the reviewers would have a clear mental picture of what they are proposing. Shortly thereafter, a new working paper [gated*] that seemingly challenges this advice hit the streets, or more specifically, hit the tweets. In this study, Julian Kolev, Yuly Fuentes-Medel, and Fiona Murray analyze 6,794 proposals submitted to the Gates Foundation and find that “narrow” words are associated with lower proposal scores and “broad” words are associated with higher proposal scores. More to the point for this study, women are more likely to use these narrow words and men more likely to use the broad ones.

I was hit by a pang of guilt! Had I given the wrong advice? I don’t think so, but let me tell you more about this very interesting study so that I can explain why. (Hint: it has to do with how narrow and broad are measured and what that might mean for innovation.)

The study set up
Put simply, if the reviewers have no idea what the genders of the applicants are, are women scored as highly as men, all else being equal?
Kolev et al. (2019) are interested in the interplay between diversity and innovation. The primary source of demographic diversity in their data is gender, so they look only at gender in this study. They set out to test whether a blinded proposal review process – one that eliminates bias – also reduces disparity in gender outcomes and thus promotes diversity. Put simply, if the reviewers have no idea what the genders of the applicants are, are women scored as highly as men, all else being equal?

Their dataset includes all the proposals submitted to the Gates Foundation Global Challenges: Exploration (GCE) Program for infectious disease research from 2008 through 2017 by U.S.-based applicants with academic or non-profit affiliations plus the scores for all those proposals. For this grant program, the proposal reviewers are blinded, meaning they don’t have any information about the applicants other than proposal details. You might think that reviewers can easily guess who applicants are, as often happens with blinded journal referees, but for GCE the reviewers come from many fields and backgrounds, including the private sector and government. So there doesn’t seem to be a concern about that.

Not only did Kolev et al. code information from each of these proposals and about each of the reviewers and their scores, they also collected information about career length and publication history for all the applicants and subsequent career outcomes for a subset of applicants. This is an impressive dataset. The data also allow them to do some useful things methodologically. First, because each proposal is scored by multiple reviewers, Kolev et al. are able to control for applicant quality and the proposal idea by comparing scores from different reviewers for the same proposal. Second, they use a regression discontinuity approach to look for a differential effect of receiving funding on male and female applicants by comparing those receiving funding to those just below the cut-off, that is, comparing later outcomes for those applicants who received similar scores.

The findings
After controlling for many possible explanations, Kolev et al. find that proposals submitted by women are scored lower than proposals submitted by men.
After controlling for many possible explanations, Kolev et al. find that proposals submitted by women are scored lower than proposals submitted by men. According to one specification, women are 15% less likely to receive a “silver” rating and 20% less likely to receive a “gold” rating. Kolev et al. also show that this disparity appears to come exclusively from male reviewers’ scores. For a great summary of how Kolev et al. test a variety of possible explanations for this disparity and what they find for each, see Markus Goldstein’s great blog post about the same paper.

Unable to eliminate the disparity using more conventional explanations, Kolev et al. analyze the text of the proposals submitted. In particular, they look at word choice. They code words as “narrow” and “broad” and then look for associations between word type and gender as well as associations between word type and reviewers’ scores. For me the most interesting figure in the working paper is figure 6. In a chart with four quadrants, the figure shows words in terms of gender-based use (used more frequently by one gender), score disparity (appear more frequently in high-scoring proposals), and type (narrow or broad). A quick look at the figure shows that men are more likely to use broad words and broad words appear more in high-scoring proposals, while women are more likely to use narrow words, and these appear more in low-scoring proposals. Kolev et al. support these findings with econometrics.

Which words matter?

There are three broad words in the quadrant of words used more by men and appearing more frequently in high-scoring proposals: bacteria, detection, control. There are five narrow words in the quadrant of words used more by women and appearing less frequently in high-scoring proposals: contraceptive**, brain, oral, health, community. Looking closely at this figure, my first response was, “wait, what?! How is it that bacteria is a broad word and health is a narrow word?”

Not surprising (to me anyway) the crux of the matter is measurement. Kolev et al. measure narrowness and broadness of words by looking at the distribution of word choice in the sample proposals across the 10 topics within infectious disease research. Examples of the topics are HIV, malaria and diarrhea. If a word appears at about the same rate in proposals across all the topics, it is considered “broad”. If a word appears significantly more often in proposals under some topics compared to other topics, then the word is considered “narrow”. So that means proposals in only some of the 10 topics use the word health a lot but not in the other topics. Conversely, proposals across all 10 topics use the word bacteria about as often.

According to the Kolev et al. analysis, women applicants are more likely to use specific words while male reviewers are more likely to reward common words.
I would not label these distinctions broad and narrow, I would label them common and specific. (Later in the paper, Kolev et al. do use “general” and “topic-specific”.) Words are common if the rate of their appearance is the same across topics, while words are specific if they appear more, or less, frequently depending on the topic. According to the Kolev et al. analysis, women applicants are more likely to use specific words while male reviewers are more likely to reward common words.

What does this mean for innovation?

I hypothesize that truly innovative proposals use more specific words. How can you describe something that is new and different within a topic if you are using words common across topics? According to this study, many people doing infectious disease research take detection into account, but not many take community into account. That suggests to me that proposals that consider community are more likely to be innovative.

I hypothesize that truly innovative proposals use more specific words.
Kolev et al.’s analysis of outcomes after the program supports my hypothesis, at least in the converse. Their table 7 takes the subset of applicants who were funded under the program and continue to be active and tests for the determinants of these applicants’ later outcomes. Those who had proposals with a high use of broad words have lower top-journal article counts, lower new co-author counts, fewer NIH grants, and fewer NIH R01 grants.

What I would love to see as follow-on research is an in-depth assessment of the true innovativeness of a subset of proposals and then an analysis of proposal word choice by innovativeness.

What does this mean for proposal writing and review?

It is important to recognize that the Gates GCE review process is different from most. Remember from above that for GCE, Gates enlists reviewers from broad backgrounds including outside of science. This selection contrasts with a funder like NIH that selects reviewers with expertise in the specific topic they will be reviewing. I suspect that the Gates Foundation expects that a more diverse group of reviewers is better able to identify innovation. A priori, that makes some sense. But the opposite seems to be true, at least for male reviewers. The evidence suggests that the men among these reviewers are “overly credulous to the broad claims” of proposals.

Kolev et al.’s research emphasizes the importance of knowing your audience and choosing your words wisely. But I would argue, [the] better solution is to select and train proposal reviewers more carefully.
So, should people write proposals differently? I don’t think so. I do think Kolev et al.’s research emphasizes the importance of knowing your audience and choosing your words wisely. But I would argue, based on this paper and six years working for a research grant-making organization, that the better solution is to select and train proposal reviewers more carefully.

Did I give my colleagues the wrong advice? No, because the proposal they were submitting was in response to a request for concepts in a specific area of work, and thus, I expect, scored by people working in that area. These reviewers, like NIH reviewers, should be able to spot innovation.

Do I write like a girl? I hope so!

*I’m surprised that the Gates Foundation is allowing researchers using foundation data to publish in a gated working paper series.

**My own hypothesis for the word contraceptive is simply that women are more likely to propose innovations related to contraception, and men are less likely to care about it. This hypothesis might apply to the word oral as well, as it often appears along with contraceptive.

Sharing is caring!