Sneaky in a good way: The use of survey and assessment metadata in soft skills measurement

I often do not manage to divorce myself from my desk to venture out into the wider DC world of brown bags, workshops and conferences, but I recently attended a panel discussion at the APPAM research conference on “Measuring soft skills for evaluation and policy: Challenges and innovations.” My FOMO (fear of missing out) was higher than usual: I was generally familiar with and compelled by most of the panelists’ work in some shape or form; moreover, FHI 360’s Kristin Brady was serving as discussant for the panel.

All four panelists started from two common assumptions: 1) soft skills, AKA life skills or social-emotional skills, are important for work, health and life; and 2) they are notoriously difficult to measure. This gnarliness of soft skills measurement is something we’ve experienced firsthand through the USAID-funded YouthPower Action activity to develop a soft skills assessment tool for international youth development programs.

In this blog post, I first provide an overview of some approaches to soft skills measurement and then zoom in on one approach from the panel discussion that I find particularly promising.

Measuring soft skills
Through both desk research and trial and error, we’ve learned that a range of soft skills measurement approaches exist and that they each have their own promises and pitfalls.
Through both desk research and trial and error, we’ve learned that a range of soft skills measurement approaches exist and that they each have their own promises and pitfalls.

On one end of the spectrum are self-reports that ask students to rate their perceptions of their skill levels. While cheap and easy to use, self-reported measurement can be open to several potential sources of error that come from reference bias, or differences across respondents’ frames of reference for survey constructs, such as “being a hard worker”; social desirability bias, or perceived pressures to answer questions in the “right” or socially desirable way; and differences in how respondents interpret response options, such as “often” or “strongly agreeing.”

On the other end are complex (both to administer and analyze) performance tasks, like the classic Marshmallow Test, which may help to reduce measurement error. However, more complex performance tasks also cost more and are difficult to administer consistently.

Somewhere in the middle, we find relatively new, experimental assessment strategies like anchoring vignettes, forced choice methods, and situational judgment tests that may help address some of the sources of measurement error associated with self-reports, but can also increase the burden on the respondent (Kyllonen and Bertling, 2014).

A promising new approach

Having studied the advantages and disadvantages of various approaches myself, I felt a sense of comradery with the pursuits of the panelists, who all presented various strategies for addressing some of the challenges facing the soft skills measurement world. One measurement approach in particular jumped out to me.

This came from a presentation by Gema Zamarro, a professor at the University of Arkansas’s Department of Education Reform, who discussed the potential use of metadata from student assessments, or data about the data (to borrow her explanation), as proxies for soft skills. Zamarro focused on two assessment behaviors: item non-response and careless answering. Item non-response refers to a phenomenon whereby students skip questions even if they have the knowledge to answer the questions (Hitt, et al., 2016). Careless answering refers to students’ response patterns – these patterns may be “careless” if students repeatedly use the same response category on a Likert scale or select the same response for items that measure oppositional constructs (Zamarro, et al., 2018).

In fact, this approach has been accumulating some validity evidence. In a recent working paper, Zamarro and her colleagues present evidence from several studies suggesting that careless answering and item non-response rates demonstrate low but significant correlations with various skills and personality factors (for example: Barry and Finney, 2016 and Zamarro, et al., 2018) as well as education outcomes (for example: Hitt, et al., 2016).

It is not difficult to spot pitfalls to this approach. I’m reminded of my high school friend Florida who, instead of answering the questions on her AP calculus exam, doodled wacky caricatures. I think that was her way of saying either “I can’t answer these questions” or “These questions are poorly worded” or some combination thereof, but she is also one of the most soft-skilled people I know. Certainly, not all tests are created equal, and neither are all test administration circumstances. Put my friend Gideon in a test room in which someone is loudly crunching their way through an apple and then clone him. Now put him in another room in which no one is eating an apple and you’ll probably see wildly different results. Zamarro and her colleagues are not shy to point out these and other potential limitations.

However, I find the use of test and survey metadata as potential proxies for soft skills to be a promising approach for a few different reasons.

  1. It is cheap. We already have the data, so why not use it?
  2. It seems pretty doable. Item non-response rates are calculated by identifying the percentage of skipped items out of the total number of items. Careless answering is calculated in a multi-step process that is a bit more complicated, but still not quite rocket science (described in Zamarro, et al., 2018).
  3. It is sneaky, but in a good way. Since respondents aren’t aware that this data is being captured (hopefully), self-report bias can be avoided.
As Zamarro and her colleagues acknowledge, we won’t fully understand the potential uses or non-uses of student assessment metadata until we collect more validity data.
Moreover, as Zamarro and her colleagues acknowledge, we won’t fully understand the potential uses or non-uses of student assessment metadata until we collect more validity data. For now, she advises thinking of metadata as “crude proxies” for soft skills rather than valid measures. And like well-meaning parents withholding the true identity of Santa Claus, we just have to make sure to keep this sneaky data collection tool a secret from our children until they’re old enough to understand. The only possible repercussion is that our theoretical children may one day come to resent us for having lied to them for years on end, but surely, it will have all been worth it.

Sharing is caring!