This is a reflection on a specific MICER17 conference session; for an overview of the conference, start reading here.
Dr Stewart Kirton continued the theme of proper statistical handling of data, set by Fraser Scott last year. We were specifically looking at Likert scales, and ways of developing and handling them. The importance of piloting your study was raised for the first (but not the last) time, both on fellow education researchers and students, to find out if the question is as understandable as you think. I do not do enough of this!
We then discussed several implementation dos and don’ts around Likert scales: I was wary of mixing questions with different types responses, but this is fine – as is using questions with even numbers of responses. Although it wasn’t explicitly mentioned, I wonder if people prefer to give Likert scales an odd number of values in order to provide a neutral option. The neutral option itself may be desirable precisely because Likert scales usually do not provide a “do not know” option – something Stewart encouraged us to do if appropriate!
Two caveats do apply around responses, however, which I was ignorant of and have broken the rules around frequently:
- Don’t mix positive and negative wording in questions
- Possible responses shouldn’t be clustered at the extreme ends of the scale (Endorsability).
Negatively worded questions can potentially influence participants whereas positive questions don’t, but the far bigger source of bias comes from mixing the two wordings together – this should be avoided even at the cost of a universally-negative questionnaire! For the questionnaire itself, Stewart advocated writing somewhere around 10-12 questions, and then keeping the best 6-8 of these. I imagine that if you needed more than this, then your research question may be inappropriately broad (with reference to Suzanne’s session).
The meat of this session, which spilled over into post-conference discussion, was around the issue of averaging Likert scale data. In brief, the numbers associated with Likert scale data are more or less arbitrary (ordinal data) but we frequently treat them mathematically (interval data). The gap between 1 and 2 may not be the same as the gap between 2 and 3 – so applying mountains of statistics is often inappropriate and time-consuming (and a bane of certain peer reviewer’s lives!). Rather, Stewart suggests strategies around data binning into binary choices – “Very NSS”, as Simon Lancaster put it on the day.
We frequently want to compare pre- and post-intervention data, so another strategy might be to look at percentage shifts in each response, or to look at how individual student responses changed over time. I’ve committed the sin of Likert averages more than once, and had previously wrestled with standard deviations as a method of conveying answer distributions, without understanding why it felt “wrong”. Now I do!
I had multiple takeaways from this session, but my favourite is probably endorsability as a way to answer the same sort of inherent question that leads people to average likert scale data. For example, testing the effectiveness of an intervention in the past I may have looked for a numerical change on the average response to a single question, whereas the principle of endorsability would have me providing several questions, assessing student comfort in low-, medium- and high-stakes situations.
For a far more elegant summary of the talk, Dr Kristy Turner was also at the conference and sketched several of the talks; her tweet is embedded below with permission, gratefully received!