The Reification of the Student Evaluation Score

Something I’ve been thinking a fair amount about recently is the use of student evaluation scores as a tool in faculty evaluation. There are a variety of reasons for this, which I won’t go into at this point, but one thing I’ve come to find particularly interesting, and important, is the reification of them. I think it’s worth exploring that process in order to better understand the use of these tools and what they don’t actually tell us.

I started re-visiting the concept of “reification” when I read Adorno’s “Free Time” for a guest lecture I gave in a colleague’s class earlier this semester, and it re-surfaced, as I said, in thinking about student evaluation scores. While it has a deep history in the critical theoretical tradition represented by Adorno, it’s not a term folks are likely to hear much in everyday life, so it’s worthwhile to provide an idea of what I’m talking about before moving forward. I like to refer to “reification” as “thingification.” I suppose I get this from Adorno himself:

“For all reification is a forgetting: objects become purely thing-like the moment they are retained for us without the continued presence of their other aspects: when something of them has been forgotten.”

In this piece, I’m going to discuss the process of forgetting that turns student evaluation scores into “things” that have “objective meaning.” There are two specific issues I will focus on here. The first is the issue of innumeracy, or mathematical illiteracy. There is a certain irony that the fetishization of numerical scores would involve the inappropriate an illegitimate use of them, but the second issue, managerial expedience, goes some way toward explaining it.

In most of the schools I have worked at, student evaluations of teaching have a similar format. There are a series of Likert-scale questions asking about various aspects of the course, where students are given a statement and asked to respond with “Strongly Agree, Agree, No Opinion, Disagree, Strongly Disagree.” Each of those options has a corresponding numerical value. Scores for each question are obtained by taking the “mean” score for each question, and then by averaging all of these mean scores to arrive at a final score, which is supposedly a measure of teaching effectiveness. Generally, these evaluations also contain a section where students may write out longer comments, but it is the numerical scores generated from the Likert-scale questions that are given the most weight institutionally.

The production of these scores brings us to the issue of innumeracy, and to the reifying activity of forgetting. It is in this process that the technical limitations of Likert-scale questions are erased in order to produce “objective” measures of teaching effectiveness. Only forgetting allows us to do so, though.

To explain those technical limitations, I’ll turn it over to a Professor of Statistics to describe the issues involved with Likert-Scales and “levels of measurement.”

Effectiveness ratings are what statisticians call an “ordinal categorical” variable: The ratings fall in categories with a natural order (7 is better than 6 is better than … is better than 1), but the numbers 1, 2, …, 7 are really labels of categories, not quantities of anything. We could replace the numbers with descriptive words and no information would be lost: The ratings might as well be “not at all effective”, “slightly effective,” “somewhat effective,” “moderately effective,” “rather effective,” “very effective,” and “extremely effective.”

Does it make sense to take the average of “slightly effective” and “very effective” ratings given by two students? If so, is the result the same as two “moderately effective” scores? Relying on average evaluation scores does just that: It equates the effectiveness of an instructor who receives two ratings of 4 and the effectiveness of an instructor who receives a 2 and a 6, since both instructors have an average rating of 4. Are they really equivalent?

They are not, as this joke shows: Three statisticians go hunting. They spot a deer. The first statistician shoots; the shot passes a yard to the left of the deer. The second shoots; the shot passes a yard to the right of the deer. The third one yells, “we got it!”

Even though the average location of the two misses is a hit, the deer is quite unscathed: Two things can be equal on average, yet otherwise utterly dissimilar. Averages alone are not adequate summaries of evaluation scores.

I like to think of it this way, if we are using ordinal measures like this to measure the temperature, our scores would be, Hot, Warm, Moderate, Cool, and Cold. What is the “average” score of Hot (5), Warm (4), Warm (4), Moderate (3), and Cold (1)? Using the numbers usually assigned, we would arrive at a score of 3.4. However, the numbers we are using have no numerical value. Only by erasing this fact can we even try to create an average of them, and then treat that average as though it is numerically meaningful. This problem is compounded when we average several of these scores to create an overall effectiveness score, and then combine those scores to compare individuals to group averages. These scores are the product of multiplicative meaninglessness.

Despite their meaninglessness, we can see that these scores have been imbued with meaning. Perhaps that shouldn’t be surprising. After all, one of the basic assumptions of symbolic interactionist sociology is that, “people act toward objects based on the meaning they have for them.” However, in this case, that meaning comes from the process of erasing the history and technical limitations of ordinal-level data, a process aided by widespread, even insitutionalized, innumeracy.

In Paulos’s classic formulation, innumeracy is a form of mathematical illiteracy, and that is what’s working here. There is a basic technical incompetence involved in transforming these responses into scores by simply taking an average, and then averaging the averages. But no one seems to notice. They have either forgotten or never understood the limitations of these data and what these data are actually capable and incapable of measuring. Procedures and policies are established to produce and utilize these scores. Technical incompetence becomes institutional practice as we collectively agree to pretend these scores are valid.

And that’s where the issue of managerial expedience comes into play. The scores themselves become tools that managers can use in the performance of their duties. It is far easier to use simple numerical scores than to engage in an in-depth review of the various forms of assessment necessary to adequately evaluate faculty performance. Not only is it easier, it is far less time-consuming, which becomes important as the number of faculty increase. Note, I didn’t say the number of full-time faculty. The adjunctification of the academic labor force has resulted in a greater number of instructional faculty to be evaluated.

Even aside from evaluating individual faculty, these scores are used by administrators to study groups of faculty, to compare individuals to their departmental or college colleagues, to measure year-to-year “improvements,” and the like. These scores become objective things, independent of the history of their creation and its limitations. They take on a life of their own as tools used in the managerial enterprise. Managerial expedience leads to the institutionalization of innumeracy, but that is only possible due to the reification of these scores, to their thingification through the erasure of what they actually are.

This raises a more significant question: In the era of the managerial-corporate academy, how do we interrupt and resist these processes? Remembering is a necessary feature of de-thingifying these scores, and de-legitimizing their illegitimate use, but I’m not completely sure the management class cares about its own incompetence in such matters. That’s a problem for another, and every, day.

This entry was posted in Uncategorized. Bookmark the permalink.