University Student Evaluations of Teachers (SETs) Are Probably Bad, But Not For the Reason Many Think.
Are SETs biased against women and ethnic minorities? Let's have a look.
The end of each semester brings a familiar ritual. Faculty hand out student evaluations of teaching (SETs) to their classes while simultaneously grousing that they are useless in evaluating teaching effectiveness. Although it feels as if some momentum is growing to reconsider their use, there is little indication they’ll vanish in the near future. And controversies about them remain: how well do they reflect teaching effectiveness; do they tell us more about professors or their students; are they biased against women and ethnic minorities; and are there significant downsides to their use?
A new review of empirical evidence regarding the effectiveness of SETs by psychology professor Bob Uttl at Mount Royal University is damning, albeit not always in the way people may expect. Put simply, there are good reasons to view SETs as more pernicious than helpful when it comes to university teaching. However, some of the reasons many people believe SETs are bad aren’t empirically supported. Thus, arguments against SETs must remain cautiously grounded in empirical evidence, lest they run afoul of the data.
A Brief History of SETs
Most histories of SETs begin approximately a century ago in the 1920s. Then, in the US, there was widespread dissatisfaction with the quality of university instruction. Ironically, this appears to echo current significant declines in the public’s perception of universities and their value to society, so SETs apparently don’t shield universities from public approbation. There was little by the way of systematic evaluation of teaching effectiveness, with rumor and gossip among professors playing a major role.
In the early 20th century psychologists Herman Remmers and Edwin Guthrie developed some of the first systematic SETs. These initial SETs were designed to be formative; that is, helping professors to get a sense of student perceptions and improve their own teaching. They were not intended for evaluative use by university administrators.
Until the 1970s, they largely remained formative. In 1973, mandatory use of SETs was practiced by only 29% of universities. However, their use soon skyrocketed, with 68% of universities using them by 1983 and further increasing to 86% by 1993. That this coincided with a larger bureaucratization of higher ed and the introduction of buzzwords ranging from learning outcomes to diversity was probably no coincidence. Most of us have grown up in an era in which the use of SETs is largely ubiquitous.
However, their increasing use in evaluative situations, namely merit raises and promotion and tenure, raised serious questions about whether SETs were warranted for use in such decisions. Did they actually measure teaching effectiveness? Further, as diversity issues occupied the center of academia, both in good and, arguably, bad faith ways, questions were raised whether SETs were fair. Is it possible that students’ racial and gender biases might influence SET scores?
Are SETs a Valid Measure of Teaching Effectiveness?
To use SETs effectively for evaluation, we must have some understanding of what we want SETs to measure, and evidence they actually measure that thing. Presumably, SETs are meant to provide some estimate of how much students actually learn in a course. SETs are often the only quantitative measure professors can rely on when addressing teaching effectiveness. Being quantitative, they may be more persuasive than the qualitative data from peer evaluations. In fact, a strong argument can be made that good quantitative data should take precedence, but do SETs in fact, provide reliable and valid evidence for teaching effectiveness?
Uttl’s review of decades of empirical evidence reaches a clear conclusion: no. The primary means of examining the usefulness of SETs are what are called multisection studies. That is: multiple sections of the same course, using the same syllabus, tests, reading materials, etc., with only the instructor differing between sections. In the best such studies, and controlling for prior student learning, Uttl found that SETs predicted around 1% of the variance in student learning and concluded “the estimated SET/learning correlations were not significantly different from zero.”
Part of the problem is that SETs tap into multiple other factors, such as student’s actual intelligence as well as how intelligent they think they are (a tendency to overestimate one’s own expertise or knowledge, something called the Dunning-Kruger Effect). In other words, if students think they deserve a higher grade than they are getting at the time of evaluation, even if they do not, they evaluate professors more poorly. Courses in the hard sciences tend to rate more poorly than other subjects, and physically attractive professors are rated more highly (good news for me!...or maybe not).
Surprising probably no one, more difficult courses are rated more poorly. And students apparently can be bribed…offering them chocolates or cookies on the day of in-person evaluations raises SETs.
To be fair, not all the evidence is the highest quality (some studies use Rate My Professor rather than actual SETs although Uttl argues the two sets of data tend to correlate highly.) Nonetheless, even if we consider only those studies that look at SETs directly, the evidence they predict teaching effectiveness is poor, whereas the evidence they are susceptible to other influences is high.
Do SETs Tell Us More About Professors or Students?
Since SETs became widely popular as administrative evaluation tools in the 1970s and 80s, the student body has changed significantly. A higher proportion of high school students attend college of some sort. Yet, in more recent years, the population of graduating high school seniors has stabilized and possibly begun to shrink (due to declining births and reduced number of males entering college), leading universities to dig deeper into applicant pools (read: less qualified applicants) to fill incoming classes. Several straightforward changes have occurred in recent cohorts of college students.
As many instructors have likely witnessed, incoming students have far higher mental health problems than previous cohorts. Contrary to popular thinking, this isn’t a problem unique to teens and young adults (suicides, for instance, are actually much higher among middle aged adults). However, this does mean that recent cohorts of students are generally more stressed, unhappy and, indeed, neurotic than prior generations.
The intellectual readiness of college students has also declined in recent cohorts. This is likely due to at least two issues. First, as the number of graduating high school seniors moving on the university has declined, colleges need to admit less prepared students. Second, impacts of the Covid19 pandemic are still being felt among students who experienced shutdowns. Thus, we’ve seen a decline in standardized testing scores among recent cohorts of students. Data on intelligence also suggests a “reverse Flynn effect”, of decreasing average intelligence scores, including among college students (but not college graduates, presumably due to dropouts). Intelligence does predict likelihood of graduating college.
Taken together, just these two factors suggest incoming pools of students who are less well prepared for the rigors of university, both emotionally and intellectually. That the unhappiness of these students may reflect in SETs should be rather straightforward.
Are SETs Biased Against Women or Ethnic Minorities?
Any debate on the use of SETs inevitably will lead to claims that they are systemically biased against women or ethnic minorities. This is a difficult subject given the emotional valence of race and sex/gender issues, as well as the taboo in progressive academic circles against challenging such beliefs. Thus, these claims have an “It is known…” vibe. Yet, surprisingly, this is one of the weakest challenges to SETs.
As Uttl notes in his evaluation of studies regarding sex/gender and SETs, the evidence for biasing effects is weak. Mean differences between women and men are actually very tiny and inconsistent across studies. In a review of this literature I conducted several years back, I came the same conclusion.
To be fair, a recent review by other scholars, suggested there may be small sex effects favoring men. This created something of a stir, likely as it supported the preexisting narrative, but I observe that the evidence provided is underwhelming. For instance, as they note from one study they review, the difference between men and women was “by 0.046 on a scale from 1 to 5.” Though “statistically significant”, such effects are clearly trivial, and often driven by methodological noise or even the scholars’ preexisting beliefs. We would never accept such a low standard of evidence for correlations between SETs and teaching effectiveness, yet far too many people relax the standards of evidence when they support academics’ preexisting beliefs.
The evidence base regarding race/ethnicity bias is smaller. Though some scholars have passionate beliefs in these effects, effect sizes appear to be weak and inconsistent (in some cases black professors rated higher than white, etc.). One study of 224 business professors demonstrates the complexities. Asian professors were rated highest, with white and Latino professors about equal. Black professors were rated lowest (albeit sample sizes were tiny an obviously non-representative). There were no sex/gender differences in ratings. When the authors combined all non-white groups together, white professors scored higher than non-white, but this masked significant outcome discrepancies between subgroups of non-white professors. Any group differences were specific to black professors, not other non-whites.
Yet, even if we were to conclude that demographic group differences were clear and robust, despite the murky state of the actual evidence, that is not enough to indicate bias. Taboo though it is to say, it is possible that any differences indicate real group differences. It may indeed be the case that white male professors are (slightly) better teachers than are other groups. By contrast, white males may be pussycats who inflate grades and reduce rigor to coddle students. Either way, we simply can’t conclude that because group differences exist, even if they do, that this indicates bias. It would be necessary to demonstrate that SETs predict teaching effectiveness differently for different groups, and that evidence simply is lacking.
As such, this argument is probably a problematic one. First, it’s moralistic and identitarian nature may mainly serve to create division between different groups rather than foster unity in opposition to SETs (after all, if SETs favor one groups…and we observe humans are inherently selfish…that group may be convinced to support, not oppose them). Second, it’s an easily rebuttable argument. Thus, it should be dropped in favor of better supported arguments, namely that SETs are ineffective for all groups of professors, regardless of sex or race.
Are There Significant Downsides to the Use of SETs?
SETs can be useful when used formatively. That is, seen only by instructors themselves to get helpful suggestions from students. Indeed, I’ve sometimes changed my courses after receiving constructive suggestions from students. However, used evaluatively, they create a perverse incentive to do whatever it takes to keep students happy, even if those decisions aren’t the best for student learning.
Much has been lamented about the customer mentality that has evolved in higher education over the past few decades. Undoubtedly, SETs contribute to this to the degree that they reinforce the mentality that students must be kept happy as a primary goal, whether or not that is associated with their learning.
The result, as noted by Dr. Uttl, has been considerable grade inflation over the past few decades. For instance, at Harvard, arguably the US’ leading educational institution, the median grade in all classes has been an A- since at least 2013. This is associated with reductions in expectations, workload, and study time among students.
Coupled with a focus on retention at all costs, SETs have contributed to a reduction in academic standards and an inflation in grades. Given that they are also a poor measure of teaching effectiveness, they may also misidentify those teachers which are effective and ineffective.
Concluding Thoughts
There are very good reasons to consider ending the use of SETs. However, it is important that efforts to end the use of SETs in evaluation remain focused on good data and not become distracted by moralistic and divisive arguments where the evidence is weaker. Indulging in identitarian moralization may actually undermine efforts to reduce the reliance on SETs.
There is an argument for retaining SETs purely as a formative measure, seen only by instructors themselves. However, I suspect if they are retained there will always be a creeping urge by administrators to convert them back into evaluative measures, however ineffective they may be.
There also is a fair concern about how abusive teachers may be identified by students. However, there is little evidence SETs are particularly effective in assuring student protection from abusive professors. Other mechanisms for reporting and complaint, which already exist at most universities, may be more effective in dealing with professors who violate codes of conduct.
Ultimately, SETs probably should be discontinued in their entirety until a reliable and valid system for their use is developed.