Questioning student evaluations - World leading higher education information and services

Most colleges and universities are nearing the end of their fall terms. Hence it’s time not only for final examinations but also for collecting student evaluations of teaching (SET). Student ratings can be high-stakes. They’re used to evaluate teaching when faculty members are being considered for tenure or promotions. They’re also employed in decisions as to whether to reappoint faculty on contingent contracts. In fact, they’re often the only method a university uses to monitor the quality of teaching and their abuse can lead to disastrous violations of academic freedom and faculty rights.

Recently, a number of scholars have published research demonstrating that student evaluations may not be the best way to measure either student learning or instructor effectiveness. Two studies in particular are especially revealing.

Philip Stark, chair of the statistics department at the University of California, Berkeley, has been teaching at Berkeley since 1988, and says “the reliance on teaching evaluations has always bothered me.” He is the co-author with Richard Freishat of Berkeley’s Center for Teaching and Learning of “An Evaluation of Course Evaluations.” Here are some choice excerpts from their study:

First, on response rates:

Some students do not fill out SET surveys. The response rate will be less than 100%. The lower the response rate, the less representative the responses might be: there’s no reason nonresponders should be like responders–and good reasons they might not be. For instance, anger motivates people to action more than satisfaction does. Have you ever seen a public demonstration where people screamed “we’re content!”? Nonresponse produces uncertainty: Suppose half the class responds, and that they rate the instructor’s handwriting legibility as 2. The average for the entire class might be as low as 1.5, if all the “nonresponders” would also have rated it 1. Or it might be as high as 4.5, if the nonresponders would have rated it 7. . . .

The point: Response rates themselves say little about teaching effectiveness. In reality, if the response rate is low, the data should not be considered representative of the class as a whole.

What about averages?

Personnel reviews routinely compare instructors’ average scores to departmental averages. Such comparisons make no sense, as a matter of Statistics. They presume that the difference between 3 and 4 means the same thing as the difference between 6 and 7. They presume that the difference between 3 and 4 means the same thing to different students. They presume that 5 means the same thing to different students and to students in different course. They presume that a 3 “balances” a 7 to make two 5s. For teaching evaluations, there’s no reason any of those things should be true. . . .

Comparing an individual instructor’s average with the average for a course or a department is meaningless: Suppose that the departmental average for a particular course is 4.5, and the average for a particular instructor in a particular semester is 4.2. The instructor’s rating is below average. How bad is that? If other instructors get an average of exactly 4.5 when they teach the course, 4.2 might be atypically low. On the other hand, if other instructors get 6s half the time and 3s half the time, 4.2 is well within the spread of scores.

These points relate to quantitative evaluations, but what of student comments? Stark and Freishat write:

Students are ideally situated to comment about their experience [italics in original] of the course, including factors that influence teaching effectiveness, such as the instructor’s audibility, legibility, and perhaps the instructor’s availability outside class. They can comment on whether they feel more excited about the subject after taking the class, and—for electives—whether the course inspired them to take a follow-up course. They might be able to judge clarity, but clarity may be confounded with the difficulty of the material. While some student comments are informative, one must be quite careful interpreting the comments: faculty and students use the same vocabulary quite differently, ascribing quite different meanings to words such as “fair,” “professional,” “organized,” “challenging,” and “respectful.”

So what do student evaluations actually measure?

This is what we do with SET. We don’t measure teaching effectiveness. We measure what students say, and pretend it’s the same thing. We calculate statistics, report numbers, and call it a day. . . . SET may be reliable, in the sense that students often agree. But that’s an odd focus. We don’t expect instructors to be equally effective with students with different background, preparation, skill, disposition, maturity, and “learning style.” Hence, if ratings are extremely consistent, they probably don’t measure teaching effectiveness: If a laboratory instrument always gives the same reading when its inputs vary substantially, it’s probably broken. . . .

Students are in a good position to observe some aspects of teaching, such as clarity, pace, legibility, audibility, and their own excitement (or boredom). SET can measure these things . . . . But students cannot rate effectiveness–regardless of their intentions. Calling SET a measure of effectiveness does not make it one, any more than you can make a bathroom scale measure height by relabeling its dial “height.” Averaging “height” measurements made with 100 different scales would not help.

Another study conducted by Michele Pellizzari, an economics professor at the University of Geneva in Switzerland, makes a more troublng claim: that course evaluations may in fact measure, and thus motivate, the opposite of good teaching. His experiment took place with students at the Bocconi University Department of Economics in Milan, Italy. The paper compared the student evaluations of a particular professor to another measure of teacher quality: how those students performed in a subsequent course. The paper is highly technical statistically (at least for me), but the basic conclusion is clear: The better the professors were, as measured by their students’ grades in later classes, the lower their ratings from students. “If you make your students do well in their academic career, you get worse evaluations from your students,” Pellizzari concluded,

teachers who are more effective in promoting future performance receive worse evaluations from their students. This relationship is statistically significant for all items (but logistics), and is of sizable magnitude. . . .

These results clearly challenge the validity of students’ evaluations of professors as a measure of teaching quality. Even abstracting from the possibility that professors strategically adjust their grades to please the students (a practice that is made difficult by the timing of the evaluations, that are always collected before the exam takes place), it might still be possible that professors who make the classroom experience more enjoyable do that at the expense of true learning or fail to encourage students to exert effort. Alternatively, students might reward teachers who prepare them for the exam, that is teachers who teach to the test, even if this is done at the expenses of true learning. This interpretation is consistent with the results in Weinberg et al. (2009), who provide evidence that students are generally unaware of the value of the material they have learned in a course. Of course, one may also argue that students’ satisfaction is important per se and, even, that universities should aim at maximizing satisfaction rather than learning, especially private institutions like Bocconi. We doubt that this is the most common understanding of higher education policy. . . .

The interpretation of the students’ evaluations as measures of the quality of teaching rests on the – explicit or implicit – view that the students observe the quality of teaching in the classroom and, when asked to report it in the questionnaire, they do so truthfully. Our results, however, contradict this view and seem more consistent with the idea that students evaluate teachers on the basis of their enjoyment of the course or, in the words of economists, on the basis of their realized utility. Good teachers – those who provide their students with knowledge that is useful in future learning – presumably require their students to exert effort by paying attention and being concentrated in class and by doing demanding homework. As it is commonly assumed in economic models, agents dislike exerting effort and, if the students’ questionnaires reflect utility, it is very possible that good teachers are badly evaluated by their students.

So what do we do about all this? Pellizzari writes:

Overall, our results cast serious doubts on the validity of students’ evaluations of professors as measures of teaching quality or effort. At the same time, the strong effects of teaching quality on students’ outcomes suggest that improving the quantity or the quality of professors’ inputs in the education production function can lead to large gains.

Stark and Freishat are more specific:

If we want to assess and improve teaching, we have to pay attention to the teaching, not the average of a list of student-reported numbers with a troubled and tenuous relationship to teaching. Instead, we can watch each other teach and talk to each other about teaching. We can look at student comments. We can look at materials created to design, redesign, and teach courses, such as syllabi, lecture notes, websites, textbooks, software, videos, assignments, and exams. We can look at faculty teaching statements. We can look at samples of student work. We can survey former students, advisees, and graduate instructors. We can look at the job placement success of former graduate students.

We can ask: Is the teacher putting in appropriate effort? Is she following practices found to work in the discipline? Is she available to students? Is she creating new materials, new courses, or new pedagogical approaches? Is she revising, refreshing, and reworking existing courses? Is she helping keep the curriculum in the department up to date? Is she trying to improve? Is she supervising undergraduates for research, internships, and honors theses? Is she advising graduate students? Is she serving on qualifying exams and thesis committees? Do her students do well when they graduate?

Stark and Freishat go on to recommend a model of instructional assessment based on teaching portfolios and peer evaluation and they hold up Berkeley’s Statistics Department as a possible model. But that kind of effort often takes a good deal of time and energy and usually doesn’t provide the quick and dirty “measurable outcomes” that have become so fashionable in too many institutions.

The conclusions that I’ve reported here are, of course, already well-known to most of us who have actually taught, even if we aren’t always cognizant of the kind of rigorous and scientific investigation found in these two studies (and in many more as well). Unfortunately, however, I somehow doubt that many administrators or those in the burgeoning assessment racket will pay much attention to these studies. For too many of them, our colleges and universities are just businesses and the students are our “customers.” And, as we all know, in business the customer is always right.