Judgment and Decision Making, vol. 1, no. 1, July 2006, pp. 86-90.
 
The availability heuristic in the classroom:
How soliciting more criticism can boost your course ratings 
Craig R. Fox1
  
  UCLA Anderson School and Department of Psychology  
Abstract
  This paper extends previous research showing that experienced
  difficulty of recall can influence evaluative judgments (e.g.,
  Winkielman & Schwarz, 2001) to a field study of university
  students rating a course.  Students completed a mid-course
  evaluation form in which they were asked to list either 2 ways in
  which the course could be improved (a relatively easy task) or
  10 ways in which the course could be improved (a relatively
  difficult task).  Respondents who had been asked for 10
  critical comments subsequently rated the course more favorably
  than respondents who had been asked for 2 critical comments.
  An internal analysis suggests that the number of critiques
  solicited provides a frame against which accessibility of
  instances is evaluated.  The paper concludes with a discussion
  of implications of the present results and possible directions
  for future research.
Keywords: availability, course evaluations, accessibility, easy
of retrieval
 
1  Introduction
According to Tversky and Kahneman's (1973) availability heuristic,
people sometimes judge the frequency of events in the world by the ease
with which examples come to mind.  This process has generally been
demonstrated by asking participants to assess the relative likelihood
of two categories in which instances of the first category are more
difficult to recall than instances of the second category, despite the
fact that instances of the first category are more common in the world.
 For instance, Kahneman and Tversky (1973) found that most people think
the letter R more often appears in English words as the
first letter than the third letter, presumably
because the first letter provides a better cue for recalling instances
of words than does the third letter.  In fact, it turns out that
R appears more often as the third than first letter in English
words.
Schwarz et al. (1991) observed that the classic studies
demonstrating the availability heuristic failed to distinguish an
interpretation based on ease of retrieval from an alternative
interpretation based on content of retrieval in which an event
is judged more common when a larger number of examples come to mind. 
To tease apart these accounts, Schwarz et al. (1991) asked
participants in one
study to list either 6 or 12 examples of assertive
or unassertive behavior that they have exhibited and then rate
themselves on their overall degree of assertiveness.  Participants
rated themselves as more assertive after they had listed 6 examples of
assertive behavior (a relatively easy task) rather than 12 examples (a
relatively difficult task); similarly, they rated themselves as less
assertive (i.e., more unassertive) after they had listed 6 rather than
12 examples of unassertive behavior.  Similar patterns of
results have been observed in many other studies of frequency-related
judgments, including the rate at which a particular letter occurs in
various positions of words (Wänke, et al., 1995), the quality of
one's own memory (Winkielman, et al., 1998), the frequency of one's own
past behaviors (Aarts & Dijksterhuis, 1999), one's susceptibility to
heart disease (Rothman & Schwarz, 1998) and one's susceptibility to
sexual assault (Grayson & Schwarz, 1999).  For a review of this
literature see Schwarz (1998; 2004).  Thus, an abundance of data
supports the original interpretation of the availability heuristic:
categories are judged to be more common when instances more
easily come to mind, even when a smaller absolute
number of instances are generated.
This program has been extended from frequency-based judgments to 
evaluative judgments of such targets as public transportation (Wänke,
et al., 1996), luxury automobiles (Wänke, et al., 1997), and one's
own childhood (Winkielman & Schwarz, 2001).  For instance, Winkielman
and Schwarz (2001) asked participants to recall either 4 childhood
events (an easy task) or 12 childhood events (a difficult task).  Some
participants were then led to believe that memories from pleasant
periods tend to fade, while others were led to believe that memories
from unpleasant periods tend to fade.  When later asked to
evaluate their childhood, participants believed that pleasant
memories fade rated their childhood more favorably when they completed
the difficult task (12 events) than the easy task (4 events);
participants who believed that unpleasant memories fade rated their
childhood more favorably when they completed the easy rather than
difficult task.
Previous studies of the availability heuristic using the paradigm of
Schwarz et al. (1991) have turned up impressive and robust
results.  However, these demonstrations have been restricted primarily
to laboratory surveys in which task of recalling examples then making
an overall assessment may seem somewhat artificial to participants and
the responses of little consequence.  More important, most participants
in previous studies presumably had little prior experience with the
particular Likert scale that served as the dependent measure (e.g.,
most had never before rated their childhood or public transportation on
a 7-point scale).  Hence, ratings of respondents may be especially
susceptible to superficial cues-such as the accessibility of
instances-when mapping their beliefs and attitudes onto an unfamiliar
response scale.  
The present investigation overcomes these limitations through a "field
study" of students evaluating a course.  First, evaluations are a
normal facet of most university courses in which students are commonly
asked to list specific suggestions and also provide a global
assessment.  Moreover, course evaluations are consequential, as they
can influence future course offerings and course staffing, promotion
and tenure decisions, and provide information to future prospective
students of the target course.  Second, students at universities
quickly become familiar with standard course evaluation scales and how
ratings are distributed across classes, often relying on these scores
in choosing among elective courses.  
The study of course evaluations is also interesting in its own right.  A
number of recent papers have questioned the validity of these ratings,
and a lively debate appeared some years ago in the American
Psychologist (1997; pp.1182-1225; 1998, pp.1223-1231).  Thus far,
questions of discriminant validity have mainly focused on the
correlation between teaching ratings and apparently irrelevant factors
such as the students' expected grades or the course workload.  To date
there have been few published investigations of the relationship
between the design of course feedback forms and summary course
evaluations.  The present study attempts to answer the following
provocative question: can one paradoxically obtain higher
course ratings by soliciting a greater number of critical
comments from students?
 
2  Method
Participants were 64 business students enrolled in two sections
of a course on negotiation at Duke University.  Three weeks into
a six-week term, students were asked to complete a one-page
mid-course evaluation form, as they do for most classes in the
business school at Duke.  Students in both class sections were
randomly assigned to one of two course evaluation forms that
differed by a single item.  The first ten items on both forms
were neutrally valenced short-answer and multiple choice
questions (e.g., "How do you view the pacing of the class,"
with response options ranging from "much too slow" to "much
too fast") and neutrally valenced open-ended questions (e.g.,
"What do you think of the lectures and class discussion so
far?").  For item #11, half the students (n = 32) were
asked to "List 2 ways in which you feel the course could be
improved" whereas the remaining students (n = 32) were
asked to list 10 ways in which they felt the course could be
improved.  Directly below this question, the relevant quantity of
numbered spaces (2 or 10) were provided.  For item #12, all
participants were asked to list their 2 favorite aspects of the
course.  Finally, all participants were asked, "Overall, how
would you rate the course so far on a 1-7 scale."2 This scale
is used for all final course evaluations, with 1 denoting the
lowest possible score and 7 the highest possible score.  The
large majority of students enrolled in the class (84%) were in
the midst of their second year of the MBA program (their seventh
six-week term), and therefore had an abundance of prior
experience using this scale to rate classes.
 Table 1: Results of linear regressions predicting course evaluation scores.
|  | Model Number | 
|  | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 
| Critiques | 0.0749* |  |  | 0.0855* | -0.00172 |  | -0.00558 | 
| solicited | (0.0334) |  |  | (0.0334) | (0.0446) |  | (0.0632) | 
| Critiques |  | -0.118 |  | -0.162f |  | 0.00344 | 0.012 | 
| produced |  | (0.0957) |  | (0.0928) |  | (0.0971) | (0.138) | 
| Ratio of produced |  |  | -1.070** |  | -1.081* | -1.075** | -1.12f | 
| to solicited |  |  | (0.312) |  | (0.439) | (0.343) | (0.667) | 
| Adj R2 | .066 | .009 | .159 | .099 | .144 | .144 | .128 | 
|  | 
| f .05 < p < .10, * p < .05, ** p < .005 | 
| Note: Rows correspond to independent variables.  Regression
models are numbered by column.  Cell entries list regression
coefficients, with standard deviations in parentheses.  Adjusted
R2 values for each regression are listed below the relevant
column. | 
 
3  Results
Responses did not differ significantly between the two class
sections, so the data were combined.  Results supported the major
theoretical prediction: course evaluations were higher among
students who were asked to list 10 ways in which the course could
be improved (M = 5.52, SD = 0.88) than among those were asked to
list 2 ways in which the course could be improved (M = 4.92, SD =
1.12).  Median scores were 5.5 and 5.0, respectively.  These
differences were statistically significant (t(56) =
2.24, p = .03).  Based on the distribution of final
scores for all courses offered that term in the business school
at Duke University, this difference between experimental
conditions is approximately [3/4] of a standard
deviation.3
Not surprisingly, participants produced a greater number of suggestions
in the 10-slot condition (M = 2.1; SD = 1.84) than in the 2-slot
condition (M = 1.6, SD = 0.72); however, this difference was small and
not statistically significant (t(56) = 1.52, p =
.14).  In fact, only 31% of participants in the 10-slot condition
provided more than two suggestions for ways in which the course could
be improved.  One might therefore surmise that the observed effects on
course ratings were more often driven by artificially
extending the natural retrieval process of students in the
10-slot condition (thereby improving subsequent course ratings) than by
artificially truncating the natural retrieval process of
students in the 2-slot condition (thereby depressing subsequent course
ratings).  
In previous studies using similar methods, the vast majority of
participants did manage to produce the number of examples that
were solicited by the experimenter so that the number of examples
solicited and the number provided were perfectly (or nearly
perfectly) correlated.  In the present study, the variability in
the number of critiques provided by respondents in both
conditions offers a unique opportunity to explore the independent
effects of the number provided and the number
solicited.4  When both variables are entered into an ordinary
least-square regression, the number solicited is
positively related to course ratings (b  = .09,
t(55) = 2.56, p = .01), and the number
produced is negatively related to course ratings
(b = -.16, t(55) = -1.74, p = .09;
see model 4, Table 1).
One might infer on the basis of this result that the number of
critiques solicited provides a frame against which the number recalled
is evaluated.  If so, one would expect course ratings to be
predicted quite well by the ratio of critical comment
produced to critical comments solicited.  Indeed, there is a
strong negative correlation between this ratio and course ratings
(r = -.42; t(56) = -3.43, p
 .001; see also model 3, Table 1).  Moreover, when we
enter both the number of critiques solicited and the ratio of
critiques produced to solicited into a regression (model 5, Table
1), the independent effect of the number solicited is wiped out
(b = 0.00, t(55) = 0.04, p = .97), but
the association between course ratings and the ratio of
criticisms produced to solicited remains strong and significant
(b = -1.08, t(55) = -2.47, p =
.02).5  Similarly, when we enter both the number of critical
comments produced and the ratio of critiques produced to
solicited into a regression (model 6, see Table 1), the
independent effect of the number of critiques produced is wiped
out (b = 0.00, t(55) = -0.04, p =
.97), but the independent association between course ratings the
ratio of critiques produced to solicited remains strong and
significant (b = -1.07, t(55) = -3.13,
p =.003).6 Taken together, these
results suggest that the ratio of criticisms produced to
solicited mediates the relationships between both of these
variables and course ratings (cf. Baron & Kenny, 1986).
 
4  Discussion
The present study provides strong evidence that course evaluations can
be improved, paradoxically, by soliciting a larger number of critical
comments from students.  Previous laboratory studies of the
availability heuristic have found that a target category is sometimes
judged less common after a greater number of examples are solicited. 
The present study extends this program to evaluative judgments in a
naturalistic context in which participants have ample prior experience
mapping their attitudes to the relevant response scale.
Most previous studies determined through pilot testing that it would be
difficult for participants to provide the total number of examples
solicited.  However, most studies were either designed so that
all participants would be able to eventually produce the total
number of examples solicited (e.g., Schwarz, et al., 1991) or
did not ask participants to explicitly produce any examples but merely
ponder the task of producing the number of examples solicited
(Wänke, Bohner & Jukowitsch, 1997).  The present study provides a
more direct measure of how difficult each respondent found the task:
the number of critiques that the respondent produced.  
The finding that the ratio of critiques produced to solicited
mediates the relationship between the number solicited and course
ratings suggests that the number solicited may be adopted as a norm
against which the retrieval of critiques is evaluated.  That is, the
number solicited may be accepted as a "reasonable" or "expected"
quantity of criticism, consistent with the rules of conversational
implicature (Grice, 1975).  Indeed, Winkielman et al. (1998)
observed the usual pattern of results when participants were told that
most people find the task of recalling a large number of examples
easy (which implies that the number solicited is an
appropriate norm), but the reverse pattern when participants
were told that most people find the task difficult (which
implies that the number solicited is not an appropriate norm).  Future
research might test the "norm" hypothesis more directly by examining
whether the association between frequency ratings and number of
examples solicited is moderated by the perceived normative
diagnosticity of the number of examples solicited.  For instance, if
participants learn that the number of examples solicited was determined
by the roll of dice, one might expect the effect size to diminish; if
participants are explicitly told that the task is designed so that an
average person can complete it with modest effort, one might expect the
effect size to be augmented.
The present investigation demonstrates that a minor variation in the
format of course evaluation forms-in this case, changing a single
word ("two" to "ten") and changing the number of spaces provided
for responses-can have a pronounced effect on global course
evaluations that are made on a familiar rating scale.  One might expect
that other superficial manipulations of format, such as the order in
which the global evaluation versus constituent judgments are solicited,
may also affect measured course ratings (see, e.g., Sudman, et al.,
1992).  On a practical level, the present results underscore the
importance of standardization of course evaluation forms when summary
scores are compared across classes or departments.  Additionally, the
lability of course evaluations reminds us of the limitation of summary
evaluation scores as a measure of teaching performance.
References
Aarts, H. & Dijksterhuis, A. (1999).  How often did I do it?
  Experienced ease of retrieval and frequency estimates of past
  behavior.  Acta Psychologica, 103, 77-89.
Baron, R. M. & Kenny, D. A. (1986).  The Moderator-mediator variable
distinction in social psychology research: Conceptual, strategic, and
statistical considerations.  Journal of Personality and Social
Psychology, 51, 1173-1182.
Grayson, C. E., & Schwarz, N. (1999).  Beliefs influence information
processing strategies: Declarative and experiential information on risk
assessment.  Social Cognition, 17, 1-18.
Grice, H. P. (1975).  Logic and Conversation. In P. Cole, &J. L. Morgan,
(Eds.), Speech Acts, pp. 41-58. London: Academic Press.
Kahneman, D., & Tversky, A. (1974).  Judgment under uncertainty:
Heuristics and biases.  Science, 185, 1124-1131.
Menon, G., Raghubir, P., & Schwarz, N. (1995).  Behavioral frequency
judgments: An accessibility-diagnosticity framework.  Journal
of Consumer Research, 22, 212-228.
Rothman, A. J., & Schwarz, N. (1998).  Constructing perceptions of
vulnerability: Personal relevance and the use of experiential
information in health judgments. Personality and Social
Psychology Bulletin, 24, 1053-1064.
Schwarz, N. (1998).  Accessible content and accessibility experiences:
The interplay of declarative and experiential information in judgment. 
Personality and Social Psychology Review, 2, 87-99.
Schwarz, N. (2004).  Metacognitive experiences in consumer judgment and
decision making.  Journal of Consumer Psychology, 14, 332-348.
Schwarz, N., Bless, H., Strack, F., Klumpp, G., Rittenauer-Schatka, H.,
& Simmons, A. (1991).  Ease of retrieval as information: Another look
at the availability heuristic.  Journal of Personality and
Social Psychology, 61, 195-202.
Sudman, S., Bradburn, N. M. & Schwarz, N. (1996).  Thinking
About Answers: The application of Cognitive Processes to Survey
Methodology.  San Francisco: Jossey-Bass.
Tversky, A., & Kahneman, D.  (1973).  Availability: A heuristic for
judging frequency and probability.  Cognitive Psychology, 5,
207-232.
Wänke, M., Bless, H. & Biller, B. (1996).  Subjective experience
versus content of information in the construction of attitude
judgments.  Personality and Social Psychology Bulletin, 22,
1105-1113.
Wänke, M., Bohner, G, & Jukowitsch, A. (1997).  There are many
reasons to drive a BMW: Does imagined ease of argument generation
influence attitudes?  Journal of Consumer Research, 24, 170-177.
Wänke, M., Schwarz, N. & Bless, H. (1995).  The availability
heuristic revisited: Experienced ease of retrieval in mundane frequency
estimates.  Acta Psychologica, 89, 83-90.
Winkielman, P., & Schwarz, N., & Belli, R. F. (1998).  The Role of ease
of retrieval and attribution in memory judgments: Judging your memory
as worse despite recalling more events.  Psychological Science, 9,
124-126.
Winkielman, P., & Schwarz, N. (2001).  How pleasant was your childhood?
 Beliefs about memory shape inferences from experienced difficulty of
recall.  Psychological Science, 12,176-179.
Footnotes:
1I thank Jim Bettman, Rick Larrick,
    Patty Linville and Yael Zemack for helpful comments and
    suggestions on an earlier draft of this paper.  Address
    correspondence to: Craig R. Fox, UCLA Anderson School, 110
    Westwood Plaza #D511, Los Angeles, CA 90095-1481,
    craig.fox@anderson.ucla.edu
  
2Six
  respondents provided course critiques but no summary course
  rating; hence their responses were dropped from the sample for
  analyses in which the latter variable entered.
3The average final course rating among all
  classes taught in the business school that term was 5.34, SD =
  0.80.  The course reported in this study received a final
  rating of 5.82.
4Participants in the studies of Grayson and
  Schwarz (1999) also failed to report all the examples
  solicited, however these investigators report no such internal
  analysis.
5Multicollinearity does not seem to be a problem.
  The correlation between number of critiques produced and the
  ratio of critiques produced to solicited is 0.38, and the
  Variance Inflation Factor (VIF) in the multiple regression is
  1.2.
6Again, multicollinearity does not
  seem to be a problem.  The correlation between number of
  critiques solicited and the ratio of critiques produced to
  solicited is -0.72, and the VIF is 1.9.
File translated from
TEX
by 
TTH,
version 3.74.
On 17 Jul 2006, 11:18.