Judgment and Decision Making, vol. 5, no. 7, December 2010, pp. 467-476

## A simple remedy for overprecision in judgment## Uriel Haran |

Overprecision is the most robust type of overconfidence. We present a new method that significantly reduces this bias and offers insight into its underlying cause. In three experiments, overprecision was significantly reduced by forcing participants to consider all possible outcomes of an event. Each participant was presented with the entire range of possible outcomes divided into intervals, and estimated each interval’s likelihood of including the true answer. The superiority of this Subjective Probability Interval Estimate (SPIES) method is robust to range widths and interval grain sizes. Its carryover effects are observed even in subsequent estimates made using the conventional, 90% confidence interval method: judges who first made SPIES judgments considered a broader range of values in subsequent conventional interval estimates as well.

Keywords: overconfidence, overprecision, subjective
probability, interval estimates, judgment and decision making.

The Federal Home Loan Mortgage Corporation, otherwise known as Freddie
Mac, provides an online calculator on its website
(`www.freddiemac.com`) to help potential clients determine
whether they should buy a home or rent one. Among the factors
included in this calculation is the estimated appreciation value of
the home in question, defined by the website as “the yearly
percentage rate that an asset increases in value”. The user has to
enter a percentage value by which, according to her best judgment, her
potential home will increase or decrease. However, when a negative
value (i.e., a forecast that the house’s value will go down) was
entered, it was followed by an error message: “Please fix the
following errors: Appreciation rate must be a number between 0.00 and
100.00.” The design of this online calculator conveyed Freddie Mac’s
belief that housing prices can change only between 0% and +100%,
with any rate outside this range being improbable. However, according
to the Federal Housing Finance Agency (2010), the average yearly
appreciation rate of houses in the United States was consistently
outside this range from the second quarter of 2007 through the first
quarter of 2010, falling as low as −12.03% (and even lower than
−28% in some states). This forecasting error, among others,
resulted in Freddie Mac’s near failure, before its take-over by the
U.S. government in 2008. (More than two years later, in December
2010, Freddie Mac finally changed its on-line calculator to account
for house value depreciation.)

The failure of Freddie Mac to anticipate a depreciation of U.S. house prices is but one of many examples of overprecision in judgment. Overprecision is a form of overconfidence, found to be both prevalent and particularly impervious to debiasing (Moore & Healy, 2008). Also referred to as overconfidence in interval estimates (e.g., Soll & Klayman, 2004), overprecision is the excessive certainty that one knows the truth. Among its documented consequences are errors in clinical diagnosis (Christensen-Szalanski & Bushyhead, 1981; Oskamp, 1965), excessive market trading (Daniel, Hirshleifer, & Subrahmanyam, 1998; Odean, 1999), and excessive conviction by individual climate scientists that they know the future trajectory of climate change (Morgan & Keith, 1995; Zickfeld, Morgan, Frame, & Keith, 2010). Overprecision is typically measured by eliciting a confidence interval — a range of values that the judge is confident, to a certain degree, will include the true value in question (Alpert & Raiffa, 1982). Research has repeatedly found that the confidence people have in their beliefs exceeds their accuracy, meaning that the confidence intervals they produce are too narrow (i.e., overly precise, see Soll & Klayman, 2004). This pattern is observed in novice as well as expert judgments (Clemen, 2001; Henrion & Fischhoff, 1986; Juslin, Winman, & Hansson, 2007; McKenzie, Liersch, & Yaniv, 2008; Morgan & Keith, 2008).

Attempts to debias overprecision have had limited success. Koriat, Lichtenstein, and Fischhoff (1980) argued that people’s excessive confidence in their beliefs is driven by the more extensive search they conduct for supporting evidence than for evidence that contradicts their beliefs. In their experiments, participants were presented with two possible answers to a question, chose the answer they thought was correct, and reported their confidence in the accuracy of their chosen answer. They were grossly overconfident when they expressed very high confidence. However, when asked to consider evidence contradicting their answers before reporting their confidence, participants reported lower confidence levels. Soll and Klayman (2004) manipulated this search for evidence by asking their participants to specify the fractile cutoffs at the top and bottom ends of the range of possible values. So instead of asking their participants to specify the ends of an 80% confidence interval, they asked their participants (1) for a number low enough that there was a 90% chance the true answer was above it; and (2) for a number so high that there was a 90% chance the true answer was below it. Using this approach, Soll and Klayman were able to modestly reduce overprecision.

Other research has tried to reduce overconfidence by focusing on the format of the question (Juslin, Wennerholm, & Olsson, 1999; Seaver, von Winterfeldt, & Edwards, 1978; Teigen & Jørgensen, 2005). This research found that interval evaluation, based on probability judgments of fixed intervals, produces less overconfidence than interval production. For example, participants asked to create 90% confidence intervals produce excessively-narrow intervals, but other participants, who subsequently estimate their confidence for the participant-created intervals, report less than 90% confidence in their accuracy. Building on these findings, Winman, Hansson, & Juslin (2004) proposed the adaptive interval assessment (ADINA) method of eliciting judgments. Using this method, a desired confidence level for an interval is determined in advance. An interval is produced around a specific value (generated either by the judge, a peer, or at random), and the judge estimates the probability that this interval contains the correct answer. If this probability is higher than the desired confidence level, a narrower interval is presented next, and, similarly, if the initial probability is lower than the desired probability, then a wider interval is presented. This procedure is repeated until the probability assigned to the interval matches the desired confidence level. The authors found that the resulting intervals from this procedure displayed less overprecision than intervals that were produced directly. Unfortunately, this reduction in overprecision appeared to be tied to the assessment format: subsequent assessments made with a different response format (e.g., confidence intervals) reverted to their old, overly precise form, suggesting that the change in methods did not affect the cognitive process by which estimates were produced. In short, no method has been found that both reduces overprecision and trains judges to consider a wider range of values when making subsequent estimates in a different format.

We propose a novel method of producing interval estimates for
quantitative values that has the potential to significantly reduce
overprecision. Our method, *Subjective Probability Interval
Estimates* (SPIES), works by forcing judges to consider the entire
range of possible answers to a question. The judge sees the full range
of possible outcomes, divided into a series of intervals. For each
interval, the judge estimates the probability that it includes the
correct answer, with the sum of these probabilities constrained to
equal 100% (e.g., Figure 1).

We expect SPIES will be superior to previously instantiated methods for two reasons. First, SPIES forces judges to consider all possible values, including extreme values, which they may otherwise fail to consider spontaneously. Overlooking these extreme values may account, at least in part, for the overprecision observed in interval estimates. By requiring the judge to consider all values and assign each of them some probability of being correct, even if this probability is zero, SPIES may significantly reduce this bias.

Second, the SPIES method includes features found to be instrumental in reducing overprecision. This method makes use of multiple judgments, which, as Soll & Klayman (2004) found, produce lower overprecision than single interval estimates. Also, building on the findings of research on format dependence (e.g., Juslin et al., 2007; Teigen & Jørgensen, 2005), SPIES is based on probability judgments, which appear to induce less overprecision than do interval estimates. This reduction may be further enhanced by constraining the summed probability assigned to outcomes to equal 100%, limiting the tendency to overstate subjective probabilities (Tversky & Koehler, 1994).

We report three experiments that tested our approach. Experiment 1 compared overprecision levels produced by SPIES to those produced by other methods of eliciting quantitative predictions. Experiment 2 tested the robustness of SPIES to different range widths and interval grain sizes. Experiment 3 tested the robustness of SPIES to ranges with defined bounds, and examined whether SPIES can increase accuracy of estimates of extreme values, as well as of values which lay closer to the middle of the range. In addition, Experiment 3 measured the carryover effects of SPIES on subsequent estimates made using a different method.

Experiment 1 tested whether SPIES can reduce overprecision, relative to
two other methods of estimating intervals — 90% confidence intervals,
the most widely used method of interval production, and
5^{th} and 95^{th} fractile estimates,
which together imply a 90% confidence interval.

103 Pittsburgh residents responded to an email solicitation, sent to past participants in studies of the Center for Behavioral Decision Research, inviting them to participate in an online study. One of the participants was randomly selected to receive a $100 prize.

Participants estimated the high temperature in Pittsburgh one month
from the day on which they completed the survey, in three different
formats. In a *90% confidence interval* format, participants
entered two values, between which they were 90% sure the actual
temperature would fall. In a *fractile *format*,
*participants specified their estimated distribution’s
5^{th} fractile (i.e., a number sufficiently low that
they were 95% sure it would be below that the actual temperature),
and the 95^{th} fractile (i.e., a number they were 95%
sure would fall above the actual temperature). In addition,
participants made* Subjective Probability Interval Estimates*
(SPIES) — they were presented with the following temperature
intervals: below 40°F, 40–49, 50–59, 60–69, 70–79,
80–89, 90–99, 100–109, and 110°F or above. They then
estimated, for each interval, the probability that it would contain
the actual temperature. The web page required the participants to adjust
the probabilities so that they summed to 100% before proceeding.
Presentation order of the three formats was randomly determined, and
not recorded.

Because the assigned confidence level for intervals produced by the
first two methods was 90%, we chose this level as our target
confidence for intervals produced by SPIES. We used an algorithm to
calculate these confidence intervals, which identifies the temperature
interval with the highest subjective probability and adds its
neighboring intervals until the sum of probabilities reaches closest
to, but not more than 90%. The algorithm then adds the proportion of
the adjacent interval with the next highest probability (or the two
intervals on both sides of the aggregated interval, when they are
assigned equal probabilities) needed to reach 90%. We refer to the
resulting confidence interval as 90% SPIES.^{1} This is
a conservative calculation of 90% SPIES, designed to produce a
confidence interval out of the fewest possible subjective probability
intervals. In cases where an extreme interval (i.e., below
40°F, 110°F or above) was included in a
participant’s 90% SPIES, we calculated that interval’s width as
10°F.

The true temperatures on the days for which participants made their
estimates were between 67°F and 73°F. A
repeated-measures ANOVA comparing the accuracy of participants’
estimates across the three methods revealed a significant difference,
*F*(2, 101) = 4.98, *p *= .009, η
^{2 }= .090. 90% confidence intervals and
intervals produced by the 5^{th} and
95^{th} fractiles did not differ in their accuracy,
both including the correct answer 73.79% of the time (*SD *=
44.19).^{2} 90% SPIES, however, included the correct answer in
88.35% of the estimates (*SD *= 32.24), a significantly higher
hit rate than both 90% confidence intervals, *t*(102) = 2.88,
*p *= .005, *d *= 0.57, and fractiles, *t*(102) =
2.69, *p* = .008, *d *= 0.53. Whereas 90% confidence
intervals and fractiles displayed significant overprecision of
16.21%, *t*s(102) = 3.72, *p *< .0005,
*d *= 0.74, the accuracy level produced by SPIES was not
significantly different from the 90% confidence level assigned to
them, *t*(102) = 0.52, *p* = .60, meaning that these
estimates did not exhibit overprecision (see Figure 2).

The SPIES method does not seem to have improved participants’
intuition regarding the precise temperature, as measured by the
distance between an interval’s midpoint and the true answer. A
repeated-measures ANOVA revealed a significant method effect,
*F*(2, 101) = 3.49, *p *= .034, η
^{2} = .065, but the midpoints of 90%
SPIES intervals were not significantly closer to the true answer than either
those of 90% confidence intervals, *t*(102) = 1.39, *p
*= .167, or those between the 5^{th} and
95^{th} fractiles, *t* < 1.

We also compared the widths of the intervals generated by the three
methods. A repeated-measures ANOVA revealed a significant effect of
method on interval width, *F*(2,101) = 21.71, *p*
< .0005, η ^{2 }= .301.
Within-subject contrasts show that 90% SPIES intervals were significantly wider
(*M *= 31.81, *SD *= 11.96) than 90% confidence
intervals (*M* = 23.58, *SD *= 14.42), *t*(102) =
5.73, *p *< .0005, *d* = 0.62, but slightly,
and non-significantly, narrower than fractiles (*M *= 33.15,
*SD* = 22.48), *t *< 1. The fractile
estimates’ relatively large mean width, as well as
their high variability, can be accounted for by the fact that eight of
these estimates reached either below 30°F or above
119°F (the boundary values we set for calculating 90%
SPIES), and resulted in relatively wide intervals.^{3}

Of the three methods tested in this experiment, the SPIES method was the only one in which confidence was correctly calibrated with accuracy. Although 90% confidence intervals and fractile estimates produced a higher hit rate than that typically found in prior research (Klayman, Soll, Gonzalez-Vallejo, & Barlas, 1999), the accuracy of SPIES was significantly higher than both of these methods. Moreover, the SPIES method not only produced better accuracy, it eliminated overprecision.

Another noteworthy finding is that SPIES produced a significantly higher hit rate than did fractile estimates. This result suggests that, although in both methods judges unpack their estimates into multiple judgments, this feature is not the primary driver of the superior calibration found in SPIES.

The results of this experiment are not conclusive regarding why SPIES were more accurate. On the one hand, interval midpoints did not differ between the three estimation formats in their distance from the true value, suggesting the better hit-rate is due to the estimates made using SPIES being more inclusive. On the other hand, 90% SPIES achieved a higher hit rate than fractile estimates without being significantly wider. As noted, we believe this is due to the constraint put on including extreme values in the SPIES intervals, but not in the other estimates. This issue was addressed in Experiment 3. First, we wanted to test whether the improved performance produced by SPIES holds for different configurations of intervals. This is an important issue because the SPIES method necessitates two choices: how big to make the range of possible responses and into how many intervals to divide that range. These variations may influence the amount of attention given by the judge to the values she considers, and, subsequently, affect the quality of the estimates produced. Therefore, we sought to test the robustness of the results obtained in Experiment 1 to these variations.

In Experiment 2, we varied the width of the range of subjective probability intervals for which estimates were made and the number of intervals into which this range was divided. We expected that SPIES would be better calibrated than 90% confidence intervals, regardless of the width of their range, or of how many intervals the SPIES task consisted.

The study was conducted online, using participants from Amazon
Mechanical Turk (described by Paolacci et al., 2010). 116 U.S.-based
participants (63 women, *M*_{age} = 36.78)
completed a survey for 5¢ each.

Participants estimated the day’s high temperature in Washington, DC
exactly one month after the day on which they took the survey. In a 2
x 2 between-subjects design, participants specified SPIES intervals with a
narrow range (−15°F to 84°F)^{4} or with a wide range (−65°F to
134°F), which were divided into either ten or twenty
intervals. These divisions resulted in three interval grain-sizes:
fine (5°F), medium (10°F) and coarse
(20°F). Two intervals of extreme values were added at
both ends of these ranges: “-16°F or lower” or
“-166°F or lower” at one end, and “85°F or
higher” or “135°F or higher” at another end (see Table
1). To compare SPIES with conventional interval estimates, an
additional group of participants produced a 90% confidence interval.

Actual temperatures on the days for which participants provided their
estimates fell between 31°F and 40°F. First, we
compared the accuracy of 90% confidence intervals to that of estimates
made using SPIES. Similar to Experiment 1, 90% SPIES achieved a
significantly higher hit rate (*M *= 73.91%, *SD *=
44.15) than 90% confidence intervals (*M *= 29.17%, *SD
*= 46.43), *t*(114) = 4.38, *p *< .0005,
*d* = 0.99. As expected, 90% SPIES of all four configurations
produced accurate estimates at a significantly higher rate than 90%
confidence intervals, *t*s ≥ 2.28, *p*s ≤
.027, *d*s ≥ 0.68 (see Figure 3).

Second, we tested whether the different configurations of the SPIES
task affected participants’ estimates. A 2 (range width:
100°F, 200°F) x 2 (number of intervals: 10,
20) between-subjects ANOVA on the *hit rates* of 90% SPIES
revealed no significant effects of either range width,* F
*< 1, or number of intervals, *F*(1,88) = 3.23,
*p* = .08; nor was there a significant interaction, *F*
< 1 (see Table 2). In order to perform a more conservative
test of the effect of range width on participants’ estimates, we
compared the two conditions in which participants made SPIES judgments with a
medium, 10ºF grain size (see Table 1). These two
conditions differed only in range width: one group was presented with
a 100ºF range, whereas for the other group, the
SPIES task spanned 200ºF. The comparison between
these two groups revealed no significant effect of range width on hit
rates (100ºF range: *M *= 80.95%, *SD
*= 40.24%; 200ºF range: *M *= 69.23%,
*SD =*47.07%), *t* < 1.

Range Width Number of intervals Grain Size Extreme Intervals Narrow (100ºF) 20 + 2 extreme intervals Fine (5ºF) −16°F or lower85°F or higher

Narrow (100ºF) 10 + 2 extreme intervals Medium (10ºF) −16°F or lower85°F or higher

Wide (200ºF) 20 + 2 extreme intervals Medium (10ºF) −66°F or lower135°F or higher

Wide (200ºF) 10 + 2 extreme intervals Coarse (20ºF) −66°F or lower135°F or higher

We did, however, find that the width of 90% SPIES was affected by the
configuration of the task. We conducted a similar ANOVA on
*estimate width*, which revealed significant main effects of the
overall SPIES’ range width and the number of intervals
it included, *F*(1,88) = 12.52, *p* = .001, η
^{2} =.125 and *F*(1,88) = 12.25,
*p* = .001, η ^{2} =.122,
respectively, with no interaction, *F* < 1 (see Table
3). However, a comparison of the two 10ºF grain size
groups found no effect of range width on estimate width
(100ºF range: *M *= 33.40, *SD *=
16.58; 200ºF range: *M *= 33.50, *SD *=
12.93), *t* < 1.

As in Experiment 1, the estimated intervals’ midpoints
were not affected by our manipulations. The distances of 90%
SPIES’ midpoints from their respective true values did
not vary with range width, *F*(1, 88) = 1.47, *p *= .228,
or with grain size, *F * < 1, nor was there an
interaction, *F* < 1. No significant difference in
midpoint accuracy was found between 90% SPIES and 90% confidence
intervals, either, *t* < 1.

In light of the significant effects on estimate width and the large,
though only marginally-significant effect of number of intervals on
hit rates, we sought to examine the extent to which participants were
sensitive to the different SPIES configurations. We tested this by
measuring the number of intervals to which participants assigned some
probability higher than zero in their estimates. A 2 (range width) x 2
(number of intervals) ANOVA found a significant effect of interval
number, wherein participants for whom the SPIES task consisted of
twenty intervals gave significantly more intervals (*M *= 6.36,
*SD *= 3.81) non-zero probabilities than those who were
presented with only ten intervals (*M *= 4.24, *SD *=
2.30), *F*(1, 88) = 14.94, *p *< .0005. The
ANOVA also found a significant range width effect, *F*(1, 88) =
22.69, *p *< .0005, but the direct comparison of the
two 10ºF grain size groups found no effect of range
width on the number of intervals with non-zero probabilities
(100ºF range: *M *= 5.14, *SD *= 2.83;
200ºF range: *M *= 4.62, *SD *= 1.79),
*t* < 1. Together, these results suggest that
participants who made estimates with the finer-grained SPIES were
aware of the need to use a larger number of intervals and adjusted
their estimates, but not sufficiently to fully equate their estimates’
width to those made with coarser-grained intervals.

Range Width Number of intervals Narrow Wide 10 80.95% (40.24%) 83.33% (38.07%) 20 61.90% (49.76%) 69.23% (47.07%)

Range Width Number of intervals Narrow Wide 10 33.40 (16.58) 44.95 (11.80) 20 25.48 (11.12) 33.50 (12.93)

As in Experiment 1, SPIES had a significantly higher hit rate than standard 90% confidence interval estimates. More important, this difference was consistent across the various range widths and interval grains.

One common feature of the first two experiments is that both included estimates of values on an unbounded scale (i.e., temperatures), for which we did not specify a minimum or a maximum value. In the absence of such explicit bounds, the highest and lowest intervals in the SPIES task may be perceived by the judge as reasonable bounds, between which the experimenters expect the true answer to lie. Because these intervals were included in the SPIES tasks, but not in confidence interval estimates, they may account for some of the difference in performance between the two methods. Also, in both experiments, the true values eventually fell closer to the middle of the scale than to any one of its ends. In Experiment 3 we sought to address these issues, by specifying the high and low bounds of the range in all conditions, as well as by varying the true values being estimated.

Another question which remains open is whether the difference between SPIES and confidence interval production is solely due to the different elicitation format, or whether SPIES enact a change in the underlying process by which estimates are generated. We hypothesize that training judges to consider the entire range of possible values, using SPIES, will have effects beyond the current elicitation method, and will affect subsequent estimates made in different formats. We tested this hypothesis in Experiment 3.

In Experiment 3, participants estimated a series of values, using
confidence intervals for half of their estimates and SPIES for the
others. Participants estimated the year in which all
20^{th} Century U.S. presidents were first elected to
office. These years were therefore on a bounded scale, ranging from
1900 to 1999. In addition, since these election years for all
presidents were estimated, the true values fell at various points on
the scale, both near the ends and closer to the middle.

We varied the elicitation method within-subjects. Participants produced 90% confidence intervals for half of their estimates, and SPIES for the other half, the order of which was counterbalanced. This design enabled us to test for the influence of SPIES on subsequent confidence interval estimates, by measuring differences in 90% interval widths between confidence intervals produced before SPIES and those produced after. If format dependence is solely responsible for the reduction in overprecision exhibited in SPIES, then, similar to the findings of Winman et al. (2004), confidence intervals will not be affected after switching from SPIES. If, as we suggest, SPIES change the process by which judges make confidence estimates, then 90% confidence intervals should include a wider range of values if made after SPIES than when made beforehand.

334 Pittsburghers (169 women, *M *age = 22.6, *SD* = 6.79)
completed a survey in the lab, in exchange for $3 or course credit.

Participants answered a 16-item quiz, estimating the years in which
all 20^{th} Century U.S. presidents were first elected
to office^{5}. For
each president, participants estimated either a 90% confidence
interval or SPIES. The SPIES task included all years from 1900 to
1999, divided into ten intervals, each representing a decade, with no
end intervals for more extreme values. Similarly, in the confidence
interval production condition, any estimate that included years
outside the 20^{th} century could not be submitted, and
the participant was instructed to revise it. Half of the participants
provided 90% confidence intervals for the first eight estimates and
SPIES for the last eight; for the other half, this order was reversed.
Items appeared in a different random order for each participant.

We calculated 90% SPIES the same way as in Experiments 1 and 2. Next,
we conducted a 2 (elicitation method: SPIES, confidence intervals) x 2
(elicitation order: first eight estimates, last eight estimates) mixed
ANOVA^{6} on *hit rates*, which
showed that 90% SPIES had a significantly higher hit rate than 90%
confidence intervals. SPIES included the correct answer 76.91% of
the time (*SD *= 20.17), compared with 54.34% (*SD *=
26.26%) in 90% confidence intervals, *F*(1,332) = 192.34,
*p* < .001, η
^{2} =.367. This result supported our
prediction that SPIES would provide greater accuracy for estimated
values in bounded ranges, regardless of where on the range the true
value eventually falls. As in Experiments 1 and 2, we found no
significant effect of elicitation method on interval midpoint
accuracy, *F*(1, 332) = 1.11, *p *= .29.

A similar ANOVA on *estimate width* yielded a significant effect
of SPIES on subsequent confidence interval width. SPIES produced
significantly wider estimates (*M *= 36.27, SD = 20.09) than
90% confidence intervals (*M *= 18.17, *SD *= 14.84),
but there was also a significant Elicitation method x Elicitation order
interaction, *F*(1,332) = 3.97*, p *= .047, η
^{2}= .012. Simple effects tests revealed
that 90% confidence intervals produced after having taken the SPIES
task were significantly wider (*M* = 20.77 years,
*SD* = 16.13) than those produced in the first set of estimates
(*M *= 15.57, *SD *= 12.95), *t*(332) = 3.25,
*p* = .001, *d* = 0.36, whereas 90% SPIES did not differ
between the two groups, *t *< 1. This result suggests
that SPIES had a carryover effect on subsequent confidence interval
estimates, leading judges to consider a wider range of values in their
estimates. To rule out learning and time effects, we conducted a
repeated-measures ANOVA on confidence interval widths for each item
participants estimated. The last confidence interval estimate in each
set was not, on average, wider than the first estimate in the set,
*F *< 1, suggesting the greater width of
confidence intervals made after SPIES than of those made before SPIES
was not due to a simple improvement with experience or time within the
same elicitation method (see Figure 4).

The results of this experiment confirm that the increased accuracy observed in SPIES does not depend on features of the possible range of values being estimated, or on where on this range the true value actually falls. Furthermore, the carryover effect of SPIES on subsequent confidence interval estimates suggests that the reduced bias in SPIES is not due to format dependence alone. It also demonstrates a change in the process by which judgments are made. The more extensive consideration of values in SPIES prompted judges to generate wider confidence intervals in later estimates.

Overprecision in judgment continues to be a robust and intriguing phenomenon with potentially profound and harmful consequences in domains as diverse as corporate investment and scientific progress (e.g., Henrion & Fischhoff, 1986; Malmendier & Tate, 2005; Morgan & Keith, 2008). SPIES appears to be a practical and simple method of producing interval estimates that effectively reduces overprecision. Across three experiments that elicited different estimates, SPIES led to greater accuracy than other elicitation methods — and in some cases completely eliminated overprecision. The results further suggest that SPIES may affect the process by which people make quantitative estimates, as confidence interval estimates produced after SPIES included a wider range of values than estimates produced before this intervention.

Future research is needed to elucidate the underlying mechanism by which SPIES results in reduced overprecision. SPIES may evoke a more extensive search for information, which puts the judge in a more inquisitive mindset (e.g., Galinsky, Moskowitz, & Skurnik, 2000), leading to a better and more deliberate estimation process. Alternatively, considering all subjective probability intervals may work by increasing the amount of available estimate-relevant information in memory, forcing a fuller consideration of alternative hypotheses (Hirt & Markman, 1995; McKenzie, 1997, 1998; Morewedge & Kahneman, 2010).

In addition to our laboratory findings, we believe SPIES can easily be used for producing estimates in real-world settings. As Experiment 2 shows, SPIES provides superior results to confidence interval estimates, regardless of how the SPIES task is presented. The structure of the method, which utilizes the entire range of possible values, allows the production of intervals of virtually any target width or confidence level from the same estimate, and even allows changing the target width or confidence without having to estimate the same value multiple times. Furthermore, SPIES appears to have positive carryover effects, suggesting that the method may help train judges to improve their estimates when the range of possible outcomes of an event is uncertain and traditional confidence interval estimates are required.

Another useful feature of SPIES is the added information it provides about the judge’s sense of uncertainty regarding the estimated value. Traditional confidence intervals provide information only about the two values beyond which the judge thinks the true value has a very low chance of being, but not which values within the confidence interval are perceived as more probable than others. Point estimates and probability judgments, which are widely used in industry, provide very little information about the judge’s sense of the extent to which the true value may vary. SPIES, on the other hand, provides information on the values which the judge estimates as the most probable, as well as her sense of the variability in her estimate. This information can be highly valuable in cases such as estimates of future product demands which affect present stock, production and pricing.

One limitation of the experiments depicted in this paper is that they tested the SPIES method on only one type of estimates plagued by overprecision, namely interval estimates. Future research should test whether this method is applicable in forecasts of discrete events (e.g., the chances that a building will sustain an earthquake; which candidate will win an election). Another limitation is that, despite its simplicity for the judge, the SPIES method is too complex and time consuming for many everyday estimates. The use of SPIES is recommended in contexts where the consequences are large and ample time or a computer is available to calculate a confidence interval, but they are hardly the panacea for all estimates and forecasts. Nevertheless, we believe expert judges and professionals who make estimates of uncertain quantities may benefit from adopting SPIES.

Alpert, M., & Raiffa, H. (1982). A progress report on the training of
probability assessors. In D. Kahneman, P. Slovic, & A. Tversky,
*Judgment under Uncertainty: Heuristics and Biases. *Cambridge:
Cambridge University Press.

Christensen-Szalanski, J. J., & Bushyhead, J. B. (1981).
Physicians‘ use of probabilistic information in real
clinical setting. *Journal of Experimental Psychology: Human
Perception and Performance, 7*, 928–935.

Clemen, B. (2001). Assessing 10–50–90s: a surprise. *Decision
Analysis, 20*, 2.

Daniel, K., Hirshleifer, D., & Subrahmanyam, A. (1998). Investor
Psychology and Security Market Under- and Overreactions. *The
Journal of Finance, 53*, 1839–1885.

Federal Housing Finance Agency. (2010). *Quarterly average and
median prices for states and U.S.: 2000Q1 - Present. *Retrieved
December 10, 2010, from http://www.fhfa.gov/Default.aspx?Page=87

Galinsky, A. D., Moskowitz, G. B., & Skurnik, I. (2000).
Counterfactuals as self-generated primes : The effect of prior
counterfactual activation on person perception judgments.
*Social Cognition, 18*, 252–280.

Henrion, M., & Fischhoff, B. (1986). Assessing uncertainty in physical
constants. *American Journal of Physics, 54*, 791–798.

Hirt, E. R., & Markman, K. D. (1995). Multiple explanation: A
consider-an-alternative strategy for debiasing judgments.
*Journal of Personality and Social Psychology, 69*, 1069–1086.

Juslin, P., Wennerholm, P., & Olsson, H. (1999). Format dependence in
subjective probability calibration. *Journal of Experimental
Psychology: Learning, Memory, and Cognition, 25*, 1038–1052.

Juslin, P., Winman, A., & Hansson, P. (2007). The Naïve Intuitive
Statistician: A Naïve Sampling Model of Intuitive Confidence
Intervals. *Psychological Review, 114*, 678–703.

Klayman, J., Soll, J. B., Gonzalez-Vallejo, C., & Barlas, S. (1999).
Overconfidence: It depends on how, what, and whom you ask*.
Organizational Behavior and Human Decision Processes, 79*, 216–247.

Koriat, A., Lichtenstein, S., & Fischhoff, B. (1980). Reasons for
confidence. *Journal of Experimental Psychology: Human Learning
and Memory, 6*, 107–118.

Malmendier, U., & Tate, G. (2005). CEO overconfidence and corporate
investment. *The Journal of Finance, 60*, 2661–2700.

McKenzie, C. (1997). Underweighting alternatives and overconfidence.
*Organizational Behavior and Human Decision Processes, 71*,
141–160.

McKenzie, C. R. (1998). Taking into account the strength of an
alternative hypothesis. *Journal of Experimental Psychology:
Learning, Memory, and Cognition, 24*, 771–792.

McKenzie, C., Liersch, M., & Yaniv, I. (2008). Overconfidence in
interval estimates: What does expertise buy you? *Organizational
Behavior and Human Decision Processes, 107*, 179–191.

Moore, D. A., & Healy, P. J. (2008). The trouble with overconfidence.
*Psychological Review, 115*, 502–517.

Morewedge, C. K., & Kahneman, D. (2010). Associative processes in
intuitive judgment. *Trends in Cognitive Sciences, 14*, 435–440.

Morgan, M. G., & Keith, D. W. (1995). Subjective judgments by climate
experts. *Environmental Science & Technology, 29*, 468–476.

Morgan, M. G., & Keith, D. W. (2008). Improving the way we think about
projecting future energy use and emissions of carbon dioxide.
*Climatic Change, 90*, 189–215.

Odean, T. (1999). Do investors trade too much? *The American
Economic Review, 89*, 1279–1298.

Oskamp, S. (1965). Overconfidence in case-study judgments.
*Journal of Consulting Psychology, 29*, 261–265.

Paolacci, G., Chandler, J., & Stern, L. N. (2010). Running experiments
on Amazon Mechanical Turk. *Judgment and Decision Making, 5*,
411–419.

Seaver, D. A., von Winterfeldt, D., & Edwards, W. (1978). Eliciting
subjective probability distributions on continuous variables.
*Organizational Behavior & Human Performance, 21*, 379–391.

Soll, J., & Klayman, J. (2004). Overconfidence in interval estimates.
*Journal of Experimental Psychology: Learning, Memory, and
Cognition, 30*, 299–314.

Teigen, K. H., & Jørgensen, M. (2005). When 90% confidence
intervals are 50% certain: on the credibility of credible intervals.
*Applied Cognitive Psychology, 19*, 455–475.

Tversky, A., & Koehler, D. J. (1994). Support theory: a nonextensional
representation of subjective probability. *Psychological Review,
101*, 547–567.

Winman, A., Hansson, P., & Juslin, P. (2004). Subjective probability
intervals: How to reduce overconfidence by interval evaluation.
J*ournal of Experimental Psychology: Learning, Memory, &
Cognition, 30*, 1167–1175.

Zickfeld, K., Morgan, M. G., Frame, D. J., & Keith, D. W. (2010).
Expert judgments about transient climate response to alternative future
trajectories of radiative forcing. *Proceedings of the National
Academy of Sciences, 107*, 12451–12456.

%Your input data file should be in a .csv file, and include only the %data entered in the SPIES task, without column headers or participant %ID's. The output file will be a text file, which will include the %subjective probabilities incorporated in the result interval, as well %as the interval's low and high bounds. function [] = SPIES(filename) filenamenew = [filename(1:(end - 4)) 'out.txt']; data = importdata(filename); %The four lines below this comment are for configuring your SPIES task: %rangeMin = the SPIES task's low bound. %rangeMax = the SPIES task's high bound. %intervalGrainSize = the width of the SPIES' intervals. %targetConfidence = the result confidence interval's desired level of % confidence. rangeMin = 0; rangeMax = 100; intervalGrainSize = 10; targetConfidence = 90; minColRange = (rangeMin:intervalGrainSize:rangeMax - intervalGrainSize); maxColRange = (rangeMin+intervalGrainSize:intervalGrainSize:rangeMax); result = []; for ii = 1:size(data, 1) dataRow = data(ii, :); len = length(dataRow); matrix = zeros(len, len); for i1 = 1:len for j1 = i1:len matrix(i1, j1) = sum(dataRow(i1:j1)); end end maxValueIndex = 1; for k = 1:len if (dataRow(k) >= dataRow(maxValueIndex)) maxValueIndex = k; end end bottom = maxValueIndex; top = maxValueIndex; while (sum(dataRow(bottom:top)) <= targetConfidence) if (bottom > 1) if (top == len) bottom = bottom - 1; else if (dataRow(bottom - 1) > dataRow(top + 1)) bottom = bottom - 1; else if (dataRow(bottom - 1) < dataRow(top + 1)) top = top + 1; else if (dataRow(bottom - 1) == dataRow(top + 1)) top = top + 1; bottom = bottom - 1; end end end end else if (top < len) top = top + 1; else % This should not happen end end end includedSPIES = zeros(1, len); includedSPIES(bottom:top) = dataRow(bottom:top); startRange = minColRange(bottom); endRange = maxColRange(top); while (sum(includedSPIES(bottom:top)) > targetConfidence || ... includedSPIES(bottom) == 0 || includedSPIES(top) == 0 ) extra = sum(dataRow(bottom:top)) - targetConfidence; if (extra == 0) while (includedSPIES(bottom) == 0) bottom = bottom + 1; end while (includedSPIES(top) == 0) top = top - 1; end startRange = minColRange(bottom); endRange = maxColRange(top); continue; end if (dataRow(bottom) + dataRow(top) <= extra) includedSPIES(bottom) = 0; includedSPIES(top) = 0; bottom = bottom + 1; top = top - 1; startRange = minColRange(bottom); endRange = maxColRange(top); continue; end diff = dataRow(bottom) - dataRow(top); if (diff == 0) valuePerUnit = dataRow(bottom) / intervalGrainSize; unitToUse = dataRow(bottom) - (extra / 2); startRange = maxColRange(bottom) - ... (unitToUse / valuePerUnit); endRange = minColRange(top) + ... (unitToUse / valuePerUnit); includedSPIES(top) = includedSPIES(top) - (extra / 2); includedSPIES(bottom) = includedSPIES(bottom) - (extra / 2); continue; end if (diff > 0) if (dataRow(top) <= extra) includedSPIES(top) = 0; top = top - 1; endRange = maxColRange(top); continue; end valuePerUnit = dataRow(top) / intervalGrainSize; unitToUse = dataRow(top) - (extra); endRange = minColRange(top) + ... (unitToUse / valuePerUnit); includedSPIES(top) = includedSPIES(top) - (extra); continue; end if (diff < 0) if (dataRow(bottom) <= extra) includedSPIES(bottom) = 0; bottom = bottom + 1; startRange = minColRange(bottom); continue; end valuePerUnit = dataRow(bottom) / intervalGrainSize; unitToUse = dataRow(bottom) - (extra); startRange = maxColRange(bottom) - ... (unitToUse / valuePerUnit); includedSPIES(bottom) = includedSPIES(bottom) - (extra); continue; end end result(ii, 1:(len + 2)) = [includedSPIES,startRange, endRange]; end dlmwrite(filenamenew, result, ' ');

- *
- The authors wish to thank Nir Kerem, Dafna Shahaf and Lior Lipshitz for help in developing the SPIES calculation algorithm, and the staff and facilities of the Center for Behavioral Decision Research at CMU. Correspondence concerning this paper should be addressed to Uriel Haran, Tepper School of Business, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA, 15213, email: uharan@cmu.edu.
- 1
- The full algorithm used to calculate 90% SPIES interval is in the Appendix.
- 2
- The identical result for 90% confidence intervals and the fractile method appears to be coincidental, as 63 participants were accurate in both their 90% confidence intervals and fractile estimates, whereas 26 were accurate in only one of the two formats.
- 3
- Only one 90% confidence interval exceeded these boundary values.
- 4
- The highest and lowest temperatures, respectively, ever recorded in Washington, DC in February, the target month for participants’ forecasts..
- 5
- We excluded William McKinley, who was first elected in 1896, and Gerald Ford, who was never elected president.
- 6
- Since we counterbalanced elicitation order between the two groups (i.e., one group estimated SPIES intervals for the first eight estimates and confidence intervals for the last eight, whereas the other group made estimates in the reverse order), the group means are equal to the method x order interaction.
In formal terms, the group main effect is: H

_{0}: (SPIES_{1}+ Conf. Int_{2}) – (Conf.Int_{1}+ SPIES_{2}) = SPIES_{1}+ Conf. Int_{2}– Conf. Int_{1 }– SPIES_{2}= 0.The method x order interaction is: H

_{0}: (SPIES_{1}– SPIES_{2}) – (Conf. Int_{1}— Conf. Int_{2}) = SPIES_{1}– SPIES_{2}– Conf. Int_{1}+ Conf. Int_{2}= 0.Note that these two equations are the same. Therefore, a difference in the estimates of the two groups would imply a significant interaction between the elicitation method and order (i.e., first eight estimates vs. last eight).

This document was translated from L^{A}T_{E}X by H^{E}V^{E}A.