A note on determining the number of cues used in judgment
analysis studies: The issue of type II error
Jason W. Beckstead1
College of Nursing
University of South Florida
Judgment and Decision Making, vol. 2, no. 5, October 2007, pp. 317-325
Abstract
Many judgment analysis studies employ multiple regression procedures to
estimate the importance of cues. Some studies test the
significance of regression coefficients in order to decide
whether or not specific cues are attended to by the judge or decision
maker. This practice is dubious because it ignores type II error. The
purposes of this note are (1) to draw attention to this issue,
specifically as it appears in studies of self-insight, (2) to
illustrate the problem with examples from the judgment literature, and
(3) to provide a simple method for calculating post-hoc power in
regression analyses in order to facilitate the reporting of type II
errors when regression models are used.
Keywords: judgment analysis, self-insight, multiple regression, post-hoc
power.
1 Introduction
For decades judgment analysts have successfully used multiple regression
to model the organizing cognitive principles underlying many types of
judgments in a variety of contexts (see Brehmer & Brehmer, 1988;
Cooksey, 1996; Dhami, et al., 2004, for reviews). Most often these
models depict the individual judge or decision maker as combining
multiple differentially weighted pieces of information (cues) in a
compensatory manner to arrive at a judgment. Further, these analyses
portray those who have acquired expertise on a judgment task as
applying their judgment model or "policy"
with regular, although less than perfect, consistency. The ability of
linear regression models to accurately reproduce such expert judgments
under various conditions has been discussed in detail (e.g.,
Dawes, 1979; Dawes & Corrigan, 1974; Einhorn & Hogarth, 1975). If one
accepts the proposition that people's judgments can be
modeled as though they are multiple regression equations, questions
arise such as: 1) How many of the available cues does the individual
use? and 2) How should the number of cues used be determined?
Too many researchers blindly apply statistical significance tests to
inform them - in a kind of deterministic manner - whether judges
did or did not attend to specific cues. If the t-test
calculated on a cue's weight is significant, then the
cue is counted as being attended to by the judge. Relying on p
values in this way is a problem because these values are affected by
the number of cues and number of cases presented to the judge during
the task and by how well the overall regression equation fits the total
set of responses.
This issue is discussed in this note which is organized as follows:
First, examples from the judgment literature are reviewed to illustrate
the existence of the problem. Second, notation commonly used by
judgment analysts when describing regression procedures is introduced.
Third, using this notation, a method for calculating the post-hoc power
of t-tests on regression coefficients based on the
noncentral t distribution is described. Fourth, this method is
applied to estimate the number of cases necessary for statistical
significance in order to illustrate how the
investigator's conclusions about the number of
cues attended to in a judgment task should be informed by
considerations of type II error. Finally, an SPSS program for
performing the calculations is described and provided in the Appendix.
2 Some examples in the judgment literature
Although it is reasonable to conclude that a
"significant" cue is important to the judge
and reliably used as he or she makes judgments, the converse does not
follow. When a cue's weight (regression coefficient,
standardized regression coefficient, or squared semipartial
correlation) is not significant, it does not necessarily mean that the
cue is unimportant; there may simply be insufficient statistical power
to produce a significant test result. Determining the number of cues to
which an individual attends is an important issue from both practical
and theoretical viewpoints. In a practical sense, informing poorly
performing judges that they should attend to more (or different) cues
than they apparently do can improve their accuracy (see Balzer, et al.,
1989, for review of cognitive feedback). Theories of cognitive
functioning have long considered determining the amount of information
we process to be a relevant question (e.g., Gigerenzer & Goldstein,
1996; Hammond, 1966; Miller 1956).
In the typical judgment analysis the problem of type II error is
overlooked. I know of no studies in the judgment analysis literature
that report the power of the significance tests on cue weights when
these tests are relied upon to determine the number of cues being used
by a judge. While an exhaustive review of the empirical literature is
beyond the scope of this note, a few examples are presented to
illustrate the problem.
Phelps and Shanteau's (1978) purportedly determined the
number of cues used by expert livestock judges in making decisions
using two different experimental (“controlled” and
“naturalistic”) designs. The same seven livestock
judges rated the breeding quality of gilts (female breeding pigs) in
two completely within-subject experiments. The controlled design used a partial
factorial design in which each judge made 128 judgments of gilts
described on 11 orthogonal cues. The naturalistic design used eight photographs of gilts. In
this experiment the judges first rated the breeding quality of the gilt
in each photo and then rated each photo on the same 11 cues used in
controlled design. This procedure was repeated, resulting in a total of 16
judgments per judge. The authors then used significance tests to
determine whether specific cues were being used by each judge in the
two experiments. An important finding was that the judges used far more
cues (mean = 10.1) in the controlled design than they did in the
naturalistic design (mean = 0.9). The relevant data are summarized in
Table 1. Using the F statistics reported in their Tables 1 and
2 to calculate estimates of effect sizes (η2)
reveals some paradoxical results; many of the cues showed stronger
relationships to judgments in the naturalistic design. Because of the lower
statistical power in the naturalistic design (the controlled design presented 128 cases
whereas the naturalistic design presented only 16) fewer cues were counted as
significant and it was concluded that less information was being used
by all judges under the naturalistic design.
Table 1: Summary of results from Phelps and Shanteau (1978) with
addition of effect size estimates.
|
No. of significant cues |
Median η2 |
No. cues with
larger η2 in
naturalistic |
Judge |
Controlled |
Naturalistic |
Controlled |
Naturalistic |
|
1 |
10 |
2 |
0.205 |
0.365 |
5 |
2 |
9 |
0 |
0.321 |
0.310 |
7 |
3 |
10 |
0 |
0.158 |
0.024 |
3 |
4 |
9 |
3 |
0.264 |
0.333 |
8 |
5 |
11 |
1 |
0.177 |
0.200 |
5 |
6 |
11 |
0 |
0.376 |
0.167 |
2 |
7 |
11 |
0 |
0.162 |
0.184 |
5 |
When comparing the results of the two experiments the authors attributed
the difference in the amount of information used by the experts to the
stimulus configuration, "...the source of the discrepancy
seems to be in the intercorrelations among the characteristics and not
in the statistical analysis" (Phelps & Shanteau, 1978,
p.218). Although Phelps and Shanteau pointed out that the F
statistics they report could easily be expressed as estimates of effect
size they did not do so. If they had, they may have come to a different
conclusion about the influences of naturalistic and controlled cue
configurations in their judgment tasks.
One area of research particularly sensitive to the problem at hand is
the study of self-insight into decisions. The assessment of
self-insight in social judgment studies has traditionally compared
statistical weights (derived via regression equations) with subjective
weights. A widely accepted finding is that people have relatively poor
insight into their judgment policies (see Brehmer & Brehmer, 1988;
Harries, et al., 2000; Slovic & Lichtenstein, 1971, for reviews). In
most studies assessing insight, judges are required to produce
subjective weights (e.g., distributing 100 points among the cues).
"It was the comparison of statistical and subjective
weights that produced the greatest evidence for the general lack of
self-insight" (Reilly, 1996, p. 214). Another robust
finding from this literature is that people report using more cues than
are revealed by regression models. "A cue is considered
used if its standardized regression coefficient is
significant" (Harries, et al., 2000, p. 461).
Two influential studies on insight by Reilly and Doherty (1989, 1992)
asked student judges to recognize their judgment policies
among those from several other judges. In the first study seven of
eleven judges were able to identify their own policies. In contrasting
this finding to previous studies the authors noted "These
data reflect an astonishing degree of insight" (Reilly &
Doherty, 1989, p. 125). In the second study the number of cues and the
stimulus configuration were manipulated. Overall, 35 of 77 judges were
able to identify their own policies. The authors reconciled this
encouraging finding with the prevailing literature on methodologic
grounds, arguing that the lack of insight shown in previous studies
might be related to people's inability to articulate
their policies. "There is the distinct possibility that
while people have reasonable self-insight on judgment tasks, they do
not know how to express that insight. Or pointing the finger the other
way round, while people do have insight we do not know how to measure
it" (1992, p. 305).
In both these studies, when judges were presented with policies, each
judge's set of cue weights (squared semipartial correlations in this
case) was rescaled to sum to 100, and importantly, cues which did not
account for significant (p < .01) variance were represented as zeros.
The authors noted the majority of judges (in both studies) indicated
that they had relied on the presence or absence of zeros as part of
the search strategy used to recognize their own policies. The use of
significance tests to assign specific cues a rescaled value of zero in
these studies is problematic for two reasons. First, the power of a
significance test on a squared semipartial correlation in multiple
regression is affected by the value of the multiple R2. As R2
increases, smaller weights are more likely to be significant. Second,
the power of these significance tests is affected by the number of
predictors in the regression equation. The net result was that the
criterion used to assign zero to a specific cue was not constant
across judges. Only when all judges are presented with the same number
of cues and all have equal values of R2 for their resultant policy
equations could the criterion be consistently applied.
To illustrate, Reilly and Doherty (1989) presented 160 cases
containing 19 cues to each judge. Consider two judges with different
values of R2 based on 18 of the cues, say .90 and .50. The minimum
detectable effect (i.e., smallest weight that the
19th cue could take and still be significant) for
the first judge is .008 but .039 for the second judge. The same
problem exists in the 1992 study that used 100 cases and is compounded
by the fact that the authors manipulated the number of cues presented
to the judges; half the sample rated cases described by six cues and
the other half rated cases described by twelve cues. In the
recognition portion of both studies the useful pattern of zeros in the
cue profiles was an artifact introduced arbitrarily by the use of
significance tests. Had the authors used p < .05 rather than p < .01
to assign zeros, their conclusions about insight might have been
astonishingly different.
Harries et al. (2000, Study 1), examining the prescription decisions of
a sample of 32 physicians, replicated the finding that people are able
to select (recognize) their policies among those from several others.
This study followed up on the participants in a decision making task
(Evans, et al., 1995) in which 100 cases constructed from 13 cues were
judged and regression analysis was used to derive decision policies.
Judges also provided subjective cue weights, first indicating the
direction (sign) of influence, then rating how much (0-10 scale) the
cue had bearing on their decisions. When comparing tacit to stated
policies (i.e., regression weights to subjective weights) Harries et
al. (2000) described a "triangular pattern of
self-insight": a) cues that had significant weights were
the ones that the judge indicated he or she used, b) where the judge
indicated that a cue was not important it did not have a significant
weight, and c) there were cues that the judge indicated were important
but which did not have significant weights. The
authors' choice of p value for determining
whether a cue was attended to in the tacit policies had influence on
all three sides of this triangular pattern.
Approximately 10 months following the decision-making task, participants
were presented with sets of decision policies in the form of bar charts
rather than tables of numbers. Cues with statistically significant
weights were presented as darker bars. With only four cues having
significant effects on decisions (Harries et al., 2000, p. 457), it is
possible that physicians used the presence or absence of lighter bars
in the same way that Reilly and Doherty's students
made use of zeros in their recognition strategies. Had more cues been
classified and presented as significant, the policy recognition task
might have proved more difficult.
Other examples exist in the applied medical judgment literature. Gillis
et al. (1981) relied extensively on p values of beta weights
for describing the judgment policies of 26 psychiatrists making
decisions to prescribe haloperidol based on 8 symptoms (see their Table
4). Averaged across judges, the number of cues used was 2.4, 1.9, or
1.0 depending on the p value employed (.05, .01, or .001,
respectively). Had the investigators chosen to compare the number of
cues used with self-reported usage, which of the three p
values ought they have relied upon? Had the investigators rescaled and
presented policies to participants for recognition (via Reilly &
Doherty), their choice of p value could have affected the
difficulty of the recognition task.
More recently, in a judgment analysis of 20 prescribing decisions made
by 40 physicians and four medical guideline experts, Smith et al.
(2003) reported "The number of significant cues ... varied between
doctors, ranging from 0 to 5" (p. 57), and among the experts "The
mean number of significant cues was 1.25" (p. 58). It is noteworthy
that this study presented doctors with a relatively small number of
cases thus leaving open the meaning of "significant." Had Smith et
al. presented more than 20 cases, they may have concluded (based on
p values) that doctors and guideline experts attended to more
information when making prescribing decisions.
Other models of judgment, known as "fast and frugal heuristics" have
recently been proposed as alternatives to regression models (see
Gigerenzer, 2004; Gigerenzer & Kurzenhäuser, 2005; Gigerenzer, et
al., 1999). A hallmark of fast and frugal models is that they are
purported to rely on far fewer cues than do judgment models described
by regression procedures. When comparing these classes of models, the
number of cues the judge uses is one way of differentiating the
psychological plausibility of these models (see Gigerenzer, 2004).
Studies comparing regression models with fast and frugal models have
implied that significance testing is the method of
determining the number of cues used despite the fact that the
developers of these methods (e.g., Stewart, 1988) made no such claim
and currently advise against it (Stewart, personal communication, July
2, 2007).
In a study comparing regression with fast and frugal heuristics, Dhami
and Harries (2001) fitted both types of models to 100 decisions made by
medical practitioners. They report that number of cues attended to was
significantly greater when modeled by regression than by the matching
heuristic. According to the regression models the average number of
cues used was 3.13 and the average for the fast and frugal models was
1.22. "In the regression model a cue was classified as
being used if its Wald statistic was significant (p < .05) ..."
(Dhami & Harries, 2001, p. 19). In
the heuristic model, the number of cues used was determined by the
percentage of cases correctly predicted by the model; significance
tests were not used. At issue is not the fact that different criteria
were used to count the cues used under the two types of models
(although this is a problem when evaluating their results), but rather,
that the authors relied on a significance test known for some time to
be dubious2,
and their choice of p value for counting cues may have biased
their data to favor the psychological plausibility of the fast and
frugal model. Had they used p < .01 rather than
p < .05, the average number of cues used according
to the regression procedure would presumably have been lower, and
perhaps not different than the average found for the matching
heuristic.
In the last few paragraphs, examples from the literature have been
presented that highlight the problems associated with using
significance tests to determine the number of cues used in judgment
tasks. Tests of significance on regression coefficients or
R2 are really not very enlightening for
distinguishing the "best" judgment model
from among a set of competing models. The true test of which model
(among a set of contenders) is the best is the ability of the equation
to predict the judgments made in some future sample of cases, the data
from which were not used to estimate the regression equation. The
remaining sections of this note formally present the regression model
as used in judgment analysis and discuss a method for assessing the
power of significance tests so as to provide more information to
judgment analysts who use them.3
3 Notation
Following Cooksey (1996), let the k cues be denoted by
subscripted X's (e.g., X1 to
Xk). In a given judgment analysis a series of
m profiles or cases is constructed where each case is
comprised of k cues. The judge or subject makes m
responses Ys to these cases. The resulting multiple
regression equation representing the subject's
judgment policy is of the general form
Ys = b0 + b1X1 + b2X2 + ... + bkXk + e |
| (1) |
where b0 represents the regression constant and
the remaining bi represent regression
coefficients for each cue where each coefficient indicates the amount
by which the prediction of Ys would change if its
associated cue value changed by one unit while holding all other cue
values constant, and e represents residual or unmodeled
influences.
Tests of significance may be employed to assess the null hypothesis
that the value of bi in the population is zero, thus
H0: bi = 0 against the alternative
H1: bi ¹ 0. The ratio bi / SEbi is
distributed as a t statistic with degrees of freedom (df) =
m - k - 1. The SEbi is found as
where sdYs and sdXi are, respectively, the standard
deviations for the judgments and for the ith cue's values;
R2Ys is the squared multiple correlation for the judgment
equation; and R2Xi is the squared multiple correlation from a
regression analysis predicting the ith cue's values from the
values of the remaining k-1 cues. In standard multiple regression it
can be shown that the significance test of bi (t = bi / SEbi)
is equivalent to testing significance of the standardized regression
coefficient bi and the squared semipartial correlation
associated with Xi (see Pedhazur, 1997). This
is fortunate because most commercially available statistics packages
routinely print values for SEbi but not for
SEbi.4
4 Post-hoc power analysis on t-test of regression coefficients
Having analyzed data from a judgment analysis using multiple
regression it is rather simple to calculate the statistical power
associated with the t-test of each regression coefficient. All that
is needed from the analysis is the observed value of t, its
df, and the a priori specified value of a. To obtain
the power of the t-test that H0: bi = 0 for
a = .05, one may employ the noncentral distribution
of the t statistic (see Winer et al., 1991, pp. 863-865), here
denoted t¢, which is actually a family of distributions
defined by df and a noncentrality parameter d, hence
t¢(df; d). In the present context
d = bi / SEbi. The power of the t-test on the
regression coefficient may then be determined as
|
Prob(t′) > tdf, 1−α / 2 | δ = bi / SEbi) = |
|
|
|
1 − Prob(type II error) |
|
|
(3) |
|
Thus the probability that the noncentral t′
will be greater than the critical value of t, given the
observed value of t = bi /
SEbi, is equal to the power of the
test that H0: bi = 0
for α = .05. For example, consider the following result from
an illustrative judgment analysis involving k = 6 cues and
m = 30 cases provided by Cooksey (1996, p.175). The
unstandardized regression coefficient for a particular cue is
b = 0.267, (β = .295) its standard error is 0.146,
thus t = 0.267/0.146 = 1.829. The critical value for
t with df = 30 - 6 - 1 = 23, and α = .05 for
a two-tailed test is 2.069; consequently the null hypothesis is not
rejected and it might be concluded that this cue is unimportant to the
judge. Using the information from this significance test and the
noncentral distribution of t′(df =
23; δ = 1.829) we find that the probability of type II error =
.582, and thus the power to reject the null is only .418. To claim that
this cue is “unimportant to the judge,” or
“is not being attended to by the judge”
does not seem justifiable in light of the rather high probability of
type II error.
5 Estimating the number of cases necessary for significant
t-test of regression coefficients
Faced with such a nonsignificant result, as in the example presented
above, the judgment analyst may wish to know the extent to which this
outcome was related to the study design. In particular, how was the
nonsignificant t-test of the cue weight affected by his or
her decision to present m cases to the judge instead of some
larger number m*? To address this question we must first
clarify the types of the stimuli used in judgment studies.
Brunswik (1955) argued for preserving the substantive properties
(content) of the environment to which the investigator wishes to
generalize in the stimuli presented during the experimental task.
Hammond (1966), in attempting to overcome the difficulties inherent in
such representative designs, distinguished between
“substantive” and
“formal” sampling of stimuli. Formal
stimulus sampling concerns the relationships among environmental
stimuli (with content ignored). The following discussion is limited to
studies employing formal stimulus sampling. When taking the formal
approach to stimulus sampling, the investigator's
focus is on maintaining the statistical characteristics of the task
environment (e.g., k, sdXi
and R2Xi) in the
sample of stimuli presented to the participant. These characteristics
of the environment may be summarized as a covariance matrix,
Σ. If the investigator obtains a sample of m
stimuli from the environment, the covariance matrix
Sm, may be computed from the sample and
compared with Σ. The basic assumption of formal
stimulus sampling may then be stated as Sm
≈ Σ. Whether probability or nonprobability
sampling is used, it is possible for the investigator to construct an
alternative set of m* cases such that
Sm* = Sm. Under the
condition that Sm* =
Sm ≈ Σ, it is
possible to estimate SEbi*, the
standard error of the regression coefficient based on the larger sample
of cases m*. Inspection of Eq. (2) reveals that
SEbi becomes smaller as the number of
cases m becomes larger. Holding all other terms in Eq. (2)
constant, SEbi* may be found as
Substituting SEbi* in place of
SEbi when calculating
t-test on bi allows us to
estimate the impact of increasing m to m* on type I
error in the same judgment analysis. Making the same substitution in
Eq. (3) allows us to estimate the impact of this change on type II
error and power.
Stewart (1988) has discussed the relationships among k,
R2Xi, and
m and recommends m = 50 as a minimum for reliable
estimates of cue weights when k ranges from 4 - 10 and
R2Xi=0. He points
out that as the intercorrelations among the cues increases the number
of cases will need to be increased in order to maintain reasonably
small values of SEbi. Of course the
investigator's choice of m should also
influenced by his or her sense of subject burden. Stewart notes from
empirical evidence that most judges can deal with making between
"40 to 75 judgments in an hour, but the number varies
with the judge and the task" (Stewart, 1988, p.46). In
discussing the design of judgment analysis studies Cooksey (1996) has
suggested that the optimal number of cases may be closer to 80 or 90.
Reilly and Doherty (1992) reported the average time for 77 judges to
complete 100 12-cue cases was 1.25 hours. In a recent study by
Beckstead and Stamp (2007) 15 judges took on average 32 minutes (range
20-47) to respond to 80 cases constructed from 8 cues.
For the example given in the previous section, if the investigator had
used m* = 40, rather than m = 30, Eq. (4) indicates
that SEbi* would have been 0.122 and
the resulting value for the t-test would have been 2.191
with p = .036. The point here is that had the investigator
presented 10 more cases (sampled from the same population), he or she
might have come to a different conclusion about the number of cues
attended to by this judge.
6 An SPSS program for calculating post-hoc power in regression
analysis
The calculations for determining post-hoc power for tests of regression
coefficients as used in judgment analysis studies and estimating
SEbi* are straightforward and based
on statistical theory, however detailed tables of noncentral t
distributions are hard to come by. The author has written an SPSS
program for performing these calculations that is provided in the
Appendix. To illustrate the program, consider another cue taken from
the same example found in Cooksey (1996, p.175) where b =
-0.423, SEb = 0.386, and k = 6 for m = 30.
Inserting these values into the program and specifying that the number
of cases increase to 90 by increments of 10, produces the result shown
in Table 2.
Table 2: Illustration of the influence of the number of cases m* on
t-tests of regression coefficient
m* | SEb* | t-test | p-value |
40 | 0.322 | 1.313 | .198 |
50 | 0.282 | 1.498 | .141 |
60 | 0.254 | 1.664 | .102 |
70 | 0.233 | 1.814 | .074 |
80 | 0.217 | 1.952 | .055 |
90 | 0.203 | 2.082 | .040 |
|
As m* increases, the estimated values of SEb* decrease
and the values of the t-statistic increase. According to
these estimates, the t-test on this cue would have been
significant had approximately 85 cases been used in the judgment task.
The program can be "rerun" specifying a
smaller increment in order to refine this estimate. The results
provided by such an analysis could also be very useful in the planning
of subsequent judgment studies.
7 Summary and recommendations
In this note the issue of type II error has been raised in the context
of determining whether or not a cue is important to a judge in judgment
analysis studies. Some of the potential pitfalls of relying on
significance tests to determine cue utilization have been pointed out
and a simple method for calculating post-hoc power of such tests has
been presented. A short computer program has been provided to
facilitate these analyses and encourage the calculation (and reporting)
of statistical power when judgment analysts rely on significance tests
to inform them as to the number of cues attended to in judgment tasks.
As a tool for understanding the individual's cognitive
functioning, regression analysis has proved to be quite useful to
judgment researchers for over 40 years. In this role I believe that its
true value lies in its descriptive, not its inferential, facility. Like
any good tool, if we are to continue our reliance upon it we must
insure that it is in proper working order and not misuse it.
There are alternative models of judgment being advocated (e.g.,
probabilistic models proposed by Gigerenzer and colleagues) that do not
fall prey to the problems associated with regression analysis. However,
as judgment researchers develop, test, and apply these models,
questions about the amount of information (i.e., the number of cues)
individuals use when forming judgments and making decisions are bound
to arise. The strongest evidence for the veracity of any judgment model
is its ability to predict the outcomes of future decisions.
The practice of post-hoc power calculations as an aid in the
interpretation of nonsignificant experimental results is not without
its critics (e.g., Hoenig & Heisey, 2001; Nakagawa & Foster, 2004).
Hypothesis testing is easily misunderstood but when applied with good
judgment it can be an effective aid to the interpretation of
experimental data (Nickerson, 2000). Higher observed power does not
imply stronger evidence for a null hypothesis that is not rejected
(see Hoenig & Heisey, 2001 for discussion of the power approach
paradox). Some researchers have argued for abandoning the use
hypothesis testing altogether and relying instead on the confidence
interval estimation approach (Armstrong, 2007; Rozeboom, 1960). I tend
to agree with Gigerenzer and colleagues who put it succinctly, "As
long as decisions based on conventional levels of significance are
given top priority ... theoretical conclusions based on significance
or nonsignificance remain unsatisfactory without knowledge about
power" (Sedlmeier & Gigerenzer, 1989, p. 315).
References
Armstrong, J. S. (2007). Significance tests harm progress in
forecasting. International Journal of Forecasting, 23,
321-327.
Balzer, W. K., Doherty, M. E., & O'Connor, R. Jr.
(1989). Effects of cognitive feedback on performance.
Psychological Bulletin, 106, 410-433.
Beckstead, J. W., & Stamp, K. D. (2007). Understanding how nurse
practitioners estimate patients' risk for coronary
heart disease: A judgment analysis. Journal of Advanced
Nursing, 60, 436-446.
Brehmer, A., & Brehmer, B. (1988). What have we learned about human
judgment from thirty years of policy capturing? In B. Brehmer & C. R.
B. Joyce (Eds.), Human judgment: The SJT view, (pp. 75-114).
Amsterdam: Elsevier Science Publishers.
Brunswik, E. (1955). Representative design and probabilistic theory in
functional psychology. Psychological Review, 62, 193-217.
Cooksey, R. W. (1996). Judgment analysis: Theory, methods, and
applications. San Francisco: Academic Press.
Dawes, R. M., (1979). The robust beauty of improper linear models in
decision making. American Psychologist, 34, 571-582.
Dawes, R. M., & Corrigan, B. (1974). Linear models in decision making.
Psychological Bulletin, 81, 95-106.
Dhami, M. K., Hertwig, R., & Hoffrage, U. (2004). The role of
representative design in an ecological approach to cognition.
Psychological Bulletin, 130, 959-988.
Dhami, M. K., & Harries, C., (2001). Fast and frugal versus regression
models of human judgment. Thinking and Reasoning, 7, 5-27.
Einhorn, H. J., & Hogarth, R. M. (1975).Unit weighting schemes for
decision making. Organizational Behavior and Human Performance,
13, 171-192.
Evans, J. St. B. T., Harries, C., Dennis, I., & Dean, J. (1995). General
practitioners' tacit and stated policies in the
prescription of lipid lowering agents. British Journal of
General Practice, 45, 15-18.
Gigerenzer, G. (2004). Fast and frugal heuristics: The tools of bounded
rationality. In D. J. Koehler and N. Harvey (Eds.), Blackwell
handbook of judgment and decision making, (pp. 62-88). Oxford:
Blackwell Publishing.
Gigerenzer, G. & Goldstein, D. G. (1996). Reasoning the fast and frugal
way: Models of bounded rationality. Psychological Review, 103,
650-669.
Gigerenzer, G., & Kurzenhäuser, S. (2005). Fast and frugal heuristics
in medical decision making. In R. Bibace, J. D. Laird, K. D. Noller,
and J. Valsiner (Eds.), Science and Medicine in Dialogue:
Thinking through Particulars and Universals, (pp. 3-15). Westport CN:
Praeger.
Gigerenzer, G., Todd, P. M., & the ABC Research Group (Eds.) (1999).
Fast and frugal heuristics: The adaptive toolbox.
Gillis, J. S., Lipkin, J. O., & Moran, T. J. (1981). Drug therapy
decisions. Journal of Nervous and Mental Disease, 169,
439-437.
Hammond, K. R. (1966). Probabilistic functionalism: Egon
Brunswik's integration, of the history, theory, and
method of psychology. In K. R. Hammond (Ed.), The psychology of
Egon Brunswik. New York: Holt Rinehart & Winston, (pp. 15-80).
Harries, C., Evans, J. St. B. T., & Dennis, I. (2000). Measuring
doctors's self-insight into their treatment decisions.
Applied Cognitive Psychology, 14, 455-477.
Hauck, W. W. & Donner, A. (1977). Wald's test as
applied to hypotheses in logit analysis. Journal of the
American Statistical Association, 82, 1110-1117.
Hoenig, J. M. & Heisey, D. M. (2001). The abuse of power: The pervasive
fallacy of power calculations for data analysis. American
Statistician, 55, 19-24.
Hosmer, D. W. & Lemeshow, S. (2000). Applied logistic
regression, 2 Ed. New York: John
Wiley & Sons, Inc.
Jennings, D. E. (1986). Judging inference adequacy in logistic
regression. Journal of the American Statistical Association,
81, 987-990.
Miller, G. A. (1956). The magical number seven plus or minus two: Some
limits on our capacity for processing information.
Psychological Review, 63, 81-97.
Nakagawa. S. & Foster, T. M. (2004). The case against retrospective
statistical power analyses with an introduction to power analysis.
Acta Ethologica, 7, 103-108.
Nickerson, R. S. (2000). Null hypothesis significance testing: A review
of an old and continuing controversy. Psychological Methods,
5, 241-301.
Pedhazur, E. J. (1997). Multiple regression in behavioral
research: Explanation and prediction, 3 ed. Fort
Worth: Harcourt Brace College Publishers.
Phelps, R. H, & Shanteau, J. (1978). Livestock judges: How much
information can an expert use? Organizational Behavior and
Human Performance, 21, 209-219.
Reilly, B. A. (1996). Self-insight, other-insight, and their relation to
interpersonal conflict. Thinking and Reasoning, 2, 213-222.
Reilly, B. A., & Doherty, M. E. (1989). A note on the assessment of
self-insight in judgment research. Organizational Behavior and
Human Decision Processes, 44, 123-131.
Reilly, B. A., & Doherty, M. E. (1992). The assessment of self-insight
in judgment policies. Organizational Behavior and Human
Decision Processes, 53, 285-309.
Rozeboom, W. W. (1960). The fallacy of the null hypothesis significance
test. Psychological Bulletin, 57, 416-428.
Sedlmeier, P. & Gigerenzer, G. (1989). Do studies of statistical power
have an effect on the power of studies? Psychological Bulletin,
105, 309-316.
Slovic, P. & Lichtenstein, S. (1971). Comparison of Bayesian and
regression approaches to the study of information processing in
judgment. Organizational Behavior and Human Performance, 6, 649-744.
Smith, L., Gilhooly, K., & Walker, A. (2003). Factors influencing
prescribing decisions in the treatment of depression: A Social Judgment
Theory approach. Applied Cognitive Psychology, 17, 51-63.
Stewart, T. R. (1988). Judgment analysis: Procedures. In B. Brehmer &
C. R. B. Joyce (eds.) Human judgment: The SJT view,
(pp. 41-74). Amsterdam: Elsevier Science Publishers.
Winer, B. J., Brown, D. R., & Michels, K. M. (1991).
Statistical principles in experimental design,
3 ed. New York: McGraw-Hill, Inc.
Appendix
The following is an SPSS program to calculate post-hoc power of
t-test on regression coefficients and to estimate sample
size needed for significance of such tests. After typing the commands
into a syntax window and supplying information specific to your
analysis, simply run the program to obtain results similar to those
found in Table 2.
Color Key: commands, comments,
information to be supplied by the user.
**------------------------------------------.
**ENTER NECESSARY INFORMATION FROM MULTIPLE REGRESSION ANALYSIS HERE*.
DEFINE @STUFF ().
COMPUTE b = -0.423 /*unstandardized regression coefficient */.
COMPUTE SEb = 0.386 /*standard error of regression coefficient */.
COMPUTE k = 6 /*number of predictors in regression equation */.
COMPUTE N = 30 /*number of observations or cases */.
COMPUTE alpha = .05 /*type I error criterion */.
COMPUTE maxN = 90 /*maximum value of N for table of estimates */.
COMPUTE incN = 10 /*increment in N for table of estimates */.
!ENDDEFINE .
**------------------------------------------.
**CALCULATING POST-HOC POWER for t-TEST of REGRESSION COEFFICIENT.
NEW FILE.
INPUT PROGRAM.
@STUFF.
COMPUTE t = ABS(b/SEb)COMPUTE df = N-k-1 /*degrees of freedom for t-test on b */.
COMPUTE tcrit = IDF.T(1-(alpha/2),df) /*critical value of t for desired alpha */.
COMPUTE t_prob = 2*(1-CDF.T(t,df)) /*this is obs p value for t-test on b */.
COMPUTE Power = 1-NCDF.T(tcrit,df,t) /*post-hoc power for obs t-test on b */.
END CASE.
END FILE.
END INPUT PROGRAM.
FORMAT N k DF (F3.0) t_prob t b SEb Power (F8.3).
LIST b SEb t k N t_prob power.
**ESTIMATING SAMPLE SIZE NECESSARY FOR t-TEST OF b TO BE SIGNIFICANT.
NEW FILE.
INPUT PROGRAM.
@STUFF.
LOOP newN = N+incN TO maxN BY incN.
COMPUTE SEbStar = SEb/SQRT((newN-k-1)/(N-k-1)) /*est of SEb under new N */.
COMPUTE tcritN = IDF.T(1-(alpha/2),newN-k-1) /*crit t value for desired α */.
COMPUTE tstar = ABS(b/SEbStar) /*est of t under new N */.
COMPUTE t_probN = 2*(1-CDF.T(tstar,newN-k-1)) /*est of p-value for tstar */.
COMPUTE powerN = 1-NCDF.T(tcritN,newN-k-1,tstar) /*estd power of test under new N */.
END CASE.
LEAVE b SEb k N alpha.
END LOOP.
END FILE.
END INPUT PROGRAM.
FORMAT newN (F5.0) SEbStar powerN tstar t_probN b (F5.3).
LIST newN b SEbStar tstar t_probN powerN.
Footnotes:
1Address: Jason W. Beckstead,
University of South Florida College of Nursing, 12901 Bruce B.
Downs Boulevard MDC22, Tampa, Florida 33612. Email:
jbeckste@health.usf.edu
2Hauck and Donner (1977) found that the Wald test
behaves in an aberrant manner. Jennings (1986) has also questioned the
adequacy of the Wald test for making statistical inferences. Hosmer and
Lemeshow (2000) recommend using the likelihood-ratio test instead.
3The utility of statistical
significance and hypothesis testing as a general approach has been
questioned by researchers in the social sciences (e.g., Armstrong,
2007; Nickerson, 2000; Rozeboom, 1960). I believe that many of us are
likely to continue to rely on this approach for some time. It is
therefore important that we fully understand the assumptions,
mechanics, and limitations of this approach.
4The method presented here is also directly
applicable to standardized regression coefficients when their
corresponding standard errors are available.
File translated from
TEX
by
TTH,
version 3.78.
On 24 Oct 2007, 07:36.