Statistical policies of Judgment and Decision Making

Statistics is controversial, and we do not want to dictate everything when authors have a well-reasoned disagreement with conventional views or with what follows. But here is the starting point. The following checklist is about "red flags". These are not necessarily problems, but they will require examination before review. It will help if you yourself make sure that these issues will not cause problems, by reading the associated text below.

Checklist for some statistical red flags: Details below

Small samples, low power, and "surprising" hypotheses.
Statistical control (probably the single most common issue).
Multiple regression (sometimes called "econometrics").
Mediation.
Treating ordered variables as if they were names (un-ordered categories or groups).
Dichotomizing continuous measures.
Reporting interactions, or drawing conclusions from significant vs. non-significant.
Showing that your results are generally true of subjects but not looking at items.

Significance tests are useful

Despite all the criticism of null-hypothesis significance testing, a journal just cannot publish articles when the main results can be explained by a simple and plausible alternative hypothesis, namely, that the results arise from random variation. Thus, authors must do something to reject this hypothesis. The standard thing is a null-hypothesis significance test. This is not the only thing. Other methods involve credible intervals, confidence intervals, Bayes factors, and posterior p-values (in which the probability of the null hypothesis given the data is estimated from a Bayesian posterior distribution). These all have their place. But something must be done.

Significance tests are most useful for well-controlled experiments, which are usually designed, often with great care, to make the null hypothesis essentially true when the experimental manipulation has no effect. For population studies involving correlations of measured variables, it is arguable that the null hypothesis is always false, so significance testing is only of descriptive value in these cases. Other useful descriptions include means, standard errors, confidence intervals, or effect sizes in meaningful units (e.g., "the conditions differed by 1.5 points on the 7-point response scale"). Please limit the number of descriptors you use, especially in the middle of sentences.

For JDM, the convention is that "significant" means that the p-level is .05 or less. Note that "not significant" is not the same as "no effect".

Report exact p-levels, even for non-significant results. (See, for example, Wilkinson et al., American Psychologist, 54, 594-604.) Exact p-levels have descriptive value, even when we use .05 as the cutoff for claiming that a result is significant. (If all we know is that p<.05, then in fact our expectation of the unreported true p-level will decrease as sample size increases.)

P-levels alone are not sufficient to establish the reliability of a result. When power is lower (e.g., from smaller sample size), the probability of a significant result given a true effect is lower but the probability of a significant result given no effect is the p-level. Thus, the probability of a true effect given a significant result (at a given p-level) is lower (assuming that the prior probability of a true effect is independent of power).

More generally, p-values often result from the garden of forking paths, and often this is apparent from the paper itself, e.g., when the main result depends on an analysis other than the most obvious one.

Use the most powerful test

Suppose you do a study of child development using ages 5 9 13 and 17 years, and you think that the dependent variable should increase with age. If you treat age as a categorical variable (a factor, as if the numbers were merely names), then the hypothesis of an age effect is consistent with 24 different orderings (excluding ties). Because you are casting a wide net for so many different results, most of which are really part of your null hypothesis, your test is less powerful for detecting the effect of interest, and it can be significant even when your hypothesis is false. If you instead test for a linear trend, you will test your hypothesis directly, without losing power. (See Hale, G. A. (1977). On use of ANOVA in developmental research. Child Development, 43, 1101-1106.)

You should also look for reasons why your model is inappropriate, such as quadratic trends in the data. Note, however, that a linear model may still be more powerful than an ordinal one, even if its assumptions are not quite met. For example, in the developmental study just described, you might find a ceiling effect, so that ages 13 and 17 do not differ. This would violate the assumption of linearity, but you may still have no reason to think that the true function is an inverted U, and a simple linear regression may still be the most powerful test.

If you have a one-tailed hypothesis, such as "increase with age" rather than "increase or decrease", then you lose power by doing a two-tailed test. It is no more "conservative" to do a two-tailed test for a one-tailed hypothesis than to treat ages as names when they are really numbers. If you want to be conservative, use a lower p-level to call something significant. Most experimental hypotheses in JDM are one-tailed. For example, when we do a de-biasing experiment, the hypothesis of interest is that the bias is reduced. If it increases, our hypotheis is just wrong, just as if the bias failed to change at all.

Don't throw away good data

When you split a continuous variable at the median, or anywhere, you add error. For example, if the median is 50 and the numbers are uniformly distributed from 0 to 100, a median split will effectively add an error that ranges between 0 and 25 to every number, with an average of 12.5. This reduces power. See this article (html version) for references.

But a split does not always lead to a higher p-value. By chance, a median split, or some other split, will occasionally lead to a lower p-value (as in the article just mentioned). It is thus possible to get a significant result with a median split that would not be significant with the most powerful test. Choices in data analysis are inevitable, and this fact allows "p-hacking", that is, trying out different sets of choices until one yields the desired result. Such a procedure, sometimes easy to rationalize in hindsight, subverts the test of significance. The requirement that researchers report what could be argued to be the single most powerful test limits the possible range of such snooping, and reduces readers' suspicion, justified or not.

The same argument applies to other sorts of categorization.

Outliers are a different matter. Sometimes you must do something with them. It depends on where you think they come from. Sometimes (as discussed in the next paragraph) they go away with an appropriate transformation of the data. Sometimes they result from obvious typos, which you can correct before you do any other analyses. Often it makes sense to trim them so that they equal the closest non-outlier value, winsorize them, or, as a last resort, use a rank order test. Sometimes they are just nonsense, and it is best to omit them. Whatever you do, say what you did. And if possible, say whether it matters or not.

Transformations of data are reasonable and perhaps under-used. For example, a log transform is often appropriate for measures that are bounded at zero such as willingness to pay or reaction time. (For willingness to pay, it makes sense to add 1 so that the minimum after the transformation is zero rather than minus infinity.) Of course, many other transforms are sensible. Sometimes a sensible transformation can eliminate the need to remove or trim outliers.

Interactions, and comparisons of significant vs. non-significant

Many claims are of the form "Variable X has a greater effect on Y in condition C1 than in C2", or "Variable X affects Y in C1 but not in C2", or "X2 moderates the effect of X1 on Y." These involve interactions. Tests of interactions must be reported. Failure to report such tests is a common error.

The two facts that one effect is significant and another is not do not together imply that the effects are different. This problem also arises with comparisons of correlations; differences of correlations must be tested.

Some interactions are difficult to interpret because of ceiling effects, floor effects, or, more generally, scaling effects. Apparent interactions could be removable by a reasonable transformation. Most dependent variables are more sensitive to any manipulation in some parts of their range than in other parts. Measures are generally less sensitive near their limits (floor or ceiling), but this is not the only possibility. If we transform the measure so that it is equally sensitive everywhere, an interaction might disappear. This problem cannot account for cross-over interactions and some others.

If you report interactions and main effects from the same regression model (or Type 3 ANOVA), be careful about the effect of interaction terms on other estimates. Estimates of lower-order interactions and main effects may differ as a function of: 1) whether higher-order interactions are included; and 2) how variables are coded. See this paper

Third variables: mediation and statistical control

We often find that X is correlated with Y. X may be an experimental manipulation or a measured variable, and Y is a dependent variable of interest. And we want to assess the role of a third variable, M, which might be correlated with both X and Y. Sometimes we want to ask whether M mediates the relationship between X and Y. Sometimes we want to show that M does not account for the relation between X and Y; that is, we want to control for M, statistically.

In statistical control, we usually regress Y on X and M, and we seek to show that the coefficient for X is still significant when M is included in the model. We want to conclude that M does not explain the correlation between X and Y. Statistical control often yields misleading results. The problem is that M is usually intended as a measure of some underlying variable M*, which is the true variable whose effect we want to remove. If we want to remove the full variance due to M*, we must measure it perfectly, without error. Any error can be expected to reduce the coefficient for M in the model, thus increasing the coefficient for X. To take an extreme example, suppose M* is "cognitive ability" and M is "head circumference". Although we can measure M with great accuracy (and reliability), it does not correlate very highly with M*, so we do not remove much of the variance in M* by including M in the model. The validity of M as a measure of M* is low.¹ (We can think of "validity" as the correlation between M and M*.) Statistical control can be useful when we measure M* without error, e.g., when it is gender or age, or when the X coefficient is not reduced at all by the inclusion of M in the model, and when M is reasonably valid.²

Mediation tests can be informative. But such tests can show spurious mediation when M has no causal effect on Y but some other (possibly un-measured) variable Z correlates positively with both M and Y, or when X has no effect on M but Z correlates with both X and M. Sometimes these spurious effects are implausible or even impossible. If X is an experimental manipulation, for example, then Z cannot affect X. True causal mediation can be missed when (for example) Z correlates in opposite directions with M and Y, or when M is measured poorly.

Regression vs. correlation

Mutiple regression should not be relied upon, alone, when the questions concern the relation between each of several variables and a dependent variable. Simple correlations (or univariate regressions), including correlations of predictors with each other, are usually necessary to provide a full picture of the results. When the predictors examined are an arbitrary selection from those that might have been examined, if only they were measured, this procedure itself becomes arbitrary, since each coefficient can be affected by what else is included. It can even produce misleading conclusions.³ Multiple regression is, however, helpful or necessary for some purposes, such as removing nuisance variables, assessing the extent to which a correlation results from variation between or within sub-groups, statistical control (when coefficients are not reduced by the inclusion of a covariate or when accuracy of measurement is not a problem), or practical problems in prediction.

Subjects and items

Some studies test differences between responses to two types of items, such as personal and impersonal moral dilemmas. They present several dilemmas of each type to each of several subjects. A common error here is to consider subject variance but not item variance. Both subjects and items are sampled from presumably larger populations, and we want to know something about these populations, not just the specific set of items that we looked at. This issue is too complex to discuss fully here, as it is often reasonable to use only a single well-controlled pair of items (e.g., the Asian disease problem) or even a single subject, if the goal is simply to show that some effect exists somewhere. Classic reviews are here and here.

Final note: The Psychonomic Society has a policy on statistics for all its journals, which is fully compatible with the policies described here, although somewhat different in what it covers.

¹ Mathematically, various measures used for statistical control, such as partial correlation and multiple regression, amount to asking whether the correlation between X and Y can be explained by the correlations of both variables with M. If so, then r(X,Y) = r(X,M)*r(M,Y). So we ask whether r(X,Y) is larger than this product. The numerator of the formula for partial correlation is thus r(X,Y)-r(X,M)*r(M,Y). If we add error variance to M, then both r(X,M) and r(M,Y) are reduced, and so is their product, and this difference becomes (more) positive.

² This problem was noted by Daniel Kahneman in 1965, although his argument applies only to reliability, and the probelm also exists for validity, as in the case of head circumference. A broader analysis, with some possible solutions, is here, although the solutions proposed are limited when validity is an issue. Another general statement, with extensions, is here. When validity is not an issue (e.g., when a test consists of problems of a certain type and the variable of interest is "ability to solve that type of problem"), a tolerable solution is to "disattenuate" a regression model starting with a raw correlation matrix M (dependent variable and all predictors) and then correcting all correlations using a reliability measure (such as omega; alpha might over-correct) for each variable, as follows in R code:

M <- M/sqrt(R %*% t(R)) # correct all correlations; R is a vector of reliabilities in the same order
diag(M) <- 1            # set the diagonal of the corrected matrix to 1
Predictors <- M[-1,-1]  # remove the dependent variable (DV), here the first
DVcors <- M[1,-1]       # disattenuated correlations of the DV with each predictor, a vector
CorrectedCoefficients <- solve(Predictors) %*% DVcors # invert the matrix and multiply
(Thanks to Andrew Meyer for this solution and code.)

³ For example, suppose x1=[-1,-2,-3,-4,-3], x2=[1,4,9,16,25], x3=[1,4,9,16,16], and y=[0,2,6,12,20]. Suppose the simple correlations with y are the most theoretically relevant results: y correlates negatively with x1 but positively with x2 and x3. But, if you regress y on x1 and x2, or on x1 and x3, or on x1, x2 and x3, then the coefficient for x1 is positive, despite the negative correlation between y and x1. The coefficient for x3 is positive for these three regressions, as it should be. However, if you regress y on x2 and x3, the x3 coefficient is slightly negative. Problems like these can result from nonlinearity, and from correlations among some predictors. See this article for a more general discussion of benefits as well as costs of such effects.

Jonathan Baron