Statistical policies of Judgment and Decision Making

Statistics is controversial, and we do not want to dictate everything when authors have a well-reasoned disagreement with conventional views or with what follows. But here is the starting point.

Significance tests are useful

Despite all the criticism of null-hypothesis significance testing, a journal just cannot publish articles when the main results can be explained by a simple and plausible alternative hypothesis, namely, that the results arise from random variation. Thus, authors must do something to reject this hypothesis. The standard thing is a null-hypothesis significance test. This is not the only thing. Other methods involve confidence intervals, and posterior p-values (in which the probability of the null hypothesis given the data is estimated from a Bayesian posterior distribution). These all have their place. But something must be done.

Significance tests are most useful for well-controlled experiments, which are usually designed, often with great care, to make the null hypothesis essentially true when the experimental manipulation has no effect. For large population studies involving correlations of measured variables, it is arguable that the null hypothesis is always false, so significance testing is only of descriptive value in these cases. Measures of effect size are more useful here, but less useful for experiments, which often accept random noise as inevitable and can aggregate it in many different ways.

For JDM, the convention is that "significant" means that the p-level is .05 or less. If it is .06, we say "almost significant".

Report exact p-levels. (See, for example, Wilkinson et al., American Psychologist, 54, 594-604..) Exact p-levels have descriptive value, even when we use .05 as the cutoff for claiming that a result is real. (If all we know is that p<.05, then in fact our expectation of the unreported true p-level will decrease as sample size increases. But the probability of replication does not depend on sample size once we know the exact p-value.)

Use the most powerful test

Suppose you do a study of child development using ages 5 9 13 and 17. years, and you think that the dependent variable should increase with age. If you treat age as a categorical variable (a factor, as if the numbers were merely names), then the hypothesis of an age effect is consistent with 24 different orderings (excluding ties). Because you are casting a wide net for so many different results, most of which are really part of your null hypothesis, your test is less powerful for detecting the effect of interest, and it can be significant even when your hypothesis is false. If you instead test for a linear trend, you will test your hypothesis directly, without losing power. (See Hale, G. A. (1977). On use of ANOVA in developmental research. Child Development, 43, 1101-1106.)

You should also look for reasons why your model is inappropriate, such as quadratic trends in the data. Note, however, that a linear model may still be more powerful than an ordinal one, even if its assumptions are not quite met. For example, in the developmental study just described, you might find a ceiling effect, so that ages 13 and 17 do not differ. This would violate the assumption of linearity, but you may still have no reason to think that the true function is an inverted U, and a simple linear regression may still be the most powerful test.

If you have a one-tailed hypothesis, such as "increase with age" rather than "increase or decrease", then you lose power by doing a two-tailed test. It is no more "conservative" to do a two-tailed test for a one-tailed hypothesis than to treat ages as names when they are really numbers. If you want to be conservative, use a lower p-level to call something significant. Most experimental hypotheses in JDM are one-tailed. For example, when we do a de-biasing experiment, the hypothesis of interest is that the bias is reduced. If it increases, our hypotheis is just wrong, just as if the bias failed to change at all.

Don't throw away good data

When you split a continuous variable at the median, or anywhere, you add error. For example, if the median is 50 and the numbers are uniformly distributed from 0 to 100, a median split will effectively add an error that ranges between 0 and 25 to every number, with an average of 12.5. This reduces power. See this article (html version) for references.

But a split does not always lead to a higher p-value. By chance, a median split, or some other split, will occasionally lead to a lower p-value (as in the article just mentioned). It is thus possible to get a significant result with a median split that would not be significant with the "most powerful test". Choices in data analysis are inevitable, and this fact allows "data snooping", that is, trying out different sets of choices until one yields the desired result. Such a procedure, sometimes easy to rationalize in hindsight, subverts the test of significance. The requirement that researchers report what could be argued to be the single most powerful test limits the possible range of such snooping, and reduces readers' suspicion, justified or not.

The same argument applies to other sorts of categorization.

Outliers are a different matter. Sometimes you must do something with them. It depends on where you think they come from. Sometimes (as discussed in the next paragraph) they go away with an appropriate transformation of the data. Sometimes they result from obvious typos, which you can correct before you do any other analyses. Often it makes sense to trim them so that they equal the closest non-outlier value. Or, as a last resort, use a rank order test. Sometimes they are just nonsense, and it is best to omit them. Whatever you do, say what you did. And if possible, say whether it matters or not.

Transformations of data are reasonable and perhaps under-used. For example, a log transform is often appropriate for measures that are bounded at zero such as willingness to pay or reaction time. (For willingness to pay, it makes sense to add 1 so that the minimum after the transformation is zero rather than minus infinity.) Of course, many other transforms are sensible. Sometimes a sensible transformation can eliminate the need to remove or trim outliers.

Interactions

Many claims are of the form "Variable X has a greater effect on Y in condition C1 than in C2", or "Variable X affects Y in C1 but not in C2", or "X2 moderates the effect of X1 on Y." These involve interactions. Tests of interactions must be reported.

Beyond this, some interactions are difficult to interpret because of ceiling effects, floor effects, or, more generally scaling effects. See Loftus, G. R. (1978). On interpretation of interactions. Memory and Cognition, 6, 312–319.

Third variables: mediation and statistical control

We often find that X is correlated with Y. X may be an experimental manipulation or a measured variable, and Y is a dependent variable of interest. And we want to assess the role of a third variable, M, which might be correlated with both X and Y. Sometimes we want to ask whether M mediates the relationship between X and Y. Sometimes we want to show that M does not account for the relation between X and Y; that is, we want to control for M, statistically.

Mediation tests are often informative in correlational studies, but they are most useful when X is an experimental manipulation. In this case we can infer causality. Although the Sobel test is commonly recommended for tests of mediation, and is acceptable here, a simpler test may even be better. Specifically, if we find that X predicts M in a simple univariate regression (or correlation), and M predicts Y in a regression of Y on both M and X, we can infer mediation. The test of mediation is thus the maximum p-value for these two simple regression models. (See MacKinnon, D. P., Lockwood, C. M., Hoffman, J. M., West, S. G., & Sheets, V. (2002). A comparison of methods to test mediation and other intervening variable effects. Psychological Methods, 7, 83–104.)

In statistical control, we usually regress Y on X and M, and we seek to show that the coefficient for Y is still significant whe M is included in the model. We want to conclude that M does not explain the correlation between X and Y. Statistical control is over-used and is rarely informative. The problem is that M is usually intended as a measure of some underlying variable M*, which is the true variable whose effect we want to remove. If we want to remove the full variance due to M*, we must measure it perfectly, without error. Any error can be expected to reduce the coefficient for M in the model, thus increasing the coefficient for X. To take an extreme example, suppose M* is "cognitive ability" and M is "head circumference". Although we can measure M with great accuracy (and reliability), it does not correlate very highly with M*, so we do not remove much of the variance in M* by including M in the model. The validity of M as a measure of M* is low. Statistical control can be useful when we measure M without error, e.g., when it is gender or age, or when the Y coefficient is not reduced at all by the inclusion of M in the model (and when M is reasonably valid).

Regression vs. correlation

Mutiple regression has its uses, but it should not be used routinely when the questions concern the relation between several variables and some dependent variable. Simple correlations are often more informative. Multiple regression tells you the contribution of each predictor with all the others statistically controlled. When the predictors examined are some relatively arbitrary selection from those that might have been examined, this procedure has little point.
Jonathan Baron