- Small samples, low power, and "surprising" hypotheses.
- Statistical control (probably the single most common issue).
- Multiple regression (sometimes called "econometrics").
- Mediation.
- Treating ordered variables as if they were names (un-ordered categories or groups).
- Dichotomizing continuous measures.
- Reporting interactions, or drawing conclusions from significant vs. non-significant.
- Showing that your results are generally true of subjects but not looking at items.

Significance tests are most useful for well-controlled experiments, which are usually designed, often with great care, to make the null hypothesis essentially true when the experimental manipulation has no effect. For population studies involving correlations of measured variables, it is arguable that the null hypothesis is always false, so significance testing is only of descriptive value in these cases. Other useful descriptions include means, standard errors, confidence intervals, or effect sizes in meaningful units (e.g., "the conditions differed by 1.5 points on the 7-point response scale"). Please limit the number of descriptors you use, especially in the middle of sentences.

For JDM, the convention is that "significant" means that the p-level is .05 or less. Note that "not significant" is not the same as "no effect".

Report exact p-levels, even for non-significant results. (See, for
example,
Wilkinson et al., *American Psychologist, 54*, 594-604.) Exact
p-levels have descriptive value, even when we use .05 as the cutoff for
claiming that a result is significant. (If all we know is that p<.05, then
in fact our expectation of the unreported true p-level will decrease as
sample size increases.)

P-levels alone are not sufficient to establish the reliability of a result. When power is lower (e.g., from smaller sample size), the probability of a significant result given a true effect is lower but the probability of a significant result given no effect is the p-level. Thus, the probability of a true effect given a significant result (at a given p-level) is lower (assuming that the prior probability of a true effect is independent of power).

More generally, p-values often result from the garden of forking paths, and often this is apparent from the paper itself, e.g., when the main result depends on an analysis other than the most obvious one.

You should also look for reasons why your model is inappropriate, such as quadratic trends in the data. Note, however, that a linear model may still be more powerful than an ordinal one, even if its assumptions are not quite met. For example, in the developmental study just described, you might find a ceiling effect, so that ages 13 and 17 do not differ. This would violate the assumption of linearity, but you may still have no reason to think that the true function is an inverted U, and a simple linear regression may still be the most powerful test.

If you have a one-tailed hypothesis, such as "increase with age" rather than "increase or decrease", then you lose power by doing a two-tailed test. It is no more "conservative" to do a two-tailed test for a one-tailed hypothesis than to treat ages as names when they are really numbers. If you want to be conservative, use a lower p-level to call something significant. Most experimental hypotheses in JDM are one-tailed. For example, when we do a de-biasing experiment, the hypothesis of interest is that the bias is reduced. If it increases, our hypotheis is just wrong, just as if the bias failed to change at all.

But a split does not always lead to a higher p-value. By chance, a median split, or some other split, will occasionally lead to a lower p-value (as in the article just mentioned). It is thus possible to get a significant result with a median split that would not be significant with the most powerful test. Choices in data analysis are inevitable, and this fact allows "p-hacking", that is, trying out different sets of choices until one yields the desired result. Such a procedure, sometimes easy to rationalize in hindsight, subverts the test of significance. The requirement that researchers report what could be argued to be the single most powerful test limits the possible range of such snooping, and reduces readers' suspicion, justified or not.

The same argument applies to other sorts of categorization.

**Outliers** are a different matter. Sometimes you must do something
with them. It depends on where you think they come from. Sometimes (as
discussed in the next paragraph) they go away with an appropriate
transformation of the data. Sometimes they result from obvious typos,
which you can correct before you do any other analyses. Often it makes
sense to trim them so that they equal the closest non-outlier value,
winsorize them, or, as a last resort, use a rank order test. Sometimes
they are just nonsense, and it is best to omit them. *Whatever you do,
say what you did. And if possible, say whether it matters or not.*

**Transformations** of data are reasonable and perhaps under-used. For
example, a log transform is often appropriate for measures that are
bounded at zero such as willingness to pay or reaction time. (For
willingness to pay, it makes sense to add 1 so that the minimum after
the transformation is zero rather than minus infinity.) Of course,
many other transforms are sensible. Sometimes a sensible
transformation can eliminate the need to remove or trim outliers.

The two facts that one effect is significant and another is not do not together imply that the effects are different. This problem also arises with comparisons of correlations; differences of correlations must be tested.

Some interactions are difficult to interpret because of ceiling effects,
floor effects, or, more generally, scaling effects. Apparent interactions
could be
*removable*
by a reasonable transformation. Most dependent variables are more
sensitive to *any* manipulation in some parts of their range than
in other parts. Measures are generally less sensitive near their
limits (floor or ceiling), but this is not the only possibility. If we
transform the measure so that it is equally sensitive everywhere, an
interaction might disappear. This problem cannot account for
cross-over interactions and some others.

In statistical control, we usually regress Y on X and M, and we
seek to show that the coefficient for X is still significant when M is
included in the model. We want to conclude that M does not explain
the correlation between X and Y. Statistical control often yields
misleading results. The problem is that M is usually intended as a
measure of some underlying variable M*, which is the true variable
whose effect we want to remove. If we want to remove the full
variance due to M*, we must measure it perfectly, without error. Any
error can be expected to reduce the coefficient for M in the model,
thus increasing the coefficient for X. To take an extreme example,
suppose M* is "cognitive ability" and M is "head circumference".
Although we can measure M with great accuracy (and reliability), it
does not correlate very highly with M*, so we do not remove much of
the variance in M* by including M in the model. The validity of M as
a measure of M* is low.^{1} (We
can think of "validity" as the correlation between M and M*.)
Statistical control can be useful when we measure M* without error,
e.g., when it is gender or age, or when the X coefficient is not
reduced at all by the inclusion of M in the model, and when M is
reasonably valid.^{2}

Mediation tests can be informative. But such tests can show spurious mediation when M has no causal effect on Y but some other (possibly un-measured) variable Z correlates positively with both M and Y, or when X has no effect on M but Z correlates with both X and M. Sometimes these spurious effects are implausible or even impossible. If X is an experimental manipulation, for example, then Z cannot affect X. True causal mediation can be missed when (for example) Z correlates in opposite directions with M and Y, or when M is measured poorly.

^{2}
This problem was brought to the attention of
psychologists by
Daniel
Kahneman in 1965, although he applies his argument only to
reliability, and it also applies to validity, as in the case of head
circumference. A broader statement, with some possible solutions, is
here,
although the solutions proposed may be limited in their applicability
when validity is an issue. Another general statement, with extensions, is
here.

^{3} For example,
suppose x1=[-1,-2,-3,-4,-3], x2=[1,4,9,16,25], x3=[1,4,9,16,16], and
y=[0,2,6,12,20]. Suppose the simple correlations with y are the most
theoretically relevant results: y correlates negatively with x1 but
positively with x2 and x3. But, if you regress y on x1 and x2, or on
x1 and x3, or on x1, x2 and x3, then the coefficient for x1 is
positive, despite the negative correlation between y and x1. The
coefficient for x3 is positive for these three regressions, as it
should be. However, if you regress y on x2 and x3, the x3 coefficient
is slightly negative. Problems like these can result from
nonlinearity, and from correlations among some predictors. See
this
article for a more general discussion of benefits as well as costs of such effects.

Jonathan Baron