Significance tests are most useful for well-controlled experiments, which are usually designed, often with great care, to make the null hypothesis essentially true when the experimental manipulation has no effect. For large population studies involving correlations of measured variables, it is arguable that the null hypothesis is always false, so significance testing is only of descriptive value in these cases. Measures of effect size are more useful here, but less useful for experiments, which often accept random noise as inevitable and can aggregate it in many different ways.
For JDM, the convention is that "significant" means that the p-level is .05 or less. If it is .06, we say "almost significant".
Report exact p-levels. (See, for example, Wilkinson et al., American Psychologist, 54, 594-604..) Exact p-levels have descriptive value, even when we use .05 as the cutoff for claiming that a result is real. (If all we know is that p<.05, then in fact our expectation of the unreported true p-level will decrease as sample size increases. But the probability of replication does not depend on sample size once we know the exact p-value.)
You should also look for reasons why your model is inappropriate, such as quadratic trends in the data. Note, however, that a linear model may still be more powerful than an ordinal one, even if its assumptions are not quite met. For example, in the developmental study just described, you might find a ceiling effect, so that ages 13 and 17 do not differ. This would violate the assumption of linearity, but you may still have no reason to think that the true function is an inverted U, and a simple linear regression may still be the most powerful test.
If you have a one-tailed hypothesis, such as "increase with age" rather than "increase or decrease", then you lose power by doing a two-tailed test. It is no more "conservative" to do a two-tailed test for a one-tailed hypothesis than to treat ages as names when they are really numbers. If you want to be conservative, use a lower p-level to call something significant. Most experimental hypotheses in JDM are one-tailed. For example, when we do a de-biasing experiment, the hypothesis of interest is that the bias is reduced. If it increases, our hypotheis is just wrong, just as if the bias failed to change at all.
But a split does not always lead to a higher p-value. By chance, a median split, or some other split, will occasionally lead to a lower p-value (as in the article just mentioned). It is thus possible to get a significant result with a median split that would not be significant with the "most powerful test". Choices in data analysis are inevitable, and this fact allows "data snooping", that is, trying out different sets of choices until one yields the desired result. Such a procedure, sometimes easy to rationalize in hindsight, subverts the test of significance. The requirement that researchers report what could be argued to be the single most powerful test limits the possible range of such snooping, and reduces readers' suspicion, justified or not.
The same argument applies to other sorts of categorization.
Outliers are a different matter. Sometimes you must do something with them. It depends on where you think they come from. Sometimes (as discussed in the next paragraph) they go away with an appropriate transformation of the data. Sometimes they result from obvious typos, which you can correct before you do any other analyses. Often it makes sense to trim them so that they equal the closest non-outlier value. Or, as a last resort, use a rank order test. Sometimes they are just nonsense, and it is best to omit them. Whatever you do, say what you did. And if possible, say whether it matters or not.
Transformations of data are reasonable and perhaps under-used. For example, a log transform is often appropriate for measures that are bounded at zero such as willingness to pay or reaction time. (For willingness to pay, it makes sense to add 1 so that the minimum after the transformation is zero rather than minus infinity.) Of course, many other transforms are sensible. Sometimes a sensible transformation can eliminate the need to remove or trim outliers.
Beyond this, some interactions are difficult to interpret because of ceiling effects, floor effects, or, more generally scaling effects. See Loftus, G. R. (1978). On interpretation of interactions. Memory and Cognition, 6, 312–319.
Mediation tests are often informative in correlational studies, but they are most useful when X is an experimental manipulation. In this case we can infer causality. Although the Sobel test is commonly recommended for tests of mediation, and is acceptable here, a simpler test may even be better. Specifically, if we find that X predicts M in a simple univariate regression (or correlation), and M predicts Y in a regression of Y on both M and X, we can infer mediation. The test of mediation is thus the maximum p-value for these two simple regression models. (See MacKinnon, D. P., Lockwood, C. M., Hoffman, J. M., West, S. G., & Sheets, V. (2002). A comparison of methods to test mediation and other intervening variable effects. Psychological Methods, 7, 83–104.)
In statistical control, we usually regress Y on X and M, and we seek to show that the coefficient for Y is still significant whe M is included in the model. We want to conclude that M does not explain the correlation between X and Y. Statistical control is over-used and is rarely informative. The problem is that M is usually intended as a measure of some underlying variable M*, which is the true variable whose effect we want to remove. If we want to remove the full variance due to M*, we must measure it perfectly, without error. Any error can be expected to reduce the coefficient for M in the model, thus increasing the coefficient for X. To take an extreme example, suppose M* is "cognitive ability" and M is "head circumference". Although we can measure M with great accuracy (and reliability), it does not correlate very highly with M*, so we do not remove much of the variance in M* by including M in the model. The validity of M as a measure of M* is low. Statistical control can be useful when we measure M without error, e.g., when it is gender or age, or when the Y coefficient is not reduced at all by the inclusion of M in the model (and when M is reasonably valid).