Marginally Significant


I Tweeted this graph last week. I had been messing around with density plots in R and it seemed a neat illustration of the phrase ‘marginally significant’ being used to mean ‘nearly, but not actually, significant’: the frequency of the phrase is rare below 0.05, peaks at p=0.06, then declines sharply before another peak at p=0.1.

The ensuing discussion highlighted a couple of good points: (a) where did the data come from? and (b) it should have been a histogram.

Where did the data come from? A very cursory search of Google Scholar for the phrase “marginally significant (p=x), where x is 0.01, 0.02..0.15 in steps of 0.01, which is probably good enough for a quick Tweet, but not enough for sustained discussion.

Should it have been a histogram? Yes, if only because the peaks misrepresent the data: there are no intermediate values between, say, 0.05 and 0.06.

So I re-did the Google Scholar search. This time I looked for statements of the form “marginally significant (p=x)” where x is every synonym of 0.001,0.002..0.200 in steps of 0.001. So, for example, p=0.01 might be in the format 0.01, 0.010, .01, .010.

Here are the data: 

And here is the histogram:


It’s still not perfect, since the search misses examples if the p-value isn’t cited directly after the phrase. But until automated searches on Google Scholar are possible, it’s probably the best I can do for now.


Still Not Significant


What to do if your p-value is just over the arbitrary threshold for ‘significance’ of p=0.05?

You don’t need to play the significance testing game – there are better methods, like quoting the effect size with a confidence interval – but if you do, the rules are simple: the result is either significant or it isn’t.

So if your p-value remains stubbornly higher than 0.05, you should call it ‘non-significant’ and write it up as such. The problem for many authors is that this just isn’t the answer they were looking for: publishing so-called ‘negative results’ is harder than ‘positive results’.

The solution is to apply the time-honoured tactic of circumlocution to disguise the non-significant result as something more interesting. The following list is culled from peer-reviewed journal articles in which (a) the authors set themselves the threshold of 0.05 for significance, (b) failed to achieve that threshold value for p and (c) described it in such a way as to make it seem more interesting.

As well as being statistically flawed (results are either significant or not and can’t be qualified), the wording is linguistically interesting, often describing an aspect of the result that just doesn’t exist. For example, “a trend towards significance” expresses non-significance as some sort of motion towards significance, which it isn’t: there is no ‘trend’, in any direction, and nowhere for the trend to be ‘towards’.

Some further analysis will follow, but for now here is the list in full (UPDATE: now in alpha-order):

(barely) not statistically significant (p=0.052)
a barely detectable statistically significant difference (p=0.073)
a borderline significant trend (p=0.09)
a certain trend toward significance (p=0.08)
a clear tendency to significance (p=0.052)
a clear trend (p<0.09)
a clear, strong trend (p=0.09)
a considerable trend toward significance (p=0.069)
a decreasing trend (p=0.09)
a definite trend (p=0.08)
a distinct trend toward significance (p=0.07)
a favorable trend (p=0.09)
a favourable statistical trend (p=0.09)
a little significant (p<0.1)
a margin at the edge of significance (p=0.0608)
a marginal trend (p=0.09)
a marginal trend toward significance (p=0.052)
a marked trend (p=0.07)
a mild trend (p<0.09)
a moderate trend toward significance (p=0.068)
a near-significant trend (p=0.07)
a negative trend (p=0.09)
a nonsignificant trend (p<0.1)
a nonsignificant trend toward significance (p=0.1)
a notable trend (p<0.1)
a numerical increasing trend (p=0.09)
a numerical trend (p=0.09)
a positive trend (p=0.09)
a possible trend (p=0.09)
a possible trend toward significance (p=0.052)
a pronounced trend (p=0.09)
a reliable trend (p=0.058)
a robust trend toward significance (p=0.0503)
a significant trend (p=0.09)
a slight slide towards significance (p<0.20)
a slight tendency toward significance(p<0.08)
a slight trend (p<0.09)
a slight trend toward significance (p=0.098)
a slightly increasing trend (p=0.09)
a small trend (p=0.09)
a statistical trend (p=0.09)
a statistical trend toward significance (p=0.09)
a strong tendency towards statistical significance (p=0.051)
a strong trend (p=0.077)
a strong trend toward significance (p=0.08)
a substantial trend toward significance (p=0.068)
a suggestive trend (p=0.06)
a trend close to significance (p=0.08)
a trend significance level (p=0.08)
a trend that approached significance (p<0.06)
a very slight trend toward significance (p=0.20)
a weak trend (p=0.09)
a weak trend toward significance (p=0.12)
a worrying trend (p=0.07)
all but significant (p=0.055)
almost achieved significance (p=0-065)
almost approached significance (p=0.065)
almost attained significance (p<0.06)
almost became significant (p=0.06)
almost but not quite significant (p=0.06)
almost clinically significant (p<0.10)
almost insignificant (p>0.065)
almost marginally significant (p>0.05)
almost non-significant (p=0.083)
almost reached statistical significance (p=0.06)
almost significant (p=0.06)
almost significant tendency (p=0.06)
almost statistically significant (p=0.06)
an adverse trend (p=0.10)
an apparent trend (p=0.286)
an associative trend (p=0.09)
an elevated trend (p<0.05)
an encouraging trend (p<0.1)
an established trend (p<0.10)
an evident trend (p=0.13)
an expected trend (p=0.08)
an important trend (p=0.066)
an increasing trend (p<0.09)
an interesting trend (p=0.1)
an inverse trend toward significance (p=0.06)
an observed trend (p=0.06)
an obvious trend (p=0.06)
an overall trend (p=0.2)
an unexpected trend (p=0.09)
an unexplained trend (p=0.09)
an unfavorable trend (p<0.10)
appeared to be marginally significant (p<0.10)
approached acceptable levels of statistical significance (p=0.054)
approached but did not quite achieve significance (p>0.05)
approached but fell short of significance (p=0.07)
approached conventional levels of significance (p<0.10)
approached near significance (p=0.06)
approached our criterion of significance (p>0.08)
approached significant (p=0.11)
approached the borderline of significance (p=0.07)
approached the level of significance (p=0.09)
approached trend levels of significance (p0.05)
approached, but did reach, significance (p=0.065)
approaches but fails to achieve a customary level of statistical significance (p=0.154)
approaches statistical significance (p>0.06)
approaching a level of significance (p=0.089)
approaching an acceptable significance level (p=0.056)
approaching borderline significance (p=0.08)
approaching borderline statistical significance (p=0.07)
approaching but not reaching significance (p=0.53)
approaching clinical significance (p=0.07)
approaching close to significance (p<0.1)
approaching conventional significance levels (p=0.06)
approaching conventional statistical significance (p=0.06)
approaching formal significance (p=0.1052)
approaching independent prognostic significance (p=0.08)
approaching marginal levels of significance p<0.107)
approaching marginal significance (p=0.064)
approaching more closely significance (p=0.06)
approaching our preset significance level (p=0.076)
approaching prognostic significance (p=0.052)
approaching significance (p=0.09)
approaching the traditional significance level (p=0.06)
approaching to statistical significance (p=0.075)
approaching, although not reaching, significance (p=0.08)
approaching, but not reaching, significance (p<0.09)
approximately significant (p=0.053)
approximating significance (p=0.09)
arguably significant (p=0.07)
as good as significant (p=0.0502)
at the brink of significance (p=0.06)
at the cusp of significance (p=0.06)
at the edge of significance (p=0.055)
at the limit of significance (p=0.054)
at the limits of significance (p=0.053)
at the margin of significance (p=0.056)
at the margin of statistical significance (p<0.07)
at the verge of significance (p=0.058)
at the very edge of significance (p=0.053)
barely below the level of significance (p=0.06)
barely escaped statistical significance (p=0.07)
barely escapes being statistically significant at the 5% risk level (0.1>p>0.05)
barely failed to attain statistical significance (p=0.067)
barely fails to attain statistical significance at conventional levels (p<0.10
barely insignificant (p=0.075)
barely missed statistical significance (p=0.051)
barely missed the commonly acceptable significance level (p<0.053)
barely outside the range of significance (p=0.06)
barely significant (p=0.07)
below (but verging on) the statistical significant level (p>0.05)
better trends of improvement (p=0.056)
bordered on a statistically significant value (p=0.06)
bordered on being significant (p>0.07)
bordered on being statistically significant (p=0.0502)
bordered on but was not less than the accepted level of significance (p>0.05)
bordered on significant (p=0.09)
borderline conventional significance (p=0.051)
borderline level of statistical significance (p=0.053)
borderline significant (p=0.09)
borderline significant trends (p=0.099)
close to a marginally significant level (p=0.06)
close to being significant (p=0.06)
close to being statistically significant (p=0.055)
close to borderline significance (p=0.072)
close to the boundary of significance (p=0.06)
close to the level of significance (p=0.07)
close to the limit of significance (p=0.17)
close to the margin of significance (p=0.055)
close to the margin of statistical significance (p=0.075)
closely approaches the brink of significance (p=0.07)
closely approaches the statistical significance (p=0.0669)
closely approximating significance (p>0.05)
closely not significant (p=0.06)
closely significant (p=0.058)
close-to-significant (p=0.09)
did not achieve conventional threshold levels of statistical significance (p=0.08)
did not exceed the conventional level of statistical significance (p<0.08)
did not quite achieve acceptable levels of statistical significance (p=0.054)
did not quite achieve significance (p=0.076)
did not quite achieve the conventional levels of significance (p=0.052)
did not quite achieve the threshold for statistical significance (p=0.08)
did not quite attain conventional levels of significance (p=0.07)
did not quite reach a statistically significant level (p=0.108)
did not quite reach conventional levels of statistical significance (p=0.079)
did not quite reach statistical significance (p=0.063)
did not reach the traditional level of significance (p=0.10)
did not reach the usually accepted level of clinical significance (p=0.07)
difference was apparent (p=0.07)
direction heading towards significance (p=0.10)
does not appear to be sufficiently significant (p>0.05)
does not narrowly reach statistical significance (p=0.06)
does not reach the conventional significance level (p=0.098)
effectively significant (p=0.051)
equivocal significance (p=0.06)
essentially significant (p=0.10)
extremely close to significance (p=0.07)
failed to reach significance on this occasion (p=0.09)
failed to reach statistical significance (p=0.06)
fairly close to significance (p=0.065)
fairly significant (p=0.09)
falls just short of standard levels of statistical significance (p=0.06)
fell (just) short of significance (p=0.08)
fell barely short of significance (p=0.08)
fell just short of significance (p=0.07)
fell just short of statistical significance (p=0.12)
fell just short of the traditional definition of statistical significance (p=0.051)
fell marginally short of significance (p=0.07)
fell narrowly short of significance (p=0.0623)
fell only marginally short of significance (p=0.0879)
fell only short of significance (p=0.06)
fell short of significance (p=0.07)
fell slightly short of significance (p>0.0167)
fell somewhat short of significance (p=0.138)
felt short of significance (p=0.07)
flirting with conventional levels of significance (p>0.1)
heading towards significance (p=0.086)
highly significant (p=0.09)
hint of significance (p>0.05)
hovered around significance (p = 0.061)
hovered at nearly a significant level (p=0.058)
hovering closer to statistical significance (p=0.076)
hovers on the brink of significance (p=0.055)
in the edge of significance (p=0.059)
in the verge of significance (p=0.06)
inconclusively significant (p=0.070)
indeterminate significance (p=0.08)
indicative significance (p=0.08)
is just outside the conventional levels of significance
just about significant (p=0.051)
just above the arbitrary level of significance (p=0.07)
just above the margin of significance (p=0.053)
just at the conventional level of significance (p=0.05001)
just barely below the level of significance (p=0.06)
just barely failed to reach significance (p<0.06)
just barely insignificant (p=0.11)
just barely statistically significant (p=0.054)
just beyond significance (p=0.06)
just borderline significant (p=0.058)
just escaped significance (p=0.07)
just failed significance (p=0.057)
just failed to be significant (p=0.072)
just failed to reach statistical significance (p=0.06)
just failing to reach statistical significance (p=0.06)
just fails to reach conventional levels of statistical significance (p=0.07)
just lacked significance (p=0.053)
just marginally significant (p=0.0562)
just missed being statistically significant (p=0.06)
just missing significance (p=0.07)
just on the verge of significance (p=0.06)
just outside accepted levels of significance (p=0.06)
just outside levels of significance (p<0.08)
just outside the bounds of significance (p=0.06)
just outside the conventional levels of significance (p=0.1076)
just outside the level of significance (p=0.0683)
just outside the limits of significance (p=0.06)
just outside the traditional bounds of significance (p=0.06)
just over the limits of statistical significance (p=0.06)
just short of significance (p=0.07)
just shy of significance (p=0.053)
just skirting the boundary of significance (p=0.052)
just tendentially significant (p=0.056)
just tottering on the brink of significance at the 0.05 level
just very slightly missed the significance level (p=0.086)
leaning towards significance (p=0.15)
leaning towards statistical significance (p=0.06)
likely to be significant (p=0.054)
loosely significant (p=0.10)
marginal significance (p=0.07)
marginally and negatively significant (p=0.08)
marginally insignificant (p=0.08)
marginally nonsignificant (p=0.096)
marginally outside the level of significance
marginally significant (p>=0.1)
marginally significant tendency (p=0.08)
marginally statistically significant (p=0.08)
may not be significant (p=0.06)
medium level of significance (p=0.051)
mildly significant (p=0.07)
missed narrowly statistical significance (p=0.054)
moderately significant (p>0.11)
modestly significant (p=0.09)
narrowly avoided significance (p=0.052)
narrowly eluded statistical significance (p=0.0789)
narrowly escaped significance (p=0.08)
narrowly evaded statistical significance (p>0.05)
narrowly failed significance (p=0.054)
narrowly missed achieving significance (p=0.055)
narrowly missed overall significance (p=0.06)
narrowly missed significance (p=0.051)
narrowly missed standard significance levels (p<0.07)
narrowly missed the significance level (p=0.07)
narrowly missing conventional significance (p=0.054)
near limit significance (p=0.073)
near miss of statistical significance (p>0.1)
near nominal significance (p=0.064)
near significance (p=0.07)
near to statistical significance (p=0.056)
near/possible significance(p=0.0661)
near-borderline significance (p=0.10)
near-certain significance (p=0.07)
nearing significance (p<0.051)
nearly acceptable level of significance (p=0.06)
nearly approaches statistical significance (p=0.079)
nearly borderline significance (p=0.052)
nearly negatively significant (p<0.1)
nearly positively significant (p=0.063)
nearly reached a significant level (p=0.07)
nearly reaching the level of significance (p<0.06)
nearly significant (p=0.06)
nearly significant tendency (p=0.06)
nearly, but not quite significant (p>0.06)
near-marginal significance (p=0.18)
near-significant (p=0.09)
near-to-significance (p=0.093)
near-trend significance (p=0.11)
nominally significant (p=0.08)
non-insignificant result (p=0.500)
non-significant in the statistical sense (p>0.05
not absolutely significant but very probably so (p>0.05)
not as significant (p=0.06)
not clearly significant (p=0.08)
not completely significant (p=0.07)
not completely statistically significant (p=0.0811)
not conventionally significant (p=0.089), but..
not currently significant (p=0.06)
not decisively significant (p=0.106)
not entirely significant (p=0.10)
not especially significant (p>0.05)
not exactly significant (p=0.052)
not extremely significant (p<0.06)
not formally significant (p=0.06)
not fully significant (p=0.085)
not globally significant (p=0.11)
not highly significant (p=0.089)
not insignificant (p=0.056)
not markedly significant (p=0.06)
not moderately significant (P>0.20)
not non-significant (p>0.1)
not numerically significant (p>0.05)
not obviously significant (p>0.3)
not overly significant (p>0.08)
not quite borderline significance (p>=0.089)
not quite reach the level of significance (p=0.07)
not quite significant (p=0.118)
not quite within the conventional bounds of statistical significance (p=0.12)
not reliably significant (p=0.091)
not remarkably significant (p=0.236)
not significant by common standards (p=0.099)
not significant by conventional standards (p=0.10)
not significant by traditional standards (p<0.1)
not significant in the formal statistical sense (p=0.08)
not significant in the narrow sense of the word (p=0.29)
not significant in the normally accepted statistical sense (p=0.064)
not significantly significant but..clinically meaningful (p=0.072)
not statistically quite significant (p<0.06)
not strictly significant (p=0.06)
not strictly speaking significant (p=0.057)
not technically significant (p=0.06)
not that significant (p=0.08)
not to an extent that was fully statistically significant (p=0.06)
not too distant from statistical significance at the 10% level
not too far from significant at the 10% level
not totally significant (p=0.09)
not unequivocally significant (p=0.055)
not very definitely significant (p=0.08)
not very definitely significant from the statistical point of view (p=0.08)
not very far from significance (p<0.092)
not very significant (p=0.1)
not very statistically significant (p=0.10)
not wholly significant (p>0.1)
not yet significant (p=0.09)
not strongly significant (p=0.08)
noticeably significant (p=0.055)
on the border of significance (p=0.063)
on the borderline of significance (p=0.0699)
on the borderlines of significance (p=0.08)
on the boundaries of significance (p=0.056)
on the boundary of significance (p=0.055)
on the brink of significance (p=0.052)
on the cusp of conventional statistical significance (p=0.054)
on the cusp of significance (p=0.058)
on the edge of significance (p>0.08)
on the limit to significant (p=0.06)
on the margin of significance (p=0.051)
on the threshold of significance (p=0.059)
on the verge of significance (p=0.053)
on the very borderline of significance (0.05<p<0.06)
on the very fringes of significance (p=0.099)
on the very limits of significance (0.1>p>0.05)
only a little short of significance (p>0.05)
only just failed to meet statistical significance (p=0.051)
only just insignificant (p>0.10)
only just missed significance at the 5% level
only marginally fails to be significant at the 95% level (p=0.06)
only marginally nearly insignificant (p=0.059)
only marginally significant (p=0.9)
only slightly less than significant (p=0.08)
only slightly missed the conventional threshold of significance (p=0.062)
only slightly missed the level of significance (p=0.058)
only slightly missed the significance level (p=0·0556)
only slightly non-significant (p=0.0738)
only slightly significant (p=0.08)
partial significance (p>0.09)
partially significant (p=0.08)
partly significant (p=0.08)
perceivable statistical significance (p=0.0501)
possible significance (p<0.098)
possibly marginally significant (p=0.116)
possibly significant (0.05<p>0.10)
possibly statistically significant (p=0.10)
potentially significant (p>0.1)
practically significant (p=0.06)
probably not experimentally significant (p=0.2)
probably not significant (p>0.25)
probably not statistically significant (p=0.14)
probably significant (p=0.06)
provisionally significant (p=0.073)
quasi-significant (p=0.09)
questionably significant (p=0.13)
quite close to significance at the 10% level (p=0.104)
quite significant (p=0.07)
rather marginal significance (p>0.10)
reached borderline significance (p=0.0509)
reached near significance (p=0.07)
reasonably significant (p=0.07)
remarkably close to significance (p=0.05009)
resides on the edge of significance (p=0.10)
roughly significant (p>0.1)
scarcely significant (0.05<p>0.1)
significant at the .07 level
significant tendency (p=0.09)
significant to some degree (0<p>1)
significant, or close to significant effects (p=0.08, p=0.05)
significantly better overall (p=0.051)
significantly significant (p=0.065)
similar but not nonsignificant trends (p>0.05)
slight evidence of significance (0.1>p>0.05)
slight non-significance (p=0.06)
slight significance (p=0.128)
slight tendency toward significance (p=0.086)
slightly above the level of significance (p=0.06)
slightly below the level of significance (p=0.068)
slightly exceeded significance level (p=0.06)
slightly failed to reach statistical significance (p=0.061)
slightly insignificant (p=0.07)
slightly less than needed for significance (p=0.08)
slightly marginally significant (p=0.06)
slightly missed being of statistical significance (p=0.08)
slightly missed statistical significance (p=0.059)
slightly missed the conventional level of significance (p=0.061)
slightly missed the level of statistical significance (p<0.10)
slightly missed the margin of significance (p=0.051)
slightly not significant (p=0.06)
slightly outside conventional statistical significance (p=0.051)
slightly outside the margins of significance (p=0.08)
slightly outside the range of significance (p=0.09)
slightly outside the significance level (p=0.077)
slightly outside the statistical significance level (p=0.053)
slightly significant (p=0.09)
somewhat marginally significant (p>0.055)
somewhat short of significance (p=0.07)
somewhat significant (p=0.23)
somewhat statistically significant (p=0.092)
strong trend toward significance (p=0.08)
sufficiently close to significance (p=0.07)
suggestive but not quite significant (p=0.061)
suggestive of a significant trend (p=0.08)
suggestive of statistical significance (p=0.06)
suggestively significant (p=0.064)
tailed to insignificance (p=0.1)
tantalisingly close to significance (p=0.104)
technically not significant (p=0.06)
teetering on the brink of significance (p=0.06)
tend to significant (p>0.1)
tended to approach significance (p=0.09)
tended to be significant (p=0.06)
tended toward significance (p=0.13)
tendency toward significance (p approaching 0.1)
tendency toward statistical significance (p=0.07)
tends to approach significance (p=0.12)
tentatively significant (p=0.107)
too far from significance (p=0.12)
trend bordering on statistical significance (p=0.066)
trend in a significant direction (p=0.09)
trend in the direction of significance (p=0.089)
trend significance level (p=0.06)
trend toward (p>0.07)
trending towards significance (p>0.15)
trending towards significant (p=0.099)
uncertain significance (p>0.07)
vaguely significant (p>0.2)
verged on being significant (p=0.11)
verging on significance (p=0.056)
verging on the statistically significant (p<0.1)
verging-on-significant (p=0.06)
very close to approaching significance (p=0.060)
very close to significant (p=0.11)
very close to the conventional level of significance (p=0.055)
very close to the cut-off for significance (p=0.07)
very close to the established statistical significance level of p=0.05 (p=0.065)
very close to the threshold of significance (p=0.07)
very closely approaches the conventional significance level (p=0.055)
very closely brushed the limit of statistical significance (p=0.051)
very narrowly missed significance (p<0.06)
very nearly significant (p=0.0656)
very slightly non-significant (p=0.10)
very slightly significant (p<0.1)
virtually significant (p=0.059)
weak significance (p>0.10)
weakened..significance (p=0.06)
weakly non-significant (p=0.07)
weakly significant (p=0.11)
weakly statistically significant (p=0.0557)
well-nigh significant (p=0.11)

Infographic: how to interpret Kappa values for DSM-5 inter-rater reliability


An old talk on mediation analysis

I talk I gave at KCL in 2006, obviously hoping that the visual allusion to both William Blake and Isaac Newton would rub off on my own reputation.


Why change scores are generally a bad idea for PROMs data: the maths part

So, thanks to Math for the Masses I can now do equations using LaTeX, a typesetting language I have avoided learning to date, if only because so many people have told me that I simply must learn LaTeX.

To return to the problem at hand, why do we get a negative correlation between preoperative score & health gain?

The reason is that ‘health gain’ is a change score, calculated from the postoperative score minus preoperative score:

Health Gain = Postoperative Score – Preoperative Score

I haven’t worked out how to make LaTeX do ‘friendly’ equations so we will have to simplify this to:

c = b – a

The correlation between preoperative score (a) and health gain (c) is clearly the same as the correlation of a with b-a. We can write the equation for this as follows:

r_{ac}= r_{a(b-a)}= \frac{r_{ab}\times sd_b-sd_a}{\sqrt{sd_{a}^{2}+sd_b^2-2r_{ab}\times sd_a\times sd_b}}

The important thing about this equation the top part. The standard deviations of pre- and postoperative scores are likely to be very similar: we have the same people filling in the same questionnaire on both occasions. Similarly, rab is likely to be positive because it’s the correlation between pre- and postoperative scores, and less than 1.0.

This means that rab × sdb will be almost always be smaller than  sda so the expression will be negative, making the correlation between a and c negative.

Note also that the corrleation between pre- and postoperative scores might be quite small if the questionnaire is unreliable so the greater the measurement error, the greater the negative bias.


Why change scores are generally a bad idea for PROMs data: a worked example

Let’s imagine we have a Patient-Reported Outcome Measure (PROM) for surgical outcomes and we give it to patients pre- and post- operatively to see if health status improves after surgery. In common with many questionnaire measures, our PROM has reliability of 0.7.

Let’s also suppose that the surgery is equally effective for everyone, increasing health status by 3 points. In this example, that’s an effect size of d=0.36.

We calculate ‘health gain’ as the difference between pre- and post- operative score:

Health Gain = Postoperative Score – Preoperative Score

We know this should be around 3 points for each person, give or take a bit due to measurement error.

What happens when we plot health gain against preoperative score?

Scatterplot of negative correlation

The correlation between preoperative score and health gain is negative and significant: r = -0.46, p<0.05: can we conclude that surgery is more effective for patients with the lowest initial health status?

NO: because we know that everyone improved by the same amount, 3 points. The correlation should be zero.

Why this happens is quite easy to explain but will have to wait until I work out how to do equations in WordPress.

UPDATE: just in case you were wondering if this sort of thing happens in real life, this graph is taken from a BMJ article on PROMs (1) in which it is concluded that “better preoperative health tends to be associated with smaller, not larger, health gains “.

Look familiar?

1. J. Appleby, “Patient reported outcome measures: how are we feeling today?” BMJ 344, no. jan11 2 (January 11, 2012): d8191-d8191.

Why change scores are generally a bad idea for PROMs data

Patient-reported outcome measures (PROMs) are essentially questionnaire scales designed to measure the treatment outcomes that matter to patients. These are usually characterised as ‘health status’ or ‘health-related quality of life’, but really there’s no reason to limit the scope of PROMs to these domains: you could argue, for example, that the numerous measures of severity of depression (PHQ-9, BDI etc.) are PROMs in their own right.

The NHS Information Centre states that:

“The health status information collected from patients by way of PROMs questionnaires before and after an intervention provides an indication of the outcomes or quality of care delivered to NHS Patients”

A key feature of PROMs analysis is the reliability of the data: the degree of measurement error entailed in collecting data using a PROM. Concerns about reliability can be dismissed because reliability is an index of unbiased measurement error and, well, that just cancels out with large samples, doesn’t it? Well, not always. And especially not if you use ‘change scores’.

The natural thing to do with a set of before and after scores is to subtract the ‘before’ from the ‘after’ and call that a ‘change score’ or ‘health gain’:

change score = health gain = [post-treatment score] – [pre-treatment score]

There are two related problems when you do this with PROMs data:

  1. The reliability of the change score is usually much lower than the reliability of the pre- and post- scores
  2. The correlation between pre-score and change score is usually strongly biased in the negative direction

What this means in practice is:

  1. Change scores tend to have greater measurement error than the scores they were derived from
  2. They tend also to have moderate (but spurious) negative correlations with pre-scores

Point (2) is interesting because it can be misinterpreted as an interaction effect: the treatment appears most effective for patients with the lowest pre-treatment scores. But the effect is entirely spurious.

What’s also interesting is that the size of the (spurious) negative correlation between pre-treatment score and change score increases as the reliability of the PROM decreases: the greater the measurement error, the larger the (spurious) correlation, and the more likely it is be significant.

So, to conclude: change scores are generally a bad idea for PROMs data, and the more unreliable the PROMs data, the worse the problem is.