Time for a look at the literature.
I'll start with a mild disclaimer: the purpose of this comment is not to have a go at (or embarass) people who have confused confidence intervals with credible intervals. Indeed I've only recently started to think more clearly about what is going on here myself - and I certainly wouldn't promise that my current understanding is complete and correct. Moreover, given that almost everyone has been getting it wrong for years, it would not be reasonable to single out a handful of individuals for blame. Really, the purpose of this comment is just to point out how completely ubiquitous this confusion is in the literature.
First the results of a bit of web-surfing. It's actually really hard to find a good definition of a confidence interval on the web.
Here's a typical faulty version which I found linked from Wikipedia:
The confidence interval defines a band around the sample mean within which the true population [mean?] will lie, to some degree of confidence:
For example, there is a 95% probability that the true population mean will lie within the 95% confidence interval of the sample mean.
As
I've already demonstrated, this is false in general.
The contents of the Wikipedia page itself was misleading until recently - I have
had a go at fixing it, rather clumsily. More editing is welcome...
The very first google hit for
confidence interval is an interesting case. It is a
Lancaster Uni mirror of some
widely distributed educational material:, which contains the commendably careful definition:
If independent samples are taken repeatedly from the same population, and a confidence interval calculated for each sample, then a certain percentage (confidence level) of the intervals will include the unknown population parameter.
which makes it clear that the probability is based on the frequentist concept of repeated sampling to generate a population of confidence intervals. So far so good. However, their definition actually started off with the at best ambiguous:
A confidence interval gives an estimated range of values which is likely to include an unknown population parameter, the estimated range being calculated from a given set of sample data.
and concludes with the rather unfortunate:
Confidence intervals ... provide a range of plausible values for the unknown parameter.
(That's not to say that confidence intervals never provide such a range - but it is not necessarily what they are designed to do, and they well may fail to achieve this.) Despite the careful description of repeated sampling in the middle of their text, it seems quite possible that some readers will end up with a rather misleading impression of what confidence intervals are.
The first edition of the highly-regarded book by Wilks ("Statistical methods in the atmospheric sciences") also equated confidence and credible intervals (
here), and said things like "H
0 [the null hypothesis] is rejected as too unlikely to have been true" - despite the frequentist paradigm explicitly forbidding the attachment of probabilities to hypotheses. I was pleased to find that Prof Wilks quickly agreed with me that this was misleading, and stated that in fact the recently-published second edition (which I have not seen) does not contain the comment about credible intervals.
Perhaps most surprisingly, the mistake is committed by people even when they are railing against the limitations of standard frequentist hypothesis testing! Eg, in
this comment, Nicholls recommends reporting confidence intervals:
The reporting of confidence intervals would allow readers to address the question 'Given these data and the correlation calculated with them, what is the probability that H0 is true?'
Oops.
So, how well does climate science come out of this? Well, the confusion seems to pop up just about everywhere that you see these words "likely" and "very likely" (eg throughout the IPCC TAR). The one exception would seem to be the climate sensitivity work, the bulk of which is explicitly Bayesian right from the start, with clearly stated priors. One could probably argue that many of the TAR judgements were based on experts carefully weighing up the evidence, but in many cases (especially the D&A chapter), it seems entirely routine to directly interpret a confidence interval (say, from a regression analysis) as a credible interval. I've seen an absolutely explicit occurrence of this in a recent D&A paper, co-authored by 2 prominent figures in the field.
I
promised TCO I would say something about the Hockey Stick. Of course, it's the same story here. On the basis of regression coefficients, MBH (specifically in their 1999 GRL paper, which is repeated in the TAR) make a statement about how "likely" it is that the 1990 were warmer than the previous millenium. I can't help but be amused to note that with all the "
auditing" and
peer-review, including by
professional statisticians, none of them seems to have noticed this little detail.
Of course an important question to consider is, how much does this all matter? And the answer is...it depends. In many cases, the answers you get probably won't turn out too different. There was some explicitly Bayesian estimation briefly mentioned in the D&A chapter of the TAR, and broadly speaking it seemed to give results that were similar to (in fact perhaps even stronger than) the more mainstream stuff. Moreover, there is a lot of slack in terms like "likely", so changing the probability might not invalidate such statements anyway. So I am not by any means suggesting that the TAR needs to be thrown out, and therefore maybe some people will claim this is all a fuss about nothing. However, I would argue that it is still surely a good thing to at least understand what is going on, so that people can consider how important an issue it is in each particular case. For instance, equating confidence intervals and credible intervals seems to assume (inter alia) the choice of a uniform prior: this decision can by no means be an automatic one, and equating it with "no prior" or "initial ignorance" is definitely wrong. Eg, no-one, not even the most rabid septic, has actually ever believed that CO2 is as likely to cool as warm the planet (at least not since Arrhenius), so assigning an equal prior probability to positive and negative effects would surely be hard to defend. Like the example of the "negative mass" apple, a confidence interval that covers a wide range does not mean that we think the parameter has a significant probability of taking an extreme value! At an absolute minimum, it certainly needs to be stated clearly that this choice of prior was made.
There are a number of other technical issues such as model error which are intimately related, (eg, what does it mean to determine a parameter or regression coefficient to n decimal places, in a model that is an incomplete representation of the real world?). We can try a bit of hand-waving and claim it doesn't matter too much, but ultimately I think if we are going to try to make useful and credible estimates then there is little alternative but to try to deal with these things more coherently and consistently, even if it does mean more work for the statisticians!