Time for a look at the literature.
I'll start with a mild disclaimer: the purpose of this comment is not to have a go at (or embarass) people who have confused confidence intervals with credible intervals. Indeed I've only recently started to think more clearly about what is going on here myself - and I certainly wouldn't promise that my current understanding is complete and correct. Moreover, given that almost everyone has been getting it wrong for years, it would not be reasonable to single out a handful of individuals for blame. Really, the purpose of this comment is just to point out how completely ubiquitous this confusion is in the literature.
First the results of a bit of web-surfing. It's actually really hard to find a good definition of a confidence interval on the web. Here's a typical faulty version which I found linked from Wikipedia:
The contents of the Wikipedia page itself was misleading until recently - I have had a go at fixing it, rather clumsily. More editing is welcome...
The very first google hit for confidence interval is an interesting case. It is a Lancaster Uni mirror of some widely distributed educational material:, which contains the commendably careful definition:
The first edition of the highly-regarded book by Wilks ("Statistical methods in the atmospheric sciences") also equated confidence and credible intervals (here), and said things like "H0 [the null hypothesis] is rejected as too unlikely to have been true" - despite the frequentist paradigm explicitly forbidding the attachment of probabilities to hypotheses. I was pleased to find that Prof Wilks quickly agreed with me that this was misleading, and stated that in fact the recently-published second edition (which I have not seen) does not contain the comment about credible intervals.
Perhaps most surprisingly, the mistake is committed by people even when they are railing against the limitations of standard frequentist hypothesis testing! Eg, in this comment, Nicholls recommends reporting confidence intervals:
So, how well does climate science come out of this? Well, the confusion seems to pop up just about everywhere that you see these words "likely" and "very likely" (eg throughout the IPCC TAR). The one exception would seem to be the climate sensitivity work, the bulk of which is explicitly Bayesian right from the start, with clearly stated priors. One could probably argue that many of the TAR judgements were based on experts carefully weighing up the evidence, but in many cases (especially the D&A chapter), it seems entirely routine to directly interpret a confidence interval (say, from a regression analysis) as a credible interval. I've seen an absolutely explicit occurrence of this in a recent D&A paper, co-authored by 2 prominent figures in the field.
I promised TCO I would say something about the Hockey Stick. Of course, it's the same story here. On the basis of regression coefficients, MBH (specifically in their 1999 GRL paper, which is repeated in the TAR) make a statement about how "likely" it is that the 1990 were warmer than the previous millenium. I can't help but be amused to note that with all the "auditing" and peer-review, including by professional statisticians, none of them seems to have noticed this little detail.
Of course an important question to consider is, how much does this all matter? And the answer is...it depends. In many cases, the answers you get probably won't turn out too different. There was some explicitly Bayesian estimation briefly mentioned in the D&A chapter of the TAR, and broadly speaking it seemed to give results that were similar to (in fact perhaps even stronger than) the more mainstream stuff. Moreover, there is a lot of slack in terms like "likely", so changing the probability might not invalidate such statements anyway. So I am not by any means suggesting that the TAR needs to be thrown out, and therefore maybe some people will claim this is all a fuss about nothing. However, I would argue that it is still surely a good thing to at least understand what is going on, so that people can consider how important an issue it is in each particular case. For instance, equating confidence intervals and credible intervals seems to assume (inter alia) the choice of a uniform prior: this decision can by no means be an automatic one, and equating it with "no prior" or "initial ignorance" is definitely wrong. Eg, no-one, not even the most rabid septic, has actually ever believed that CO2 is as likely to cool as warm the planet (at least not since Arrhenius), so assigning an equal prior probability to positive and negative effects would surely be hard to defend. Like the example of the "negative mass" apple, a confidence interval that covers a wide range does not mean that we think the parameter has a significant probability of taking an extreme value! At an absolute minimum, it certainly needs to be stated clearly that this choice of prior was made.
There are a number of other technical issues such as model error which are intimately related, (eg, what does it mean to determine a parameter or regression coefficient to n decimal places, in a model that is an incomplete representation of the real world?). We can try a bit of hand-waving and claim it doesn't matter too much, but ultimately I think if we are going to try to make useful and credible estimates then there is little alternative but to try to deal with these things more coherently and consistently, even if it does mean more work for the statisticians!
I'll start with a mild disclaimer: the purpose of this comment is not to have a go at (or embarass) people who have confused confidence intervals with credible intervals. Indeed I've only recently started to think more clearly about what is going on here myself - and I certainly wouldn't promise that my current understanding is complete and correct. Moreover, given that almost everyone has been getting it wrong for years, it would not be reasonable to single out a handful of individuals for blame. Really, the purpose of this comment is just to point out how completely ubiquitous this confusion is in the literature.
First the results of a bit of web-surfing. It's actually really hard to find a good definition of a confidence interval on the web. Here's a typical faulty version which I found linked from Wikipedia:
The confidence interval defines a band around the sample mean within which the true population [mean?] will lie, to some degree of confidence:As I've already demonstrated, this is false in general.
For example, there is a 95% probability that the true population mean will lie within the 95% confidence interval of the sample mean.
The contents of the Wikipedia page itself was misleading until recently - I have had a go at fixing it, rather clumsily. More editing is welcome...
The very first google hit for confidence interval is an interesting case. It is a Lancaster Uni mirror of some widely distributed educational material:, which contains the commendably careful definition:
If independent samples are taken repeatedly from the same population, and a confidence interval calculated for each sample, then a certain percentage (confidence level) of the intervals will include the unknown population parameter.which makes it clear that the probability is based on the frequentist concept of repeated sampling to generate a population of confidence intervals. So far so good. However, their definition actually started off with the at best ambiguous:
A confidence interval gives an estimated range of values which is likely to include an unknown population parameter, the estimated range being calculated from a given set of sample data.and concludes with the rather unfortunate:
Confidence intervals ... provide a range of plausible values for the unknown parameter.(That's not to say that confidence intervals never provide such a range - but it is not necessarily what they are designed to do, and they well may fail to achieve this.) Despite the careful description of repeated sampling in the middle of their text, it seems quite possible that some readers will end up with a rather misleading impression of what confidence intervals are.
The first edition of the highly-regarded book by Wilks ("Statistical methods in the atmospheric sciences") also equated confidence and credible intervals (here), and said things like "H0 [the null hypothesis] is rejected as too unlikely to have been true" - despite the frequentist paradigm explicitly forbidding the attachment of probabilities to hypotheses. I was pleased to find that Prof Wilks quickly agreed with me that this was misleading, and stated that in fact the recently-published second edition (which I have not seen) does not contain the comment about credible intervals.
Perhaps most surprisingly, the mistake is committed by people even when they are railing against the limitations of standard frequentist hypothesis testing! Eg, in this comment, Nicholls recommends reporting confidence intervals:
The reporting of confidence intervals would allow readers to address the question 'Given these data and the correlation calculated with them, what is the probability that H0 is true?'Oops.
So, how well does climate science come out of this? Well, the confusion seems to pop up just about everywhere that you see these words "likely" and "very likely" (eg throughout the IPCC TAR). The one exception would seem to be the climate sensitivity work, the bulk of which is explicitly Bayesian right from the start, with clearly stated priors. One could probably argue that many of the TAR judgements were based on experts carefully weighing up the evidence, but in many cases (especially the D&A chapter), it seems entirely routine to directly interpret a confidence interval (say, from a regression analysis) as a credible interval. I've seen an absolutely explicit occurrence of this in a recent D&A paper, co-authored by 2 prominent figures in the field.
I promised TCO I would say something about the Hockey Stick. Of course, it's the same story here. On the basis of regression coefficients, MBH (specifically in their 1999 GRL paper, which is repeated in the TAR) make a statement about how "likely" it is that the 1990 were warmer than the previous millenium. I can't help but be amused to note that with all the "auditing" and peer-review, including by professional statisticians, none of them seems to have noticed this little detail.
Of course an important question to consider is, how much does this all matter? And the answer is...it depends. In many cases, the answers you get probably won't turn out too different. There was some explicitly Bayesian estimation briefly mentioned in the D&A chapter of the TAR, and broadly speaking it seemed to give results that were similar to (in fact perhaps even stronger than) the more mainstream stuff. Moreover, there is a lot of slack in terms like "likely", so changing the probability might not invalidate such statements anyway. So I am not by any means suggesting that the TAR needs to be thrown out, and therefore maybe some people will claim this is all a fuss about nothing. However, I would argue that it is still surely a good thing to at least understand what is going on, so that people can consider how important an issue it is in each particular case. For instance, equating confidence intervals and credible intervals seems to assume (inter alia) the choice of a uniform prior: this decision can by no means be an automatic one, and equating it with "no prior" or "initial ignorance" is definitely wrong. Eg, no-one, not even the most rabid septic, has actually ever believed that CO2 is as likely to cool as warm the planet (at least not since Arrhenius), so assigning an equal prior probability to positive and negative effects would surely be hard to defend. Like the example of the "negative mass" apple, a confidence interval that covers a wide range does not mean that we think the parameter has a significant probability of taking an extreme value! At an absolute minimum, it certainly needs to be stated clearly that this choice of prior was made.
There are a number of other technical issues such as model error which are intimately related, (eg, what does it mean to determine a parameter or regression coefficient to n decimal places, in a model that is an incomplete representation of the real world?). We can try a bit of hand-waving and claim it doesn't matter too much, but ultimately I think if we are going to try to make useful and credible estimates then there is little alternative but to try to deal with these things more coherently and consistently, even if it does mean more work for the statisticians!
6 comments:
Browsing the web for the low-down on Charles Lyell, I came across this paper:
Max Albert, "Should Bayesians Bet Where Frequentists Fear to Tread?" Philosophy of Science, 72 (October 2005) pp. 584-593.
Abstract:
Probability theory is important not least because of its relevance for decision making, which also means: its relevance for the single case. The frequency theory of probability on its own is irrelevant in the single case. However, Howson and Urbach argue that Bayesianism can solve the frequentist's problem: frequentist-probability information is relevant to Bayesians (although to nobody else). The present paper shows that Howson and Urbach's solution cannot work, and indeed that no Bayesian solution can work. There is no way to make frequentist probability relevant.
http://www.journals.uchicago.edu/PHILSCI/journal/issues/v72n4/720401/brief/720401.abstract.html
You see I was actually being kind to you by not accepting your bet :-)
Thanks for the link. I'm not sure that the conclusion is that startling though.
AIUI, the paper shows that frequentist (long-run) probability can never by itself dictate rational odds to a Bayesian making a finite number of bets, since (in the example presented) the Bayesian can always choose a prior that generates a particular initial set of outcomes that is not representative of the long term.
The example is instructive and reminds me of some issues in complexity theory and cryptography. The author uses the example of a pseudorandom number generator to generate coin tosses. Even though it has the correct long-run properties, an arbitrarily long initial series of H can be generated by initialising it appropriately. So if you are betting against an adversary who is using this machine, it may indeed be completely rational to refuse a series of bets on T even when offered "favourable" odds of better than 1:1. TBH, I don't find anything particularly unexpected or puzzling about this. Maybe no-one else had said it though!
What is "the Bayesian fallacy"? Was googling and came across this term. I have no stats training. Tone seemed to imply competing schools of thought and that Bayesians were like Albigensians fit only for the fire.
"the Bayesian fallacy"
Doesn't ring a bell, but there are various paradoxes and problems associated with Bayesian reasoning like the Doomsday argument. Since the bigest issue is generally in the unthinking adoption of a uniform prior as representing "ignorance" you'll not be surprised to find that I don't find them very convincing!
Thanks for the nice write-up.
If you are not already familiar with it, you might be interested in reading Jaynes's comprehensive article on this topic - published in 1976!
Ooh, thanks for the great ref! I do think I've read it before but it's worth another look, and anyone else who finds this post is likely to find it interesting too.
Post a Comment