There was I, minding my own business as usual, when I chanced upon the
Talk page for Confidence interval - Wikipedia. There's some odd stuff going on there...
I freely admit that I was confused about Bayesian and frequentist probability a few years ago. In fact I wince whenever I re-read a particular statement I made in a paper published as recently as 2005 - no, I'm not telling you where it is. In my defence, a lot of stuff I had read concerning probability in climate science (and beyond) is at best misleading and sometimes badly wrong - and hey, the referees didn't pick up on it either! But really, given some time to think and some clear descriptions (of which there are plenty on the web) it is really not that difficult to get a handle on it.
A confidence interval is a frequentist concept, based on repeated sampling from a distribution. Perhaps it is best illustrated with a simple example. Say X is a unknown but fixed parameter (eg the speed of light, or amount of money in my wallet), and we can sample x
i = X+e
i where e
i is a random draw from the distribution N(0,1) - that is, x
i is an observation of X with that given uncertainty. Then there is a 25% probability that e
i will lie in the interval [-0.32,0.32] and therefore 25% of the intervals [x
i-0.32,x
i+0.32] will contain the unknown X. Or to put it another way, P(x
i-0.32 lt X lt x
i+0.32)=25% (and incidentally, I hate that Blogger can't even cope with a less than sign without swallowing text).
Note that nothing in the above depends on anything at all about the value of X. The statements are true whatever value X takes, and are just as true if we actually know X as if we don't.
The confusion comes in once we have a specific observation x
i = 25.55 (say) and construct the appropriate 25% CI [25.23,25.87]. Does it follow that [25.23,25.87] contains X with probability 25%? Well, apparently some people on Wikipedia who call themselves professional statisticians (including a university lecturer) think it does. And there are some apparently authoritative references (listed on that page) which are sufficiently vague and/or poorly worded that such an idea is perhaps excusable at first. But what is the repeated sample here for which the 25% statistic applies? We originally considered repeatedly drawing the x
i from their sampling distribution and creating the appropriate CIs. 25% of these CIs will contain X, but they will have different endpoints. If we only keep the x
i which happen to take the value 25.55, then all the resulting CIs will be the same [25.23,25.87], but (obviously) either all of them will contain X, or none of them will! So neither of these approaches can help to define P(25.32 lt X lt 25.87) in a nontrivial frequentist sense.
In fact in order for it to make sense to talk of P(25.32 lt X lt 25.87) we
have to consider X in some probabilistic way (since the other values in that expression are just constants). If X is some real-world parameter like the speed of light, that requires a Bayesian interpretation of probability as a degree of belief. Effectively, by considering the range of different width confidence intervals, we are making a statement of the type P(X|x
i=25.55) (where this is now a distribution for X). The probability axioms tell us that
P(X|x
i=25.55)= P(x
i=25.55|X)P(X)/P(x
i=25.55)
(which is Bayes Theorem of course) and you can see that on the right hand side we have P(X), which is a prior distribution for X. [As for the other terms; the likelihood P(x
i=25.55|X) is trivial to calculate, as we have already said that x
i is an observation of X with Gaussian uncertainty, and the demonimator P(x
i=25.55) is a normalisation constant that makes the probabilities integrate to 1.] So not only do we need to consider X probabilistically, but its prior distribution will affect the posterior P(X|x
i=25.55). Therefore, before one has started to consider that, it is clearly untenable to simply assert that P(25.32 lt X lt 25.87) = 25%. If I told you that X was an integer uniformly chosen from [0,100], you would immediately assign zero probability to it being in that short confidence interval! (That's not a wholly nonsensical example - eg I could place a bag-full of precise 1g masses on a mass balance that has error given by the standard normal distribution, and ask you how many were in the bag.) And probably you would think it was mostly likely to be 25 or 26, and less likely to be more distant values. But maybe I thought of an integer, and squared it...in which case the answer is almost certainly 25. Maybe I thought of an integer and cubed it... In all these cases, I'm describing an experiment where the prior has a direct intuitive frequentist interpretation (we can repeat the experiment with different X sampled from its prior). That's not so clear (to put it mildly) when X is a physical parameter like the speed of light, or climate sensitivity.
But anyway, the important point is, the answer necessarily depends on the prior. And once you've observed the data and calculated the end-points of your confidence interval, your selected confidence level no longer automatically gives you the probability that your particular interval contains the parameter in question. That predicate P(x
i-0.32 lt X lt x
i+0.32) is fundamentally different from P(25.23 lt X lt 25.87) - the former has a straightforward frequency interpretation irrespective of anything we know about X, but the latter requires a Bayesian approach to probability, and a prior for X (and will vary depending on what prior is used).
The way people routinely come unstuck is that for simple examples, those two probabilities actually can be numerically the same, if we use a uniform prior for X. Moreover, the Bayesian version (probability of X given the data) is what people actually want in practical applications, and so the statement routinely gets turned round in peoples' heads. But there are less trivial examples where this equivalence comes badly unstuck, and of course there are also numerous cases where a uniform prior is hardly reasonable in the first place. [In fact I would argue that a uniform prior is rarely reasonable (eg at a minimum, the real world is physically bounded in various ways, and many parameters are defined as to be non-negative), but sometimes the results are fairly insensitive to a wide range of choices.]
Fortunately a number of people who do seem to know what they are talking about have weighed in on the Wikipedia page...