Saturday, May 03, 2008

Train wreck on Wikipedia: Confidence interval

There was I, minding my own business as usual, when I chanced upon the Talk page for Confidence interval - Wikipedia. There's some odd stuff going on there...

I freely admit that I was confused about Bayesian and frequentist probability a few years ago. In fact I wince whenever I re-read a particular statement I made in a paper published as recently as 2005 - no, I'm not telling you where it is. In my defence, a lot of stuff I had read concerning probability in climate science (and beyond) is at best misleading and sometimes badly wrong - and hey, the referees didn't pick up on it either! But really, given some time to think and some clear descriptions (of which there are plenty on the web) it is really not that difficult to get a handle on it.

A confidence interval is a frequentist concept, based on repeated sampling from a distribution. Perhaps it is best illustrated with a simple example. Say X is a unknown but fixed parameter (eg the speed of light, or amount of money in my wallet), and we can sample xi = X+ei where ei is a random draw from the distribution N(0,1) - that is, xi is an observation of X with that given uncertainty. Then there is a 25% probability that ei will lie in the interval [-0.32,0.32] and therefore 25% of the intervals [xi-0.32,xi+0.32] will contain the unknown X. Or to put it another way, P(xi-0.32 lt X lt xi+0.32)=25% (and incidentally, I hate that Blogger can't even cope with a less than sign without swallowing text).

Note that nothing in the above depends on anything at all about the value of X. The statements are true whatever value X takes, and are just as true if we actually know X as if we don't.

The confusion comes in once we have a specific observation xi = 25.55 (say) and construct the appropriate 25% CI [25.23,25.87]. Does it follow that [25.23,25.87] contains X with probability 25%? Well, apparently some people on Wikipedia who call themselves professional statisticians (including a university lecturer) think it does. And there are some apparently authoritative references (listed on that page) which are sufficiently vague and/or poorly worded that such an idea is perhaps excusable at first. But what is the repeated sample here for which the 25% statistic applies? We originally considered repeatedly drawing the xi from their sampling distribution and creating the appropriate CIs. 25% of these CIs will contain X, but they will have different endpoints. If we only keep the xi which happen to take the value 25.55, then all the resulting CIs will be the same [25.23,25.87], but (obviously) either all of them will contain X, or none of them will! So neither of these approaches can help to define P(25.32 lt X lt 25.87) in a nontrivial frequentist sense.

In fact in order for it to make sense to talk of P(25.32 lt X lt 25.87) we have to consider X in some probabilistic way (since the other values in that expression are just constants). If X is some real-world parameter like the speed of light, that requires a Bayesian interpretation of probability as a degree of belief. Effectively, by considering the range of different width confidence intervals, we are making a statement of the type P(X|xi=25.55) (where this is now a distribution for X). The probability axioms tell us that

P(X|xi=25.55)= P(xi=25.55|X)P(X)/P(xi=25.55)

(which is Bayes Theorem of course) and you can see that on the right hand side we have P(X), which is a prior distribution for X. [As for the other terms; the likelihood P(xi=25.55|X) is trivial to calculate, as we have already said that xi is an observation of X with Gaussian uncertainty, and the demonimator P(xi=25.55) is a normalisation constant that makes the probabilities integrate to 1.] So not only do we need to consider X probabilistically, but its prior distribution will affect the posterior P(X|xi=25.55). Therefore, before one has started to consider that, it is clearly untenable to simply assert that P(25.32 lt X lt 25.87) = 25%. If I told you that X was an integer uniformly chosen from [0,100], you would immediately assign zero probability to it being in that short confidence interval! (That's not a wholly nonsensical example - eg I could place a bag-full of precise 1g masses on a mass balance that has error given by the standard normal distribution, and ask you how many were in the bag.) And probably you would think it was mostly likely to be 25 or 26, and less likely to be more distant values. But maybe I thought of an integer, and squared which case the answer is almost certainly 25. Maybe I thought of an integer and cubed it... In all these cases, I'm describing an experiment where the prior has a direct intuitive frequentist interpretation (we can repeat the experiment with different X sampled from its prior). That's not so clear (to put it mildly) when X is a physical parameter like the speed of light, or climate sensitivity.

But anyway, the important point is, the answer necessarily depends on the prior. And once you've observed the data and calculated the end-points of your confidence interval, your selected confidence level no longer automatically gives you the probability that your particular interval contains the parameter in question. That predicate P(xi-0.32 lt X lt xi+0.32) is fundamentally different from P(25.23 lt X lt 25.87) - the former has a straightforward frequency interpretation irrespective of anything we know about X, but the latter requires a Bayesian approach to probability, and a prior for X (and will vary depending on what prior is used).

The way people routinely come unstuck is that for simple examples, those two probabilities actually can be numerically the same, if we use a uniform prior for X. Moreover, the Bayesian version (probability of X given the data) is what people actually want in practical applications, and so the statement routinely gets turned round in peoples' heads. But there are less trivial examples where this equivalence comes badly unstuck, and of course there are also numerous cases where a uniform prior is hardly reasonable in the first place. [In fact I would argue that a uniform prior is rarely reasonable (eg at a minimum, the real world is physically bounded in various ways, and many parameters are defined as to be non-negative), but sometimes the results are fairly insensitive to a wide range of choices.]

Fortunately a number of people who do seem to know what they are talking about have weighed in on the Wikipedia page...


LuboŇ° Motl said...

It seems that I mostly agree with you. One must always view the variable as uncertain, with a stochastic distribution. Bayesian and frequentist interpretations only differ in philosophy how this distribution is being reconstructed.

More importantly, various "theorems" that one can move the role of the boundaries of the confidence interval and switch the measured variable with the original center of the interval - without changing the confidence level etc. - are clearly incorrect. Integrals of the normal distribution are tough and do not satisfy these identities.

My feeling is, nevertheless, that the Wikipedia article is more coherent and meaningful than the comments of the average participant of the Talk debate there. Fortunately, the people who want the article to be "corrected" in a certain way - even though they don't understand the ways themselves - are unable to write a viable text to the article which is how Wikipedia microscopically preserves its high enough quality.

P. Lewis said...


Use SGML/HTML/XML entities:
< > ≥ ≤

These can be obtained by placing lt, gt, ge and le between "&" and ";" (no quotes)

∫ ≠ :)

Google something like HTML character codes for a full list

James Annan said...

Ah, I tried "&" next to the "lt" but forgot the closing ";" which may explain things...thanks. Still a pain compared to just typing the key next to >...


Yoram Gat said...

Hi James,

I am a statistician, and have gone through several discussions of this kind (taking place mainly in the comments section of Deltoid in the context of the Lancet studies of mortality in Iraq).

I was considering writing an online tutorial on the subject. If, however, there already exists such a tutorial I would not bother. Can you cite some of the "clear descriptions" that you referred to?

James Annan said...

Hi Yoram,

I remember reading your comments on the "p-rep" idea.

I think a clear exposition that tackles this confusion head-on would be of great value. I found this useful, and also this and this are interesting, but I'm not sure that either of them focus directly enough on the issue of frequentist confidence intervals versus bayesian credible intervals. I've tried blogging about it here in the past, as you may have noticed if you have visited before.

Yoram Gat said...

Hi James,

Thanks for the links.

The first three documents you link to are unsatisfactory in my opinion. They are examples of a sophomoric genre, the Bayes advocacy literature.

Your own post that you linked to does a much better job at addressing the confidence interval issue. It does however leaves some room for additional comments, so I may yet embark on my own attempt.

Maybe a separate article on the absurdities of Bayes advocacy (not the same, of course, as Bayes methodology - which often is useful) is also in order.

James Annan said...

Hi Yoram,

I'm sure you are being too gentle on me but thanks anyway :-)

I would be interested in reading what you come up with, so please let me know when it's done.