Or, why Nic Lewis is wrong.
Long time no post, but I've been thinking recently about climate sensitivity (about which more soon) and was provoked into writing something by this post, in which Nic Lewis sings the praises of so-called "objective Bayesian" methods.
Firstly, I'd like to acknowledge that Nic has made a significant contribution to research on climate sensitivity, both through identifying a number of errors in the work of others (eg here, here and most recently here) and through his own contributions in the literature and elsewhere. Nevertheless, I think that what he writes about so-called "objective" priors and Bayesian methods is deeply misleading. No prior can encapsulate no knowledge, and underneath the use of these bold claims there is always a much more mealy-mouthed explanation in terms of a prior having "minimal" influence, and then you need to have a look at what "minimal" really means, and so on. Well, such a prior may or may not be a good thing, but it is certainly not what I understand "no information" to mean. I suggest that "automatic" is a less emotive term than "objective" and would be less likely to mislead people as to what is really going on. Nic is suggesting ways of automatically choosing a prior, which may or may not have useful properties.
[As a somewhat unrelated aside, it seems strange to me that the authors of the corrigendum here concerning a detail of the method, do not also correct their erroneous claims concerning "ignorant" priors. It's one thing to let errors lie in earlier work - no-one goes back and corrects minor details routinely - but it is unfortunate that when actually writing a correction about something they state does not substantially affect their results, they didn't take the opportunity to also correct a horrible error that has seriously mislead much of the climate science community and which continues to undermine much work in this area. I'm left with the uncomfortable conclusion that they still don't accept that this aspect of the work was actually in error, despite my paper which they are apparently trying to ignore rather than respond to. But I'm digressing.]
All this stuff about "objective priors" is just rhetoric - the term simply does not mean what a lay-person might expect (including a climate scientist not well-versed in statistical methodology). The posterior P(S|O) is equal to to the (normalised) product of prior and likelihood - it makes no more sense to speak of a prior not influencing the posterior, as it does to talk of the width of a rectangle not influencing its area (= width x height). Attempts to get round this by then footnoting a vaguer "minimal effect, relative to the data" are just shifting the pea around under the thimble.
In his blog post, Nic also extolls the virtue of probabilistic coverage as a way of evaluating methods. This initially sounds very attractive - the idea being that your 95% intervals should include reality, 95% of the time (and similarly for other intervals). There is however a devil in the detail here, because such a probabilistic evaluation implies some sort of (infinitely) repeated sampling, and it's critical to consider what is being sampled, and how. If you consider only a perfect repetition in which both the unknown parameter(s) and the uncertain observational error(s) take precisely the same values, then any deterministic algorithm will return the same answer, so the coverage in this case will be either 100% or 0%! Instead of this, Nic considers repetition in which the parameter is fixed and the uncertain observations are repeated. Perfect coverage in this case sounds attractive, but it's trivial to think of examples where it is simply wrong, as I'll now present.
Let's assume Alice picks a parameter S (we'll consider her sampling distribution in a minute) and conceals it from Bob. Alice also samples an "error" e from the simple Gaussian N(0,1). Alice provides the sum O=S+e to Bob, who knows the sampling distribution for e. What should Bob infer about S? Frequentists have a simple answer that does not depend on any prior belief about S - their 95% confidence interval will be (S-2e,S+2e) (yes I'm approximating negligibly throughout the post). This has probabilistically perfect coverage if S is held fixed and e is repeatedly sampled. Note that even this approach, which basically every scientist and statistician in the world will agree is the correct answer to the situation as stated, does not have perfect coverage if instead e is held fixed and S is repeatedly sampled! In this case, coverage will be 100% or 0%, regardless of the sampling distribution of S. But never mind about that.
As for Bayesians, well they need a prior on S. One obvious choice is a uniform prior and this will basically give the same answer as the frequentist approach. But now let's consider the case that Alice picks S from the standard Normal N(0,1), and tells Bob that she is doing so. The frequentist interval still works here (i.e., ignoring this prior information about S), but Bayesian Bob can do "better", in the sense of generating a shorter interval. Using the prior N(0,1) - which I assert is the only prior anyone could reasonably use - his Bayesian posterior estimate for S is the Normal N(O/2,0.7), giving a 95% probability interval of (O/2-1.4,O/2+1.4). It is easy to see that for a fixed S, and repeated observational errors e, Bob will systematically shrink his central estimates towards the prior mean 0, relative to the true value of S. Let's say S=2, then (over a set of repeated observations) Bob's posterior estimates will be centred on 1 (since the mean of all the samples of e is 0) and far more than 5% of his 95% intervals (including the full 27% of cases where e is more negative than -0.6) will fail to include the true value of S. Conversely, if S=0, then far too many of Bob's 95% intervals will include S. In particular, all cases where e lies in (-2.8,2.8) - which is about 99.5% of them - will generate posteriors that include 0. So coverage - or probability matching, as Nic calls it - varies from far too generous, when S is close to 0, to far too rare, for extreme values of S.
I don't think that any rational Bayesian could possibly disagree with Bob's analysis here. I challenge Nic to present any other approach, based on "objective" priors or anything else, and defend it as a plausible alternative to the above. Or else, I hope he will accept that probability matching is simply not (always) a valid measure of performance. These Bayesian intervals are unambiguously and indisputably the correct answer in the situation as described, and yet they do not provide the correct coverage conditional on a fixed value for S
Just to be absolutely clear in summarising this - I believe Bayesian Bob is providing the only acceptable answer given the information as provided in this situation. No rational person could support a different belief about S, and therefore any alternative algorithm or answer is simply wrong. Bob's method does not provide matching probabilities, for a fixed S and repeated observations. Nothing in this paragraph is open to debate.
Therefore, I conclude that matching probabilities (in this sense, i.e. repeated sampling of obs for a fixed parameter) is not an appropriate test or desirable condition in general. There may be cases where it's a good thing, but this would have to be argued for explicitly.
13 comments:
Kass and Wasserman have written a paper The Selection of Prior Distributions by Formal Rules. That seems a rather good description for the approach Nic Lewis has taken, but eevn that with reservations.
Papers of Jewson, Rowlands, and Allen (2009, 2010) have also referred to objective priors. These papers contain chapters, where some derivation for the methods is presented. Going through that in detail tells that the priors are actually determined by
- the climate model used, and
- the assumption that prior distributions are uniform in the space defined by variables used for representing empirical data.
Neither of these is unique as other models lead to different priors with the same data and as making a nonlinear transformation for the empirical variables would also lead to different priors.
I am not that familiar with the term "objective prior". Objectivist Bayesian I am familiar with, which just means that the interpretation of a probability is not a subjective belief, but a state of information. Thus you can have priors that incorporate information (i.e. an informative prior) without the analysis falling outside an objective Bayes framework. It would be easier if the term was used in that sense, rather than to suggest an uninformative (or minimally informative) prior is in some way more "scientifically" objective than some other. It isn't if you have information it should be included in the analysis and to choose not to is a subjective choice.
No prior encodes no information at all, because you are at least encoding the information that you know you don't know anything about the value of some parameter.
I am impressed that Radford Neal commented on what is a pretty obscure (from a statistical perspective) blog.
Thanks, I see that Radford Neal has made similar points but rather better, the comment is here. I was also going to point out that Nic's prior is nonsensical - id much rather trust the judgement of a scientist about plausible (prior) dates for a sample, than some automatic calculation that gives self-evidently ridiculous answers.
A significant problem with this approach is that it underweighs the probability of surprises in just the same way as the overbroad Frame uniform prior does especially for cases where whatever you are constructing the prior from is sparse
Sorry, of course what was meant was in the same way as the Frame approach overweights the probability of surprises.
From Nic Lewis via email (anyone else having trouble commenting on blogger?):
===
James,
"Nic considers repetition in which the parameter is fixed and the uncertain observations are repeated."
No I don't. You've misread my article.
"id much rather trust the judgement of a scientist about plausible (prior) dates for a sample, than some automatic calculation"
So you think that use in OxCal of a flat prior over the whole real line represents such a judgement, and a valid one at that?
===
me:
As for the first part, I'll have another look at the post, but it is abundantly clear that Nic's approach is wholly unacceptable in some cases, about which I'll post again later (holiday here).
As for the second, I don't claim any particular expertise or experience in this area, but suspect that a uniform prior was probably selected as a somewhat lazy/naive but generally acceptable approximation.
The "uniform prior" is by no means the "obvious choice".
There is no obvious choice. This requires an advanced understanding of probabilistics and statistics.
Jeffrey's Prior is the only choice that is truly uninformative.
It is obviously popular with many scientists, and in many applications it is not badly wrong (but I make no particular judgement for the specific case of carbon dating, it seems to me that a vaguely informative prior would be more plausible and not too hard to select). As for "truly uninformative", sadly that is only true (at best) using some technical definition of "uninformative" that does not correspond to common english usage. Which is more or less where we came in...
In many cases the Jeffreys' prior is clearly very informative. The post of Nic Lewis at Climate Audit presents a perfect example of that as noted by Radford Neal and others in the discussion.
Jeffeys' prior is rule based, but it's not uninformative.
The information of Jeffreys' prior is dependent on the particular empirical method used, the quantities chosen to represent the empirical observations, and on the methods used to convert empirical data to the final results of the analysis. There's absolutely no fundamental reason to believe that all that would result in an uninformative prior.
At the risk of being deemed an irrational Bayesian, it seems that if Bob's estimates converge to a fixed number independent of the observations, he's not doing a very good job of updating.
Bob's first update is from prior S~N(0,1) with O[1]~N(S,1), to N(O[1]/2,.7). But subsequent updates should be towards the observations, not toward the original prior mean 0. So Bob's next update would be to N( (O[1]/2/.5 + O[2]/1)/(.5+1), 1/sqrt(3) ).
Bob's posterior approaches the true S after enough repetitions, with ~N((0+sum(O))/(n+1),1/sqrt(n+1)). Since all the variances are 1, this is exactly the same as augmenting a frequentist's data set with a point representing the prior, O={0,O[1],O[2]...O[n]}. There's always a residual shrinkage toward the prior, but for large n it's swamped by the data. If Alice picks 2, it approaches 2, and if Alice picks 0, it approaches 0.
There's still a problem with coverage as a metric, because it's essentially a smoothing procedure, so it's serially correlated in any one Bob/Alice experiment. Probably better to look at something less binary, like a P value.
Am I missing something?
Tom, I was trying to talk about the situation of repeated experiments where Bob gets one obs each time, not adding new obs to the existing set.
Ahh ... OK, that clears it up.
Actually, I think that raises a new objection ... S=0 or S=2 is not the same as S~N(0,1). As long as Alice picks S as advertised, the shrinkage in Bob's posterior should be OK over repeated trials.
Post a Comment