Thursday, August 05, 2010

"IPCC Experts" New Clothes

You may recall not so long I ago I blogged about our paper in which we argued that the (standard outside climate science) paradigm of a statistically indistinguishable ensemble - where reality is just another sample from the distribution - is a much more natural and plausible interpretation of the AR4 multi-model ensemble, than the alternative "truth-centred" paradigm - where the models are assumed to be scattered around with reality lying exactly at the centre of their sampling distribution. The latter has no theoretical basis or practical support as far as I can tell, it appears to have been plucked out of thin air by a process of wishful thinking, and is strongly refuted by an analysis of the ensemble. But this post isn't really about that.

Immediately after that paper was published, the IPCC held a closed meeting which we were of course not permitted to attend. The purpose of the meeting was to generate a "best practice guidance paper" for the use of the multi-model ensemble. Jules predicted that our work would get misinterpreted somehow, but I thought our paper was fairly straightforward and hard to misunderstand. Well, I hadn't reckoned on the unique skills of the "IPCC Experts". Eventually this meeting report and summary appeared on their web site.

Regarding the interpretation of the multi-model ensemble, they say:
Alternatively, a method may assume:

b. that each of the members is considered to be ‘exchangeable’ with the other members and with the real system (e.g., Murphy et al., 2007; Perkins et al., 2007; Jackson et al., 2008; Annan and Hargreaves, 2010). In this case, observations are viewed as a single random draw from an imagined distribution of the space of all possible but equally credible climate models and all possible outcomes of Earth’s chaotic processes.

What? What is "the space of all possible but equally credible climate models" and what does this have to do with anything? Of the papers they cite, only ours actually mentions exchangeability and statistical indistinguishability, and what we wrote is that this means that "the truth is drawn from the same distribution as the ensemble members, and thus no statistical test can reliably distinguish one from the other". We also cited Toth et al 2003 (good book by famous NWP people) who wrote equivalently "the ensemble members and the verifying observation are mutually independent realizations of the same probability distribution".

Note that there is no reference to the "space of all possible models". All that matters is that the sampling distributions of models and truth are the same.

This may appear at first to be a rather pedantic and minor complaint. However, it doesn't take long to realise that the "space of all possible models" is a "colourless green idea", that is, a syntactically valid but completely meaningless phrase. This isn't just my assertion, it is agreed by all the previous authors who have used this terminology! (If you wish to disagree, feel free to explain in the comments what a "possible model" is, and how it can be distinguished from an impossible one. Good luck with that.)

In fact as far as we can tell this phrase has only ever been used to denigrate the use of the multi-model ensemble. The argument goes, that in order to understand how to use this ensemble, we have to first understand the "space of all possible models" from which they are sampled. This phrase is meaningless, therefore the use of the ensemble is theoretically ill-founded. Supporting quotes are appended below - quotes which many attendees of the meeting were well aware of, because they wrote them. Well, we don't mind people writing gibberish in their own papers, but we object strongly to them linking such nonsense to our work. Our analysis does not depend in any way on this meaningless concept, and to claim that it does (with the corollary that our analysis is philosophically ill-founded) is a flat-out lie.

In fact the multi-model ensemble can be very naturally interpreted as sampling our collective uncertainties about how best to represent the climate system. The question of reliability of the ensemble then simply amounts to asking whether these uncertainties are well-calibrated or not - which as we have shown, is an eminently testable hypothesis (at least in respect of current and historical data) and does not require anyone to "imagine" such bizarre and spurious constructions as the "space of all possible models".

We complained to the authors of this piece of nonsense, and they replied with the remarkable claim that despite being listed as the authors, they were not in fact responsible for the accuracy of anything they wrote, as they were merely reporting the "the definition as determined and agreed by the attendees", and would not countenance any correction of this mistake. Yes, they really used those words I have placed in quotes. Apparently it didn't occur to any of these "experts" present that this concept of statistical indistinguishability was an established term of art that already had a perfectly adequate definition, and that this existing definition is the only one that has ever been presented in the context of climate science. Their decision to reinvent the definition of statistical indistinguishability apparently has the full support of the IPCC hierarchy. I'm utterly gobsmacked that they place their duty to defend this "consensus" of a private clique above their duty to ensure that this "consensus" is honest, accurate, and useful to potential readers, let alone providing a fair representation of the work of those who are prohibited from participation in this process. It's as if the WG2 authors had simply proclaimed that 2035 was the date the experts had agreed that all Himalayan glaciers would vanish, and that was the end of the matter.

We have various manuscripts at different stages of writing and review, and can probably correct this mistake somehow (assuming that reviewers allow us to dissent from the newly-established "consensus"), but it's unlikely that what we write will ever have the circulation and influence that the IPCC bully pulpit affords. And of course, it is pretty hard to proof our work against spurious criticism when these "experts" are prepared to simply pluck arbitrary nonsense out of thin air. It's a shame that no-one there actually stood up and said "But these words have no meaning, how can they be used in a definition?"

Some references to the "space of all possible models", which make the nonsensical nature of this phrase clear, and how it has been used to argue against the use of the multi-model ensemble:

Allen et al 2002:

"the distribution of all possible models is undefined"

Collins 2007:

"Is the collection of the world’s climate models an adequate sample of the space of all possible models (and, indeed, is it even possible to define such a space)?"

Murphy et al 2007:

"Specifically, it is not clear how to define a space of possible model configurations of which the MME members are a sample. This creates the need to make substantial assumptions in order to obtain probabilistic predictions from their results"

Stainforth et al 2007:

"The lack of any ability to produce useful model weights, and to even define the space of possible models, rules out the possibility of producing meaningful PDFs for future climate based simply on combining the results from multi-model or perturbed physics ensembles; or emulators thereof."


Nick Barnes said...

Never ask a mathematician or computer scientist questions like this. It's pretty easy to define a meaningful superset of "the space of all possible models" (e.g. all legal Fortran programs which, when given input Xi, produces output in the set Yi, for all i in some index set I). Constrain by suitable choices of Xi and Yi. I would agree that this definition isn't very useful.

James Annan said...

Hmmm...generating the superset is clear enough, but it seems to me you may have rather glossed over the crux of how to winnow the superset down to "possible models", given that no-one admits to having any way of defining what these are!

(And if one takes the alternative approach that *everything* is a possible model, then that doesn't explain how we managed to sample 25 times and each time miraculously find something that actually looks like a model.)

Dikran Marsupial said...

I think this is one of the most worrying things I have ever read about the IPCC process. It is a little reminiscent of Abrams pointing out to Monckton that he had misinterpreted the findings of a number of papers. In both cases, it damages reputations if such misinterpretations are propagated once they have been pointed out.

The truth centered interpretation is just bizarre though, it is hard to see how it came to be. The best possible model would be an infinite set of parallel Earths in alternate universes, which were similar enough to have the same forcings, but different initial states for the atmosphere (them butterflies and their pesky fluttering wings! ;o). In that case the real Earth truly is statistically exchangeable, and you wouldn't expect the mean of the ensemble to exactly match the observed climate on out Earth; only that the observed climate would lie within the spread of the ensemble.

Basically the truth centered interpretation is saying that the chaotic component of the climate we observe on our Earth just happens to be exactly average (over all of the ways it could possibly have worked out). Which of course is a rather large assumption!

admin said...

You are protesting far too much. As far as I could tell there was not much opposition (if any) to your framing of this issue and Claudia took on the task of explaining things very well.


(PS can you go back to allowing name/email commenting?)

Martin Vermeer said...

Hmmm, isn't this just another case of trying to force a bayesian idea into a frequentist mental frame? Isn't it un-operational rather than wrong?

Steve Bloom said...

Well, it does look like they were *trying* to agree with you. But now what do they plan to do with this observation?

Also, does "equally credible" mean that they've taken the first step down the road of decertifying (in some fashion) the known crap models? That seems good if so.

Finally, would Jules agree that she has now been shown to be the better Bayesian? :)

jules said...

You seem to be missing the point. The problem is what has been written down is wrong. We aren't suggesting that people were opposed at the meeting (obviously we have no idea what happened at the meeting), but are unhappy they don't want to correct a simple mistake in what they have written. After all if they had defined pi to be 3, even though it was the agreed definition of the meeting, you presumably wouldn't think it was acceptable for the untruth to be propagated further.

James Annan said...


I think I understand how the truth-centred thing arose - a combination of convenience, sloppy thinking, and a lack of critical evaluation - and I can't even be too harsh on people for not dropping it immediately we pointed out its silliness, as it takes time for new ideas to sink in.

OTOH, there is no sign in the meeting report that anyone actually considered whether the truth-centred paradigm had any value (it is presented as the case "a" that precedes the quoted bit in my post). If the same people uncritically regurgitate the literature, and then self-referentially use this review as an excuse to keep on doing the same old same old, that will be disappointing.


If Claudia or anyone else explained things in terms of a space of all possible models, then I'm afraid they did not do it very well, but instead misunderstood an important element of our work. If this imaginary space of possible models is actually considered to be a real obstacle to the use of the MME (as several authors have clearly indicated) then our interpretation would represent a (minor) breakthrough in eliminating this problem.

Anonymous said...

How long before your comments on "consensus" are misinterpreted (intentionally or otherwise)?


ac said...
This comment has been removed by the author.
ac said...

James, when you write

"the truth is drawn from the same distribution as the ensemble members"

what do you mean by distribution, and what does it mean to 'draw' from it?

How is the problem of defining a sampling distribution for the models and truth any more well-posed than defining a 'space of possible credible models' from which to draw? How can we possibly demonstrate that the models chosen to form an ensemble are independent samples from either the 'distribution' or the 'space'? (Where I'm going with this is that I don't think independence is demonstrable or even important, and certainly not required MME to be valid).

James Annan said...

Collectively, we have uncertainty about how best to model the climate system - people use different grids, numerical schemes, but most importantly, things like cloud parameterisations vary widely because we don't agree on the best way of doing it. This is the distribution that the models can (IMO) be naturally viewed as sampling.

The question is then whether reality "looks like" a typical sample or whether it stands out as particularly unusual. According to many standard model-data comparisons (such as other people have used and published), it looks pretty typical to us.

I agree that independence is a somewhat different issue, and we did not talk about it in the Reliability paper. But coincidentally I'm just revising a manuscript on that specific topic. More later (perhaps months...)

ac said...

I think the phrase 'space of possible but equally credible models' means the more or less the same thing as 'distribution of models that describes our collective uncertainties about how best to represent the climate system'.

Both suggest taking a finite number of plausible models from the much larger set/distribution/space of possible plausible models. The key word here is 'credible'.

James Annan said...

Well, it seems to me that they are trying to imbue the distribution with some objective status whereas I'm presenting it in overtly subjective terms. But ultimately, this is not really relevant from the POV of defining or understanding what statistically indistinguishable means. Someone else could come up with a *different* distribution of models, that was *also* statistically indistinguishable - or even just think of a different way of interpreting the existing ensemble. The concept is not tied in any way to this interpretation of the CMIP3 models.

Also, when I complained to the authors, they did not argue that their definition was fact they specifically acknowledged that mine was different.

ac said...

Yes I think I see your point now. The randomness or otherwise of the sample of models has nothing to do with the statistical indistinguishability of model output.

PolyisTCOandbanned said...

Can you comment on the recent MMH trend paper? I worry looking at the very tight error bars for the model trends that what they did was some sort of definitional game. Probably wrong. And whehter wrong or not, that they didn't explain the difference in concept of what they're doing versus what Santer did.

But I don't knw for sure. And you are good with logic and trends and trend difference signficance.

so please comment.

James Annan said...

Happy to take a look, but I don't know what paper you are talking about.

James Annan said...

Ah, I've found it...

it looks like they are comparing the obs to the multi-model mean and finding that they differ.

Well....duh. This is the same old "ensemble is not truth-centred" thing I've been banging on about for some time now. Of course any set of obs will differ from the ensemble mean, and if you measure with enough precision, the difference will be significant. All of the *models* differ from the ensemble mean too!

It's a completely bogus approach to model validation.

(Having said that, these obs are only tenuously consistent with the models, as we have shown in the infamous Heartland presentation. But this new bogus comparison adds nothing to the debate.)

Should blog this if I have travelling soon though.

PolyisTCOandbanned said...

That's what I feared. We've seen this movie before. What a bunch of duffers. Or dileberately misleading?

Even if they favor their interpetation, don't they need to highlight that it's this interpetation that drives the high error bars?

P.s. It just flabergasts me that they think it's meaningful in any normal physical sense (a bayesian bettor say) to think that IPCC estimates a centuries worth of change on the surface at something like x plus or minus 50%, but they say "the models" calculate a trend with plus or minus less than 10% (granted in atmosphere) at a time with LESS forcing and over a period with much less time (so chance more of a role)

James Annan said...

Well, I've put it in a new blog post now. Thanks for the tip.

It's amusing that there has been so much hot air spent over the precise details of the statistical testing (obviously I'm aware of the Douglass/Santer back-story) while no-one bothered to question whether the fundamental premise was meaningful in the first faces are deserved all round!

crandles said...

Probably a silly question, but I have a different interpretation of what the quoted authors are trying to say. Could they be trying to say something like:

The distribution of all possible models is undefined in the sense that we cannot say variable x is y times more important than variable z. Hence we cannot arrive at an objective measure/probability of which models are most like reality nor of distances in parameter space. This doesn't mean an ensemble is completely useless.

If that is part of what they are trying to say, then it seems that part is sensible even if I think it is badly expressed.

I agree what is written seems wrong and shouldn't be propagated.

I also agree with the "New Clothes" description of "distribution of all possible models". Whether this is because it is so near meaningless that it is useless or actually arriving at colourless green idea level of meaningless seems fairly unimportant; the important thing is its complete uselessness.

James Annan said...

Well, I don't think that being sufficiently unclear that no-one can make out what quite what it means, is really a sufficient defence :-)

I've also seen quite explicitly incorrect interpretations in terms of "the distribution of truth". OK, this is perhaps a slightly different error, but all related by the problem of woolly thinking.