Friday, January 15, 2010

Reliability of the IPCC AR4 (CMIP3) ensemble

So, our paper has now been now been accepted, and should be published in a week or two [update: here]. We think it poses a strong challenge to the "consensus" that has emerged in recent years.

If you are thinking this sounds like deja vu all over again, you'd be right. But the subject is a little different this time. Rather than estimates of climate sensitivity, this time we are talking about the interpretation of the "ensemble of opportunity" provided by the IPCC AR4 (formally CMIP3, but here I will use the popular name). Those who have been closely following this somewhat esoteric subject may have seen numerous assertions that the ensemble is likely biased, too narrow, doesn't cover an appropriate range of uncertainty etc etc. Thus, we should all be worried that there is a large probability that climate change may be even worse than the models imply.

Fortunately, it's all based on some analysis methods that are fundamentally flawed.

Although we'd been vaguely aware of this field for some time, the story really starts a couple of years ago at a workshop, when my attention was piqued by a slide which has subsequently been written up as part of a multi-author review paper and forms the motivation for Figure 1 in our paper. The slide presented an analysis of the multi-model ensemble, the main claim being that the multi-model ensemble mean did not converge sufficiently rapidly to the truth, as more models were added to it. Thus, the argument went, the models are not independent, their mean is biased, and we need to take some steps to correct for these problems when we try to interpret the ensemble.

The basic paradigm under which much of the ensemble analysis work in recent years has operated is based on the following superficially appealing logic: (1) all model builders are trying to simulate reality, (2) a priori, we don't know if their errors are positive or negative (with respect to any observables), (3) if we assume that the modellers are "independent", then the models should be scattered around in space with the truth lying at the ensemble mean. Like so:
where the truth is the red star and the models are the green dots.

However, this paradigm is completely implausible for a number of reasons. First, since we don't know the truth (in the widest sense) we have no possible way of generating models that scatter evenly about it. Second, this paradigm leads to absurd conclusions like a 90% "very likely" confidence interval for climate sensitivity of 2.7C - 3.4C, based on the sensitivities reported by the AR4 models (this comes from a simple combinatorial argument based on the number of models you expect to be higher and lower than the truth, if they lie independently and equiprobably on either side). Third, it implies that all we would need to do to get essentially perfect predictions is to build enough models and take the average, without any new theoretical insights or observations regarding the climate system.

Lastly, it is robustly refuted by simple analyses of the ensemble itself, as observations (of anything) are routinely found to lie some way from the ensemble mean. As has been demonstrated in several papers including the multi-author review paper mentioned above.

So you might think this paradigm should have been still-born and never caught on. However, people have persevered with it over a number of years, trying to fix it with various additional "bias" terms or ensemble inflation methods, and generally worrying that the ensemble isn't as good as they had hoped.

So along we came to have a look. Actually, although this issue had been sitting uneasily at the back of my mind for some time, we were finally prompted into looking into it properly earlier this year when Jules was asked to write something else concerning model evaluation.

It didn't take long to work out what was going on. As explained above, the truth-centred paradigm is theoretically implausible and observationally refuted. However, there is a much more widely-used (indeed all-but ubiquitous) way of interpreting ensembles, in which the ensemble members are assumed to be exchangeable with the truth, or statistically indistinguishable from it. So in contrast to the picture above, we might expect to see something like this:

Here the red isolines describe the distribution defined by the models. Note that the truth (red star) is not at the ensemble mean, but just some "typical" place in the ensemble range.

In contrast to the truth-centred paradigm, it is easy to understand how such an ensemble might arise - all we need to do is make a range of decisions when building models, that reflect our honestly-held (but uncertain) beliefs about how the climate system operates. So long as our uncertainty is commensurate with the actual errors of our models, there is no particular need to assume that our beliefs are unbiased in their mean, and indeed they will not be.

I can't emphasise too strongly that this is the basic paradigm under which pretty well all ensemble methods have always operated, apart from one small little corner of climate science. It underpins the standard probabilistic interpretation, that if a proportion p% of the ensemble has property X, we say the probability of X is p%. A corollary is that if we apply this interpretation to the climate sensitivity estimates, we find a "very likely" confidence interval of 2.1C - 4.4C. Now I'm sure some would argue that this interval is too narrow, but I would say it is pretty reasonable, though this is somewhat fortuitous as with such a small sample the endpoints are determined entirely by the outliers. The implied 70% confidence interval of 2.3C - 4.3C is more robust, and would be hard to criticise. What is certainly clear is that these ranges are not completely horrible in the way that the one provided by the truth-centred interpretation was.

With this statistically interchangeable paradigm being central to all sorts of ensemble methods, notably including numerical weather prediction, it is no surprise that there is a veritable cornucopia of analysis tools already available to investigate and validate such ensembles. The most basic property that most people are interested in is "reliability", which means that an event occurs on p% of the occasions that it has been predicted to occur with probability p%. This is the meaning of "reliability" used in the subject line of this post and title of our paper. A standard test of reliability is that the rank histogram of the observations in the ensemble is uniform. So this is what we tested, using basically the same observations that others had used to show that the ensemble was inadequate.

And what we found is....

...the rank histograms (of surface temperature, precipitation and sea level pressure from top to bottom) aren't quite uniform, but they are pretty good. The non-uniformity is statistically significant (click on the pic for bigger, and the numbers are explained in the paper), but the magnitude of the errors in mean and bias are actually rather small. What's more, the ensemble spread is if anything too broad (as indicated by the domed histograms), rather than too narrow as has been frequently argued.

So our conclusion is that all this worry about the spread of the ensemble being too small is actually a mirage caused by a misinterpretation of how ensembles normally behave. Of course, we haven't actually shown that the future predictions are good, merely that the available evidence gives us no particular cause for concern. Quite the converse, in fact - the models sample a wide range of physical behaviours and the truth is, as far as we can tell, towards the centre of their spread. This supports the simple "one member one vote" analysis as a pretty reasonable starting point, but also allows for further developments such as skill-based weighting.

This paper seems particularly timely with the IPCC having a "Expert Meeting on Assessing and Combining Multi-Model Climate Projections" in a couple of weeks. In fact it was partly hearing about that meeting that prompted us to finish off the paper quickly last November, although we had, as I mentioned, been thinking about it for some time before then. I should give due praise to GRL, since I've grumbled about them in the past. This time, the paper raced through the system taking about 3 weeks from submission to acceptance - it might have been even quicker but the GRL web-site was borked for part of that. It is nice when things happen according to theory :-) Not forgetting the helpful part played by the reviewers too, who made some minor suggestions and were very enthusiastic overall.

Unfortunately, hoi polloi like Jules and myself are not allowed to appear in such rarefied company as the IPCC Expert Meeting - I did ask, with the backing of the Japanese Support Unit for the IPCC, but was refused. So we will just have to wait with bated breath to see what, if anything, the "IPCC Experts" make of it. While the list of invitees is very worthy, is disappointing to see that so many of them are members of the same old cliques, with no fewer than 4 participants from the Hadley Centre, and three each from NCAR, CSIRO and PCMDI, and vast numbers of multiply co-authored papers linking many of the attendees together. Those 4 institutes alone provide almost a quarter of the scientists invited. Coincidentally (or not), staff from these institutes also filled 5 of the 7 places on the organising committee... Shame they couldn't find space for even one person from Japan's premier climate science institute.


admin said...

I'll be there - so I'll put in a word... ;-)


Jesús R. said...
This comment has been removed by the author.
Jesús R. said...
This comment has been removed by the author.
Jesús R. said...

Very interesting, I hope you are right! I'm not qualified to assess the merits of your work, how do you think your colleagues are receiving this approach to constrain the upper bound of climate sensitivity? Skill-based weighting really seems an obvious thing to do! (eg. Tamino has recently shown that CCCMA models seem to be overestimating 20th-century warming). Your view, Gavin, would also be very welcome.

How about having a chat with some attendant and convincing him of your view? :D So that he conveys your thoughts to the meeting... :) Go for Gavin! ;-))) :)

Thanks for informing. Cheers!

Anonymous said...

"Lastly, it is robustly refuted by simple analyses of the ensemble itself, as observations (of anything) are routinely found to lie some way from the ensemble mean. As has been demonstrated in several papers including the multi-author review paper mentioned above."

I don't immediately see why this robustly refutes the paradigm. The ensemble mean is an estimate of the forced climate change, and the observations are (as I understand it) the combination of the forced and unforced change, so there is no reason to expect the observations to lie close to the ensemble mean, even is the the "standard" paradigm were valid. Am I missing something?

Martin Vermeer said...

Looks interesting... congratulations!

Some nits / questions:

Page X-3: Reliabilty -> Reliability

For example, if the models are indeed sampled from a distribution centered on the truth, then the biases of different models should have, on average, near-zero pair-wise correlations.

But isn't this a result from the correlation between the models in the distribution, not so much from its centering on the truth?

Or do you mean that the distribution for an added model conditional on already included models, is non-centered wrt the truth?

This could perhaps be stated clearer?

Tom C said...

"I can't emphasise too strongly that this is the basic paradigm under which pretty well all ensemble methods have always operated, apart from one small little corner of climate science."

Not trying to be difficult here but do you have any ideas why this is?

Jesús R. said...

"observations (of anything) are routinely found to lie some way from the ensemble mean. As has been demonstrated in several papers including the multi-author review paper mentioned above."

I don't know what those papers are (not even your multi-author review paper), could you please give us some reference?


crandles said...

>"If you are thinking this sounds like deja vu all over again"

You don't really specify what 'this' is. If 'this' is a narrower range in the distribution, the cause could well be attributed to your views on the narrowness of the range. That seems rather uninteresting.

However you appear to present similarities that go further than the narrowness of the range. You seem to indicate an avoidance of the usual/standard methods.

So if I create a scale of 0 to 10 where do you think such similarity can be attributed?

0 represents incompetence without the slightest hint of intention or even confirmation bias.

1 represents completely unintentional but some confirmation bias where people who become climate researchers tend to be people consider the climate situation to be serious.

Greater than 1 are not completely unintentional. So 2 might represent some vague consideration of different methods but either drifted into what seemed easiest and gave sensible answer perhaps tinged with confirmation bias or was an honestly held belief.

10 represents deliberate seeking out of unusual/dubious methods and selecting to give alarmist uncertainty.

FWIW I wrote this before seeing Tom C comment. Same question or different? I am not very sure.

James Annan said...

Short comments first:

Thanks Gavin, if you are a good boy and do your recommended reading you'll find it listed there :-)

Jesus - I'm not really presenting this as a rigorous way to estimate sensitivity - we didn't put in in the paper, but thought it was interesting to note the implication.

gcc - I think you have just presented another reason why the truth-centred approach cannot work. However for the work described here, we are generally talking about climatological averages where natural variability is a very minor issue compared to the model biases.

Martin - the correlation being talked about is between the model errors, eg (m_i - O) and (m_j - O), where the m_i,j are models and O is the observations, so if the obs were at the ensemble mean, the average pairwise correlation would be zero.

Jesus again - the refs are in the paper which is free to read...I didn't want to pick on one or two individuals too openly. OK, it is Knutti et al, but it draws on previous work by various authors.

James Annan said...

Tom C and Chris, I firstly just mean the same sentence was a direct quote from the previous post :-) However, there does seem to be something more systematic going on.

In both this case and the previous one involving uniform priors, we seem to have climate scientists building some edifice around some intuitively appealing rhetoric which rapidly falls apart under critical analysis. In this case, I tried to trace back the origins, and say in the paper:

This truth-centred paradigm appears to have arisen as a post-hoc interpretation of the ad-hoc weighting procedure known as “Reliabilty Ensemble Averaging” or REA

It wouldn't be fair to exclude myself entirely from criticism, as (again, in both cases) I was aware of what people were saying, and even accepted it myself, before getting directly involved. OTOH it is worrying how quickly and easily we were able to knock holes in theories that many influential people had been using and accepting for some years.

It is not difficult to conclude that the IPCC process has played a part here, in focussing power in private committees and favouring consensus-forming over debate. Some people appear to be rewarded more for echoing the majority view rather than for actually coming up with anything new.

Wikipedia has an interesting page on groupthink which seems highly relevant.

Hence my comments in the last paragraph of the post.

EliRabett said...

If the effect is systematic looking at climate models is a bad tactic because there is only one realization. How about you do the same thing with local weather models. Lots of models, lots of realizations.

Ian said...

Very interesting! I hadn't thought about this before, but your description above makes sense. So, hypothetically, under what conditions would you have some confidence that models (climate-related or not) really are randomly scattered in some model-space, allowing inferences from an ensemble mean?

Deep Climate said...

Tutorial on rank histogram (using simple weather prediction example):

Martin Vermeer said...

James, thanks!

So, are (m_i - E(O)) and (m_j - E(O)) correlated? Your second pic seems to suggest so (ellipses). Or is this unintentional?

Deep Climate said...

Second time around, it all made more sense to me.

The x-axis is percentage of binned observations. 23 models = 24 bins = 1.67% each on average.

Wrt surface temps the model ensemble looks to have a bit of a cool bias though, right?

Sorry about posting the same rank histogram explanation you did. In my defence, the links don't stand out very much, so honestly I just missed it.

James Annan said...


Theoretically, it's just as I said, as long as our uncertainties are honestly reflected in the range of models produced, we should be ok. Many have argued that this is not the case and that the model range is too small, but I don't believe the evidence supports them.

Beyond the rank histogram, there are a range of other ways of investigating whether what we can observe of reality is interchangeable with the models, and we are playing around with some of them right now. I'm expecting to write a longer paper significantly extending this work in the next month or two (or four, you know how it goes).


As independent samples their average correlation is necessarily zero. I think you may be confusing this with the correlation of x and y (labelling the diagram in the obvious way).

DC, yes the obs are a bit hot, ie the models are cool. It's only about 0.5C on average though. We normalised the graphs to sum to 40 for reasons discussed in the paper - there are really 72x36 data points (5 degree grid) but with substantial spatial correlation.

Roger Jones said...


glad you've written this. All the papers/reports I have done using model-derived data in risk analyses since 1999 have used the assumption that the truth is more likely than not to be somewhere in there, but not truth centred.
By the time various uncertainties are combined, uniform or rather flat beta distributions of a rather wide scatter are fine. Actually, if you do this for regional climate, as we have found out in SE Australia, climate variability not represented in the consensus model set and probably not in any of the individual runs, can deliver a climate not in the sample.

Also, when climate policy uncertainties are added, the precision afforded by erroneous (hubristic?) assumptions becomes even less relevant. In most cases, policy does not need such high precision.

The application of probabilities needs to be mindful of what information is most needed to manage risk.

Deep Climate said...
This comment has been removed by the author.
Deep Climate said...

Of course, I meant:

The y-axis is percentage of binned observations. 23 models = 24 bins = 1.67 each on average.

OK, there are 24 models = 25 bins (not 23 models). Normalizing to 40 d.f. gives an average score of 1.6. I'll get there.

I see that 0.5C is not a huge bias, given the spread. But the direction of the bias does argue against the meme put forth in some quarters that climate models have a warm bias in general.

guthrie said...

I'm afraid that going from 2.7 to 3.4 to 2.1 to 4.4C somehow doesn't make any difference to me, being a simple member of the public. Is it going to be 2.1 or 4.4? Either doesn't sound nice given the change a mere 0.5C makes to the climate I was used to.

How long until we see this crop up in the denialosphere?
(Not that they should get too much attention)

crandles said...

I noted that the 2.7 to 3.4 simply got called absurd and completely horrible. A range of 0.7C - could we be that sure? Obviously not even James thinks the range could be that narrow.

"Is it going to be 2.1 or 4.4? Either doesn't sound nice"

Remember this is climate sensitivity which is not directly observable/feelable. Even if CO2 does get to 560ppm it would be some time before full effects are felt. So probably not in our lifetime unless we get a big methane burp. Maybe 'either doesn't sound nice for our children' would be better?

In the absense of a methane burp, which end of the temp rise we get depends on policy as well as uncertainty in climate sensitivity. So AFAICS although transitory rate of change is more certain than climate sensitivity, the uncertainty of policy reintroduces the uncertainty. However if effects are severe, policy will likely curtail the worse end scenarios.

James Annan said...


Thanks. Of course with 23 models, even in the best case you will still see reality outside the model range one time in 12, which means a lot of these cases if you are looking at a lot of predictands - and it's well known that in short-term weather prediction mode, models tend to underestimate extreme cases. However multimodel ensembles are generally far better than single model (initial condition) ensembles in that respect.

Guthrie, I don't think any denialist will read the paper, still less understand it if they do. Besides, the main message is that the models are more reliable than others have argued, which is hardly a point they will want to promote...

(and, what Chris says)

Hank Roberts said...

You'll be shocked to hear that NASA has been caught averaging numbers; this arithmetrickery is apparently being explained in a full hour of television programming. I wonder who's sponsoring this one.

James Annan said...


That is just too funny...I'd expect to see that on the Onion or Daily Mash!

Rattus Norvegicus said...

I, unfortunately, went and read the blog mentioned in the presser. My favorite gripe: "why does GISS throw away data before 1880?!".

I hope I don't have to answer that question for anyone reading here.

Hank Roberts said...

PS, a cross-reference for anyone who finds this later and doesn't recognize what it's about, the bunk I mentioned is debunked in a paragraph here:

The relevant part begins:
"Update: Some comments on the John Coleman/KUSI/Joe D’Aleo/E. M. Smith accusations about the temperature records. Their claim is apparently that coastal station absolute temperatures are being used to estimate the current absolute temperatures in mountain regions and that the anomalies there are warm because the coast is warmer than the mountain. This is simply wrong. What is actually done is that temperature anomalies are calculated locally from local baselines, and these anomalies can be interpolated over quite large distances. This is perfectly fine and checkable...."

Deep Climate said...

I'll see your GISS averaging nonsense, and raise you with the latest from the National Post's Lawrence Solomon - "Google censors climategate search results". I kid you not.

Tom C said...

James -

Thanks for this writeup and for answering my question. I don't think that your answer gets to the heart of the matter, though.

I have very minimal training in statistics, yet I was able to detect this flawed approach as I read about ensemble averaging. How can it be that specialists with strings of letters after their names, who spend their lives doing this stuff, could have pursued this line of reasoning?

I think there is something else going on here, namely thinking that a model output has the same status of a measurement vis-a-vis the truth. How else can people talk abut "random errors" in a model? Measurements have random errors, models are "always wrong but sometimes useful".

It's probably not an accident that the meterologists, who have dealt with these concepts for decades, are more cautious about the model outputs.

James Annan said...

Well, putting on my cynical hat for a minute (did I ever take it off?) there does seem to be a bit of a culture in some parts of reliance on nice rhetoric rather than a sound mathematical basis. They write something that sounds nice and uncontentious, then try to translate it somehow into a mathematical form, and assume that the resulting methods and numbers therefore have some validity.

I certainly noted this in the promotion of uniform priors, which I believe almost everyone now accepts was a mistake (I still don't know what Myles Allen thinks).

The "averaging the constraints" thing is yet another example, from members of the same clique. Sounds good in words, but has absolutely no mathematical foundation and in fact directly contradicts the axioms of probability.

You might find it hard to believe, but at one point they even wrote that where their intuition ran contrary to the axioms of probability, they would rather use their intuition!

However, returning to the point of this post, I note that the paper is now published (post is updated with the link at the top) and I will wait to hear how the "experts" react to it at their meeting next week.

Deep Climate said...

For what it's worth, I left a comment at RC in Unforced Variations, with a pointer to the paper and this post, along with the abstract.

I added the following comment:
Definitely merits a head post update, and I would suggest, a guest post from Annan and Hargreaves.

Hank Roberts said...

So, um, I followed a post of Eli's about your pajamas and ended up reading this:

How ... um ... what ... um
Well? What's the overlap here?

James Annan said...

Thanks DC, I would suggest waiting to see what the IPCC Experts advise as "Best Practice" for the use of the ensemble. I think they may actually be aiming to complete this during the workshop next week, though that may be a stretch. Probably Gavin will want to say something about that anyway.

Hank, is a sort of hosting site for scientists - we know the guy who runs it, through the GENIE project. However in practice these things tend to be just another monkey to feed, or rather, neglect :-)

Deep Climate said...

Inline comment on RC on my pointer to the Reliability of CMIP3 Ensemble paper:

[Response: This will be topic of the week, next week.... - gavin]

So I guess he will discuss the IPCC Expert meeting and the paper.

James Annan said...


There is certainly a much broader discussion to be had than merely focussing on the contents of this one short paper.