So, our
paper has now been now been accepted, and should be published in a week or two [update:
here]. We think it poses a strong challenge to the "consensus" that has emerged in recent years.
If you are thinking this sounds like deja vu all over again, you'd be
right. But the subject is a little different this time. Rather than estimates of climate sensitivity, this time we are talking about the interpretation of the "ensemble of opportunity" provided by the IPCC AR4 (formally CMIP3, but here I will use the popular name). Those who have been closely following this somewhat esoteric subject may have seen numerous assertions that the ensemble is likely biased, too narrow, doesn't cover an appropriate range of uncertainty etc etc. Thus, we should all be worried that there is a large probability that climate change may be even worse than the models imply.
Fortunately, it's all based on some analysis methods that are fundamentally flawed.
Although we'd been vaguely aware of this field for some time, the story really starts a couple of years ago at a
workshop, when my attention was piqued by a slide which has subsequently been written up as part of a multi-author review paper and forms the motivation for Figure 1 in our paper. The slide presented an analysis of the multi-model ensemble, the main claim being that the multi-model ensemble mean did not converge sufficiently rapidly to the truth, as more models were added to it. Thus, the argument went, the models are not independent, their mean is biased, and we need to take some steps to correct for these problems when we try to interpret the ensemble.
The basic paradigm under which much of the ensemble analysis work in recent years has operated is based on the following superficially appealing logic: (1) all model builders are trying to simulate reality, (2) a priori, we don't know if their errors are positive or negative (with respect to any observables), (3) if we assume that the modellers are "independent", then the models should be scattered around in space with the truth lying at the ensemble mean. Like so:
where the truth is the red star and the models are the green dots.
However, this paradigm is completely implausible for a number of reasons. First, since we don't know the truth (in the widest sense) we have no possible way of generating models that scatter evenly about it. Second, this paradigm leads to absurd conclusions like a 90% "very likely" confidence interval for climate sensitivity of 2.7C - 3.4C, based on the sensitivities reported by the AR4 models (this comes from a simple combinatorial argument based on the number of models you expect to be higher and lower than the truth, if they lie independently and equiprobably on either side). Third, it implies that all we would need to do to get essentially perfect predictions is to build enough models and take the average, without any new theoretical insights or observations regarding the climate system.
Lastly, it is robustly refuted by simple analyses of the ensemble itself, as observations (of anything) are routinely found to lie some way from the ensemble mean. As has been demonstrated in several papers including the multi-author review paper mentioned above.
So you might think this paradigm should have been still-born and never caught on. However, people have persevered with it over a number of years, trying to fix it with various additional "bias" terms or ensemble inflation methods, and generally worrying that the ensemble isn't as good as they had hoped.
So along we came to have a look. Actually, although this issue had been sitting uneasily at the back of my mind for some time, we were finally prompted into looking into it properly earlier this year when Jules was asked to write something else concerning model evaluation.
It didn't take long to work out what was going on. As explained above, the truth-centred paradigm is theoretically implausible and observationally refuted. However, there is a much more widely-used (indeed all-but ubiquitous) way of interpreting ensembles, in which the ensemble members are assumed to be exchangeable with the truth, or statistically indistinguishable from it. So in contrast to the picture above, we might expect to see something like this:
Here the red isolines describe the distribution defined by the models. Note that the truth (red star) is not at the ensemble mean, but just some "typical" place in the ensemble range.
In contrast to the truth-centred paradigm, it is easy to understand how such an ensemble might arise - all we need to do is make a range of decisions when building models, that reflect our honestly-held (but uncertain) beliefs about how the climate system operates. So long as our uncertainty is commensurate with the actual errors of our models, there is no particular need to assume that our beliefs are unbiased in their mean, and indeed they will not be.
I can't emphasise too strongly that this is the basic paradigm under which pretty well all ensemble methods have always operated, apart from one small little corner of climate science. It underpins the standard probabilistic interpretation, that if a proportion p% of the ensemble has property X, we say the probability of X is p%. A corollary is that if we apply this interpretation to the climate sensitivity estimates, we find a "very likely" confidence interval of 2.1C - 4.4C. Now I'm sure some would argue that this interval is too narrow, but I would say it is pretty reasonable, though this is somewhat fortuitous as with such a small sample the endpoints are determined entirely by the outliers. The implied 70% confidence interval of 2.3C - 4.3C is more robust, and would be hard to criticise. What is certainly clear is that these ranges are not completely horrible in the way that the one provided by the truth-centred interpretation was.
With this statistically interchangeable paradigm being central to all sorts of ensemble methods, notably including numerical weather prediction, it is no surprise that there is a veritable cornucopia of analysis tools already available to investigate and validate such ensembles. The most basic property that most people are interested in is "reliability", which means that an event occurs on p% of the occasions that it has been predicted to occur with probability p%. This is the meaning of "reliability" used in the subject line of this post and title of our paper. A standard test of reliability is that the
rank histogram of the observations in the ensemble is uniform. So this is what we tested, using basically the same observations that others had used to show that the ensemble was inadequate.
And what we found is....
...the rank histograms (of surface temperature, precipitation and sea level pressure from top to bottom) aren't quite uniform, but they are pretty good. The non-uniformity is statistically significant (click on the pic for bigger, and the numbers are explained in the paper), but the magnitude of the errors in mean and bias are actually rather small. What's more, the ensemble spread is if anything too
broad (as indicated by the domed histograms), rather than too
narrow as has been frequently argued.
So our conclusion is that all this worry about the spread of the ensemble being too small is actually a mirage caused by a misinterpretation of how ensembles normally behave. Of course, we haven't actually shown that the future predictions are good, merely that the available evidence gives us no particular cause for concern. Quite the converse, in fact - the models sample a wide range of physical behaviours and the truth is, as far as we can tell, towards the centre of their spread. This supports the simple "one member one vote" analysis as a pretty reasonable starting point, but also allows for further developments such as skill-based weighting.
This paper seems particularly timely with the IPCC having a "
Expert Meeting on Assessing and Combining Multi-Model Climate Projections" in a couple of weeks. In fact it was partly hearing about that meeting that prompted us to finish off the paper quickly last November, although we had, as I mentioned, been thinking about it for some time before then. I should give due praise to GRL, since I've grumbled about them in the past. This time, the paper raced through the system taking about 3 weeks from submission to acceptance - it might have been even quicker but the GRL web-site was borked for part of that. It is nice when things happen according to theory :-) Not forgetting the helpful part played by the reviewers too, who made some minor suggestions and were very enthusiastic overall.
Unfortunately, hoi polloi like Jules and myself are not allowed to appear in such rarefied company as the IPCC Expert Meeting - I did ask, with the backing of the Japanese Support Unit for the IPCC, but was refused. So we will just have to wait with bated breath to see what, if anything, the "IPCC Experts" make of it. While the list of invitees is very worthy, is disappointing to see that so many of them are members of the same old cliques, with no fewer than 4 participants from the Hadley Centre, and three each from NCAR, CSIRO and PCMDI, and vast numbers of multiply co-authored papers linking many of the attendees together. Those 4 institutes alone provide almost a quarter of the scientists invited. Coincidentally (or not), staff from these institutes also filled 5 of the 7 places on the organising committee... Shame they couldn't find space for even one person from Japan's premier climate science institute.