This CPDN paper doesn't seem to have attracted much comment, perhaps because the results aren't actually very far off what the IPCC already said (just a touch higher). But Chris (and Carrick) commented on it down here so I think it is worth a post.

It's the results from a large ensemble of transient simulations of the mid 20-21st centuries, analysed to produce a "likely" range of warming by 2050.

Here is the main result:

(click for full size) where the vertical dotted lines demarcate their "likely" range, and the horizontal line is the threshold for goodness of fit (such that only the results below this line actually contribute to the final answer). The grey triangles represent models that are thown out due to large radiative imbalance.

I am puzzled by a few aspects of this research. Firstly, on a somewhat philosophical point, I don't have much of a feel for what "likelihood profiling" is or how/why/if it works, and that's even after having obtained the book that they cite on the method. The authors are quite emphatic about not adopting a Bayesian interpretation of probability as a degree of belief, so the results are presented as a confidence interval (remember, this is not the same thing as a credible interval). Therefore, I don't really think the comparison with the IPCC "likely" range is meaningful, since the latter is surely intended as a Bayesian credible interval. Whatever this method does, it certainly does not generate an interval that anyone can credibly believe in!

Secondly, on a more practical point, it seems a bit fishy to use the range of results achieved by 2050, relative to 1961-90, without accounting for the fact that almost all of their models have already over-estimated the warming by 2010, many by quite a large margin (albeit an acceptable level according to their statistical test of model perfomance). The point is, given that we currently enjoy 0.5C of warming relative to the baseline, then reaching 3C by 2050 implies an additional warming of 2.5C over the next 40 years. However, as far as I can see none of the models in their sample warms by this much. Certainly the two highest values in their sample - which are the only ones that lie outside the IPCC range, and which can be clearly identified in both the panels of the figure above - were already far too warm by 2010, by about 0.3-0.4C. So although they present a warming of 3C by 2050 as the upper bound of their "likely" range, none of their models actually warmed over the next 40 years by as much as the real world would have to do to reach this level.

Finally, on a fundamental point about the viability of the method, the authors clearly state (in the SI) that they "assume that our sample of ensemble members is sufficient to represent a continuum (i.e. infinite number)". They also use a Gaussian statistical model of natural variability in their statistical method (which is entirely standard and non-controversial, I should point out - if anything, optimistic in its lack of long tails). Their "likely" range is defined as the extremal values from their ensemble of acceptable models. This seems to imply that as the ensemble size grows, the range will also grow without limit. (Most people would of course use a quantile range, and not have this problem.) So I don't understand how this method can work at all in this sort of application where there is a formally unbounded (albeit probabilistically small) component of internal variability. In a mere 10

It's the results from a large ensemble of transient simulations of the mid 20-21st centuries, analysed to produce a "likely" range of warming by 2050.

Here is the main result:

(click for full size) where the vertical dotted lines demarcate their "likely" range, and the horizontal line is the threshold for goodness of fit (such that only the results below this line actually contribute to the final answer). The grey triangles represent models that are thown out due to large radiative imbalance.

I am puzzled by a few aspects of this research. Firstly, on a somewhat philosophical point, I don't have much of a feel for what "likelihood profiling" is or how/why/if it works, and that's even after having obtained the book that they cite on the method. The authors are quite emphatic about not adopting a Bayesian interpretation of probability as a degree of belief, so the results are presented as a confidence interval (remember, this is not the same thing as a credible interval). Therefore, I don't really think the comparison with the IPCC "likely" range is meaningful, since the latter is surely intended as a Bayesian credible interval. Whatever this method does, it certainly does not generate an interval that anyone can credibly believe in!

Secondly, on a more practical point, it seems a bit fishy to use the range of results achieved by 2050, relative to 1961-90, without accounting for the fact that almost all of their models have already over-estimated the warming by 2010, many by quite a large margin (albeit an acceptable level according to their statistical test of model perfomance). The point is, given that we currently enjoy 0.5C of warming relative to the baseline, then reaching 3C by 2050 implies an additional warming of 2.5C over the next 40 years. However, as far as I can see none of the models in their sample warms by this much. Certainly the two highest values in their sample - which are the only ones that lie outside the IPCC range, and which can be clearly identified in both the panels of the figure above - were already far too warm by 2010, by about 0.3-0.4C. So although they present a warming of 3C by 2050 as the upper bound of their "likely" range, none of their models actually warmed over the next 40 years by as much as the real world would have to do to reach this level.

Finally, on a fundamental point about the viability of the method, the authors clearly state (in the SI) that they "assume that our sample of ensemble members is sufficient to represent a continuum (i.e. infinite number)". They also use a Gaussian statistical model of natural variability in their statistical method (which is entirely standard and non-controversial, I should point out - if anything, optimistic in its lack of long tails). Their "likely" range is defined as the extremal values from their ensemble of acceptable models. This seems to imply that as the ensemble size grows, the range will also grow without limit. (Most people would of course use a quantile range, and not have this problem.) So I don't understand how this method can work at all in this sort of application where there is a formally unbounded (albeit probabilistically small) component of internal variability. In a mere 10

^{23}samples or so, their bounds would have been as wide as ± 10 sigma of the natural variability alone - which based on the left hand panel of the fig, would have been rather wider than what they actually found.
## 18 comments:

D'oh it would help if I read what the grey triangles were.

10^23 is a rather large increase in the sample size. But, improve the quality control from ~1.15 to ~0.9 and the range changes to about 76% of the range they decided upon. (But of course they wouldn't decide on the result they wanted then set the quality control to achieve that result.)

On the contrary, 10^23 is actually very very small indeed compared to an infinite ensemble :-)

Even ignoring internal variability, I don't know how they can make any statements about having (nearly) found the extrema of the forced response in their region of parameter space. With a Monte Carlo approach, this is obvious and easy (even 100 samples probably covers the 5-95% range of the underlying continuum distribution). But with no sampling distribution, only a multivariate space, there could be a very small region with very high response and their method simply won't find it.

Interesting to see you criticising them for their sample sizes being too small. :o)

Not quite sure what you mean by "no sampling distribution". Which models they get back is rather random AFAICS. Compared to telling each model to generate random parameters in the ranges set, the distribution they get back will definitely be more widely spaced. Is that undesireably compared to a clumpier distribution you would get from a generate random parameters in the ranges set?

How would you go about getting a good distribution of parameter combinations?

Chris, the essential point is that they are claiming to have actually found the most extreme model (or at least, a close approximation to it) within their high-dimensional parameter space. Standard Bayesian approaches would only care about finding the percentiles (eg 95%ile or 99 etc) of a distribution, which is likely to be a substantially simpler problem (though it still may be quite hard, depending on the details).

In their approach, the actual distribution of sensitivity (etc) across their ensemble has no particular importance, it is just the range covered that counts. A Bayesian would have to choose (eg) uniform, gaussian, or some other distribution for each parameter, and the results would depend on this choice.

If CPDN doesn't have the computing power to do a large enough sample then no climate modelling group does. So the question is what is the better input to some form of emulator? Is it the wide distribution that CPDN get with their tripple peak distribution or is it some ditribution that samples quite heavily 'near' the centre but is very poor at sampling the extremes.

I would rather extrapolate within the range tested rather than outside the range tested. (I also wouldn't want all at the extremes with nothing in the middle so I wouldn't go more widely spaced than CPDN have used.)

The other thing is, if you specifically distribute your sample to get a full range of the sensitivity, then you haven't got the best sample of the best models at median expected sensitivity.

Anyway, I cannot see anything in what you have said that makes me think there is a valid criticism of the distribution used.

I do completely agree that they should have calculated a 5-95% range or a 16-83% range (or both or others).

I also think they should have shown this for various different quality control levels. I think that is likely to show curves with lower uncertainty for higher quality control. Though whether that really points to lower uncertainty in the way I am tending to think may well be rather dubious.

FWIW is there an open version of the paper. From the description their figure of merit was not the global temperature anomaly but depended on the geographical distribution of the warming which would be a step forward

Eli, I think as a co-author I have the right to digitally share it, so here it is: https://docs.google.com/open?id=0B9HdfZpD8H7vX2VIQ2hUdTZSN0NhSm5BWWVZRktfUQ

(If I'm violating some policy please let me know and I'll remove the link.)

I'm not going to comment on the science since it's been years since I was involved on CPDN, my role was that of computer geek, and I forget everything by now anyway! ;-) But I think it would be great if other groups explore these huge datasets from CPDN, I know there's a nice portal one of my colleagues at Oxford set up (Milo Thurston) to this end. There's much more data than Oxford boffins & postdocs to handle it all anyway IMHO.

Then perhaps some of the other things James & crandles bring up could probably be explored? There probably wasn't enough room in Nature to explore things more, considering the author list took up half the space! ;-)

I don't have a Nature subscription any more unfortunately - did the Isaac Held "forward" or "commentary" or whatever say anything useful for you?

Carl, given your name and address on the paper, I was wondering if you had been tempted back... Isaac Held's blurb was fairly anodyne...nothing negative, but "it is nevertheless important to think of the results as a work in progress". I suppose it's the sort of freebie you get in return for writing a positive review :-)

Chris, what they have actually generated *is* a 66% confidence interval, which is basically the frequentist analogue of a 17-83% probability range. Given their method, it does not make sense (AIUI) to consider percentiles of their "acceptable" models, the theory is based on using the extreme solutions. My questions are not so much whether there would be a better way of performing their particular calculation, but firstly whether their method gives an accurate solution to their problem as stated, and secondly whether their interpretation of the problem is actually a useful one.

Sorry, I could do with an expanation of that reply (particularly the 'it does not make sense' part).

I do accept that questioning whether their method gives an accurate answer and whether their interpretation is a useful one should come first. But to a certain extent, whether their method gives an accurate answer surely might involve exploring whether it is better than alternatives?

I am thinking, for simplicity, that for an example problem you can create 1000 good models which have answers in the range 1.5 to 4.5. To generate those 1000 models you can hand out 2000 models and the bad models have answers in the range 5 to 10. Alternately you could look around more by allowing more extreme parameters. Suppose this generates 4000 models, of which 1000 are good as before and 3000 models with answers in the 5 to 20 range. (Consequently it is now clear that we didn't need to look for more extreme parameters. If only real problems were this simple.)

Question then is whether 1.5-4.5 range is a 50% confidence interval, a 25% confidence interval or (possibly?)a 100% confidence interval?

On this simple problem, you might be able to justify all three with the difference depending on how exactly you define your method. This simply shows that a confidence interval is highly dependant on method and is therefore not a reliable number nor the credible interval that we want.

I said I think they should have calculated a 5-95% range. I should accept that this range isn't a confidence interval per *their* method. Neither is it credible interval but perhaps just a confidence interval for what I see as a potentially better method. Obviously it would need much more consideration before it could be considered to be a better method. There is likely to be some reason why it isn't a better method or it would have been used. However, without discussing it, I will probably never know why it isn't a better method.

With this information, would you end up with a credible interval of more than 75% for 1.5 to 4.5 even if the answer and prior was for number of years before arctic sea ice practically disappears for 1 day a year? (assume you believe these models are the best around.)

(BTY did you think these numbers were for climate sensitivity?)

Hopefully, with that example the comments on the distribution being widely spaced being sensible for setting up an emulator seem more relevant.

Though good for setting up an emulator, the distribution is almost certainly too widely spaced for pracical use. The way I am thinking, when this is corrected for using an emulator, I think a sensible 66% likely range is likely to turn out to be a much narrower range than they have shown in this paper.

(They may well be aware of this but in order to have more impact they want to show a wide range now (shock, horror, worse than we thought) so a later paper has different dramatic effect (wow, startling improvement from these techniques). But of course, in reality, it is just an unfortunate situation that space constraints stop them from including a discussion of likely differences between their confidence interval and a credible interval.

This sort of bias couldn't be why some people turn septic could it?)

Chris, using their method, the way to get a range different to the 66% interval, would be to change the severity of their constraint (horizontal line on fig 2a), and see how this different set of acceptable models propagates.

I'm sure they tested this and it's obvious from the diagram that the result would be that the the 95% CI is not so different from the 66% CI (the upper bound would only rise to to 3.2C), which is probably why they published the latter :-)

>"Chris, using their method, the way to get a range different to the 66% interval, would be to change the severity of their constraint"

Yes, I can see that is what you do according to their method. But what I am wondering, through my example above, whether this is reaching a 'through the looking glass' level of a confidence level means what the authors intend it to mean.

If it is so crazily easy to manipulate yet irrefutable as the answer is true by definition of the method in that way, is it time to begin to question whether this method of theirs is scientific or not?

Chris, it's a general weakness of many statistical methods that the results are only valid if the method was defined prior to the data being collected. However, it might be harsh to pick on this particular occasion to complain.

That sounds reasonable, thanks James.

received by email from Nic Lewis who was having trouble getting commenting to work - reply to follow shortly:

"James, I'm not sure why their "likely" range would grow with the number of ensemble members. As the range would include only the 66% of ensemble members that passed goodness-of-fit test, I would expect it to remain largely unchanged with ensemble size, assuming a close link between goodness-of-fit and forecast warming. Isn't this a similar position to randomly sampling a (say) Gaussian distribution of fixed variance, where the more samples are taken the more extreme values will be reached, but the 66% central CI will be little affected? Even if the link between goodness-of-fit and forecast warming is weak, I would expect only random fluctuations in the "likely" range.

However, I don't think that the study's "likely" range of 2050 warming relates closely to how likely warming in that range actually is. The study didn't explore anything like full ranges of key climate parameters: equilibrium climate sensitivities below 2 K were not included in the ensemble, only a limited range of ocean heat uptake levels appears to have been considered, and it is unclear to me to what extent the possibility of aerosol forcing being small was represented. So it looks to me as if the lower bound of the study's "likely" range is probably significantly biased upwards.

Also, can you clarify why you don't expect Bayesian credible intervals (of the one sided variety) to be the same as the corresponding frequentist confidence interval? Is that because you think Bayesian credible intervals are not valid as objective measures of probability? Most statisticians involved with these issues seem to view checking the matching of one sided credible intervals against frequentist CIs as an important way of testing the validity of Bayesian inference, and in particular of the ability of candidate noninformative priors to generate objective posterior probability densities, a position that I agree with."

Nic,

considering the first part of your comment, let's write the response of a model over the hindcast and forecast periods as something like (A+e,B+d) where A and B are the forced response over the two intervals (which depends on the parameter choices) and e and d are gaussian deviates due to internal variability (which depends on random initial conditions). Now, it doesn't matter how you select/constrain over the hindcast interval, the range of forecast warming still has no supremum because even if B is bounded, d is unbounded due to being a gaussian. Gaussian here is a very natural (perhaps optimistic) choice and I'm certainly not aware of any strict limit on the magnitude of internal variability. Averaging over a finite size initial condition ensemble (as they did) only makes the distribution narower in variance, it doesn't bound it strictly.

[Strictly speaking, there is no absolute guarantee even that B is bounded, and it certainly doesn't seem from the diagram that they have sampled densely around the high end of their range - I only see two samples in their acceptable ensemble that are above 2.8C or so.]

I agree with you about the lower bound, it seems particularly unreasonable for them to criticise the lower end of the IPCC range, especially as the rather small IPCC ensemble includes models with a lower forecast than their full range, which also satisfy their statistical criterion. There's a heap of evidence that single model ensembles simply don't generate as diverse a range of behaviour as structurally different models can do.

I don't expect Bayesian intervals to match frequentist ones because they address a different problem and require different inputs. Of course there are some quite common situations where they do coincide numerically, but I don't see why this should be one of them. Furthermore, there seem to be a bunch of ways of approaching this frequentist likelihood profiling thing, which do not agree with each other, so they can't possibly all agree with a specific Bayesian analysis.

Post a Comment