Saturday, August 14, 2010

Once more into the breach

Hmm...I've used that title before. Anyway, I've run out of excuses and have been tasked by my manager with addressing some of the issues people have raised in comments, while she addresses the current shortage of chocolate brownies in our apartment. Thanks to those who have already stepped in and provided some answers and rebuttals.

Rather than just working through the comments in detail, I think it may be more sensible to firstly re-iterate some basic points underlying the interpretation of model ensembles, as this seems to be the basis of the disagreement. I'm not going to delve into the fine details of the statistical analyses but rather consider the broad issue of what an ensemble of model results can reasonably be expected to represent.

I'm actually finding it difficult to criticise MMH too strongly - it's not that they are correct, on the contrary it is clear that their analysis is wrong-headed and fundamentally irrelevant - but rather that so many climate scientists have also got confused over this. MMH have compared the obs to the ensemble mean and found that they differ. As I said before, big deal, no-one in their right mind would expect them to match anyway. But the "IPCC Experts"TM did effectively endorse such an interpretation of the ensemble in their recent Expert Guidance (hmmm...dare I call it "The Dummies Guide to Ensembles"? Better not go there, I'm sure I've done enough damage already.) So clearly they are also rather confused.

Of course it would be great if it were true. No more worries over climate sensitivity, for example. Given the current model spread of about 3±1.5, all we need to do is put together 10 models into an ensemble and the range shrinks to 3±0.5. 100 models would give us 3±0.15. Who needs all those pesky new observations and theories anyway? Just write some more code and the answer will pop out. Sadly, real life doesn't work this way.

To put it in a nutshell, there is simply no theoretical, philosophical, or practical basis for the hope that the ensemble mean may coincide with the truth. I have demonstrated that the mean is always a better estimate - in terms of having a lower RMS error - than most of the constituent models, and explained why it is often better than all of the models - but that doesn't make it the truth. In fact irrespective of how good the models are, the mean of the models is not a plausible climate simulation at all in many respects. The mean is a mathematical object that doesn't even look like a model - it has far too little variability in both time and space, and numerous other completely unrealistic properties. This idea is completely routine and well-known, but Reto Knutti sometimes presents a nice analogy demonstrating this (I see he attributes the idea to Doug Nychka):


It's a set of 16 photos of aeroplanes (which take the place of models aiming to simulate a hypothetical "real plane") and the "average photo" composite of the images. The planes all have 2 wings and recognisable shapes. The average plane is a vague splodge, with no recognisable shape. It simply doesn't look like a plane at all. Surely no-one in their right mind would be surprised if the real plane fails to end up looking precisely like this "average plane". Except MMH and perhaps the IPCC Experts. Sorry, but however much I would like to just have a go at MMH I can't get around the fact that the IPCC Experts explicitly endorsed this idea.

In our GRL paper we wrote:

"This truth-centred paradigm appears to have arisen as a post-hoc interpretation of the ad-hoc weighting procedure known as ‘‘Reliability Ensemble Averaging’’ or REA"

and

"We suggest that in place of the truth-centred approach, future research into the use of the CMIP3 and other multi-model ensembles of opportunity should be based on the paradigm of a statistically indistinguishable ensemble, as this is both intuitively plausible and reasonably compatible with observational evidence."

which was intended as a strong condemnation, but perhaps it was too cryptic, or perhaps the IPCC Experts simply weren't prepared to take our word for it. It seems that Jules and I will have to keep writing about this in a variety of ways until the idea sinks in. The belief that an ensemble should or even could be centred on an unknown truth is a complete nonsense. End of story. There is really nothing to debate here.

39 comments:

PolyisTCOandbanned said...

Dude:

I don't even know all that fancy stats and math and stuff, but just thinking about what you're doing, would seem to get you to this insight.

I also fault MMH for doing a bunch of intricate equations and such, but not clearly discussing what they were doing and how it differed from other approaches and why they thought it correct. At first, I figured they were just trying to be sneaky/tentdentious. Now, I'm kinda scared that they don't think. I'm not sure which would be worse...

Seriously, just stepping back and thinking about how we use insights from the various models for the century long rise, would show how it differs from MMH. (The point I made to Steve, which he censored.)

Heck, I realize this stuff is subtle and complicated and it's easy to trip yourself up. Heck, so are medium hard word problems on a college level basic Probability and Statistics exam. You can't just grabe the nearest formula and plug and chug.

I'm actually open to having MMH proved right, to learning more about how to think about this stuff. But so far, they haven't gotten it done. Bunch of squid ink. Some arguments of the "well OK we might be wrong, but then something else would be too" (which is a pretty silly attitude...you ought to have ideas that you think are correct in your science papers, not blog debating points that you can't even back up on their own). McI flails around with new ad hoc analyses (after 2 years of wait time) and messed them up. Then closes comments after his mistake is shown (what a coward, should take his whipping like a man). And these are our best skeptics! :(

Deep Climate said...

James,
I think it would be wise to place MMH and Santer et al 2008 in context.

As you pointed out, Santer et al ran a similar test of observations against the ensemble mean, corrected from the abysmal Douglas et al (Santer's H2). But they also did a pairwise analyis (H1) that seems roughly equivalent to the ensemble spread approach.

MMH purport to do a better, cleaner and more up-to-date H2 than Santer. Although I'm convinced some of the details have been mishandled, their findings are not surprising given that tropical tropospheric trends went down in the observations and up in the models for 1979-2009 relative to 1979-1999 (plus more d.o.f.)

I agree with you that the MMH (and Santer H2) analysis is misplaced. But one also has to ask why MMH went backward and didn't consider ensemble spread or pairwise analysis, even though that had been done by Santer et al (as Gavin Schmidt pointed out in comments).

I also think that much of the discrepancy between tropospheric obs on one hand and surface and models on the other, is down to problems with the satellite data analysis (calibration biases). The forgotten Fu et al and emerging Zou et al data sets are considerably warmer than UAH and RSS, especially in the tropics. Zou et al also appears to be more consistent with less spurious effects - very promising.

[h/t Gavin's Pussycat for bringing Zou to my attention.]

http://deepclimate.org/2010/08/12/open-thread-%C2%A05/#comment-5002

TimG said...

As far as I can can tell your GRL paper can be summarized as follows:

"It is wrong to use the ensemble average as a best estimate of reality and our approach is better yet it produces basically the same result".

The fact that you come up with basically the same answer seems to suggest the differences in the methods are not as large as you claim and your protests about the IPCC expert way of doing things may be overblown.

A good parallel may be the numerous recent replicationd of the temperature reconstructions which demonstrated rather conclusively that the choice of algorithm for gridding/combining temperature records does not really matter much but that does not mean the raw data is useful.

I think the bigger problem here is the entire concept of testing models as a group is flawed and that each model should be tested on its own.

Jim Crimmins said...

I guess what people are searching for is the answers to these questions:

--how do we evaluate the models?
--are some models better than others?
--how do we decide if a model belongs in the "ensemble" - ie what quality metric or meta-structure standards must it pass?
--since these models have no predictive skill on a 1 yr time horizon, which means we can't run normal stats on them, and the mean drifts don't match reality over the long term, what good are the models - (seriously?)
--it seems like the idea of the "ensemble" is a band-aid to generate enough spread to cover th actual observations, since the models themselves defy evaluation by conventional means - not a confidence inspiring situation

Jim Crimmins said...

Another point:

--it's clear that an individual models separate runs can generate many different potential futures, so that you wouldn't necessarily expect the mean to match the actual mean of the individual future of the observations if the variance of the simulations was wide. However, if the variance of the individual model simulations is not very wide, and the monthly or annual SD is not similar to observations then there is a problem.

--the idea that the different models are drawn from some "model population" with a large variance that covers the observation is really hard to swallow. they are all trying to model the same plane in your example, not different planes.

--so, the different models are not modeling different futures. the individual runs of each model are modeling different futures. i think this is the difference in interpretation that is going on

Deep Climate said...

James,
I think your example of increasing the number of models is a bit exaggerated.

No one (not the IPCC, not Santer, not MMH) is suggesting that the more models you have, the more you can constrain the confidence interval around a mean trend value.

With regard to model-obs comparison (not possible for your climate sensitivity example), the test is really whether the model mean is is consistent with a given observation trend. (I'll leave aside the thorny issue of how MMH go about combining obs sets, because I haven't a clue, so let's stick with just one).

It seems to me, that as the number of models grows this converges to a test as to whether the model mean falls within the CI of the observed series trend (which is pretty wide for the tropical troposphere).

Unless your'e Douglas et al. They actually did require that the observation be within the model mean trend C.I. Of course the models were found wanting.

But maybe I've misconstrued your argument. Anyway, this is more of a quibble - in general, we are in violent agreement.

gp2 said...

I'm surprised that scientist are ignoring satellite reconstruction with higher tropical trends compared to regularly updated uah,rss timeseries; indeed if Zou et al. approach turn out to be correct not only the discrepancy between satellite reconstruction and models does not exist but even papers like Klotzbach et al. claiming that the discrepancy is due to biases in the surface temperature record would be wrong.
How can they make such an assumption if already satellite temperature reconstruction have such a large spread in the tropics?(which imply that necessary at least 2 satellite reconstruction must be wrong)

James Annan said...

DC, perhaps you are right. I find it hard to make sense of, because it makes no sense...

TimG, but we do come to very different conclusions from the others. In particular, analysing the same data according to the two different paradigms leads to the conclusion either that the ensenble is "likely too narrow and not capturing the full range of plausible models" (Knutti et al) or else close to perfect (what we found).

Jim Crimmins, you say "the mean drifts don't match reality over the long term" but this is not at all clear. We have shown the real warming trend is indeed contained in the spread of model predictions, albeit currently close to the lower edge. We don't know if this is the onset of a clear discrepancy or merely due to random chance.

Anonymous said...

James
I dont normally agree with you but I do on this. MMH seem to be demanding that the observations fit the mean of the models to within a standard error, which as you say becomes increasingly impossible as the number of models increases. So this is the wrong test - too strict.
But equally I think that merely requiring them to fit within one SD is too lenient, since we can always consider a wider spread of models to satisfy this.
So I don't know what the answer is. Perhaps the question of whether "the models" fit the observations is a meaningless one.

PaulM

Jim Crimmins said...

JA- I've read your paper and understand the idea it is based on. I think it comes down to a difference in these two approaches (same plane/different plane in your visual example):

--should the **individual** models be run enough times, with enough of a stochastic component added to the models to generate a plausibly wide distribution of future outcomes, then evaluated individually against whatever metrics are deemed important, and the best models selected (models simulate same plane)

OR

--should the individual models be run a few times, with no stochastic component, and the collection of all model runs be viewed as a distribution of future outcomes, which is then evaluated against whatever metrics are deemed important. (models simulate different planes)

I think a lot of us believe (A) is the better approach. It's possible the the circulation models are not the right tool for the job of construction of plausible 100 yr future distributions - if not fine - let's use them coupled with satellite data to nail down forcing sensitivities and feedbacks and then construct a much simpler model using CO2, solar, etc scenarios to construct 100 yr distributions we can all make sense of.

crandles said...

>"I think a lot of us believe (A) is the better approach."

Jim, sorry but why?

There are many ways for a model to be wrong (many variables that could be biased in several different ways).

"evaluated individually against whatever metrics are deemed important, and the best models selected" Doesn't this selection of best models mean you fail to see as many of the ways in which the model could be wrong compared to an approach of insisting that all the model types are included?

Ideally the models simulate one thing - reality. This makes "models simulate same plane" sound good. However we know we cannot possibly achieve that so it is more a matter of seeing how wrong we can be.

Suppose model type 1 simulates the plan view of a plane well but is hopeless about vertical profile while model type 2 is good at vertical profile but poor at plan view then I would rather use both model types than decide whether model type 1 was better or worse than model type 2.

I think this sounds like:
The individual models should be run a few times and the collection of all model runs be viewed as a distribution of how wrong we could be without using the models. Evaluate against whatever metrics are deemed important to downweight unlikely badly wrong possibilities. Use remaining distribution to indicate how wrong we could be, bearing in mind that both model type1 and 2 could be badly biased in same way about some other aspect (say cross-section) which gives other ways in which we could be wrong which are not adequately addressed by the models.

This sounds nearer to your B than your A.

Jim Crimmins said...

crandles -

In your example the things the plane models are trying to simulate are somewhat orthogonal, so the ensemble approach makes some sense.

My understanding of the climate models (which may be incorrect) is that they are all relatively "complete" models, or at least that's the intention.

Further, the modeling community (at least in the voice of RC/Gavin S) has maintined that the models are relatively complete and skillful at this point.

I could understand the ensemble approach if the models are simulating relatively orthogonal (and uncoupled) facets of the climate. That's just not my understanding of what they are up to.

In the case where they are essentially competing models, I think we need a different method of evaluating them individually. I'm not sure I know what that is in the context of temperature anomalies, since they apparently have little short/medium term predictive ability there.

Again, my thinking is that they are probably over-complex for the purpose they being sold (aggressively) - that of providing 100 yr terminal distributions of potential global temperature anomalies. That's the real problem I think. It doesn't make them bad, just ill-suited.

Jim Crimmins said...

And as to why B in the original example is to be avoided - well the reasons have been pointed out by many: it's easy to have a distribution of models with zero predictive skill that are not eliminated statistically. Since they are not eliminated statistically (even if just on the edge), they are then used for 100 yr predictions. Moreover, since they have no stochastic component, the terminal range fails to include one standard deviation of downside natural variability. Those two effects (edge of the observed range, no stochastic component) add about 1C to the bottom range of the terminal distribution. That's a lot of bias.

crandles said...

>"Moreover, since they have no stochastic component"

huh ?? Strictly an initial condition ensemble or a perturbed parameter ensemble is not a stochastic component per strict definition of stochastic but I am not sure I see any way in which these fail to fill the purposes of a stochastic component (and they have other benefits/uses as well).

Jim Crimmins said...

Uncertainty in model parameters is not the same as natural climate variability, unless the parameters are stochastic time series.

PolyisTCOandbanned said...

McIntyre rapidly did an analysis, got a bunch of his allies pointing to it. Then it turned out he screwed it up (to his benefit...funny how the cashier always makes errors in one direction). funny how, they spent 2 years working on this thing and still didn't have explanations in there and flailed around and made mistakes first thing out of the gate.

He's disappeared the graphic and locked the thread. Will get back to it in a few weeks. But he had time to get a new thread up for the next hs paper. And he doesn't mind that staying open without him to attend it.

I guess it is just coincidence that locking discussion on someting where he made a mistake (and disappearing part of the post) helps minimize him having to take his lumps.

Rattus Norvegicus said...

This might be part of the answer:

http://www.star.nesdis.noaa.gov/smcd/emb/mscat/mscat_files/Zou.2009.ErrorStructure.pdf

h/t to Steve Bloom.

Anonymous said...

Curious analogy with the planes, but somewhat topsy turvy I would have thought.

16 companies come to you with 16 machines they claim are airplanes. If you blend them together to get a smudgy image, which doesn't look inconsistent with the shape of a plane it doesn't mean that you can state that either individually or collectively the 16 objects are liekly to be planes.

James Annan said...

I certainly agree with that (and don't think I stated anything to the contrary), but MMH are effectively using the fact that the smudge doesn't look like a plane to argue that the individual constituent planes won't fly, which is also clearly illogical. The mean doesn't tell you that much either way, frankly (well if it looked very much like the obs it would probably be generally encouraging, but this is not a necessary condition)

Tony Sidaway said...

Hello James, as usual your explanation is lucid and easy to follow. But as I'm not good at guessing what odd bits of alphabet stand for (this happens to me often enough for me to suspect that there may be some cognitive dysfunction on my part), could you explain which publication you're referring to as MMH? I'm sufficiently clued up to guess that the two Ms may stand for Stephen McIntyre and Ross McKitrick, but that doesn't get me very far.

James Annan said...

Sorry, that was an obvious error on my part. It's this paper.

PolyisTCOandbanned said...

Even leaving aside the issue of model to model variation, I have a sense that the MMHers have gotten themselves so confused with fancy shmancy "panel regressions" and time-series shtuff, that they don't even do normal sampling statsitcis. Like if you have 15 trends, how to take the standard devation of that set of data. I wonder how they get a standard deviation for a single run of a model. then I wonder WHAT they use to do standard deviation of the trend with multiple runs (after all if they do something funky with a single run, are they also being funky with multiple runs). Then of course, they probably lack any concept of how to handle different models (most of which are single run).

I don't know any of this. I'm just worried about it and want someone else to check for me.

BTW, if they are abusing basic ideas like standard deviation, who knows what they are doing with the averages! Are those still what I learned in grade school about adding all the numbers and dividing by the number of numbers?

ac said...

As long as theres a single clear mode you're still betting on the ensemble mean though, right?

Paul said...

JeffID is still waiting for an answer! See his and John N_G comments on Curry's latest meander.

http://judithcurry.com/2010/09/27/no-consensus-on-consensus/

Sorry I can't link to a specific comment. You will just have to wade through the whole morass.

Paul Middents

James Annan said...

Unfortunately, Jeff Id does not show sufficient competence to understand the answer.

Jeff said...

James,

This is my first reading of your post. Claims of my competence to follow your argument aside, I actually agree with you about the ensemble mean.

However, you are missing the little detail that so often gets lost in climate science. The magnitude of the various models trends was shown to be 2 - 4 times over observation. I believe Lucia covered this point in a comment where she said in paraphrase, saying something is statistically significant can be nearly meaningless. If you have a x +/- 0.001 and the trend is 0.002 different perhaps nobody cares. However when trend is TWO to FOUR times different, then you have something to discuss.

Now you can belittle me all you want, but if you want to pass that off as a non-issue, I may have to return the favor.

As an engineer, it would send me to question the data and the models. As I've already looked in depth at the data my opinion is that it is likely in the models, but it should be thoroughly examined.

James Annan said...

Jeff,

Thanks for the comment. If you agree with me about the method of analysis, you should also agree that the observations fall within the range of model estimates (eg Pat Michaels et al analysis, on which I am a co-author). So I don't know where your 2-4 times come from.

Just to be clear, I don't think the models are perfect, and I would not be surprised if more concrete evidence of a mismatch were to mount up over time. However, it hasn't happened yet - and when it does, it may be partly due to obs error too, of course.

Jeff Id said...

James,

According to MMH authors, the model mean runs 2 - 4 times observation.

Now I'm sure that 'some' models have far lower trends, and some individual models have wide enough error bars to overlap temp trend, but when the mean runs so much higher than observation, it does call into question the underlying assumptions which are so similar in many of the models.

Your point about the accuracy of model mean for determining significance has merit, but only to the extent that the accuracy of the result is uncertain(yeah I said it). But when you average a hundred different images of simulated planes, and the blob looks like an ocean liner, you need to consider the possibility of a systematic bias.

I'm certain that in a less entrenched field where there was less value placed on the result, problems shown by MMH10 would be taken far more seriously.

----

Here is a question on a different topic. Since models are modeling temp, and we're comparing models to temp, should the temp-model residual be used to determine the error bars? Or should we just look at the uncertainty of each trend individually?

Also interesting, if not examining residuals, would you use the error bars of one trend and the actual trend of the other to determine significance? Or would you use the error bars of both?

James Annan said...

IMO, in a less entrenched field, no-one would have taken MMH seriously in the first place. It would just have been ignored as obviously wrong.

The fact that the ensemble mean is some factor larger than the obs in itself means nothing about the ensemble's reliability. In fact it merely tells us in this case that using these methods we cannot hope to give accurate predictions of trend over a short (say 10y) interval, because we know that natural variability is large relative to the forced response over this time period. (There is ongoing work to try to improve on this situation, but that is not relevant to this analysis.)

Consider a large ensemble of fair coins which are each tossed 6 times. The ensemble mean is 3H. If a single coin (representing "reality") only gets 1H, then you could say that the ensemble mean is 3x greater than obs. However, this does not provide strong evidence that the real coin has a bias against H. The probability of 0H or 1H is over 10% even for a fair coin, and the distribution of the large ensemble will show this. Similarly, the probability of a very low temperature trend (over 10y say) is non-negligible, and the models show this. You can read it straight off the Michaels et al graphs.

If you wait for another 6 tosses and reality again ends up with only 1H versus the typical 3H from the ensemble, then you have much stronger evidence for a discrepancy...but until that happens there simply isn't much to go on. This is basically the situation we are in now.

Jeff Id said...

James,

I understand the bulk concept you are expressing, but there are ways to quantify the statistical validity of the claim that you are making.

For instance the standard deviation of a coin flipped 6 times is 1.22. Two sigma is 2.44 a mean of 3.5 heads minus 2.44 and you have or 1.06 heads at two sigma.

So it isn't that big a deal to see 1 once in a while.

In MMH 10 they showed that the mean is well outside of the two sigma 95% level for the model coin tosses. Again, the magnitude of the difference is the key.

You shouldn't ignore this quote:

"Over the interval 1979 to 2009, model-projected temperature trends are two to four times larger than observed trends in both the lower and mid-troposphere and the differences are statistically significant at the 99% level."

This means that you already have your additional coin tosses. Despite the fact that an average of models may or may not be physically realistic, the fact that their average and error bars all run so much higher than observation, and are so statistically significant, should not be overlooked with a hand wave.

James Annan said...

Jeff, the fundamental error of MMH is that confidence interval they worked out is not based on the standard deviation of the ensemble, but rather the standard error on the ensemble mean. Thus re-quoting their claim is irrelevant - it is simply an incorrect calculation.

Using their calculation, you could argue that many of the models fail to be predicted by the ensemble of models...surely you can see this is a crazy claim, as the models can hardly fail to predict themselves. A valid and correctly-formulated statistical test should reject 5% of models at the 5% level, etc - this is what the 5% significance threshold *means*.

MikeN said...

How hard would it be to just collect source code for the various models, and test them against different input parameters, as well as newer observations, and see which physics is likely to be more realistic?
See which models predicted a slowdown in warming, etc.
Some of the models are a bit explicit in the input variables like cloud sensitivity, while others have them built in a little deeper into the code.

Jeff Id said...

James,

Despite the rocky start, I've appreciated the discussion thus far. You almost had me convinced for a bit, to the point where I began writing to say so. From that experience I can say that my mind is open enough but after some consideration, I am convinced with certainty that you are missing the point. It comes down to the hypothesis you are working to test.

If you wish to test that observed trends are within model distribution, you would use the standard deviation of the trends of models and you would have the wide confidence intervals you expected. Visually it is easy to say that you are correct, observed trends are within the edge of model distribution. The narrow CI of MMH10 does not encompass all model trends (BTW, you don't need to explain what CI means).

If you wish to test that observed trend is within our understanding of average trend over models, the CI you need is completely different.

The trend is the trend. A perfectly instrumented/measured temperature series with a trend of 0.25 +/- 1 is still greater than a series of 0.20 +/- 1. What that means is even though we haven't nailed down the certianty of the trend long term and even though both are dramatically within the CI, the first is actually a higher trend. Of course statistically, neither trend is known to be separable due to the variance of the short term signal. The assumption of Santer and MMH10 is that the trend from an average of models having different but allegedly realistic assumptions should resolve to something close to the observed data. After all models are mathematical representations of the climate. I also agree that model average is something non-physical because of differing assumptions.

So from the average model trend we find a value 3 times the measured. The first question then becomes the uncertainty in that trend. In paleoclimate, if you want to know the certainty of the average trend in the blade of the stick, you wouldn't take the extremes of all the inputs to calculate uncertainty, you take the variance in the output and use some method i.e. monte carlo or a DOF estimate.

So with all that preamble you state:

Using their calculation, you could argue that many of the models fail to be predicted by the ensemble of models...surely you can see this is a crazy claim, as the models can hardly fail to predict themselves. A valid and correctly-formulated statistical test should reject 5% of models at the 5% level, etc - this is what the 5% significance threshold *means*.

to which I reply softly:

By your method, more data never ever improves the certainty to which you know trend.

Please consider the bold for some time.

The post by Chad Herman which I reviewed that drives this point home is here:

http://treesfortheforest.wordpress.com/2009/12/11/ar4-model-hypothesis-tests-results-now-with-tas/

It was a bit sobering while I spent an hour in the process of figuring out why you were right. The models are running high, or the data is running low or a little of both.

That is probably enough.

Jeff Id said...

James,

I tried to reply several times. Something was going wrong but the error message was unintelligible. From a guy who has programmed for almost 30 years, that's not good.

I've left my reply at tAV. Please continue the conversation as it has been enjoyable. Here or there doesn't matter to me.

http://noconsensus.wordpress.com/2010/10/01/conversation-with-a-climatologist/

James Annan said...

Hi Jeff, your comment seemed to work ok (in triplicate :-)) but this post is so old that all comments need manual approval. If I think of anything substantive to add maybe I'll put up a new post...

Chad's article doesn't add anything, and in fact the "is consistent with" formulation exemplifies the confusion. It presupposes a definition of "consistent" that is at best disputed and (IMO) obviously inappropriate. For what is the potential source of a disagreement? Even assuming the models are a perfect characterisation of the forced response and natural variability of the climate system (in statistical terms), his calculation will (with high probability) find that the obs are not consistent with the mean.

A more transparent approach to hypothesis testing would be to use the form of Wilks which I wrote about some time ago. The relationship of Chad's hypothesis to a more standard null hypothesis is not clear.

Your comment that more data does not improve the accuracy of the trend is dead wrong and does not follow from anything I wrote. Continuing from my hypothetical example given previously, if the models say 0.2+-0.18 over a decade, they will (as a first approximation) say 0.4+-0.25 over 20 years meaning 0.2+-0.13 for the decadal trend. Equivalently, a longer data series will give less uncertainty on the underlying trend. So although an obs trend of 0.05 is not inconsistent over one decade, a second decade the same would be.

James Annan said...

Mike, in practice it is hard to separate out the choice of parameter values from the overall structure of the model. Whole components have been added in the past 20 years (since Hansen's 1988 forecast) in light of observational evidence. Also, the old code has not been maintained and does not run, or at least does not give quite the same results - we actually tried this with Hansen's model.

Jeff Id said...

James,

Thanks again for the reply.

The MMH method tests something different than you are testing. It's not a zero value method by any means, it's just not testing the same thing.

Once the model setup and parameters are picked, they are known to a perfect certainty. Several individual models fail individual tests but the group as a whole fails based on its model mean variance using a CI based only on knowledge of the model mean and not the full range of models.

If you wish to test whether the model ensemble trends are significantly different from reality. The spread is enough that the answer is no. As you have discussed here.

If you wish to test whether the model mean is known to be significantly different from reality, the answer is yes. The inputs and responses are already fixed to a zero uncertainty in each run so the question answered is a different one.

Whether you consider that something useful is another question but the test itself is perfectly valid.

I think that it's clear that the uncertainty in inputs and response create enough spread that we can say models are not able to predict trend well enough to be separable from reality.

What we can also say from MMH is that the parameters and responses chosen and utilized are biased heavily to the high side of observation. This is also visible in the Santer rebuttal which got stuck in review.

It's a subtle difference but the statement that the MMH test is invalid is incorrect.

--------------

This quote is from John Christy on the same topic:

“In this study we use the results from 21 IPCC AR4 models all of which portrayed the surface trend at or above +0.08 °C decade−1 (minimizing the problem of instability due to small denominators in the SR.). Some of the 21 models were represented by multiple runs which we then averaged together to represent a single simulation for that particular model. With 21 model values of SR, we will have a fairly large sample from which to calculate such variations created by both the structural differences among the models as well as their individual realizations of interannual variability. From our sample of 21 models (1979–1999) we determine the SR median and 95% C.I. as 1.38 ± 0.38. We shall refer to this error range as the ―spread‖ of the SRs as it encompasses essentially 95% of the results. We may then calculate the standard error of the mean and determine that the 95% C.I. for the central value of the 21 models sampled here as 1.38 ± 0.08, and refer to this as the error range which defines our ability to calculate the ―best estimate‖ of the central value of the models’ SR. Thus, the first error range or ―spread‖ is akin to the range of model SRs, and the second error range describes our knowledge of the ―best estimate‖ representing the confidence in determining the central value of a theoretical complete population of the model SRs.”

and later:

“With the exception of one SR case (RSS TLT) out of 18, none of the directly-measured observational datasets is consistent with the ―best estimate‖ of the IPCC AR4 [12] model-mean. Based on our assumptions of observational values, we conclude the AR4 model-mean or ―best estimate‖ of the SR (1.38 ± 0.08) is significantly different from the SRs determined by observations as described above. Note that the SRs from the thermal wind calculations are significantly larger than model values in all cases, which provides further evidence that TWE trends contain large errors.”

Link to paper is here:

http://www.mdpi.com/2072-4292/2/9/2148/pdf

Jeff Id said...

my last comment was stuck in moderation or something.

James Annan said...

Hi Jeff,

Well everything on old posts gets stuck when I am away for a few days.

To be honest though I do not mind engaging you in conversation I'm going to duck out on this now. It is quite clear that even (some) working climate scientists are hopelessly confused on this - I've just seen an absolute howler in a paper in press - so there seems little point in trying to persuade you. I'll have to play the long game through the literature.