This came up again in the comments to this post, and I've been meaning to write about it for some time.

It has been repeatedly observed that the multi-model mean is considerably better than a typical model (and sometimes - not always - better than all models), in terms of RMS error when compared to data. The first explicit observation of this I'm aware of in IPCC-related climate science is from the analysis of the CMIP1 models by Lambert and Boer (2001) and it has been a feature of many subsequent analyses of more recent multi-model ensembles such as Gleckler et al (2008)'s analysis of CMIP3.

Several authors (including the IPCC AR4 page 608, as Lucia mentioned in the above comment thread) have talked about the errors of different models cancelling, and this seems to have motivated the concept of models being sampled from some distribution centred on the truth - a notion which unfortunately is quite clearly bogus, having no theoretical support and being comprehensively refuted by the data. So a sound understanding of this phenomenon still seems to be absent.

It was only recently that I really realised that this was considered an open question in the field - it was actually a reviewer's comment on the Reliability paper that put me on to it, following which I took a careful look at the literature to see exactly what had been said. My 3-line reply probably didn't get to the reviewer as that paper was accepted immediately after minor revision. I wasn't allowed to attend the IPCC Expert Meeting where I hear this puzzle was mentioned, and I was a bit worried that we might get take the secret to our graves if we got run over by a bus while tandeming to work, so quickly wrote a manuscript which contains the explanation (among other things - tempting though it was to try, I didn't think I could really get away with publishing a single page, and anyway, I had some related points to make). I don't think this has yet got as far as being sent out for review, and so, as far as I can tell, the question still remains a mystery to the entire climate science community. But at least if I get run over now, someone should get to see the answer :-)

I'll give you all a chance to think about it before posting the solution. Knowing that it has a simple solution, mathematically-minded readers should have a good chance of working it out - and will surely kick themselves when they see it. (If you do know the/an answer, feel free to say so in the comments without giving it away too quickly.)

So, why is the multi-model mean so good?

It has been repeatedly observed that the multi-model mean is considerably better than a typical model (and sometimes - not always - better than all models), in terms of RMS error when compared to data. The first explicit observation of this I'm aware of in IPCC-related climate science is from the analysis of the CMIP1 models by Lambert and Boer (2001) and it has been a feature of many subsequent analyses of more recent multi-model ensembles such as Gleckler et al (2008)'s analysis of CMIP3.

Several authors (including the IPCC AR4 page 608, as Lucia mentioned in the above comment thread) have talked about the errors of different models cancelling, and this seems to have motivated the concept of models being sampled from some distribution centred on the truth - a notion which unfortunately is quite clearly bogus, having no theoretical support and being comprehensively refuted by the data. So a sound understanding of this phenomenon still seems to be absent.

It was only recently that I really realised that this was considered an open question in the field - it was actually a reviewer's comment on the Reliability paper that put me on to it, following which I took a careful look at the literature to see exactly what had been said. My 3-line reply probably didn't get to the reviewer as that paper was accepted immediately after minor revision. I wasn't allowed to attend the IPCC Expert Meeting where I hear this puzzle was mentioned, and I was a bit worried that we might get take the secret to our graves if we got run over by a bus while tandeming to work, so quickly wrote a manuscript which contains the explanation (among other things - tempting though it was to try, I didn't think I could really get away with publishing a single page, and anyway, I had some related points to make). I don't think this has yet got as far as being sent out for review, and so, as far as I can tell, the question still remains a mystery to the entire climate science community. But at least if I get run over now, someone should get to see the answer :-)

I'll give you all a chance to think about it before posting the solution. Knowing that it has a simple solution, mathematically-minded readers should have a good chance of working it out - and will surely kick themselves when they see it. (If you do know the/an answer, feel free to say so in the comments without giving it away too quickly.)

So, why is the multi-model mean so good?

## 18 comments:

Well, two reasons seem obvious. First, all the models used match the known temperature record pretty well, and we haven't gone too far in the future so that temperatures are close to what they were ten years ago. In the far future, models with the right sensitivity number will do much better than the mean, if the right number differs significantly from it.

But second, the RMS variation of a mean of several time series is much less (reduced by the usual square root of number) than the RMS variation of the individual time series. So the RMS difference between observations and model would, for an otherwise accurate model, come from the sum of the two individual variations - so since the multi-model mean variation is less, it will naturally be closer to the data by that measure.

It's sort of a cancellation of errors effect, but only temporarily until the "signal" gets to be bigger than weather noise. Right?

LLN?

Is the question why are ensemble means in general better than any individual realisation, or why multi-model means in particular are better? Are you only interested in RMS error?

ac,

I'm deliberately being a bit vague, in line with what the scientists have observed. It is the multi-model mean that has been investigated most intensively, and most people use some variant of RMSE (sometimes just mean squared error and usually area-weighted).

The other comments don't seem to be along quite the right lines...

If utility is a function of squared error, the expectation value of a well calibrated pdf is utility maximising. I think that's from Jaynes' chapter on decision theory. We 'expect' truth to coincide with the ensemble mean only insofar as it's the expectation value of the pdf.

Yes, that's very similar to what we wrote in the reliability paper:

"in the latter case the mathematical expectation of the truth is still given by the mean of the sampling distribution, but we no longer expect the truth to be at, or even close to, this location."

But it isn't quite what I was looking for...

If a model is included in the ensemble, this likely means that the model already performs reasonably well in a wide variety of contexts. Those models which do not are generally eliminated.

That being the case a model is likely to perform reasonably well on issues that are closely related to what the models within the ensemble and the mean of the ensemble itself are being tested on. As such the individual models taken as an ensemble should center on what is our best estimate of what is being estimated.

This will be in no small part due to the fact that the models are already including what is thought to be the most relevant factors for estimation -- according to the expert minds of those who created them. This being the case the mean of the ensemble should perform well -- and will likely perform better than any of the models that compose it as in a probabilistic fashion the ensemble takes into account a context that is likely larger than what is taken into account by any one of the models regarding the parameter being estimated.

Note that there is nothing in the above explanation to suggest that we should simply include as many models as possible in order to increase n in the hope that this will drive the mean of the ensemble closer to the true value. Quite the opposite -- if this means that the additional models that would be included generally performed much more poorly than those that are already included in the ensemble.

Likewise it in no way suggests that the mean will necessarily be close to our best estimate a year from now. However, as it is likely close to our best estimate given the evidence that we have available today it is also likely close to our current best estimate of what will be our best estimate a week, decade or century from now.

The mean is going to be close to the middle of the range of model estimations at every step. The observations are likely to be within the range. This gives a maximum error of half the range. 0 upto half a range will likely average less than a quarter of the range.

Trend errors are likely to be small compared to weather noise at least for a period in which observations are compared to models which is likely to be short.

Even when the trend error is small there are likely to be times when the obs are not close to the mean. Suppose the obs average a quarter of the range away from the mean (which would be a poor result for the mean). Then the error in an individual model will vary from 0 upto 3/4 of the range. 0 to 3/4 of a range is likely to average more than the 1/4 range calculated for the MMM.

So how does the MMM compare with initial condition ensemble mean for each model?

Timothy Chase makes the point that poor climate models are naturally selected out. Then the Darwinian principles of variation and natural selection apply. That alone does not imply that the ensemble mean is more fit than any member of the evolving population of climate models as such are too new and too few.

But it does suggest there are no surviving outliers; the models all cluster rather closely to one another, at least for the data available to date.

So posit that each model contains a systematic error, wrong climate sensitivity, as well as random errors (in comparison to the data) due to constitutive or even statistical components describing internal variability. Unless we are most unlucky (all models with a climate sensitivity which is too low for instance), the ensemble mean cancels (some of the) systematic error and hence is closer to truth.

Well, I know lots of maths, but not so much statistics. Not sure the above is worth more than a gentlemanly C grade. (Honors is the old British system, yes?).

Maybe only D (Pass).

I'm not sure that some special feature of climate models is important here.

I just generated 1000 random time series. The RMS difference between series 1 and the average of the other 999 series is smaller than the RMS difference between series 1 and any of the other individual series.

Nebuchadnezzar

I think one could also make a selectionist-type argument to the effect that all surviving models will tend to do well in some areas but poorly in others. If some did well in all areas those which do poorly in some areas but well in others would be selected out -- by the same sort of artificial selection that I appealed to earlier. But in this context it should always be remembered that doing poorly or well is relative -- something that does well does well compared to something else.

Likewise remembering that the selection involved is artificial selection such as that which has directed the evolution of domesticated dogs, grains and recipes. It places the responsibility for the process of selection by which some models survive while others are eliminated squarely on us.

We have to choose, and despite the fact that some will better fit the evidence that exists in various contexts better than others. We have to determine the relative weights and importance of different context. Yet there may often be good criteria for determining those weights.

But oddly enough, my reasoning has been more a matter of verbal logic than mathematical reasoning, and at no point have I actually made an appeal to the law of large numbers or cancellation of errors. And please note that I am not arguing that this is "the correct approach" but simply making an observation which to some extent has taken me by surprise.

Does it go something like:

1. Write out the expression for RMS error of the ensemble mean in terms of individual members.

2. Write out the expression for the mean RMS error over all members.

3. Without assuming anything about the sampling distribution of the perturbations, show that 1. is strictly less than or equal to 2.

?

QUICK CORRECTION, third paragraph of my most recent comment...

I wrote, "We have to choose, and despite the fact that some will better fit the evidence that exists in various contexts better than others. We have to determine the relative weights and importance of different context. Yet there may often be good criteria for determining those weights."

I will often write long sentences and long paragraphs -- then try and break up the sentences and paragraphs to make things more readable. This is an instance of where I did a rather botched job of it.

As the word "despite" in the first sentence suggests, the first and the second sentence of this paragraph really should have been left together as a single sentence. They should have been separated not by a period but a comma.

My apologies.

Nebuchadnezzar and ac are getting warmer...

James, somehow I doubted that you were thinking along the same lines as I was.

Regarding RMS...

I like what I have seen so far, but one point that still needs to be dealt with is the question of a subset of the ensemble that does better than the ensemble itself, and as James pointed out earlier on you don't simply want to include more models in the hope that the will reduce ensemble error by increasing n.

Interested in seeing where this goes. Carry on.

Did I forget to mention the triangle inequality in part 3? That's pretty much qed as far as I my working out says: abs(sum(obs - f[i])) <= sum(abs(obs - f[i])).

Is it due somehow to the MMM being smoother than than any of the individual members? As it is the squared error if you have a random upward perturbation in the observations you will get a bigger squared error if it coincides with a downward random perturbation in the model output than you would if the random perturbations in the model output were averaged out?

In other words the difference is visible if you look at the bias-variance-noise decompositions?

James gives the answer...

Why the multi-model mean is so good!

THURSDAY, JUNE 17, 2010

http://julesandjames.blogspot.com/2010/06/why-multi-model-mean-is-so-good.html

Post a Comment