Thursday, June 17, 2010

Why the multi-model mean is so good!

Well, some of my commenters went off in not quite the right direction, but a few (especially Nebuchadnezzar and ac) got very close. As I mentioned in the comments myself, I had deliberately left the problem statement a little vague. The reason I did this is partly because the challenge is not just to be able to solve mathematical puzzles, but also to first create a mathematical framework in which to interpret the sort of messy and imprecise data that tends to obtain in the real world. Life is not a Martin Gardner puzzle but bits of his puzzles (generally rather simpler) can often be found in life if you know where to look. And partly, I was coy because if I'd stated the problem in unambiguous mathematical terms, it more-or-less answers itself :-)

So, without further ado, on with the show.

Let's write mi for the models (i runs from 1 to n), M = 1/n Σ mi is the multi-model mean and O are the observations. All sums are over i. We start with the mean of the squared distances of models to observations

1/n Σ |mi - O|2

and write it equivalently as 1/n Σ |(mi -M) - (O-M)|2

where all I have done is added and subtracted M and placed some brackets for convenience.

Use of the dot product identity a.a = |a|2 allows us to multiply out this expression as

1/n Σ |mi -M|2 - 2/n Σ (mi - M).(O-M) +|O-M|2.

The final trick, such as it is, is to observe that the central sum of dot product terms is identically zero, because Σ (mi - M) = Σ mi - nM = n(1/n Σ mi - M) = 0 (zero, not O) by definition of M.

This leaves us with:

1/n Σ |mi - O|2 = 1/n Σ |mi -M|2 +|O-M|2

and thus the squared distance from obs to multi-model mean is less than the mean of the squared distances from the individual models to the multi-model mean obs, by an amount which equals the average of the squared distances from the models to their mean.

Note in the above, there is absolutely no dependence on the details of the distribution of the models, or the specific values that the observations take. This equation has nothing whatsoever to do with the model ensemble being centred on the truth. The multi-model mean also provides a better estimate of the climate of Mars than the individual models do! (M is the only such point that guarantees to be better than the models in this sense, whatever the obs are.)

A further point to note is that the result does not even depend on the oft-cited "cancellation of errors". Consider a 2-variable, 2-model system where the models take the values (1,3) and (3,1) and the obs are at (0,0). In this case, the models are both biased positive for both variables. Yet the mean (2,2) has a squared error of 4+4=8 and each model has a squared error of 1+9 = 10. The average squared model deviation from the mean is 1+1=2, balancing the equation. So all attempts to deduce properties of the ensemble based solely on the result presented here are utterly futile. There are more complex and subtle questions relating to when the multi-model mean is better than all of the models, but that's for another time.

If you want an intuitive way to think of the result, it's a sort of Pythagorus' Theorem on average. mi - O is the hypotenuse here, with the other vertex at M. Also, the full equation is just the Law of Cosines summed over each triangle with vertices mi, M and O, since the dot product a.b = |a||b|cos θ, and the sum of all those terms cancels to zero.

I've written this out in very long-winded fashion, but it's such an elementary point and so fundamental to the analysis of ensembles that I still find it rather mind-boggling that no-one else in climate science seems to know about it. I await with interest to see what the reviewers have to say.


crandles said...

This just seems to show that the first commentator, Arthur, got it right (albeit without the reconciliation that no-one but you provided).

The errors in the individual models 1,3, 3, and 1 average to 2 whereas the MMM errors of 2 and 2 also average to 2. The variability in the errors of the MMM are less so when you do a RMS calculation it appears better. This is similar to using a given length of fencing you can enclose a larger area with a square than with a rectangle.

Anonymous said...

Hmmm.... but what you have showed is that the average error across all the models must be larger than the error in the multi-model mean. This is *not* the same as explaining why the multi-model mean is often better than *any* single model.

Anonymous said...

So model averaging is a form of stochastic cooling.

But why would this be mysterious?


William M. Connolley said...

> mean of the squared distances from the individual obs to the multi-model mean

"mean of the squared distances from the individual models to the obs"?

Timothy Chase said...

crandles wrote, "Hmmm.... but what you have showed is that the average error across all the models must be larger than the error in the multi-model mean. This is *not* the same as explaining why the multi-model mean is often better than *any* single model."

No it doesn't show that the multi-model mean is often better than any single model. But how exactly would you show that? As a matter of fact, the word "often" in this mathematical context is highly ambiguous. What he can do, presumably, is identify the conditions under which the multi-model will be better than any single model -- and James apparently believes that he can show that as well.

But what James has shown so far is a strict equality that has evidently escaped everyone -- until now -- that explains why, under all circumstances, the error of the multi-model will necessarily be smaller than the average error of the models that make up its ensemble -- and that is most certainly a step forward.

crandles said...

Ignore the last sentence I wrote - I am talking rubbish again.

Moving quickly OT to divert attention:

The IRI ENSO forecasts have moved significantly towards La Nina with the average of the dynamic models for JJA & JAS now at -0.8 and -1 (a month ago they were -.4 and -.5). 3 of the 15 dynamic models forecast -1.6 and below. Mid June Nino3.4 is already at -0.5.

If you assume a 7 month lag maybe this doesn't affect your chance of winning a bet with a record global average temperature this year. However I believe it will affect temperatures in some regions before the end of the year. As you are not using GISS, I think your chances may be diminishing. Any comments?

James Annan said...

Chris, I didn't get that out of what Arthur said, but I might have misunderstood it (the first part of his comment may have put me off the track).

Belette, fixed, but not exactly as you suggest!

Anon #1, well if the mean is better than the typical model (in practice by a significant margin), it is surely going to be better than *most* models (for reasonable distributions). As Tim says, the question of the mean being better than *all* models is not amenable to a direct proof. In fact in the paper we consider it in a probabilistic form - that is, what is the probability of a single model being better than the mean? It turns out that this probability varies widely depending on a handful of factors.

James Annan said...

Chris, an ENSO post is long overdue, and on my list. In brief, yes I suspect you are right...

Alastair said...

What if the errors are (-1, 3) and (-3, 1)?

ac said...

Neat. I recall reading a similar analysis in the seasonal prediction literature, but the focus was correlation, not RMSE, and the theme was potential predictability. Could have been a tech report or just a set of slides. If it's of interest I might be able to scrape my memory for an author or other unique identifier.

Alastair said...

Answering my own question:

-1, 3 -> 1 + 9 = 10
-3, 1 -> 9 + 1 = 10
-2, 2 -> 4 + 4 = 8

so that works, but what about

2, 4 -> 4 + 16 = 20
4, 6 -> 16 + 36 = 52
3, 5 -> 9 + 25 = 34

Mean of 34 > 1st model of 20!

James Annan said...

ac, I'd definitely be interested in seeing any other written version of this. I suspect it has "folklore" status and is rarely written down as people are usually working on more advanced topics. I certainly can't believe it is unknown in the broader community, indeed it jumped into my mind immediately when a reviewer asked about it so I'm sure I must have seen it in a different context, but I don't know where. The multimodel NWP stuff I've read doesn't seem to quite state it though.

Alastair, we can easily create ensembles where one model is closer to the obs than the mean is - just add a perfect model (ie one that matches the obs) to the ensemble! Also, error cancelling will often occur in practice, but it is not necessary for the formula to work.

Steve Bloom said...

Chris, remember that 1998 dropped off sharply at the end of the year, so all is not lost. All else equal I suppose we can expect this year to drop off less sharply. So for now the odds look to me to be no worse than even, probably a bit better.

Also, which source was being used for temps?

Steve Bloom said...

Looking at GISTEMP, I was surprised to see that the Jan-May 2010 is net .54 degrees ahead of 1998, and indeed after August 1998 anomalies dropped off to levels that haven't been seen for 10 years, so at this point I'd have to say I'll be surprised if 2010 isn't a new record.

James Annan said...

Hey, the ENSO thread is that way ^

(I know it's my fault for not writing it yet!)

crandles said...

>Hey, the ENSO thread is that way ^


>It turns out that this probability varies widely depending on a handful of factors.

Clearly the greater the variation between models, the greater the advantage the MMM has.

The nearer the obs trend is to the MMM trend, the more likely the MMM is to beat all models.

If the individual models have very similar trends then the MMM is more likely to beat all models.

A model having very similar deviations from trend to the obs may make it more likely to beat the MMM.

I doubt I am getting up to the 'handful' yet.

Oops, that also was "for another time".

Dikran Marsupial said...

The result that the sum-of-squared error (SSE) of the mean of a committee of models is less than the mean SSE of the individual committee members is quite well known in machine learning, where committees of models are often used as a variance reduction technique. For example, see pages 365-6 of "Neural networks for pattern recognition" by Chris Bishop (an excellent book on the topic - 14,000 cites according to Google scholar!), which obtains the same result by a slightly different method.

The reason this happens is because the MMM is smoother and towards the middle of the distribution of the individual models. The SSE punishes large errors much more harshly than small errors (due to the squaring). This means that the average of the errors for the individual models will be dominated by the models with the greatest errors. This happens for two reasons, either they lie on the other side of the mean to the observations (high bias) or they are very noisy and the noise is badly correlated with the observations (high variance).

Jame's proof is still neat though, and easier to understand without the use of expectations.

James Annan said...

Thanks DM, I'm sure this idea pops up all over the place where people use ensembles. The book is here for anyone interested. I agree that my version is neater :-)

Arthur said...

Yes, the second part of my comment was essentially James' point - however he hadn't exactly specified the problem completely, and I was thinking more in terms of the question of whether observations remain within the bounds set by the variance in the "model" - well slightly confused on the point anyway. Within the context of RMS differences, this post does nicely explain why the MMM is always "better".

Chip Knappenberger said...


I came across a reference to a similar topic in the forecasting literature recently. Perhaps you may find it to be of interest:


James Annan said...

Chip, thanks for that.

Coincidentally I saw something related a while back...this Armstrong article on significance testing criticises a paper KFHS which states "Finally, we find that the M3 conclusion that a combination of methods is better than that of the methods being combined was not proven."

Well it has been now :-)

(The Armstrong paper is unaware of the proof, but focuses on the inappropriate use of significance testing - another of my frequent complaints.)

I can't help but be amused that these self-appointed experts in forecasting don't know this result (though as DM showed, it is definitely well known in some quarters). Anyway I can hardly be too snarky about Armstrong when the climate scientists were unaware of it too!

David B. Benson said...

Oh my, that was easy!

But I've not seen exactly that before, as best as I can recall.

Chip Knappenberger said...


Armstrong does seem to rely on empirical evidence rather than a proof like yours.

Looking though his Principles of Forecasting (available to browse at Amazon), especially Armstrong's section on Combining Forecasts in Chapter 13, it seems dominated by empirical arguments.

You could always ask him if the mathematical solution is well-known... :^)


James Annan said...

Wow, there's decades of literature on combining forecasts surveyed there, and it seems that none of them know this result.

William M. Connolley said...

I'm not convinced your result explains things. Yes it is very neat and all, but it only explains what you say: why |avg(m)-O| is better than avg(|m-O|) (or whatever :-).

But the "observation" is that avg(m) is generally better than most if not all of the models. I don't think you've explained that.

Go on, write a paper dissing Armstrong and his ilk.

Anonymous said...

Closest I can find are these:

From a reference in an ECMWF paper, which also discusses the issue.

James Annan said...


I'll post about the mean being better than all models, but this result explains precisely the "thing" that I set out to explain, which is that the mean outperforms most models (though not necessarily all) by a significant margin.

I can't really bring myself to dis Armstrong when it seems that no-one else knows this result (even though some can at least prove that the mean always outperforms a typical model). I'm relieved that my co-author vetoed some slightly sarcastic comments in our manuscript about climate scientists being unaware of this well-known result!

crandles said...

First of two papers linked by anonymous referenced Gauss 1809 so obviously your search won't be complete until you have gone back at least that far ;)

James Annan said...

I think those papers were talking more about something like the standard way of combining gaussian estimates, ie the x1*s2^2/(s1^2+s2^2)+ x2*s1^2/(s1^2+s2^2) formula (or something like that). But I haven't yet had time to read them very carefully. They are also only talking about combining two things, which may be a special case.

Anyway, if Eugenia Kalnay says she hasn't seen it before (and she does), that's good enough for me! I'm still baffled by the idea that this equation can really be new though. It seems like it might have a bit of an impact.

crandles said...

>"It seems like it might have a bit of an impact."

An equation allows the effect to be quantified. Do you foresee people calculating the RMSE for MMM then increasing this answer by a factor to reflect the advantage it has before comparing with the other models RMSE?

Or something similar, or some other quantitative use, or is the impact you refer to something completely different?

James Annan said...

Well I don't think it will really cause a revolution, but since it seems like it may be new in fields as diverse as NWP and "Forecasting" (financial) I would expect quite a few to at least take note. One thing it will do is more securely justify the use of the MMM. It (or a modified version) may help with unequal weighting too, but I haven't thought about that yet.