Saturday, January 14, 2006

Probability, prediction and verification II: Forecast skill

Forecast skill is generally defined as the performance of particular forecast system in comparison to some other reference technique. For example, from the AMS:

skill—A statistical evaluation of the accuracy of forecasts or the effectiveness of detection techniques. Several simple formulations are commonly used in meteorology. The skill score (SS) is useful for evaluating predictions of temperatures, pressures, or the numerical values of other parameters. It compares a forecaster's root-mean-squared or mean-absolute prediction errors, Ef, over a period of time, with those of a reference technique, Erefr, such as forecasts based entirely on climatology or persistence, which involve no analysis of synoptic weather conditions:
If SS > 0, the forecaster or technique is deemed to possess some skill compared to the reference technique.
(The UK Met Office gives essentially the same definition.)

For short-term weather forecasting, persistence (tomorrow's weather will be like today's) will usually be a suitable reference technique. For seasonal forecasting, climatology (July will be like a typical July) would be more appropriate. Note that although skill is essentially a continuous comparative measure between two alternative forecast systems, there is also a discrete boundary between positive and negative skill which represents the point below which the output from the forecast system under consideration is worse than the reference technique. When the reference technique is some readily available null hypothesis such as the two examples suggested, it is common to say that a system with negative skill is skill-free. But as IRI states, it is also possible to compare two sophisticated forecast systems directly (eg the skill of the UKMO forecast using NCAR as a reference, and vice versa). In that case of course it would be unreasonable to describe the poorer as skill-free - it would merely be less skillful.

With that in mind, let's turn our attention to what Roger Pielke Snr has to say on the subject of forecast skill in climate modelling. Recently, he published a guest post from Hank Tennekes, about which more later. But first, his assertions regarding model skill in the comments caught my eye.

Roger repeated his claim that the climate models have no skill (an issue he's repeatedly raised in earlier posts). Eli Rabett asked him to explain what he meant by this statement. I found his reply rather unsatisfactory, and you can see our exchanges further down that page.

Incredibly, it turns out that Roger is claiming it is appropriate to use the data themselves as the reference technique! If the model fails to predict the trend, or variability, or nonlinear transitions, shown by the data (to within the uncertainty of the data themselves) then in his opinion it has no skill. Of course, given the above introduction, it will be clear why this is a wholly unsuitable reference technique. This data is not available at the time of making the forecast, and so cannot reasonably be considered a reference for determining the threshold between skillful and skill-free. In fact, there is no realistic way that any forecast system could ever be expected to match the data themselves in this way - certainly, one cannot expect a weather forecast to predict the future observations to such a high degree of precision. Will the predicted trend of 6C in temperature (from 3am to 3pm today) match the observations to within their own precision (perhaps 0.5C in my garden, but I could easily measure more accurately if I tried)? It's possible on any given day, but it doesn't usually happen - the error in the forecast is usually about 1-2C. Will the predicted trend in seasonal change from January to July be more accurate than the measurements which are yet to be made? Of course not. It's a ridiculous suggestion. According to his definition, virtually all models are skill-free, all the time. By using an established term from the meteorological community in such an idiosyncratic way, he is misleading his readers, and with his background, it's hard to believe that he doesn't realise this.

I've asked him several times to find any examples from the peer-reviewed literature where the future data are used as the reference technique for determining whether a forecast system has skill or not. There's been no substantive reply, of course.

For a climate forecast, I think that a sensible starting point would be to use persistence (the next 30 years will look like the last) as a reference. By that measure, I am confident that model forecasts have skill, at least for temperature on broad scales (I've not looked in detail at much else). And as you all know, I've been prepared to put money on it.


Well, I thought that Roger was coming round to realising his error, but in fact he's just put up another post re-iterating the same erroneous point of view. So I'll go over it in a bit more detail.

A skill test is a comparison of two competing hypotheses - usually a model forecast and some null hypothesis such as persistence - to see which has lower errors. Every definition I have seen (AMS, UKMO, IRI) uses essentially the same definition, and Roger specifically cited the AMS version when asked what he meant by his use of the term. To those who disingenuously or naively say "Without a foundation in the real world (i.e. using observed data), the skill of a model forecast cannot be determined", I reply - of course, that is what lower errors in the above sentence means - the error is the difference between the hypothesis, and observed reality! I'll write it out in more detail for any who still haven't got the point.

Repeating the formula above

where Ef = E(m-o) is the RMS difference betweeen model forecast m and observations o, and Erefr = E(r-o) is the RMS difference between reference hypothesis r and observations o. Do I really need to spell out why one cannot used the observations themselves as the reference hypothesis? Do I actually have to write down that in this case the formula becomes

SS = 1 - E(m-o)/E(r-o) = 1 - E(m-o)/E(o-o)?

I repeat again my challenge to Roger, or anyone else, to find any example from the peer-reviewed literature where the target data has beeen used as the reference hypothesis for the purposes of determining whether a forecast has skill. I am confident that most readers will draw the appropriate conclusion from the fact that no such example has been forthcoming, even if they've never encountered a divide-by-zero error.


Anonymous said...

John Fleck says -

Thanks. This is very helpful. But I am still confused. It sounds like, for a long-range climate model of the next thirty years, the only way we can evaluate the "skill," as defined here, is to wait thirty years and then compare the model's output with persistence as you've defined it.

Should we, for purposes of evaluating the model today, use the last thirty years, comparing model output, a persistence forecast and actual data?

Anonymous said...

>I think that a sensible starting point would be to use persistence (the next 30 years will look like the last) as a reference.

That seems like a very low reference point to me. I would have thought a better reference point would be:

Any 30 year period will look like an extrapolation of any trend or cycles that are apparent from looking at the previous 100 years of data.


Anonymous said...

Should have added that I am assuming that my reference level has skill compared to yours. If it doesn't then your reference would be the one to use.


Lumo said...

It has been explained in quite a detail on that your opinion that the GCMs have skill with respect to simpler models is patently false.

It is known that you get more accurate results in average if you replace any existing current GCMs with a simple linear function of forcings. And there are probably many other procedures that give better results than GCMs. Moreover, they are more transparent.

If you read my blog, you would see that the people like William Connolley who are unable to think without computers end up incapable to solve simple probabilistic problems from the elementary school - with or without a help of computers.

James Annan said...


"with respect to simpler models" the level of globally-averaged surface temperature, I'm sure that a simple input-output relationship works well. Would you care to speculate over what such a model implies for climate sensitivity, and predicts for the future, BTW? I guess most of the CA crowd haven't thought about that too hard...

Of course we really want useful regional forecasts, including precipitation changes, and changes in distribution including extremes, not just the global mean surface temperature. Nevertheless, the simple approaches do help to confirm that the physics in the GCMs is reasonably well understood. There's certainly value in both top-down, data-based and bottom-up, fundamental physical modelling. When their predictions coincide (as they do here), it's particularly encouraging.


You're getting ahead of me :-)


We could also use a more sophisticated reference technique for weather forecasts, such as looking at the recent trend in pressure, or extrapolating from weather maps...soon, you find yourself building a numerical model to calculate your reference forecast :-) I'm not sure that any simple extrapolation would work well for climate forecasting over the historical record (expecially at regional scales) but it might be interesting to look at this - one would need to formalise your "trends and cycles" into a simple formula. Of course, extrapolating the current linear trend happens to give us a pretty good forecast right now, but we only know that cos all models (and simple analyses) tell us so...

EliRabett said...

To ride my hobby horse again, mostly because I am seriously interested in whether it has any validity.

Divide the data set in half (say take half the observation stations). Call that the data (d). Call the other half the test set (t). Call the model prediction (m). First find E(t-d). Now divide the data set up again, and again, until you have a distribution of E(t-d). Now find SS = 1 - E(m-d)/E(t-d). It would be a really tough test tho.

Anonymous said...

I appreciate your point about looking at recent trends in pressure becoming a weather model itself. However I think what I am saying about climate is more relevant.

I agree I would have to formalise trends and cycles into a simpler formula. I wouldn't want to overfit a pattern so as a first stab in the dark, I would suggest finding a best fit allowing 4 variables to change: sign wave amplitude, sine wave frequency, sine wave offset and trend growth.

I think the difference between trend analysis, which needs no knowledge of cause and effect, and a climate model which does claim to understand cause and effect is significant.

I am concerned that modelers have an incentive to choose a low reference point so that they can claim their model has skill. If you choose the reference point you are advocating then I think you are claiming as model skill some of the skill attributable to a 'no knowledge of cause and effect' technique.

An ordinary person might have cause to be unimpressed at this bias and become more skeptical of the abilities of climate models.

Just for the record, I would think models would still have skill using the reference point I am suggesting. I also agree that Roger Pielke Snr is setting the reference point way too high.


James Annan said...


In general terms, this sort of cross-validation is certainly a good idea - here is one I did earlier :-) I'm not sure exactly that your particular plan is a good way of going about it. I'm going to blog more about this.


Well the main point of a skill test is not really to boast that the model has some skill, but rather to provide a reference point for comparisons. We can compare the skill of modern weather forecasts to those of decades ago without ever needing to run both systems in parallel. The only point at which the "no skill" threshold really has any meaning is if you want to throw away the model entirely, at which point you'd better make sure there really is a null hypothesis that performs as well. There are certainly lots of examples where simple statistical analyses do outperform complex models, but on the other hand they cannot be expected to extrapolate reliably outside the historical regime unless they are built upon an understanding of the underlying reality. Often they are built as an approximation of the model output itself, to facilitate the use of ensemble methods at low computational cost.

EliRabett said...

Hi, I'm more interested in the principle than the implementation. I had no expectation that the particular method I had thought up in 0.1 sec was going to work (actually at the 0.2 sec mark I thought of a few problems but I had hit the post button by then), and really thought it unlikely that no one else had thought of anything similar. What I was scratching for was a previous example. Thanks.

Anne van der Bom said...

In the links to Pielke Sr's blog, replace "" with ""