Forecast skill is generally defined as the performance of particular forecast system in comparison to some other reference technique. For example, from
the AMS:
skill—A statistical evaluation of the accuracy of forecasts or the effectiveness of detection techniques. Several simple formulations are commonly used in meteorology. The skill score (SS) is useful for evaluating predictions of temperatures, pressures, or the numerical values of other parameters. It compares a forecaster's root-mean-squared or mean-absolute prediction errors, Ef, over a period of time, with those of a reference technique, Erefr, such as forecasts based entirely on climatology or persistence, which involve no analysis of synoptic weather conditions: If SS > 0, the forecaster or technique is deemed to possess some skill compared to the reference technique.
(The
UK Met Office gives essentially the same definition.)
For short-term weather forecasting, persistence (tomorrow's weather will be like today's) will usually be a suitable reference technique. For seasonal forecasting, climatology (July will be like a typical July) would be more appropriate. Note that although skill is essentially a continuous comparative measure between two alternative forecast systems, there is also a discrete boundary between positive and negative skill which represents the point below which the output from the forecast system under consideration is worse than the reference technique. When the reference technique is some readily available null hypothesis such as the two examples suggested, it is common to say that a system with negative skill is skill-free. But as
IRI states, it is also possible to compare two sophisticated forecast systems directly (eg the skill of the UKMO forecast using NCAR as a reference, and vice versa). In that case of course it would be unreasonable to describe the poorer as skill-free - it would merely be less skillful.
With that in mind, let's turn our attention to what Roger Pielke Snr has to say on the subject of forecast skill in climate modelling. Recently, he published a
guest post from Hank Tennekes, about which more later. But first, his assertions regarding model skill in the comments caught my eye.
Roger repeated his claim that the climate models have no skill (an issue he's repeatedly raised in earlier posts). Eli Rabett
asked him to explain what he meant by this statement. I found his reply rather unsatisfactory, and you can see our exchanges further down that page.
Incredibly, it turns out that Roger is claiming it is appropriate to use the data themselves as the reference technique! If the model fails to predict the trend, or variability, or nonlinear transitions, shown by the data (to within the uncertainty of the data themselves) then in his opinion it has no skill. Of course, given the above introduction, it will be clear why this is a wholly unsuitable reference technique. This data is not available at the time of making the forecast, and so cannot reasonably be considered a reference for determining the threshold between skillful and skill-free. In fact, there is no realistic way that
any forecast system could
ever be expected to match the data themselves in this way - certainly, one cannot expect a weather forecast to predict the future observations to such a high degree of precision. Will the predicted trend of 6C in temperature (from 3am to 3pm today) match the observations to within their own precision (perhaps 0.5C in my garden, but I could easily measure more accurately if I tried)? It's possible on any given day, but it doesn't usually happen - the error in the forecast is usually about 1-2C. Will the predicted trend in seasonal change from January to July be more accurate than the measurements which are yet to be made? Of course not. It's a ridiculous suggestion. According to his definition, virtually all models are skill-free, all the time. By using an established term from the meteorological community in such an idiosyncratic way, he is misleading his readers, and with his background, it's hard to believe that he doesn't realise this.
I've asked him several times to find any examples from the peer-reviewed literature where the future data are used as the reference technique for determining whether a forecast system has skill or not. There's been no substantive reply, of course.
For a climate forecast, I think that a sensible starting point would be to use persistence (the next 30 years will look like the last) as a reference. By that measure, I am confident that model forecasts have skill, at least for temperature on broad scales (I've not looked in detail at much else). And as you all know, I've been prepared to
put money on it.
UpdateWell, I thought that Roger was coming round to realising his error, but in fact he's just put up
another post re-iterating the same erroneous point of view. So I'll go over it in a bit more detail.
A skill test is a comparison of two competing hypotheses - usually a model forecast and some null hypothesis such as persistence - to see which has lower errors. Every definition I have seen (
AMS,
UKMO,
IRI) uses essentially the same definition, and Roger specifically cited the AMS version when asked what he meant by his use of the term. To those who disingenuously or naively say "Without a foundation in the real world (i.e. using observed data), the skill of a model forecast cannot be determined", I reply - of course, that is what
lower errors in the above sentence means - the
error is the difference between the hypothesis, and observed reality! I'll write it out in more detail for any who still haven't got the point.
Repeating the formula above
where E
f = E(m-o) is the RMS difference betweeen model forecast m and observations o, and E
refr = E(r-o) is the RMS difference between reference hypothesis r and observations o. Do I really need to spell out why one cannot used the observations themselves as the reference hypothesis? Do I actually have to write down that in this case the formula becomes
SS = 1 - E(m-o)/E(r-o) = 1 - E(m-o)/E(o-o)?
I repeat again my challenge to Roger, or anyone else, to find any example from the peer-reviewed literature where the target data has beeen used as the reference hypothesis for the purposes of determining whether a forecast has skill. I am confident that most readers will draw the appropriate conclusion from the fact that no such example has been forthcoming, even if they've never encountered a divide-by-zero error.