...or, how to perform the most elementary null hypothesis significance tests.
Roger Pielke has been saying some truly bizarre and nonsensical things recently. The pick of the bunch IMO is
this post. The underlying question is:
Are the models consistent with the observations over the last 8 years?So Roger takes the ensemble of model outputs (8 year trend as analysed in
this RC post), and then plots some observational estimates (about which more later), which clearly lie well
inside the 95% range of the model predictions, and apparently without any shame or embarrassment adds the obviously wrong statement:
"one would conclude that UKMET, RSS, and UAH are inconsistent with the models".
Um....no one would not:

Update: OK, there are a number of things wrong with this picture. First, these "Observed temperature trends" stated on the left, calculated by Lucia, are actually per century not per decade, although I think they have been plotted in the right place. When OLS is used on the 8-year trends (to be consistent with the model analysis), the various obs give results of around 0.13 - 0.26C/decade, with my HadCRU analysis actually being at the lower end of this range. Second, the pale blue lines purporting to show "95% spread across model realizations" are in the wrong place. Roger seems to have done a 90% spread (5-95% coverage) which is about 20% too narrow, in terms of the range it implies.
I challenged this obvious absurdity and repeatedly asked him to back it up with a calculation. After a lot of ducking and weaving, about the 30th comment under the post, he eventually admits "
I honestly don't know what the proper test is". Isn't thinking about the proper test a prerequisite for confidently asserting that the models fail it? Anyway, I'll walk through it here very slowly for the hard of understanding. I'll use Wilks "
Statistical methods in the atmospheric sciences" (I have the 1st edition), and in particular Chapter 5: "Hypothesis testing". It opens:
Formal testing of hypotheses, also know as significance testing, is generally covered extensively in introductory courses in statistics. Accordingly, this chapter will review only the basic concepts behind formal hypothesis tests...[cut]
and then continues with:
5.1.3 The elements of any hypothesis test
Any hypothesis test proceeds according to the following five steps:
1. Identify a test statistic that is appropriate to the data and question at hand.
This is a gimme. Obviously, the question that Roger has posed is about the 8-year trend of observed mean surface temperature. I'm going to use an ordinary least squares (OLS) fit because that is what is already available for the models, and it is also by far the most commonly used method for trend estimation and has well understood properties. For some unstated reason, Roger chose to use Cochrane-Orcutt estimates for the observed data that he plotted in his picture, but I do not know how well that method performs for such a short time series or how it compares to OLS. Anyone who wishes to repeat the analysis using C-O should find it easy enough in principle, they will need to get the raw model output (freely available) and analyse it in that manner. I would bet a large sum of money that this will not change the results qualitatively.
2. Define a null hypothesis.
Easy enough, the null hypothesis H0 here that I wish to test is that the models correctly predict the planetary temperature trend over 2000-2007. If anyone has any other suggestion for what null hypothesis makes sense in this situation, I'm all ears.
3. Define an alternative hypothesis.
"H0 is false". This all seems too easy so far....there must be something scary around the corner.
4. Obtain the null distribution, which is simply the sampling distribution of the test statistic given that the null hypothesis is true.
OK, now the real work - such as it is - starts. First, we have the distribution of trends predicted by the models. As RC have shown, this is well approximated by a Gaussian N(0.19,0.21). (I am going to stick with decadal trends throughout rather than using a mix of time scales to give me less chance of embarassingly dropping factors of 10 as Roger has done in several places in his post. He has also plotted his blue "95%" lines in the wrong place too, but I've got bigger fish to fry.) There are firm theoretical reasons why we should expect a Gaussian to provide a good fit (basically the
Central Limit Theorem). This distribution isn't quite what we need, however. The model output (as analysed) uses
perfect knowledge of the
model temperature, whereas the
observed estimate for the
planet is calculated from
limited observational coverage. In fact, CRU estimate their observational errors at about 0.025 for each year's mean (at one standard deviation). This introduces a small additional uncertainty of about 0.04 on the decadal trend. That is, if the true planetary trend is X, say, then the observational analysis will give us a number in the range [X-0.08,X+0.08] with 95% probability.
Putting that together with the model output, we get the result that if the null hypothesis is true and the models' prediction of N(0.19,0.21) for the true planetary trend is correct, then the sampling distribution for the observed trend should also be N(0.19,0.21). I calculated 0.21 for the standard deviation there by adding the two uncertainties of 0.21 and 0.04 in quadrature (ie squaring, adding, taking the square root). This is the correct formula under the assumption that the observational error is independent of the true planetary temperature, which seems natural enough.
So, as I had guessed in my comments to Roger's post, considering observational uncertainty here has a negligible effect (is rounded off completely), so we could have simply used the existing model spread as the null distribution. Using this approach generally makes such tests stiffer than they should be, but it is often a small effect.
5. Compare the observed test statistic to the null distribution. If the test statistic falls in a sufficiently improbable region of the null distribution, H0 is rejected as too unlikely to have been true given the observed evidence. If the test statistic falls within the range of "ordinary" values described by the null distribution, the test statistic is seen as consistent with H0 which is then not rejected. [my emphasis]
OK, let's have a look at the test statistic. For HADCRU, the least squares trend is....0.11C/decade. That is a simple least squares to the last 8 full year values of
these data. (I generally use the variance-adjusted version, on the ground that if they think there is a reason to adjust the variance, I see no reason to presume that this harms their analysis. It doesn't affect the conclusions of course.)
So, where does 0.11 lie in the null distribution N(0.19,0.21)? Just about slap bang in the middle, that's where. OK, it is marginally lower than the mean (by a whole 0.38 standard deviations), but actually closer to the mean than one could generally hope for, even if the null is true. In fact the probability of a sample statistic from the null distribution being worse than the observed test statistic is a whopping 70% (this value being 1 minus the integral of a Gaussian from -0.38 to +0.38 standard deviations)!
So what do we conclude from this?
First, that the data are obviously
not inconsistent with the models at the 5% level.
Second...well I leave readers to draw their own conclusions about Roger "
I honestly don't know what the proper test is" Pielke.