...or, how to perform the most elementary null hypothesis significance tests.

Roger Pielke has been saying some truly bizarre and nonsensical things recently. The pick of the bunch IMO is this post. The underlying question is: Are the models consistent with the observations over the last 8 years?

So Roger takes the ensemble of model outputs (8 year trend as analysed in this RC post), and then plots some observational estimates (about which more later), which clearly lie well inside the 95% range of the model predictions, and apparently without any shame or embarrassment adds the obviously wrong statement:

I challenged this obvious absurdity and repeatedly asked him to back it up with a calculation. After a lot of ducking and weaving, about the 30th comment under the post, he eventually admits "I honestly don't know what the proper test is". Isn't thinking about the proper test a prerequisite for confidently asserting that the models fail it? Anyway, I'll walk through it here very slowly for the hard of understanding. I'll use Wilks "Statistical methods in the atmospheric sciences" (I have the 1st edition), and in particular Chapter 5: "Hypothesis testing". It opens:

and then continues with:

This is a gimme. Obviously, the question that Roger has posed is about the 8-year trend of observed mean surface temperature. I'm going to use an ordinary least squares (OLS) fit because that is what is already available for the models, and it is also by far the most commonly used method for trend estimation and has well understood properties. For some unstated reason, Roger chose to use Cochrane-Orcutt estimates for the observed data that he plotted in his picture, but I do not know how well that method performs for such a short time series or how it compares to OLS. Anyone who wishes to repeat the analysis using C-O should find it easy enough in principle, they will need to get the raw model output (freely available) and analyse it in that manner. I would bet a large sum of money that this will not change the results qualitatively.

Easy enough, the null hypothesis H0 here that I wish to test is that the models correctly predict the planetary temperature trend over 2000-2007. If anyone has any other suggestion for what null hypothesis makes sense in this situation, I'm all ears.

"H0 is false". This all seems too easy so far....there must be something scary around the corner.

OK, now the real work - such as it is - starts. First, we have the distribution of trends predicted by the models. As RC have shown, this is well approximated by a Gaussian N(0.19,0.21). (I am going to stick with decadal trends throughout rather than using a mix of time scales to give me less chance of embarassingly dropping factors of 10 as Roger has done in several places in his post. He has also plotted his blue "95%" lines in the wrong place too, but I've got bigger fish to fry.) There are firm theoretical reasons why we should expect a Gaussian to provide a good fit (basically the Central Limit Theorem). This distribution isn't quite what we need, however. The model output (as analysed) uses perfect knowledge of the model temperature, whereas the observed estimate for the planet is calculated from limited observational coverage. In fact, CRU estimate their observational errors at about 0.025 for each year's mean (at one standard deviation). This introduces a small additional uncertainty of about 0.04 on the decadal trend. That is, if the true planetary trend is X, say, then the observational analysis will give us a number in the range [X-0.08,X+0.08] with 95% probability.

Putting that together with the model output, we get the result that if the null hypothesis is true and the models' prediction of N(0.19,0.21) for the true planetary trend is correct, then the sampling distribution for the observed trend should also be N(0.19,0.21). I calculated 0.21 for the standard deviation there by adding the two uncertainties of 0.21 and 0.04 in quadrature (ie squaring, adding, taking the square root). This is the correct formula under the assumption that the observational error is independent of the true planetary temperature, which seems natural enough.

So, as I had guessed in my comments to Roger's post, considering observational uncertainty here has a negligible effect (is rounded off completely), so we could have simply used the existing model spread as the null distribution. Using this approach generally makes such tests stiffer than they should be, but it is often a small effect.

OK, let's have a look at the test statistic. For HADCRU, the least squares trend is....0.11C/decade. That is a simple least squares to the last 8 full year values of these data. (I generally use the variance-adjusted version, on the ground that if they think there is a reason to adjust the variance, I see no reason to presume that this harms their analysis. It doesn't affect the conclusions of course.)

So, where does 0.11 lie in the null distribution N(0.19,0.21)? Just about slap bang in the middle, that's where. OK, it is marginally lower than the mean (by a whole 0.38 standard deviations), but actually closer to the mean than one could generally hope for, even if the null is true. In fact the probability of a sample statistic from the null distribution being worse than the observed test statistic is a whopping 70% (this value being 1 minus the integral of a Gaussian from -0.38 to +0.38 standard deviations)!

So what do we conclude from this?

First, that the data are obviously not inconsistent with the models at the 5% level.

Second...well I leave readers to draw their own conclusions about Roger "I honestly don't know what the proper test is" Pielke.

Roger Pielke has been saying some truly bizarre and nonsensical things recently. The pick of the bunch IMO is this post. The underlying question is: Are the models consistent with the observations over the last 8 years?

So Roger takes the ensemble of model outputs (8 year trend as analysed in this RC post), and then plots some observational estimates (about which more later), which clearly lie well inside the 95% range of the model predictions, and apparently without any shame or embarrassment adds the obviously wrong statement:

"one would conclude that UKMET, RSS, and UAH are inconsistent with the models".Um....no one would not:

Update: OK, there are a number of things wrong with this picture. First, these "Observed temperature trends" stated on the left, calculated by Lucia, are actually per century not per decade, although I think they have been plotted in the right place. When OLS is used on the 8-year trends (to be consistent with the model analysis), the various obs give results of around 0.13 - 0.26C/decade, with my HadCRU analysis actually being at the lower end of this range. Second, the pale blue lines purporting to show "95% spread across model realizations" are in the wrong place. Roger seems to have done a 90% spread (5-95% coverage) which is about 20% too narrow, in terms of the range it implies.

I challenged this obvious absurdity and repeatedly asked him to back it up with a calculation. After a lot of ducking and weaving, about the 30th comment under the post, he eventually admits "I honestly don't know what the proper test is". Isn't thinking about the proper test a prerequisite for confidently asserting that the models fail it? Anyway, I'll walk through it here very slowly for the hard of understanding. I'll use Wilks "Statistical methods in the atmospheric sciences" (I have the 1st edition), and in particular Chapter 5: "Hypothesis testing". It opens:

Formal testing of hypotheses, also know as significance testing, is generally covered extensively in introductory courses in statistics. Accordingly, this chapter will review only the basic concepts behind formal hypothesis tests...[cut]

and then continues with:

5.1.3 The elements of any hypothesis test

Any hypothesis test proceeds according to the following five steps:

1. Identify a test statistic that is appropriate to the data and question at hand.

This is a gimme. Obviously, the question that Roger has posed is about the 8-year trend of observed mean surface temperature. I'm going to use an ordinary least squares (OLS) fit because that is what is already available for the models, and it is also by far the most commonly used method for trend estimation and has well understood properties. For some unstated reason, Roger chose to use Cochrane-Orcutt estimates for the observed data that he plotted in his picture, but I do not know how well that method performs for such a short time series or how it compares to OLS. Anyone who wishes to repeat the analysis using C-O should find it easy enough in principle, they will need to get the raw model output (freely available) and analyse it in that manner. I would bet a large sum of money that this will not change the results qualitatively.

2. Define a null hypothesis.

Easy enough, the null hypothesis H0 here that I wish to test is that the models correctly predict the planetary temperature trend over 2000-2007. If anyone has any other suggestion for what null hypothesis makes sense in this situation, I'm all ears.

3. Define an alternative hypothesis.

"H0 is false". This all seems too easy so far....there must be something scary around the corner.

4. Obtain the null distribution, which is simply the sampling distribution of the test statistic given that the null hypothesis is true.

OK, now the real work - such as it is - starts. First, we have the distribution of trends predicted by the models. As RC have shown, this is well approximated by a Gaussian N(0.19,0.21). (I am going to stick with decadal trends throughout rather than using a mix of time scales to give me less chance of embarassingly dropping factors of 10 as Roger has done in several places in his post. He has also plotted his blue "95%" lines in the wrong place too, but I've got bigger fish to fry.) There are firm theoretical reasons why we should expect a Gaussian to provide a good fit (basically the Central Limit Theorem). This distribution isn't quite what we need, however. The model output (as analysed) uses perfect knowledge of the model temperature, whereas the observed estimate for the planet is calculated from limited observational coverage. In fact, CRU estimate their observational errors at about 0.025 for each year's mean (at one standard deviation). This introduces a small additional uncertainty of about 0.04 on the decadal trend. That is, if the true planetary trend is X, say, then the observational analysis will give us a number in the range [X-0.08,X+0.08] with 95% probability.

Putting that together with the model output, we get the result that if the null hypothesis is true and the models' prediction of N(0.19,0.21) for the true planetary trend is correct, then the sampling distribution for the observed trend should also be N(0.19,0.21). I calculated 0.21 for the standard deviation there by adding the two uncertainties of 0.21 and 0.04 in quadrature (ie squaring, adding, taking the square root). This is the correct formula under the assumption that the observational error is independent of the true planetary temperature, which seems natural enough.

So, as I had guessed in my comments to Roger's post, considering observational uncertainty here has a negligible effect (is rounded off completely), so we could have simply used the existing model spread as the null distribution. Using this approach generally makes such tests stiffer than they should be, but it is often a small effect.

5. Compare the observed test statistic to the null distribution. If the test statistic falls in a sufficiently improbable region of the null distribution, H0 is rejected as too unlikely to have been true given the observed evidence. If the test statistic falls within the range of "ordinary" values described by the null distribution, the test statistic is seen as consistent with H0 which is then not rejected. [my emphasis]

OK, let's have a look at the test statistic. For HADCRU, the least squares trend is....0.11C/decade. That is a simple least squares to the last 8 full year values of these data. (I generally use the variance-adjusted version, on the ground that if they think there is a reason to adjust the variance, I see no reason to presume that this harms their analysis. It doesn't affect the conclusions of course.)

So, where does 0.11 lie in the null distribution N(0.19,0.21)? Just about slap bang in the middle, that's where. OK, it is marginally lower than the mean (by a whole 0.38 standard deviations), but actually closer to the mean than one could generally hope for, even if the null is true. In fact the probability of a sample statistic from the null distribution being worse than the observed test statistic is a whopping 70% (this value being 1 minus the integral of a Gaussian from -0.38 to +0.38 standard deviations)!

So what do we conclude from this?

First, that the data are obviously not inconsistent with the models at the 5% level.

Second...well I leave readers to draw their own conclusions about Roger "I honestly don't know what the proper test is" Pielke.

## 15 comments:

IIRC the Cochrane-Orcutt thing was Lucia's idea. BTW, don't miss Roger's fawning post about her final triumph. Apparently she has proved that Swedes are tall, and having done so will now just sit back and watch the thermometers. One can only hope.

We are quite tall, and handsome in a certain angel and a certain light ;)

I’m getting more and more interested in what drives this people, ok for Roger it could be a little fame and money but all the other bloggers that puts down so much time on something they really don’t want to understand... crazy... well guess every one wants to try to get there 15 minutes of fame.

Magnus- I think for many of them, understanding is not even on th radar, they want it to go away and stop bothering them, so will pick up any old bit of rubbish that proves their case. A local denialist did that recently by claiming CO2 levels had been around 400ppm in the 1940's, so obviously things aren't as the IPCC etc claimed. He'd obviously been reading Beck, but didn't mentiont that.

Lucia, well, I think she and some of the others do want to understand, but they are perpetually stuck in a kind of sceptical manner, plus they want to understand it in their own way, i.e. things will get bent until they fit their preconceptions or abilities, eg the use of COchrane-Orcutt.

Steve,

That post of Lucia's is nothing short of

nutty. And Roger calling it "wonderfully clear" just takes the biscuit. I wonder if he's on a sort of scientific suicide mission.While I agree with most of the post, I think there is a good reason for using the Cochrane-Orcutt method (there are other methods, too) for estimating the trend and especially its uncertainty. Tamino has a good explanation about the autocorrelation of the residuals, and how to take it into account.

If the residuals are autocorrelated, OLS may yield too narrow confidence intervals for the trend. In the IPCC report, for example, autocorrelation is taken into account, with the assumption that the residuals have AR1-structure.

"I honestly don't know what the proper test is"

Wow...

I also like how Pielke pretends not to see any falsifiable climate predictions,

right below an entire blog posthe just wrote regarding one such prediction.Yeah, who's paying him to write this crap?

-- bi,

International Journal of InactivismSins Roger is sneaky with his blog comments I say it here... delete if you want...

The thing Roger, is that you make so obvious errors and still it sounds like you know it all... that is why people get angry.

Not because you ask questions...

John V is making a sterling effort to pin down Roger in the comments. Roger is still ducking and weaving but John V is reducing it to a simple choice that Roger will find it hard to wriggle out of. Its fun to watch!

Mika,

Yes, I can accept that C-O may be a reasonable choice in principle. I am not convinced that it will give robust results for 8 data points, since it is very hard to give a good estimate of an AR(1) model from a short time series. And if one is using C-O on the data, one needs to work out the null distribution of C-O analyses from the model outputs, which would require reanalysis of the model output. Not that any of this could significantly change the results in this case.

(The formal uncertainty estimates from least squares on the 8 year interval are probably pretty meaningless here too - we have a fairly good handle on interannual variability from the much longer obs. record, so it is a bit silly to only use the residuals from a very short interval, which may happen to be unusually large or small.)

I suppose that is something I could look at in a blog post...what range of results C-O gives when you feed it 8 years of trend plus white noise (or AR(1) of known properties).

James,

I think that the trends shown in Pielke's post were calculated based on monthly temperatures, so the number of data points may not be that small.

Of course, if there is a better way to estimate the uncertainty of the trend, using Cochrane-Orcutt instead of OLS seems pretty irrelevant.

(Hopefully I didn't post this twice, as I had some difficulty posting)

Mika.

OK, yes with monthly data there are a lot of points. OTOH it is still unclear to me how well such a short interval can characterise an autoregressive system if there is a time scale of more than a few months or so.

> John V is reducing it to a simple

> choice that Roger will find it hard

> to wriggle out of.

Eh?

Yeah, Roger can wriggle out of anything...

Jim, why do you think that post of Lucia´s is exactly ´nutty´?

Assuming you mean me, the whole elaborate story about tall Swedes and other random races is just such a huge and elaborate diversion from the original question - which was, as I have already explained, very elementary and straightforward, not at all difficult to answer. She spent several

thousandwords missing (or deliberately evading) the point. Roger puts the icing on the cake by calling it "wonderfully clear", when he has repeatedly shown he doesn't understand the subject at all. But I guess you can fool plenty of people most of the time, and those are the ones they are catering for.Post a comment