Answer: when the 23% refers to the proportion of models that are rejected by Roger Pielke's definition of "consistent with the models at the 95% level".

By the normal definition, this simple null hypothesis significance test should reject roughly 5% of the models. Eg Wilks Ch 5 again "the null hypothesis is rejected if the probability (as represented by the null distribution) of the test statistic, and all other results at least as unfavourable to the null hypothesis, is less than or equal to the test level. [...] Commonly the 5% level is chosen" (which we are using here).

I asked Roger what range of observed trends would pass his consistency test and he replied with -0.05 to 0.45C/decade. I then asked Roger how many of the models would pass his proposed test, and instead of answering, he ducked the question, sarcastically accusing me of a "nice switch" because I asked him about the finite sample rather than the fitted Gaussian. They are the same thing, Roger (to within sampling error). Indeed, the Gaussian was specifically constructed to agree with the sample data (in mean and variance, and it visually matches the whole shape pretty well).

The answer that Roger refused to provide is that 13/55 = 24% of the discrete sample lies outside his "consistent with the models at the 95% level" interval (from the graph, you can read off directly that it is at least 10, which is 18% of the sample, and at most 19, or 35%).

But that's only with my sneaky dishonest switch to the finite sample of models. If we use the fitted Gaussian instead, then roughly 23% of it lies outside Rogers proposed "consistent at the 95% level" interval. So that's entirely different from the 24% of models that are outside his range, and supports his claims...

I guess if you squint and wave your hands, 23% is pretty close to 5%. Close enough to justify a sequence of patronising posts accusing me and RC of being wrong, and all climate scientists of politicising climate science and trying to shut down the debate, anyway. These damn data and their pro-global-warming agenda. Life would be so much easier if we didn't have to do all these difficult sums.

I'm actually quite enjoying the debate over whether the temperature trend is inconsistent with the models at the "Pielke 5%" level :-) And so, apparently, are a large number of readers.

By the normal definition, this simple null hypothesis significance test should reject roughly 5% of the models. Eg Wilks Ch 5 again "the null hypothesis is rejected if the probability (as represented by the null distribution) of the test statistic, and all other results at least as unfavourable to the null hypothesis, is less than or equal to the test level. [...] Commonly the 5% level is chosen" (which we are using here).

I asked Roger what range of observed trends would pass his consistency test and he replied with -0.05 to 0.45C/decade. I then asked Roger how many of the models would pass his proposed test, and instead of answering, he ducked the question, sarcastically accusing me of a "nice switch" because I asked him about the finite sample rather than the fitted Gaussian. They are the same thing, Roger (to within sampling error). Indeed, the Gaussian was specifically constructed to agree with the sample data (in mean and variance, and it visually matches the whole shape pretty well).

The answer that Roger refused to provide is that 13/55 = 24% of the discrete sample lies outside his "consistent with the models at the 95% level" interval (from the graph, you can read off directly that it is at least 10, which is 18% of the sample, and at most 19, or 35%).

But that's only with my sneaky dishonest switch to the finite sample of models. If we use the fitted Gaussian instead, then roughly 23% of it lies outside Rogers proposed "consistent at the 95% level" interval. So that's entirely different from the 24% of models that are outside his range, and supports his claims...

I guess if you squint and wave your hands, 23% is pretty close to 5%. Close enough to justify a sequence of patronising posts accusing me and RC of being wrong, and all climate scientists of politicising climate science and trying to shut down the debate, anyway. These damn data and their pro-global-warming agenda. Life would be so much easier if we didn't have to do all these difficult sums.

I'm actually quite enjoying the debate over whether the temperature trend is inconsistent with the models at the "Pielke 5%" level :-) And so, apparently, are a large number of readers.

## 28 comments:

This image seems prescient.

-- bi,

International Journal of InactivismThis may help. Or not.

http://marketing.wharton.upenn.edu/ideas/pdf/Armstrong/StatisticalSignificance.pdf.

"Why We Don’t Really Know What 'Statistical Significance' Means:

A Major Educational Failure

ABSTRACT

The Neyman–Pearson theory of hypothesis testing, with the Type I error rate, α, as the significance level, is widely regarded as statistical testing orthodoxy. Fisher’s model of significance testing, where the evidential p value denotes the level of significance, nevertheless dominates statistical testing practice. This paradox has occurred because these two incompatible theories of classical statistical testing have been anonymously mixed together, creating the false impression of a single, coherent model of statistical inference. We show that this hybrid approach to testing, with its misleading p < α statistical significance criterion, is common in marketing research textbooks, as well as in a large random sample of papers from twelve marketing journals. That is, researchers attempt the impossible by simultaneously interpreting the p value as a Type I error rate and as a measure of evidence against the null hypothesis. The upshot is that many investigators do not know what our most cherished, and ubiquitous, research desideratum—'statistical significance'—really means. This, in turn, signals an educational failure of the first order. ..."

Brokerage is after all marketing, and they're all subprime now.

I find arguments about P-values interesting for a couple of reasons. One is that strict adherence to alpha=5% is only a convention, and it is one that puts all the stress on avoiding Type I errors; use of a different alpha could be justified (as long as the justification was consistently applied) on the basis of balancing the Type II error. The second reason is that with heaps of people like Pielke Jr running these tests at every conceivable time step, etc, the probability of rejecting Ho increases, so there has to be some sort of correction for multiple tests. (Problematically, the tests are not independent.) A third reason for my interest is that the first two observations come from a frequentist's point of view; I have no idea how a Bayesian would examine this information, but I'd be excited to learn.

Now he's conceded, without quite conceding it, that what he said about consistency was wrong, so he's decided to talk about skill instead. I give up

"I give up."

I lied.

I try to be Bayesian, of a rather simple kind:

http://en.wikipedia.org/wiki/Bayes_factor

wherein two hypotheses are treated symmetrically with regard to the weight of the evidence. However, this needs some caution with regards to the fact that a more complex hypothesis (more adjustable parameters) is 'more likely' to fit the data. So some criterion which applies Ockham's Razor seems also to be required:

http://en.wikipedia.org/wiki/Akaike_information_criterion

http://en.wikipedia.org/wiki/Bayesian_information_criterion

Mark,

Like a moth to a flame :-)

David,

I agree that Bayesian approaches are probably better in general - there's some irony in my defending a purely frequentist method after all the nasty thing I've said about them. But a standard NHST certainly has its place as a basic check, and if one is going to perform one, it should at least be done correctly!

James Annan wrote "... if one is going to perform one, it should at least be done correctly!"

Absolutely!

In case anyone needed proof that RP Jr.'s problem is in part genetic, here's more.

When the answer is 14 +/- 10?

It seems every time someone has a problem with statistics, they blame it on p-values, hypothesis testing, Fisher or Neyman and Pearson, and pull out the trump card of how life would be so much better if we were all Bayesians.

No - these problems usually reflect a lack of understanding of statistics, and adding prior distributions would only make matters worse.

As for the matter at hand: I think part of the problem, James, is that you never made a rigorous statement of the probabilistic model you are working with.

I offer a first pass at defining such a model:

We assume that the temperature at year n+1 is T(n + 1) = T(n) + X(n + 1).

The annual increments X(n) are independent and are normally distributed with mean m and std. dev. s.

The mean of the annual increments, m, is dependent on the concentration of CO2 in the atmosphere. Its current value is 0.019C.

The current value of the variance s is .21/sqrt(10) = 0.066C.

The "models" (or "model predictions", i.e., points in the ensemble used by IPCC) are sample paths from this distribution, as is the measured temperature record.

How does this description of things seem to you?

(Clearly additional elements need to be added to the model to account for extreme events like El Ninos, volcano eruptions, etc., but supposedly these are only perturbations to the general trend as described by the model above.)

Yoram,

No, I don't think that is a reasonable model. It seems to me (although I may be mistaken) that it generates a random walk that will spread in an unbounded manner over time even if forcing is fixed. In fact, all reasonable theory and models suggest that over the decadal/century time scale, the temperature will converge to a quasi-equilibrium state where there is some natural variability but not an unbounded walk. The simple zero-dimensional energy-balance model (c.dT/dt = F - L.T + e) is a reasonable place to start. This is the model used by Schwartz, but of course it predates him - his error was not in the model itself, but rather in claiming it could be used together with limited observational data to precisely diagnose the behaviour of the real climate system, which is not the case.

When the forcing is increasing linearly, this will generate a trend plus AR(1) 'natural variability'. AR(1) is not really right to describe the unforced variability, but it's not too silly a starting point so long as one bears in mind its limitations.

According to the model you suggest, given the forcing, the uncertainty in temperature does not grow as you go farther into the future. That is, you should be able to predict the average temperature, say, 50 years from now with more or less the same accuracy as you can predict the temperature 10 years from now (assuming the smoothing factor of the AR is shorter than 10 years).

Does this seem right?

Yoram,

a 50 year prediction would only be the same if there was no error in the predictions of what the forcing will be at that point. Since the future behavior of civilization is pretty uncertain, it doesn't work.

Ok - but then what is the uncertainty about the forcing? If the uncertainty in the forcing over 10, 30, or 50 years is on the same order of magnitude as the noise term e, then that would have an important impact on the uncertainty of the entire prediction.

One more point - before, when comparing the uncertainty of a 10 year prediction to that of a 30 year prediction, James indicated that the variance in the prediction grows linearly with time - how does this follow from the model he suggests?

Yoram,

There is uncertainty in the forcing, feedback, and effective capacity (even assuming that heat balance model is considered adequate), on top of the natural variability "noise".

If we have known forcing, then over the longer term, there will be some increased spread due to the uncertainty in the forced response. I think the IPCC figure 10.4 (also in the SPM) shows the effect fairly well - eg looking just at the A2 scenario, the uncertainty fuzz is reasonably constant for up to about 30 years (where natural variability dominates) but grows thereafter as the uncertainty in the forced ressponse becomes relatively larger. So for 30 years, linear trend plus some noise is a reasonable model - and we have uncertainty in both components.

"when comparing the uncertainty of a 10 year prediction to that of a 30 year prediction, James indicated that the variance in the prediction grows linearly with time"

I don't recall what you are referring to - can you remind me?

Carl Wunsch used an AR(2) model for paleoclimate (to replace orbital forcing).

James,

I gave the matter some more thought and I now understand that for a time interval significantly shorter than the relaxation time of the system (i.e., delta-T << c / L) the system as you specified has an approximately fixed mean time derivative and thus exhibits the random walk-with-drift behavior that I suggested.

What is the relaxation time c / L we are talking about here? That is, given fixed forcing, how long would it take the atmosphere to reach the equilibrium temperature?

By the way, is there an online resource discussing atmospheric modeling at the level of abstraction we are discussing here?

Yoram,

Part of the problem is that there isn't really a unique time scale. Fig 1 of Reto's comment shows the response to an instantanous change in forcing, no simple exponential looks like a good fit to that. 15y may be a reasonable compromise but it doesn't really represent either short- or long-term responses very well.

Don't know of any on-line resources, sorry. There are a number of papers which have some relevance though, eg.

Smith, R. L., T. M. L. Wigley and B. D. Santer(2003), A Bivariate Time Series Approach to Anthropogenic Trend Detection in Hemispheric Mean Temperatures, Journal of Climate, 16(8), 1228-1240.

There is a paper by Hansen et al. (2007), available on the web site, which contains a graph that I interpret as indicating a temperature rise continuing for at least 1300 years.

Thanks to James for the link to the comment paper.

In an attempt to produce something like Figure 1 of the Reto N. et al. paper using AR(1) ideas, the following system of air, A(n) at step n, and water, W(n) at step n, occured to me. I'm posting it mainly to see if I am badly off in understanding a little about the climate system.

The CO2 concentration in the air is treated as an exogenous variable, X(n) at step n, so that different values can be tried without the labor of figuring out cqarbon cycle sources and sinks. E(n) is the noise term at step n. Lower case letters are constants:

A(n+1) = a + bA(n) + cW(n) + dX(n) + E(n)

W(n+1) = f + gW(n) + hA(n)

The idea is to set the constants so that the A series reponds immediately to the additional CO2 and then slowly heats up all the water, providing a positive feedback.

David,

Yes, that sort of thing is reasonable. One can split the system into land, upper ocean, deep ocean and so on. Any reasonable choice beyond a one-box system will not have a single well-defined time scales. Eventually if you make the boxes small enough, it may converge to a GCM :-)

Of course selecting the exchange coefficients is key with these box models. For GCMs, in principle the fluxes are determined by the physics, although even then there is room for tuning.

From Reto's figure it looks like there are two effects involved, one with a very short relaxation time - about 5 years - and the other with a very long one - maybe 200 years. Any trend that is fixed for 10 - 30 years, such as the +.2C/decade that you refer to, must be caused by the latter effect.

Unfortunately it's a bit more subtle than that. The 0.2C/decade is not absolutely fixed, but is predicated on

plausibleemissions scenarios. If we kept atmospheric composition fixed at the current state then the future warming rate would be much less (eyeballing Figure 10.4 of the IPCC suggests something like 0.3C over 30y). But this would require an immediate 50% cut in emissions, which isn't going to happen. Even then, this more gradual warming would still continue for O(100) years and more.Ok, so about half of the projected increase (.1C/decade) is based on the long term effect of the present level of atmospheric CO2 and half is based on the short term effect of the expected increase in CO2.

I make it about 60% short term temperature rise and 40% long term.

Yes, about half.

Yes, that's probably about right. There is uncertainty over these proportions, of course.

Post a comment