The work presented is a straightforward comparison of temperature trends, both observed and modelled. The goal is to check the consistency of the two - ie, asking the question "are the observations inconsistent with the models"?

This is approached though a standard null hypothesis significance test, which I've talked about at some length before. The null hypothesis being that the observations are drawn from the distribution defined by the model ensemble. We are considering whether or not this null hypothesis can be rejected (and at what confidence level). If so, this would tend to cast doubts on either or both of the forced response and the internal variability of the models.

It may be worth emphasising right at the outset that our analysis is almost identical in principle to that presented by Gavin on RC some time ago. In that post, he formed the distribution of model results (over two different intervals) and used this to assess how likely a negative trend would be. Here is his main picture:

He argued (correctly) that if the models described the forced and natural behaviour adequately, a negative 8-year trend was not particularly unlikely, but over 20 years it would be very unlikely, though not impossible (1% according to his Gaussian fit).

We have extended that basic calculation in a few ways, firstly by considering a more complete range of intervals (to avoid accusations of cherry-picking on the start date). Also, rather than using an arbitrary threshold of zero trend, we have specifically looked at where the observed trends actually lie (well, we also show where zero lies in the distributions). I don't believe there is anything remotely sneaky or underhand in the basic premise or method. One subtle difference, which I believe to be appropriate, is to use an equal weighting across models rather than across simulations (which is what I believe Gavin did). I don't think there is any reason to give one model more weight just because more simulations were performed with it. In practice this barely affect the results. Another clever trick (not mine, so I can praise it without a hint of boastfulness) is to use not just the exactly matching time intervals from the models to compare to the data, but also to consider other intervals of equal length but different start months. It so happens that the mean trend of the models is very much constant up to 2020 and of course there were no exciting external events like volcanoes, so this gives a somewhat larger sample size with which to characterise the model ensemble. For longer trends, these intervals are largely overlapping, so it's not entirely clear how much better this approach is quantitatively, but it's still a nice idea.

Anyway, without further ado, here are the results. First the surface observations, plotted as their trend overlaying the model distribution:

You should note that our results agree pretty well with Gavin's - over 8 years, the probability of a negative trend is around 15% on this graph, and we don't go to 20y but it's about 1% at 15y and changing very slowly. So I don't think there is any reason to doubt the analysis.

Then the satellite analyses (compared to the appropriate tropospheric temps, so the y axis is a little different):

And finally a summary of all obs plotted as the cumulative probability (ie one-sided p-level):

As you can see, the surface obs are mostly lowish (all in the lower half), and for several of the years the satellite analyses are really very near the edge indeed.

Note that the observational data points are certainly not independent realisations of the climate trend - they all use overlapping intervals which include the most recent 5 years. Really it's just a lot of different ways of looking at the same system. (If each trend length were independent, then the disagreement would be striking, as it's not plausible that all 11 different values would lie so close to the edge, even with the GISS analysis. But no-one is making that argument.)

It is also worth pointing out that this analysis method contradicts the confused and irrelevant calculations that some have previously presented elsewhere in the blogosphere. Contrary to the impression you might get from those links, the surface obs are certainly not outside the symmetric 95% interval (ie below the 2.5% threshold on the above plots), though you can get just past 5% for HadCRU for particular lengths of trend and a couple of the satellite data points do go below 2.5%, particularly those affected by the super-El-Nino of 1998.

As for the interpretation...well this is where it gets debatable, of course. People may not be entitled to their own facts, but they are entitled to reasonable interpretations of these facts. Clearly, over this time interval, the observed trends lie towards the lower end of the modelled range. No-one disputes that. But at no point do they go outside it, and the lowest value for any of the surface obs is only just outside the cumulative 5% level. (Note this would only correspond to a 10% level on a two-sided test). So it would be hard to argue directly for a rejection of the null hypothesis. On the other hand, it is probably not a good idea to be too blase about it. If the models were wrong, this is exactly what we'd expect to see in the years before the evidence became indisputable. Another point to note is that the satellite data shows worse agreement with the models, right down to the 1% level at one point, and I find it hard to accept that this issue has really been fully reconciled.

A shopping list of possible reasons for the results include:

- Natural variability - the obs aren't really that unlikely anyway, they are still within the model range
- Incorrect forcing - eg some of the models don't include solar effects, but some of them do (according to Gavin on that post - I haven't actually looked this up). I don't think the other major forcings can be wrong enough to matter, though missing mechanisms such as stratospheric water vapour certainly could be a factor, let alone "unknown unknowns"
- Models (collectively) over-estimating the forced response
- Models (collectively) under-estimating the natural variability
- Problems with the obs

However, the trend in global surface temperatures has been nearly flat since the late 1990s despite continuing increases in the forcing due to the sum of the well-mixed greenhouse gases (CO2, CH4, halocarbons, and N2O), raising questions regarding the understanding of forced climate change, its drivers, the parameters that define natural internal variability (2), and how fully these terms are represented in climate models.

That wasn't some sceptic diatribe, but rather Solomon et al, writing in Science (stratospheric water vapour paper). And there was also the Easterling and Wehner paper (which incidentally also uses a very similar underlying methodology for the model ensemble). Knight et al as well: "Observations indicate that global temperature rise has slowed in the last decade"

So all those who are hoping to burn me at the stake, please put away your matches.