Ok, having got various things out of the way, on with the show.
I liked this letter which appeared in Nature recently. Not just because I'd done something similar myself with the earlier Hansen forecast :-) In general, I think it's important to revisit historical statements to see how well they held up. Allen et al have gone back to a forecast they made about 10 years ago, and checked how well it matches up to reality. The answer is...
really well. On the left, is the original forecast with new data added, and the right is the same result re-expressed relative to a 1986-96 baseline. The forecast was originally expressed in terms of decadal means, so I don't think there is anything untoward in the smoothing. The solid line black in the left plot is the original HadCM2 output, with the dashed line and grey region representing the adjusted result after fitting to recent (at that time) obs using standard detection and attribution tchniques.
They also compared their forecast to a couple of alternative approaches:
This plot shows the HadCM2 forecast (black), CMIP5 models (blue) and another possible forecast of no forced trend, just a random walk (green). The red line is the observed temperature. They point out that their forecast performed better than the alternatives, in the sense that it assigned higher probability (density) to the observations.
So far, so good. However, I disagree with their statement that "the CMIP5 forecast also clearly outperforms the random walk,
primarily because it has better sharpness" (my emphasis). Actually, the CMIP5 forecast outperforms the random walk simply because it is clearly much closer to the data. The CMIP5 mean is about 0.41 in these units (all these numbers are just read off the graph, and may not be precise), the random walk is of course 0, and the observed anomaly is 0.27. The only ways a forecast based on the CMIP5 mean could have undererformed the random walk would have been if it was either so sharp that it excluded the obs (which in practice would mean a standard deviation of 0.06 or less, resulting in a 90% range of 0.31-0.51), or so diffuse that it assiged low probability across a huge range of values (ie, a standard deviation of 0.75 or greater, with associated 90% range of -0.8 to 1.6). The actual CMIP5 width here seems to be close to 0.1, well within those sharpness bounds.
I do think I know what the authors are trying to say, which is that if you are going to be at the 5th percentile of a distribution, it's better to be at the 5th percentile of a sharp forecast than a broad one. But changing the sharpness of the forecast based on CMIP5 would obviously mean the obs were no longer at the 5th percentile! In fact, despite not quite hitting the obs, the CMIP5 forecast is not that much worse than the tuned forecast (black curve), thanks to being quite sharp. And according to the authors' own estimation of how much uncertainty they had in their original forecast, they obviously got extraordinarily lucky to hit the data so precisely. With their forecast width, it would have been almost impossible to miss the 90% interval - this would have required a very large decadal jump in temperature. I don't think it is reasonable to say that one method is intrinsically better than the other, on the basis of a single verification point that both methods actually forecast correctly. If the obs had come in at say 0.4 - which they forecast with high probability - I hardly think they would have been saying that the CMIP5 ensemble of opportunity was a superior approach.
(For what it's worth, I think the method used in this forecast intrinsically has exaggerated uncertainty, but that's another story entirely.)