Wednesday, September 10, 2008

How not to combine multiple constraints...

Some time ago, I mentioned in passing a "howler" in press from CPDN. Since I was so sarcastic about it back then, it is only fair to provide an update. However, this post may be a bit long and technical for some...and things have changed enough since my previous statement, so much so that I probably wouldn't bother posting this but for my previous comment. But anyway, there isn't much else going on in climate science right now, so here goes...

The story starts way back with the IPCC AR4 drafts, actually. The very first semi-public version of this (ie the first version that people like me were allowed to see and comment on) attempted to address the issue of combining the different probabilistic estimates of climate sensitivity. They did this by averaging different pdfs that various authors have generated. It was immediately obvious to me (and possibly many others) that this was a completely invalid approach, and I said as much in my comments. In fact this episode - combined with the growing realisation that people were actually taking these outlandish "pdfs of climate sensitivity" seriously - is largely what prompted me to quickly write the "multiple constraints" paper. Anyway, this "averaging" result was removed from the next draft and I had hoped that our paper (which as far as I'm aware, no-one has argued against in the literature) might have put an end to that whole idea.

Fast forward on a few months and I saw an AGU presentation (by Reto Knutti) which appeared to again address the question of combining constraints. Its punchline was an analysis where where adding more observations appeared to result in an increase in uncertainty. Now, while it is certainly true that additional information can lead to an increase in uncertainty, such situations tend to be somewhat pathological and uncommon in practice, so I had a closer look. It only took a couple of emails back and with Reto to work out there was something badly wrong in what they had done - they had "combined" their constraints by averaging them, so I pointed out that this was invalid. Reto didn't reply to this point, and I assumed he had gone away and re-worked things. So, I was more than a little surprised that the next time I saw him speak, in the autumn of 2007, he presented the same result and mentioned that the paper (Sanderson et al) was now in press!

Time for a slight digression into why averaging the constraints is wrong. There is a standard theorem in probability theory that one always expects to learn from new information (and this somewhat woolly statement can be formalised in a rigorous manner). This general result crops up in all sorts of specific applications, and its correctness is not in doubt. Now, the sort of quadratic cost function that Sanderson et al use as their basic measure of model-data mismatch is pretty much ubiquitous in estimation, and it has a natural identification with the logarithm of the likelihood function corresponding to Gaussian uncertainties. It is no exaggeration to say that these quadratic log-likelihood cost functions underpin almost all practical probabilistic prediction and estimation in geosciences, all the way from a simple least-squares linear regression up to the advanced data assimilation methods used in modern numerical weather prediction. As a simple illustration, let's say we have an unknown variable x, and we make one observation o of it which has (Gaussian) observational uncertainty e. The resulting likelihood function P(o|x) is the Gaussian N(o,e), and taking the negative logarithm of this we reach the quadratic cost function 0.5*((x-o)/e)2. If we have more than one observation, things are little more complicated because we need to account not only for the uncertainties on each observation, but also the covariance of these uncertainties. The resulting multivariate Gaussian likelihood N(O,E) naturally gives rise to the cost function 0.5*(X-O)TE-1(X-O) where X and O are now vectors of the estimated and observed variables, and E is the covariance matrix. In the simple case where we have two observations of a single variable and the covariance of the errors E is diagonal, this simplifies to 0.5*[((x-o1)/e1)2 + ((x-o2)/e2)2] - and this is the direct sum of the cost functions that arise from taking each observation individually. And whatever the form of the covariance matrix E, the log-likelihood is never equivalent to 0.25*[((x-o1)/e1)2 + ((x-o2)/e2)2], which is the result of the "averaging the constraints" procedure that Sanderson et al invented.

The (normal, correct) cost function above corresponds to the Gaussian N((o1*e22+o2*e12)/(e12+e22), sqrt(e12*e22/(e12+e22)), this being the well-known formula for the optimal interpolation of two uncertain estimates. It is easy to see that the width of this Gaussian is lower than for either of the single-observation Gaussians, since sqrt(e12*e22/(e12+e22)) is less than the smaller of e1 and e2. (The same result holds if we use a full covariance matrix E with off-diagonal elements, but it's not as easy to show that within Blogger's limited maths capability!)

If, however, we use the average of the two single-obs cost functions, the resulting Gaussian is now N((o1*e22+o2*e12)/(e12+e22),sqrt(2*e12*e22/(e12+e22)) and it is immediate that its width, sqrt(2*e12*e22/(e12+e22), lies between e1 and e2. So, if we start with the more accurate observation, and then "combine" it with a less accurate one through this "averaging" methdology, then the final answer will always have greater uncertainty than we started with, which is contrary to the well-established theorem I mentioned previously. Effectively, this method is asserting that a weak observation with wide uncertainty should cause us to forget what we already knew!

So in summary, averaging constraints is obviously nonsense that has no possible probabilistic interpretation. I spent a lot of time late last year trying to explain this in simple terms to several of the authors of the Sanderson et al paper, and in response they gave every appearance of trying to argue that averaging was a valid approach, although they also accused me of reading too much into the paper. To be fair, they had not directly generated any posterior pdfs, and did not explicitly state that these cost functions were log-likelihoods, but given (a) the theoretical background I have briefly outlined above, and (b) the supposed increase in uncertainty as more observations were considered had been repeatedly presented as the main result of this work, I don't think my reading of it was unreasonable. Moreover, if they had really meant to disavow this interpretation, then it is puzzling that they continued to defend the methodology as valid. Myles Allen even challenged me to write a comment if I thought their paper was wrong, and given the potential impact of this erroneous "averaging the constraints" meme if it is widely adopted, I was originally tempted to do so. But I have just looked for the paper on-line, and am surprised (but pleased) to see several changes to the paper that have been made subsequent to our discussions. For starters, they have removed the most contentious claim - which was previously made in the abstract as well as the main text - that their cost function provides a constraint on climate sensitivity that weakens as more observations are used. In fact the paper now contains an entirely new paragraph that was not there before. This explicitly states that their results cannot be interpreted in terms of log-likelihoods. They could have gone a little further and admitted that the widths of their cost functions cannot be directly related to uncertainty in any way whatsoever, but I suppose that would have been rather embarrassing given that the explicit goal of the paper is to explore this issue. They have also added a little discussion of their reasoning behind the averaging method (which was not justified at all in the earlier version of the paper). Now it is explicitly presented as alternative to the "incorrect" alternative of adding the individual terms, which (as they state) is wrong if uncertainties are not independent. It is disappointing that they didn't find the space to mention here that the obvious alternative to adding the terms, if they are not independent, is just to use the full covariance matrix as I have outlined above. For while of course the addition of individual constraints is incorrect unless the uncertainties are judged to be independent, this may at least be a reasonable approximation if the dependencies are small (and it could in principle be biased either way), whereas their curious "averaging" method is guaranteed to be wrong for all possible covariance structures, with the resulting uncertainty guaranteed to be too wide. Indeed it is trivial to observe that averaging is algebraically equivalent to addition, but for the arbitrary and unjustifiable multiplicative scaling factor of sqrt(N) on the width of the implied joint likelihood function, where N is the number of individual terms.

I guess the authors would claim that these last-minute changes to the paper are nothing more than cosmetic - just a matter of explaining things in simple enough terms for people like me to understand, rather than representing any substantive modifications to their previous claims. (They certainly don't acknowledge my help in fixing the problems in the original version.) The real proof of the pudding will be in whether they and others continue to use this "averaging the constraints" meme...I will watch with interest to see if and how it spreads...

7 comments:

crandles said...

Cannot comment much without reading the paper.

The abstract includes:

"The use of annual mean or seasonal differences on top-of-atmosphere radiative fluxes as an observational error metric results in the most clearly defined minimum in error as a function of sensitivity, with consistent but less well-defined results when using the seasonal cycles of surface temperature or total precipitation."

I get the impression that an "unjustifiable multiplicative scaling factor of sqrt(N) on the width of the implied joint likelihood function" would not have any effect on where the minimum is or how well defined it is. However that assumes the observations are independent or so close as to not matter. If there are correlations, would you consider that this would be likely/very likely/etc to have much(?) effect on where in parameter space the minimum is or how well defined it is?

If it is likely to have a noticable effect, is this likely to be intractable or could estimates be made?

I imagine errors in the neural emulation might (also?) be quite significant.

James Annan said...

I get the impression that an "unjustifiable multiplicative scaling factor of sqrt(N) on the width of the implied joint likelihood function" would not have any effect on where the minimum is or how well defined it is.

It won't affect the location, but in context "how well defined it is" is precisely the "sharpness" of the minimum. Try plotting x^2, 0.01x^2 and 100x^2 (well you hardly need to go to the trouble...) and of course they all have the same minimum, which in mathematical terms is perfectly well defined, but in the context of the paper this phrasing refers to the spread in x for which the cost does not vary very much (for some rather vague definition of "very much") and thus the minimum of 100x^2 is much more tightly defined than 0.01x^2. The main point remains that the operation of averaging the constraints cannot correspond to learning from new observations.

I'm surprised the final version is not on the CPDN web site, as they usually put their stuff up pretty quickly.

The NN stuff is potentially interesting, although there are alternative (standard) methods for emulation which have a much more rigorous statistical foundation. It is not clear to me why they chose this approach, or how its performance compares to the standard method.

crandles said...

>"Try plotting x^2, 0.01x^2 and 100x^2 (well you hardly need to go to the trouble...) and of course they all have the same minimum, which in mathematical terms is perfectly well defined, but in the context of the paper this phrasing refers to the spread in x for which the cost does not vary very much (for some rather vague definition of "very much") and thus the minimum of 100x^2 is much more tightly defined than 0.01x^2."

Yes I expressed that badly it needed a relative comparison (ie relatively how well defined it is). If you plot x^2+30 and y^2+30 and find x is better defined than y for some vague 'not very much' (a%?) variation in cost. If instead you plot 100x^2+3000 and 100y^2+3000 you will still end up with the same x is relatively better defined than y conclusion.

So if I have now understood correctly (but demonstrated I have failed to express that clearly in writing), the conclusion in the abstract that I quoted remains unaffected in the idealised case of independent observations.

>"The main point remains that the operation of averaging the constraints cannot correspond to learning from new observations."

I don't, at present, see any reason to dispute this or of this being the main point of your post. So I have nothing much to say. I would prefer to discuss the extent to which the problem you have identified and explained remains a problem to the the conclusions that have been left in. ie could co-varying observations change a well defined relative comparison if your correct full covariance matrix method was used instead of the papers averaging approach?

I am just wondering if no answer means the full matrix solution could well be either an intractable problem which might make it not an available alternative or that the effect are so negligable it is drowned out by other issues.

Either of these would appear to make some sense for your decision to not write a comment ... but wait until the method is copied somewhere else in a way that it can affect a papers conclusion?

Probably too soon for such pondering, absence of evidence ...

James Annan said...

Their paper really discussed methods and generalities with no solid numerical answers, so there is nothing really to "correct" in that sense. The difficulty in real applications is probably in deciding what the matrix should be rather than actually solving the resulting problem (it may depend on the size of the problem too).

I'd be very happy if the whole sorry mess died a death, but the worst case scenario of course is that some poor sod unconnected with the original work reads: "we have chosen to average the quadratic cost functions rather than adding them. ... to combine errors in this [latter] way would be incorrect", takes this to mean that averaging is correct, and does some completely bogus calculation as a result. Of course I may be fussing over nothing. We'll have to wait and see...but it seems they have gone to some lengths to promote this averaging method while maintaining plausible deniability That is, they have not actually made any explicit claims that averaging is ever correct in any situation at all, but it is easy to see how this paper - and indeed that specific section I quoted - could be interpreted as supporting the approach.

Hank Roberts said...

Making no pretense at understanding the math -- this showed up in "related items" searches for climate articles, and vice versa. It appears from this that the economists have rediscovered Catton's "Overshoot" caution about removing redundancy from the human economy to the point where one hiccup causes a crash. They are warning one another about risks of low probability events -- like whatever happened to the credit system.

Brief excerpt in case it tempts any of the climate scientists to comment:

http://www.edge.org/3rd_culture/taleb08/taleb08_index.html

"3) Beware the "atypicality" of remote events. There is a sucker's method called "scenario analysis" and "stress testing"—usually based on the past (or some "make sense" theory). Yet I show in the appendix how past shortfalls that do not predict subsequent shortfalls. Likewise, "prediction markets" are for fools. They might work for a binary election, but not in the Fourth Quadrant. Recall the very definition of events is complicated: success might mean one million in the bank ...or five billions!

4) Time. It takes much, much longer for a times series in the Fourth Quadrant to reveal its property. At the worst, we don't know how long. Yet compensation for bank executives is done on a short term window, causing a mismatch between observation window and necessary window. They get rich in spite of negative returns. But we can have a pretty clear idea if the "Black Swan" can hit on the left (losses) or on the right (profits).

The point can be used in climatic analysis. Things that have worked for a long time are preferable—they are more likely to have reached their ergodic states.

5) Beware Moral Hazard. Is optimal to make series of bonuses betting on hidden risks in the Fourth Quadrant, then blow up and write a thank you letter. Fannie Mae and Freddie Mac's Chairmen will in all likelihood keep their previous bonuses (as in all previous cases) and even get close to 15 million of severance pay each.

6) Metrics. Conventional metrics based on type 1 randomness don't work. Words like "standard deviation" are not stable and does not measure anything in the Fourth Quadrant. So does "linear regression" (the errors are in the fourth quadrant), "Sharpe ratio", Markowitz optimal portfolio, ANOVA shmnamova, Least square, etc. Literally anything mechanistically pulled out of a statistical textbook.

My problem is that people can both accept the role of rare events, agree with me, and still use these metrics, which is leading me to test if this is a psychological disorder.

The technical appendix shows why these metrics fail: they are based on "variance"/"standard deviation" and terms invented years ago when we had no computers. One way I can prove that anything linked to standard deviation is a facade of knowledge: There is a measure called Kurtosis that indicates departure from "Normality". It is very, very unstable and marred with huge sampling error: 70-90% of the Kurtosis in Oil, SP500, Silver, UK interest rates, Nikkei, US deposit rates, sugar, and the dollar/yet currency rate come from 1 day in the past 40 years, reminiscent of figure 3. This means that no sample will ever deliver the true variance. It also tells us anyone using "variance" or "standard deviation" (or worse making models that make us take decisions based on it) in the fourth quadrant is incompetent.

7) Where is the skewness? Clearly the Fourth Quadrant can present left or right skewness. If we suspect right-skewness, the true mean is more likely to be underestimated by measurement of past realizations, and the total potential is likewise poorly gauged. A biotech company (usually) faces positive uncertainty, a bank faces almost exclusively negative shocks. I call that in my new project "concave" or "convex" to model error....

crandles said...

>"The NN stuff is potentially interesting, although there are alternative (standard) methods for emulation which have a much more rigorous statistical foundation. It is not clear to me why they chose this approach, or how its performance compares to the standard method."

What do you consider "the standard method"?


I now find out there is work going on to use "a technique called Random Forests as an emulator, which benefits from requiring very little tuning to get good results and it also provides a measure of uncertainty in each prediction.....
sending out some verification runs on CPDN .... each with an initial condition ensemble of 4 members, based on a continuous sampling of parameter space (rather than the discrete sampling used in the original ensemble), and am currently analysing the results. [student] plans to compare different emulation techniques, namely neural nets v random forest v gaussian process emulators and highlight the strengths and weaknesses of each.

(I expect the above will be made publically available soon so I hope there isn't too much harm in posting this now in this obsure location.)

James Annan said...

I mean the Bayesian emulator ideas introduced by Kennedy and O'Hagan, now being developed in the climate context by Jonty Rougier and people at the Hadley Centre. I think this is probably what "Gaussian process emulators" refers to in your paragraph.

The random forests abstract is already up on the EGU meeting site, so I don't think you need to worry about breaching confidences! Interestingly, the poster seems to be withdrawn, which is a shame.