The error of equating P(Data|Hypothesis) and P(Hypothesis|Data) is known as the Prosecutor's Fallacy, due to its frequent appearance in criminal trials (misinterpretation of DNA evidence etc) (the dispute on that wikipedia page seems to refer to the details of its applicability to a particular legal case, not the underlying theory). Typically, it is illustrated via a simple discrete yes/no question along the following lines: if the probability of a random person matching a DNA sample from a crime scene is 1 in 1,000,000, then what is the probability that a suspect is guilty, given only that their DNA matches? The fallacial answer is 999,999 in 1,000,000. An easy way to see the flaw in this is to note that in the UK, there are 60,000,000 people so there will be 60 people whose DNA matches, only 1 of whom will be the guilty one (note various other assumptions I've made, including the fact that a crime actually took place at all, it was committed by one person, and that there is no other evidence as to the suspect).

Formally, the exact calculation of P(H|D) when we are given P(D|H) requires Bayes' Theorem:

which requires the specification of a "prior" P(H) (P(D) is a normalisation constant which provides no real theoretical difficulties, although it might be hard to calculate in practice). It must be understood that this equation does not depend on "being a Bayesian" or "being a frequentist". It is simply a law of probability, which follows directly from the axioms (in particular, P(D,H)=P(D|H)P(H)=P(H|D)P(D)). So it's not something we can choose to obey or not - at least, without abandoning any pretence that we are talking about probability as it is usually understood.

Although the prosecutor's fallacy is generally demonstrated through discrete probability, Bayes' Theoreom applies equally to continuous probability distribution functions, with f(h|d) being related to f(d|h) via f(h|d)=f(d|h)f(h)/f(d). This explains the distinction between confidence intervals and credible intervals demonstrated in the last post, since an experimental observation gives us f(d|h) (a likelihood function), and in order to turn it into a posterior pdf f(h|d) we need to use a prior f(h).

For example, given the previous apple-weighing example, we might have a prior belief that the apple will weight about 100g, plus or minus 20g at 1 standard deviation (and strictly speaking, the prior should be truncated at 0). The likelihod function arising from the measurement is itself a Gaussian shape centred on the observed 40g, with a width of 50g - this function does extend to negative values, as a hypothetical negative-mass apple would have a nonzero probability of a returning a 40g measurement. Applying Bayes' Theorem formally gives us the well-known result of optimal interpolation between two gaussians, which in this case works out to 91.7+-18.6g. In this case, the observation is so poor that it hardly affects our prior belief, but if our scales had an error of only 5g we'd obviously depend far more on their output. In no case would we end up believing that the apple's mass was negative!

Next, and perhaps last (for now at least): what the literature says.

Formally, the exact calculation of P(H|D) when we are given P(D|H) requires Bayes' Theorem:

P(H|D)=P(D|H)P(H)/P(D)

Although the prosecutor's fallacy is generally demonstrated through discrete probability, Bayes' Theoreom applies equally to continuous probability distribution functions, with f(h|d) being related to f(d|h) via f(h|d)=f(d|h)f(h)/f(d). This explains the distinction between confidence intervals and credible intervals demonstrated in the last post, since an experimental observation gives us f(d|h) (a likelihood function), and in order to turn it into a posterior pdf f(h|d) we need to use a prior f(h).

For example, given the previous apple-weighing example, we might have a prior belief that the apple will weight about 100g, plus or minus 20g at 1 standard deviation (and strictly speaking, the prior should be truncated at 0). The likelihod function arising from the measurement is itself a Gaussian shape centred on the observed 40g, with a width of 50g - this function does extend to negative values, as a hypothetical negative-mass apple would have a nonzero probability of a returning a 40g measurement. Applying Bayes' Theorem formally gives us the well-known result of optimal interpolation between two gaussians, which in this case works out to 91.7+-18.6g. In this case, the observation is so poor that it hardly affects our prior belief, but if our scales had an error of only 5g we'd obviously depend far more on their output. In no case would we end up believing that the apple's mass was negative!

Next, and perhaps last (for now at least): what the literature says.

## No comments:

Post a Comment