Sunday, April 12, 2020

Reporting delays etc

Made this GIF overnight. It's showing how my forecast evolves as more data is added to the system. Initially it doesn't assume any control will be imposed, and then at the appropriate day (23rd March) it assumes a change in R (value to be estimated) which feeds through into deaths after a couple of weeks according to the model dynamics.

But something about it bugged me. Why did the model fail in mid-late March? If it's supposed to be a decent forecasting system, it should be predicting where the future data will lie. The late March data were not affected by the lockdown, it's just that the model is overestimating the pace of growth from early data. I played around with a few ideas and really didn't manage to fix it. I did, however, notice that the data jumped sharply from the 13th to 14th of March, from 1-2 deaths per day, to 10+ deaths per day, without a single day in between. This is actually pretty implausible from a statistical point of view if the underlying death rate is growing steadily in an exponential way, as theory and practice expects.

So I went and had a more careful look at the data. 

These data are actually not, as people might assume, the number of people who have died on a given day. They are actually the number of reports collected in a given 24h period, which may represent deaths on any previous time. I already knew this and also knew that it shouldn't affect the growth rate estimate, so long as the delays are fairly consistent over time. This can be checked with an alternative data set in which deaths are actually collated by true date of death rather than date of report, and this is what the next plot does:

Oops I didn't label the axes properly. Deaths vs dates. The red/pink circles are the same as the previous data, that is to say, the number in the daily report that features on the news each day. The blue/cyan triangles, on the other hand, are the deaths that are known to have actually occurred on on each day. The first thing to note is that the blue/cyan points are for the most part higher, and that's despite these numbers only relating to England, so the UK as a whole will be another 10-20% higher still. There is a drop-off towards the present day where the totals are probably not yet complete and some to-be-added deaths will turn out to have occurred on these days. This is specifically warned about in the data. Ignoring these points, the slopes of the two data set are strikingly similar just as theory expects (which is good, as I was relying on this for my analyses to make sense).These are the dotted pink and cyan lines, with the extent of the lines both showing the points I used to derive them. So far, so good.

So now look at the initial segment where I have drawn a couple of bolder lines in red and dark blue. These are linear fits to the darker blue and red data points, and their slopes are quite different. The blue one agrees withe the cyan - I also extended this with the thin solid cyan line and it's coincidentally (confusingly) identical to the dotted one. The red one, on the other hand, extends as the solid pink line and clearly misses the future data. Just as my slightly more complex model fit did. You can also see that the blue dots show no fewer than 5 days actually had 3-9 deaths inclusive, despite there being none of these in the red data. My fit using the full model is not actually just a linear regression but it's rather similar, and creates the same effect.

My conclusion is that the reason my prediction struggles here is the red data were just particularly poor at this point, and this wasn't due either to bad luck with the randomness of death, or that the model doesn't represent the underlying dynamics well (the blue data are perfectly linear), but instead almost certainly due to some weird reporting errors being significantly larger than I had allowed for in my estimation. Because the chance of not getting a single day in that range of intermediate values is extremely low in a world where we had roughly a whole week with the expected death rate in that range. I don't know if they actually changed the system around that time or not - might do a bit more digging.


Everett F Sargent said...

A lot of backdating would appear to be in order ...
Particularly over the past 2-3 weeks.

Germany managed to reanimate 31 zombies on Saturday!

Everett F Sargent said...

IMHO particularly daunting is the so-called 'do nothing' alternative. Did they actually expect individuals to 'do nothing' when they do something wrt the common cold and flu shots. It is like saying those individuals are fated to die regardless, and the individuals also say to themselves that they are fated to die, so why bother to do anything at all.

Another confounding issue is days-to-weeks of extended end-of-life due to placing individuals on ventilators and into ICU's.

A 3rd issue is the population density demography as most deaths occur in highly urbanized areas (where one might expect R0 to be higher as opposed to largely rural areas).

A 4th issue is location, location, location, as in, different model constants for different countries, for example.

A 5th issue is why hasn't the pandemic spread to large parts of the World, e. g. Africa, Ibdia, South America, Australia, New Zealand. Sure, you can explain a few of these, but can you explain them for well over 100 countries? I perceive a pattern of mobility, latitude and seasonal effects (NH vs SH).

Phil said...

New Zealand and Australia have fairly effective governments.
Even to the point of declaring the Easter Bunny to be an essential worker, who might not make it to every house this year.

South America, perhaps you are behind on reading the news.

India is blaming the Muslim minority.

Africa may be partly protected by heat and humidity, but spread is reported.

The poorest countries will follow the 'do nothing' alternative. They can't afford to pause the economy, or many would starve. Even if they had a low number of cases, testing, case tracking and isolation would be unaffordable. The death rate in the poorest of countries will be lower due to younger aged populations.

Everett F Sargent said...


The question was ... Sure, you can explain a few of these, but can you explain them for well over 100 countries?

So you answered the 'few of these' part somewhat (I'm pretty sure Ecuador isn't SA, Brazil would be better, blaming Muslims is irrelevant and the WHO Africa report is a bit of a non sequitur). Already knew what AU, NZ, JP, TW and CA were doing, as in the 'few of these' part above.

Doubling time is now ~6.5 days (2020-04-09) for the RoW (Rest of World = World-EUUSCNIRKRJPDPCAAUNZTW), having bottomed out at ~2.5 days on 2020-03-28 (pretty much a straight line in log-normal space, but subject to chance as more time passes).

But thanks anyway.

Everett F Sargent said...

"2020-03-28" should be "2020-03-19" above (or 22 elapsed days through 2020-04-09).

Everett F Sargent said...

"chance" should be "change" above.

PeteB said...


I guess you have seen this - breakdown of the deaths announced by day

Nic Lewis said...

Hi James
Nice work. And I agree with you that it is absolutely fine for non-specialists to have a go at epidemiological modelling of the COVID-19 pandemic.

However, apart from the limitations of the data, I think that there is a model identification problem. This conclusion is currently based on an analytical asymptotic solution to the equations of the Thomas House model rather than actually running your code with different central Lp and Ip parameter values. I will illustrate the problem using your UK modelling in Fig. 2(b).

Based on your values of latent period Lp= 4 days and infectious period Ip= 2 days, and the initial slope of the fitted red line (which I estimate corresponds to a daily growth factor of 1.268x), my analytical solution gives an R0 value of 2.98, exactly the same as you derive. Likewise, the final slope of the red line (daily growth factor 0.8925) gives Rt= 0.497, very close to your 0.49 figure.

However, your Ip= 2 days value seems very low in the light of the evidence available. Suppose instead that you had used Thomas House's original values of Lp= 5 days and Ip= 7 days. My estimates, to match your red lines, then become R0= 6.02 and Rt= 0.23. Not only are the absolute values very different, but the ratio of Rt to R0 is hugely different.

If I select Lp= 1 day and Ip= 7 days, which is consistent with pre-symptomatic cases being infectious, the analytical R0 value to match the 1.268x growth factor becomes 2.96 - indistinguishable from your 2.98 estimates. But the analytical Rt value is then 0.40, quite a lot lower than your 0.49. And if I take an extreme case where a person becomes infectious as soon as they are infected, and remains so for 10 days (Lp= 0, Ip= 10 days), my solution is R0= 3.00, which again matches your 2.98, but my Rt value becomes 0.26.

So, if my solution is correct, selecting realistic Lp and Ip values is critical to deriving realistic values for R0 and Rt, and for their ratio.