Monday, March 23, 2020

Uncertainty in the COVID-19 model

Time for James to Speak to the Nation. Well, Speak to my Reader at least. Rather than subjecting myself to the waffling windbag, here are some fun experiments in parameter estimation that I've been doing.

The setup is very simple. The SEIR model I've been using has 3 main internal parameters, being the reproduction number R0, the latent period, and the infectious period. They can all be estimated to some extent from case studies, but the appropriate numbers to use in a model are not precisely known. Beyond these parameters, the only remaining tunable value is the number of infected people at the start date (this date isn't worth changing itself as you just go up and down the exponential curve). It seems that Ferguson et al picked the biological numbers from the recent literature and then chose the start value to hit a given number of deaths from their simulations on a date in mid-March. No uncertainties are reported in their analysis, though they have used a few different R0 values in some scenarios. So I think it's worth investigating how much the parameters could reasonably vary, and how much difference this might make.

I've just done a very simple tuning, fitting to values at the start of Feb and 20th March. Since the data are essentially exponential there is little point in using more intermediate data. All it can tell me is the rate of growth and its current level.

Priors for the model parameters are chosen rather arbitrarily by me, based on my previous tuning.
R0 ~ N(2.4,0.3)
Latent_p ~ N(4,1)
Infectious_p ~ N(4.5,1)
Start ~ exp(N(-20,1))

All uncertainties quoted at one standard deviation.

Sampling from this prior gives a wide range of possible epidemics, as shown below. The blue plume is the 5-95% range with the median line also drawn. This plume may be a bit misleading as it seems to include the possibility of no epidemic at all! Actually what is happening here is the different parameter sets lead to a different timing of the peak, so at any moment in time quite a lot of them have either not started, or ended. I've plotted a handful of the individual curves as red lines to show that. In fact all parameter sets do lead to a large epidemic, which is an immediate and unavoidable consequence of the R0 value.

Now for the tuning.

For the data....I assume a significant amount of under-reporting in the official case numbers, as discussed by many. Kucharski today suggested the true value (and note this is even for symptomatic cases) was about 16x the reported values, though I'm not sure if this applies from day1. Vallance recently made the heroic claim that the number of cases might be 1000 times the number of deaths at any given time, presumably in a desperate attempt to pretend that the mortality rate is really low as that's what he staked his advice on. I guess if you include newly infected, it's reasonable. My constraint on March 20 (which I'm taking as relating to the total symptomatic and recovered cases) doesn't include his value, but ranges from 22k-162k (median of 60k) on the day when the official number was 4k and deaths were 177. My early Feb constraint is centred on 20 with a range of 7-55. Mostly just because the log of 20 is very close to 3, I guess on reflection that range is probably a bit high but note that it has to account also for further imported cases (which the model does not simulate) rather than just home-grown ones. The initial growth is exponential so it only really makes sense to do these constraints in log space. A little simplistically, I assume uncorrelated errors on these two data points. In reality, the degree of under-reporting might be expected to be somewhat related.

The below plot shows the prior and posterior spread over the initial segment on a log scale. The two red Hs are the constraints I've used with red circles at their means, the dark circles below are the actual reported numbers. My pale blue prior has a broad spread, and the green posterior (just tuned by rejection sampling, nothing fancy) focusses nicely on the plausible range.

And what about the projected epidemic? Here we have it:

Compared to the prior, it's greatly constrained with the timing of the peak ranging from about mid-April to the end of May. Again the lower limit in that plume is made up of the start of late epidemics and end of early ones, it doesn't represent a realistic trajectory in itself.

I think one thing that can probably be concluded is that the parametric uncertainty isn't hugely important for the broad policy advice. Rather like climate science, in fact :-) In the absence of strong action, we are getting an epidemic in the next couple of months, and that's really all we need to know. Though the timing is somewhat uncertain...I may investigate that this space...

Edit: I realised overnight that the way I have calculated the percentiles was a bit clumsy and misleading and will change for future work. Basically I calculated the percentiles of each model compartment first, with the various diagnostics being generated from these, whereas I should have calculate the diagnostics first, and then presented the percentiles of them. A bit of a silly error really. I'll leave it up as a testament to the limitations of doing stuff quickly.


...and Then There's Physics said...

Is your model available somewhere? I'm just not sure what you're calling the latent period refers to.

James Annan said...

you didn't get back far enough in my posts :-)

Link to Thomas House from this one:

James Annan said...

SEIR model has an "exposed" ie infected but not infectious box

...and Then There's Physics said...

Thanks. Got it now.