Simple Bayesian analysis inference of coronavirus infection rate from the Stanford study in Santa Clara county

tl;dr: Their 95% interval for the infection rate, given the data available, is [0.7%, 1.8%]. My Bayesian interval is [0.2%, 2.4%]. Most of what makes my interval wider is the possibility that the specificity and sensitivity of the tests can vary across labs. To get a narrower interval, you’d need additional assumptions regarding the specificity and sensitivity of the particular experiment done for that report.

1. Background

A Stanford research team gave coronavirus tests for a sample of people from Santa Clara County, California. 1.5% of the people in the sample tested positive, but there were concerns about false positives, as well as about possible sampling bias. A couple weeks later, the Stanford team produced a new report with additional information on false positive and false negative rates. They also produced some new confidence intervals, but their procedure is pretty complicated, as it has to simultaneously handle uncertainties in three different probabilities.

This seems like a good case for Bayesian analysis, so I decided to do it.

2. The Stan model and R script

First I created a file, santaclara.stan:

data {
  int y_sample;
  int n_sample;
  int y_spec;
  int n_spec;
  int y_sens;
  int n_sens;
}
parameters {
  real p;
  real spec;
  real sens;
}
model {
  real p_sample;
  p_sample = p*sens + (1-p)*(1-spec);
  y_sample ~ binomial(n_sample, p_sample);
  y_spec ~ binomial(n_spec, spec);
  y_sens ~ binomial(n_sens, sens);
}

It’s easy to do this in Rstudio: just click on File, New File, and then there’s a Stan File option!

Then I ran this R script:

library("cmdstanr")
library("rstan")
stanfit 

The hardest part here was figuring out what numbers to use from the Bendavid et al. papers, as they present several different sources of data for sensitivity and specificity. I tried to find numbers that were as close as possible to what they were using in their reports.

3. Results

And here's what came out for the model fit to the data from the first report:

         mean se_mean    sd     2.5%      25%      50%      75%    97.5% n_eff  Rhat
p       0.010   0.000 0.005    0.001    0.007    0.011    0.014    0.019  1198 1.005
spec    0.993   0.000 0.003    0.986    0.991    0.994    0.996    0.999  1201 1.004
sens    0.836   0.001 0.035    0.758    0.814    0.838    0.861    0.897  1731 1.000
lp__ -338.297   0.054 1.449 -342.043 -338.895 -337.900 -337.258 -336.730   721 1.005

And for the model fit to the data from the second report:

         mean se_mean    sd     2.5%      25%      50%      75%    97.5% n_eff  Rhat
p       0.013   0.000 0.003    0.007    0.010    0.012    0.015    0.019  2598 1.000
spec    0.995   0.000 0.001    0.992    0.994    0.995    0.996    0.997  2563 1.001
sens    0.822   0.001 0.031    0.759    0.802    0.823    0.844    0.879  2887 1.000
lp__ -446.187   0.031 1.307 -449.546 -446.810 -445.841 -445.221 -444.718  1827 1.001

The parameter p is the infection rate in the population for which the data can be considered a random sample.

In the model fit to the data from the first report, the 95% interval for p is [0.001, 0.019]---that is, the data are consistent with an underlying infection rate of between 0 and 2%. The lower bound of the interval does not go all the way to zero, but that's just an artifact of summarizing inferences by a central interval. For example, we can plot the posterior simulation draws:

draws_1 

And you get this:

In the model fit to the data from the second report, the 95% interval for p is [0.007, 0.015]---that is, the data are consistent with an underlying infection rate of between 0.7% and 1.5%.

4. Notes

a. As discussed in my earlier post, this is all based on the assumption that the sensitivity and specificity data represent independent tests with constant probability---that's implicit in the binomial model in the above Stan program. If you start allowing the specificity to vary from one experiment to another, then you can't just pool---you'd want to fit a hierarchical model---and that will end up giving you a wider uncertainty interval for p.

b. Also as discussed in my earlier post, I'm ignoring the weighting done by Bendavid et al. to adjust for differences between the sample and population. I just have too many questions about what they did, and their data are not available. I think the best way to do this adjustment would be using multilevel regression and poststratification. It would be easy to do this with a slight expansion of the above Stan program, so if someone has these data, they could do it. More to the point, this is the sort of adjustment that could and should be done in future studies with nonrepresentative samples.

c. I'm also ignoring the possibility, which we've also discussed, that people with symptoms or at higher risk of infection were more likely to respond to the survey.

d. The Stan model was super easy to write. It did take a few minutes to debug, though. In addition to some simple typos, I had an actual mistake. Originally, I had "p_sample = p*sens + (1-p)*(1-spec)" as "p_sample = p*sens + (1-p)*spec". That's not right at all! When I ran that model, it had terrible mixing and the inferences made no sense. I wasn't sure what was wrong, so to debug I did the following simplifications:

- First, I simply set p_sample = p;. This ran fine and it gave the correct inference for p (ignoring the corrections for sensitivity and specificity) and the right inferences for sens and spec as well.

- Then I got rid of the second term in the expression, so just p_sample = p*sens. This worked too. So the sensitivity adjustment did not seem to be causing the problem.

- I was then trying to figure out how to check the specificity correction, when I noticed that my formula was wrong so I fixed it.

The folk theorem wins again!

e. The statistical analysis is not magic. It's only as good as its assumptions, and points a, b, and c above suggest problems with the assumptions. The point of the analysis is to summarize the uncertainty given the data, if the assumptions were true. So in practice it should be an understatement of our actual uncertainty.

5. Hierarchical model

I was curious about the variation in sensitivity and specificity among the different datasets (13 specificity studies and 3 sensitivity studies) listed in the updated report. So I fit a hierarchical model allowing these two parameters to vary. I set up the model so that the specificities from the Santa Clara experiment and the 13 other studies were drawn from a common distribution, and the same for the sensitivities from the Santa Clara experiment and the 3 other studies.

Here's the Stan program, which I put in the file santaclara_hierarchical.stan:

data {
  int y_sample;
  int n_sample;
  int J_spec;
  int y_spec [J_spec];
  int n_spec [J_spec];
  int J_sens;
  int y_sens [J_sens];
  int n_sens [J_sens];
}
parameters {
  real p;
  vector[J_spec] e_spec;
  vector[J_sens] e_sens;
  real mu_spec;
  real sigma_spec;
  real mu_sens;
  real sigma_sens;
}
transformed parameters {
  vector[J_spec] spec;
  vector[J_sens] sens;
  spec = inv_logit(mu_spec + sigma_spec*e_spec);
  sens = inv_logit(mu_sens + sigma_sens*e_sens);
}
model {
  real p_sample;
  p_sample = p*sens[1] + (1-p)*(1-spec[1]);
  y_sample ~ binomial(n_sample, p_sample);
  y_spec ~ binomial(n_spec, spec);
  y_sens ~ binomial(n_sens, sens);
  e_spec ~ normal(0, 1);
  e_sens ~ normal(0, 1);
  sigma_spec ~ normal(0, 1);
  sigma_sens ~ normal(0, 1);
}

This model has weakly-informative half-normal priors on the between-experiment standard deviations of the specificity and sensitivity parameters.

In R, I compiled, fit, and printed the results:

sc_model_hierarchical 

The data are taken from page 19 of the second report.

And here's the result:

               mean se_mean    sd     2.5%      25%      50%      75%    97.5% n_eff  Rhat
p             0.030   0.003 0.060    0.002    0.012    0.017    0.023    0.168   295 1.024
mu_spec       5.735   0.017 0.658    4.654    5.281    5.661    6.135    7.199  1583 1.001
sigma_spec    1.699   0.013 0.492    0.869    1.348    1.647    2.003    2.803  1549 1.003
mu_sens       1.371   0.021 0.686   -0.151    1.009    1.408    1.794    2.615  1037 1.004
sigma_sens    0.962   0.016 0.493    0.245    0.602    0.885    1.232    2.152   983 1.008
spec[1]       0.996   0.000 0.004    0.986    0.994    0.998    0.999    1.000  1645 1.002
spec[2]       0.993   0.000 0.004    0.983    0.990    0.993    0.996    0.998  4342 1.001
spec[3]       0.995   0.000 0.008    0.972    0.994    0.998    0.999    1.000  3636 1.000
spec[4]       0.996   0.000 0.005    0.982    0.995    0.998    0.999    1.000  3860 1.001
spec[5]       0.999   0.000 0.001    0.997    0.999    0.999    1.000    1.000  4101 1.001
spec[6]       0.998   0.000 0.002    0.993    0.998    0.999    1.000    1.000  3600 1.001
spec[7]       0.998   0.000 0.002    0.993    0.998    0.999    1.000    1.000  4170 1.001
spec[8]       0.999   0.000 0.001    0.995    0.998    0.999    1.000    1.000  3796 1.001
spec[9]       0.991   0.000 0.006    0.976    0.989    0.993    0.996    0.999  4759 0.999
spec[10]      0.997   0.000 0.004    0.985    0.996    0.998    0.999    1.000  3729 0.999
spec[11]      0.962   0.000 0.031    0.880    0.948    0.970    0.984    0.996  4637 1.000
spec[12]      0.978   0.000 0.012    0.949    0.972    0.980    0.986    0.995  4445 1.000
spec[13]      0.978   0.000 0.013    0.944    0.972    0.981    0.988    0.996  3862 1.000
spec[14]      0.975   0.000 0.020    0.921    0.966    0.980    0.989    0.997  5021 1.001
sens[1]       0.688   0.010 0.250    0.064    0.584    0.775    0.866    0.978   639 1.010
sens[2]       0.899   0.001 0.034    0.822    0.878    0.903    0.924    0.955  3014 1.000
sens[3]       0.744   0.001 0.068    0.597    0.700    0.750    0.794    0.862  2580 1.001
sens[4]       0.734   0.001 0.071    0.585    0.688    0.739    0.786    0.857  3286 1.001
lp__       -428.553   0.125 4.090 -437.842 -431.121 -428.203 -425.621 -421.781  1064 1.006

Now there's more uncertainty about the underlying infection rate. In the previous model fit with all the specificity data and sensitivity data aggregated, the 95% posterior interval was [0.007, 0.019], with the hierarchical model, it's [0.002, 0.168]. Huh? 0.168? That's because of the huge uncertainty in the sensitivity for the new experiment, as can be seen from the wiiiiide uncertainty interval for sens[1], which in turn comes from the possibility that sigma_sens is very large.

The trouble is, with only 3 sensitivity experiments, we don't have enough information to precisely estimate the distribution of sensitivities across labs. There's nothing we can do here without making some strong assumptions.

One possible strong assumption is to assume that sigma_sens is some small value. In the current context, it makes sense to consider this assumption, as we can consider it a relaxation of the assumption of Bendavid et al. that sigma_sens = 0.

So I'll replace the weakly informative half-normal(0, 1) prior on sigma_sens with something stronger:

sigma_sens ~ normal(0, 0.2)

To get a sense of what this means, start with the above estimate of mu_sens, which is 1.4. Combining with this new prior implies that there's a roughly 2/3 chance that the sensitivity of the assay in a new experiment is in the range invlogit(1.4 +/- 0.2), which is (0.77, 0.83). This seems reasonable.

Changing the prior, recompiling, and refitting on the same data yields:

p             0.015   0.000 0.005    0.003    0.012    0.015    0.018    0.024   925 1.006
mu_spec       5.744   0.020 0.656    4.640    5.282    5.684    6.141    7.160  1062 1.004
sigma_spec    1.681   0.012 0.468    0.875    1.352    1.650    1.981    2.702  1632 1.004
mu_sens       1.509   0.006 0.283    0.937    1.329    1.510    1.701    2.060  2111 1.002
sigma_sens    0.263   0.003 0.146    0.016    0.148    0.261    0.365    0.562  1774 1.000
spec[1]       0.996   0.000 0.004    0.987    0.995    0.998    0.999    1.000   791 1.007
spec[2]       0.992   0.000 0.004    0.982    0.990    0.993    0.996    0.998  4350 1.000
spec[3]       0.995   0.000 0.009    0.971    0.994    0.998    0.999    1.000  3517 1.000
spec[4]       0.996   0.000 0.005    0.982    0.995    0.998    0.999    1.000  3536 1.000
spec[5]       0.999   0.000 0.001    0.997    0.999    0.999    1.000    1.000  3852 1.001
spec[6]       0.998   0.000 0.002    0.993    0.998    0.999    1.000    1.000  3484 1.000
spec[7]       0.998   0.000 0.002    0.993    0.998    0.999    1.000    1.000  3737 1.000
spec[8]       0.999   0.000 0.001    0.995    0.998    0.999    1.000    1.000  3529 1.000
spec[9]       0.991   0.000 0.006    0.977    0.989    0.993    0.996    0.999  4969 1.001
spec[10]      0.997   0.000 0.004    0.985    0.996    0.998    0.999    1.000  3405 1.000
spec[11]      0.962   0.000 0.031    0.880    0.949    0.971    0.985    0.996  4105 1.000
spec[12]      0.978   0.000 0.012    0.951    0.971    0.980    0.987    0.994  4783 1.001
spec[13]      0.978   0.000 0.013    0.944    0.972    0.981    0.988    0.996  4508 1.000
spec[14]      0.975   0.000 0.019    0.925    0.966    0.981    0.990    0.997  4635 1.000
sens[1]       0.808   0.001 0.069    0.629    0.779    0.820    0.853    0.907  2483 1.000
sens[2]       0.859   0.001 0.036    0.783    0.835    0.860    0.884    0.924  3375 1.001
sens[3]       0.792   0.001 0.052    0.670    0.763    0.799    0.829    0.876  3516 1.001
sens[4]       0.788   0.001 0.054    0.664    0.755    0.795    0.827    0.876  3142 1.000

Our 95% interval for p is now [0.003, 0.024], that is, the estimated infection rate is somewhere between 0.3% and 2.4%.

That's what I'd go with right now, given the information available.

Also, I checked my results a bit, but I did not check my entire procedure using fake-data simulation, so I can't be sure of all the above. Feel free to check the data, re-fit, and alter the model on your own.

P.S. Is it possible that Zad's cats, pictured above, are named Specificity and Sensitivity? I'd like to think so.