tl;dr: Their 95% interval for the infection rate, given the data available, is [0.7%, 1.8%]. My Bayesian interval is [0.2%, 2.4%]. Most of what makes my interval wider is the possibility that the specificity and sensitivity of the tests can vary across labs. To get a narrower interval, you’d need additional assumptions regarding the specificity and sensitivity of the particular experiment done for that report.

**1. Background**

A Stanford research team gave coronavirus tests for a sample of people from Santa Clara County, California. 1.5% of the people in the sample tested positive, but there were concerns about false positives, as well as about possible sampling bias. A couple weeks later, the Stanford team produced a new report with additional information on false positive and false negative rates. They also produced some new confidence intervals, but their procedure is pretty complicated, as it has to simultaneously handle uncertainties in three different probabilities.

This seems like a good case for Bayesian analysis, so I decided to do it.

**2. The Stan model and R script**

First I created a file, santaclara.stan:

data { int y_sample; int n_sample; int y_spec; int n_spec; int y_sens; int n_sens; } parameters { real p; real spec; real sens; } model { real p_sample; p_sample = p*sens + (1-p)*(1-spec); y_sample ~ binomial(n_sample, p_sample); y_spec ~ binomial(n_spec, spec); y_sens ~ binomial(n_sens, sens); }

It’s easy to do this in Rstudio: just click on File, New File, and then there’s a Stan File option!

Then I ran this R script:

library("cmdstanr") library("rstan") stanfitThe hardest part here was figuring out what numbers to use from the Bendavid et al. papers, as they present several different sources of data for sensitivity and specificity. I tried to find numbers that were as close as possible to what they were using in their reports.

3. ResultsAnd here's what came out for the model fit to the data from the first report:

mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat p 0.010 0.000 0.005 0.001 0.007 0.011 0.014 0.019 1198 1.005 spec 0.993 0.000 0.003 0.986 0.991 0.994 0.996 0.999 1201 1.004 sens 0.836 0.001 0.035 0.758 0.814 0.838 0.861 0.897 1731 1.000 lp__ -338.297 0.054 1.449 -342.043 -338.895 -337.900 -337.258 -336.730 721 1.005And for the model fit to the data from the second report:

mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat p 0.013 0.000 0.003 0.007 0.010 0.012 0.015 0.019 2598 1.000 spec 0.995 0.000 0.001 0.992 0.994 0.995 0.996 0.997 2563 1.001 sens 0.822 0.001 0.031 0.759 0.802 0.823 0.844 0.879 2887 1.000 lp__ -446.187 0.031 1.307 -449.546 -446.810 -445.841 -445.221 -444.718 1827 1.001The parameter p is the infection rate in the population for which the data can be considered a random sample.

In the model fit to the data from the

first report, the 95% interval for p is [0.001, 0.019]---that is, the data are consistent with an underlying infection rate of between 0 and 2%. The lower bound of the interval does not go all the way to zero, but that's just an artifact of summarizing inferences by a central interval. For example, we can plot the posterior simulation draws:draws_1And you get this:

In the model fit to the data from the

second report, the 95% interval for p is [0.007, 0.015]---that is, the data are consistent with an underlying infection rate of between 0.7% and 1.5%.

4. Notesa. As discussed in my earlier post, this is all based on the assumption that the sensitivity and specificity data represent independent tests with constant probability---that's implicit in the binomial model in the above Stan program. If you start allowing the specificity to vary from one experiment to another, then you can't just pool---you'd want to fit a hierarchical model---and that will end up giving you a wider uncertainty interval for p.

b. Also as discussed in my earlier post, I'm ignoring the weighting done by Bendavid et al. to adjust for differences between the sample and population. I just have too many questions about what they did, and their data are not available. I think the best way to do this adjustment would be using multilevel regression and poststratification. It would be easy to do this with a slight expansion of the above Stan program, so if someone has these data, they could do it. More to the point, this is the sort of adjustment that could and should be done in future studies with nonrepresentative samples.

c. I'm also ignoring the possibility, which we've also discussed, that people with symptoms or at higher risk of infection were more likely to respond to the survey.

d. The Stan model was super easy to write. It did take a few minutes to debug, though. In addition to some simple typos, I had an actual mistake. Originally, I had "p_sample = p*sens + (1-p)*(1-spec)" as "p_sample = p*sens + (1-p)*spec". That's not right at all! When I ran that model, it had terrible mixing and the inferences made no sense. I wasn't sure what was wrong, so to debug I did the following simplifications:

- First, I simply set p_sample = p;. This ran fine and it gave the correct inference for p (ignoring the corrections for sensitivity and specificity) and the right inferences for sens and spec as well.

- Then I got rid of the second term in the expression, so just p_sample = p*sens. This worked too. So the sensitivity adjustment did not seem to be causing the problem.

- I was then trying to figure out how to check the specificity correction, when I noticed that my formula was wrong so I fixed it.

The folk theorem wins again!

e. The statistical analysis is not magic. It's only as good as its assumptions, and points a, b, and c above suggest problems with the assumptions. The point of the analysis is to summarize the uncertainty given the data, if the assumptions were true. So in practice it should be an understatement of our actual uncertainty.

5. Hierarchical modelI was curious about the variation in sensitivity and specificity among the different datasets (13 specificity studies and 3 sensitivity studies) listed in the updated report. So I fit a hierarchical model allowing these two parameters to vary. I set up the model so that the specificities from the Santa Clara experiment and the 13 other studies were drawn from a common distribution, and the same for the sensitivities from the Santa Clara experiment and the 3 other studies.

Here's the Stan program, which I put in the file santaclara_hierarchical.stan:

data { int y_sample; int n_sample; int J_spec; int y_spec [J_spec]; int n_spec [J_spec]; int J_sens; int y_sens [J_sens]; int n_sens [J_sens]; } parameters { real p; vector[J_spec] e_spec; vector[J_sens] e_sens; real mu_spec; real sigma_spec; real mu_sens; real sigma_sens; } transformed parameters { vector[J_spec] spec; vector[J_sens] sens; spec = inv_logit(mu_spec + sigma_spec*e_spec); sens = inv_logit(mu_sens + sigma_sens*e_sens); } model { real p_sample; p_sample = p*sens[1] + (1-p)*(1-spec[1]); y_sample ~ binomial(n_sample, p_sample); y_spec ~ binomial(n_spec, spec); y_sens ~ binomial(n_sens, sens); e_spec ~ normal(0, 1); e_sens ~ normal(0, 1); sigma_spec ~ normal(0, 1); sigma_sens ~ normal(0, 1); }This model has weakly-informative half-normal priors on the between-experiment standard deviations of the specificity and sensitivity parameters.

In R, I compiled, fit, and printed the results:

sc_model_hierarchicalThe data are taken from page 19 of the second report.

And here's the result:

mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat p 0.030 0.003 0.060 0.002 0.012 0.017 0.023 0.168 295 1.024 mu_spec 5.735 0.017 0.658 4.654 5.281 5.661 6.135 7.199 1583 1.001 sigma_spec 1.699 0.013 0.492 0.869 1.348 1.647 2.003 2.803 1549 1.003 mu_sens 1.371 0.021 0.686 -0.151 1.009 1.408 1.794 2.615 1037 1.004 sigma_sens 0.962 0.016 0.493 0.245 0.602 0.885 1.232 2.152 983 1.008 spec[1] 0.996 0.000 0.004 0.986 0.994 0.998 0.999 1.000 1645 1.002 spec[2] 0.993 0.000 0.004 0.983 0.990 0.993 0.996 0.998 4342 1.001 spec[3] 0.995 0.000 0.008 0.972 0.994 0.998 0.999 1.000 3636 1.000 spec[4] 0.996 0.000 0.005 0.982 0.995 0.998 0.999 1.000 3860 1.001 spec[5] 0.999 0.000 0.001 0.997 0.999 0.999 1.000 1.000 4101 1.001 spec[6] 0.998 0.000 0.002 0.993 0.998 0.999 1.000 1.000 3600 1.001 spec[7] 0.998 0.000 0.002 0.993 0.998 0.999 1.000 1.000 4170 1.001 spec[8] 0.999 0.000 0.001 0.995 0.998 0.999 1.000 1.000 3796 1.001 spec[9] 0.991 0.000 0.006 0.976 0.989 0.993 0.996 0.999 4759 0.999 spec[10] 0.997 0.000 0.004 0.985 0.996 0.998 0.999 1.000 3729 0.999 spec[11] 0.962 0.000 0.031 0.880 0.948 0.970 0.984 0.996 4637 1.000 spec[12] 0.978 0.000 0.012 0.949 0.972 0.980 0.986 0.995 4445 1.000 spec[13] 0.978 0.000 0.013 0.944 0.972 0.981 0.988 0.996 3862 1.000 spec[14] 0.975 0.000 0.020 0.921 0.966 0.980 0.989 0.997 5021 1.001 sens[1] 0.688 0.010 0.250 0.064 0.584 0.775 0.866 0.978 639 1.010 sens[2] 0.899 0.001 0.034 0.822 0.878 0.903 0.924 0.955 3014 1.000 sens[3] 0.744 0.001 0.068 0.597 0.700 0.750 0.794 0.862 2580 1.001 sens[4] 0.734 0.001 0.071 0.585 0.688 0.739 0.786 0.857 3286 1.001 lp__ -428.553 0.125 4.090 -437.842 -431.121 -428.203 -425.621 -421.781 1064 1.006Now there's more uncertainty about the underlying infection rate. In the previous model fit with all the specificity data and sensitivity data aggregated, the 95% posterior interval was [0.007, 0.019], with the hierarchical model, it's [0.002, 0.168]. Huh? 0.168? That's because of the huge uncertainty in the sensitivity for the new experiment, as can be seen from the wiiiiide uncertainty interval for sens[1], which in turn comes from the possibility that sigma_sens is very large.

The trouble is, with only 3 sensitivity experiments, we don't have enough information to precisely estimate the distribution of sensitivities across labs. There's nothing we can do here without making some strong assumptions.

One possible strong assumption is to assume that sigma_sens is some small value. In the current context, it makes sense to consider this assumption, as we can consider it a relaxation of the assumption of Bendavid et al. that sigma_sens = 0.

So I'll replace the weakly informative half-normal(0, 1) prior on sigma_sens with something stronger:

sigma_sens ~ normal(0, 0.2)To get a sense of what this means, start with the above estimate of mu_sens, which is 1.4. Combining with this new prior implies that there's a roughly 2/3 chance that the sensitivity of the assay in a new experiment is in the range invlogit(1.4 +/- 0.2), which is (0.77, 0.83). This seems reasonable.

Changing the prior, recompiling, and refitting on the same data yields:

p 0.015 0.000 0.005 0.003 0.012 0.015 0.018 0.024 925 1.006 mu_spec 5.744 0.020 0.656 4.640 5.282 5.684 6.141 7.160 1062 1.004 sigma_spec 1.681 0.012 0.468 0.875 1.352 1.650 1.981 2.702 1632 1.004 mu_sens 1.509 0.006 0.283 0.937 1.329 1.510 1.701 2.060 2111 1.002 sigma_sens 0.263 0.003 0.146 0.016 0.148 0.261 0.365 0.562 1774 1.000 spec[1] 0.996 0.000 0.004 0.987 0.995 0.998 0.999 1.000 791 1.007 spec[2] 0.992 0.000 0.004 0.982 0.990 0.993 0.996 0.998 4350 1.000 spec[3] 0.995 0.000 0.009 0.971 0.994 0.998 0.999 1.000 3517 1.000 spec[4] 0.996 0.000 0.005 0.982 0.995 0.998 0.999 1.000 3536 1.000 spec[5] 0.999 0.000 0.001 0.997 0.999 0.999 1.000 1.000 3852 1.001 spec[6] 0.998 0.000 0.002 0.993 0.998 0.999 1.000 1.000 3484 1.000 spec[7] 0.998 0.000 0.002 0.993 0.998 0.999 1.000 1.000 3737 1.000 spec[8] 0.999 0.000 0.001 0.995 0.998 0.999 1.000 1.000 3529 1.000 spec[9] 0.991 0.000 0.006 0.977 0.989 0.993 0.996 0.999 4969 1.001 spec[10] 0.997 0.000 0.004 0.985 0.996 0.998 0.999 1.000 3405 1.000 spec[11] 0.962 0.000 0.031 0.880 0.949 0.971 0.985 0.996 4105 1.000 spec[12] 0.978 0.000 0.012 0.951 0.971 0.980 0.987 0.994 4783 1.001 spec[13] 0.978 0.000 0.013 0.944 0.972 0.981 0.988 0.996 4508 1.000 spec[14] 0.975 0.000 0.019 0.925 0.966 0.981 0.990 0.997 4635 1.000 sens[1] 0.808 0.001 0.069 0.629 0.779 0.820 0.853 0.907 2483 1.000 sens[2] 0.859 0.001 0.036 0.783 0.835 0.860 0.884 0.924 3375 1.001 sens[3] 0.792 0.001 0.052 0.670 0.763 0.799 0.829 0.876 3516 1.001 sens[4] 0.788 0.001 0.054 0.664 0.755 0.795 0.827 0.876 3142 1.000Our 95% interval for p is now [0.003, 0.024], that is, the estimated infection rate is somewhere between 0.3% and 2.4%.

That's what I'd go with right now, given the information available.

Also, I checked my results a bit, but I did not check my entire procedure using fake-data simulation, so I can't be sure of all the above. Feel free to check the data, re-fit, and alter the model on your own.

P.S.Is it possible that Zad's cats, pictured above, are named Specificity and Sensitivity? I'd like to think so.