Here are some things people have sent me lately. They are in no particular order, except that I put the last item last so we could end with some humor. After this, I’ll write a few more blog posts, then it’ll be time to do some real work.

**Table of contents**

1. Suspicious coronavirus numbers from Turkey

2. Sensitivities and specificities

3. Unfrozen caveman law professor

4. Putting some model in your curve fitting

5. Why not give people multiple tests?

6. Fat or skinny: which is better?

7. Yet another forecasting competition

8. Kaiser goes medieval

9. A new way to do causal inference from observational data?

10. One of the worst studies I’ve ever seen in my life

**1. Suspicious coronavirus numbers from Turkey**

Abdullah Aydogan shares this short article he wrote on coronavirus data comparing Turkey’s strange fixed ratio of 0.021 for a 10-day period with the rest of the world. Now that the virus has taken root in the U.S., we haven’t been talking so much about international statistics.

In his article, Aydogan concludes:

The analysis proves it to be problematic to claim there is nothing unusual about Turkey’s 10-day long stability in total death to total case ratio. Simply put, excluding China and Iran’s trajectory after the 45th day since their first deaths, there is no country that has experienced variation as low as Turkey’s trajectory between the 19th and 28th days since its first death.

There may be a plausible explanation for this outcome besides allegations of data manipulation in official reports. But we will only learn about it if Turkey makes significant steps in improving transparency and accountability in data sharing.

I’m reminded of the suspiciously smooth time series of General Electric’s reported stock earnings.

**2. Sensitivities and specificities**

Joseph Candelora writes:

While looking for the new LA County serological study paper, I stumbled across this preprint, Estimating SARS-CoV-2 seroprevalence and epidemiological parameters with uncertainty from serological surveys, by Daniel Larremore et al., which looked interesting:

As I [Candelora] read it, my initial reaction was surprise at the statement on page 5 that for a simulated study with “sensitivity (93%) and specificity (97.5%) … when seroprevalence is 10% or lower, around 1000 samples are necessary to estimate seroprevalence to within two percentage points (Fig. 2)”.

That sounded great, like the Santa Clara County and LA County studies had a lot more to teach us than I thought, that even though seropravelence in those studies is in the low single digits and the test’s specificity was below 100%, we could still bound the actual seroprevalence within a reasonable interval. But then I read it again, and realized they were estimating within “two percentage points” and not “two percent”. Seriously, who cares? If the test seroprevalence is 1.5% and you’re trying to get a reasonable estimate of number of infections in the population, bounding that seroprevalence within 2 percentage points is pretty worthless. Am I missing something? Why would they focus on percentage points?

My reply: if your test has a specificity of 97.5% and the underlying seroprevalence rate is, say, 1%, then you’re drawing dead. Your test can give you an upper bound of the rate in the population, but any point estimate or lower bound will be extremely sensitive to the specificity, which you don’t really know. Indeed, just as “seroprevalence” isn’t really a single number—it varies by geography and demographics—we can also say that “specificity” isn’t really a single number either, as there’s some reason for the false positives, it’s not pure randomness. The point about estimating the rate within 2 percentage points is that this the best you’re gonna be able to do anyway.

**3. Unfrozen caveman law professor**

Daniel Hemel writes:

I’m a law professor—not an epidemiologist or a statistician. Just a note on the Santa Clara County and LA County tests that seems to have been overlooked. (Caveat: This isn’t remotely close to my area of expertise. Then again, teaching students how to poke holes in evidence is basically what we law professors do . . . . ):

— Tl;dr: The manufacturer of the test used in both studies appears to be reporting a negative predictive value meaningfully lower than 99.5% (i.e., a false positive rate meaningfully higher than 0.5%). If this is correct, then infection rate estimates based on an assumption of an 0.5% false positive rate will be too high.

More detail:

— As best I can tell, the prevalence-rate estimates in the Stanford study are based on the total number of positive cases by either IgG or IgM (see p. 6 of preprint: “The total number of positive cases by either IgG or IgM in our unadjusted sample was 50, a crude

prevalence rate of 1.50%”). I.e.: If you come back positive on one, you’re counted as positive. The manufacturer’s package insert (https://imgcdn.mckesson.com/CumulusWeb/Click_and_learn/COVID19_Package_Insert_Rapid.pdf) reports a 99.5% negative predictive value for the IgG version (369/371) and a 99.2% negative predictive value for the IgM version (368/371). As one of your readers notes in the comments, it’s unclear whether these probabilities are independent. If independent, that suggests a negative predictive value of 98.7% if I’m doing the multiplication right.— But that’s not all. The manufacturer also reports that when testing 150 known-negative samples at Jiangsu Provincial Center for Disease Control and Research, 146 tests were negative and 4 were positive: https://imgcdn.mckesson.com/CumulusWeb/Click_and_learn/COVID19_CDC_Evaluation_Report.pdf. (All four false positives appear to have been false positive for IgM; one of those four was also false positive for IgG.) 146/150 = 97.3%. The same study finds a positive predictive value of 95%.

— If we use those parameters and 50/3,330 samples come back positive, and if I’m remembering the Rogan-Gladen equation from my undergrad stats class correctly, p=(t+β-1)/(α+β-1); t = 0.015; α = 0.95; β = 0.973; p = -0.013. Well, of course that can’t be the true prevalence rate, but you get the point. The crude prevalence rate reported in the study is less than what we’d expect if the false positive rate were 2.7% and the true prevalence rate were zero.

— For LA County, if the 4.1% figure is their positive test frequency (and I can’t figure out what their 4.1% figure actually is), then we’re talking about an estimated prevalence using the JPCDCR parameters and the Rogan-Gladen equation of 1.5%. Which, given 617 deaths in LA County and a population of 10.04 million, would be a crude fatality rate of 0.4% — not so far off what we’ve seen from China.

— But of course, you know much more than I do about all this. The main point is just that if you dig around the manufacturer materials on the McKesson website, then it looks like our point estimate of the false positive rate should maybe be 1.3% or maybe 2.7% but no apparent reason to think it’s 0.5%.

Then again, people who do this for a living are representing that it’s 0.5%, so who am I to say that they’re wrong?

My reply: I’m a statistician, not a biologist, and I don’t have any idea how these assays work. Indeed, the last time I did any research on lab assays was back in 2004. One thing I learned in that project is that the usual statistical analyses of assays—including the analyses done by expert biologists—are often crude oversimplifications of the measurement process.

**4. Putting some model in your curve fitting**

Paul Cuff writes:

IHME did some useful footwork, collecting data on social distancing interventions by geography, adjusting deaths for demographics, etc. Then they fed it into a brain-dead model. As a consequence, instead of providing simple, intuitive, and actionable conclusions, they provide predictions of very little use or accuracy and mainly serve to reinforce misunderstandings about ebb and flow of infections.

The intention of the IHME model is to understand the consequences of social distancing. They focused on four interventions, like closing schools, etc. They are modeling in the regime of no herd immunity. In this regime, R_t is memoryless. They could/should have provided a mapping from intervention cocktail to R_t. A likely good model would be multiplicative. Each intervention multiplies R_t. The last thing to estimate would be the delay between intervention and when it appears in the death data. Perhaps one additional set of parameter would be a mapping from population density to R0, or simply allow each locality to have a R0 floating parameter. Instead of giving “predictions,” they would be saying: “closing schools decreases R by a factor of x.”

The IHME are not the only ones to look at spread in an abstruse way. They did their sigmoid curve fitting in the wrong domain, but it’s also the domain that most graphics are presented in, including the popular NYT-style graphic that Phil referred to to analyze Sweden in his post on Monday. That is, cumulative deaths are presented on a log-scale. The log-scale serves a purpose for exponential growth, but that purpose is mostly lost when trying to track a time-varying Rt from cumulative numbers. As you know (in time-units of incubation periods):

current infections = exp (int log(R_t) dt)

log(current infections) = int log(R_t) dtSo the logarithm of current infections reveals R_t in such a straightforward way that it can be even done by eye through the slope. Daily deaths serve as a delayed proxy for infections.

On the other hand, the logarithm of cumulative deaths (or infections) does not yield any cancellation. Curves end up looking like sigmoids instead of straight lines, and you end up tempted to do really dump curve fits. There is merit to looking at cummulative quantities, especially if we are trying to make inferences about immunity or give report cards to each locality. Also, by luck, if Rt is constant, then you still get straight lines. But there’s really no sense in looking at recent slopes of cumulative quantities on a log scale, or trying to understand the Rt dynamics in that way. I understand that daily quantities are noisier and less pleasant to look at, but stats handles that just fine.

A least squares fit (in the log-daily-death domain) with each intervention getting a coefficient would be a quick and dirty way to start, and a thousand times more useful than the IHME model. I would be tempted to do it myself, but I don’t know how to access the data. Also, now that I’m not in academia anymore, my time is a bit limited. I did throw together some plots here for my own sanity.

I did not read this message in details, but I’m supportive of the general point that, even if your forecast is just curve fitting with no latent structure, it still makes sense to put some subject-matter information the model where you can. This point may seem obvious, but not everyone gets it. I’ve often seen analyses by what might be called “regression purists” who seem to feel that once you put your predictors in your model that you can’t look back, a sort of poor man’s preregistration that as a byproduct can destroy your ability to learn.

**5. Why not give people multiple tests?**

Andrea Panizza writes:

Before the Stanford study came out, people on Twitter were already claiming that antibody tests cannot reliably tell whether a single individual contracted or not COVID-19. Example.

This is a different topic, but somewhat related, to issues with serological studies on a sample of N individuals, like your coverage of the Stanford study.

I get your point about the possibility of the Stanford estimate being a statistical artifact, but I have more difficulties understanding why the test wouldn’t be reliable at an individual level. Basically, my reasoning is as follows: assume that in Italy (my country) the prevalence of the disease is 4.0%, with a [3.2%-5.1%] 95% CrI, as estimated here.

Estimates from other research groups are similar. Now, if we consider the Cellex test mentioned in the tweet thread I linked above, we have a sensitivity (false positive rate) of 93.8% and a specificity (false negative rate) of 95.6%. Suppose I get a test which results positive. Then a straightforward application of Bayes’ rule gives a posterior probability of me having contracted COVID-19 of 0.47, with aCrI [0.41, 0.53]. This is definitely not strong evidence of me having contracted the infection. However, can’t I just fix this by taking another test? If this second test also results positive, the posterior probability now becomes 0.95, with a CrI [0.94, 0.96]. What am I missing here? Is the inference wrong, because the same person is being tested twice, and thus I cannot consider the results independent? I got this objection, but it doesn’t seem right to me at all.

For example, if we consider the classical “fair coin” problem, then even if we a priori have a strong belief that the coin is loaded (e.g., a Beta prior on p centered on 0.05 and quite tight, with p being the probability the probability that we get “heads”), then, launching the coin a few times and obtaining “head” all times provides strong evidence against the “tail” face of the coin being loaded. [Actually, you can load a die but you can’t bias a coin. — ed.]

Do you think my analysis is sound?

My reply: I don’t know how the tests work, but it’s not clear to me that the positive or negative result is a purely random event. If you are negative and test positive, there could be a reason for this positive test (maybe you have some antibody that is similar to coronavirus) so it could show up on the test again.

**6. Fat or skinny: which is better?**

Gustavo Novoa writes:

I’ve recently come across an ongoing debate on the effect of obesity on covid-19 health outcomes. Researchers at NYU published a report arguing that deaths and serious complications from the virus were disproportionately obese—which was covered in the NYT.

An op-ed response was written on Wired arguing that the reports were dubious because they were not adjusting for socioeconomic factors, nor the discrimination that obese people face in hospitals (the latter is obviously significnatly more difficult to adjust for).

I [Novoa] looked at one of the studies, and they were only presenting descriptive data, without running any kind of model.

Do you have data that can be analyzed or any existing fitted models that can speak to the effect of obesity on covid outcomes?

My reply: I have no idea! But it seems that our esteemed Columbia University colleague Dr. Oz is on the case (link courtesy of Paul Alper).

**7. Yet another forecasting competition**

Igor Grossmann writes:

Many people today are wondering about the societal changes that would follow the current crisis. What do scientists think about the societal changes in the months to come? Moreover, how accurate are social scientists in their forecasts? To address these questions, in collaboration with several colleagues (Phil Tetlock, Cendri Hutcherson, Michael Varnum, Lyle Ungar and others) I am organizing a Forecasting Collaborative. I am inviting you and/or your students (alone or in a team with others) to participate to investigate the accuracy of forecasts for complex societal phenomena. If you know of others who may be interested, please share this note.

We plan to investigate the accuracy of forecasts made by social scientists about critical social issues in the US over the next 12 months: well-being, affect on social media, prejudice, gender bias, political ideology and polarization.

The basic idea is simple: Participants in this study will receive past monthly data for the domains they would like to participate it. They can use their expertise and/or data modeling to estimate points for each of the next 12 months, then answer a few questions about your rationale/model. After six months, we will contact you again to obtain possible updates on your forecasts based on new data. You can participate by yourself or in a team.

We are aiming to product a

registered report style manuscript for publication in an interdisciplinary high-impact journal, such that results from this initiative will be accepted in principle, irrespective of the outcomes. Our goal is also to summarize the forecasts to the general public.A direct benefit from participation in this study is that participants will obtain greater insight into accuracy of one’s forecasts (which can be used to inform future forecasts).

Participants have an opportunity to contribute to the journal article as an author.

If you are interested in learning more about the Behavioral Science Forecasting Collaborative, please, click here.

Boldface phrases were boldface is the original email. If you want to join, I guess you can just click on the link.

**8. Kaiser goes medieval**

Kaiser Fung points us to a post he wrote “about the terrible charts in that Oxford study that claimed half of UK got infected by mid/late March.” He adds:

Also, previously, I [Kaiser] used that study to write an explainer for understanding statistical modeling. I’ve always found it a challenge to explain Bayesian models in a non-mathematical way to the general audience – and that Oxford study finally made me do it.

**9. A new way to do causal inference from observational data?** Bill Harris writes:

I saw this timely “Method for estimating effects of COVID-19 treatments outside randomized trials” in today’s ASA Connect Digest and wondered if you were familiar with the approach and had any observations.

The slides make it sound like an attractive approach, if it works. My first concern was that it might be too facile. My second, more substantive concern was that it seemed to ignore randomness in the data. That is, in their fake data example, the 13% in cohort 1 (and, for that matter, the 16% in cohort 2) are clearly treated as fixed (that part sounds a bit Bayesian), but pages 14 and 15 (slides 13 and 14) seem to bring in variation at the end.

If it is credible, it could at least provide a tool to assess claims made about observational study results.

Here’s the description, from David Wulf:

I [Wulf] share below an overview of a simple new method my team has developed that may help identify the causal effects of experimental COVID-19 treatments, in the hopes that (i) you have suggestions or criticisms, (ii) you have access to relevant data on which to run it, or (iii) you can connect us/the method to those who do.

A method my advisor (Dr Chad Hazlett, Asst. Prof. in Statistics and Political Science, UCLA) and I (PhD student, UCLA Statistics) have developed uses clinical, observational data to estimate the causal effect of new treatments that are given outside of randomized trials, and that works no matter how much unobserved confounding/selection bias there is. It accomplishes this using a different, more flexible assumption, which places limits on the change in outcomes we could have expected over time in the absence of the new treatment. We hope it can complement ongoing RCTs on COVID-19 treatments being used under emergency/expanded-access provisions (like hydroxychloroquine) by investigating a different population (those who MDs choose to treat) and giving answers without waiting for trials to complete. Clearly the stakes are high – failing to learn from these experiences would be tragic, while allowing invalid and possibly misleading comparisons to influence medical or policy decisions would be just as problematic.

We made a slide presentation detailing this application, and that presentation includes links to a paper that introduced the method, and to an applied paper currently R&R. Importantly, the only data needed to run the analysis are patient subpopulation counts, which we hope helps avoid some issues about PHI/data access.

Harris follows up:

I found Hazlett and Wulf’s ResearchGate project that contains other interesting papers, including “Estimating causal effects of new treatments despite self-selection: The case of experimental medical treatments” and “Inference without randomization or ignorability: A stability-controlled quasi-experiment on the prevention of tuberculosis.”

I have not had a chance to read this, so I’m just forwarding it to all of you in case you’re interested.

**10. One of the worst studies I’ve ever seen in my life**

Jonathan Bydlak writes:

I wonder if you could comment on this take by a political scientist that there is no statistical evidence for the lockdowns when you regress state death rates on a simple lockdown dummy and a handful of other population characteristics.

While this particular article has just begun making the rounds, many have been making this argument. My instinct is that the analysis is very incomplete, as it ignores 1) the path the virus has taken in spreading, 2) the likely correlation between low case rates and the lack of a lockdown in the six states, and 3) the time between the first case and when the measures were adopted. It would be very helpful to hear your take.

My reply: This is one of the worst studies I’ve ever seen in my life. It’s a master class in stupid.