People have been (correctly) mocking my 1990s-style code. They’re right to mock! My coding style works for me, kinda, but it does have problems. Here’s an example from a little project I’m working on right now. I was motivated to write this partly as a followup to Bob’s post yesterday about coding practices.
I fit two models to some coronavirus testing data, with the goal of estimating the ratio of prevalence between two groups. First I fit the simple model implicitly assuming specificity and sensitivity of 1:
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat p 0.023 0.000 0.004 0.015 0.020 0.023 0.026 0.033 33939 1 p 0.020 0.000 0.005 0.012 0.017 0.020 0.023 0.031 31022 1 ratio 1.218 0.002 0.406 0.632 0.933 1.152 1.427 2.191 28751 1 lp__ -202.859 0.008 0.999 -205.567 -203.246 -202.554 -202.146 -201.879 17156 1
The huge effective sample sizes came because I ran Stan for 10,000 iterations in warmup and 10,000 in sampling, getting a ton of draws to get good precision on the 2.5% and 97.5% quantiles, which should not be a big deal for the application—it’s hard to think of a case when we should ever care about a super-precise estimate of the endpoints of an uncertainty interval—but is convenient for getting a single set of numbers to report in the published article. Anyway, that’s not my main point. I just wanted to let you know why I was straying from my usual practice.
To continue, I then re-fit the model with the data on specificity and sensitivity. Prior tests were perfect on specificity—this had to do with the stringent cutoff they were using to declare a positive result—and also had high sensitivity, so the results shouldn’t change much, especially given that we’re focusing on the ratio. But the uncertainty in the ratio should go up a bit, I’d think.
Here were the results:
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat p 0.023 0.000 0.005 0.015 0.020 0.023 0.026 0.033 34764 1 p 0.020 0.000 0.005 0.012 0.017 0.020 0.023 0.031 34133 1 ratio 1.218 0.002 0.405 0.626 0.932 1.151 1.431 2.185 30453 1 lp__ -202.868 0.008 1.011 -205.582 -203.255 -202.560 -202.148 -201.878 17880 1
Not much different! The lower bound of the interval has gone up from 0.626 to 0.632, and the upper bound has remained the same. Interesting. Let’s check the inferences for the two probabilities individually . . . they’re exactly the same . . . Wait a minute!
Let me check the code.
OK, I have a bug in my R code. Here are the lines of code fitting that second model:
Do you see it? In the second line, I took the old file, compare_1.stan, and put it in the object for the new model, compare_2. Here's what I need to do instead:compare_2
And now it works! Well, almost. First it turned out I had some little errors, such as not correctly defining a vector variable, which were caught during compilation, then when the model finally ran it gave a "variable does not exist" error, and I went back and caught the typo in the above data definition, where I mistakenly had defined n_spec twice in the copy and pasting procedure. (The repeat of 130 was not a mistake, though; their specificity testing data really were 130 negative tests out of 130.) So I fixed to:data_2
And then the code really did run, yielding:mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat p 0.019 0.000 0.007 0.004 0.014 0.019 0.024 0.032 5808 1.000 p 0.016 0.000 0.007 0.002 0.011 0.016 0.021 0.030 8207 1.000 spec 0.994 0.000 0.005 0.983 0.991 0.995 0.998 1.000 4998 1.000 sens 0.913 0.000 0.023 0.862 0.898 0.915 0.929 0.952 19939 1.000 ratio 1.693 0.052 3.859 0.453 0.901 1.206 1.656 4.917 5457 1.001 lp__ -254.658 0.020 1.562 -258.611 -255.451 -254.309 -253.501 -252.688 5840 1.001
The 95% interval for the ratio is now super-wide (and the estimated interval endpoint is itself noisy, even after 10,000 iterations; re-fitting the model with a new random seed yields an upper interval of 4.5), which makes sense because of the small possibility from the data that the false positive rate is high enough to mess things up.
Anyway, the point of this post is just to demonstrate how, with sloppy code, you really have to keep your hands on the wheel at all times. Even such simple fitting is hard to do if you're not continually checking results against your expectations.
I expect that the particular things that happened to me here would not be so bad for people who use more modern coding practices and workflows. From the other direction, there are lots of researchers whose code and workflow are much sloppier than mine, and they'll have to be especially eagle-eyed to avoid tripping up and even publishing erroneous results.
To return to the title of this post: Ugly code is buggy code not just because it's easier to introduce bugs and harder to find them if your code is poorly structured. It's also that, for code to be truly non-buggy, it doesn't just have to be free of bugs. You also have to know it's free of bugs, so that you can use it confidently.
Also it took me about 20 minutes to write this post. My blogging workflow is efficient (or perhaps inefficient in a larger sense, in that blogging is so easy that I do lots of it, perhaps taking the place of more productive activities).