Drunk-under-the-lamppost testing

I’m writing a response here to Abraham Mathews’s post, Best practices for code review, R edition, because my comment there didn’t show up and I think the topic’s important. Mathews’s post starts out on the right track, then veers away from best practices in the section “What code should be reviewed?” where he says,

…In general, we would never want to review four or five different files at a time. Instead, code review should be more targeted and the focus must be on the code that requires more prudent attention. The goal should be to review files or lines of code that contain complex logic and may benefit from a glance from other team members.

Given that guidance, the single file from the above example that we should never consider for code review is basic_eda.R. It’s just a simple file with procedural code and will only be run once or twice. …

The standard for code review in industry and large open-source projects is to review every piece of code before it’s merged. The key to this strategy is ensuring that every line of code has been viewed by at the very least the author and one trusted team member. Sampling-based code review that’s biased to where the group thinks errors may be has the drunks-under-the-lamppost problem of not covering a large chunk of code. Software developers obsess on test coverage, but it’s very challenging and we haven’t been able to get there with Stan. If we were developing flight control or pacemaker or transactional banking software, the standards would be much higher.

Typically, APIs are designed top-down from a client’s perspective (the client being a human or another piece of code that calls the API), then coded bottom up. Each component is reviewed and unit tested before being merged. The key to this strategy is being able to develop high-level modules with the confidence that the low-level pieces work. It may sound like it’s going to take longer to unit test as you go, but the net is a huge time savings with the upside of having more reliable code.

It’s also critical to keep the three key components of software development in synch: documenting (i.e., design), testing, and coding. In larger projects, features of any size always start with a functional spec outlining how it works from the client point of view—that’s usually written like the eventual documentation will be written because that’s what says what code does. With just doc, the key here is to make sure the API that is being delivered is both easy to document and easy to test. For example, large functions with intertwined, dependent arguments, as often found in REPL languages like R, Python, and Julia, produce what programmers call a “bad smell”, precisely because such functions are hard to document and test.

Consider the rgamma function in R. It takes three parameter arguments, shape, rate, and scale. Experienced statisticians might know that scale and rate parameters are conventionally inverses, yet this isn’t mentioned in the doc anywhere other than implicitly with the values of the default arguments. What happens if you supply both scale and rate? The doc doesn’t say, so I just tried it. It does not return an error, as one might expect from languages that try to keep their arguments coherent, but rather uses the rate and ignores the scale (order doesn’t matter). At the point someone proposed the rgamma function’s API, someone else should’ve piped up and said, “Whoa, hang on there a second, cowpoke; this function’s going to be a mess to test and document because of the redundant arguments.” With scale not getting a default and rate and shape being inverses, the tests need to cover behavior for all 8 possible input patterns. The doc should really say what happens when both scale and rate are specified. Instead, it just says “Invalid arguments will result in return value ‘NaN’, with a warning.” That implies that inconsistent rate and scale arguments (e.g., rate = 10, scale = 10) aren’t considered invalid arguments.

I should also say that my comments above are intended for API design, such as an R package one might want to distribute or a piece of infrastructure a lab or company wants to support. I wouldn’t recommend this style of functional design and doc and testing for exploratory research code, because it’s much harder to design up front and isn’t intended to be portable or distributed beyond a tiny group of collaborators. I’m not saying don’t test such code, I’m just saying the best practices there would be different than for designing APIs for public consumption. For example, no need to test Windows and Linux and Mac if you only ever target one platform, no reason to test all the boundary conditions if they’re never going to be used, and so on. It absolutely still helps to design top down and write modular reusable components bottom up. It’s just usually not apparent what these modules will be until after many iterations.

P.S. I highly recommend Hunt and Thomas’s book, The Pragmatic Programmer. It’s a breeze to read and helped me immensely when I was making the move from a speech recognition researcher writing experimental code to an industrial programmer. Alternatives I’ve read suffer from being too long and pedantic, too dogmatic, and/or too impractical.

P.P.S. I’ve been meaning to write a blog post on the differences in best practices in research versus publicly distributed code. I know they’re different, but haven’t been able to characterize what I’d recommend in the way of methodology for research code. Maybe that’s because I spend maybe one day/month on personal or group research code (for example, the code Andrew and I developed for an analysis of SARS-CoV-2 seroprevalence), and nineteen days a month working on Stan API code. I’d be curious as to what other people do to test and organize their research code.