How many Stan users are there?

This is an interesting sampling or measurement problem that came up in a Discourse thread started by Simon Maskell:

It seems we could look at a number of pre-existing data sources (eg discourse views and contributors, papers, StanCon attendance etc) to inform an inference of how many people use Stan (and/or use things that use Stan). We could also generate new data (eg via surveys etc). Do we know the answer and/or how best to work it out?

The cleanest way to do this would be to start with a list of the population possible Stan users, then survey a random sample of them, ask if they use Stan, and extrapolate to the population. But we can’t do this because no such list exists. We could count Stan downloads, but that’s not Stan users, as we assume that lots of the downloads are automatic, and also people might download Stan and then only use it once, or not at all.

Lauren Kennedy suggests doing a snowball or network sample using contributors to the Stan Forums as a starting point.

Snowball sampling could work. There could be other ideas too. Please offer your suggestions in comments.

Here are my thoughts:

1. A natural first step in any research project is to read the literature. There must be some estimates of the numbers of users of other programming languages such as Python, R, C++, Julia, Bugs, Stata, etc. I don’t know where these estimates come from, but looking at them would be a start.

2. If we’re gonna do a survey to estimate the number of Stan users, it perhaps makes sense to expand the project and simultaneously estimate the number of users of some other programming languages too, both for efficiency (with little more effort we can get information that will be of interest to others) and to get comparisons: comparing the uses different languages in our survey and also comparing our estimates to estimates that have been obtained by others.

3. We should also think about how the survey could be done again in the future. If we have a good estimate of the number of users, we might want to repeat the procedure every year or two to get a sense of trends.

4. How many Stan users are there? What’s a “Stan user”? Does this include users of rstanarm and brms? What about people who only use Stan through Prophet—does that count? Do we want to count every-users or current users? How often must you use Stan to count as a user? What if you took a class that used Stan? Etc.

The point of this last set of questions is not that we need a precise definition of Stan user, but rather that we should ask a battery of questions to get at mode and frequency of use. Also, we should consider how we might want to summarize and interpret the results: we should think about this before we conduct the survey (rather than doing the usual thing of gathering a bunch of data and then deciding what to do with it all).