The typical set and its relevance to Bayesian computation

tl;dr

The typical set (at some level of coverage) is the set of parameter values for which the log density (the target function) is close to its expected value. As has been much discussed, this is not the same as the posterior mode. In a d-dimensional unit normal distribution with a high value of d, the 99% typical set looks like an annulus or sphere or doughnut, corresponding to the points whose distance from the origin is approximately sqrt(d). This intuition is important because it helps understand the challenges of HMC and similar algorithms, which essentially have to cycle around the sphere rather than directly cutting through the center.

The term “typical set” is a standard term in information theory and is more general than the “typical set” concept discussed in Stan (as in the above paragraph). There are some differences between the official definition of typical set and the use of the term in Stan discussions. We can continue to work with the concept of typical set, and it will help for us to clarify the definition when we use the term.

Background

The concept of the “typical set” comes up a lot when we’re discussing Hamiltonian Monte Carlo and Stan. For example, in his case study, “Typical Sets and the Curse of Dimensionality,” Bob Carpenter writes, “Roughly speaking, the typical set is where draws from a given distribution tend to lie. That is, it covers most of the mass of the distribution. This particular choice of volume not only covers most of the mass of a distribution, its members reflect the properties of draws from that distribution.” He also writes that the typical set “is roughly defined as the central log density band into which almost all random draws from that distribution will fall.”

Bob also quotes Bernhard Schölkopf as saying, “a high-dimensional space is a lonely place.” which reminds me of my own line that I use when teaching this stuff: “In high-dimensional space, no one can hear you scream.” The funny thing is, I never saw that movie—I don’t like horror movies—but I remember the slogan.

In his article, “A Conceptual Introduction to Hamiltonian Monte Carlo,” Michael Betancourt refers to “the neighborhood between these two extremes [the neighborhood immediately around the mode, and the complementary neighborhood far away from the mode] known as the typical set” and writes, “In high-dimensional parameter spaces probability mass . . . and hence the dominant contributions to expectations, concentrates in a neighborhood called the typical set.” Michael also writes, “because probability densities and volumes transform oppositely under any reparameterization, the typical set is an invariant object that does not depend on the irrelevant details of any particular choice of parameters,” which I don’t think is correct (see below).

There’s a log of consensus that the typical set is important. But what exactly is it?

Notation and definition

For consistency and clarity I’ll first define some notation. Assume we have a d-dimensional continuous probability density p(theta) defined on some space Theta, so that the integral over Theta of p(theta) d theta is 1. In Bayesian notation we would write p(theta|y), but here I’m suppressing the conditioning on y. So it’s just a math problem now. I’ll also assume we can compute p(theta) and that we can use Stan to get random draws theta from the probability distribution with density p(theta). And I’ll define the target function, u(theta) = log p(theta), the log density function. Finally, the entropy is defined as H = – E(log(p(theta))) = integral log(p(theta)) p(theta) d theta.

I’m being a little sloppy here because typically we can only compute p(theta) up to an arbitrary multiplicative constant, so we we can only compute log p(theta) up to an arbitrary additive constant, and we usually define the target function only up to an arbitrary additive constant too. But for our purposes here we are only working with relative values of the target function, so for simplicity I’ll just pretend the normalizing constant is known. It won’t matter.

The official definition of typical set comes from information theory. Here’s what it currently says on wikipedia (no Wegman version available; sorry): “In information theory, the typical set is a set of sequences whose probability is close to two raised to the negative power of the entropy of their source distribution.” The set is defined in the space of sequences theta_1,…,theta_n. Thus, the typical set is a subset of Theta^n. Wikipedia’s definition using x where we’re using theta, and it’s using log_2 where we’re using log_e, otherwise its notation is consistent with ours.

In Stan, however, people seem to be talking about the typical set as a set of values of theta, that is, the typical set is a subset of Theta, not a subset of Theta^n.

But then we can just take the information-theory definition and use the special case n=1. Then the definition of typical set is: all the values of theta for which log p(theta) is within some epsilon of its expectation. That is, the set of values of theta for which | log p(theta) + H | < epsilon. (OK, the wikipedia definition is less then or equal, but that doesn’t matter here because we’re working with continuous distributions.) 

10-dimensional normal example

I’ll now illustrate with a 10-dimensional normal example in R:

library("mvtnorm")
n_sims <- 1e4
d <- 10
theta <- rmvnorm(n_sims, rep(0, d), diag(d))
u <- dmvnorm(theta, rep(0, d), diag(d), log=TRUE)
H <- - mean(u)

H is the entropy, which can be calculated analytically for this example as 0.5*d*log(2*pi*e). But just to keep things simple and general, I estimated it based on 10,000 simulations; it’s just the negative of the expectation of the log density, and it comes to 14.18. The mode has density -(d/2)*log(2*pi) = -9.19. But, as we well know, the mode is not a typical value of the distribution! The maximum of u in my 10,000 simulations is max(u) = -9.62.

Using the wikipedia definition, the typical set for any epsilon corresponds to the values of theta for which u(theta) is within epsilon of – H. For example, if epsilon = 1, it’s the values for which u(theta) is in the range [-15.18, -13.18].

We can figure out coverage with a little R function:

coverage <- function(epsilon){
mean(abs(u + H) < epsilon)
}

Then we can try some values:

coverage(0)
[1] 0
coverage(1)
[1] 0.3403
coverage(2)
[1] 0.6372
coverage(3)
[1] 0.8475
coverage(4)
[1] 0.9413
coverage(5)
[1] 0.971

As epsilon increases, the coverage approaches 100%.

But this official definition of typical set is not quite what we’ve been talking about with Stan! The problem is that, to get high enough coverage using this interval centered at – H, you’ll end up including the mode after all! Check it out: – H is -14.18, and the maximum value of u is at -9.19. So if we set epsilon = 5, the typical set will include the mode.

Because of the law of large numbers, this problem does not arise with the classical definition of typical set as n gets large. But since we’re working with n=1, it’s a concern. And the definitions implicitly used by the Stan people are clearly working within Theta, not Theta^n. See for example the quotes by Bob and Michael above.

Differences between the official definition of “typical set” and some of how it’s discussed in the Stan community

1. The first difference is that the official definition is in Theta^n, while in Stan it’s used to refer to a subset of Theta. So the use in Stan is a special case of the general definition.

2. The second difference is that, as noted above, the 99% typical set for the 10-dimensional normal as officially defined includes the mode, which is not quite how we’ve been thinking about things. The Stan intuition is not all wrong—as dimensionality increases, 99% typical sets will ultimately exclude the mode. Everything’s fine if d=100, for example. But for low dimensions, such as d=10, the skew in the posterior distribution of log(p(theta)) makes a difference. 10 is not a high number of dimensions, but 10-dimensional problems do come up in applications.

3. The third difference is that the typical set is not invariant to nonlinear transformations of theta. This can be seen as follows: if you transform theta nonlinearly, you’ll need to divide the density by the absolute value of the determinant of the Jacobian of the transformation, that is, you’re adding -log |det(Jacobian)| to the log density. The typical set of theta is the inverse image of a set defined on the log density (that is, you take the values of theta for which log(p(theta)) is in the range -H +/- epsilon), and changing log(p(theta)) by subtracting a log abs det Jacobian will change the mapping and thus change the inverse image. This is not a big deal, nor is it a problem—if you nonlinearly transform theta, you’re changing the geometry of the problem, so it makes sense that the typical set changes too. We should just be clear on it.

You can also see this by continuing with the above simulation in R. Just define phi_j = exp(theta_j) for j=1,…,10; the determinant of the Jacobian is then prod_{j=1}^{10} phi_j, so in R:

phi <- exp(theta)
jacobian <- apply(phi, 1, prod)
u_phi <- - log(jacobian) + u

Under the new parameterization, we can compute typical sets of phi by working with u_phi. The typical sets of phi will be inverse images of ranges of u_phi near its expectation. The point is that the inverse image of a particular value of u_phi will not be the same as the inverse image of a particular value of phi. So the typical sets will be different. The issue here is not about cancellation of density and area in computing total probability, nor is it about E(u_phi) being different from E(u). Rather, the issue is that sets of constant u_phi are different from sets of constant u. The rule for being in the typical set of theta is different from the rule for being in the typical set of phi, because the inverse image sets are different.

Of the three items listed above, I think the one that could be most confusing is point #2.

There are various ways to handle point #2:

– We can just accept that in moderate dimensions such as d=10, the 99% typical set can contain the mode.

– We can talk about 50% typical sets instead of typical sets more generally. (Somehow from the discussions with in Stan of typical sets, I’ve been picturing something like 99%, but that’s never really been specified.) But that won’t work if we’re using “typical set” to include most of the domain of integration.

– We can work with a different summary of the distribution of log p(theta). Instead of an interval centered at E(log p(theta)), we could use a central interval or shortest interval of log p(theta) that contains specified probability. Take the inverse image of either of such sets and you’ll get an alternative “typical set” that is doughnut-shaped and excludes the mode in that 10-dimensional normal example. Bob’s case study would work fine if you take the typical set to be the central or shortest posterior interval of log p(theta).

– We can talk less about the typical set and more about the distribution of the log density or target function. For example, instead of saying, “If we start our algorithm far from the typical set, then Nuts takes you toward the typical set, and not until then do you want to start sampling,” we can say, “If we start our algorithm with the log density far from its expectation, then Nuts takes the log density toward that expectation, and not until you are close to that expectation do you want to start sampling.”

I kind of like that last solution. Often when we talk about typical sets we can just talk directly about the log density or target function. I think this could be helpful for users (or developers) because this is the function that Stan is computing anyway!

Also, this last solution reinforces the message that the mode is not “typical.” If we want our simulations to be in the areas of Theta where log(p(theta)) is near its expectation, then we don’t particularly want to be at the mode. By definition, the log density is at an extreme value, not at its expectation, when theta is at its mode.

Again, I think most of the things that people in the Stan community have been saying about typical sets are vaguely correct. It’s just been hard for me to follow some of it without the precise definition, hence this post.