Short story: Full dummy coding in frequentist regression analyses leads to non-identifiability. In Bayesian analyses that’s not an issue, but a different problem arises: full dummy coding is computationally problematic. The group indicators in a hierarchical model are a special case that avoids these computational problems.
Long story: (PDF)
There are various choices for encoding a nominal input variable for use in a regression analysis. One of these is full dummy coding, in which a -level nominal variable is replaced by a subvector of indicator variables:
where the entry is at position .
In frequentist analyses this encoding is not used if there is a separate intercept, because it yields a non-identifiable model: subtracting a constant amount from the regression coefficient for each indicator variable and adding it to the intercept leaves the likelihood unchanged. The same problem occurs if there are multiple nominal variables and each uses full dummy coding. Instead, a leave-one-out dummy coding is used ( is dropped, so that level is encoded as all zeroes) or an effects coding is used ( is dropped, but for all when ).
In principle, non-identifiability is not a problem for a Bayesian analysis: we are interested in obtaining a posterior distribution for the regression coefficients, not a point estimate, and as long as we use a proper prior distribution we are guaranteed a proper posterior distribution. Thus we could, if desired, use a full dummy coding.
In practice, things are less straightforward. Use of full dummy coding in a Bayesian analysis can be problematic because it yields a highly correlated posterior distribution that poses computational difficulties.
To see this, consider a simple problem: a linear regression with only a single input variable, a nominal variable with levels. is the row vector of predictor variables derived from , and is the value of the dependent variable for data point . We have
together with independent normal priors on the regression coefficients
The predictor matrix is defined as
If is the proportion of rows for which , then one can show (details
here) that the posterior distribution for is a multivariate normal distribution with covariance matrix , where
But the left summand is singular: is an eigenvector with eigenvalue of 0. Defining
we see that if for all then itself is nearly singular.
A nearly singular posterior precision matrix is problematic even in this simple scenario, as it leads to numerical instabilities when inverting the matrix. It also implies very high posterior correlations among the regression coefficients. In generalized linear models where we have to use Markov chain Monte Carlo methods to explore the posterior distribution—say, logistic regression or Poisson regression—this leads to very slow mixing of the Markov chain; in the case of Hamiltonian Monte Carlo it can lead to very small step sizes. In either case the result is very slow estimation.
As an example, consider a data set with observations and a weak prior, . Then , and for we find that the posterior correlation between and () is
and the posterior correlation between and , , , is
How large does need to be to avoid extreme correlations? Assuming for all , and hence for all , the following table gives the minimum value that (for ) may have and still yield posterior correlations that never exceed in absolute value. These figures assume , i.e., a weak prior on the intercept (they are not sensitive to the specific value of used.)
Notice that as the number of levels increases, smaller values for become acceptable.
Hierarchical models can be thought of as regression models involving a nominal variable with many distinct levels. The simplest case is a varying-intercept model:
where is the group to which observation belongs, and we have the hierarchical prior
The equation for is equivalent to
So why don’t we see a problem with high posterior correlations among the coefficients in this case? We have two mitigating factors in play:
- is large. As previously noted, this allows smaller values without getting extreme posterior correlations.
- is relatively small, yielding larger values for . The whole point of a hierarchical model is that we expect to find the posterior distribution for to be concentrated on values much smaller than what we would assign for a weakly informative prior on the coefficients .
In summary, we have the following guidelines for encoding nominal variables in a Bayesian regression analysis:
- Do not use full dummy coding, as it will generally lead to severe computational difficulties due to extreme posterior correlations in the regression coefficients.
- Group indicators for hierarchical models are a special case that avoids these difficulties.