Short story: Full dummy coding in frequentist regression analyses leads to non-identifiability. In Bayesian analyses that’s not an issue, but a different problem arises: full dummy coding is computationally problematic. The group indicators in a hierarchical model are a special case that avoids these computational problems.
Long story: (PDF)
There are various choices for encoding a nominal input variable for use in a regression analysis. One of these is full dummy coding, in which a K-level nominal variable is replaced by a subvector of K indicator variables:
\text{level }k \Rightarrow (0, \ldots, 0, 1, 0, \ldots, 0)where the 1 entry is at position k.
In frequentist analyses this encoding is not used if there is a separate intercept, because it yields a non-identifiable model: subtracting a constant amount from the regression coefficient for each indicator variable and adding it to the intercept leaves the likelihood unchanged. The same problem occurs if there are multiple nominal variables and each uses full dummy coding. Instead, a leave-one-out dummy coding is used (w_{K} is dropped, so that level K is encoded as all zeroes) or an effects coding is used (w_{K} is dropped, but w_{k}=-1 for all 1\leq k\leq K-1 when v_{j}=K).
In principle, non-identifiability is not a problem for a Bayesian analysis: we are interested in obtaining a posterior distribution for the regression coefficients, not a point estimate, and as long as we use a proper prior distribution we are guaranteed a proper posterior distribution. Thus we could, if desired, use a full dummy coding.
In practice, things are less straightforward. Use of full dummy coding in a Bayesian analysis can be problematic because it yields a highly correlated posterior distribution that poses computational difficulties.
To see this, consider a simple problem: a linear regression with only a single input variable, a nominal variable v with K levels. X_i is the row vector of predictor variables derived from v, and y_i is the value of the dependent variable for data point i. We have
y_{i}\sim\text{Normal}\left(X_{i}\beta,\,\sigma_y\right),\quad1\leq i\leq ntogether with independent normal priors on the regression coefficients
\beta_{j}\sim\text{Normal}\left(0,\sigma_{j}\right).The predictor matrix X is defined as
\begin{array}{lcl}X_{i0} &=& 1 \quad\text{(so that }\beta_{0}\text{ is the intercept)} \\ X_{ij} &=& 1 \quad\text{if }j\neq 0\text{ and }v_i=j \\ X_{ij} &=& 0 \quad\text{if }j\neq 0\text{ and }v_i\neq j \end{array}If p_{j} is the proportion of rows X_{i} for which X_{ij}=1, then one can show (details
here) that the posterior distribution for \beta is a multivariate normal distribution with covariance matrix \Lambda^{-1}, where
But the left summand n\sigma_y^{-2}L is singular: (1,-1,\cdots,-1)^{\prime} is an eigenvector with eigenvalue of 0. Defining
r_{j}=\frac{\sigma_y^{2}}{n\sigma_{j}^{2}}we see that if r_{j}\ll 1 for all j then \Lambda itself is nearly singular.
A nearly singular posterior precision matrix \Lambda is problematic even in this simple scenario, as it leads to numerical instabilities when inverting the matrix. It also implies very high posterior correlations among the regression coefficients. In generalized linear models where we have to use Markov chain Monte Carlo methods to explore the posterior distribution—say, logistic regression or Poisson regression—this leads to very slow mixing of the Markov chain; in the case of Hamiltonian Monte Carlo it can lead to very small step sizes. In either case the result is very slow estimation.
As an example, consider a data set with n=100 observations and a weak prior, \sigma_{j}=10^{3}\sigma_y. Then r_{j}=10^{-8}, and for K=3 we find that the posterior correlation between \beta_{0} and \beta_{j} (j\neq 0) is
\rho_{0j}=-0.99999994and the posterior correlation between \beta_{j} and \beta_{k}, j,k\neq 0, j\neq k, is
\rho_{jk}=0.9999999.How large does r_{j} need to be to avoid extreme correlations? Assuming \sigma_{j}=\sigma_{k} for all j,k\neq 1, and hence r_{j}=r_{k} for all j,k\neq 0, the following table gives the minimum value that r_{j} (for j\neq 0) may have and still yield posterior correlations that never exceed 0.9 in absolute value. These figures assume r_{0}=10^{-8}, i.e., a weak prior on the intercept (they are not sensitive to the specific value of r_{0} used.)
K | min r_j |
---|---|
3 | 2.6\times10^{-2} |
4 | 1.5\times10^{-2} |
5 | 9.4\times10^{-3} |
6 | 6.5\times10^{-3} |
8 | 3.7\times10^{-3} |
10 | 2.3\times10^{-3} |
13 | 1.4\times10^{-3} |
16 | 9.2\times10^{-4} |
19 | 6.5\times10^{-4} |
Notice that as the number of levels K increases, smaller values for r_{j} become acceptable.
Hierarchical models can be thought of as regression models involving a nominal variable with many distinct levels. The simplest case is a varying-intercept model:
\begin{array}{rcl}y_{i} &\sim& \text{Normal}\left(\upsilon_{i},\sigma_{y}\right) \\ \upsilon_{i} &=& \alpha+a_{k[i]}+\beta x_{i} \end{array}where k[i] is the group to which observation i belongs, and we have the hierarchical prior
\begin{array}{rcl}a_{k}&\sim&\text{Normal}\left(0,\sigma_{a}\right)\\ \sigma_{a}&\sim&\text{some diffuse prior}. \end{array}The equation for \upsilon_{i} is equivalent to
\begin{array}{rcl}\upsilon_{i}&=&\sum_{k=1}^{K}a_{k}w_{ik}+\beta x_{i}\\ w_{i}&=&\text{full dummy coding of level }k[i]. \end{array}So why don’t we see a problem with high posterior correlations among the coefficients a_{k} in this case? We have two mitigating factors in play:
- K is large. As previously noted, this allows smaller r_{k} values without getting extreme posterior correlations.
- \sigma_{a} is relatively small, yielding larger values for r_{k}. The whole point of a hierarchical model is that we expect to find the posterior distribution for \sigma_{a} to be concentrated on values much smaller than what we would assign for a weakly informative prior on the coefficients a_{k}.
In summary, we have the following guidelines for encoding nominal variables in a Bayesian regression analysis:
- Do not use full dummy coding, as it will generally lead to severe computational difficulties due to extreme posterior correlations in the regression coefficients.
- Group indicators for hierarchical models are a special case that avoids these difficulties.