Full Dummy Coding for Nominal Variables

Short story: Full dummy coding in frequentist regression analyses leads to non-identifiability. In Bayesian analyses that’s not an issue, but a different problem arises: full dummy coding is computationally problematic. The group indicators in a hierarchical model are a special case that avoids these computational problems.

Long story: (PDF)

There are various choices for encoding a nominal input variable for use in a regression analysis. One of these is full dummy coding, in which a $K$ -level nominal variable is replaced by a subvector of $K$ indicator variables:

\text{level }k \Rightarrow (0, \ldots, 0, 1, 0, \ldots, 0)

where the $1$ entry is at position $k$ .

In frequentist analyses this encoding is not used if there is a separate intercept, because it yields a non-identifiable model: subtracting a constant amount from the regression coefficient for each indicator variable and adding it to the intercept leaves the likelihood unchanged. The same problem occurs if there are multiple nominal variables and each uses full dummy coding. Instead, a leave-one-out dummy coding is used ( $w_{K}$ is dropped, so that level $K$ is encoded as all zeroes) or an effects coding is used ( $w_{K}$ is dropped, but $w_{k}=-1$ for all $1\leq k\leq K-1$ when $v_{j}=K$ ).

In principle, non-identifiability is not a problem for a Bayesian analysis: we are interested in obtaining a posterior distribution for the regression coefficients, not a point estimate, and as long as we use a proper prior distribution we are guaranteed a proper posterior distribution. Thus we could, if desired, use a full dummy coding.

In practice, things are less straightforward. Use of full dummy coding in a Bayesian analysis can be problematic because it yields a highly correlated posterior distribution that poses computational difficulties.

To see this, consider a simple problem: a linear regression with only a single input variable, a nominal variable $v$ with $K$ levels. $X_i$ is the row vector of predictor variables derived from $v$ , and $y_i$ is the value of the dependent variable for data point $i$ . We have

y_{i}\sim\text{Normal}\left(X_{i}\beta,\,\sigma_y\right),\quad1\leq i\leq n

together with independent normal priors on the regression coefficients

\beta_{j}\sim\text{Normal}\left(0,\sigma_{j}\right).

The predictor matrix $X$ is defined as

\begin{array}{lcl}X_{i0} &=& 1 \quad\text{(so that }\beta_{0}\text{ is the intercept)} \\ X_{ij} &=& 1 \quad\text{if }j\neq 0\text{ and }v_i=j \\ X_{ij} &=& 0 \quad\text{if }j\neq 0\text{ and }v_i\neq j \end{array}

If $p_{j}$ is the proportion of rows $X_{i}$ for which $X_{ij}=1$ , then one can show (details
here) that the posterior distribution for $\beta$ is a multivariate normal distribution with covariance matrix $\Lambda^{-1}$ , where

\begin{array}{rcl}\Lambda &=& n\sigma_y^{-2}L + P\\[1em] L &=& \begin{pmatrix}1 & p_{1} & p_{2} & \cdots & p_{K}\\ p_{1} & p_{1} & 0 & \cdots & 0\\ p_{2} & 0 & p_{2} & \cdots & 0\\ \vdots & \vdots & \vdots & \ddots & \vdots\\ p_{K} & 0 & 0 & \cdots & p_{K} \end{pmatrix} \\[3em] M &=& \begin{pmatrix}\sigma_{0}^{-2} & 0 & \cdots & 0\\ 0 & \sigma_{1}^{-2} & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \cdots & \sigma_{K}^{-2} \end{pmatrix}\end{array}

But the left summand $n\sigma_y^{-2}L$ is singular: $(1,-1,\cdots,-1)^{\prime}$ is an eigenvector with eigenvalue of 0. Defining

r_{j}=\frac{\sigma_y^{2}}{n\sigma_{j}^{2}}

we see that if $r_{j}\ll 1$ for all $j$ then $\Lambda$ itself is nearly singular.

A nearly singular posterior precision matrix $\Lambda$ is problematic even in this simple scenario, as it leads to numerical instabilities when inverting the matrix. It also implies very high posterior correlations among the regression coefficients. In generalized linear models where we have to use Markov chain Monte Carlo methods to explore the posterior distribution—say, logistic regression or Poisson regression—this leads to very slow mixing of the Markov chain; in the case of Hamiltonian Monte Carlo it can lead to very small step sizes. In either case the result is very slow estimation.

As an example, consider a data set with $n=100$ observations and a weak prior, $\sigma_{j}=10^{3}\sigma_y$ . Then $r_{j}=10^{-8}$ , and for $K=3$ we find that the posterior correlation between $\beta_{0}$ and $\beta_{j}$ ( $j\neq 0$ ) is

\rho_{0j}=-0.99999994

and the posterior correlation between $\beta_{j}$ and $\beta_{k}$ , $j,k\neq 0$ , $j\neq k$ , is

\rho_{jk}=0.9999999.

How large does $r_{j}$ need to be to avoid extreme correlations? Assuming $\sigma_{j}=\sigma_{k}$ for all $j,k\neq 1$ , and hence $r_{j}=r_{k}$ for all $j,k\neq 0$ , the following table gives the minimum value that $r_{j}$ (for $j\neq 0$ ) may have and still yield posterior correlations that never exceed $0.9$ in absolute value. These figures assume $r_{0}=10^{-8}$ , i.e., a weak prior on the intercept (they are not sensitive to the specific value of $r_{0}$ used.)

$K$	min $r_j$
$3$	$2.6\times10^{-2}$
$4$	$1.5\times10^{-2}$
$5$	$9.4\times10^{-3}$
$6$	$6.5\times10^{-3}$
$8$	$3.7\times10^{-3}$
$10$	$2.3\times10^{-3}$
$13$	$1.4\times10^{-3}$
$16$	$9.2\times10^{-4}$
$19$	$6.5\times10^{-4}$

Notice that as the number of levels $K$ increases, smaller values for $r_{j}$ become acceptable.

Hierarchical models can be thought of as regression models involving a nominal variable with many distinct levels. The simplest case is a varying-intercept model:

\begin{array}{rcl}y_{i} &\sim& \text{Normal}\left(\upsilon_{i},\sigma_{y}\right) \\ \upsilon_{i} &=& \alpha+a_{k[i]}+\beta x_{i} \end{array}

where $k[i]$ is the group to which observation $i$ belongs, and we have the hierarchical prior

\begin{array}{rcl}a_{k}&\sim&\text{Normal}\left(0,\sigma_{a}\right)\\ \sigma_{a}&\sim&\text{some diffuse prior}. \end{array}

The equation for $\upsilon_{i}$ is equivalent to

\begin{array}{rcl}\upsilon_{i}&=&\sum_{k=1}^{K}a_{k}w_{ik}+\beta x_{i}\\ w_{i}&=&\text{full dummy coding of level }k[i]. \end{array}

So why don’t we see a problem with high posterior correlations among the coefficients $a_{k}$ in this case? We have two mitigating factors in play:

$K$ is large. As previously noted, this allows smaller $r_{k}$ values without getting extreme posterior correlations.
$\sigma_{a}$ is relatively small, yielding larger values for $r_{k}$ . The whole point of a hierarchical model is that we expect to find the posterior distribution for $\sigma_{a}$ to be concentrated on values much smaller than what we would assign for a weakly informative prior on the coefficients $a_{k}$ .

In summary, we have the following guidelines for encoding nominal variables in a Bayesian regression analysis:

Do not use full dummy coding, as it will generally lead to severe computational difficulties due to extreme posterior correlations in the regression coefficients.
Group indicators for hierarchical models are a special case that avoids these difficulties.