Bayesium Analytics

Short story:

Given a $K$ -level nominal input variable in a Bayesian regression model, let $\alpha_k$ be the effect of level $k$ . To achieve a symmetric prior on the effects, with a prior mean of 0 and prior covariance of $\sigma^{2}$ for each $\alpha_{k}$ , the prior on the vector of regression coefficients $\beta$ should have the form

\beta \sim \text{Normal}\left(\overline{0}, X^{-1}\Sigma \left(X^{-1}\right)^{\prime}\right)

where

$\Sigma$ is a square matrix of $K-1$ rows/columns:
$\begin{array}{rcl}\Sigma_{jk} & = & \left\{ \begin{array}{ll} \sigma^{2} & \text{if }j=k \\ \rho\sigma^{2} & \text{if }j\neq k \end{array} \right. \\ \rho & = & -\frac{1}{K-1} \end{array}$
X is a square matrix of K-1 rows/columns giving the level encodings:
- $X_k$ is the encoding of level $k$ for $k\neq K$ .
- The encoding of level $K$ is $-\sum_{k=1}^{K-1}X_{k}$ .
- No row of $X$ is a linear combination of the other rows.

Long story: (Full write-up)

Effects

Suppose that we have a $K$ -level nominal input variable used in a Bayesian regression analysis, with each level $k$ encoded as a row vector

X_{k}=\left(x_{k1},\ldots,x_{kp}\right).

Let $\beta_{i}$ be the regression coefficient corresponding to the $i$ -th element of the encoding, so that a level of $k$ contributes the term

\alpha_{k}=X_{k}\beta=\sum_{i}\beta_{i}x_{ki}

to the overall regression sum. We call $\alpha_{k}$ the effect of level $k$ .

Any prior on $\beta$ defines a corresponding joint prior on the effects $\alpha_{k}$ via the above equation. Our goal is to construct an appropriate prior distribution for $\beta$ using as our only prior information some notion of how large any of the effects may plausibly be: we want the prior mean for each $\alpha_{k}$ to be $0$ , and the prior variance to be some given value $\sigma^{2}$ . Since this information makes no distinction between the levels, the joint prior for the effects should be symmetric: reordering the levels should leave this joint prior unchanged.

We would like the effects to indicate the differences between levels, and not include any constant (independent of level) contribution to the overall regression sum; thus we require that

\sum_{k=1}^{K}\alpha_{k}=0.

This implies that the joint distribution for the effects is degenerate. In the remainder of this note we therefore define the vector $\alpha$ to be the first $K-1$ effects,

\alpha^{\prime}=\left(\alpha_{1},\ldots,\alpha_{K-1}\right),

and use

\alpha_{K}=-\sum_{k=1}^{K-1}\alpha_{k}.

Encodings

Using $\alpha_{k}=X_{k}\beta$ we have

0=\sum_{k=1}^{K}\alpha_{k}=\left(\sum_{k=1}^{K}X_{k}\right)\beta

and so, assuming a non-degenerate (full-dimensional) prior over $\beta$ , the level encodings must satisfy

\sum_{k=1}^{K}X_{k}=\overline{0}.

We therefore define the matrix $X$ to be the first $K-1$ row vectors $X_{k}$ :

X = \begin{pmatrix}X_{1}\\ \vdots\\ X_{K-1}\end{pmatrix}

and use

X_{K}=-\sum_{k=1}^{K-1}X_{k}.

Our equation defining the effects then becomes

\alpha=X\beta

and so, to have a one-to-one correspondence between effects vectors $\alpha$ and regression-coefficient vectors $\beta$ , we require that $X$ be invertible. That is,

$X$ must be a square matrix (we require $p=K-1$ );
no level encoding $X_{k}$ , $k\neq K$ , may be expressible as a linear combination of the remaining level encodings (excluding $X_{K}$ ).

One example of an encoding satisfying these requirements is effects coding:

\begin{array}{rcl}x_{ki} & = & \left\{\begin{array}{ll}1 & \text{if }i=k\\ 0 & \text{if }i\neq k\end{array}\right. \quad\text{for }k\neq K\\x_{Ki} & = & -1\quad\text{for all }i.\end{array}

An obvious prior that doesn’t work

With effects coding the obvious symmetric prior for $\beta$ ,

\beta_{k}\sim\text{Normal}\left(0,\sigma\right),

leads to a very asymmetric prior for the effects $\alpha_{k}$ : for $k\neq K$ we have

\alpha_{k}\sim\text{Normal}\left(0,\sigma\right)

independently ( $\text{Cov}\left(\alpha_{j},\alpha_{k}\right)=0$ if $j,k\neq K$ ), but for $k=K$ we have

\alpha_{K} \sim \text{Normal}\left(0, \,\sqrt{K-1}\sigma\right)

and for $k\neq K$ the covariance between $\alpha_{k}$ and $\alpha_{K}$ is $\sigma^2$ .

Solution strategy

We find an appropriate prior for $\beta$ by first constructing a symmetric prior for the effects themselves, then solving for the corresponding prior on $\beta$ . The prior we derive for $\alpha$ turns out to be a multivariate normal with mean vector $\overline{0}$ and a covariance matrix $\Sigma$ defined later. Since

\beta=X^{-1}\alpha

the required prior for $\beta$ is

\beta \sim \text{Normal}\left(\overline{0},\, X^{-1}\Sigma \left(X^{-1}\right)^{\prime}\right)

For the effects coding, $X$ is just the identity matrix (remember that $X$ only has $K-1$ rows, omitting $X_{K}$ ), and so the prior covariance matrix for $\beta$ is just $\Sigma$ itself.

We seek to construct the most diffuse, least informative prior distribution for $\alpha$ satisfying

\begin{array}{rcl} \text{E}\left(\alpha_{k}\right) & = & 0\\ \text{Var}\left(\alpha_{k}\right) & = & \sigma_{k}^{2} \end{array}

for all $k$ , $1\leq k\leq K$ . We do so using the method of maximum entropy: our prior will be the maximum-entropy distribution satisfying the given constraints. (See references 1, 2, and 3.)

The entropy of a distribution is a measure of how much information the distribution provides about the variable(s) in question; the greater the entropy, the greater the uncertainty and the less informative the distribution. The entropy of a distribution with pdf $p(\alpha)$ is defined as

-\int p(\alpha)\log(p(\alpha)/m(\alpha))dx=-\text{E}\left(\log(p(\alpha)/m(\alpha))\right)

where $m(\alpha)$ is a reference measure chosen to coincide with some notion of maximal ignorance. Note that the entropy is invariant under a change of variables because both the density and the reference measure transform in the same way.

Form of the maximum-entropy solution

In general, the maximum-entropy distribution satisfying a set of $n$ constraints

\text{E}\left(f_{i}(\alpha)\right)=C_{i}

has a pdf of form

p(\alpha)=\frac{m(\alpha)}{Z}\exp\left(-\sum_{i=1}^{n}\lambda_{i}f_{i}\left(\alpha\right)\right)

for some $n$ -vector of parameter values $\lambda$ and corresponding normalizing constant $Z$ . Applying this to the problem at hand, and using the uniform measure $m(\alpha)=1$ , we find that the pdf for the maximum-entropy distribution on $\alpha$ having $\text{E}\left(\alpha_{k}\right)=0$ and $\text{E}\left(\alpha_{k}^{2}\right)=\sigma^{2}$ for all $k$ is

$p\left(\alpha\right) = Z^{-1}\exp\left(-\sum_{k=1}^{K}\nu_{k}\alpha_{k}-\sum_{k=1}^{K}\lambda_{k}\alpha_{k}^{2}\right) \quad (1)$

for some choice of parameters $\nu_{k}$ and $\lambda_{k}$ , and corresponding normalizing constant $Z$ .

Rather than directly solving for the parameters $\nu_{k}$ and $\lambda_{k}$ , we note the following:

Since $\log p(\alpha)$ is quadratic in $\alpha$ we can complete the square to re-express $p(\alpha)$ as a multivariate normal density with some mean $\mu$ and covariance matrix $\Sigma$ .
Since $\text{E}\left(\alpha\right)=\overline{0}$ we know that $\mu=\overline{0}$ .
Our constraints are symmetric: if $\tilde{\alpha}$ is any vector obtained from $\alpha$ by reordering its elements, the constraints on $\alpha$ are equivalent to identical constraints on $\tilde{\alpha}$ . Therefore the maximum-entropy distribution for $\alpha$ is also symmetric: the distributions for $\tilde{\alpha}$ and $\alpha$ are identical. That is, $\Sigma$ must remain unchanged after any permutation of its rows and columns.

This last observation implies that

the diagonal elements of $\Sigma$ are all the same; and
the off-diagonal elements of $\Sigma$ are all the same.

Combining this with the requirement that $\text{Var}\left(\alpha_{k}\right)=\sigma^{2}$ for all $k$ , we see that we must have

\Sigma_{jk}=\left\{\begin{array}{ll} \sigma^{2} & \text{if }j=k\\ \rho\sigma^{2} & \text{if }j\neq k \end{array} \right.

for some value $\rho$ .

The full writeup shows that a solution of this form can be written in the maximum-entropy form of equation (1).

Solving for the common covariance

At this point we have satisfied all of the constraints except for $\text{Var}\left(\alpha_{K}\right)=\sigma^{2}$ , and we choose $\rho$ accordingly. We find that

\text{Var}\left(\alpha_{K}\right) = \left(K-1\right) \sigma^{2}+\left(K-1\right)\left(K-2\right)\rho\sigma^{2}

and so we require

\left(K-1\right)\left(1+\left(K-2\right)\rho\right)=1.

A bit of algebra then gives
$\rho=-\frac{1}{K-1}.$

Final notes

The full writeup verifies that

$\Sigma$ as we have defined it is positive definite, and hence a legitimate covariance matrix; and
this solution is symmetric: $\text{Cov}\left(\alpha_j, \alpha_k\right) = \rho\sigma^2$ for any $j \neq k$ , even when one of $j$ or $k$ is $K$ .

I first derived this prior circa 2005, but did not publish it. Lenk and Orme independently propose an “effects prior” using the same covariance matrix $\Sigma$ described here, in the context of a hierarchical regression model. Their derivation assumes an effects coding and proceeds from different premises than those used herein.

References

Jaynes, Edwin T. (1957). “Information Theory and Statistical Mechanics,” Physical Review, Series II 106 (4): 620–630.
Jaynes, Edwin T. (1957). “Information Theory and Statistical Mechanics II,” Physical Review, Series II 108 (2): 171–190.
Jaynes, Edwin T. (2003). Probability Theory: The Logic of Science, Cambridge University Press, pp. 351–355.
Lenk, Peter and Bryan Orme (2009). “The Value of Informative Priors in Bayesian Inference with Sparse Data,” J. of Marketing Research 46 (6): 832–845.

Short story: Full dummy coding in frequentist regression analyses leads to non-identifiability. In Bayesian analyses that’s not an issue, but a different problem arises: full dummy coding is computationally problematic. The group indicators in a hierarchical model are a special case that avoids these computational problems.

Long story: (PDF)

There are various choices for encoding a nominal input variable for use in a regression analysis. One of these is full dummy coding, in which a $K$ -level nominal variable is replaced by a subvector of $K$ indicator variables:

\text{level }k \Rightarrow (0, \ldots, 0, 1, 0, \ldots, 0)

where the $1$ entry is at position $k$ .

In frequentist analyses this encoding is not used if there is a separate intercept, because it yields a non-identifiable model: subtracting a constant amount from the regression coefficient for each indicator variable and adding it to the intercept leaves the likelihood unchanged. The same problem occurs if there are multiple nominal variables and each uses full dummy coding. Instead, a leave-one-out dummy coding is used ( $w_{K}$ is dropped, so that level $K$ is encoded as all zeroes) or an effects coding is used ( $w_{K}$ is dropped, but $w_{k}=-1$ for all $1\leq k\leq K-1$ when $v_{j}=K$ ).

In principle, non-identifiability is not a problem for a Bayesian analysis: we are interested in obtaining a posterior distribution for the regression coefficients, not a point estimate, and as long as we use a proper prior distribution we are guaranteed a proper posterior distribution. Thus we could, if desired, use a full dummy coding.

In practice, things are less straightforward. Use of full dummy coding in a Bayesian analysis can be problematic because it yields a highly correlated posterior distribution that poses computational difficulties.

To see this, consider a simple problem: a linear regression with only a single input variable, a nominal variable $v$ with $K$ levels. $X_i$ is the row vector of predictor variables derived from $v$ , and $y_i$ is the value of the dependent variable for data point $i$ . We have

y_{i}\sim\text{Normal}\left(X_{i}\beta,\,\sigma_y\right),\quad1\leq i\leq n

together with independent normal priors on the regression coefficients

\beta_{j}\sim\text{Normal}\left(0,\sigma_{j}\right).

The predictor matrix $X$ is defined as

\begin{array}{lcl}X_{i0} &=& 1 \quad\text{(so that }\beta_{0}\text{ is the intercept)} \\ X_{ij} &=& 1 \quad\text{if }j\neq 0\text{ and }v_i=j \\ X_{ij} &=& 0 \quad\text{if }j\neq 0\text{ and }v_i\neq j \end{array}

If $p_{j}$ is the proportion of rows $X_{i}$ for which $X_{ij}=1$ , then one can show (details
here) that the posterior distribution for $\beta$ is a multivariate normal distribution with covariance matrix $\Lambda^{-1}$ , where

\begin{array}{rcl}\Lambda &=& n\sigma_y^{-2}L + P\\[1em] L &=& \begin{pmatrix}1 & p_{1} & p_{2} & \cdots & p_{K}\\ p_{1} & p_{1} & 0 & \cdots & 0\\ p_{2} & 0 & p_{2} & \cdots & 0\\ \vdots & \vdots & \vdots & \ddots & \vdots\\ p_{K} & 0 & 0 & \cdots & p_{K} \end{pmatrix} \\[3em] M &=& \begin{pmatrix}\sigma_{0}^{-2} & 0 & \cdots & 0\\ 0 & \sigma_{1}^{-2} & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \cdots & \sigma_{K}^{-2} \end{pmatrix}\end{array}

But the left summand $n\sigma_y^{-2}L$ is singular: $(1,-1,\cdots,-1)^{\prime}$ is an eigenvector with eigenvalue of 0. Defining

r_{j}=\frac{\sigma_y^{2}}{n\sigma_{j}^{2}}

we see that if $r_{j}\ll 1$ for all $j$ then $\Lambda$ itself is nearly singular.

A nearly singular posterior precision matrix $\Lambda$ is problematic even in this simple scenario, as it leads to numerical instabilities when inverting the matrix. It also implies very high posterior correlations among the regression coefficients. In generalized linear models where we have to use Markov chain Monte Carlo methods to explore the posterior distribution—say, logistic regression or Poisson regression—this leads to very slow mixing of the Markov chain; in the case of Hamiltonian Monte Carlo it can lead to very small step sizes. In either case the result is very slow estimation.

As an example, consider a data set with $n=100$ observations and a weak prior, $\sigma_{j}=10^{3}\sigma_y$ . Then $r_{j}=10^{-8}$ , and for $K=3$ we find that the posterior correlation between $\beta_{0}$ and $\beta_{j}$ ( $j\neq 0$ ) is

\rho_{0j}=-0.99999994

and the posterior correlation between $\beta_{j}$ and $\beta_{k}$ , $j,k\neq 0$ , $j\neq k$ , is

\rho_{jk}=0.9999999.

How large does $r_{j}$ need to be to avoid extreme correlations? Assuming $\sigma_{j}=\sigma_{k}$ for all $j,k\neq 1$ , and hence $r_{j}=r_{k}$ for all $j,k\neq 0$ , the following table gives the minimum value that $r_{j}$ (for $j\neq 0$ ) may have and still yield posterior correlations that never exceed $0.9$ in absolute value. These figures assume $r_{0}=10^{-8}$ , i.e., a weak prior on the intercept (they are not sensitive to the specific value of $r_{0}$ used.)

$K$	min $r_j$
$3$	$2.6\times10^{-2}$
$4$	$1.5\times10^{-2}$
$5$	$9.4\times10^{-3}$
$6$	$6.5\times10^{-3}$
$8$	$3.7\times10^{-3}$
$10$	$2.3\times10^{-3}$
$13$	$1.4\times10^{-3}$
$16$	$9.2\times10^{-4}$
$19$	$6.5\times10^{-4}$

Notice that as the number of levels $K$ increases, smaller values for $r_{j}$ become acceptable.

Hierarchical models can be thought of as regression models involving a nominal variable with many distinct levels. The simplest case is a varying-intercept model:

\begin{array}{rcl}y_{i} &\sim& \text{Normal}\left(\upsilon_{i},\sigma_{y}\right) \\ \upsilon_{i} &=& \alpha+a_{k[i]}+\beta x_{i} \end{array}

where $k[i]$ is the group to which observation $i$ belongs, and we have the hierarchical prior

\begin{array}{rcl}a_{k}&\sim&\text{Normal}\left(0,\sigma_{a}\right)\\ \sigma_{a}&\sim&\text{some diffuse prior}. \end{array}

The equation for $\upsilon_{i}$ is equivalent to

\begin{array}{rcl}\upsilon_{i}&=&\sum_{k=1}^{K}a_{k}w_{ik}+\beta x_{i}\\ w_{i}&=&\text{full dummy coding of level }k[i]. \end{array}

So why don’t we see a problem with high posterior correlations among the coefficients $a_{k}$ in this case? We have two mitigating factors in play:

$K$ is large. As previously noted, this allows smaller $r_{k}$ values without getting extreme posterior correlations.
$\sigma_{a}$ is relatively small, yielding larger values for $r_{k}$ . The whole point of a hierarchical model is that we expect to find the posterior distribution for $\sigma_{a}$ to be concentrated on values much smaller than what we would assign for a weakly informative prior on the coefficients $a_{k}$ .

In summary, we have the following guidelines for encoding nominal variables in a Bayesian regression analysis:

Do not use full dummy coding, as it will generally lead to severe computational difficulties due to extreme posterior correlations in the regression coefficients.
Group indicators for hierarchical models are a special case that avoids these difficulties.