Short story:
Given a K-level nominal input variable in a Bayesian regression model, let \alpha_k be the effect of level k. To achieve a symmetric prior on the effects, with a prior mean of 0 and prior covariance of \sigma^{2} for each \alpha_{k}, the prior on the vector of regression coefficients \beta should have the form
\beta \sim \text{Normal}\left(\overline{0}, X^{-1}\Sigma \left(X^{-1}\right)^{\prime}\right)where
- \Sigma is a square matrix of K-1 rows/columns:
\begin{array}{rcl}\Sigma_{jk} & = & \left\{ \begin{array}{ll} \sigma^{2} & \text{if }j=k \\ \rho\sigma^{2} & \text{if }j\neq k \end{array} \right. \\ \rho & = & -\frac{1}{K-1} \end{array}
- X is a square matrix of K-1 rows/columns giving the level encodings:
- X_k is the encoding of level k for k\neq K.
- The encoding of level K is -\sum_{k=1}^{K-1}X_{k}.
- No row of X is a linear combination of the other rows.
Long story: (Full write-up)
Effects
Suppose that we have a K-level nominal input variable used in a Bayesian regression analysis, with each level k encoded as a row vector
X_{k}=\left(x_{k1},\ldots,x_{kp}\right).Let \beta_{i} be the regression coefficient corresponding to the i-th element of the encoding, so that a level of k contributes the term
\alpha_{k}=X_{k}\beta=\sum_{i}\beta_{i}x_{ki}to the overall regression sum. We call \alpha_{k} the effect of level k.
Any prior on \beta defines a corresponding joint prior on the effects \alpha_{k} via the above equation. Our goal is to construct an appropriate prior distribution for \beta using as our only prior information some notion of how large any of the effects may plausibly be: we want the prior mean for each \alpha_{k} to be 0, and the prior variance to be some given value \sigma^{2}. Since this information makes no distinction between the levels, the joint prior for the effects should be symmetric: reordering the levels should leave this joint prior unchanged.
We would like the effects to indicate the differences between levels, and not include any constant (independent of level) contribution to the overall regression sum; thus we require that
\sum_{k=1}^{K}\alpha_{k}=0.This implies that the joint distribution for the effects is degenerate. In the remainder of this note we therefore define the vector \alpha to be the first K-1 effects,
\alpha^{\prime}=\left(\alpha_{1},\ldots,\alpha_{K-1}\right),and use
\alpha_{K}=-\sum_{k=1}^{K-1}\alpha_{k}.Encodings
Using \alpha_{k}=X_{k}\beta we have
0=\sum_{k=1}^{K}\alpha_{k}=\left(\sum_{k=1}^{K}X_{k}\right)\betaand so, assuming a non-degenerate (full-dimensional) prior over \beta, the level encodings must satisfy
\sum_{k=1}^{K}X_{k}=\overline{0}.We therefore define the matrix X to be the first K-1 row vectors X_{k}:
X = \begin{pmatrix}X_{1}\\ \vdots\\ X_{K-1}\end{pmatrix}and use
X_{K}=-\sum_{k=1}^{K-1}X_{k}.Our equation defining the effects then becomes
\alpha=X\betaand so, to have a one-to-one correspondence between effects vectors \alpha and regression-coefficient vectors \beta, we require that X be invertible. That is,
- X must be a square matrix (we require p=K-1);
- no level encoding X_{k}, k\neq K, may be expressible as a linear combination of the remaining level encodings (excluding X_{K}).
One example of an encoding satisfying these requirements is effects coding:
\begin{array}{rcl}x_{ki} & = & \left\{\begin{array}{ll}1 & \text{if }i=k\\ 0 & \text{if }i\neq k\end{array}\right. \quad\text{for }k\neq K\\x_{Ki} & = & -1\quad\text{for all }i.\end{array}An obvious prior that doesn’t work
With effects coding the obvious symmetric prior for \beta,
\beta_{k}\sim\text{Normal}\left(0,\sigma\right),leads to a very asymmetric prior for the effects \alpha_{k}: for k\neq K we have
\alpha_{k}\sim\text{Normal}\left(0,\sigma\right)independently (\text{Cov}\left(\alpha_{j},\alpha_{k}\right)=0 if j,k\neq K), but for k=K we have
\alpha_{K} \sim \text{Normal}\left(0, \,\sqrt{K-1}\sigma\right)and for k\neq K the covariance between \alpha_{k} and \alpha_{K} is \sigma^2.
Solution strategy
We find an appropriate prior for \beta by first constructing a symmetric prior for the effects themselves, then solving for the corresponding prior on \beta. The prior we derive for \alpha turns out to be a multivariate normal with mean vector \overline{0} and a covariance matrix \Sigma defined later. Since
\beta=X^{-1}\alphathe required prior for \beta is
\beta \sim \text{Normal}\left(\overline{0},\, X^{-1}\Sigma \left(X^{-1}\right)^{\prime}\right)For the effects coding, X is just the identity matrix (remember that X only has K-1 rows, omitting X_{K}), and so the prior covariance matrix for \beta is just \Sigma itself.
We seek to construct the most diffuse, least informative prior distribution for \alpha satisfying
\begin{array}{rcl} \text{E}\left(\alpha_{k}\right) & = & 0\\ \text{Var}\left(\alpha_{k}\right) & = & \sigma_{k}^{2} \end{array}for all k, 1\leq k\leq K. We do so using the method of maximum entropy: our prior will be the maximum-entropy distribution satisfying the given constraints. (See references 1, 2, and 3.)
The entropy of a distribution is a measure of how much information the distribution provides about the variable(s) in question; the greater the entropy, the greater the uncertainty and the less informative the distribution. The entropy of a distribution with pdf p(\alpha) is defined as
-\int p(\alpha)\log(p(\alpha)/m(\alpha))dx=-\text{E}\left(\log(p(\alpha)/m(\alpha))\right)where m(\alpha) is a reference measure chosen to coincide with some notion of maximal ignorance. Note that the entropy is invariant under a change of variables because both the density and the reference measure transform in the same way.
Form of the maximum-entropy solution
In general, the maximum-entropy distribution satisfying a set of n constraints
\text{E}\left(f_{i}(\alpha)\right)=C_{i}has a pdf of form
p(\alpha)=\frac{m(\alpha)}{Z}\exp\left(-\sum_{i=1}^{n}\lambda_{i}f_{i}\left(\alpha\right)\right)for some n-vector of parameter values \lambda and corresponding normalizing constant Z. Applying this to the problem at hand, and using the uniform measure m(\alpha)=1, we find that the pdf for the maximum-entropy distribution on \alpha having \text{E}\left(\alpha_{k}\right)=0 and \text{E}\left(\alpha_{k}^{2}\right)=\sigma^{2} for all k is
p\left(\alpha\right) = Z^{-1}\exp\left(-\sum_{k=1}^{K}\nu_{k}\alpha_{k}-\sum_{k=1}^{K}\lambda_{k}\alpha_{k}^{2}\right) \quad (1)
for some choice of parameters \nu_{k} and \lambda_{k}, and corresponding normalizing constant Z.
Rather than directly solving for the parameters \nu_{k} and \lambda_{k}, we note the following:
- Since \log p(\alpha) is quadratic in \alpha we can complete the square to re-express p(\alpha) as a multivariate normal density with some mean \mu and covariance matrix \Sigma.
- Since \text{E}\left(\alpha\right)=\overline{0} we know that \mu=\overline{0}.
- Our constraints are symmetric: if \tilde{\alpha} is any vector obtained from \alpha by reordering its elements, the constraints on \alpha are equivalent to identical constraints on \tilde{\alpha}. Therefore the maximum-entropy distribution for \alpha is also symmetric: the distributions for \tilde{\alpha} and \alpha are identical. That is, \Sigma must remain unchanged after any permutation of its rows and columns.
This last observation implies that
- the diagonal elements of \Sigma are all the same; and
- the off-diagonal elements of \Sigma are all the same.
Combining this with the requirement that \text{Var}\left(\alpha_{k}\right)=\sigma^{2} for all k, we see that we must have
\Sigma_{jk}=\left\{\begin{array}{ll} \sigma^{2} & \text{if }j=k\\ \rho\sigma^{2} & \text{if }j\neq k \end{array} \right.for some value \rho.
The full writeup shows that a solution of this form can be written in the maximum-entropy form of equation (1).
Solving for the common covariance
At this point we have satisfied all of the constraints except for \text{Var}\left(\alpha_{K}\right)=\sigma^{2}, and we choose \rho accordingly. We find that
\text{Var}\left(\alpha_{K}\right) = \left(K-1\right) \sigma^{2}+\left(K-1\right)\left(K-2\right)\rho\sigma^{2}and so we require
\left(K-1\right)\left(1+\left(K-2\right)\rho\right)=1.A bit of algebra then gives
\rho=-\frac{1}{K-1}.
Final notes
The full writeup verifies that
- \Sigma as we have defined it is positive definite, and hence a legitimate covariance matrix; and
- this solution is symmetric: \text{Cov}\left(\alpha_j, \alpha_k\right) = \rho\sigma^2 for any j \neq k, even when one of j or k is K.
I first derived this prior circa 2005, but did not publish it. Lenk and Orme independently propose an “effects prior” using the same covariance matrix \Sigma described here, in the context of a hierarchical regression model. Their derivation assumes an effects coding and proceeds from different premises than those used herein.
References
- Jaynes, Edwin T. (1957). “Information Theory and Statistical Mechanics,” Physical Review, Series II 106 (4): 620–630.
- Jaynes, Edwin T. (1957). “Information Theory and Statistical Mechanics II,” Physical Review, Series II 108 (2): 171–190.
- Jaynes, Edwin T. (2003). Probability Theory: The Logic of Science, Cambridge University Press, pp. 351–355.
- Lenk, Peter and Bryan Orme (2009). “The Value of Informative Priors in Bayesian Inference with Sparse Data,” J. of Marketing Research 46 (6): 832–845.