(The full write-up is here.)
In a previous note I proposed a prior on the regression coefficients for a nominal input variable that leads to a symmetric prior on the effects:
\begin{array}{rcl} \beta & \sim & \text{Normal}\left(\overline{0},W\Sigma W^{\prime}\right)\\ W & = & X^{-1}\\ \Sigma_{jk} & = & \left\{\begin{array}{ll} \sigma^{2} & \text{if }j=k\\ \rho\sigma^{2} & \text{if }j\neq k \end{array}\right. \\ \rho & = & -1/(K-1) \end{array}where \beta is the vector of regression coefficients, K is the number of levels, X and \Sigma are square matrices with K-1 rows/columns, row k of X is the encoding for level k\neq K, and level K is encoded as -\sum_{k=1}^{K-1}X_{k}. This gives an induced prior on the K effects \alpha_{k}=X_{k}\beta that is symmetric in the effects, has a mean of 0 and variance of \sigma^{2} for each \alpha_{k}, and is degenerate, in that the sum of the effects is 0.
It may be desirable to avoid actually constructing the matrix \Sigma, either for uniformity in the implementation of a Bayesian regression model (e.g., independent normal priors for each regression coefficient over the entire set of predictor variables) or because K is large. In this note I discuss two ways of achieving this goal:
- choose an encoding of the levels that turns independent normal priors on the regression coefficients into the desired symmetric prior on the effects; or
- directly evaluate the probability density without constructing the matrix \Sigma.
Low to moderate number of levels
When K is small or moderate, we may achieve independent normal distributions for each of the regression coefficients \beta_{k} by careful choice of encoding. If X is chosen such that W\Sigma W^{\prime}=\sigma^{2}I, then the prior
\beta_{k}\sim\text{Normal}\left(0,\sigma\right)\quad\text{for }1\leq k\leq K-1gives us the desired symmetric prior on the effects.
To find such an encoding X we begin with an eigendecomposition of \Sigma: eigenvalues \lambda_{1},\ldots,\lambda_{K-1} and corresponding orthonormal eigenvectors u_{1},\ldots,u_{K-1} of \Sigma. Then
\Sigma=U\Lambda U^{-1}where U is the matrix obtained by stacking the eigenvectors side by side, and \Lambda is the diagonal matrix whose k-th diagonal element is \lambda_{k}:
\begin{array}{rcl} U & = & \left(u_{1},\ldots,u_{K-1}\right)\\ \Lambda & = & \text{diagonal}\left(\lambda_{1},\ldots,\lambda_{K-1}\right). \end{array}In the full write-up I show that choosing
\begin{array}{rcl} X & = & UD\\ D & = & \text{diagonal}\left(d_{1},\ldots,d_{K-1}\right)\\ d_{k} & = & \frac{\lambda_{k}^{1/2}}{\sigma}.\end{array}yields W\Sigma W^{\prime}=\sigma^2 I as desired.
The orthonormal eigenvectors and corresponding eigenvalues of of \Sigma are
\begin{array}{rcl} u_{1} & = & \left(K-1\right)^{-1/2}\left(1,\ldots,1\right)^{\prime}\\ \lambda_{1} & = & \frac{\sigma^{2}}{K-1}\\ d_{1} & = & \left(K-1\right)^{-1/2} \end{array}and, for 2\leq k\leq K-1,
\begin{array}{rcl} u_{ki} & = & \left(k(k-1)\right)^{-1/2}\cdot\left\{\begin{array}{ll} -1 & \text{if }i<k\\ k-1 & \text{if }i=k\\ 0 & \text{if }i>k \end{array}\right. \\ \lambda_{k} & = & \frac{K\sigma^{2}}{K-1}\\ d_{k} & = & \left(\frac{K}{K-1}\right)^{1/2}. \end{array}Using X=UD, which multiplies column k of U by d_{k}, we have
X_{ik}=d_{k}u_{ki};in particular,
\begin{array}{rcl} X_{i1} & = & \frac{1}{K-1}\\ X_{ik} & = & 0\text{ if }1<k<i\\ X_{ii} & = & \left(\frac{K\left(k-1\right)} {k\left(K-1\right)}\right)^{1/2}\text{ if }i\neq1\\ X_{ik} & = & -\left(\frac{K}{k(k-1)\left(K-1\right)}\right)^{1/2}\text{ if }k>i. \end{array}This gives the required encodings for the first K-1 levels. The encoding for level K is the negative sum of the encodings for levels 1 to K-1, which works out to
X_{K}=\left(-1,0,\ldots,0\right).As a check, the K\times K covariance matrix for the full effects vector
\tilde{\alpha}=\begin{pmatrix}\alpha\\ \alpha_{K}\end{pmatrix}is \sigma^{2}\tilde{X}\tilde{X}^{\prime}, where \tilde{X} is obtained by appending X_{K} to X as an additional row. Using Mathematica, I have verified that
\tilde{X}\tilde{X}^{\prime}=\begin{pmatrix}1 & \rho & \cdots & \rho & \rho\\ \rho & 1 & \cdots & \rho & \rho\\ \vdots & \vdots & \ddots & \vdots & \vdots\\ \rho & \rho & \cdots & 1 & \rho\\ \rho & \rho & \cdots & \rho & 1 \end{pmatrix}\quad\text{where }\rho=-\frac{1}{K-1}for all K from 3 to 100, as expected.
Large number of levels
If K is large, as occurs in a multilevel model with a prior on \sigma, the above approach may be inefficient. Rather than doing a dot product of a level encoding with a vector of regression coefficients, it may be preferable to instead
- directly create a vector of effects \alpha_{k}, 1\leq k\leq K-1, with prior
\alpha\sim\text{Normal}\left(\overline{0},\Sigma\right);
- compute \alpha_{K}=-\sum_{i=1}^{K-1}\alpha_{k}; then
- use the level k to index into the full effects vector \left(\alpha,\alpha_{K}\right).
If we are estimating the model using Hamilton Monte Carlo or similar methods, such as the NUTS sampler used in Stan, then there is the question of efficiently computing the log of the prior density for \alpha. We would like to avoid actually constructing the large matrix \Sigma. First note the following:
- As shown in the full write-up,
\alpha^{\prime}\Sigma^{-1}\alpha=\tilde{\sigma}^{-2}\sum_{k=1}^{K}\alpha_{k}^{2}
where
\tilde{\sigma}^{2}=\frac{K}{K-1}\sigma^{2}
- Since \Sigma=\sigma^{2}S, where S is a matrix that is a function only of K and not of \sigma, we have
\det\Sigma=\sigma^{2(K-1)}\det S
where \det S has no dependence on \sigma.
Then, writing \equiv for “equal except for an additive term that has no dependence on \alpha or \sigma,” we have
\begin{array}{rcl} \log\left(\text{Normal}\left(\alpha\mid\overline{0},\Sigma\right)\right) & \equiv & -\frac{1}{2}\log\det\Sigma-\frac{1}{2}\alpha^{\prime}\Lambda\alpha\\ & \equiv & -\left(K-1\right)\log\sigma-\frac{1}{2} \tilde{\sigma}^{-2}\sum_{k=1}^{K}\alpha_{k}^{2}\\ & \equiv & \sum_{k=1}^{K-1} \log\left(\text{Normal}\left(\alpha_{k}\mid0,\tilde{\sigma}\right)\right)-\frac{1}{2}\tilde{\sigma}^{-2}\alpha_{K}^{2}. \end{array}In a Stan model the implementation would look something like this:
transformed parameters { ... vector[K] alpha1; ... for (k in 1:(K-1)) alpha1[k] <- alpha[k]; alpha1[K] <- -sum(alpha); } model { ... increment_log_prob( -(K-1) * log(sigma) - (K-1) / (2 * K * sigma^2) * dot_self(alpha1)); }