Implementing the Symmetric Effects Prior

(The full write-up is here.)

In a previous note I proposed a prior on the regression coefficients for a nominal input variable that leads to a symmetric prior on the effects:

\begin{array}{rcl} \beta & \sim & \text{Normal}\left(\overline{0},W\Sigma W^{\prime}\right)\\ W & = & X^{-1}\\ \Sigma_{jk} & = & \left\{\begin{array}{ll} \sigma^{2} & \text{if }j=k\\ \rho\sigma^{2} & \text{if }j\neq k \end{array}\right. \\ \rho & = & -1/(K-1) \end{array}

where $\beta$ is the vector of regression coefficients, $K$ is the number of levels, $X$ and $\Sigma$ are square matrices with $K-1$ rows/columns, row $k$ of $X$ is the encoding for level $k\neq K$ , and level $K$ is encoded as $-\sum_{k=1}^{K-1}X_{k}$ . This gives an induced prior on the $K$ effects $\alpha_{k}=X_{k}\beta$ that is symmetric in the effects, has a mean of 0 and variance of $\sigma^{2}$ for each $\alpha_{k}$ , and is degenerate, in that the sum of the effects is 0.

It may be desirable to avoid actually constructing the matrix $\Sigma$ , either for uniformity in the implementation of a Bayesian regression model (e.g., independent normal priors for each regression coefficient over the entire set of predictor variables) or because $K$ is large. In this note I discuss two ways of achieving this goal:

choose an encoding of the levels that turns independent normal priors on the regression coefficients into the desired symmetric prior on the effects; or
directly evaluate the probability density without constructing the matrix $\Sigma$ .

Low to moderate number of levels

When $K$ is small or moderate, we may achieve independent normal distributions for each of the regression coefficients $\beta_{k}$ by careful choice of encoding. If $X$ is chosen such that $W\Sigma W^{\prime}=\sigma^{2}I$ , then the prior

\beta_{k}\sim\text{Normal}\left(0,\sigma\right)\quad\text{for }1\leq k\leq K-1

gives us the desired symmetric prior on the effects.

To find such an encoding $X$ we begin with an eigendecomposition of $\Sigma$ : eigenvalues $\lambda_{1},\ldots,\lambda_{K-1}$ and corresponding orthonormal eigenvectors $u_{1},\ldots,u_{K-1}$ of $\Sigma$ . Then

\Sigma=U\Lambda U^{-1}

where $U$ is the matrix obtained by stacking the eigenvectors side by side, and $\Lambda$ is the diagonal matrix whose $k$ -th diagonal element is $\lambda_{k}$ :

\begin{array}{rcl} U & = & \left(u_{1},\ldots,u_{K-1}\right)\\ \Lambda & = & \text{diagonal}\left(\lambda_{1},\ldots,\lambda_{K-1}\right). \end{array}

In the full write-up I show that choosing

\begin{array}{rcl} X & = & UD\\ D & = & \text{diagonal}\left(d_{1},\ldots,d_{K-1}\right)\\ d_{k} & = & \frac{\lambda_{k}^{1/2}}{\sigma}.\end{array}

yields $W\Sigma W^{\prime}=\sigma^2 I$ as desired.

The orthonormal eigenvectors and corresponding eigenvalues of of $\Sigma$ are

\begin{array}{rcl} u_{1} & = & \left(K-1\right)^{-1/2}\left(1,\ldots,1\right)^{\prime}\\ \lambda_{1} & = & \frac{\sigma^{2}}{K-1}\\ d_{1} & = & \left(K-1\right)^{-1/2} \end{array}

and, for $2\leq k\leq K-1$ ,

\begin{array}{rcl} u_{ki} & = & \left(k(k-1)\right)^{-1/2}\cdot\left\{\begin{array}{ll} -1 & \text{if }i<k k-1>k \end{array}\right. \\ \lambda_{k} & = & \frac{K\sigma^{2}}{K-1}\\ d_{k} & = & \left(\frac{K}{K-1}\right)^{1/2}. \end{array}

Using $X=UD$ , which multiplies column $k$ of $U$ by $d_{k}$ , we have

X_{ik}=d_{k}u_{ki};

in particular,

\begin{array}{rcl} X_{i1} & = & \frac{1}{K-1}\\ X_{ik} & = & 0\text{ if }1<k x_ if>i. \end{array}

This gives the required encodings for the first $K-1$ levels. The encoding for level $K$ is the negative sum of the encodings for levels 1 to $K-1$ , which works out to

X_{K}=\left(-1,0,\ldots,0\right).

As a check, the $K\times K$ covariance matrix for the full effects vector

\tilde{\alpha}=\begin{pmatrix}\alpha\\ \alpha_{K}\end{pmatrix}

is $\sigma^{2}\tilde{X}\tilde{X}^{\prime}$ , where $\tilde{X}$ is obtained by appending $X_{K}$ to $X$ as an additional row. Using Mathematica, I have verified that

\tilde{X}\tilde{X}^{\prime}=\begin{pmatrix}1 & \rho & \cdots & \rho & \rho\\ \rho & 1 & \cdots & \rho & \rho\\ \vdots & \vdots & \ddots & \vdots & \vdots\\ \rho & \rho & \cdots & 1 & \rho\\ \rho & \rho & \cdots & \rho & 1 \end{pmatrix}\quad\text{where }\rho=-\frac{1}{K-1}

for all $K$ from 3 to 100, as expected.

Large number of levels

If $K$ is large, as occurs in a multilevel model with a prior on $\sigma$ , the above approach may be inefficient. Rather than doing a dot product of a level encoding with a vector of regression coefficients, it may be preferable to instead

directly create a vector of effects $\alpha_{k}$ , $1\leq k\leq K-1$ , with prior
$\alpha\sim\text{Normal}\left(\overline{0},\Sigma\right);$
compute $\alpha_{K}=-\sum_{i=1}^{K-1}\alpha_{k}$ ; then
use the level $k$ to index into the full effects vector $\left(\alpha,\alpha_{K}\right)$ .

If we are estimating the model using Hamilton Monte Carlo or similar methods, such as the NUTS sampler used in Stan, then there is the question of efficiently computing the log of the prior density for $\alpha$ . We would like to avoid actually constructing the large matrix $\Sigma$ . First note the following:

As shown in the full write-up, $\alpha^{\prime}\Sigma^{-1}\alpha=\tilde{\sigma}^{-2}\sum_{k=1}^{K}\alpha_{k}^{2}$
where

$\tilde{\sigma}^{2}=\frac{K}{K-1}\sigma^{2}$
Since $\Sigma=\sigma^{2}S$ , where $S$ is a matrix that is a function only of $K$ and not of $\sigma$ , we have $\det\Sigma=\sigma^{2(K-1)}\det S$
where $\det S$ has no dependence on $\sigma$ .

Then, writing $\equiv$ for “equal except for an additive term that has no dependence on $\alpha$ or $\sigma$ ,” we have

\begin{array}{rcl} \log\left(\text{Normal}\left(\alpha\mid\overline{0},\Sigma\right)\right) & \equiv & -\frac{1}{2}\log\det\Sigma-\frac{1}{2}\alpha^{\prime}\Lambda\alpha\\ & \equiv & -\left(K-1\right)\log\sigma-\frac{1}{2} \tilde{\sigma}^{-2}\sum_{k=1}^{K}\alpha_{k}^{2}\\ & \equiv & \sum_{k=1}^{K-1} \log\left(\text{Normal}\left(\alpha_{k}\mid0,\tilde{\sigma}\right)\right)-\frac{1}{2}\tilde{\sigma}^{-2}\alpha_{K}^{2}. \end{array}

In a Stan model the implementation would look something like this:

transformed parameters {
  ...
  vector[K] alpha1;

  ...
  for (k in 1:(K-1))
    alpha1[k] <- alpha[k];
  alpha1[K] <- -sum(alpha);
}
model  {
  ...
  increment_log_prob(
    -(K-1) * log(sigma)
    - (K-1) / (2 * K * sigma^2) * dot_self(alpha1));
}