Bayesium Analytics

Practical Bayesian Data Analysis

  • Archive
  • Backlog
  • Services

July 14, 2015 by kevin@ksvanhorn.com

Implementing the Symmetric Effects Prior

(The full write-up is here.)

In a previous note I proposed a prior on the regression coefficients for a nominal input variable that leads to a symmetric prior on the effects:

\begin{array}{rcl} \beta & \sim & \text{Normal}\left(\overline{0},W\Sigma W^{\prime}\right)\\ W & = & X^{-1}\\ \Sigma_{jk} & = & \left\{\begin{array}{ll} \sigma^{2} & \text{if }j=k\\ \rho\sigma^{2} & \text{if }j\neq k \end{array}\right. \\ \rho & = & -1/(K-1) \end{array}

where \beta is the vector of regression coefficients, K is the number of levels, X and \Sigma are square matrices with K-1 rows/columns, row k of X is the encoding for level k\neq K, and level K is encoded as -\sum_{k=1}^{K-1}X_{k}. This gives an induced prior on the K effects \alpha_{k}=X_{k}\beta that is symmetric in the effects, has a mean of 0 and variance of \sigma^{2} for each \alpha_{k}, and is degenerate, in that the sum of the effects is 0.

It may be desirable to avoid actually constructing the matrix \Sigma, either for uniformity in the implementation of a Bayesian regression model (e.g., independent normal priors for each regression coefficient over the entire set of predictor variables) or because K is large. In this note I discuss two ways of achieving this goal:

  • choose an encoding of the levels that turns independent normal priors on the regression coefficients into the desired symmetric prior on the effects; or
  • directly evaluate the probability density without constructing the matrix \Sigma.

Low to moderate number of levels

When K is small or moderate, we may achieve independent normal distributions for each of the regression coefficients \beta_{k} by careful choice of encoding. If X is chosen such that W\Sigma W^{\prime}=\sigma^{2}I, then the prior

\beta_{k}\sim\text{Normal}\left(0,\sigma\right)\quad\text{for }1\leq k\leq K-1

gives us the desired symmetric prior on the effects.

To find such an encoding X we begin with an eigendecomposition of \Sigma: eigenvalues \lambda_{1},\ldots,\lambda_{K-1} and corresponding orthonormal eigenvectors u_{1},\ldots,u_{K-1} of \Sigma. Then

\Sigma=U\Lambda U^{-1}

where U is the matrix obtained by stacking the eigenvectors side by side, and \Lambda is the diagonal matrix whose k-th diagonal element is \lambda_{k}:

\begin{array}{rcl} U & = & \left(u_{1},\ldots,u_{K-1}\right)\\ \Lambda & = & \text{diagonal}\left(\lambda_{1},\ldots,\lambda_{K-1}\right). \end{array}

In the full write-up I show that choosing

\begin{array}{rcl} X & = & UD\\ D & = & \text{diagonal}\left(d_{1},\ldots,d_{K-1}\right)\\ d_{k} & = & \frac{\lambda_{k}^{1/2}}{\sigma}.\end{array}

yields W\Sigma W^{\prime}=\sigma^2 I as desired.

The orthonormal eigenvectors and corresponding eigenvalues of of \Sigma are

\begin{array}{rcl} u_{1} & = & \left(K-1\right)^{-1/2}\left(1,\ldots,1\right)^{\prime}\\ \lambda_{1} & = & \frac{\sigma^{2}}{K-1}\\ d_{1} & = & \left(K-1\right)^{-1/2} \end{array}

and, for 2\leq k\leq K-1,

\begin{array}{rcl} u_{ki} & = & \left(k(k-1)\right)^{-1/2}\cdot\left\{\begin{array}{ll} -1 & \text{if }i<k k-1>k \end{array}\right. \\ \lambda_{k} & = & \frac{K\sigma^{2}}{K-1}\\ d_{k} & = & \left(\frac{K}{K-1}\right)^{1/2}. \end{array}

Using X=UD, which multiplies column k of U by d_{k}, we have

X_{ik}=d_{k}u_{ki};

in particular,

\begin{array}{rcl} X_{i1} & = & \frac{1}{K-1}\\ X_{ik} & = & 0\text{ if }1<k x_ if>i. \end{array}

This gives the required encodings for the first K-1 levels. The encoding for level K is the negative sum of the encodings for levels 1 to K-1, which works out to

X_{K}=\left(-1,0,\ldots,0\right).

As a check, the K\times K covariance matrix for the full effects vector

\tilde{\alpha}=\begin{pmatrix}\alpha\\ \alpha_{K}\end{pmatrix}

is \sigma^{2}\tilde{X}\tilde{X}^{\prime}, where \tilde{X} is obtained by appending X_{K} to X as an additional row. Using Mathematica, I have verified that

\tilde{X}\tilde{X}^{\prime}=\begin{pmatrix}1 & \rho & \cdots & \rho & \rho\\ \rho & 1 & \cdots & \rho & \rho\\ \vdots & \vdots & \ddots & \vdots & \vdots\\ \rho & \rho & \cdots & 1 & \rho\\ \rho & \rho & \cdots & \rho & 1 \end{pmatrix}\quad\text{where }\rho=-\frac{1}{K-1}

for all K from 3 to 100, as expected.

Large number of levels

If K is large, as occurs in a multilevel model with a prior on \sigma, the above approach may be inefficient. Rather than doing a dot product of a level encoding with a vector of regression coefficients, it may be preferable to instead

  1. directly create a vector of effects \alpha_{k}, 1\leq k\leq K-1, with prior

    \alpha\sim\text{Normal}\left(\overline{0},\Sigma\right);

  2. compute \alpha_{K}=-\sum_{i=1}^{K-1}\alpha_{k}; then
  3. use the level k to index into the full effects vector \left(\alpha,\alpha_{K}\right).

If we are estimating the model using Hamilton Monte Carlo or similar methods, such as the NUTS sampler used in Stan, then there is the question of efficiently computing the log of the prior density for \alpha. We would like to avoid actually constructing the large matrix \Sigma. First note the following:

  1. As shown in the full write-up, \alpha^{\prime}\Sigma^{-1}\alpha=\tilde{\sigma}^{-2}\sum_{k=1}^{K}\alpha_{k}^{2}

    where

    \tilde{\sigma}^{2}=\frac{K}{K-1}\sigma^{2}

  2. Since \Sigma=\sigma^{2}S, where S is a matrix that is a function only of K and not of \sigma, we have \det\Sigma=\sigma^{2(K-1)}\det S

    where \det S has no dependence on \sigma.

Then, writing \equiv for “equal except for an additive term that has no dependence on \alpha or \sigma,” we have

\begin{array}{rcl} \log\left(\text{Normal}\left(\alpha\mid\overline{0},\Sigma\right)\right) & \equiv & -\frac{1}{2}\log\det\Sigma-\frac{1}{2}\alpha^{\prime}\Lambda\alpha\\ & \equiv & -\left(K-1\right)\log\sigma-\frac{1}{2} \tilde{\sigma}^{-2}\sum_{k=1}^{K}\alpha_{k}^{2}\\ & \equiv & \sum_{k=1}^{K-1} \log\left(\text{Normal}\left(\alpha_{k}\mid0,\tilde{\sigma}\right)\right)-\frac{1}{2}\tilde{\sigma}^{-2}\alpha_{K}^{2}. \end{array}

In a Stan model the implementation would look something like this:

transformed parameters {
  ...
  vector[K] alpha1;

  ...
  for (k in 1:(K-1))
    alpha1[k] <- alpha[k];
  alpha1[K] <- -sum(alpha);
}
model  {
  ...
  increment_log_prob(
    -(K-1) * log(sigma)
    - (K-1) / (2 * K * sigma^2) * dot_self(alpha1));
}

Filed Under: Uncategorized

Recent Posts

  • Installing httpstan on Ubuntu Linux 20.04
  • Installing CmdStan on Windows 10
  • Probability Theory Does Not Extend Logic?
  • It’s All About Jensen’s Inequality
  • Analysis of a Nootropics Survey

Archives

  • August 2022
  • July 2017
  • March 2016
  • August 2015
  • July 2015
  • June 2015

Categories

Tags

multilevel modeling nominal variables

Copyright © 2025 · Generate Pro Theme on Genesis Framework · WordPress · Log in