(The full write-up is here.)

In a previous note I proposed a prior on the regression coefficients for a nominal input variable that leads to a symmetric prior on the effects:

where is the vector of regression coefficients, is the number of levels, and are square matrices with rows/columns, row of is the encoding for level , and level is encoded as . This gives an induced prior on the effects that is symmetric in the effects, has a mean of 0 and variance of for each , and is degenerate, in that the sum of the effects is 0.

It may be desirable to avoid actually constructing the matrix , either for uniformity in the implementation of a Bayesian regression model (e.g., independent normal priors for each regression coefficient over the entire set of predictor variables) or because is large. In this note I discuss two ways of achieving this goal:

- choose an encoding of the levels that turns independent normal priors on the regression coefficients into the desired symmetric prior on the effects; or
- directly evaluate the probability density without constructing the matrix .

## Low to moderate number of levels

When is small or moderate, we may achieve independent normal distributions for each of the regression coefficients by careful choice of encoding. If is chosen such that , then the prior

gives us the desired symmetric prior on the effects.

To find such an encoding we begin with an eigendecomposition of : eigenvalues and corresponding orthonormal eigenvectors of . Then

where is the matrix obtained by stacking the eigenvectors side by side, and is the diagonal matrix whose -th diagonal element is :

In the full write-up I show that choosing

yields as desired.

The orthonormal eigenvectors and corresponding eigenvalues of of are

and, for ,

Using , which multiplies column of by , we have

in particular,

This gives the required encodings for the first levels. The encoding for level is the negative sum of the encodings for levels 1 to , which works out to

As a check, the covariance matrix for the full effects vector

is , where is obtained by appending to as an additional row. Using Mathematica, I have verified that

for all from 3 to 100, as expected.

## Large number of levels

If is large, as occurs in a multilevel model with a prior on , the above approach may be inefficient. Rather than doing a dot product of a level encoding with a vector of regression coefficients, it may be preferable to instead

- directly create a vector of effects , , with prior
- compute ; then
- use the level to index into the full effects vector .

If we are estimating the model using Hamilton Monte Carlo or similar methods, such as the NUTS sampler used in Stan, then there is the question of efficiently computing the log of the prior density for . We would like to avoid actually constructing the large matrix . First note the following:

- As shown in the full write-up,
where

- Since , where is a matrix that is a function only of and not of , we have
where has no dependence on .

Then, writing for “equal except for an additive term that has no dependence on or ,” we have

In a Stan model the implementation would look something like this:

transformed parameters { ... vector[K] alpha1; ... for (k in 1:(K-1)) alpha1[k] <- alpha[k]; alpha1[K] <- -sum(alpha); } model { ... increment_log_prob( -(K-1) * log(sigma) - (K-1) / (2 * K * sigma^2) * dot_self(alpha1)); }