(PDF)

## Introduction

A generalized linear model for binary response data has the form

\Pr\left(y=1\mid x\right)=g^{-1}\left(x^{\prime}\beta\right)where y is the 0/1 response variable, x is the n-vector of predictor variables, \beta is the vector of regression coefficients, and g is the link function. In the Stan modeling language this would be written as

y ~ bernoulli(p); g(p) <- dot_product(x, beta);

with g replaced by the name of a link function, and similarly for the BUGS modeling language.

The most common choices for the link function are

- logit: g(p)=\log\left(\frac{p}{1-p}\right);
- probit:
g^{-1}(\eta)=\Phi(\eta)
where \Phi is the cumulative distribution function for the standard normal distribution; and

- complementary log-log (cloglog): g(p)=\log\left(-\log\left(1-p\right)\right).

All three of these are strictly increasing, continuous functions with g(0)=-\infty and g(1)=+\infty.

In this note we’ll discuss when to use each of these link functions.

## Probit

The probit link function is appropriate when it makes sense to think of y as obtained by thresholding a normally distributed latent variable z:

\begin{array}{rcl} z & = & x^{\prime}\beta^{*}+\varepsilon\\ \varepsilon & \sim & \text{Normal}\left(0,\sigma\right)\\ y & = & \begin{cases} 1 & \text{if }z\geq0\\ 0 & \text{otherwise}. \end{cases} \end{array}Defining \beta=\beta^{*}/\sigma, this yields

\begin{array}{rcl} \Pr\left(y=1\mid x\right) & = & \Pr\left(x^{\prime}\beta^{*}+\varepsilon\geq0\right)\\ & = & \Pr\left(-\varepsilon\leq x^{\prime}\beta^{*}\right)\\ & = & \Pr\left(\varepsilon\leq x^{\prime}\beta^{*}\right)\\ & = & \Phi\left(x^{\prime}\beta\right). \end{array}## Logit

Logit is the default link function to use when you have no specific reason to choose one of the others. There is a specific technical sense in which use of logit corresponds to minimal assumptions about the relationship between y and x. Suppose that we describe the joint distribution for x and y by giving

- the marginal distribution for x, and
- the expected value of x_{i}y for each predictor variable x_{i}.

Then the maximum-entropy (most spread-out, diffuse, least concentrated) joint distribution for x and y satisfying the above description has a pdf of form

p\left(x,y\right)=\frac{1}{Z}f(x)\exp\left(\sum_{i=1}^{n}\beta_{i}x_{i}y\right)for some function f, coefficient vector \beta and normalizing constant Z. The conditional distribution for y is then

\begin{array}{rcl} p\left(y\mid x\right) & = & \frac{p(x,y)}{p(x,0)+p(x,1)}\\ & = & \frac{\exp\left(\left(x^{\prime}\beta\right)y\right)}{1+\exp\left(x^{\prime}\beta\right)} \end{array}and so

\begin{array}{rcl} \Pr\left(y=1\mid x\right) & = & \frac{\exp\left(x^{\prime}\beta\right)}{1+\exp\left(x^{\prime}\beta\right)}\\ & = & \text{logit}^{-1}\left(x^{\prime}\beta\right). \end{array}## Cloglog

The complementary log-log link function arises when

y=\begin{cases} 1 & \text{if }z > 0\\ 0 & \text{if }z=0 \end{cases}where z is a count having a Poisson distribution:

\begin{array}{rcl} z & \sim & \text{Poisson}\left(\lambda\right)\\ \lambda & = & \exp\left(x^{\prime}\beta\right). \end{array}To see this, let

p=\Pr\left(z > 0\mid x\right).Then

\begin{array}{rcl} p & = & 1-\text{Poisson}\left(0\mid\lambda\right)\\ & = & 1-\exp\left(-\lambda\right)\\ & = & 1-\exp\left(-\exp\left(x^{\prime}\beta\right)\right) \end{array}and so

\begin{array}{rcl} \text{cloglog}\left(p\right) & = & \log\left(-\log\left(1-p\right)\right)\\ & = & \log\left(-\log\left(\exp\left(-\exp\left(x^{\prime}\beta\right)\right)\right)\right)\\ & = & x^{\prime}\beta. \end{array}## Conclusion

In summary, here is when to use each of the link functions:

- Use probit when you can think of y as obtained by thresholding a normally distributed latent variable.
- Use cloglog when y indicates whether a count is nonzero, and the count can be modeled with a Poisson distribution.
- Use logit if you have no specific reason to choose some other link function.

Allen Downey says

This is very good, so thanks! You might also mention one nice property of logit as a link function: it make the parameters interpretable in terms of log odds ratios.