A generalized linear model for binary response data has the form
where is the 0/1 response variable, is the -vector of predictor variables, is the vector of regression coefficients, and is the link function. In the Stan modeling language this would be written as
y ~ bernoulli(p); g(p) <- dot_product(x, beta);
with replaced by the name of a link function, and similarly for the BUGS modeling language.
The most common choices for the link function are
where is the cumulative distribution function for the standard normal distribution; and
- complementary log-log (cloglog):
All three of these are strictly increasing, continuous functions with and .
In this note we’ll discuss when to use each of these link functions.
The probit link function is appropriate when it makes sense to think of as obtained by thresholding a normally distributed latent variable :
Defining , this yields
Logit is the default link function to use when you have no specific reason to choose one of the others. There is a specific technical sense in which use of logit corresponds to minimal assumptions about the relationship between and . Suppose that we describe the joint distribution for and by giving
- the marginal distribution for , and
- the expected value of for each predictor variable .
Then the maximum-entropy (most spread-out, diffuse, least concentrated) joint distribution for and satisfying the above description has a pdf of form
for some function , coefficient vector and normalizing constant . The conditional distribution for is then
The complementary log-log link function arises when
where is a count having a Poisson distribution:
To see this, let
In summary, here is when to use each of the link functions:
- Use probit when you can think of as obtained by thresholding a normally distributed latent variable.
- Use cloglog when indicates whether a count is nonzero, and the count can be modeled with a Poisson distribution.
- Use logit if you have no specific reason to choose some other link function.