Uncategorized

March 21, 2016 by kevin@ksvanhorn.com

It’s All About Jensen’s Inequality

A recent paper proves something that runners have long suspected: GPS overestimates the distance you have traveled. This isn’t due to any algorithmic error; it is instead an unavoidable consequence of two facts:

The position measurements that GPS makes are noisy — there is some degree of random error to them.
The distance between two points is a convex function of the coordinates of the points.

A convex function is one that curves upwards. Here are some examples:

For a function of one argument (such as the above examples), convexity means that the function has a positive second derivative. A convex function $f$ of several arguments $x_1, \ldots, x_n$ curves upward no matter what direction you follow; that is, the directional second derivative is positive no matter what direction you choose.

Jensen’s Inequality states that

if $f$ is a convex function
and $x$ is a (possibly vector-valued) random variable

then

E[f(x)] > f(E[x]).

(Strictly speaking, you could have $=$ instead of $>$ , but only if the probability distribution for $x$ is concentrated at a single point.)

In this case, $x$ is the vector $(x_1,y_1,x_2,y_2)$ , where $(x_1,y_1)$ are the measured GPS coordinates for the starting point and $(x_2,y_2)$ are the measured GPS coordinates for the ending point, and

f(x) = \sqrt{(x_1 - x_2)^2 + (y_1 - y_2)^2}

is the calculated distance between the two points. It is straightforward to show that this distance function $f(x)$ is convex.

Note that $(x_1,y_1)$ and $(x_2,y_2)$ are noisy measurements, not the actual (imperfectly known) coordinates. If we assume that the GPS measurements, although noisy, are at least unbiased, then

E\left[\left(x_i,y_i\right)\right] = \left(x^*_i, y^*_i\right)

where $(x^*_1,y^*_1)$ and $(x^*_2,y^*_2)$ are the actual coordinates. The calculated distance is $f(x)$ , the actual distance is $f(x^*)$ , and Jensen’s inequality guarantees that

E[f(x)] > f(x^*).

March 17, 2016 by kevin@ksvanhorn.com

Analysis of a Nootropics Survey

The Problem

Scott Alexander’s blog Slate Star Codex recently carried the results of a survey of over 850 users of nootropics (cognitive enhancers) such as caffeine, Adderall, and Modafinil. The survey asked respondents to subjectively rate each substance on a scale of 0 to 10, with 0 meaning useless, 1-4 meaning subtle effects, 5-9 meaning strong effects, and 10 meaning life-changing.

There are several difficulties in analyzing this kind of data:

Discretization. Actual effects vary over a continuum, but respondents are asked to choose from a discrete set of choices.
Heterogeneous scale usage. People vary in how they use the scale. Some may spread their responses out more than others. Some may tend to give higher answers than others for the same underlying effect. There may be nonlinearities in how people use the scale.
Meaning. Just what do these ratings mean? How do they translate into specific effects on a person’s mind?

In this note I tackle problems (1) and (2), which are purely technical problems; for (3) I have no answer. On (2) I restrict my attention to bias and scaling of responses.

The analytic approach used here is a simplified version of the methodology described in this paper.

One final caveat: the survey subjects are a self-selected sample, and hence may not be representative of the general populace. One way of dealing with that issue would be to regress the nootropic effects on various subject characteristics that might be predictive of nootropic effect. I did not do this, although the survey includes questions that could be used for this purpose.

Summary of Results

I did a Bayesian analysis that used a hierarchical prior for the nootropic effects and accounted for scale usage heterogeneity and discretization of responses. The first figure shows estimates for the population mean effect ( $\alpha+\beta_{k}$ from the model described below) for each nootropic. The black point is the posterior median, the red line is the 80% posterior credible interval, and the black line is the 90% interval.

Credible intervals for nootropic effects — Posterior credible intervals for nootropic effects

The picture is considerably murkier if you look at the posterior predictive distribution for each nootropic. The effect for an individual is $\alpha+\beta_{k}+\varepsilon_{k}$ , where $\varepsilon_{k}$ is an individual deviation from the population mean, with nootropic-specific variance $\sigma_{k}^{2}$ . The next figure shows credible intervals for this individual effect, for each nootropic. These individual effects are quite uncertain: the values $\sigma_{k}$ vary from around 1.9 to around 2.8.

The Data

The data are provided as a table with one row per subject (survey respondent), and one column per nootropic. There were 36 nootropics mentioned in the survey, but subjects only gave ratings to those they had actually used. The first step is to reshape the data from this “wide” format into a “long” format with columns for subject, nootropic, and response, with each case being some subject’s experience with some nootropic. Then

$i$ indexes cases;
$\mathrm{subj}_{i}$ is the subject for case $i$ ;
$\mathrm{noo}_{i}$ is the nootropic for case $i$ ;
$r_{i}$ is the rating subject $\mathrm{subj}_{i}$ gave for nootropic $\mathrm{noo}_{i}$ .

Discretization

Take each rating $r_{i}$ to be the binned version of a continuous latent variable $y_{i}$ . For example, a rating of $r_{i}=5$ means that $4.5\leq y_{i}\leq5.5$ . Similarly, a rating of $r_{i}=0$ means $y_{i}\leq0.5$ , and a rating of $r_{i}=10$ means $r_{i}\geq9.5$ .

This approach uses a fixed, equally-spaced set of breakpoints $\theta_{i}=i+0.5$ , $0\leq i < 10$ ; a further refinement, which I did not explore, would be to infer the breakpoints themselves, perhaps restricting them to some parametric form such as a quadratic in $i$ .

Scale Usage

One can expect that survey respondents will vary in how they translate their response to a nootropic into a continuous rating $y_{i}$ . Assume that an underlying continuous response $z_{i}$ gets translated into a continuous rating $y_{i}$ via an individual bias term and scaling factor:

\begin{array}{rcl} y_{i} & = & \mathrm{bias}_{j}+\mathrm{scale}_{j}\cdot z_{i}\\ j & = & \mathrm{subj}_{i}. \end{array}

Hierarchical priors for the scale-usage parameters are appropriate:

\begin{array}{rcl} \mathrm{bias}_{j} & \sim & \mathrm{Normal}\left(\alpha,\sigma_{\mathrm{bias}}\right)\\ \sigma_{\mathrm{bias}} & \sim & \mathrm{HalfNormal}\left(7.0\right)\\ \log\left(\mathrm{scale}_{j}\right) & \sim & \mathrm{Normal}\left(0,\sigma_{\mathrm{scale}}\right)\\ \sigma_{\mathrm{scale}} & \sim & \mathrm{HalfNormal}\left(2.0\right)\\ \alpha & \sim & \mathrm{Normal}\left(5.0,5.0\right) \end{array}

It probably would have made sense to use a bivariate normal prior on $\mathrm{bias}_{j}$ and $\log\left(\mathrm{scale}_{j}\right)$ to allow for correlations between them in the population, but I did not explore this option.

The priors on $\alpha$ , $\sigma_{\mathrm{bias}}$ and $\sigma_{\mathrm{scale}}$ are weakly informative, based on the 0 to 10 scale used:

Values of $\alpha$ outside the range 0 to 10 are implausible.
A value of $\sigma_{\mathrm{bias}} > 7$ is implausible, as it allows quite extreme values for $\mathrm{bias}_{j}$ to be common.
A value of $\sigma_{\mathrm{scale}} > 2$ is implausible, as it means that it would be common for to see a factor of $\exp\left(2\sqrt{2}\right)\approx17$ difference in the the scaling used by two different subjects.

Nootropic effects

The effectiveness of a nootropic will vary over the population; letting $\beta_{k}$ be the mean effect of nootropic $k$ , and $\sigma_{k}^{2}$ its variance over the population, we have

\begin{array}{rcl} z_{i} & \sim & \mathrm{Normal}\left(\beta_{k},\sigma_{k}\right)\\ k & = & \mathrm{noo}_{i}. \end{array}

Since there are 36 different nootropics in the study, I used hierarichical priors for the mean effects and variances:

\begin{array}{rcl} \beta_{k} & \sim & \mathrm{Normal}\left(0,\sigma_{\mathrm{noo}}\right)\\ \sigma_{\mathrm{noo}} & \sim & \mathrm{HalfNormal}\left(7.0\right)\\ \log\left(\sigma_{k}\right) & \sim & \mathrm{Normal}\left(\mu_{\mathrm{lse}},\sigma_{\mathrm{lse}}\right)\\ \mu_{\mathrm{lse}} & \sim & \mathrm{Normal}\left(0,2\right)\\ \sigma_{\mathrm{lse}} & \sim & \mathrm{Normal}\left(0,2\right) \end{array}

The priors for $\sigma_{\mathrm{noo}}$ , $\mu_{\mathrm{lse}}$ , and $\sigma_{\mathrm{lse}}$ are again intended to be weakly informative—given the 10-point scale, values of $\sigma_{\mathrm{noo}}$ larger than 7, values of $\mu_{\mathrm{lse}}$ larger than 2 $\left(\approx\log7.4\right)$ , and values of $\sigma_{\mathrm{lse}}$ larger than 2 all seem implausibly extreme.

Likelihood

The normal distribution for $z_{i}$ induces a normal distribution for $y_{i}$ , conditional on the other model variables:

\begin{array}{rcl} y_{i} & \sim & \mathrm{Normal}\left(\mu_{i},\, s_{i}\right)\\ \mu_{i} & = & \mathrm{bias}_{j}+\mathrm{scale}_{j}\cdot\beta_{k}\\ s_{i} & = & \mathrm{scale}_{j}\cdot\sigma_{k}\\ j & = & \mathrm{subj}_{i}\\ k & = & \mathrm{noo}_{i}. \end{array}

The likelihood for case $i$ in the data set is then given by

\begin{array}{rcl} \Pr\left(r_{i}=0\right) & = & \Phi\left(\frac{0.5-\mu_{i}}{s_{i}}\right)\\ \Pr\left(r_{i}=k\right) & = & \Phi\left(\frac{k+0.5-\mu_{i}}{s_{i}}\right)-\Phi\left(\frac{k-0.5-\mu_{i}}{s_{i}}\right)\\ & & \text{for }0 < k < 10\\ \Pr\left(r_{i}=10\right) & = & 1-\Phi\left(\frac{9.5-\mu_{i}}{s_{i}}\right)\\ & = & \Phi\left(\frac{\mu_{i}-9.5}{s_{i}}\right) \end{array}

where $\Phi$ is the CDF for the standard normal distribution.

Estimation Scripts

The R code I used to run the estimation and produce the plots are in these three scripts, which I ran one after the other: nootropics.R, nootropics2.R, and nootropics3.R.

August 14, 2015 by kevin@ksvanhorn.com

Which Link Function — Logit, Probit, or Cloglog?

(PDF)

Introduction

A generalized linear model for binary response data has the form

\Pr\left(y=1\mid x\right)=g^{-1}\left(x^{\prime}\beta\right)

where $y$ is the 0/1 response variable, $x$ is the $n$ -vector of predictor variables, $\beta$ is the vector of regression coefficients, and $g$ is the link function. In the Stan modeling language this would be written as

  y ~ bernoulli(p);
  g(p) <- dot_product(x, beta);

with $g$ replaced by the name of a link function, and similarly for the BUGS modeling language.

The most common choices for the link function are

logit: $g(p)=\log\left(\frac{p}{1-p}\right);$
probit: $g^{-1}(\eta)=\Phi(\eta)$
where $\Phi$ is the cumulative distribution function for the standard normal distribution; and
complementary log-log (cloglog): $g(p)=\log\left(-\log\left(1-p\right)\right).$

All three of these are strictly increasing, continuous functions with $g(0)=-\infty$ and $g(1)=+\infty$ .

In this note we’ll discuss when to use each of these link functions.

Probit

The probit link function is appropriate when it makes sense to think of $y$ as obtained by thresholding a normally distributed latent variable $z$ :

\begin{array}{rcl} z & = & x^{\prime}\beta^{*}+\varepsilon\\ \varepsilon & \sim & \text{Normal}\left(0,\sigma\right)\\ y & = & \begin{cases} 1 & \text{if }z\geq0\\ 0 & \text{otherwise}. \end{cases} \end{array}

Defining $\beta=\beta^{*}/\sigma$ , this yields

\begin{array}{rcl} \Pr\left(y=1\mid x\right) & = & \Pr\left(x^{\prime}\beta^{*}+\varepsilon\geq0\right)\\ & = & \Pr\left(-\varepsilon\leq x^{\prime}\beta^{*}\right)\\ & = & \Pr\left(\varepsilon\leq x^{\prime}\beta^{*}\right)\\ & = & \Phi\left(x^{\prime}\beta\right). \end{array}

Logit

Logit is the default link function to use when you have no specific reason to choose one of the others. There is a specific technical sense in which use of logit corresponds to minimal assumptions about the relationship between $y$ and $x$ . Suppose that we describe the joint distribution for $x$ and $y$ by giving

the marginal distribution for $x$ , and
the expected value of $x_{i}y$ for each predictor variable $x_{i}$ .

Then the maximum-entropy (most spread-out, diffuse, least concentrated) joint distribution for $x$ and $y$ satisfying the above description has a pdf of form

p\left(x,y\right)=\frac{1}{Z}f(x)\exp\left(\sum_{i=1}^{n}\beta_{i}x_{i}y\right)

for some function $f$ , coefficient vector $\beta$ and normalizing constant $Z$ . The conditional distribution for $y$ is then

\begin{array}{rcl} p\left(y\mid x\right) & = & \frac{p(x,y)}{p(x,0)+p(x,1)}\\ & = & \frac{\exp\left(\left(x^{\prime}\beta\right)y\right)}{1+\exp\left(x^{\prime}\beta\right)} \end{array}

and so

\begin{array}{rcl} \Pr\left(y=1\mid x\right) & = & \frac{\exp\left(x^{\prime}\beta\right)}{1+\exp\left(x^{\prime}\beta\right)}\\ & = & \text{logit}^{-1}\left(x^{\prime}\beta\right). \end{array}

Cloglog

The complementary log-log link function arises when

y=\begin{cases} 1 & \text{if }z > 0\\ 0 & \text{if }z=0 \end{cases}

where $z$ is a count having a Poisson distribution:

\begin{array}{rcl} z & \sim & \text{Poisson}\left(\lambda\right)\\ \lambda & = & \exp\left(x^{\prime}\beta\right). \end{array}

To see this, let

p=\Pr\left(z > 0\mid x\right).

Then

\begin{array}{rcl} p & = & 1-\text{Poisson}\left(0\mid\lambda\right)\\ & = & 1-\exp\left(-\lambda\right)\\ & = & 1-\exp\left(-\exp\left(x^{\prime}\beta\right)\right) \end{array}

and so

\begin{array}{rcl} \text{cloglog}\left(p\right) & = & \log\left(-\log\left(1-p\right)\right)\\ & = & \log\left(-\log\left(\exp\left(-\exp\left(x^{\prime}\beta\right)\right)\right)\right)\\ & = & x^{\prime}\beta. \end{array}

Conclusion

In summary, here is when to use each of the link functions:

Use probit when you can think of $y$ as obtained by thresholding a normally distributed latent variable.
Use cloglog when $y$ indicates whether a count is nonzero, and the count can be modeled with a Poisson distribution.
Use logit if you have no specific reason to choose some other link function.