David Chapman has an essay entitled Probability Theory Does Not Extend Logic. A friend asked me about it, and I wrote up my response here.

## It’s All About Jensen’s Inequality

A recent paper proves something that runners have long suspected: GPS overestimates the distance you have traveled. This isn’t due to any algorithmic error; it is instead an unavoidable consequence of two facts:

- The position measurements that GPS makes are noisy — there is some degree of random error to them.
- The distance between two points is a convex function of the coordinates of the points.

A convex function is one that curves upwards. Here are some examples:

For a function of one argument (such as the above examples), convexity means that the function has a positive second derivative. A convex function f of several arguments x_1, \ldots, x_n curves upward no matter what direction you follow; that is, the *directional* second derivative is positive no matter what direction you choose.

Jensen’s Inequality states that

- if f is a convex function
- and x is a (possibly vector-valued) random variable

then

E[f(x)] > f(E[x]).(Strictly speaking, you could have = instead of >, but only if the probability distribution for x is concentrated at a single point.)

In this case, x is the vector (x_1,y_1,x_2,y_2), where (x_1,y_1) are the *measured* GPS coordinates for the starting point and (x_2,y_2) are the *measured* GPS coordinates for the ending point, and

is the calculated distance between the two points. It is straightforward to show that this distance function f(x) is convex.

Note that (x_1,y_1) and (x_2,y_2) are noisy measurements, not the actual (imperfectly known) coordinates. If we assume that the GPS measurements, although noisy, are at least unbiased, then

E\left[\left(x_i,y_i\right)\right] = \left(x^*_i, y^*_i\right)where (x^*_1,y^*_1) and (x^*_2,y^*_2) are the *actual coordinates*. The calculated distance is f(x), the actual distance is f(x^*), and Jensen’s inequality guarantees that

## Analysis of a Nootropics Survey

## The Problem

Scott Alexander’s blog Slate Star Codex recently carried the results of a survey of over 850 users of nootropics (cognitive enhancers) such as caffeine, Adderall, and Modafinil. The survey asked respondents to subjectively rate each substance on a scale of 0 to 10, with 0 meaning useless, 1-4 meaning subtle effects, 5-9 meaning strong effects, and 10 meaning life-changing.

There are several difficulties in analyzing this kind of data:

*Discretization*. Actual effects vary over a continuum, but respondents are asked to choose from a discrete set of choices.*Heterogeneous scale usage*. People vary in how they use the scale. Some may spread their responses out more than others. Some may tend to give higher answers than others for the same underlying effect. There may be nonlinearities in how people use the scale.*Meaning*. Just what do these ratings*mean*? How do they translate into specific effects on a person’s mind?

In this note I tackle problems (1) and (2), which are purely technical problems; for (3) I have no answer. On (2) I restrict my attention to bias and scaling of responses.

The analytic approach used here is a simplified version of the methodology described in this paper.

One final caveat: the survey subjects are a self-selected sample, and hence may not be representative of the general populace. One way of dealing with that issue would be to regress the nootropic effects on various subject characteristics that might be predictive of nootropic effect. I did not do this, although the survey includes questions that could be used for this purpose.

## Summary of Results

I did a Bayesian analysis that used a hierarchical prior for the nootropic effects and accounted for scale usage heterogeneity and discretization of responses. The first figure shows estimates for the population mean effect (\alpha+\beta_{k} from the model described below) for each nootropic. The black point is the posterior median, the red line is the 80% posterior credible interval, and the black line is the 90% interval.

The picture is considerably murkier if you look at the posterior *predictive* distribution for each nootropic. The effect *for an individual* is \alpha+\beta_{k}+\varepsilon_{k}, where \varepsilon_{k} is an individual deviation from the population mean, with nootropic-specific variance \sigma_{k}^{2}. The next figure shows credible intervals for this individual effect, for each nootropic. These individual effects are quite uncertain: the values \sigma_{k} vary from around 1.9 to around 2.8.

## The Data

The data are provided as a table with one row per subject (survey respondent), and one column per nootropic. There were 36 nootropics mentioned in the survey, but subjects only gave ratings to those they had actually used. The first step is to reshape the data from this “wide” format into a “long” format with columns for subject, nootropic, and response, with each case being some subject’s experience with some nootropic. Then

- i indexes cases;
- \mathrm{subj}_{i} is the subject for case i;
- \mathrm{noo}_{i} is the nootropic for case i;
- r_{i} is the rating subject \mathrm{subj}_{i} gave for nootropic \mathrm{noo}_{i}.

## Discretization

Take each rating r_{i} to be the binned version of a continuous latent variable y_{i}. For example, a rating of r_{i}=5 means that 4.5\leq y_{i}\leq5.5. Similarly, a rating of r_{i}=0 means y_{i}\leq0.5, and a rating of r_{i}=10 means r_{i}\geq9.5.

This approach uses a fixed, equally-spaced set of breakpoints \theta_{i}=i+0.5, 0\leq i < 10; a further refinement, which I did not explore, would be to infer the breakpoints themselves, perhaps restricting them to some parametric form such as a quadratic in i.

## Scale Usage

One can expect that survey respondents will vary in how they translate their response to a nootropic into a continuous rating y_{i}. Assume that an underlying continuous response z_{i} gets translated into a continuous rating y_{i} via an individual bias term and scaling factor:

\begin{array}{rcl} y_{i} & = & \mathrm{bias}_{j}+\mathrm{scale}_{j}\cdot z_{i}\\ j & = & \mathrm{subj}_{i}. \end{array}Hierarchical priors for the scale-usage parameters are appropriate:

\begin{array}{rcl} \mathrm{bias}_{j} & \sim & \mathrm{Normal}\left(\alpha,\sigma_{\mathrm{bias}}\right)\\ \sigma_{\mathrm{bias}} & \sim & \mathrm{HalfNormal}\left(7.0\right)\\ \log\left(\mathrm{scale}_{j}\right) & \sim & \mathrm{Normal}\left(0,\sigma_{\mathrm{scale}}\right)\\ \sigma_{\mathrm{scale}} & \sim & \mathrm{HalfNormal}\left(2.0\right)\\ \alpha & \sim & \mathrm{Normal}\left(5.0,5.0\right) \end{array}It probably would have made sense to use a bivariate normal prior on \mathrm{bias}_{j} and \log\left(\mathrm{scale}_{j}\right) to allow for correlations between them in the population, but I did not explore this option.

The priors on \alpha, \sigma_{\mathrm{bias}} and \sigma_{\mathrm{scale}} are weakly informative, based on the 0 to 10 scale used:

- Values of \alpha outside the range 0 to 10 are implausible.
- A value of \sigma_{\mathrm{bias}} > 7 is implausible, as it allows quite extreme values for \mathrm{bias}_{j} to be common.
- A value of \sigma_{\mathrm{scale}} > 2 is implausible, as it means that it would be common for to see a factor of \exp\left(2\sqrt{2}\right)\approx17 difference in the the scaling used by two different subjects.

## Nootropic effects

The effectiveness of a nootropic will vary over the population; letting \beta_{k} be the mean effect of nootropic k, and \sigma_{k}^{2} its variance over the population, we have

\begin{array}{rcl} z_{i} & \sim & \mathrm{Normal}\left(\beta_{k},\sigma_{k}\right)\\ k & = & \mathrm{noo}_{i}. \end{array}Since there are 36 different nootropics in the study, I used hierarichical priors for the mean effects and variances:

\begin{array}{rcl} \beta_{k} & \sim & \mathrm{Normal}\left(0,\sigma_{\mathrm{noo}}\right)\\ \sigma_{\mathrm{noo}} & \sim & \mathrm{HalfNormal}\left(7.0\right)\\ \log\left(\sigma_{k}\right) & \sim & \mathrm{Normal}\left(\mu_{\mathrm{lse}},\sigma_{\mathrm{lse}}\right)\\ \mu_{\mathrm{lse}} & \sim & \mathrm{Normal}\left(0,2\right)\\ \sigma_{\mathrm{lse}} & \sim & \mathrm{Normal}\left(0,2\right) \end{array}The priors for \sigma_{\mathrm{noo}}, \mu_{\mathrm{lse}}, and \sigma_{\mathrm{lse}} are again intended to be weakly informative—given the 10-point scale, values of \sigma_{\mathrm{noo}} larger than 7, values of \mu_{\mathrm{lse}} larger than 2 \left(\approx\log7.4\right), and values of \sigma_{\mathrm{lse}} larger than 2 all seem implausibly extreme.

## Likelihood

The normal distribution for z_{i} induces a normal distribution for y_{i}, conditional on the other model variables:

\begin{array}{rcl} y_{i} & \sim & \mathrm{Normal}\left(\mu_{i},\, s_{i}\right)\\ \mu_{i} & = & \mathrm{bias}_{j}+\mathrm{scale}_{j}\cdot\beta_{k}\\ s_{i} & = & \mathrm{scale}_{j}\cdot\sigma_{k}\\ j & = & \mathrm{subj}_{i}\\ k & = & \mathrm{noo}_{i}. \end{array}The likelihood for case i in the data set is then given by

\begin{array}{rcl} \Pr\left(r_{i}=0\right) & = & \Phi\left(\frac{0.5-\mu_{i}}{s_{i}}\right)\\ \Pr\left(r_{i}=k\right) & = & \Phi\left(\frac{k+0.5-\mu_{i}}{s_{i}}\right)-\Phi\left(\frac{k-0.5-\mu_{i}}{s_{i}}\right)\\ & & \text{for }0 < k < 10\\ \Pr\left(r_{i}=10\right) & = & 1-\Phi\left(\frac{9.5-\mu_{i}}{s_{i}}\right)\\ & = & \Phi\left(\frac{\mu_{i}-9.5}{s_{i}}\right) \end{array}where \Phi is the CDF for the standard normal distribution.

## Estimation Scripts

The R code I used to run the estimation and produce the plots are in these three scripts, which I ran one after the other: nootropics.R, nootropics2.R, and nootropics3.R.

- 1
- 2
- 3
- 4
- Next Page »