Uncategorized

July 28, 2015 by kevin@ksvanhorn.com

Doomsday and the Dice Room Murders

The puzzle

Scott Aaronson has a wonderful book called Quantum Computing Since Democritus (1), and in his chapter on the anthropic principle he writes about a thought experiment (attributed to John Leslie) that he calls the Dice Room:

Imagine that there’s a very, very large population of people in the world, and that there’s a madman. What this madman does is, he kidnaps ten people and puts them in a room. He then throws a pair of dice. If the dice land snake-eyes (two ones), then he simply murders everyone in the room. If the dice do not land snake-eyes, then he releases everyone, then kidnaps 100 people. He now does the same thing: he rolls two dice; if they land snake-eyes, then he kills everyone, and if they don’t land snake-eyes, then he releases them and kidnaps 1000 people. He keeps doing this until he gets snake-eyes, at which point he’s done.

Think of the population of this world as being randomly partitioned into a sequence of batches, with batch $t$ having $10^{t}$ people. On step $t$ the madman kidnaps the people in batch $t$ .

The Dice Room is a metaphor for human extinction. The population of this hypothetical world corresponds to all human beings who ever have or who ever may live. The batches represent succeeding generations of humanity, the exponentially growing batch size represents exponential growth of the human population, the madman represents the sum of the various existential risks facing humanity, and rolling snake-eyes corresponds to an extinction event that wipes out humanity. Being kidnapped and placed in the Dice Room corresponds to being born into a particular generation of humanity.

This Dice Room scenario leads to the following question: if you are kidnapped, what is your probability of dying? Correspondingly, how likely is ours to be the generation in which humans go extinct? Aaronson gives two answers. The first:

…the dice have a 1/36 chance of landing snake-eyes, so you should only be a “little bit” worried (considering.)

Let’s call this the “proximate risk” argument.

And the second:

…consider, of people who enter the room, what the fraction is of people who ever get out. Let’s say that it ends at 1000. Then, 110 people get out and 1000 die. If it ends at 10,000, then 1110 people get out and 10,000 die. In either case, about 8/9 of the people who ever go into the room die.

… We can say that we’re conditioning on a specific termination point, but that no matter what that point is, we get the same answer. It could be 10 steps or 50 steps, but no matter what the termination point is, almost all of the people who go into the room are going to die, because the number of people is increasing exponentially.

Let’s call this the “proportion murdered” argument. Also, $1000/1110\approx9/10$ rather than $8/9$ , so we’ll use $9/10$ in the rest of this note. The proportion-murdered argument amounts to an argument that human extinction is probably nigh.

Aaronson comments that “If you’re a Bayesian, then this kind of seems like a problem,” as Bayesian reasoning seems to give two different answers to the same question given the same information.

In a nutshell: how imminent doom is averted

Right off the bat, you should be smelling something rotten here: Bayesian reasoning is just unadorned probability theory, nothing more and nothing less, and we know that the rules of probability theory are logically consistent—or rather, if they’re not, then Zermelo-Fraenkel set theory is also inconsistent and mathematicians start throwing themselves off of tall buildings. If we have two different arguments giving different answers, one of them is wrong. In particular, the proportion-murdered argument is wrong.

The proportion-murdered argument relies on the property that $\pi_{t}$ , the prior probability that you are in batch $t$ , increases exponentially with $t$ . If the madman murders at step $T$ and $\pi_{t+1}=10\pi_{t}$ for all $1\leq t < T$ , and you know you are kidnapped at some point, then the probability that you are in batch $T$ (and hence die) is

\frac{\pi_{T}}{\sum_{t=1}^{T}\pi_{t}}\approx\frac{9}{10}.

But such exponential growth cannot continue indefinitely: the probabilities $\pi_{t}$ must sum to 1, and any infinite sequence of nonnegative numbers that sums to 1 must go to zero in the limit. Regardless of what prior distribution we use for your batch number, we know that $\pi_{t}\rightarrow0$ as $t\rightarrow\infty$ . Thus is the proportion-murdered argument destroyed, imminent extinction averted, and the foundations of mathematics rescued.

In the rest of this note I’ll flesh out the analysis.

Let’s get explicit

To start the analysis, let’s be crystal clear on our assumptions:

Writing $M(t)$ for “the madman murders at step $t$ ” and $\text{hm}(t)$ for “the madman has already murdered by step $t$ ,” the madman’s choice process is as follows, for $t\geq1$ : $\begin{array}{rcl} \text{hm}\left(1\right) & = & \text{false}\\ \text{hm}\left(t+1\right) & = & \left(\text{hm}\left(t\right)\text{ or }M\left(t\right)\right)\\ \Pr\left(M\left(t\right)\mid\text{hm}\left(t\right)\right) & = & 0\\ \Pr\left(M\left(t\right)\mid\neg\text{hm}\left(t\right)\right) & = & p\\ & = & \frac{1}{36} \end{array}$
where “ $\neg$ ” means “not.” That is, once the madman has murdered he does not do so again, and if he has not murdered in a prior step then there is a probability $p$ he will murder in the current step.
The population is partitioned into batches, which are numbered from 1 onward, with every individual belonging to exactly one batch. We write $b$ for the specific batch to which you belong.
We write $\pi_{t}$ for $\Pr\left(b=t\right)$ , the prior probability that you belong to batch $t$ .
You die, written $D$ , if the madman murders on the step corresponding to your batch number: $D\equiv M\left(b\right).$
You are kidnapped, written $K$ , if the madman has not yet murdered on the step corresponding to your batch number: $K\equiv\neg\text{hm}\left(b\right).$
You know that you are kidnapped, but not your batch number. Thus, the probability of interest is $\Pr\left(D\mid K\right).$

What is the specific set of batch probabilities $\pi_{t}$ that arise in the Dice Room scenario? We could derive it as follows:

Assume that batch $t$ contains $10^{t}$ individuals and that there are $n$ batches (a finite number). This gives a total population size of $N=\sum_{t=1}^{n}10^{t}$ .
Number the individuals from 1 to $N$ ; batch 1 consists of the first $10^{1}$ individuals, batch 2 consists of the next $10^{2}$ individuals, etc.
Let $i$ be your assigned number; since you don’t know your number, the prior for $i$ is the uniform distribution over the integers from 1 to $N$ .

This gives $\pi_{t}=10^{t}/N$ for $1\leq t\leq n$ and $\pi_{t}=0$ for $t > n$ . In this case if the madman “murders” at a step $t > n$ then nobody actually dies, as all batches after batch $n$ are empty.

What if we don’t want to assume a finite population? Then things get trickier, because there does not exist a uniform distribution over the infinite set of all positive integers. We would have to give some prior on $i$ that was at least weakly informative.

The details don’t matter. The analysis that follows doesn’t depend on what specific batch probabilities $\pi_{t}$ are used, as long as they are valid probabilities: each is nonnegative and they sum to 1.

The proximate-risk argument

The first step in formalizing the proximate-risk argument is to decompose the problem by the batch to which you belong:

\begin{array}{rcl} \Pr\left(D\mid K\right) & = & \Pr\left(M\left(b\right)\mid K\right)\\ & = & \sum_{t=1}^{\infty}\Pr\left(\left(b=t\right)\,\&\, M\left(t\right)\mid K\right)\\ & = & \sum_{t=1}^{\infty}\Pr\left(b=t\mid K\right)\cdot\Pr\left(M\left(t\right)\mid\left(b=t\right)\,\&\, K\right) \end{array}

In words: take the weighted average, over all $t$ , of the probability that the madman murders at step $t$ if you are in batch $t$ and are kidnapped; use as the weighting the probability that you are in batch $t$ if you are kidnapped.

Now

\begin{array}{rcl} \left(b=t\right)\,\&\, K & \Leftrightarrow & \left(b=t\right)\,\&\,\neg\text{hm}\left(b\right)\\ & \Leftrightarrow & \left(b=t\right)\,\&\,\neg\text{hm}\left(t\right) \end{array}

\begin{array}{rcl} \Pr\left(M(t)\mid\left(b=t\right)\,\&\, K\right) & = & \Pr\left(M(t)\mid\left(b=t\right)\,\&\,\neg\text{hm}\left(t\right)\right)\\ & = & \Pr\left(M(t)\mid\neg\text{hm}(t)\right)\\ & = & \frac{1}{36} \end{array}

with the second step justified by the fact that $(b=t)$ and $M(t)$ are independent propositions, even when conditioning on $\neg\text{hm}\left(t\right)$ ; then

\begin{array}{rcl} \Pr\left(D\mid K\right) & = & \sum_{t=1}^{\infty}\Pr\left(b=t\mid K\right)\cdot\frac{1}{36}\\[1em] & = & \frac{1}{36}\Pr\left(\exists t\geq1.\, b=t\right)\\[1em] & = & \frac{1}{36}. \end{array}

So the proximate-risk argument checks out as valid.

The proportion-murdered argument

Formalizing this argument proceeds in the same fashion as for the proximate-risk argument, except that we decompose the problem by the time step at which the madman murders:

\begin{array}{rcl} \Pr\left(D\mid K\right) & = & \Pr\left(M\left(b\right)\mid K\right)\\ & = & \sum_{t=1}^{\infty}\Pr\left(\left(b=t\right)\,\&\, M\left(t\right)\mid K\right)\\ & = & \sum_{t=1}^{\infty}\Pr\left(M(t)\mid K\right)\cdot\Pr\left(b=t\mid M(t)\,\&\, K\right) \end{array}

In words: take the weighted average, over all $t$ , of the probability that you are in batch $t$ if you are kidnapped and the madman murders at step $t$ ; use as the weighting the probability that the madman murders at step $t$ given that you are kidnapped.

Let’s verify that this really is a weighted average, i.e., that the weights sum to 1. We have

\Pr\left(M(t)\right)=p\left(1-p\right)^{t-1}

and hence

\begin{array}{rcl} \Pr\left(\exists t\geq1.\, M(t)\right) & = & \sum_{t=1}^{\infty}\Pr\left(M(t)\right)\\ & = & p\sum_{t=1}^{\infty}\left(1-p\right)^{t-1}\\ & = & 1. \end{array}

It is true in general that if $\Pr\left(Y\right)=1$ then $\Pr\left(Y\mid Z\right)=1$ for any $Z$ whose probability is not 0. So

\begin{array}{rcl} \sum_{t=1}^{\infty}\Pr\left(M(t)\mid K\right) & = & \Pr\left(\exists t\ge1.\, M(t)\mid K\right)\\ & = & 1. \end{array}

In words: the madman is guaranteed to eventually murder, and conditioning on the fact that you are kidnapped does not change this.

Note that

\begin{array}{rcl} M(t)\,\&\, K & \Leftrightarrow & M(t)\,\&\,\left(b\leq t\right); \end{array}

if the madman murders at step $t$ , you are kidnapped if and only if your batch number is $t$ or less. Defining

Q_{t}\equiv\frac{\pi_{t}}{\sum_{u=1}^{t}\pi_{u}},

we then find that

\begin{array}{rcl} \Pr\left(b=t\mid M(t)\,\&\, K\right) & = & \Pr\left(b=t\mid M(t)\,\&\,\left(b\leq t\right)\right)\\ & = & \Pr\left(b=t\mid b\leq t\right)\\ & = & Q_{t} \end{array}

(This uses the fact that $b=t$ and $M(t)$ are independent propositions even if we condition on $\left(b\leq t\right)$ .)

In words: if the madman murders at step $t$ and you are kidnapped, then $Q_{t}$ is the probability that you are in batch $t$ , exactly as assumed by the proportion-murdered argument.

We then have

\Pr\left(D\mid K\right)=\sum_{t=1}^{\infty}\Pr\left(M(t)\mid K\right)\cdot Q_{t};

that is, the probability we seek to determine is the weighted average of $Q_{t}$ over all $t$ .

So far every step of the proportion-murdered argument has checked out. However, the proportion-murdered argument also relies on a claim that $Q_{t}\approx9/10$ regardless of $t$ . That is, if it is known that you are in one of the first $t$ batches, it is highly likely that you are in batch $t$ itself. If $\pi_{u+1}=10\pi_{u}$ for $1\leq u < t$ , then $Q_{t}$ does indeed work out to approximately $9/10$ . But, as previously noted, it cannot be true that $\pi_{u+1}=10\pi_{u}$ for all $u\geq1$ : the probabilities $\pi_{u}$ can sum to 1 only if $\pi_{u}\rightarrow0$ as $u\rightarrow0$ . From this we see that $Q_{t}\rightarrow0$ as $t\rightarrow\infty$ .

Detail: fixing the proportion-murdered argument

Let’s carry out the rest of the analysis to verify that, even taking the proportion-murdered approach to the question, we still get an answer of $1/36$ .

First, some results we will need:

\begin{array}{rcl} \Pr\left(K\right) & = & \Pr\left(\neg\text{hm}\left(b\right)\right)\\ & = & \sum_{t=1}^{\infty}\Pr\left(\left(b=t\right)\,\&\,\neg\text{hm}\left(t\right)\right)\\ & = & \sum_{t=1}^{\infty}\Pr\left(b=t\right)\Pr\left(\neg\text{hm}\left(t\right)\right)\\ & = & \sum_{t=1}^{\infty}\pi_{t}\left(1-p\right)^{t-1} \end{array}

and

\begin{array}{rcl} \Pr\left(M(t)\,\&\, K\right) & = & \Pr\left(M(t)\,\&\,\left(b\leq t\right)\right)\\ & = & \Pr\left(M(t)\right)\cdot\Pr\left(b\leq t\right)\\ & = & p\left(1-p\right)^{t-1}\sum_{u=1}^{t}\pi_{u}. \end{array}

Then

\begin{array}{rcl} \Pr\left(M(t)\mid K\right)\cdot Q_{t} & = & \frac{\Pr\left(M(t)\,\&\, K\right)\cdot Q_{t}}{\Pr\left(K\right)}\\ & = & \frac{p\left(1-p\right)^{t-1}\pi_{t}}{\Pr\left(K\right)} \end{array}

and so

\begin{array}{rcl} \Pr\left(D\mid K\right) & = & \sum_{t=1}^{\infty}\Pr\left(M(t)\mid K\right)\cdot Q_{t}\\ & = & \sum_{t=1}^{\infty}\frac{p\left(1-p\right)^{t-1}\pi_{t}}{\Pr\left(K\right)}\\ & = & p\frac{\Pr\left(K\right)}{\Pr\left(K\right)}\\ & = & p\\ & = & \frac{1}{36}. \end{array}

Concluding comments

The take-away here is this: don’t trust purely verbal arguments when it comes to determining conditional probabilities. It is too easy to make mistakes, even for experts.

Experienced software developers take it for granted that any piece of code that hasn’t been tested is wrong. A similar attitude is warranted towards any argument involving conditional probabilities. The counterpart to testing code is to formalize the argument. In other words, do the math, justifying each step of the derivation by appeal to the laws of probability.

References

1. Aaronson, Scott (2013). Quantum Computing Since Democritus. Cambridge University Press.

July 19, 2015 by kevin@ksvanhorn.com

Conditional Probability and the Dice Room Murders

[Edit: Doomsday and the Dice Room Murders supersedes this post, giving a more general analysis that allows for an unbounded population.]

(PDF)

The puzzle

Scott Aaronson has a wonderful book called Quantum Computing Since Democritus, and in his chapter on the anthropic principle he writes about a thought experiment (attributed to John Leslie) that he calls the Dice Room:

Imagine that there’s a very, very large population of people in the world, and that there’s a madman. What this madman does is, he kidnaps ten people and puts them in a room. He then throws a pair of dice. If the dice land snake-eyes (two ones), then he simply murders everyone in the room. If the dice do not land snake-eyes, then he releases everyone, then kidnaps 100 people. He now does the same thing: he rolls two dice; if they land snake-eyes, then he kills everyone, and if they don’t land snake-eyes, then he releases them and kidnaps 1000 people. He keeps doing this until he gets snake-eyes, at which point he’s done.

This scenario leads to the following question: if you are kidnapped, what is your probability of dying? He gives two answers. The first:

…the dice have a 1/36 chance of landing snake-eyes, so you should only be a “little bit” worried (considering.)

And the second:

…consider, of people who enter the room, what the fraction is of people who ever get out. Let’s say that it ends at 1000. Then, 110 people get out and 1000 die. If it ends at 10,000, then 1110 people get out and 10,000 die. In either case, about 8/9 of the people who ever go into the room die.

…We can say that we’re conditioning on a specific termination point, but that no matter what that point is, we get the same answer. It could be 10 steps or 50 steps, but no matter what the termination point is, almost all of the people who go into the room are going to die, because the number of people is increasing exponentially.

(Actually, $1000/1110\approx 9/10$ rather than $8/9$ , so we’ll use $9/10$ in the rest of this note.)

Aaronson comments that “If you’re a Bayesian, then this kind of seems like a problem,” as Bayesian reasoning seems to give two different answers to the same question.

Commentary

The other reason you should be skeptical of the above argument is that it’s purely verbal. It is extremely easy to make mistakes in applying probability theory if you just give intuitive verbal arguments without actually grinding through the details of the math. It’s easy to make mistakes even if you do most of the math except for some small step that seems obvious (see, for example, my analysis of the Marginalization Paradox.)

We shall see that the error in Aaronson’s (Leslie’s?) reasoning lies in dropping a conditioning proposition when assessing a probability; that is, an unconditional probability $\Pr\left(X\right)$ is implicitly used when a conditional probability $\Pr\left(X\mid Z\right)$ is called for. The conditioning proposition $Z$ at first glance may appear irrelevant, but it greatly alters the probability of $X$ .

Let’s get explicit

To start the analysis, let’s be crystal clear on our assumptions:

Define $s(j)=\sum_{t=1}^{j}10^{t}=\frac{10}{9}\left(10^{j}-1\right).$
That is, $s(j)$ is the total number of people kidnapped up to the $j$ -th step inclusive (if the process gets that far). This assumes that no person is ever kidnapped more than once, which appears to be implicit in Aaronson’s description.
Assume that the population of the world is $N=s(n)$ for some positive integer $n$ . This assumption is just a convenience so that we don’t have to deal with any “left-overs.”
Assume that $n$ is large enough that that it is almost certain that the madman will get snake-eyes before he runs out of people to kidnap. For example, $n=300$ gives a probability of about $0.9998$ that the madman eventually gets snake-eyes.
Writing $M(t)$ for “the madman murders at step $t$ ” and $\text{hm}(t)$ for “the madman has already murdered by step $t$ ,” the madman’s choice process is as follows, for $1\leq t\leq n$ : $\begin{array}{rcl} \text{hm}\left(1\right) & = & \text{false}\\ \text{hm}\left(t+1\right) & = & \left(\text{hm}\left(t\right)\text{ or }M\left(t\right)\right)\\ \Pr\left(M\left(t\right)\mid\text{hm}\left(t\right)\right) & = & 0\\ \Pr\left(M\left(t\right)\mid\neg\text{hm}\left(t\right)\right) & = & p\\ & = & \frac{1}{36} \end{array}$
where “ $\neg$ ” means “not.” That is, once the madman has murdered he does not do so again, and if he has not murdered in a prior step then there is a probability $p$ he will murder in the current step.
The madman selects victims in a random order. Let $v$ be a random permutation of the numbers from $1$ to $N$ ; that is, we take all possible permutations of $1$ to $N$ to be equally probable values for $v$ . If we imagine that the individuals in the population are numbered from $1$ to $N$ , then $v_{i}=1$ means that individual $i$ is the first selected, $v_{i}=100$ means that individual $i$ is the 100th to be selected, and so on. We can always extend the random ordering out past the point where the slaughter occurs, so we take $v_{i}$ to be meaningful regardless of whether or not individual $i$ is ever kidnapped. Since all permutations are equally likely, we have $\Pr\left(v_{i}=k\right)=\frac{1}{N}$
for all integers $1\leq k\leq N$ .
The random ordering defines $n$ batches of individuals, where batch 1 has the first $10^{1}$ individuals, batch 2 has the next $10^{2}$ individuals, and so on. Define $B\left(i\right)$ to be the number of the batch to which individual $i$ belongs; specifically,
$B\left(i\right)=t\;\Leftrightarrow\; s(t-1) < v_{i}\leq s(t).$
Individual $i$ dies, written $D(i)$ , if the madman murders on the step corresponding to the batch to which individual $i$ belongs:
$D\left(i\right)=M\left(B\left(i\right)\right).$
Individual $i$ gets kidnapped, written $K(i)$ , if the madman has not yet murdered on the step corresponding to the batch to which individual $i$ belongs:
$K\left(i\right)=\neg\text{hm}\left(B\left(i\right)\right).$
You are some individual $i$ , and the information you have available is that you have been kidnapped. Thus, the probability you are interested in is
$\Pr\left(D(i)\mid K\left(i\right)\right).$

Alternative decompositions

The first step in formalizing either of Aaronson’s arguments is to decompose the problem by batch:

\begin{array}{rcl} \Pr\left(D(i)\mid K(i)\right) & = & \Pr\left(M\left(B(i)\right)\mid K(i)\right)\\ & = & \sum_{t=1}^{n}\Pr\left(\left(B\left(i\right)=t\right)\,\&\,M\left(t\right)\mid K(i)\right). \end{array}

That is, the probability of being murdered is the sum of the probabilities of being murdered at each of the $n$ possible steps.

The next step is to apply the product rule to $P_{t}$ , defined as

P_{t}\,\equiv\,\Pr\left(\left(B(i)=t\right)\,\&\,M(t)\mid K(i)\right).

The product rule states that for any three propositions $X$ , $Y$ , and $Z$ , we may decompose that probability that both $X$ and $Y$ are true, conditional on $Z$ being true, as

\Pr\left(X\,\&\,Y\mid Z\right)=\Pr\left(X\mid Z\right)\Pr\left(Y\mid X\,\&\,Z\right).

Since $X\,\&\,Y$ is equivalent to $Y\,\&\,X$ , this gives two different possible decompositions, and that is where the two analyses differ:

The first analysis factors $P_{t}$ as $\Pr\left(B\left(i\right)=t\mid K(i)\right)\,\cdot\,\Pr\left(M\left(t\right)\mid\left(B\left(i\right)=t\right)\,\&\,K(i)\right)$
and implicitly claims that
$\begin{array}{rcl}\Pr\left(M(t)\mid\left(B(i)=t\right)\,\&\,K(i)\right) & = & \Pr\left(M(t)\mid\neg\text{hm}(t)\right) \\ & = & \frac{1}{36} \end{array}$
regardless of $t$ . (That is, knowing that $i$ belongs to batch $t$ and is kidnapped provides no information relevant to the probability of the madman murdering at step $t$ beyond establishing that he has not yet murdered by step $t$ .) Since individual $i$ certainly belongs to one of the $n$ batches, we then get

$\begin{array}{rcl} \Pr\left(D(i)\mid K(i)\right) & = & \frac{1}{36}\sum_{t=1}^{n}\Pr\left(B(i)=t\mid K(i)\right)\\ & = & \frac{1}{36}. \end{array}$
The second analysis factors $P_{t}$ as $\Pr\left(M(t)\mid K(i)\right)\,\cdot\,\Pr\left(B\left(i\right)=t\mid M\left(t\right)\,\&\,K(i)\right)$
and implicitly claims two things: first, that
$\Pr\left(B\left(i\right)=t\mid M\left(t\right)\,\&\,K(i)\right)\,=\,\frac{10^{t}}{s(t)}\,\approx\,\frac{9}{10}$
(if $i$ is kidnapped at some point and the madman murders at step $t$ , then $i$ has a probability of about $9/10$ of being in batch $t$ ), and second that
$\sum_{t=1}^{n}\Pr\left(M(t)\mid K(i)\right)\approx 1$
(it is highly probable that the madman eventually murders, and knowing that $i$ is kidnapped does not change this). This then yields
$\begin{array}{rcl} \Pr\left(D(i)\mid K(i)\right) & \approx & \frac{9}{10}\sum_{t=1}^{n}\Pr\left(M(t)\mid K(i)\right)\\ & \approx & \frac{9}{10} \end{array}$
in contradiction to the earlier answer of $1/36$ .

To resolve the apparent paradox, we then have to determine which of these three claims are valid and which are not:

Claim 1. $\Pr\left(M(t)\mid\left(B(i)=t\right)\,\&\,K(i)\right)=\Pr\left(M(t)\mid\neg\text{hm}(t)\right)$ .

Claim 2. $\Pr\left(B\left(i\right)=t\mid M\left(t\right)\,\&\,K(i)\right) = 10^{t}/s(t)$ .

Claim 3. $\sum_{t=1}^{n}\Pr\left(M(t)\mid K(i)\right) \approx 1$ .

Conditional independence

All three of the claims involve conditional probabilities. To evaluate the claims we need to think about (conditional) independence of variables. Two propositions $X$ and $Y$ are independent, conditional on $Z$ , if

\Pr\left(X\mid Y\,\&\,Z\right)=\Pr\left(X\mid Z\right)

or, equivalently, if

\Pr\left(Y\mid X\,\&\,Z\right)=\Pr\left(Y\mid X\right).

Two variable $x$ and $y$ are independent, conditional on $Z$ , if the propositions $x=X$ and $y=Y$ are independent, conditional on $Z$ , for any possible pair of values $X$ and $Y$ .

The figure below shows the Bayesian network graph for our problem when $n=4$ . There is a node for every variable, and for every variable there are arcs to it from those variables used in its immediate definition. You can read off conditional independencies directly from this graph, using the notion of d-separation [1], although I won’t go into all the details here.

The important point we’ll use if that if every undirected path between two variables passes through a “collider” node that is not part of the information you are conditioning on ( $Z$ in the above discussion), then those two variables are conditionally independent. A collider node on an undirected path is a variable that has both arcs pointing in to it. For example, in the undirected path

\text{hm}(4)\rightarrow M(4)\rightarrow D(i)\leftarrow B(i)

between $\text{hm}(4)$ and $B(i)$ , the variable $D(i)$ is a collider node.

Verifying Claim 1

We first note that, for $1\leq t\leq n$ ,

\left(B(i)=t\right)\,\&\,K(i)\quad\Leftrightarrow\quad\left(B(i)=t\right)\,\&\,\neg\text{hm}(t).

If you belong to batch $t$ , then you are kidnapped if and only if the madman has not murdered before step $t$ . So

\begin{array}{rcl} & & \Pr\left(M\left(t\right)\mid\left(B\left(i\right)=t\right)\,\&\,K(i)\right) \\ & = & \Pr\left(M\left(t\right)\mid\left(B\left(i\right)=t\right)\,\&\,\neg\text{hm}\left(t\right)\right). \end{array}

From the Bayesian network graph we can see that $M(t)$ is independent of $B(i)$ conditional on $\text{hm}(t)$ : all undirected paths between $M(t)$ and $B(i)$ include either the collider $\rightarrow D(i)\leftarrow$ or the collider $\rightarrow K(i)\leftarrow$ , and we are not conditioning on either $D(i)$ or $K(i)$ . So we have

\begin{array}{rcl} & & \Pr\left(M\left(t\right)\mid\left(B\left(i\right)=t\right)\,\&\,\neg\text{hm}\left(t\right)\right) \\ & = & \Pr\left(M(t)\mid\neg\text{hm}(t)\right) \end{array}

and Claim 1 is verified.

This tells us that the first analysis is, in fact, correct:

\Pr\left(D(i)\mid K(i)\right)=1/36.

Verifying Claim 2

We first note that, for $1\leq t\leq n$ ,

M\left(t\right)\,\&\,K(i)\quad\Leftrightarrow\quad M\left(t\right)\,\&\,\left(B(i)\leq t\right).

If the madman murders at step $t$ , then you are kidnapped if and only if you are in batch $t$ or an earlier batch. So

\begin{array}{rcl} & & \Pr\left(B\left(i\right)=t\mid M\left(t\right)\,\&\,K(i)\right) \\ & = & \Pr\left(B(i) = t\mid M(t)\,\&\,\left(B(i)\leq t\right)\right). \end{array}

All undirected paths between $B(i)$ and $M(t)$ are again blocked by one of the colliders $\rightarrow D(i)\leftarrow$ or $\rightarrow K(i)\leftarrow$ , so $B(i)$ and $M(t)$ are independent conditional on $B(i)\leq t$ , giving

\begin{array}{rcl} & & \Pr\left(B(i)=t\mid M(t)\,\&\,\left(B(i)\leq t\right)\right) \\ & = & \Pr\left(B(i)=t\mid B(i)\leq t\right) \\ & = & \frac{10^{t}}{s(t)} \\ &\approx& \frac{9}{10} \end{array}

and Claim 2 is verified.

Claim 3: not verified

As you have no doubt guessed by now, Claim 3 does not check out. Although it is true that

\sum_{t=1}^{n}\Pr\left(M(t)\right)\approx1,

we cannot substitute $\Pr\left(M(t)\right)$ for $\Pr\left(M(t)\mid K(i)\right)$ . $M(t)$ and $K(i)$ are dependent, due to the undirected path between them that run through the variables $\text{hm}(t)$ (among others.) This is where the second analysis falls down.

In fact, we find that

\sum_{t=1}^{n}\Pr\left(M(t)\mid K(i)\right)\approx\frac{10}{9}\cdot\frac{1}{36};

that is, if you know that you are kidnapped, but you don’t know at which step, then it is highly likely that the madman never murders anyone at all! Furthermore, this holds no matter how large you make $n$ . The intuition behind this is that it is highly likely that you are in batch $n$ , as batch $n$ contains about 9 times as many people as all other batches combined. But if you are in batch $n$ and are kidnapped, then there is only a probability of $1/36$ that the madman will roll snake eyes at this final step; most likely he will not roll snake eyes and, having run out of people to kidnap, will not commit the planned murder.

Details on Claim 3

Having given the hand-wavy verbal argument, let’s grind through the math. Using the standard identity for conditional probabilities,

\begin{array}{rcl} \Pr\left(M\left(t\right)\mid K(i)\right) & = & \Pr\left(M(t)\,\&\,K(i)\right)/\Pr(K(i))\\ & = & \Pr\left(M(t)\,\&\,\left(B(i)\leq t\right)\right)/\text{Pr}\left(K(i)\right)\\ & = & \Pr\left(M(t)\right)\Pr\left(B(i)\leq t\right)/\Pr(K(i)) \end{array}

with the last step justified by the fact that $M(t)$ and $B(i)$ are independent, as discussed in verifying Claim 2. But

\Pr\left(M(t)\right)=p(1-p)^{t-1}

and

\Pr\left(B(i)\leq t\right)=s(t)/s(n)

\begin{array}{rcl} \Pr\left(M\left(t\right)\mid K(i)\right) & = & \frac{p\left(1-p\right)^{t-1}s(t)}{s(n)\Pr(K(i))}\\[1em] & = & \frac{10p}{9s(n)\Pr(K(i))}\left(1-p\right)^{t-1}\left(10^{t}-1\right). \end{array}

We can check from the graph that $B(i)$ and $\text{hm}(t)$ are independent, so

\begin{array}{rcl} \Pr\left(K(i)\right) & = & \Pr\left(\neg\text{hm}\left(B(i)\right)\right)\\ & = & \sum_{t=1}^{n}\Pr\left(B(i)=t\,\&\,\neg\text{hm}(t)\right)\\[1.5em] & = & \sum_{t=1}^{n}\Pr\left(B(i)=t\right)\Pr\left(\neg\text{hm}(t)\right)\\[1.5em] & = & \sum_{t=1}^{n}\frac{10^{t}}{s(n)}\left(1-p\right)^{t-1}. \end{array}

Putting these two equalities together gives

\begin{array}{rcl} \sum_{t=1}^{n}\Pr\left(M(t)\mid K(i)\right) & = & \frac{10p\sum_{t=1}^{n}(1-p)^{t-1}\left(10^{t}-1\right)}{9\sum_{t=1}^{n}10^{t}(1-p)^{t-1}}\\[1.5em] & = & \frac{10p}{9}\left(1-\frac{\sum_{t=1}^{n}(1-p)^{t}}{\sum_{t=1}^{n}10^{t}(1-p)^{t}}\right)\\[1.5em] & \approx & \frac{10}{9}p. \end{array}

Conclusion

In summary, we have found that Aaronson’s first analysis of the Dice Room scenario is correct, and the second analysis, based on considering what proportion of the people who ever go into the Dice Room die, is not. The latter analysis implicitly (and erroneously) reduces the conditional probability $\Pr\left(M(t)\mid K(i)\right)$ —the probability that the madman murders at step $t$ , given that individual $i$ is kidnapped—to the unconditional probability $\Pr\left(M(t)\right)$ . The conditioning information $K(i)$ turns out to strongly affect the probability of $M(t)$ .

References

1. Geiger, Dan, Thomas Verma, and Judea Pearl (1990). “Identifying independence in Bayesian networks,” Networks 20, pp. 507–534. Online PDF.

July 14, 2015 by kevin@ksvanhorn.com

Implementing the Symmetric Effects Prior

(The full write-up is here.)

In a previous note I proposed a prior on the regression coefficients for a nominal input variable that leads to a symmetric prior on the effects:

\begin{array}{rcl} \beta & \sim & \text{Normal}\left(\overline{0},W\Sigma W^{\prime}\right)\\ W & = & X^{-1}\\ \Sigma_{jk} & = & \left\{\begin{array}{ll} \sigma^{2} & \text{if }j=k\\ \rho\sigma^{2} & \text{if }j\neq k \end{array}\right. \\ \rho & = & -1/(K-1) \end{array}

where $\beta$ is the vector of regression coefficients, $K$ is the number of levels, $X$ and $\Sigma$ are square matrices with $K-1$ rows/columns, row $k$ of $X$ is the encoding for level $k\neq K$ , and level $K$ is encoded as $-\sum_{k=1}^{K-1}X_{k}$ . This gives an induced prior on the $K$ effects $\alpha_{k}=X_{k}\beta$ that is symmetric in the effects, has a mean of 0 and variance of $\sigma^{2}$ for each $\alpha_{k}$ , and is degenerate, in that the sum of the effects is 0.

It may be desirable to avoid actually constructing the matrix $\Sigma$ , either for uniformity in the implementation of a Bayesian regression model (e.g., independent normal priors for each regression coefficient over the entire set of predictor variables) or because $K$ is large. In this note I discuss two ways of achieving this goal:

choose an encoding of the levels that turns independent normal priors on the regression coefficients into the desired symmetric prior on the effects; or
directly evaluate the probability density without constructing the matrix $\Sigma$ .

Low to moderate number of levels

When $K$ is small or moderate, we may achieve independent normal distributions for each of the regression coefficients $\beta_{k}$ by careful choice of encoding. If $X$ is chosen such that $W\Sigma W^{\prime}=\sigma^{2}I$ , then the prior

\beta_{k}\sim\text{Normal}\left(0,\sigma\right)\quad\text{for }1\leq k\leq K-1

gives us the desired symmetric prior on the effects.

To find such an encoding $X$ we begin with an eigendecomposition of $\Sigma$ : eigenvalues $\lambda_{1},\ldots,\lambda_{K-1}$ and corresponding orthonormal eigenvectors $u_{1},\ldots,u_{K-1}$ of $\Sigma$ . Then

\Sigma=U\Lambda U^{-1}

where $U$ is the matrix obtained by stacking the eigenvectors side by side, and $\Lambda$ is the diagonal matrix whose $k$ -th diagonal element is $\lambda_{k}$ :

\begin{array}{rcl} U & = & \left(u_{1},\ldots,u_{K-1}\right)\\ \Lambda & = & \text{diagonal}\left(\lambda_{1},\ldots,\lambda_{K-1}\right). \end{array}

In the full write-up I show that choosing

\begin{array}{rcl} X & = & UD\\ D & = & \text{diagonal}\left(d_{1},\ldots,d_{K-1}\right)\\ d_{k} & = & \frac{\lambda_{k}^{1/2}}{\sigma}.\end{array}

yields $W\Sigma W^{\prime}=\sigma^2 I$ as desired.

The orthonormal eigenvectors and corresponding eigenvalues of of $\Sigma$ are

\begin{array}{rcl} u_{1} & = & \left(K-1\right)^{-1/2}\left(1,\ldots,1\right)^{\prime}\\ \lambda_{1} & = & \frac{\sigma^{2}}{K-1}\\ d_{1} & = & \left(K-1\right)^{-1/2} \end{array}

and, for $2\leq k\leq K-1$ ,

\begin{array}{rcl} u_{ki} & = & \left(k(k-1)\right)^{-1/2}\cdot\left\{\begin{array}{ll} -1 & \text{if }i<k\\ k-1 & \text{if }i=k\\ 0 & \text{if }i>k \end{array}\right. \\ \lambda_{k} & = & \frac{K\sigma^{2}}{K-1}\\ d_{k} & = & \left(\frac{K}{K-1}\right)^{1/2}. \end{array}

Using $X=UD$ , which multiplies column $k$ of $U$ by $d_{k}$ , we have

X_{ik}=d_{k}u_{ki};

in particular,

\begin{array}{rcl} X_{i1} & = & \frac{1}{K-1}\\ X_{ik} & = & 0\text{ if }1<k<i\\ X_{ii} & = & \left(\frac{K\left(k-1\right)} {k\left(K-1\right)}\right)^{1/2}\text{ if }i\neq1\\ X_{ik} & = & -\left(\frac{K}{k(k-1)\left(K-1\right)}\right)^{1/2}\text{ if }k>i. \end{array}

This gives the required encodings for the first $K-1$ levels. The encoding for level $K$ is the negative sum of the encodings for levels 1 to $K-1$ , which works out to

X_{K}=\left(-1,0,\ldots,0\right).

As a check, the $K\times K$ covariance matrix for the full effects vector

\tilde{\alpha}=\begin{pmatrix}\alpha\\ \alpha_{K}\end{pmatrix}

is $\sigma^{2}\tilde{X}\tilde{X}^{\prime}$ , where $\tilde{X}$ is obtained by appending $X_{K}$ to $X$ as an additional row. Using Mathematica, I have verified that

\tilde{X}\tilde{X}^{\prime}=\begin{pmatrix}1 & \rho & \cdots & \rho & \rho\\ \rho & 1 & \cdots & \rho & \rho\\ \vdots & \vdots & \ddots & \vdots & \vdots\\ \rho & \rho & \cdots & 1 & \rho\\ \rho & \rho & \cdots & \rho & 1 \end{pmatrix}\quad\text{where }\rho=-\frac{1}{K-1}

for all $K$ from 3 to 100, as expected.

Large number of levels

If $K$ is large, as occurs in a multilevel model with a prior on $\sigma$ , the above approach may be inefficient. Rather than doing a dot product of a level encoding with a vector of regression coefficients, it may be preferable to instead

directly create a vector of effects $\alpha_{k}$ , $1\leq k\leq K-1$ , with prior
$\alpha\sim\text{Normal}\left(\overline{0},\Sigma\right);$
compute $\alpha_{K}=-\sum_{i=1}^{K-1}\alpha_{k}$ ; then
use the level $k$ to index into the full effects vector $\left(\alpha,\alpha_{K}\right)$ .

If we are estimating the model using Hamilton Monte Carlo or similar methods, such as the NUTS sampler used in Stan, then there is the question of efficiently computing the log of the prior density for $\alpha$ . We would like to avoid actually constructing the large matrix $\Sigma$ . First note the following:

As shown in the full write-up, $\alpha^{\prime}\Sigma^{-1}\alpha=\tilde{\sigma}^{-2}\sum_{k=1}^{K}\alpha_{k}^{2}$
where

$\tilde{\sigma}^{2}=\frac{K}{K-1}\sigma^{2}$
Since $\Sigma=\sigma^{2}S$ , where $S$ is a matrix that is a function only of $K$ and not of $\sigma$ , we have $\det\Sigma=\sigma^{2(K-1)}\det S$
where $\det S$ has no dependence on $\sigma$ .

Then, writing $\equiv$ for “equal except for an additive term that has no dependence on $\alpha$ or $\sigma$ ,” we have

\begin{array}{rcl} \log\left(\text{Normal}\left(\alpha\mid\overline{0},\Sigma\right)\right) & \equiv & -\frac{1}{2}\log\det\Sigma-\frac{1}{2}\alpha^{\prime}\Lambda\alpha\\ & \equiv & -\left(K-1\right)\log\sigma-\frac{1}{2} \tilde{\sigma}^{-2}\sum_{k=1}^{K}\alpha_{k}^{2}\\ & \equiv & \sum_{k=1}^{K-1} \log\left(\text{Normal}\left(\alpha_{k}\mid0,\tilde{\sigma}\right)\right)-\frac{1}{2}\tilde{\sigma}^{-2}\alpha_{K}^{2}. \end{array}

In a Stan model the implementation would look something like this:

transformed parameters {
  ...
  vector[K] alpha1;

  ...
  for (k in 1:(K-1))
    alpha1[k] <- alpha[k];
  alpha1[K] <- -sum(alpha);
}
model  {
  ...
  increment_log_prob(
    -(K-1) * log(sigma)
    - (K-1) / (2 * K * sigma^2) * dot_self(alpha1));
}