This post describes a Bayesian model I developed for a former employer, The Modellers, a Hall & Partners company. The Modellers is a marketing research firm specializing in highend analytics, and our parent company H & P does a lot of tracking studies; these are a series of surveys done at regular intervals to track how attitudes, awareness, and opinion about a company and its products change over time, or to track recognition of advertising campaigns. One problem one encounters is that sampling error always causes the numbers to fluctuate even when nothing is changing, which causes a certain amount of heartburn among clients. Anything to reduce the sampling noise is therefore worthwhile.
Some clients also want to run the surveys more frequently. The problem is that if you want to survey twice as frequently without increasing sampling noise, this requires twice the data collection costs — for example, surveying 1000 people every two weeks instead of 1000 people every four weeks.
But… if you just ran the survey two weeks ago, and two weeks before that, it’s not as if you have no idea what the results of the current survey are going to be. In most tracking studies you would be surprised to see large changes over a oneweek or twoweek period. So if you’re tracking, say, what proportion of a targeted population thinks your widget is the best on the market, the data from two weeks ago and four weeks ago provide some information about the population proportion today. It works the other way, too — the data you’ll collect two weeks and four weeks from now also have relevance to estimating the population proportion today, for the same reason that a large change is unlikely in that short of a time period. (That doesn’t help you today, but it can help you in a retrospective analysis.)
So, without further ado, here is an explanation of the model.
The Local Level Model
The local level model is one of the simplest forms of statespace model. It is used to track a realvalued quantity that cannot be precisely measured, but changes only a limited amount from one time step to the next.
For each time step there is a realvalued state that cannot be directly observed, and a measurement variable that constitutes a noisy measurement of . There are two variance parameters:
 is the error variance that relates a noisy measurement to the state ;
 is is a “drift” variance that characterizes the typical magnitude of change in the state variable from one time step to the next.
Given , , and , the joint distribution for and
is given by
begin{align}
alpha_{t+1} & simmathrm{Normal}left(alpha_{t},sigma_{alpha}right)nonumber \
y_{t} & simmathrm{Normal}left(alpha_{t},sigma_{y}right)tag{1}label{eq:llm0}
end{align}
That is, the sequence of states is a Gaussian random walk, and is just with Gaussian noise added. If were infinite, then would be the only measurement providing information about ; but a finite limits how much and may plausibly differ, hence and for also provide information about . The information provides about decreases as or increases.
Figure 1 shows the dependency graph for the local level model, for four time steps. Each node represents a variable in the model. An arrow from one variable to another indicates that the probability distribution for the second variable is defined conditional on the first. Comparing equations (ref{eq:llm0}) to Figure 1, we see that the distribution for is conditional on and , hence there is an arrow from to and another arrow from to in the dependency graph.
A Modified Local Level Model for Binomial Data
Now consider tracking the proportion of a population that would respond positively to a given survey question (if asked). Can we model the evolution of this proportion over time using the local level model? The proportion at time is not an arbitrary real value as in the local level model, but that is easily dealt with by defining the state variable to be , which ranges from to as ranges from to . The parameter constrains what are plausible changes in the proportion from one time step to the next. But what of the realvalued measurement and its error variance ? We have instead a discrete measurement given by a number of people surveyed at time and a number of positive responses at time . So for survey data we swap out the measurement model
\[
y_{t}simmathrm{Normal}left(alpha_{t},sigma_{y}right)
\]
and replace it with a binomial measurement model:
begin{align*}
k_{t} & simmathrm{Binomial}left(n_{t},p_{t}right)\
mathrm{logit}left(p_{t}right) & =alpha_{t}
end{align*}
In the above, is the probability of getting successes out of independent trials, each having a probability of success.
We end up with the following model: given , , and the sample sizes , the joint distribution for and is given by
begin{align}
alpha_{t+1} & simmathrm{Normal}left(alpha_{t},sigmaright)nonumber \
k_{t} & simmathrm{Binomial}left(n_{t},p_{t}right)nonumber \
p_{t} & =mathrm{logit}^{1}left(alpha_{t}right)tag{2}label{eq:llmbinomial}
end{align}
where we have renamed as just , since it is the only variance parameter.
Figure 2 shows the dependency graph for this binomial locallevel model. As before, the sequence of states is a Gaussian random walk. As before, the fact that is finite means that, not only , but also and for provide information about , in decreasing amounts as or increases.
A Bayesian Perspective
The frequentist statistics that most researchers learn treats probabilities as longrun frequencies of events. Bayesian probabilities, in constrast, are epistemic probabilities (cox)(jaynes): they measure degrees of certainty in propositions.
The binomial local level model as we have presented it has two parameters whose values are unknown a priori: the drift variance and the initial state . To have a complete Bayesian model we must define prior distributions over these variables. These prior distributions encode our background information on what are plausible values for the variables. We use
begin{align}
alpha_{0} & simmathrm{Normal}left(0,s_{alpha}right)nonumber \
sigma & simmathrm{HalfNormal}left(s_{sigma}right)tag{3}label{eq:prior}
end{align}
where is a normal distribution with standard deviation , truncated to allow only positive values. (See Figure 3.) If we choose then this results in a prior over that is approximately uniform. Based on examination of data from past years, we might choose , where is the number of weeks in one time period. This allows plausible values for the drift standard deviation to range from 0 to . The high end of that range corresponds to a typical change over the course of a single month of around 10% absolute (e.g., an increase from 50% to 60%, or a decrease from 40% to 30%.)
The model specification given by (ref{eq:llmbinomial}) and (ref{eq:prior}) is just a way of describing a joint probability distribution for the parameters , , and data , given the sample sizes ; this joint distribution has the following density function :
begin{multline*}
gleft(sigma,alpha_{0},k_{0},ldots,alpha_{T},k_{T}mid n_{0},ldots,n_{T}right)=\
f_{1}left(sigmaright)g_{2}left(alpha_{0}right)f_{3}left(k_{0}mid n_{0},mathrm{logit}^{1}left(alpha_{0}right)right) \ cdotprod_{t=1}^{T}g_{4}left(alpha_{t}midalpha_{t1},sigmaright)f_{3}left(k_{t}mid n_{t},mathrm{logit}^{1}left(alpha_{t}right)right)
end{multline*}
where
begin{align*}
f_{1}left(sigmaright) & =mathrm{HalfNormal}left(sigma;s_{sigma}right)\
g_{2}left(alpharight) & =mathrm{Normal}left(alpha;0,s_{alpha}right)\
f_{3}left(kmid n,pright) & =mathrm{Binomial}left(k;n,pright)\
g_{4}left(alpha_{t}midalpha_{t1},sigmaright) & =mathrm{Normal}left(alpha_{t};alpha_{t1},sigmaright)\
mathrm{HalfNormal}left(x;sright) & =begin{cases}
frac{2}{sqrt{2pi}s}expleft(frac{x^{2}}{2s^{2}}right) & mbox{if }x > 0\
0 & mbox{otherwise}
end{cases}\
mathrm{Normal}left(x;mu,sright) & =frac{1}{sqrt{2pi}s}expleft(frac{left(xmuright)^{2}}{2s^{2}}right)\
mathrm{Binomial}left(k;n,pright) & =frac{n!}{k!(nk)!}p^{k}left(1pright)^{nk}
end{align*}
This formula for the joint density is just the product of the (conditional) densities for the individual variables.
Alternatively, we can express the model directly in terms of the proportions instead of their logits ; using the usual changeofvariables formula, the joint distribution for , , and conditional on then has the following density function :
begin{multline*}
fleft(sigma,p_{0},k_{0}.ldots,p_{T},k_{T}mid n_{0},ldots,n_{T}right)=\
f_{1}left(sigmaright)f_{2}left(p_{0}right)f_{3}left(k_{0}mid n_{0},p_{0}right)cdotprod_{t=1}^{T}f_{4}left(p_{t}mid p_{t1},sigmaright)f_{3}left(k_{t}mid n_{t},p_{t}right)
end{multline*}
where
begin{align*}
f_{2}left(pright) & =p^{1}left(1pright)^{1}mathrm{Normal}left(mathrm{logit}left(pright);0,s_{alpha}right)\
f_{4}left(p_{t}mid p_{t1},sigmaright) & =p_{t}^{1}left(1p_{t}right)^{1}mathrm{Normal}left(mathrm{logit}left(p_{t}right);mathrm{logit}left(p_{t1}right),sigmaright).
end{align*}
For a Bayesian model there is no need to continue on and define estimators and test statistics, investigate their sampling distributions, etc. The joint distribution for the model variables, plus the rules of probability theory, are all the mathematical apparatus we need for inference. To make inferences about an unobserved model variable after collecting the data that observable variables have values respectively, we simply apply the rules of probability theory to obtain the distribution for conditional on .
For example, if we are interested in , then (conceptually) the procedure for making inferences about given the survey counts and sample sizes is this:
 Marginalize (sum/integrate out) the variables we’re not interested in, to get a joint probability density function for and the data variables :
[
begin{array}{l}
fleft(p_{T},k_{0},ldots,k_{T}mid n_{0},ldots,n_{T}right)=\
qquadint_{0}^{infty}int_{0}^{1}cdotsint_{0}^{1}fleft(sigma,p_{0},k_{0},ldots,p_{T},k_{T}mid n_{0},ldots,n_{T}right),mathrm{d}p_{0}cdotsmathrm{d}p_{T1},mathrm{d}sigma.
end{array}
]
This is just a variant of the rule from probability theory that says that
[
Prleft(Aright)=Prleft(B_{1}right)+cdots+Prleft(B_{m}right)
]
when is equivalent to “ or … or ” and are mutually exclusive propositions (no overlap).  Compute the conditional density function (also known as the posterior density function):
[
fleft(p_{T}mid k_{0},n_{0},ldots,k_{T},n_{T}right)=frac{fleft(p_{T},k_{0},ldots,k_{T}mid n_{0},ldots,n_{T}right)}{int_{0}^{1}fleft(p_{T},k_{0},ldots,k_{T}mid n_{0},ldots,n_{T}right),mathrm{d}p_{T}}.
]
This is just a variant of the rule for conditional probabilities,
[
Prleft(A_{i}mid C,Dright)=frac{Prleft(A_{i}mbox{ and }Cmid Dright)}{Pr(Cmid D)},
]
combined with the rule that, if it is known that exactly one of is true, then
[
Pr(C)=Prleft(A_{1}mbox{ and }Cright)+cdots+Prleft(A_{m}mbox{ and }Cright).
]  Given any bounds , we can now compute the probability that , conditional on the data values :
$$
Prleft(aleq p_{T}leq bmid k_{0},ldots,k_{T},n_{0},ldots,n_{T}right)=int_{a}^{b}fleft(p_{T}mid k_{0},n_{0},ldots,k_{T},n_{T}right),mathrm{d}p_{T}
$$This is known as a posterior probability, as it is the probability we obtain after seeing the data. In this step we get the posterior probability of by, roughly, just summing over all the possible values of between and .
We can make inferences about in a similar fashion, using the joint density function for :
\[
begin{array}{l}
hleft(sigma,p_{0},k_{0},ldots,p_{T1},k_{T1},Delta_{T},k_{T}mid n_{0},ldots,n_{T}right)=\
qquad fleft(sigma,p_{0},k_{0},ldots,p_{T1},k_{T1},p_{0}+Delta_{T},k_{T}mid n_{0},ldots,n_{T}right).
end{array}
\]
In practice, computing all these highdimensional integrals exactly is usually intractably difficult, which is why we resort to Markov chain Monte Carlo methods (mcmc). The actually computational procedure is then this:
 Draw a dependent random sample from the joint conditional distribution
[
fleft(sigma,p_{0},ldots,p_{T}mid k_{0},n_{0},ldots,k_{T},n_{T}right)
]
using Markov chain Monte Carlo. Each of the sample values is an entire vector .  Approximate the posterior distribution for the quantity of interest as a histogram of values computed from the sample values , where . For example, if we’re just interested in , then for each we create a histogram from the values , . If we’re interested in , then we create a histogram from the values , .
A Graphical Example
In order to get an intuition for how this model works, it is helpful to see how the inferred posterior distribution for (the population proportion at time ) evolves as we add new data points. We have done this in Figure 4, assuming for simplicity that the drift variance is known to be . The first row correspdonds to , the second to , etc.
The left column displays
\[
fleft(p_{t}midsigma,n_{0},k_{0},ldots,n_{t1},k_{t1}right),
\]
that is, the inferred probability density function for given known values for and survey results up to the previous time step.
The middle column displays the likelihood
\[
Prleft(k_{t}mid n_{t},p_{t}right)=frac{n_{t}!}{k_{t}!left(n_{t}k_{t}right)!}p_{t}^{k_{t}}left(1p_{t}right)^{k_{t}},
\]
which is the information provided by the survey results from time period .
The right column displays
\[
fleft(p_{t}midsigma,n_{0},k_{0},ldots,n_{t1},k_{t1}right),
\]
that is, the inferred probability density for given known values for and survey results up to and including time step .
Here are some things to note about Figure 4:
 Each density curve in the right column is obtained by taking the density curve in the left column, multiplying by the likelihood curve, and rescaling the result so that the area under the density curve is 1. That is,
$$
fleft(p_{t}midsigma,n_{0},k_{0},ldots,n_{t1},k_{t1}right)
$$
acts like a prior distribution for , which is updated by the new data to obtain a posterior distribution
$$
fleft(p_{t}midsigma,n_{0},k_{0},ldots,n_{t1},k_{t1},n_{t},k_{t}right).
$$
In the right column this prior is shown in gray for comparison.  The density curves in the right column are narrower than the corresponding likelihood curves in the middle column. This increase in precision is what we gain from the informative prior in the left column.
 Except for row 0, each density curve in the left column is obtained by taking the density curve in the right column from the previous row (time step)——and making it a bit wider. (This necessarily lowers the peak of the curve, as the area under the curve must still be 1.) To understand this, recall that the left column is the inferred posterior for given only the data up to the previous time step . Then all we know about is whatever we know about , plus the fact that differs from by an amount that has a variance of :
$$
alpha_{t}simmathrm{Normal}left(alpha_{t1},sigmaright)
$$
The density curve is shown in gray in the left column for comparison.
Of course, in reality we don’t know and must infer it instead. Figure 5 repeats the analysis of Figure 4, except that each plot shows a joint probability distribution for and .
 The left column is .
 The middle column is the likelihood .
 The right column is .
 increases from to from top to bottom.
 The axis is , the axis is , and brighter shading indicates a higher probability density.
 The range for is to .
Note the following:
 The “probability clouds” in the right column lean to the right. This is because the data indicate that is increasing with (note that the center of the likelihood moves to the right as increases), but a nearzero value for doesn’t allow for much change in .
 As increases we get more evidence for a nonzero value of , and the “probabily clouds” move up away from the axis.
A Computational Example
To help readers understand this model and the Bayesian concepts I’ve written an R script, binomllmcompexamp.R, that computes various posterior densities and posterior probabilities up to time step 3, exactly as described earlier. The script is intended to be read as well as run; hence the emphasis is on clarity rather than speed. In particular, since some functions compute integrals over upwards of four variables using a generic integration procedure, for each point in a density plot, some of the computations can take several minutes. Do not be daunted by the length of the script; it is long because half of it is comments, and little effort has been made to factor out common code as I judged it better to be explicit than brief. The script is grouped into blocks of short functions that are all minor variations of the same pattern; once you’ve read the first few in a block, you can skim over the rest.
Here are the steps to use the script:
 Save
binomllmcompexamp.R
to some folder.  Start up R.
 Issue the command
source('folder/binomllmcompexamp.R')
,
replacingfolder
with the path to the folder where you savedbinomllmcompexamp.R
.  Issue the command
demo()
. This runs a demo showing the various commands the script provides.  Experiment with these commands, using different data values or intervals.
The following list describes the available commands. is , is , and is . In general, any time a command takes parameters and , these are the survey data: is a vector containing the count of positive responses for each time step, and is a vector containing the sample size for each time step. Note that contains (the count for time step 0), contains , etc., and similarly for .

print.config()
Prints the values of various control parameters, prior parameters, and data values that you can modify. 
plot.p0.posterior.density(k,n)
Plots the posterior density . 
p0.posterior.probability(a,b,k,n)
Computes the posterior probability . 
plot.p1.posterior.density(k,n)
Plots the posterior density . 
p1.posteror.probability(a,b,k,n)
Computes the posterior probability
$$
Prleft(aleq p_{1}leq bmid k_{0},n_{0},k_{1},n_{1}right).
$$ 
plot.delta1.posterior.density(k,n)
Plots the posterior density
. 
delta1.posterior.probability(a,b,k,n)
Computes the posterior
probability
$$
Prleft(aleqDelta_{1}leq bmid k_{0},n_{0},k_{1},n_{1}right).
$$ 
plot.p2.posterior.density(k,n)
Plots the posterior density . 
p2.posteror.probability(a,b,k,n)
Computes the posterior probability
$$
Prleft(aleq p_{2}leq bmid k_{0},n_{0},k_{1},n_{1},k_{2},n_{2}right).
$$ 
plot.delta2.posterior.density(k,n)
Plots the posterior density
. 
delta2.posterior.probability(a,b,k,n)
Computes the posterior probability
$$
Prleft(aleqDelta_{2}leq bmid k_{0},n_{0},k_{1},n_{1},k_{2},n_{2}right).
$$ 
plot.p3.posterior.density(k,n)
Plots the posterior density . 
p3.posteror.probability(a,b,k,n)
Computes the posterior probability
$$
Prleft(aleq p_{3}leq bmid k_{0},n_{0},k_{1},n_{1},k_{2},n_{2},k_{3},n_{3}right).
$$ 
plot.delta3.posterior.density(k,n)
Plots the posterior density
. 
delta3.posterior.probability(a,b,k,n)
Computes the posterior probability
$$
Prleft(aleqDelta_{3}leq bmid k_{0},n_{0},k_{1},n_{1},k_{2},n_{2},k_{3},n_{3}right).
$$References
[mcmc]Brooks, S., A. Gelman, G. Jones, and X.L. Meng (2011). Handbook of Markov Chain Monte Carlo.
[cox]Cox, R. T. (1946). “Probability, frequency and reasonable expectation,” American Journal of Physics 17, 1–13.
[timeseries]Durbin, J. and S. J. Koopman (2012). Time Series Analysis by State Space Methods, second edition.
[jaynes]Jaynes, E. T. (2003). Probability Theory: The Logic of Science.
Leave a Reply