Here is a review of some concepts from probability theory and calculus that are especially important in Bayesian statistics.
One-dimensional integrals and probability densities
An integral can be thought of as a sum over a grid of values, multiplied by the spacing of the grid. This is useful in combination with the notion of a probability density function when you want to compute the probability that a random variable lies in some interval.
If is a probability density function for some variable , and , then the probability that is the integral
this may be thought of as a short-hand notation for
for some very large number , where
If we chop the interval from to into equal-sized subintervals, then is the midpoint of the -th subinterval, and is the width of each of these subintervals. The values form a grid of values, and is the spacing of the grid. If the grid spacing is so fine that is nearly constant over each of these subintervals, then the probability that lies in subinterval is closely approximated as . We get by adding up the probabilities for each of the subintervals. Figure 1 illustrates this process with and for a variable with a normal distribution with mean 0 and variance 1.
The exact probability is the integral , which is the area under the normal density curve between and . ( is the probability density for the normal distribution at .) The figure illustrates an approximate computation of the integral using . The subintervals have width and their midpoints are , , , , and . The five rectangles each have width , and their heights are , , , , and . The sum is the total area of the five rectangles combined; this sum is an approximation of . As gets larger the rectangles get thinner, and the approximation becomes closer to the true area under the curve.
Please note that the process just described defines what the integral means, but in practice more efficient algorithms are used for the actual computation.
The probability that is
you can think of this integral as being for some very large value , chosen to be so large that the probability density is vanishingly small whenever .
Multi-dimensional integrals and probability densities
We can also have joint probability distributions for two variables and with a corresponding joint probability density function . If and , then the probability that and is the nested integral
analogously to the one-dimensional case, this two-dimensional integral may be thought of as the following procedure:
- Choose very large numbers and .
- Chop the interval from to into equal subintervals, and chop the interval from to into equal subintervals, so that the rectangular region defined by and is divided into very small boxes.
- Let be the midpoint of the box at the intersection of the -th row and -th column. These midpoints form a two-dimensional grid.
- Let and ; note that is the width and is the height of each of the very small boxes.
- Sum up the values for all and , and .
Figure 2 shows an example of a joint probability density for and , displayed using shades of gray, with lighter colors indicating a greater probability density. The joint distribution shown is a multivariate normal centered at and , with covariance matrix
Figure 3 illustrates the process of computing when and have the above joint density, using and .
We can, of course, generalize this to joint distributions over three, four, or more variables , with a corresponding joint probability density function . If and … and , then the probability that and … and is the nested integral
and we can think of this as choosing very large numbers and forming an -dimensional grid with cells, evaluating at the midpoint of each of these cells, and so on.
Conditional probability densities
Suppose that we have variables and with a joint distribution that has density function , describing our prior information about and . We then find out the value of , perhaps via some sort of a measurement. Our updated information about is summarized in the conditional distribution for , given that ; the density function for this conditional distribution is
For example, if and have the joint distribution described in Figure 3, then Figure 4 shows the conditional density function for given , and Figure 5 shows the conditional density function for given . You can see from the density plot that and are positively correlated, and this explains why the density curve for given is shifted to the right relative to the density curve for given .
In general, if we have joint probability distribution for variables , with a joint probability density function , and we find out the values of, say, , then we write
for the conditional density function for , given that , and its formula is
One way of understanding this formula is to realize that if you integrate the conditional density over all possible values of , the result must be 1 (one of those possible values must be the actual value), and this is what the denominator in the above formula guarantees.
Suppose that we have two variables and that have a known deterministic relation that allows us to always obtain the value of from the value of , and likewise obtain the value of from the value of . Some examples are
- and ,
- and , or
- and ,
but not when may be positive or negative: if we are given then could be either or .
If we have a probability density for then we can turn it into a probability density for using the change of variables formula:
In the above formula, stands for the derivative of with respect to ; this measures how much changes relative to a change in , and in general it depends on . Loosely speaking, imagine changing by a vanishingly small amount , and let be the corresponding (vanishingly small) change in ; then the derivative of with respect to is the ratio .
Here are some common changes of variable:
- where is some nonzero constant. Then
- . Then
- , where . Then