Bayesian Statistics

$\newcommand{\footnotename}{footnote}$ $\def \LWRfootnote {1}$ $\newcommand {\footnote }[2][\LWRfootnote ]{{}^{\mathrm {#1}}}$ $\newcommand {\footnotemark }[1][\LWRfootnote ]{{}^{\mathrm {#1}}}$ $\let \LWRorighspace \hspace $ $\renewcommand {\hspace }{\ifstar \LWRorighspace \LWRorighspace }$ $\newcommand {\mathnormal }[1]{{#1}}$ $\newcommand \ensuremath [1]{#1}$ $\newcommand {\LWRframebox }[2][]{\fbox {#2}} \newcommand {\framebox }[1][]{\LWRframebox } $ $\newcommand {\setlength }[2]{}$ $\newcommand {\addtolength }[2]{}$ $\newcommand {\setcounter }[2]{}$ $\newcommand {\addtocounter }[2]{}$ $\newcommand {\arabic }[1]{}$ $\newcommand {\number }[1]{}$ $\newcommand {\noalign }[1]{\text {#1}\notag \\}$ $\newcommand {\cline }[1]{}$ $\newcommand {\directlua }[1]{\text {(directlua)}}$ $\newcommand {\luatexdirectlua }[1]{\text {(directlua)}}$ $\newcommand {\protect }{}$ $\def \LWRabsorbnumber #1 {}$ $\def \LWRabsorbquotenumber "#1 {}$ $\newcommand {\LWRabsorboption }[1][]{}$ $\newcommand {\LWRabsorbtwooptions }[1][]{\LWRabsorboption }$ $\def \mathchar {\ifnextchar "\LWRabsorbquotenumber \LWRabsorbnumber }$ $\def \mathcode #1={\mathchar }$ $\let \delcode \mathcode $ $\let \delimiter \mathchar $ $\def \oe {\unicode {x0153}}$ $\def \OE {\unicode {x0152}}$ $\def \ae {\unicode {x00E6}}$ $\def \AE {\unicode {x00C6}}$ $\def \aa {\unicode {x00E5}}$ $\def \AA {\unicode {x00C5}}$ $\def \o {\unicode {x00F8}}$ $\def \O {\unicode {x00D8}}$ $\def \l {\unicode {x0142}}$ $\def \L {\unicode {x0141}}$ $\def \ss {\unicode {x00DF}}$ $\def \SS {\unicode {x1E9E}}$ $\def \dag {\unicode {x2020}}$ $\def \ddag {\unicode {x2021}}$ $\def \P {\unicode {x00B6}}$ $\def \copyright {\unicode {x00A9}}$ $\def \pounds {\unicode {x00A3}}$ $\let \LWRref \ref $ $\renewcommand {\ref }{\ifstar \LWRref \LWRref }$ $ \newcommand {\multicolumn }[3]{#3}$ $\require {textcomp}$ $\newcommand {\intertext }[1]{\text {#1}\notag \\}$ $\let \Hat \hat $ $\let \Check \check $ $\let \Tilde \tilde $ $\let \Acute \acute $ $\let \Grave \grave $ $\let \Dot \dot $ $\let \Ddot \ddot $ $\let \Breve \breve $ $\let \Bar \bar $ $\let \Vec \vec $ $\require {colortbl}$ $\let \LWRorigcolumncolor \columncolor $ $\renewcommand {\columncolor }[2][named]{\LWRorigcolumncolor [#1]{#2}\LWRabsorbtwooptions }$ $\let \LWRorigrowcolor \rowcolor $ $\renewcommand {\rowcolor }[2][named]{\LWRorigrowcolor [#1]{#2}\LWRabsorbtwooptions }$ $\let \LWRorigcellcolor \cellcolor $ $\renewcommand {\cellcolor }[2][named]{\LWRorigcellcolor [#1]{#2}\LWRabsorbtwooptions }$ $\require {mathtools}$ $\newenvironment {crampedsubarray}[1]{}{}$ $\newcommand {\smashoperator }[2][]{#2\limits }$ $\newcommand {\SwapAboveDisplaySkip }{}$ $\newcommand {\LaTeXunderbrace }[1]{\underbrace {#1}}$ $\newcommand {\LaTeXoverbrace }[1]{\overbrace {#1}}$ $\newcommand {\LWRmultlined }[1][]{\begin {multline*}}$ $\newenvironment {multlined}[1][]{\LWRmultlined }{\end {multline*}}$ $\let \LWRorigshoveleft \shoveleft $ $\renewcommand {\shoveleft }[1][]{\LWRorigshoveleft }$ $\let \LWRorigshoveright \shoveright $ $\renewcommand {\shoveright }[1][]{\LWRorigshoveright }$ $\newcommand {\shortintertext }[1]{\text {#1}\notag \\}$ $\newcommand {\vcentcolon }{\mathrel {\unicode {x2236}}}$ $\renewcommand {\intertext }[2][]{\text {#2}\notag \\}$ $\newenvironment {fleqn}[1][]{}{}$ $\newenvironment {ceqn}{}{}$ $\newenvironment {darray}[2][c]{\begin {array}[#1]{#2}}{\end {array}}$ $\newcommand {\dmulticolumn }[3]{#3}$ $\newcommand {\LWRnrnostar }[1][0.5ex]{\\[#1]}$ $\newcommand {\nr }{\ifstar \LWRnrnostar \LWRnrnostar }$ $\newcommand {\mrel }[1]{\begin {aligned}#1\end {aligned}}$ $\newcommand {\underrel }[2]{\underset {#2}{#1}}$ $\newcommand {\medmath }[1]{#1}$ $\newcommand {\medop }[1]{#1}$ $\newcommand {\medint }[1]{#1}$ $\newcommand {\medintcorr }[1]{#1}$ $\newcommand {\mfrac }[2]{\frac {#1}{#2}}$ $\newcommand {\mbinom }[2]{\binom {#1}{#2}}$ $\newenvironment {mmatrix}{\begin {matrix}}{\end {matrix}}$ $\newcommand {\displaybreak }[1][]{}$ $ \def \offsyl {(\oslash )} \def \msconly {(\Delta )} $ $ \DeclareMathOperator {\var }{var} \DeclareMathOperator {\cov }{cov} \DeclareMathOperator {\Bin }{Bin} \DeclareMathOperator {\Geo }{Geometric} \DeclareMathOperator {\Beta }{Beta} \DeclareMathOperator {\Unif }{Uniform} \DeclareMathOperator {\Gam }{Gamma} \DeclareMathOperator {\Normal }{N} \DeclareMathOperator {\Exp }{Exp} \DeclareMathOperator {\Cauchy }{Cauchy} \DeclareMathOperator {\Bern }{Bernoulli} \DeclareMathOperator {\Poisson }{Poisson} \DeclareMathOperator {\Weibull }{Weibull} \DeclareMathOperator {\IGam }{IGamma} \DeclareMathOperator {\NGam }{NGamma} \DeclareMathOperator {\ChiSquared }{ChiSquared} \DeclareMathOperator {\Pareto }{Pareto} \DeclareMathOperator {\NBin }{NegBin} \DeclareMathOperator {\Studentt }{Student-t} \DeclareMathOperator *{\argmax }{arg\,max} \DeclareMathOperator *{\argmin }{arg\,min} $ \( \def \to {\rightarrow } \def \iff {\Leftrightarrow } \def \ra {\Rightarrow } \def \sw {\subseteq } \def \mc {\mathcal } \def \mb {\mathbb } \def \sc {\setminus } \def \wt {\widetilde } \def \v {\textbf } \def \E {\mb {E}} \def \P {\mb {P}} \def \R {\mb {R}} \def \C {\mb {C}} \def \N {\mb {N}} \def \Q {\mb {Q}} \def \Z {\mb {Z}} \def \B {\mb {B}} \def \~{\sim } \def \-{\,;\,} \def \qed {$\blacksquare $} \CustomizeMathJax {\def \1{\unicode {x1D7D9}}} \def \cadlag {c\`{a}dl\`{a}g} \def \p {\partial } \def \l {\left } \def \r {\right } \def \Om {\Omega } \def \om {\omega } \def \eps {\epsilon } \def \de {\delta } \def \ov {\overline } \def \sr {\stackrel } \def \Lp {\mc {L}^p} \def \Lq {\mc {L}^p} \def \Lone {\mc {L}^1} \def \Ltwo {\mc {L}^2} \def \toae {\sr {\rm a.e.}{\to }} \def \toas {\sr {\rm a.s.}{\to }} \def \top {\sr {\mb {\P }}{\to }} \def \tod {\sr {\rm d}{\to }} \def \toLp {\sr {\Lp }{\to }} \def \toLq {\sr {\Lq }{\to }} \def \eqae {\sr {\rm a.e.}{=}} \def \eqas {\sr {\rm a.s.}{=}} \def \eqd {\sr {\rm d}{=}} \def \approxd {\sr {\rm d}{\approx }} \def \Sa {(S1)\xspace } \def \Sb {(S2)\xspace } \def \Sc {(S3)\xspace } \)

4.2 Two more examples of conjugate pairs

There are several examples of conjugate pairs in the exercises at the end of this chapter. We include a couple more here, the first of which generalizes the calculations in Example 3.2.3.

Lemma 4.2.1 (Gamma-Exponential conjugate pair) Let $n\in \N $. Let $(X,\Lambda )$ be a continuous Bayesian model with model family $M_\lambda \sim \Exp (\lambda )^{\otimes n}$ and parameter $\lambda \in (0,\infty )$. Suppose that the prior is $\Lambda \sim \Gam (\alpha ,\beta )$ and let $x\in (0,\infty )^n$. Then the posterior is $\Lambda |_{\{X=x\}}\sim \Gam (\alpha +n,\beta +\sum _1^n x_i)$.

Proof: From Theorem 3.1.2 we have that for $\lambda \in (0,\infty )$

\begin{align*} f_{\Lambda |_{\{X=x\}}}(\lambda ) &\propto f_{\Exp (\lambda )^{\otimes n}}(x)f_{\Gam (\alpha ,\beta )}(\lambda ) \\ &\propto \l (\prod _{i=1}^n\lambda e^{-\lambda x_i}\r )\l (\frac {\beta ^\alpha }{\Gamma (\alpha )}\lambda ^{\alpha -1}e^{-\beta \lambda }\r ) \\ &\propto \lambda ^n e^{-\lambda \sum _1^n x_i}\lambda ^{\alpha -1}e^{-\beta \lambda }. \\ &\propto \lambda ^{\alpha +n-1}e^{-\lambda (\beta +\sum _1^n x_i)}, \end{align*} By Lemma 1.2.5 we recognize this p.d.f. as $\Theta |_{\{X=x\}}\sim \Gam (\alpha +n,\beta +\sum _1^n x_i)$. ∎

We’ll now do a more complicated example in which the constant of proportionality would change multiple times – if we were to write it in, which we won’t, thanks to $\propto $. In Section 4.5 we will look at Bayesian inference for the normal distribution where both $\mu $ and $\sigma $ are unknown parameters. For now we view $\sigma $ as fixed, so the mean $\mu $ is the only parameter.

Lemma 4.2.2 (Normal-Normal conjugate pair) Let $u\in \R $ and $\sigma ,s>0$. Let $(X,\Theta )$ be a continuous Bayesian model with model family $M_{\theta }\sim \Normal (\theta ,\sigma ^2)^{\otimes n}$ and parameter $\theta \in \R $. Suppose that the prior is $\Theta \sim \Normal (u,s^2)$ and let $x\in \R ^n$. Then
$\seteqnumber{0}{4.}{0}$
\begin{equation} \label {eq:conj_normal_normal} \Theta |_{\{X=x\}}\sim \Normal \l ( \frac {\frac {1}{\sigma ^2}\sum _{1}^n x_i+\frac {u}{s^2}}{\frac {n}{\sigma ^2}+\frac {1}{s^2}}\,,\; \frac {1}{\frac {n}{\sigma ^2}+\frac {1}{s^2}} \r ). \end{equation}

Proof: From Theorem 3.1.2 we have that for $\theta \in \R $

\begin{align*} f_{\Theta |_{\{X=x\}}}(\theta ) &\propto f_{\Normal (\theta ,\sigma ^2)^{\otimes n}}(x)f_{\Normal (u,s^2)}(\theta ) \\ &\propto \l (\prod _{i=1}^n\frac {1}{\sqrt {2\pi }}e^{-\frac {(x_i-\theta )^2}{2\sigma ^2}}\r ) \l (\frac {1}{\sqrt {2\pi }}e^{-\frac {(\theta -u)^2}{2s^2}}\r ) \\ &\propto \exp \l (-\frac {1}{2\sigma ^2}\sum _{i=1}^n(x_i-\theta )^2-\frac {1}{2s^2}(\theta -u)^2\r ) \\ &\propto \exp \big (-\mc {Q}(\theta )\big ) \end{align*} where

\begin{equation*} \mc {Q}(\theta ) =\theta ^2\stackrel {A}{\overbrace {\l (\frac {n}{2\sigma ^2}+\frac {1}{2s^2}\r )}} -2\theta \stackrel {B}{\overbrace {\l (\frac {1}{2\sigma ^2}\sum _{i=1}^n x_i+\frac {u}{2s^2}\r )}} +\stackrel {C}{\overbrace {\l (\frac {1}{2\sigma ^2}\sum _{i=1}^n x_i^2+\frac {u^2}{2s^2}\r )}}. \end{equation*}

Completing the square in $\mc {Q}(\theta )$, using the general form of completing the square (which you can find on the reference sheet in Appendix A), we have that

\begin{align*} f_{\Theta |_{\{X=x\}}}(\theta ) &\propto \exp \l (-A\l (\theta -\frac {B}{A}\r )^2 + C - \frac {B^2}{A}\r ) \\ &\propto \exp \l (-\frac {1}{2(\tfrac {1}{2A})}\l (\theta -\tfrac {B}{A}\r )^2\r ). \end{align*} By Lemma 1.2.5 we recognize this p.d.f. as $\Theta |_{\{X=x\}}\sim \Normal \l (\frac {B}{A},\frac {1}{2A}\r )$. We have

\[\frac {B}{A}=\frac {\frac {1}{\sigma ^2}\sum _{i=1}^n x_i+\frac {u}{s^2}}{\frac {n}{\sigma ^2}+\frac {1}{s^2}} \qquad \text { and }\qquad \frac {1}{2A}=\frac {1}{\frac {n}{\sigma ^2}+\frac {1}{s^2}},\]

as required. Note that in the first term we have cancelled a factor of $\frac 12$ from both $\mc {A}$ and $\mc {B}$. ∎

From (4.1) we can see that the variance will decrease as $n\to \infty $, and that for large $n$ it will be $\approx \frac {\sigma ^2}{n}$. This agrees with our experience in Example 4.2.4, in which case we had $\sigma ^2=s^2=0.4^2$, giving variance $\frac {0.4^2}{n}$. Recall that in our discussion at the end of Example 4.2.4 we noted that the posterior variance had become very small after only $10$ obserations, despite the prior having a reasonably large variance.

In the formulae we obtained in (4.1) each time a variance appears, for both $\sigma ^2$ and $s^2$, it appears on the bottom of a fraction. This suggests that we might obtain nicer formulae if we instead to parameterize the normal distribution as $\Normal (\mu ,\frac {1}{\tau })$, where by by comparison to our usual parametrization we have written $\tau =\frac {1}{\sigma ^2}$. It is common to do this in Bayesian statistics and the variable $\tau $ is then known as precision. We will do this, for example, in Exercise 4.4 which considers the Normal distribution with fixed mean and unknown variance.

You can find a table of conjugate pairs on the reference sheets in Appendix A, below the tables of named distributions. It contains all of the examples within this chapter; there is no need for you to memorize the formulae.

Remark 4.2.3 You are now ready to start on all of the exercises for this chapter.

Example 4.2.4 Speed cameras are used to measure the speed of individual cars. They do so by recording two images of a moving car, with the second image being captured a fixed time after the first image. By analysing the two images the camera can tell how far the car has travelled in that time, which gives an estimate of its speed. This is not an easy process and there is some degree of error involved.

Suppose that we are trying to assess if the manufacturers description of the error is accurate. The manufacturer claims that, if the true speed is $30$ (miles per hour) then the speed recorded by the camera can be modelled as a $\Normal (30,0.16)$ random variable.

We construct an experiment to test this. We set up a camera and drive $10$ cars past it, each travelling at $30$ miles per hour (let us assume this can be done accurately, which is not unrealistic using modern cruise control). The camera records speeds of
$\seteqnumber{0}{4.}{1}$
\begin{equation} \label {eq:camera_data_1} 30.9,\quad 29.9,\quad 30.1,\quad 30.3,\quad 29.7,\quad 30.1,\quad 30.1,\quad 29.2,\quad 30.6,\quad 30.4. \end{equation}

We record this data as $x=(x_i)_{i=1}^{10}$.

We will come back to this example in future but for now let us assume, for simplicity, that we believe the manufacturers claim that the data will have a normal distribution with variance $0.16$. We want to see if mean matches up with the figure claimed. We’ll use the model family

\[M_\theta \sim N(\theta ,0.16)^{\otimes 10}\eqd N(\theta ,0.4^2)^{\otimes 10},\]

where the mean $\theta $ is an unknown parameter. The parameter space of this model is $\Pi =\R $ and the range of the model is $\R ^{10}$.

For our prior we will use $\Theta \sim \Normal (30,5^2)$ which has p.d.f.

\[f_\Theta (\theta )=\frac {1}{5\sqrt {2\pi }}e^{-(x-30)^2/10}.\]

We will study techniques for choosing the prior in Chapter 5. For now our motivation is that we expect the true value for $\theta $ is about $30$, but we don’t have a lot of confidence in that, so we pick a fairly large value for the variance.
- Remark 4.2.5 It is always sensible to think about what property of reality your ‘true’ parameter value represents. In this case, the true value of $\theta $ is the average speed that would be recorded by the camera, for a car that was travelling at exactly $30$mph. We don’t know this value.
By Lemma 4.2.2 the posterior distribution is

\[\Theta |_{\{X=x\}}\sim \Normal \l ( \frac {\frac {1}{0.16}\sum _1^{10}x_i+\frac {30}{5^2}}{\frac {10}{0.16}+\frac {1}{5^2}}, \frac {1}{\frac {10}{0.16}+\frac {1}{5^2}} \r ) \eqd \Normal \l (30.13, 0.04^2\r ). \]

Here we fill in $\sum _1^{10} x_i=301.4$ and round the parameters to two decimal places. As in our previous examples, let us compare the prior $\Theta $ to the posterior $\Theta |_{\{X=x\}}$.

It is difficult to show them on the same axis, so we have had to miss out the top part of the curve $f_{\Theta |_{\{X=x\}}}$. The outcome is similar to Example 3.2.3, in that the posterior has focused in on a small region. Given our data (4.2) this seems sensible. The influence of the prior has largely been forgotten.

We were originally interested to compare the behaviour the manufacturer claimed that the camera would have, with the results of our experiment. To do so we should compare $\Normal (30,0.16)$, which is what the manufacturer claimed our experiment should observe, with the predictive distribution from our data analysis. As in Example 3.2.3, what we want here is the predictive distribution for a single data point (i.e. the case $n=1$). For that, our model family is $N(\theta ,0.4^2)$, and our posterior distribution for the unknown parameter $\theta $ is $\Normal (\speedcameramean , 0.04^2)$, which gives the p.d.f. of the predictive distribution for a single datapoint as
$\seteqnumber{0}{4.}{2}$
\begin{align*} f_{\text {predictive}}(x) &=\int _\R f_{\Normal (\theta , 0.4^2)}(x)f_{\Normal (30.13, 0.04^2)}(\theta )\,d\theta \end{align*} which we can evaluate numerically¹. We obtain

They are quite similar. If our predictive distribution is a true reflection of the cameras behaviour, it suggests that the camera may be overestimating speeds by a small amount. We would need to do some statistical testing before saying anything more, based on the $10$ datapoints that we have, and we’ll have to wait until Chapter 7 for that.

We’ll return to this data again in Example 4.5.3, where will also treat the variance as an unknown parameter.

¹ In fact, some further calculation would reveal that this is also the p.d.f. of a normal distribution.