Bayesian Statistics

$\newcommand{\footnotename}{footnote}$ $\def \LWRfootnote {1}$ $\newcommand {\footnote }[2][\LWRfootnote ]{{}^{\mathrm {#1}}}$ $\newcommand {\footnotemark }[1][\LWRfootnote ]{{}^{\mathrm {#1}}}$ $\let \LWRorighspace \hspace $ $\renewcommand {\hspace }{\ifstar \LWRorighspace \LWRorighspace }$ $\newcommand {\mathnormal }[1]{{#1}}$ $\newcommand \ensuremath [1]{#1}$ $\newcommand {\LWRframebox }[2][]{\fbox {#2}} \newcommand {\framebox }[1][]{\LWRframebox } $ $\newcommand {\setlength }[2]{}$ $\newcommand {\addtolength }[2]{}$ $\newcommand {\setcounter }[2]{}$ $\newcommand {\addtocounter }[2]{}$ $\newcommand {\arabic }[1]{}$ $\newcommand {\number }[1]{}$ $\newcommand {\noalign }[1]{\text {#1}\notag \\}$ $\newcommand {\cline }[1]{}$ $\newcommand {\directlua }[1]{\text {(directlua)}}$ $\newcommand {\luatexdirectlua }[1]{\text {(directlua)}}$ $\newcommand {\protect }{}$ $\def \LWRabsorbnumber #1 {}$ $\def \LWRabsorbquotenumber "#1 {}$ $\newcommand {\LWRabsorboption }[1][]{}$ $\newcommand {\LWRabsorbtwooptions }[1][]{\LWRabsorboption }$ $\def \mathchar {\ifnextchar "\LWRabsorbquotenumber \LWRabsorbnumber }$ $\def \mathcode #1={\mathchar }$ $\let \delcode \mathcode $ $\let \delimiter \mathchar $ $\def \oe {\unicode {x0153}}$ $\def \OE {\unicode {x0152}}$ $\def \ae {\unicode {x00E6}}$ $\def \AE {\unicode {x00C6}}$ $\def \aa {\unicode {x00E5}}$ $\def \AA {\unicode {x00C5}}$ $\def \o {\unicode {x00F8}}$ $\def \O {\unicode {x00D8}}$ $\def \l {\unicode {x0142}}$ $\def \L {\unicode {x0141}}$ $\def \ss {\unicode {x00DF}}$ $\def \SS {\unicode {x1E9E}}$ $\def \dag {\unicode {x2020}}$ $\def \ddag {\unicode {x2021}}$ $\def \P {\unicode {x00B6}}$ $\def \copyright {\unicode {x00A9}}$ $\def \pounds {\unicode {x00A3}}$ $\let \LWRref \ref $ $\renewcommand {\ref }{\ifstar \LWRref \LWRref }$ $ \newcommand {\multicolumn }[3]{#3}$ $\require {textcomp}$ $\newcommand {\intertext }[1]{\text {#1}\notag \\}$ $\let \Hat \hat $ $\let \Check \check $ $\let \Tilde \tilde $ $\let \Acute \acute $ $\let \Grave \grave $ $\let \Dot \dot $ $\let \Ddot \ddot $ $\let \Breve \breve $ $\let \Bar \bar $ $\let \Vec \vec $ $\require {colortbl}$ $\let \LWRorigcolumncolor \columncolor $ $\renewcommand {\columncolor }[2][named]{\LWRorigcolumncolor [#1]{#2}\LWRabsorbtwooptions }$ $\let \LWRorigrowcolor \rowcolor $ $\renewcommand {\rowcolor }[2][named]{\LWRorigrowcolor [#1]{#2}\LWRabsorbtwooptions }$ $\let \LWRorigcellcolor \cellcolor $ $\renewcommand {\cellcolor }[2][named]{\LWRorigcellcolor [#1]{#2}\LWRabsorbtwooptions }$ $\require {mathtools}$ $\newenvironment {crampedsubarray}[1]{}{}$ $\newcommand {\smashoperator }[2][]{#2\limits }$ $\newcommand {\SwapAboveDisplaySkip }{}$ $\newcommand {\LaTeXunderbrace }[1]{\underbrace {#1}}$ $\newcommand {\LaTeXoverbrace }[1]{\overbrace {#1}}$ $\newcommand {\LWRmultlined }[1][]{\begin {multline*}}$ $\newenvironment {multlined}[1][]{\LWRmultlined }{\end {multline*}}$ $\let \LWRorigshoveleft \shoveleft $ $\renewcommand {\shoveleft }[1][]{\LWRorigshoveleft }$ $\let \LWRorigshoveright \shoveright $ $\renewcommand {\shoveright }[1][]{\LWRorigshoveright }$ $\newcommand {\shortintertext }[1]{\text {#1}\notag \\}$ $\newcommand {\vcentcolon }{\mathrel {\unicode {x2236}}}$ $\renewcommand {\intertext }[2][]{\text {#2}\notag \\}$ $\newenvironment {fleqn}[1][]{}{}$ $\newenvironment {ceqn}{}{}$ $\newenvironment {darray}[2][c]{\begin {array}[#1]{#2}}{\end {array}}$ $\newcommand {\dmulticolumn }[3]{#3}$ $\newcommand {\LWRnrnostar }[1][0.5ex]{\\[#1]}$ $\newcommand {\nr }{\ifstar \LWRnrnostar \LWRnrnostar }$ $\newcommand {\mrel }[1]{\begin {aligned}#1\end {aligned}}$ $\newcommand {\underrel }[2]{\underset {#2}{#1}}$ $\newcommand {\medmath }[1]{#1}$ $\newcommand {\medop }[1]{#1}$ $\newcommand {\medint }[1]{#1}$ $\newcommand {\medintcorr }[1]{#1}$ $\newcommand {\mfrac }[2]{\frac {#1}{#2}}$ $\newcommand {\mbinom }[2]{\binom {#1}{#2}}$ $\newenvironment {mmatrix}{\begin {matrix}}{\end {matrix}}$ $\newcommand {\displaybreak }[1][]{}$ $ \def \offsyl {(\oslash )} \def \msconly {(\Delta )} $ $ \DeclareMathOperator {\var }{var} \DeclareMathOperator {\cov }{cov} \DeclareMathOperator {\Bin }{Bin} \DeclareMathOperator {\Geo }{Geometric} \DeclareMathOperator {\Beta }{Beta} \DeclareMathOperator {\Unif }{Uniform} \DeclareMathOperator {\Gam }{Gamma} \DeclareMathOperator {\Normal }{N} \DeclareMathOperator {\Exp }{Exp} \DeclareMathOperator {\Cauchy }{Cauchy} \DeclareMathOperator {\Bern }{Bernoulli} \DeclareMathOperator {\Poisson }{Poisson} \DeclareMathOperator {\Weibull }{Weibull} \DeclareMathOperator {\IGam }{IGamma} \DeclareMathOperator {\NGam }{NGamma} \DeclareMathOperator {\ChiSquared }{ChiSquared} \DeclareMathOperator {\Pareto }{Pareto} \DeclareMathOperator {\NBin }{NegBin} \DeclareMathOperator {\Studentt }{Student-t} \DeclareMathOperator *{\argmax }{arg\,max} \DeclareMathOperator *{\argmin }{arg\,min} $ \( \def \to {\rightarrow } \def \iff {\Leftrightarrow } \def \ra {\Rightarrow } \def \sw {\subseteq } \def \mc {\mathcal } \def \mb {\mathbb } \def \sc {\setminus } \def \wt {\widetilde } \def \v {\textbf } \def \E {\mb {E}} \def \P {\mb {P}} \def \R {\mb {R}} \def \C {\mb {C}} \def \N {\mb {N}} \def \Q {\mb {Q}} \def \Z {\mb {Z}} \def \B {\mb {B}} \def \~{\sim } \def \-{\,;\,} \def \qed {$\blacksquare $} \CustomizeMathJax {\def \1{\unicode {x1D7D9}}} \def \cadlag {c\`{a}dl\`{a}g} \def \p {\partial } \def \l {\left } \def \r {\right } \def \Om {\Omega } \def \om {\omega } \def \eps {\epsilon } \def \de {\delta } \def \ov {\overline } \def \sr {\stackrel } \def \Lp {\mc {L}^p} \def \Lq {\mc {L}^p} \def \Lone {\mc {L}^1} \def \Ltwo {\mc {L}^2} \def \toae {\sr {\rm a.e.}{\to }} \def \toas {\sr {\rm a.s.}{\to }} \def \top {\sr {\mb {\P }}{\to }} \def \tod {\sr {\rm d}{\to }} \def \toLp {\sr {\Lp }{\to }} \def \toLq {\sr {\Lq }{\to }} \def \eqae {\sr {\rm a.e.}{=}} \def \eqas {\sr {\rm a.s.}{=}} \def \eqd {\sr {\rm d}{=}} \def \approxd {\sr {\rm d}{\approx }} \def \Sa {(S1)\xspace } \def \Sb {(S2)\xspace } \def \Sc {(S3)\xspace } \)

8.4 Gibbs sampling

Recall that our parameter $\theta $ may really be a vector of parameters, as in Section 4.5 where we considered the model $M_{(\mu ,\sigma )}\sim \Normal (\mu ,\sigma ^2)$ with $\theta =(\mu ,\sigma )\in \R \times (0,\infty )$. In this section we introduce a technique for handling models with many parameters $\theta =(\theta _1,\ldots ,\theta _d)\in \R ^d$. For those of you taking MAS364, this section is non-examinable and is included for interest only. For those of you taking MAS61006, this section is on syllabus and will feed into your work next semester.

For $d=1$ or $d=2$ the MH algorithm as described in Section 8.3 is effective. For much large $d$, what tends to happen is that it takes a very long time for the sequence $(y_m)$ generated by the MH algorithm to explore the parameter space $\Pi \sw \R ^d$, which means that it takes longer to converge to (the distribution of) $\Theta |_{\{X=x\}}$. This happens simply because there is a lot of space to explore inside $\R ^d$ when $d$ is large; this phenomenon is often known as the curse of dimensionality.

Example 8.4.1 Imagine you have parked your car inside a multi-story car park and then forgotten where you’ve parked it. If you know which level your car is on then you will only have to search on that level, and you will find your car in minutes. If you don’t know which level, it will take you much longer. The first case is exploring $d=2$, the second is $d=3$. Actually, to make this example properly match the difference between $d=2$ and $d=3$, the car park should have the same number of floors as there are parking spaces along the side-length of each floor! The problem only gets worse in $d\geq 3$.

One strategy for working around this problem is to update the parameters $(\theta _1,\ldots ,\theta _d)$ one at a time. That is, we would first change $\theta _1$ while keeping $(\theta _2,\ldots ,\theta _d)$ fixed, then we would change $\theta _2$ while keeping $(\theta _1,\theta _3,\ldots ,\theta _d)$ fixed, and so on. After updating $\theta _d$ we would then return to $\theta _1$. It is helpful to introduce some notation for this: we write $\theta _{-i}=(\theta _1,\ldots ,\theta _{i-1},\theta _{i+1},\ldots ,\theta _d)$ for the vector $\theta $ with the $\theta _i$ term removed. We use the same notation for the random vector $\Theta $ e.g. $\Theta _{-i}$.

Remark 8.4.2 In reality, instead of updating the $\theta _i$ one-by-one, it is common to update the parameters in small batches. For example we might update $(\theta _1,\ldots ,\theta _4)$ in one step, then $(\theta _5,\ldots ,\theta _8)$ in the next step, and so on. It is helpful to put related parameters, with values that might strongly influence each other, within the same batch.

When we update the parameters in turn, a common choice of proposal distribution is to set $Q\sim \Theta _i|_{\{\Theta _{-i}=\theta _{-i},\,X=x\}}$ in the update for $\theta _i$, where $\theta _{-i}$ are the values obtained from the previous update. This choice of proposal has the effect that, from (8.1), we end up with $\alpha =1$ and all proposals are then accepted. When using proposals of this form, the MH algorithm is usually known as the Gibbs sampler.

Definition 8.4.3 The distributions of $\Theta _{i}|_{\{\Theta _{-i}=\theta _{-i},\,X=x\}}$, for $i=1,\ldots ,d$, are known as the full conditional distributions. From Lemma 1.6.1 the $i^{th}$ full conditional has p.d.f. given by
$\seteqnumber{0}{8.}{11}$
\begin{align} f_{\Theta _{i}|_{\{\Theta _{-i}=\theta _{-i},\,X=x\}}}(\theta _i) &= \frac {f_{\Theta |_{\{X=x\}}}(\theta )} {\int _{\R ^{d-1}}f_{\Theta |_{\{X=x\}}}(\theta _1,\ldots ,\theta _{i-1},\theta _i,\theta _{i+1},\ldots ,\theta _d)\,d\theta _{-i}} \notag \\ &\propto f_{\Theta |_{\{X=x\}}}(\theta ) \label {eq:full_conditionals} \end{align} We can calculate $f_{\Theta |_{\{X=x\}}}$ from Theorems 2.4.1/3.1.2, which provides a strategy for calculating (8.12) analytically. Note that $\propto $ in (8.12) treats $\theta _{-i}$ and $x$ as constants, and the only variable is $\theta _i$.

Remark 8.4.4 The notation $\Theta _{i}|_{\{\Theta _{-i}=\theta _{-i},\,X=x\}}$ for the full conditionals is a bit unwieldy. In Bayesian shorthand we would write simply $\theta _i|\theta _{-i},x$ which is much neater.

The Gibbs sampler that results from these strategies is as follows.

1. Choose an initial point $y_0=(\theta _1^{(0)},\ldots ,\theta _d^{(0)})\in \Pi $.
2. For each $i=1,\ldots ,d$, sample $\tilde {y}$ from $\Theta _{i}|_{\{\Theta _{-i}=\theta ^{(m)}_{-i},\,X=x\}}$ and set

\[y_{m+1}=(\theta _1^{(m)},\ldots ,\theta _{i-1}^{(m)},\tilde {y},\theta _{i+1}^{(m)},\ldots ,\theta _d^{(m)}).\]

Note that we increment the value of $m$ each time that we increment $i$. When reach $i=d$, return to $i=1$ and repeat.

Repeat this step until $m$ is large enough that values taken by the $y_m$ are no longer affected by the choice of $y_0$.
3. The final value of $y_m$ is now a sample of $\Theta |_{\{X=x\}}$.

Note that we need to take samples from the full conditionals $\Theta _{i}|_{\{\Theta _{-i}=\theta ^{(m)}_{-i},\,X=x\}}$ in step 2. This isn’t always possible, and the Gibbs sampler is only helpful if we can do that. It is often used in cases where the full conditionals turn out to be named distributions, or nearly one as in Example 8.4.5 below.

For the MH algorithm we had Theorem 8.2.2 to tell us that $y_m$ was (approximately) a sample of $\Theta |_{\{X=x\}}$, justified by the discussion in Section 8.2.3. It is possible to make similar arguments for the Gibbs algorithm above, but we won’t include them in our course.

If the full conditionals can’t be easily sampled from, then one strategy is to use the MH algorithm (run inside of step 2 above) to obtain samples of $\Theta _{-i}|_{\{X=x\}}$. This technique is known as Metropolis-within-Gibbs. In practice, once the parameters are divided up into batches, as described in Remark 8.4.2, some batches may be amenable to Gibbs sampling, whilst others may require Metropolis-within-Gibbs. The details will depend on the model. Trial and error is often required to find the best combination of techniques. We won’t try to write down algorithms of that complexity within these notes, but you should hopefully end the course with an understanding of how (and why) each piece of an algorithm like that would work.

Example 8.4.5 This example comes from Sections 1.1.1/7.5.3/8.6.2 of the book ‘Bayesian Approach to Intrepreting Archaeological Data’ by Buck et al (1996). The data comes from radiocarbon dating, and is a vector $(x_1,x_2,\ldots ,x_n)$ of estimated ages obtained (via carbon dating) from $n$ different objects. We write $\theta _i$ for the true age of object $i$, which is unknown. Our model for the age of each object is $x_i\sim \Normal (\theta _i,v_i)$ and we assume that the estimation errors are independent, for each $i$. For simplicity we will assume that the $v_i$ are known parameters, so our model family has $n$ parameters $\theta =(\theta _1,\ldots ,\theta _n)$. We thus have the model family

\[M_\theta =\Normal (\theta _1,v_1) \otimes \ldots \otimes \Normal (\theta _n,v_n).\]

From the historical context of the objects, it is known that $\theta _1<\theta _2<\ldots <\theta _n$, so we condition our model $M_\theta $ on this event. We can use Exercise 1.8 to do this conditioning, resulting in a new model family $M'_\theta $ given by
$\seteqnumber{0}{8.}{12}$
\begin{align*} f_{M'_\theta }(x) &= \begin{cases} \frac {f_{M_\theta }(x)}{\P [\Normal (\theta _1,v_1) < \Normal (\theta _2,v_2)< \ldots < \Normal (\theta _n,v_n)]} & \text { for }\theta _1<\theta _2<\ldots <\theta _n \\ 0 & \text { otherwise} \end {cases} \\ &\propto \begin{cases} \prod _{i=1}^n f_{\Normal (\theta _i,v_i)}(x_i) & \text { for }\theta _1<\theta _2<\ldots <\theta _n \\ 0 & \text { otherwise} \end {cases} \\ &\propto \begin{cases} \exp \l (-\frac 12\sum _{i=1}^n\frac {(\theta _i-x_i)^2}{v_i}\r ) & \text { for }\theta _1<\theta _2<\ldots <\theta _n \\ 0 & \text { otherwise.} \end {cases} \end{align*} We use the Bayesian model $(X,\Theta )$ with model family $M'_{\theta }$ and the improper prior

\[f_\Theta (\theta )= \begin {cases} 1 & \text { for }0<\theta _1<\theta _2<\ldots <\theta _n, \\ 0 & \text { otherwise.} \end {cases}\]

By Theorem 3.1.2 we obtain that the posterior distribution has p.d.f.
$\seteqnumber{0}{8.}{12}$
\begin{equation} \label {eq:radiocarbon_post} f_{\Theta |_{\{X=x\}}}(\theta ) \propto \begin{cases} \exp \l (-\frac 12\sum _{i=1}^n\frac {(\theta _i-x_i)^2}{v_i}\r ) & \text { for }0<\theta _1<\theta _2<\ldots <\theta _n, \\ 0 & \text { otherwise.} \end {cases} \end{equation}

This is the same density as $M'_\theta $, except now we treat $\theta $ rather than $x$ as the variable. The density is symmetric in $x$ and $\theta $ so we already know this distribution. It is the distribution of $\theta \sim \Normal (x_1,v_1)\otimes \ldots \otimes \Normal (x_n,v_n)$ conditioned on the event $0<\theta _1<\theta _2<\ldots <\theta _n$. One way to simulate samples of this distribution is via rejection sampling: simulate $\theta \sim \Normal (x_1,v_1)\otimes \ldots \otimes \Normal (x_n,v_n)$ and reject the sample $\theta $ until it satisfies $0<\theta _1<\theta _2<\ldots <\theta _n$. The trouble is that unless $n$ is small, we will mostly end up rejecting the samples because the condition we have imposed is an unlikely one.

From (8.13) and (8.12) we have full conditionals given by
$\seteqnumber{0}{8.}{13}$
\begin{align*} f_{\Theta _{i}|_{\{\Theta _{-i}=\theta _{-i},\,X=x\}}}(\theta _i) &\propto \begin{cases} \exp \l (-\frac 12\sum _{j=1}^n\frac {(\theta _j-x_i)^2}{v_i}\r ) & \text { for }\theta _i\in (\theta _{i-1},\theta _{i+1}), \\ 0 & \text { otherwise} \end {cases} \\ &\propto \begin{cases} \exp \l (-\frac 12\frac {(\theta _i-x_i)^2}{v_i}\r ) & \text { for }\theta _i\in (\theta _{i-1},\theta _{i+1}), \\ 0 & \text { otherwise} \end {cases} \end{align*} (where we set $\theta _0=0$ and $\theta _{n+1}=\infty $ to make convenient notation). Note that $\theta _i$ is the only variable here, and the second line follows because $\theta _{-i}$ and $x$ are treated as constants. We recognize this full conditional distribution as that of $\theta _i\sim N(x_i,v_i)$ conditioned on the event $\theta _i\in (\theta _{i-1},\theta _{i+1})$. These full conditionals are much easier to sample from: we use rejection sampling, sample $\theta _i\sim \Normal (x_i,v_i)$ and reject until we obtain a sample for which $\theta _i\in (\theta _{i-1},\theta _{i+1})$. Hence, in this situation we have all the necessary ingredients to use a Gibbs sampler.