Bayesian Statistics

$\newcommand{\footnotename}{footnote}$ $\def \LWRfootnote {1}$ $\newcommand {\footnote }[2][\LWRfootnote ]{{}^{\mathrm {#1}}}$ $\newcommand {\footnotemark }[1][\LWRfootnote ]{{}^{\mathrm {#1}}}$ $\let \LWRorighspace \hspace $ $\renewcommand {\hspace }{\ifstar \LWRorighspace \LWRorighspace }$ $\newcommand {\mathnormal }[1]{{#1}}$ $\newcommand \ensuremath [1]{#1}$ $\newcommand {\LWRframebox }[2][]{\fbox {#2}} \newcommand {\framebox }[1][]{\LWRframebox } $ $\newcommand {\setlength }[2]{}$ $\newcommand {\addtolength }[2]{}$ $\newcommand {\setcounter }[2]{}$ $\newcommand {\addtocounter }[2]{}$ $\newcommand {\arabic }[1]{}$ $\newcommand {\number }[1]{}$ $\newcommand {\noalign }[1]{\text {#1}\notag \\}$ $\newcommand {\cline }[1]{}$ $\newcommand {\directlua }[1]{\text {(directlua)}}$ $\newcommand {\luatexdirectlua }[1]{\text {(directlua)}}$ $\newcommand {\protect }{}$ $\def \LWRabsorbnumber #1 {}$ $\def \LWRabsorbquotenumber "#1 {}$ $\newcommand {\LWRabsorboption }[1][]{}$ $\newcommand {\LWRabsorbtwooptions }[1][]{\LWRabsorboption }$ $\def \mathchar {\ifnextchar "\LWRabsorbquotenumber \LWRabsorbnumber }$ $\def \mathcode #1={\mathchar }$ $\let \delcode \mathcode $ $\let \delimiter \mathchar $ $\def \oe {\unicode {x0153}}$ $\def \OE {\unicode {x0152}}$ $\def \ae {\unicode {x00E6}}$ $\def \AE {\unicode {x00C6}}$ $\def \aa {\unicode {x00E5}}$ $\def \AA {\unicode {x00C5}}$ $\def \o {\unicode {x00F8}}$ $\def \O {\unicode {x00D8}}$ $\def \l {\unicode {x0142}}$ $\def \L {\unicode {x0141}}$ $\def \ss {\unicode {x00DF}}$ $\def \SS {\unicode {x1E9E}}$ $\def \dag {\unicode {x2020}}$ $\def \ddag {\unicode {x2021}}$ $\def \P {\unicode {x00B6}}$ $\def \copyright {\unicode {x00A9}}$ $\def \pounds {\unicode {x00A3}}$ $\let \LWRref \ref $ $\renewcommand {\ref }{\ifstar \LWRref \LWRref }$ $ \newcommand {\multicolumn }[3]{#3}$ $\require {textcomp}$ $\newcommand {\intertext }[1]{\text {#1}\notag \\}$ $\let \Hat \hat $ $\let \Check \check $ $\let \Tilde \tilde $ $\let \Acute \acute $ $\let \Grave \grave $ $\let \Dot \dot $ $\let \Ddot \ddot $ $\let \Breve \breve $ $\let \Bar \bar $ $\let \Vec \vec $ $\require {colortbl}$ $\let \LWRorigcolumncolor \columncolor $ $\renewcommand {\columncolor }[2][named]{\LWRorigcolumncolor [#1]{#2}\LWRabsorbtwooptions }$ $\let \LWRorigrowcolor \rowcolor $ $\renewcommand {\rowcolor }[2][named]{\LWRorigrowcolor [#1]{#2}\LWRabsorbtwooptions }$ $\let \LWRorigcellcolor \cellcolor $ $\renewcommand {\cellcolor }[2][named]{\LWRorigcellcolor [#1]{#2}\LWRabsorbtwooptions }$ $\require {mathtools}$ $\newenvironment {crampedsubarray}[1]{}{}$ $\newcommand {\smashoperator }[2][]{#2\limits }$ $\newcommand {\SwapAboveDisplaySkip }{}$ $\newcommand {\LaTeXunderbrace }[1]{\underbrace {#1}}$ $\newcommand {\LaTeXoverbrace }[1]{\overbrace {#1}}$ $\newcommand {\LWRmultlined }[1][]{\begin {multline*}}$ $\newenvironment {multlined}[1][]{\LWRmultlined }{\end {multline*}}$ $\let \LWRorigshoveleft \shoveleft $ $\renewcommand {\shoveleft }[1][]{\LWRorigshoveleft }$ $\let \LWRorigshoveright \shoveright $ $\renewcommand {\shoveright }[1][]{\LWRorigshoveright }$ $\newcommand {\shortintertext }[1]{\text {#1}\notag \\}$ $\newcommand {\vcentcolon }{\mathrel {\unicode {x2236}}}$ $\renewcommand {\intertext }[2][]{\text {#2}\notag \\}$ $\newenvironment {fleqn}[1][]{}{}$ $\newenvironment {ceqn}{}{}$ $\newenvironment {darray}[2][c]{\begin {array}[#1]{#2}}{\end {array}}$ $\newcommand {\dmulticolumn }[3]{#3}$ $\newcommand {\LWRnrnostar }[1][0.5ex]{\\[#1]}$ $\newcommand {\nr }{\ifstar \LWRnrnostar \LWRnrnostar }$ $\newcommand {\mrel }[1]{\begin {aligned}#1\end {aligned}}$ $\newcommand {\underrel }[2]{\underset {#2}{#1}}$ $\newcommand {\medmath }[1]{#1}$ $\newcommand {\medop }[1]{#1}$ $\newcommand {\medint }[1]{#1}$ $\newcommand {\medintcorr }[1]{#1}$ $\newcommand {\mfrac }[2]{\frac {#1}{#2}}$ $\newcommand {\mbinom }[2]{\binom {#1}{#2}}$ $\newenvironment {mmatrix}{\begin {matrix}}{\end {matrix}}$ $\newcommand {\displaybreak }[1][]{}$ $ \def \offsyl {(\oslash )} \def \msconly {(\Delta )} $ $ \DeclareMathOperator {\var }{var} \DeclareMathOperator {\cov }{cov} \DeclareMathOperator {\Bin }{Bin} \DeclareMathOperator {\Geo }{Geometric} \DeclareMathOperator {\Beta }{Beta} \DeclareMathOperator {\Unif }{Uniform} \DeclareMathOperator {\Gam }{Gamma} \DeclareMathOperator {\Normal }{N} \DeclareMathOperator {\Exp }{Exp} \DeclareMathOperator {\Cauchy }{Cauchy} \DeclareMathOperator {\Bern }{Bernoulli} \DeclareMathOperator {\Poisson }{Poisson} \DeclareMathOperator {\Weibull }{Weibull} \DeclareMathOperator {\IGam }{IGamma} \DeclareMathOperator {\NGam }{NGamma} \DeclareMathOperator {\ChiSquared }{ChiSquared} \DeclareMathOperator {\Pareto }{Pareto} \DeclareMathOperator {\NBin }{NegBin} \DeclareMathOperator {\Studentt }{Student-t} \DeclareMathOperator *{\argmax }{arg\,max} \DeclareMathOperator *{\argmin }{arg\,min} $ \( \def \to {\rightarrow } \def \iff {\Leftrightarrow } \def \ra {\Rightarrow } \def \sw {\subseteq } \def \mc {\mathcal } \def \mb {\mathbb } \def \sc {\setminus } \def \wt {\widetilde } \def \v {\textbf } \def \E {\mb {E}} \def \P {\mb {P}} \def \R {\mb {R}} \def \C {\mb {C}} \def \N {\mb {N}} \def \Q {\mb {Q}} \def \Z {\mb {Z}} \def \B {\mb {B}} \def \~{\sim } \def \-{\,;\,} \def \qed {$\blacksquare $} \CustomizeMathJax {\def \1{\unicode {x1D7D9}}} \def \cadlag {c\`{a}dl\`{a}g} \def \p {\partial } \def \l {\left } \def \r {\right } \def \Om {\Omega } \def \om {\omega } \def \eps {\epsilon } \def \de {\delta } \def \ov {\overline } \def \sr {\stackrel } \def \Lp {\mc {L}^p} \def \Lq {\mc {L}^p} \def \Lone {\mc {L}^1} \def \Ltwo {\mc {L}^2} \def \toae {\sr {\rm a.e.}{\to }} \def \toas {\sr {\rm a.s.}{\to }} \def \top {\sr {\mb {\P }}{\to }} \def \tod {\sr {\rm d}{\to }} \def \toLp {\sr {\Lp }{\to }} \def \toLq {\sr {\Lq }{\to }} \def \eqae {\sr {\rm a.e.}{=}} \def \eqas {\sr {\rm a.s.}{=}} \def \eqd {\sr {\rm d}{=}} \def \approxd {\sr {\rm d}{\approx }} \def \Sa {(S1)\xspace } \def \Sb {(S2)\xspace } \def \Sc {(S3)\xspace } \)

8.3 Markov Chain Monte Carlo

We now take a Bayesian model $(X,\Theta )$. We will assume that the parameter space $\Pi $ is a subset of $\R ^d$. Recall from Theorems 2.4.1 and 3.1.2 that $\Pi $ is also the range of both the prior $\Theta $ and the posterior $\Theta |_{\{X=x\}}$. As usual, we will assume that the prior distribution $\Theta $ is a continuous distribution with p.d.f. $f_\Theta $.

Markov Chain Monte Carlo, or MCMC, is a general term for algorithms that use Markov chains to sample from probability distributions. In particular, we can use the Metropolis-Hasting algorithm from Section 8.2 to construct samples from the posterior distribution $Y=\Theta |_{\{X=x\}}$. Let us think about what we capabilities we need, in order to do this.

1. We first need a choice of joint distribution $(Y,Q)$, and the ability to take samples $\tilde {y}$ from the proposal distribution $Q|_{\{Y=y\}}$, for any value of $y$.

Random walk case:

General case:

Choose $Q=\Theta |_{\{X=x\}}+Z$, where $Z$ satisfies $f_Z(z)=f_Z(-z)$, as in Example 8.2.4.

Then $\wt {y}=y+Z$ has the distribution $Q|_{\{\Theta |_{\{X=x\}}=y\}}$.

If we want to use a more complicated proposal distribution, then there is a more difficult choice to make here.

2. The second requirement is that we can calculate the value of $\alpha $ in (8.1)/(8.4).

.
Random walk case:	General case:
We can calculate the p.d.f. $f_{\Theta \|_{\{X=x\}}}$ using Theorems 2.4.1 and 3.1.2. In the symmetric case this is all that we need.	If we don’t use the symmetric case, then we also need to evaluate $f_{Q\|_{\{Y=y\}}}$, from our chosen joint distribution $(Y,Q)$. Our choice of $(Y,Q)$ (in the step above) will usually result in us being able to write down the joint p.d.f. $f_{Y,Q}$, from which we can (at least in principle) use Lemma 1.6.1 to calculate $f_{Q\|_{\{Y=y\}}}$.

3. The last thing that we require is that the proposals are accepted fairly often, in step 3 of the MH algorithm. As a rough guide, an acceptance rate of $15-50\%$ is generally viewed as best.
- • An acceptance rate that is very high (i.e. moderately close to $100\%$) often means that the MH algorithm is not moving very fast, so won’t converge quickly.
- • An acceptance rate that is too low (i.e. close to $0\%$) tells you that few proposals are being accepted. This also prevents the MH algorithm from converging quickly.
A good acceptance rate happens naturally in some cases. When it does not, a bit of hand-tuning or additional techniques are required, some of which appear in the MSc material next semester.

We’ll focus on the symmetric case from now on. We’ll write down a full description of the MCMC algorithm for Bayesian inference, for the random walk case, in Section 8.3.1

8.3.1 MCMC algorithm for the random walk case

Take a (discrete or continuous) Bayesian model $(X,\Theta )$, with parameter space $\Pi \sw \R ^d$. We want to obtain samples of $\Theta |_{\{X=x\}}$, and we know the p.d.f. $f_{\Theta |_{\{X=x\}}}$ from Theorems 2.4.1/3.1.2.

We must first choose an initial point $y_0$. We ideally want this value near where the typical values of $\Theta |_{\{X=x\}}$ sit, because that will (at least, initially) give a higher acceptance rate. If we choose the initial state poorly then the algorithm will take a bit longer to converge, which often doesn’t matter too much.

We must also choose a continuous distribution for $Z$ satisfying $f_Z(z)=f_Z(-z)$ for all $z\in \R $. A common choice is $Z\sim N(0,\sigma ^2)$, as in Example 8.2.4, for some chosen value of $\sigma $.

Then, given $y_m$, we define $y_{m+1}$ as follows.

1. Sample $z$ from $Z$ and set $\tilde {y}=y_m+z$.
2. Calculate $\alpha =\min \l (1,\frac {f_{\Theta |_{\{X=x\}}}(\tilde {y})}{f_{\Theta |_{\{X=x\}}}(y_m)}\r )$.
3. Then, set $y_{m+1}=\begin {cases} \tilde {y} & \text { with probability }\alpha , \\ y_m & \text { with probability }1-\alpha . \end {cases} $

We repeat steps 1-3 until $m$ is large enough that values taken by the $y_m$ are no longer affected by the choice of $y_0$. This often can be judged by eye, from a plot of the sequence $(y_m)$.

Repeating the whole procedure obtains multiple samples of $\Theta |_{\{X=x\}}$, which we can plot in a histogram to get an approximation of the distribution. This is exactly what we already did in Example 8.2.3.

.
Random walk case:	General case:
We can calculate the p.d.f. \(f_{\Theta \|_{\{X=x\}}}\) using Theorems 2.4.1 and 3.1.2. In the symmetric case this is all that we need.	If we don’t use the symmetric case, then we also need to evaluate \(f_{Q\|_{\{Y=y\}}}\), from our chosen joint distribution \((Y,Q)\). Our choice of \((Y,Q)\) (in the step above) will usually result in us being able to write down the joint p.d.f. \(f_{Y,Q}\), from which we can (at least in principle) use Lemma 1.6.1 to calculate \(f_{Q\|_{\{Y=y\}}}\).