last updated: January 23, 2025

Bayesian Statistics

\(\newcommand{\footnotename}{footnote}\) \(\def \LWRfootnote {1}\) \(\newcommand {\footnote }[2][\LWRfootnote ]{{}^{\mathrm {#1}}}\) \(\newcommand {\footnotemark }[1][\LWRfootnote ]{{}^{\mathrm {#1}}}\) \(\let \LWRorighspace \hspace \) \(\renewcommand {\hspace }{\ifstar \LWRorighspace \LWRorighspace }\) \(\newcommand {\mathnormal }[1]{{#1}}\) \(\newcommand \ensuremath [1]{#1}\) \(\newcommand {\LWRframebox }[2][]{\fbox {#2}} \newcommand {\framebox }[1][]{\LWRframebox } \) \(\newcommand {\setlength }[2]{}\) \(\newcommand {\addtolength }[2]{}\) \(\newcommand {\setcounter }[2]{}\) \(\newcommand {\addtocounter }[2]{}\) \(\newcommand {\arabic }[1]{}\) \(\newcommand {\number }[1]{}\) \(\newcommand {\noalign }[1]{\text {#1}\notag \\}\) \(\newcommand {\cline }[1]{}\) \(\newcommand {\directlua }[1]{\text {(directlua)}}\) \(\newcommand {\luatexdirectlua }[1]{\text {(directlua)}}\) \(\newcommand {\protect }{}\) \(\def \LWRabsorbnumber #1 {}\) \(\def \LWRabsorbquotenumber "#1 {}\) \(\newcommand {\LWRabsorboption }[1][]{}\) \(\newcommand {\LWRabsorbtwooptions }[1][]{\LWRabsorboption }\) \(\def \mathchar {\ifnextchar "\LWRabsorbquotenumber \LWRabsorbnumber }\) \(\def \mathcode #1={\mathchar }\) \(\let \delcode \mathcode \) \(\let \delimiter \mathchar \) \(\def \oe {\unicode {x0153}}\) \(\def \OE {\unicode {x0152}}\) \(\def \ae {\unicode {x00E6}}\) \(\def \AE {\unicode {x00C6}}\) \(\def \aa {\unicode {x00E5}}\) \(\def \AA {\unicode {x00C5}}\) \(\def \o {\unicode {x00F8}}\) \(\def \O {\unicode {x00D8}}\) \(\def \l {\unicode {x0142}}\) \(\def \L {\unicode {x0141}}\) \(\def \ss {\unicode {x00DF}}\) \(\def \SS {\unicode {x1E9E}}\) \(\def \dag {\unicode {x2020}}\) \(\def \ddag {\unicode {x2021}}\) \(\def \P {\unicode {x00B6}}\) \(\def \copyright {\unicode {x00A9}}\) \(\def \pounds {\unicode {x00A3}}\) \(\let \LWRref \ref \) \(\renewcommand {\ref }{\ifstar \LWRref \LWRref }\) \( \newcommand {\multicolumn }[3]{#3}\) \(\require {textcomp}\) \(\newcommand {\intertext }[1]{\text {#1}\notag \\}\) \(\let \Hat \hat \) \(\let \Check \check \) \(\let \Tilde \tilde \) \(\let \Acute \acute \) \(\let \Grave \grave \) \(\let \Dot \dot \) \(\let \Ddot \ddot \) \(\let \Breve \breve \) \(\let \Bar \bar \) \(\let \Vec \vec \) \(\require {colortbl}\) \(\let \LWRorigcolumncolor \columncolor \) \(\renewcommand {\columncolor }[2][named]{\LWRorigcolumncolor [#1]{#2}\LWRabsorbtwooptions }\) \(\let \LWRorigrowcolor \rowcolor \) \(\renewcommand {\rowcolor }[2][named]{\LWRorigrowcolor [#1]{#2}\LWRabsorbtwooptions }\) \(\let \LWRorigcellcolor \cellcolor \) \(\renewcommand {\cellcolor }[2][named]{\LWRorigcellcolor [#1]{#2}\LWRabsorbtwooptions }\) \(\require {mathtools}\) \(\newenvironment {crampedsubarray}[1]{}{}\) \(\newcommand {\smashoperator }[2][]{#2\limits }\) \(\newcommand {\SwapAboveDisplaySkip }{}\) \(\newcommand {\LaTeXunderbrace }[1]{\underbrace {#1}}\) \(\newcommand {\LaTeXoverbrace }[1]{\overbrace {#1}}\) \(\newcommand {\LWRmultlined }[1][]{\begin {multline*}}\) \(\newenvironment {multlined}[1][]{\LWRmultlined }{\end {multline*}}\) \(\let \LWRorigshoveleft \shoveleft \) \(\renewcommand {\shoveleft }[1][]{\LWRorigshoveleft }\) \(\let \LWRorigshoveright \shoveright \) \(\renewcommand {\shoveright }[1][]{\LWRorigshoveright }\) \(\newcommand {\shortintertext }[1]{\text {#1}\notag \\}\) \(\newcommand {\vcentcolon }{\mathrel {\unicode {x2236}}}\) \(\renewcommand {\intertext }[2][]{\text {#2}\notag \\}\) \(\newenvironment {fleqn}[1][]{}{}\) \(\newenvironment {ceqn}{}{}\) \(\newenvironment {darray}[2][c]{\begin {array}[#1]{#2}}{\end {array}}\) \(\newcommand {\dmulticolumn }[3]{#3}\) \(\newcommand {\LWRnrnostar }[1][0.5ex]{\\[#1]}\) \(\newcommand {\nr }{\ifstar \LWRnrnostar \LWRnrnostar }\) \(\newcommand {\mrel }[1]{\begin {aligned}#1\end {aligned}}\) \(\newcommand {\underrel }[2]{\underset {#2}{#1}}\) \(\newcommand {\medmath }[1]{#1}\) \(\newcommand {\medop }[1]{#1}\) \(\newcommand {\medint }[1]{#1}\) \(\newcommand {\medintcorr }[1]{#1}\) \(\newcommand {\mfrac }[2]{\frac {#1}{#2}}\) \(\newcommand {\mbinom }[2]{\binom {#1}{#2}}\) \(\newenvironment {mmatrix}{\begin {matrix}}{\end {matrix}}\) \(\newcommand {\displaybreak }[1][]{}\) \( \def \offsyl {(\oslash )} \def \msconly {(\Delta )} \) \( \DeclareMathOperator {\var }{var} \DeclareMathOperator {\cov }{cov} \DeclareMathOperator {\Bin }{Bin} \DeclareMathOperator {\Geo }{Geometric} \DeclareMathOperator {\Beta }{Beta} \DeclareMathOperator {\Unif }{Uniform} \DeclareMathOperator {\Gam }{Gamma} \DeclareMathOperator {\Normal }{N} \DeclareMathOperator {\Exp }{Exp} \DeclareMathOperator {\Cauchy }{Cauchy} \DeclareMathOperator {\Bern }{Bernoulli} \DeclareMathOperator {\Poisson }{Poisson} \DeclareMathOperator {\Weibull }{Weibull} \DeclareMathOperator {\IGam }{IGamma} \DeclareMathOperator {\NGam }{NGamma} \DeclareMathOperator {\ChiSquared }{ChiSquared} \DeclareMathOperator {\Pareto }{Pareto} \DeclareMathOperator {\NBin }{NegBin} \DeclareMathOperator {\Studentt }{Student-t} \DeclareMathOperator *{\argmax }{arg\,max} \DeclareMathOperator *{\argmin }{arg\,min} \) \( \def \to {\rightarrow } \def \iff {\Leftrightarrow } \def \ra {\Rightarrow } \def \sw {\subseteq } \def \mc {\mathcal } \def \mb {\mathbb } \def \sc {\setminus } \def \wt {\widetilde } \def \v {\textbf } \def \E {\mb {E}} \def \P {\mb {P}} \def \R {\mb {R}} \def \C {\mb {C}} \def \N {\mb {N}} \def \Q {\mb {Q}} \def \Z {\mb {Z}} \def \B {\mb {B}} \def \~{\sim } \def \-{\,;\,} \def \qed {$\blacksquare $} \CustomizeMathJax {\def \1{\unicode {x1D7D9}}} \def \cadlag {c\`{a}dl\`{a}g} \def \p {\partial } \def \l {\left } \def \r {\right } \def \Om {\Omega } \def \om {\omega } \def \eps {\epsilon } \def \de {\delta } \def \ov {\overline } \def \sr {\stackrel } \def \Lp {\mc {L}^p} \def \Lq {\mc {L}^p} \def \Lone {\mc {L}^1} \def \Ltwo {\mc {L}^2} \def \toae {\sr {\rm a.e.}{\to }} \def \toas {\sr {\rm a.s.}{\to }} \def \top {\sr {\mb {\P }}{\to }} \def \tod {\sr {\rm d}{\to }} \def \toLp {\sr {\Lp }{\to }} \def \toLq {\sr {\Lq }{\to }} \def \eqae {\sr {\rm a.e.}{=}} \def \eqas {\sr {\rm a.s.}{=}} \def \eqd {\sr {\rm d}{=}} \def \approxd {\sr {\rm d}{\approx }} \def \Sa {(S1)\xspace } \def \Sb {(S2)\xspace } \def \Sc {(S3)\xspace } \)

8.3 Markov Chain Monte Carlo

We now take a Bayesian model \((X,\Theta )\). We will assume that the parameter space \(\Pi \) is a subset of \(\R ^d\). Recall from Theorems 2.4.1 and 3.1.2 that \(\Pi \) is also the range of both the prior \(\Theta \) and the posterior \(\Theta |_{\{X=x\}}\). As usual, we will assume that the prior distribution \(\Theta \) is a continuous distribution with p.d.f. \(f_\Theta \).

Markov Chain Monte Carlo, or MCMC, is a general term for algorithms that use Markov chains to sample from probability distributions. In particular, we can use the Metropolis-Hasting algorithm from Section 8.2 to construct samples from the posterior distribution \(Y=\Theta |_{\{X=x\}}\). Let us think about what we capabilities we need, in order to do this.

  • 1. We first need a choice of joint distribution \((Y,Q)\), and the ability to take samples \(\tilde {y}\) from the proposal distribution \(Q|_{\{Y=y\}}\), for any value of \(y\).

    .

    Random walk case:

    General case:

    Choose \(Q=\Theta |_{\{X=x\}}+Z\), where \(Z\) satisfies \(f_Z(z)=f_Z(-z)\), as in Example 8.2.4.

    Then \(\wt {y}=y+Z\) has the distribution \(Q|_{\{\Theta |_{\{X=x\}}=y\}}\).

    If we want to use a more complicated proposal distribution, then there is a more difficult choice to make here.

  • 2. The second requirement is that we can calculate the value of \(\alpha \) in (8.1)/(8.4).

    .

    Random walk case:

    General case:

    We can calculate the p.d.f. \(f_{\Theta |_{\{X=x\}}}\) using Theorems 2.4.1 and 3.1.2. In the symmetric case this is all that we need.

    If we don’t use the symmetric case, then we also need to evaluate \(f_{Q|_{\{Y=y\}}}\), from our chosen joint distribution \((Y,Q)\). Our choice of \((Y,Q)\) (in the step above) will usually result in us being able to write down the joint p.d.f. \(f_{Y,Q}\), from which we can (at least in principle) use Lemma 1.6.1 to calculate \(f_{Q|_{\{Y=y\}}}\).

  • 3. The last thing that we require is that the proposals are accepted fairly often, in step 3 of the MH algorithm. As a rough guide, an acceptance rate of \(15-50\%\) is generally viewed as best.

    • An acceptance rate that is very high (i.e. moderately close to \(100\%\)) often means that the MH algorithm is not moving very fast, so won’t converge quickly.

    • An acceptance rate that is too low (i.e. close to \(0\%\)) tells you that few proposals are being accepted. This also prevents the MH algorithm from converging quickly.

    A good acceptance rate happens naturally in some cases. When it does not, a bit of hand-tuning or additional techniques are required, some of which appear in the MSc material next semester.

We’ll focus on the symmetric case from now on. We’ll write down a full description of the MCMC algorithm for Bayesian inference, for the random walk case, in Section 8.3.1

8.3.1 MCMC algorithm for the random walk case

Take a (discrete or continuous) Bayesian model \((X,\Theta )\), with parameter space \(\Pi \sw \R ^d\). We want to obtain samples of \(\Theta |_{\{X=x\}}\), and we know the p.d.f. \(f_{\Theta |_{\{X=x\}}}\) from Theorems 2.4.1/3.1.2.

We must first choose an initial point \(y_0\). We ideally want this value near where the typical values of \(\Theta |_{\{X=x\}}\) sit, because that will (at least, initially) give a higher acceptance rate. If we choose the initial state poorly then the algorithm will take a bit longer to converge, which often doesn’t matter too much.

We must also choose a continuous distribution for \(Z\) satisfying \(f_Z(z)=f_Z(-z)\) for all \(z\in \R \). A common choice is \(Z\sim N(0,\sigma ^2)\), as in Example 8.2.4, for some chosen value of \(\sigma \).

Then, given \(y_m\), we define \(y_{m+1}\) as follows.

  • 1. Sample \(z\) from \(Z\) and set \(\tilde {y}=y_m+z\).

  • 2. Calculate \(\alpha =\min \l (1,\frac {f_{\Theta |_{\{X=x\}}}(\tilde {y})}{f_{\Theta |_{\{X=x\}}}(y_m)}\r )\).

  • 3. Then, set \(y_{m+1}=\begin {cases} \tilde {y} & \text { with probability }\alpha , \\ y_m & \text { with probability }1-\alpha . \end {cases} \)

We repeat steps 1-3 until \(m\) is large enough that values taken by the \(y_m\) are no longer affected by the choice of \(y_0\). This often can be judged by eye, from a plot of the sequence \((y_m)\).

Repeating the whole procedure obtains multiple samples of \(\Theta |_{\{X=x\}}\), which we can plot in a histogram to get an approximation of the distribution. This is exactly what we already did in Example 8.2.3.