Bayesian Statistics

$\newcommand{\footnotename}{footnote}$ $\def \LWRfootnote {1}$ $\newcommand {\footnote }[2][\LWRfootnote ]{{}^{\mathrm {#1}}}$ $\newcommand {\footnotemark }[1][\LWRfootnote ]{{}^{\mathrm {#1}}}$ $\let \LWRorighspace \hspace $ $\renewcommand {\hspace }{\ifstar \LWRorighspace \LWRorighspace }$ $\newcommand {\mathnormal }[1]{{#1}}$ $\newcommand \ensuremath [1]{#1}$ $\newcommand {\LWRframebox }[2][]{\fbox {#2}} \newcommand {\framebox }[1][]{\LWRframebox } $ $\newcommand {\setlength }[2]{}$ $\newcommand {\addtolength }[2]{}$ $\newcommand {\setcounter }[2]{}$ $\newcommand {\addtocounter }[2]{}$ $\newcommand {\arabic }[1]{}$ $\newcommand {\number }[1]{}$ $\newcommand {\noalign }[1]{\text {#1}\notag \\}$ $\newcommand {\cline }[1]{}$ $\newcommand {\directlua }[1]{\text {(directlua)}}$ $\newcommand {\luatexdirectlua }[1]{\text {(directlua)}}$ $\newcommand {\protect }{}$ $\def \LWRabsorbnumber #1 {}$ $\def \LWRabsorbquotenumber "#1 {}$ $\newcommand {\LWRabsorboption }[1][]{}$ $\newcommand {\LWRabsorbtwooptions }[1][]{\LWRabsorboption }$ $\def \mathchar {\ifnextchar "\LWRabsorbquotenumber \LWRabsorbnumber }$ $\def \mathcode #1={\mathchar }$ $\let \delcode \mathcode $ $\let \delimiter \mathchar $ $\def \oe {\unicode {x0153}}$ $\def \OE {\unicode {x0152}}$ $\def \ae {\unicode {x00E6}}$ $\def \AE {\unicode {x00C6}}$ $\def \aa {\unicode {x00E5}}$ $\def \AA {\unicode {x00C5}}$ $\def \o {\unicode {x00F8}}$ $\def \O {\unicode {x00D8}}$ $\def \l {\unicode {x0142}}$ $\def \L {\unicode {x0141}}$ $\def \ss {\unicode {x00DF}}$ $\def \SS {\unicode {x1E9E}}$ $\def \dag {\unicode {x2020}}$ $\def \ddag {\unicode {x2021}}$ $\def \P {\unicode {x00B6}}$ $\def \copyright {\unicode {x00A9}}$ $\def \pounds {\unicode {x00A3}}$ $\let \LWRref \ref $ $\renewcommand {\ref }{\ifstar \LWRref \LWRref }$ $ \newcommand {\multicolumn }[3]{#3}$ $\require {textcomp}$ $\newcommand {\intertext }[1]{\text {#1}\notag \\}$ $\let \Hat \hat $ $\let \Check \check $ $\let \Tilde \tilde $ $\let \Acute \acute $ $\let \Grave \grave $ $\let \Dot \dot $ $\let \Ddot \ddot $ $\let \Breve \breve $ $\let \Bar \bar $ $\let \Vec \vec $ $\require {colortbl}$ $\let \LWRorigcolumncolor \columncolor $ $\renewcommand {\columncolor }[2][named]{\LWRorigcolumncolor [#1]{#2}\LWRabsorbtwooptions }$ $\let \LWRorigrowcolor \rowcolor $ $\renewcommand {\rowcolor }[2][named]{\LWRorigrowcolor [#1]{#2}\LWRabsorbtwooptions }$ $\let \LWRorigcellcolor \cellcolor $ $\renewcommand {\cellcolor }[2][named]{\LWRorigcellcolor [#1]{#2}\LWRabsorbtwooptions }$ $\require {mathtools}$ $\newenvironment {crampedsubarray}[1]{}{}$ $\newcommand {\smashoperator }[2][]{#2\limits }$ $\newcommand {\SwapAboveDisplaySkip }{}$ $\newcommand {\LaTeXunderbrace }[1]{\underbrace {#1}}$ $\newcommand {\LaTeXoverbrace }[1]{\overbrace {#1}}$ $\newcommand {\LWRmultlined }[1][]{\begin {multline*}}$ $\newenvironment {multlined}[1][]{\LWRmultlined }{\end {multline*}}$ $\let \LWRorigshoveleft \shoveleft $ $\renewcommand {\shoveleft }[1][]{\LWRorigshoveleft }$ $\let \LWRorigshoveright \shoveright $ $\renewcommand {\shoveright }[1][]{\LWRorigshoveright }$ $\newcommand {\shortintertext }[1]{\text {#1}\notag \\}$ $\newcommand {\vcentcolon }{\mathrel {\unicode {x2236}}}$ $\renewcommand {\intertext }[2][]{\text {#2}\notag \\}$ $\newenvironment {fleqn}[1][]{}{}$ $\newenvironment {ceqn}{}{}$ $\newenvironment {darray}[2][c]{\begin {array}[#1]{#2}}{\end {array}}$ $\newcommand {\dmulticolumn }[3]{#3}$ $\newcommand {\LWRnrnostar }[1][0.5ex]{\\[#1]}$ $\newcommand {\nr }{\ifstar \LWRnrnostar \LWRnrnostar }$ $\newcommand {\mrel }[1]{\begin {aligned}#1\end {aligned}}$ $\newcommand {\underrel }[2]{\underset {#2}{#1}}$ $\newcommand {\medmath }[1]{#1}$ $\newcommand {\medop }[1]{#1}$ $\newcommand {\medint }[1]{#1}$ $\newcommand {\medintcorr }[1]{#1}$ $\newcommand {\mfrac }[2]{\frac {#1}{#2}}$ $\newcommand {\mbinom }[2]{\binom {#1}{#2}}$ $\newenvironment {mmatrix}{\begin {matrix}}{\end {matrix}}$ $\newcommand {\displaybreak }[1][]{}$ $ \def \offsyl {(\oslash )} \def \msconly {(\Delta )} $ $ \DeclareMathOperator {\var }{var} \DeclareMathOperator {\cov }{cov} \DeclareMathOperator {\Bin }{Bin} \DeclareMathOperator {\Geo }{Geometric} \DeclareMathOperator {\Beta }{Beta} \DeclareMathOperator {\Unif }{Uniform} \DeclareMathOperator {\Gam }{Gamma} \DeclareMathOperator {\Normal }{N} \DeclareMathOperator {\Exp }{Exp} \DeclareMathOperator {\Cauchy }{Cauchy} \DeclareMathOperator {\Bern }{Bernoulli} \DeclareMathOperator {\Poisson }{Poisson} \DeclareMathOperator {\Weibull }{Weibull} \DeclareMathOperator {\IGam }{IGamma} \DeclareMathOperator {\NGam }{NGamma} \DeclareMathOperator {\ChiSquared }{ChiSquared} \DeclareMathOperator {\Pareto }{Pareto} \DeclareMathOperator {\NBin }{NegBin} \DeclareMathOperator {\Studentt }{Student-t} \DeclareMathOperator *{\argmax }{arg\,max} \DeclareMathOperator *{\argmin }{arg\,min} $ \( \def \to {\rightarrow } \def \iff {\Leftrightarrow } \def \ra {\Rightarrow } \def \sw {\subseteq } \def \mc {\mathcal } \def \mb {\mathbb } \def \sc {\setminus } \def \wt {\widetilde } \def \v {\textbf } \def \E {\mb {E}} \def \P {\mb {P}} \def \R {\mb {R}} \def \C {\mb {C}} \def \N {\mb {N}} \def \Q {\mb {Q}} \def \Z {\mb {Z}} \def \B {\mb {B}} \def \~{\sim } \def \-{\,;\,} \def \qed {$\blacksquare $} \CustomizeMathJax {\def \1{\unicode {x1D7D9}}} \def \cadlag {c\`{a}dl\`{a}g} \def \p {\partial } \def \l {\left } \def \r {\right } \def \Om {\Omega } \def \om {\omega } \def \eps {\epsilon } \def \de {\delta } \def \ov {\overline } \def \sr {\stackrel } \def \Lp {\mc {L}^p} \def \Lq {\mc {L}^p} \def \Lone {\mc {L}^1} \def \Ltwo {\mc {L}^2} \def \toae {\sr {\rm a.e.}{\to }} \def \toas {\sr {\rm a.s.}{\to }} \def \top {\sr {\mb {\P }}{\to }} \def \tod {\sr {\rm d}{\to }} \def \toLp {\sr {\Lp }{\to }} \def \toLq {\sr {\Lq }{\to }} \def \eqae {\sr {\rm a.e.}{=}} \def \eqas {\sr {\rm a.s.}{=}} \def \eqd {\sr {\rm d}{=}} \def \approxd {\sr {\rm d}{\approx }} \def \Sa {(S1)\xspace } \def \Sb {(S2)\xspace } \def \Sc {(S3)\xspace } \)

Chapter 8 Computational methods

We noted at several points that conjugate pairs, from Chapter 4, do not provide enough flexibility for many practical situations. Instead, Bayesian statistics is heavily reliant on a family of computational techniques, introduced within this chapter. They generate samples from the posterior distribution and have a moderate computational cost. With simple model families of the kind we have worked with throughout the course it is reasonable to use desktop machines. More complex model families can require larger machines.

Throughout this chapter we assume the same setup as in Chapter 7, which we repeat here for convenience. We work with a discrete or continuous Bayesian model $(X,\Theta )$, where we have data $x$ and posterior $\Theta |_{\{X=x\}}$. We keep all of our usual notation: the parameter space is $\Pi $, the model family is $(M_\theta )_{\theta \in \Pi }$, and the range of the model is $R$. Note that $M_\theta $ could have the form $M_\theta \sim (Y_\theta )^{\otimes n}$ for some random variable $Y_\theta $ with parameter $\theta $, corresponding to $n$ i.i.d. data points.

8.1 Approximate Bayesian computation $\offsyl $

In this section we describe a numerical method for calculating the posterior $\Theta |_{\{X=x\}}$ that is based on rejection sampling. Recall that we used rejection sampling in Section 1.4 to prove Lemma 1.4.1, and also to give some intuition for our first examples of conditioning.

The algorithm we study here is known as Approximate Bayesian Computation, or ABC for short. We will describe it first for discrete data, in the situation where the prior (and, consequently, the posterior) are also discrete distributions. We haven’t studied this case in any of our previous chapters, so let us first introduce it here.

Definition 8.1.1 (Bayesian model with discrete parameters and discrete data) Take a prior with p.m.f. $p_\Theta (\theta )$ and a discrete model family $(M_\theta )_{\theta \in \Pi }$ where $\Pi $ is a finite or countable set. The Bayesian model $(X,\Theta )$ has the law

\[\P [X=x,\Theta =\theta ]=\P [M_\theta =x]p_\Theta (\theta ).\]

It is straightforward to sum over $x$ and obtain the prior distribution $\P [\Theta =\theta ]=p_\Theta (\theta )$, and also to sum over $\theta $ and obtain the sampling distribution $\P [X=x]=\sum _{\theta \in \Pi }\P [M_\theta =x]p_\Theta (\theta )$. For $x\in R_X$ we have $\P [X=x]>0$ and thus the posterior $\Theta |_{\{X=x\}}$ is defined via Lemma 1.4.1. Also using Lemma 1.4.1, the conditional distribution $X|_{\{\Theta =\theta \}}$ satisfies

\[\P [X|_{\{\Theta =\theta \}}=x] =\frac {\P [X=x,\Theta =\theta ]}{\P [\Theta =\theta ]} =\frac {\P [M_\theta =x]p_\Theta (\theta )}{p_{\Theta }(\theta )} =\P [M_\theta =x]\]

so $X|_{\{\Theta =\theta \}}\eqd M_\theta $.

In the context of Definition 8.1.1 the ABC algorithm for generating samples from $\Theta |_{\{X=x\}}$ is the following:

1. Sample $\theta _0$ from the discrete distribution $\Theta $.
2. Sample $x_0$ from the discrete distribution $M_\theta \eqd X|_{\{\Theta =\theta \}}$.
3. Then:
- - if $x\neq x_0$, go back to step one;
- - if $x=x_0$,accept $\theta _0$ as a sample of $\Theta |_{\{X=x\}}$.

This algorithm is precisely the strategy of our proof for Lemma 1.4.1, written as an algorithm and adapted to the special case of Definition 8.1.1. It generates a single sample of the distribution $\Theta |_{\{X=x\}}$. We can run the algorithm again to obtain more samples.

The ABC algorithm outlined above requires only that we have the ability to take samples from discrete distributions with known probability mass functions. To handle cases with continuous priors and/or data, we also need to be able to sample from continuous distributions with known probability density functions. The modifications are as follows:

• If $\Theta $ is continuous and $(M_\theta )$ is discrete then we can adopt the same algorithm, with the modification that in step 1 we must now sample from a continuous distribution rather than a discrete distribution.
• If $(M_\theta )$ is continuous then in step 3 we will have $\P [x=x_0]=0$. In this case the simplest strategy is to fix some $\eps >0$ and accept $\theta _0$ as an approximate sample of $\Theta |_{\{X=x\}}$ if $|x-x_0|\leq \eps $.

This idea is based on (1.9), which stated that if $\Theta |_{\{X=x\}}$ was to be defined then it should be defined to be the limit as $\eps \to 0$ of $\Theta |_{\{|X-x|\leq \eps \}}$. The terminology ‘Approximate’ Bayesian Computation comes from this step.

More complex strategies for comparing $x$ and $x_0$ can also be used, with the aim of focusing on the aspects of the data that are most important to us.

The ABC algorithm as described above has a serious drawback. In discrete cases the probability that $x=x_0$ (in step 3) can be extremely small, meaning that we have to go around the loop $1\to 2\to 3\to 1$ many times before we find a sample we can accept. In continuous cases, choosing $\eps $ close to $0$ obtains results in good approximation but introduces the same problem; that the probability of accepting the same $x_0$ becomes small. To get handle this difficulty, various ways of sampling $x_0$ (in step 2) have been developed to increase the acceptance probability, without changing the distribution of $x_0$. One such method is ABC-MCMC, which uses ideas from Section 8.3 to sample $x_0$. Another method is sequential ABC, where $x_0$ is sampled as a perturbation in a carefully chosen direction from the (rejected) $x_0$ in the previous iteration of the loop. We will not detail such methods here. They are very popular in some applications.

Bayesian Statistics

Chapter 8 Computational methods

8.1 Approximate Bayesian computation \(\offsyl \)