Bayesian Statistics

$\newcommand{\footnotename}{footnote}$ $\def \LWRfootnote {1}$ $\newcommand {\footnote }[2][\LWRfootnote ]{{}^{\mathrm {#1}}}$ $\newcommand {\footnotemark }[1][\LWRfootnote ]{{}^{\mathrm {#1}}}$ $\let \LWRorighspace \hspace $ $\renewcommand {\hspace }{\ifstar \LWRorighspace \LWRorighspace }$ $\newcommand {\mathnormal }[1]{{#1}}$ $\newcommand \ensuremath [1]{#1}$ $\newcommand {\LWRframebox }[2][]{\fbox {#2}} \newcommand {\framebox }[1][]{\LWRframebox } $ $\newcommand {\setlength }[2]{}$ $\newcommand {\addtolength }[2]{}$ $\newcommand {\setcounter }[2]{}$ $\newcommand {\addtocounter }[2]{}$ $\newcommand {\arabic }[1]{}$ $\newcommand {\number }[1]{}$ $\newcommand {\noalign }[1]{\text {#1}\notag \\}$ $\newcommand {\cline }[1]{}$ $\newcommand {\directlua }[1]{\text {(directlua)}}$ $\newcommand {\luatexdirectlua }[1]{\text {(directlua)}}$ $\newcommand {\protect }{}$ $\def \LWRabsorbnumber #1 {}$ $\def \LWRabsorbquotenumber "#1 {}$ $\newcommand {\LWRabsorboption }[1][]{}$ $\newcommand {\LWRabsorbtwooptions }[1][]{\LWRabsorboption }$ $\def \mathchar {\ifnextchar "\LWRabsorbquotenumber \LWRabsorbnumber }$ $\def \mathcode #1={\mathchar }$ $\let \delcode \mathcode $ $\let \delimiter \mathchar $ $\def \oe {\unicode {x0153}}$ $\def \OE {\unicode {x0152}}$ $\def \ae {\unicode {x00E6}}$ $\def \AE {\unicode {x00C6}}$ $\def \aa {\unicode {x00E5}}$ $\def \AA {\unicode {x00C5}}$ $\def \o {\unicode {x00F8}}$ $\def \O {\unicode {x00D8}}$ $\def \l {\unicode {x0142}}$ $\def \L {\unicode {x0141}}$ $\def \ss {\unicode {x00DF}}$ $\def \SS {\unicode {x1E9E}}$ $\def \dag {\unicode {x2020}}$ $\def \ddag {\unicode {x2021}}$ $\def \P {\unicode {x00B6}}$ $\def \copyright {\unicode {x00A9}}$ $\def \pounds {\unicode {x00A3}}$ $\let \LWRref \ref $ $\renewcommand {\ref }{\ifstar \LWRref \LWRref }$ $ \newcommand {\multicolumn }[3]{#3}$ $\require {textcomp}$ $\newcommand {\intertext }[1]{\text {#1}\notag \\}$ $\let \Hat \hat $ $\let \Check \check $ $\let \Tilde \tilde $ $\let \Acute \acute $ $\let \Grave \grave $ $\let \Dot \dot $ $\let \Ddot \ddot $ $\let \Breve \breve $ $\let \Bar \bar $ $\let \Vec \vec $ $\require {colortbl}$ $\let \LWRorigcolumncolor \columncolor $ $\renewcommand {\columncolor }[2][named]{\LWRorigcolumncolor [#1]{#2}\LWRabsorbtwooptions }$ $\let \LWRorigrowcolor \rowcolor $ $\renewcommand {\rowcolor }[2][named]{\LWRorigrowcolor [#1]{#2}\LWRabsorbtwooptions }$ $\let \LWRorigcellcolor \cellcolor $ $\renewcommand {\cellcolor }[2][named]{\LWRorigcellcolor [#1]{#2}\LWRabsorbtwooptions }$ $\require {mathtools}$ $\newenvironment {crampedsubarray}[1]{}{}$ $\newcommand {\smashoperator }[2][]{#2\limits }$ $\newcommand {\SwapAboveDisplaySkip }{}$ $\newcommand {\LaTeXunderbrace }[1]{\underbrace {#1}}$ $\newcommand {\LaTeXoverbrace }[1]{\overbrace {#1}}$ $\newcommand {\LWRmultlined }[1][]{\begin {multline*}}$ $\newenvironment {multlined}[1][]{\LWRmultlined }{\end {multline*}}$ $\let \LWRorigshoveleft \shoveleft $ $\renewcommand {\shoveleft }[1][]{\LWRorigshoveleft }$ $\let \LWRorigshoveright \shoveright $ $\renewcommand {\shoveright }[1][]{\LWRorigshoveright }$ $\newcommand {\shortintertext }[1]{\text {#1}\notag \\}$ $\newcommand {\vcentcolon }{\mathrel {\unicode {x2236}}}$ $\renewcommand {\intertext }[2][]{\text {#2}\notag \\}$ $\newenvironment {fleqn}[1][]{}{}$ $\newenvironment {ceqn}{}{}$ $\newenvironment {darray}[2][c]{\begin {array}[#1]{#2}}{\end {array}}$ $\newcommand {\dmulticolumn }[3]{#3}$ $\newcommand {\LWRnrnostar }[1][0.5ex]{\\[#1]}$ $\newcommand {\nr }{\ifstar \LWRnrnostar \LWRnrnostar }$ $\newcommand {\mrel }[1]{\begin {aligned}#1\end {aligned}}$ $\newcommand {\underrel }[2]{\underset {#2}{#1}}$ $\newcommand {\medmath }[1]{#1}$ $\newcommand {\medop }[1]{#1}$ $\newcommand {\medint }[1]{#1}$ $\newcommand {\medintcorr }[1]{#1}$ $\newcommand {\mfrac }[2]{\frac {#1}{#2}}$ $\newcommand {\mbinom }[2]{\binom {#1}{#2}}$ $\newenvironment {mmatrix}{\begin {matrix}}{\end {matrix}}$ $\newcommand {\displaybreak }[1][]{}$ $ \def \offsyl {(\oslash )} \def \msconly {(\Delta )} $ $ \DeclareMathOperator {\var }{var} \DeclareMathOperator {\cov }{cov} \DeclareMathOperator {\Bin }{Bin} \DeclareMathOperator {\Geo }{Geometric} \DeclareMathOperator {\Beta }{Beta} \DeclareMathOperator {\Unif }{Uniform} \DeclareMathOperator {\Gam }{Gamma} \DeclareMathOperator {\Normal }{N} \DeclareMathOperator {\Exp }{Exp} \DeclareMathOperator {\Cauchy }{Cauchy} \DeclareMathOperator {\Bern }{Bernoulli} \DeclareMathOperator {\Poisson }{Poisson} \DeclareMathOperator {\Weibull }{Weibull} \DeclareMathOperator {\IGam }{IGamma} \DeclareMathOperator {\NGam }{NGamma} \DeclareMathOperator {\ChiSquared }{ChiSquared} \DeclareMathOperator {\Pareto }{Pareto} \DeclareMathOperator {\NBin }{NegBin} \DeclareMathOperator {\Studentt }{Student-t} \DeclareMathOperator *{\argmax }{arg\,max} \DeclareMathOperator *{\argmin }{arg\,min} $ \( \def \to {\rightarrow } \def \iff {\Leftrightarrow } \def \ra {\Rightarrow } \def \sw {\subseteq } \def \mc {\mathcal } \def \mb {\mathbb } \def \sc {\setminus } \def \wt {\widetilde } \def \v {\textbf } \def \E {\mb {E}} \def \P {\mb {P}} \def \R {\mb {R}} \def \C {\mb {C}} \def \N {\mb {N}} \def \Q {\mb {Q}} \def \Z {\mb {Z}} \def \B {\mb {B}} \def \~{\sim } \def \-{\,;\,} \def \qed {$\blacksquare $} \CustomizeMathJax {\def \1{\unicode {x1D7D9}}} \def \cadlag {c\`{a}dl\`{a}g} \def \p {\partial } \def \l {\left } \def \r {\right } \def \Om {\Omega } \def \om {\omega } \def \eps {\epsilon } \def \de {\delta } \def \ov {\overline } \def \sr {\stackrel } \def \Lp {\mc {L}^p} \def \Lq {\mc {L}^p} \def \Lone {\mc {L}^1} \def \Ltwo {\mc {L}^2} \def \toae {\sr {\rm a.e.}{\to }} \def \toas {\sr {\rm a.s.}{\to }} \def \top {\sr {\mb {\P }}{\to }} \def \tod {\sr {\rm d}{\to }} \def \toLp {\sr {\Lp }{\to }} \def \toLq {\sr {\Lq }{\to }} \def \eqae {\sr {\rm a.e.}{=}} \def \eqas {\sr {\rm a.s.}{=}} \def \eqd {\sr {\rm d}{=}} \def \approxd {\sr {\rm d}{\approx }} \def \Sa {(S1)\xspace } \def \Sb {(S2)\xspace } \def \Sc {(S3)\xspace } \)

Chapter 7 Testing and parameter estimation

In this chapter we discuss aspects of statistical testing and parameter inference, using the Bayesian models set up in earlier chapters. Throughout this chapter we work in the situation of a discrete or absolutely continuous Bayesian model $(X,\Theta )$, where we have data $x$ and posterior $\Theta |_{\{X=x\}}$. We keep all of our usual notation: the parameter space is $\Pi $, the model family is $(M_\theta )_{\theta \in \Pi }$, and the range of the model is $R$. Note that $M_\theta $ could have the form $M_\theta \sim (Y_\theta )^{\otimes n}$ for some random variable $Y_\theta $ with parameter $\theta $, corresponding to $n$ i.i.d. data points.

We have noted in Chapter 5 that an well chosen prior distribution can lead to a more accurate posterior distribution. Statistical testing is often used in situations where multiple different perspectives are involved and this makes the specification of prior beliefs more complicated. For example, trials of medical treatments involve patients, pharmaceutical companies and regulators, all of whom have different levels of trust in each other as wel as potentially different prior beliefs. It is common practice to check how much the results of statistical tests depend upon the choice of prior, often by varying the prior or comparing to a weakly informative prior.

7.1 Hypothesis testing

Hypothesis testing is surprisingly simple within the Bayesian framework. We first need to introduce the way to present the results.

Definition 7.1.1 Let $A$ and $B$ be events such that $\P [A\cup B]=1$ and $A\cap B=\emptyset $. The odds ratio of $A$ against $B$ is

\[O_{A,B}=\frac {\P [A]}{\P [B]}.\]

It expresses how much more likely $A$ is than $B$. For example, $O_{A,B}=2$ means that $A$ is twice as likely to occur than $B$; if $O_{A,B}=1$ then $A$ and $B$ are equally likely.

Take a Bayesian model $(X,\Theta )$ with parameter space $\Pi $. We split the parameter space into two pieces: $\Pi =\Pi _0\cup \Pi _1$ where $\Pi _0\cap \Pi _1=\emptyset $. We consider two competing hypothesis:

\begin{align*} H_0&:\text { that }\theta \in \Pi _0, \\ H_1&:\text { that }\theta \in \Pi _1, \end{align*} where $\theta $ represents the true value of the parameter i.e. the value for which our model should (at least, as a good approximation) match up with reality.

Definition 7.1.2 The prior odds of $H_0$ against $H_1$ is defined to be
$\seteqnumber{0}{7.}{0}$
\begin{equation} \label {eq:prior_odds} \frac {\P [\Theta \in \Pi _0]}{\P [\Theta \in \Pi _1]}. \end{equation}

Given the data $x$, the posterior odds of $H_0$ against $H_1$ is defined to be
$\seteqnumber{0}{7.}{1}$
\begin{equation} \label {eq:posterior_odds} \frac {\P [\Theta |_{\{X=x\}}\in \Pi _0]}{\P [\Theta |_{\{X=x\}}\in \Pi _1]}. \end{equation}

We might also refer to (7.1) as the ‘prior odds of $\Pi _0$ against $\Pi _1$’, and similarly for (7.2).

Note that the prior odds involve the prior $\Theta $, and the posterior odds involve the posterior $\Theta |_{\{X=x\}}$, but otherwise the formulae are identical. We assume implicitly that $\P [\Theta \in \Pi _0]$ and $\P [\Theta \in \Pi _1]$ are both non-zero, which by Theorems 2.4.1 and 3.1.2 implies that the same is true for $\Theta |_{\{X=x\}}$. Note also that the prior and posterior odds are only well defined for proper prior and posterior distributions, or else we cannot make sense of the probabilities above.

It is often helpful to get a feel for how much the data has influenced the result of the test. For these purposes we also define the Bayes factor

\begin{equation} \label {eq:bayes_factor} B=\frac {\text {posterior odds}}{\text {prior odds}}. \end{equation}

Our next lemma shows why $B$ is important. It is equal to the ratio of the likelihoods of the event $\{X=x\}$, i.e. of the data that we have, conditional on $\Theta \in \Pi _0$ and $\Theta \in \Pi _1$. In other words, $B$ is the ratio of the likelihood of $H_0$ compared to $H_1$.

Lemma 7.1.3 In the notation above, the Bayes factor satisfies $B=\frac {L_{X|_{\{\Theta \in \Pi _0\}}}(x)}{L_{X|_{\{\Theta \in \Pi _1\}}}(x)}$ where $L$ denotes the likelihood function.

Proof: We split the proof into two cases, depending on whether the Bayesian model is discrete or absolutely continuous. In the discrete case we have

\[B =\frac {\P [\Theta |_{\{X=x\}}\in \Pi _0]\P [\Theta \in H_1]}{\P [\Theta |_{\{X=x\}}\in \Pi _1]\P [\Theta \in H_0]} =\frac {\frac {\P [\Theta \in \Pi _0, X=x]}{\P [X=x]}\P [\Theta \in H_1]}{\frac {\P [\Theta \in \Pi _1, X=x]}{\P [X=x]}\P [\Theta \in H_0]} =\frac {\frac {\P [\Theta \in \Pi _0,X=x]}{\P [\theta \in \Pi _0]}}{\frac {\P [\Theta \in \Pi _1,X=x]}{\P [\theta \in \Pi _1]}} =\frac {\P [X|_{\{\Theta \in \Pi _0\}}=x]}{\P [X|_{\{\Theta \in \Pi _1\}}=x]}. \]

We have used equation (1.4) from Lemma 1.4.1 several times here. The continuous case is left for you, in Exercise 7.7 ∎

As a rough guide to interpreting the Bayes factor, the following table¹ is often used:

.
Bayes factor	Interpretation: evidence in favour of $H_0$ over $H_1$
1 to 3.2	Indecisive / not worth more than a bare mention
3.2 to 10	Substantial
10 to 100	Strong
above 100	Decisive

Note that a high value of $B$ only says that $H_0$ should be preferred over $H_1$. It does not tell us anything objective about how good our model $(M_\theta )$ is; it only tells us that $X|_{\{\Theta \in \Pi _0\}}$ is a better fit for $x$ than $X|_{\{\Theta \in \Pi _1\}}$ is.

Values of the Bayes factor below $1$ suggest evidence in favour of $H_1$ over $H_0$. In such a case we can swap the roles of $H_0$ and $H_1$, which corresponds to the Bayes factor changing from $B$ to $1/B$, and we can then use the same table to discuss the weight of evidence in favour of $H_1$ over $H_0$.

¹ From Kass & Raftery (1995).

Example 7.1.4 Returning to Example 4.5.3, suppose that we wished to test the hypothesis that the speed camera is, on average, overestimating the speed to cars. Recall that in this example:
- • Our model was the $N(\mu ,\frac {1}{\tau })$, for the speed recorded by the camera when a car travels at exactly 30mph.
- • We used a weak prior $(\mu ,\tau )\sim \NGam (30,\frac {1}{10^2},1,\frac 15)$.
- • We found the posterior $(\mu ,\tau )\sim \NGam (30.14, 10.01, 6.00, 1.24)$.
Both the posterior and prior density functions are plotted in Example 4.5.3.

Recall that if $(\mu ,\tau )\sim \NGam (m,p,a,b)$ then $\mu |\tau \sim \Normal (m,\frac {1}{p\tau })$, so the marginal mean of $\mu $ is $m$. Hence, the speed camera on average overestimates the speed when $\mu >30$, and underestimates on average when $\mu <30$. The probability that $\mu $ is exactly $30$ is zero, because our posterior $\NGam $ is a continuous distribution, so we will simply ignore that possibility. We don’t care about the location of $\tau $ here so we simply allow it to take any value $\tau \in (0,\infty )$. This gives us hypothesis
$\seteqnumber{0}{7.}{3}$
\begin{align*} H_0&:\text { that }(\mu ,\tau )\in \Pi _0=(30,\infty )\times (0,\infty ), \\ H_1&:\text { that }(\mu ,\tau )\in \Pi _1=(-\infty ,30)\times (0,\infty ). \end{align*} We want to compute the Bayes factor $B$. We’ll start with the posterior odds ratio. We have
$\seteqnumber{0}{7.}{3}$
\begin{equation*} \P [(\mu ,\tau )\in \Pi _0]=\int _{30}^\infty \int _0^\infty f_{\NGam (30.14, 10.01, 6.00, 1.24)}(\mu ,\tau )\,d\tau \,d\mu \approx 0.82, \end{equation*}

computed numerically and rounded to two decimal places. Note that $\P [(\mu ,\tau )\in \Pi _1]=1-\P [(\mu ,\tau )\in \Pi _0]$, which gives a posterior odds ratio of

\[\frac {\P [\Theta |_{\{X=x\}}\in H_0]}{\P [\Theta |_{\{X=x\}}\in H_1]}=\frac {0.82}{1-0.82}=4.56\]

again rounded to two decimal places. The prior odds ratio, calculated via the same procedure, is exactly $1$. This occurs because of the symmetry of the prior $\NGam (30,\frac {1}{10^2},1,\frac 15)$ distribution (this symmetry is visible in the sketch in Example 4.5.3) gives that $\P [\NGam (30,\frac {1}{10^2},1,\frac 15)\in \Pi _1]=\P [\NGam (30,\frac {1}{10^2},1,\frac 15)\in \Pi _0]=\frac 12$. Hence the Bayes factor for this hypothesis test is
$\seteqnumber{0}{7.}{3}$
\begin{equation} B=\frac {4.56}{1.00}=4.56. \end{equation}

Based on our table above, we have substantial evidence that the speed camera is overestimating speeds.

A potential problem with our test is that we have not cared about how much the camera is overestimating speeds. The (marginal) mean of $\mu $ in our posterior distribution is $30.14$, which is only slightly larger than the true speed $30$, and this suggests that the error is fairly small. We would need to be careful about communicating the result of our test, to avoid giving the wrong impression.

Note that we have used a small amount of Bayesian shorthand in this example, by writing $\mu $ and $\tau $ for both random variables and samples of these random variables.

.
Bayes factor	Interpretation: evidence in favour of \(H_0\) over \(H_1\)
1 to 3.2	Indecisive / not worth more than a bare mention
3.2 to 10	Substantial
10 to 100	Strong
above 100	Decisive