Bayesian Statistics

$\newcommand{\footnotename}{footnote}$ $\def \LWRfootnote {1}$ $\newcommand {\footnote }[2][\LWRfootnote ]{{}^{\mathrm {#1}}}$ $\newcommand {\footnotemark }[1][\LWRfootnote ]{{}^{\mathrm {#1}}}$ $\let \LWRorighspace \hspace $ $\renewcommand {\hspace }{\ifstar \LWRorighspace \LWRorighspace }$ $\newcommand {\mathnormal }[1]{{#1}}$ $\newcommand \ensuremath [1]{#1}$ $\newcommand {\LWRframebox }[2][]{\fbox {#2}} \newcommand {\framebox }[1][]{\LWRframebox } $ $\newcommand {\setlength }[2]{}$ $\newcommand {\addtolength }[2]{}$ $\newcommand {\setcounter }[2]{}$ $\newcommand {\addtocounter }[2]{}$ $\newcommand {\arabic }[1]{}$ $\newcommand {\number }[1]{}$ $\newcommand {\noalign }[1]{\text {#1}\notag \\}$ $\newcommand {\cline }[1]{}$ $\newcommand {\directlua }[1]{\text {(directlua)}}$ $\newcommand {\luatexdirectlua }[1]{\text {(directlua)}}$ $\newcommand {\protect }{}$ $\def \LWRabsorbnumber #1 {}$ $\def \LWRabsorbquotenumber "#1 {}$ $\newcommand {\LWRabsorboption }[1][]{}$ $\newcommand {\LWRabsorbtwooptions }[1][]{\LWRabsorboption }$ $\def \mathchar {\ifnextchar "\LWRabsorbquotenumber \LWRabsorbnumber }$ $\def \mathcode #1={\mathchar }$ $\let \delcode \mathcode $ $\let \delimiter \mathchar $ $\def \oe {\unicode {x0153}}$ $\def \OE {\unicode {x0152}}$ $\def \ae {\unicode {x00E6}}$ $\def \AE {\unicode {x00C6}}$ $\def \aa {\unicode {x00E5}}$ $\def \AA {\unicode {x00C5}}$ $\def \o {\unicode {x00F8}}$ $\def \O {\unicode {x00D8}}$ $\def \l {\unicode {x0142}}$ $\def \L {\unicode {x0141}}$ $\def \ss {\unicode {x00DF}}$ $\def \SS {\unicode {x1E9E}}$ $\def \dag {\unicode {x2020}}$ $\def \ddag {\unicode {x2021}}$ $\def \P {\unicode {x00B6}}$ $\def \copyright {\unicode {x00A9}}$ $\def \pounds {\unicode {x00A3}}$ $\let \LWRref \ref $ $\renewcommand {\ref }{\ifstar \LWRref \LWRref }$ $ \newcommand {\multicolumn }[3]{#3}$ $\require {textcomp}$ $\newcommand {\intertext }[1]{\text {#1}\notag \\}$ $\let \Hat \hat $ $\let \Check \check $ $\let \Tilde \tilde $ $\let \Acute \acute $ $\let \Grave \grave $ $\let \Dot \dot $ $\let \Ddot \ddot $ $\let \Breve \breve $ $\let \Bar \bar $ $\let \Vec \vec $ $\require {colortbl}$ $\let \LWRorigcolumncolor \columncolor $ $\renewcommand {\columncolor }[2][named]{\LWRorigcolumncolor [#1]{#2}\LWRabsorbtwooptions }$ $\let \LWRorigrowcolor \rowcolor $ $\renewcommand {\rowcolor }[2][named]{\LWRorigrowcolor [#1]{#2}\LWRabsorbtwooptions }$ $\let \LWRorigcellcolor \cellcolor $ $\renewcommand {\cellcolor }[2][named]{\LWRorigcellcolor [#1]{#2}\LWRabsorbtwooptions }$ $\require {mathtools}$ $\newenvironment {crampedsubarray}[1]{}{}$ $\newcommand {\smashoperator }[2][]{#2\limits }$ $\newcommand {\SwapAboveDisplaySkip }{}$ $\newcommand {\LaTeXunderbrace }[1]{\underbrace {#1}}$ $\newcommand {\LaTeXoverbrace }[1]{\overbrace {#1}}$ $\newcommand {\LWRmultlined }[1][]{\begin {multline*}}$ $\newenvironment {multlined}[1][]{\LWRmultlined }{\end {multline*}}$ $\let \LWRorigshoveleft \shoveleft $ $\renewcommand {\shoveleft }[1][]{\LWRorigshoveleft }$ $\let \LWRorigshoveright \shoveright $ $\renewcommand {\shoveright }[1][]{\LWRorigshoveright }$ $\newcommand {\shortintertext }[1]{\text {#1}\notag \\}$ $\newcommand {\vcentcolon }{\mathrel {\unicode {x2236}}}$ $\renewcommand {\intertext }[2][]{\text {#2}\notag \\}$ $\newenvironment {fleqn}[1][]{}{}$ $\newenvironment {ceqn}{}{}$ $\newenvironment {darray}[2][c]{\begin {array}[#1]{#2}}{\end {array}}$ $\newcommand {\dmulticolumn }[3]{#3}$ $\newcommand {\LWRnrnostar }[1][0.5ex]{\\[#1]}$ $\newcommand {\nr }{\ifstar \LWRnrnostar \LWRnrnostar }$ $\newcommand {\mrel }[1]{\begin {aligned}#1\end {aligned}}$ $\newcommand {\underrel }[2]{\underset {#2}{#1}}$ $\newcommand {\medmath }[1]{#1}$ $\newcommand {\medop }[1]{#1}$ $\newcommand {\medint }[1]{#1}$ $\newcommand {\medintcorr }[1]{#1}$ $\newcommand {\mfrac }[2]{\frac {#1}{#2}}$ $\newcommand {\mbinom }[2]{\binom {#1}{#2}}$ $\newenvironment {mmatrix}{\begin {matrix}}{\end {matrix}}$ $\newcommand {\displaybreak }[1][]{}$ $ \def \offsyl {(\oslash )} \def \msconly {(\Delta )} $ $ \DeclareMathOperator {\var }{var} \DeclareMathOperator {\cov }{cov} \DeclareMathOperator {\Bin }{Bin} \DeclareMathOperator {\Geo }{Geometric} \DeclareMathOperator {\Beta }{Beta} \DeclareMathOperator {\Unif }{Uniform} \DeclareMathOperator {\Gam }{Gamma} \DeclareMathOperator {\Normal }{N} \DeclareMathOperator {\Exp }{Exp} \DeclareMathOperator {\Cauchy }{Cauchy} \DeclareMathOperator {\Bern }{Bernoulli} \DeclareMathOperator {\Poisson }{Poisson} \DeclareMathOperator {\Weibull }{Weibull} \DeclareMathOperator {\IGam }{IGamma} \DeclareMathOperator {\NGam }{NGamma} \DeclareMathOperator {\ChiSquared }{ChiSquared} \DeclareMathOperator {\Pareto }{Pareto} \DeclareMathOperator {\NBin }{NegBin} \DeclareMathOperator {\Studentt }{Student-t} \DeclareMathOperator *{\argmax }{arg\,max} \DeclareMathOperator *{\argmin }{arg\,min} $ \( \def \to {\rightarrow } \def \iff {\Leftrightarrow } \def \ra {\Rightarrow } \def \sw {\subseteq } \def \mc {\mathcal } \def \mb {\mathbb } \def \sc {\setminus } \def \wt {\widetilde } \def \v {\textbf } \def \E {\mb {E}} \def \P {\mb {P}} \def \R {\mb {R}} \def \C {\mb {C}} \def \N {\mb {N}} \def \Q {\mb {Q}} \def \Z {\mb {Z}} \def \B {\mb {B}} \def \~{\sim } \def \-{\,;\,} \def \qed {$\blacksquare $} \CustomizeMathJax {\def \1{\unicode {x1D7D9}}} \def \cadlag {c\`{a}dl\`{a}g} \def \p {\partial } \def \l {\left } \def \r {\right } \def \Om {\Omega } \def \om {\omega } \def \eps {\epsilon } \def \de {\delta } \def \ov {\overline } \def \sr {\stackrel } \def \Lp {\mc {L}^p} \def \Lq {\mc {L}^p} \def \Lone {\mc {L}^1} \def \Ltwo {\mc {L}^2} \def \toae {\sr {\rm a.e.}{\to }} \def \toas {\sr {\rm a.s.}{\to }} \def \top {\sr {\mb {\P }}{\to }} \def \tod {\sr {\rm d}{\to }} \def \toLp {\sr {\Lp }{\to }} \def \toLq {\sr {\Lq }{\to }} \def \eqae {\sr {\rm a.e.}{=}} \def \eqas {\sr {\rm a.s.}{=}} \def \eqd {\sr {\rm d}{=}} \def \approxd {\sr {\rm d}{\approx }} \def \Sa {(S1)\xspace } \def \Sb {(S2)\xspace } \def \Sc {(S3)\xspace } \)

2.4 Bayesian updates

We now write down a general version of the method in Example 2.3.3, in the form of a theorem that we can apply once per update step. It is worth noting that in many practical situations only one update step is actually needed. This would normally be the case if we receive all the relevant data at the same time. If our data arrives gradually (e.g. once per year from an annual survey) then we can provide on-going analysis by carrying out an update step whenever new data arrives.

In Example 2.3.3 we only had one parameter, so our parameter space was $\Pi \sw \R $, and we only had one piece of data (per update) so our model for the data was a random variable $X$ taking values in $\R $. In general we will want some number $d\in \N $ of parameters, so we take $\Pi \sw \R ^d$, and we will want to handle some number $n\in \N $ of datapoints at once, so we let $X$ take values in $\R ^n$.

Theorem 2.4.1 (Bayesian updates for discrete data) Let $(X,\Theta )$ be a discrete Bayesian model, with parameter space $\Pi $, prior p.d.f. $f_\Theta $, model family $(M_\theta )$ and range $R$. Suppose that $x\in R$. Then the posterior $\Theta |_{\{X=x\}}$ is a continuous random variable and has p.d.f.
$\seteqnumber{0}{2.}{10}$
\begin{equation} \label {eq:bayes_rule_discrete_data} f_{\Theta |_{\{X=x\}}}(\theta )= \frac {1}{Z}\P [M_\theta =x]f_{\Theta }(\theta ) \end{equation}

where $Z=\int _\Pi \P [M_\theta =x]f_{\Theta }(\theta )\,d\theta $. The range of $\Theta |_{\{X=x\}}$ is $\Pi $, the same range as for $\Theta $.

Proof of Theorem 2.4.1: Let $(X,\Theta )$ be a discrete Bayesian model as given and let $x\in R$. Note that $\P [M_\theta =x]>0$ for all $\theta $ because $(M_\theta )$ is a discrete model family, so from (2.4) we have that $\P [X=x]>0$. By Lemma 1.5.1

\[ \P \l [\Theta |_{\{X=x\}}\in B\r ] =\frac {\P [X=x,\Theta \in B]}{\P [X=x]} =\frac {\int _{B}\P [M_\theta =x]f_\Theta (\theta )\,d\theta }{\int _{\R ^d}\P [M_\theta =x]f_\Theta (\theta )\,d\theta } =\int _{B}\frac {1}{Z}\P [M_\theta =x]f_\Theta (\theta )\,d\theta . \]

where $Z=\int _{\R ^d}\P [M_\theta =x]f_\Theta (\theta )\,d\theta $. The denominator here comes from (2.4) and the numerator from (2.3).

The definition of a discrete Bayesian model gives that the prior $\Theta $ has range $\Pi $, so we may assume $f_\Theta (\theta )=0$ for $\theta \notin \Pi $. Hence $Z=\int _{\Pi }\P [M_\theta =x]f_\Theta (\theta )\,d\theta $. It follows that $\Theta |_{\{X=x\}}$ is a continuous random variable with p.d.f. as in (2.11).

Also, $f_{\Theta |_{\{X=x\}}}(\theta )>0$ if and only if $\theta \in \Pi $, so $\Theta |_{\{X=x\}}$ also has range $\Pi $. ∎

Now that we have Theorem 2.4.1, and in particular equation (2.11), we should use it to calculate posterior densities – instead of falling back on Definition 1.4.2. For example, in Example 2.3.3 we can go straight from our description of the model to writing down the p.d.f. of $P|_{\{X=4\}}$ as

\begin{align} f_{P|_{\{X=4\}}}(p) &=\frac {1}{Z}\P [\Bin (10,p)=4]f_{\Beta (2,8)}(p) \notag \\ &=\frac {\binom {10}{4}\mc {B}(2,8)}{Z}\,p^4(1-p)^{10-4}p^{2-1}(1-p)^{8-1} \notag \\ &=\frac {1}{Z'}\,p^{5}(1-p)^{13} \label {eq:propto_helps_example} \end{align} for $p\in [0,1]$, and zero elsewhere. Note that we have written $\frac {1}{Z'}=\frac {\binom {10}{4}\mc {B}(2,8)}{Z}$ in the last line, without having to do any computation. We only need to care about the part of the formula that depends on $p$, because the rest will be a normalizing constant. From Lemma 1.2.5 we can recognize (2.12) as the p.d.f. of the $\Beta (6,14)$ distribution.

2.4.1 Historical notes $\offsyl $

Equation (2.11) is often known as Bayes’ rule, after Thomas Bayes (1701-1761). Bayes was one of the first mathematicians to study conditional probability, although he only became interested by it in later life and did not publish his work. Instead, it was edited and published by Richard Price (1723-1791) after Bayes’ death. Both Bayes and Price were primarily interested in philosophy – statistics barely existed at the time and mathematics had only recently discovered calculus. In fact, what Bayes discovered is much closer to Lemma 1.4.1 in the special case of discrete random variables.

The concept of Bayesian inference first appears in work of Pierre-Simon Laplace (1749-1827). It was originally known as ‘inverse probability’ and kept this name up until the 1950s, during which the term ‘Bayesian’ became used instead. This makes Bayesian methods one of the oldest parts of statistics. By comparison, techniques based on maximum likelihood estimators (MLEs) were not introduced until the 1920s.

During the middle part of the 20th century, statistics was dominated by techniques based on MLEs and Bayesian techniques fell out of fashion. They become popular again with the development of modern computing power in the 1980s and 1990s, when it was realized (as we will see in Chapter 8) that Bayesian updates could be performed numerically without relying on families of well-known distributions. This provided the possibility of writing down highly complex Bayesian models whilst still having them ‘learn’ from data.

Bayesian Statistics

2.4 Bayesian updates

2.4.1 Historical notes \(\offsyl \)