Bayesian Statistics

$\newcommand{\footnotename}{footnote}$ $\def \LWRfootnote {1}$ $\newcommand {\footnote }[2][\LWRfootnote ]{{}^{\mathrm {#1}}}$ $\newcommand {\footnotemark }[1][\LWRfootnote ]{{}^{\mathrm {#1}}}$ $\let \LWRorighspace \hspace $ $\renewcommand {\hspace }{\ifstar \LWRorighspace \LWRorighspace }$ $\newcommand {\mathnormal }[1]{{#1}}$ $\newcommand \ensuremath [1]{#1}$ $\newcommand {\LWRframebox }[2][]{\fbox {#2}} \newcommand {\framebox }[1][]{\LWRframebox } $ $\newcommand {\setlength }[2]{}$ $\newcommand {\addtolength }[2]{}$ $\newcommand {\setcounter }[2]{}$ $\newcommand {\addtocounter }[2]{}$ $\newcommand {\arabic }[1]{}$ $\newcommand {\number }[1]{}$ $\newcommand {\noalign }[1]{\text {#1}\notag \\}$ $\newcommand {\cline }[1]{}$ $\newcommand {\directlua }[1]{\text {(directlua)}}$ $\newcommand {\luatexdirectlua }[1]{\text {(directlua)}}$ $\newcommand {\protect }{}$ $\def \LWRabsorbnumber #1 {}$ $\def \LWRabsorbquotenumber "#1 {}$ $\newcommand {\LWRabsorboption }[1][]{}$ $\newcommand {\LWRabsorbtwooptions }[1][]{\LWRabsorboption }$ $\def \mathchar {\ifnextchar "\LWRabsorbquotenumber \LWRabsorbnumber }$ $\def \mathcode #1={\mathchar }$ $\let \delcode \mathcode $ $\let \delimiter \mathchar $ $\def \oe {\unicode {x0153}}$ $\def \OE {\unicode {x0152}}$ $\def \ae {\unicode {x00E6}}$ $\def \AE {\unicode {x00C6}}$ $\def \aa {\unicode {x00E5}}$ $\def \AA {\unicode {x00C5}}$ $\def \o {\unicode {x00F8}}$ $\def \O {\unicode {x00D8}}$ $\def \l {\unicode {x0142}}$ $\def \L {\unicode {x0141}}$ $\def \ss {\unicode {x00DF}}$ $\def \SS {\unicode {x1E9E}}$ $\def \dag {\unicode {x2020}}$ $\def \ddag {\unicode {x2021}}$ $\def \P {\unicode {x00B6}}$ $\def \copyright {\unicode {x00A9}}$ $\def \pounds {\unicode {x00A3}}$ $\let \LWRref \ref $ $\renewcommand {\ref }{\ifstar \LWRref \LWRref }$ $ \newcommand {\multicolumn }[3]{#3}$ $\require {textcomp}$ $\newcommand {\intertext }[1]{\text {#1}\notag \\}$ $\let \Hat \hat $ $\let \Check \check $ $\let \Tilde \tilde $ $\let \Acute \acute $ $\let \Grave \grave $ $\let \Dot \dot $ $\let \Ddot \ddot $ $\let \Breve \breve $ $\let \Bar \bar $ $\let \Vec \vec $ $\require {colortbl}$ $\let \LWRorigcolumncolor \columncolor $ $\renewcommand {\columncolor }[2][named]{\LWRorigcolumncolor [#1]{#2}\LWRabsorbtwooptions }$ $\let \LWRorigrowcolor \rowcolor $ $\renewcommand {\rowcolor }[2][named]{\LWRorigrowcolor [#1]{#2}\LWRabsorbtwooptions }$ $\let \LWRorigcellcolor \cellcolor $ $\renewcommand {\cellcolor }[2][named]{\LWRorigcellcolor [#1]{#2}\LWRabsorbtwooptions }$ $\require {mathtools}$ $\newenvironment {crampedsubarray}[1]{}{}$ $\newcommand {\smashoperator }[2][]{#2\limits }$ $\newcommand {\SwapAboveDisplaySkip }{}$ $\newcommand {\LaTeXunderbrace }[1]{\underbrace {#1}}$ $\newcommand {\LaTeXoverbrace }[1]{\overbrace {#1}}$ $\newcommand {\LWRmultlined }[1][]{\begin {multline*}}$ $\newenvironment {multlined}[1][]{\LWRmultlined }{\end {multline*}}$ $\let \LWRorigshoveleft \shoveleft $ $\renewcommand {\shoveleft }[1][]{\LWRorigshoveleft }$ $\let \LWRorigshoveright \shoveright $ $\renewcommand {\shoveright }[1][]{\LWRorigshoveright }$ $\newcommand {\shortintertext }[1]{\text {#1}\notag \\}$ $\newcommand {\vcentcolon }{\mathrel {\unicode {x2236}}}$ $\renewcommand {\intertext }[2][]{\text {#2}\notag \\}$ $\newenvironment {fleqn}[1][]{}{}$ $\newenvironment {ceqn}{}{}$ $\newenvironment {darray}[2][c]{\begin {array}[#1]{#2}}{\end {array}}$ $\newcommand {\dmulticolumn }[3]{#3}$ $\newcommand {\LWRnrnostar }[1][0.5ex]{\\[#1]}$ $\newcommand {\nr }{\ifstar \LWRnrnostar \LWRnrnostar }$ $\newcommand {\mrel }[1]{\begin {aligned}#1\end {aligned}}$ $\newcommand {\underrel }[2]{\underset {#2}{#1}}$ $\newcommand {\medmath }[1]{#1}$ $\newcommand {\medop }[1]{#1}$ $\newcommand {\medint }[1]{#1}$ $\newcommand {\medintcorr }[1]{#1}$ $\newcommand {\mfrac }[2]{\frac {#1}{#2}}$ $\newcommand {\mbinom }[2]{\binom {#1}{#2}}$ $\newenvironment {mmatrix}{\begin {matrix}}{\end {matrix}}$ $\newcommand {\displaybreak }[1][]{}$ $ \def \offsyl {(\oslash )} \def \msconly {(\Delta )} $ $ \DeclareMathOperator {\var }{var} \DeclareMathOperator {\cov }{cov} \DeclareMathOperator {\Bin }{Bin} \DeclareMathOperator {\Geo }{Geometric} \DeclareMathOperator {\Beta }{Beta} \DeclareMathOperator {\Unif }{Uniform} \DeclareMathOperator {\Gam }{Gamma} \DeclareMathOperator {\Normal }{N} \DeclareMathOperator {\Exp }{Exp} \DeclareMathOperator {\Cauchy }{Cauchy} \DeclareMathOperator {\Bern }{Bernoulli} \DeclareMathOperator {\Poisson }{Poisson} \DeclareMathOperator {\Weibull }{Weibull} \DeclareMathOperator {\IGam }{IGamma} \DeclareMathOperator {\NGam }{NGamma} \DeclareMathOperator {\ChiSquared }{ChiSquared} \DeclareMathOperator {\Pareto }{Pareto} \DeclareMathOperator {\NBin }{NegBin} \DeclareMathOperator {\Studentt }{Student-t} \DeclareMathOperator *{\argmax }{arg\,max} \DeclareMathOperator *{\argmin }{arg\,min} $ \( \def \to {\rightarrow } \def \iff {\Leftrightarrow } \def \ra {\Rightarrow } \def \sw {\subseteq } \def \mc {\mathcal } \def \mb {\mathbb } \def \sc {\setminus } \def \wt {\widetilde } \def \v {\textbf } \def \E {\mb {E}} \def \P {\mb {P}} \def \R {\mb {R}} \def \C {\mb {C}} \def \N {\mb {N}} \def \Q {\mb {Q}} \def \Z {\mb {Z}} \def \B {\mb {B}} \def \~{\sim } \def \-{\,;\,} \def \qed {$\blacksquare $} \CustomizeMathJax {\def \1{\unicode {x1D7D9}}} \def \cadlag {c\`{a}dl\`{a}g} \def \p {\partial } \def \l {\left } \def \r {\right } \def \Om {\Omega } \def \om {\omega } \def \eps {\epsilon } \def \de {\delta } \def \ov {\overline } \def \sr {\stackrel } \def \Lp {\mc {L}^p} \def \Lq {\mc {L}^p} \def \Lone {\mc {L}^1} \def \Ltwo {\mc {L}^2} \def \toae {\sr {\rm a.e.}{\to }} \def \toas {\sr {\rm a.s.}{\to }} \def \top {\sr {\mb {\P }}{\to }} \def \tod {\sr {\rm d}{\to }} \def \toLp {\sr {\Lp }{\to }} \def \toLq {\sr {\Lq }{\to }} \def \eqae {\sr {\rm a.e.}{=}} \def \eqas {\sr {\rm a.s.}{=}} \def \eqd {\sr {\rm d}{=}} \def \approxd {\sr {\rm d}{\approx }} \def \Sa {(S1)\xspace } \def \Sb {(S2)\xspace } \def \Sc {(S3)\xspace } \)

Chapter 4 Conjugate priors

Recall that we define a Bayesian model $(X,\Theta )$ using a prior $\Theta $ and a model family $(M_\theta )_{\theta \in \Pi }$. The main result from Chapters 2 and 3 is that we can (at least, in principle) obtain the distribution of the posterior $\Theta |_{\{X=x\}}$. If the prior and posterior are from the same family of distributions then we say that this family of distributions is conjugate to the model family.

Note that the setup here involves two families of distributions: (1) the model family $(M_\theta )$ and (2) the conjugate family, to which the prior and posterior both belong. The formal definition is a bit of a mouthful:

Definition 4.0.1 Let $(M_\theta )_{\theta \in \Pi }$ and $(T_a)_{a\in A}$ be two model families, with parameter spaces $\Pi $ and $A$ respectively. We say that $(M_\theta )$ and $(T_a)$ are a conjugate pair if whenever $(X,\Theta )$ is a Bayesian model with model family $(M_\theta )$ and prior $\Theta \sim T_a$, for all $x\in R_X$ there exists $b\in A$ such that $\Theta |_{\{X=x\}}\sim T_b$. We say that the family $(T_a)$ is a conjugate prior for $(M_\theta )$.

The point of using conjugate pairs is that, to specify a Bayesian update step, we only need to describe how the parameters of the prior distribution should change, to obtain the posterior distribution. This is in general much simpler than (2.11) and (3.3). We will describe a few conjugate pairs in this chapter and discuss their limitations in Section 4.6.

4.1 Notation: proportionality

In (2.12) and (3.7) we used $\frac {1}{Z}$ and $\frac {1}{Z'}$ for normalizing constants. It was helpful not to worry about exactly what the value of these constant were. In longer calculations we might need to use several different normalizing constants in this way, and it is helpful to have some notation for doing so (beyond simply $\frac {1}{Z''},\frac {1}{Z'''}$ and so on).

Definition 4.1.1 Let $f$ and $g$ be functions within the same domain. We write $f\propto g$ if there exists $C\in (0,\infty )$ such that $f(x)=Cg(x)$ for all $x$. In words, $f$ is said to be proportional to $g$.

The relation $\propto $ has several nice properties, which are easy to check and are left for you in Exercise 4.9. For example, for any function $f$ we have $f\propto f$. Also, $f\propto g$ if and only if $g\propto f$ and, lastly, if $f\propto g$ and $g\propto h$ then $f\propto h$. We’ll use these properties frequently in calculations, without further comment.

Example 4.1.2 Using the notation $\propto $, Lemma 1.2.5 says that for random variables $X$ and $Y$:
- • If $X$ and $Y$ are discrete and $p_X\propto p_Y$ then $X\eqd Y$.
- • If $X$ and $Y$ are continuous and $f_X\propto f_Y$ then $X\eqd Y$.
We will often use Lemma 1.2.5 in this way from now on, including in the next example.

Example 4.1.3 The calculation in (2.12) can be written simply as
$\seteqnumber{0}{4.}{0}$
\begin{align*} f_{P|_{\{X=4\}}}(p) &\propto \P [\Bin (10,p)=4]f_{\Beta (2,8)}(p) \\ &\propto p^4(1-p)^{10-4}p^{2-1}(1-p)^{8-1} \\ &\propto p^{5}(1-p)^{13}. \end{align*} It follows immediately from Lemma 1.2.5 that $P|_{\{X=4\}}\sim \Beta (6,14)$.

Example 4.1.4 More generally, the key equations (2.11) and (3.3) from Theorems 2.4.1 and 3.1.2 can be written
$\seteqnumber{0}{4.}{0}$
\begin{align*} f_{\Theta |_{\{X=x\}}}(\theta )\propto p_{M_\theta }(x)f_{\Theta }(\theta ) & \qquad \text { for discrete Bayesian models,} \\ f_{\Theta |_{\{X=x\}}}(\theta )\propto f_{M_\theta }(x)f_{\Theta }(\theta ) & \qquad \text { for continuous Bayesian models.} \end{align*} We will often use Theorems 2.4.1 and 3.1.2 in this way from now on.

A complication of using $\propto $ is that the symbol does not explicitly specify which variables should be treated as part of the proportionality, and which other variables can be treated as constants. For our purposes there is a simple way to resolve this difficulty. We we use $\propto $ there will, in most cases, be a function on the left of the first $\propto $ that appears within a calculation. The arguments of that function (not including subscripts) are the variables that proportionality applies to; everything else can be treated as constant in so far as $\propto $ is concerned.

4.1.1 The Beta-Binomial pair

Here is our first example of a conjugate pair, which generalizes the calculations in Example 2.1.1-2.3.3.

Lemma 4.1.5 (Beta-Bernoulli conjugate pair) Let $n\in \N $. Let $(X,\Theta )$ be a discrete Bayesian model with model family $M_\theta \sim \Bern (\theta )^{\otimes n}$ and parameter $\theta \in [0,1]$. Suppose that the prior is $\Theta \sim \Beta (a,b)$ and let $x\in \{0,1\}^n$. Then the posterior is $\Theta |_{\{X=x\}}\sim \Beta (a+k,b+n-k)$ where $k=\sum _1^n x_i$.

Proof: Note that $k$ is the number of Bernoulli trials that generate a $1$, and that we have $n$ trials in total. Under $M_p$, each trial has probability $p$ of generating $1$. From Theorem 2.4.1 we have that for $\theta \in [0,1]$

\begin{align*} f_{\Theta |_{\{X=x\}}}(\theta ) &\propto \P [\Bern (\theta )^{\otimes n}=x] f_{\Beta (a,b)}(\theta ) \\ &\propto \l (\prod _{i=1}^n\P [\Bern (\theta )=x_i]\r ) f_{\Beta (a,b)}(\theta ) \\ &\propto \l (\theta ^k(1-\theta )^{n-k}\r )\l (\frac {1}{\mc {B}(a,b)}\theta ^{a-1}(1-\theta )^{b-1}\r )\\ &\propto \theta ^{a+k-1}(1-\theta )^{b+n-k-1}. \end{align*} By Lemma 1.2.5 we recognize this p.d.f. as $\Theta |_{\{X=x\}}\sim \Beta (a+k,b+n-k)$. ∎

The value of $k$ has an intuitive interpretation, because it is the number of successful trials observed in our data $x$ (here we take a trial to result in $1$ if successful, and $0$ if failed). Looking back at Example 2.3.3, this allows us to do all of the Bayesian update calculations with one easy piece of arithmetic.

More generally, we can use the Binomial distribution in place of the Bernoulli distribution, as in the next lemma.

Lemma 4.1.6 (Beta-Binomial conjugate pair) Let $n,m_i\in \N $. Let $(X,\Theta )$ be a discrete Bayesian model with model family

\[M_\theta \sim \Bin (m_1,\theta )\otimes \ldots \otimes \Bin (m_n,\theta ).\]

with the parameter $\theta \in \Pi =[0,1]$. Suppose that the prior is $\Theta \sim \Beta (a,b)$ and let $x=(x_1,\ldots ,x_n)$ where $x_i\in \{0,\ldots ,m_i\}$. Then the posterior is $\Theta |_{\{X=x\}}\sim \Beta \l (a+\sum _1^n x_i,b+\sum _1^n m_i-\sum _1^n x_i\r )$.

Proof: From Theorem 2.4.1 we have that for $\theta \in [0,1]$

\begin{align*} f_{\Theta |_{\{X=x\}}}(\theta ) &\propto \l (\prod _{i=1}^n\binom {m_i}{x_i}\theta ^{x_i}(1-\theta )^{m_i-x_i}\r )\l (\frac {1}{\mc {B}(a,b)}\theta ^{a-1}(1-\theta )^{b-1}\r )\\ &\propto \theta ^{a+\sum _1^n x_i-1}(1-\theta )^{b+\sum _1^nm_i-\sum _1^nx_i-1}. \end{align*} By Lemma 1.2.5 we recognize this p.d.f. as $\Theta |_{\{X=x\}}\sim \Beta \l (a+\sum _1^n x_i,b+\sum _1^n m_i-\sum _1^n x_i\r )$. ∎

Remark 4.1.7 $\offsyl $ There is a further generalization of this model to experiments that can have many possible outcomes. It involves the Dirichlet and multinomial distributions. It is not much more complicated, but we won’t cover it within this course.