Bayesian Statistics

$\newcommand{\footnotename}{footnote}$ $\def \LWRfootnote {1}$ $\newcommand {\footnote }[2][\LWRfootnote ]{{}^{\mathrm {#1}}}$ $\newcommand {\footnotemark }[1][\LWRfootnote ]{{}^{\mathrm {#1}}}$ $\let \LWRorighspace \hspace $ $\renewcommand {\hspace }{\ifstar \LWRorighspace \LWRorighspace }$ $\newcommand {\TextOrMath }[2]{#2}$ $\newcommand {\mathnormal }[1]{{#1}}$ $\newcommand \ensuremath [1]{#1}$ $\newcommand {\LWRframebox }[2][]{\fbox {#2}} \newcommand {\framebox }[1][]{\LWRframebox } $ $\newcommand {\setlength }[2]{}$ $\newcommand {\addtolength }[2]{}$ $\newcommand {\setcounter }[2]{}$ $\newcommand {\addtocounter }[2]{}$ $\newcommand {\arabic }[1]{}$ $\newcommand {\number }[1]{}$ $\newcommand {\noalign }[1]{\text {#1}\notag \\}$ $\newcommand {\cline }[1]{}$ $\newcommand {\directlua }[1]{\text {(directlua)}}$ $\newcommand {\luatexdirectlua }[1]{\text {(directlua)}}$ $\newcommand {\protect }{}$ $\def \LWRabsorbnumber #1 {}$ $\def \LWRabsorbquotenumber "#1 {}$ $\newcommand {\LWRabsorboption }[1][]{}$ $\newcommand {\LWRabsorbtwooptions }[1][]{\LWRabsorboption }$ $\def \mathchar {\ifnextchar "\LWRabsorbquotenumber \LWRabsorbnumber }$ $\def \mathcode #1={\mathchar }$ $\let \delcode \mathcode $ $\let \delimiter \mathchar $ $\def \oe {\unicode {x0153}}$ $\def \OE {\unicode {x0152}}$ $\def \ae {\unicode {x00E6}}$ $\def \AE {\unicode {x00C6}}$ $\def \aa {\unicode {x00E5}}$ $\def \AA {\unicode {x00C5}}$ $\def \o {\unicode {x00F8}}$ $\def \O {\unicode {x00D8}}$ $\def \l {\unicode {x0142}}$ $\def \L {\unicode {x0141}}$ $\def \ss {\unicode {x00DF}}$ $\def \SS {\unicode {x1E9E}}$ $\def \dag {\unicode {x2020}}$ $\def \ddag {\unicode {x2021}}$ $\def \P {\unicode {x00B6}}$ $\def \copyright {\unicode {x00A9}}$ $\def \pounds {\unicode {x00A3}}$ $\let \LWRref \ref $ $\renewcommand {\ref }{\ifstar \LWRref \LWRref }$ $ \newcommand {\multicolumn }[3]{#3}$ $\require {textcomp}$ $\newcommand {\intertext }[1]{\text {#1}\notag \\}$ $\let \Hat \hat $ $\let \Check \check $ $\let \Tilde \tilde $ $\let \Acute \acute $ $\let \Grave \grave $ $\let \Dot \dot $ $\let \Ddot \ddot $ $\let \Breve \breve $ $\let \Bar \bar $ $\let \Vec \vec $ $\require {colortbl}$ $\let \LWRorigcolumncolor \columncolor $ $\renewcommand {\columncolor }[2][named]{\LWRorigcolumncolor [#1]{#2}\LWRabsorbtwooptions }$ $\let \LWRorigrowcolor \rowcolor $ $\renewcommand {\rowcolor }[2][named]{\LWRorigrowcolor [#1]{#2}\LWRabsorbtwooptions }$ $\let \LWRorigcellcolor \cellcolor $ $\renewcommand {\cellcolor }[2][named]{\LWRorigcellcolor [#1]{#2}\LWRabsorbtwooptions }$ $\renewcommand {\intertext }[2][]{\text {#2}\notag \\}$ $\newenvironment {fleqn}[1][]{}{}$ $\newenvironment {ceqn}{}{}$ $\newenvironment {darray}[2][c]{\begin {array}[#1]{#2}}{\end {array}}$ $\newcommand {\dmulticolumn }[3]{#3}$ $\newcommand {\LWRnrnostar }[1][0.5ex]{\\[#1]}$ $\newcommand {\nr }{\ifstar \LWRnrnostar \LWRnrnostar }$ $\newcommand {\mrel }[1]{\begin {aligned}#1\end {aligned}}$ $\newcommand {\underrel }[2]{\underset {#2}{#1}}$ $\newcommand {\medmath }[1]{#1}$ $\newcommand {\medop }[1]{#1}$ $\newcommand {\medint }[1]{#1}$ $\newcommand {\medintcorr }[1]{#1}$ $\newcommand {\mfrac }[2]{\frac {#1}{#2}}$ $\newcommand {\mbinom }[2]{\binom {#1}{#2}}$ $\newenvironment {mmatrix}{\begin {matrix}}{\end {matrix}}$ $\newcommand {\displaybreak }[1][]{}$ $ \def \offsyl {(\oslash )} \def \msconly {(\Delta )} $ $ \DeclareMathOperator {\var }{var} \DeclareMathOperator {\cov }{cov} \DeclareMathOperator {\Bin }{Bin} \DeclareMathOperator {\Geo }{Geometric} \DeclareMathOperator {\Beta }{Beta} \DeclareMathOperator {\Unif }{Uniform} \DeclareMathOperator {\Gam }{Gamma} \DeclareMathOperator {\Normal }{N} \DeclareMathOperator {\Exp }{Exp} \DeclareMathOperator {\Cauchy }{Cauchy} \DeclareMathOperator {\Bern }{Bernoulli} \DeclareMathOperator {\Poisson }{Poisson} \DeclareMathOperator {\Weibull }{Weibull} \DeclareMathOperator {\IGam }{IGamma} \DeclareMathOperator {\NGam }{NGamma} \DeclareMathOperator {\ChiSquared }{ChiSquared} \DeclareMathOperator {\Pareto }{Pareto} \DeclareMathOperator {\NBin }{NegBin} \DeclareMathOperator {\Studentt }{Student-t} \DeclareMathOperator *{\argmax }{arg\,max} \DeclareMathOperator *{\argmin }{arg\,min} $ \( \def \to {\rightarrow } \def \iff {\Leftrightarrow } \def \ra {\Rightarrow } \def \sw {\subseteq } \def \mc {\mathcal } \def \mb {\mathbb } \def \sc {\setminus } \def \wt {\widetilde } \def \v {\textbf } \def \E {\mb {E}} \def \P {\mb {P}} \def \R {\mb {R}} \def \C {\mb {C}} \def \N {\mb {N}} \def \Q {\mb {Q}} \def \Z {\mb {Z}} \def \B {\mb {B}} \def \~{\sim } \def \-{\,;\,} \def \qed {$\blacksquare $} \CustomizeMathJax {\def \1{\unicode {x1D7D9}}} \def \cadlag {c\`{a}dl\`{a}g} \def \p {\partial } \def \l {\left } \def \r {\right } \def \Om {\Omega } \def \om {\omega } \def \eps {\epsilon } \def \de {\delta } \def \ov {\overline } \def \sr {\stackrel } \def \Lp {\mc {L}^p} \def \Lq {\mc {L}^p} \def \Lone {\mc {L}^1} \def \Ltwo {\mc {L}^2} \def \toae {\sr {\rm a.e.}{\to }} \def \toas {\sr {\rm a.s.}{\to }} \def \top {\sr {\mb {\P }}{\to }} \def \tod {\sr {\rm d}{\to }} \def \toLp {\sr {\Lp }{\to }} \def \toLq {\sr {\Lq }{\to }} \def \eqae {\sr {\rm a.e.}{=}} \def \eqas {\sr {\rm a.s.}{=}} \def \eqd {\sr {\rm d}{=}} \def \approxd {\sr {\rm d}{\approx }} \def \Sa {(S1)\xspace } \def \Sb {(S2)\xspace } \def \Sc {(S3)\xspace } \)

Chapter 6 Bayesian Notation

Recall that for a random variable $Y$ we define the likelihood function $L_Y$ by

\begin{equation} \label {eq:likelihood_def} L_Y(y)=\begin{cases} p_Y(y) & \text { where $p_Y$ is the p.m.f.~and $Y$ is discrete,} \\ f_Y(y) & \text { where $f_Y$ is the p.d.f.~and $Y$ is continuous.} \end {cases} \end{equation}

We continue with our convention of denoting probability density functions by $f$ and probability mass functions by $p$.

This notation allows us to write the key equation from Theorems 2.4.1 and 3.1.2, for the distribution of the posterior, in a single form. If $(X,\Theta )$ is a (discrete or continuous) Bayesian model, where $\Theta $ is a continuous random variable with p.d.f. $f_\Theta (\theta )$, and the model family $(M_\theta )$ has likelihood function $L_{M_\theta }$, then the posterior distribution of $\Theta $ given the data $x$ has p.d.f.

\begin{equation} \label {eq:bayes_rule_combined} f_{\Theta |_{\{X=x\}}}(\theta )=\frac {L_{M_\theta }(x)f_\Theta (\theta )}{L_X(x)} \end{equation}

where $Z=L_X(x)$ is the normalizing constant, or equivalently $f_{\Theta |_{\{X=x\}}}(\theta )\propto L_{M_\theta }(x)f_\Theta (\theta )$. In both Sections 2.2 and 3.1 we noted that $M_\theta \eqd X|_{\{\Theta =\theta \}}$, which leads to

\begin{equation} \label {eq:bayes_rule_full_condensed} f_{\Theta |_{\{X=x\}}}(\theta )\propto L_{X|_{\{\Theta =\theta \}}}(x)f_\Theta (\theta ). \end{equation}

The term $L_{X|_{\{\Theta =\theta \}}}(X)$ is often known as the likelihood function of the Bayesian model and equation 6.2 is yet another version of Bayes’ rule. It is the most general version of the Bayes’ rule that we will encounter within this course, and it is the basis for most practical applications of Bayesian inference.

Some textbooks, and many practitioners, prefer to use a more condensed notation for equations (6.2) and (6.3). They write simply $f(y)$ for the likelihood function of $Y$, and $f(x)$ for the likelihood function of $X$. Conditioning is written as $f(y|x)$ for the likelihood function of $Y$ given the event $\{X=x\}$. This notation requires that we must only ever write $x$ for samples of $X$ and $y$ for samples of $Y$, or else we would not be able to infer which random variables were involved. In this notation (6.2) becomes

\begin{equation} \label {eq:bayes_rule_combined_shorthand} f(\theta |x)=\frac {f(x|\theta )f(\theta )}{f(x)}, \end{equation}

which is easy to remember! It bears a close similarly to $\P [A|B]=\frac {\P [B|A]\P [B]}{\P [A]}$, which is Bayes’ rule for events. Note that in (6.4) the ‘function’ $f$ is really representing four different functions, dependent upon which variable(s) are fed into it – that part of the notation can easily become awkward and/or confusing if you are not familiar with it.

There are many variations on the notation in (6.4).

1. Some textbooks prefer to denote likelihood by $p(\cdot )$ instead of $f(\cdot )$, giving $p(\theta |x)=\frac {p(x|\theta )p(\theta )}{p(x)}$. Some use a different symbol for likelihood functions connected to $\Theta $ and those connected to $X$, for example $p(\theta |X)=\frac {l(x|\theta )p(\theta )}{l(x)}$ where $p$ denotes prior and posterior and $l$ denotes the likelihood of the model family.
2. Sometimes the likelihood is omitted entirely, by writing e.g. $\theta |x$ to denote the distribution of the posterior $\Theta |_{\{X=x\}}$. Here lower case letters are used to blur the distinction between random variables and data. For example, you might see a Bayesian model with Binomial model family and a Beta prior defined by writing simply $\theta \sim \Beta (a,b)$ and $x|\theta \sim \Bin (n,\theta )$.
3. Some textbooks use subscripts to indicate which random variables are conditioned on, as we have done, but in a slightly different way e.g. $f_{X|Y}(x,y)$ instead of $f_{X|_{\{Y=y\}}}(x)$.

In this course we refer to all these various notations as Bayesian shorthand, or simply shorthand.

Using shorthand can make Bayesian statistics very confusing to learn, so we have avoided it so far within this course. We will sometimes use it from now on, when it is convenient and clear in meaning. This includes several of the exercises at the end of this chapter. For those of you taking MPS4111, Bayesian shorthand will be used extensively in the second semester of your course. Hopefully, by that point you will be familiar enough with the underlying theory that they will save you time rather than cause confusion.

Within this course we use the convention that you should write your answers to questions in the same style of notation as the question uses, unless you are explicitly asked to do otherwise.

A technical remark

Remark 6.0.1 $\offsyl $ The underlying reason for most of the troubles with notation is that, from a purely mathematical point of view, there is no need to restrict to the two special cases of discrete and continuous distributions. It is more natural to think of both Theorems 2.4.1 and 3.1.2 as statements of the form ‘we start with a distribution (the prior) and we perform an operation to turn it into another distribution (the posterior)’. The operation involved here (the Bayesian update) can be made sense of in a consistent way for all distributions, but it requires the disintegration theorem which takes some time to understand.

At present, the strategy chosen by most statisticians is to simply not study disintegration. This is partly for historical reasons. It was clear how to do the continuous case several decades before the disintegration theorem was proved, and the discrete case was understood two centuries before that. Based on these two special cases statisticians developed the idea of a likelihood function, split into two cases as in (6.1). The use of likelihood functions then became well established within statistics, before disintegrations of general distributions were understood by mathematicians. Consequently statistics generally restricts itself to the discrete and continuous cases that we have described in this course.

There are advantages and disadvantages to this choice. It still gives us enough flexibility to write down most of the Bayesian models that we might want to use in data analysis – although we would struggle to handle a model family that uses e.g. the random variable of mixed type in Exercise 1.2. Very occasionally it makes things actually go wrong, as we noted in Remark 3.1.3. The main downside is that we often have to treat the discrete and continuous separately, as we did in Chapters 2 and 3. That consumes a bit of time and it leaves us with a weaker understanding of what is going on.

6.1 Exercises on Chapter 6

6.1 $\color {blue}\star \,\star $ The following equations, written in Bayesian shorthand, are the key conclusions from results in earlier chapters of these notes. Which results are they from?
- (a) $f(x|y)=\frac {f(y,x)}{f(y)}$.
- (b) If $\theta \sim \Beta (\alpha ,\beta )$ and $x|\theta \sim \Bern (\theta )^{\otimes n}$ then $\theta |x\sim \Beta (\alpha +k,\beta +n-k)$, where $x=(x_i)_1^n$ and $k=\sum _1^n x_i$.
Write the following results in Bayesian shorthand, using similar notation to that in parts (a) and (b).
- (c) Lemma 4.2.1.
- (d) From Section 4.5, the two facts above Lemma 4.5.2 concerning marginal and conditional distributions of the $\NGam $ distribution.
6.2 $\color {blue}\star \,\star $ The following results are written in Bayesian shorthand.
- (a) If $x\sim N(0,1)$ then $x|\{x>0\}\sim |x|$.
- (b) If $x$ and $y$ are independent then $x|y\sim x$.
In each case, write a version of the results in precise mathematical notation. Which parts of Chapter 1 are they closely related to?
6.3 $\color {blue}\star \,\star $ Suppose that we model $x|\theta \sim \NBin (m,\theta )^{\otimes n}$, where $m\in \N $ is fixed and $\theta \in (0,1)$ is an unknown parameter.
- (a) Show that $f(x|\theta )\propto \theta ^{mn}(1-\theta )^{\sum _1^n x_i}.$
- (b) Show that the prior $\theta \sim \Beta (\alpha ,\beta )$ is conjugate to $\NBin (m,\theta )^{\otimes n}$, and find the posterior parameters.
- (c)
  - (i) Show that the reference prior for $\theta $ is given by $f(\theta )\propto \theta ^{-1}(1-\theta )^{-1/2}$.
  - (ii) Does $f(\theta )$ define a proper distribution?
  - (iii) Find the posterior density $f(\theta |x)$ arising from this prior.
Hint: The setup given is a Bayesian model with model family $M_{\theta }\sim \NBin (m,\theta )^{\otimes n}$.
6.4 Suppose that we model $x|\mu ,\tau \sim \Normal (\mu ,\frac {1}{\tau })^{\otimes n}$, where both $\mu $ and $\tau $ are unknown parameters. We use the improper prior $f(\mu , \tau )\propto \frac {1}{\tau }$ for $\tau >0$, and $f(\tau )=0$ elsewhere.
- (a) $\color {blue}\star \,\star $ Show that for $\mu \in \R $ and $\tau >0$ the posterior distribution satisfies
  
  \[f(\mu ,\tau |x)\propto \tau ^{\frac {n}{2}-1}\exp \l (-\frac {\tau }{2}\sum _{i=1}^n(x_i-\mu )^2\r ).\]
- (b) $\color {blue}\star \star \star $ Find the marginal p.d.f of $\tau |x$. Show that $(\mu ,\tau )|x$ is a proper distribution if and only if $n\geq 2$.
Hint: The setup given is a Bayesian model with model family $M_{\mu ,\tau }\sim \Normal (\mu ,\frac {1}{\tau })^{\otimes n}$. For part (b) use the sample-mean-variance identity (4.10).
6.5 $\color {blue}\star \,\star $ Let $(M_\theta )_{\theta \in \Pi }$ be a continuous family of distributions. For $i=1,2,$ let $\Theta _i$ be a continuous random variable with p.d.f. $f_{\Theta _i}$, both taking values in $\R ^d$. Let $\alpha ,\beta \in (0,1)$ be such that $\alpha +\beta =1$.
- (a) Show that $f_\Theta (\theta )=\alpha f_{\Theta _1}(\theta )+\beta f_{\Theta _2}(\theta )$ is a probability density function.
- (b) Consider Bayesian models $(X_1,\Theta _1)$ and $(X_2,\Theta _2)$, with the same model family $(M_\theta )$ and different prior distributions. Consider also a third Bayesian model $(X,\Theta )$ with model family $(M_\theta )$ and prior $\Theta $ with p.d.f. $f_\Theta (\theta )=\alpha f_{\Theta _1}(\theta )+\beta f_{\Theta _2}(\theta )$.
  
  Show that the posterior distributions of these three models satisfy
  
  \[f_{\Theta |_{\{X=x\}}}(\theta )=\alpha ' f_{\Theta _1|_{\{X_1=x\}}}(\theta ) + \beta ' f_{\Theta _2|_{\{X_2=x\}}}(\theta )\]
  
  where $\alpha '=\frac {\alpha Z_1}{\alpha Z_1+\beta Z_2}$ and $\beta '=\frac {\beta Z_2}{\alpha Z_1+\beta Z_2}$. Here $Z_1$ and $Z_2$ are the normalizing constants given in Theorem 3.1.2 for the posterior distributions of $(X_1,\Theta _1)$ and $(X_2,\Theta _2)$.
- (c) Outline briefly how to modify your argument in (c) to also cover the case of discrete Bayesian models.
6.6 $\color {blue}\star \star \star $ This question explores the idea in Exercise 4.6 further, but except for (a)(ii) it does not depend on having completed that exercise.
- (a) Let $(M_\theta )$ be a discrete or absolutely continuous family with range $R$. Let $(X,\Theta )$ be a Bayesian model with model family $M_\theta ^{\otimes n}$. Let $x\in R^n$ and write $x(1)=(x_1,\ldots ,x_{n_1})$, $x(2)=(x_{n_1+1},\ldots ,x_{n})$. Let $(X_1,\Theta )$ and $(X_2,\Theta |_{\{X_1=x(1)\}})$ be Bayesian models with model families $M_\theta ^{\otimes n_1}$ and $M_\theta ^{\otimes n_2}$, where $n_1+n_2=n$.
  - (i) Show that
    
    \[(\Theta _1|_{\{X_1=x(1)\}})|_{\{X_2=x(2)\}}\eqd \Theta |_{\{X=x\}}.\]
    
    Use likelihood functions to write your argument in a way that covers both the discrete and absolutely continuous cases.
  - (ii) What is the connection between this fact and Exercise 4.6?
- (b) Rewrite your solution to (a)(i) in a Bayesian shorthand notation of your choice.