Bayesian Statistics

$\newcommand{\footnotename}{footnote}$ $\def \LWRfootnote {1}$ $\newcommand {\footnote }[2][\LWRfootnote ]{{}^{\mathrm {#1}}}$ $\newcommand {\footnotemark }[1][\LWRfootnote ]{{}^{\mathrm {#1}}}$ $\let \LWRorighspace \hspace $ $\renewcommand {\hspace }{\ifstar \LWRorighspace \LWRorighspace }$ $\newcommand {\mathnormal }[1]{{#1}}$ $\newcommand \ensuremath [1]{#1}$ $\newcommand {\LWRframebox }[2][]{\fbox {#2}} \newcommand {\framebox }[1][]{\LWRframebox } $ $\newcommand {\setlength }[2]{}$ $\newcommand {\addtolength }[2]{}$ $\newcommand {\setcounter }[2]{}$ $\newcommand {\addtocounter }[2]{}$ $\newcommand {\arabic }[1]{}$ $\newcommand {\number }[1]{}$ $\newcommand {\noalign }[1]{\text {#1}\notag \\}$ $\newcommand {\cline }[1]{}$ $\newcommand {\directlua }[1]{\text {(directlua)}}$ $\newcommand {\luatexdirectlua }[1]{\text {(directlua)}}$ $\newcommand {\protect }{}$ $\def \LWRabsorbnumber #1 {}$ $\def \LWRabsorbquotenumber "#1 {}$ $\newcommand {\LWRabsorboption }[1][]{}$ $\newcommand {\LWRabsorbtwooptions }[1][]{\LWRabsorboption }$ $\def \mathchar {\ifnextchar "\LWRabsorbquotenumber \LWRabsorbnumber }$ $\def \mathcode #1={\mathchar }$ $\let \delcode \mathcode $ $\let \delimiter \mathchar $ $\def \oe {\unicode {x0153}}$ $\def \OE {\unicode {x0152}}$ $\def \ae {\unicode {x00E6}}$ $\def \AE {\unicode {x00C6}}$ $\def \aa {\unicode {x00E5}}$ $\def \AA {\unicode {x00C5}}$ $\def \o {\unicode {x00F8}}$ $\def \O {\unicode {x00D8}}$ $\def \l {\unicode {x0142}}$ $\def \L {\unicode {x0141}}$ $\def \ss {\unicode {x00DF}}$ $\def \SS {\unicode {x1E9E}}$ $\def \dag {\unicode {x2020}}$ $\def \ddag {\unicode {x2021}}$ $\def \P {\unicode {x00B6}}$ $\def \copyright {\unicode {x00A9}}$ $\def \pounds {\unicode {x00A3}}$ $\let \LWRref \ref $ $\renewcommand {\ref }{\ifstar \LWRref \LWRref }$ $ \newcommand {\multicolumn }[3]{#3}$ $\require {textcomp}$ $\newcommand {\intertext }[1]{\text {#1}\notag \\}$ $\let \Hat \hat $ $\let \Check \check $ $\let \Tilde \tilde $ $\let \Acute \acute $ $\let \Grave \grave $ $\let \Dot \dot $ $\let \Ddot \ddot $ $\let \Breve \breve $ $\let \Bar \bar $ $\let \Vec \vec $ $\require {colortbl}$ $\let \LWRorigcolumncolor \columncolor $ $\renewcommand {\columncolor }[2][named]{\LWRorigcolumncolor [#1]{#2}\LWRabsorbtwooptions }$ $\let \LWRorigrowcolor \rowcolor $ $\renewcommand {\rowcolor }[2][named]{\LWRorigrowcolor [#1]{#2}\LWRabsorbtwooptions }$ $\let \LWRorigcellcolor \cellcolor $ $\renewcommand {\cellcolor }[2][named]{\LWRorigcellcolor [#1]{#2}\LWRabsorbtwooptions }$ $\require {mathtools}$ $\newenvironment {crampedsubarray}[1]{}{}$ $\newcommand {\smashoperator }[2][]{#2\limits }$ $\newcommand {\SwapAboveDisplaySkip }{}$ $\newcommand {\LaTeXunderbrace }[1]{\underbrace {#1}}$ $\newcommand {\LaTeXoverbrace }[1]{\overbrace {#1}}$ $\newcommand {\LWRmultlined }[1][]{\begin {multline*}}$ $\newenvironment {multlined}[1][]{\LWRmultlined }{\end {multline*}}$ $\let \LWRorigshoveleft \shoveleft $ $\renewcommand {\shoveleft }[1][]{\LWRorigshoveleft }$ $\let \LWRorigshoveright \shoveright $ $\renewcommand {\shoveright }[1][]{\LWRorigshoveright }$ $\newcommand {\shortintertext }[1]{\text {#1}\notag \\}$ $\newcommand {\vcentcolon }{\mathrel {\unicode {x2236}}}$ $\renewcommand {\intertext }[2][]{\text {#2}\notag \\}$ $\newenvironment {fleqn}[1][]{}{}$ $\newenvironment {ceqn}{}{}$ $\newenvironment {darray}[2][c]{\begin {array}[#1]{#2}}{\end {array}}$ $\newcommand {\dmulticolumn }[3]{#3}$ $\newcommand {\LWRnrnostar }[1][0.5ex]{\\[#1]}$ $\newcommand {\nr }{\ifstar \LWRnrnostar \LWRnrnostar }$ $\newcommand {\mrel }[1]{\begin {aligned}#1\end {aligned}}$ $\newcommand {\underrel }[2]{\underset {#2}{#1}}$ $\newcommand {\medmath }[1]{#1}$ $\newcommand {\medop }[1]{#1}$ $\newcommand {\medint }[1]{#1}$ $\newcommand {\medintcorr }[1]{#1}$ $\newcommand {\mfrac }[2]{\frac {#1}{#2}}$ $\newcommand {\mbinom }[2]{\binom {#1}{#2}}$ $\newenvironment {mmatrix}{\begin {matrix}}{\end {matrix}}$ $\newcommand {\displaybreak }[1][]{}$ $ \def \offsyl {(\oslash )} \def \msconly {(\Delta )} $ $ \DeclareMathOperator {\var }{var} \DeclareMathOperator {\cov }{cov} \DeclareMathOperator {\Bin }{Bin} \DeclareMathOperator {\Geo }{Geometric} \DeclareMathOperator {\Beta }{Beta} \DeclareMathOperator {\Unif }{Uniform} \DeclareMathOperator {\Gam }{Gamma} \DeclareMathOperator {\Normal }{N} \DeclareMathOperator {\Exp }{Exp} \DeclareMathOperator {\Cauchy }{Cauchy} \DeclareMathOperator {\Bern }{Bernoulli} \DeclareMathOperator {\Poisson }{Poisson} \DeclareMathOperator {\Weibull }{Weibull} \DeclareMathOperator {\IGam }{IGamma} \DeclareMathOperator {\NGam }{NGamma} \DeclareMathOperator {\ChiSquared }{ChiSquared} \DeclareMathOperator {\Pareto }{Pareto} \DeclareMathOperator {\NBin }{NegBin} \DeclareMathOperator {\Studentt }{Student-t} \DeclareMathOperator *{\argmax }{arg\,max} \DeclareMathOperator *{\argmin }{arg\,min} $ \( \def \to {\rightarrow } \def \iff {\Leftrightarrow } \def \ra {\Rightarrow } \def \sw {\subseteq } \def \mc {\mathcal } \def \mb {\mathbb } \def \sc {\setminus } \def \wt {\widetilde } \def \v {\textbf } \def \E {\mb {E}} \def \P {\mb {P}} \def \R {\mb {R}} \def \C {\mb {C}} \def \N {\mb {N}} \def \Q {\mb {Q}} \def \Z {\mb {Z}} \def \B {\mb {B}} \def \~{\sim } \def \-{\,;\,} \def \qed {$\blacksquare $} \CustomizeMathJax {\def \1{\unicode {x1D7D9}}} \def \cadlag {c\`{a}dl\`{a}g} \def \p {\partial } \def \l {\left } \def \r {\right } \def \Om {\Omega } \def \om {\omega } \def \eps {\epsilon } \def \de {\delta } \def \ov {\overline } \def \sr {\stackrel } \def \Lp {\mc {L}^p} \def \Lq {\mc {L}^p} \def \Lone {\mc {L}^1} \def \Ltwo {\mc {L}^2} \def \toae {\sr {\rm a.e.}{\to }} \def \toas {\sr {\rm a.s.}{\to }} \def \top {\sr {\mb {\P }}{\to }} \def \tod {\sr {\rm d}{\to }} \def \toLp {\sr {\Lp }{\to }} \def \toLq {\sr {\Lq }{\to }} \def \eqae {\sr {\rm a.e.}{=}} \def \eqas {\sr {\rm a.s.}{=}} \def \eqd {\sr {\rm d}{=}} \def \approxd {\sr {\rm d}{\approx }} \def \Sa {(S1)\xspace } \def \Sb {(S2)\xspace } \def \Sc {(S3)\xspace } \)

5.3 Reference priors

An interesting response to the argument in Section 5.2.1 was given by the statistician Harold Jeffreys in 1946. It leads to a particular suggestion for the choice of prior. Suppose that two different people, Alice and Bob, construct a Bayesian model, with one parameter. Alice uses the model family $(M_\theta )_{\theta \in \Pi }$ and Bob use the model family $(M_\varphi )_{\varphi \in \Pi }$, where $\theta $ and $\varphi $ are related by some function $h(\theta )=\varphi $, where $h:\Pi \to \Pi $ with $\Pi \sw \R $ and $h$ is strictly monotone increasing and differentiable. That is, they use the ‘same’ model family, but parametrize it differently.

Alice will choose a prior p.d.f. $f_1$ and Bob will choose a prior p.d.f. $f_2$. This means that Alice constructs a model with sampling distribution

\[f_{X_1}(x)=\int _\Pi f_{M_\theta }(x)f_1(\theta )\,d\theta \]

and Bob constructs a model with sampling distribution

\[f_{X_2}(x)=\int _\Pi f_{M_{h(\theta )}}(x)f_2(\theta )\,d\theta .\]

Alice and Bob have never met each other, and in fact they do not even know that each other exists. Neither of them knows the function $h$.

This is where we come in. We write the statistics textbook, that both Alice and Bob will both read. They will choose their prior based on our instructions – the same instructions, for both people. Can we provide Alice and Bob with a way to choose their individual priors that will make their models equal i.e. so that $f_{X_1}(x)=f_{X_2}(x)$?

Remark 5.3.1 If Alice and Bob did meet each other, then Alice could tell Bob what her prior $\Theta $ was and by comparing notes they could work out the function $h$. Bob could then choose his prior to be the p.d.f. of $h(\Theta )$, where $\Theta $ is Alice’s prior. This choice makes $f_{X_1}(x)=f_{X_2}(x)$, see Exercise 5.7.

Returning to the situation where Alice and Bob do not meet, the surprising answer to the problem is: yes, this is possible. The solution is that we should tell them both to use the prior

\begin{equation} \label {eq:jeffreys_prior} f(\theta )\propto \E \l [\l (\frac {d}{d\lambda }\log (L_{M_{\lambda }}(X))\r )^2\r ]^{1/2} \quad \text { where }(M_\lambda )\text { is their chosen model family and }X\sim M_\lambda . \end{equation}

Let us not worry about how Jeffreys found this solution, and let us just show that it really works.

Proof that the solution works: $\offsyl $ Alice writes down her prior $f_1(\theta )\propto \E [(\frac {d}{d\theta }\log (L_{M_{\theta }}(X)))^2]^{1/2}$ and Bob writes down his prior, $f_2(\varphi )\propto \E [(\frac {d}{d\varphi }\log (L_{M_{\varphi }}(X)))^2]^{1/2}$. Then Alice’s model is

\[f_{X_1}(x) \propto \int _\Pi f_{M_{\theta }}(x)f_1(\theta )\,d\theta . \]

Alice doesn’t know the function $h$, but substituting $\theta =h(\lambda )$, her model is equal to

\begin{align*} f_{X_1}(x) \propto \int _\Pi f_{M_{h(\lambda )}}(x)f_1(h(\lambda ))h'(\lambda )\,d\lambda \\ \propto \int _\Pi f_{M_{h(\lambda )}}(x)f_1(h(\lambda ))h'(\lambda )\,d\lambda \\ \propto \int _\Pi f_{M_{h(\theta )}}(x)f_1(h(\theta ))h'(\theta )\,d\theta . \end{align*} In the last line we have simply changed notation by replacing $\lambda $s with $\theta $s. Meanwhile, Bob’s model is equal to

\begin{align*} f_{X_1}(x) &\propto \int _\Pi f_{M_{h(\theta )}}(x)f_2(\theta )\,d\theta . \end{align*} It follows that $f_{X_1}(x)=f_{X_2}(x)$ if we have $f_2(\theta )\propto f_1(h(\theta ))h'(\theta )$. We will now show that this equation holds, for any strictly monotone function $h$. Bob’s parameter is $\varphi =h(\theta )$, so we have

\begin{align*} f_2(\theta )^2 &\propto \E \l [\l (\frac {d}{d\varphi }\log (L_{M_{\varphi }}(X))\r )^2\r ] \\ &\propto \E \l [\l (\frac {d}{d\theta }\log (L_{M_{\varphi }}(X))\times \frac {d\varphi }{d\theta }\r )^2\r ] \\ &\propto \E \l [\l (\frac {d}{d\theta }\log (L_{M_{h(\theta )}}(X))\times h'(\theta )\r )^2\r ] \\ &=f_1(h(\theta ))^2h'(\theta )^2. \end{align*} To reach the second line we use the chain rule. Taking square roots, we obtain that $f_{X_1}(x)=f_{X_2}(x)$ as required. ∎

In the earlier half of the 20^th the argument in Section 5.2.1 was taken quite seriously, and treated as a major philosophical reason to question the reliability of Bayesian statistics. In particular, the objection was that if both Alice and Bob tried to use the same uninformative prior but used models that were parametrized differently then they would obtain different results, despite having the same intentions and, from their own perspectives, the same methodology. Jeffreys showed that this difficulty could be entirely avoided with a particular choice of prior.

These arguments took place before modern computers, when it was difficult to test how well Bayesian methods worked in practice (except for conjugate priors). We are now better able to test how much different modelling errors matter. Statisticians today no longer attach much weight to this objection.

Starting from the ideas above and those in Remark 5.2.2, there is a modern branch of statistics that investigates uninformative priors with particular theoretical properties. A modern approach is to use a prior that tries to maximise the difference (in some sense) between the prior and posterior distribution, essentially seeking to maximise the influence of the data. Priors with this property are known as reference priors. Their theory is beyond what we can cover here, but it turns out that if we have only one parameter and we model our data as i.i.d. samples then the reference prior is the same as the prior proposed by Jeffreys – in more complicated cases, they are different. Let us investigate what the reference prior looks like for some particular choices of one-parameter model.

Definition 5.3.2 Suppose that $(M_\theta )$ is a family of distributions with parameter space $\Pi \sw \R $. The reference prior $\Theta $ associated $(M_\theta )$ has density function given by
$\seteqnumber{0}{5.}{1}$
\begin{align} f_{\Theta }(\theta ) &\propto \E \l [\l (\frac {d}{d\theta }\log (L_{M_{\theta }}(X))\r )^2\r ]^{1/2} \label {eq:reference_prior_d1} \\ &\propto \E \l [-\frac {d^2}{d\theta ^2}\log (L_{M_{\theta }}(X))\r ]^{1/2}. \label {eq:reference_prior_d2} \end{align} where $X\sim M_\theta $. Both forms (5.2) and (5.3) are included on the reference sheets in Appendix A. Equation (5.3) is often easier to work with because it does not include a $(\cdot )^2$ term.

There are some caveats to this definition. The reference prior might be an improper prior, or if the expectation in (5.2)/(5.3) is not finite then the reference prior may not exist.

Remark 5.3.3 $\offsyl $ Equations (5.2) and (5.3) are equivalent. To deduce (5.3) from (5.2), use the partial differentiation identity $\frac {\partial ^2}{\partial \theta ^2} \log f(x,\theta ) = \frac {1}{f(x,\theta )}\frac {\partial ^2}{\partial \theta ^2} f(x,\theta ) - \big ( \frac {\partial }{\partial \theta } \log f(x,\theta )\big )^2$ and that $\E \big [ \frac {1}{f(X;\theta )}\frac {\partial ^2}{\partial \theta ^2}f(X, \theta ) \, \big | \, \theta \big ] = \frac {\partial ^2}{\partial \theta ^2} \int _{\mathbb {R}} f(x,\theta )\,dx = 0$. We omit the details.

Example 5.3.4 For the Bernoulli model family $(M_p)_{p\in [0,1]}$ where $M_p\sim \Bern (p)$, the likelihood is
$\seteqnumber{0}{5.}{3}$
\begin{align*} L_{M_p}(x) \;=\; \begin{cases} p & \text { for }x=0\text { and }p\in [0,1] \\ 1-p & \text { for }x=1\text { and }p\in [0,1] \\ 0 & \text { otherwise} \end {cases} \;=\; \begin{cases} p^x(1-p)^{1-x} & \text { for }p\in [0,1] \\ 0 & \text { otherwise} \end {cases} \end{align*} for $x\in \{0,1\}$. For the non-zero case, $\frac {d}{dp}\log (L_{M_p}(x))=\frac {d}{dp}(x\log p + (1-x)\log (1-p))=\frac {x}{p}-\frac {1-x}{1-p}$ and so

\[\frac {d^2}{dp^2}\log (L_{M_p})(x)=\frac {-p}{x^2}-\frac {1-x}{(1-p)^2}.\]

Hence, from (5.3), for $p\in [0,1]$ the density function of the reference prior $P$ is given by
$\seteqnumber{0}{5.}{3}$
\begin{align*} f_P(p) &\propto \E \l [-\frac {d^2}{dp^2}\log (L_{M_{p}}(X))\r ]^{1/2} \\ &\propto \E \l [\frac {X}{p^2}+\frac {1-X}{(1-p)^2}\r ]^{1/2} \\ &\propto \l (\frac {\E [X]}{p^2}+\frac {1-\E [X]}{(1-p)^2}\r )^{1/2} \\ &\propto \l (\frac {p}{p^2}+\frac {1-p}{(1-p)^2}\r )^{1/2} \\ &\propto \l (\frac {1}{p}+\frac {1}{1-p}\r )^{1/2} \\ &\propto p^{-1/2}(1-p)^{-1/2}. \end{align*} Using Lemma 1.2.5 we recognize that $P\sim \Beta (\frac 12,\frac 12)$.

A useful fact is that the reference prior for $M_\theta $ and $M_\theta ^{\otimes n}$ are identical, in the sense that they are $\propto $ to each other. This is shown in Exercise 5.8.