last updated: October 24, 2024

Bayesian Statistics

\(\newcommand{\footnotename}{footnote}\) \(\def \LWRfootnote {1}\) \(\newcommand {\footnote }[2][\LWRfootnote ]{{}^{\mathrm {#1}}}\) \(\newcommand {\footnotemark }[1][\LWRfootnote ]{{}^{\mathrm {#1}}}\) \(\let \LWRorighspace \hspace \) \(\renewcommand {\hspace }{\ifstar \LWRorighspace \LWRorighspace }\) \(\newcommand {\mathnormal }[1]{{#1}}\) \(\newcommand \ensuremath [1]{#1}\) \(\newcommand {\LWRframebox }[2][]{\fbox {#2}} \newcommand {\framebox }[1][]{\LWRframebox } \) \(\newcommand {\setlength }[2]{}\) \(\newcommand {\addtolength }[2]{}\) \(\newcommand {\setcounter }[2]{}\) \(\newcommand {\addtocounter }[2]{}\) \(\newcommand {\arabic }[1]{}\) \(\newcommand {\number }[1]{}\) \(\newcommand {\noalign }[1]{\text {#1}\notag \\}\) \(\newcommand {\cline }[1]{}\) \(\newcommand {\directlua }[1]{\text {(directlua)}}\) \(\newcommand {\luatexdirectlua }[1]{\text {(directlua)}}\) \(\newcommand {\protect }{}\) \(\def \LWRabsorbnumber #1 {}\) \(\def \LWRabsorbquotenumber "#1 {}\) \(\newcommand {\LWRabsorboption }[1][]{}\) \(\newcommand {\LWRabsorbtwooptions }[1][]{\LWRabsorboption }\) \(\def \mathchar {\ifnextchar "\LWRabsorbquotenumber \LWRabsorbnumber }\) \(\def \mathcode #1={\mathchar }\) \(\let \delcode \mathcode \) \(\let \delimiter \mathchar \) \(\def \oe {\unicode {x0153}}\) \(\def \OE {\unicode {x0152}}\) \(\def \ae {\unicode {x00E6}}\) \(\def \AE {\unicode {x00C6}}\) \(\def \aa {\unicode {x00E5}}\) \(\def \AA {\unicode {x00C5}}\) \(\def \o {\unicode {x00F8}}\) \(\def \O {\unicode {x00D8}}\) \(\def \l {\unicode {x0142}}\) \(\def \L {\unicode {x0141}}\) \(\def \ss {\unicode {x00DF}}\) \(\def \SS {\unicode {x1E9E}}\) \(\def \dag {\unicode {x2020}}\) \(\def \ddag {\unicode {x2021}}\) \(\def \P {\unicode {x00B6}}\) \(\def \copyright {\unicode {x00A9}}\) \(\def \pounds {\unicode {x00A3}}\) \(\let \LWRref \ref \) \(\renewcommand {\ref }{\ifstar \LWRref \LWRref }\) \( \newcommand {\multicolumn }[3]{#3}\) \(\require {textcomp}\) \(\newcommand {\intertext }[1]{\text {#1}\notag \\}\) \(\let \Hat \hat \) \(\let \Check \check \) \(\let \Tilde \tilde \) \(\let \Acute \acute \) \(\let \Grave \grave \) \(\let \Dot \dot \) \(\let \Ddot \ddot \) \(\let \Breve \breve \) \(\let \Bar \bar \) \(\let \Vec \vec \) \(\require {colortbl}\) \(\let \LWRorigcolumncolor \columncolor \) \(\renewcommand {\columncolor }[2][named]{\LWRorigcolumncolor [#1]{#2}\LWRabsorbtwooptions }\) \(\let \LWRorigrowcolor \rowcolor \) \(\renewcommand {\rowcolor }[2][named]{\LWRorigrowcolor [#1]{#2}\LWRabsorbtwooptions }\) \(\let \LWRorigcellcolor \cellcolor \) \(\renewcommand {\cellcolor }[2][named]{\LWRorigcellcolor [#1]{#2}\LWRabsorbtwooptions }\) \(\require {mathtools}\) \(\newenvironment {crampedsubarray}[1]{}{}\) \(\newcommand {\smashoperator }[2][]{#2\limits }\) \(\newcommand {\SwapAboveDisplaySkip }{}\) \(\newcommand {\LaTeXunderbrace }[1]{\underbrace {#1}}\) \(\newcommand {\LaTeXoverbrace }[1]{\overbrace {#1}}\) \(\newcommand {\LWRmultlined }[1][]{\begin {multline*}}\) \(\newenvironment {multlined}[1][]{\LWRmultlined }{\end {multline*}}\) \(\let \LWRorigshoveleft \shoveleft \) \(\renewcommand {\shoveleft }[1][]{\LWRorigshoveleft }\) \(\let \LWRorigshoveright \shoveright \) \(\renewcommand {\shoveright }[1][]{\LWRorigshoveright }\) \(\newcommand {\shortintertext }[1]{\text {#1}\notag \\}\) \(\newcommand {\vcentcolon }{\mathrel {\unicode {x2236}}}\) \(\renewcommand {\intertext }[2][]{\text {#2}\notag \\}\) \(\newenvironment {fleqn}[1][]{}{}\) \(\newenvironment {ceqn}{}{}\) \(\newenvironment {darray}[2][c]{\begin {array}[#1]{#2}}{\end {array}}\) \(\newcommand {\dmulticolumn }[3]{#3}\) \(\newcommand {\LWRnrnostar }[1][0.5ex]{\\[#1]}\) \(\newcommand {\nr }{\ifstar \LWRnrnostar \LWRnrnostar }\) \(\newcommand {\mrel }[1]{\begin {aligned}#1\end {aligned}}\) \(\newcommand {\underrel }[2]{\underset {#2}{#1}}\) \(\newcommand {\medmath }[1]{#1}\) \(\newcommand {\medop }[1]{#1}\) \(\newcommand {\medint }[1]{#1}\) \(\newcommand {\medintcorr }[1]{#1}\) \(\newcommand {\mfrac }[2]{\frac {#1}{#2}}\) \(\newcommand {\mbinom }[2]{\binom {#1}{#2}}\) \(\newenvironment {mmatrix}{\begin {matrix}}{\end {matrix}}\) \(\newcommand {\displaybreak }[1][]{}\) \( \def \offsyl {(\oslash )} \def \msconly {(\Delta )} \) \( \DeclareMathOperator {\var }{var} \DeclareMathOperator {\cov }{cov} \DeclareMathOperator {\Bin }{Bin} \DeclareMathOperator {\Geo }{Geometric} \DeclareMathOperator {\Beta }{Beta} \DeclareMathOperator {\Unif }{Uniform} \DeclareMathOperator {\Gam }{Gamma} \DeclareMathOperator {\Normal }{N} \DeclareMathOperator {\Exp }{Exp} \DeclareMathOperator {\Cauchy }{Cauchy} \DeclareMathOperator {\Bern }{Bernoulli} \DeclareMathOperator {\Poisson }{Poisson} \DeclareMathOperator {\Weibull }{Weibull} \DeclareMathOperator {\IGam }{IGamma} \DeclareMathOperator {\NGam }{NGamma} \DeclareMathOperator {\ChiSquared }{ChiSquared} \DeclareMathOperator {\Pareto }{Pareto} \DeclareMathOperator {\NBin }{NegBin} \DeclareMathOperator {\Studentt }{Student-t} \DeclareMathOperator *{\argmax }{arg\,max} \DeclareMathOperator *{\argmin }{arg\,min} \) \( \def \to {\rightarrow } \def \iff {\Leftrightarrow } \def \ra {\Rightarrow } \def \sw {\subseteq } \def \mc {\mathcal } \def \mb {\mathbb } \def \sc {\setminus } \def \wt {\widetilde } \def \v {\textbf } \def \E {\mb {E}} \def \P {\mb {P}} \def \R {\mb {R}} \def \C {\mb {C}} \def \N {\mb {N}} \def \Q {\mb {Q}} \def \Z {\mb {Z}} \def \B {\mb {B}} \def \~{\sim } \def \-{\,;\,} \def \qed {$\blacksquare $} \CustomizeMathJax {\def \1{\unicode {x1D7D9}}} \def \cadlag {c\`{a}dl\`{a}g} \def \p {\partial } \def \l {\left } \def \r {\right } \def \Om {\Omega } \def \om {\omega } \def \eps {\epsilon } \def \de {\delta } \def \ov {\overline } \def \sr {\stackrel } \def \Lp {\mc {L}^p} \def \Lq {\mc {L}^p} \def \Lone {\mc {L}^1} \def \Ltwo {\mc {L}^2} \def \toae {\sr {\rm a.e.}{\to }} \def \toas {\sr {\rm a.s.}{\to }} \def \top {\sr {\mb {\P }}{\to }} \def \tod {\sr {\rm d}{\to }} \def \toLp {\sr {\Lp }{\to }} \def \toLq {\sr {\Lq }{\to }} \def \eqae {\sr {\rm a.e.}{=}} \def \eqas {\sr {\rm a.s.}{=}} \def \eqd {\sr {\rm d}{=}} \def \approxd {\sr {\rm d}{\approx }} \def \Sa {(S1)\xspace } \def \Sb {(S2)\xspace } \def \Sc {(S3)\xspace } \)

6.2 The connection to maximum likelihood

You have already seen maximum likelihood based methods for parameter inference in previous courses. They rely on the idea that, if we wish to estimate the parameter \(\theta \), we can use that value

\begin{equation} \label {eq:mle_def} \hat {\theta }=\argmax _{\theta \in \Pi } L_{M_\theta }(x) \end{equation}

Here \((M_\theta )\) is a family of models and \(x\) is data, and we believe that for some value(s) of the parameter the model \(M_\theta \) is reasonably similar to whatever physical process generated our data.

The value of \(\hat \theta \), which is usually uniquely specified by (6.5), is known as the maximum likelihood estimator of \(\theta \), given the data \(x\) and model family \((M_\theta )\). Graphically, it is the value of \(\theta \) corresponding to the highest point on the graph \(\theta \mapsto L_{M_\theta }(x)\). Heuristically, it is the value of \(\theta \) that produces a model \(M_\theta \) that has the highest probability (within our chosen model family) to generate the data that we actually saw.

Recall that, for a discrete random variable \(Y\), the mode is most likely single value for \(Y\) to take, or \(\argmax _{y\in \ R_Y} \P [Y=y]\) in symbols. You may be familiar with the following definition already, but we are about to need it, so we recall:

  • Definition 6.2.1 Let \(Y\) be a continuous random variable with range \(R_Y\). The mode of \(Y\) is the value \(y\in R_Y\) that maximises the p.d.f. \(f_Y(y)\), given by \(\argmax _{y\in R_Y}f_Y(x)\).

  • Example 6.2.2 For continuous random variables, \(\P [Y=y]=0\) for all \(y\). The idea here is that in this case the concept of ‘most likely value’ is best represented by the maximum of the probability density function. Let \(Y\sim \Gam (3,4)\), with p.d.f.

    \[f_Y(y)=\begin {cases} 32y^{2}e^{-3y} & \text { for }y>0,\\ 0 & \text { otherwise.} \end {cases} \]

    (image)

    The mode is shown at its value \(y=\frac 12\). This value can be found by solving the equation \(\frac {df_Y(y)}{dy}=32\l (2ye^{-4y}+y^2(-4e^{-4y})\r )=32ye^{-4y}(2-4y)=0\) and checking that the solution \(y=\frac 12\) corresponds to a local maxima.

Comparing equations (6.5) and (6.3), there is a clear connection between MLEs and Bayesian inference: if we take a flat prior (i.e. \(f_\Theta (\theta )\) is constant) then the MLE \(\hat {\theta }\) is equal to the mode of the posterior distribution. This allows us to view the MLE approach as a simplification of the Bayesian approach. There are two steps to this simplification, to obtain the MLE approach from the Bayesian one:

  • 1. We fix the prior to be a uniform distribution (or an improper flat prior, if necessary).

  • 2. Instead of considering the posterior distribution as a random variable, we approximate the posterior distribution with a point estimate: its mode.

In principle we might make either one of these simplifications without the other one, but they are commonly made together. We are now able to discuss how the two approaches compare:

  • We’ve seen in many examples that, as the amount of data that we have grows, the posterior distribution tends to become more and more concentrated around a single value. In such a case, the MLE becomes a very good approximation for the posterior. This situation is common when we have plenty of data – see Section 6.2.1 for a more rigorous (but off-syllabus) discussion.

  • If we do not have lots of data then the approximation in step 2 will be less precise, and in the Bayesian case the influence of the prior will matter. In this situation whether Bayesian or MLE methods perform best depends on several factors.

    Bayesian methods require more work to implement, but they allow us to incorporate prior beliefs (if we have them). These prior beliefs can make the analysis more reliable, if they are realistic beliefs, but can make it less reliable if they are not. Bayesian methods generate a posterior distribution that has a clear meaning in terms of conditional probability, with little scope for misinterpretation.

    MLE based methods are comparatively easier to implement, but come with a risk of loss of detail from the approximation in step 2. They produce a point estimate for the known parameters, which is easier to communicate, but is also more open to misinterpretation. (We will discuss this issue in more detail in Chapter 7.)

  • If our model is not a reasonable reflection of reality, or if having more data does not help us infer parameters more accurately, then both methods become unreliable – no matter how much data we have.

  • Remark 6.2.3 When we give a point estimate of a random variable it is more common to use the mean, but for the MLE we use the mode. The reason for doing so is simply that the mode often gives a nicer formulae.

Statistical methods based on MLEs and the simplifications 1 and 2 listed above are often known as ‘frequentist’ or ‘classical’ methods. You will sometimes find that statisticians describe themselves as ‘Bayesian’ or ‘frequentist’, carrying the implication that they prefer to use one family of methods over the other. This may come from greater experience with one set of methods or from a preference due to the specifics of a particular model.

To a great extent this distinction is historical. During the middle of the 20th century methods based on simplifications 1 and 2 dominated statistics, because they could be implemented without the need for modern computers. Once it was realized that modern computers made Bayesian methods possible (with complicated model families) the community that investigated these techniques needed a name and an identity, to distinguish itself as something new. The concept of identifying as ‘Bayesian’ or ‘frequentist’ is essentially a relic of that social process, rather than anything with a clear mathematical foundation.

Modern statistics makes use of both posterior distributions and (MLE or otherwise) simplifications of the posterior distribution. Sometimes it mixes the two approaches together, or chooses between them for model-specific reasons. We do need to divide things up in order to learn them, so will only study Bayesian models within this course – but in general you should maintain an understanding of other approaches too.

6.2.1 Making the connection precise \(\offsyl \)

Several theorems are known which actually prove, under wide ranging conditions, that when we have plenty of data the MLE and Bayesian approaches become essentially equivalent. These theorems are complicated to state, but let us give a brief explanation of what is known here.

Take a model family \((M_\theta )_{\theta \in \Pi }\) and define a Bayesian model \((X,\Theta )\) with model family \(M_\theta ^{\otimes n}\). This model family represents \(n\) i.i.d. samples from \(M_\theta \). Fix some value \(\theta ^*\in \Pi \), which we think of as the true value of the parameter \(\theta \). Let \(x\) be a sample from \(M_{\theta ^*}^{\otimes n}\). We write the posterior \(\Theta |_{\{X=x\}}\) as usual.

Let \(\hat \theta \) be the MLE associated to the model family \(M_\theta ^{\otimes n}\) given the data \(x\), that is \(\hat \theta =\argmax _{\theta \in \Pi }L_{M_\theta ^{\otimes n}}(x)\). Then as \(n\to \infty \) it holds that

\begin{equation} \label {eq:bvm} \Theta |_{\{X=x\}}\approxd \Normal \l (\theta ^*,\frac {1}{n}I(\theta ^*)^{-1}\r ) \end{equation}

where \(I(\theta )\) is the Fisher information matrix defined by \(I(\theta )_{ij}=\E [\frac {\p ^2}{\p \theta _i \theta _j}\log f_{M_\theta ^{\otimes n}}(X)]\). The key point is that (6.6) says that the posterior \(\Theta |_{\{X=x\}}\) and the MLE \(\theta ^*\) are in fact very similar, for large \(n\), because of the factor \(\frac 1n\) in the variance.

Equation (6.6) is known as Laplace approximation. The mathematically precise form of this approximation, which replaces \(\approxd \) in (6.6) by the concept of convergence is distribution, is known as the Bernstein von-Mises theorem. The first rigorous proof was given by Doob in 1949 for the special case of finite sample spaces. It has since been extended under more general assumptions, notably to cover countable state spaces, but the general case (and whatever conditions it may need) is still unknown. From these results we do know that various conditions are required for (6.6) to hold. We can also identify cases in which (6.6) will fail: for example if \(M_\theta \sim \Cauchy (\theta ,1)\) then all of the terms in \(I(\theta )\) will be undefined.