Bayesian Statistics

$\newcommand{\footnotename}{footnote}$ $\def \LWRfootnote {1}$ $\newcommand {\footnote }[2][\LWRfootnote ]{{}^{\mathrm {#1}}}$ $\newcommand {\footnotemark }[1][\LWRfootnote ]{{}^{\mathrm {#1}}}$ $\let \LWRorighspace \hspace $ $\renewcommand {\hspace }{\ifstar \LWRorighspace \LWRorighspace }$ $\newcommand {\mathnormal }[1]{{#1}}$ $\newcommand \ensuremath [1]{#1}$ $\newcommand {\LWRframebox }[2][]{\fbox {#2}} \newcommand {\framebox }[1][]{\LWRframebox } $ $\newcommand {\setlength }[2]{}$ $\newcommand {\addtolength }[2]{}$ $\newcommand {\setcounter }[2]{}$ $\newcommand {\addtocounter }[2]{}$ $\newcommand {\arabic }[1]{}$ $\newcommand {\number }[1]{}$ $\newcommand {\noalign }[1]{\text {#1}\notag \\}$ $\newcommand {\cline }[1]{}$ $\newcommand {\directlua }[1]{\text {(directlua)}}$ $\newcommand {\luatexdirectlua }[1]{\text {(directlua)}}$ $\newcommand {\protect }{}$ $\def \LWRabsorbnumber #1 {}$ $\def \LWRabsorbquotenumber "#1 {}$ $\newcommand {\LWRabsorboption }[1][]{}$ $\newcommand {\LWRabsorbtwooptions }[1][]{\LWRabsorboption }$ $\def \mathchar {\ifnextchar "\LWRabsorbquotenumber \LWRabsorbnumber }$ $\def \mathcode #1={\mathchar }$ $\let \delcode \mathcode $ $\let \delimiter \mathchar $ $\def \oe {\unicode {x0153}}$ $\def \OE {\unicode {x0152}}$ $\def \ae {\unicode {x00E6}}$ $\def \AE {\unicode {x00C6}}$ $\def \aa {\unicode {x00E5}}$ $\def \AA {\unicode {x00C5}}$ $\def \o {\unicode {x00F8}}$ $\def \O {\unicode {x00D8}}$ $\def \l {\unicode {x0142}}$ $\def \L {\unicode {x0141}}$ $\def \ss {\unicode {x00DF}}$ $\def \SS {\unicode {x1E9E}}$ $\def \dag {\unicode {x2020}}$ $\def \ddag {\unicode {x2021}}$ $\def \P {\unicode {x00B6}}$ $\def \copyright {\unicode {x00A9}}$ $\def \pounds {\unicode {x00A3}}$ $\let \LWRref \ref $ $\renewcommand {\ref }{\ifstar \LWRref \LWRref }$ $ \newcommand {\multicolumn }[3]{#3}$ $\require {textcomp}$ $\newcommand {\intertext }[1]{\text {#1}\notag \\}$ $\let \Hat \hat $ $\let \Check \check $ $\let \Tilde \tilde $ $\let \Acute \acute $ $\let \Grave \grave $ $\let \Dot \dot $ $\let \Ddot \ddot $ $\let \Breve \breve $ $\let \Bar \bar $ $\let \Vec \vec $ $\require {colortbl}$ $\let \LWRorigcolumncolor \columncolor $ $\renewcommand {\columncolor }[2][named]{\LWRorigcolumncolor [#1]{#2}\LWRabsorbtwooptions }$ $\let \LWRorigrowcolor \rowcolor $ $\renewcommand {\rowcolor }[2][named]{\LWRorigrowcolor [#1]{#2}\LWRabsorbtwooptions }$ $\let \LWRorigcellcolor \cellcolor $ $\renewcommand {\cellcolor }[2][named]{\LWRorigcellcolor [#1]{#2}\LWRabsorbtwooptions }$ $\require {mathtools}$ $\newenvironment {crampedsubarray}[1]{}{}$ $\newcommand {\smashoperator }[2][]{#2\limits }$ $\newcommand {\SwapAboveDisplaySkip }{}$ $\newcommand {\LaTeXunderbrace }[1]{\underbrace {#1}}$ $\newcommand {\LaTeXoverbrace }[1]{\overbrace {#1}}$ $\newcommand {\LWRmultlined }[1][]{\begin {multline*}}$ $\newenvironment {multlined}[1][]{\LWRmultlined }{\end {multline*}}$ $\let \LWRorigshoveleft \shoveleft $ $\renewcommand {\shoveleft }[1][]{\LWRorigshoveleft }$ $\let \LWRorigshoveright \shoveright $ $\renewcommand {\shoveright }[1][]{\LWRorigshoveright }$ $\newcommand {\shortintertext }[1]{\text {#1}\notag \\}$ $\newcommand {\vcentcolon }{\mathrel {\unicode {x2236}}}$ $\renewcommand {\intertext }[2][]{\text {#2}\notag \\}$ $\newenvironment {fleqn}[1][]{}{}$ $\newenvironment {ceqn}{}{}$ $\newenvironment {darray}[2][c]{\begin {array}[#1]{#2}}{\end {array}}$ $\newcommand {\dmulticolumn }[3]{#3}$ $\newcommand {\LWRnrnostar }[1][0.5ex]{\\[#1]}$ $\newcommand {\nr }{\ifstar \LWRnrnostar \LWRnrnostar }$ $\newcommand {\mrel }[1]{\begin {aligned}#1\end {aligned}}$ $\newcommand {\underrel }[2]{\underset {#2}{#1}}$ $\newcommand {\medmath }[1]{#1}$ $\newcommand {\medop }[1]{#1}$ $\newcommand {\medint }[1]{#1}$ $\newcommand {\medintcorr }[1]{#1}$ $\newcommand {\mfrac }[2]{\frac {#1}{#2}}$ $\newcommand {\mbinom }[2]{\binom {#1}{#2}}$ $\newenvironment {mmatrix}{\begin {matrix}}{\end {matrix}}$ $\newcommand {\displaybreak }[1][]{}$ $ \def \offsyl {(\oslash )} \def \msconly {(\Delta )} $ $ \DeclareMathOperator {\var }{var} \DeclareMathOperator {\cov }{cov} \DeclareMathOperator {\Bin }{Bin} \DeclareMathOperator {\Geo }{Geometric} \DeclareMathOperator {\Beta }{Beta} \DeclareMathOperator {\Unif }{Uniform} \DeclareMathOperator {\Gam }{Gamma} \DeclareMathOperator {\Normal }{N} \DeclareMathOperator {\Exp }{Exp} \DeclareMathOperator {\Cauchy }{Cauchy} \DeclareMathOperator {\Bern }{Bernoulli} \DeclareMathOperator {\Poisson }{Poisson} \DeclareMathOperator {\Weibull }{Weibull} \DeclareMathOperator {\IGam }{IGamma} \DeclareMathOperator {\NGam }{NGamma} \DeclareMathOperator {\ChiSquared }{ChiSquared} \DeclareMathOperator {\Pareto }{Pareto} \DeclareMathOperator {\NBin }{NegBin} \DeclareMathOperator {\Studentt }{Student-t} \DeclareMathOperator *{\argmax }{arg\,max} \DeclareMathOperator *{\argmin }{arg\,min} $ \( \def \to {\rightarrow } \def \iff {\Leftrightarrow } \def \ra {\Rightarrow } \def \sw {\subseteq } \def \mc {\mathcal } \def \mb {\mathbb } \def \sc {\setminus } \def \wt {\widetilde } \def \v {\textbf } \def \E {\mb {E}} \def \P {\mb {P}} \def \R {\mb {R}} \def \C {\mb {C}} \def \N {\mb {N}} \def \Q {\mb {Q}} \def \Z {\mb {Z}} \def \B {\mb {B}} \def \~{\sim } \def \-{\,;\,} \def \qed {$\blacksquare $} \CustomizeMathJax {\def \1{\unicode {x1D7D9}}} \def \cadlag {c\`{a}dl\`{a}g} \def \p {\partial } \def \l {\left } \def \r {\right } \def \Om {\Omega } \def \om {\omega } \def \eps {\epsilon } \def \de {\delta } \def \ov {\overline } \def \sr {\stackrel } \def \Lp {\mc {L}^p} \def \Lq {\mc {L}^p} \def \Lone {\mc {L}^1} \def \Ltwo {\mc {L}^2} \def \toae {\sr {\rm a.e.}{\to }} \def \toas {\sr {\rm a.s.}{\to }} \def \top {\sr {\mb {\P }}{\to }} \def \tod {\sr {\rm d}{\to }} \def \toLp {\sr {\Lp }{\to }} \def \toLq {\sr {\Lq }{\to }} \def \eqae {\sr {\rm a.e.}{=}} \def \eqas {\sr {\rm a.s.}{=}} \def \eqd {\sr {\rm d}{=}} \def \approxd {\sr {\rm d}{\approx }} \def \Sa {(S1)\xspace } \def \Sb {(S2)\xspace } \def \Sc {(S3)\xspace } \)

Chapter 5 The prior

We’ve spent most of our energy in Chapters 2-4 on understanding Bayesian models and performing the Bayesian update step. In Chapter 4 we focused on techniques for choosing the prior in a way that would make calculations straightforward to perform. In this chapter we maintain our focus on the prior, but with the opposite goal. Our interest here is in choosing a prior that best reflects a set of beliefs.

There are two parts to this chapter. Section 5.1 focuses on techniques for choosing a prior based on the opinions on experts. We are interested to quantify these opinions, in order to combine them with data, and we hope that by doing so we will produce more accurate results than could be obtained from the data alone. Sections 5.2 and 5.3 are concerned with the opposite situation where we want to focus solely on the data, and carry out our analysis with as few preconceptions as is possible.

Neither of these situations gives conjugate priors, in general. Consequently they lead to Bayesian updates that require computational methods, which we will study in Chapter 8. In both cases we must continue to abide by Cromwell’s rule from Section 4.4: the chosen prior should allow a non-zero probability for all parameter values that are physically possible.

5.1 Elicitation

Elicitation is the process of extracting an individuals beliefs about some unknown quantity, and representing those beliefs via probabilities. It is a difficult and inexact process. People often struggle to turn their thoughts into probabilities and are susceptible to many different psychological biases. There is no reason to expect that a single persons beliefs will be self-consistent. We will discuss psychological biases in Section 5.1.1. For now let us focus on the process of elicitation.

The first question we need to answer is whose prior we actually want. For example, if we are trying to determine the effectiveness of a drug, should we use the prior beliefs of the pharmaceutical company, or perhaps the regulators, perhaps even the patients? As a general rule, the prior should represent the beliefs held by the person(s) who decides what actions should be taken in response to the statistical analysis. They should, ideally, be the same person(s) as will face the consequences of a poor or incorrect decision. They are known as the elicitee for the duration of the process.

Eliciting probabilities

We have a limited capacity to think in terms of probabilities. For example, no elicitee can use their personal experience to judge the difference between probability $0.5$ and $0.5001$. When people are prepared to state the probability of some event exactly, it is usually based on some sort of symmetry. For example most people will tell you that for a fair six sided dice we have $\P [\text {throw a }6]=\frac 16$. In reality the probability is not exactly $\frac 16$, because the dice is not perfectly symmetric, but it is close enough for most practical purposes.

Except for symmetrical cases, elicitees estimating probabilities will generally rely on a mixture of their intuition and memory. We can help to understand their beliefs by choosing our questions carefully. It is good practice to focus on quantities that are meaningful to the elicitee, which usually means asking about quantities they have actually observed, or about relationships between quantities that they have expert knowledge of. In a complex model we may have to avoid asking directly about the parameters, because the elicitee may not understand what these parameters represent, even though our goal is to choose a prior distribution.

Eliciting distributions

We now consider how to elicit a whole distribution. Whole distributions are complicated objects and we can never claim that a certain distribution will perfectly represent someone’s beliefs about an unknown quantity. The best we can hope for a is a reasonable representation.

We generally concentrate on two aspects of the distribution:

• Location represents the value, or range of values, where the unknown quantity is most likely to be. It might be a guess for a mode, median or mean.
• Dispersal represents the level of certainty that the parameter falls within its most likely range of values. It might be represented using a variance.

Non-statisticians will often have difficulty dealing directly with concepts like mode, median, mean and variance. Instead of asking directly, a common strategy is to choose a family of distributions and then elicit probabilities to determine appropriate parameters. We might start this process by asking the elicitiee to draw the rough shape of the distribution representing their beliefs, and choose a family able to reproduce that shape.

It is usually easier to elicit estimates of location than it is to elicit estimates of dispersal. An elicitee with a good understand of probability might be willing to specify percentiles directly, for example to state values of $q$ such that $\P [\Theta \leq q]=0.95$ and $\P [q\leq \Theta ]=0.95$, but many people will find this difficult, particularly for probabilities that are close to zero or one.

The following scheme is known as the bisection method. It focuses on events of equal probability that (according to the elicitee) have a good chance of occurring. It seeks to elicit information about a single unknown real parameter $\theta $.

1. The elicitee is first asked to give a value $m$ such that the events $\theta \in (-\infty ,m]$ and $\theta \in [m,\infty )$ are equally likely.
2. Next, the elicitee is asked to give a value $l$ such that the events $\theta \in (-\infty ,l]$ and $\theta \in [l,m]$ are equally likely.
3. Lastly, the elicitee is asked to give a value $u$ such that the events $\theta \in [m,u]$ and $\theta \in [u,\infty )$ are equally likely.

In more statistical language, the elicitee provides their estimate for the 25th percentile $l$, the median of 50th percentile $m$ and the 75th percentile $u$. We could extend the process by splitting up further into more intervals of equal probability, or by using other percentiles, but we should be wary that increasing the complexity of the questions will also increase the risk that the elicitee fails to communicate their beliefs accurately.

We use the quantities $l,m,u$ obtained to deduce the parameters, within our chosen family of possible prior distributions. It is helpful that we tend to have three quantities and (for named distributions) only one or two parameters, because this allows us to check up on how well we have represented the individuals prior beliefs.

Example 5.1.1 Suppose that we have carried out the bisection method and obtained $m=0.7,l=0.5,u=0.8$, and the elicitee has sketched a distribution for a parameter $\theta \in [0,1]$ that looks like this:

We decide to try and represent this information with a $\Beta (\alpha ,\beta )$ distribution. We solve the equations

\[\P [\Beta (\alpha ,\beta )\leq 0.3]=0.25,\qquad \P [\Beta (\alpha ,\beta )\geq 0.8]=0.25\]

numerically to obtain $\alpha =3.04$ and $\beta =1.71$. The median of the $\Beta (3.04,1.71)$ distribution is $0.66$, also obtained numerically. This is close to the elicitees value for $m=0.7$. The distribution we have obtained is:

It is a reasonable match for what the elicitee has drawn. It also accounts for Cromwell’s rule by putting a small amount of probability on $\theta $ close to zero, which the elicitee did not think possible.

As part of the elicitation process we usually need to decide how strongly the prior should focus the values taken by $\theta $ into a particular region. A prior that strongly focuses $\theta $ on one (or occasionally more) small regions is known as a strongly informative prior. A prior that does not is generally known as a weakly informative prior. These are not mathematical definitions but they are very commonly used terms. You will often see them shortened to simply ‘strong’ and ‘weak’.

5.1.1 Psychological biases

People often take decisions using heuristics, which are shortcuts that are used to make quick and effective guesses. For example people will often assume that a more expensive product will be of better quality, and may base their purchasing decisions partly on this idea; it is often, but not always, true. These sorts of heuristics can introduce biases into the elicitation process, when heuristics struggle to capture the reality of a complex situation. Some pitfalls to be aware of during elicitation:

1. Availability bias. This is where an elicitee overestimates the probability of an event because it is easy to remember (or notice) that event, or because it has recently occurred.

For example, after hearing news of a plane crash, people are more susceptible to overestimate the frequency of plane crashes.
2. Anchoring. This is where an elicitee relies too heavily on a single piece of information.
3. Hindsight bias. This is when an elicitee falsely believes that they would have predicted an event, after that event has happened. This behaviour risks underestimating the probabilities of other outcomes.
4. Overconfidence. This is the tendency to give too much probability to events that are believed to be likely, and consequently underestimate the probability of unlikely events.

For example, in many studies where people were asked to provide 95% intervals for various unknown quantities, the true value lay outside of the estimated intervals as much as 20-30% of the time.

There are many other ways in which the heuristics we rely on in everyday life can lead to errors and biases. It is never possible to eradicate them all, but it is clear that experience and training in making probabilistic statements, as well as making elicitees aware of potential sources of bias, tends to result in more accurate estimation.