Bayesian Statistics

$\newcommand{\footnotename}{footnote}$ $\def \LWRfootnote {1}$ $\newcommand {\footnote }[2][\LWRfootnote ]{{}^{\mathrm {#1}}}$ $\newcommand {\footnotemark }[1][\LWRfootnote ]{{}^{\mathrm {#1}}}$ $\let \LWRorighspace \hspace $ $\renewcommand {\hspace }{\ifstar \LWRorighspace \LWRorighspace }$ $\newcommand {\mathnormal }[1]{{#1}}$ $\newcommand \ensuremath [1]{#1}$ $\newcommand {\LWRframebox }[2][]{\fbox {#2}} \newcommand {\framebox }[1][]{\LWRframebox } $ $\newcommand {\setlength }[2]{}$ $\newcommand {\addtolength }[2]{}$ $\newcommand {\setcounter }[2]{}$ $\newcommand {\addtocounter }[2]{}$ $\newcommand {\arabic }[1]{}$ $\newcommand {\number }[1]{}$ $\newcommand {\noalign }[1]{\text {#1}\notag \\}$ $\newcommand {\cline }[1]{}$ $\newcommand {\directlua }[1]{\text {(directlua)}}$ $\newcommand {\luatexdirectlua }[1]{\text {(directlua)}}$ $\newcommand {\protect }{}$ $\def \LWRabsorbnumber #1 {}$ $\def \LWRabsorbquotenumber "#1 {}$ $\newcommand {\LWRabsorboption }[1][]{}$ $\newcommand {\LWRabsorbtwooptions }[1][]{\LWRabsorboption }$ $\def \mathchar {\ifnextchar "\LWRabsorbquotenumber \LWRabsorbnumber }$ $\def \mathcode #1={\mathchar }$ $\let \delcode \mathcode $ $\let \delimiter \mathchar $ $\def \oe {\unicode {x0153}}$ $\def \OE {\unicode {x0152}}$ $\def \ae {\unicode {x00E6}}$ $\def \AE {\unicode {x00C6}}$ $\def \aa {\unicode {x00E5}}$ $\def \AA {\unicode {x00C5}}$ $\def \o {\unicode {x00F8}}$ $\def \O {\unicode {x00D8}}$ $\def \l {\unicode {x0142}}$ $\def \L {\unicode {x0141}}$ $\def \ss {\unicode {x00DF}}$ $\def \SS {\unicode {x1E9E}}$ $\def \dag {\unicode {x2020}}$ $\def \ddag {\unicode {x2021}}$ $\def \P {\unicode {x00B6}}$ $\def \copyright {\unicode {x00A9}}$ $\def \pounds {\unicode {x00A3}}$ $\let \LWRref \ref $ $\renewcommand {\ref }{\ifstar \LWRref \LWRref }$ $ \newcommand {\multicolumn }[3]{#3}$ $\require {textcomp}$ $\newcommand {\intertext }[1]{\text {#1}\notag \\}$ $\let \Hat \hat $ $\let \Check \check $ $\let \Tilde \tilde $ $\let \Acute \acute $ $\let \Grave \grave $ $\let \Dot \dot $ $\let \Ddot \ddot $ $\let \Breve \breve $ $\let \Bar \bar $ $\let \Vec \vec $ $\require {colortbl}$ $\let \LWRorigcolumncolor \columncolor $ $\renewcommand {\columncolor }[2][named]{\LWRorigcolumncolor [#1]{#2}\LWRabsorbtwooptions }$ $\let \LWRorigrowcolor \rowcolor $ $\renewcommand {\rowcolor }[2][named]{\LWRorigrowcolor [#1]{#2}\LWRabsorbtwooptions }$ $\let \LWRorigcellcolor \cellcolor $ $\renewcommand {\cellcolor }[2][named]{\LWRorigcellcolor [#1]{#2}\LWRabsorbtwooptions }$ $\require {mathtools}$ $\newenvironment {crampedsubarray}[1]{}{}$ $\newcommand {\smashoperator }[2][]{#2\limits }$ $\newcommand {\SwapAboveDisplaySkip }{}$ $\newcommand {\LaTeXunderbrace }[1]{\underbrace {#1}}$ $\newcommand {\LaTeXoverbrace }[1]{\overbrace {#1}}$ $\newcommand {\LWRmultlined }[1][]{\begin {multline*}}$ $\newenvironment {multlined}[1][]{\LWRmultlined }{\end {multline*}}$ $\let \LWRorigshoveleft \shoveleft $ $\renewcommand {\shoveleft }[1][]{\LWRorigshoveleft }$ $\let \LWRorigshoveright \shoveright $ $\renewcommand {\shoveright }[1][]{\LWRorigshoveright }$ $\newcommand {\shortintertext }[1]{\text {#1}\notag \\}$ $\newcommand {\vcentcolon }{\mathrel {\unicode {x2236}}}$ $\renewcommand {\intertext }[2][]{\text {#2}\notag \\}$ $\newenvironment {fleqn}[1][]{}{}$ $\newenvironment {ceqn}{}{}$ $\newenvironment {darray}[2][c]{\begin {array}[#1]{#2}}{\end {array}}$ $\newcommand {\dmulticolumn }[3]{#3}$ $\newcommand {\LWRnrnostar }[1][0.5ex]{\\[#1]}$ $\newcommand {\nr }{\ifstar \LWRnrnostar \LWRnrnostar }$ $\newcommand {\mrel }[1]{\begin {aligned}#1\end {aligned}}$ $\newcommand {\underrel }[2]{\underset {#2}{#1}}$ $\newcommand {\medmath }[1]{#1}$ $\newcommand {\medop }[1]{#1}$ $\newcommand {\medint }[1]{#1}$ $\newcommand {\medintcorr }[1]{#1}$ $\newcommand {\mfrac }[2]{\frac {#1}{#2}}$ $\newcommand {\mbinom }[2]{\binom {#1}{#2}}$ $\newenvironment {mmatrix}{\begin {matrix}}{\end {matrix}}$ $\newcommand {\displaybreak }[1][]{}$ $ \def \offsyl {(\oslash )} \def \msconly {(\Delta )} $ $ \DeclareMathOperator {\var }{var} \DeclareMathOperator {\cov }{cov} \DeclareMathOperator {\Bin }{Bin} \DeclareMathOperator {\Geo }{Geometric} \DeclareMathOperator {\Beta }{Beta} \DeclareMathOperator {\Unif }{Uniform} \DeclareMathOperator {\Gam }{Gamma} \DeclareMathOperator {\Normal }{N} \DeclareMathOperator {\Exp }{Exp} \DeclareMathOperator {\Cauchy }{Cauchy} \DeclareMathOperator {\Bern }{Bernoulli} \DeclareMathOperator {\Poisson }{Poisson} \DeclareMathOperator {\Weibull }{Weibull} \DeclareMathOperator {\IGam }{IGamma} \DeclareMathOperator {\NGam }{NGamma} \DeclareMathOperator {\ChiSquared }{ChiSquared} \DeclareMathOperator {\Pareto }{Pareto} \DeclareMathOperator {\NBin }{NegBin} \DeclareMathOperator {\Studentt }{Student-t} \DeclareMathOperator *{\argmax }{arg\,max} \DeclareMathOperator *{\argmin }{arg\,min} $ \( \def \to {\rightarrow } \def \iff {\Leftrightarrow } \def \ra {\Rightarrow } \def \sw {\subseteq } \def \mc {\mathcal } \def \mb {\mathbb } \def \sc {\setminus } \def \wt {\widetilde } \def \v {\textbf } \def \E {\mb {E}} \def \P {\mb {P}} \def \R {\mb {R}} \def \C {\mb {C}} \def \N {\mb {N}} \def \Q {\mb {Q}} \def \Z {\mb {Z}} \def \B {\mb {B}} \def \~{\sim } \def \-{\,;\,} \def \qed {$\blacksquare $} \CustomizeMathJax {\def \1{\unicode {x1D7D9}}} \def \cadlag {c\`{a}dl\`{a}g} \def \p {\partial } \def \l {\left } \def \r {\right } \def \Om {\Omega } \def \om {\omega } \def \eps {\epsilon } \def \de {\delta } \def \ov {\overline } \def \sr {\stackrel } \def \Lp {\mc {L}^p} \def \Lq {\mc {L}^p} \def \Lone {\mc {L}^1} \def \Ltwo {\mc {L}^2} \def \toae {\sr {\rm a.e.}{\to }} \def \toas {\sr {\rm a.s.}{\to }} \def \top {\sr {\mb {\P }}{\to }} \def \tod {\sr {\rm d}{\to }} \def \toLp {\sr {\Lp }{\to }} \def \toLq {\sr {\Lq }{\to }} \def \eqae {\sr {\rm a.e.}{=}} \def \eqas {\sr {\rm a.s.}{=}} \def \eqd {\sr {\rm d}{=}} \def \approxd {\sr {\rm d}{\approx }} \def \Sa {(S1)\xspace } \def \Sb {(S2)\xspace } \def \Sc {(S3)\xspace } \)

3.2 Notation: independent data

We often want to construct a Bayesian model where the data corresponds to $n$ independent, identically distribution samples from some common distribution. That is, we want our model family to be of the form $X=(X_1,\ldots ,X_n)$ where the $X_i$ are independent with the same distribution. It is helpful to have some notation for this.

Given a pair of random variables $Y$ and $Z$, we write $Y\otimes Z$ for the random variable $(Y,Z)$ formed of a copy of $Y$ and a copy of $Z$ that are independent of each other. We will tend to use this notation in combination with named distributions. For example, $X\sim \Normal (0,1)\otimes \Normal (0,1)$ means that $X=(X_1,X_2)$ is a pair of independent $\Normal (0,1)$ random variables. When we want to create $n$ copies we will use a superscript $\otimes n$, so $X\sim \Normal (0,1)^{\otimes n}$ means that $X=(X_1,\ldots ,X_n)$ is a sequence of $n$ independent $\Normal (0,1)$ random variables.

Note that if $X$ has range $R$ then $X^{\otimes n}$ has range $R^n$.

Example 3.2.1 Recall that a Bernoulli trial is a random variable $X\sim \Bern (p)$, with distribution $\P [X=1]=p$ and $\P [X=0]=1-p$. The standard relationship between Bernoulli trials and the Binomial distribution can be written as follows: if $(X_i)_{i=1}^n\sim \Bern (p)^{\otimes n}$ then $\sum _{i=1}^n X_i\sim \Bin (n,p)$.

Example 3.2.2 Let $M$ be a continuous random variable with p.d.f. $f_M$ and let $X\eqd M^{\otimes n}$. Then $X$ has p.d.f.

\[f_{X}(x)=\prod _{i=1}^n f_M(x_i),\]

where $x=(x_1,\ldots ,x_n)$. A similar relationship applies in the case of discrete random variables, to probability mass functions.

Example 3.2.3 We are interested to model the duration of time that people spend on social activities. We will use data from the 2015 American Time Use survey, corresponding to the category ‘socializing and communicating with others’.

We decide to model the time spent on a single social activity as an exponential random variable $\Exp (\lambda )$, where $\lambda $ is an unknown parameter. This is a common model for time durations. Our data consists of $n=50$ independent responses, each of which tells us the duration that was spent on a single social activity, in minutes. This gives us the model family

\[M_\lambda =\Exp (\lambda )^{\otimes 50}\]

which has p.d.f.
$\seteqnumber{0}{3.}{5}$
\begin{align} \label {eq:social_model_pdf} f_{M_\lambda }(x)&=\prod _{i=1}^n \lambda e^{-\lambda x_i} =\lambda ^{50}e^{-\lambda \sum _1^{50}x_i}. \end{align} A single item of data has range $(0,\infty )$ so the range of our model is $(0,\infty )^{50}$.

We need to choose a prior for $\lambda $. As in Example 2.2.2 we will make a somewhat arbitrary choice, because for now our focus is on understanding how Bayesian updates work. Our prior for the duration of a social activity will be $\Lambda \sim \Gam (2,60)$,

We can check that this prior sits roughly within the region of parameters that we would expect: it has $\E [\Lambda ]=\frac {2}{60}=\frac {1}{30}$, which corresponds to an average social activity time of $30$ minutes, because $\E [\Exp (\lambda )]=\frac {1}{\lambda }$.

We represent our data as a vector $x=(x_1,\ldots ,x_{50})$. A histogram of the data is as follows:

It satisfies $\sum _1^n x_i=6638$, which we can fill into (3.6).

We can find the posterior distribution $\Lambda |_{\{X=x\}}$ using Theorem 3.1.2. It has p.d.f.
$\seteqnumber{0}{3.}{6}$
\begin{align} f_{\Lambda |_{\{X=x\}}}(\lambda ) &= \frac {1}{Z}f_{M_\lambda }(x)f_{\Gam (2,60)}(\lambda ) \notag \\ &= \frac {1}{Z}\lambda ^{50}e^{-6638\lambda }\frac {60^2}{\Gamma (2)}\lambda ^{2-1}e^{-60\lambda } \notag \\ &= \frac {1}{Z'}\lambda ^{51}e^{-6698\lambda } \label {eq:social_post_pdf} \end{align} Note that here we have absorbed the factor $\frac {60^2}{\Gamma (2)}$ into the normalizing constant $\frac {1}{Z}$ to obtain a new normalizing constant $\frac {1}{Z'}$. We know that (3.7) is a probability density function, so by Lemma 1.2.5 we have that $\Lambda |_{\{X=x\}}\sim \Gam (52,6698)$, and we know that $\frac {1}{Z'}$ must be the normalizing constant of the $\Gamma (52,6698)$ distribution.

Plotting the prior and posterior probability density functions together gives

Here we see that, even though our prior is spread out across a fairly wide range of values, the posterior has focused very precisely on a small region. By comparison to Example 2.3.3, we have a lot more data here, so our analysis here has produced a higher level of confidence in the best choice of parameter values. Consequently our choice of prior has mattered less than it did in Example 2.3.3.

It is sensible to compare the results of our analysis our the histogram of the data $x$. Our model is for $50$ independent samples, so technically the sampling and predictive distributions of our model generate $50$ real-valued samples, which is awkward to sketch. Instead we use the sampling and predictive distributions for a single data point (i.e. the $n=1$ case of our model). This gives sampling and predictive distributions, from (3.2) and (3.5), with probability density functions
$\seteqnumber{0}{3.}{7}$
\begin{align*} f_{\text {sampling}}(x_1)&=\int _{\R _d}f_{\Exp (\lambda )}(x_1)f_{\Gamma (2,60)}(\lambda ) \,d\lambda , \\ f_{\text {predictive}}(x_1)&=\int _{\R _d}f_{\Exp (\lambda )}(x'_1)f_{\Gamma (52,6698)}(\lambda ) \,d\lambda . \end{align*} Comparing these to the data, we obtain

It is clear that the predictive distribution is a better match for the data than the sampling distribution. To make the comparison we have scaled the total area of the histogram to be $1$, to match the fact the area under probability density functions is also $1$.

Remark 3.2.4 In both Chapter 2 and 3 we have used continuous distributions for our random parameters. In principle we could use discrete distributions instead i.e. $\Pi $ would become a finite set and $\Theta $ would only be allowed to take values in $\Pi $. We would need to slightly modify (2.11) and (3.3) for such cases. There aren’t any families of common distributions where the parameters spaces are discrete, and in practice we rarely have a reason to want models of this type. We won’t study models of this type within our course.