Bayesian Statistics

$\newcommand{\footnotename}{footnote}$ $\def \LWRfootnote {1}$ $\newcommand {\footnote }[2][\LWRfootnote ]{{}^{\mathrm {#1}}}$ $\newcommand {\footnotemark }[1][\LWRfootnote ]{{}^{\mathrm {#1}}}$ $\let \LWRorighspace \hspace $ $\renewcommand {\hspace }{\ifstar \LWRorighspace \LWRorighspace }$ $\newcommand {\mathnormal }[1]{{#1}}$ $\newcommand \ensuremath [1]{#1}$ $\newcommand {\LWRframebox }[2][]{\fbox {#2}} \newcommand {\framebox }[1][]{\LWRframebox } $ $\newcommand {\setlength }[2]{}$ $\newcommand {\addtolength }[2]{}$ $\newcommand {\setcounter }[2]{}$ $\newcommand {\addtocounter }[2]{}$ $\newcommand {\arabic }[1]{}$ $\newcommand {\number }[1]{}$ $\newcommand {\noalign }[1]{\text {#1}\notag \\}$ $\newcommand {\cline }[1]{}$ $\newcommand {\directlua }[1]{\text {(directlua)}}$ $\newcommand {\luatexdirectlua }[1]{\text {(directlua)}}$ $\newcommand {\protect }{}$ $\def \LWRabsorbnumber #1 {}$ $\def \LWRabsorbquotenumber "#1 {}$ $\newcommand {\LWRabsorboption }[1][]{}$ $\newcommand {\LWRabsorbtwooptions }[1][]{\LWRabsorboption }$ $\def \mathchar {\ifnextchar "\LWRabsorbquotenumber \LWRabsorbnumber }$ $\def \mathcode #1={\mathchar }$ $\let \delcode \mathcode $ $\let \delimiter \mathchar $ $\def \oe {\unicode {x0153}}$ $\def \OE {\unicode {x0152}}$ $\def \ae {\unicode {x00E6}}$ $\def \AE {\unicode {x00C6}}$ $\def \aa {\unicode {x00E5}}$ $\def \AA {\unicode {x00C5}}$ $\def \o {\unicode {x00F8}}$ $\def \O {\unicode {x00D8}}$ $\def \l {\unicode {x0142}}$ $\def \L {\unicode {x0141}}$ $\def \ss {\unicode {x00DF}}$ $\def \SS {\unicode {x1E9E}}$ $\def \dag {\unicode {x2020}}$ $\def \ddag {\unicode {x2021}}$ $\def \P {\unicode {x00B6}}$ $\def \copyright {\unicode {x00A9}}$ $\def \pounds {\unicode {x00A3}}$ $\let \LWRref \ref $ $\renewcommand {\ref }{\ifstar \LWRref \LWRref }$ $ \newcommand {\multicolumn }[3]{#3}$ $\require {textcomp}$ $\newcommand {\intertext }[1]{\text {#1}\notag \\}$ $\let \Hat \hat $ $\let \Check \check $ $\let \Tilde \tilde $ $\let \Acute \acute $ $\let \Grave \grave $ $\let \Dot \dot $ $\let \Ddot \ddot $ $\let \Breve \breve $ $\let \Bar \bar $ $\let \Vec \vec $ $\require {colortbl}$ $\let \LWRorigcolumncolor \columncolor $ $\renewcommand {\columncolor }[2][named]{\LWRorigcolumncolor [#1]{#2}\LWRabsorbtwooptions }$ $\let \LWRorigrowcolor \rowcolor $ $\renewcommand {\rowcolor }[2][named]{\LWRorigrowcolor [#1]{#2}\LWRabsorbtwooptions }$ $\let \LWRorigcellcolor \cellcolor $ $\renewcommand {\cellcolor }[2][named]{\LWRorigcellcolor [#1]{#2}\LWRabsorbtwooptions }$ $\require {mathtools}$ $\newenvironment {crampedsubarray}[1]{}{}$ $\newcommand {\smashoperator }[2][]{#2\limits }$ $\newcommand {\SwapAboveDisplaySkip }{}$ $\newcommand {\LaTeXunderbrace }[1]{\underbrace {#1}}$ $\newcommand {\LaTeXoverbrace }[1]{\overbrace {#1}}$ $\newcommand {\LWRmultlined }[1][]{\begin {multline*}}$ $\newenvironment {multlined}[1][]{\LWRmultlined }{\end {multline*}}$ $\let \LWRorigshoveleft \shoveleft $ $\renewcommand {\shoveleft }[1][]{\LWRorigshoveleft }$ $\let \LWRorigshoveright \shoveright $ $\renewcommand {\shoveright }[1][]{\LWRorigshoveright }$ $\newcommand {\shortintertext }[1]{\text {#1}\notag \\}$ $\newcommand {\vcentcolon }{\mathrel {\unicode {x2236}}}$ $\renewcommand {\intertext }[2][]{\text {#2}\notag \\}$ $\newenvironment {fleqn}[1][]{}{}$ $\newenvironment {ceqn}{}{}$ $\newenvironment {darray}[2][c]{\begin {array}[#1]{#2}}{\end {array}}$ $\newcommand {\dmulticolumn }[3]{#3}$ $\newcommand {\LWRnrnostar }[1][0.5ex]{\\[#1]}$ $\newcommand {\nr }{\ifstar \LWRnrnostar \LWRnrnostar }$ $\newcommand {\mrel }[1]{\begin {aligned}#1\end {aligned}}$ $\newcommand {\underrel }[2]{\underset {#2}{#1}}$ $\newcommand {\medmath }[1]{#1}$ $\newcommand {\medop }[1]{#1}$ $\newcommand {\medint }[1]{#1}$ $\newcommand {\medintcorr }[1]{#1}$ $\newcommand {\mfrac }[2]{\frac {#1}{#2}}$ $\newcommand {\mbinom }[2]{\binom {#1}{#2}}$ $\newenvironment {mmatrix}{\begin {matrix}}{\end {matrix}}$ $\newcommand {\displaybreak }[1][]{}$ $ \def \offsyl {(\oslash )} \def \msconly {(\Delta )} $ $ \DeclareMathOperator {\var }{var} \DeclareMathOperator {\cov }{cov} \DeclareMathOperator {\Bin }{Bin} \DeclareMathOperator {\Geo }{Geometric} \DeclareMathOperator {\Beta }{Beta} \DeclareMathOperator {\Unif }{Uniform} \DeclareMathOperator {\Gam }{Gamma} \DeclareMathOperator {\Normal }{N} \DeclareMathOperator {\Exp }{Exp} \DeclareMathOperator {\Cauchy }{Cauchy} \DeclareMathOperator {\Bern }{Bernoulli} \DeclareMathOperator {\Poisson }{Poisson} \DeclareMathOperator {\Weibull }{Weibull} \DeclareMathOperator {\IGam }{IGamma} \DeclareMathOperator {\NGam }{NGamma} \DeclareMathOperator {\ChiSquared }{ChiSquared} \DeclareMathOperator {\Pareto }{Pareto} \DeclareMathOperator {\NBin }{NegBin} \DeclareMathOperator {\Studentt }{Student-t} \DeclareMathOperator *{\argmax }{arg\,max} \DeclareMathOperator *{\argmin }{arg\,min} $ \( \def \to {\rightarrow } \def \iff {\Leftrightarrow } \def \ra {\Rightarrow } \def \sw {\subseteq } \def \mc {\mathcal } \def \mb {\mathbb } \def \sc {\setminus } \def \wt {\widetilde } \def \v {\textbf } \def \E {\mb {E}} \def \P {\mb {P}} \def \R {\mb {R}} \def \C {\mb {C}} \def \N {\mb {N}} \def \Q {\mb {Q}} \def \Z {\mb {Z}} \def \B {\mb {B}} \def \~{\sim } \def \-{\,;\,} \def \qed {$\blacksquare $} \CustomizeMathJax {\def \1{\unicode {x1D7D9}}} \def \cadlag {c\`{a}dl\`{a}g} \def \p {\partial } \def \l {\left } \def \r {\right } \def \Om {\Omega } \def \om {\omega } \def \eps {\epsilon } \def \de {\delta } \def \ov {\overline } \def \sr {\stackrel } \def \Lp {\mc {L}^p} \def \Lq {\mc {L}^p} \def \Lone {\mc {L}^1} \def \Ltwo {\mc {L}^2} \def \toae {\sr {\rm a.e.}{\to }} \def \toas {\sr {\rm a.s.}{\to }} \def \top {\sr {\mb {\P }}{\to }} \def \tod {\sr {\rm d}{\to }} \def \toLp {\sr {\Lp }{\to }} \def \toLq {\sr {\Lq }{\to }} \def \eqae {\sr {\rm a.e.}{=}} \def \eqas {\sr {\rm a.s.}{=}} \def \eqd {\sr {\rm d}{=}} \def \approxd {\sr {\rm d}{\approx }} \def \Sa {(S1)\xspace } \def \Sb {(S2)\xspace } \def \Sc {(S3)\xspace } \)

Chapter 1 Conditioning

1.1 Random variables

Let $X$ be a random variable taking values in $\R $. You should think of $X$ as an object that takes a random value, which is hopefully natural. Most of the things we interact with are random e.g. when we buy a pair of shoes we do not know how long they will last for; when we walk home later, we do not know how much rain will fall, and so on. In principle we might think of anything as being random, but within this course we will restrict ourselves to random variables that take values in $\R ^d$. We won’t use bold symbols for vectors in this course. Typically we will write $x$ or $y$ for elements of $\R ^d$, and when we need to use coordinates we’ll write e.g. $x=(x_1,\ldots x_d)\in \R ^d$, where $x_i\in \R $.

We are interested in two particular types of random variable in this course, captured by the following definition.

Definition 1.1.1 Let $X$ be a random variable taking values in $\R ^d$.
- 1. We say that $X$ is discrete if there exists a countable set $A\sw \R ^d$ such that $\P [X\in A]=1$.
  
  In the case $d=1$, this will usually mean that either $P[X\in \N ]$ or $\P [X\in \Z ]=1$. We use the terminology ‘let $X$ be random variable with values in $\N $ (or $\Z $)’ for this case.
  
  In this case the function $p_X(x)=\P [X=x]$, defined for $x\in \R ^d$, is known as the probability mass function or simply p.m.f. of $X$.
  
  The range of $X$ is the set $R_X=\{x\in \R ^d\-\P [X=x]>0\}$.
- 2.
  
  We say that $X$ is continuous if there exists a function $f_X:\R ^d\to [0,\infty )$ such that
  $\seteqnumber{0}{1.}{0}$
  \begin{equation} \label {eq:abs_cts_pdf} \P [X\in A]=\int _A f_X(x)\,dx \end{equation}
  
  for all $A\sw \R ^d$.
  
  In this case $f_X$ is known as the probability density function or simply p.d.f. of $X$. For $d>1$ it is common to write $X=(X_1,\ldots ,X_d)$ and refer to $f_X(x)$ as the joint p.d.f. of the $X_i$.
  
  The range of $X$ is the set $R_X=\{x\in \R ^d\-f_X(x)>0\}$.

Most random variables used in statistical inference are one of these two types. In this course we will use reference sheets of named distributions, found in Appendix A, covering a very large range of examples. These reference sheets will be made available in the exam. You should be familiar with relationships between named distributions that were discussed in earlier courses, for example the relationship between Bernoulli trials and the Geometric and Binomial distributions.

Note that the integral in (1.1) is over a set $A\sw \R ^d$, with variable $x\in \R ^d$. We’ll generally use this notation instead of writing out multiple integral signs (e.g. $\int \int \int \cdots \int \ldots dx_1\,dx_2,\ldots ,dx_d$) in this course.

Definition 1.1.2 Let $X$ be a random variable taking values in $\R ^d$. We say that a random variable $X$ is deterministic if there exists $x\in \R ^d$ such that $\P [X=x]=1$.

We will often view a constant, say $a\in \R $, as an example of a deterministic random variable. This is another slight abuse of terminology, but it is natural and it won’t cause any trouble. Note that deterministic random variables are a special type of discrete random variable.

1.1.1 $\offsyl $ Technicalities

In this off-syllabus section we mention three technical points. They are aimed mainly at students with more technical backgrounds in analysis and probability theory. We won’t discuss these points in lectures.

1. More advanced textbooks use the term absolutely continuous for the class of random variables that we have called continuous. The complication arises because there are random variables for which $F_X$ is a continuous function but no p.d.f. $f_X$ exists. These random variables are usually associated to random fractals and are rarely used within statistics, so in statistics it is common to drop the word ‘absolutely’.
2. In this course we will use the convention that probability density functions must be continuous (as functions) except where they are zero. You can check that all of the distributions on the reference sheet in Appendix A are given in this form.

In fact, probability density functions $f_X(x)$ are only defined almost everywhere. The term for almost all $x$ is also commonly used. We cannot explain the precise meaning of it within this course, and many (otherwise good) textbooks on Bayesian statistics fail to note that this difficulty exists. Loosely, the same distribution can be defined using two (or more) different probability density functions $f_X(x)$ and $f'_X(x)$, but it will always be the case that $f_X(x)=f'_X(x)$ for ‘almost all’ values of $x$. We will discuss the matter further in Section 1.2, Remarks 3.1.3 and 6.1.2.
3. In our definitions and results above, the sets $A$ for which we evaluate $\P [X\in A]$ must be Borel subsets of $\R ^d$. In practice this technicality does not restrict us at all and we will continue to ignore this point for the remainder of the course.

Taking care of these issues rigorously requires some background on Lebesgue integration, but we do not assume that background for this course.

Bayesian Statistics

Chapter 1 Conditioning

1.1 Random variables

1.1.1 \(\offsyl \) Technicalities