Bayesian Statistics
Chapter 6 Discussion
We have now understood enough about Bayesian inference to discuss how it compares to other techniques. We will do so in Section 6.2. We first give an outline of the various different notations that are used for the Bayesian framework, most of which are more condensed than the notation we have used in Chapters 1-5.
6.1 Bayesian shorthand notation
Recall that for a random variable \(Y\) we define the likelihood function \(L_Y\) by
\(\seteqnumber{0}{6.}{0}\)\begin{equation} \label {eq:likelihood_def} L_Y(y)=\begin{cases} p_Y(y) & \text { where $p_Y$ is the p.m.f.~and $Y$ is discrete,} \\ f_Y(y) & \text { where $f_Y$ is the p.d.f.~and $Y$ is continuous.} \end {cases} \end{equation}
We continue with our convention of denoting probability density functions by \(f\) and probability mass functions by \(p\).
This notation allows us to write the key equation from Theorems 2.4.1 and 3.1.2, for the distribution of the posterior, in a single form. If \((X,\Theta )\) is a (discrete or continuous) Bayesian model, where \(\Theta \) is a continuous random variable with p.d.f. \(f_\Theta (\theta )\), and the model family \((M_\theta )\) has likelihood function \(M_\theta \), then the posterior distribution of \(\Theta \) given the data \(x\) has p.d.f.
\(\seteqnumber{0}{6.}{1}\)\begin{equation} \label {eq:bayes_rule_combined} f_{\Theta |_{\{X=x\}}}(\theta )=\frac {L_{M_\theta }(x)f_\Theta (\theta )}{L_X(x)} \end{equation}
where \(Z=L_X(x)\) is the normalizing constant, or equivalently \(f_{\Theta |_{\{X=x\}}}(\theta )\propto L_{M_\theta }(x)f_\Theta (\theta )\). In both Sections 2.2 and 3.1 we noted that \(M_\theta \eqd X|_{\{\Theta =\theta \}}\), which leads to
\(\seteqnumber{0}{6.}{2}\)\begin{equation} \label {eq:bayes_rule_full_condensed} f_{\Theta |_{\{X=x\}}}(\theta )\propto L_{X|_{\{\Theta =\theta \}}}(x)f_\Theta (\theta ). \end{equation}
The term \(L_{X|_{\{\Theta =\theta \}}}(X)\) is often known as the likelihood function of the Bayesian model and equation 6.2 is yet another version of Bayes’ rule. It is the most general version of the Bayes’ rule that we will encounter within this course, and it is the basis for most practical applications of Bayesian inference.
Some textbooks, and many practitioners, prefer to use a more condensed notation for equations (6.2) and (6.3). They write simply \(f(y)\) for the likelihood function of \(Y\), and \(f(x)\) for the likelihood function of \(X\). Conditioning is written as \(f(y|x)\) for the likelihood function of \(Y\) given the event \(\{X=x\}\). This notation requires that we must only ever write \(x\) for samples of \(X\) and \(y\) for samples of \(Y\), or else we would not be able to infer which random variables were involved. In this notation (3.1) becomes
\(\seteqnumber{0}{6.}{3}\)\begin{equation} \label {eq:bayes_rule_combined_shorthand} f(\theta |x)=\frac {f(x|\theta )f(\theta )}{f(x)}, \end{equation}
which is easy to remember! It bears a close similarly to \(\P [A|B]=\frac {\P [B|A]\P [B]}{\P [A]}\), which is Bayes’ rule for events. Note that in (6.4) the ‘function’ \(f\) is really representing four different functions, dependent upon which variable(s) are fed into it – that part of the notation can easily become awkward and/or confusing if you are not familiar with it.
There are many variations on the notation in (6.4).
-
1. Some textbooks prefer to denote likelihood by \(p(\cdot )\) instead of \(f(\cdot )\), giving \(p(\theta |x)=\frac {p(x|\theta )p(\theta )}{p(x)}\). Some use a different symbol for likelihood functions connected to \(\Theta \) and those connected to \(X\), for example \(p(\theta |X)=\frac {l(x|\theta )p(\theta )}{l(x)}\) where \(p\) denotes prior and posterior and \(l\) denotes the likelihood of the model family.
-
2. Sometimes the likelihood is omitted entirely, by writing e.g. \(\theta |x\) to denote the distribution of the posterior \(\Theta |_{\{X=x\}}\). Here lower case letters are used to blur the distinction between random variables and data. For example, you might see a Bayesian model with Binomial model family and a Beta prior defined by writing simply \(\theta \sim \Beta (a,b)\) and \(x|\theta \sim \Bin (n,\theta )\).
-
3. Some textbooks use subscripts to indicate which random variables are conditioned on, as we have done, but in a slightly different way e.g. \(f_{X|Y}(x,y)\) instead of \(f_{X|_{\{Y=y\}}}(x)\).
In this course we refer to all these various notations as Bayesian shorthand, or simply shorthand.
Using shorthand can make Bayesian statistics very confusing to learn, so we have avoided it so far within this course. We will sometimes use it from now on, when it is convenient and clear in meaning. This includes several of the exercises at the end of this chapter. For those of you taking MAS61006, Bayesian shorthand will be used extensively in the second semester of your course. Hopefully, by that point you will be familiar enough with the underlying theory that they will save you time rather than cause confusion.
6.1.1 A technical remark
-
Remark 6.1.2 \(\offsyl \) The underlying reason for most of the troubles with notation is that, from a purely mathematical point of view, there is no need to restrict to the two special cases of discrete and continuous distributions. It is more natural to think of both Theorems 2.4.1 and 3.1.2 as statements of the form ‘we start with a distribution (the prior) and we perform an operation to turn it into another distribution (the posterior)’. The operation involved here (the Bayesian update) can be made sense of in a consistent way for all distributions, but it requires the disintegration theorem which takes some time to understand.
At present, the strategy chosen by most statisticians is to simply not study disintegration. This is partly for historical reasons. It was clear how to do the continuous case several decades before the disintegration theorem was proved, and the discrete case was understood two centuries before that. Based on these two special cases statisticians developed the idea of a likelihood function, split into two cases as in (6.1). The use of likelihood functions then became well established within statistics, before disintegrations of general distributions were understood by mathematicians. Consequently statistics generally restricts itself to the discrete and continuous cases that we have described in this course.
There are advantages and disadvantages to this choice. It still gives us enough flexibility to write down most of the Bayesian models that we might want to use in data analysis – although we would struggle to handle a model family that uses e.g. the random variable of mixed type in Exercise 1.2. Very occasionally it makes things actually go wrong, as we noted in Remark 3.1.3. The main downside is that we often have to treat the discrete and continuous separately, as we did in Chapters 2 and 3. That consumes a bit of time and it leaves us with a weaker understanding of what is going on.