last updated: January 23, 2025

Bayesian Statistics

\CustomizeMathJax

8.4 Gibbs sampling

Recall that our parameter θ may really be a vector of parameters, as in Section 4.5 where we considered the model M(μ,σ)N(μ,σ2) with θ=(μ,σ)R×(0,). In this section we introduce a technique for handling models with many parameters θ=(θ1,,θd)Rd. For those of you taking MAS364, this section is non-examinable and is included for interest only. For those of you taking MAS61006, this section is on syllabus and will feed into your work next semester.

For d=1 or d=2 the MH algorithm as described in Section 8.3 is effective. For much large d, what tends to happen is that it takes a very long time for the sequence (ym) generated by the MH algorithm to explore the parameter space ΠRd, which means that it takes longer to converge to (the distribution of) Θ|{X=x}. This happens simply because there is a lot of space to explore inside Rd when d is large; this phenomenon is often known as the curse of dimensionality.

  • Example 8.4.1 Imagine you have parked your car inside a multi-story car park and then forgotten where you’ve parked it. If you know which level your car is on then you will only have to search on that level, and you will find your car in minutes. If you don’t know which level, it will take you much longer. The first case is exploring d=2, the second is d=3. Actually, to make this example properly match the difference between d=2 and d=3, the car park should have the same number of floors as there are parking spaces along the side-length of each floor! The problem only gets worse in d3.

One strategy for working around this problem is to update the parameters (θ1,,θd) one at a time. That is, we would first change θ1 while keeping (θ2,,θd) fixed, then we would change θ2 while keeping (θ1,θ3,,θd) fixed, and so on. After updating θd we would then return to θ1. It is helpful to introduce some notation for this: we write θi=(θ1,,θi1,θi+1,,θd) for the vector θ with the θi term removed. We use the same notation for the random vector Θ e.g. Θi.

  • Remark 8.4.2 In reality, instead of updating the θi one-by-one, it is common to update the parameters in small batches. For example we might update (θ1,,θ4) in one step, then (θ5,,θ8) in the next step, and so on. It is helpful to put related parameters, with values that might strongly influence each other, within the same batch.

When we update the parameters in turn, a common choice of proposal distribution is to set QΘi|{Θi=θi,X=x} in the update for θi, where θi are the values obtained from the previous update. This choice of proposal has the effect that, from (8.1), we end up with α=1 and all proposals are then accepted. When using proposals of this form, the MH algorithm is usually known as the Gibbs sampler.

  • Definition 8.4.3 The distributions of Θi|{Θi=θi,X=x}, for i=1,,d, are known as the full conditional distributions. From Lemma 1.6.1 the ith full conditional has p.d.f. given by

    fΘi|{Θi=θi,X=x}(θi)=fΘ|{X=x}(θ)Rd1fΘ|{X=x}(θ1,,θi1,θi,θi+1,,θd)dθi(8.12)fΘ|{X=x}(θ) We can calculate fΘ|{X=x} from Theorems 2.4.1/3.1.2, which provides a strategy for calculating (8.12) analytically. Note that in (8.12) treats θi and x as constants, and the only variable is θi.

  • Remark 8.4.4 The notation Θi|{Θi=θi,X=x} for the full conditionals is a bit unwieldy. In Bayesian shorthand we would write simply θi|θi,x which is much neater.

The Gibbs sampler that results from these strategies is as follows.

  • 1. Choose an initial point y0=(θ1(0),,θd(0))Π.

  • 2. For each i=1,,d, sample y~ from Θi|{Θi=θi(m),X=x} and set

    ym+1=(θ1(m),,θi1(m),y~,θi+1(m),,θd(m)).

    Note that we increment the value of m each time that we increment i. When reach i=d, return to i=1 and repeat.

    Repeat this step until m is large enough that values taken by the ym are no longer affected by the choice of y0.

  • 3. The final value of ym is now a sample of Θ|{X=x}.

Note that we need to take samples from the full conditionals Θi|{Θi=θi(m),X=x} in step 2. This isn’t always possible, and the Gibbs sampler is only helpful if we can do that. It is often used in cases where the full conditionals turn out to be named distributions, or nearly one as in Example 8.4.5 below.

For the MH algorithm we had Theorem 8.2.2 to tell us that ym was (approximately) a sample of Θ|{X=x}, justified by the discussion in Section 8.2.3. It is possible to make similar arguments for the Gibbs algorithm above, but we won’t include them in our course.

If the full conditionals can’t be easily sampled from, then one strategy is to use the MH algorithm (run inside of step 2 above) to obtain samples of Θi|{X=x}. This technique is known as Metropolis-within-Gibbs. In practice, once the parameters are divided up into batches, as described in Remark 8.4.2, some batches may be amenable to Gibbs sampling, whilst others may require Metropolis-within-Gibbs. The details will depend on the model. Trial and error is often required to find the best combination of techniques. We won’t try to write down algorithms of that complexity within these notes, but you should hopefully end the course with an understanding of how (and why) each piece of an algorithm like that would work.

  • Example 8.4.5 This example comes from Sections 1.1.1/7.5.3/8.6.2 of the book ‘Bayesian Approach to Intrepreting Archaeological Data’ by Buck et al (1996). The data comes from radiocarbon dating, and is a vector (x1,x2,,xn) of estimated ages obtained (via carbon dating) from n different objects. We write θi for the true age of object i, which is unknown. Our model for the age of each object is xiN(θi,vi) and we assume that the estimation errors are independent, for each i. For simplicity we will assume that the vi are known parameters, so our model family has n parameters θ=(θ1,,θn). We thus have the model family

    Mθ=N(θ1,v1)N(θn,vn).

    From the historical context of the objects, it is known that θ1<θ2<<θn, so we condition our model Mθ on this event. We can use Exercise 1.8 to do this conditioning, resulting in a new model family Mθ given by

    fMθ(x)={fMθ(x)P[N(θ1,v1)<N(θ2,v2)<<N(θn,vn)] for θ1<θ2<<θn0 otherwise{i=1nfN(θi,vi)(xi) for θ1<θ2<<θn0 otherwise{exp(12i=1n(θixi)2vi) for θ1<θ2<<θn0 otherwise. We use the Bayesian model (X,Θ) with model family Mθ and the improper prior

    fΘ(θ)={1 for 0<θ1<θ2<<θn,0 otherwise.

    By Theorem 3.1.2 we obtain that the posterior distribution has p.d.f. 

    (8.13)fΘ|{X=x}(θ){exp(12i=1n(θixi)2vi) for 0<θ1<θ2<<θn,0 otherwise.

    This is the same density as Mθ, except now we treat θ rather than x as the variable. The density is symmetric in x and θ so we already know this distribution. It is the distribution of θN(x1,v1)N(xn,vn) conditioned on the event 0<θ1<θ2<<θn. One way to simulate samples of this distribution is via rejection sampling: simulate θN(x1,v1)N(xn,vn) and reject the sample θ until it satisfies 0<θ1<θ2<<θn. The trouble is that unless n is small, we will mostly end up rejecting the samples because the condition we have imposed is an unlikely one.

    From (8.13) and (8.12) we have full conditionals given by

    fΘi|{Θi=θi,X=x}(θi){exp(12j=1n(θjxi)2vi) for θi(θi1,θi+1),0 otherwise{exp(12(θixi)2vi) for θi(θi1,θi+1),0 otherwise (where we set θ0=0 and θn+1= to make convenient notation). Note that θi is the only variable here, and the second line follows because θi and x are treated as constants. We recognize this full conditional distribution as that of θiN(xi,vi) conditioned on the event θi(θi1,θi+1). These full conditionals are much easier to sample from: we use rejection sampling, sample θiN(xi,vi) and reject until we obtain a sample for which θi(θi1,θi+1). Hence, in this situation we have all the necessary ingredients to use a Gibbs sampler.