Last October I joined the Systems Approaches to Biomedical Science CDT at Oxford. The best thing about being part of an interdisciplinary programme like SABS is that we get to work with people from many different backgrounds. Many of my colleagues are chemists and biochemists, and in our first six months we took a range of short modules ranging from basic cell biology to scientific computing. While they patiently explained to me the basics of PCR, Nickel columns and other
dark magic experimental techniques, I tried to help with the occasional maths problems.
A topic that came up again and again throughout the more computational modules was MCMC, or Markov Chain Monte Carlo. Trying to explain a somewhat convoluted sampling technique to people who have barely ever seen a probability density function isn’t easy. I myself only studied it in the third year of a Maths&Stats undergrad and found it confusing at the time. However, I did recently come up with a fun little example, so I figured I ought to share it with the world.
Monte Carlo methods
Monte Carlo methods, of which MCMC is one particular type, are algorithms that use random sampling to make approximations. They’re particularly useful for estimating integrals and drawing from complex distributions.
Monte Carlo tends to work by drawing some sort of random numbers and then splitting them in groups (usually accepting or rejecting them) in a way that captures the behaviour of our integral/density/object of interest.
Suppose for example that we wish to estimate . A way to do this using Monte Carlo is to draw a square with side 2 and a circle of radius 1 inside it. We can then draw points uniformly at random over the square and see what fraction of them fall inside the circle. Since the square has area 4 and the circle has area , we can use that fraction as an estimate for .
A visual representation of Monte Carlo estimation. 115 out of the 150 uniformly distributed points lie in the circle, giving an estimated value of 3.0667 for .
This may sound like a crude and inefficient way to do estimation. A sample of 150 points gives large errors and even going to 10 000 points doesn’t reliably give an estimate correct to two decimal places. Still, Monte Carlo can be a valuable technique for solving complex multidimensional problems.
Markov Chain Monte Carlo (around Oxford)
In the above example we draw points independently and uniformly at random. Because the set up is so simple, it is more or less obvious that we get a sensible and consistent estimate. However, provided that we can argue that out results hold in the limit (and converge reasonably quickly), nothing is stopping us from adapting the process to different distributions.
Suppose that we have a distribution function that we wish to sample from. One way to do so is to construct a Markov chain that has as a stationary distribution and then sample from that chain. If we let the process run for long enough, the overall time spent in each of the states will match . This may sound very convoluted—won’t constructing a whole Markov process be a lot more effort than estimating or calculating directly and then sampling from that?
Not necessarily. It is often the case that comparing two objects, or events, can be easier than calculating their individual intrinsic quality in a meaningful way. You can say “A is two times better than B” or “A is two times more likely than B” without knowing exactly how good/likely A and B are. There is a baseline level or scaling constant which we don’t need to worry about when making pairwise comparisons, but which can be difficult to evaluate. This is often the case in Bayesian inference . Markov chains can help us avoid the issue of scaling constants.
Take Oxford colleges during summer for example, and consider how popular different colleges are among tourists. We can measure popularity of a college as the probability that a randomly chosen tourist happens to visit that college at a particular time of day . It would be quite a lot of effort to go to all the colleges and measure the number of their visitors, add them all together and then rescale to get the probabilities.
The MCMC approach to determining college popularity would be instead to start wandering around Oxford. We set off from some college (say Corpus Christi), and then at regular time intervals decide whether to move at a randomly chosen nearby college. If the college is more popular (like Christ Church), we move to it. If it’s less popular, we only move to it with probability proportional to its popularity — e.g. if we know Oriel gets only 70% of the visitors Corpus does, we move to it with probability 70%. Otherwise we just stay at Corpus until the time of the next random jump.
We can keep walking in this fashion until our feet hurt, or until the porters stop letting us in. Intuitively, we’ll end up spending more time in the popular colleges than in the less popular ones — it’s likely we’ll be stuck in Christ Church or Magdalen for several time steps at a time, because they’re so much more popular than any college around them. Similarly, we’d probably jump away from Pembroke at the first opportunity.
This is a Markov process — we make independent jumps at regular time intervals, which are based only on our current location but not on our past history. We can argue formally that our intuition is correct, in that the stationary distribution in this particular example is unique and corresponds to college popularity. So if we measure the proportion of time spent at different colleges, this will be a good estimate for the popularity distribution!
This is what MCMC is, in a nutshell. It may seem a little different from the “circle in a square” example at the beginning, but it has the same core idea of making random choices, putting them in boxes, and then counting how many choices fall in which box.
Oh, and if you do find yourself in Oxford, make sure you visit Pembroke. It really is very lovely!
 The R code for generating the Monte Carlo estimation and image can be found here.
 I’m talking about the common rephrasing of Bayes’ rule that the posterior density is proportional to the prior times the likelihood. The idea is that the prior and the likelihood capture all of the information contained in the posterior, and the scaling constant is just a bit of a computational nuisance required to make sure everything integrates to 1.
 Naturally, we will assume certain ridiculous claims, such as independence of tourists, and we will also ignore the fact that different colleges have different policies when it comes to visitors.