on blo(pi)gging

I first started thinking about doing a PhD in my second year at Oxford. At the time I had a look at the graduate study sections on the Department of Statistics website and eventually found my way to BLOPIG, a blog maintained by the protein informatics group. Because of the interdisciplinary nature of protein informatics research and because the group is so large, the blog covers an impressive range of topics. As an undergraduate I could maybe sort of follow one post in three, and from these I made two conclusions:

  • OPIG sounds like a great group to work in, and
  • people doing DPhils are way too smart.

I definitely entertained the idea that one day I’d become a scientist cool enough to be BLOPIG-worthy, but I also worried a lot that I just wasn’t that clever. Fast-forward about four years and you’ll find me sitting at my desk in the department, often worrying that people around me are far more clever than I am. Despite the fact that I’m currently in OPIG… I suspect that my 19-year-old self would find me a bit ridiculous. Since then I’ve learnt the following:

  • OPIG is indeed a fantastic group to be in,
  • people doing DPhils are incredibly smart, and
  • attending group meetings massively helps with understanding what the blog posts are about.

Oh, and a while ago I wrote a thing. We end up in curious places, I suppose.




White noise

A large part of the statistics courses I took during my undergraduate degree involved parametric modelling. There we would often use “noise”, “white noise”,  and “Gaussian noise” to mean the same thing — independent normally distributed random variables with mean 0, which (hopefully) account for the difference between the theoretical model and the observed data.

I’ve come across this particular piece of jargon again and again, and it almost always refers specifically to the normal distribution. I hadn’t really thought much of that: “noise” and “error” both refer to disturbances in the “true” signal, and we tend to think of errors as normally distributed by default. None of my undergraduate work ever involved noise in the colloquial sense, so I had no idea whether this reflected in any way real noise… until a few days ago.

I’ve been working on a short research project at IBME for the past few weeks. My work involves analysing voice recordings made in uncontrolled environments (mostly just people’s homes), and the data I have also includes short recordings of background noise. I haven’t worked much with those yet, but I did plot a histogram of one I picked at random. Here’s what it looks like!


A histogram of the amplitudes in a 5 second recording of background noise. The blue curve is a Gaussian fit.

Pretty neat, huh? I expected it to look symmetric, etc., but I’m pretty sure this is the best fit of real data to a normal distribution I’ve ever seen.

An (easy?) example of MCMC

Last October I joined the Systems Approaches to Biomedical Science CDT at Oxford. The best thing about being part of an interdisciplinary programme like SABS is that we get to work with people from many different backgrounds. Many of my colleagues are chemists and biochemists, and in our first six months we took a range of short modules ranging from basic cell biology to scientific computing. While they patiently  explained to me the basics of PCR, Nickel columns and other dark magic experimental techniques, I tried to help with the occasional maths problems.

A topic that came up again and again throughout the more computational modules was MCMC, or Markov Chain Monte Carlo. Trying to explain a somewhat convoluted sampling technique to people who have barely ever seen a probability density function isn’t easy. I myself only studied it in the third year of a Maths&Stats undergrad and found it confusing at the time. However, I did recently come up with a fun little example, so I figured I ought to share it with the world.

Monte Carlo methods

Monte Carlo methods, of which MCMC is one particular type, are algorithms that use random sampling to make approximations. They’re particularly useful for estimating integrals and drawing from complex distributions.

Monte Carlo tends to work by drawing some sort of random numbers and then splitting them in groups (usually accepting or rejecting them) in a way that captures the behaviour of our integral/density/object of interest.

Suppose for example that we wish to estimate \pi . A way to do this using Monte Carlo is to draw a square with side 2 and a circle of radius 1 inside it. We can then draw points uniformly at random over the square and see what fraction of them fall inside the circle. Since the square has area 4 and the circle has area \pi , we can use that fraction as an estimate for \frac{\pi}{4}.

Monte Carlo estimation

A visual representation of Monte Carlo estimation. 115 out of the 150 uniformly distributed points lie in the circle, giving an estimated value of 3.0667 for \pi[1].

This may sound like a crude and inefficient way to do estimation. A sample of 150 points gives large errors and even going to 10 000 points doesn’t reliably give an estimate correct to two decimal places. Still, Monte Carlo can be a valuable technique for solving complex multidimensional problems.

Markov Chain Monte Carlo (around Oxford)

In the above example we draw points independently and uniformly at random. Because the set up is so simple, it is more or less obvious that we get a sensible and consistent estimate. However, provided that we can argue that out results hold in the limit (and converge reasonably quickly), nothing is stopping us from adapting the process to different distributions.

Suppose that we have a distribution function p(x) that we wish to sample from. One way to do so is to construct a Markov chain that has p(x) as a stationary distribution and then sample from that chain. If we let the process run for long enough, the overall time spent in each of the states will match p(x). This may sound very convoluted—won’t constructing a whole Markov process be a lot more effort than estimating or calculating p(x) directly and then sampling from that?

Not necessarily. It is often the case that comparing two objects, or events, can be easier than calculating their individual intrinsic quality in a meaningful way. You can say “A is two times better than B” or “A is two times more likely than B” without knowing exactly how good/likely  A and B are. There is a baseline level or scaling constant which we don’t need to worry about when making pairwise comparisons, but which can be difficult to evaluate. This is often the case in Bayesian inference [2]. Markov chains can help us avoid the issue of scaling constants.

Take Oxford colleges during summer for example, and consider how popular different colleges are among tourists. We can measure popularity of a college as the probability that a randomly chosen tourist happens to visit that college at a particular time of day [3]. It would be quite a lot of effort to go to all the colleges and measure the number of their visitors, add them all together and then rescale to get the probabilities.

The MCMC approach to determining college popularity would be instead to start wandering around Oxford. We set off from some college (say Corpus Christi), and then at regular time intervals decide whether to move at a randomly chosen nearby college. If the college is more popular (like Christ Church), we move to it. If it’s less popular, we only move to it with probability proportional to its popularity — e.g. if we know Oriel gets only 70% of the visitors Corpus does, we move to it with probability 70%. Otherwise we just stay at Corpus until the time of the next random jump.

We can keep walking in this fashion until our feet hurt, or until the porters stop letting us in. Intuitively, we’ll end up spending more time in the popular colleges than in the less popular ones — it’s likely we’ll be stuck in Christ Church or Magdalen for several time steps at a time, because they’re so much more popular than any college around them. Similarly, we’d probably jump away from Pembroke at the first opportunity.

This is a Markov process — we make independent jumps at regular time intervals, which are based only on our current location but not on our past history. We can argue formally that our intuition is correct, in that the stationary distribution in this particular example is unique and corresponds to college popularity. So if we measure the proportion of time spent at different colleges, this will be a good estimate for the popularity distribution!

This is what MCMC is, in a nutshell. It may seem a little different from the “circle in a square” example at the beginning, but it has the same core idea of making random choices, putting them in boxes, and then counting how many choices fall in which box.

Oh, and if you do find yourself in Oxford, make sure you visit Pembroke. It really is very lovely!

[1] The R code for generating the Monte Carlo estimation and image can be found here.

[2] I’m talking about the common rephrasing of Bayes’ rule that the posterior density is proportional to the prior times the likelihood. The idea is that the prior and the likelihood capture all of the information contained in the posterior, and the scaling constant is just a bit of a computational nuisance required to make sure everything integrates to 1.

[3] Naturally, we will assume certain ridiculous claims, such as independence of tourists, and we will also ignore the fact that different colleges have different policies when it comes to visitors.

Student’s t-test and knowing how much you don’t know

A friend — who is, incidentally, a lot better than me at the whole blogging thing — recently asked me about the t-distribution and what it’s used for. I ended up writing a fairly lengthy email in response and thought I might as well share it with the rest of the Internet.

Imagine the following setting: we have a sample X_1,\dots,X_n of independent normally distributed variables, all with mean \mu and variance \sigma^2. Suppose \sigma^2 is known and we want to test whether some value \hat{\mu} is a feasible guess for \mu.

H0: \mu = \hat{\mu}

H1: \mu \neq \hat{\mu}

The way we would typically go about this is by calculating the sample mean \bar{X} = \frac{1}{n}\sum X_i and saying that under H0 it follows a N(\hat{\mu}, \sigma^2/n) distribution. This means that under H0 the test statistic

Z=\frac{\bar{X} - \hat{\mu}}{\sqrt{\sigma^2/n}}

follows a standard normal distribution and we can proceed with calculating the p-value of this test. Easy!

What happens if we don’t know \sigma^2? It is tempting to use the (unbiased) sample variance

S^2 = \frac{1}{n-1}\sum (X_i - \bar{X})^2

to estimate \sigma^2 and substitute in the Z statistic above.

T=\frac{\bar{X} - \hat{\mu}}{\sqrt{S^2/{n}}}

Assuming T \approx N(0,1) we can compute an approximate p-value that should be fine… right?

Let’s forget about the algebra of it for a second. In the first case, we had some uncertainty about the mean and no uncertainty whatsoever about the variance. What the hypothesis test does is it checks whether the uncertainty that we have is enough to explain the difference between \bar{X} and \hat{\mu}. Not knowing the variance gives us some extra uncertainty on top of what we had before. The more uncertain we are of what we know, the more tolerant we should be to reality not matching our expectations.

Intuitively at least, the approximate p-value above is going to be somewhat conservative, since we’re not accounting for the variability of the sample variance. However, we still expect T to behave more or less like a standard normal — especially if our sample size is large, in which case we’re very confident about our estimate for \sigma^2. We can make several guesses about the density of T based on this idea alone:

  • it looks more or less like a standard normal (i.e. like a bell curve)

  • it has wider tails

  • it’s not fixed, but depends on n: the more data we have, the better our normal approximation

I do love informal approaches, but the academic community doesn’t necessarily share my enthusiasm. In any case we can all agree that knowing the exact p-value is preferable to making an approximation. This is where the t-distribution kicks in.

Definition. Let Z \sim N(0,1) and V \sim \chi^2_k be two independent random variables. The t-distribution with k degrees of freedom is defined to be the distribution of \frac{Z}{\sqrt{V/k}}.

While you can write the density function explicitly, it’s the form above that is the useful one. I won’t go through the algebra of it, but using it you can check that T \sim t_{n-1}. This means we can compute exact p-values for the hypothesis test (or rather we can let R compute them for us). This is what Student’s t-test amounts to!

Now for the really cool part. We can plot the standard normal and several t-distributions with varying degrees of freedom. Here’s what happens:


Our intuition was pretty spot on!

Zachary’s karate club

Can you give me an example of a network?

You can, of course. Your Facebook friends, for example, and the ways they are connected to each other. Or the transport network of the city you live in. Or, if you’re being really original, the World Wide Web.

Funnily enough when it comes to classic examples in the field, these are all high up on the list but not quite as high as “friends in a karate club”. Bizarre as it may seem, this example is such a cliché that you can win an award for being the first one to mention it during a networks conference.

Talking about a network of friends seems like a perfectly reasonable thing to do, but what’s with the karate? The answer dates back to a 1977 paper by Wayne Zachary, An Information Flow Model for Conflict and Fission in Small Groups. Zachary studied the change in social dynamics of one particular university karate club over the course of three years.

Zachary’s karate club

The club had 34 socially active members, i.e. members who also interacted outside of club meetings. They formed various friendships among each other, without necessarily being aware of distinct social circles within the club. Zachary, however, noticed that each member would either mostly have friends in common with the karate instructor (node 1 above), or with the club president (node 34 above), but would rarely be equally well-connected with both. Long story short, he identified two disjoint social groups within the club using network analysis and when the president and the instructor fell out and the club split, this happened pretty much exactly the way the model predicted. Pretty neat, huh? And he didn’t even have that much data to begin with. Imagine what you could do with the stuff Facebook knows!

New term, new courses

Today’s morning flight took me from frosty Sofia to the much warmer Oxford. My penultimate term as an undergraduate begins on Monday.

I was writing a personal statement recently when I realised something quite interesting. In my three and a half years as a Maths&Stats student so far I have taken every single course on offer by the Department of Statistics, the one exception being Actuarial Science. Most I chose because they sounded genuinely interesting, some I picked just because of the lecturer and/or tutor, and a couple seemed pretty useful, if not awe-inspiring. I didn’t even notice them add up.

Still, it’s a pretty impressive streak and it’ll be a shame to break it right at the finish line. However, simulation isn’t really my thing and while machine learning is certainly tempting, I’ve already gone over the required number of courses and I do have a dissertation to write. This term I’ll be spending more time at the Maths Institute instead – I’m taking probabilistic combinatorics and networks, Perhaps rather surprisingly even the former seems to have a fair bit to do with random graphs, so I won’t be too far away from my comfort zone.

It promises to be an exciting term, as always. It’s good to be back.