Bayesian Vs. Frequentist statistics and the case that never was…

Hal Varian Quote

There’s a lot of truth in the quote above, well the bit about statistics at least – I don’t think anyone else, anywhere at any time has ever considered computer engineering a sexy occupation. With that said, sexy or not, statistical analysis has become an essential tool to researchers of all scientific disciplines so it might be of benefit to discuss the two primary schools of thought here. Also note that I’m usually disinclined to pigeonhole a statistician as ‘Frequentist’ or ‘Bayesian’ as it’s unnecessarily confrontational – such labelling seems to ignore the fact that one should be aware, if not employ, both methodologies depending on requirements.

Frequentist statistics is the standard approach taught in most University modules and focuses on the probability of the data given the hypothesis P(D|H); e.g. the P-values cited in most papers are values of P(D|Ho). One uses the term ‘frequentism’ because this ‘probability of data‘ is taken to mean “the frequency with which something would happen, in a lengthy series of trials given some hypothesis about the experiment”[1], that is, we set up the problem as a multiple trial problem with a null hypothesis so the probability lies not in the hypothesis (this is strictly 1 or 0, accept or reject) but in the data.

The above infers then that if you really must put a number on the probability of a hypothesis, then you are forced to drop the frequentist approach. This is where Bayesian statistics can come in useful since it allows one to focus on the probability of the hypothesis given the data, P(H|D). Unlike the frequentist thoughts on repeatable random sampling, this approach treats the data as fixed (for example the x tosses of a coin are all the data you will have) and hypotheses as random – the hypothesis will be accepted with some probability between 0 and 1. Take for example a pair of doctors issuing the same statement:

1) Frequentist Doctor: He probably doesn’t have condition X

Meaning: Using the frequentist approach forces a dichotomous decision (we either accept the null hypothesis or we fail to reject it) so this sentence seems slightly odd if read in the loose sense. Therefore we should read it as “the data are inconsistent with the hypothesis that the patient has this condition”.

2) Bayesian Doctor: He probably doesn’t have condition X

Meaning: The doctor is speaking in the most literal sense possible based on the data and has calculated the probability thereafter with these. There is also an inherent subjectivity or belief involved in this statement which will be discussed below.


Bayesian statistics insists you provide a ‘prior belief’ of your model

Let’s imagine we are interested in the average length ‘L’ in inches of all tadpoles in a pond. As a dedicated field biologist you have some prior information: L is certainly between 1 and 3 inches, with the median most likely in the middle of this range. So in Bayesian statistical inference, we first make a guess on what the probability distribution of L is, that is, instead of saying L has one true value, it can be chosen from some probability distribution — known as the prior probability distribution and which reflects the state of knowledge about L before collecting any data.

Next, we collect some data and based on our observations we use Bayes’ theorem (see bottom) to update the prior distribution in light of the data to get a new probability distribution for L called the posterior distribution. The posterior distribution reflects our state of knowledge about L after collecting data. This is one of the core advantages of the Bayesian statistical framework: the ability to include specific constraints in the form of prior-distributions and model structures when you have little data to work with.

Echoing the two doctors statements above and using the posterior distribution, the Bayesian biologist can say there is a 95% probability that the length mean is in the interval 1 to 3:

 P(1 ≤ L ≤ 3) = 95%

Of course from a frequentist perspective this statement makes no sense – L is simply an unknown constant which either lies in the range [1, 3] or it doesn’t. It’s important to note here that frequentist statistics only allows probability statements about sampling in the form of confidence intervals:

P(1 ≤ Z ≤ 3) = 95%

That is, one has 95% confidence that Z, a random draw of size ‘n’ from the population of tadpoles in the pond, has a mean ‘L’ between 1 and 3. It is useful to think of the possibility of taking many different group samples of Z, finding the sample mean for each sample group and then using that particular value to form a 95% confidence interval, thereby calculating numerous different 95% confidence intervals. The 95% confidence means that, when we do this many times, 95% of the resulting intervals will actually include the true population mean.

Bringing it all together with an example

You could be forgiven at this stage for thinking the two methods are simply skirting around semantics with slightly philosophical meanderings colouring differentiating one from the other but having discussed how both techniques differ in their treatment of input parameters, it is natural (and correct) to assume that both techniques lead to different conclusions, an observation that further highlights the importance of knowing when to deploy each method.

Lets look at an example that will demonstrate the different conclusions from each technique. As with many other statistical examples, we’ll take the classic coin toss; taking a coin with the unknown p being the probability of heads and (1-p) as tails, we wish to find p and decide to toss it ten times, finding that when we do we get 7 heads.

The frequentist statistician will immediately calculate that the probability of heads to be

The Bayesian statistician suggests that this doesn’t seem right and he seems to think that p might be closer to 0.5 and wants to use this subjective guess-work to help estimate p. Remember with the Bayesian approach, instead of considering only the maximum likelihood  estimate for p, we treat p as a random variable with its own distribution of possible values. A sketch of the 0.5 belief would look something like:

For now, you’ll have to take my word for it but this graph looks like a type of probability distribution called a Beta Distribution with shaping parameters α and β having each a value of 5, also called a Beta(5,5) distribution. We can calculate the mean and variance  for this distribution using the standard beta distribution formulas

Therefore the mean is 5/5+5 = 0.5 and the variance is 5.5/(10^2).11 = 0.023 so the Standard Deviation can be worked out to be 0.15. Finally, using Bayes’ Theorem, the bayesian statistician combines the data previously generated by the coin toss and the prior distribution based on the estimated probability and gets the posterior distribution of p, which turns out to be a Beta(12,8) distribution:

Using the same formulas above we plug in the 12 and 8 values and find the mean now is 0.6 and the variance is 0.01, so the standard deviation is 0.11.


Where the frequentist statistician calculated p = 0.7, the bayesian calculated a value of 0.6 by incorporating a prior estimate alongside the data generated. Note that the final value of p would change again if the bayesian estimated a value other than 0.5 to begin with. This is one of the reasons that frequentist statisticians distrust Bayesian methods since an inference model incorporating subjective guess-work doesn’t sit well with some.

Additionally, quantifying prior beliefs into probability distributions is not simple or easily agreed with other parties and even with an agreed p distribution, prior to powerful computing techniques actually computing the probability distribution to find the posterior distribution using Bayes Theorem was a major obstacle.

To recap:

In frequentist statistical analysis, we find confidence intervals for the parameters since the parameters (or hypotheses) are fixed and the data are random i.e. we wish to find the probability of the data given the hypothesis P(D|Ho)

Bayesian analysis finds probability intervals for the parameters since, vice versa to above, the parameters (aka hypotheses) are random and the data are fixed i.e. the probability of the hypothesis given the data, P(Ho|D). As a result Bayes Theorem is usually used when an experimental outcome has been determined, but where one might wish to confirm the validity of some of the aspects to the experiment e.g. what is the probability that a person who fails a polygraph test was actually telling the truth?





If you've found this useful leave a reply...

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: