At its simplest, statistical inference is used to draw conclusions from the data in our samples and project these findings to populations for which we do not have data. Having previously discussed some of the primary differences between Bayesian and Frequentist (Classical) schools of inference, one can summarise the fundamental differences between the
two approaches by how they interpret probability, represent the unknown parameters, acknowledge the use of prior information and make the final inferences . The rising popularity in Bayesian statistics has been, in parallel with an exponential rise in computational processing power, primarily due to overcoming once intractable calculations which previously held back analysis via Classical methods such as maximum likelihood.
In this blog post I’ll highlight how Bayesian methodology has allowed complicated biological models with relevant parameters to be estimated while allowing prior information to be efficiently incorporated. First though, it might be of help if we return to some simple Bayesian terminology commonly found throughout the literature:
Some biologically relevant examples please…
To put the Bayesian principles above in to context we can use the simple example of a species of stickleback fish in highland and lowland glacial lakes, diverging over thousands of years and evolving separable phenotypes to adapt to local environments. If after exceptionally heavy rains a trickle of highland fish somehow migrate to the lowland lakes, how do we assign a sampled individual to its native source lake on the basis of its genotype?
1 Based on prior knowledge of the fish stocks in the lower lake, we guess that the probability of randomly selecting a fish that is a lowland native to be 0.8 and the probability that it is a newly arrived highland fish to be 0.2. This subjective guesswork represents the prior probability P(H) in the far right column in the table below.
2 Now lets imagine in the lower lake population there are two genotypes at some particular locus, α and β. Utilising background research, one thinks that the likelihood of genotype α is 0.05 in the highland stickleback population and 0.90 for the lowland stickleback. The joint distribution, that is, the probability of a particular observation is calculated by multiplying the prior and the likelihood, i.e. P(D|H).P(H).
3 One important row to take note of is the marginal likelihood P(D) which is the probability that an observation will be of a particular genotype, irrespective of its parameter value, and is obtained by summing the joint distribution across parameter values.
4 Finally, the Posterior Probability P(H|D) is calculated by dividing the joint distribution P(D|H).P(H) by the marginal likelihood (also known as the probability of the data P(D)) Therefore if we observe genotype β, the posterior probability that it is a newly arrived Highland stickleback is 0.71. So although a randomly selected fish may be from an unknown source lake, we can used Bayesian methods to infer its population of birth based on its genotype.
The above methodology has been used in a broad variety of practical applications including population stratification detection in GWAS , genetic counselling , pedigree analysis and calculating risks incorporating additional information, which is useful when the person’s genotype cannot be determined. More intuitive Bayes-related biological problems can be found here and here; working through some of these simple examples should really help further ones appreciation for the usefulness of the Bayesian toolbox. Another simple example of Bayesian statistical inference would be in the following situation:
From a population 1% has a rare disease. A test is 99% effective, i.e. given a patient is sick, 99% test positive and given a patient is healthy, 99% test negative. Given a positive result from a test, what is probability the patient is actually sick? The key idea is that the probability of an event A given an event B (e.g., the probability that one is sick given that one has tested positive in the test) depends not only on the relationship between events A and B (i.e., the accuracy of the test) but also on the marginal probability (or “simple probability”) of occurrence of each event. Since the test is known to be 99% accurate, this could be due to 1.0% false positives, 1.0% false negatives (missed cases), or a mix of false positives and false negatives. Bayes’ theorem allows one to calculate the conditional probability of having the illness, given a positive test result, for any of these three cases.Hence, a positive test result does not prove conclusively that the person is sick but Bayes’s theorem provides a formula for evaluating the probability.
Another reason for the rising popularity of Bayesian methods can be explained by the ease with which complex likelihood problems can be tackled via the use of Markov chain Monte Carlo techniques. It is important to note that the data is often not so simple as a positive or negative test result. For example, if the data is a continuous data type, then the denominator stated in Bayes’s Theorem above can be replaced by an unsightly integral, which may or may not have a closed-form solution. In fact, sometimes we need sophisticated Monte Carlo methods just to approximate the integral . Clearly then more complex data will lead to multiple integrals that, until recently, could not be readily evaluated. In my next post I will discuss how MCMC methods have enabled the evaluation of Bayesian posterior probabilities, thereby making calculations tractable for complicated genetic models…
References and further reading
 Pritchard, J. K., Stephens, M. & Donnelly, P. Inference of population structure using multilocus genotype data. Genetics 155, 945–959 (2000).
 Wilson, I. J. & Balding, D. J. Genealogical inference from microsatellite data. Genetics 150, 499–510 (1998).
The BUGS (Bayesian inference Using Gibbs Sampling) software project has some invaluable Bayesian resources here
Bayes Theorem Tutorial in Graphics (due-diligence.typepad.com)