LAMARC Documentation: Bayesian Tutorial

Bayesian-LAMARC tutorial: Why?

This tutorial is designed to be read from beginning to end, but if you like you can jump straight to:

What's this about? What's a Bayesian analysis?

Better minds than mine have devoted themselves to creating whole web pages explaining what a Bayesian approach is, and why it's important. See, for example, Eliezer Yudkowsky's An Intuitive Explanation of Bayes' Theorem. In brief, the advantages of a Bayesian approach include:

Any information you have about the parameters before you start the program can be incorporated into the answer
The answers are in terms of probability instead of likelihood, and the answers therefore have credibility intervals instead of support intervals.
For LAMARC in particular, the limitations of reliance on 'driving values' for the likelihood run are no longer an issue because the parameters vary during the search through tree-space.

The biggest disadvantage of a Bayesian approach is:

You must include information about the parameters before you start the analysis, regardless of your state of ignorance, and that information is incorporated into the answer, sometimes in non-obvious ways.

How is a Bayesian approach used in LAMARC?

In Likelihood-LAMARC, each chain is a search through tree-space with a single set of 'driving values'. In Bayesian LAMARC, this search through tree-space is augmented by a search through parameter space as defined by the priors. This search serves by proxy as the 'driving values' for the parallel search through tree-space.

In practical terms, a Likelihood-LAMARC search looks like:

Generate a tree using a set of driving values.
Come up with a new tree using that same set of driving values and the old tree.
Compare the two trees, and usually keep the one with the better likelihood.
Save the tree you liked better.
Go to step 2 until you're done.
Analyze the trees you've collected.
Make a new set of driving values based on your trees, and go back to step 2 again for a new chain.

(The 'usually' is so that the search can sometimes move 'downhill' in hopes of finding a better peak in a different area.)

A Bayesian-LAMARC search looks like:

Generate a tree using a set of driving values.
Randomly choose whether to go to step 3 or step 7.
Select a new point in parameter space.
Compare the two sets of parameters, and usually keep the one with the better likelihood.
Save the set of parameters you liked better, and replace the old set of driving values with those new ones.
Go to step 2 unless you're done, then go to step 10.
Come up with a new tree using the current set of driving values and the old tree.
Compare the two trees, and usually keep the one with the better likelihood.
Go to step 2 unless you're done, then go to step 10.
Analyze the parameters you've collected.

Steps 1 and 7-9 are exactly the same as those in Likelihood-LAMARC, except that the trees are not saved for subsequent analysis, and instead the parameters are saved and analyzed.

Just as Likelihood-LAMARC samples trees proportionally to their likelihood, Bayesian-LAMARC samples sets of parameters proportionally to their probability. For more detail, and a comparison of simulated data in both a Likelihood and Bayesian setting, see:

Kuhner, M. K. and L. P. Smith, 2007 Comparing Likelihood and Bayesian Coalescent Estimation of Population Parameters Genetics 175: 155-165.

That was more than I really wanted to know.

You asked!

What I meant to say is, methodology aside, what are the results of a Bayesian run and what do they mean?

In the current version of LAMARC, each parameter is analyzed separately, and a probability density function is created for each. (A probability density function is a curve where the area under the curve is one.) The highest point on these curves is given as the point estimate for that parameter. The area under the curve is used to report credibility intervals in the output file--the 0.005 percentile is the point at which 0.005% of the area of the curve falls below it, and .995% of the curve falls above it. The reverse is true for the 0.995 percentile. The combination of the .005 percentile and the .995 percentile give you your 99% credibility intervals. (Credibility intervals are like confidence intervals in that they tell you where the truth is most likely to lie, but are used in a Bayesian context. They are sometimes called the 'posterior probability interval'.)

Probably of more interest are the curvefiles themselves, produced by LAMARC as output files. Each parameter has a curvefile for each chromosomal region in your data set, plus one curvefile for the overall estimate (the product of the individual curves). They are named [prefix]_[region]_[parameter].txt, where [prefix] is 'curvefile' by default (this can be changed in the menu or the XML), [region] is 'overall' or 'reg#' with # the number of the region in question, and [parameter] is the short name of the parameter, like 'Theta1' for the theta for the first population, or 'M32' for the migration rate from the third population into the second population.

Each curvefile contains information about the curve at the top, followed by a 2-column tab-delimited list of the parameter values and point likelihoods that define the probability density function. It can be imported into a spreadsheet program or Mathematica to create an actual graph.

So, what do these curvefiles tell me?

Each curvefile is a sort of convolution of the likelihood and the prior: if both inputs are accurate, the result tells you the relative probability that the parameter is any particular value. Unlike the likelihood version, this overview is complete and continuous.

Another advantage of the curvefiles is that you can check them to see if they're unimodal. Our research to date indicates that the true probability density function should be unimodal, with three exceptions: an insufficiently long run of LAMARC, insufficient sampling of the data, and unusual correlations between parameters. The first gives rise to multiple peaks in the curvefiles that come from single regions, which will not be conserved from one region to the next. The second gives rise to multiple peaks in the curvefiles that come from the overall estimate over regions. The third gives rise to the same multimodal peaks across all regions. The various possibilities, then, are:

You have multiple peaks in both the individual regions curvefiles and the overall-region curvefiles: If these multiple peaks are different across the different regions, LAMARC was not run long enough. It is probably not possible to tell at this point if the data were also undersampled. If these multiple peaks are conserved across regions and show up in the overall-region curvefiles too, it is possible you have unusually-correlated parameters--see below for a discussion on this.
You have multiple peaks in the overall region curvefiles, but not in the individual regions curvefiles: LAMARC was probably run long enough, but the data coverage is insufficient. Your options are either to collect new data at a different region, or report the multiple peaks and their likely cause in your analysis.
You have multiple peaks in the individual regions curvefiles, but not in the overall-region curvefiles: LAMARC was probably not run long enough for any one region, but good coverage of the data seems to have overcome this insufficiency.
You have no multiple peaks: LAMARC was run for long enough, and your data coverage was good. Congratulations!

So what won't these curvefiles tell me?

The first thing they won't tell you is anything about any correlation between your parameters. For example, the known correlation between Theta and growth would not be seen in the curvefiles (or anywhere else in a Bayesian-LAMARC run). This is not a shortcoming of the Bayesian method per se, but rather a shortcoming of LAMARC as currently written: each parameter's walk through parameter space is analyzed without regard to how the other parameters are moving at the same time. So, say that two parameters are jointly exploring the following parameter space, illustrated as a contour map. We'll start by observing what happens in the uncorrelated case:

In the Bayesian run, this contour map fills up with sampled points in proportion to its height. So, the peak area fills up with a lot of points, and as you go down, the density of sampled points goes down, too. A two-dimensional analysis of these points could reproduce the contour map. But in a one-dimensional analysis (as in the current version of LAMARC), each curvefile is the result of projecting each point onto one of the axes, then smoothing the density of resulting points to get a probability density function (seen outside each axis).

Now let's look at some other examples, where the two parameters are either correlated (on the left) or inversely-correlated (on the right). The contour map of parameter space might then look like:

Graph of two
inversely correlated parameters

Unfortunately, not only can a one-dimensional analysis not distinguish between these two cases, it also cannot distinguish between these cases and the uncorrelated case: the projected densities onto the axes are the same in all three cases.

The problem gets even knottier when the two parameters have some sort of non-symmetrical correlation. Consider a correlation whose contour map looks like this:

Graph of two variably
correlated parameters

In this example, there is a single maximum on the contour map, but its oddly-distorted shape causes there to be multiple peaks in its projection onto the X axis. Given a sufficiently-long 'flat' section of the curve, the secondary peak could even end up being higher than the peak corresponding to the peak of the contour plot.

This is, it should be noted, not an incorrect assessment of the data, but it is limited. In other words, if the parameter plotted on the Y axis in the above example is truly equally likely to be anywhere in the lower half of its allowed range, (which it would be, if the prior on that parameter was accurate), and all those Y values correspond to a single X value, there will indeed be a peak at that X value, which may even overshadow the peak seen in the contour plot. Conversely, an analysis that simply reported the peak of the contour plot would miss the fact that there was a different X value which might be a better choice, were you only considering possible values of X.

The take-home lesson? The point estimates of individual parameters you get in a Bayesian run are only guaranteed to be accurate if considered independently from one another. And, of course, if the priors are accurate.

Is there any other feature of these curvefiles that might surprise me if I didn't know the secret?

Funny you should ask! As a matter of fact, there's one biggie: Our curve-smoothing algorithm does not take into account the prior boundaries. This means that if your prior ranges from 0 to 100, you might end up with a curve that goes negative, even if negative values are absurd for your data. This does not mean that LAMARC actually used any negative values calculating the trees, but rather that the curve-smoothing algorithm has taken some of the density at or around zero and smoothed it beyond the boundary. If you're desperately curious about our curve-smoothing algorithm, look here.

If you get this behavior, there are two ways to interpret it. One is simply to assign all density outside the boundary in question to the boundary itself. In other words, if 3% of your curve falls below zero, simply claim that the lower 3% of your probability density curve is at zero. Another possibility (especially if you have accepted the default priors) is that your priors are wrong. If, in the same 0 to 100 example, you have a lot of density smoothed beyond 100, and the peak of the curve is at 99.5, it's fairly clear that your data is pushing the search as high as you are allowing it to go. It might be informative to see where it would take the search when you allow it to go higher still.

So, why should I run Bayesian LAMARC vs. Likelihood LAMARC?

There are two advantages we have found in a Bayesian run vs. a likelihood run in LAMARC. In some difficult cases, our experience indicates that Bayes-LAMARC does a better job of searching tree space than does Likelihood- LAMARC. We attribute this to Likelihood-LAMARC's reliance on driving values: its search is akin to looking under a lamp-post because that's where the light is. The method of changing the driving values from chain to chain allow us (to extend the analogy) to move the lamp-post from time to time, and given a sufficiently long run, the search would find all the good trees. But Bayes-LAMARC's methodology searches tree space at the same time that it searches parameter space, meaning that the driving values are constantly changing. The boundaries of that search are immobile--they're the boundaries of the priors--but if your priors are correct, the search does a better job of covering the allowable space.

The second advantage of a Bayesian run vs. a likelihood run is in a sense a subset of the first advantage: a Bayesian run is better at determining whether your data support parameter values at or near zero. In a likelihood run, a driving value of zero makes it nearly impossible to search tree space with a non-zero value. Conversely, having a non-zero driving value makes it nearly impossible to search tree space where the value is zero. One observed failure mode in Likelihood LAMARC is when it estimates a parameter after one chain to be nearly zero, then estimates that same parameter to be high after the next chain, near-zero again the next, and continues to ping-pong back and forth throughout the run. Our current theory is that this behavior stems from having single driving values. Our Bayesian runs have not shown this same behavior, and instead settle on a near-zero estimate, complete with seemingly appropriate confidence intervals.

Finally, if you wish to analyze a case with modern populations diverging from ancestral populations, you must use a Bayesian analysis as likelihood is not supported in cases with divergence. For the curious, the reason for this is that a likelihood run would prefer to keep the population splitting times constant throughout a chain and then consider what they suggest about other possible split times, but it turns out that trees inferred with one split time are often not useful in deciding about other potential split times. A Bayesian analysis avoids this problem as it allows split times to vary throughout the chain.

So tell me about these priors, then. They're flat, right?

Yes, the current version of LAMARC only has flat priors. A prior is supposed to represent your knowledge of the potential answers before you start, and if you don't know anything about what the potential answer might be beyond 'it'll be larger than X and less than Y', assigning all values between X and Y the same probability is a way to express that.

Even then, you have to decide what kind of density your priors should have between those boundaries. LAMARC currently gives you two options: a linear prior, and a logarithmic prior (with natural logarithms, not base-10 logarithms). This choice can affect the confidence intervals of the reported parameters, and will definitely affect the search through parameter space.

Your choice should be motivated by how you believe the parameter varies. For example, if you believe that if the best value for theta will be around 0.01, and that the 95% confidence interval will be from 0.1 to 0.001, you should use a logarithmic prior for theta. If you believe that the best value for a migration rate will be around 100, and that the 95% confidence interval will be from 50 to 150, you should use a linear prior for migration.

In the end, if you run LAMARC long enough and you have enough data, it won't matter which type of prior you use because the data will overwhelm it. But your search will have to be much longer if you use the wrong type. If you get it right beforehand, it will spend an equal amount of time searching the 1st to 50th percentiles as it will the 50th to 99th percentiles, meaning that both limits will have the same degree of precision. In addition, your search will have been maximally efficient.

Also remember that any flat areas of your curvefiles should raise a red flag. For example, let's go back to our graph of variably-correlated parameters:

One possible cause of the curve on the left (with the long flat section) is a logarithmic prior for that parameter with a lower bound so low that it included a wide swath of values, all of which had the same effect on the data as they would if the parameter value was zero. For example, if you are estimating recombination rate, there will be a particular value for that rate that predicts a single recombination event in the entire history of your population. All potential recombination rates below that value will therefore predict the same thing: zero recombination events. Conversely, any tree with zero recombination events will happily accept any value for recombination below that cut-off.

The upshot of all this is that if you get a curvefile with a flat section, you must realize that section only includes the information you put into the prior--in other words, the same amount of information you had before you ran LAMARC at all. What makes this example particularly worrying is not that recombination has a long flat section--that's fairly easy to explain--but that due to a correlation between recombination and some other parameter, that other parameter now has a secondary peak. And as we said before: if your prior is right, than this secondary peak is also right. But if the prior was intended to represent more or less complete lack of information, the program gleaned information from it anyway.

With some amount of trepidation, we have put in default priors for each force, though we strongly encourage users to make deliberate choices about the priors for their particular variables, instead of blindly accepting the given defaults. For reference, the default priors are:

Theta: Logarithmic, 0.00001 - 10.0
Migration: Logarithmic, 0.01 - 1000.0
Recombination: Logarithmic, 0.00001 - 10.0
Growth: Linear, -500.0 - 1000.0

Uh, suddenly I'm not so sure about doing a Bayesian run any more.

You are wise to be cautious, for it is always critically important to examine your assumptions before undertaking any scientific endeavor, and the priors represent a large set of assumptions that can affect your results in sometimes surprising ways. It might help to realize that your priors are part of the model for population history you're using--a model that already includes a variety of assumptions and simplifications. The difference, of course, is that the assumptions and simplifications of the coalescent model have been hashed out in the literature over the course of several years, and you're going to have to defend the prior you put on your data for the first time.

On the plus side, the more data you collect and the longer you run LAMARC, the less your choice of priors will affect the results. If you do a sufficient analysis, and understand that any part of the resulting curves that are flat came from your assumptions and not from your data, you will be OK.

Well, OK, I'll give it a shot. What do I do?

Onward, then, to the Bayesian tutorial.

(Previous | Contents | Next)