LAMARC Documentation: Tutorial

LAMARC tutorial

This tutorial is designed to be read from beginning to end, but if you like you can jump straight to:

Why should I use LAMARC?

Lamarc is a tool for studying populations. It can estimate:

Effective population size
Migration rates between populations
Changes to population size with time (growth rates)
Recombination rates
Variations of mutation rates across genomic regions (gamma)
Fine-scale mapping of trait data

It can estimate them all at once or hold some of them fixed while estimating the rest. If this is the sort of information you're after, LAMARC can help.

First off, you need genetic data. This might be standard genetic data like DNA or SNP sequences, or microsatellite counts, or it could (with version 2.0) be observations with genetic underpinnings, like phenotypic observations or electrophoretic data. More data will result in more accurate results, but it's important to note that the addition of more individuals, after a certain point (10-15 per population), is much less helpful than the addition of more independent genetic regions. The history of a known region in an additional individual is probably similar to that of its cohorts. The history of a new unlinked region from the same (or new) set of individuals is almost certainly very different from that of the old region, and therefore more likely to hold new information.

Secondly, you need a computer. In particular, you probably need a computer with a lot of memory. A good LAMARC work-out can use hundreds of megabytes of RAM, and for particularly ambitious analyses with lots of data, gigabytes are probably called for. For learning purposes, LAMARC can run on more limited computers, if you're careful to give it a limited data set and not allow it to record too many observations.

Finally, you need some time. The total amount of time can vary from data set to data set, but it's not unusual for a complete, solid run of LAMARC to take a week or two. Larger data sets requiring full error analysis (profiling) can take a month or longer. However, it would be a mistake to throw all your data at LAMARC in January, then write to us in May asking when you could expect results. The answer would probably be a tragic, "LAMARC ran out of memory, and inexplicably failed to crash." Much better, instead, to test LAMARC on subsets of your data and get preliminary results in a day, which will then help you determine how best to analyze your full data set. This tutorial is designed to help you do just that.

One caveat: some enterprising souls, having lots of computers and little time, may think, 'Aha! I can solve my problem of having no time by running the parallelized version of LAMARC!' Which would be a great idea, if such a thing existed. There is a parallel version of MIGRATE, so if you only want to estimate effective population sizes and migration rates, that option is open to you. If your LAMARC run contains multiple regions or does multiple replicates, you can separate them into different files, run with summary files turned on and recombine them using poor man's parallelization . But if those options don't appeal to you, you're stuck waiting for single runs to finish. Sorry.

OK, I got my data. Now what?

Hang on, let's take this slowly. If this is your first time using LAMARC, I strongly recommend using only a subset of your data, to get a feel for how the program works, first.

If you have multiple populations, let's select two of them to start with. If you know something about the migration patterns of the populace already, pick two populations you know share migrants. Then select around 10 individuals from each population. If you're only studying one population, pick 10-15 individuals. If you want accurate results, it's vitally important that you pick these individuals randomly from your data set, and not on the basis of them being the 'most different' or 'most interesting'. You can grossly distort your results should you rely on such a scheme.

Next, pick one region to analyze. If you have DNA or SNP data, that means one continuous stretch of sequence. If you have microsat data, that probably means just one microsatellite, though if you happen to have linked microsat data, it could be those linked microsats.

Armed with the knowledge of which data you want to analyze, it's now time to assault the converter. Lamarc 2.1's file conversion program can be used in both a batch mode and as a graphical tool. You can read about how to use the converter here.

OK, the converter made a file for me. What next?

You are ready to run LAMARC. The very first thing LAMARC asks you for is the name of a file, and the converter-created file is what it wants. You can edit the contents of that file if you wish (it's all XML), but it's probably much easier to change things in the menu. If you like, you can even enter the name of your file at the command line ('lamarc yourfile.xml'), and if you don't want to do anything in the menu, you can run LAMARC in 'batch mode' by adding a '-b' flag ('lamarc yourfile.xml -b').

What about Bayesian. How does that work?

You can read about the Bayesian capabilities of LAMARC in why you might want to run Bayesian-LAMARC and how you would do so. The latter picks up from right here. From here on, this tutorial focuses on the Likelihood-LAMARC path.

So... that's it? Enter the filename and hit 'run'?

For a first crack at your data, yes. It'll use a lot of default values that are perhaps not the best choice for your data set, but they'll work for now. After you see how it runs, we'll get around to changing the defaults.

OK, hang on, I'm going to go try it.

I'll be here.

It's running! Um, can you explain this counting thing?

After creating a tree, LAMARC is rearranging it to see if it can find better ones. Each time it does so, it increments the counter. Using the defaults, it first counts to zero from -1000 without sampling anything (this is the phase known as 'burn-in'), then it starts counting up to 10,000, sampling every 20th tree. Those trees will be used to estimate parameters, which you'll see in a moment.

OK, it finished counting and now it's stopped.

It's busy calculating the max--

No, never mind, there we go. OK, it gave me a lot of information and started counting again. What does all this mean?

OK, your screen probably looks something like this:

15:39:34  Initial chain   1:  [====================]        10000 steps
 
15:40:09  Predicted end of chains for this region:  Tue Feb  8 16:06:43
          2005
 
15:40:09  Accepted  7.39% | Posterior lnL 27.0287215 | Data lnL -627.243228
Trees discarded due to too many events:        2
Trees discarded due to too small population sizes:        0
Trees discarded due to an infinitesimal data likelihood:        0
Trees discarded due to extremely long branch lengths:        0
Tree-Arranger accepted            284/8268 proposals
Tree-Size-Arranger accepted       455/1732 proposals
  Class                  Theta
  Population 1          0.039795
  Population 2          0.065072
   
  Population                     Mig
  Population 1            --------  164.6673
  Population 2            142.1566  --------

15:40:09  Initial chain   2:  [=|                  ]           35

Let's take a look at each piece in turn.

Predicted end of chains

LAMARC's first estimate of when it will be done. This estimate gets revised at the end of every chain, and hopefully gets more accurate as it goes.

Accepted 7.39%

The number of proposed changes to the tree that were accepted by LAMARC. This number is usually in the 10% range--if it's very low, it means that it's latched on to some form of a tree, and is very reluctant to vary from that, which means that it probably is not exploring tree-space very effectively. Too high, and it's accepting almost everything that comes its way, which probably means it's not proposing radical-enough changes, or else that your data give it no reason to prefer one tree over another (such data are likely mostly noise, so they aren't very informative about population parameters - if you run into this, go review your input data for obvious problems like missing data).

Posterior lnL 27.0287215

The posterior log likelihood. This number (which should always be positive) indicates how much better the maximized parameters are than the initial, driving set of parameters. In early chains (like this one), it may be quite high, but it will drop down to the single digits in later chains. Please note that this is a relative likelihood, valid only within this specific run; it cannot be used in log-likelihood tests or other statistical applications.

Data lnL -627.243228

The data log likelihood. This number will probably be very negative, should increase for the first few chains, then level off for the last few chains. It represents the log of the probability of the data given the currently accepted tree. Since a large data set is highly improbable on any tree, it's not a problem that this number is so low. However, if it is still rising rapidly by the end of your run, you haven't yet found the best trees and your run is too short.

One other thing you should note is that if you're comparing this run to a new run with more of your data, the data log likelihood will go down. This is due to the fact that a tree with more tips will generally fit the data at those tips more poorly than a tree with fewer tips.

Trees discarded due to...

Sometimes, LAMARC will discard trees because the trees themselves are inherently too tricky to deal with. Almost always, these trees would also be rejected from having too low a likelihood, so you shouldn't worry about this too much unless one of these numbers gets very high (say, larger than 5% of the total number of proposed trees). If that happens, your starting parameters might be too extreme, or you might be calling two populations different when they are actually genetically identical (rejections due to 'too many events' can have this cause). Generally, though, it's just LAMARC being efficient.

If absolutely no trees are rejected, you'll see the message "No trees discarded due to limit violations." which means you're fine.

Arranger accepted

This is a more detailed breakdown of the 'Accepted 7.39%', above. The two arrangers on by default in LAMARC are the Tree-Arranger, which breaks a branch of a tree and then re-attaches it, and the Tree-Size-Arranger, which preserves the topology of the tree but picks new sizes for some or all of the branches. The Tree-Arranger is absolutely required, since it's the only one to change tree topologies. The Tree-Size-Arranger is more of a helper function, which is why (by default) it is only attempted 1/5 as often as the Tree-Arranger.

These numbers can vary fairly widely, but pay attention if they get too large or too small, just as you would to the overall 'Accepted X%' line, above.

Theta and Mig

These are the parameters LAMARC is trying to estimate. You will be estimating one theta value for every population in your data, and two migration rates for every pair of populations in your data. In this case, with two populations, that means a theta for each (0.039795 for population 1 and 0.065072 for population 2), and two migration rates (164.6673 for the rate from pop2 to pop1, and 142.1566 for the rate from pop1 to pop2). These are LAMARC's best estimates of the parameters, as judged from one 'chain'-worth of trees. They will be used as driving values for the next chain, as will the last tree sampled in the chain. These reported values for the last chain will be reported as LAMARC's estimates of your parameters.

Initial Chain 2

LAMARC is starting over, essentially. Except that this time, it hopefully has a halfway-decent tree to start with, and better driving values than it did the first time. It'll repeat this process (by default) until it's done it 10 times.

Great! While you were explaining that, LAMARC got to something called 'Final Chain 1' and it's taking a lot longer. What's happening?

By default, LAMARC runs 10 'Initial' chains and 2 'Final' chains. The initial chains take 10,000 steps, and the final chains take 200,000 steps. This is LAMARC's attempt to solve a basic problem in likelihood analysis: LAMARC does a great search in the area right around the driving values, but not so great a search elsewhere. By using multiple chains, LAMARC moves to driving values as close to the true maximum as possible, so that it can find that maximum efficiently. The first 10 chains are an attempt to find the right neighborhood, and the first final chain is an attempt to narrow it down even further. The last final chain is used to actually estimate your parameters, and to provide support intervals for those estimates.

Yay, it finished Final Chain 2! And, er, now it's stuck.

If you can see the message "Beginning profiling, please be patient", it means that LAMARC has entered a computationally-intense phase where it calculates the support intervals for its point estimates. We call this phase 'Profiling', since it provides profiles of the likelihood surface as seen from the perspective of each parameter in turn. After a while, your screen will look something like this:

03:44:15  Beginning profiling, please be patient
04:06:09  Finished profile 1 of 7.  Predicted end of this set of profiles:
          Sat Apr  9 06:17:33 2005
 
04:36:23  Finished profile 2 of 7.  Predicted end of this set of profiles:
          Sat Apr  9 06:46:43 2005
 
04:37:26  Finished profile 3 of 7.  Predicted end of this set of profiles:
          Sat Apr  9 05:48:20 2005

It can take an extremely variable amount of time to finish profiling, but a good rule of thumb is 'about half again as long as it's taken so far'. LAMARC provides estimates of when it expects to be done as it finishes each profile in turn, predicting that each profile it has to go will take about as long as the the average profile so far. For an exploratory run like we've assembled here, a few hours is not unreasonable.

[extra credit] Wait a minute--it finished profiling, but then started up again? What's going on?

If you have more than one genetic region in your data, it first goes through the chains for the first region, then calculates the profiles for that region, then goes on to the next region. So what you're seeing is the beginnings of LAMARC calculating the data from the second region. Back when we were getting started, I suggested only using data from one region, and this is why--each region goes through the same process, which can be rather lengthy. But, no harm done--it just means a longer wait.

OK, it's finally done. So, what can I tell from this 'outfile' it produced?

Let's go through the outfile a section at a time. More information on outfiles can be found in this section.

Maximum Likelihood Estimates (MLEs) of Parameters

This first section contains the bulk of the most critical information about your parameters: their best estimates and their support intervals. Your outfile might look something like:

                       Theta        |    Migration Rate
Population       Theta1     Theta2  |   M21        M12
Best Val (MLE)  0.005697   0.002163 | 716.8337   1478.605
    Percentile                      |
   99%   0.005  0.003973   0.001279 | 213.8190   551.3366
   95%   0.025  0.004314   0.001424 | 301.2110   718.4079
   90%   0.050  0.004503   0.001518 | 355.3201   816.1701
   75%   0.125  0.004824   0.001706 | 451.1927   987.5201
   50%   0.250  0.005162   0.001903 | 553.9211   1174.315
           MLE  0.005697   0.002163 | 716.8337   1478.605
   50%   0.750  0.006309   0.002432 | 902.0924   1832.927
   75%   0.875  0.006793   0.002636 | 1047.330   2115.421
   90%   0.950  0.007349   0.002867 | 1211.883   2438.987
   95%   0.975  0.007735   0.003025 | 1324.255   2661.947
   99%   0.995  0.008568   0.003364 | 1561.403   3136.894
Theta1:  Theta for Pop1
Theta2:  Theta for Pop2
M21:  Migration rate into Pop1 from Pop2
M12:  Migration rate into Pop2 from Pop1

There are four parameters LAMARC is estimating: two Thetas (population sizes) and two migration rates. The estimate can be read across the first line of numbers: the theta for population one is .005697, and is .002163 for population two. The migration rate estimate into population one is 716.8337, and is 1478.605 for the rate into population two.

The rest of the table is a result of choosing percentile profiling, and is what LAMARC was doing after the end of 'Final Chain 2'. Each line represents a percentile, or, values for which the true value of a parameter is that percentage likely to be lower than. So, for example, since the .025 percentile for Theta 1 is 0.004314, the true value for Theta 1 is 2.5% likely to be lower than that value, and 97.5% likely to be higher than that value. To find the 95% support interval, then, we take the lower 2.5% and the upper 2.5% percentiles, and report the spread between them--in this case, LAMARC is 95% certain that Theta 1 lies between .004314 and .007735. Similar confidence intervals can be found using this table for the 99%, 90%, 75%, and 50% confidence intervals, and can give you a better idea of LAMARC's picture of the data.

Profile Likelihoods

In this section, you are given many more details about the profiles listed in the 'Maximum Likelihood Estimates' section. The first section will always be "Overall Profile Tables", and since in this example we only used one genetic region, that's all we have. If we had used more than one region, we would start with this "Overall Profile Tables" again, which would be followed by "Regional Profile Tables", containing the same sort of information considering each genetic region separately in turn.

The first swath of data will look something like this:

Log Likelihoods:

Percentile  Theta1   |   Ln(L)   
  0.005    0.003973  | -3.158082 
  0.025    0.004314  | -1.761258 
  0.050    0.004503  | -1.193397 
  0.125    0.004824  | -0.502098 
  0.250    0.005162  | -0.068077 
   MLE     0.005697  |  0.159471 
  0.750    0.006309  | -0.068013 
  0.875    0.006793  | -0.502098 
  0.950    0.007349  | -1.193396 
  0.975    0.007735  | -1.761353 
  0.995    0.008568  | -3.157959

The first two columns we have seen before, in the first section. Now, however, we are given the actual values for the posterior log likelihoods when that parameter (Theta 1, in this case) is constrained to be held at each particular value. The log likelihoods here are actually what determine the percentiles, instead of the other way around. If you assume a gamma distribution of probability, and know that your point of maximum likelihood has a log value of 0.159471 (as it does in the above table), then the points at which the likelihood has a log value of approximately -1.761 will be your .025 and .975 percentiles.

The next section gives us information about parameter correlation:

Best fit parameters with Theta1 held constant:

Percentile  Theta1   |   Theta2     M21       M12    
  0.005    0.003973  |  0.002155  722.4041  1475.974 
  0.025    0.004314  |  0.002157  721.1656  1476.671 
  0.050    0.004503  |  0.002158  720.5119  1477.008 
  0.125    0.004824  |  0.002159  719.4664  1477.540 
  0.250    0.005162  |  0.002161  718.4165  1478.015 
   MLE     0.005697  |  0.002163  716.8337  1478.605 
  0.750    0.006309  |  0.002165  715.0486  1479.142 
  0.875    0.006793  |  0.002166  713.6108  1479.514 
  0.950    0.007349  |  0.002167  711.9080  1479.921 
  0.975    0.007735  |  0.002166  710.6590  1480.219 
  0.995    0.008568  |  0.002164  707.7885  1480.974

Once again, the first two columns are repeats of what we've seen before. But now, we get to see what happens to the rest of the parameters if we modify the first one. The basic idea here is that we hold Theta1 to a particular value, then allow our other parameters to vary as we find a new point of joint maximum likelihood (and it is this maximum likelihood that we report in the above table). In other words, LAMARC asks the data, "OK, I know that the best value for Theta1 is .005697. But if I was wrong, and the best value was at .003973, what would Theta2, M21, and M12 be?" In this case, the answer is "Not very different." There's a slight correlation between Theta1 and all three other parameters, but it's not very strong. Contrast the above table with this, taken from the same example file:

Percentile   M21     |   Theta1    Theta2     M12    
  0.005    213.8190  |  0.005733  0.001852  1545.118 
  0.025    301.2110  |  0.005727  0.001907  1525.316 
  0.050    355.3201  |  0.005721  0.001960  1508.983 
  0.125    451.1927  |  0.005711  0.002050  1487.323 
  0.250    553.9211  |  0.005704  0.002113  1479.440 
   MLE     716.8337  |  0.005697  0.002163  1478.605 
  0.750    902.0924  |  0.005692  0.002191  1480.951 
  0.875    1047.330  |  0.005689  0.002203  1482.800 
  0.950    1211.883  |  0.005685  0.002213  1484.520 
  0.975    1324.255  |  0.005683  0.002218  1485.435 
  0.995    1561.403  |  0.005678  0.002226  1486.743

Here, we see how varying the migration rate M21 affects the estimation of the other three parameters. In contrast with varying Theta1, which only caused variation in the third or fourth significant digit, we can see here that varying M21 causes variation in the second or third significant digit of the other parameters. Again, not huge, but more noticeable.

Wait, wait, wait, I'm lost already. Why am I looking at all this?

Sorry. OK, here's the basic questions these sections answer:

What are the values for my parameters?

The values for your parameters are the first row of the 'Maximum Likelihood Estimates' tables; the ones labeled 'MLE'.

What are the confidence intervals for my parameters?

Technically for this kind of analysis they're called 'support intervals', so the standard 95% support intervals can be found by looking for the rows that start with '95%'. So, in our Theta1 example above, the lower 95% point is at 0.004314, the upper is at 0.007735, and the MLE is at 0.005697. One way to write this is:

0.005697 +.002083/-.001383

Since this is clearly too many digits of precision, you might round off to:

0.0057 +.0021/-.0013

If you thought this was confusing and were willing to sacrifice precision, you could average this error to:

0.0057 +/-.0017

even though it's, you know, wrong. Also keep in mind that results from likelihood LAMARC are believed to estimate too-narrow confidence intervals due to its reliance on the driving values. Runs with replication get away from this problem.

What else can I say about my estimated parameters?

You can comment on parameter correlations by looking for trends in the 'Profile Likelihoods' section. For example, you could look at M21 as compared to M12, and if a high M21 correlated with a low M12 and vice versa, you could talk about the total amount of migration between those two populations.

OK, I think I get it. Let's go to the next sections.

Gladly! Actually the next bits don't need a lot of explanation--the next section ("User Specified Options") is a summary of your starting values, search strategy, and filenames. The next section ("Data summary") is an overview of what your data looked like, what data models you used, and an actual copy of the raw data. And the final section ("Run Reports by Region") is a copy of most of what was sent to the screen during the LAMARC run (sans the progress bar, of course).

OK, then! I'm ready to publish!

Ha ha.

Er, right. So, what do I do next?

Now you have decisions to make, and since these decisions will be based on your data, which I don't have with me just now, so you're going to have to make the majority of these decisions on your own. If you're up for another question and answer session, though, skip ahead to Analyzing the Rest of Your Data. Or if you're keen to discover how to do a Bayesian run, read on.

(Previous | Contents | Next)