LAMARC Documentation: Suitable data for LAMARC

Suitable data for LAMARC

This article gives our best available information on the type and amount of data needed to use LAMARC successfully. Remember that LAMARC is an experimental tool, not an oracle. You know your data better than we do. The following guidelines will help, but we don't guarantee success even if you meet these criteria. Every data set is different. But if you have much less data than described here, you will probably not be happy with your results.

How many unlinked regions?

How many samples?

How many linked sites?

What kinds of samples?

What kinds of populations?

How many unlinked regions?

Best rule we've found is count the number of forces you wish to recover and add 1. So for estimation of Theta and migration rate it is possible to get results with one or two regions but the returned values improve markedly with three. This is because each independent region tracks a different evolutionary pathway. Each gives you an independent measure of the forces of interest. Averaging the estimate for each force from each region together gives you a much better estimate than any individual region can provide.

There are a few exceptions:

Estimation of growth rate of a single population is very poor with less than 3 unlinked regions and particularly benefits from having more.

Recombination is best estimated from a single lengthy region (more discussion of this below in "How many linked sites?"

Note that runtime increases approximately linearly with the number of regions. But it's hard to have too many, unless you exceed your computer's memory capacity (or your patience).

How many samples?

Surprisingly, Felsenstein has shown that for estimation of Theta the optimal number of samples is very low, around 8. (Reference: Felsenstein 2006.) If you can obtain another unlinked region you will gain much more than you would gain by increasing the sample size for the current region beyond 8. We generally recommend aiming for 10-15 sequences per population. Again, recombination is an exception; a recombination analysis probably needs 15-20 sequences.

If you cannot get multiple genes, multiple sequences are some help, but we do not recommend going above 30 sequences/subpopulation as the added difficulty of the analysis more than outweighs the gain in information. Remember that the more sequences you have, the longer you will have to search to find good trees, and the longer each tree will take to process. Runtime goes up with the log of the number of sequences, but as you will also have to perform more steps (perhaps many more steps) in the final outcome, adding sequences causes a worse-than-linear increase in runtime.

How many linked sites?

This depends on the expected level of polymorphism. For DNA or SNP data, a region should ideally be long enough to see 10 or more variable sites. If that's not possible you will definitely need multiple unlinked regions. Above 100 or so variable sites, there is little additional information to be gained outside of estimating recombination.

Runtime goes up less than linearly with number of sites; how much less depends on how polymorphic your data are. Nearly invariant data are very quick, so if you have long sequences, by all means use them (assuming you have the memory).

Recombination is an exception. You would like your sequenced area to include tens of recombinations, but not hundreds. Unfortunately it is hard to know this in advance. A recombinational analysis will definitely need 20-30 variable sites or more to have much accuracy. In general, sequences for recombination inference should be long, but watch out for signs that you are in the hundreds-of-recombinations zone as such runs will take a very long time. (The estimates will be good, but you won't want to wait for them.)

The best approach if you do not know how much recombinaion your have is to start small and experiment. For example, run a very small (2-4) sample subset of your data and see how long it takes and how much recombination is finds. If you find you have high levels of recombination you may then want to shorten your sequences.

What kinds of samples?

LAMARC requires random samples from each subpopulation. Therefore

DO NOT cherrypick the most divergent or most interesting sequences,
DO NOT pick one from each major lineage,
DO NOT discard identical sequences, and
DO NOT do anything of this sort!

Such data will give grossly biased estimates.

It has been stated in print (not by us!) that mutually incompatible sites in the data set must be removed. This is totally untrue. It is true for programs which assume an infinite-sites model, but LAMARC does not. Remove sites only if you cannot be sure they are correctly aligned. It is best to remove all alignment columns for which the alignment is doubtful.

Be sure that all of your samples are of the same homolog; throwing in a few sequences from a paralog will ruin your analysis.

You do not need to sample from multiple subpopulations evenly, nor in proportion to their population sizes. However, try not to let the sample size from any subpopulation go below 2 diploid or 4 haploid individuals as the estimates are likely to be unstable.

What kinds of populations?

LAMARC works well for subpopulations which show definite signs of geographic structure. You may want to use the program STRUCTURE to test for this before beginning. If STRUCTURE does not see any structure in your populations, the migration rate is probably too high and you should pool those populations together.

One warning here. If STRUCTURE finds clusters it's tempting to use them to redefine your populations. DON"T DO THAT!!! You've just returned all the migrants to their original populations. If you find structure, keep your samples in their original spacial groupings.

At the other extreme, if your populations are totally isolated you will not, of course, get a good estimate of the migration rate, though other parameters can be well estimated. If your populations have probably not exchanged migrants in the last 4N generations (for a diploid, or 2N for a haploid) you should analyze each one separately, or use a divergence model; it is not appropriate to try to estimate non-divergence migration rates.

(Previous | Contents | Next)