(Previous | Contents | Next)

Evolutionary Forces

This article describes, in some technical detail, the way we model evolutionary forces. You don't need this information to get the program running, but it can be very helpful in understanding your results.

The forces supported at this time are:

For each force, we describe the parameter being estimated, and list some assumptions made in the analysis. If the assumptions are violated, the results will not necessarily be wrong, but should be interpreted cautiously.

Coalescence

We estimate the parameter Theta for each population. While most papers describe Theta as 4Nemu (where Ne is the effective number of individuals and mu is the mutation rate in mutations per generation), this is specific to diploids. If you put haploid mtDNA into LAMARC, the Theta estimates will be estimates of 2Nfmu instead, where Nf is the effective number of females in the population. It is best to think of Theta as "number of heritable copies in population * 2 * mutation rate," since this definition works no matter what the ploidy is. (The "2" comes from the fact that two sequences that have diverged for time t are different by 2 * mu * t mutations, since both diverging lineages accumulate mutations.)

The units of mu are mutations per site per generation. Please note that many studies compute a (lower-case) theta whose units involve mutations per locus per generation. To convert, multiply the per- site Theta by the number of sites.

If the "multiple rate categories" option of the data model is used, the mutation rate is the weighted mean across all categories.

While it would be helpful to have separate estimates of population size and mutation rate, with genetic data from a single time point there is no way to separate them. If you have outside information about either population size or mutation rate, for example from outgroup analysis, you can then estimate the other parameter directly.

A Frequently Asked Question is how to interpret the parameter Theta when analyzing mitochondrial DNA or Y-chromosome DNA. In these cases Theta is proportional to the mtDNA or Y-chromosome effective population size. For example, when given mtDNA, the program estimates Theta=2Nfmu, and when given Y-chromosome DNA, the program estimates Theta=2Nmmu (where "f" and "m" refer to the female and male effective population sizes).

To combine estimates of Theta from regions with different Thetas, such as mtDNA and nuclear DNA, you can set the relative population sizes in the 'Effective Population Size' menu, or set these values directly in the XML input. Reported multi-region Theta estimates will be scaled accordingly.

Assumptions:

Migration

We estimate the immigration rate into each population from each other population (moving forward in time), so that a three-population case will estimate six rates. The immigration parameter M is expressed as m/mu where m is the immigration rate per generation and mu is the neutral mutation rate per site per generation. (As usual, if multiple mutation categories are used, mu is the weighted average.)

If you would prefer to consider migration in terms of 4Nm, multiply the given value by the recipient population's value of Theta. Please be careful of the distinction between immigration and emigration. LAMARC always estimates and reports immigration, so 'M12' indicates the rate at which individuals from population 1 have moved to population 2.

In the current version we cannot meaningfully combine genetic regions with different migration rates in the same analysis. We hope to provide this capability later.

Peter Beerli has argued that it may be useful to include a population in your analysis even if you have not sampled any individuals from it. This can guard against biases produced by unacknowledged populations. For example, if you believe that your real-life situation involves three populations exchanging migrants, but you have been able to sample only two of them, you might add the third population even with no individuals. The parameter estimates involving this population will be weak, but the estimates involving the other two populations may be more accurate than if the unsampled population were omitted entirely.

However, LAMARC is not very stable with unsampled populations, because the likelihood surface for the parameters of the unsampled population may be flat or multi-peaked. For this reason we have not allowed unsampled populations in v2. If you feel this capability would help you, consider using MIGRATE, and also please let us know. We will work on stabilizing estimates for unsampled populations if this is of general interest.

Assumptions:

We do not assume that migration is symmetrical, though you can impose this assumption if you wish using constraints.

Recombination

We estimate the recombination rate r = C/mu, where C is the recombination rate per inter-site link per generation and mu is the neutral mutation rate per site per generation. As usual, if multiple rate categories are in effect, mu is the weighted average of the categories.

In the current version we cannot meaningfully combine regions or populations with different recombination rates. We hope to provide this capability later.

We do not assume that all recombinations are visible. An advantage of LAMARC and its predecessor RECOMBINE over pairwise methods is that their recombination estimates are not dragged downwards if the data are nearly invariant (though the error bars, of course, increase greatly).

Assumptions:

Growth

We estimate the exponential growth rate g as defined in the following equation, where t is a time before the present:

Thetat = Thetapresent time exp(-gt)

This means that positive values of g indicate a population which is growing, and negative values a population which is shrinking. They are not symmetrical in magnitude, however; g = 10 indicates rather slow growth, but g = -10 indicates significant shrinkage for most values of Theta.

When migration is also in effect, growth rates are computed for each population independently.

In the presence of growth, the reported value of Theta is the present-day Theta (at the time when the data were sampled). The equation above can be used to determine values of Theta for past times. Just remember that the time parameter t is measured in units of mutations (i.e. 1 t is the average number of generations it takes one site to accumulate one mutation), and g is measured in the inverse units of t. If mu is known, divide generations by mu to get units of t (conversely, t*mu is a number of generations).

Assumptions:

Simulation studies show that estimation of growth tends to be biased upwards; this bias is reduced by having multiple unlinked regions and/or, in cases with recombination, long sequences. Be cautious in interpreting results from one or a few regions. The profile likelihoods are more reliable than the maxima but can also show the bias.

Gamma-distributed relative mutation rate over regions

This isn't exactly a "force," but because LAMARC optionally allows you to estimate the parameter (α) of the gamma distribution which best fits data sets composed of multiple unlinked genomic regions, it's in the menu for evolutionary forces. More information about this is available here.

Because there need to be multiple genomic regions in order for there to be relative mutation rates to distribute among them, this "force" will not appear in the evolutionary forces menu if the data is annotated as coming from a single genomic region.

Due to a mathematical detail of how this feature is implemented in LAMARC, it can only be applied to maximum-likelihood analyses. If you wish to perform a Bayesian analysis, your best bet is to estimate the relative single-region mutation rates by some other means, then supply these to LAMARC as constants.

Assumptions:

Divergence

We estimate the time at which pairs of populations split from their common ancestor. Thus, a case with three populations will estimate two divergence times. For example, a human/chimp/gorilla case might estimate the human versus chimp divergence time and the human+chimp versus gorilla divergence time.

LAMARC does not infer the population tree. The information that we should be computing human+chimp and not chimp+gorilla as the first split must be provided by the user. [For a method that attempts to infer population trees, see the "*BEAST" program from Drummond's group.] Divergence requires not only the topology of the population tree, but its order of events. If we have a tree with an A+B split and a C+D split we must specify which split happened most recently.

Divergence times are scaled by the mutation rate. A time of 1.0 for the divergence of two populations means that it occured long enough ago that each character has an expectation of 1 mutation since then.

If migration is additionally turned on, migration is allowed between populations that exist at the same time. This leads to inference of more migration rates than a non-divergence case. For example, in the human/chimp/gorilla example the program could maximally infer bidirectional migration rates between human and chimp, human and gorilla, chimp and gorilla, and between gorilla and the human/chimp ancestor. Users should beware of inferring too many migration rates for the available amount of data. Unwanted rates can be fixed at zero. For example, in this case fixing the rates between human and gorilla to zero would seem reasonable.

An important caveat about models of divergence is that in some cases the data will say nothing about one or more parameters (the jargon term is "not identifiable"). For example, if a population split very recently there will be no power to infer the sizes of the two descendants or the migration rate between them. If they split very long ago there will be no power to infer the size of the ancestor or its migration rate. And if the migration rate between two populations is very high there will be no power to infer their divergence time. It is important to look at confidence intervals and, in Bayesian runs, the shape of the posterior distribution to detect cases in which some parameters are not identifiable.

Quick calculators for starting values (FST, Watterson) are not available in cases with divergence. We don't know how to use them for unsampled ancestral populations, so cravenly don't use them at all.

Assumptions:

(Previous | Contents | Next)