(Previous | Contents | Next)
Glossary
This article gives informal definitions of terms used in the
LAMARC documentation. For more precise definitions, consult our published
papers.
- Bayesian analysis. In the LAMARC context, an
analysis which places priors on the population parameters and then samples
both possible trees and possible parameters. It reports the relative
probabilities of the range of parameter values which it visited. The
alternative is a Likelihood analysis.
- Brownian-motion model. A mathematical approximation of the
stepwise model of microsatellite mutation. It will run much faster, but
may be inaccurate if your data are nearly invariant.
- Burn-in. Discarding some steps of the sampler before
beginning to record any. This is done because early steps may
be highly atypical. For example, if the sampler starts with a
deeply unreasonable tree, the first few trees it produces will
also be unreasonable and probably should be discarded.
- Chain. LAMARC grinds for a while
producing trees, and then makes an estimate based on
those trees. It may repeat this cycle many times, especially in
a likelihood run. Each such cycle is a "chain." "Initial"
chains are used to get reasonable starting values of the
parameters, and are generally shorter; "final" chains are
used to obtain the actual parameter estimates, and are
generally longer. (A Bayesian run is often just one huge
final chain.)
- Coalescence. Kingman developed the idea of looking at
a genealogy going backwards in time from modern individuals to
their ancestors. He named the resulting pattern the "n-coalescent"
(later writers have dropped the "n-"). So a coalescent tree is
the backwards pattern of relationships among modern sampled
individuals, and a coalescence is a point at which two of those
individuals have a common ancestor. In LAMARC, we sometimes
use "coalescence" to refer to the effect of Theta on the shape
of the coalescent tree, as opposed to the effects of migration,
recombination, growth, and so forth. The proper name of this
evolutionary force is "genetic drift."
- Curve file. In a Bayesian run, a file containing data points
from the probability density function that
results from collecting the parameters which the chain has visited. It can
be read into a spreadsheet (Excel) or math/stat package (Mathematica) to
produce a picture of the Bayesian posterior probability curve.
- Data likelihood. The probability of your sequence data given a
specific tree (genealogy). This is also computed by maximum-likelihood
phylogeny algorithms such as PHYLIP's
DNAML.
Usually expressed as a log-likelihood because it's so small; the resulting
numbers are negative, and are improved by moving closer to zero.
- Data uncertainty
Lamarc runs on nucleotide data now accommodate modeling data uncertainty.
The per-base error rate gives the rate at which each single instance of a
nucleotide should be assumed to have been miss-called. A value of 0 indicates
that all were sequenced correctly. A value of 0.001 indicates one in one
thousand is incorrectly called.
Note that this is different from using the IUPAC ambiguity codes as it privileges
one (or more) data values over another. It may be used simultaneously
with the IUPAC codes.
As of December, 2009 the per base error rate must be set to a single
value for each segment.
See the data uncertainty segment in the
Instructions for using the LAMARC menu
documentation for more information.
- Divergence. Splitting of
an ancestral population into two descendant populations. There may optionally
be ongoing migration between the descendants after the split.
- Effective population size. A theoretical concept that converts a
real population into an idealized Wright-Fisher population. The effective
population size is the size of a Wright-Fisher population that would have
the same amount of genetic drift (same Theta) as our real one. Most real
populations have an effective population size much smaller than their
census size due to factors like non-breeding individuals, overlapping
generations, and unequal reproductive success between genders.
- Epoch time. In Lamarc, an "epoch" is the period of time between one population
splitting event and the next, in a model with divergence. The parameter reported
for Lamarc epoch times is the time of the population split in generations times
the mutation rate per site in mutations per generation, counting backwards from
the present (when the samples were collected). Note that it is a time (the
time at which the populations split) and not a length of time (the length of time
between one population split and the next). If you wish to know the split time in
generations or years, you will need an external estimate of the mutation rate.
- F84 model. Evolutionary model proposed by Joe Felsenstein (in
1984) in which the frequencies of the four nucleotides and the ratio
of transitions to transversions may be varied. Several simpler
models such as Kimura 2-Parameter and Jukes-Cantor can easily
be expressed as subsets of F84. (They can also be expressed as
subsets of GTR but this will be much slower.)
- Growth (g).
The parameter governing the
exponential growth model used by lamarc. Theta at any time
t (where t is measured increasing into the past, and is
in "mutational units" where one unit of time is the expected
time until a site mutates once) is equal
to modern-day Theta times exp(-gt). Positive values of g
thus indicate a growing population and negative values a
shrinking one. Interpretation of the actual value requires
knowledge of the mutation rate, because of the use of
mutational units in defining time. However, even without
this, g values from organisms of similar mutation rate can
be compared.
- GTR model. The General Time-Reversible Model, the
most general easily tractable model of nucleotide sequence
data. Can be used to emulate simpler models, but if a model
can be expressed as a simplification of F84 instead it will
run faster.
- Haplotype. A collection of markers known to come from
the same chromosomal copy in an individual. See "Phase."
- Heating. Improving the search performance of a
Markov chain Monte Carlo sampler by adding additional searches
which see a smoothed-out or "heated" version of the search space.
We refer to each additional search as a "temperature". This method is also
known as 'Metropolis-coupled Markov chain Monte Carlo', or MCMCMC, or
MC3.
- K-Allele model. A mutational model which assumes that
there are K states which a marker can be in, and change from any
state to any other is equally probable. Jukes-Cantor is a
K-Allele model (with K=4) for DNA data (though in LAMARC you
will need to use an actual DNA model to get the same effect).
The K-Allele model is suitable for data where we do not know
the pattern of mutation, such as presence/absence of a
chromosomal modification, or for elecrophoretic mobility data.
It may also be useful for microsatellites
which do not fit a stepwise model well.
- Likelihood analysis. In the LAMARC
context, an analysis which collects trees at a particular driving value of
the parameter(s) and then uses those trees to make a likelihood curve for
other values of the parameters. The alternative is a Bayesian analysis.
- Locus. In ordinary genetic terminology, a
gene or other defined piece of chromosome. In LAMARC, we use the terms segment and region to
indicate sections of genetic information that would be referred to in other
places as 'loci'. We try to be consistent in using our terms, but may have
slipped on occasion--it's likely we mean 'segment' if we accidentally said
'locus' somewhere.
- Marker. A position along the chromosome for
which we have collected data. This is in contrast to "site", which is any position on the chromosome within the
area we are considering, even if we didn't collect data for it. SNP data
involves choosing a few interesting markers out of a much larger collection
of sites.
- Markov chain Monte Carlo. A strategy for integrating
functions which cannot be solved directly, such as finding
the probability of sequence data given population parameters.
"Markov chain" means that each step of the search depends only
on the previous step, and "Monte Carlo" means that random
choices are used rather than, say, a systematic grid search.
Abbreviated MCMC.
- Maximizer. The part of LAMARC which analyzes stored trees or
parameters to determine the best estimate of each parameter. Can be used
in conjunction with profiling to estimate error
bars for these parameters.
- Maximum likelihood estimate (MLE). The set of parameters that
maximize the posterior likelihood (the sum over
sampled genealogies of the probability of the data given the genealogy and
the probability of the genealogy given the parameters). In other words,
this is the best solution found by a likelihood
run. Contrast with the most probable
estimate, the best solution found in a Bayesian
analysis.
- Migration rate. LAMARC estimates immigration rates
(movement of breeding individuals into a population) and
records them as M=m/mu, where m is the chance of immigration per
individual per generation, and mu is the chance of mutation per
site per generation. If you are interested in 4Nm instead,
multiply our estimate of M by our estimate of Theta
for the recipient population.
- Mixed KS model. A microsatellite mutational model that assumes some proportion
of changes are among random states (like K-Allele)
and the remainder are among adjacent states (like Stepwise). May be a better fit than either model
alone for some microsatellite data. LAMARC allows you to specify the
proportion of stepwise to multistep changes or to allow the program to try
to optimize this ratio as it runs. The parameter percent_stepwise
is the proportion of stepwise changes; percent_stepwise=0 is the K-Allele
model and percent_stepwise=1 is the Stepwise model.
- Most probable estimate (MPE). The highest
point on the posterior probability curve for a given
parameter. Effectively, the point that fell closest to the sampled
parameter the highest number of times (think of the
tallest bar in a histogram). In other words, this is the best solution
found by a Bayesian run. Contrast with maximum likelihood estimate, the best solution found by a likelihood analysis.
- Mu. Neutral mutation rate per site or marker (depending on data type) per generation. If you
have multiple data types, the relative mu rate must be set for each to
ensure accurate co-estimation of your parameters. Please be careful in
comparing LAMARC runs with results from other methods, as these often report
mu per segment rather than mu per site.
- Parameter. In LAMARC parlance, a parameter is a numerical
quality that describes an aspect of a population
or set of populations. LAMARC can co-estimate effective population sizes (Theta), migration rates, the recombination rate, and growth
rates for one or more populations.
- Phase. Phase is information about whether or
not several variants seen in the same individual are on the same chromosome
of that individual, or different chromosomes. (In other words, whether
they are part of the same haplotype.) If data is of
"known phase" then we know which variants group together on the same
chromosome; if it is of "unknown phase" we only know which variants are
present, but not how they are allocated among the gene copies. Another way
of saying this is that haplotype data is phase-known and genotype data is
phase-unknown.
- Point probability. The height of the posterior probability curve at a given value of an
estimated parameter. Point probabilities can be compared to other point
probabilities on the same curve (for example, to find the most
probable estimate), but are otherwise not as useful as integrating a
section of that curve.
- Population. A group of organisms that are more or less
freely interbreeding among themselves and isolated from other groups.
- Posterior likelihood. The probability of the observed
genealogies at their best (MLE) parameter values, divided by
their probability at the values which produced them ("driving
values"). This is quoted because it can be useful in diagnosing
a poorly performing run; if it is very large, the driving values
were poor ones and the run should be extended until better ones
can be found. It can not be used in a likelihood ratio
test between runs, because it is a ratio of two independent likelihoods,
neither of which we can actually compute on its own. Posterior likelihoods
are produced in a likelihood analysis.
- Posterior probability curve. A probability density function that describes the relative
probabilities that the true value of a parameter is a particular value.
Useful when comparing the relative probabilities of two or more possible
values for the parameter, or when integrating to find the total probability
that the true value of the parameter is within a range of values (as during
profiling). Posterior probability curves are
produced in a Bayesian analysis, and exported as curvefiles by LAMARC. See also Point probability.
- Prior. A LAMARC prior for a Bayesian run is a curve that describes your prior
knowledge of the possible range of values for a given parameter. LAMARC
uses only 'flat' priors, giving the parameter an equal chance to be anywhere
within the allowed range, but the density of those priors may be either
linear or logrithmic, depending on how the parameter is expected to vary.
- Probability Density Function. A curve that
describes a probability distribution in terms of integrals (the area under a
probability density function must be 1.0). More informally, it can be
viewed as a 'smoothed out' version of a histogram.
- Profiles. A picture of the uncertainty
in LAMARC's results. In a likelihood run, profiles are produced
by holding one parameter constant at a series of different values,
and for each value maximizing all other parameters. This is like
taking a slice through the multi-dimensional surface to reveal the landscape
with respect to one parameter. In a Bayesian run, profiles
are produced by showing the probability curve for one parameter
at a time; no information is available about how other parameters
co-vary with the chosen one. In either case, percentile profiles
show the results at percentiles of the distribution (for example,
the 95% marks) while the less expensive fixed profiles show
the results at arbitrary points chosen in advance.
- Recombination rate (r).
LAMARC measures recombination
rate as C/mu, where C is the rate of recombination per inter-site
link per generation, and mu is the mutation rate per site per
generation. Be careful in comparing this with other estimates,
which are often of 4NC, or per-locus rather than per-site.
- Region (or 'Linked Segment Region'). A
stretch of linked markers along a chromosome. The usual term is "locus", but a LAMARC region can contain what a geneticist
might consider multiple loci (e.g. coding regions for several genes) as long
as they are all linked and their map is known. In the program and in our
documentation, we usually refer to this kind of region as a "linked segment
region", because our regions can contain multiple linked segments.
- Replicate. An internal repetition of a
LAMARC analysis used to produce a more refined result, and to measure
whether the run is long enough to produce consistent results. In a likelihood analysis, when multiple replicates are
run, their results are combined via Geyer's reverse logistic regression
method to produce a joint estimate that should be more accurate than the
individual ones. In a Bayesian analysis, when
multiple replicates are run, the resulting posterior
probability curves are simply averaged.
- Runtime report. The on-screen report
showing each chain and its acceptance rates, parameter estimates, times, and
so forth; also reprinted as the last entry in the output file.
- Segment (or Coherent Segment). A contiguous
stretch of sites containing markers
of all the same data type. Multiple segments can be linked together within
a region, even when the data types differ.
- Site. A position along the chromosome in the
area under study, whether or not you have observed any data at that
position. Contrasts with "marker", which is a site at
which you have observed data. Sites are important because we must consider
every site as a possible location for a recombination, whether or not it is
a marker; the total chance of recombination is proportional to the number
of sites.
- Stepwise model. A mutational model that
assumes mutations happen between adjacent states. It is generally used
for microsatellites, and in this case the assumption is that
microsatellites increase or decrease by one repeat unit at a time. (It may
still work well even if this assumption is occasionally violated; if you
believe it is often violated, the K-Allele or Mixed KS models may be more
appropriate.) The Brownian model approximates Stepwise and runs much
faster, but may break down if population sizes are very small.
- Theta. Population parameter controlling the
amount of genetic diversity in a population, and equal to two times the
neutral mutation rate per site times the number of heritable gene copies
in the population. For a diploid population, the Theta for nuclear DNA is
equal to 4N(mu).
- Tip. A single observed sequence; generally a haplotype,
though if you do not know the phase of your data you can make
fictional haplotypes to use as initial tips. Called "tip" because
it appears at one of the tips of the genealogical tree.
- Tree. An informal word for the coalescent genealogy which
relates the sampled sequences. Formally, genealogies with recombination
should be called Ancestral Recombination Graphs, but we frequently
call them trees even though they are not strictly tree-like.
(Previous | Contents | Next)