LAMARC Documentation: Menu

The Interactive LAMARC menu system

Introduction

LAMARC's user interface is fairly awkward to use. This is because LAMARC is mainly a batch program and this interface reflects the way LAMARC does things internally, not the way users think about things. LAMARC is designed to run off XML input. What this interface does is read in and appropriately decorate the output of the File Conversion Utilities and allow users to tweak existing XML (often the output of this interface) to do different analyses than it was originally created to do. Note that there is almost no overlap between what can be edited in this interface and what can be edited in the File Converter. Our long term goal is to combine this interface with the File Converter and make LAMARC a purely batch program (which has many advantages such as more effective use of parallel computing).

This interface reflects how LAMARC is organized internally, which is not necessarily obvious. We recommend that you take a tour through all the menus before you do anything to get a sense of where things are. For example, "Migration" is found under "Analysis", which may seem odd until you realize that if "Migration" is on, the "Analysis" of the data changes. There are a lot of other examples of seemingly normal terms meaning slightly different things in the LAMARC context. A tour of the interface will give you an overview.

General Conventions

LAMARC has essentially a command line interface, which is fairly uncommon in the modern era. So here are some general conventions that will help understand it:

When the menu redisplays, the old screen just scrolls up, so make sure you stay at the bottom of window you are working in or it will get confusing.
Each line you can interact with has a single character at the left side (for example J), some text to explain what that line is about, and the current value on the right side. In order to change the value on that line, enter that character J at the bottom of the screen.
If there are multiple similar lines that can be edited individually, for example rows of a Migration Matrix, they will have numbers in the left column rather than letters.
Case does not matter. You can enter J or j in the above example and get the same effect.
Booleans (Yes/No or True/False items) are toggled by entering the character listed at the left. So if an item with an A on the left is "Yes" and you want "No", enter A, the screen will redisplay and A will now be "No".
Things can come and go in menus in logical, but not necessarily obvious ways. For example, in Forces, if you only have one population. Migration will not appear, because it cannot happen. This can get a bit confusing because when you change something on one screen it can cause something on another screen to appear or disappear. Until you get used to your analysis, it's wise to review everything using the "Overview" pages when you think you are ready to start a run, just to make sure nothing you have done had unexpected side-effects.

Start Up

When you start up LAMARC the first thing you will be asked is what your output directory is. You will then be asked for your input file. This may seem a bit backwards, since usually you would want to read data in before putting it out, but reflects the internal workings of LAMARC. If you don't specify an input file, LAMARC will follow the Phylip conventions and look in the output directory for a file called "infile".

The data in the input file defines the kinds of analyses which are possible. If you don't see the kind of analysis you wish to do listed on the "Analysis" menu, you will need to modify your input file so that kind of analysis is possible. For example, if you wish to study migration, you need at least two populations. If "Migration" is not an option in "Analysis" you only have one population defined in your input file. You will need to fix that, either using the file converter or editing the XML directly, before LAMARC can analyze migration.

Once the data have been located and processed (which may take several seconds), the first screen you see upon starting LAMARC is the top level menu:

LAMARC main screen

The menu may appear in a different form depending on your computer system, but the basic ideas are always the same. You can now review and set values in the following areas:

On all LAMARC menus, the bottom line will give two options:

Run the program ('.')
Quit ('q')

If you are viewing a sub-menu, you will also have the option to:

Go Up to a previous menu('<return>')

If you have performed any changes from the initial setup within a submenu there will be the:

Undo option ('-') which will undo your last change

If you have performed any Undo operations, there will be the:

Redo option ('+') which will redo your last change

Any time that you may create a new valid LAMARC input file based on the current menu settings there will be the:

Create option ('>').

Warning 1: LAMARC's search defaults are fairly small. This allows a first-time user to get visible results right away, but for serious publishable analysis you will almost surely want to increase some of the settings in the "Search Strategy menu".

Warning 2: Once you have selected "Run" you will have no further interaction with the program; you'll have to kill and restart it in order to change its behavior. However, Lamarc does save a modified version of its input file, updated with any changes made via the menu, into file "menusettings_infile.xml" when you exit the menu. If you want to re-run LAMARC starting with the same data and menu options as you last selected, choose "menusettings_infile.xml" as your input file when restarting LAMARC.

Data options

LAMARC data screen

This menu allows you to define what your data is and how you want to model it.

The first two items (C and S) define the source of the random number seed used to start the analysis. Normally the seed is set from the system clock so it is default set "Yes". To toggle it off and use the explicit seed type C.

A very few systems lack a system clock; on those you will need to set this value by hand (either here or in the input file).

The explicit random seed is used if you wish to do exactly the same analysis twice. You can hand-set the seed by entering S. You will be queried for the number to be used. LAMARC will then find the closest integer of the form 4N+1 to the number you entered.

The E option (Effective population size menu) will only appear if you have data from multiple regions. It provides a way to combine data collected from different regions of the genome that have unique effective population sizes. For example, data from nuclear chromosomes of a diploid organism reflect an effective population size four times larger than data from the mitochondrion of the same organism. Data from sex chromosomes also have unique effective population sizes--the relative effective population size ratios for a non-sex chromosome to an X chromosome to a Y chromosome is 4:3:1. Selecting E takes you to a sub-menu where you can select a particular genomic region and set its effective population size.

The next set of menus allow you to modify the data-analysis model for each segment of your data. You can either modify the model for each segment individually (1, ...), or you can modify a default model for the different types of data LAMARC found in your infile. If you have DNA or RNA or SNP data, N allows you to edit the default data model for all Nucleotide data. If you have microsatellite data, M allows you to edit that data's default model. If you have K-Allele data, K allows you to edit that data's default model. To assign the appropriate default data model to all segments, select D.

For nucleotide data, you can either choose the Felsenstein '84 model (F84) or the General Time Reversible model (GTR). Microsatellite data may take the Brownian, Stepwise, K-Allele, or Mixed K-Allele/Stepwise models (MixedKS). K-Allele data may only take the K-Allele model.

Options common to all data model submenus

Several menu options are common to all evolutionary models; for conciseness these are described here.

If you are editing the data model for a particular segment (and not for a default), the first line displays the type of data found in that segment, and you are given the option (D) of using the appropriate default data model for that segment. The M option (Data Model) allows you to cycle through the possible data models appropriate for that data type.

The next two menu lines (C and A) describe the current state of the categories model of variation in mutation rates among sites. LAMARC uses the Hidden Markov Model of Felsenstein and Churchill (1996). In this model, you must determine how many rate categories you wish to assume, then provide the relative mutation rates for each category and the probability that a site will fall into that category. However, you do not need to specify which category each site actually falls into. The program will sum over all possibilities.

If you choose to have multiple categories, select C (Number of Categories), which will take you to a sub-menu. Here, you can change the number of categories with the N option, then select particular rate/probability pairs to change their values on a further sub-menu. For example, if you wish to model two categories with 80% of sites evolving at a base rate and the remaining 20% evolving ten times faster, you would set the number of categories to 2, then set one rate/probability pair to 1 and .8, and the second rate/probability pair to 10 and .2.

Internally, the program will normalize the rates so that the mean rate across categories, weighted by the category probabilities, is 1.0.

In data modeled with categories of mutation rates, the "mu" value (a component of various forces such as theta and M) is the weighted average of the individual mutation rates. In the above example, if you determine that mu is 0.028, you can solve the equation:

0.028 = (0.8 * 1x) + (0.2 * 10x)
x = 0.028 / 2.8 = 0.01
10x = 0.1

and thus determine the two individual mutation rates are 0.01 and 0.1

The program will slow down approximately in proportion to the number of rate categories. We do not recommend using more than five as the gain in accuracy is unlikely to be worth the loss in speed. Do not use two categories with the same rate: this accomplishes nothing and slows the program down.

A category rate of zero is legal and can be useful to represent invariant sites; however, at least one rate must be non-zero.

If you wish to use the popular gamma distribution to guide your rates, use another program to calculate a set of categories that will approximate a gamma distribution, then enter the corresponding rates and probabilities manually into LAMARC. There is currently no provision to infer gamma distributed rate variation within a single segment. For gamma distributed mutation rate variation across independent regions, see the gamma parameter of the analysis menu.

The A (Auto-Correlation) option provides an autocorrelation coefficient which controls the tendency of rates to "clump". The coefficient can be interpreted as the average length of a run of sites with the same rate. If you believe there is no clumping (each site has an independent rate), set this coefficient to 1. If you believe that, for example, groups of about 100 sites tend to have the same rate, set it to 100.

While auto-correlation values may be set for any model, it is likely to make sense biologically only in the case of contiguous DNA or RNA data. It is not sensible to use it for widely separated SNPs or microsatellites.

After other model-specific options, the R (Relative mutation rate) option provides a coefficient which controls the comparison of mutations rate (mu) between segments and/or data types. If, for example, you have both microsatellite data and nuclear chromosome data in your file, and you know that changes accrue in your microsatellite data ten times faster than changes accrue in the DNA, you can use this option to set the relative mu rate for the microsat segment(s) to be 10, and the relative mu rate for the DNA segment(s) to be 1. Overall estimates of parameters with mu in them (like Theta) will be reported relative to the DNA values. If you want overall estimates reported relative to the microsat values, you can set the microsat mu rate to 1 and the DNA mu rate to 0.1.

Model-specific menus: nucleotide data

F84 model

The Felsenstein '84 (F84) model is a fairly general nucleotide evolutionary model, able to accommodate unequal nucleotide frequencies and unequal rates of transition and transversion. It has the flexibility to mimic simpler models such as Kimura's (set all nucleotide frequencies to 0.25) and Jukes and Cantor's (set all nucleotide frequencies to 0.25 and the transition/transversion ratio to 0.500001).

The T option (TT Ratio) allows you to set the ratio between transitions (A/G, C/T) and transversions (all other mutations). If bases mutated completely at random this ratio would be 0.5 (1:2). If you want a random-mutation model (corresponding to the model of Jukes and Cantor) LAMARC will use 0.500001 instead of 0.5, due to a limitation of the algorithm used in LAMARC that would otherwise divide by zero.

Programs such as PAUP* can be used to estimate the transition/transversion ratio from your data. In practice it probably does not need to be very exact.

The B option (Base Frequencies) will take you to a submenu where you can either tell LAMARC to calculate the base frequencies directly from the data (the F option), or enter the values for the relative base frequencies yourself. Unless your sequences are very short, it is probably best to calculate these frequencies from the data. If a particular nucleotide does not exist in your data, you may set its frequency to a very small non-zero value (0.00001 is probably low enough).

General Time-Reversible (GTR) model

The GTR model is the most general tractable model for nucleotide data. It allows both unequal nucleotide frequencies and unequal rates for each pair of nucleotides. The most practical way to use GTR in LAMARC is to estimate its rates with another program, such as PAUP*. LAMARC does not have any facility to estimate the GTR rates itself, but requires them to be provided.

It is wasteful to use GTR when a simpler model is adequate, since it runs rather slowly. PAUP* with MODELTEST can be used to assess the adequacy of simpler models.

The G option (GTR rates) requests input of the six base-specific mutational rates. These are symmetrical rates before consideration of nucleotide frequencies, and can be obtained from PAUP*. PAUP* may provide only the first 5 rates, in which case the [GT] rate is always 1.0.

The B option (Base Frequencies) allows you to set the base frequencies of the four nucleotides, in the order A, C, G, and T. The "Base frequencies computed from data" option is not given here, since the third-party program you use to determine GTR rates will also give you base frequencies, and you should use those.

Modeling data uncertainty in F84 and GTR models

Lamarc runs on nucleotide data now accommodate modeling data uncertainty. This is option P in both the F84 and GTR models. The per-base error rate gives the rate at which each single instance of a nucleotide should be assumed to have been miss-called. A value of 0 indicates that all were sequenced correctly. A value of 0.001 indicates one in one thousand is incorrect. The default value is 0. This feature is in beta test as of December, 2009.

Model-specific menus: microsatellite data

Apart from the choice of which model to use, the only choices for the microsatellite models, except for the MixedKS model, are those common to all models: handling of rate differences among markers, and normalization. These are discussed above. It is not meaningful or useful to ask for autocorrelation when analyzing only a single microsatellite marker per segment.

Stepwise model

The stepwise microsatellite model assumes that microsatellite mutations are always single-step changes, so that the larger observed differences have been built up via successive single-step mutations.

Brownian-motion model

The Brownian-motion microsatellite model is an approximation of the stepwise model. Rather than a discrete model of single mutational steps, we use a continuous Brownian-motion function and then truncate it to the nearest step. This is much faster than the full stepwise model and returns comparable results on fairly polymorphic data, but is less accurate on nearly invariant data.

K-Allele model

This model assumes that only the alleles detected in the data exist, and that mutation from any such allele to any other is equally likely. The Jukes-Cantor DNA model, for example, is a K-Allele model for k=4.

Mixed K-Allele/Stepwise model

The Mixed K-Allele/Stepwise model (MixedKS) considers both mutation to adjacent states (like the Stepwise model) and mutation to arbitrary states (like the K-Allele model). The relative frequency of the two is expressed by the proportion constant percent_stepwise, available as menu option L. It indicates the proportion of changes which are stepwise, so that percent_stepwise=0 is K-Allele and percent_stepwise=1 is Stepwise. An initial value can be set here, and either used throughout the run, or optimized at the end of every chain if the Optimize (O) option is set. The program finds the value of percent_stepwise that maximizes the likelihood of the data on the final genealogy of each chain, using a bisection approach.

Model-specific menus: K-Allele data

The single model available for K-Allele data is the K-Allele model. "K-allele data" is defined as any genetic data collected as discrete units, such as electrophoretic data or phenotypic data. As for microsatellite data, the K-allele model assumes equally likely one-step mutations from any state to any other state.

Analysis

LAMARC analysis screen

The Analysis option leads to a submenu that will allow you to specify the evolutionary forces you're going to infer, as well as the starting values, constraints, and profiling options for each force's parameters. More or less options will appear here depending on your data. If there is more than one population, you will have an M option fo estimating Migration parameters. Similarly, if you have more than one region in your data, you can turn on or off estimation of varying mutational rates over regions (gamma), and if you have trait data, you can set up your mapping analysis.

Each force currently in effect is marked as Enabled, and forces not in effect are marked as Disabled. If you wish to add or remove a force, or to change the parameters associated with a force, enter that force's submenu.

One point to bear in mind is that for nucleotide data the mutation rate mu is always expressed as mutation per site, not mutation per locus as in many studies. You may need to do a conversion in order to compare your results with those from other studies.

Each force is explained below, and following that is a description of the various options available on the submenus: constraints, profiling, and Bayesian priors. For more information on evolutionary forces, consult the forces documentation.

Theta (Effective Population Size): the "Coalescence" force

Coalescence is obligatory on all data sets, so there is no provision for turning it off.

The Theta submenu allows you to customize estimation of Theta, the number of heritable units in the population times the neutral mutation rate times two. This is 4N_emu for ordinary diploid data, N_emu for mitochondrial data, and so forth.

Starting values of Theta cannot be less than or equal to zero. They should not be tiny (less than 0.0001), because the program will take a long time to move away from a tiny starting value and explore larger values.

This program provides Watterson and FST estimates for use as starting values. It should never be quoted as a correct calculation of Watterson or FST, because if it finds the result unsatisfactory as a starting value, it will substitute a default.

The G option allows you to hand-set all of the Thetas to the same initial value. The W option allows you to set all of them to the Watterson value. (This will cause re-computation of the Watterson value, and can take several seconds with a large data set.) The F option allows you to set all of them to the FST value. You can then fine-tune by using the numbered options to hand-set the starting Thetas of individual populations. The FST option is only available if there is more than one population.

Growth parameters: the "Growth" force

This submenu allows you to turn on and off estimation of population growth rates, and to set starting parameters.

If there is a single population in your data, Lamarc will estimate a single growth rate for it. If there are multiple populations, Lamarc will estimate one independent growth rate per population.

If we label growth as g, then the relationship between Theta at a time t > 0 in the past and Theta at the present day (t = 0) is:

Theta_t = Theta_{present day} e^-gt

This means that a positive value of g represents a growing population, and a negative value, a shrinking one.

Time is measured in units of mutations (i.e., 1 t is the average number of generations it takes one site to accumulate one mutation), and g is measured in the inverse units of time. If mu is known, divide generations by mu to get units of t, or conversely, multiply t*mu to get a number of generations.

Starting parameter input for growth is similar to that for Theta, except that no quick pairwise calculators are available; you will have to either accept default values or enter values of your own. Avoid highly negative values (less than -10) as these have some risk of producing infinitely long trees which must then be rejected.

Migration parameters and model: the "Migration" force

This submenu allows you to customize estimation of the migration rates among your populations. The rates are reported as M = m/mu, where m is the immigration rate per generation and mu is the neutral mutation rate per site per generation. Note that many other programs compute 4N_em instead; be sure to convert units before making comparisons with such results.

You do not have the option to turn migration on and off; if there is only one population migration must be off, and if there is more than one population then migration must be on. (Otherwise there is no way for the genealogy to connect to a common ancestor.)

The main choice for migration is how to set the starting values for the migration parameters. You can use an F_ST-based estimator or hand-set the values, either hand-setting all to the same value, or setting each one individually.

The F_ST estimator does not always return a sensible result (for example, if there is more within-population than between-population variability), and in those cases we substitute an arbitrary default value. If you see strange F_ST estimates you may wish to hand-set those values. Please do not quote LAMARC as a source of reliable F_ST estimates, since we do not indicate which have been replaced by defaults.

The final menu entry sets the maximum number of migrations allowed in a genealogy. An otherwise normal run may occasionally propose a genealogy with a huge number of migrations. This could exhaust computer memory; in any case it would slow the analysis down horribly. Therefore, we provide a way to limit the maximum number of migrations. This limit should be set high enough that it disallows only a small proportion of genealogies, or it will produce a downward bias in the estimate of M.

If you find that you are sometimes running out of memory late in a program run that involves migration, try setting this limit a bit lower. If you find, on examining your runtime reports, that a huge number of genealogies are being dropped due to excessive events, set it a bit higher. (The "runtime reports" are the messages displayed on the screen while the Markov chains are evolving; a copy of these messages is provided at the end of each output file.) You may also want to try lower starting values if many genealogies are dropped in the early chains.

Recombination parameter: the "Recombination" force

This submenu allows you to customize estimation of the recombination rate parameter r, defined as C/mu where C is the recombination rate per site per generation and mu is the neutral mutation rate per site per generation. We do not currently allow segment-specific or population-specific recombination rates; only one value of r will be estimated.

The first menu line allows you to turn recombination estimation on and off. Estimating recombination slows the program down a great deal, but if recombination is actually occurring in your data, allowing inference of recombination will not only tell you about recombination, but may improve inference of all other parameters.

You cannot estimate recombination rate if there is only one site, and in practice you cannot estimate it unless there is some variation in your data--at least two variable sites. Your estimate will be very poor unless there are many variable sites.

The S option allows you to set a starting value of r. No pre-calculated value is available, so your choices are to set it yourself or accept an arbitrary default.

Starting values of r should be greater than zero. If you do not want to infer recombination, turn the recombination force off completely instead. If you believe that r is zero, but wish to infer it to test this belief, start with a small non-zero value such as 0.01. It is unwise to set the starting value of r greater than 1.0, because the program will probably bog down under huge numbers of recombinations as a result. A rate of 1 would indicate that recombination is as frequent as mutation, and this state of affairs cannot generally be distinguished from complete lack of linkage.

The M option sets the maximum number of recombinations allowed in a genealogy. An otherwise normal run may occasionally propose a genealogy with a huge number of recombinations. This could exhaust computer memory; in any case it would slow the analysis down horribly. Therefore, we provide a way to limit the maximum number of recombinations. This limit should be set high enough that it disallows only a small proportion of genealogies, or it will produce a downward bias in the estimate of r.

If you find that you are sometimes running out of memory late in a program run that involves recombination, try setting this limit a bit lower. If you find, on examining your runtime reports, that many genealogies are being dropped due to excessive events, set it a bit higher. (You may also want to try lower starting values if many genealogies are dropped in the early chains.)

Gamma parameter: allowing the background mutation rate to vary over regions

If you suspect that the mutation rate varies between your genomic regions, but do not know the specifics of how exactly it varies, you can turn on estimation of this force to allow for gamma-distributed rate variation. The shape parameter of the gamma ('alpha') can be estimated, or you can set it to a value you believe to be reasonable. While the gamma function is a convenient way to allow for different types of variation, it is unlikely that the true variation is drawn from an actual gamma distribution, so the important thing here is mainly that you allow mutation rates to vary, not necessarily which particular value is estimated for the shape parameter. For more information, see the section, "Combining data with different mutation rates".

Trait Data analysis

This section provides the capability to map the location of a measured trait within one of your genomic regions. You will need to have provided trait data in your input file. For more details about trait mapping, see the mapping documentation.

The Trait Data analysis menu will show you all of the traits which you have provided data for and can attempt to map, with an indication of which genomic region each is in. To modify the model for a trait, choose that trait by number; you will be taken to a specific menu for mapping that trait. It will start by reminding you of the trait name, and then show the type of analysis you are using. The two kinds of mapping analysis are discussed in more detail in "Trait Mapping." As a brief reminder, a "float" analysis collects the trees without use of the trait data, and then finds the best fit of trait data to trees after the fact. A "jump" analysis assigns a trial location to the trait and then allows it to be reconsidered as the run progresses.

In this menu, you can also restrict the range of sites which you are considering as potential locations for your trait. For example, you may be quite sure that the trait is not located in the first 100 sites, but you still wish to analyze them because they may add useful information about Theta and recombination rate. You can remove the range 1-100 from consideration using the R option. You can also add sites to consideration using the A option: for example, if you know that your trait is either in sites 1-100 or 512-750, one approach is to remove all sites, then add these two ranges specifically.

If you have turned on a "jump" analysis, the necessary rearrangement strategies will appear in the Strategy:Rearrangement submenu. You may wish to inspect them and make sure that you like the proportion of effort used for each strategy.

Divergence parameters and model: the "Divergence" force

The only value that can be edited in Divergence is Epoch Boundary Time (scaled by the mutation rate) of each Divergence event. You can set starting values and priors for these as usual. There are no constraints available for these parameters. If you wish to redefine the Ancestor/Descendent relationships you need to either return to the file converter or edit the input file XML.

Divergence-Migration parameters and model: the "Divergence-Migration" force

This force is presented exactly the same way that a regular migration matrix is except there are also entries for Ancestor populations. Note that even though you can potentially enter migration rates between invalid population pairs (for example an ancestor and one of its children) these will be ignored by the calculation. (Be warned that if you manage to create an XML input file with values for migration rates between invalid pairs, for example by hand-editing your XML, the program will produce confused and meaningless results.) Also note that pairwise calculators for starting values are not available for cases with divergence.

Options common to all Force submenus

Three options are available on all Forces submenus (except for Trait mapping and Divergence), and they all behave in the same fashion. Constraints allow you to hold parameters constant or constrain them to be equal to other parameters. Profiling affects the reported support intervals around the estimates (and can affect how long it takes the program to run). If you are running LAMARC in Bayesian mode, the Bayesian Priors menu allows you to set the priors for the parameters.

Constraints

Beginning with version 2.0 of LAMARC, we allow constraints on all parameters. All parameters can be unconstrained (the default, and equivalent to pre-2.0 runs), constant, or grouped. Grouped parameters all have the same starting value, and can either be constrained to be identical (and vary together as a unit), or be set constant. In addition, we allow some parameters to be set 'invalid', which means 'constant, equal to zero, and not reported on in the output'.

Say, for example, you know that the recombination rate for your population is 0.03. In this case, you can set the recombination starting value to 0.03 and set the recombination constraint to 'constant'. Or say you have a set of populations from islands in a river; you may know that all downstream migration rates will be equal to each other, and that all upstream migration rates will be equal to each other. In this case, you can put all the downstream rates together in one group, all the upstream rates together in another group, and set each group's constraint to 'identical'. If you have another set of populations and know that migration is impossible between two of them, you could set those migration rates to be 'invalid' (or simply set them constant and set the starting values to zero).

In general, a LAMARC run with constraints will be somewhat faster than one without, since fewer parameters have to be estimated. This can be particularly helpful for complex systems where you already have some information, and are interested in estimating just a few parameters. Unfortunately, constraints are not available at this time for the Epoch Boundary Time parameters.

Select 'C' to go to the Constraints sub-menu for any force. To change the constraint on a particular parameter, enter that parameter's menu index number. To group parameters, pick one of them and enter 'A' (Add a parameter), then the number of your parameter, then 'N' (for a new group). Then pick another parameter that should be grouped with the first one, enter 'A' again, the number of your new parameter, then the group number of the group you just created (probably 1). Groups are created with the automatic constraint of 'identical', meaning that they will vary, but be co-estimated. You may also set a group 'constant', which has the same effect as setting the individual parameters constant, but guarantees they will all have the same value.

Profiling

Each force's Profiling option (P) takes you to a sub-menu where you can adjust how LAMARC gives you feedback about the support intervals for your data. Setting the various profiling options is important in a likelihood run, since it is the only way to obtain confidence limits of your estimates, and can drastically affect total program time. It is less important in a Bayesian run, since the produced curvefiles have the same information, and profiling simply reports key points from those curves (and it takes essentially no time to calculate, as a result). Profiling is automatically turned on in a Bayesian run, and it doesn't make a lot of difference which type of profiling is used in that instance, so most of the discussion below will be most applicable to a likelihood run.

For each force, you can turn profiling on (A) or off (X) for all parameters of a given force, though you cannot profile any parameter you set constant. The next option (P), toggles between percentile and fixed-point profiling. Selecting this option will cause all parameters with some sort of profiling on to become either percentile or fixed. You can turn on and off profiling for individual parameters by selecting that parameter's menu index number.

Both kinds of profiling try to give some information about the shape of the likelihood (or posterior, in a Bayesian analysis) curve, including both how accurate the estimates are, and how tightly correlated estimates for the different parameters are.

Fixed-point profiling considers each parameter in turn. For a variety of values of that parameter (five times the MLE, ten times the MLE, etc.) it computes the optimal values for all other parameters, and the log likelihood value at that point. This gives some indication of the shape of the curve, but the fixed points are not always very informative. In the case of growth, some values are set to multiples of the MLE, while others are set to some generally-useful values unrelated to the MLE, such as 0.

Percentile profiling, instead of using fixed points, gives you values which the value of your parameter is X% likely to fall below. A value for theta of 0.00323 at the .05 percentile means that the true value of theta has only a 5% chance of being less than or equal to 0.00323, and a 95% chance of being greater than 0.00323. In a likelihood run, LAMARC will then calculate the best values for all other parameters with the first parameter fixed at that percentile. If the above example was calculated in a run estimating recombination and growth rates, for example, LAMARC will calculate the best values for recombination and growth if theta had been 0.00323. This gives a much nicer picture of the shape of the curve, but it is very slow. If you use percentile profiling for likelihood, expect it to take a significant proportion of your total run time.

The accuracy of the percentile profiling in a likelihood run is dependent on the likelihood surface for your data fitting a Gaussian in every dimension. When the surface is Gaussian, the percentiles for each parameter can be determined by finding the values which result in particular log likelihoods. For example, the .05 percentile is mathematically defined to be the point at which the log likelihood is exactly 1.35 less than the log likelihood for the MLE, while the .25 percentile can be found at the point where the log likelihood is exactly 0.23 less. LAMARC finds the values for which these likelihoods can be found, but cannot determine whether the actual likelihood surface for your data has a truly Gaussian shape. Hence, percentile profiling cannot be used to report absolute confidence intervals, but it is at least a step in that direction.

You may want to turn off profiling completely, or used fixed-point profiling, for exploratory runs. Percentile profiling gives the best information for final runs, but may be too slow. If you save your data to a summary file (see summary files), you can go back and change the profiling options in a subsequent run, which then won't have to recalculate the Markov chains; it will merely calculate new profiles.

If you turn off profiling, you will lose both the profile tables themselves and the approximate confidence intervals in the MLE tables. A good compromise is to set the output file verbosity to "concise", which causes LAMARC to only calculate two profiles (for percentile profiling, the 95% support intervals) instead of about 10.

Bayesian Priors

If you are running LAMARC in Bayesian mode (see the Search Strategy menu), each force will have the option to edit the Bayesian priors (B) for that force. A more detailed discussion of a Bayesian run can be found below, as well as in the tutorial.

Search Strategy

LAMARC search strategy screen

This menu allows you to fine-tune your search strategy, to get the best results from LAMARC with the minimal time. Consider tuning these settings if you are not satisfied with the performance of your searches. For advice on choosing the best settings here, see the article "Search Strategies."

The first option in the Search Strategy menu (P, 'Perform Bayesian or Likelihood analysis') toggles your setup between a likelihood run and a Bayesian run. This choice can have a profound impact on the course of your run, though hopefully both have a reasonable chance of arriving at the truth at the end. A likelihood run (the only option for versions of LAMARC earlier than 2.0) searches tree-space with a fixed set of 'driving values' per chain, and searches the resulting likelihood surface to find the best set of parameter estimates. A Bayesian run searches tree-space at the same time as it searches parameter-space, then uses its parameter-space search as a Bayesian posterior to report the best values for individual parameters. For more details about a Bayesian search with some comparisons to the likelihood search, see the Bayesian tutorial.

Bayesian priors menu

If you have elected to run a Bayesian search, you will get the option (B) to set the priors for the various forces in your data. Selecting the option will take you to a sub-menu listing all active forces and a summary of the current priors for each force. Once you select a particular force, you get the option to edit the default prior for that force (D), and a list of parameters, each of which may be selected to edit that parameter's prior.

When editing the prior for a particular parameter, you may select whether you wish to use the default prior with the D option, re-setting the current prior to the default. For all priors, you may then set three options: the shape of the prior (S), which may be linear or (natural) logarithmic, and the upper (U) and lower (L) bounds of the prior. There is currently no provision for other prior shapes.

Rearrangers menu

Selecting R from the Search Strategy menu takes you to a sub-menu where you can set the relative frequencies of the various arrangers. The main arranger in a LAMARC run is the Topology rearranger (T), which is the main tree rearranger. This rearranger works by selecting and breaking a branch of its current tree, then re-simulating that branch to add it back to the tree. It should almost always be set greater than the other tree rearrangers (the size and hapolotype arrangers), and any decrease in its relative frequency probably requires a concomitant increase in chain length (see sampling strategy, below).

A new arranger for version 2.0 is the Tree-Size rearranger (S). This rearranger leaves the topology of the tree constant, but re-samples branch lengths root-wards below a selected branch (biased with a triangular distribution towards the root). Our initial experiments with this rearranger indicate that it is helpful in a variety of situations, but particularly helpful for runs with growth and migration. It should be used sparingly, however: we've found setting this rearranger's frequency to 1/5 that of the topology rearranger is a generally good ratio.

If your data appears to have phase-unknown sites, you will have the option to set the relative frequency of the Haplotype rearranger (H). The haplotype rearranger considers new phase assignments for a pair of haplotypes. Like the tree-size rearranger, setting this frequency to 1/5 that of the topology rearranger has been found to produce good results.

If you have chosen to do a Bayesian run, you will have the option to set the relative frequency of the Bayesian rearranger (B). This rearranger considers new parameter values on the current tree. By default, this is set to the same frequency as the topology rearranger, and this seems to be adequate for runs with a small number of variable parameters. This can be increased for runs with a larger number of parameters, but you probably don't want a relative frequency of more than 2-3 times that of the topology arranger--increase the length of your chains instead.

If you are doing trait mapping using the "jump" strategy (in which the trait alleles are assigned a chromosomal location, and this location is reconsidered during the run) two additional rearrangers become available. The Trait haplotypes rearranger (M) allows reconsideration of ambiguous trait haplotypes: for example, it can switch between DD, Dd and dD as haplotype resolutions for someone showing a dominant phenotype. The Trait Location rearranger (L) moves the trait position within the region. We have little information about the best proportions of effort to put into these two rearrangers, but the Trait Location rearranger probably needs to get at least 20% effort to function well. These arrangers are not needed in "float" mapping or in non-mapping runs and will not be shown.

Sampling Strategy (chains and replicates)

This sub-menu allows you to adjust the time LAMARC spends sampling trees. It can (and should) be adjusted to reflect whether you want an 'exploratory' run vs. a 'production' run, how complicated your parameter model is, and whether you are performing a likelihood or Bayesian run. Options germane to each of the above scenarios will be discussed in turn.

The first option (R) allows you to use replication--repeating the entire analysis of each genomic region a number of times, and consolidating the results. This is more accurate than running LAMARC several times and attempting to fuse the results by hand, because LAMARC will compute profiles over the entire likelihood surface, including replicates, instead of individually. It will, of course, increase the time taken by the current run in proportion to the number of replicates chosen (both the time spent generating chains and, typically, the time spent profiling the results). The minimum number of replicates is one, for a single run-through of LAMARC. A reasonable upper limit is 5 if your runs produce reasonably consistent results, though you may want to use a higher number to try to overcome more inconsistent runs. Replication is useful for 'production' runs more than exploratory runs, and can help both likelihood and Bayesian searches.

LAMARC's search is divided into two phases. First, the program will run "initial" chains. In a likelihood run it is useful to make these relatively numerous and short as they serve mainly to get the driving values and starting genealogy into a reasonable range. When all the initial chains are done, the program will run "final" chains. These should generally be longer, and are used to polish the final estimate. Exploratory runs can have both short initial and short (but somewhat longer) final chains, and production runs should have longer sets of both chains. Because a likelihood run is highly dependent on the driving values, you will probably need several initial chains (10 is a reasonable number), and a few final chains (the default of 2 usually works). A Bayesian run is not at all dependent on the driving values, and while you might use several initial chains for an exploratory run just to see what's happening with the estimates, you should probably simply use zero or one initial chains and one quite-long final chain to obtain your production estimates.

For both initial and final chains, there are four parameters to set. "Number of chains" determines how many chains of that type will be run. "Number of recorded genealogies" determines how many genealogies (in a likelihood run), while "Number of recorded parameter sets" determines how many sets of parameters (in a Bayesian run) will actually be used to make the parameter estimates. "Interval between recorded items" determines how many rearrangements will be performed between samples. Thus, if you ask for 100 recorded items per chain, and set the interval between them to 20, the program will perform a total of 2000 rearrangements, sampling 100 times to make the parameter estimates. The total number of samples will determine the length of your run, and can be shorter for exploratory runs but should be long enough to produce stable results for production runs. In a Bayesian run, as mentioned, you will want one, long chain for your production run. If you are seeing spurious spikes in your curvefiles, you probably need to increase the sampling interval, too--because each rearrangement only changes a single parameter (and also takes time to rearrange trees), certain parameters can stay the same simply by neglect, and will end up being oversampled in the output. Increasing the sampling interval can overcome this artifact.

"Number of samples to discard" controls the burn-in phase before sampling begins. LAMARC can be biased by its initial genealogy and initial parameter values, and discarding the first several samples can help to reduce this bias. To continue with the example above, if you ask for 100 samples, an interval of 20, and 100 samples to be discarded, the program will create a total of 2100 samples, throwing away the first 100 and sampling every 20th one thereafter. In a likelihood run, you want the burn-in phase to be long enough to get away from your initial driving values, which will be longer the more complex your model is. In a Bayesian run, you also want the burn-in phase to be long enough to get away from your initial set of values and the initial genealogy. Again, this will need to be longer if you have a more complex model with lots of parameters.

Multiple simultaneous searches with heating

The last menu item on the Search Strategy menu allows you to help your search explore a wider sampling of trees by setting multiple "temperatures." A search through possible trees at a higher temperature accepts proportionally less likely trees, in the hopes that future rearrangements will find a new, well-formed tree with a higher likelihood. This approach will often rescue a search that otherwise becomes stuck in one group of trees and does not find other alternatives.

(The reason that the word "temperature" is used here may be understood by means of an analogy. Imagine, on a snowy winter day, that there are several snowmen on the lawn in front of a house, and you want to identify the tallest one; you do not want to determine the exact height, you just want to determine the tallest snowman. One way of doing this would be to raise the temperature so that all of the snowmen melt; you could then identify the tallest snowman as the one that disappears last. Using multiple "heated" Markov chains simultaneously provides smoothed-out, compressed views of the space of possible genealogy arrangements.)

To set multiple temperatures, select the M option (Multiple simultaneous searches with heating) menu, then select S (Number of Simultaneous Searches) and enter the number of temperatures you want. You will then get a list of new menu options, and be able to set the various temperatures. For best results, temperatures should progress in value pseudo-exponentially. A typical series of temperatures might be "1 1.1 2 3 8", but different data sets might have different optimal magnitudes, from "1 2 10 20 50" to "1 1.01 1.05 1.1 1.3". Watching the Swap Rates between temperatures during the run is important for determining an optimal series here--rates should vary somewhere between 10 and 40 (the numbers give are percents). Below about 5% you are getting little return for a huge increase in computation, and above 50% the two chains are so close to each other that they are unlikely to be exploring significantly distinct areas of parameter space (a process more efficiently handled by using replicates).

Should finding an optimal series of temperatures by hand become too difficult, or if the optimal series of temperatures varies during a run, LAMARC can be told to try to optimize the temperatures automatically, by switching from "static" to "adaptive" heating (the A option that appears if you have more than one temperature). With static heating, the temperatures you specify will be used throughout the run. With adaptive heating, the program will continually adjust the temperatures during the run to keep swapping rates between 10% and 40%. We make no guarantees that adaptive heating will be superior to static heating, but it should at least drive the values to the right magnitudes, and keep them there during the course of the run.

A second option that appears if you have multiple temperatures is the ability to set the swap interval for different temperatures (I). The default is "1", which means LAMARC picks two adjacent temperatures and checks to see if the higher-temperature chain has a better likelihood than the lower-temperature chain after each rearrangement. To check less frequently, set this value to a higher number (3 for every third rearrangement, etc.). A higher value will speed up the program incrementally, but typically does not represent a significant gain.

In general, a run will increase in length proportionally to the number of temperatures chosen, though the time spent profiling will be the same as without.

Input and Output related tasks

LAMARC I/O screen

This menu controls most all of the interactions between the program, the computer, and the user. You can use it to modify the names and content of files LAMARC produces, as well as the information printed to the screen during a LAMARC run.

Verbosity of Progress Reports

This first option on the input and output menu controls the reports that appear on-screen as the program runs. Type V to toggle among the four options. NONE suppresses all output, and might be used when running LAMARC in the background, after you know what to expect. CONCISE only periodically reports about LAMARC's progress, noting that a new region has begun, or that profiling has started. NORMAL adds some real-time feedback on the program's progress and additionally guesses at completion time for each program phase (the guesses are not always completely reliable, however). VERBOSE provides the maximum amount of feedback, reporting additionally on some of the internal states of the program, including overflow/underflow warnings and the like. If something has gone wrong with your LAMARC run, having this option set to VERBOSE is your best chance at a diagnosis.

Output File Options

The next menu item sends you to a submenu where you can set various aspects of the final output file. Select O to go to this menu.

Selecting N allows you to set the name of the output report file. Please be aware that if you specify an existing file, you will overwrite (and destroy) its contents.

Selecting V allows you to toggle between the three levels of content for the output file. VERBOSE will give a very lengthy report with everything you might possibly want, including a copy of the input data (to check to make sure the data were read correctly and are aligned). NORMAL will give a moderate report with less detail, and CONCISE will give an extremely bare-bones report with just the parameter estimates and minimal profiling. We recommend trying VERBOSE first and going to NORMAL if you know you don't need the extra information. CONCISE is mainly useful if your output file is going to be read by another computer program rather than a human being, or if speed is of the essence, since it speeds up profiling in a likelihood run by 5.

The "Profile likelihood settings" option (P) leads to a new sub-menu that lists all forces and gives you an overview of how they are going to be profiled. You can turn on (A) or off (X) profiling for all parameters from this menu, or set the type of profiling to percentile (P) or fixed (F). The other menu options take you to the force-specific profiling menus discussed above.

Name of menu-modified version of input file

The "Name of menu-modified version of input file" option (M) allows you to change the name of the automatically-generated file which will be created by LAMARC when the menu is exited ("menusettings_infile.xml", by default). This file contains all the information in the infile, but also contains any information that may have been set in the menu. If you want to repeat your run with exactly the same options that you chose from the menu this time, you can rerun using this file as your input file.

Writing and Reading Summary Files

The next two menu items on the "Input and Output Related Tasks" menu are used to enable or disable reading and writing of summary files. If summary file writing is enabled, LAMARC saves the information it calculates as it goes. There is enough to be able to recover from a failed run, or to repeat the numerical analysis of a successful run. If a run fails while generating chains, LAMARC will take the parameter estimates from the last completed chain, use them to generate trees in a new chain, then use those trees and the original estimates to start a new chain where it had crashed before. In this scenario, LAMARC cannot produce numerically identical results to what it would have produced had the run finished, but should produce similar results for non-fragile data sets. However, if profiling had begun in the failed run, the summary files do contain enough information to produce numerically identical results, at least to a certain degree of precision.

To turn on summary file writing, select W from the menu, then X to toggle activation. The name of the summary file may be changed with the N menu option. This will produce a summary file as LAMARC runs. To then read that summary file, turn on summary file reading the next time LAMARC is run (from the same data set!) with the R option from this menu, then X to toggle activation, and finally N to set the correct name of the file. Lamarc will then begin either where the previous run stopped, or, if the previous run was successful, will start again from the Profiling stage.

For particularly long runs on flaky systems, it may be necessary to both read and write summary files. If both reading and writing is on, LAMARC will read from the first file, write that information to the second file, and then proceed. If that run is then stopped, the new file may be used as input to start LAMARC further along its path then before. If this option is chosen, be sure to have different names for the input summary file and the output summary file.

If reading from a summary file, most of the options set when writing the summary file must remain the same, or unpredictable results may occur, including LAMARC mysteriously crashing or producing unreliable results. However, since all profiling occurs after reading in the summary file, options related to that may be changed freely. For example, in order to get preliminary results, you may run LAMARC with "fixed" profiling, "concise" output reports, and writing summary files on. Afterwards, if more detail is needed about the statistical support of your estimates, you may run LAMARC again, this time with summary file reading, "percentile" profiling, and "verbose" output files.

Tracer output

LAMARC will automatically write files for the utility Tracer written by Drummond and Rambaut (see the "Using Tracer with LAMARC" documentation in this package). LAMARC's Tracer output files are named [prefix]_[region]_[replicate].txt. You can turn on or off Tracer output and set the prefix here.

NEWICK tree output

If there is no migration or recombination, LAMARC can optionally write out the tree of highest data likelihood it encounters for each region, in Newick format. Options in this menu control whether such a Newick file will be written, and what its prefix will be. This option is not needed for normal use of the program, but it is sometimes interesting to see what the best tree was, and how it compares with the best tree found by phylogeny-inference programs. (Sometimes, surprisingly, LAMARC is able to outdo normal inference programs.)

Bayesian curvefiles

A Bayesian run of LAMARC will produce additional output for each region/parameter combination that details the probability density curve for that parameter. Each file can be read into a spreadsheet program (like Excel) to produce a graphic of your output. If you decide you don't have enough disk space for these files, or don't want them for some other reason, you can turn off this feature by toggling the U option ('Write Bayesian results to individual files'). You can change the prefix given to all of these curvefiles with the C option ('Prefix to use for all Bayesian curvefiles'). Curvefile names are of the format [prefix]_[region]_[parameter].txt, where '[prefix]' is the option settable here, '[region]' is the region number, and '[parameter]' is the parameter in question. More details about bayesian curvefiles are available in the Bayesian tutorial.

Recombination location output

In runs modeling recombination, LAMARC can dump a file listing each recombination location in every sampled tree in the last final chain. A recombination between data position -13 and -12 is recorded as -13, one between 340 and 341 is recorded as 340. You can read the file into R or another statistical computing tool, and plot a histogram to see where the recombinations are most often accepted. Keep in mind that there is a slight bias to accept recombinations near the ends of the input sequences, as there is less data available to demonstrate a recombination there is unsupported.

These files are named [prefix]_[region]_[replicate].txt, You can turn on or off this output and set the prefix here. (The default prefix is 'reclocfile'.) This option is off by default since they can produce a large amount of output.

Show current settings

LAMARC overview screen

This menu option provides reports on all current settings, so that you can see what you've done before starting the program. You cannot change the settings here, but each display will indicate which menu should be used to change the displayed settings.

(Previous | Contents | Next)