LAMARC Documentation: Haplotypes/Diplotypes

Using genotypic data

The LAMARC program is not a tool to infer haplotypes in the sense of telling you what the best haplotype resolutions would be. However, it can make estimates using genotypic data with unknown phase, by including in its search space many different resolutions of the haplotypes.

This process is not slow in itself, but it makes the search space enormously larger, so you will need to run many more steps. You will also almost surely need to use heating; otherwise the search tends to get "stuck" as soon as it finds a reasonably compatible tree and set of haplotypes.

If you can get haplotypes by family studies or experimental methods, they will produce more accurate results than LAMARC's process. However, if you can only get haplotypes by using some other form of haplotype-inference software (such as the EM algorithm), it may be better, if you have time, to use LAMARC's process instead (using the inferred haplotypes as a starting point will speed up the search). No matter how good the haplotype-inference program, it will introduce some bias by using only the "best" haplotypes; if these haplotypes are not fully correct they will tend to indicate less recombination than has actually occured.

Input

The XML format for phase-unknown data includes an extra tag within each individual, <phase>. This tag has an obligatory attribute of "type" which can be either "known" or "unknown." In the "known" case, the phase tag encloses a list of all sites whose phase is known (and therefore need not be reconsidered during the run). In the "unknown" case, the phase tag encloses a list of all sites whose phase is not known (and therefore must be reconsidered). Use whichever is more convenient for your data.

If all sites are known, the phase tag can be omitted, or an empty <phase type="unknown"> tag can be used. If all sites are unknown, an empty <phase type="known"> tag can be used.

Only one <phase> tag is allowed per individual.

You do not need to exclude sites which are homozygous, as they will not be considered for phase inference anyway; but if any heterozygous sites are phase-known, it is much better to indicate them as such than to let them be unnecessarily reconsidered.

To indicate that a haplotype-reconsideration arrangement strategy should be used, add a <haplotyping> tag to the <strategy> section, and indicate for each strategy what proportion of the time it should be used. For example, here is a valid strategy block that will spend 20% of its effort on haplotype reconsideration and 80% on tree reconsideration:

 <strategy>
    <resimulating> 0.8 </resimulating>
    <haplotyping> 0.2 </haplotyping>
 </strategy>

We have used 20% haplotyping with fair success, but you may want to experiment. Note that the resimulating strategy is required; 100% effort on haplotyping is not a valid option.

To turn haplotype reconsideration on and off, use the Search Strategies menu, Rearrangement submenu. If haplotyping is off, selecting its entry will turn it on, and you will be prompted for the frequency of haplotype reconsideration desired (for example, 0.2 for 20% effort into reconsideration). If haplotype reconsideration is on, selecting its entry will turn it off.

Note that these options will have no effect unless some phase-unknown sites are present in the data set. Haplotype reconsideration is automatically disabled for any region which lacks phase-unknown sites.

Search strategy

Every step spent reconsidering the haplotypes is time not spent reconsidering the tree. So if you put 20% effort into haplotypes you will need at least 20% more steps to get the same tree coverage. It will probably be safer to double the number of steps so as to allow for the increased search space.

Haplotype steps are quicker than tree-resimulation steps, so you can afford to be fairly generous with the number of haplotype steps. However, we don't recommend putting more than 50% effort into haplotype reconsideration, because this may end up optimizing the haplotypes of a very suboptimal tree.

Heating is almost essential for genotypic data unless very few sites are phase-unknown. Signs that the haplotype sampler is "stuck" include estimates that don't move far from their starting value, or estimates that are nearly constant within one run but vary wildly between runs. Stuck haplotyping runs normally overestimate Theta because they are using sub-optimal haplotypes. Unfortunately the error bars in such cases may exclude the truth.

If you have access to a haplotype-inference program, using its inferred haplotypes as a starting point may produce better estimates more quickly. We prefer this to simply accepting the externally inferred haplotypes as correct. While they may be nearly correct, they will err in a specific direction (corresponding to a downward bias in Theta) because they are "too perfect" compared to the truth.

Evaluating the results

When haplotype reconsideration is in effect, some extra information will be printed in the runtime report and output report to let you assess whether it is working well. (The "runtime report" is the set of messages that Lamarc displays while it is running. If you select "verbose" output, these messages will be appended to your output file.) Acceptance rates are given for each type of "arranger" (rearrangement strategy) separately, so you might see:

Tree-Arranger accepted            149/1591 proposals
Haplotype-Arranger accepted       277/409 proposals

This means that the resimulating arranger is not accepting very many proposed trees, whereas the haplotype-reconsidering arranger is accepting over half. If one of the arrangers is accepting few or no trees, your search is probably not working even if the overall acceptance rate is satisfactory; at the very least, you will need to run the program for a long time to get adequate searching.

If multiple temperatures are in use, the reported acceptance rates are for the coldest temperature only.

(Previous | Contents | Next)

Using genotypic data

Input

Menu

Search strategy

Evaluating the results