LAMARC Documentation: Parallelizing

Parallelizing LAMARC

LAMARC does not come in a parallel version. However, if you wish, you can do 'poor man's parallelization' by running subsets of your analysis on different machines and then consolidating the results. This is done by making separate input files asking LAMARC to perform a different set of independent calculations and output a series of summary files. Those summary files can then be concatenated together for a final LAMARC run that will perform the overall analysis.

You can parallelize LAMARC by spreading out the calculations for different genetic regions over different computers (one per genetic region), and/or by spreading the calculation of different 'replicates' over different computers.

This is complex enough we highly recommend you keep good notes since it is easy to get confused. Check lists and careful naming conventions help, but keeping track of a bunch of asynchronous software jobs on multiple machines is inherently difficult.

There are three steps to the process

Scripts to run "Poor Man's Parallelization"

We have released a set of Python scripts in the scripts/pyParallel directory of the LAMARC distribution. They can be used to perform the steps below. However, since this is a difficult topic and the scripts are not thoroughly tested, you may wish to read the information in the sections below.

To run the scripts, do the following steps from a terminal/command window:

prepare a single LAMARC file describing all regions and setting the full number of replicates you want. We'll assume that file is called infile.xml
put infile.xml in its own, otherwise empty directory
from that directory, issue the command: python <path/to/>pyParallel/divide_data.py -l infile.xml -o divdir

The program will produce output telling you which files to run LAMARC on next and which python program to run after the LAMARC runs are completed. Depending if you have both multiple regions and multiple replicates, there may be a second such step of multiple LAMARC runs followed by a python command. The last step is always a single LAMARC run to produce a final file.

Dividing up your data

Regions:

If you have more than one genetic region, the first step is to divide up your data into its component regions. You can do this by modifying your original data files and feeding them into the converter one set at a time, or you can, within the converter, delete the extra regions. In the end, you will want one LAMARC input file for each of your regions.

Do note that different genetic regions can take different amounts of time for LAMARC to analyze, depending on the amount of data in each. If one of your regions is a 100 kbase stretch of DNA, and another region is a single microsatellite segment, the DNA is going to take longer to analyze than the microsat. You should almost always be running preliminary LAMARC runs on your data anyway so you know what type and amount of analysis best suits your data, and these amounts will be different for every region. In any event, plan to put the regions that will take a long time on your faster computers, and your less-data-rich regions on the slower computers.

You must also be sure that the same number of chains are being used for each analysis (i.e. 10 initial chains and 2 final chains).

Replicates:

When running multiple replicates on different machines, the same input files can be used with two VERY IMPORTANT caveats.

First, if your parallel machines have shared file space, you will overwrite your output and summary files if run from the same directory. This is a bad idea anyway, because you will want to investigate runs from months ago sometimes. It's best to run every run in its own file space, for example snail/p2/s3/recomb which is the run of snail data which has 2 populations, SNP segment 3 and has recombination turned on. It's a bit of a hassle to type, but it pays when you discover 6 months from now that something very strange happened in that run.
Second, and this is particularly important: you must be sure that your parallel runs have different random number seeds. There are two ways of setting the random number seed and both have their pitfalls, here:
- If you set the random number seed with the <seed> tag in the input file, you must use different input files with different seeds for each replicate. What's more, the numbers must be more that four apart from each other. LAMARC uses as a seed the closest integer of the form 4n+1 where n is an integer, so, for example, 1000, 1001, 1002, and 1003 will all be 'rounded' to 1001 before being used as a seed.
- Conversely, if there is no <seed> tag in the input file, LAMARC will use the system clock, so you can use the same input file for each replicate. However, runs must be started at least 5 seconds apart from each other if you do this. LAMARC will use the current time to the second to get an integer, then again 'round' that integer to the closest integer of the form 4n+1. Effectively, that means that every four-second window of time gets a single random number seed. So, if you have a script that starts LAMARC runs on different computers, any it starts up in the same 4-second window will be computationally identical to each other, and therefore useless as separate analyses. Occasionally, different architectures will produce different results with LAMARC even with the same seed, but it is, shall we say, unwise to rely on this behavior.

Running the programs and collecting the data.

In order to get LAMARC to do a joint analysis over all replicates and/or regions, it must be given a set of trees and the numbers used to obtain those trees. Each run, therefore, must write this information out by using summary files.

Summary file writing can either be turned on from the menu (I for Input and Output related tasks, then W for Writing of tree summary file), or from the XML, using the <use-out-summary> tag (set true) and the <out-summary-file> tag (for the name of the file) (see the menu section of the documentation). Be sure to save each summary file with a different name, or at least within a different directory, so you can tell where it came from later. Nothing within the summary files will enable you to distinguish them later, apart from the different values they contain.

Don't forget that if you are doing multiple replicates, each region needs the same number of replicates.

Also, since the purpose of this exercise is speed, be sure to turn off profiling for all your parameters for these individual runs. You can turn on profiling again in your final LAMARC run, but having it on here just wastes time--this information is not saved to the output summary files.

When you have run these analyses, you should have summary files for all of your LAMARC runs.

Concatenating your data and running the joint analysis

You now need to engage in a bit of chicanery. You must fool LAMARC into thinking that it wants to run one huge analysis with all of your regions and replicates, then create a summary file that pretends to be the result of that analysis. If you feed that summary file to LAMARC, it will then see that the only analysis that is left to be done is the joint analysis, and skip to that step.

Combining over replicates

First, if you have used multiple replicates from separate runs, you must create a new input file for each region that claims to want to run the number of replicates you have already created. This is as simple as copying any of the input files used to create the individual replicates, and changing the value of the <replicates> tag (the random number seed does not matter for this step). Then you must create one summary file out of your collected set of summary files. This requires a bit of editing, but not much.

Each summary file will have one <number> tag for each chain in that analysis, numbered 0 0 0, 0 0 1, 0 0 2, etc. The first number is for the region, the second for the replicate, and the third for the chain. For now, regardless of which genomic region you are dealing with, just change the replicate number. So, the first summary file you can leave as-is, the second summary file you change to 0 1 0, 0 1 1, 0 1 2, etc., the third you change to 0 2 0, etc., and so on. In addition, there will be a <reg_rep> tag at the beginning of the <chainsum> section, which will read '0 0' at first. Again, the first number is for the region, and the second number is for the replicate. You should change the second number (corresponding to the replicate number) to match. Finally, delete the <XML-summary-file> tag from the beginning of all files but the first, the </XML-summary-file> tag from the end of all files but the last, and the <end-region> </end-region> pair of tags from the end of all files.

Then, concatenate your replicate summary files, in order. Nothing need be interleaved; each file should simply follow the next, in whatever order you gave them replicate numbers.

When you're done, you should have one large summary file which should match the format of insumfile.3rep.xml (actual xml is here) , though it should be much much larger (the example file has only two trees per replicate).

Now you need to run LAMARC again. Use your new input file, telling it to use the summary file you have created. If you are doing this step on your way to concatenating your data from multiple genomic regions, you must again turn on summary file writing. LAMARC will read in your data, and should skip right to the multi-replicate analysis. If it does not, something has gone awry, and you should check your input summary file. One trick is to compare the new output summary file to your input summary file, and see where they diverge--that point is presumably the source of your error.

If all goes well, you should now have a new output summary file that is identical to your input summary file with the exception that it now has a <replicate-summary> tag in it containing the estimates of your various parameters, along with the reported maximum likelihood. These differences can be seen in outsumfile.3rep.xml (actual xml is here). If you had more than one region, you must then repeat this process for each of your regions.

Combining over regions

Now you need to repeat essentially the same process to combine your regions together to get one overall parameter estimate. First, you need to go create one input file that wants to do a full multi-region and multi-replicate (if necessary) analysis. You'll need to go back to the converter for this, and feed it all of your data this time, instead of just the data from one genomic region at a time.

Next, take the resulting LAMARC input file and either in the XML or in the menu, you need to re-set up the correct number of chains you used for your individual analyses, as well as the correct number of replicates. You also need to tell it to read from a summary file, which we now need to create. Finally, if you want error analysis, turn on profiling in this step (and not before, since the profiling information is not saved).

To combine your region-based summary files, you'll need to again change the <number> tags to correctly match the region number in your data. Each file should start with a set of <number> tags containing 0 0 0, then 0 0 1, 0 0 2, etc. If you have multiple replicates, much further down you will find a new set of tags containing 0 1 0, 0 1 1, 0 1 2, etc. Again, the first number is the region, the second number is the replicate, and the third number is the chain. Leave these tags alone for your first region (the numbering starts at zero), but for each subsequent region, you'll need to increase the first number: 1 0 0 , 1 0 1, 1 0 2, etc. (with 1 1 0, 1 1 1, 1 1 2, etc. following if you have multiple replicates). Also, each replicate will have a <reg_rep> tag, containing 0 0, 0 1, 0 2, etc. (If you didn't do replicates, that still counts as 'one replicate', numbered 0 0). The first number in that tag (corresponding to the region) also needs to be changed to the appropriate region number (1 0, 1 1, 1 2, etc. for the summary file for your second region if it has multiple replicates, or just the first "1 0" if there are none.

Finally, you need to delete the initial <XML-summary-file> tag from all files but the first, and the final </XML-summary-file> from all files but the last. Do not delete any <end-region> </end-region> tag pairs you might have at this point (which you will have if you did not use replicates), and certainly do not delete the information in the <replicate-summary> tags.

Now, concatenate all your files, in order. In other words, make one big file that starts off with region '0', has region '1' next, and so on. Save that as a unique input summary file, and tell your LAMARC input file to read from that.

Once again, you're ready to run LAMARC, this time for the final time. If all goes well, it will cruise through the data, pausing at the end of each replicate to recalculate some internal modifiers (the Geyer weights, if you know what those are) as well as the profiles (if you have turned those on again), and then at the very end, it will calculate a summary over regions. Again, if you have LAMARC write to a summary file, that file should be identical to your cobbled-together version, with the exception that at the very end, you will get a <region-summary> tag with your final estimates therein. And you're done!

(Previous | Contents | Next)