LAMARC Documentation: Data file conversion

Converter Command File Reference

Converter Command File Introduction

The converter command file is an XML-format text file which can be used to bypass the converter GUI interface and directly provide information to the converter.

When to use a Converter Command File

For most LAMARC users, running the lamarc file converter in GUI mode will be the quickest and most intuitive way to convert data files for use in LAMARC. However, there are a few situations in which it may be necessary to write a converter command file. These situations include:

automating conversion for use in simulation studies,
using a new converter feature for which there is not yet a GUI interface, and
reading in information that is tedious and error prone to enter by hand (such as locations for SNP data).

If a command file is needed to access a particular feature, it can be read into the converter either in batch mode or from the GUI.

An Example Converter Command File

An example converter command file with matching MIGRATE data files is provided in the batch_converter/ directory. The file sample-conv-cmd.xml (actual xml is is here) annotated with comments, and should be a good guide to what's going on.

How to Create a Converter Command File

The simplest way to create your own file is probably a combination of:

copying the provided example,
preparing an example in the GUI and then using the File > Write Batch Command File menu command, and
editing a final version based on the above two items.

The rest of this section is provided as a reference should copying from examples is not sufficient for your needs.

How to Use a Converter Command File

You can use your converter command file by:

Reading it in from the GUI with the "File >Read Command File" menu item
Providing it using the -c command line argument to the converter in either GUI or batch mode.

Command File Overview

The top level tag of the file is a <lamarc-converter-cmd> tag. Its possible immediate children are listed in the table below. Note that none of these child tags are required. This is because, generally speaking, fragments of complete converter command files are allowed to be read in from the GUI.

Top Level Tags in Lamarc Converter Command File
parent tag	child tag	child required	child instances allowed
<lamarc-converter-cmd>	<traits>	optional	SINGLE
	<regions>	optional	SINGLE
	<populations>	optional	SINGLE
	<individuals>	optional	SINGLE
	<panels>	optional	SINGLE
	<infiles>	optional	SINGLE
	<outfile>	optional	SINGLE
	<lamarc-header-comment>	optional	SINGLE

Traits

The <traits> tag is used only when trait mapping. If you are not mapping traits, you may skip ahead to the regions section.

The <traits> tag contains definitions of one or more of the following objects.

<trait-info>, used to specify a trait name and associated alleles, and
<phenotype> definitions, used to specify a model for an observed trait manifestation.

Below is a table discribing the relevant XML tags. You can also find an examples trait-info definition and examples of phenotype definitions in the section on trait mapping.

Table of Sub-Tags of <traits>

Tags Describing Traits in Lamarc Converter Command File
parent tag	child tag	child required	child instances allowed
<traits>	<trait-info>	optional	multiple
<traits>	<phenotype>	optional	multiple
<trait-info>	<name>	REQUIRED	SINGLE
<trait-info>	<allele>	REQUIRED	multiple
<phenotype>	<name>	REQUIRED	SINGLE
<phenotype>	<genotype-resolutions>	REQUIRED	multiple
<genotype-resolutions>	<trait-name>	REQUIRED	SINGLE
<genotype-resolutions>	<haplotypes>	REQUIRED	multiple
<haplotypes>	<alleles>	REQUIRED	SINGLE
<haplotypes>	<penetrance>	REQUIRED	SINGLE
tag	contents
<allele>	unique name; should not contain spaces
<alleles>	ordered list of names (from <allele> tags of corresponding trait), separated by spaces
<penetrance>	value between 0 and 1; indicates the chance that an individual with these specific alleles will display the enclosing trait
<name>	unique name; should not contain spaces
<trait-name>	unique name; should not contain spaces

Tags Specifying Inheritance and Mutation Models: <regions> and <segments>

In section Modeling Linkage Properties and Relative Mutation Rates of Your Data of the documentation

Regions

Specifying Inheritance Relationships
parent tag	child tag	child required	child instances allowed
<regions>	<region>	REQUIRED	multiple
<region>	<name>	REQUIRED	SINGLE
	<effective-popsize>	optional	SINGLE
	<segments>	optional	SINGLE
	<trait-location>	optional	multiple
<trait-location>	<trait-name>	REQUIRED for mapping optional for others	SINGLE
tag	contents
<effective-popsize>	value greater than 0; defaults to 1; the relative effective population size of samples from this region.
<trait-name>	unique name; should not contain spaces

Segments

Specifying Properties of Data Samples
parent tag	child tag or attribute	child required	child instances allowed
<segments>	<segment>	REQUIRED	multiple
<segment>	datatype	REQUIRED	-
	marker-proximity	optional	-
	<name>	REQUIRED	SINGLE
	<markers>	REQUIRED	SINGLE
	<map-position>	optional	SINGLE
	<length>	optional	SINGLE
	<locations>	optional	SINGLE
	<first-position-scanned>	optional	SINGLE
	<unresolved-markers>	optional	SINGLE
tag	contents
<markers>	number of sites with data; for dna this is the number of sites sequenced; for snp data it is the number of snps; for kallele and microsat data it is the number of distinct sites at which kallele/msat data was collected.
<map-position>	location of <first-position-scanned> in region-wide coordinates
<length>	total number of bases searched for data
<locations>	the location of each particular data site of your data in segment coordinates
<first-position-scanned>	the location of the first sampled location in your data in segment coordinates
attribute	value	meaning
datatype	dna	DNA data
	snp	SNP data
	kallele	k-allele data
	microsat	microsattelite data
marker-proximity	linked	individual data markers likely to be inherited together
marker-proximity	unlinked	individual data markers are independently inherited

Populations

The <populations> tag is used to name distinct populations. If your data files have named populations, the population names here should match the names that are in your files.

Specifying population names with the <populations> tag
parent tag	child tag	child required	child instances allowed
<populations>	<population>	Y	Y
tag	contents
<population>	a name unique among all populations, regions, and segments

Data files

The <infiles> tag will tell the converter where to find your data, and how to associate each file with the previously-defined regions, segments, and populations.

Tags Describing Input Files in Lamarc Converter Command File
parent tag	child tag or attribute	child required	child instances allowed
<infiles>	<infile>	REQUIRED	multiple
<infile>	datatype	REQUIRED	-
	format	optional	-
	sequence-alignment	optional	-
	<name>	REQUIRED	SINGLE
	<segments-matching>	REQUIRED	SINGLE
	<pop-matching>	optional	SINGLE
	<individuals-from-samples>	optional	SINGLE
<individuals-from-samples>	type	REQUIRED	-
<population-matching>	type	REQUIRED	-
<population-matching>	<population-name>	depends on value of type attribute	multiple
<segments-matching>	type	REQUIRED	-
<segments-matching>	<segment-name>	depends on value of type attribute	multiple
tag	contents
<individuals-from-samples>	the number of adjacent samples to bundle into a single individual
attribute	value	meaning
datatype	dna	DNA data
	snp	SNP data
	kallele	k-allele data
	microsat	microsattelite data
format	migrate	input file is a migrate file
format	phylip	input file is a phylip file
sequence-alignment	interleaved	the first line of each sequence appears, followed by all second lines, then all third lines, etc.
sequence-alignment	sequential	each entire sequence appears in the file before the next one starts.
type for <individuals-from-samples>	byAdjacency	bundle adjacent samples into individuals
type for <population-matching>	byList	Each population referred to in the file is to be assigned to a particular population defined in this file. If this type is used, sub-tags of the type `<population-name`> should be used to define those populations (each should have a name that matches a population defined in the `<populations`> tag, above).
	byName	The file itself contains information about what populations the data refers to. These names must match the names given in the 'population' tag, above.
	single	All individuals in the file are to be assigned to a single population. That population must then be defined by a `<population-name`> subtag.
type for <segments-matching>	byList	Each segment referred to in the file is to be assigned to a particular segment defined in this file. If this type is used, sub-tags of the type `<segment-name`> should be used to define those segment (each should have a name that matches a defined segment).
type for <segments-matching>	single	All individuals in the file are to be assigned to a single segment. That segment must then be defined by a `<segment-name`> subtag.

Specifying the Name of the Produced Lamarc file

<outfile>, where you can specify the name of the file that you want the converter to produce,

Tags Describing Output Files in Lamarc Converter Command File
tag	contents
<outfile>	name of outfile to produce; defaults to `infile.xml`

Miscellaneous Tags

Miscellaneous Tags in Lamarc Converter Command File
tag	contents
<lamarc-header-comment>	text of comment to be inserted in lamarc file

Specifying Relationships Between Individuals and Data Samples

For most LAMARC analyses, it is not necessary to specify which pairs (or more) of data sequences belong to the same individual. However, there are a few cases where it may be necessary, including:

Trait mapping, since traits are observed for individuals.
When haplotypes are incompletely resolved from individuals.
When combining nucleotide data (defined by sample) and microsats (defined by individuals).

Assigning samples to individuals, and optionally assigning trait phenotypes or information about haplotype resolution to them is done with the <individuals> tag. An example can be found in section Assigning Phenotypes to Individuals of the Trait Mapping documentation.

Specifying Relationships between Individuals and Sample Data in Converter Command File
parent tag	child tag	child required	child instances allowed
<individuals>	<individual>	optional	multiple
<individual>	<name>	REQUIRED	SINGLE
	<sample>	REQUIRED	multiple
	<phase>	optional	multiple
	<has-phenotype>	optional	multiple
	<genotype-resolutions>	optional	multiple
<sample>	<name>	REQUIRED	SINGLE
<phase>	<segment-name>	REQUIRED	SINGLE
<phase>	<unresolved-markers>	REQUIRED	SINGLE
tag	contents
<name>	a name unique among all individuals and samples
<has-phenotype>	a <phenotype>name already defined in the <traits> section
<genotype-resolution>	an "anonymous" phenotype belonging to the enclosing individual only. See <traits> subtags table for definition
<segment-name>	the name of the segment to which this set of phase information applies
<unresolved-markers>	sites for which data markers are unresolved for this individual and segment

To see an example of the <phase>, <segment-name> and <unresolved-markers> tags in use, see the file sample-conv-cmd.xml (actual xml is here)

The values for the 'unresolved-markers' tag should be site labels. The first valid site in a segment is the value of the 'first-position-scanned' tag for that segment, and the last valid site is determined by the length of the segment. If the segment does not have as many markers in it as valid sites (as for SNP data), the values here should match the values in the 'locations' tag for the segment. In the example file, the second segment of the second chromosome has SNP data with markers at positions 13, 19, 35, 77, 102, 112, and 204. These are therefore the only valid values for the 'phase' tag for this segment.

Specifying Panel Correction Information

Panel member counts should be entered only if the user wishes to invoke Panel Correction. They need not be specified for all regions, only those for which one has the number of sequences used to create the panel.

WARNING: Do not estimate the number of sequences used to create a panel, it will make your results indefensible. If you do not have the actual number of sequences, you should not use Panel Correction. Your mutation rates will be lower, but that's the best you can do without knowing more about how the panel was created.

Specifying Panel Correction Information in Converter Command File
parent tag	child tag	child required	child instances allowed
<panels>	<panel>	optional	multiple
<panel>	<panel-name>	optional	SINGLE
	<panel-region>	REQUIRED	SINGLE
	<panel-pop>	REQUIRED	SINGLE
	<panel-size>	REQUIRED	SINGLE

(Back | Contents | Next)