tarchive2amos: utility for generating AMOS message files
The AMOS package uses a compact representation for the information exchange to and from the assembler. This representation, the AMOS message format, is described in detail here, and was inspired by the interchange format developed at Celera Genomics for use in Celera Assembler.
Tarchive2amos is a utility that allows users to convert files from the NCBI Trace Archive format into the AMOS message format.
tarchive2amos can use data specified in the following three formats:
- sequence data in one or more multi-fasta formatted files. These files must be named fasta.* (Trace Archive standard) or *.seq.
- quality data in zero or more multi-fasta formatted files. These files must be named qual.* (Trace Archive standard) or *.qual and must match the names of the sequence files. Note that the quality files are not necessary: if they are absent all bases will be assigned quality value of 20 (1 error in 100 bp).
- anciliary data in xml format. These files must be named xml.* (Trace Archive standard) or *.xml and must match the names of the sequence files. The information specified in these files includes (but is not limited to) clipping information, library size information, etc. For more information please refer to the Trace Archive documentation. Like the quality information, the XML files are not required.
In addition to these files, the user can provide a list of clear ranges (clipping coordinates) in a separate file. This information will override any set by the xml files. Furthermore, reads not present in the clear range file will be excluded from the conversion.
Note that if a clear range file is not specified, reads with no clear range set in the XML or the sequence file (see below) will be assigned a clear range that spans the entire extent of the read.
Sequence file formats
tarchive2amos accepts four different formats for the header lines in the sequence file:
- Trace Archive format generated by a query (either through website or query_tracedb script)
>gnl|ti|145655111 name:38245161 ...
The first identifier is the TRACE_ID in the XML file and the second one is the name assigned to the trace (TRACE_NAME) in the xml file.
- Trace Archive format:
The first identifier is the trace identifier (TRACE_ID in the XML file) while the second one is the assigned name for the trace (TRACE_NAME in the XML file). The output message file will only contain the trace name (in the eid: field of each read record).
- TIGR sequence format (also produced by the trimming package lucy) :
>GBRAA01TF 1000 2000 1500 17 823
The first identifier is the trace name, followed by three numbers representing the library size estimates (ignored by tarchive2amos), then followed by the clear range.
- Generic multi-fasta
Note that the sequence and quality files are linked through the first identifier on the multi-fasta header line. The XML and the sequence files are linked through the TRACE_NAME field in the XML (it has to match the trace name portion of the header in the Trace Archive format, or the trace identifier in the other two formats).
tarchive2amos assumes that for each file called <file>.seq there is a <file>.qual and a <file>.xml. (alternatively the files may be called fasta.<file>, qual.<file> and xml.<file>). If no .xml file is present the program will only produce a set of RED (read) records.
tarchive2amos -o <prefix> [-c <clear_ranges>] [-l <libs>] [-m <mates>] <seq_file1> <seq_file2> ...
tarchive2amos will read one or more sequence files (as described above) and place the ouptut in a file called <prefix>.afg. Note that the -o option is required. Use the -h option for a complete list of options.
A set of clear ranges may be specified in an addional file (with option -c) in the format:
<read id> <clip_left> <clip_right>
These values will overwrite any value specified in the XML or sequence files.
In addition to Trace Archive XMLs, tarchive2amos also accepts library and read mate information in a Bambus-style .mates file. Furthermore, library information can also be provided with the -l option in a file formatted as follows:
<lib_id> <mean_size> <size_stdev>
- -i <id> - specifies the starting identifier for the messages generated. This option is useful when appending to an already existing AMOS bank.
- -min <len> - minimum length of reads accepted (default 100 bp)
- -max <len> - maximum length of reads accepted (default 2048 bp)
- -qual <qval> - quality value to be assigned to qualityless reads (default 20)
The program produces rather verbose output when inconsistencies are found in the data.