AMOS WIKI - User contributions [en]

Minimus

2010-01-29T16:25:49Z

Trgibbons: /* Overview */

== Overview ==

Minimus is one of several assembly pipelines included in the AMOS software package. It is designed specifically for small data-sets, such as the set of reads covering a specific gene. Note that the code will work for larger assemblies (we have used it to assemble bacterial genomes), however, due to its stringency, the resulting assembly will be highly fragmented. For large and/or complex assemblies the execution of Minimus should be followed by additional processing steps, such as scaffolding.

Minimus follows the Overlap-Layout-Consensus paradigm and consists of three main modules which share information through a central file bank:

* [[hash-overlap]] - Computes the overlaps between the reads using a modified version of the Smith-Waterman local alignment algorithm
* [[tigger]] - Uses the read overlaps to generate the layouts of reads representing individual contigs
* [[make-consensus]] - Refines the layouts produced by the tigger to generate accurate multiple alignments within the reads

Minimus uses AMOS message files as both the inputs and the outputs. Please see the [[File conversion utilities]] documentation for more information.

[[minimus2 | Minimus2]] is a modified version of the minimus pipeline designed for merging two sequence sets. Instead of hash-overlap it uses a nucmer based overlap detector which is much faster.

== Documentation ==

Documentation on running minimus is included with the distribution in the /docs subdirectory.

See [[Minimus/README]].

== Examples ==

Examples of a flu assembly and a Zebrafish gene can be found in the test/minimus directory created when the AMOS distribution is untarred. Documentation on the examples is included with the distribution in /docs/minimus.README.

== Basic usage ==

To run minimus will you need a set of sequence files. Assuming you have a set of reads in fasta format called '''my_reads.seq''', you can run minimus with the following two commands:

toAmos -s my_reads.seq -o my_reads.afg

minimus my_reads

The output will be a fasta formatted file called '"my_reads.fasta"', a contig file with details about the assembly of each contig called '"my_reads.contig"', and an AMOS bank folder with various files used internally by minimus.
The toAmos file conversion utility is the most general and probably the most useful of the file conversion utilities included with minimus. More information about toAmos and the [[File_conversion_utilities | other file conversion utilities]] can be found in the [[AMOS | AMOS documentation wiki]]. For example, you can include quality data from a Phred style quality score file by running [[ToAmos | toAmos]] with the -q option as follows:

toAmos -s my_reads.fasta -q my_reads.qual -o my_reads.afg

Minimus can also be called with the following equivalent command:

runAmos -C $AMOSBASE/src/Pipeline/minimus.acf my_reads

The AMOS package also includes other helpful tools such as [[Hawkeye]], which is useful for evaluating your assembly with respect to paired-end reads. It can be run on the minimus bank with the following command:

hawkeye my_reads.bnk/

== Publication ==

[http://www.biomedcentral.com/1471-2105/8/64 Minimus: a fast, lightweight genome assembler]

Sommer, DD, Delcher, AL, Salzberg, SL, and Pop, M. (2007) BMC Bioinformatics, 8:64doi:10.1186/1471-2105-8-64.

== Acknowledgements ==
The development of minimus was supported by the National Institutes of Health under grants R01-LM06845 and R01-LM007938 to SLS and by Department of Homeland Security cooperative agreement W81XWH-05-2-0051.

Tarchive2amos

2010-01-20T19:25:11Z

Trgibbons: /* Required inputs */ Renamed section, since qual and xml files are not actually required inputs

tarchive2amos: utility for generating AMOS message files

== Overview ==

The AMOS package uses a compact representation for the information exchange to and from the assembler. This representation, the AMOS message format, is described in detail here, and was inspired by the interchange format developed at Celera Genomics for use in Celera Assembler.

Tarchive2amos is a utility that allows users to convert files from the NCBI Trace Archive format into the AMOS message format.

== Input Files ==

tarchive2amos can use data specified in the following three formats:

* sequence data in one or more multi-fasta formatted files. These files must be named fasta.* (Trace Archive standard) or *.seq.
* quality data in zero or more multi-fasta formatted files. These files must be named qual.* (Trace Archive standard) or *.qual and must match the names of the sequence files. Note that the quality files are not necessary: if they are absent all bases will be assigned quality value of 20 (1 error in 100 bp).
* anciliary data in xml format. These files must be named xml.* (Trace Archive standard) or *.xml and must match the names of the sequence files. The information specified in these files includes (but is not limited to) clipping information, library size information, etc. For more information please refer to the Trace Archive documentation. Like the quality information, the XML files are not required.

In addition to these files, the user can provide a list of clear ranges (clipping coordinates) in a separate file. This information will override any set by the xml files. Furthermore, reads not present in the clear range file will be excluded from the conversion.

Note that if a clear range file is not specified, reads with no clear range set in the XML or the sequence file (see below) will be assigned a clear range that spans the entire extent of the read.

== Sequence file formats ==

tarchive2amos accepts four different formats for the header lines in the sequence file:

* Trace Archive format generated by a query (either through website or query_tracedb script)

>gnl|ti|145655111 name:38245161 ...

The first identifier is the TRACE_ID in the XML file and the second one is the name assigned to the trace (TRACE_NAME) in the xml file.

* Trace Archive format:

>gnl|ti|145655111 38245161

The first identifier is the trace identifier (TRACE_ID in the XML file) while the second one is the assigned name for the trace (TRACE_NAME in the XML file). The output message file will only contain the trace name (in the eid: field of each read record).

* TIGR sequence format (also produced by the trimming package lucy) :

>GBRAA01TF 1000 2000 1500 17 823

The first identifier is the trace name, followed by three numbers representing the library size estimates (ignored by tarchive2amos), then followed by the clear range.

* Generic multi-fasta

>GBRAA01TF

Note that the sequence and quality files are linked through the first identifier on the multi-fasta header line. The XML and the sequence files are linked through the TRACE_NAME field in the XML (it has to match the trace name portion of the header in the Trace Archive format, or the trace identifier in the other two formats).

== Synopsis ==

tarchive2amos assumes that for each file called <file>.seq there is a <file>.qual and a <file>.xml. (alternatively the files may be called fasta.<file>, qual.<file> and xml.<file>). If no .xml file is present the program will only produce a set of RED (read) records.

tarchive2amos -o <prefix> [-c <clear_ranges>] [-l <libs>]
[-m <mates>] <seq_file1> <seq_file2> ...

tarchive2amos will read one or more sequence files (as described above) and place the ouptut in a file called <prefix>.afg. Note that the -o option is required. Use the -h option for a complete list of options.

A set of clear ranges may be specified in an addional file (with option -c) in the format:

<read id> <clip_left> <clip_right>

These values will overwrite any value specified in the XML or sequence files.

In addition to Trace Archive XMLs, tarchive2amos also accepts library and read mate information in a Bambus-style .mates file. Furthermore, library information can also be provided with the -l option in a file formatted as follows:

<lib_id> <mean_size> <size_stdev>

== Additional options ==

* -i <id> - specifies the starting identifier for the messages generated. This option is useful when appending to an already existing AMOS bank.
* -min <len> - minimum length of reads accepted (default 100 bp)
* -max <len> - maximum length of reads accepted (default 2048 bp)
* -qual <qval> - quality value to be assigned to qualityless reads (default 20)

== Notes ==

The program produces rather verbose output when inconsistencies are found in the data.

Tarchive2amos

2010-01-20T19:24:16Z

Trgibbons: /* Synopsis */ Added the description of the implicit use of qual and xml files, as well as a mention to the -h option

tarchive2amos: utility for generating AMOS message files

== Overview ==

The AMOS package uses a compact representation for the information exchange to and from the assembler. This representation, the AMOS message format, is described in detail here, and was inspired by the interchange format developed at Celera Genomics for use in Celera Assembler.

Tarchive2amos is a utility that allows users to convert files from the NCBI Trace Archive format into the AMOS message format.

== Required inputs ==

tarchive2amos can use data specified in the following three formats:

* sequence data in one or more multi-fasta formatted files. These files must be named fasta.* (Trace Archive standard) or *.seq.
* quality data in zero or more multi-fasta formatted files. These files must be named qual.* (Trace Archive standard) or *.qual and must match the names of the sequence files. Note that the quality files are not necessary: if they are absent all bases will be assigned quality value of 20 (1 error in 100 bp).
* anciliary data in xml format. These files must be named xml.* (Trace Archive standard) or *.xml and must match the names of the sequence files. The information specified in these files includes (but is not limited to) clipping information, library size information, etc. For more information please refer to the Trace Archive documentation. Like the quality information, the XML files are not required.

In addition to these files, the user can provide a list of clear ranges (clipping coordinates) in a separate file. This information will override any set by the xml files. Furthermore, reads not present in the clear range file will be excluded from the conversion.

Note that if a clear range file is not specified, reads with no clear range set in the XML or the sequence file (see below) will be assigned a clear range that spans the entire extent of the read.

== Sequence file formats ==

tarchive2amos accepts four different formats for the header lines in the sequence file:

* Trace Archive format generated by a query (either through website or query_tracedb script)

>gnl|ti|145655111 name:38245161 ...

The first identifier is the TRACE_ID in the XML file and the second one is the name assigned to the trace (TRACE_NAME) in the xml file.

* Trace Archive format:

>gnl|ti|145655111 38245161

The first identifier is the trace identifier (TRACE_ID in the XML file) while the second one is the assigned name for the trace (TRACE_NAME in the XML file). The output message file will only contain the trace name (in the eid: field of each read record).

* TIGR sequence format (also produced by the trimming package lucy) :

>GBRAA01TF 1000 2000 1500 17 823

The first identifier is the trace name, followed by three numbers representing the library size estimates (ignored by tarchive2amos), then followed by the clear range.

* Generic multi-fasta

>GBRAA01TF

Note that the sequence and quality files are linked through the first identifier on the multi-fasta header line. The XML and the sequence files are linked through the TRACE_NAME field in the XML (it has to match the trace name portion of the header in the Trace Archive format, or the trace identifier in the other two formats).

== Synopsis ==

tarchive2amos assumes that for each file called <file>.seq there is a <file>.qual and a <file>.xml. (alternatively the files may be called fasta.<file>, qual.<file> and xml.<file>). If no .xml file is present the program will only produce a set of RED (read) records.

tarchive2amos -o <prefix> [-c <clear_ranges>] [-l <libs>]
[-m <mates>] <seq_file1> <seq_file2> ...

tarchive2amos will read one or more sequence files (as described above) and place the ouptut in a file called <prefix>.afg. Note that the -o option is required. Use the -h option for a complete list of options.

A set of clear ranges may be specified in an addional file (with option -c) in the format:

<read id> <clip_left> <clip_right>

These values will overwrite any value specified in the XML or sequence files.

In addition to Trace Archive XMLs, tarchive2amos also accepts library and read mate information in a Bambus-style .mates file. Furthermore, library information can also be provided with the -l option in a file formatted as follows:

<lib_id> <mean_size> <size_stdev>

== Additional options ==

* -i <id> - specifies the starting identifier for the messages generated. This option is useful when appending to an already existing AMOS bank.
* -min <len> - minimum length of reads accepted (default 100 bp)
* -max <len> - maximum length of reads accepted (default 2048 bp)
* -qual <qval> - quality value to be assigned to qualityless reads (default 20)

== Notes ==

The program produces rather verbose output when inconsistencies are found in the data.

Tarchive2amos

2010-01-20T19:20:29Z

Trgibbons: /* Required inputs */ Beginning to update this page

tarchive2amos: utility for generating AMOS message files

== Overview ==

The AMOS package uses a compact representation for the information exchange to and from the assembler. This representation, the AMOS message format, is described in detail here, and was inspired by the interchange format developed at Celera Genomics for use in Celera Assembler.

Tarchive2amos is a utility that allows users to convert files from the NCBI Trace Archive format into the AMOS message format.

== Required inputs ==

tarchive2amos can use data specified in the following three formats:

* sequence data in one or more multi-fasta formatted files. These files must be named fasta.* (Trace Archive standard) or *.seq.
* quality data in zero or more multi-fasta formatted files. These files must be named qual.* (Trace Archive standard) or *.qual and must match the names of the sequence files. Note that the quality files are not necessary: if they are absent all bases will be assigned quality value of 20 (1 error in 100 bp).
* anciliary data in xml format. These files must be named xml.* (Trace Archive standard) or *.xml and must match the names of the sequence files. The information specified in these files includes (but is not limited to) clipping information, library size information, etc. For more information please refer to the Trace Archive documentation. Like the quality information, the XML files are not required.

In addition to these files, the user can provide a list of clear ranges (clipping coordinates) in a separate file. This information will override any set by the xml files. Furthermore, reads not present in the clear range file will be excluded from the conversion.

Note that if a clear range file is not specified, reads with no clear range set in the XML or the sequence file (see below) will be assigned a clear range that spans the entire extent of the read.

== Sequence file formats ==

tarchive2amos accepts four different formats for the header lines in the sequence file:

* Trace Archive format generated by a query (either through website or query_tracedb script)

>gnl|ti|145655111 name:38245161 ...

The first identifier is the TRACE_ID in the XML file and the second one is the name assigned to the trace (TRACE_NAME) in the xml file.

* Trace Archive format:

>gnl|ti|145655111 38245161

The first identifier is the trace identifier (TRACE_ID in the XML file) while the second one is the assigned name for the trace (TRACE_NAME in the XML file). The output message file will only contain the trace name (in the eid: field of each read record).

* TIGR sequence format (also produced by the trimming package lucy) :

>GBRAA01TF 1000 2000 1500 17 823

The first identifier is the trace name, followed by three numbers representing the library size estimates (ignored by tarchive2amos), then followed by the clear range.

* Generic multi-fasta

>GBRAA01TF

Note that the sequence and quality files are linked through the first identifier on the multi-fasta header line. The XML and the sequence files are linked through the TRACE_NAME field in the XML (it has to match the trace name portion of the header in the Trace Archive format, or the trace identifier in the other two formats).

== Synopsis ==

tarchive2amos -o <prefix> [-c <clear_ranges>] [-l <libs>]
[-m <mates>] <seq_file1> <seq_file2> ...

tarchive2amos will read one or more sequence files (as described above) and place the ouptut in a file called <prefix>.afg. Note that the -o option is required.

A set of clear ranges may be specified in an addional file (with option -c) in the format:

<read id> <clip_left> <clip_right>

These values will overwrite any value specified in the XML or sequence files.

In addition to Trace Archive XMLs, tarchive2amos also accepts library and read mate information in a Bambus-style .mates file. Furthermore, library information can also be provided with the -l option in a file formatted as follows:

<lib_id> <mean_size> <size_stdev>

== Additional options ==

* -i <id> - specifies the starting identifier for the messages generated. This option is useful when appending to an already existing AMOS bank.
* -min <len> - minimum length of reads accepted (default 100 bp)
* -max <len> - maximum length of reads accepted (default 2048 bp)
* -qual <qval> - quality value to be assigned to qualityless reads (default 20)

== Notes ==

The program produces rather verbose output when inconsistencies are found in the data.

Minimus/README

2010-01-14T22:31:44Z

Trgibbons: /* Brief Summary */

minimus - The AMOS Lightweight Assembler

== Brief Summary ==
Minimus is an assembly pipeline designed specifically for small
data-sets, such as the set of reads covering a specific gene. Note that
the code will work for larger assemblies (we have used it to assemble
bacterial genomes), however, due to its stringency, the resulting assembly
will be highly fragmented. For large and/or complex assemblies the execution
of Minimus should be followed by additional processing steps, such as
scaffolding.

Minimus follows the Overlap-Layout-Consensus paradigm and consists of three main modules which share information through a central file bank:

* [[hash-overlap]] - Computes the overlaps between the reads using a modified version of the Smith-Waterman local alignment algorithm
* [[tigger]] - Uses the read overlaps to generate the layouts of reads representing individual contigs
* [[make-consensus]] - Refines the layouts produced by the tigger to generate accurate multiple alignments within the reads

==Dependencies==
None.

==Running==
Either execute the minimus configuration script directly from
$bindir OR copy it to your local directory, edit it, and run it with
the `runAmos' command interpreter. The following variables must be set
on the command line or added to the script for the pipeline to operate
properly:

TGT - The target genome sequences in AMOS message format (.afg)
minimus -D TGT=<target> <prefix>
OR
runAmos -C minimus -D TGT=<target> <prefix>

Where <prefix> will be the output file prefix, and <target> is the
input AMOS message file. Check the `runAmos' documentation or type
`runAmos --help' for details on operating an AMOS pipeline. The
minimus pipeline config file can be easily modified by the user to add
additional processing steps.

In order to run minimus you need to provide an AMOS formatted file
of the reads. Such a file (commonly with extension .afg) can be
generated from a combination of sequence (.seq), quality (.qual), and
Trace Archive XML (.xml) files using the [[ToAmos | toAmos]] or
[[Tarchive2amos | tarchive2amos]] programs which will appear in the
$bindir directory upon installation.

The default TGT file is <prefix>.afg, thus if our input file is
<prefix>.afg we can run minimus simply by typing:

minimus <prefix>

== Output ==
Output will be a TIGR .contig file and a FastA .fasta file. The
TIGR contig file contains the gapped consensus and multi-alignment
information for the assembly. Each contig sequence is preceded by a
header line which starts with '##', followed by the gapped consensus
sequence with gaps represented as a '-' character. Following the
consensus is the gapped read sequence preceded by a header line
beginning with '#'. The .fasta file contains all the contigs produced
by AMOScmp in a multi-FastA formatted file. These sequences will match
the sequences in the .contig file, but without the gaps.

To obtain an ACE format representation of the assembly, we can run
the following to obtain a <prefix>.ace file:

bank-report -b <prefix>.bnk CTG > <prefix>.ctg
amos2ace <prefix>.afg <prefix>.ctg

Where <prefix> is the same as was used in the above section and
<prefix>.afg is the original input to the assembly pipeline. We can
simply add these commands to the runAmos config file to produce an ACE
file every time we run minimus.

==Example==
Assume we have a set of Trace Archive data with the names
`target.seq', `target.qual' and `target.xml' which contain the
sequence information for a small assembly task. To run the minimus
pipeline and generate the default output, we would type the following:

tarchive2amos -o target.seq
minimus -D TGT=target.afg target

This will generate the default output named `target.contig' and
`target.fasta'. We could then generate an ACE assembly format file by
following the instructions in the above section, substituting "target"
for "<prefix>".

Minimus is now packaged with two example assemblies. The two examples
are an Influenza A assembly and a Zebra Fish Gene assembly under the 'test'
directory. The 'test' directory in located in the main AMOS directory after you untar
the AMOS tarball.

Minimus/README

2010-01-14T22:29:29Z

Trgibbons:

minimus - The AMOS Lightweight Assembler

== Brief Summary ==
minimus is an assembly pipeline designed specifically for small
data-sets, such as the set of reads covering a specific gene. Note that
the code will work for larger assemblies (we have used it to assemble
bacterial genomes), however, due to its stringency, the resulting assembly
will be highly fragmented. For large and/or complex assemblies the execution
of Minimus should be followed by additional processing steps, such as
scaffolding.

Minimus follows the Overlap-Layout-Consensus paradigm and consists of
three main modules:

* overlapper - computes the overlaps between the reads using a modified version of the Smith-Waterman local alignment algorithm

* tigger - uses the read overlaps to generate the layouts of reads representing individual contigs

* make-consensus - refines the layouts produced by the tigger to generate accurate multiple alignments within the reads

==Dependencies==
None.

==Running==
Either execute the minimus configuration script directly from
$bindir OR copy it to your local directory, edit it, and run it with
the `runAmos' command interpreter. The following variables must be set
on the command line or added to the script for the pipeline to operate
properly:

TGT - The target genome sequences in AMOS message format (.afg)
minimus -D TGT=<target> <prefix>
OR
runAmos -C minimus -D TGT=<target> <prefix>

Where <prefix> will be the output file prefix, and <target> is the
input AMOS message file. Check the `runAmos' documentation or type
`runAmos --help' for details on operating an AMOS pipeline. The
minimus pipeline config file can be easily modified by the user to add
additional processing steps.

In order to run minimus you need to provide an AMOS formatted file
of the reads. Such a file (commonly with extension .afg) can be
generated from a combination of sequence (.seq), quality (.qual), and
Trace Archive XML (.xml) files using the [[ToAmos | toAmos]] or
[[Tarchive2amos | tarchive2amos]] programs which will appear in the
$bindir directory upon installation.

The default TGT file is <prefix>.afg, thus if our input file is
<prefix>.afg we can run minimus simply by typing:

minimus <prefix>

== Output ==
Output will be a TIGR .contig file and a FastA .fasta file. The
TIGR contig file contains the gapped consensus and multi-alignment
information for the assembly. Each contig sequence is preceded by a
header line which starts with '##', followed by the gapped consensus
sequence with gaps represented as a '-' character. Following the
consensus is the gapped read sequence preceded by a header line
beginning with '#'. The .fasta file contains all the contigs produced
by AMOScmp in a multi-FastA formatted file. These sequences will match
the sequences in the .contig file, but without the gaps.

To obtain an ACE format representation of the assembly, we can run
the following to obtain a <prefix>.ace file:

bank-report -b <prefix>.bnk CTG > <prefix>.ctg
amos2ace <prefix>.afg <prefix>.ctg

Where <prefix> is the same as was used in the above section and
<prefix>.afg is the original input to the assembly pipeline. We can
simply add these commands to the runAmos config file to produce an ACE
file every time we run minimus.

==Example==
Assume we have a set of Trace Archive data with the names
`target.seq', `target.qual' and `target.xml' which contain the
sequence information for a small assembly task. To run the minimus
pipeline and generate the default output, we would type the following:

tarchive2amos -o target.seq
minimus -D TGT=target.afg target

This will generate the default output named `target.contig' and
`target.fasta'. We could then generate an ACE assembly format file by
following the instructions in the above section, substituting "target"
for "<prefix>".

Minimus is now packaged with two example assemblies. The two examples
are an Influenza A assembly and a Zebra Fish Gene assembly under the 'test'
directory. The 'test' directory in located in the main AMOS directory after you untar
the AMOS tarball.

Minimus

2010-01-14T22:28:16Z

Trgibbons: /* Basic usage */

== Overview ==

Minimus is an assembly pipeline designed specifically for small data-sets, such as the set of reads covering a specific gene. Note that the code will work for larger assemblies (we have used it to assemble bacterial genomes), however, due to its stringency, the resulting assembly will be highly fragmented. For large and/or complex assemblies the execution of Minimus should be followed by additional processing steps, such as scaffolding.

Minimus follows the Overlap-Layout-Consensus paradigm and consists of three main modules which share information through a central file bank:

* [[hash-overlap]] - Computes the overlaps between the reads using a modified version of the Smith-Waterman local alignment algorithm
* [[tigger]] - Uses the read overlaps to generate the layouts of reads representing individual contigs
* [[make-consensus]] - Refines the layouts produced by the tigger to generate accurate multiple alignments within the reads

Minimus uses AMOS message files as both the inputs and the outputs. Please see the [[File conversion utilities]] documentation for more information.

[[minimus2 | Minimus2]] is a modified version of the minimus pipeline designed for merging two sequence sets. Instead of hash-overlap it uses a nucmer based overlap detector which is much faster.

== Documentation ==

Documentation on running minimus is included with the distribution in the /docs subdirectory.

See [[Minimus/README]].

== Examples ==

Examples of a flu assembly and a Zebrafish gene can be found in the test/minimus directory created when the AMOS distribution is untarred. Documentation on the examples is included with the distribution in /docs/minimus.README.

== Basic usage ==

To run minimus will you need a set of sequence files. Assuming you have a set of reads in fasta format called '''my_reads.seq''', you can run minimus with the following two commands:

toAmos -s my_reads.seq -o my_reads.afg

minimus my_reads

The output will be a fasta formatted file called '"my_reads.fasta"', a contig file with details about the assembly of each contig called '"my_reads.contig"', and an AMOS bank folder with various files used internally by minimus.
The toAmos file conversion utility is the most general and probably the most useful of the file conversion utilities included with minimus. More information about toAmos and the [[File_conversion_utilities | other file conversion utilities]] can be found in the [[AMOS | AMOS documentation wiki]]. For example, you can include quality data from a Phred style quality score file by running [[ToAmos | toAmos]] with the -q option as follows:

toAmos -s my_reads.fasta -q my_reads.qual -o my_reads.afg

Minimus can also be called with the following equivalent command:

runAmos -C $AMOSBASE/src/Pipeline/minimus.acf my_reads

The AMOS package also includes other helpful tools such as [[Hawkeye]], which is useful for evaluating your assembly with respect to paired-end reads. It can be run on the minimus bank with the following command:

hawkeye my_reads.bnk/

== Publication ==

[http://www.biomedcentral.com/1471-2105/8/64 Minimus: a fast, lightweight genome assembler]

Sommer, DD, Delcher, AL, Salzberg, SL, and Pop, M. (2007) BMC Bioinformatics, 8:64doi:10.1186/1471-2105-8-64.

== Acknowledgements ==
The development of minimus was supported by the National Institutes of Health under grants R01-LM06845 and R01-LM007938 to SLS and by Department of Homeland Security cooperative agreement W81XWH-05-2-0051.

Minimus

2010-01-14T22:26:56Z

Trgibbons: /* Basic usage */

== Overview ==

Minimus is an assembly pipeline designed specifically for small data-sets, such as the set of reads covering a specific gene. Note that the code will work for larger assemblies (we have used it to assemble bacterial genomes), however, due to its stringency, the resulting assembly will be highly fragmented. For large and/or complex assemblies the execution of Minimus should be followed by additional processing steps, such as scaffolding.

Minimus follows the Overlap-Layout-Consensus paradigm and consists of three main modules which share information through a central file bank:

* [[hash-overlap]] - Computes the overlaps between the reads using a modified version of the Smith-Waterman local alignment algorithm
* [[tigger]] - Uses the read overlaps to generate the layouts of reads representing individual contigs
* [[make-consensus]] - Refines the layouts produced by the tigger to generate accurate multiple alignments within the reads

Minimus uses AMOS message files as both the inputs and the outputs. Please see the [[File conversion utilities]] documentation for more information.

[[minimus2 | Minimus2]] is a modified version of the minimus pipeline designed for merging two sequence sets. Instead of hash-overlap it uses a nucmer based overlap detector which is much faster.

== Documentation ==

Documentation on running minimus is included with the distribution in the /docs subdirectory.

See [[Minimus/README]].

== Examples ==

Examples of a flu assembly and a Zebrafish gene can be found in the test/minimus directory created when the AMOS distribution is untarred. Documentation on the examples is included with the distribution in /docs/minimus.README.

== Basic usage ==

To run minimus will you need a set of sequence files. Assuming you have a set of reads in fasta format called '''my_reads.seq''', you can run minimus with the following two commands:

`toAmos -s my_reads.seq -o my_reads.afg'

`minimus my_reads'

The output will be a fasta formatted file called '"my_reads.fasta"', a contig file with details about the assembly of each contig called '"my_reads.contig"', and an AMOS bank folder with various files used internally by minimus.
The toAmos file conversion utility is the most general and probably the most useful of the file conversion utilities included with minimus. More information about toAmos and the [[File_conversion_utilities | other file conversion utilities]] can be found in the [[AMOS | AMOS documentation wiki]]. For example, you can include quality data from a Phred style quality score file by running [[ToAmos | toAmos]] with the -q option as follows:

`toAmos -s my_reads.fasta -q my_reads.qual -o my_reads.afg'

Minimus can also be called with the following equivalent command:

`runAmos -C $AMOSBASE/src/Pipeline/minimus.acf my_reads'

The AMOS package also includes other helpful tools such as [[Hawkeye]], which is useful for evaluating your assembly with respect to paired-end reads. It can be run on the minimus bank with the following command:

`hawkeye my_reads.bnk/'

== Publication ==

[http://www.biomedcentral.com/1471-2105/8/64 Minimus: a fast, lightweight genome assembler]

Sommer, DD, Delcher, AL, Salzberg, SL, and Pop, M. (2007) BMC Bioinformatics, 8:64doi:10.1186/1471-2105-8-64.

== Acknowledgements ==
The development of minimus was supported by the National Institutes of Health under grants R01-LM06845 and R01-LM007938 to SLS and by Department of Homeland Security cooperative agreement W81XWH-05-2-0051.

Minimus/README

2010-01-14T22:24:50Z

Trgibbons: /* Example */

minimus - The AMOS Lightweight Assembler

== Brief Summary ==
minimus is an assembly pipeline designed specifically for small
data-sets, such as the set of reads covering a specific gene. Note that
the code will work for larger assemblies (we have used it to assemble
bacterial genomes), however, due to its stringency, the resulting assembly
will be highly fragmented. For large and/or complex assemblies the execution
of Minimus should be followed by additional processing steps, such as
scaffolding.

Minimus follows the Overlap-Layout-Consensus paradigm and consists of
three main modules:

* overlapper - computes the overlaps between the reads using a modified version of the Smith-Waterman local alignment algorithm

* tigger - uses the read overlaps to generate the layouts of reads representing individual contigs

* make-consensus - refines the layouts produced by the tigger to generate accurate multiple alignments within the reads

==Dependencies==
None.

==Running==
Either execute the minimus configuration script directly from
$bindir OR copy it to your local directory, edit it, and run it with
the `runAmos' command interpreter. The following variables must be set
on the command line or added to the script for the pipeline to operate
properly:

TGT - The target genome sequences in AMOS message format (.afg)
`minimus -D TGT=<target> <prefix>'
OR
`runAmos -C minimus -D TGT=<target> <prefix>'

Where <prefix> will be the output file prefix, and <target> is the
input AMOS message file. Check the `runAmos' documentation or type
`runAmos --help' for details on operating an AMOS pipeline. The
minimus pipeline config file can be easily modified by the user to add
additional processing steps.

In order to run minimus you need to provide an AMOS formatted file
of the reads. Such a file (commonly with extension .afg) can be
generated from a combination of sequence (.seq), quality (.qual), and
Trace Archive XML (.xml) files using the [[ToAmos | toAmos]] or
[[Tarchive2amos | tarchive2amos]] programs which will appear in the
$bindir directory upon installation.

The default TGT file is <prefix>.afg, thus if our input file is
<prefix>.afg we can run minimus simply by typing:

`minimus <prefix>'

== Output ==
Output will be a TIGR .contig file and a FastA .fasta file. The
TIGR contig file contains the gapped consensus and multi-alignment
information for the assembly. Each contig sequence is preceded by a
header line which starts with '##', followed by the gapped consensus
sequence with gaps represented as a '-' character. Following the
consensus is the gapped read sequence preceded by a header line
beginning with '#'. The .fasta file contains all the contigs produced
by AMOScmp in a multi-FastA formatted file. These sequences will match
the sequences in the .contig file, but without the gaps.

To obtain an ACE format representation of the assembly, we can run
the following to obtain a <prefix>.ace file:

`bank-report -b <prefix>.bnk CTG > <prefix>.ctg'
`amos2ace <prefix>.afg <prefix>.ctg'

Where <prefix> is the same as was used in the above section and
<prefix>.afg is the original input to the assembly pipeline. We can
simply add these commands to the runAmos config file to produce an ACE
file every time we run minimus.

==Example==
Assume we have a set of Trace Archive data with the names
`target.seq', `target.qual' and `target.xml' which contain the
sequence information for a small assembly task. To run the minimus
pipeline and generate the default output, we would type the following:

`tarchive2amos -o target.seq'
`minimus -D TGT=target.afg target'

This will generate the default output named `target.contig' and
`target.fasta'. We could then generate an ACE assembly format file by
following the instructions in the above section, substituting "target"
for "<prefix>".

Minimus is now packaged with two example assemblies. The two examples
are an Influenza A assembly and a Zebra Fish Gene assembly under the 'test'
directory. The 'test' directory in located in the main AMOS directory after you untar
the AMOS tarball.

Minimus/README

2010-01-14T22:24:09Z

Trgibbons: /* Output */

minimus - The AMOS Lightweight Assembler

== Brief Summary ==
minimus is an assembly pipeline designed specifically for small
data-sets, such as the set of reads covering a specific gene. Note that
the code will work for larger assemblies (we have used it to assemble
bacterial genomes), however, due to its stringency, the resulting assembly
will be highly fragmented. For large and/or complex assemblies the execution
of Minimus should be followed by additional processing steps, such as
scaffolding.

Minimus follows the Overlap-Layout-Consensus paradigm and consists of
three main modules:

* overlapper - computes the overlaps between the reads using a modified version of the Smith-Waterman local alignment algorithm

* tigger - uses the read overlaps to generate the layouts of reads representing individual contigs

* make-consensus - refines the layouts produced by the tigger to generate accurate multiple alignments within the reads

==Dependencies==
None.

==Running==
Either execute the minimus configuration script directly from
$bindir OR copy it to your local directory, edit it, and run it with
the `runAmos' command interpreter. The following variables must be set
on the command line or added to the script for the pipeline to operate
properly:

TGT - The target genome sequences in AMOS message format (.afg)
`minimus -D TGT=<target> <prefix>'
OR
`runAmos -C minimus -D TGT=<target> <prefix>'

Where <prefix> will be the output file prefix, and <target> is the
input AMOS message file. Check the `runAmos' documentation or type
`runAmos --help' for details on operating an AMOS pipeline. The
minimus pipeline config file can be easily modified by the user to add
additional processing steps.

In order to run minimus you need to provide an AMOS formatted file
of the reads. Such a file (commonly with extension .afg) can be
generated from a combination of sequence (.seq), quality (.qual), and
Trace Archive XML (.xml) files using the [[ToAmos | toAmos]] or
[[Tarchive2amos | tarchive2amos]] programs which will appear in the
$bindir directory upon installation.

The default TGT file is <prefix>.afg, thus if our input file is
<prefix>.afg we can run minimus simply by typing:

`minimus <prefix>'

== Output ==
Output will be a TIGR .contig file and a FastA .fasta file. The
TIGR contig file contains the gapped consensus and multi-alignment
information for the assembly. Each contig sequence is preceded by a
header line which starts with '##', followed by the gapped consensus
sequence with gaps represented as a '-' character. Following the
consensus is the gapped read sequence preceded by a header line
beginning with '#'. The .fasta file contains all the contigs produced
by AMOScmp in a multi-FastA formatted file. These sequences will match
the sequences in the .contig file, but without the gaps.

To obtain an ACE format representation of the assembly, we can run
the following to obtain a <prefix>.ace file:

`bank-report -b <prefix>.bnk CTG > <prefix>.ctg'
`amos2ace <prefix>.afg <prefix>.ctg'

Where <prefix> is the same as was used in the above section and
<prefix>.afg is the original input to the assembly pipeline. We can
simply add these commands to the runAmos config file to produce an ACE
file every time we run minimus.

==Example==
Assume we have a set of Trace Archive data with the names
`target.seq', `target.qual' and `target.xml' which contain the
sequence information for a small assembly task. To run the minimus
pipeline and generate the default output, we would type the following:

`tarchive2amos -o target.seq'
`minimus -D TGT=target.afg target'

This will generate the default output named `target.contig' and
`target.fasta'. We could then generate an ACE assembly format file by
following the instructions in the above section, substituting "target"
for "<prefix>".

Minimus is now packaged with two example assemblies. The two examples
are an Influenza A assembly and a Zebra Fish Gene assembly under the 'test'
directory. The 'test' directory in located in the main AMOS directory after you untar
the AMOS tarball.

Minimus/README

2010-01-14T22:23:23Z

Trgibbons: /* Running */

minimus - The AMOS Lightweight Assembler

== Brief Summary ==
minimus is an assembly pipeline designed specifically for small
data-sets, such as the set of reads covering a specific gene. Note that
the code will work for larger assemblies (we have used it to assemble
bacterial genomes), however, due to its stringency, the resulting assembly
will be highly fragmented. For large and/or complex assemblies the execution
of Minimus should be followed by additional processing steps, such as
scaffolding.

Minimus follows the Overlap-Layout-Consensus paradigm and consists of
three main modules:

* overlapper - computes the overlaps between the reads using a modified version of the Smith-Waterman local alignment algorithm

* tigger - uses the read overlaps to generate the layouts of reads representing individual contigs

* make-consensus - refines the layouts produced by the tigger to generate accurate multiple alignments within the reads

==Dependencies==
None.

==Running==
Either execute the minimus configuration script directly from
$bindir OR copy it to your local directory, edit it, and run it with
the `runAmos' command interpreter. The following variables must be set
on the command line or added to the script for the pipeline to operate
properly:

TGT - The target genome sequences in AMOS message format (.afg)
`minimus -D TGT=<target> <prefix>'
OR
`runAmos -C minimus -D TGT=<target> <prefix>'

Where <prefix> will be the output file prefix, and <target> is the
input AMOS message file. Check the `runAmos' documentation or type
`runAmos --help' for details on operating an AMOS pipeline. The
minimus pipeline config file can be easily modified by the user to add
additional processing steps.

In order to run minimus you need to provide an AMOS formatted file
of the reads. Such a file (commonly with extension .afg) can be
generated from a combination of sequence (.seq), quality (.qual), and
Trace Archive XML (.xml) files using the [[ToAmos | toAmos]] or
[[Tarchive2amos | tarchive2amos]] programs which will appear in the
$bindir directory upon installation.

The default TGT file is <prefix>.afg, thus if our input file is
<prefix>.afg we can run minimus simply by typing:

`minimus <prefix>'

== Output ==
Output will be a TIGR .contig file and a FastA .fasta file. The
TIGR contig file contains the gapped consensus and multi-alignment
information for the assembly. Each contig sequence is preceded by a
header line which starts with '##', followed by the gapped consensus
sequence with gaps represented as a '-' character. Following the
consensus is the gapped read sequence preceded by a header line
beginning with '#'. The .fasta file contains all the contigs produced
by AMOScmp in a multi-FastA formatted file. These sequences will match
the sequences in the .contig file, but without the gaps.

To obtain an ACE format representation of the assembly, we can run
the following to obtain a <prefix>.ace file:

`bank-report -b <prefix>.bnk CTG > <prefix>.ctg'
`amos2ace <prefix>.afg <prefix>.ctg'

Where <prefix> is the same as was used in the above section and
<prefix>.afg is the original input to the assembly pipeline. We can
simply add these commands to the runAmos config file to produce an ACE
file every time we run minimus.

==Example==
Assume we have a set of Trace Archive data with the names
`target.seq', `target.qual' and `target.xml' which contain the
sequence information for a small assembly task. To run the minimus
pipeline and generate the default output, we would type the following:

`tarchive2amos -o target.seq'
`minimus -D TGT=target.afg target'

This will generate the default output named `target.contig' and
`target.fasta'. We could then generate an ACE assembly format file by
following the instructions in the above section, substituting "target"
for "<prefix>".

Minimus is now packaged with two example assemblies. The two examples
are an Influenza A assembly and a Zebra Fish Gene assembly under the 'test'
directory. The 'test' directory in located in the main AMOS directory after you untar
the AMOS tarball.

Minimus

2010-01-14T22:17:46Z

Trgibbons: /* Overview */

Minimus

2010-01-14T20:11:12Z

Trgibbons: /* Basic usage example */ I updated the minimus home page in an attempt to make it more approachable. I will not be offended if this is rolled back or radically modified.

== Overview ==

minimus is an assembly pipeline designed specifically for small data-sets, such as the set of reads covering a specific gene. Note that the code will work for larger assemblies (we have used it to assemble bacterial genomes), however, due to its stringency, the resulting assembly will be highly fragmented. For large and/or complex assemblies the execution of Minimus should be followed by additional processing steps, such as scaffolding.

minimus follows the Overlap-Layout-Consensus paradigm and consists of three main modules:

* [[hash-overlap]] - computes the overlaps between the reads using a modified version of the Smith-Waterman local alignment algorithm
* [[tigger]] - uses the read overlaps to generate the layouts of reads representing individual contigs
* [[make-consensus]] - refines the layouts produced by the tigger to generate accurate multiple alignments within the reads

minimus uses as AMOS messages as both the inputs and the outputs. Please see the [[File conversion utilities]] documentation for more information.

[[minimus2]] is a modified version of the minimus pipeline designed for merging two sequence sets. Instead of hash-overlap it uses a nucmer based overlap detector which is much faster.

== Documentation ==

Documentation on running minimus is included with the distribution in the /docs subdirectory.

See [[Minimus/README]].

== Examples ==

Examples of a flu assembly and a Zebrafish gene can be found in the test/minimus directory created when the AMOS distribution is untarred. Documentation on the examples is included with the distribution in /docs/minimus.README.

== Basic usage ==

To run minimus will you need a set of sequence files. Assuming you have a set of reads in fasta format called '''my_reads.seq''', you can run minimus with the following two commands:

toAmos -s my_reads.seq -o my_reads.afg

minimus my_reads

The output will be a fasta formatted file called '"my_reads.fasta"', a contig file with details about the assembly of each contig called '"my_reads.contig"', and an AMOS bank folder with various files used internally by minimus.
The toAmos file conversion utility is the most general and probably the most useful of the file conversion utilities included with minimus. More information about toAmos and the [[File_conversion_utilities | other file conversion utilities]] can be found in the [[AMOS | AMOS documentation wiki]]. For example, you can include quality data from a Phred style quality score file by running [[ToAmos | toAmos]] with the -q option as follows:

toAmos -s my_reads.fasta -q my_reads.qual -o my_reads.afg

Minimus can also be called with the following equivalent command:

runAmos -C $AMOSBASE/src/Pipeline/minimus.acf my_reads

The AMOS package also includes other helpful tools such as [[Hawkeye]], which is useful for evaluating your assembly with respect to paired-end reads. It can be run on the minimus bank with the following command:

hawkeye my_reads.bnk/

== Publication ==

[http://www.biomedcentral.com/1471-2105/8/64 Minimus: a fast, lightweight genome assembler]

Sommer, DD, Delcher, AL, Salzberg, SL, and Pop, M. (2007) BMC Bioinformatics, 8:64doi:10.1186/1471-2105-8-64.

== Acknowledgements ==
The development of minimus was supported by the National Institutes of Health under grants R01-LM06845 and R01-LM007938 to SLS and by Department of Homeland Security cooperative agreement W81XWH-05-2-0051.