AMOS WIKI - User contributions [en]

AMOS

2012-05-04T05:32:10Z

Floflooo:

{| align="right"
| __TOC__
|}

The AMOS consortium is committed to the development of open-source whole genome assembly software. The project acronym (AMOS) represents our primary goal -- to produce A Modular, Open-Source whole genome assembler. Open-source so that everyone is welcome to contribute and help build outstanding assembly tools, and modular in nature so that new contributions can be easily inserted into an existing assembly pipeline. This modular design will foster the development of new assembly algorithms and allow the AMOS project to continually grow and improve in hopes of eventually becoming a widely accepted and deployed assembly infrastructure. In this sense, AMOS is both a design philosophy and a software system.

Quick links:
* [[AMOS Getting Started]]
* [http://sourceforge.net/project/showfiles.php?group_id=134326 Download]
* [http://sourceforge.net/projects/amos SourceForge project page]

== Announcements ==

* August 5, 2011 - [http://sourceforge.net/projects/amos/files/amos/3.1.0/ Version 3.1.0] of AMOS released!
* August 2, 2011 - [http://sourceforge.net/projects/amos/files/sample_data/ AMOS Sample Data] posted
* December 7, 2010 - [http://sourceforge.net/projects/amos/files/amos/3.0.0/ Version 3.0.0] of AMOS released!

== Documentation ==

=== Assemblers ===
* [[ABBA]] - Assembly Boosted By Amino Acid Sequences
* [[AMOScmp]] - comparative assembler
* [[AMOScmp-shortReads]] - comparative assembler for short reads (Solexa,454)
* [[AMOScmp-shortReads-alignmentTrimmed]] - comparative assembler for short reads that uses alignment based trimming
* [[minimus]] - basic genome assembler for small datasets
* [[Minimo]] - the minimus assembler with many more options: short read support, variable stringency, strand-specificity, various outputs formats
* [[minimus2]] - basic genome assembler for two datasets; can also be used as an assembly merge pipeline
* [[minimus2-blat]] - Same as minimus2 but uses BLAT instead of Nucmer for added speed

=== Validation and Visualization ===
* [[Hawkeye]] - assembly viewer
* [[amosvalidate]] - assembly forensics
* [[FRCurve]] - Feature-Response Curve
* [[Benchmark]] - assembly benchmark data

=== Scaffolding ===
* [[Bambus]] - Open source standalone hierarchical scaffolding
* [[Bambus2]] - Scaffolding Polymorphic Genomes and Metagenomes

=== Trimming, Overlapping, & Error Correction ===
* [[Figaro]] - statistical vector trimmer
* [[UMD Overlapper]] - High quality overlap computations
* [[KI Overlapper]] - Repeat aware overlapper
* [[AutoEditor]] - Automatic correction of genome sequencing errors
* [[FastqQC]] - Read composition and quality

=== Utilities ===
* [[File conversion utilities]] - converting data to and from AMOS
* [[AMOS Utilities | AMOS Utilities]] - general utilities
* [[runAmos]] - Pipeline executor

=== AMOS Development ===
* [[Programmer's guide]] - Getting started with the Source code
* [[Infrastructure]] - Developer level details
* [[Wiki guide]] - Guide for editing the wiki

=== Assembly Tutorials ===
* [http://www.cbcb.umd.edu/research/assembly_primer.shtml Assembly primer] - overview of genome assembly.
* [http://www.cbcb.umd.edu/research/contig_representation.shtml Representing assemblies (not just in AMOS)]
* [http://wgs-assembler.sourceforge.net Running Celera Assembler]

Additional documentation in development through the [[AMOS Documentation Project]]

== Download ==
The AMOS source if freely available for download from the File Release Section of our SourceForge project page. Please refer to the COPYING license included in the package for a description of the Artistic License, the same OSI certified open source license used by Perl and countless other packages. Not all of the above packages are included with the standard AMOS distribution, please see the homepage for the software you wish to download to verify that it is included with the AMOS source distribution.

[http://sourceforge.net/project/showfiles.php?group_id=134326 Download from SourceForge]

== Consortium members ==

There have been numerous positive responses regarding the AMOS initiative, and we expect the list of involved organizations to grow significantly as the project matures. Please contact us if you want to join. The groups currently involved with the development of AMOS are listed below, along with their responsibilities and areas of expertise.

* University of Maryland, Center for Bioinformatics and Computational Biology
** project organization and direction
** infrastructure
** consensus
** automated sequence editing
** scaffolding
** overlap detection
** contig construction

* The Institute for Genomic Research
** production pipelines
** automated finishing tools
** error correction

* Karolinska Institutet
** overlap detection
** error correction

* Marine Biological Laboratory - Woods Hole
** graphical interface
** integration of assembly data with analysis (gene, polymorphism, etc.) information

== Join the consortium ==

All interested parties are welcome to join or aid the AMOS consortium. Please address all correspondence via Email to:

amos-help (at) lists (dot) sourceforge (dot) net

To receive information regarding new releases and developments, please subscribe to our moderated, low-traffic users' mailing list:

amos-users (at) lists (dot) sourceforge (dot) net

== Bug reports and support ==

For AMOS bug reports or support requests, please browse our SourceForge project page or Email us at:

amos-help (at) lists (dot) sourceforge (dot) net

== Acknowledgements ==

The AMOS consortium would like to thank the following organizations for their funding and/or support:
* The National Institutes of Health - grants R01-LM06845, N01-AI-15447
* The National Science Foundation - grants IIS-9902923, IIS-9820497
* Department of Homeland Security - cooperative agreement W81XWH-05-2-0051
* SourceForge.net

Minimo

2011-11-12T02:25:48Z

Floflooo:

== Overview ==

Minimo is largely based on [[minimus|Minimus]], and as such favours assembly quality to speed. Just like [[minimus|Minimus]], Minimo follows the Overlap-Layout-Consensus paradigm.

The main advantage of Minimo over [[minimus|Minimus]] is that it takes simple FASTA files as input and generates contigs formatted in ACE and FASTA. Additional parameters can be used to tune the assembly stringency (minimum overlap length and minimum identity), or to do a strand-specific assembly. You can use Minimo on short reads, but the number of sequences should be kept reasonable!

Generally, decreasing the minimum overlap identity results in a less fragmented assembly, but likely less faithful, as sequencing errors or small varitions between closely related species (in the case of metagenomic data) might cause chimeric contigs. Similarly, decreasing the minimum overlap length might produce less fragmented, less faithful assemblies. However, increasing the minimum overlap length may sometimes also produce better assemblies by resolving the assembly of small repeated regions.

== Documentation ==

Documentation on how to run Minimo is obtained by typing:

Minimo -h

The usage message is:

Minimo is a de novo assembler based on the AMOS infrastructure. Minimo uses a
conservative overlap-layout-consensus algorithm to avoid mis-assemblies and
can be applied to short read or strand-specific assemblies. The input is a
FASTA file and there are options to control the stringency of the assembly
and the processing of the quality scores. By default, the results are in the
AMOS format and written to the directory where the input FASTA file is located.
Usage:
Minimo FASTA_IN [options]
Options:
-D QUAL_IN=<file> Input quality score file (in Phred format)
-D GOOD_QUAL=<n> Quality score to set for bases within the clear
range if no quality file was given (default: 30)
-D BAD_QUAL=<n> Quality score to set for bases outside clear range
if no quality file was given (default: 10). If your
sequences are trimmed, try the same value as GOOD_QUAL.
-D MIN_LEN=<n> Minimum contig overlap length (at least 20 bp,
default: 35)
-D MIN_IDENT=<d> Minimum contig overlap identity percentage (between 0
and 100 %, default: 98)
-D STRAND_SPEC=<n> Do a strand-specific assembly (e.g. for transcripts)
(0:no 1:yes, default: 0)
-D ALN_WIGGLE=<d> Alignment wiggle value (from 2 for short reads to 15 for
long reads, default: 2)
-D FASTA_EXP=<n> Export results in FASTA format (0:no 1:yes, default: 0)
-D ACE_EXP=<n> Export results in ACE format (0:no 1:yes, default: 0)
-D OUT_PREFIX=< s> Prefix to use for the output file path and name

== Basic usage ==

To run Minimo will you need a set of sequence files. Assuming you have a set of reads in fasta format called '''my_reads.fa''', you can run Minimo with the following commands:

Minimo my_reads.fa

To export the contigs in a FASTA file or in ACE format (i.e. for downstream processing), use the FASTA_EXP and ACE_EXP options:

Minimo my_reads.fa -D FASTA_EXP=1 -D ACE_EXP=1

If you need to use a specific overlap length or identity between reads of a contig, try:

Minimo my_reads.fa -D MIN_LEN=80 -D MIN_IDENT=90

For the assembly of transcripts or other directional sequence datasets, try a strand-specific assembly:

Minimo my_reads.fa -D STRAND_SPEC=1

== Publication ==

[http://dx.doi.org/10.1002/0471250953.bi1108s33 Next generation sequence assembly with AMOS]

Treangen TJ, Sommer DD, Angly FE, Koren S, Pop M. (2011) Curr Protoc Bioinformatics, Chapter 11:Unit 11.8, doi:10.1002/0471250953.bi1108s33

Minimus

2011-11-12T02:23:21Z

Floflooo: /* Publication */

== Overview ==

Minimus is one of several assembly pipelines included in the AMOS software package. It is designed specifically for small data-sets, such as the set of reads covering a specific gene. Note that the code will work for larger assemblies (we have used it to assemble bacterial genomes), however, due to its stringency, the resulting assembly will be highly fragmented. For large and/or complex assemblies the execution of Minimus should be followed by additional processing steps, such as scaffolding.

Minimus follows the Overlap-Layout-Consensus paradigm and consists of three main modules which share information through a central file bank:

* [[hash-overlap]] - Computes the overlaps between the reads using a modified version of the Smith-Waterman local alignment algorithm
* [[tigger]] - Uses the read overlaps to generate the layouts of reads representing individual contigs
* [[make-consensus]] - Refines the layouts produced by the tigger to generate accurate multiple alignments within the reads

Minimus uses AMOS message files as both the inputs and the outputs. Please see the [[File conversion utilities]] documentation for more information.

[[minimus2 | Minimus2]] is a modified version of the minimus pipeline designed for merging two sequence sets. Instead of hash-overlap it uses a nucmer based overlap detector which is much faster.

== Documentation ==

Documentation on running minimus is included with the distribution in the /docs subdirectory.

See [[Minimus/README]].

== Examples ==

Examples of a flu assembly and a Zebrafish gene can be found in the test/minimus directory created when the AMOS distribution is untarred. Documentation on the examples is included with the distribution in /docs/minimus.README.

== Basic usage ==

To run minimus will you need a set of sequence files. Assuming you have a set of reads in fasta format called '''my_reads.seq''', you can run minimus with the following two commands:

toAmos -s my_reads.seq -o my_reads.afg

minimus my_reads

The output will be a fasta formatted file called '"my_reads.fasta"', a contig file with details about the assembly of each contig called '"my_reads.contig"', and an AMOS bank folder with various files used internally by minimus.
The toAmos file conversion utility is the most general and probably the most useful of the file conversion utilities included with minimus. More information about toAmos and the [[File_conversion_utilities | other file conversion utilities]] can be found in the [[AMOS | AMOS documentation wiki]]. For example, you can include quality data from a Phred style quality score file by running [[ToAmos | toAmos]] with the -q option as follows:

toAmos -s my_reads.fasta -q my_reads.qual -o asm_reads.afg

Minimus can also be called with the following equivalent command:

runAmos -C $AMOSBASE/src/Pipeline/minimus.acf asm_reads

The AMOS package also includes other helpful tools such as [[Hawkeye]], which is useful for evaluating your assembly with respect to paired-end reads. It can be run on the minimus bank with the following command:

hawkeye asm_reads.bnk/

== Publication ==

[http://www.biomedcentral.com/1471-2105/8/64 Minimus: a fast, lightweight genome assembler]

Sommer, DD, Delcher, AL, Salzberg, SL, and Pop, M. (2007) BMC Bioinformatics, 8:64, doi:10.1186/1471-2105-8-64.

== Acknowledgements ==
The development of minimus was supported by the National Institutes of Health under grants R01-LM06845 and R01-LM007938 to SLS and by Department of Homeland Security cooperative agreement W81XWH-05-2-0051.

Programmer's guide

2011-11-09T05:38:32Z

Floflooo: /* The C++ API */

{| align="right"
| __TOC__
|}

== Getting AMOS ==
AMOS can be downloaded from our Sourceforge download site: http://sourceforge.net/project/showfiles.php?group_id=134326 as a tar file, or directly from the AMOS git repository (see below).

=== The .tar file ===
If you chose to download AMOS as a .tar file, getting started is as simple as untarring the file, running "./configure" from the top level directory, then "make all". For more details see the Getting Started document as well as the INSTALL file provided in the top level directory.

=== GIT access ===
To access AMOS directly, you can clone a copy of the source code to your local machine

## clone the remote master repo to a local copy
git clone git://amos.git.sourceforge.net/gitroot/amos/amos

If you are a registered AMOS developer with read/write access to repository, you can checkout the code using:

## clone the remote master to a local copy (replace SFNAME with your sourceforge username)
git clone ssh://<SFNAME>@amos.git.sourceforge.net/gitroot/amos/amos

## make some changes

## now commit your changes to your local repo
git commit -a -m "brief change message"

## once you are happy, send the changes to the master repo
git push

## update local repo with remote
git pull

This page lists recent changes
[http://amos.git.sourceforge.net/git/gitweb.cgi?p=amos/amos;a=summary http://amos.git.sourceforge.net/git/gitweb.cgi?p=amos/amos;a=summary]

Here are a couples tutorials on how to use git to commit changes, make branches, etc

[http://git-scm.com/documentation http://git-scm.com/documentation]: Detailed documentation
[http://git-scm.com/course/svn.html http://git-scm.com/course/svn.html]: Fast tutorial for svn users

Before being able to compile the AMOS code you will need to create the appropriate configuration files with the command "./bootstrap" run from the top level directory. You will then be able to continue with compilation as described above under the .tar file.

If you wish to play a more involved role in the development of AMOS, or if you wish to contribute some of your code or bug fixes, please contact us at:

amos-help (at) lists (dot) sourceforge (dot) net

== Autoconf basics (how to add your own code to the source tree) ==
This section is not meant as documentation for the GNU autoconf package. Below you will learn how to add a program to the AMOS distribution, in an already existing directory. If you want help with a more complex autoconf operation please contact us at the email listed above.

The template for the Makefile file that will be created by the configure command (see description of compilation above) can be found in the file Makefile.am in each of the directories. This file consists of two sections: a description of the files that are going to be installed when running "make install", and a description of each of the files that will be compiled as part of the "make" command. If you wish to add a program to the AMOS tree, you will thus need to add both a record indicating this program will be installed by the make process, and instructions on how to build this program. The instructions for adding a script (either a Perl script or an AMOS configuration file), or a C++ program are described below.

=== Addding a script to the AMOS tree ===
To add a script you can simply list it in the "dist_bin_SCRIPTS" variable at the beginning of the Makefile.am file, e.g.:

dist_bin_SCRIPTS = \
bank-unlock.pl

The build process will automatically add a "use lib" line to the beginning of your Perl scripts indicating where the AMOS code is installed. Furthermore, the #! line will be appropriately modified according to the location of the Perl binary identified by the configure process.

When building AMOS configuration files, the build process will automatically update the BINDIR and NUCMER variable in your file to the values identified by the configure process for the location of the AMOS binary installation directory, and for the location of the nucmer binary (part of the MUMmer distribution).

=== Adding a C++ program to the AMOS tree ===
To add a C++ program to AMOS, you must first add the name of the program to the "bin_PROGRAMS" variable in the Makefile.am file:

bin_PROGRAMS = \
bank2contig \

You must then specify instructions on how this binary will be built. These instructions include the location of the source files used in building the program:

bank2contig_SOURCES = \
bank2contig.cc

instructions on additional libraries that might be needed:

bank2contig_LDADD = \
$(top_builddir)/src/Common/libCommon.a \
$(top_builddir)/src/AMOS/libAMOS.a

or additional flags:

bank2contig_CPPFLAGS = \
-I$(top_srcdir)/src/Common

If you wish to use the global library and CFLAGS parameters you may provide just the _SOURCES variable.

== AMOS messages and the Perl API ==
AMOS programs can communicate among each other using a flat file format inspired by the format used by Celera Assembler. An overview of this file format and the way AMOS objects are stored, is provided on the [[Infrastructure]] page.

The AMOS distribution provides a Perl module that can be used to parse AMOS (and Celera Assembler) message files. For a detailed description of the various functions provided by the AMOS::AmosLib module you can use the perldoc documentation:

$ perldoc AMOS::AmosLib

Below we will only describe the use of this module to read and parse AMOS messages.

To include the AMOS::AmosLib module in your perl program you will need to use the command:

use AMOS::AmosLib;

at the beginning of the code. If this module is not installed in the Perl search path (which can be set in the PERLLIB environment variable), you might have to also use the Perl command "use lib" to specify the location of the AMOS library.

Like the C++ API (described below), reading AMOS messages from a file involves first reading the message in its entirety, oblivious of the data encoded within, then parsing the message to extract the individual components. These two steps can be executed as follows:

my $rec = getRecord(\*STDIN); # read a record from the standard input
my ($id, $fields, $recs) = parseRecord($rec); # parse the information in the message

The first command retrieves the entire message from the input, i.e. a whole block of text between curly braces.
The second command retrieves the three components of the message:

1. $id - the three letter code of the message (see Types of messages)
2. $fields - hash table of the individual fields in the message. E.g. for a read ($id == "RED"), $$fields{"seq"} represents the sequence of the read.
3. $recs - array of any possible sub-messages. These messages will need to be parsed individually with the parseRecord command. An example of sub-messages are the TLE (tile) message indicating the position of reads within a contig. $#$recs - represents the index of the last sub-message (if $#$recs == -1, there are no submessages).

== The C++ API ==
Below is a quick overview of the AMOS C++ API. The quickest way to get started is to examine the file src/Bank/bank-tutorial.cc. This file highlights the interaction with the AMOS bank through the C++ API and contains copious comments meant to guide you through your first AMOS program.

For a detailed description of all AMOS classes refer to the automatically generated doxygen API docs: http://amos.sourceforge.net/docs/api/

The main AMOS datastructure is the bank - an indexed database of assembly objects. This central datastructure provides allows the integration of multiple software modules that communicate by modifying the objects stored in a shared bank.

=== Overview of include files ===

#include <foundation_AMOS.hh> all of the below

#include <inttypes_AMOS.hh> integer typedefs
#include <exceptions_AMOS.hh> exception types
#include <datatypes_AMOS.hh> structs
#include <databanks_AMOS.hh> bank types
#include <messages_AMOS.hh> message types and message NCodes
#include <universals_AMOS.hh> assembly classes

=== Basic terminology ===

* IID internal integer identifier and object reference
* EID external string identifier
* BID bank specific identifier (index of the file store, may be invalidated by bank operations)
* 3-Code 3-character identifier string for objects and fields
* N-Code an integer representation of a 3-code (Encode/Decode functions)
* message a single curly-bracketed AMOS message (see message grammar)
* sub-message a single curly-bracketed AMOS message contained by another (see message grammar)

Relative orientation of reads/contigs (used in overlaps or scaffold links)

normal ---a---> ---b--->
anti-normal <---a--- <---b---
innie ---a---> <---b---
outie <---a--- ---b--->

=== Dealing with AMOS message files ===
Reading an AMOS message from a file is as simple as:

Message_t msg;
msg.read (cin);

Note, that the msg object is generic, representing a properly formatted message object (see message grammar), irrespective of the actual assembly object represented by the message. This object can be used to read arbitrary message files, such as those generated by Celera Assembler, even though the individual objects do not map to AMOS objects.

To assign the message contents to a specific object, e.g. a contig:

Contig_t contig;
contig.readMessage(msg);

Note, that the readMessage operation will fail if the message does not properly encode an AMOS contig.

The reverse operation, writing a new message from an internal AMOS object can be simply performed:

contig.writeMessage(msg);
message.write(cout);

=== Communicating with the bank ===
AMOS banks can be open in two modes: for random access (bank mode), and for sequential access (bank stream mode). To open a bank you must also specify the type of the objects stored in it, by providing the N-code of the object. Thus, to open a bank of contigs

Bank_t contig_bank(Contig_t::NCODE);
BankStream_t contig_stream(Contig_t::NCODE);

contig_bank.open("mybank.dir");
contig_stream.open("mybank.dir");

The string "mybank.dir" refers to the physical location of the bank on the disk, and represents the name of a directory that contains all the relevant bank files. In addition to the location of the bank, the open() command may specify a mode of access as B_READ, or B_WRITE, or both (B_READ|B_WRITE) (the default access is B_READ):

contig_bank.open("mybank.dir", B_READ|B_WRITE);

Bank streams can only be used for sequential access, e.g.:

Contig_t contig;
contig_stream >> contig; // read from bank
contig_stream << contig; // write to bank

The sequential access mode is useful for processing anonymous objects (without an assigned IID or EID), or simply for the ease of use.

Random access banks can be used to perform more complex operations:

// lookup by IID
if (! contig_bank.existsIID(1))
cerr << "Cannot find object with iid 1" << endl;

// lookup by EID
if (! contig_bank.existsEID("bigcontig"))
cerr << "Cannot find object with eid bigcontig" << endl;

contig_bank.fetch(1, contig); // retrieve object by IID
contig_bank.fetch("bigcontig", contig); // retrieve object by EID

contig_bank.append(contig); // add an object to the bank
contig_bank.remove(1); // remove an object by IID
contig_bank.remove("bigcontig"); // remove an object by EID

Note that by default objects are not physically removed from the bank when using the remove command, rather they are marked for deletion. To compact the bank after several remove operations you will need to run

contig_bank.clean();

=== Indices ===
There is often the need to cross-reference the various objects stored in a bank, e.g. to obtain the list of reads present in a contig, or, for a read, to identify the contig or scaffold it belongs to. Some such relationships are natively represented in the AMOS objects (e.g. contig messages also list the reads belonging to them), for others it is necessary to build lookup tables. AMOS helps you by providing a generic mechanism for specifying lookup tables linking arbitrary AMOS types. The AMOS indices are implemented using STL hash multi-maps (allows one-to-many correspondence).

A simple example on the use of indices is shown below. The code generates a map linking each read to its mate (this information is normally contained in the Fragment_t object).

Index_t read2mate;
rd2mate.buildReadMate("mybank"); // build index linking reads to their mates in the bank "mybank"

ID_t mate = rd2mate.lookup(5); // find mate of read with IID=5
if (mate == NULL_ID) // if no mate found, returns NULL_ID
cerr << "Read 5 has no mate " << endl;

This example relied on the pre-defined function buildReadMate that automatically builds an index of reads to mates. Several such predefined functions are provided, see the documentation for the Index_t object. If you need to build your own index, for which no predefined build function exists, you can use the insert command to add an identifier pair to the index:

Index_t obj2obj;
obj2obj.insert(id1, id2);

In case of a one-to-many mapping (e.g. all the reads in a scaffold) you can retrieve all the IDs corresponding to a query ID using:

pair<const_iterator, const_iterator> startend = lookupAll(myid);
for (iterator i = startend.first; i != startend.second; i++)
cout << "Found id " << *i << endl;

AMOS

2011-10-19T03:44:36Z

Floflooo: /* Assemblers */

{| align="right"
| __TOC__
|}

The AMOS consortium is committed to the development of open-source whole genome assembly software. The project acronym (AMOS) represents our primary goal -- to produce A Modular, Open-Source whole genome assembler. Open-source so that everyone is welcome to contribute and help build outstanding assembly tools, and modular in nature so that new contributions can be easily inserted into an existing assembly pipeline. This modular design will foster the development of new assembly algorithms and allow the AMOS project to continually grow and improve in hopes of eventually becoming a widely accepted and deployed assembly infrastructure. In this sense, AMOS is both a design philosophy and a software system.

Quick links:
* [[AMOS Getting Started]]
* [http://sourceforge.net/project/showfiles.php?group_id=134326 Download]
* [http://sourceforge.net/projects/amos SourceForge project page]

== Announcements ==

* August 5, 2010 - [http://sourceforge.net/projects/amos/files/amos/3.1.0/ Version 3.1.0] of AMOS released!
* August 2, 2011 - [http://sourceforge.net/projects/amos/files/sample_data/ AMOS Sample Data] posted
* December 7, 2010 - [http://sourceforge.net/projects/amos/files/amos/3.0.0/ Version 3.0.0] of AMOS released!

== Documentation ==

=== Assemblers ===
* [[ABBA]] - Assembly Boosted By Amino Acid Sequences
* [[AMOScmp]] - comparative assembler
* [[AMOScmp-shortReads]] - comparative assembler for short reads (Solexa,454)
* [[AMOScmp-shortReads-alignmentTrimmed]] - comparative assembler for short reads that uses alignment based trimming
* [[minimus]] - basic genome assembler for small datasets
* [[Minimo]] - the minimus assembler with many more options: short read support, variable stringency, strand-specificity, various outputs formats
* [[minimus2]] - basic genome assembler for two datasets; can also be used as an assembly merge pipeline
* [[minimus2-blat]] - Same as minimus2 but uses BLAT instead of Nucmer for added speed

=== Validation and Visualization ===
* [[Hawkeye]] - assembly viewer
* [[amosvalidate]] - assembly forensics
* [[FRCurve]] - Feature-Response Curve
* [[Benchmark]] - assembly benchmark data

=== Scaffolding ===
* [[Bambus]] - Open source standalone hierarchical scaffolding
* [[Bambus2]] - Scaffolding Polymorphic Genomes and Metagenomes

=== Trimming, Overlapping, & Error Correction ===
* [[Figaro]] - statistical vector trimmer
* [[UMD Overlapper]] - High quality overlap computations
* [[KI Overlapper]] - Repeat aware overlapper
* [[AutoEditor]] - Automatic correction of genome sequencing errors
* [[FastqQC]] - Read composition and quality

=== Utilities ===
* [[File conversion utilities]] - converting data to and from AMOS
* [[AMOS Utilities | AMOS Utilities]] - general utilities
* [[runAmos]] - Pipeline executor

=== AMOS Development ===
* [[Programmer's guide]] - Getting started with the Source code
* [[Infrastructure]] - Developer level details
* [[Wiki guide]] - Guide for editing the wiki

=== Assembly Tutorials ===
* [http://www.cbcb.umd.edu/research/assembly_primer.shtml Assembly primer] - overview of genome assembly.
* [http://www.cbcb.umd.edu/research/contig_representation.shtml Representing assemblies (not just in AMOS)]
* [http://wgs-assembler.sourceforge.net Running Celera Assembler]

Additional documentation in development through the [[AMOS Documentation Project]]

== Download ==
The AMOS source if freely available for download from the File Release Section of our SourceForge project page. Please refer to the COPYING license included in the package for a description of the Artistic License, the same OSI certified open source license used by Perl and countless other packages. Not all of the above packages are included with the standard AMOS distribution, please see the homepage for the software you wish to download to verify that it is included with the AMOS source distribution.

[http://sourceforge.net/project/showfiles.php?group_id=134326 Download from SourceForge]

== Consortium members ==

There have been numerous positive responses regarding the AMOS initiative, and we expect the list of involved organizations to grow significantly as the project matures. Please contact us if you want to join. The groups currently involved with the development of AMOS are listed below, along with their responsibilities and areas of expertise.

* University of Maryland, Center for Bioinformatics and Computational Biology
** project organization and direction
** infrastructure
** consensus
** automated sequence editing
** scaffolding
** overlap detection
** contig construction

* The Institute for Genomic Research
** production pipelines
** automated finishing tools
** error correction

* Karolinska Institutet
** overlap detection
** error correction

* Marine Biological Laboratory - Woods Hole
** graphical interface
** integration of assembly data with analysis (gene, polymorphism, etc.) information

== Join the consortium ==

All interested parties are welcome to join or aid the AMOS consortium. Please address all correspondence via Email to:

amos-help (at) lists (dot) sourceforge (dot) net

To receive information regarding new releases and developments, please subscribe to our moderated, low-traffic users' mailing list:

amos-users (at) lists (dot) sourceforge (dot) net

== Bug reports and support ==

For AMOS bug reports or support requests, please browse our SourceForge project page or Email us at:

amos-help (at) lists (dot) sourceforge (dot) net

== Acknowledgements ==

The AMOS consortium would like to thank the following organizations for their funding and/or support:
* The National Institutes of Health - grants R01-LM06845, N01-AI-15447
* The National Science Foundation - grants IIS-9902923, IIS-9820497
* Department of Homeland Security - cooperative agreement W81XWH-05-2-0051
* SourceForge.net

Minimo

2011-10-19T03:42:39Z

Floflooo: /* Overview */

Minimo

2011-10-19T03:25:40Z

Floflooo: /* Overview */

== Overview ==

Minimo is largely based on [[minimus|Minimus]], and as such favours assembly quality to speed. Use on moderately-sized data! Minimo follows the Overlap-Layout-Consensus paradigm just like [[minimus|Minimus]].

The main advantage of Minimo over [[minimus|Minimus]] is that it takes simple FASTA files as input and generates contigs formatted in ACE and FASTA. Additional parameters can be used to tune the assembly stringency (minimum overlap length and minimum identity), or to do a strand-specific assembly.

Generally, decreasing the minimum overlap identity results in a less fragmented assembly, but likely less faithful, as sequencing errors or small varitions between closely related species (in the case of metagenomic data) might cause chimeric contigs. Similarly, decreasing the minimum overlap length might produce less fragmented, less faithful assemblies. However, increasing the minimum overlap length may sometimes also produce better assemblies by resolving the assembly of small repeated regions.

== Documentation ==

Documentation on how to run Minimo is obtained by typing:

Minimo -h

The usage message is:

Minimo is a de novo assembler based on the AMOS infrastructure. Minimo uses a
conservative overlap-layout-consensus algorithm to avoid mis-assemblies and
can be applied to short read or strand-specific assemblies. The input is a
FASTA file and there are options to control the stringency of the assembly
and the processing of the quality scores. By default, the results are in the
AMOS format and written to the directory where the input FASTA file is located.
Usage:
Minimo FASTA_IN [options]
Options:
-D QUAL_IN=<file> Input quality score file (in Phred format)
-D GOOD_QUAL=<n> Quality score to set for bases within the clear
range if no quality file was given (default: 30)
-D BAD_QUAL=<n> Quality score to set for bases outside clear range
if no quality file was given (default: 10). If your
sequences are trimmed, try the same value as GOOD_QUAL.
-D MIN_LEN=<n> Minimum contig overlap length (at least 20 bp,
default: 35)
-D MIN_IDENT=<d> Minimum contig overlap identity percentage (between 0
and 100 %, default: 98)
-D STRAND_SPEC=<n> Do a strand-specific assembly (e.g. for transcripts)
(0:no 1:yes, default: 0)
-D ALN_WIGGLE=<d> Alignment wiggle value (from 2 for short reads to 15 for
long reads, default: 2)
-D FASTA_EXP=<n> Export results in FASTA format (0:no 1:yes, default: 0)
-D ACE_EXP=<n> Export results in ACE format (0:no 1:yes, default: 0)
-D OUT_PREFIX=< s> Prefix to use for the output file path and name

== Basic usage ==

To run Minimo will you need a set of sequence files. Assuming you have a set of reads in fasta format called '''my_reads.fa''', you can run Minimo with the following commands:

Minimo my_reads.fa

To export the contigs in a FASTA file or in ACE format (i.e. for downstream processing), use the FASTA_EXP and ACE_EXP options:

Minimo my_reads.fa -D FASTA_EXP=1 -D ACE_EXP=1

If you need to use a specific overlap length or identity between reads of a contig, try:

Minimo my_reads.fa -D MIN_LEN=80 -D MIN_IDENT=90

For the assembly of transcripts or other directional sequence datasets, try a strand-specific assembly:

Minimo my_reads.fa -D STRAND_SPEC=1

Minimo

2011-10-19T03:24:41Z

Floflooo: /* Basic usage */

== Overview ==

Minimo is largely based on [[minimus|Minimus]], and as such favours assembly quality to speed. Use on moderately-sized data! Minimo follows the Overlap-Layout-Consensus paradigm just like [[minimus|Minimus]].

The main advantage of Minimo over [[minimus|Minimus]] is that it takes simple FASTA files as input and generates contigs formatted in ACE and FASTA. In addition two parameters can be used to tune the assembly stringency (minimum overlap length and minimum identity).

Generally, decreasing the minimum overlap identity results in a less fragmented assembly, but likely less faithful, as sequencing errors or small varitions between closely related species (in the case of metagenomic data) might cause chimeric contigs. Similarly, decreasing the minimum overlap length might produce less fragmented, less faithful assemblies. However, increasing the minimum overlap length may sometimes also produce better assemblies by resolving the assembly of small repeated regions.

== Documentation ==

Documentation on how to run Minimo is obtained by typing:

Minimo -h

The usage message is:

Minimo is a de novo assembler based on the AMOS infrastructure. Minimo uses a
conservative overlap-layout-consensus algorithm to avoid mis-assemblies and
can be applied to short read or strand-specific assemblies. The input is a
FASTA file and there are options to control the stringency of the assembly
and the processing of the quality scores. By default, the results are in the
AMOS format and written to the directory where the input FASTA file is located.
Usage:
Minimo FASTA_IN [options]
Options:
-D QUAL_IN=<file> Input quality score file (in Phred format)
-D GOOD_QUAL=<n> Quality score to set for bases within the clear
range if no quality file was given (default: 30)
-D BAD_QUAL=<n> Quality score to set for bases outside clear range
if no quality file was given (default: 10). If your
sequences are trimmed, try the same value as GOOD_QUAL.
-D MIN_LEN=<n> Minimum contig overlap length (at least 20 bp,
default: 35)
-D MIN_IDENT=<d> Minimum contig overlap identity percentage (between 0
and 100 %, default: 98)
-D STRAND_SPEC=<n> Do a strand-specific assembly (e.g. for transcripts)
(0:no 1:yes, default: 0)
-D ALN_WIGGLE=<d> Alignment wiggle value (from 2 for short reads to 15 for
long reads, default: 2)
-D FASTA_EXP=<n> Export results in FASTA format (0:no 1:yes, default: 0)
-D ACE_EXP=<n> Export results in ACE format (0:no 1:yes, default: 0)
-D OUT_PREFIX=< s> Prefix to use for the output file path and name

== Basic usage ==

To run Minimo will you need a set of sequence files. Assuming you have a set of reads in fasta format called '''my_reads.fa''', you can run Minimo with the following commands:

Minimo my_reads.fa

To export the contigs in a FASTA file or in ACE format (i.e. for downstream processing), use the FASTA_EXP and ACE_EXP options:

Minimo my_reads.fa -D FASTA_EXP=1 -D ACE_EXP=1

If you need to use a specific overlap length or identity between reads of a contig, try:

Minimo my_reads.fa -D MIN_LEN=80 -D MIN_IDENT=90

For the assembly of transcripts or other directional sequence datasets, try a strand-specific assembly:

Minimo my_reads.fa -D STRAND_SPEC=1

Minimo

2011-10-19T03:24:15Z

Floflooo: /* Basic usage */

== Overview ==

Minimo is largely based on [[minimus|Minimus]], and as such favours assembly quality to speed. Use on moderately-sized data! Minimo follows the Overlap-Layout-Consensus paradigm just like [[minimus|Minimus]].

The main advantage of Minimo over [[minimus|Minimus]] is that it takes simple FASTA files as input and generates contigs formatted in ACE and FASTA. In addition two parameters can be used to tune the assembly stringency (minimum overlap length and minimum identity).

Generally, decreasing the minimum overlap identity results in a less fragmented assembly, but likely less faithful, as sequencing errors or small varitions between closely related species (in the case of metagenomic data) might cause chimeric contigs. Similarly, decreasing the minimum overlap length might produce less fragmented, less faithful assemblies. However, increasing the minimum overlap length may sometimes also produce better assemblies by resolving the assembly of small repeated regions.

== Documentation ==

Documentation on how to run Minimo is obtained by typing:

Minimo -h

The usage message is:

Minimo is a de novo assembler based on the AMOS infrastructure. Minimo uses a
conservative overlap-layout-consensus algorithm to avoid mis-assemblies and
can be applied to short read or strand-specific assemblies. The input is a
FASTA file and there are options to control the stringency of the assembly
and the processing of the quality scores. By default, the results are in the
AMOS format and written to the directory where the input FASTA file is located.
Usage:
Minimo FASTA_IN [options]
Options:
-D QUAL_IN=<file> Input quality score file (in Phred format)
-D GOOD_QUAL=<n> Quality score to set for bases within the clear
range if no quality file was given (default: 30)
-D BAD_QUAL=<n> Quality score to set for bases outside clear range
if no quality file was given (default: 10). If your
sequences are trimmed, try the same value as GOOD_QUAL.
-D MIN_LEN=<n> Minimum contig overlap length (at least 20 bp,
default: 35)
-D MIN_IDENT=<d> Minimum contig overlap identity percentage (between 0
and 100 %, default: 98)
-D STRAND_SPEC=<n> Do a strand-specific assembly (e.g. for transcripts)
(0:no 1:yes, default: 0)
-D ALN_WIGGLE=<d> Alignment wiggle value (from 2 for short reads to 15 for
long reads, default: 2)
-D FASTA_EXP=<n> Export results in FASTA format (0:no 1:yes, default: 0)
-D ACE_EXP=<n> Export results in ACE format (0:no 1:yes, default: 0)
-D OUT_PREFIX=< s> Prefix to use for the output file path and name

== Basic usage ==

To run Minimo will you need a set of sequence files. Assuming you have a set of reads in fasta format called '''my_reads.fa''', you can run Minimo with the following commands:

Minimo my_reads.fa

To export the contigs in a FASTA file or in ACE format (i.e. for downstream processing), use the FASTA_EXP and ACE_EXP options:

Minimo my_reads.fa -D FASTA_EXP=1 -D ACE_EXP=1

If you need to use a specific overlap length or identity between reads of a contig, try:

Minimo my_reads.fa -D MIN_LEN=80 -D MIN_IDENT=90

For the assembly of transcripts or other directional sequence datasets, try:

Minimo my_reads.fa -D STRAND_SPEC=1

Minimo

2011-10-19T03:23:43Z

Floflooo: /* Basic usage */

== Overview ==

Minimo is largely based on [[minimus|Minimus]], and as such favours assembly quality to speed. Use on moderately-sized data! Minimo follows the Overlap-Layout-Consensus paradigm just like [[minimus|Minimus]].

The main advantage of Minimo over [[minimus|Minimus]] is that it takes simple FASTA files as input and generates contigs formatted in ACE and FASTA. In addition two parameters can be used to tune the assembly stringency (minimum overlap length and minimum identity).

Generally, decreasing the minimum overlap identity results in a less fragmented assembly, but likely less faithful, as sequencing errors or small varitions between closely related species (in the case of metagenomic data) might cause chimeric contigs. Similarly, decreasing the minimum overlap length might produce less fragmented, less faithful assemblies. However, increasing the minimum overlap length may sometimes also produce better assemblies by resolving the assembly of small repeated regions.

== Documentation ==

Documentation on how to run Minimo is obtained by typing:

Minimo -h

The usage message is:

Minimo is a de novo assembler based on the AMOS infrastructure. Minimo uses a
conservative overlap-layout-consensus algorithm to avoid mis-assemblies and
can be applied to short read or strand-specific assemblies. The input is a
FASTA file and there are options to control the stringency of the assembly
and the processing of the quality scores. By default, the results are in the
AMOS format and written to the directory where the input FASTA file is located.
Usage:
Minimo FASTA_IN [options]
Options:
-D QUAL_IN=<file> Input quality score file (in Phred format)
-D GOOD_QUAL=<n> Quality score to set for bases within the clear
range if no quality file was given (default: 30)
-D BAD_QUAL=<n> Quality score to set for bases outside clear range
if no quality file was given (default: 10). If your
sequences are trimmed, try the same value as GOOD_QUAL.
-D MIN_LEN=<n> Minimum contig overlap length (at least 20 bp,
default: 35)
-D MIN_IDENT=<d> Minimum contig overlap identity percentage (between 0
and 100 %, default: 98)
-D STRAND_SPEC=<n> Do a strand-specific assembly (e.g. for transcripts)
(0:no 1:yes, default: 0)
-D ALN_WIGGLE=<d> Alignment wiggle value (from 2 for short reads to 15 for
long reads, default: 2)
-D FASTA_EXP=<n> Export results in FASTA format (0:no 1:yes, default: 0)
-D ACE_EXP=<n> Export results in ACE format (0:no 1:yes, default: 0)
-D OUT_PREFIX=< s> Prefix to use for the output file path and name

== Basic usage ==

To run Minimo will you need a set of sequence files. Assuming you have a set of reads in fasta format called '''my_reads.fa''', you can run minimus with the following two commands:

Minimo my_reads.fa

To export the contigs in a FASTA file or in ACE format (i.e. for downstream processing), use the FASTA_EXP and ACE_EXP options:

Minimo my_reads.fa -D FASTA_EXP=1 -D ACE_EXP=1

If you need to use a specific overlap length or identity between reads of a contig, try:

Minimo my_reads.fa -D MIN_LEN=80 -D MIN_IDENT=90

For the assembly of transcripts or other directional sequence datasets, try:

Minimo my_reads.fa -D STRAND_SPEC=1

Minimo

2011-10-19T03:22:14Z

Floflooo: /* Documentation */

== Overview ==

Minimo is largely based on [[minimus|Minimus]], and as such favours assembly quality to speed. Use on moderately-sized data! Minimo follows the Overlap-Layout-Consensus paradigm just like [[minimus|Minimus]].

The main advantage of Minimo over [[minimus|Minimus]] is that it takes simple FASTA files as input and generates contigs formatted in ACE and FASTA. In addition two parameters can be used to tune the assembly stringency (minimum overlap length and minimum identity).

Generally, decreasing the minimum overlap identity results in a less fragmented assembly, but likely less faithful, as sequencing errors or small varitions between closely related species (in the case of metagenomic data) might cause chimeric contigs. Similarly, decreasing the minimum overlap length might produce less fragmented, less faithful assemblies. However, increasing the minimum overlap length may sometimes also produce better assemblies by resolving the assembly of small repeated regions.

== Documentation ==

Documentation on how to run Minimo is obtained by typing:

Minimo -h

The usage message is:

Minimo is a de novo assembler based on the AMOS infrastructure. Minimo uses a
conservative overlap-layout-consensus algorithm to avoid mis-assemblies and
can be applied to short read or strand-specific assemblies. The input is a
FASTA file and there are options to control the stringency of the assembly
and the processing of the quality scores. By default, the results are in the
AMOS format and written to the directory where the input FASTA file is located.
Usage:
Minimo FASTA_IN [options]
Options:
-D QUAL_IN=<file> Input quality score file (in Phred format)
-D GOOD_QUAL=<n> Quality score to set for bases within the clear
range if no quality file was given (default: 30)
-D BAD_QUAL=<n> Quality score to set for bases outside clear range
if no quality file was given (default: 10). If your
sequences are trimmed, try the same value as GOOD_QUAL.
-D MIN_LEN=<n> Minimum contig overlap length (at least 20 bp,
default: 35)
-D MIN_IDENT=<d> Minimum contig overlap identity percentage (between 0
and 100 %, default: 98)
-D STRAND_SPEC=<n> Do a strand-specific assembly (e.g. for transcripts)
(0:no 1:yes, default: 0)
-D ALN_WIGGLE=<d> Alignment wiggle value (from 2 for short reads to 15 for
long reads, default: 2)
-D FASTA_EXP=<n> Export results in FASTA format (0:no 1:yes, default: 0)
-D ACE_EXP=<n> Export results in ACE format (0:no 1:yes, default: 0)
-D OUT_PREFIX=< s> Prefix to use for the output file path and name

== Basic usage ==

To run Minimo will you need a set of sequence files. Assuming you have a set of reads in fasta format called '''my_reads.fa''', you can run minimus with the following two commands:

Minimo my_reads.fa

To export the contigs in a FASTA file or in ACE format (i.e. for downstream processing), use the FASTA_EXP and ACE_EXP options:

Minimo my_reads.fa -D FASTA_EXP=1 -D ACE_EXP=1

If you need to use a specific overlap length or identity between reads of a contig, try:

Minimo my_reads.fa -D MIN_LEN=80 -D MIN_IDENT=90

Amosvalidate

2011-07-24T23:21:00Z

Floflooo: /* Running amosvalidate */

Automated assembly validation pipeline.

Adam Phillippy, Michael Schatz, Mihai Pop
Center for Bioinformatics and Computational Biology, University of Maryland

Publication: Genome assembly forensics: finding the elusive mis-assembly. Phillippy AM, Schatz MC, Pop M. Genome Biol. 2008;9(3):R55.

== Overview ==

Since the initial "draft" sequence of the human genome was released in 2001, it has become clear that it was not an entirely accurate reconstruction of the genome. Despite significant advances in sequencing and assembly since then, genome sequencing continues to be an inexact process. Genome finishing and validation have remained a largely manual and expensive process, and consequently, many genomes are presented as draft assemblies. Draft assemblies are of unknown quality and potentially contain significant mis-assemblies, such as collapsed repeats, sequence excision, or artificial rearrangements. Too often these assemblies are judged only by contig size, with larger contigs preferred without regard to quality, because it has been difficult to gauge large scale assembly quality.

Our automated software pipeline, amosvalidate, addresses this deficiency and automatically detects mis-assemblies using a battery of known and novel assembly quality metrics. Instead of focusing on a single assembly characteristic as other validation approaches have tried, the power of our approach comes from leveraging multiple sources of evidence. amosvalidate statistically analyzes mate-pair orientations and separations, repeat content, depth-of-coverage, correlated polymorphisms in the read alignments, and read alignment breakpoints to identify structurally suspicious regions of the assembly. The suspicious regions identified by individual metrics are then clustered and combined to identify (with high confidence) regions that are mis-assembled.

Related tools:
* [[Hawkeye]]
* [[MUMmer]]

== Running amosvalidate ==

amosvalidate reads the assembly data from an AMOS bank. A bank is a special directory of binary encoded files containing all information on an assembly. A bank is created by the AMOS assemblers directly, or by converting the results of others assemblers into AMOS format. This is typically done with the tools toAmos and bank-transact. toAmos reads the assembly files and converts them to plaintext AMOS message formats, and bank-transact reads those messages and creates the binary encoded bank directory. See the AMOS Assembly Conversion Page for more information.

For example:

$ toAmos -f assembly.frg -a assembly.asm -o - | bank-transact -m - -o assembly.bnk -c

Creates the bank assembly.bnk from the files assembly.frg and assembly.asm, which are the input and output files for the Celera Assembler.

$ toAmos -ace assembly.ace -o - | bank-transact -m - -o assembly.bnk -c

Creates the bank assembly.bnk from an ace file, which is an output format for many assemblers including Phrap, Arachne, Velvet and Newbler. Check your assembler's documentation for more information on creating ACE files. More information on converting to AMOS is available in the toAmos documentation.

$ tarchive2amos -o assembly -assembly ASSEMBLY.xml TRACEINFO.seq;
$ bank-transact -m assembly.afg -b assembly.bnk -c

Creates the bank assembly.bnk from an assembly archive XML file called ASSEMBLY.xml. Note all of the read fasta files should be concatentated into a single TRACEINFO.seq file, and the read qualities files should be concatenated into a single TRACEINFO.qual file, and the TRACEINFO.xml file should be present as well. More information is available in the tarchive2amos documentation.

Once the bank has been built, launch the analysis by typing:

$ amosvalidate assembly.bnk

If the assembler you used does not record the clear range, you'll very likely run into an error. In this case, re-run amosvalidate without using clear range information

$ amosvalidate assembly.bnk -D CLEAR_RANGE=0

After the validation completes, the mis-assembly features will be loaded into the bank and present in the files assembly.all.feat and assembly.suspicious.feat. These features can be viewed in Hawkeye by typing:

$ hawkeye assembly.bnk

== Matepair Happiness ==

Matepairs from a double barreled shotgun sequencing library should be oriented towards each other, and their distance apart in the assembly should match the library's size distribution. The tool asmQC looks for regions where multiple matepairs are mis-oriented or the insert coverage is low. Both can indicate the assembly has a rearrangement mis-assembly. The tool cestat-cov computes a per-library statistic called the CE statistic at every position in the assembly. The CE statistics indicates how well the mates spanning a positing match the library's distribution. If the mates are consistently closer than expected at a given position, as would occur in a collapsed repeat or excision from the assembly, the statistic will have a large negative value (ce < -4). If the inserts are consistently larger than expected, such as from a repeat copy number expansion or other insertion event, the statistic will have a large positive value (ce > 4)

'''cestat-cov output file: asm.ce.feat'''

Record of positions in the assembly with unusual CE statistic (|ce| > 4).

Description of columns in file:
1. Contig ID
2. MATEPAIR Feature Type
3. range start
4. range end
4. CE_COMPRESS | CE_STRETCH
5. Library ID

asmQC output is written directly to the bank, but features can be extracted with bank-report

== Correlated SNP Detection ==

Correlated SNPs are positions in the genome where most of the reads are one base, but multiple other reads have another base. Unlike sequencing errors that occur at random, these correlated discrepancies can indicate the presence of a mis-assembly. In a haploid bacterial genome, for example, correlated SNPs nearly always indicate 2 copies of a near identical repeat have been collapsed into a single copy. In diploid or polyploid genomes, these can indicate a collapsed repeat, or positions where the homologous chromosomes disagree. If the frequency is higher than expected biologically, it is strong evidence for a collapsed repeat.

'''analyzeSNPs output file: asm.snps'''

analyzeSNPs finds all positions in the multiple alignment that the reads disagree. By default, it only reports positions where there are 2 or more reads that disagree with the consensus (but agree with each other) and the sum of their quality values is at least 40

Description of columns in file:
1. Contig ID
2. Gapped position
3. Ungapped position
4. Consensus
5. Depth of coverage
6. Number of reads that disagree with the consensus
7. X(N) X=base1, N=number of reads that have base1
8. {R1,R2,RN} Read ids that have base X
9. Y(N) Y=base2, N=number of reads that have base2
10. {R1, R2, RN} Read ids that have base Y

'''clusterSNPs output file: asm.snp.feat'''

clusterSNPs scans the SNPs report generated by analyzeSNPs to find regions that have a high frequency of SNPs. By default, it reports all regions with at least 2 columns within at most 500bp of each other as found by analyzeSNPs.

Description of columns in file:
1. Contig ID
2. SNP Feature Type (P)
3. range start
4. range end
4. HIGH_SNP
5. The number of SNPs
6. The average distance between SNPs

== Read Coverage ==

If the libraries have been constructed using a random shearing process, the reads should uniformily cover the genome at the average depth of coverage. Regions where the coverage is deeper than expected can indicate a collapsed repeat.

'''analyze-read-depth output file: asm.depth.feat'''

By default, analyze-read-depth reports regions that are 3x deeper than the average coverage. Positions within 1000bp of each other are clustered together.

Description of columns in file:
1. Contig ID
2. Coverage Feature Type (D)
3. range start
4. range end
5. Maximum depth of coverage in this range

== Singleton Breakpoint Analysis ==

After an assembly is complete, there can be reads left over, called singletons, that are not placed in the assembly. These reads are often from contaminating DNA or otherwise low quality sequence and can be safely ignored. However, some types of mis-assemblies can cause singletons where a portion of the read will align well to the contig but the rest of the read past the mis-assembly junction does not. If there are multiple reads that all follow the same pattern of partially aligning until the same position, this is strong evidence for mis-assembly.

'''listReadPlacedStatus output file: asm.singletons'''

[[listReadPlacedStatus]] can report which contig(s) a read is placed into, but in the pipeline simply lists which reads are singletons.

'''casm-breaks output file: asm.break.fea'''

The singleton reads are then aligned to the consensus sequences of the contigs and then analyzed for shared breakpoints. casm-breaks reports positions where there are multiple reads that all have the same breakpoint pattern. Unlike some of the other pipeline tools, [[casm-breaks]] writes an XML like message file.

File Format:
{FEA Feature message
typ:B Breakpoint feature
src:N,CTG The breakpoint occurs in contig N
com: <string> string linking all of the breakpoint features for a set of reads
clr:X,Y Range the contig where the read aligns
} End of feature

== Repeat K-mer Analysis ==

Almost all mis-assemblies are caused by repeats, and thus it can be useful to find the locations of the repeats in an assembly. Furthermore, it is very interesting to find the locations of collapsed or expanded repeats. We developed a new metric, called normalized k-mer analysis, that can discover collapsed or expanded repeats. A k-mer is a k-length substring of a longer sequence. Using a sliding window across a sequence, we can catalog all k-mers and count the number of occurrences of each. Call K_r the set of k-mers in the reads, and K_c the set of k-mers in the contig consensus sequences. A normalized k-mer count, K*, is the number of times a given k-mer q occurs in K_r divided by the number of times q occurs in K_c. This simple statistic can reveal which repeats have been mis-assembled. For example, the number of times the k-mers across a 2 copy repeat will be present in K_r is 2 * the depth of coverage. If the 2-copy repeat occurs in 2-copies in the assembly, then those kmers will all be present twice in K_c, and K* will be equal to the depth of coverage. If, however, the repeat was collapsed and occurs only once, then K_c will be 1 across the repeat, and K* will be equal to 2*the average depth of coverage.

'''count-kmers output file: asm.22.n22mers'''

count-kmers can count k-mers of arbitrary length in the reads or contig consensus sequences, and it can compute normalized k-mers. In the forensics pipeline, it computes normalized k-mers where k=22 and the number of occurrences is at least 22 (approximately 3 * the standard depth of coverage, 8). File format (N is the normalized k-mer count for a kmer sequence): >N
kmersequence

'''kmer-cov output file: asm.nkmer.feat'''

kmer-cov maps the k-mer coverage across a sequence. In the forensics pipeline it reports regions at least 1000bp long covered by high frequency normalized kmers, i.e., the collapsed repeats in the genome.
Description of columns in file:
1. Contig ID
2. Coverage Feature Type (K)
3. range start
4. range end
5. Length of region

== Feature Combiner ==

The above metrics can find many different types of mis-assemblies, but each is limited in type of mis-assembly it can find. Furthermore, normal statistical variation may introduce false positives in the analysis. For example, flagging every insert mate whose size is less than 2 standard deviations from the library mean will flag about 2.5% of the inserts even though the vast majority are correct. Instead we use a feature combiner to collect all of the evidence for a mis-assembly and output regions with multiple mis-assembly features present at the same region. This allows one to focus their attention on the regions that are most likely to be mis-assemblied.

All of the features are loaded into the bank, and will then be visible within Hawkeye for further inspection.

'''suspiciousfeat2region output file: asm.suspicious.feat'''

File format:
1. Contig id
2. Mis-assembly Feature Type (A)
3. range start
4. range end
5. MIS-ASSEMBLY
6. Number of features in the region
7. Number of feature types in the region
8. List of features separated with pipe ("|") character

Bambus 2.0/quick start guide

2011-07-24T07:54:33Z

Floflooo: /* Step 1. Install the AMOS package - Bambus 2.0 is part of it. */

This is a copy of the Bambus 2 user guide taken (and improved) from here: http://www.cbcb.umd.edu/software/bambus/doc/HowToBambus2.pdf

See also: http://www.cbcb.umd.edu/software/bambus

==How to run Bambus 2.0==
'''Caveat:''' Bambus is still being actively developed and the code is currently in the "user beware" and "for experts only" stage.

=== Step 1. Install the AMOS package - Bambus 2.0 is part of it. ===
See [[AMOS Getting Started]].

'''Note:''' since Bambus is still under active development you should pull the latest unofficial release of AMOS directly from the Git repository - see instructions at: [[Programmer's guide]].

=== Step 2. What information you need ===
Bambus needs to know about the contigs produced by the assembler and information about how these contigs are linked to each other. In AMOS terms, the basic information necessary are a list of contigs (http://amos.sourceforge.net/docs/api/classAMOS_1_1Contig__t.html) and a list of contig links (http://amos.sourceforge.net/docs/api/classAMOS_1_1ContigLink__t.html) or contig edges (http://amos.sourceforge.net/docs/api/classAMOS_1_1ContigEdge__t.html - bundles of consistent contig links) indicating the relative placement of pairs of contigs.

These data can either be provided to Bambus directly in the form of a AMOS message file (see [[Message Types]]) or inferred from mate-pair information as described below.

== Running Bambus 2.0 ==
* First, add the .afg file built as described above (for other conversion utilities see: http://sourceforge.net/apps/mediawiki/amos/index.php?title=File_conversion_utilities) to an AMOS bank (flat-file database):
bank-transact -cf myproj.bnk -m myfile.afg

* Use the mate-pair information to construct a collection of contig links.
clk -b myproj.bnk

'''Note:''' that you can also construct these links with your own custom software and upload them into the bank in which case you would skip the "clk" command.

* Bundle the contig links into a collection of contig edges.
Bundler -b myproj.bnk

'''Note:''' as with the clk command you might want to build the contig edges separately and upload them into the bank using your own software.

'''Note:''' the Bundler command also accepts the command line parameter "-t" followed by a list of edge types as defined in src/AMOS/Link_AMOS.hh. Currently the following types are defined: '''M''' - mate-pair, '''O''' - overlap, '''P''' - physical, '''A''' - alignment, '''S''' - synteny, and '''X''' - other.

* Identify genomic repeats and output them to std out
MarkRepeats -b myproj.bnk [-redundancy X -aggressive] > myRepeats

Optional parameters:
:"-redundancy X" only uses contig edges comprising X or more contig links
:"-aggressive" - aggressive repeat identification based on global depth of coverage statistics (default procedure relies on graph analysis rather than coverage statistics)

'''Note:''' this program requires the boost library

* Order and orient contigs according to repeat and link information

'''IMPORTANT:''' several of the operations performed by this program destructively modify the bank (changes cannot be undone). You should make a copy of the bank prior to running OrientContigs.

OrientContigs -b myproj.bnk -prefix myscaff

:"-prefix" specifies the prefix for all output files

Optional parameters:
:"-all" - output unlinked contigs as scaffolds
:"-noreduce" - turns off graph simplification routines (see below)
:"-redundancy X" - same as above - ignore edges with less than X links
:"-repeats filename" - ignores repeats listed in "filename" (one contig ID per line) as generated, e.g. by the MarkRepeats :program described above.
:"-aggressive" - aggressive scaffolding - by default links that are stretched by more than 3 standard deviations are ignored. Aggressive option turns this feature off and tries to reconcile the scaffold as best possible.

* Linearize the scaffolds (if desired). By default Bambus 2 produces non-linear graph-based scaffolds. If fasta output is desired, it is necessary to linearize the scaffolds.
untangle -e myscaff.evidence.xml -s myscaff.out.xml -o myscaff.untangle.xml

* Output fasta result (if desired). This involves two steps, the first to generating the fasta file representing the contigs and the second combines them, separated by Ns, into a scaffold fasta file.
bank2fasta -d -b myproj.bnk > contigs.fasta
printScaff -e myscaff.evidence.xml -s myscaff.untangle.xml -l myscaff.library -f contigs.fasta -merge -o myscaff

== Outputs ==
The output of the OrientContigs program is a collection of scaffolds stored in the bank. The program also generates several files starting with the specified prefix
*myScaff.agp
**The scaffolds generated by the OrientContigs programs in NCBI AGP format
*myScaff.dot
**The scaffolds generated by the OrientContigs program in Graphviz dot format. It can be converted to a PostScript or PDF file using the dot program in the Graphviz package.
*myScaff.evidence.xml
*myScaff.library
*myScaff.out.xml
**The scaffolds generated by the OrientContigs program compatible with the Bambus 1 format.
*myScaff.fasta
**The fasta file of the scaffolds, joined by Ns
*myScaff.stats
**Statistics on the scaffolds generated, including N50 and total span.

=== Scaffold simplifications ===
By default (unless option "-noreduce" is provided) the OrientContigs program simplifies certain
graph patterns:
* simple paths
* bubbles
** These patterns are iteratively merged into single contigs until no additional simplifications can be made.

Fedora installation

2011-07-24T07:49:41Z

Floflooo:

This was tested on Fedora 13.

First, download either the regular or development version of AMOS.

i/ The regular AMOS version is available from http://sourceforge.net/projects/amos/files/, e.g.:
wget http://sourceforge.net/projects/amos/files/amos/2.0.8/amos-2.0.8.tar.gz/download
ii/ The development version of AMOS is in a Git repository. To get it, run:
git clone git://amos.git.sourceforge.net/gitroot/amos/amos

In the directory where the AMOS file are located, run the following to install the prerequisites:
su -c "yum install automake qt3 qt3-devel boost boost-devel libXmu libXmu-devel libXi libXi-devel expat expat-devel"

If you need the AMOScmp, AMOScmp-shortReads-alignmentTrimmed or minimus2 components of AMOS, you need to install MUMMER. As far as I know, there is no easy way to do so. Go to http://mummer.sourceforge.net/ and follow the MUMMER installation instructions.

For the standard version of AMOS, skip to next step, but for the development version, first, run:
./bootstrap

Then regardless of the version:
./configure --prefix=/usr/local/AMOS
make
make check
su -c "make install"
su -c "ln -s /usr/local/AMOS/bin/* /usr/local/bin/"

Now all the programs shipped in AMOS should be available from the command-line.
For example try:
Minimo -h
'''Bold text'''

Debian installation

2011-07-24T07:48:02Z

Floflooo:

These instructions are for Debian and Debian-based distros (e.g. Ubuntu 9.04)

To start, download either the regular or development version of AMOS.

i/ The regular AMOS version is available from http://sourceforge.net/projects/amos/files/, e.g.:
wget http://sourceforge.net/projects/amos/files/amos/2.0.8/amos-2.0.8.tar.gz/download
ii/ The development version of AMOS is in a Git repository. To get it, run:
git clone git://amos.git.sourceforge.net/gitroot/amos/amos

In the directory where the AMOS file are located, run the following to install the prerequisites:
sudo aptitude install ash coreutils gawk gcc automake mummer mummer-doc libboost-dev

For the Hawkeye component of AMOS, you need Qt3:
sudo aptitude install libqt3-headers

For the standard version of AMOS, skip to next step, but for the development version, first, run:
./bootstrap

Then regardless of the version:
./configure --with-Qt-dir=/usr/share/qt3 --prefix=/usr/local/AMOS
make
make check
sudo make install
sudo ln -s /usr/local/AMOS/bin/* /usr/local/bin/

Now all the programs shipped in AMOS should be available from the command-line.
For example try:
Minimo -h

AMOS Getting Started

2011-07-24T07:45:45Z

Floflooo: /* Downloading the development version */

{{TOC}}

Is AMOS an assembler? is one of the first questions we are asked. The short answer is no. AMOS is not an assembler, rather a software infrastructure for developing assembly tools. If you are only interested in running an off-the-shelf assembler on your shotgun data, do not despair, AMOS provides two such assemblers: AMOScmp - a comparative assembler; and Minimus - a basic assembler for small datasets. However it is important to realize that, with a little bit of programming, you can use AMOS to put together your own shotgun assembler customized for the specific characteristics of your data.

This page will provide you with the basic information needed to get started using AMOS. Advanced AMOS users can go directly to in-depth resources from the main page [[AMOS]].

== Downloading AMOS ==
AMOS can be downloaded from Sourceforge using the following link: [http://sourceforge.net/project/showfiles.php?group_id=134326 http://sourceforge.net/project/showfiles.php?group_id=134326]

No need to remember this URL as you can easily reach it from the [AMOS main page].

This link will bring you to the Sourceforge download page for our project. While older versions of our code are also available for download from this page we recommend you download the latest version to take advantage of the full functionality of the code.

AMOS is released as a source-code package, with the exception of the OSX version of the assembly viewer Hawkeye, that can be downloaded as a binary from the File Release section of the download page. Instructions for compiling and installing AMOS are provided below.

=== Downloading the development version ===

If you want the bleeding-edge of AMOS, e.g. to edit the source code, you should download the development version of AMOS using Git following the directions here: [http://sourceforge.net/scm/?type=git&group_id=134326 http://sourceforge.net/scm/?type=git&group_id=134326]

Or in short:
git clone git://amos.git.sourceforge.net/gitroot/amos/amos

== Installing AMOS ==
After reading this section make sure you also read the INSTALL file distributed with AMOS. This file may contain information pertaining to the latest version of AMOS that is not included here.

=== Installing the development version ===

The first step to install the development version of AMOS is to type:
./bootstrap

Then proceed with the instructions for the normal installation below.

=== Normal installation ===
The AMOS source package has a name like: amos-1.4.5.tar.gz where 1.4.5 is the version of the code. Once you untar this file (using "tar -xzf amos-1.4.5.tar.gz" in Linux, or "gunzip -d amos-1.4.5.tar.gz | tar xf -" in other flavors of Unix) you will find the current AMOS distribution in a directory named amos-1.4.5. The next steps assume you have cd'd into this directory.

AMOS uses the [http://www.gnu.org/software/autoconf GNU autoconf] package to reduce cross-platform compatibility issues. Before compiling the code you will need to run the configure script that will probe your system for the locations of all software packages required by AMOS.

By simply running:

./configure

you will prepare AMOS to be installed in the directory hosting the source package. This is OK if you are just testing AMOS. We recommend, however, that you provide the configure script with a more permanent home for AMOS, e.g.:

./configure --prefix=/usr/local/AMOS

will ultimately lead the AMOS directory hierarchy to be installed underneath /usr/local/.

After running configure, make sure you check the messages left on your screen to make sure no errors occured. Errors during the configure step can lead to an incomplete build.

To compile the code you need to simply run:

make

followed by

make install

to install AMOS into the directory selected with the --prefix option to configure.

Normally, these steps are sufficient to install AMOS on most UNIX systems. If you encounter errors during configuration or compilation, or if you are trying to install AMOS on an OSX or Cygwin system, please read the following sub-sections.

=== Specifying the location of dependencies ===
If the configure script gives you a message like:

WARNING! nucmer was not found but is required to run AMOScmp
install nucmer if planning on using AMOScmp

you either have not installed the [http://mummer.sourceforge.net/ MUMmer] package, or you have installed it in a location where the configure script cannot find it. MUMmer (the nucmer program in particular) is required by the comparative assembler [[AMOScmp]].

To remedy this situation, please install MUMmer following instructions found at [http://mummer.sourceforge.net http://mummer.sourceforge.net].

If MUMmer is already installed, but configure cannot find it, you can specify the location of the nucmer program by setting the environment variably NUCMER, e.g.:

NUCMER=/usr/local/bin/mummer/nucmer
export NUCMER

in a "traditional" shell (sh, bash, ksh, etc.), or

setenv NUCMER /usr/local/bin/mummer/nucmer

in csh or tcsh. Of course you'll need to replace /usr/local/bin/mummer/nucmer with the actual location of this program on your system.
Specifying the location of the QT library
On most Unix installations (see below for OSX and Cygwin), the QT library should be properly installed and AMOS will make without any problems. If, however, you notice a message like:

WARNING! Qt3 toolkit was not found but is required to run AMOS GUIs

the configure process was not able to find the QT library on your system. Check with your system administrator to have this toolkit installed on your system. If, however, you are certain the toolkit is installed, but AMOS still didn't find it, you can directly specify the location of the toolkit directory, or specifically the include, bin, and lib directories, where QT is installed, and the name of the library file, using the following options to the configure script:

--with-Qt-dir
--with-Qt-include_dir
--with-Qt-lib_dir
--with-Qt-bin_dir
--with-Qt-lib

Similarly, if you get the message:

WARNING! Boost graph toolkit was not found but is required to run parts of the AMOS Scaffolder (Bambus 2)

try specifying the location of Boost with the option:

--with-Boost-dir

=== Debian and Ubuntu installation ===
[[Debian installation]]

=== Fedora, RedHat, CentOS installation ===
[[Fedora installation]]

=== Mac OS X installation ===

[[OSX installation]]

=== Cygwin installation ===
[[Cygwin installation]]

== Running AMOS ==

=== Basic AMOS concepts ===
AMOS consists of a collection of modules that operate on a central data-structure called a bank. A bank is really just a directory that contains a database (organized as a collection of indexed files) comprising assembly related objects such as reads, contigs, scaffolds, etc. The modules thus communicate with each other by making changes to the bank. For example, an assembler might consist of three modules: an overlapper, a contigger, and a multi-aligner. The overlapper will first read the shotgun reads from the bank, compare them to each other and write back to the bank a list of overlaps, i.e. pairs of reads that match each other. The contigger then reads the collection of overlaps and makes sense out of it, by producing a layout of the reads that is consistent with most of the observed overlaps. The contigger then writes these contigs (contiguous chunks of the genome) to the bank. Finally, the multi-aligner reads from the bank both the reads and the contigs, builds a multiple alignment of the reads, using as a guide the layout of the reads produced by the contigger, then updates the contigs with the detailed alignment information. Thus, the three programs were able to communicate with each other using the bank as an intermediate storage space. If this litle description didn't make much sense to you, check out our [http://www.cbcb.umd.edu/research/assembly_primer.shtml Genome Assembly Primer]. It also has pointers to future reading.

Objects in the bank may be identified by one, or both of the following identifiers: IID (internal identifier) - an integer identifier, internal to AMOS; and EID (external identifier) - a string representing some external identifier of the record, e.g. the original name of a sequencing read. Both identifiers must be unique for a specific object type, but may be shared by multiple objects. For example, there can only be one contig with an IID equal to 1, however there can be both a contig, and a read, and an overlap, all with the IID = 1.
Message files
The AMOS banks are not the only mechanism for AMOS modules to communicate with each other, and to the "outside world". AMOS also uses a flat-file format (AMOS message files) inspired by the format used in Celera Assembler. This format is generally used as an intermediate format for converting to and from external file formats. The AMOS message files are then used to populate the data-structures present in a bank.

For more details on the AMOS message file format check out the [[Infrastructure]] pages. The use of message files will be described in more detail in the remainder of this tutorial.

==== Reading and writing banks ====
To learn how to generate AMOS message files check out the section called Creating inputs for AMOS. Assuming you already have an AMOS message file, most of the modules will require that the information from this file be loaded into a bank. This section describes the commands used to transfer information between a bank and the message file.

The command bank-transact can be used to load a message file into a bank. In its simplest invocation:

bank-transact -b mybank -m mymessagefile

bank-transact loads the messages in mymessagefile into the bank mybank. Note that this invocation assume the bank already exists, and bank-transact will fail otherwise. When creating a new bank you can run:

bank-transact -c -b mybank -m mymessagefile

The option -c stands for "create". By also providing the option -f (force), the bank will be overwritten if it already exists.

The contents of a bank can be output into a flat-file format with the command:

bank-report -b mybank

By default bank-report outputs all the data in the bank. The output can be restricted to certain message types by providing the 3 letter codes of the messages to be output, e.g:

bank-report -b mybank CTG RED

will output all the contigs (CTG) and read (RED) records. In addition bank-report allows the user to specify a list of EIDs (option -E) or a list of IIDs (option -I) that will be reported.

==== Bank locking ====
To allow concurrent access to the bank, AMOS programs lock the bank while the operate on it. There are two types of locks: for reading, and writing. If a bank is locked for reading, other read accesses are allowed but no writes. If a bank is locked for writing, no concurrent accesses are allowed. Some of the AMOS tools (such as the viewer Hawkeye), have an option to load a bank in "inspect" mode, i.e. the code ignores any locks placed on the bank.

In certain situations, if a program accessing the bank crashes, the bank may remain locked, prohibiting further access. All existing locks can be removed with the command (make sure that another user is not accessing the same bank):

bank-unlock mybank

==== Bank versions ====
The specific format of the AMOS bank is closely related to the current version of the AMOS software. The banks are not backward compatible, i.e., a bank produced by AMOS 1.0 will not be readable by AMOS 1.5. A simple solution for reading a bank created by an older version of AMOS is to output the contents of the bank using bank-report (the AMOS distribution contains old versions of the bank-report code, e.g. bank-report-1.1) , then reload the bank with the most recent bank-transact command.

==== Pipelines ====
As it has hopefully become clear from the introduction to AMOS above, most genome assembly tasks involve the sequential execution of several modules, in an assembly line (or pipeline) fashion. AMOS provides a mechanism for quickly putting together simple pipelines. By "simple" we mean situations where the specific assembly task involves running several programs in order, without the need for more complex control structures such as "if" statements or loops. To implement complex pipelines you will have to rely on Perl or another complex programming language.

An AMOS pipelines are described in a simple interpreted language, and consist of a series of steps that are executed in order. The steps are meant to provide a logical breakdown of the individual assembly tasks, representing the execution of one or more programs. Each step in a pipeline is identified by a step number (a throw-back to the days of the Basic language) providing the user with a mechanism to execute only some of the steps of a pipeline.

To learn more about AMOS pipelines and how to write them, check out the documentation for [[runAmos]] (the pipeline executor), or check out one of the pipelines distributed with AMOS (AMOScmp and minimus are good starting points).

=== Creating inputs for AMOS ===
The inputs to most AMOS programs must be provided in the AMOS message format. For help converting non-AMOS file formats into message files see the [[File conversion utilities]].

=== Running AMOScmp ===
AMOScmp is a comparative assembler that can be used to assemble reads from one genome (called the target) using as a template the sequence of a related genome (called the reference). Read the AMOScmp documentation for a detailed description of this program.

By default, running AMOScmp as follows:

AMOScmp prefix

assumes that the target is provided in the AMOS message file prefix.afg, and the reference in the file prefix.1con. To use different file locations, you can set the variables TGT and REF, either directly within the AMOScmp script, or on the command line:

AMOScmp -D "TGT=mytarget.afg" -D "REF=myreference.1con" prefix

The prefix must still be provided as it is used to generate the name of the output files.

AMOScmp will populate a bank named prefix.bnk, and will load into it a set of contigs, as well as a scaffold, linking together contigs that are adjacent along the reference. In addition, AMOScmp outputs the set of contigs as both a multi-FASTA file prefix.fasta, and a TIGR .contig file prefix.contig. Note that the consensus of the contigs (reported in the FASTA file) is generated from the target genome, and may differ from the reference genome (after all, the goal of the assembler is to assemble the target). In fact, AMOScmp uses sophisticated algorithms for detecting differences between the target and reference in order to prevent misassemblies. For more information refer to:

M. Pop, A. Phillippy, A.L. Delcher and S.L. Salzberg. [http://www.cbcb.umd.edu/papers/Pop%20et%20al%20Comparative.pdf Comparative genome assembly]. Briefings in Bioinformatics. 5(3), pp. 237-248, 2004.

=== Running minimus ===
Minimus is a basic genome assembler that can be used for small assembly jobs (e.g. a single gene, or a viral genome). Minimus is currently used as a central component of the Influenza A sequencing pipeline at The Institute for Genomic Research. Read the [[minimus]] documentation for more information.

To run minimus you must provide a set of shotgun reads in an AMOS message file. Running:

minimus prefix

assumes the input is in file prefix.afg. After running, minimus populates the bank prefix.bnk with a set of contigs, furthermore it reports the contigs in both a FASTA file (prefix.fasta) and a TIGR .contig file (prefix.contig). Note that minimus does not use mate-pairs. In essence it is, in Celera Assembler terminology, a unitigger. Any mate-pair information provided in the .afg will be silently ignored.

=== Viewing the result of an assembly ===
The content of a bank can be viewed with a program called Hawkeye:

hawkeye mybank

For detailed information on how to use Hawkeye, refer to the [[Hawkeye]] documentation.

=== Validating assemblies ===
Even the best genome assemblers sometimes make mistakes. AMOS provides a mechanism to run several checks on the output of an assembler (assuming the data are already stored in a bank), through a script called amosvalidate. Amosvalidate runs through the assembly and identifies several types of inconsistencies, such as clusters of SNPs in the assembled reads, clusters of mate-pairs that are too close or too far from each other (with respect to the estimated library sizes), and unassembled reads that do not properly match the assembly. A full description of these measures is beyond the scope of this document. We are currently submitting a manuscript describing the tools included in amosvalidate and will update this page when it gets published.

All the potential assembly problems identified by amosvalidate are written back into the bank as features, i.e ranges along the assembly. Each feature is tagged with the problem that was identified in that region. Typically, users then load the assembly in the Hawkeye viewer and examine the assembly in the tagged regions. Alternatively, the features may be extracted from the bank and processed automatically by specialized software (e.g. several assemblies of a same genome can be compared by the number of features identified in the assembly - the assembly with fewer features is likely "better").

Running amosvalidate is as simple as:

amosvalidate prefix

where prefix.bnk is the location of the bank.

== Getting help ==
To report bugs in AMOS, or to get help, email us at:

amos-help (at) lists (dot) sourceforget (dot) net

To receive information regarding new releases and developments, please [http://lists.sourceforge.net/lists/listinfo/amos-users subscribe] to our moderated, low-traffic users' mailing list:

amos-users (at) lists (dot) sourceforget (dot) net

AMOS Getting Started

2011-07-24T07:44:54Z

Floflooo: /* Installing the development version */

{{TOC}}

Is AMOS an assembler? is one of the first questions we are asked. The short answer is no. AMOS is not an assembler, rather a software infrastructure for developing assembly tools. If you are only interested in running an off-the-shelf assembler on your shotgun data, do not despair, AMOS provides two such assemblers: AMOScmp - a comparative assembler; and Minimus - a basic assembler for small datasets. However it is important to realize that, with a little bit of programming, you can use AMOS to put together your own shotgun assembler customized for the specific characteristics of your data.

This page will provide you with the basic information needed to get started using AMOS. Advanced AMOS users can go directly to in-depth resources from the main page [[AMOS]].

== Downloading AMOS ==
AMOS can be downloaded from Sourceforge using the following link: [http://sourceforge.net/project/showfiles.php?group_id=134326 http://sourceforge.net/project/showfiles.php?group_id=134326]

No need to remember this URL as you can easily reach it from the [AMOS main page].

This link will bring you to the Sourceforge download page for our project. While older versions of our code are also available for download from this page we recommend you download the latest version to take advantage of the full functionality of the code.

AMOS is released as a source-code package, with the exception of the OSX version of the assembly viewer Hawkeye, that can be downloaded as a binary from the File Release section of the download page. Instructions for compiling and installing AMOS are provided below.

=== Downloading the development version ===

If you want the bleeding-edge of AMOS, e.g. to edit the source code, you should download the development version of AMOS using CVS following the directions here: [http://sourceforge.net/scm/?type=git&group_id=134326 http://sourceforge.net/scm/?type=git&group_id=134326]

Or in short:
git clone git://amos.git.sourceforge.net/gitroot/amos/amos

== Installing AMOS ==
After reading this section make sure you also read the INSTALL file distributed with AMOS. This file may contain information pertaining to the latest version of AMOS that is not included here.

=== Installing the development version ===

The first step to install the development version of AMOS is to type:
./bootstrap

Then proceed with the instructions for the normal installation below.

=== Normal installation ===
The AMOS source package has a name like: amos-1.4.5.tar.gz where 1.4.5 is the version of the code. Once you untar this file (using "tar -xzf amos-1.4.5.tar.gz" in Linux, or "gunzip -d amos-1.4.5.tar.gz | tar xf -" in other flavors of Unix) you will find the current AMOS distribution in a directory named amos-1.4.5. The next steps assume you have cd'd into this directory.

AMOS uses the [http://www.gnu.org/software/autoconf GNU autoconf] package to reduce cross-platform compatibility issues. Before compiling the code you will need to run the configure script that will probe your system for the locations of all software packages required by AMOS.

By simply running:

./configure

you will prepare AMOS to be installed in the directory hosting the source package. This is OK if you are just testing AMOS. We recommend, however, that you provide the configure script with a more permanent home for AMOS, e.g.:

./configure --prefix=/usr/local/AMOS

will ultimately lead the AMOS directory hierarchy to be installed underneath /usr/local/.

After running configure, make sure you check the messages left on your screen to make sure no errors occured. Errors during the configure step can lead to an incomplete build.

To compile the code you need to simply run:

make

followed by

make install

to install AMOS into the directory selected with the --prefix option to configure.

Normally, these steps are sufficient to install AMOS on most UNIX systems. If you encounter errors during configuration or compilation, or if you are trying to install AMOS on an OSX or Cygwin system, please read the following sub-sections.

=== Specifying the location of dependencies ===
If the configure script gives you a message like:

WARNING! nucmer was not found but is required to run AMOScmp
install nucmer if planning on using AMOScmp

you either have not installed the [http://mummer.sourceforge.net/ MUMmer] package, or you have installed it in a location where the configure script cannot find it. MUMmer (the nucmer program in particular) is required by the comparative assembler [[AMOScmp]].

To remedy this situation, please install MUMmer following instructions found at [http://mummer.sourceforge.net http://mummer.sourceforge.net].

If MUMmer is already installed, but configure cannot find it, you can specify the location of the nucmer program by setting the environment variably NUCMER, e.g.:

NUCMER=/usr/local/bin/mummer/nucmer
export NUCMER

in a "traditional" shell (sh, bash, ksh, etc.), or

setenv NUCMER /usr/local/bin/mummer/nucmer

in csh or tcsh. Of course you'll need to replace /usr/local/bin/mummer/nucmer with the actual location of this program on your system.
Specifying the location of the QT library
On most Unix installations (see below for OSX and Cygwin), the QT library should be properly installed and AMOS will make without any problems. If, however, you notice a message like:

WARNING! Qt3 toolkit was not found but is required to run AMOS GUIs

the configure process was not able to find the QT library on your system. Check with your system administrator to have this toolkit installed on your system. If, however, you are certain the toolkit is installed, but AMOS still didn't find it, you can directly specify the location of the toolkit directory, or specifically the include, bin, and lib directories, where QT is installed, and the name of the library file, using the following options to the configure script:

--with-Qt-dir
--with-Qt-include_dir
--with-Qt-lib_dir
--with-Qt-bin_dir
--with-Qt-lib

Similarly, if you get the message:

WARNING! Boost graph toolkit was not found but is required to run parts of the AMOS Scaffolder (Bambus 2)

try specifying the location of Boost with the option:

--with-Boost-dir

=== Debian and Ubuntu installation ===
[[Debian installation]]

=== Fedora, RedHat, CentOS installation ===
[[Fedora installation]]

=== Mac OS X installation ===

[[OSX installation]]

=== Cygwin installation ===
[[Cygwin installation]]

== Running AMOS ==

=== Basic AMOS concepts ===
AMOS consists of a collection of modules that operate on a central data-structure called a bank. A bank is really just a directory that contains a database (organized as a collection of indexed files) comprising assembly related objects such as reads, contigs, scaffolds, etc. The modules thus communicate with each other by making changes to the bank. For example, an assembler might consist of three modules: an overlapper, a contigger, and a multi-aligner. The overlapper will first read the shotgun reads from the bank, compare them to each other and write back to the bank a list of overlaps, i.e. pairs of reads that match each other. The contigger then reads the collection of overlaps and makes sense out of it, by producing a layout of the reads that is consistent with most of the observed overlaps. The contigger then writes these contigs (contiguous chunks of the genome) to the bank. Finally, the multi-aligner reads from the bank both the reads and the contigs, builds a multiple alignment of the reads, using as a guide the layout of the reads produced by the contigger, then updates the contigs with the detailed alignment information. Thus, the three programs were able to communicate with each other using the bank as an intermediate storage space. If this litle description didn't make much sense to you, check out our [http://www.cbcb.umd.edu/research/assembly_primer.shtml Genome Assembly Primer]. It also has pointers to future reading.

Objects in the bank may be identified by one, or both of the following identifiers: IID (internal identifier) - an integer identifier, internal to AMOS; and EID (external identifier) - a string representing some external identifier of the record, e.g. the original name of a sequencing read. Both identifiers must be unique for a specific object type, but may be shared by multiple objects. For example, there can only be one contig with an IID equal to 1, however there can be both a contig, and a read, and an overlap, all with the IID = 1.
Message files
The AMOS banks are not the only mechanism for AMOS modules to communicate with each other, and to the "outside world". AMOS also uses a flat-file format (AMOS message files) inspired by the format used in Celera Assembler. This format is generally used as an intermediate format for converting to and from external file formats. The AMOS message files are then used to populate the data-structures present in a bank.

For more details on the AMOS message file format check out the [[Infrastructure]] pages. The use of message files will be described in more detail in the remainder of this tutorial.

==== Reading and writing banks ====
To learn how to generate AMOS message files check out the section called Creating inputs for AMOS. Assuming you already have an AMOS message file, most of the modules will require that the information from this file be loaded into a bank. This section describes the commands used to transfer information between a bank and the message file.

The command bank-transact can be used to load a message file into a bank. In its simplest invocation:

bank-transact -b mybank -m mymessagefile

bank-transact loads the messages in mymessagefile into the bank mybank. Note that this invocation assume the bank already exists, and bank-transact will fail otherwise. When creating a new bank you can run:

bank-transact -c -b mybank -m mymessagefile

The option -c stands for "create". By also providing the option -f (force), the bank will be overwritten if it already exists.

The contents of a bank can be output into a flat-file format with the command:

bank-report -b mybank

By default bank-report outputs all the data in the bank. The output can be restricted to certain message types by providing the 3 letter codes of the messages to be output, e.g:

bank-report -b mybank CTG RED

will output all the contigs (CTG) and read (RED) records. In addition bank-report allows the user to specify a list of EIDs (option -E) or a list of IIDs (option -I) that will be reported.

==== Bank locking ====
To allow concurrent access to the bank, AMOS programs lock the bank while the operate on it. There are two types of locks: for reading, and writing. If a bank is locked for reading, other read accesses are allowed but no writes. If a bank is locked for writing, no concurrent accesses are allowed. Some of the AMOS tools (such as the viewer Hawkeye), have an option to load a bank in "inspect" mode, i.e. the code ignores any locks placed on the bank.

In certain situations, if a program accessing the bank crashes, the bank may remain locked, prohibiting further access. All existing locks can be removed with the command (make sure that another user is not accessing the same bank):

bank-unlock mybank

==== Bank versions ====
The specific format of the AMOS bank is closely related to the current version of the AMOS software. The banks are not backward compatible, i.e., a bank produced by AMOS 1.0 will not be readable by AMOS 1.5. A simple solution for reading a bank created by an older version of AMOS is to output the contents of the bank using bank-report (the AMOS distribution contains old versions of the bank-report code, e.g. bank-report-1.1) , then reload the bank with the most recent bank-transact command.

==== Pipelines ====
As it has hopefully become clear from the introduction to AMOS above, most genome assembly tasks involve the sequential execution of several modules, in an assembly line (or pipeline) fashion. AMOS provides a mechanism for quickly putting together simple pipelines. By "simple" we mean situations where the specific assembly task involves running several programs in order, without the need for more complex control structures such as "if" statements or loops. To implement complex pipelines you will have to rely on Perl or another complex programming language.

An AMOS pipelines are described in a simple interpreted language, and consist of a series of steps that are executed in order. The steps are meant to provide a logical breakdown of the individual assembly tasks, representing the execution of one or more programs. Each step in a pipeline is identified by a step number (a throw-back to the days of the Basic language) providing the user with a mechanism to execute only some of the steps of a pipeline.

To learn more about AMOS pipelines and how to write them, check out the documentation for [[runAmos]] (the pipeline executor), or check out one of the pipelines distributed with AMOS (AMOScmp and minimus are good starting points).

=== Creating inputs for AMOS ===
The inputs to most AMOS programs must be provided in the AMOS message format. For help converting non-AMOS file formats into message files see the [[File conversion utilities]].

=== Running AMOScmp ===
AMOScmp is a comparative assembler that can be used to assemble reads from one genome (called the target) using as a template the sequence of a related genome (called the reference). Read the AMOScmp documentation for a detailed description of this program.

By default, running AMOScmp as follows:

AMOScmp prefix

assumes that the target is provided in the AMOS message file prefix.afg, and the reference in the file prefix.1con. To use different file locations, you can set the variables TGT and REF, either directly within the AMOScmp script, or on the command line:

AMOScmp -D "TGT=mytarget.afg" -D "REF=myreference.1con" prefix

The prefix must still be provided as it is used to generate the name of the output files.

AMOScmp will populate a bank named prefix.bnk, and will load into it a set of contigs, as well as a scaffold, linking together contigs that are adjacent along the reference. In addition, AMOScmp outputs the set of contigs as both a multi-FASTA file prefix.fasta, and a TIGR .contig file prefix.contig. Note that the consensus of the contigs (reported in the FASTA file) is generated from the target genome, and may differ from the reference genome (after all, the goal of the assembler is to assemble the target). In fact, AMOScmp uses sophisticated algorithms for detecting differences between the target and reference in order to prevent misassemblies. For more information refer to:

M. Pop, A. Phillippy, A.L. Delcher and S.L. Salzberg. [http://www.cbcb.umd.edu/papers/Pop%20et%20al%20Comparative.pdf Comparative genome assembly]. Briefings in Bioinformatics. 5(3), pp. 237-248, 2004.

=== Running minimus ===
Minimus is a basic genome assembler that can be used for small assembly jobs (e.g. a single gene, or a viral genome). Minimus is currently used as a central component of the Influenza A sequencing pipeline at The Institute for Genomic Research. Read the [[minimus]] documentation for more information.

To run minimus you must provide a set of shotgun reads in an AMOS message file. Running:

minimus prefix

assumes the input is in file prefix.afg. After running, minimus populates the bank prefix.bnk with a set of contigs, furthermore it reports the contigs in both a FASTA file (prefix.fasta) and a TIGR .contig file (prefix.contig). Note that minimus does not use mate-pairs. In essence it is, in Celera Assembler terminology, a unitigger. Any mate-pair information provided in the .afg will be silently ignored.

=== Viewing the result of an assembly ===
The content of a bank can be viewed with a program called Hawkeye:

hawkeye mybank

For detailed information on how to use Hawkeye, refer to the [[Hawkeye]] documentation.

=== Validating assemblies ===
Even the best genome assemblers sometimes make mistakes. AMOS provides a mechanism to run several checks on the output of an assembler (assuming the data are already stored in a bank), through a script called amosvalidate. Amosvalidate runs through the assembly and identifies several types of inconsistencies, such as clusters of SNPs in the assembled reads, clusters of mate-pairs that are too close or too far from each other (with respect to the estimated library sizes), and unassembled reads that do not properly match the assembly. A full description of these measures is beyond the scope of this document. We are currently submitting a manuscript describing the tools included in amosvalidate and will update this page when it gets published.

All the potential assembly problems identified by amosvalidate are written back into the bank as features, i.e ranges along the assembly. Each feature is tagged with the problem that was identified in that region. Typically, users then load the assembly in the Hawkeye viewer and examine the assembly in the tagged regions. Alternatively, the features may be extracted from the bank and processed automatically by specialized software (e.g. several assemblies of a same genome can be compared by the number of features identified in the assembly - the assembly with fewer features is likely "better").

Running amosvalidate is as simple as:

amosvalidate prefix

where prefix.bnk is the location of the bank.

== Getting help ==
To report bugs in AMOS, or to get help, email us at:

amos-help (at) lists (dot) sourceforget (dot) net

To receive information regarding new releases and developments, please [http://lists.sourceforge.net/lists/listinfo/amos-users subscribe] to our moderated, low-traffic users' mailing list:

amos-users (at) lists (dot) sourceforget (dot) net

FRCurve

2011-05-23T07:03:06Z

Floflooo: /* Documentation */

'''FRCurve''': Feature-Response Curve

== Overview ==

Inspired by the standard receiver operating characteristic (ROC) curve, the Feature-Response curve (FRC) characterizes the sensitivity (coverage) of the sequence assembler as a function of its discrimination threshold (number of features).

The AMOS package provides an automated assembly validation pipeline called [http://sourceforge.net/apps/mediawiki/amos/index.php?title=Amosvalidate amosvalidate] that analyzes the output of an assembler using a variety of assembly quality metrics (or features). Examples of features include: (M) mate-pair orientations and separations, (K) repeat content by k-mer analysis, (C) depth-of-coverage, (P) correlated polymorphism in the read alignments, and (B) read alignment breakpoints to identify structurally suspicious regions of the assembly. After running amosvalidate on the output of the assembler, each contig is assigned a number of features that
correspond to doubtful regions of the sequence.

Given any such set of features, the response (quality) of the assembler output is then analyzed as a function of the maximum number of possible errors (features) allowed in the contigs. More specifically, for a fixed feature
threshold <math>\phi</math>, the contigs are sorted by size and, starting from the longest, only those contigs are tallied, if their sum of features is <math>\leq \phi</math>. For this set of contigs, the corresponding approximate genome coverage is computed, leading to a single point of the Feature-Response curve.

FRC's properties:
<ul>
<li>The FRC can be used as a metric to compare the assembly quality of multiple assemblers.</li>
<li>The FRC does not require any reference sequence (except an estimate of the genome size) to be used for validation, thus making it a very useful tool in de novo sequencing projects. </li>
<li>Separate FRCs can be generated for each feature type enabling to scrutinize the relative strengths and weaknesses of different assemblers.</li>
</ul>

== Documentation ==

Following the AMOS philosophy, the FRCurve is implemented as a pipeline that consists of two steps:
* 1. invocation to the amosvalidate tool to compute the features for the set of contigs;
* 2. invocation to the FRC module, getFRCvalues
The name of the pipeline in the AMOS distribution is "FRCurve_pipeline".

Documentation on how to run FRCurve is obtained by typing:

FRCurve -h

The usage message is:

Feature-Response Curve pipeline
Usage:
FRCurve [params] \
-D GENOME_SIZE=<n> - Genome size (number of bps)
-D BANK=<n> - AMOS bank name

Description:
The Feature-Response curve characterizes the sensitivity (coverage)
of the sequence assembler as a function of its discrimination threshold (number of features).
Given any set of features compute by the amosvalidate pipeline, the response (quality)
of the assembler output is analyzed as a function of the maximum number of possible
errors (features) allowed in the contigs.
For more details see the wiki page at:
http://sourceforge.net/apps/mediawiki/amos/index.php?title=FRCurve
Output:
The Feature-Response curve (FRC) is saved in file "FRC.txt", while
FRCs for each feature type are saved respectively in:
"FRC_coverage.txt", "FRC_polymorphism.txt", "FRC_breakpoint.txt",
"FRC_kmer.txt", "FRC_matepair.txt" and "FRC_misassembly.txt"
Output file format:
Each file contains the FRCs in 3-columns format
- column 1 = feature threshold T;
- column 2 = contigs' N50 associated to the threshold T in column 1;
- column 3 = cumulative size of the contigs whose number of features is <= T;

== Example ==

The figure below shows the Feature-Response Curve generated for the [http://sourceforge.net/apps/mediawiki/amos/index.php?title=Minimus minimus] assembly pipeline on the ''Brucella suis'' genome using the benchmark dataset available [http://sourceforge.net/apps/mediawiki/amos/index.php?title=Benchmark here].

[[File:minimus_frc.jpeg|600px]]

== People ==

* [http://cims.nyu.edu/~gn387/ Giuseppe Narzisi] (PhD Student, NYU)
* [http://www.cs.nyu.edu/mishra/ Bud Mishra] (Faculty, NYU)

== References ==

Narzisi G. and Mishra B.:
''Comparing De Novo Genome Assembly: The Long and Short of It''.
'''PLoS ONE''' 6(4):e19175. April 2011 (DOI: [http://dx.plos.org/10.1371/journal.pone.0019175 10.1371/journal.pone.0019175]).

== Acknowledgements ==

Research reported here was supported by grants from NSF CDI program and Abraxis BioScience, LLC.

FRCurve

2011-05-23T06:31:28Z

Floflooo: /* Documentation */

'''FRCurve''': Feature-Response Curve

== Overview ==

Inspired by the standard receiver operating characteristic (ROC) curve, the Feature-Response curve (FRC) characterizes the sensitivity (coverage) of the sequence assembler as a function of its discrimination threshold (number of features).

The AMOS package provides an automated assembly validation pipeline called [http://sourceforge.net/apps/mediawiki/amos/index.php?title=Amosvalidate amosvalidate] that analyzes the output of an assembler using a variety of assembly quality metrics (or features). Examples of features include: (M) mate-pair orientations and separations, (K) repeat content by k-mer analysis, (C) depth-of-coverage, (P) correlated polymorphism in the read alignments, and (B) read alignment breakpoints to identify structurally suspicious regions of the assembly. After running amosvalidate on the output of the assembler, each contig is assigned a number of features that
correspond to doubtful regions of the sequence.

Given any such set of features, the response (quality) of the assembler output is then analyzed as a function of the maximum number of possible errors (features) allowed in the contigs. More specifically, for a fixed feature
threshold <math>\phi</math>, the contigs are sorted by size and, starting from the longest, only those contigs are tallied, if their sum of features is <math>\leq \phi</math>. For this set of contigs, the corresponding approximate genome coverage is computed, leading to a single point of the Feature-Response curve.

FRC's properties:
<ul>
<li>The FRC can be used as a metric to compare the assembly quality of multiple assemblers.</li>
<li>The FRC does not require any reference sequence (except an estimate of the genome size) to be used for validation, thus making it a very useful tool in de novo sequencing projects. </li>
<li>Separate FRCs can be generated for each feature type enabling to scrutinize the relative strengths and weaknesses of different assemblers.</li>
</ul>

== Documentation ==

Following the AMOS philosophy, the FRCurve is implemented as a pipeline that consists of two steps:
* 1. invocation to the amosvalidate tool to compute the features for the set of contigs;
* 2. invocation to the FRC module
The name of the pipeline in the AMOS distribution is "FRCurve_pipeline".

Documentation on how to run FRCurve is obtained by typing:

FRCurve -h

The usage message is:

Feature-Response Curve pipeline
Usage:
FRCurve [params] \
-D GENOME_SIZE=<n> - Genome size (number of bps)
-D BANK=<n> - AMOS bank name

Description:
The Feature-Response curve characterizes the sensitivity (coverage)
of the sequence assembler as a function of its discrimination threshold (number of features).
Given any set of features compute by the amosvalidate pipeline, the response (quality)
of the assembler output is analyzed as a function of the maximum number of possible
errors (features) allowed in the contigs.
For more details see the wiki page at:
http://sourceforge.net/apps/mediawiki/amos/index.php?title=FRCurve
Output:
The Feature-Response curve (FRC) is saved in file "FRC.txt", while
FRCs for each feature type are saved respectively in:
"FRC_coverage.txt", "FRC_polymorphism.txt", "FRC_breakpoint.txt",
"FRC_kmer.txt", "FRC_matepair.txt" and "FRC_misassembly.txt"
Output file format:
Each file contains the FRCs in 3-columns format
- column 1 = feature threshold T;
- column 2 = contigs' N50 associated to the threshold T in column 1;
- column 3 = cumulative size of the contigs whose number of features is <= T;

== Example ==

The figure below shows the Feature-Response Curve generated for the [http://sourceforge.net/apps/mediawiki/amos/index.php?title=Minimus minimus] assembly pipeline on the ''Brucella suis'' genome using the benchmark dataset available [http://sourceforge.net/apps/mediawiki/amos/index.php?title=Benchmark here].

[[File:minimus_frc.jpeg|600px]]

== People ==

* [http://cims.nyu.edu/~gn387/ Giuseppe Narzisi] (PhD Student, NYU)
* [http://www.cs.nyu.edu/mishra/ Bud Mishra] (Faculty, NYU)

== References ==

Narzisi G. and Mishra B.:
''Comparing De Novo Genome Assembly: The Long and Short of It''.
'''PLoS ONE''' 6(4):e19175. April 2011 (DOI: [http://dx.plos.org/10.1371/journal.pone.0019175 10.1371/journal.pone.0019175]).

== Acknowledgements ==

Research reported here was supported by grants from NSF CDI program and Abraxis BioScience, LLC.

Amosvalidate

2011-05-23T04:43:38Z

Floflooo:

Automated assembly validation pipeline.

Adam Phillippy, Michael Schatz, Mihai Pop
Center for Bioinformatics and Computational Biology, University of Maryland

Publication: Genome assembly forensics: finding the elusive mis-assembly. Phillippy AM, Schatz MC, Pop M. Genome Biol. 2008;9(3):R55.

== Overview ==

Since the initial "draft" sequence of the human genome was released in 2001, it has become clear that it was not an entirely accurate reconstruction of the genome. Despite significant advances in sequencing and assembly since then, genome sequencing continues to be an inexact process. Genome finishing and validation have remained a largely manual and expensive process, and consequently, many genomes are presented as draft assemblies. Draft assemblies are of unknown quality and potentially contain significant mis-assemblies, such as collapsed repeats, sequence excision, or artificial rearrangements. Too often these assemblies are judged only by contig size, with larger contigs preferred without regard to quality, because it has been difficult to gauge large scale assembly quality.

Our automated software pipeline, amosvalidate, addresses this deficiency and automatically detects mis-assemblies using a battery of known and novel assembly quality metrics. Instead of focusing on a single assembly characteristic as other validation approaches have tried, the power of our approach comes from leveraging multiple sources of evidence. amosvalidate statistically analyzes mate-pair orientations and separations, repeat content, depth-of-coverage, correlated polymorphisms in the read alignments, and read alignment breakpoints to identify structurally suspicious regions of the assembly. The suspicious regions identified by individual metrics are then clustered and combined to identify (with high confidence) regions that are mis-assembled.

Related tools:
* [[Hawkeye]]
* [[MUMmer]]

== Running amosvalidate ==

amosvalidate reads the assembly data from an AMOS bank. A bank is a special directory of binary encoded files containing all information on an assembly. A bank is created by the AMOS assemblers directly, or by converting the results of others assemblers into AMOS format. This is typically done with the tools toAmos and bank-transact. toAmos reads the assembly files and converts them to plaintext AMOS message formats, and bank-transact reads those messages and creates the binary encoded bank directory. See the AMOS Assembly Conversion Page for more information.

For example:

$ toAmos -f assembly.frg -a assembly.asm -o - | bank-transact -m - -o assembly.bnk -c

Creates the bank assembly.bnk from the files assembly.frg and assembly.asm, which are the input and output files for the Celera Assembler.

$ toAmos -ace assembly.ace -o - | bank-transact -m - -o assembly.bnk -c

Creates the bank assembly.bnk from an ace file, which is an output format for many assemblers including Phrap, Arachne, Velvet and Newbler. Check your assembler's documentation for more information on creating ACE files. More information on converting to AMOS is available in the toAmos documentation.

$ tarchive2amos -o assembly -assembly ASSEMBLY.xml TRACEINFO.seq;
$ bank-transact -m assembly.afg -b assembly.bnk -c

Creates the bank assembly.bnk from an assembly archive XML file called ASSEMBLY.xml. Note all of the read fasta files should be concatentated into a single TRACEINFO.seq file, and the read qualities files should be concatenated into a single TRACEINFO.qual file, and the TRACEINFO.xml file should be present as well. More information is available in the tarchive2amos documentation.

Once the bank has been built, launch the analysis by typing:

$ amosvalidate assembly.bnk

After the validation completes, the mis-assembly features will be loaded into the bank and present in the files assembly.all.feat and assembly.suspicious.feat. These features can be viewed in Hawkeye by typing:

$ hawkeye assembly.bnk

== Matepair Happiness ==

Matepairs from a double barreled shotgun sequencing library should be oriented towards each other, and their distance apart in the assembly should match the library's size distribution. The tool asmQC looks for regions where multiple matepairs are mis-oriented or the insert coverage is low. Both can indicate the assembly has a rearrangement mis-assembly. The tool cestat-cov computes a per-library statistic called the CE statistic at every position in the assembly. The CE statistics indicates how well the mates spanning a positing match the library's distribution. If the mates are consistently closer than expected at a given position, as would occur in a collapsed repeat or excision from the assembly, the statistic will have a large negative value (ce < -4). If the inserts are consistently larger than expected, such as from a repeat copy number expansion or other insertion event, the statistic will have a large positive value (ce > 4)

'''cestat-cov output file: asm.ce.feat'''

Record of positions in the assembly with unusual CE statistic (|ce| > 4).

Description of columns in file:
1. Contig ID
2. MATEPAIR Feature Type
3. range start
4. range end
4. CE_COMPRESS | CE_STRETCH
5. Library ID

asmQC output is written directly to the bank, but features can be extracted with bank-report

== Correlated SNP Detection ==

Correlated SNPs are positions in the genome where most of the reads are one base, but multiple other reads have another base. Unlike sequencing errors that occur at random, these correlated discrepancies can indicate the presence of a mis-assembly. In a haploid bacterial genome, for example, correlated SNPs nearly always indicate 2 copies of a near identical repeat have been collapsed into a single copy. In diploid or polyploid genomes, these can indicate a collapsed repeat, or positions where the homologous chromosomes disagree. If the frequency is higher than expected biologically, it is strong evidence for a collapsed repeat.

'''analyzeSNPs output file: asm.snps'''

analyzeSNPs finds all positions in the multiple alignment that the reads disagree. By default, it only reports positions where there are 2 or more reads that disagree with the consensus (but agree with each other) and the sum of their quality values is at least 40

Description of columns in file:
1. Contig ID
2. Gapped position
3. Ungapped position
4. Consensus
5. Depth of coverage
6. Number of reads that disagree with the consensus
7. X(N) X=base1, N=number of reads that have base1
8. {R1,R2,RN} Read ids that have base X
9. Y(N) Y=base2, N=number of reads that have base2
10. {R1, R2, RN} Read ids that have base Y

'''clusterSNPs output file: asm.snp.feat'''

clusterSNPs scans the SNPs report generated by analyzeSNPs to find regions that have a high frequency of SNPs. By default, it reports all regions with at least 2 columns within at most 500bp of each other as found by analyzeSNPs.

Description of columns in file:
1. Contig ID
2. SNP Feature Type (P)
3. range start
4. range end
4. HIGH_SNP
5. The number of SNPs
6. The average distance between SNPs

== Read Coverage ==

If the libraries have been constructed using a random shearing process, the reads should uniformily cover the genome at the average depth of coverage. Regions where the coverage is deeper than expected can indicate a collapsed repeat.

'''analyze-read-depth output file: asm.depth.feat'''

By default, analyze-read-depth reports regions that are 3x deeper than the average coverage. Positions within 1000bp of each other are clustered together.

Description of columns in file:
1. Contig ID
2. Coverage Feature Type (D)
3. range start
4. range end
5. Maximum depth of coverage in this range

== Singleton Breakpoint Analysis ==

After an assembly is complete, there can be reads left over, called singletons, that are not placed in the assembly. These reads are often from contaminating DNA or otherwise low quality sequence and can be safely ignored. However, some types of mis-assemblies can cause singletons where a portion of the read will align well to the contig but the rest of the read past the mis-assembly junction does not. If there are multiple reads that all follow the same pattern of partially aligning until the same position, this is strong evidence for mis-assembly.

'''listReadPlacedStatus output file: asm.singletons'''

[[listReadPlacedStatus]] can report which contig(s) a read is placed into, but in the pipeline simply lists which reads are singletons.

'''casm-breaks output file: asm.break.fea'''

The singleton reads are then aligned to the consensus sequences of the contigs and then analyzed for shared breakpoints. casm-breaks reports positions where there are multiple reads that all have the same breakpoint pattern. Unlike some of the other pipeline tools, [[casm-breaks]] writes an XML like message file.

File Format:
{FEA Feature message
typ:B Breakpoint feature
src:N,CTG The breakpoint occurs in contig N
com: <string> string linking all of the breakpoint features for a set of reads
clr:X,Y Range the contig where the read aligns
} End of feature

== Repeat K-mer Analysis ==

Almost all mis-assemblies are caused by repeats, and thus it can be useful to find the locations of the repeats in an assembly. Furthermore, it is very interesting to find the locations of collapsed or expanded repeats. We developed a new metric, called normalized k-mer analysis, that can discover collapsed or expanded repeats. A k-mer is a k-length substring of a longer sequence. Using a sliding window across a sequence, we can catalog all k-mers and count the number of occurrences of each. Call K_r the set of k-mers in the reads, and K_c the set of k-mers in the contig consensus sequences. A normalized k-mer count, K*, is the number of times a given k-mer q occurs in K_r divided by the number of times q occurs in K_c. This simple statistic can reveal which repeats have been mis-assembled. For example, the number of times the k-mers across a 2 copy repeat will be present in K_r is 2 * the depth of coverage. If the 2-copy repeat occurs in 2-copies in the assembly, then those kmers will all be present twice in K_c, and K* will be equal to the depth of coverage. If, however, the repeat was collapsed and occurs only once, then K_c will be 1 across the repeat, and K* will be equal to 2*the average depth of coverage.

'''count-kmers output file: asm.22.n22mers'''

count-kmers can count k-mers of arbitrary length in the reads or contig consensus sequences, and it can compute normalized k-mers. In the forensics pipeline, it computes normalized k-mers where k=22 and the number of occurrences is at least 22 (approximately 3 * the standard depth of coverage, 8). File format (N is the normalized k-mer count for a kmer sequence): >N
kmersequence

'''kmer-cov output file: asm.nkmer.feat'''

kmer-cov maps the k-mer coverage across a sequence. In the forensics pipeline it reports regions at least 1000bp long covered by high frequency normalized kmers, i.e., the collapsed repeats in the genome.
Description of columns in file:
1. Contig ID
2. Coverage Feature Type (K)
3. range start
4. range end
5. Length of region

== Feature Combiner ==

The above metrics can find many different types of mis-assemblies, but each is limited in type of mis-assembly it can find. Furthermore, normal statistical variation may introduce false positives in the analysis. For example, flagging every insert mate whose size is less than 2 standard deviations from the library mean will flag about 2.5% of the inserts even though the vast majority are correct. Instead we use a feature combiner to collect all of the evidence for a mis-assembly and output regions with multiple mis-assembly features present at the same region. This allows one to focus their attention on the regions that are most likely to be mis-assemblied.

All of the features are loaded into the bank, and will then be visible within Hawkeye for further inspection.

'''suspiciousfeat2region output file: asm.suspicious.feat'''

File format:
1. Contig id
2. Mis-assembly Feature Type (A)
3. range start
4. range end
5. MIS-ASSEMBLY
6. Number of features in the region
7. Number of feature types in the region
8. List of features separated with pipe ("|") character

Amosvalidate

2011-05-23T04:42:25Z

Floflooo: Typo fixes

Automated assembly validation pipeline.

Adam Phillippy, Michael Schatz, Mihai Pop
Center for Bioinformatics and Computational Biology, University of Maryland

Publication: Genome assembly forensics: finding the elusive mis-assembly. Phillippy AM, Schatz MC, Pop M. Genome Biol. 2008;9(3):R55.

== Overview ==

Since the initial "draft" sequence of the human genome was released in 2001, it has become clear that it was not an entirely accurate reconstruction of the genome. Despite significant advances in sequencing and assembly since then, genome sequencing continues to be an inexact process. Genome finishing and validation have remained a largely manual and expensive process, and consequently, many genomes are presented as draft assemblies. Draft assemblies are of unknown quality and potentially contain significant mis-assemblies, such as collapsed repeats, sequence excision, or artificial rearrangements. Too often these assemblies are judged only by contig size, with larger contigs preferred without regard to quality, because it has been difficult to gauge large scale assembly quality.

Our automated software pipeline, amosvalidate, addresses this deficiency and automatically detects mis-assemblies using a battery of known and novel assembly quality metrics. Instead of focusing on a single assembly characteristic as other validation approaches have tried, the power of our approach comes from leveraging multiple sources of evidence. amosvalidate statistically analyzes mate-pair orientations and separations, repeat content, depth-of-coverage, correlated polymorphisms in the read alignments, and read alignment breakpoints to identify structurally suspicious regions of the assembly. The suspicious regions identified by individual metrics are then clustered and combined to identify (with high confidence) regions that are mis-assembled.

Related tools:
* [[Hawkeye]]
* [[MUMmer]]

== Running amosvalidate ==

amosvalidate reads the assembly data from an AMOS bank. A bank is a special directory of binary encoded files containing all information on an assembly. A bank is created by the AMOS assemblers directly, or by converting the results of others assemblers into AMOS format. This is typically done with the tools toAmos and bank-transact. toAmos reads the assembly files and converts them to plaintext AMOS message formats, and bank-transact reads those messages and creates the binary encoded bank directory. See the AMOS Assembly Conversion Page for more information.

For example:

$ toAmos -f assembly.frg -a assembly.asm -o - | bank-transact -m - -o assembly.bnk -c

Creates the bank assembly.bnk from the files assembly.frg and assembly.asm, which are the input and output files for the Celera Assembler.

$ toAmos -ace assembly.ace -o - | bank-transact -m - -o assembly.bnk -c

Creates the bank assembly.bnk from an ace file, which is an output format for many assemblers including Phrap, Arachne, and Newbler. Check your assembler's documentation for more information on creating ACE files. More information on converting to AMOS is available in the toAmos documentation.

$ tarchive2amos -o assembly -assembly ASSEMBLY.xml TRACEINFO.seq;
$ bank-transact -m assembly.afg -b assembly.bnk -c

Creates the bank assembly.bnk from an assembly archive XML file called ASSEMBLY.xml. Note all of the read fasta files should be concatentated into a single TRACEINFO.seq file, and the read qualities files should be concatenated into a single TRACEINFO.qual file, and the TRACEINFO.xml file should be present as well. More information is available in the tarchive2amos documentation.

Once the bank has been built, launch the analysis by typing:

$ amosvalidate assembly.bnk

After the validation completes, the mis-assembly features will be loaded into the bank and present in the files assembly.all.feat and assembly.suspicious.feat. These features can be viewed in Hawkeye by typing:

$ hawkeye assembly.bnk

== Matepair Happiness ==

Matepairs from a double barreled shotgun sequencing library should be oriented towards each other, and their distance apart in the assembly should match the library's size distribution. The tool asmQC looks for regions where multiple matepairs are mis-oriented or the insert coverage is low. Both can indicate the assembly has a rearrangement mis-assembly. The tool cestat-cov computes a per-library statistic called the CE statistic at every position in the assembly. The CE statistics indicates how well the mates spanning a positing match the library's distribution. If the mates are consistently closer than expected at a given position, as would occur in a collapsed repeat or excision from the assembly, the statistic will have a large negative value (ce < -4). If the inserts are consistently larger than expected, such as from a repeat copy number expansion or other insertion event, the statistic will have a large positive value (ce > 4)

'''cestat-cov output file: asm.ce.feat'''

Record of positions in the assembly with unusual CE statistic (|ce| > 4).

Description of columns in file:
1. Contig ID
2. MATEPAIR Feature Type
3. range start
4. range end
4. CE_COMPRESS | CE_STRETCH
5. Library ID

asmQC output is written directly to the bank, but features can be extracted with bank-report

== Correlated SNP Detection ==

Correlated SNPs are positions in the genome where most of the reads are one base, but multiple other reads have another base. Unlike sequencing errors that occur at random, these correlated discrepancies can indicate the presence of a mis-assembly. In a haploid bacterial genome, for example, correlated SNPs nearly always indicate 2 copies of a near identical repeat have been collapsed into a single copy. In diploid or polyploid genomes, these can indicate a collapsed repeat, or positions where the homologous chromosomes disagree. If the frequency is higher than expected biologically, it is strong evidence for a collapsed repeat.

'''analyzeSNPs output file: asm.snps'''

analyzeSNPs finds all positions in the multiple alignment that the reads disagree. By default, it only reports positions where there are 2 or more reads that disagree with the consensus (but agree with each other) and the sum of their quality values is at least 40

Description of columns in file:
1. Contig ID
2. Gapped position
3. Ungapped position
4. Consensus
5. Depth of coverage
6. Number of reads that disagree with the consensus
7. X(N) X=base1, N=number of reads that have base1
8. {R1,R2,RN} Read ids that have base X
9. Y(N) Y=base2, N=number of reads that have base2
10. {R1, R2, RN} Read ids that have base Y

'''clusterSNPs output file: asm.snp.feat'''

clusterSNPs scans the SNPs report generated by analyzeSNPs to find regions that have a high frequency of SNPs. By default, it reports all regions with at least 2 columns within at most 500bp of each other as found by analyzeSNPs.

Description of columns in file:
1. Contig ID
2. SNP Feature Type (P)
3. range start
4. range end
4. HIGH_SNP
5. The number of SNPs
6. The average distance between SNPs

== Read Coverage ==

If the libraries have been constructed using a random shearing process, the reads should uniformily cover the genome at the average depth of coverage. Regions where the coverage is deeper than expected can indicate a collapsed repeat.

'''analyze-read-depth output file: asm.depth.feat'''

By default, analyze-read-depth reports regions that are 3x deeper than the average coverage. Positions within 1000bp of each other are clustered together.

Description of columns in file:
1. Contig ID
2. Coverage Feature Type (D)
3. range start
4. range end
5. Maximum depth of coverage in this range

== Singleton Breakpoint Analysis ==

After an assembly is complete, there can be reads left over, called singletons, that are not placed in the assembly. These reads are often from contaminating DNA or otherwise low quality sequence and can be safely ignored. However, some types of mis-assemblies can cause singletons where a portion of the read will align well to the contig but the rest of the read past the mis-assembly junction does not. If there are multiple reads that all follow the same pattern of partially aligning until the same position, this is strong evidence for mis-assembly.

'''listReadPlacedStatus output file: asm.singletons'''

[[listReadPlacedStatus]] can report which contig(s) a read is placed into, but in the pipeline simply lists which reads are singletons.

'''casm-breaks output file: asm.break.fea'''

The singleton reads are then aligned to the consensus sequences of the contigs and then analyzed for shared breakpoints. casm-breaks reports positions where there are multiple reads that all have the same breakpoint pattern. Unlike some of the other pipeline tools, [[casm-breaks]] writes an XML like message file.

File Format:
{FEA Feature message
typ:B Breakpoint feature
src:N,CTG The breakpoint occurs in contig N
com: <string> string linking all of the breakpoint features for a set of reads
clr:X,Y Range the contig where the read aligns
} End of feature

== Repeat K-mer Analysis ==

Almost all mis-assemblies are caused by repeats, and thus it can be useful to find the locations of the repeats in an assembly. Furthermore, it is very interesting to find the locations of collapsed or expanded repeats. We developed a new metric, called normalized k-mer analysis, that can discover collapsed or expanded repeats. A k-mer is a k-length substring of a longer sequence. Using a sliding window across a sequence, we can catalog all k-mers and count the number of occurrences of each. Call K_r the set of k-mers in the reads, and K_c the set of k-mers in the contig consensus sequences. A normalized k-mer count, K*, is the number of times a given k-mer q occurs in K_r divided by the number of times q occurs in K_c. This simple statistic can reveal which repeats have been mis-assembled. For example, the number of times the k-mers across a 2 copy repeat will be present in K_r is 2 * the depth of coverage. If the 2-copy repeat occurs in 2-copies in the assembly, then those kmers will all be present twice in K_c, and K* will be equal to the depth of coverage. If, however, the repeat was collapsed and occurs only once, then K_c will be 1 across the repeat, and K* will be equal to 2*the average depth of coverage.

'''count-kmers output file: asm.22.n22mers'''

count-kmers can count k-mers of arbitrary length in the reads or contig consensus sequences, and it can compute normalized k-mers. In the forensics pipeline, it computes normalized k-mers where k=22 and the number of occurrences is at least 22 (approximately 3 * the standard depth of coverage, 8). File format (N is the normalized k-mer count for a kmer sequence): >N
kmersequence

'''kmer-cov output file: asm.nkmer.feat'''

kmer-cov maps the k-mer coverage across a sequence. In the forensics pipeline it reports regions at least 1000bp long covered by high frequency normalized kmers, i.e., the collapsed repeats in the genome.
Description of columns in file:
1. Contig ID
2. Coverage Feature Type (K)
3. range start
4. range end
5. Length of region

== Feature Combiner ==

The above metrics can find many different types of mis-assemblies, but each is limited in type of mis-assembly it can find. Furthermore, normal statistical variation may introduce false positives in the analysis. For example, flagging every insert mate whose size is less than 2 standard deviations from the library mean will flag about 2.5% of the inserts even though the vast majority are correct. Instead we use a feature combiner to collect all of the evidence for a mis-assembly and output regions with multiple mis-assembly features present at the same region. This allows one to focus their attention on the regions that are most likely to be mis-assemblied.

All of the features are loaded into the bank, and will then be visible within Hawkeye for further inspection.

'''suspiciousfeat2region output file: asm.suspicious.feat'''

File format:
1. Contig id
2. Mis-assembly Feature Type (A)
3. range start
4. range end
5. MIS-ASSEMBLY
6. Number of features in the region
7. Number of feature types in the region
8. List of features separated with pipe ("|") character

AMOS Getting Started

2011-05-10T00:03:20Z

Floflooo: /* OSX installation */

{{TOC}}

Is AMOS an assembler? is one of the first questions we are asked. The short answer is no. AMOS is not an assembler, rather a software infrastructure for developing assembly tools. If you are only interested in running an off-the-shelf assembler on your shotgun data, do not despair, AMOS provides two such assemblers: AMOScmp - a comparative assembler; and Minimus - a basic assembler for small datasets. However it is important to realize that, with a little bit of programming, you can use AMOS to put together your own shotgun assembler customized for the specific characteristics of your data.

This page will provide you with the basic information needed to get started using AMOS. Advanced AMOS users can go directly to in-depth resources from the main page [[AMOS]].

== Downloading AMOS ==
AMOS can be downloaded from Sourceforge using the following link: [http://sourceforge.net/project/showfiles.php?group_id=134326 http://sourceforge.net/project/showfiles.php?group_id=134326]

No need to remember this URL as you can easily reach it from the [AMOS main page].

This link will bring you to the Sourceforge download page for our project. While older versions of our code are also available for download from this page we recommend you download the latest version to take advantage of the full functionality of the code.

AMOS is released as a source-code package, with the exception of the OSX version of the assembly viewer Hawkeye, that can be downloaded as a binary from the File Release section of the download page. Instructions for compiling and installing AMOS are provided below.

=== Downloading the development version ===

If you want the bleeding-edge of AMOS, e.g. to edit the source code, you should download the development version of AMOS using CVS following the directions here: [http://sourceforge.net/scm/?type=cvs&group_id=134326 http://sourceforge.net/scm/?type=cvs&group_id=134326]

Or in short:
cvs -z3 -d:pserver:anonymous@amos.cvs.sourceforge.net:/cvsroot/amos co -P AMOS

== Installing AMOS ==
After reading this section make sure you also read the INSTALL file distributed with AMOS. This file may contain information pertaining to the latest version of AMOS that is not included here.

=== Installing the development version ===

The first step to install the CVS version of AMOS is to type:
./bootstrap

Then proceed with the instructions for the normal installation below.

=== Normal installation ===
The AMOS source package has a name like: amos-1.4.5.tar.gz where 1.4.5 is the version of the code. Once you untar this file (using "tar -xzf amos-1.4.5.tar.gz" in Linux, or "gunzip -d amos-1.4.5.tar.gz | tar xf -" in other flavors of Unix) you will find the current AMOS distribution in a directory named amos-1.4.5. The next steps assume you have cd'd into this directory.

AMOS uses the [http://www.gnu.org/software/autoconf GNU autoconf] package to reduce cross-platform compatibility issues. Before compiling the code you will need to run the configure script that will probe your system for the locations of all software packages required by AMOS.

By simply running:

./configure

you will prepare AMOS to be installed in the directory hosting the source package. This is OK if you are just testing AMOS. We recommend, however, that you provide the configure script with a more permanent home for AMOS, e.g.:

./configure --prefix=/usr/local/AMOS

will ultimately lead the AMOS directory hierarchy to be installed underneath /usr/local/.

After running configure, make sure you check the messages left on your screen to make sure no errors occured. Errors during the configure step can lead to an incomplete build.

To compile the code you need to simply run:

make

followed by

make install

to install AMOS into the directory selected with the --prefix option to configure.

Normally, these steps are sufficient to install AMOS on most UNIX systems. If you encounter errors during configuration or compilation, or if you are trying to install AMOS on an OSX or Cygwin system, please read the following sub-sections.

=== Specifying the location of dependencies ===
If the configure script gives you a message like:

WARNING! nucmer was not found but is required to run AMOScmp
install nucmer if planning on using AMOScmp

you either have not installed the [http://mummer.sourceforge.net/ MUMmer] package, or you have installed it in a location where the configure script cannot find it. MUMmer (the nucmer program in particular) is required by the comparative assembler [[AMOScmp]].

To remedy this situation, please install MUMmer following instructions found at [http://mummer.sourceforge.net http://mummer.sourceforge.net].

If MUMmer is already installed, but configure cannot find it, you can specify the location of the nucmer program by setting the environment variably NUCMER, e.g.:

NUCMER=/usr/local/bin/mummer/nucmer
export NUCMER

in a "traditional" shell (sh, bash, ksh, etc.), or

setenv NUCMER /usr/local/bin/mummer/nucmer

in csh or tcsh. Of course you'll need to replace /usr/local/bin/mummer/nucmer with the actual location of this program on your system.
Specifying the location of the QT library
On most Unix installations (see below for OSX and Cygwin), the QT library should be properly installed and AMOS will make without any problems. If, however, you notice a message like:

WARNING! Qt3 toolkit was not found but is required to run AMOS GUIs

the configure process was not able to find the QT library on your system. Check with your system administrator to have this toolkit installed on your system. If, however, you are certain the toolkit is installed, but AMOS still didn't find it, you can directly specify the location of the toolkit directory, or specifically the include, bin, and lib directories, where QT is installed, and the name of the library file, using the following options to the configure script:

--with-Qt-dir
--with-Qt-include_dir
--with-Qt-lib_dir
--with-Qt-bin_dir
--with-Qt-lib

Similarly, if you get the message:

WARNING! Boost graph toolkit was not found but is required to run parts of the AMOS Scaffolder (Bambus 2)

try specifying the location of Boost with the option:

--with-Boost-dir

=== Debian and Ubuntu installation ===
[[Debian installation]]

=== Fedora, RedHat, CentOS installation ===
[[Fedora installation]]

=== Mac OS X installation ===

[[OSX installation]]

=== Cygwin installation ===
[[Cygwin installation]]

== Running AMOS ==

=== Basic AMOS concepts ===
AMOS consists of a collection of modules that operate on a central data-structure called a bank. A bank is really just a directory that contains a database (organized as a collection of indexed files) comprising assembly related objects such as reads, contigs, scaffolds, etc. The modules thus communicate with each other by making changes to the bank. For example, an assembler might consist of three modules: an overlapper, a contigger, and a multi-aligner. The overlapper will first read the shotgun reads from the bank, compare them to each other and write back to the bank a list of overlaps, i.e. pairs of reads that match each other. The contigger then reads the collection of overlaps and makes sense out of it, by producing a layout of the reads that is consistent with most of the observed overlaps. The contigger then writes these contigs (contiguous chunks of the genome) to the bank. Finally, the multi-aligner reads from the bank both the reads and the contigs, builds a multiple alignment of the reads, using as a guide the layout of the reads produced by the contigger, then updates the contigs with the detailed alignment information. Thus, the three programs were able to communicate with each other using the bank as an intermediate storage space. If this litle description didn't make much sense to you, check out our [http://www.cbcb.umd.edu/research/assembly_primer.shtml Genome Assembly Primer]. It also has pointers to future reading.

Objects in the bank may be identified by one, or both of the following identifiers: IID (internal identifier) - an integer identifier, internal to AMOS; and EID (external identifier) - a string representing some external identifier of the record, e.g. the original name of a sequencing read. Both identifiers must be unique for a specific object type, but may be shared by multiple objects. For example, there can only be one contig with an IID equal to 1, however there can be both a contig, and a read, and an overlap, all with the IID = 1.
Message files
The AMOS banks are not the only mechanism for AMOS modules to communicate with each other, and to the "outside world". AMOS also uses a flat-file format (AMOS message files) inspired by the format used in Celera Assembler. This format is generally used as an intermediate format for converting to and from external file formats. The AMOS message files are then used to populate the data-structures present in a bank.

For more details on the AMOS message file format check out the [[Infrastructure]] pages. The use of message files will be described in more detail in the remainder of this tutorial.

==== Reading and writing banks ====
To learn how to generate AMOS message files check out the section called Creating inputs for AMOS. Assuming you already have an AMOS message file, most of the modules will require that the information from this file be loaded into a bank. This section describes the commands used to transfer information between a bank and the message file.

The command bank-transact can be used to load a message file into a bank. In its simplest invocation:

bank-transact -b mybank -m mymessagefile

bank-transact loads the messages in mymessagefile into the bank mybank. Note that this invocation assume the bank already exists, and bank-transact will fail otherwise. When creating a new bank you can run:

bank-transact -c -b mybank -m mymessagefile

The option -c stands for "create". By also providing the option -f (force), the bank will be overwritten if it already exists.

The contents of a bank can be output into a flat-file format with the command:

bank-report -b mybank

By default bank-report outputs all the data in the bank. The output can be restricted to certain message types by providing the 3 letter codes of the messages to be output, e.g:

bank-report -b mybank CTG RED

will output all the contigs (CTG) and read (RED) records. In addition bank-report allows the user to specify a list of EIDs (option -E) or a list of IIDs (option -I) that will be reported.

==== Bank locking ====
To allow concurrent access to the bank, AMOS programs lock the bank while the operate on it. There are two types of locks: for reading, and writing. If a bank is locked for reading, other read accesses are allowed but no writes. If a bank is locked for writing, no concurrent accesses are allowed. Some of the AMOS tools (such as the viewer Hawkeye), have an option to load a bank in "inspect" mode, i.e. the code ignores any locks placed on the bank.

In certain situations, if a program accessing the bank crashes, the bank may remain locked, prohibiting further access. All existing locks can be removed with the command (make sure that another user is not accessing the same bank):

bank-unlock mybank

==== Bank versions ====
The specific format of the AMOS bank is closely related to the current version of the AMOS software. The banks are not backward compatible, i.e., a bank produced by AMOS 1.0 will not be readable by AMOS 1.5. A simple solution for reading a bank created by an older version of AMOS is to output the contents of the bank using bank-report (the AMOS distribution contains old versions of the bank-report code, e.g. bank-report-1.1) , then reload the bank with the most recent bank-transact command.

==== Pipelines ====
As it has hopefully become clear from the introduction to AMOS above, most genome assembly tasks involve the sequential execution of several modules, in an assembly line (or pipeline) fashion. AMOS provides a mechanism for quickly putting together simple pipelines. By "simple" we mean situations where the specific assembly task involves running several programs in order, without the need for more complex control structures such as "if" statements or loops. To implement complex pipelines you will have to rely on Perl or another complex programming language.

An AMOS pipelines are described in a simple interpreted language, and consist of a series of steps that are executed in order. The steps are meant to provide a logical breakdown of the individual assembly tasks, representing the execution of one or more programs. Each step in a pipeline is identified by a step number (a throw-back to the days of the Basic language) providing the user with a mechanism to execute only some of the steps of a pipeline.

To learn more about AMOS pipelines and how to write them, check out the documentation for [[runAmos]] (the pipeline executor), or check out one of the pipelines distributed with AMOS (AMOScmp and minimus are good starting points).

=== Creating inputs for AMOS ===
The inputs to most AMOS programs must be provided in the AMOS message format. For help converting non-AMOS file formats into message files see the [[File conversion utilities]].

=== Running AMOScmp ===
AMOScmp is a comparative assembler that can be used to assemble reads from one genome (called the target) using as a template the sequence of a related genome (called the reference). Read the AMOScmp documentation for a detailed description of this program.

By default, running AMOScmp as follows:

AMOScmp prefix

assumes that the target is provided in the AMOS message file prefix.afg, and the reference in the file prefix.1con. To use different file locations, you can set the variables TGT and REF, either directly within the AMOScmp script, or on the command line:

AMOScmp -D "TGT=mytarget.afg" -D "REF=myreference.1con" prefix

The prefix must still be provided as it is used to generate the name of the output files.

AMOScmp will populate a bank named prefix.bnk, and will load into it a set of contigs, as well as a scaffold, linking together contigs that are adjacent along the reference. In addition, AMOScmp outputs the set of contigs as both a multi-FASTA file prefix.fasta, and a TIGR .contig file prefix.contig. Note that the consensus of the contigs (reported in the FASTA file) is generated from the target genome, and may differ from the reference genome (after all, the goal of the assembler is to assemble the target). In fact, AMOScmp uses sophisticated algorithms for detecting differences between the target and reference in order to prevent misassemblies. For more information refer to:

M. Pop, A. Phillippy, A.L. Delcher and S.L. Salzberg. [http://www.cbcb.umd.edu/papers/Pop%20et%20al%20Comparative.pdf Comparative genome assembly]. Briefings in Bioinformatics. 5(3), pp. 237-248, 2004.

=== Running minimus ===
Minimus is a basic genome assembler that can be used for small assembly jobs (e.g. a single gene, or a viral genome). Minimus is currently used as a central component of the Influenza A sequencing pipeline at The Institute for Genomic Research. Read the [[minimus]] documentation for more information.

To run minimus you must provide a set of shotgun reads in an AMOS message file. Running:

minimus prefix

assumes the input is in file prefix.afg. After running, minimus populates the bank prefix.bnk with a set of contigs, furthermore it reports the contigs in both a FASTA file (prefix.fasta) and a TIGR .contig file (prefix.contig). Note that minimus does not use mate-pairs. In essence it is, in Celera Assembler terminology, a unitigger. Any mate-pair information provided in the .afg will be silently ignored.

=== Viewing the result of an assembly ===
The content of a bank can be viewed with a program called Hawkeye:

hawkeye mybank

For detailed information on how to use Hawkeye, refer to the [[Hawkeye]] documentation.

=== Validating assemblies ===
Even the best genome assemblers sometimes make mistakes. AMOS provides a mechanism to run several checks on the output of an assembler (assuming the data are already stored in a bank), through a script called amosvalidate. Amosvalidate runs through the assembly and identifies several types of inconsistencies, such as clusters of SNPs in the assembled reads, clusters of mate-pairs that are too close or too far from each other (with respect to the estimated library sizes), and unassembled reads that do not properly match the assembly. A full description of these measures is beyond the scope of this document. We are currently submitting a manuscript describing the tools included in amosvalidate and will update this page when it gets published.

All the potential assembly problems identified by amosvalidate are written back into the bank as features, i.e ranges along the assembly. Each feature is tagged with the problem that was identified in that region. Typically, users then load the assembly in the Hawkeye viewer and examine the assembly in the tagged regions. Alternatively, the features may be extracted from the bank and processed automatically by specialized software (e.g. several assemblies of a same genome can be compared by the number of features identified in the assembly - the assembly with fewer features is likely "better").

Running amosvalidate is as simple as:

amosvalidate prefix

where prefix.bnk is the location of the bank.

== Getting help ==
To report bugs in AMOS, or to get help, email us at:

amos-help (at) lists (dot) sourceforget (dot) net

To receive information regarding new releases and developments, please [http://lists.sourceforge.net/lists/listinfo/amos-users subscribe] to our moderated, low-traffic users' mailing list:

amos-users (at) lists (dot) sourceforget (dot) net

Bambus2

2011-04-22T02:57:40Z

Floflooo:

[http://www.cs.umd.edu/~sergek/ Sergey Koren] and
[http://www.cbcb.umd.edu/~mpop/ Mihai Pop]

Scaffolding represents the task of ordering and orienting contigs by incorporating additional information about their relative placement along the genome. The original Bambus package was the first general purpose scaffolders made available as an open source package. We are happy to announce the arrival of Bambus 2.0, the second generation Bambus scaffolder available as an open source package. While most other scaffolders are closely tied to a specific assembly program, Bambus accepts the output from most current assemblers and provides the user with great flexibility in choosing the scaffolding parameters. In particular, Bambus is able to accept contig linking data other than specified by mate-pairs. Such sources of information include alignment to a reference genome (Bambus can directly use the output of MUMmer), physical mapping data, or information about gene synteny.

Getting data into Bambus 2 requires you convert your assembly to AMOS format. Here is my recipe:

[[toAmos]] \
-s my.fa \
-c my.contig \
-m my.mates \
-o my.afg

You need the .fa to list the contigs within the GFD-like contig file (annoying but true). You don't need accurate sequences in the .fa, you just need something to make the format valid. The .contig and .mates are as expected for [[Bambus]].

The resulting .afg is then 'banked' with:

[[bank-transact]] -c \
-b my.bnk \
-m my.afg

Bambus2 is composed of a series of scripts. For instruction about how to use them see: [[Bambus 2.0/quick start guide]].

There is a Python script to facilitate running Bambus2 in one quick command:

goBambus2

which returns:

run: goBambus2 <input reads or contigs or amos bank name> <output prefix> [options]
eg.: goBambus2 example.contigs myoutput --all --contigs
This script is designed to run the Bambus pipeline and takes either reads or contigs plus XML Trace Archive data as input and outputs scaffolds
For further info please contact the Bambus 2 authors: Sergey Koren and Mihai Pop

For example, you could run:

goBambus2 brucella.seq myScaff --all --reads

If you prefer Perl, you could use Perl to drive Bambus 2. Here is an example script: [[Bambus 2.0/goBambus-perl]].

More information is available at http://www.cbcb.umd.edu/software/bambus/

Bambus2

2011-04-22T02:27:54Z

Floflooo:

[http://www.cs.umd.edu/~sergek/ Sergey Koren] and
[http://www.cbcb.umd.edu/~mpop/ Mihai Pop]

Scaffolding represents the task of ordering and orienting contigs by incorporating additional information about their relative placement along the genome. The original Bambus package was the first general purpose scaffolders made available as an open source package. We are happy to announce the arrival of Bambus 2.0, the second generation Bambus scaffolder available as an open source package. While most other scaffolders are closely tied to a specific assembly program, Bambus accepts the output from most current assemblers and provides the user with great flexibility in choosing the scaffolding parameters. In particular, Bambus is able to accept contig linking data other than specified by mate-pairs. Such sources of information include alignment to a reference genome (Bambus can directly use the output of MUMmer), physical mapping data, or information about gene synteny.

Getting data into Bambus 2 requires you convert your assembly to AMOS format. Here is my recipe:

[[toAmos]] \
-s my.fa \
-c my.contig \
-m my.mates \
-o my.afg

You need the .fa to list the contigs within the GFD-like contig file (annoying but true). You don't need accurate sequences in the .fa, you just need something to make the format valid. The .contig and .mates are as expected for [[Bambus]].

The resulting .afg is then 'banked' with:

[[bank-transact]] -c \
-b my.bnk \
-m my.afg

Bambus2 is composed of a series of scripts. For instruction about how to use them see: [[Bambus 2.0/quick start guide]].

There is a Python script to facilitate running Bambus2 in one quick command:

goBambus

which returns:

run: goBambus2 <input reads or contigs or amos bank name> <output prefix> [options]
eg.: goBambus2 example.contigs myoutput --all --contigs
This script is designed to run the Bambus pipeline and takes either reads or contigs plus XML Trace Archive data as input and outputs scaffolds
For further info please contact the Bambus 2 authors: Sergey Koren and Mihai Pop

For example, you could run:

goBambus2.py brucella.seq myScaff --all --reads

If you prefer Perl, you could use Perl to drive Bambus 2. Here is an example script: [[Bambus 2.0/goBambus-perl]].

More information is available at http://www.cbcb.umd.edu/software/bambus/

AMOS Getting Started

2011-04-22T01:57:40Z

Floflooo:

{{TOC}}

Is AMOS an assembler? is one of the first questions we are asked. The short answer is no. AMOS is not an assembler, rather a software infrastructure for developing assembly tools. If you are only interested in running an off-the-shelf assembler on your shotgun data, do not despair, AMOS provides two such assemblers: AMOScmp - a comparative assembler; and Minimus - a basic assembler for small datasets. However it is important to realize that, with a little bit of programming, you can use AMOS to put together your own shotgun assembler customized for the specific characteristics of your data.

This page will provide you with the basic information needed to get started using AMOS. Advanced AMOS users can go directly to in-depth resources from the main page [[AMOS]].

== Downloading AMOS ==
AMOS can be downloaded from Sourceforge using the following link: [http://sourceforge.net/project/showfiles.php?group_id=134326 http://sourceforge.net/project/showfiles.php?group_id=134326]

No need to remember this URL as you can easily reach it from the [AMOS main page].

This link will bring you to the Sourceforge download page for our project. While older versions of our code are also available for download from this page we recommend you download the latest version to take advantage of the full functionality of the code.

AMOS is released as a source-code package, with the exception of the OSX version of the assembly viewer Hawkeye, that can be downloaded as a binary from the File Release section of the download page. Instructions for compiling and installing AMOS are provided below.

=== Downloading the development version ===

If you want the bleeding-edge of AMOS, e.g. to edit the source code, you should download the development version of AMOS using CVS following the directions here: [http://sourceforge.net/scm/?type=cvs&group_id=134326 http://sourceforge.net/scm/?type=cvs&group_id=134326]

Or in short:
cvs -z3 -d:pserver:anonymous@amos.cvs.sourceforge.net:/cvsroot/amos co -P AMOS

== Installing AMOS ==
After reading this section make sure you also read the INSTALL file distributed with AMOS. This file may contain information pertaining to the latest version of AMOS that is not included here.

=== Installing the development version ===

The first step to install the CVS version of AMOS is to type:
./bootstrap

Then proceed with the instructions for the normal installation below.

=== Normal installation ===
The AMOS source package has a name like: amos-1.4.5.tar.gz where 1.4.5 is the version of the code. Once you untar this file (using "tar -xzf amos-1.4.5.tar.gz" in Linux, or "gunzip -d amos-1.4.5.tar.gz | tar xf -" in other flavors of Unix) you will find the current AMOS distribution in a directory named amos-1.4.5. The next steps assume you have cd'd into this directory.

AMOS uses the [http://www.gnu.org/software/autoconf GNU autoconf] package to reduce cross-platform compatibility issues. Before compiling the code you will need to run the configure script that will probe your system for the locations of all software packages required by AMOS.

By simply running:

./configure

you will prepare AMOS to be installed in the directory hosting the source package. This is OK if you are just testing AMOS. We recommend, however, that you provide the configure script with a more permanent home for AMOS, e.g.:

./configure --prefix=/usr/local/AMOS

will ultimately lead the AMOS directory hierarchy to be installed underneath /usr/local/.

After running configure, make sure you check the messages left on your screen to make sure no errors occured. Errors during the configure step can lead to an incomplete build.

To compile the code you need to simply run:

make

followed by

make install

to install AMOS into the directory selected with the --prefix option to configure.

Normally, these steps are sufficient to install AMOS on most UNIX systems. If you encounter errors during configuration or compilation, or if you are trying to install AMOS on an OSX or Cygwin system, please read the following sub-sections.

=== Specifying the location of dependencies ===
If the configure script gives you a message like:

WARNING! nucmer was not found but is required to run AMOScmp
install nucmer if planning on using AMOScmp

you either have not installed the [http://mummer.sourceforge.net/ MUMmer] package, or you have installed it in a location where the configure script cannot find it. MUMmer (the nucmer program in particular) is required by the comparative assembler [[AMOScmp]].

To remedy this situation, please install MUMmer following instructions found at [http://mummer.sourceforge.net http://mummer.sourceforge.net].

If MUMmer is already installed, but configure cannot find it, you can specify the location of the nucmer program by setting the environment variably NUCMER, e.g.:

NUCMER=/usr/local/bin/mummer/nucmer
export NUCMER

in a "traditional" shell (sh, bash, ksh, etc.), or

setenv NUCMER /usr/local/bin/mummer/nucmer

in csh or tcsh. Of course you'll need to replace /usr/local/bin/mummer/nucmer with the actual location of this program on your system.
Specifying the location of the QT library
On most Unix installations (see below for OSX and Cygwin), the QT library should be properly installed and AMOS will make without any problems. If, however, you notice a message like:

WARNING! Qt3 toolkit was not found but is required to run AMOS GUIs

the configure process was not able to find the QT library on your system. Check with your system administrator to have this toolkit installed on your system. If, however, you are certain the toolkit is installed, but AMOS still didn't find it, you can directly specify the location of the toolkit directory, or specifically the include, bin, and lib directories, where QT is installed, and the name of the library file, using the following options to the configure script:

--with-Qt-dir
--with-Qt-include_dir
--with-Qt-lib_dir
--with-Qt-bin_dir
--with-Qt-lib

Similarly, if you get the message:

WARNING! Boost graph toolkit was not found but is required to run parts of the AMOS Scaffolder (Bambus 2)

try specifying the location of Boost with the option:

--with-Boost-dir

=== Debian and Ubuntu installation ===
[[Debian installation]]

=== Fedora, RedHat, CentOS installation ===
[[Fedora installation]]

=== OSX installation ===

[[OSX installation]]

=== Cygwin installation ===
[[Cygwin installation]]

== Running AMOS ==

=== Basic AMOS concepts ===
AMOS consists of a collection of modules that operate on a central data-structure called a bank. A bank is really just a directory that contains a database (organized as a collection of indexed files) comprising assembly related objects such as reads, contigs, scaffolds, etc. The modules thus communicate with each other by making changes to the bank. For example, an assembler might consist of three modules: an overlapper, a contigger, and a multi-aligner. The overlapper will first read the shotgun reads from the bank, compare them to each other and write back to the bank a list of overlaps, i.e. pairs of reads that match each other. The contigger then reads the collection of overlaps and makes sense out of it, by producing a layout of the reads that is consistent with most of the observed overlaps. The contigger then writes these contigs (contiguous chunks of the genome) to the bank. Finally, the multi-aligner reads from the bank both the reads and the contigs, builds a multiple alignment of the reads, using as a guide the layout of the reads produced by the contigger, then updates the contigs with the detailed alignment information. Thus, the three programs were able to communicate with each other using the bank as an intermediate storage space. If this litle description didn't make much sense to you, check out our [http://www.cbcb.umd.edu/research/assembly_primer.shtml Genome Assembly Primer]. It also has pointers to future reading.

Objects in the bank may be identified by one, or both of the following identifiers: IID (internal identifier) - an integer identifier, internal to AMOS; and EID (external identifier) - a string representing some external identifier of the record, e.g. the original name of a sequencing read. Both identifiers must be unique for a specific object type, but may be shared by multiple objects. For example, there can only be one contig with an IID equal to 1, however there can be both a contig, and a read, and an overlap, all with the IID = 1.
Message files
The AMOS banks are not the only mechanism for AMOS modules to communicate with each other, and to the "outside world". AMOS also uses a flat-file format (AMOS message files) inspired by the format used in Celera Assembler. This format is generally used as an intermediate format for converting to and from external file formats. The AMOS message files are then used to populate the data-structures present in a bank.

For more details on the AMOS message file format check out the [[Infrastructure]] pages. The use of message files will be described in more detail in the remainder of this tutorial.

==== Reading and writing banks ====
To learn how to generate AMOS message files check out the section called Creating inputs for AMOS. Assuming you already have an AMOS message file, most of the modules will require that the information from this file be loaded into a bank. This section describes the commands used to transfer information between a bank and the message file.

The command bank-transact can be used to load a message file into a bank. In its simplest invocation:

bank-transact -b mybank -m mymessagefile

bank-transact loads the messages in mymessagefile into the bank mybank. Note that this invocation assume the bank already exists, and bank-transact will fail otherwise. When creating a new bank you can run:

bank-transact -c -b mybank -m mymessagefile

The option -c stands for "create". By also providing the option -f (force), the bank will be overwritten if it already exists.

The contents of a bank can be output into a flat-file format with the command:

bank-report -b mybank

By default bank-report outputs all the data in the bank. The output can be restricted to certain message types by providing the 3 letter codes of the messages to be output, e.g:

bank-report -b mybank CTG RED

will output all the contigs (CTG) and read (RED) records. In addition bank-report allows the user to specify a list of EIDs (option -E) or a list of IIDs (option -I) that will be reported.

==== Bank locking ====
To allow concurrent access to the bank, AMOS programs lock the bank while the operate on it. There are two types of locks: for reading, and writing. If a bank is locked for reading, other read accesses are allowed but no writes. If a bank is locked for writing, no concurrent accesses are allowed. Some of the AMOS tools (such as the viewer Hawkeye), have an option to load a bank in "inspect" mode, i.e. the code ignores any locks placed on the bank.

In certain situations, if a program accessing the bank crashes, the bank may remain locked, prohibiting further access. All existing locks can be removed with the command (make sure that another user is not accessing the same bank):

bank-unlock mybank

==== Bank versions ====
The specific format of the AMOS bank is closely related to the current version of the AMOS software. The banks are not backward compatible, i.e., a bank produced by AMOS 1.0 will not be readable by AMOS 1.5. A simple solution for reading a bank created by an older version of AMOS is to output the contents of the bank using bank-report (the AMOS distribution contains old versions of the bank-report code, e.g. bank-report-1.1) , then reload the bank with the most recent bank-transact command.

==== Pipelines ====
As it has hopefully become clear from the introduction to AMOS above, most genome assembly tasks involve the sequential execution of several modules, in an assembly line (or pipeline) fashion. AMOS provides a mechanism for quickly putting together simple pipelines. By "simple" we mean situations where the specific assembly task involves running several programs in order, without the need for more complex control structures such as "if" statements or loops. To implement complex pipelines you will have to rely on Perl or another complex programming language.

An AMOS pipelines are described in a simple interpreted language, and consist of a series of steps that are executed in order. The steps are meant to provide a logical breakdown of the individual assembly tasks, representing the execution of one or more programs. Each step in a pipeline is identified by a step number (a throw-back to the days of the Basic language) providing the user with a mechanism to execute only some of the steps of a pipeline.

To learn more about AMOS pipelines and how to write them, check out the documentation for [[runAmos]] (the pipeline executor), or check out one of the pipelines distributed with AMOS (AMOScmp and minimus are good starting points).

=== Creating inputs for AMOS ===
The inputs to most AMOS programs must be provided in the AMOS message format. For help converting non-AMOS file formats into message files see the [[File conversion utilities]].

=== Running AMOScmp ===
AMOScmp is a comparative assembler that can be used to assemble reads from one genome (called the target) using as a template the sequence of a related genome (called the reference). Read the AMOScmp documentation for a detailed description of this program.

By default, running AMOScmp as follows:

AMOScmp prefix

assumes that the target is provided in the AMOS message file prefix.afg, and the reference in the file prefix.1con. To use different file locations, you can set the variables TGT and REF, either directly within the AMOScmp script, or on the command line:

AMOScmp -D "TGT=mytarget.afg" -D "REF=myreference.1con" prefix

The prefix must still be provided as it is used to generate the name of the output files.

AMOScmp will populate a bank named prefix.bnk, and will load into it a set of contigs, as well as a scaffold, linking together contigs that are adjacent along the reference. In addition, AMOScmp outputs the set of contigs as both a multi-FASTA file prefix.fasta, and a TIGR .contig file prefix.contig. Note that the consensus of the contigs (reported in the FASTA file) is generated from the target genome, and may differ from the reference genome (after all, the goal of the assembler is to assemble the target). In fact, AMOScmp uses sophisticated algorithms for detecting differences between the target and reference in order to prevent misassemblies. For more information refer to:

M. Pop, A. Phillippy, A.L. Delcher and S.L. Salzberg. [http://www.cbcb.umd.edu/papers/Pop%20et%20al%20Comparative.pdf Comparative genome assembly]. Briefings in Bioinformatics. 5(3), pp. 237-248, 2004.

=== Running minimus ===
Minimus is a basic genome assembler that can be used for small assembly jobs (e.g. a single gene, or a viral genome). Minimus is currently used as a central component of the Influenza A sequencing pipeline at The Institute for Genomic Research. Read the [[minimus]] documentation for more information.

To run minimus you must provide a set of shotgun reads in an AMOS message file. Running:

minimus prefix

assumes the input is in file prefix.afg. After running, minimus populates the bank prefix.bnk with a set of contigs, furthermore it reports the contigs in both a FASTA file (prefix.fasta) and a TIGR .contig file (prefix.contig). Note that minimus does not use mate-pairs. In essence it is, in Celera Assembler terminology, a unitigger. Any mate-pair information provided in the .afg will be silently ignored.

=== Viewing the result of an assembly ===
The content of a bank can be viewed with a program called Hawkeye:

hawkeye mybank

For detailed information on how to use Hawkeye, refer to the [[Hawkeye]] documentation.

=== Validating assemblies ===
Even the best genome assemblers sometimes make mistakes. AMOS provides a mechanism to run several checks on the output of an assembler (assuming the data are already stored in a bank), through a script called amosvalidate. Amosvalidate runs through the assembly and identifies several types of inconsistencies, such as clusters of SNPs in the assembled reads, clusters of mate-pairs that are too close or too far from each other (with respect to the estimated library sizes), and unassembled reads that do not properly match the assembly. A full description of these measures is beyond the scope of this document. We are currently submitting a manuscript describing the tools included in amosvalidate and will update this page when it gets published.

All the potential assembly problems identified by amosvalidate are written back into the bank as features, i.e ranges along the assembly. Each feature is tagged with the problem that was identified in that region. Typically, users then load the assembly in the Hawkeye viewer and examine the assembly in the tagged regions. Alternatively, the features may be extracted from the bank and processed automatically by specialized software (e.g. several assemblies of a same genome can be compared by the number of features identified in the assembly - the assembly with fewer features is likely "better").

Running amosvalidate is as simple as:

amosvalidate prefix

where prefix.bnk is the location of the bank.

== Getting help ==
To report bugs in AMOS, or to get help, email us at:

amos-help (at) lists (dot) sourceforget (dot) net

To receive information regarding new releases and developments, please [http://lists.sourceforge.net/lists/listinfo/amos-users subscribe] to our moderated, low-traffic users' mailing list:

amos-users (at) lists (dot) sourceforget (dot) net

OSX installation

2011-04-22T01:43:13Z

Floflooo:

Download QT/Mac 3.3.x from Trolltech.

As of 4/12/06, the most recent version is available at: ftp://ftp.trolltech.com/qt/source/qt-mac-free-3.3.6.tar.gz

Follow the Trolltech instructions for building QT. Make sure to set the environment variable QTDIR appropriately.

Run this command to configure AMOS:

./configure

Note that if the QT or Boost configure tests fail, you may have to manually specify the location of the QT or Boost libraries, e.g. most likely:

./configure --with-Qt-dir=/opt/local/lib/qt3 --with-Boost-dir=/opt/local/include

Run make to build AMOS. Then run:

cd src/bankViewer
$QTDIR/bin/qmake
make

The Hawkeye binary will then build in the Hawkeye directory. You will have to manually copy it to your bin directory.

OSX installation

2011-04-22T01:38:23Z

Floflooo:

Download QT/Mac 3.3.x from Trolltech.

As of 4/12/06, the most recent version is available at: ftp://ftp.trolltech.com/qt/source/qt-mac-free-3.3.6.tar.gz

Follow the Trolltech instructions for building QT. Make sure to set the environment variable QTDIR appropriately.

Run this command to configure AMOS:

./configure

Note that if the QT or Boost configure tests fail, you may have to manually specify the location of the QT libraries, e.g. most likely:

./configure --with-Qt-dir=/opt/local/lib/qt3 --with-Boost-dir=/opt/local/include

Run make to build AMOS. Then run:

cd src/bankViewer
$QTDIR/bin/qmake
make

The Hawkeye binary will then build in the Hawkeye directory. You will have to manually copy it to your bin directory.

AMOS Getting Started

2011-04-22T00:49:47Z

Floflooo: Fixed typos and added the --with-Boost-dir option

{{TOC}}

Is AMOS an assembler? is one of the first questions we are asked. The short answer is no. AMOS is not an assembler, rather a software infrastructure for developing assembly tools. If you are only interested in running an off-the-shelf assembler on your shotgun data, do not despair, AMOS provides two such assemblers: AMOScmp - a comparative assembler; and Minimus - a basic assembler for small datasets. However it is important to realize that, with a little bit of programming, you can use AMOS to put together your own shotgun assembler customized for the specific characteristics of your data.

This page will provide you with the basic information needed to get started using AMOS. Advanced AMOS users can go directly to in-depth resources from the main page [[AMOS]].

== Downloading AMOS ==
AMOS can be downloaded from Sourceforge using the following link: [http://sourceforge.net/project/showfiles.php?group_id=134326 http://sourceforge.net/project/showfiles.php?group_id=134326]

No need to remember this URL as you can easily reach it from the [AMOS main page].

This link will bring you to the Sourceforge download page for our project. While older versions of our code are also available for download from this page we recommend you download the latest version to take advantage of the full functionality of the code.

AMOS is released as a source-code package, with the exception of the OSX version of the assembly viewer Hawkeye, that can be downloaded as a binary from the File Release section of the download page. Instructions for compiling and installing AMOS are provided below.

=== Downloading the development version ===

If you want the bleeding-edge of AMOS, e.g. to edit the source code, you should download the development version of AMOS using CVS following the directions here: [http://sourceforge.net/scm/?type=cvs&group_id=134326 http://sourceforge.net/scm/?type=cvs&group_id=134326]

Or in short:
cvs -z3 -d:pserver:anonymous@amos.cvs.sourceforge.net:/cvsroot/amos co -P AMOS

== Installing AMOS ==
After reading this section make sure you also read the INSTALL file distributed with AMOS. This file may contain information pertaining to the latest version of AMOS that is not included here.

=== Installing the development version ===

The first step to install the CVS version of AMOS is to type:
./bootstrap

Then proceed with the instructions for the normal installation below.

=== Normal installation ===
The AMOS source package has a name like: amos-1.4.5.tar.gz where 1.4.5 is the version of the code. Once you untar this file (using "tar -xzf amos-1.4.5.tar.gz" in Linux, or "gunzip -d amos-1.4.5.tar.gz | tar xf -" in other flavors of Unix) you will find the current AMOS distribution in a directory named amos-1.4.5. The next steps assume you have cd'd into this directory.

AMOS uses the [http://www.gnu.org/software/autoconf GNU autoconf] package to reduce cross-platform compatibility issues. Before compiling the code you will need to run the configure script that will probe your system for the locations of all software packages required by AMOS.

By simply running:

./configure

you will prepare AMOS to be installed in the directory hosting the source package. This is OK if you are just testing AMOS. We recommend, however, that you provide the configure script with a more permanent home for AMOS, e.g.:

./configure --prefix=/usr/local

will ultimately lead the AMOS directory hierarchy to be installed underneath /usr/local/.

After running configure, make sure you check the messages left on your screen to make sure no errors occured. Errors during the configure step can lead to an incomplete build.

To compile the code you need to simply run:

make

followed by

make install

to install AMOS into the directory selected with the --prefix option to configure.

Normally, these steps are sufficient to install AMOS on most UNIX systems. If you encounter errors during configuration or compilation, or if you are trying to install AMOS on an OSX or Cygwin system, please read the following sub-sections.

=== Specifying the location of dependencies ===
If the configure script gives you a message like:

WARNING! nucmer was not found but is required to run AMOScmp
install nucmer if planning on using AMOScmp

you either have not installed the [http://mummer.sourceforge.net/ MUMmer] package, or you have installed it in a location where the configure script cannot find it. MUMmer (the nucmer program in particular) is required by the comparative assembler [[AMOScmp]].

To remedy this situation, please install MUMmer following instructions found at [http://mummer.sourceforge.net http://mummer.sourceforge.net].

If MUMmer is already installed, but configure cannot find it, you can specify the location of the nucmer program by setting the environment variably NUCMER, e.g.:

NUCMER=/usr/local/bin/mummer/nucmer
export NUCMER

in a "traditional" shell (sh, bash, ksh, etc.), or

setenv NUCMER /usr/local/bin/mummer/nucmer

in csh or tcsh. Of course you'll need to replace /usr/local/bin/mummer/nucmer with the actual location of this program on your system.
Specifying the location of the QT library
On most Unix installations (see below for OSX and Cygwin), the QT library should be properly installed and AMOS will make without any problems. If, however, you notice a message like:

WARNING! Qt3 toolkit was not found but is required to run AMOS GUIs

the configure process was not able to find the QT library on your system. Check with your system administrator to have this toolkit installed on your system. If, however, you are certain the toolkit is installed, but AMOS still didn't find it, you can directly specify the location of the toolkit directory, or specifically the include, bin, and lib directories, where QT is installed, and the name of the library file, using the following options to the configure script:

--with-Qt-dir
--with-Qt-include_dir
--with-Qt-lib_dir
--with-Qt-bin_dir
--with-Qt-lib

Similarly, if you get the message:

WARNING! Boost graph toolkit was not found but is required to run parts of the AMOS Scaffolder (Bambus 2)

try specifying the location of Boost with the option:

--with-Boost-dir

=== Debian and Ubuntu installation ===
[[Debian installation]]

=== Fedora, RedHat, CentOS installation ===
[[Fedora installation]]

=== OSX installation ===

[[OSX installation]]

=== Cygwin installation ===
[[Cygwin installation]]

== Running AMOS ==

=== Basic AMOS concepts ===
AMOS consists of a collection of modules that operate on a central data-structure called a bank. A bank is really just a directory that contains a database (organized as a collection of indexed files) comprising assembly related objects such as reads, contigs, scaffolds, etc. The modules thus communicate with each other by making changes to the bank. For example, an assembler might consist of three modules: an overlapper, a contigger, and a multi-aligner. The overlapper will first read the shotgun reads from the bank, compare them to each other and write back to the bank a list of overlaps, i.e. pairs of reads that match each other. The contigger then reads the collection of overlaps and makes sense out of it, by producing a layout of the reads that is consistent with most of the observed overlaps. The contigger then writes these contigs (contiguous chunks of the genome) to the bank. Finally, the multi-aligner reads from the bank both the reads and the contigs, builds a multiple alignment of the reads, using as a guide the layout of the reads produced by the contigger, then updates the contigs with the detailed alignment information. Thus, the three programs were able to communicate with each other using the bank as an intermediate storage space. If this litle description didn't make much sense to you, check out our [http://www.cbcb.umd.edu/research/assembly_primer.shtml Genome Assembly Primer]. It also has pointers to future reading.

Objects in the bank may be identified by one, or both of the following identifiers: IID (internal identifier) - an integer identifier, internal to AMOS; and EID (external identifier) - a string representing some external identifier of the record, e.g. the original name of a sequencing read. Both identifiers must be unique for a specific object type, but may be shared by multiple objects. For example, there can only be one contig with an IID equal to 1, however there can be both a contig, and a read, and an overlap, all with the IID = 1.
Message files
The AMOS banks are not the only mechanism for AMOS modules to communicate with each other, and to the "outside world". AMOS also uses a flat-file format (AMOS message files) inspired by the format used in Celera Assembler. This format is generally used as an intermediate format for converting to and from external file formats. The AMOS message files are then used to populate the data-structures present in a bank.

For more details on the AMOS message file format check out the [[Infrastructure]] pages. The use of message files will be described in more detail in the remainder of this tutorial.

==== Reading and writing banks ====
To learn how to generate AMOS message files check out the section called Creating inputs for AMOS. Assuming you already have an AMOS message file, most of the modules will require that the information from this file be loaded into a bank. This section describes the commands used to transfer information between a bank and the message file.

The command bank-transact can be used to load a message file into a bank. In its simplest invocation:

bank-transact -b mybank -m mymessagefile

bank-transact loads the messages in mymessagefile into the bank mybank. Note that this invocation assume the bank already exists, and bank-transact will fail otherwise. When creating a new bank you can run:

bank-transact -c -b mybank -m mymessagefile

The option -c stands for "create". By also providing the option -f (force), the bank will be overwritten if it already exists.

The contents of a bank can be output into a flat-file format with the command:

bank-report -b mybank

By default bank-report outputs all the data in the bank. The output can be restricted to certain message types by providing the 3 letter codes of the messages to be output, e.g:

bank-report -b mybank CTG RED

will output all the contigs (CTG) and read (RED) records. In addition bank-report allows the user to specify a list of EIDs (option -E) or a list of IIDs (option -I) that will be reported.

==== Bank locking ====
To allow concurrent access to the bank, AMOS programs lock the bank while the operate on it. There are two types of locks: for reading, and writing. If a bank is locked for reading, other read accesses are allowed but no writes. If a bank is locked for writing, no concurrent accesses are allowed. Some of the AMOS tools (such as the viewer Hawkeye), have an option to load a bank in "inspect" mode, i.e. the code ignores any locks placed on the bank.

In certain situations, if a program accessing the bank crashes, the bank may remain locked, prohibiting further access. All existing locks can be removed with the command (make sure that another user is not accessing the same bank):

bank-unlock mybank

==== Bank versions ====
The specific format of the AMOS bank is closely related to the current version of the AMOS software. The banks are not backward compatible, i.e., a bank produced by AMOS 1.0 will not be readable by AMOS 1.5. A simple solution for reading a bank created by an older version of AMOS is to output the contents of the bank using bank-report (the AMOS distribution contains old versions of the bank-report code, e.g. bank-report-1.1) , then reload the bank with the most recent bank-transact command.

==== Pipelines ====
As it has hopefully become clear from the introduction to AMOS above, most genome assembly tasks involve the sequential execution of several modules, in an assembly line (or pipeline) fashion. AMOS provides a mechanism for quickly putting together simple pipelines. By "simple" we mean situations where the specific assembly task involves running several programs in order, without the need for more complex control structures such as "if" statements or loops. To implement complex pipelines you will have to rely on Perl or another complex programming language.

An AMOS pipelines are described in a simple interpreted language, and consist of a series of steps that are executed in order. The steps are meant to provide a logical breakdown of the individual assembly tasks, representing the execution of one or more programs. Each step in a pipeline is identified by a step number (a throw-back to the days of the Basic language) providing the user with a mechanism to execute only some of the steps of a pipeline.

To learn more about AMOS pipelines and how to write them, check out the documentation for [[runAmos]] (the pipeline executor), or check out one of the pipelines distributed with AMOS (AMOScmp and minimus are good starting points).

=== Creating inputs for AMOS ===
The inputs to most AMOS programs must be provided in the AMOS message format. For help converting non-AMOS file formats into message files see the [[File conversion utilities]].

=== Running AMOScmp ===
AMOScmp is a comparative assembler that can be used to assemble reads from one genome (called the target) using as a template the sequence of a related genome (called the reference). Read the AMOScmp documentation for a detailed description of this program.

By default, running AMOScmp as follows:

AMOScmp prefix

assumes that the target is provided in the AMOS message file prefix.afg, and the reference in the file prefix.1con. To use different file locations, you can set the variables TGT and REF, either directly within the AMOScmp script, or on the command line:

AMOScmp -D "TGT=mytarget.afg" -D "REF=myreference.1con" prefix

The prefix must still be provided as it is used to generate the name of the output files.

AMOScmp will populate a bank named prefix.bnk, and will load into it a set of contigs, as well as a scaffold, linking together contigs that are adjacent along the reference. In addition, AMOScmp outputs the set of contigs as both a multi-FASTA file prefix.fasta, and a TIGR .contig file prefix.contig. Note that the consensus of the contigs (reported in the FASTA file) is generated from the target genome, and may differ from the reference genome (after all, the goal of the assembler is to assemble the target). In fact, AMOScmp uses sophisticated algorithms for detecting differences between the target and reference in order to prevent misassemblies. For more information refer to:

M. Pop, A. Phillippy, A.L. Delcher and S.L. Salzberg. [http://www.cbcb.umd.edu/papers/Pop%20et%20al%20Comparative.pdf Comparative genome assembly]. Briefings in Bioinformatics. 5(3), pp. 237-248, 2004.

=== Running minimus ===
Minimus is a basic genome assembler that can be used for small assembly jobs (e.g. a single gene, or a viral genome). Minimus is currently used as a central component of the Influenza A sequencing pipeline at The Institute for Genomic Research. Read the [[minimus]] documentation for more information.

To run minimus you must provide a set of shotgun reads in an AMOS message file. Running:

minimus prefix

assumes the input is in file prefix.afg. After running, minimus populates the bank prefix.bnk with a set of contigs, furthermore it reports the contigs in both a FASTA file (prefix.fasta) and a TIGR .contig file (prefix.contig). Note that minimus does not use mate-pairs. In essence it is, in Celera Assembler terminology, a unitigger. Any mate-pair information provided in the .afg will be silently ignored.

=== Viewing the result of an assembly ===
The content of a bank can be viewed with a program called Hawkeye:

hawkeye mybank

For detailed information on how to use Hawkeye, refer to the [[Hawkeye]] documentation.

=== Validating assemblies ===
Even the best genome assemblers sometimes make mistakes. AMOS provides a mechanism to run several checks on the output of an assembler (assuming the data are already stored in a bank), through a script called amosvalidate. Amosvalidate runs through the assembly and identifies several types of inconsistencies, such as clusters of SNPs in the assembled reads, clusters of mate-pairs that are too close or too far from each other (with respect to the estimated library sizes), and unassembled reads that do not properly match the assembly. A full description of these measures is beyond the scope of this document. We are currently submitting a manuscript describing the tools included in amosvalidate and will update this page when it gets published.

All the potential assembly problems identified by amosvalidate are written back into the bank as features, i.e ranges along the assembly. Each feature is tagged with the problem that was identified in that region. Typically, users then load the assembly in the Hawkeye viewer and examine the assembly in the tagged regions. Alternatively, the features may be extracted from the bank and processed automatically by specialized software (e.g. several assemblies of a same genome can be compared by the number of features identified in the assembly - the assembly with fewer features is likely "better").

Running amosvalidate is as simple as:

amosvalidate prefix

where prefix.bnk is the location of the bank.

== Getting help ==
To report bugs in AMOS, or to get help, email us at:

amos-help (at) lists (dot) sourceforget (dot) net

To receive information regarding new releases and developments, please [http://lists.sourceforge.net/lists/listinfo/amos-users subscribe] to our moderated, low-traffic users' mailing list:

amos-users (at) lists (dot) sourceforget (dot) net

OSX installation

2011-04-22T00:28:25Z

Floflooo:

Download QT/Mac 3.3.x from Trolltech.

As of 4/12/06, the most recent version is available at: ftp://ftp.trolltech.com/qt/source/qt-mac-free-3.3.6.tar.gz

Follow the Trolltech instructions for building QT. Make sure to set the environment variable QTDIR appropriately.

Run this command to configure AMOS:

./configure

Note the QT configure tests may fail. You may have to manually specify the location of the QT libraries, e.g. most likely:

./configure --with-Qt-dir=/opt/local/lib/qt3

Run make to build AMOS. Then run:

cd src/bankViewer
$QTDIR/bin/qmake
make

The Hawkeye binary will then build in the Hawkeye directory. You will have to manually copy it to your bin directory.

AMOS Getting Started

2011-04-20T00:35:50Z

Floflooo: /* Fedora installation */

{{TOC}}

Is AMOS an assembler? is one of the first questions we are asked. The short answer is no. AMOS is not an assembler, rather a software infrastructure for developing assembly tools. If you are only interested in running an off-the-shelf assembler on your shotgun data, do not despair, AMOS provides two such assemblers: AMOScmp - a comparative assembler; and Minimus - a basic assembler for small datasets. However it is important to realize that, with a little bit of programming, you can use AMOS to put together your own shotgun assembler customized for the specific characteristics of your data.

This page will provide you with the basic information needed to get started using AMOS. Advanced AMOS users can go directly to in-depth resources from the main page [[AMOS]].

== Downloading AMOS ==
AMOS can be downloaded from Sourceforge using the following link: [http://sourceforge.net/project/showfiles.php?group_id=134326 http://sourceforge.net/project/showfiles.php?group_id=134326]

No need to remember this URL as you can easily reach it from the [AMOS main page].

This link will bring you to the Sourceforge download page for our project. While older versions of our code are also available for download from this page we recommend you download the latest version to take advantage of the full functionality of the code.

AMOS is released as a source-code package, with the exception of the OSX version of the assembly viewer Hawkeye, that can be downloaded as a binary from the File Release section of the download page. Instructions for compiling and installing AMOS are provided below.

=== Downloading the development version ===

If you want the bleeding-edge of AMOS, e.g. to edit the source code, you should download the development version of AMOS using CVS following the directions here: [http://sourceforge.net/scm/?type=cvs&group_id=134326 http://sourceforge.net/scm/?type=cvs&group_id=134326]

Or in short:
cvs -z3 -d:pserver:anonymous@amos.cvs.sourceforge.net:/cvsroot/amos co -P AMOS

== Installing AMOS ==
After reading this section make sure you also read the INSTALL file distributed with AMOS. This file may contain information pertaining to the latest version of AMOS that is not included here.

=== Installing the development version ===

The first step to install the CVS version of AMOS is to type:
./bootstrap

Then proceed with the instructions for the normal installation below.

=== Normal installation ===
The AMOS source package has a name like: amos-1.4.5.tar.gz where 1.4.5 is the version of the code. Once you untar this file (using "tar -xzf amos-1.4.5.tar.gz" in Linux, or "gunzip -d amos-1.4.5.tar.gz | tar xf -" in other flavors of Unix) you will find the current AMOS distribution in a directory named amos-1.4.5. The next steps assume you have cd'd into this directory.

AMOS uses the [http://www.gnu.org/software/autoconf GNU autoconf] package to reduce cross-platform compatibility issues. Before compiling the code you will need to run the configure script that will probe your system for the locations of all software packages required by AMOS.

By simply running:

./configure

you will prepare AMOS to be installed in the directory hosting the source package. This is OK if you are just testing AMOS. We recommend, however, that you provide the configure script with a more permanent home for AMOS, e.g.:

./configure --prefix=/usr/local

will ultimately lead the AMOS directory hierarchy to be installed underneath /usr/local/.

After running configure, make sure you check the messages left on your screen to make sure no errors occured. Errors during the configure step can lead to an incomplete build.

To compile the code you need to simply run:

make

followed by

make install

to install AMOS into the directory selected with the --prefix option to configure.

Normally, these steps are sufficient to install AMOS on most UNIX systems. If you encounter errors during configuration or compilation, or if you are trying to install AMOS on an OSX or Cygwin system, please read the following sub-sections.

=== Specifying the location of MUMmer ===
If the configure script gives you a message like:

WARNING! nucmer was not found but is required to run AMOScmp
install nucmer if planning on using AMOScmp

you either have not installed the [http://mummer.sourceforge.net/ MUMmer] package, or you have installed it in a location where the configure script cannot find it. MUMmer (the nucmer program in particular) is required by the comparative assembler [[AMOScmp]].

To remedy this situation, please install MUMmer following instructions found at [http://mummer.sourceforge.net http://mummer.sourceforge.net].

If MUMmer is already installed, but configure cannot find it, you can specify the location of the nucmer program by setting the environment variably NUCMER, e.g.:

NUCMER=/usr/local/bin/mummer/nucmer
export NUCMER

in a "traditional" shell (sh, bash, ksh, etc.), or

setenv NUCMER /usr/local/bin/mummer/nucmer

in csh or tcsh. Of course you'll need to replace /usr/local/bin/mummer/nucmer with the actual location of this program on your system.
Specifying the location of the QT library
On most Unix installations (see below for OSX and Cygwin), the QT library should be properly installed and AMOS will make without any problems. If, however, you notice a message like:

WARNING! Qt3 toolkit was not found but is required to run AMOS GUIs

the configure process was not able to find the QT library on your system. Check with your system administrator to have this toolkit installed on your system. If, however, you are certain the toolkit is installed, but AMOS still didn't find it, you can directly specify the location of the toolkit directory, or specifically the include, bin, and lib directories, where QT is installed, and the name of the library file, using the following options to the configure script:

--with_Qt_dir
--with_Qt_include_dir
--with_Qt_lib_dir
--with_Qt_bin_dir
--with_Qt_lib

=== Debian and Ubuntu installation ===
[[Debian installation]]

=== Fedora, RedHat, CentOS installation ===
[[Fedora installation]]

=== OSX installation ===

[[OSX installation]]

=== Cygwin installation ===
[[Cygwin installation]]

== Running AMOS ==

=== Basic AMOS concepts ===
AMOS consists of a collection of modules that operate on a central data-structure called a bank. A bank is really just a directory that contains a database (organized as a collection of indexed files) comprising assembly related objects such as reads, contigs, scaffolds, etc. The modules thus communicate with each other by making changes to the bank. For example, an assembler might consist of three modules: an overlapper, a contigger, and a multi-aligner. The overlapper will first read the shotgun reads from the bank, compare them to each other and write back to the bank a list of overlaps, i.e. pairs of reads that match each other. The contigger then reads the collection of overlaps and makes sense out of it, by producing a layout of the reads that is consistent with most of the observed overlaps. The contigger then writes these contigs (contiguous chunks of the genome) to the bank. Finally, the multi-aligner reads from the bank both the reads and the contigs, builds a multiple alignment of the reads, using as a guide the layout of the reads produced by the contigger, then updates the contigs with the detailed alignment information. Thus, the three programs were able to communicate with each other using the bank as an intermediate storage space. If this litle description didn't make much sense to you, check out our [http://www.cbcb.umd.edu/research/assembly_primer.shtml Genome Assembly Primer]. It also has pointers to future reading.

Objects in the bank may be identified by one, or both of the following identifiers: IID (internal identifier) - an integer identifier, internal to AMOS; and EID (external identifier) - a string representing some external identifier of the record, e.g. the original name of a sequencing read. Both identifiers must be unique for a specific object type, but may be shared by multiple objects. For example, there can only be one contig with an IID equal to 1, however there can be both a contig, and a read, and an overlap, all with the IID = 1.
Message files
The AMOS banks are not the only mechanism for AMOS modules to communicate with each other, and to the "outside world". AMOS also uses a flat-file format (AMOS message files) inspired by the format used in Celera Assembler. This format is generally used as an intermediate format for converting to and from external file formats. The AMOS message files are then used to populate the data-structures present in a bank.

For more details on the AMOS message file format check out the [[Infrastructure]] pages. The use of message files will be described in more detail in the remainder of this tutorial.

==== Reading and writing banks ====
To learn how to generate AMOS message files check out the section called Creating inputs for AMOS. Assuming you already have an AMOS message file, most of the modules will require that the information from this file be loaded into a bank. This section describes the commands used to transfer information between a bank and the message file.

The command bank-transact can be used to load a message file into a bank. In its simplest invocation:

bank-transact -b mybank -m mymessagefile

bank-transact loads the messages in mymessagefile into the bank mybank. Note that this invocation assume the bank already exists, and bank-transact will fail otherwise. When creating a new bank you can run:

bank-transact -c -b mybank -m mymessagefile

The option -c stands for "create". By also providing the option -f (force), the bank will be overwritten if it already exists.

The contents of a bank can be output into a flat-file format with the command:

bank-report -b mybank

By default bank-report outputs all the data in the bank. The output can be restricted to certain message types by providing the 3 letter codes of the messages to be output, e.g:

bank-report -b mybank CTG RED

will output all the contigs (CTG) and read (RED) records. In addition bank-report allows the user to specify a list of EIDs (option -E) or a list of IIDs (option -I) that will be reported.

==== Bank locking ====
To allow concurrent access to the bank, AMOS programs lock the bank while the operate on it. There are two types of locks: for reading, and writing. If a bank is locked for reading, other read accesses are allowed but no writes. If a bank is locked for writing, no concurrent accesses are allowed. Some of the AMOS tools (such as the viewer Hawkeye), have an option to load a bank in "inspect" mode, i.e. the code ignores any locks placed on the bank.

In certain situations, if a program accessing the bank crashes, the bank may remain locked, prohibiting further access. All existing locks can be removed with the command (make sure that another user is not accessing the same bank):

bank-unlock mybank

==== Bank versions ====
The specific format of the AMOS bank is closely related to the current version of the AMOS software. The banks are not backward compatible, i.e., a bank produced by AMOS 1.0 will not be readable by AMOS 1.5. A simple solution for reading a bank created by an older version of AMOS is to output the contents of the bank using bank-report (the AMOS distribution contains old versions of the bank-report code, e.g. bank-report-1.1) , then reload the bank with the most recent bank-transact command.

==== Pipelines ====
As it has hopefully become clear from the introduction to AMOS above, most genome assembly tasks involve the sequential execution of several modules, in an assembly line (or pipeline) fashion. AMOS provides a mechanism for quickly putting together simple pipelines. By "simple" we mean situations where the specific assembly task involves running several programs in order, without the need for more complex control structures such as "if" statements or loops. To implement complex pipelines you will have to rely on Perl or another complex programming language.

An AMOS pipelines are described in a simple interpreted language, and consist of a series of steps that are executed in order. The steps are meant to provide a logical breakdown of the individual assembly tasks, representing the execution of one or more programs. Each step in a pipeline is identified by a step number (a throw-back to the days of the Basic language) providing the user with a mechanism to execute only some of the steps of a pipeline.

To learn more about AMOS pipelines and how to write them, check out the documentation for [[runAmos]] (the pipeline executor), or check out one of the pipelines distributed with AMOS (AMOScmp and minimus are good starting points).

=== Creating inputs for AMOS ===
The inputs to most AMOS programs must be provided in the AMOS message format. For help converting non-AMOS file formats into message files see the [[File conversion utilities]].

=== Running AMOScmp ===
AMOScmp is a comparative assembler that can be used to assemble reads from one genome (called the target) using as a template the sequence of a related genome (called the reference). Read the AMOScmp documentation for a detailed description of this program.

By default, running AMOScmp as follows:

AMOScmp prefix

assumes that the target is provided in the AMOS message file prefix.afg, and the reference in the file prefix.1con. To use different file locations, you can set the variables TGT and REF, either directly within the AMOScmp script, or on the command line:

AMOScmp -D "TGT=mytarget.afg" -D "REF=myreference.1con" prefix

The prefix must still be provided as it is used to generate the name of the output files.

AMOScmp will populate a bank named prefix.bnk, and will load into it a set of contigs, as well as a scaffold, linking together contigs that are adjacent along the reference. In addition, AMOScmp outputs the set of contigs as both a multi-FASTA file prefix.fasta, and a TIGR .contig file prefix.contig. Note that the consensus of the contigs (reported in the FASTA file) is generated from the target genome, and may differ from the reference genome (after all, the goal of the assembler is to assemble the target). In fact, AMOScmp uses sophisticated algorithms for detecting differences between the target and reference in order to prevent misassemblies. For more information refer to:

M. Pop, A. Phillippy, A.L. Delcher and S.L. Salzberg. [http://www.cbcb.umd.edu/papers/Pop%20et%20al%20Comparative.pdf Comparative genome assembly]. Briefings in Bioinformatics. 5(3), pp. 237-248, 2004.

=== Running minimus ===
Minimus is a basic genome assembler that can be used for small assembly jobs (e.g. a single gene, or a viral genome). Minimus is currently used as a central component of the Influenza A sequencing pipeline at The Institute for Genomic Research. Read the [[minimus]] documentation for more information.

To run minimus you must provide a set of shotgun reads in an AMOS message file. Running:

minimus prefix

assumes the input is in file prefix.afg. After running, minimus populates the bank prefix.bnk with a set of contigs, furthermore it reports the contigs in both a FASTA file (prefix.fasta) and a TIGR .contig file (prefix.contig). Note that minimus does not use mate-pairs. In essence it is, in Celera Assembler terminology, a unitigger. Any mate-pair information provided in the .afg will be silently ignored.

=== Viewing the result of an assembly ===
The content of a bank can be viewed with a program called Hawkeye:

hawkeye mybank

For detailed information on how to use Hawkeye, refer to the [[Hawkeye]] documentation.

=== Validating assemblies ===
Even the best genome assemblers sometimes make mistakes. AMOS provides a mechanism to run several checks on the output of an assembler (assuming the data are already stored in a bank), through a script called amosvalidate. Amosvalidate runs through the assembly and identifies several types of inconsistencies, such as clusters of SNPs in the assembled reads, clusters of mate-pairs that are too close or too far from each other (with respect to the estimated library sizes), and unassembled reads that do not properly match the assembly. A full description of these measures is beyond the scope of this document. We are currently submitting a manuscript describing the tools included in amosvalidate and will update this page when it gets published.

All the potential assembly problems identified by amosvalidate are written back into the bank as features, i.e ranges along the assembly. Each feature is tagged with the problem that was identified in that region. Typically, users then load the assembly in the Hawkeye viewer and examine the assembly in the tagged regions. Alternatively, the features may be extracted from the bank and processed automatically by specialized software (e.g. several assemblies of a same genome can be compared by the number of features identified in the assembly - the assembly with fewer features is likely "better").

Running amosvalidate is as simple as:

amosvalidate prefix

where prefix.bnk is the location of the bank.

== Getting help ==
To report bugs in AMOS, or to get help, email us at:

amos-help (at) lists (dot) sourceforget (dot) net

To receive information regarding new releases and developments, please [http://lists.sourceforge.net/lists/listinfo/amos-users subscribe] to our moderated, low-traffic users' mailing list:

amos-users (at) lists (dot) sourceforget (dot) net

AMOS

2011-03-18T01:30:13Z

Floflooo: /* Assemblers */

{| align="right"
| __TOC__
|}

The AMOS consortium is committed to the development of open-source whole genome assembly software. The project acronym (AMOS) represents our primary goal -- to produce A Modular, Open-Source whole genome assembler. Open-source so that everyone is welcome to contribute and help build outstanding assembly tools, and modular in nature so that new contributions can be easily inserted into an existing assembly pipeline. This modular design will foster the development of new assembly algorithms and allow the AMOS project to continually grow and improve in hopes of eventually becoming a widely accepted and deployed assembly infrastructure. In this sense, AMOS is both a design philosophy and a software system.

Quick links:
* [[AMOS Getting Started]]
* [http://sourceforge.net/project/showfiles.php?group_id=134326 Download]
* [http://sourceforge.net/projects/amos SourceForge project page]

== Announcements ==

* December 7th, 2010 - Version 3.0.0 of AMOS released!

== Documentation ==
Additional documentation in development through the [[AMOS Documentation Project]]

=== Assemblers ===
* [[ABBA]] - Assembly Boosted By Amino Acid Sequences
* [[AMOScmp]] - comparative assembler
* [[AMOScmp-shortReads]] - comparative assembler for short reads (Solexa,454)
* [[AMOScmp-shortReads-alignmentTrimmed]] - comparative assembler for short reads that uses alignment based trimming
* [[minimus]] - basic genome assembler for small datasets
* [[Minimo]] - the minimus assembler with many more options
* [[minimus2]] - basic genome assembler for two datasets; can also be used as an assembly merge pipeline
* [[minimus2-blat]] - Same as minimus2 but uses BLAT instead of Nucmer for added speed

=== Validation and Visualization ===
* [[Hawkeye]] - assembly viewer
* [[amosvalidate]] - assembly forensics
* [[FRCurve]] - Feature-Response Curve
* [[Benchmark]] - assembly benchmark data

=== Scaffolding ===
* [[Bambus]] - Open source standalone hierarchical scaffolding
* [[Bambus2]] - Scaffolding Polymorphic Genomes and Metagenomes

=== Trimming, Overlapping, & Error Correction ===
* [[Figaro]] - statistical vector trimmer
* [[UMD Overlapper]] - High quality overlap computations
* [[KI Overlapper]] - Repeat aware overlapper
* [[AutoEditor]] - Automatic correction of genome sequencing errors
* [[FastqQC]] - Read composition and quality

=== Utilities ===
* [[File conversion utilities]] - converting data to and from AMOS
* [[AMOS Utilities | AMOS Utilities]] - general utilities
* [[runAmos]] - Pipeline executor

=== AMOS Development ===
* [[Programmer's guide]] - Getting started with the Source code
* [[Infrastructure]] - Developer level details
* [[Wiki guide]] - Guide for editing the wiki

=== Assembly Tutorials ===
* [http://www.cbcb.umd.edu/research/assembly_primer.shtml Assembly primer] - overview of genome assembly.
* [http://www.cbcb.umd.edu/research/contig_representation.shtml Representing assemblies (not just in AMOS)]
* [http://wgs-assembler.sourceforge.net Running Celera Assembler]

== Download ==
The AMOS source if freely available for download from the File Release Section of our SourceForge project page. Please refer to the COPYING license included in the package for a description of the Artistic License, the same OSI certified open source license used by Perl and countless other packages. Not all of the above packages are included with the standard AMOS distribution, please see the homepage for the software you wish to download to verify that it is included with the AMOS source distribution.

[http://sourceforge.net/project/showfiles.php?group_id=134326 Download from SourceForge]

== Consortium members ==

There have been numerous positive responses regarding the AMOS initiative, and we expect the list of involved organizations to grow significantly as the project matures. Please contact us if you want to join. The groups currently involved with the development of AMOS are listed below, along with their responsibilities and areas of expertise.

* University of Maryland, Center for Bioinformatics and Computational Biology
** project organization and direction
** infrastructure
** consensus
** automated sequence editing
** scaffolding
** overlap detection
** contig construction

* The Institute for Genomic Research
** production pipelines
** automated finishing tools
** error correction

* Karolinska Institutet
** overlap detection
** error correction

* Marine Biological Laboratory - Woods Hole
** graphical interface
** integration of assembly data with analysis (gene, polymorphism, etc.) information

== Join the consortium ==

All interested parties are welcome to join or aid the AMOS consortium. Please address all correspondence via Email to:

amos-help (at) lists (dot) sourceforge (dot) net

To receive information regarding new releases and developments, please subscribe to our moderated, low-traffic users' mailing list:

amos-users (at) lists (dot) sourceforge (dot) net

== Bug reports and support ==

For AMOS bug reports or support requests, please browse our SourceForge project page or Email us at:

amos-help (at) lists (dot) sourceforge (dot) net

== Acknowledgements ==

The AMOS consortium would like to thank the following organizations for their funding and/or support:
* The National Institutes of Health - grants R01-LM06845, N01-AI-15447
* The National Science Foundation - grants IIS-9902923, IIS-9820497
* Department of Homeland Security - cooperative agreement W81XWH-05-2-0051
* SourceForge.net

FRCurve

2011-03-18T01:25:20Z

Floflooo:

'''FRCurve''': Feature-Response Curve

== Overview ==

Inspired by the standard receiver operating characteristic (ROC) curve, the Feature-Response curve characterizes the sensitivity (coverage) of the sequence assembler as a function of its discrimination threshold (number of features).

The AMOS package provides an automated assembly validation pipeline called [http://sourceforge.net/apps/mediawiki/amos/index.php?title=Amosvalidate amosvalidate] that analyzes the output of an assembler using a variety of assembly quality metrics (or features). Examples of features include: (M) mate-pair orientations and separations, (K) repeat content by k-mer analysis, (C) depth-of-coverage, (P) correlated polymorphism in the read alignments, and (B) read alignment breakpoints to identify structurally suspicious regions of the assembly. After running amosvalidate on the output of the assembler, each contig is assigned a number of features that
correspond to doubtful regions of the sequence.

Given any such set of features, the response (quality) of the assembler output is then analyzed as a function of the maximum number of possible errors (features) allowed in the contigs. More specifically, for a fixed feature
threshold <math>\phi</math>, the contigs are sorted by size and, starting from the longest, only those contigs are tallied, if their sum of features is <math>\leq \phi</math>. For this set of contigs, the corresponding approximate genome coverage is computed, leading to a single point of the Feature-Response curve.

== Documentation ==

Following the AMOS philosophy, the FRCurve is implemented as a pipeline that consists of two steps:
* 1. invocation to the amosvalidate tool to compute the features for the set of contigs;
* 2. invocation to the FRC module
The name of the pipeline in the AMOS distribution is "FRCurve".

Documentation on how to run FRCurve is obtained by typing:

FRCurve -h

The usage message is:

Usage:
FRCurve [params] \
-D GENOME_SIZE=<n> - Genome size (number of bps)
-D BANK=<n> - AMOS bank name
Output:
The Feature-Response curve (FRC) is saved in file "FRC.txt", while
FRCs for each feature type are saved respectively in:
"FRC_coverage.txt", "FRC_polymorphism.txt", "FRC_breakpoint.txt",
"FRC_kmer.txt", "FRC_matepair.txt" and "FRC_misassembly.txt"
File format:
Each file contains the FRCs in 3-columns format
- column 1 = feature threshold T;
- column 2 = contigs' N50 associated to the threshold T in column 1;
- column 3 = approximate coverage of the contigs whose number of features is <= T;

== Example ==

The figure below shows the Feature-Response Curve generated for the [http://sourceforge.net/apps/mediawiki/amos/index.php?title=Minimus minimus] assembly pipeline on the ''Brucella suis'' genome using the benchmark dataset available [http://sourceforge.net/apps/mediawiki/amos/index.php?title=Benchmark here].

[[File:minimus_frc.jpeg|600px]]

== People ==

* [http://cims.nyu.edu/~gn387/ Giuseppe Narzisi] (PhD Student, NYU)
* [http://www.cs.nyu.edu/mishra/ Bud Mishra] (Faculty, NYU)

== References ==

Coming soon...

== Acknowledgements ==

Research reported here was supported by grants from NSF CDI program and Abraxis BioScience, LLC.

ToAmos

2011-03-17T08:58:06Z

Floflooo: /* Known issues */

toAmos: converter from various types of inputs to AMOS messages

== Overview ==

toAmos is primarily designed for converting the output of an assembly program into the AMOS format so that it can be stored in an AMOS bank. toAmos can be used as a replacement for tarchive2amos however the latter is more flexible when converting from Trace Archive or simple .seq and .qual inputs.

== Synopsis ==

toAmos -o out_file
(-s fasta_reads (-q qual_file) (-gq good_qual) (-bq bad_qual))
(-c tigr_contig | -a celera_asm [-S][-utg] | -ta tigr_asm | -ace phrap_ace [-phd])
(-m bambus_mates | -x trace_xml | -f celera_frg [-acc])
(-arachne arachne_links | -scaff bambus_scaff)
(-i insert_file | -map dst_map)
(-pos pos_file)
(-id min_id)

toAmos reads the inputs specified on the command line and converts the information into AMOS message format. The following types of information can be provided to toAmos:

* Sequence and quality data (options -f, -s, -q, -gq, or -bq)
* Library and mate-pair data (options -m, -x, -f, -i, or -map)
* Contig data (options -c, -a, -ta, or -ace)
* Scaffold data (option -a)

== Options ==
{| class="somecssclass" border="1"
|-
| -o <out_file> || output filename ('-' for standard output)
|-
| -s <fasta_reads> || sequence data file in FASTA format (reads names ending in .1 or /1 are taken as mate pairs)
|-
| -q <qual_file> || sequence quality score file in QUAL format
|-
| -gq <bad_qual> || minimum quality score for high-quality bases (default: 30) - if no quality file provided bases within clear range are assigned this quality value
|-
| -bq <good_qual> || maximum quality score for low-quality bases (default: 10) - if no quality file provided bases outside the clear range are assigned this quality value (default 10)
|-
| -c <tigr_contig> || provide TIGR .contig file [http://www.cbcb.umd.edu/research/contig_representation.shtml in GDE-like format]
|-
| -a <celera_asm> || use Celera Assembler .asm contig file (contig and scaffold information)
|-
| -S || include the surrogate unitigs in the .asm file as AMOS contigs
|-
| -utg || include all UTG unitig messages in the .asm file as AMOS contigs
|-
| -ta <tigr_asm> || contig file in TIGR Assembler format (.tasm)
|-
| -ace <phrap_ace> || contig file in Phred ACE format (can be accompanied by -q)
|-
| -phd || read the content of PHD file referenced in ACE files
|-
| -m <bambus_mates> || library and mate-pair information file in Bambus format
|-
| -x <trace_xml> || ancilliary data file (library, mate-pair, clear range) in Trace Archive XML format
|-
| -f <celera_frg> || library, mate-pair, sequence, quality, and clear range data file in Celera Assembler format
|-
| -acc || use accession numbers in FRG files
|-
| -arachne <arachne_links> || scaffold file in Arachne .links format
|-
| -scaff <bambus_scaff> || scaffold file in Bambus .scaff format
|-
| -map <dst_map> || read map information - mapping from internal library ID to external library ID useful in conjunction with the -f option. This file consists of space-separated records providing a mapping from the "acc:" field in "DST" records within the .frg file to an externally recognizable name for each library.
|-
| -pos <pos_file> || TIGR-style .pos position file
|-
| -id <min_id> || start numbering contigs at this number
|-
|}

== TIGR specific options (not too useful outside TIGR) ==

* -i <insert file> - use mapping from internal library ID to external library ID provided in a .insert file produced by pullfrag.

== Known issues ==

The -ta (TIGR Assembler input) option has not been thoroughly tested and likely does not properly work. Contact us if either of these options is important to you.

== Errors ==
n/a

ToAmos

2011-03-17T08:57:27Z

Floflooo: /* Errors */

toAmos: converter from various types of inputs to AMOS messages

== Overview ==

toAmos is primarily designed for converting the output of an assembly program into the AMOS format so that it can be stored in an AMOS bank. toAmos can be used as a replacement for tarchive2amos however the latter is more flexible when converting from Trace Archive or simple .seq and .qual inputs.

== Synopsis ==

toAmos -o out_file
(-s fasta_reads (-q qual_file) (-gq good_qual) (-bq bad_qual))
(-c tigr_contig | -a celera_asm [-S][-utg] | -ta tigr_asm | -ace phrap_ace [-phd])
(-m bambus_mates | -x trace_xml | -f celera_frg [-acc])
(-arachne arachne_links | -scaff bambus_scaff)
(-i insert_file | -map dst_map)
(-pos pos_file)
(-id min_id)

toAmos reads the inputs specified on the command line and converts the information into AMOS message format. The following types of information can be provided to toAmos:

* Sequence and quality data (options -f, -s, -q, -gq, or -bq)
* Library and mate-pair data (options -m, -x, -f, -i, or -map)
* Contig data (options -c, -a, -ta, or -ace)
* Scaffold data (option -a)

== Options ==
{| class="somecssclass" border="1"
|-
| -o <out_file> || output filename ('-' for standard output)
|-
| -s <fasta_reads> || sequence data file in FASTA format (reads names ending in .1 or /1 are taken as mate pairs)
|-
| -q <qual_file> || sequence quality score file in QUAL format
|-
| -gq <bad_qual> || minimum quality score for high-quality bases (default: 30) - if no quality file provided bases within clear range are assigned this quality value
|-
| -bq <good_qual> || maximum quality score for low-quality bases (default: 10) - if no quality file provided bases outside the clear range are assigned this quality value (default 10)
|-
| -c <tigr_contig> || provide TIGR .contig file [http://www.cbcb.umd.edu/research/contig_representation.shtml in GDE-like format]
|-
| -a <celera_asm> || use Celera Assembler .asm contig file (contig and scaffold information)
|-
| -S || include the surrogate unitigs in the .asm file as AMOS contigs
|-
| -utg || include all UTG unitig messages in the .asm file as AMOS contigs
|-
| -ta <tigr_asm> || contig file in TIGR Assembler format (.tasm)
|-
| -ace <phrap_ace> || contig file in Phred ACE format (can be accompanied by -q)
|-
| -phd || read the content of PHD file referenced in ACE files
|-
| -m <bambus_mates> || library and mate-pair information file in Bambus format
|-
| -x <trace_xml> || ancilliary data file (library, mate-pair, clear range) in Trace Archive XML format
|-
| -f <celera_frg> || library, mate-pair, sequence, quality, and clear range data file in Celera Assembler format
|-
| -acc || use accession numbers in FRG files
|-
| -arachne <arachne_links> || scaffold file in Arachne .links format
|-
| -scaff <bambus_scaff> || scaffold file in Bambus .scaff format
|-
| -map <dst_map> || read map information - mapping from internal library ID to external library ID useful in conjunction with the -f option. This file consists of space-separated records providing a mapping from the "acc:" field in "DST" records within the .frg file to an externally recognizable name for each library.
|-
| -pos <pos_file> || TIGR-style .pos position file
|-
| -id <min_id> || start numbering contigs at this number
|-
|}

== TIGR specific options (not too useful outside TIGR) ==

* -i <insert file> - use mapping from internal library ID to external library ID provided in a .insert file produced by pullfrag.

== Known issues ==

The -ta (TIGR Assembler input) and -ace (ACE formatted input) options have not been throughly tested and likely do not properly work. Contact us if either of these options is important to you.

== Errors ==
n/a

AMOS Getting Started

2011-02-01T02:06:16Z

Floflooo:

{{TOC}}

Is AMOS an assembler? is one of the first questions we are asked. The short answer is no. AMOS is not an assembler, rather a software infrastructure for developing assembly tools. If you are only interested in running an off-the-shelf assembler on your shotgun data, do not despair, AMOS provides two such assemblers: AMOScmp - a comparative assembler; and Minimus - a basic assembler for small datasets. However it is important to realize that, with a little bit of programming, you can use AMOS to put together your own shotgun assembler customized for the specific characteristics of your data.

This page will provide you with the basic information needed to get started using AMOS. Advanced AMOS users can go directly to in-depth resources from the main page [[AMOS]].

== Downloading AMOS ==
AMOS can be downloaded from Sourceforge using the following link: [http://sourceforge.net/project/showfiles.php?group_id=134326 http://sourceforge.net/project/showfiles.php?group_id=134326]

No need to remember this URL as you can easily reach it from the [AMOS main page].

This link will bring you to the Sourceforge download page for our project. While older versions of our code are also available for download from this page we recommend you download the latest version to take advantage of the full functionality of the code.

AMOS is released as a source-code package, with the exception of the OSX version of the assembly viewer Hawkeye, that can be downloaded as a binary from the File Release section of the download page. Instructions for compiling and installing AMOS are provided below.

=== Downloading the development version ===

If you want the bleeding-edge of AMOS, e.g. to edit the source code, you should download the development version of AMOS using CVS following the directions here: [http://sourceforge.net/scm/?type=cvs&group_id=134326 http://sourceforge.net/scm/?type=cvs&group_id=134326]

Or in short:
cvs -z3 -d:pserver:anonymous@amos.cvs.sourceforge.net:/cvsroot/amos co -P AMOS

== Installing AMOS ==
After reading this section make sure you also read the INSTALL file distributed with AMOS. This file may contain information pertaining to the latest version of AMOS that is not included here.

=== Installing the development version ===

The first step to install the CVS version of AMOS is to type:
./bootstrap

Then proceed with the instructions for the normal installation below.

=== Normal installation ===
The AMOS source package has a name like: amos-1.4.5.tar.gz where 1.4.5 is the version of the code. Once you untar this file (using "tar -xzf amos-1.4.5.tar.gz" in Linux, or "gunzip -d amos-1.4.5.tar.gz | tar xf -" in other flavors of Unix) you will find the current AMOS distribution in a directory named amos-1.4.5. The next steps assume you have cd'd into this directory.

AMOS uses the [http://www.gnu.org/software/autoconf GNU autoconf] package to reduce cross-platform compatibility issues. Before compiling the code you will need to run the configure script that will probe your system for the locations of all software packages required by AMOS.

By simply running:

./configure

you will prepare AMOS to be installed in the directory hosting the source package. This is OK if you are just testing AMOS. We recommend, however, that you provide the configure script with a more permanent home for AMOS, e.g.:

./configure --prefix=/usr/local

will ultimately lead the AMOS directory hierarchy to be installed underneath /usr/local/.

After running configure, make sure you check the messages left on your screen to make sure no errors occured. Errors during the configure step can lead to an incomplete build.

To compile the code you need to simply run:

make

followed by

make install

to install AMOS into the directory selected with the --prefix option to configure.

Normally, these steps are sufficient to install AMOS on most UNIX systems. If you encounter errors during configuration or compilation, or if you are trying to install AMOS on an OSX or Cygwin system, please read the following sub-sections.

=== Specifying the location of MUMmer ===
If the configure script gives you a message like:

WARNING! nucmer was not found but is required to run AMOScmp
install nucmer if planning on using AMOScmp

you either have not installed the [http://mummer.sourceforge.net/ MUMmer] package, or you have installed it in a location where the configure script cannot find it. MUMmer (the nucmer program in particular) is required by the comparative assembler [[AMOScmp]].

To remedy this situation, please install MUMmer following instructions found at [http://mummer.sourceforge.net http://mummer.sourceforge.net].

If MUMmer is already installed, but configure cannot find it, you can specify the location of the nucmer program by setting the environment variably NUCMER, e.g.:

NUCMER=/usr/local/bin/mummer/nucmer
export NUCMER

in a "traditional" shell (sh, bash, ksh, etc.), or

setenv NUCMER /usr/local/bin/mummer/nucmer

in csh or tcsh. Of course you'll need to replace /usr/local/bin/mummer/nucmer with the actual location of this program on your system.
Specifying the location of the QT library
On most Unix installations (see below for OSX and Cygwin), the QT library should be properly installed and AMOS will make without any problems. If, however, you notice a message like:

WARNING! Qt3 toolkit was not found but is required to run AMOS GUIs

the configure process was not able to find the QT library on your system. Check with your system administrator to have this toolkit installed on your system. If, however, you are certain the toolkit is installed, but AMOS still didn't find it, you can directly specify the location of the toolkit directory, or specifically the include, bin, and lib directories, where QT is installed, and the name of the library file, using the following options to the configure script:

--with_Qt_dir
--with_Qt_include_dir
--with_Qt_lib_dir
--with_Qt_bin_dir
--with_Qt_lib

=== Debian and Ubuntu installation ===
[[Debian installation]]

=== Fedora installation ===
[[Fedora installation]]

=== OSX installation ===

[[OSX installation]]

=== Cygwin installation ===
[[Cygwin installation]]

== Running AMOS ==

=== Basic AMOS concepts ===
AMOS consists of a collection of modules that operate on a central data-structure called a bank. A bank is really just a directory that contains a database (organized as a collection of indexed files) comprising assembly related objects such as reads, contigs, scaffolds, etc. The modules thus communicate with each other by making changes to the bank. For example, an assembler might consist of three modules: an overlapper, a contigger, and a multi-aligner. The overlapper will first read the shotgun reads from the bank, compare them to each other and write back to the bank a list of overlaps, i.e. pairs of reads that match each other. The contigger then reads the collection of overlaps and makes sense out of it, by producing a layout of the reads that is consistent with most of the observed overlaps. The contigger then writes these contigs (contiguous chunks of the genome) to the bank. Finally, the multi-aligner reads from the bank both the reads and the contigs, builds a multiple alignment of the reads, using as a guide the layout of the reads produced by the contigger, then updates the contigs with the detailed alignment information. Thus, the three programs were able to communicate with each other using the bank as an intermediate storage space. If this litle description didn't make much sense to you, check out our [http://www.cbcb.umd.edu/research/assembly_primer.shtml Genome Assembly Primer]. It also has pointers to future reading.

Objects in the bank may be identified by one, or both of the following identifiers: IID (internal identifier) - an integer identifier, internal to AMOS; and EID (external identifier) - a string representing some external identifier of the record, e.g. the original name of a sequencing read. Both identifiers must be unique for a specific object type, but may be shared by multiple objects. For example, there can only be one contig with an IID equal to 1, however there can be both a contig, and a read, and an overlap, all with the IID = 1.
Message files
The AMOS banks are not the only mechanism for AMOS modules to communicate with each other, and to the "outside world". AMOS also uses a flat-file format (AMOS message files) inspired by the format used in Celera Assembler. This format is generally used as an intermediate format for converting to and from external file formats. The AMOS message files are then used to populate the data-structures present in a bank.

For more details on the AMOS message file format check out the [[Infrastructure]] pages. The use of message files will be described in more detail in the remainder of this tutorial.

==== Reading and writing banks ====
To learn how to generate AMOS message files check out the section called Creating inputs for AMOS. Assuming you already have an AMOS message file, most of the modules will require that the information from this file be loaded into a bank. This section describes the commands used to transfer information between a bank and the message file.

The command bank-transact can be used to load a message file into a bank. In its simplest invocation:

bank-transact -b mybank -m mymessagefile

bank-transact loads the messages in mymessagefile into the bank mybank. Note that this invocation assume the bank already exists, and bank-transact will fail otherwise. When creating a new bank you can run:

bank-transact -c -b mybank -m mymessagefile

The option -c stands for "create". By also providing the option -f (force), the bank will be overwritten if it already exists.

The contents of a bank can be output into a flat-file format with the command:

bank-report -b mybank

By default bank-report outputs all the data in the bank. The output can be restricted to certain message types by providing the 3 letter codes of the messages to be output, e.g:

bank-report -b mybank CTG RED

will output all the contigs (CTG) and read (RED) records. In addition bank-report allows the user to specify a list of EIDs (option -E) or a list of IIDs (option -I) that will be reported.

==== Bank locking ====
To allow concurrent access to the bank, AMOS programs lock the bank while the operate on it. There are two types of locks: for reading, and writing. If a bank is locked for reading, other read accesses are allowed but no writes. If a bank is locked for writing, no concurrent accesses are allowed. Some of the AMOS tools (such as the viewer Hawkeye), have an option to load a bank in "inspect" mode, i.e. the code ignores any locks placed on the bank.

In certain situations, if a program accessing the bank crashes, the bank may remain locked, prohibiting further access. All existing locks can be removed with the command (make sure that another user is not accessing the same bank):

bank-unlock mybank

==== Bank versions ====
The specific format of the AMOS bank is closely related to the current version of the AMOS software. The banks are not backward compatible, i.e., a bank produced by AMOS 1.0 will not be readable by AMOS 1.5. A simple solution for reading a bank created by an older version of AMOS is to output the contents of the bank using bank-report (the AMOS distribution contains old versions of the bank-report code, e.g. bank-report-1.1) , then reload the bank with the most recent bank-transact command.

==== Pipelines ====
As it has hopefully become clear from the introduction to AMOS above, most genome assembly tasks involve the sequential execution of several modules, in an assembly line (or pipeline) fashion. AMOS provides a mechanism for quickly putting together simple pipelines. By "simple" we mean situations where the specific assembly task involves running several programs in order, without the need for more complex control structures such as "if" statements or loops. To implement complex pipelines you will have to rely on Perl or another complex programming language.

An AMOS pipelines are described in a simple interpreted language, and consist of a series of steps that are executed in order. The steps are meant to provide a logical breakdown of the individual assembly tasks, representing the execution of one or more programs. Each step in a pipeline is identified by a step number (a throw-back to the days of the Basic language) providing the user with a mechanism to execute only some of the steps of a pipeline.

To learn more about AMOS pipelines and how to write them, check out the documentation for [[runAmos]] (the pipeline executor), or check out one of the pipelines distributed with AMOS (AMOScmp and minimus are good starting points).

=== Creating inputs for AMOS ===
The inputs to most AMOS programs must be provided in the AMOS message format. For help converting non-AMOS file formats into message files see the [[File conversion utilities]].

=== Running AMOScmp ===
AMOScmp is a comparative assembler that can be used to assemble reads from one genome (called the target) using as a template the sequence of a related genome (called the reference). Read the AMOScmp documentation for a detailed description of this program.

By default, running AMOScmp as follows:

AMOScmp prefix

assumes that the target is provided in the AMOS message file prefix.afg, and the reference in the file prefix.1con. To use different file locations, you can set the variables TGT and REF, either directly within the AMOScmp script, or on the command line:

AMOScmp -D "TGT=mytarget.afg" -D "REF=myreference.1con" prefix

The prefix must still be provided as it is used to generate the name of the output files.

AMOScmp will populate a bank named prefix.bnk, and will load into it a set of contigs, as well as a scaffold, linking together contigs that are adjacent along the reference. In addition, AMOScmp outputs the set of contigs as both a multi-FASTA file prefix.fasta, and a TIGR .contig file prefix.contig. Note that the consensus of the contigs (reported in the FASTA file) is generated from the target genome, and may differ from the reference genome (after all, the goal of the assembler is to assemble the target). In fact, AMOScmp uses sophisticated algorithms for detecting differences between the target and reference in order to prevent misassemblies. For more information refer to:

M. Pop, A. Phillippy, A.L. Delcher and S.L. Salzberg. [http://www.cbcb.umd.edu/papers/Pop%20et%20al%20Comparative.pdf Comparative genome assembly]. Briefings in Bioinformatics. 5(3), pp. 237-248, 2004.

=== Running minimus ===
Minimus is a basic genome assembler that can be used for small assembly jobs (e.g. a single gene, or a viral genome). Minimus is currently used as a central component of the Influenza A sequencing pipeline at The Institute for Genomic Research. Read the [[minimus]] documentation for more information.

To run minimus you must provide a set of shotgun reads in an AMOS message file. Running:

minimus prefix

assumes the input is in file prefix.afg. After running, minimus populates the bank prefix.bnk with a set of contigs, furthermore it reports the contigs in both a FASTA file (prefix.fasta) and a TIGR .contig file (prefix.contig). Note that minimus does not use mate-pairs. In essence it is, in Celera Assembler terminology, a unitigger. Any mate-pair information provided in the .afg will be silently ignored.

=== Viewing the result of an assembly ===
The content of a bank can be viewed with a program called Hawkeye:

hawkeye mybank

For detailed information on how to use Hawkeye, refer to the [[Hawkeye]] documentation.

=== Validating assemblies ===
Even the best genome assemblers sometimes make mistakes. AMOS provides a mechanism to run several checks on the output of an assembler (assuming the data are already stored in a bank), through a script called amosvalidate. Amosvalidate runs through the assembly and identifies several types of inconsistencies, such as clusters of SNPs in the assembled reads, clusters of mate-pairs that are too close or too far from each other (with respect to the estimated library sizes), and unassembled reads that do not properly match the assembly. A full description of these measures is beyond the scope of this document. We are currently submitting a manuscript describing the tools included in amosvalidate and will update this page when it gets published.

All the potential assembly problems identified by amosvalidate are written back into the bank as features, i.e ranges along the assembly. Each feature is tagged with the problem that was identified in that region. Typically, users then load the assembly in the Hawkeye viewer and examine the assembly in the tagged regions. Alternatively, the features may be extracted from the bank and processed automatically by specialized software (e.g. several assemblies of a same genome can be compared by the number of features identified in the assembly - the assembly with fewer features is likely "better").

Running amosvalidate is as simple as:

amosvalidate prefix

where prefix.bnk is the location of the bank.

== Getting help ==
To report bugs in AMOS, or to get help, email us at:

amos-help (at) lists (dot) sourceforget (dot) net

To receive information regarding new releases and developments, please [http://lists.sourceforge.net/lists/listinfo/amos-users subscribe] to our moderated, low-traffic users' mailing list:

amos-users (at) lists (dot) sourceforget (dot) net

Debian installation

2011-02-01T02:04:16Z

Floflooo: moved Ubuntu installation to Debian installation

These instructions are for Debian and Debian-based distros (e.g. Ubuntu 9.04)

To start, download either the regular or development version of AMOS.

i/ The regular AMOS version is available from http://sourceforge.net/projects/amos/files/, e.g.:
wget http://sourceforge.net/projects/amos/files/amos/2.0.8/amos-2.0.8.tar.gz/download
ii/ The development version of AMOS is in a CVS repository. To get it, run:
cvs -z3 -d:pserver:anonymous@amos.cvs.sourceforge.net:/cvsroot/amos co -P AMOS

In the directory where the AMOS file are located, run the following to install the prerequisites:
sudo aptitude install ash coreutils gawk gcc automake mummer mummer-doc libboost-dev

For the Hawkeye component of AMOS, you need Qt3:
sudo aptitude install libqt3-headers

For the standard version of AMOS, skip to next step, but for the CVS development version, first, run:
./bootstrap

Then regardless of the version:
./configure --with-Qt-dir=/usr/share/qt3 --prefix=/usr/local/AMOS
make
make check
sudo make install
sudo ln -s /usr/local/AMOS/bin/* /usr/local/bin/

Now all the programs shipped in AMOS should be available from the command-line.
For example try:
Minimo -h

Ubuntu installation

2011-02-01T02:04:16Z

Floflooo: moved Ubuntu installation to Debian installation

#REDIRECT [[Debian installation]]

Debian installation

2011-02-01T02:01:02Z

Floflooo:

These instructions are for Debian and Debian-based distros (e.g. Ubuntu 9.04)

To start, download either the regular or development version of AMOS.

i/ The regular AMOS version is available from http://sourceforge.net/projects/amos/files/, e.g.:
wget http://sourceforge.net/projects/amos/files/amos/2.0.8/amos-2.0.8.tar.gz/download
ii/ The development version of AMOS is in a CVS repository. To get it, run:
cvs -z3 -d:pserver:anonymous@amos.cvs.sourceforge.net:/cvsroot/amos co -P AMOS

In the directory where the AMOS file are located, run the following to install the prerequisites:
sudo aptitude install ash coreutils gawk gcc automake mummer mummer-doc libboost-dev

For the Hawkeye component of AMOS, you need Qt3:
sudo aptitude install libqt3-headers

For the standard version of AMOS, skip to next step, but for the CVS development version, first, run:
./bootstrap

Then regardless of the version:
./configure --with-Qt-dir=/usr/share/qt3 --prefix=/usr/local/AMOS
make
make check
sudo make install
sudo ln -s /usr/local/AMOS/bin/* /usr/local/bin/

Now all the programs shipped in AMOS should be available from the command-line.
For example try:
Minimo -h

AMOS

2011-01-19T02:04:06Z

Floflooo: /* AMOS Development */

{| align="right"
| __TOC__
|}

The AMOS consortium is committed to the development of open-source whole genome assembly software. The project acronym (AMOS) represents our primary goal -- to produce A Modular, Open-Source whole genome assembler. Open-source so that everyone is welcome to contribute and help build outstanding assembly tools, and modular in nature so that new contributions can be easily inserted into an existing assembly pipeline. This modular design will foster the development of new assembly algorithms and allow the AMOS project to continually grow and improve in hopes of eventually becoming a widely accepted and deployed assembly infrastructure. In this sense, AMOS is both a design philosophy and a software system.

Quick links:
* [[AMOS Getting Started]]
* [http://sourceforge.net/project/showfiles.php?group_id=134326 Download]
* [http://sourceforge.net/projects/amos SourceForge project page]

== Announcements ==

* December 7th, 2010 - Version 3.0.0 of AMOS released!

== Documentation ==
Additional documentation in development through the [[AMOS Documentation Project]]

=== Assemblers ===
* [[ABBA]] - Assembly Boosted By Amino Acid Sequences
* [[AMOScmp]] - comparative assembler
* [[AMOScmp-shortReads]] - comparative assembler for short reads (Solexa,454)
* [[AMOScmp-shortReads-alignmentTrimmed]] - comparative assembler for short reads that uses alignment based trimming
* [[minimus]] - basic genome assembler for small datasets
* [[minimus2]] - basic genome assembler for two datasets; can also be used as an assembly merge pipeline
* [[Minimo]] - the minimus assembler with many more options

=== Validation and Visualization ===
* [[Hawkeye]] - assembly viewer
* [[amosvalidate]] - assembly forensics
* [[Benchmark]] - assembly benchmark data

=== Scaffolding ===
* [[Bambus]] - Open source standalone hierarchical scaffolding
* [[Bambus2]] - Scaffolding Polymorphic Genomes and Metagenomes

=== Trimming, Overlapping, & Error Correction ===
* [[Figaro]] - statistical vector trimmer
* [[UMD Overlapper]] - High quality overlap computations
* [[KI Overlapper]] - Repeat aware overlapper
* [[AutoEditor]] - Automatic correction of genome sequencing errors
* [[FastqQC]] - Read composition and quality

=== Utilities ===
* [[File conversion utilities]] - converting data to and from AMOS
* [[AMOS Utilities | AMOS Utilities]] - general utilities
* [[runAmos]] - Pipeline executor

=== AMOS Development ===
* [[Programmer's guide]] - Getting started with the Source code
* [[Infrastructure]] - Developer level details
* [[Wiki guide]] - Guide for editing the wiki

=== Assembly Tutorials ===
* [http://www.cbcb.umd.edu/research/assembly_primer.shtml Assembly primer] - overview of genome assembly.
* [http://www.cbcb.umd.edu/research/contig_representation.shtml Representing assemblies (not just in AMOS)]
* [http://wgs-assembler.sourceforge.net Running Celera Assembler]

== Download ==
The AMOS source if freely available for download from the File Release Section of our SourceForge project page. Please refer to the COPYING license included in the package for a description of the Artistic License, the same OSI certified open source license used by Perl and countless other packages. Not all of the above packages are included with the standard AMOS distribution, please see the homepage for the software you wish to download to verify that it is included with the AMOS source distribution.

[http://sourceforge.net/project/showfiles.php?group_id=134326 Download from SourceForge]

== Consortium members ==

There have been numerous positive responses regarding the AMOS initiative, and we expect the list of involved organizations to grow significantly as the project matures. Please contact us if you want to join. The groups currently involved with the development of AMOS are listed below, along with their responsibilities and areas of expertise.

* University of Maryland, Center for Bioinformatics and Computational Biology
** project organization and direction
** infrastructure
** consensus
** automated sequence editing
** scaffolding
** overlap detection
** contig construction

* The Institute for Genomic Research
** production pipelines
** automated finishing tools
** error correction

* Karolinska Institutet
** overlap detection
** error correction

* Marine Biological Laboratory - Woods Hole
** graphical interface
** integration of assembly data with analysis (gene, polymorphism, etc.) information

== Join the consortium ==

All interested parties are welcome to join or aid the AMOS consortium. Please address all correspondence via Email to:

amos-help (at) lists (dot) sourceforge (dot) net

To receive information regarding new releases and developments, please subscribe to our moderated, low-traffic users' mailing list:

amos-users (at) lists (dot) sourceforge (dot) net

== Bug reports and support ==

For AMOS bug reports or support requests, please browse our SourceForge project page or Email us at:

amos-help (at) lists (dot) sourceforge (dot) net

== Acknowledgements ==

The AMOS consortium would like to thank the following organizations for their funding and/or support:
* The National Institutes of Health - grants R01-LM06845, N01-AI-15447
* The National Science Foundation - grants IIS-9902923, IIS-9820497
* Department of Homeland Security - cooperative agreement W81XWH-05-2-0051
* SourceForge.net

AMOS

2011-01-19T02:03:17Z

Floflooo: /* Utilities */

{| align="right"
| __TOC__
|}

The AMOS consortium is committed to the development of open-source whole genome assembly software. The project acronym (AMOS) represents our primary goal -- to produce A Modular, Open-Source whole genome assembler. Open-source so that everyone is welcome to contribute and help build outstanding assembly tools, and modular in nature so that new contributions can be easily inserted into an existing assembly pipeline. This modular design will foster the development of new assembly algorithms and allow the AMOS project to continually grow and improve in hopes of eventually becoming a widely accepted and deployed assembly infrastructure. In this sense, AMOS is both a design philosophy and a software system.

Quick links:
* [[AMOS Getting Started]]
* [http://sourceforge.net/project/showfiles.php?group_id=134326 Download]
* [http://sourceforge.net/projects/amos SourceForge project page]

== Announcements ==

* December 7th, 2010 - Version 3.0.0 of AMOS released!

== Documentation ==
Additional documentation in development through the [[AMOS Documentation Project]]

=== Assemblers ===
* [[ABBA]] - Assembly Boosted By Amino Acid Sequences
* [[AMOScmp]] - comparative assembler
* [[AMOScmp-shortReads]] - comparative assembler for short reads (Solexa,454)
* [[AMOScmp-shortReads-alignmentTrimmed]] - comparative assembler for short reads that uses alignment based trimming
* [[minimus]] - basic genome assembler for small datasets
* [[minimus2]] - basic genome assembler for two datasets; can also be used as an assembly merge pipeline
* [[Minimo]] - the minimus assembler with many more options

=== Validation and Visualization ===
* [[Hawkeye]] - assembly viewer
* [[amosvalidate]] - assembly forensics
* [[Benchmark]] - assembly benchmark data

=== Scaffolding ===
* [[Bambus]] - Open source standalone hierarchical scaffolding
* [[Bambus2]] - Scaffolding Polymorphic Genomes and Metagenomes

=== Trimming, Overlapping, & Error Correction ===
* [[Figaro]] - statistical vector trimmer
* [[UMD Overlapper]] - High quality overlap computations
* [[KI Overlapper]] - Repeat aware overlapper
* [[AutoEditor]] - Automatic correction of genome sequencing errors
* [[FastqQC]] - Read composition and quality

=== Utilities ===
* [[File conversion utilities]] - converting data to and from AMOS
* [[AMOS Utilities | AMOS Utilities]] - general utilities
* [[runAmos]] - Pipeline executor

=== AMOS Development ===
* [[Programmer's guide]] - Getting started with the Source code
* [[Infrastructure]] - Developer level details
* [[Wiki guide]] - Guide for editing the wiki

=== Assembly Tutorials ===
* [http://www.cbcb.umd.edu/research/assembly_primer.shtml Assembly primer] - overview of genome assembly.
* [http://www.cbcb.umd.edu/research/contig_representation.shtml Representing assemblies (not just in AMOS)]
* [http://wgs-assembler.sourceforge.net Running Celera Assembler]

== Download ==
The AMOS source if freely available for download from the File Release Section of our SourceForge project page. Please refer to the COPYING license included in the package for a description of the Artistic License, the same OSI certified open source license used by Perl and countless other packages. Not all of the above packages are included with the standard AMOS distribution, please see the homepage for the software you wish to download to verify that it is included with the AMOS source distribution.

[http://sourceforge.net/project/showfiles.php?group_id=134326 Download from SourceForge]

== Consortium members ==

There have been numerous positive responses regarding the AMOS initiative, and we expect the list of involved organizations to grow significantly as the project matures. Please contact us if you want to join. The groups currently involved with the development of AMOS are listed below, along with their responsibilities and areas of expertise.

* University of Maryland, Center for Bioinformatics and Computational Biology
** project organization and direction
** infrastructure
** consensus
** automated sequence editing
** scaffolding
** overlap detection
** contig construction

* The Institute for Genomic Research
** production pipelines
** automated finishing tools
** error correction

* Karolinska Institutet
** overlap detection
** error correction

* Marine Biological Laboratory - Woods Hole
** graphical interface
** integration of assembly data with analysis (gene, polymorphism, etc.) information

== Join the consortium ==

All interested parties are welcome to join or aid the AMOS consortium. Please address all correspondence via Email to:

amos-help (at) lists (dot) sourceforge (dot) net

To receive information regarding new releases and developments, please subscribe to our moderated, low-traffic users' mailing list:

amos-users (at) lists (dot) sourceforge (dot) net

== Bug reports and support ==

For AMOS bug reports or support requests, please browse our SourceForge project page or Email us at:

amos-help (at) lists (dot) sourceforge (dot) net

== Acknowledgements ==

The AMOS consortium would like to thank the following organizations for their funding and/or support:
* The National Institutes of Health - grants R01-LM06845, N01-AI-15447
* The National Science Foundation - grants IIS-9902923, IIS-9820497
* Department of Homeland Security - cooperative agreement W81XWH-05-2-0051
* SourceForge.net

AMOS

2011-01-19T02:02:57Z

Floflooo: /* Trimming, Overlapping, & Error Correction */

{| align="right"
| __TOC__
|}

The AMOS consortium is committed to the development of open-source whole genome assembly software. The project acronym (AMOS) represents our primary goal -- to produce A Modular, Open-Source whole genome assembler. Open-source so that everyone is welcome to contribute and help build outstanding assembly tools, and modular in nature so that new contributions can be easily inserted into an existing assembly pipeline. This modular design will foster the development of new assembly algorithms and allow the AMOS project to continually grow and improve in hopes of eventually becoming a widely accepted and deployed assembly infrastructure. In this sense, AMOS is both a design philosophy and a software system.

Quick links:
* [[AMOS Getting Started]]
* [http://sourceforge.net/project/showfiles.php?group_id=134326 Download]
* [http://sourceforge.net/projects/amos SourceForge project page]

== Announcements ==

* December 7th, 2010 - Version 3.0.0 of AMOS released!

== Documentation ==
Additional documentation in development through the [[AMOS Documentation Project]]

=== Assemblers ===
* [[ABBA]] - Assembly Boosted By Amino Acid Sequences
* [[AMOScmp]] - comparative assembler
* [[AMOScmp-shortReads]] - comparative assembler for short reads (Solexa,454)
* [[AMOScmp-shortReads-alignmentTrimmed]] - comparative assembler for short reads that uses alignment based trimming
* [[minimus]] - basic genome assembler for small datasets
* [[minimus2]] - basic genome assembler for two datasets; can also be used as an assembly merge pipeline
* [[Minimo]] - the minimus assembler with many more options

=== Validation and Visualization ===
* [[Hawkeye]] - assembly viewer
* [[amosvalidate]] - assembly forensics
* [[Benchmark]] - assembly benchmark data

=== Scaffolding ===
* [[Bambus]] - Open source standalone hierarchical scaffolding
* [[Bambus2]] - Scaffolding Polymorphic Genomes and Metagenomes

=== Trimming, Overlapping, & Error Correction ===
* [[Figaro]] - statistical vector trimmer
* [[UMD Overlapper]] - High quality overlap computations
* [[KI Overlapper]] - Repeat aware overlapper
* [[AutoEditor]] - Automatic correction of genome sequencing errors
* [[FastqQC]] - Read composition and quality

=== Utilities ===
* [[File conversion utilities]] - converting data to and from AMOS
* [[AMOS Utilities | AMOS Utilities]] - general utilities
* [[runAmos]] - Pipeline executor

=== AMOS Development ===
* [[Programmer's guide]] - Getting started with the Source code
* [[Infrastructure]] - Developer level details
* [[Wiki guide]] - Guide for editing the wiki

=== Assembly Tutorials ===
* [http://www.cbcb.umd.edu/research/assembly_primer.shtml Assembly primer] - overview of genome assembly.
* [http://www.cbcb.umd.edu/research/contig_representation.shtml Representing assemblies (not just in AMOS)]
* [http://wgs-assembler.sourceforge.net Running Celera Assembler]

== Download ==
The AMOS source if freely available for download from the File Release Section of our SourceForge project page. Please refer to the COPYING license included in the package for a description of the Artistic License, the same OSI certified open source license used by Perl and countless other packages. Not all of the above packages are included with the standard AMOS distribution, please see the homepage for the software you wish to download to verify that it is included with the AMOS source distribution.

[http://sourceforge.net/project/showfiles.php?group_id=134326 Download from SourceForge]

== Consortium members ==

There have been numerous positive responses regarding the AMOS initiative, and we expect the list of involved organizations to grow significantly as the project matures. Please contact us if you want to join. The groups currently involved with the development of AMOS are listed below, along with their responsibilities and areas of expertise.

* University of Maryland, Center for Bioinformatics and Computational Biology
** project organization and direction
** infrastructure
** consensus
** automated sequence editing
** scaffolding
** overlap detection
** contig construction

* The Institute for Genomic Research
** production pipelines
** automated finishing tools
** error correction

* Karolinska Institutet
** overlap detection
** error correction

* Marine Biological Laboratory - Woods Hole
** graphical interface
** integration of assembly data with analysis (gene, polymorphism, etc.) information

== Join the consortium ==

All interested parties are welcome to join or aid the AMOS consortium. Please address all correspondence via Email to:

amos-help (at) lists (dot) sourceforge (dot) net

To receive information regarding new releases and developments, please subscribe to our moderated, low-traffic users' mailing list:

amos-users (at) lists (dot) sourceforge (dot) net

== Bug reports and support ==

For AMOS bug reports or support requests, please browse our SourceForge project page or Email us at:

amos-help (at) lists (dot) sourceforge (dot) net

== Acknowledgements ==

The AMOS consortium would like to thank the following organizations for their funding and/or support:
* The National Institutes of Health - grants R01-LM06845, N01-AI-15447
* The National Science Foundation - grants IIS-9902923, IIS-9820497
* Department of Homeland Security - cooperative agreement W81XWH-05-2-0051
* SourceForge.net

AMOS

2011-01-19T02:02:28Z

Floflooo: /* Validation and Visualization */

{| align="right"
| __TOC__
|}

The AMOS consortium is committed to the development of open-source whole genome assembly software. The project acronym (AMOS) represents our primary goal -- to produce A Modular, Open-Source whole genome assembler. Open-source so that everyone is welcome to contribute and help build outstanding assembly tools, and modular in nature so that new contributions can be easily inserted into an existing assembly pipeline. This modular design will foster the development of new assembly algorithms and allow the AMOS project to continually grow and improve in hopes of eventually becoming a widely accepted and deployed assembly infrastructure. In this sense, AMOS is both a design philosophy and a software system.

Quick links:
* [[AMOS Getting Started]]
* [http://sourceforge.net/project/showfiles.php?group_id=134326 Download]
* [http://sourceforge.net/projects/amos SourceForge project page]

== Announcements ==

* December 7th, 2010 - Version 3.0.0 of AMOS released!

== Documentation ==
Additional documentation in development through the [[AMOS Documentation Project]]

=== Assemblers ===
* [[ABBA]] - Assembly Boosted By Amino Acid Sequences
* [[AMOScmp]] - comparative assembler
* [[AMOScmp-shortReads]] - comparative assembler for short reads (Solexa,454)
* [[AMOScmp-shortReads-alignmentTrimmed]] - comparative assembler for short reads that uses alignment based trimming
* [[minimus]] - basic genome assembler for small datasets
* [[minimus2]] - basic genome assembler for two datasets; can also be used as an assembly merge pipeline
* [[Minimo]] - the minimus assembler with many more options

=== Validation and Visualization ===
* [[Hawkeye]] - assembly viewer
* [[amosvalidate]] - assembly forensics
* [[Benchmark]] - assembly benchmark data

=== Scaffolding ===
* [[Bambus]] - Open source standalone hierarchical scaffolding
* [[Bambus2]] - Scaffolding Polymorphic Genomes and Metagenomes

=== Trimming, Overlapping, & Error Correction ===
* [[Figaro]] - statistical vector trimmer
* [[UMD Overlapper]] - High quality overlap computations
* [[KI Overlapper]] - Repeat aware overlapper
* [[AutoEditor]] - Automatic correction of genome sequencing errors
* [[FastqQC]] - Read composition and quality

=== Utilities ===
* [[File conversion utilities]] - converting data to and from AMOS
* [[AMOS Utilities | AMOS Utilities]] - general utilities
* [[runAmos]] - Pipeline executor

=== AMOS Development ===
* [[Programmer's guide]] - Getting started with the Source code
* [[Infrastructure]] - Developer level details
* [[Wiki guide]] - Guide for editing the wiki

=== Assembly Tutorials ===
* [http://www.cbcb.umd.edu/research/assembly_primer.shtml Assembly primer] - overview of genome assembly.
* [http://www.cbcb.umd.edu/research/contig_representation.shtml Representing assemblies (not just in AMOS)]
* [http://wgs-assembler.sourceforge.net Running Celera Assembler]

== Download ==
The AMOS source if freely available for download from the File Release Section of our SourceForge project page. Please refer to the COPYING license included in the package for a description of the Artistic License, the same OSI certified open source license used by Perl and countless other packages. Not all of the above packages are included with the standard AMOS distribution, please see the homepage for the software you wish to download to verify that it is included with the AMOS source distribution.

[http://sourceforge.net/project/showfiles.php?group_id=134326 Download from SourceForge]

== Consortium members ==

There have been numerous positive responses regarding the AMOS initiative, and we expect the list of involved organizations to grow significantly as the project matures. Please contact us if you want to join. The groups currently involved with the development of AMOS are listed below, along with their responsibilities and areas of expertise.

* University of Maryland, Center for Bioinformatics and Computational Biology
** project organization and direction
** infrastructure
** consensus
** automated sequence editing
** scaffolding
** overlap detection
** contig construction

* The Institute for Genomic Research
** production pipelines
** automated finishing tools
** error correction

* Karolinska Institutet
** overlap detection
** error correction

* Marine Biological Laboratory - Woods Hole
** graphical interface
** integration of assembly data with analysis (gene, polymorphism, etc.) information

== Join the consortium ==

All interested parties are welcome to join or aid the AMOS consortium. Please address all correspondence via Email to:

amos-help (at) lists (dot) sourceforge (dot) net

To receive information regarding new releases and developments, please subscribe to our moderated, low-traffic users' mailing list:

amos-users (at) lists (dot) sourceforge (dot) net

== Bug reports and support ==

For AMOS bug reports or support requests, please browse our SourceForge project page or Email us at:

amos-help (at) lists (dot) sourceforge (dot) net

== Acknowledgements ==

The AMOS consortium would like to thank the following organizations for their funding and/or support:
* The National Institutes of Health - grants R01-LM06845, N01-AI-15447
* The National Science Foundation - grants IIS-9902923, IIS-9820497
* Department of Homeland Security - cooperative agreement W81XWH-05-2-0051
* SourceForge.net

AMOS

2011-01-19T02:01:56Z

Floflooo: /* Announcements */

{| align="right"
| __TOC__
|}

The AMOS consortium is committed to the development of open-source whole genome assembly software. The project acronym (AMOS) represents our primary goal -- to produce A Modular, Open-Source whole genome assembler. Open-source so that everyone is welcome to contribute and help build outstanding assembly tools, and modular in nature so that new contributions can be easily inserted into an existing assembly pipeline. This modular design will foster the development of new assembly algorithms and allow the AMOS project to continually grow and improve in hopes of eventually becoming a widely accepted and deployed assembly infrastructure. In this sense, AMOS is both a design philosophy and a software system.

Quick links:
* [[AMOS Getting Started]]
* [http://sourceforge.net/project/showfiles.php?group_id=134326 Download]
* [http://sourceforge.net/projects/amos SourceForge project page]

== Announcements ==

* December 7th, 2010 - Version 3.0.0 of AMOS released!

== Documentation ==
Additional documentation in development through the [[AMOS Documentation Project]]

=== Assemblers ===
* [[ABBA]] - Assembly Boosted By Amino Acid Sequences
* [[AMOScmp]] - comparative assembler
* [[AMOScmp-shortReads]] - comparative assembler for short reads (Solexa,454)
* [[AMOScmp-shortReads-alignmentTrimmed]] - comparative assembler for short reads that uses alignment based trimming
* [[minimus]] - basic genome assembler for small datasets
* [[minimus2]] - basic genome assembler for two datasets; can also be used as an assembly merge pipeline
* [[Minimo]] - the minimus assembler with many more options

=== Validation and Visualization ===
* [[Hawkeye]] - assembly viewer
* [[amosvalidate]] - assembly forensics
* [[Benchmark]] - assembly benchmark data

=== Scaffolding ===
* [[Bambus]] - Open source standalone hierarchical scaffolding
* [[Bambus2]] - Scaffolding Polymorphic Genomes and Metagenomes

=== Trimming, Overlapping, & Error Correction ===
* [[Figaro]] - statistical vector trimmer
* [[UMD Overlapper]] - High quality overlap computations
* [[KI Overlapper]] - Repeat aware overlapper
* [[AutoEditor]] - Automatic correction of genome sequencing errors
* [[FastqQC]] - Read composition and quality

=== Utilities ===
* [[File conversion utilities]] - converting data to and from AMOS
* [[AMOS Utilities | AMOS Utilities]] - general utilities
* [[runAmos]] - Pipeline executor

=== AMOS Development ===
* [[Programmer's guide]] - Getting started with the Source code
* [[Infrastructure]] - Developer level details
* [[Wiki guide]] - Guide for editing the wiki

=== Assembly Tutorials ===
* [http://www.cbcb.umd.edu/research/assembly_primer.shtml Assembly primer] - overview of genome assembly.
* [http://www.cbcb.umd.edu/research/contig_representation.shtml Representing assemblies (not just in AMOS)]
* [http://wgs-assembler.sourceforge.net Running Celera Assembler]

== Download ==
The AMOS source if freely available for download from the File Release Section of our SourceForge project page. Please refer to the COPYING license included in the package for a description of the Artistic License, the same OSI certified open source license used by Perl and countless other packages. Not all of the above packages are included with the standard AMOS distribution, please see the homepage for the software you wish to download to verify that it is included with the AMOS source distribution.

[http://sourceforge.net/project/showfiles.php?group_id=134326 Download from SourceForge]

== Consortium members ==

There have been numerous positive responses regarding the AMOS initiative, and we expect the list of involved organizations to grow significantly as the project matures. Please contact us if you want to join. The groups currently involved with the development of AMOS are listed below, along with their responsibilities and areas of expertise.

* University of Maryland, Center for Bioinformatics and Computational Biology
** project organization and direction
** infrastructure
** consensus
** automated sequence editing
** scaffolding
** overlap detection
** contig construction

* The Institute for Genomic Research
** production pipelines
** automated finishing tools
** error correction

* Karolinska Institutet
** overlap detection
** error correction

* Marine Biological Laboratory - Woods Hole
** graphical interface
** integration of assembly data with analysis (gene, polymorphism, etc.) information

== Join the consortium ==

All interested parties are welcome to join or aid the AMOS consortium. Please address all correspondence via Email to:

amos-help (at) lists (dot) sourceforge (dot) net

To receive information regarding new releases and developments, please subscribe to our moderated, low-traffic users' mailing list:

amos-users (at) lists (dot) sourceforge (dot) net

== Bug reports and support ==

For AMOS bug reports or support requests, please browse our SourceForge project page or Email us at:

amos-help (at) lists (dot) sourceforge (dot) net

== Acknowledgements ==

The AMOS consortium would like to thank the following organizations for their funding and/or support:
* The National Institutes of Health - grants R01-LM06845, N01-AI-15447
* The National Science Foundation - grants IIS-9902923, IIS-9820497
* Department of Homeland Security - cooperative agreement W81XWH-05-2-0051
* SourceForge.net

Minimo

2010-12-09T06:23:46Z

Floflooo: /* Overview */

== Overview ==

Minimo is largely based on [[minimus|Minimus]], and as such favours assembly quality to speed. Use on moderately-sized data! Minimo follows the Overlap-Layout-Consensus paradigm just like [[minimus|Minimus]].

The main advantage of Minimo over [[minimus|Minimus]] is that it takes simple FASTA files as input and generates contigs formatted in ACE and FASTA. In addition two parameters can be used to tune the assembly stringency (minimum overlap length and minimum identity).

Generally, decreasing the minimum overlap identity results in a less fragmented assembly, but likely less faithful, as sequencing errors or small varitions between closely related species (in the case of metagenomic data) might cause chimeric contigs. Similarly, decreasing the minimum overlap length might produce less fragmented, less faithful assemblies. However, increasing the minimum overlap length may sometimes also produce better assemblies by resolving the assembly of small repeated regions.

== Documentation ==

Documentation on how to run Minimo is obtained by typing:

Minimo -h

The usage message is:

Usage:
Minimo FASTA_IN [options]
Options:
-D QUAL_IN=<file> Input quality score file
-D GOOD_QUAL=<n> Quality score to set for bases within the clear
range if no quality file was given (default: 30)
-D BAD_QUAL=<n> Quality score to set for bases outside clear range
if no quality file was given (default: 10). If your
sequences are trimmed, try the same value as GOOD_QUAL.
-D MIN_LEN=<n> Minimum contig overlap length (at least 20 bp,
default: 35)
-D MIN_IDENT=<d> Minimum contig overlap identity percentage (between 0
and 100 %, default: 98)
-D ALN_WIGGLE=<d> Alignment wiggle value (from 2 for short reads to 15 for
long reads, default: 2)
-D FASTA_EXP=<n> Export results in FASTA format (0:no 1:yes, default: 0)
-D ACE_EXP=<n> Export results in ACE format (0:no 1:yes, default: 0)
-D OUT_PREFIX=< s> Prefix to use for the output file path and name

== Basic usage ==

To run Minimo will you need a set of sequence files. Assuming you have a set of reads in fasta format called '''my_reads.fa''', you can run minimus with the following two commands:

Minimo my_reads.fa

To export the contigs in a FASTA file or in ACE format (i.e. for downstream processing), use the FASTA_EXP and ACE_EXP options:

Minimo my_reads.fa -D FASTA_EXP=1 -D ACE_EXP=1

If you need to use a specific overlap length or identity between reads of a contig, try:

Minimo my_reads.fa -D MIN_LEN=80 -D MIN_IDENT=90

Minimo

2010-12-09T06:16:52Z

Floflooo: /* Basic usage */

Minimo

2010-12-09T06:12:39Z

Floflooo:

Minimo

2010-12-09T05:38:17Z

Floflooo: /* Documentation */

Minimo

2010-12-09T05:36:56Z

Floflooo: /* Documentation */

Minimo

2010-12-09T05:35:27Z

Floflooo: Created page with '== Overview == Minimo is largely based on Minimus, and as such favours assembly quality to speed. Use on moderately-sized data! Minimo follows the Overlap-Layout-Con…'

AMOS

2010-12-09T05:28:20Z

Floflooo:

{| align="right"
| __TOC__
|}

The AMOS consortium is committed to the development of open-source whole genome assembly software. The project acronym (AMOS) represents our primary goal -- to produce A Modular, Open-Source whole genome assembler. Open-source so that everyone is welcome to contribute and help build outstanding assembly tools, and modular in nature so that new contributions can be easily inserted into an existing assembly pipeline. This modular design will foster the development of new assembly algorithms and allow the AMOS project to continually grow and improve in hopes of eventually becoming a widely accepted and deployed assembly infrastructure. In this sense, AMOS is both a design philosophy and a software system.

Quick links:
* [[AMOS Getting Started]]
* [http://sourceforge.net/project/showfiles.php?group_id=134326 Download]
* [http://sourceforge.net/projects/amos SourceForge project page]

== Announcements ==

* November 10th, 2009 - New version of AMOS released!

== Documentation ==
Additional documentation in development through the [[AMOS Documentation Project]]

=== Assemblers ===
* [[ABBA]] - Assembly Boosted By Amino Acid Sequences
* [[AMOScmp]] - comparative assembler
* [[AMOScmp-shortReads]] - comparative assembler for short reads (Solexa,454)
* [[AMOScmp-shortReads-alignmentTrimmed]] - comparative assembler for short reads that uses alignment based trimming
* [[minimus]] - basic genome assembler for small datasets
* [[minimus2]] - basic genome assembler for two datasets; can also be used as an assembly merge pipeline
* [[Minimo]] - the minimus assembler with many more options

=== Validation and Visualization ===
* [[Hawkeye]] - assembly viewer
* [[amosvalidate]] - assembly forensics
* [[Benchmark]] - assembly benchmark data

=== Scaffolding ===
* [[Bambus]] - Open source standalone hierarchical scaffolding
* [[Bambus2]] - Scaffolding Polymorphic Genomes and Metagenomes

=== Trimming, Overlapping, & Error Correction ===
* [[Figaro]] - statistical vector trimmer
* [[UMD Overlapper]] - High quality overlap computations
* [[KI Overlapper]] - Repeat aware overlapper
* [[AutoEditor]] - Automatic correction of genome sequencing errors
* [[FastqQC]] - Read composition and quality

=== Utilities ===
* [[File conversion utilities]] - converting data to and from AMOS
* [[AMOS Utilities | AMOS Utilities]] - general utilities
* [[runAmos]] - Pipeline executor

=== AMOS Development ===
* [[Programmer's guide]] - Getting started with the Source code
* [[Infrastructure]] - Developer level details
* [[Wiki guide]] - Guide for editing the wiki

=== Assembly Tutorials ===
* [http://www.cbcb.umd.edu/research/assembly_primer.shtml Assembly primer] - overview of genome assembly.
* [http://www.cbcb.umd.edu/research/contig_representation.shtml Representing assemblies (not just in AMOS)]
* [http://wgs-assembler.sourceforge.net Running Celera Assembler]

== Download ==
The AMOS source if freely available for download from the File Release Section of our SourceForge project page. Please refer to the COPYING license included in the package for a description of the Artistic License, the same OSI certified open source license used by Perl and countless other packages. Not all of the above packages are included with the standard AMOS distribution, please see the homepage for the software you wish to download to verify that it is included with the AMOS source distribution.

[http://sourceforge.net/project/showfiles.php?group_id=134326 Download from SourceForge]

== Consortium members ==

There have been numerous positive responses regarding the AMOS initiative, and we expect the list of involved organizations to grow significantly as the project matures. Please contact us if you want to join. The groups currently involved with the development of AMOS are listed below, along with their responsibilities and areas of expertise.

* University of Maryland, Center for Bioinformatics and Computational Biology
** project organization and direction
** infrastructure
** consensus
** automated sequence editing
** scaffolding
** overlap detection
** contig construction

* The Institute for Genomic Research
** production pipelines
** automated finishing tools
** error correction

* Karolinska Institutet
** overlap detection
** error correction

* Marine Biological Laboratory - Woods Hole
** graphical interface
** integration of assembly data with analysis (gene, polymorphism, etc.) information

== Join the consortium ==

All interested parties are welcome to join or aid the AMOS consortium. Please address all correspondence via Email to:

amos-help (at) lists (dot) sourceforge (dot) net

To receive information regarding new releases and developments, please subscribe to our moderated, low-traffic users' mailing list:

amos-users (at) lists (dot) sourceforge (dot) net

== Bug reports and support ==

For AMOS bug reports or support requests, please browse our SourceForge project page or Email us at:

amos-help (at) lists (dot) sourceforge (dot) net

== Acknowledgements ==

The AMOS consortium would like to thank the following organizations for their funding and/or support:
* The National Institutes of Health - grants R01-LM06845, N01-AI-15447
* The National Science Foundation - grants IIS-9902923, IIS-9820497
* Department of Homeland Security - cooperative agreement W81XWH-05-2-0051
* SourceForge.net