|
AMOS: A Modular Open-Source Assembler
1. AMOS overview
The AMOS consortium is committed to the development of open-source
whole genome assembly software. The project acronym (AMOS) represents
our primary goal -- to produce A Modular,
Open-Source whole genome assembler.
Open-source so that everyone is welcome to contribute and help
build outstanding assembly tools, and modular in nature so
that new contributions can be easily inserted into an existing assembly
pipeline. This modular design will foster the development of new assembly
algorithms and allow the AMOS project to continually grow and improve
in hopes of eventually becoming a widely accepted and deployed assembly
infrastructure. In this sense, AMOS is both a design philosophy and
a software system.
AMOS Getting Started
Programmer's guide
AMOS Documentation Project
Quick links:
Module documentation:
2. Table of contents
- AMOS overview
- Table of contents
- Collaborators
- Acknowledgements
- Infrastructure
- libAMOS API
- Specifications
- Modules and projects
- Assembly pipeline
- Overlap detection
- Contig construction
- Consensus
- Scaffolding
- Error correction
- Validation
- Utilities
- Download
- Join the consortium
- Bug reports
3. Consortium members
There have been numerous positive responses regarding the AMOS initiative,
and we expect the list of involved organizations to grow significantly
as the project matures. Please contact us if you
want to join. The groups currently involved with the development of
AMOS are listed below, along with their responsibilities and areas of
expertise.
4. Acknowledgements
The AMOS consortium would like to thank the following organizations
for their funding and/or support:
The National Institutes of Health
- grants R01-LM06845, N01-AI-15447
The National Science Foundation -
grants IIS-9902923, IIS-9820497
Department of Homeland Security -
cooperative agreement W81XWH-05-2-0051
SourceForge.net
5. Infrastructure
The principal benefit of the AMOS project is its modular design, but
in order to facilitate many, isolated components, a robust infrastructure
is desirable. In response to this need, TIGR has developed numerous
C++ classes for the efficient storage of assembly data types. These
assembly objects can be written to and read from a central data repository,
allowing for separate modules to build on and improve existing assemblies
in discrete steps. This allows an assembly pipeline to run its steps
in any order, and for data snapshots to be preserved at any time. In
order to convey the assembly data outside of the C++ classes, we have
implemented an ASCII message format modeled on that used by Celera Assembler*.
This message format will be the unifying standard for all external module
communication, and allow for the data snapshots to be output in a concise,
text format. The API (application programming interface) for the AMOS
foundation classes and the specification for the AMOS message format
can be found in the sections below.
- "A Whole-Genome Assembly of Drosophila." Myers E, Sutton
G, et. al., Science, 2000. 287(5461):2196-204.
5.1. Application programming
interface
The AMOS API describes the programming interface for all of the AMOS
foundation classes. Currently these classes are implemented in C++,
but could ported to other languages as long as the API was preserved.
The implementation can be found in the latest distribution under the
src/AMOS project directory. These classes comprise the
libAMOS.a library. This library contains the tools necessary
to handle and manipulate AMOS messages, data-banks and internal assembly
data structures such as sequencing reads, contigs, scaffolds, etc. The
C++ source code for libAMOS is freely available for download here.
AMOS infrastructure API
5.2. Specifications
The AMOS file types and message formats are defined in various specification
documents, which can be found by following the below link. These documents
also provide information on how to use messages for module communication
and general development procedure recommendations.
AMOS specification documents
6. Modules and projects
The following sections list all modules currently in development and
those modules that are already in production. Because AMOS is in a constant
state of development, there is an ever expanding list of ongoing projects,
and this section attempts to outline the basic function of each project
along with its status and parent organization. Status descriptions are
(in order of occurrence): planning, development,
testing, production and antiquated.
These status descriptions appear to the right of the module name. Clicking
on a module name will redirect you to the project homepage (if applicable).
6.1. Assembly pipeline
runAmos is the command executor for all of the AMOS pipelines.
minimus is a lightweight assembly tool for performing
small assembly tasks for which a the complexity of a full assembler
is unnecessary. Some such tasks, commonly needed during genome finishing,
include joining together two overlapping contigs, adding reads to an
existing contig, and refining the multiple-alignment of the reads within
a contig. We use a standard 3-step assembly process known as overlap-layout-consensus
which is explained further on the minimus website. minimus
is freely available as part of the AMOS distribution which can be downloaded
here.
Examples of a flu assembly and a Zebrafish gene can be found in the test/minimus directory created when the AMOS distribution is untarred. Documentation on the examples is included with the distribution in /docs subdirectory
Supported in part by DHS cooperative agreement W81XWH-05-2-0051.
AMOScmp is a comparative assembly pipeline. With the
rapid growth in the number of sequenced genomes has come an increase
in the number of organisms for which two or more closely-related species
have been sequenced. This has created the possibility of building a
comparative genome assembly algorithm, which can assemble a newly sequenced
genome by mapping it onto a reference genome. Methods are described
in our paper (below) and on the AMOScmp website. The MUMmer whole genome
alignment package is required for the mapping step of this pipeline,
and is freely available from the MUMmer homepage. AMOScmp is freely available
as part of the AMOS distribution, which can be downloaded here.
Related publications
- "Comparative Genome Assembly." Pop M, Phillippy A, Delcher AL,
Salzberg SL, Briefings in Bioinformatics, 2004. 5(3):237-48.
6.2. Overlap detection
UMD overlapper
|
STATUS: testing |
The UMD overlapper is designed to reduce the number of
overlaps produced by the assembler by reducing the number of repeat-induced
overlaps. Furthermore the algorithm is greatly enhanced through the
use of minimizers - a technique for reducing the number of k-mers considered
in the initial phase of overlapping by an order of magnitude. Most assemblers
use exact k-mer matches in order to identify reads that potentially
overlap.
Related publications
- "A preprocessor for shotgun assembly of large genomes." Roberts
M, Hunt BR, Yorke JA, Bolanos R, Delcher A, Journal of Computational
Biology, 2004. 11(4):734-752
- "Reducing storage requirements for biological sequence comparison."
Roberts M, Hayes W, Hunt BR, Mount SM, Yorke JA. Bioinformatics, 2004, 20(18):3363-3369.
KI overlapper
|
STATUS: testing |
The Karolinska Institutet overlapper is designed to handle
the problems created by sequencing errors. Instead of exact k-mer matches
- the approach used by most existing assemblers - the KI overlapper
uses a q-gram based method to identify "near hits" - k-mers that differ
at a small number of positions. This approach allows this overlapper
to identify overlaps otherwise missed by other overlappers.
Related publications
- "Correcting errors in shotgun sequences." Tammi MT, Arner E, Kindlund
E, Andersson B, Nucleic Acids Research, 2003. 31(15):4663-72.
- "TRAP: Tandem Repeat Assembly Program produces improved shotgun
assemblies of repetitive sequences." Tammi MT, Arner E, Andersson
B, Computational Methods Programs Biomed, 2003. 70(1):47-59.
- "Separation of nearly identical repeats in shotgun assemblies using
defined nucleotide positions, DNPs." Tammi MT, Arner E, Britton T,
Andersson B, Bioinformatics, 2002. 18(3):379-88.
6.3. Contig construction
UMD Contigger
|
STATUS: development |
The UMD contigger uses the set of read overlaps generated
during the overlap stage in order to identify unambiguous contigs -
maximally consistent tilings of reads. These contigs represent stretches
of the genome that can be unambiguously assembled and form a convenient
backbone for further processing.
6.4. Consensus
libSlice
|
STATUS: production |
libSlice is a C++ library that provides the user with
a parametric implementation of the Churchill-Waterman algorithm for
computing the consensus base from a column in a multiple alignment of
reads. This task is an essential part of any consensus module. The implementation
can be found in the latest distribution under the src/Slice
project directory. These C structs comprise the libSlice.a
library.
libAlign
|
STATUS: production |
libAlign is a robust multi-alignment library for consensus
generation. It can efficiently handle large inputs and is able to identify
and correctly align slightly misplaced and/or low-similarity reads in
the input. The implementation can be found in the latest distribution
under the src/Align project directory. These classes comprise
the libAlign.a library and depend on the libSlice
library.
6.5. Scaffolding
Bambus is the first general purpose scaffolder that is
publicly available as an open source package. While most other scaffolders
are closely tied to a specific assembly program, Bambus accepts the
output from most current assemblers and provides the user with great
flexibility in choosing the scaffolding parameters. In particular, Bambus
is able to accept contig linking data other than specified by mate-pairs.
Such sources of information include alignment to a reference genome
(Bambus can directly use the output of MUMmer), physical mapping data,
or information about gene synteny.
Related publications
- "Hierarchical scaffolding with Bambus." Pop M, Kosack DS, Salzberg
SL, Genome Research, 2004. 14(1):149-59.
6.6. Error correction
The AutoEditor is a tool developed at TIGR that combines
the trace information with the tiling of reads within a contig in order
to identify and correct sequencing errors. Note that unlike other methods
for error correction, the Auto Editor will only modify a base if supporting
evidence is found in the traces, thus greatly reducing the possibility
of errors. In our tests the Auto Editor corrected up to 90% of the sequencing
errors present in the data, leading to a corresponding reduction in
the manual labor required during the finishing stages.
Related publications
- "Automated correction of genome sequence errors." Gajer P, Schatz
M, Salzberg SL, Nucleic Acids Research, 2004. 32(2):562-9.
UMD error corrector
|
STATUS: testing |
In conjunction with the UMD overlapper, the UMD error corrector
identifies and corrects potential sequencing errors by detecting bases
in a multiple alignment of reads that are supported by only one of the
reads. The algorithm uses a heuristic rule called the 4-3 rule that
examines overlapping sets of 4 reads at 3 positions in order to identify
differences corresponding to distinct copies of a repeat.
Related publications
- "A preprocessor for shotgun assembly of large genomes." Roberts
M, Hunt BR, Yorke JA, Bolanos R, Delcher A, Journal of Computational
Biology (to appear)
KI error corrector
|
STATUS: testing |
Sequencing errors in combination with repeated regions cause major
problems in shotgun sequencing, mainly due to the failure of assembly
programs to distinguish single base differences between repeat copies
from erroneous base calls. The Karolinska Institutet error corrector
implements a new strategy to correct errors in shotgun sequence data
using defined nucleotide positions, DNPs. The method distinguishes single
base differences from sequencing errors by analyzing multiple alignments
consisting of a read and all its overlaps with other reads. The construction
of multiple alignments is performed using a novel pattern matching algorithm.
Related publications
- "Correcting errors in shotgun sequences." Tammi MT, Arner E, Kindlund
E, Andersson B, Nucleic Acids Research, 2003. 31(15):4663-72.
- "TRAP: Tandem Repeat Assembly Program produces improved shotgun
assemblies of repetitive sequences." Tammi MT, Arner E, Andersson
B, Computational Methods Programs Biomed, 2003. 70(1):47-59.
- "Separation of nearly identical repeats in shotgun assemblies using
defined nucleotide positions, DNPs." Tammi MT, Arner E, Britton T,
Andersson B, Bioinformatics, 2002. 18(3):379-88.
6.7. Validation
amosvalidate is a validation pipeline for genome assemblies.
This pipeline includes a collection of methods for ascertaining the
quality of an assembly, and examines multiple measures of assembly quality
to pinpoint potential mis-assemblies. Validation techniques include mate-pair
validation, repeat analysis, coverage analysis, identification of correlated
read
polymorphisms, and read alignment breakpoint analysis. Regions of the
assembly exhibiting multiple signatures of mis-assembly are flagged as
suspicious and output by amosvalidate for further examination.
TIGR has assembled a set of benchmark assembly genomes.
Each benchmark set comes with the sequence of the finished genome, random
shotgun reads, closure reads, and ancillary library and insert information.
Each sequence is categorized as matching or non-matching, based on its
mapping to the finished genome. Sequences that match the finished genome
at 90% identity for over 80% of their trimmed length (as aligned by
MUMmer) are included in the matching set, while
all other reads are grouped into the non-matching set. Ancillary information
is presented in Trace Archive XML format. Please refer the the benchmark
website for a more lengthy description and the actual data.
|
|
STATUS: development - production |
The ASM File converters are a collection of utilities
for converting sequence and assembly data between the most widely used
data formats as well as to and from the AMOS message format. Examples
of the data handled by these utilities are: Trace Archive data and ancillary
information, .ACE assembly format, TIGR Assembler input and output formats,
Celera Assembler message format, and Arachne input and output formats.
amoslib is a PERL module for the handling of AMOS message
files.
MUMmer is a whole-genome alignment software suite developed
and maintained by TIGR. It is extremely useful to mapping sequencing
reads and assembly contigs to finished sequence. This can provide valuable
information for assembly quality assessment and for comparative genomics.
Hawkeye serves as a visualization tool for AMOS assembly
data. It works off of an AMOS data bank and produces an interactive
display of the assembly data for quality assessment and assembly investigation.
It was developed with the Qt GUI toolkit.
The AMOS source if freely available for download from the File
Release Section of our SourceForge
project page. Please refer to the COPYING license included in the
package for a description of the Artistic License,
the same OSI certified open source
license used by Perl and countless other packages. Not all of the above
packages are included with the standard AMOS distribution, please see
the homepage for the software you wish to download to verify that it
is included with the AMOS source distribution.
All interested parties are welcome to join or aid the AMOS consortium.
Please address all correspondence via Email to:

To receive information regarding new releases and developments, please
subscribe
to our moderated, low-traffic users' mailing list:

For AMOS bug reports or support requests, please browse our SourceForge
project page or Email us at:

|