AMOS - Getting Started
- Introduction
- Downloading AMOS
- Installing AMOS
- Normal
installation
- Specifying
the location of MUMmer
- Specifying
the location of the QT library
- OSX
installation
- Cygwin
installation
- Running AMOS
- Basic AMOS
concepts
- Message
files
- Reading
and writing banks
- Bank
locking
- Bank
versions
- Pipelines
- Creating
inputs for AMOS
- Running
AMOScmp
- Running
minimus
- Viewing the
result of an assembly
- Validating
assemblies
- Getting help
Introduction
Is AMOS an assembler?
is
one of the first questions we are asked. The short answer is
no.
AMOS is not an assembler, rather a software infrastructure
for
developing assembly tools. If you are only interested in
running
an off-the-shelf assembler on your shotgun data, do not despair, AMOS
provides two such assemblers: AMOScmp - a comparative assembler; and
Minimus - a basic assembler for small datasets. However it is
important to realize that, with a little bit of programming, you can
use AMOS to put together your own shotgun assembler customized for the
specific characteristics of your data.
This page will provide you with the basic information needed to get
started using AMOS. Advanced AMOS users can go directly to
in-depth resources:
Downloading
AMOS
AMOS can be downloaded from Sourceforge using the following link: http://sourceforge.net/project/showfiles.php?group_id=134326
No need to remember this URL as you can easily reach it from either the
main Sourceforge page for the AMOS project http://sourceforge.net/projects/amos/
or from our pretty front page http://amos.sourceforge.net.
This link will bring you to the Sourceforge download page for our
project. While older versions of our code are also available
for
download from this page we recommend you download the latest version to
take advantage of the full functionality of the code.
AMOS is released as a source-code package, with the exception of the
OSX version of the assembly viewer Hawkeye, that can be downloaded as a
binary from the File Release section of the download page.
Instructions for compiling and installing AMOS are provided
below.
Installing
AMOS
After reading this section make sure you also read the INSTALL file
distributed with AMOS. This file may contain information
pertaining to the latest version of AMOS that is not included here.
Normal
installation
The AMOS source package has a name like: amos-1.4.5.tar.gz where 1.4.5
is the version of the code. Once you untar this file (using "tar -xzf amos-1.4.5.tar.gz"
in Linux, or "gunzip
-d amos-1.4.5.tar.gz | tar xf -" in other flavors of Unix)
you will find the current AMOS distribution in a directory named amos-1.4.5.
The next steps assume you have cd'd into this directory.
AMOS uses the GNU
autoconf package to reduce cross-platform compatibility
issues. Before compiling the code you will need to run the configure
script that will probe your system for the locations of all software
packages required by AMOS.
By simply running:
./configure
you will prepare AMOS to be installed in the directory hosting the
source package. This is OK if you are just testing AMOS.
We
recommend, however, that you provide the configure
script with a more permanent home for AMOS, e.g.:
./configure
--prefix=/usr/local
will ultimately lead the AMOS directory hierarchy to be
installed underneath /usr/local/.
After running configure,
make sure you check the messages left on your screen to make sure no
errors occured. Errors during the configure step can lead to
an
incomplete build.
To compile the code you need to simply run:
make
followed by
make install
to install AMOS into the directory selected with the --prefix option
to configure.
Normally, these steps are sufficient to install AMOS on most UNIX
systems. If you encounter errors during configuration or
compilation, or if you are trying to install AMOS on an OSX or Cygwin
system, please read the following sub-sections.
Specifying
the location of MUMmer
If the configure script gives you a message like:
WARNING! nucmer was
not found but is required to run AMOScmp
install nucmer if planning on using AMOScmp
you either have not installed the MUMmer
package, or you have installed it in a location where the configure
script cannot find it. MUMmer (the nucmer program
in particular) is required by the comparative assembler AMOScmp.
To remedy this situation, please install MUMmer following instructions
found at http://mummer.sourceforge.net.
If MUMmer is already installed, but configure cannot find it, you can
specify the location of the nucmer
program by setting the environment variably NUCMER, e.g.:
NUCMER=/usr/local/bin/mummer/nucmer
export NUCMER
in a "traditional" shell (sh, bash, ksh, etc.), or
setenv NUCMER
/usr/local/bin/mummer/nucmer
in csh or tcsh. Of course you'll need to replace /usr/local/bin/mummer/nucmer
with the actual location of this program on your system.
Specifying
the location of the QT library
On most Unix installations (see below for OSX and Cygwin), the QT
library should be properly installed and AMOS will make without any
problems. If, however, you notice a message like:
WARNING! Qt3 toolkit
was not found but is required to run AMOS GUIs
the configure process was not able to find the QT library on your
system. Check with your system administrator to have this
toolkit installed on your system. If, however, you are
certain the toolkit is installed, but AMOS still didn't find it, you
can directly specify the location of the toolkit directory, or
specifically the include, bin, and lib directories, where QT is
installed, and the name of the library file, using the following
options to the configure script:
--with_Qt_dir
--with_Qt_include_dir
--with_Qt_lib_dir
--with_Qt_bin_dir
--with_Qt_lib
OSX
installation
Download QT/Mac 3.3.x from Trolltech.
As of 4/12/06, the most recent version is available at: ftp://ftp.trolltech.com/qt/source/qt-mac-free-3.3.6.tar.gz
Follow the Trolltech instructions for building QT. Make sure to set the
environment variable QTDIR
appropriately.
Run ./configure
to configure AMOS. Note the QT configure tests may fail. Run make to build
AMOS. Then run:
cd src/bankViewer
$QTDIR/bin/qmake
make
The Hawkeye
binary will then build in the Hawkeye
directory. You will have to manually copy it to your bin directory.
Cygwin
installation
Make sure you have the following packages installed:
Base:
ash
coreutils
gawk
gzip
Devel:
gcc-g++
make
X11:
qt3
qt3-bin
qt3-devel
xorig-x11-* (all x11 packages, including fonts)
Run ./configure
from the
top level source directory. Note the QT configure tests will
fail. Run make to build AMOS except for the Hawkeye directory. Then run:
cd src/bankViewer
qmake-qt3
make
The Hawkeye binary will then build in the bankViewer directory. You will
have to manually copy it to your bin directory.
Running
AMOS
Basic
AMOS concepts
AMOS consists of a collection of modules that operate on a central
data-structure called a bank.
A bank is really just a directory that contains a database
(organized as a collection of indexed files) comprising assembly
related objects such as reads, contigs, scaffolds, etc. The
modules thus communicate with each other by making changes to the bank.
For example, an assembler might consist of three modules: an
overlapper, a contigger, and a multi-aligner. The overlapper
will first read the shotgun reads from the bank, compare them to each
other and write back to the bank a list of overlaps, i.e. pairs
of reads that match each other. The contigger then reads the
collection of overlaps and makes sense out of it, by producing a layout
of the reads that is consistent with most of the observed overlaps.
The contigger then writes these contigs (contiguous
chunks of the genome) to the bank. Finally, the multi-aligner
reads from the bank both the reads and the contigs, builds a multiple
alignment of the reads, using as a guide the layout of the reads
produced by the contigger, then updates the contigs with the detailed
alignment information. Thus, the three programs were able to
communicate with each other using the bank as an intermediate storage
space. If this litle description didn't make much sense to
you, check out our Genome
Assembly Primer. It also has pointers to future
reading.
Objects in the bank may be identified by one, or both of the following
identifiers: IID (internal identifier) - an integer
identifier, internal to AMOS; and EID (external identifier) - a string
representing some external identifier of the record, e.g. the original
name of a sequencing read. Both identifiers must be unique
for a specific object type, but may be shared by multiple objects.
For example, there can only be one contig with an IID equal
to 1, however there can be both a contig, and a read, and an overlap,
all with the IID = 1.
Message
files
The AMOS banks are not the only mechanism for AMOS modules to
communicate with each other, and to the "outside world". AMOS
also uses a flat-file format (AMOS message files) inspired by
the format used in Celera
Assembler. This format is generally used as an
intermediate format for converting to and from external file formats.
The AMOS message files are then used to populate the
data-structures present in a bank.
For more details on the AMOS message file format check out the
following two documents. The use of message files will be
described in more detail in the remainder of this tutorial.
Reading
and writing banks
To learn how to generate AMOS message files check out the section
called Creating inputs for AMOS. Assuming you already have an
AMOS message file, most of the modules will require that the
information from this file be loaded into a bank. This
section describes the commands used to transfer information between a
bank and the message file.
The command bank-transact
can be used to load a message file into a bank. In its
simplest invocation:
bank-transact -b
mybank -m mymessagefile
bank-transact loads the messages in mymessagefile into the bank mybank.
Note that this invocation assume the bank already exists, and
bank-transact will fail otherwise. When creating a new bank
you can run:
bank-transact -c -b
mybank -m mymessagefile
The option -c stands for "create". By also providing the
option -f (force), the bank will be overwritten if it already
exists.
The contents of a bank can be output into a flat-file format with the
command:
bank-report -b
mybank
By default bank-report outputs all the data in the bank. The
output can be restricted to certain message types by providing the 3
letter codes of the messages to be output, e.g:
bank-report -b
mybank CTG RED
will output all the contigs (CTG) and read (RED) records. In
addition bank-report allows the user to specify a list of EIDs (option
-E) or a list of IIDs (option -I) that will be reported.
Bank
locking
To allow concurrent access to the bank, AMOS programs lock the bank
while the operate on it. There are two types of locks: for
reading, and writing. If a bank is locked for reading, other
read accesses are allowed but no writes. If a bank is locked
for writing, no concurrent accesses are allowed. Some of the
AMOS tools (such as the viewer Hawkeye), have an option to load a bank
in "inspect" mode, i.e. the code ignores any locks placed on the bank.
In certain situations, if a program accessing the bank crashes, the
bank may remain locked, prohibiting further access. All
existing locks can be removed with the command (make sure that another
user is not accessing the same bank):
bank-unlock mybank
Bank
versions
The specific format of the AMOS bank is closely related to the current
version of the AMOS software. The banks are not backward
compatible, i.e., a bank produced by AMOS 1.0 will not be readable by
AMOS 1.5. A simple solution for reading a bank created by an
older version of AMOS is to output the contents of the bank using
bank-report (the AMOS distribution contains old versions of the
bank-report code, e.g. bank-report-1.1) , then reload the bank with the
most recent bank-transact command.
Pipelines
As it has hopefully become clear from the introduction to AMOS above,
most genome assembly tasks involve the sequential execution of several
modules, in an assembly line (or pipeline) fashion. AMOS
provides a mechanism for quickly putting together simple pipelines.
By "simple" we mean situations where the specific assembly
task involves running several programs in order, without the need for
more complex control structures such as "if" statements or loops.
To implement complex pipelines you will have to rely on Perl
or another complex programming language.
An AMOS pipelines are described in a simple interpreted language, and
consist of a series of steps
that are executed in order. The steps are meant to provide a
logical breakdown of the individual assembly
tasks, representing the execution of one or more programs.
Each step in a pipeline is identified by a step number (a
throw-back to the days of the Basic language) providing the user with a
mechanism to execute only some of the steps of a pipeline.
To learn more about AMOS pipelines and how to write them, check out the
documentation for runAmos
(the pipeline executor), or check out one of the pipelines
distributed with AMOS (AMOScmp and minimus are good starting points).
Creating
inputs for AMOS
The inputs to most AMOS programs must be provided in the AMOS message
format. For help converting non-AMOS file formats into
message files visit the
overview of conversion utilities.
Running
AMOScmp
AMOScmp is a comparative assembler that can be used to assemble reads
from one genome (called the target)
using as a template the sequence of a related genome (called the reference).
Read the AMOScmp
documentation for a detailed description of this
program.
By default, running AMOScmp as follows:
AMOScmp prefix
assumes that the target is provided in the AMOS message file
prefix.afg, and the reference in the file prefix.1con. To use
different file locations, you can set the variables TGT and REF, either
directly within the AMOScmp script, or on the command line:
AMOScmp -D
"TGT=mytarget.afg" -D "REF=myreference.1con" prefix
The prefix must still be provided as it is used to generate the name of
the output files.
AMOScmp will populate a bank named prefix.bnk, and will load into it a
set of contigs, as well as a scaffold, linking together contigs that
are adjacent along the reference. In addition, AMOScmp
outputs the set of contigs as both a multi-FASTA file prefix.fasta, and
a TIGR .contig file prefix.contig. Note that the consensus of
the contigs (reported in the FASTA file) is generated from the target
genome, and may differ from the reference genome (after all, the goal
of the assembler is to assemble the target). In fact, AMOScmp
uses sophisticated algorithms for detecting differences between the
target and reference in order to prevent misassemblies. For
more information refer to:
M. Pop, A. Phillippy, A.L.
Delcher and S.L. Salzberg. Comparative
genome assembly. Briefings in Bioinformatics. 5(3),
pp.
237-248, 2004.
Running
minimus
Minimus is a basic genome assembler that can be used for small assembly
jobs (e.g. a single gene, or a viral genome). Minimus is
currently used as a central component of the Influenza A sequencing
pipeline at The Institute
for Genomic Research. Read the minimus documentation
for more information.
To run minimus you must provide a set of shotgun reads in an AMOS
message file. Running:
minimus prefix
assumes the input is in file prefix.afg. After running,
minimus populates the bank prefix.bnk with a set of contigs,
furthermore it reports the contigs in both a FASTA file (prefix.fasta)
and a TIGR .contig file (prefix.contig). Note that minimus
does not use mate-pairs. In essence it is, in Celera
Assembler terminology, a unitigger. Any mate-pair information
provided in the .afg will be silently ignored.
Viewing
the result of an assembly
The content of a bank can be viewed with a program called
Hawkeye:
hawkeye mybank
For detailed information on how to use Hawkeye, refer to the Hawkeye documentation.
Validating
assemblies
Even the best genome assemblers sometimes make mistakes. AMOS
provides a mechanism to run several checks on the output of an
assembler (assuming the data are already stored in a bank), through a
script called amosvalidate.
Amosvalidate runs through the assembly and identifies several
types of inconsistencies, such as clusters of SNPs in the assembled
reads, clusters of mate-pairs that are too close or too far from each
other (with respect to the estimated library sizes), and unassembled
reads that do not properly match the assembly. A full
description of these measures is beyond the scope of this document.
We are currently submitting a manuscript describing the tools
included in amosvalidate and will update this page when it gets
published.
All the potential assembly problems identified by amosvalidate are
written back into the bank as features,
i.e ranges along the assembly. Each feature is tagged with
the problem that was identified in that region. Typically,
users then load the assembly in the Hawkeye viewer and examine the
assembly in the tagged regions. Alternatively, the features
may be extracted from the bank and processed automatically by
specialized software (e.g. several assemblies of a same genome can be
compared by the number of features identified in the assembly - the
assembly with fewer features is likely "better").
Running amosvalidate is as simple as:
amosvalidate prefix
where prefix.bnk is the location of the bank.
Getting
help
To report bugs in AMOS, or to get help, email us at:

To receive information regarding new releases and developments, please subscribe
to our moderated, low-traffic users' mailing list: