AMOS Getting Started

From AMOS WIKI
Revision as of 08:20, 5 June 2010 by Floflooo (Talk | contribs)

Jump to: navigation, search

Is AMOS an assembler? is one of the first questions we are asked. The short answer is no. AMOS is not an assembler, rather a software infrastructure for developing assembly tools. If you are only interested in running an off-the-shelf assembler on your shotgun data, do not despair, AMOS provides two such assemblers: AMOScmp - a comparative assembler; and Minimus - a basic assembler for small datasets. However it is important to realize that, with a little bit of programming, you can use AMOS to put together your own shotgun assembler customized for the specific characteristics of your data.

This page will provide you with the basic information needed to get started using AMOS. Advanced AMOS users can go directly to in-depth resources from the main page AMOS.

Downloading AMOS

AMOS can be downloaded from Sourceforge using the following link: http://sourceforge.net/project/showfiles.php?group_id=134326

No need to remember this URL as you can easily reach it from the [AMOS main page].

This link will bring you to the Sourceforge download page for our project. While older versions of our code are also available for download from this page we recommend you download the latest version to take advantage of the full functionality of the code.

AMOS is released as a source-code package, with the exception of the OSX version of the assembly viewer Hawkeye, that can be downloaded as a binary from the File Release section of the download page. Instructions for compiling and installing AMOS are provided below.

If you want to edit the source code, you should download the code from CVS following the directions here: http://sourceforge.net/scm/?type=cvs&group_id=134326


Installing AMOS

After reading this section make sure you also read the INSTALL file distributed with AMOS. This file may contain information pertaining to the latest version of AMOS that is not included here.

Normal installation

The AMOS source package has a name like: amos-1.4.5.tar.gz where 1.4.5 is the version of the code. Once you untar this file (using "tar -xzf amos-1.4.5.tar.gz" in Linux, or "gunzip -d amos-1.4.5.tar.gz | tar xf -" in other flavors of Unix) you will find the current AMOS distribution in a directory named amos-1.4.5. The next steps assume you have cd'd into this directory.

AMOS uses the GNU autoconf package to reduce cross-platform compatibility issues. Before compiling the code you will need to run the configure script that will probe your system for the locations of all software packages required by AMOS.

By simply running:

./configure

you will prepare AMOS to be installed in the directory hosting the source package. This is OK if you are just testing AMOS. We recommend, however, that you provide the configure script with a more permanent home for AMOS, e.g.:

./configure --prefix=/usr/local

will ultimately lead the AMOS directory hierarchy to be installed underneath /usr/local/.

After running configure, make sure you check the messages left on your screen to make sure no errors occured. Errors during the configure step can lead to an incomplete build.

To compile the code you need to simply run:

make

followed by

make install

to install AMOS into the directory selected with the --prefix option to configure.

Normally, these steps are sufficient to install AMOS on most UNIX systems. If you encounter errors during configuration or compilation, or if you are trying to install AMOS on an OSX or Cygwin system, please read the following sub-sections.


Specifying the location of MUMmer

If the configure script gives you a message like:

WARNING! nucmer was not found but is required to run AMOScmp
   install nucmer if planning on using AMOScmp

you either have not installed the MUMmer package, or you have installed it in a location where the configure script cannot find it. MUMmer (the nucmer program in particular) is required by the comparative assembler AMOScmp.

To remedy this situation, please install MUMmer following instructions found at http://mummer.sourceforge.net.

If MUMmer is already installed, but configure cannot find it, you can specify the location of the nucmer program by setting the environment variably NUCMER, e.g.:

NUCMER=/usr/local/bin/mummer/nucmer
export NUCMER

in a "traditional" shell (sh, bash, ksh, etc.), or

setenv NUCMER /usr/local/bin/mummer/nucmer

in csh or tcsh. Of course you'll need to replace /usr/local/bin/mummer/nucmer with the actual location of this program on your system. Specifying the location of the QT library On most Unix installations (see below for OSX and Cygwin), the QT library should be properly installed and AMOS will make without any problems. If, however, you notice a message like:

WARNING! Qt3 toolkit was not found but is required to run AMOS GUIs

the configure process was not able to find the QT library on your system. Check with your system administrator to have this toolkit installed on your system. If, however, you are certain the toolkit is installed, but AMOS still didn't find it, you can directly specify the location of the toolkit directory, or specifically the include, bin, and lib directories, where QT is installed, and the name of the library file, using the following options to the configure script:

--with_Qt_dir
--with_Qt_include_dir
--with_Qt_lib_dir
--with_Qt_bin_dir
--with_Qt_lib

Ubuntu installation

Ubuntu installation

Fedora installation

Fedora installation

OSX installation

Download QT/Mac 3.3.x from Trolltech.

As of 4/12/06, the most recent version is available at: ftp://ftp.trolltech.com/qt/source/qt-mac-free-3.3.6.tar.gz

Follow the Trolltech instructions for building QT. Make sure to set the environment variable QTDIR appropriately.

Run ./configure to configure AMOS. Note the QT configure tests may fail. Run make to build AMOS. Then run:

cd src/bankViewer
$QTDIR/bin/qmake
make

The Hawkeye binary will then build in the Hawkeye directory. You will have to manually copy it to your bin directory.


Cygwin installation

Make sure you have the following packages installed:

  • Base:
    • ash
    • coreutils
    • gawk
    • gzip
  • Devel:
    • gcc-g++
    • make
  • X11:
    • qt3
    • qt3-bin
    • qt3-devel
    • xorig-x11-* (all x11 packages, including fonts)

Run ./configure from the top level source directory. Note the QT configure tests will fail. Run make to build AMOS except for the Hawkeye directory. Then run:

cd src/bankViewer
qmake-qt3
make

The Hawkeye binary will then build in the bankViewer directory. You will have to manually copy it to your bin directory.

Running AMOS

Basic AMOS concepts

AMOS consists of a collection of modules that operate on a central data-structure called a bank. A bank is really just a directory that contains a database (organized as a collection of indexed files) comprising assembly related objects such as reads, contigs, scaffolds, etc. The modules thus communicate with each other by making changes to the bank. For example, an assembler might consist of three modules: an overlapper, a contigger, and a multi-aligner. The overlapper will first read the shotgun reads from the bank, compare them to each other and write back to the bank a list of overlaps, i.e. pairs of reads that match each other. The contigger then reads the collection of overlaps and makes sense out of it, by producing a layout of the reads that is consistent with most of the observed overlaps. The contigger then writes these contigs (contiguous chunks of the genome) to the bank. Finally, the multi-aligner reads from the bank both the reads and the contigs, builds a multiple alignment of the reads, using as a guide the layout of the reads produced by the contigger, then updates the contigs with the detailed alignment information. Thus, the three programs were able to communicate with each other using the bank as an intermediate storage space. If this litle description didn't make much sense to you, check out our Genome Assembly Primer. It also has pointers to future reading.

Objects in the bank may be identified by one, or both of the following identifiers: IID (internal identifier) - an integer identifier, internal to AMOS; and EID (external identifier) - a string representing some external identifier of the record, e.g. the original name of a sequencing read. Both identifiers must be unique for a specific object type, but may be shared by multiple objects. For example, there can only be one contig with an IID equal to 1, however there can be both a contig, and a read, and an overlap, all with the IID = 1. Message files The AMOS banks are not the only mechanism for AMOS modules to communicate with each other, and to the "outside world". AMOS also uses a flat-file format (AMOS message files) inspired by the format used in Celera Assembler. This format is generally used as an intermediate format for converting to and from external file formats. The AMOS message files are then used to populate the data-structures present in a bank.

For more details on the AMOS message file format check out the Infrastructure pages. The use of message files will be described in more detail in the remainder of this tutorial.


Reading and writing banks

To learn how to generate AMOS message files check out the section called Creating inputs for AMOS. Assuming you already have an AMOS message file, most of the modules will require that the information from this file be loaded into a bank. This section describes the commands used to transfer information between a bank and the message file.

The command bank-transact can be used to load a message file into a bank. In its simplest invocation:

bank-transact -b mybank -m mymessagefile

bank-transact loads the messages in mymessagefile into the bank mybank. Note that this invocation assume the bank already exists, and bank-transact will fail otherwise. When creating a new bank you can run:

bank-transact -c -b mybank -m mymessagefile

The option -c stands for "create". By also providing the option -f (force), the bank will be overwritten if it already exists.

The contents of a bank can be output into a flat-file format with the command:

bank-report -b mybank

By default bank-report outputs all the data in the bank. The output can be restricted to certain message types by providing the 3 letter codes of the messages to be output, e.g:

bank-report -b mybank CTG RED

will output all the contigs (CTG) and read (RED) records. In addition bank-report allows the user to specify a list of EIDs (option -E) or a list of IIDs (option -I) that will be reported.


Bank locking

To allow concurrent access to the bank, AMOS programs lock the bank while the operate on it. There are two types of locks: for reading, and writing. If a bank is locked for reading, other read accesses are allowed but no writes. If a bank is locked for writing, no concurrent accesses are allowed. Some of the AMOS tools (such as the viewer Hawkeye), have an option to load a bank in "inspect" mode, i.e. the code ignores any locks placed on the bank.

In certain situations, if a program accessing the bank crashes, the bank may remain locked, prohibiting further access. All existing locks can be removed with the command (make sure that another user is not accessing the same bank):

bank-unlock mybank


Bank versions

The specific format of the AMOS bank is closely related to the current version of the AMOS software. The banks are not backward compatible, i.e., a bank produced by AMOS 1.0 will not be readable by AMOS 1.5. A simple solution for reading a bank created by an older version of AMOS is to output the contents of the bank using bank-report (the AMOS distribution contains old versions of the bank-report code, e.g. bank-report-1.1) , then reload the bank with the most recent bank-transact command.


Pipelines

As it has hopefully become clear from the introduction to AMOS above, most genome assembly tasks involve the sequential execution of several modules, in an assembly line (or pipeline) fashion. AMOS provides a mechanism for quickly putting together simple pipelines. By "simple" we mean situations where the specific assembly task involves running several programs in order, without the need for more complex control structures such as "if" statements or loops. To implement complex pipelines you will have to rely on Perl or another complex programming language.

An AMOS pipelines are described in a simple interpreted language, and consist of a series of steps that are executed in order. The steps are meant to provide a logical breakdown of the individual assembly tasks, representing the execution of one or more programs. Each step in a pipeline is identified by a step number (a throw-back to the days of the Basic language) providing the user with a mechanism to execute only some of the steps of a pipeline.

To learn more about AMOS pipelines and how to write them, check out the documentation for runAmos (the pipeline executor), or check out one of the pipelines distributed with AMOS (AMOScmp and minimus are good starting points).

Creating inputs for AMOS

The inputs to most AMOS programs must be provided in the AMOS message format. For help converting non-AMOS file formats into message files see the File conversion utilities.


Running AMOScmp

AMOScmp is a comparative assembler that can be used to assemble reads from one genome (called the target) using as a template the sequence of a related genome (called the reference). Read the AMOScmp documentation for a detailed description of this program.

By default, running AMOScmp as follows:

AMOScmp prefix

assumes that the target is provided in the AMOS message file prefix.afg, and the reference in the file prefix.1con. To use different file locations, you can set the variables TGT and REF, either directly within the AMOScmp script, or on the command line:

AMOScmp -D "TGT=mytarget.afg" -D "REF=myreference.1con"  prefix

The prefix must still be provided as it is used to generate the name of the output files.

AMOScmp will populate a bank named prefix.bnk, and will load into it a set of contigs, as well as a scaffold, linking together contigs that are adjacent along the reference. In addition, AMOScmp outputs the set of contigs as both a multi-FASTA file prefix.fasta, and a TIGR .contig file prefix.contig. Note that the consensus of the contigs (reported in the FASTA file) is generated from the target genome, and may differ from the reference genome (after all, the goal of the assembler is to assemble the target). In fact, AMOScmp uses sophisticated algorithms for detecting differences between the target and reference in order to prevent misassemblies. For more information refer to:

M. Pop, A. Phillippy, A.L. Delcher and S.L. Salzberg. Comparative genome assembly. Briefings in Bioinformatics. 5(3), pp. 237-248, 2004.


Running minimus

Minimus is a basic genome assembler that can be used for small assembly jobs (e.g. a single gene, or a viral genome). Minimus is currently used as a central component of the Influenza A sequencing pipeline at The Institute for Genomic Research. Read the minimus documentation for more information.

To run minimus you must provide a set of shotgun reads in an AMOS message file. Running:

minimus prefix

assumes the input is in file prefix.afg. After running, minimus populates the bank prefix.bnk with a set of contigs, furthermore it reports the contigs in both a FASTA file (prefix.fasta) and a TIGR .contig file (prefix.contig). Note that minimus does not use mate-pairs. In essence it is, in Celera Assembler terminology, a unitigger. Any mate-pair information provided in the .afg will be silently ignored.


Viewing the result of an assembly

The content of a bank can be viewed with a program called Hawkeye:

hawkeye mybank

For detailed information on how to use Hawkeye, refer to the Hawkeye documentation.


Validating assemblies

Even the best genome assemblers sometimes make mistakes. AMOS provides a mechanism to run several checks on the output of an assembler (assuming the data are already stored in a bank), through a script called amosvalidate. Amosvalidate runs through the assembly and identifies several types of inconsistencies, such as clusters of SNPs in the assembled reads, clusters of mate-pairs that are too close or too far from each other (with respect to the estimated library sizes), and unassembled reads that do not properly match the assembly. A full description of these measures is beyond the scope of this document. We are currently submitting a manuscript describing the tools included in amosvalidate and will update this page when it gets published.

All the potential assembly problems identified by amosvalidate are written back into the bank as features, i.e ranges along the assembly. Each feature is tagged with the problem that was identified in that region. Typically, users then load the assembly in the Hawkeye viewer and examine the assembly in the tagged regions. Alternatively, the features may be extracted from the bank and processed automatically by specialized software (e.g. several assemblies of a same genome can be compared by the number of features identified in the assembly - the assembly with fewer features is likely "better").

Running amosvalidate is as simple as:

amosvalidate prefix

where prefix.bnk is the location of the bank.

Getting help

To report bugs in AMOS, or to get help, email us at:

amos-help (at) lists (dot) sourceforget (dot) net

To receive information regarding new releases and developments, please subscribe to our moderated, low-traffic users' mailing list:

amos-users (at) lists (dot) sourceforget (dot) net