AMOS - programmer's guide
- Getting AMOS
- The .tar file
- Direct CVS access
- Autoconf basics (how to add your own code to the CVS tree)
- Addding a script to the AMOS tree
- Adding a C++ program to the AMOS tree
- AMOS messages and the Perl API
- The C++ API
- Overview of include files
- Basic terminology
- Dealing with AMOS message files
- Communicating with the bank
- Indices
Getting AMOS
AMOS can be downloaded from our Sourceforge download site: http://sourceforge.net/project/showfiles.php?group_id=134326 as a tar file, or directly from the AMOS CVS (see below).
The .tar file
If you chose to download AMOS as a .tar file, getting started is as
simple as untarring the file, running "./configure" from the top level
directory, then "make all". For more details see the Getting Started document as well as the INSTALL file provided in the top level directory.
Direct CVS access
To access AMOS directly through anonymous CVS, use the following settings:
CVSROOT=pserver:anonymous@cvs.sourceforge.net:/cvsroot/amos
CVS_RSH=ssh
You can then retrieve the AMOS distribution with the command:
co -P AMOS
Before being able to compile the AMOS code you will need to create the
appropriate configuration files with the command "./bootstrap" run from
the top level directory. You will then be able to continue with
compilation as described above under the .tar file.
If you wish to play a more involved role in the development of AMOS, or
if you wish to contribute some of your code or bug fixes, please
contact us at:

Autoconf basics (how to add your own code to the CVS tree)
This section is not meant as documentation for the GNU autoconf
package. Below you will learn how to add a program to the AMOS
distribution, in an already existing directory. If you want help
with a more complex autoconf operation please contact us at the email
listed above.
The template for the Makefile file that will be created by the
configure command (see description of compilation above) can be found
in the file Makefile.am in each of the directories. This file
consists of two sections: a description of the files that are going to
be installed when running "make install", and a description of each of
the files that will be compiled as part of the "make" command. If
you wish to add a program to the AMOS tree, you will thus need to add
both a record indicating this program will be installed by the make
process, and instructions on how to build this program. The
instructions for adding a script (either a Perl script or an AMOS
configuration file), or a C++ program are described below.
Addding a script to the AMOS tree
To add a script you can simply list it in the "dist_bin_SCRIPTS" variable at the beginning of the Makefile.am file, e.g.:
dist_bin_SCRIPTS = \
bank-unlock.pl
The build process will automatically add a "use lib" line to the
beginning of your Perl scripts indicating where the AMOS code is
installed. Furthermore, the #! line will be appropriately
modified according to the location of the Perl binary identified by the
configure process.
When building AMOS configuration files, the build process will
automatically update the BINDIR and NUCMER variable in your file to the
values identified by the configure process for the location of the AMOS
binary installation directory, and for the location of the nucmer
binary (part of the MUMmer distribution).
Adding a C++ program to the AMOS tree
To add a C++ program to AMOS, you must first add the name of the program to the "bin_PROGRAMS" variable in the Makefile.am file:
bin_PROGRAMS = \
bank2contig \
You must then specify instructions on how this binary will be built.
These instructions include the location of the source files used
in building the program:
bank2contig_SOURCES = \
bank2contig.cc
instructions on additional libraries that might be needed:
bank2contig_LDADD = \
$(top_builddir)/src/Common/libCommon.a \
$(top_builddir)/src/AMOS/libAMOS.a
or additional flags:
bank2contig_CPPFLAGS = \
-I$(top_srcdir)/src/Common
If you wish to use the global library and CFLAGS parameters you may provide just the _SOURCES variable.
AMOS messages and the Perl API
AMOS programs can communicate among each other using a flat file format inspired by the format used by Celera Assembler. An overview of this file format and the way AMOS objects are stored, is provided in the two files listed below:
The AMOS distribution provides a Perl module that can be used to parse
AMOS (and Celera Assembler) message files. For a detailed
description of the various functions provided by the AMOS::AmosLib
module you can use the perldoc documentation:
$ perldoc AMOS::AmosLib
Below we will only describe the use of this module to read and parse AMOS messages.
To include the AMOS::AmosLib module in your perl program you will need to use the command:
use AMOS::AmosLib;
at the beginning of the code. If this module is not installed in
the Perl search path (which can be set in the PERLLIB environment
variable), you might have to also use the Perl command "use lib" to
specify the location of the AMOS library.
Like the C++ API (described below), reading AMOS messages from a file
involves first reading the message in its entirety, oblivious of the
data encoded within, then parsing the message to extract the individual
components. These two steps can be executed as follows:
my $rec = getRecord(\*STDIN); # read a record from the standard input
my ($id, $fields, $recs) = parseRecord($rec); # parse the information in the message
The first command retrieves the entire message from the input, i.e. a whole block of text between curly braces.
The second command retrieves the three components of the message:
- $id - the three letter code of the message (see Types of messages)
- $fields - hash table of the individual fields in the message.
E.g. for a read ($id == "RED"), $$fields{"seq"} represents the
sequence of the read.
- $recs - array of any possible sub-messages. These messages
will need to be parsed individually with the parseRecord command.
An example of sub-messages are the TLE (tile) message indicating
the position of reads within a contig. $#$recs - represents the
index of the last sub-message (if $#$recs == -1, there are no
submessages).
The C++ API
Below is a quick overview of the AMOS C++ API. The quickest way to get started is to examine the file src/Bank/bank-tutorial.cc.
This file highlights the interaction with the AMOS bank through
the C++ API and contains copious comments meant to guide you through
your first AMOS program.
For a detailed description of all AMOS classes refer to the automatically generated doxygen API docs.
The main AMOS datastructure is the bank
- an indexed database of assembly objects. This central
datastructure provides allows the integration of multiple software
modules that communicate by modifying the objects stored in a shared
bank.
Overview of include files
#include
<foundation_AMOS.hh>
all of the below
#include
<inttypes_AMOS.hh>
integer typedefs
#include <exceptions_AMOS.hh>
exception types
#include <datatypes_AMOS.hh>
structs
#include <databanks_AMOS.hh>
bank types
#include <messages_AMOS.hh>
message types and message NCodes
#include <universals_AMOS.hh>
assembly classes
Basic terminology
IID
internal integer identifier and
object reference
EID
external string identifier
BID
bank specific identifier (index of
the file store, may be invalidated by bank operations)
3-Code
3-character identifier string for
objects and fields
N-Code
an integer representation of a
3-code (Encode/Decode
functions)
message
a single curly-bracketed AMOS
message (see message grammar)
sub-message a
single curly-bracketed AMOS message contained by
another (see message grammar)
Relative orientation of reads/contigs (used in overlaps or scaffold links)
normal
---a---> ---b--->
anti-normal <---a--- <---b---
innie
---a---> <---b---
outie
<---a---
---b--->
Dealing with AMOS message files
Reading an AMOS message from a file is as simple as:
Message_t msg;
msg.read
(cin);
Note, that
the msg object is generic, representing a properly formatted message
object (see message grammar), irrespective of the actual assembly
object represented by the message. This object can be used to
read arbitrary message files, such as those generated by Celera
Assembler, even though the individual objects do not map to AMOS
objects.
To assign the message contents to a specific object, e.g. a contig:
Contig_t contig;
contig.readMessage(msg);
Note, that the readMessage operation will fail if the message does not properly encode an AMOS contig.
The reverse operation, writing a new message from an internal AMOS object can be simply performed:
contig.writeMessage(msg);
message.write(cout);
Communicating with the bank
AMOS banks can be open in two modes: for random access (bank mode), and
for sequential access (bank stream mode). To open a bank you must
also specify the type of the objects stored in it, by providing the
N-code of the object. Thus, to open a bank of contigs
Bank_t contig_bank(Contig_t::NCODE);
BankStream_t contig_stream(Contig_t::NCODE);
contig_bank.open("mybank.dir");
contig_stream.open("mybank.dir");
The string "mybank.dir" refers to the physical location of the bank on
the disk, and represents the name of a directory that contains all the
relevant bank files. In addition to the location of the bank, the
open() command may specify a mode of access as B_READ, or B_WRITE, or
both (B_READ|B_WRITE) (the default access is B_READ):
contig_bank.open("mybank.dir", B_READ|B_WRITE);
Bank streams can only be used for sequential access, e.g.:
Contig_t contig;
contig_stream >> contig; // read from bank
contig_stream << contig; // write to bank
The sequential access mode is useful for processing anonymous objects
(without an assigned IID or EID), or simply for the ease of use.
Random access banks can be used to perform more complex operations:
// lookup by IID
if (! contig_bank.existsIID(1))
cerr << "Cannot find object with iid 1" << endl;
// lookup by EID
if (! contig_bank.existsEID("bigcontig"))
cerr << "Cannot find object with eid bigcontig" << endl;
contig_bank.fetch(1, contig); // retrieve object by IID
contig_bank.fetch("bigcontig", contig); // retrieve object by EID
contig_bank.append(contig); // add an object to the bank
contig_bank.remove(1); // remove an object by IID
contig_bank.remove("bigcontig"); // remove an object by EID
Note that by default objects are not physically removed from the bank
when using the remove command, rather they are marked for deletion.
To compact the bank after several remove operations you will need
to run
contig_bank.clean();
Indices
There is often the need to cross-reference the various objects stored
in a bank, e.g. to obtain the list of reads present in a contig, or,
for a read, to identify the contig or scaffold it belongs to.
Some such relationships are natively represented in the AMOS
objects (e.g. contig messages also list the reads belonging to them),
for others it is necessary to build lookup tables. AMOS helps you
by providing a generic mechanism for specifying lookup tables linking
arbitrary AMOS types. The AMOS indices are implemented using STL
hash multi-maps (allows one-to-many correspondence).
A simple example on the use of indices is shown below. The code
generates a map linking each read to its mate (this information is
normally contained in the Fragment_t object).
Index_t read2mate;
rd2mate.buildReadMate("mybank"); // build index linking reads to their mates in the bank "mybank"
ID_t mate = rd2mate.lookup(5); // find mate of read with IID=5
if (mate == NULL_ID) // if no mate found, returns NULL_ID
cerr << "Read 5 has no mate " << endl;
This example relied on the pre-defined function buildReadMate that
automatically builds an index of reads to mates. Several such
predefined functions are provided, see the documentation for the
Index_t object. If you need to build your own index, for which no
predefined build function exists, you can use the insert command to add
an identifier pair to the index:
Index_t obj2obj;
obj2obj.insert(id1, id2);
In case of a one-to-many mapping (e.g. all the reads in a scaffold) you
can retrieve all the IDs corresponding to a query ID using:
pair<const_iterator, const_iterator> startend = lookupAll(myid);
for (iterator i = startend.first; i != startend.second; i++)
cout << "Found id " << *i << endl;