Programmer's guide

From AMOS WIKI
Jump to: navigation, search

Getting AMOS

AMOS can be downloaded from our Sourceforge download site: http://sourceforge.net/project/showfiles.php?group_id=134326 as a tar file, or directly from the AMOS git repository (see below).


The .tar file

If you chose to download AMOS as a .tar file, getting started is as simple as untarring the file, running "./configure" from the top level directory, then "make all". For more details see the Getting Started document as well as the INSTALL file provided in the top level directory.


GIT access

To access AMOS directly, you can clone a copy of the source code to your local machine

 ## clone the remote master repo to a local copy
 git clone git://amos.git.sourceforge.net/gitroot/amos/amos 

If you are a registered AMOS developer with read/write access to repository, you can checkout the code using:

 ## clone the remote master to a local copy (replace SFNAME with your sourceforge username)
 git clone ssh://<SFNAME>@git.code.sf.net/p/amos/code
 
 ## make some changes
 
 ## now commit your changes to your local repo
 git commit -a -m "brief change message"
 
 ## once you are happy, send the changes to the master repo
 git push
 
 ## update local repo with remote
 git pull
 

This page lists recent changes

http://amos.git.sourceforge.net/git/gitweb.cgi?p=amos/amos;a=summary


Here are a couples tutorials on how to use git to commit changes, make branches, etc

 http://git-scm.com/documentation: Detailed documentation
 http://git-scm.com/course/svn.html: Fast tutorial for svn users

Before being able to compile the AMOS code you will need to create the appropriate configuration files with the command "./bootstrap" run from the top level directory. You will then be able to continue with compilation as described above under the .tar file.

If you wish to play a more involved role in the development of AMOS, or if you wish to contribute some of your code or bug fixes, please contact us at:

amos-help (at) lists (dot) sourceforge (dot) net


Autoconf basics (how to add your own code to the source tree)

This section is not meant as documentation for the GNU autoconf package. Below you will learn how to add a program to the AMOS distribution, in an already existing directory. If you want help with a more complex autoconf operation please contact us at the email listed above.

The template for the Makefile file that will be created by the configure command (see description of compilation above) can be found in the file Makefile.am in each of the directories. This file consists of two sections: a description of the files that are going to be installed when running "make install", and a description of each of the files that will be compiled as part of the "make" command. If you wish to add a program to the AMOS tree, you will thus need to add both a record indicating this program will be installed by the make process, and instructions on how to build this program. The instructions for adding a script (either a Perl script or an AMOS configuration file), or a C++ program are described below.


Addding a script to the AMOS tree

To add a script you can simply list it in the "dist_bin_SCRIPTS" variable at the beginning of the Makefile.am file, e.g.:

dist_bin_SCRIPTS = \
        bank-unlock.pl

The build process will automatically add a "use lib" line to the beginning of your Perl scripts indicating where the AMOS code is installed. Furthermore, the #! line will be appropriately modified according to the location of the Perl binary identified by the configure process.

When building AMOS configuration files, the build process will automatically update the BINDIR and NUCMER variable in your file to the values identified by the configure process for the location of the AMOS binary installation directory, and for the location of the nucmer binary (part of the MUMmer distribution).


Adding a C++ program to the AMOS tree

To add a C++ program to AMOS, you must first add the name of the program to the "bin_PROGRAMS" variable in the Makefile.am file:

bin_PROGRAMS = \
       bank2contig  \

You must then specify instructions on how this binary will be built. These instructions include the location of the source files used in building the program:

bank2contig_SOURCES = \
       bank2contig.cc

instructions on additional libraries that might be needed:

bank2contig_LDADD = \
       $(top_builddir)/src/Common/libCommon.a \
       $(top_builddir)/src/AMOS/libAMOS.a

or additional flags:

bank2contig_CPPFLAGS = \
   -I$(top_srcdir)/src/Common

If you wish to use the global library and CFLAGS parameters you may provide just the _SOURCES variable.


AMOS messages and the Perl API

AMOS programs can communicate among each other using a flat file format inspired by the format used by Celera Assembler. An overview of this file format and the way AMOS objects are stored, is provided on the Infrastructure page.


The AMOS distribution provides a Perl module that can be used to parse AMOS (and Celera Assembler) message files. For a detailed description of the various functions provided by the AMOS::AmosLib module you can use the perldoc documentation:

$ perldoc AMOS::AmosLib


Below we will only describe the use of this module to read and parse AMOS messages.

To include the AMOS::AmosLib module in your perl program you will need to use the command:

use AMOS::AmosLib;

at the beginning of the code. If this module is not installed in the Perl search path (which can be set in the PERLLIB environment variable), you might have to also use the Perl command "use lib" to specify the location of the AMOS library.

Like the C++ API (described below), reading AMOS messages from a file involves first reading the message in its entirety, oblivious of the data encoded within, then parsing the message to extract the individual components. These two steps can be executed as follows:

my $rec = getRecord(\*STDIN);  # read a record from the standard input
my ($id, $fields, $recs) = parseRecord($rec);  # parse the information in the message

The first command retrieves the entire message from the input, i.e. a whole block of text between curly braces. The second command retrieves the three components of the message:

  1. $id - the three letter code of the message (see Types of messages)
  2. $fields - hash table of the individual fields in the message.  E.g. for a read ($id == "RED"), $$fields{"seq"} represents the sequence of the read.
  3. $recs - array of any possible sub-messages.  These messages will need to be parsed individually with the parseRecord command.  An example of sub-messages are the TLE (tile) message indicating the position of reads within a contig.  $#$recs - represents the index of the last sub-message (if $#$recs == -1, there are no submessages).

The C++ API

Below is a quick overview of the AMOS C++ API. The quickest way to get started is to examine the file src/Bank/bank-tutorial.cc. This file highlights the interaction with the AMOS bank through the C++ API and contains copious comments meant to guide you through your first AMOS program.

For a detailed description of all AMOS classes refer to the automatically generated doxygen API docs: http://amos.sourceforge.net/docs/api/

The main AMOS datastructure is the bank - an indexed database of assembly objects. This central datastructure provides allows the integration of multiple software modules that communicate by modifying the objects stored in a shared bank.


Overview of include files

#include <foundation_AMOS.hh>   all of the below
#include <inttypes_AMOS.hh>     integer typedefs
#include <exceptions_AMOS.hh>   exception types
#include <datatypes_AMOS.hh>    structs
#include <databanks_AMOS.hh>    bank types
#include <messages_AMOS.hh>     message types and message NCodes
#include <universals_AMOS.hh>   assembly classes


Basic terminology

  • IID internal integer identifier and object reference
  • EID external string identifier
  • BID bank specific identifier (index of the file store, may be invalidated by bank operations)
  • 3-Code 3-character identifier string for objects and fields
  • N-Code an integer representation of a 3-code (Encode/Decode functions)
  • message a single curly-bracketed AMOS message (see message grammar)
  • sub-message a single curly-bracketed AMOS message contained by another (see message grammar)

Relative orientation of reads/contigs (used in overlaps or scaffold links)

normal           ---a--->  ---b--->
anti-normal     <---a---  <---b---
innie            ---a---> <---b---
outie           <---a---   ---b--->


Dealing with AMOS message files

Reading an AMOS message from a file is as simple as:

Message_t msg;
msg.read (cin);

Note, that the msg object is generic, representing a properly formatted message object (see message grammar), irrespective of the actual assembly object represented by the message. This object can be used to read arbitrary message files, such as those generated by Celera Assembler, even though the individual objects do not map to AMOS objects.

To assign the message contents to a specific object, e.g. a contig:

Contig_t contig;
contig.readMessage(msg);

Note, that the readMessage operation will fail if the message does not properly encode an AMOS contig.

The reverse operation, writing a new message from an internal AMOS object can be simply performed:

contig.writeMessage(msg);
message.write(cout);


Communicating with the bank

AMOS banks can be open in two modes: for random access (bank mode), and for sequential access (bank stream mode). To open a bank you must also specify the type of the objects stored in it, by providing the N-code of the object. Thus, to open a bank of contigs

Bank_t contig_bank(Contig_t::NCODE);
BankStream_t contig_stream(Contig_t::NCODE);

contig_bank.open("mybank.dir");
contig_stream.open("mybank.dir");

The string "mybank.dir" refers to the physical location of the bank on the disk, and represents the name of a directory that contains all the relevant bank files. In addition to the location of the bank, the open() command may specify a mode of access as B_READ, or B_WRITE, or both (B_READ|B_WRITE) (the default access is B_READ):

contig_bank.open("mybank.dir", B_READ|B_WRITE);

Bank streams can only be used for sequential access, e.g.:

Contig_t contig;
contig_stream >> contig;  // read from bank
contig_stream << contig;  // write to bank

The sequential access mode is useful for processing anonymous objects (without an assigned IID or EID), or simply for the ease of use.

Random access banks can be used to perform more complex operations:

// lookup by IID
if (! contig_bank.existsIID(1))
    cerr << "Cannot find object with iid 1" << endl;

// lookup by EID
if (! contig_bank.existsEID("bigcontig"))
   cerr << "Cannot find object with eid bigcontig" << endl;

contig_bank.fetch(1, contig);  // retrieve object by IID
contig_bank.fetch("bigcontig", contig); // retrieve object by EID

contig_bank.append(contig); // add an object to the bank
contig_bank.remove(1);   // remove an object by IID
contig_bank.remove("bigcontig"); // remove an object by EID

Note that by default objects are not physically removed from the bank when using the remove command, rather they are marked for deletion. To compact the bank after several remove operations you will need to run

contig_bank.clean();


Indices

There is often the need to cross-reference the various objects stored in a bank, e.g. to obtain the list of reads present in a contig, or, for a read, to identify the contig or scaffold it belongs to. Some such relationships are natively represented in the AMOS objects (e.g. contig messages also list the reads belonging to them), for others it is necessary to build lookup tables. AMOS helps you by providing a generic mechanism for specifying lookup tables linking arbitrary AMOS types. The AMOS indices are implemented using STL hash multi-maps (allows one-to-many correspondence).

A simple example on the use of indices is shown below. The code generates a map linking each read to its mate (this information is normally contained in the Fragment_t object).

Index_t read2mate;
rd2mate.buildReadMate("mybank"); // build index linking reads to their mates in the bank "mybank"

ID_t mate = rd2mate.lookup(5);   // find mate of read with IID=5
if (mate == NULL_ID)                 // if no mate found, returns NULL_ID
   cerr <<  "Read 5 has no mate " << endl;

This example relied on the pre-defined function buildReadMate that automatically builds an index of reads to mates. Several such predefined functions are provided, see the documentation for the Index_t object. If you need to build your own index, for which no predefined build function exists, you can use the insert command to add an identifier pair to the index:

Index_t obj2obj;
obj2obj.insert(id1, id2);

In case of a one-to-many mapping (e.g. all the reads in a scaffold) you can retrieve all the IDs corresponding to a query ID using:

pair<const_iterator, const_iterator> startend = lookupAll(myid);
for (iterator i = startend.first; i != startend.second; i++)
   cout << "Found id " << *i << endl;