Hawkeye
An Interactive Visual Analytics Tool for Genome Assemblies.
Michael Schatz -
Adam Phillippy -
Ben Shneiderman -
Steven Salzberg
Version 1.0 - March 5, 2007
Publication: Schatz, M.C., Phillippy, A.M., Shneiderman, B., Salzberg, S.L. (2007) Hawkeye: a visual analytics tool for genome assemblies. Genome Biology 8:R34.
Contents
Abstract
Genome assembly remains an inexact science. Even when accomplished with the best software available, the assembly of a genome often contains numerous errors, both small and large. Hawkeye is a visual analytics tool for genome assembly analysis and validation, designed to aid in identifying and correcting assembly errors. Hawkeye blends the best practices from information and scientific visualization to facilitate inspection of large-scale assembly data while minimizing the time needed to detect mis-assemblies and make accurate judgments of assembly quality.
All levels of the assembly data hierarchy are made accessible to users, along with summary statistics and common assembly metrics. A ranking component guides investigation towards likely mis-assemblies or interesting features to support the task at hand. Wherever possible, high-level overviews, dynamic filtering, and automated clustering are leveraged to focus attention and highlight anomalies in the data. Hawkeyes effectiveness has been proven on several genome projects, where it has been used both to improve quality and to validate the correctness of complex genomes.
Hawkeye is compatible with most widely used assemblers, including Phrap, ARACHNE, Celera Assembler, Newbler, AMOS, and assemblies deposited in the NCBI Assembly Archive.
- Click for a presentation on AMOS Assembly Validation and Visualization. [1.4MB]
- Click for a recorded demonstration of using Hawkeye to analyze a mis-assembly. [2.2MB]
Build & Installation
Hawkeye comes in source form with the AMOS distribution. You should build the entire AMOS distribution even if you only want to run Hawkeye so all of the necessary convertors and libraries are available. You can download the AMOS source package from: http://sourceforge.net/project/showfiles.php?group_id=134326.
Hawkeye requires Qt 3.x is installed to run. The latest version of Qt is currently 3.3.6 and can be downloaded from the Trolltech website for Unix and Mac OS X: http://www.trolltech.com/products/qt/downloads. Many linux distributions come with the Qt runtime libraries by default, but do not have the development package installed. You must install both the runtime libraries and the development package (header files) to build Hawkeye. Cygwin (Windows) is also supported following the directions in the INSTALL file in the AMOS source. Qt 4.x is not supported at this time.
The general build process is to run './configure; make; make install' in the AMOS source directory. You may need to explicitly specify the Qt directories to configure when building AMOS with the following options:
$ configure --help <snip> --with-Qt-dir=DIR DIR is equal to QTDIR if you have followed the installation instructions of Trolltech. Header files are in DIR/include, binary utilities are in DIR/bin and the library is in DIR/lib. Use the options below to override these defaults --with-Qt-include-dir=DIR --with-Qt-bin-dir=DIR --with-Qt-lib-dir=DIR --with-Qt-lib=LIB Use -lLIB to link the Qt library
More information is available in the INSTALL file within the AMOS tarball.
Launching Hawkeye
Hawkeye reads the assembly data from an AMOS bank. A bank is a special directory of binary encoded files containing all information on an assembly. A bank is created by the AMOS assemblers directly, or by converting the results of others assemblers into AMOS format. This is typically done with the tools toAmos and bank-transact. toAmos reads the assembly files and converts them to plaintext AMOS message formats, and bank-transact reads those messages and creates the binary encoded bank directory. See the AMOS Assembly Conversion Page for more information.
For example:
$ toAmos -f human.frg -a human.asm -o - | bank-transact -m - -b human.bnk -c
Creates the bank human.bnk from the files human.frg and human.asm, which are the input and output files for the Celera Assembler.
$ toAmos -ace human.ace -o - | bank-transact -m - -b human.bnk -c
Creates the bank human.bnk from an ace file, which is an output format for many assemblers including Phrap, Arachne, and Newbler. Check your assembler's documentation for more information on creating ACE files. Note the ACE file contains all of the sequence information, so it is not necessary to import the fasta files separately. More information on converting to AMOS is available in the toAmos documentation.
$ tarchive2amos -o human -assembly ASSEMBLY.xml TRACEINFO.seq; $ bank-transact -m human.afg -b human.bnk -c
Creates the bank human.bnk from an assembly archive XML file called ASSEMBLY.xml. Note all of the read fasta files should be concatentated into a single TRACEINFO.seq file, and the read qualities files should be concatenated into a single TRACEINFO.qual file, and the TRACEINFO.xml file should be present as well. More information is available in the tarchive2amos documentation.
Once the bank has been built, launch the viewer by running hawkeye on the bank directory. This will open your assembly to the Hawkeye Launch Pad where you can see an overview of your assembly and select scaffolds or contigs for closer investigation:
$ hawkeye human.bnk
Command Line Options
The options available are listed by specifying -h.
$ hawkeye -h Usage: hawkeye [options] [bankname [contigid [position]]] Options: -c <path> Add a chromatogram path -D <DB> Set the chromatogram DB -T Enable Trace Fetch cmd -p <port> Initialize Server on this port -K <kmer> Load File of kmers -h Display this help
A typical execution will be "hawkeye prefix.bnk" which will load the assembly from the bank named prefix.bnk. Specifying a path with -c allows you to set a location for the viewer to find the chromatograms for the project. You may set multiple paths, and hawkeye will search each one. Similiarily, the -d option also specifies locations for the chromatograms, but this is for "TIGR style" naming schemes to be used in conjunction with the -D option. More work is under development to simplify chromatogram access. The -p option allows you to set a TCP port for Hawkeye to accept commands from, especially for integration with mummerplot.
Note that to view chromatograms within the viewer you need to both have the chromatograms available, and have the chromatogram positions available in the bank or in the trace files. The chromatogram positions are the positions of the peaks in the traces where the base call were made. They can be loaded into the bank with "updateBankPositions bankname posfile" where bankname is the name of the bank and posfile is a file encoding the positions for each read. Some trace file formats such as SCF have the positions encoded within, and it will not be necessary to load the positions into the bank.
If your reads are in the trace archive, and you set the name (EID) of the reads to be ti numbers, then you can fetch the traces and chromatogram positions on the fly from the trace archive by enabling the Trace Fetch Command (-T). With this enabled, Hawkeye will execute the following system command to load a trace:
$ curl "http://www.ncbi.nlm.nih.gov/Traces/previous/trace.fcgi?\ cmd=java&j=scf&val=%EID%&ti=%EID%" -s -o %TRACECACHE%/%EID%
Note you need to have curl installed in your current path. If your organization has its own trace server, you can replace this command with one for your organization, by modifying AMOS/src/bankViewer/DataStore.cc. More information on the Trace Fetch Command is comming soon.
Hawkeye can display k-mer coverage in addition to read or insert coverage in the coverage plot region. To do so, you must pre-compute the k-mer counts in your assembly. AMOS comes bundled with a tool 'count-kmers' that can be used for this purpose. A typical execution is to count the occurencs of k-mers (k=22) in your reads, and plot those values. A sufficient long k-mer should unique be in your genome, so the average k-mer coverage indicates the depth of read coverage, and spikes in k-mer coverage indicate repetitive regions. This is displayed as follows:
$ count-kmers -r human.bnk > human.22mers $ hawkeye -K human.22mers human.bnk
Sample Assembly
A sample assembly is available here: gb6.small.afg.bz2 (4.6MB)
The assembly is a small selection from Bacillus Anthracis consisting of 4 small scaffolds of 11 contigs created from 6249 reads. It is in a compressed AMOS message file. Download it, and then view it as follows:
$ bunzip2 gb6.small.afg.bz2 $ bank-transact -m gb6.small.afg -b gb6.small.bnk -c $ hawkeye -T gb6.small.bnk
Specifying -T enables the trace fetch command so that traces can be viewed on-the-fly from the NCBI trace archive. See Command Line Options for more information. See the Launch Pad documentation for a description of how to navigate this assembly.
Acknowledgements
This work was supported in part by NIH award R01-LM06845, the National Institute of Allergy and Infectious Disease under contract NIH-NIAID-DMID- 04-34, HHSN266200400038C, and DHS/HSARPA award W81XWH-05-2-0051 to SLS.