Hawkeye Scaffold View
The scaffold view of the assembly shows how the contigs and inserts are placed on the scaffold. It uses the mate-pair relationship and library sizes to categories the "happiness" of each insert, meaning it displays if the paired reads are correctly oriented and at the expected distance apart. The threshold distance for a "happy" insert can be adjusted by setting the maximum allowed number of standard deviations from the mean an insert can be. Details on all objects displayed in the Insert view can be found by clicking on any object. The mate for any unhappy insert is highlighted by right clicking on the read.
Along with inserts and contigs, it also plots the read and insert coverage at each position along the scaffold, and a measure of overall happiness of the inserts called the CE statistic. The viewer also highlights the location of arbitrary features along the scaffold. This functionality is currently used to highlight clues of mis-assembly, such as regions of the genome where the assembly has a high occurence of unhappy insert coverage, or regions of high density correlated SNPs. Both such events are strong evidence for misassembly, and their combination at a location is nearly conclusive evidence.
The view is divided into 3 regions. Along the right is the control panel (f in the picture) and details panel (g) which allows users to filter and set parameters of the display and see the details of selected objects. Along the bottom left (e) is an overview of the entire scaffold, showing the contig placement and features. The rest of the display (a-d), the main display, shows the placement of the contigs (b), features (c), and inserts (d), and the statistical levels (a) of the currently selected region of the scaffold. The range slider below the overview and the magnifying glass tools allow users to select the region of interest. The display is interactive, and the details on every object are available on demand by clicking on the object.
The combinations of evidence displayed in the scaffold view makes it possible to quickly identify mis-assemblies. Consider the region displayed below where happymates have been hidden, and k-mer coverage is plotted but otherwise uses default parameters.
Our analysis begins at the cluster of yellow compressed inserts. Individually, a single compressed mate is not unusual since inserts sample the library distribution, but this is an unusually large cluster. Similiarly, the cluster of singleton mates (purple) below the compression is unusually large. Moving up, the small red features indicated there are multiple correlated SNPs in that same region, but this is a haploid organism. Further investigation in the Hawkeye Contig View should be used to confirm it is not chance correlations, but given the low background distribution of correlations, we can assume this is most likely due to mis-assembly. The bright white in the read coverage heat map indicates this is the highest read coverage in the scaffold. The CE Statistic indicates a very strong compression, and at -6 is well below the threshold. Finally, the spike in kmer coverage in yellow at the top of the plot indicates this is a complicated repeat region. Every mis-assembly characteristic has been met and we can conclude undeniably that this is a mis-assembly.
In contrast, note the repeat (high kmer coverage) on the left of the plot has only 2 compressed mates, no correlated SNPs, and even coverage. This demonstrates the difficulty in understanding mis-assemblies: while nearly every mis-assembly occurs in a repeat, not every repeat is mis-assembled, and the presense of individual mis-assembly clues such as individual compressed mates are inconclusive. It is only the combination of evidence that allows one to prove mis-assembly.
A Mis-assembled region displayed in the Scaffold View
The main display (a-d) shows the contigs, inserts, features, and statistical information for a region of the scaffold. The scaffold of contigs (b) is represented as rectangles appropriately spaced and sized. The color of the rectangle indicates if the contig is oriented forward (blue) or reverse (dark blue) in the scaffold. Immediately beneath the contigs are two heatmaps. The first in purple highlights regions where the insert coverage is exceptionally high or low. Similarily, the green track highlights high and low read coverage. Features
Beneath the heatmaps, are the the feature tracks. Features are regions of scaffolds or contigs that have been selected for having interesting features. AMOS comes bundled with tools for computing mis-assembly type features, but arbitrary features of any type can be loaded from a tabbed deliminate file using loadFeatures.
The currently available feature types are as follows:
|These regions have a high density of correlated SNPs where multiple reads agree with each other, but disagree with the consensus at multiple locations. This occurs for biological reason such as in assemblies of diploid organisms, but also (and more commonly) because of collapsed repeats in the assembly. These regions are computed by the AMOS tools analyzeSNPs and clusterSNPs.|
|These regions mark the locations of surrogate unitigs created by Celera Assembler, which are regions the assembler computes to be likely repetitive sequence and are often sources of mis-assembly.|
| These regions where marked by the AMOS tool asmQC which evaluates the happiness of matepairs in the assembly. More information is available in the amosvalidate documentation.
|These regions indicate where a several singleton reads break their alignment. This can occur because of mis-assembly where the singleton reads span the junction between the mis-assembled copies.|
Coverage and CE Statistic
Above the contigs (a) are two plots for coverage levels (top), and the CE statistic (bottom). The purple line in the coverage plot indicates the insert coverage, and the green line in the coverage plot indicates the read coverage. Mean values in the current scaffold are displayed as dashed lines. If loaded with -K, kmer coverage will be plotted in yellow. See Command Line Options for more information.
Beneath the coverage plots is a plot of the CE statistic at that point (green and red lines). The CE statistic computes the level of compression or expansion for the inserts spanning a particular position. Values near zero indicate no deviation, large negative values (<3) indicate statistically unlikely compression, and large positive values (>3) indicate statistically unlikely expansion, and thus flag both compression and expansion type mis-assemblies. The value is computed on a per library basis, so there will be as many plots as libaries represented in the scaffold. See the Control Panel Library menu for the legend of library colors. A manuscript describing the CE statistic in more detail is in preparation.
Below the feature tracks are the inserts in the scaffold. If possible, the mate-pairs are drawn connected by a thin line, while in all cases, the thick rectangle indicates the position of the read. By default, the colored and partioned categorically based on their mate happiness. There are 7 happiness levels:
|These inserts are happy in terms of both orientation and distance.|
|These inserts have the correct orientation, but are further apart than expected based on the library distribution and happiness threshold.|
|These inserts have the correct orientation, but are closer than expected based on the library distribution and happiness threshold.|
|These inserts have an invalid orientation, either both pointing in the same direction, or away from each other. (Note the reverse is true for transponson reads, and Hawkeye will accomdate for this).|
|This means a read is present in the scaffold, but its mate is in some other scaffold. The thin line indicates the expected position of the mate.|
|This means a read is present in the scaffold, but its mate is a singleton and not in any contig or scaffold. The thin line indicates the expected position of the mate.|
|This means the read has no mate provided for unknown reasons, but usually directed closure reads, 454 sequencing data, or a failed mate.|
The Overview panel (e) shows the entire scaffold and features. The background is tinted to highlight the currently visible region in the main display. It is aligned with the range slider beneath for selecting regions to display. Clicking in the panel recenters the display on the click point. Return to top
The top of the control panel (f) controls the pointer. The arrow allows a user to select objects to get details in the details box (g). Right clicking an insert with the arrow highlights the mate if it is within the current scaffold, and control-clicking a read jumps to the mate even if the mate is in a different scaffold. This allows one to follow chain of mates between scaffolds. The magnifying glass tools allow one to zoom in or out of the main display.
The queries box allows one to control the main display. The Search box allows one to find any object by regular expresion on the name (eid or iid) of the object. The Happy Distance sets the maximum number of standard deviations from the mean an insert size may be and still be clasifyed as happy.
Next is a box for the features. Each predefined feature type has a checkbox and slider. The checkbox controls if that feature type should be displayed, and the slider controls how severe the feature has to be to be displayed. by default, all features are displayed, but the sliders can be used to show only regions with extreme insert or read coverage, for example. The colors of the feature sliders act as a legend for their display in the main display.
Next is a series of toggles for the Mate Types, and controls if that type should be displayed or not. For example, happy mates are often the least informative, so they can be hidden.
Below the mate types are a series of toggles. They are as follows:
|Coverage Stat||Toggles the display of the coverage level plot.|
|CE Statistic||Toggles the display of the CE Statistic value.|
|Connect Mates||Toggles if mates should be draw connected, or as separate reads.|
|Partition Types||Toggles if the inserts should be partitioned according to their happiness, or should be draw as compact as possible.|
|Tint Partition||Toggles if a tinted rectangle should be draw behind the different happiness groups, especially if the mate colors are not according to happiness.|
Next to the display toggles is a radio button group controlling how the mates are colored as follows:
|Mates colored by according to their insert type happiness level.|
|Continuous||Happy, compressed, and expanded mates are colored as a function of their deviation from the mean. Insert sizes near the mean will be colored to blend with the background, while large deviations are drawn increasingly bright. This can be more sensitive than categorical coloring for finding clusters of slightly compress/expanded mates.|
|Linking||Linking mates are colored according to the contig they link to. Note: there are only 14 colors used, so some colors may repeat if they are a large number of linked contigs. In general, though, the colors will be locally accurate. Click for details to verify.|
|Library|| Inserts are colored according to their parent library. The legend of colors is displayed below.
The final region of the control panel is a legend for the libraries. Each library is listed by iid along with their mean and standard deviation. The color code is represented by a sample insert, but the same colors are also used for the CE statistic plot.