Quality check a fastq for position dependent sequencing or quality biases

Example
fastqqc.pl -tsv file.fq > file.fq.qc

[[File:Fastqqc.png]]

== FastqQC Plot ==

<pre>
## R script for generating quality plot
## Shows the per position sequence composition (A=red, T=orange, G=green, C=cyan, N=black)
## And the per position average quality value (dashed-purple line)
prefix="s_7"
dir="./"

pdf(paste(paste(dir,prefix,sep=""),".qc.pdf",sep=""))
par(mfrow=c(2,1))

qc1pre = paste(prefix,"_1", sep="")
qc1name=paste(dir, qc1pre,"_sequence.txt.qc", sep="")
qc1 <- read.table(qc1name, header=TRUE)
qc2pre = paste(prefix,"_2", sep="")
qc2name=paste(dir, qc2pre,"_sequence.txt.qc", sep="")
qc2 <- read.table(qc2name, header=TRUE)

plot(qc1$pos, qc1$X.A, type="l", col="red", ylim=c(0,40), xlab="", ylab="", main=qc1pre, axes=FALSE, frame.plot=TRUE)
lines(qc1$pos, qc1$X.C, col="cyan")
lines(qc1$pos, qc1$X.G, col="green")
lines(qc1$pos, qc1$X.T, col="orange")
lines(qc1$pos, qc1$X.N, col="black")
lines(qc1$pos, qc1$Q, col="purple", lty=2)
abline(v=seq(0,80,10), col="grey", lty=3)
abline(h=seq(0,80,10), col="grey", lty=3)
axis(side=1, at=seq(0,80,by=10))
axis(side=2, at=seq(0,80,by=10))

plot(qc2$pos, qc2$X.A, type="l", col="red", ylim=c(0,40), xlab="", ylab="", main=qc2pre, frame.plot=TRUE)
lines(qc2$pos, qc2$X.C, col="cyan")
lines(qc2$pos, qc2$X.G, col="green")
lines(qc2$pos, qc2$X.T, col="orange")
lines(qc2$pos, qc2$X.N, col="black")
lines(qc2$pos, qc2$Q, col="purple", lty=2)
abline(v=seq(0,80,10), col="grey", lty=3)
abline(h=seq(0,80,10), col="grey", lty=3)
axis(side=1, at=seq(0,80,by=10))
axis(side=2, at=seq(0,80,by=10))

dev.off()
</pre>

Mcschatz:

The contig window of the viewer displays the mulitple alignment of reads within contigs, and lets one view the bases of the reads and the consensus sequence. The chromatogram signal, and quality values of the reads can optionally be displayed, as can the trimmed unassembled portion of the read. One can quickly and easily navigate to any position in any contig, or scan contigs for regions of disagreement between the reads. Alternatively, the consensus sequence of a contig can be searched by regular expression.

[[Image:HawkeyeContigView.jpg]]

The Contig View using the SNP coloring features

Immediately below the toolbars is the consensus of the contig. A solid circle above the consensus flags the position as having a discrepancy in the tiling. Clicking in the consensus will color and sort the reads based on the base at that position. In the screenshot above, I clicked on the C at position 11386. The first 2 reads (GDEI048TF and GDEIN20TF) are colored green because they have an A at that position, whereas the other reads have a C and are therefore colored blue. This allows one to easily see that the reads disagree with the other reads in a consensistent way. Single-clicking on a read displays the chromatogram signal for the read. The chromatogram displayed in the main window will be stretched so that the peak occurs aligned with the base calls. Double-clicking on a read will open a new window with the raw chromatogram (see below).

Below the tiling is some summary information, including the filename of the bank, how many contigs are contained in the bank, the current contig id, the length of the consensus of the current contig and the number of reads in the current contig.

== SNP Barcode ==

Hawkeye can only display at most 50 basepairs from a single read at a given time and still be legible. Therefore, the font adjustment utilizes semantic zooming. After switching to below a threshold instead of the bases displayed as letters, they are displayed as abstract colored rectangles matching the chromatogram colors. In addition, instead of displaying every base in every read, it only displays those positions that disagree with the consensus. This creates a barcode like display for reads where their patterns of SNPs can be examined. This reduces the information load of viewing the multiple alignment and focuses the attention on the most interesting postions, those that disagree. Furthermore, very large regions of upto 1000bp can be inspected at once. The SNP sorting still works in the Barcode display, so clicking in the consensus will resort and repartition the reads based on that position.

{|
| [[Image:HawkeyeBases.png|thumb]]
| [[Image:HawkeyeBarcodeSmall.png|thumb]]
|}

Switching from bases to SNP Barcode views.

[[Image:HawkeyeBarcode.png]]

Using the SNP barcode to inspect a very large region

== Raw Chromatograms ==

Double clicking on a read in the tiling displays the read in a separate chromatogram window. This requires that the chromotogram is available and the chromatogram positions have been loaded into the bank, or the positions are encoded in the trace file. Below we see the chromatogram display for read DMGLJ45TR. Unlike in the tiling window, the chromatogram displayed has not been stretched, instead the base calls are spaced to align with the peaks. The top number is the consensus position. The next line is the consensus and the current read's base calls are beneath. A solid circle above the consensus flags there is a discrepancy in the tiling at that position (as at position 776/802). Below the read's base calls are the quality values of the base calls. Below the quality values is the current sequence position (1-based gapped). Below this the chromatogram signal is displayed, and the chromatogram position is displayed at the bottom. The window is tinted so that trimmed bases appear with a dark background- the C at 785/811 is the last base in the clear range of this read.

[[Image:HawkeyeRawChromo.png]]

== Options ==

=== File Menu ===

The File menu has options for opening a bank, setting chromatogram paths, and displaying summary windows.

=== Options Menu ===

* Color Bases - Toggle if the bases should be colored red, blue, yellow and green or just black and white
* SNP Coloring - Toggles if clicking in the consensus should color the reads based on the base at that position
* Show Full Range - Toggles if the trimmed bases should be displayed or not. They are displayed with a red tint.
* Show Positions - Toggles if the consensus position should be written at each position or just at every 10 bp.
* Show Indicator - Toggles if the carret indicator should be displayed above the consensus position
* Show Quality Values - Toggles if the base quality values should be displayed in the tiling
* Low Case Low QV - Toggle if bases with low quality (< 30) should be displayed in lower case letters
* Highlight Discrepancies - Toggles if bases which disagree with the consensus should be highlighted in purple
* Prefetch Chromatograms - Toggles if chromatograms should be displayed without clicking in the tiling.

=== Toolbars ===

The position box displays the current position (the leftmost displayed consensus position), and also accepts input for jumping to arbitrary positions. The up and down arrows next to the spin box step the view window in 1 base increments. The left and right arrows next to the position spin box jump to the previous and next position with a discrepancy in the reads relative to the current position indicated by the caret above the consensus position line.

The A+ and A- increases and decreases the font size, and activates the semantic zooming into the SNP Barcode. The Find box allows for Qt regular expressions to be input, and the left and right arrows will search the consensus sequence for that regular expression, and highlight it in the consensus.

File:HawkeyeRawChromo.png

2009-07-13T00:07:50Z

Mcschatz: