Jump to: navigation, search

Automated assembly validation pipeline.

Adam Phillippy, Michael Schatz, Mihai Pop Center for Bioinformatics and Computational Biology, University of Maryland

Publication: Genome assembly forensics: finding the elusive mis-assembly. Phillippy AM, Schatz MC, Pop M. Genome Biol. 2008;9(3):R55.


Since the initial "draft" sequence of the human genome was released in 2001, it has become clear that it was not an entirely accurate reconstruction of the genome. Despite significant advances in sequencing and assembly since then, genome sequencing continues to be an inexact process. Genome finishing and validation have remained a largely manual and expensive process, and consequently, many genomes are presented as draft assemblies. Draft assemblies are of unknown quality and potentially contain significant mis-assemblies, such as collapsed repeats, sequence excision, or artificial rearrangements. Too often these assemblies are judged only by contig size, with larger contigs preferred without regard to quality, because it has been difficult to gauge large scale assembly quality.

Our automated software pipeline, amosvalidate, addresses this deficiency and automatically detects mis-assemblies using a battery of known and novel assembly quality metrics. Instead of focusing on a single assembly characteristic as other validation approaches have tried, the power of our approach comes from leveraging multiple sources of evidence. amosvalidate statistically analyzes mate-pair orientations and separations, repeat content, depth-of-coverage, correlated polymorphisms in the read alignments, and read alignment breakpoints to identify structurally suspicious regions of the assembly. The suspicious regions identified by individual metrics are then clustered and combined to identify (with high confidence) regions that are mis-assembled.

Related tools:

Running amosvalidate

amosvalidate reads the assembly data from an AMOS bank. A bank is a special directory of binary encoded files containing all information on an assembly. A bank is created by the AMOS assemblers directly, or by converting the results of others assemblers into AMOS format. This is typically done with the tools toAmos and bank-transact. toAmos reads the assembly files and converts them to plaintext AMOS message formats, and bank-transact reads those messages and creates the binary encoded bank directory. See the AMOS Assembly Conversion Page for more information.

For example:

$ toAmos -f assembly.frg -a assembly.asm -o - | bank-transact -m - -o assembly.bnk -c

Creates the bank assembly.bnk from the files assembly.frg and assembly.asm, which are the input and output files for the Celera Assembler.

$ toAmos -ace assembly.ace -o - | bank-transact -m - -o assembly.bnk -c

Creates the bank assembly.bnk from an ace file, which is an output format for many assemblers including Phrap, Arachne, Velvet and Newbler. Check your assembler's documentation for more information on creating ACE files. More information on converting to AMOS is available in the toAmos documentation.

$ tarchive2amos -o assembly -assembly ASSEMBLY.xml TRACEINFO.seq;
$ bank-transact -m assembly.afg -b assembly.bnk -c

Creates the bank assembly.bnk from an assembly archive XML file called ASSEMBLY.xml. Note all of the read fasta files should be concatentated into a single TRACEINFO.seq file, and the read qualities files should be concatenated into a single TRACEINFO.qual file, and the TRACEINFO.xml file should be present as well. More information is available in the tarchive2amos documentation.

Once the bank has been built, launch the analysis by typing:

$ amosvalidate assembly.bnk

If the assembler you used does not record the clear range, you'll very likely run into an error. In this case, re-run amosvalidate without using clear range information

$ amosvalidate assembly.bnk -D CLEAR_RANGE=0

After the validation completes, the mis-assembly features will be loaded into the bank and present in the files assembly.all.feat and assembly.suspicious.feat. These features can be viewed in Hawkeye by typing:

$ hawkeye assembly.bnk

Matepair Happiness

Matepairs from a double barreled shotgun sequencing library should be oriented towards each other, and their distance apart in the assembly should match the library's size distribution. The tool asmQC looks for regions where multiple matepairs are mis-oriented or the insert coverage is low. Both can indicate the assembly has a rearrangement mis-assembly. The tool cestat-cov computes a per-library statistic called the CE statistic at every position in the assembly. The CE statistics indicates how well the mates spanning a positing match the library's distribution. If the mates are consistently closer than expected at a given position, as would occur in a collapsed repeat or excision from the assembly, the statistic will have a large negative value (ce < -4). If the inserts are consistently larger than expected, such as from a repeat copy number expansion or other insertion event, the statistic will have a large positive value (ce > 4)

cestat-cov output file: asm.ce.feat

Record of positions in the assembly with unusual CE statistic (|ce| > 4).

Description of columns in file:

1. Contig ID
2. MATEPAIR Feature Type
3. range start
4. range end
5. Library ID

asmQC output is written directly to the bank, but features can be extracted with bank-report

Correlated SNP Detection

Correlated SNPs are positions in the genome where most of the reads are one base, but multiple other reads have another base. Unlike sequencing errors that occur at random, these correlated discrepancies can indicate the presence of a mis-assembly. In a haploid bacterial genome, for example, correlated SNPs nearly always indicate 2 copies of a near identical repeat have been collapsed into a single copy. In diploid or polyploid genomes, these can indicate a collapsed repeat, or positions where the homologous chromosomes disagree. If the frequency is higher than expected biologically, it is strong evidence for a collapsed repeat.

analyzeSNPs output file: asm.snps

analyzeSNPs finds all positions in the multiple alignment that the reads disagree. By default, it only reports positions where there are 2 or more reads that disagree with the consensus (but agree with each other) and the sum of their quality values is at least 40

Description of columns in file:

1. Contig ID
2. Gapped position
3. Ungapped position
4. Consensus
5. Depth of coverage
6. Number of reads that disagree with the consensus
7. X(N) X=base1, N=number of reads that have base1
8. {R1,R2,RN} Read ids that have base X
9. Y(N) Y=base2, N=number of reads that have base2
10. {R1, R2, RN} Read ids that have base Y

clusterSNPs output file: asm.snp.feat

clusterSNPs scans the SNPs report generated by analyzeSNPs to find regions that have a high frequency of SNPs. By default, it reports all regions with at least 2 columns within at most 500bp of each other as found by analyzeSNPs.

Description of columns in file:

1. Contig ID
2. SNP Feature Type (P)
3. range start
4. range end
5. The number of SNPs
6. The average distance between SNPs

Read Coverage

If the libraries have been constructed using a random shearing process, the reads should uniformily cover the genome at the average depth of coverage. Regions where the coverage is deeper than expected can indicate a collapsed repeat.

analyze-read-depth output file: asm.depth.feat

By default, analyze-read-depth reports regions that are 3x deeper than the average coverage. Positions within 1000bp of each other are clustered together.

Description of columns in file:

1. Contig ID
2. Coverage Feature Type (D)
3. range start
4. range end
5. Maximum depth of coverage in this range

Singleton Breakpoint Analysis

After an assembly is complete, there can be reads left over, called singletons, that are not placed in the assembly. These reads are often from contaminating DNA or otherwise low quality sequence and can be safely ignored. However, some types of mis-assemblies can cause singletons where a portion of the read will align well to the contig but the rest of the read past the mis-assembly junction does not. If there are multiple reads that all follow the same pattern of partially aligning until the same position, this is strong evidence for mis-assembly.

listReadPlacedStatus output file: asm.singletons

listReadPlacedStatus can report which contig(s) a read is placed into, but in the pipeline simply lists which reads are singletons.

casm-breaks output file: asm.break.fea

The singleton reads are then aligned to the consensus sequences of the contigs and then analyzed for shared breakpoints. casm-breaks reports positions where there are multiple reads that all have the same breakpoint pattern. Unlike some of the other pipeline tools, casm-breaks writes an XML like message file.

File Format:

{FEA 	Feature message
typ:B 	Breakpoint feature
src:N,CTG 	The breakpoint occurs in contig N
com: <string> 	string linking all of the breakpoint features for a set of reads
clr:X,Y 	Range the contig where the read aligns
} 	End of feature

Repeat K-mer Analysis

Almost all mis-assemblies are caused by repeats, and thus it can be useful to find the locations of the repeats in an assembly. Furthermore, it is very interesting to find the locations of collapsed or expanded repeats. We developed a new metric, called normalized k-mer analysis, that can discover collapsed or expanded repeats. A k-mer is a k-length substring of a longer sequence. Using a sliding window across a sequence, we can catalog all k-mers and count the number of occurrences of each. Call K_r the set of k-mers in the reads, and K_c the set of k-mers in the contig consensus sequences. A normalized k-mer count, K*, is the number of times a given k-mer q occurs in K_r divided by the number of times q occurs in K_c. This simple statistic can reveal which repeats have been mis-assembled. For example, the number of times the k-mers across a 2 copy repeat will be present in K_r is 2 * the depth of coverage. If the 2-copy repeat occurs in 2-copies in the assembly, then those kmers will all be present twice in K_c, and K* will be equal to the depth of coverage. If, however, the repeat was collapsed and occurs only once, then K_c will be 1 across the repeat, and K* will be equal to 2*the average depth of coverage.

count-kmers output file: asm.22.n22mers

count-kmers can count k-mers of arbitrary length in the reads or contig consensus sequences, and it can compute normalized k-mers. In the forensics pipeline, it computes normalized k-mers where k=22 and the number of occurrences is at least 22 (approximately 3 * the standard depth of coverage, 8). File format (N is the normalized k-mer count for a kmer sequence): >N kmersequence

kmer-cov output file: asm.nkmer.feat

kmer-cov maps the k-mer coverage across a sequence. In the forensics pipeline it reports regions at least 1000bp long covered by high frequency normalized kmers, i.e., the collapsed repeats in the genome. Description of columns in file:

1. Contig ID
2. Coverage Feature Type (K)
3. range start
4. range end
5. Length of region

Feature Combiner

The above metrics can find many different types of mis-assemblies, but each is limited in type of mis-assembly it can find. Furthermore, normal statistical variation may introduce false positives in the analysis. For example, flagging every insert mate whose size is less than 2 standard deviations from the library mean will flag about 2.5% of the inserts even though the vast majority are correct. Instead we use a feature combiner to collect all of the evidence for a mis-assembly and output regions with multiple mis-assembly features present at the same region. This allows one to focus their attention on the regions that are most likely to be mis-assemblied.

All of the features are loaded into the bank, and will then be visible within Hawkeye for further inspection.

suspiciousfeat2region output file: asm.suspicious.feat

File format:

1. Contig id
2. Mis-assembly Feature Type (A)
3. range start
4. range end
6. Number of features in the region
7. Number of feature types in the region
8. List of features separated with pipe ("|") character