UMD Overlapper
Overview
The UMD overlapper is designed to reduce the number of overlaps produced by the assembler by reducing the number of repeat-induced overlaps. Furthermore the algorithm is greatly enhanced through the use of minimizers - a technique for reducing the number of k-mers considered in the initial phase of overlapping by an order of magnitude. Most assemblers use exact k-mer matches in order to identify reads that potentially overlap.
In conjunction with the UMD overlapper, the UMD error corrector identifies and corrects potential sequencing errors by detecting bases in a multiple alignment of reads that are supported by only one of the reads. The algorithm uses a heuristic rule called the 4-3 rule that examines overlapping sets of 4 reads at 3 positions in order to
identify differences corresponding to distinct copies of a repeat.
Related publications
- "A preprocessor for shotgun assembly of large genomes." Roberts M, Hunt BR, Yorke JA, Bolanos R, Delcher A, Journal of Computational Biology, 2004. 11(4):734-752
- "Reducing storage requirements for biological sequence comparison." Roberts M, Hayes W, Hunt BR, Mount SM, Yorke JA. Bioinformatics, 2004, 20(18):3363-3369.