Bank2contig

From AMOS WIKI
Jump to: navigation, search

bank2contig is a general converter from AMOS banks into a variety of other contig formats.

For descriptions of format, see [1]

TIGR Assembler / GDE Contig Format

The .contig format is a simple text format for encoding read to contig alignments. This is the default output format for bank2contig. The layout format (-L) is the same as the contig format, except no sequence information is written. This is useful for listing the reads in each contig, their positions, clear ranges, etc.

Example:

##56487 19 1623 bases, 00000000 checksum.
TTAGACCCAGGAGAAG-CATAAAATTTTCAGAGCCATCTGATGTAGGAGGAAGTTATGAA
#000035230611N10F(0) [RC] 711 bases, 00000000 checksum. {720 10} <1 710>
TTAGACCCAGGAGAAG-CATAAAATTTTCAGAGCCATCTGATGTAGGAGGAAGTTATGAA
  • Each contig is preceded by a header starting with ##, followed by the contig identifier, number of reads aligned to it, and the number of bases in the padded consensus. If generated by TIGR Assembler, these records also contain an 8-digit checksum, however most converters generate a blank checksum (it's not used by any code anyway).
  • The contig sequence, listed after the "##" header, is padded with the gap character.
  • Each read aligned to the consensus is preceded by a header starting with a single "#" character. Provided in parantheses, is the 0-based offset of the read in the consensus. Within the square brackets the string "RC" indicates the read was reverse complemented, a fact also indicated in the representation of the clear range within the braces ({720 10}). The clear range is 1-based with respect to the unpadded/ungapped read sequence. Note the low number is 10, meaning the first 9 bases (1-9) have been trimmed from the beginning (5' end) of the read. There may also be bases trimmed at the end of the read (3' end) beyond base 720, but this format does not record how many bases there are. Next comes the coordinates of the read along the ungapped 1-based consensus are provided within angle brackets (<1 710>). This header also contains a checksum (largely ignored) and information about the number of bases following it.
  • After the read header, the aligned section of the read (the bases within the clear range alone) is provided in padded form, and in the correct orientation (complemented if necessary).



SAM Conversion

The SAM (Sequence Alignment/Map) format is a generic format for storing large nucleotide sequence alignments, used in the 1000 genomes project and many others.

bank2contig is a basic converter from the AMOS assembly format into SAM format. It works from AMOS Banks (indexed binary format), and outputs the assembled reads with extended CIGAR strings compatible with the samtools library. At this time it does not convert mate or library information, but should be sufficient for analyzing & visualizing the read to contig alignments from a variety of assembly formats, including AMOS, Celera Assembler, phrap, velvet, etc.

The basic steps are:

1. Create AMOS AFG file: AMOScmp, Minimus, & velvet automatically create AFG files

# Or convert ACE File
$ toAmos -ace data.ace -o data.afg

# Or convert Celera Assembler
$ toAmos -frg data.frg -a data.asm -o data.afg

2. Create AMOS bank

$ bank-transact -m data.afg -b data.bnk -c

3. Create contig fasta & SAM alignment file

$ bank2fasta -i -b data.bnk > data.fa
$ bank2contig -i -s data.bnk > data.sam

5. Load with samtools and view alignments

$ samtools faidx data.fa                            # index the contig FASTA
$ samtools import data.fa.fai data.sam data.bam     # SAM->BAM
$ samtools index data.bam                           # index BAM
$ samtools tview data.bam data.fa                   # view alignments


DNPTrapper

DNPTrapper is an assembly editing and visualization tool specifically designed for manual analysis and finishing of repeated regions. It differs from previous tools by providing flexibility and an overview that greatly simplifies the finishing process, by allowing the user to view whole repeat regions at once and to edit assembly errors manually by drag and drop. The program implements and visualizes the results of a previously described statistical method that detects defined nucleotide positions (DNPs, representing single base differences between repeat units) in the presence of sequencing errors.


Usage:

bank2contig -T data.bnk > data.xml


Simple Layout

The simple layout format (-S) is a simple tab deliminated file with the ids of the reads in the contig. The fields are:

1. contig id
2. contig status
3. read id
4. reverse complement flag (0/1)
5. read offset (0-based gapped offset)