Benchmark

Overview

We have assembled a set of benchmark assembly genomes. Each benchmark set comes with the sequence of the finished genome, random shotgun reads, closure reads, and ancillary library and insert information. Each sequence is categorized as matching or non-matching, based on its mapping to the finished genome. Sequences that match the finished genome at 90% identity for over 80% of their trimmed length (as aligned by MUMmer) are included in the matching set, while all other reads are grouped into the non-matching set. Ancillary information is presented in Trace Archive XML format. Please refer the the benchmark website for a more lengthy description and the actual data.

Genomes

Description

Each tarball contains the following files

genome.1con

       The finished genome sequence for this organism in multi-FastA format.
       Each chromosome or plasmid is a separate FastA entry.

Sequences

random.seq
closure.seq
random_nonmatching.seq
closure_nonmatching.seq

The sequences produced by the random (whole genome shotgun) phase and the closure (finishing) phase of the sequencing project. Sequences grouped into the 'nonmatching' files failed to match the finished genome at 90% identity over 80% of their trimmed length (as aligned by MUMmer). All other sequences matched the finished genome at or above this criterion. To simulate assembly of the original shotgun project, concatenate the data in random.seq and random_nonmatching.seq and assemble that.

These files are in multi-FastA format, with whitespace delimited sequence information placed in the FastA headers. The 6 fields in the header are:

       ID, MINL, MAXL, MEANL, CLEARL, CLEARR

       ID      - A unique sequence identifier

       MINL    - Estimated minimum insert size

       MAXL    - Estimated maximum insert size

       MEANL   - Estimated mean insert size

       CLEARL  - The leftmost position of the trimmed sequence.  We have
                 already trimmed all sequences to remove vector and
                 low-quality basecalls.  The sequence files contain the
                 entire read; to get the trimmed data, use the range from
                 CLEARL through CLEARR. CLEARL and CLEARR are inclusive range

bounds, and use a 1 based coordinate system.

       CLEARR  - The rightmost position of the trimmed sequence.

Qualities

random.qual
closure.qual
random_nonmatching.qual
closure_nonmatching.qual

The quality values for the each of the above sequences files, in two digit integer format, separated by a single whitespace. Each quality sequence is headed by the same FastA ID found in the seq files.

Meta data

random.xml
closure.xml
random_nonmatching.xml
closure_nonmatching.xml

The ancillary information in trace archive XML format. For each sequencing read, there is a <trace> record which describes the following fields:

       <trace_name>    - A unique sequence identifier. Same as the ID field
                         in the seq files.

       <template_id>   - The insert ID this read was sequenced from. Reads
                         from the same insert can be grouped to form mate-pair
                         information for the assembly process.

       <trace_end>     - Direction of the sequencing reaction. Useful in
                         determining the orientation of the mate sequences.

       <library_id>    - The library ID this insert was taken from. Reads from
                         the same library will share the same size
                         distribution.

       <insert_size>   - Estimated insert size from this insert.

       <insert_stdev>  - Standard deviation of the estimated insert size.

       <type>          - Type of read, either "closure" or "paired_production"
                         Meaning the read is a closure walk or an end-paired
                         sequence.

Benchmark

Contents

Overview

Genomes

Description

genome.1con

Sequences

Qualities

Meta data

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

Tools