Revision as of 22:23, 7 July 2009 by Mcschatz (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search


We have assembled a set of benchmark assembly genomes. Each benchmark set comes with the sequence of the finished genome, random shotgun reads, closure reads, and ancillary library and insert information. Each sequence is categorized as matching or non-matching, based on its mapping to the finished genome. Sequences that match the finished genome at 90% identity for over 80% of their trimmed length (as aligned by MUMmer) are included in the matching set, while all other reads are grouped into the non-matching set. Ancillary information is presented in Trace Archive XML format. Please refer the the benchmark website for a more lengthy description and the actual data.



Each tarball contains the following files

Reference Genome

  • genome.1con

The finished genome sequence for this organism in multi-FastA format. Each chromosome or plasmid is a separate FastA entry.


  • random.seq
  • closure.seq
  • random_nonmatching.seq
  • closure_nonmatching.seq

The sequences produced by the random (whole genome shotgun) phase and the closure (finishing) phase of the sequencing project. Sequences grouped into the 'nonmatching' files failed to match the finished genome at 90% identity over 80% of their trimmed length (as aligned by MUMmer). All other sequences matched the finished genome at or above this criterion. To simulate assembly of the original shotgun project, concatenate the data in random.seq and random_nonmatching.seq and assemble that.

These files are in multi-FastA format, with whitespace delimited sequence information placed in the FastA headers. The 6 fields in the header are:

       ID      - A unique sequence identifier
       MINL    - Estimated minimum insert size
       MAXL    - Estimated maximum insert size
       MEANL   - Estimated mean insert size
       CLEARL  - The leftmost position of the trimmed sequence.  We have
                 already trimmed all sequences to remove vector and
                 low-quality basecalls.  The sequence files contain the
                 entire read; to get the trimmed data, use the range from
                 CLEARL through CLEARR. CLEARL and CLEARR are inclusive range
                 bounds, and use a 1 based coordinate system.
       CLEARR  - The rightmost position of the trimmed sequence.


  • random.qual
  • closure.qual
  • random_nonmatching.qual
  • closure_nonmatching.qual

The quality values for the each of the above sequences files, in two digit integer format, separated by a single whitespace. Each quality sequence is headed by the same FastA ID found in the seq files.

Meta data

  • random.xml
  • closure.xml
  • random_nonmatching.xml
  • closure_nonmatching.xml

The ancillary information in trace archive XML format. For each sequencing read, there is a <trace> record which describes the following fields:

       <trace_name>    - A unique sequence identifier. Same as the ID field
                         in the seq files.
       <template_id>   - The insert ID this read was sequenced from. Reads
                         from the same insert can be grouped to form mate-pair
                         information for the assembly process.
       <trace_end>     - Direction of the sequencing reaction. Useful in
                         determining the orientation of the mate sequences.
       <library_id>    - The library ID this insert was taken from. Reads from
                         the same library will share the same size
       <insert_size>   - Estimated insert size from this insert.
       <insert_stdev>  - Standard deviation of the estimated insert size.
       <type>          - Type of read, either "closure" or "paired_production"
                         Meaning the read is a closure walk or an end-paired