We have assembled a set of benchmark assembly genomes. Each benchmark set comes with the sequence of the finished genome, random shotgun reads, closure reads, and ancillary library and insert information. Each sequence is categorized as matching or non-matching, based on its mapping to the finished genome. Sequences that match the finished genome at 90% identity for over 80% of their trimmed length (as aligned by MUMmer) are included in the matching set, while all other reads are grouped into the non-matching set. Ancillary information is presented in Trace Archive XML format. Please refer the the benchmark website for a more lengthy description and the actual data.
- Brucella suis
- Shewanella oneidensis
- Staphylococcus aureus COL
- Staphylococcus epidermidis RP62A
Each tarball contains the following files
The finished genome sequence for this organism in multi-FastA format. Each chromosome or plasmid is a separate FastA entry.
The sequences produced by the random (whole genome shotgun) phase and the closure (finishing) phase of the sequencing project. Sequences grouped into the 'nonmatching' files failed to match the finished genome at 90% identity over 80% of their trimmed length (as aligned by MUMmer). All other sequences matched the finished genome at or above this criterion. To simulate assembly of the original shotgun project, concatenate the data in random.seq and random_nonmatching.seq and assemble that.
These files are in multi-FastA format, with whitespace delimited sequence information placed in the FastA headers. The 6 fields in the header are:
ID, MINL, MAXL, MEANL, CLEARL, CLEARR
ID - A unique sequence identifier
MINL - Estimated minimum insert size
MAXL - Estimated maximum insert size
MEANL - Estimated mean insert size
CLEARL - The leftmost position of the trimmed sequence. We have already trimmed all sequences to remove vector and low-quality basecalls. The sequence files contain the entire read; to get the trimmed data, use the range from CLEARL through CLEARR. CLEARL and CLEARR are inclusive range bounds, and use a 1 based coordinate system.
CLEARR - The rightmost position of the trimmed sequence.
The quality values for the each of the above sequences files, in two digit integer format, separated by a single whitespace. Each quality sequence is headed by the same FastA ID found in the seq files.
The ancillary information in trace archive XML format. For each sequencing read, there is a <trace> record which describes the following fields:
<trace_name> - A unique sequence identifier. Same as the ID field in the seq files.
<template_id> - The insert ID this read was sequenced from. Reads from the same insert can be grouped to form mate-pair information for the assembly process.
<trace_end> - Direction of the sequencing reaction. Useful in determining the orientation of the mate sequences.
<library_id> - The library ID this insert was taken from. Reads from the same library will share the same size distribution.
<insert_size> - Estimated insert size from this insert.
<insert_stdev> - Standard deviation of the estimated insert size.
<type> - Type of read, either "closure" or "paired_production" Meaning the read is a closure walk or an end-paired sequence.