We have tested Figaro on simulated and real read data. Here we provide all simulated data that is discussed in our paper.

To create a test in which we know exactly where the true vector ends, we have generated a set of artificial sequences based on shotgun reads from the Chlamydophila caviae gpic genome project containing variable length vector sequence on their ends. We trimmed off the first 300 bases from each of 19,633 reads, and attached a vector sequence of random length ranging from 10 to 50 bp generated from the SmaI cloning site of the pUC18 vector (GenBank accession L09136). No vector sequence was attached to about 20% of the reads. Finally, we introduced a varying amount of error within the vector sequence to assess the performance of Figaro in the presence of sequencing errors.

Each dataset represents a different error rate within the vector sequences. These are only fasta sequence files, and the number next to the id line “>XXX #” is the first non-vector base in the read, i.e. “>XXX 1” means the read contains no vector, and “>XXX 42” means the 42nd base pair is the beginning of the true read.