Next-generation sequencing (NGS) technologies, such as Illumina's Genome Analyzer [1] and Roche/454's GS FLX [2], produce massive volumes of data. For instance, Illumina's Genome Analyzer IIx can produce up to 640 million 150 bp paired-end reads in a single run [3]. Increasing availability of high volume data is opening new possibilities to researchers. These include assessment of rare variants in viral populations via deep sequencing, metagenomic sequencing of bacterial communities, and pooled resequencing of human chromosomes. Extracting meaningful information from these kinds of sequencing projects is often difficult, however, due to the error rates associated with NGS. Separating true variants from sequencing errors remains challenging. Furthermore, analysis is complicated by an ever-increasing variety of downstream software, and a lack of clear standards [4]. Both selecting the most appropriate sequencing technology, and choosing the appropriate software package and parameter values for data analysis are typically done via a 'hit and miss' approach - a costly exercise, even in the world of 'cheap' NGS.
Simulation of NGS data, followed by software benchmarking, presents an alternative approach. Early published simulators include GenFrag [5]. Its usefulness for modern NGS projects is limited by a simplistic error model, a single input genome, and a lack of quality score information. SamTools [6] also supplies a simulator, however it uses a uniform error rate. A uniformly increasing error rate is used in a slight improvement released as 'dwgsim'[7]. NGS features highly heterogenous error profiles [8, 9], so the usefulness of this simulator must be questioned.
There is growing evidence that sequence context (i.e., the nucleotide sequence surrounding a base and the base's position within the read) influences error rates in both Roche/454 and Illumina sequencing [8, 9]. This awareness has led to more advanced simulators such as MetaSim and Flowsim [10, 11]. While MetaSim generates reads from many input genomes and uses sequence-context error models, it cannot be trained on real data and does not assign quality values to reads, limiting its potential applications. The recent program Flowsim is the most realistic NGS simulator to date, with advanced error modelling and quality scores [11]. However it operates only in 'flowspace' and is therefore entirely limited to simulation of Roche/454 pyrosequencing data. Likewise, the unpublished simulator SimSeq [12] empirically captures some characteristic features of Illumina error models, however only allows a single input genome, does not empirically derive all parameters, and cannot simulate Roche/454 data. ART [13], an unpublished cross-platform simulator, also uses context-dependent error models and does assign quality scores. However it appears limited to a single genome and does not allow training on user's own data sets. Thus there is a need for a realistic, cross-platform NGS simulator, as multiple sequencing platforms are likely to persist, each with their own strengths and weaknesses [14].
Here, we describe GemSIM - a General, Error Model based SIMulator of NGS sequencing data. It uses the generic and standardised formats SAM (aligned reads) [6] and FASTQ (raw reads) [15], thus ensuring GemSIM's applicability to both current and emerging NGS technologies. GemSIM creates empirical error models from real NGS data, facilitating technology-, machine-, and even run-specific simulation. GemSIM considers a sequence-context composed of a window of three bases before the current base, the current base, and one base after the current base (we call this the 'sequence-context word'). GemSIM also assigns realistic, empirically-derived quality scores to simulated single or paired-end reads. It can draw reads from either single or multiple genomes or haplotype sets, making it applicable to deep sequencing, metagenomic, and resequencing projects. We demonstrate GemSIM's usefulness for evaluating error models and benchmarking downstream analysis software by using GemSIM to capture the error profiles of two different paired-end Illumina runs and one Roche/454 Titanium run, and by simulating reads from a set of in silico generated Buchnera aphidicola haplotypes. We then attempt to identify SNPs using the popular program VarScan [16] and assess the effects of different error profiles and technologies on SNP-calling accuracy.
Implementation
GemSIM is implemented in Python as a command line package, consisting of the four programs GemErr, GemHaps, GemReads, and GemStats. The GemSIM workflow is as follows:
GemErr
GemErr generates empirical error models from real data. A SAM format alignment of control data is used as input. A list of polymorphic sites or sites which are known to differ from the reference genome may also be supplied; these sites are then considered to be true SNPs and are ignored during error model calculation.
Reads are sequentially parsed, tracking the total number of reads and read length distributions. For paired-end reads, insert size, whether the read is the first or second read in the pair, and the proportion of properly aligned pairs are also recorded. For each base of each read the following information is then stored: a) nucleotide type and base position in read; b) mismatch or true base for the position; c) indels following the current position; d) preceding three bases in the read; e) following base in the read, and f) quality scores for true and mismatch bases and insert bases. Although it is mainly the sequence preceding the current position that is known to affect error rates [8, 9], the following base in the read is tracked to allow accurate simulation of indels within homopolymers. Sequence aligners record these errors either at the start or end of a homopolymer. By taking the following base into consideration, indels are only inserted once within long homopolymers, at the end, rather than potentially multiple times within the homopolymer. Empirical distributions for tracked information are stored to a file and used as error models for input into GemReads.
If a particular sequence-context word is not contained at least some minimum × number of times (default × = 4) within the reference genome, then the model is updated using information based on the longest nucleotide sequence that occurs at least × times in the reference. For instance, if AACTG is missing from the reference, however ACTG can be found five times, then the matrix entry where T is the current position, G is the following position, and AAC are the three bases before the current position is updated by considering all other matrix entries where the two bases before the current position are AC, the current position is T, and G is the following position.
GemStats (optional)
GemStats takes an error model generated by GemErr, and calculates a set of statistics based on this model - for paired-end reads, models for the first and second read in a pair are considered separately. Statistics reported include the overall mismatch, insertion, and deletion error rates; the error rate for each nucleotide; and the error rate by base position within the read. Additionally, any sequence-context words with an error rate more than two standard deviations greater than the average error rate are also reported.
GemHaps (optional)
GemHaps takes a genome sequence, and an input command specifying haplotype frequencies and the number of SNPs in each haplotype (here, haplotypes are defined as a group of related nucleotide sequences, each differing by at least one SNP). The positions of mutations are randomly determined, and haplotypes are written to a file for input into GemReads. Alternatively, users may create their own tab-delimited haplotype file (for instance, describing SNPs generated according to an evolutionary model).
GemReads
GemReads requires as input an error model file (generated by users with GemErr or supplied in GemSIM), a FASTA genome file or directory of FASTA genomes (metagenomics mode), an optional haplotype file (user defined or from GemHaps), and a tab-delimited species-abundance text file (metagenomics mode only). Additionally, the user specifies whether the reference genome(s) are circular or linear, and which quality score offset to use (33 or 64, depending on the required FASTQ output version). The requested number of single or paired-end reads are generated as follows, and output as FASTQ files:
-
(1)
Read length and insert length (paired-end only) are randomly chosen from the empirical distributions defined by the error model (read length may also be set to a static value).
-
(2)
In metagenomics mode, an input reference genome is chosen with probability proportional to the species abundance and genome size.
-
(3)
Read location and direction are chosen at random, and the read is copied from the input genome.
-
(4)
Read is assigned to a haplotype (if supplied) and updated with SNPs, where appropriate.
-
(5)
Errors are introduced according to the error models, accounting for read position, sequence-context word, and 1st or 2nd read in pair (for paired-end reads).
-
(6)
Quality scores are assigned using recorded empirical distributions contained in the error model file.
For paired-end reads, steps 1-6 are repeated, however the insert size is used in conjunction with the previous read position and direction to determine the read location. Our error model also tracks how many read pairs have one read that does not align. We simulate this by generating a read that consists entirely of Ns with low quality scores, forcing it not to map. For instance, for 100 bp paired-end Illumina sequencing data, one read would contain 100 Ns, each with a quality score of 'B' or 2.