GemSIM: general, error-model based simulator of next-generation sequencing data
© McElroy et al; licensee BioMed Central Ltd. 2012
Received: 21 September 2011
Accepted: 15 February 2012
Published: 15 February 2012
Skip to main content
© McElroy et al; licensee BioMed Central Ltd. 2012
Received: 21 September 2011
Accepted: 15 February 2012
Published: 15 February 2012
GemSIM, or General Error-Model based SIMulator, is a next-generation sequencing simulator capable of generating single or paired-end reads for any sequencing technology compatible with the generic formats SAM and FASTQ (including Illumina and Roche/454). GemSIM creates and uses empirically derived, sequence-context based error models to realistically emulate individual sequencing runs and/or technologies. Empirical fragment length and quality score distributions are also used. Reads may be drawn from one or more genomes or haplotype sets, facilitating simulation of deep sequencing, metagenomic, and resequencing projects.
We demonstrate GemSIM's value by deriving error models from two different Illumina sequencing runs and one Roche/454 run, and comparing and contrasting the resulting error profiles of each run. Overall error rates varied dramatically, both between individual Illumina runs, between the first and second reads in each pair, and between datasets from Illumina and Roche/454 technologies. Indels were markedly more frequent in Roche/454 than Illumina and both technologies suffered from an increase in error rates near the end of each read.
The effects of these different profiles on low-frequency SNP-calling accuracy were investigated by analysing simulated sequencing data for a mixture of bacterial haplotypes. In general, SNP-calling using VarScan was only accurate for SNPs with frequency > 3%, independent of which error model was used to simulate the data. Variation between error profiles interacted strongly with VarScan's 'minumum average quality' parameter, resulting in different optimal settings for different sequencing runs.
Next-generation sequencing has unprecedented potential for assessing genetic diversity, however analysis is complicated as error profiles can vary noticeably even between different runs of the same technology. Simulation with GemSIM can help overcome this problem, by providing insights into the error profiles of individual sequencing runs and allowing researchers to assess the effects of these errors on downstream data analysis.
Next-generation sequencing (NGS) technologies, such as Illumina's Genome Analyzer  and Roche/454's GS FLX , produce massive volumes of data. For instance, Illumina's Genome Analyzer IIx can produce up to 640 million 150 bp paired-end reads in a single run . Increasing availability of high volume data is opening new possibilities to researchers. These include assessment of rare variants in viral populations via deep sequencing, metagenomic sequencing of bacterial communities, and pooled resequencing of human chromosomes. Extracting meaningful information from these kinds of sequencing projects is often difficult, however, due to the error rates associated with NGS. Separating true variants from sequencing errors remains challenging. Furthermore, analysis is complicated by an ever-increasing variety of downstream software, and a lack of clear standards . Both selecting the most appropriate sequencing technology, and choosing the appropriate software package and parameter values for data analysis are typically done via a 'hit and miss' approach - a costly exercise, even in the world of 'cheap' NGS.
Simulation of NGS data, followed by software benchmarking, presents an alternative approach. Early published simulators include GenFrag . Its usefulness for modern NGS projects is limited by a simplistic error model, a single input genome, and a lack of quality score information. SamTools  also supplies a simulator, however it uses a uniform error rate. A uniformly increasing error rate is used in a slight improvement released as 'dwgsim'. NGS features highly heterogenous error profiles [8, 9], so the usefulness of this simulator must be questioned.
There is growing evidence that sequence context (i.e., the nucleotide sequence surrounding a base and the base's position within the read) influences error rates in both Roche/454 and Illumina sequencing [8, 9]. This awareness has led to more advanced simulators such as MetaSim and Flowsim [10, 11]. While MetaSim generates reads from many input genomes and uses sequence-context error models, it cannot be trained on real data and does not assign quality values to reads, limiting its potential applications. The recent program Flowsim is the most realistic NGS simulator to date, with advanced error modelling and quality scores . However it operates only in 'flowspace' and is therefore entirely limited to simulation of Roche/454 pyrosequencing data. Likewise, the unpublished simulator SimSeq  empirically captures some characteristic features of Illumina error models, however only allows a single input genome, does not empirically derive all parameters, and cannot simulate Roche/454 data. ART , an unpublished cross-platform simulator, also uses context-dependent error models and does assign quality scores. However it appears limited to a single genome and does not allow training on user's own data sets. Thus there is a need for a realistic, cross-platform NGS simulator, as multiple sequencing platforms are likely to persist, each with their own strengths and weaknesses .
Here, we describe GemSIM - a General, Error Model based SIMulator of NGS sequencing data. It uses the generic and standardised formats SAM (aligned reads)  and FASTQ (raw reads) , thus ensuring GemSIM's applicability to both current and emerging NGS technologies. GemSIM creates empirical error models from real NGS data, facilitating technology-, machine-, and even run-specific simulation. GemSIM considers a sequence-context composed of a window of three bases before the current base, the current base, and one base after the current base (we call this the 'sequence-context word'). GemSIM also assigns realistic, empirically-derived quality scores to simulated single or paired-end reads. It can draw reads from either single or multiple genomes or haplotype sets, making it applicable to deep sequencing, metagenomic, and resequencing projects. We demonstrate GemSIM's usefulness for evaluating error models and benchmarking downstream analysis software by using GemSIM to capture the error profiles of two different paired-end Illumina runs and one Roche/454 Titanium run, and by simulating reads from a set of in silico generated Buchnera aphidicola haplotypes. We then attempt to identify SNPs using the popular program VarScan  and assess the effects of different error profiles and technologies on SNP-calling accuracy.
GemSIM is implemented in Python as a command line package, consisting of the four programs GemErr, GemHaps, GemReads, and GemStats. The GemSIM workflow is as follows:
GemErr generates empirical error models from real data. A SAM format alignment of control data is used as input. A list of polymorphic sites or sites which are known to differ from the reference genome may also be supplied; these sites are then considered to be true SNPs and are ignored during error model calculation.
Reads are sequentially parsed, tracking the total number of reads and read length distributions. For paired-end reads, insert size, whether the read is the first or second read in the pair, and the proportion of properly aligned pairs are also recorded. For each base of each read the following information is then stored: a) nucleotide type and base position in read; b) mismatch or true base for the position; c) indels following the current position; d) preceding three bases in the read; e) following base in the read, and f) quality scores for true and mismatch bases and insert bases. Although it is mainly the sequence preceding the current position that is known to affect error rates [8, 9], the following base in the read is tracked to allow accurate simulation of indels within homopolymers. Sequence aligners record these errors either at the start or end of a homopolymer. By taking the following base into consideration, indels are only inserted once within long homopolymers, at the end, rather than potentially multiple times within the homopolymer. Empirical distributions for tracked information are stored to a file and used as error models for input into GemReads.
If a particular sequence-context word is not contained at least some minimum × number of times (default × = 4) within the reference genome, then the model is updated using information based on the longest nucleotide sequence that occurs at least × times in the reference. For instance, if AACTG is missing from the reference, however ACTG can be found five times, then the matrix entry where T is the current position, G is the following position, and AAC are the three bases before the current position is updated by considering all other matrix entries where the two bases before the current position are AC, the current position is T, and G is the following position.
GemStats takes an error model generated by GemErr, and calculates a set of statistics based on this model - for paired-end reads, models for the first and second read in a pair are considered separately. Statistics reported include the overall mismatch, insertion, and deletion error rates; the error rate for each nucleotide; and the error rate by base position within the read. Additionally, any sequence-context words with an error rate more than two standard deviations greater than the average error rate are also reported.
GemHaps takes a genome sequence, and an input command specifying haplotype frequencies and the number of SNPs in each haplotype (here, haplotypes are defined as a group of related nucleotide sequences, each differing by at least one SNP). The positions of mutations are randomly determined, and haplotypes are written to a file for input into GemReads. Alternatively, users may create their own tab-delimited haplotype file (for instance, describing SNPs generated according to an evolutionary model).
Read length and insert length (paired-end only) are randomly chosen from the empirical distributions defined by the error model (read length may also be set to a static value).
In metagenomics mode, an input reference genome is chosen with probability proportional to the species abundance and genome size.
Read location and direction are chosen at random, and the read is copied from the input genome.
Read is assigned to a haplotype (if supplied) and updated with SNPs, where appropriate.
Errors are introduced according to the error models, accounting for read position, sequence-context word, and 1st or 2nd read in pair (for paired-end reads).
Quality scores are assigned using recorded empirical distributions contained in the error model file.
For paired-end reads, steps 1-6 are repeated, however the insert size is used in conjunction with the previous read position and direction to determine the read location. Our error model also tracks how many read pairs have one read that does not align. We simulate this by generating a read that consists entirely of Ns with low quality scores, forcing it not to map. For instance, for 100 bp paired-end Illumina sequencing data, one read would contain 100 Ns, each with a quality score of 'B' or 2.
Memory and runtime for Illumina and Roche/454 simulations
Data processed (bases)
GemErr, Illumina v4
0.9 × 108
GemErr, Illumina v5
1.1 × 108
1.3 × 107
GemReads, Illumina v4
5.0 × 108
GemReads, Illumina v5
5.0 × 108
4.0 × 108
Number of low-frequency (< 4) k-mers in reference and control genomes
The fact that most 5-mers are contained within the control genomes used also supports the notion that the derived error models can be used to accurately simulate reads from any unrelated reference genomes. For the 10% of 5-mers not well represented within the control genomes, GemSIM derives an error rate based on the relevant 4-mer (or 3-mer, for the one PhiX 4-mer mentioned above).
Error model statistics for Illumina v4, Illumina v5, and Roche/454
Ill. v4 1st read
Ill. v4 2nd read
Ill. v5 1st read
Ill. v5 2nd read
1st most freq. (%)
GGGTA - > GGGGA (4.47)
ACAAG - > ACACG (3.94)
GGGTC - > GGGGC (5.85)
AGGTG- > AGGGG (3.69)
AAACA - > AAAAA (1.07)
2nd most freq. (%)
AGGTG - > AGGGG (3.71)
AGGTG - > AGGGG (3.29)
CTCGG - > CTCCG (5.83)
CGGTG - > CGGGG (2.7)
CCCAC - > CCCCC (1.02)
3rd most freq. (%)
CCCAA - > CCCCA (3.15)
CCCAA - > CCCCA (3.24)
GGGCG - > GGGGG (4.06)
GGGTG - > GGGGG (2.45)
CCCCG - > CCCAG (0.75)
4th most freq. (%)
CGGTG - > CGGGG (3.06)
GGGTA - > GGGGA (3.14)
CGGTG - > CGGGG (3.65)
GGGTC - > GGGGG (2.03)
AAAGG - > AAAAG (0.70)
5th most freq. (%)
GGGTG - > GGGGG (2.71)
ACAAA - > ACACA (2.97)
GGGTA - > GGGGA (3.20)
CGGTC - > CGGGC (1.98)
AGGAA - > AGGGA (0.52)
While insertions were roughly twice as common as deletions in the Illumina runs, in general, indel error rates in both Illumina runs were two to three orders of magnitude lower than in the Roche/454 run. This is consistent with previous findings that mismatches are the predominant forms of error in Illumina, while indels are the most frequent errors in Roche/454 sequencing [20, 22], showing that the GemSIM error-modeling approach captures these key characteristics of the different sequencing platforms.
For each error model, simulated reads were drawn from a set of five haplotypes derived from the B. aphidicola reference genome, with frequencies of 1%, 3%, 5%, 7% and 84%. The four low-frequency haplotypes each contained 100 randomly placed SNPs and the same haplotypes (with their associated SNPs) were used for all simulations and downstream analysis.
SNP-calling was performed using VarScan v2.2.3  on pileup files generated with SamTools v0.1.12a . As we were interested in identifying low-frequency true SNPs and associated false positives (errors), the minimum variant frequency parameter was set to zero. Minimum coverage was set to 100 and the minimum number of reads supporting each SNP was set to five. As VarScan largely depends on individual base quality scores to distinguish between true SNPs and sequencing errors, we varied the minimum average quality (M.A.Q.) parameter from 10 to 40 and investigated its interaction with specific error profiles and SNP-calling accuracy.
SNP-calling was highly accurate for SNPs with a frequency > 3% for all sequencing platforms simulated. Roche/454 showed a slightly lower true positive rate (i.e. an increased false negative rate) than the Illumina simulations. When using VarScan with a M.A.Q. of 20, 379 out of 400 SNPs were identified (86, 97, 97, and 97 for frequencies 1, 3, 4 and 7%, respectively). In contrast, all 400 SNPs were identified from both Illumina simulations. Upon closer inspection, 11 of the false negatives with true frequency of 1% failed to be supported by five reads. All the remaining false negatives were associated with homopolymer indel errors. Inspection of the pileup file showed these SNPs were contained in the data; however VarScan reported them as indels instead.
Any inaccurate SNP calls within +/- 1% of a known haplotype frequency were classed as false positives. For Illumina v5 and Roche/454, all false positives had a frequency under 1% (+/- 1%). Illumina v4 showed a drastically increased false positive rate, as can be expected from the higher average error rate of this run. Despite this, false positives were still restricted to under 3% (+/- 1%) population frequency. As M.A.Q. was increased, however, some false positives with a true frequency of 1% (+/-1%) were now given a frequency by VarScan of 3% (+/-1%). This can be seen as spikes in the frequency = 3% graph, between M.A.Q. 30 and M.A.Q. 40. This can be understood by considering how the M.A.Q. parameter works. The VarScan manual states that M.A.Q. is the 'minimum base quality at a position to count a read' . This means VarScan uses M.A.Q. to select a subset of reads from which to make a call. Thus increasing M.A.Q. reduces the sampling of reads, which in turn reduces the accuracy of the SNP frequency calculation when frequencies are small. This highlights the need for a detailed understanding of any chosen data and analysis pipeline and the value of performing benchmarking with simulated data.
Number of genomic sites considered by VarScan for increasing values of M.A.Q
By considering errors within their sequence-context in real data, GemSIM captured known features of Illumina and Roche/454 error profiles, thus validating our approach. GemSIM therefore facilitates independent, objective, and comparable simulation of both Roche/454 and Illumina sequencing data. Analysis of the error models created by GemSIM also provided new insights into error profiles, specifically that there were substantial differences between the error profiles of the Illumina v4 and Illumina v5 sequencing runs. While it is not clear whether this difference is due to the change in chemistry or to some other factors, it does show that sequencing runs can vary substantially, even when performed by the same sequencing provider using the same machine. Furthermore, differences between these error profiles have a substantial impact on downstream analysis, as shown by our study of SNP-calling accuracy in simulated data.
Our findings call for empirically derived, run-specific error models in sequencing simulation. GemSIM meets this need with a set of python scripts that can be run on a standard desktop computer. By allowing analysis of run-specific error models created from the user's own data, GemSIM helps researchers to identify unique features of their data - understanding of which may be invaluable for downstream data analysis. Error-model analysis also facilitates quality control and identification of 'bad quality' sequencing runs, such as the Illumina v4 run described here. Finally, simulation of sequencing data based on empirically-derived error models allows researchers to choose the most appropriate sequencing platform for their project, assess the impacts of errors on downstream data analysis, and objectively interpret any findings.
GemSIM is most suited to simulating resequencing or metagenomic projects where known reference sequences exist, as it relies on the presence of a reference sequence to initially generate reads (error models are then superimposed on top of the read). For example, GemSIM can be used in a deep resequencing project to establish at which coverage any further sequencing may not provide any increase in SNP detection accuracy. GemSIM may also prove itself valuable in developing and benchmarking de-novo assemblers, by assessing how well a known genome can be reconstructed from simulated reads. By providing a manually modified reference genome to GemSIM, users could also simulate reads to assess the detection of large genomic rearrangements via de-novo assembly.
Future improvements to GemSIM may include increasing the number of bases tracked before the current position during error model construction, as it is possible that error profiles are even more heterogenous than reported here. For Roche/454 sequencing, indel errors are known to increase with increasing homopolymer length, while there is evidence to show that Illumina sequencing accuracy can be influenced by the sequence up to 10 bases before the current position . Currently, the number of bases before the current position is limited by both memory requirements and the need for the sequence-context word to be present in the control dataset. With future improvements in computing power and memory handling, it will be feasible to allow users to optionally increase the sequence-context word length, when appropriate.
As new sequencing technologies emerge, we will also continue testing and developing GemSIM for compatibility. Recently released platforms, such as the Illumina's MiSeq and Pacific Biosciences RS system, can be readily assessed as they are compatible with the two generic formats required by GemSIM, FASTQ and SAM.
The error models described in this paper are provided with the GemSIM package as generic technology-specific error models, for users who do not have access to control data. New error models for different platforms and chemistries will be supplied, as they become available.
Project name: GemSIM.
Project home page: http://sourceforge.net/projects/gemsim/
Operating system(s): platform independent.
Programming language: Python 2.6
Other requirements: Numpy, Python 2.6
License: GNU GPL v3.
Any restrictions to use by non-academics: none.
KM was supported by an Australian Postgraduate Award. FL was supported by an NHMRC Fellowship (Nos. 510428). Illumina sequencing was funded by the Australian Cystic Fibrosis Research Trust. This project was also supported by the National Health and Medical Research Council of Australia (NHMRC) Program Grant No. 510448. We would like to thank Rowena Bull and Andrew Lloyd for provision of the plasmid control data.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.