Distinguishing low frequency mutations from RT-PCR and sequence errors in viral deep sequencing data
© Orton et al.; licensee BioMed Central. 2015
Received: 6 August 2014
Accepted: 9 March 2015
Published: 24 March 2015
RNA viruses have high mutation rates and exist within their hosts as large, complex and heterogeneous populations, comprising a spectrum of related but non-identical genome sequences. Next generation sequencing is revolutionising the study of viral populations by enabling the ultra deep sequencing of their genomes, and the subsequent identification of the full spectrum of variants within the population. Identification of low frequency variants is important for our understanding of mutational dynamics, disease progression, immune pressure, and for the detection of drug resistant or pathogenic mutations. However, the current challenge is to accurately model the errors in the sequence data and distinguish real viral variants, particularly those that exist at low frequency, from errors introduced during sequencing and sample processing, which can both be substantial.
We have created a novel set of laboratory control samples that are derived from a plasmid containing a full-length viral genome with extremely limited diversity in the starting population. One sample was sequenced without PCR amplification whilst the other samples were subjected to increasing amounts of RT and PCR amplification prior to ultra-deep sequencing. This enabled the level of error introduced by the RT and PCR processes to be assessed and minimum frequency thresholds to be set for true viral variant identification. We developed a genome-scale computational model of the sample processing and NGS calling process to gain a detailed understanding of the errors at each step, which predicted that RT and PCR errors are more likely to occur at some genomic sites than others. The model can also be used to investigate whether the number of observed mutations at a given site of interest is greater than would be expected from processing errors alone in any NGS data set. After providing basic sample processing information and the site’s coverage and quality scores, the model utilises the fitted RT-PCR error distributions to simulate the number of mutations that would be observed from processing errors alone.
These data sets and models provide an effective means of separating true viral mutations from those erroneously introduced during sample processing and sequencing.
RNA viruses such as foot-and-mouth disease virus (FMDV) evolve rapidly due to their large population size, high replication rate and the poor proof-reading ability of their RNA-dependent RNA polymerase. Mutation rates of RNA viruses are cited to be between 10−3 and 10−6 mutations per nucleotide per transcription cycle [1-4], therefore mutations can potentially be introduced every time a viral genome is replicated. As a result, these viruses exist within their hosts as large, complex and heterogeneous populations, comprising a spectrum of related but non-identical genome sequences [5-8]. The capacity of RNA viruses to evolve rapidly is a key driver of viral virulence, vaccine resistance and host-jumping (the infection of a new species and adaptation to life within the new host). Therefore, an understanding of the mutational dynamics of RNA viruses is essential for our understanding of viral disease progression, transmission and the development of antiviral therapeutics .
Next generation sequencing (NGS) techniques provide the means for rapid and cost-effective dissection of viral evolutionary dynamics at an unprecedented level of detail [10-18]. The massively parallel and high-throughput nature of NGS platforms, combined with the relatively short genome of an RNA virus, enables the analysis of viral samples (which can contain billions of virions) to a very high depth. Such ultra-deep coverage of the genome enables the diversity of the whole viral population to be examined and subsequently compared between samples to investigate evolutionary events such as selection and bottlenecks. Furthermore, high depth enables the identification of important variants present within the viral population at low frequencies, such as those that increase pathogenicity or convey drug resistance  which would pose a risk if they become dominant in later populations due to selection. However, a current problem with the application of NGS platforms to viral population analysis is that true low frequency viral variants cannot be effectively distinguished from variants caused by errors during sample preparation or sequencing.
Focussing on the Illumina platform, after RT and PCR, viral samples undergo fragmentation via sonication (or nebulization) to generate DNA fragments of various sizes. Typically 300-400 bp fragments are then selected by gel extraction. The process of sonication results in approximately half of the sample DNA being destroyed via evaporation, and only ~10% of the remaining DNA is located in the 300-400 bp range . This results in a total sample loss of ~95% and therefore represents a sampling bottleneck of 5%. However, the latest Illumina Nextera DNA sample preparation kit simultaneously fragments and tags sample DNA with adapters in a single step (tagmentation) and results in significantly less sample loss  and hence requires substantially less starting DNA. Illumina library preparation involves the ligation of Illumina adaptors and unique sequencing tags onto the ends of DNA fragments followed by 10 cycles of PCR to amplify the library ready for sequencing on an Illumina platform.
Base calling on the Illumina platform is complicated by factors including similar emission spectra of the A/C and G/T fluorophores as well as phasing and pre-phasing issues . In terms of error rate, the Illumina platform is considered to be better than its rivals [23-25] and every sequenced base in every read is assigned a quality score that is a measure of the uncertainty of the base call using the Phred scale. At one of the highest quality scores Q40, there is a 1 in 10,000 probability of a base being called incorrectly. Although this is a very low probability, the ultra-deep coverage of viral samples can result in a coverage of 10 K-100 K at each position in the genome. This implies that even if every base were of the highest quality, we would still expect errors to be introduced via base miscalls from the Illumina machine. Furthermore, it has been reported that Illumina machines have a sequence specific error profile whereby errors occur more frequently around certain motifs such as GGC and GGX [26-28]. Such systematic errors are poorly understood given that motifs such as GGC (the codon for glycine) occur very frequently in DNA, and yet it is only a small number of these motifs in a genome that suffer such errors, suggesting that there are other issues involved such as the DNA sequence further upstream of the motif.
The introduction of erroneous mutations during the sample preparation process and miscalls during base calling confounds the identification of true low frequency viral variants. Consequently, true low-frequency variants will be practically indistinguishable from process error, whilst high frequency viral variants will be easily identified, as they are observable at levels much higher than can be attributed to error. However, there have been few experimental analyses to determine thresholds above which a variant is highly likely to be real. Furthermore, there are very few computational methods available to distinguish true viral variants from erroneous mutations, with the majority of methods tailored to the 454 platform such as AmpliconNoise  and ReadClean454 . Other tools aimed at detecting low frequency variants include Segminator II  which is specialised towards temporally sampled data, V-Phaser which utilises information on the co-location of variants on reads [32,33], Lo-Freq which utilises quality scores to model base miscalls , and an approach which incorporates re-sequencing with reads distribution and strand bias . However, these tools do not consider the error effects of the RT or PCR processes; although Flux Simulator  does consider these processes it is specialised towards RNAseq datasets.
Here we focus on FMDV which has a genome of ~8.3 kb and is a non-enveloped, positive-sense, single-stranded RNA virus in the Aphthovirus genus of the family Picornaviridae. FMDV is the causative agent of the highly contagious and economically serious foot-and-mouth disease (FMD), a vesicular disease of cloven-hoofed animals, which can spread extremely rapidly and has the potential to cause enormous economic losses. In this study, we have created a set of novel control samples in the laboratory that are all derived from a DNA plasmid containing a full-length FMDV cDNA with extremely limited diversity in the starting population. One sample was ultra-deep sequenced on the Illumina platform without any RT-PCR sample processing, whilst the other samples were either PCR or RT-PCR amplified prior to sequencing. This enabled the level of error introduced by the RT and PCR processes to be individually assessed and also enabled minimum frequency thresholds to be set for true viral variant identification. We combined these experimental data sets with a genome wide computational model of the sample processing and NGS calling process to gain a detailed understanding of the error rates and thresholds at each step. Furthermore, we demonstrate how the model can also used to investigate a specific site of interest in any NGS data set, by utilising the site’s coverage, quality scores, and the fitted RT-PCR error distributions, one can investigate if the number of mutations observed are more than would be expected from processing errors alone. Combined, these data sets and the computational model provide an effective means of separating true viral mutations from those erroneously introduced during sample processing and sequencing.
Control sample and data preparation
Sequence: A 7,856 bp DNA fragment containing 80% of the FMDV genome was cut from the undiluted linearised pT7S3-O1K B64 plasmid using restriction enzyme digestion with PstI and then gel extracted. No RT or PCR was required to amplify this clonal sample.
PCR-low: Undiluted and linearised pT7S3-O1K B64 plasmid was subjected to two independent PCR amplifications, 19 cycles each, to generate 2 independent but overlapping PCR products - PCR1 and PCR2 - which are 4065 bp and 4033 bp in length, respectively. The DNA polymerase (DNApol) used in this PCR process was Platinum Taq High Fidelity (Invitrogen).
PCR-high: Linearised pT7S3-O1K B64 plasmid was diluted to 106 plasmid DNA copies per μl (so as to result in the same amount of final DNA product as PCR-low), and then subjected to two independent PCR amplifications of 39 cycles each, to generate PCR products PCR1 and PCR2. PCR efficiency was measured with efficiency measured at 80% during the initial stages of PCR (see Additional file 1). The 106 starting copy number and 39 cycles of PCR are the optimised values used to process FMDV samples for NGS sequencing determined from our previous studies , where further details of the laboratory protocol can be found.
RT-PCR: Linearised pT7S3-O1K B64 plasmid was diluted 1000 fold to 106 plasmid DNA copies per μl, was first transcribed from DNA to RNA using the MEGAscript T7 kit (Ambion), and then reverse transcribed using Superscript III (Invitrogen) to generate cDNA followed by amplification via 39 cycles of PCR to generate PCR products PCR1 and PCR2, as above.
The Sequence control sample covers a slightly different portion of the plasmid genome than the remaining three PCR amplified samples due to the different location of the PstI restriction site to the previously optimised PCR primer sites (Figure 2); restriction site selection was influenced by biosecurity measures whereby no more than 80% of the FMDV genome could be transported out of the high security area for sequencing. To enable direct comparisons between all the samples, we focus our analyses on the portions of the FMDV genome that are shared between all of the samples (Figure 2). The ends of the DNA fragments are ignored due to large spikes in coverage at these positions, due to sonication bias.
Control samples were sequenced at the Polyomics Facility, University of Glasgow, on an Illumina Genome Analyzer IIx. Briefly, DNA was fragmented using sonication and the resultant fragment distribution assessed on an Agilent BioAnalyzer 2100 platform. After size selection of fragments of between 300 and 400 bp, a library of purified genomic DNAs was prepared by ligating adapters to the fragment ends to generate flow-cell-suitable templates. A unique 6-nt sequence tag for multiplexing was added to each control sample followed by PCR of 10 cycles using Phusion Oy (Finnzymes) DNApol. The amplified and tagged DNA samples were then pooled and attached to the flow cell by complementary surface-bound primers, isothermal bridging amplification formed DNA clusters for reversible terminator sequencing, yielding reads of 73 nucleotides.
Reads were processed using a previously developed pipeline [17,18]; however, other tools such as bwa  and bowtie  could be used as alternatives. Reads were first trimmed from 73 nt to 70 nt due to the very poor quality of the last 3 bases and then filtered, with reads discarded if they had an average probability of error per nt greater than 0.1%. The trimmed and filtered reads were then aligned to the plasmid reference genome with a simple, custom-made scoring algorithm, previously described in [17,18]. We further trimmed the first and last 6 nts of each aligned read, as they showed a higher number of mismatches to the reference sequence due to insertions or deletions close to the edges of the reads [17,18,20,41]. Information on the aligned reads is then piled-up to give the reference base, coverage and the number of As, Cs, Gs and Ts at each genome position along with an average probability of a sequencing error derived from the quality scores. The fastq files from all four samples have been uploaded to the EBI SRA repository (http://www.ebi.ac.uk/ena/) with the accession numbers ERR776658, ERR776657, ERR77656, and ERR776655 for the RT-PCR, PCR-High, PCR-Low and Sequence controls respectively.
The computational model operates at the population level. Instead of modelling 109 or 106 individual genomes (viral or plasmid) through the exponential PCR process, which would be computationally intensive, we view the population as an array or alignment of 106 initial genomes and operate on the numbers of As, Cs, Gs, and Ts present at each genome position. This population view is based on the assumption that the copying of each individual base in each genome during PCR (or RT) is independent and is therefore a single binomial trial with a probability of success (a mutation) equal to the DNA polymerase error rate. Therefore, with 106 genomes being duplicated in the first PCR cycle, each individual genome position is independently copied 106 times with a mutation potentially introduced each time. An example of the alignment view and error accumulation through PCR cycles is presented in Additional file 1.
Where b is the number of polymorphic bins, and E i and M i are the number of nucleotide sites (represented as a proportion of all sites) in the ith polymorphic bin, for the experimental (E) and model (M) mutation spectrums, respectively. For model fitting and parameter estimation, we use a Sequential Monte Carlo (SMC) algorithm [42-45], a form of Approximate Bayesian Computation (ABC). We use the SMC algorithm proposed by , further details of which are presented in the Additional file 1, but briefly a parameter (or particle) θ is sampled from a prior distribution π(θ) and propagated through a sequence of intermediate distributions π(θ|d(x0, x*) ≤ εi) until it represents a sample from the target distribution π(θ|d(x0, x*) ≤ εT). d is the distance between the model simulated dataset (x*) and the laboratory data set (x0) calculated by the sum of squares score above. The tolerances εi are minimum distance scores and are chosen such that εi > … > εT ≥ 0, thus the distributions gradually evolve towards the target posterior. For the model fitting, we used uniform prior distributions and a dynamic tolerance schedule that progressively decreases the tolerance at each step. The SMC process essentially fits the model parameters to the given data set, and thus provides estimates of the error rates of the enzymes used during RT and PCR, as well as the underlying distributions involved.
Set Initial Diversity: The first step is to set the initial diversity of the viral population. This can be set to Clonal (no diversity), Fixed (all genome positions have the same initial mismatch frequency, or Sampled where the initial mismatch frequency of every genome position is drawn from a Beta distribution (αS, βS) where αS, βS are the two shape parameters of a Beta distribution; αS, βS can either be defined or added to the parameters to be estimated. For this study, we use the Clonal (no diversity) setting.
Determine RT enzyme error profile: Assuming the RT enzyme is more likely to make errors (mutations) at some genome positions than others (in a sequence specific manner) the RT enzyme error rate for each genome position is randomly selected from a Beta distribution (αR, βR). Alternatively, one can simply use a single RT enzyme error rate for all genome positions.
Transcription and Reverse Transcription: As our starting clonal sample is DNA, we first transcribe our DNA to RNA so that it can be reverse transcribed back to cDNA, enabling us to evaluate errors that impact upon the RNA template. We therefore represent this process as two distinct steps but where transcription uses the same parameters as reverse transcription, as we are unable to disentangle these two processes from each other using the data sets available. Transcription and reverse transcription are linear copying processes, with the original template removed after completion. Each genome position is considered in turn, where the population at each of the 4 bases (A, C, G, and T) at that position is also considered in turn. Error introduction is modelled by a binomial distribution (n ij , pRi ), where pRi is the RT enzyme error rate at genome position i and n ij is the number of the jth base currently at genome position i. A random draw from this distribution gives the number of errors made, each error is randomly assigned to one of the other 3 bases at that genome position whose population increases by one, whilst that of the transcribed base decreases by one.
Determine DNApol error profile: Assuming the DNApol enzyme is more likely to make mutations at some positions than others, the DNApol error rate for each genome position is also randomly selected from a Beta distribution (αD, βD). Alternatively, one can simply set the DNApol error rate to be the same for all genome positions.
Sample preparation PCR: Each genome position is considered in turn, where the population at each of the 4 bases (A, C, G, and T) at that position is also considered in turn. First, if PCR efficiency is not 100% the number of bases to be copied is determined from a binomial distribution (n ij , pC), where pC is the PCR efficiency for PCR cycle c and n ij is the number of base j at genome position i; the population of base j is then increased by the number of bases to be copied. Second, the number of mutations that were made during copying is determined from another binomial distribution (n ij , pMi ) where pMi is the DNApol error rate at genome position i and n ij is the number of the jth base copied at genome position i (determined from first step); Poisson and then normal approximations to the binomial are used when the population of n ij is high (>100 and >109, respectively). Third, the population of the copied base is decreased by the number of mutations to be made, and for each mutation the resulting mutated base is randomly selected to be one of the 3 other bases whose population increases by 1.
Sonication: The 5% sampling bottleneck caused by sonication is represented by a hyper-geometric distribution (which describes the probability of k successes in n draws). Here we use the Poisson approximation of the hyper-geometric for speed, and the corresponding normal approximation when the base population (n) is large. A Poisson distribution (λ ij ) is used if the base population is <109 where λ ij is the mean and equal to [5% × population at genome position i base j]; for base populations >109 we use the normal approximation of the hyper-geometric.
Illumina Library PCR: The 10 cycles of PCR performed during the Illumina library preparation proceeds in the same manner to the sample preparation PCR above, except that the DNApol error rate is fixed for each genome position to the published error rate of Phusion Oy (Finnzymes; 1 in 2,272,727).
Illumina Sampling: The reads obtained from an Illumina sequencing run represent a small sample of the amplified Illumina library. To represent this bottleneck, a sample of bases is randomly selected stochastically from the population present at each genome position, with the number of bases to be sampled given by the read coverage at that genome position from the corresponding experimental data set.
Illumina Error: Every base in every read is assigned a quality score (Q) by the Illumina machine, which can be converted into a probability of error (p) with the formula p = 10- Q/10. For each genome position, a mean probability of error can then be calculated from all the quality score p values aligned at that position. The number of Illumina errors at each position is then modelled by randomly drawing from a binomial distribution (n i , pEi ) where pEi and n i are the average probability of a sequencing error and the coverage at genome position i, respectively.
Variant frequency thresholds
Table 1 shows that the maximum observed mismatch frequency in the RT-PCR sample was 0.66%; in the mutation spectrum this genome position is placed in the bin that has a midpoint of 0.5% (Figure 4). We can be highly confident that any mutation observed above this frequency is true, for samples processed with this protocol. However, there are likely to be a number of true viral variants present below this threshold. Figure 4 shows that the real FMDV sample begins to deviate from the RT-PCR sample at the 0.157% threshold, with more variants being observed in the real sample from this point on. A total of 226 variants are observed above this threshold in the RT-PCR control sample, whilst 516 variants are observed in the real FMDV sample, 290 of which are therefore likely to be true. A threshold of 0.66% would correctly identify all the errors (100% specificity), but would only identify 32 of the 290 likely variants in the real FMDV sample as true. Lowering the threshold by half to 0.3% results in a doubling of the number of likely true variants identified to 75 at a relatively small cost with specificity only dropping to 99.86%. Whilst at a threshold of 0.15%, specificity drops to 96% with 226 errors falsely identified as true, but all 290 are the likely true variants also identified. It is important to note that a threshold will be dictated by the actual sample processing used and must also be taken in context. The fidelity of the enzymes used and the number of the PCR cycles will highly influence threshold setting, whilst the coverage of a genome position and the quality of it’s aligned reads will influence a threshold in a site specific manner.
Next we characterised all the base substitutions that occurred in each of the clonal samples. In the Seq control sample, by far the most frequent base miscall was G to T (a G miscalled as a T) which represented approximately 23% of all Illumina base miscalls, followed by T to C, C to T, G to A and A to G which varied between approximately 12% and 15% (Additional file 1). Over 40% of all Illumina errors occurred at G positions, even though there were less G positions in the genome sequence considered than A or C positions; the remaining 60% of errors were distributed equally between the A, C and T positions. This suggests that the Illumina machine is more prone to error at G positions and could reflect the known issues with GGC and GGX motifs discussed previously. For the PCR samples, the most common base substitutions are those representing transitions between the purines (A to G and G to A) and pyrimidines (T to C and C to T), all of which are dominant within the dataset; although the A to G and T to C transitions are more abundant. Although substantially lower than these transitions, the G to T is by far the highest transversion, highlighting the contribution of the above Illumina error into the overall mutation spectrum. The RT sample shows the same characteristics as the PCR sample.
Overall, the mutation spectra suggest that Illumina error is clearly observable and should therefore be considered when examining viral populations at ultra-deep coverage. The main bulk of the error appears to come from the PCR process rather than the RT step, but RT does substantially affect the high frequency end of the spectrum. Although RT enzymes have higher error rates than their PCR counterparts, PCR utilises numerous amplification cycles and as errors are cumulative the amount of PCR error in the sample increases with each cycle.
Similarly, for the RT-PCR model, there is again good agreement between the simulated and experimental data sets when one assumes the RT enzyme is more likely to make mutations at some positions than others (Figure 5C). The model predicts a highly skewed beta distribution (0.19, 3074 to represent the error rate of the RT enzyme used (Superscript III – Invitrogen), with a low median error rate of 5.31 × 10−6 but with a mean an order of magnitude higher at 6.13 × 10−5 substitutions per nt copied (Figure 6), which is within the range of published error rates for Superscript III (6.66 × 10−5 to 3.12 × 10−5; Invitrogen and ) which again varies depending on the assay used. The 95th percentile of this distribution has an error rate of 3.2 × 10−4 suggesting some genome positions are highly prone to error during the reverse transcriptase process. This could be due to sequence specific effects or perhaps the 2D/3D structure of the RNA influencing where the reverse transcriptase makes an error.
Figure 8 is an example of setting a frequency threshold in the context of a genome position of interest. In such a case, one already knows the coverage and quality scores, both of which can greatly affect a frequency threshold. Coverage directly affects frequency (at a coverage of 100 a single erroneous mutation would result in a frequency of 1% but is unlikely to be trusted), whilst a site with poor quality could similarly lead to a high error frequency from machine miscalls alone. Therefore, the model can be used to investigate the possible affect the RT and PCR processes are having at a specific site given what is already know from the sequence data. We envisage that the model would be used in this case for a select number of genome positions, such as though at the higher scale of the mutation spectrum, or specific sites of interest such as those that convey drug resistance, pathogenicity or become fixed later on in a transmission chain, due to the computational cost of running the model across the whole genome.
NGS technology provides the means to go beyond consensus sequences and investigate the population structure of viral samples. The introduction of erroneous mutations during the sample preparation process and miscalls during sequencing confounds the identification of true low frequency viral variants. Identifying such variants is important when tracing the source and development of specific mutations of interest, or when comparing the total diversity within viral populations, such as those before and after a bottleneck. In this study, we generated and analysed a collection of novel NGS data sets that reveal the amount of error introduced during each viral sample processing step and used them to set a minimum frequency threshold of ~0.5% to separate real mutations from process error. However, this threshold could be lowered at the cost of specificity, and will naturally vary depending on the specifics of the sample processing steps used such as enzyme fidelity and number of PCR cycles. We complemented this with a genome-scale model to gain a deeper understanding of each processing step.
The data and model show that despite high quality read filtering, Illumina error is still observable, although typically at relatively low frequencies. The introduction of PCR resulted in a substantial increase in the observable error to higher mismatch frequencies. However, using the highest fidelity DNApol enzymes and minimising the number of PCR cycles will logically reduce the amount of PCR error introduced. The model predicted the mean error rate of the DNApol used (Platinum Taq High Fidelity; Invitrogen) to be 2.64 × 10−5 substitutions per nt copied, which is only slightly higher than the manufacturers published error rate of 1.8 × 10−5 (6 times the error rate of standard taq; Invitrogen); a further study using novel BEAMing technology estimated a lower bound error rate of 1.4 × 10−5 . However, the measured error rate is extremely dependent on the assay used, with estimates for Taq itself range from 1.1 × 10−4  to between 2.3 × 10−5  and 2.7 × 10−5 . In this report, we have used a novel approach using NGS reads to estimate the error rate, and propose that the error rate is in fact best represented with a distribution rather than a single value. The model predicted a mean error rate of the RT enzyme used (Superscript III; Invitrogen) to be 6.13 × 10−5 substitutions per nt copied, which is within the range of published error rates for Superscript III (6.66 × 10−5 to 3.12 × 10−5, Invitrogen and ) which again varies depending on the assay used. Our RT-PCR sample was first transcribed into RNA using T7 polymerase (Ambion) before reverse transcription back into DNA. Therefore, the RT error rate estimated by the model is applied to both the transcription and reverse transcription steps as these two steps cannot be fitted separately with the available data. If the transcription step is less error prone, then the RT step will in fact have a higher error rate than we have reported here to compensate, and vice versa.
The model predicts that not all genome positions are equal and that some sites are much more prone to PCR error than others. This novel model prediction could be due to the sequence composition of surrounding bases influencing the likelihood of DNApol making an error, which has previously been reported for human polymerase v . The model also makes a similar prediction for the RT process, which resulted in a highly skewed Beta distribution with some genome positions highly susceptible to error. Interestingly, this does not appear to be specific to the RT-PCR enzymes (Platinum Taq High Fidelity and Superscript III) or viral sequence (FMDV) that we used in this study. Similar findings have been reported with Hepatitis C virus , using different RT-PCR enzymes combined with a consensus sequencing strategy of numerous molecular clones. They reported that RT-PCR errors were not evenly distributed, but were concentrated in specific hotspot regions, one of which coincided with a known region of hyper-variability in the viral genome. This suggests that any specific sequence context or secondary structure that negatively impacts on the RT-PCR enzymes may also affect the viral RNA polymerase itself. Therefore, hyper-variable regions in viral genomes may be, in part, the result of natural polymerase mutational hotspots. Overall, identification of the sequence and structural signatures of mutational hotspot regions, and a comparison of these between different viruses would make an interesting study and could lead to novel insights into what drives the mutational dynamics of viruses.
From the data itself we identified a minimum frequency of ~0.5% above which one can be reasonably confident that an observed mutation is real. However, a minimum mismatch frequency threshold should be interpreted with some caution and must always be considered in context, as any mismatch frequency is highly dependent on the coverage at a genome position. For example, 1 observed mismatch at a genome position with a coverage of 100 leads to a mismatch frequency of 1% which would appear relatively high. However, a single mutation is not reliable given the error prone nature of the processes, and a coverage of 100 would be considered low compared to the average of 34,000 observed in our Seq control data set, suggesting an underlying problem with that genome position that has resulted in limited coverage. Additional filters should therefore be applied when validating the mismatches at a particular genome position, such as minimum coverage cut-offs, and validation against the number of mismatches expected given the sample processing procedure.
To address this, our model can also be used to investigate if the mutations observed at a given genome position are real, by simulating the number of mutations that would be expected from sequencing and sample processing error alone, and comparing this distribution to the observed number. This could be applied to any viral Illumnia data set, and adapted to represent alternative RT-PCR enzymes if estimates of their error rates are known; alternatively, one could simply run the model multiple times scanning through a range of enzyme error rates. Although we used 39 cycles of PCR in our optimised protocol, alternate virus/system protocols may well utilise fewer cycles. As shown in Figure 4, when the number of PCR cycles is halved, the mutation spectrum shifts substantially to lower error frequencies. Therefore, one should logically limit the number of PCR cycles used if possible as PCR errors accumulate with each cycle. Alternatively, there are high-fidelity polymerases available that have published error rates orders of magnitude above the Platinum Taq polymerase used in our protocol, which will also greatly reduce the observed errors. However, in both cases the RT step remains which alone can still result in erroneous high frequency variants, although only at relatively few genome positions. In addition, as Illumina machines share the same sequence by synthesis chemistry, and as we utilise the quality scores outputted by the sequencer itself, our results are applicable to more recent machines.
Recent techniques have been developed [54,55] that utilise the fragmentation followed by circularisation of viral RNA. Each small RNA circle is then reverse transcribed multiple times to create a single cDNA strand with multiple copies of the original RNA sequence. The cDNA can then be directly sequenced (after library adapter ligation) and as each read will contain multiple copies (typically three) of the same viral RNA sequence, reverse transcriptase errors and sequencer base miscalls can be readily identified; a similar strategy to that of utilising overlapping read pairs . These approaches are very promising, however, they are currently limited to in-vitro samples due to the substantially large amount of initial viral RNA required, as they do not utilise PCR at any step. Therefore, our strategy still provides a valuable means for identifying and quantifying errors introduced during the processing of biological samples, such as field isolated during an epidemic, which typically have to amplified to achieve the required amount of RNA as standard. Furthermore, our results that reverse transcriptase is highly prone to error at some sites, may have implications for such circular re-sequencing techniques. An alternative approach to error handling is the incorporation of unique barcodes into sample DNA which have been successfully applied to viral population and fitness analyses [56,57].
Our work here has been focussed on the Illumina platform, but application of the same experimental techniques would be useful to assess and compare the error profiles, in terms of viral populations, of different platforms. It would also be possible to apply the models to other platforms, although this would likely involve substantial more work. Although a large proportion of the models, such as the RT-PCR component, could be directly applied to other sequencing platforms, the models would then need to be extended to represent each platform’s specific sample preparation protocol and sequencing error profile.
We have created a novel set of laboratory control samples that enabled the level of error introduced by the RT and PCR processes to be assessed and minimum frequency thresholds to be set for true viral variant identification. We combined this with a genome-scale computational model of the sample processing and NGS calling process to gain a detailed understanding of the errors at each step, which predicted that RT and PCR errors are more likely to occur at some genomic sites than others. The model can also be used to investigate whether the number of observed mutations at a given site of interest is greater than would be expected from processing errors alone in any NGS data set. These data sets and models provide an effective means of separating true viral mutations from those erroneously introduced during sample processing and sequencing. Furthermore, the data sets themselves provide an ideal test set for the evaluation of viral variant calling tools to assess their ability to distinguish real viral variants from RT-PCR and sequencer errors.
Availability of supporting data
The raw FASTQ files from all four control samples will be deposited and made publicly available from the Sequence Read Archive (SRA) at the European Nucleotide Archive (ENA).
This work was supported by the Biotechnology and Biological Sciences Research Council, UK via a DTA PhD studentship (project BB/E018505/1), BBSRC standard grant (projects BB/I013784/1 and BB/I014314/1), and Epi-SEQ, a research project supported under the 2nd joint call for transnational research projects by EMIDA ERA-NET.
- Holland J, Spindler K, Horodyski F, Grabau E, Nichol S, VandePol S. Rapid evolution of RNA genomes. Science. 1982;215(4540):1577–85.View ArticlePubMedGoogle Scholar
- Drake JW. Rates of spontaneous mutation among RNA viruses. Proc Natl Acad Sci U S A. 1993;90(9):4171–5.View ArticlePubMed CentralPubMedGoogle Scholar
- Duffy S, Shackelton LA, Holmes EC. Rates of evolutionary change in viruses: patterns and determinants. Nat Rev Genet. 2008;9(4):267–76.View ArticlePubMedGoogle Scholar
- Sanjuan R. Mutational fitness effects in RNA and single-stranded DNA viruses: common patterns revealed by site-directed mutagenesis studies. Philos Trans R Soc Lond B Biol Sci. 2010;365(1548):1975–82.View ArticlePubMed CentralPubMedGoogle Scholar
- Eigen M. Selforganization of matter and the evolution of biological macromolecules. Naturwissenschaften. 1971;58(10):465–523.View ArticlePubMedGoogle Scholar
- Eigen MaS P. The hypercycle—a principle of natural self-organization. Naturwissenschaften. 1978;65(1):7–41.View ArticleGoogle Scholar
- Holmes EC, Moya A. Is the quasispecies concept relevant to RNA viruses? J Virol. 2002;76(1):460–5.View ArticlePubMed CentralPubMedGoogle Scholar
- Orton RJ, Wright CF, Morelli MJ, Juleff N, Thebaud G, Knowles NJ, et al. Observing micro-evolutionary processes of viral populations at multiple scales. Philos Trans R Soc Lond B Biol Sci. 2013;368(1614):20120203.View ArticlePubMed CentralPubMedGoogle Scholar
- Chen-Harris H, Borucki MK, Torres C, Slezak TR, Allen JE. Ultra-deep mutant spectrum profiling: improving sequencing accuracy using overlapping read pairs. BMC Genomics. 2013;14:96.View ArticlePubMed CentralPubMedGoogle Scholar
- Eriksson N, Pachter L, Mitsuya Y, Rhee SY, Wang C, Gharizadeh B, et al. Viral population estimation using pyrosequencing. PLoS Comput Biol. 2008;4(4):e1000074.View ArticlePubMed CentralPubMedGoogle Scholar
- Hoffmann C, Minkah N, Leipzig J, Wang G, Arens MQ, Tebas P, et al. DNA bar coding and pyrosequencing to identify rare HIV drug resistance mutations. Nucleic Acids Res. 2007;35(13):e91.View ArticlePubMed CentralPubMedGoogle Scholar
- Kampmann ML, Fordyce SL, Avila-Arcos MC, Rasmussen M, Willerslev E, Nielsen LP, et al. A simple method for the parallel deep sequencing of full influenza A genomes. J Virol Methods. 2011;178(1–2):243–8.View ArticlePubMedGoogle Scholar
- Margeridon-Thermet S, Shulman NS, Ahmed A, Shahriar R, Liu T, Wang C, et al. Ultra-deep pyrosequencing of hepatitis B virus quasispecies from nucleoside and nucleotide reverse-transcriptase inhibitor (NRTI)-treated patients and NRTI-naive patients. J Infect Dis. 2009;199(9):1275–85.View ArticlePubMed CentralPubMedGoogle Scholar
- Rozera G, Abbate I, Bruselles A, Vlassi C, D’Offizi G, Narciso P, et al. Massively parallel pyrosequencing highlights minority variants in the HIV-1 env quasispecies deriving from lymphomonocyte sub-populations. Retrovirology. 2009;6:15.View ArticlePubMed CentralPubMedGoogle Scholar
- Simen BB, Simons JF, Hullsiek KH, Novak RM, Macarthur RD, Baxter JD, et al. Low-abundance drug-resistant viral variants in chronically HIV-infected, antiretroviral treatment-naive patients significantly impact treatment outcomes. J Infect Dis. 2009;199(5):693–701.View ArticlePubMedGoogle Scholar
- Wang C, Mitsuya Y, Gharizadeh B, Ronaghi M, Shafer RW. Characterization of mutation spectra with ultra-deep pyrosequencing: application to HIV-1 drug resistance. Genome Res. 2007;17(8):1195–201.View ArticlePubMed CentralPubMedGoogle Scholar
- Wright CF, Morelli MJ, Thebaud G, Knowles NJ, Herzyk P, Paton DJ, et al. Beyond the consensus: dissecting within-host viral population diversity of foot-and-mouth disease virus by using next-generation genome sequencing. J Virol. 2011;85(5):2266–75.View ArticlePubMed CentralPubMedGoogle Scholar
- Morelli MJ, Wright CF, Knowles NJ, Juleff N, Paton DJ, King DP, et al. Evolution of foot-and-mouth disease virus intra-sample sequence diversity during serial transmission in bovine hosts. Vet Res. 2013;44(1):12.View ArticlePubMed CentralPubMedGoogle Scholar
- Flaherty P, Natsoulis G, Muralidharan O, Winters M, Buenrostro J, Bell J, et al. Ultrasensitive detection of rare mutations using next-generation targeted resequencing. Nucleic Acids Res. 2012;40(1):e2.View ArticlePubMed CentralPubMedGoogle Scholar
- Quail MA, Kozarewa I, Smith F, Scally A, Stephens PJ, Durbin R, et al. A large genome center’s improvements to the Illumina sequencing system. Nat Methods. 2008;5(12):1005–10.View ArticlePubMed CentralPubMedGoogle Scholar
- Caruccio N. Preparation of next-generation sequencing libraries using Nextera technology: simultaneous DNA fragmentation and adaptor tagging by in vitro transposition. Methods Mol Biol. 2011;733:241–55.View ArticlePubMedGoogle Scholar
- Kircher M, Heyn P, Kelso J. Addressing challenges in the production and analysis of illumina sequencing data. BMC Genomics. 2011;12:382.View ArticlePubMed CentralPubMedGoogle Scholar
- Loman NJ, Misra RV, Dallman TJ, Constantinidou C, Gharbia SE, Wain J, et al. Performance comparison of benchtop high-throughput sequencing platforms. Nat Biotechnol. 2012;30(5):434–9.View ArticlePubMedGoogle Scholar
- Quail MA, Smith M, Coupland P, Otto TD, Harris SR, Connor TR, et al. A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC Genomics. 2012;13:341.View ArticlePubMed CentralPubMedGoogle Scholar
- Zagordi O, Daumer M, Beisel C, Beerenwinkel N. Read length versus depth of coverage for viral quasispecies reconstruction. PLoS One. 2012;7(10):e47046.View ArticlePubMed CentralPubMedGoogle Scholar
- Dohm JC, Lottaz C, Borodina T, Himmelbauer H. Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res. 2008;36(16):e105.View ArticlePubMed CentralPubMedGoogle Scholar
- Nakamura K, Oshima T, Morimoto T, Ikeda S, Yoshikawa H, Shiwa Y, et al. Sequence-specific error profile of Illumina sequencers. Nucleic Acids Res. 2011;39(13):e90.View ArticlePubMed CentralPubMedGoogle Scholar
- Meacham F, Boffelli D, Dhahbi J, Martin DI, Singer M, Pachter L. Identification and correction of systematic error in high-throughput sequence data. BMC Bioinformatics. 2011;12:451.View ArticlePubMed CentralPubMedGoogle Scholar
- Quince C, Lanzen A, Davenport RJ, Turnbaugh PJ. Removing noise from pyrosequenced amplicons. BMC Bioinformatics. 2011;12:38.View ArticlePubMed CentralPubMedGoogle Scholar
- Henn MR, Boutwell CL, Charlebois P, Lennon NJ, Power KA, Macalalad AR, et al. Whole genome deep sequencing of HIV-1 reveals the impact of early minor variants upon immune recognition during acute infection. PLoS Pathog. 2012;8(3):e1002529.View ArticlePubMed CentralPubMedGoogle Scholar
- Archer J, Baillie G, Watson SJ, Kellam P, Rambaut A, Robertson DL. Analysis of high-depth sequence data for studying viral diversity: a comparison of next generation sequencing platforms using Segminator II. BMC Bioinformatics. 2012;13:47.View ArticlePubMed CentralPubMedGoogle Scholar
- Macalalad AR, Zody MC, Charlebois P, Lennon NJ, Newman RM, Malboeuf CM, et al. Highly sensitive and specific detection of rare variants in mixed viral populations from massively parallel sequence data. PLoS Comput Biol. 2012;8(3):e1002417.View ArticlePubMed CentralPubMedGoogle Scholar
- Yang X, Charlebois P, Macalalad A, Henn MR, Zody MC. V-Phaser 2: variant inference for viral populations. BMC Genomics. 2013;14:674.View ArticlePubMed CentralPubMedGoogle Scholar
- Wilm A, Aw PP, Bertrand D, Yeo GH, Ong SH, Wong CH, et al. LoFreq: a sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets. Nucleic Acids Res. 2012;40(22):11189–201.View ArticlePubMed CentralPubMedGoogle Scholar
- Li M, Stoneking M. A new approach for detecting low-level mutations in next-generation sequence data. Genome Biol. 2012;13(5):R34.View ArticlePubMed CentralPubMedGoogle Scholar
- Griebel T, Zacher B, Ribeca P, Raineri E, Lacroix V, Guigo R, et al. Modelling and simulating generic RNA-Seq experiments with the flux simulator. Nucleic Acids Res. 2012;40(20):10073–83.View ArticlePubMed CentralPubMedGoogle Scholar
- Ellard FM, Drew J, Blakemore WE, Stuart DI, King AM. Evidence for the role of His-142 of protein 1C in the acid-induced disassembly of foot-and-mouth disease virus capsids. J Gen Virol. 1999;80(Pt 8):1911–8.PubMedGoogle Scholar
- Fijalkowska IJ, Schaaper RM, Jonczyk P. DNA replication fidelity in Escherichia coli: a multi-DNA polymerase affair. FEMS Microbiol Rev. 2012;36(6):1105–21.View ArticlePubMed CentralPubMedGoogle Scholar
- Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25(14):1754–60.View ArticlePubMed CentralPubMedGoogle Scholar
- Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9(4):357–9.View ArticlePubMed CentralPubMedGoogle Scholar
- Lin W, Piskol R, Tan MH, Li JB. Comment on “Widespread RNA and DNA sequence differences in the human transcriptome”. Science. 2012;335(6074):1302. author reply 1302.View ArticlePubMedGoogle Scholar
- Sisson SA, Fan Y, Tanaka MM. Sequential Monte Carlo without likelihoods. Proc Natl Acad Sci U S A. 2007;104(6):1760–5.View ArticlePubMed CentralPubMedGoogle Scholar
- Toni T, Welch D, Strelkowa N, Ipsen A, Stumpf MP. Approximate Bayesian computation scheme for parameter inference and model selection in dynamical systems. J R Soc Interface. 2009;6(31):187–202.View ArticlePubMed CentralPubMedGoogle Scholar
- Beaumont M, Cornuet JM, Marin JM, Robert CP. Adaptive approximate Bayesian computation. Biometrika. 2009;96(4):983–90.View ArticleGoogle Scholar
- Csillery K, Blum MG, Gaggiotti OE, Francois O. Approximate Bayesian Computation (ABC) in practice. Trends Ecol Evol. 2010;25(7):410–8.View ArticlePubMedGoogle Scholar
- Aird D, Ross MG, Chen WS, Danielsson M, Fennell T, Russ C, et al. Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol. 2011;12(2):R18.View ArticlePubMed CentralPubMedGoogle Scholar
- Martinez F, Lafforgue G, Morelli MJ, Gonzalez-Candelas F, Chua NH, Daros JA, et al. Ultradeep sequencing analysis of population dynamics of virus escape mutants in RNAi-mediated resistant plants. Mol Biol Evol. 2012;29(11):3297–307.View ArticlePubMedGoogle Scholar
- Potter J, Zheng W, Lee J. Thermal stability and cDNA sythesis capability of SuperScript III reverse transcriptase. Focus. 2003;25(1):19–24.Google Scholar
- Li M, Diehl F, Dressman D, Vogelstein B, Kinzler KW. BEAMing up for detection and quantification of rare sequence variants. Nat Methods. 2006;3(2):95–7.View ArticlePubMedGoogle Scholar
- Barnes WM. The fidelity of Taq polymerase catalyzing PCR is improved by an N-terminal deletion. Gene. 1992;112(1):29–35.View ArticlePubMedGoogle Scholar
- Boulanger J, Muresan L, Tiemann-Boege I. Massively parallel haplotyping on microscopic beads for the high-throughput phase analysis of single molecules. PLoS One. 2012;7(4):e36064.View ArticlePubMed CentralPubMedGoogle Scholar
- Arana ME, Takata K, Garcia-Diaz M, Wood RD, Kunkel TA. A unique error signature for human DNA polymerase nu. DNA Repair. 2007;6(2):213–23.View ArticlePubMed CentralPubMedGoogle Scholar
- Domingo-Calap P, Sentandreu V, Bracho MA, Gonzalez-Candelas F, Moya A, Sanjuan R. Unequal distribution of RT-PCR artifacts along the E1-E2 region of Hepatitis C virus. J Virol Methods. 2009;161(1):136–40.View ArticlePubMedGoogle Scholar
- Acevedo A, Brodsky L, Andino R. Mutational and fitness landscapes of an RNA virus revealed through population sequencing. Nature. 2014;505(7485):686–90.View ArticlePubMed CentralPubMedGoogle Scholar
- Lou DI, Hussmann JA, McBee RM, Acevedo A, Andino R, Press WH, et al. High-throughput DNA sequencing errors are reduced by orders of magnitude using circle sequencing. Proc Natl Acad Sci U S A. 2013;110(49):19872–7.View ArticlePubMed CentralPubMedGoogle Scholar
- Wu NC, Young AP, Al-Mawsawi LQ, Olson CA, Feng J, Qi H, et al. High-throughput profiling of influenza A virus hemagglutinin gene at single-nucleotide resolution. Sci Rep. 2014;4:4942.PubMed CentralPubMedGoogle Scholar
- Mangul S, Wu NC, Mancuso N, Zelikovsky A, Sun R, Eskin E. Accurate viral population assembly from ultra-deep sequencing data. Bioinformatics. 2014;30(12):i329–37.View ArticlePubMed CentralPubMedGoogle Scholar
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.