Accuracy and quality assessment of 454 GS-FLX Titanium pyrosequencing
© Gilles et al; licensee BioMed Central Ltd. 2011
Received: 20 July 2010
Accepted: 19 May 2011
Published: 19 May 2011
Skip to main content
© Gilles et al; licensee BioMed Central Ltd. 2011
Received: 20 July 2010
Accepted: 19 May 2011
Published: 19 May 2011
The rapid evolution of 454 GS-FLX sequencing technology has not been accompanied by a reassessment of the quality and accuracy of the sequences obtained. Current strategies for decision-making and error-correction are based on an initial analysis by Huse et al. in 2007, for the older GS20 system based on experimental sequences. We analyze here the quality of 454 sequencing data and identify factors playing a role in sequencing error, through the use of an extensive dataset for Roche control DNA fragments.
We obtained a mean error rate for 454 sequences of 1.07%. More importantly, the error rate is not randomly distributed; it occasionally rose to more than 50% in certain positions, and its distribution was linked to several experimental variables. The main factors related to error are the presence of homopolymers, position in the sequence, size of the sequence and spatial localization in PT plates for insertion and deletion errors. These factors can be described by considering seven variables. No single variable can account for the error rate distribution, but most of the variation is explained by the combination of all seven variables.
The pattern identified here calls for the use of internal controls and error-correcting base callers, to correct for errors, when available (e.g. when sequencing amplicons). For shotgun libraries, the use of both sequencing primers and deep coverage, combined with the use of random sequencing primer sites should partly compensate for even high error rates, although it may prove more difficult than previous thought to distinguish between low-frequency alleles and errors.
Scientific strategies and approaches based on next-generation sequencing (NGS) have been revolutionizing genetics over the last few years. Many aspects of basic, applied and clinical research now rely on the generation of enormous amounts of sequence data from various sample sources, to assess polymorphism (mostly SNPs), or expression data (RNA-Seq) at the genome level [1, 2]. This shift in the scale of sequence acquisition has been achieved by simultaneous progress in bioinformatics, the availability of genome assemblies and key technical findings in the domains of biochemistry and sequencing device physics . In this context, the 454 GS-FLX (Roche Diagnostics Corporation), Illumina© technology (Illumina, Inc.) and SOLiDTM systems (Applied BiosystemsTM) offer a number of complementary solutions for specific requirements (see Metzker  for a review). 454 GS-FLX Titanium technology provides around 1,000,000 sequences in a single 10-hour run. These sequences, with an average read length equal to 330 bp, may be up to 500 bp in shotgun libraries conditions, much longer than can be obtained with the other available approaches. This makes mapping easier, particularly for repetitive regions, and facilitates de novo genome sequencing, exome capture, metagenomics and amplicon sequencing .
One of the basic questions arising from this spectacular increase in sequence volume concerns the possible detrimental effects of this shift in quantity on the quality of the obtained data. In other words, is there a tradeoff between the quantity and quality of information? It is widely accepted that next-generation sequencing approaches generate such large amounts of sequence data that even if overall accuracy (derived from error rate) or quality (percentage of error-free sequences) is suboptimal it is still possible to reconstruct polymorphism rigorously by comparing redundant sequences that cover the same genomic region multiple times (i.e. depth of coverage provides accuracy, not the individual read) [5–8]. This is the typical "quick and dirty" view of NGS. This approach may sound reasonable, but it is based on assumptions such as low error rate and error randomness for the unambiguous detection of polymorphism. If this last assumption is challenged, even a low error rate has a huge impact on sequence analysis, as in cases of related allele detection, paralogous sequences or pseudogene identification. In these cases, the "quick and dirty" approach is inadequate, because consensus sequence calculation is accurate only if these three sources of sequence diversity are distinguishable from the error due to background noise .
In 2007, S. Huse and collaborators raised the question of the accuracy and quality of massively parallel pyrosequencing GS20 systems, performing an empirical analysis of the per-base error rate . This was needed as "the quality score of a position is not a measure of a confidence that the correct base a called at that position, as with a traditional PHRED score. Instead, the GS20 quality score is a measure of confidence that the homopolymer length at a position is correct" [10, 11]. They used V6 hypervariable region sequences from cloned microbial ribosomal DNA for this purpose. They concluded that the accuracy rate was 99.51%, on average, and that 82% of the sequences contained no error. They also demonstrated that 39% of the errors corresponded to homopolymer effects . Finally, they detected no significant correlation between error and distance from the 5' end of the sequences for 101 positions. Surprisingly, despite changes in this technology over the last four years, the accuracy and quality of 454-based sequences has not been reevaluated and this previous study remains the basic reference used by the scientific community to account for error rate in 454 GS-FLX systems (181 articles citing this study at the time of writing). Over the same period, chemistry, acquisition devices (CCD cameras in particular) and quality filtering algorithms have evolved. A new analysis is therefore required, and this was the main goal of this work.
Furthermore, in addition to estimating the per-base error rate, we aimed to identify the potential causes of sequencing errors and possible solutions for improving both the accuracy and quality of pyrosequences. We selected several variables likely to affect sequencing errors directly or indirectly: (i) the position of the nucleotide base within the sequence (the beginning of the sequence may be more accurate than the end), (ii) the primary structure of the sequence, including, in particular, the presence of homopolymers, (iii) the length of the sequence generated (a sequence may be short due to quality filtering, resulting from an accumulation of errors or the stochastic ending of polymerization), and (iv) the position of the bead carrying the sequence both within and between the regions on a PT plate (PicoTiterPlate) (edge effect), and between multiple PT plates. Our analyses are based on Roche test fragments. These are sequences used for GS-FLX Titanium diagnostics that are included in all runs, but not subjected to PCR amplification before sequencing. Thus with these fragments we estimate the sequencing error due to pyrosequencing. Huse et al.  found that the experimental sequences they used display error rate five times higher than the GS20 Roche test fragments (0.1% vs. 0.49%). Since almost all of their results are based on experimental sequences, we cannot directly compare our results to theirs. However, we do not intend to focus on a general error rate, but rather assess the effect of several variables on error generation.
Comparative analysis of the accuracy and quality of sequences
# of sequences
% of error-free sequences
# of positions
Total % of error
Ref 1 (101)
Ref 2 (101)
Ref 3 (101)
Ref 4 (101)
Ref 5 (101)
Ref 6 (101)
Ref 1 (572)
Ref 2 (552)
Ref 3 (500)
Ref 4 (532)
Ref 5 (592)
Ref 6 (516)
The error rate for the first 101 sequenced positions (corresponding to 8,596,016 examined bases) displayed a mean = 0.534% (95% CI: [0.529, 0.539]) (45,895 erroneous bases) for 454 GS-FLX Titanium data. This global error rate is five times higher than the error rate obtained by the analyses of GS20 test fragments and is similar to that obtained from for GS20 experimental sequences. Indeed, 0.49% of the positions were erroneous for a comparable dataset relating to 101 positions (Table 1). If we break down the global error rate for all reference sequences according to the type of error, insertions are found to be the most common errors (mean = 0.273% [0.269, 0.276]; mode q1/2 = 0.215), followed by deletions (0.232% [0.229, 0.235]; q1/2 = 0.170), mismatches (0.022% [0.021, 0.023]; q1/2 = 0.010), and ambiguous base calls (0.007% [0.006, 0.007]; q1/2 = 0.010). This pattern is entirely consistent with that described by Huse et al. . This pattern is in agreement with the study of 454 GS-FLX  but markedly different from IlluminaTM sequencing, in which insertions and deletions of single bases occur less frequently than mismatches [13, 14]. In total, 58,269 sequences (67.57% [67.26, 67.88]) of this length were found to be free from error. This trend is similar to that reported for GS20 experimental sequences, for which 82% of sequences matched the corresponding reference sequence perfectly. Unfortunately the data are not available for GS20 test fragments.
If we restricted the analysis to full-length sequences (500 to 592 positions), we found for the 86,237 sequences that passed the 454 quality filters (29,100,738 bases) that 312,351 bases were erroneous (1.073% [1.069, 1.077]). The pattern observed for the first 101 positions was confirmed for the full-length sequence data, with insertions (0.541% [0.538, 0.543]; q1/2 = 0.465) and deletions (0.359% [0.357, 0.362]; q1/2 = 0.350) being the most common types of error and mismatches (0.088% [0.087, 0.089]; q1/2 = 0.085) and ambiguous base calls (0.085% [0.084, 0.086]; q1/2 = 0.090) making a smaller contribution to global error rate. Only 8,702 of the 86,237 full-length sequences (10.09% [9.89, 10.29]) had no error with respect to the corresponding reference sequence. This result strongly contrasts with the higher proportion of error-free sequences for the first 101 bases.
The comparison of error rates between sequences of different lengths (first 101 positions vs full-length sequences) highlighted two key developments in addition to the doubling of the global error rate for full-length sequences as found elsewhere . This length-associated overall increase in error rates did not reflect a common mechanism for all types of error, as insertion and deletion rates increased only slightly (by factors of 2 and 1.5, respectively), whereas mismatch and ambiguous base call rates increased to a much greater extent (by factors of 5 and 9, respectively). This decoupling of the changes in rate for different types of errors modified the contribution to global error rate of the various types of errors. Thus, mismatch and ambiguous base call errors made a greater contribution to global error rate for longer sequences, although their effects remained moderate. Thus, overall error rates and the rates of different types of error are not uniform for the sequences obtained by 454 GS-FLX Titanium sequencing. Consequently, the conclusions drawn for short sequences should not be directly extrapolated to longer sequences, as sequence length affects error rates. Another key result in this in-depth analysis of error was the finding that error rate (1.07%) should be seen in the light of the large number of erroneous sequences (89.91%) in the dataset. This combination of a low error rate and a large number of erroneous sequences results from the occurrence of only very small numbers of errors in individual sequences, on average. These findings conflict with those reported for GS20 sequencing and suggest that the removal of erroneous sequences may not be useful, to increase the overall quality.
This pattern is particularly problematic for 454 data, as the number of sequences significantly decreases after 300 bases (see Figure 1 and additional file 2 for illustration) whatever the reference sequence considered (for a total length ranging from 500 to 592). In summary, for the longest sequences (>300 positions), the combination of higher error rates along the length of the sequence, combined with the decrease in the number of sequences available, may make it difficult to correct errors. This difficulty results from a deficit in the number of sequences required to decrease the probability of erroneous assignation for a given sequence position, under a reasonable coverage threshold (i.e. minimum number of reads per bp required, see additional file 1).
This issue is further complicated by the heterogeneous distribution of the error types among the six different control DNA reference sequences, within and between gasket regions for a PT GS-FLX Titanium plate and also between PT plates, as initially estimated from the large standard errors (derived from table 1) in the error estimate. This overall variability of error distribution makes it difficult to draw any clear conclusions ruling out particular parameters that might potentially influence error rates or to identify a single mechanism accounting for the observed errors in the dataset. This pattern requires an in-depth analysis of the interaction and explanatory power of various factors before we can assess the degree of sequencing error and identify solutions for preventing artifacts.
The evolution of 454 technology combines progress in chemistry, acquisition devices, such as CCD cameras and PT plates handling equipment, and improvements in quality filters and base-calling algorithms. All these modifications are potential sources of variation in the amount, length and quality of sequences. In this work, we analyzed the interaction of seven variables identified as potential sources of sequencing error. We characterized sequencing error as a function of information about position in the sequence (Position and Seq.length), the presence of homopolymers (Homopolymer) and reference sequence type (Seq.type), all considered being sequence-specific information. Location on the PT plate was also taken into account through the region of origin (Region), the distance of beads to the region center (Dist.region) or the plate center (Dist.plate, see Materials ad Methods for details) as both the flow of chemicals through the plate and the central position of the CCD camera may play a role in the error generation. Before this analysis, we tested the hypothesis of homogeneous error rates on the three PT plates. This hypothesis was rejected (χ2 = 2613.3, df = 2, P < 2 × 10-16). The significant result obtained in this test is mostly due to the high power of detection associated with the large number of samples available, but this heterogeneity requires the specification of individual parameter values for the logistic model describing each PT plate. The three runs were therefore analyzed separately. This approach did not prevent us from extracting the common trends influencing error rate and distribution. The models (for each plate and for each type of error) explained between 14.32% and 37.38% of the error distribution and were highly significant (P < 2 × 10-16).
The nullity of r (Bravais-Pearson correlation coefficient) between pairs of the seven variables was tested independently for each run. As the usual assumptions required to infer the distribution of the test statistics were not met, we used permutations to approximate the distribution of the test statistic under H0. We used a type I error rate of 0.05 and Benjamini-Hochberg correction  to take multiple testing into account. Most of the pairs of variables (74.29%) were significantly correlated, using a threshold of α = 0.05 in a permutation test for multiple testing. However 41.85% of the pairs of variables correlated with 0.005 < r < 0.05. The pair of variables displaying the strongest correlation was the position of the error in the reference sequence (position) and sequence length (Seq.length), with 0.40 < r < 0.50, depending on the PT plate considered. The second strongest correlation was that between distance to the region center (Dist.region) and distance to the PT plate center (Dist.plate), with 0.38 < r < 0.62.
At DNA sequence level, we detailed the variables individually accounting for the highest proportion of the error rate for each error type. It was essential to bear in mind, during this analysis, the fact that most of the explanatory power of these variables was obtained with combinations of variables. We analyzed each type of error independently.
For insertion errors (Figure 2), the variable Homopolymer accounted for 5.97% ± 1.33 of the variation in error on its own, and was concurrent to the error rate. This finding is consistent with available published empirical observations linking errors to homopolymers . The variable Position accounted for 11.94% ± 2.22 of the variation and was also concurrent to error. In other words, the error rate due to insertions increased along the sequence. Finally, the variable Seq.length accounted for 5.48% ± 3.13 of the variation. Insertion rates were lower for longer sequences and higher for shorter sequences. These last two results may appear paradoxical, but the combined information for these variables indicates that the distribution of insertion errors along sequences is not random, with more insertions in 3' end, whatever the length of the sequence considered. This is fully explained if we considered that i) the number of sequences decreases with length (Figure 1), hence changing the number of sequences for which error rates are computed with respect to the reference and ii) the quality filtering process (v2.3) implemented in the GS-FLX system involves the trimming of reads with many off-peak signal intensities by the software. In particular, for insertions, the TrimBack Valley Filter trims sequences from the 3' end until the number of valley flows (intermediate signal intensity, i.e., a signal intensity occurring in the valley between the peaks for 1-mer and 2-mer incorporations, or 2-mer and 3-mer, etc.) is < 1.25% . This implies that short sequences are not short because the strand synthesis stops prematurely, but due to a rapid decrease in the quality of the flowgram (raw sequence) resulting from early out-of-phase synthesis. Trimming eliminates the 3' end with above-threshold ambiguous base calls, but the remaining sequence still contains errors.
For deletion errors, Seq.type accounted for 2.36% ± 1.39 of the variation, reflecting substantial heterogeneity between the reference sequences. The variables Homopolymer (accounting for 6.89% ± 0.89 of the variation) and Position (accounting for 8.93% ± 5.21 of the variation) were both concurrent to the deletion rate. Deletion errors tend to occur more frequently in homopolymers and their rates are higher towards the 3' end of sequences.
Finally, mismatch and ambiguous base call error rates were both found to be linked to Position (45.24% ± 4.04 and 25.85% ± 1.71, respectively) and Seq.length (25.00% ± 9.61 and 7.66% ± 2.11, respectively), with higher error rates found in 3' positions within sequences and longer sequences tending to have lower error rates.
As detailed in the results and discussion section, error rate variability is mostly accounted for by the combination of the seven variables analyzed. However, the heterogeneous physical pattern may be partially driven by the combined influence of the central CCD camera (edge effect) with chemical flow direction (Y-axis). This explanation is, however, insufficient in itself to account for the observed pattern, and other variables clearly influence error rate. The negative relationship between insertion and deletion errors is probably related to physical acquisition issues, but chemistry-related artifacts probably also have an effect (through the related statistical variables analyzed), including the CAFIE effect (carry forward and incomplete extension) in particular. Carry forward occurs when a trace amount of nucleotide remains in a well after the apyrase wash, perpetuating premature nucleotide incorporations for specific sequence combinations during the next base flow and contributing to signal 'noise'. Incomplete extension occurs when some DNA strands on a bead fail to incorporate during the appropriate base flow. The strands that fail to incorporate must await another flow cycle for sequencing to continue and are thus incorporated out-of-phase with the rest of the strands .
This study clearly demonstrates that sequencing error rate, as deciphered here, is a heterogeneous feature in 454 GS-FLX Titanium pyrosequencing. We cannot extrapolate the results obtained for other technologies, such as the GS20 system, to this system, nor is the use of a single global error rate inappropriate. Our results provide information about the number of sequences required to correct for a specific erroneous position, when detected, but this procedure requires the error rate to be computed from within the 454 PT plate regions in which the physical distribution of error rate is heterogeneous. Internal DNA controls should therefore be used when appropriate [7, 19, 24] (readily available for amplicon sequencing), together with an error-corrected base caller , and routine procedures taking error data into account should be defined. When error rate is not estimated, a large number of potential false-positive polymorphisms would be expected and only post-sequencing validation can account for these artifacts [26, 27]. For the resolution of this issue, the use of both sequencing primers and deep coverage, combined with the use of random sequencing priming sites, should partially compensate for error -- even for high error rates -- although it may be more difficult to distinguish between low-frequency alleles and errors than previously anticipated.
We used the six control DNA fragment Type I sequences (as provided in Roche 454 protocols) as reference sequences. This made it possible to use a large number of strictly identical templates to characterize the sequencing error rate of this technology. The sequences generated constituted a set of three replicates from three different runs, making it possible to assess the quality and accuracy of the 454 GS-FLX Titanium method. Six references were used, with lengths ranging from 500 to 592 bp and GC contents from 52.75% to 65.85%; each of these reference sequences contained a large number of homopolymers (20 to 34), defined as a succession of three or more identical bases. Homopolymer positions are shown on Figure 1 and in additional file 2. The reference sequences are provided in additional file 5.
All reference sequence positions were classified according to the presence and length of a homopolymer: (i) the first and last bases of a homopolymer and those within two bases on either side of a homopolymer were coded "1". All the other positions within the homopolymer were coded "3" to "6" (the length of the homopolymer). All positions outside these zones (not influenced by the homopolymer) were coded "0".
The dataset consisted of 86,237 sequences, corresponding to 29,100,738 positions. Sequencing was carried out at Genoscreen, France. We aimed to identify factors linked to error rate. For a tractable analysis, we analyzed a dataset corresponding to all the positions at which an error was detected, plus a similar number of error-free positions randomly selected from the whole original dataset.
Reads (see additional file 6) were sorted according to their reference sequences, by BLASTn . Each read was aligned to its reference sequence, to identify the positions and the number of sequencing errors. For optimization of the pairwise alignment parameters, the total number of errors was counted in a test dataset of 500 kb for a series of gap opening and gap extension penalties. The final analyses were carried out with ClustalW , using "1" as the gap opening penalty, and "10" as the gap extension penalty.
In the analyses, the observation unit was the position on the 454-generated sequences. These positions were transformed into the position on the reference sequence. Insertions are reported with respect to the position of the base preceding the gaps. For each position, a binary variable was defined indicating the presence or absence of a sequencing error. An error is defined here as discordance between two homologous positions: the first in the reference sequence and the second in the generated sequence. Discordance may refer to an insertion, a deletion, a nucleotide mismatch or an ambiguous base call (N) with respect to a non-available nucleotide determination on the replicate sequence (according to Huse et al. ). We investigated the pattern of 454-error type, focusing on the following seven factors: (i) Position, position in the sequence expressed as a proportion of the total length of the reference sequence (treated as a quantitative variable); (ii) Seq.type, the different reference sequences (qualitative variable with 6 settings); (iii) Homopolymer, type of homopolymer linked to the position as defined above; (iv) Dist.region, Euclidean distance between the generated sequence (bead) and the center of the region on the plate; (v) Dist.plate, Euclidean distance between the generated sequence and the center of the plate; (vi) Seq.length, length of the considered generated sequence (the observed sequence length results from the GS-FLX quality filtering process); (vii) Region, region of the plate in which the replicate was observed, region of the considered replicate.
The R package was used for all statistical tests . The significance of regression coefficients was assessed by a permutation test with Benjamini-Hochberg correction, with α = 0.05. As we studied both qualitative and quantitative variables, we decided to transform the qualitative variables. The various possible settings of each qualitative variable were therefore replaced by a binary variable (dummy variable).
Maximum likelihood estimators were considered to estimate the parameters of the model. Tests of significance of the parameters were then carried out with Student's t test. A model was generated for each of the three plates and for each of the error types (insertion, deletion, mismatch and N). All the analyses were performed with R (version 2.6.0).
The contribution of a given explanatory variable xi is assessed as follows. Let us denote by comp.mod the logistic model including all the variables considered, and dev(comp.mod), its deviance. Let us define dev(sub.model) as the deviance associated with the model including all the variables other than the considered xi. Then, part(xi) = (dev(sub.mod)- dev(comp.mod))/dev(comp.mod)) expresses the contribution of xi in addition to the other variables. We can symmetrically define the participation of all the variables other than xi: part(whole\xi) = (dev(xi)-dev(comp.mod))/(dev(comp.mod)). Hence the deviance of the complete model may be broken down into the sum of three terms: the first exclusive to xi, the second exclusive to the rest of the variables and the last expressing the explanation common to xi and the other variables: 1 = part(xi) + part(whole\xi) + (1- part(xi) - part(whole\xi)).
We thank G. Nève for assistance with figure design. We thank M. Galan for useful comments on previous versions of the manuscript and S. Nielsen and J. Sappa (Alex Edelman) for major improvements of English grammar throughout the text. This work was supported by the AIP BioRessources "EcoMicro" grant from the French Institut National de la Recherche Agronomique (INRA), permanent institutional support from Montpellier SupAgro, University Aix-Marseille I, and the R&D budget of Genoscreen (Lille, France).
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.