The profile of errors that accumulate in TCR repertoire data acquired by next generation sequencing has not been fully explored. Our results, derived by sequencing monoclonal TCR CDR3 chains, indicate that an error rate >1% and as high as 6% can be expected after Illumina® sequencing of a 30-36 nt CDR3 central to a 125 bp read, even after screening of sequences for identity with established TCR sequence 5' and 3' of the CDR3. Considering that millions of sequences may be acquired in a single repertoire assessment, tens of thousands of erroneous sequences may be present as well as thousands of unique erroneous sequences. Indeed, if a mean of >100 events are acquired for "correct" unique CDR3 sequence reads, at an error rate of 1% more unique erroneous sequences may be present than error-free sequences. Although our analyses focused on the Illumina® system, the level of error we observed is of a similar magnitude to the ~2% estimated using the Roche platform for shorter (14 nt average) CDR3 region reads .
Our results have a number of implications both in regards to the extent to which sequencing errors may permeate repertoire data and methods through which these can be refined. Erroneous sequences had specific characteristics. The large majority showed length identity with the parental CDR3, and among these most incorporated only single nt substitutions (79.1 ± 10.5%, 88.4 ± 5.7%, 88.2 ± 4.9% for 5C.C7, OT-1, and DO11.10 CDR3 at q = 0, and 99.2 ± 0.1%, 98.9 ± 0.2%, and 98.3 ± 0.2% respectively at q = 30). Therefore elimination of single nt substitutions related to an index sequence of higher frequency has the potential to remove the majority of erroneous sequences (Figure 5). Site and nt-specific substitution rates varied substantially within a CDR3, however, and in our data sets did not exceed a frequency of 0.69% for phred unfiltered sequences and 0.24% for sequences filtered at a q = 30. Identifying single nt mismatch sequences compared with a higher frequency index sequence in repertoire analyses can be performed through straightforward algorithms, and a threshold frequency value can be assigned to flag such sequences as potential mis-sequence events. It would seem prudent to incorporate a separately barcoded monoclonal TCR control in all repertoire studies so as to establish error thresholds based on the quality of individual sequencing reactions.
One limitation in filtering single nt mismatch sequences is that for sequences present at low frequencies, there may be insufficient event numbers to reliably identify probable mis-sequence events. For example, if 200 index sequences are present and the filter cutoff is set at 0.5% of the index, a one nt mis-match sequence observed once would be culled (200 × 0.5% = 1). But if that sequence was present twice, it would not meet criteria for this filter. Yet, at these small discrete numbers, a binomial distribution of detected events when the occurrence probability is 0.5% indicates that any potential mis-sequence event present 3 or fewer times would need to be culled to be >95% confident that all sequences less frequent than 0.5% of the index are eliminated. Therefore it may be necessary to establish variable cutoffs as sequence counts become low, so as to reliably eliminate mis-sequences. Increasing cutoff values, however, may also be associated with an increased risk of improperly excluding genuine sequences from the data set. A more complex algorithm that explores sequence space by clustering similar sequences with up to 3 nt substitutions has also been used previously, and our results would support this type of approach . However, that algorithm assumed that genuine CDR3 sequences would only rarely possess <3 nt mismatches. Eliminating sequences with multiple differences may lead to overculling, and if phred filters are applied may be unnecessary. Again, the addition of quantitatively defined frequency thresholds based on internal controls may minimize the inappropriate extraction of true sequences with high nt similarity to an index sequence.
No evidence was found for substantial error introduction during sample preparation steps. Although this phase of manipulation incorporates 2 rounds of PCR, these were performed with high fidelity polymerases with low intrinsic error rates. Errors introduced through PCR would be anticipated to be idiosyncratic. For each of the TCR, among the multiple independently prepared samples, little variation in error rates was observed. As it would seem improbable that the same rare events are consistently incorporated in these steps, the alternative conclusion that the preparatory amplifications did not markedly contribute to overall error seems justified. An alternative source for errors may be those inherent in mRNA transcription within the T cell. However, this would also seem unlikely to be a primary contributor as measured rates of mRNA transcriptional error are 1 - 2 orders of magnitude lower than that observed in these studies [25, 26]. This would suggest that the solid phase amplification and sequencing steps are the primary sources of errors, and improved technologies that minimize these are therefore critical. Indeed, unlike single-strand sequencing, more recently developed paired-end approaches that bi-directionally cover the CDR3 sequence may allow improved identification and filtering of sequences with errors.
Our data indicate that additional features may aid in identifying erroneous sequences within mixed TCR populations. Skewed sequencing direction was highly associated with errors. Among phred-unfiltered sequences, the majority of erroneous sequences were observed to be sequenced primarily in a forward or reverse orientation. Indeed, for many sequences, exclusive directionality was observed despite the presence of hundreds of independent reads. Restricting sequences based on phred scores greatly diminished, though did not eliminate this directional bias. Hence, filtering sequences based on directional bias may be useful to eliminate erroneous sequences. One caveat is that in the setting of a large polyclonal population of T cells, even with the acquisition of millions of sequences, many erroneous sequences will be present among lower frequency events. Event numbers may often be inadequate to reliably screen for variation from an anticipated distribution of forward to reverse reads.
Interestingly, with the phred-filtered sequences, a clear nt bias was observed among errors for each of the TCR. Specifically, four substitutions comprised the majority of errors, C→T, G→A, T→C, and A→G. Why these particular purine-pyrimidine transversions are specifically prominent is not clear. However, 2 forms of symmetry underlie them. First, bidirectional mutations are seen, i.e. a C→T and T→C, and a G→A and A→G. This suggests that these base pair combinations are specifically prone to errors. Second, the mutations are also related by their equivalence across complementary read strands. Thus a C→T substitution in a CDR3 sequence could result from a C→T error in a forward read or a G→A in the reverse read. A similar symmetry applies to the T→C and A→G substitutions. This indicates that very specific base pairing is highly prone to error during sequencing. The nature of the base substitution is therefore a potentially important parameter in determining the likelihood that related sequences result from mis-sequencing.
Considering the different factors that are associated with erroneous TCR CDR3 sequences, we would envision a multi-step process for filtering sequences. Most important is the initial application of a phred filter. We observed a progressive decrease in overall errors by increasing the base-call value. Often a q = 20 is considered an adequate cutoff, however our findings indicate a decrease in errors with a q = 30, with an acceptable loss of total sequence numbers. Using single end sequencing here, many residual erroneous sequences were nevertheless present. Paired end sequencing may provide additional benefit in excluding mis-sequence events when full length bidirectional sequence can be acquired. Additional error reduction strategies will depend on the residual error rate present, which can be indicated by a separately barcoded monoclonal CDR3 control incorporated into a sequencing reaction. Our data indicates that at a q = 30, filtering single nt mismatch sequences that are less frequent than 1% of a high frequency index sequence reliably purges >98% of residual erroneous sequences. Increasing the threshold value for filtering potentially incorrect sequences may eliminate more erroneous events but will also increasingly purge true sequences. Further, as mentioned above, as the frequency of the index sequence diminishes to low numbers, the threshold value for culling sequences may need to be increased based on binomial probabilities. It may therefore be helpful to examine these events for additional indications of error, such as the presence of specific transversions that were particularly common among erroneous sequences. Finally, in performing multiple repetitions of sequencing reactions, we observed that many low frequency sequences were only present in one or a few sequencing reactions. The performance of multiple independent sequencing reactions on single amplified samples permits the identification of common sequences, providing increased confidence in their authenticity.
As each sample analyzed for CDR repertoire possesses variable intrinsic diversity, it is not possible to a priori define the extent to which specific interventions that exclude sequences from a data set will mistakenly purge true sequences. Ultimately the best means to safeguard the integrity of CDR3 data set is by diminishing the errors inherent in sequencing, and this must be a priority. New third generation sequencing instruments are currently under development, and, not requiring sample amplification, have the potential to eliminate errors introduced during sample preparation . Fidelity of these systems for quantitative repertoire assessments, however, remains to be determined. This would have to be high as sequencing primary rather than amplified DNA or cDNA samples markedly reduces sequence redundancy, which may be used for quality control.