Quality filtering of Illumina index reads mitigates sample cross-talk
© The Author(s). 2016
Received: 27 September 2016
Accepted: 26 October 2016
Published: 4 November 2016
Multiplexing multiple samples during Illumina sequencing is a common practice and is rapidly growing in importance as the throughput of the platform increases. Misassignments during de-multiplexing, where sequences are associated with the wrong sample, are an overlooked error mode on the Illumina sequencing platform. This results in a low rate of cross-talk among multiplexed samples and can cause detrimental effects in studies requiring the detection of rare variants or when multiplexing a large number of samples.
We observed rates of cross-talk averaging 0.24 % when multiplexing 14 different samples with unique i5 and i7 index sequences. This cross-talk rate corresponded to 254,632 misassigned reads on a single lane of the Illumina HiSeq 2500. Notably, all types of misassignment occur at similar rates: incorrect i5, incorrect i7, and incorrect sequence reads. We demonstrate that misassignments can be nearly eliminated by quality filtering of index reads while preserving about 90 % of the original sequences.
Cross-talk among multiplexed samples is a significant error mode on the Illumina platform, especially if samples are only separated by a single unique index. Quality filtering of index sequences offers an effective solution to minimizing cross-talk among samples. Furthermore, we propose a straightforward method for verifying the extent of cross-talk between samples and optimizing quality score thresholds that does not require additional control samples and can even be performed post hoc on previous runs.
KeywordsIllumina Sequencing Multiplexing Sequencing errors
In recent years Illumina sequencing has emerged as a mainstay for numerous biological applications. Due to the immense number of sequences that can be obtained, it is often useful to sequence DNA from multiple samples in a single run. This multiplexing process relies upon unique “index” sequences, termed i5 and i7, that are added to both sides of the DNA being sequenced. With only a few unique i5 and i7 sequences, hundreds of different i5 and i7 combinations can be created, enabling many samples to be simultaneously sequenced. De-multiplexing the samples after sequencing only requires finding the sequencing reads associated with each index pair that was added to the sequencing run.
As with other sequencing approaches, the Illumina method has been characterized for the frequency and type of errors that are generated . Substitutions, where one base is misread as another, are the most frequent error class and occur more often toward the end of the sequence [2, 3]. Insertions, deletions and motif-specific errors occur less frequently, but they can still cause problems for certain applications [4, 5].
Another type of error involves cross-talk among multiplexed samples and has received far less attention despite recent reports that error rates can be significant [6–8]. Such errors are particularly insidious in applications that require the detection of variants that are rare in one sample but abundant in others, which includes biosphere surveys, investigations of ancient DNA, and the identification of cancerous cells . Cross-talk errors can also be problematic if a large number of samples are multiplexed, such that each sample is a small fraction of the total number of reads. Since cross-talk can come from multiple sources, it has sometimes been attributed to experimental mistakes, cross-contamination during primer synthesis, multiple misread bases within index sequences, or sample carryover from previous sequencing runs on the same machine .
Using standard de-multiplexing protocols, we observed a 0.09 % rate of sequence misassignments, which have the correct i5 and i7 index pair but incorrect sequence, and a 0.16 % rate of index misassignments, which have a correct sequence read but a single incorrect i5 or i7 index. These rates are consistent with prior studies that found misassignment rates between 0.06 and 0.21 % [8, 9]. Furthermore, the rate of sequence misassignment was similar to that of i5 or i7 index misassignment (Fig. 1b), indicating that the sequence is being misassigned rather than both index sequences being independently misassigned. Both sequence and index misassignments will contribute to cross-talk between samples when each sample is separated from other samples by a single index, whereas only sequence misassignments are relevant when unique dual-indexing is used. Nevertheless, the existence of sequence misassignments indicates that even the use of two unique index sequences is insufficient to eliminate cross-talk.
Misassignment errors could result from distinct cluster originators forming at an overlapping spot on the flow cell . If this were the case, we might expect the quality score profiles of incorrect reads to oscillate between low quality in positions where the two sequence clusters differ (e.g., one A, one C) and high quality where they are identical (e.g., both A). However, we did not observe any such pattern in the quality score signals of incorrect triplets, perhaps because there is a poor correlation between the quality score and the actual probability of error  or because neighboring positions are taken into account when assigning quality scores. Nevertheless, we would expect overlapping clusters to lower the quality of all read steps due to competing signals, yet this was also not observed. Instead it appears that one cluster tends to overpower the other during each read step (i5, i7, or sequence), and the overpowering cluster in the pair can switch between read steps.
To our knowledge, this is the first systematic study of cross-talk on the Illumina platform that uses standard dual indexing as opposed to custom or single indexing schemes. Previous studies of cross-talk identified the advantages of dual indexing over single indexing and of quality filtering index sequences [7, 8]. Here we extended these findings by showing that there are three independent modes of cross-talk: incorrect i5 index, i7 index, and sequence. The existence of sequence misassignments prevents dual indexing from completely eliminating cross-talk without quality filtering. It also means that if only a single (i7) index is used, filtering on sequence quality in addition to index quality is the best strategy. In agreement with previous work , we determined that no amount of quality filtering can completely eliminate cross-talk when samples are only separated by one of two index sequences. Thus, unique dual indexing is required when identification of extremely rare variants is critical. We also proposed a simple method for both quantifying cross-talk and choosing run-specific or application-specific thresholds for mitigating it by counting reads assigned to unexpected index pairs during quality filtering (Fig. 5).
Cross-talk between samples effectively limits the number of index pair combinations that can be reliably used. As the fraction of clusters sharing an i5 or i7 increases, the number of misassigned reads will concomitantly increase. Eventually, even at small rates of misassignment the incorrect reads would rise to an intolerable level if enough index combinations were used. This is supported by a previous study in which the rate of cross-talk was estimated to approach 1 % when 625 index pair combinations were used . For this reason, we believe it is necessary to quality filter index reads in addition to the sequencing read when employing a multiplexing strategy. Furthermore, to mitigate the issue of spurious results due to cross-talk in the literature, we recommend that repositories such as the Sequence Read Archive (SRA)  enable and encourage the submission of quality scores for index sequences and unexpected (control) index pairs. This would allow retroactive filtering of published sequences, and would also provide a means for automatic accumulation of data on the magnitude of sample cross-talk as sequencing platforms evolve.
Template DNA extraction and PCR amplification
A total of 13 strains (Additional file 1: Table S1) belonging to the genera Amycolatopsis or Streptomyces were grown at 28 °C in 1 mL of 1/10th concentration ISP2 medium (10 g Malt extract, 4 g Yeast extract, and 4 g Dextrose per 1 L) for 9 days. The remaining protocol closely paralleled that of a previous study . Briefly, the cultures were centrifuged at 1000 rcf for 10 min to pellet the cells. A 700 μL volume of supernatant was removed, the remaining volume was vortexed, and 200 μL of the concentrated mycelium was transferred to a 0.2 mL thin-wall tube (Corning). These tubes were sonicated at 100 % amplitude for 60 s using a Model 505 Sonicator with Cup Horn (QSonica) while the samples were completely enclosed. After sonication, the samples were centrifuged, and the supernatant containing DNA was used as template for PCR amplification.
Extracted DNA was amplified using indexed primers containing adapters (Additional file 1: Table S2). Samples were carefully arranged into a 96 well plate in alternating rows and columns to prevent any possibility of cross-contamination. Primers were designed to target either a stably integrated chromosomal barcode or the RNA polymerase subunit β (rpoB) gene. The PCR reaction consisted of a 2 min denaturation step at 95 °C, followed by 40 cycles of 20 s at 98 °C, 15 s at 67 °C, and 15 s at 80 °C. The PCR reaction contained 10 μL of iQ Supermix (Bio-Rad), 0.8 μL of 10 μM forward primer, 0.8 μL of 10 μM reverse primer, 4 μL of DNA template, and 5.9 μL of reagent grade H2O per sample. Primers were synthesized by Integrated DNA Technologies using their TruGrade service that is intended to prevent cross-contamination during synthesis. Furthermore, primers were purchased across multiple orders that were staggered in time to further ensure that primer cross-contamination could not occur.
DNA purification, sequencing, and analysis
PCR products were purified separately with the Wizard SV-Gel and PCR Cleanup System (Promega). Samples were sequenced by the UW-Madison Biotechnology Center on an Illumina Hi-Seq 2500 in rapid mode. Sample concentrations were determined using an Agilent 2100 Bioanalyzer, and pooled immediately prior to sequencing in order to reach a target density of 8.5e5 to 1e6 clusters per mm2. Spiking PhiX was unnecessary because the sequences’ first 5 bases were well separated (hamming distance from 2 to 5), and we have not noticed a reduction in cross-talk from adding PhiX in prior runs. Single-end sequencing was performed for 51 cycles. After sequencing the cluster density was determined to be 9.9e5/mm2.
Samples were de-multiplexed using Illumina’s bcl2fastq (v2.17) software and its associated defaults, that is, allowing 1 mismatch per index and only outputting reads that “pass filter”. Illumina’s pass filter algorithm screens out reads based on the signal intensities over the first 25 cycles of the sequencing read. The additional parameter “--create-fastq-for-index-reads” was specified to force the program to output fastq files for both index sequences (i5 and i7). Raw index and sequence reads are available from the sequence read archive (SRA) under accession number SRP083789. We also de-multiplexed another randomly selected index pair (i5: ACGTAAGG; i7: GGCCAATT) that was not used with any sample. This index pair had zero associated reads, confirming that the observed rates of sequence misassignment are larger than expected from misread bases alone.
Reads were assigned to the nearest expected sequence within an edit distance of four (including mismatches, insertions, and deletions) using the DECIPHER (v2.1.6) package  in R  (http://www.DECIPHER.codes). The sequences belonging to each sample were separated by an edit distance of at least 14, meaning that a small number of misread bases would not prevent correct matching. Barring insertions and deletions, which are uncommon on the Illumina platform, the 14 sequence variants were separated by between 21 and 43 substitutions. The probability of 17 (21 differences –4 mismatches) or more substitutions within 51 bases is 10−23 at a high misread rate of 1 % (Q20). Between 4.8 and 9.6 million reads were mapped to each of the 14 sequences having a known index pair, with a mean of 7.5 million reads per expected triplet. A total of 99.9 % of unexpected triplets differed from an expected triplet by a single read step, with the remainder differing by two read steps (e.g., incorrect i5 and i7).
Quality score filtering was applied with the TrimDNA function of DECIPHER , which allows specification of a maximum average error rate. The quality score (Q) can be converted to a probability of error (p) using the formula p = 10(Q/−10). The sequence misassignment rate was calculated as the fraction of reads having the same i5 and i7 index pair that mapped to the wrong sequence, divided by the total number of mapped reads having that index pair. The index misassignment rate was calculated as the fraction of reads that mapped to a sequence with a known index pair, but differing by a single i5 or i7 index from the expected index pair.
The authors thank the University of Wisconsin Biotechnology Center DNA Sequencing Facility for performing the Illumina sequencing associated with this study.
This work was supported by the Simons Foundation, Targeted Grant in the Mathematical Modeling of Living Systems Award 342039, the National Science Foundation Grant DEB 1457518, and the National Institute of Food and Agriculture, US Department of Agriculture, Hatch project 1006261.
Availability of data and materials
The datasets generated and analyzed during the current study are available in the GenBank repository, under accession SRP083789.
EW performed the experiments. EW and KV designed the study, analyzed the data and wrote the manuscript. Both authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Consent for publication
Ethics approval and consent to participate
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Cox MP, Peterson DA, Biggs PJ, Solexa QA. At-a-glance quality assessment of Illumina second-generation sequencing data. BMC Bioinformatics. 2010;11:485.View ArticlePubMedPubMed CentralGoogle Scholar
- Dohm JC, Lottaz C, Borodina T, Himmelbauer H. Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Research. 2008;36:e105–5.Google Scholar
- Schirmer M, Ijaz UZ, D’Amore R, Hall N, Sloan WT, Quince C. Insight into biases and sequencing errors for amplicon sequencing with the Illumina MiSeq platform. Nucleic Acids Research. 2015;43:e37–7.Google Scholar
- Nakamura K, Oshima T, Morimoto T, Ikeda S, Yoshikawa H, Shiwa Y, et al. Sequence-specific error profile of Illumina sequencers. Nucleic Acids Research. 2011;39:e90.Google Scholar
- McMurdie PJ, Rosen MJ, Han AW, Johnson AJA, Holmes SP, Callahan BJ. DADA2: High-resolution sample inference from Illumina amplicon data. Nat Meth. Nature Publishing Group; 2016;:1–7.Google Scholar
- Wright ES, Vetsigian KH. Inhibitory interactions promote frequent bistability among competing bacteria. Nat Commun. 2016;7:11274.View ArticlePubMedPubMed CentralGoogle Scholar
- Kircher M, Sawyer S, Meyer M. Double indexing overcomes inaccuracies in multiplex sequencing on the Illumina platform. Nucleic Acids Research. 2011;40:e3–e3.Google Scholar
- Nelson MC, Morrison HG, Benjamino J, Grim SL, Graf J. Analysis, Optimization and Verification of Illumina-Generated 16S rRNA Gene Amplicon Surveys. Heimesaat MM, editor. PLoS ONE. 2014;9:e94249.View ArticlePubMedPubMed CentralGoogle Scholar
- D'Amore R, Ijaz UZ, Schirmer M, Kenny JG, Gregory R, Darby AC, et al. A comprehensive benchmarking study of protocols and sequencing platforms for 16S rRNA community profiling. BMC Genomics. 2016;17:55.Google Scholar
- Leinonen R, Sugawara H, Shumway M, on behalf of the International Nucleotide Sequence Database Collaboration. The Sequence Read Archive. Nucleic Acids Research. 2010;39:D19–D21.Google Scholar
- Wright ES. Using DECIPHER v2.0 to Analyze Big Biological Sequence Data in R. R Journal. 2016;8:352–9.Google Scholar
- R Core Team. R: A language and environment for statistical Computing. Vienna: R Foundation for Statistical Computing; 2016. URL https://www.R-project.org/.