SASI fragment design
For the SASI-Seq approach to work, a fragment or set of fragments was required that would be inexpensive, easily identified and resistant to degradation and loss during Illumina library preparation. One of the most common variable steps in the library preparation process is size selection[24], which can yield very tight (+/- 10 bp) or very broad (+/- 400 bp, or greater) fragment size ranges. In order to prevent the SASI fragments being lost during size selection, we therefore envisioned a set of three fragments of different sizes, approx. 200 bp, 400 bp and 600 bp that would be evenly spaced within the range of fragment size distributions commonly used for Illumina sequencing.
The viral genome PhiX 174 is easily one of the most commonly sequenced genomes, as it is often used as an internal control during Illumina sequencing[25, 26]. As such, it likely has a perfect reference and bioinformatics pipelines have been written to remove PhiX reads from Illumina datasets. We therefore designed our spike-in fragments to represent discrete segments of the PhiX genome around a common core. To do this, we used the program Oligo 6[27] to design a set of primers against the NC_001422.1 Genbank reference sequence, that gave three fragments of approximately 200 bp, 400 bp and 600 bp from a common reverse primer and that had roughly equal Tm and priming efficiencies. The best primer pairs had forward primers at positions 926, 743 and 571 and a reverse primer at position 1123 giving amplicons of 214, 397 and 568 bp respectively. In order to add a unique signature to these fragments that could be uniquely associated with a particular sample, we placed a unique sequence barcode from our set of Illumina barcode sequences at the 5’ end of each forward primer[28]. These barcodes were designed using a Hamming script[22, 23] that considers that the major error mode of Illumina sequencing is substitution errors and ensures that no two barcodes are less than 4 base substitutions apart. This enables single error correction i.e. if a barcode sequence gains an error during sequencing it will be one base away from the perfect sequence and can be counted as that original barcode. A barcode sequence has to gain at least three errors before it will be falsely counted as an alternative barcode. With the Illumina error rate less than 1%[29] this should occur at a frequency of less than 1 in 106. For the purposes of both this application and for multiplexing during Illumina sequencing we sought to construct a set of 384 such barcodes that included our previous set of 8mer 96 multiplexing barcodes[19, 30, 31]. To do this we found we needed to expand the barcode word length to be a 9mer, so assigned the 9th base as A in the first 167 barcodes in the set as this is the first base of the Illumina adapter sequence following the run of barcode bases (for 9mer barcode sequences see Additional file1: Table S1).
Initial SASI fragment investigation experiments
In order for this approach to work it was necessary to demonstrate that the SASI fragments remain within a DNA sample once added and could not be degraded or processed away. For these tests SASI amplicons were generated with barcode tag #1 at both ends, as described in Methods. We sought to determine whether or not the fragments were sheared using typical physical shearing conditions employed during Illumina library construction. 500 ng aliquots of human genomic DNA were spiked with 0.5 ng of SASI fragment mixture and sheared using a Covaris focused acoustic shearing device to produce average fragment sizes of 200 bp, 300 bp, 400 bp and 500 bp respectively. Illumina sequencing libraries were constructed from each sheared DNA sample and with each library receiving a different P7 indexing barcode sequence. The libraries were mixed in equimolar proportions and sequenced on an Illumina MiSeq instrument. From each indexed library we analysed the fraction of reads that shared similarity to the PhiX reference sequence (Figure 2). This clearly demonstrated that the majority of the SASI fragments were broken during shearing, that virtually none of the larger 568 bp amplicon remained, but approximately 10% of detected fragments were intact 214 and 397 bp amplicons.
In order to investigate the effect of different size selection protocols on the levels of detectable SASI fragments, we again took 500 ng aliquots of human genomic DNA spiked with 0.5 ng of SASI fragment mixture, sheared to an average fragment size of 300 bp and made Illumina sequencing libraries using a variety of size selection approaches, before sequencing as a multiplexed pool. We have previously found the Sage Science Pippin Prep gives the tightest distribution of fragment sizes during fractionation[24]. We therefore used the Pippin Prep to separate as tight a size fraction as possible centred around 300 or 500 bp, i.e. approximately halfway between the sizes of the SASI amplicons. We made libraries including this size fractionation step both before and after library PCR. We also made libraries using: the Caliper LabchipXT to size fractionate tight 300 and 500 bp fragments; agarose-gel electrophoresis to size fractionate a tight 300 bp size fraction; AMPure beads to purify >200 bp fragments and 400-600 bp fragments; and Agilent SureSelect custom exome enrichment. The results demonstrated the persistent nature of the SASI fragments in that we were able to detect SASI fragment reads from all the libraries (Figure 3), including after Pippin Prep fractionation (for mapped insert-size distributions, see Additional file2: Figure S1), and perhaps surprisingly, even after SureSelect target enrichment, albeit at a very low level.
Following the inclusion of a specific probe (10 μM final concentration, for details see Methods) we found that SASI fragments could be reproducibly detected following SureSelect target enrichment with representation after sequencing, close to the spike-in level (results not shown).
Optimisation of multiplexing barcode sequences
Ideally, sequence multiplexing would be pure in the sense that a single sample would have a unique and exclusive barcode sequence. Also for SASI-Seq to have maximum sensitivity, a single sample would only display the intended barcode sequence(s). However, there are two mechanisms by which background contamination can occur: i) cross-contamination can occur between barcoding oligonucleotides during synthesis and subsequent processing and ii) errors during sequencing can lead to sequence drift such that an alternate barcode sequence is read.
In previous experiments in which samples were deliberately omitted from multiplexed library pools, we noticed that such samples could still be detected at a low level. In order to determine the best processing and purification approach for oligo synthesis, we made a set of libraries using barcoded multiplexing PCR primers that had been purified by HPLC (from Company A) or PAGE purification (Company B), or using IDT TruGrade processing (custom service, Integrated DNA Technologies, Iowa, USA). We deliberately did not open the tubes containing some of the barcode primers, but included those barcodes in the sequence dataset analysis, looking to see what fraction of reads were attributed to those barcodes, although they had not been used (Additional file3: Table S2). With HPLC or PAGE purification, approximately 0.56% and 0.34% of reads mapped to the missing barcodes. With TruGrade this was dramatically reduced to just 0.03%. The set of barcode sequences used initially was designed to be 4 bases apart and to tolerate one mismatch. In order to investigate the origin of these mis-attributed barcodes, we tabulated the number of perfect matches and the number of matches within one mismatch against each barcode in the 384 set (Additional file4: Table S3 sheet1). We found that some matches to absent barcode sequences had higher levels of perfect matches (than single-change mismatches) to other barcodes synthesised within the same batch, indicating cross-contamination in the lab or during synthesis. Other mis-attributed barcodes had higher numbers of hits allowing for one mismatch than they did to perfect matches, indicating that those matches were due to sequence drift from other barcodes as a result of sequencing error. We looked at the level of barcode mis-attribution in other runs, two of which are illustrated in Additional file4: Table S3 as sheets 2 and 3. Whilst in some runs mis-attribution was primarily due to perfect matches indicating lab contamination (Additional file4: Table S3 sheet2), upto 0.2% mis-attribution was observed due to sequence drift (Additional file4: Table S3 sheet 3).
In order to make SASI-Seq as sensitive as possible, and sample multiplexing as distinct as possible, we sought to reduce this background level of barcode mis-attribution by redesigning our 384 plex barcode set such that they were at least 5 bases different from the closest other barcode sequence. When using single error correction, this would tolerate 3 sequencing errors, since at least 4 sequencing errors would be required to potentially convert each to within one base of an alternative barcode. This required increasing the barcode length to 11 bases, the sequences of which are given in Additional file5: Table S4.
We ordered Illumina PCR multiplexing oligos with these 384 different 11-mer barcode sequences from IDT to TruGrade purity in four 96 well plates, and validated that purity by checking for the presence of unexpected barcodes. Briefly we amplified an Illumina adapter ligated fragment library of the S. aureus TW20 strain, in the presence of each of the 384 barcoded primers, in four 96 well plates. After PCR we made two multiplexed library pools, one containing an equal volume of all odd number barcoded libraries and the other with an equal volume of all the even number barcoded libraries. We used an 8-channel pipette for this purpose, so that we could pipette whole columns without error. Each pool was purified, quantified and run on an Illumina MiSeq to determine the frequency of each barcode (Additional file6: Table S5). The incidence of mis-attribution in each experiment was less than 0.005%, of which 75% and 83% respectively were perfect matches, demonstrating the highly discriminatory nature of these barcodes which would be a prerequisite for sensitive cross-contamination detection using SASI-Seq.
Sample assurance using SASI-Seq
To test the performance of SASI-Seq, we prepared a set of 96 multiplexed libraries from samples that had been spiked with 0.1% SASI fragments containing a unique 11-mer barcode at one end. The reads from each library were segregated according to the barcode sequence and each library dataset mined for reads originating from the SASI spike-in fragments. The results are best visualised as a tabulated matrix of sequencing barcode versus spike-in barcode for each library (e.g. (Figure 4), full results in Additional file7: Table S6). The number of SASI specific reads varies between samples, but for each their representation roughly approximates the 0.1% spike-in level. Variation is probably a result of a number of factors including variation in the number of reads for each Illumina barcode data set, accuracy of quantification of both DNA sample and SASI fragments as well as pipetting accuracy at low volumes. In separate experiments (not shown) different relative levels of sequencing barcodes and SASI fragments were observed indicating that some barcodes/fragments are not outperforming others as has been observed with some “in-line” barcoding sets (e.g.[32]).
For the most part, the only SASI fragments detected in each library dataset were those that were expected. There were, however, a small number of hits to other SASI barcode sequences in some of the libraries. On analysis, these were found to be perfect matches indicating cross-contamination during processing rather than sequence error resulting in barcode cross-talk.
Detecting sample swaps and cross-contamination using SASI-Seq
To demonstrate that we could reliably use SASI-Seq to detect sample swaps and cross-contamination events we deliberately mixed samples with known spike-ins. Sample swaps could be identified quite readily, an example of this is shown in (Figure 5), in which two consecutive samples were purposely transposed.
We next tested the power by which we could detect cross-contamination by deliberately mixing in samples containing other spike-in barcodes at known levels. Specifically, samples in triplicate containing 0.1% uniquely tagged spike-ins were deliberately cross-contaminated by adding another sample, containing 0.1% spike-ins with SASI barcode #77, to 10%, 1% and 0.1% relative to the concentration of the original sample. At the 0.1% level of overall spike-in, cross-contamination down to 1% could be reliably detected above the background contamination and sample-to-sample SASI read variation, within the experiment (Figure 6). The low level of background contamination seen in this experiment was probably a result of small splashover events during library preparation since the contaminants had perfect matches and no such contamination was observed when individual libraries were remade manually (results not shown).
Universal SASI-Seq
In order to make the SASI fragment design applicable to Nextera library preparation and PCR-based enrichment approaches, we added the sequences TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG and GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG, that are normally introduced via the Nextera reaction[33], to the respective 5’ ends of forward and reverse primers used in SASI fragment generation. This enabled the SASI fragments to be amplified using the standard Nextera PCR primers, or using primers with these sequences that could easily be included in any PCR enrichment panel (results not shown).
SNP calling
Since this approach involves adding foreign DNA to samples under study we had a slight concern that SASI fragment sequence may contaminate usable sequence data and interfere with subsequent analysis, leading to false SNP calling and elevated false positive rates. To examine this possibility we sequenced the genome of Staphylococcus aureus TW20, for which we had a complete genome sequence[34], both with and without the inclusion of SASI fragments. Variant analysis of the resulting datasets showed that neither dataset had any variants compared to the reference thus providing assurance that SASI fragments do not lead to false SNP calls.