Dual indexed design of in-Drop single-cell RNA-seq libraries improves sequencing quality and throughput

The increasing demand of single-cell RNA-sequencing (scRNA-seq) experiments, such as the number of experiments and cells queried per experiment, necessitates higher sequencing depth coupled to high data quality. New high-throughput sequencers, such as the Illumina NovaSeq 6000, enables this demand to be filled in a cost-effective manner. However, current scRNA-seq library designs present compatibility challenges with newer sequencing technologies, such as index-hopping, and their ability to generate high quality data has yet to be systematically evaluated. Here, we engineered a new dual-indexed library structure, called TruDrop, on top of the inDrop scRNA-seq platform to solve these compatibility challenges, such that TruDrop libraries and standard Illumina libraries can be sequenced alongside each other on the NovaSeq. We overcame the index-hopping issue, demonstrated significant improvements in base-calling accuracy, and provided an example of multiplexing twenty-four scRNA-seq libraries simultaneously. We showed favorable comparisons in transcriptional diversity of TruDrop compared with prior library structures. Our approach enables cost-effective, high throughput generation of sequencing data with high quality, which should enable more routine use of scRNA-seq technologies.


Introduction
Most droplet-based single-cell RNA-seq (scRNA-seq) libraries to date have been sequenced on Illumina sequencing platforms using their sequencing-by-synthesis technology (1)(2)(3)(4). Libraries generated by droplet-based scRNA-seq approaches require a certain read depth for adequate identification of cell types and states (1)(2)(3). With the introduction of Illumina's NovaSeq6000 next generation sequencing (NGS) platform, the number of scRNA-seq libraries that can theoretically be multiplexed for sequencing together to the required depth has significantly increased (5).
Coupled with improvements in hardware technology and sequencing chemistry, sequencing costs can be dramatically reduced, which in turn can facilitate scRNA-seq for routine lab use (Supplementary Table 1). However, the utilization of the improved exclusion amplification (ExAmp) chemistry and patterned flow cells in this new technology has introduced new problems for droplet-based scRNA-seq library structures to date (6)(7)(8)(9)(10).
One aspect to be considered when sequencing using ExAmp chemistry is the increased rate of index-hopping between samples sequenced together compared with those sequenced using Illumina's normal bridge amplification chemistry (7). Index hopping occurs due to the physical incorporation of the sample index from one library into a library molecule from a different library (Fig. 1A-E) (8,9). The end result is the mis-assignment of reads between samples ( Fig. 1B). Index hoppng presents a significant problem for scRNA-seq libraries, where data resolution and sample integrity are vitally important. While computational approaches to use cell barcodes as a second index to solve this mis-assignment problem have been proposed (9,10), due to the redundant nature of barcodes used in different bead lots, a large amount of data will need to be discarded due to cross-sample barcode collisions detailed below. One of the best strategies to solve the index-hopping problem is to incorporate a second sample index (i5) on the other side of the final sequencing library (Fig. 1F-I) (11). Thus, an index-hopped read would be identified by an un-anticipated combination of sample indexes and can be filtered out. Currently, using a second index and proper sample handling to prevent sample mixing prior to sequencing are the only methods available to pro-actively prevent index-hopping in bulk sequencing assays (8,11).
There are several issues to consider when designing a dual-indexed scRNA-seq library for compatibility with the NovaSeq. A combinatorial dual-indexing scheme in which at least one of the two sample indexes is repeated across two or more samples will reduce the samples that could be potentially mis-assigned. However, samples sharing a sample index would still need to be treated as a single-indexed library (Fig. 1G) (7). The best method then is to use a unique dualindexed system (Fig. 1I) so that none of the sample indexes on one side of the library (i7) or the other (i5) are shared between samples (7). The indexes used for both sides of the library should be sufficiently different that a 1 base error (insertion, deletion, or substitution) should not result in the mis-assignment of the associated read (12).
Another issue to consider was the use of custom sequencing primers with the prior library structures, such as inDrop V2, that were incompatible with large amounts of other Illumina libraries, such as common TruSeq libraries (2,13). Thus, previous sequencing runs of V2 scRNAseq libraries occupy the entire sequencing lanes (Methods). When sequencing just a single library type, the resulting low base composition diversity during the cell barcode read results in a spike in base call error rate. The ability to sequence alongside other Illumina libraries should increase the diversity of bases incorporated across the flow cell at each cycle, improving not only the base calling accuracy, but also the flow cell cluster recognition during sequencing (14).
Here, we document the development and benchmarking of an Illumina compatible dualindex library structure for the inDrop scRNA-seq platform that builds upon the widely-used, commercially available V2 gel beads in a manner independent of the cell barcodes incorporated into the library. We demonstrate the necessity for transitioning to uniquely dual-indexed libraries when sequencing on platforms that use ExAmp chemistry due to cross-sample cell barcode collisions. Using the design documented here, anywhere from 1 to 96 of the resulting scRNA-seq libraries can be sequenced alongside other Illumina samples with minimized sample cross-talk and improvements in sequencing accuracy, which should facilitate the widespread adoption of scRNA-seq in experimental workflows. Importantly, although the PhiX spike-in occupied some of the read depth, the mean quality score increased for the transcript read and barcode + UMI, compared with a run without PhiX (Table 1) (15). The improved quality scores equate to a decrease in the probability of an error in base calling from 0 10 to 1 10 on the transcript read, and a corresponding decrease in error probability from 10 to 0 10 on the barcode + UMI read. This represents about a 1.8-and 1.7-fold decrease in the base calling error rate for bases incorporated during sequencing. This is also reflected in the base calling accuracy plots from the two sequencing runs (Fig. 2  Furthermore, to achieve a standard Illumina TruSeq library structure, the cell barcode + UMI read has been swapped to read 1, which has been documented to be the higher quality read (18).

Sequencing quality of inDrop scR -seq libraries is improved when sequenced with a diverse
Since these indexes were designed to be pooled in sets of 8 index pairs (19)  failed to amplify in a manner similar to that of libraries with V2 index 6 and 12, it was replaced with index pair 25 (which behaved similar to V2) in all further testing.

TruDrop libraries see improved performance when sequenced using ex MP chemistry
To put TruDrop libraries into action, we first sequenced these libraries on the iSeq 100, which utilizes patterned flow cells and ExAmp chemistry to test clustering efficiency and priming effectiveness during the sequencing run (20,21).

Single-cell data generated by TruDrop maintain the same cell population structure as
To determine whether scRNA-seq data generated with TruDrop was valid, count data were generated by alignment, deconvolution, and filtering in a manner parallel to the same samples generated with V2. For sets of mouse and human samples, data generated by the two library structures were analyzed together using t-SNE (22)  clusters. These data suggest that the library structure and sequencer used did not result in any overt biases in data for recovering cell types.

TruDrop libraries generate larger throughput of data on the ovaSeq
We evaluated the performance of TruDrop libraries of human colonic specimens at different sequencing depths by comparing the number of UMIs and genes recovered after NovaSeq sequencing (Fig. 5E, F). Similar to prior testing, diminishing returns were observed with increasing read depth due to re-sequencing of reads that collapse into single UMIs (3). In this prior work, medians of 3,000 UMI/cell and 1,300 genes/cell were reported when samples were sequenced to 60K reads per cell, with a predicted maximum of 3,500 UMI/cell and 1,400 genes/cell (3). For the samples sequenced here, we observed medians of 16,000 UMI/cell and 3,800 genes/cell when samples were sequenced to 150K reads per cell (Supplementary Table   4). The predicted maximum output in our runs is 20,507 UMI/cell and 4,280 genes/cell (Fig. 5E,   F). While cell typing could be done with as few as 20K reads per cell (Fig. 5A-D), we find that analysis in the range of 40K to 60K reads per cell ( 11,000 UMI/cell, 2800 genes/cell) yields the most return for value.

Discussion
Multiplexed NGS is currently essential for performing scRNA-seq in a cost-efficient manner. In order fully realize the advantage of the decreased costs associated with sequencing on platforms that utilize Illumina's ExAMP chemistry, it is necessary for scRNA-seq libraries to utilize a multiplex sequencing strategy that adequately addresses the problem of index hopping.
With the development of TruDrop, we take a preventative approach in utilizing a unique dualindexing method that minimizes sample cross-talk (6). Most prior work on high-throughput scRNA-seq libraries has focused on using computational methods to deconvolve and filter out entire barcodes (cells) with reads that could have originated from index-hopped sequencing reads, resulting in substantial data loss (9). To our knowledge only the V3 inDrop library structure has previously endeavored to implement a dual-indexed system for high-throughput scRNA-seq in a lower rate of ambiguous alignments. The uniquely aligned reads are those that move on to downstream data analysis, and thus, this improvement results in substantially more useable data.
As for the discrepancy in the percentage of uniquely aligned reads between mouse (73%) and human (87%), this is a routinely observed difference between mapping to reference genomes of mouse versus human. Furthermore, the TruDrop libraries did not generate biased results, as sequencing the same samples using either library structures recovered the same cell types, with TruDrop libraries producing higher quality data.
In summary, the TruDrop library structure resulted in the ability to sequence inDrop is supported by T32HD007502. B.C. is supported by T32LM012412. C.R.S. was supported by T32AI007281. We thank Linas Mazutis from MSKCC for his valuable input on the library preparation protocol, Karen Beeri from the VANTAGE core for her technical assistance, and   PhiX, the 7-base long i7-and i5-index reads are used so that PhiX reads can be filtered out and discarded during demultiplexing. (C) Plot of the calculated proportion of cell barcodes that will need to be discarded from single-indexed sequencing runs at different levels of multiplexing. We assume each sample will contain ~3000 cell barcodes.