Cross-site comparison of ribosomal depletion kits for Illumina RNAseq library construction

Herbert, Zachary T.; Kershner, Jamie P.; Butty, Vincent L.; Thimmapuram, Jyothi; Choudhari, Sulbha; Alekseyev, Yuriy O.; Fan, Jun; Podnar, Jessica W.; Wilcox, Edward; Gipson, Jenny; Gillaspy, Allison; Jepsen, Kristen; BonDurant, Sandra Splinter; Morris, Krystalynne; Berkeley, Maura; LeClerc, Ashley; Simpson, Stephen D.; Sommerville, Gary; Grimmett, Leslie; Adams, Marie; Levine, Stuart S.

doi:10.1186/s12864-018-4585-1

Methodology article
Open access
Published: 15 March 2018

Cross-site comparison of ribosomal depletion kits for Illumina RNAseq library construction

Zachary T. Herbert¹,
Jamie P. Kershner²,
Vincent L. Butty³,
Jyothi Thimmapuram⁴,
Sulbha Choudhari⁴,
Yuriy O. Alekseyev^5,6,
Jun Fan⁷,
Jessica W. Podnar⁸,
Edward Wilcox⁹,
Jenny Gipson¹⁰,
Allison Gillaspy¹⁰,
Kristen Jepsen¹¹,
Sandra Splinter BonDurant¹²,
Krystalynne Morris¹³,
Maura Berkeley¹,
Ashley LeClerc⁵,
Stephen D. Simpson¹³,
Gary Sommerville¹,
Leslie Grimmett¹,
Marie Adams¹⁴ &
…
Stuart S. Levine³

BMC Genomics volume 19, Article number: 199 (2018) Cite this article

17k Accesses
36 Citations
22 Altmetric
Metrics details

Abstract

Background

Ribosomal RNA (rRNA) comprises at least 90% of total RNA extracted from mammalian tissue or cell line samples. Informative transcriptional profiling using massively parallel sequencing technologies requires either enrichment of mature poly-adenylated transcripts or targeted depletion of the rRNA fraction. The latter method is of particular interest because it is compatible with degraded samples such as those extracted from FFPE and also captures transcripts that are not poly-adenylated such as some non-coding RNAs. Here we provide a cross-site study that evaluates the performance of ribosomal RNA removal kits from Illumina, Takara/Clontech, Kapa Biosystems, Lexogen, New England Biolabs and Qiagen on intact and degraded RNA samples.

Results

We find that all of the kits are capable of performing significant ribosomal depletion, though there are differences in their ease of use. All kits were able to remove ribosomal RNA to below 20% with intact RNA and identify ~ 14,000 protein coding genes from the Universal Human Reference RNA sample at >1FPKM. Analysis of differentially detected genes between kits suggests that transcript length may be a key factor in library production efficiency.

Conclusions

These results provide a roadmap for labs on the strengths of each of these methods and how best to utilize them.

Background

Ribosomal depletion is a critical method in transcriptomics that allows for efficient detection of functionally relevant coding as well as non-coding transcripts through removal of highly abundant rRNA species. Use of oligo dT primer to capture the polyadenylated 3′ end of the transcripts and isolate mRNA is routine in many RNA sequencing preparations; however this method lacks the ability to handle degraded samples where most of the RNA is separated from the 3′ tail, or to isolate non-polyadenylated transcripts such as lncRNAs. Ribosomal removal methods address these issues by directly depleting the rRNA while leaving other transcripts intact. This technique is widely utilized and is a basic component of many large datasets [1,2,3].

The current generation of rRNA removal kits employs three distinct strategies to deplete these transcripts. In the first method, rRNA is captured by complimentary oligonucleotides that are coupled to paramagnetic beads, after which the bound rRNA is precipitated and removed from the reaction. Kits utilizing this method include Illumina’s RiboZero, Qiagen GeneRead rRNA depletion, and Lexogen RiboCop. The second method uses an alternative strategy, hybridizing the rRNA to DNA oligos and degrading the RNA:DNA hybrids using RNAseH. These kits include NEBNext rRNA depletion, Kapa RiboErase, and Takara/Clontech’s RiboGone. A third method that is specifically aimed at low-input samples using the Takara/Clontech SMARTer Pico kit, targets the ribosomal RNA sequences after conversion to cDNA and library prep using the ZapR enzyme. Additional high abundance transcripts such as globin and mitochondrial RNA (mtRNA) can be targeted by each method. Despite the availability of many different kits utilizing these methods, the efficiency of rRNA removal and possible off-target effects of these different methodologies on the resulting RNAseq data remains unclear.

With an increasing number of ribosomal RNA depletion kits available, understanding the relative strengths of these methods is critical for improving experimental design. To address this challenge, we have conducted a cross-site study comparing seven rRNA depletion kits against a standard sample both as intact and degraded RNA. We find that about half of the kits are likely to require significant care in implementation and note that the Lexogen RiboCop and Qiagen GeneRead kits worked poorly with heavily degraded samples. The different kits also appear to be affected by relative lengths of the transcripts as well as the degradation of the input RNA. These results suggest that different methodologies may be appropriate depending on the experimental question and quality of input material.

Results

To better understand the strengths of the different ribosomal depletion methodologies available, we utilized a set of controlled samples that could be broadly distributed in order to provide a consistent biological background for each kit and site (Fig. 1). All experiments utilized the well characterized Universal Human Reference RNA (UHR) from Agilent, either in its intact state, or following heat degradation (Additional file 1: Figure S1). Two control spike-ins were added to this sample, the Lexogen Spike-In RNA Variant Controls (SIRVs) which were added before degradation and co-degraded with the sample, and the External RNA Controls Consortium (ERCC) from Ambion, which was added after degradation and thus remained intact and served as an additional control. It should be noted that while heat degraded RNA is a proxy for difficult to isolate samples (eg, brain microdissections or tumors), the samples are not formalin fixed and so lack the base modifications often observed in FFPE samples [4]. Consequently, the degraded sample in this study should be considered a best case scenario for how the kits would perform with challenging sample types that yield poor quality RNA. For each experiment, 100 ng of RNA input was used with the exception of the Takara/Clontech SMARTer pico kit which used 1 ng input as recommended by the manufacturer.

The study tested seven rRNA depletion kits (Fig. 1), each tested at four sites. The kits tested include Illumina RiboZero Gold (RZ), Lexogen RiboCop (LX), Qiagen GeneRead rRNA Depletion (Q), all of which use bead capture for ribosomal depletion, the New England Biolabs NEBNext rRNA Depletion (NE), Kapa RiboErase (K), and Takara/Clontech Ribogone (CR) kits that are based on RNAseH degradation of the rRNA, and SMARTer Pico (CZ) which uses the ZapR enzyme to remove rRNA after library prep. The Lexogen, Qiagen, Ribogone, and NEBNext kits all utilized the NEB Next Ultra II directional library generation kit to convert the RNA to Illumina libraries while the other kits used RNA library generation kits from the manufacturer of the depletion chemistry. A total of 11 sites participated in the study with each site handling no more than four kits. Sites were selected from genomic core facilities that are members of the Association of Biomedical Research Facilities (ABRF) who routinely preform RNA library preparation for academic labs. For each vendor, a consultation conference call was held between the vendor and the participating sites to review the protocol in detail before the experiment was performed with the goal of standardizing and clarifying the protocol, thereby minimizing the chance for confusion about the methodology. Full details of alterations to the standard protocols agreed on by the vendor and the participating sites can be found in the Additional file 2. Technical duplicates of both the degraded and intact RNA were run at each site for each kit. The total 106 samples after dropouts were pooled and sequenced on three NextSeq500 runs at a single site to eliminate bias due to sequencing.

The different methodologies were first evaluated for their ability to perform their primary objective, the removal of rRNA from the samples before sequencing. rRNA reads in each sample were identified by aligning to known rRNA sequences using BWA. A cutoff of 50% nuclear rRNA was chosen to indicate ribosomal depletion failure. Illumina’s RiboZero Gold kit showed ~ 5% rRNA with the intact sample at all sites (excluding a single point failure) but slightly higher rRNA fractions for the degraded sample. This kit was used as a baseline for the other kits due to its long-standing reputation in the sequencing community (Fig. 2a). The other kits that used the capture method for depletion were less consistent than RiboZero Gold with one failed site for Lexogen RiboCop and three failed sites for the Qiagen GeneRead rRNA depletion kit. For both the Lexogen and Qiagen kits, the intact samples performed significantly better than the degraded sample and caution should be used when using these kits on highly degraded RNA.

By comparison, the kits that degraded the rRNAs by either RNase H treatment or using ZapR showed more consistent results. Excluding single sites that failed with the NEBNext Ultra rRNA and SMARTer Pico kits, those two kits as well as the Takara/Clontech RiboGone and Kapa RiboErase kits performed very well with no differences observed between intact and degraded RNA. The RNaseH methods all showed very low rRNA fractions overall with the noted exception to the NEB kit. The SMARTer Pico kit had a slightly higher rRNA level, similar to that observed with Illumina Ribozero Gold degraded samples.

For those samples with successful rRNA depletion, we next ascertained the quality of the RNA sequencing data. Samples with greater than 50% rRNA were excluded from further analysis to eliminate artifacts that may be caused by improper implementation of the protocol. All the kits showed strong strand bias as expected by the protocols (Fig. 2b). Notably, SMARTer Pico reads mapped to the sense strand, which is the opposite strand from the other methods. While this is expected, care should be taken in adapting existing informatics pipelines to this kit. Differences were observed among the kits in how they handled mtRNAs. These were a major contaminant in the Clontech kits, particularly the RiboGone method that only targets the 12 s mtRNA and not the 16 s mtRNA. The other methods that addressed all mtRNA reads, such as Illumina Ribozero Gold, significantly reduced the fraction of reads from mtRNAs (Fig. 2c). This is especially noticeable in the RiboZero samples where site3 utilized a standard Human/Mouse/Rat kit instead of the RiboZero Gold.

Looking at the non-rRNA reads in each sample, the vast majority align to protein coding genes based on the ENSEMBL annotation. All samples show > 60% protein coding with most over 80% and the Clontech RiboGone kit having the largest fraction mapping (Fig. 3a). Most of the samples identified ~ 14,000 genes expressed at greater than one RPKM and ~ 16,000 at over 0.1 RPKM (Fig. 3b). A single site using the Takara/Clontech SMARTer Pico kit did show a somewhat reduced number of genes, which was associated with a lower library complexity observed from that site. Antisense mapped reads and reads mapping to the signal recognition particle RNAs (SRPs) were the most variable aspect of each sample though the source of these differences were unclear as they varied widely from site to site.

While the total number of protein coding genes detected was quite similar, many genes appear to be detected at significantly different rates. To better understand this observation, we performed differential gene expression analysis on the intact RNA samples that passed our QC metrics, comparing each preparation back to Illumina RiboZero Gold (Fig. 3c). The Qiagen rRNA depletion kit was excluded as only two replicates passed these criteria. Hierarchical clustering of the differentially detected genes, clusters the kits first by their RNAseq library prep methodology, with all three kits prepared using the NEB Ultra II Directional RNA Library kit clustering together, followed by the Kapa RiboErase kit and finally the low input Takara/Clontech SMARTer pico. Generally, several hundred genes could be easily observed as differentially detected between each of the kits and Illumina’s RiboZero (fold changes > 2, Benjamini corrected p-values < 0.001, Fig. 3d). Testing the physical properties of these differentially detected genes found that gene length appears to be a large contributor to the direction of the bias with shorter transcripts better detected by RiboZero and longer transcripts better detected by the other kits (Fig. 3e). Many of the most variably detected genes across the data set are quite small, such as mitochondrial proteins and ribosomal proteins (Additional file 3: Figure S2). The libraries themselves did not display any particular size bias with the RiboZero samples having an average length distribution similar to the other kits (Additional file 4: Figure S3). Bias in GC percentage was also observed in a few samples, with the Kapa RiboErase kit having the strongest bias against high GC transcripts (though the underlying gene list is quite small, Fig. 3f).

Ribosomal depletion is a key methodology used in studying noncoding RNAs as many of these transcripts lack polyadenylation sites [5]. We focused on lincRNAs (long intervening noncoding RNAs), one type of noncoding RNAs, since they do not overlap with any protein coding or other long non-coding RNA genes. Using the ENSEMBL annotation, approximately 4% of the non rRNA reads map to lincRNAs using RiboZero (Fig. 4a). Similar numbers are observed with the SMARTer Pico kit and Kapa RiboErase. The other kits, all of which were prepared with NEB Ultra II directional RNA library kit, show less than 3% of reads mapping to lincRNAs. This global decrease in number of lincRNA reads appears to reflect a general decrease in the number of mapping reads rather than a specific bias against a subset of lincRNAs. Two lines of evidence support this conclusion. First, the number of lincRNAs detected at > = 0.01 RPM remains ~ 3500 for all of the different kits tested and no bias is seen against the NEB prepped kits (Fig. 4b). Second, while the majority of lincRNAs detected can be assigned to 4 specific lincRNAs (MALAT1, SNORD3A, RNRP and NEAT1), the remaining fraction remains constant among the different kits, suggesting a global decrease in mapping (Fig. 4c). The precise set of lincRNAs detected varies somewhat but the core of 3200 lincRNAs are detected by all of the kits (Fig. 4d).

While comparison of different detection rates of mRNAs in UHR can point to possible differences between the kits’ chemistries, the spike-in controls provide an absolute metric to evaluate their effectiveness. Both SIRV (co-degraded with the RNA) and ERCC (not degraded) control spike-ins were added to the sample before library preparation and should give an unbiased look at the behavior of the different kits. As an initial test, we examined the ratio between the two spike-in types to model the impact of degradation on efficiency of library formation. Importantly, the kits do treat degraded RNA differently in their protocols and the not degraded ERCCs in the degraded samples were processed along the path proscribed for the bulk RNA, suggesting they are under-fragmented relative to the bulk population. Examining the ratios between the ERCC and SIRVs, we find that all the intact samples show ~ 60% SIRV reads (Fig. 5a). By comparison, the degraded samples show significant bias between the ERCCs and SIRV, generally favoring the intact ERCC. The RiboZero Gold and Kapa RiboErase kits show the least bias based on degradation, while the kits using the NEB stranded RNAseq kits showed a bias against shorter RNA fragments, which is similar to what was observed for protein coding genes (Fig. 3e). The SMARTer Pico kit is biased against the intact ERCC spike-ins in the degraded sample, which is likely due to not pre-degrading the ERCCs in the context of the degraded total RNA, emphasizing the importance of this step. Inserting not degraded spike-in controls into variably degraded experimental samples may confound the ability of these spike-ins to serve as a normalization tool, as previously observed [6, 7].

Distinct transcripts are present at defined ratios in the spike-in control allowing direct visualization of over and under-representation of transcripts. Within the SIRV spike-ins, the distinct transcripts were largely at equal ratios between kits and sites, though deviation from the expected values is observed (Fig. 5b). The SMARTer Pico kit was particularly susceptible to variation, possibly due to the low total input (1:100^th of the other kits), and some transcripts (e.g. purple at 1/4×) show loss of signal in the degraded sample. By comparison, the ERCC spike-ins showed significantly more variability across sites, even within the intact RNA samples (Additional file 5: Figure S4). Overall, the Lexogen and Takara/Clontech RiboGone kits generally had the most consistent and even performance on both the ERCC and SIRV spike-in controls across their test sites for intact samples.

Using the transcript diversity within the SIRV spike-in control, we further examined the general trends observed with respect to transcript length, GC content, and abundance. The 50 highest expressed SIRV transcripts from the intact samples were individually quantified using STAR/RSEM and the TPMs normalized to the expected frequency of the transcripts. We then compared the normalized frequencies for all of the transcripts highest and lowest quintile for different parameters to look for differences. Randomly selected transcripts show minimal differences (Fig. 5c), nor do we observe biases when the transcripts are sorted by GC percentage or amount of expected transcript (Fig. 5e, f). Transcript length does appear to show a bias, with the longest 10 transcripts (len > 2200 nt) showing lower levels of detection than the shortest transcripts (len < 480 nt, Fig. 5d) and three of the poorest detected transcripts in the SIRVs are in the set of longest transcripts. However, the overall distribution remains very broad and size alone is not predictive of the behavior of the transcript.

Discussion

rRNA depletion methodologies have expanded significantly in the last year. While differences exist between the rRNA depletion chemistries tested, all of the kits tested were able to successfully remove a significant amount of the rRNA in library preparations. The bead depletion chemistries were the most challenging to consistently implement successfully and struggled to remove rRNAs in the degraded RNA samples, possibly due to inefficiencies in hybridization with the degraded material. All of the kits, including the low-input ZapR based kit from Takara/Clontech, detected ~ 14,000 transcripts at > 1 RPKM.

With the broad success of the different chemistries, other aspects of library preparation, such as cost and ease of use, can be considered. Participating sites were surveyed to collect feedback regarding ease of use and previous experience with each kit that was tested (Table 1).

Table 1 User Experience with Different rRNA Depletion Chemistries: Study participants were surveyed for their opinions on the ease of use of each kit they tested

Full size table

While all sites were familiar with the RiboZero depletion kit, the majority of sites had no prior experience with the other kits. The participating sites generally reported high comfort levels with all the kits. Interestingly, the comfort level with the different kits did not correlate with success as the sites with the lowest comfort levels for the NEB kit and ZapR based kit from Takara/Clontech both showed very good performance. This may be due to the robustness of the specific methods or the quality of the written protocols.

While we did observe differences in the efficiency of library preparation, analytics remains a key caveat. For this study we used STAR/RSEM/DESEQ [8,9,10] for the analysis of the transcript levels, but different informatics tools may have more or less ability to handle the variations between the different chemistries and to model the spike-in controls. The combination of defined control samples with single transcript and spliced spike-ins provides an opportunity to use this data in the evaluation of different algorithmic approaches without overfitting to a single site or single type of chemistry. The number of RNAseq algorithms continues to multiply and finding the most appropriate methodology remains challenging. We believe this data set will provide a unique opportunity to better characterize the strengths and challenges of not only the depletion chemistries, but the RNAseq analysis algorithms as well.

Conclusions

We evaluated the performance of seven different commercially available rRNA depletion kits. While all kits were able to successfully remove rRNA in at least in some instances, there were marked differences in cross-site reproducibility, ease of use, and biases associate with sample quality. These results will provide important guidance for researchers when selecting the most appropriate rRNA depletion kit for their specific project aims.

Methods

Input RNA preparation

Intact and degraded input RNA was prepared and aliquoted at a single site. The Universal Human Reference RNA (Agilent) was diluted to 500 ng/ μL in 200 μl of RNase-free water and 3.94 μl of the Spike-in RNA Variant Control E2 Mix (Lexogen) were added. The sample was split into two aliquots, one of which was then heated at 94° C on an Eppendorf™Thermomixer for 1 h and 27 min. 1 μl of ERCC RNA Spike-In Mix 1 (ThermoFisher Scientific) was added to both the intact and degraded samples before running on a Bioanalyzer 2100 RNA Nano chip (Agilent) (Additional file 1: Figure S1). The final intact and degraded RNA samples were then diluted to 25 ng/μl and were distributed to each site on dry ice for rRNA-depletion and library preparation.

Site selection and index allocation

Eleven genomics core facilities were selected from among the members of the Association of Biomedical Research Facilities membership who volunteered to take part in this study. All of the participating cores routinely perform RNAseq library preparation for laboratories at their institutes and each site prepared between one and four library types. Kits were assigned to each site to minimize overlapping sets with the exception of the Takara/Clontech and Illumina kits. All four Takara/Clontech sites preformed both the Ribogone (CR) and SMARTer Pico (CZ) kit to minimize shipping costs. Similarly, Illumina RiboZero (RZ) sites were selected to minimize shipping of the donated reagents. Sites were not selected based on prior experience with specific chemistries. Each chemistry was tested by four sites, named 1-4. The same library creation site will not necessary have the same site number for different chemistries. Indices were assigned by the group to prevent overlapping among libraries. Note the Qiagen chemistry is no longer available for purchase.

Protocol normalization

For each chemistry, a conference call was arranged for each of the participating sites and a vendor representative to review the protocol in detail prior to library preparation. During the course of the phone call, the protocol was reviewed in detail and any required modifications or variables (eg., cycle number) were discussed and agreed on by all users. These details are reported in the Additional file 2.

rRNA depletion and library construction

Each site performed rRNA depletion and subsequent library prep following the vendor protocols (Additional file 2: Table S1). Kits, excluding Illumina, were supplied from a single manufacturer lot. Input RNA concentrations, fragmentation conditions and PCR cycles for intact and degraded RNA samples for each kit were discussed with the vendors and can be found in Additional file 2: Table S2 for each kit. Completed libraries were quantified by Qubit or equivalent and run on a Bioanalyzer or equivalent for size determination. Libraries were pooled and sent to a single site for final quantification by Qubit fluorometer (ThermoFisher Scientific), TapeStation 2200 (Agilent), and RT-qPCR using the Kapa Biosystems Illumina library quantification kit. Libraries were pooled for each run on a NextSeq 500.

Sequencing

Sites were instructed to make an equimolar pool of libraries from each kit using site-specific quantification and pooling SOPs and return each pool along with individual un-pooled libraries to the designated sequencing site. The sequencing site quantified each pool by Qubit fluorometer, Agilent TapeStation, and qPCR using the Kapa Illumina quant assay. Library pools were multiplexed and sequenced over three high output paired-end 75 bp runs on the Illumina NextSeq 500 to achieve sufficient read depth for analysis (See Additional file 2).

Alignment and quality control

Reads were aligned against hg19 using bwa-mem v. 0.7.12-r1039 with flags –t 16 –f [11]. Mapping rates, fraction of multiply-mapping reads, strandedness (Fig. 2b), number of unique 20-mers at the 5′ end of the reads, insert size distributions (Additional file 4: Figure S3) and fraction of nuclear ribosomal RNAs (Fig. 2a) were calculated using dedicated perl scripts and bedtools v. 2.25.0 [12]. In addition, each resulting bam file was randomly down-sampled to one million aligned reads and read density across genomic features were estimated for RNA-Seq-specific quality control metrics. Samples with < 10 reads or > 50% rRNA were eliminated from additional analysis and were not included in later sequencing pools. 50% rRNA was selected as the threshold as for samples with less than 50% rRNA, achieving sufficient read depth is faster and less expensive by providing additional sequencing depth rather than repreparation of the library. This number, however, is significantly higher than most participating sites would consider successful. Sample reads from all runs were concatenated before final RNA analysis.

RNA-Seq mapping and quantitation

Reads were aligned against hg19 / ENSEMBL 75 annotation using STAR v. 2.5.1b [8]. with the following flags -runThreadN 8 --runMode alignReads --outFilterType BySJout --outFilterMultimapNmax 20 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --outFilterMismatchNmax 999 --alignIntronMin 10 --alignIntronMax 1,000,000 --alignMatesGapMax 1,000,000 --outSAMtype BAM SortedByCoordinate --quantMode TranscriptomeSAM with --genomeDir pointing to a low-memory footprint, 75 nt-junction hg19 STAR suffix array. Gene expression was quantitated using RSEM v. 1.2.30 [10] with the following flags for all libraries: rsem-calculate-expression --paired-end --calc-pme --alignments -p 8 --forward-prob 0 against an annotation matching the STAR SA reference, with the exception of the positively-stranded CZ libraries, for which --forward-prob was set to 1. Posterior mean estimates (pme) of counts were retrieved, ribosomal RNA counts removed and expression in reads per kilobase of modeled exon per million mapped reads (RPKM, Fig. 3b, Additional file 3: Figure S2) or transcripts per million (TPM, Fig. 3a) were computed on the remaining count matrices. Similar scripts and pipelines were used for ERCC and SIRV mapping and quantitation. For both ERCCs and SIRVs, counts were retrieved from the RSEM gene output, as well as fractional isoform usage for SIRVs. SIRVs were normalized by dividing by the expected frequency from https://www.lexogen.com/wp-content/uploads/2017/06/SIRV_Set1_sequence-design-overview-170612a.xlsx.

Differential representation analysis

Libraries and kits were compared against RiboZero samples using a standard differential expression framework to identify genes with differential representation by the various chemistries. Briefly, intact samples from different sites were treated as replicates and differential expression was performed in the R statistical environment (R v. 3.2.3) using Bioconductor’s DESeq2 package on the protein-coding genes only [9]. Dataset parameters were estimated using the estimateSizeFactors(), and estimateDispersions() functions, and differential expression based on a negative binomial distribution/Wald test was performed using nbinomWaldtest() (all packaged into the DESeq() function), using the kit type as a contrast. Fold-changes, p-values and Benjamin-Hochberg-adjusted p-values (BH) were reported for each gene. Genes with BH < 0.001 and absolute fold-changes greater than 2 were considered for downstream analyses (Fig. 3c-d).

lincRNA analysis

After removing lincRNAs that were not assigned to chromosomes, 7251 lincRNAs were identified from the ENSEMBL annotation for further analysis. The lincRNAs with mean RPM > =0.01 were retained for each kit. To examine the distribution of different lincRNAs in each of the samples, the percentage RPM was calculated. Except the top 4 lincRNAs, rest of the lincRNAs were classified as ‘Others’ category. The figures were generated using a custom python script. Based on the ribo-depletion methods, the seven kits were grouped into three sets: RLQ (includes RZ, LX, and Q), NCK (includes NE, CR, K) and CZ. The lincRNAs detected in three sets were compared using an Euler diagram created using Venny (http://bioinfogp.cnb.csic.es/tools/venny/).

Genomic features analysis and visualization

Gene lengths were retrieved from RSEM outputs. GC content was calculated on the longest annotated isoform of each protein-coding gene. Box-plots were generated using Spotfire (Tibco) and TreeView (Fig. 3e-f).

Abbreviations

ABRF:: Association of Biomolecular Resource Facilities
BWA:: Burrows-Wheeler Aligner
CR:: Takara/Clontech RiboGone
CZ:: Takara/Clontech Pico Input
ERCC:: External RNA Controls Consortium
FFPE:: Fromalin Fixed Parafin Embedded
FPKM:: Fragments per Kilobase per Million
K:: Kapa Biosystems RiboErase
lncRNA:: Long Non-Coding RNA
LX:: Lexogen RiboCop
mRNA:: Messenger RNA
mtRNA:: Mitochondrial RNA
NE:: New England Biolabs rRNA removal
PCR:: Polymerase Chain Reaction
Q:: Qiagen rRNA removal
RPKM:: Reads per Kilobase per Million
RPM:: Reads per Milllion
rRNA:: ribosomal RNA
RT-qPCR:: Real-Time Quantitative Polymerase Chain Reaction
RZ:: Illumin Ribo-zero
SIRV:: Spike-in RNA Variant Controls
SOP:: Standard Operating Procedure
SRPs:: Signal Recognition Particle RNA
UHR:: Universal Human Reference RNA

References

Cui P, Lin Q, Ding F, Xin C, Gong W, Zhang L, et al. A comparison between ribo-minus RNA-sequencing and polyA-selected RNA-sequencing. Genomics. 2010;96:259–65.
Article CAS PubMed Google Scholar
Zhao W, He X, Hoadley KA, Parker JS, Hayes DN, Perou CM. Comparison of RNA-Seq by poly (a) capture, ribosomal RNA depletion, and DNA microarray for expression profiling. BMC genomics. BioMed Central. 2014;15:419.
Google Scholar
Guo Y, Zhao S, Sheng Q, Guo M, Lehmann B, Pietenpol J, et al. RNAseq by Total RNA library identifies additional RNAs compared to poly(a) RNA library. Biomed Res. 2015;2015:1–9.
Google Scholar
Masuda N, Ohnishi T, Kawamoto S, Monden M, Okubo K. Analysis of chemical modification of RNA from formalin-fixed samples and optimization of molecular biology applications for such samples. Nucleic Acids Res Oxford University Press. 1999;27(22):4436–43.
Article CAS Google Scholar
Ulitsky I, Bartel DP. lincRNAs: genomics, evolution, and mechanisms. Cell. 2013;154:26–46.
Article CAS PubMed PubMed Central Google Scholar
Jiang L, Schlesinger F, Davis CA, Zhang Y, Li R, Salit M, et al. Synthetic spike-in standards for RNA-seq experiments. Genome Res. 2011;21:1543–51.
Article CAS PubMed PubMed Central Google Scholar
Risso D, Ngai J, Speed TP, Dudoit S. Normalization of RNA-seq data using factor analysis of control genes or samples. Nat Biotechnol. 2014;32:896–902.
Article CAS PubMed PubMed Central Google Scholar
Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29:15–21.
Article CAS PubMed Google Scholar
Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15:550.
Article PubMed PubMed Central Google Scholar
Li B, Dewey CN. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics. 2011;12:323.
Article CAS PubMed PubMed Central Google Scholar
Li H, Durbin R. Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics. 2009;25:1754–60.
Article CAS PubMed PubMed Central Google Scholar
Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26:841–2.
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

This study is a product of the DNA Sequencing Research Group (DSRG) of the Association of Biomedical Research Facilities (ABRF, www.abrf.org). The authors are grateful to the participating vendors: Illumina, New England Biolabs, Lexogen, Takara/Clontech, Qiagen and Kapa/Roche who donated library preparation reagents, spike-in controls and sequencing reagents as well as to the ABRF executive board for funding. We also thank the many member labs within the ABRF who volunteered to test kits.

Funding

This work is supported by NCI award P30-CA14051 (VB, SL), NIEHS award P30-ES002109 (VB, SL) and the Harvard University Center for Aids Research NIH award P30 AI060354 (ZTH,MB,GS,LG). The funding bodies have not had a role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.

Availability of data and materials

The dataset supporting the conclusions of this article is available in the NCBI Gene Expression Omnibus repository, accession number GSE100127 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE100127).

Author information

Authors and Affiliations

Molecular Biology Core Facilities at Dana-Farber Cancer Institute, Boston, MA, USA
Zachary T. Herbert, Maura Berkeley, Gary Sommerville & Leslie Grimmett
BioFrontiers Institute, Next-Gen Sequencing Facility, University of Colorado Boulder, Boulder, CO, USA
Jamie P. Kershner
MIT BioMicro Center, Massachusetts Institute of Technology, Cambridge, MA, USA
Vincent L. Butty & Stuart S. Levine
Bioinformatics Core, Purdue University, West Lafayette, IN, USA
Jyothi Thimmapuram & Sulbha Choudhari
Microarray and Sequencing Resource Core Facility, Boston University, Boston, MA, USA
Yuriy O. Alekseyev & Ashley LeClerc
Department of Pathology and Laboratory Medicine, Boston University, Boston, MA, USA
Yuriy O. Alekseyev
Genomic Core Facility, Department of Biomedical Sciences, Joan C. Edwards School of Medicine, Marshall University, Huntington, WV, USA
Jun Fan
Genomic Sequencing and Analysis Facility, University of Texas, Austin, TX, USA
Jessica W. Podnar
DNA Sequencing Center, Brigham Young University, Provo, UT, USA
Edward Wilcox
Laboratory for Molecular Biology and Cytometry, University of Oklahoma Health Sciences Center, Oklahoma City, OK, USA
Jenny Gipson & Allison Gillaspy
IGM Genomics Center, University of California, San Diego, La Jolla, CA, USA
Kristen Jepsen
UW Biotechnology Center, Gene Expression Center, University of Wisconsin-Madison, Madison, WI, USA
Sandra Splinter BonDurant
Hubbard Center for Genome Studies, University of New Hampshire, Durham, NH, USA
Krystalynne Morris & Stephen D. Simpson
Genomics Core Facility, Van Andel Institute, Grand Rapids, MI, USA
Marie Adams

Authors

Zachary T. Herbert
View author publications
You can also search for this author in PubMed Google Scholar
Jamie P. Kershner
View author publications
You can also search for this author in PubMed Google Scholar
Vincent L. Butty
View author publications
You can also search for this author in PubMed Google Scholar
Jyothi Thimmapuram
View author publications
You can also search for this author in PubMed Google Scholar
Sulbha Choudhari
View author publications
You can also search for this author in PubMed Google Scholar
Yuriy O. Alekseyev
View author publications
You can also search for this author in PubMed Google Scholar
Jun Fan
View author publications
You can also search for this author in PubMed Google Scholar
Jessica W. Podnar
View author publications
You can also search for this author in PubMed Google Scholar
Edward Wilcox
View author publications
You can also search for this author in PubMed Google Scholar
Jenny Gipson
View author publications
You can also search for this author in PubMed Google Scholar
Allison Gillaspy
View author publications
You can also search for this author in PubMed Google Scholar
Kristen Jepsen
View author publications
You can also search for this author in PubMed Google Scholar
Sandra Splinter BonDurant
View author publications
You can also search for this author in PubMed Google Scholar
Krystalynne Morris
View author publications
You can also search for this author in PubMed Google Scholar
Maura Berkeley
View author publications
You can also search for this author in PubMed Google Scholar
Ashley LeClerc
View author publications
You can also search for this author in PubMed Google Scholar
Stephen D. Simpson
View author publications
You can also search for this author in PubMed Google Scholar
Gary Sommerville
View author publications
You can also search for this author in PubMed Google Scholar
Leslie Grimmett
View author publications
You can also search for this author in PubMed Google Scholar
Marie Adams
View author publications
You can also search for this author in PubMed Google Scholar
Stuart S. Levine
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

MA,SL,ZH conceived of study; ZH,JK,YA,JF,JP,MA,SL coordinated rRNA depletion method protocols and directed sites; JK,YA,JF,JP,EW,JG,AG,KJ,SBD,KM,MB,AL,SS,GS,LG,MA generated data; JT,VB,SC, and SL analyzed data; JK,ZH,JT,MA, and SL wrote the manuscript. All authors read and approved the final manuscript for publication.

Corresponding author

Correspondence to Stuart S. Levine.

Ethics declarations

Ethics approval and consent to participate

Study does not involve human or animal subjects.

Consent for publication

Not Applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Additional files

Additional file 1:

Figure S1. BioAnalyzer traces for samples used in the study. Left: Intact UHR, Right: Heat degraded UHR RNA. (PPTX 127 kb)

Additional file 2:

Contains the following: Describes normalization of kit protocols and any individual site deviations from this normalization. Figure S1. – BioAnalyzer traces for samples used in the study. Left: Intact UHR, Right: Heat degraded UHR RNA. Figure S2. – Clustering of differentially detected genes. Top 50 most differentially detected genes, as measured by variance of log2RPKM across all samples, were clustered based on their differential expression. TOP: Hierarchical tree of clustering based on a complete linkage function using Euclidean distance. 2ND LINE: Intact/Degraded status is shown. Intact samples are indicated in white while degraded samples are indicated in grey. 3RD LINE: Kit. Dark Blue = RZ|RiboZero Gold, Yellow = LX|Lexogen RiboCop, Aqua = NE|NEBNext rRNA Depletion, Green = Q|Qiagen, Grey = K|Kapa RiboErase, Blue = CR|Clontech Ribogone, Orange = CZ|SMARTer Pico total RNA. HEAT MAP: Red indicate higher level of absolute expression. Scale shown to right. White lines indicate the highest branches within the hierarchical tree. Figure S3. – Insert size distribution for RNAseq libraries from intact RNA. The insert size for each library passing the 50% rRNA filter was calculated for reads with convergent reads that were separated by < 1000 bp. Kit abbreviations: RZ = RiboZero Gold, LX = Lexogen RiboCop, NE = NEBNext rRNA Depletion, K=Kapa RiboErase, CR = Clontech Ribogone, CZ = SMARTer Pico total RNA. Top: length of the 90th percentile of inserts reads. Middle: length of the median insert read. Bottom: length of the 10th percentile of inserts read. Figure S4. - Relative ratio of reads mapping to ERCCs. Fraction of reads mapping to each ERCC mRNA is shown for each replicate. Light horizontal lines show 2-fold changes in fraction observed (log scale). Each expected concentration is shown in a different color. Data sets ordered by intact/degraded status followed by site within each kit left to right. Table S1. Catalog numbers and manual versions for protocols used. Table S2. Comparison of RNA library preparation chemistry variables. (DOCX 21 kb)

Additional file 3:

Figure S2. Clustering of differentially detected genes. Top 50 most differentially detected genes, as measured by variance of log2RPKM across all samples, were clustered based on their differential expression. TOP: Hierarchical tree of clustering based on a complete linkage function using Euclidean distance. 2ND LINE: Intact/Degraded status is shown. Intact samples are indicated in white while degraded samples are indicated in grey. 3RD LINE: Kit. Dark Blue = RZ|RiboZero Gold, Yellow = LX|Lexogen RiboCop, Aqua = NE|NEBNext rRNA Depletion, Green = Q|Qiagen, Grey = K|Kapa RiboErase, Blue = CR|Clontech Ribogone, Orange = CZ|SMARTer Pico total RNA. HEAT MAP: Red indicate higher level of absolute expression. Scale shown to right. White lines indicate the highest branches within the hierarchical tree. (PPTX 179 kb)

Additional file 4:

Figure S3. Insert size distribution for RNAseq libraries from intact RNA. The insert size for each library passing the 50% rRNA filter was calculated for reads with convergent reads that were separated by < 1000 bp. Kit abbreviations: RZ = RiboZero Gold, LX = Lexogen RiboCop, NE = NEBNext rRNA Depletion, K=Kapa RiboErase, CR = Clontech Ribogone, CZ = SMARTer Pico total RNA. Top: length of the 90th percentile of inserts reads. Middle: length of the median insert read. Bottom: length of the 10th percentile of inserts read. (PPTX 82 kb)

Additional file 5:

Figure S4. Fraction of reads mapping to each ERCC mRNA is shown for each replicate. Light horizontal lines show 2-fold changes in fraction observed (log scale). Each expected concentration is shown in a different color. Data sets ordered by intact/degraded status followed by site within each kit left to right. (PPTX 165 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Herbert, Z.T., Kershner, J.P., Butty, V.L. et al. Cross-site comparison of ribosomal depletion kits for Illumina RNAseq library construction. BMC Genomics 19, 199 (2018). https://doi.org/10.1186/s12864-018-4585-1

Download citation

Received: 13 September 2017
Accepted: 08 March 2018
Published: 15 March 2018
DOI: https://doi.org/10.1186/s12864-018-4585-1

Cross-site comparison of ribosomal depletion kits for Illumina RNAseq library construction

Abstract

Background

Results

Conclusions

Background

Results

Discussion

Conclusions

Methods

Input RNA preparation

Site selection and index allocation

Protocol normalization

rRNA depletion and library construction

Sequencing

Alignment and quality control

RNA-Seq mapping and quantitation

Differential representation analysis

lincRNA analysis

Genomic features analysis and visualization

Abbreviations

References

Acknowledgements

Funding

Availability of data and materials

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Publisher’s Note

Additional files

Additional file 1:

Additional file 2:

Additional file 3:

Additional file 4:

Additional file 5:

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Genomics

Contact us