Evaluating bias-reducing protocols for RNA sequencing library preparation
© Jackson et al.; licensee BioMed Central Ltd. 2014
Received: 4 March 2014
Accepted: 26 June 2014
Published: 7 July 2014
Next-generation sequencing does not yield fully unbiased estimates for read abundance, which may impact on the conclusions that can be drawn from sequencing data. The ligation step in RNA sequencing library generation is a known source of bias, motivating developments in enzyme technology and library construction protocols. We present the first comparison of the standard duplex adaptor protocol supplied by Life Technologies for use on the Ion Torrent PGM with an alternate single adaptor approach involving CircLigase (CircLig protocol).
A correlation between over-representation in sequenced libraries and degree of secondary structure has been reported previously, therefore we also investigated whether bias could be reduced by ligation with an enzyme that functions at a temperature not permissive for such structure.
A pool of small RNA fragments of known composition was converted into a sequencing library using one of three protocols and sequenced on an Ion Torrent PGM. The CircLig protocol resulted in less over-representation of specific sequences than the standard protocol. Over-represented sequences are more likely to be predicted to have secondary structure and to co-fold with adaptor sequences. However, use of the thermostable ligase Methanobacterium thermoautotrophicum RNA ligase K97A (Mth K97A) was not sufficient to reduce bias.
The single adaptor CircLigase-based approach significantly reduces, but does not eliminate, bias in Ion Torrent data. Ligases that function at temperatures to remove the possible influence of secondary structure on library generation may be of value, although Mth K97A is not effective in this case.
KeywordsNext-generation sequencing Bias Ion Torrent Mth K97A Thermostable CircLigase Small RNA
Next-generation sequencing is a powerful genomic tool that can be used to investigate the transcriptome (the abundance of RNA species), the translatome (the ribosome occupancy on mRNAs) and the interactome (RNA-protein binding). However, substantial differences in data obtained have been observed when the same fragment library is sequenced on different platforms . This has motivated attempts to characterise and remedy biases in sequencing library generation, either through modification of protocols [2, 3] or applying bioinformatic corrections .
Whilst there are a number of different sequencing platforms, each share a common series of steps to convert the RNA pool of interest into an RNA-seq library. Each platform has a specific set of adaptors that are ligated to both 5′ and 3′ ends of a pool of RNA fragments of interest (e.g. small RNAs or fragmented mRNAs). These adaptors are then used to prime reverse transcription and PCR amplification. These completed libraries are then sequenced, using the adaptors as the starting point for the sequencing reaction.
However, the biochemical manipulations involved in this library generation process introduce biases that affect the final sequencing output. PCR amplification results in under-representation of both AT rich and GC rich regions [5, 6], but this can be minimised by the use of polymerases that have been generated through molecular evolution to reduce these biases such as KAPA HiFi . Single molecule studies have shown that T4 RNA ligases 1 and 2 (rnl1 and rnl2), which are used to ligate adaptor sequences, are also associated with significant biases (reviewed in ). A point mutation of the truncated rnl2 (trRnl2 K227Q) reduces bias , but fragments predicted to co-fold with the adaptor are still over-represented in sequencing libraries when standard protocols are used . To address whether ligation at temperatures that minimise co-folding reduces bias, a thermostable 3′RNA ligase has been developed: Methanobacterium thermoautotrophicum RNA ligase K97A mutant (Mth K97A) . To date, the suitability of Mth K97A for use in RNA-seq has not been assessed.
Herein we present the first direct comparison of the CircLig and standard Life Technologies protocols. Furthermore, to investigate whether Mth K97A might be suitable for use in RNA-seq library generation, we substituted it for trRnl2 K227Q in the CircLig protocol. The degree of bias was assessed by generating libraries from a pool of 20 nt RNAs of known composition using each protocol and sequencing them on an Ion Torrent personal genome machine.
Comparing over-representation introduced by different library preparation protocols
Position specific biases
Correlation between predicted structure and over-representation
Importantly, only the most over-represented fragments from the Mth library were associated with any secondary structure or co-folding ability suggesting that secondary structure is not a major source of bias when this protocol and enzyme are used.
Next-generation sequencing can be used for a range of genomic investigations. However, as with any technology, systemic biases affect the accuracy of sequencing data and thus the strength of conclusions that can be drawn. The ligation step in library generation has been shown to be a significant source of bias. We found that compared to the standard protocol the CircLig protocol with trRnl2 K227Q was associated with almost half the level of over-representation when a degenerate RNA pool was sequenced. Although the adaptor sequences for library generation differ between Life Technologies SOLiD and Ion Torrent platforms, the protocols are essentially similar. Therefore, using the correct adaptor sequences, the CircLig approach could be used for any Life Technologies platform and be expected to produce more representative libraries than the standard protocols.
Analysis of the sequencing data from both standard and CircLig protocols revealed the sequences that were over-represented were predicted to be more structured under the ligation conditions. This correlation may be causative as T4 RNA ligase 2 is involved in the repair of nicked dsRNA , but this paper does not seek to address this question directly. Instead we show that ligation with the thermostable ligase Mth K97A at temperatures not broadly permissive for secondary structure and co-folding does not reduce over-representation.
A substantial portion of the RNA pool could not be ligated using Mth K97A (Figure 2). The initial characterisation of the ligation efficiency of this enzyme was performed with one RNA sequence . By using a partially degenerate fragment pool we are able to characterise the enzyme more fully and reveal the enzyme has a strong preference for A and C at the 3rd nucleotide from the ligation site. Consistent with this, much higher ligation efficiencies were obtained when the enzyme was used by Zhelkovsky and Reynolds to ligate RNA with A at this position  than we observed with our partially degenerate RNA pool. While we cannot exclude the possibility that this is specific to the adaptor sequence used (rather than a general limitation of this enzyme), this bias does make Mth K97A inappropriate for use in existing sequencing protocols.
Surprisingly, the most abundant sequences in all libraries had a higher than expected G-content. It is unclear whether this is because each library protocol has a previously unknown bias in favour of G-rich sequences or if the bias is at a step common to all libraries. We suggest this may be an interesting avenue for further research.
The CircLig protocol reduces, but does not abolish, bias associated with Ion Torrent PGM RNA-seq. Highly structured sequences are more likely to be over-represented in RNA-seq libraries, but this is not remedied by the use of the thermostable Mth K97A enzyme. Although the CircLig protocol does involve more hands-on time than the standard Life Technologies protocol, it offers superior accuracy and therefore we recommend it for sequencing on Life Technologies platforms.
Design of fragment pool
A random number generator (http://www.random.org) was programmed to output values between 1 and 4 inclusive, with each digit corresponding to a different base. The program was run 10 times giving the following sequence: GCAGUUGCCA. An RNA fragment pool with this sequence followed by 10 degenerate nucleotides (5′ – GCAGUUGCCANNNNNNNNNN – 3′) was synthesised (ThermoScientific Molecular Biology). Fragments contained both 5′ phosphate and 3′ hydroxyl moieties necessary for ligation.
Small RNA library construction
For standard libraries, ligation of 10 ng of RNA pool was performed with the Ion Total RNA-Seq kit v2 (Life Technologies), following the manufacturer’s instructions.
For the CircLig libraries, 40 ng RNA pool (~6 pmol) and 100 ng Universal Cloning Linker (~18 pmol; NEB) were denatured at 80°C for 2 min then placed immediately on ice. Ligation with 200 U T4 RNA Ligase 2 truncated K227Q mutant (trRnl2 K227Q; NEB) was performed in the presence of 1X RNA ligase buffer (NEB), 15% PEG8000, 20 U SuperaseIn (Ambion), at 25°C for 3 hr. Ligation with 20 pmol Mth K97A was performed in 1X NEB1 buffer and 20 U SuperaseIn, at 65°C for 1 hr, as described by Zhelkovsky and McReynolds (11). Reactions were terminated by heat inactivation at 90°C and RNA/DNA recovered by isopropanol precipitation with 1 μl GlycoBlue (Ambion) as a coprecipitant. Reaction mixtures were denatured at 80°C in an equal volume of formamide loading buffer (96% Formamide, 10 mM EDTA, 0.01% bromophenol blue, 0.01% xylene cyanol) and separated on a 15% TBE-Urea polyacrylamide gel (Invitrogen). 250 ng small RNA marker (Abnova), denatured the same way, was run in adjacent wells. Gels were stained with SybrGold (Invitrogen) and visualised using a Safe Imager (Invitrogen) to guide excision of the 37 nt band. RNA/DNA was recovered from the gel slice by overnight elution in 300 mM NaCl, 0.1% SDS at 4°C followed by isopropanol precipitation and resuspension in 10 μl nuclease-free water. 2 μl of 1.25 μM Ion Torrent compatible reverse transcription (RT) primer (5′-CGCCTTGGCC/Sp/CACTCA/Sp/CCTCTCTATGGGCAGTCGGTGATATCTATTGATGGTGCCTACAG – 3′ where Sp is an 18-atom hexa-ethyleneglycol spacer) was added to each sample, denatured at 80°C for 3 min then snap cooled on ice. Reverse transcription was performed using 200U Superscript III (Invitrogen) at 48°C for 30 min in 1X first strand buffer (Invitrogen), 0.5 mM dNTPs, 5 mM DTT and 20 U SuperaseIn. RNA was hydrolysed by incubation at 98°C for 20 min with 100 mM NaOH, 50 mM EDTA. cDNA was recovered by isopropanol precipitation and separated on a 6% TBE-Urea polyacrylamide gel as described above. The upper band (~80 bp) was excised, being careful to avoid the lower weight bands (unincorporated RT primer and RT primer dimers). cDNA was recovered from the gel slice and circularised with 100 U Circligase (Epicentre) in 1X Circligase buffer, 2.5 mM MnCl2, 0.5 mM ATP, at 60°C for 1 hr. Reactions were heat inactivated at 80°C and cDNA precipitated.
All libraries were PCR amplified and sequenced using an Ion Torrent 318 chip as per manufacturer’s instructions, with the difference in library size in the CircLig protocol accounted for to ensure the same molarity of each library was used.
The PCR adaptors were removed from the reads using the standard Ion Torrent pipeline. Sequenced reads were then filtered based on their size: 18 nt to 22 nt inclusive for reads from the standard protocol, 40 nt to 44 nt inclusive for CircLig based protocols. A custom script using Smith-Waterman alignments, with identity >0.7, was used to remove the linker sequence from the 3′ end of the alternate protocol reads, and to remove the 10 nt non-degenerate region from all reads.
Read sets were analysed to determine how many times each particular sequence was observed. A theoretical dataset corresponding to the total number of sequenced reads from each library was constructed, i.e. a distribution X ~ Binomial(n, 1/410), where n was the total number of reads. Goodness of fit testing was performed using a discrete Kolmogorov –Smirnov test from the R Package ‘Matching’ with 500 iterations of bootstrapping.
The predicted structure of each read, including the 10 nt non-degenerate region, was produced using RNAfold from the ViennaRNA Package , and the minimum free energies were plotted against the reads ranked from most to least abundant, using the mean minimum free energy over a sliding 1000-read window. Co-folding structures between the fragment (GCAGUUGCCANNNNNNNNNN) and the linking sequence used in the alternative protocols (CUGUAGGCACCAUCAAU) were also predicted using RNAcofold from the ViennaRNA Package, and plotted in the same way. Predictions were made at 25°C for the rnl2 dataset, at 65°C for the mth dataset, and at 16°C for the standard protocol dataset. All other parameters were left as defaults. Plots were produced using R (http://www.R-project.org). Data are available in the ArrayExpress database (http://www.ebi.ac.uk/arrayexpress) under accession number E-MTAB-2566.
This work was funded by the Biotechnology and Biological Sciences Research Council [TJJ grant number BB/D526653/1], a BBSRC Professorial Fellowship [AEW,RVS grant number BB/F02326X/2] and the Medical Research Council (Programme funding to AEW).
- Tian G, Yin X, Luo H, Xu X, Bolund L, Zhang X, Gan SQ, Li N: Sequencing bias: comparison of different protocols of microRNA library construction. BMC Biotechnol. 2010, 10: 64-PubMed CentralPubMedView ArticleGoogle Scholar
- Ingolia NT, Ghaemmaghami S, Newman JR, Weissman JS: Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling. Science. 2009, 324 (5924): 218-223.PubMed CentralPubMedView ArticleGoogle Scholar
- Sorefan K, Pais H, Hall AE, Kozomara A, Griffiths-Jones S, Moulton V, Dalmay T: Reducing sequencing bias of small RNAs. Silence. 2012, 3 (1): 4-PubMed CentralPubMedView ArticleGoogle Scholar
- Benjamini Y, Speed TP: Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Res. 2012, 40 (10): e72-PubMed CentralPubMedView ArticleGoogle Scholar
- Aird D, Ross MG, Chen WS, Danielsson M, Fennell T, Russ C, Jaffe DB, Nusbaum C, Gnirke A: Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol. 2011, 12 (2): R18-PubMed CentralPubMedView ArticleGoogle Scholar
- Dohm JC, Lottaz C, Borodina T, Himmelbauer H: Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res. 2008, 36 (16): e105-PubMed CentralPubMedView ArticleGoogle Scholar
- Oyola SO, Otto TD, Gu Y, Maslen G, Manske M, Campino S, Turner DJ, Macinnis B, Kwiatkowski DP, Swerdlow HP, Quail MA: Optimizing Illumina next-generation sequencing library preparation for extremely AT-biased genomes. BMC Genomics. 2012, 13: 1-PubMed CentralPubMedView ArticleGoogle Scholar
- Zhuang F, Fuchs RT, Robb GB: Small RNA expression profiling by high-throughput sequencing: implications of enzymatic manipulation. J Nucleic Acids. 2012, 2012: 360358-PubMed CentralPubMedView ArticleGoogle Scholar
- Viollet S, Fuchs RT, Munafo DB, Zhuang F, Robb GB: T4 RNA ligase 2 truncated active site mutants: improved tools for RNA analysis. BMC Biotechnol. 2011, 11: 72-PubMed CentralPubMedView ArticleGoogle Scholar
- Zhuang F, Fuchs RT, Sun Z, Zheng Y, Robb GB: Structural bias in T4 RNA ligase-mediated 3′-adapter ligation. Nucleic Acids Res. 2012, 40 (7): e54-PubMed CentralPubMedView ArticleGoogle Scholar
- Zhelkovsky AM, McReynolds LA: Structure-function analysis of Methanobacterium thermoautotrophicum RNA ligase - engineering a thermostable ATP independent enzyme. BMC Mol Biol. 2012, 13 (1): 24-PubMed CentralPubMedView ArticleGoogle Scholar
- Nandakumar J, Ho CK, Lima CD, Shuman S: RNA substrate specificity and structure-guided mutational analysis of bacteriophage T4 RNA ligase 2. J Biol Chem. 2004, 279 (30): 31337-31347.PubMedView ArticleGoogle Scholar
- Lorenz R, Bernhart SH, Honer Zu Siederdissen C, Tafer H, Flamm C, Stadler PF, Hofacker IL: ViennaRNA Package 2.0. Algorithms Mol Biol. 2011, 6: 26-PubMed CentralPubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.