RDDpred: a condition-specific RNA-editing prediction model from RNA-seq data
© Kim et al. 2015
Published: 11 January 2016
RNA-editing is an important post-transcriptional RNA sequence modification performed by two catalytic enzymes, "ADAR"(A-to-I) and "APOBEC"(C-to-U). By utilizing high-throughput sequencing technologies, the biological function of RNA-editing has been actively investigated. Currently, RNA-editing is considered to be a key regulator that controls various cellular functions, such as protein activity, alternative splicing pattern of mRNA, and substitution of miRNA targeting site. DARNED, a public RDD database, reported that there are more than 300-thousands RNA-editing sites detected in human genome(hg19). Moreover, multiple studies suggested that RNA-editing events occur in highly specific conditions. According to DARNED, 97.62 % of registered editing sites were detected in a single tissue or in a specific condition, which also supports that the RNA-editing events occur condition-specifically. Since RNA-seq can capture the whole landscape of transcriptome, RNA-seq is widely used for RDD prediction. However, significant amounts of false positives or artefacts can be generated when detecting RNA-editing from RNA-seq. Since it is difficult to perform experimental validation at the whole-transcriptome scale, there should be a powerful computational tool to distinguish true RNA-editing events from artefacts.
We developed RDDpred, a Random Forest RDD classifier. RDDpred reports potentially true RNA-editing events from RNA-seq data. RDDpred was tested with two publicly available RNA-editing datasets and successfully reproduced RDDs reported in the two studies (90 %, 95 %) while rejecting false-discoveries (NPV: 75 %, 84 %).
RDDpred automatically compiles condition-specific training examples without experimental validations and then construct a RDD classifier. As far as we know, RDDpred is the very first machine-learning based automated pipeline for RDD prediction. We believe that RDDpred will be very useful and can contribute significantly to the study of condition-specific RNA-editing. RDDpred is available at http://biohealth.snu.ac.kr/software/RDDpred .
RNA-editing: a biologically crucial regulator and highly condition-specific event
RNA-editing event is defined as a post-transcriptional RNA sequence modification . Currently, there are two known RNA-editing mechanisms, performed by two different catalytic enzymes, “ADAR” (A-to-I) and “APOBEC” (C-to-U) [2, 3]. The most common type of editing in metazoans is the one catalyzed by the ADAR family of enzymes . By utilizing high-throughput sequencing technologies, the biological function of RNA-editing has been actively investigated [5–7]. Currently, RNA-editing is considered to be a key regulator that controls various cellular functions including protein activity, alternative splicing pattern of mRNA and substitution of miRNA targeting site [1, 8–10].
Moreover, there are multiple studies that showed direct relation of RNA-editing to biological phenotypes. For example, Galeano’s group showed that the editing events in glioblastoma by ADAR2 enzymes are crucial for pathogenesis and claimed that ADAR-class enzyme can be considered as a tumor-suppressor . And in APOBEC3G, a type of APOBEC-class enzyme causes HIV-1 retroviral inactivation by deamination .
RNA-seq: an important tool for investigating condition-specific RNA-editing patterns
RNA-seq, a high-throughput sequencing of transcriptome, is a powerful method for investigating whole-transcriptome status. Since the nature of the technology is taking a snapshot of cells with massive sequencing reads, it is suitable for detecting condition-specific events in whole-transcriptome scale. Therefore, it is also suited for detecting RNA-editing events that have such condition-specific characteristics. There have been a number of studies that used RNA-seq to reveal condition-specific editing patterns in whole-transcriptome scale [5–7].
Systematic artefacts: the major huddle to detect authentic RNA-editing events from RNA-seq
Even though RNA-seq is suitable for RNA-editing detection, it is also true that the current computational pipelines of RNA-editing detection with RNA-seq have considerable false-positive risks. In 2012 Nature Biotechnology journal, an article “The difficult calls in RNA editing”, reports interviews with eight prominent RNA-editing researchers. They pointed out that false-positive calling is one of the most challenging problems in RNA-editing detection with RNA-seq .
The false-positives caused by mis-alignment of short-reads can be termed “Systematic Artefacts” due to their inherent and reproducible characteristics. Systematic artefacts can be caused by various reasons, (a) inherent duplications/repeats within genomic sequences, (b) ambiguity caused by splicing-junctions, (c) prevalent polymorphisms between individuals and (d) shortness of sequencing reads [17, 18]. This inherent and reproducible error has been assumed to be one of the major confounding factors while detecting sequence variants [19, 20].
Artefact simulation results: resulted from 10-times of iterations
Three distinct computational approaches that addressed the artefact-issue
To handle systematic artefacts in RNA-seq, a number of computational approaches have been developed. These can be categorized into three groups in terms of features they used: (a) A priori knowledge based filtering [26, 27], (b) Computational simulation of artefacts , (c) Machine-learning based prediction model [5, 28].
A priori knowledge based filtering used public genomic features, such as Alu repeats, genomic duplications, and pseudogenes, to assess the detected editing-sites directly. For instance, Li’s group used public annotation of genomic repeats to filter out potential artefacts within the detected RDD(RNA/DNA Difference) sites . On the other hand, the approach based on computational simulation of artefacts rather utilizes calculated features than public features. Peng’s group used extensively simulated RNA-seq to predict inherent error-inducible regions in genome sequence and used them as a filter .
Unlike the filter-based methods that directly assess RDD candidate sites with pre-defined filters, machine-learning based methods generates a predictor in advance. The predictor, or machine-learning classifier, is trained to learn the differences between true and false examples. As an example, Laurent’s group generated a Random Forest predictor that utilizes read-alignment patterns as attributes. With 77 attributes, Laurent’s group generated a predictor and demonstrated it has 87 % of estimated accuracy by experimental validation . As mentioned, since RNA-editing events are occurred highly condition-specifically, machine-learning approach might have an advantage in that they pursue more data-driven method by generating condition-specific model.
Machine-learning based RNA-editing prediction became possible
Laurent’s work  was the first successful demonstration to show that a machine learning approach for RNA-editing prediction is both feasible and sensitive. However, to be a general-purpose model, there are several limitations. First of all, a predictor needs a training data that consists of positive and negative examples. And in Laurent’s study, they collected the both training examples from additionally performed Sanger-seq . However, as we emphasized, RNA-editing is a condition specific event. And, since Laurent’s approach used experimentally verified training examples specific to their own conditions, the model might not be applicable in different conditions unless additional sequencing is performed. Therefore, it is more cost-efficient if we can avoid the experimental validations with utilizing the machine-learning approach.
Implementation of RDDpred
We tested RDDpred in Python (2.7.3), Samtools-Bcftools (1.2.1), WEKA (3.6.12) package, in linux environment.
1) Input and output of RDDpred
RDDpred takes alignment results as input data and gives the prediction results of each SNVs, or RDD candidates as outputs. The raw RDD candidates are detected with Samtools-Bcftools pipeline  while the prediction model is trained by using WEKA package .
2) Selection of alignment tool by the user
Condition-specific training data preparation
1) Positive-set of training data: utilizing public databases, RADAR and DARNED
RADAR and DARNED databases include 2.5 million, 300-thousands of curated sites respectively [13, 29]. These two databases share a considerable portion of sites, 150-thousands sites. Since the pre-known sites are already proved to have editing potential, we can use the sites matched to the consensus sites as positive examples (Fig. 4). Since RDDpred takes the positive sites as an input, users can change or supplement the sites that are considered as true events.
2) Negative-set of training data: applying MES artefact calculation method
RDDpred predictor description
1) RDDpred mainly focuses on systematic artefacts
It is known that there are various types of artefacts from RNA-seq, such as amplification errors during library construction, sequencing errors, and errors by mis-alignments. Unlike former events, errors by mis-alignments shows different characteristics that they are more reproducible. Since the errors from library construction and sequencing procedures are transient in general, they can be excluded by replicating experiments. On the other hand, since the errors by mis-alignments, or systematic artefacts, are inherent to specific alignment method, they might not be excluded even after multiple replications. Therefore, RDDpred mainly focuses on detecting systematic artefacts with considering other artefacts as well.
2) Read-alignment pattern: a valuable source for distinguishing systematic artefacts
Attributes used to train prediction model: total 15-features are calculated with samtools-bcftools(v1.2) pipeline 
Variant read ratio
Segregation based metric
Phred probability of all samples
being the same
Mapping quality bias
Mann-Whitney U test of Mapping
Fraction of MQ0 reads
Root-mean-square mapping quality
of covering reads
Variant Distance Bias for filtering
splice-site artefacts in
Mann-Whitney U test of Read
Tail distance bias
Base quality bias
Mann-Whitney U test of Base
Read strand bias
3) The 15 attributes for RDDpred
As mentioned, the 15 attributes are categorized into six category, (a) “Read Depth” category represents read-count in editing sites. (b) “Allele Segregation” category includes four attributes, such as VAF, SGB, FQ, and CallQual, respectively. All of these attributes are calculated from edited read-ratio against total reads. (c) “Mapping Quality” category of attributes reflects how the alignments of reads are well-performed, which utilizes alignment scores that the aligner generates. Four attributes, such as PV3, MQB, MQ0F and MQ belongs to this category. (d) “Read Position” category includes three attributes, such as VDB, RPB, and PV4, which represent how the positions of variants are biased within sequencing reads. (e) “Base Quality” category uses base-quality information generated by sequencing machine to detect whether low-quality bases are significantly biased to editing-sites. Two attributes, PV2 and BQB belongs to this category. (f) Finally, “Read Strand” category includes single attribute PV1, that represents how the strands of edited reads are biased than non-edited reads (Table 2).
Evaluation with two previous studies
Comparison results from two different tissues which shows that RDD occurs condition specifically
- 1.Training datasets
Positive examples: Predicted as positives sites by Public Databases (RADAR, DARNED)
Negative examples: Predicted as artefact sites by MES method (Peng et al. Nature biotechnology 2012)
The entries overlapped with test-data are excluded from training-data
- 2.Test datasets
Positive examples: Positively detected sites by experimental validation (Sanger-seq)
Negative examples: False discovery sites proved by experimental validation (Sanger-seq)
RDDpred prediction in Bahn’s dataset
RDDpred prediction in Peng’s dataset
Additional specification of RDDpred
Linux version: Linux version 2.6.32-358.el6.x86_64, CentOS release 6.4
Memory usage: 20 GB in maximum
CPU usage: 20-cores (Intel(R) Xeon(R) CPU E5645 @ 2.40 GHz)
We developed a software package for RNA-editing prediction from RNA-seq data. RDDpred utilizes current published database and methods such as RADAR, DARNED [13, 29] and MES-method  to build condition specific predictor. RDDpred generates a predictor that considers the experimental condition under which RNA-seq experiments are performed. As of now, there are only two studies we can compare with RDDpred. However, we successfully demonstrated that RDDpred was able to reproduce the results and reduce the false-discovery in both studies.
Category rankings of attributes utilized by RDDpred model: top 3 categories showed relatively strong prediction power
During the high-throughput sequencing, the sequencers generate bases-qualities that represent the confidence of sequencing. Therefore, unlike other five metrics, the “Base Quality” reflects the molecular status of bases that are directly recorded by sequencer. Until now, we only knew bases modified by editing enzymes are somehow recognized as guanine (or thymine for APOBEC class), but did not know how these recognitions are observed in the perspective of sequencing machines. The base-quality issue indicates that there might be some distinctions between normal and edited bases at the molecular level. Thus, it implies that more detailed recording of molecular characteristics during sequencing process might be a key to improve the accuracy of RNA-editing detection.
RDDpred: a useful tool for investigating condition-specific RNA-editing with RNA-seq
RNA-seq is one of the most powerful methods to investigate transcriptome and the amount of RNA-seq has recently increased nearly exponentially . In spite of this rapid RNA-seq data accumulation and the recognition on important biological roles of RNA-editing, only a few studies reported RNA-editing findings due to the difficulty of getting robust profiles of RNA-editome . Since it is difficult to perform the experimental validation of RNA-editing events in whole-transcriptome scale, a reliable and easily-usable prediction method is truly required.
RDDpred prepares training examples that are specific to the condition of input data without experimental validations. RDDpred proved good performances by reproducing the detection of two previous studies and correcting most of their false-discoveries. Moreover, as far as we know, RDDpred is the very first automated pipeline that utilizes machine-learning technique with a well-evaluated performance. Thus, we believe that RDDpred will be very useful and can contribute significantly to the study of RNA-editing. RDDpred is available at http://biohealth.snu.ac.kr/software/RDDpred .
This research was supported by the Bio & Medical Technology Development Program of the NRF funded by the Korean government, MSIP(No. NRF-2014M3C9A3063541). Also, this research was supported by Next-Generation Information Computing Development Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Science, ICT & Future Planning (No. NRF-2012M3C4A7033341). The publication cost will be paid by the Seoul National University Office of Research.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
- Burns CM, Chu H, Rueter SM, Hutchinson LK, Canton H, Sanders-Bush E, et al. Regulation of serotonin-2c receptor g-protein coupling by rna editing. Nature. 1997; 387:303–8.PubMedView ArticleGoogle Scholar
- Keegan LP, Leroy A, Sproul D, O’Connell MA. Adenosine deaminases acting on rna (adars): Rna-editing enzymes. Genome Biol. 2004; 5:209.PubMedPubMed CentralView ArticleGoogle Scholar
- Harris RS, Petersen-Mahrt SK, Neuberger MS. Rna editing enzyme apobec1 and some of its homologs can act as dna mutators. Mol cell. 2002; 10:1247–53.PubMedView ArticleGoogle Scholar
- Nishikura K. Functions and regulation of rna editing by adar deaminases. Ann Rev Biochem. 2010; 79:321.PubMedPubMed CentralView ArticleGoogle Scholar
- St Laurent G, Tackett MR, Nechkin S, Shtokalo D, Antonets D, Savva YA, et al. Genome-wide analysis of a-to-i rna editing by single-molecule sequencing in drosophila. Nat Struct Mol Biol. 2013; 20:1333–39.PubMedView ArticleGoogle Scholar
- Peng Z, Cheng Y, Tan BC-M, Kang L, Tian Z, Zhu Y, et al. Comprehensive analysis of rna-seq data reveals extensive rna editing in a human transcriptome. Nat Biotechnol. 2012; 30:253–60.PubMedView ArticleGoogle Scholar
- Bahn JH, Lee JH, Li G, Greer C, Peng G, Xiao X. Accurate identification of a-to-i rna editing in human by transcriptome sequencing. Genome Res. 2012; 22:142–50.PubMedPubMed CentralView ArticleGoogle Scholar
- Rueter SM, Dawson TR, Emeson RB. Regulation of alternative splicing by rna editing. Nature. 1999; 399:75–80.PubMedView ArticleGoogle Scholar
- Rosenberg BR, Hamilton CE, Mwangi MM, Dewell S, Papavasiliou FN. Transcriptome-wide sequencing reveals numerous apobec1 mrna-editing targets in transcript 3’ utrs. Nat Struct Mol Biol. 2011; 18:230–6.PubMedPubMed CentralView ArticleGoogle Scholar
- Nishikura K. Editor meets silencer: crosstalk between rna editing and rna interference. Nat Rev Mol Cell Biol. 2006; 7:919–31.PubMedPubMed CentralView ArticleGoogle Scholar
- Galeano F, Rossetti C, Tomaselli S, Cifaldi L, Lezzerini M, Pezzullo M, et al. Adar2-editing activity inhibits glioblastoma growth through the modulation of the cdc14b/skp2/p21/p27 axis. Oncogene. 2013; 32:998–1009.PubMedPubMed CentralView ArticleGoogle Scholar
- Chiu YL, Soros VB, Kreisberg JF, Stopak K, Yonemoto W, Greene WC. Cellular apobec3g restricts hiv-1 infection in resting cd4+ t cells. Nature. 2010; 466:276–6.View ArticleGoogle Scholar
- Kiran A, Baranov PV. Darned: a database of rna editing in humans. Bioinformatics. 2010; 26:1772–6.PubMedView ArticleGoogle Scholar
- Song W, Liu Z, Tan J, Nomura Y, Dong K. Rna editing generates tissue-specific sodium channels with distinct gating properties. J Biol Chem. 2004; 279:32554–2561.PubMedPubMed CentralView ArticleGoogle Scholar
- Miyata Y, Sugita M. Tissue-and stage-specific rna editing of rps14 transcripts in moss (physcomitrella patens) chloroplasts. J Plant Physiol. 2004; 161:113–5.PubMedView ArticleGoogle Scholar
- Bass B, Hundley H, Li JB, Peng Z, Pickrell J, Xiao XG, et al. The difficult calls in rna editing. Nat Biotechnol. 2012; 30:1207–9.PubMedView ArticleGoogle Scholar
- Degner JF, Marioni JC, Pai AA, Pickrell JK, Nkadori E, Gilad Y, et al. Effect of read-mapping biases on detecting allele-specific expression from rna-sequencing data. Bioinformatics. 2009; 25:3207–212.PubMedPubMed CentralView ArticleGoogle Scholar
- Heap GA, Yang JH, Downes K, Healy BC, Hunt KA, Bockett N, et al. Genome-wide analysis of allelic expression imbalance in human primary cells by high-throughput transcriptome resequencing. Hum Mol Genet. 2010; 19:122–34.PubMedPubMed CentralView ArticleGoogle Scholar
- Talwalkar A, Liptrap J, Newcomb J, Hartl C, Terhorst J, Curtis K, et al. Smash: a benchmarking toolkit for human genome variant calling. Bioinformatics. 2014; 30:2787–95.PubMedPubMed CentralView ArticleGoogle Scholar
- Engström PG, Steijger T, Sipos B, Grant GR, Kahles A, Rätsch G, et al. Systematic evaluation of spliced alignment programs for rna-seq data. Nat Methods. 2013; 10:1185–91.PubMedPubMed CentralView ArticleGoogle Scholar
- Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, et al. Star: ultrafast universal rna-seq aligner. Bioinformatics. 2013; 29:15–21.PubMedPubMed CentralView ArticleGoogle Scholar
- Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, et al. Refseq: an update on mammalian reference sequences. Nucleic Acids Res. 2014; 42:756–63.View ArticleGoogle Scholar
- Li H. A statistical framework for snp calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics. 2011; 27:2987–993.PubMedPubMed CentralView ArticleGoogle Scholar
- McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, et al. The genome analysis toolkit: a mapreduce framework for analyzing next-generation dna sequencing data. Genome Res. 2010; 20:1297–303.PubMedPubMed CentralView ArticleGoogle Scholar
- Karolchik D, Hinrichs AS, Furey TS, Roskin KM, Sugnet CW, Haussler D, et al. The ucsc table browser data retrieval tool. Nucleic Acids Res. 2004; 32:493–6.View ArticleGoogle Scholar
- Li JB, Levanon EY, Yoon JK, Aach J, Xie B, LeProust E, et al. Genome-wide identification of human rna editing sites by parallel dna capturing and sequencing. Science. 2009; 324:1210–13.PubMedView ArticleGoogle Scholar
- Mo F, Wyatt AW, Sun Y, Brahmbhatt S, McConeghy BJ, Wu C, et al. Systematic identification and characterization of rna editing in prostate tumors. PloS One. 2014; 9(7):e101431.PubMedPubMed CentralView ArticleGoogle Scholar
- Zhang Q, Xiao X. Genome sequence-independent identification of rna editing sites. Nat Methods. 2015; 12:347–50.PubMedPubMed CentralView ArticleGoogle Scholar
- Ramaswami G, Li JB. Radar: a rigorously annotated database of a-to-i rna editing. Nucleic Acids Res. 2014; 42(D1):D109-D113.PubMedPubMed CentralView ArticleGoogle Scholar
- Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH. The weka data mining software: an update. ACM SIGKDD explorations newsletter. 2009; 11:10–18.View ArticleGoogle Scholar
- Leinonen R, Sugawara H, Shumway M. The sequence read archive. Nucleic Acids Res. 2011; 39(suppl 1):D19-D21.PubMedPubMed CentralView ArticleGoogle Scholar