iSeeRNA: identification of long intergenic non-coding RNA transcripts from transcriptome sequencing data
© Sun et al.; licensee BioMed Central Ltd. 2013
Published: 15 February 2013
Skip to main content
© Sun et al.; licensee BioMed Central Ltd. 2013
Published: 15 February 2013
Long intergenic non-coding RNAs (lincRNAs) are emerging as a novel class of non-coding RNAs and potent gene regulators. High-throughput RNA-sequencing combined with de novo assembly promises quantity discovery of novel transcripts. However, the identification of lincRNAs from thousands of assembled transcripts is still challenging due to the difficulties of separating them from protein coding transcripts (PCTs).
We have implemented iSeeRNA, a support vector machine (SVM)-based classifier for the identification of lincRNAs. iSeeRNA shows better performance compared to other software. A public available webserver for iSeeRNA is also provided for small size dataset.
iSeeRNA demonstrates high prediction accuracy and runs several magnitudes faster than other similar programs. It can be integrated into the transcriptome data analysis pipelines or run as a web server, thus offering a valuable tool for lincRNA study.
Over the past decade, evidence from numerous high-throughput genomic platforms reveals that even though less than 2% of the mammalian genome encodes proteins, a significant fraction can be transcribed into different complex families of non-coding RNAs (ncRNAs) [1–4]. Other than microRNAs and other families of small non-coding RNAs, long non-coding RNAs (lncRNAs, >200nt) are emerging as potent regulators of gene expression . Originally identified by Guttman et al.  from four mouse cell types using chromatin state maps as a subtype of lncRNAs, long intergenic non-coding RNAs (lincRNAs), are discrete transcriptional unit intervening known protein-coding loci. Recent studies demonstrate the functional significance of lincRNAs. However, it remains a daunting task to identify all the lincRNAs existent in various biological processes and systems.
Whole transcriptome sequencing, known as RNA-Seq, offers the promise of rapid comprehensive discovery of novel genes and transcripts . With the de novo assembly software such as Cufflinks  and Scripture , a large set of novel assemblies can be obtained from RNA-Seq data. Several programs have been used to facilitate the cataloging of lincRNAs from RNA-Seq assemblies. For example, Li et al.  used Codon Substitution Frequency (CSF) score  to identify lincRNAs from de novo assembled transcripts in chicken skeletal muscle. Pauli et al.  took advantage of PhyloCSF score  followed by other filtering steps to identify lincRNAs expressed during zebrafish embryogenesis. Cabili et al.  also used PhyloCSF program to eliminate the de novo assembled transcripts with positive coding potential and identified ~8200 lincRNA loci in 24 human tissues. However, the extremely high computational times demanded by PhyloCSF, may become the bottleneck for handling millions of assemblies generated from high throughput sequencing. Furthermore, neither CSF nor PhyloCSF provides publicly available tools that can be readily integrated into the lincRNA identification workflow. Therefore, ab initio reconstruction of a reliable set of lincRNAs through computational method remains a daunting task. There is an urgent need for such a standalone tool to accurately and quickly distinguish lincRNAs from extremely large dataset. Previous studies showed that supervised machine learning method, especially Support Vector Machine (SVM), may represent a potential solution for accurate identification of lincRNAs and protein coding gene transcripts (PCTs). For example, CONC (Coding Or Non-Coding) , CPC (Coding Potential Calculator) , and POTRAIT  have been developed to discriminate PCTs and ncRNAs in general. However, the performance of these programs is largely dependent on datasets; for instance, CONC is slow on analyzing large datasets , which may limit its usefulness in the transcriptome data analysis. CPC works well with known PCTs but may tend to classify novel PCTs into lincRNAs if they have not been recorded in the protein databases used by CPC . PORTAIT was specifically designed for the neglected species such as fungus et al. . Moreover, their performance on the identification of lincRNAs has not been evaluated.
In this study, we present a new SVM-based classifier and a standalone tool, iSeeRNA. It demonstrated high accuracy, balanced sensitivity and specificity for both lincRNA and PCT datasets. It also outperforms others by running several order-of-magnitudes faster, thus representing an ideal tool for lincRNA identification from transcriptome sequencing data.
To be compatible with de novo assembly software, such as Cufflinks and Scripture, which use GTF/GFF or BED file format, we set these three formats as default input file formats for iSeeRNA. This will allow easy integration of iSeeRNA into the transcriptome data analysis workflow. The detailed information about the file formats can be found at UCSC genome browser (http://genome.ucsc.edu/FAQ/FAQformat.html).
In order to build SVM models for iSeeRNA, we used LIBSVM (version 3.11) implementation  with Radial Basis Functional kernel which was shown to be the best kernel to deal with this task . During the training, SVM was set as binary classifier with the two classes being lincRNAs (positive set) and PCTs (negative set). Optimized SVM parameters C and gamma were obtained by using the accompanying grid.py script with 5,000 randomly selected instances from the training dataset. To obtain the best performance model, 10-fold cross-validation was used. In addition, two models were trained and tested separately using species specific datasets for human and mouse, respectively.
iSeeRNA was benchmarked against two other classification programs: PhyloCSF and CPC. These two programs were installed locally and executed with default parameters. For PhyloCSF, a score of 0 was used as the classification parameter. For CPC, Uniref90  was employed as protein database and the default classification model developed by its authors was used.
To evaluate the performance, accuracy (sensitivity or specificity) and Matthews Correlation Coefficient (MCC) , an indicator used in machine learning as a measure of the quality of binary (two-class) classification, were calculated; and Receiver Operating Characteristic (ROC) curves were generated.
Where TP, FP, TN and FN are the numbers of true positives (lincRNAs predicted to be non-coding), false positives (PCTs predicted to be non-coding), true negatives (PCTs predicted to be coding) and false negatives (lincRNAs predicted to be coding).
Selecting appropriate features is one of the most critical steps to build a SVM classifier. Many features have been used in distinguishing ncRNAs from PCTs. These can be classified to three categories: conservation, Open Reading Frame (ORF) and nucleotide sequences-based [12, 14–16, 22–25]. We employed those features that have demonstrated good potential to differentiate PCTs from ncRNAs in general considering lincRNAs share some common sequence properties with other classes of ncRNAs. As a result, a total of 10 features in three categories were used to build our SVM models. The first class of feature is conservation. Many studies have demonstrated that lincRNAs are less conserved than PCTs in general , making this a suitable feature for distinguishing them. To calculate the conservation score, we first downloaded the base-resolution phastCons  score files from UCSC; the scores of all nucleotides were then collected and averaged to obtain the conservation score for each transcript. The homolog search based features were among the most popular features for ncRNA classification but not employed for the following reasons. First, many novel PCTs are not collected in the protein database so that they tend to be mis-classifed as ncRNAs; Second, it showed strong correlation with the conservation score (Spearman correlation = 0.728, see Additional file 1), which did not further improve the performance when conservation is used. Lastly, it is very demanding in terms of computational time so that it tremendously reduces the performance of SVM classifier. Two Open Reading Frame (ORF) related features were selected as the second class, i.e. ORF length and ORF proportion defined by the length of an ORF divided by the total length of the transcript. We reasoned that a true lincRNA transcript, compared to PCTs, is more likely to have a low-quality ORF reflected by either a short ORF or a small proportion. txCdsPredict program from UCSC genome browser was employed to calculate the ORF for each transcript; the other seven features constitute the third class including frequencies of seven di- or tri-nucleotide sequences (GC, CT, TAG, TGT, ACG and TCG), which contribute the most to the overall performance. Some other nucleotide based features were not employed due to their weak classification ability . We found that all the three classes were useful to some extent in distinguishing lincRNAs and PCTs when used alone; and exon conservation score and ORF proportion showed the highest discrimination power among all the features. (see Additional File 2).
Performance evaluation of iSeeRNA on testing datasets
We further evaluated iSeeRNA performance on several benchmark datasets collected from published studies. The first dataset is a collection of experimentally validated functional lincRNAs (28 for human and 11 for mouse). iSeeRNA successfully identified these transcripts as lincRNAs with 100% accuracy. We then applied iSeeRNA on a collection of 8,195 human lincRNAs identified from de novo assembled transcripts , iSeeRNA correctly predicted 97.3% (7,977/8,195) of these lincRNAs (data not shown). These results further demonstrated the high accuracy of iSeeRNA for the identification of lincRNAs.
Evaluation of accuracy and CPU time of iSeeRNA, PhyloCSF, and CPC on comparison dataset
To test the efficiency of iSeeRNA, we next recorded the computational times for these three methods on the comparison dataset. Overall, iSeeRNA showed several order-of-magnitudes faster than PhyloCSF and at least 10 times faster than CPC (Table 2). This suggests that iSeeRNA is more suitable for processing large amount of transcripts from high-throughput transcriptome sequencing data. This advantage together with accepting GFF/GTF/BED as input file format makes iSeeRNA an ideal program that can be smoothly integrated as part of a lincRNA annotation pipeline for high-throughput transcriptome data analysis.
In this study, we report a lightweight SVM-based program, iSeeRNA, designed for computational identification of lincRNAs from high-throughput transcriptome sequencing data. We have provided not only a standalone program that can be integrated into the transcriptome data analysis pipeline but also a web server for those with limited bioinformatics support to use it independently. Compared to similar programs, iSeeRNA directly support the file formats widely used by the RNA-Seq assemblers, and it also has demonstrated the best performance in terms of the prediction accuracy for both lincRNAs and PCTs and the computational time. We think this stems from the following improvements we have made in terms of feature selection, training dataset used and optimization of the computational method: (i) iSeeRNA was uniquely trained in a species-dependent manner. By using species-specific lincRNA and PCT training datasets, we have built two separate SVM models for human and mouse respectively. However, iSeeRNA also allows users to build additional customized models for the species of their interest with the increasing number of species-specific lincRNAs discovered at a rapid speed; (ii) iSeeRNA was trained with a balanced dataset containing approximately equal number of lincRNAs and PCTs. This has avoided the overfeeding of protein coding data and potential bias during the performance evaluation thus leading to accurate prediction with a balanced sensitivity and specificity. (iii) Compared to CPC, iSeeRNA does not use any homolog based features (such as the BLASTX  score) derived from homolog search results. As novel PCTs are likely omitted in the database, these features showed bias towards lincRNAs which may explain why CPC achieved a higher sensitivity but a comparatively lower specificity (Table 2). In addition, iSeeRNA employed seven sequence based features which were not considered by CPC. (iv) Unlike PhyloCSF, which is solely based on conservation for evaluating the coding potential of a transcript, iSeeRNA integrates multiple features. Our results demonstrated that PhyloCSF had difficulty in making clear discrimination between lincRNAs and PCTs. Even at the optimal threshold (95), 12.9% PCTs were wrongly classified as lincRNAs (Figure 3). However, the classification performance was clearly improved by integrating more features in iSeeRNA (Figure 2). Furthermore, PhyloCSF failed to calculate the scores for some of the HAVANA annotated lincRNA transcripts (Table 2), this further limits its application on lincRNA identification.
In conclusion, we have implemented a highly accurate and reliable tool, iSeeRNA, for high throughput screening of lincRNAs from transcriptome sequencing data. We provided not only a web server for small dataset but also a standalone program that can be integrated into a bioinformatics pipeline for complex transcriptome data analysis. iSeeRNA demonstrates high performance with high accuracy and balanced sensitivity and specificity for both lincRNAs and PCTs. This makes it a valuable tool for lincRNA studies.
Funding: This work was supported by General Research Funds from the Research Grants Council of Hong Kong, China [CUHK476309, CUHK476310 to HW, and CUHK473211 to HS]; CUHK direct grant [2041474 to HS, 2041492 and 2041662 to HW]. National Natural Science Foundation of China (No. 61171191) and Natural Science Foundation of Jiangsu Province in China (BK2010500) to XS.
The publication costs for this article were funded by the University Grants Committee of the Government of the Hong Kong Special Administrative Region, China, under the General Research Funds (CUHK473211).
This article has been published as part of BMC Genomics Volume 14 Supplement 2, 2013: Selected articles from ISCB-Asia 2012. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/14/S2.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.