Predicting siRNA potency with random forests and support vector machines
© Wang et al. 2010
Published: 01 December 2010
Skip to main content
© Wang et al. 2010
Published: 01 December 2010
Short interfering RNAs (siRNAs) can be used to knockdown gene expression in functional genomics. For a target gene of interest, many siRNA molecules may be designed, whereas their efficiency of expression inhibition often varies.
To facilitate gene functional studies, we have developed a new machine learning method to predict siRNA potency based on random forests and support vector machines. Since there were many potential sequence features, random forests were used to select the most relevant features affecting gene expression inhibition. Support vector machine classifiers were then constructed using the selected sequence features for predicting siRNA potency. Interestingly, gene expression inhibition is significantly affected by nucleotide dimer and trimer compositions of siRNA sequence.
The findings in this study should help design potent siRNAs for functional genomics, and might also provide further insights into the molecular mechanism of RNA interference.
RNA interference (RNAi) is a post-transcriptional gene regulatory mechanism by which a double-stranded RNA (dsRNA) induces sequence-specific gene silencing . The RNAi pathway consists of multiple steps, including the cleavage of the dsRNA by Dicer to form a 19-nucleotide short interfering RNA (siRNA) with 3' overhangs, the incorporation of the siRNA molecule into the RNA-induced silencing complex (RISC), and the recognition of the target gene transcript(s) by the RISC-siRNA complex to induce mRNA degradation or translational repression. In mammals and many other organisms, chemically synthesized siRNA molecules can be introduced into cells to knockdown the expression of a specific gene. Because of the simplicity and low cost, siRNA-based gene silencing has quickly become an important technique in functional genomics .
Since not all siRNAs are equally effective, siRNA design is one of the critical steps in gene silencing studies. Earlier experimental data have suggested several sets of empirical rules for designing potent siRNAs. For example, Ui-Tei et al.  proposed several criteria for potent siRNAs, including the presence of AU-rich 5' terminal region and G/C at the 3' end of the antisense strand, and the absence of long GC stretches (>9 base pairs). The thermodynamic properties of siRNA duplexes were shown to affect target mRNA degradation [4, 5]. In addition, secondary structure in siRNAs could reduce the efficacy of gene silencing . Although these findings provided important insight into RNA interference, the empirical rules were often derived from relatively small datasets, and thus might not cover all the relevant features affecting siRNA potency.
With the accumulation of siRNA data, machine learning methods have been developed for both classification and regression analysis of siRNA potency. Strom  reported a genetic programming (GP) method with boosting algorithms for the binary classification of effective and ineffective siRNAs. It was shown that the boosted GP classifier outperformed support vector machines (SVMs) trained on the same dataset. Ladunga  trained SVMs for siRNA classification based on biophysical signatures of free energy, target site accessibility and dinucleotide characteristics. Wang et al.  also constructed an SVM classifier to identify hyperfunctional siRNAs by using simple sequence and structural features such as base composition at each position, GC content, and secondary structure. Artificial neural networks (ANNs) have been trained for regression analysis of siRNA efficacy. Huesken et al.  developed the BIOPREDsi model with nucleotide sequence information to predict the inhibitory activity of siRNAs, and used the ANN model to design a human siRNA library. Shabalina et al.  performed thermodynamic and correlation analyses on a set of siRNAs, and constructed an ANN model with three parameters characterizing siRNA sequences. In addition, Vert et al.  constructed the simple linear model DSIR using the LASSO procedure for siRNA efficacy prediction with basic sequence features.
The above-mentioned previous studies suggest that many siRNA features of sequence composition, thermodynamic stability and secondary structure are related to the effectiveness of gene silencing. However, few advanced methods have been used to select the most relevant features for predicting siRNA potency. Most of the previous studies selected some siRNA features based on empirical knowledge or simple statistical analysis (e.g., correlation analysis).
More recently, Klingelhoefer et al.  used a stochastic logistic regression-based algorithm to identify relevant features associated with siRNA potency. The feature selection method revealed several sequence motifs such as UCU in potent siRNAs.
In this study, random forests (RFs) were constructed to select important sequence features for predicting siRNA potency. RF-based variable importance measures were previously used in microarray expression data analyses to select a relatively small set of informative genes for disease/sample classification [14, 15]. However, it remained to be demonstrated whether the RF methodology could be applied to sequence feature selection for siRNA classification. The selected sequence features were used to construct SVM models for predicting siRNA potency. Our results suggest that siRNA potency is significantly affected by its nucleotide dimer and trimer compositions. Some nucleotide motifs such as UCC appear to be positively correlated with siRNA efficacy, whereas other motifs such as GAG may have a negative effect on gene expression inhibition. The findings will likely be useful for rational siRNA design in large-scale functional genomics projects.
As described in the previous study , a non-redundant set of experimentally evaluated siRNAs were collected from several published studies. For each siRNA, the relative level of target gene mRNA was the ratio of the remaining mRNA level after siRNA treatment to the wild-type control level. The relative mRNA level (ranging from 0 to 1) was used to determine the effectiveness of mRNA knockdown. Effective siRNAs gave rise to lower levels of remaining gene expression. The cut-off level of 0.5 was used to define positive instances (potent siRNAs with the relative mRNA level ≤ 0.5) and negative instances (non-potent siRNAs with the relative mRNA level > 0.5). The dataset used in this study consisted of 165 positive instances and 115 negative instances. Each siRNA instance was a sequence of 19 nucleotides (from 5' to 3' end) representing the antisense strand of the target gene mRNA transcript. The two-nucleotide overhang at the 3' end of a siRNA was not included in the data instance. Although more siRNA data have recently become available and will be used in the future study, our approach was first tested on this relatively small dataset so that our findings could be compared with the previously published results.
Potential sequence features for siRNA classification.
Number of features
siRNA nucleotide sequence
Global and local G/C contents
Secondary structure stability
In this study, important sequence features for siRNA classification were identified by using the random forest (RF) algorithm implemented in the software package available at http://www.stat.berkeley.edu/~breiman/RandomForests/. The RF algorithm uses a combination of independent decision trees to model data and measure variable importance . Each decision tree in a forest is constructed using a bootstrap sample from the data, and about one-third of the data instances are not used to grow the tree. These instances are called the out-of-bag (oob) data for the tree. At each node of the tree, m variables out of all the n input variables (m « n) are randomly selected, and the tree node is split using the selected m variables. The random selection of features at each node decreases the correlation between the trees in the forest. Thus, the RF algorithm can handle many redundant features and avoid model overfitting. It has been shown that RFs outperform AdaBoost ensembles on noisy datasets, and can perform well on data with many weak input variables .
To evaluate the importance of variable x, its values in the oob instances associated with each tree in the forest are permuted randomly. The permuted oob instances as well as the original oob instances are then classified using the tree. The number of correct classifications on the original oob instances is subtracted by the number of predictions for the correct class on the permuted oob instances to calculate a raw score based on the tree. The importance score of variable x is defined as the average of raw scores over all the trees in the forest. For a fixed number of trees in the forest, the larger importance score a variable has, the more important it is for classification. In addition, a z-score can be obtained by dividing the variable importance score by its standard error, and a statistical significance level may be assigned to the z-score assuming normality .
where and are two data vectors, and γ is a training parameter. A smaller γ value makes the decision boundary smoother. Another parameter for SVM training is the regularization factor C, which controls the trade-off between low training error and large margin . Different values for the γ and C parameters have been tested in this study to optimize the classifier performance.
where TP is the number of true positives; TN is the number of true negatives; FP is the number of false positives; and FN is the number of false negatives. In addition to overall accuracy, sensitivity and specificity, Matthews correlation coefficient (MCC) is also commonly used as a measure of the quality of binary classifications . MCC measures the correlation between predictions and the actual class labels.
The Receiver Operating Characteristic (ROC) curve is probably the most robust approach for classifier evaluation and comparison . The ROC curve is drawn by plotting the true positive rate (i.e., sensitivity) against the false positive rate, which equals to (1 – specificity). In this work, the ROC curve has been generated by varying the output threshold of a classifier and plotting the true positive rate against false positive rate for each threshold value. The area under the ROC curve (AUC) can be used as a reliable measure of classifier performance . Since the ROC plot is a unit square, the maximum value of AUC is 1, which is achieved by a perfect classifier. Weak classifiers have AUC values close to 0.5.
There are many potential features for siRNA classification. To select the important features, siRNA sequences were coded with the 120 features shown in Table 1, and then used to construct random forests (RFs) with different settings of the m parameter (the number of variables randomly selected to split each node in a tree). As suggested by the RF software , the ceiling of the square root of the total number of input variables might be used as the default value of m (i.e., m = 11). In this study, 10 RFs were constructed by varying the m parameter setting from 2 to 20 (m = 2, 3, 5, 7, 9, 11, 13, 15, 17, or 20). Each RF with 1000 trees selected the top 20 features based on the z-score of variable importance. Some of the common features selected by the RFs were then identified for siRNA classification. The use of multiple RFs might increase the reliability for identifying relevant features.
Important features selected by random forests.
G/C% (first 5 bases)
Some trinucleotide or dinucleotide features show negative correlation with siRNA efficacy. For example, the frequencies of CAG and GAG on the antisense strand have average z-scores of 11.849 and 11.674, respectively, and are negatively correlated with siRNA efficacy (Table 2). The Pearson's correlation coefficients are -0.289 for CAG% and -0.305 for GAG%. The other composition features with negative effects on siRNA efficacy include GCA%, AUA%, CUG%, AG%, GG%, GGA%, and GGC% (Table 2). These nucleotide motifs should be avoided in designing potent siRNAs.
Several features of base composition and G/C content were also selected by the RFs. As shown in Table 2, the frequency of G on the antisense strand (G%) is negatively correlated with siRNA efficacy (r = -0.266), whereas U% shows a positive correlation with gene expression inhibition (r = 0.127). Both the global G/C content and the G/C content in the first 5 bases (antisense strand) are negatively correlated with siRNA efficacy (r = -0.147 and -0.256, respectively). In addition, the nucleotide identity at the third position (NT3, from the 5' end) of the antisense strand may be an important feature for siRNA classification. A nucleotide C at the third position has a negative effect on siRNA efficacy (r = -0.199). However, the nucleotide identities at the other positions as well as siRNA secondary structure stability were not selected by most of the RFs.
Some of the important features in Table 2 were previously shown to be related to the efficacy of gene silencing. Very high G/C contents were found to have a negative effect on siRNA efficacy [9, 13]. The presence of AU-rich 5' terminal region of the antisense strand was proposed to be one of the criteria for designing potent siRNAs . Consistent with the previous findings, both the overall G/C% and the 5' terminal G/C% (first 5 bases) of the antisense strand were selected by the RFs in this study. It was previously shown that the frequency of U, but not G or GG, was positively correlated with siRNA efficacy . This observation has also been confirmed in the present study. Nevertheless, many of the other features in Table 2, including the top three features (UCC%, CAG%, and GAG%), have not been well documented in the literature. Thus, the findings in this study provide new insights into the rational design of potent siRNAs.
Performance of support vector machine classifiers.
To further evaluate the classifier RF_Features, we examined the SVM output used to predict siRNA potency. The classifier was constructed using siRNA instances with binary labels (potent or non-potent). The actual magnitude of gene expression inhibition was not included in the training data. In Figure 2, the SVM output for each test instance is plotted against the level of gene expression inhibition. Interestingly, the output from RF_Features is positively correlated with the level of gene expression inhibition (Pearson's correlation coefficient r = 0.4715). The result further suggests that the classifier RF_Features has learned some important siRNA patterns related to the efficacy of gene silencing.
We have developed a new machine learning approach for predicting siRNA potency based on random forests and support vector machines. Since there were many potential features for siRNA classification, random forests were used for feature selection based on variable importance scores. Interestingly, most of the selected features were nucleotide dimer and trimer compositions of siRNA sequence. Some nucleotide motifs (e.g., UCC) showed positive correlation with siRNA efficacy, whereas other motifs (e.g., GAG) might have a negative effect on gene silencing. These important features were used to train support vector machines for predicting siRNA potency with relatively high accuracy. In the future, we will apply our approach to a large, integrated dataset of siRNAs, and develop a software system for rational siRNA design in functional geneomic studies.
Publication of this supplement was made possible with support from the International Society of Intelligent Biological Medicine (ISIBM). We acknowledge the scientific contributions and inputs of Dr. Mary Qu Yang.
This article has been published as part of BMC Genomics Volume 11 Supplement 3, 2010: The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2164/11?issue=S3.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.