Energy profile and secondary structure impact shRNA efficacy
- Hong Zhou^{1}Email author and
Affiliated with
- Xiao Zeng^{2}Email author
Affiliated with
DOI: 10.1186/1471-2164-10-S1-S9
© Zhou and Zeng. 2009
Published: 7 July 2009
Advertisement
DOI: 10.1186/1471-2164-10-S1-S9
© Zhou and Zeng. 2009
Published: 7 July 2009
RNA interference (RNAi) is a cellular mechanism in which a short/small double stranded RNA induces the degradation of its sequence specific target mRNA, leading to specific gene silencing. Since its discovery, RNAi has become a powerful biological technique for gene function studies and drug discovery. The very first requirement of applying RNAi is to design functional small interfering RNA (siRNA) that can uniquely induce the degradation of the targeted mRNA. It has been shown that many functional synthetic siRNAs share some common characteristics, such as GC content limitation and free energy preferences at both terminals, etc.
Our three-phase algorithm was developed to design siRNA on a whole-genome scale based on those identified characteristics of functional siRNA. When this algorithm was applied to design short hairpin RNA (shRNA), the validated success rate of shRNAs was over 70%, which was almost double the rate reported for TRC library. This indicates that the designs of siRNA and shRNA may share the same concerns. Further analysis of the shRNA dataset of 444 designs reveals that the high free energy states of the two terminals have the largest positive impact on the shRNA efficacy. Enforcing these energy characteristics of both terminals can further improve the shRNA design success rate to 83.1%. We also found that functional shRNAs have less probability for their 3' terminals to be involved in mRNA secondary structure formation.
Functional shRNAs prefer high free energy states at both terminals. High free energy states of the two terminals were found to be the largest positive impact factor on shRNA efficacy. In addition, the accessibility of the 3' terminal is another key factor to shRNA efficacy.
RNA interference (RNAi) is a cellular mechanism in which a short/small double stranded RNA induces the degradation of its sequence specific target mRNA, leading to specific gene silencing. Since its discovery, RNAi has become a powerful biological technique for gene function studies and drug discovery [1–3]. It also shows promise as a direct therapeutic agent [4–6]. In order to apply RNAi technology, one must scan the target sequence and identify a stretch of 19 ~29 nucleotide sequence that might give the best chance to succeed as gene-specific small interfering RNA (siRNA) because randomly selected siRNA is seldom functional [7]. Previous research has identified some empirical rules regarding the efficacy of siRNA sequences. These rules include the GC content limitation (the optimal GC content is between 30% and 55%) [7–13], the thermo-stability preference (lower stability at the 3' terminal of sense strand would help the antisense strand enter into the RISC complex) [11, 14, 15], and some base preferences in the siRNA sequences [7, 16]. Recently, several studies employed computational methods to analyze the published functional siRNA datasets whose sizes are relatively large and revealed more base preferences at particular positions [17–20]. These computational models and their discoveries seem to be promising, because they were using large set of experimentally validated sequences. However, one must be cautious when reading into these conclusions because the existing datasets might be biased for lack of negative results (some negative results were seldom reported in publications).
Another challenge is that most existing datasets are usually about chemically synthesized siRNA sequences. Currently there are two approaches to induce siRNA sequences into cells. One is to transfect chemically synthesized siRNA sequences into cells. Though this approach is more frequently used, the drawback is that it can not offer long-term gene suppression and some mammalian cell types are resistant to the transfection methods [21, 22]. The alternative approach is to have a small hairpin RNA or short hairpin RNA (shRNA) expressed on a plasmid vector that enables long-term gene suppression [22]. Another advantage of using the shRNA approach is that once a functional vector is identified, it can be reproduced easily and inexpensively.
As suggested by Matveeva et al [19], some of the characteristics of functional siRNA may not apply to shRNA, though many of them may. In 2004 and 2005, we developed a three-phase algorithm for computer-aided siRNA design [23, 24]. This algorithm includes all the design considerations published by then and arranges all the necessary siRNA selection rules in three groups of filters according to their impacts on the siRNA efficacy and applies them to the design process in three steps. Each filter represents a specific design rule. Phase I filters eliminate all the siRNA sequences containing the damaging elements for a functional siRNA. Examples are those filters that prohibit the existence of internal palindromes or long GC stretches. All siRNA candidates must successfully pass all the phase I filter assessment. Phase II filters are used to rank eligible siRNA sequences by a final score with the sum of gain and penalty points. There is a cutoff such that siRNA candidates will be selected only if their final scores are no less than the cutoff point. Phase III filters represent those rules whose impact on the siRNA functionality has yet to be clarified and therefore are considered optional. Most of the selective filters in Phase II are set to ensure the selection rule that the 3' terminal is less thermodynamically stable, compared to the 5' end (on the sense strand). This differential stability increases the probability that the antisense strand will be incorporated into the RISC complex [14, 15]. This algorithm was adopted by SABiosciences Corporation and has been eventually used for their shRNA design. Biological experiments conducted by SABiosciences have shown that 71.2% (316/444) of the shRNAs designed by our algorithm are functional (i.e. can suppress at least 70% of the target gene expression in average), a high success rate that has never been reported experimentally with shRNA (for the definition of a functional shRNA or a successful design of shRNA, please read the section Methods). For example, the design algorithms employed recently by Root [22] and Moffat [21] for shRNA generate less than 40% functional shRNAs. Since our algorithm was originally developed based on discoveries and studies concerning siRNA functionality, its high success rate in shRNA design suggests that most design concerns for siRNA sequences are applicable to shRNA design.
This article further extends our analysis on the available shRNA dataset generated by SABiosciences. Although this dataset is biased as it is specifically generated by our three-phase algorithm, the analysis reveals useful information that may help confirm the effectiveness of the rules used in the algorithm, modify the existing rules or rearrange them for better prediction and identify new rules.
The two sets of biological experiments completely tested 444 shRNAs targeting 125 human genes. Of the 444 shRNAs, 316 are found to be functional (71.2%). Considering the fact that variations exist in the experimentally measured suppression efficacy, we decided to remove some shRNAs whose efficacy are in the range near 70% in hope of ensuring the validity of the dataset. The two ranges for removal are 60–75% and 55–80%. This means we exclude shRNAs whose efficacy is between 60–75% and 55–80% respectively. With the 60–75% removal range, there are 351 shRNAs for analysis in which 268 are considered functional (efficacy >= 75%). However, with the 55–80% range, there are only 289 shRNA sequences for analysis in which 221 shRNAs are considered functional (efficacy >= 80%).
By default, the three-phase algorithm sets a selection cutoff such that only shRNAs which score at least 7 points will be selected for biological experiments. However, former experiments showed that there were a few genes for which our algorithm failed to design enough shRNAs [23]. Thus in the dataset there are some shRNAs scoring less than 7 points. For this reason, the chi-square test for independence, is first applied to assess if the 7-points-cutoff is significant in distinguishing between functional and nonfunctional shRNAs. Note that all further statistical results are obtained using the chi-square test unless otherwise noted. The analysis shows that the 7-points-cutoff is significant in making the distinction between the functional and nonfunctional shRNAs. shRNAs scoring at least 7 points have a higher probability of being functional. When the removal range is 60–75%, the p-value is 0.037. With the removal range to be 55–80%, the statistical significance increases (p = 0.0046). Several other removal ranges were tested using the chi-square test and it has been found that the removal range of 55–80% generates the most significant result. For example, if the removal range is further expanded to be 50–85%, the significance drops (p = 0.037). Unless otherwise stated, the rest of our results were obtained with a removal range of 55–80%.
Higher ΔG_{as-5} increases the probability that shRNAs are functional.
shRNA ΔG_{as-5} | Number | Functional Rate | |
---|---|---|---|
>= -3.2 | Functional | 155 | 81.6% (155/190) |
Not functional | 35 | ||
<-3.2 | Functional | 66 | 66.7% (66/99) |
Not functional | 33 |
Table 1 indicates that the enforcement of filter f-dga in the selection algorithm could improve the design success rate significantly at the expense of missing some functional shRNAs. The results in Table 1 were obtained using a removal range of 55–80%. Even with no removal range, enforcing the filter f-dga significantly improves the shRNA design success rate to 76.0% (222 out of 292). This suggests that the high energy state at the 3' terminal has positive impact on shRNA efficacy.
Using either ΔG_{as-7}, ΔG_{ss-6} or both can better predict the efficacy of shRNAs.
Energy profile | ΔG_{as-5} | ΔG_{as-7} | ΔG_{ss-6} | ΔG_{as-7} or ΔG_{ss-6}or both |
---|---|---|---|---|
Energy criteria | >= -3.2 | >= -6.6 | >= -7.0 | |
P value of Chi-test | 0.0046 | 0.0005 | 0.0045 | 0.0000001 |
True positive | 155 | 155 | 115 | 192 |
False positive | 35 | 32 | 22 | 39 |
True negative | 33 | 36 | 46 | 29 |
False negative | 66 | 66 | 106 | 29 |
ROC specificity | 0.49 | 0.53 | 0.68 | 0.43 |
ROC sensitivity | 0.70 | 0.70 | 0.52 | 0.87 |
Without any removal range, all the shRNAs whose ΔG_{as-7} >= -6.6 or ΔG_{ss-6} >= -7.0 or both have averaged suppression efficacy of 76.98%, while those shRNAs with no high energy state at either end have averaged suppression efficacy of 65.83%. A two-sample t-test reveals that the difference between the two averaged efficacy is very significant (p = 0.0000047). This confirms that the use of either ΔG_{as-7}, ΔG_{ss-6}or both could significantly improve the selection of functional shRNAs.
Recent research has shown that the siRNA sequence characteristics could be helpful in predicting siRNA efficacy [16–20]. To test this, we compiled a set of 404 sequence characteristics parameters and used linear least square analysis (multiple regression analysis) in an effort to identify the most influential sequence characteristics. Since the two terminals' energy profiles contribute to shRNA efficacy, the least square analysis includes the two energy parameters. The 406 parameters are generated as follows:
The free energy of the first 7 base pairs of the antisense string (if >= -6.6, then value 1; otherwise, value -1).
The free energy of the first 6 base pairs of the sense string (if >= -7.0, value equals 1; otherwise value equals -1).
At each of the 21 positions, the nucleotide could be either A, C, G, T. For the presence of each nucleotide, there are 4 values generated. For example, for A, the 4 parameters are 1 0 0 0 while for G they are 0 0 1 0. So there are 84 parameters for the 21 nucleotides.
For each pair of nearest neighbors, there are 4 × 4 = 16 parameters. For example, for AC, they are 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0. So there are 16 × 20 = 320 parameters.
Sequence characteristics that consistently show either positive or negative impacts on shRNA efficacy in Least Square Analysis.
Positive Parameters | ||
---|---|---|
Parameter | Position | Coefficient |
G | 3 | 8.01 |
G | 5 | 9.11 |
A | 11 | 7.59 |
T | 14 | 6.16 |
CA | 6–7 | 11.48 |
CC | 8–9 | 11.63 |
TG | 10–11 | 19.58 |
GA | 13–14 | 31.32 |
TA | 17–18 | 6.36 |
TA | 18–19 | 24.11 |
Negative Parameters | ||
Parameter | Position | Coefficient |
TA | 4–5 | -35.04 |
TA | 6–7 | -23.35 |
GG | 8–9 | -21.00 |
GC | 9–10 | -24.66 |
GA | 17–18 | -15.08 |
Different parameter sets are also compiled in search for the best parameter set. For example, some parameter sets exclude energy profiles and some include GC ratio, etc. It is worthy to note that with different parameter sets, the coefficients change dramatically while the predication accuracy does not improve. This makes us wonder whether the findings by linear least square analysis are nuisances. However, we did notice that on average, the nucleotide pairs have larger impact than single nucleotides. For example, in Table 3, the absolute average value of the coefficients of pairs is 20.33 while that of the single nucleotide is 7.73. The T-test shows that the p-value of the difference is 0.00034. This might indicate that longer sequence characteristics affect shRNA efficacy more than shorter sequence characteristics.
As there are reports that local RNA target structure influences siRNA efficacy [26, 27], we then wanted to know if this could happen to shRNA. The secondary structure of mRNA could be formed for various reasons, but internal palindromes and repeated sequences are two causes in some cases. Please be aware that here we only performed the analysis statistically. The result does not indicate how much the secondary structure could impact the shRNA efficacy.
Initially we considered all possible palindromes and repeated sequences of length 7 or more that could involve any part of the 21 shRNA nucleotides. No significant results were found. We then only considered the possible palindromes and repeated sequences of length 8 or more that could only involve any part of the 6 base pairs from 5' terminal or of the 7 base pairs from 3' terminal. It was found that the number of possible palindromes of length 8 or more involving 3' terminal shows statistical bias between functional and nonfunctional shRNAs. Nonfunctional shRNAs tend to have more possible palindromes. This bias is most significant with palindromes of length 9 or more involving any part of the 7 base pairs from the 3' terminal. By Chi-Square test, the statistical significance is p = 0.011 with removal range of 55–80% and p = 0.0001 with removal range of 60–75%. Please notice that here we assumed that more possible palindromes implies higher probabilities for the terminal to be involved in secondary structure formation. If the assumption is valid, then the above result implies that secondary structure involving the 3' terminal could negatively impact the shRNA efficacy.
The above experiment targets the 7 base pairs at the 3' terminal. It is reasonable again to ask if other lengths of nucleotide sequences at 3' terminal will show similar statistical bias. Unsurprisingly, we found that all 3' terminal sequences of lengths 1 to 7 show similar statistical bias, i.e. functional shRNAs tend to have less possibility for all the terminal sequences of lengths 1 – 7 to be involved in palindrome formation. The most statistical bias is found with terminal sequence of length 6 (p < 0.000004 with removal range of 60–75%, p = 0.0006 with removal range 55–80%). This discovery motivates us to combine it with energy profile in order to further improve the efficacy predication. Our investigation has yet to show that this statistical bias could further improve the predication accuracy. This is not surprising since the possible palindrome structure could affect the terminal energy state. The two variables, the terminal energy state and the possible secondary structure are interfering with each other. A multivariate statistical analysis or recursive partition approach might help bring more lights into our future investigation.
The default setting of the three-phase algorithm is relatively stringent. Under this default setting, the algorithm cannot design shRNA sequences for approximately 8% of genes from human Refseq database [23]. It is possible that many functional shRNA sequences are missed by this algorithm. However, when there could be enough shRNA sequences designed for a given gene, this algorithm with the default stringent setting promises a good probability for functional shRNAs.
It has been confirmed by several studies that the free energy profile at the two terminals is the most critical factor relating to siRNA efficacy [7, 15–20]. Our analysis on shRNA dataset confirms that this belief applies to shRNA design. However, there is difference. For chemically synthesized siRNA, it was first found that the high free energy of first 5 base pairs at each terminal correlated well with siRNA efficacy [12, 25], and later other researchers discovered that it might be the high free energy of the first 2 base pairs [19, 20]. Our results show that shRNA efficacy can be predicted more effectively using the free energy profile of the first 6 base pairs at the 5' terminal and the first 7 base pairs at the 3' terminal. Currently we are not clear about the reason behind the difference, though the difference might be due to the differences between chemically synthesized siRNA and shRNA.
Internal palindrome is one of several causes that help RNA molecules form secondary structures. Our analysis found that shRNAs with more possible palindromes involving the 3' terminal tend to affect shRNA efficacy negatively, especially those possible palindromes that are of length 9 or more and involve the 7 terminal nucleotides. RNA secondary structure involving the terminal could limit the accessibility of the terminal, which might explain why the secondary structure could negatively impact shRNA efficacy. However, our result is very primitive since it is obtained with possible palindromes only and is only a statistical analysis result. Our future work will make use of software mfold to more precisely elucidate the relationship between shRNA efficacy and RNA secondary structure. If more positive relationships are found, confirmation by biological experiments will follow.
293H cells (Invitrogen) were cultured in D-MEM supplemented with 10% FBS and 1× non-essential amino acids (Invitrogen) for no more than 15 passages. Gene specific shRNA sequences were designed using the three-phase algorithm [23]. Negative control is a shRNA that has a random sequence and no homology to the mammalian transcriptome. All shRNAs were cloned into the pGeneClip™ hMGFP vector (Promega) to generate SABioscience SureSilencing™ shRNA plasmids. Transfection grade SureSilencing™ plasmid (0.8 mg) mixed with Lipofectamine 2000 reagent (Invitrogen, 3 mL) was delivered to 80,000 cells in a 24-well plate format. Culture media were changed 24 hours after transfection. Transfection efficiency was estimated by following the expression of GFP using fluorescence microscopy. After 48 hours, total RNA was extracted using the ArrayGrade™ Total RNA Isolation Kit with gDNA cleanup by TURBO DNase™ (Ambion).
cDNA was synthesized from total RNA using the ReactionReady™ First Strand cDNA Synthesis Kit. Real-time PCR was performed using RT2 SYBR Green qPCR Master Mixes on the Bio-Rad iCycler real-time PCR system or the Stratagene Mx3000 realtime PCR system.
Detailed description of knockdown success rate and its calculation is given in [28]. Real-time RT-PCR method was chosen to measure the relative mRNA levels between cells (293H from Invitrogen) transfected with a negative control shRNA and the cells transfected with target-specific shRNA. The percentage knockdown of a gene was calculated using the ΔΔCt method [29]. Basically, the expression level of our gene of interest is normalized to an internal control gene (GAPDH here) to obtain the ΔCt value, before it is compared between two differentially treated cells (thus the ΔΔCt). Since this validation process is complex, involving many different steps, we have built a statistical model to capture all the variations in the whole process [28]. By this model, we can mature the 95% confidence interval (CI) of the percentage knockdowns of any given shRNA. So our definition of a "successful" design is the shRNA that has a mean knockdown (KD) above 70%, with a lower bound of KD above 55.5% (with 95% CI). Please notice that all experiments were repeated for three times, and only those experiments with transfection efficiency greater than 80% and good PCR replicate consistency were included for analysis.
This article has been published as part of BMC Genomics Volume 10 Supplement 1, 2009: The 2008 International Conference on Bioinformatics & Computational Biology (BIOCOMP'08). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2164/10?issue=S1.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.