Large-scale analysis of structural, sequence and thermodynamic characteristics of A-to-I RNA editing sites in human Alu repeats

Background Alu repeats in the human transcriptome undergo massive adenosine to inosine RNA editing. This process is selective, as editing efficiency varies greatly among different adenosines. Several studies have identified weak sequence motifs characterizing the editing sites, but these alone do not account for the large diversity observed. Results Here we build a dataset of 29,971 editing sites and use it to characterize editing preferences. We focus on structural aspects, studying the double-stranded RNA structure of the Alu repeats, and show the editing frequency of a given site to depend strongly on the micro-structure it resides in. Surprisingly, we find that interior loops, and especially the nucleotides at their edges, are more likely to be edited than helices. In addition, the sequence motifs characterizing editing sites vary with the micro-structure. Finally, we show that thermodynamic stability of the site is important for its editing. Conclusions Analysis of a large dataset of editing events reveals more information on sequence and structural motifs characterizing the A-to-I editing process


Background
RNA Editing is a post-transcriptional modification of mRNA [1][2][3][4], which may result in the synthesis of proteins that are not directly encoded in the genome. There are two major types of RNA Editing in mammals, both of which occur via deamination of a base, either cytidine (which is turned into uridine) or adenosine (which turns into inosine). Inosine is read by the ribosome (and sequencers) as guanosine, and thus A I modifications at the mRNA level translate into an A G changes at the genetic code level. In this work we focus exclusively on A-to-I RNA Editing, which is catalyzed by enzymes from the ADAR (Adenosine Deaminases that Act on RNA) family. ADARs are double-stranded RNA (dsRNA) binding proteins, and thus dsRNA is a prerequisite for A-to-I editing [1,2].
RNA Editing is a fine-tuning mechanism, capable of changing only a few nucleotides. Both edited and unedited variants of the same transcript may be present in the cell. A-to-I editing is known to be vital in vertebrates, and important for normal life in invertebrates. In Drosophila, knocking out ADAR activity causes the flies to exhibit defects in locomotion and mating and to suffer tremors [5]. ADAR knockout C. elegans worms exhibit chemotaxis defects [6]. In mice, knocking out ADAR1 causes embryonic death and defects in erythropoiesis [7,8]. ADAR2 -/-mice die shortly after birth and are increasingly seizure prone after postnatal day 12 [9]. The lethal phenotype is accounted for by a single editing site resulting in a single amino acid substitution in the gluR-B gene.
In addition, alteration of A I editing has been ascribed to several pathological conditions [10], mainly to neuro-psychiatric conditions such as amyotrophic lateral sclerosis (ALS) [11], epilepsy [9,12], major depression disorder [13][14][15], and glioblastoma multiforme [16]. Reduced A-to-I editing levels have been linked to cancer in various tissues, most strongly to brain tumors. A correlation between the reduction of ADAR3 and the tumor aggressiveness was observed, and overexpression of ADAR1 and ADAR2 resulted in decreased proliferation rate of the glioblastoma multiforme cell-lines [17].
Isolating inosine-containing transcripts from C. elegans and human brain, it has been noticed that most A-to-I editing occurs in non coding regions [18]. Genome-wide bioinformatic searches for A-to-I editing sites have enabled the identification of abundant A-to-I editing in the transcriptome of several vertebrates [19][20][21][22][23][24]. It was found that editing occurs mainly within repetitive elements. These repetitive elements are likely to base-pair with a neighboring similar element and form the dsRNA structure which is the target of the ADAR enzymes. In particular, virtually all A-to-I editing events in human occur specifically within Alu repeats.
The Alus are a particular set of primate-specific retrotransposons, approximately 280 nucleotides in length. The Alus are the most abundant of all transposable elements in primates, making up more than 10% of the human genome, with some 1.1 million copies. Recent studies [21,23] have demonstrated that the frequency of A-to-I editing in human is much higher than in mouse, rat, chicken and fly. This has to do with the abundance and low diversity of the Alu elements as compared to similar elements in other genomes [24]: since Alu is so common in the human genome, there is a high probability that an Alu and a counterpart, oppositely oriented Alu, exist nearby and are transcribed together. When the RNA transcript folds, these two Alus form a helix, thus becoming a target for the dsRNA binding ADARs.
The physiological significance of A-to-I editing within non-coding repetitive elements is still elusive. Several possible mechanisms have been suggested through which editing of a non-coding repetitive element might affect the fate of a transcript: editing may result in insertion or elimination of a splice site, and may theoretically lead to the alteration of transcriptional start and stop codons [25]. Hyperedited inosine-containing RNAs might be cleaved at specific sites [26][27][28][29]. In addition, inosine containing mRNAs were also shown to be retained in the nucleus, suggesting an additional regulatory role for A-to-I editing [30,31]. However, the validity and scope of this last mechanism has been debated recently [32,33]. Finally, while the molecular significance is yet unclear, editing within Alu repeats was shown to be altered in cancerous tissues [17].
A-to-I editing is characterized by a puzzling specificity and selectivity in the adenosines which are edited. In some substrates, e.g. the AMPA receptor gluR-B subunit in mice [34] and the E1 sites within an Alu repeat in the NARF gene [25], RNA Editing is extremely efficient, editing 100% of transcripts at a specific adenosine. In others, such as most of the sites in Alu repeats, a seemingly random editing pattern is observed, where many adenosines are targeted, with varying editing efficiency.
However, careful analysis reveals that editing in Alu repeats is also highly reproducible: the variability among healthy individuals in editing level at a given site within a specific Alu repeat is much lower than the site-to-site differences. Sequence preferences for ADARs have been previously documented. C and T are overrepresented at the nucleotide 5′ to the editing site, while G is underrepresented. At the nucleotide 3′ to the site, G is significantly overrepresented [19,[35][36][37][38][39]. These motifs are too weak, however, to fully characterize A-to-I editing. Therefore, the question still stands: what controls the editing level at each given site? ADARs bind to the RNA via double-stranded RNA binding motifs. Thus, dsRNA is a necessity for A-to-I editing. Indeed, it has been shown for the highly selective R/G editing site within the hairpin of the glutamate receptor subunits mRNAs, that the identities of bases in the helical region are evolutionarily conserved, while the bases in the nonhelical part of the hairpin covary so as to maintain their non-helical structure [40]. This distinctive feature demonstrates the importance of the secondary structure to the phenomenon of RNA Editing.
The internal structure of the dsRNA is expected to control the editing efficiency [41]. For example, it has been shown experimentally that internal loops may effectively be equivalent to helix termini in terms of editing efficiency [42]. Thus, internal loops along dsRNA, if large enough, may act as delimiters separating a large dsRNA into many small helices. Since ADARs deaminate fewer A's in shorter helices, their existence (along with the sequence preferences of the ADARs) might be a means to increase the specificity of editing. It is thus plausible that more features of the secondary structure of an RNA molecule play an important role in determining the specificity of adenosine deamination of an ADAR substrate.
In this paper we will characterize the properties of Ato-I editing sites in terms of their secondary structure properties, their sequence properties, and their thermodynamic properties. We describe the building of a database of MFOLD [43] foldings used to query these properties, and then display and discuss the results of those queries.

Structural Analysis
We first look at the editing frequency for each substructure type (see Table 1 and fig. 1). We compare a "test set" of A-to-I Editing sites, which we denote by E 1 , and a control set of sites not known to be edited, denoted by E 0 . The E 1 and E 0 sets are defined with precision in the Methods section. Interestingly, while the existence of a helix is well known to be a prerequisite for editing, the overall frequency of E 1 is actually more than two fold lower in helices (0.044) than in interior loops (0.091). As the overwhelming majority of E 1 sites reside in helices and interior loops, we focus henceforth on these two substructures only. For clarity, we emphasize here that by "interior loop" we mean only the unpaired nucleotides that form the loop's constituent strands. Table 1 also suggests length dependence. The editing prevalence as a function of length is given in figs. 2 and 3 (henceforth, error bars represent 95% confidence intervals. Also, some graphs of integer-valued variables have non-integer entries due to data binning). Clearly, longer helices are more likely to be edited, while longer strands of interior loops are less likely to be edited. In addition, the length of the opposite strand (the one the editing site does not reside in) also affects the editing frequency in an interior loop: as shown in fig. 4, symmetric loops are more likely to be edited.
Furthermore, we study the effect of the location of the specific nucleotide within its respective substructure. We define cePos as the distance of the site (in nucleotides) from the closest edge of the substructure it is in (cePos = 0 means the very edge of a substructure). Figs. 5 and 6 present the frequency of E 1 sites as a function of cePos. For helices, one observes a general trend of enhancement of editing as a site lies deeper in the helix. For interior loops, however, there is dramatic depletion of E 1 for cePos > 0. In fact, it should be noted that 91% of edited sites in interior loops lie at the very edge of the loop, i.e. cePos = 0. Most of these are in fact a single mismatch within an almost perfect helix (i.e., opposite strand length is also one nucleotide). Such mismatches were already implicated as preferred targets of ADARs, as previous in-vitro data as well as bioinformatic work indicate that AC mismatches are more favorable substrates than A-T pairs [19,44]. However, it is worthwhile noticing that our analysis shows this trend to persist even for longer interior loops: interior loop strands of length up to five nucleotides are more likely to be edited than the average site in a helix (see fig. 3).
For these cePos = 0 sites, there is a significant (p < 2.2e-16) effect to the direction of the nearest neighboring helix: A-to-I editing frequency is 0.068 for sites with a helix only in the upstream site, 0.094 for sites with a helix only in the downstream site, and 0.13 for sites with helices on both sides.
The above results hold when controlling for the total length of the substructure: we compared E 1 and E 0 sites for helices of a given length, and for loops of a given size. The resulting trend was the same: for E 1 sites in helices cePos is larger than for E 0 sites, whereas in interior loops the connection is reversed. Other location variables tested, such as the position relative to the middle of the substructure, or relative to the 5′ end, did not result in noticeable results.

Sequence Analysis
We start with the nucleotide opposite of the editing site. For helices, it is clear what this means: the "opposite" nucleotide of a site is the nucleotide that pairs with that site (and is therefore always T). We expand this idea, however, to sites at the edges of interior loops (i.e., having cePos = 0): for these sites on the most 5′ (3′) nucleotide of the loop-strand, the opposite nucleotide is the most 3′ (5′) site of the other strand in the loop. If  the site is the only one on its strand, and the opposite strand has more than one nucleotide, the opposite nucleotide is undefined. We shall refer to the opposite nucleotide as opNuc for short. There is a very strong enrichment for sites with C on the opposite site: we looked at the frequency of E 1 for sites with a given opNuc, and obtained a frequency of 18.5% for C, whereas for A the frequency was 5.1% and for G, 3.7%. This is consistent with (but more pronounced than) the data presented in [19][20][21][22]37,44].
Next we look at the statistics of the nucleotides upstream and downstream of the A-to-I editing sites. In order to avoid biases due to the underlying nucleotide statistics in Alu repeats we do not look at the raw distribution of nucleotides but rather at the enrichment factor, i.e. how much is the editing frequency increased (compared to the average within the respective substructure) when the neighboring site is any specific nucleotide. The enrichment factors are presented in figs 7, 8 and 9 for the two immediate neighbor nucleotides separately, as well as for the joint variable composed of both upstream and downstream neighbors. Overall, the profiles found are similar to those seen in previous largescale studies of editing [19][20][21][22]24,45]: T is most preferred upstream and is not preferred downstream, while G is most preferred downstream and least preferred upstream (in both helices and loops). However, we do find a significant (p < 1.1e-16 for all comparisons) difference between the profiles for helices and loops. For example, the preference for an upstream T is stronger in helices, whereas the preference for a downstream G is stronger in interior loops.
We also calculated the enrichment factors for the joint variable composed of the site's upstream neighbor, downstream neighbor, and opNuc. The results are displayed in Table 2.
In addition, we searched for enrichment in the extended neighborhood of the editing sites, looking at 30 neighboring nucleotides at both sides of the site (upN refers to the nucleotide N sites upstream to the editing site, and dnN refers to the nucleotide N sites downstream to the editing site). Almost all neighbors show a significantly different nucleotide distribution around edited sites, see Tables 3 (helices) and 4 (interior loops). The most significant differences (largest χ 2 scores) are observed for neighbors up1, up2, up7 and dn18 in helices and up1, dn1, up2 and up3 in interior loops. We note that while almost all 60 neighbors tested show statistically significant difference, it is hard to tell whether these differences are due to ADARs preference or rather stem from editing hot spots within the Alu. We also present the enrichment factors for seven positions surrounding the editing sites which were reported to show preferences to specific nucleotides when surrounding ADAR2 editing sites [41]. As seen in Table 5, the patterns observed here for Alu editing are somewhat different: for example, locations dn10 and dn13 seem to favor G in contrast to the opposite trend reported in [41] for ADAR2 sites. The differences might be due to the much larger sample we study here. Additionally, it is also possible that editing sites in the coding region, mostly having a functional role, have different characteristics than the ones in Alu repeats. However, these differences could also result from differences between the profiles of ADAR1 and ADAR2. While the sample of editing sites studied in [41] is biased towards ADAR2 targets, the sample studied here, coming from a wide range of tissues, represents a different mix of the two enzymes, with larger weight of ADAR1. Moreover, the different splice-variants of the ADARs possibly have varying editing efficiencies and site preferences. The mix of these variants occurring in our in-vivo sample, could also lead to slight variations in the preferences observed as compared to results of invitro studies.

Thermodynamic Calculations
Finally, we study the effect of thermodynamic stability on editing efficiency. For each genomic neighborhood, we look at the thermodynamic average over all the low free-energy structures. The laws of statistical mechanics give us the probability that the RNA is in a specific secondary structure n, where T is the temperature in degrees Kelvin, k B is Boltzmann's constant and Z is defined by the sum where the label n runs over all available foldings, and G n are the respective free energies. In practice, we only use those folds generated by MFOLD which are expected to be all folds relevant at human-body temperature. We may now, for example, calculate the probability of some particular site to be in a helix, where p s i is the frequency of site i being in substructure of type s. This entropy is a measure of the thermodynamic volatility of the site's substructure: if a site is always in the same substructure (e.g. the site is always in a helix), it will have zero structural entropy. If, however, the site's substructure fluctuates, for example between a helical structure and a loop structure, it will have higher structural entropy. The structural entropy of a site with equal probability to be in two difference substructures is ln(2) = 0.7. The highest possible structural entropy is of a site which spends equal time in each of the possible substructures. Figs. 10 and 11 show the frequency of E 1 as a function of the structural entropy, for sites whose lowest free-energy structure is helix or interior-loop separately. Interestingly, A-to-I editing is enriched for sites with low structural entropy, i.e. having a well-defined low energy micro-structure. A wobbling state, fluctuating between two or more possible structures is less well edited. This holds regardless of the ground-state structure, but the effect is stronger for interior loops: sites with a well-defined interior loops structure are twice more likely to be edited compared with sites whose ground state structure is also an interior loop but having even 1% probability to be in other structures.
Analysis of a large dataset of secondary structures of putatively edited Alu repeats reveals that structural motifs are indeed important in determining the A-to-I editing efficiency of a given site. Most notably, we highlight the strong preference for editing of adenosines within short symmetrical internal loops. Moreover, the microstructure also has modest but noticeable effect on the cis-preferences of the ADARs. Long perfect dsRNA duplexes are often considered to be the best target for editing by ADARs. Here we find that sites adjacent to the edge of helices (cePos = 0) are even more efficiently edited. Averaged over our whole database, adenosines deep within (cePos > 10) long (> 30 bp) perfect helices are indeed edited more efficiently than the average adenosine in a helix -we find 1625 such sites, with editing frequency 8.2%, compared to only 4.4% for the average helix site. However, this is still lower than the average frequency for interior loops, 9.1%. Moreover, focusing on single A-C mismatches within a helix (i.e. cePos = 0 sites having neighboring helices on both sides and C as the opposite nucleotide)   raises the frequency to 19%. Finally, choosing also the optimal neighbors, i.e. T upstream and G downstream, raises the editing frequency as observed in our database to 37% ! We stress again that these frequencies should not be regarded as the true editing efficiency, but rather as a relative measure. Yet, one is able to conclude that the best way to engineer an efficient editing site is not to put it in a long perfect duplex, but rather to be in a single mismatch within a duplex. Interestingly, the 100% edited E1 site in the NARF gene [25], fits nicely with these engineering rules -it is a cePos = 0 site in a symmetric loop, with C opposite to it and T and G in the upstream and downstream sites, respectively. However, the strand length there is 3 and not the optimal 1. An editing site that fully adheres to the above "rules" is the amber/W one of the hepatitis delta virus antigenome (genotype I) [46]. This site is critical for the virus to assemble viral particles and to be infectious [47]. Given the high adaptivity of viruses, it is not surprising to find that this site fits with all of the above: it is located in a single A-C mismatch within a helix (cePos = 0 and loop length = 2), and has T and G as its immediate neighbors.
However, the Q/R site in GluRB does not fit to our observations. It lies within a rather long (17 bp) helix, with cePos = 5, with C (rather than the optimal T) upstream and G downstream [48]. Yet, this site is also 100% edited. Apparently, there is still much more to learn about the characteristics of editing by ADARs, beyond the information presented in the present study.   One possible explanation is that this site in known to be edited only by ADAR2 [49]. The two editing enzymes ADAR1 and ADAR2 are known to have overlapping, but distinct, preferences [36][37][38]50,51]. However, our approach does not allow us to distinguish between them. It was recently shown that editing of mouse B1 and B2 SINEs is mediated by both enzymes [39]. Some sites within these repeats are ADAR1 specific, some are ADAR2 specific and some are edited by both. It is not yet clear which enzyme is the main one in terms of Alu editing in human. Since our database is based on mRNA sequences from a wide range of tissues, it is possible that it characterized mainly the profile of the widely-expressed ADAR1 rather than that of ADAR2 which is expressed mainly in the brain. It is thus likely that some of the preferences identified in this work characterize ADAR1 and are therefore not present in the GluRB ADAR2-specific site. The discrepancies between nucleotide distributions around the editing sites reported above and those reported by [41] for ADAR2 sites might also attest for differences between the ADAR2 profile and the one characterizing our dataset, probably a mix of the two enzymes, with larger weight of ADAR1.
In an attempt to estimate the differences between the two enzymes, we compared 4657 editing sites supported by 13805 brain mRNAs, where both ADAR1 and ADAR2 are present, and 1684 sites residing in 10186 non-brain mRNAs, presumably edited mostly by ADAR1 (tissue-origin was determined by UCSC annotation [52]). The patterns observed were similar but not identical. For example, 1376 of the 2966 brain sites residing in a helix (46.4%) had a G in the dn1 position, compared to 452 out of 1076 (42.0%) in non-brain sites in a helix (p-value 0.013). However, differences were not statistically significant upon Bonferroni-correcting for multiple testing. Thus, a larger and better dataset (fully characterized in terms of of the tissues studied) is required in order to study the small tissue differences between the preferences of the two ADAR enzymes.

Conclusions
Using a dataset of 29,971 editing sites within Alu repeats, we analyzed the editing preferences. We found that the micro-structure a site resides in affects its editing frequency. In addition, the sequence motifs characterizing editing sites vary with the micro-structure. We have also shown that structural entropy and thermodynamic stability play a role in determining editing efficiency. We find that the probability of a nucleotide fluctuating between a number of possible structures to be edited is lower than the weighted average of the probabilities for each possible structure alone. This provides a hint as to the rate of the A-to-I editing process compared to the relaxation time scales controlling the transition between the possible folds. Taken together, the results presented here could be of help in revealing the complex nature of the ADARs editing profile.

Methods
We construct a list of putative editing sites within Alu repeats following the method presented in [23,24]. That is, we use mismatches in the relatively clean RNA sequences, rather than the much larger but noisier EST data. We use UCSC alignments of human RNA sequences to the genomes http://genome.ucsc.edu [52] and record all mismatches in these alignments. Then, known SNPs are removed, and the list is intersected with Alu locations, to obtain a set of mismatches within Alu repeats. A-to-I editing sites in Alu repeats tend to occur in clusters, we thus take only clusters of three or more consecutive identical mismatches. While this process is not inherently biased towards any specific type of mismatch, 98% of the mismatches found are A-to-G, suggesting that although these sites are typically supported by a single mismatch, they do represent true A-to-I editing sites with a low level of false-positives. Nucleotide distributions at certain locations around editing sites, reported to exhibit nucleotide biases [41]. For each of the sites, we present the probability to have a given nucleotide N when the 0 location adenosine is edited, divided by the probability of that nucleotide regardless of whether the 0 adenosine is edited: P(N|0 adenosine is edited)/P(N). Values > 1 indicate enrichment of N for edited sites, and < 1 indicate depletion. upX = X nucleotides upstream of site 0, dnX = X nucleotides downstream. The probabilities are given for editing sites in helices and interior loops separately, but are very similar for both. For comparison, we present the patterns reported in [41]. We then construct the predicted secondary structures using the following procedure: (a) for each site in our list, its Alu was located in the genome. Then, the nearest antisense Alu was located, and the genomic neighborhood that includes all nucleotides in and between the two Alus was taken, along with 200 extra nucleotides on each side. 61% of the inter-Alu distances are less than 1000 nucleotides, 22% more lie between 1000 and 2000, 9% are between 2000 and 3000 and the remaining 8% strech from 3000 to 6800 nucleotides.
(b) Neighborhoods having > 400 bp overlap on the same strand were merged into a single neighborhood encompassing both. This step resulted in 3,276 neighborhoods, containing 29,971 putative editing sites. RNA segments corresponding to these neighborhoods were folded using MFOLD, resulting in predicted secondary structures.
The accuracy of RNA secondary structure prediction by current dynamic programming algorithms (such as the MFOLD software) is moderate, up to 73% [53]. Yet, while false structure predictions would inevitably introduce noise to our analyzed data, the large sample size should allow for detecting a signal. Moreover, one should bear in mind that the RNA structures we consider -long and almost perfect dsRNAs -are relatively easy to analyze, and thus the accuracy of the folding algorithms is expected to be much higher than the above quoted rate.
(c) We parsed the output of MFOLD into a relational database containing all the information about the secondary structures in which the various sites reside (the basic substructures are given in fig. 1).
(d) All adenosines in the genomic neighborhoods' sites were classified: We find 29,971 putative editing sites within Alu repeats, denoted by E 1 and 590,206 adenosines within Alu repeats that were not detected as editing sites, denoted by E 0 . The adenosines which are not in Alu repeats do not enter our analyses. In the following, we use the E 0 sites as a control set. It should be stressed that the sensitivity of the bioinformatic algorithms for detecting editing sites is rather low, mainly due to the low coverage, low editing efficiency of most sites and tissue origin of the available sequences. For example, the observed editing efficiency averaged over all Alus, which is 0.048, is probably lower than the actual value. Therefore, the set of E 0 sites should not be thought of as sites that are never edited, but rather as a background, maybe slightly depleted in editing sites. On the other hand, the set of E 1 contains, with high precision, only edited sites [23].