Comparative analysis of EhSINE1-containing regions of E. histolytica genome and syntenic regions of E. dispar
A total of 393 full-length SINE1 elements (length > 450 bp) were identified in E. histolytica by genome sequence analysis. Syntenic regions corresponding to each of the EhSINE1-containing loci were located in E. dispar. To score for synteny entire scaffolds were matched in the two species using the program GATA. In 88% of cases where synteny was found, the syntenic regions matched throughout the scaffold, while in the rest synteny was not visible in some patches. The presence or absence of any of the EdSINEs (EdSINE1, 2, 3) was determined at syntenic loci of E. dispar. The results are summarized in Figure 1A. Of the 393 EhSINE1-containing loci of E. histolytica, syntenic regions in E. dispar could be predicted with certainty for 180 loci. Only these loci were included for further study - thus removing the contribution of differential sequence coverage in our comparative analysis. Loci not represented in both species due to differences in genome coverage, or difficulty in alignment, have not been included. In addition, as stated above, the synteny stretched for the entire length of the scaffold. Of these 180 loci, SINEs were absent in E. dispar at 114 loci, were present at 24 loci and their presence or absence could not be determined at 42 loci. Amongst the 114 loci where SINEs were absent, no repeat element of any type was found at 96 loci (representative example shown in Additional file 3 Figure S2), while LINE or EdRC4 sequences were found at 18 loci. Amongst the 24 loci that contained a SINE, 18 had EdSINE1 (one representative example is shown in Figure 2 and the rest in Additional file 4 Figure S3-S19), one had a truncated copy of EdSINE1, and five had EdSINE2 and/or EdSINE3. Amongst the 42 loci where presence or absence of SINE could not be established, in 23 cases the scaffold ended within the locus in E. dispar, and in 19 cases the homology was restricted to genes on one side of the SINE while there was no homology on the other side (may be due to deletions/inversions/rearrangements).
For 213 EhSINE1-containing loci of E. histolytica, syntenic loci could not be found in E. dispar for the following reasons. In many of these cases the scaffolds containing these loci were composed entirely of repeats (83 loci), tRNA genes and repeats (7 loci) or pseudogenes and repeats (9 loci). In 114 cases synteny could not be determined either because there were multiple copies of homologous genes, or homologous genes were located on multiple scaffolds, or there was a single gene in the scaffold.
Comparative analysis of EdSINE1-containing regions of E. dispar genome and syntenic regions of E. histolytica
A total of 302 full-length SINE1 elements (length greater than 450 bp) were identified in E. dispar by genome sequence analysis. Syntenic regions corresponding to each of the EdSINE1-containing loci were located in E. histolytica and the presence or absence of any of the EhSINEs (EhSINE1, 2, 3) was determined as described above. Of the 302 loci, syntenic regions could be predicted with certainty for 127 loci (Figure 1B). Of these, SINEs were absent in E. histolytica at 73 loci, were present at 19 loci and their presence or absence could not be determined at 35 loci. Amongst the 73 loci where SINEs were absent, no repeat element of any type was found at 62 loci (Additional file 5 Figure S20), while LINE sequences were found at 11 loci. Amongst the 19 loci that contained a SINE, 18 had EhSINE1 (as scored above) while 1 had EhSINE2. Amongst the 35 loci where presence or absence of SINE could not be established, in 23 cases the scaffold ended within the locus in E. histolytica, and in 12 cases the homology was restricted to genes on one side of the SINE while there was no homology on the other side.
For 175 EdSINE1-containing loci, syntenic loci could not be found in E. histolytica for the reasons mentioned in the previous section. In 75 cases the E. dispar loci were composed entirely of repeats while in 100 cases synteny could not be determined either because there were multiple copies of homologous genes, or homologous genes were located on multiple scaffolds, or there was a single gene in the scaffold.
Sequence alignment of syntenic loci
Figure 3 shows actual sequence alignments of a few selected syntenic loci where SINE1 is found in E. histolytica but missing in E. dispar, and vice versa. As is evident, in each case the element is flanked by TSDs. Only one copy of the TSD is found at the syntenic locus of the species where the SINE is missing. The surrounding sequences show the sequence similarity expected of intergenic regions of the two species (80-90%). One example is shown of an intergenic region where SINE1 is present in both species. Although SINE1 is located in the same intergenic region, the actual point of insertion is not the same and consequently the TSD sequences are different (Figure 3B). This was the typical pattern seen in other loci of this type where SINEs were located in the same intergenic regions in both species.
From the above data it is clear that only in about 20% of cases where presence or absence of SINE1 could be established at syntenic loci, are SINE elements located in the same intergenic region (although at different insertion points) in both species. In >80% of these loci SINE1 was not found at the same location in both species. Since the elements in the two species have a common lineage and are closely related, what possible factors might account for these differences? According to the Target primed Reverse Transcription model, retrotransposition is initiated by the LINE-encoded Endonuclease (EN) nicking the bottom strand of the target site [15]. Hence it is reasonable to believe that the sequences preferentially nicked by the EN could be the preferred insertion sites of the retrotransposon, and the behavior of EN might influence the choice of target site of a non LTR retrotransposon. Since the Eh EN and Ed EN differ from each other at many amino acid positions (as shown below), it is possible that the two enzymes may have evolved different recognition specificities. To establish this we studied the properties of the EdLINE1-encoded EN and compared it with EhLINE1-encoded EN.
Cloning and expression of the EdLINE1 endonuclease (Ed EN) polypeptide
To obtain the Ed EN coding sequence we used the Eh EN sequence (already cloned in our lab) as a starting point. Ed EN differs from Eh EN in 23 amino acid positions (Figure 4). The Eh EN sequence was mutated in these positions (as described in Methods) to obtain the Ed EN coding sequence. This 782-bp Eco RI-Not1 fragment was cloned in the E. coli expression vector pET30b. The expressed protein contained His-tag, and together with other vector sequences at the amino terminus, it was 307 amino acids long, with an expected molecular mass of 35.3 kDa (Figure 5A). It was purified by nickel-agarose chromatography, and its identity was confirmed by using an anti-His tag antibody and anti-Eh EN antibody (Figure 5B). The Ed EN protein could nick a nonspecific substrate, pBS. Like the previously reported activity of Eh EN [19] supercoiled pBS DNA was efficiently nicked by the purified Ed EN protein to yield open circular and linear DNAs. The presence of discrete bands corresponding to open circular and linear forms shows that the enzyme makes predominantly single-strand nicks and not double-strand breaks (Figure 5D).
Kinetics of the Ed EN-catalyzed reaction with pBS supercoiled DNA substrate under steady-state conditions
To determine the kinetics under steady-state conditions, reactions were carried out with the enzyme at a concentration of 2 nM and with pBS DNA at a concentration of 2-75 nM (Figure 6). The disappearance of supercoiled DNA was determined by densitometric scanning as described for Eh EN [20]. As mentioned in Methods, all time-course results were the average of at least three independent determinations. The variation observed at each time point was <4.7% of the mean value (0.09-5.1). Although the variation in values of each data point was in the range of 4% in three replicates, the slopes for each set showed lesser variation (up to 1.0%). Kinetic parameters (Km and kcat) were calculated from a Lineweaver-Burk plot (Figure 6C). Km for pBS DNA was calculated to be 1.086 ± 0.009 × 10-8 M. The catalytic constant, kcat, (Vmax ⁄ [E]) was determined to be 5.67 ± 0.027 × 10-3 sec-1. These values were comparable to the Km (2.6 ± 0.018 × 10-8 M) and kcat (1.6 ± 0.01 10-2 sec-1) of Eh EN [20] and the Km was comparable with the low Km values (0.5-17 nM) of restriction endonucleases determined with different DNA substrates under different conditions of buffer and temperature [22]. Furthermore, the turnover number (kcat) of the enzyme was in the lower range of that reported for restriction endonucleases [(1.6-16.6) x10-2 sec-1] [22]. The low turnover number of a retrotransposon-encoded endonuclease may have a significant role in limiting the rate of retrotransposition events in the genome. From the above data we infer that the kinetic parameters of Ed EN are not significantly different from Eh EN.
Nicking site sequence preference of Ed EN
We had earlier shown that Eh EN preferentially nicked a 176 bp fragment from E. histolytica precisely at the site where a SINE1 element was known to insert in this region of the genome [19]. To test whether Ed EN had a similar sequence preference as Eh EN, the same 176 bp fragment was incubated with Ed EN. The nicking pattern, determined for the bottom strand, was exactly the same as that obtained with Eh EN. Three nicking hot spots were obtained, of which site #3 corresponded with the exact site of insertion of Eh SINE1 in vivo (Figure 7A). The sequences important for target site recognition, as determined for Eh EN [18], were tested for Ed EN by altering the sequences immediately surrounding the nicked site #3. Transition mutations were introduced using oligonucleotides with the appropriately altered sequence to PCR amplify a 117 bp fragment from the 176 bp template (position 60 to 176). The DNA sequences thus obtained contained a normal site #2 and a mutated site #3. The activity of Ed EN on the mutated site #3 was quantitated using site #2 as an internal control. The results showed that changing the GG nucleotides (on top strand, upstream of the nick) to TT decreased the activity to 10% for Ed EN (compared with 2% for Eh EN) and changing the T nucleotide upstream of the nick to C increased the activity to 183% for Ed EN (compared with 133% for Eh EN) (Figure 7B). These results show that both Ed EN and Eh EN are very similar in their target site specificity. Since the endonuclease domain of ORF2 was used in these studies, the possibility of a complete ORF2 protein displaying a different specificity in vivo cannot, however, be entirely ruled out.
DNA structural features of E. dispar and E. histolytica SINE1 insertion sites
We have earlier shown that sequence-dependent DNA structural features may play an important role in site selection by SINE elements in E. histolytica[18]. Here we have analyzed the same features in E. dispar to see if the SINE insertion sites share these properties with E. histolytica. The parameters checked are listed in 'Methods'. In general majority of the features showed similar pattern in both the species except for DNA denaturation energy and free energy profile. Our previous study had shown that the insertion sites in E. histolytica are T-enriched and the content profile showed a significant peak at -22 bp relative to the insertion site. Similar profile was also observed for E. dispar with a peak at -22 bp (Figure 8). Statistical analysis by Mann-Whitney test on the difference of the average T content of insertion sites for E. histolytica and E. dispar suggested that there is no significant difference between the two. Similar results were obtained when other DNA-based structural parameters were determined. Thus the SINE1-occupied sites in E. histolytica and E. dispar share the same structural features.
We checked whether the intergenic regions at loci where SINEs were found in both genomes shared greater sequence similarity compared with loci where SINEs did not occur in both genomes. However this was not found to be the case. Intergenic regions in both sets of loci showed overall sequence similarity in the range of 75-90%. E. histolytica and E. dispar are very closely related sibling species [23] which were in fact classified as a single species until they were re-described as two separate species [24]. Their close relationship is also evident from phylogeny based on LINE-derived RT sequences. This analysis showed that all three families of LINEs and SINEs already existed in the common ancestor before E. histolytica and E. dispar separated into two distinct species [6]. The great similarity between the two ENs of E. histolytica and E. dispar, as found in this study, again shows that the basic retrotransposition machinery is highly conserved in these sibling species. It is therefore possible that SINEs may indeed have occupied all of the potential insertion sites in the genome of the common ancestor of E. histolytica and E. dispar but many of the inserted elements may have been preferentially lost in each genome as the two species diverged from each other. Indeed the differential loss of retrotransposons from specific loci might have contributed to speciation [25, 26]. On the other hand it may be possible that SINE expansion took place after the divergence of the two species, and only a sub set of the potential insertion sites in the E. histolytica and E. dispar genomes are currently occupied. In that case one may expect that each of these extant genomes may possess a large number of 'empty' sites where SINEs could potentially insert in future. A hallmark of retrotransposition is the appearance of target site duplication (TSD) following the insertion of a new element. In syntenic loci an unoccupied site is expected to have one copy of the TSD which is duplicated in the occupied site. We checked for TSD sequences in E. dispar unoccupied sites corresponding to the syntenic E. histolytica occupied sites. We randomly picked 75 loci of E. histolytica, where SINE1 is absent in E. dispar and looked for matches with the TSD at each locus. At 19 loci we found very good match with the TSD sequence (matched length greater than 15 bp, and sequence identity greater than 85%). The occurrence of close matches of TSD sequences in the syntenic loci of E. dispar suggests that potential empty sites may exist where future retrotransposition events could take place.