Stable intronic sequence RNAs (sisRNAs) are selected regions in introns with distinct properties

Background Stable introns and intronic fragments make up the largest population of RNA in the oocyte nucleus of the frog Xenopus tropicalis. These stable intronic sequence RNAs (sisRNAs) persist through the onset of zygotic transcription when synchronous cell division has ended, and the developing embryo consists of approximately 8000 cells. Despite their abundance, the sequence properties and biological function of sisRNAs are just beginning to be understood. Results To characterize this population of non-coding RNA, we identified all of the sisRNAs in the X. tropicalis oocyte nucleus using published high-throughput RNA sequencing data. Our analysis revealed that sisRNAs, have an average length of ~ 360 nt, are widely expressed from genes with multiple introns, and are derived from specific regions of introns that are GC and TG rich, while CpG poor. They are enriched in introns at both ends of transcripts but preferentially at the 3′ end. The consensus binding sites of specific transcription factors such as Stat3 are enriched in sisRNAs, suggesting an association between sisRNAs and transcription factors involved in early development. Evolutionary conservation analysis of sisRNA sequences in seven vertebrate genomes indicates that sisRNAs are as conserved as other parts of introns, but much less conserved than exons. Conclusion In total, our results indicate sisRNAs are selected intron regions with distinct properties and may play a role in gene expression regulation.

the form of a lariat in which the 5′ end is linked to the 3' acceptor splice site. These lariats are debranched into a linear form and then degraded rapidly in most cases [12,13]. The majority of intron fragments are believed to be unstable, with a few exceptions [14][15][16][17]. Recently, a number of sisRNAs were identified in the oocyte nucleus (germinal vesicle or GV) of Xenopus tropicalis [9] and later in the oocyte cytoplasm [10]. Although less stable than cytoplasmic mRNA, the sisRNAs are very stable. Transcription inhibition studies revealed that they are stable for at least 2 days and RNA sequencing analysis demonstrated that they are transferred to the egg upon GV breakdown and persist until at least the blastula stage [9]. These sisRNAs are usually only a portion of the full intron and while nuclear sisRNAs are present as either linear or lariat molecules, most cytoplasmic sisRNAs are in lariat form [9,10]. Thus far, sisRNAs have also been found in human, mouse, chicken, zebrafish, the Epstein-Barr virus and Drosophila melanogaster [9,[18][19][20][21][22]. Little is known about the sequence properties and biological function of these abundant sisRNAs although recently, sisR-1 and sisR-4, have been shown to participate in a feedback loop to modulate their parental gene expression in Drosophila [20,[22][23][24].
To further characterize sisRNAs, we identified all sisR-NAs in the X. tropicalis genome using the published high-throughput sequencing data of RNA from the GV [9]. We then determined the average length of sisRNAs, their distribution in genes, sequence composition, transcription factor binding site (TFBS) enrichment, gene ontology and evolutionary conservation. Here we show that sisRNAs are most widely expressed from genes with multiple introns, enriched in GC and TpG. They are enriched in introns at both ends of transcripts but preferentially at the 3′ end. They also contain specific transcription factor binding sites (TFBS), which supports recent findings that sisRNAs play a role in the regulation of gene expression.

Genome-wide identification of sisRNAs in Xenopus tropicalis
To study the sequence characteristics of sisRNAs, the highthroughput sequencing data of RNA (RNA-seq) from the GV of the frog Xenopus tropicalis [9] was used to detect sisRNA peaks with the model based analysis for ChIP-seq (MACS) algorithm [25]. We identified a total of 63,410 RNA peaks by a more stringent criterion (FDR = 0.01) (Fig. 1a, Table 1) and compared the location of each RNA peak to exons and introns to identify the true sisRNA peaks (i.e. those from spliced introns that did not cross intron-exon boundaries.) Two widely used and primarily manually annotated gene sets, RefSeq (9448 protein-coding genes) and Ensembl (28,967 protein-coding genes), were used as references to identify the sisRNA peaks. Twenty-four thousand nine hundred and one sisRNAs were identified by refSeq genes (termed as refSeq sisRNAs), with a total length of~9 Mbps, which accounts for 0.6% of the X. tropicalis genome (Fig. 1a, Table 1). Similarly, 34,169 sisRNAs were identified by Ensembl genes (termed as Ensembl sisRNAs), with a total length of~12 Mbps, which accounts for 0.8% of the genome (Fig. 1a, Table 1). Together, a total of 20,020 peaks were identified by both gene sets (Fig.  1a), which represents~80% of refSeq sisRNAs and~60% of Ensembl sisRNAs respectively. The sisRNA length ranges from 160 bps to 2908 bps with the majority 200-500 bps long and an average of 363.6 bps and 356.6 bps for refSeq and Ensembl sisRNAs, respectively ( Fig. 1b- Table 1). Taken together, two high quality datasets of genome-wide sisRNAs were generated for further investigation of sisRNA properties.
sisRNAs are widely expressed and preferentially located in genes with multiple introns To determine the number of genes with sisRNAs, we first divided the genes of each dataset into 10 groups according to the number of introns in a gene. Genes without introns were excluded from further analysis. From the 9448 RefSeq genes, 93.8% (8864) have multiple exons while only 6.2% (583) of the genes have a single exon (Tables 2). Our data show that the more introns a gene has, the more likely the gene generates sisRNAs. While 24.9% of genes with a single intron have sisRNAs located in their intron, the possibility that a gene with 5 introns has a sisRNA increased to 66.6%. When a gene has 10 or more introns the possibility increased to 81.0% (Fig. 2a). On average, 67.3% of Refseq genes (5965/8864) with introns produce sisR-NAs (Fig. 2a, Table 2).
We also calculated the average number of different sisRNAs per gene. We observed that the average sisRNA number increases with intron number (the average number of introns per gene is 7.8 in Xenopus tropicalis). For example, genes with a single intron have an average of 0.60 sisRNAs, genes with 5 introns have an average of 2.06 sisRNAs, and genes with 10 or more introns have an average of 5.06 sisRNAs (Fig. 2b, Table 2). On average, there are 2.97 sisRNAs per gene and 0.36 sisRNAs per intron (Table 2). We also compared the intron length to the number of sisRNAs for each intron, and it indicates that longer introns tend to have more sisRNAs (R=0.72, Supplementary Fig. 1A-B). Next, we compared the total length of introns (Supplementary Fig. 1C) or normalized intron number per Mb ( Supplementary Fig. 1D) to the number of sisRNAs for each gene. The results indicate that the number of sisR-NAs is well correlated with the total length of introns (R=0.87) rather than the number of introns (R=0.09) in a gene. We observed a similar result for Ensembl genes (Fig. 2, Table 3). The sisRNA sequences are GC and TpG rich To investigate the base-pair composition in sisRNAs, we first calculated the prevalence of all 10 unique dinucleotides in sisRNAs and in the X. tropicalis genome. We observed that sisRNAs as compared to the X. tropicalis genome are rich in AC, CC, AG, CA and GC, while poor in CG, AT, AA, TA and GA, for both RefSeq-and Ensembl-identified sisRNAs (Fig. 3a). We then extended the calculation to trinucleotides. As we expected, sisR-NAs are rich in CCA, GCC, CAC, GCA, CAG, AGC, which are all trinucleotides containing GC or CA|TG, the dinucleotides shown to be enriched in sisRNAs (Fig.  3b). sisRNAs are very poor in CGN, which are trinucleotides with CpG, and also poor in ATA, AAT, AAA and TAA (Fig. 3b). Generally speaking, CpG rich regions (e.g. CpG islands) are G+C rich and CpG poor regions are A+T rich. Interestingly, we observed that sisRNAs are CpG poor while GC rich. We then calculated the GC content, CpG density and CA|TG density for each sisRNA as well as each scaffold in the whole genome ( Fig. 3c-e). Compared to the X. tropicalis genome, sisRNAs have a higher GC content (Fig.  3c), a lower CpG density (Fig. 3d) and a higher CA|TG density (Fig. 3e). As a result, in terms of base-pair compositions, sisRNAs are different from the genome, with a different prevalence of dinucleotides, trinucleotides, and a lower CpG density, higher CA|TG density and GC content. In other words, sisRNAs are unique regions of the genome with their own properties. The results for RefSeq sisRNAs and Ensembl sisRNAs are nearly identical (Fig. 3), providing strong support for our results.

The sisRNAs are specific regions of the introns
Since sisRNAs are derived from regions of introns, and introns have a distinct base-pair composition as compared to the whole genome, we expect sisRNAs to also have a sequence composition different from the whole genome. However, it remains unclear whether sisRNAs are randomly distributed in introns or in specific regions of the intron. GC content (G+C%), CpG density and CA|TG density are widely used and important parameters to analyze sequences characteristics. Thus, we calculated these parameters for each intron with/without sisRNAs and compared these results with sisRNAs alone. The total GC content of the introns with sisRNAs (40.59%) is higher (p< 0.001, t-test) than the genome (40.07%). The GC content in sisRNAs alone (41.92% and 41.79% for RefSeq and Ensembl sisRNAs, respectively) is higher than that of the introns (p< 0.001, t-test) (Fig. 4a, Table 4). The CpG density of introns when compared to the whole genome is poor (Fig. 4b). Interestingly, introns with sisRNAs have a higher CpG density than both introns without sisRNAs and sisRNAs alone (Fig. 4b).
While the CA|TG density in introns is very similar to the genome, sisRNAs have a higher CA|TG density than introns with sisRNAs ( Fig. 4c). Thus, the sisRNAs are in regions of the introns with higher CA|TG density and GC%. We further divided the introns without sisRNAs into two groups according to whether the host gene is with/without sisRNAs: Intron A (the host gene without sisRNAs) or Intron B (the host gene has sisRNAs but the intron itself is without sisRNAs). We found that intron A and intron B are very similar, thus sisRNAs are closely associated with the introns from which they originate. Taken together, these results reveal that sisRNAs are specific regions of introns with distinct sequence compositions.
sisRNAs are enriched at both 5′ and 3′ end of transcripts, with a preference for the 3′ end After we observed that sisRNAs are in specific regions of introns, with unique base-pair compositions, we analyzed whether they are derived from specific introns along the gene. To study where sisRNAs are enriched, we concatenated all the introns for each gene along the transcript. We divided each joined intron transcript into (See figure on previous page.) Fig. 1 sisRNAs are identified from GV RNA-seq data according to Ensemble and RefSeq gene sets in X. tropicalis. a Venn diagram of peaks called by MACS with FDR=0.01 and determined to be sisRNAs according to refSeq and Ensembl gene sets. b UCSC screenshot of identified sisRNAs in the gene E2F3. Red and blue blocks indicate exons, red peaks indicate mRNA detected in the cytoplasm, blue peaks indicate RNA detected in the GV, grey blocks indicate sisRNAs identified, orange lines indicate the summits of sisRNA peaks. c-d Histograms of length distributions for (c) refSeq and (d) Ensembl sisRNAs 100 bins and identified the bins in which the sisRNAs are located. As shown in Fig. 5a, the sisRNAs are mostly enriched at the beginning (5′ end) and the end (3' end) of the transcript with a preference for the 3′ end. An example is shown for the gene nasp: more sisRNAs were observed at the 3′ end (Fig. 5b). These results further confirmed that sisRNAs are not random sequences from the introns: they have distinct sequence compositions and are preferentially driven from the 3′ end of a transcript. We next asked whether sisRNAs are derived from the 5′ and 3′ end of introns because these end introns have unique properties. To investigate this possibility, we divided all the introns from genes with more than 3 introns into 3 categories: 1 st − 5′ end introns (S), middle introns (M) and last − 3′ end introns (E). The results indicate that last − 3′ end introns are very similar to the middle intron, while the first − 5′ end introns have different compositions and are slightly longer (Supplementary Fig. 2). The properties of the first introns may be different because they sometimes overlap with the promoter regions, which have higher CpG, CA|TG density, and GC content. These results indicate that sisRNAs are not preferentially derived from 3′ end introns because these introns have unique properties, but instead some other mechanism of sisRNA production is involved.
Another possibility is that sisRNAs produced from 3′ end introns are remnants of gene transcription and splicing. To test this idea, we performed the analysis of correlation between the expression levels of host genes and sisRNAs. As shown in Fig. 5c, the expression levels of host genes (FPKM) and sisRNAs peak signals are negatively relevant (R=-0.1724), which does not support sisRNAs as remnants of gene transcription and splicing that have not been cleaned from cells. Thus, the stability of sisRNAs is regulated independently from the host mRNAs. Taken together, sisRNAs are most enriched in the 3′ end of introns, and the mechanism for this enrichment remains to be investigated.

Specific TFBSs are enriched in sisRNA-producing introns
Because of the specific sequence properties and preference for introns on the ends of a gene, we performed an enrichment calculation of transcription factor binding sites (TFBS) in sisRNAs to study the potential functions of sisRNAs. We searched for enriched DNA motifs among the 935 position weight matrices (PWMs) collected from the TRANSFAC databases [26] in sisRNAs and in introns without sisRNAs (Fig. 6). We also calculated the motif enrichment in both RefSeq and Ensembl sisRNAs and the enrichments in two datasets are nearly identical (R=0.99), indicating the calculation is robust and the quality of identified sisRNAs is high (Fig. 6b). Stat3, NF-κB, p50:p50, MYOG:NF1 and GAF consensus sites are enriched in sisRNAs but depleted in introns without sisRNAs (Fig. 6a). Stat3 (signal transducer and activator of transcription 3) is a member of the STAT protein family, functions as a transcriptional activator [27], and is highly expressed in X. tropicalis (Fig. 6c). NF-κB (nuclear factor kappa-light-chain-enhancer of activated B cells) is a protein complex that controls transcription [28]. p50 is the mature NF-κB subunit, which has no intrinsic ability to activate transcription and has been proposed to act as a transcriptional repressor when binding with κB elements as homodimers (p50:p50) [29]. Myogenin (MYOG) is one of four muscle-specific basic helix-loop-helix regulatory factors involved in controlling myogenesis [30], and NF-1 (Neurofibromin 1) is a negative regulator of the Ras signal transduction pathway [31] and is required for skeletal muscle development [32]. The GAGA factor (GAF) is one of a few transcription factors that can regulate transcription at multiple levels: depending on its target genomic location, it can act as either activator or repressor [33]. We also observed that some TFBS, such as Ncx, Prop1 and Nkx3a, are depleted in sisRNAs while slightly enriched in introns without sisRNAs (Fig. 6a). These results suggest that specific TFBS involved in transcription regulation, either activation or suppression, are enriched in sisRNA-producing introns, which imply that sisRNAs may play a functional role in transcriptional regulation.
The GO terms nucleotide binding, RNA binding and ATP binding are enriched in genes with sisRNAs To further investigate the role of sisRNAs, we asked whether the genes with sisRNAs share any function. GO (Gene Ontology) enrichment analysis [34] of the 5419 RefSeq genes with sisRNAs indicated that 13.9% of the genes are involved in nucleotide binding, 3.2% of the genes are involved in RNA binding and 7.9% genes are involved in ATP binding ( Table 5). The other enriched gene sets included RNA processing, mRNA metabolic process and translation (Table 5). These results suggested that sisRNAs might have a potential biological function involved in gene regulation and metabolism.
The sisRNAs are as evolutionary conserved as introns, and much less than exons Our data indicate that sisRNAs are in specific regions of introns, contain TFBSs, and are in the introns of genes involved in nucleotide, ATP and RNA binding. To investigate whether these sisRNAs are conserved across species, we determined the PhastCons conservation score for sisRNAs and introns (Fig. 7). The PhastCons scores [35] are calculated based on multiple alignments of 6 vertebrate genomes (zebrafish, chicken, opossum, rat, mouse and human) with X. tropicalis. As expected, the boundary of exon and intron has the highest conservation score (Fig. 7a, c). Although sisRNAs are as conserved as other intron regions, they are still much less conserved than exons, suggesting they might not be conserved among species, which was also shown in a recent study [10]. An alternative approach to evaluate the conservation of sisRNAs is to compare the sisRNAs in different cell types from various species. We obtained the cytoplasm sisRNAs data recently published [21], which contains several species and cell types, including human red blood cells, Hela cells, mouse red blood cells and 3 T3 cells, chicken DF1 cells, and Xenopus laevis XTC cells. We compared the GC content, CpG density, and CA|TG density among these cytoplasm sisRNAs to the refSeq and Ensembl sisRNAs that we identified (Fig. 7d-f). We found that between different cell types in the same species, for example, human red blood cells and Hela cells, the GC content, CpG density and CA|TG density of these sisRNAs are quite different from each other (Fig.  7d-f). While for the same cell types between different species, for example, human red blood cells and mouse red blood cells, some sequence composition similarities exist (Fig. 7d-f). These results indicate that sisRNAs are cell type specific. Although sequences of sisRNAs may not be so conserved among species, the sequence compositions are similar.
The X. laevis XTC sisRNAs are very similar to X. tropicalis GV sisRNAs in terms of sequence composition. We next asked how many common host genes with sisRNAs exist between the two species. We compared 5563 unique genes host sisRNAs in GV of X. tropicalis to 2029 unique genes which have sisRNAs in X. laevis XTC cells, there are only 602 genes overlapped and this chance is even lower than random selection (Fig. 7g). Among these 602 genes, the number of sisRNAs from the same intron is less than 50 (data not shown). Overall, our calculation suggested there may not be positional or sequence conservation between sisRNAs, which is also supported by other recent work [10].
We also performed GO (Gene Ontology) enrichment analysis for the genes host sisRNAs in each cell type ( Supplementary Fig. 3A-F). In chicken DF1 cells, the sisRNAs are not enriched in any gene sets (Supplementary Fig. 3E). Even though sisRNAs are enriched in some gene sets in the other 5 cell types, there are no common enriched gene sets present in all 5. The most common enriched gene set is related to sister chromatid segregation or nuclear division, which is observed in human RBC, Hela, mouse RBC, and X. laevis XTC cells (Supplementary Fig. 3A-C, F). Another common enriched biological process is covalent chromatin modification, which is observed in human RBC, mouse RBC and 3 T3 cells ( Supplementary Fig. 3A, C-D). It will be interesting to test whether sisRNAs play functional roles in cell division or chromatin organization.

The properties of sisRNAs in Xenopus tropicalis
The biological function of sisRNAs in the Xenopus oocyte nucleus is unclear but of particular interest especially in light of their stability and abundance. The majority of zygotic transcription in the Xenopus embryo begins at the midblastula transition (MBT), after the 12 th cell division. The transcription rate must be extremely high in the oocyte and the transcripts must be very stable to allow a sufficient amount of RNA to be deposited in the cytoplasm such that the RNA:DNA ratio is not significantly depleted as the cells divide before MBT [36,37]. In this case, the sisRNAs could simply be the byproducts of universal stable RNA transcripts [9]. However, we found that sisRNAs are not randomly distributed in introns, but rather in specific regions of the intron, with a unique sequence composition, indicating that these sisRNAs are selected to be stable and abundant and thus are very likely to have a relevant biological function in the Xenopus oocyte and early embryonic development.
Recently, 9000 sisRNAs have been found in the cytoplasm with about half of these confirmed as lariat molecules [10]. These sisRNAs are only derived from a relatively small number of specific introns [10], which further confirmed our observations that sisRNAs are not random sequences but specific regions of introns. Besides Xenopus, numerous circular intronic sequences have been identified in cultured human cell lines (Hela and H9) [19] implying that they are widely expressed in many different species and may have a significant biological role [22].

b) combinations in sisRNAs identified by refSeq genes (red) and by Ensembl genes (blue). c-e Boxplots show the GC% (c), CpG density (d) and CA/TG density (e) of the genome (green) compared to sisRNAs identified by refSeq (red) and Ensembl (blue) genes
The association of sisRNA-producing introns with Stat3 We showed that consensus TFBS of Stat3 is the most enriched motif in the introns from which sisRNAs are generated, and observed that Stat3 is highly expressed in Xenopus tropicalis oocytes [9]. A recent microarray study showed that Stat3 is expressed at stage 2 and peaks at stage 8 and is still detectable as late as stage 33 in both X. tropicalis and X. laevis [38]. It has been widely reported that Stat3 can bind to intronic regions to regulate gene expression. For example, the expression of BCL3 is induced by IL-6 via Stat3 binding to intronic enhancer HS4 [39]. The signaling factor, WNT5A is an evolutionarily conserved target of the Stat3 signaling cascade based on 11-bp-spaced tandem Stat3-binding sites  within intron 4 of human, chimpanzee, cow, mouse and rat WNT5A orthologs [40]. Stat3 binding to the introns of Foxp3, RORα, RORγt, and IL-6Rα have also been reported [41][42][43]. Analysis of~75,000 Stat3 binding sites identified by chromatin immunoprecipitation (ChIP)-seq in a transformed human breast cell line revealed that most Stat3 binding sites are located within introns [44]. In total, these data indicate a role for Stat3 in the regulation of expression of the genes from which sisR-NAs arise. It is also possible that specific TF interactions are involved in the biogenesis of sisRNAs. Both of these seem more likely considering STAT3 is a TF that binds DNA and RNA [45]. A recent study showed that lncRNA directly binds STAT3 in the cytoplasm of human dendritic cells (DC), thereby preventing dephosphorylation of STAT3 by SHP1, and controlling the differentiation of DC [46]. Another study demonstrated an interaction between STAT3 and circular RNAs [47], which may be a type of sisRNA. Taken together, future experiments can be designed to test if sisRNAs interact with STAT3 and if their formation is dependent on STAT3.
The evolutionary origin of sisRNAs DNA methylation at the 5′ position of cytosine (5mC), primarily in CpG context, is observed in nearly every vertebrate examined, including Xenopus tropicalis [48]. 5mC can deaminate to thymine 10-50 times faster than the mutation rate of other nucleotides [49]. Deamination of 5mC caused the high depletion of the CpG dinucleotide in mammalian genomes [50], and as a result, TG (the deamination product of 5mCG) is the most abundant dinucleotide in vertebrates [51]. The unmethylated CGs [52] tend to be clustered together into CG islands (CGI) [53]. As a result, CpG rich regions (CG islands) have a higher GC content, whereas CpG poor regions have a lower GC content. We observed that CpG is depleted in sisRNAs as compared to the X. tropicalis genome, and as expected, TG is enriched. Interestingly, we observed higher GC content in sisRNAs, which suggested that the base composition of sisRNAs is not merely the consequence of deamination of 5mC, there must be other mechanisms playing a role in the evolutionary origin of sisRNAs. A recent study showed that exons of lncRNA loci also have a high GC content due to purifying selection [54], thus it is possible that sisRNAs share some evolutionary properties with lncRNAs. Surprisingly, the CpG density in sisRNAs alone is very similar to introns without sisRNAs, and lower than that of introns with sisRNAs. This suggests that sisRNAs are surrounded by regions with a high CpG level. In other words, although sisRNAs themselves are CpG poor, they are derived from introns with high CpG density, and higher CpG density in the introns may be indicative of producing sisRNAs.
We also assessed the evolutionary conservation of the X. tropicalis sisRNA sequences in six other vertebrate genomes, including zebrafish, chicken, opossum, rat, mouse and human. Our results indicated that sisRNAs are not conserved: sisRNAs are as conserved as other parts of introns, but much less conserved than the exons. Cytoplasmic sisRNAs have been identified recently in X. tropicalis oocytes and these sisRNAs are only derived from a relatively small number of specific introns [10], it is worth noting that these sisRNAs are also not conserved.
Thus far sisRNAs are identified in human, mouse, chicken, zebrafish, Xenopus and Drosophila, implying that sisRNAs might be widely expressed in many other species [21,22]. Considering that sisRNAs are not conserved, if sisRNAs do have some biological functions such as gene regulation, it might be cell type and species-specific.
In conclusion, our results suggest that sisRNAs are not transcribed from random part of introns but specific regions with distinct properties. The sisRNAs are GC rich while CpG poor, and preferentially enriched in the 3′ end of the mRNA transcript. Specific TFBSs involved in gene regulation are enriched in the regions from which sisRNAs arise, suggesting an association with specific proteins, such as Stat3, and further experiments are required to investigate this association. With more and more sisRNA data available in different species, the potential biological functions of sisRNAs would hopefully to be revealed soon.

Dataset generation
The GV and cytoplasmic RNA-seq data were obtained from the Gall lab [9]. sisRNA sequences from human red blood cells, Hela cells, mouse red blood cells, mouse 3 T3 cells, chicken DF1 cells, X. laevis XTC cells were obtained from [21]. All the sequencing data were produced by the Gall lab, and acquisition and culturing of cells are described in detail [21]. Briefly, human HeLa cells, mouse 3 T3 cells, chicken DF1 cells and XTC cells are cell lines cultured in different mediums [21]; human RBCs were purchased from the Interstate Blood Bank (through Zen-Bio, Inc.); mouse tissues were provided by the Bortvin laboratory, Department of Embryology, Carnegie Institution for Science, Baltimore, MD [21]. The dataset of refSeq and Ensembl genes, PhastCons conservation scores, as well as the genome sequences of Xenopus tropicalis were downloaded from the University of California Santa Cruz Genome Bioinformatics website (http://genome.ucsc.edu/) [55]. The reference X. tropicalis genome assembly was xenTro2 (assembly version 4.1). Exons, introns, 5'UTRs, and 3'UTRs for refSeq genes and Ensembl genes were determined using UCSC annotations. UCSC genome browser screen shots were generated using custom tracks of the UCSC web site (https://genome.ucsc.edu/).

Identification of sisRNAs in germinal vesicles
The Model-Based Analysis of ChIP-seq algorithm (MACS) [25] was used for detecting sisRNAs peaks by analyzing germinal vesicle (GV) RNA-seq data [9]. First, we identified all 63,410 peaks of the GV RNA-seq data with default parameters, but a more stringent FDR=0.01 (default is 0.05). We then determined the location of each peak relative to exon and intron annotation for refSeq and Ensembl genes, respectively. A peak was identified as a sisRNA if it was located within an intron and did not overlap with an exon. We set the cut-off length of 150-bps for the sisRNAs to rule out the potential contamination of short non-coding RNAs (e.g. miRNA, siRNA, piRNA). In this way we identified 24, 901 and 34,169 sisRNAs for refSeq and Ensembl annotated genes, respectively.

Calculation of sisRNA density along introns
To calculate the sisRNA density in introns, we first concatenated all the introns for each gene. Then we divided each joined intron transcript into 100 bins. For a bin b with length of L in a certain gene g, we denote b (1:L). For each position i in b let the variable x at position i (x i ) be 1 in case of an overlap with a sisRNA and 0 if it does not overlap. The density of sisRNAs in each bin for a certain joined intron transcript is calculated as:

Calculation of motif enrichment in sisRNAs and introns
To determine the enriched motifs in sisRNAs and introns, we calculated an enrichment score for each motif.
To avoid the bias of sampling from the X. tropicalis genome, we searched the whole genome for each motif. In each scaffold, motifs were searched using MAST in MEME suite [56] with the position weight matrices (PWMs). The PWMs used in this study were collected from the TRANSFAC databases [26] in which 935 PWMs are provided and MAST was run with default parameters. For each motif M with the length L we denote M (y start :y end ) to record the positions where the motif starts and ends: y 1 :y 1 +L-1, y 2 : y 2 +L-1 ... y N :y N +L-1, N being the total number of motifs in the genome. For each position y i : y i +L-1, if it overlapped with the examined regions, sisRNAs or introns, x i =1, otherwise x i =0. For whole sisRNAs or introns, the observed occurrences (OCC obs ) and expected occurrences (OCC exp ) of the motif are calculated as: where N is the total number of motifs in the whole genome, L r is the total number of base pairs in the examined regions (sisRNAs or introns), and L g is the total number of base pairs for the whole X. tropicalis genome. The enrichment score E for motif M was calculated as follows: E ¼ OCC obs OCC exp , where OCC obs is the observed occurrences, and OCC exp is the expected occurrence of motif M in the examined regions (sisRNAs or introns).

Supplementary information
Supplementary information accompanies this paper at https://doi.org/10. 1186/s12864-020-6687-9. CA|TG density (C), and length (D) of the 1st introns (green), middle introns (orange), and last introns (purple) of genes with more than 3 introns, to sisRNAs identified by refSeq (red) and Ensembl (blue) genes. Supplementary Figure 3. The gene ontology (GO) analysis for the genes hosting cytoplasmic sisRNAs across species. Significant enriched GO terms for genes hosting cytoplasmic sisRNAs in human red blood cells (A), human Hela cells (B), mouse red blood cells (C), mouse 3 T3 cells (D), chicken DF1 cells (E), and Xenopus laevis XTC cells (F).