piRNA-IPdb: a PIWI-bound piRNAs database to mining NGS sncRNA data and beyond

Background PIWI-interacting RNAs (piRNAs) are an abundant single-stranded type of small non-coding RNAs (sncRNAs), which initially were discovered in gonadal cells, with a role as defenders of genomic integrity in the germline, acting against the transposable elements. With a regular size range of 21-35 nt, piRNAs are associated with a PIWI-clade of Argonaute family proteins. The most widely accepted mechanisms of biogenesis for piRNAs involve the transcription of longer precursors of RNAs to be processed, by complexes of proteins, to functional size, preferentially accommodating uridine residues at the 5’ end and 3’ methylation to increase the stability of these molecules. piRNAs have also been detected in somatic cells, with diverse potential functions, indicating their high plasticity and pleiotropic activity. Discovered about two decades ago, piRNAs are a large and versatile type of sncRNA and that remain insufficiently identified and analyzed, through next-generation sequencing (NGS), to evaluate their processing, functions, and biogenesis in different cell types and during development. piRNAs’ distinction from other sncRNAs has led to controversial results and interpretation difficulties when using existing databases because of the heterogeneity of the criteria used in making the distinction. Description We present “piRNA-IPdb”, a database based uniquely on datasets obtaining after the defining characteristic of piRNAs: those small RNAs bound to PIWI proteins. We selected and analyzed sequences from piRBase that exclusively cover the binding to PIWI. We pooled a total of 18,821,815 sequences from RNA-seq after immunoprecipitation that included the binding to any of the mouse PIWI proteins (MILI, MIWI, or MIWI2). Conclusions In summary, we present the characteristics and potential use of piRNA-IPdb database for the analysis of bona fide piRNAs.

Background piRNAs, the most recently discovered small non-coding RNAs, are essentially defined by their interaction with PIWI proteins. These molecules are single-stranded small non-coding RNAs with a regular size of about 21-35 nucleotides. They were first discovered in germ cells as defenders of genomic integrity in the germline, acting as post-transcriptional repressors of transposable elements [1]. Acting as guide RNAs for PIWI, piRNAs depend entirely in their bound with Argonaute-clade PIWI proteins to form a piRNA-induced silencing complexes. The processing of piRNAs starts from longer RNA molecules that are shortened during piRNA biogenesis, generating, preferentially, 5' uridine residues and a 2'-O-methylated 3'-end, which increases the stability of these molecules [2]. A secondary piRNA biogenesis pathway has been proposed, in which a piRNA acts as template to recruit an antisensecomplementary sequence to be processed into a new piRNA. Despite their being initially discovered and mostly studied in germ cells, piRNAs have also been detected in somatic cells [3] associated with diverse, not previously identified, potential functions, indicating their high plasticity and pleiotropic activity [2,[4][5][6][7], even though they are far from being fully characterized.
The distinction of piRNAs from other sncRNAs has led to controversial results and interpretation difficulties when existing databases are used because of heterogeneity in the criteria used to characterize piRNAs. Various databases are emerging to facilitate piRNA identification. But such bioinformatic tools could be considered too imprecise in regard to the basic consideration of what a piRNA is, in essence. Consequently, many sequences may be accumulated using different identifying criteria, such as are detected in NONCODE or RNAdb or others, such as piRNABank, piRNAQuest, IsopiRBank, or piRBase, that were initially considered piRNA specific. The problem has been the lack of a gold standard protocol for piRNA identification, so each database follows his own criteria and, in consequence, discrepancies appear among them. It seems, though, that in the claims made for the large number of sequences that populate these databases, a central point is being missed. By definition, piRNAs show an association with PIWI proteins [8]. Diverse characteristics have been associated with piRNA sequences, but the unique, invariable aspect of piRNAs should be their bond to PIWI proteins. In addition, several regions of other sncRNAs (tRNAs, rRNAs, snoRNAs, miRNAs) may share the essential features of piRNAs and act functionally as such [4,9], but would not necessarily have to be considered as artifacts, as has been suggested [10], precluding potential functional mechanisms of regulation.
The aim of the present work is to elaborate and assess a database, piRNA-IPdb, based on the recent update of piRBase [11] and adjusting the piRNA identification criterion exclusively to PIWI-bound detected sequences after immunoprecipitation approaches in the mouse. Among the 21 species represented in the piRBase, the largest number of sets collected in the database corresponds to Mus musculus. Moreover, in vertebrates, mouse is the species with the highest number of piRNA datasets identified by immunoprecipitation with PIWI proteins. Overall, the present piRNAimmunoprecipitation database (piRNA-IPdb) can serve as a useful and trustworthy piRNA sequence resource for future piRNA research.
From the whole piRBase data we filtered sequences for each of selected datasets (all from PIWI immunoprecipitation sequencing assays), with custom scripts (publicly available), obtaining sequences and expression data for each piRNA in each dataset. In parallel, the expression data was transformed from raw counts to count per million (dividing each sequence count by total count of the dataset and multiplying by one million) in order to normalize the diversity of sequencing methods of each dataset. With these data, we selected the most expressed 1 % of piRNA sequences and labeled them as "highly expressed centile" ("hec").
The genome coordinates were obtained by aligning database sequences against the last published genome of the mouse (GRCm39/mm39) using Bowtie aligner, allowing one mismatch (-v1 parameter), and sorting the sequences with more than one possible valid alignment (-m1 option).
Scripts to generate the database are publicly available from project GitHub (https://github.com/OdBT/piRNA-IPdb_v2). Specific scripts were used for each of the analyses carried out, using mainly shell scripts for data analysis and R scripts [12] for data plotting.
We use the widely known NGS toolbox's script (basicanalyses.pl) [13] to obtain the piRNA-IPdb sequence length distribution and the nucleotide composition. The detection of piRNA amplification cycle (ping-pong cycle) (reverse-complementary piRNA sequences overlapping 10 nucleotides) was performed using the signature.py Python script [14]. The presence of miRNAs related to piRNA-IPdb sequences was checked by mapping piRNA sequences against pre-miRNA, downloaded from miR-Base using Bowtie sequence aligner and miRBase as microRNA database.

Database details
For this piRNA-IPdb, a total of 18,821,815 unique sequences were pooled (obtained from 23 individual datasets from piRBase). Some interface functions have been integrated for the different piRNAs registered in the database with associated functions determined on the basis of the RepeatMasker database, including repetitive elements, different transposable elements (LINEs, SINEs, etc.), rRNAs, tRNAs, and, others (Table 1).
In the following sections, we will discuss the results of the exploratory analysis of this database, emphasizing the expected classical characteristics reported for piRNAs.

Sequence length distribution
The distribution of sequence lengths in this piRNA-IPdb showed that 90.23 % of sequences have lengths between 24 and 31 nt (Fig. 1A). These data are consistent with the sequences typically identified in the literature.

Nucleotide composition
The pre-piRNA trimming process generates piRNAs that preferentially (but not unequivocally) start at 5' with a uridine (1U-bias) [15]. We also checked the complete nucleotide composition of the first 15 nucleotides of all sequences. Results indicate that this 1U-bias is also Fig. 1 Database sequence analysis. A, Sequence length distribution of piRNA-IPdb. B, Nucleotide frequency of first 15 nucleotides from all sequences in the database. C, Frequency in the database of overlapping sequences among 10 nucleotides generated by potential ping-pong amplification cycles present in piRNA-IPdb (Fig. 1B). Specifically, 62 % of the sequences start with this base.

Ping-Pong cycle hallmark
A substantial number of piRNAs in the germline participate in a transposable element-mediated amplification process called a "ping-pong" pathway, initially characterized by the presence of piRNAs with the 1U-10A hallmark sequence [16], although this pattern is not unequivocally a consequence of the 1U-bias [17]. Using the signature.py Python script [14], we measured the quantity of antisense overlapping sequence pairs between database sequences. The results (Fig. 1C) confirmed that our piRNA database is also abundant in sequences showing this characteristic, with a high proportion of sequences overlapping in these 10 nucleotides.

Measuring piRNA expression
We extracted information from the reads in each dataset (excluding ds5 and ds134, which did not contain such data). Due to the large variability in total library size for each dataset, reads were normalized using counts per million. The vast majority of piRNA sequences have a very low number of detected reads, although some groups of sequences displayed sequences with very well-detected expression. To highlight this feature, we labelled as FASTA files the most highly expressed 1 % of sequences with the "hec" label (as "highly expressed centile"). The sequence-length distribution of these sequences is shown in Fig. 2A. Compared to the whole database, hec sequence length distribution showed greater length and higher 1U-bias (Fig. 2B).

Mapping piRNAs
The genome alignment of database piRNAs showed that 91.16 % of the sequences map, with only 1 mismatch allowance, to the mm39 genome database. However, 19.09 % of the total sequences map to multiple positions.
The sequence FASTA file is uploaded together with a BED file containing the genome coordinates of each piRNA with a unique aligned position.

Web interface
In order to make the database accessible, we also created a web page available at (https://ipdb2.shinyapps.io/ ipdb2/). The web page hosts the downloadable database files with a summary of the piRNA immunoprecipitation database.

Utility and discussion
Case study: Identifying miRNAs from the database sequences As an exploratory example of interactive studies based on this database, we have evaluated the potential duality in the generation of miRNAs vs. piRNAs from an RNA precursor sequence with the capacity to generate either type of sncRNA, depending on their specific binding to AGO or PIWI proteins. Recent studies have raised doubts about the presence of other sncRNAs, such miRNAs, tRNAs, rRNAs, as contaminants in piRNA databases [10]. However, it is possible to identify cases in which both types of molecule (piRNA and other sncRNAs) could be bifunctional and even suggest alternative regulatory mechanisms to generate one or the other type. In fact, alternative, non-canonical pathways in the biogenesis of miRNAs have been reported [18][19][20]. Here, we evaluate that possibility for miRNAs/ piRNAs from precursors that might generate such a situation.
Using miRBase as a reference, the sequences of the piRNA-IPdb have been mapped in the search for miR-NAs. A total of 42,704 sequences (0.23 % of the total) potentially considered to be piRNA as well as miRNA have been detected, of which 607 showed identical sequences present in both databases, the rest being partially aligned or having just one nucleotide of difference (one mismatch). Some miRNAs have been aligned with up to 2000 different piRNAs ( Table 2). A function has  been integrated into the website database (https:// ipdb2.shinyapps.io/ipdb2/, by which it is possible to directly recognize which is the coverage of piRNAs from a specific miRNA, simply by clicking on the function miRNA. Figure 3 shows some examples of coverage of piRNA sequences against miRNA sequences. This analysis, performed on the three miRNAs detected as most expressed, showed that the alignments of miRNA regions are coincident with the regions where the highest number of piRNA counts are detected over the pri-miRNA sequences (Table 3). This leads us to hypothesize that, since PIWI appears to be unable to bind to double-stranded RNAs, the generation of piRNAs should be previous to the stem-loop configuration of the pre-miRNAs-that is, a kind of by-product in piRNA biogenesis, where their precursors were in excess or the level of PIWI was limiting. Otherwise, regions corresponding to the loop or single-stranded ends of pri-miRNAs would correspond to higher abundances of piRNAs after the processing of such pri-miRNAs.

Conclusions
In this study we present the creation of the piRNA database "piRNA-IPdb", a database with 18,821,815 sequences that are detected after immunoprecipitation with PIWI proteins, we check that these piRNAs meets the regular characteristics of bona fide piRNAs, and we demonstrate the utility of such database for piRNA analysis at the case study.