Genome-wide identification and comprehensive analysis of microRNAs and phased small interfering RNAs in watermelon

Background MicroRNAs (miRNAs) are a class of endogenous small non-coding RNAs involved in the post-transcriptional gene regulation and play a critical role in plant growth, development and stress responses. Watermelon (Citrullus lanatus L.) is one of the important agricultural crops worldwide. However, the watermelon miRNAs and phasiRNAs and their functions are not well explored. Results Here we carried out computational and experimental analysis of miRNAs and phased small interfering RNAs (phasiRNAs) in watermelon by analyzing 14 small RNA profiles from roots, leaves, androecium, petals, and fruits, and one published small RNA profile of mixed tissues. To identify the targets of miRNAs and phasiRNAs, we generated a degradome profile for watermelon leaf which is analyzed using the SeqTar algorithm. We identified 97 conserved pre-miRNAs, of which 58 have not been reported previously and 348 conserved mature miRNAs without precursors. We also found 9 novel pre-miRNAs encoding 18 mature miRNAs. One hundred and one 21 nucleotide (nt) PHAS loci, and two hundred and forty one 24 nt PHAS loci were also identified. We identified 127 conserved targets of the conserved miRNAs and TAS3-derived tasiRNAs by analyzing a degradome profile of watermelon leaf. Conclusions The presented results provide a comprehensive view of small regulatory RNAs and their targets in watermelon. Electronic supplementary material The online version of this article (10.1186/s12864-018-4457-8) contains supplementary material, which is available to authorized users.

Watermelon is one of the 20 most important agricultural crops worldwide [34]. Although watermelon is economically important, only a few studies [34][35][36][37][38] have paid attention on the miRNAs in watermelon, and there are still no studies about phasiRNAs in watermelon.
To systematically investigate the miRNAs and phasiR-NAs in watermelon, we sequenced 14 small RNA profiles from 5 tissues, i.e., roots, leaves, androecium, petals, and fruits. Comprehensive computational analyses of these small RNA profiles lead to identification of 97 pre-miRNAs of which 58 have not been reported, 101 PHAS loci encoding 21 nt phasiRNAs (including a new TAS3 locus), and 241 PHAS loci encoding 24 nt phasiRNAs. A degradome profile of watermelon leaf was generated for identifying targets of miRNAs and phasiRNAs. Totally, 127 conserved targets for conserved miRNAs and TAS3derived tasiRNAs were identified in the analysis. These results significantly enhanced our understanding of small RNA guided gene regulations in watermelon.

Materials and small RNA sequencing profiles
The watermelon plants were grown in the greenhouse of Faculty of Life Science and Technology, Kunming University of Science and Technology (24°51 0 N, 102°5 2 2 E, altitude 1835 m), Yunnan, China. During the experiments, the daily average temperature was 25°C, the daily maximum difference in temperature was 10°C and humidity was 60-80%. Fourteen samples from five tissues, i.e., root, leaf, petal, androecium and pre-mature fruit, of watermelon plants without treatments were collected and frozen using liquid nitrogen immediately (Additional file 1: Supplementary Table S1). The 14 samples were stored at -80°C until RNAs were extracted. Total RNAs were extracted from samples using the Trizol reagent according to the manufacturer's protocol. The integrities of the RNAs were checked using an ultraviolet spectrophotometer (Hoefer, MA, USA), based on the ratio of the optical density at 260 nm to that at 280 nm (OD260/280) and were also assessed by electrophoresis in a denaturing formaldehyde agarose gel, based on visual comparison of the 18S and 28S ribosomal RNAs. The RNA samples with OD260/280 between 1.8 and 2.0 were checked for the total quantities. Only samples with at least 20 μg were chosen for preparation of sRNA sequencing libraries. 20 μg total RNAs dissolved in 35 μl were delivered to the sequencing facility. The small RNAs of the samples were isolated from total RNAs and were sequenced using Illumina HiSeq 2000 sequencer. The obtained small RNA profiles had been deposited to NCBI under GEO the accession number, GSE102030. The qualities of the obtained sRNA profiles were evaluated with the FASTQC program (https:// www.bioinformatics.babraham.ac.uk/projects/fastqc/) (in Additional file 2: Figure S1).

Identification of conserved and novel miRNAs
The computational pipelines for analyzing the obtained small RNA profiles were shown in supplementary materials (Additional file 2: Figure S2).
Our computational methods for analyzing the small RNA libraries was reported previously [39]. In brief, small RNAs were extracted after removing raw reads with more than 5 low scored nucleotides (< 25) and 3' adaptor sequences of the remaining reads (see Additional file 2: Figure S2d). Unique small RNAs were obtained after eliminating redundant sequences. The unique sRNAs were aligned to genome and mRNAs of watermelon to calculate the numbers of reads and unique sequences mapped to the genome and mRNAs. Then unique sequences mapped to known non-coding RNAs (rRNAs, tRNAs, snRNAs, snoRNAs) and repeats were removed from unique RNAs by aligning to databases Rfam (r11) [40], NONCODE (v3.0) [41], Silva [42], The TIGR Plant Repeat Databases [43] and Repbase (r20) [44] using SOAP2 [45]. Remaining reads were mapped to the miRBase database (r21) [46] to calculate the frequencies of conserved miRNAs. The numbers of reads that were aligned to different types of molecules were counted and summarized (see Additional file 1: Table S2).
After this step, the remaining unique reads were aligned to genome to predict novel miRNAs. The surrounding sequences (down-and up-stream) were extended by 300 bp which were used to predict fold-back structure using RNAfold [47]. The structures with at least 18 paired nucleotides and a folding energy of smaller than -40 Kcal/mol were kept as putative precursors for further analysis. The small RNA reads were mapped to the obtained putative precursors. Finally, novel miRNA identification and annotation were strictly based on appearance of miRNA* and predictable fold back structures for the miRNA precursor sequences, as suggested by Meyers et al. [48].
As described previously [39], mature miRNA sequences from all plant species were downloaded from the miR-Base (v21) [46] from which unique miRNA sequences were obtained (see Additional file 2: Figure S2a). Next, these unique miRNA sequences were used as queries against the watermelon genome using BLASTN to predict known miRNA homologs in watermelon that were not represented in small RNA libraries. Hits with no more than two mismatches were identified and the flanking regions (80nt, 130nt, 180nt down stream and upstream) to the mapped mature miRNAs were isolated and used to predict fold-back structures using the RNAfold. The predicted fold-back structures were examined for the presence of miRNA on the same arm of the hairpin as the known family members from other plants. These precursor sequences were further evaluated by MIRcheck and selected candidates that have ≤ 5 mismatches, ≤ 2 bulged or asymmetrically unpaired nucleotides, and ≤ 3 continuous mismatches within the mature miRNA.
We compared the identified conserved miRNAs with those reported previously [34][35][36][37][38]. If the sequences of a mature miRNA was the same as reported earlier, the conserved miRNA was named the same as reported previously. The remaining conserved miRNAs were named by using upper case MIR followed by the family name, and alphabetical letters in lower case if these have not been reported earlier.

Analysis of miRNA expression patterns in different tissues
The raw frequencies of mature miRNAs for samples of different tissues were compared with edgeR [49]. miRNAs with multiple-test corrected P-values of smaller than 0.05 were regarded as significantly deregulated miRNAs in the different tissues. miRNA with abundances of at least 5 RPTM (Reads Per Ten Million sequencing tags) in at least two of the 14 samples and standard deviation of at least 1 were selected to perform hierarchical clustering. The normalized RPTM values plus one were log scaled to calculate correlation coefficient (CC) values between samples. Then the CC values were applied to the pheatmap function in the pheatmap library in R to perform hierarchical clustering. miRNAs with average abundances of at least 5 RPTM in the 14 samples were used to perform Principle Component Analysis (PCA). Log-scaled normalized RPTM values plus one were applied to the prcomp function in the psych library in R to perform PCA.

Identification of TAS3 loci in watermelon
As done previously [50], TAS3 derived tasiRNAs, siRNAs that target the ARF family genes, in Arabidopsis and rice were collected. Then, these tasiRNAs were aligned to the watermelon genome by allowing at most 2 mismatches. The neighboring sequences of the matched loci were cut out to include 250 nt from both 5' and 3' sides. Then we examined the typical miR390 complementary sites around the conserved tasiRNAs by predicting the miR390 sites on the cut-out sequences.

Identification of PHAS loci in watermelon
To predict PHAS loci and phasiRNAs, we used one more small RNA profile of mixed watermelon tissues in our previous work [35] in addition to 14 small RNA profiles generated in the current study. This sRNA profile was downloaded from the from the NCBI GEO database with the accession number GSM936361. This profile and the combined file of the 14 sRNA profiles generated in this study were used to predict PHAS loci and phasiRNAs, respectively.
The PHAS loci and phasiRNA were predicted as described previously [39,51]. The unique sequences in the small RNA libraries were mapped to Repbase (r20) [44] and The TIGR Plant Repeat Databases [43] to remove sRNAs mapped to repeats, then the remaining sRNAs were mapped to the genome of watermelon using SOAP2 [45] (see Additional file 2: Figure S2b). A self-written program was used to scan the genome and cDNA sequences using a window of 210 nt or 240 nt (ten 21 nt or 24 nt) respectively. A two-nucleotide positive offset was used to calculate the positions of siRNAs on the anti-sense strand because the existence of two-nucleotide over-hang at the 3'-end of siRNA duplex [17,24,26,27]. Then a P-value was calculated for each of the windows using a modified version of methods in [26], where n was the number of unique 21 nt (or 24 nt) sRNAs mapped within a window, k was the number of phased unique 21 nt (or 24 nt) sRNAs within the window, and m was the number of phases. In this study, m was set to 10. And a phase score was calculated for each position of the genome and cDNA sequences using the method in [52]. For a window started at a position with more than three phased unique sRNAs, i.e., when k ≥ 3, where P i was the number of phased reads at the ith phase from the position, U i was the number of non-phased reads at the ith phase from the position, and m was the number of phases in the window, and k was the number of unique phased siRNAs in the window. m was 10 in this study.
The window with a P-value less than 0.05 was extended 100 bp at both 5'-and 3'-ends, then the overlapped windows were merged. The P-values of the merged windows were used to calculate the false positive rates using the method in [53]. The merged windows with a maximal phase scores of larger than pre-determined threshold and multiple test corrected P-values of smaller than 0.05 were reported as PHAS loci. The predicted PHAS loci were named with its chromosome and a unique serial number for each chromosome. "P21" and "P24" were added at the beginning of the predicted PHAS loci encoding 21 and 24 nt phasiRNAs. The neighboring PHAS loci were predicted as PHAS clusters if the distances between individual PHAS loci were smaller than 2000 base pairs. The phased siRNAs of the predicted PHAS loci were reported as phasiRNAs. The phasiRNAs of a PHAS loci were named by adding siR and a serial number to the name of the PHAS loci.

Degradome sequencing
A leaf sample of a watermelon plant grown in greenhouse at Kunming University of Science and Technology, Kunming, Yunnan, China, was frozen in liquid nitrogen immediately after harvesting. The total RNA from the leaf of a watermelon plant was extracted using the Trizol reagent according to the manufacturer's protocol. The integrity of the RNA was checked with an ultraviolet spectrophotometry and 2100 BioAnalyzer. The degradome of polyadenylated transcripts was sequenced using Illumina HiSeq 4,000 sequencer. The degradome sequencing profile had been deposited into the NCBI GEO database under the accession number, GSE102030.
The quality of the obtained degradome profile was evaluated with the FASTQC program. The obtained degradome profile was processed with a self-developed pipeline (see Additional file 2: Figure S2c). Firstly, the raw degradome sequencing profile was filtered to remove low quality reads that have low scored nucleotides (< 20) in the first 20 nucleotides. Next, the first 20 nucleotides in remaining reads were used in later analysis. The unique sequences were obtained and the frequencies of the unique sequences were calculated. Then, similar to sRNA profiles, unique sequences were aligned to genome and mRNAs of watermelon, Rfam (r11) [40], NONCODE (v3.0) [41], Silva [42], The TIGR Plant Repeat Databases [43] and Repbase (r20) [44] using SOAP2 [45] to calculate the numbers of reads and unique sequences mapped to different categories of molecules as well as repeats. The SOAP2 alignment result of mapping degradome reads to watermelon mRNAs was used to predict miRNA and siRNA targets, as detailed below.

Identification of miRNA/siRNA targets in watermelon
The targets of miRNAs and tasiRNAs were predicted with the SeqTar algorithm [54]. For conserved miRNAs and tasiRNAs, the targets that had less than or equal to four mismatches were used for further analysis. For novel miR-NAs, only targets with at least 1 valid read and less than 4 mismatches were used.
The targets of phasiRNAs were also predicted with the SeqTar algorithm. Only targets that have no mismatches or at least one valid degradome read and less than or equal to 3 mismatches were used for further analysis.

Identifying miRNA complementary sites on PHAS loci
The SeqTar algorithm was used to predict miRNA complementary sites on the original transcripts of PHAS loci. For conserved miRNAs, the targets that have at least one valid read, i.e., read started at the 9th to 11th positions of a miRNA binding site (as defined in [54]), or targets that have less than 4 mismatches were used for further analysis.

High-throughput sequencing profiles of watermelon Small RNAs
We generated 14 small RNA libraries from the pooled RNA isolated from five different tissues (root, leaf, petal, androecium and premature fruit). Plants were grown under greenhouse conditions without special treatments. The small RNAs of these 14 samples were sequenced using Illumina HiSeq 2000 sequencer. Initially the 3' adapter from the raw reads was trimmed and small RNAs ranging between 18-30 nucleotides were extracted. Upon sequencing of these 14 small RNA libraries, we obtained approximately 20 million reads for each of the libraries. Totally, we obtained 280,698,505 total reads represented by 60,516,813 unique small RNAs (Additional file 1: Table S1). After examining the scores per nucleotides with FASTQC, the qualities of the obtained sRNA sequencing profiles were generally good with scores above 30 for the first 25 nt (see Additional file 2: Figure S1), which include most small RNAs with 21 to 24 nt. The small RNA sequences were aligned to mRNAs of watermelon, precursors of miRNAs in the miRBase (v21), other non-coding RNAs besides miRNAs, repeat elements and genome of watermelon with SOAP2 (see details in Figure  S2 and in Materials and methods). Most reads could be aligned to the genome of watermelon (see Additional file 1: Table S2), suggesting high qualities of the obtained sRNA profiles. The total reads of 21 nt and 24 nt are over-represented while the unique reads of 23 nt and 24 nt are over-represented in all four libraries (Additional file 2: Figure S3). The 21 nt reads and unique sequences represents the only peak for those sequences mapped to the miRNA precursors (Additional file 2: Figure S3), consistent with the lengths of mature plant miRNAs.

Identification of conserved miRNAs and their expression patterns in different tissues
By aligning mature miRNAs from other species to watermelon genome (see details in Figure S2 and in Materials and methods), we identified 97 pre-miRNAs of which 58 have not been reported previously ( Fig. 1a and Additional file 1: Table S3). We also found 348 mature miRNAs without precursor by aligning the sRNA profiles to the mature miRNAs in the miRBase (v21) (Additional file 1: Table S3). When compared to two model species, Arabidopsis thaliana and rice, we found 80 members that belong to 20 miRNA families that are highly conserved (Additional file 1: Table S4).
Because we have sequenced small RNAs from five different tissues of watermelon, it is feasible to assess the differences in miRNA expression patterns in root, leaf, petal, androecium and premature fruit. We performed Principle Component Analysis and Hierarchical Clustering based on the normalized frequencies of mature miRNAs (Additional file 1: Table S5). The samples from the same tissues were clustered together and samples from different tissues were clearly differentiated (Fig. 1b). We also found the strong correlation among the samples of the same tissues, meanwhile samples from different tissues have much lower correlation coefficient values (Fig. 1c). These results suggest that miRNAs in different tissues have important functions and perhaps involve in different physiological processes in different tissues.
We also compared the abundances of conserved miR-NAs for samples from root, leaf, petal, and pre-mature fruit with edgeR [49] to identify deregulated miRNAs in different tissues. This analysis has identified 87 to 153 deregulated miRNAs between the compared tissues (Additional file 1: Table S6). For examples, MIR156i_5p has higher expression levels in leaves than in the roots, petals, and fruits; MIR319g has higher expression level in petals than in leaves; MIR167f-5p has higher expression level in petals than in roots (Additional file 1: Table S7).
We also identified 9 putative novel miRNAs in watermelon (Additional file 1: Table S8). Although miRNA and miRNA* were identified for these miRNAs, their expression levels are generally very low when being compared to conserved miRNAs, suggesting that these novel miRNAs may be non-functional or not functionally important.

Identifying miRNA targets in watermelon
To identify the miRNA targets, we generated a degradome profile with over 23 million reads, of which most could be aligned to watermelon genome (Additional file 1: Table S9). The nucleotide scores of the obtained degradome were of high quality since most nucleotides had scores above 35 (Additional file 2: Figure S4). We identified targets of conserved miRNA by analyzing the Fig. 1 The conserved miRNAs in watermelon and their expression patterns in different tissues. a The number of pre-miRNAs identified in this study and that reported in literature. b The PCA analysis of miRNA expression profiles in different tissues. c The hierarchical clustering of miRNA expession profiles in different tissues obtained degradome profile of watermelon leaf using the SeqTar algorithm [54]. This analysis revealed 127 conserved targets for conserved miRNA families and TAS3 generated tasiARFs (Additional file 2: Tables S10 and S11).
Some of the conserved miRNA targets are shown in Fig. 2. The miR160 family targets 5 ARF genes (in Additional file 1: Table S10), of which two are shown in Fig. 2a and 2b. The miR166 family targets five HD-Zip transcription factors (in Additional file 1: Table S10), of  Fig. 2c and 2d. The miR171 family targets eight SCL transcription factors (in Additional file 1: Table S10), of which two are shown in Fig. 2e and 2f. The miR393 family targets 8 F-box genes (in Additional file 1: Table S10), of which two are shown in Fig. 2g and 2h.
The miR396 family targets 8 GRF transcriptional factors (in Additional file 1: Table S10), of which one is shown in Fig. 2i. Several other conserved miRNAs also cleaved their targets with significant number of valid degradome reads (see Additional file 2: Figure S5).  In additional to those 127 conserved targets (in Additional file 1: Table S10), conserved miRNA may regulate non-conserved targets as well (Additional file 1: Table  S12). Only 48 putative targets were identified for novel miRNAs and most of these targets have no or very limited valid reads at the identified miRNA complementary sites (Additional file 1: Table S13), again suggesting these novel miRNAs might be non-functional.

Identification of TAS3 loci in watermelon
We previously found one TAS3 in watermelon by analyzing ESTs (Expressed Sequencing Tags) of watermelon [35]. In this study, the previously identified TAS3 loci was found on chromosome 6 of watermelon, named as TAS3a (Fig. 3a). We found an additional TAS3 locus in this study, named as TAS3b, on chromosome 2 of watermelon in this study (Fig. 3b). Both of these two loci have typical miR390 sites beside the tasiRNA generation region (Fig. 3b to c). It is interesting that both the 5' and the 3' miR390 site on TAS3b could induce cleavage of TAS3b transcripts (Fig. 3d), and the 5' miR390 site is the major site that induces more cleavages on TAS3b transcripts. The 10 to 11 positions of 5' miR390 sites are complementary on TAS3b, but not matched on TAS3a (Fig. 3b), which might be the reason that the 5' miR390 site on TAS3b is cleavable, but non-cleavable on TAS3a. TAS3b shows a clear phased siRNA generation patterns in one of the sRNA profiles used (Fig. 3e). TAS3a and TAS3b encode two and one tasiRNAs, respectively ( Fig. 3a and f). These three tasiRNAs target 4 ARF genes (Fig. 3g). Two ARF genes were cleaved by these TAS3-derived tasiRNAs at two different sites as revealed by the degradome analysis ( Fig. 3h  and i).
Typically, the 3' miR390 is the only or major site that could induce cleavage on TAS3 transcripts [20]. But Xia et al. [55] noticed that many spruce TAS3 genes include a cleavable 5' miR390 site. Thus, our results indicate that TAS3 loci of other species may also have cleavable 5' miR390 sites in addition to spruce.

PHAS loci in watermelon
The combined sRNA profiles of 14 samples in this study and one additional sRNA profile downloaded from NCBI GEO database were aligned to watermelon genome using SOAP2 [45], respectively. Then, we used a self-written program to predict PHAS loci and phasiRNAs from the alignment results of SOAP2 based on methods described earlier (see Materials and methods for details). We predicted one hundred and one 21nt PHAS loci by using a phase score threshold of 5 and a multiple-test corrected P < 0.05 (Additional file 1: Table S14). These PHAS loci produced at least four unique twenty-one nt phasiR-NAs that were sequenced in the sRNA libraries used in study. We aligned the obtained PHAS sequences to NCBI Nucleotide Collection (nr/nt) database, the coding gene of watermelon and the TIGR Plant Repeat database to obtain putative annotation of the predicted PHAS loci (details are given in Additional file 1: Table S14). These PHAS loci were mainly originated from protein coding genes (Fig. 4a). The second largest category is repeats (as shown in Fig. 4a). One of the predicted 21 nt PHAS loci is a TAS3 locus.
Two hundred and forty one 24 nt PHAS were predicted by using a phase score threshold of 5 and a multiple-test corrected P < 0.05 (Additional file 1: Table S15). The 24 nt PHAS loci are different from the 21 nt PHAS loci in watermelon, with the largest category belonging to unknown category (Fig. 4b). The second largest category of 24 nt PHAS loci is protein coding genes (Fig. 4b).
In several previous studies, we found the some PHAS loci in other species may generate both 21 and 24 nt phasiRNAs [39,56]. In this study, we found that three PHAS loci may generate both 21 and 24 nt phasiRNAs (Fig. 6). Only the first locus shown in Fig. 6a was annotated as an unknown gene, the other two loci in Figs. 6b and c were intergenic or un-annotated genes.

Putative phasiRNAs targets in watermelon
The 101 and 241 PHAS loci encoding 851 and 1227 twenty one and twenty four nt phasiRNAs, respectively (in Additional file 1: Table S16 and S17, respectively).
We found that in addition to some cis-targets, where phasiRNAs target their own transcripts, some phasiRNAs may target other genes in trans (see Additional file 1: Table S18 and S19 for 21 and 24 nt phasiRNAs, respectively). For examples (see Additional file 2: Figure S6), P21_Chr10_2_siR6 targets Cla021422, a BHLH transcription factor; P21_Chr1_12_siR7 target Cla005420, a photosystem II polypeptide; P21_Chr6_4_siR13 targets Cla006161, a Hydroxycinnamoyl transferase and P21_Chr5_3_siR3 targets Cla018830, a NADPHcytochrome P450 reductase. These phasiRNAs induce significant cleavages on their corresponding target transcripts, suggesting that they might have trans-targets.

Conclusion
Watermelon is an important plants for economical and agricultural reason. In depth miRNA and PhasiRNA studies have been lacking for this important species. We systematically sequenced 14 small RNA profiles for five different tissues of watermelon, and identified 97 pre-miRNAs, 101 PHAS loci encoding 21 nt phasiRNAs, and 241 PHAS loci encoding 24 nt phasiRNAs. The expression patterns of miRNAs vary in different tissues and show to be very similar for samples of the same tissues. Samples from the same tissues could be corrected clustered based on the expression profiles of miRNAs, suggesting the important functions of miRNAs in different tissues. We also identified one additional TAS3 locus. One hundred and twenty seven targets of conserved miRNAs and TAS3 derived tasiRNAs were identified by analyzing a degradome profile. These results significantly improved our knowledge of sRNA guided gene regulations in watermelon.