Identification of plant lncRNAs with new lncNAT species from strand-specific RiboMinus transcriptome sequencing data
To systematically study the molecular features and evolutionary profiles of lncRNAs in plants, we selected eight representative plant species across the phylogenetic landscape for analysis, including Chlamydomonas reinhardtii (C. reinhardtii), Selaginella moellendorffii (S. moellendorffii), Zea mays (maize), Oryza sativa subsp. Japonica (rice), Arabidopsis lyrata (A. lyrata), Arabidopsis thaliana (A. thaliana), Populus trichocarpa (poplar) and Solanum lycopersicum (tomato) (Fig. 1). These species can represent single-cell alga, ferns, monocotyledon and dicotyledon, spanned an evolutionary history of 116 million years. Whole-genome sequences and annotations of these species were available from public databases (Additional file 1: Table 1).
Different from previous studies, we acquired only RiboMinus and strand-specific RNA-seq (ssRNA-seq) data for multiple plant tissues, either collected from public database or sequenced by ourselves when they are not available (Additional file 2). This critical approach was designed to include lncRNA species either with poly(A) or without poly(A), which can also identify the transcriptional orientation of lncRNAs, and their location on sense- or antisense- strand of coding gene regions. A consensus and efficient lncRNA identification workflow were constructed based on our previous study [25]. To reduce the variation emerging from the tissue difference among species, we selected data of five tissues: root, stem, leaf, flower, and seed/fruit. For lower-level plant species without differentiated tissues, such as C. reinhardtii and S. moellendorffii, we used all available data. Considering the batch effects coming from PCR artifacts, sequence depth and gene expressional abundance, etc. among different data sources, we eliminated PCR artifacts and low-quality sequencing data by pre-processing and mapping high-quality reads to genome in our procedure. We required the mapped data size for each tissue to be above 10X genomic coverage depth in order to retain transcripts with low-level expression (“Additional file 2” listed the data size of each sample for all tissues of the eight species). We normalized the expression of transcripts using FPKM, which reduced or eliminated the discrepancy from sequencing depth and transcript length. At last, we removed those low-expressional transcripts (FPKM 0.5 for single-exon transcripts and 0.1 for multiple-exon transcripts) to resolve the transcripts with low expressional abundance. For species with annotations including lncRNAs, we added them in our result if those lncRNAs satisfied our criterion.
A total of 39,945 lncRNAs were obtained in all the eight species, for which the transcriptional orientation of each lncRNA was also confirmed. LncRNAs identified in this study were categorized into two types: lncRNA in intergenic region (lincRNA: 25,350) and lncRNA antisense to one or more exons of coding genes (lncNAT: 14,595) (Fig. 1, Additional file 1: Table 2, Additional file 3). Both of lincRNAs and lncNATs dispersed widely in all of the plant clades from 777, the least number in C. reinhardtii to 8321, the most in maize. By comparing lncRNAs identified in this study with lncRNAs from the databases of PLncDB, CANTATAdb, GreeNC, and NONCODE, a total of 22,313 new lncRNAs (12,838 lincRNAs and 9475 lncNATs) were found, and their distribution in each species were listed in “Additional file 1: Table 2”.
Previous studies on plant lncRNAs often missed out lncNATs due to lack of orientation in their RNA sequencing data (thus could not made accurate distinction on lncNATs from coding gene transcripts). Here we searched lncNATs using strand specific RNA-seq data, and found that lncNATs dispersed widely in plant (Fig. 1). LncNATs took a large part in plant lncRNAs, and even more than lincRNA in several species, such as A. thaliana and C. reinhardtii (Fig. 1). There are about 20% coding genes with lncNATs transcribing in A. thaliana. Coding genes can be expressed with their counterpart lncNATs simultaneously (Additional file 4: Fig. S1A). Compared with coding genes without lncNATs, their expressional levels were not significantly affected by the presence of lncNATs (Additional file 4: Fig. S1B). The lack of correlation between the expression of coding genes and their paired lncNATs implied the lncNATs may not be directly involved in the regulation of coding genes’ expression.
The consistent molecular features of lncRNAs among divergent plant species
The characteristics of lncRNAs, such as transcript length, AT content, exon number, and so on, were explored and compared to coding genes in individual plant species [11, 26, 27]. However, if these features were consistent in plants were not clear. In addition to the most important differences between lncRNA and mRNA that was coding products, other features remained to be investigated between lncRNA and mRNA in plants.
In current study, we revealed comprehensive characteristics of lncRNAs in multiple species using unified sequencing data and bioinformatic analysis procedure. We found both lncNATs and lincRNAs have more simple structures than mRNAs. LncRNAs were shorter than mRNAs (median length of lncNAT: 964, lincRNA: 817, mRNA: 1422, both p-value < 2.2e-16, t-test) (Fig. 2A), and lncRNAs have fewer exons compared to mRNAs (Mean of lncNAT: 1.4, lincRNA: 1.7, mRNA: 5.2, both p-value < 2.2e-16, t-test) (Fig. 2B). These two features were conserved in plant lncRNAs (Fig. 2A, B). LncRNAs existed in plant genomes mainly as single-exon transcripts. A total of 65% lincRNAs and 82% lncNATs were single-exon, while in mRNAs this number was 22%. This phenomenon was observed in all plant species of this study, and was in consistent with results of other plant species in previous studies [27].
Fewer multiple-exon transcripts meant few requirements for splicing by lncRNA. In a previous study, researchers found lncRNAs were rarely spliced and mainly non-polyadenylated [28]. The splicing efficiency in human and mouse showed that inefficient splicing might be a common feature of lncRNAs across species [29]. The ratio of splicing lncRNAs was significantly lower than those for mRNAs (Wilcoxon test, both p-value = 7.6e-06, Additional file 4: Fig. S2A) in this study. Splicing lincRNAs ranged from 13.7% (poplar) to 48.8% (S. moellendorffii), compared to the ratio of mRNAs from 65.3% (rice) to 92.6% (C. reinhardtii) (Fig. 2C). Inefficient splicing of lncRNAs was not just a feature in animals but is also common across plant species. Next, we wanted to investigate if the splicing sites in lncRNAs were same with mRNAs. Further analysis of the base distribution upstream and downstream of splicing sites showed that sequences of splice sites in lncRNAs were conserved across plant species. They have almost the same sequence context between lncRNAs and mRNAs, which suggested that they probably used the same splicing mechanism (Additional file 4: Fig. S2B).
Both of lincRNAs and lncNATs have conserved nucleotide constitution among multiple species. LncRNAs have lower GC content than coding sequences (CDS) of coding genes, while which was higher than intergenic sequences (Fig. 2D). This feature was also consistent across multiple species in plant. High GC content was usually associated with coding sequences [30]. Though lncNATs and coding genes shared overlapping sequences, the GC content of lncNATs was still lower than CDS in most plant species (Fig. 2D). The bias of nucleotide in lncRNA meant the selective pressure on their sequences. At the same time, mutants accumulated in both lincRNAs and lncNATs were much higher than coding genes (Fig. 2E, Additional file 1: Table 3). Mutants accumulated in coding regions with the possibility to change amino acid and lead to the change of protein products. While the absence of coding products for lncRNAs made them tolerate more mutants. Although lncNATs shared sequences with coding genes, their sequences accumulated more mutants and fewer GC bases, implied they may experience different paths to coding genes in evolution.
New discovery of lncRNA origination by inserting of transposable elements (TEs)
TEs were widely existed in eukaryotic genomes and acted as key factors for gene and genome evolution [31, 32]. Insertion of TEs was considered as one of many ways for origin of lncRNA [15, 33]. To study the roles of TE for lncRNAs, we first established a reference library of TEs for each genome using RepeatMasker [34]. As a result, up to 59.7% lincRNAs can be found contain TEs in their sequences (Fig. 3A). Besides, we found a large proportion of TEs inserted into the genomic loci of lncRNAs and coding genes (lincRNA: 3.9% ~ 59.7%, lncNAT: 2.4% ~ 32%, coding gene: 3.1% ~ 43.9%). But, when we limited the inserting regions to exons of lncRNAs and CDS of coding genes, the ratio of transcripts overlapping with TEs decreased sharply in the latter (1.8% ~ 13.1%), while in lncRNAs the ratio remained (lincRNA: 3.1% ~ 58.2%, lncNAT: 0.5% ~ 31.8%) (Fig. 3A). Significantly, sequences of TEs were remained in lncNATs though they shared sequences with coding genes. Decrease of TEs in protein coding sequences indicated that TEs inserting into CDS were eliminated in evolution since they may change gene products, and resulted in serious consequences. In coding genes, TE insertion was concentrated in regions with limited effects on gene products, such as UTR, intron. On the contrary, lncRNAs were not influenced by insertion of TEs in their exons for lack of coding products, so these TE sequences can be kept and accompanied by lncRNAs in evolution. This suggested that lncRNAs may originate from TE insertion. The close relationship between lncRNA and TEs revealed the key roles of TEs played in the origin of lncRNA.
TEs were classified into two types: Class I TEs or retrotransposons, and Class II TEs or DNA transposons. They are both present in the plant genome. However, we found that Class II TEs favored transcribed regions, including lncNAT and lincRNA, whereas Class I TEs favored regions of non-transcribed sequences (Fig. 3B). The ratio between Classes I and II TEs deviating from that the entire genome indicated that the TEs, Classes I and II, in lncRNAs and coding sequences were under different selection pressure.
TEs play important roles on the origination of lncRNAs. Benefitting from the strand-specific sequencing data, we can confirm the positions of promoters. We found about 38 ± 21% lincRNAs had promoters originating from TEs, which was higher than the percentage of coding genes (29 ± 20%) and lncNATs (27 ± 21%) (Fig. 3C). The ratio of lncRNAs with promoters originating from TEs was positively related to whole genomic TE content in each species (Pearson’s correlation, lincRNA: 0.63, lncNAT: 0.83). This suggested that TEs randomly inserted into genome was one of the major ways to generate lncRNAs.
Further, we found TEs initiated the transcription of lncRNAs by supplying TF binding sequences for lncRNAs (Fig. 3D). Combining with ChIP-seq data from nine transcript factors (TFs) in A. thaliana, A. lyrata, rice and maize, we identified 250 lncRNAs, which contained TF binding sites originating from TEs, and these lncRNAs accounted for 11.7% of lncRNAs with TF biding sites being confirmed (Additional file 1: Table 4, Additional file 5). As an example, we found a lincRNA Osat_00007032, which was upstream of coding gene OS01G0166800 in rice (subsp. Japonica), and expressed in anther, pistil, leaf, root, stem and seed, but especially high in pistil (Additional file 4: Fig. S3). A TE belonging to MERMITEJ subfamily covered the upstream of Osat_00007032, and provided promoter for Osat_00007032. In the promoter, binding sequences for transcript factor MADS29 was confirmed (Fig. 3D). MADS29 belongs to MADS-box transcription factors, which were critical regulators for rice reproductive development [35, 36]. Surprisingly, we cannot detect lncRNAs or TEs in the syntenic region of genome for Indica, which was another subspecies of rice. This indicated that lincRNA Osat_00007032 was specific for Japonica, and was evolved recently by TE insertion. This novel discovery that TE brought promoters containing TF binding sites to the upstream exonic locus, revealed the relationship between TE and lncRNA different from that previously found in tomato lncRNA, for which a TE inserted into the downstream of a promoter [16].
Most plant lincRNAs only conserved within genus indicating rapid turnover in evolution
Sequence conservation provides insight into evolutionary process of functional elements in genome. However, most lncRNAs in plants were found with poor conservation, including the lncRNAs whose functions were well studied [37, 38]. In this study, we attempted to explore the evolution of lncRNAs in plants by analyzing the conservation of lncRNAs with two methods. First, we evaluated the sequence conservation using PhastCons score [39]. PhastCons score was calculated based on the result of seven-way whole genomic alignment for A. thaliana, A. lyrata, soybean, rice, poplar, tomato and S. moellendorffii. We compared the sequence conservation of lincRNAs, lncNATs, CDS of coding genes, 5’UTR and 3’UTR of coding genes, and used sequences from intergenic regions as control. As expected, CDS exhibited the highest sequence conservation, while the intergenic sequence was the lowest (Fig. 4A). Considering that lncNATs and coding genes shared some common sequences, they showed similar sequence conservation (Fig. 4A, Additional file 4: Fig. S4A). The conservation of lincRNAs, however, was lower than 5’UTR, 3’UTR and introns of coding genes (Fig. 4A). The proportion of plant lincRNAs with PhastCons scores higher than 0.6 was only 0.2%, and this number in coding genes was 14.4%. About 71% plant lincRNAs scored 0, indicating very low conservation (Additional file 4: Fig. S4B). In placental mammals, old (minimum age 90 Myr) and young lncRNAs (minimum age 25 Myr) were defined basing on the phylogenetic distribution of species, and they found the conservation of old lncRNAs were close to CDS, while young lncRNAs were even lower than intergenic regions [20]. In this study, we found plant lincRNAs showed similar conservation to intergenic sequences, and which suggested that most plant lincRNAs were young and with short evolutionary age (Fig. 4A, Additional file 4: Fig. S4A).
Further, we evaluated the conservation of plant lincRNAs by identifying the homologous lncRNAs across the eight plant species. We mapped lincRNAs in one species to genomes of all other species. LncNATs were excluded since the existence of overlapped regions shared with coding genes. While strong conversation of coding genes existed between evolutionarily remote species in the plant kingdom, the conservation of lincRNAs retained only in the evolutionarily close groups (Fig. 4B, C). Previous studies of lncRNAs showed that the lncRNA homology mainly existed within a family in both plants and animals [20, 21]. However, we found conserved lincRNAs were rare in plant family, as only 5% lincRNAs from rice were homologous with those in maize (Fig. 4C). In genus such as Arabidopsis, half of lincRNAs were found having homologs in other species within the same genus (Fig. 4C). Hence, we found that most of the homologous lincRNAs were genus- or species-specific in plants, suggesting that lincRNAs have a rapid turnover in evolution.
IPS was found to be the only plant lncRNA group with conserved sequence motifs
Some lincRNAs were found to be functional via short-conserved patches, though the whole sequences were divergent [40, 41]. To know if conserved patches existed in plant lncRNAs, we developed a method using sliding windows to calculate the mean PhastCons score for each patch. We found most lincRNAs have no apparently conserved patches in the eight plant species (Fig. 4D). We then looked into the plant lncRNAs in lncrnadb [42], and found IPS was the only lncRNA with conserved patches.
IPS lncRNA family participated in phosphorus (Pi) homeostasis in plants, including AtIPS1 and AtIPS2/At4 in Arabidopsis, OsIPS1 and OsIPS2 in rice, and TPSI1 in tomato [43,44,45]. They were found to be accumulated under Pi starved condition. IPS contained conserved 24-nt nucleotides motif in several species [43, 45]. The motif contained two highly conserved regions which were separated by three nucleotides, and played roles as the target mimicry for miR399 [46]. Strikingly, our study found all species contained this motif except algae (Fig. 4E). The absence of IPS in algae may be explained by their lack of vascular tissues, whereas IPS was expressed in vascular tissues when lacking of Pi. IPS first appeared in S. moellendorffii, a representative of primitive vascular plant (Fig. 4E, F). But only one of the two IPS motifs was found in S. moellendorffii, whereas two motifs were often found in high plants (Fig. 4E, F). Different from most lncRNAs with short evolutionary histories, IPS underwent a long evolutionary history with conserved function in Pi homeostasis, suggesting plant lncRNAs may play important roles in the adaptation of terrestrial life during migration from aquatic to terrestrial.
Specific expression of lncRNAs revealing sophisticated transcription regulation
Previously several lncRNA studies showed that most lncRNAs were expressed at low levels, and often expressed in specific tissues or conditions in plants [11, 26]. In this study, we attempted to profile the expression of lincRNAs and lncNATs in more species. Both lincRNA and lncNAT showed lower expression levels and higher tissue specificity compared to coding genes (Fig. 5A, B). LincRNA and lncNAT showed similar expressional levels (average: 35.2 vs 22.9), while the expressional level of mRNA was significantly higher (average: 124.8, Kolmogorov-Smirnov test, both p-value < 2.2e-16) (Fig. 5A). Similar trends were observed in all plant species (Additional file 4: Fig. S5A). Low expressional level was a common feature for plant lncRNAs.
Tissue specificity was evaluated using Jensen-Shannon (JS) score in all of the plant species, which returned a score between 0 and 1 [47]. JS score of 1 represented the transcript expressed uniquely in one tissue, while JS score of 0 meant the transcript expressed in all tissues. Both lincRNA and lncNAT exhibited significantly higher JS score, i.e., higher tissue specificity, than mRNA (Kolmogorov-Smirnov test, both p-value < 2.2e-16) (Fig. 5B, Additional file 4: Fig. S5B).
The expressional patterns of lncRNAs evolved rapidly in plant, which formed sharp contrast to the coding genes. We defined tissue-specific (TS) transcripts as transcript with JS score > = 0.9. We found TS lncRNAs and TS mRNAs were distributed to tissues differently in the same species (Fig. 5C). Also, among the different plants, TS lncRNAs of tomato and maize were more likely to express in leaf and root, while TS lncRNAs were found mostly in the pistil of rice, and the seed of A. lyrata.
Previous studies in rice and maize showed that lncRNAs were more specially expressed in reproductive tissues [11, 26]. By comparing the expression of all transcripts, we found mRNAs were more evenly expressed in all tissues in A. thaliana, while lncRNAs were significantly different among tissues (Fig. 5D). In A. thaliana, lncNATs were preferentially expressed in embryo, while lincRNAs were in both endosperm and embryo. LncRNAs had a narrow expression profile compared to coding genes.
LncRNAs showed similar TF binding rate compared to that of coding genes, though they had low expressional levels and unstable expression patterns. We used ChIP-seq data of nine transcript factors (TFs) from A. thaliana, A. lyrata, rice and maize to predict the binding sites in promoter (Additional file 5). The frequency of lncRNA promoters binding with TFs was comparable to coding genes (Fig. 5E). TFs with low binding rate in coding genes also had low binding efficiency in lncRNA, e.g., LFYGR, NLP7, P1, a-myc, HASF1b.
Plant lncRNAs forming co-expression network with coding genes
Expression is often closely related to function, especially highly and specifically expressed lncRNAs [48, 49]. We selected lncRNAs that were highly expressed in one or two tissues in A. thaliana to construct WGCNA network. A total of 178 lincRNAs, 555 lncNATs and 20,729 coding genes were selected to construct the network. In this network, lncRNAs and coding genes were clustered into 24 modules according to their expressional characteristics (Fig. 6A). In these modules, a total of 689 lncRNAs (155 lincRNAs and 534 lncNATs) were found to be co-expressed with coding genes. Especially, lncNATs in the network were not directly related to their antisense coding genes, which referred they may not regulate the transcription of antisense genes. GO enrichment analysis was performed for the coding genes in each module. Functions of the coding genes were summarized as follows: I) Basic metabolism: kinase activity, lipid metabolic process, and hydrolase activity; II) Response to stress and response to abiotic stimulus; III) Reproduction (Fig. 6C). LncRNAs involved in the network may play roles in these functions.
The correlation between modules and samples showed that the expressional patterns of lncRNAs were highly correlated to specific tissues (Fig. 6A). In our result, lncRNAs of A. thaliana were highly and specially expressed in embryo and endosperm (Fig. 5D). In WGCNA network, the embryo was strongly correlated with module 01, and the endosperm was strongly correlated to module 22 (Fig. 6A, B). In module 01, we found a myriad of coding genes associating with embryo development. Functions of these genes can be categorized as: I) Involving in embryo development directly; II) Overexpression, mutation or knocking down caused embryo death; III) Enrichment of protein products during embryonic development. Co-expressional relationship was found between these coding genes and lncRNA, and lncRNAs in this network may play roles in embryo development. In module 22, two radial networks were constituted of four lncNATs and seven lincRNAs with coding genes (Fig. 6B). In these coding genes, we found four genes encoding transcript factors played roles in the endosperm development. As an example, gene AT1G02580 encoded transcript factor MEDEA/MEA, a polycomb group gene that was imprinted in the endosperm [50]. All of the 11 lncRNAs in the network were found co-expressed with it, which indicating that these lncRNAs may play roles in endosperm imprint. Coincidently, researchers found some lncRNAs expressing in castor bean seeds were involved in genomic imprint, and several of them comprised the imprinted cluster with imprinted coding genes [27].