Maize transposable elements contribute to long non-coding RNAs that are regulatory hubs for abiotic stress response

Background Several studies have mined short-read RNA sequencing datasets to identify long non-coding RNAs (lncRNAs), and others have focused on the function of individual lncRNAs in abiotic stress response. However, our understanding of the complement, function and origin of lncRNAs – and especially transposon derived lncRNAs (TE-lncRNAs) - in response to abiotic stress is still in its infancy. Results We utilized a dataset of 127 RNA sequencing samples that included total RNA datasets and PacBio fl-cDNA data to discover lncRNAs in maize. Overall, we identified 23,309 candidate lncRNAs from polyA+ and total RNA samples, with a strong discovery bias within total RNA. The majority (65%) of the 23,309 lncRNAs had sequence similarity to transposable elements (TEs). Most had similarity to long-terminal-repeat retrotransposons from the Copia and Gypsy superfamilies, reflecting a high proportion of these elements in the genome. However, DNA transposons were enriched for lncRNAs relative to their genomic representation by ~ 2-fold. By assessing the fraction of lncRNAs that respond to abiotic stresses like heat, cold, salt and drought, we identified 1077 differentially expressed lncRNA transcripts, including 509 TE-lncRNAs. In general, the expression of these lncRNAs was significantly correlated with their nearest gene. By inferring co-expression networks across our large dataset, we found that 39 lncRNAs are as major hubs in co-expression networks that respond to abiotic stress, and 18 appear to be derived from TEs. Conclusions Our results show that lncRNAs are enriched in total RNA samples, that most (65%) are derived from TEs, that at least 1077 are differentially expressed during abiotic stress, and that 39 are hubs in co-expression networks, including a small number that are evolutionary conserved. These results suggest that lncRNAs, including TE-lncRNAs, may play key regulatory roles in moderating abiotic responses.


Background
The functional component of any genome extends beyond its protein coding sequences. Much of the additional function is encoded by RNAs, which vary in size from small RNAs (sRNAs) of< 25 nucleotides (nt) in length, to tRNAs of 70 to~90 nt in length, to an even larger class of long non-coding RNAs (lncRNAs). lncRNAs are typically defined as being longer than 200 nt and containing no more than one short (< 100 amino acids) open reading frame [1].
lncRNAs represent a stunning proportion of transcriptional products. In mice, for example, an early study cataloged~34,000 lncRNAs, representing one-third of all polyadenylated cDNAs [2]. More recent work has annotated~14,000 lncRNAs in humans [3]. Work in plants has lagged somewhat behind, but plant lncRNAs have been identified based on various kinds of high throughput expression data. For example, microarrays have been used to detect 6480 lncRNAs from Arabidopsis thaliana [4]; single-stranded RNA sequence data have led to the identification of 2224 lncRNA transcripts in rice (Oryza sativa) [5]; and total RNAseq data have been employed to detect 7245 lncRNAs in maize (Zea mays ssp. mays) [6].
At least three general properties of lncRNAs have become apparent from studies of both plants and animals. The first is that many lncRNAs are polyadenylated and capped, suggesting that they are transcribed and processed similarly to mRNAs [7]. However, lncRNAs can also be non-polyadenylated, and hence robust lncRNA discovery requires consideration of both polyadenylated and non-polyadenylated RNA samples. The second is that lncRNAs tend to be expressed at lower levels than coding genes, but with precise spatio-temporal patterns [3,[7][8][9][10][11][12][13]. A third general property is that some lncRNAs overlap with coding regions and sometimes contain parts of an exon; however, most originate from intergenic spaces (and these are sometimes called long intergenic RNAs or lincRNAs). Consistent with their origin from intergenic spaces, a large proportion of lncRNAs are either derived from transposable elements (TEs) or contain remnants of TEs. For example, Kapusta et al. [7] determined that 75% of human lncRNAs contained regions that appear to be derived from TEs.
Just as the origin and structures of lncRNAs are diverse, they play similarly varied functional roles. One major role is to act as templates for sRNA production, which in turn often contribute toward the epigenetic silencing of TEs [14,15]. Some lncRNAs perform other key functions, especially regulatory roles in cellular and developmental processes [3,16]. In plants, for example, lncRNAs have been shown to affect functions as diverse as phosphate signaling [17], flowering time [18], and susceptibility to pathogens [19]. Consistent with the hypothesis that lncRNAs play important regulatory roles, some lncRNAs are conserved among species and appear to be under purifying selection [3,20,21].
A growing body of evidence also points to a potential role for plant lncRNAs in responses to abiotic and biotic stresses. A few studies have identified Arabidopsis lncRNAs that respond to salt, drought, heat and cold stresses, as well as phosphate starvation [22][23][24]. The expression of 28% (1832 of 6480) of Arabidopsis lncRNAs was found to be significantly altered under biotic and/or abiotic stresses [4]. These findingsi.e., that lncRNAs are associated with stress responsesare particularly important in the context of crop species, because abiotic stresses affect crop yield and quality [13,[25][26][27][28][29]. However, the identification of lncRNAs during crop stress response remains largely unexplored, with a few notable exceptions. For example, 637 nitrogenresponsive lncRNAs and 664 drought-responsive lncRNAs have been identified in maize seedlings [6,30]. Similarly, 1010 and 1503 lncRNAs are known to be differentially expressed under abiotic stress in rice and in chickpea [31]. An important but challenging issue is to discover lncRNAs that are associated with abiotic stress responses and then to determine which lncRNAs function as key regulators, which serve as sRNA templates and which represent transcriptional noise.
Here we identify lncRNAs that relate to abiotic stress responses in maize. Our work extends previous maize lncRNA studies in at least three ways [6,8,30]. First, our efforts to detect lncRNAs are based on more expansive data. To perform lncRNA discovery, we have amassed 127 RNAseq datasets that were generated by different methods, in different tissues and across developmental stages, with a large subset generated in abiotic stress experiments, including salt, drought, heat, cold, UV and ozone stresses. The data include 89 RNAseq samples based on Illumina sequencing, 36 RNAseq datasets based on PacbioIsoSeq experiments, and two Illumina RNAseq datasets that were based on total RNA to potentially detect non-polyadenylated lncRNAs. Second, we investigate the relationship between TEs and lncRNAs. More than 85% of the maize genome consists of DNA derived from TEs [32], and we therefore expect that many lncRNAs exhibit sequence similarity to TEs. Thus far, however, the connection between lncRNA and specific TE superfamilies has not yet been investigated for maize. Third, we identify the subset of lncRNAs that are differentially expressed under abiotic stress to begin to narrow the set of candidates that function in stress response. To further narrow a candidate list of potentially functional lncRNAs, we also investigate co-expression of lncRNAs with neighboring genes and within expression networks [33,34]. Bringing these diverse analyses together, we identify several lncRNAs that are hubs in co-expression networks that respond to abiotic stress and show that several of these hubs are lncRNAs derived from TEs.

Construction of transcripts and lncRNA discovery
To discover lncRNAs and examine their expression during abiotic stress, we used 89 RNAseq samples, 2 total RNA-Seq samples and 36 Pacbio Iso-Seq samples. For the Illumina datasets, we extracted and cleaned~305 Gb of sequence data; on average 92.1% of Illumina reads per sample aligned successfully to the maize B73 v4 reference sequence [35]. Aligned reads from each Illumina sample were merged. We also collected and cleaned~1.98 Gb of IsoSeq sequences, aligned them to the B73 reference, and collapsed them for a total of 17,673 loci with 43,774 transcripts. We then combined the Illumina RNAseq and Pacific Biosciences (PacBio) IsoSeq data based on alignment of contigs to the reference, ultimately identifying a non-redundant set of 77,172 loci with 95,523 transcripts (Additional file 1: Fig. S1). Among these, 19,449 transcripts were found only in the total RNA sample. The set of 95,523 transcripts consisted of both coding transcripts and potential lncRNA transcripts. To identify the latter, we used a pipeline based on a combination of annotation programs and Pfam analyses (see Methods). Of the 95,523 assembled transcripts, CPC2 annotation identified 31,967 non-coding transcripts (CPC2 score < − 1), and 41,839 transcripts were deemed to be noncoding based on CNCI analysis. Of these two sets, 26,099 transcripts were longer than 200 bp and were predicted to be non-coding by both CPC2 and CNCI. These were further filtered by: i) comparing them to the Pfam database, retaining only those transcripts without a match (Blast, Evalue>1e-05) and ii) FPKM filtration, based on our requirement that FPKM had to exceed 1 in least one sample. The final dataset, which we consider high confidence lncRNAs, consisted of 23,309 transcripts (Table 1; Additional file 1: Fig. S1), representing 24% of the total (23,309/95,523). The average length of these candidate lncRNAs was 382 bp. None had an ORF > 100 amino acids in length, as per our definition of lncRNAs (see Methods), but most (95.15%) had one ORF. Among the 23,309 lncRNA candidates, 59.3% (or 13,822 transcripts) were identified from polyadenylated (polyA+) RNAseq samples, and the remaining 40.7% (or 9487 transcripts) were from total RNA samples, representing potential polyA-transcripts (Table  1; hereafter we refer to lncRNAs from total RNAs as polyA-for simplicity). A file containing all the identified lncRNAs sequences, along with their genomic locations, is provided in DataS1.
The 23,309 lncRNAs were widely distributed across the 10 maize chromosomes (Additional file 1: Fig. S2). We also examined their location relative to annotated coding sequences within the maize genome. As expected from our search strategy, most lncRNAs (87.9%, 20,499 of 23, 309) were intergenic, based on the output (a U class code) from gff compare. Only 185 lncRNAs were found to be intronic, with 29 and 156 of these as polyA-and polyA+ ( Table 1). The few remaining high confidence lncRNAs corresponded to, or overlapped with, previously annotated lncRNAs in the B73 v4 reference (Table 1). Among the 20,499 lincRNAs, 44.7% (or 9153 of 20,499) were from total RNA datasets (i.e, potentially polyA-), representing a significant enrichment for lncRNAs within the total RNA samples (Pearson χ-squared; p < 0.001).

Most lncRNAs are derived from transposable elements
Previous work has shown that a large fraction of lncRNAs are derived from TEs [7,35], including maize lncRNAs [8]. These observations have led to the hypothesis that TEs contribute to the functional domains of lncRNAs [36]. However, previous papers have provided few details about the TE superfamilies that have contributed to lncRNAs or to the proportion length of individual lncRNAs that can be attributed to TEs. Accordingly, we examined our set of lncRNAs to identify which may be derived from a TE. To do so, we masked regions of our 23,309 high-confidence lncRNAs using a speciesspecific TE library (see Materials and Methods). Overall, we found that 65.69% lncRNAs (15,312 of 23,309) overlapped with known maize TEs, which is a proportion similar to the previous maize study based on fewer lncRNAs [8]. Most (61%, or 9341 of 15,312) TE-lncRNAs showed similarity to TEs over ≥90% of their length (Fig. 1a). Perhaps unsurprisingly, the proportion of polyA-lncRNAs that were masked by TE sequence was higher than that of polyA+ lncRNAs (79.26% vs. 56.37%), which is a significant difference (p < 1e-5) (Fig. 1b). Hereafter we refer to lncRNAs with sequence similarity to TEs as "TE-lncRNAs".
We further investigated the superfamily of TEs that were similar to the 15,312 TE-lncRNAs. We found that 86% had sequence similarity to Long Terminal Repeat (LTR) retrotransposons of the Gypsy and Copia superfamilies ( Table 2) and also that some of these TE-lncRNAs exceeded 3750 bp in length (Fig. 1c). A much smaller proportion of TE-lncRNAs were derived from DNA transposons ( Table 2); the longest of these were shorter than the longest TE-lncRNAs with similarity to Gypsy and Copia elements (Fig. 1c).
These observations raise an interesting question: Do LTR/Gypsy and LTR/Copia elements give rise to lncRNAs more often than expected, given their proportion of the genome? To address this question, we estimated the proportion length among all annotated TEs that were attributable to LTR/Gypsy, LTR/Copia and other element superfamilies, based on RepeatMasker analyses. We then compared these percentages to the proportion length among inferred TE-lncRNAs ( Table 2). We found, for example, that LTR/Gypsy elements produced TE-lncRNAs at roughly the expected proportion (61% vs. 59%), relative to their representation in the genome. However, LTR/Copia elements contributed TE-lncRNAs at a lower proportion than their proportion length among annotated TEs (22% vs. 33%). Particularly notable is the fact that class II DNA elements produced TE-lncRNAs in our dataset at~2fold higher rate (12% vs. 6%) than expected based on their total length among TEs in the genome (Table 2). Altogether, our results verify that the most maize lncRNAs derive from TEs, but they also indicated that different TE superfamilies give rise to TE-lncRNAs at different rates.

Differential expression under abiotic stress
One general feature of lncRNAs is that they are expressed at lower levels than protein coding genes, and they are often expressed tissue specifically [3,6,8,23,37,38]. We assessed the expression levels of coding and lncRNA transcripts based on their maximum FKPM across all of our 129 datasets and then averaged these maximum levels across transcripts. The results indicated that lncRNAs are expressed at lower levels than coding RNAs (Fig. 2a), with coding regions expressed at three-fold higher maximum levels, on average, than non-TE-lncRNAs (average FPKM: 12.57 vs. 4.30) and six-fold higher maximum levels, on average, than TE-lncRNAs (average FPKM: 12.57 vs. 2.04). We next sought to identify coding genes and lncRNAs that were differentially expressed under abiotic stress. To do so, we contrasted a subset of our RNAseq samples that were generated from abiotic stress experiments that included both treatment and control RNAseq samples. For example, our samples included 12 RNAseq datasets that represent two control samples and two replicated treatment samples from each of four stresses (salt, drought, heat and cold) (Additional file 2: Table S1), all taken from V3 seedlings. Accordingly, we contrasted each stress treatment to the control, for a total of four contrasts (salt, drought, heat and cold) in V3 seedlings. Extending this approach to V4 and V6 seedlings across all the RNAseq data, we performed a total of 12 contrasts (Additional file 2: Table S1). These contrasts identified numerous differentially expressed coding genes and lncRNAs ( Table 3). The various treatments identi-fied~2000 up-or down-regulated coding transcripts, on average, and a set of 1077 non-redundant lncRNAs that were either up-and down-regulated across treatments. Fig. 1 The relationship between lncRNAs and TEs. a The histogram indicates the number of lncRNAs (y-axis) relative to the percentage length (xaxis) of lncRNAs that have similarity to TEs. b Numbers of lncRNAs that are polyA-(i.e., from total RNA) or polyA+ with similarity to TEs. The proportion of polyA-lncRNAs is significantly enriched for similarity to TEs. c The length distribution of TE-lncRNAs organized by their inferred TE superfamily of origin Among the 1077 non-redundant lncRNA transcripts, many were differentially expressed in two or more treatments. For example, 679 lncRNAs were identified as differentially expressed across V3-V6 stages under heat treatment (Table 3; Fig. 2c; Additional file 4: Table S3, Additional file 1: Fig. S3). Of these, 29 lncRNAs were differentially expressed in all three developmental stages, but 79, 214 and 232 lncRNAs were specific to the V3, V4 and V6 stages, respectively. Interestingly, 40.50% (32/ 79) heat-responsive lncRNAs at the V3 stage, 26.17% (56/214) heat-responsive lncRNAs at V4 and 42.67% (99/232) heat-responsive lncRNAs at V6 were also differentially expressed in response to other stress treatments, but not shared among developmental stages. These patterns implicate many lncRNAs as a common component of abiotic stress responses, but they also imply that these responses have temporal (i.e., developmental) specificity in leaves from V3 to V6 seedlings.
Interestingly, 529 non-redundant TE-lncRNAs were differentially expressed under one or more conditions. The proportion of differentially expressed TE-lncRNAs was lower than the proportion of all lncRNAs; TE-lncRNAs were 65% of the total proportion of lncRNAs, but constituted only 45 and 56% of up-and downregulated lncRNAs. Most of the differentially expressed TE-lncRNAs had similarity to LTR/Gypsy and LTR/ Copia, as expected, but other TE families also contributed to differentially expressed TE-lncRNAs. For example, MSTRG.32907 exhibited similarities to LINE elements, MSTRG.73329 was similar to DNA/hAT-Ac elements, and MSTRG.37644 was an LTR/Gypsy elements. All of these were differentially expressed in leaves from V3 seedlings, but in different abiotic treatments (heat, cold and salt, respectively) ( Fig. 4).
lncRNAs have been shown to be involved in cis regulation of neighboring genes. To investigate this possibility, we examined the correlation in expression between lncRNAs and their closest neighboring gene in either the 5′ or 3′ direction, yielding a dataset of 1077 differentially expressed lncRNAs and their neighboring genes. The lncRNAs were strongly (r = 0.48), and highly significantly (p < 2e-16) correlated with the expression of their closest neighboring gene (Fig. 2b), suggesting that lncRNAs may either be involved in cis regulation or are subject to some of the same cis regulatory features as their neighboring genes.

Co-expression modules associated with stress responses
Compared to coding genes and microRNAs, the function of lncRNAs in abiotic stress response remains largely   unknown. Computational construction of gene coexpression networks can be a valuable tool for linking lncRNAs and coding RNAs and also for beginning to infer potential biological functions, because co-expressed genes are often members of the same pathway or protein complexes, are often either functionally related, or are controlled by the same transcriptional regulatory program [33,[39][40][41]. We used the 89 Illumina RNA-seq datasets to build co-expression networks (see Methods and Table S1). WGCNA analyses identified 40 modules that comprise various nodes in the network. Of the 40 inferred modules, 16 were significantly correlated with stress treatments (Fig. 3, Additional file 1: Fig. S4, Additional file 4: Table S3, Additional file 5: S4). These 16 contained 7221 transcripts including 408 lncRNAs, of which 171 were TE-lncRNAs. Most of the 16 modules were associated with a single stress and developmental stage, but some were correlated with two or more stresses or stages (Fig. 3). For example, the ME_darkgreen module was highly correlated with drought at the V3 stage (r 2 = 0.76, p < 4e-18), but it was also significantly correlated with salt stress at the V3 (r 2 = 0.21, p < 0.05) and V4 (r 2 = 0.29, p < 0.005) stages. Similarly, the ME_salmon module correlated with drought (r 2 = 0.25, p < 0.02) and also salt stress at the V3 (r 2 = 0.45, p < 1e-05) and V4 stages (r 2 = 0.38, p < 3e-04). Complete correlation information between modules and stress conditions and developmental stages are provided in Additional file 1: Fig. S4.
Recent work uncovered a temporal transcriptional logic underlying nitrogen (N) signaling in Arabidopsis [42]; we see similar logic based on developmental timing for abiotic stress responses. Consider the example of heat stress: the ME_tan module was correlated with V3 heat stress (r 2 = 0.89, p < 4e-32), the ME_yellow module correlated with V4 heat stress (r 2 = 0.96, p < 1e-49), and the ME_darkturquoise (r 2 = 0.43, p < 2e-05) and ME_ pink (r 2 = 0.49, p < 1e-06) modules were associated with heat stress in the V6 stage. These data suggest a developmental cascade of heat-responsive modules. To illustrate this graphically, we arranged the 16 associated modules by stress and development stage. Like heat stress, cold and drought stress were both associated with distinct modules at different developmental stages. There were exceptions, however, as both salt and UV stress associated with two modules in the V4 stage (Fig. 3).
Among the 16 significant modules, the most lncRNAs were associated with the ME_yellow module, which correlated with heat stress in the V4 stage (r 2 = 0.96, p < 1e-49) and contained 147 lncRNAs and 65 TE-lncRNAs. Figure 4 details the expression pattern of this and other stress related modules. Given these modules, it is possible to extract the eigengenes from modules to infer function. For example, the eigengenes for the ME_yellow module were Fig. 3 A visual representation of the 16 modules that were significantly correlated with abiotic stress responses. All of the modules were associated with one stress condition and developmental stage, such that they exhibit a temporal cascade of stress responsiveness under different stresses and across V3 to V6 developmental stages. The scale of the heat map reflects the level of correlation (r) among genes in an expression module for a specific abiotic stress (i.e., Heat, Cold, Drought, Salt, UV, Ozone) at a specific development stages (i.e., V3 to V6) assigned into GO categories related to 'response to heat', 'response to high light intensity', 'heat acclimation, response to radiation', 'regulation of seedling development' and 'ER-nucleus signaling pathway'. The ME_darkturquoise (r 2 = 0.43, p < 2e-05) and ME_pink (r 2 = 0.49, p < 1e-06) modules were also associated with heat stress but in a later development stage (V6). These two modules contained 52 lncRNAs and 16 TE-lncRNAs, and their eigengenes exhibited significant enrichment of the GO terms 'intracellular ribonucleoprotein complex', 'HslUV protease complex', 'cytoplasmic translation' and 'intracellular membrane-bounded organelle' (Additional file 6: Table S5, Additional file 7: Table S6, Additional file 8:  Table S7). Overall, GO-inferred functions helped verify that the modules reflect aspects of the stress response.

LncRNAs are hubs in modules
An interesting facet of the 16 stress-associated modules is that each contained both lncRNAs and TE-lncRNAs. We have mentioned that the ME_yellow module contained the most lncRNAs of the 16 modules, with 147 lncRNAs and 65 TE-lncRNAs, but other modules were similar in containing lncRNAs. For example, the ME_tan module, which is associated heat stress in V3, contained 26 lncRNAs and 9 TE-lncRNAs. An important question concerns the role of these lncRNAs in expression networks. One role, which is suggested by our results (Fig. 2b), is that some of the lncRNAs in modules are co-expressed with genes due to cis interactions. It is also possible, however, that lncRNAs regulate genes in trans. To investigate this possibility, we screened for key 'hubs', which we defined by Top) The heat graph shows transcript expression levels for hub genes and lncRNAs in each module (y-axis) and across conditions (x-axis). The key to modules (y-axis) and stress conditions (x-axis) are shown on the right legend, with conditions also separated by developmental stage (bottom of x-axis). Warmer colors within the heat map indicate high expression, and cooler colors are low (or under) expression. This particular heat map illustrates that the four heat-associated modules are, as expected, highly expressed under heat stress, but not always at the same developmental stage. Bottom) The bar plots below the heat graph are eigen-lncRNA expression values selected from the top three overrepresented TE-lncRNAs with high interconnectivity. The x-axis is the same as the heat map, and the id of the TE-lncRNAs is provided by the color key. This graphs shows, again, that the TE-lncRNAs tend to be more highly expressed under heat stress, but with some dependence on developmental stage. Additional file 1: Figs. S5 to S9 present similar figures for modules associated with cold, drought, salt, UV and ozone stress, respectively high connectivity (i.e., intramodular connectivity within the top 10% of all members of the module), membership > 0.9 and high significance (p < 0.01) in the module. Based on these filters, we identified 670 hubs that included 39 lncRNAs from different stress-responsive modules (Additional file 5: Table S4), of which 18 were TE-lncRNAs.
Many hubs in co-expression networks belong to transcription factors (TF) of families such as TCP, AP2/ EREBP, MYB, WRKY, NAC, bZIP [43][44][45][46]. We found interactions and potential crosstalk between lncRNAs and stress-responsive TFs from these families. In the heatresponsive modules, for example, hub lncRNAs such as MSTRG.32907, MSTRG.36825 and MSTRG.30107 and MSTRG.35709 were connected to TF families such as TCP, NAC, Dof and bHLH, which are known to respond to abiotic stress from previous studies (Fig. 5, 48-50]. These results suggest the possibility that lncRNAsand more specifically, some TE-lncRNAsact to regulate abiotic stress responses. If they play a functional role, one would expect them to be conserved over evolutionary time. We tested this idea by blasting each of the 39 hub lncRNAs to an evolutionary gradient of genomes that included sorghum, rice and Arabidopsis (Additional file 9: Table S8). Of the 39, 16 had strong hits (e < 10 − 15 ) to sorghum, a close relative to maize, and 4 of these 16 were TE-lncRNAs. Moreover, three of the hub lncRNAs had hits to rice, but zero TE-lncRNAs had rice hits, and none of the 39 hub lncRNAs had significant hits to Arabidopsis. Overall, these results suggest that~10% these lncRNAs have been conserved since the divergence of rice and maize, roughly 50 million years ago [47], and that 39% have been conserved since the divergence between sorghum and maize, roughly 16 million years ago [48].
Testing the reliability of RNA-seq based inferences via qRT-PCR All of our inferences are based on bioinformatic analyses of RNAseq samples. To explore the reliability of these inferences, we performed a heat-stress experiment on maize B73 V3 seedings. The seedlings were subjected to 50°C for 4 h (see Methods), and their RNA was extracted. We then subjected the samples to quantitative real-time PCR (qRT-PCR) to compare expression changes between replicated control and heat-treated seedlings. We focused on ten representative transcripts, including seven lncRNAs and three coding genes. Among this set of ten transcripts, six were significantly up-regulated under heat-stress and four were significantly down-regulated based on qRT-PCR (Fig. 6). We then compared experimental results to those based on RNAseq data, illustrating a high degree of consistency (r = 0.936; p < 1 × 10 − 4 ) between inferences based on RNA-Seq and on qRT-PCR (Fig. 6, Additional file 10: Table S9).

Discussion
In this study, we accumulated and mined an expansive dataset to identify lncRNAs in maize, particularly those that are expressed in response to abiotic stress. Bioinformatic analyses led to the identification of 23,309 lncRNAs, the largest collection yet identified from maize. We characterized these lncRNAs with respect to three features: i) their prevalence and origins, especially lncRNAs that appear to be derived from TEs, ii) their expression levels and patterns, including a detectable ciseffect, and iii) their potential for functioning in abiotic stress response, as inferred from the construction of coexpression networks.

lncRNA identification and characterization
By its very nature, lncRNA discovery is limited by a number of factors. It is first, of course, limited by the definition of lncRNAs that have been used in the literaturei.e., an RNA molecule > 200 bp with at most one ORF or overlapping exon of < 100 codons [1]. Following precedence, we have adopted this definition for lncRNA discovery, but it bears remembering that some of these could in fact be translated because they contain short ORFs. A second limitation is the fact that our search strategy did not include lncRNAs that overlapped with (or contained) an annotated exon. We applied this limitation purposefully, to avoid mis-classification based on fragmented RNA molecules or contigs. For that reason, however, our work likely underrepresents lncRNAs derived from genes and so some of our estimates may be inaccurate. For example, if many lncRNAs are derived from genic regions, then our estimate of the proportion of lncRNAs that are derived from TE-lncRNAs is an overestimate. It is worth noting, however, that our estimate of the proportion of TE-lncRNAs (65%) is similar to a previous, smaller maize study that estimated 68% of lncRNAs were derived from TEs [8]. A third limitation is that the completeness of lncRNA discovery relies critically on the number of tissue and developmental samples that are available. With the exception Although our study focuses on only one tissue (i.e., leaves from seedlings of different developmental stages), it greatly expands lncRNA discovery in maize because previously the most RNAseq samples used for lncRNA discovery was 30 [8].
Our RNA datasets were highly enriched for polyadenylated (polyA+) transcripts, because it consisted of 36 PacBio fl-cDNA datasets, 89 RNAseq datasets and only two total RNA datasets. Nonetheless, fully 44% of intergenic lncRNAs were identified from the total RNA data, representing a disproportionately large number relative to polyA+ data. This observation superficially suggests that far more lncRNAs are polyA-, which is an important point to consider when one considers that most -but not all [6,49,50] -lncRNA surveys in plants have relied solely on RNAseq samples and not total RNA samples. Previous work has also suggested that the ratio of polyA-and polyA+ lncRNAs may be a function of growth conditions and external stresses [13]. A fuller understanding of lncRNAs may require more substantial investments in total RNA datasets.

Most lncRNAs are TE-lncRNAs
Given our identification of 23,309 lncRNAs, we next sought to characterize their loci of origin and particularly to identify those that likely originated from TEs. We found that~65% (15,312) of lncRNAs contained similarity to known TEs. Of these, most (61%, 9341 of 15,312) were similar to TEs over > 90% of their length, suggesting they were derived solely from TEs. As we noted above, our estimates of the proportion of TE-lncRNAs could be too high, based on our search strategy. However, it is also not surprising that we identified (See figure on previous page.) Fig. 5 The networks of four heat-responsive modules. The four modules are, the ME_tan module (top left), the ME_yellow module (top right), the ME_pink module (middle), and the ME_darkturquoise module (bottom). In each network diagram, the green circles represent TE-lncRNAs; the blue color represents nonTE-lncRNAs; the orange dots represent known transcription factors from various families (e.g., TCP), and grey circles represent coding RNAs. The size of the dot represents intramodular connectivity, with larger sizes representing higher connectivity. From these networks, we can infer that lncRNAs and TE-lncRNAs are sometimes as or more interconnected than transcription factors Fig. 6 qRT-PCR validation of differentially expressed lncRNAs and coding RNAs The qRT-PCR data were generated from 10 differentially expressed loci, based on leaf tissue of V3 seedlings under heat-shock and control conditions. The qRT-PCR histogram for each locus represents the mean ± standard error (SE) of three independent biological replicates, and the qRT-PCR are compared to fold-change data inferred from RNAseq data. The fold-change values based on qRT-PCR and RNAseq data were significantly correlated across the 10 loci (r = 0.9363, p < 0.000067). a high proportion of TE-lncRNAs, for at least three reasons. First, previous studies in mammals have demonstrated that most lincRNAs derive from TEs [7,35]. Second, the maize genome is replete with TEs, with > 85% of the genome estimated to consist of DNA derived from TEs [32]. Finally, an important function of lncRNAs is to be precursors for small RNAs, which in turn contribute to TE silencing via sequence homology [8,[51][52][53].
We also investigated the TE families from which TE-lncRNAs originated. Most of the TE-lncRNAs were derived from LTR/Gypsy and LTR/Copia families (Table 2), reflecting their preponderance in the maize genome [32,53]. lncRNAs derived from LTR/Gypsy elements were represented in a similar proportion to their genomic proportion (by length) among the TEs we investigated in our study (Table 2). However, LTR/Copia elements were underrepresented in the TE-lncRNA dataset relative to their combined lengths in the genome, 22% versus 33%. This suggests that LTR/Copia elements do not produce lncRNAs as readily as LTR/Gypsy elements, at least within our data. The reasons for the difference between LTR/ Copia and LTR/Gypsy are presently unclear, but one can consider two broad categories: TE age and TE location. For the former, older elements might be expected to be in a deeply-silenced epigenetic state that relies primarily on the maintenance of methylation during cell division rather than an active epigenetic response that enlists lncRNAs [54]. For the latter, one might expect LTR/Copia elements to be in genomic locations that are transcribed. In fact, however, the opposite is true, because LTR/Gypsy elements tend to be concentrated in pericentromeric regions [32] where there may be less active transcription and less ongoing silencing. In contrast, LTR/Copia elements tend to accumulate preferentially in euchromatic regions [32] that tend to be more transcriptionally active. Class II DNA elements also tend to be located near genes and euchromatic regions, but unlike LTR/Copia elements they produce lncRNAs at about a 2-fold higher than implied by their genomic lengths (Table 2). To sum: we have shown that TE superfamilies over-and under-produce lncRNAs relative to their genomic representation based on our extensive collection of datasets, but the ultimate causes of these differences remain unclear.

Levels and patterns of lncRNA expression
Several previous papers from both plants and animals have shown that lncRNAs tend to be expressed at lower levels than bona fide genes and that they also tend to show tissue-specific patterns of expression [3,[7][8][9][10][11][12]. We have verified the former by recording the maximum FPKM for each lncRNA transcript across datasets; on average, lncRNAs are expressed at 4-fold lower levels than genic transcripts by this metric (Fig. 2a). Unfortunately, we cannot verify that lncRNAs have more tissue specific expression than genes, because the bulk of our data were isolated from leaves. We can, however, verify that they have lower entropy than genes, on average (Average Shannon Entropy = 2.10 for coding genes vs. 1.13 for lncRNAS), because the lncRNAs consistently lack expression evidence under more conditions.
Of the 13,822 polyA+ lncRNAs, we found that 1077 (7.79%) were differentially expressed under stress conditions, including 529 TE-lncRNAs. These TE-lncRNAs provided an opportunity to assess whether they could be linked to the expression of nearby genes, indicating some sort of cis-regulatory pattern, as has been observed in other species [20,55,56]. TE-lncRNAs were significantly correlated (r 2 = 0.48; p < 2.0e-16) with their nearest neighboring genes (Fig. 2b), suggesting that TE-lncRNAs may either be involved in cis regulation or are subject to some of the same cis regulatory features as their neighboring genes, such as open chromatin structure.

lncRNAs, abiotic stress and coexpression modules
This study was designed specifically to identify stressresponsive lncRNAs. We approached this problem in two ways. We first identified differentially regulated lncRNAs from a series of controlled experiments for heat, cold, drought and salt stress. These experiments were based on leaf tissue from seedlings of the V3, V4 and V6 stages. Comparing the stress treatment to their corresponding control at the appropriate developmental stage across 12 different contrasts, we identified 1077 lncRNAs with evidence for differential expression. This observation corroborates previous studies in suggesting that lncRNAs may be differentially regulated under stress [6, 22-24, 30, 31], but it provides no indication whether the differentially regulated lncRNAs are a byproduct of stress responses or play a functional role. There is, however, a large gap between observing differential expression and proving function. As a first step toward bridging this gap, we have built co-expression networks based on both coding RNAs and lncRNAs from 89 RNAseq datasets, yielding a total of 40 coexpression modules. Of these, 16 were significantly associated with stress responses, and GO annotations of these modules were generally consistent with their inferred response functions. One interesting facet of these 16 modules is that they demonstrate clear patterns across developmental time (Fig. 3), suggesting that temporal hierarchies are important for plant responses to environmental stress.
It is difficult to infer function from co-expression modules [57], but studies have shown that genes with high connectedness tend to be functionally essential [58,59]. We were therefore particularly interested whether any of our lncRNAs are included within co-expression networks and particularly whether they are 'hubs' within network modules. Of the 16 modules that were significantly associated with stress responses, we identified 670 hubs, many of which corresponded to genes from known transcription factor families (Fig. 5). Of these 670 hubs, 39 were lncRNA transcripts. These represent our best candidates for lncRNAs that function in stress response, potentially as trans-acting regulatory factors. Consistent with this last conjecture, several of these lncRNA hubs were connected to genes from known TF factors [60][61][62]. Moreover,1 0% of these lncRNAs yielded strong blast hits to rice, suggesting some measure of evolutionary conservation consistent with functional constraint, at least for this subset.
One somewhat surprising finding is that 18 of the 39 lncRNA hubs are related in sequence toand perhaps derived from -TEs. This observation raises the intriguing idea that TE exaptation can occur at the level of lncRNAs. It is now well known that TE exaptation contributes to many aspects of genome function, including protein coding genes and especially functional regulatory elements [63][64][65]. The location of TE-lncRNAs as hubs, along with their connectedness to known TFs, suggests that a small subset of TE-derived lncRNAs may function as trans-acting regulatory factors in maize. If true, these hubs appear to have been recruited recently, given that only four of 16 yield strong hits to the sorghum genome. Clearly additional work is required to prove that these TE-lncRNAs function as hypothesized in abiotic response, but their centrality in co-expression modules is nonetheless an intriguing result that is consistent with previous findings showing that most lncRNAs are derived from TEs [7] and that lncRNAs can play central regulatory roles in plant and animal development [63].

Sample collection
In this study, we gathered 36 Pacbio Isoseq datasets that were sampled from different tissues [66] and 91 illumina RNAseq datasets that were sampled from leaves of maize B73 [6,[67][68][69] (Additional file 2: Table S1, Additional file 1: Fig. S1). Of the Illumina datasets, 89 represented polyA+ transcripts and two were based on total RNA, which includes putative polyA-transcripts. The datasets were used for three purposes: lncRNA discovery, differential gene expression analyses, and the inference of gene co-expression networks. All of the 129 datasets were used for lncRNA discovery. A subset of 71 of the 91 RNAseq datasets were employed for differential gene expression analyses (Additional file 2: Table S1); these included replicated control and treatment samples from experiments that tested the effects of drought, salt, heat, cold, UV and ozone treatments on gene expression. Finally, all of the 89 polyA+ Illumina RNAseq datasets were used for inferring gene coexpression networks. The 89 Illumina datasets represented a developmental series sampled from leaves of V3, V4 and V6 seedlings; we take advantage of this developmental series in some network analyses (Additional file 2: Table S1, Additional file 1: Fig. S1).

Data processing and alignment
Raw data were converted into the FASTQ-formatted file by the Fastq-dump program from the SRA Toolkit (https://github.com/ncbi/sratoolkit). For Illumina data, the SolexaQA++ v3.1 program [70] was employed for quality trimming, using the Q20 value. After trimming, any reads < 50 bp were removed. Cleaned reads were then aligned to the B73 reference genome sequence (v4, http://plants.ensembl.org) using the STAR aligner program [71] with default parameters. Aligned reads were assembled into transcripts by the StringTie program, using the RABT (reference annotation-based transcript) assembly algorithm [72]. For the Pacbio IsoSeq data, reads were aligned to the B73 reference genome using the Minimap2 program [73]. Unique isoforms were collapsed, based on genome alignment by Cupcake ToFU (https://github.com/Magdoll/cDNA_Cupcake).
Subsequently, the assembled transcripts from Illumina RNAseq and Pacbio IsoSeq were merged using StringTie, which yielded a non-redundant unified set of transcripts.

Computational identification of intergenic and intronic lncRNAs
To find lncRNAs, a strict computational strategy was performed as described by Lv et al. (2016) that and consisted of four steps. First, non-redundant transcripts were submitted to annotation programs to evaluate their coding potential. We used two annotation programs -CPC2 [74] and CNCI [75] and focused on transcripts that were identified as having no coding potential by both programs as candidate lncRNAs. Second, we submitted candidates to the Pfam database using Pfam_scan script (ftp://ftp.ebi.ac.uk/pub/databases/Pfam/), which aligns transcripts with HMMER [76]. We filtered any transcripts that aligned to known protein families at an Evalue<1e-05. Third, we compared the remaining transcripts to reference annotations using gffcompare [77], which outputs various codes to designate the relationship of transcripts to annotated coding regions. We retained transcripts with class codes "i", which indicates that a transcript is fully contained within a reference intron, and "u", which designates transcripts that are not obviously related to known coding regions, for further analyses. This last step is likely to miss some sense and anti-sense lncRNAs that derive from coding regions but also limit false positives based on incompletely assembled coding transcripts. Finally, we retained transcripts as high confidence lncRNAs if they passed all of the previous four steps, if they were longer than 200 bp, and if they had an FPKM (fragments per kilobase of exon model per million reads mapped) > 1 in at least one of our sample datasets. To determine the relationship of high-confidence lncRNAs to TEs, we masked the lncRNA sequences to identify TE domains. Masking was based on the maize-specific library of Repbase database (www.girinst.org) and was performed by RepeatMasker (www.repeatmasker.org).

Gene expression analyses
We performed two separate types of analyses based on gene and lncRNA expression data. The first analysis was differential expression analysis based on comparisons between stress and control data (Table S1). To perform these analyses, high quality reads were aligned to the B73 reference using the STAR program [71]. For reads that mapped to multiple locations, we removed alignment reads with a mapping quality < 20, based on SAM-Tools [78]. Raw counts were quantified using the featureCounts program [79], and the FPKM value per gene was calculated using a custom Perl script. The DESeq2 package [80] was used to perform pairwise comparisons between samples to identify differentially expressed transcripts. To identify differentially expressed genes (DEG), we relied on two criteria: the Log2(fold change) had to be > 1 and the adjusted p-value from DEseq analyses had to be p-adj < 0.05.
The second type of analysis was the inference of coexpression networks. To construct networks, expression profiles were extracted from each gene and lncRNA, and expression levels were normalized using variance stabilizing transformation in DESeq2 [80]. Co-expression correlations among lncRNAs and genes were based on Pearson correlations with R 2 ≥ 0.8 across the 89 RNAseq datasets. An unsigned co-expression network was inferred using the WGCNA package [81] with an optimal soft threshold = 12. Modules within the network were assigned using Topological Overlap Matrix (TOM). The correlations between modules and stress treatments were calculated and plotted, and then the significant stress-responsive modules were extracted for further analysis. Co-expressed networks were visualized by the Gephi program [82].

Gene ontology enrichment analysis
The eigengene probes of each stress-responsive module were assigned putative functions by searching against the UniProt protein database [83]. Searching was based on using the Blastx program [84], using a cut-off evalue ≤ 1e-10. Coding eigengenes were then submitted to the AgriGO v2 online toolkit [85] for gene ontology term enrichment. A Fisher's exact test was applied for the enrichment analysis and the p value was adjusted using the Bonferroni method, with an experiment-wide significance level of 0.05.

Experimental stress treatment, RNA extraction and qRT-PCR analysis
The maize inbred line B73 was germinated in a greenhouse at JAAS (Jiangsu Academy of Agricultural Sciences). Seedlings at the three-leaf (V3) stage were then incubated at 50°C for 4 h for heat stress treatment, as described by Makarevitch et al. [69]. Control plants were retained under a temperature of 25°C. Leaves from three independent biological replicates were collected and processed for RNA extraction and first strand cDNA synthesis according to PrimeScriptTM RT Master Mix (TaKaRa). qRT-PCR was performed using SYBR Premix DimerEraser™ kits (Takara) on a Real Time PCR System (Roche LightCyclerR 96, USA), according to the manufacturer's instructions. Quantification results of target transcripts were calculated using the comparative 2-ΔΔCT method. Primers were designed using Primer Primer5 [86] and can be found in Additional file 10: Table S9.
Additional file 1: Figure S1. A schematic showing the data used in this paper, the bioinformatic pipeline for lncRNA identification, numbers of genes and identified lcRNAs, and some features of downstream lncRNA analyses. Figure S2. The chromosomal distribution of lncRNAs. Density was plotted across each chromosome. Figure S3 Figure S4. The correlation values (r) between each of the inferred coexpression modules and the specific traits (e.g., Heat, Cold) at different development times (e.g., V3 to V6). A subset of these data for the top 16 stress-associated modules is provided as a heat map in Figure 3.  Figure  S8) and ozone stress ( Figure S9). Each figure contains a heat map (top) and graphs of the expression of specific TElncRNAs (bottom) that were chosen because they overrepresented with high interconnectivity. The heat graph shows transcript expression levels for genes and lncRNAs in each module (y-axis) and across conditions (x-axis). The key to modules (yaxis) and stress conditions (x-axis) are shown on the right legend, with conditions also separated by developmental stage (bottom of x-axis). Warmer colors within the heat map indicate high expression, and cooler colors are under-expression. The bar plots below the heat graph are eigen-lncRNA expression values selected from the top overrepresented lncRNAs with high interconnectivity. The x-axis is the same as the heat map, and the id of the lncRNAs is provided by the color key.
Additional file 2: Table S1. A summary of pre-processing and alignment of Illumina and Pacbio datasets.
Additional file 3: Table S2. Details of identified lncRNAs including TE information.
Additional file 4: Table S3. Differentially expressed lncRNAs and TE-lncRNAs under different abiotic stresses across developmental stages. Each sheet represents different stress treatments at a specific developmental stage.