Long non-coding RNAs are major contributors to transcriptome changes in sunflower meiocytes with different recombination rates

Background Meiosis is a form of specialized cell division that marks the transition from diploid meiocyte to haploid gamete, and provides an opportunity for genetic reassortment through recombination. Experimental data indicates that, relative to their wild ancestors, cultivated sunflower varieties show a higher recombination rate during meiosis. To better understand the molecular basis for this difference, we compared gene expression in male sunflower meiocytes in prophase I isolated from a domesticated line, a wild relative, and a F1 hybrid of the two. Results Of the genes that showed differential expression between the wild and domesticated genotypes, 63.62 % could not be identified as protein-coding genes, and of these genes, 70.98 % passed stringent filters to be classified as long non-coding RNAs (lncRNAs). Compared to the sunflower somatic transcriptome, meiocytes express a higher proportion of lncRNAs, and the majority of genes with exclusive expression in meiocytes were lncRNAs. Around 40 % of the lncRNAs showed sequence similarity with small RNAs (sRNA), while 1.53 % were predicted to be sunflower natural antisense transcripts (NATs), and 9.18 % contained transposable elements (TE). We identified 6895 lncRNAs that are exclusively expressed in meiocytes, these lncRNAs appear to have higher conservation, a greater degree of differential expression, a higher proportion of sRNA similarity, and higher TE content relative to lncRNAs that are also expressed in the somatic transcriptome. Conclusions lncRNAs play important roles in plant meiosis and may participate in chromatin modification processes, although other regulatory functions cannot be excluded. lncRNAs could also be related to the different recombination rates seen for domesticated and wild sunflowers. Electronic supplementary material The online version of this article (doi:10.1186/s12864-016-2776-1) contains supplementary material, which is available to authorized users.

Here we present additional text (details of methods and discussion), tables and figures. Sections of this document are refered in the main text, and numbers of tables and figures are referred in the main text as "AF1-#", where "#" is the corresponding number for table or figure. Text in blue includes link to web resources.

Contents
Sequencing and assembly results 1 Additional discussion of lncRNA Identification Sequencing and assembly results We used the assembly from F 1 meiocytes as a representative transcriptome for differential expression analysis since alleles from both parents are expected to be present and expressed in the hybrid. 73,669 transcripts (with a N50=1,298) composed this assembly, of these, 73,658 showed expression in at least one of the genotypes. From this transcriptome, we identified 71 genes orthologs to Arabidopsis thaliana known meiotic genes, 8 more than previously described in a transcriptome assembled de novo from meiocytes and somatic sequences [5]. This confirmed the reliability and representativeness of the F 1 transcriptome. We observed that when we assembled the transcriptome with sequences from the F 1 and one of the parents, the complexity of the transcriptome increased, i.e., more genes as well as splice variants were reconstructed, but the number of known-meiotic genes detected was the same. So we thought that the possibilities of miss-assemblies and chimeric transcripts increases when reads from different genotypes are mixed, but not new genes are detected.
On the other hand, after testing for the number of missing genes in this transcriptome through the methodology described in [6], we conclude that the F 1 transcriptome is complete. Table AF1-1 presents the number of reads obtained from each one of the libraries, as well as the number and percentages of reads with unique hit to the F 1 assembly. The uniformity of the percentages of reads from each library mapping to the F 1 transcriptome demonstrate that no significant bias in expression was introduced by using the F 1 transcriptome.
In Table AF1-1 we can see that the percentages of reads that have a unique hit with the F 1 transcriptome for meyocites goes from a minimum of 75.26 up to a maximum of 79.91, with an average of 77.90, a median of 78.40 and a standard deviation of only 1.86%. The number and percentages of reads from the somatic transcriptome previously obtained in [5] were used to compare the lncRNA detected in the somatic and meyocites transcriptomes.   [11] and 'Coding-Potential Assessment Tool' (CPAT) [28] were employed conjointly to assure a high confidence assignment of transcripts as lncRNA. These two algorithms are complementary; CPC assesses protein-coding potential of transcripts using sequence features and support vector machine, while CPAT uses an alignment-free logistic regression model. For CPC, negative scores indicate low coding potential, and a maximum threshold of -1 is considered as strong evidence of non-coding potential, and thus in our case as authentication for lncRNAs. On the other hand, CPAT directly gives an estimate of the probability that the sequence is coding for a protein, allowing a direct interpretation of the results. Here we give details of the selection process and compare it with the use of the two algorithms in the literature.
All 34,304 sunflower genes without a good blastx hit to proteins were subjected to the CPC and CPAT algorithms. For the classification of transcripts as lncRNAs we set a double threshold, considering a gene as lncRNA only for transcripts for which CPC score ≤ −1 and CPAT score ≤ 0.3 Table AF1-3 presents the number and percentages of sequences which fulfilled each criterion.
In Table AF1-3 we can appreciate how the application of both criteria for the selection of lncRNA was successful in avoiding putative false positives, that will happen if only the criterion CPC ≤ −1 but not  From Table AF1-4 we can see that the average CPC score for genes classified as lncRNA is -1.26, while for the group classified as 'Unknown' this average is -1.01, but has a maximum of 2.47, much larger than the threshold of -1.00 set for lncRNAs, thus CPC was an effective discriminant criteria. However, the CPAT threshold of a score less or equal than 0.3 in coding probability, resulted even more powerful to determine non-coding capacity; while nominally we asked the value to be equal or less than 0.3, effectively all lncRNAs detected have the same value of in coding probability, 2 × 10 −6 , indicating a near null possibility for these sequence to be coding protein. Thus, a posteriori, the maximum value of the CPAT parameter for our lncRNA sequences is far from 0.3 and very near zero.
To compare the stringency of the thresholds set to CPC and CPAT in our procedure to designate lncRNAs, we compiled Table AF1-5 showing references that use these algorithms in distinct organisms.
From Table AF1-5 we can see that in references [12,31,27,33,34,19] the threshold set to CPC, 0, is less stringent the value set by us, -1, while in references [17,16,8,10,29] this threshold is equally stringent than the one set by us. On references [26,1,2,22,32], the threshold for the CPAT algorithm is more permissive than the one set by us, i.e., in all cases > 0.3 More relevant, in all references presented in Table AF1-5, except by [30], only a single algorithm is employed to call lncRNA, while we employed thresholds for both programs, and as we have demonstrated (see Table AF1-3), this fact filter a large number of putative false positive lncRNAs. Only in [30] both programs are employed conjointly to call lncRNA, and in that case our threshold for CPC is more stringent, ≤ −1, than the one in [30], ≤ 0.5, while the threshold for CPAT in [30] is 0.02, nominally more stringent that the threshold used by us, 0.3; however, as shown in Table AF1-4 in all our 25,327 lncRNAs the CPAT value is near zero (2 × 10 −6 ) and thus our classification results are at least as robust than the ones presented in [30].
In summary, we have shown that our results of classification of the 25,327 as lncRNAs are at least as robust, but likely more trustworthy than the ones presented in the 17 references, corresponding to 17 different organisms, presented in Table AF1-5.

Putative targets for lncRNAs
As shown in multiple model systems, lncRNAs can form networks of ribonucleoprotein (RNP) complexes with chromatin regulators, and thus can function as scaffolds in these complexes [23]. With the aim of predicting putative targets (protein coding genes) for the sunflower lncRNA found in this study, we tested the 'LncTar' algorithm [13] with a subset of our data.
The method implemented in LncTar [13] assume that base pairing plays a critical role in RNA-RNA interactions, and works by estimating the approximate binding free energy, delta-G (dG) for a pair of sequences. The value of dG is then normalized by dividing it by the minimum of the length of the two sequences, obtaining normalized delta-G (ndG). Values of ndG which are more negative indicate a higher probability of 'interaction' between the sequences, even when such interactions do not follow the canonical rules of base-pairing, and are calculated by Nearest-Neighbor doublets. Pairs of sequences that surpass a given threshold are reported by the program. In [13] authors test their method with 10 pairs of corroborated lncRNA / mRNA sequences, successfully predicting 8 of the 10 pairs by using a ndG threshold ndG< −0.1 The manual of the program suggests cutoffs for ndG -0.08 (low confidence), -0.10 (low confidence), -0.13 (medium confidence), -0.15 (high confidence), and -0.20 (very high confidence).
It is important to notice that the authors in [13] do not clarify the proportion of false positives that their method is likely to give. They tested 5000 pairs of random lncRNA / mRNAs, and found that the threshold of ndG=-0.10 is surpassed (ndG < −0.1) by around 5% of the random pairs; this implies that with that threshold 5% of the 'significant' pairs will be false positives, even if 8 of 10 experimentally confirmed interactions were recovered. To take a much higher threshold for ndG is not a good solution, given that as discussed in [13], true interaction are dependent on other factors, as stacked pair energy, loop energy and RNA tertiary structure are not taken into account. Thus, any interaction found with this method demands experimental confirmation to demonstrate the putative targeting of a gene by a lncRNA.
LncTa was downloaded from http://www.cuilab.cn/lnctar and installed to perform the analyses. A test run showed that in our computer system the program takes around 4.5 seconds to process a single pair of sequences. Given that we have a total of 59,085 genes, of which 33,758 are protein coding and 25,327 are lncRNAs, the number of comparison of all possible pairs of protein coding and lncRNAs to find putative targets is very large, 33, 758 × 25, 327 = 854, 988, 866, and thus it is unfeasible to perform all these comparisons in a reasonable time, even if the computation time is reduced by the use of a more powerful computer or distributed in a computer cluster.
To make a test run with 'LncTar' we decided to use a subset of the previously described meiotic genes found in the domesticated genotype [5]. From these genes we selected the 10 with largest differences in expression between the parental genotypes.
Reasoning that these 10 DE meiotic genes could be targets of lncRNAs that were also DE between parental genotypes, we selected 473 lncRNAs which were DE between parents with a Q-value ≤ 1e − 20.
With these two sets of sequences we have 10 × 473 = 4730 pair comparisons to do. LncTar was run with these sets and resulted in the detection of 12 pairs lncRNA / mRNA (mRNA are the transcripts of the meiotic genes) with a ndG < −0.15. Table AF1-6 presents the pairs of lncRNA / putative meiotic targets found in the run.  Table AF1-6 we observe that 12 lncRNAs have as putative targets 6 sunflower meiotic genes. In two cases, 3 distinct lncRNAs have the same target (genes AESP and MMD; rows 1 to 3 and 4 to 6, respectively); in two cases two distinct lncRNAs have the same targets (genes MSH5 and ZIP4; rows 7 to 8 and 11 to 12, respectively) while in the remaining two cases (genes RAD51 and XRCC3; rows 9 and 10, respectively) the relation is one to one (lncRNA to target). Note that even if we use the most stringent threshold for ndG quoted in the software manual, -0.20 ('very high confidence'), only rows 5 to 9 and 12 in Table AF1-6 will be eliminated, letting 6 interactions, i.e., 4 meiotic genes with putatively targeted by 6 lncRNAs.
It is also interesting to see if there is any correlation between the expression of the lncRNA and their putative targets. Expression levels in Transcripts per Million (TPM) and correlations between expression levels in the three genotypes are shown in Table AF1-7.
From Table AF1-7 we can see that the expression levels of the meiotic genes is, as expected, much larger than the one for the lncRNA. This was in general the case in all our data; lncRNAs presented a much smaller relative expression than the protein coding genes. Also from From Table AF1-7 we can observe that for some of the pairs 'lncRNA / putative target' there is a high correlation in expression level when measured in the three genotypes (D=Domesticated, F 1 and Wild); rows 4 to 10 and 12 (8/12) have an absolute value ofr > 0.9 Figure AF1-2 present the plot of relative expression by genotype for the case of the meiotic gene c45048 g1 i1-MMD1 which was identified as putative target for the lncRNAs lncRNA-c43609 g2 i1, lncRNA-c66209 g1 i1 and lncRNA-c24499 g2 i1 (rows 4 to 6 in tables AF1-6 and AF1-7).
From Figure AF1-1 we can see that the tendency for expression change between the meiotic gene c45048 g1 i1-MMD1 and the three lncRNAs putatively targeting it (lncRNAs lncRNA-c43609 g2 i1, lncRNA-c66209 g1 i1 and lncRNA-c24499 g2 i1) is very alike; for the four genes the lowest expression is found at the domesticated genotype (D), increasing in F 1 and then given the maximum for the wild genotype (W). The concordance of the expression pattern gives as result the very high correlations (r > 0.9874) between the expression patterns (see rows 4 to 6 in Table AF1-7).
To avoid jumping to conclusions about the reliability of the interactions found, we designed a 'control group' of meiotic genes. These were the 10 meiotic genes with the smallest evidence of differential expression (less significant, P > 0.8) between genotypes. These not-DE meiotic genes were run as putative targets of the set of lncRNAs previously selected. From this run a set of 20 significant (ndG< −.15) interactions between lncRNA and non-differentially expressed meiotic genes were found. Names of genes ndG and gene function are presented in Table AF1-8. Table AF1-8 shows the 20 interactions found between 18 distinct lncRNA (note that lncRNA-c58535 g1 i1 is present in rows 2, 15 and 20) and 7 distinct not-differentially expressed meiotic genes. All the lncRNAs shown in Table AF1-8 are distinct to the ones previously found (tables AF1-6 and AF1-7).
Table AF1-9 shows the expression levels per genotype and estimated correlation (r) between expression levels of lncRNAs and their putative targets (meiotic genes not-differentially expressed).
From Table AF1-9 we can observe that, even when the control set of non-differentially expressed meiotic genes presents relatively homogeneous expression in the three genotypes, some of the absolute values of correlation estimated with the expression in the lncRNAs are large,r > 0.9 (rows 1, 2 and 4) even when in general the correlations are smaller than with the set of differentially expressed meiotic genes (Table  AF1-6).
The logic for the selection of a control group of not-differentially expressed meiotic genes was that given that these genes do not vary in the genotypes, they were unlikely to be controled by lncRNA which are differentially expressed in the same genotypes. However, the number of 'significant' interactions in the 'control' group (tables AF1-8 and AF1-9) is larger, 20 interactions, than the ones detected with the set  of differentially expressed genes (tables AF1-6 and AF1-7), 12 interactions. Even when it cannot be ruled out the some of the interactions found with the 'control' group could be legitimate, it results highly suspicious that more interactions are found where it was expected to find less.
To make a further test of the reliability of the algorithm, we defined a second control group formed by 10 genes with very large expression in the somatic transcriptome but very low expression in meiocytes. The rational for this last experiment is that genes with very low expression in meiocytes, but high expression in the somatic tissues are unlikely to be control by lncRNA differentially expressed in meiocytes. However, this second control experiment produced 49 interactions with a ndG < −0.15. Table AF1-10 summarize statistics for ndG in the three experiments carried out to detect interactions between the set of 473 differentially expressed lncRNAs and the 'Test' set of 10 meiotic differentially expressed genes (tables AF1-6 and AF1-7), the 'Control 1', consisting on 10 meiotic not-differentially expressed genes (tables AF1-8 and AF1-9) and finally 'Control 2', the set of 10 (non meiotic) protein coding genes strongly expressed in somatic tissues but not in meiocytes.
From Table AF1-10 we can see that the number of 'significant' interactions detected (n) is larger in the two control groups ('Control 1' and 'Control 2' rows) than in the original test ('Test' row), demonstrating that very likely many, if not all, of the detected interactions could be biologically irrelevant. It cannot be ruled out that some of the interactions detected could be true interactions between lncRNAs and putative targets, however, there is clearly a large number of interactions that are false positives, otherwise the number of interactions detected in the Test group will be larger, or at least equal, to the number of interactions in the control groups, but the opposite happens. Even with the relatively small number of interactions explored, 470 × 10 = 4730, for the groups of 470 lncRNAs and 10 protein coding genes at each case, n = 12, 20, 49 interactions are detected for 'Test', 'Control 1' and 'Control 2' respectively. Also Table AF1-10. Statistics for ndG in the three groups of genes searched. Test -10 meiotic differentially expressed genes (tables AF1-6 and AF1-7); Control 1 -10 meiotic not-differentially expressed genes (tables AF1-8 and AF1-9); Control 2 -10 protein coding genes strongly expressed in somatic tissues but not meiocytes. Even when the algorithm implemented in [13] will be detecting part of the necessary factors for the interaction of lncRNA with their targets, say the binding free energy for pairs of sequences, it is clear that this parameter even if necessary is not sufficient to predict real lncRNA / mRNA interactions.  Repetitive elements (not TEs) found in the sunflower lncR-NAs. G = genes, NH=No hits found, TR = similar to known sunflower tandem repeats, TR rDNA = ribosomal DNA, TR SSR = microsatellites, U = similar to unknown sequences in the NCBI db, U C61 = similar to contig 61, U LC = unknown repeats (low complexity), U S = similar to unknown repeats in the Sanger library [7].

Transposons (TEs) and Repetitive Elements (REs) in sunflower lncRNAs
To determine which of the lncRNAs reported here contained TEs or REs we performed a blastn analysis with three databases of repetitive elements, SunRep [20], RepBase [9] and Repeat Element Database (PGSB-REcat) [21]. Additionally, we employed the service in line of RepeatMasker.
For the blastn experiments our lncRNAs were compared with each one of the mentioned databases using parameters '-dust no -evalue 1' and results were carefully evaluated to determine the optimal threshold values to include only biologically relevant results, avoiding as much as possible false positives as well as false negatives. We center our analysis in the bitscore of the output that takes into account the length of the alignment, mismatches as well as gaps and that in contrast with the expected value (evalue) does not depend on the size of the explored database. We determined that a threshold of bitscore = 71 produced perfect alignments without mismatches or gaps of length = 38 bp. Thus we set a threshold of bitscore ≥ 70 in order to detect even small fragments of RE in the lncRNA. All the hits that passed the threshold bitscore ≥ 70 had an expected value evalue < 1 × 10 −6 for all databases, thus the criterion for expected value is unnecessary, even if it is always fulfilled.
For the online service of RepeatMasker we uploaded our lncRNA sequences and compared with the full collection of Arabodopsis thaliana, using the best search engine (cross match) with the maximum sensitivity ('slow' option). Table AF1-11 presents the sources of data employed to detect TEs and repetitive elements in our sunflower lncRNAs and summarize the results obtained. From Table AF1-11 we can see that, as expected, the largest number (and percentage) of lncRNAs detected to contain TE or RE elements is obtained with the database from sunflower elements ('SunRep'). However, the other sources for the analyses also gave small numbers of lncRNAs with TE or RE. In some cases the same lncRNA is significant for more than one source, thus row 5 of the table, presenting the total number and percentages of distinct lncRNAs with TE or REs is not equal to the sum of the values of n from rows 1 to 4. The following tables present a more detailed analysis of the results per source.  Table AF1-12) and only in 2 lncRNAs the elements are detected in all four databases (row 13 in Table AF1 -12). The largest number of lncRNAs detected in two datasets, 57 (row 9 in Table AF1-12) happens with databases SunRep and RepeatMasker (RepMar).   Table AF1-13) is equal to 38 bp in row 5 and column 'SunRep'. In general, the minimum and average lengths of the alignments show that our procedure is unlikely to produce false positives, because even small lengths of the lncRNAs containing TE or REs elements were detected.

qRT-PCR analysis of selected lncRNAs
Previously, our RNA-seq data for protein coding genes were validated for 5 genes by qRT-PCR, showing that fold changes in expression are consistent between RNA-seq and qRT-PCR in the domesticated meiocytes and somatic transcriptomes (see Figure 6 in [5]). Here we performed qRT-PCR analysis on selected lncRNAs to compare the estimated fold change in expression between the domesticated (D, HA89) and wild (W, Ac-8) genotypes. A total of 22 lncRNAs were selected for the analysis, and primers were designed using PRI3 online software, optimizing for size and conditions for qRT-PCR. Two lncRNAs were selected as controls, given their uniform expression in the two genotypes, as estimated by RNA-seq results.
qRT-PCR reactions were prepared and run as previously reported in [5] from RNA extracted from meiocytes from the domesticated (D, HA89) and wild (W, Ac-8) genotypes. All reactions were performed four times (technical replicates) with RNA from each source, and final results were analyzed by the 2 −∆∆C T method [15]. Table AF1-14 summarize primary results for RNA-seq and qRT-PCR for the genes tested.
From Table AF1-14 we can see that only 18 of the 23 pairs of primers designed gave a unique specific PCR product to be analyzed by qRT-PCR (column SP), thus for the 5 rows in Table AF1-14 where there was not unique PCR product, all remaining qRT-PCR results are missing. Figure AF1-14 exemplifies cases where the PCR reaction gave a single specific product (panel A), or more than one product (Panel B). Given the fact that lncRNAs are, in general, expressed at absolute concentrations lower than protein coding genes [14], the number of PCR cycles needed to detect the product above the noise threshold (BE in Table AF1-14) is large and prone to errors [15]. In many cases this leads to undetermined values for the mean of the C T cycle ('Und.' in theÊC T columns in Table AF1-14); only genes with keys Con1, G1, G4, G5, G6, G8 and G10 (rows 7, 13, 16 to 18, 20 and 22 in Table AF1-14) have determined values in all columns, and thus can be used to compare RNA-seq with qRT-PCR results by the 2 −∆∆C T method. Table AF1-15 presents some of the possible comparisons of fold changes estimated by RNA-seq (F R ) and qRT-PCR (F q ) using the simple 2 −∆C T comparison.
The first two rows of Table AF1-15 present the comparisons of fold changes for the control genes, Con1 and G1 between the D and W genotypes. From these data we can see that Con1 is a better control than G1, given that its fold change between the genotypes estimated by qRT-PCR is 1.7, while for G1 it is 0.25 For comparisons performed between genes in the wild genotype, it is important to consider that in RNAseq the measurement in transcripts per million (TPM) is only valid when comparing the same gene in two or more conditions [24]. However, when comparing different genes in the same genotype (comparisons 'Within W' in Table AF1-15) TPM measurements can be biassed if the length of the genes are different, because longer genes have a greater probability of accumulating gene tags even if both genes compared are expressed at the same relative expression, thus only comparisons between genes of approximately the same size (column 'GS' in Table qt1) were performed in the comparisons 'Within W' in Table AF1-15. For those cases, all fold changes show the same tendency, even when large differences, as it is expected by the exponential nature of the qRT-PCR estimation [18]. Table AF1-16 presents the comparisons between log 2 fold changes between the D and W genotypes performed by the −∆∆C T method with two independent controls (Con1 and G1).
lncRNA in Sunflower -Additional File 1 In Table AF1- 16 we can see that the tendency of the fold changes between the D and W genotypes for the five genes using qRT-PCR, log 2 (F q ), and RNA-seq, log 2 (F R ), are in general consistent when using two different control genes for the −∆∆C T method, even when, as expected the qRT-PCR measures Table AF1-16. Comparisons of log 2 fold changes for five genes by qRT-PCR using the −∆∆C T method with two independent controls (Con1 and G1), log 2 (F q ) and by RNA-seq, log 2 (F R ).
Control Gene log 2 (F q ) log 2 (F R )

Con1
(a) Specific product for gene with key='Con1'. BE = 26.   Table AF1-14 for keys of genes. are more extreme and variable, given the exponential error structure [18] and low concentration of the lncRNAs.