Genome-wide profiling of Populus small RNAs
© Klevebring et al. 2009
Received: 23 July 2009
Accepted: 20 December 2009
Published: 20 December 2009
Skip to main content
© Klevebring et al. 2009
Received: 23 July 2009
Accepted: 20 December 2009
Published: 20 December 2009
Short RNAs, and in particular microRNAs, are important regulators of gene expression both within defined regulatory pathways and at the epigenetic scale. We investigated the short RNA (sRNA) population (18-24 nt) of the transcriptome of green leaves from the sequenced Populus trichocarpa using a concatenation strategy in combination with 454 sequencing.
The most abundant size class of sRNAs were 24 nt. Long Terminal Repeats were particularly associated with 24 nt sRNAs. Additionally, some repetitive elements were associated with 22 nt sRNAs. We identified an sRNA hot-spot on chromosome 19, overlapping a region containing both the proposed sex-determining locus and a major cluster of NBS-LRR genes. A number of phased siRNA loci were identified, a subset of which are predicted to target PPR and NBS-LRR disease resistance genes, classes of genes that have been significantly expanded in Populus. Additional loci enriched for sRNA production were identified and characterised. We identified 15 novel predicted microRNAs (miRNAs), including miRNA*sequences, and identified a novel locus that may encode a dual miRNA or a miRNA and short interfering RNAs (siRNAs).
The short RNA population of P. trichocarpa is at least as complex as that of Arabidopsis thaliana. We provide a first genome-wide view of short RNA production for P. trichocarpa and identify new, non-conserved miRNAs.
Plants produce a diverse and dynamic population of small RNAs (sRNAs) that are involved in transcriptional and post-transcriptional gene silencing and directing DNA methylation [1–4]. Distinct sub-populations of sRNAs have been identified and experimental evidence derived from Arabidopsis thaliana mutants has shown that each sub-population is derived via distinct biogenesis routes: microRNAs (miRNAs) are produced via the action of the Dicer-like RNAseIII-type ribonuclease DCL1 and are cleaved from precursor near-perfect stem-loop hairpins formed from RNA polymerase II transcripts; 21 nucleotide (nt) endogenous short-interfering RNAs (siRNAs) derive from long, double-stranded RNAs (dsRNA) and are produced via the action of an RNA-dependent RNA (RDR) polymerase; trans-acting siRNAs (TAS) are produced primarily by RDR6 together with SGS3 and DCL4, which yield phased 21 nt siRNAs; 24 nt heterochromatic siRNAs are produced by the action of the DNA-dependent RNA polymerase PolIV, RDR2 and DCL3 [1, 5]. Work in A. thaliana has made extensive use of high throughput sequencing in combination with sRNA silencing mutants to elucidate the roles of genes within the different biogenesis pathways for sRNA classes .
The genus Populus is now firmly established as the model system for forest trees . Populus represents an excellent model, being suitable for studies focusing on commercial traits, such as biomass-yield and wood fibre qualities, as well as association mapping and ecological interaction studies. As seasonal, hard-wood perennials with an extended juvenile phase, Populus species undergo a number of processes that could be expected to involve some degree of epigenetic control and re-programming. As such Populus represents an ideal system in which to further understanding of traits such as seasonal senescence, dormancy/growth-arrest and juvenile to adult phase transition. Populus is also nearly exclusively dioecious, yet currently the mechanism determining gender is unknown. An increasingly compelling body of evidence exists to suggest that chromosome 19 may be in the process of becoming a ZW style sex chromosome [8–11], with the female potentially being the heterogametic sex , although this is certainly not clear . Again this is a trait that could involve an epigenetic component. As long-lived, clonally-replicating species, poplars and aspens also represent an excellent opportunity to identify how such long-lived species may have evolved particular means to survive both abiotic and biotic stresses. Particularly in the case of biotic factors, long-lived species face the challenge of surviving repeated attacks from antagonists with short generation times which are therefore capable of more rapid evolutionary change. Indeed the molecular clock of poplar ticks considerably more slowly than that of Arabidopsis thaliana .
Populus has so far not been exhaustively profiled for sRNAs, especially when compared to A. thaliana where there have now been a number of high-throughput sequencing studies performed [6, 13–18] and where an excellent web resource exists for viewing the available datasets . Previous work in Populus has either consisted of in silico studies or has been performed to a lower sequencing depth and has concentrated only on the identification of miRNAs [12, 20–22]. To date, none have described the genomic distribution of other classes of sRNAs in Populus, despite those representing the majority of sRNAs produced.
We sequenced sRNAs in the sequenced P. trichocarpa genotype (Nisqually-1) using a concatenation approach in combination with massively parallel pyrosequencing (454). Using an established analysis pipeline that draws on the accumulated knowledge gained from analysing sRNA data in A. thaliana , we characterised sRNAs in young leaf material and in particular we concentrated on describing the genomic distribution and context of loci producing numerous sRNAs as well as identifying predicted trans-acting siRNA and miRNA loci.
Sequence reads summary.
Low-complexity (<3 different bases), size-range (18-24 nt)
Perfect match to Populus genome
There have been three publications relating to sRNAs in Populus, two of which used traditional Sanger sequencing [20, 21] with the third using 454 pyrosequencing . To allow direct comparison with our data we used the data from  (Additional files 1 and 2) to re-create a redundant dataset for analysis using the UEA plant sRNA toolkit. This resulted in 2,459 unique sequences between 18-24 nt after filtering (see Methods), of which 563 were perfect matches and 1846 were overlapping with the 80,538 unique sequence reads in our dataset.
One notable immediate difference between our dataset and that of  is the frequency distribution of different sRNA size classes. We found that 24 nt sRNAs were the most abundant size class, with 21 nt sRNAs also being highly abundant (Figure 1D). This is in agreement with previous reports in A. thaliana [6, 16], maize  and Physcomitrella patens . In contrast, the results presented by  showed clear dominance of 21 nt sRNAs with the second highest size class being 22 nts, followed by 24 nts. Re-analysis using the same filtering criteria as for our data again revealed a clear dominance of 21 nt sRNAs, however 24 nt sequences were the second highest class followed by 22 nt (combining data across leaves and vegetative buds). The reason for this difference may be that  used a P. balsamifera clone whereas we used the same genetic clone as was used to produce the genome sequence, and the unusually high sequence variation within or between Populus species complicates analysis .
Since sRNAs often associate with repeats, it was of interest to analyse the distribution of sRNAs in relation to repeats in the Populus genome. However, as compared to Arabidopsis, repetitive element annotation in Populus is less well developed. We performed a RepeatModeler and RepeatMasker analysis of the Populus genome sequence. The results from this analysis are available at PopGenIE [28, 29]. For RepeatModeler-identified repeats the majority of sRNAs did not overlap a repeat and only ten repeats had overlap to >100 unique sRNA with none having >500 overlapping sequences. The maximum number of overlapping sequences was 182 for a repeat that overlaps two NBS-LRR disease resistance gene models. The sRNAs in this region showed a clear enrichment for 21 mers. Within the RepeatMasker data there was clear dominance of 21 nt sRNAs for those repeats with >100 overlapping sRNAs. There were also a number of cases where there was dominance of 22 nt sRNAs or near-equal representation of 22 and 24 nt sRNAs with these being overlaps to LTR retrotransposons. For example a repeat on scaffold_132 (position 308756..310627) showed a majority of 22 nt sequences with substantial numbers of 24 nt sequences and few sequences from other size classes. Such examples are potentially interesting considering the findings of  where it was found that there is greater overlap of 22 nt sRNAs to LTR elements in maize than in Arabidopsis. Future work using biogenesis mutants will be needed to clarify whether 22 nt LTR-associated sRNAs in Populus are more similar to the maize or the Arabidopsis genomes.
In all other repeats there was clear dominance of 24 nt sRNAs with 21 nt sRNAs also having a significant representation. There were also substantial numbers of 22 nt sRNAs while other size classes were insignificant. In general, the RepeatMasker data matched the results that have been reported in A. thaliana [6, 16].
RepeatMasker was used to identify all Populus and Rosid repeats in RepBase  (release 14.03). RepBase contains 169 lineage specific Populus repeats and 1018 Rosid repeats (of which 516 are Arabidopsis specific). There are additionally 176 ancestral/ubiquitous repeats. There were extremely few examples of overlap to repeats within the Rosids dataset with only 212 sRNAs overlapping currently annotated RepBase repeats. Three of these overlapped to a Medicago LINE element and the remainder overlapped with eight annotated LTR elements, five of which were Populus specific. The contrast between these overlap results and those of the RepeatModeler data strongly suggest that there is currently a paucity of public data relating to the repetitive elements in the Populus genome.
Predicted phased and trans-acting loci.
Number of Targets (perfect match)
28(1) NBS- LRR
16(7) NBS- LRR
The genomic context of sRNAs was examined by identifying overlap to predicted gene models and repeats. The vast majority of sRNAs did not overlap gene-coding loci and most of those that did overlapped with <10 unique sRNAs. However, a small number of genes overlapped with a large numbers of sRNAs. For example, the two genes with the highest overlap to sRNAs were estExt_Genewise1_v1.C_91780006 (1,646 sRNAs) and gw1.7267.9.1 (1,351 sRNAs). Nearly all of these overlapping sRNAs map anti-sense to the gene and in both cases there is a ribosomal DNA (rDNA)-like repeat and an LTR retrotransposon within the gene coding sequence. In both cases there was near-equal distribution of sRNAs in all size classes (18-24 mers) suggesting that these are not siRNAs. In contrast other genes with high numbers of overlapping sRNAs showed clear enrichment for a particular size class. For example fgenesh4_pg.C_scaffold_6025000001 showed enrichment for 22 nt and to a lesser extent 24 nt sRNAs and eugene3.00102261 for 21 nt sRNAs. There were also cases where there was near-equal distribution across size classes but with a peak at the smaller sizes (e.g. gw1.376.2.1 and gw1.422.22.1). In these cases, all reads derived from the same strand as the gene and therefore likely represent degradation products of highly expressed genes. These two genes (and two additional genes in the list with >100 overlapping sRNAs) encode psbA, with all four having maximal homology to the Arabidopsis gene ATCG00020, which also shows a similar pattern of sRNA overlap. There were 27 genes with >100 unique overlapping sRNAs (Additional file 4) and 2,969 genes overlapped to at least a single mapped sRNAs. Excluding those genes with >100 overlapping sRNAs, the majority of overlapping sRNAs were 21 mers. Examination of the annotation of the 27 genes with >100 overlapping sRNAs did not reveal any functional over-representation of Gene Ontology categories or the presence of particular types of genes.
Target prediction for the 21 nt siRNAs from this locus identified four ARF genes (two ARF3 homologs, an ARF2 homolog and an ARF4 homolog) suggesting that both target site and target are conserved. Based on our own examination of the ARF family, both ARF2 and ARF3 appear to be duplicated in Populus, with ARF4 existing as a single ortholog. In the case of ARF3 we identified two additional predicted gene models (eugene3.08470003 and eugene3.150910001) that lie within the same branch of the phylogenetic tree generated for the entire ARF family (Additional file 5) but they are truncated and lack the target recognition site for the TAS3a tasiRNA. In the case of ARF2, one of the duplicates (estExt_fgenesh4_pm.C_LG_XII0386) contains three SNPs within the target site. As one of these SNPs is at the 11th base pair of the target sequence, it is quite probable that this copy of the gene is no longer targeted. The other copy (eugene3.00150845) maintains complete homology to A. thaliana within the target site. It would be interesting to examine the functional role of this locus in Populus, as in Arabidopsis it has been shown to control vegetative phase transition . This is a trait of particular interest in Populus as the long juvenile phase represents a significant limitation to breeding programmes.  examined the ARF family in detail and identified six potential ARF2 genes. However only two of these (the two we also identified as duplicated copies) were from the Jamboree gene model set (v1.1) and as a result the remaining four models that they predicted as ARF2 genes were not included in the current analysis.
In A. thaliana, all TAS loci identified to date produce phased 21 nt sRNAs, with the phase being set by miRNA or siRNA cleavage [15, 31, 37, 38]. TAS1 and TAS2 transcripts are targeted by miR173, TAS3 by miR390 and TAS4 by miR828. While miR390 and miR828 are conserved in Populus, there is no evidence of miR173 conservation.  showed that there appears to be little evidence for the production of phased siRNAs of other lengths in A. thaliana. The UEA plant sRNA toolkit ta-siRNA tool implements the algorithm from  to identify potential TAS/phased loci on the basis of such phasing. Using a p value cut-off of 0.001 (see ), 28 potential phased loci were identified containing between three and 14 phased sequences (Additional file 6). In the case of the predicted phased locus on LG_X (Table 2) we could extend the predicted phased locus beyond the region identified (Figure 2A) by assuming that 'missing' sequences in relation to a predicted miRNA target site were produced but not present in the current dataset. The production of potentially phased siRNAs alone is, however, not enough to class a locus as trans-acting so we concentrated on the 12 predicted highly significant loci with p values <0.00001 and used the psRNATarget prediction tool  to identify potential targets for phased siRNAs (Table 2, Additional file 6). Seven of these 12 loci were within predicted protein-coding regions, five are intergenic and one spans across the end of a predicted gene model. Of the five loci located in intergenic regions, all but one overlapped to NBS-LRR repetitive elements and we therefore do not classify these as TAS loci. To date, all eight of the confirmed A. thaliana TAS loci (TAS1a-c, TAS2, TAS3a-c, TAS4) are produced from non protein-coding regions. Phased loci within coding regions have also been reported but are not classed as trans-acting as they typically target in cis or target other members of the gene family from which they are produced [14, 39]. Using this criterion only the locus at LG_VI:551356..551607 (Table 2, Figure 2B) would be classified as a TAS locus.
Among the identified loci are some interesting examples. On LG_II (LG_II:14322242..14322493, p 0.000014), phased sRNAs are produced specifically from the exons of a predicted No Apical Meristem (NAM) gene (gw1.II.959.1) with a cluster of sRNAs being produced from both of the two exons and none from the intervening intron, suggesting that this locus is specific to the transcribed RNA. The loci on LG_III and scaffold_80 contain phased siRNAs that do not have unique hits within the genome. In the case of the three loci on LG_III, all three are homologous and contain largely the same set of siRNAs. From the current study it is not possible to determine whether all three loci actually produce siRNAs. In the case of the locus on scaffold_80 there are numerous genomic hits for each siRNA.
 found that Populus PPR-P genes were predicted to be targeted singularly or dually by either miR475 or miR476. Our data show that Populus PPR-P loci do produce siRNAs, with a subset producing phased siRNAs. However, the gene presented in Figure 5 of  (eugene3.00062011) contains targets sites for both miR475 and miR476 but did not produce siRNAs in the current study or in that of . The A. thaliana TAS1 and TAS2 loci specifically target PPR genes but there is no evidence of homology between these loci and the ones identified in this study. Therefore it would seem that similar evolutionary mechanisms have been deployed to silence the same gene families in both species. Previously miR475 and miR476 predictions were based on the considerably smaller datasets of [20, 21] and had not been well characterised.
We found that particularly miR476 has strong sequence support as a bona fide miRNA. The stem-loop structure and sRNA read distribution for the four miR475 and three miR476 loci can been found in Additional file 7 and 8 respectively.
The phased locus on LG_VI generates siRNAs that are predicted to target the MYB transcription factor gene from which they are generated. Expression of the siRNAs was low within the sample we sequenced and it is therefore hard to predict whether such a potential loop of transcriptional regulation has functional significance. TAS4 in A. thaliana specifically targets MYB transcription factors  however there is no apparent significant homology between the TAS4 locus and the Populus locus identified here. However, it is interesting that phasing of TAS4 is set by miR828 and that there is a miR828 target site in phase with the locus we have identified.
We also performed a search for predicted genomic target sites of miR390, miR475, miR476, miR482.2, and miR828 to examine whether any predicted targets matched the location of identified phased loci. We included miR482.2 because it potentially sets the phase of the locus identified on LG_X:19646453..19647187 (Table 2). However, this did not yield any additional loci producing sRNAs.
Using the findings presented in , we examined this region in greater detail. The available genome sequence for the heterozygous P. trichocarpa female represents only one haplotype at each location, with each chromosome sequence representing a chimeric combination of scaffolds from both haplotypes. For the peritelomeric region of LG_XIX, the alternative haplotype is represented by scaffold_117 , a fact that can be confirmed by examining the presence of the mapped genetic markers ORPM 276 and ORPM 277 (see  for details). The two haplotypes for this region are highly divergent, with contrasting gene content . Of the genes that are in common between the two haplotypes, the vast majority are NBS-LRR genes. Here we show that only the haplotype represented in LG_XIX contains the identified hotspot for sRNA production, with sRNA production being minimal for scaffold_117 (Additional files 10 and 11). The phased locus identified on LG_XIX is also specific to this haplotype. Target predictions for the phased siRNAs and for other sRNAs that had unique alignments to this region of LG_XIX showed almost exclusive targeting of NBS-LRR genes. Of genes targeted by non-phased sRNAs and located on linkage groups (interpreting genes on scaffolds is ambiguous and these are therefore ignored), 26 of 51 genes were located on LG_XIX, 18 of which are located within the first 1 Mb of the linkage group and all of which are NBS-LRR genes. Although the majority of genes shared between the two haplotypes for this region are NBS-LRR genes, no target sites were identified for genes on scaffold_117 and examination of paralog data at PopGenIE suggests that the targeted genes are not those with high similarity to genes on scaffold_117. The notable over-represented production of sRNAs from this region in both the 21 and 24 nt size classes and the pattern of NBS-LRR gene targeting leaves open the possibility that sRNAs and NBS-LRR genes within this region have a role to play in sex determination or the maintenance of reduced recombination. The observed haplotypic divergence for these features is certainly intriguing.
Comparison of all siRNA reads to Viridiplantae mature miRNA sequences in miRBase (Release 13.0) identified matches to 45 miRNA families (allowing two mismatches including 5' and 3' overhang, Additional file 12). The number of matching families increased to 65 when matches were searched for in the miRNA precursor sequences. There are currently 43 Populus miRNA families in miRBase. Perfect matches to 38 of these were identified when matching to precursor sequences and 34 when matching to mature miRNAs. Matching to precursor sequences was performed as we noticed a number of cases where we sequenced a high copy number 21 nt sequence that was predicted as a miRNA but did not match the current miRBase mature miRNA entries. As an example, we identified a sequence matching ptc-miR827 with an expression count of one. At the corresponding genomic locus a miRNA was predicted within our dataset but with the miRNA and miRNA* sequences reversed compared to the current miRBase entry. In our dataset, the miRNA* had a read count of 12. A miRNA corresponding to the current miRBase entry was predicted in , in which no miRNA* sequence was found. This suggests that the two sequences may be under differential control.
Sequences matching the mature miRNA sequence of two conserved families (miR1511 and miR858) that do not currently have Populus mirBase entries were found with sequence counts of 36 and 17 respectively. Neither of the regions where these sequences map to were predicted as miRNAs within our dataset. The sequence matching miR1511 has 19 perfect hits within the genome and two hits containing two mismatches. The majority of these hits are within LTR retrotransposons (based on RepPop annotations) and none of the locations has evidence of a miRNA* sequence. This makes it unlikely that any of the loci represent an actual miRNA, although there is still the potential that the generated sRNA could serve a targeting function. The sequences matching miR858 have two perfect hits within the genome. The matching sequence was also present at a high sequence count (312) in the data of , however no potential miRNA* sequence was found in either dataset. Neither locus produces a convincing hairpin structure, making it unlikely that these are actual miRNAs, despite the perfect homology to mature miRNA sequences in other plant species. Further work is needed to determine whether these loci produce siRNAs that result in target cleavage.
The UEA plant sRNA toolkit miRCat tool identified 414 potential miRNAs (Additional file 13). In A. thaliana, miRCat has been shown to have >90% sensitivity for detecting known miRNAs and to have a specificity of >99%  when applied to the comparable dataset of . Target predictions were run for all predicted miRNAs. All predicted miRNAs matching existing miRBase entries had predicted targets. Predicted miRNAs matched 143 of the 237 Populus miRNAs currently in miRBase, with members from 29 of the 43 families included. Of the remaining sequences, 156 had no predicted targets and 115 had predicted targets. None of the predicted miRNAs without predicted targets appear to be conserved. Functional enrichment for Gene Ontology (GO ) Biological Process categories was carried out for all predicted targets of existing miRBase miRNAs and for all novel predicted miRNAs from this study. In both cases there was dramatic and nearly exclusive over-representation of processes associated with development and pattern formation/specification.
Although all of the predicted miRNAs fulfil the requirements of expression and foldback, the need for stringent annotation criteria has become increasingly evident.  argue that small RNAs that derive from regions apparently able to form hairpins could very well represent false-positives, and that further evidence such as DCL1 dependency or detection of a miRNA* is required to classify a locus as a bona fide miRNA . Since we do not have data on DCL1 dependency, we used the presence of a miRNA* sequence as a classification criterion. However, for predicted miRNAs matching existing miRBase entries only 66 of 143 sequences (46%) had a miRNA* sequence present in our dataset and the majority of current miRBase entries have no miRNA* support.
Novel predicted miRNA loci.
Within the three classes of data (matching miRBase; not-matching and with predicted targets; not-matching and no predicted targets) there were distinctly different sequence length frequencies. The vast majority (76%) of sequences matching miRBase entries were 21 mers. For novel miRNAs with predicted targets, the largest size class was 21 mers (43%), however a notable percentage of 22 mers (12%) and 24 mers (22%) were included. For sequences with no predicted targets, there was clear over-representation of 24 mers (72%), with 21 mers being the only other notable size group (20%). There were also differences in the frequency of sequences starting with a U, which is characteristic of the majority of miRNAs in A. thaliana . For the three classes of predicted miRNAs the percentage of sequences starting with an A, U, C or G respectively was: matching miRBase 5, 79, 15, 1; not-matching, with targets 22, 57, 13, 8; not-matching, no targets 32, 40, 17, 12 (Additional file 16). In all three cases the majority of sequences begin with a U, however the distribution is clearly far more biased towards U in those sequences matching current miRBase entries. It was also the case that most cases not matching miRBase but having an identified miRNA* started with a U. Despite caveats, at least some of the predicted miRNAs will represent non-conserved, young miRNAs but are certainly expressed at low levels and would require considerably increased depth of coverage in order to locate a miRNA* sequence for which a lack of predicted targets is not unexpected .
We examined the three novel predicted miRNAs with the highest sequence counts in more detail (Figure 4A-E). Many of the scaffolds within the current genome assembly are likely to represent haplotypes rather than unassembled sequences. This is particularly true of the shorter scaffolds. VISTA alignment results suggest that the two loci represented in Figure 4C-D lie within duplicated regions, or a least regions that are highly similar and syntenic and Figure 4G shows that they have highly conserved hairpin sequences.
It is likely that several novel miRNA candidates remain to be found in Populus, and deeper sequencing of small RNA libraries from additional tissues, in particular flowers, and conditions is needed. It would, for example, be interesting to sample material before and after the juvenile phase transition. The small number of novel non-conserved miRNAs identified in combination with the evidence that we obtained considerably greater depth of coverage of the small RNA population than previous work suggests that greater depth will now be required to identify miRNAs with moderate to low abundance due to the requirement for a sequenced miRNA*. Sequencing sRNAs from additional tissues and conditions will also help clarify which of the current miRBase ptc-miRNAs can be experimentally confirmed. As was also reported to be the case in rice , many of the current Populus miRNAs in miRBase lack any experimental evidence and at least some of the predicted miRNAs more likely represent siRNAs.
Since the Populus genome has undergone a relatively recent whole-genome duplication, we were interested in how many pairs of duplicated genes had retained miRNA targeting in both copies. Using psRNATarget predictions of miRNA targeting for all miRBase miRNAs (Release 13.0) in combination with data made available on the JGI Populus genome ftp site , an analysis of recent duplicated genes was used to identify those duplets (duplicated gene pairs) where only one of the two copies is predicted to be targeted by a miRNA. Across all of the 45,555 predicted Populus gene models, 167 (≈ 0.4%) were predicted to be targeted by a miRNA. The duplicated dataset contains 6,699 duplicated gene pairs of which the vast majority (6,663) are not predicted to be targeted by miRNAs. Of the 36 duplets remaining (≈ 0.5%) there were 21 where both duplets are predicted to be targeted by miRNAs and 15 where only one of the two duplet copies is targeted. Alignments of these showed that six duplets lost target predictions due to gene models being highly truncated compared to the other copy with the target site entirely missing. There were SNPs or single base pair insertions in seven cases but none of these were at the 10th or 11th position and so target predictions were lost because the threshold score we set (2) was not met. In these cases it is not clear whether the duplet would still be targeted or not. There was one example of an insertion and one of a deletion that did affect the 10th and 11th position of the target site, which would almost certainly result in a loss of miRNA binding.
We profiled the sRNA population in Populus to a depth far exceeding previous efforts. This allowed identification of novel, non-conserved miRNA loci, phased siRNAs and characterisation of the genomic distribution of sRNAs. We identified a region of LG_XIX overlapping the sex determination region and a major cluster NBS-LRR genes as a hot-spot for sRNA production Whether or not sRNAs are more important regulators of developmental transitions in long-lived species such as Populus compared to annual species remains to be established.
Young, but fully expanded, leaves from Populus trichocarpa 'Nisqually-1' trees of ≈ 1.5 metres grown in pots in the Umeå University greenhouses under natural light regime were sampled at noon on March 23 2007. Total RNA was prepared using a modified Trizol protocol . Enrichment and RNA cloning of RNAs in the 18 to 24-nts size range was performed as previously described  with the following modifications: The 3' adapter (Linker1, CTGTAGGCACCATCAAT) was 5'-adenylated and 3'-dedioxymodified (IDT) and the 5' adapter (Acceptor1, atcgtAGGCACCUGAUA, lower case is DNA, upper case is RNA) was chimeric (IDT). The primer for reverse transcription was ATTGATGGTGCCTACAG (IDT). Re-suspension of first-strand cDNA was carried out in 20 μl 0.1 × TE.
After first-strand cDNA synthesis, amplification was carried out in a 20-μl reaction using 2 U of phusion polymerase (New England Biolabs), 5 μl first-strand cDNA and 5 pmol each of the primers FusionA1 (GCCTCCCTCGCGCCATCAGATCGTAGGCACCTGATA) and FusionB1 (GCCTTGCCAGCCCGCTCAGATTGATGGTGCCTACAG). The amplification product was precipitated using 70% ethanol in the presence of 20 μg glycogen (Ambion), washed once in 500 μl ice-cold 70% ethanol and re-suspended in 20 μl of nuclease-free water (Qiagen). The purified product was subjected to emulsion Polymerase chain reaction (PCR) and massively parallel pyrosequencing on 1/16th of a plate in a GS-FLX instrument following the manufacturer's instructions (Roche).
After first-strand cDNA synthesis, six parallel amplification reactions were carried out using 1 μl of cDNA template per reaction 5'-monophosphorylated primers. After pooling, nucleic acids were purified using ethanol precipitation as described above. Two 20-μl concatenation reactions were set up, each containing 5 μl purified amplification product, 4 μl 5 × T4 DNA Ligase buffer, 1 μl T4 DNA Ligase (High concentration, Invitrogen) and 10 μl sterile deionized water. The reactions were incubated for 3 hours at room temperature (≈ 20°C). Concatenated DNA was ethanol precipitated as described above, and subjected to ligation of sequencing adapters according to the manufacturer's instructions (Roche). After emulsion PCR, sequencing on a GS-FLX was carried out following the manufacturer's instructions (Roche).
To extract the small RNA sequences from the data we matched tags of the 10 last and 10 first bases of the 5'- and 3'-tags respectively and extracted the sequence between them. After initial tag searching on the sequenced strand, all reads were converted to their reverse complement and the tag searching was repeated. This was done since the blunt concatenation can take place in any direction (as shown in Figure 1A).
All data filtering and analysis was performed using the UEA plant sRNA toolkit . This is an sRNA analysis pipeline for second-generation sequencing data analysis and implements a number of algorithms based on knowledge obtained primarily from sRNA studies in A. thaliana. Sequences were filtered to only include 18-24 nt entries, to remove known r/tRNAs and low complexity sequences and to keep only those with perfect matches to the published P. trichocarpa genome sequence . The sequenced P. trichocarpa clone is the same one used for the current study.
The overlap between sRNAs and genomic features was examined using the data provided at the PopGenIE web resource . All sRNA sequences have been deposited at the PopGenIE web-resource  and can be viewed within the main genome browser.
The redundant sequence data was used as input for the siLoCo and ta-siRNA prediction tools. For these tools, all options were left at their defaults. A p value of 0.001 was used for the ta-siRNA prediction tool. The output from the ta-siRNA prediction tool was visualised at PopGenIE and loci with p values <10-5 were manually inspected to identify whether the loci could be further extended by allowing for sequences missed due to low expression/lack of sequencing depth.
The miRCat tool was used to identify putative miRNA sequences using the redundant filtered sequence data. A minimum sRNA abundance of 2 was set with a maximum of 16 genomic hits and a size-range of 18-24. This tool also generates a GFF-format file containing details of all predicted miRNAs and non-miRNA sequences. This file was uploaded to the PopGenIE web resource for visualisation. Matches to existing, known Populus miRNAs (miRBase Release 13.0 ) were identified using PatMan . Two searches were run to identify matches to mature miRNAs, one allowing 2 mismatches and another allowing two gaps. The second was used to allow overhang at the beginning or the end of the miRNA sequence as in many cases, we identified a 21 nt sequence for predicted miRNAs where the current miRBase entry is shorter. No gaps within the mature sequence were allowed. We also searched for matches using the predicted hairpin sequence to miRBase hairpin sequences as there were cases where the predicted mature miRNA represented the currently deposited miRNA* sequence.
miRNA target predictions were performed for all current P. trichocarpa miRNAs in miRBase (Release 13.0) in addition to novel predicted miRNAs and TAS siRNAs from this study. We compared target predictions made using the UEA plant sRNA toolkit target prediction tool, the WMD2 target search tool (Web miRNA Designer ), the TargetFinder tool from the Carrington lab [13, 31] and the psRNATarget tool available at . For psRNATarget the maximum expectation was set to two and all other options left as default. The psRNATarget tool reports all potential complementary regions between miRNA/ta-siRNA and target sequences using an improved iterative parallel Smith-Waterman algorithm and a weighted scoring schema allowing each mismatch to be weighted according to the mismatch type and position in the query small RNA (cf ). We found that there was near-complete overlap between the psRNATarget predictions and those from TargetFinder and the UEA toolkit tool but with fewer predicted targets using the maximum expectation values of two. As this likely represents a situation of having some false-negatives but few false-positives, we decided to use the psRNATarget tool. There was also extensive overlap between the predictions from the TargetFinder and UEA target search tool with less overlap between prediction from the WMD2 target search tool on average (data not shown).
Target predictions for miRBase (Release 13.0) sequences from all tools have been made available at the PopGenIE ftp site and are already included in the genome browser.
Potential genomic target sites (i.e. not within the coding sequence of gene models) were identified using PatMan by searching for complementary matches with up to 2 mismatches or by using the psRNATarget tool to search for targets within over-lapping 3.4 Kb genomic regions. Although likely to exclude a number of bona fide targets, we used a maximum expectation value threshold of two in order to minimise false-positives.
We performed both repeat masking of repeats in RepBase  using RepeatMasker  as well as performing de novo repeat identification using the RepeatModeler pipeline . RepeatModeler uses RECON , RepeatScout  and Tandem Repeat Finder  to perform de novo repeat identification. The consensus repeat library generated by RepeatModeler was then used with RepeatMasker. The data contained in the RepPop database  were also used. All repeat runs performed have been made available at the PopGenIE web resource and can be visualised in the genome browser or downloaded from the ftp site.
In order to be able to directly compare our data to that of , we downloaded their supplementary data and created a redundant sequence file to use as input to the UEA plant sRNA toolkit. This was then used to perform exactly the same analysis as described for our dataset. Comparisons to the Barakat data are therefore based on the results from the UEA plant sRNA toolkit tools rather than the results presented in the text of .
All files produced by the UEA plant sRNA toolkit tools for both our data and that of  are available for download from the PopGenIE ftp site.
refers toArabidopsis thaliana throughout.
The authors wish to thank Kicki Holmberg, Christian Nathanaelsson and Annika Åberg for help with preparation prior to 454 sequencing. We also thank Anna Westring for her kind help with graphical issues and Dr. Magnus Bjursell, Dr. Johan Lindberg and Dr. Andreas Sjödin for bioinformatics support, provision of analysis scripts and informative discussion. This work was supported by Swedish Research Council and Knut and Alice Wallenberg Foundation.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.