De novo characterization of the gametophyte transcriptome in bracken fern, Pteridium aquilinum
© Der et al; licensee BioMed Central Ltd. 2011
Received: 24 May 2010
Accepted: 8 February 2011
Published: 8 February 2011
Skip to main content
© Der et al; licensee BioMed Central Ltd. 2011
Received: 24 May 2010
Accepted: 8 February 2011
Published: 8 February 2011
Because of their phylogenetic position and unique characteristics of their biology and life cycle, ferns represent an important lineage for studying the evolution of land plants. Large and complex genomes in ferns combined with the absence of economically important species have been a barrier to the development of genomic resources. However, high throughput sequencing technologies are now being widely applied to non-model species. We leveraged the Roche 454 GS-FLX Titanium pyrosequencing platform in sequencing the gametophyte transcriptome of bracken fern (Pteridium aquilinum) to develop genomic resources for evolutionary studies.
681,722 quality and adapter trimmed reads totaling 254 Mbp were assembled de novo into 56,256 unique sequences (i.e. unigenes) with a mean length of 547.2 bp and a total assembly size of 30.8 Mbp with an average read-depth coverage of 7.0×. We estimate that 87% of the complete transcriptome has been sequenced and that all transcripts have been tagged. 61.8% of the unigenes had blastx hits in the NCBI nr protein database, representing 22,596 unique best hits. The longest open reading frame in 52.2% of the unigenes had positive domain matches in InterProScan searches. We assigned 46.2% of the unigenes with a GO functional annotation and 16.0% with an enzyme code annotation. Enzyme codes were used to retrieve and color KEGG pathway maps. A comparative genomics approach revealed a substantial proportion of genes expressed in bracken gametophytes to be shared across the genomes of Arabidopsis, Selaginella and Physcomitrella, and identified a substantial number of potentially novel fern genes. By comparing the list of Arabidopsis genes identified by blast with a list of gametophyte-specific Arabidopsis genes taken from the literature, we identified a set of potentially conserved gametophyte specific genes. We screened unigenes for repetitive sequences to identify 548 potentially-amplifiable simple sequence repeat loci and 689 expressed transposable elements.
This study is the first comprehensive transcriptome analysis for a fern and represents an important scientific resource for comparative evolutionary and functional genomics studies in land plants. We demonstrate the utility of high-throughput sequencing of a normalized cDNA library for de novo transcriptome characterization and gene discovery in a non-model plant.
As the sister lineage to seed plants, ferns represent a critical clade for comparative evolutionary studies in land plants [1, 2]. In contrast to seed plants, ferns typically retain the ancestral condition for a suite of life history traits (e.g. the lack of secondary growth, homospory, motile sperm, and independent free-living gametophyte and sporophyte generations). Ferns are thus an important outgroup for studying the evolution of wood, seeds, pollen, flowers, and fruit among other economically important characteristics found in seed plants, as well as the evolution of development in these complex structures and the expansion of gene families associated with seed plant evolution (e.g. transcription associated proteins). For reasons not yet fully understood, ferns typically have much higher chromosome numbers and larger genomes than seed plants [1, 3, 4].
Understanding the factors that influence these differences and their evolutionary consequences will require developing genomic resources in ferns to provide the necessary comparative context to understand the evolution of these traits [3–5]. Additionally, because ferns have evolved and maintained free-living and photosynthetic gametophyte and sporophyte life stages, they are an ideal group for studies of life-cycle evolution in land plants and genome function in independent haploid and diploid phases.
Among the genomic tools available for non-model organisms, expressed sequence tags (ESTs) have proven to be a rapid and cost effective strategy to develop sequence markers for comparative evolutionary and functional studies. While taxonomic sampling of plants in genome-scale projects has expanded substantially with dramatic decreases in sequencing cost, and increases in throughput, the development of genomic resources in ferns has lagged far behind those of other plant groups. This deficit has primarily been attributed to technical and economic barriers associated with the complex and large genomes in ferns, but is compounded by the limited agronomic value of most ferns . To date (December 2010), genomic information in ferns is limited to a genetic linkage map  and a modest EST data set comprised of about 5,000 Sanger cDNA sequences  for Ceratopteris richardii, just over 30,500 ESTs for Adiantum capillus-veneris , and over 2,600 ESTs in Osmunda lancea . Fewer than 500 ESTs for Pteridium aquilinum have previously been sequenced and deposited in Genbank .
With the introduction of cost efficient and massively-parallel high-throughput sequencing technologies, genome-scale studies in non-model organisms are being actively pursued for gene discovery, expression profiling, SNP and SSR marker development, and studies in functional, comparative, and evolutionary genomics in taxa where little or no previous genomic information exists [10–27]. We chose the Roche 454 GS-FLX Titanium pyrosequencing technology to sequence a full length enriched normalized cDNA library for the gametophyte generation of the bracken fern, Pteridium aquilinum (L.) Kuhn.
Pteridium (family: Dennstaedtiaceae) is a cosmopolitan fern genus comprised of several closely related species that are well differentiated from other genera in the family. Pteridium aquilinum is the most widespread of the bracken species and is distributed throughout the northern hemisphere and Africa . Bracken is notorious as a weed in open fields and is toxic to people and livestock. Despite its toxicity, bracken is eaten as a delicacy in several parts of the world, and due to its often high local abundance and large coarse stature, is sometimes used as thatching or packing material. Because bracken is common, easily cultured and manipulated, and can have a major economic impact, it has become one of the most intensively studied fern species.
Bracken has been used as a model system for the study of the fern life cycle [29–37], gametophyte development and the pheromonal mechanism of sex determination [38–45], cyanogenesis , carcinogenesis [47–49], invasion ecology [50–52], and climate change . Pteridium aquilinum has a diploid chromosome count of 2n = 104 and a total genome size of about 9.8 Gbp .
This study was conceived to develop an extensive expressed gene sequence resource in ferns for evolutionary and functional genomics. We present the first comprehensive transcriptome characterization for a fern gametophyte, including an assessment of transcriptome coverage, gene family and functional representation, SSR identification, and a comparative analysis of gene sets across land plants.
Number of sequences
Mean length (bp)
Standard deviation on length (bp)
Mode length (bp)
Median length (bp)
Range in length (bp)
2 - 624
78 - 624
86 - 3229
Total length (Mbp)
Assembly summary statistics
Primary assembly (MIRA)
Secondary assembly (MIRA+CAP3)
Number of reads assembled into contigs
Number of reads discarded during assembly
Number of 454 reads retained as singletons
Number of primary contigs (MIRA)
Number of secondary contigs (CAP3)
Total number of unique sequences (unigenes)
Mean unigene length (bp)
Largest unigene length (bp)
Total assembly length (Mbp)
Mean read depth coverage
Transcriptome coverage estimates: ESTcalc
Number of technologies
454 GS-FLX (Titanium)
Predicted assembly statistics
Total Assembled Sequence (MB)
Mean unigene length (bp)
Mean unigene length (longest unigene per gene, bp)
Singleton yield (%)
Percent transcriptome (%)
Percent of genes tagged (%)
Percent of genes with 90% coverage (%)
Percent of genes with 90% coverage by largest unigene (%)
Percent of genes with 100% coverage (%)
Percent of genes with 100% coverage by largest unigene (%)
Taxonomic distribution of unigene blastx hits in the nr database
Best blastx hit
Lowest common ancestor for blastx hits
Number of unigenes
Percent of unigenes with hits
Number of unigenes
Percent of Unigenes with hits
Artificial sequences, hits don't pass threshold, or taxon not assigned
A total of 2,679 perfect di-, tri-, tetra-, and pentanucleotide simple sequence repeats (SSRs) longer than 9, 8, 6, and 5 repeats, respectively, were identified in 2,285 unigenes (Additional file 5) using msatCommander . Sufficient flanking sequences existed to design high quality primers for 548 potentially amplifiable SSR loci. PCR primers were chosen using Primer3  as implemented in msatCommander  (Additional file 6). Since this RNA was extracted from gametophytes derived from spores collected from a single diploid sporophyte, we are unable to determine the level of variation present at these SSR loci in natural populations.
Repetitive transposon classification
Number of elements
Percentage of sequence
Rolling circle Helitrons
We have used high-throughput sequencing data to characterize the gametophyte transcriptome of Pteridium aquilinum, a species for which very little genomic data are available. These data represent an 865-fold increase over the expressed sequence data previously available for Pteridium in Genbank .
Because contaminant adapter/primer sequences, polyA/T repeats, and low complexity end sequences can substantially compromise de novo assembly and can be difficult to completely remove (KM Dlugosch, personal communication), we aggressively filtered and trimmed the reads beyond the default instrument-level processing routines at the cost of sequence information loss (approximately 11.6 Mbp were removed, representing 4.4% of the filter-passed data).
Considering the sheer quantity and depth of sequencing produced by next-generation sequencing platforms, we deemed this an acceptable level of loss to improve accuracy in the assembly. We also used a two-step assembly strategy to minimize redundancy in our final unigene sequence set. We adopted this approach because MIRA is able to handle the large number of reads produced by 454 sequencing and utilizes a multi-pass strategy to identify and correct sequencing and assembly errors to produce a highly accurate assembly, but is sensitive to uneven sequencing depth of coverage and allelic diversity, resulting in a high number of redundant contigs. CAP3 is a proven and efficient DNA sequence assembler that can be used to join highly similar overlapping sequences, but is unable to handle the large number of reads produced by new high-throughput sequencing platforms. By combining these two assembly tools, we were able to join contigs and singletons that failed to assemble in MIRA to reduce sequence-level redundancy in our final unigene sequence set.
In examining the taxonomic distribution of nr blastx hits for the unigenes, we identified only a small proportion of sequences with best blast hits or LCA assignments outside of the green plants. When we examine these hits in greater detail, we find that many of them only align to short conserved domains, are hypothetical proteins of unknown function from model organisms, or are genes which are conserved across broad taxonomic levels, such as cytochrome P450, alpha-tubulin and dynein proteins. Additionally, because no other fern genomes have been sequenced, some of these sequences may represent novel fern genes. Thus, the evidence indicates that there is very little heterospecific sequence contamination in these data.
While the simulations that underlay ESTcalc are based on the well characterized Arabidopsis thaliana floral transcriptome (approximately 18,000 genes with transcripts averaging 1,500 bp long) and assume perfect cDNA normalization and sequence assembly, Wall et al.  show that their results were highly predictive for empirical datasets from diverse eukaryotic species and tissues, making their simulations useful as a null model for predicting transcriptome coverage in other organisms. The predictions for transcriptome coverage produced by ESTcalc are largely consistent with that observed in our two-step assembly. However, the larger assembly size, greater number of unigenes, and shorter unigene lengths observed in our data set relative to the ESTcalc prediction may be explained by imperfect cDNA normalization or inefficient de novo assembly. Additionally, it is also becoming evident that with increased transcriptome sequencing throughput, it is possible to capture a richer, more nuanced, picture of transcriptome complexity (e. g. partially processed transcripts and alternative splice forms) [73–75]. This increased information content, however, presents significant challenges for de novo assembly and often results in a fragmented or partially redundant assembly [76, 77]. Also consistent with the ESTcalc estimate that we have tagged all of the transcripts present in this sample, our unigene accumulation curve shows that the rate of new unigene detection for this cDNA library has declined to the point that additional sequencing is unlikely to detect new genes, but may however serve to condense and join non-overlapping contigs in our assembly. Similar approaches to evaluate sufficient sampling in transcriptome projects have been used by other researchers when other information about the transcriptome is absent [15, 78].
The GO functional categories represented in the Pteridium gametophyte transcriptome are not significantly different from the suite of functional categories present in the full Arabidopsis genome GO annotation. Most of the unigenes annotated with a cellular component are localized to plastids or mitochondria, but a large number of them are also targeted for ribosomes or the plasma membrane (Figure 3). The molecular function of unigenes is heavily dominated by binding nucleic acids or proteins and metabolic activity, including hydrolase and kinase activity (Figure 3). The biological processes represented include all of the major cellular processes from transport and cellular organization to transcription, translation, and metabolism (Figure 3).
Visual examination of annotated/colored KEGG maps (not shown) indicates that we have captured all of the genes required for glycolysis, the citrate cycle, and plant hormone biosynthesis including gibberellin, abscisic acid, strigolactone, cytokinin, brassinosteroid, and auxin. We also detected Enzyme code signatures for most of the genes involved in nucleic and amino acid metabolism and chlorophyll biosynthesis.
The PlantTribes database contains an objective MCL cluster-based classification system for plant genes and gene families [60, 79, 80]. By identifying similar sequences in this classification system, we assigned unigene sequences with putative gene family identities. The most abundant of these gene families present in the unigene set was the pentatricopeptide repeat protein (PPR) family, with over 600 unigenes classified as PPR proteins. We were also able to identify 65 unigenes classified in the MADS-box transcription factor family. Using this classification, gene sequences from Pteridium can be extracted for gene families of interest for use in studies of gene family evolution or phylogenomics. The overlap in orthogroup membership and blast hits for proteins in Arabidopsis thaliana, Selaginella moellendorffii, and Physcomitrella patens is similar (Figure 4A/B), but some striking differences can be observed. In both the PlantTribes and blast-based Venn diagrams, most of the unigenes which were identified in Arabidopsis, Selaginella, or Physcomitrella are also shared across all three species. In the PlantTribes classification, most of the genes are shared with Arabidopsis (21,649 unigenes), the species in this comparison that shares the most recent common ancestor with Pteridium, while slightly fewer and approximately equal gene representation is shared with Selaginella and Physcomitrella (19,649 and 19,485 unigenes, respectively). This is in contrast to the blast-based examination of gene set overlap in which Arabidopsis again has the greatest number of unigenes with hits (23,148 unigenes), but Physcomitrella has hits with 6,122 more unigenes than Selaginella (Figure 4). At first this seems counterintuitive because Pteridium shares a more recent common ancestor with Selaginella than with Physcomitrella. This pattern may be explained by the expression of "gametophyte genes" in Pteridium that are conserved with genes in the Physcomitrella genome, however little is known about the expression and evolution of genes between sporophyte and gametophyte stages. Both Physcomitrella and Pteridium have maintained a homosporous life cycle with a large independent gametophyte stage. These life history differences may also play a role on the selective pressures and/or constraints influencing gene evolution and more work is needed to address these hypotheses.
In our examination of Arabidopsis gametophyte genes, we identified gametophyte-expressed homologs in Pteridium for over half (52.7%) of the previously characterized Arabidopsis gametophyte specific or enriched genes. This finding suggests that a highly specific suite of genes required for gamete production and syngamy may be conserved over long periods of evolutionary time, despite substantial differences in life cycle and reproductive strategies between Pteridium and Arabidopsis. It should be noted also that these conserved genes are not the genes required for meiosis because the tissues sampled in this study and those used to identify gametophyte specific genes in Arabidopsis were all post-meiotic. An in-depth study of sporophytic genes in Pteridium is needed to better understand the evolution and expression of genes between the sporophyte and gametophyte stages.
This study is the first comprehensive sequencing effort and analysis of gene function in the transcriptome of a fern and represents the most extensive expressed sequence resource available in ferns to date, nearly 16 times more data than exists for Adiantum capillus-veneris. These data are an important new scientific resource for comparative evolutionary studies in land plants and will be of great value for studies of genome structure and function in ferns. These data can be used to develop microarrays for gene expression assays or serve as a reference transcriptome for future RNA-seq experiments in Pteridium. As additional genome-scale projects in diverse plants are undertaken, these data will be of immense value in representing ferns, the sister clade to seed plants, in comparative genomic analyses.
Pteridium aquilinum ssp. aquilinum spores (collection number: Wolf 83; sourced from a single sporophyte individual collected in Norwich, UK) were sown onto sterile agar nutrient media containing Bold's macronutrients and Nitch's micronutrients (prepared as described by ) and grown under white light. Whole gametophytes including both vegetative and sexually mature male, female, and hermaphroditic individuals of various ages (up to 9 months from germination) were flash frozen in liquid nitrogen and ground to a fine powder. Total RNA was isolated using the Sigma Spectrum Plant Total RNA Kit, incorporating on-column DNase I (Qiagen) digestion during extraction to remove traces of genomic DNA. Total RNA was concentrated by precipitating in 2.5 M ammonium acetate and 70% ethanol, then resolubilizing the RNA pellet in RNase-free water to approximately 500 ng/μL. Total RNA was quantified and its quality verified using an Agilent Bioanalyzer 2100. Total RNA was sent to the Center for Genomics and Bioinformatics at Indiana University, Bloomington (IU CGB), where a normalized transcriptome (cDNA) library optimized for Roche 454 GS-FLX Titanium sequencing was prepared .
Briefly, full-length enriched cDNA was synthesized with the CloneTech SMART cDNA synthesis kit using modified 454-ready adapter/primer oligos (K Mockaitis, unpublished). The frequency of abundant cDNA species was reduced using the Evrogen Direct Trimmer normalization kit. Normalized cDNA was fragmented by sonication, blunt end repaired, and ligated to custom 454 sequencing adapters. Amplification of the sequencing library incorporated an adapter-mediated PCR suppression effect to preferentially amplify ligation products suitable for 454 sequencing . The final transcriptome library was size selected and 454 sequencing proceeded according to the manufacturers recommended protocol on 3 regions of a four region PicoTiter plate.
Sequence reads generated in this study were deposited in the NCBI sequence read archive (SRA012887). Raw sequence reads that passed instrument software quality filters were trimmed of custom oligonucleotide adapter sequences (Justin Choi, unpublished, IU CGB). The resulting sequences were further processed with SeqClean  and SnoWhite v1.0.3  to remove low quality, short, and contaminant sequences, and to aggressively trim polyA/T sequences. Cleaned reads were assembled de novo in MIRA v3rc4 [55, 85] using a minimum percent identity of 94% to align reads, retaining singleton reads in the assembly (-OUT:sssip = yes). This primary assembly was passed through a secondary assembly step in CAP3 (95% identity, 25 bp overlap)  to reduce redundancy in the final assembly and join additional contigs. Custom perl scripts were used to extract summary information about the reads and assemblies (Tables 1 and 2; Additional file 7).
We utilized a web-based tool, ESTcalc , to estimate the predicted level of transcriptome coverage for our data set. Input parameters to ESTcalc require that we specify the sequencing technology used (or a combination of technologies) and either the total sequencing level (Mbp), or the number of reads and an estimate of read lengths. We used the best approximation for sequencing technology available (454 GS-FLX) and the empirical values observed for the cleaned sequence data (254 Mbp or 681,722 reads with an average of 372.6 bp/read) to obtain our estimates. The estimates reported were identical whether we parameterized on total sequence or supplied read length information as well.
To determine the number of eukaryotic ultra conserved orthologs (UCOs ) we captured in the Pteridium transcriptome data set, we queried a list of 357 UCO coding sequences from Arabidopsis (sequences available at: http://compgenomics.ucdavis.edu/compositae_reference.php) into the unigene set with an e-value threshold of 1e-10 using NCBI tblastx. These blast results were then parsed to determine then number of UCOs with a positive hit that returned an amino acid alignment greater than 30 residues long.
We assessed the changing rate of new gene detection as a function of sampling effort (unigene accumulation curve, Figure 2) using a bootstrapped random sampling protocol implemented in a custom perl script (Additional file 8). This script uses the empirical distribution of read number per unigene in our final assembly to randomly sample reads one at a time and tracks the total number of unigenes detected at each step. Because the order of sampling can impact the shape of this curve, we computed 1,000 replicate random sample orders and calculated the mean number of unigenes detected after each draw. To evaluate the level of variation in the number of unigenes detected, we also calculated the 95% confidence interval on the number of unigenes. Using this curve, we then estimated the number of reads it took to capture an average of 90%, 95%, and 99% of the unigenes and the average number of additional reads required to detect each of the last 10 unigenes.
To evaluate for the presence of potential contaminating sequences, we examined the taxonomic distribution of blastx hits for the unigene set in the NCBI nr protein database using an e-value threshold of 1e-10. The top 10 blast hits for each unigene were kept and examined in MEGAN v.3.7.2 . MEGAN is a tool built for the examination of metagenomic data sets and provides a number of useful functions to explore the information content of large blast results. The blast results for each unigene were mapped onto the NCBI taxonomy tree by examining just the best hit (lowest e-value) or by using the lowest common ancestor (LCA) algorithm . LCA was determined using at least three blast hits with a bitscore greater than 75 and within 10% of the top bitscore for that unigene.
The same blast search used to examine the taxonomic distribution of blast hits was used to identify putative homologous proteins and annotate each sequence with gene ontology (GO) terms using Blast2GO [65–67]. Blast2GO was also used to automatically handle InterProScan (IPS) searches to identify conserved protein domains in translations of the longest ORF in each unigene. Any GO terms associated with IPS hits were then merged into the blast-based GO annotation. GO terms were then used to map enzyme codes to each sequence. Enzyme codes were then used to automatically color and retrieve KEGG pathway maps [86, 87]. As a final step in examining a broad functional representation of the gametophyte transcriptome, GO terms were mapped to the reduced GO-slim ontology and visualized and explored with directed acyclic graphs (not shown) and summarized with filtered pie charts including GO categories represented by at least 150 sequences (Figure 3)
Unigenes were classified into tribe- and orthoMCL clusters in the PlantTribes2.0 database using a custom Perl pipeline (dePamphilis lab, unpublished) which queries each unigene against the complete inferred protein set from ten plant species that have complete sequenced genomes, using a blastx e-value threshold of 1e-10. Unigenes were assigned to MCL clusters based on the best blast hit. Species used for blast searches and gene clustering in the PlantTribes2.0 database include: Chlamydomonas reinhardtii v3.0, Physcomitrella patens v1.1, Selaginella moellendorffii v1.0, Oryza sativa v5.0, Sorghum bicolor v1.0, Vitis vinifera v1.0, Populus trichocarpa v1.0, Medicago truncatula v1.0, Carica papaya v1.0, and Arabidopsis thaliana (TAIR7). Meta-information about each assigned cluster was extracted from the database for each unigene and was output to a file. Simple text-based searches examined this information to retrieve gene family names and putative gene family functional data. The shared single copy tribes  for Arabidopsis, Vitis, Populus, and Oryza were identified in the PlantTribes2.0 database and the number of these tribes detected in the unigene set was determined by examining the pipeline output file. Orthogroup assignments for Pteridium unigenes were examined for cluster membership by Selaginella, Physcomitrella, and Arabidopsis to generate a Venn diagram showing putative gene level overlap (Figure 4A). Unigenes were also directly queried against each of these proteomes using a blastx e-value threshold of 1e-10 to examine the distribution of similar proteins in these three species. Venn diagrams were generated to graphically illustrate the overlap of unigenes for each proteome (Figure 4B). To screen for the presence of putative gametophyte genes, a list of male gametophyte specific genes in Arabidopsis was extracted from the microarray study of Honys and Twell  and a combined list of significantly enriched female gametophyte genes was compiled from the studies of Wuest et al.  and Yu et al. . These Arabidopsis gametophyte gene lists compiled from the literature were examined for overlap with the list of genes producing significant alignments with Pteridium unigenes in the blastx search against the Arabidopsis genome to produce a Venn diagram of gametophyte genes (Figure 5).
Funding for this research was provided by the Vice President for Research and the Center for Integrated Biosystems at Utah State University. The Department of Biology and the Ecology Center at Utah State University provided funds for publication processing charges. Keithanne Mockaitis directed the cDNA library preparation and sequencing at the Center for Genomics and Bioinformatics at Indiana University, and Justin Choi pre-processed the primary sequencing data. Michael Pfrender, Aaron Duffy, Katrina Dlugosch, Eric Wafula, Paul Cliften, Amanda Grusz, Jacob Davidson, and Kristal Watrous assisted with various aspects of this project or manuscript preparation.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.