Functional analysis and comparative genomics of expressed sequence tags from the lycophyte Selaginella moellendorffii

Background The lycophyte Selaginella moellendorffii is a member of one of the oldest lineages of vascular plants on Earth. Fossil records show that the lycophyte clade arose 400 million years ago, 150–200 million years earlier than angiosperms, a group of plants that includes the well-studied flowering plant Arabidopsis thaliana. S. moellendorffii has a genome size of approximately 100 Mbp, as small or smaller than that of A. thaliana. S. moellendorffii has the potential to provide significant comparative information to better understand the evolution of vascular plants. Results We sequenced 2181 Expressed Sequence Tags (ESTs) from a S. moellendorffii cDNA library. One thousand three hundred and one non-redundant sequences were assembled, containing 291 contigs and 1010 singletons. Approximately 75% of the ESTs matched proteins in the non-redundant protein database. Among 1301 clusters, 343 were categorized according to Gene Ontology (GO) hierarchy and were compared to the GO mapping of A. thaliana tentative consensus sequences. We compared S. moellendorffii ESTs to the A. thaliana and Physcomitrella patens EST databases, using the tBLASTX algorithm. Approximately 60% of the ESTs exhibited similarity with both A. thaliana and P. patens ESTs; whereas, 13% and 1% of the ESTs had exclusive similarity with A. thaliana and P. patens ESTs, respectively. A substantial proportion of the ESTs (26%) had no match with A. thaliana or P. patens ESTs. Conclusion We discovered 1301 putative unigenes in S. moellendorffii. These results give an initial insight into its transcriptome that will aid in the study of the S. moellendorffii genome in the near future.


Background
Our understanding of biology has been greatly improved by studying genome structure and gene function of a broad sampling of model organisms such as Mus musculus (mouse), Drosophila melanogaster (fruit fly), Danio rerio (zebrafish), Caenorhabditis elegans (nematode), and Arabidopsis thaliana [1][2][3][4][5]. Comparative genomics has made it clear that orthologs of many proteins that act as signal transduction components, transcriptional regulatory factors, and metabolic enzymes can be identified between and among these model organisms [6]. As a result, the knowledge gained from comparative and evolutionary studies of these species can provide insights into homologous processes in a wide range of other organisms, varying from crop plants to humans [7]. Within plants however, most of the efforts in genomics have been focused on crop plants or economically important plants such as Oryza sativa (rice), Zea mays (maize), and Lycopersicon esculentum (tomato) [8][9][10]. Thus, coupled with the sequencing of the A. thaliana genome, these efforts have provided data on only a single branch of the plant evolutionary tree, namely members of the Monocotyledonae and Dicotyledonae, collectively termed the angiosperms and commonly known as flowering plants. As a result, the community of plant scientists has little sequence data on other plant lineages that could provide insights into common mechanisms of how plants develop and survive in a terrestrial environment, nor do they have any kind of evolutionary benchmarks that might reveal how angiosperms have come to dominate most world ecosystems [11].
Clear evidence for the existence of angiosperms is present in the fossil record of the lower Cretaceous (140 million years ago), and some evidence suggests their existence 60 million years earlier, around the same time that conifers and ginkgos arose [12]. In contrast, fossil evidence for the lycophytes is found in strata dated to approximately 420 million years ago [13]. Thus, this clade diverged very early from the lineage that led to all other vascular plants , and has existed on earth over twice as long as plants that are the most common subjects of current laboratory and agricultural research. As such, the study of lycophytes may provide novel insights into plant biology that would not be provided by research that focuses only on flowering plants.
Selaginella is an extant genus of the lycophyte clade. It is sometimes referred to as a 'seed-free' plant to highlight the fact that it has not evolved flowers and seeds in the time since its divergence from other plant lineages. It has a number of characteristics that would make its study convenient for, and valuable to, the plant biology community [11,14]. For example, like many other species of Selaginella, S. moellendorffii ( Figure 2) is a small diploid plant that can be easily grown in the laboratory. Further, it has an approximate genome size of 100 Mbp [14], smaller than that of A. thaliana, and among the smallest published genome sizes for 'seed-free' genera. Because of these attributes, S. moellendorffii was recently chosen as one of the non-crop plants for BAC library construction in a NSF funded Green Plant BAC library Project [15]. More importantly, the Department of Energy Joint Genome Institute A simplified version of the plant phylogenetic tree simplified and condensed from Pryer et al. [11] Figure 1 A simplified version of the plant phylogenetic tree simplified and condensed from Pryer et al. [11]. The tree shows that lycophytes (highlighted) diverged from other vascular plant lineages soon after plants colonized the terrestrial environment. Representative species were chosen from sub-clades within the clades listed, and illustrate major developments in plant evolution including the colonization of land (land plants, L), the development of vasculature (vascular plants, V) and true leaves (euphyllophytes, E), and the evolution of flowers (flowering plants, F), and seeds (seed plants, S).
(JGI) has officially announced that it will sequence the S. moellendorffii genome [16], making this species a target of extreme interest for research into comparative plant genomics, biochemistry, and development.
Expressed sequence tag (EST) sequencing has been used as an efficient and economical approach for large-scale gene discovery [17]. It has also successfully provided frameworks for many genome projects [18,19]. Recently, a large number of ESTs have been generated from various plant species and deposited in GenBank, including both model and crop plants like A. thaliana, rice, wheat, and maize as well as species representative of clades other than angiosperms, such as gymnosperms, cycads, and mosses [20][21][22][23]. Although over 1000 ESTs from another Selaginella species S. lepidophylla, also known as the resurrection plant, have also been deposited in GenBank [20], no manuscript has been published reporting on their analysis. In this paper, we describe 2181 ESTs generated from a S. moellendorffii cDNA library. These ESTs were assembled into 1301 clusters, annotated using the BLASTX algorithm, surveyed for their abundance within the dataset, and classified into functional groups according to the Gene Ontology (GO) hierarchy. Finally, a comparative genomics approach was used for comparing S. moellendorffii ESTs with those of A. thaliana and Physcomitrella patens to look for genes unique to S. moellendorffii.

Generation of S. moellendorffii cDNA library and ESTs
To gain a broad coverage of S. moellendorffii transcripts, we collected and pooled whole S. moellendorffii plants for mRNA extraction and subsequent cDNA library construction. To enrich for full-length cDNA clones, doublestranded cDNA was size-fractionated before cloning. Based upon the average insert sizes of 35 cDNA clones chosen at random from the library, we estimate that the cDNA library has an average insert size of 850 bp. 2304 clones were sequenced from the 5' end of the cDNAs, which generated 2181 vector-trimmed EST sequences with an average sequencing read length of 640 bp.

Assembly of S. moellendorffii ESTs
To identify overlapping EST sequences, reduce sequencing error and produce non-redundant EST data for further functional annotation and comparative analysis, 2181 ESTs were assembled into clusters through stackPACK v2.2 clustering system [24]. Based upon regions of nucleotide identity, EST sequences were merged into contiguous consensus sequences (contigs). One thousand three hundred and one non-redundant EST clusters, putatively regarded as unigenes, were generated, consisting of 291 contigs and 1010 singletons. The cluster size varied from one to 105 copies of any given EST ( Figure 3). Manual inspection of the assembled ESTs identified 10 clusters The morphology of S. moellendorffii  counted as unigenes that may actually represent non-overlapping sequence reads from cDNAs corresponding to four single genes. As an example, three unigenes were found to be best aligned to three different regions of the same protein in a BLASTX analysis (described in the following paragraph), suggesting we lack a complete transcript for their accurate assembly. Conversely, we also found that some clustered ESTs did not necessarily have identical sequences within their overlapping regions. In most of the cases, regions of sequence disagreement within the clusters tend to appear towards the ends of the EST reads, which is likely to be caused by errors generated during sequencing. In some other cases, it may due to failure to discriminate between gene family members during clustering, or allelic diversity in S. moellendorffii.

Annotation of S. moellendorffii ESTs
To annotate S. moellendorffii ESTs, the 1301 putative unigenes were translated dynamically in all 6 reading frames and searched for homology against the NCBI non-redundant (nr) protein database using BLASTX [25]. BLASTX hits with E-values less than 10 -5 were taken to be significant. Among 1301 unigenes, 962 (74%) had BLASTX hits in the nr database, while the remaining 339 (26%) had hits with E-values greater than 10 -5 or no hit. When a less permissive cutoff E-value of 10 -10 was adopted, the numbers of unigenes with BLASTX hits and without BLASTX hits changed slightly to 891 (68%) and 410 (32%) respectively. Our dataset showed that the inferred translation products of most S. moellendorffii ESTs appear to be similar to proteins in other organisms but that there was also a percentage of ESTs that represented potential Selaginellaor lycophyte-specific genes. Interestingly, 15 ESTs had at least their top five BLASTX hits from non-plant organisms, including six from bacteria or cyanobacteria

Highly represented S. moellendorffii ESTs
EST copy number can be used to approximate gene expression levels in an organism, although there are artifacts of cDNA library construction that may limit or over-represent certain transcripts [26]. Table 1 summarizes the first 32 most abundantly represented transcripts in the S. moellendorffii EST collection, having six or more EST copies in each cluster, with their identities putatively assigned by BLASTX analysis of the assembled contigs. As expected, a large number of the S. moellendorffii ESTs are photosyn-thesis-related genes, with 19 clusters containing 213 ESTs (9% of total sequenced ESTs) corresponding to genes involved in photosynthesis. There were seven clusters matching to core proteins of photosynthesis reaction centers, including four subunits of photosystem I (PSI-G, PSI-H, PSI-L, PSI-N), and three photosystem II proteins (PsbW, OEC23, CP22). There were four contigs corresponding to light-harvesting chlorophyll a/b-binding proteins, including one early light-induced protein. We also found ESTs for the RuBisCO small subunit, carbonic anhydrase, plastocyanin, one subunit of cytochrome b 6 f complex, ferredoxin and ferredoxin/NADP oxidoreductase, proteins involved in carbon fixation and photosynthetic electron transport. There were two putative anti- oxidative proteins found within S. moellendorffii ESTs: chloroplastic iron superoxide dismutase and catalase, presumably required for the decomposition of superoxide and hydrogen peroxide [27,28]. The BLASTX results show that all of these highly expressed S. moellendorffii photosynthetic genes had homologs in A. thaliana genome, consistent with previous observation that the photosynthesis machinery has been highly conserved throughout plant evolution.
Three highly expressed S. moellendorffii transcripts corresponded to genes encoding enzymes of metabolism, including an aldolase-like protein, a putative glutamine synthetase cytosolic isoenzyme involved in nitrogen assimilation [29,30], and a putative S-adenosylmethionine synthetase required for the synthesis of the major methyl group donor involved in the methylation of a variety of biomolecules ranging from histones to secondary metabolites, and for the biosynthesis of ethylene [31,32].
Other relatively abundant ESTs included one encoding a putative subtilisin-chymotrypsin inhibitor, exhibiting 49% amino acid sequence identity with the wheat subtilisin-chymotrypsin inhibitor, which may play a role in plant defense by inhibiting the serine proteinases of pathogens [33]. Two transcripts that matched an A. thaliana expressed protein and Pisum sativum core protein may function as membrane channel proteins. Interestingly, one highly expressed EST matched with an E-value of 10 -12 a C. elegans protein of unknown function, and is only more distantly related to an A. thaliana late embryogenesis abundant protein.
There were five highly expressed ESTs that did not yield significant matches using BLASTX (E>10 -5 ). These are putative Selaginella-specific genes and may encode proteins with functions unique to Selaginella or lycophytes. within a 20 amino acid motif). The other three highly expressed S. moellendorffii specific ESTs lack hints for functional annotation. The biological function of the proteins encoded by these genes, and the question of whether high transcript abundance is predictive of high protein expression will be a matter for future investigation.

Functional categorization of S. moellendorffii ESTs
The most sensitive method to find new members of known gene families among EST sequences is to search for homology of the translated ESTs to motifs extracted from a multiple alignment of known gene family members [18]. To functionally categorize S. moellendorffii ESTs using motif homology searches, we translated the 1301 unigenes in six reading frames and imported them into InterProScan [34], which aligned 491 clusters to InterPro entries (E<10 -5 ). Mapping of InterPro entries to GO [35], assigned 343 out of 491 InterPro hits with 562 GO accession numbers. The 562 accession numbers further generated 964 individual GO mappings in the three major ontologies (biological processes, molecular functions and cellular components) [36]. The apparent discrepancies between these values arises from the fact that not all Inter-Pro hits had available GO accession numbers associated with them, one InterProScan entry could be assigned to more than one GO accession numbers, and one GO accession number could be mapped under multiple parental categories [37].
Tables 2 and Figure 4 summarize the GO assignment of S. moellendorffii ESTs in terms of biological processes, molecular functions and cellular components, covering a broad range of the GO functional categories. Using the downloaded A. thaliana GO assignments from the TIGR A. thaliana Gene Index [38,39], we compared the distribution of GO categories between S. moellendorffii ESTs and A. thaliana tentative consensus sequences (TCs). Table 3 shows that the distribution patterns of GO assignments of S. moellendorffii and A. thaliana transcripts were generally similar, with a few exceptions in some categories. Besides the true differences in functional distribution of unigenes, some of the differences could be due to the difference in EST data sources between these two species. For example, in terms of biological processes, A. thaliana has a higher percentage in 'response to stimulus and stress' and 'development' than S. moellendorffii. Considering that among the A. thaliana ESTs in the TIGR database, some were generated from plants at specific developmental stages or from plants exposed to specific biotic or abiotic stimuli, it is very likely that ESTs from orthologs of these genes would be missing from the S. moellendorffii ESTs which were generated from normal mature plants.
The current GO annotations for plants are based solely on the annotated proteins of A. thaliana and O. sativa, both of which are angiosperms. Since the lycophyte clade diverged from other plant lineages 400 million years ago, and 200 million years before angiosperms, it is perhaps to be expected that a large proportion of S. moellendorffii genes could not be accurately assigned to GO categories in the database containing only angiosperm gene entries. We expect that the representation of plant species other than angiosperms will certainly benefit resources as InterPro and in turn will lead to further resolution within GO.

Comparative genomics of S. moellendorffii ESTs
One important objective of comparative genomics is to trace gene evolution including the emergence, development, and loss of orthologous genes in different organisms over evolutionary time [40]. To survey the S. moellendorffii ESTs in an evolutionary context, we used the S. moellendorffii unigene sequences as queries to search for homologous sequences in the A. thaliana and P. patens EST databases using tBLASTX algorithm (cut off E-value = 10 -6 ). There were two reasons that we chose A. thaliana and P. patens ESTs as tBLASTX databases. First, A. thaliana and P. patens are representatives of the most diverged lineages of land plants, namely angiosperms and bryophytes. They flank Selaginella in the plant phylogenetic tree, and last shared a common ancestor over 400 million years ago [23], thus providing ample opportunity for the evolutionary divergence of individual genes and gene families. Second, the large quantities of A. thaliana and P. patens ESTs in GenBank (472,278 and 104,027 respectively) provide a substantial coverage of the transcriptome in these two species. Using them as BLAST databases makes it possible Note that one gene product may be assigned to more than one GO terms, and one children term can fit into multiple parental categories. The representation means the number of non-redundant ESTs that can be mapped to a certain GO term. The representation percentage is based on the total number of GO mappings in each of the three major ontologies (biological process: 420, molecular function: 364, cellular component: 180).
to do a relatively comprehensive genomic analysis even in the absence of the full genome sequence of P. patens. Figure 5 summarizes the distribution of S. moellendorffii ESTs by tBLASTX results. Among 1301 non-redundant S. moellendorffii ESTs, 788 (61%) ESTs had homology with both A. thaliana and P. patens ESTs. These ESTs probably identify non-dispensable genes, which tend to be evolutionarily conserved in all land plants [41]. 168 (13%) ESTs had exclusive similarity with A. thaliana ESTs, and may represent the genes that evolved in land plants after the divergence of bryophytes, or those that were lost from the genomes of mosses.  [47]. Only 8 (1%) S. moellendorffii ESTs had similarity only with P. patens ESTs. These ESTs may represent genes that arose early in plant evolution but were lost later after the divergence of the lycophytes. It should be noted, however, that all eight of these ESTs had relative low tBLASTX score (E-value around 10 -10 ), limiting our certainty that the homologous ESTs in P. patens are true orthologs. Finally, there were 337 (26%) ESTs that had no tBLASTX match in the A. thaliana and P. patens EST databases. These ESTs may be Selaginella-specific genes, possibly having evolved only in lycophytes after their divergence from other lineages or having arisen after the divergence of bryophytes and later being lost in euphyllophytes.

Conclusion
We sequenced 2181 ESTs from the lycophyte S. moellendorffii, putatively representing 1301 unigenes. Our data showed that a large proportion of the genes had homologous genes in the well-studied model plant A. thaliana and other plant species. By browsing the putative functional annotations of these ESTs, researchers will be able to choose S. moellendorffii genes of interest and compare them to their othologs in other species. We also found a substantial number of putative Selaginella-specific genes that do not share similarity with known genes, with some of them even representing very highly expressed genes. Representations of Gene Ontology (GO) mapping results for S. moellendorffii non-redundant ESTs  Considering the complexity of the plant kingdom and a time span more than 150 million years between the divergences of lycophytes and angiosperms, it will not be surprising to identify gene functions in S. moellendorffii that are not present in A. thaliana. When the draft genome sequence of S. moellendorffii is completed and released, this EST resource will also play an important role in the mapping and annotation of the genome. As a member of a clade that arose after the bryophytes and before all other vascular plants, S. moellendorffii will provide new opportunities in studying plant evolution, particularly those adaptations relating to fundamental traits that facilitated the transition of green plants to the land, such as lignification in vascular plants, root/stem/leaf organography, complex patterns of sporophyte branching, and the elaboration of reproductive structures.

Genomic PCR
To amplify the genomic sequences of the two most highly expressed ESTs (SmoC1_cn126 and SmoC1_cn125) in S. moellendorffii, PCR was performed using genomic DNA extracted from 50 mg fresh tissue of S. moellendorffii as described previously [49] as template and two pairs of PCR primers designed from their EST contig sequences: CC1170 (5'-CGAGCTCGTAGTGATAGTGTC -3') and CC1171 (5'-AACCATAGGAGAGGAAGACC-3') for SmoC1_cn126; CC1228 (5'-ATAGCTTAGCTGCTTTCT-TCTC-3') and CC1229 (5'-ATACTACTCATGTCGCAGCTC -3') for SmoC1_cn125. PCR was performed using an initial 2 min denaturation at 94°C, followed by 25 cycles, each consisting of a 0.5 min denaturation at 94°C, a 0.5 min annealing at 50°C, and a 1 min extension at 72°C. These 25 cycles were followed by a 5 min extension at 72°C. PCR products were purified using QIAquick PCR Purification Kit (QIAGEN) and sequenced at Purdue Genomics Center.

Authors' contributions
JKW constructed the S. moellendorffii cDNA library, participated in the EST sequencing, carried out the bioinfomatic analysis of the ESTs, and performed the genomic PCR for two transcripts. MT participated in the S. moellendorffii cDNA library construction and provided comments on the manuscript. CC conceived the study and coordinated work. JKW and CC wrote the article. All authors read and approved the final manuscript.