Reannotation and extended community resources for the genome of the non-seed plant Physcomitrella patens provide insights into the evolution of plant gene structures and functions
- Andreas D Zimmer†1,
- Daniel Lang†1,
- Karol Buchta1,
- Stephane Rombauts8, 9,
- Tomoaki Nishiyama5,
- Mitsuyasu Hasebe6, 10,
- Yves Van de Peer8, 9,
- Stefan A Rensing2, 3, 7 and
- Ralf Reski1, 3, 4Email author
© Zimmer et al.; licensee BioMed Central Ltd. 2013
Received: 6 March 2013
Accepted: 19 July 2013
Published: 23 July 2013
The moss Physcomitrella patens as a model species provides an important reference for early-diverging lineages of plants and the release of the genome in 2008 opened the doors to genome-wide studies. The usability of a reference genome greatly depends on the quality of the annotation and the availability of centralized community resources. Therefore, in the light of accumulating evidence for missing genes, fragmentary gene structures, false annotations and a low rate of functional annotations on the original release, we decided to improve the moss genome annotation.
Here, we report the complete moss genome re-annotation (designated V1.6) incorporating the increased transcript availability from a multitude of developmental stages and tissue types. We demonstrate the utility of the improved P. patens genome annotation for comparative genomics and new extensions to the cosmoss.org resource as a central repository for this plant “flagship” genome. The structural annotation of 32,275 protein-coding genes results in 8387 additional loci including 1456 loci with known protein domains or homologs in Plantae. This is the first release to include information on transcript isoforms, suggesting alternative splicing events for at least 10.8% of the loci. Furthermore, this release now also provides information on non-protein-coding loci. Functional annotations were improved regarding quality and coverage, resulting in 58% annotated loci (previously: 41%) that comprise also 7200 additional loci with GO annotations. Access and manual curation of the functional and structural genome annotation is provided via the http://www.cosmoss.org model organism database.
Comparative analysis of gene structure evolution along the green plant lineage provides novel insights, such as a comparatively high number of loci with 5’-UTR introns in the moss. Comparative analysis of functional annotations reveals expansions of moss house-keeping and metabolic genes and further possibly adaptive, lineage-specific expansions and gains including at least 13% orphan genes.
Given its phylogenetic key position as an early diverging land plant that bridges the gap of about one billion years between the unicellular green algae and flowering plants, the moss Physcomitrella patens (Physcomitrella) unites most of the attributes desirable for a model organism, including a short generation time, small stature, comparatively low morphological complexity, a haplo-dominant life cycle, traceable cell lineage, high growth rate and simplicity of genetic transformation. Combined with the potential for evolutionary-developmental (evo-devo) studies, these traits have become increasingly attractive to a wide range of plant scientists. Over the last two decades, a growing community has established P. patens as a model organism with a well-developed molecular toolbox including the uniquely efficient gene targeting via homologous recombination and comprehensive genomics resources which have been made available early on using the central web service cosmoss.org [1–4]. The moss is also a promising model for green biotechnology [5–9], which allows the production of safe recombinant proteins with eukaryotic post-translational modifications in competitive quantities.
The draft genome sequence of the moss Physcomitrella patens was published in 2008 . The availability of the genomic sequence and the established molecular toolbox provide the ideal foundation for extensive comparative and evo-devo analyses studies. This is reflected in the publication record - a growing body of researchers from all fields has begun to apply Physcomitrella as an additional model organism for comparative studies [11–20]. Various evo-devo studies have demonstrated the ability of Physcomitrella transgenes to act as functional orthologs in cross-species complementation assays using A. thaliana mutant lines (e.g. [21–24]). Additionally, large-scale analyses and cross-kingdom comparisons increasingly utilize the moss as a representative organism for the plant kingdom [25–29]. This ongoing interest, the available resources, the active community, and the moss’ attractive phylogenetic position recently led the U.S. Department of Energy’s Joint Genome Institute (JGI) to select P. patens as a “plant flagship genome” .
The quality of a genome annotation is the bottleneck for any form of downstream and comparative analyses. Particularly affected by flaws are the large-scale, high-throughput approaches employed in systems biology . Following its initial V1.0 annotation, the P. patens genome annotation has been iteratively improved. The draft V1 assembly was based on whole-genome shotgun Sanger sequencing at 8.6x clone depth and comprises 2536 V1 scaffolds. This number was reduced to 2106 in the released V1.1 after removal of bacterial contaminations . After an additional round of scaffold filtering, released as V1.2, the genome sequence of the 27 chromosomes is still scattered over 1995 genomic scaffolds . The filtering of the gene catalogue, in particular by removal of transposable elements and other non-protein coding regions, led to the prediction of 27,966 protein-coding genes . The results of a survey conducted on behalf of the JGI at the annual moss meeting 2011 clearly show that for most groups (79%) the initial sequencing and release of the genome was already “essential”. However, many research topics were listed that “would be enabled if a highly complete and accurate reference genome for Physcomitrella was available”, revealing the need for advanced genome annotation. For example, there were cases of well-characterized moss genes that were present in the genomic sequence but were missing from the gene catalogue . In V1.2 only 4515 (~16%) gene models had both 5’ and 3’-UTRs (untranslated regions). Over 23,000 genes missed either 5’-UTR or 3’-UTR annotation and thus were incomplete. Furthermore, functional annotation was only available for 41% of the genes and hardly any of these annotations were backed by traceable experimental evidence. A further shortcoming was that no established and universal means for scientists and curators existed to link (published) knowledge on moss genes to the digital representations scattered across several databases.
Model organism databases (MODs, e.g. Gramene , TAIR , FlyBase ) are integrated, web-accessible resources, which are prerequisite for the success and the quality of a reference genome . They act as general repositories for all kinds of research data and scientific knowledge that is generated by the scientific community. Thus, they provide the necessary infrastructure for researchers working with model species, are the focal point for scientists new or outside the field and provide conceptual interfaces for data exchange with more general data repositories (e.g. NCBI, UniProt , Ensembl , Phytozome , and PLAZA ) to enable comparative analyses and ensure overall data quality [41, 42].
Experience from various MODs and whole genome sequencing projects shows that automatic annotation without substantial manual curation is not sufficient to ensure data quality and knowledge discovery . An active community is necessary to transfer all available data, especially the biological knowledge covered by the scientific literature, to the genome annotation .
Initially, the cosmoss.org resource was set up to provide access to the P. patens virtual transcriptome assemblies and annotation  using BLAST services, keyword search and sequence retrieval. Subsequently, it was extended to provide services for splice site prediction , for mining gene families of transcription associated proteins (PlanTAPDB; ) and to predict dual protein targeting (ATP; ). Since the release of the initial P. patens genome assembly, the resource cosmoss.org additionally provides access to the draft genome sequence , the genetic map , and the filtered genome annotation V1.2 . Moreover, cosmoss.org serves as a platform to coordinate the analysis and annotation of the P. patens genome sequence. As part of this, a wiki and several mailing lists have been set up to report and discuss the results within the community. Additionally, an integrative genome browser serves as a main entrance point for the exploration of the moss genome and annotation. The integrative cosmoss.org browser is based on the Gbrowse software  and provides base pair level resolution for large-scale annotation data covering predictions for all different kinds of genomic regions ranging from protein-coding genes, transposable elements and repeats to tRNA, rRNA, miRNAs, and other non-protein coding RNAs. Furthermore, the annotations are linked to the cosmoss.org internal annotation resources as well as to GenBank , Pfam , miRBase  and comparative genomics resources like Phytozome  and PLAZA .
While a significant part of cosmoss.org is based on our analyses, we are continuously integrating external published data, e.g. sRNAs , miRNAs  and EST (expressed sequence tag) or short read data from the sequence read archive (SRA ) and from collaborators around the world. In addition, the cosmoss.org gene annotation releases are shared and hosted at the NCBI and the comparative plant resources, Phytozome, PLAZA, and PlantGDB . In July 2009, the Physcomitrella community annotation services were transferred from the JGI website to cosmoss.org and the resource now functions as the central annotation repository for the moss P. patens.
Here, we report the complete re-annotation of the P. patens genome assembly V1 (Table 1), demonstrate its utility for comparative analyses and introduce the extensions to the cosmoss.org resource to act as a permanent central genome annotation repository and model organism database involving:
The improvement of gene structures with specific focus on the incorporation of transcript evidence to cover alternative splice variants and to derive UTRs.
Complete renewal of the functional annotation.
Prediction and annotation of non-protein-coding genes.
Integration of user annotations: Structural and functional annotations.
Integration of manual annotations: Development of community annotation services.
P. patens genome annotation releases
Rensing et al.
Lang et al.
genome size (Mb)
protein-coding genes with EST support
annotated as alternatively spliced
genes with UTRs
gene density (kb per gene)
exons / gene
mean exon length (bp)
mean intron length (bp)
Gene structures altered since previous release:
Loci added with plant homologs
Models added with Pfam domain
Loci added with Pfam domain
Models filtered out
Eukaryotic type signal recognition particle RNA (SRP)
Results and discussion
Improved structural annotation of the P. patens genome
The cosmoss.org Physcomitrella patens V1.6 genome annotation reported here is the result of iterative rounds of evidence mapping, repeat masking, gene structure prediction, filtering and model selection and harbors annotation of protein-coding genes, transposable and repetitive elements and, for the first time, definition of non-protein-coding loci. The release comprises 32,275 protein-coding genes, 432 tRNA loci, 798 rDNA regions, 229 miRNA precursors (108 families) , 213 snRNA genes, and 6 SRP (signal recognition particle) loci. Considering the number of miRNA families, P. patens with 108 families has an intermediate position between the green alga C. reinhardtii (47 families) and the flowering plant A. thaliana (187 families) . Consistent with previous findings [3, 10] about half of the genome consists of full length LTR retrotransposons and related fragments including chromodomain-containing gypsy LTR retrotransposons (Tcn1) shared by fungi and non-flowering plants .
V1.6 protein-coding gene predictions are based on multiple sources of evidence. Of prime importance are ESTs (Additional file 1: Table A1) from 19 different experimental conditions, tissue types and developmental stages providing a reliable basis for gene structure prediction. The combined transcript evidence was used to train species-specific prediction models using SpliceMachine  and EuGène , followed by the generation of weighted consensus gene structures using EVidenceModeler  and PASA . The prediction procedure was repeated iteratively. Each round involved several filtering steps for separation of non-protein coding, repeat and transposable element-associated genes. The interim versions V1.3-1.5 were not published but are the basis for V1.6 (see Additional file 2: Table A2). The whole protein-coding gene prediction and annotation process is summarized in Additional file 3: Figure A1. The annotation release V1.6 comprises 32,275 loci coding for 38,357 protein-coding transcripts (see annotation releases overview Table 1). 26,722 (~83%) loci are supported by transcript evidence (i.e. EST or full-length cDNA). The average V1.6 gene has a mean length of 2369 bp and a transcript length of 1389 bp. The number of only 1582 unchanged gene models from V1.1 to V1.6 is an excellent indicator of the extent of changes and improvements that led to the current release. While the changes in release V1.2 (Table 1; ) were restricted to the removal/filtering of non-protein-coding genes (7972 filtered models), the complete annotation process leading to V1.6 (Additional file 3: Figure A1) resulted in 22,307 (~80% of V1.2) updated models, 8387 (25% of V1.6) new loci and 4077 (15% of V1.2) models which have been removed due to non-protein coding/transposable element origin. Of the new loci, 1535 transcripts (1338 genes) are part of a gene family with at least one additional plant species, and 2196 transcripts (1456 genes) encode at least one Pfam domain. Also, published genes and gene models released by the scientific community were mapped, manually curated and integrated into V1.6.
Inspection of UTR annotations indicates that gene structure completeness is much improved in V1.6. The number of protein-coding loci with both 5’ and 3’ UTRs were increased from 4515 to 15,757 transcripts. The median transcript length increased from 987 (V1.2) to 1248 bases (V1.6) which can be explained by a higher percentage of annotated UTR regions in this release. V1.6 gene models only contain UTR annotations if they are fully supported by transcript evidence.
Annotation and characterization of alternative splicing
We used the PASA pipeline  to study the extent and characteristics of alternative splicing (AS) in the moss. We used GenomeThreader  transcript alignments in exchange of PASA’s standard GMAP alignments  as GenomeThreader supports alignment of transcript evidence to multiple (nearly identical) loci. This is important for covering nearly identical tandemly-arrayed genes observed in P. patens[3, 10] and of other segmental duplications.
Summary statistics of alternative splicing in P. patens genes
# Sub clusters
% Sub clusters
involved in alt-splicing
In V1.6 we have incorporated information on AS for the first time into the released genome annotation. The integration of AS into the annotation process leads to 3500 (10.8%) annotated loci with an average of 2.52 transcripts per locus and a maximum of 11 transcripts in V1.6. 1775 loci (51% of AS transcripts; 5.5% of all genes) have an altered coding sequence (CDS) due to alternative splicing resulting in 2380 distinct proteins. In contrast to A. thaliana, where the analysis of large-scale full-length cDNAs suggested most splicing events to occur outside of coding regions in the 5’-UTR , alternative splicing in moss seems to affect UTRs and CDS regions to a similar degree. 2948 alternative transcripts are due to alternative splicing in the UTRs of 1991 loci (56% of AS transcripts; 6.2% of all genes). Extensive alternative splicing of moss 5’-UTRs was observed previously in an individual study of the MDHAR genes .
Insights into the evolution of gene structures along the green lineage
Protein-coding gene statistics of selected Viridiplantae
Gene length [bp]
Transcript length [bp]
CDS length [bp]
Exon length [bp]
Intron length [bp]
Exons per gene
Introns per gene
5'-UTR exon length [bp]
5'-UTR intron length [bp]
3'-UTR exon length [bp]
3'-UTR intron length [bp]
5'-UTR length [bp]
3'-UTR length [bp]
Multi exon transcript
Single exon transcript
Transcripts with both 5' and 3'-UTR
Transcripts with 5'-UTR
Transcripts with 3'-UTR
Transcripts without UTR
Multi exon 5'-UTR
Single exon 5'-UTR
Multi exon 3'-UTR
Single exon 3'-UTR
Improved functional annotation of moss proteins
Comparison of the Gene Ontology (GO) annotation of P. patens V1.2 and V1.6
Total GO terms
Genes with GO terms
Genes with BP
Genes with MF
Genes with CC
Selected GO categories: P. patens in comparison to A. thaliana
GO term id
Two component system – histidine kinases and response regulators
two-component signal transduction system (phosphorelay)
two-component sensor activity
two-component response regulator activity
photosynthesis, light harvesting
ciliary or flagellar motility
L-phenylalanine catabolic process
glyceraldehyde-3-phosphate dehydrogenase (phosphorylating) activity
In summary, we improved the GO annotation of P. patens gene products in general and demonstrate that the annotation can serve as a solid basis for comparative and exploratory analyses. Compared to A. thaliana (TAIR10), where only 7% of all genes have no GO annotation, there are still more than 40% of all loci without any GO annotation for P. patens, and those with annotation are mainly inferred by electronic annotation (IEA). While a significant fraction of this 40% probably comprises orphan genes, special focus should be placed on the improvement of the functional annotation of the P. patens protein-coding genes to unravel its full potential as a reference model organism. To facilitate such efforts we developed the cosmoss.org community annotation interface to browse and alter the P. patens annotation (see section genonaut) discussed below.
Gene families (clusters) were retrieved by clustering a multitude of sequences of Archaeplastida proteins (Additional file 5: Table A4) using OrthoMCL . Parameters were optimized as described in Methods to target Archaeplastida gene families sensu strictu, i.e. families of genes that evolved by speciation and duplications after the divergence of the red/green lineage from a single gene in the last common ancestor. Multi-gene or superfamilies thus are split across multiple clusters. The number of P. patens loci in clusters is 32,733, while after the subtraction of the annotated number of 32,275 protein-coding loci, 458 fall into multiple clusters. Some of these cases are due to fragmentary or false structure predictions caused by fusion of two or more distinct gene loci into a single locus and will be resolved in future.
Analysis of intron-loss and gain in the green lineage
Similar to 5’-UTRs, where we observe a striking number of multi-exon regions in moss, the overall number of single-exon transcripts is remarkable. While Chlamydomonas and Volvox contain fewer than 10% single-exon transcripts (Table 3 and Additional file 4: Table A3), this fraction is on average ~19% in other land plants, while moss possesses 23.4%. This may be due either to fragmentary gene predictions and residual non-protein coding genes, or may reflect secondaryintron losses.It has been observed that introns and their positions are highly conserved during land plant evolution . In A. thaliana and O. sativa intron losses outnumber intron gains . Furthermore, there is evidence for secondary intron loss; e.g. a moss sedoheptulose-1,7-bisphosphatase (SBP) gene which lost six out of seven introns . Such intron losses might account for the relatively high percentage of single-exon genes in P. patens. One suggested mechanism for intron loss involves the reverse transcription of an mRNA followed by the (partial) replacement of the genomic DNA copy by an intron-less cDNA via homologous recombination, called retrocopying . An increased rate of intron loss might thus be facilitated by the extraordinarily high rate of DNA repair by homologous recombination in P. patens. This hypothesis is based on models proposing a prominent role of gene conversion and DNA repair in intron loss [84, 85].
Number of introns in Viridiplantae
Amount/fraction of intron-less genes
Amount/fraction of genes with less introns than median intron numbers of other plants
About 3% (941) of the P. patens genes seem to have lost their introns entirely, i.e. are of putative retrocopy origin, which is in the range of the other land plants (Table 6). The extension of the analysis to Arabidopsis, Rice and Chlamydomonas supports the findings from the comparison of absolute numbers of single-exon genes between algae and land plants described in the previous sections. The alga has significantly less (0.7%) single-exon genes than the three land plants (3-4%) under study. One likely scenario is an increased activity of transposons resulting in a secondary, maybe more recent, intron gain in algae [69, 86]. This view is supported by the observation that intron positions are often not conserved between the two algae and the land plants (data not shown). Considering the comparable rate of intron loss in plant gene families, the fact that the total number of multi-exon transcripts in P. patens is similar to vascular plants, that only 60% of the 8979 single exon models are supported by expression evidence, and that more than half of these transcripts are shorter than 500 bp (which is less than half of the mean of P. patens transcripts, 1389 nt), leads us to conclude that a significant number of the predicted single exon genes represent fragmentary predictions, non-protein coding genes or pseudogenes.
Gene family size evolution in Viridiplantae
Among these expansions house-keeping and metabolic gene functions are most prominent, independently supporting analyses of paralogs retained after the proposed whole genome duplication event ~45 million years ago  and of the unique presence of identical tandemly-arrayed genes . Prominent examples of this expanded category of genes are abundantly expressed components of multimeric protein complexes like the ribosome and proteasome. In total, 86 of the clusters represent the different abundantly expressed protein components of the ribosome. On average these harbor ~1.5 times more genes in P. patens than in A. thaliana. For the 36 clusters representing structural components of the proteasome we observed on average 1.6 times the gene complement of Arabidopsis in the moss. These expansions are also detectable in the functional comparison of the two species using GO enrichment analysis (Additional file 6: Table A5).
In addition, families encoding for smaller complexes and monomers are expanded, including the Light Harvesting Complex II (LHCII) major antenna, Ribulose-1, 5-bisphosphate carboxylase oxygenase (RuBisCO) small subunit, TOC12 (translocase of the outer chloroplast membrane 12) and components of the splicing and translation machineries.
The expansion of house-keeping and metabolic gene functions is also mirrored by the findings of the GO enrichment analysis comparing P. patens and A. thaliana (Table 5 and Additional file 6: Table A5), which revealed an increased complement of genes involved in translation, oxidation-reduction, electron transport, microtubule based movement, glycolysis, and ATP synthesis. Additionally, specific expansions occurred which might represent lineage-/species-specific adaptations (e.g. expansins, MIKC*-type MADS box transcription factors, late embryogenesis abundant (LEA) proteins, early response to dehydration (ERD) proteins, cationic peroxidases) and in enriched GO “biological process” annotations (e.g. chitin catabolic processes, the phosphoenolpyruvate-dependent sugar phosphotransferase system, cell wall macromolecule catabolic processes, phosphatidylinositol-mediated signaling, chromatin assembly or disassembly, cell redox homeostasis, ciliary or flagellar motility and DNA repair). Some of these enriched categories like “ciliary or flagellar motility”, which can be explained by the absence of flagellated sperm in flowering plants, confirm the findings of previous analyses [10, 69, 88]. The majority of the above listed categories and families (nine out of 13) represents true novel insights which will help to unravel the much-cited qualities of the mosses in coping with abiotic and biotic stressors like their unique ability to repair DNA damage [89–91].
Extension of the cosmoss.org resource to provide a permanent, central model organism database and annotation repository for P. patens
Protein-coding gene models were clustered to loci and the resulting locus definitions were used to derive information-rich locus identifiers (Cosmoss Gene ID; CGI; see Additional file 7: S2 and Additional file 8: Figure A2, and  for details).
With the development of genonaut , we have extended cosmoss.org by the capability to annotate Physcomitrella patens genes with regard to gene name, product name, description, and Gene Ontology (GO) terms. The interface which allows searching, browsing and editing of annotation is modular and can be extended to support additional (ontology) annotations as gene features. Traceability of annotations in terms of author and experimental evidence is crucial for quality assessment of information retrieval. Thus, the genonaut interface accepts the alteration of an existing gene description only if the source is specified. Integration of multiple sources into a unique abstraction layer is achieved by assigning unique and permanent Cosmoss Reference IDs (CRID) to author statements from all sources. The highest quality author statements are experimental evidence provided as references to peer-reviewed publications. The easiest way to achieve this is to provide a valid PubMed ID and the system automatically retrieves all relevant information from NCBI PubMed. If the source is a publication that is not tracked in PubMed, a custom reference can be created. If no publication is available, as the “weakest” possible evidence, a note in form of a text comment or web link describing the evidence is required. Besides the references the genonaut interface allows to link to other resources via database cross-references (Dbxref).
The gene products can be further annotated using GO terms. To allow convenient manual GOA, the genonaut interface assists the annotator by allowing the user to browse the appropriate GO namespace by keywords to assign the correct terms. In addition to the mandatory reference for a genonaut annotation, we have integrated the assignment of the GO evidence codes . In this way the quality of each assigned GO term is directly discernible. As traceability is crucial for the maintenance of annotation quality, the genonaut system traces every annotation change using a history system. Thus it is possible to trace changes and possibly revert to a previous state if needed, but more importantly to comprehend the annotation history of every gene and annotation version.
Whereas the annotation browser capabilities are publicly available, the editor functions are restricted to registered cosmoss.org users (December 2012: 228 annotator accounts). The cosmoss.org curator team acts as a superior authority which supervises and validates the user provided annotations by direct personal communication.
Moreover, the genonaut interface provides a starting point to retrieve detailed annotation about P. patens protein-coding genes. Besides the possibility to search and edit the annotations and annotation history, the genonaut interface is linked to the sequence retrieval, genome browser and sequence viewer providing transcript and protein domain annotations.
To support the manual curation of gene structures we have integrated and adapted the Apollo structural gene annotation editor  for the cosmoss.org genome browser. Generated user_models (December 2012: 830 manually curated transcripts) are assigned CGIs (Additional files 7 and 8) from the user model namespace extended using the authors username to allow multiple versions per locus (e.g. Pp1s275_35U2 ➔ Pp1s275_35U2__zimmer.1).
Here we describe the complete re-annotation of the P. patens V1 genome assembly comprising structural and functional annotation of protein-coding genes and, for the first time, description of non-protein loci including tRNA, rRNA, miRNA, snoRNA and snRNA loci. Compared to V1.2 the improved structural annotation V1.6 resulted in 8387 additional protein-coding loci, 11,242 more complete genes and only 1582 unaltered gene structures. 70% of the 32,275 protein-coding genes are supported by EST evidence. Nearly half (~49%) of the protein-coding loci in V1.6 are now be considered complete, containing both UTRs. Furthermore, the information-rich cosmoss.org locus IDs also carry information on the chromosomal/scaffold localization and about alternative splicing of transcripts.
We significantly increased the number of genes with functional annotations (58% as compared to 41% in V1.2) in form of GO term annotations (GOA). Our quality assessment of the V1.6 GOA demonstrates sufficient annotation depth to recover results from previous high-quality phylogeny-based approaches using ontology term enrichment analysis. Nevertheless, there are still 41% of all loci without any GO annotation and only 0.04% of all assigned GO terms are supported by direct experimental evidence. Although this is a common phenomenon for most available plant genomes, special focus needs to be placed on the improvement of the functional annotation of the P. patens protein-coding genes in order for it to serve as a reference model organism and plant “flagship”. With the development of the cosmoss.org community annotation services described here allowing users to browse, view and alter functional and structural annotations of moss genes, transfer and exchange of knowledge is greatly facilitated. Including the described extensions of the resource, cosmoss.org now is well-equipped to serve as a permanent, central model organism database and annotation repository for P. patens.
We demonstrate the utility of the provided annotation and resources for the comparative study of plant evolution including the analysis of codon usage (see Additional file 7: S1 and Additional file 9: Figure A3), alternative splicing, gene structure evolution as well as the detection of lineage- and species-specific expansions of gene families and biological processes.
Results from our comparative analyses were mostly consistent with previous observations, but also provided several novel insights. In particular, we found further evidence for intron loss during land plant evolution and secondary intron gain in the alga Chlamydomonas. Investigation of alternative splicing and gene structures revealed a unique complexity of 5’-UTRs in the moss, pointing to the importance of UTRs for the regulation of gene expression in this early diverging land plant.
Our comparative analysis of functional annotations and protein clusters revealed expansions of moss house-keeping and metabolic gene functions as well as hitherto unknown lineage-specific expansions. In total, 832 gene clusters are expanded in P. patens and at least ~13% of all gene loci are orphan genes as they have no homolog in other as yet published genomes. Subsequent functional analysis of this data set will further extend our understanding of the unique capabilities to cope with abiotic and biotic stressors and to efficiently repair DNA damage.
P. patens reference genes
In total, 137 manually annotated and validated Physcomitrella patens gene structures are in the cosmoss.org genome browser track “Ppref genes”. Some of the genes are directly derived from published genes (GenBank) or provided by the scientific community, but the majority was extended or corrected manually using ESTs and FLcDNAs. In addition, in-house validated sequenced P. patens gene structures were added.
EuGène P. patens gene prediction process
EuGène  allows combining various type of evidence including e.g. ESTs, mate pair information, homologous sequences, existing gene predictions and splice site predictions. Our EuGène predictions for P. patens are based on a splice site prediction using SpliceMachine  trained on filtered P. patens EST alignments, coding, intronic and intergenic regions, homology evidence (A. thaliana homologs) and are filtered using transposon related sequences and gaps. The optimal parameters were determined on an independent P. patens reference genes set not used for training of EuGène. Two P. patens whole genome EuGène predictions went into generation of the release V1.6. The first contains 37,872 predicted loci and was restricted to generate UTR regions only if transcript evidence is available and the second contains UTR regions predicted ab initio (46,071 loci).
The first EuGène model predicts ~94% of all CDS exons in the reference genes correctly, whereas the second does so for ~95% (results are summarized in Table 3). Both predict 76% of all reference CDS without any error. That implies that the EuGène predictions perform well in predicting the exons but split several loci into two or more distinct genes. While manually inspecting these problematic loci we have noticed that gene models created by the JGI , which were not been selected for release V1.1, were often better than the selected model model, or could be used to overcome, or respectively complement, the EuGène predictions. As the method of choice to combine all available P. patens evidence and to further improve the protein-coding gene structures we have used EVidenceModeler .
EvidenceModeler (EVM) - weighted consensus gene model predictions
EVM (Haas et al. 2008) combines evidence from different sources into a consensus gene structure prediction. With the possibility to weight and unite the different evidence and optimize their combination, EVM utilizes the different sources by equating the drawbacks of individual sources but also boosting their strong points. For P. patens we have used EVM to find the optimal combination of PASA  transcript assemblies, EST alignments and five different whole genome protein-coding gene predictions. The process is also described in Additional file 3: Figure A1. The resulting models were subsequently subjected to PASA to model the UTR regions. As a consequence, all UTRs in release V1.6 are supported by transcript evidence. The utilization of EVM has enabled us to increase the prediction performance on the reference gene set (86.1% of all CDS and 97.3% of all CDS exons are correct; see Additional file 2: Table A2).
Additional gene structure predictions using EuGène
EuGène  version 3.4 was adapted and trained for P. patens on the basis of the Ppref genes set (mentioned above) and including a species-specific splice-site prediction and IMM (Interpolated Markov Model) models trained on intergenic, CDS and UTR regions. Gaps in the genomic sequence and repetitive regions, in particularly LTR retrotransposons , were masked for the training and predictions. As additional evidence, homologous protein sequences (Swissprot rel. 13.4 and Arabidopsis thaliana TAIR7  homologs) and in particular EST alignments 360,974 from GenomeThreader , 118,243 from sim4  and 97,373 from exonerate  were used. If available, we also provided EST mate pair information into EuGène (63,945 EST mate pairs). Two whole genome EuGène predictions were used for the consensus model approach leading to V1.6. The training input of these two models was the same except that one model was additionally trained with 5’- and 3’-UTR regions (EuGène MarkovIMM plugin).
Splice site prediction
P. patens EST alignments were loaded into a Bio::DB::Seqfeature database. The exclusion of alternative splicing and bad quality EST alignments led to a distinct, species-specific splice-site training set for SpliceMachine. Splice sites were only taken into account where GenomeThreader , sim4  and exonerate  did exactly the same EST alignment. Two models, one donor (GT) and one acceptor (AG) model have been generated and have been used for EuGène training.
PASA – transcript assemblies
PASA  assemblies were performed as directed on the software homepage , with the following modifications: Per default PASA uses GMAP  for transcript alignments, however, our evaluation process (data not shown) reveals improved EST alignments using GenomeThreader. PASA offers the possibility to include alignments in GFF3 format via the --IMPORT_CUSTOM_ALIGNMENTS_GFF3 switch. The P. patens GenomeThreader EST alignments were converted into PASA compliant GFF3 format. PASA per default supports only one alignment per EST/transcript. Therefore with regard to duplicated genes (especially tandemly arrayed genes and (near-)identical genes) the corresponding EST alignments were renamed. E.g. the sequence ppsp14d22fl matches to 2 loci in the genome, so the corresponding sequence is duplicated and renamed: ppsp14d22fl and ppsp14d22fl_2. Transcript alignments with less than 90% EST length coverage or less than 95% alignment identity were discarded. The maximum allowed intron length was set to 20,000 nt, based on the longest observed intron supported by Sanger ESTs. Three cycles of PASA annotation loading, annotation comparison, and annotation updates were used to maximize the incorporation of transcript alignments into the transcript assemblies.
Consensus gene predictions – unfiltered V1.6 models
The P. patens JGI AllModels V1.1 (http://ftp.jgi-psf.org/pub/JGI_data/Physcomitrella_patens/v1.1/transcripts.Phypa1_1.AllModels.fasta.gz) and the cosmoss.org EuGène models, described in the previous section, and transcript alignments (PASA, exonerate and sim4) were combined and weighted with EVidenceModeler (EVM). EVM combines all evidence per locus into one consensus gene structure model. EVM was trained on the Ppref genes deducting reference genes used for the training of EuGène. Subsequently the UTR regions were modeled by P. patens PASA transcript alignment assemblies as described on the PASA homepage . In this context models with alternative splicing evidence were generated and incorporated into V1.6.
Functional gene annotation
The unfiltered V1.6 gene models and the corresponding predicted protein sequences, respectively, were subjected to BLAST2GO . The initial BLASTP (e-value cut-off: 1E-4) search was performed against a P. patens V1.1 subtracted GenPept release 172.0 to allow an unbiased functional annotation, independent from the previous entirely automatic V1.1 GOA. For the BLAST2GO annotation step the minimum coverage between a hit and its HSP was set to 40%. The validation step as well as the integration of InterProScan V4.5 (InterPro release v22.0) based GO annotation was used to generate the GOA for the P. patens V1.6 proteins.
GO term annotation was extended by experimental evidence, various subcellular target predictions and homology-based methods using the pred2GOA method described below. Existing functional annotations like gene names, description lines, GO terms and KEGG EC numbers and KO terms were collected from JGI and Kyoto Encyclopedia of Genes and Genomes  and combined into a non-redundant database. Existing experimental evidence for subcellular localization was also manually integrated as GOA if available. Additional functional annotations were created using the homology-based methods BLAST2GO, IPRScan  and KAAS . These two steps resulted in GO terms with IEA evidence codes. Subcellular localization of the protein sequences was predicted considering the individual gene’s full-length status using several tools: TAPScan , MultiLoc , WolfPSORT , TargetP , ChloroP , SignalP , Prosite KDEL , HMMTOP  and MEMSAT3 . To combine these predictions with the existing GOA, we developed the algorithm pred2GOA which allows weighted integration of GOA from multiple sources. Resulting predictions were translated into cellular component GO terms and compared to the existing GOA at the GO slim (plants) level. The assignment of GO terms is based on a weighted majority rule consensus. If the underlying gene did not have an annotated UTR and a start codon, at least one of the predictors’ predictions had to be based on more than the N-terminal region of the protein. Resulting GO term assignments were reviewed for consistency and defined as “Inferred from Sequence or Structural Similarity” (ISS). The resulting GO annotation was mapped to GO slim terms using the Blast2GO internal mapping function using the “goslim_plant.obo” ontology subset.
GO enrichment analysis
The enrichment analyses were performed using the Bioconductor package topGO . We used and compared results from both the classical Fisher’s Exact Test and the topGO algorithm “weight01” for test statistics with a p-value cut-off 0.05. The Arabidopsis GOA was downloaded from the TAIR ftp server (TAIR10; http://ftp.arabidopsis.org/Ontologies/Gene_Ontology/ATH_GO_GOSLIM.txt).
Protein clusters and representative gene model selection
Although certainly useful in single gene analysis and the study of AS, the use of all splice-variants in large-scale comparative analysis introduces undesirable complications. Therefore gene catalogues are usually reduced to one representative isoform per locus prior to large-scale analyses. By convention, the representative model is the variant with the lowest splice variant index. A representative gene model per locus was selected by the following procedure: At the first step, all P. patens V1.6 proteins together with the proteins of various sequenced Archeaplastida (Additional file 5: Table A4) were clustered using OrthoMCL V.2 . The inflation parameter was optimized using a set of reference gene families (e.g. [10, 15]). The inflation value was set to 1.3. The resultant clusters were aligned with MAFFT ginsi v6.712b  and subsequently the uncorrected pairwise distances were calculated with distmat (EMBOSS 6.1.0; ). The distmat matrices were used to select the Physcomitrella model as representative for a locus that has the least of all substitutions (minimum distmat distance) to any non-Physcomitrella cluster member. Clusters with only Physcomitrella sequences were subjected to a BLASTP search against GenPept (release 172.0). In this case the representative per locus was set to the Physcomitrella sequence which covers its closest GenPept hit, in terms of alignment length, best. If still no clear representative could be determined, the model with the longest transcript length was chosen as the representative. Due to the fact that the clustering is based on sequence information alone, not all splice variants per locus were grouped into the same cluster in some cases. Thus, the number of loci in clusters slightly exceeds the number of physical loci.
Pre-tRNA genes were predicted by combining results from tRNAscan-SE 1.21  and ARAGORN 1.1 . rRNA loci were predicted using RNAmmer-1.2  and extended by mapping the available SILVA Physcomitrella rRNAs  to the genome with BLASTN.
Non-protein coding loci and gene families where determined using Infernal  with the RFAM (version 8.1)  covariance models and by integration of the miRBase 18 Physcomitrella miRNA classifications and annotations.
P. patens V1.6 protein-coding genes filtering
On the basis of a Bio::DB::SeqFeature database  the V1.6 gene models were filtered against the annotated LTR-retrotransposons , Repbase (RELEASE 20080801; ) using RepeatMasker v3.26  and the non-coding RNA described in the previous section.
The effective number of codons was calculated with CodonW 1.4.4 .
Fisher’s exact tests and Wilcoxon tests were performed with R 3.0.0. The p-values were corrected for multiple testing using fdr.
Availability of supporting data
The complete annotation and sequence information comprising: Gene structure releases in GFF3 format: gene, transcript, CDS, protein, UTR sequences in FASTA format; Mappings/lookup tables, GO annotations in GAF format are accessible via the cosmoss.org download section .
We thank Tomomichi Fujita, Keiko Sakakibara, Atsushi Toyoda, Asao Fujiyama, and Yuji Kohara for the generation of FLcDNA libraries and sequencing. We are indebted to our collaborators at the US DoE JGI and the Moss Genome Consortium. Funding by the German Research Foundation DFG (RE 837/10-2) and by MEXT, JSPS and JST is gratefully acknowledged. The article processing charge was funded by the German Research Foundation (DFG) and the Albert Ludwigs University Freiburg in the funding programme Open Access Publishing.
- Lang D, Eisinger J, Reski R, Rensing SA: Representation and high-quality annotation of the Physcomitrella patens transcriptome demonstrates a high proportion of proteins involved in metabolism in mosses. Plant Biol (Stuttg). 2005, 7: 238-250. 10.1055/s-2005-837578.Google Scholar
- Rensing SA, Fritzowsky D, Lang D, Reski R: Protein encoding genes in an ancient plant: analysis of codon usage, retained genes and splice sites in a moss, Physcomitrella patens. BMC Genomics. 2005, 6: 43-10.1186/1471-2164-6-43.PubMed CentralPubMedGoogle Scholar
- Lang D, Zimmer AD, Rensing SA, Reski R: Exploring plant biodiversity: the Physcomitrella genome and beyond. Trends Plant Sci. 2008, 13: 542-549. 10.1016/j.tplants.2008.07.002.PubMedGoogle Scholar
- cosmoss.org. http://www.cosmoss.org.
- Saidi Y, Finka A, Chakhporanian M, Zryd JP, Schaefer DG, Goloubinoff P: Controlled expression of recombinant proteins in Physcomitrella patens by a conditional heat-shock promoter: a tool for plant research and biotechnology. Plant Mol Biol. 2005, 59: 697-711. 10.1007/s11103-005-0889-z.PubMedGoogle Scholar
- Lucumi A, Posten C, Pons MN: Image analysis supported moss cell disruption in photo-bioreactors. Plant Biol (Stuttg). 2005, 7: 276-282. 10.1055/s-2005-865638.Google Scholar
- Decker EL, Reski R: Moss bioreactors producing improved biopharmaceuticals. Curr Opin Biotechnol. 2007, 18: 393-398. 10.1016/j.copbio.2007.07.012.PubMedGoogle Scholar
- Büttner-Mainik A, Parsons J, Jérôme H, Hartmann A, Lamer S, Schaaf A, Schlosser A, Zipfel PF, Reski R, Decker EL: Production of biologically active recombinant human factor H in Physcomitrella. Plant Biotechnol J. 2011, 9: 373-383. 10.1111/j.1467-7652.2010.00552.x.PubMedGoogle Scholar
- Parsons J, Altmann F, Arrenberg CK, Koprivova A, Beike AK, Stemmer C, Gorr G, Reski R, Decker EL: Moss-based production of asialo-erythropoietin devoid of Lewis A and other plant-typical carbohydrate determinants. Plant Biotechnol J. 2012, 10: 851-861. 10.1111/j.1467-7652.2012.00704.x.PubMedGoogle Scholar
- Rensing SA, Lang D, Zimmer AD, Terry A, Salamov A, Shapiro H, Nishiyama T, Perroud PF, Lindquist EA, Kamisugi Y, Tanahashi T, Sakakibara K, Fujita T, Oishi K, Shin-I T, Kuroki Y, Toyoda A, Suzuki Y, Hashimoto SI, Yamaguchi K, Sugano S, Kohara Y, Fujiyama A, Anterola A, Aoki S, Ashton N, Barbazuk WB, Barker E, Bennetzen JL, Blankenship R: The Physcomitrella genome reveals evolutionary insights into the conquest of land by plants. Science. 2008, 319: 64-69. 10.1126/science.1150646.PubMedGoogle Scholar
- Hirano K, Nakajima M, Asano K, Nishiyama T, Sakakibara H, Kojima M, Katoh E, Xiang H, Tanahashi T, Hasebe M, Banks JA, Ashikari M, Kitano H, Ueguchi-Tanaka M, Matsuoka M: The GID1-Mediated gibberellin perception mechanism is conserved in the Lycophyte Selaginella moellendorffii but not in the Bryophyte Physcomitrella patens. Plant Cell. 2007, 26: 26-Google Scholar
- Axtell MJ, Snyder JA, Bartel DP: Common functions for diverse small RNAs of land plants. Plant Cell. 2007, 19: 1750-1769. 10.1105/tpc.107.051706.PubMed CentralPubMedGoogle Scholar
- Alboresi A, Caffarri S, Nogue F, Bassi R, Morosinotto T: In silico and biochemical analysis of Physcomitrella patens photosynthetic antenna: identification of subunits which evolved upon land adaptation. PLoS One. 2008, 3: e2033-10.1371/journal.pone.0002033.PubMed CentralPubMedGoogle Scholar
- Gitzinger M, Parsons J, Reski R, Fussenegger M: Functional cross-kingdom conservation of mammalian and moss (Physcomitrella patens) transcription, translation and secretion machineries. Plant Biotechnol J. 2009, 7: 73-86. 10.1111/j.1467-7652.2008.00376.x.PubMedGoogle Scholar
- Paponov IA, Teale W, Lang D, Paponov M, Reski R, Rensing SA, Palme K: The evolution of nuclear auxin signalling. BMC Evol Biol. 2009, 9: 126-10.1186/1471-2148-9-126.PubMed CentralPubMedGoogle Scholar
- Pils B, Heyl A: Unraveling the evolution of cytokinin signaling. Plant Physiol. 2009, 151: 782-791. 10.1104/pp.109.139188.PubMed CentralPubMedGoogle Scholar
- Mukherjee K, Brocchieri L, Bürglin TR: A comprehensive classification and evolutionary analysis of plant homeobox genes. Mol Biol Evol. 2009, 26: 2775-2794. 10.1093/molbev/msp201.PubMed CentralPubMedGoogle Scholar
- Bartels S, González-Besteiro MA, Lang D, Ulm R: Emerging functions for plant MAP kinase phosphatases. Trends Plant Sci. 2010, 15: 322-329. 10.1016/j.tplants.2010.04.003.PubMedGoogle Scholar
- Pitsch NT, Witsch B, Baier M: Comparison of the chloroplast peroxidase system in the chlorophyte Chlamydomonas reinhardtii, the bryophyte Physcomitrella patens, the lycophyte Selaginella moellendorffii and the seed plant Arabidopsis thaliana. BMC Plant Biol. 2010, 10: 133-10.1186/1471-2229-10-133.PubMed CentralPubMedGoogle Scholar
- Aya K, Hiwatashi Y, Kojima M, Sakakibara H, Ueguchi-Tanaka M, Hasebe M, Matsuoka M: The Gibberellin perception system evolved to regulate a pre-existing GAMYB-mediated system during land plant evolution. Nat Commun. 2011, 2: 544-PubMedGoogle Scholar
- Yasumura Y, Moylan EC, Langdale JA: A conserved transcription factor mediates nuclear control of organelle biogenesis in anciently diverged land plants. Plant Cell. 2005, 17: 1894-1907. 10.1105/tpc.105.033191.PubMed CentralPubMedGoogle Scholar
- Menand B, Yi K, Jouannic S, Hoffmann L, Ryan E, Linstead P, Schaefer DG, Dolan L: An ancient mechanism controls the development of cells with a rooting function in land plants. Science. 2007, 316: 1477-1480. 10.1126/science.1142618.PubMedGoogle Scholar
- Marella HH, Sakata Y, Quatrano RS: Characterization and functional analysis of ABSCISIC ACID INSENSITIVE3-like genes from Physcomitrella patens. Plant J. 2006, 46: 1032-1044. 10.1111/j.1365-313X.2006.02764.x.PubMedGoogle Scholar
- Perroud PF, Quatrano RS: BRICK1 is required for apical cell growth in filaments of the moss Physcomitrella patens but not for gametophore morphology. Plant Cell. 2008, 20: 411-422. 10.1105/tpc.107.053256.PubMed CentralPubMedGoogle Scholar
- Zemach A, McDaniel IE, Silva P, Zilberman D: Genome-wide evolutionary analysis of eukaryotic DNA methylation. Science. 2010, 328: 916-919. 10.1126/science.1186366.PubMedGoogle Scholar
- Heidel AJ, Lawal HM, Felder M, Schilde C, Helps NR, Tunggal B, Rivero F, John U, Schleicher M, Eichinger L, Platzer M, Noegel AA, Schaap P, Glöckner G: Phylogeny-wide analysis of social amoeba genomes highlights ancient origins for complex intercellular communication. Genome Res. 2011, 21: 1882-1891. 10.1101/gr.121137.111.PubMed CentralPubMedGoogle Scholar
- Yang X, Tschaplinski TJ, Hurst GB, Jawdy S, Abraham PE, Lankford PK, Adams RM, Shah MB, Hettich RL, Lindquist E, Kalluri UC, Gunter LE, Pennacchio C, Tuskan GA: Discovery and annotation of small proteins using genomics, proteomics, and computational approaches. Genome Res. 2011, 21: 634-641. 10.1101/gr.109280.110.PubMed CentralPubMedGoogle Scholar
- Fu Y, Bannach O, Chen H, Teune JH, Schmitz A, Steger G, Xiong L, Barbazuk WB: Alternative splicing of anciently exonized 5S rRNA regulates plant transcription factor TFIIIA. Genome Res. 2009, 19: 913-921. 10.1101/gr.086876.108.PubMed CentralPubMedGoogle Scholar
- Tang H, Wang X, Bowers JE, Ming R, Alam M, Paterson AH: Unraveling ancient hexaploidy through multiply-aligned angiosperm gene maps. Genome Res. 2008, 18: 1944-1954. 10.1101/gr.080978.108.PubMed CentralPubMedGoogle Scholar
- JGI plant flagship genomes.http://hagsc.org/projects/flagship,
- Van den Berg BHJ, McCarthy FM, Lamont SJ, Burgess SC: Re-annotation is an essential step in systems biology modeling of functional genomics data. PLoS One. 2010, 5: e10642-10.1371/journal.pone.0010642.PubMed CentralPubMedGoogle Scholar
- Markmann-Mulisch U, Wendeler E, Zobell O, Schween G, Steinbiss HH, Reiss B: Differential requirements for RAD51 in Physcomitrella patens and Arabidopsis thaliana development and DNA damage repair. Plant Cell. 2007, 19: 3080-3089. 10.1105/tpc.107.054049.PubMed CentralPubMedGoogle Scholar
- Liang C, Jaiswal P, Hebbard C, Avraham S, Buckler ES, Casstevens T, Hurwitz B, McCouch S, Ni J, Pujar A, Ravenscroft D, Ren L, Spooner W, Tecle I, Thomason J, Tung C, Wei X, Yap I, Youens-Clark K, Ware D, Stein L: Gramene: a growing plant comparative genomics resource. Nucleic Acids Res. 2008, 36: D947-D953.PubMed CentralPubMedGoogle Scholar
- Swarbreck D, Wilks C, Lamesch P, Berardini TZ, Garcia-Hernandez M, Foerster H, Li D, Meyer T, Muller R, Ploetz L, Radenbaugh A, Singh S, Swing V, Tissier C, Zhang P, Huala E: The Arabidopsis Information Resource (TAIR): gene structure and function annotation. Nucleic Acids Res. 2008, 36: D1009-D1014.PubMed CentralPubMedGoogle Scholar
- Tweedie S, Ashburner M, Falls K, Leyland P, McQuilton P, Marygold S, Millburn G, Osumi-Sutherland D, Schroeder A, Seal R, Zhang H: FlyBase: enhancing Drosophila Gene Ontology annotations. Nucleic Acids Res. 2009, 37: D555-D559. 10.1093/nar/gkn788.PubMed CentralPubMedGoogle Scholar
- Hirschman J, Berardini TZ, Drabkin HJ, Howe D: A MOD(ern) perspective on literature curation. Mol Genet Genomics. 2010, 283: 415-425. 10.1007/s00438-010-0525-8.PubMed CentralPubMedGoogle Scholar
- Consortium UP: Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic Acids Res. 2012, 40: D71-D75.Google Scholar
- Flicek P, Amode MR, Barrell D, Beal K, Brent S, Carvalho-Silva D, Clapham P, Coates G, Fairley S, Fitzgerald S, Gil L, Gordon L, Hendrix M, Hourlier T, Johnson N, Kähäri AK, Keefe D, Keenan S, Kinsella R, Komorowska M, Koscielny G, Kulesha E, Larsson P, Longden I, McLaren W, Muffato M, Overduin B, Pignatelli M, Pritchard B, Riat HS: Ensembl 2012. Nucleic Acids Res. 2012, 40: D84-D90. 10.1093/nar/gkr991.PubMed CentralPubMedGoogle Scholar
- Goodstein DM, Shu S, Howson R, Neupane R, Hayes RD, Fazo J, Mitros T, Dirks W, Hellsten U, Putnam N, Rokhsar DS: Phytozome: a comparative platform for green plant genomics. Nucleic Acids Res. 2012, 40: D1178-D1186. 10.1093/nar/gkr944.PubMed CentralPubMedGoogle Scholar
- Van Bel M, Proost S, Wischnitzki E, Movahedi S, Scheerlinck C, Van de Peer Y, Vandepoele K: Dissecting plant genomes with the PLAZA comparative genomics platform. Plant Physiol. 2012, 158: 590-600. 10.1104/pp.111.189514.PubMed CentralPubMedGoogle Scholar
- Gaudet P, Lane L, Fey P, Bridge A, Poux S, Auchincloss A, Axelsen K, Braconi Quintaje S, Boutet E, Brown P, Coudert E, Datta RS, de Lima WC, de Oliveira Lima T, Duvaud S, Farriol-Mathis N, Ferro Rojas S, Feuermann M, Gateau A, Hinz U, Hulo C, James J, Jimenez S, Jungo F, Keller G, Lemercier P, Lieberherr D, Moinat M, Nikolskaya A, Pedruzzi I, Rivoire C, Roechert B, Schneider M, Stanley E, Tognolli M, Sjölander K, Bougueleret L, Chisholm RL, Bairoch A: Collaborative annotation of genes and proteins between UniProtKB/Swiss-Prot and dictyBase. Database (Oxford). 2009, 2009: bap016-10.1093/database/bap016.Google Scholar
- Kersey PJ, Lawson D, Birney E, Derwent PS, Haimel M, Herrero J, Keenan S, Kerhornou A, Koscielny G, Kähäri A, Kinsella RJ, Kulesha E, Maheswari U, Megy K, Nuhn M, Proctor G, Staines D, Valentin F, Vilella AJ, Yates A: Ensembl genomes: extending Ensembl across the taxonomic space. Nucleic Acids Res. 2010, 38: D563-D569. 10.1093/nar/gkp871.PubMed CentralPubMedGoogle Scholar
- Howe D, Costanzo M, Fey P, Gojobori T, Hannick L, Hide W, Hill DP, Kania R, Schaeffer M, St Pierre S, Twigger S, White O, Rhee SY: Big data: the future of biocuration. Nature. 2008, 455: 47-50. 10.1038/455047a.PubMed CentralPubMedGoogle Scholar
- Mazumder R, Natale DA, Julio JAE, Yeh LS, Wu CH: Community annotation in biology. Biol Direct. 2010, 5: 12-10.1186/1745-6150-5-12.PubMed CentralPubMedGoogle Scholar
- Richardt S, Lang D, Reski R, Frank W, Rensing SA: PlanTAPDB, a phylogeny-based resource of plant transcription-associated proteins. Plant Physiol. 2007, 143: 1452-1466. 10.1104/pp.107.095760.PubMed CentralPubMedGoogle Scholar
- Mitschke J, Fuss J, Blum T, Höglund A, Reski R, Kohlbacher O, Rensing SA: Prediction of dual protein targeting to plant organelles. New Phytol. 2009, 183: 224-235. 10.1111/j.1469-8137.2009.02832.x.PubMedGoogle Scholar
- Kamisugi Y, Von Stackelberg M, Lang D, Care M, Reski R, Rensing SA, Cuming AC: A sequence-anchored genetic linkage map for the moss, Physcomitrella patens. Plant J. 2008, 56: 855-866. 10.1111/j.1365-313X.2008.03637.x.PubMed CentralPubMedGoogle Scholar
- Stein LD, Mungall C, Shu S, Caudy M, Mangone M, Day A, Nickerson E, Stajich JE, Harris TW, Arva A, Lewis S: The generic genome browser: a building block for a model organism system database. Genome Res. 2002, 12: 1599-1610. 10.1101/gr.403602.PubMed CentralPubMedGoogle Scholar
- Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW: GenBank. Nucleic Acids Res. 2011, 39: D32-D37. 10.1093/nar/gkq1079.PubMed CentralPubMedGoogle Scholar
- Punta M, Coggill PC, Eberhardt RY, Mistry J, Tate J, Boursnell C, Pang N, Forslund K, Ceric G, Clements J, Heger A, Holm L, Sonnhammer ELL, Eddy SR, Bateman A, Finn RD: The Pfam protein families database. Nucleic Acids Res. 2012, 40: D290-D301. 10.1093/nar/gkr1065.PubMed CentralPubMedGoogle Scholar
- Kozomara A, Griffiths-Jones S: miRBase: integrating microRNA annotation and deep-sequencing data. Nucleic Acids Res. 2011, 39: D152-D157. 10.1093/nar/gkq1027.PubMed CentralPubMedGoogle Scholar
- Axtell MJ, Jan C, Rajagopalan R, Bartel DP: A two-hit trigger for siRNA biogenesis in plants. Cell. 2006, 127: 565-577. 10.1016/j.cell.2006.09.032.PubMedGoogle Scholar
- Griffiths-Jones S: The microRNA Registry. Nucleic Acids Res. 2004, 32: D109-D111. 10.1093/nar/gkh023.PubMed CentralPubMedGoogle Scholar
- Kodama Y, Shumway M, Leinonen R: The Sequence Read Archive: explosive growth of sequencing data. Nucleic Acids Res. 2012, 40: D54-D56. 10.1093/nar/gkr854.PubMed CentralPubMedGoogle Scholar
- Duvick J, Fu A, Muppirala U, Sabharwal M, Wilkerson MD, Lawrence CJ, Lushbough C, Brendel V: PlantGDB: a resource for comparative plant genomics. Nucleic Acids Res. 2008, 36: D959-D965.PubMed CentralPubMedGoogle Scholar
- Novikova O, Smyshlyaev G, Blinov A: Evolutionary genomics revealed interkingdom distribution of Tcn1-like chromodomain-containing Gypsy LTR retrotransposons among fungi and plants. BMC Genomics. 2010, 11: 231-10.1186/1471-2164-11-231.PubMed CentralPubMedGoogle Scholar
- Degroeve S, Saeys Y, De Baets B, Rouze P, Van de Peer Y: SpliceMachine: predicting splice sites from high-dimensional local context representations. Bioinformatics. 2005, 21: 1332-1338. 10.1093/bioinformatics/bti166.PubMedGoogle Scholar
- Schiex T, Moisan A, Rouzé P: Computational Biology. EuGène: an Eukaryotic Gene Finder that combines several sources of evidence. Edited by: Gascuel O, Sagot MF. 2001, Berlin / Heidelberg: Springer, 111-125. 2066Google Scholar
- Haas BJ, Salzberg SL, Zhu W, Pertea M, Allen JE, Orvis J, White O, Buell CR, Wortman JR: Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biol. 2008, 9: R7-10.1186/gb-2008-9-1-r7.PubMed CentralPubMedGoogle Scholar
- Haas BJ, Delcher AL, Mount SM, Wortman JR, Smith RK, Hannick LI, Maiti R, Ronning CM, Rusch DB, Town CD, Salzberg SL, White O: Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res. 2003, 31: 5654-5666. 10.1093/nar/gkg770.PubMed CentralPubMedGoogle Scholar
- Gremme G, Brendel V, Sparks ME, Kurtz S: Engineering a Software Tool for Gene Structure Prediction in Higher Organisms. Inf Softw Technol. 2005, 47: 965-978. 10.1016/j.infsof.2005.09.005.Google Scholar
- Wu TD, Watanabe CK: GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics. 2005, 21: 1859-1875. 10.1093/bioinformatics/bti310.PubMedGoogle Scholar
- Wang BB, Brendel V: Genomewide comparative analysis of alternative splicing in plants. Proc Natl Acad Sci USA. 2006, 103: 7175-7180. 10.1073/pnas.0602039103.PubMed CentralPubMedGoogle Scholar
- Barbazuk WB, Fu Y, McGinnis KM: Genome-wide analyses of alternative splicing in plants: opportunities and challenges. Genome Res. 2008, 18: 1381-1392. 10.1101/gr.053678.106.PubMedGoogle Scholar
- Sinha R, Zimmer AD, Bolte K, Lang D, Reski R, Platzer M, Rensing SA, Backofen R: Identification and characterization of NAGNAG alternative splicing in the moss Physcomitrella patens. BMC Plant Biol. 2010, 10: 76-10.1186/1471-2229-10-76.PubMed CentralPubMedGoogle Scholar
- Filichkin SA, Priest HD, Givan SA, Shen R, Bryant DW, Fox SE, Wong WK, Mockler TC: Genome-wide mapping of alternative splicing in Arabidopsis thaliana. Genome Res. 2010, 20: 45-58. 10.1101/gr.093302.109.PubMed CentralPubMedGoogle Scholar
- Alexandrov NN, Troukhan ME, Brover VV, Tatarinova T, Flavell RB, Feldmann KA: Features of Arabidopsis genes and genome discovered using full-length cDNAs. Plant Mol Biol. 2006, 60: 69-85. 10.1007/s11103-005-2564-9.PubMedGoogle Scholar
- Lunde C, Baumann U, Shirley NJ, Drew DP, Fincher GB: Gene structure and expression pattern analysis of three monodehydroascorbate reductase (Mdhar) genes in Physcomitrella patens: implications for the evolution of the MDHAR family in plants. Plant Mol Biol. 2006, 60: 259-275. 10.1007/s11103-005-3881-8.PubMedGoogle Scholar
- Merchant SS, Prochnik SE, Vallon O: The Chlamydomonas genome reveals the evolution of key animal and plant functions. Science. 2007, 318: 245-250. 10.1126/science.1143609.PubMed CentralPubMedGoogle Scholar
- Chung BYW, Simons C, Firth AE, Brown CM, Hellens RP: Effect of 5’UTR introns on gene expression in Arabidopsis thaliana. BMC Genomics. 2006, 7: 120-10.1186/1471-2164-7-120.PubMed CentralPubMedGoogle Scholar
- Weise A, Rodriguez-Franco M, Timm B, Hermann M, Link S, Jost W, Gorr G: Use of Physcomitrella patens actin 5’ regions for high transgene expression: importance of 5’ introns. Appl Microbiol Biotechnol. 2006, 70: 337-345. 10.1007/s00253-005-0087-6.PubMedGoogle Scholar
- Mignone F, Gissi C, Liuni S, Pesole G: Untranslated regions of mRNAs. Genome Biol. 2002, 3: REVIEWS0004Google Scholar
- Conesa A, Götz S, García-Gómez JM, Terol J, Talón M, Robles M: Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics. 2005, 21: 3674-3676. 10.1093/bioinformatics/bti610.PubMedGoogle Scholar
- Quevillon E, Silventoinen V, Pillai S, Harte N, Mulder N, Apweiler R, Lopez R: InterProScan: protein domains identifier. Nucleic Acids Res. 2005, 33: W116-W120. 10.1093/nar/gki442.PubMed CentralPubMedGoogle Scholar
- Gene Ontology (GO) evidence codes:http://www.geneontology.org/GO.evidence.shtml,
- Huang DW, Sherman BT, Lempicki RA: Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 2009, 37: 1-13. 10.1093/nar/gkn923.PubMed CentralGoogle Scholar
- Wolf L, Rizzini L, Stracke R, Ulm R, Rensing SA: The molecular and physiological responses of Physcomitrella patens to ultraviolet-B radiation. Plant Physiol. 2010, 153: 1123-1134. 10.1104/pp.110.154658.PubMed CentralPubMedGoogle Scholar
- Wood AJ, Duff RJ: The aldehyde dehydrogenase (ALDH) gene superfamily of the moss Physcomitrella patens and the algae Chlamydomonas reinhardtii and Ostreococcus tauri. Bryologist. 2009, 112: 1-11. 10.1639/0007-2745-112.1.1.Google Scholar
- Li L, Stoeckert CJ, Roos DS: OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res. 2003, 13: 2178-2189. 10.1101/gr.1224503.PubMed CentralPubMedGoogle Scholar
- Kersting AR, Bornberg-Bauer E, Moore AD, Grath S: Dynamics and adaptive benefits of protein domain emergence and arrangements during plant genome evolution. Genome Biol Evol. 2012, 4: 316-329. 10.1093/gbe/evs004.PubMed CentralPubMedGoogle Scholar
- Teich R, Grauvogel C, Petersen J: Intron distribution in Plantae: 500 million years of stasis during land plant evolution. Gene. 2007, 394: 96-104. 10.1016/j.gene.2007.02.011.PubMedGoogle Scholar
- Roy SW, Penny D: Patterns of intron loss and gain in plants: intron loss-dominated evolution and genome-wide comparison of O. sativa and A. thaliana. Mol Biol Evol. 2007, 24: 171-181.PubMedGoogle Scholar
- Frugoli JA, McPeek MA, Thomas TL, McClung CR: Intron loss and gain during evolution of the catalase gene family in angiosperms. Genetics. 1998, 149: 355-365.PubMed CentralPubMedGoogle Scholar
- Hu K: Intron exclusion and the mystery of intron loss. FEBS Lett. 2006, 580: 6361-6365. 10.1016/j.febslet.2006.10.048.PubMedGoogle Scholar
- Roy SW, Gilbert W: The evolution of spliceosomal introns: patterns, puzzles and progress. Nat Rev Genet. 2006, 7: 211-221.PubMedGoogle Scholar
- Funke RP, Kovar JL, Logsdon JM, Corrette-Bennett JC, Straus DR, Weeks DP: Nucleus-encoded, plastid-targeted acetolactate synthase genes in two closely related chlorophytes, Chlamydomonas reihardtii and Volvox carteri: phylogenetic origins and recent insertion of introns. Mol Gen Genet. 1999, 262: 12-21. 10.1007/s004380051054.PubMedGoogle Scholar
- Rensing SA, Ick J, Fawcett JA, Lang D, Zimmer A, De Peer YV, Reski R: An ancient genome duplication contributed to the abundance of metabolic genes in the moss Physcomitrella patens. BMC Evol Biol. 2007, 7: 130-10.1186/1471-2148-7-130.PubMed CentralPubMedGoogle Scholar
- Banks JA, Nishiyama T, Hasebe M: The Selaginella genome identifies genetic changes associated with the evolution of vascular plants. Science. 2011, 332: 960-963. 10.1126/science.1203810.PubMed CentralPubMedGoogle Scholar
- Schaefer D, Zryd JP, Knight CD, Cove DJ: Stable transformation of the moss Physcomitrella patens. Mol Gen Genet. 1991, 226: 418-424.PubMedGoogle Scholar
- Hohe A, Reski R: A tool for understanding homologous recombination in plants. Plant Cell Rep. 2003, 21: 1135-1142. 10.1007/s00299-003-0644-3.PubMedGoogle Scholar
- Kamisugi Y, Schlink K, Rensing SA, Schween G, Von Stackelberg M, Cuming AC, Reski R, Cove DJ: The mechanism of gene targeting in Physcomitrella patens: homologous recombination, concatenation and multiple integration. Nucleic Acids Res. 2006, 34: 6206-6214.Google Scholar
- P. patens annotation guidelines.https://www.cosmoss.org/physcome_project/wiki/Annotation_guidelines,
- Cosmoss.org gene annotation interface – Genonaut.http://www.cosmoss.org/annotation/Genonaut,
- Ed L, Nomi H, Mark G, Raymond C, Suzanna L: Apollo: a community resource for genome annotation editing. Bioinformatics. 2009, 25: 1836-1837. 10.1093/bioinformatics/btp314.PubMed CentralGoogle Scholar
- Florea L, Hartzell G, Zhang Z, Rubin GM, Miller W: A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Res. 1998, 8: 967-974.PubMed CentralPubMedGoogle Scholar
- Slater GSC, Birney E: Automated generation of heuristics for biological sequence comparison. BMC Bioinforma. 2005, 6: 31-10.1186/1471-2105-6-31.Google Scholar
- PASA software homepage.http://pasa.sourceforge.net,
- Ogata H, Goto S, Sato K, Fujibuchi W, Bono H, Kanehisa M: KEGG: Kyoto Encyclopedia of genes and genomes. Nucleic Acids Res. 1999, 27: 29-34. 10.1093/nar/27.1.29.PubMed CentralPubMedGoogle Scholar
- Moriya Y, Itoh M, Okuda S, Yoshizawa AC, Kanehisa M: KAAS: an automatic genome annotation and pathway reconstruction server. Nucleic Acids Res. 2007, 35: W182-W185. 10.1093/nar/gkm321.PubMed CentralPubMedGoogle Scholar
- Lang D, Weiche B, Timmerhaus G, Richardt S, Riaño-Pachón DM, Corrêa LGG, Reski R, Mueller-Roeber B, Rensing SA: Genome-wide phylogenetic comparative analysis of plant transcriptional regulation: a timeline of loss, gain, expansion and correlation with complexity. Genome Biol Evol. 2010, 2: 488-503. 10.1093/gbe/evq032.PubMed CentralPubMedGoogle Scholar
- Höglund A, Donnes P, Blum T, Adolph HW, Kohlbacher O: MultiLoc: prediction of protein subcellular localization using N-terminal targeting sequences, sequence motifs and amino acid composition. Bioinformatics. 2006, 22: 1158-1165. 10.1093/bioinformatics/btl002.PubMedGoogle Scholar
- Horton P, Park KJ, Obayashi T, Fujita N, Harada H, Adams-Collier CJ, Nakai K: WoLF PSORT: protein localization predictor. Nucleic Acids Res. 2007, 35: W585-W587. 10.1093/nar/gkm259.PubMed CentralPubMedGoogle Scholar
- Emanuelsson O, Nielsen H, Brunak S, Von Heijne G: Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J Mol Biol. 2000, 300: 1005-1016. 10.1006/jmbi.2000.3903.PubMedGoogle Scholar
- Emanuelsson O, Nielsen H, Von Heijne G: ChloroP, a neural network-based method for predicting chloroplast transit peptides and their cleavage sites. Protein Sci. 1999, 8: 978-984. 10.1110/ps.8.5.978.PubMed CentralPubMedGoogle Scholar
- Boden M, Hawkins J: Prediction of subcellular localization using sequence-biased recurrent networks. Bioinformatics. 2005, 21: 2279-2286. 10.1093/bioinformatics/bti372.PubMedGoogle Scholar
- Hulo N, Bairoch A, Bulliard V, Cerutti L, De Castro E, Langendijk-Genevaux PS, Pagni M, Sigrist CJ: The PROSITE database. Nucleic Acids Res. 2006, 34: D227-D230. 10.1093/nar/gkj063.PubMed CentralPubMedGoogle Scholar
- Tusnády GE, Simon I: The HMMTOP transmembrane topology prediction server. Bioinformatics. 2001, 17: 849-850. 10.1093/bioinformatics/17.9.849.PubMedGoogle Scholar
- Jones DT: Improving the accuracy of transmembrane protein topology prediction using evolutionary information. Bioinformatics. 2007, 23: 538-544. 10.1093/bioinformatics/btl677.PubMedGoogle Scholar
- Alexa A, Rahnenführer J, Lengauer T: Improved scoring of functional groups from gene expression data by decorrelating GO graph structure. Bioinformatics. 2006, 22: 1600-1607. 10.1093/bioinformatics/btl140.PubMedGoogle Scholar
- Katoh K, Kuma K, Toh H, Miyata T: MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res. 2005, 33: 511-518. 10.1093/nar/gki198.PubMed CentralPubMedGoogle Scholar
- Rice P, Longden I, Bleasby A: EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet. 2000, 16: 276-277. 10.1016/S0168-9525(00)02024-2.PubMedGoogle Scholar
- Schattner P, Brooks AN, Lowe TM: The tRNAscan-SE, snoscan and snoGPS web servers for the detection of tRNAs and snoRNAs. Nucleic Acids Res. 2005, 33: W686-W689. 10.1093/nar/gki366.PubMed CentralPubMedGoogle Scholar
- Laslett D, Canback B: ARAGORN, a program to detect tRNA genes and tmRNA genes in nucleotide sequences. Nucleic Acids Res. 2004, 32: 11-16. 10.1093/nar/gkh152.PubMed CentralPubMedGoogle Scholar
- Lagesen K, Hallin P, Rødland EA, Staerfeldt HH, Rognes T, Ussery DW: RNAmmer: consistent and rapid annotation of ribosomal RNA genes. Nucleic Acids Res. 2007, 35: 3100-3108. 10.1093/nar/gkm160.PubMed CentralPubMedGoogle Scholar
- Pruesse E, Quast C, Knittel K, Fuchs BM, Ludwig W, Peplies J, Glöckner FO: SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB. Nucleic Acids Res. 2007, 35: 7188-7196. 10.1093/nar/gkm864.PubMed CentralPubMedGoogle Scholar
- Nawrocki EP, Kolbe DL, Eddy SR: Infernal 1.0: inference of RNA alignments. Bioinformatics. 2009, 25: 1335-1337. 10.1093/bioinformatics/btp157.PubMed CentralPubMedGoogle Scholar
- Gardner PP, Daub J, Tate J, Moore BL, Osuch IH, Griffiths-Jones S, Finn RD, Nawrocki EP, Kolbe DL, Eddy SR, Bateman A: Rfam: wikipedia, clans and the “decimal” release. Nucleic Acids Res. 2011, 39: D141-D145. 10.1093/nar/gkq1129.PubMed CentralPubMedGoogle Scholar
- Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA, Dagdigian C, Fuellen G, Gilbert JG, Korf I, Lapp H, Lehvaslaiho H, Matsalla C, Mungall CJ, Osborne BI, Pocock MR, Schattner P, Senger M, Stein LD, Stupka E, Wilkinson MD, Birney E: The Bioperl toolkit: Perl modules for the life sciences. Genome Res. 2002, 12: 1611-1618. 10.1101/gr.361602.PubMed CentralPubMedGoogle Scholar
- Jurka J: Repbase update: a database and an electronic journal of repetitive elements. Trends Genet. 2000, 16: 418-420. 10.1016/S0168-9525(00)02093-X.PubMedGoogle Scholar
- RepeatMasker Open-3.0.http://www.repeatmasker.org,
- codonW: http://codonw.sourceforge.netGoogle Scholar
- P. patens V1.6 annotation download section.https://www.cosmoss.org/physcome_project/wiki/Genome_Annotation/V1.6#Downloads_genome_annotation_version_1.6,
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.