- Research article
- Open Access
Enrichment of Triticum aestivum gene annotations using ortholog cliques and gene ontologies in other plants
© Tulpan et al.; licensee BioMed Central. 2015
- Received: 30 October 2014
- Accepted: 27 March 2015
- Published: 15 April 2015
While the gargantuan multi-nation effort of sequencing T. aestivum gets close to completion, the annotation process for the vast number of wheat genes and proteins is in its infancy. Previous experimental studies carried out on model plant organisms such as A. thaliana and O. sativa provide a plethora of gene annotations that can be used as potential starting points for wheat gene annotations, proven that solid cross-species gene-to-gene and protein-to-protein correspondences are provided.
DNA and protein sequences and corresponding annotations for T. aestivum and 9 other plant species were collected from Ensembl Plants release 22 and curated. Cliques of predicted 1-to-1 orthologs were identified and an annotation enrichment model was defined based on existing gene-GO term associations and phylogenetic relationships among wheat and 9 other plant species. A total of 13 cliques of size 10 were identified, which represent putative functionally equivalent genes and proteins in the 10 plant species. Eighty-five new and more specific GO terms were associated with wheat genes in the 13 cliques of size 10, which represent a 65% increase compared with the previously 130 known GO terms. Similar expression patterns for 4 genes from Arabidopsis, barley, maize and rice in cliques of size 10 provide experimental evidence to support our model. Overall, based on clique size equal or larger than 3, our model enriched the existing gene-GO term associations for 7,838 (8%) wheat genes, of which 2,139 had no previous annotation.
Our novel comparative genomics approach enriches existing T. aestivum gene annotations based on cliques of predicted 1-to-1 orthologs, phylogenetic relationships and existing gene ontologies from 9 other plant species.
- Gene Ontology
- Wheat Gene
- Very Long Chain Fatty Acid
- Clique Size
- Term Assignment
Gene orthology forms the backbone of comparative and evolutionary genomics and it represents a central piece in many computational methods for functional annotation of genes particularly relevant for newly sequenced plant species. Orthologs are genes that evolved from their last common ancestor after a speciation event [1,2] and are essentially considered to be the ‘same’ gene in different species. In comparison, paralogs are genes which are derived via a gene duplication event and although evolutionarily related, they are not the ‘same’ gene and are unlikely to have all the same function in different species. The precise identification of orthologs and paralogs is a quintessential step in comparative genomics and functional analysis of genes.
Existing orthology prediction methods can be broadly grouped into two categories : (i) graph-based methods that cluster pairs of genes based on (typically protein) sequence similarity (e.g. InParanoid , RoundUp , COG , KOG , eggNOG , OrthoDB , OrthoMCL , OMA ), and (ii) tree-based methods, which cluster genes and aim for the reconciliation of the protein and the species trees (e.g. TreeFam , Ensembl Compara , PhylomeDB , LOFT ). Systematic evaluations of these methods including advantages, disadvantages, challenges and validation are discussed in the literature [16,17].
A widely adopted approach for orthology prediction is the Reciprocal Best BLAST Hit (RBBH) method (also known as ‘bidirectional best hit’) [6,18], which identifies orthologous genes between two species that are more similar to each other than to any other gene in the same species. While the RBBH method was proven to provide a solid bidirectional bridge between orthology and bidirectional best hits inferred from sequence similarity , it is by far not perfect given its limitations caused by evolutionary events such as gene loss and gene duplications or by incomplete genomic sequences. Such limitations lead to false positives representing incorrect labelling of paralogs as orthologs . Nevertheless, in particular circumstances where trusted orthologs are required, such as information enrichment in genome annotation , the RBBH method generates high quality 1-to-1 orthologs, which can be further used to seed orthologous groups .
Information about ten plant species included in this study
NCBI Taxonomy ID
Number of chromosomes
Estimated genome size [Mb]
Number of selected DNA/proteins (Ensembl Plants)
2n = 14 (DD)
2n = 2x = 10
2n = 2x = 10
2n = 2x = 20
2n = 2x = 14
2n = 2x = 24
2n = 2x = 20
2n = 6x = 42 (AABBDD)
2n = 14 (AA)
2n = 2x = 20
To enhance the confidence in our orthology prediction, we use two types of information as base input sequences for our RBBH prediction: DNA coding sequences and their corresponding proteins. We define as 1-to-1 orthologs those pairwise orthology relations commonly predicted by the RBBH method using as input the aforementioned information. In case of disagreement between the two predictions, the putative individual orthologs are dismissed, at the expense of losing potentially valid orthologs. For example, if the coding sequence representing gene TRAES_5BL_DAE1BD995 in Triticum aestivum is an RBBH of the corresponding DNA for gene TRIUR3_21564 in Triticum urartu, and if the corresponding protein sequences associated with the same genes are also RBBHs, then we accept the pair (TRAES_5BL_DAE1BD995, TRIUR3_21564) as 1-to-1 ortholog.
A related approach where cliques of OMA  orthologs and paralogs were used to enhance functional annotation of prokaryotic genes was proposed by Skunca et al. (2013) . Their approach assigned novel GO terms to orthologous genes based on majority voting, i.e. if a certain GO term exists in 50% or more of the genes in a clique, then it is assigned to the remaining genes in the same clique. A more recent study , proposes a computational framework (OrthoClust) that integrates single species co-association networks into data clusters across multiple species via orthology relationships. Their framework is applied on RNA-Seq expression profiles of C.elegans and D. melanogaster from the modENCODE consortium.
Here, we introduce a novel 3-step gene ontology enrichment model that relies on a set of cliques containing pairwise 1-to-1 predicted orthologs among multiple plant species, a set of known annotations for genes in each plant species and the phylogenetic relationship of those species. Once a target plant species such as T. aestivum is selected, the model calculates gene ontology scores based on the phylogenetic proximity between two species, the overall hierarchy of gene ontology and the predicted orthology relationships. Based on the calculated scores and further refinement using the augmented degree of novelty compared to existing annotations and a minimum score threshold, novel GO terms are assigned to the genes in the target species. Overall, based on clique sizes equal or larger than 3, our model contributed to the existing gene-GO term associations in T. aestivum by enriching 7,838 (8%) genes, of which 2,139 had no previous annotation.
We downloaded DNA coding sequences, protein sequences and Gene Ontology (GO) vocabulary for functional annotation from the FTP site of Ensembl Plants version 22  for 10 plant species: Aegilops tauschii, Arabidopsis thaliana, Brachypodium distachyon, Brassica rapa, Hordeum vulgare, Oryza sativa, Sorghum bicolor, Triticum aestivum, Triticum urartu and Zea mays. Coding DNA and protein sequences were pre-processed and only those corresponding to longest transcripts were selected for pairwise BLAST runs. Annotations and physical map information was acquired programmatically from the Gramene MySQL database build 41. The physical mapping information for Aegilops tauschii and Triticum urartu was complemented with information extracted and processed from the original publications of the two species [24,25].
We implemented a Reciprocal Best BLAST Hit (RBBH) approach for 1-to-1 orthology prediction inspired from previous work applied to human and mouse genomes . Two BLAST runs are executed for each pair of plant species and for each sequence type (DNA, protein) to identify reciprocal best hits (RBBHs). A 1-to-1 orthology relationship is assigned for those pairs of genes that are bidirectional hits within a confidence interval (e-value ≤ 10−5).
Cliques of orthologs discovery in ten plant species
Cliquer version 1.21 was used to identify cliques of orthologs in 10 plant species. Cliquer is a highly efficient graph-based algorithm for finding cliques in an arbitrary weighted graph. It uses an exact branch-and-bound algorithm developed by Patric Östergård at Aalto University in Finland. Custom scripts were used to convert pairwise orthology data into DIMACS-formatted files required by Cliquer. Each gene represents a node in the graph and each edge represents an orthologous relationship. All edges in the graph were equally scored (score = 1). Cliquer was executed using the following parameters: −a, −x and –m 3. Cliquer generated all cliques with size between 3 and 10 using as input 45 files with unique pairs of orthologs between the 10 plant species in less than one hour (~44 minutes) on a desktop computer running Linux Ubuntu (64-bit) kernel 3.8.0-35-generic, with 256 GB of RAM (less than 2GB used) and two 16-core CPUs (no parallelism used).
Orthologs clique validation using overlapping gene ontologies
For comparison purposes, to distinguish between well predicted cliques of orthologs and randomly predicted ones, we generated a set of cliques with OCLs between 3 and 10 populated with genes selected uniformly at random from complete pools of genes for each of the 10 plant species. For each clique size (OCL), we generated as described above an equal number of cliques and we calculated the GO Set Overlap as described by Equation 1.
Gene ontology enrichment
For the 10 plant species considered in this study, the maximum GO score is 4.32, which corresponds to a GO term present in all genes in a clique of size 10 (1 + 1 + ½ + ½ + 1/3 + ¼ + 1/5 + 1/5 + 1/6 + 1/6 = 4.3166667 ≈ 4.32).
We use a score threshold of G T = 0.5 above which we assign GO terms to a wheat gene. The choice of 0.5 for the threshold value is rooted in the significant phylogenetic proximity of Triticum aestivum to other closely related cereals such as Triticum urartu, Aegilops tauschii and Hordeum vulgare, all being part of groups with scores equal to 1 and 2. A GO term assigned to a gene in a species with group score equal to 2, will contribute 1/2, i.e. 0.5 to the overall GO score , thus being considered sufficiently significant to be assigned to the orthologous wheat gene.
The following parameter settings were used in the AgriGO  analysis: (i) Fisher’s exact test with Benjamini-Yekutieli (FDR under dependency) multiple comparison correction and (ii) significance level α = 0.05.
1-to-1 ortholog cliques
For every pair of plant species we predicted 1-to-1 orthologs using the RBBH method. Additional file 1: Table S1 provides details with respect to the total number of 1-to-1 orthologs predicted when DNA and protein sequences were used. Based on these results we considered the intersection of the DNA and protein orthology predictions for further exploration. We found cliques of 1-to-1 orthologs among the 10 plant species and grouped them based on the Ortholog Clique Level (OCL), which represents the clique size. Given the sparsity of gene ontologies in plants, we focused a large part of our analyses on ortholog cliques of size 10 while we provide additional information regarding different aspects of genes in cliques with OCLs between 3 and 10 [see Additional file 2: Figures S1 and Figure S2]. Each ortholog clique consists of genes connected to every other orthologous gene and thus they are expected to have similar functions.
GO scores and OCLs
Characterization of genes in cliques of size 10
13 cliques of size 10 with 1-to-1 orthologs in all 10 plant species
Clique 1 consists of a set of 10 pairwise 1-to-1 orthologs in 10 plant species potentially representing cyclophilin71, a member of the immunophillin group of proteins known for their property of binding to the immune-suppressant drug cyclosporine A. This particular protein is unusual due to the presence of an additional WD domain (along the traditional PPIase domain) experimentally proven to exist in A. thaliana (AT3G44600 – CYP71/AtCYP71) and O. sativa (OS08G0557500/LOC_Os08g44330) . Due to its ability to modulate the distribution of FAS1 and LHP1 on chromatin in plants, loss of this gene function causes drastic pleiotropic phenotypic defects . While less is known about cyclophylin A in wheat, some information is available about cyclophylin B . The wheat gene (TRAES_7DS_8020BEEC2) in this clique had 16 original GO terms and was enriched by our model with 35 new GO terms with an average GO score of 1.12. The highest scored new GO term assigned to this gene was protein peptidyl-prolyl isomerization (GO:0000413).
Clique 2 includes 10 genes potentially representing one (NMAT1) of the four nuclear maturase genes (NMAT1 to 4) encoding mitochondrial proteins in plants. In Arabidopsis, NMAT1 (AT1G30010) functions in the trans-splicing of nad1 intron 1, and has a role in cis-splicing of nad2 intron 1 and nad4 intron 2 . While no NMAT1 gene was experimentally confirmed in wheat yet, recent studies in wheat mitochondria showed that nad2 intron 1, nad1 intron 2 and cox2 shifted from a predominantly hydrolytic pathway at room temperature to alternative pathways in the cold . The wheat gene (TRAES_7DL_659883F3D) in this clique had 4 original GO terms and was enriched by our model with 5 new GO terms with an average GO score of 3.32. All 5 new GO terms were equally scored and represent mitochondrion (GO:0005739), seed germination (GO:0009845), seedling development (GO:0090351), Group III intron splicing (GO:0000374) and vegetative to reproductive phase transition of meristem (GO:0010228).
Genes in Clique 3 show high DNA and protein sequence similarity with the A. thaliana “chlororespiratory reduction 6” (AT2G47910 – CRR6) chloroplast thylakoid membrane protein. In Arabidopsis, this protein is required for the assembly of the NAD(P)H dehydrogenase complex of the photosynthetic electron transport chain. A suite of recent studies in wheat revealed how photosystem 1 and 2 activity is influenced by various stress factors such as heat [38,39] and draught , nevertheless no genes were clearly identified as key factors in the corresponding pathways. The wheat gene (TRAES_6AS_87906149C) in this clique had 4 original GO terms and was enriched by our model with 2 new GO terms with an average GO score of 0.92. The two new GO terms are iron-sulfur cluster assembly (GO:0016226) and aromatic amino acid family (GO:0009073).
Genes in Clique 4 are highly similar sequence-wise with A. thaliana “ABC-2 type transporter family protein” (AT3G13220) ATP-binding cassette transporter G26 (ABCG26) involved in tapetal cell and pollen development. This gene is required for male fertility and pollen exine formation. The predicted ortholog in rice (OS06G0607700) was identified in a 2013 study  as one of the important components for sporopollenin synthesis and secretion, functioning as transporter of sporopollenin precursors, translocating its substrates from tapetal cells to the developmental microspores. This gene plays a major role in the anther development and male sterility in rice. While studies of pollen formation at the phenotype level in T. aestivum were reported as early as 1986 , further work is needed to reveal the complex gene-based mechanisms that control the pollen development process in wheat. The wheat gene (TRAES_7DL_439CC6EA0) in this clique had 9 original GO terms and was enriched by our model with 1 new GO term with a GO score of 2.32. The new GO term turned out to be an obsolete one (ATP catabolic process - GO:0006200) currently replaced by ATPase activity (GO:0016887), which was already part of the original GO terms. Thus no real enrichment was obtained for this wheat gene.
Clique 5 contains genes with a high degree of similarity with A. thaliana “methyltransferases” FIO1/FIONA1 (AT2G21070) gene, which is a genetic regulator of period length in the plant’s circadian clock. The gene is located in the nucleus and is involved in methyltransferase activity in flowering, circadian rhythm and photoperiodism. Other methyltransferase genes were isolated and characterized in monocots such as maize  and rice . In wheat, five homologous cDNA sequences were connected with methyltransferase activity  and their expression patterns were studied. The Cab-1 gene was also identified as a circadian clock regulator in wheat in 1988 . The wheat gene (TRAES_6AS_AD173C5A3) in this clique had 1 original GO term and was enriched by our model with 4 new GO terms with an average GO score of 1.69. The new GO terms are methylation (GO:0032259), nucleus (GO:0005634), circadian rhythm (GO:0007623) and photoperiodism, flowering (GO:0048573).
Genes in Clique 6 are related to A. thaliana “Rhodanese/Cell cycle control phosphatase superfamily protein” (AT2G40760) located in chloroplast and involved in aging. Genes with similar functionality were identified in other plants such as a predicted orthologous gene in maize (GRMZM2G087671), which was previously identified as a target gene of draught-responsive microRNAs . In wheat, a previous study  reported the existence of a full-length cDNA sequence named TaTST (Triticum aestivum thiosulfate sulfurtransferase) with role in powdery mildew resistance mapped on the short arm of 6B chromosomes of wheat through Southern blot and GSP-PCR using Chinese Spring nullisomic/tetrasomic lines and ditelosomic lines. The wheat gene (TRAES_6AS_FD8F6B539) in this clique was not previously annotated and was enriched by our model with 1 new GO term with a GO score of 0.95. The new GO term is aging (GO:0007568).
Clique 7 includes genes related to A. thaliana “3-oxo-5-alpha-steroid 4-dehydrogenase family protein” (AT3G55360 – CER10), which is located in the endoplasmic reticulum. This gene is an Enoyl-CoA reductase involved in all very long chain fatty acids (VLCFA) elongation reactions that are required for cuticular wax, storage lipid and sphingolipid metabolism. AT3G55360 apparently encodes the sole enoyl reductase activity associated with microsomal fatty acid elongation in Arabidopsis . The Affymetrix wheat probe set Ta.28682.2.S1_x_at is associated with the CER10 Arabidopsis gene. This Arabidopsis gene shows a high level of similarity with the orthologous wheat gene TRAES_3B_90F2B79E9 in Clique 7, which was experimentally assigned the putative function “Enoyl-CoA reductase” in a 2010 study focused on changes in properties of wheat leaf cuticle during interactions with Hessian fly . The wheat gene (TRAES_3B_90F2B79E9) in this clique had 4 original GO terms and was enriched by our model with 9 new GO terms with an average GO score of 1.54. The GO terms with the highest score (3.32) are fatty acid elongase activity (GO:0009922), trans-2-enoyl-CoA reductase (NADPH) activity (GO:0019166) and plasma membrane (GO:0005886).
Among the 13 cliques of size 10, Clique 8 includes 1-to-1 orthologous genes with less specific annotation. The annotations corresponding to the A. thaliana gene AT3G55760 leads to an unknown protein located in the chloroplast stroma and expressed in 16 plant structures. The wheat gene (TRAES_4BS_2159A428F) in this clique had 2 original GO terms and was not enriched by our model.
Clique 9 includes genes with high sequence similarity with A. thaliana “Jumonji domain-containing protein 22” (AT5G06550). This gene encodes a hairless (HR) demethylase that acts as a positive regulator of seed germination in the PHYB-PIL5-SOM pathway. GramineaeTFDB lists the wheat tplb0016n18 transcript factor as one of the 4 members of the Jumonji family proteins. This TF is homologous with the wheat Ensembl gene model TRAES_7DL_96FFFB41E (reciprocal best BLAST e-values equal with 1e-157 and 6e-163 for blastp and 0 for blastn) that we identified as being part of Clique 9 and is reported to have orthologous sequences in 6 plant species (A. thaliana, B. distachyon, H. vulgare, O. sativa, S. bicolor and Z. mays) – all being 1-to-1 orthologs in Clique 9. The wheat gene (TRAES_7DL_96FFFB41E) in this clique had 5 original GO terms and was enriched by our model with 4 new GO terms with an average GO score of 1.08. The new GO terms are regulation of flower development (GO:0009909), protein targeting to mitochondrion (GO:0006626), cell surface receptor signalling pathway (GO:0007166) and regulation of transcription, DNA-templated (GO:0006355).
Clique 10 consists of genes similar to a putative A. thaliana “GMP synthase (glutamine-hydrolyzing)/glutamine amidotransferase” (AT1G63660). Based on the limited available annotation, this gene plays a role in asparagine synthase (glutamine-hydrolyzing) activity, catalytic activity, GMP synthase (glutamine-hydrolyzing) activity and ATP binding. Two sets of genes in rice and Arabidopsis (housekeeping and tissue specific) which have evolved under contrasting evolutionary constraints include the rice gene OS08G0326600 as an ortholog of AT1G63660 . The corresponding wheat Affymetrix probe set id (Ta.3136.1.S1_at) for the Ensembl wheat gene model TRAES_6AS_E6DEE586C is listed on the PlaNet  website as being co-expressed with the aforementioned rice gene. This suggests that TRAES_6AS_E6DEE586C is a good candidate for a wheat GMP synthase gene. The wheat gene (TRAES_6AS_E6DEE586C) in this clique had 7 original GO terms and was enriched by our model with 4 new GO terms with an average GO score of 1.54. The new GO terms are cytosol (GO:0005829), RNA methylation (GO:0001510), protein import into nucleus (GO:0006606) and pyrimidine ribonucleotide biosynthetic process (GO:0009220).
Genes in Clique 11 are 1-to-1 orthologs with A. thaliana gene AT4G35870, characterized in TAIR as an “early-responsive to dehydration stress protein (ERD4)”, which, interestingly, coincides with the description of another Arabidopsis gene, namely AT1G30360 and in UniProt as a “CSC1-like protein”, which is located in the cell membrane and is involved in protein targeting to vacuole. The corresponding AT4G35870 protein was experimentally determined to be involved in vacuolar sorting of storage proteins (AtGFS10) . The wheat gene (TRAES_4DL_11B05CF85) in this clique had 1 original GO term and was enriched by our model with 2 new GO terms with an average GO score of 2.33. The new GO terms are protein targeting to vacuole (GO:0006623) and integral component of membrane (GO:0016021).
Clique 12 consists of highly similar genes with A. thaliana “NAD(P)-binding Rossmann-fold superfamily protein” (AT4G35250 - HCF244), which is located in chloroplast and manifests a binding and catalytic activity. Interestingly, the Arabidopsis gene was identified as an ortholog of the Ycf39 (CyanoBase designation Slr0399), which was originally identified in a screen for suppressor mutants that restored the ability of a D2 mutant of Synechocystis 6803 to bind the bound plastoquinone, QA, and was suggested to be involved in delivering plastoquinone to Photosystem II during assembly [54,55]. The wheat gene (TRAES_7DS_E3B38CA36) in this clique was not previously annotated and was enriched by our model with 9 new GO terms with an average GO score of 2.09. The 3 GO terms with the highest score are chloroplast thylakoid (GO:0009534), translation initiation factor activity (GO:0003743) and photosystem II assembly (GO:0010207).
Genes in Clique 13 are highly similar with A. thaliana “RAD3-like DNA-binding helicase protein” (AT1G03190 - UVH6/AtXPD). This gene acts as a negative regulator for plant response to UV damage and heat, which trigger tissue death and reduced chloroplast function. The gene functions in DNA repair and it is essential for plant growth . According to KEGG, the gene is involved in two pathways, namely the basal transcription factors pathway (ath03022) and the nucleotide excision repair pathway (ath03420). The wheat gene (TRAES_1AS_A25EED9EA) in this clique had 9 original GO terms and was enriched by our model with 9 new GO terms with an average GO score of 1.12. All new GO terms received the same scores and include heat acclimation (GO:0010286), response to high light intensity (GO:0009644), transcription from RNA polymerase II promoter (GO:0006366) and RNA splicing, via endonucleolytic cleavage and ligation (GO:0000394).
Co-expression and expression profile similarity for genes in cliques of size 10
We verified the similarity of expression profiles for genes in cliques of size 10 using the Expressolog Tree Viewer  from the Bio-Analytic Resource (BAR) for Plant Biology available at University of Toronto [see Additional file 1: Table S3 ]. Expression datasets available in GEO for 4 plant species were used, such as: A. thaliana (AtGenExpress data series of Schmid et al., 2005 ), O. sativa (GEO accession numbers GSE7951 and GSE6893), H. vulgare (GEO accession number GSE16754) and Z. mays (PlexDB experiment number ZM37). Regrettably, wheat is not included yet in their database.
The Arabidopsis, rice and maize genes in all but one (clique 13) of the cliques of size 10 were identified as expressologs. The average expression profile similarity SCC scores (Spearman Correlation Coefficients) equal to 0.21 (std. dev. = 0.12) for Arabidopsis vs. rice and 0.37 (std. dev. = 0.24) for Arabidopsis vs. maize. For clique 13, only Arabidopsis and maize genes were identified as expressologs with SCC = 0.15. In 9 of the 13 cliques of size 10 the barley genes were also identified as being expressologs with the corresponding Arabidopsis, rice and maize genes. In the remaining 4 cliques, no barley expressologs were identified. Overall, the similar expression profile evidence for genes in 4 out of 10 plant species suggests that our cliques of 1-to-1 orthologs are well defined.
To facilitate the connexion between the previous analysis and wheat genes, we investigated the co-expression of genes in cliques of size 10 using the “Standard analysis” NetworkComparer approach in PlaNet . PlaNet includes information from 7 plant species, of which 5 overlap with the ones used in our study: Arabidopsis, barley, brachypodium, rice and wheat. The “Standard analysis” NetworkComparer approach compares a gene of interest with other genes (represented by microarray probe sets) belonging to the PFAM family of the query. These probe sets are then used to generate an ancestral network, which depicts conserved co-expression relationships across selected probe sets and the identity of transcripts constituting conserved PFAMs are revealed. For 9 out of 13 cliques of size 10, wheat genes shared conserved Pfam domains with genes in up to 4 other plant species (Arabidopsis, barley, brachypodium and rice) from the PlaNet database [see Additional file 3]. In 4 of those cases, all 5 genes shared the same Pfam domains. Similar with our gene ontology analysis discussed above, no genes in Clique 8 were identified to share conserved Pfam domains. In addition, we used the Pfam2GO mappings of Reviewed Computational Analysis (RCA) annotations provided by the Gene Ontology Consortium to map 23 (22 original and 1 new) GO terms from a total of 147 original and new GO terms corresponding to the 13 wheat genes in cliques of size 10. The new GO term “protein peptidyl-prolyl isomerization” (GO:0000413) confirmed by this approach corresponded to the wheat gene TRAES_7DS_8020BEEC2 in Clique 1, while the 22 original GO terms characterised wheat genes in cliques 1, 2, 4, 5, 7, 10, 11 and 13.
Gene ontology characterization
Overall, for the 10 plant species, the top five high frequency GO terms in the “molecular function” group are: protein binding (GO:0005515), ATP binding (GO:0005524), metal ion binding (GO:0046872), DNA binding (GO:0003677) and nucleotide binding (GO:0000166). The top five in “cellular component” GO terms are: membrane (GO:0016020), nucleus (GO:0005634), integral to membrane (GO:0016021), plasma membrane (GO:0005886) and chloroplast (GO:0009507). Similarly, the five highly occurring “biological process” GO terms for the 10 plant species are: oxidation-reduction process (GO:0055114), protein phosphorylation (GO:0006468), metabolic process (GO:0008152), regulation of transcription - DNA-dependent (GO:0006355) and biological process (GO:0008150).
For cliques of size 10, “cellular component” GO terms [see Additional file 2: Figure S6] such as: nucleus (GO:0005634, 34 times), chloroplast (GO:0009507, 29 times), membrane (GO:0016020, 28 times) and chloroplast stroma (GO:0009570, 20 times) occur more frequently among the participant genes. “Molecular function” GO terms [see Additional file 2: Figure S7] have also significantly high occurrence, such as ATP binding (GO:0005524, 30 times), protein binding (GO:0005515, 21 times) and nucleotide binding (GO:0000166, 19 times), while “biological process” GO terms [see Additional file 2: Figure S8] have medium to lower occurrences, the most prevalent being regulation of flower development (GO:0009909, 14 times), protein folding (GO:0006457, 14 times), vegetative to reproductive phase transition of meristem (GO:0010228, 13 times) and seed germination (GO:0009845, 13 times).
Gene Ontology (GO) enrichment model
Based on the gene ontology acquired from Ensembl Plants we propose the following procedure for enriching the annotation of wheat genes using information extrapolated from 1-to-1 orthologs with 9 other plant species.
Status of currently known annotated genes (proteins) in 10 plant species
Total num. genes
Num. annotated genes
Num. unannotated genes (%)
The second step consists of calculating the gene ontology scores (Equation 2) based on the phylogenetic proximity between genes in two species, the overall hierarchy of gene ontology and the predicted orthology relationships. For each wheat gene which belongs to a clique of size at least 3, we assign a new GO term if the GO score is above the 0.5 threshold. The genes in cliques of size at least 3 have a total of 107,997 GO terms associations. This step selects 80,067 new GO term assignments with scores equal to or higher than 0.5, leading to an overall cumulative improvement of 47% compared to the existing gene-GO term associations (171,488).
For cliques of size ten, 12 out of 13 wheat genes acquired a total of 150 new GO term associations (cumulative improvement of 115%) over the existing 130 GO terms.
The third step consists of a refinement for assignment of GO terms by keeping only those new terms that add more information (higher specificity) to the gene, i.e. we accept only those new GO terms that reside deeper (closer to a leaf node in the graph) or on a different path than the original annotations in the overall GO graph hierarchy. This will also ensure that no “place holder” root GO terms (molecular function, cellular component and biological process) will be selected for those genes where annotations already exist along the corresponding paths in the GO graph. Nevertheless, root GO terms will be accepted when no annotation is available for a given gene or if the available annotation follows a different path corresponding to a different root GO term in the GO term graph. Based on this approach, out of 80,067 new GO term assignments with scores equal to or higher than 0.5 assigned at step 2, only 25,607 new GO terms add more information to the existing annotations (only 152 root GO terms). For these, the GO term path length varies between 2 and 15 with a mean around 6.4 [see Additional file 2: Figures S9 and S10] and the GO scores span the [0.5, 3.3] interval with the mean around 1.1 [see Additional file 2: Figures S11 and S12]. This process leads to an overall GO term assignment increase of 15% compared to the original annotation (171,488 GO terms). A total of 7,838 wheat genes were annotated with newly assigned GO terms, leading to an overall increase of 8% with respect to the total number of wheat genes in Ensembl Plants release 22. Out of those 7,838 genes, 2,139 had no previous annotation [see Additional file 4] of which only 145 represent root GO terms (16 biological process - GO:0008150 and 129 molecular function - GO:0003674).
For cliques of size 10, only 85 out of 150 new GO terms are more specific and enrich the annotations based on their GO graph paths. This leads to an overall improvement of 65% compared to existing annotations (130 GO terms). Two wheat genes, namely TRAES_6AS_FD8F6B539 and TRAES_7DS_E3B38CA36 [see Additional file 5], out of 13 had no previous known annotations and were enriched with 1 (GO:0007568 - aging) and 9 (GO:0003743 - translation initiation factor activity, GO:0009507 - chloroplast, GO:0009534 - chloroplast thylakoid, GO:0016117 - carotenoid biosynthetic process, GO:0019288 - isopentenyl diphosphate biosynthetic process, methylerythritol 4-phosphate pathway, GO:0019684 - photosynthesis, light reaction, GO:0006364 - rRNA processing, GO:0010114 - response to red light, and GO:0010207 - photosystem II assembly) new GO terms, respectively.
We propose a novel annotation model for wheat genes based on 1-to-1 cliques of orthologs, existing gene ontologies from 9 other plant species and their phylogenetic relationship. Our annotation model relies on the intersection of 1-to-1 orthologs predictions based on DNA and protein sequences encoded by the same gene. Large cliques of orthologs combined with an additive scoring scheme based on phylogenetic distances between plant species provide the mechanism for gene ontology knowledge transfer from orthologous genes in other well annotated plant species to wheat genes.
In addition, we provided experimental validation and analysis of genes with similar expression profiles in Arabidopsis, barley, maize and rice and evidence that our cliques of orthologs are valid. Our model demonstrated that wheat gene functional annotations can be enriched via cliques of 1-to-1 orthologs, gene ontology information and phylogenetic relationships among considered species.
To further bridge the gap between newly sequenced and completely characterized wheat genes and proteins, a large number of validated annotations is required from the experimental community. These extremely valuable manual annotations can be in turn integrated into automatic computational annotation pipelines and models such as the one presented here to further increase the quality, throughput and understanding of the deluge of information generated today.
This research was funded by the National Research Council Canada and the Canadian Wheat Alliance.
- Fitch WM. Distinguishing Homologous from Analogous Proteins. Syst Zool. 1970;19:99.View ArticlePubMedGoogle Scholar
- Koonin EV. Orthologs, paralogs, and evolutionary genomics. Annu Rev Genet. 2005;39:309–38.View ArticlePubMedGoogle Scholar
- Trachana K, Larsson TA, Powell S, Chen W-H, Doerks T, Muller J, et al. Orthology prediction methods: a quality assessment using curated protein families. Bioessays. 2011;33:769–80.View ArticlePubMed CentralPubMedGoogle Scholar
- Ostlund G, Schmitt T, Forslund K, Köstler T, Messina DN, Roopra S, et al. InParanoid 7: new algorithms and tools for eukaryotic orthology analysis. Nucleic Acids Res. 2010;38(Database issue):D196–203.View ArticlePubMed CentralPubMedGoogle Scholar
- Deluca TF, Wu I-H, Pu J, Monaghan T, Peshkin L, Singh S, et al. Roundup: a multi-genome repository of orthologs and evolutionary distances. Bioinformatics. 2006;22:2044–6.View ArticlePubMedGoogle Scholar
- Tatusov RL, Koonin EV, Lipman DJ. A genomic perspective on protein families. Science. 1997;278:631–7.View ArticlePubMedGoogle Scholar
- Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, et al. The COG database: an updated version includes eukaryotes. BMC Bioinformatics. 2003;4:41.View ArticlePubMed CentralPubMedGoogle Scholar
- Powell S, Forslund K, Szklarczyk D, Trachana K, Roth A, Huerta-Cepas J, et al. eggNOG v4.0: nested orthology inference across 3686 organisms. Nucleic Acids Res. 2014;42(Database issue):D231–9.View ArticlePubMed CentralPubMedGoogle Scholar
- Waterhouse RM, Tegenfeldt F, Li J, Zdobnov EM, Kriventseva EV. OrthoDB: a hierarchical catalog of animal, fungal and bacterial orthologs. Nucleic Acids Res. 2013;41(Database issue):D358–65.View ArticlePubMed CentralPubMedGoogle Scholar
- Chen F, Mackey AJ, Stoeckert CJ, Roos DS. OrthoMCL-DB: querying a comprehensive multi-species collection of ortholog groups. Nucleic Acids Res. 2006;34(Database issue):D363–8.View ArticlePubMed CentralPubMedGoogle Scholar
- Altenhoff AM, Schneider A, Gonnet GH, Dessimoz C. OMA 2011: orthology inference among 1000 complete genomes. Nucleic Acids Res. 2011;39(Database issue):D289–94.View ArticlePubMed CentralPubMedGoogle Scholar
- Ruan J, Li H, Chen Z, Coghlan A, Coin LJM, Guo Y, et al. TreeFam: 2008 Update. Nucleic Acids Res. 2008;36(Database issue):D735–40.PubMed CentralPubMedGoogle Scholar
- Flicek P, Amode MR, Barrell D, Beal K, Billis K, Brent S, et al. Ensembl 2014. Nucleic Acids Res. 2014;42(Database issue):D749–55.View ArticlePubMed CentralPubMedGoogle Scholar
- Huerta-Cepas J, Capella-Gutiérrez S, Pryszcz LP, Marcet-Houben M, Gabaldón T. PhylomeDB v4: zooming into the plurality of evolutionary histories of a genome. Nucleic Acids Res. 2014;42(Database issue):D897–902.View ArticlePubMed CentralPubMedGoogle Scholar
- Van der Heijden RTJM, Snel B, van Noort V, Huynen MA. Orthology prediction at scalable resolution by phylogenetic tree analysis. BMC Bioinformatics. 2007;8:83.View ArticlePubMed CentralPubMedGoogle Scholar
- Fang G, Bhardwaj N, Robilotto R, Gerstein MB. Getting started in gene orthology and functional analysis. PLoS Comput Biol. 2010;6, e1000703.View ArticlePubMed CentralPubMedGoogle Scholar
- Altenhoff AM, Dessimoz C. Phylogenetic and functional assessment of orthologs inference projects and methods. PLoS Comput Biol. 2009;5, e1000262.View ArticlePubMed CentralPubMedGoogle Scholar
- Overbeek R, Fonstein M, D’Souza M, Pusch GD, Maltsev N. The use of gene clusters to infer functional coupling. Proc Natl Acad Sci U S A. 1999;96:2896–901.View ArticlePubMed CentralPubMedGoogle Scholar
- Wolf YI, Koonin EV. A tight link between orthologs and bidirectional best hits in bacterial and archaeal genomes. Genome Biol Evol. 2012;4:1286–94.View ArticlePubMed CentralPubMedGoogle Scholar
- Dalquen DA, Dessimoz C. Bidirectional best hits miss many orthologs in duplication-rich clades such as plants and animals. Genome Biol Evol. 2013;5:1800–6.View ArticlePubMed CentralPubMedGoogle Scholar
- Skunca N, Bošnjak M, Kriško A, Panov P, Džeroski S, Smuc T, et al. Phyletic profiling with cliques of orthologs is enhanced by signatures of paralogy relationships. PLoS Comput Biol. 2013;9, e1002852.View ArticlePubMed CentralPubMedGoogle Scholar
- Yan K-K, Wang D, Rozowsky J, Zheng H, Cheng C, Gerstein M. OrthoClust: an orthology-based network framework for clustering data across multiple species. Genome Biol. 2014;15:R100.View ArticlePubMed CentralPubMedGoogle Scholar
- Monaco MK, Stein J, Naithani S, Wei S, Dharmawardhana P, Kumari S, et al. Gramene 2013: comparative plant genomics resources. Nucleic Acids Res. 2014;42(Database issue):D1193–9.View ArticlePubMed CentralPubMedGoogle Scholar
- Jia J, Zhao S, Kong X, Li Y, Zhao G, He W, et al. Aegilops tauschii draft genome sequence reveals a gene repertoire for wheat adaptation. Nature. 2013;496:91–5.View ArticlePubMedGoogle Scholar
- Ling H-Q, Zhao S, Liu D, Wang J, Sun H, Zhang C, et al. Draft genome of the wheat A-genome progenitor Triticum urartu. Nature. 2013;496:87–90.View ArticlePubMedGoogle Scholar
- Lynn DJ, Winsor GL, Chan C, Richard N, Laird MR, Barsky A, et al. InnateDB: facilitating systems-level analyses of the mammalian innate immune response. Mol Syst Biol. 2008;4:218.View ArticlePubMed CentralPubMedGoogle Scholar
- Niskanen S, Östergård PRJ. Cliquer User’s Guide: Version 1.0. Communications Laboratory, Helsinki University of Technology, Espoo, Finland, Tech. Rep. T48, 2003.Google Scholar
- Du Z, Zhou X, Ling Y, Zhang Z, Su Z. agriGO: a GO analysis toolkit for the agricultural community. Nucleic Acids Res. 2010;38:W64–70.View ArticlePubMed CentralPubMedGoogle Scholar
- Goodstein DM, Shu S, Howson R, Neupane R, Hayes RD, Fazo J, et al. Phytozome: a comparative platform for green plant genomics. Nucleic Acids Res. 2012;40(Database issue):D1178–86.View ArticlePubMed CentralPubMedGoogle Scholar
- Mayer KFX, Waugh R, Brown JWS, Schulman A, Langridge P, Platzer M, et al. A physical, genetic and functional sequence assembly of the barley genome. Nature. 2012;491:711–6.PubMedGoogle Scholar
- Sakai H, Lee SS, Tanaka T, Numa H, Kim J, Kawahara Y, et al. Rice Annotation Project Database (RAP-DB): an integrative and interactive database for rice genomics. Plant Cell Physiol. 2013;54, e6.View ArticlePubMed CentralPubMedGoogle Scholar
- Schwacke R, Schneider A, van der Graaff E, Fischer K, Catoni E, Desimone M, et al. ARAMEMNON, a novel database for Arabidopsis integral membrane proteins. Plant Physiol. 2003;131:16–26.View ArticlePubMed CentralPubMedGoogle Scholar
- Trivedi DK, Yadav S, Vaid N, Tuteja N. Genome wide analysis of Cyclophilin gene family from rice and Arabidopsis and its comparison with yeast. Plant Signal Behav. 2012;7:1653–66.View ArticlePubMed CentralPubMedGoogle Scholar
- Li H, Luan S. The cyclophilin AtCYP71 interacts with CAF-1 and LHP1 and functions in multiple chromatin remodeling processes. Mol Plant. 2011;4:748–58.View ArticlePubMedGoogle Scholar
- Wu H, Wensley E, Bhave M. Identification and analysis of genes encoding a novel ER-localised Cyclophilin B in wheat potentially involved in storage protein folding. Plant Sci. 2009;176:420–32.View ArticleGoogle Scholar
- Keren I, Tal L, des Francs-Small CC, Araújo WL, Shevtsov S, Shaya F, et al. nMAT1, a nuclear-encoded maturase involved in the trans-splicing of nad1 intron 1, is essential for mitochondrial complex I assembly and function. Plant J. 2012;71:413–26.PubMedGoogle Scholar
- Dalby SJ, Bonen L. Impact of low temperature on splicing of atypical group II introns in wheat mitochondria. Mitochondrion. 2013;13:647–55.View ArticlePubMedGoogle Scholar
- Dash S, Mohanty N. Response of seedlings to heat-stress in cultivars of wheat: Growth temperature-dependent differential modulation of photosystem 1 and 2 activity, and foliar antioxidant defense capacity. J Plant Physiol. 2002;159:49–59.View ArticleGoogle Scholar
- Brestic M, Zivcak M, Kalaji HM, Carpentier R, Allakhverdiev SI. Photosystem II thermostability in situ: environmentally induced acclimation and genotype-specific reactions in Triticum aestivum L. Plant Physiol Biochem. 2012;57:93–105.View ArticlePubMedGoogle Scholar
- Zivcak M, Brestic M, Balatova Z, Drevenakova P, Olsovska K, Kalaji HM, et al. Photosynthetic electron transport and specific photoprotective responses in wheat leaves under drought stress. Photosynth Res. 2013;117:529–46.View ArticlePubMedGoogle Scholar
- Niu B-X, He F-R, He M, Ren D, Chen L-T, Liu Y-G. The ATP-binding cassette transporter OsABCG15 is required for anther development and pollen fertility in rice. J Integr Plant Biol. 2013;55:710–20.View ArticlePubMedGoogle Scholar
- El-Ghazaly G, Jensen W. Studies of the development of wheat (Triticum aestivum) pollen: formation of the pollen aperture. Can J Botany. 1986;64(12):3141–54.View ArticleGoogle Scholar
- Steward N, Kusano T, Sano H. Expression of ZmMET1, a gene encoding a DNA methyltransferase from maize, is associated not only with DNA replication in actively proliferating cells, but also with altered DNA methylation status in cold-stressed quiescent cells. Nucleic Acids Res. 2000;28:3250–9.View ArticlePubMed CentralPubMedGoogle Scholar
- Teerawanichpan P, Chandrasekharan MB, Jiang Y, Narangajavana J, Hall TC. Characterization of two rice DNA methyltransferase genes and RNAi-mediated reactivation of a silenced transgene in rice callus. Planta. 2004;218:337–49.View ArticlePubMedGoogle Scholar
- Dai Y, Ni Z, Dai J, Zhao T, Sun Q. Isolation and expression analysis of genes encoding DNA methyltransferase in wheat (Triticum aestivum L.). Biochim Biophys Acta. 2005;1729:118–25.View ArticlePubMedGoogle Scholar
- Nagy F, Kay SA, Chua N-H. A circadian clock regulates transcription of the wheat Cab-1 gene. Genes Dev. 1988;2:376–82.View ArticleGoogle Scholar
- Li J, Fu F, An M, Zhou S, She Y, Li W. Differential Expression of MicroRNAs in Response to Drought Stress in Maize. J Integr Agric. 2013;12:1414–22.View ArticleGoogle Scholar
- Niu J-S, Yu L, Ma Z-Q, Chen P-D, Liu D-J. Molecular cloning, characterization and mapping of a rhodanese like gene in wheat. Yi Chuan Xue Bao. 2002;29:266–72.PubMedGoogle Scholar
- Gable K, Garton S, Napier JA, Dunn TM. Functional characterization of the Arabidopsis thaliana orthologue of Tsc13p, the enoyl reductase of the yeast microsomal fatty acid elongating system. J Exp Bot. 2004;55:543–5.View ArticlePubMedGoogle Scholar
- Kosma DK, Nemacheck JA, Jenks MA, Williams CE. Changes in properties of wheat leaf cuticle during interactions with Hessian fly. Plant J. 2010;63:31–43.PubMedGoogle Scholar
- Mukhopadhyay P, Basak S, Ghosh TC. Differential selective constraints shaping codon usage pattern of housekeeping and tissue-specific homologous genes of rice and arabidopsis. DNA Res. 2008;15:347–56.View ArticlePubMed CentralPubMedGoogle Scholar
- Mutwil M, Klie S, Tohge T, Giorgi FM, Wilkins O, Campbell MM, et al. PlaNet: combined sequence and expression comparisons across plant networks derived from seven species. Plant Cell. 2011;23:895–910.View ArticlePubMed CentralPubMedGoogle Scholar
- Fuji K, Shimada T, Takahashi H, Tamura K, Koumoto Y, Utsumi S, et al. Arabidopsis vacuolar sorting mutants (green fluorescent seed) can be identified efficiently by secretion of vacuole-targeted green fluorescent protein in their seeds. Plant Cell. 2007;19:597–609.View ArticlePubMed CentralPubMedGoogle Scholar
- Ermakova-Gerdes S, Vermaas W. Inactivation of the open reading frame slr0399 in Synechocystis sp. PCC 6803 functionally complements mutations near the Q(A) niche of photosystem II. A possible role of Slr0399 as a chaperone for quinone binding. J Biol Chem. 1999;274:30540–9.View ArticlePubMedGoogle Scholar
- Nixon PJ, Michoux F, Yu J, Boehm M, Komenda J. Recent advances in understanding the assembly and repair of photosystem II. Ann Bot. 2010;106:1–16.View ArticlePubMed CentralPubMedGoogle Scholar
- Liu Z, Hong S-W, Escobar M, Vierling E, Mitchell DL, Mount DW, et al. Arabidopsis UVH6, a homolog of human XPD and yeast RAD3 DNA repair genes, functions in DNA repair and is essential for plant growth. Plant Physiol. 2003;132:1405–14.View ArticlePubMed CentralPubMedGoogle Scholar
- Patel RV, Nahal HK, Breit R, Provart NJ. BAR expressolog identification: expression profile similarity ranking of homologous genes in plant species. Plant J. 2012;71:1038–50.View ArticlePubMedGoogle Scholar
- Schmid M, Davison TS, Henz SR, Pape UJ, Demar M, Vingron M, et al. A gene expression map of Arabidopsis thaliana development. Nat Genet. 2005;37:501–6.View ArticlePubMedGoogle Scholar
- Huang DW, Sherman BT, Lempicki RA. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 2009;37:1–13.View ArticlePubMed CentralGoogle Scholar
- Supek F, Bošnjak M, Škunca N, Šmuc T. REVIGO summarizes and visualizes long lists of gene ontology terms. PLoS One. 2011;6, e21800.View ArticlePubMed CentralPubMedGoogle Scholar
- Letunic I, Bork P. Interactive Tree Of Life v2: online annotation and display of phylogenetic trees made easy. Nucleic Acids Res. 2011;39:W475–8.View ArticlePubMed CentralPubMedGoogle Scholar
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.