Statistical analysis of fractionation resistance by functional category and expression
© The Author(s) 2017
Published: 24 May 2017
The current literature establishes the importance of gene functional category and expression in promoting or suppressing duplicate gene loss after whole genome doubling in plants, a process known as fractionation. Inspired by studies that have reported gene expression to be the dominating factor in preventing duplicate gene loss, we analyzed the relative effect of functional category and expression.
We use multivariate methods to study data sets on gene retention, function and expression in rosids and asterids to estimate effects and assess their interaction.
Our results suggest that the effect on duplicate gene retention fractionation by functional category and expression are independent and have no statistical interaction.
In plants, functional category is the more dominant factor in explaining duplicate gene loss.
After whole genome duplication events there is massive duplicate gene loss, a process known as fractionation. Duplicate genes from whole genome duplications are sensitive to pseudogenization and excision of chromosomal fragments. Notably, fractionation continues even after the polyploid species has been rediploidized. Models such as the Gene Balance Hypothesis  and the Gene Dosage Hypothesis [9, 10] attempt to explain the pattern of these duplicate gene losses .
A striking example of gene balance is provided by the preferential retention of circadian clock genes after the whole genome triplication event in the history of Brassica rapa . The regulation of these genes in plants is assured by stoichiometric negative feedback loops. These clock genes, as a whole, are preferentially retained compared to other core eukaryotic genes or to neighbouring genes flanking the clock genes.
Both the Gene Balance Hypothesis and the Gene Dosage Hypothesis are needed because each model explains observations that the other model can not fully explain. However, teasing apart the relative importance of those factors require rigorous multivariate analysis. This what we undertake in the paper, and despite the intuitive appeal of the Gene Dosage Hypothesis, we find that gene functional category is far more explanatory of variable retention rate than gene expression.
We construct gene families based on the sequence similarity and the conserved gene order between extant species using CoGe [17, 18]. These gene families are pruned into smaller units that are linked by the whole genome duplication in the ancestor using the “Orthologs for Multiple Genomes” program . Detailed flowcharts and parameters for generating gene families have been presented previously [12, 13].
The species grape , peach  and cacao  form the rosid data set. These species can trace their last common ancestor to the period after the divergence of the asterids, following the core eudicot hexaploid about 120 million years ago . There are no additional rounds of whole genome duplication in the evolutionary paths leading to the these present-day species [20–22]. Therefore, whole genome comparative analysis of the rosid data set offers insights on the effects of fractionation over long period of time.
The asterid data set provides a different viewpoint of the fractionation process compared to the rosid data set. The last common asterid ancestor diverged five to ten million years after the hexaploid core eudicot ancestor. This early divergence means the fractionation process after the hexaploid ancestor of the asterid data set is mostly independent from the fractionation process in the species of the rosid data set. Furthermore, the species of the asterid data set, which consists of extant species tomato , Mimulus , and Utricularia , have additional rounds of whole genome duplication .
The asterid data set addresses two potential concerns. The first concern is whether the results of the rosid data set represent a general effect or a clade-specific trend. The second concern is whether the additional rounds of whole genome duplication introduce a different pattern compared to single ancient whole genome duplication event. Thus far the fractionation pattern of genomes of the datasets is consistent with the literature and appears to be general [11, 13].
For the expression analysis, we use grape to represent the rosids and tomato to represent the asterids. High quality RNA-seq expression data, already normalized and organ-specific, are available for both species [23, 26]. Since a gene’s function may be relevant to specific tissues only, for each gene, we use the highest expression level it displays across all organs to represent its expression score.
We use retention indices to measure how fractionation resistant or prone gene families are. The retention index of each gene family is calculated by counting in how many species the genes is still maintained in duplicate. For example, if a gene family of the rosid data set is maintained as duplicates only in grapes, then the retention index of that gene family is one. Since there are three species in both the rosid data set and the asterid data set, retention indices range between zero (gene set reduced to singletons in all species) and three (gene maintained as duplicates in all species).
For the expression analysis, we use individual genes instead of gene families, for two reasons. The first reason is that genes in duplicate families have varying gene expressions that may differ by orders of magnitude. The skewness of the data prevents us from using averages. Second, we cannot just take the highest expressing gene in the gene family in the same way as we chose the organ with the highest expression to represent the gene’s score. This is to avoid the artifact that the more genes a gene family has, the higher the expression of the gene family will be by virtue of having more chance to include a high expressing gene.
We also bin gene expression data into two groups, HighExp and LowExp, as an additional normalization step. Genes of the HighExp group have expression levels greater or equal than the median gene expression level of the particular functional category. The LowExp group contains genes that have expression levels lower than the median gene expression level of the particular functional category.
We designate three high levels of GO functional categories (Fig. 5) that we previously found to have the highest effect on fractionation [13, 14]. The first category is “Metabolic process (Z1)”, one of the most fractionation-prone. The second category is “Enzyme class (Z2)”. It is also highly fractionation-prone but it includes enzymes involved in signalling pathways so the category as a whole may show increased retention compared to Z1. The third category is “Regulation and Response” (Z3). This is composed of two most fractionation-resistant GO categories. These three high level GO functional categories cover two of the three GO distinct domains: “biological process” (Z1 and Z3) and “molecular function” (Z2).
Each high level functional categories is further divided into six low-level GO categories to represent more specific and biologically distinct functions. GO terms “secondary metabolic process”, “lipid biosynthetic process”, “steroid metabolic process”, “nucleobase-compound containing metabolic process”, “carbohydrate metabolic process”, and “protein metabolic process” represents Z1. These six metabolic GO terms are representative of diverse metabolic processes. GO terms “transferase activity”, “oxidoreductase activity”, “hydrolase activity”, “ligase activity”, “lyase activity”, and “isomerase activity”, the six major enzyme classes, represent Z2.
GO terms and number of genes
Metabolic Process (Z1)
lipid biosynthetic process
steroid metabolic process
nucleobase containing compound metabolic process
carbohydrate metabolic process
protein metabolic process
secondary metabolic process
Enzyme Class (Z2)
Regulation Response (Z3)
regulation of metabolic process
nucleic acid binding transcription factor activity
response to hormone
response to temperature stimulus
response to stress
Results and discussion
Figure 6 is a visual representation of what the average retention indices are for each functional category. This result is consistent with our prediction that genes of Z3 are more fractionation-resistant than gene of Z2 and Z1.
Figure 7 also shows that in grape, the difference between fractionation-resistant Z3 and fractionation prone Z1 and Z2 are smaller than the difference in tomato. A reason for this observation being that gene families that are singletons in all three species of the rosid data set constitute a far more higher proportion than in the asterid data set, so even the fractionation-resistant functional category contain many singleton gene families.
ANOVA table on balanced grape and tomato data
Grape Anova Table (Type II tests)
Tomato Anova Table (Type II tests)
Expression has been suggested to be the most important factor in determining duplicate retention after whole genome duplication events . Our results suggest otherwise, that functional category is the more dominant factor of the two. Furthermore, our results in Table 2 suggests that there is no interaction between functional category and expression level.
We expect the result presented here to be present in other flowering plant lineages as well, given how both the rosid dataset and the asterid dataset show a consistent trend. Also, our previous analyses on fractionation resistance [13, 14] show these retention trends to be consistent across different lineages, giving us more confidence in this prediction.
Going forward, we want to further explore the role of expression on fractionation. One direction is to explore the different types of expression. Some genes are only expressed in certain tissues or at certain developmental stages, such as the development of flowers, or genes that have organ specific expression pattern, or genes that are always on but fluctuate depending on the situation. Different expression pattern may have different fractionation tendencies.
Another direction is to expand the analysis to other genes that are currently not part of the analysis. One particular analysis for future work is the relationship between retained duplicates and the nearby genes. Retained duplicates are reported to have an effect on the distribution of genes with copy number variation in humans . We can explore if similar effects are also present in plants.
In summary, we have evidence to suggest that functional categories plays a more important than gene expression levels in duplicate gene retention after whole genome duplication. There are many challenges and possibilities that can build upon this work to better explain the mechanisms and the effects of the fractionation process.
Research supported in part by grants from the Natural Sciences and Engineering Research Council of Canada. DS holds the Canada Research Chair in Mathematical Genomics.
This publication was funded by Discovery Grant RGPIN-2016-05585 to DS from the Natural Sciences and Engineering Research Council of Canada.
Availability of data and material
All genome data is available from the CoGe website.
ECC and DS did the research and writing, AM and JHC gave statistical guidance. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Consent for publication
Ethics approval and consent to participate
About this supplement
This article has been published as part of BMC Genomics Volume 18 Supplement 4, 2017: Selected articles from the Fifth IEEE International Conference on Computational Advances in Bio and Medical Sciences (ICCABS 2015): Genomics. The full contents of the supplement are available online at https://bmcgenomics.biomedcentral.com/articles/supplements/volume-18-supplement-4.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Soltis DE, Soltis PS. Polyploidy: recurrent formation and genome evolution. Trends Ecol Evol. 1999; 14(9):348–52.View ArticlePubMedGoogle Scholar
- Soltis DE, Albert VA, Leebens-Mack J, Bell CD, Paterson AH, Zheng C, Sankoff D, Wall PK, Soltis PS, et al.Polyploidy and angiosperm diversification. Am J Bot. 2009; 96(1):336–48.View ArticlePubMedGoogle Scholar
- Abrouk M, Murat F, Pont C, Messing J, Jackson S, Faraut T, Tannier E, Plomion C, Cooke R, Feuillet C, et al.Palaeogenomics of plants: synteny-based modelling of extinct ancestors. Trends Plant Sci. 2010; 15(9):479–87.View ArticlePubMedGoogle Scholar
- Jiao Y, Wickett NJ, Ayyampalayam S, Chanderbali AS, Landherr L, Ralph PE, Tomsho LP, Hu Y, Liang H, Soltis PS, et al.Ancestral polyploidy in seed plants and angiosperms. Nature. 2011; 473(7345):97–100.View ArticlePubMedGoogle Scholar
- Hegarty M, Coate J, Sherman-Broyles S, Abbott R, Hiscock S, Doyle J. Lessons from natural and artificial polyploids in higher plants. Cytogenet Genome Res. 2013; 140(2–4):204–25.View ArticlePubMedGoogle Scholar
- Crow KD, Wagner GP. What is the role of genome duplication in the evolution of complexity and diversity?Mol Biol Evol. 2006; 23(5):887–92.View ArticlePubMedGoogle Scholar
- Comai L. Genetic and epigenetic interactions in allopolyploid plants. Plant Mol Biol. 2000; 43:387–99.View ArticlePubMedGoogle Scholar
- Birchler JA, Veitia RA. Gene balance hypothesis: Connecting issues of dosage sensitivity across biological disciplines. Proc Nat Acad Sci. 2012; 109(37):14746–53.View ArticlePubMedPubMed CentralGoogle Scholar
- Papp B, Pal C, Hurst LD. Dosage sensitivity and the evolution of gene families in yeast. Nature. 2003; 424:194–7.View ArticlePubMedGoogle Scholar
- Schnable JC, Wang X, Pires JC, Freeling M. Escape from preferential retention following repeated whole genome duplication in plants. Front Plant Sci. 2012; 3(94):1–8.Google Scholar
- Conant GC, Birchler JA, Pires JC. Dosage, duplication, and diploidization: clarifying the interplay of multiple models for duplicate gene evolution over time. Curr Opin Plant Biol. 2014; 19:91–8.View ArticlePubMedGoogle Scholar
- Zheng C, Chen E, Albert VA, Lyons E, Sankoff D. Ancient eudicot hexaploidy meets ancestral eurosid gene order. BMC Genomics. 2013; 14(Suppl 7):3.View ArticleGoogle Scholar
- Chen ECH, Najar CBA, Zheng C, Brandts A, Lyons E, Tang H, Carretero-Paulet L, Albert VA, Sankoff D. The dynamics of functional classes of plant genes in rediploidized ancient polyploids. BMC Bioinformatics. 2013; 14(S-15):19.View ArticleGoogle Scholar
- Chen EC, Sankoff D. Gene expression and fractionation resistance. BMC Genomics. 2014; 15(Suppl 6):19.View ArticleGoogle Scholar
- Lou P, Wu J, Cheng F, Cressman LG, Wang X, McClung CR. Preferential retention of circadian clock genes during diploidization following whole genome triplication in Brassica rapa. Plant Cell Online. 2012; 24(6):2415–26.View ArticleGoogle Scholar
- Gout JF, Kahn D, Duret L, Paramecium Post-Genomics Consortium. The relationship among gene expression, the evolution of gene dosage, and the rate of protein evolution. PLoS Genet. 2010; 6(5):1000944.View ArticleGoogle Scholar
- Lyons E, Freeling M. How to usefully compare homologous plant genes and chromosomes as DNA sequences. Plant J. 2008; 53:661–73.View ArticlePubMedGoogle Scholar
- Lyons E, Pedersen B, Kane J, Alam M, Ming R, Tang H, Wang X, Bowers J, Paterson A, Lisch D, Freeling M. Finding and comparing syntenic regions among Arabidopsis and the outgroups papaya, poplar and grape: CoGe with rosids. Plant Physiol. 2008; 148:1772–81.View ArticlePubMedPubMed CentralGoogle Scholar
- Zheng C, Swenson K, Lyons E, Sankoff D. OMG! orthologs in multiple genomes - competing graph-theoretical formulations In: Przytycka T, Sagot M-F, editors. WABI 2011, 11th Workshop on Algorithms in Bioinformatics. Lecture Notes in Computer Science 6833. Springer: 2011. p. 364–75.Google Scholar
- Jaillon O, Aury JM, Noel B, Policriti A, Clepet C, Casagrande A, Choisne N, Aubourg S, Vitulo N, Jubin C, Vezzi A, Legeai F, Hugueney P, Dasilva C, Horner D, Mica E, Jublot D, Poulain J, Bruyére C, Billault A, Segurens B, Gouyvenoux M, Ugarte E, Cattonaro F, Anthouard V, Vico V, Del Fabbro C, Alaux M, Di Gaspero G, Dumas V, Felice N, Paillard S, Juman I, Moroldo M, Scalabrin S, Canaguier A, Le Clainche I, Malacrida G, Durand E, Pesole G, Laucou V, Chatelet P, Merdinoglu D, Delledonne M, Pezzotti M, Lecharny A, Scarpelli C, Artiguenave F, Pè ME, Valle G, Morgante M, Caboche M, Adam-Blondon AF, Weissenbach J, Quétier F, Wincker P. The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla. Nature. 2007; 449:463–7.View ArticlePubMedGoogle Scholar
- Jung S, Cestaro A, Troggio M, Main D, Zheng P, Cho I, Folta KM, Sosinski BAA, Celton JM, Arús P, Shulaev V, Verde I, Morgante M, Rokhsar DS, Velasco R, Sargent DJ. Whole genome comparisons of Fragaria, Prunus and Malus reveal different modes of evolution between rosaceous subfamilies. BMC Genomics. 2012; 13:129.View ArticlePubMedPubMed CentralGoogle Scholar
- Argout X, Salse J, Aury JM, Guiltinan MJ, Droc G, Gouzy J, Allegre M, Chaparro C, Legavre T, Maximova SN, Abrouk M, Murat F, Fouet O, Poulain J, Ruiz M, Roguet Y, Rodier-Goud M, Barbosa-Neto JF, Sabot F, Kudrna D, Ammiraju JS, Schuster SC, Carlson JE, Sallet E, Schiex T, Dievart A, Kramer M, Gelley L, Shi Z, Bérard A, Viot C, Boccara M, Risterucci A, Guignon V, Sabau X, Axtell MJ, Ma Z, Zhang Y, Brown S, Bourge M, Golser W, Song X, Clement D, Rivallan R, Tahi M, Akaza JM, Pitollat B, Gramacho K, D’Hont A, Brunel D, Infante D, Kebe I, Costet P, Wing R, McCombie WR, Guiderdoni E, Quétier F, Panaud O, Wincker P, Bocs S, Lanaud C. The genome of Theobroma cacao. Nat Genet. 2011; 43:101–8.View ArticlePubMedGoogle Scholar
- Tomato Genome Consortium. The tomato genome sequence provides insights into fleshy fruit evolution. Nature. 2012; 485:635–41.View ArticleGoogle Scholar
- US Department of Energy, J.G.I.Mimulus version 1; 2010. https://phytozome.jgi.doe.gov/pz/portal.html#!info?alias=Org_Mguttatus.
- Ibarra-Laclette E, Lyons E, Hernández-Guzmán G, Pérez-Torres CA, Carretero-Paulet L, Chang TH, Lan T, Welch AJ, Juárez MJ, Simpson J, Fernández-Cortés A, Arteaga-Vázquez M, Góngora-Castillo E, Acevedo-Hernández G, Schuster SC, Himmelbauer H, Minoche AE, Xu S, Lynch M, Oropeza-Aburto A, Cervantes-Pérez SA, de Jesús Ortega-Estrada M, Cervantes-Luevano JI, Michael TP, Mockler T, Bryant D, Herrera-Estrella A, Albert VA, Herrera-Estrella L. Architecture and evolution of a minute plant genome. Nature. 2013; 498:94–8.View ArticlePubMedPubMed CentralGoogle Scholar
- Vitulo N, Forcato C, Carpinelli E, Telatin A, Campagna D, D’Angelo M, Zimbello R, Corso M, Vannozzi A, Bonghi C, Lucchin M, Valle G. A deep survey of alternative splicing in grape reveals changes in the splicing machinery related to tissue, stress condition and genotype. BMC Plant Biol. 2014; 14(1):99.View ArticlePubMedPubMed CentralGoogle Scholar
- The Gene Ontology Consortium. Gene ontology: tool for the unification of biology. Nat Genet. 2000; 25(1):25–9. Data Version 2012-04-20.View ArticlePubMed CentralGoogle Scholar
- Conesa A, Götz S, García-Gómez JM, Terol J, Talón M, Robles M. Bioinformatics. 2005; 21(18):3674–6.Google Scholar
- Makino T, McLysaght A, Kawata M. Genome-wide deserts for copy number variation in vertebrates. Nat Commun. 2013; 4:2283.View ArticlePubMedGoogle Scholar
- Soltis DE, Albert VA, Leebens-Mack J, Palmer JD, Wing RA, dePamphilis CW, Ma H, Carlson JE, Altman N, Kim S, et al.The Amborella genome: an evolutionary reference for plant biology. Genome Biol. 2008; 9(402):10–1186.Google Scholar