- Open Access
Statistical analysis of fractionation resistance by functional category and expression
BMC Genomics volume 18, Article number: 366 (2017)
The current literature establishes the importance of gene functional category and expression in promoting or suppressing duplicate gene loss after whole genome doubling in plants, a process known as fractionation. Inspired by studies that have reported gene expression to be the dominating factor in preventing duplicate gene loss, we analyzed the relative effect of functional category and expression.
We use multivariate methods to study data sets on gene retention, function and expression in rosids and asterids to estimate effects and assess their interaction.
Our results suggest that the effect on duplicate gene retention fractionation by functional category and expression are independent and have no statistical interaction.
In plants, functional category is the more dominant factor in explaining duplicate gene loss.
The proliferation and the advancement of tools for genetic analysis changed the understanding of the role of polyploidy in evolution . Polyploidy, which can result from whole genome duplication events of doubling or tripling of the genome, is now considered to be a recurrent and frequent theme in plant evolution. Virtually all land plants have a polyploid ancestor [2–5] with many lineages having additional rounds of whole genome duplication events (Fig. 1). These special events in evolutionary history have been linked to increased morphological and genetic diversity [6, 7].
After whole genome duplication events there is massive duplicate gene loss, a process known as fractionation. Duplicate genes from whole genome duplications are sensitive to pseudogenization and excision of chromosomal fragments. Notably, fractionation continues even after the polyploid species has been rediploidized. Models such as the Gene Balance Hypothesis  and the Gene Dosage Hypothesis [9, 10] attempt to explain the pattern of these duplicate gene losses .
The Gene Balance Hypothesis argues that the need to maintain stoichiometry ratio between important gene products results in the maintenance of these duplicate genes. In this model, duplicate regulatory genes and duplicate genes responding to stimulus are expected to be maintained at a greater rate due to gene product interactions. Gene products that do not need to interact with other gene products to maintain a delicate balance, such as many metabolic and enzymatic genes which interacts with metabolites such as food, sugar, and fat, are expected to be lost at a greater rate. We have verified these general expectations in previous work [12–14] as documented in Fig. 2.
A striking example of gene balance is provided by the preferential retention of circadian clock genes after the whole genome triplication event in the history of Brassica rapa . The regulation of these genes in plants is assured by stoichiometric negative feedback loops. These clock genes, as a whole, are preferentially retained compared to other core eukaryotic genes or to neighbouring genes flanking the clock genes.
The competing model, the Gene Dosage Hypothesis, argues that important genes are simply more likely to be kept, and because of how biologically expensive it is to maintain high expression levels, high gene expression level is a good indicator that the gene is important. Prior to the WGD, loss of these genes would entail significant loss of fitness. After WGD, the organism has reached a new normal, with twice the previous activity, and disproportionate loss of these expensive gene via fractionation would also incur a decrease of fitness. Therefore, duplicate genes with high expression levels will be maintained in duplicate. In this model, gene function is still the driving force to maintain these duplicates, but high level general functional categories, such as the above-mentioned metabolic, enzymatic, regulatory, and response patterns, are too general to be of use in predicting duplicate gene retention. Gout et al.  reported, in Paramecium, that high expressing genes are maintained in duplicate more than low expressing genes. Controlling for different functional categories having different expression levels does not change this result (Fig. 3). In , we also reported that duplicate genes are more likely to be maintained as duplicates if they have high expression levels, regardless of their functional categories. However, our results showed the effect of gene expression on maintaining duplicate gene after whole genome duplication events is much less pronounced than in the Paramecium study.
Both the Gene Balance Hypothesis and the Gene Dosage Hypothesis are needed because each model explains observations that the other model can not fully explain. However, teasing apart the relative importance of those factors require rigorous multivariate analysis. This what we undertake in the paper, and despite the intuitive appeal of the Gene Dosage Hypothesis, we find that gene functional category is far more explanatory of variable retention rate than gene expression.
We construct gene families based on the sequence similarity and the conserved gene order between extant species using CoGe [17, 18]. These gene families are pruned into smaller units that are linked by the whole genome duplication in the ancestor using the “Orthologs for Multiple Genomes” program . Detailed flowcharts and parameters for generating gene families have been presented previously [12, 13].
The species grape , peach  and cacao  form the rosid data set. These species can trace their last common ancestor to the period after the divergence of the asterids, following the core eudicot hexaploid about 120 million years ago . There are no additional rounds of whole genome duplication in the evolutionary paths leading to the these present-day species [20–22]. Therefore, whole genome comparative analysis of the rosid data set offers insights on the effects of fractionation over long period of time.
The asterid data set provides a different viewpoint of the fractionation process compared to the rosid data set. The last common asterid ancestor diverged five to ten million years after the hexaploid core eudicot ancestor. This early divergence means the fractionation process after the hexaploid ancestor of the asterid data set is mostly independent from the fractionation process in the species of the rosid data set. Furthermore, the species of the asterid data set, which consists of extant species tomato , Mimulus , and Utricularia , have additional rounds of whole genome duplication .
The asterid data set addresses two potential concerns. The first concern is whether the results of the rosid data set represent a general effect or a clade-specific trend. The second concern is whether the additional rounds of whole genome duplication introduce a different pattern compared to single ancient whole genome duplication event. Thus far the fractionation pattern of genomes of the datasets is consistent with the literature and appears to be general [11, 13].
For the expression analysis, we use grape to represent the rosids and tomato to represent the asterids. High quality RNA-seq expression data, already normalized and organ-specific, are available for both species [23, 26]. Since a gene’s function may be relevant to specific tissues only, for each gene, we use the highest expression level it displays across all organs to represent its expression score.
We use retention indices to measure how fractionation resistant or prone gene families are. The retention index of each gene family is calculated by counting in how many species the genes is still maintained in duplicate. For example, if a gene family of the rosid data set is maintained as duplicates only in grapes, then the retention index of that gene family is one. Since there are three species in both the rosid data set and the asterid data set, retention indices range between zero (gene set reduced to singletons in all species) and three (gene maintained as duplicates in all species).
Figure 4 summarizes how many gene families are in each retention category based on each gene family’s retention index. For rosids, a much larger proportion of gene families have become singletons. While the “all singletons” (retention index of zero) category also contain the highest number of gene families in asterids, the families are more evenly distributed among the retention categories.
For the expression analysis, we use individual genes instead of gene families, for two reasons. The first reason is that genes in duplicate families have varying gene expressions that may differ by orders of magnitude. The skewness of the data prevents us from using averages. Second, we cannot just take the highest expressing gene in the gene family in the same way as we chose the organ with the highest expression to represent the gene’s score. This is to avoid the artifact that the more genes a gene family has, the higher the expression of the gene family will be by virtue of having more chance to include a high expressing gene.
We also bin gene expression data into two groups, HighExp and LowExp, as an additional normalization step. Genes of the HighExp group have expression levels greater or equal than the median gene expression level of the particular functional category. The LowExp group contains genes that have expression levels lower than the median gene expression level of the particular functional category.
We use GO  terms to classify gene families into functional categories via Blast2GO . GO terms are nested within each other to provide different resolution of annotation (Fig. 5). We call GO terms that are close to the one of the three “root terms” “high level terms”. These high level terms describe general functional categories. As a result, a particular gene may be annotated with multiple high level terms as shown in Fig. 5.
We designate three high levels of GO functional categories (Fig. 5) that we previously found to have the highest effect on fractionation [13, 14]. The first category is “Metabolic process (Z1)”, one of the most fractionation-prone. The second category is “Enzyme class (Z2)”. It is also highly fractionation-prone but it includes enzymes involved in signalling pathways so the category as a whole may show increased retention compared to Z1. The third category is “Regulation and Response” (Z3). This is composed of two most fractionation-resistant GO categories. These three high level GO functional categories cover two of the three GO distinct domains: “biological process” (Z1 and Z3) and “molecular function” (Z2).
Each high level functional categories is further divided into six low-level GO categories to represent more specific and biologically distinct functions. GO terms “secondary metabolic process”, “lipid biosynthetic process”, “steroid metabolic process”, “nucleobase-compound containing metabolic process”, “carbohydrate metabolic process”, and “protein metabolic process” represents Z1. These six metabolic GO terms are representative of diverse metabolic processes. GO terms “transferase activity”, “oxidoreductase activity”, “hydrolase activity”, “ligase activity”, “lyase activity”, and “isomerase activity”, the six major enzyme classes, represent Z2.
GO terms “regulation of metabolic process”, “nucleic acid transcription factor activity”, “signal transduction”, “response to hormone”, “response to temperature”, and “response to stress” represent Z3. This is a combination of two highly fractionation-resistant functional categories in “biological regulation” and “response to stimulus”  so that there are six low level and biologically distinct GO terms in each high level functional categories (Table 1).
Results and discussion
The inherently different gene count for different functions (Table 1) means the categories are not balanced as would be required for ANOVA. We sidestep the issue by using the average retention index of each functional category instead of the raw count. This strategy comes at the expense of statistical power since we are now left with just two data points for each low-level functional category. Still, Fig. 6 shows the expected result of high expression correlating with high fractionation resistance.
Figure 6 is a visual representation of what the average retention indices are for each functional category. This result is consistent with our prediction that genes of Z3 are more fractionation-resistant than gene of Z2 and Z1.
This is further reinforced in Fig. 7. This supports our prediction that genes of Z3 are more fractionation-resistant than Z1 and Z2. In grape, the adjusted p-value for the statistical test of the difference between Z3 and Z2 is only marginally significant, likely due to insufficient data. That the difference is real is bolstered by the clear difference between Z3 and Z2 in tomato.
Figure 7 also shows that in grape, the difference between fractionation-resistant Z3 and fractionation prone Z1 and Z2 are smaller than the difference in tomato. A reason for this observation being that gene families that are singletons in all three species of the rosid data set constitute a far more higher proportion than in the asterid data set, so even the fractionation-resistant functional category contain many singleton gene families.
The ANOVA table (Table 2) answers the main objective of the paper: which of Gene Balance Hypothesis and Gene Dosage impact duplicate gene retention more? We answer this by calculating whether functional categories or expression levels have the bigger effect size in the two-way ANOVA. In the table, the effect size, measured in partial eta squared, supports the conjecture in the Chen et al. paper  that functional category carries more weight in determining retention indices than expression levels. The table also shows that while functional categories strongly affect average retention indices, the effect that expression levels have on average retention indices are no longer significant.
Expression has been suggested to be the most important factor in determining duplicate retention after whole genome duplication events . Our results suggest otherwise, that functional category is the more dominant factor of the two. Furthermore, our results in Table 2 suggests that there is no interaction between functional category and expression level.
We expect the result presented here to be present in other flowering plant lineages as well, given how both the rosid dataset and the asterid dataset show a consistent trend. Also, our previous analyses on fractionation resistance [13, 14] show these retention trends to be consistent across different lineages, giving us more confidence in this prediction.
Going forward, we want to further explore the role of expression on fractionation. One direction is to explore the different types of expression. Some genes are only expressed in certain tissues or at certain developmental stages, such as the development of flowers, or genes that have organ specific expression pattern, or genes that are always on but fluctuate depending on the situation. Different expression pattern may have different fractionation tendencies.
Another direction is to expand the analysis to other genes that are currently not part of the analysis. One particular analysis for future work is the relationship between retained duplicates and the nearby genes. Retained duplicates are reported to have an effect on the distribution of genes with copy number variation in humans . We can explore if similar effects are also present in plants.
In summary, we have evidence to suggest that functional categories plays a more important than gene expression levels in duplicate gene retention after whole genome duplication. There are many challenges and possibilities that can build upon this work to better explain the mechanisms and the effects of the fractionation process.
Soltis DE, Soltis PS. Polyploidy: recurrent formation and genome evolution. Trends Ecol Evol. 1999; 14(9):348–52.
Soltis DE, Albert VA, Leebens-Mack J, Bell CD, Paterson AH, Zheng C, Sankoff D, Wall PK, Soltis PS, et al.Polyploidy and angiosperm diversification. Am J Bot. 2009; 96(1):336–48.
Abrouk M, Murat F, Pont C, Messing J, Jackson S, Faraut T, Tannier E, Plomion C, Cooke R, Feuillet C, et al.Palaeogenomics of plants: synteny-based modelling of extinct ancestors. Trends Plant Sci. 2010; 15(9):479–87.
Jiao Y, Wickett NJ, Ayyampalayam S, Chanderbali AS, Landherr L, Ralph PE, Tomsho LP, Hu Y, Liang H, Soltis PS, et al.Ancestral polyploidy in seed plants and angiosperms. Nature. 2011; 473(7345):97–100.
Hegarty M, Coate J, Sherman-Broyles S, Abbott R, Hiscock S, Doyle J. Lessons from natural and artificial polyploids in higher plants. Cytogenet Genome Res. 2013; 140(2–4):204–25.
Crow KD, Wagner GP. What is the role of genome duplication in the evolution of complexity and diversity?Mol Biol Evol. 2006; 23(5):887–92.
Comai L. Genetic and epigenetic interactions in allopolyploid plants. Plant Mol Biol. 2000; 43:387–99.
Birchler JA, Veitia RA. Gene balance hypothesis: Connecting issues of dosage sensitivity across biological disciplines. Proc Nat Acad Sci. 2012; 109(37):14746–53.
Papp B, Pal C, Hurst LD. Dosage sensitivity and the evolution of gene families in yeast. Nature. 2003; 424:194–7.
Schnable JC, Wang X, Pires JC, Freeling M. Escape from preferential retention following repeated whole genome duplication in plants. Front Plant Sci. 2012; 3(94):1–8.
Conant GC, Birchler JA, Pires JC. Dosage, duplication, and diploidization: clarifying the interplay of multiple models for duplicate gene evolution over time. Curr Opin Plant Biol. 2014; 19:91–8.
Zheng C, Chen E, Albert VA, Lyons E, Sankoff D. Ancient eudicot hexaploidy meets ancestral eurosid gene order. BMC Genomics. 2013; 14(Suppl 7):3.
Chen ECH, Najar CBA, Zheng C, Brandts A, Lyons E, Tang H, Carretero-Paulet L, Albert VA, Sankoff D. The dynamics of functional classes of plant genes in rediploidized ancient polyploids. BMC Bioinformatics. 2013; 14(S-15):19.
Chen EC, Sankoff D. Gene expression and fractionation resistance. BMC Genomics. 2014; 15(Suppl 6):19.
Lou P, Wu J, Cheng F, Cressman LG, Wang X, McClung CR. Preferential retention of circadian clock genes during diploidization following whole genome triplication in Brassica rapa. Plant Cell Online. 2012; 24(6):2415–26.
Gout JF, Kahn D, Duret L, Paramecium Post-Genomics Consortium. The relationship among gene expression, the evolution of gene dosage, and the rate of protein evolution. PLoS Genet. 2010; 6(5):1000944.
Lyons E, Freeling M. How to usefully compare homologous plant genes and chromosomes as DNA sequences. Plant J. 2008; 53:661–73.
Lyons E, Pedersen B, Kane J, Alam M, Ming R, Tang H, Wang X, Bowers J, Paterson A, Lisch D, Freeling M. Finding and comparing syntenic regions among Arabidopsis and the outgroups papaya, poplar and grape: CoGe with rosids. Plant Physiol. 2008; 148:1772–81.
Zheng C, Swenson K, Lyons E, Sankoff D. OMG! orthologs in multiple genomes - competing graph-theoretical formulations In: Przytycka T, Sagot M-F, editors. WABI 2011, 11th Workshop on Algorithms in Bioinformatics. Lecture Notes in Computer Science 6833. Springer: 2011. p. 364–75.
Jaillon O, Aury JM, Noel B, Policriti A, Clepet C, Casagrande A, Choisne N, Aubourg S, Vitulo N, Jubin C, Vezzi A, Legeai F, Hugueney P, Dasilva C, Horner D, Mica E, Jublot D, Poulain J, Bruyére C, Billault A, Segurens B, Gouyvenoux M, Ugarte E, Cattonaro F, Anthouard V, Vico V, Del Fabbro C, Alaux M, Di Gaspero G, Dumas V, Felice N, Paillard S, Juman I, Moroldo M, Scalabrin S, Canaguier A, Le Clainche I, Malacrida G, Durand E, Pesole G, Laucou V, Chatelet P, Merdinoglu D, Delledonne M, Pezzotti M, Lecharny A, Scarpelli C, Artiguenave F, Pè ME, Valle G, Morgante M, Caboche M, Adam-Blondon AF, Weissenbach J, Quétier F, Wincker P. The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla. Nature. 2007; 449:463–7.
Jung S, Cestaro A, Troggio M, Main D, Zheng P, Cho I, Folta KM, Sosinski BAA, Celton JM, Arús P, Shulaev V, Verde I, Morgante M, Rokhsar DS, Velasco R, Sargent DJ. Whole genome comparisons of Fragaria, Prunus and Malus reveal different modes of evolution between rosaceous subfamilies. BMC Genomics. 2012; 13:129.
Argout X, Salse J, Aury JM, Guiltinan MJ, Droc G, Gouzy J, Allegre M, Chaparro C, Legavre T, Maximova SN, Abrouk M, Murat F, Fouet O, Poulain J, Ruiz M, Roguet Y, Rodier-Goud M, Barbosa-Neto JF, Sabot F, Kudrna D, Ammiraju JS, Schuster SC, Carlson JE, Sallet E, Schiex T, Dievart A, Kramer M, Gelley L, Shi Z, Bérard A, Viot C, Boccara M, Risterucci A, Guignon V, Sabau X, Axtell MJ, Ma Z, Zhang Y, Brown S, Bourge M, Golser W, Song X, Clement D, Rivallan R, Tahi M, Akaza JM, Pitollat B, Gramacho K, D’Hont A, Brunel D, Infante D, Kebe I, Costet P, Wing R, McCombie WR, Guiderdoni E, Quétier F, Panaud O, Wincker P, Bocs S, Lanaud C. The genome of Theobroma cacao. Nat Genet. 2011; 43:101–8.
Tomato Genome Consortium. The tomato genome sequence provides insights into fleshy fruit evolution. Nature. 2012; 485:635–41.
US Department of Energy, J.G.I.Mimulus version 1; 2010. https://phytozome.jgi.doe.gov/pz/portal.html#!info?alias=Org_Mguttatus.
Ibarra-Laclette E, Lyons E, Hernández-Guzmán G, Pérez-Torres CA, Carretero-Paulet L, Chang TH, Lan T, Welch AJ, Juárez MJ, Simpson J, Fernández-Cortés A, Arteaga-Vázquez M, Góngora-Castillo E, Acevedo-Hernández G, Schuster SC, Himmelbauer H, Minoche AE, Xu S, Lynch M, Oropeza-Aburto A, Cervantes-Pérez SA, de Jesús Ortega-Estrada M, Cervantes-Luevano JI, Michael TP, Mockler T, Bryant D, Herrera-Estrella A, Albert VA, Herrera-Estrella L. Architecture and evolution of a minute plant genome. Nature. 2013; 498:94–8.
Vitulo N, Forcato C, Carpinelli E, Telatin A, Campagna D, D’Angelo M, Zimbello R, Corso M, Vannozzi A, Bonghi C, Lucchin M, Valle G. A deep survey of alternative splicing in grape reveals changes in the splicing machinery related to tissue, stress condition and genotype. BMC Plant Biol. 2014; 14(1):99.
The Gene Ontology Consortium. Gene ontology: tool for the unification of biology. Nat Genet. 2000; 25(1):25–9. Data Version 2012-04-20.
Conesa A, Götz S, García-Gómez JM, Terol J, Talón M, Robles M. Bioinformatics. 2005; 21(18):3674–6.
Makino T, McLysaght A, Kawata M. Genome-wide deserts for copy number variation in vertebrates. Nat Commun. 2013; 4:2283.
Soltis DE, Albert VA, Leebens-Mack J, Palmer JD, Wing RA, dePamphilis CW, Ma H, Carlson JE, Altman N, Kim S, et al.The Amborella genome: an evolutionary reference for plant biology. Genome Biol. 2008; 9(402):10–1186.
Research supported in part by grants from the Natural Sciences and Engineering Research Council of Canada. DS holds the Canada Research Chair in Mathematical Genomics.
This publication was funded by Discovery Grant RGPIN-2016-05585 to DS from the Natural Sciences and Engineering Research Council of Canada.
Availability of data and material
All genome data is available from the CoGe website.
ECC and DS did the research and writing, AM and JHC gave statistical guidance. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Consent for publication
Ethics approval and consent to participate
About this supplement
This article has been published as part of BMC Genomics Volume 18 Supplement 4, 2017: Selected articles from the Fifth IEEE International Conference on Computational Advances in Bio and Medical Sciences (ICCABS 2015): Genomics. The full contents of the supplement are available online at https://bmcgenomics.biomedcentral.com/articles/supplements/volume-18-supplement-4.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
From Fifth IEEE International Conference on Computational Advances in Bio and Medical Sciences(ICCABS 2015) Miami, FL, USA. 15–17 October 2015
About this article
Cite this article
Chen, E., Morin, A., Chauchat, JH. et al. Statistical analysis of fractionation resistance by functional category and expression. BMC Genomics 18, 366 (2017). https://doi.org/10.1186/s12864-017-3736-0
- Gene loss
- Whole genome duplication
- Gene ontology
- Expression level