Ancient eudicot hexaploidy meets ancestral eurosid gene order
© Zheng et al.; licensee BioMed Central Ltd. 2013
Published: 5 November 2013
Skip to main content
© Zheng et al.; licensee BioMed Central Ltd. 2013
Published: 5 November 2013
A hexaploidization event over 125 Mya underlies the evolutionary lineage of the majority of flowering plants, including very many species of agricultural importance. Half of these belong to the rosid subgrouping, containing severals whose genome sequences have been published. Although most duplicate and triplicate genes have been lost in all descendants, clear traces of the original chromosome triples can be discerned, their internal contiguity highly conserved in some genomes and very fragmented in others. To understand the particular evolutionary patterns of plant genomes, there is a need to systematically survey the fate of the subgenomes of polyploids, including the retention of a small proportion of the duplicate and triplicate genes and the reconstruction of putative ancestral intermediates between the original hexaploid and modern species, in this case the ancestor of the eurosid clade.
We quantitatively trace the fate of gene triples originating in the hexaploidy across seven core eudicot flowering plants, and fit this to a two-stage model, pre- and post-radiation. We also measure the simultaneous dynamics of duplicate orthologous gene loss in three rosids, as influenced by biological functional class. We propose a new protocol for reconstructing ancestral gene order using only gene adjacency data from pairwise genomic analyses, based on repeating MAXIMUM WEIGHT MATCHING at two levels of resolution, an approach designed to transcend limitations on reconstructed contig size, while still avoiding the ambiguities of a multiplicity of solutions. Applied to three high-quality rosid genomes without subsequent polyploidy events, our automated procedure reconstructs the ancestor of the eurosid clade.
The gene loss analysis and the ancestor reconstruction present complementary assessments of post-hexaploidization evolution, the first at the level of individual gene families within and across sister genomes and the second at the chromosome level. Despite the loss of more than 95% of gene duplicates and triplicates, and despite major structural rearrangement, our reconstructed eurosid ancestor clearly identifies the three regions corresponding to each of the seven original chromosomes of the earlier pre-hexaploid ancestor. Functional analysis confirmed trends reported for more recent plant polyploidy events: genes involved with regulation and responses were retained in multiple copies, while genes involved with metabolic processes were lost.
The publication of the grapevine genome sequence by Jaillon et al. in 2007  included the discovery of an ancient hexaploidization event, which also showed clear traces in all the other eudicot genomes sequenced up to that time, but which was absent from the monocot genome of rice. Since then, this event has been confirmed  and characterized in most detail with the publication of the cacao genome sequence  and in other work of Salse and colleagues . It also has been dated to occur before the radiation of the core eudicots, but after their divergence from more basal eudicots . The original basis for all this work was the observation of seven triples of homeologous regions in each genome. The ancient hexaploid would have combined three identical or closely related subgenomes, each made up of seven chromosomes. In the modern species, each of the three regions (fragmented by post-hexaploid genome rearrangements in some of the genomes), reflecting the three equivalent chromosomes in the original subgenomes, contains some genes paralogous to genes in one, or occasionally both, of the other two homeologous regions. Meticulous work identified the boundaries of these regions and suggested a credible history of chromosome fusions and other major rearrangements leading to the modern genomes from an ancestral N = 3 × 7 = 21-chromosome hexaploid.
There is a growing literature on gene order reconstruction in the context of flowering plant phylogeny [6–8], although the extensive paralogy induced by hexaploidization, and the subsequent gene order scrambling due to fractionation, i.e., the loss of duplicate genes from one or two of the three homeologous regions, more or less randomly, cause difficulty for the inference of ancestral gene order.
In this paper we follow two separate lines of investigation, the first into the dynamics of fractionation in seven descendants of the hexaploidization and functional constraints on which genes have lost copies, and the second into the reconstruction of an early descendant of the hexaploid, namely the common ancestor of the eurosids. Merging the two approaches combine both an intensional measure - detailed rates of gene loss - and an extensional characterization - changes to chromosomal structure - of the early evolution of the core eudicots.
For the reconstruction of the eurosid ancestor, we use the grapevine genome, member of the earliest branching rosid order, Vitales, as an outgroup, as well as the two least rearranged eurosid genomes, cacao and peach. Our entirely automated method for the reconstruction of ancestral gene order
harmonizes them into triples of orthologs across the three genomes by the OMG! technique ,
optimally infers ancestral contigs using MAXIMUM WEIGHT MATCHING (MWM) [18, 19] on oriented (same or different DNA strand) gene adjacencies, and then discards any matches not supported by at least two of the genomes,
replaces all the genes on the genomes by oriented symbols representing the new contigs,
does another round of MWM on the adjacencies of the new symbols, to form chromosomal fragments,
merges adjacent fragments if they are not separated by more than a preset number of genes on at least two genomes.
In contrast to genome halving  or genome aliquoting  methods, we do not attempt to reconstruct the ancestor at the moment of polyploidization. Rather we reconstruct a presumably rediploidized, partially fractionated, and somewhat rearranged descendant of that polyploid, in an effort to infer recent common ancestor of the extant genomes.
Before this reconstruction of the ancestral genome, which makes no reference whatsoever to hexaploidy or its remnants, we quantitatively analyze the internal paralogies, or homeologies, of the seven genomes, documenting the pervasive pattern of triples of syntenic blocks that are the signature of hexaploidization. We also develop a model for the fractionation process to account for the proportion of gene triples, gene pairs, and single copy genes within each genome. We attribute some of the deviation between model predictions and observations to rate dependence on gene functional class, and undertake to quantify this rate diversity based on known gene ontology in Arabidopsis of triplets, duplicates and single-copy of homologous rosid genes.
In the final sections of this paper, we apply the knowledge we have generated about the remnants, in the grapevine, cacao and peach genomes, of the 21 regions produced by the hexaploidization, to try to identify these regions in the ancestral eurosid genome we reconstructed, and find striking coherence between the reconstructed chromosomes and what would be expected from a gross comparison of the genomes at the level of entire regions.
Syntenic dot-plots for the self-comparison of both grapevine and of cacao clearly show a pattern of pairwise homologies falling largely into 21 groups according to the chromosomal location of the two homologs [1, 3]. This is also true to a lesser extent for peach. Within each of the three genomes, these groups of homologies are distributed among seven triples of large regions, where there are numerous homologies between each of the three pairs of regions. There are very few, if any, gene homologies within regions, few between different (non-homeologous) tripled regions, and few between genes in regions and genes outside of all tripled regions.
We adopt the coloring scheme introduced in [1, 3] for the seven triples of regions: red, orange, yellow, green, pale blue, blue and purple. We distinguish between the three homologs of each colour by using dark, medium and light shades of each, ordered according to the number of genes detected in each by our methods, with the darkest containing the most genes.
For grapevine, cacao and peach, the ranges of the triplicated regions were determined by examining the dot plots produced by SYNMAP. For the other four genomes, the greater degree of chromosomal rearrangement entails more than 21 groups of smaller syntenic regions in the SYNMAP output. But these regions can still be grouped into triples, with numerous homologies between each of the three pairs of groups of regions and few homologies within groups, between genes in different triples, or between genes in groups and those in no group. Without risk of ambiguity we will refer to these groups of regions as "regions" as if they were each contiguous regions as in grapevine, cacao and peach.
Due to rapid fractionation, there are very few triples of homologous genes within each genome, as we will discuss in the next section. This loss of iso-functional paralogs results in an interleaving  pattern of pairwise homologies among three regions.
Distribution of colours among synteny blocks.
genes in synteny blocks
% genes coloured
% genes coloured in coloured blocks
number of blocks (most counted 2X)
% one colour
% no colour
% different colour
number of gene pairs (most counted 2X)
% in coloured blocks
% same colour, different region
% same colour, same region
% one gene coloured
% neither coloured
Gene family sizes.
frequencies of gene family sizes
all three genes survived is .
two of the three survived is , and
only one survived is .
a triplet would still survive is .
an original triplet would manifest as a pair is , and
an original triplet would be reduced to a single copy is .
Though a general functional classification of all or most genes is not available for any of our rosid genomes, in the next section we initiate a comparative study of fractionation rates by assuming that genes have the same functions as their Arabidopsis homologs, when this homology can be detected.
We carried out a search for Arabidopsis orthologs for the genes in our grapevine, cacao and peach data by subjected them to the LastZ analysis in CoGe against the TAIR Version 10 database at http://arabidopsis.org/. For each rosid gene we picked the best Arabidopsis hit. Only the three genomes were examined because for comparability, our subsequent analysis required at least one copy of a gene in each genome, and no more than three. We found a total of 7200 such complete gene families in the three genomes; including the remaining genomes in the analysis would have drastically diminished this number and compromised the statistics.
Gene ontologies for all the Arabidopsis homologs chosen were downloaded from http://www.geneontology.org/GO.downloads.ontology.shtml. For 84% of the 7200 gene families, all the genes mapping to an Arabidopsis homolog mapped to the same one. For those families containing members mapping to different homologs, these were virtually always Arabidopsis genes having the same functionality.
For the 7200 gene families, we distinguished five categories of multiple-copy gene retention: 1 = gene families consisting of only one copy in all three genomes, 2 = genes that have one copy in two of the genomes and two or three copies in the third genome, 3 = genes that have one copy in one genome and two or three copies in the others, 4 = genes with two copies in each genome, 5 = gene families with at least two copies in each genome, at least one of which contains three copies. Thus genes with lower scores have lost more copies to fractionation and those with higher scores have retained more and are "fractionation-resistant".
For each family, we tabulated all the GO terms "hit" by at least one member, and amalgamated these terms for the family. We noted that, not surprisingly, families in higher scoring categories, containing more genes, hit a higher number of terms than families with lower scores. To see if some gene functions were associated with higher or lower rate of retention, we normalized the proportions of gene families in each retention category showing an association with a term by the proportion of gene families in that retention category showing any hits, separately for Biological function, Cellular component and Molecular function. For example, 3982/5933 = 67% of families in retention category 1 (one copy in each genome) had at least one hit for some Biological function term. One of these terms, Response to stimulus was hit by 15% of these families, so that the normalized association value for this GO term and this retention category is 0.15/0.67 = 0.22.
The formal structure we adopt, both for the extant rosid genomes used as input data, and for the ancestral eurosid we aim to reconstruct, is the oriented gene order. In graph theoretical terms, these are pairs of graph matchings. The set of vertices of the graphs contains the two ends of all the genes, distinguished either as 5' versus 3', h(ead) versus t(ail) or "+" versus "-". The first matching simply connects every gene end to its other end. (Because every vertex is connected to another vertex, this is a complete matching.) The second connects each gene end to at most one other gene end (usually of a different gene), representing the adjacency of two genes. This is usually not a complete matching; a vertex representing a gene end at the end of a chromosomes is not connected to any other vertex.
The problem of reconstructing an unknown gene order on the basis of three given gene orders from related genomes is called the median problem. There are many methods for solving this (e.g. [27–30]); they generally attempt to construct a match as similar as possible, optimizing some measure of similarity under various constraints, to the three given adjacency matchings. The methods are distinguished by fine distinctions among the simplified models of genome rearrangement invoked, assumptions about identical gene complement or lack of duplicate genes, insistence or not on linearity of chromosomes or the number of chromosomes in the solution, relative weights of adjacencies versus chromosome-terminal gene ends, and other considerations.
Most gene order median problems are difficult: exact algorithms bog down for realistic instances while heuristics risk somewhat sub-optimal solutions. But of much more concern to biologists than details about optimality and efficiency is that the reliance on global optimality criteria leads to a severe problem of non-uniqueness of solutions. Reconstruction methods based on gene-by-gene adjacencies do not scale well for gene orders involving tens of thousands of genes. Long reconstructed chromosomes inevitably contain many poorly supported adjacencies, i.e., adjacencies that only occur in one, or even none, of the three data genomes. Though these may be relatively infrequent in terms of overall proportions, for example, 1% of 1000 adjacencies, giving 10 per chromosome, this is still a worrisome number; the reconstructions can often be broken at these points and reassembled in different combinations using other hitherto unused adjacencies, without losing the optimality of the solution. As a result there are an exponential number of solutions, many of them differing greatly at the level of chromosome structure. This is a dismaying situation for a biologist.
This partially explains the pervasiveness, in the field of ancestral genome inference, of reconstructions in two or more steps. The first step reconstructs a large number of small "contigs" or anchors, and the subsequent steps pieces these together into larger chromosome fragments or whole chromosomes, using some other objective function or relying on some other data. Often the ultimate step in reconstruction involves manual intervention, invoking biological intuition or some implicit expectations about the final shape of the solution.
In this paper, we also use a multi-step process, but our goals are first, to eliminate as far as possible sources of non-uniqueness in the reconstruction. We make sure that nothing is reconstructed on the basis of a series of arbitrary choices among equally plausible alternatives (a recipe for rampant non-uniqueness), and second, that the whole process be automated and reproducible, with explicit thresholds and other parameters.
The input to the reconstruction method is the complete gene order for each of the extant genomes, together with identification of orthologous genes in the different genomes. We construct our gene orders based on SYNMAP's pairwise comparisons of gene orders: grapevine-cacao, cacao-peach and peach-grapevine, using its rigorous default values to ensure that orthologs are identified not only by their sequence similarities, but also by severe requirements on their syntenic context (i.e., syntenic blocks are identified only if they contain at least 5 pairs of collinear orthologous genes). We impose transitivity on the results where this does not already obtain, e.g., if a is orthologous to b and b is orthologous to c, then a is assumed orthologous to c whether or not this has been detected by SYNMAP.
For the grapevine, cacao and peach genomes, this initially produced a total of 14,543 orthology sets containing two, three or more genes, for a total of 39,286 genes.
The requirement for synteny within SYNMAP is an effective way of resolving non-tandem paralogy in large majority of instances. Two paralogs in different syntenic contexts are treated separately as different genes by virtue of these contextual distinctions.
Paralogs still show up, however. Genes a 1 and a 2 in genome A and can both register as orthologous to gene b in genome B because of segmental duplication in A or because identical contexts of a 1 and a 2 survived the fractionation process intact since the early polyploidization process. a 1 and a 2 are thus paralogs.
We use the OMG! procedure  to judiciously suppress a small number of homology relations to resolve these situations, usually by dividing large orthology sets into smaller ones, allowing only three or two genes to be considered orthologous across the three genomes. The objective is to maximize the sum of the squares of the number of genes in remaining orthology sets (9 or 4, respectively, for three-way and two-way orthology).
For the grapevine, cacao and peach genomes, this produced a final total of 5664 orthology sets containing two genes and 9148 containing three genes, for a total of 14,812 sets containing 38,772 genes.
Because we will be weighting adjacencies differentially, we do not use any of the dedicated median solvers, but the more flexible approach of MAXIMUM WEIGHT MATCHING (MWM). This allows us to make a slight weight distinction between two conflicting matches that appear equally often in the three genomes, but in subtly different contexts. We solve the MWM problem on the three sets of adjacencies, one for each genome, determined by the order of the genes output by SYNMAP and OMG!. The first polynomial-time exact algorithm for MWM -- "path, trees and flower" -- was introduced by Edmonds . We use Galil's O(n3) version of the algorithm  for which Python code by J. van Rantwijk is available online.
Weight system for adjacencies.
occurrence of adjacency between gene ends x and y
weight of xy
xy occurs in all three genomes
xy occurs in two genomes; x or y is absent from the third genome
xy occurs in two genomes; x and y present in the third genome
xy and uv occur in one genome; xu and yv in another; uv, but not both x or y, in third
xy and uv occur in one genome; xu and yv in another; x, y and uv in third
xy occurs in one genome; one of x or y absent in each of the other genomes
xy occurs in one genome; x and y both present in one or neither of the other genomes
xy occurs in one genome; x and y present in other two genomes
In the output from the algorithm, the 14,812 orthology sets were organized into 63 contigs. These included all the adjacencies in the input with weights 3.00 (5657 of them), 2.03 (6319), 2.00 (1109) or 1.50 (113), and all except one of weight 1.49 (31 of them). In addition there were 1520 of adjacencies of weight 1.01, 1.00 or 0.99.
We know that it is the adjacencies of weight less than 2.00 that propagate non-uniqueness. Those with weight 2.00 or more are very likely to be in all solutions. In our particular data this extends to the adjacencies of weight 1.50 and almost all of weight 1.49.
If we discard all poorly supported adjacencies from the (MWM) solution, then, there can be little variability in the solutions. This, however, comes at a cost of increasing the number of contigs from 63 to 1583, which must then be assembled into chromosomes by other means.
One incidental benefit of discarding most low-weight adjacencies is that all the circular contigs among the 63 are broken into linear fragments. It is theoretically possible to have circular contigs containing only weight 2.00 adjacencies, but this is not encountered in our data.
In addition, we discard 1368 genes scattered among 934 short contigs (three adjacencies or less). The large majority of these have only one gene, and are likely to represent errors originating in SYNMAP or OMG!, or movements of very small genomic fragments, containing little information about the main evolutionary events in these genomes. They also represent possible sources of non-uniqueness in reconstruction of ancestral chromosomes.
Note that had we attempted to use any of the other sequenced rosids instead of, or in addition to grapevine, cacao and peach, few additional contigs could be expected, because the highly fragmented nature of the subgenomes in these genomes would have resulted in mostly short contigs that would only have been discarded; some existing contigs could have received additional support, but since all except the weight 1.00 adjacencies are already consistent among themselves, this would have been of little benefit.
We use the 649 remaining contigs to create "summary" versions of the original three genomes. Every contig in the solution to the MWM projects a coordinate position and an orientation back to a single chromosome in each of the grapevine, cacao and peach genomes (even though the contig may contain a minority of genes from another chromosome), based on the average coordinates and most frequent orientation of the genes it contains from that genome on that particular chromosome.
We can thus recognize adjacencies between these contig projections in each summary genome and carry out MWM, using the same weights as in Table 3 in Step 3 and the deletions in Step 4 to construct a new median consisting of a matching on the contigs. Again we discard adjacencies supported by only one of the three genomes. There remain 177 new contigs.
We can then merge new contigs with neighboring projections in two or more genomes, according to a criterion of how many genes may intervene between two candidates for merger. For each pair of new contigs on the same chromosomes in at least two genomes, we do the following:
decompose each contig into pieces that are internally strictly contiguous on the two genomes,
find the two pieces, one from each new contig, that are the closest, when appropriately oriented.
We then merge the pair of new contigs that are the closest, as long as they are not more than 500 genes apart on either of the two genomes. The cutoff in this final step, 500 with our data, is critical, but is easy to find since it has a direct and dramatic effect on the result. Lower values, like 400, result in an unrealistically large number of small chromosomes; higher values like 600 produce assemblies including one or two impossibly long chromosomes.
The pipeline we have described is not computationally intensive. Data preparation Steps 1 and 2 required 1 to 2 minutes in total for the rosid data. The MWM run in Step 3 took 15 minutes. The data manipulations in Steps 4 and 5 required less than a minute. The second MWM executed in 2 seconds.
This shows that despite the fact that no triplication information whatsoever was used in the reconstruction, most of the 21 triplicated regions are located solely or largely on a single chromosome in the reconstruction. Moreover, as we shall see in the next section, the general structure of the chromosomes corresponds closely to what we would expect from common patterns of region fusions in the three genome.
Inspection of the coloured karyotypes in Figure 3 is suggestive of some of the evolutionary events which must have transpired in the rosids, without need to reference the individual genes that make up the triplicate regions, in the style of [3, 4]. The main principles we adopt are first, that these genomes are descendants of a hexaploid, and second, that syntenic segments from two ancestral chromosomes on a single modern chromosome, when observed in two or three of the sequenced genomes, is unlikely to have been produced by two or three coincidental events, but is rather most parsimoniously accounted for by a single event at some point in the common evolutionary history of these genomes. This ties in with our search for uniqueness of reconstructions.
The ancestral hexaploid is assumed to contain three identical sets of seven chromosomes, which we label with seven colors, each in three shades from darkest to lightest.
All three sequenced genomes have a large light green region adjacent to, or very close to, a small dark red region on the same chromosome, as well as another large dark red region elsewhere. This suggests a non-reciprocal translocation occurring on the tree branch between the hexaploid ancestor and the rosid ancestor, and accounts for one of the fused chromosomes depicted in the latter. There is also a small light green region at the end of the dark red chromosome in grapevine. This is most likely the result of a chromosomal fission in the tree branch leading to grapevine.
There are two chromosomes containing medium green in grapevine. This would have required a fission in the lineage leading to grapevine, but this must have been preceded by a fusion of medium green with medium purple in the common ancestor of grapevine and cacao, since this association is present in both. This accounts for another of the fused chromosomes apparent in the rosid ancestor.
All genomes show adjacencies between medium pale blue and dark pale blue, suggesting a fusion in the rosid ancestor as shown, with a further translocation with medium green in grapevine and additional rearrangements in peach.
Grapevine and cacao evidence an early fusion of medium blue and medium yellow (the remaining fused chromosome visible in the rosid ancestor), followed by a fission of most of the medium yellow to form a separate chromosome in grapevine.
There is an association of dark blue and medium yellow in the two eurosid genomes only, suggesting this event occurred after the divergence from Vitales. However, the inherited association with medium blue is retained in cacao, so that a three-way fusion is is portrayed in the eurosid ancestor. Other exclusively eurosid associations include fusions of light purple and dark orange, and of light blue and light orange, with much of the light orange fissioning to form a separate chromosome in cacao. In addition, the medium green/medium purple fusion inherited from the rosid ancestor has acquired an association with dark red in the eurosids, so that a three-way fusion seems appropriate there.
These considerations suggest, as summarized in Figure 7, that the rosid ancestor had 18 chromosomes, four resulting from fusions among the original hexaploid chromosomes, and 14 consisting of entire chromosomes or parts of single chromosomes from the latter. Grapevine would have undergone two additional fusions and two fissions (at new breakpoints) of chromosomes previously fused in the rosid ancestor.
Similarly the eurosid ancestor can be predicted to have at most 15 chromosomes, derived by four additional fusions and one fission, from which the 10-chromosome cacao and 8-chromosome peach evolved, mostly by further fusions.
Although the exercise in the previous section takes no account of the actual gene-order evidence, it does provide a plausibility check on the detailed reconstruction leading to Figure 6. The discrepancy between the 13-chromosome reconstruction in Figure 6 and the 15-chromosome version in Figure 7 can be accounted for in terms of the fusion of the light orange region with the medium green/medium purple/dark red chromosome in the detailed reconstruction and the fusion of two other red chromosomes. In addition, there is a translocation of the medium blue from its position in cacao (Figure 7) to its position in peach (Figure 6). The compositions of the four chromosomes affected in Figure 6 are otherwise unchanged, and the remaining nine are identical in the two approaches.
We can conclude that, with allowances for its superficial nature, the "top-bottom" derivation in the previous section gives results which accord well with the detailed reconstruction earlier in this paper. The sole problem is the position of the medium blue region, whose positions in the rosid genomes can only be explained by two coincidental fusions (in cacao and grapevine) or by an unlikely translocation of exactly this region, no more and no less, to another chromosome in peach. Of interest is that in the reconstruction of the 9-chromosome Rosaceae ancestor by Jung et al.  based on grapevine, peach and strawberry, two of reconstructed chromosomes have the same overall structure as peach 2 and 5; this does not resolve the quandary, but at least suggests that the evolutionary events involved predate the Rosaceae radiation.
As a prelude to ancestral genome reconstruction, we have quantitatively documented the details of the triplicated regions in the seven rosid eudicots under study, and suggested a model for fractionation in two steps, one pre-radiation and the other post-radiation. Fitting the model to the data from seven rosids requires the assumption of an unrealistically low number of genes involved in the initial hexaploidization, suggesting a dependence of fractionation rate on gene class, an effect that is confirmed by differential functional associations of highly fractionated versus highly retained gene triples in grapevine, cacao and peach.
A major preoccupation of our reconstruction methodology was to avoid the non-uniqueness of optimal solutions, a problem endemic to one-step reconstructions based even partly on poorly supported adjacencies. The key to this is to discard all such adjacencies from the MWM solution to the median problem. We rely instead on a second application of the algorithm to summary genomes made up of projections back from the high-confidence contigs produced by the initial run.
Our reconstruction procedure requires three parameter settings. One is the weight system, which on one hand distinguishes between conflicting evidence and absent evidence against adjacencies with the same level of support, and on the other hand detects a certain number of poorly supported adjacencies whose contexts in the three genomes suggest that they nevertheless merit inclusion in a solution. The second parameter is a threshold for contig sizes (more than three adjacencies), designed to eliminate noise, due either to errors or insensitivity in the data-generation steps, or to movements of isolated genes to remote locations in the genome. Despite a general tendency to conserve the large triplicated regions, there is much low-level rearrangement occurring in these genomes. They each require from 400 to 600 rearrangements to be generated from the reconstructed ancestral gene order, though much of this is ascribable to fractionation; inevitably a few of these rearrangements conspire in two genomes to mislead the reconstruction, especially if we place too much faith in the significance of contigs containing only one or two genes. The third setting is the cutoff for merging neighboring fragments after the higher-level run of MWM. In our data, the appropriate value empirically obtained was 500 genes; 400 was too low, unable to detect several valid fusions, while 600 was too high, resulting in an unbalanced genome concentrated largely in a single chromosome. Of the three parameter settings, the first two should be generally applicable, while the third will have to be adjusted (in an automated way if desired, using chromosome number or other karyotype characteristic) to each data set. At least for our data set, all of these settings are indispensable. Not dropping small contigs after the first MWM leads to many unlikely fusions as well as numerous tiny chromosomes that do not fit in with the rest. Using simply weights 1, 2 and 3 instead of those in Table 3, or not using the right cutoff lead, both lead to unbalanced genomes and other anomalies.
As more angiosperm genome sequences become available, reconstruction of additional early ancestral genomes will become possible, adding to the Rosaceae ancestor in , and the eurosid ancestor studied here, to further elucidate the evolution of large clades of these plants. Additionally, genomes that break up the phylogenetic clades studied here will help resolve conflicting evidence of chromosome fusion and fission events. Paradoxically, although the whole genome doubling (or tripling) that recurs throughout this evolutionary domain has the effect of scrambling local gene order through fractionation, it also provides a convincing way of validating reconstructions based on gene order, as exemplified by the intact triplicated regions inferred in Figure 6.
This paper is based in part on an extended abstract published in the proceedings of the ICCABS 2012 conference . The only sections that were retained, though extensively rewritten, are the first two sections after the Introduction, corresponding to Figures 2 - 4 in the present version. Virtually all of the rest of the original paper was discarded and replaced by new material.
Research supported in part by grants from the Natural Sciences and Engineering Research Council of Canada (NSERC). DS holds the Canada Research Chair in Mathematical Genomics. EL was supported by a grant from the Gordon and Betty Moore Foundation (GBMF3383).
Publication of this article was supported by the Canada Research Chair in Mathematical Genomics.
This article has been published as part of BMC Genomics Volume 14 Supplement 7, 2013: Selected articles from the Second IEEE International Conference on Computational Advances in Bio and Medical Sciences (ICCABS 2012): Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/14/S7
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.