Reconstruction of an ancestral Yersinia pestisgenome and comparison with an ancient sequence
© Duchemin et al.; 2015
Published: 2 October 2015
Skip to main content
Volume 16 Supplement 10
We’re sorry, something doesn't seem to be working properly.
Please try refreshing the page. If that doesn't work, please contact us so we can address the problem.
© Duchemin et al.; 2015
Published: 2 October 2015
We’re sorry, something doesn't seem to be working properly.
Please try refreshing the page. If that doesn't work, please contact us so we can address the problem.
We propose the computational reconstruction of a whole bacterial ancestral genome at the nucleotide scale, and its validation by a sequence of ancient DNA. This rare possibility is offered by an ancient sequence of the late middle ages plague agent. It has been hypothesized to be ancestral to extant Yersinia pestis strains based on the pattern of nucleotide substitutions. But the dynamics of indels, duplications, insertion sequences and rearrangements has impacted all genomes much more than the substitution process, which makes the ancestral reconstruction task challenging.
We use a set of gene families from 13 Yersinia species, construct reconciled phylogenies for all of them, and determine gene orders in ancestral species. Gene trees integrate information from the sequence, the species tree and gene order. We reconstruct ancestral sequences for ancestral genic and intergenic regions, providing nearly a complete genome sequence for the ancestor, containing a chromosome and three plasmids.
The comparison of the ancestral and ancient sequences provides a unique opportunity to assess the quality of ancestral genome reconstruction methods. But the quality of the sequencing and assembly of the ancient sequence can also be questioned by this comparison.
Extant species are derived from a process of evolution and diversification from species now disappeared. These species are called ancient in general and ancestral if they left a descendant. Ancestral genomic sequences can be estimated through computation from a set of extant sequences related by a phylogeny and a model of evolution , while ancient genomic sequences in general can be sequenced from the remains of dead organisms .
Ancestral genome reconstruction can consist in predicting a gene content in ancestral species , and for each gene its sequence . While originally used to study proteins or isolated genes, ancestral genome reconstructions are now robust at a scale larger than the gene, for fragments where no rearrangement have occurred . Methods for inferring ancestral gene orders have also been explored [5–8]. Together, these methods open the way to the reconstruction of complete ancestral genomes, including their sequences.
Obtaining ancestral sequences can allow, through the study of physical properties of the reconstructed molecules, the inference of the paleoenvironnements in which these molecules evolved . These methods also allow access to an oriented and ordered view of molecular events along the history of life. Moreover, they offer a better understanding of this history and can further our knowledge of the mechanisms linking organic sequences to their functions .
Despite this, ancestral sequence reconstruction suffers from several limits. Along with the study of molecular evolution, it relies on the validity of models and their fundamental hypothesis. Furthermore, given that we are interested in a phenomenon often distant in time, it is at best difficult to obtain proofs validating proposed predictions. Thus, the validation of ancestral reconstruction methods is often limited to robustness tests, or simulations that themselves rely on the validity of the models of evolution .
Ancient DNA sequences is another way to have an access to the past history of living organisms. Under certain conditions it is possible to obtain genetic material through the sequencing of the remains of an organism. Ancient DNA sequencing began in the middle of the 80s with the cloning and sequencing of fragments of mitochondrial DNA in a museum specimen of Equus quagga, an extinct equine species that disappeared in the XIX th century . The advent of PCR methods  and high-throughput sequencing  followed by what is called third generation sequencing  allowed the sequencing of several extinct animals [15–17], ancient unicellular eukaryotes [18, 19], bacteria [2, 20, 21], metagenome , or virome .
The ancient sequences disclose a new source of information concerning the evolution of lineages of interest. They have already been used, among other things, to understand the dynamic of extant populations of the genus Homo [24–26], or other animals , to correct and recalibrate phylogenies , or to better understand past pandemics [18, 19, 2, 20, 21].
However, along with the problems specific to sequencing technologies, ancient DNA sequencing is limited by the post-mortem chemical degradation of DNA molecules throughout time. Thus, like fossils, ancient sequences are scarce while, unlike them, limited to recent times.
The late-medieval ancient genome, likely close to that ancestor, offers a validation opportunity for the ancestral reconstruction method. We achieve here this reconstruction and perform the comparison.
Note that a sequence of the same genome was proposed recently by Rajaraman et al. , but was not issued from ancestral reconstruction. The contigs of the ancient genome were scaffolded with a method including the phylogeny of relatives, and some parts of the assembly could be corrected, but what we present here is not using at all the ancient sequence in the reconstruction phase, it is done only from independent extant data.
The data consists in 13 Yersinia annotated genomes (Figure 1) from which we extract 3772 homologous protein gene families containing at least two genes, using the HOGENOM database . Of these, 1971 have exactly one copy per extant strain. This step corresponds to part A in Figure 3.
Using Muscle  (default parameters), we aligned the 1971 families, concatenated the variable sites of all alignments and obtained a phylogenetic tree using PhyML  (100 bootstraps, otherwise default parameters) that we rooted by separating the pestis from the pseudotuberculosis clades, according to a consensus in the literature. In our tree the branch separating the two clades is well supported, as well as the branches surrounding the ancestor that we wish to reconstruct (see Figure 1). This step corresponds to part B in Figure 3.
All gene families sequences were then aligned using Prank  and one gene tree per family was computed using PhyML (100 bootstraps, otherwise default parameters). Because we are aligning recently diverged strains of the same organisms , the sequences often have not diverged enough to allow an unambiguous tree reconstruction. So we collapsed all branches with a support lower than 99 and then used ProfileNJ  to solve the created polytomies. ProfileNJ reconstructs species tree branches instead of collapsed branches and chooses among several solutions with a Neighbor-Joining formula. Distances for the Neighbor-Joining part were computed with bppdist, a Bio++ suite software  (GTR + Γ(4) model).
ProfileNJ also roots the gene trees according to "Last Common Ancestor" reconciliation method, annotating internal nodes with duplications or speciations, and choosing a root minimizing the number of duplications.
Reconciled gene trees depict the history of the gene family, including all ancestral genes, uniquely defined by the reconciliation.
This step corresponds to part C in Figure 3.
From the 3772 gene families, some were discarded because they showed signal of a process that we do not handle well in our pipeline, gene transfer. Transfer was suspected when a branch in the reconciled gene tree would correspond to at least 4 independent losses in the species tree. We also removed the families with more than 5 genes in the black death ancestor, suspecting insertion sequences, which are poorly handled by the method. We also removed families containing genes fully included in other genes: as we model the evolution of gene orders, these would be difficult to handle. We eventually removed families when the reconciled gene tree did not contain a gene in the ancestor we want to reconstruct.
The final data set contained 3656 families. Note that when removing gene families from the study, we do not necessarily give up the reconstruction of parts of the ancestral sequence. We just define the removed parts as intergenic. As we also reconstruct intergenic sequences, this simply modifies the resolution at which we are able to detect rearrangements.
Each gene is a segment of a chromosome or a plasmid and has a start and an end position on it. We identify these positions as the extremities of the gene. A start position may be greater than an end position: the order of the extremities defines the orientation of the gene. We model each genome by a graph, whose nodes are gene extremities of genes in that genome. We put an edge, called an adjacency between pairs of extremities of a same gene. Additionally if genes AA′ and BB′ are consecutive (A and A′ are the extremities of the first gene, appearing in that order on the chromosome or plasmid, and B, B′ are the extremities of the second gene), we put an adjacency between A′ and B. So extant genomes are sets of disjoints cycles in a graph, modeling chromosomes and plasmids.
Gene extremities can be clustered into families, inherited from gene families, and also inherit the reconciled gene tree.
Ancestral adjacencies between gene extremities were inferred using DeCo . It models the evolution of an adjacency between two gene extremities following a parsimony principle, i.e. minimizing the number of gains and breakages of adjacencies, due to rearrangements. It takes as input the species tree, all gene trees, and extant adjacencies, and proposes a set of ancestral adjacencies between ancestral gene extremities defined by the reconciled gene trees. This step corresponds to part D in Figure 3.
There can be several reasons for the adjacency graph not to be a collection of paths and cycles, as we would expect if the data and methods were perfect. Incorrect gene trees are probably the major source of such discrepancies, while others may come from uncertainties in adjacency history inference.
We transform the adjacency graph into a genome (i.e. an adjacency graph that is a collection of paths and cycles), first by correcting gene trees, by operations we call zipping and unzipping, then by removing a minimum number of adjacencies so that the remaining graph is a genome.
Each ancestral gene extremity of a gene g should have at most two adjacencies. If one has more than two, a first hypothesis can be that in the real ancestral genome, the gene g was duplicated in two copies, and each copy would carry some of the adjacencies of g.
If in one extant species, there are two homologous copies of the gene g, and their extremities share the homologs of the adjacencies attributed to an extremity of g, then we perform the unzipping operation.
It consists in making two genes out of g by modifying the gene tree T of the gene family containing g. Only the subtree rooted at g is changed, into a subtree rooted at a new duplication node with two descendants: g and a new gene g′. Then the two subtrees rooted at g and g′ are reconstructed, first by assigning all leaves to g or g′ according to their neighborhood; Then by constructing subtrees on these leaves using ProfileNJ. In the case where some leaves can't be assigned to either g or g′ using their neighborhood (i.e. their extant neighbors are not descendant of any of the ancestral neighbors), then leaves are assigned to one of the two set of leaves according to their mean phylogenetic distances with them. Where there is a tie (for instance if all sequences are identical, all distances are null), the leaf is randomly assigned to one of the two leaf-set.
Figure 5A gives an example of an unzipping operation on the ancestral adjacency graph and on the gene tree.
If the unzipping procedure increases the number of adjacencies incident to a gene extremity of a gene h in the immediate neighborhood of g in the adjacency graph, then the unzipping procedure is applied to h as well, and then to its neighbors, until the region is linearized.
Another possible reason for a gene g to be involved in more than two adjacencies is that two of these adjacencies gh and gh′ concern two paralogs h and h′ which in reality should form only one gene. In that case we perform a zipping operation, similar to the one described in .
Let h d be the last common ancestor of h and h′ in their gene tree. Suppose it is assigned to species s, whose descendants are s1 and s2. It is a duplication node, and we turn it into a speciation node by giving it two descendant nodes h1 and h2, and assigning its descendant leaves to either one of them, depending on whether they are genes from descendants of s1 or s2. Then subtrees rooted at h1 and h2 are reconstructed using ProfileNJ.
Figure 5B gives an example of a zipping operation on the ancestral adjacency graph and on the gene tree.
Zipping produces a new ancestral gene h d instead of two paralogues h and h′. We propagate the same operation to the neighbors of the ancestral gene h d in the adjacency graph if they are themselves supernumerary paralogues.
Note that for zipping and unzipping, the propagation mechanism allows the treatment of several consecutive nodes, such that a large segmental duplication containing multiple genes can be dealt with as long as there exists a node to start the unzipping move (e.g. at one extremity of the segmental duplication).
Zipping and unzipping are tested independently for each ancestral node with more than two neighbors. Each of them should decrease the number of gene extremities with more than two adjacencies. The operation that decreases it the most is kept.
If none of zipping and unzipping succeeds in removing all such supernumerary adjacencies (it is possible that none of the hypotheses applies), then we remove as few adjacencies as possible so that only gene extremities with at most two adjacencies remain. This is achieved using a maximum matching technique described in .
Ancestral sequences have to be reconstructed by pieces, because they need a multiple alignment free of rearrangements. The pieces have to be glued together, and in order to avoid between pieces border problems, pieces have to overlap. This is why we reconstruct an ancestral sequence for all pairs of genes which are connected by an adjacency. Then pairs are aligned together on their common gene, and merged.
We orient each adjacent gene pair with a first and a second gene, each gene should be once the first gene of a pair, and once the second in another pair. We use the gene tree of the first gene as a guide, to construct a multiple sequence alignment with the extant sequences that contain this adjacent pair (thus, the sequences contains both genes and the sequence between them when they are neighbors in an extant species, and only the first gene of the adjacency when they aren't), and the ancestral sequence using Prank .
Gene sequences at the ends of contigs are reconstructed alone using their own tree. In consequence each inter-gene sequence is reconstructed once and each gene sequence is reconstructed twice and at least once with its own tree. We assemble the obtained ancestral sequences by aligning (using Smith & Waterman's algorithm) the ones sharing a gene and then making the consensus sequence of that alignment, favoring the sequence reconstructed with the tree of the aligned gene.
For instance, consider the ancestral path ABC (where A,B and C are genes), we reconstruct the ancestral sequence of A using its own tree, AB using A's tree, BC using B's tree and C using its own tree. Afterward the ancestral sequence of A is aligned with the ancestral sequence AB, favoring the sequence of A when computing the consensus. Then the sequence AB is aligned with the sequence BC, favoring the sequence BC in the consensus (as both sequences align on gene B and BC used B's tree for the reconstruction). Finally, the sequence ABC is aligned with the sequence C, favoring C in the consensus.
A graphical view of these steps are given in Figure 3, parts F and G.
Note that, as stated before, the ancestral sequence reconstruction needs a multiple alignment free of rearrangements. This means that the size of the recombination events that can be taken into account for ancestral sequences reconstruction depends on the density of the markers (here, the gene extremities of 3656 gene families) used in the ancestral order reconstruction step.
We perform the whole process of ancestral gene order reconstruction for three data sets: the whole set of filtered families, the set of D free families, without duplication and the DL free families, without duplication nor loss.
Ancestral gene order is computed with the whole set, but it gives fragmented paths in the adjacency graph. The fragments are progressively assembled using the D free and DL free gene orders.
The ancestral gene order was reconstructed for the chromosome (3342 genes) and the three plasmids (pCD: 74 genes, pMT: 87 genes, pPCP: 5 genes). The plasmids pCD and pPCP were obtained as circular elements in the adjacency graph, while the plasmid pMT was represented by one linear fragment. The chromosome was obtained as three linear components. To join these components, we ran DeCo on their six extremities using a gradient of adjacency gain/loss costs ratio (from 1/10 to 10/1) and scored each potential adjacency by the number of times it was observed. We then applied a weighted maximum matching technique  to extract the best possible order between the fragments (only one optimal solution remained).
The ancestral sequences of the plasmids pCD, pMT and pPCP were entirely reconstructed, for a total of respectively 100.1 kb, 67.7 kb and 9.6 kb. Concerning the ancestral chromosome, a total of 4.7 Mb of ancestral sequence was reconstructed, which is close to the size of the extant chromosomes of Yersinia pestis strains (e.g. 4.7 Mb for the strain Antiqua). A lack of signal in extant genomes due to convergent rearrangements, prevented the reconstruction of four ancestral adjacencies. Because of these, the ancestral chromosome sequence is actually composed of four disjoint fragments (their sizes are respectively 3.44 Mb, 0.67 Mb, 0.40 Mb and 0.19 Mb).
The reconstructed ancestral sequences are avalaible in Additional file 1.
Using Megablast  we aligned the 2134 ancient Yersinia pestis contigs obtained by Bos et al. (avalaible at http://paleogenomics.irmacs.sfu.ca/FPSAC/, last accessed 19 june 2015) against the obtained ancestral genome, including chromosome and plasmids.
We examine 2179 hits of length >102.5bp from 2087 contigs (see Additional file 2 for the bimodal distribution of hit lengths which justifies this threshold). The others are full of repeated elements, making the comparison difficult. As a consequence the examined hits all match to the chromosome and none to the plasmids.
So at a large scale, there is only one difference which can be an assembly error in the ancient sequence or a derived mutation of the ancient bacteria, because the ancient configuration is not supported by extant genomes.
At a finer scale, differences are more numerous. Approximately 81% of the 2084 contigs with a hit are exact matches to the ancestral genome. We examined some of the remaining and found that the differences could be explained by three kinds of error sources in the ancestral or ancient sequences:
Lack of sufficient data for ancestral reconstruction: it is the case if only one of the two children which branches off the ancestor, in addition to an outgroup, support the presence of a sequence. In that case there is no comparison point to infer some bases, and some are inferred differently than in the ancient sequence.
Lack of a good model of evolution at an intermediary scale, like duplication of small elements. They are here included in alignments and indel models, which do not account for repetitions.
Assembly errors in the ancient sequence.
Consider for example the ancient contig number 497 where a mismatch occurs when aligned with the ancestral sequence. The mismatch is situated in an intergenic region of the ancestral genome that is present in one descendant of the reconstructed ancestor and two outgroup Yersinia pestis species. Consequently, the ancestral sequence was reconstructed using a tree where the node of interest was along a branch, missing a comparison point (i.e. another descendant) to choose between its descendant allele and the outgroup allele.
Consider also the ancient contig number 8849 which aligns with one mismatch to the reconstructed ancestor. At the position of the mismatch, all extant (group and outgroup species) sequences bear the same allele and thus the reconstructed ancestral sequence bears it too. However, the ancient contig bears another allele at that position. If we consider the ancient contig as correct, then this difference would be an original mutation on the ancient strain. Such an hypothesis could be checked by mapping the ancient reads to their contigs in order to assess the validity of that specific allele. However, we note that the original study  that used read data to call SNPs did not detect any that were specific to the ancient strain.
Such discrepancies can sometimes be explained by errors in the ancient sequence, especially in regions where repetitions occur. For instance, the case illustrated on Figure 8A, is seen on the contig number 8335 obtained by Bos et al. (which is also the chimeric contig but this discrepancy is independent). Around position 1860, that ancient contig displays one occurrence of a 20-mer. However, the reconstructed ancestral sequence has two consecutive occurrences of that 20-mer. This region is situated in an intergenic region, so it has been reconstructed by an alignment of an adjacency with its two flanking genes. The extant species (descendant of the reconstructed ancestor or not) which have this gene adjacency all display two occurrences (in favor of the ancestral reconstruction) at the exception of Yersinia pestis strain CO92, the Yersinia pestis reference genome which was used to map the ancient reads in . While the fact that we did not use the raw reads obtained in  prevents us to draw any definitive conclusion, this appears to be an error in the ancient sequence assembly, caused by a derived mutation in the genome used as a reference.
Conversely, it happens that similar patterns are better explained by errors in the reconstructed ancestral sequence. Such a case occurs on the locus where the ancient contig number 5613 maps. The situation is also similar to Figure 8A. Two contiguous regions hit at a distance of 1315 bp on the reconstructed ancestral sequence. The sequence separating the two hits in the ancestor is only supported by one extant descendant (Nepal strain) and the other extant descendants match the ancient contig in only one long hit. This seems to be an error due to the absence of an evolutionary model allowing big insertions. Prank models indels but 1315 bp is not really an indel but is rather an insertion of what should perhaps have been an evolutionary unit. It seems that the indel model prefers losing several times such a long DNA segment rather than inserting it once in a terminal branch of the phylogeny. So we can expect a small number of such false additions in the ancestral sequence.
A complete reconstruction of an ancestral genome at the nucleotide level requires to take into account evolutionary events at several scales: nucleotide substitutions, indels, duplications, losses, recombinations, transfers, transposable elements propagation, rearrangements. Each level is handled by dedicated bioinformatics tools which are rarely used together.
We associated here gene content/sequence/order tools in order to attempt the reconstruction of a whole ancestral bacterial genome, including a chromosome and three plasmids. We chose an organism from the Yersinia pestis clade because of a recently published ancient sequence. Despite being relatively recent at the evolutionary scale (650 years), the evolution at all levels, and in particular in genome structure and organization, makes the problem difficult. The difficulty can come from numerous events (rearrangements, insertion sequence dynamics), but also from scarce events (substitutions) that prevent reconstructing gene trees from sequences because of a lack of information.
We did not only assemble existing tools that handle evolution at different levels, but also report methodological novelties, like the zipping and unzipping processes to modify gene trees and linearize adjacency graphs. Using synteny information to construct gene trees is rarely achieved  and linearizing often only use cutting operations .
We cannot explicitly handle recombination events or gene transfers, duplications at levels different from the gene, and propagation of insertion sequences. Some tools exist to reconstruct gene content or order in the presence of transfers [3, 42], but not equivalent to ProfileNJ , which we used because of a lack of signal from the sequences in many gene families. It has not been developed for transfers apparently for algorithmic purposes . Transfers will probably limit the quality of the sequence, which at recombination points will be reconstructed with a wrong gene tree. We expect these limits to be rare, as we found only little evidence of gene evolution clearly discordant with the species tree.
Another limit of this method is that it handles evolution at three different scales: sequence, gene content, gene order, while evolution happens at a continuum of scales, some part of it we don't explicitly model. This is for example the case for small duplications: gene duplications are handled but if they are smaller than genes, duplications will be part of sequence evolution, where the models and alignements take indels into account but not duplications. This is also the case of insertion sequence propagation. If insertion sequences are annotated as genes, their dynamics is sometimes so fast that parsimony duplication/loss principles are not accounting for it, even within a very small amount of time. If insertion sequences are taken in intergenic regions, they will again be handled inside alignments and yield a small amount of false positives.
A small part of the sequence is not reconstructed because of convergent rearrangements which have wiped the traces of some intergenic sequences. These convergent rearrangements also introduce one ambiguity in the ancestral gene order. It is possible that it reflects an ancestral polymorphism which has differently been resolved in different lineages.
Polymorphism, and the absence of it in our ancestral genome, is another limitation of such an approach. The ancient population was probably composed of several variants, and the 650 years might not be sufficient to sort out all of it. So we are not sure that a single organism carried the genome we reconstruct, but it might be a consensus of several genomes.
Yet these limits concern probably a very small percentage of the sequence, which is largely reconstructed with a total match to the ancient sequence. Beyond the methodological challenge and the interesting comparison with an ancient genome, the goal of such a reconstruction is not to find an application in synthetic biology, but to understand the evolution of this dangerous pathogen. Substitutions, which apparently are only a minor part of the story, are often the only marker of evolution (for example in ) because of a better availability of performing tools.
In conclusion, we report here the reconstructed ancestral bacterial genome of an ancestral Yersinia pestis. The reconstruction is achieved using already published software and methods but also introduces methodological novelties, especially concerning ancestral adjacency graph linearization, leading to the obtention of larger reconstructed ancestral chromosome fragments.
The comparison of the reconstructed ancestral genome with an ancient sequence provides the opportunity to assess the quality of the reconstruction. It appears that while the reconstruction methods display some limits for events spanning more than a few nucleotides and smaller than a gene (for instance, a gene domain duplication), they yield good results concerning small (substitutions, short indels) and gene-scale events(for instance, gene duplications or rearrangements spanning at least a gene).
This work is funded by the Agence Nationale pour la Recherche, Ancestrome project ANR-10-BINF-01-01.
Publication charges for this work were funded by the Agence Nationale pour la Recherche, Ancestrome project ANR-10-BINF-01-01.
This article has been published as part of BMC Genomics Volume 16 Supplement 10, 2015: Proceedings of the 13th Annual Research in Computational Molecular Biology (RECOMB) Satellite Workshop on Comparative Genomics: Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/16/S10.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.