Sum of parts is greater than the whole: inference of common genetic history of populations
© Utro et al.; licensee BioMed Central Ltd. 2013
Published: 21 January 2013
Skip to main content
© Utro et al.; licensee BioMed Central Ltd. 2013
Published: 21 January 2013
Reconstructability of population history, from genetic information of extant individuals, is studied under a simulation setting. We do not address the issue of accuracy of the reconstruction algorithms: we assume the availability of the theoretical best algorithm. On the other hand, we focus on the fraction (1 - f) of the common genetic history that is irreconstructible or impenetrable. Thus the fraction, f, gives an upper bound on the extent of estimability. In other words, there exists no method that can reconstruct a fraction larger than f of the entire common genetic history. For the realization of such a study, we first define a natural measure of the amount of genetic history. Next, we use a population simulator (from literature) that has at least two features. Firstly, it has the capability of providing samples from different demographies, to effectively reflect reality. Secondly, it also provides the underlying relevant genetic history, captured in its entirety, where such a measure is applicable. Finally, to compute f, we use an information content measure of the relevant genetic history. The simulator of choice provided the following demographies: Africans, Europeans, Asians and Afro-Americans.
We observe that higher the rate of recombination, lower the value of f, while f is invariant over varying mutation rates, in each of the demographies. The value of f increases with the number of samples, reaching a plateau and suggesting that in all the demographies at least about one-third of the relevant genetic history is impenetrable. The most surprising observation is that the the sum of the reconstructible history of the subsegments is indeed larger than the reconstructible history of the whole segment. In particular, longer the chromosomal segment, smaller the value of f, in all the demographies.
We present the very first framework for measuring the fraction of the relevant genetic history of a population that is mathematically elusive. Our observed results on the tested demographies suggest that it may be better to aggregate the analysis of smaller chunks of chromosomal segments than fewer large chunks. Also, no matter the richness of samples in a population, at least one-third of the population genetic history is impenetrable. The framework also opens up possible new lines of investigation along the following. Given the characteristics of a population, possibly derived from observed extant individuals, to estimate the (1) optimal sample size and (2) optimal sequence length for the most informative analysis.
Every genetic event that is consequential to the genetic landscape of a population is captured in a topological structure called the Ancestral Recombinations Graph (ARG) . The converse of this may not hold, that is every genetic event captured in the ARG may not contribute to the observed genetic patterns in the extant population, but nevertheless is a legitimate component of the relevant and common genetic history of the population. In a sense, the ARG is the phylogeny of the individuals of the population. It should be noted that just as in a phylogeny topology, an ARG also does not have any extraneous nodes. Further, it is not unreasonable to assume that there exists a "true" ARG for a collection of samples or a population. Recall that the nodes of the ARG represent genetic events. The topology is not necessarily a tree, due to genetic exchange events such as recombinations, gene duplications and so on. These are represented as nodes with multiple incoming edges in the ARG. The edges are usually annotated with mutation and the lengths are representative of the ages in generation. Thus the topology, together with its annotation and the edge lengths, determines the genetic landscape of the extant samples. The reader is directed to  for an exposition on random graph representation of the ARG.
In this paper, we simply use the expected number of nodes in the ARG as a measure of the relevant genetic history of the population. While this may not be precise, it is a fair proxy for the amount of the relevant genetic history. Then a well-defined question to ask is: What is the largest fraction, f, of the history that is estimable from a given sample? In other words, no matter what methodological ingenuity is employed, there is always (1 - f) fraction of the common history that is impenetrable. Let N be the number of nodes in an underlying ARG topology. Given the extant samples, let a method estimate N' ≤ N nodes. We further assume that the lengths of the edges, as well as the interconnectivity with the labels, are estimated correctly, so the estimated fraction of this ARG, defined as 0.0 ≤ N'/N ≤ 1.0, is a natural "overstimate". Let be the maximum of from all possible methods i. Then f, the penetrable fraction, is . However, it is impossible to enumerate all possible estimation methods. So, we resort to the mathematical structure called the minimal descriptor  of an ARG: it is an essential substructure of an ARG that preserves the genetic landscape of the extant samples, including the topology and edge lengths of the marginal trees. The reader is directed to  for an exposition on this nonredundant information content of an ARG. The minimal descriptor is also an ARG and let the number of nodes in the minimal descriptor be , then . Thus, this methodology-independent scheme gives an upper bound on the penetrable fraction f as , and a lower bound on the impenetrable fraction as .
In this paper, we seek the value of f, in human populations. Such a study of attempting to "know the unknowable" is best done in a simulation setting. From literature, we pick a population simulator that has the capability of providing not only individuals from different demographies, but also the underlying ARG. This is very suitable for our experimental set-up. Next, we design an algorithm to extract the minimal descriptor from a given ARG. Thus we compute the upper bound on f, as discussed. Recall that each node of the ARG has a specific age or depth associated with it. It may be noted that the length attribute of an edge can be viewed simply as the non-negative difference between the depths of the two incident nodes. The terminal leafnode are the extant individuals. The depth of the extant individual is defined to be zero and the value progressively increases as one traverses the ARG away from the terminal leaf nodes. The nodes of the minimal descriptor are also the nodes of the underlying ARG and the same age is associated with them. Let epoch d be defined as a range of depths say [d 1, d 2] with d 2 ≥ d 1. Then the history density at d, N d , is measured by the number of nodes in the ARG with depth in the epoch d. Extending this notion, the estimable density at d is measured as , where is the number of nodes in the minimal descriptor with depth in the epoch d. We study the demography characteristics in terms of the history density and the estimable density.
Let tARG denote the true ARG for a given data set with N nodes. Then, is the number of estimable nodes of tARG, irrespective of the reconstructability of the interconnecting edges. Thus is an overestimate and is an underestimate of the true values.
- mutation rate: According to different studies Homo sapiens, as a species, has a mutation rate around 1.5 × 10-8 per base pair per generation (bp/gen for short) . However, this value could change along the genome.
- sequence length: When simulating genetic population data, sequence length is one of the most important factors. While it may not computationally feasible to simulate a whole chromosome, enough polymorphisms are required in order to get meaningful results.
- sample size: The sample size needs to be large enough to capture important population features.
- recombination rate: The mean recombination rate along the genome in Homo sapiens is around 1.3 cM/Mb . However it has been seen that it can vary widely in a fine-scale manner when focusing on specific regions of the genome . Different simulations are run using recombination rates matching the major portion of the range observed in human data .
Mutation rate (bp/gen × 10-8)
0.7, 1.5, 3.0
Sequence length (Kb)
5, 10, 30, 50, 75, 100, 150, 200
5, 10, 30, 60, 120
Recombination rate (cM/Mb)
0.1, 0.3, 0.5, 0.7, 0.9, 1.1, 1.3, 1.5, 1.7, 1.9, 2.1, 2.3, 2.5, 2.8, 3.1, 3.5, 3.9, 4.5, 5.1
Each population of the COSI demography has been tested independently as well as the whole human demography (i.e. all populations together). In total, more than 22800 simulations replicates were generated combining different values for the four simulation parameters described above. This includes ten replicates for each experiment. For the highest value of sequence length used (i.e. 200 Kb), some experiments were terminated after thirty minutes, since no substantial progress was being made towards its completion. Therefore, the results for 200 Kb sequence length are not reported in the summary plots.
Recall that an ARG is a phylogenetic structure that encodes both duplication events, such as mutations, as well as genetic exchange events, such as recombinations: this captures the (genetic) dynamics of a population evolving over generations. From a topological point of view, an ARG is always a directed acyclic graph where the direction of the edges is toward the more recent generation. An edge is annotated with the mutation genetic event, possible multiple events. Some simulators may give edges with empty labels. Recall that the length of the edge, not to be confused with the edge label, represents the epoch defined by the age (or depth) of the two incident nodes. A chain has a single incoming edge and a single outgoing edge. In the ARG we define the following nodes: (a) the leaf nodes have no outgoing edges and they represent the extant unit (b) the coalescent nodes have single incoming edges, and (c) the exchange nodes have multiple incoming edges. Finally, given two nodes v and w, if there is an outgoing edge from v to w, then v is referred to as parent of w and w is referred as a child of v.
In  a structure-preserving and samples-preserving core of an ARG G, called the minimal descriptor ARG (mdARG) of G was identified. Its structure-preserving characteristic ensures that the topology and the all the branch lengths of the marginal trees of the minimal descriptor ARG are identical to that of G and the samples-preserving property asserts that the patterns of genetic variation in the samples of the minimal descriptor ARG are exactly the same as that of G. It was also shown that an unbounded G has a finite minimal descriptor, that continues to preserve critical graph-theoretic properties of G. Thus this lossless and bounded structure is well defined for all ARGs (including unbounded ARGs) and we use the same here. However, a minimal descriptor of an ARG may not be unique. This does not affect the estimation of f, since , for all possible corresponding the the different minimal descriptors (see the Background section for the definitions).
Given the genetic landscape of some extant samples, its underlying ARG is a plausible explanation of the observation, since it is the annotated topological structure that captures the genetic history in its totality, that is relevant to the extant samples. It can also be viewed as a generator that faithfully produces the genetic landscape of the different demographies and since it is a random graph , we use multiple replicates to study its the characteristics. We observe that the mutation rate does not influence f significantly. Hence, we use a fixed mutation rate of 1.5 × 10-8 bp/gen in the figures. The interested reader is directed to the Additional File 1 for the plots of each experiment for all the other mutation rates. Even though the ARG is not a tree, the density (i.e, number of nodes per epoch) of the relevant genetic events decrease exponentially with depth. Also, the shape of the profiles for the different demographies is independent of the four classes of parameters.
Reconstructability of common genetic history is a fundamental curiosity in the study of populations. While the population evolution models mature and the algorithms get more sophisticated, what fraction of the common and relevant genetic history of populations continues to be elusive? We present a framework that enables such an exploration. This is based on the random topological structure, the ARG and a method-independent (information-theoretic) structure called the minimal descriptor. This is applied to different demographics in a simulation setting. The most surprising observation is that the sum of the reconstructible history of each of the chromosomal segments, s 1, s 2, ..., s m , is indeed larger than the reconstructible history of the single segment composed of these segments. This appears to be a universal property, holding in all the demographies tested. Also, irrespective of the sample size, we observe that at least one-third of the population genetic history is impenetrable, in all the demographies.
The framework also opens up possible new directions of investigation. Assume that the characteristics of a population can be derived, say from the linkage disequilibrium landscape and other characteristics of observed extant individuals. Then, can such a generator be used to answer the "best-practice" questions about the population: what is the (1) optimal sample size and (2) optimal sequence length for the most informative analysis.
The publication costs for this article were funded by the corresponding author's institution.
This article has been published as part of BMC Genomics Volume 14 Supplement 1, 2013: Selected articles from the Eleventh Asia Pacific Bioinformatics Conference (APBC 2013): Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/14/S1.
MP carried out this work during an internship at IBM T J Watson Research Center.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.