A human genome-wide library of local phylogeny predictions for whole-genome inference problems
- Srinath Sridhar^{1} and
- Russell Schwartz^{2}Email author
DOI: 10.1186/1471-2164-9-389
© Sridhar and Schwartz; licensee BioMed Central Ltd. 2008
Received: 27 February 2008
Accepted: 18 August 2008
Published: 18 August 2008
Abstract
Background
Many common inference problems in computational genetics depend on inferring aspects of the evolutionary history of a data set given a set of observed modern sequences. Detailed predictions of the full phylogenies are therefore of value in improving our ability to make further inferences about population history and sources of genetic variation. Making phylogenetic predictions on the scale needed for whole-genome analysis is, however, extremely computationally demanding.
Results
In order to facilitate phylogeny-based predictions on a genomic scale, we develop a library of maximum parsimony phylogenies within local regions spanning all autosomal human chromosomes based on Haplotype Map variation data. We demonstrate the utility of this library for population genetic inferences by examining a tree statistic we call 'imperfection,' which measures the reuse of variant sites within a phylogeny. This statistic is significantly predictive of recombination rate, shows additional regional and population-specific conservation, and allows us to identify outlier genes likely to have experienced unusual amounts of variation in recent human history.
Conclusion
Recent theoretical advances in algorithms for phylogenetic tree reconstruction have made it possible to perform large-scale inferences of local maximum parsimony phylogenies from single nucleotide polymorphism (SNP) data. As results from the imperfection statistic demonstrate, phylogeny predictions encode substantial information useful for detecting genomic features and population history. This data set should serve as a platform for many kinds of inferences one may wish to make about human population history and genetic variation.
Background
Since the first draft sequences of the human genome were completed, much of the sequencing field has turned to the problem of identifying common genomic variations and their distributions among human populations [1–3]. These variations exist predominantly in the form of single nucleotide polymorphisms (SNPs), single DNA bases that take on two common alleles in the population. While most of these variants are believed to be functionally neutral, they nonetheless encode a great deal of information about the history and structure of the population from which they are sampled, as well as the molecular evolution of the local genetic region in which each occurs. Many computational methods have been developed to infer properties of the molecular evolution or population genetics of a species from these SNP data. Examples include methods for identifying sites of frequent recombination (e.g, [4]) or gene conversion (e.g., [5]), identifying conserved haplotype sequences (e.g., [6]), finding genomic regions that have undergone selective sweeps (e.g., [7]), and detecting population substructure (e.g., [8, 9]) and admixture (e.g., [10, 11]).
All of these inference methods work by a common principle of superimposing a mathematical model of the evolutionary event or process to be detected on a model of neutral evolution in the absence of that process. For example, an inference of population substructure may compare whether observed SNP allele frequencies in the current generation are more consistent with what we expect to find in a single population under Hardy-Weinberg equilibrium or what we expect to see in two genetically isolated populations evolving independently over many generations. Any such inference could in principle be made more easily and accurately if we could observe not just the current generation, but also prior generations at various points in this evolutionary process. Information on these past genetic sequences, commonly encoded in phylogenetic trees or networks, is not generally directly observable but it too can be computationally inferred.
Our goal in this paper is to facilitate a general strategy for performing a range of statistical inferences from genetic variation data: using phylogenetic inferences from the variation data as a common starting point and treating these inferred phylogenies as the input to inferences about specific features of molecular evolution in a population. Similar ideas have previously been applied on smaller scales. Such phylogeny-based inferences have been developed for specific inference problems, such as the detection of likely recombination breakpoints [12]. In addition, genome-scale analyses of phylogenies have been conducted in bacteria. For example, Filliol et al. [13] examined phylogeny inference on a genomic scale for the purpose of categorizing genetic variations from geographically diverse isolates of Mycobacterium tuberculosis. Nonetheless, there remain substantial obstacles to the more general use of phylogeny-based inference for whole-genome analysis in eukaryotes. Phylogeny inferences may proceed from a more limited model of molecular evolution than do the downstream inference algorithms and may therefore fail to represent key evolutionary events. Even if the model is correct, the inferred phylogenies may be incorrect. While there is no information in the modern sequences that is not also found in the phylogenies, regardless of their accuracy, incorrect phylogenies may end up confounding our analyses.
Perhaps most limiting is that intra-species phylogeny inference is a computationally demanding task that would be intractable on genomic scales by any widely-used inference method. The simplest variant of the problem is maximum parsimony (MP) [14], which seeks the smallest tree capable of explaining a given data set, a method that tends to be most suitable for short time scales in which mutations are likely to have been infrequent. MP phylogeny inference has been shown to be NP-hard [15] and thus computationally intractable except on small problem instances. There have been some prior methods to solve MP phylogenies optimally, as well as to solve more difficult inferences, such as the inference of phylogenetic networks incorporating both mutation and recombination [16–18]. In practice, MP phylogeny inference is generally performed with fast heuristic methods that do not guarantee optimal solutions (see, for example, Felsenstein [19]). Even these methods are not sufficiently tractable to process the millions of trees one would need for inference on genomic scales. Maximum likelihood [20] and Bayesian [21] methods for phylogeny inference tend to allow for more realistic and detailed mutational models and to more accurately account for uncertainty in inferences than do MP methods, but at the cost of generally even greater computational time.
In the present work, we seek to enable widespread use of phylogeny-based inference for genome-wide analysis by creating a library of local human phylogenies across the human genome drawn from the HapMap variation data [2]. We create this library by applying a recently developed method for maximum parsimony phylogeny inference [22] that made it possible for the first time to construct the millions of local phylogenies needed to enable whole-genome phylogeny-based inferences. We illustrate the use of this library for phylogeny-based inferences with sample applications based on a tree statistic that we call "phylogenetic imperfection." We demonstrate that imperfection shows significant regional and cross-population conservation and show that it is significantly predictive of fine-scale recombination rate.
Methods
Phylogeny Inferences
We infer maximum parsimony phylogenies using a method developed in Sridhar et al. [22]. The method finds maximum parsimony mutational phylogenies from matrices of binary SNP variation data. The algorithm first uses a series of preprocessing steps to eliminate redundancy from an observed data set, decompose the problem into smaller sub-problems where possible, and limit the space of possible solutions to each problem. These simplified phylogeny sub-problems are then converted into an equivalent representation based on the graph-theoretic concept of multi-commodity flows. These flow problems then translate directly into a mathematical format called an integer linear program (ILP), a form of constraint satisfaction problem for which highly optimized solver programs are available. We then use the CPLEX 10.0 ILP solver to find an optimal solution to the ILP, which we can then convert into a minimum-size solution to the original phylogeny problem. We refer the reader to Sridhar et al. [22] for details on the theory and implementation of these methods.
Data Sets
This study primarily uses data from the International Haplotype Map (HapMap Phase II) [2] for the purpose of conducting a fine-scale genome-wide scan of human genetic variations. We use computationally phased HapMap data for this analysis. Although we have developed algorithms that will infer maximum parsimony phylogenies directly from unphased data [23, 24], these algorithms are not efficient enough for use on a whole-genome scale. We restricted ourselves to the HapMap CEU population of Utah residents of European ancestry and the YRI population of residents of Yoruba in Ibadan, Nigeria because these subpopulations were genotyped for parent-child trios and can thus be expected to have minimal phasing error. The other two HapMap data sets (Han Chinese in Beijing, China and Japanese in Tokyo, Japan) were genotyped only for unrelated individuals and were omitted here due to the higher likelihood of phasing errors. All HapMap data sets were downloaded in phased form from the HapMap web site, where the PHASE program [25] had been used to identify most likely phases from the trio data. This HapMap build was based on the NCBI human genome assembly build 35 [26]. SNP location assignments and genomic coordinates are therefore based on NCBI build 35. The resulting data contained 120 haplotypes from 60 unrelated individuals for each of the two populations typed at approximately 3.7 million SNPs. Phylogeny inferences were run for window sizes of five, six, seven, eight, and nine consecutive SNPs at each overlapping window of the given size across the 22 autosomal human chromosomes in each of the CEU and YRI populations. The resulting library contains a total of nearly 16 million phylogenies, each covering 5–9 consecutive SNPs on 60 chromosomes. Statistical analyses described below were performed using the 5-SNP libraries. A subset of these analyses, described below, were performed after screening the full library to remove windows spanning predicted recombination hotspots, as assessed by the LDhot method of McVean [4].
Several additional datasets were used to study correlation of imperfection with other sequence features. We retrieved the set of nonsynonymous coding SNPs (ncSNPs) mapped to the build 35 genome using the Ensemble BioMart tool [27, 28], selecting all ncSNPs with validated assays. Fine-scale recombination rates and recombination hotspots were retrieved from the HapMap web site [2]. Locations of all short tandem repeats in the human genome were retrieved from the UCSC Genome web site [29]. The set of all human repeats was based on RepeatMasker [30] inferences and was also retrieved from the UCSC Genome resources Table View tool. The locations of high-scoring hits were also manually examined using the UCSC Genome Browser [31] and the dbSNP resource at the NCBI web site [1] to identify the genes and repetitive regions containing the particular SNPs of interest and, for coding SNPs, to identify their corresponding amino acid changes.
Statistical Analysis
Phylogenetic Imperfection
Note that computing phylogenetic imperfection is NP-hard. While we do not provide a formal proof of that statement, it follows from the fact that knowing the imperfection of a dataset allows one to trivially compute the number of mutations found in the maximum parsimony tree. One could therefore use an efficient algorithm for computing imperfection to create an efficient algorithm for MP phylogeny inference. For example, one might repeatedly identify one possible node and edge on the periphery of an optimal tree whose elimination reduces the optimal parsimony score, then recurse on the remainder of the data to construct the rest of the tree.
Mutual information
In order to test for regional variations in phylogenetic imperfection, we calculated mutual information between windows at varying genomic distances. We enumerated all pairs of 5-SNP windows across each chromosome, excluding those with overlapping SNPs. Each pair of windows was placed in a bucket according to the distance separating the central SNPs of the two windows. Bucket widths of 1 kb and 100 kbs were used in separate tests. We then treated the entries in these buckets as samples from two random variables, one variable corresponding to the imperfection of the upstream element of each pair and the other to the imperfection of the downstream element of each pair. We then calculated the mutual information of these two random variables as a measure of how informative an imperfection score at any one point on the genome is about those at varying distances along the genome.
The mutual information of the set is then defined as H(x) + H(y) - H(x, y).
In order to establish statistical significance of mutual information scores, we used the fact that a mutual information score can be regarded as a log likelihood ratio statistic, which itself is approximately chi-square distributed for sufficiently large sample size. Many of the data points in a given bucket will be dependent on one another, even under the null hypothesis that different windows are independent of one another, because the same window may contribute to multiple pairs within a given distance. We therefore adopted a conservative estimate of significance by taking only a subset of pairs containing at most one pair for any given window. With this approximation, the significance of any given data point can be estimated by treating it as a chi-square statistic with value (2 ln 2)Im, where I is the calculated mutual information and m is the number of independent data points supporting it. We calculated this value for each window and used the minimum value of the statistic as an approximate upper bound on the p-value of the comparisons. The number of degrees of freedom is equal to the maximum observed imperfection score, 23 for these data.
Imperfection versus recombinations
In order to compare imperfection and fine-scale recombination rates, we first identified for each window in our data set the location of the central SNP in the window. We then retrieved the fine-scale recombination rate at each such SNP from the HapMap-supplied data. The result was two paired lists of data points. We calculated Pearson correlation coefficients for the two lists for each chromosome individually and for all chromosomes collectively. Statistical significance of the correlation coefficients was assessed by permutation test, randomly permuting one data set with respect to the other for 1,000 trials for each test reported. A curve was fit to the data points by proposing that imperfection i is related to recombination rate r by a function of the form i = a(1 - e^{-br}) and using Newton-Raphson iteration to find the least-squares best fit parameters a and b. We calculated the correlation coefficients with the original data-set as well as with windows spanning recombination hotspots removed.
Imperfection Outliers
We selected those windows of highest imperfection by summing imperfection computed in the CEU and YRI populations for common windows and selecting those with the highest sum. This analysis was performed using data from which recombination hotspots had been excluded. The analysis was run once for the set of all windows in the genome excluding hotspots and once for those windows centered on non-synonymous coding SNPs (ncSNPs). ncSNPs mapped to the genome were selected from Biomart [27, 28] using the February 2006 dbSNP build. We selected all those with combined imperfection at least 24 for the set of all windows and for combined imperfection at least 12 for the set of ncSNP-centered windows. The gene list generated from ncSNP outliers was run through the GOSTAT server [32] to identify overrepresented gene classes. We report only the two most significant such classes as most of the others appear to represent subcategories of those two best hits.
Comparison to Maximum Likelihood Phylogenies
One might question whether statistics drawn from our phylogeny library could be biased by systematic errors in phylogeny inference. For example, MP inference can never produce trees larger than the true phylogenies and may therefore systematically underestimate phylogeny size. While there is no known ground truth by which we might definitively test for such biases, we can compare a subset of trees to those from a more statistically sound maximum likelihood (ML) method. Because of the high run-time of ML methods, we can compare only a small subset of the windows. We examined the first 200 5-SNP windows from chromosome 1. We constructed trees for these windows using the Phylip [33] ML inference code, with uniform mutation rate and constant rate variation among sites, using the speedier (S) processing option. We cannot directly compare geometries between the ML and MP trees because the ML approach necessarily treats individuals with the same sequence as distinct tree nodes, creating large subtrees with essentially arbitrary connectivity. We therefore post-process the ML trees by collapsing all nodes of common sequence (both observed haplotypes and inferred Steiner nodes) and relinking the resulting non-redundant node set into a minimum-cost spanning tree. We compared the average phylogeny sizes between these processed ML inferences and our MP data, quantified by the fractional difference between mean MP and mean ML tree size over the 200 data sets. We further quantified differences by two metrics on the geometries of the MP and processed ML trees. We first applied the Robinson-Foulds distance [34], a widely used metric for comparison of phylogenetic trees that measures the number of bipartitions of the population that are defined by edges in one tree but not the other. Second, we examined the root mean square (RMS) difference, summed over all pairs of observed haplotypes, between the separation of the pair in one tree and their separation in the other.
Computer Resources
The inferences were performed on a Pentium workstation computer running Linux. Code was written in C++ and uses the CPLEX 10 ILP solver for linear programming solution. Our inference algorithms and the phylogeny library are accessible through a web server at http://www.cs.cmu.edu/~imperfect. The server provides access to the preprocessed human phylogeny library and a front-end to a server to which users can supply their own data to be solved by our methods. In addition, users can directly download a full set of phylogenies for each chromosome in DOT format, a language for graph description developed for the Graphviz graph rendering package. The present analyses were based on inferences of a single phylogenetic tree per window of SNPs examined, but the server can also infer the network produced by the union of all maximum parsimony phylogenies for any given window. Source code will be provided upon request, although users must supply their own ILP solver to run it.
All other data processing and statistical computations were performed with code written in the Perl language. Graphics for the paper were prepared with SigmaPlot version 10 and Gnuplot version 3.7.
Results and discussion
Genome-wide Imperfection Scan
Run times in hours for inferring the complete phylogeny library, separated by window size and population.
Population | 5-SNP | 6-SNP | 7-SNP | 8-SNP | 9-SNP |
---|---|---|---|---|---|
CEU | 6 h. | 16 h. | 11 h. | 15 h. | 128 h. |
YRI | 13 h. | 8 h. | 15 h. | 37 h. | 53 h. |
In addition to providing a coarse visualization, the imperfection scores give us a statistic for assessing population conservation of the phylogenies. Over all chromosomes, the CEU data shows a mean imperfection of 0.30 and the YRI a mean imperfection of 0.55. These results may reflect the higher genetic diversity of African versus European populations. The two populations do, however, show a strong overlap in regions of high or low imperfection on the genome. For SNPs variant in both populations, the imperfections have a correlation coefficient of 0.49 between the populations when examining all windows or 0.36 when recombination hotspots are excluded. This correlation may reflect the in fluence of common histories prior to divergence of the two lineages, some inherent propensity of particular sites in the genome towards larger or smaller phylogenies, some combination of the two, or some systematic SNP-specific bias.
Mutual Information
Imperfection as a Statistic of Fine-scale Recombination
Chromosome-by-chromosome Pearson correlation coefficients of local phylogenetic imperfection and fine-scale recombination rate outside of recombination hotspots.
Chromosome | r_{ CEU }^{ a }(CEU) | r_{ YRI }^{ b }(YRI) |
---|---|---|
1 | 0.23 | 0.30 |
2 | 0.24 | 0.29 |
3 | 0.26 | 0.30 |
4 | 0.24 | 0.28 |
5 | 0.25 | 0.30 |
6 | 0.25 | 0.29 |
7 | 0.26 | 0.30 |
8 | 0.27 | 0.31 |
9 | 0.24 | 0.30 |
10 | 0.24 | 0.27 |
11 | 0.26 | 0.30 |
12 | 0.23 | 0.28 |
13 | 0.26 | 0.31 |
14 | 0.27 | 0.31 |
15 | 0.28 | 0.32 |
16 | 0.28 | 0.31 |
17 | 0.32 | 0.34 |
18 | 0.27 | 0.34 |
19 | 0.32 | 0.33 |
20 | 0.28 | 0.28 |
21 | 0.24 | 0.27 |
22 | 0.29 | 0.34 |
Chromosome-by-chromosome Pearson correlation coefficients of local phylogenetic imperfection and fine-scale recombination rate including recombination hotspots.
Chromosome | r_{ CEU }^{ a }(CEU) | r_{ YRI }^{ b }(YRI) |
---|---|---|
1 | 0.37 | 0.39 |
2 | 0.39 | 0.41 |
3 | 0.41 | 0.41 |
4 | 0.39 | 0.40 |
5 | 0.42 | 0.42 |
6 | 0.42 | 0.41 |
7 | 0.41 | 0.39 |
8 | 0.41 | 0.41 |
9 | 0.37 | 0.38 |
10 | 0.38 | 0.39 |
11 | 0.39 | 0.39 |
12 | 0.40 | 0.38 |
13 | 0.42 | 0.43 |
14 | 0.44 | 0.44 |
15 | 0.43 | 0.41 |
16 | 0.38 | 0.37 |
17 | 0.42 | 0.41 |
18 | 0.42 | 0.45 |
19 | 0.38 | 0.39 |
20 | 0.37 | 0.36 |
21 | 0.43 | 0.41 |
22 | 0.39 | 0.40 |
This analysis also allows us to reconsider the issue of conservation between the populations by examining how well mutual correlation with recombination rate explains correlation of phylogeny sizes between populations. The overall Pearson correlation between recombination and imperfection across all chromosomes is 0.40 for both populations, compared to a correlation of 0.49 between the imperfections of the two populations. These comparisons were likewise all determined to be significant with p-value < 0:001 by permutation tests. When hotspots are excluded, the overall correlations between imperfection and recombination rate drop to 0.36 for both populations. The correlation between imperfection scores for the two populations drops to 0.45. Collectively, these observations confirm that the imperfection scores are significantly influenced by other local genomic properties than just recombination rates.
Imperfection by SNP Class
This result is unsurprising since Africans are a more diverse and older population group than Europeans. Low imperfection scores account for a substantial majority of windows for all four sets and the absolute differences between the frequencies at the first few imperfection scores are small. ncSNP windows have a very slightly higher fraction of perfect phylogenies compared to all windows (77.92% versus 77.86% for CEU and 68.02% versus 65.13% for YRI). While there are greater differences between ncSNPs and general windows for larger imperfection scores, these could be explained by the small numbers of examples for the largest imperfection values. Selection against changes in protein code therefore appears to introduce at most a modest bias in window imperfection scores.
Outlier Phylogenies
Windows exhibiting the highest imperfection in the genome.
central SNP^{ a } | chr.^{ b } | pos.^{ c } | i _{ CEU } ^{ d } | i _{ YRI } ^{ e } | sequence context |
---|---|---|---|---|---|
rs2486545 | 1 | 240569616 | 15 | 17 | intergenic region |
rs10174559 | 2 | 241396766 | 11 | 16 | intron of KIF1A kinesin family member 1A |
rs11683248 | 2 | 29360497 | 11 | 14 | intron of ALK anaplastic lymphoma kinase (Ki-1) |
rs7405052 | 16 | 84225446 | 7 | 17 | intron of KIAA0182 |
rs12918736 | 16 | 84226491 | 7 | 17 | intron of KIAA0182 |
rs10926263 | 1 | 236937389 | 8 | 16 | intron of FMN2 formin 2 |
rs6037439 | 20 | 296188 | 17 | 7 | intergenic region |
rs7173687 | 15 | 24667811 | 9 | 15 | intron of GABRA5 gamma-aminobutyric acid (GABA) A receptor, alpha 5 |
rs2493310 | 1 | 3317579 | 9 | 15 | intron of PRDM16 PR domain containing 16 |
rs8045380 | 16 | 84226877 | 10 | 14 | intron of KIAA0182 |
Windows centered on non-synonymous coding SNPs exhibiting the greatest imperfection, excluding recombination hotspots.
central SNP^{ a } | chr.^{ b } | pos.^{ c } | i _{ CEU } ^{ d } | i _{ YRI } ^{ e } | gene^{ f } | variation^{ g } |
---|---|---|---|---|---|---|
rs1810247 | 15 | 19915318 | 5 | 11 | LOC650137 seven transmembrane helix receptor | C85R |
rs17690844 | 8 | 17656239 | 4 | 10 | MTUS1 mitochondrial tumor suppressor 1 | T453K |
rs2368406 | 10 | 29824078 | 6 | 8 | SVIL supervillin | A809P |
rs4973588 | 2 | 233660480 | 4 | 9 | NGEF neuronal guanine nucleotide exchange factor | T111M |
rs7208422 | 17 | 73642170 | 8 | 5 | TMC8 transmembrane channel-like 8 | I306N |
rs3751928 | 17 | 68792947 | 7 | 6 | CDC42EP4 CDC42 effector protein (Rho GTPase binding) 4 | |
rs7627615 | 3 | 185301118 | 4 | 9 | HTR3E 5-hydroxytryptamine (serotonin) receptor 3, family member E | T86A |
rs2802808 | 1 | 201698085 | 4 | 8 | NFASC neurofascin homolog (chicken) | |
rs10790715 | 11 | 124298892 | 5 | 7 | HEPACAM hepatocyte cell adhesion molecule | V218M |
rs557806 | 19 | 54069054 | 5 | 7 | PPP1R15A protein phosphatase 1 regulatory (inhibitor) subunit 15A | P251R |
rs1356410 | 15 | 40222129 | 5 | 7 | PLA2G4F phospholipase A2 group IVF | V740M |
rs351111 | 19 | 795020 | 6 | 6 | PRTN3 proteinase 3 (serine proteinase, neutrophil, Wegener granulomatosis autoantigen) | I119V |
Comparison to Maximum Likelihood Phylogenies
In order to test for systematic biases in phylogenies introduced by inference from an MP method, we compared them to phylogenies inferred by a maximum likelihood (ML) method for a subset of 200 5-SNP windows. We found a mean phylogeny size of 5.67 mutations for the ML trees versus 5.53 for the MP trees. If we regard the ML trees as a close approximation to the ground truth, then we can conclude MP trees underestimate phylogeny size by an average of 2.5% on these data. We can therefore suggest that, while there is some systematic bias toward smaller trees with an MP method, the bias is relatively modest for small windows. The RMS distance between pairs of individuals for corresponding ML and MP trees is 0.112, suggesting that most individuals are in similar relative positions between the two trees. The Robertson-Foulds distance has a mean value of 0.54, indicative of a somewhat larger average variation when comparing trees by edges rather than by individuals. This mean Robertson-Foulds score can be interpreted as an average of just over one inconsistent edge between each pair of trees. Since we cannot guarantee the optimality of the ML trees by the ML criterion, these measures may in fact overstate the difference between the MP trees and ground truth. It is also possible, though, that the ML trees may themselves be biased relative to the ground truth and may therefore understate the bias in the MP trees. Our post-processing step to collapse sub-trees with identical sequence may also bias the geometries to more closely match those of MP trees. Future examination with other tree statistics or other methods of tree inference may also yield more dramatic differences than we observe here.
Conclusion
We have used recent methodological improvements in fast phylogeny algorithms to construct a genome-wide library of local mutational phylogenies in the human genome. This library provides an unprecedented view of likely sequences of mutational events in local regions of the genome that may give us new insight into mechanisms of mutation and selective pressures on genomic scales and in individual genes of interest. Many forms of genomic analysis rely on indirect inferences of molecular evolution based on modern observed sequences and would likely benefit from accurate knowledge of full evolutionary histories. While we cannot observe these full histories, it is now possible to make reasonable inferences in local regions. We hope that this library will help enable a "phylogeny first" approach to whole-genome analysis tools based on the common hypothesis that good inferences of phylogenies will provide a stronger basis for statistical prediction of a broad class of genomic features than do raw variation data. We have demonstrated this approach with several sample applications of a simple tree statistic, imperfection, that measures the total size of a phylogeny.
The imperfection tree statistic is predictive of fine-scale recombination rate. It may therefore be useful as an alternative method for estimating recombination rates. Moreover, it detects sequence regularities beyond the correlations for which recombination rate can account, which we conjecture is likely to include local mutation rate biases. Phylogenetic imperfection in conjunction with other measures of recombination rate may be a useful way to separate these possibilities. This result may also in part reflect the fact that recombination rate and mutation rate are themselves correlated [44, 45]. It is possible that other mechanisms, such as gene conversion, significantly affect observed imperfection scores. Wiehe et al. [46] showed that one can distinguish recombination and gene conversion by local patterns of linkage disequilibrium (LD); a characteristic pattern of high LD between non-consecutive SNPs with mutual low LD to intervening SNPs might similarly allow one to distinguish genuine recurrent mutation from other possible sources of high phylogenetic imperfection.
The imperfection statistic also allows us to test several hypothesis about molecular evolution on genomic scales. One such hypothesis is that there are significant regional biases in mutational propensities across the genome, beyond what can be accounted for by local signals such as recombination hotspots. Consistent with the hypothesis, imperfection shows a pattern of local correlation on multiple scales, from a strong peak for nearby but non-overlapping windows on the kilobase scale to a gradual decline in correlation on even megabase scales. Given that imperfection strongly correlates with recombination rate but shows significant cross-population correlation even after corrections for recombination rate, it is likely that these regional correlations reflect a combination of regional variation in recombination rate and regional variability in mutation rate. We cannot, however, yet determine the precise degree to which these two factors, or others unknown to us, might contribute to the overall regional variability. Because the information calculations excluded windows sharing SNPs, the very strong local peak on the kilobase scale could only derive from regions extremely dense in SNPs. It is therefore plausible that the fine-scale peak corresponds to local correlations in imperfection due to hypermutable regions of the genome. Further study of regional patterns for known sources of phylogenetic imperfection may help to separate these effects and detect any unanticipated contributing factors. By contrast, the imperfection statistic leads us to reject the hypothesis of significant variations in phylogenetic complexity between coding versus non-coding SNPs, or between SNPs in repetitive versus non-repetitive regions of the genome.
While we use imperfection in the present study as an illustration of our proposed "phylogeny first" approach, there are many other tree statistics that may be informative for particular processes. Gene conversion, selective sweeps, and epistasis, among other processes, might all be anticipated to produce characteristic features of tree geometry by which they might be detected from a phylogeny library. Determining which statistics are informative for particular processes and how they compare to other inference methods is a broad problem that we plan to address in future work.
Declarations
Acknowledgements
This work was supported in part by funding from U.S. National Science Foundation grants CCR-0122581, CCF-043075, and IIS-0612099.
Authors’ Affiliations
References
- Sherry ST, Ward MH, Kholodov M, Baker J, Pham L, Smigielski E, Sirotkin K: dbSNP: The NCBI database of genetic variation. Nucleic Acids Research. 2001, 29: 308-311.PubMedPubMed CentralView ArticleGoogle Scholar
- The International HapMap Consortium: The international HapMap project. Nature. 2005, 426: 789-796.Google Scholar
- The ENCODE Project Consortium: The ENCODE (ENCyclopedia Of DNA Elements) project. Science. 2004, 306: 636-640.View ArticleGoogle Scholar
- McVean GA, Myers SR, Hunt S, Deloukas P, Bentley DR, Donnely P: The fine-scale structure of recombination rate variation in the human genome. Science. 2004, 304: 584-581.View ArticleGoogle Scholar
- Fitch DHA, Goodman M: Phylogenetic scanning: a computer-assisted algorithm for mapping gene conversions and other recombinational events. Bioinformatics. 1991, 7: 207-215.View ArticleGoogle Scholar
- Daly MJ, Rioux JD, Schaffner SF, Hudson TJ: High-resolution haplotype structure in the human genome. Nat Genet. 2001, 29 (2): 229-232.PubMedView ArticleGoogle Scholar
- Nielsen R, Williamson S, Kim Y, Hubisz MJ, Clark AG, Bustamante C: Genomic scans for selective sweeps using SNP data. Genome Research. 2005, 15: 1566-1575.PubMedPubMed CentralView ArticleGoogle Scholar
- Pritchard JK, Stephens M, Donnelly P: Inference of population structure using multilocus genotype data. Genetics. 2000, 155: 945-959.PubMedPubMed CentralGoogle Scholar
- Sridhar S, Rao S, Halperin E: An efficient and accurate graph-based method to detect population substructure. Proceedings of the International Conference on Research in Computational Molecular Biology (RECOMB). 2007, 503-517.View ArticleGoogle Scholar
- Tang H, Coram M, Wang P, Zhu X, Risch N: Reconstructing genetic ancestry blocks in admixed individuals. American Journal of Human Genetics. 2006, 79: 1-12.PubMedPubMed CentralView ArticleGoogle Scholar
- Sankararaman S, Sridhar S, Kimmel G, Halperin E: Estimating local ancestry in admixed populations. American Journal of Human Genetics. 2008, to appearGoogle Scholar
- Kosakovsky Pond SL, Posada D, Gravenor MB, Woelk CH, Frost SDW: Automated phylogenetic detection of recombination using a genetic algorithm. Molecular Biology and Evolution. 2006, 23: 1891-1901.PubMedView ArticleGoogle Scholar
- Filliol I, Motiwala AS, Cavatore M, Qi W, Hernando Hazbón M, Bobadilla del Valle M, Fyfe J, García-García L, Rstogi N, Sola C, Zozio T, Inírida Guerrero M, Inés León C, Crabtree J, Angiuoli S, Eisenach KD, Durmaz R, Joloba ML, Rendón A, Sifuentes-Osornio J, Ponce de León A, Cave MD, Fleischmann R, Whittam TS, Alland D: Global phylogeny of Mycobacterium tuberculosis based on single nucleotide polymorphism (SNP) analysis: insights into tuberculosis evolution, phylogenetic accuracy of other DNA fingerprinting systems, and recommendations for a minimal standard SNP set. Journal of Bacteriology. 2006, 188: 759-772.PubMedPubMed CentralView ArticleGoogle Scholar
- Edwards AWF, Cavalli-Sforza CC: The reconstruction of evolution. Annals of Human Genetics. 1963, 27: 105-106.Google Scholar
- Day WHE, Johnson DS: The computational complexity of inferring rooted phylogenies by parsimony. Mathematical Biosciences. 1986, 81: 33-42.View ArticleGoogle Scholar
- Song YS, Ding Z, Gusfield D, Langley CH, Wu Y: Algorithms to distinguish the role of gene-conversion from single-crossover recombination in the derivation of SNP sequences in populations. Journal of Computational Biology. 2007, 14: 1273-1286.PubMedPubMed CentralView ArticleGoogle Scholar
- Jin G, Nakhleh L, Snir S, Tuller T: Efficient parsimony-based methods for phylogenetic reconstruction. Bioinformatics. 2006, 23 (2): e123-e128.View ArticleGoogle Scholar
- Jin G, Nakhleh L, Snir S, Tuller T: Inferring phylogenetic networks by the maximum parsimony criterion: a case study. Mol Biol Evol. 2007, 24 (1): 324-337.PubMedView ArticleGoogle Scholar
- Felsenstein J: Inferring Phylogenies. 2004, Sinauer Associates, Sunderland, MAGoogle Scholar
- Edwards AWF, Cavalli-Sforza CC: Reconstruction of evolutionary trees. Pgenetic and Phylogenetic Classification. Edited by: Heywood VH, McNeill J. 1964, Systematics Association, London, 67-76.Google Scholar
- Gomberg D: "Bayesian" postdiction in an evolution process. Unpublished manuscript. 1968, University of Padvia, ItalyGoogle Scholar
- Sridhar S, Blelloch GE, Halperin E, Ravi R, Schwartz R: Algorithms for near-perfect phylogenetic tree reconstruction in theory and practice. ACM/IEEE Transactions on Computational Biology and Bioinformatics. 2007, 4: 561-571.PubMedView ArticleGoogle Scholar
- Sridhar S, Blelloch GE, Ravi R, Schwartz R: Optimal imperfect phylogeny and haplotyping. Proceedings of the Computational Systems Bioinformatics Conference. 2006, 199-210.View ArticleGoogle Scholar
- Sridhar S, Lam F, Blelloch GE, Ravi R, Schwartz R: Efficiently finding the most parsimonious phylogenetic tree via linear programming. Proceedings of the International Symposium on Bioinformatics Research and Applications. 2007, 37-48.View ArticleGoogle Scholar
- Stephens M, Smith N, Donnelly P: A new statistical method for haplotype reconstruction from population data. American Journal of Human Genetics. 2001, 68: 978-989.PubMedPubMed CentralView ArticleGoogle Scholar
- Wheeler DL, Barrett T, Benson DA, Bryant SH, Chetvermin V, Church DM, DiCuccio M, Edgar R, Federhen S, Geer LY, Helmberg W, Kasputin Y, Kenton DL, Khovayko O, Lipman DJ, Madden TL, Maglott DR, Ostell J, Pruitt KD, Schuler GD, Schriml LM, Sequeira E, Sherry ST, Sirotkin K, Souvorov A, Starchenko G, Suzek TO, Tatusov R, Tatusova TA, Wagner L, Yaschenko E: Database resources of the National Center for Biotechnology Information. Nucleic Acids Research. 2006, 34: D173-D180.PubMedPubMed CentralView ArticleGoogle Scholar
- Birney W, Andrews D, Caccamo M, Chen Y, Clarke L, Coates G, Cox T, Cunningham F, Curwen V, Cutts T, Down T, Durbin R, Fernandez-Suarez XM, Flicek P, Gräf S, Hammond M, Herrero J, Howe K, Iyer V, Jekosch K, Kähäri A, Kasprzyk A, Keefe D, Kokocinski F, Kulesha E, London D, Longden I, Melsopp C, Meidl P, Overduin P, Parker A, Proctor G, Prlic A, Rae M, Rios D, Redmond S, Schuster M, Sealy I, Searle S, Severin J, Slater G, Smedley D, Smith J, Stabenau A, Stalker J, Trevanion S, Ureta-Vidal A, Vogel J, White S, Woodwark C, Hubbard UJP: Ensembl 2006. Nucleic Acids Research. 2006, 34: D556-D561.PubMedPubMed CentralView ArticleGoogle Scholar
- Kasprzyk A, Keefe D, Smedley D, London D, Spooner W, Melsopp C, Hammond M, Rocca-Serra P, Cox T, Birney E: EnsMart: A generic system for fast and exible access to biological data. Genome Research. 2005, 14: 160-169.View ArticleGoogle Scholar
- Karolchik D, Baertsch R, Diekhans M, Furey TS, Hinrichs A, Lu YT, Roskin KM, Schwartz M, Sugnet CW, Thomas DJ, Weber RJ, Haussler D, Kent WJ: The UCSC Genome Browser Database. Nucleic Acids Research. 2003, 31: 51-54.PubMedPubMed CentralView ArticleGoogle Scholar
- Smit AFA, Hubley R, Green P: RepeatMasker Open-3.0. 2004, [http://www.repeatmasker.org]Google Scholar
- Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D: The Human Genome Browser at UCSC. Genome Research. 2002, 12: 996-1006.PubMedPubMed CentralView ArticleGoogle Scholar
- Beissbarth T, Speed TP: GOstat: Find statistically overrepresented Gene Ontologies within a group of genes. Bioinformatics. 2004, 20 (9): 1464-1465.PubMedView ArticleGoogle Scholar
- Felsenstein J: PHYLIP (Phylogeny Inference Package) version 3.6. Distributed by the author. 2005, Department of Genome Sciences, University of Washington, SeattleGoogle Scholar
- Robinson D, Foulds L: Comparison of phylogenetic trees. Mathematical Biosciences. 1981, 53: 131-147.View ArticleGoogle Scholar
- Myers S, Bottolo L, Freeman C, McVean G, Donnelly P: A fine-scale map of recombination rates and hotspots across the human genome. Science. 2005, 310: 321-324.PubMedView ArticleGoogle Scholar
- Tautz D, Schlötterer C: Simple sequences. Current Opinion in Genetics and Development. 1994, 4: 832-837.PubMedView ArticleGoogle Scholar
- Field D, Wills C: Abundandant microsatellite polymorphisms in Saccharomyces cerevisiae and the different distributions of microsatellites in eight prokaryotes and S. cerevisiae, result from strong mutation pressures and a variety of selective forces. Proc Natl Acad Sci U S A. 1998, 95 (4): 1647-1652.PubMedPubMed CentralView ArticleGoogle Scholar
- DiBenedetto M, Pineau P, Nouet S, Berhouet S, Seitz I, Louis S, Dejean A, Couraud PO, Strosberg AD, Stoppa-Lyonnet D, Nahmias C: Mutation analysis of the 8p22 candidate tumor suppressor gene ATIP/MTUS1 in hepatocellular carcinoma. Molecular and Cellular Endorinology. 2006, 252: 207-215.View ArticleGoogle Scholar
- Moh MC, Lee LH, Yang X, Shen S: HEPN1, a novel gene that is frequently down-regulated in hepatocellular carcinoma, suppresses cell growth and induces apoptosis in HepG2 cells. Journal of Hepatology. 2003, 39: 580-586.PubMedView ArticleGoogle Scholar
- Ramoz N, Rueda LA, Bouadjar B, Montoya LS, Ort G, Favre M: Mutations in two adjacent novel genes are associated with epidermodysplasia verruciformis. Nature Genetics. 2002, 32: 579-581.PubMedView ArticleGoogle Scholar
- Korabiowska M, Cordon-Cardo C, Betke H, Schlott T, Kotthaus M, Stachura J, Brinck U: GADD153 is an independent prognostic factor in melanoma: Immunohistochemical and molecular genetic analysis. Histol Histopathol. 2002, 17 (3): 805-811.PubMedGoogle Scholar
- Uehara A, Sugawara S, Muramoto K, Takada H: Activation of human oral epithelial cells by neutrophil proteinase 3 through protease-activated receptor-2. J Immunol. 2002, 169 (8): 4593-4603.View ArticleGoogle Scholar
- Ludemann J, Utecht B, Gross WL: Anti-neutrophil cytoplasm antibodies in Wegener's granulomatosis recognize an elastinolytic enzyme. Journal of Experimental Medicine. 1990, 171: 357-362.PubMedView ArticleGoogle Scholar
- Hellmann I, Ebersberger I, Ptak SE, Pääbo S, Przeworski M: A neutral explanation for the correlation of diversity with recombination rates in humans. American Journal of Human Genetics. 2003, 72: 1527-1535.PubMedPubMed CentralView ArticleGoogle Scholar
- Hardison RC, Roskin KM, Yank S, Diekhans M, Kent WJ, Weber R, Einitski L, Li J, O'Connor M, Kolbe D, Schwartz S, Furey TS, Whelan S, Goldman N, Smit A, Miller W, Chiaromonte F, Haussler D: Covariation in frequencies of substitution, deletion, transposition, and recombination during eutherian evolution. Genome Research. 2003, 13: 13-26.PubMedPubMed CentralView ArticleGoogle Scholar
- Wiehe T, Mountain J, Parham P, Slatkin M: Distinguishing recombination and intragenic gene conversion by linkage disequilibrium patterns. Genetical Research. 2000, 75: 61-73.PubMedView ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.