 Methodology article
 Open Access
 Published:
An assembly and alignmentfree method of phylogeny reconstruction from nextgeneration sequencing data
BMC Genomics volume 16, Article number: 522 (2015)
Abstract
Background
Nextgeneration sequencing technologies are rapidly generating wholegenome datasets for an increasing number of organisms. However, phylogenetic reconstruction of genomic data remains difficult because de novo assembly for nonmodel genomes and multigenome alignment are challenging.
Results
To greatly simplify the analysis, we present an Assembly and AlignmentFree (AAF) method (https://sourceforge.net/projects/aafphylogeny) that constructs phylogenies directly from unassembled genome sequence data, bypassing both genome assembly and alignment. Using mathematical calculations, models of sequence evolution, and simulated sequencing of published genomes, we address both evolutionary and sampling issues caused by direct reconstruction, including homoplasy, sequencing errors, and incomplete sequencing coverage. From these results, we calculate the statistical properties of the pairwise distances between genomes, allowing us to optimize parameter selection and perform bootstrapping. As a test case with real data, we successfully reconstructed the phylogeny of 12 mammals using raw sequencing reads. We also applied AAF to 21 tropical tree genome datasets with low coverage to demonstrate its effectiveness on nonmodel organisms.
Conclusion
Our AAF method opens up phylogenomics for species without an appropriate reference genome or high sequence coverage, and rapidly creates a phylogenetic framework for further analysis of genome structure and diversity among nonmodel organisms.
Background
Understanding the phylogenetic relationships among organisms is an essential aspect for many ecological, biogeographical, and evolutionary questions [1]. Currently, the simple step of generating a robust phylogeny for a group of poorly studied organisms can require substantial research investment. Most phylogenies are reconstructed from a tiny portion of the genome [2], but as nextgeneration sequencing technologies become faster and cheaper, the number of species for which whole genome sequence data are available has increased dramatically. Most wholegenome datasets are collected for reasons other than phylogenetic reconstruction, yet it is the first step in many comparative studies [3]. Therefore, constructing a phylogeny from genomic data would be a valuable tool even if genome datasets were not collected with this in mind. Unfortunately, most existing methods for phylogenetic reconstruction are not intended for analysis of genomic scale datasets [4]. Traditional phylogenomic techniques require genome assembly, detection of putative orthologous genes from the assembled sequences, and alignment at the DNA sequence level [5]. These analyses typically go beyond the expertise of researchers who only require a phylogeny to place comparative studies into an evolutionary framework. Approaches are needed to efficiently cope with the dramatic increase in genomic data, and to allow easy and reliable reconstruction of phylogenetic relationships among genomes [6].
Multiplesequence alignment is a central issue in phylogenetic reconstruction, and errors in the alignment process often lead to errors in phylogenetic reconstruction [7]. When working at the genome scale, multisequence alignment becomes very difficult (reviewed in [8]). Alignmentfree methods were initially proposed to circumvent the issues of recombination and genetic shuffling that make alignment difficult [9], and they have attracted attention due to their computational efficiency [10] and accuracy [11]. However, because they are not based on specific evolutionary models, they have been mainly used for comparing similarity between sequences [12, 13] and genomes [14].
The majority of the alignmentfree methods focus on the distribution within and among study genomes of short DNA/protein fragments, known generally as kmers where k is the length of the substring taken from the original sequences [15]. Usually distance matrices are calculated directly from the distribution of kmers, and phylogenies are built from these distances [16]. These distance metrics, however, are derived without an evolutionary model and therefore do not represent the genetic distances (see [17] for a recent review). Furthermore, the kmer statistics used to compare genomes are typically computed from assembled sequences [18, 19]. Unfortunately, for the majority of organisms, a reference genome from a closely related species is not available. Without a reference, denovo genome assembly of short reads remains a major challenge, especially in wild, outcrossed species with high levels of heterozygosity [20]. Additionally, high coverage is generally required [21, 22], but even with high coverage denovo assembly often remains errorprone [23]. The difficulty of assembly has led to methods that directly analyze unassembled read data [24, 25], and some of these methods are proposed for reconstructing phylogenies [26–29]. However, these methods are mainly designed for closely related prokaryotic genomes. Finally, although a couple of these methods [26, 29] have addressed the inherent assemblyfree problems such as coverage and sequencing errors, none proposed any solution. They also do not provide methods such as bootstrapping to assess confidence in the reconstructions.
Here we present a new method that directly reconstructs a phylogeny from wholegenome short read sequence (SRS) data. By removing the need for assembly of the sequencing reads, we extend alignmentfree methods to Assembly and AlignmentFree (AAF) methods. Furthermore, we develop, explain, and validate our AAF method using a combination of sequence evolution models, mathematical calculations and simulated SRS data from published genomes for 11 primates. The mathematical calculations provide the conceptual foundation for the method and predict its performance given a basic evolutionary model for sequence divergence. Simulation models of sequence evolution allow us to test the mathematical predictions. Simulations of SRS data allow us to validate the method given realistic genome complexity, different genome sizes, sequencing errors, and a range of sequence coverage. To provide a tool for researchers to assess their own AAF phylogenetic reconstructions, we also developed a twostage bootstrap that estimates the precision of our method when applied to novel genome data. In order to demonstrate how AAF works on real sequencing data, we apply AAF to reconstruct the phylogeny for 12 mammal species from raw sequencing datasets, since the primate phylogeny is well established through both morphological and molecular data. Finally, to illustrate the ability of AAF to handle data for which it was designed – lowcoverage data from poorly known species – we use AAF to reconstruct the phylogeny for 21 tropical tree species. The package is available at https://sourceforge.net/projects/aafphylogeny/ with detailed documentation and tutorials.
Results and discussion
The AAF approach first calculates pairwise genetic distances between each sample using the number of evolutionary changes between their genomes, which are represented by the number of kmers that differ between genomes. The phylogenetic relationships among the genomes are then reconstructed from the pairwise distance matrix. Using simulated SRS read data (with sequencing error and incomplete coverage) from published and fully assembled genomes, our AAF method obtained the same phylogeny for 11 primate species (plus one outgroup, Fig. 1) as those previously published using traditional methods [30, 31], even though AAF did not use any information about assembly or alignment. Furthermore, the AAF method was very efficient, requiring only a few days on a standard work station (Table 1). Below, we provide complete theoretical and computational support for our method.
Evolutionary model
Our measure of the phylogenetic distance (denoted d) between two species, A and B, is based on the estimate of the rate parameter from a Poisson process for a mutation occurring at a single nucleotide under the evolutionary model that the mutation rate is the same for all nucleotides across the genomes. We include not only mutations caused by nucleotide substitutions, but also insertions and deletions (indels). If kmers are random nucleotide sequences of length k, and if mutations occur randomly and independently among nucleotides, the probability that no mutation will occur within a given kmer between species A and B is exp(−kd). Mutations will decrease the number of shared kmers, n _{ s }, between species relative to the total number of kmers, n _{ t }. In the hypothetical case in which only substitutions occurred and all kmers were unique, then all the species will have the same total number of kmers, n _{ t }, and the maximum likelihood estimate of exp(−kd) is n _{ s }/n _{ t }. In real situations, indels introduce the complication of changing n _{ t } which can be addressed as follows. A single insertion of length l will cause the loss of at most (k – 1) kmers and a gain of at most (l + k – 1) kmers, while a single deletion of length l will cause the loss of at most (l + k – 1) kmers and a gain of at most (k – 1) kmers. For computing distances, this means that deletions will cause a greater reduction in the number of shared kmers than either a substitution or an insertion of the same length. To account for this asymmetry, when estimating the distance between two taxa, the smaller of the two values of n _{ t } calculated separately for each of the taxa is used. This leads to the estimate of d (denoted D) of
This formula can be modified to correct for back substitutions (reversal to a nucleotide’s original state after two or more mutations), although this effect is small (see Methods: Estimating d).
Although application of Equation 1 for AAF is in principle straightforward, several important issues must be addressed. These divide naturally into alignmentfree and assemblyfree issues. Below, we first address alignmentfree issues that involve extracting as much information as possible about the true evolutionary variation between species. We then address assemblyfree issues that involve reducing sampling variation caused by incomplete coverage and sequencing errors.
Alignmentfree: kmer sensitivity and homoplasy
Lack of alignment makes it more difficult to extract all of the possible information about evolutionary distances between species. If genomes from two species were perfectly aligned, it would be possible to identify all substitutions and indels. In AAF, however, only differences in the presence/absence of kmers are used. If a kmer covers, for example, multiple substitutions, it will count equally as one carrying only a single substitution. Consequently, shorter kmers are more likely to have greater sensitivity to single evolutionary events. On the other hand, identical kmers could be derived from physically, functionally, or evolutionarily different regions of the genome and are therefore not homologous (kmer homoplasy). Longer kmers are less likely to suffer from kmer homoplasy. Therefore, a tradeoff exists for kmer length between the problem of sensitivity (which requires smaller k) and kmer homoplasy (which requires larger k).
To isolate alignmentfree issues, in this section we compute kmers under the assumption that we have completely assembled genomes without error. Therefore, only the evolutionary differences between genomes will be captured, without any differences caused by sampling error or sequencing effort.
kmer homoplasy
kmer homoplasy is generally considered as “noise” in phylogenetic reconstruction [16] and is a particular problem when kmer length is short and genome size is large. For example, if k = 15, the total number of possible kmers accounting for complementarity is 4^{15}/2 (~5x10^{8}), so if genomes are close to this length, identical kmers will appear in different species simply due to the limited number of total possible kmers.
kmer homoplasy may incorrectly inflate the proportion of shared kmers because (i) multiple copies of the same kmer at different locations in species A must all experience mutations before this kmer is no longer shared with species B, and (ii) a kmer that does undergo a mutation may turn into a kmer that already exists elsewhere in the genomes of species A or B (see Methods: kmer homoplasy). kmer homoplasy depends on the frequency distribution of different kmers across the genome of species A and B, which varies with k; for the species with the shortest genome, we denote this distribution Q _{ k }. To account for kmer homoplasy, we derived a mathematical formula for the proportion of shared kmers between species, p _{ h }, which is a prediction of the ratio n _{ s }/n _{ t } based on Q _{ k } (Methods: kmer homoplasy). In this formula, Q _{ k } can be calculated empirically from real genomes (Eq. 4) or estimated theoretically under the assumption that the ancestral genome for species A and B is a random sequence (Eq. 5).
We first investigated the general issue of kmer homoplasy under the assumption that the ancestral genome is a random sequence (Fig. 2a). For large genomes and small k, kmer homoplasy led to p _{ h } = 1 because all possible kmers occur in both species. This problem is exacerbated if GC content is biased, which will inflate the average similarity in genomic kmer composition. Fortunately, because the possible number of kmers increases exponentially, small increases in k quickly overcome this limitation; for example, when k = 18 over 30 billion kmers are possible, which is considerably larger than the majority of genome sizes, and when k = 21, over 2 trillion possible kmers exist. Therefore, above a threshold k (which differs by genome size and sequence complexity), the effects of kmer homoplasy are greatly reduced. As would be expected, above this k the theoretical and observed values of p _{ h } are the same (Fig. 2a dashed black line). Furthermore, the k at which kmer homoplasy vanishes depends only weakly on the genetic distance d between species (Fig. 2a, d = 0.02 vs. d = 0.1). Therefore, a sufficiently large k will overcome homoplasy, regardless of the evolutionary distance between species.
The mathematical formula for p _{ h } accurately predicted the results from simulated sequence evolution starting from either a random ancestral sequence of 100 kbp (Fig. 2b) or real (nonrandom) sequences (Fig. 2c). Comparing sequence evolution simulated from random and real ancestral sequences, k must be larger to reduce kmer homoplasy with the real ancestral sequence (Fig. 2c); this is because the lower complexity in the real ancestral sequence increases the probability that a kmer appears at multiple locations in the genome by chance. For random sequences of length 10^{9} bp, kmer homoplasy is negligible for k ≥ 19 (Fig. 2a), whereas for Q _{ k } obtained from the primate genomes, kmer homoplasy only becomes negligible for k ≥ 21 (Fig. 2d).
Balancing sensitivity and homoplasy
While kmer homoplasy becomes negligible when k is sufficiently large, continuing to increase k will also increase the probability that a single kmer contains more than one evolutionary event, and this will reduce sensitivity and underestimate the proportion of shared kmers.
The balance between kmer homoplasy and sensitivity can be understood in terms of the statistical properties of bias and precision in the estimate D (Fig. 3). Bias refers to the under or overestimation of distances, whereas precision refers to the variability in the estimates. We investigated both the bias and precision by simulating sequence evolution (Methods: Simulation of sequence evolution); for these simulations, we used relatively short ancestral sequences (compared to the primate genomes) for which smaller values of k will be enough to overcome kmer homoplasy. For short kmers (k = 9, 11), increasing genome size led to consistent underestimates of D (Fig. 3a) due to the increased numbers of shared kmers from kmer homoplasy (Fig. 2). At small genome sizes, however, shorter kmers led to greater precision in the estimate D measured by the coefficient of variation, CV (Fig. 3b). This greater precision for shorter kmers occurred because shorter kmers have greater sensitivity to identify individual mutations. At large genome sizes, however, the precision decreased for shorter kmers due to kmer homoplasy. These results predict that shorter kmers will give better phylogenetic reconstructions up to the point that kmer homoplasy leads to strong bias and imprecision in the estimate D. Therefore, the optimal k for phylogenetic reconstruction is the k which is just large enough to greatly reduce kmer homoplasy for a given genome size (Fig. 2).
To test this conclusion, we simulated 12 sequences based on their phylogenetic relationships starting with ancestral sequences ranging from 5 kb to 1280 kbp (100 phylogenies simulated for each starting length) given the “true” phylogenetic relationships among simulated species are given by Fig. 1b. As expected, for short genomes for which kmer homoplasy was negligible, shorter kmers led to fewer topological mistakes in phylogeny reconstruction (Fig. 3c). However, as genome length increased, phylogenies reconstructed using k = 9 possessed an increasing number of mistakes. For longer kmers (11 ≤ k ≤ 17), AAF invariably gave the correct topology when genome size was ≥160 kbp. Furthermore, the adequate performance with k = 11 despite the expected underestimate of D (Fig. 3a) suggests that reconstructing tree topology is robust to moderate amounts of bias.
Phylogeny reconstruction from assembled genomes
Using 11 assembled primate genomes with rabbit as an outgroup, AAF with k = 21 generated a phylogeny with the same topology as those described in recent publications [30, 31]. With k < 21, a few topology errors were observed, especially for deep nodes; these errors were anticipated as a result of kmer homoplasy (Fig. 2c). For k > 21, tree topology (Additional file 1: Figure S1) and branch lengths (Additional file 2: Table S1) were remarkably stable. This matches our prediction from Fig. 2c that values of k ≥ 21 should be sufficient to minimize kmer homoplasy. Therefore, we take 21 as the optimal k that balances homoplasy and sensitivity for this dataset, and we take the tree constructed from assembled genomes with optimal k as our optimal AAF tree (Fig. 1b). Quantitative comparison with the phylogeny from Perelman et al. (2011) showed high similarity to our optimal phylogeny with respect to branch lengths; there was a high correlation between their patristic distances (r = 0.9717) and a low Branch Score Distance (BSD = 0.0518;[32]). AAF is also very efficient, requiring only a few days on a standard workstation (Table 1).
Assemblyfree: incomplete coverage, sequencing error, and filtering
While the lack of alignment introduces possible errors in the inference of the actual evolutionary relationships among species, the lack of assembly primarily introduces sampling error caused by low genome coverage and sequencing errors [26, 29]. The actual number of kmers will be underrepresented given low sequencing coverage, whereas sequencing error will cause both the loss of true kmers and the gain of false kmers.
One simple solution to remove false kmers caused by sequencing error is to filter out all lowfrequency kmers; here, we filtered by removing kmers that occur as singletons (those which occur as single copies) in a genome. If sequencing errors are random, with no molecular or experimental bias, then the probability of observing the identical sequencing error at the same position is low. Therefore, given observed error rates for most nextgeneration sequencing platforms, kmers observed more than one time in the SRS data for a genome are unlikely to be errors. However, as sequencing coverage decreases, a larger fraction of real kmers will be singletons in the dataset, and therefore filtering will remove real kmers. As a consequence, although filtering will be beneficial at high coverage, at low coverage filtering will be detrimental. We investigated incomplete coverage, sequencing error, and kmer filtering using mathematical calculations, simulations, and application to simulated SRS data to determine the severity of potential problems caused by sampling errors and to derive recommendations accordingly.
Total and shared kmers with missing and false kmers
When there is incomplete coverage and sequencing errors, the true total and shared numbers of kmers, denoted n _{ t } * and n _{ s } *, differ from the observed total and shared kmers, n _{ t } and n _{ s }. We derived mathematical formulas (Methods: Combined effects of coverage and sequencing error) to predict the ratios p _{ t } = n _{ t }/n _{ t }* and p _{ s } = n _{ s }/n _{ s }* given information about coverage, sequencing errors, and kmer filtering (Fig. 4). False kmers caused by sequencing errors inflate p _{ t } starting from 2X coverage (Fig. 4a, solid lines), and at least 5X coverage is required to capture most of the shared kmers (dashed lines). When filtering out singleton kmers, p _{ t } at higher coverage correctly equals one; however, at lower coverage p _{ s } is reduced by 2030 % (Fig. 4b) due to the loss of true singletons. Thus, the advantage of filtering is that it reduces the number of false total kmers, at the cost of losing true shared kmers.
To filter or not to filter?
Given the tradeoff between filtering and not filtering, we investigated how much coverage is required before filtering should be used. Whether or not to filter kmers can again be decided by computing the bias and precision of the estimate D (Eqs. 8 and 9). Without filtering, the number of false kmers increases, and the true distance between species is overestimated (Fig. 5a, solid lines). Filtering out singletons can correct for this sequencing error effect with sufficient coverage (58X according to genome size, dashed lines).
Because reconstructing the correct topology appears robust to bias in the estimate D (Fig. 3a vs 3c), precision is likely more important, and filtering should be done when it can lower the variance in the estimate D. This crossover point generally occurs between 5X and 8X, depending on k (Fig. 5b, dashed lines for filtering and solid lines for no filtering).
To confirm these conclusions, we simulated SRS data with sequencing error and different coverage using ancestral genomes of length 80 kbp (Fig. 5c). As predicted by the estimates of the precision (Fig. 5b), filtering led to fewer mistakes in the topology of the phylogeny when coverage exceeded a threshold between 5X and 8X. The poor performance with k = 9 (Fig. 5c, red line) was due to kmer homoplasy as found previously (Fig. 2b). With the genome size of 80 kbp, k = 11 and 13 (Fig. 5c, orange and green lines) performed slightly better than other values of k in the simulations of assembled genomes (Fig. 3c) and here had the additional advantage of being less sensitive to sequencing error (Fig. 5a, b).
Tip correction
Incomplete coverage and sequencing error both inflate estimated distances between species and lead to longer tips on the tree. We derived a mathematical correction for this effect (Eq. 10) that does not depend on the true distance between species, but does depend on the read length and sequencing error and coverage. When the read length, sequencing error, and coverage are similar for all taxa, then this correction is similar for all species, so that the correction only affects the tips rather than internal nodes of the tree. This correction is particularly helpful for lower coverage datasets when filtering is not an option. Note that the decision whether to filter depends on precision (Fig. 5); therefore, although the tip correction adjusts for bias, it does not affect the decision whether to filter. The case of large differences in read length, sequencing error, and coverage among taxa is discussion in the Methods: Tip correction.
Bootstrapping
To determine the uncertainty in tree topology, we devised a twostage nonparametric bootstrap that accounts for both sampling variation (caused by incomplete coverage and sequencing error) and evolutionary variation (caused by the true history of sequence divergence and the ability of AAF to reconstruct it). The first stage of the bootstrap follows standard procedures: resample the original reads with replacement, construct a kmer presence/absence table, compute distances D, and construct the phylogeny. The variance in topology among 100 replicates is then used to assess the precision of reconstruction associated with sampling variation. To additionally incorporate evolutionary variation, the second stage involves, for each of the 100 kmer presence/absence tables, resampling with replacement 1/k of the total kmers. Only 1 in k of the kmers is selected to account for the nonindependence among kmers caused by their overlap. Simulations showed that choosing 1/k of the kmers gives the correct variance in D in the bootstrap (Methods: Bootstrapping).
This nonparametric bootstrap is not practical for large genomes due to the computational requirements for repeating the phylogenetic reconstruction a hundred or more times. Therefore, we also developed a twostage parametric bootstrap that uses mathematical equations to estimate the variances in distances between species caused by sampling and evolutionary variation (Methods: Bootstrapping). This approach gives similar results to the nonparametric bootstrap in both simulations with high coverage and filtering (Fig. 6a), and low coverage without filtering (Fig. 6b), but its computational requirements are independent of genome size.
Application on simulated and raw sequencing data
As an initial test of AAF, we simulated reads from 11 fully assembled primate genomes with rabbits as an outgroup (Methods: Read simulations). AAF successfully reconstructed phylogenies with the same topology and only slightly different branch lengths from simulated SRS data (Fig. 1ce, Table 2). The closest match to the phylogeny produced from the assembled primate genomes with k = 21 (Fig. 1b) was the filtered 5X coverage (Fig. 1e, BSD = 0.016), followed by the nonfiltered 2X coverage (Fig. 1c, BSD = 0.063) and then nonfiltered 5X coverage (Fig. 1d, BSD = 0.10). In fact, even for 0.5X coverage and no filtering, AAF recovered the correct topology (Additional file 1: Figure S1e), although we do not recommend application of AAF to datasets with such low coverage. Finally, with the tip correction, the trees are more similar to those constructed from the fully assembled genomes (Table 2).
Although we have included sequencing error and random sequences in our read simulations, there are nonrandom biases brought by different sequencing technologies. In order to test the performance of AAF on real sequencing data, we downloaded nextgeneration sequencing data for 7 primates that are available in NCBI Short Reads Archive. We did not find data for rabbit in SRA, and therefore we used cat as the outgroup. We added five species to expand the study to Euarchontoglires, the superorder to which primates belong. This dataset includes sequences from different sequencing platforms with different library construction strategies and read lengths (Additional file 3: Table S2). AAF successfully reconstructed a phylogeny from the 12 mammals dataset (Additional file 4: Figure S2) that agrees with recent studies [33, 34]. This illustrates how AAF can be used to construct phylogenies from publicly available, heterogeneous genome data.
AAF is designed mainly for constructing phylogenies for nonmodel organisms. Therefore, we showcase AAF on raw SRS datasets that might be typically encountered by evolutionary biologists and ecologists, a dataset of 21 tropical trees from four orders (Additional file 5: Table S3). Analyzing this 21 tropical trees dataset was the initial stimulus for developing AAF, and the dataset contains the many complications that will arise in studies of nonmodel organisms: low and uneven coverage, different read lengths, and no available reference genome close enough to aid traditional assembly and alignment. The AAF tree (Fig. 7) matched the established phylogeny at the interfamily level according to the current Angiosperm Phylogeny Group classification, APGIII [35], and the intergenera level according to recent studies [36, 37]. However, at the intragenus level, there is no published phylogeny for one of the major genera, Lithocarpus; the existing phylogenies for the other major genus, Ficus, have no consensus and poor bootstrap support [38, 39]. Therefore, we cannot compare our results diagnostically to established results in the literature. A tutorial giving a full demonstration from parameter selection to phylogeny reconstruction, tip correction and bootstrapping of this dataset is provided in the Additional file 6: Tutorial of the analysis of the 21 tropical trees dataset.
Comparison with other alignment and assemblyfree methods
Although there are other alignment and assemblyfree methods for constructing phylogenies, these methods are designed for prokaryote genomes [27, 28, 40]. A major advantage of AAF is its ability to process large genomes within days. The only methods that demonstrated attempts on mammaliansize genomes are provided by Song et al. [26] and Yi and Jin [29].
Song et al. [26] extended a variant of the classic alignmentfree statistic, D_{2}, to perform assemblyfree phylogenetic reconstructions. The authors demonstrated its application on 1X simulated genome sequences of five mammals and raw sequences of 13 tropical trees (included within the 21 species we present in our Tutorial), but the results were not convincing. For mammals, none of the resulting clustering was consistent with the known phylogenetic relationships of the five species. For the tropical trees, species from the same genus (Lithocarpus) were not grouped together, and the relationship between genera within Fagaceae was not consistent with previous studies [41]. Song et al. [26] did not provide a package which we could use for direct comparisons with AAF, although our successful application to 12 mammals and 21 tropical tree species contrasts with their results.
Yi and Jin [29] developed cophylog that uses an microalignment method for identifying SNPs (http://humpopgenfudan.cn/resources/softwares/COphylog.tar.gz.). This method is designed primarily for closely related species, and the applications Yi and Jin present consist mainly of bacteria. The authors applied cophylog to 5 mammal species in one of their supplemental figures. However, when we tried to repeat this analysis, the intermediate file (.co file) for a single run of one species (Bushbaby, SRR953063) was 129G. Ideally the intermediate file could be limited to around 24G for mammal species (genome size times 8byte), only if given perfectly assembled genomes without repetitive segments. Since each species has dozens to hundreds of runs of data, we were not able to replicate their results and compare their method to AAF based on real large genome data. This type of analysis will require more hard drive space than most potential users will have available, and this limits the practical use of the method for large genomes.
For a direct comparison between AAF and cophylog for smaller datasets, we simulated 100 data sets and assessed the performance of the two methods by the number of topological mistakes in the phylogeny. This simulation study is similar to that in Fig. 3c (see Methods: Simulation of sequence evolution). To select the length of the flanking regions (CK) to use in cophylog (since no guidance is provided with the method), we used our calculations for optimal k to avoid homoplasy, which should pose the same problems for cophylog as for AAF; thus, in cophylog we set the combined length of the flanking regions plus the focal nucleotide (i.e., 2CK + 1) equal to k in AAF. AAF uniformly outperformed cophylog (Additional file 7: Figure S3). Furthermore, the performance of cophylog depended on CK in a manner predicted for the selection of k in AAF: for short initial sequences, smaller values of CK performed best because they are more able to identify SNPs that are located close together on the genome. However, for longer sequences performance was degraded by homoplasy for small CK. The best value of CK for genome sizes >160 KB was CK = 6, which is comparable to k = 13 in AAF.
Advantages of the AAF method
The AAF method greatly simplifies phylogenetic reconstruction from genomic scale data. Compared to traditional phylogenetic methods, the AAF method uses the evolutionary signal from the whole genome, including coding and noncoding regions, and smallscale structural differences, like indels, not just nucleotide substitutions. In contrast to most current alignmentfree methods, the distance matrix calculated by AAF matches with the genetic distance. In contrast to the few available assemblyfree methods, AAF can analyze large genomes and huge datasets while accounting for problems posed by lack of assembly, specifically sequencing read errors and incomplete coverage. Our strategy of treating kmers as present/absent instead of using their absolute or relative frequencies will be advantageous for the direct analysis of raw SRS data, particularly when the coverage varies within and among genomes. Using presence/absence of kmers also makes AAF more robust to repetitive sequences with high abundance. We demonstrated that the method has statistically welldefined properties that allow optimization and adaptation of the method. These properties also provide predictive insight into the strengths and weaknesses of AAF given a range of evolutionary and sampling conditions.
Other main advantages of the AAF method include:

(i).
Low coverage requirements. Even with the increasing throughput of nextgeneration sequencing technology, it is still expensive to sequence many genomes at the high coverage needed for successful assembly. The AAF method is able to recover an accurate phylogeny with low coverage (Fig. 5c). This is largely due to the abundance of information brought by whole genomes. Even for species with large genomes like primates, it is possible to obtain 2X coverage for five individuals from a single Illumina HiSeq lane. Therefore, AAF can generate a robust phylogeny quickly and inexpensively for any group of species of interest.

(ii).
Low computational demands. Traditional phylogenomic studies rely on three main steps (without considering data acquisition) that are computationally demanding: assembly, orthologous gene identification and alignment, and phylogenetic reconstruction based on the aligned sequences. Traditional phylogenetic methods like maximumlikelihood and Bayesian methods rely on complex evolutionary models that require large amounts of computing time, even when using large computing clusters. The AAF method decreases the total analysis time drastically. Compared to 40–50 h required for human genome assembly using a supercomputer [42], which is just the first step of traditional analysis for one species, AAF takes less than two days to complete the phylogenetic reconstruction of a dozen primate genomes on a standard workstation with 25G of RAM and 12 threads.
Limitations
The advantages of the AAF method come with some costs:

(i).
Loss of kmer sensitivity. AAF does not use all of the evolutionary information that would be available if genomes were accurately assembled and aligned. The minimum length of kmers is set by the need to overcome kmer homoplasy; for the primate dataset in this study, the minimum length was k = 21. Therefore, if there are multiple mutations (substitutions or indels) within a 21nucleotide sequence, they will be covered by the same kmer. Nonetheless, the vast amounts of data available from entire genomes will largely overcome this problem.

(ii).
Deep nodes. The accuracy of most alignmentfree methods suffers when applied to sequences or genomes separated by large genetic distances [27, 29]. Deep phylogenetic nodes imply a higher density of mutations. This will cause the loss of sensitivity of AAF to differences between genomes, and will also exacerbate the effects of homoplasy, which in turn will decrease the estimate D between distantly related species (Additional file 8: Figure S4). In our simulations and application to the primate genomes, AAF performed well despite a divergence time for primates estimated at 87 Mya [31] and estimated genetic distance (probability of mutation) from basal node to tips of roughly 0.1. It also worked well when expanding the analysis to Euarchontoglires (Additional file 4: Figure S2b). However, AAF failed to identify some of the deep nodes when including more mammals across the Placentalia clade (Additional file 4: Figure S2c) with a divergence time of >100 Mya [33]. In our dataset of 21 tropical trees, the AAF relationships between families were consistent with the APG III System. The divergence time for this group is about 94 Mya, and the average genetic distance from basal node to tips is roughly 0.1 as well. These empirical examples suggest that AAF can successfully reconstruct phylogenies with divergence times of <100 Mya and genetic distances from base to tips of <0.1, and we do not recommend AAF for deeper nodes.

(iii).
Location of mutations. AAF does not directly give information about where genetic differences between genomes occur. Specifically, the calculation of the phylogeny from the distance matrix does not estimate ancestral states that might be used for mapping specific mutations onto the phylogenetic tree. Thus, AAF does not identify genes or regions of the genome that are conserved or discordant. Nonetheless, once a phylogeny is constructed, analyzing distribution patterns of genes and genome regions across the phylogeny can be performed on kmers directly [25, 43]. Integration with AAF, however, will require additional development.
Guidelines for parameter selection
AAF requires two choices from the user: (i) what kmer length to use, and (ii) whether to filter out singleton kmers. The choice of k depends on the possibility of kmer homoplasy; k must be long enough to guard against it. This choice can be made for real genomes by plotting p _{ h } and selecting k where the estimated value of p _{ h } matches the theoretical value assuming no kmer homoplasy, using empirically calculated Q _{ k } (Fig. 2d, Additional file 9: Figure S5). Larger values of k beyond this will generally decrease the performance of AAF due to both the loss of sensitivity (from covering multiple mutations under the same kmer) and the increased likelihood of kmer loss through sequencing error if kmers are not filtered (Fig. 4). We have included the R code for plotting Fig. 2d to provide a starting point for the selection of k (see more details in the tutorial). To confirm this selection, users can repeat the analyses while increasing k by 2 until successive values of k give the same phylogeny.
Filtering is a good guard against sequencing error and inflated tip lengths in the phylogeny (Figs. 1, 2, 3, 4 and 5). However, it requires coverage of at least 58X (depending on kmer length) to ensure that not too many true kmers are filtered out (Fig. 4). A pleasant side effect of filtering at the kmer counting stage is that this decreases the size of kmer table drastically and thereby decreases the computational load substantially (Table 1).
Future directions
AAF could be used to develop automated phylogeny generators such as phylota [44] while using wholegenome data, or REALPHY [40] but not only for microbes. It can make use of both assemblies and sequencing reads that are available, as we demonstrated for primates (assemblies downloaded from ensemble and raw reads downloaded from the NCBI Short Reads Archive). Automation is made feasible by AAF’s robustness against incomplete and unequal coverage, and its ability to accept different read lengths from any sequencing platform. Although AAF is designed for phylogenetic analysis, it has wider application for understanding the pattern of relationships among different taxonomic levels of samples such as population structure within species. While AAF is mainly designed for large eukaryotic genomes, it is also able to analyze prokaryote datasets. Possible application of the alignment and assembly free approach requires further exploration for other types of sequencing data, such as RADseq and metagenomic data especially for organisms without a reference. Although AAF is designed for subjects without a good reference genome, studies on species with a reference will also benefit from AAFs computation efficiency and userfriendly pipeline. Because the AAF approach is based upon a matrix of genetic distances among the genomes, it is easy to add new data without recalculating all of the other distances among genomes.
Our overall AAF approach could also be expanded naturally to investigate different phylogenetic patterns within genomes. Our application of AAF to phylogenies generates “average” genetic distances between whole genomes and a corresponding phylogeny based on the overall differences among genomes. It is possible, however, to use AAF to identify suites of kmers that are consistent with a different phylogeny from the majority. This compartmentalization of the evolutionary history of the genome could identify portions of the genome with discordant histories in comparison to the majority of the genome [25]. The approach could also be used, in conjunction with phenotypic information about species, to associate suites of kmers with phenotypes of interest. An advantage of the overall AAF approach is that it can identify genomic elements (kmers and contigs derived from them) that show interesting discordance or associative patterns among species without initially investing in wholegenome assembly and alignment. Thus, the use of AAF to construct phylogenies is only the initial step in a broader AAF program.
Conclusions
The AAF method proved to be an accurate and efficient way of estimating the phylogenetic relationships using raw sequence data from whole genomes. We developed the theoretical basis for optimizing kmer length selection, filtering, correcting tip branch lengths, and bootstrapping, directly addressing the problems of homoplasy, sequencing error, and incomplete coverage. Thus, AAF provides a robust tool for phylogeny reconstruction especially when only lowcoverage and heterogeneous genome data are available – data that would challenge traditional assembly and alignmentbased methods.
Methods
Generating the kmer table
We use programs from the phylokmer package [25] for counting and merging kmers into a combined kmer presence/absence table. The counting step identifies all possible kmers for each genome. Adjacent kmers overlap for (k – 1) consecutive nucleotides, so if there is complete coverage, each site is covered by k kmers. When there is filtering, a kmer is only recorded as present if it occurs as two or more copies in the same species. The merging step is to produce an M x N table of the counts of each of M kmers among a group of N species. Because the number of kmers counted for a given species will depend on the sequencing coverage and possible nonuniform coverage across the genome, the kmer frequency table is converted to a table of presence/absence of kmers among taxa.
Estimating d
In our model, the evolutionary distance (d) is estimated from the proportion of kmers that are shared between taxa. As in most evolutionary models, multiple substitutions at the same site limit the divergence among kmers and hence the distance estimated between species [45]. Under the assumption that mutations only take the form of nucleotide substitutions, the probability that a given nucleotide undergoes m substitutions in distance D between two species is Poisson distributed; thus, this probability is D ^{m} e ^{–D}/m!. If γ_{ m } denotes the probability that after m substitutions there is no change in the nucleotide, the probability of no change in the kmer is
Here, γ_{0} = 1, γ_{1} = 0, γ_{2} = w _{ s } ^{2} + 2w _{ t } ^{2}, and γ_{3} = 2w _{ s } ^{2} w _{ t } + 4w _{ t } ^{3} where w _{ s } and w _{ t } are the probabilities of transitions and transversions. Values of γ_{ m } for m > 3 are vanishingly small. Equation 2 can be solved to obtain the estimate D, although this estimate differs little from that given by Equation 1. For example, for w _{ s } = 0.5 and w _{ t } = 0.25, the difference between estimates D is 4 % when the true distance d = 0.2 and diminishes for lower values of d independently of k and genome size. Equation 2 assumes that all mutations are substitutions, and even though indels may be rare, large indels will lead to a much greater loss of shared kmers than single substitutions. Therefore, multiple substitutions will in practice lead to even smaller differences in the estimate of D from Equation 1 in the main text. Thus, Equation 1 in the main text rather than Equation 2 above is included in the pipeline.
Tree estimation
We constructed phylogenies from the N x N matrix of distance measures D _{ ij } for species i and j using weighted least squares [1] with weights proportional to the expected variance of distances calculated for Equation 1. For fixed n _{ t }, the values of n _{ s } will have an approximately binomial distribution with the number of trials n _{ t } and the probability of success of each trial (persistence of a kmer) given by n _{ s }/n _{ t }. Because n _{ t } will be large, this can be approximated by a Gaussian distribution with mean n _{ s } and variance n _{ s }(1 – n _{ s }/n _{ t }). From this, the distribution of D _{ ij } will be approximately lognormal with variance proportional to
where \( {\widehat{D}}_{ij} \) is the estimate of D _{ ij } determined during the treefitting process. The phylogenetic tree for N species given by branch lengths \( {\widehat{D}}_{ij} \) is that tree that minimizes \( {\displaystyle \sum_{i,j}^N\omega}\left(k{\widehat{D}}_{ij}\right){\left({D}_{ij}{\widehat{D}}_{ij}\right)}^2. \) For these calculations in our pipeline, we used the program fitch in the package PHYLIP [46], although we modified the program to accommodate the weights given by Equation 3.
kmer homoplasy
Kmer homoplasy is a central challenge for analyzing raw read data, and therefore we addressed it in detail. The results of the mathematical calculations of homoplasy are illustrated in Fig. 2. Here, we present the detailed mathematical results for interested readers.
To calculate the consequences of kmer homoplasy on the estimated distance between species requires the distribution of kmer frequencies within genomes. Let Q _{ k } denote the random variable for the number of copies of a kmer in a genome above 1; thus, if q _{ k }(i) denotes the probability distribution of Q _{ k }, q _{ k }(0) is the probability that a kmer is unique, q _{ k }(1) is the probability that a kmer occurs at two different locations in the genome, etc. There are three ways in which kmer homoplasy increases the observed proportion of shared kmers. (i) If there are multiple copies of a kmer, then even if some copies undergo a mutation from species A to species B, the species will still share the kmer if at least one copy in A does not undergo a mutation in B. (ii) Even if all copies of a kmer mutate from species A to B, a mutation in another region of the genome of B could give rise to the same kmer, so that this kmer is still shared between A and B. (iii) Even if all of the copies of a kmer in A undergo mutations, they will still be shared if the mutations generate kmers that exist in B at different locations. Combining these three scenarios, the probability of two species sharing kmers is p _{ h } = p _{1} + (1 – p _{1})p _{2} + (1 – p _{1})(1 – p _{2})p _{3} where
Here, P is the probability of a given kmer being identical at two different locations in the genome, which is 2(0.5  u + u ^{2})^{k} with GC content u, and g is the genome length. For theoretical exploration of the consequences of kmer homoplasy, p _{ h } can be calculated under the assumption that genomes are random sequences undergoing evolution. In this case, Q _{ k } follows a binomial distribution with probability of success P and number of trials equal to the genome size, g. From these assumptions,
The distribution Q _{ k } gives the probability of homoplasy within a genome. Q _{ k } can be calculated empirically from the kmer frequency distribution. The two are the same for simulated (Fig. 2b, c) or assembled genome sequences (Fig. 2d). While calculating Q _{ k } from sequencing reads, the kmer frequency distribution needs to be corrected to account for coverage and sequencing errors (see below: Tutorial using 21 tropical trees).
Tree comparison
We used both correlations between patristic distance matrices and the Branch Score Distance (BSD, from the PHYLIP package [46] to compare between our AAF optimal phylogeny (Fig. 1b) and the most recent phylogeny in the literature [31]. To compare phylogenetic trees produced from simulated SRS data (Table 2), we used BSD. BSD is based on the sum of the squared differences between the branch lengths of the two trees [32]. The larger the BSD, the larger is the distance between the trees. BSD has the advantage that it depends on absolute branch lengths rather than relative branch lengths (which determine correlations of patristic distances). Our mathematical results show that inaccuracies of the AAF method, such as the lengthening of branch tips (Eq. 10), often involve changes in absolute branch lengths. Therefore, BSD provides a rigorous test for comparisons between AAF trees.
Missing kmers due to incomplete coverage
Assuming that reads are random across a genome with coverage depth c, the probability of finding a kmer with reads of length r is p _{ r } = 1 – exp(−L) where L = c(r – k + 1)/r. If kmers are filtered to remove singletons, then the probability of missing a kmer due to coverage is p _{ rf } = 1 – (1 + L)exp(−L). Note that the probability of recording a kmer is lower for the case when singletons are filtered.
Missing kmers due to sequencing errors and filtering
We assume that sequencing errors replace a given nucleotide with A, T, C, or G at random with an error rate E. Then the probability of finding at least one copy of a true kmer is
If kmers are filtered, the probability of including a true kmer is
False kmers caused by sequencing errors
Sequencing errors will generate false kmers that will increase the apparent number of kmers within a given species, n _{ t }*. The probability that false kmers are produced is approximately p _{ ta } = L(1 – (1 – E)^{k}). False kmers will also increase the observed number of shared kmers between species n _{ s }* by generating apparent homoplasy; the probability of generating a false shared kmer is approximately p _{ sa } = L(1 – (1 – E/3)^{kd}). If kmers are filtered, the probability of false kmers generated by sequencing error is vanishingly small.
Combined effects of coverage and sequencing error
The previous results make it possible to estimate the net effect of incomplete coverage and sequencing errors on the estimate of the distance between species. Let p _{ t } and p _{ s } denote the theoretically predicted ratios of observed to true total and shared kmers, n _{ t }*/n _{ t } and n _{ s }*/n _{ s }. In the absence of kmer filtering, p _{ t } = p _{ r } p _{ e } + p _{ ta } and p _{ s } = p _{ r } ^{2} p _{ e } ^{2} + p _{ sa }. In the presence of filtering, these equations also apply by replacing p _{ r } and p _{ e } by p _{ rf } and p _{ ef }, and setting p _{ ta } = p _{ sa } = 0. These equations are used to show the effects of coverage, sequencing error, and filtering on the ratios n _{ t }*/n _{ t } and n _{ s }*/n _{ s } (Fig. 4). They can also be used to show the bias in estimates of distances caused by incomplete coverage and sequencing error (Fig. 5a). The distance computed from the observed total and shared kmers is D* = −(1/k)log(n _{ s }*/n _{ t }*), and the difference between D* and D (that would be calculated if the true n _{ t } and n _{ s } were known) is
To compute the loss of precision due to incomplete coverage and sequencing error, assume that the true numbers of shared and total kmers, n _{ s } and n _{ t }, are known; these values are random variables due to the evolutionary process, but we are interested only in the variation caused by incomplete coverage and sequencing error in estimating the distance between two species that represent a single realization of the evolutionary process. The variance in the estimate of D calculated from the observed ratio n _{ s }*/n _{ t }* is proportional to
where q _{ s } = p _{ r } ^{2} p _{ e } ^{2} and q _{ t } = p _{ r } p _{ e } when singletons are not filtered, and q _{ s } = p _{ rf } ^{2} p _{ ef } ^{2}, q _{ t } = p _{ rf } p _{ ef }, and p _{ ta } = p _{ sa } = 0 when they are filtered. From this equation, precision always increases with coverage (Fig. 5b).
Tip correction
Incomplete coverage and sequencing error will increase the estimated distance between species, but this bias can be corrected using Equation 8. Specifically, the tips of the tree can be reduced by
if kmers are not filtered; if kmers are filtered, p _{ r } and p _{ e } are replaced by p _{ rf } and p _{ ef }, and p _{ ta } = 0. This approximation for D _{ tip } depends only on c, r, E and k, and therefore if all species have similar values of c, r and E, the correction will be very similar for all species. Note that shortening tip lengths should be done only after the tree topology has been constructed, because the weights used in the leastsquares tree estimation (Eq. 3) apply to the observed values n _{ t }* and n _{ s }*, not the corrected values.
When c, r, E and k differ greatly among species, the values given by Equation 10 will differ among tips. A solution to this situation is to calculate the average D _{ tip } for all pairwise distances using the parameter values for taxa with the lower n _{ t }, and then trim each tip with the average of these values of D _{ tip }. This solution is preferable to correcting distances between taxa before constructing the phylogeny, because the construction of the phylogeny should incorporate variances due to coverage and sequencing error that would be removed by tip corrections.
Bootstrapping
We developed a nonparametric bootstrap similar to standard bootstraps used for phylogenetic reconstructions, and also a parametric bootstrap that can be scaled to very large genomes. Both bootstraps separate the effects of sampling variation (incomplete coverage, sequencing error) and evolutionary variation (mutations).
The nonparametric bootstrap first randomly resamples reads (with replacement) to assess the uncertainty in the phylogenetic tree caused by sampling variability. For each resampled data set, a kmer table is generated. To isolate the effect of sampling variability alone, bootstrap phylogenies are constructed from these kmer tables. To include evolutionary variability, the second stage takes each kmer table generated from resampling reads and resamples the table (with replacement) by taking rows with probability 1/k; this follows the “block bootstrap” proposed by [47, 48]. This resampling shortens the kmer table by 1/k to account for the overlap between kmers; each nucleotide can be covered by k kmers. In simulations of sequence evolution, this resampling procedure gave good approximations to the evolutionary variance in D estimated between two species, validating the block bootstrap for this application.
The parametric bootstrap uses our equations for the variation in the estimates of distances to simulate genetic distances including sampling and evolutionary variation. For stage one, the approach involves adding a random variable e_{ ij } to the distance between each pair of species i and j, thereby “contaminating” the distance matrix with the uncertainty expected from incomplete coverage and sequencing error; e_{ ij } is selected from a normal random number generator with mean zero and variance
Simulations showed that this formula sometimes underestimates and sometimes overestimates the true standard deviation in the distance between taxa by as much as 50 %. To provide a conservative bootstrap (i.e., one that is not going to improperly inflate the support for nodes), we multiplied the estimated standard deviation of e_{ ij } by a correction factor of 2. The resulting distance matrix is then used to construct a bootstrap phylogenetic tree. Repeating this procedure 100 (or more) times gives an estimate of the proportion of branch nodes that are consistent despite uncertainty in genetic distances between species caused by incomplete coverage and sequencing error. For stage two, the “contaminated” distance matrix from stage one is contaminated a second time to account for variation in mutations by adding the random variable e_{ ij } with mean zero and variance given by
This equation is modified from Equation 3 to incorporate covariances caused by overlapping kmers. The second stage of parametric bootstrap also incorporates covariances among distances between species that share a common branch in the tree. These covariances are estimated from the phylogenetic tree computed from the original data so that the covariance between distances d(A,B) and d(C,D) equals the proportion of evolutionary history shared by the four species (i.e., the distance between the node separating A and B, and the node separating C and D).
Simulation of sequence evolution
We used Rose (Randommodel Of Sequence Evolution [49]) to simulate sequences under a HKY model of evolution [50] with a transition bias of 2. We assumed that the insertion and deletion rates were the same, and that their sum was 10 % of the substitution rate. Both insertions and deletions had length uniformly distributed between 1 and 5. There are two types of simulations in our study: simulations of differences in sequences between two species given a true distance d between them, and simulations of many sequences from a phylogeny. The first is used to assess the estimate of distance, D (Fig. 3a, b), and the second is used to assess the ability of AAF (Fig. 3c) and cophylog (Additional file 7: Figure S3) to recover the phylogeny used as starting trees. We used the phylogeny given in Fig. 1b as the starting tree and randomly selected sequences from a segment of the rabbit genome [30] as the ancestral genome sequences.
Read simulations
The primate genome assemblies were downloaded from the Ensembl database [50], and we simulated pairend Illumina data using Dwgsim (Whole Genome Simulation, http://sourceforge.net/apps/mediawiki/dnaa/) assuming a read length of 70 bp, a sequencing error rate of 1 %, and coverages of 2X and 5X, with and without filtering.
Availability of supporting data
The SRS data sets of the 21 tropical trees are available in the NCBI Short Reads Archive. See accession numbers in Additional file 5: Table S3.
Abbreviations
 AAF:

Assembly and AlignmentFree
 BSD:

Branch Score Distance
 SRS:

Short Read Sequences
References
 1.
Felsenstein J: Inferring Phylogenies. Sinauer Associates Inc., Sunderland, MA 2004.
 2.
Straub SCK, Parks M, Weitemier K, Fishbein M, Cronn RC, Liston A. Navigating the tip of the genomic iceberg: Nextgeneration sequencing for plant systematics. Am J Bot. 2012;99:349–64.
 3.
Garland Jr T, Harvey PH, IVES AR. Procedures for the Analysis of Comparative Data Using Phylogenetically Independent Contrasts. Systematic Biol. 1992;41:18–32.
 4.
Yang Z, Rannala B: Molecular phylogenetics: principles and practice. Nature Reviews Genetics. 2012;13:303–14.
 5.
Delsuc F, Brinkmann H, Philippe H. Phylogenomics and the reconstruction of the tree of life. Nat Rev Genet. 2005;6:361–75.
 6.
BOORE J. The use of genomelevel characters for phylogenetic reconstruction. Trends Ecol Evol. 2006;21:439–46.
 7.
Ogden TH, Rosenberg MS. Multiple Sequence Alignment Accuracy and Phylogenetic Inference. Systematic Biol. 2006;55:314–28.
 8.
Chan CX, Ragan MA: Nextgeneration phylogenomics. Biology Direct 2013;8:3.
 9.
Blaisdell BE. A measure of the similarity of sets of sequences not requiring sequence alignment. Proc Natl Acad Sci. 1986;83:5155–9.
 10.
Dai Q, Yang Y, Wang T. Markov model plus kword distributions: a synergy that produces novel statistical measures for sequence comparison. Bioinformatics. 2008;24:2296–302.
 11.
Yang K, Zhang L. Performance comparison between ktuple distance and four modelbased distances in phylogenetic tree reconstruction. Nucleic Acids Res. 2008;36:e33–3.
 12.
Lippert RA, Huang H, Waterman MS. Distributional regimes for the number of kword matches between two random sequences. Proc Natl Acad Sci. 2002;99:13980–9.
 13.
Reinert G, Chew D, Sun F, Waterman MS. AlignmentFree Sequence Comparison (I): Statistics and Power. J Comput Biol. 2009;16:1615–34.
 14.
DomazetLoso M, Haubold B. Alignmentfree detection of local similarity among viral and bacterial genomes. Bioinformatics. 2011;27:1466–72.
 15.
Qi J, Luo H, Hao BL. CVTree: a phylogenetic tree reconstruction tool based on whole genomes. Nucleic Acids Res. 2004;32:W45–7.
 16.
Stuart GW, Moffett K, Leader JJ. A Comprehensive Vertebrate Phylogeny Using Vector Representations of Protein Sequences from Whole Genomes. Mol Biol Evol. 2002;19:554–62.
 17.
Haubold B. Alignmentfree phylogenetics and population genetics. Brief Bioinform. 2014;15:407–18.
 18.
Ulitsky I, Burstein D, Tuller T, Chor B. The Average Common Substring Approach to Phylogenomic Reconstruction. J Comput Biol. 2006;13:336–50.
 19.
Sims GE, Jun SR, Wu GA, Kim SH. Alignmentfree genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proc Natl Acad Sci. 2009;106:2677–82.
 20.
Luo R, Liu B, Xie Y, Li Z, Huang W, Yuan J, et al. SOAPdenovo2: an empirically improved memoryefficient shortread de novo assembler. Giga Science. 2012;1:18.
 21.
Li R, Fan W, Tian G, Zhu H, He L, Cai J, et al. The sequence and de novo assembly of the giant panda genome. Nature. 2010;463:311–7.
 22.
Xu X, Pan S, Cheng S, Zhang B, Mu D, Ni P, et al. Genome sequence and analysis of the tuber crop potato. Nature. 2011;475:189–95.
 23.
Baker M. De novo genome assembly: what every biologist should know. Nat Methods. 2012;9:333–7.
 24.
Cannon CH, CS KUA, ZHANG D, HARTING JR. Assembly free comparative genomics of shortread sequence data discovers the needles in the haystack. Mol Ecol. 2010;19:147–61.
 25.
CS KUA, Ruan J, Harting J, Ye CX, Helmus MR, Yu J, et al. ReferenceFree Comparative Genomics of 174 Chloroplasts. PLoS One. 2012;7:e48995.
 26.
Song K, Ren J, Zhai Z, Liu X, Deng M, Sun F. AlignmentFree Sequence Comparison Based on NextGeneration Sequencing Reads. J Comput Biol. 2013;20:64–79.
 27.
Roychowdhury T, Vishnoi A, Bhattacharya A. NextGeneration Anchor Based Phylogeny (NexABP): Constructing phylogeny from Nextgeneration sequencing data. Sci Rep. 2013;3.
 28.
MaurerStroh S, Gunalan V, Wong Wc, Eisenhaber F. A Simple Shortcut To Unsupervised Alignmentfree Phylogenetic Genome Groupings Even From Unassembled Sequencing Reads. J Bioinform Comput Biol. 2013;11:1343005.
 29.
Yi H, Jin L. Cophylog: an assemblyfree phylogenomic approach for closely related organisms. Nucleic Acids Res. 2013;41:e75–5.
 30.
Prasad AB, Allard MW, NISC Comparative Sequencing Program, Green ED. Confirming the Phylogeny of Mammals by Use of Large Comparative Sequence Data Sets. Mol Biol Evol. 2008;25:1795–808.
 31.
Perelman P, Johnson WE, Roos C, Seuánez HN, Horvath JE, Moreira MAM, et al. A Molecular Phylogeny of Living Primates. PLoS Genet. 2011;7:e1001342.
 32.
Kuhner MK, Felsenstein J. Simulation Comparison Of Phylogeny Algorithms Under Equal And Unequal Evolutionary Rates. Mol Biol Evol. 1994;11:459–68.
 33.
Meredith RW, Janecka JE, Gatesy J, Ryder OA, Fisher CA, Teeling EC, et al. Impacts of the Cretaceous Terrestrial Revolution and KPg Extinction on Mammal Diversification. Science. 2011;334:521–4.
 34.
Broad Institute Sequencing Platform and Whole Genome Assembly Team, Baylor College of Medicine Human Genome Sequencing Center Sequencing Team, Genome Institute at Washington University, Broad Institute Sequencing Platform and Whole Genome Assembly Team, Baylor College of Medicine Human Genome Sequencing Center Sequencing Team, Genome Institute at Washington University. A highresolution map of human evolutionary constraint using 29 mammals. Nature. 2011;478:476–81.
 35.
Group AP. An update of the Angiosperm Phylogeny Group classification for the orders and families of flowering plants: APG III. Botanical J Linnean Soc. 2009;161:105–21.
 36.
Oh SH, Manos PS. Molecular phylogenetics and cupule evolution in Fagaceae as inferred from nuclear. Taxon. 2008;57:434–51.
 37.
Manos PS, Cannon CH, Oh SH. Phylogenetic Relationships and Taxonomic Status Of the Paleoendemic Fagaceae Of Western North America: Recognition Of A New Genus, Notholithocarpus. Madroño. 2008;55:181–90.
 38.
XU L, HARRISON RD, YANG P, YANG DR. New insight into the phylogenetic and biogeographic history of genus Ficus: Vicariance played a relatively minor role compared with ecological opportunity and dispersal. Journal of Systematics and Evolution. 2011;49:546–57.
 39.
Cruaud A, Ronsted N, Chantarasuwan B, Chou LS, Clement WL, Couloux A, et al. An Extreme Case of PlantInsect Codiversification: Figs and FigPollinating Wasps. Systematic Biol. 2012;61:1029–47.
 40.
Bertels F, Silander OK, Pachkov M, Rainey PB, van Nimwegen E. Automated Reconstruction of WholeGenome Phylogenies from ShortSequence Reads. Mol Biol Evol. 2014;31:1077–88.
 41.
Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, et al. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 2010;20:265–72.
 42.
Nordström KJV, Albani MC, James GV, Gutjahr C, Hartwig B, Turck F, Paszkowski U, Coupland G, from mutant and wildtype individuals using kmers. Nature Biotechnology 2013, 31:325–330.
 43.
Sanderson M, Boss D, Chen D, Cranston K, Wehe A. The PhyLoTA Browser: Processing GenBank for Molecular Phylogenetics Research. Systematic Biol. 2008;57:335–46.
 44.
Jukes TH, Cantor CR. Evolution of Protein Molecules. Volume 3. New York: Mammalian Protein Metabolism; 1969. p. 21–132.
 45.
Felsenstein J, PHYLIP. (Phylogeny Inference Package) version 3.6. 2005.
 46.
Kunsch HR. The Jackknife And The Bootstrap For General Stationary Observations. Annals of statistics. 1989;17:1217–41.
 47.
Hall P, Horowitz JL, Jing BY. On Blocking Rules For The Bootstrap With Dependent Data. Biometrika. 1995;82:561–74.
 48.
Stoye J, Evers D, Meyer F. Rose: generating sequence families. Bioinformatics. 1998;14:157–63.
 49.
Hasegawa M, Kishino H, Yano TA. Dating Of The Human Ape Splitting By A Molecular Clock Of MitochondrialDNA. J Mol Evol. 1985;22:160–74.
 50.
Flicek P, Ahmed I, Amode MR, Barrell D, Beal K, Brent S, et al. Ensembl 2013. Nucleic Acids Res. 2012;41:D48–55.
Acknowledgements
We thank Jue Ruan (Beijing Institute of Genomics) for help with modification on phylokmer. We thank Cecile Ane, Garret Suen, Caitlin Pepperell, and members of the Ives lab for discussion and suggestions of this work. We thank Huiguang Yi for providing help in implementing Cophylog for comparison with AAF. This research was supported by the Science Foundation of the Chinese Academy of Sciences 135 program XTBGT01. Computing was supported by HPC Center, Kunming Institute of Botany, CAS, China. HF was supported by grants from the UWMadison Graduate School, the BascomPlaenert fund, the USNSF (DEB0816613) to ARI, and Yunnan Provincial Government High Level Talent Introduction grant (09SK051B01) to CHC. YSG was supported by Young International Scientists Fellowship of Chinese Academy of Sciences (grant number 151C53WJQNKXJ20110008).
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
CHC conceived the original idea, and HF and ARI developed it into its current form. ARI did the simulations and HF did the rest of the analyses. HF and ARI drafted the manuscript with large input from YSG and CHC. HF and YSG wrote the software package. All authors read and approved the final manuscript.
Additional files
Additional file 1: Figure S1.
Primate phylogeny reconstructed by AAF from assembled genomes or simulated reads. From genome assemblies with (A) k = 15 (B) k = 23 (C) k = 25 (D) k =31. (E) Using 70bp reads simulated from assembled genomes with 1 % sequencing errors and 0.5 coverage. Incorrect branches (relative to the optimal tree with k = 21) are shown in red.
Additional file 2: Table S1.
Branch Score Distance (BSD) between the optimal AAF tree and trees generated with larger k.
Additional file 3: Table S2.
General information and accession numbers in NCBI Short Reads Archive of the mammal dataset.
Additional file 4: Figure S2.
Phylogeny of mammals constructed with raw reads downloaded from NCBI Short Reads Archive. (A) 7 primates. (B) 12 mammals. See details of dataset in Additional file 3: Table S2.
Additional file 5: Table S3.
General information and accession numbers of the 21 tropical trees dataset.
Additional file 6:
Tutorial of the analysis of the 21 tropical trees dataset. Contains the whole process from obtaining data to parameter selection, phylogeny reconstruction and bootstrap.
Additional file 7: Figure S3.
Performance comparison between AAF and cophylog measured by the number of topological mistakes in the phylogeny for initial sequence lengths ranging from 5 to 640 KB. Rose 1.3 was used to simulate sequence evolution down the 12 species phylogeny given in Fig. 1b (see Methods: Simulation of sequence evolution); for each sequence length, 100 simulations were performed, and the number of topological mistakes averaged. For Cophylog, the values of CK (length of the flanking regions) were 4 (red), 5 (orange), 6 (purple), 7 (green) and 8 (cyan), and for AAF the values of k were 11 (blue dots) and 13 (black dots).
Additional file 8: Figure S4.
Comparison of pairwise distances for the tropical trees dataset calculated with different k.
Additional file 9: Figure S5.
Estimating the optimal k of the tropical trees dataset. Theoretical predictions of the proportion of shared kmers, ph, calculated from the observed frequency distribution of kmers, Q _{ k }, for the tropical trees dataset ranging in size from 400 M to 1.3Gbp assuming the true distance between taxa is d = 0.1 (divergence time 94Mya).
Rights and permissions
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Received
Accepted
Published
DOI
Keywords
 kmers
 Phylogenomics
 Homoplasy
 Alignmentfree
 Assemblyfree