An assembly and alignment-free method of phylogeny reconstruction from next-generation sequencing data

Fan, Huan; Ives, Anthony R.; Surget-Groba, Yann; Cannon, Charles H.

doi:10.1186/s12864-015-1647-5

Methodology article
Open access
Published: 14 July 2015

An assembly and alignment-free method of phylogeny reconstruction from next-generation sequencing data

Huan Fan^1,2,3,
Anthony R. Ives³,
Yann Surget-Groba⁴ &
…
Charles H. Cannon^1,5

BMC Genomics volume 16, Article number: 522 (2015) Cite this article

11k Accesses
91 Citations
7 Altmetric
Metrics details

Abstract

Background

Next-generation sequencing technologies are rapidly generating whole-genome datasets for an increasing number of organisms. However, phylogenetic reconstruction of genomic data remains difficult because de novo assembly for non-model genomes and multi-genome alignment are challenging.

Results

To greatly simplify the analysis, we present an Assembly and Alignment-Free (AAF) method (https://sourceforge.net/projects/aaf-phylogeny) that constructs phylogenies directly from unassembled genome sequence data, bypassing both genome assembly and alignment. Using mathematical calculations, models of sequence evolution, and simulated sequencing of published genomes, we address both evolutionary and sampling issues caused by direct reconstruction, including homoplasy, sequencing errors, and incomplete sequencing coverage. From these results, we calculate the statistical properties of the pairwise distances between genomes, allowing us to optimize parameter selection and perform bootstrapping. As a test case with real data, we successfully reconstructed the phylogeny of 12 mammals using raw sequencing reads. We also applied AAF to 21 tropical tree genome datasets with low coverage to demonstrate its effectiveness on non-model organisms.

Conclusion

Our AAF method opens up phylogenomics for species without an appropriate reference genome or high sequence coverage, and rapidly creates a phylogenetic framework for further analysis of genome structure and diversity among non-model organisms.

Background

Understanding the phylogenetic relationships among organisms is an essential aspect for many ecological, biogeographical, and evolutionary questions [1]. Currently, the simple step of generating a robust phylogeny for a group of poorly studied organisms can require substantial research investment. Most phylogenies are reconstructed from a tiny portion of the genome [2], but as next-generation sequencing technologies become faster and cheaper, the number of species for which whole genome sequence data are available has increased dramatically. Most whole-genome datasets are collected for reasons other than phylogenetic reconstruction, yet it is the first step in many comparative studies [3]. Therefore, constructing a phylogeny from genomic data would be a valuable tool even if genome datasets were not collected with this in mind. Unfortunately, most existing methods for phylogenetic reconstruction are not intended for analysis of genomic scale datasets [4]. Traditional phylogenomic techniques require genome assembly, detection of putative orthologous genes from the assembled sequences, and alignment at the DNA sequence level [5]. These analyses typically go beyond the expertise of researchers who only require a phylogeny to place comparative studies into an evolutionary framework. Approaches are needed to efficiently cope with the dramatic increase in genomic data, and to allow easy and reliable reconstruction of phylogenetic relationships among genomes [6].

Multiple-sequence alignment is a central issue in phylogenetic reconstruction, and errors in the alignment process often lead to errors in phylogenetic reconstruction [7]. When working at the genome scale, multi-sequence alignment becomes very difficult (reviewed in [8]). Alignment-free methods were initially proposed to circumvent the issues of recombination and genetic shuffling that make alignment difficult [9], and they have attracted attention due to their computational efficiency [10] and accuracy [11]. However, because they are not based on specific evolutionary models, they have been mainly used for comparing similarity between sequences [12, 13] and genomes [14].

The majority of the alignment-free methods focus on the distribution within and among study genomes of short DNA/protein fragments, known generally as k-mers where k is the length of the substring taken from the original sequences [15]. Usually distance matrices are calculated directly from the distribution of k-mers, and phylogenies are built from these distances [16]. These distance metrics, however, are derived without an evolutionary model and therefore do not represent the genetic distances (see [17] for a recent review). Furthermore, the k-mer statistics used to compare genomes are typically computed from assembled sequences [18, 19]. Unfortunately, for the majority of organisms, a reference genome from a closely related species is not available. Without a reference, de-novo genome assembly of short reads remains a major challenge, especially in wild, out-crossed species with high levels of heterozygosity [20]. Additionally, high coverage is generally required [21, 22], but even with high coverage de-novo assembly often remains error-prone [23]. The difficulty of assembly has led to methods that directly analyze unassembled read data [24, 25], and some of these methods are proposed for reconstructing phylogenies [26–29]. However, these methods are mainly designed for closely related prokaryotic genomes. Finally, although a couple of these methods [26, 29] have addressed the inherent assembly-free problems such as coverage and sequencing errors, none proposed any solution. They also do not provide methods such as bootstrapping to assess confidence in the reconstructions.

Here we present a new method that directly reconstructs a phylogeny from whole-genome short read sequence (SRS) data. By removing the need for assembly of the sequencing reads, we extend alignment-free methods to Assembly and Alignment-Free (AAF) methods. Furthermore, we develop, explain, and validate our AAF method using a combination of sequence evolution models, mathematical calculations and simulated SRS data from published genomes for 11 primates. The mathematical calculations provide the conceptual foundation for the method and predict its performance given a basic evolutionary model for sequence divergence. Simulation models of sequence evolution allow us to test the mathematical predictions. Simulations of SRS data allow us to validate the method given realistic genome complexity, different genome sizes, sequencing errors, and a range of sequence coverage. To provide a tool for researchers to assess their own AAF phylogenetic reconstructions, we also developed a two-stage bootstrap that estimates the precision of our method when applied to novel genome data. In order to demonstrate how AAF works on real sequencing data, we apply AAF to reconstruct the phylogeny for 12 mammal species from raw sequencing datasets, since the primate phylogeny is well established through both morphological and molecular data. Finally, to illustrate the ability of AAF to handle data for which it was designed – low-coverage data from poorly known species – we use AAF to reconstruct the phylogeny for 21 tropical tree species. The package is available at https://sourceforge.net/projects/aaf-phylogeny/ with detailed documentation and tutorials.

Results and discussion

The AAF approach first calculates pairwise genetic distances between each sample using the number of evolutionary changes between their genomes, which are represented by the number of k-mers that differ between genomes. The phylogenetic relationships among the genomes are then reconstructed from the pairwise distance matrix. Using simulated SRS read data (with sequencing error and incomplete coverage) from published and fully assembled genomes, our AAF method obtained the same phylogeny for 11 primate species (plus one outgroup, Fig. 1) as those previously published using traditional methods [30, 31], even though AAF did not use any information about assembly or alignment. Furthermore, the AAF method was very efficient, requiring only a few days on a standard work station (Table 1). Below, we provide complete theoretical and computational support for our method.

Table 1 AAF performance metrics

Full size table

Evolutionary model

Our measure of the phylogenetic distance (denoted d) between two species, A and B, is based on the estimate of the rate parameter from a Poisson process for a mutation occurring at a single nucleotide under the evolutionary model that the mutation rate is the same for all nucleotides across the genomes. We include not only mutations caused by nucleotide substitutions, but also insertions and deletions (indels). If k-mers are random nucleotide sequences of length k, and if mutations occur randomly and independently among nucleotides, the probability that no mutation will occur within a given k-mer between species A and B is exp(−kd). Mutations will decrease the number of shared k-mers, n _s, between species relative to the total number of k-mers, n _t. In the hypothetical case in which only substitutions occurred and all k-mers were unique, then all the species will have the same total number of k-mers, n _t, and the maximum likelihood estimate of exp(−kd) is n _s/n _t. In real situations, indels introduce the complication of changing n _t which can be addressed as follows. A single insertion of length l will cause the loss of at most (k – 1) k-mers and a gain of at most (l + k – 1) k-mers, while a single deletion of length l will cause the loss of at most (l + k – 1) k-mers and a gain of at most (k – 1) k-mers. For computing distances, this means that deletions will cause a greater reduction in the number of shared k-mers than either a substitution or an insertion of the same length. To account for this asymmetry, when estimating the distance between two taxa, the smaller of the two values of n _t calculated separately for each of the taxa is used. This leads to the estimate of d (denoted D) of

$$ D=\frac{-1}{k} \log \frac{n_s}{n_t} $$

(1)

This formula can be modified to correct for back substitutions (reversal to a nucleotide’s original state after two or more mutations), although this effect is small (see Methods: Estimating d).

Although application of Equation 1 for AAF is in principle straightforward, several important issues must be addressed. These divide naturally into alignment-free and assembly-free issues. Below, we first address alignment-free issues that involve extracting as much information as possible about the true evolutionary variation between species. We then address assembly-free issues that involve reducing sampling variation caused by incomplete coverage and sequencing errors.

Alignment-free: k-mer sensitivity and homoplasy

Lack of alignment makes it more difficult to extract all of the possible information about evolutionary distances between species. If genomes from two species were perfectly aligned, it would be possible to identify all substitutions and indels. In AAF, however, only differences in the presence/absence of k-mers are used. If a k-mer covers, for example, multiple substitutions, it will count equally as one carrying only a single substitution. Consequently, shorter k-mers are more likely to have greater sensitivity to single evolutionary events. On the other hand, identical k-mers could be derived from physically, functionally, or evolutionarily different regions of the genome and are therefore not homologous (k-mer homoplasy). Longer k-mers are less likely to suffer from k-mer homoplasy. Therefore, a trade-off exists for k-mer length between the problem of sensitivity (which requires smaller k) and k-mer homoplasy (which requires larger k).

To isolate alignment-free issues, in this section we compute k-mers under the assumption that we have completely assembled genomes without error. Therefore, only the evolutionary differences between genomes will be captured, without any differences caused by sampling error or sequencing effort.

k-mer homoplasy

k-mer homoplasy is generally considered as “noise” in phylogenetic reconstruction [16] and is a particular problem when k-mer length is short and genome size is large. For example, if k = 15, the total number of possible k-mers accounting for complementarity is 4¹⁵/2 (~5x10⁸), so if genomes are close to this length, identical k-mers will appear in different species simply due to the limited number of total possible k-mers.

k-mer homoplasy may incorrectly inflate the proportion of shared k-mers because (i) multiple copies of the same k-mer at different locations in species A must all experience mutations before this k-mer is no longer shared with species B, and (ii) a k-mer that does undergo a mutation may turn into a k-mer that already exists elsewhere in the genomes of species A or B (see Methods: k-mer homoplasy). k-mer homoplasy depends on the frequency distribution of different k-mers across the genome of species A and B, which varies with k; for the species with the shortest genome, we denote this distribution Q _k. To account for k-mer homoplasy, we derived a mathematical formula for the proportion of shared k-mers between species, p _h, which is a prediction of the ratio n _s/n _t based on Q _k (Methods: k-mer homoplasy). In this formula, Q _k can be calculated empirically from real genomes (Eq. 4) or estimated theoretically under the assumption that the ancestral genome for species A and B is a random sequence (Eq. 5).

We first investigated the general issue of k-mer homoplasy under the assumption that the ancestral genome is a random sequence (Fig. 2a). For large genomes and small k, k-mer homoplasy led to p _h = 1 because all possible k-mers occur in both species. This problem is exacerbated if GC content is biased, which will inflate the average similarity in genomic k-mer composition. Fortunately, because the possible number of k-mers increases exponentially, small increases in k quickly overcome this limitation; for example, when k = 18 over 30 billion k-mers are possible, which is considerably larger than the majority of genome sizes, and when k = 21, over 2 trillion possible k-mers exist. Therefore, above a threshold k (which differs by genome size and sequence complexity), the effects of k-mer homoplasy are greatly reduced. As would be expected, above this k the theoretical and observed values of p _h are the same (Fig. 2a dashed black line). Furthermore, the k at which k-mer homoplasy vanishes depends only weakly on the genetic distance d between species (Fig. 2a, d = 0.02 vs. d = 0.1). Therefore, a sufficiently large k will overcome homoplasy, regardless of the evolutionary distance between species.

The mathematical formula for p _h accurately predicted the results from simulated sequence evolution starting from either a random ancestral sequence of 100 kbp (Fig. 2b) or real (non-random) sequences (Fig. 2c). Comparing sequence evolution simulated from random and real ancestral sequences, k must be larger to reduce k-mer homoplasy with the real ancestral sequence (Fig. 2c); this is because the lower complexity in the real ancestral sequence increases the probability that a k-mer appears at multiple locations in the genome by chance. For random sequences of length 10⁹ bp, k-mer homoplasy is negligible for k ≥ 19 (Fig. 2a), whereas for Q _k obtained from the primate genomes, k-mer homoplasy only becomes negligible for k ≥ 21 (Fig. 2d).

Balancing sensitivity and homoplasy

While k-mer homoplasy becomes negligible when k is sufficiently large, continuing to increase k will also increase the probability that a single k-mer contains more than one evolutionary event, and this will reduce sensitivity and underestimate the proportion of shared k-mers.

The balance between k-mer homoplasy and sensitivity can be understood in terms of the statistical properties of bias and precision in the estimate D (Fig. 3). Bias refers to the under- or overestimation of distances, whereas precision refers to the variability in the estimates. We investigated both the bias and precision by simulating sequence evolution (Methods: Simulation of sequence evolution); for these simulations, we used relatively short ancestral sequences (compared to the primate genomes) for which smaller values of k will be enough to overcome k-mer homoplasy. For short k-mers (k = 9, 11), increasing genome size led to consistent underestimates of D (Fig. 3a) due to the increased numbers of shared k-mers from k-mer homoplasy (Fig. 2). At small genome sizes, however, shorter k-mers led to greater precision in the estimate D measured by the coefficient of variation, CV (Fig. 3b). This greater precision for shorter k-mers occurred because shorter k-mers have greater sensitivity to identify individual mutations. At large genome sizes, however, the precision decreased for shorter k-mers due to k-mer homoplasy. These results predict that shorter k-mers will give better phylogenetic reconstructions up to the point that k-mer homoplasy leads to strong bias and imprecision in the estimate D. Therefore, the optimal k for phylogenetic reconstruction is the k which is just large enough to greatly reduce k-mer homoplasy for a given genome size (Fig. 2).

To test this conclusion, we simulated 12 sequences based on their phylogenetic relationships starting with ancestral sequences ranging from 5 kb to 1280 kbp (100 phylogenies simulated for each starting length) given the “true” phylogenetic relationships among simulated species are given by Fig. 1b. As expected, for short genomes for which k-mer homoplasy was negligible, shorter k-mers led to fewer topological mistakes in phylogeny reconstruction (Fig. 3c). However, as genome length increased, phylogenies reconstructed using k = 9 possessed an increasing number of mistakes. For longer k-mers (11 ≤ k ≤ 17), AAF invariably gave the correct topology when genome size was ≥160 kbp. Furthermore, the adequate performance with k = 11 despite the expected underestimate of D (Fig. 3a) suggests that reconstructing tree topology is robust to moderate amounts of bias.

Phylogeny reconstruction from assembled genomes

Using 11 assembled primate genomes with rabbit as an outgroup, AAF with k = 21 generated a phylogeny with the same topology as those described in recent publications [30, 31]. With k < 21, a few topology errors were observed, especially for deep nodes; these errors were anticipated as a result of k-mer homoplasy (Fig. 2c). For k > 21, tree topology (Additional file 1: Figure S1) and branch lengths (Additional file 2: Table S1) were remarkably stable. This matches our prediction from Fig. 2c that values of k ≥ 21 should be sufficient to minimize k-mer homoplasy. Therefore, we take 21 as the optimal k that balances homoplasy and sensitivity for this dataset, and we take the tree constructed from assembled genomes with optimal k as our optimal AAF tree (Fig. 1b). Quantitative comparison with the phylogeny from Perelman et al. (2011) showed high similarity to our optimal phylogeny with respect to branch lengths; there was a high correlation between their patristic distances (r = 0.9717) and a low Branch Score Distance (BSD = 0.0518;[32]). AAF is also very efficient, requiring only a few days on a standard workstation (Table 1).

Assembly-free: incomplete coverage, sequencing error, and filtering

While the lack of alignment introduces possible errors in the inference of the actual evolutionary relationships among species, the lack of assembly primarily introduces sampling error caused by low genome coverage and sequencing errors [26, 29]. The actual number of k-mers will be under-represented given low sequencing coverage, whereas sequencing error will cause both the loss of true k-mers and the gain of false k-mers.

One simple solution to remove false k-mers caused by sequencing error is to filter out all low-frequency k-mers; here, we filtered by removing k-mers that occur as singletons (those which occur as single copies) in a genome. If sequencing errors are random, with no molecular or experimental bias, then the probability of observing the identical sequencing error at the same position is low. Therefore, given observed error rates for most next-generation sequencing platforms, k-mers observed more than one time in the SRS data for a genome are unlikely to be errors. However, as sequencing coverage decreases, a larger fraction of real k-mers will be singletons in the dataset, and therefore filtering will remove real k-mers. As a consequence, although filtering will be beneficial at high coverage, at low coverage filtering will be detrimental. We investigated incomplete coverage, sequencing error, and k-mer filtering using mathematical calculations, simulations, and application to simulated SRS data to determine the severity of potential problems caused by sampling errors and to derive recommendations accordingly.

Total and shared k-mers with missing and false k-mers

When there is incomplete coverage and sequencing errors, the true total and shared numbers of k-mers, denoted n _t * and n _s *, differ from the observed total and shared k-mers, n _t and n _s. We derived mathematical formulas (Methods: Combined effects of coverage and sequencing error) to predict the ratios p _t = n _t/n _t* and p _s = n _s/n _s* given information about coverage, sequencing errors, and k-mer filtering (Fig. 4). False k-mers caused by sequencing errors inflate p _t starting from 2X coverage (Fig. 4a, solid lines), and at least 5X coverage is required to capture most of the shared k-mers (dashed lines). When filtering out singleton k-mers, p _t at higher coverage correctly equals one; however, at lower coverage p _s is reduced by 20-30 % (Fig. 4b) due to the loss of true singletons. Thus, the advantage of filtering is that it reduces the number of false total k-mers, at the cost of losing true shared k-mers.

To filter or not to filter?

Given the trade-off between filtering and not filtering, we investigated how much coverage is required before filtering should be used. Whether or not to filter k-mers can again be decided by computing the bias and precision of the estimate D (Eqs. 8 and 9). Without filtering, the number of false k-mers increases, and the true distance between species is overestimated (Fig. 5a, solid lines). Filtering out singletons can correct for this sequencing error effect with sufficient coverage (5-8X according to genome size, dashed lines).

Because reconstructing the correct topology appears robust to bias in the estimate D (Fig. 3a vs 3c), precision is likely more important, and filtering should be done when it can lower the variance in the estimate D. This crossover point generally occurs between 5X and 8X, depending on k (Fig. 5b, dashed lines for filtering and solid lines for no filtering).

To confirm these conclusions, we simulated SRS data with sequencing error and different coverage using ancestral genomes of length 80 kbp (Fig. 5c). As predicted by the estimates of the precision (Fig. 5b), filtering led to fewer mistakes in the topology of the phylogeny when coverage exceeded a threshold between 5X and 8X. The poor performance with k = 9 (Fig. 5c, red line) was due to k-mer homoplasy as found previously (Fig. 2b). With the genome size of 80 kbp, k = 11 and 13 (Fig. 5c, orange and green lines) performed slightly better than other values of k in the simulations of assembled genomes (Fig. 3c) and here had the additional advantage of being less sensitive to sequencing error (Fig. 5a, b).

Tip correction

Incomplete coverage and sequencing error both inflate estimated distances between species and lead to longer tips on the tree. We derived a mathematical correction for this effect (Eq. 10) that does not depend on the true distance between species, but does depend on the read length and sequencing error and coverage. When the read length, sequencing error, and coverage are similar for all taxa, then this correction is similar for all species, so that the correction only affects the tips rather than internal nodes of the tree. This correction is particularly helpful for lower coverage datasets when filtering is not an option. Note that the decision whether to filter depends on precision (Fig. 5); therefore, although the tip correction adjusts for bias, it does not affect the decision whether to filter. The case of large differences in read length, sequencing error, and coverage among taxa is discussion in the Methods: Tip correction.

Bootstrapping

To determine the uncertainty in tree topology, we devised a two-stage nonparametric bootstrap that accounts for both sampling variation (caused by incomplete coverage and sequencing error) and evolutionary variation (caused by the true history of sequence divergence and the ability of AAF to reconstruct it). The first stage of the bootstrap follows standard procedures: resample the original reads with replacement, construct a k-mer presence/absence table, compute distances D, and construct the phylogeny. The variance in topology among 100 replicates is then used to assess the precision of reconstruction associated with sampling variation. To additionally incorporate evolutionary variation, the second stage involves, for each of the 100 k-mer presence/absence tables, resampling with replacement 1/k of the total k-mers. Only 1 in k of the k-mers is selected to account for the non-independence among k-mers caused by their overlap. Simulations showed that choosing 1/k of the k-mers gives the correct variance in D in the bootstrap (Methods: Bootstrapping).

This nonparametric bootstrap is not practical for large genomes due to the computational requirements for repeating the phylogenetic reconstruction a hundred or more times. Therefore, we also developed a two-stage parametric bootstrap that uses mathematical equations to estimate the variances in distances between species caused by sampling and evolutionary variation (Methods: Bootstrapping). This approach gives similar results to the nonparametric bootstrap in both simulations with high coverage and filtering (Fig. 6a), and low coverage without filtering (Fig. 6b), but its computational requirements are independent of genome size.

Application on simulated and raw sequencing data

As an initial test of AAF, we simulated reads from 11 fully assembled primate genomes with rabbits as an outgroup (Methods: Read simulations). AAF successfully reconstructed phylogenies with the same topology and only slightly different branch lengths from simulated SRS data (Fig. 1c-e, Table 2). The closest match to the phylogeny produced from the assembled primate genomes with k = 21 (Fig. 1b) was the filtered 5X coverage (Fig. 1e, BSD = 0.016), followed by the non-filtered 2X coverage (Fig. 1c, BSD = 0.063) and then non-filtered 5X coverage (Fig. 1d, BSD = 0.10). In fact, even for 0.5X coverage and no filtering, AAF recovered the correct topology (Additional file 1: Figure S1e), although we do not recommend application of AAF to datasets with such low coverage. Finally, with the tip correction, the trees are more similar to those constructed from the fully assembled genomes (Table 2).

Table 2 Branch Score Distance (BSD) between trees generated from simulated reads and the optimal AAF tree

Full size table

Although we have included sequencing error and random sequences in our read simulations, there are non-random biases brought by different sequencing technologies. In order to test the performance of AAF on real sequencing data, we downloaded next-generation sequencing data for 7 primates that are available in NCBI Short Reads Archive. We did not find data for rabbit in SRA, and therefore we used cat as the outgroup. We added five species to expand the study to Euarchontoglires, the super-order to which primates belong. This dataset includes sequences from different sequencing platforms with different library construction strategies and read lengths (Additional file 3: Table S2). AAF successfully reconstructed a phylogeny from the 12 mammals dataset (Additional file 4: Figure S2) that agrees with recent studies [33, 34]. This illustrates how AAF can be used to construct phylogenies from publicly available, heterogeneous genome data.

AAF is designed mainly for constructing phylogenies for non-model organisms. Therefore, we showcase AAF on raw SRS datasets that might be typically encountered by evolutionary biologists and ecologists, a dataset of 21 tropical trees from four orders (Additional file 5: Table S3). Analyzing this 21 tropical trees dataset was the initial stimulus for developing AAF, and the dataset contains the many complications that will arise in studies of non-model organisms: low and uneven coverage, different read lengths, and no available reference genome close enough to aid traditional assembly and alignment. The AAF tree (Fig. 7) matched the established phylogeny at the interfamily level according to the current Angiosperm Phylogeny Group classification, APGIII [35], and the intergenera level according to recent studies [36, 37]. However, at the intragenus level, there is no published phylogeny for one of the major genera, Lithocarpus; the existing phylogenies for the other major genus, Ficus, have no consensus and poor bootstrap support [38, 39]. Therefore, we cannot compare our results diagnostically to established results in the literature. A tutorial giving a full demonstration from parameter selection to phylogeny reconstruction, tip correction and bootstrapping of this dataset is provided in the Additional file 6: Tutorial of the analysis of the 21 tropical trees dataset.

Comparison with other alignment and assembly-free methods

Although there are other alignment and assembly-free methods for constructing phylogenies, these methods are designed for prokaryote genomes [27, 28, 40]. A major advantage of AAF is its ability to process large genomes within days. The only methods that demonstrated attempts on mammalian-size genomes are provided by Song et al. [26] and Yi and Jin [29].

Song et al. [26] extended a variant of the classic alignment-free statistic, D₂, to perform assembly-free phylogenetic reconstructions. The authors demonstrated its application on 1X simulated genome sequences of five mammals and raw sequences of 13 tropical trees (included within the 21 species we present in our Tutorial), but the results were not convincing. For mammals, none of the resulting clustering was consistent with the known phylogenetic relationships of the five species. For the tropical trees, species from the same genus (Lithocarpus) were not grouped together, and the relationship between genera within Fagaceae was not consistent with previous studies [41]. Song et al. [26] did not provide a package which we could use for direct comparisons with AAF, although our successful application to 12 mammals and 21 tropical tree species contrasts with their results.

Yi and Jin [29] developed co-phylog that uses an micro-alignment method for identifying SNPs (http://humpopgenfudan.cn/resources/softwares/CO-phylog.tar.gz.). This method is designed primarily for closely related species, and the applications Yi and Jin present consist mainly of bacteria. The authors applied co-phylog to 5 mammal species in one of their supplemental figures. However, when we tried to repeat this analysis, the intermediate file (.co file) for a single run of one species (Bushbaby, SRR953063) was 129G. Ideally the intermediate file could be limited to around 24G for mammal species (genome size times 8byte), only if given perfectly assembled genomes without repetitive segments. Since each species has dozens to hundreds of runs of data, we were not able to replicate their results and compare their method to AAF based on real large genome data. This type of analysis will require more hard drive space than most potential users will have available, and this limits the practical use of the method for large genomes.

For a direct comparison between AAF and co-phylog for smaller datasets, we simulated 100 data sets and assessed the performance of the two methods by the number of topological mistakes in the phylogeny. This simulation study is similar to that in Fig. 3c (see Methods: Simulation of sequence evolution). To select the length of the flanking regions (CK) to use in co-phylog (since no guidance is provided with the method), we used our calculations for optimal k to avoid homoplasy, which should pose the same problems for co-phylog as for AAF; thus, in co-phylog we set the combined length of the flanking regions plus the focal nucleotide (i.e., 2CK + 1) equal to k in AAF. AAF uniformly outperformed co-phylog (Additional file 7: Figure S3). Furthermore, the performance of co-phylog depended on CK in a manner predicted for the selection of k in AAF: for short initial sequences, smaller values of CK performed best because they are more able to identify SNPs that are located close together on the genome. However, for longer sequences performance was degraded by homoplasy for small CK. The best value of CK for genome sizes >160 KB was CK = 6, which is comparable to k = 13 in AAF.

Advantages of the AAF method

The AAF method greatly simplifies phylogenetic reconstruction from genomic scale data. Compared to traditional phylogenetic methods, the AAF method uses the evolutionary signal from the whole genome, including coding and non-coding regions, and small-scale structural differences, like indels, not just nucleotide substitutions. In contrast to most current alignment-free methods, the distance matrix calculated by AAF matches with the genetic distance. In contrast to the few available assembly-free methods, AAF can analyze large genomes and huge datasets while accounting for problems posed by lack of assembly, specifically sequencing read errors and incomplete coverage. Our strategy of treating k-mers as present/absent instead of using their absolute or relative frequencies will be advantageous for the direct analysis of raw SRS data, particularly when the coverage varies within and among genomes. Using presence/absence of k-mers also makes AAF more robust to repetitive sequences with high abundance. We demonstrated that the method has statistically well-defined properties that allow optimization and adaptation of the method. These properties also provide predictive insight into the strengths and weaknesses of AAF given a range of evolutionary and sampling conditions.

Other main advantages of the AAF method include:

(i).
Low coverage requirements. Even with the increasing throughput of next-generation sequencing technology, it is still expensive to sequence many genomes at the high coverage needed for successful assembly. The AAF method is able to recover an accurate phylogeny with low coverage (Fig. 5c). This is largely due to the abundance of information brought by whole genomes. Even for species with large genomes like primates, it is possible to obtain 2X coverage for five individuals from a single Illumina HiSeq lane. Therefore, AAF can generate a robust phylogeny quickly and inexpensively for any group of species of interest.
(ii).
Low computational demands. Traditional phylogenomic studies rely on three main steps (without considering data acquisition) that are computationally demanding: assembly, orthologous gene identification and alignment, and phylogenetic reconstruction based on the aligned sequences. Traditional phylogenetic methods like maximum-likelihood and Bayesian methods rely on complex evolutionary models that require large amounts of computing time, even when using large computing clusters. The AAF method decreases the total analysis time drastically. Compared to 40–50 h required for human genome assembly using a supercomputer [42], which is just the first step of traditional analysis for one species, AAF takes less than two days to complete the phylogenetic reconstruction of a dozen primate genomes on a standard workstation with 25G of RAM and 12 threads.

Limitations

The advantages of the AAF method come with some costs:

(i).
Loss of k-mer sensitivity. AAF does not use all of the evolutionary information that would be available if genomes were accurately assembled and aligned. The minimum length of k-mers is set by the need to overcome k-mer homoplasy; for the primate dataset in this study, the minimum length was k = 21. Therefore, if there are multiple mutations (substitutions or indels) within a 21-nucleotide sequence, they will be covered by the same k-mer. Nonetheless, the vast amounts of data available from entire genomes will largely overcome this problem.
(ii).
Deep nodes. The accuracy of most alignment-free methods suffers when applied to sequences or genomes separated by large genetic distances [27, 29]. Deep phylogenetic nodes imply a higher density of mutations. This will cause the loss of sensitivity of AAF to differences between genomes, and will also exacerbate the effects of homoplasy, which in turn will decrease the estimate D between distantly related species (Additional file 8: Figure S4). In our simulations and application to the primate genomes, AAF performed well despite a divergence time for primates estimated at 87 Mya [31] and estimated genetic distance (probability of mutation) from basal node to tips of roughly 0.1. It also worked well when expanding the analysis to Euarchontoglires (Additional file 4: Figure S2b). However, AAF failed to identify some of the deep nodes when including more mammals across the Placentalia clade (Additional file 4: Figure S2c) with a divergence time of >100 Mya [33]. In our dataset of 21 tropical trees, the AAF relationships between families were consistent with the APG III System. The divergence time for this group is about 94 Mya, and the average genetic distance from basal node to tips is roughly 0.1 as well. These empirical examples suggest that AAF can successfully reconstruct phylogenies with divergence times of <100 Mya and genetic distances from base to tips of <0.1, and we do not recommend AAF for deeper nodes.
(iii).
Location of mutations. AAF does not directly give information about where genetic differences between genomes occur. Specifically, the calculation of the phylogeny from the distance matrix does not estimate ancestral states that might be used for mapping specific mutations onto the phylogenetic tree. Thus, AAF does not identify genes or regions of the genome that are conserved or discordant. Nonetheless, once a phylogeny is constructed, analyzing distribution patterns of genes and genome regions across the phylogeny can be performed on k-mers directly [25, 43]. Integration with AAF, however, will require additional development.

Guidelines for parameter selection

AAF requires two choices from the user: (i) what k-mer length to use, and (ii) whether to filter out singleton k-mers. The choice of k depends on the possibility of k-mer homoplasy; k must be long enough to guard against it. This choice can be made for real genomes by plotting p _h and selecting k where the estimated value of p _h matches the theoretical value assuming no k-mer homoplasy, using empirically calculated Q _k (Fig. 2d, Additional file 9: Figure S5). Larger values of k beyond this will generally decrease the performance of AAF due to both the loss of sensitivity (from covering multiple mutations under the same k-mer) and the increased likelihood of k-mer loss through sequencing error if k-mers are not filtered (Fig. 4). We have included the R code for plotting Fig. 2d to provide a starting point for the selection of k (see more details in the tutorial). To confirm this selection, users can repeat the analyses while increasing k by 2 until successive values of k give the same phylogeny.

Filtering is a good guard against sequencing error and inflated tip lengths in the phylogeny (Figs. 1, 2, 3, 4 and 5). However, it requires coverage of at least 5-8X (depending on k-mer length) to ensure that not too many true k-mers are filtered out (Fig. 4). A pleasant side effect of filtering at the k-mer counting stage is that this decreases the size of k-mer table drastically and thereby decreases the computational load substantially (Table 1).

Future directions

AAF could be used to develop automated phylogeny generators such as phylota [44] while using whole-genome data, or REALPHY [40] but not only for microbes. It can make use of both assemblies and sequencing reads that are available, as we demonstrated for primates (assemblies downloaded from ensemble and raw reads downloaded from the NCBI Short Reads Archive). Automation is made feasible by AAF’s robustness against incomplete and unequal coverage, and its ability to accept different read lengths from any sequencing platform. Although AAF is designed for phylogenetic analysis, it has wider application for understanding the pattern of relationships among different taxonomic levels of samples such as population structure within species. While AAF is mainly designed for large eukaryotic genomes, it is also able to analyze prokaryote datasets. Possible application of the alignment and assembly free approach requires further exploration for other types of sequencing data, such as RAD-seq and metagenomic data especially for organisms without a reference. Although AAF is designed for subjects without a good reference genome, studies on species with a reference will also benefit from AAFs computation efficiency and user-friendly pipeline. Because the AAF approach is based upon a matrix of genetic distances among the genomes, it is easy to add new data without recalculating all of the other distances among genomes.

Our overall AAF approach could also be expanded naturally to investigate different phylogenetic patterns within genomes. Our application of AAF to phylogenies generates “average” genetic distances between whole genomes and a corresponding phylogeny based on the overall differences among genomes. It is possible, however, to use AAF to identify suites of k-mers that are consistent with a different phylogeny from the majority. This compartmentalization of the evolutionary history of the genome could identify portions of the genome with discordant histories in comparison to the majority of the genome [25]. The approach could also be used, in conjunction with phenotypic information about species, to associate suites of k-mers with phenotypes of interest. An advantage of the overall AAF approach is that it can identify genomic elements (k-mers and contigs derived from them) that show interesting discordance or associative patterns among species without initially investing in whole-genome assembly and alignment. Thus, the use of AAF to construct phylogenies is only the initial step in a broader AAF program.

Conclusions

The AAF method proved to be an accurate and efficient way of estimating the phylogenetic relationships using raw sequence data from whole genomes. We developed the theoretical basis for optimizing k-mer length selection, filtering, correcting tip branch lengths, and bootstrapping, directly addressing the problems of homoplasy, sequencing error, and incomplete coverage. Thus, AAF provides a robust tool for phylogeny reconstruction especially when only low-coverage and heterogeneous genome data are available – data that would challenge traditional assembly- and alignment-based methods.

Methods

Generating the k-mer table

We use programs from the phylokmer package [25] for counting and merging k-mers into a combined k-mer presence/absence table. The counting step identifies all possible k-mers for each genome. Adjacent k-mers overlap for (k – 1) consecutive nucleotides, so if there is complete coverage, each site is covered by k k-mers. When there is filtering, a k-mer is only recorded as present if it occurs as two or more copies in the same species. The merging step is to produce an M x N table of the counts of each of M k-mers among a group of N species. Because the number of k-mers counted for a given species will depend on the sequencing coverage and possible non-uniform coverage across the genome, the k-mer frequency table is converted to a table of presence/absence of k-mers among taxa.

Estimating d

In our model, the evolutionary distance (d) is estimated from the proportion of k-mers that are shared between taxa. As in most evolutionary models, multiple substitutions at the same site limit the divergence among k-mers and hence the distance estimated between species [45]. Under the assumption that mutations only take the form of nucleotide substitutions, the probability that a given nucleotide undergoes m substitutions in distance D between two species is Poisson distributed; thus, this probability is D ^m e ^–D/m!. If γ_m denotes the probability that after m substitutions there is no change in the nucleotide, the probability of no change in the k-mer is

$$ \frac{n_s}{n_t}={\left({\displaystyle \sum_{m=0}^{\infty }{\gamma}_m\frac{D^m{e}^{-D}}{m!}}\right)}^k $$

(2)

Here, γ₀ = 1, γ₁ = 0, γ₂ = w _s ² + 2w _t ², and γ₃ = 2w _s ² w _t + 4w _t ³ where w _s and w _t are the probabilities of transitions and transversions. Values of γ_m for m > 3 are vanishingly small. Equation 2 can be solved to obtain the estimate D, although this estimate differs little from that given by Equation 1. For example, for w _s = 0.5 and w _t = 0.25, the difference between estimates D is 4 % when the true distance d = 0.2 and diminishes for lower values of d independently of k and genome size. Equation 2 assumes that all mutations are substitutions, and even though indels may be rare, large indels will lead to a much greater loss of shared k-mers than single substitutions. Therefore, multiple substitutions will in practice lead to even smaller differences in the estimate of D from Equation 1 in the main text. Thus, Equation 1 in the main text rather than Equation 2 above is included in the pipeline.

Tree estimation

We constructed phylogenies from the N x N matrix of distance measures D _ij for species i and j using weighted least squares [1] with weights proportional to the expected variance of distances calculated for Equation 1. For fixed n _t, the values of n _s will have an approximately binomial distribution with the number of trials n _t and the probability of success of each trial (persistence of a k-mer) given by n _s/n _t. Because n _t will be large, this can be approximated by a Gaussian distribution with mean n _s and variance n _s(1 – n _s/n _t). From this, the distribution of D _ij will be approximately lognormal with variance proportional to

$$ \sigma \begin{array}{c}\hfill 2\hfill \\ {}\hfill D\hfill \end{array}\cong \frac{1}{n_t}\left(\frac{1-{e^{-k\widehat{D}}}^{{}_{ij}}}{e^{-k{\widehat{D}}_{ij}}}\right)=\frac{1}{n_t\omega \left(k{\widehat{D}}_{ij}\right)} $$

(3)

where $ {\widehat{D}}_{ij} $ is the estimate of D _ij determined during the tree-fitting process. The phylogenetic tree for N species given by branch lengths $ {\widehat{D}}_{ij} $ is that tree that minimizes $ {\displaystyle \sum_{i,j}^N\omega}\left(k{\widehat{D}}_{ij}\right){\left({D}_{ij}-{\widehat{D}}_{ij}\right)}^2. $ For these calculations in our pipeline, we used the program fitch in the package PHYLIP [46], although we modified the program to accommodate the weights given by Equation 3.