K-mer distribution of complex microbiota is homogenous irrespective of bacterial composition
Highly complex microbiota metagenomic raw sequence data can be split in short sequences of length k bases, which can be binned into a finite set of possible k-mer sequences (4k combinations). K-mer analysis of single bacterial genome data has previously revealed differences in k-mer distribution between bacterial species [30]. In contrast, we hypothesize that k-mer distribution of a large set of sequence data derived from a complex mix of microorganisms follows a relatively uniform distribution. To validate this hypothesis we selected two distinct stool samples representing two different enterotypes (Prevotella dominated for donor #1 and Bacteroides dominated for donor #2 - Figure 1A). We then analysed the occurrence of each 4-mer by searching through all raw sequence reads for the two metagenomes. Interestingly, the two selected metagenomes had very similar 4-mer distributions despite their highly different bacterial compositions (Figure 1B). Of note, the Shannon-Entropy for both samples was high (0.9932 and 0.9930 for donor #1 and #2, respectively) characteristic of a uniform distribution of 4-mers (Figure 1B). In line with our hypothesis, the Shannon-Entropy of the two selected metagenomes was clearly higher than the one of 28 known genomes of bacterial species from a large spectrum of phyla and classes (Additional file 1: Figure S1A top panel and C). In other words, genomes from individual bacterial species have a more heterogenous 4-mer distribution than complex metagenomes, even when such metagenomes are derived from very different gut microbiota compositions. This result was confirmed by evaluating the average normalized Shannon-index of the k-mer distribution for genomes derived from 28 bacterial strains compared to gut metagenomes derived from 21 low (<1010 bacteria) (cf. Additional file 1: Figure S1A middle panel) and 31 high (>1010 bacteria) (cf. Additional file 1: Figure S1A bottom panel) bacterial content human stool samples (P = 0.001 and <0.0001, respectively, cf. Additional file 1: Figure S1B). Similarly, we compared the 28 bacterial strains with 110 healthy individuals from the study by Yatsunenko et al. (mean and 95% confidence intervals for strains and metagenomes: 0.972 [0.963:0.980] and 0.983 [0.981:0.984], respectively, P = 0.004) [5]. Of note, the Yatsunenko study employed Illumina sequencing, showing that the methodology is platform-independent.
Moreover, individual bacterial genomes aggregated into 6 clusters defined by their k-mer distribution using agglomerative hierarchical cluster analysis (Figure 1D). The clusters were validated with a non-hierarchical K-means cluster analysis. The agreement between the two clustering techniques was good as defined by Cohen’s kappa agreement value (κ = 0.48). Interestingly, the identified clusters are associated with the phylogeny of the bacteria and can be used to evaluate taxonomic relations, as previously suggested [30]. Deductions from this result suggest that 4-mer analysis of metagenomes of complex bacterial mixtures can be decomposed into a linear regression of k-mer distribution vectors of individual bacteria genomes and a residual, which would represent the component unexplained by known bacterial genomes. In other words, this type of analysis could identify novel bacterial species and potentially elucidate their phylogenetic descent. This approach is beyond the scope of the present study.
Quantitative metagenomic analysis of serially diluted gut microbiota identifies lowest analyzable sample size limit
Biased metagenomic sequence distribution can be a result of technical obstacles (DNA extraction and library construction), contaminations and limiting amount of sample material [9,31]. Whereas the former causes may be improved or avoided the latter is most often unavoidable. Of note, the reliability of sequence distribution directly affects the validity of quantitative metagenomic data. Therefore, there is an urgent need for a method to evaluate metagenomic quality. To investigate if k-mer distribution analysis of complex metagenomes could predict metagenomic quality of samples with limiting material, we generated 10-fold serial dilutions of two purified gut microbiota samples presented above (cf. Figure 1 - donor #1 and #2). Each dilution underwent genomic DNA extraction and metagenomic analysis (Table 1). All dilutions of the same sample should ideally have identical gene distribution with the more concentrated sample being the most representative of the underlying gut microbiota and thus of best quality. We therefore mapped raw metagenomic sequences onto a reference gene catalogue [4] for all analyzed samples and correlated gene frequencies from four 10-fold dilutions with gene frequencies from the most concentrated sample, serving as internal reference sample (Figure 2A). For both samples (donor #1 and #2) this analysis demonstrated strong correlations between all serial dilutions and their reference sample with a clear reduction in correlation for the highest dilution for both samples, indicating the analytical sample size limitation associated with our analytical protocol (Figure 2B). As expected, correlation between two unrelated donors (the highest concentration sample from donor #1 and #2 - spearman r = 0.22) was significantly lower than intra-donor correlations (Figure 2B and C).
K-mer distribution analysis of metagenomic sequences identifies the same lower sample size limit as quantitative metagenomic analysis
Having established a metagenomic dataset including metagenomes with a defined decline in quality we investigated if k-mer analysis of raw sequences of the same dataset would be able to predict the lower sample size limit as defined in the previous paragraph based on a comparative gene mapping procedure. 4-mer analysis of raw metagenomes corresponding to dilution series samples (1 to 10.000 fold dilutions) of gut microbiota from donor #1 and #2 identified a biased 4-mer distribution for 1.000- and 10.000-fold dilution samples from both donor #1 and #2 (Figure 3A, left panel). Interestingly, aberrant k-mers were not fully overlapping between sample dilutions (Figure 3A, right panel), suggesting that low quality is derived from both sample preparation and system noise. Calculating the Shannon-Entropy for 4-mer distributions from all metagenomes confirmed that the two most dilute samples suffered from a particularly biased raw sequence read composition (Figure 3B). To identify aberrant 4-mers, we correlated the 4-mer frequency observed for each dilution series metagenome with the 4-mer frequency observed for the undiluted reference sample of donor #1 and #2, respectively (Figure 3C). This analysis revealed a distinct subset of 4-mers largely overrepresented in the diluted samples. A closer look at these 4-mers uncovered a tight association with the unique barcode-cassette sequence flanking the genome fragments of the metagenomic shot-gun repertoire. These sequences are derived from self-ligated shot-gun cassettes. Excessive amounts of these sequences are a consequence of limited genomic DNA and subsequent reduced ligation efficiency. Indeed, when we removed all raw sequence reads matching the barcode-cassette sequence of the respective metagenome repertoire, the 4-mer distributions of diluted samples were less aberrant (Additional file 1: Figure S2A), although the 10.000-fold diluted sample remained quantitatively more biased (reduced Shannon-Entropy) than the other dilutions for both donor #1 and #2 (Additional file 1: Figure S2B). Similarly, the correlation analysis revealed that the 10.000-fold diluted sample included k-mers largely overrepresented in the diluted sample compared to the undiluted reference k-mer distribution (Additional file 1: Figure S2C). Of note, this bias is correlated with the skewed gene distribution observed for the 10.000-fold dilution (Figure 2B).
These observations demonstrate that metagenomic quality, as defined by the capacity to precisely and robustly define gene distributions of microbiota, can be predicted by a k-mer distribution analysis of metagenomic raw sequences. It is however not clear if the skewed k-mer distribution observed for the highest sample dilutions (corresponding to low quality metagenomes) is due to aberrant bacterial gene sequences, as observed by correlative analysis of mapped reads (Figure 2), or due to concomitant non-mappable sequences similar to but distinct from the barcode-cassette sequences discussed above. We therefore filtered raw metagenome sequences to only contain mappable sequences. 4-mer analysis revealed an almost equal distribution of 4-mers for all dilution series metagenomes (Additional file 1: Figure S3A) resulting in very similar Shannon-Entropy for 4-mer distributions of all samples (Additional file 1: Figure S3B). Equally, k-mer frequencies correlated perfectly between dilution series samples from the same donor (Additional file 1: Figure S3C). The predictive features of the k-mer analysis are therefore relying on a secondary but concomitant degradation of sequence quality and distribution.
K-mer distribution predicts metagenomic sequence mapping to a reference gene catalogue
Our data demonstrate that k-mer analysis is primarily identifying the presence of aberrant sequences, such as contaminations linked to poor metagenome library assembly resulting from limited quantity of genomic DNA. Because sequence contaminations are unlikely to map to known bacterial genes, we speculated that skewed k-mer distributions could predict the frequency of raw sequence reads mapping to the reference gene catalogue. Of note, raw sequences in this context refer to entirely unmanipulated NGS datasets. This approach was chosen to render the methodology broadly applicable. Indeed, we were able to show a clear positive association between 4-mer distribution quantified as Shannon-Entropy and the frequency of mapped reads for dilution series metagenomes of donor #1 and #2 (r = 0.88, P = 0.0009 - Figure 4A). Of note, the three most concentrated dilution series samples for both donor #1 and #2 had very similar 4-mer distributions and thus similar gene mapping frequency, whereas the more diluted samples suffered a pronounced drop in the uniformity of their 4-mer distribution with an associated drop in gene mapping efficiency. Applying this analytical approach to a set of 52 metagenomes of 28 human gut microbiota (some gut microbiota were analyzed up to three times with different initial sample size input) showed that our observation was generally applicable, and that 4-mer analysis predicted gene mapping efficiencies below approximately 20% (r = 0.34, P = 0.0141 - Figure 4B). Of note, the rate of mapping was based on unfiltered raw sequences and therefore lower than previously reported [32]. We observed that low mapping efficiency was strongly associated with limiting sample material (less than 1010 bacteria per sample – Figure 4B). Low (<1010 bacteria) and high (>1010 bacteria) quantity samples differed significantly with regards to the quantity of DNA available for the ligation step of metagenomic library construction (P = 0.0004; median values and 25%-75% ranges are 1.0 μg [1.0;1.0] and 0.7 μg [0.6;1.0], respectively). The quantities were conform with what was observed for the dilution series samples (cf. Table 1). Above a mapping efficiency of 20% the normalized Shannon Entropy reaches a plateau despite variation in mapping efficiency. This is likely to be a consequence of the relatively large inherent variation in gene distributions between individuals, which is more or less compatible with the known but still incomplete gene reference catalog [4]. The constant increase in gene coverage provided by reference catalogues should eventually remove variations of gene mapping between samples.