Analysis of composition-based metagenomic classification
© Higashi et al.; licensee BioMed Central Ltd. 2012
Published: 19 October 2012
Skip to main content
© Higashi et al.; licensee BioMed Central Ltd. 2012
Published: 19 October 2012
An essential step of a metagenomic study is the taxonomic classification, that is, the identification of the taxonomic lineage of the organisms in a given sample. The taxonomic classification process involves a series of decisions. Currently, in the context of metagenomics, such decisions are usually based on empirical studies that consider one specific type of classifier. In this study we propose a general framework for analyzing the impact that several decisions can have on the classification problem. Instead of focusing on any specific classifier, we define a generic score function that provides a measure of the difficulty of the classification task. Using this framework, we analyze the impact of the following parameters on the taxonomic classification problem: (i) the length of n-mers used to encode the metagenomic sequences, (ii) the similarity measure used to compare sequences, and (iii) the type of taxonomic classification, which can be conventional or hierarchical, depending on whether the classification process occurs in a single shot or in several steps according to the taxonomic tree.
We defined a score function that measures the degree of separability of the taxonomic classes under a given configuration induced by the parameters above. We conducted an extensive computational experiment and found out that reasonable values for the parameters of interest could be (i) intermediate values of n, the length of the n-mers; (ii) any similarity measure, because all of them resulted in similar scores; and (iii) the hierarchical strategy, which performed better in all of the cases.
As expected, short n-mers generate lower configuration scores because they give rise to frequency vectors that represent distinct sequences in a similar way. On the other hand, large values for n result in sparse frequency vectors that represent differently metagenomic fragments that are in fact similar, also leading to low configuration scores. Regarding the similarity measure, in contrast to our expectations, the variation of the measures did not change the configuration scores significantly. Finally, the hierarchical strategy was more effective than the conventional strategy, which suggests that, instead of using a single classifier, one should adopt multiple classifiers organized as a hierarchy.
Rather than considering a single species in pure culture, metagenomics goes beyond and focuses on the exploration of entire microbial communities . This focus is possible only because of the recent improvements in sequencing technology. As is typical of new concepts, the emergence of this new paradigm has brought up some new challenges. Among them, the manipulation and analysis of short reads deserves special attention.
In some cases, the phylogenetic diversity of a microbial community is not well covered and, as a consequence, only a few reads can be assembled . Hence, one of the first steps of a large-scale metagenomic analysis is to estimate the phylogenetic distribution of the sample. One approach to perform this task is the taxonomic classification of the reads, which is the assignment of these reads into phylogenetic categories .
Essentially, there are three approaches to classifying sequences into taxonomic categories. One possibility is to focus on conserved gene markers (such as rRNA 16S) to identify the source organism of the read. Because rRNA is well conserved, this approach produces an accurate taxonomic classification of the reads. Nevertheless, because only a small fraction of the sequences contain these gene markers, most of the reads of a metagenomic sample cannot be classified using this approach .
Taxonomic classification can also be based on sequence similarity, that is, the alignment of metagenomic reads to a reference dataset (for example using BLAST ). This approach is an accurate method, as long as a similar sequence is present in the database--which is not always true for metagenomic projects . Some examples of off-the-shelf software for metagenomic analysis based on sequence similarity are CARMA  and Megan .
Yet another way to perform the taxonomic classification is to rely on a set of features that is induced by the sequences of nucleotides, producing the so-called composition-based classification . Some features employed in this case are: codon usage, GC content, and oligonucleotide frequency (henceforth n-mer frequency). The latter is usually considered to be a good choice, because the n-mer frequencies carry phylogenetic signals that are useful for extracting common patterns between organisms at different taxonomic levels [9–11]. The following are some examples of software for taxonomic classification based on n-mer frequencies: Phylopythia  implements a support vector machine for classifying sequences that are larger than 3 kbp, Phymm  uses interpolated Markov modes (IMM) to classify reads with at least 100 bp, TACOA  merges the k-nearest-neighbor (k-NN) algorithm with kernelized learning strategies to handle sequences from 800 bp to 50 kbp, and Treephyler  uses hidden Markov models (HMM) to classify reads of 200 bp.
This work focuses on composition-based classification using n-mer frequencies to encode genomic sequences. Such an approach involves a series of decisions, regardless of the specific classifier chosen to perform the task. Usually, these decisions are based on a set of preliminary experiments that account for one particular type of classifier [4, 13]. These studies provide valuable information regarding the performance of a given category of classifier; however, because they are biased by the peculiarities of the classifier of choice, they provide little insight about the characteristics of the classification problem itself. This paper presents a general framework for the empirical assessment of the impact that several decisions have on the degree of separability of taxonomic classes. Thus, instead of focusing on any classifier in particular, we focus our study on the classification problem.
Here we refer to a specific configuration of the classification problem as the setting induced by the following three features: (i) the length of the n-mer word used to encode the DNA sequences; (ii) the similarity measure adopted to compare the sequences; and (iii) the strategy used to assign sequences to taxonomic classes, which can be the conventional approach, in which the sequences are considered independently, or the hierarchical approach, in which the taxonomic context of each DNA fragment is accounted for. The goal of the current work is to serve as a guideline for the development of composition-based metagenomic classifiers by providing some intuition as to how the difficulty of the taxonomic classification problem changes with respect to the variation in the features described above.
We used two types of data: (i) complete genomes; and (ii) synthetic metagenomic fragments. These datasets are described in the following sections.
The genomes were obtained from GenBank, the NCBI database of genetic sequences . We used only microbial sequences, because the majority of metagenomic studies are focused on this type of organism . We considered all 1, 032 microbial genomes sequenced until January, 2010. Among these, 497 sequences had to be removed because they had incomplete taxonomic lineage or undefined nucleotides. Hence, the actual number of genomes used was 535, which encompassed the domains Bacteria and Archaea.
The synthetic fragments were generated by the program MetaSim  using the genomes described above. MetaSim is a metagenomic sequence simulator that can be used to create sets of synthetic fragments reflecting the taxonomic composition of typical metagenomic scenarios. A total of 23, 000 fragments with ~ 400bp was generated under the sequencing conditions of Roche's 454 pyrosequencer .
We now describe how we preprocessed the data to perform our analysis.
To associate the sequence with its corresponding taxonomic lineage we used the information available at NCBI Taxonomy and BioPerl, a toolkit for the manipulation of genomic data . The result of this process was a vector comprising seven positions that were filled out with NCBI taxids (taxonomy identifiers) corresponding to each one of the seven taxa: domain, phylum, class, order, family, genus, and species.
Thus, f (d, t, n, s, a) = y represents a score y to the dataset d, considering the taxon t, using a n-mer length of n to encode the sequences and the similarity measure s to check how similar the sequences are and, finally, using the score measure a. In other words, the score y is a measure of the degree of separability of the taxonomic classes in d at level t under the specific configuration induced by n, s, and a.
We now describe how we defined the score measures that were used to evaluate the classification problem.
We want to assess the "separability" of the taxonomic classes under a given configuration. A straightforward way to do so would be to choose a specific type of classifier and then measure its classification accuracy for each possible combination of values for (d, t, n, s, a) (using cross-validation, for example ). Note that in this case we would be measuring the difficulty of the problem under the assumptions made by that specific classifier. For example, if we adopted a linear model such as the Naive Bayes classifier, then we would be measuring how well classes can be separated by a hyperplane . Therefore, if we want to make no assumptions regarding the "shape" of the classes, the correct approach would be to use a nonlinear model capable of representing any boundary between the classes (such as a support vector machine using an appropriate kernel ). However, such an approach would require an expensive cross-validation process to determine the correct level of complexity of the model under each configuration (using, for example, regularization ).
We want a measure of the separability of the classes that can be efficiently computed and at the same time makes no strong assumptions regarding the shapes of the classes. A possible way of solving this problem is to base our measure on this simple observation: given a set of objects that belong to different classes, the level of separability of the classes can be assessed by the fraction of objects whose closest neighbor belongs to the same class. Note that, under this criterion, if the boundaries between the classes are well defined, then the set of objects will usually be considered to be separable, regardless of the shape of the classes. Therefore, this simple measure is an efficient way of assessing the degree of overlap between classes.
Algorithm 1 presents a detailed description of the computation of the proposed separability measure. Given a configuration (d, t, n, s), for each sequence in d, we calculate its nearest neighbor (NN) and check whether both sequences belong to the same class at the taxonomic level t. If so, then we add 1 to the configuration score. The result is then normalized to fall in the interval 0. We call this approach the conventional score measure.
Algorithm 1: conventional_score(d, t, n, s)
/* Computes the conventional score for a given set of DNA sequences */
Input: d ∈ D, t ∈ T, n ∈ N, s ∈ S
Output: Conventional scorel score
1 score ← 0
2 m ← 0
3 foreach sequence d i ∈ d do
4 if d i is not the only representative of its class in d at level t then
5 m ← m + 1
6 d j ← NN(d i , d, n, s) ;/* nearest neighbor of d i in d using n-mers and measure s */
7 if class(d i ) = class(d j ) at taxonomic level t then score ← score + 1
8 return score/m
Note that, if a genome is the only representative of its taxonomic group, then its nearest neighbor will necessarily belong to another class, which biases downwards the score measure shown in Algorithm 1. For this reason, we classify a genome only if it is not the unique example of its taxonomic class (line 4 of Algorithm 1). In the dataset used in our experiments, classes with a single member occur only at the taxonomic level of species. Specifically, out of 535 genomes used in the experiments, 328 were the unique representatives of their species.
As shown in Algorithm 1, the conventional score is the percentage of sequences that have the same lineage as their nearest-neighbors at a given taxonomic level. Incidentally, this approach is similar to using the k-Nearest-Neighbor (k-NN) classifier with k = 1 (except that in the latter case we would not eliminate classes with a single representative) . This approach is in accordance with our objective of focusing our analysis on the classification problem, because the 1-NN classifier does not make strong assumptions regarding the shape of the classes .
Given the hierarchical structure of the taxonomic classification task, one might wonder whether it is a good strategy to decompose the problem into simpler sub-problems that are defined at each taxonomic level. More specifically, instead of using a single classifier, one would have a hierarchy of classifiers that are organized according to the taxonomic tree. In this case, a given DNA sequence would be classified as follows: first, a classifier at the highest hierarchical level would determine the domain to which the sequence belongs. Then, the DNA sequence would be classified at the next hierarchical level, the phylum, with the particular classifier used to do so determined by the domain the sequence was assigned to at one level above. Following the same reasoning, the sequence would then be passed on to the classifier that is responsible for the specific phylum that it was assigned to, and so on, until the desired taxonomic level had been reached. This classification strategy has been used before in the literature [4, 23].
Note that, to compare the hierarchical scoring measure with the conventional measure, we cannot simply apply Algorithm 1 at each taxonomic level, because the nearest neighbor of a given sequence defines its classification at all of the taxonomic levels (and thus the hierarchical score would coincide with the conventional score). Since we do not want to introduce any bias in our analysis, we must define a score measure that is compatible with our strategy of measuring the separability between classes. This goal can be accomplished as follows. Suppose that a given sequence d i has been correctly classified at taxonomic level t. Then, to classify it one level below in the taxonomic tree, t + 1, we can eliminate all of the
sequences that do not belong to the same class as d i at level t. This procedure corresponds to selecting a specific classifier in the hierarchical scheme described above. Next, if we remove our initial assumption that d i was correctly classified at level t, it is clear that, by eliminating the appropriate sequences of the dataset, an incorrect classification at level t can be followed by a correct classification at level t + 1. This strategy is precisely what allows us to evaluate the hierarchical classification using nothing but the nearest neighbor of each DNA sequence.
Observe that eliminating the sequences that do not belong to the same class as d i at level t corresponds to assuming that d i was correctly classified at that level. Of course, to have an accurate score function at level t + 1, we must account for the possibility that the sequence was incorrectly classified at level t. Clearly, a straightforward way to estimate the probability of a misclassification at level t is to use the score function at that level. Therefore, we define the hierarchical score measure recursively: roughly speaking, the hierarchical score at level t corresponds to the product between the conventional score at the same level and the hierarchical score one level above. Algorithm 2 provides a step-by-step description of how to compute the proposed hierarchical score measure.
Algorithm 2: hierarchical_score(d, t, n, s)
/* Computes the hierarchical score for a given set of DNA sequences */
Input: d ∈ D, t ∈ T, n ∈ N, s ∈ S
Output: Hierarchical score
1 if t = 1 then return conventional_score(d, t, n, s) ;/* i.e., if t is "domain" */
3 score ← 0
4 m ← 0
5 foreach sequence d i ∈ d do
6 d' ← d with only sequences d k which belong to the same class as d i at level t - 1
7 if |d'| > 1 and d i is not the only representative of its class at level t then
8 m ← m + 1
9 d j ← NN(d i , d', n, s)
10 if class(d i ) = class(d j ) at taxonomic level t then score ← score + 1
11 return score/m * hierarchical_score(d, t - 1, n, s)
Using Algorithm 2, one can assess the degree of separability of taxonomic classes under a hierarchical classification scheme without making any strong assumptions regarding the shape of the classes. Therefore, the result of such an analysis applies to any set of classifiers, including a heterogeneous hierarchy composed of classifiers of different types.
As described above, in this work we assume that a given configuration of the taxonomic classification problem is defined by: (i) n, the length of n-mers used to encode the sequences; (ii) s, the similarity measure used; and (iii) a, the score measure, which can be the conventional measure or the hierarchical measure (Algorithms 1 and 2, respectively). To provide an empirical basis for the development of composition-based metagenomic classifiers, we analyzed the separability of taxonomic classes under different configurations of the classification task.
We performed 10 * 4 * 2 = 80 experiments with the genomic dataset and 8 * 4 * 2 = 64 experiments with the synthetic metagenomic fragments data (in both cases the three numbers correspond to |N|, |S|, and |A|, respectively; see Equation (5)). In total, we performed 80 + 64 = 144 experiments. Our analysis addresses the impact of parameters n, s, and a over the configuration scores. Although we also discuss other taxonomic levels, we focus our analysis on the classification problem at the taxon species.
The genomic dataset comprises 535 genomes encompassing 386 different species. Considering the conventional score measure, the configuration scores for this type of data at the level of species varied from f (G, sp, 1, kl, c) = 0.275, for the worst configuration, to f (G, sp, 5, 2, c) = 0.512, for the best configuration. The hierarchical scores varied between f (G, sp, 1, 2, h) = 0.378 and f (G, sp, 7, kl, h) = 0.532.
1-mer frequencies for sequences in five different phyla.
Observe also that from n = 8 to n = 10 the scores decrease slightly. This decrease is a consequence of the fact that, when n ≥ 8, the number of possible n-mer sequences is very large, which results in sparse frequency vectors with a low discriminative power. For example, if the similar sequences di = A A ATGGTA and d j = A G ATGGTA are encoded with n = 8, the result is two vectors with 65, 536 positions filled with zeros in all but one position, which would contain a "1" representing the words d i and d j . Hence, we have two extremely similar sequences represented by two different frequency vectors, which clearly disrupts the score function f.
Concerning the two score measures, the hierarchical approach presented slightly better performance than the conventional score, as shown in Figures 2 and 3. This relationship suggests that decomposing the classification task into smaller sub-problems does indeed make the problem easier.
The synthetic dataset comprises 23, 000 fragments with approximately 400bp. As mentioned previously, these sequences were generated with the sequences simulator MetaSim  under the sequencing conditions of the 454 pyrosequencer. The configuration scores at the level of species varied between f(F, sp, 8, ∞, c) = 0.007 and f (F, sp, 4, kl, c) = 0.112 for the conventional score function, and between f (F, sp, 7, 1, h) = 0.113 and f (F, sp, 4, 1, h) = 0.5 when the hierarchical measure was considered.
In summary, we observed that the scores associated with metagenomic data are in general smaller than the scores generated with genomic data, and using a hierarchical classification approach in this case appears to be even more beneficial. Moreover, the value of n that generated the best results decreased from n ≈ 7 to n ≈ 4, which indicates that, when dealing with metagenomic fragments with approximately 400bp, there is no point in using frequency vectors that have a dimension much higher than 256.
The scores are an approximately concave function of n with a maximum value that is between 4 and 7; the "optimal" value of n is smaller for the metagenomic dataset.
Changing the similarity measure s does not have a strong effect on the scores.
The hierarchical classification scheme appears to be a better alternative for both genomic and metagenomic data; however, in the latter case, its advantage over the conventional classification approach is more evident.
In general the scores that are associated with the metagenomic data are smaller than the scores that are associated with the genomic data, but the difference is more significant under the conventional classification scheme.
Configuration scores referring to the best n and s values.
Taxonomic classification is an essential step within a metagenomic study, since this is the first step of a metagenomic analysis and its result is used as a basis to posterior investigations. Usually, composition-based metagenomic classifiers are configured based on preliminary experiments that account for a specific type of classifier. In this work we proposed to shift the focus of the analysis to the classification task itself. To make this shift, we presented a general framework that can be used to study the impact of several decisions on the difficulty of the classification problem (that is, how "separable" the classes are under different configurations of the task).
In this work we focused the analysis on the impact of three factors in particular: (i) the length of the n-mers used to encode the DNA sequences; (ii) the similarity measure used to compare frequency vectors; and (iii) the underlying classification scheme (hierarchical or conventional). The results presented provide some intuition on how the difficulty of the classification problem changes as a function of the features above. Because our analysis does not assume any structure of the classification problem, it can be used as a guideline for the development of composition-based metagenomic classifiers of any type. Moreover, the framework presented in this work can be used for the analysis of the impact of other factors over the taxonomic classification task.
We thank Douglas Adriano Augusto for important help with the computational experiments. We also thank Carlos Henrique Brandt for granting permission to run some of the experiments on the clusters of CESUP (Centro Nacional de Supercomputação). This work was funded by CAPES (Coordenação de Aperfeiçoamento de Pessoal de Nível Superior).
This article has been published as part of BMC Genomics Volume 13 Supplement 5, 2012: Proceedings of the International Conference of the Brazilian Association for Bioinformatics and Computational Biology (X-meeting 2011). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/13/S5 .
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.