Acquisition of datasets
We used two types of data: (i) complete genomes; and (ii) synthetic metagenomic fragments. These datasets are described in the following sections.
Complete genomes
The genomes were obtained from GenBank, the NCBI database of genetic sequences [14]. We used only microbial sequences, because the majority of metagenomic studies are focused on this type of organism [15]. We considered all 1, 032 microbial genomes sequenced until January, 2010. Among these, 497 sequences had to be removed because they had incomplete taxonomic lineage or undefined nucleotides. Hence, the actual number of genomes used was 535, which encompassed the domains Bacteria and Archaea.
Synthetic metagenomic fragments
The synthetic fragments were generated by the program MetaSim [16] using the genomes described above. MetaSim is a metagenomic sequence simulator that can be used to create sets of synthetic fragments reflecting the taxonomic composition of typical metagenomic scenarios. A total of 23, 000 fragments with ~ 400bp was generated under the sequencing conditions of Roche's 454 pyrosequencer [17].
Preprocessing of datasets
We now describe how we preprocessed the data to perform our analysis.
Calculating n-mer frequencies
To encode the nucleotide sequences we calculated the n-mer frequencies in each (meta)genomic sequence. To do so, we counted the number of occurrences of all possible n-mers in a given sequence, considering an overlap of n - 1 nucleotides (that is, we started from position 1 to n, then from position 2 to n+1, and so on). This strategy gives rise to a 4n-dimensional vector whose elements represent the number of occurrence of each possible n-mer. We then divided the elements of such a vector by the total number of n-mers contained in the sequence. For the experiments with Kullback-Leibler (KL) divergence we used a slightly different approach to count the n-mers, because the strategy above could lead to a division by zero (see Equation 4). In particular, we assumed that each n-mer had occurred at least once, a method usually referred to in the literature as "add-one smoothing" [18]. In the end, each sequence is represented by a vector of n-mer frequencies (hereafter, "vector of frequencies"). We will sometimes refer to a vector of frequencies as simply a "sequence" when there is no risk of misinterpretation. Figure 1 illustrates the process described above.
Determining taxonomic lineage
To associate the sequence with its corresponding taxonomic lineage we used the information available at NCBI Taxonomy and BioPerl, a toolkit for the manipulation of genomic data [19]. The result of this process was a vector comprising seven positions that were filled out with NCBI taxids (taxonomy identifiers) corresponding to each one of the seven taxa: domain, phylum, class, order, family, genus, and species.
Score functions
The next step is implementing a score function, which provides, under a specific configuration, a score for the degree of separability of the taxonomic classes. To formally define this function, we will adopt the following notation. D = {G, F} is the dataset, in which G represents the genomic sequences and F is the metagenomic synthetic fragments. T = {do, ph, cl, or, fa, ge, sp} is the taxon set, which represents the sequence's taxonomic lineage. N = {1, 2, . . . , 10} is the set of lengths of n-mers and S = {1, 2, ∞, kl} represents the set of similarity measures, where 1 is the 1-norm distance (Equation 1), 2 represents the 2-norm (Euclidean) distance (Equation 2), and ∞ is the ∞-norm distance (Equation 3); kl is the Kullback-Leibler divergence (Equation 4).
(1)
(2)
(3)
(4)
A ={c, h} is the set of score measures. The element c represents the conventional score measure, in which the configuration is scored considering the sequence separately, and the element h is the hierarchical score measure, in which the configuration is scored with respect to the sequence's taxonomic context (see below). Considering this notation the score function is defined as follows:
(5)
Thus, f (d, t, n, s, a) = y represents a score y to the dataset d, considering the taxon t, using a n-mer length of n to encode the sequences and the similarity measure s to check how similar the sequences are and, finally, using the score measure a. In other words, the score y is a measure of the degree of separability of the taxonomic classes in d at level t under the specific configuration induced by n, s, and a.
We now describe how we defined the score measures that were used to evaluate the classification problem.
Conventional score measure
We want to assess the "separability" of the taxonomic classes under a given configuration. A straightforward way to do so would be to choose a specific type of classifier and then measure its classification accuracy for each possible combination of values for (d, t, n, s, a) (using cross-validation, for example [20]). Note that in this case we would be measuring the difficulty of the problem under the assumptions made by that specific classifier. For example, if we adopted a linear model such as the Naive Bayes classifier, then we would be measuring how well classes can be separated by a hyperplane [20]. Therefore, if we want to make no assumptions regarding the "shape" of the classes, the correct approach would be to use a nonlinear model capable of representing any boundary between the classes (such as a support vector machine using an appropriate kernel [21]). However, such an approach would require an expensive cross-validation process to determine the correct level of complexity of the model under each configuration (using, for example, regularization [20]).
We want a measure of the separability of the classes that can be efficiently computed and at the same time makes no strong assumptions regarding the shapes of the classes. A possible way of solving this problem is to base our measure on this simple observation: given a set of objects that belong to different classes, the level of separability of the classes can be assessed by the fraction of objects whose closest neighbor belongs to the same class. Note that, under this criterion, if the boundaries between the classes are well defined, then the set of objects will usually be considered to be separable, regardless of the shape of the classes. Therefore, this simple measure is an efficient way of assessing the degree of overlap between classes.
Algorithm 1 presents a detailed description of the computation of the proposed separability measure. Given a configuration (d, t, n, s), for each sequence in d, we calculate its nearest neighbor (NN) and check whether both sequences belong to the same class at the taxonomic level t. If so, then we add 1 to the configuration score. The result is then normalized to fall in the interval [0,1]. We call this approach the conventional score measure.
Algorithm 1: conventional_score(d, t, n, s)
/* Computes the conventional score for a given set of DNA sequences */
Input: d ∈ D, t ∈ T, n ∈ N, s ∈ S
Output: Conventional score
1 score ← 0
2m ← 0
3foreach sequence d
i
∈ d do
4 if d
i
is not the only representative of its class in d at level t then
5 m ← m + 1
6 d
j
← NN(d
i
, d, n, s) ;/* nearest neighbor of d
i
in d using n-mers and measure s */
7 if class(d
i
) = class(d
j
) at taxonomic level t then score ← score + 1
8 return score/m
Note that, if a genome is the only representative of its taxonomic group, then its nearest neighbor will necessarily belong to another class, which biases downwards the score measure shown in Algorithm 1. For this reason, we classify a genome only if it is not the unique example of its taxonomic class (line 4 of Algorithm 1). In the dataset used in our experiments, classes with a single member occur only at the taxonomic level of species. Specifically, out of 535 genomes used in the experiments, 328 were the unique representatives of their species.
As shown in Algorithm 1, the conventional score is the percentage of sequences that have the same lineage as their nearest-neighbors at a given taxonomic level. Incidentally, this approach is similar to using the k-Nearest-Neighbor (k-NN) classifier with k = 1 (except that in the latter case we would not eliminate classes with a single representative) [22]. This approach is in accordance with our objective of focusing our analysis on the classification problem, because the 1-NN classifier does not make strong assumptions regarding the shape of the classes [20].
Hierarchical score measure
Given the hierarchical structure of the taxonomic classification task, one might wonder whether it is a good strategy to decompose the problem into simpler sub-problems that are defined at each taxonomic level. More specifically, instead of using a single classifier, one would have a hierarchy of classifiers that are organized according to the taxonomic tree. In this case, a given DNA sequence would be classified as follows: first, a classifier at the highest hierarchical level would determine the domain to which the sequence belongs. Then, the DNA sequence would be classified at the next hierarchical level, the phylum, with the particular classifier used to do so determined by the domain the sequence was assigned to at one level above. Following the same reasoning, the sequence would then be passed on to the classifier that is responsible for the specific phylum that it was assigned to, and so on, until the desired taxonomic level had been reached. This classification strategy has been used before in the literature [4, 23].
Note that, to compare the hierarchical scoring measure with the conventional measure, we cannot simply apply Algorithm 1 at each taxonomic level, because the nearest neighbor of a given sequence defines its classification at all of the taxonomic levels (and thus the hierarchical score would coincide with the conventional score). Since we do not want to introduce any bias in our analysis, we must define a score measure that is compatible with our strategy of measuring the separability between classes. This goal can be accomplished as follows. Suppose that a given sequence d
i
has been correctly classified at taxonomic level t. Then, to classify it one level below in the taxonomic tree, t + 1, we can eliminate all of the
sequences that do not belong to the same class as d
i
at level t. This procedure corresponds to selecting a specific classifier in the hierarchical scheme described above. Next, if we remove our initial assumption that d
i
was correctly classified at level t, it is clear that, by eliminating the appropriate sequences of the dataset, an incorrect classification at level t can be followed by a correct classification at level t + 1. This strategy is precisely what allows us to evaluate the hierarchical classification using nothing but the nearest neighbor of each DNA sequence.
Observe that eliminating the sequences that do not belong to the same class as d
i
at level t corresponds to assuming that d
i
was correctly classified at that level. Of course, to have an accurate score function at level t + 1, we must account for the possibility that the sequence was incorrectly classified at level t. Clearly, a straightforward way to estimate the probability of a misclassification at level t is to use the score function at that level. Therefore, we define the hierarchical score measure recursively: roughly speaking, the hierarchical score at level t corresponds to the product between the conventional score at the same level and the hierarchical score one level above. Algorithm 2 provides a step-by-step description of how to compute the proposed hierarchical score measure.
Algorithm 2: hierarchical_score(d, t, n, s)
/* Computes the hierarchical score for a given set of DNA sequences */
Input: d ∈ D, t ∈ T, n ∈ N, s ∈ S
Output: Hierarchical score
1 if t = 1 then return conventional_score(d, t, n, s) ;/* i.e., if t is "domain" */
2else
3score ← 0
4m ← 0
5foreach sequence d
i
∈ d do
6d' ← d with only sequences d
k
which belong to the same class as d
i
at level t - 1
7if |d'| > 1 and d
i
is not the only representative of its class at level t then
8m ← m + 1
9d
j
← NN(d
i
, d', n, s)
10if class(d
i
) = class(d
j
) at taxonomic level t then score ← score + 1
11return score/m * hierarchical_score(d, t - 1, n, s)
Using Algorithm 2, one can assess the degree of separability of taxonomic classes under a hierarchical classification scheme without making any strong assumptions regarding the shape of the classes. Therefore, the result of such an analysis applies to any set of classifiers, including a heterogeneous hierarchy composed of classifiers of different types.