### Simulated metagenomic datasets

We simulated four NGS metagenomic datasets using MetaSim [47]. In the first simulated datasets, the simulated communities consist of 5 bacterial species: *Acidobacterium capsulatum*, *Bacteroides fragilis*, *Nitrosospira multiformis*, *Proteus mirabilis* and *Sulfolobus islandicus* found in a soil bacterial community from a previous study [42]. Two types of sample relationships were simulated: group relationship where the species abundance levels of the samples belong to different groups and gradient relationship where the species abundance levels change continuously with some variables such as temperature or location.

In **Simulation 1**, we simulated 90 samples belonging to 3 groups (shown in Additional file 1: Figure S11). We began with a relative abundance vector of the 5 bacterial species (0.05,0.10,0.20,0.25,0.40), and each component was perturbed by multiplying the absolute value of a normally distributed variable with mean 1 and standard deviation equal to the value of the component itself. The relative abundance vector was renormalized to sum to 1. The original species relative abundance vector was perturbed and renormalized in this manner for three times independently, and three relative abundance vectors were obtained: (0.022, 0.058, 0.116, 0.507, 0.297), (0.042, 0.088, 0.281, 0.244, 0.345) and (0.046, 0.066, 0.042, 0.320, 0.526). We used these three vectors as the species abundance vectors for the three group centers and further generated three groups of species relative abundance vectors around them. For each group center relative abundance vector, we added to each component the absolute value of a Gaussian noise of mean zero and standard derivation equal to each component and renormalized components to sum to 1. Each relative abundance vector was randomized and renormalized 30 times, and a total of 90 relative abundance vectors were obtained. With these relative abundance vectors, we generated 90 metagenomic samples by mixing the 5 bacterial genomes at the abundance levels defined in the vectors and sampling short reads from the mixture genomes with MetaSim [47]. The sampling procedure mimics the next-generation sequencing by the Roche/454 platform with read length of 200nt.

In **Simulation 2**, we simulated 20 samples consisting of the same 5 bacterial species as in Simulation 1 and the relative abundance vector of the 5 species shifts along a gradient axis. Among the 5 bacterial species, we set the abundance level of one of them as constant, and let the abundance levels of the other 4 species vary following 4 functions across the sampling indexes. The four functions (shown in Additional file 1: Figure S12) have centers approximately around sample indexes 0, 5, 10 and 15, respectively. The abundances of all 4 species were first taken from these functions and were then normalized together with the species with constant abundance to sum to 1, forming the relative abundance vectors. Absolute values of Gaussian noises were added to each component of the abundance vector, with mean 0 and standard deviation equal to the value of each component. The vectors were renormalized after adding the noises. We generated 20 metagenomic samples by sampling the 5 bacterial genomes according to these relative abundance vectors using MetaSim [47] with read length of 200nt.

In

**Simulations 3 and 4**, we considered more complex communities consisting of 113 microbial genomes from the collection of genomes provided by the FAMeS dataset [

45,

46]. The relative abundance vectors of the microbial genomes were generated from the power-law (Zipf’s) distribution:

In **Simulation 3**, we generated 60 samples belonging to 3 groups with 20 samples in each group. For each group, we randomly ordered the 113 genomes and assigned the *i*-th genome with relative abundance *f (k;α,N)*. Then we added to each component the absolute value of a Gaussian noise with mean zero and standard derivation equal to each component and renormalized components to sum to 1. Each relative abundance vector was randomized and renormalized 20 times, and a total of 60 relative abundance vectors were obtained. Then we used these relative abundance vectors to simulate 60 metagenomic samples.

In **Simulation 4**, we generated 20 samples consisting of the same 113 genomes , and the relative abundance vector of the 113 genomes were generated by the power-law (Zipf’s) distribution as in the third simulation. In order to mimic the gradient model, the relative abundance vector shifts along a gradient axis of α from 0.275 to 0.75 by 0.025. Again, absolute values of Gaussian noises were added to each component of the 20 abundance vectors, with mean 0 and standard deviation equal to the value of that component. The vectors were renormalized after adding the noises. We generated 20 metagenomic samples according to these relative abundance vectors using MetaSim. In order to see the effect of sequencing platform, we generated short reads through both Roche/454 and Illumina/Solexa platforms by their default parameters in Simulations 3 and 4.

In all the simulations, we generated datasets at three sequencing depths: 1,000, 10,000 and 100,000 sequencing reads per sample. At each setting, we generated 100 duplicated datasets to simulate possible stochastic effects in real NGS data.

### Real metagenomic datasets

We analyzed three real shotgun metagenomic datasets published in recent years.

**The Mammalian Gut Dataset** includes 39 fecal microbiome samples from 33 mammalian species [31]. Shotgun sequencing of whole metagenome by the multiplex shotgun Roche/454 FLX pyrosequencing platform produced a total of 2,163,286 reads [mean = 55,469±28,724(SD) per sample] (Additional file 1: Table S1). The read length is 261±83(SD) nt. The animals in the dataset represent three diet types and varied digestive physiologies (hindgut-fermenting herbivores, *n* = 8; foregut-fermenting herbivores, *n* = 13; simple-gut carnivores, *n* = 7, and simple-gut omnivores, *n* = 11). As reported in previous studies with both targeted 16S rRNA sequencing and whole metagenome sequencing, the fecal microbiomes of herbivores and carnivores are associated with host diets and digestive physiologies [9]. Furthermore, bacterial compositions of omnivores are very diverse and cannot be distinguished from other groups of mammals [31]. Therefore, we studied this dataset in two steps. First, we excluded the omnivore samples and only studied the relationships among the groups of hindgut-fermenting herbivores, foregut-fermenting herbivores and simple-gut carnivores. Then we included the omnivore samples to study how the omnivore samples clustered with the other samples.

**The Global Ocean Dataset** includes 56 global ocean microbiome samples from 41 aquatic, largely marine locations across a multi-thousand kilometer transection from the North Atlantic through the Panama Canal and ending in the South Pacific [29]. Shotgun sequencing of whole metagenome by the AB3730xl platform produced 7,697,926 reads [mean = 137,463±145,307 (SD) per sample] (Additional file 1: Table S2). The read length is 1,032±58(SD) nt. In addition to different geographic locations, these samples also represent a range of habitat types, such as open ocean (*n* = 23), coastal (*n* = 19), and estuary (*n* = 3). As reported in previous studies, the diversity among samples shows significant separation between groups located in temperate regions and those located in tropical regions, although certain outliers can be present [29]. In our study, we first applied the sequence signature methods to the 23 open ocean samples and 19 coastal ocean samples separately to avoid possible interactive effects of sampling location and habitat types. Finally, the methods were applied to all 56 samples to obtain a panoramic view of the microbial communities.

**The Human Gut Dataset** includes 13 fecal microbiome metagenomic samples from healthy Japanese individuals, consisting of 7 adults, 2 weaned children, and 4 unweaned infants (Additional file 1: Table S4) [28]. Shotgun sequencing of the whole metagenome produced 1,041,617 reads [mean = 80,124±2,619(SD) per sample] with read length of 756±162(SD) nt. These samples were collected from 2 families (Family I: 2 adults and 2 weaned children; Family II: 2 adults and 1unweaned infant), 4 individual adults, and 2 individual infants. The previous study revealed intriguing differences in microbiomes between adults and unweaned infants and a high functional uniformity in adults and weaned children [28]. In our study, sequence signature methods were applied to detect the dissimilarity between the groups of samples.

### The *k*-tuple count vectors and dissimilarity measures

The comparison of metagenomic samples using sequence signatures arises from the need for the comparison of two microbial community samples without having to do alignment with reference genomes, which are often incomplete or unavailable. For every metagenomic sample, we can obtain its *k*-tuple count vector as the sequence signature. Based on these vectors, we can use a dissimilarity measure to evaluate the difference between two samples. We analyze the relationships among multiple samples using the dissimilarity matrix composed of the dissimilarities between all pairs of samples.

The first step is to count the number of occurrences of each *k*-tuple. Reads in NGS data can come from either the forward strand or the reverse strand of a genome. Because we do not know which strand a read comes from, we consider both a read and its complement to take both strands into consideration. That is, we supplement the observed reads by their complements to form the read set on which the number of *k*-tuple occurrences are counted. For genome or metagenome data, we have a finite alphabet set *S* = {" A ", " C ", " G ", " T "} and the set of all possible k-tuples *W* = {*w* ∈ *S*
^{
k
}}. The numbers of occurrences of all the 4^{k} tuples of length *k* in all reads of a metagenome sample form the *k*-tuple count vector of the sample.

In this paper, we studied a collection of dissimilarity measures between *k*-tuple count vectors, including three *d*
_{
2
}-type dissimilarity measures *d*
_{
2
}, *d*
_{2}
^{
S
} , and *d*
_{2}
^{*}[34]. For the calculation of *d*
_{2}
^{
S
} and *d*
_{2}
^{*}, we need to centralize the real count of a tuple by deducting its expected number of occurrences under certain models for the underlying background genome sequences. We estimated the background genome sequences using Markov models of orders 0, 1, 2, and 3, respectively. In addition to the *d*
_{
2
}-type dissimilarities, we also studied the Euclidean, Manhattan, and Chebyshev distance measures for the *k*-tuple frequency vectors. We also studied a dissimilarity measure developed from Dr. Hao’s group (*Hao*) [41] and this measure uses a Markov model of order *k*-2 for the background sequences. We next describe these measures in detail.

Let

and

be the

*k*-tuple count vectors of two metagenomic samples X and Y, respectively. The

*d*
_{
2
} dissimilarity measure [

34] derived from the well-known D

_{2} statistic [

48] is

Dissimilarity measures *d*
_{2}
^{
S
} and *d*
_{2}
^{*}[34] derived from two variants of D_{2} termed *D*
_{2}
^{
S
} and *D*
_{2}
^{*}[49], were also studied. To define *d*
_{2}
^{
S
} and *d*
_{2}
^{*}, we first introduce the centralized count variables by

and

where

*p*
_{·,i
} is the probability of

*i* th

*k*-tuple under the probability model (Markov model of order

*r*=0,1,2,3) for the background sequences and

is the total count of

*k*-tuples. Then we have

The ranges of both *d*
_{2}
^{
S
} and *d*
_{2}
^{*} are between 0 and 1. When the two samples are the same, both *d*
_{2}
^{
S
} and *d*
_{2}
^{*} are 0. When the two samples are completely independent, the expected values of *D*
_{2}
^{
S
} and *D*
_{2}
^{*} are 0, and thus, the expected value of both *d*
_{2}
^{
S
} and *d*
_{2}
^{*} is 0.5. When the enriched tuples in the two samples complement each other, the value of both *d*
_{2}
^{
S
} and *d*
_{2}
^{*} is close to 0. Therefore, these measures can be used to measure the dissimilarity between the samples.

In addition to the newly developed *d*
_{
2
}-type dissimilarity measures, we also studied the standard *l*
_{
p
}-norm distance measures for *k*-tuple frequency vectors. We first normalized *k*-tuple count vectors to *k*-tuple frequency vectors whose components sum to 1,
and

where

and

. Three classical

*l*
_{
p
}-norm distances including Manhattan (

*l*
_{
1
}-norm), Euclidean (

*l*
_{
2
}-norm) and Chebyshev (

*l∞-*norm), abbreviated as

*Ma*,

*Eu* and

*Ch*, respectively, are defined to compare

*k*-tuple frequency vectors.

A dissimilarity measure developed by Qi et al. [

41] from Dr. Hao’s group is also studied. We use the corresponding author’s last name

*Hao* as the short form of this measure to simplify the notation.

*Hao* is based on the relative difference between the actual

*k*-tuple frequencies of each word with its expectation under a Markov model of order

*k*–2.

A new dissimilarity measure developed by Willner *et al*. [38] was also studied in this paper. The relative abundance odds ratio measure of dinucleotide was defined as
. The relative abundance odds ratio for tri-nucleotides was defined as
, and for tetra-nucleotides was defined as
, where *N* and *M* represent any nucleotide [32].

The new dissimilarity measures, termed

*Willner*, based on the relative abundance odds ratios for di-, tri-, and tetra-nucleotides were defined as

For

*d*
_{2}
^{
S
} and

*d*
_{2}
^{*} and the

*Hao* dissimilarity measures described above, the expected number of occurrences of each tuple needs to be calculated. To do this, we need to assume a model for the background sequence. In this study, we used Markov models of different orders (0, 1, 2, and 3 for

*d*
_{2}
^{
S
} and

*d*
_{2}
^{*} , and

*k*–2 for

*Hao*) to model the background genome sequences. Markov model of order 0 corresponds to the independent identically distributed (i.i.d.) model where genome sequences are generated as a series of independent and identically distributed random characters. Thus, for a

*k*-tuple

*w* =

*w*
_{1}
*w*
_{2} …

*w*
_{
k
}, the expected frequency under the Markov model of order 0 (M

_{0}) is also the probability of

*w* occurring under M

_{0}:

where *p(w*
_{
j
}
*)* is the probability of *w*
_{
j
} in all the reads of a metagenomic sample.

Under Markov model (

*M*
_{
r
}) of order

*1≤ r ≤ k - 2* the expected frequency is

where *p(w*
_{
1
}
*w*
_{
2
}
*…w*
_{
r
}
*)* is the initial distribution of consecutive states *w*
_{
1
}
*w*
_{
2
}
*…w*
_{
r
} in all the reads of a metagenomic sample, and *p*(*w*
_{
j + r
}|*w*
_{
j
}
*w*
_{
j + 1} … *w*
_{
j + r − 1}) is the transition probability going from consecutive states *w*
_{
j
}
*w*
_{
j + 1} … *w*
_{
j + r − 1} to state *w*
_{
j + r
}.

### Beta-diversity analysis and evaluation methods

Detection of group relationships among microbial samples and the identification of external gradients driving shifts in microbial community structure are two major types of analysis tasks in microbial community comparison. Therefore, we evaluated the performance of the *k*-tuple dissimilarity measures in metagenomic comparison by assessing how well a method detects the known group relationships or identifies known environmental gradient.

For clustering analysis, we detect group relationships among the microbial samples by applying the UPGMA (unweighted pair-group method with arithmetic means) algorithm [43] based on the between-samples dissimilarity matrices. It is a hierarchical clustering algorithm based on pair-wise dissimilarity matrix of multiple samples, using average-linkage for measuring the dissimilarity of two clusters. We used the *phangorn*[50] package in R for this algorithm. We used the TreeClimber package [25], a tool to compare clustering of microbial communities, to evaluate the resulting clustering trees by the parsimony test. The parsimony score measures how far away the clustering tree is from the true classification of the samples. If samples in the same group are clustered together in a clustering tree, the parsimony score is *c*-1 where *c* is the number of groups. As the parsimony score increases, the clustering tree becomes increasingly different from the true grouping of the samples. Monte Carlo *p-*value is used to see whether the observed parsimony score is statistically significant. Here, samples were randomly grouped to generate clustering trees 1,000 times, and the *p-*value was approximated by the fraction of times that the parsimony score for the randomized trees is smaller than or equal to the parsimony score of the observed tree.

For the study of gradient relationships among the samples, the shift of microbial samples is visualized by PCoA (Principal Coordinates Analysis) [44], which is a multi-dimensional scaling (MDS) method that converts between-sample dissimilarity matrix into two-dimensional (or three-dimensional) ordinates of samples and arranges the samples in ordinate space. We used the MASS [51] package in R for PCoA. Then, the influence of environmental gradient(s) on microbial communities can be tested by calculating correlation, such as Pearson correlation coefficient (PCC), between PCoA axis and gradient axis. In this way, the performance of the between-sample dissimilarity measures given by sequence signature methods can be evaluated if the gradient driving microbial communities is known.