### Datasets

The input genome dataset consists of a catalogue of 2,031 completely sequenced genomes retrieved from NCBI (ftp://ftp.ncbi.nih.gov/genomes/Bacteria/) in July 2012. The nucleotide sequences from the 2,031 bacterial genomes spans across 292 genera, 537 species and 1,246 strains. After downloading the entire dataset for the bacterial genomes all the plasmid sequences for the respective bacterial genomes were removed. Each of the 2,031 genomes was tagged using the first three letters of their genus and species names. In addition, the entire strain name was retained for clarity purpose. For example, the genome Chlamydia trachomatis D/UW-3/CX was tagged as CHL_TRA_ D/UW-3/CX.

Table S10 in Additional File 1 lists the entire set of 2,031 genomes and their associated statistics such as the length of the genome, the number of *n*-grams (*n* = 12) in the genome, the number of unique and common *n*-grams in the genome and the repeat ratio.

### The *n*-gram model for nucleotide representation

An *n*-gram is any subsequence of a nucleotide sequence of fixed length *n*. In literature, these nucleotide subsequences have been called alternatively as *n*-mers, oligonucleotide, oligopeptide, etc. For the purpose of obtaining common and unique *n*-grams across all the 2,031 bacterial genomes, all possible *n*-grams were extracted from each of the genomes in the dataset. Given a dataset of genome sequences *D*, let *d*
_{
i
} be the complete nucleotide sequence for an organism *O*
_{
i
} in *D* where
, where
where
represent the set of four nucleotide *A*, *G*, *C* and *T*, then a set of
*n*-grams can be obtained from
as

,
. Using this *n*-gram model, the following property of *n*-grams can be observed.

Here it is important to note that few of the bacterial genomes contain additional letters namely *N*, *R, Y, W, M, S*, or *K* to account for two ambiguous bases in any given position. For example the letter *N* at any given position indicates unknown base, the letter *R* at any given position indicates either *A* or *G*, the letter *Y* at any given position indicates either *C* or *T* and so on. In addition to that letters *B*, *D*, *H* and *V* represents 3-base ambiguities. Therefore, ∑ = {*A*, *C*, *T*, *G*, *N*, *R*, *Y*, *W*, *M*, *S*, *K*, *B*, *D*, *H*, *V*}.

### Unique and common *n*-gram profile

From the entire 2,031 genomes, all possible non-repeating *n*-grams (*n* = 12) were obtained. The *n*-grams from each genome were compared against the *n*-grams in the other genomes to finally arrive at a set of unique (present in a single genome) and common (present in multiple genomes) *n-*grams. The unique *n*-gram set includes two columns - the *n*-gram and the genome in which it is present. On the other hand the common *n*-gram set includes four columns - the *n*-gram, frequency of its occurrence in the entire dataset, its weight assigned by the scoring function and the genomes in which it is present.

### Scoring function

The scoring function obtains a set of common and unique *n*-grams based on the *n*-gram model discussed above. The scoring function is parameterized with the length of the *n-*gram and the target dataset to begin with. The scoring function reads in the nucleotide sequences of each genome, and generates all possible *n-*grams without any repeats. If a nucleotide sequence is of length
, then the total number of *n-*grams is given by
. Once all the *n-*grams are generated, the scoring function compares all the *n-*grams from a genome against those from all the other genomes in the dataset. After successful comparison the scoring function determines a profile of all the common and unique *n*-grams in the dataset. All the unique *n*-grams are assigned a weight of unity, *i.e*. 1, and the common *n*-grams are assigned weights using a dampening factor that accounts for how popular the *n*-gram is with respect to the genomes present in the dataset.

For any *n*-gram
, the dampening factor is given by the expression
, where
denotes the total number of genomes in the dataset and
denotes the total number of genomes in which the *n*-gram
is present. This factor is similar to the term 'weighting' as discussed in our previous study [14]. The damping factor adjusts the weights of the *n*-grams in such a way that popular *n*-grams receive a low weightage and vice-versa. Table S3 in Additional File 2 shows the weights assigned to few hypothetical *n*-grams based upon their frequency of occurrence in the dataset. If the *n*-gram is present only in a single genome then its weight is unity, *i.e*. 1, and if it is present in all the genomes then its weight is zero, *i.e*. 0.

### Model building

The model-building step involves indexing the entire set of common and unique *n*-grams and assigning appropriate weight to each *n*-gram based on its frequency profile across the reference genome set. For model building our tool considers either the entire set (100%) or a partial (75%, 50% and 25%) set of non-repeatable *n*-grams from each genome. For model building using a partial genome set, non-repeating *n*-grams are randomly selected from the genome. The number of *n*-grams selected from each genome is proportionate to their size.

Model building is a very crucial step in MetaID and it is also a time consuming process. In case of adding new genomes to the dataset or adding a completely different community including viral, fungus, archaeal, etc., the model-building step has to be carried out again. Therefore, this update process can be scheduled at periodic intervals. Moreover, model-building step in our tool is an offline process.

### Repeat ratio

While harvesting the *n*-grams (*n* = 12) from the reference genomes we observed that there are a large number of *n*-grams that have the tendency to re-appear. Therefore, we came up with a parameter "repeat ratio" to account for the abundances of repeated *n*-grams in each genome. Repeat ratio is determined by computing the fraction of the repeated *n*-grams to the total number of *n*-grams in the genome. Here repeat ratios are represented as percentages. The Table S11 and Figure S1 (in Additional File 2) presents a histogram of the repeat ratio distribution across the 2,031 bacterial genomes. Across 2,031 bacterial genomes the repeat ratio distribution ranged widely between 0.85% to 71.53%. Only small fractions of the genome, *i.e.*, 3.3% have repeat ratios within 10%. Almost about 69.2% of the genomes have a repeat ratio between 25% and 60%. In total nearly 99.6% of the genomes have their repeat ratios ranging from 10% to 70%. The mean and the standard deviation of the repeat ratios across 2,031 bacterial genomes were observed to be 27.57 and 12.52 respectively.

### Testing and identification (classification)

Though the objectives behind our testing and identification (classification) steps are the same, there is a subtle difference between them. For testing we consider 1%, 3%, 5%, and 7% of the non-repeated *n*-grams randomly chosen from each genome and try to identify their origin. In contrast, for identification we consider the entire set of metagenomic reads to harvest all possible *n*-grams (*n* = 12) and try to determine the constituent organisms in a given community.

Let us consider
as a set of *n*-grams obtained from the reads or randomly selected from the genome and
as the set of genomes present in the database. We define a mapping from *R* to *G* as
where all the elements in domain *R* maps to a single element in co-domain *G i.e*.
where
is the only single range in co-domain
and
. To obtain a mapping from *R* to *G* we construct a
matrix of the form
where we define
where
,
,
are the columns in the
matrix and
represent the weight assigned to an *n*-gram *e* that is present in genome *f* or 0 if the *n*-gram *e* is not present in the genome *f*. In the above-mentioned
matrix we define
as the sum of all the elements in the column *z*. After computing the sum of each column in the
matrix we arrange all the column sums in a descending order. We then associate
provided that
and
.

In summary, after obtaining the *n*-grams from the reads or from the genomes we construct a matrix with the rows representing the *n*-grams and the columns representing the entire set of genomes in the dataset. We then replace each matrix entry with the weight of the *n*-gram corresponding to that particular genome. If an *n*-gram is not part of the genome then we replace that entry with a zero *i.e.*, 0. After filling the matrix entries, we determine the column sum against each genome; identify the highest column sum and associate (map) the entire set of *n*-grams to that particular genome.

It is important to note that in the identification step we try to map a set of reads to a genome instead of mapping each single read to a genome. This is because it is hard to classify each single read to a genome due to the intense computation involvement and lack of discriminatory signals in them. Again, in order to ensure a successful classification we compared our classification results against the classifications performed by MetaSim.

### MetaSim reads

Metagenomic reads for our mock-staggered communities were obtained using the MetaSim simulation tool. On parameterizing MetaSim with the genomes, their abundance profile, the empirical error model (Table S12 in Additional File 2) and the total number of reads to be generated; MetaSim generates a set of reads against each genome. For our mock-staggered community, MetaSim generated about 3 million 100 bp pair-end reads. Table S13 in Additional File 2 shows the parameter settings used in MetaSim for constructing the mock staggered community and the details of the simulation output.

### Mock communities

Two different mock communities were used in this study. The first one is the mock-even community that is constructed from two datasets namely HC1 and HC2 obtained from MetaPhlAn website (http://www.huttenhower.org/metaphlan). The original datasets consisted of 100 genomes each with an equal abundance of 1%. From these datasets, we constructed a mock-even community of 167 microbial genomes (72 from HC1 + 95 from HC2) that are also present in our 2,031 set of reference genomes. The entire set of reads pertaining to these 167 genomes was included in our community to ensure that their abundances are equal *i.e*. 1%. We eliminated the rest 33 genomes either due to their absence in our dataset or because there was no appropriate mapping found between the KEGG ID's in HC1 and HC2 to our NCBI names in the database.

Secondly, we constructed a mock-staggered community by randomly choosing 100 microbial genomes out of the 2,031 genomes in our dataset. The final mock-staggered community included genomes with genome sizes varying between 641,770 to 9,033,684 and with their repeat ratios ranging from 7 to 63. For this community, we randomly assigned an abundance value for each genome between 0.1% and 10% totaling up to 100% (Table S9 in Additional File 1).

### Abundance estimation

Considering a set of reads from a genome, we harvested all possible non-repeated

*n*-grams (

*n* = 12) and mapped them against their reference genome. Upon mapping, we counted the total number of

*n*-grams that is in common (intersection) between the reads and the reference genome. We determined the relative "Observed Abundance" for a genome as the ratio of its number of non-repeated

*n*-gram counts to the total sum of the non-repeating

*n*-grams of the genomes present in the community multiplied by the total number of genomes in the sample. After determining the observed abundances we noticed that genomes with extreme repeat ratios

*i.e*. above 50 or below 15 had a tendency to be estimated higher or lower respectively. Therefore to correct the observed abundances we either subtract or add the first standard deviation of the repeat ratios of 2031 genomes to the mean of the repeat ratios of 2031 genomes. On the other hand, if the repeat ratio of a genome lies between 15 and 50 then the mean of the repeat ratios of 2031 genomes is used as such. The corrected abundance for a genome is reported based on their repeat ratio using the following expressions:

Where
is the mean and
is the standard deviation of the repeat ratios for the 2,031 genomes present in our dataset (Table S10 in Additional File 1). Note here that the mean and standard deviation for the repeat ratios will change with the addition or elimination of genomes in the dataset.

From Figure S1 (Additional File 2), we noticed that most of the genomes have their repeat ratio ranging between 15% and 50%. Therefore when correcting the abundances (Corrected Abundance) we subtract one standard deviation from the mean for those genomes whose repeat ratio is above 50% and add one standard deviation to the mean for those genomes whose repeat ratio is below 15%. For the genomes with repeat ratios between 15% and 50%, the mean of the repeat ratio is considered as such. Here we report the abundance estimates for any given community in percentages *i.e*. 100% for the entire community or equal to the number of microbial species in the community. Therefore, if the corrected abundance does not add up to 100% or equal to the number of species we report the "Estimated Abundance" which is normalized to either 100% or equal to the number of species in the community.

### Performance metrics

We report standard performance measure in terms of accuracy as percentages. Accuracy is defined as the ratio of number of entries (genomes) that have been correctly identified to the number of entries under consideration. In some cases, we have reported balanced accuracies wherever we have information about specificity and sensitivity

*i.e.*,