All genomic DNA sequences were obtained from the NCBI genome database [29] together with information about the different organisms. Additional information can also be found in Additional file 2.

The computer programs used to generate the results were made according to the explanations given below. The following notation will be used throughout:

Let (

*w*
_{1}
*w*
_{2}..

*w*
_{
n
})

_{
i
}, represent an oligonucleotide (

*n*-mer) with 1 ≤

*i* ≤

*N* = 4

^{
n
} possible combinations. The function

gives the overlapping empirical frequency of the oligonucleotide (*w*
_{1}
*w*
_{2}..*w*
_{
n
})_{
i
}, with respect to the DNA sequence *Z* = {*w*
_{1}
*w*
_{2}..*w*
_{
s
}}, where *S* is much larger than *n*.

The hexanucleotide-based relative abundances can then be calculated as follows:

Where 1 ≤ *i* ≤ *N* = 4^{
n
}

The genomic signature is then found by comparing two genomic DNA sequences with the Pearson correlation formula:

*N* = 4

^{
n
} designates the total number of possible DNA word combinations, with

The nucleotides *w*
_{
l
}, 1 ≤ *l* ≤ 6, in the denominator of equations (4) and (5), are the corresponding nucleotides in the *i*
^{th} hexanucleotide *w*
_{1}
*w*
_{2}
*w*
_{3}
*w*
_{4}
*w*
_{5}
*w*
_{6}.

represent the average hexanucleotide relative abundance values.

Hierarchical clustering based on Euclidean distance was performed on the resulting symmetric 867 × 867 correlation matrix. Average linkage was used to put emphasis on the closest matches based on group similarities.

Oligonucleotide usage variance (OUV) can be considered as a measure of oligonucleotide frequency bias, or selection pressure on the genomic DNA composition, and was calculated according to the given formula for each chromosome:

The function

*M*
_{0}
*((w*
_{1}
*w*
_{2}...

*w*
_{
n
}
*)*
_{
i
}
*)* approximates oligonucleotide frequencies with the corresponding mononucleotide frequencies:

The formula implicitly assumes that each nucleotide in the approximated *n-*mer is independent of the neighbouring nucleotides. In addition, equation (7) assumes that genomic oligonucleotide frequencies are only influenced by AT content, which means that low values can be interpreted as random mutations carrying little or no information. High variance values, on the other hand, mean that substantial information is carried by the oligonucleotide being approximated.

Linear regression analysis was performed between OUV for di-, tetra-, and hexanucleotide frequencies (response variable) and genomic AT content (predictor variable) using log transformation. *R*
^{2} designates '% coefficient of determination'.

A conditional logistic multinomial (polychotomous) regression model was fitted to asses the individual influences of predictors: genome size, AT content, OUV, phyla, oxygen requirement, habitat, growth temperature and pathogenicity, with the cluster groups as the response variable. The AIC and McFadden

*R*
^{2} statistics were used as indicators of the quality of the fitted model. The following multinomial logistic regression model was run in the statistical program R using the package

*nnet*:

The response variable "Groups" is a categorical variable consisting of the different cluster groups (see Figure 1). The predictors Phyla, Oxygen, Habitat and Growth temperature were also categorical factors, while Size, AT and OUV were numerical factors. The Oxygen factor consisted of the categories: aerobic, anaerobic and facultative. Habitat consisted of the categories: host-associated, multiple, specialized, terrestrial, and aquatic, while the growth temperature factor consisted of the following categories: psychrophilic, mesophilic and thermophilic. This information was taken from the NCBI website http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi. The regression model converged after 220 iterations. Assessment of statistical significance was carried out with the *car* package.

All regression models were statistically significant with the significance level set to *p < 0.001*.