Metabolic classification of microbial genomes using functional probes
© Lee et al; licensee BioMed Central Ltd. 2012
Received: 12 October 2011
Accepted: 27 April 2012
Published: 27 April 2012
Skip to main content
© Lee et al; licensee BioMed Central Ltd. 2012
Received: 12 October 2011
Accepted: 27 April 2012
Published: 27 April 2012
Microorganisms able to grow under artificial culture conditions comprise only a small proportion of the biosphere's total microbial community. Until recently, scientists have been unable to perform thorough analyses of difficult-to-culture microorganisms due to limitations in sequencing technology. As modern techniques have dramatically increased sequencing rates and rapidly expanded the number of sequenced genomes, in addition to traditional taxonomic classifications which focus on the evolutionary relationships of organisms, classifications of the genomes based on alternative points of view may help advance our understanding of the delicate relationships of organisms.
We have developed a proteome-based method for classifying microbial species. This classification method uses a set of probes comprising short, highly conserved amino acid sequences. For each genome, in silico translation is performed to obtained its proteome, based on which a probe-set frequency pattern is generated. Then, the probe-set frequency patterns are used to cluster the proteomes/genomes.
Features of the proposed method include a high running speed in challenge of a large number of genomes, and high applicability for classifying organisms with incomplete genome sequences. Moreover, the probe-set clustering method is sensitive to the metabolic phenotypic similarities/differences among species and is thus supposed potential for the classification or differentiation of closely-related organisms.
Owing to new sequencing technologies, the number of microorganisms with completely or partially determined genomic sequences is rapidly increasing, inclusive of many species that cannot be artificially cultured and many new/unknown species collected from environmental samples. As the amount of genomic information increases, interdependent relationships between species (e.g., the symbiotic partnership between bacteria and host) and the survival strategy of certain microbes and their living environment (e.g., Archaea and Bacteria living in the hot spring) become particularly interesting. As a result, the microbiology field has gradually expanded its focus from microbial clones to microbial communities, and some new research fields have accordingly formed, such as metagenomics . Because species are no longer studied through clonal isolates, the first question encountered by microbial genomics researchers looking at a heterogeneous population is often "Who is there?". A number of short biomarkers, such as the 16S rRNA genes, exhibit detectable sequence variations in a basically conserved framework between species and can be used both to identify individual species within a community and to infer their phylogenetic relationships [2–9]. However, there are drawbacks in analysis of short sequences. These markers generally comprise just a small proportion of an organism's genome; for example, 16S rDNA contributes less than 0.2% of the bacterial genome . Previous studies have suggested that the lack of metabolic information in this small amount of genetic material renders it insufficient to describe the way of life for an entire organism or species [10, 11].
Many studies have used existing metabolic databases, e.g., Kyoto Encyclopedia of Genes and Genomes (KEGG), to understand metabolic relationships between organisms and to construct complete relationship trees [12, 13]. Interestingly, although relationship trees constructed using metabolic data are generally consistent with existing phylogenetic trees; there are important differences in the details [14–17]. In some ways, these relationship trees more effectively explain the survival strategies that organisms have developed to handle unique metabolic relationships, such as how symbiotic bacteria share metabolites with hosts [14, 18, 19]. However, these methods still have many basic shortcomings, including the dependence on complete quantitative information of metabolites, difficulty in defining reactants and intermediates, heavy reliance on human annotation, and the requirement to deal with excessively complex metabolic data . Many of the problems associated with constructing organismal relationships from metabolic data may be avoided by using a proteomic approach. Since proteins are the basic functional units of biological systems, construction of proteomic trees may prove effective in describing the metabolic relationships between species and in reconstructing phylogenetic relationships [21–25].
Microorganisms were chosen as the initial test subjects for our proteome-based classification method. Distinct from complex multicellular organisms, most microorganisms are unicellular and structurally simple. In addition, most microbial intracellular proteins react directly to external stimuli. While there are differences among related proteins, functionally critical protein domains are typically very conserved . This study used a common set of conserved protein sequences -- which is called a probe-set -- to determine relationships between organisms, an analysis that could be labelled "seeking commonality among variation". Then, "seeking variation among commonality", differences in conserved sequences between organisms were used to categorize individual organisms. Currently, the genomes of more than one thousand microorganism strains have been sequenced [27, 28]. The habitats of these microorganisms vary greatly, from symbiotic environments to extreme ecosystems. Since ample biochemical and metabolic data concerning these microorganisms are available, they were selected for the initial development and evaluation of our method.
We first verified that the proposed probe-set method could identify differences between enzymes and between metabolic pathways. Next, we demonstrated that the method could accurately differentiate host-associated from free-living bacteria. Finally, using sequence data from hundreds of microorganisms, we constructed a large-scale relationship tree. Several factors contribute to the success of the probe-set method for clustering microorganisms that share unique metabolic relationships, that coexist in extreme environments, or that possess extraordinary metabolic capabilities (e.g., green sulfur and photosynthetic bacteria). The probe-set method is essentially a compositional analytical method that avoids the disadvantages of conventional sequence-based classification methods, including the difficulty of classifying sequences containing exchanges or recombinations [29–32]. An important advantage of the probe-set method is its ability to classify organisms whose genomes have not been sequenced. This should make this method feasible for metagenomics studies, which generally involve incomplete and poorly annotated genomic sequences . In addition, the proposed method is able to detect metabolic differences between organisms with very close evolutionary relationships. Finally, to compare trees generated by the probe-set method with other kinds of classification trees, a tree topology comparison method is developed in this work. This tree comparison method is useful for large scale phylogenetic tree comparing and can be a standard evaluating approach for further tree building methods.
We used short, highly conserved, amino acid sequence fragments to define the proteomic probe-set. For implementation, the Prosite descriptors (peptide fragments mainly ranging from 10 to 20 residues in length) were selected as probes in this work. Each Prosite descriptor represents a conserved protein sequence extracted from a number of proteins with similar biological functions. A descriptor can be considered a consensus of a group of text strings that occur frequently at the functional site(s) of a certain kind of enzymes, transporters, receptors, etc. For example, descriptor PS00100, which represents the sequence pattern commonly found at the active site of chloramphenicol acetyltransferase, can be written as Q-[LIV]-H-H-[SA]-x(2)-D-G-[FY]-H. Sequences such as Q-L-H-H-S-G-G-D-G-F-H and Q-V-H-H-A-G-G-D-G-Y-H match this pattern descriptor, or probe. If a probe is found in a protein, that protein generally has the function represented by the probe. Although a single probe only describes a small fraction of the functions of a proteome/genome, a reasonable representation can be obtained by using a sufficiently large collection of probes. This is the core concept of the proposed method, which utilizes approximately 1,000 probes. To determine relationships among a chosen set of organisms, probe frequency patterns of the 1,000 probes were calculated for each proteome. These frequency patterns were then clustered to construct the classification tree. The probe-set method is thus a global-scale method that disregards noise in the system (i.e., non-conserved protein sequences) and focuses on conserved protein sequences that more accurately describe an organism's functional capacities.
It is perhaps not surprising that enzymes with different functions have different probe-set frequency patterns. We next asked whether entire metabolic pathways could be distinguished using this method. This task is inherently more challenging and more relevant, as pathways are not composed of proteins with the same function, but of proteins with multiple functions that act cooperatively. For example, according to the KEGG database (the most extensive pathway database at present), a typical glycolysis/gluconeogenesis pathway requires 49 enzymes, including 15 EC1, 14 EC2, 6 EC3, 4 EC4, 8 EC5, and 2 EC6 enzymes. Five major metabolic pathways are commonly shared among organisms: the carbohydrate (P1), energy (P2), lipid (P3), nucleotide (P4), and amino acid (P5) pathways. As shown in Figure 1b, different metabolic pathways possessed different probe-set frequency patterns, and the CC values between pathways were low. These data indicate that the probe-set method can effectively discriminate between major metabolic pathways. It is noticeable that metabolic reactants and intermediates were not considered in this analysis; therefore, the discriminatory power of the probe-set method would not be significantly affected by unclear definitions of reactants and intermediates (see Background).
Within the bacteria superkingdom, the Proteobacteria phylum was divided into two groups by the probe-set method (see the red sectors in Figure 4). This grouping is different from regular taxonomic classifications, according to which Proteobacteria species should be clustered in the same group. The first group consisted of organisms mainly from phyla Proteobacteria, Spirochaete, and Chlamyidae; most organisms in this group have parasitic/symbiotic characteristics. The second group comprised many species characterized by their unique ways of energy production, such as photosynthesis (Cyanobacteria and Green sulfur bacteria) and chemosynthesis (chemoautotrophs, chemolithoautotrophs, and hydrogen sulfur oxidization species). These results imply that, at taxonomic ranks lower than class, the probe-set method tends to classify species according to their biological and metabolic characteristics. Since, in the example of Proteobacteria, the characteristics were acquired due to the living environments of the organisms (e.g., hot-spring water), it is expected that our classification method can help identify organisms living in similar environments and provide information about how they survive in and interact with their environments.
As expected, in the Archaea superkingdom (see the lower part of Additional file 1), we also found that the similarities of living environment exerted important effects on the classification results of our method. Phylum Euryarchaeota, for instance, were divided into two major groups that were consistent with the taxonomical classes of those organisms (e.g., the Crenarchaeota class and Euryarchaeota class), but the grouping of organisms in each class did not exactly follow traditional taxonomic classification; instead, halophiles (living in high salt concentration environment), thermophiles (living in high temperature environment) and methanophiles (using methane as carbon and energy source) were respectively clustered together.
From these results, the probe-set method is able to reconstruct traditional phylogenetic classifications from a proteomic perspective and detect non-phylogenetic commonalities in organisms that have adapted unique biochemical capabilities. We believe that this is because the conserved sequences of a proteome can reflect the biological characteristics of an organism more accurately than genomic DNA.
Genome sequencing methods often split contiguous sequences into thousands of fragments that must be recombined in the correct order according to overlapping regions. Reconstructing an entire DNA genome is not trivial and there are many publicly available genomic sequences still incomplete. Compositional analysis, which is the main concept behind the proposed probe-set method, represents a possible solution for analyzing incomplete genomic sequences. In this study, we simulated two scenarios to assess the reliability of the probe-set method in dealing with incomplete genomes.
Next, we performed a more stringent assessment, in which every of the 87 species was randomly truncated before reconstructing the classification tree. The topological similarity between the reconstructed tree and the reference tree was quantified as a CC value as described in the above subsection. As shown in Figure 8b, the probe-set method does not require full amount of information to obtain good classification results; the CC value for the reconstructed tree with 50% truncated genomes was ~0.78. The high accuracy and CC values obtained in these truncation tests imply that the probe-set method represents a potential and convenient tool for microbial taxonomy, particularly for species whose genomes have not been completely sequenced.
Due to the immense diversity of microbial morphologies distributed in various living environments, classification strategies for the microorganisms based on wet-lab techniques may be costly, time-consuming and thus inefficient when compared with strategies based on computation. Classification methods with different properties and emphases have been proposed. Phenetic classification methods classify microbes according to measurable features such as cell shape, staining properties, and metabolic characteristics . Proteomic comparisons based on two-dimensional polyacrylamide gel electrophoresis were also applied to distinguish closely-related species . Biomarkers, e.g., the 16S rRNA genes, cytochrome c, and ATPases, are used as molecular clocks to elucidate the evolutionary history of species . Approaches such as whole genome alignment  and gene ordering analysis  construct phylogenetic trees from a genomic point of view. The proposed probe-set method focuses on expressible information contained in the genome and thus its classification emphasizes the functional relationships among species. In order to comprehensively understand the evolutionary relationships among organisms, several attempts have been made to combine all existing classification methods, such as the polyphasic taxonomy introduced by Colwell  and refined by Vandamme et al. . Polyphasic taxonomy researches classify organisms based on their phenotypes, genotypes, and chemotaxonomic characteristics and the results suggested that any single feature or biomarker is insufficient to properly classify organisms at every level of taxonomy . Since the proposed probe-set method utilizes almost all currently known consensus protein sequence patterns to perform classification, its combinative nature may make it a good alternative and comprehensive way to classify microorganisms.
In our large-scale proteomic tree, the positions of several species differed significantly from their positions in the traditional taxonomic classification tree, which is established according to the similarities of 16S rRNA genes. Compared with the classifications based on 16S rRNA genes, our method tends to classify organisms according to the similarities of their metabolic capabilities (see Results, Figure 2 and 3). A metabolic capability of an organism is made possible by the cooperation of many genes, the existence of which, as we supposed, may be revealed by the composition of genome of the organism. Indeed, as shown in Figure 1b, the probe-set compositions of various metabolic pathways can be very different. Since the proposed probe-set method classifies organisms according to their genomic compositional differences, this method should be able to detect the genomic differences correlated with the metabolic capabilities of organisms.
The probe-set method compares genomes by transforming the coding sequences into probe frequency patterns and then clustering them. The clustering of these frequency patterns is much simpler than performing multiple sequence alignment of whole microbial genomes. The probe-set classification takes only 1.3 minutes to build a large proteomic tree containing 843 organisms on a computer with an Intel Xeon 2.13 GHz processor and 3 GB memory. This high speed makes the proposed method well applicable to perform large-scale classification of microorganisms. Moreover, the proposed method compares organisms based on information extracted from their whole genomes and thus considers more function-related characteristics than traditional phylogenetic analyzing strategies do. The example shown in Figure 5 has well demonstrated that Lactobacteria, the functional differences of which can not be detected by the traditional 16 rRNA-based method, are classified according to their fermentative capabilities.
Computing the probe frequencies uses only the coding regions of a genome sequence. It is now possible to obtain these coding sequences directly by using next generation sequencing technologies (NGS). Instead of completed whole genome sequences, the contigs of genomes or transcriptomes assembled by NGS can be the source data for the probe-set method. There are more than two thousand whole genome sequencing (WGS) projects, and over half of the genomes handled by them are still in the contig form (http://www.ncbi.nlm.nih.gov/genbank/wgs.html). This situation is probably caused by the fact that sequence assembly is a highly complex problem and may be the current bottleneck of WGS. Without well assembled genome sequences, conventional whole genome comparison methods, such as whole genome alignments, might not be applicable. We have proven that the probe-set method can highly correctly classify incomplete genome sequences (Figure 8). Besides, this method possesses an order-independent nature, which means that even the order of contigs of a genome is unknown, the probe-set frequencies of the genome can still be accurately obtained. Thus, as the NGS and WGS fields continue to increase the number of microorganisms being sequenced, the proposed method can be a useful method for the phylogenetic or functional analyses of organisms either with or without complete genome sequences.
The tree comparison method designed in this study measures the topological similarity between classification trees. We suppose that this method can be utilized as a standard procedure for tree comparisons. To measure the similarity/difference between phylogenetic trees constructed with different methods, or to quantify how well a tree reconstructs the relationships in phylogeny, metabolism or community among organisms involves tree comparisons. Previously, such comparisons were often done manually, lacking a quantitative measure. However, manual comparison is applicable only when the number of organisms is small. The tree comparison method that we developed is well applicable for big trees with hundreds, thousands or more organisms, facilitating the development and evaluation of future tree construction methods. For instance, in the HGT-removing experiment (Figure 6), where trees constructed with and without HGT genes were subjected to our tree comparison procedure, the similarities between trees were clearly revealed.
At present we utilized all available probes provided by the Prosite, except for the highly frequent ones (as annotated by Prosite) to develop our probe-set based classification method. However, it is likely that some probes contribute little to the classification power of the proposed method. We have planned to establish a reduced version of the current probe-set by perhaps removing some highly infrequent probes or performing some critical factor analyses to identify the probes that exert major effects on the classification power. In addition to classifying microorganisms, the probe-set may also be applied to the identification of microorganisms with specific metabolic properties or phenotypes. For example, two microbial groups with opposite biological characteristics, e.g., nitrogen-fixing and non-nitrogen-fixing organisms, can be put together to compare their probe frequencies. The probes with significant difference in occurring frequency between the two groups may serve as good markers for detecting the nitrogen-fixing ability of other organisms.
So far, we have not considered the non-coding region of a genome when implementing the probe-set method. However, many regulatory sequences in a genome are not translated into proteins while they may still be functionally and evolutionarily important. For instance, some RNAs, such as ribozymes, act as enzymes. The 5' and 3' untranslated regions of messenger RNAs often contain sequence conserved regulatory elements . Integrating these regulatory sequences and functional non-coding RNAs to the probe-set expands the source of information for species classification from the proteome level to the transcriptome level. This expansion is supposed to improve the comprehensiveness of the classifications.
A classification method has been purposed to classify microbial genomes. A set of probes (i.e., conserved amino acid sequences with known biological functions) are used to encode microbial genomes into frequency patterns. The classification is achieved by hierarchical clustering of these frequency patterns. The method itself is a kind of compositional analysis which features computational inexpensiveness and high fault tolerance. This method can classify hundreds of genomes in minutes. Its classification results agree well with the phylogenetic relationships of microorganisms at higher classification levels, and clearly reflect the functional similarities among microorganisms at finer classification levels. Importantly, complete genome sequences are not the requisite for our method to obtain reliable results. In this post genomic era, when the amount of genome sequence data increases so rapidly, the high efficiency and novelty of the proposed method make it feasible for large scale classifications of microorganisms and phylogenetic studies of species with similar metabolic properties or incomplete genome sequences.
All proteomes were downloaded from the NCBI RefSeq server (http://www.ncbi.nlm.nih.gov/refseq/). Taxonomic data were retrieved from the NCBI Taxonomy database (http://www.ncbi.nlm.nih.gov/Taxonomy/). Amino acid sequence pattern descriptors provided by the Prosite database (http://au.expasy.org/prosite/) were utilized as the probes. Highly frequent sequence patterns, as annotated by the Prosite, were eliminated. The enzymatic categories, protein sequences and metabolic pathway information of enzymes were downloaded from the KEGG database (http://www.genome.jp/kegg/). The metabolic, biochemical and environmental characteristics of organisms were obtained from the GOLD database (http://www.genomesonline.org/). Horizontally transferring genes were downloaded from the HGT database (http://genomes.urv.cat/HGT-DB/). There are 415 microorganisms recorded in the HGT database; all these organisms were used in our HGT removal experiment.
For each "class" containing three or more organisms, three organisms were randomly selected to perform the random truncation experiment (Figure 8). For each selected organism, the contig sequences of its genome were randomly removed to a given truncation rate. At every truncation rate, the random removal was performed for 10 times, after each of which a classification tree of the selected organisms was reconstructed and compared with the tree that was constructed without the random removal of coding sequences. The correlation coefficient values (see the last subsection) shown in Figure 8 were the average value of the 10 repeated experiments.
where and represent the frequencies of probe i in organism A and organism B, respectively, and n denotes the number of probes.
where f i and f i ' respectively represent the raw and standardized frequency values of organism i, and μ and σ are the mean and standard deviation of the frequency values for all organisms in the column. A cell with a zero f i ' is colored black. Positive and negative f i ' values are represented by red and green colors, respectively, and the brightness of the color is in proportion to the absolute value of f i '.
When performing classification, the color-coded rows were clustered in a way described in the next subsection. After the clustering, similar color-coded rows should be placed close to one another.
In this study, the probe frequency patterns of organisms were clustered by the CLUSTER 3.0 program (http://bonsai.hgc.jp/~mdehoon/software/cluster/software.htm) to obtain the classification tree. The Spearman rank correlation, the nonparametric versions of the Pearson correlation coefficient, was used as the distance measure of probe-set patterns. This measurement worked more robust than Pearson correlation coefficient on reducing the effects of outliers. When the distance matrix was calculated, CLUSTER 3.0 worked based on the average linkage clustering algorithm in which the average of pairwise distances in two clusters is used to build the hierarchical clustering tree.
Each organism in a classification tree can be considered a leaf of the tree. A traveling distance method is purposed here to describe the distance between leaves in the same rooted tree. For any two leaves, there exists a lowest common ancestor , as shown in Figure 7a. Given two leaves X and Y, starting from X and Y respectively, if there are m and n nodes to be traversed before reaching their lowest common ancestor, then the traveling distance between X and Y is computed as m + n. For a tree with N organisms, an N × N leaf-to-leaf distance matrix can thus be obtained (see Figure 7b). To measure the similarity between trees, the CC of two matrices is calculated. A high CC value stands for a high similarity in topology for the compared trees. Because the computation of a CC value requires paired data, a limitation of this method is that the two trees under comparison should possess exactly the same nodes.
We thank Dr. I-Shou Chang at the Center of Biomedical Databases, National Health Research Institutes (NHRI), Dr. Chao A. Hsiung at the Institute of Population Health Sciences, NHRI, and Dr. Chau-Ti Ting at the Department of Life Science, National Taiwan University for their insightful suggestions. We also thank Yen-Yi Liu at the Institute of Bioinformatics, National Chiao Tung University, for helping analyze classification data. This work is funded by the National Science Council, Taiwan, R.O.C. with grant numbers 96-3112-B-007-006, and 97-2752-B-007-003-PAE.