Gene Cluster Profile Vectors: a method to infer functionally related gene sets by grouping proximity-based gene clusters
© Pejaver and Kim; licensee BioMed Central Ltd. 2011
Published: 27 July 2011
Skip to main content
© Pejaver and Kim; licensee BioMed Central Ltd. 2011
Published: 27 July 2011
Proximity-based methods and co-evolution-based phylogenetic profiles methods have been successfully used for the identification of functionally related genes. Proximity-based methods are effective for physically clustered genes while the phylogenetic profiles method is effective for co-occurring gene sets. However, both methods predict many false positives and false negatives. In this paper, we propose the Gene Cluster Profile Vector (GCPV) method, which combines these two methods by using phylogenetic profiles of whole gene clusters. The GCPV method is, currently, the only genome comparison based method that allows for the characterization of relationships between gene clusters based profiles of individual genes in clusters.
The GCPV method groups together reasonably related operons in E. coli about 60% of the time. The method is not sensitive to the choice of a reference genome set used and it outperforms the conventional phylogenetic profiles method. Finally, we show that the method works well for predicted gene clusters from C. crescentus and can serve as an important tool not only for understanding gene function, but also for elucidating mechanisms of general biological processes.
The GCPV method has shown to be an effective and robust approach to the prediction of functionally related gene sets from proximity-based gene clusters or operons.
Next generation sequencing technology has caused an explosion in the genomic data that is available to the research community. As a consequence, the annotation of such large numbers of diverse genomes has now become a major challenge. More specifically, in order to elucidate novel biological processes in certain organisms, assigning functions to the genes involved and understanding their interplay have become key problems. To address these challenges, several methods of function assignment have been proposed over the past decade. These range from simple homology-based strategies (BLAST , bi-directional best hits, etc.) to comparative genomics-based methods such as the gene cluster method , the phylogenetic profiles method , and the gene fusion  method.
In the case of prokaryotes, the gene cluster method has been very effective as functionally related genes tend to be physically clustered together on the genome and these arrangements tend to be conserved due to selection pressures . However, conservation of genes is highly sensitive to the set of reference genomes and their phylogenetic relationship. In general, gene clusters are correct, in terms of functional relationship of component genes, but gene clusters are small and fragmented when clusters conserved in many genomes are sought. In addition, proximity-based clusters are often fragmented since not all genes in a functionally related gene set are physically co-located.
Another technique called the phylogenetic profiles method has been effective for co-evolving gene sets. However, this method is prone to false positives. Moreover, both these methods do not help understand the underlying mechanisms of biological processes. In order to establish such higher-level functions for both genes and gene clusters, we adopt a novel approach that involves the identification of functionally related gene clusters. To the best of our knowledge, this identification problem has not been addressed before. In this paper, we propose a novel method called the Gene Cluster Profile Vector (GCPV) method that combines strength of both techniques. GCPV takes, as an input, a set of gene clusters that are stringently defined and then group clusters based on the similarity of two clusters defined as the occurrence profiles of individual genes of the clusters in a set of genomes. We evaluate the GCPV method’s effectiveness in grouping together related operons in Escherichia coli and also assess the performance of the GCPV method in comparison to the single-gene phylogenetic profiles method. We then test its performance on predicted gene clusters from Caulobacter crescentus.
Genes whose products that contribute to the same biological process tend to form small clusters in prokaryotes. However, these clusters themselves, tend to be spread out through the entire genome. Simple proximity-based methods would be unable to identify functional relatedness in such cases. Moreover, although certain biological processes are observed in certain organisms, the corresponding clusters related to them may be fragmented into smaller clusters. On the other hand, although the phylogenetic profiles method has been successfully used for gene function assignment, it is based on the assumption that genes with related functions would have similar evolutionary profiles. This assumption tends to lead to false positives due to spurious matches. We propose a method that combines the advantages of the proximity and evolutionary constraints inherent in these two methods. Previous attempts at combining these methods, used individual scores for a gene, based on proximity and on co-evolution and combined these scores to assign it a function . Since the goal of this study is not only to assign specific functions to genes but to also identify functional relationships between whole gene clusters, such score-based methods cannot be used. We, therefore, propose a more intuitive combination of the phylogenetic profiles and gene cluster methods.
The problem being addressed in this paper can be described as follows. Given a target genome, a set of gene clusters in the target genome (as predicted by any cluster prediction algorithm [7–11]) and a reference genome set, the goal is to identify functionally related gene clusters in the target genome and thus, generate clusters of gene clusters that contain gene clusters with similar biological functions, i.e.
Input: (1) G T , a target genome; (2) C, a set of gene clusters predicted by a proximity-based method in G T ; (3) G, a set of reference genomes that are at varying evolutionary distances from G T .
Output:L, a set of clusters of gene clusters where each cluster contains functionally related gene clusters. The main reason behind the use of proximity-based gene clusters as input is that, gene clusters, when predicted with stringent parameters, are generally accurate but are fragmented as small sets. Thus, we have designed the GCPV method to group together such tight, fragmented clusters.
In order to address the above problem, several challenges have to be overcome. First, no prior information on gene or cluster functions is known. Second, in order to identify functionally related gene clusters, both proximity and conservation information in these clusters need to be considered. Third, when considering phylogenetic profiles of gene clusters, the method needs to be independent of the size of gene clusters. Finally, comparative genomics methods like the phylogenetic profiles methods are typically dependent on the size and nature of the reference genome set used and such dependence needs to be minimized. Thus, the design and implementation of a novel method to identify functionally related gene clusters is not trivial.
The best approximation for a gene cluster is an operon as operons are also sets of genes that are constrained by proximity to each other. However, it must be noted that an operon corresponds to a set of co-transcribed and co-regulated genes. We have used a set of known operons in Escherichia coli K-12 substr. MG1655 (NCBI RefSeq: NC_000913) that have been verified by experiments. This dataset was obtained by filtering out computationally predicted operons from RegulonDB (Release 6.7) . Subsequently, there were 1299 genes represented in 379 operons from E. coli. This data, along with single gene phylogenetic profile information was input into the GCPV workflow. Note that for this and all the following experiments, only prokaryotic species were used as reference genomes and these were downloaded from NCBI. The resulting clusters of operons were evaluated using SEED broad categories.
An interesting example is the CCNA_02239 gene in C. crescentus which is annotated as a hypothetical protein. PhyloEGGS predicted it to be a part of a two-gene cluster also containing a translation initiation inhibitor (CCNA_02241). The GCPV method included this gene cluster in a cluster of cluster of size three with the remaining two clusters being assigned similar KEGG pathways (ccs00190 and ccs01100). The latter is a general metabolic pathway while the former deals with oxidative phosphorylation. Based on this, one can conclude that CCNA_02239 may play a role in oxidative phosphorylation or a similar pathway. This would not have been evident through a simple BLAST search or by just using the cluster context. This highlights the use of the GCPV method as an effective annotation tool.
Clusters of operons from E. coli with a reference genome set of 120 genomes
Identifier for cluster of operon
ackA-pta, argT-hisJQMP, artPIQM, fliAZY, glnHPQ, metNIQ
fimAICDFGH, flgAMN, flhDC, slp-dctR, smtA-mukFEB
flgBCDEFGHIJ , flgKL , flhBAE , fliLMNOPQR , motAB-cheAW
csgDEFG, fliDST, yeaGH
We have established an effective and robust method for the detection of functionally related gene clusters and thus, genes, with no prior information on function provided. The GCPV method shows minimum dependence on the reference genome set used and has been shown to outperform the basic phylogenetic profiles method which is in agreement with previous work in this area . However, our work serves as an improvement over the work in  as our method can accommodate gene clusters of any size. A limitation of the GCPV method is that genomic coverage may not span the entire genome as not all genes are present in as parts of clusters.
This was aimed to be a pilot study and future work includes the use of more sophisticated clustering techniques to further improve performance. This is particularly important in the context of improving consistency at the individual cluster level. Interestingly, in its current form, the GCPV method still groups together functionally similar clusters/operons as indicated by the evaluation scores. This implies that although individual clusters may not be reference-set independent, the GCPV method generates clusters containing functionally related gene clusters and can be used for function assignment. Additionally, the GCPV method can be adapted to any gene set other than proximity-based gene clusters, as long as the intra-set coupling is tight. In general, the GCPV method holds the potential to play a role not only in genome annotation but also in testing hypotheses for roles of previously uncharacterized genes in metabolic pathways, protein-protein interactions and general biological processes.
Therefore, for a given gene cluster, its profile vector is a numeric vector where each element represents the extent of conservation of genes from that cluster in a specific reference genome. For example in Fig. 6 the third element in the GCPV on the left is 0.5, it means that the third genome contains only half the genes from that cluster.
An example of this has been explained in Fig. 6 where two hypothetical gene clusters, one with four genes and one with five are considered. The cosine similarity of their GCPVs turns out to be 0.9949 which indicates a high similarity. This is reasonable because gene cluster Y is exactly identical to cluster X in its structure except for the addition of a fifth gene g 15 . It can also be observed that the profiles of g 11 and g 15 are identical to each other. Thus, this fifth gene does not affect the conservation profile of the cluster as a whole and this results in a high cosine similarity value.
As mentioned earlier, one of the key challenges in a comparative genomics method is reducing the dependence of the method on the reference genome set used. In order to address this challenge, an intermediate step has been incorporated into the GCPV workflow. This step basically involves the random generation of new reference genome subsets from the original reference genome set. These subsets vary in size (with the original reference genome set being the largest) and result in the generation of GCPVs of different lengths. For a given reference genome set, these can loosely be termed as ‘sub-profiles’ and cosine similarities can be calculated for GCPVs generated from each of these sub-profiles. In these study, the minimum sub-profile size was 12 and the increase in profile size occurred in increments of 12.
The idea behind this is that during the subsequent steps in the workflow, these cosine similarities would be combined to improve the method in two ways. First, as mentioned earlier, the use of cosine similarities from reference genome subsets would help reduce the dependence of performance on the reference genome set. Second, by randomly generating smaller reference genome subsets, one can account for lineage-specific co-evolution whose effects often arise in the phylogenetic profiles method.
The steps outlined previously result in a single value between zero and one for a given pair of gene clusters within the target genome. This is done for all pairs of gene clusters to generate a cosine similarity matrix. If there are k gene clusters, this matrix would have k rows and k columns with all entries in the diagonal being unity (since the cosine similarity calculated for a gene cluster and itself is one). This is repeated for each reference genome subset and would result in many such cosine similarity matrices.
where M is the final combined cosine similarity matrix, M ( i ) is the cosine similarity matrix resulting from the reference genome subset i and w i is the weight for the matrix M ( i ).
The final result of the GCPV method is a reasonable grouping of gene clusters that reflects their functional relatedness to each other. After having experimented with various clustering techniques, it was found that density-based methods fail because cosine similarities seem to be distributed evenly and the Markov Cluster method results in a smaller number of larger clusters. The divisive hierarchical clustering technique seems to give the most reasonable results. It basically assumes that all data points belong to one large cluster and then proceeds to break the cluster down into smaller units based on intra-cluster and inter-cluster distances.
In order to establish the effectiveness of the GCPV method, we have performed experiments on both real and predicted data. For validation purposes we have used either the SEED  or the KEGG  broad categories. The validation methodology involves the calculation of a score for each cluster of cluster generated.
SK devised and developed the method and also designed the experiments. VRP devised and developed the method and performed the experiments. Both authors prepared the manuscript.
This work was partially supported by a US National Science Foundation grant (NSF MCB-0731950) and the Microbial Systems Node in the METACyt Initiative from the Lilly Foundation.
This article has been published as part of BMC Genomics Volume 12 Supplement 2, 2011: Selected articles from the IEEE International Conference on Bioinformatics and Biomedicine 2010. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2164/12?issue=S2.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.