PathwayVoyager: pathway mapping using the Kyoto Encyclopedia of Genes and Genomes (KEGG) database
© Altermann and Klaenhammer; licensee BioMed Central Ltd. 2005
Received: 21 December 2004
Accepted: 03 May 2005
Published: 03 May 2005
Equally important and challenging as genome annotation, is the subsequent classification of predicted genes into their respective pathways. The Kyoto Encyclopedia of Genes and Genomes (KEGG) represents a database consisting of known genes and their respective biochemical functionalities. Although accessible online, analyses of multiple genes are time consuming and are not suitable for analyzing data sets that are proprietary.
Presented here is a new software solution that utilizes the KEGG online database for pathway mapping of partial and whole prokaryotic genomes. PathwayVoyager retrieves user-defined subsets of the KEGG database and stores the data as local, blast-formatted databases. Previously selected datasets can be re-used, reducing run-time significantly. Whole or partial genomes can be automatically analyzed using NCBI's BlastP algorithm and ORFs with similarities below the user-defined threshold will be marked on pathway maps. Multiple gene hits are sorted by similarity. Since no sequence information is transmitted over the Internet, PathwayVoyager is an ideal solution for pathway mapping and reconstruction of confidential DNA sequence data.
PathwayVoyager represents an alternative approach to many already existing, more complex pathway reconstructions software solutions. This software does not require any dedicated hardware or software and is flexible and straightforward to use. It is ideally suited for environments where analyses on variable datasets are desired.
The ongoing sequencing of complete genomes of prokaryotes and eukaryotes reveals a tremendous amount of uncharted data. In prokaryotic genomes, roughly 25 to 30 percent of the predicted ORFeome remain functionally unknown with many Open Reading Frames (ORFs) only showing similarities to conserved hypothetical ORFs of other organisms. However, a significant number of the predicted ORFs do show similarities to functionally classified genes with defined roles in the complex network of metabolic pathways. Previously, the most common approach to determine the functionality of a gene and its gene product was to experimentally determine phenotypic changes upon inactivation or overexpression of the gene. This is the most effective approach for determining the roles of single genes, but it is unfeasible to investigate complete ORFeomes with over 2000 ORFs. The Kyoto Encyclopedia of Genes and Genomes (KEGG) represents an ambitious and successful attempt to assign known enzymes into known biochemical pathways and is updated on a regular basis [1–3]. The database is represented by a web-based browser and a multitude of different analyses are possible. Genes can be analyzed using online Blast algorithms http://www.ncbi.nlm.nih.gov/BLAST/ and if significant Blast similarities resulted in the assignment of a defined enzyme class (i.e. EC classification), these genes can be marked in corresponding KEGG pathways. However, analysis of larger numbers of genes using this manual approach is tedious, inflexible, and time consuming. Since un-encrypted data transmission is used for most public servers, remote Blast analyses with confidential sequence queries is not desirable. In addition to this web-browser based approach, the KEGG database can also be accessed directly via an application programming interface (API) and the underlying databases can be downloaded for local uses. Third party-software most commonly use the KEGG database for gene-classifications, often in combination with whole genome annotation efforts , or utilizes the database content for reference gene sets used in further experiments . Other remote software solutions like DAVID  integrate various databases and experimental results to allow for extensive query-based data mining on given gene lists. Similarly, Pathway Tools http://bioinformatics.ai.sri.com/ptools/ utilizes dedicated server and database backbones to realize a sophisticated environment. New genomes must be manually integrated into a Pathway/Genome Database (PGDB), which in turn sets the basis for more complex queries and analyses. However, the significant resources required to implement genome information for Pathway Tools might not be readily available. PathFinder  and BioMiner  use a different approach in that they utilize some of the data the KEGG database provides and then approach pathway reconstruction using software specific algorithms. However, these solutions tend to be complex, thoroughly web-based, and often utilize the whole KEGG databases without selective options. Furthermore, most of these solutions require specific data formats that are compatible with the respective application for initial data parsing and entry. For example, although PathFinder can parse common EMBL input files, only genes with an EC-number tag can initially be integrated into the database system, implying a sophisticated level of existing genome annotation. Although this approach can be used to generally classify genes, it lacks the necessary flexibility to compare specific pathways of interest in single organisms or groups of organisms to a selected set of genes. In addition, these complex algorithms are not always needed and more simplistic and faster solutions with less hardware and software requirements would be preferable due to their ease of use and flexibility.
Closely-related groups of organisms may differ in certain key elements that define specific strain/species related differences. Comparing these organisms or groups of organisms with each other should highlight these differences and reveal specific properties or lead to new genetic targets for pathway engineering. Therefore, it may be desirable to choose only subsets of organisms and pathways from the overall KEGG database content.
PathwayVoyager was developed to overcome most of these obstacles. The software resembles a tool to analyze an unlimited number of protein sequences against a user-selected subset of the KEGG database using NCBI's BlastP algorithm and subsequently places them into their proper pathway positions. Results are displayed in colored pathway maps and hits can easily be analyzed using the graphical interface. This tool reflects a different approach to pathway mapping, in that it provides a simplistic and flexible approach with few prerequisites. No dedicated hardware (i.e. background server) or software (i.e. relational database backbones) are necessary to analyze given datasets. A standard PC with the Windows operating system is sufficient to operate PathwayVoyager. In contrast to more complex tools, no underlying protein annotation is necessary and plain protein sequences in FASTA format can be used as query templates. This approach is ideal for draft phase genomes and ongoing annotation efforts in completed genomes where the emphasis lies on the establishment and verification of gene annotation and an initial assessment of metabolic capabilities. The resulting main advantage of PathwayVoyager is its speed and economy for initial pathway mapping. Also, the resulting data can easily be accessed on different locations by transferring the generated flatfile database to the respective computers. Once the research objective shifts to comparative and predictive pathway analyses, other tools like DAVID or PathwayTools become more advantageous. PathwayVoyager fills a niche for environments with limited hardware and software resources that still require a significant and meaningful way to perform small and large scale pathway mapping projects from varying data sources.
PathwayVoyager is written completely in Perl/Tk and requires the Perl interpreter http://www.activestate.com. No further Perl modules are required. However, two external distributions are required, namely the NCBI Blast distribution ftp://ftp.ncbi.nih.gov/blast/executables/LATEST-BLAST, and the SOAP::Lite client http://www.soaplite.com/ to utilize the KEGG API. Perl/Tk provides the interpreter for PathwayVoyager, and the SOAP::Lite client facilitates interaction with the KEGG API. The standalone Blast distribution is used to generate Blast compatible databases and to perform the local Blast analyses. For data analysis and browsing, the Perl/Tk interpreter is the only pre-requisite for PathwayVoyager. The standalone Blast distribution, and the SOAP::Lite client can be omitted and no internet connection is necessary.
The software was developed to optimally complement the GAMOLA annotation suite  but accepts any protein sequence in FASTA format. The sequential numbering of ORFs in GAMOLA annotated genomes is reflected by gene-name tags in the generated Genbank files. Extracting protein sequences into FASTA files preserves this numbering scheme and is subsequently presented in the browser module of PathwayVoyager. This permits fast and efficient ORF-tracking throughout the genome and often provides preliminary identification of gene clusters. PathwayVoyager operates as a stand alone software solution without the need of additional database backbone systems. The use of PathwayVoyager and the KEGG database system implies the agreement to the license terms specified for KEGG at http://www.genome.ad.jp/kegg/kegg5.html.
The graphical user interface was designed to be self-explanatory and easy to use. After the initial pathway setup, no further installation steps are necessary. Although PathwayVoyager requires an internet connection in order to retrieve data from the KEGG database, all analyzes involving the provided gene sets are performed locally and no sensitive data are transmitted. This eliminates one of the major security concerns when working with confidential data and permits the real-time use of the KEGG database system. PathwayVoyager does not require any dedicated hardware and has been tested on a standard PC and the Windows platform. Linux versions of Perl/Tk, the SOAP::Lite client, and the standalone Blast distribution are freely available and would allow PathwayVoyager to operate under a Linux environment, as well. For certain selectable pathways (e.g. Ribosomal reference pathway) KEGG does not yet support organism independent marking. For practical reasons, no hits will be displayed for these pathways.
Results and discussion
The user-selected organism and pathway combination is shown in a separate pop-up window (not shown). The current status of the KEGG pathway mapping is also shown in a separate log-window (not shown). In general, the right panel (Figure 2, C) harbors the user-guide interface and was designed to lead the user through the analyses in a step-by-step approach. By default, the organism and pathway confirmation automatically initializes the KEGG pathway mapping. If only the selected and retrieved protein sequences are required, or a manual start for pathway mapping is desired, the setup module allows the configuration for manual mode. KEGG pathway mapping can then be initiated with the "Submit to KEGG" button. Selected pathway/organism combinations are saved as an ASCII text file. The retrieved protein sequences are stored into a separate ASCII file and a Blast-compatible database is generated. For future analyses, the pathway/organism selection and the respective database can be re-used with different query protein sequence sets.
The possibility to re-use previous selections dramatically reduces the time needed to complete KEGG analyses, as retrieval of individual protein sequences from KEGG is omitted. In addition, Blast results obtained with the given query set can also be re-used. This shortens the run time further, enabling rapid mappings and analyses of pathways with varying relaxed or stringent threshold values.
The provided gene set will then be compared to the local database generated from the selected organism-pathway protein sequence combination using the BlastP algorithm. Blast hits featuring an e-value below the user-selectable threshold will be used to generate the marked KEGG pathway requests. Pathway maps are saved as GIF files and the URL for the respective KEGG pathway map including the corresponding BlastP results are stored separately in text files.
In general, previously selected pathways are displayed by either their KEGG pathway code or full name. Alternative analyses can be displayed by changing the default mapping directory, using the "Directory" function (Figure 3, A). The selected pathway will then be graphically displayed and BlastP hits below the specified threshold are indicated as red boxes, bearing the respective EC numbers (Figure 3, C). Each marked element is shown by its EC-number code, numerically sorted, in a listbox (Figure 3, B). Upon selection of an entry, all BlastP hits below the threshold are sorted by ascending e-values and displayed accordingly (Figure 3, D). This workflow allows for a quick pathway mapping throughout a given gene set and those potentially involved in multiple pathways can be easily identified and analyzed.
In the example shown, the conversion of glyceronephosphate to glyceraldehyde-3-phosphate is mediated by a triosephosphate isomerase (EC 126.96.36.199). Selecting this entry from the EC entry list (Figure 3, B), highlights all query hits found in L. acidophilus below the defined threshold (Figure 3, D). Two entries below an e-value of 1e-120 were found, namely ORFs Lba699 (e-value: 1e-127) and Lba700 (e-value: 1e-131). Both entries show significant similarities to triosephosphate isomerases. Further analyses showed that the conversion of glyceraldehyde-3-phosphate to glycerate-1,3-bisphosphate and to glycerate-3-phosphate is mediated by Lba698 (EC 188.8.131.52, e-value 1e-176) and Lba699 (EC 184.108.40.206, e-value 0), respectively. The ambiguity found for EC 220.127.116.11 could be resolved and, consequently, the genome annotation was updated accordingly. More detailed analyses revealed the presence of the complete pathway for uptake and conversion of glucose into pyruvate and L-lactate. A more detailed analysis of the complete metabolic pathway reconstruction of L. acidophilus NCFM using PathwayVoyager is described elsewhere .
PathwayVoyager does not evaluate or extrapolate the displayed hits and the quality and significance of the results depend on the current content of the KEGG database. As with every predictive software, results should be carefully analyzed and seen in their genetic context to evaluate activities and potential substrate specifity-variances carried out by homologous enzymes. Results from previous analyses can be displayed by selecting the "View existing KEGG pathways" option in the PathwayVoyager main window (Figure 2, C).
Run times for PathwayVoyager may vary, depending on the number of selected pathways and organisms. Analysis of a complete genome of ~2,000 open reading frames (ORFs) using the complete KEGG database can be carried out in less than 36 h.
PathwayVoyager differs significantly in its approach from other software solutions for pathway reconstructions that already exist. In contrast to the often highly complex and specific algorithms, PathwayVoyager represents a more straight-forward approach, and doesn't require substantial resources on the users' side. Relying on the Blast algorithm and the ambitious KEGG database, PathwayVoyager utilizes widely accepted resources to analyze and map data. Despite the uncomplicated approach, evidential data can rapidly be obtained and easily analyzed during genome analyses. PathwayVoyager represents an effective pathway mapping tool for large or confidential data sets.
Availability and requirements
• Project name: Biological Pathway Mapping
• Project home page: none
• Operating system(s): Platform independent
• Programming language: PERL
• Other requirements: Active Perl 5.8, SOAP::Lite client, NCBI's Blast distribution
• License: The software is distributed for free under the NC State University copyright and can be obtained upon request to the authors.
• Any restrictions to use by non-academics: none
This work was funded in part by the North Carolina Agricultural Research Service, and Danisco, Inc. of Madison, Wisconsin. We thank Evelyn Durmaz for her help and technical assistance. Special thanks go to M. Andrea Azcarate-Peril and B. Logan Buck for their helpful discussions and beta testing of the PathwayVoyager software.
- Ogata H, Goto S, Fujibuchi W, Kanehisa M: Computation with the KEGG pathway database. Biosystems. 1998, 47: 119-128. 10.1016/S0303-2647(98)00017-3.PubMedView ArticleGoogle Scholar
- Kanehisa M: The KEGG database. Novartis Found Symp. 2002, 247: 91-101.PubMedView ArticleGoogle Scholar
- Kanehisa M, Goto S, Kawashima S, Okuno Y, Hattori M: The KEGG resource for deciphering the genome. Nucleic Acids Res. 2004, 32: D277-D280. 10.1093/nar/gkh063.PubMedPubMed CentralView ArticleGoogle Scholar
- Almeida LG, Paixao R, Souza RC, Costa GC, Almeida DF, Vasconcelos AT: A new set of bioinformatics tools for genome projects. Genet Mol Res. 2004, 3: 26-52.PubMedGoogle Scholar
- Yang HH, Hu Y, Buetow KH, Lee MP: A computational approach to measuring coherence of gene expression in pathways. Genomics. 2004, 84: 211-217. 10.1016/j.ygeno.2004.01.007.PubMedView ArticleGoogle Scholar
- Dennis GJ, Sherman BT, Hosack DA, Yang J, Gao W, Lane HC, Lempicki RA: DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol. 2003, 4: 3-10.1186/gb-2003-4-5-p3.View ArticleGoogle Scholar
- Goesmann A, Haubrock M, Meyer F, Kalinowski J, Giegerich R: PathFinder: reconstruction and dynamic visualization of metabolic pathways. Bioinformatics. 2002, 18: 124-129. 10.1093/bioinformatics/18.1.124.PubMedView ArticleGoogle Scholar
- Sirava M, Schafer T, Eiglsperger M, Kaufmann M, Kohlbacher O, Bornberg-Bauer E, Lenhof HP: BioMiner-modeling, analyzing, and visualizing biochemical pathways and networks. Bioinformatics. 2002, 18 Suppl 2: S219-S230.PubMedView ArticleGoogle Scholar
- Altermann E, Klaenhammer TR: GAMOLA: a new local solution for sequence annotation and analyzing draft and finished prokaryotic genomes. OMICS. 2003, 7: 161-169. 10.1089/153623103322246557.PubMedView ArticleGoogle Scholar
- Altermann E, Russell WM, Azcarate-Peril MA, Barrangou R, Buck BL, McAuliffe O, Souther N, Dobson A, Duong T, Callanan M, Lick S, Hamrick A, Cano R, Klaenhammer TR: Complete genome sequence of the probiotic lactic acid bacterium Lactobacillus acidophilus NCFM. Proc Natl Acad Sci U S A. 2005, 102: 3906-2912. 10.1073/pnas.0409188102.PubMedPubMed CentralView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.