Fungal cytochrome P450 database

Background Cytochrome P450 enzymes play critical roles in fungal biology and ecology. To support studies on the roles and evolution of cytochrome P450 enzymes in fungi based on rapidly accumulating genome sequences from diverse fungal species, an efficient bioinformatics platform specialized for this super family of proteins is highly desirable. Results The Fungal Cytochrome P450 Database (FCPD) archives genes encoding P450s in the genomes of 66 fungal and 4 oomycete species (4,538 in total) and supports analyses of their sequences, chromosomal distribution pattern, and evolutionary histories and relationships. The archived P450s were classified into 16 classes based on InterPro terms and clustered into 141 groups using tribe-MCL. The proportion of P450s in the total proteome and class distribution in individual species exhibited certain taxon-specific characteristics. Conclusion The FCPD will facilitate systematic identification and multifaceted analyses of P450s at multiple taxon levels via the web. All data and functions are available at the web site .


Background
Cytochrome P450 is the collective name for a super family of heme-containing monooxygenases. P450 enzymes not only participate in the production of diverse metabolites but also play critical roles in organism's adaptation to specific ecological and/or nutritional niches by modifying potentially harmful environmental chemicals. In fungi, P450 enzymes have contributed to exploration of and adaptation to diverse ecological niches [1,2].
Rapidly accumulating genome sequences from diverse fungal species, including more than 80 species with more currently being sequenced [3], offer opportunities to study the genetic and evolutionary mechanisms underpinning different fungal life styles at the genome level [4][5][6][7]. To support such studies with the focus on cytochrome P450s, we constructed a new platform named as the Fungal Cytochrome P450 Database (FCPD), which archives P450s in most sequenced fungal and oomycetes species and allows comparison of the archived data with previously published datasets, such as the Cytochrome P450 Engineering Database [8], a manually curated P450 database at http://drnelson.utmem.edu/CytochromeP450 html (referred as the Nelson's P450 database herein), and P450 datasets derived from extensive phylogenetic analyses of selected fungal taxon groups [9,10]. The FCPD also supports multifaceted analyses of P450s using various web-based bioinformatics tools supported by the Comparative Fungal Genomics Platform (CFGP; http:// cfgp.snu.ac.kr/) [3]. The FCPD, in combination with highthroughput experimental approaches, will advance our understanding of the roles and evolution of P450s.

Pipeline for identifying and classifying fungal P450s
To identify P450 proteins from genome sequences, standardized genome databases managed by CFGP (http:// cfgp.snu.ac.kr/) [3] and annotated information of each ORF by InterPro scan [11] were used. The pipeline for the identification and archiving of P450s consists of four steps ( Figure 1). In the first step, all proteins carrying one or more of 16 InterPro terms associated with cytochrome P450 were identified and classified according to associated InterPro terms. Domain information of P450 proteins was also retrieved from the InterPro scan results. To filter out potential false positives (i.e., those carrying a very short domain), the minimum length for IPR001128 (Cytochrome P450) was set at 25 amino acid (aa). Since some of these potential false positives might indeed belong to novel P450s, rather than discarding them, they were labelled as "questionable P450" in FCPD. Secondly, using the collection of putative P450 sequences, cache tables, especially for results from several statistical analyses, were created to speed up data retrieval. BLAST datasets were also generated to support BLAST searches of P450s via the FCPD web site and cluster analysis. Thirdly, classspecific and cluster-specific neighbour joining phylogenetic trees that show relationships among P450s within individual phylogenetic groups (e.g., Figure 2) were constructed (bootstrapped with 2,000 or 10,000 repeats), which are displayed by Phyloviewer (http://www.phy loviewer.org/; Park et al., unpublished) on the FCPD web site. Using the BLAST dataset, fungal P450s were clustered using tribe-MCL [12], and compared with the data in three publicly available databases: the Cytochrome P450 Engineering database [8], the Nelson's P450 database, and a set of phylogenetically analyzed P450s in multiple fungal species [9,10]. Results from this comparison were stored in the FCPD for viewing via the FCPD web site. For species with multiple versions of genome annotation, data generated using different versions were linked to provide the history of annotation.
As the fourth step, using BLAST all P450s archived in FCPD were matched to the corresponding families in the Nelson's P450s database, which contains manually curated data based on the P450 International Nomenclature [13,14]. For each P450, the assigned family name was considered highly confident ('> = 44% identity' in the site), when the degree of aa sequence identity was 44% or higher. When no match at that level could be found in the Nelson's P450 database, the best hit in BLAST search was chosen to assign the family name and labelled as low confidence ('< 44% identity' in the site). Considering that P450s are very diverse and that the Nelson's P450 database covers less fungal species than FCPD, it is highly likely that some of the P450s with low confidence represent novel families that have yet to be registered in the Nelson's P450 database ( Figure 3). This annotation result was stored in FCPD and can be viewed through the FCPD web site.
In the genomes of 66 fungal and 4 oomycete species, 4,538 putative P450 genes were identified. Although oomycete species belong to the kingdom Stramenophila and show closer phylogenetic relationships to brown algae and diatoms [15], they have been traditionally studied by mycologists due to their morphological similarities with true fungi, and their P450s were included in FCPD.

Evaluation of the accuracy of annotation via the automated pipeline in FCPD by comparing with data archived in the manually curated Nelson's P450s database
The automated annotation process of P450 in FCPD may result in some false-positives and negatives. To evaluate its accuracy, all 886 P450s identified using the pipeline in 12 fungal species were compared with manually curated data in the Nelson's P450 database. The positive predictive value (PPV; the proportion of the predicted P450s in FCPD to P450s that have been archived in the Nelson's P450 database) was 0.894 (792 out of 886 P450s in FCPD were matched to P450s in Nelson's P450 database). Some putative false positives in FCPD appeared to be pseudo genes. Another factor that contributed to the discrepancy between the two sources is that some data in the Nelson's P450 database were based on a version earlier than what was used for FCPD (e.g., version 4 of Magnaporthe oryzae genome having been used for the former, while FCPD being based on version 5). Gene prediction models employed to analyze different versions might have had different predictions. In contrast, 1,032 out of 1,034 fungal P450s curated in the Nelson's P450 database were identified as P450 by the FCPD pipeline (99.8% sensitivity), supporting the reliability of the FCPD pipeline. The two P450s not identified as P450 by FCPD came from Phytophthora sojae and P. ramorum, respectively and corresponded to truncated sequences (34 and 89 aa, respectively, and were labelled as fragment of P450 in the Nelson's P450 database). Detailed analyses of the underlying reasons for the inconsistency between the two sources will help us improve the automated annotation pipeline of FCPD.

Notable features in fungal P450s in the taxonomic context
The numbers of P450s in individual species exhibited certain taxon-specific features (Table 1). Within the phylum Ascomycota, members of the subphylum Pezizomycotina typically carry around 100 P450s with the exception of four species (Coccidioides immitis, Histoplasma capsulatum, Uncinocarpus reessi and Neurospora crassa) that only carry 22 to 46 P450s. The proportion of P450s in the total proteome in the subphylum Pezizomycotina (0.63% in average) is twice as large as that of vertebrates (0.33%) but is less than that of plant species (0.82%). In contrast to the Pezizomycotina, species in the subphyla Saccharomycotina and Taphrinomycotina have a very few P450s (e.g., only 3 P450s in Saccharomyces cerevisiae and 2 P450s in Schizosaccharomyces pombe). Within the phylum Basidio-Data retrieval pipeline in FCPD  Step 1 Step 2 Step 3 Step 4

Distribution patterns of fungal P450s among clusters and clans
When fungal/oomycetes P450s were combined with 5,447 P450s extracted from 40 other eukaryotic and prokaryotic species and clustered using tribe-MCL (with inflation factor of 5.0; the most strict condition for clustering based on sequence similarity), 141 clusters were identified. Among these, 74 clusters contain only fungal P450s, suggesting that many fungal P450s have a configuration unique to fungi. The taxonomic origins of fungal P450s in the 26 clusters that contain more than 10 fungal P450s were analyzed (   Figure 3). In the phylum Ascomycota, the assignment of 1,007 P450s (29.24%) was supported with low confidence. In the phylum Basidiomycota, the proportion was 44.56% (352 out of 790 P450s). More than 90% P450s (104 out of 110) in the phylum Zygomycota and 100% P450s in the phylum Chytridiomycota did not closely match with any families in the Nelson's P450 database. These results strongly suggest that new fungal families need to be defined.

Update of FCPD
Considering the rapid increase in fungal genome sequencing [3], timely update of FCPD is critical to present the latest information to users. The BLAST dataset, bootstrapped phylogenetic trees specific for individual classes and clusters, results from clustering analysis and annotation of P450s based on the international P450 nomenclature will be updated automatically once new P450s have been identified via the identification pipeline. Since the identification of P450s depends on the accuracy of a gene model employed to annotate the genome, as a new version of previously released genome sequences becomes available, FCPD will be updated with the data based on earlier versions being tagged as an "Old putative P450 sequences." Links between new and old versions will be provided.

Accessing lists and sequences of fungal P450s based on species of origin and taxonomic position
To support efficient search and retrieval of sequences of P450s, data archived in FCPD can be browsed and searched through multiple methods. Upon selecting a species of interest, general information about the species and a list of its P450s can be viewed. From this list, any P450 sequences can be stored in a personal data repository called the Favorite, in which six useful bioinformatic tools can be utilized to analyze the stored data. The Favorite is a virtual space for storing sequences archived in CFGP [3]. A list of P450s belonging to each class defined by InterPro terms or cluster can also be displayed. Taxonomical distribution of P450s, resulted from comparison with data in the Cytochrome P450 Engineering Database (CYP450ED) [8] and two previous studies on fungal P450s [9,10], can be browsed. P450 sequences in FCPD can also be searched by gene name.

BLAST search of all or subsets of P450s
In FCPD, five different databases of P450s, including all P450s (including those from plants and animals), all fungal/oomycete P450s and three fungal phylum-specific databases of P450s, can be searched using BLAST. Additionally, fungal P450 sequences in the Nelson's P450 database can also be searched. From BLAST search results,  sequences of individual P450s can be saved in the Favorite for subsequent analyses.

Analyses of P450s using tools in the Comparative Fungal Genomics Platform
Many on-line databases that archive gene families allow downloading of all or part of data to user's computer but often do not provide data analysis tools via the database site. Consequently, to conduct desired analyses, users may have to visit multiple websites to access desired data analysis tools and/or install programs in personal computer.
In FCPD, sequences of one or more fungal P450s can be selected by clicking check boxes next to each P450 and stored them into the Favorite. The Object Browser in FCPD supports the transfer of chosen sequences from the Favorite to CFGP in which the data can be analyzed using six useful bioinformatics tools [3]. These tools include BLAST, ClustalW, InterPro Scan, PSort, SignalP 3.0 and  BLASTMatrix. The BLASTMatrix is a novel tool for surveying the presence of genes homologous to a query in multiple species simultaneously. Once any new analysis tool has been added to CFGP, users of FCPD will be able to use the tool immediately.

Visualization of chromosomal distribution patterns of P450s via SNUGB
To aid for the visualization of chromosomal distribution pattern of P450s for species with available physical chromosome map information, FCPD provides a diagram illustrating position of P450s on individual chromosomes ( Figure 5), which are drawn by a newly developed genome browser called SNUGB (http://genome browser.snu.ac.kr/; Jung et al., submitted). Currently, chromosomal maps of 13 fungal species are available.

Conclusion
To our knowledge, FCPD is the most comprehensive database that archives and classifies P450s in publicly availa-ble fungal and oomycete genomes (65 fungal and 4 oomycete species) through a systematic identification pipeline. The reliability of the pipeline in retrieving fungal P450 sequences was evaluated by comparing resulting data with other established datasets, and the data from these sources were archived in FCPD for comparison and search. The pipeline also links annotated information from different versions of fungal genome sequences. Numbers of P450s in individual fungal species vary widely, and fungal specific P450 clusters were found via clustering analysis. In combination with other bioinformatic platforms, such as CFGP http://cfgp.snu.ac.kr/