Fungal Secretome Database: Integrated platform for annotation of fungal secretomes

Background Fungi secrete various proteins that have diverse functions. Prediction of secretory proteins using only one program is unsatisfactory. To enhance prediction accuracy, we constructed Fungal Secretome Database (FSD). Description A three-layer hierarchical identification rule based on nine prediction programs was used to identify putative secretory proteins in 158 fungal/oomycete genomes (208,883 proteins, 15.21% of the total proteome). The presence of putative effectors containing known host targeting signals such as RXLX [EDQ] and RXLR was investigated, presenting the degree of bias along with the species. The FSD's user-friendly interface provides summaries of prediction results and diverse web-based analysis functions through Favorite, a personalized repository. Conclusions The FSD can serve as an integrated platform supporting researches on secretory proteins in the fungal kingdom. All data and functions described in this study can be accessed on the FSD web site at http://fsd.snu.ac.kr/.


Background
The "secretome" refers to the collection of proteins that contain a signal peptide and are processed via the endoplasmic reticulum and Golgi apparatus before secretion [1]. In organisms from bacteria to humans, secretory proteins are common and perform diverse functions. These functions include immune system [2], roles as neurotransmitters in the nervous system [3], roles as hormones/pheromones [4], acquisition of nutrients [5][6][7], building and remodeling of cell walls [8], signaling and environmental sensing [9], and competition with other organisms [10][11][12][13]. Some secretory proteins in pathogens function as effectors that manipulate and/or destroy host cells with special signatures. In Plasmodium and Phytophthora species, effectors carry the RXLX [EDQ] or RXLR motifs as host targeting signals [11][12][13].
With the aid of advanced genome sequencing technologies [14], the rapid increase of sequenced fungal genomes offers many opportunities to study the function and evolution of secretory proteins at the genome level [15,16]. The Comparative Fungal Genomics Platform (CFGP; http://cfgp.snu.ac.kr/) [16] now archives 235 genomes from 120 fungal/oomycete species. The accurate prediction of secretory proteins in sequenced genomes is the key to realizing such opportunities.
The widely used SignalP 3.0 program [17] detected 89.81% of the 2,512 experimentally verified sequences in SPdb [18], a database containing proteins with signal peptides. To improve the accuracy of prediction, we built a hierarchical identification pipeline based on nine prediction programs (Table 1). Through this pipeline, putative secretory proteins, including pathogen effectors, encoded by 158 fungal and oomycete genomes were identified. The Fungal Secretome Database (FSD; http:// fsd.snu.ac.kr/) was established to support not only the archiving of fungal secretory proteins but also the management and use of the resulting data. The FSD also has a user-friendly web interface and offers several data analysis functions via Favorite, a personalized data repository implemented in the CFGP (http://cfgp.snu.ac.kr/) [16].

Construction and content
Evaluation of the pipeline for predicting secretory proteins To evaluate the capabilities of four programs SignalP 3.0 [17], SigCleave [19], SigPred [20], and RPSP [21] for predicting signal peptides, we analyzed the secretory proteins collected in SPdb [18]. SignalP 3.0 identified 89.81% of 2,512 proteins; while adding the other three programs, in combination, 87.50% of the proteins, which were not predicted by SignalP 3.0, were identified. The remaining proteins (1.31% of 2,512 proteins) were investigated by using two programs that predicted subcellular localization: PSort II [22] and TargetP 1.1b [23]. We found that 34.38% of the proteins were predicted to be extracellular proteins, increasing the coverage to 99.16%. For the 1,093 characterized fungal/oomycete secretory proteins (Table 2), the combinatory pipeline raised the prediction coverage from 75.30% to 84.17% in comparison to SignalP 3.0. In addition, 98.14% of 24,921 experimentally unverified sequences in the SPdb were predicted as secretory proteins by the pipeline, while SignalP 3.0 caught 80.22% of them as positive. To assess robustness of the pipeline with non-secretory proteins, we prepared yeast proteins localized in cytosol, endoplasmic reticulum, nucleus, or mitochondrion [24]. When the 1,955 proteins were subjected to the FSD pipeline and SignalP 3.0, the numbers of false positives were almost same (84 and 82, respectively). Together, these results suggest that this ensemble approach could SigCleave A program to predict whether a protein has signal peptides or not [19] SigPred A program to predict whether a protein has signal peptides or not [20] RPSP A program to predict whether a protein has signal peptides or not [21] TMHMM 2.0c A program to predict whether a protein has trans-membrane helix(es) or not [26] TargetP 1.1b A program to predict a site where a protein probably resides [23] PSort II A program to predict a site where a protein probably resides [22] SecretomeP 1.0f A program to predict whether a protein is secreted by non-classical pathways or not [25] predictNLS A program to predict whether a protein has nuclear localization signal or not [28]  compensate for some of the weaknesses of individual programs, resulting in more robust predictions. Additionally, SecretomeP 1.0f [25], which can predict nonclassical secretory proteins, was integrated into the FSD. The FSD contains an identification pipeline that sequentially analyzes proteomes of interest using i) Sig-nalP 3.0; ii) a combination of SigCleave, SigPred, and RPSP to screen those proteins not considered positive by SignalP 3.0; and iii) PSort II and TargetP 1.1b to analyze the negatives from the previous step. Additionally, SecretomeP 1.0f was integrated to provide information related to non-classical secretory proteins. To eliminate potential false positives, we filtered proteins that i) contain more than one transmembrane helix predicted by TMHMM 2.0c [26] and/or ii) the endoplasmic reticulum retention signal ([KRHQSA]-[DENQ]-E-L; classified as false-positive; Figure 1A) [27]. In addition, iii) nuclear proteins predicted by both predictNLS [28] and PSort II [22] and iv) mitochondrial proteins predicted by PSort II [22] as well as TargetP 1.1b [23] were eliminated because two subcellular localizations are not related to secretory proteins.
Following analysis via the pipeline, the resulting putative secretory proteins after removing potential false positives are divided into four classes: i) SP contains all proteins predicted by SignalP 3.0; ii) SP 3 contains the proteins predicted by SigPred, SigCleave, or RPSP but not by SignalP 3.0; iii) SL contains the proteins predicted by PSort II and/or TargetP 1.1b but not by the first two steps; and iv) NS contains the proteins predicted by SecretomeP 1.0f but not by SignalP 3.0 ( Figure  1A; Table 3).

System structure of the FSD
To improve the expandability and flexibility of the FSD, we adopted a three-layer structure (i.e., data warehouse, analysis pipeline, and user interface) in its design. The data warehouse was established using the standardized genome warehouse managed by the CFGP (http://cfgp. snu.ac.kr/) [16] that has been used in various bioinformatics systems [15,[29][30][31][32][33][34][35]. The pipeline layer was built with a series of Perl programs. In addition to the prediction programs described above, ChloroP 1.1 as well as hydropathy plots [36] were included in the FSD to provide additional information on secretory proteins. Whenever new fungal genomes become available, the automated pipeline classifies them based on the predictions of nine programs, thus keeping the FSD current ( Figure 1B). MySQL 5.0.67 and PHP 5.2.9 were used to maintain database and to develop web-based user interfaces that present complex information intuitively. Web pages were serviced through Apache 2.2.11. Favorite, a personal data repository used in the CFGP (http://cfgp.snu.ac. kr/) [16], was integrated to provide thirteen functions for further analyses.

Secretory proteins in 158 fungal/oomycete genomes
To survey the genome-wide distribution of secretory proteins in fungi and oomycetes, we used the pipeline to analyze all predicted proteins encoded by 158 fungal/ oomycete genomes. Of the 1,373,444 open reading frames (ORFs) analyzed, 92,926 (6.77%), 103,224 (7.52%), and 12,733 (0.93%) proteins belonged to classes SP, SP 3 , and SL, respectively ( Table 4, 5, and 6). In total, 208,883 ORFs (15.21%) were denoted putative secretory proteins. The proteins belonging to class NS were not included in the putative secretome because they represented more than 40% of whole proteome.
To determine the phylum-level distribution of classes SP, SP 3 , and SL within fungi, we investigated the proportions of the three classes among subphyla (Figure 2). Class SP 3 was the largest, class SP was a little smaller, and the class SL was much smaller; this was consistent over every subphylum. Only in Plasmodium species, oomycetes, and the kingdom Metazoa class SP was dominant. Class SL did not exceeded 2.10% of the whole genome, except in Plasmodium species (4.52%). Plasmodium species also showed the lowest variance among the three classes, which may reflect signal peptide-independent types of secretory proteins such as vacuolar transport signals (VTSs) [12]. These results may be partially affected by the composition of the training data for each prediction program and inherent features of each algorithm.
The phylum Basidiomycota had a larger proportion of secretory proteins (17.90%) than other fungal taxonomy such as the subphylum Mucoromycotina (11.99%) and the phyla Ascomycota (12.87%) and Microsporidia (15.10%). Within the phylum Ascomycota, the subphylum Pezizomycotina showed a higher portion of class SP (7.82%) than the subphyla Saccharomycotina and Taphrinomycotina (4.57% and 3.74%, respectively). When considered that subphylum Pezizomycotina contains many pathogenic fungi (47 of 59) compared with subphylum Saccharomycotina (11 of 65), the abundance of secretory proteins in the subphylum Pezizomycotina suggests that pathogens may have larger secretome than saprophytes in general. In fact, Magnaporthe oryzae and Neurospora crassa, a closely related pair of pathogen and non-pathogen supported by  recent phylogenomic studies [37][38][39], contain 22.31% and 16.93% of secretory proteins, respectively. Moreover, the same tendency was found in comparison with 158 fungal/oomycete genomes archived in the FSD (pathogens and saprophytes showed 14.06% and 11.70%, respectively).

Effectors encoded by fungal/oomycete and Plasmodium genomes
Phytophthora species, a group that includes many important plant pathogens, uses a RXLR signal to secrete effectors to host cells [40]. RXLR effectors were tightly co-located with signal peptides predicted by the SignalP 3.0 with high confidence values (HMM and NN for 0.93 and 0.65, respectively) [41]. With the same conditions, we identified 734 putative RXLR effectors from three Phytophthora species, similar to a previous study [42]. However, 153 fungal genomes showed that only 0.04% of the total proteome contained this motif, suggesting that the use of RXLR for secretion is oomycetespecific. The motivation of finding the RXLR pattern in oomycetes was the RXLX [EDQ] motif of the VTS in the malaria pathogen, Plasmodium falciparum. Once P. falciparum invades the human erythrocyte, it secretes the proteins that carry the pentameric VTS of the RXLX [EDQ] motif from the parasitophorus vacuole to the host cytoplasm [12,13]. To determine how many VTSs could be detected by our pipeline, we investigated 217 proteins of P. falciparum [13]. Of these, 115 proteins (53.00%) were classified as secretory proteins, defined in the FSD by the RXLX [EDQ] motif. Comparing our result to that predicted by SignalP 3.0 alone (41 out of 217), we found that our pipeline demonstrated high fidelity in detecting proteins containing VTSs.
In class SP, the proportions of proteins possessing the RXLX [EDQ] but not the RXLR motif were 96.75%, 56.18%, and 93.21% in fungi, oomycetes, and Plasmodium species, respectively ( Figure 3A). There were similar proportions of the RXLX [EDQ] motif in classes SP 3 and SL across the three groups ( Figure 3B and 3C). Taken together, these data show that the RXLR motif, with signal peptides predicted by SignalP 3.0, is oomycete-specific [41]. It is interesting that fungal genomes have significantly higher numbers of the RXLX [EDQ] motif than Plasmodium species (t-test based on amino acid frequency in each genome; P = 2.2e -16 ), suggesting

FSD web interfaces
To support the browsing of the global patterns of archived data, the FSD prepares diverse charts and tables. For example, intersections of prediction results are summarized in a chart for each genome (Figure 4). Despite of the many programs, all prediction results for each protein are displayed on one page, allowing users to browse them easily ( Figure 5).
The personalized virtual space, Favorite, supports in-depth analyses in the FSD The FSD allows users to collect proteins of interest and save them into the Favorite, which provides thirteen  Figure 2 Distribution of three classes at the phylum/subphylum level. The average ratios of the classes to the total ORFs at the subphylum and phylum levels are described. The orange circular arc represents the fungal kingdom, and the four light blue round boxes represent phyla or kingdoms. Inside the chart, the blue line represents the ratio of class SP; the red line, class SP 3 ; and the green line, class SL.  The web page shows a one page summary of amino acid sequence, exon structure, and genome context via the SNUGB [15], along with 12 predictions, including signal peptides and subcellular localization.

7)
. From these result pages, users can collect and store proteins in Favorite again, for further analyses. Additionally, Favorites created in the FSD can be shared with the CFGP (http://cfgp.snu.ac.kr/) [16], permitting users to use the 22 bioinformatics tools provided in the CFGP web site.

Conclusions
Given the availability of large number of fungal genomes and diverse prediction programs for secretory proteins, a three-layer classification rule was established and implemented in a web-based database, the FSD. With the aid of an automated pipeline, the FSD classifies putative secretory proteins from 158 fungal/ oomycetes genomes into four different classes, three of which are defined as the putative secretome. The proportion of fungal secretory proteins and host targeting signals varies considerably by species. It is interesting that fungal genomes have high proportions of the RXLX [EDQ] motif, characterized as host targeting signal in Plasmodium species. Summaries of the complex prediction results from twelve programs help users to readily access to the information provided by the FSD. Favorite, a personalized virtual space in the CFGP, serves thirteen different analysis tools for further indepth analyses. Moreover, 22 bioinformatics tools Figure 6 SNU Genome Browser implemented in the FSD. The SNUGB (http://genomebrowser.snu.ac.kr/) [15] displays i) four types of signal peptides predicted by SignalP 3.0, SigCleave, SigPred, and RPSP, ii) amino acid patterns, iii) nucleotide localization signals predicted by predictNLS, iv) transmembrane helixes predicted by TMHMM 2.0c, and v) hydropathy plots.
provided by the CFGP can be utilized via the Favorite. Given these features, the FSD can serve as an integrated environment for studying secretory proteins in the fungal kingdom.

Availability and requirements
All data and functions described in this paper can be freely accessed through the FSD web site at http://fsd. snu.ac.kr/.