- Research article
- Open Access
Genome-wide subcellular localization of putative outer membrane and extracellular proteins in Leptospira interrogans serovar Lai genome using bioinformatics approaches
BMC Genomicsvolume 9, Article number: 181 (2008)
In bacterial pathogens, both cell surface-exposed outer membrane proteins and proteins secreted into the extracellular environment play crucial roles in host-pathogen interaction and pathogenesis. Considerable efforts have been made to identify outer membrane (OM) and extracellular (EX) proteins produced by Leptospira interrogans, which may be used as novel targets for the development of infection markers and leptospirosis vaccines.
In this study we used a novel computational framework based on combined prediction methods with deduction concept to identify putative OM and EX proteins encoded by the Leptospira interrogans genome. The framework consists of the following steps: (1) identifying proteins homologous to known proteins in subcellular localization databases derived from the "consensus vote" of computational predictions, (2) incorporating homology based search and structural information to enhance gene annotation and functional identification to infer the specific structural characters and localizations, and (3) developing a specific classifier for cytoplasmic proteins (CP) and cytoplasmic membrane proteins (CM) using Linear discriminant analysis (LDA). We have identified 114 putative EX and 63 putative OM proteins, of which 41% are conserved or hypothetical proteins containing sequence and/or protein folding structures similar to those of known EX and OM proteins.
Overall results derived from the combined computational analysis correlate with the available experimental evidence. This is the most extensive in silico protein subcellular localization identification to date for Leptospira interrogans serovar Lai genome that may be useful in protein annotation, discovery of novel genes and understanding the biology of Leptospira.
Leptospirosis is a globally widespread zoonosis caused by the animal spirochete pathogen Leptospira interrogans . The clinical feature of its severe disease form, known as Weil's syndrome, or acute renal failure, is associated with multiple system complications, including renal failure, meningitis, and pulmonary haemorrhage. Although early treatment for leptospirosis is important for ensuring a favorable clinical outcome, this is often difficult to achieve, as symptoms during the early stages of infection resemble those of several other systematic diseases.
One potential method for controlling the spread of leptospirosis is through the development of vaccines. Candidates for vaccine production include outer membrane (OM) and extracellular (EX) proteins, several of which have been implicated in chemotaxis, adherence and other pathogenic steps. Attempts to identify such proteins have been performed previously by experimental [2–14] and computational methods [15–20]. Complete genome sequences of two serovars, Lai and Copenhageni of L. interrogans have been reported [15–17]. Hundreds of putative membrane proteins and lipoproteins were predicted, although in many cases, gene annotation may be incomplete or inaccurate to reliably identify putative vaccine candidates.
Previous studies have tried to identify potential vaccine candidates using experimental methods and in silico predictions. Proteomic analysis of purified outer membrane vesicles (OMVs) of L. interrogans serovar Copenhageni was performed by Nally et al. and revealed 33 intact OM proteins . The study by Gamberini et al.  showed 16 predicted surface exposed lipoproteins of L. interrogans serovar Copenhageni via whole genome analysis, only four of which are conserved among 8 pathogenic serovars. Since leptospiral lipoproteins are usually (but not exclusively) surface exposed proteins, and many are vaccine candidates, Setubal et al.  focused on lipoprotein prediction using spirochaetal lipoprotein (SpLip) program and identified 146 predicted lipoproteins (but not their localizations) for L. interrogans serovar Lai. The search for new potential vaccine candidates was continued by Yang et al. , who used a filtering approach combining in silico analysis, comparative genome hybridization, and microarray methods to identify 226 leptospiral surface exposed proteins. All of the previous studies summarized above focus on identification of vaccine candidates.
However, both computational and experimental have their own drawbacks [21, 22] Computational methods, for instance, depend on the presence of type I signal peptides [23, 24], transmembrane helices [24–26], or other particular features specifically found in previously identified membrane proteins, which may not be highly specific or sensitive. Experimental methods, on the other hand, yield results that may be complicated by cross-compartment contamination occurring during the preparation of samples, which can also result in the inclusion of false positive results in data sets [21, 22]. Hence, results obtained from both methods can occasionally lead to conflicting conclusions. We believe that such a focused approach without attempt to accurately identify periplasmic proteins (PP) and cytoplasmic membrane (CM) proteins can lead to erroneous identification of PP and CM as OM or EX by both in silico and experimental approaches. A holistic prediction of all membrane protein localizations will lead to better accuracy in genome annotation of membrane proteins, including vaccine candidates.
In this study we utilized a combination of three computational prediction tools PSORTb [27, 28], Proteome Analyst (PA) , and ProtCompB  to perform whole genome analysis of protein subcellular localization, and to identify novel putative L. interrogans serovar Lai OM and EX vaccine candidates. We combined the results derived from these three prediction algorithms into a consensus vote, resulting in a more accurate protein subcellular localization prediction. Furthermore, we incorporated homology searching against the DBSubloc database  and structural information from the GTD prediction  to enhance genome annotation, and to infer OM, EX and PP localized proteins. We also developed a specific classifier based on Linear Discriminant Analysis (LDA) for identification of leptospiral cytoplasmic proteins (CP) and cytoplasmic membrane proteins (CM), using a training set obtained from the consensus vote. We were able to assign subcellular localizations to several previously uncharacterized hypothetical proteins, thus improving L interrogans genome annotation.
We performed the subcellular localization prediction of L. interrogans serovar Lai using the pipeline described in the Material and methods section (shown in Figure 1), following the steps of training set verification, consensus vote, homology and structural prediction, and finally LDA-based classification.
Training set verification: Localization predictions of a set of experimentally verified proteins with known localization
To evaluate the robustness and versatility of our protein localization procedure, we used a set of well- characterized Gram-negative bacterial proteins with experimentally verified localizations taken from the work by Gardy and Brinkman  as a test set. The data set comprising 299 proteins was first analyzed by using PSORTb, PA, and ProtCompB. We found that, individually, PSORTb, PA, and ProtCompB assigned 73%, 71% and 79% of the verified protein localizations respectively (recall rate in Table 1). The overall precision rates were 97%, 95 and 83%, respectively. As expected, the overall recall rate was highest for ProtCompB, while its precision rate was also the lowest. The recall rate based on "consensus vote" (see materials and methods) results derived from all three methods was 48% without any false positives. Relaxing the criteria by considering predicted results of any two methods or the "majority vote" resulted in an overall recall rate of 77% with a single false positive.
Since the number of outputs for EX and OM proteins agreed by all three predictions was low (low recall rate), we used structure-based homology information from GTD and/or homology search results from DBSubloc prediction as the additional information for inferring protein localization. Using this information, we assessed the likelihood of the "non-consensus vote" outputs (see material and methods) for being EX or OM proteins. When the information from DBSubloc and GTD predictions were also used, the overall recall rates for the EX, OM and PP increased to 67%, 89% and 86% respectively as shown in Table 1. The method resulted in 96% precision. This performance was much better than any of the three individual methods, or any of the above combinations. Therefore, we have shown that the combination of prediction tools, DBSubloc homology search and GTD structural-based prediction markedly improved the accuracy and recall for EX, OM and PP protein localization prediction. Therefore, our prediction pipeline is applicable for subcellular localization prediction of hypothetical, or unknown proteins.
Subcellular localization predictions of L. interrogans: Step 1 Consensus votes
After demonstration of the accuracy of our pipeline prediction with the training set, the whole predicted proteome of L. interrogans serovar Lai was analyzed using three computational predictions for protein subcellular localization: PSORTb, ProtCompB, and Proteome analyst (PA). The results obtained from each prediction program are shown in Table 2. ProtCompB assigned subcellular localizations to all protein queries whereas approximately 50% of protein queries were assigned as unknown localization by PSORTb and PA.
After inspection of the prediction results derived from the three prediction algorithms, it was found that 797 out of 4,727 ORFs of L. interrogans serovar Lai genome had the following consensus vote predicted localizations: 418 cytoplasmic proteins (CP), 332 cytoplasmic membrane proteins (CM), 17 periplasmic proteins (PP), 15 outer membrane proteins (OM), and 15 extracellular/secreted proteins (EX) (Table 2, 3, 4 Additional file 1, 2, 3). The biological functions of most of the localized proteins are already annotated. Only about 9% (68 of 797 ORFs) were proteins annotated as conserved hypothetical or unknown proteins. This shows that the consensus vote approach has a high accuracy of subcellular localization prediction for L. interrogans. However, this recall of these methods is unacceptably low, since the localization of the majority of proteins remains unknown (3930 out of 4727 proteins).
When comparing the concordance or prediction agreement rates between the three prediction methods (excluding proteins with unknown localization by one or two programs), the rates for PSORTb and PA, PSORTb and ProtCompB, and PA and ProtCompB were 70.3%, 80%, and 59.5%, respectively. PSORTb was found to have a strong propensity to assign protein queries to CP and OM proteins, while PA was found to assign preferentially to CM, PP and EX proteins (p < 0.001, chi-square tests).
Step 2: Homology-based and protein folding recognition predictions for non-consensus vote localizations
The non-consensus vote OM, EX, and PP proteins were further analyzed for localizations using DBsubloc, and GTD. As presented in Table 5, 6, 99 more proteins (43 out of 83 proteins predicted by two previous methods and 56 out of 617 proteins predicted by one previous method) were additionally identified as putative EX, while 48 proteins (23 out of 59 proteins predicted by two methods, and 25 from 980 proteins predicted by one method) were additionally identified as putative OM proteins as shown in Table 7, 8. Moreover, 58 proteins (20 out of 20 proteins predicted by two methods and 38 out of 504 proteins predicted by one method) were additionally predicted as PP proteins (Additional file 1). It is of interest that several protein loci currently annotated as hypothetical proteins without localization information were predicted in EX, OM and PP compartments by the combination method (Tables 3, 4, 5, 6, 7, 8, 9 and Additional file 1). The homology search and structural information from DBSubloc and GTD thus allowed further identification of EX, OM, and PP from the non-consensus vote set, however, 3725 protein localizations remain unknown.
Step 3: Cytoplasmic (CP) and cytoplasmic membrane proteins (CM) identified by Linear Discriminant Analysis (LDA)
The remaining 3725 proteins with unknown localization after step 2 were further analyzed using an LDA-based classifier we developed to identify CP and CM proteins using the set of CP and CM consensus outputs (418 CP proteins and 332 CM proteins) predicted by all of the three prediction programs (Additional file 2, 3) as a training set (see Materials and Methods). 2272 CP and 481 CM proteins were additionally identified from the 3725 "unknown set" by this approach (Additional file 4, 5). We also found that 66% (1501 out of 2272) of the LDA based predicted CP and 54% (260 out of 481) of the LDA based predicted CM are hypothetical or unknown proteins. In other words, overall 56.3 % (1516 out of 2690) of hypothetical and/or unknown proteins in the whole genome were assigned as CP and 38 % as CM or helix transmembrane proteins.
After the final step in the prediction method, we are able to confidently predict the localization of 3755 (79.4%) Leptospiral proteins. Our combination method thus has a considerably improved recall over the PSORTB and PA methods, approaching that of ProtCompB (Table 1). To test the final prediction accuracy with estimated % agreement and % coverage of our combination method, we then performed the localization prediction of 28 experimentally verified proteins from several studies of Leptospiral outer membrane and extracellular, or cell surface proteins.
Protein subcellular localization prediction on the experimentally verified leptospiral outer membrane and extracellular proteins
As shown in the Additional file 6, the three prediction programs PSORTb, PA and ProtCompB gave markedly different predictions from one another for 28 experimentally OM and EX. Each of the three prediction programs had weaknesses, either poor agreement (ProtCompB) or low coverage (PSORTb and PA). Our combination approach was much better in the respect and showed good agreement and coverage.
Computational prediction for protein subcellular localization is a key step for genome annotation and development of drug and vaccine target. In this study, we used a combination method to putatively assign CP, CM, PP, OM, and EX proteins. We combined the results from three different algorithms namely PSORTb, PA and ProtCompB into a consensus vote to obtain higher prediction accuracy. The combination approach has previously been used to significantly reduce, or exclude false positive predictions for membrane topology prediction , and outer membrane prediction . In our case, the accuracy of consensus vote is very high, since well characterized OM and EX proteins were predicted including lactonizing lipase , microbial collagenase , O-sialoglycoprotein endopeptidase , Rhs family protein , CsgA or C factor , thermolysin , leucine rich repeat proteins (LRR) [41–43], Ton-B dependent outer membrane receptor proteins, OmpA, porin, heavy metal efflux pump, TolC, and general secretory pathway protein D (Table 4).
On the other hand, the recall, or sensitivity of consensus vote prediction is low, especially for EX and OM. The recall for consensus vote is low, because PSORTb and PA programs are known to have limitations for some proteins. PSORTb requires a training set from a limited number of experimentally-determined proteins, while PA has a disadvantage in that query proteins have to share similarity to known proteins in the Swiss-Prot database . Among high-throughput computational predictions for protein subcellular localization, PSORTb has been reported as the prediction tool that achieves the highest overall accuracy, followed closely by PA .
To overcome the limitations in PSORTb, PA and ProtCompB, the predictions for proteins predicted by only one or two out of the three prediction methods (the non consensus vote) were refined by homology-based search using the DBSubloc database and structural annotation in GTD. This allowed us to identify protein localizations with greater confidence. The advantage of GTD is that protein folding recognition or threading methods can determine pairs of proteins that have no obvious similarities in sequence, but have similar folds. It was previously suggested this approach should be carried out to increase prediction sensitivity for specific protein localization [22, 45, 46]. To our knowledge, this study is the first to employ GTD information to infer leptospiral protein localizations.
Structure-based information from GTD prediction revealed that the majority of the 99 EX predictions were proteins that may be secreted by the type III or the type V (autotransport) system. These proteins are shown in Table 5, 6 with their corresponding PDB code. Many of the putative EX proteins that are annotated as leucine rich repeat (LRR) containing proteins share sequence similarity to PopC protein (Q9RBS2), which is secreted through the hrp-secretion apparatus or the type III secretion pathway of Ralstonia solanacearum . Structurally related well-characterized extracellular LRR proteins in other species include YopM (PDB code 1jl5), a Yersinia pestis cytotoxin , internalin B , a virulence factor of Listeria monocytogenase (PDB code 1d0b) and polygalacturonase inhibiting protein (PDB code 1ogq), a secreted protein involved in plant defense .
It is of interest to note that several L. interrogans proteins are contained within the LRR and TPR (Tetratricopeptide repeat) protein families, but predicted sub-cellular localization is not necessarily conserved among all members within each family (Table 3, 5, 6, 7, 8, 9 and Table in additional file 4). The majority of LRR proteins were predicted to be EX localized, while TPR proteins were predicted in all compartments except PP. This finding is consistent with the multiple functions of TPR homologues from more distantly related species in different sub-cellular milieux, including signal transduction, chaperone activity, cell-cycle, transcription, and protein transport [49, 50].
Out of 48 non-consensus vote of predicted OM, 24 were proteins annotated as outer membrane or putative outer membrane proteins, while of the remainder were proteins annotated as conserved hypothetical proteins. The structural information derived from the GTD prediction of the conserved or hypothetical proteins that were predicted as putative OM were the same as that of the annotated outer membrane proteins. As shown in Table 7, 8, it can be observed that 24 hypothetical proteins can now be annotated as putative OM.
Although it is clear that the consensus vote combined with DB and GTD prediction can give robust prediction for EX, OM and PP, there are many proteins with either CP or CM localization remaining. Using our combination approach, we found that about 17% of genes encode putative CM proteins in L. interrogans serovar Lai genome, which is of similar proportion to the 20% – 30% CM proteins in other bacterial species [25, 51]. From our subcellular location prediction we identified 63 OM and 114 EX proteins as potential vaccine candidates. On the other hand, it is possible to exclude 813 CM and 75 PP predicted proteins as vaccine candidates, on the basis of their localization.
We compared our predictions with the previously published works. We found that 10 of 16 membrane proteins predicted by Gamberini et al. 2006, including four also demonstrated to be immunogenic among 8 pathogenic serovars in that study, were also predicted by our method as membrane proteins (2 EX, 1OM, 1PP and 6 CM) . We examined the localizations of the 145 putative lipoproteins reported by Setubal et al. , and found 29 EX, 2 OM, 7 PP and 26 CM proteins among 125 probable lipoproteins, and 1 PP and 3 CM among 21 possible lipoproteins. The localizations of 63 putative lipoproteins could not be identified, which included proteins containing signal peptidase II recognition sites and proteins lacking sequence and/or structural homology to known membrane proteins (see Additional file 7). Spirochaetal lipoproteins are found in four subcellular compartments: the periplasmic leaflet of the cytoplasmic membrane, the periplasmic outer leaflet of the outer membrane, or beyond the outer membrane into the environment as extracellular proteins . Therefore, 15 of the 145 putative lipoproteins identified as CP by our method are unlikely to be lipoproteins because of their localization. These false positive lipoproteins include UDP-glucose 6-dehydrogenase, cell-division protein, regulator of chromosome condensation RCC1 family, and 3-oxoacyl- [acyl-carrier protein] reductase. The frequency of falsely-identified lipoproteins just exceeds the reported 1% false positive rate for the SpLip program . Our results can be considered as complementary to those reported by Setubal et al. , and increase the accuracy of lipoprotein prediction.
We also compared our predictions with the 226 leptospiral surface exposed protein predictions (extracellular, outer membrane, periplasmic, inner (cytoplasmic) membrane by their localization definition) reported by Yang et al.  and found a concordance of 38.5 % (87/226) (see Additional file 8). We think the discrepancies arise from false assignments generated by the prediction algorithms used, which can be identified by comparison with proteins for which there are reliable experimental data of localization (see Additional file 6) [2–14, 53–57]. Our predictions have a higher coverage and agreement with the experimentally tested L. interrogans protein set than the study by Yang et al. , suggesting that our prediction method may be of greater overall utility for genome annotation of membrane proteins. After manual inspection of predicted localizations, we found further examples of possible false assignments. The greatest discrepancy was found for 42 proteins were identified as CM by our method, but OM by Yang et al. Some proteins among this group have homologues in other species for which there is experimental evidence of CM location, including methyl-accepting chemotaxis protein mcpB , aerotaxis sensor receptor , and penicillin-binding protein .
It was found that several loci without localization annotation were assigned by the combination prediction method. Therefore, we propose that the annotations with respect to subcellular localization for these loci can be tentatively revised. Among this group of proteins, we noted additional similarities to known protein families. One prominent group with the the SBBP domain (seven beta blade propeller proteins, Pfam PF06739) contain 9 hypothetical proteins: LA0283 (LIC10239), LA0423 (LIC10371), LA0426 (LIC10373), LA1567 (LIC12209), LA1568(12209), LA1569 (LIC12208), LA1691 (LIC12099), LA3276 (LIC10868), LA3834 (LIC13066). Three loci annotated as hypothetical proteins or lipoproteins, namely LA0996 (LIC12668), LA0962 (LIC12690), and LIC13296 (LA4135), were predicted as EX localized (shown in Table 5, 6), and may belong to the Len (leptospiral endostatin-like lipoproteins) family, based on conservation of DUF1554 domain (pfam PF07588) and structural similarity to mammalian endostatin-like protein (PDB 1koe). These proteins act as adhesion proteins and bind to host extracellular matrix (ECM) [53, 57] or human factor H . (Table 5, 6 and Table in the Additional file 6). Furthermore, three loci LIC11207 (LA2823), LIC10821 (LA3340) and LIC10774 (LA3394) and LIC10365 (LA0416), previously described to have similarity with the leptospiral effector protein  were identified as putative EX proteins in agreement with their proposed immunomodulator function.
Our combination prediction method has high agreement and coverage of experimentally verified OM and EX proteins (see Additional file 6). On the other hand, experimental localization studies are limited by insufficient sensitivity to detect low abundance proteins and cross contamination of cellular compartments during sample purification, as discussed previously by Rey et al. . It is of note that several predicted PP proteins in this work e.g. FlaB1 periplasmic flagellin (LA2017/LIC11890) have previously been identified as possible PP contaminants in experimental studies of OMV proteins [13, 20]; hence our prediction method may help in correct interpretation of future experimental verification studies, thus leading to better predictions in uncharacterized genomes. However, it should be emphasized that no automatic prediction can be accurate without experimental verification.
In this study, we have demonstrated that the specificity and sensitivity of protein subcellular localization prediction can be improved by incorporation of multiple predictive methods and structural information. By this approach, localizations can be assigned to previously hypothetical L. interrogans proteins. We think this approach is applicable for subcellular localization predictions in other prokaryote proteomes, with the caveat that some predictions are robust than others, i.e. CP and CM better than OM, EX or PP.
Materials and Methods
Amino acid sequence queries were 4,727 proteins of Leptospria interrograns serovar Lai genome (chromosome I: NC_004342, chromosome II: NC_004343)  and 3,728 protein ORFs of Leptospira interrogans serovar Copenhageni strain (Fiocuz L1-I30) [accession number AEO16823 (chromosome I) and AEO16824 (chromosome II)  obtained from GenBank. Two datasets of proteins with known subcellular localization were used. One was an experimentally confirmed data set containing 278 CP and 309 CM of Gram-negative bacteria described by Gardy et al. 2003  and used for validation of the LDA based classifier's performance. Another one was a 299 protein-data set containing 145 CP, 69 CM proteins, 29 PP, 38 OM and 18 EX which was the testing data previously used to evaluate various protein localization predictions in Gardy and Brinkman .
Computational Data sets mputational prediction tools for in silico protein localization
Several publicly available programs were used in combination of predictions. Protein subcellular localization for Gram-negative bacteria was carried out using PSORTb [27, 28], Proteome analysis (PA) , and ProtCompB . Feature based predictions for signal peptide sequence and α helix transmembrane proteins were identified using SignalP  and TMHMM [24, 25] respectively.
Homology based searching and structural annotation
Homology search for subcellular localization information was carried out using BLAST search against DBSubloc, a localization specific protein database . A protein folding recognition method for structural information used to predict the fold of protein sequence with distant homology to known structure was performed using homology search against GTD (the Genomic Threading Database) .
Prediction strategy (as shown in Figure 1)
Step 1. Consensus votes prediction
We reasoned that more accurate protein subcellular localization predictions can be gained from the consensus of methods. All leptospiral protein queries were analyzed using three subcellular localization prediction tools for Gram-negative bacteria, namely PSORTb, Proteome analysis (PA), and ProtCompB for cytoplasm (CP), cytoplasmic membrane (CM), periplasmic (PP), outer membrane (OM) and extracellular proteins (EX). Note that in this version ProtCompB prediction, CM and OM are not distinguished so both proteins are predicted as membrane proteins. The consensus prediction for each sequence was calculated using a simple majority vote type procedure. If all 3 methods agree for localization, it is assigned as a "consensus vote". The remaining results (1 or 2 out of 3 predicted) were assigned as "non-consensus vote". The CP and CM proteins assigned in this step were used as a training set for the development of LDA based classifier for CP and CM in a the next step.
Step 2. Homology-based and protein folding recognition prediction
Homology based and structural information can also be used to infer the potential localization site of query proteins [22, 45, 46]. Therefore, the remaining query proteins assigned as non-consensus vote results of PP, OM and EX were further analyzed for sequence and structure homology. Since subcellular localization is an evolutionarily conserved trait, if a protein query is homologous to a known protein with the same localization, the localization was assigned. The protein query sequences were compared to proteins in DBSubloc database at E-value ≤ 10-3 using BLAST search. Structure annotation of these queries was also performed using GTD prediction. The query proteins sequences were assigned to structures (shown as PDB code) with the high level of probability prediction (certain and high) for these protein queries. In this study, the confidence range based on p-value of measuring the reliability of the structure annotation as certain (0 ≤ p < 0.01%) and high (0.01% ≤ p < 0.1%) were considered as a statistically significant structure annotation.
Step 3. Identification of putative CP and CM using the LDA based classifier
A number of putative CP and CM identified as non-consensus vote results was further analyzed by SignalP and TMHMM. The feature attributors derived from SignalP and TMHMM predictions were then integrated and analyzed using the LDA based classifier. Proteins classified with probabilities ≥ 0.9 to be CP or CM proteins were taken as significant. The remaining queries that could not be identified in this step were classified as "unknown" results.
LDA based Classifier for CP and CM
We developed a specific classifier using the training set driven from the consensus vote prediction of leptospiral CP and CM proteins to increase the accuracy of prediction. In the classification-based prediction, our classifier was built on an LDA algorithm analyzing the value of multiple character vectors of SignalP-NN, SignalP-HMM and TMHMM prediction results of the set of training sequences. The accuracy of the LDA based classifier was investigated using leave-one out cross validation. We used experimentally determined or known CP and CM proteins of Gram-negative bacteria previously performed in the evaluation of PSORTb as a test dataset for validation of the LDA based classifier's performance . Overall, the accuracy of LDA based classifier achieved 94.96%.
Bharti AR, Nally JE, Ricaldi JN, Matthias MA, Diaz MM, Lovett MA, Levett PN, Gilman RH, Willig MR, Gotuzzo E, Vinetz JM: Leptospirosis: a zoonotic disease of global importance. Lancet Infect Dis. 2003, 3 (12): 757-771. 10.1016/S1473-3099(03)00830-2.
Haake DA, Champion CI, Martinich C, Shang ES, Blanco DR, Miller JN, Lovett MA: Molecular cloning and sequence analysis of the gene encoding OmpL1, a transmembrane outer membrane protein of pathogenic Leptospira spp. J Bacteriol. 1993, 175 (13): 4225-4234.
Shang ES, Summers TA, Haake DA: Molecular cloning and sequence analysis of the gene encoding LipL41, a surface-exposed lipoprotein of pathogenic Leptospira species. Infect Immun. 1996, 64 (6): 2322-2330.
Haake DA, Martinich C, Summers TA, Shang ES, Pruetz JD, McCoy AM, Mazel MK, Bolin CA: Characterization of leptospiral outer membrane lipoprotein LipL36: downregulation associated with late-log-phase growth and mammalian infection. Infect Immun. 1998, 66 (4): 1579-1587.
Haake DA, Chao G, Zuerner RL, Barnett JK, Barnett D, Mazel M, Matsunaga J, Levett PN, Bolin CA: The leptospiral major outer membrane protein LipL32 is a lipoprotein expressed during mammalian infection. Infect Immun. 2000, 68 (4): 2276-2285. 10.1128/IAI.68.4.2276-2285.2000.
Lee SH, Kim KA, Park YG, Seong IW, Kim MJ, Lee YJ: Identification and partial characterization of a novel hemolysin from Leptospira interrogans serovar lai. Gene. 2000, 254 (1-2): 19-28. 10.1016/S0378-1119(00)00293-6.
Cullen PA, Cordwell SJ, Bulach DM, Haake DA, Adler B: Global analysis of outer membrane proteins from Leptospira interrogans serovar Lai. Infect Immun. 2002, 70 (5): 2311-2318. 10.1128/IAI.70.5.2311-2318.2002.
Haake DA, Matsunaga J: Characterization of the leptospiral outer membrane and description of three novel leptospiral membrane proteins. Infect Immun. 2002, 70 (9): 4936-4945. 10.1128/IAI.70.9.4936-4945.2002.
Cullen PA, Haake DA, Bulach DM, Zuerner RL, Adler B: LipL21 is a novel surface-exposed lipoprotein of pathogenic Leptospira species. Infect Immun. 2003, 71 (5): 2414-2421. 10.1128/IAI.71.5.2414-2421.2003.
Koizumi N, Watanabe H: Molecular cloning and characterization of a novel leptospiral lipoprotein with OmpA domain. FEMS Microbiol Lett. 2003, 226 (2): 215-219. 10.1016/S0378-1097(03)00619-0.
Matsunaga J, Barocchi MA, Croda J, Young TA, Sanchez Y, Siqueira I, Bolin CA, Reis MG, Riley LW, Haake DA, Ko AI: Pathogenic Leptospira species express surface-exposed proteins belonging to the bacterial immunoglobulin superfamily. Mol Microbiol. 2003, 49 (4): 929-945. 10.1046/j.1365-2958.2003.03619.x.
Zhang YX, Geng Y, Bi B, He JY, Wu CF, Guo XK, Zhao GP: Identification and classification of all potential hemolysin encoding genes and their products from Leptospira interrogans serogroup Icterohae-morrhagiae serovar Lai. Acta Pharmacol Sin. 2005, 26 (4): 453-461. 10.1111/j.1745-7254.2005.00075.x.
Nally JE, Whitelegge JP, Aguilera R, Pereira MM, Blanco DR, Lovett MA: Purification and proteomic analysis of outer membrane vesicles from a clinical isolate of Leptospira interrogans serovar Copenhageni. Proteomics. 2005, 5 (1): 144-152. 10.1002/pmic.200400880.
Asuthkar S, Velineni S, Stadlmann J, Altmann F, Sritharan M: Expression and characterization of an iron-regulated hemin-binding protein, HbpA, from Leptospira interrogans serovar Lai. Infect Immun. 2007, 75 (9): 4582-4591. 10.1128/IAI.00324-07.
Ren SX, Fu G, Jiang XG, Zeng R, Miao YG, Xu H, Zhang YX, Xiong H, Lu G, Lu LF, Jiang HQ, Jia J, Tu YF, Jiang JX, Gu WY, Zhang YQ, Cai Z, Sheng HH, Yin HF, Zhang Y, Zhu GF, Wan M, Huang HL, Qian Z, Wang SY, Ma W, Yao ZJ, Shen Y, Qiang BQ, Xia QC, Guo XK, Danchin A, Saint Girons I, Somerville RL, Wen YM, Shi MH, Chen Z, Xu JG, Zhao GP: Unique physiological and pathogenic features of Leptospira interrogans revealed by whole-genome sequencing. Nature. 2003, 422 (6934): 888-893. 10.1038/nature01597.
Nascimento AL, Ko AI, Martins EA, Monteiro-Vitorello CB, Ho PL, Haake DA, Verjovski-Almeida S, Hartskeerl RA, Marques MV, Oliveira MC, Menck CF, Leite LC, Carrer H, Coutinho LL, Degrave WM, Dellagostin OA, El-Dorry H, Ferro ES, Ferro MI, Furlan LR, Gamberini M, Giglioti EA, Goes-Neto A, Goldman GH, Goldman MH, Harakava R, Jeronimo SM, Junqueira-de-Azevedo IL, Kimura ET, Kuramae EE, Lemos EG, Lemos MV, Marino CL, Nunes LR, de Oliveira RC, Pereira GG, Reis MS, Schriefer A, Siqueira WJ, Sommer P, Tsai SM, Simpson AJ, Ferro JA, Camargo LE, Kitajima JP, Setubal JC, Van Sluys MA: Comparative genomics of two Leptospira interrogans serovars reveals novel insights into physiology and pathogenesis. J Bacteriol. 2004, 186 (7): 2164-2172. 10.1128/JB.186.7.2164-2172.2004.
Nascimento AL, Verjovski-Almeida S, Van Sluys MA, Monteiro-Vitorello CB, Camargo LE, Digiampietri LA, Harstkeerl RA, Ho PL, Marques MV, Oliveira MC, Setubal JC, Haake DA, Martins EA: Genome features of Leptospira interrogans serovar Copenhageni. Braz J Med Biol Res. 2004, 37 (4): 459-477. 10.1590/S0100-879X2004000400003.
Gamberini M, Gomez RM, Atzingen MV, Martins EA, Vasconcellos SA, Romero EC, Leite LC, Ho PL, Nascimento AL: Whole-genome analysis of Leptospira interrogans to identify potential vaccine candidates against leptospirosis. FEMS Microbiol Lett. 2005, 244 (2): 305-313. 10.1016/j.femsle.2005.02.004.
Setubal JC, Reis M, Matsunaga J, Haake DA: Lipoprotein computational prediction in spirochaetal genomes. Microbiology. 2006, 152 (Pt 1): 113-121. 10.1099/mic.0.28317-0.
Yang HL, Zhu YZ, Qin JH, He P, Jiang XC, Zhao GP, Guo XK: In silico and microarray-based genomic approaches to identifying potential vaccine candidates against Leptospira interrogans. BMC Genomics. 2006, 7: 293-10.1186/1471-2164-7-293.
Rey S, Gardy JL, Brinkman FS: Assessing the precision of high-throughput computational and laboratory approaches for the genome-wide identification of protein subcellular localization in bacteria. BMC Genomics. 2005, 6: 162-10.1186/1471-2164-6-162.
Gardy JL, Brinkman FS: Methods for predicting bacterial protein subcellular localization. Nat Rev Microbiol. 2006, 4 (10): 741-751. 10.1038/nrmicro1494.
Bendtsen JD, Nielsen H, von Heijne G, Brunak S: Improved prediction of signal peptides: SignalP 3.0. J Mol Biol. 2004, 340 (4): 783-795. 10.1016/j.jmb.2004.05.028.
Emanuelsson O, Brunak S, von Heijne G, Nielsen H: Locating proteins in the cell using TargetP, SignalP and related tools. Nat Protoc. 2007, 2 (4): 953-971. 10.1038/nprot.2007.131.
Krogh A, Larsson B, von Heijne G, Sonnhammer EL: Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol. 2001, 305 (3): 567-580. 10.1006/jmbi.2000.4315.
Kall L, Krogh A, Sonnhammer EL: A combined transmembrane topology and signal peptide prediction method. J Mol Biol. 2004, 338 (5): 1027-1036. 10.1016/j.jmb.2004.03.016.
Gardy JL, Spencer C, Wang K, Ester M, Tusnady GE, Simon I, Hua S, deFays K, Lambert C, Nakai K, Brinkman FS: PSORT-B: Improving protein subcellular localization prediction for Gram-negative bacteria. Nucleic Acids Res. 2003, 31 (13): 3613-3617. 10.1093/nar/gkg602.
Gardy JL, Laird MR, Chen F, Rey S, Walsh CJ, Ester M, Brinkman FS: PSORTb v.2.0: expanded prediction of bacterial protein subcellular localization and insights gained from comparative proteome analysis. Bioinformatics. 2005, 21 (5): 617-623. 10.1093/bioinformatics/bti057.
Lu Z, Szafron D, Greiner R, Lu P, Wishart DS, Poulin B, Anvik J, Macdonell C, Eisner R: Predicting subcellular localization of proteins using machine-learned classifiers. Bioinformatics. 2004, 20 (4): 547-556. 10.1093/bioinformatics/btg447.
ProtCompB - Prediction sub-cellular protein localization. [http://linux1.softberry.com/berry.phtml?topic=protcompan&group=programs&subgroup=proloc]
Guo T, Hua S, Ji X, Sun Z: DBSubLoc: database of protein subcellular localization. Nucleic Acids Res. 2004, 32 (Database issue): D122-4. 10.1093/nar/gkh109.
McGuffin LJ, Street SA, Bryson K, Sorensen SA, Jones DT: The Genomic Threading Database: a comprehensive resource for structural annotations of the genomes from key organisms. Nucleic Acids Res. 2004, 32 (Database issue): D196-9. 10.1093/nar/gkh043.
Moller S, Croning MD, Apweiler R: Evaluation of methods for the prediction of membrane spanning regions. Bioinformatics. 2001, 17 (7): 646-653. 10.1093/bioinformatics/17.7.646.
Bagos PG, Liakopoulos TD, Hamodrakas SJ: Evaluation of methods for predicting the topology of beta-barrel outer membrane proteins and a consensus prediction method. BMC Bioinformatics. 2005, 6: 7-10.1186/1471-2105-6-7.
Ihara F, Kageyama Y, Hirata M, Nihira T, Yamada Y: Purification, characterization, and molecular cloning of lactonizing lipase from Pseudomonas species. J Biol Chem. 1991, 266 (27): 18135-18140.
Matsushita O, Yoshihara K, Katayama S, Minami J, Okabe A: Purification and characterization of Clostridium perfringens 120-kilodalton collagenase and nucleotide sequence of the corresponding gene. J Bacteriol. 1994, 176 (1): 149-156.
Abdullah KM, Lo RY, Mellors A: Cloning, nucleotide sequence, and expression of the Pasteurella haemolytica A1 glycoprotease gene. J Bacteriol. 1991, 173 (18): 5597-5603.
Hill CW, Sandt CH, Vlazny DA: Rhs elements of Escherichia coli: a family of genetic composites each encoding a large mosaic protein. Mol Microbiol. 1994, 12 (6): 865-871. 10.1111/j.1365-2958.1994.tb01074.x.
Tukel C, Raffatellu M, Humphries AD, Wilson RP, Andrews-Polymenis HL, Gull T, Figueiredo JF, Wong MH, Michelsen KS, Akcelik M, Adams LG, Baumler AJ: CsgA is a pathogen-associated molecular pattern of Salmonella enterica serotype Typhimurium that is recognized by Toll-like receptor 2. Mol Microbiol. 2005, 58 (1): 289-304. 10.1111/j.1365-2958.2005.04825.x.
Tran L, Wu XC, Wong SL: Cloning and expression of a novel protease gene encoding an extracellular neutral protease from Bacillus subtilis. J Bacteriol. 1991, 173 (20): 6364-6372.
Gueneron M, Timmers AC, Boucher C, Arlat M: Two novel proteins, PopB, which has functional nuclear localization signals, and PopC, which has a large leucine-rich repeat domain, are secreted through the hrp-secretion apparatus of Ralstonia solanacearum. Mol Microbiol. 2000, 36 (2): 261-277. 10.1046/j.1365-2958.2000.01870.x.
Ikegami A, Honma K, Sharma A, Kuramitsu HK: Multiple functions of the leucine-rich repeat protein LrrA of Treponema denticola. Infect Immun. 2004, 72 (8): 4619-4627. 10.1128/IAI.72.8.4619-4627.2004.
Evdokimov AG, Anderson DE, Routzahn KM, Waugh DS: Unusual molecular architecture of the Yersinia pestis cytotoxin YopM: a leucine-rich repeat protein with the shortest repeating unit. J Mol Biol. 2001, 312 (4): 807-821. 10.1006/jmbi.2001.4973.
Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, Gasteiger E, Martin MJ, Michoud K, O'Donovan C, Phan I, Pilbout S, Schneider M: The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 2003, 31 (1): 365-370. 10.1093/nar/gkg095.
Nair R, Rost B: Sequence conserved for subcellular localization. Protein Sci. 2002, 11 (12): 2836-2847. 10.1110/ps.0207402.
Nair R, Rost B: Better prediction of sub-cellular localization by combining evolutionary and structural information. Proteins. 2003, 53 (4): 917-930. 10.1002/prot.10507.
Bierne H, Sabet C, Personnic N, Cossart P: Internalins: a complex family of leucine-rich repeat-containing proteins in Listeria monocytogenes. Microbes Infect. 2007, 9 (10): 1156-1166. 10.1016/j.micinf.2007.05.003.
Di Matteo A, Federici L, Mattei B, Salvi G, Johnson KA, Savino C, De Lorenzo G, Tsernoglou D, Cervone F: The crystal structure of polygalacturonase-inhibiting protein (PGIP), a leucine-rich repeat protein involved in plant defense. Proc Natl Acad Sci U S A. 2003, 100 (17): 10124-10128. 10.1073/pnas.1733690100.
D'Andrea LD, Regan L: TPR proteins: the versatile helix. Trends Biochem Sci. 2003, 28 (12): 655-662. 10.1016/j.tibs.2003.10.007.
Blatch GL, Lassle M: The tetratricopeptide repeat: a structural motif mediating protein-protein interactions. Bioessays. 1999, 21 (11): 932-939. 10.1002/(SICI)1521-1878(199911)21:11<932::AID-BIES5>3.0.CO;2-N.
Wallin E, von Heijne G: Genome-wide analysis of integral membrane proteins from eubacterial, archaean, and eukaryotic organisms. Protein Sci. 1998, 7 (4): 1029-1038.
Haake DA: Spirochaetal lipoproteins and pathogenesis. Microbiology. 2000, 146 ( Pt 7): 1491-1504.
Stevenson B, Choy HA, Pinne M, Rotondi ML, Miller MC, Demoll E, Kraiczy P, Cooley AE, Creamer TP, Suchard MA, Brissette CA, Verma A, Haake DA: Leptospira interrogans Endostatin-Like Outer Membrane Proteins Bind Host Fibronectin, Laminin and Regulators of Complement. PLoS ONE. 2007, 2 (11): e1188-10.1371/journal.pone.0001188.
Vieira ML, D'Atri LP, Schattner M, Habarta AM, Barbosa AS, de Morais ZM, Vasconcellos SA, Abreu PA, Gomez RM, Nascimento AL: A novel leptospiral protein increases ICAM-1 and E-selectin expression in human umbilical vein endothelial cells. FEMS Microbiol Lett. 2007, 276 (2): 172-180. 10.1111/j.1574-6968.2007.00924.x.
Neves FO, Abreu PA, Vasconcellos SA, de Morais ZM, Romero EC, Nascimento AL: Identification of a novel potential antigen for early-phase serodiagnosis of leptospirosis. Arch Microbiol. 2007, 188 (5): 523-532. 10.1007/s00203-007-0273-2.
Barbosa AS, Abreu PA, Neves FO, Atzingen MV, Watanabe MM, Vieira ML, Morais ZM, Vasconcellos SA, Nascimento AL: A newly identified leptospiral adhesin mediates attachment to laminin. Infect Immun. 2006, 74 (11): 6356-6364. 10.1128/IAI.00460-06.
Verma A, Hellwage J, Artiushin S, Zipfel PF, Kraiczy P, Timoney JF, Stevenson B: LfhA, a novel factor H-binding protein of Leptospira interrogans. Infect Immun. 2006, 74 (5): 2659-2666. 10.1128/IAI.74.5.2659-2666.2006.
Alexander RP, Zhulin IB: Evolutionary genomics reveals conserved structural determinants of signaling and adaptation in microbial chemoreceptors. Proc Natl Acad Sci U S A. 2007, 104 (8): 2885-2890. 10.1073/pnas.0609359104.
Amin DN, Taylor BL, Johnson MS: Organization of the aerotaxis receptor aer in the membrane of Escherichia coli. J Bacteriol. 2007, 189 (20): 7206-7212. 10.1128/JB.00871-07.
Scheffers DJ, Pinho MG: Bacterial cell wall synthesis: new insights from localization studies. Microbiol Mol Biol Rev. 2005, 69 (4): 585-607. 10.1128/MMBR.69.4.585-607.2005.
We greatly thank Philip Shaw, Sastra Chaotheing and Duangdoa Wichadakul for their helpful critical reading and commend of the manuscript. This work was supported by the grant from the National Center for Genetic Engineering and Biotechnology, Thailand.
WV and SI participated in designed the research project. SI and EP carried out the computational analysis and developed LDA-based classifier. WV analyzed and interpreted the result, drafted and produced the manuscript. PP provided the further insights for refining the manuscript. All authors read and approved the final manuscript.
Electronic supplementary material
Additional file 1: Putative PP proteins in L. interrogans serovar Lai genome. This table lists the Lai locus and protein annotation of (A) 17 predicted PP derived from the consensus vote prediction (B) 20 predicted PP derived from 2 out of 3 predictions with significant DBsubloc and/or GTD predictions, (C) 38 predicted PP derived from 1 out of 3 predictions with significant DBsubloc and/or GTD predictions. (XLS 46 KB)
Additional file 2: Putative CP proteins predicted by the consensus vote prediction in L. interrogans serovar Lai genome. This table lists the Lai locus and protein annotation of 418 predicted CP proteins derived from consensus vote and used as the training set for the development of the LDA based classifier. (XLS 76 KB)
Additional file 3: Putative CM proteins predicted by the consensus vote prediction in L. interrogans serovar Lai genome. This table lists the Lai locus and protein annotation of 332 predicted CM proteins derived from consensus vote and used as the training set for the development of the LDA based classifier. (XLS 60 KB)
Additional file 6: Subcellular localizations of 28 experimentally studied OM and EX proteins of L. interrogans serovar Lai. This table lists the protein name, L. interrogans serovar Lai and copenhengeni locus, experimental localization, subcellular localization prediction using PSORTb, ProtCompB, PA, and the combination prediction of 28 experimentally studied OM and EX proteins. (XLS 44 KB)
Additional file 7: The result of subcellular localization of putative lipoproteins using the combination method. This table lists the Lai locus tag and protein annotation of 125 probable lipoproteins and 21 possible lipoproteins predicted by SpLip programs  and the subcellular localization of these lipoproteins predicted by the combination method. (XLS 48 KB)
Additional file 8: Subcellular localization of vaccine candidate using the combination method.. This table lists the Lai locus tag and protein annotation of 226 vaccine candidate predicted by Yang et al.  and the subcellular localization of these vaccine candidates predicted by the combination method. (XLS 58 KB)
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.