In silico and microarray-based genomic approaches to identifying potential vaccine candidates against Leptospira interrogans

Background Currently available vaccines against leptospirosis are of low efficacy, have an unacceptable side-effect profile, do not induce long-term protection, and provide no cross-protection against the different serovars of pathogenic leptospira. The current major focus in leptospirosis research is to discover conserved protective antigens that may elicit longer-term protection against a broad range of Leptospira. There is a need to screen vaccine candidate genes in the genome of Leptospira interrogans. Results Bioinformatics, comparative genomic hybridization (CGH) analysis and transcriptional analysis were used to identify vaccine candidates in the genome of L. interrogans serovar Lai strain #56601. Of a total of 4727 open reading frames (ORFs), 616 genes were predicted to encode surface-exposed proteins by P-CLASSIFIER combined with signal peptide prediction, α-helix transmembrane topology prediction, integral β-barrel outer membrane protein and lipoprotein prediction, as well as by retaining the genes shared by the two sequenced L. interrogans genomes and by subtracting genes with human homologues. A DNA microarray of L. interrogans strain #56601 was constructed for CGH analysis and transcriptome analysis in vitro. Three hundred and seven differential genes were identified in ten pathogenic serovars by CGH; 1427 genes had high transcriptional levels (Cy3 signal ≥ 342 and Cy5 signal ≥ 363.5, respectively). There were 565 genes in the intersection between the set encoding surface-exposed proteins and the set of 307 differential genes. The number of genes in the intersection between this set of 565 and the set of 1427 highly transcriptionally active genes was 226. These 226 genes were thus identified as putative vaccine candidates. The proteins encoded by these genes are not only potentially surface-exposed in the bacterium, but also conserved in two sequenced L. interrogans. Moreover, these genes are conserved among ten epidemic serovars in China and have high transcriptional levels in vitro. Conclusion Of the 4727 ORFs in the genome of L. interrogans, 226 genes were identified as vaccine candidates by bioinformatics, CGH and transcriptional analysis on the basis of the theory of reverse vaccinology. The proteins encoded by these genes might be useful as vaccine candidates as well as for diagnosis of leptospirosis.


Background
Leptospirosis is a globally important zoonotic disease caused by pathogenic Leptospira species [1]. Leptospires are thin, helically coiled, motile bacteria, classified into 17 genomospecies (including the saprophyte Leptospira biflexa and the pathogen Leptospira interrogans) on the basis of DNA-DNA hybridization studies, or serologically classified into more than two hundred pathogenic serovars on the basis of structural heterogeneity in the carbohydrate component of the lipopolysaccharide [2,3]. Currently available vaccines, based on inactivated whole bacteria or membrane preparations from pathogenic leptospires, are of low efficacy, have an unacceptable side-effect profile, require annual booster immunizations and do not confer cross-protective immunity against different serovars [4][5][6]. Because of these concerns, the current major focus in leptospirosis research is to discover cross-species-conserved or cross-serovar-conserved protective antigens that may elicit longer-term protection against a broad range of Leptospira [5,7]. New vaccine development strategies are thus needed for preventing this zoonosis. Reverse vaccinology, which based on the genomic approach, has been applied to some bacteria, and novel vaccine candidate sequences have been identified [8][9][10][11]. The genome projects of two Leptospira strains give us intensive knowledge on the whole genome level [12][13][14]. Although many efforts have been made to identify the surface-exposed proteins of leptospires, finding perfect vaccine candidate antigens that provide cross-protection against different serovars of pathogenic L. interrogans still requires further work [7,[15][16][17].
In our current study, we identified 226 potential candidate vaccine genes against L. interrogans using in silico analysis, comparative genomic hybridization (CGH) and transcriptional analysis, based on a genome-wide DNA microarray comprising 3528 open reading frames (ORFs) derived from the original annotation of L. interrogans strain #56601. These candidate genes not only encode surface-exposed proteins of L. interrogans strain #56601, but also have high transcription levels in vitro. Moreover, the proteins encoded by these genes are conserved in two sequenced L. interrogans and ten epidemic pathogenic serovars in China.

Results
In silico analysis for identification of genes encoding surface-exposed proteins In 4727 ORFs of L. interrogans strain #56601, 1282 proteins were predicted to be surface-exposed using P-CLAS-SIFIER, 654 proteins had signal peptides, 813 were predicted to have no more than four α-helices with transmembrane topology, 96 were predicted to have β-barrel topology implying that they are integral β-barrel outer membrane proteins, and 158 were predicted have a lipo-protein signal peptide using SpLiP. The number of genes in the intersection between the set of surface-exposed proteins identified by P-CLASSIFIER and the set of proteins characterized by at least one of the four characteristic topologies is 688. We calculated the similarity of proteins between serovar Lai and serovar Copenhageni as well as between serovar Lai and human (cut-off value: similarity >70% and E value = 1e-10 for two serovars, E value = 1e-10 for serovar Lai and human) using BLASTP. We found 3672 orthologs between the two serovars, and 605 proteins that are similar in serovar Lai and human. Finally, 616 genes were yielded by the bioinformatics study by retaining the orthologs between the two serovars and subtracting the genes that were similar in serovar Lai and human.

Comparative genomic hybridization
We prepared a gene chip microarray corresponding to the complete genome sequence of L. interrogans strain #56601. The chips were hybridized to labelled total DNA extracted from strain Fiocruz L1-130 and ten pathogenic serovars. On the basis of test hybridizations of strain Fiocruz L1-130 vs. the reference sample, we considered genes that gave hybridization ratios between 1.0 and 3.0 to be present in both strains and greater than 10.0 to be absent from the test strain. Ambiguous values between 3.0 and 10.0 may have been due to highly divergent genes or hybridization to paralogous genes. The CGH results revealed that 307 genes of L. interrogans strain #56601 were absent or highly divergent in at least one strain tested. After subtracting these 307 differential genes, we were left with 565 genes, which not only encode presumably surface-exposed proteins but also are conserved in the ten pathogenic serovars.

Transcriptome analysis
Microarray analysis of the mRNA extracted from in vitro grown leptospires revealed that the fluorescence signals of Cy3 and Cy5 ranged from 10.5 to 51,707 (see Figure 1); 1427 genes were expressed above the median level (Cy3 signal ≥ 342 and Cy5 signal ≥ 363.5) in the microarray and therefore as genes with high transcriptional levels. The intersection between the sets of 565 and 1427 genes contained 226 genes. Among them, 8.0% (18/226) were located extracellularly, 53.1% (120/226) in the outer membrane, 16.4% (37/226) in the periplasmic space and 22.6% (51/226) in the inner membrane according to predictions. These vaccine candidates were classified further according their gene names and clusters of orthologous groups (COGs) [18,19] (Table 1, 2, 3, 4); 60.6% (137/ 226) of the candidates had COG annotations.

Discussion
Vaccines composed of whole cells or outer membrane envelope are available in some countries to prevent human leptospirosis, and clinical trials have been reported [20][21][22][23]. In view of their disadvantages, especially their inability to elicit longer-term protection against different serovars of pathogenic leptospires, efforts have been focused on developing subunit vaccines [24]. During recent years, Hap1 [25] (also known as LipL32 [26]), LipL41, OmpL1 [27] and Lig [28,29] proteins have been identified as promising vaccine candidates for preclinical trials.
The availability of complete genome sequence information for many pathogens and the development of sophisticated computer programs have led to a new paradigm in vaccine development. Now it is possible to screen potential vaccine candidate genes in a reverse manner starting from the genome. This reverse vaccinology was first applied to MenB [30] and is now applied routinely in vaccine development, as in the search for vaccines against S. pneumoniae, Streptococcus agalactiae, Staphylococcus aureus, Porphyromonas gingivalis, Chlamydia pneumoniae and other microorganisms [10]. Bioinformatics analysis is the first important strategy of reverse vaccinology. Gram-negative bacteria have five subcellular location sites: cytoplasm, inner membrane, outer membrane, periplasm and extracellular space. The surface-exposed proteins, i.e. those located in sites other than the cytoplasm, are the most suitable vaccine candidates because they are more susceptible to antibody recognition and can therefore elicit protective immune responses. Many sophisticated computer programs have been developed to predict the subcellular locations of putative proteins in the whole genome [31- Identification of highly expressed genes in L. interrogans by microarray Figure 1 Identification of highly expressed genes in L. interrogans by microarray. Bacteria were grown in EMJH medium at 37°C and were collected when the culture reached mid-exponential-phase. RNA was purified and labelled with either Cy3 or Cy5 and hybridized with the microarray of L. interrogans strain #56601 (3528 genes). Transcription analysis revealed that 1427 genes were highly expressed (cy3 signal ≥ 342 and cy5 signal ≥ 363.5).    . Analyzing the gene transcription profile using DNA microarrays provides a second vaccine candidate selection strategy in reverse vaccinology. A gene having a fluorescent signal above the median value corresponds to an expression level higher than 5-10 mRNA copies per genome [34]. Those highly expressed genes could be potential vaccine candidates [34]. Finally, other approaches such as proteomic technology can be used to screen vaccine candidates. Using combined these strategies, genes encoding potential vaccine antigens can eventually be identified.
In our preliminary selection, all genes in L. interrogans strain #56601 were searched using P-CLASSIFIER, a system for predicting the subcellular locations of proteins on the basis of amino acid subalphabets and a combination of multiple support vector machines [33]. Moreover, four topologies were predicted by the corresponding programs. Proteins predicted to be surface-exposed and having any of these four topologies were screened as preliminary vaccine candidates. All proteins with more than four predicted transmembrane spanning regions were removed from the list of candidates, not only because they are likely to be completely embedded in the cell membrane and therefore inaccessible to antibodies, but also because they are difficult to express in E. coli [34].
We retained the genes shared by the two sequenced serovars and subtracted genes that had human homologues. The reason we subtracted human homologues is they are likely to cause problems of autoimmunity [35]. Finally, we narrowed the list of vaccine candidates to 616 genes in the genome of L. interrogans strain #56601.
In order to explore vaccine candidates that could generate cross-protection against the diverse serovars of leptospires, we applied CGH to identify genes that are conserved among the ten pathogenic strains involved in most infections [36]. This approach allowed us to refine the vaccine candidate shortlist further by eliminating antigens that were not conserved among these serovars. The 565 vaccine candidates not only presumably surface-exposed but also conserved among the ten prevalent serovars in China were identified as the result of this approach.
Transcriptome analysis was performed using DNA microarrays of L. interrogans in order to assess the transcription levels of all genes in the genome. A graph of the signal obtained for each gene gave a diagonal distribution reflecting the expression level of that gene. After subtracting genes with transcriptional levels below the median, we were left with 226 genes as vaccine candidates.
Applying the theory of reverse vaccinology, 226 genes had been identified as potential vaccine candidates against L. interrogans combined bioinformatics, CGH and transcrip-tional analysis. Among them, 60.6% (137/226) have COG annotations; thus, nearly 40% either have an unknown function or have no COG annotation. This group of gene products offers great promise as it comprises a pool of previously unexploited vaccine targets. To evaluate our results, we compared our candidates with those identified by others. Gamberini et al. (2005) found approximately 20% potential surface proteins using in silico approach, and sixteen proteins were recognized by antibodies present in human sera [15]. However, only three of them (LA0222, LA2637 and LA2741) appear in our final set. This is not unexpected, since 206 genes encoding hypothetical or unknown proteins were selected from approximately 20% of the genome for cloning and expression. Nally et al. (2005) characterized 32 proteins in outer membrane vesicles of L. interrogans serovar Copenhageni by two-dimensional gel electrophoresis, including previously-described outer membrane proteins (OMPs); in addition, unknown, hypothetical and putative OMPs were also identified [17]. Interestingly, only two proteins (LA0222 and LA2637) are represented among the sixteen proteins found by Gamberini and coworkers. There is an overlap of eight genes between our result and that of Nally et al. (2005) (LA0222, LA0505, LA0616, LA1495, LA2024, LA2295, LA2637 and LA3091). The reasons responsible for the discrepancies among the results may be due to differing methodologies. Genomics, transcriptional profiling and proteomics have emerged in the post genomic era with potential to speed up the vaccine discovery research process. It should be pointed out that those methods have their respective advantages and limitations, and can be complementally utilized in the development of the novel vaccines. Genomics involves the use of various softwares to predict sublocalization of proteins. However, some algorithms have limited accuracy. Although transcriptome analysis uses gene chip array to measure gene expression but suffers from the fact that mRNA levels may not reflect protein levels. Expression of a transcribed gene may be regulated at the level of translation. It is believed that the proteome maps of microorganisms are important to understand cellular status at the protein level, which cannot be deciphered from genome or transcriptome analysis [37]. Proteomics of outer membrane can rapidly identify almost all proteins in outer membrane. However, some of the proteins identified in membrane preparations are in fact typical cytoplasmic proteins [10,38]. Moreover, one of the major disadvantages of subproteomic studies by 2-D gel electrophoresis and mass spectrometry is the potential for contamination via leaky fractionation or lysis [39]. Nally et al. (2005) also revealed that outer membrane vesicles contain small amounts of inner membrane or cytoplasmic proteins in their proteomic study [17]. It is worth mentioning here that mainly surface-exposed proteins such as LipL32 (LA2637) [26,40], LipL41 (LA0616) [27,40], LipL45 (LA2295) [41] and LipL21 (LA0011) [42] have higher transcriptional levels in our results; this suggests that the genes with higher transcriptional levels identified in our current research may be preferable for development as vaccine candidates.
This is the first time that CGH and transcription analysis have been used to identify potential candidates for vaccines against L. interrogans. Our present work corroborates previous studies, showing the advantages of reverse vaccinology [8,11]. The next step following our present research is to verify whether the selected vaccine candidates are surface-exposed and to evaluate the protective activities of these proteins. Such studies will lead to the development of safe and effective new vaccines against leptospirosis in the future.

Conclusion
We have performed high-throughput in silico and microarray-based processes that are useful for determining potential vaccine candidates against leptospirosis. In total, 226 genes were identified in the genome of L. interrogans serovar Lai type strain #56601 using bioinformatics, CGH and transcriptional analysis. The proteins encoded by these genes are not only potentially surface-exposed in the bacterium, but also conserved in two sequenced L. interrogans. Moreover, these genes are conserved among ten epidemic serovars in China and have high transcriptional levels in vitro. These proteins might therefore be useful for vaccine candidates as well as for the diagnosis of leptospirosis. Further research, including verification that these vaccine candidates are surface-exposed and evaluation their protective activities, will aid in the study of vaccines against leptospirosis in the future.

Bacteria strains and growth condition
Ten strains of L. interrogans were used in this study ( Table  5). All the strains were obtained from the Institute for Infectious Disease Control and Prevention (IIDC), Beijing, China. Leptospires were maintained by serial passages in guinea pigs for preservation of virulence and were cultured in liquid Ellinghausen-McCullough-Johnson-Harris (EMJH) medium at 28°C or 37°C with shaking under aerobic conditions. Culture conditions were then developed to ensure that only mid-exponential-phase bacterial cultures at a mean density of 10 6 /ml were used in further experimentation. The cells were harvested by centrifugation at 10,000 g for 10 min at 4°C.
The L. interrogans serogroup Icterohaemorrhagiae serovar Lai type strain #56601 (strain Lai) was used to construct the DNA microarray. The genomic DNA of strain Fiocruz L1-130 was kindly provided by the Centro de Pesquisas Goncalo Moniz.

In silico analysis
Genes and protein data for human and for the sequenced L. interrogans (serovar Lai and serovar Copenhageni) were downloaded from NCBI. P-CLASSIFIER [33] was applied to predict the subcellular locations of proteins in L. interrogans strain #56601. Signal peptide prediction was carried out using SignalP 3.0 [43]. α-Helix transmembrane topology prediction was carried out using TMHMM [44]. BOMP was used to predict β-barrel outer membrane proteins [45]. Putative lipoproteins were predicted by SpLiP [46]. To identify proteins orthologous between serovar Lai and serovar Copenhageni as well as between serovar Lai and human, all predicted proteins were searched against each other locally using BLASTP [47].
Comparative genomic hybridization DNA microarrays of L. interrogans strain #56601 consisting of 3528 annotated ORFs longer than 250bp were prepared as previously described [48]. RNA (10 µg) was labelled with Cy3 by reverse transcription using Superscript α (Invitrogen). Unincorporated dye was removed using a QIAquick Nucleotide Removal Kit (QIAGEN) as specified in the manufacturer's protocol. Samples were hybridized under cover slides to the microarray slides overnight at 42°C, and then washed as usual.
The hybridization slides were processed by Tiffsplit (Agilent) and data were further analyzed using Genespring software 5.0.2 and normalized using mean values combined with Microsoft Excel software. Microarrays were used to assay relative RNA abundance. Flagged spots or SN<2 spots were excluded for intrachip and interchip reproducibility analysis. We calculated the coefficients of three spots in same chip for each gene to estimate intrachip reproducibility using Microsoft Excel. The signal values from the experiments represent average mRNA abundances. As in the CGH experiments, the dyes for labelling Cy3 and Cy5 were interchanged in the second hybridization. Figure 2 is a scheme of the procedure we used to identify the vaccine candidates as described above (the numbers in parentheses are the results after the corresponding procedure step).

Authors' contributions
HLY and XKG designed the research project. HLY and YZZ carried out the bioinformatics analysis. PH and HLY completed the CGH. JHQ and HLY coordinated the transcriptome analysis. HLY and XKG drafted the manuscript. XCJ and GPZ participated in the design of the study and Schematic representation of general procedure to identify the vaccine candidates in the genome of L. interrogans (the numbers in parentheses are the results after the corre-sponding procedure step) Figure 2 Schematic representation of general procedure to identify the vaccine candidates in the genome of L. interrogans (the numbers in parentheses are the results after the corresponding procedure step).  helped to draft the manuscript. All authors contributed to the writing and preparation of the manuscript. All authors read and approved the final manuscript.