ReVac: a reverse vaccinology computational pipeline for prioritization of prokaryotic protein vaccine candidates

Background Reverse vaccinology accelerates the discovery of potential vaccine candidates (PVCs) prior to experimental validation. Current programs typically use one bacterial proteome to identify PVCs through a filtering architecture using feature prediction programs or a machine learning approach. Filtering approaches may eliminate potential antigens based on limitations in the accuracy of prediction tools used. Machine learning approaches are heavily dependent on the selection of training datasets with experimentally validated antigens (positive control) and non-protective-antigens (negative control). The use of one or few bacterial proteomes does not assess PVC conservation among strains, an important feature of vaccine antigens. Results We present ReVac, which implements both a panoply of feature prediction programs without filtering out proteins, and scoring of candidates based on predictions made on curated positive and negative control PVCs datasets. ReVac surveys several genomes assessing protein conservation, as well as DNA and protein repeats, which may result in variable expression of PVCs. ReVac’s orthologous clustering of conserved genes, identifies core and dispensable genome components. This is useful for determining the degree of conservation of PVCs among the population of isolates for a given pathogen. Potential vaccine candidates are then prioritized based on conservation and overall feature-based scoring. We present the application of ReVac, applied to 69 Moraxella catarrhalis and 270 non-typeable Haemophilus influenzae genomes, prioritizing 64 and 29 proteins as PVCs, respectively. Conclusion ReVac’s use of a scoring scheme ranks PVCs for subsequent experimental testing. It employs a redundancy-based approach in its predictions of features using several prediction tools. The protein’s features are collated, and each protein is ranked based on the scoring scheme. Multi-genome analyses performed in ReVac allow for a comprehensive overview of PVCs from a pan-genome perspective, as an essential pre-requisite for any bacterial subunit vaccine design. ReVac prioritized PVCs of two human respiratory pathogens, identifying both novel and previously validated PVCs.


Background
Reverse vaccinology pipelines use genome datasets to identify potential vaccine candidates (PVCs) based on in silico prediction of hallmark features of an ideal vaccine candidate antigen. These features include presence of epitopes exposed on the bacterial surface for host immune recognition, antigenicity, sequence conservation across isolates, and expression during infection [1,2]. Since the development and application of reverse vaccinology to the case of Serogroup B meningococcus [3], its potential for growth has increased significantly with the advent of next-generation sequencing techniques, development of bioinformatic tools for multi-genome analyses, protein functional predictions, and high throughput protein expression platforms [4]. These advances in technology offer an opportunity to generate new reverse vaccinology programs that accurately predict candidate bacterial proteins for use in subunit-based vaccines.
Several tools have been developed for antigen prediction and vaccine candidate identification, including NERVE, Jenner-Predict, Vaxign, VaxiJen, VacSol, and Bowman-Heinson [5]. These tools typically follow either filtering or machine learning algorithms. The filtering workflows utilize a single program for each feature prediction and filter out proteins at each stage. A limitation of the filtering architecture is the potential of elimination of vaccine candidates from further analyses, in the event of a false negative prediction by any given bioinformatic tool. The machine learning workflows use datasets of known PVCs and negative controls to classify antigens and non-antigens through a probability score. To date, tools applying either of the two approaches consider protein sequences exclusively. An extensive review of all these workflows can be found in Dalsass et al. [5].
Here we describe ReVac, a computational pipeline for prediction and prioritization of protein-based bacterial vaccine candidates for experimental verification. ReVac surveys several genomes, using multiple independent tools for predictions of the same feature, to assess a large panel of protein features and sequence conservation. ReVac also scans both the protein and DNA sequences of genes for repeat sequences that could mediate phase variation (gene on/off switching) or protein structure variations, attributes that are typically not desirable in a candidate for vaccine development [6]. ReVac compiles all data across various features, at the protein and nucleotide level, from several bacterial genomes, into one tab-delimited output file. It also scores each protein based on each individual feature in parallel, without eliminating any candidate from analyses. A general problem in reverse vaccinology is that most workflows predict hundreds of proteins as vaccine candidates, rendering experimental verification assays cumbersome [5]. Although some provide a ranking of candidates based on sequence similarity with curated epitopes [7], this approach does not promote the discovery of new types of candidates from different bacteria. ReVac uses its own scoring scheme for the output of each feature prediction tool that is part of its workflow. The scoring scheme was developed, based on manually observing trends of feature predictions, of control datasets of known antigens and non-antigens. These control datasets were obtained from various antigen/epitope databases of predicted and experimentally curated proteins, namely Protegen, AntigenDB, Vaxign's control datasets, ePSORTB. We supplemented these publicly available datasets with known antigens from our Moraxella catarrhalis and non-typeable Haemophilus influenzae (NTHi) datasets [8][9][10][11][12][13]. These control datasets consist of DNA and protein sequences from various Gram-positive and Gram-negative species, which were run through ReVac (Additional file 1), and the corresponding scoring scheme is shown in Table 2.
The final output of ReVac consists of a list of predicted vaccine candidates sorted based on their ReVac scores, an aggregate scoring scheme that combines individual feature weights assigned to each of the candidates' features. This allows the user to consider candidates by perusing those with the highest ReVac scores. Importantly, ReVac accounts for strain to strain variation when prioritizing top candidates by generating clusters of orthologous genes across all genomes of the species of interest. ReVac displays average scores of gene conservation for each ortholog cluster to provide an estimate of variation. These two innovations in reverse vaccinology application allow for selection of a manageable number of conserved PVCs for experimental verification and vaccine development.

ReVac workflow
The ReVac pipeline uses the Ergatis workflow management system to analyze all data on distributed computer clusters [14]. Figure 1 shows the overall workflow and components of ReVac. Parallel computing allows ReVac to run efficiently while performing predictions on entire collections of input genomes. Analysis is launched using a list of GenBank-formatted genomes as input. ReVac's foundation components convert the GenBank files to formats suiting each predictive tool's input, as necessary. Amino acid and nucleotide gene sequence FASTA files, as well as annotation General Feature Format (GFF), files are created. Their content is then binned into smaller subsets of data that are submitted as parallel batches on the compute cluster.
ReVac utilizes several bioinformatic tools for its protein or nucleotide feature predictions ( Fig. 1 [16]. Conservation & function applies 4 different methods for generating clusters of orthologs, and implements a tool that updates annotations and assigns Gene Ontology (GO) terms [17]. Exclusion features determine protein similarity to Homo sapiens proteins (risk of autoimmunity) and a user-defined list of commensal organisms (to address the risk of depleting the microbiome), as well as the prediction of amino acid and/or nucleotide repeats that mediate phase variation. Genomic Islands (GI) prediction informs whether or not a gene is carried within a putative mobile element and therefore transmissible between isolates or species. Lastly, foundation components refer to all tools involved in file format conversion, input data generation and text processing. The implementation of multiple prediction tools and scoring schemes for most of the features considered compensates for each individual tools' potential for false negative/positive predictions. Given these attributes, ReVac offers an innovative and comprehensive workflow design for reverse vaccinology.
Outputs from ReVac's components are systematically converted into tab-delimited format and grouped by protein IDs or locus tags derived from the GenBank files. This is achieved using in-house Perl scripts, to generate ReVac's initial gene feature summary table. This table is then parsed using ReVac's scoring algorithm (Table 2) and a final score-sorted summary table is reported. These two tables include results for all genes provided as input without eliminating any potential candidates. To look for highly conserved core vaccine candidates, the scored summary table is further parsed for overall protein conservation, comparing all 4 orthology methods used, across all genomes. ReVac then refines Fig. 1 Schematic of the ReVac workflow, its components and underlying features. Blue arrows indicate the components where control datasets were used to develop the scoring algorithm. Red arrows indicate a user's input query dataset, which runs through all components and the scoring algorithm, to output a list of prioritized candidates for the supplied species. Scoring based on core genes or orthology components is indicated by the black arrow the list of PVCs for those with ReVac scores comprised of a distribution of ideal PVCs feature (i.e where the ReVac scores were penalized by a total of less than 10% of its overall score, due to the presence of undesirable PVC's scoring features). All clusters are then grouped and given an ortholog ID. Their annotation, average, minimum and maximum ReVac scores are reported at an ortholog cluster level. Based on scores observed for positive and negative controls we used, clusters harboring average scores higher than a ReVac score of 10 with minimum variation (based on the reported average, minimum and maximum) in the scores across the cluster, are ranked as top PVCs. A higher score cutoff can be chosen by the user to further reduce the number of prioritized candidates. Here, 10 was chosen as the cutoff for our NTHi and M. catarrhalis datasets, as it was observed that the frequency of non-antigens was higher below this value (Fig. 2, left peak of Controls), while the frequency of antigens formed a second distinct peak for scores 10 and higher (Fig. 2, right peak of Controls) (See also Additional file 1). Implementation of higher cutoffs to focus the list of candidates in a separate small table   does not eliminate any candidates from the complete  scored table. Other candidates can be selected by scanning the full table that shows PVCs in ranked order and evaluating the relative importance of features that may have diminished their overall score.

Control datasets used for development of the scoring scheme
The control datasets used in ReVac comprise a total of 564 proteins acquired from Vaxign, Protegen and Anti-genDB [8,9,12], as well as our manually curated list of NTHi and M. catarrhalis antigens [10,11]. Where possible, protein identifiers (IDs) from these three public databases were systematically converted to Uniprot unique IDs for consistency and ease of access to protein characteristics (Additional file 1: Sheet 3). Because ReVac is the first pipeline to consider nucleotide features associated with candidate antigens, we also obtained closely related nucleotide sequences for all public candidates by retrieval of best TBLASTN [18] hits against the National Center for Biotechnology Information (NCBI) nt database of non-redundant nucleotide sequences (all hits were to the respective species). Among other features, nucleotide sequences provided information on simple sequence repeats (SSRs) that may mediate phase variation. Since these databases contained some of the same sequences or different alleles of the same antigens, we used OrthoMCL [19] to identify their orthologs (Additional file 1). Of the 564 proteins, 376 were assigned to 102 clusters by OrthoMCL. As we were interested in the scores across all alleles of an antigen, we included all 564 in our analysis. The 564 proteins were split into 136 Gram-positive and 428 Gram-negative datasets using the species and associated Gram stain information provided from their respective databases. We also used the species hits from the TBLASTN results for this purpose. These two datasets were then run on two pipelines, each with relevant Gram-positive or Gram-negative parameters required for some of the tools incorporated in ReVac. Of the 564, 41 were unique non-antigens from Vaxign [9] and were included to assess their scores relative to our weighing scheme. All proteins from control datasets were run through the workflow (except orthology given the wide range of species represented) for development of the scoring scheme (Table 2.). Inspection of positive and negative control proteins enabled optimization and implementation of score boosting for desired features carried by real antigens, as well as maximum thresholds of penalization in the case of autoimmunity and SSRs, as described in the Methods. Summary tables from ReVac runs on all datasets are available in Additional files 1, 2 and 3.
A subset of the controls used is presented in Table 1 to illustrate the process of optimizing feature scoring. The scores for each component were developed by observing trends in the predicted features of all the tools and their correlation to whether the control protein was antigenic or non-antigenic. For example, the first 2 antigens from Table 1, the pertactin autotransporter from Bordetella pertussis and the peptidoglycan-associated outer membrane lipoprotein (P6) from NTHi, have overall subcellular localization predictions suggesting surface exposure, consistent with previous experimental findings [11,20,21]. The tools that accurately predicted these features were assigned positive weights (shown in Table 2) to identify other proteins displaying these features. In events when multiple tools show strong predictions of surface localization, the ReVac score is boosted as it was observed in multiple antigens from the dataset, and these features indicate a strong potential vaccine candidate. As for the tools that provided no features for these two antigens, they were not weighted negatively as they weren't necessary for surface exposure in the case of these two antigens but may be relevant to other proteins. We see this in the case of the Streptococcus agalactiae antigen, C protein alpha-antigen [22], where the presence of transmembrane helices and adhesin features were predicted in the protein. These tools were also assigned positive weights for identification of these features in other proteins, based on their observed frequency within the control dataset ( Table 2). Since some of the tools have no conclusive feature predictions for certain sequences, such antigens have lower overall ReVac scores.
Certain predicted features among outputs for these tools were not assigned weights as it was observed that their predictions may not accurately predict PVCs and hence, we were unable to assign a justified positive or negative weight. As such, PSORTB [13] suggests that the heparin binding protein (NHBA) from the Gramnegative bacterium Neisseria meningitidis, currently used in a multicomponent vaccine against meningococcal serogroup B, is localized exclusively in the periplasm. However, this is not consistent with experimental evidence that indicates the protein is exposed on the bacterial surface [23]. Thus, in the case of PSORTB predicted periplasmic proteins, no negative weight was assigned as some periplasmic predictions may be inaccurate or inconclusive such as in the case of NHBA. To account for this, we used multiple different tools for more accurate prediction of subcellular localization. Another example would be the case of pneumolysin from Streptococcus pneumoniae, an extracellular virulence factor [24]. PSORTB provided a strong extracellular prediction, however LipoP [25] suggested a cytoplasmic protein.
Again, for the same reason, intracellular predictions of LipoP were not penalized. Wherever similar and other trends were noticed among other tools the weights were assigned and distributed using similar justifications (Described further in Methods). The remaining non-antigens had feature predictions and annotations consistent with intracellular localization across all tools. These were assigned negative weights for each tool suggesting an intracellular localization, which should be avoided as potential PVCs. A complete list of weights assigned, and the scoring scheme is presented in Table 2 and described in the Methods.
Tools comprising the antigenicity prediction features were all assigned positive weights relative to the proportion of antigenic regions within a protein and boosted if the presence of curated epitopes within the sequence was observed. Most of these tools operate by splitting an input protein sequence into individual peptides and analyzing them individually as potential epitopes; all proteins tend to have at least some antigenic regions. As a result, weights relative to percent of antigenic regions were assigned. Lastly, adverse features are those that should be avoided when choosing any PVC, such as repeat regions or similarity to host or commensal organism proteins. ReVac identified repeats within the B. pertussis pertactin transporter and the N. meningitidis heparin binding proteins. Such repeats suggest that these antigens may undergo slipped strand mispairing resulting in phase variation of the proteins, a negative feature of vaccine antigens [6]. Antigens with sequence repeats in either promoter or protein coding regions are therefore negatively penalized. Additionally, negative scores are given to antigens with features of similarity to host and commensal proteins, to avoid the negative effects of cross reactivity of an immunizing vaccine antigen. When both features were absent, ReVac attributes positive weights to the score to increase the ranks of the PVCs away from ones having these features.
As not all the tools implemented in ReVac could be run for our control dataset, such as those related to protein conservation across their many respective species and genomes, a lower score cutoff of 8 was chosen for these datasets. Using this threshold, 74 of the 136 Grampositive antigens had a score of at least 8 with no non-antigens in the subset. 182 of 428 Gram-negative antigens had a score of at least 8 with 2 non-antigens in the subset (Table 1 and Additional file 4). It should be noted that given the breadth of species and the large number of validated antigens and non-antigens included in our control datasets, the scoring scheme we developed should be readily applicable to many bacterial pathogens. The scoring scheme can be applied iteratively to any number of new genomes being added to databases. We anticipate that the number of new genomes of interest will grow much faster than the experimental validation of new candidates that should be added to the control dataset. It is conceivable that many of the new candidates will harbor features similar to those already curated in our dataset and therefore will not change the scoring mechanism. However, when sufficient amounts of truly novel candidates become available in the future, an update to the scoring scheme could be released after some additional manual intervention. The simplest,    systematic way of identifying the need for a new release will be to determine when a critical number of important validated candidates fail to be ranked near the top of the ReVac output.
Application of ReVac to non-typeable Haemophilus influenzae (NTHi) ReVac was run on 270 NTHi genomes, derived from sputum isolates obtained from a 20-year prospective study of adults with chronic obstructive pulmonary disease (COPD) [11,26]. This dataset comprised 477,769 predicted protein encoding genes. Of these, 4477 proteins had scores that were not penalized more than 10% of their total score and grouped into ortholog clusters based on a consensus from 4 orthology prediction methods (See Methods). Each ortholog's average score was calculated, as well as its range within its ortholog cluster. Clusters were then filtered based on the presence of at least 80% of proteins present in the lowly penalized 4477 dataset. This yielded 29 ortholog clusters which were high scoring, i.e. greater than 10, that included both core and dispensable genes of NTHi. Surveying this list, provides a highly prioritized selection of orthologs for consideration, and downstream experimental verification of vaccine candidacy potential. Candidates were prioritized based on a candidate cluster being highly conserved within the species (usually > 90%), and at least 80% of the proteins in the cluster being high scoring (to allow for some allelic variation), with a narrow score distribution. Based on these results, some of our top candidates include a Hemopexin transporter protein, involved in extracellular transport of hemopexin, and the outer membrane lipoprotein P4 ( Table 3). The NTHi dataset of the analysis is provided in Additional file 2.

Application of ReVac to Moraxella catarrhalis
The Moraxella catarrhalis dataset consisted of 69 genomes, 49 were obtained from NCBI and 20 were newly sequenced by our group. The latter were obtained from sputum isolates, from patients with COPD [11,26,27]. This dataset comprised 130,179 predicted protein encoding genes. Of these, 3995 met the maximum 10% penalization filter, and again filtered based on 80% presence in a cluster. Analyses resulted in 64 high scoring ortholog clusters (greater than 10) of core and dispensable genes of M. catarrhalis. Based on these results, top candidates identified were an iron transporter protein, a Gramnegative porin protein and 2 conserved-hypothetical proteins previously unstudied ( Table 3). The M. catarrhalis dataset of the analysis is provided in Additional file 3.

ReVac benchmarking and runtime
ReVac was run on 4 major datasets, namely, 2 control datasets of Gram-positive and Gram-negative antigens, to optimize the scoring algorithm for specific components, and our test datasets of M. catarrhalis and NTHi, to prioritize potential vaccine candidates. Two smaller datasets of mycoplasma proteins were also run as test data. For benchmarking purposes, proteins were batched into groups of 1000 and 5000 sequences and were run through ReVac's most time-consuming commands to generate an estimate of CPU (Central Processing Unit) hours (Fig. 3a). These were run on an isolated, dedicated server with 2 CPUs (CPU model: Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz, 256GB RAM), each with 12 cores with hyperthreading (approximately 48 virtual cores). An overall estimate was also made of ReVac's runtime in hours for our test datasets, with actual runtime for the entire duration of a run, on multiple servers on a compute grid without CPU usage restrictions and competition with other, unrelated compute jobs potentially submitted to the same servers by other users (Fig. 3b) [10,11,28].

Discussion
ReVac was developed to identify bacterial PVCs by evaluating multiple features based on ideal PVCs characteristics and homology to known PVCs. A major aim of Ferric iron ABC transporter iron-binding protein None All the above candidates were surface exposed, predicted antigenic, conserved core proteins with low autoimmunity and no repeat regions. Further information about these candidates are available in Additional files 2 and 3 respectively. a Present in M. canis ReVac's development was to add multilayered-redundant analyses for most of the protein feature predictions used in parallel, in order to provide additional independent confidence in the respective feature predictions. The various tools employed encompass categories of essential features of PVCs; subcellular localization, antigenicity and immunogenicity, conservation and function, and genomic islands. ReVac builds on previously published reverse vaccinology pipelines and includes several major improvements: 1) the analyses of multiple genomes for a given species enabling assessment of conservation of PVCs (core genome) or, for instance, unique dispensable genome PVCs of hyper-virulent strains. 2) the evaluation of SSRs in both coding and upstream regions to assess phase variable expression of PVCs based on the presence of this nucleotide sequence feature (enabled by the use of multiple genomes). 3) the parallel analysis and summary total of the features for each protein of the input genomes, with positive scores for desirable features versus negative scores for undesirable features, reducing the false negative elimination of candidates. And 4) flexibility to prioritize or de-prioritize any feature based on organism-specific considerations of a vaccine development project, for example. The parallel scoring system allows resolution of a PVC ranked highly or poorly due to multiple features rather than just a few. Reverse vaccinology pipelines, such as ReVac, are beneficial for the prioritization of PVCs for vaccine development against bacteria prior to experimentation. Providing short lists of ideal candidates to assay in vitro and in animal models is desirable and is especially powerful where animal models may not be well established. The human restricted pathogens Moraxella catarrhalis and non-typeable Haemophilus influenzae are examples of such bacteria with underdeveloped animal models [29]. The added innovative ability of ReVac to assess conservation and phase variability from multiple genomes is critical for these genetically diverse human pathogens [10,11].
To provide an estimate of genetic variation among the M. catarrhalis genomes, we used MASH [30] to generate a pairwise genome distance matrix and associated phylogenetic trees (Figs. 4 and 5). We observed the formation of 4 distinct clades. Two of these clades were known to be separated based on sero-susceptibility of those strains (blue and green). Upon surveying the metadata of the genomes, we discovered that one of the other two clades clustered based on their date of isolation (orange) and the most distant was another species, M. canis, which was misannotated as M. catarrhalis in NCBI (National Center for Biotechnology Information). We attempted to identify similar relationships between strains in our NTHi genomes, however, there appears to be no correlation between the strains apart from clustering based on multilocus sequence types (MLST) as previously reported [11]. There was no clear genetic clustering based on clinical source of the strain, geography, duration of persistence, exacerbation vs. colonization, or year of isolation, of these strains. Upon investigation of some of our M. catarrhalis candidates we observed that some candidates, at the protein level, could recapitulate the same separation among clades as the whole genome tree (Fig. 4c  and d). It is possible that these proteins are core drivers of inter-strain variation as not all protein clusters could replicate the whole genome tree.
Based on overall vaccine candidates from both NTHi and M. catarrhalis datasets, certain types of high scoring proteins are more prioritized over others. Hemolysins and TonB receptor proteins are identified as quality candidates in both species, however they are not included in our top candidates as they were not core proteins. Other types include various transporter proteins, outer membrane proteins and porins. This result is in part due to the outer membrane-surface exposed nature of these classes of proteins. However, ReVac's comprehensive assessment of immunogenic epitope identification and cross-referencing of proteins against commensal organism's genomes supports that there are dissimilar sequences of these proteins in the pathogen and its related commensal. Whether these proteins might be strong candidates in other species, will likely depend on their conservation across those species. The presence of redundant proteins to these will diminish their impact as vaccine candidates outside the two species considered here. Presence of similar types of proteins identified as vaccine candidates is interesting as both these species occupy similar niches in the host. In total, ReVac identified a set of surface exposed proteins in two exclusively human respiratory tract pathogens. The features of these proteins as pathogen-specific PVCs, provides strong rationale to experimentally validate the efficacy of the novel vaccine antigens against the pathogens.

Conclusion
The identification of core vaccine candidates provides a path for vaccine development against prokaryotic pathogens. Use of essential-core genome components in vaccines reduces the impact of any selective pressure, which may be imposed on a pathogen through vaccine use. c A protein alignment tree of one of ReVac's top candidates, which separates sero-sensitive and sero-resistant clades, but is absent in the other two clades (also present in the respective clades of (b)). d A protein alignment tree of the candidate iron transporter that replicates the whole genome tree topology This phenomenon has been observed in the case of Group A Streptococcus, where several potentially effective vaccine candidates are not a part of the core genome [31]. Selection of these non-core candidates could result in non-vaccine strains dominating a now vacant niche. Another strength of ReVac's prioritization of candidates on a cluster level, is that it allows for identification of clade-specific vaccine candidates. This approach could be powerful in situations where a species may be a commensal organism, but certain variants are pathogenic, such as in the case of E. coli [32]. A reverse vaccinology analysis at a whole pan-genome level will be able to identify and distinguish these candidates from core candidates.
All vaccine candidates are antigens, but not all antigens are effective vaccine candidates for various reasons. ReVac therefore prioritizes antigenic PVCs using predictions from multiple antigenicity and immunogenicity tools which are becoming more prominent as effective tools for identification of potentially novel epitope regions [33]. For example, certain antigens may induce more effective adaptive immune responses than others. It should be noted that ReVac is not an antigen predictor, but a workflow that ranks proteins by their vaccine candidacy potential. Follow-up in vitro and in vivo characterization will therefore be required to assess the validity of these PVCs as antigens. ReVac's primary mandate is to help identify, as well as reduce the number of candidates that will have to be tested by providing the user with a ranked list of PVCs.
Here we have presented ReVac, a reverse vaccinology pipeline for the identification of bacterial protein PVCs from the input of one or more annotated bacterial genomes. ReVac's implementation of a parallel scoring scheme of all proteins in the organisms' proteome and summary scores minimizes elimination of false negative antigens due to potential errors in the prediction tools implemented. False positive identification of antigens is reduced by assigning lower scores to proteins associated with non-antigens, and with similarity to human host proteins as well as commensal organisms' proteins. These multiple genome assessments performed in ReVac evaluate the conservation of PVCs and those subject to phase variable expression in a given population of bacterial strains to further reduce false positives. The compilation of these features integrated into ReVac's pipeline was tested on sets of positive and negative control antigens to optimize the scoring algorithm. We used ReVac to identify PVCs for the human-restricted pathogens M. catarrhalis and NTHi. We identified both known and novel PVCs for each bacterium, supporting the efficacy of ReVac to identify experimentally proven PVCs. Furthermore, the identification of previously uncharacterized proteins as PVCs shows the benefit of ReVac to identify novel protein targets to investigate in future studies. ReVac can be used to identify PVCs of current bacterial pathogens where vaccines are not currently developed, and can also be used to quickly identify PVCs of emerging pathogens as sequences become available.

Methods
ReVac and all its tools were built and integrated together using the open-source Ergatis workflow management system (available at http://ergatis.sourceforge.net) [14]. Ergatis allows the user to monitor bioinformatic pipelines, such as ReVac, through its user interface, and allows the parallelization of several analyses on distributed computer clusters [14]. Individual tools can be installed as Ergatis components and their analyses then parallelized, for any bioinformatic pipeline. For these reasons, Ergatis was chosen as a streamlined way of performing all ReVac's computationally intensive multi-genome analyses while allowing for easy access to output data and monitoring.
Here, we provide a description of all the tools incorporated in the ReVac pipeline, including those involved in file format conversions. The complete scoring weight scheme for each component is described in Methods. The rationale for assigning different weights to specific components was derived from the results of ReVac runs on our control datasets of known antigens and nonantigens acquired from various antigen databases from multiple bacterial species (Additional file 1 and Fig. 1). Components are grouped into broad categories (Table  2), as follows: NTHi genomes using default parameters with an overall cut off set at a value of 7.5 for the overall prediction score. Proteins that were predicted to localize at multiple sites were also included if their cumulative score was above 7.5. We used the long output-type option provided by PSORTb and predictions are provided in the summary table. Each protein was given a + 1 towards its overall score if it wholly or partially localized to the outer membrane, cell wall or was predicted to be extracellular. If the protein was predicted to localize only to the cytoplasm or cytoplasmic membrane it was given a − 1. A prediction of Periplasmic was not scored as proteins from control datasets showed that some outer membrane proteins were often miscalled as periplasmic. b. LipoP [25] -A hidden Markov model (HMM)based tool to predict lipoprotein signal peptides in Gram-negative Eubacteria, able to distinguish between lipoproteins Signal peptidase II (SPase)cleaved proteins, SPaseI-cleaved proteins, cytoplasmic proteins, and transmembrane proteins [25]. Although it was developed for Gramnegatives, it has shown effective performance on the prediction of Gram-positive bacterial lipoproteins [25]. We ran LipoP using default parameters with a cutoff of − 3. Each protein was given a + 1 towards its overall score if it was predicted to be a signal peptidase or had transmembrane helices. c. TMHMM 2.0 [34] -A membrane protein topology prediction method, based on an HMM [34]. It predicts the number of transmembrane helices present in a protein sequence. TMHMM was run using default parameters and predicted number of helices are displayed in the summary table. Each protein was given + 0.5 if it had a single helix. It is not scored for 2 helices and penalized for more than 2 helices (− 0.2 for 3 and − 2 for 4 or more), as such proteins are harder to purify and less likely to be accessible on the cell surface. If the protein was predicted partially cytoplasmic or in the cytoplasmic membrane, it was penalized an additional − 2. d. SignalP 4.1 [35] -A program developed for the prediction of signal peptides from amino acid sequences that are targeted to the secretory pathway [35]. We ran SignalP using the 'best' option for prediction allowing a truncated length of 70 for the peptides. We also disabled its graphical output as we were only interested in presence/absence of signal peptides. Its raw output provides coordinates of a protein's signal peptide, which we then used to map back the actual signal peptide sequence for the summary table. Each protein was given a + 1 if it had a signal peptide. e. SPAAN [36] -An artificial neural network developed to predict the probability of a protein being an adhesin. Adhesins are surface exposed proteins which mediate the adhesion of microbial pathogens to host cells [36]. We ran SPAAN using a cutoff of 7. Each protein was given a + 1 if it was a predicted adhesin. f. HMMPFAM 3.0 [37] -A HMMER 3.0, HMM based tool, which queries a given amino acid sequence against multiple relevant subsets of TIGRFAM and PFAM databases [38,39] namely, surface exposed, signal proteins, secreted proteins, and bacteriocins. This database of motifs was developed from the TIGRfam v15.0 and Pfam v31.0 databases, manually filtered based on keywords in their descriptions, which were relevant to surface exposure. We used default noise threshold cutoffs for all our HMM runs. Each protein was given a + 0.5 for any positive hit against motifs in this database. 2) Antigenicity & Immunogenicity a. Antigenic (http://www.bioinformatics.nl/cgi-bin/ emboss/antigenic) -An EMBOSS package utilizing the method of Kolaskar and Tongaonkar to predict antigenic determinants in proteins [40]. This tool outputs the antigenic peptide regions within a protein. predictor also procured from IEDB. It scores amino acid residues using 6 different scale-based methods. We have applied all 6 methods in our analysis and only those peptide sequences which are predicted to be epitopes across all methods are considered. Analysis was conducted using default cutoffs and parameters across all 6 methods using a peptide length of 7. As in the case of MHC I, peptide epitopes from all 6 methods are tiled together to form consensus predicted B-cell epitope regions. Each protein is given a score of the ratio of its binding regions over its amino acid length. Here again it is boosted by the ratio of binding peptides of length 7, over the total number of peptides possible for a given protein with a sliding window of 1. f. Immunogenicity [16] -A Class I Immunogenicity tool from IEDB, that uses amino acid properties as well as their position within the peptide to predict the immunogenicity of a peptide-MHC (pMHC) complex. The peptides predicted to be effective MHC I binders from MHC Class I & NetCTLpan (the 99th percentile) are passed as inputs here and run using default parameters. Each protein is scored similar to the scheme followed in MHC Class I, but scores for 20% coverage or more, as not all MHC bound peptides induce an immune response. g. BLAT [15,16] -Using the BLAST-like Local Alignment Tool, amino acid sequences are compared to a database of experimentally curated epitope sequences acquired from IEDB. Using the BLAST output type and default parameters, protein regions mapping to curated epitopes are again tiled to get consensus epitope regions and their percent coverage. Each protein is given a score of the ratio of its homologous regions over its amino acid length. If that ratio is greater than 70% it is given another + 1. Here if the protein is positive for surface exposure in any 3 among PSORTb, LipoP, SignalP and IEDB, it is given a + 2 as this pattern was observed frequently in positive control datasets.

3) Conservation & Function
a. Jaccard Clusters of Orthologous Genes (COG) Analysis [41] -A two-phase protein clustering algorithm, used to generate protein paralog and ortholog clusters. It parses an all-v-all BLAST [42] output of whole genomes, after which a Jaccard similarity coefficient is calculated for every pair of proteins. Then it performs a bidirectional best hit analysis on the paralog clusters generated by the first phase of the algorithm, rather than on individual proteins to call its orthologs [41]. Each protein belonging to a COG that is present in 90% of the genomes provided receives + 1 (in at least one of the conservation methods). b. PanOCT [43] -A tool for pan-genomic analysis of closely related prokaryotic species or strains using conserved gene neighborhood information to separate recently diverged paralogs into orthologous clusters [43]. We elected to use 90% conservation as our cut-off and ran the analysis at default parameters. The PanOCT component was adapted to use a list of GFF files, a multi-FASTA file of all amino acid sequences and an all-v-all BLAST file run using the -m8 output option, to generate its config file and then perform clustering. Each protein belonging to a COG that is present in 90% of the genomes provided receives + 1 (in at least one of the conservation methods). c. OrthoMCL [19] -This program implements a scalable method for constructing orthologous groups using a Markov Cluster algorithm to group (putative) orthologs and paralogs [19]. Analysis makes use of a MySQL database to which data is stored during the clustering process, using an all-v-all m8 BLAST file and a multi-FASTA of all proteins, run at default parameters. Each protein belonging to a COG that is present in 90% of the genomes provided receives + 1 (in at least one of the conservation methods). d. LS-BSR [44] -Large Scale Blast Score Ratio (LS-BSR) compares the genetic content of hundreds to thousands of bacterial genomes and returns a matrix that describes the relatedness of all coding sequences (CDSs) in all genomes surveyed [44]. LS-BSR was run at default parameters, using a list file of strain specific DNA sequences of predicted genes to produce a BSR matrix of proteins that cluster with each other. Default cutoff is 0.7 but we elected to use 0.6 as some proteins which clustered together in other tools as core, were barely missing the default cutoff. Each protein belonging to a COG that is present in 90% of the genomes provided, receives + 1 (in at least one of the conservation methods).
Here, if a given protein is shown to cluster in 90% of the genomes across multiple methods (described above), it is given an additional + 0.1 for each of those clustering methods. e. Attributor -An in-house developed python script which refreshes annotations for a FASTA file of amino acid sequences, and assigns GO terms wherever applicable. It accepts inputs from TMHMM, LipoP, RAPSearch2 and HMMPFAM to call a specific annotation for a given protein. Each protein that has a GO term belonging to our database of surface exposed GO terms, is scored + 1 for each GO term, and − 1 for each GO term falling in our non-surface exposed GO term database, if none were identified in our surface exposed GO terms. Both GO term databases were constructed from the prokaryotic subset of GO terms, and then manually filtered for relevant surface exposed GO annotations, as well as Attributor GO term predictions for experimentally curated surface and nonsurface exposed proteins acquired from the ePSORTB database. Here we also looked to rescue PSORTb predictions calling periplasmic, when proteins are surface exposed based on attributor, and scoring an additional + 2. 4) Exclusion Features a. Autoimmunity [22] -Autoimmunity is a Perl script taken form NERVE [22] and adapted to run on Ergatis. It uses BLAST to compare each amino acid sequence as a query against the human proteome, allowing for 3 substitutions and 1 mismatch, over a minimum peptide length of 9. Raw outputs are again tiled together to give consensus regions of autoimmunity against Human proteins. Each protein that doesn't map to the human proteome is given a + 1. Any protein that has a positive hit is penalized by 2 times the ratio of its homologous regions over its amino acid length. If that ratio is greater than 20% it is penalized another − 2. b. Autoimmunity Commensals [22] -This component is an adaptation of the one above that runs against a non-redundant database of a commensal organism of choice. In our case, since we were looking at NTHi & M. catarrhalis, the most closely related species selected were H. haemolyticus and M. bovis, respectively. The database of non-redundant protein amino acid sequences was made from 13 strains of H. haemolyticus acquired from NCBI and then clustered using OrthoMCL, for NTHi. However, only one strain of M. bovis was available on NCBI. Each protein is scored the same as autoimmunity. c. SSR_Finder [45] -SSR_finder is a script developed by Siena et al. [45] which looks for Simple Sequence Repeats, up to 10 base pairs in length, in DNA coding sequences and 500 bp upstream of the gene. SSRs have been shown to contribute to phase variation of proteins allowing generation of different protein isoforms and mediating on/off translational switching through frameshifts [45]. It scans through a list of multi contig DNA fasta files for such SSRs. Each gene is given a + 1 for absence of an SSR. Presence of a repeat is penalized − 0.5 for each. An additional − 0.25 is received if the repeat is in promoter region, − 0.5 if the repeat has the potential to cause a frame shift, as well as − 0.01 times the total length of the repeat. d. SSR_Finder_Protein [45] -The above script was adapted to run on protein sequences looking for repeats, up to 20 amino acids in length, which would allow conformational changes in the protein. Each protein is penalized − 0.2 for each protein repeat, up to a maximum penalty of − 1. 5) Genomic Islands a. IslandPath [46] -IslandPath is a tool developed for the detection of genes of potential horizontal transfer origin known as Genomic Islands, in prokaryotes [46]. Islandpath accepts NCBI Protein Table (ptt) files of the proteome as well as protein & coding sequence FASTA files with their coordinates to identify possible genomic islands. Each protein present within a genomic island is penalized − 0.5.

6) Foundation Components
Input data that is passed through ReVac, requires a multitude of formats as per the component being invoked, as well as supplementary data from other predictive components. ReVac uses the following foundation components, which are provided as a part of the Ergatis workflow [14]. a. Genbank2bsml -Accepts standard GenBank file formats (.gbk) for conversion to Bioinformatic Sequence Markup Language (BSML), which is an Ergatis standard file format for data in addition to the more common FASTA files. b. Bsml2fasta -Accepts the BSML inputs from Genbank2bsml for conversion to FASTA files. This component allows for the generation of single sequence FASTA files or a larger multisequence file, providing the option of generating numerical sequence IDs for each sequence as well filtering files to contain either solely nucleotide or amino acid sequences. c. Split_multifasta -Splits the multi-sequence file(s) generated by Bsml2fasta into smaller files containing a user-defined number of sequences, for distribution onto the grid for the later components. d. Bsml2ptt -Allows conversion of the BSML files into a tab delimited ptt file. Currently only utilized in the IslandPath component. e. Extract_CDS_Features -An in-house developed script to parse Genbank files and extract all features relevant to any coding sequences present.