The variome of pneumococcal virulence factors and regulators

Background In recent years, the idea of a highly immunogenic protein-based vaccine to combat Streptococcus pneumoniae and its severe invasive infectious diseases has gained considerable interest. However, the target proteins to be included in a vaccine formulation have to accomplish several genetic and immunological characteristics, (such as conservation, distribution, immunogenicity and protective effect), in order to ensure its suitability and effectiveness. This study aimed to get comprehensive insights into the genomic organization, population distribution and genetic conservation of all pneumococcal surface-exposed proteins, genetic regulators and other virulence factors, whose important function and role in pathogenesis has been demonstrated or hypothesized. Results After retrieving the complete set of DNA and protein sequences reported in the databases GenBank, KEGG, VFDB, P2CS and Uniprot for pneumococcal strains whose genomes have been fully sequenced and annotated, a comprehensive bioinformatic analysis and systematic comparison has been performed for each virulence factor, stand-alone regulator and two-component regulatory system (TCS) encoded in the pan-genome of S. pneumoniae. A total of 25 S. pneumoniae strains, representing different pneumococcal phylogenetic lineages and serotypes, were considered. A set of 92 different genes and proteins were identified, classified and studied to construct a pan-genomic variability map (variome) for S. pneumoniae. Both, pneumococcal virulence factors and regulatory genes, were well-distributed in the pneumococcal genome and exhibited a conserved feature of genome organization, where replication and transcription are co-oriented. The analysis of the population distribution for each gene and protein showed that 49 of them are part of the core genome in pneumococci, while 43 belong to the accessory-genome. Estimating the genetic variability revealed that pneumolysin, enolase and Usp45 (SP_2216 in S. p. TIGR4) are the pneumococcal virulence factors with the highest conservation, while TCS08, TCS05, and TCS02 represent the most conserved pneumococcal genetic regulators. Conclusions The results identified well-distributed and highly conserved pneumococcal virulence factors as well as regulators, representing promising candidates for a new generation of serotype-independent protein-based vaccine(s) to combat pneumococcal infections.


Background
Streptococcus pneumoniae, also known as the pneumococcus, is a Gram-positive, α-hemolytic and facultative aerobic bacterium. This microorganism is normally found as a harmless commensal in the upper respiratory tract of humans. Pneumococi have a great epidemiological importance due to their high impact on public health, causing more than one and a half million of deaths per year around the world [1]. S. pneumoniae is the main etiologic agent of community-acquired pneumonia. However, this is not its only clinical manifestation, because other kind of diseases such as otitis media, sinusitis, septicemia and meningitis are also caused by this pathogen and associated with high mortality rates [2].
Given the particular biochemical and molecular features of Streptococcus pneumoniae (Gram-positive, catalase-negative, optochin-sensitive and bile-soluble bacteria), its identification process in the laboratory is relatively simple. Nevertheless, the great molecular, biochemical and immunological diversity of its capsule and other antigens such as choline-binding proteins make them one of the hardest bacterial pathogens to face because of its variability [3,4]. The "Quellung Reaction", developed over 100 years ago by Neufeld, allows the specifical and reliable identification of each one of the >94 serotypes that have been discovered up to date. The capsular polysaccharide is the sine qua non virulence factor, however the pathogenic potential of serotypes may vary and similarly, the frequencies or prevalence varies from one geographic region to the other [5]. Despite this, the capsule is not the only factor required to induce disease by S. pneumoniae. In fact, the surface of the pneumococcus is decorated by various proteins, which have been already associated with its high pathogenic potential. In addition, their interaction level with the host cellular receptors has been proved, exhibiting crucial pathogenic functions such as adhesion, colonization, breaching tissue barriers and immune evasion [6].
An important group of regulatory proteins of great interest are the histidine kinases (HK), located in the bacterial surface and functioning as the sensors of two-component regulatory systems (TCS). The sensing of environmental signals via TCS, regulates the genetic expression of cellular processes that are of great importance such as natural competence, antibiotic resistance, adaptation to different environmental situations, surface proteins expression, and others [7,8]. In general, TCS are composed of a histidine kinase, a membrane protein sensing the extracellular signals and transmitting these signals to a cytoplasmatic regulator/effector protein refered to as response regulator (RR). This happens via the HK autophosphorylation and a subsequent trans-phosphorylation process. In Streptococcus pneumoniae, 13 TCS and one orphan RR have been identified [7].
The relevance of the cellular, physiological and pathogenic functions that these pneumococcal proteins fulfill, have aroused a great scientific and biotechnological interest, given their potential pharmaceutical applications as vaccine candidates [9]. Nowadays, the antibiotic treatment of the infections caused by the pneumococcus is often complicated due to the increase of antibiotic resistance [10]. Furthermore, prevention by the use of the pneumococcal polysaccharide vaccines and/or pneumococcal conjugated vaccines only helps to control the disease caused by some of the serotypes and has an indirect impact on colonization [9]. Thus, there is an urge to define more global and effective strategies for the treatment and/or prevention, and to fight the pneumococcus and its local and invasive diseases. Consequently, the idea of a proteinbased vaccine has taken great importance in the last years. However, in order to be considered or included in a recombinant vaccine formulation, a bacterial protein has to fulfill specific criteria such as: (1) playing an important role in the bacterial fitness and/or pathogenesis of S. pneumoniae, (2) possessing a wide distribution among the circulating strains and clinical isolates, (3) exhibiting a major conservation at its genetic and protein sequence, (4) being inmunogenic, (5) demonstrating protectivity in experimental assays, and (6) having favorable physicochemical properties for expression and purification of its recombinant products.
Streptococcus pneumoniae is a pathogen exhibiting a fratricide behavior and an enormous capacity for natural competence, acquiring foreign genetic material and integrating it into its genome [11]. These processes, in addition to the mutation rates [12,13], greatly stimulate the horizontal gene transfer with other microorganisms, and explains pneumococcal genetic variability and genome plasticity [14,15]. This model of pneumococcal population evolution, where recombination highly outpasses mutation, is also caused by the relatively high numbers of repetitive sequences in the genome thereby facilitating the incorporation of foreign DNA in the chromosome [15][16][17][18]. In consequence, these events contribute to structural reorganizations, and influence the presence or absence of protein-encoding genes in differente subsets of the global pneumococcal population, making them highly heterogeneous from the core-and pan-genomic point of view [15]. Likewise, the generation and fixation of particular changes in the genome affect the mutation rates, which in turn influence the evolution and conservation of genes and contribute to adaptative changes that potentially lead to an increased virulence and a more complex interaction with the host [19].
Due to these molecular events and their importance, there is a need to fully and globally understand the genetic heterogeneity and variability among the different pneumococcal strains/serotypes (variome), and to get a deeper and detailed molecular undestanding of the different physiological and pathogenic mechanisms that this microorganim uses to cause severe and life-threatening diseases. Definitely, obtaining this knowledge will allow to identify potential pharmaceutical targets for new antimicriobial therapies. By the recognizition of their conservation and distribution degree among pneumococcal strains, this will confirm protein candidates for vaccines. However, despite the availability of a high number of completely sequenced genomes and the importance to analyse the genetic differences among pneumococci, only a few studies have focused on studying its variability from a global perspective, similarly as the Human variome databases do [20]. To date only the "Microbial Variome Database" [21], which possesses and organizes the available information of the variome of the two Gram-negative bacterial species Escherichia coli and Salmonella enterica, is providing such information for microorganisms. Remarkably, there are no open-source data of this nature for any Gram-positive bacterial genome. Hence, this study focused on the construction of the first S. pneumoniae Variome model, starting with the identification of all allellic and protein variants, a mutation and distribution analysis (presence and absence) of the virulence factors and regulators, among a set of pneumococcal strains that possess a fully sequenced and annotated genome.

Methods
Definition of the study population set and determination of the optimal representation of the entire population of pneumococci The search and selection of the Streptococcus pneumoniae strains for the analysis in this study was done using the microbial database of the "National Center for Biotechnology Information" NCBI (http://www.ncbi.nlm.nih.gov/genome) [22]. Likewise, in order to ensure an optimal representation of the global pneumococcal population, a genomic BLAST of 8290 available S. pneumoniae genomes was carried out. In brief, DNA alignments, employing the tool "Microbial Nucleotide BLAST" [23], that can be found in the website http://blast.ncbi.nlm.nih.gov/Blast.cgi, were performed for all the currently reported draft or complete sequenced genomes. The comparative data was then employed to construct a DNA-based Phylogenetic Tree (dendrogram), by using the Genome Tree Report Tool of the NCBI (ncbi.nlm.nih.gov/genome/tree/176). Afterwards, the file containing the dendrogram, constructed for the 8290 strains, was downloaded from the NCBI database. Finally, the dendrogram file was viewed, analyzed and adapted in order to generate circular, slanted and/or rectangular cladograms, by using the online NCBI Tool "Tree Viewer 1.17.0", which is available online at the website: ncbi.nlm. nih.gov/projects/treeview (Fig. 1).

Definition of the virulence factors and two-component regulatory systems to be studied in S. pneumoniae
The search and selection of genes and proteins widely known as virulence factors or gene encoding factors possessing a proven interaction with the human host was done by an exhaustive bioinformatic screening in the database "Virulence Factors DataBase -VFDB" [24], available at the website http://www.mgc.ac.cn/VFs. Aditionally, the virulence factors and proteins involved in interactions with the host were confirmed and completed by a systematic review of the literature [14,25]. The common names of each one of the selected virulence factors were then introduced in the database UNIPROT [26], available at http://www.uniprot.org/, with the aim of obtaining the locus tag for S. pneumoniae TIGR4 genome/strain. In addition, the genes encoding the HK or RR of the pneumococcal TCS were identified by using the Fig. 1 Phylogenetic tree (slanted cladogram) of the pneumococcal genome / strains. By using the online NCBI Tools Genome Tree Report (ncbi.nlm.nih.gov/genome/tree/176) and the Tree Viewer 1.17.0 (ncbi.nlm.nih.gov/projects/treeview), a phylogenetic tree was constructed from the analysis by genomic BLAST of 8290 sequencing projects of pneumococci reported in the NCBI database. The topology of this slanted cladogram showed different pneumococcal lineages, where the selected set of 25 pneumococcal strains can be identified in red as external nodes (the "well-distributed" key features also highlighted in red), evidencing an optimal representation of the pneumococcal population. The overall number of sequenced pneumococcal genomes is provided for each external node. The blue lines depicted those external nodes where fully sequenced and annotated genomes are located database Prokaryotic Two-Component Systems -P2CS [27], available at the website http://www.p2cs.org/ index.php. Likewise, the corresponding locus tag for S. pneumoniae TIGR4 genome / strain, of each one of the histidine kinases genes (hk) and response regulator genes (rr), were also recovered from the same database.
Chromosomal localization of the virulence factor and two-component regulatory systems genes in S. pneumoniae The chromosomal location of all the genes in the genome of S. pneumoniae TIGR4 and the construction of the genomic maps, in linear or circular representation, was done by using the software SnapGene® (GSL Biotech), available at http://www.snapgene.com. In brief, the studied genomes of S. pneumoniae were imported through its corresponding access code in GenBank (ie: NC_003028.3 for TIGR4). Then, the chromosomal location of each virulence factor gene, and the factors involved in the interaction with the host and the genes encodying for proteins of simple or two-component regulatory systems were identified. Finally, the lineal maps for the scale genomic localization for the virulence factors and the circular maps for the genomic periphery of the genes that form the two-component regulatory systems were constructed.

Distribution of the virulence factors and two-component regulatory systems in the different strains of S. pneumoniae
The identification of the genetic and protein sequences of interest to perform the comparative analysis was done, having as reference the codes (Locus Tag) in the genomes of S. pneumoniae TIGR4 and/or R6 in the database Kyoto Encyclopedia of Genes and Genomes -KEGG [28], available at http://www.kegg.jp/kegg/. Once every gene of interest was established in the database, a series of comparisons (BLASTs) were performed using the GenomeNet [29], available at http://www.genome.jp/ , using only the fully sequenced and annotated genomes of S. pneumoniae. For the nucleotide sequences the search was performed using the program BLASTN 2.2.29+, which uses nucleotide vs nucleotide alignments based on a punctuation matrix BLOSUM62 [23,30]. In the same way, the search was done for the amino acid sequences using the program BLASTP 2.2.29+ [31,32], that performs amino acids vs amino acids alignments based on a similar matrix. Once the BLAST was finalized for each virulence factor, the list was purged using as selection criteria genes with an expectancy value: e-Value = 0. The inclusion of genes with an e-value >0 was done by direct visual inspection of the alignments to check that it was indeed the same sequence. By having defined the list with the genes and proteins that fulfilled the selection criteria, it was defined to which strains of S. pneumoniae they belong. All the DNA and protein sequences were downloaded and stored in an organized way using the fasta format.

Genetic variability (variome) of the virulence factors and two-component regulatory systems among the different pneumocococal strains
The multiple comparative alignments of pneumococcal sequences were done using the web tool MultAlin [33], available at http://multalin.toulouse.inra.fr/multalin/, for which an identity matrix 1-0 was used to assign a penalty even for the slightest change in the nucleotides or amino acids sequences, covering substitution, deletions, insertions and variations in the length. From these analyses, the number of allelic and protein variants were determined for each gene according to the registry value assigned by the program to each sequence, where equal sequences have the same registry value, while different sequences possess different values. The results of the alignments were manually curated and stored for further analysis. Finally, the precise determination of the total    [34,35], available at http://www.ub.edu/dnasp/. There, all the sequences found for a determined gene were introduced and the calculations were perfomed for the corresponding type of mutation as mentioned before.

Results and discussion
"Hundreds to thousands" of S. pneumoniae strains and clinical isolates recovered from the nasopharynx, blood or cerebrospinal fluid (CSF) have been included up to date in genomic sequencing projects worldwide. However, pneumococcal strains, whose genomes are fully sequenced, annotated and publicly available, are the focus of this study. Therefore, a set of 25 pneumococcal strains were selected from the NCBI database, as population study, to perform the bioinformatic analysis needed to accomplish the construction of the variome of the virulence factors and two-component regulatory systems of Streptococcus pneumoniae (Table 1). A Variome model of the Pneumococcal Virulence Factors and Regulators is an intraspecific study, aiming to highlight variable genetic loci on the genome of Streptococcus pneumonie. A perfect and ultimate Variome model would be that constructed with the 100% of the genomic information correctly assessed from the entire pneumococcal population. However, the current state of the art is far away from this scenario and an optimal representation of the pneumococcal sets assessed up to date would be appropriate in order to validate these genomic analyzes. Currently, 8290 pneumococcal sequencing projects are reported as draft or complete genomes in the Genome Assembly and Annotation Report of the NCBI database. Therefore, a global genomic BLAST (DNA alignment) of those 8290 available S. pneumoniae genomes/strains was performed and a DNAbased Phylogenetic Tree was constructed by using the Genome Tree Report Tool of the NCBI. The topology of this phylogenetic tree (slanted cladogram) showed different pneumococcal lineages, where the selected set of 25 pneumococcal genomes/strains can be identified as external nodes ("well-distributed" key features highlighted in red), evidencing an optimal representation of the pneumococcal population (Fig. 1). In addition, it is important to highlight that the serotypes (1, 2, 3, 4, 5, 6B, 11A, 14, 19A, 19F and 23F), represented in this study population set, have been described as the pneumococcal types with the highest pathogenic potencial, due to the high burden of invasive pneumococcal diseases (IPDs) they cause worldwide. This is the reason why the majority of them (except serotypes 2 and 11A) have been included in the pneumococcal conjugate vaccines (PCVs) currently used for immunization [1].
An initial considerable number of pneumococcal virulence factor genes were identified, by employing the database VFDB [24]. This database provided further detailed information to establish their function, pathogenic role and type of interaction with a receptor in its human host. Aditionally, a systematic screening of the literature [14] did not only allow the confirmation of identified factors, but also ensured the posibility to complement the list with additional factors that have not been included in the databases. Likewise, the number of the tcs genes (27) was determined using the database  pneumococcal genome evaluated here overmatches the overall number of proteins because the reported number of genes includes all the tRNA-, rRNA-and proteinencoding genes.
Considering the chromosomal localization of pneumococcal virulence factors genes, they are all distributed along the pneumococcal genome (Fig. 2). Interestingly, these genes are located in a co-oriented manner in relation with the origin of replication (oriC: 2.160. . During the bidirectional replication of the genome, gene transcription must be simultaneous [36]. Hence, for the genes oriented in opposite direction to the corresponding replication fork, both molecular machineries will run into a frontal collision that might affect at least one of the processes. For replication, this phenomenon implies a genomic instability, while the gene transcription is probably inefficient. Previous studies have proven that the essential and highly constitutively expressed genes are co-oriented [36]. For the pneumococcus, 30 of the 36 genes encoding virulence factors are localized in the first half of the genome, on the forward strand, and cooriented with the replication fork clockwise. Similarly, 21 of the 27 virulence factor genes localized on the second half of the genome, are located on the reverse strand and co-oriented with the replication fork moving anti-clockwise (Fig. 2). A similar genome organization is observed for the 27 genes that encode the TCSs in S. pneumoniae, where only one operon, the tcs04 genes (TCS04), is not co-oriented with the replication fork (Fig. 3). These data reinforce the idea that the virulence factor genes and the genes of the tcs are highly important for the pneumococcal interaction with the human host, and its pathogenic potential in processes such as adherence, colonization, invasion, immune evasion, fitness, antibiotic resistance and natural competence ( Table 2).
The analysis of the distribution of genes associated with virulence and host-pathogen interactions among the studied pneumococcal strains revealed that only 26 of the 65 genes considered here are present in the all 25 strains. These genes encode for products involved in different functions such as cell wall hydrolysis, ABC transporters and structural proteins implied in the adherence to host tissue, the so-called adhesins. Interestingly, after preliminar inspection (by locus tag, identifier names and/or product sizes) of the datasets and supplementary material reported by van Tonder and colleagues in 2017, only a few of the pneumococcal virulence factors (PspC, KsgA, and 4 hypothetical lipoproteins) and regulators (RR04, HK08, RR08, RR09, RR10) were found in the pneumococcal "supercore" genomic list of 303 genes, based on the analysis of 3121 pneumococci recovered from healthy individuals from four different subsets of the global pneumococcal population [15]. These findings, if confirmed after deeper analysis of the datasets based on sequence comparison, may indicate that pneumococcal pathogenesis is a much more complex process than thought before. While most of the genes have a single copy in the genome, the lytA gene, encoding the major pneumococcal autolysin, is found also in two and even three copies in 13, and 2 strains, respectively. This is most likely due to the multiple integration of prophages in the chromosomal DNA [37] (Table 3). In strain SPNA45, the gene gnd, encoding the enzyme 6-phophogluconate dehydrogenase, is duplicated and fused with a second copy of its downstream neighbor gene, which encodes the orphan response regulator (rr14). The remaining 39 of the 65 virulence factor genes were found to belong to the accesory genome, presenting different degrees of absence in the 25 strains. Thus, all these genes are not essential but are beneficial for fitness and pathogenesis. Striking examples are the genes encoding the Pilus-1 and Pilus-2 structures that have been identified to mediate adherence, contribute to virulence and promote invasion [38][39][40][41][42]. These genes are located on pathogenicity islands (PAI) and these islands contain also the genes required for cell surface anchoring and regulation [38][39][40][41]. Remarkably, strains like ST556, Taiwan19F-14 and TCH8431/19A, were detected here as positive for both types of pili (1 and 2). Among the other genes with restricted presence in some strains it is important to mention that they encode for sortase-anchored proteins lytA is the only factor with more than one copy per genome. In the strain SPNA45, the gene gnd was found duplicated (2) and fused with a duplication of its neighbor gene (rr14) downstream. In the gene nanA of TIGR4 (1) a shift in its ORF was found. However, it has also been reported that NanA is expressed in this pneumoccoccal strain. Gene defective copies (genes with any alteration in their primary DNA sequences) are depicted in bold and italics: In the SPNA45 strain ply is fused with a copy of lytA, and pspA is defective in the ATCC700669 pneumococcal strain or choline-binding proteins (CBPs), as well as histidine triad proteins (pht genes). These gene products are associated with different processes of bacterial fitness and pathogenesis (Tables 3 and 2) [6,43,44]. Regarding the distribution and data of the analyzed strains for the TCS most of them were found in the 25 pneumococcal strains. Exceptions are presented by the TCS07 and TCS12, which contribute to fitness and competence, respectively [7,45]. These TCS are absent in a couple of strains (Table 4). In some other strains genes like hk01, hk12 and rr04, presented incomplete sequences, an artefact leading to truncated and hence non-functional proteins/regulators (Tables 3 and 2). Interestingly, only the genes encoding the hk08, rr08, rr09, rr06 and rr04 were found to belong to the "supercore" genomic set of genes reported by van Tonder et al., in 2017 [15], indicating the important role these highly conserved and welldistributed regulatory proteins play in the pneumococcus and in its interplay with the environment.
The estimation of the variability for each individual virulence factor and pneumococcal regulator (at the DNA and protein level) allowed the construction of a partial variome for the analysed 25 pneumococcal strains. Briefly, the variome takes into consideration the estimation of (1) the presence, absence or the number of copies of genes in the different strains, (2) the number of total synonymous and nonsynonymous mutations, and (3) the number of allelic and protein variants explaining the variability for each factor. The results summarized in Tables 5 and 6, contain the data for the genes and proteins associated to virulence and host-pathogen interaction, and also the data for the stand-alone and TCS regulators. Specifically there are some identified factors with the best distribution and highest evolutionary conservation, These were (1) the ply gene encoding the sole pneumococcal cytolysin and cytotoxin pneumolysin [46], (2) the enolase, which encodes the enzyme enolase (2-phosphoglycerate dehydratase) and has an essential function in the metabolism [47], but also interacts specifically with plasmin(ogen) and is therefore involved in fibrinolytic processes, adherence and virulence, and (3) the pcsB (Usp45) gene, which encodes for a 45-kDa secreted and immunogenic protein that is involved in cell division and stress response [48]. As for the mutations, these three proteins presented a minor number of changes, in comparison with others proteins that were also analyzed. The variome of the TCS (Table 6) allowed to conclude that the most conserved genes from the evolutionary point of view, are the genes hk05 and rr05 of ciaR/H (tcs05). The TCS CiaRH is involved in the resistance to cefotaxime, regulation of genetic competence and increase in pathogenicity in the respiratory tract in murine models [7,49,50]. Meanwhile, hk02 and rr02 (WalR/K, MicA/B or VicR/K), have been associated with resistance to erythromycin and are essential for the bacterial growth. Nevertheless, the latter was proven to be due to its regulon (pcsB), and was no longer essential upon ectopic expression of PcsB [7,48]. Pneumococcal TCS08 is involved in the genetic regulation of pilus-1 [41]. The mutation analysis showed that the response regulators exhibited a lower rate of variations in comparison to the histidine kinases, being the response regulators rr05, rr02, rr06, and rr08 the most conserved. All the results obtained in this study support the global idea of a new generation of protein-based and serotype-independent vaccines for Streptococcus pneumoniae. The basis is the high degree of distribution and conservation of the virulence proteins in combination with the importance of their functions and immunogenic capacities. This probably makes them ideal pharmacological targets to treat the pneumococcus and its diseases. This might be an alternative to the immunization with the conjugated serotypes, or represent a strategy to combine immunogenic and highly conserved proteins with capsular polysaccharides to generate a serotype-independent immune response.

Conclusions
The construction of this "low-scale" Variome model for the virulence factors and regulators of Streptococcus Table 6 Analysis of the genetic variation (Variome) of the genes that conform the two-component systems in S. pneumoniae pneumoniae was achieved from 25 pneumococcal strains with fully sequenced and annotated genomes. According to the Molecular Phylogenetic Analysis performed on the NCBI website, this selected set of pneumococcal genomes ensured an optimal representation of the pneumococcal population (8290 strains) reported in the NCBI database up to date. Similarly, this study population set also represented an important group of highgly pathogenic pneumococcal serotypes (1, 2, 3, 4, 5, 6B, 11A, 14, 19A, 19F and 23F), which have been also included in the current pneumococcal conjugate vaccine formulations (except serotypes 2 and 11A), used to prevent penumococal infections. A total of 92 different genes and proteins were identified, classified, and studied for the construction of the variome. The genes of the pneumococcal virulence factors and TCS, are distributed along the genome, and are located in such a manner that transcription is co-oriented with replication. The analysis of the gene distribution in this study population set showed that 26 of them were found in the 100% of the 25 pneumococcal genomes/ strains (core genome), while 39 are part of the flexible genome. The estimation of the variability for each individual virulence factors, stand-alone regulator or TCS, indicated that the virulence factors with the lowest variability in the pneumococcus are pneumolysin, enolase and PcsB, while the regulators with the highest conservation are TCS05 (CiaR/H), TCS02 (VicR/K) and TCS08. Finally, all the results obtained here with the bioinformatic analysis performed, constitute the first model to compare, visualize and understand the future flood of new genomic data about the genetic variation (in terms of gene presence/absence or mutation) of pneumococcal virulence factors and regulators [51][52][53]. The applicability offered by this variome model, together with further population genomic analysis of pneumococci, will provide relevant information on potential targets for vaccines, supporting the idea of a new generation of protein-based formulations to combat Streptococcus pneumoniae and its disease burden.