Skip to main content
  • Research article
  • Open access
  • Published:

The variome of pneumococcal virulence factors and regulators



In recent years, the idea of a highly immunogenic protein-based vaccine to combat Streptococcus pneumoniae and its severe invasive infectious diseases has gained considerable interest. However, the target proteins to be included in a vaccine formulation have to accomplish several genetic and immunological characteristics, (such as conservation, distribution, immunogenicity and protective effect), in order to ensure its suitability and effectiveness. This study aimed to get comprehensive insights into the genomic organization, population distribution and genetic conservation of all pneumococcal surface-exposed proteins, genetic regulators and other virulence factors, whose important function and role in pathogenesis has been demonstrated or hypothesized.


After retrieving the complete set of DNA and protein sequences reported in the databases GenBank, KEGG, VFDB, P2CS and Uniprot for pneumococcal strains whose genomes have been fully sequenced and annotated, a comprehensive bioinformatic analysis and systematic comparison has been performed for each virulence factor, stand-alone regulator and two-component regulatory system (TCS) encoded in the pan-genome of S. pneumoniae. A total of 25 S. pneumoniae strains, representing different pneumococcal phylogenetic lineages and serotypes, were considered. A set of 92 different genes and proteins were identified, classified and studied to construct a pan-genomic variability map (variome) for S. pneumoniae. Both, pneumococcal virulence factors and regulatory genes, were well-distributed in the pneumococcal genome and exhibited a conserved feature of genome organization, where replication and transcription are co-oriented. The analysis of the population distribution for each gene and protein showed that 49 of them are part of the core genome in pneumococci, while 43 belong to the accessory-genome. Estimating the genetic variability revealed that pneumolysin, enolase and Usp45 (SP_2216 in S. p. TIGR4) are the pneumococcal virulence factors with the highest conservation, while TCS08, TCS05, and TCS02 represent the most conserved pneumococcal genetic regulators.


The results identified well-distributed and highly conserved pneumococcal virulence factors as well as regulators, representing promising candidates for a new generation of serotype-independent protein-based vaccine(s) to combat pneumococcal infections.


Streptococcus pneumoniae, also known as the pneumococcus, is a Gram-positive, α-hemolytic and facultative aerobic bacterium. This microorganism is normally found as a harmless commensal in the upper respiratory tract of humans. Pneumococi have a great epidemiological importance due to their high impact on public health, causing more than one and a half million of deaths per year around the world [1]. S. pneumoniae is the main etiologic agent of community-acquired pneumonia. However, this is not its only clinical manifestation, because other kind of diseases such as otitis media, sinusitis, septicemia and meningitis are also caused by this pathogen and associated with high mortality rates [2].

Given the particular biochemical and molecular features of Streptococcus pneumoniae (Gram-positive, catalase-negative, optochin-sensitive and bile-soluble bacteria), its identification process in the laboratory is relatively simple. Nevertheless, the great molecular, biochemical and immunological diversity of its capsule and other antigens such as choline-binding proteins make them one of the hardest bacterial pathogens to face because of its variability [3, 4]. The “Quellung Reaction”, developed over 100 years ago by Neufeld, allows the specifical and reliable identification of each one of the >94 serotypes that have been discovered up to date. The capsular polysaccharide is the sine qua non virulence factor, however the pathogenic potential of serotypes may vary and similarly, the frequencies or prevalence varies from one geographic region to the other [5]. Despite this, the capsule is not the only factor required to induce disease by S. pneumoniae. In fact, the surface of the pneumococcus is decorated by various proteins, which have been already associated with its high pathogenic potential. In addition, their interaction level with the host cellular receptors has been proved, exhibiting crucial pathogenic functions such as adhesion, colonization, breaching tissue barriers and immune evasion [6].

An important group of regulatory proteins of great interest are the histidine kinases (HK), located in the bacterial surface and functioning as the sensors of two-component regulatory systems (TCS). The sensing of environmental signals via TCS, regulates the genetic expression of cellular processes that are of great importance such as natural competence, antibiotic resistance, adaptation to different environmental situations, surface proteins expression, and others [7, 8]. In general, TCS are composed of a histidine kinase, a membrane protein sensing the extracellular signals and transmitting these signals to a cytoplasmatic regulator/effector protein refered to as response regulator (RR). This happens via the HK autophosphorylation and a subsequent trans-phosphorylation process. In Streptococcus pneumoniae, 13 TCS and one orphan RR have been identified [7].

The relevance of the cellular, physiological and pathogenic functions that these pneumococcal proteins fulfill, have aroused a great scientific and biotechnological interest, given their potential pharmaceutical applications as vaccine candidates [9]. Nowadays, the antibiotic treatment of the infections caused by the pneumococcus is often complicated due to the increase of antibiotic resistance [10]. Furthermore, prevention by the use of the pneumococcal polysaccharide vaccines and/or pneumococcal conjugated vaccines only helps to control the disease caused by some of the serotypes and has an indirect impact on colonization [9]. Thus, there is an urge to define more global and effective strategies for the treatment and/or prevention, and to fight the pneumococcus and its local and invasive diseases. Consequently, the idea of a protein-based vaccine has taken great importance in the last years. However, in order to be considered or included in a recombinant vaccine formulation, a bacterial protein has to fulfill specific criteria such as: (1) playing an important role in the bacterial fitness and/or pathogenesis of S. pneumoniae, (2) possessing a wide distribution among the circulating strains and clinical isolates, (3) exhibiting a major conservation at its genetic and protein sequence, (4) being inmunogenic, (5) demonstrating protectivity in experimental assays, and (6) having favorable physico-chemical properties for expression and purification of its recombinant products.

Streptococcus pneumoniae is a pathogen exhibiting a fratricide behavior and an enormous capacity for natural competence, acquiring foreign genetic material and integrating it into its genome [11]. These processes, in addition to the mutation rates [12, 13], greatly stimulate the horizontal gene transfer with other microorganisms, and explains pneumococcal genetic variability and genome plasticity [14, 15]. This model of pneumococcal population evolution, where recombination highly outpasses mutation, is also caused by the relatively high numbers of repetitive sequences in the genome thereby facilitating the incorporation of foreign DNA in the chromosome [15,16,17,18]. In consequence, these events contribute to structural reorganizations, and influence the presence or absence of protein-encoding genes in differente subsets of the global pneumococcal population, making them highly heterogeneous from the core- and pan-genomic point of view [15]. Likewise, the generation and fixation of particular changes in the genome affect the mutation rates, which in turn influence the evolution and conservation of genes and contribute to adaptative changes that potentially lead to an increased virulence and a more complex interaction with the host [19].

Due to these molecular events and their importance, there is a need to fully and globally understand the genetic heterogeneity and variability among the different pneumococcal strains/serotypes (variome), and to get a deeper and detailed molecular undestanding of the different physiological and pathogenic mechanisms that this microorganim uses to cause severe and life-threatening diseases. Definitely, obtaining this knowledge will allow to identify potential pharmaceutical targets for new antimicriobial therapies. By the recognizition of their conservation and distribution degree among pneumococcal strains, this will confirm protein candidates for vaccines. However, despite the availability of a high number of completely sequenced genomes and the importance to analyse the genetic differences among pneumococci, only a few studies have focused on studying its variability from a global perspective, similarly as the Human variome databases do [20]. To date only the “Microbial Variome Database” [21], which possesses and organizes the available information of the variome of the two Gram-negative bacterial species Escherichia coli and Salmonella enterica, is providing such information for microorganisms. Remarkably, there are no open-source data of this nature for any Gram-positive bacterial genome. Hence, this study focused on the construction of the first S. pneumoniae Variome model, starting with the identification of all allellic and protein variants, a mutation and distribution analysis (presence and absence) of the virulence factors and regulators, among a set of pneumococcal strains that possess a fully sequenced and annotated genome.


Definition of the study population set and determination of the optimal representation of the entire population of pneumococci

The search and selection of the Streptococcus pneumoniae strains for the analysis in this study was done using the microbial database of the “National Center for Biotechnology Information” NCBI ( [22]. Likewise, in order to ensure an optimal representation of the global pneumococcal population, a genomic BLAST of 8290 available S. pneumoniae genomes was carried out. In brief, DNA alignments, employing the tool “Microbial Nucleotide BLAST” [23], that can be found in the website, were performed for all the currently reported draft or complete sequenced genomes. The comparative data was then employed to construct a DNA-based Phylogenetic Tree (dendrogram), by using the Genome Tree Report Tool of the NCBI ( Afterwards, the file containing the dendrogram, constructed for the 8290 strains, was downloaded from the NCBI database. Finally, the dendrogram file was viewed, analyzed and adapted in order to generate circular, slanted and/or rectangular cladograms, by using the online NCBI Tool “Tree Viewer 1.17.0”, which is available online at the website: (Fig. 1).

Fig. 1
figure 1

Phylogenetic tree (slanted cladogram) of the pneumococcal genome / strains. By using the online NCBI Tools Genome Tree Report ( and the Tree Viewer 1.17.0 (, a phylogenetic tree was constructed from the analysis by genomic BLAST of 8290 sequencing projects of pneumococci reported in the NCBI database. The topology of this slanted cladogram showed different pneumococcal lineages, where the selected set of 25 pneumococcal strains can be identified in red as external nodes (the “well-distributed” key features also highlighted in red), evidencing an optimal representation of the pneumococcal population. The overall number of sequenced pneumococcal genomes is provided for each external node. The blue lines depicted those external nodes where fully sequenced and annotated genomes are located

Definition of the virulence factors and two-component regulatory systems to be studied in S. pneumoniae

The search and selection of genes and proteins widely known as virulence factors or gene encoding factors possessing a proven interaction with the human host was done by an exhaustive bioinformatic screening in the database “Virulence Factors DataBase - VFDB” [24], available at the website Aditionally, the virulence factors and proteins involved in interactions with the host were confirmed and completed by a systematic review of the literature [14, 25]. The common names of each one of the selected virulence factors were then introduced in the database UNIPROT [26], available at, with the aim of obtaining the locus tag for S. pneumoniae TIGR4 genome/strain. In addition, the genes encoding the HK or RR of the pneumococcal TCS were identified by using the database Prokaryotic Two-Component Systems - P2CS [27], available at the website Likewise, the corresponding locus tag for S. pneumoniae TIGR4 genome / strain, of each one of the histidine kinases genes (hk) and response regulator genes (rr), were also recovered from the same database.

Chromosomal localization of the virulence factor and two-component regulatory systems genes in S. pneumoniae

The chromosomal location of all the genes in the genome of S. pneumoniae TIGR4 and the construction of the genomic maps, in linear or circular representation, was done by using the software SnapGene® (GSL Biotech), available at In brief, the studied genomes of S. pneumoniae were imported through its corresponding access code in GenBank (ie: NC_003028.3 for TIGR4). Then, the chromosomal location of each virulence factor gene, and the factors involved in the interaction with the host and the genes encodying for proteins of simple or two-component regulatory systems were identified. Finally, the lineal maps for the scale genomic localization for the virulence factors and the circular maps for the genomic periphery of the genes that form the two-component regulatory systems were constructed.

Distribution of the virulence factors and two-component regulatory systems in the different strains of S. pneumoniae

The identification of the genetic and protein sequences of interest to perform the comparative analysis was done, having as reference the codes (Locus Tag) in the genomes of S. pneumoniae TIGR4 and/or R6 in the database Kyoto Encyclopedia of Genes and Genomes – KEGG [28], available at Once every gene of interest was established in the database, a series of comparisons (BLASTs) were performed using the GenomeNet [29], available at, using only the fully sequenced and annotated genomes of S. pneumoniae. For the nucleotide sequences the search was performed using the program BLASTN 2.2.29+, which uses nucleotide vs nucleotide alignments based on a punctuation matrix BLOSUM62 [23, 30]. In the same way, the search was done for the amino acid sequences using the program BLASTP 2.2.29+ [31, 32], that performs amino acids vs amino acids alignments based on a similar matrix. Once the BLAST was finalized for each virulence factor, the list was purged using as selection criteria genes with an expectancy value: e-Value = 0. The inclusion of genes with an e-value >0 was done by direct visual inspection of the alignments to check that it was indeed the same sequence. By having defined the list with the genes and proteins that fulfilled the selection criteria, it was defined to which strains of S. pneumoniae they belong. All the DNA and protein sequences were downloaded and stored in an organized way using the fasta format.

Genetic variability (variome) of the virulence factors and two-component regulatory systems among the different pneumocococal strains

The multiple comparative alignments of pneumococcal sequences were done using the web tool MultAlin [33], available at, for which an identity matrix 1–0 was used to assign a penalty even for the slightest change in the nucleotides or amino acids sequences, covering substitution, deletions, insertions and variations in the length. From these analyses, the number of allelic and protein variants were determined for each gene according to the registry value assigned by the program to each sequence, where equal sequences have the same registry value, while different sequences possess different values. The results of the alignments were manually curated and stored for further analysis. Finally, the precise determination of the total mutations, synonymous and nonsynonymous was done using the software DnaSP V.5.1 [34, 35], available at There, all the sequences found for a determined gene were introduced and the calculations were perfomed for the corresponding type of mutation as mentioned before.

Results and discussion

“Hundreds to thousands” of S. pneumoniae strains and clinical isolates recovered from the nasopharynx, blood or cerebrospinal fluid (CSF) have been included up to date in genomic sequencing projects worldwide. However, pneumococcal strains, whose genomes are fully sequenced, annotated and publicly available, are the focus of this study. Therefore, a set of 25 pneumococcal strains were selected from the NCBI database, as population study, to perform the bioinformatic analysis needed to accomplish the construction of the variome of the virulence factors and two-component regulatory systems of Streptococcus pneumoniae (Table 1).

Table 1 The study population set of 25 S. pneumoniae strains included in this study and their serotypes

A Variome model of the Pneumococcal Virulence Factors and Regulators is an intraspecific study, aiming to highlight variable genetic loci on the genome of Streptococcus pneumonie. A perfect and ultimate Variome model would be that constructed with the 100% of the genomic information correctly assessed from the entire pneumococcal population. However, the current state of the art is far away from this scenario and an optimal representation of the pneumococcal sets assessed up to date would be appropriate in order to validate these genomic analyzes. Currently, 8290 pneumococcal sequencing projects are reported as draft or complete genomes in the Genome Assembly and Annotation Report of the NCBI database. Therefore, a global genomic BLAST (DNA alignment) of those 8290 available S. pneumoniae genomes/strains was performed and a DNA-based Phylogenetic Tree was constructed by using the Genome Tree Report Tool of the NCBI. The topology of this phylogenetic tree (slanted cladogram) showed different pneumococcal lineages, where the selected set of 25 pneumococcal genomes/strains can be identified as external nodes (“well-distributed” key features highlighted in red), evidencing an optimal representation of the pneumococcal population (Fig. 1). In addition, it is important to highlight that the serotypes (1, 2, 3, 4, 5, 6B, 11A, 14, 19A, 19F and 23F), represented in this study population set, have been described as the pneumococcal types with the highest pathogenic potencial, due to the high burden of invasive pneumococcal diseases (IPDs) they cause worldwide. This is the reason why the majority of them (except serotypes 2 and 11A) have been included in the pneumococcal conjugate vaccines (PCVs) currently used for immunization [1].

An initial considerable number of pneumococcal virulence factor genes were identified, by employing the database VFDB [24]. This database provided further detailed information to establish their function, pathogenic role and type of interaction with a receptor in its human host. Aditionally, a systematic screening of the literature [14] did not only allow the confirmation of identified factors, but also ensured the posibility to complement the list with additional factors that have not been included in the databases. Likewise, the number of the tcs genes (27) was determined using the database Prokaryotic 2-Component Systems - P2CS [27]. In total, 92 different genes encoding 61 surface proteins, 4 stand alone transcriptional regulators, 13 HKs and 14 RRs have been selected and included in this work for the construction of the variome, after being classified by their function and grouped according to their molecular mechanisms of surface-exposure (Table 2).

Table 2 Function or pathogenic role of the virulence factors and two-component regulatory systems of S. pneumoniae

The genomes of 25 analyzed pneumococcal strains comprise genome sizes ranging from 2,024,476 bp in SPN034156 up to 2,245,615 bp in Hungary 19A-6. Likewise, the G + C content varies between 39.50% in CGSP14 and 39.90% in SPN034156. 670-6B is the strain with the highest number of genes (2430) and proteins (2352) and SPN034156 is the strain with the lowest number of genes (1956) and proteins (1799). Hence, the difference among genomes, regarding the number of genes and proteins can be up to 474 genes and 553 proteins, respectively. The overall number of genes for each pneumococcal genome evaluated here overmatches the overall number of proteins because the reported number of genes includes all the tRNA-, rRNA- and protein-encoding genes.

Considering the chromosomal localization of pneumococcal virulence factors genes, they are all distributed along the pneumococcal genome (Fig. 2). Interestingly, these genes are located in a co-oriented manner in relation with the origin of replication (oriC: 2.160.822–196). During the bidirectional replication of the genome, gene transcription must be simultaneous [36]. Hence, for the genes oriented in opposite direction to the corresponding replication fork, both molecular machineries will run into a frontal collision that might affect at least one of the processes. For replication, this phenomenon implies a genomic instability, while the gene transcription is probably inefficient. Previous studies have proven that the essential and highly constitutively expressed genes are co-oriented [36]. For the pneumococcus, 30 of the 36 genes encoding virulence factors are localized in the first half of the genome, on the forward strand, and co-oriented with the replication fork clockwise. Similarly, 21 of the 27 virulence factor genes localized on the second half of the genome, are located on the reverse strand and co-oriented with the replication fork moving anti-clockwise (Fig. 2). A similar genome organization is observed for the 27 genes that encode the TCSs in S. pneumoniae, where only one operon, the tcs04 genes (TCS04), is not co-oriented with the replication fork (Fig. 3). These data reinforce the idea that the virulence factor genes and the genes of the tcs are highly important for the pneumococcal interaction with the human host, and its pathogenic potential in processes such as adherence, colonization, invasion, immune evasion, fitness, antibiotic resistance and natural competence (Table 2).

Fig. 2
figure 2

Chromosomal localization and direction of the virulence factor genes of S. pneumoniae TIGR4. Lineal representation of the pneumococcus genome. The arrows, drawn at scale, localize 62 of the 65 virulence factors and simple regulation genes considered in this study (pitA, pitB and zmpD are not present in the genome of TIGR4). Each color represents a different class of codified protein: blue = sortase-anchored proteins with an LPxTG cleavage motif; violet = choline-binding proteins (CBPs); green = lipoproteins, yellow = non-classical surface proteins (NCSP), and red = stand-alone regulators. This map was constructed using the Software SnapGene® (GSL Biotech; Available at

Fig. 3
figure 3

Localization and direction of the two component systems genes in S. pneumoniae TIGR4. Circular representation of the pneumococcal genome. The arrows, not drawn at scale, localize the 27 genes which codifies for the proteins of the 13 two component systems +1 incomplete. Each color indicates a different class of codified protein: red = histidine kinase sensors and blue = response regulators Proteins. This map was constructed using the Software SnapGene® (GSL Biotech; Available at

The analysis of the distribution of genes associated with virulence and host-pathogen interactions among the studied pneumococcal strains revealed that only 26 of the 65 genes considered here are present in the all 25 strains. These genes encode for products involved in different functions such as cell wall hydrolysis, ABC transporters and structural proteins implied in the adherence to host tissue, the so-called adhesins. Interestingly, after preliminar inspection (by locus tag, identifier names and/or product sizes) of the datasets and supplementary material reported by van Tonder and colleagues in 2017, only a few of the pneumococcal virulence factors (PspC, KsgA, and 4 hypothetical lipoproteins) and regulators (RR04, HK08, RR08, RR09, RR10) were found in the pneumococcal “supercore” genomic list of 303 genes, based on the analysis of 3121 pneumococci recovered from healthy individuals from four different subsets of the global pneumococcal population [15]. These findings, if confirmed after deeper analysis of the datasets based on sequence comparison, may indicate that pneumococcal pathogenesis is a much more complex process than thought before. While most of the genes have a single copy in the genome, the lytA gene, encoding the major pneumococcal autolysin, is found also in two and even three copies in 13, and 2 strains, respectively. This is most likely due to the multiple integration of prophages in the chromosomal DNA [37] (Table 3). In strain SPNA45, the gene gnd, encoding the enzyme 6-phophogluconate dehydrogenase, is duplicated and fused with a second copy of its downstream neighbor gene, which encodes the orphan response regulator (rr14). The remaining 39 of the 65 virulence factor genes were found to belong to the accesory genome, presenting different degrees of absence in the 25 strains. Thus, all these genes are not essential but are beneficial for fitness and pathogenesis. Striking examples are the genes encoding the Pilus-1 and Pilus-2 structures that have been identified to mediate adherence, contribute to virulence and promote invasion [38,39,40,41,42]. These genes are located on pathogenicity islands (PAI) and these islands contain also the genes required for cell surface anchoring and regulation [38,39,40,41]. Remarkably, strains like ST556, Taiwan19F-14 and TCH8431/19A, were detected here as positive for both types of pili (1 and 2). Among the other genes with restricted presence in some strains it is important to mention that they encode for sortase-anchored proteins or choline-binding proteins (CBPs), as well as histidine triad proteins (pht genes). These gene products are associated with different processes of bacterial fitness and pathogenesis (Tables 3 and 2) [6, 43, 44]. Regarding the distribution and data of the analyzed strains for the TCS most of them were found in the 25 pneumococcal strains. Exceptions are presented by the TCS07 and TCS12, which contribute to fitness and competence, respectively [7, 45]. These TCS are absent in a couple of strains (Table 4). In some other strains genes like hk01, hk12 and rr04, presented incomplete sequences, an artefact leading to truncated and hence non-functional proteins/regulators (Tables 3 and 2). Interestingly, only the genes encoding the hk08, rr08, rr09, rr06 and rr04 were found to belong to the “supercore” genomic set of genes reported by van Tonder et al., in 2017 [15], indicating the important role these highly conserved and well-distributed regulatory proteins play in the pneumococcus and in its interplay with the environment.

Table 3 Distribution of the virulence factor and regulation genes of S. pneumoniae
Table 4 Distribution of the genes that conform the two-component systems in S. pneumoniae

The estimation of the variability for each individual virulence factor and pneumococcal regulator (at the DNA and protein level) allowed the construction of a partial variome for the analysed 25 pneumococcal strains. Briefly, the variome takes into consideration the estimation of (1) the presence, absence or the number of copies of genes in the different strains, (2) the number of total synonymous and nonsynonymous mutations, and (3) the number of allelic and protein variants explaining the variability for each factor. The results summarized in Tables 5 and 6, contain the data for the genes and proteins associated to virulence and host-pathogen interaction, and also the data for the stand-alone and TCS regulators. Specifically there are some identified factors with the best distribution and highest evolutionary conservation, These were (1) the ply gene encoding the sole pneumococcal cytolysin and cytotoxin pneumolysin [46], (2) the enolase, which encodes the enzyme enolase (2-phosphoglycerate dehydratase) and has an essential function in the metabolism [47], but also interacts specifically with plasmin(ogen) and is therefore involved in fibrinolytic processes, adherence and virulence, and (3) the pcsB (Usp45) gene, which encodes for a 45-kDa secreted and immunogenic protein that is involved in cell division and stress response [48]. As for the mutations, these three proteins presented a minor number of changes, in comparison with others proteins that were also analyzed. The variome of the TCS (Table 6) allowed to conclude that the most conserved genes from the evolutionary point of view, are the genes hk05 and rr05 of ciaR/H (tcs05). The TCS CiaRH is involved in the resistance to cefotaxime, regulation of genetic competence and increase in pathogenicity in the respiratory tract in murine models [7, 49, 50]. Meanwhile, hk02 and rr02 (WalR/K, MicA/B or VicR/K), have been associated with resistance to erythromycin and are essential for the bacterial growth. Nevertheless, the latter was proven to be due to its regulon (pcsB), and was no longer essential upon ectopic expression of PcsB [7, 48]. Pneumococcal TCS08 is involved in the genetic regulation of pilus-1 [41]. The mutation analysis showed that the response regulators exhibited a lower rate of variations in comparison to the histidine kinases, being the response regulators rr05, rr02, rr06, and rr08 the most conserved. All the results obtained in this study support the global idea of a new generation of protein-based and serotype-independent vaccines for Streptococcus pneumoniae. The basis is the high degree of distribution and conservation of the virulence proteins in combination with the importance of their functions and immunogenic capacities. This probably makes them ideal pharmacological targets to treat the pneumococcus and its diseases. This might be an alternative to the immunization with the conjugated serotypes, or represent a strategy to combine immunogenic and highly conserved proteins with capsular polysaccharides to generate a serotype-independent immune response.

Table 5 Analysis of the Variome of the virulence factor genes of S. pneumoniae
Table 6 Analysis of the genetic variation (Variome) of the genes that conform the two-component systems in S. pneumoniae


The construction of this “low-scale” Variome model for the virulence factors and regulators of Streptococcus pneumoniae was achieved from 25 pneumococcal strains with fully sequenced and annotated genomes. According to the Molecular Phylogenetic Analysis performed on the NCBI website, this selected set of pneumococcal genomes ensured an optimal representation of the pneumococcal population (8290 strains) reported in the NCBI database up to date. Similarly, this study population set also represented an important group of highgly pathogenic pneumococcal serotypes (1, 2, 3, 4, 5, 6B, 11A, 14, 19A, 19F and 23F), which have been also included in the current pneumococcal conjugate vaccine formulations (except serotypes 2 and 11A), used to prevent penumococal infections. A total of 92 different genes and proteins were identified, classified, and studied for the construction of the variome. The genes of the pneumococcal virulence factors and TCS, are distributed along the genome, and are located in such a manner that transcription is co-oriented with replication. The analysis of the gene distribution in this study population set showed that 26 of them were found in the 100% of the 25 pneumococcal genomes/strains (core genome), while 39 are part of the flexible genome. The estimation of the variability for each individual virulence factors, stand-alone regulator or TCS, indicated that the virulence factors with the lowest variability in the pneumococcus are pneumolysin, enolase and PcsB, while the regulators with the highest conservation are TCS05 (CiaR/H), TCS02 (VicR/K) and TCS08. Finally, all the results obtained here with the bioinformatic analysis performed, constitute the first model to compare, visualize and understand the future flood of new genomic data about the genetic variation (in terms of gene presence/absence or mutation) of pneumococcal virulence factors and regulators [51,52,53]. The applicability offered by this variome model, together with further population genomic analysis of pneumococci, will provide relevant information on potential targets for vaccines, supporting the idea of a new generation of protein-based formulations to combat Streptococcus pneumoniae and its disease burden.



Basic local alignment search tool


Blocks substitution matrix


Choline binding proteins


Cerebrospinal fluid


DNA sequence polymorphism


Histidine kinase


Invasive pneumococcal diseases


Kyoto encyclopedia of genes and genomes


Multiple sequence alignment


National center for biotechnology information


Non-classical surface proteins


Prokaryotic two-component systems


Pathogenicity Islands


Response regulator

S. p. :

Streptococcus pneumoniae


Two-Component regulatory Systems


The Universal Protein Resource


Pan-genomic variability map


Virulence factors data base


  1. World Health Organization. The global burden of disease: 2004 update. Geneva: WHO; 2008.

  2. Bridy-Pappas AE, Margolis MB, Center KJ, Isaacman DJ. Streptococcus Pneumoniae: description of the pathogen, disease epidemiology, treatment, and prevention. Pharmacotherapy. 2005;25(9):1193–212.

    Article  PubMed  Google Scholar 

  3. Brueggemann AB, Griffiths DT, Meats E, Peto T, Crook DW, Spratt BG. Clonal relationships between invasive and carriage Streptococcus Pneumoniae and serotype- and clone-specific differences in invasive disease potential. J Infect Dis. 2003;187(9):1424–32.

    Article  CAS  PubMed  Google Scholar 

  4. Johnson HL, Deloria-Knoll M, Levine OS, Stoszek SK, Freimanis Hance L, Reithinger R, Muenz LR, O'Brien KL. Systematic evaluation of serotypes causing invasive pneumococcal disease among children under five: the pneumococcal global serotype project. PLoS Med. 2010;7(10):1–13.

  5. Jedrzejas MJ. Pneumococcal virulence factors: structure and function. Microbiol Mol Biol Rev. 2001;65(2):187–207. first page, table of contents

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Voss S, Gamez G, Hammerschmidt S. Impact of pneumococcal microbial surface components recognizing adhesive matrix molecules on colonization. Mol Oral Microbiol. 2012;27(4):246–56.

    Article  CAS  PubMed  Google Scholar 

  7. Throup JP, Koretke KK, Bryant AP, Ingraham KA, Chalker AF, Ge Y, Marra A, Wallis NG, Brown JR, Holmes DJ, et al. A genomic analysis of two-component signal transduction in Streptococcus Pneumoniae. Mol Microbiol. 2000;35(3):566–76.

    Article  CAS  PubMed  Google Scholar 

  8. McCluskey J, Hinds J, Husain S, Witney A, Mitchell TJ. A two-component system that controls the expression of pneumococcal surface antigen a (PsaA) and regulates virulence and resistance to oxidative stress in Streptococcus Pneumoniae. Mol Microbiol. 2004;51(6):1661–75.

    Article  CAS  PubMed  Google Scholar 

  9. Gamez G, Hammerschmidt S. Combat pneumococcal infections: adhesins as candidates for protein-based vaccine development. Curr Drug Targets. 2012;13(3):323–37.

    Article  CAS  PubMed  Google Scholar 

  10. Centers for Disease Control and Prevention. Active Bacterial Core Surveillance Report, Emerging Infections Program Network, Streptococcus pneumoniae. Atlanta: CDC; 2015.

  11. Eldholm V, Johnsborg O, Straume D, Ohnstad HS, Berg KH, Hermoso JA, Havarstein LS. Pneumococcal CbpD is a murein hydrolase that requires a dual cell envelope binding specificity to kill target cells during fratricide. Mol Microbiol. 2010;76(4):905–17.

    Article  CAS  PubMed  Google Scholar 

  12. Donkor ES. Understanding the pneumococcus: transmission and evolution. Front Cell Infect Microbiol. 2013;3:7.

    PubMed  PubMed Central  Google Scholar 

  13. Feil EJ, Smith JM, Enright MC, Spratt BG. Estimating recombinational parameters in Streptococcus Pneumoniae from multilocus sequence typing data. Genetics. 2000;154(4):1439–50.

    CAS  PubMed  PubMed Central  Google Scholar 

  14. Donati C, Hiller NL, Tettelin H, Muzzi A, Croucher NJ, Angiuoli SV, Oggioni M, Dunning Hotopp JC, Hu FZ, Riley DR, et al. Structure and dynamics of the pan-genome of Streptococcus Pneumoniae and closely related species. Genome Biol. 2010;11(10):R107.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. van Tonder AJ, Bray JE, Jolley KA, Quirk SJ, Haraldsson G, Maiden MCJ, Bentley SD, Haraldsson A, Erlendsdottir H, Kristinsson KG et al. Heterogeneity Among Estimates Of The Core Genome And Pan-Genome In Different Pneumococcal Populations. bioRxiv 2017, doi:

  16. Aras RA, Kang J, Tschumi AI, Harasaki Y, Blaser MJ. Extensive repetitive DNA facilitates prokaryotic genome plasticity. Proc Natl Acad Sci U S A. 2003;100(23):13579–84.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Chewapreecha C, Harris SR, Croucher NJ, Turner C, Marttinen P, Cheng L, Pessia A, Aanensen DM, Mather AE, Page AJ, et al. Dense genomic sampling identifies highways of pneumococcal recombination. Nat Genet. 2014;46(3):305–9.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Croucher NJ, Harris SR, Fraser C, Quail MA, Burton J, van der Linden M, McGee L, von Gottberg A, Song JH, Ko KS, et al. Rapid pneumococcal evolution in response to clinical interventions. Science. 2011;331(6016):430–4.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Sokurenko EV, Gomulkiewicz R, Dykhuizen DE. Source-sink dynamics of virulence evolution. Nat Rev Microbiol. 2006;4(7):548–55.

    Article  CAS  PubMed  Google Scholar 

  20. Ring HZ, Kwok PY, Cotton RG. Human Variome project: an international collaboration to catalogue human genetic variation. Pharmacogenomics. 2006;7(7):969–72.

    Article  PubMed  Google Scholar 

  21. Chattopadhyay S, Taub F, Paul S, Weissman SJ, Sokurenko EV. Microbial variome database: point mutations, adaptive or not, in bacterial core genomes. Mol Biol Evol. 2013;30(6):1465–70.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Tatusova TA, Karsch-Mizrachi I, Ostell JA. Complete genomes in WWW Entrez: data representation and analysis. Bioinformatics. 1999;15(7–8):536–43.

    Article  CAS  PubMed  Google Scholar 

  23. Zhang Z, Schwartz S, Wagner L, Miller W. A greedy algorithm for aligning DNA sequences. J Comput Biol. 2000;7(1–2):203–14.

    Article  CAS  PubMed  Google Scholar 

  24. Chen L, Yang J, Yu J, Yao Z, Sun L, Shen Y, Jin Q. VFDB: a reference database for bacterial virulence factors. Nucleic Acids Res. 2005;33(Database issue):D325–8.

    Article  CAS  PubMed  Google Scholar 

  25. Engel P, Goepfert A, Stanger FV, Harms A, Schmidt A, Schirmer T, Dehio C. Adenylylation control by intra- or intermolecular active-site obstruction in Fic proteins. Nature. 2012;482(7383):107–10.

    Article  CAS  PubMed  Google Scholar 

  26. Apweiler R, Bairoch A, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, et al. UniProt: the universal protein knowledgebase. Nucleic Acids Res. 2004;32(Database issue):D115–9.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Barakat M, Ortet P, Jourlin-Castelli C, Ansaldi M, Mejean V, Whitworth DE. P2CS: a two-component system resource for prokaryotic signal transduction research. BMC Genomics. 2009;10:315.

    Article  PubMed  PubMed Central  Google Scholar 

  28. Kanehisa M, Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;28(1):27–30.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Kanehisa M. Linking databases and organisms: GenomeNet resources in Japan. Trends Biochem Sci. 1997;22(11):442–4.

    Article  CAS  PubMed  Google Scholar 

  30. Morgulis A, Coulouris G, Raytselis Y, Madden TL, Agarwala R, Schaffer AA. Database indexing for production MegaBLAST searches. Bioinformatics. 2008;24(16):1757–64.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Schaffer AA, Aravind L, Madden TL, Shavirin S, Spouge JL, Wolf YI, Koonin EV, Altschul SF. Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res. 2001;29(14):2994–3005.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25(17):3389–402.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Corpet F. Multiple sequence alignment with hierarchical clustering. Nucleic Acids Res. 1988;16(22):10881–90.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Librado P, Rozas J. DnaSP v5: a software for comprehensive analysis of DNA polymorphism data. Bioinformatics. 2009;25(11):1451–2.

    Article  CAS  PubMed  Google Scholar 

  35. Sokurenko EV, Feldgarden M, Trintchina E, Weissman SJ, Avagyan S, Chattopadhyay S, Johnson JR, Dykhuizen DE. Selection footprint in the FimH adhesin shows pathoadaptive niche differentiation in Escherichia Coli. Mol Biol Evol. 2004;21(7):1373–83.

    Article  CAS  PubMed  Google Scholar 

  36. Srivatsan A, Tehranchi A, MacAlpine DM, Wang JD. Co-orientation of replication and transcription preserves genome integrity. PLoS Genet. 2010;6(1):e1000810.

    Article  PubMed  PubMed Central  Google Scholar 

  37. Morales M, Garcia P, de la Campa AG, Linares J, Ardanuy C, Garcia E. Evidence of localized prophage-host recombination in the lytA gene, encoding the major pneumococcal autolysin. J Bacteriol. 2010;192(10):2624–32.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. van Kooyk Y, Geijtenbeek TB. DC-SIGN: escape mechanism for pathogens. Nat Rev Immunol. 2003;3(9):697–709.

    Article  CAS  PubMed  Google Scholar 

  39. Figueira M, Moschioni M, De Angelis G, Barocchi M, Sabharwal V, Masignani V, Pelton SI. Variation of pneumococcal Pilus-1 expression results in vaccine escape during experimental Otitis media [EOM]. PLoS One. 2014;9(1):e83798.

    Article  PubMed  PubMed Central  Google Scholar 

  40. Soriani M, Telford JL. Relevance of pili in pathogenic streptococci pathogenesis and vaccine development. Future Microbiol. 2010;5(5):735–47.

    Article  CAS  PubMed  Google Scholar 

  41. Song XM, Connor W, Hokamp K, Babiuk LA, Potter AA. The growth phase-dependent regulation of the pilus locus genes by two-component system TCS08 in Streptococcus Pneumoniae. Microb Pathog. 2009;46(1):28–35.

    Article  CAS  PubMed  Google Scholar 

  42. Iovino F, Hammarlöf DL, Garriss G, Brovall S, Nannapaneni P, Henriques-Normark B. Pneumococcal meningitis is promoted by single cocci expressing pilus adhesin RrgA. J Clin Invest. 2016;126(8):2821–6.

    Article  PubMed  PubMed Central  Google Scholar 

  43. AlonsoDeVelasco E, Verheul AF, Verhoef J, Snippe H. Streptococcus Pneumoniae: virulence factors, pathogenesis, and vaccines. Microbiol Rev. 1995;59(4):591–603.

    CAS  PubMed  PubMed Central  Google Scholar 

  44. Blue CE, Paterson GK, Kerr AR, Berge M, Claverys JP, Mitchell TJ. ZmpB, a novel virulence factor of Streptococcus Pneumoniae that induces tumor necrosis factor alpha production in the respiratory tract. Infect Immun. 2003;71(9):4925–35.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Martin B, Granadel C, Campo N, Henard V, Prudhomme M, Claverys JP. Expression and maintenance of ComD-ComE, the two-component signal-transduction system that controls competence of Streptococcus Pneumoniae. Mol Microbiol. 2010;75(6):1513–28.

    Article  CAS  PubMed  Google Scholar 

  46. Shak JR, Ludewick HP, Howery KE, Sakai F, Yi H, Harvey RM, Paton JC, Klugman KP, Vidal JE. Novel role for the Streptococcus Pneumoniae toxin pneumolysin in the assembly of biofilms. MBio. 2013;4(5):e00655–13.

    Article  PubMed  PubMed Central  Google Scholar 

  47. Bergmann S, Schoenen H, Hammerschmidt S. The interaction between bacterial enolase and plasminogen promotes adherence of Streptococcus Pneumoniae to epithelial and endothelial cells. Int J Med Microbiol. 2013;303(8):452–62.

    Article  CAS  PubMed  Google Scholar 

  48. Ng WL, Robertson GT, Kazmierczak KM, Zhao J, Gilmour R, Winkler ME. Constitutive expression of PcsB suppresses the requirement for the essential VicR (YycF) response regulator in Streptococcus Pneumoniae R6. Mol Microbiol. 2003;50(5):1647–63.

    Article  CAS  PubMed  Google Scholar 

  49. Sebert ME, Patel KP, Plotnick M, Weiser JN. Pneumococcal HtrA protease mediates inhibition of competence by the CiaRH two-component signaling system. J Bacteriol. 2005;187(12):3969–79.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  50. Muller M, Marx P, Hakenbeck R, Bruckner R. Effect of new alleles of the histidine kinase gene ciaH on the activity of the response regulator CiaR in Streptococcus Pneumoniae R6. Microbiology. 2011;157(Pt 11):3104–12.

    Article  PubMed  Google Scholar 

  51. Gámez G, Castro F, Gómez-Mejia A, Gallego M, Bedoya A, Hammerschmidt S. Bioinformatic analysis and construction of the variome of the virulence factors and genetic regulators in Streptococcus Pneumoniae. In: Annual Conference of the Association for General and Applied Microbiology (VAAM). Marburg. Germany: Biospektrum; 2015.

    Google Scholar 

  52. Castro AF, Gómez-Mejia A, Gallego M, Bedoya A, Hammerschmidt S, Gámez GA. Variome of the Pneumococcal Surface-Exposed Proteins and other Virulence Factors: A Bioinformatics Analysis. [Abstract EuroPneumo-P1.27]. Pneumonia. 2015;7:17.

  53. Gámez GA, Castro AF, Gómez-Mejia A, Gallego M, Bedoya A, Hammerschmidt S. Análisis Bioinformático y Construcción del Varioma de los Factores de Virulencia y Sistemas de Regulación por Dos-Componentes de Streptococcus pneumoniae. [Abstract 3rd Colombian Congress on Computational Biology and Bioinformatics-CCBCOL3]. Medellín - Colombia; 2015, Oral Presentation 129.

Download references


The authors thank to Prof. Vanessa Cienfuegos, School of Microbiology, University of Antioquia for her critical evaluation of this research work and manuscript. We express our acknowledgements to peer reviewers for critical review of the manuscript. Their suggestions and comments significantly improved the quality of this piece of work.


The fundings for this research work have been provided by the Committee for Development of Research at the University of Antioquia (CODI, CIEMB-097-13) in Colombia, and the DFG GRK 1870/1 (Bacterial Respiratory Infections) in Germany. Both funding sources had no involvement in the design of this study, in the collection, analysis and interpretation of data, in the writing of this manuscript, and in the decision to submit this article for publication.

Availability of data and materials

Sequence data that support the findings of this study were already-published information, retrieved from GenBank (accession numbers are provided in Table 1). All the bioinformatic-analyzed data generated here are included in this published study. However, supplementary raw information files (mainly DNA and protein sequence comparisons) are available from the corresponding author on reasonable request.

Author information

Authors and Affiliations



All the authors have contributed to this research work, participating in the conception and design (GG, AC, MG, AB, SH), collection and analysis of information (GG, AC, MG, AB), discussion of results (GG, AC, MG, AB, AGM, MC, SH), manuscript draft preparation (GG, AC), and critical revision and edition (GG, AC, AGM, MC, SH) of the manuscript. GG and AC have contributed equally to this research work and manuscript. All the authors have read and approved the final version of this manuscript.

Corresponding author

Correspondence to Gustavo Gámez.

Ethics declarations

Ethics approval and consent to participate

Not Applicable.

Consent for publication

Not Applicable.

Competing interests

The authors declare that they have no competing interests in relation with this research work and manuscript.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gámez, G., Castro, A., Gómez-Mejia, A. et al. The variome of pneumococcal virulence factors and regulators. BMC Genomics 19, 10 (2018).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: