The SGH array provides an inexpensive high-throughput means to analyze the genomic content of any H. influenzae strain, by reporting on the possession of 2890 different gene clusters extracted from the WGS of 24 strains. Comparison of the WGS and SGH array data demonstrated the fidelity of the arrays. Comparisons of the hybridization values for the duplicated probe sets within a single array demonstrate their high reproducibility; and comparisons of the hybridization values among technical replicates (prepared with separate labeling, hybridizations, and scans) demonstrate high reproducibility across experiments.
We now have gene possession data on 210 Haemophilus influenzae strains, given the combination of the array data and WGS data. Analysis of the gene clusters from all strains reveals that only 23% (678/2890) of the supragenome is conserved across all strains. These data suggest that we are dealing with more than one species based on the Ahmed criteria  which states that if the addition of a single strain to what is an established and robust supragenome analysis (as we had for H. influenzae) results in > 5% decrease in the core genome size then the last strain added is likely from a different species. If we exempt the HDHi and the other three core genome outliers, then 94.5% of strains contain 98% of all clusters. Based on our previous suggestion of using the core genome to define a species, we would argue that the differences observed for the HDHi are consistent with it forming a subspecies or even separate species. Twenty one of these novel strains were derived from longitudinal studies on two patients . Analyses of these strains by Gilsdorf and colleagues for phenotypic characteristics routinely used to distinguish between H. influenzae and H. haemolyticus suggested that 19 out of 21 were not NTHi strains but instead met criteria for being classified as: 1) H. haemolyticus; 2) non-haemolytic H. haemolyticus; or 3) were chimeric strains with features of both H. influenzae and H. haemolyticus[46–48]. These strains were tested for the presence of iga, which encodes immunoglobulin A1 protease and for the presence of outer membrane protein P6 by reaction with 7F3 monoclonal antibody. They were also tested for hemolysis and porphyrin production. If they were positive for iga and P6 and negative for hemolysis and porphyrin, they were considered to be NTHi. Interestingly these two patients also carried strains in their pharynges that meet the criteria of H. influenzae, 14 of which were analyzed in this study using the SGH array. Three of the strains making up this novel lineage are from diverse sites, two were isolated in Australia from asymptomatic patients and one was isolated in Iowa from a patient with COPD. Two hypotheses can explain these results. First, this lineage is already widespread. Second, the bacteria in these patients were under similar fitness pressures to lose a particular set of genes. These hypotheses should be resolvable by whole genome sequence analysis which will reveal the sequence of alleles and any novel set of genes shared by these strains.
These data show the value of the SGH array for identifying core/distributed genes; grouping strains, and identifying new lineages/subspecies/species. It should be noted, however, that since the initial design did not include any representative strains from this novel lineage any lineage-specific genes will be missed by the array. Therefore, these results point us toward the sequencing of several representatives of this novel lineage and also determining their pathogenicity profile in our chinchilla model of otitis media and invasive disease [16, 49, 50], as a recent report has identified H. haemolyticus as being etiologically associated with invasive disease .
This analysis revealed that many genes are enriched in commensal strains (121 genes) while others are more commonly found in virulent strains (Additional file 2: Table S2). Most enriched genes have unknown function, and thus should provide a rich source for targeted studies to characterize novel functional categories associated with virulence and protection from virulence. In contrast, we did not observe a correlation between geographical location and gene content. For example the ten strains isolated in Pittsburgh (PittAA-PittJJ) are widely distributed throughout the phylogenetic tree. Similarly, the three strains isolated from Papua New Guinea, although they all fall into the same general branch, do not cluster closely together suggesting they are more similar to strains isolated in other locations than they are to each other.
The data collected as part of this study are relative to Haemophilus sp, but the same strategy could be used to design and produce supragenome arrays for any species for which the majority of the supragenome has been sampled by sequencing a reasonable sized subset of strains. Many human pathogens and commensals have extensive genome diversity, such as Streptococcus pneumoniae, Escherichia coli, Garderella vaginallis, and Staphylococcus. aureus[3, 5, 7, 14]. This diversity is also observed in many environmental and plant pathogens such as Pseudomonas aeruginosa, Xylella fastidiosus, and Bacillus cereus[52, 53]. The number of core genes as a percent of total genes varies extensively across species. G. vaginalis, E.coli, and B. cereus have high strain diversity, with only 27%, 35%, and 35% of total genes shared across all strains, respectively. In contrast, M. tuberculosis and B. anthracis have low strain diversity with ~91% of the supra (pan) genome shared across all strains . Different species generate novel strain combinations by transformation, conjugation and/or transduction. H. influenzae are naturally competent, and a genome wide analysis of recombinants generated in the laboratory demonstrated that homologous recombination is the main mechanism generating strain diversity .
We hypothesize that the extensive differences in the size of the core relative to the supragenome reflect both the age and the evolutionary history of the taxonomic domain being studied as well as the mechanisms used for HGT. In this manner, species that cause chronic infections in highly variable niches, such as the dynamic microbiome present in the human nasopharynx, are likely to have been selected to increase strain diversification. Whereas, it is tempting to speculate that B. anthracis, which causes highly acute disease, and is not associated with long-term survival in the human host does not require a large distributed genome because of its 'hit and run' pathogenic profile, the larger truth is that it is not a species unto itself, but rather is a pathogenic clade of B. cereus which has a very large distributed genome . Thus, M. tuberculousis is the sole species characterized to date that has a very limited supragenome which may relate to its very recent evolutionary origins, i.e. it simply has not had time to diversify as much as older species. This array provides a powerful strategy for accessing genome content of diverse strains within a species or related species, and thus provides a high value tool for research on human health and environmental science.