Comparative supragenomic analyses among the pathogens Staphylococcus aureus, Streptococcus pneumoniae, and Haemophilus influenzae Using a modification of the finite supragenome model

Background Staphylococcus aureus is associated with a spectrum of symbiotic relationships with its human host from carriage to sepsis and is frequently associated with nosocomial and community-acquired infections, thus the differential gene content among strains is of interest. Results We sequenced three clinical strains and combined these data with 13 publically available human isolates and one bovine strain for comparative genomic analyses. All genomes were annotated using RAST, and then their gene similarities and differences were delineated. Gene clustering yielded 3,155 orthologous gene clusters, of which 2,266 were core, 755 were distributed, and 134 were unique. Individual genomes contained between 2,524 and 2,648 genes. Gene-content comparisons among all possible S. aureus strain pairs (n = 136) revealed a mean difference of 296 genes and a maximum difference of 476 genes. We developed a revised version of our finite supragenome model to estimate the size of the S. aureus supragenome (3,221 genes, with 2,245 core genes), and compared it with those of Haemophilus influenzae and Streptococcus pneumoniae. There was excellent agreement between RAST's annotations and our CDS clustering procedure providing for high fidelity metabolomic subsystem analyses to extend our comparative genomic characterization of these strains. Conclusions Using a multi-species comparative supragenomic analysis enabled by an improved version of our finite supragenome model we provide data and an interpretation explaining the relatively larger core genome of S. aureus compared to other opportunistic nasopharyngeal pathogens. In addition, we provide independent validation for the efficiency and effectiveness of our orthologous gene clustering algorithm.


Background
Most strains of the Gram-positive bacterium Staphylococcus aureus are avirulent, antibiotic-sensitive commensals; however, over the past several decades there have emerged a number of pandemic, virulent, antibioticresistant strains including methicillin-resistant (MRSA) and vancomycin-resistant (VRSA) strains [1]. Although many S. aureus infections originate in the community, it is also the most common nosocomial bacterial infection in U.S. healthcare institutions, accounting for more than half a million hospital-acquired infections annually which exact an enormous financial and healthcare burden.
S. aureus can be detected in its primary reservoir in the anterior nares on a regular basis in about 20% (and intermittently in another 60%) of the human population [2], leading to efforts for decolonization in healthcare settings [3]. Some S. aureus strains have acquired any of a large number of virulence factors, and can cause a range of infections from mild to serious including pimples, impetigo, boils, cellulitis, endocarditis, necrotizing fasciitis, osteomyelitis, pneumonia, septic arthritis, septicemia, and toxic shock syndrome [4,5]. The widespread, long-term exposure of humans to S. aureus antigens from non-pathogenic strains may help explain why the development of an effective vaccine against pathogenic strains of S. aureus is difficult [6]. As a result of its ubiquity and its ability to acquire virulence and antibiotic-resistance factors it is now estimated that invasive MRSA infections cause more deaths in the U.S. (18,650 vs. 17,011 in 2005) than HIV/AIDS [7][8][9][10]. S. aureus infections of domestic livestock are also of concern, and cause significant economic losses [11,12].
Our laboratory has developed the Distributed Genome Hypothesis (DGH), a model for understanding intra-species gene content differences in bacterial pathogens, especially those associated with chronic infections [13][14][15]. The DGH states that pathogenic bacterial species use horizontal gene transfer to make available to the genomes of individual strains a set of non-core distributed genes with varying population frequencies, and with varying probabilities of contributing to the species' population fitness. The observation that these distributed genes are present at significantly different frequencies in the population of a given pathogenic species, combined with the fact that the total number of genes available in the population is larger (often much larger) than the number of genes in any one strain's genome, has led us to describe the set of genes available to a bacterial pathogenic species as a supragenome [16][17][18] in preference to the synonymously used term pan-genome [19]. The DGH views the combinatorial process of augmenting a set of core genes with a significant number of non-core distributed genes in each strain's genome as an evolutionary strategy to maximize the species' population fitness across a range of environmental conditions (e.g., nutrient supply, competing microbial flora, host innate and adaptive immune responses, and exposure to antibiotics) and at rates that are significantly greater than can be achieved through the vertical transmission and exchange of alleles of a relatively fixed set of genes [20].
Whole genome shotgun sequencing using 454 Life Sciences' next-generation pyrosequencing technology has been used in our laboratory to obtain high-coverage draft genomic DNA sequence datasets for large numbers of strains of several human bacterial pathogens [21][22][23]. Using these data, a predictive finite supragenome model of the DGH was developed, and has been used to delineate the supragenomes of Haemophilus influenzae [21] and Streptococcus pneumoniae [22], two species that are naturally transformable. Here we extend our research on the DGH in several respects: (i) by comparatively examining all of the genomes that were available for S. aureus, a non-transformable species; (ii) by making use of a newly available automated bacterial genome annotation service (the RAST system) for the annotation of these 17 genomes, a service that is based on a set of well-curated biological subsystem annotations [24]; and (iii) by introducing a revised finite supragenome model that allows the estimates of the population gene frequencies to vary during the maximization of the log-likelihood of the observed sample gene frequency data [21]. We use the descriptive and predictive capabilities of our revised finite supragenome model to delineate the S. aureus supragenome, compare it with the supragenomes of Haemophilus influenzae and Streptococcus pneumoniae, and estimate the number of chromosomal genes that would be found with the sequencing of additional S. aureus genomes.

Results
Bacterial genomic DNA sequences and the annotation data for their protein-coding genes Table 1 lists general information about the genomes of the 17 S. aureus strains that were used for this supragenome analysis. The following points should be noted about these DNA sequence datasets. First, only the bacterial chromosome DNA sequences from the 14 published S. aureus strains were used; the plasmid DNA sequences available for 8 of these 14 strains were not included in the analyses. Second, the 14 published S. aureus strains examined included some genomes that are known to be very similar (e.g., strains JH1 and JH9). Third, 16/17 of the S. aureus strains whose genomes were examined were of human origin with only one being of livestock origin (RF122, isolated from a case of bovine mastitis).
Automated bacterial genome annotation is the only practical method to keep pace with the productivity of modern DNA sequencing technologies [25] such as those used in this study to obtain high-coverage (~20X) draft genomic sequences for clinical S. aureus strains (Table 1). We chose the RAST automated bacterial genome annotation system [24] because it is free of charge, confidential, secure, handles compute-intensive bacterial genome annotation jobs quickly, and allows users to upload their own sequences, i.e., it does not just offer annotations of publicly available bacterial genomic DNA sequence datasets. Currently, as Table 2 shows, even based on the rudimentary criterion of CDS counts, different automated bacterial genome annotation providers can produce significantly different results from the same genomic DNA sequence datasets (and it should be noted that exact agreement in Table 2 for CDS counts for a given genome typically indicates re-use of an NCBI RefSeq dataset by another annotation provider). Thus, to have a consistent set of CDS annotations, we used the RAST system to annotate both our in-house generated draft genomic sequences and the bacterial chromosomes of the 14 published S. aureus strains. (Table 1).

Analyses of S. aureus gene frequencies using 17 genomes
We used our previously described computational pipeline [21] to cluster the chromosomal genes from the 17 S. aureus genomes which has proven to be highly reliable in comparison with other systems (Donati et al  n.a n.a. n.a ABWS00000000 n.a n.a n.a CGSSa01 2,971 2,769 n.a n.a. n.a ABWT00000000 n.a n.a n.a CGSSa03 2,951 2,795 n.a n.a. n.a ABWY00000000 n.a n.a n.  2010, vide infra) [26]. This single-linkage clustering procedure is designed to accommodate the use of draft genomic DNA sequence data and its annotations, i.e., data that may include open reading frames that are disrupted by genome assembly errors, or genes interrupted by contig breaks. This procedure yields clusters of CDS orthologs based on 70% sequence identity over 70% of the shorter sequence. Based on these clustering results, genes were classified (Table 3) based on their frequency as either unique (observed in one genome only), distributed (observed in more than one but not all genomes), or core (present in all genomes). The clustering results yielded 3,155 orthologous gene clusters (genes), of which 2,266 were core, 755 were distributed and 134 were unique. The unique genes had an uneven distribution, and it was not surprising that the bovine isolate, RF122, had both the largest number of unique genes (n = 53) and the smallest number of distributed genes (n = 205). Individual genomes contained between 2,524 (RF122) and 2,648 (CGSSa01) genes, whilst the maximum difference between any pair of S. aureus genomes, out of all possible (17 choose 2 = 136) pairs, based on protein-encoding gene content, was 476 genes. In addition, although only 13% of the total number of gene annotations among the 17 strains are noncore, 28% of the total number of genes found in these 17 strains are non-core indicating that many of these genes are found repeatedly throughout the species Figure 1 shows the results of a neighbor grouping analysis [23] performed using the distributed S. aureus genes, a procedure that displays the relatedness of strains based on a metric of identity by state (as opposed to identity by descent) which overcomes the problems associated with trying to do phylogenetic analyses on mosaic genomes resulting from horizontal gene transfer. The edge weights shown in the graph represent the fraction of the distributed genes in the supragenome that is either present in both (or absent in both) of the genomes represented by the vertices of the relevant edge. The mean distance among all possible strains pairs (n = 136) is 0.34 ± 0.01, and valid neighbor groups are indicated (see Materials and Methods). This analysis of distributed gene content helped determine the relationship of the three genomes that we sequenced with the 14 published S. aureus genomes. The genome of the strain CGSSa01 was very closely related to the genomes of the community-acquired methicillin-resistant S. aureus (CA-MRSA) strains USA300 and USA300TCH15, and the genome of the strain CGSSa03 was very closely related to the genomes of the vancomycin-intermediate S. aureus (VISA) strains JH1 and JH9. Figure 2 summarizes the pair-wise relationships between the genomes. The metrics used to describe these relationships are: (i) the number of genes with orthologs in each of the two strains (S = similarity score); (ii) the number of genes with an ortholog in one Core clusters (2,266 here) have one or more representative CDS in each genome examined; unique clusters are represented in only one genome; and distributed clusters in more than one but not all genomes examined.
strain but not the other (D = difference score); (iii) a composite comparison score (C = S -D); and (iv) the number of genes with orthologs found only in both strains (P = pair unique score). The mean similarity and difference scores calculated for all possible pairs (n = 136) produced values of 2,448 and 296 genes, respectively. These results and the occurrence of unique genes (Table 3) are consistent with the known similarities between the JH1 and JH9 genomes [27]; the Mu3 and Mu50 genomes [28]; and the USA300 and USA300TCH15 genomes [29]. These results are also consistent with the expected outlier status of RF122 with respect to the strains of human origin. One interesting pair-wise relationship is that of MRSA252 and CGSSa00; the comparison score was at its maximum for MRSA252, whereas all other values for this score involving CGSSa00 were less than the mean, and in most cases more than one standard deviation below the mean. In addition, the maximum pair unique score for all pair-wise relationships was that observed between CGSSa00 and MRSA252, suggesting a highly significant degree of relatedness.

Analyses of S. aureus population gene frequencies using a revised finite supragenome model
To more accurately model the number of genes contained within a species' supragenome we have revised our finite supragenome model to take advantage of the observed gene frequencies obtained from the sequencing of limited numbers of target strains. Our model treats each of the N genes in a bacterial species' supragenome as an independent Bernoulli random variable, with a gene's occurrence in a genome of one of the strains of the species representing a success outcome of a Bernoulli trial that has a probability equal to the gene's population frequency among all strains [21]. We model the population gene frequencies for a species with a supragenome as being limited to K classes, with each class defined by two parameters: a Bernoulli probability μ k that represents the gene frequency, and a corresponding mixture coefficient π k that represents the probability that one of the N genes in the supragenome belongs to class k. Thus, in addition to N and K, our model requires 2K additional parameters, which we denote using the vectors μ and π. The K elements of the vector π are constrained to be greater than zero and sum to one. The K elements of μ are constrained to be greater than zero, monotonically increasing with increasing k, and that μ K has a fixed value of one (for the core genes in the supragenome). Our observed sample data from the |S| genomes under study (17 here) is represented by a vector C = {c 0 , c 1 , c 2 , ..., c |S| } whose elements equal the number of genes observed in exactly n = 0, 1, 2, ..., | S| of these genomes. A constrained nonlinear programming function (fmincon) from MatLab's Optimization Toolbox is used to maximize the following log-likelihood function of the observed gene frequencies using values of N over a reasonable range [20]: Our initial model only carried out the optimization with respect to N and π; but did not allow the values of μ k (where k <K) to vary during the maximization of the log-likelihood of the observed data. The revised model removes this restriction, and the results obtained using this revised model yield insights not previously available.
An overview of the results obtained using the revised model is shown for three human bacterial pathogens: Haemophilus influenzae, Streptococcus pneumoniae, and Staphylococcus aureus (Figure 3). The results obtained for these three supragenomes differ significantly in their plots of the log-likelihood of the observed data against the values of supragenome size N that were examined during the optimization. Fortuitously, these results illustrate two contrasting types of supragenomes (H. influenzae and S. aureus) and a third (S. pneumoniae) whose general characteristics are intermediate between these two types. Thus, a broad plateau was observed in this plot for H. influenzae, whereas the log-likelihood plot for S. aureus declined very abruptly at estimated values of N that were significantly greater than the estimated size of its supragenome (Figure 3, upper panels). The revised supragenome model employed herein has the advantage that values of μ k (where k <K) are allowed to vary during the maximization of the log-likelihood.  Hence a priori estimates of fixed values for these parameters (i.e., as was required in our initial supragenome model)-a procedure that the bottom panels of Figure 3 show is difficult-are conveniently avoided. At the extreme case of the lowest population gene frequency class, the values of μ 1 and π 1 at the maximization of the log-likelihood of the observed data indicate that the H. influenzae supragenome is dominated by a large pool of very rare genes. In contrast, the value for μ 1 at the maximization of the log-likelihood of the observed data for the S. aureus supragenome (0.11) is an order of magnitude greater than that of H. influenzae. At the other extreme of population gene frequencies, even though the estimated size of the S. aureus supragenome at 3,221 chromosomal genes is the smallest value for N observed among these three species, the absolute number of S. aureus core genes (2,245) and their fraction of N (i.e., the value of π K = 0.6971) are both significantly greater than the same values for either H. influenzae or S. pneumoniae (Figure 3, lower panels). This estimate that approximately 30% of the S. aureus genes are non-core is in reasonable agreement with the results of earlier, more limited studies that used comparative genomic hybridization to estimate a value for this parameter of 22% [30].
The finite supragenome model is predictive as well as descriptive, Figure 4 shows the excellent correlation between the observed sample gene frequency data from the 17 S. aureus genomes under study (the number of genes observed in exactly n = 1, 2, ..., 17 of these genomes) and the same values predicted using the values of μ, π, and N obtained using our revised finite supragenome model trained on the sample data (all 17 strains). Figure 5 (lower panels) shows the ability of the model to predict the numbers of new, core, and the total number of chromosomal genes that should be detectable after sequencing up to 30 S. aureus genomes. These results agree very well with those obtained using the results from our analyses of the 17 S. aureus genomes under study ( Figure 5, upper panels). They also indicate that the sequencing of 30 S. aureus genomes will yield 99.5% of the total number and 99.4% of the core  Analyses of the RAST annotation data for the 17 S. aureus genomes under study The RAST annotation data for the S. aureus chromosomal CDS agreed very well with the results of our CDS clustering procedure (Tables 4 and 5). Each CDS feature in a GenBank-style annotation record (such as the ones available from the RAST system) typically has a product feature qualifier value associated with it. Table 4 shows that 96% (3,038 of 3,155) of the orthologous clusters generated by our CGS supragenome analysis pipeline [21] were comprised of RAST-identified CDS that all mapped to a single product feature qualifier value. Similar results were obtained when confining the analyses to just the core clusters, i.e., the clusters with the largest numbers of members, where there was 98% (2,122 of 2,266) agreement between the two methods. In those cases where clusters were comprised of CDS that mapped to more than a single product feature qualifier value, the results were often understandable, e.g., the additional product qualifier values were either undefined (e.g., hypothetical protein) or imprecise (e.g., regulatory protein). The reverse mapping-of S. aureus RAST annotation product feature qualifiers to CGS-identified CDS clusters-was expected to be more problematic, but these results were also very robust (Table 5). In this analysis 82% (258 of 316) of the RAST annotation CDS product feature qualifiers that describe genes belonging to distributed clusters appeared in only a single cluster, and 91% (1,473 of 1,623) of the core cluster-derived CDS product feature qualifiers were found in only a single cluster. The difference between the results in Tables  5 and 4 indicate that the reverse mapping (of CDS product feature qualifier values to gene clusters) is slightly more susceptible to the problems cited above concerning imprecise CDS product descriptions.
Seven percent (220 of 3,155) of the chromosomal CDS clusters (genes) found in the 17 S. aureus genomes under study were comprised of CDS that were annotated as being phage-derived, and these genes were unevenly distributed among the core, distributed and unique gene sets. Thus, a query of the CDS product feature qualifier values for the term phage anywhere in the product name yielded a mapping of the CDS annotations to 7 core, 190 distributed, and 23 unique clusters. Thus, although 72% (2,266), 24% (755), and 4% (134) of the chromosomal CDS clusters found in the 17 S. aureus genomes under study represented core, distributed, and unique genes, 3%, 86%, and 11% of the phagederived genes, respectively, were from these three classes of gene clusters.
The distinctive strategy of the RAST system is to have domain experts maintain well-curated subsystem functional annotations (e.g., metabolic pathways, regulatory pathways, or cellular structures) that can be used across bacterial genomes, instead of having the functional annotation of individual bacterial genomes attempted one-by-one by non-experts in the various subsystems that these genomes may encode. Subsystems analysis of the S. aureus chromosomal CDS annotations in the context of our gene clustering results was quite revealing (Table 6) in that almost half of the core CDS clusters could not be assigned to any RAST subsystem, as well as 90% of the distributed cluster CDS and 94% of the unique cluster CDS. These results highlight our very limited understanding of the biology of this important bacterial pathogen, as well as the need for improvements in functional annotation to keep pace with the extraordinary productivity of DNA sequencing technology.
The RAST annotation data also provided useful insights into the presence or absence of genetic determinants of methicillin resistance (Table 7) in the genomes of the 17 S. aureus strains under study. Thirteen gene product descriptions indicated involvement of the CDS expression product in resistance to the antibiotic of choice for S. aureus infections, methicillin. Ten of these 13 genes were core genes, with the remaining three genes mecA, mecI, and mecR1 distributed among more  4 5 6 7 8 9 10 11 12 13 14 15 16

Number of genomes in which gene is found
Observed Predicted Figure 4 Histogram of observed sample gene frequencies compared to the predicted number using the finite supragenome model. The number of genes for each frequency class was calculated using the results from our revised finite supragenome model (trained on all 17 strains). The observed and predicted number of core genes (2,266) found in all 17 strains agreed exactly; these values are not shown to avoid distortion of the scale of the graph. Distributed genes appear in two or more strains, but not all (from 2 to 16 here).
than one but not all genomes (Table 7). If present, all of these genes were single copy, with the exception of the fmtB gene, which was present in multiple copies in 7 of the 17 genomes under study. The presence of the mecA gene was consistent with the known methicillin-resistance status of the strains, and its absence in the untested strain CGSSa00 indicated that this strain should be methicillin-susceptible.

Discussion
The first model for supragenome (or pan-genome) analysis [19] was developed using genomic DNA sequence datasets from eight strains of the species Streptococcus agalactiae, also known as group B Streptococcus (GBS). This model was developed by fitting an exponential decay function to a plot of the average number of core genes observed with increasing numbers of genomes examined (where the average was taken for all possible permutations of the order of consideration of the genomes under study), and took the asymptote defined by such a plot as an estimate of the size of the GBS core genome. This model also fitted a second exponential decay function to a plot of the average number of new genes observed with increasing numbers of genomes examined (where the average is taken as before), and  (17) or a hypothetical study of 30 strains of Staphylococcus aureus. The two panels on the right show observed (upper panel) and predicted (lower panel) numbers of core and total genes that were or are predicted to be found in second to the nth genome for the number of strains examined (17) or a hypothetical study of 30 strains of Staphylococcus aureus. Observed new, core, and total genes were calculated using genomes examined in ascending order of their counts of non-core genes. took the asymptote as an estimate of the number of new genes that would be observed with each new GBS genome sequenced. Finally, this model also estimated the size of the GBS pan-genome by deriving a third equation for its rate of growth. A recent review [31] proposed a revised version of this model that adopts a power law fit (Heaps' Law, from the field of information retrieval) in lieu of the earlier exponential fit of the observed data. In both the original and the power law models, a threshold parameter (α) is used to distinguish open and closed pan-genomes, where an open pan-genome (with α ≤ 1) is defined as one that will yield a nonzero number of new genes when each additional genome of the species is sequenced. More recently, with the advent of ever less expensive sequencing technologies, making it possible to sequence scores of independent strains, it could be argued that modeling of the supra/pan-genome is unnecessary since sequencing of additional strains can be continued until no significant number of novel genes are identified [32] The probabilistic foundation of the model used in this work [21] offers a somewhat different perspective, but with the improvements described above that take into account multiple gene frequency classes allows for accurate supragenome modeling of populations/species for which it is not possible to obtain multiple independent clonal lineages (i.e. unculturable organisms) for sequence analysis. In these cases gene frequency were inferred from the different sequence coverage levels observed within the sequenced population. Since the vast majority of bacterial species are not culturable, but are now amenable to whole genome sequencing through single cell isolation and whole genome amplification techniques [33] our model can be used to estimate the percentage of the supragenome that has been obtained at intermediate coverage levels.
A recent comparison of the pan-genome model of Tettelin et al. with our finite supragenome model demonstrated that the two models make highly similar predictions when provided the same dataset [26], thus serving as a validation for both. However, both models share fundamental challenges in areas such as the selection of appropriate genomic DNA sequence datasets to use. For example, we chose not to include S. aureus plasmid DNA sequences, e.g., those associated with the published genomes that were used in our analysis (Table 1), nor the DNA sequences of the many S. aureus bacteriophage genomes that have been published [34]. However, our analysis included three draft genomes from S. aureus strains that we sequenced, and these unfinished genomes may contain plasmid-derived contigs. The results in Table 3 in fact suggest that the strain CGSSa01 may contain one or more S. aureus plasmids. Its genome contains the largest number of genes (2,648), and the 18 unique genes it contains are significantly greater than all but two of the other 17 strains. A comprehensive review of the S. aureus genome [35] provides a detailed description of some of the many plasmids and other mobile elements that it may contain. Decisions about the inclusion or exclusion of published plasmid and bacteriophage DNA sequences in a supragenome analysis can therefore lead to systematic error in estimates of counts of different classes of genes (e.g., core and unique genes).
Other issues arise during the selection of appropriate genomic DNA sequence datasets to use in a supragenome analysis. In this work we included genomes that are known to be very closely related, e.g., those of strains JH1 and JH9 [28], as well as one known outlier genome (RF122) of bovine origin [36]. These decisions can also be criticized as leading to systematic error in estimates of counts of different classes of genes. However, inclusion of a limited number of closely related and outlier strains also provide for useful internal controls for the results of the supragenome analysis. In some respects, and especially for a species such as S. aureus, bias in the selection of genomic DNA sequence datasets to use in a supragenome analysis is unavoidable. Given the intense interest in clinically relevant strains of S. aureus, one can reasonably expect that even with the ever increasing affordability and subsequent availability of bacterial genomic DNA sequence datasets, the S. aureus strains selected for sequencing will for the foreseeable future be dominated by epidemiologically *Shown are the numbers of distinct RAST annotation CDS "product" feature qualifiers that describe CDS belonging to either core, distributed, or unique clusters and where the relevant CDS yield the indicated number of distinct clusters. The CDS product feature qualifier "hypothetical protein" was deliberately excluded as it would be expected to map to different clusters.
important clinical strains, and that the much more quantitatively representative commensal S. aureus stains that could be isolated from human subjects will be under-represented in supragenome and other comparative genomic analyses. The S. aureus core genome has been found to contain a heptameric DNA sequence (GAAGCGG) that is believed to protect it from uncontrolled rearrangements [37]. This conserved crossover hotspot initiator or chi site is not the only DNA sequence motif and associated nucleic acid information processing system with a putative influence on the structure and maintenance of the S. aureus supragenome. The Sau1 Type I restrictionmodification (RM) system has at least two important influences in this regard [38]. First, it reduces the efficiency of conjugation between S. aureus and enterococci, the putative source of vancomycin resistance. This is believed to explain why very few vancomycinresistant S. aureus strains have arisen, despite tremendous selective pressure acting on the bacterial flora of patients treated with this drug [39]. Second, the Sau1 RM system's multi-copy specificity gene, sau1hsdS, has many alleles with significant population frequencies, and these alleles correspond to the major S. aureus lineages. Five copies of this gene occur in the RF122 genome; three copies in the genomes of CGSSa01, MSSA476, USA300, and USA300TCH15, and two copies in the remaining genomes we examined (i.e., it appears to be a multi-copy core gene found in all genomes examined). The Sau1 RM system therefore not only controls horizontal gene flow into S. aureus from other species, but also within S. aureus lineages via the polymorphic RM specificity alleles of sau1hsdS loci [38].
Analysis of the supragenome of S. aureus isolates from antibiotic-naïve populations would be an interesting extension of this work. Paradoxically, antibiotic treatment increases S. aureus conjugation frequency [40]; induces S. aureus temperate phage to excise, replicate, and transfer pathogenicity islands [41]; and when used in combination therapies may unexpectedly increase the spread of resistance among S. aureus strains [42]. In some bacterial genomes the selection pressure exerted by antibiotic exposure may also have the unexpected  effect of promoting multi-drug resistance due to positive epistasis amongst combinations of alleles of antibiotic resistance loci [43]. Thus, one might expect that S. aureus isolates from antibiotic-naïve populations would yield an estimate of supragenome size smaller than that reported here, and be comprised of an even larger percentage of core genes. The results of a supragenome analysis may therefore represent an aggregate of the results of environmental niche-specific supragenomes affected by extrinsic agents such as antibiotics that modulate horizontal gene transfer, as well as regulatory allele-specific supragenomes affected by intrinsic genetic phenomena (such as the Sal1 system) that also modulate horizontal gene transfer. Another interesting extension of this work would be an analysis of the supragenome of the related species Staphylococcus sciuri, which is one of the most abundant staphylococcal species and a frequent epidermal commensal of animals [44]. The mecA gene found in methicillin-resistant S. aureus (MRSA) strains encodes the PBP2A penicillin-binding protein, whose affinity for β-lactam antibiotics acts as a sink that vitiates the efficacy of these drugs and protects native S. aureus PBPs during their function as bacterial cell wall synthetic enzymes [45]. Incorporation of the mecA gene into the S. aureus genome is an unusual event, and requires both a delivery entity called the staphylococcal chromosome cassette (SCC), and a suitable but rarely encountered S. aureus genetic background that can tolerate the presence and expression of the mecA gene [45]. Staphylococcus sciuri, although susceptible to β-lactam antibiotics, is believed to be the source of the precursor homolog of the mecA gene present in the limited number of MRSA strains that have emerged worldwide [45]. Although limited in number, MRSA strains have spread in an epidemic manner with devastating clinical consequences. The S. sciuri genome appears to be ubiquitously agreeable to the presence and expression of its mecA precursor homolog [44], and hence a study of the S. sciuri supragenome may yield insights into the genetic determinants whose homologs in, or horizontal acquisition by, a S. aureus genome may predispose to the acquisition of mecA.

Conclusions
The supragenome of S. aureus offers a significant contrast to those of other human bacterial pathogens including H. influenzae and S. pneumoniae which share the nasopharynx as their primary site of colonization. The S. aureus supragenome (3,221 chromosomal genes) was the smallest among these three species, yet it contained the largest number (2,245) of core genes ( Figure  3). H. influenzae and S. pneumoniae each has a larger horizontal gene pool [45] to draw from by virtue of their possession of autocompetence and autotransformation systems which facilitate the horizontal acquisition of new genes [21,22]. Although some strains of S. aureus are deadly pathogens, most are not, and when compared to H. influenzae and S. pneumoniae, the S. aureus supragenome may have co-evolved with its human host for a longer period of time. As a result, there may be relatively less selection pressure on the S. aureus supragenome to maintain a larger size, as over an extended period of evolutionary time it has optimized its ability to maintain core and distributed phenotypes to survive the environmental conditions it typically encounters as a predominantly commensal species of its human host. Finally, we are currently extending the bioinformatic analyses described herein with the development of a free, post-annotation software package for the execution of the full supragenome analysis pipeline described here. This project will provide the community with the ability to reproduce a given set of published supragenome analysis results, re-analyze the underlying data after modification of the parameters used during an analysis, and perform more detailed and insightful querying of the results than can be summarized in a typical journal publication.

DNA sequencing and genome assembly
The genomes of three S. aureus strains CGSSa00, CGSSa01, and CGSSa03 were sequenced at the Center for Genomic Sciences (CGS). All three are low-passage clinical isolates obtained from patients in Pittsburgh, and were obtained, respectively, from: (i) a nasal specimen from an asymptomatic individual; (ii) an elbow arthroplasty infection [46]; and (iii) a chronic abdominal mesh implant infection that developed after ventral herniorrhaphy [47]. Each high-coverage (~20X) draft genome assembly was obtained using data generated on a Roche/454 Life Sciences GS-FLX instrument using one region of a two-region 70 × 75 mm pico-titer plate [47]. De novo draft genome assemblies were obtained using the Newbler assembler, software releases 1.1.01.20 (CGSSa00) and 1.1.03.24 (CGSSa01 and CGSSa03). Newbler's default assembly parameters were used except for the minimum overlap identity (MOI) percentages (default = 90%) used for the CGSSa01 and CGSSa03 assemblies, which were 96% and 97%, respectively. For each genome the optimal Newbler genome assembly was chosen (from a series of assemblies using different MOI values) as the one that yielded the smallest number of large contigs, the longest overall assembly length, and the smallest percentage of Q39minus Newbler-estimated assembly base-call quality values. An initial round of gap closure of the genome assembly of strain CGSSa00 was carried out as described [22]. Table 1 provides information about these three genomes and the 14 previously sequenced S. aureus genomes used in the comparative and supragenome analyses. Genomic DNA sequence assembly accession numbers for all genomes (Table 2) and publicly available FASTA files for the 14 reference genomes used were obtained from the NCBI.
Automated bacterial genome annotation and generation and identification of protein-encoding gene clusters The RAST system (Rapid Annotation using Subsystem Technology; http://rast.nmpdr.org/) [24] using default parameters was used to provide a consistent set of automated genome annotations for the bacterial chromosome assemblies of all 17 strains. Annotation output datasets were downloaded, and then in-house developed software was used to parse the protein-encoding gene features and prepare tranches of FASTA-formatted input files (proteins, genes, and genomes) for allagainst-all sequence alignment jobs using the FASTA and TFASTY software [27] installed on the Codon computing cluster at the Pittsburgh Supercomputing Center, all as described [21,22]. Subsequent steps in the proteinencoding gene-clustering generation and identification procedures were also performed as described [21,22], with the exception that version 2 (instead of version 1) of the multiple sequence alignment program Partial Order Alignment (POA2) [27] was used during the CDS gene clustering procedure.

Neighbor Grouping of S. aureus genomes
After the core (orthologs present in all genomes), distributed (orthologs present in two or more but not all genomes), and unique (present in only one genome) genes were identified, the presence or absence of the distributed genes in each of the 17 strains was used to define neighbor groups of the genomes under study, as described [23]. Briefly, neighbor grouping examines a pair-wise distance matrix in which the distance between a given pair of genomes is equal to the fraction of the distributed genes that is either present in both genomes or absent in both genomes. A pair of genomes are neighbors if the distance between them is less than the mean distance between each pair of genomes under study, minus the standard error of the mean. A valid neighbor group is a sub-graph comprised of two or more nodes (genomes) that are connected by nearest-neighbor edges.
Mathematical modeling of the S. aureus supragenome MatLab (version 7.1) and its Optimization Toolbox were used to develop a revised finite supragenome model [21] that allows the estimates of the population gene frequencies (for a limited number K of classes of genes) to vary during the maximization of the log-likelihood of the observed sample gene frequency data. The number of elements (K) of the population gene frequency vector (μ) and the corresponding mixture coefficient vector (π) was set at 6; the initial values of π were all set to 1/K, and the initial values of μ were {0.1, 0.3, 0.5, 0.7, 0.9, 1.0}. The MatLab programs used for the revised finite supragenome model will be provided upon request.