Investigating the global genomic diversity of Escherichia coli using a multi-genome DNA microarray platform with novel gene prediction strategies

Background The gene content of a diverse group of 183 unique Escherichia coli and Shigella isolates was determined using the Affymetrix GeneChip® E. coli Genome 2.0 Array, originally designed for transcriptome analysis, as a genotyping tool. The probe set design utilized by this array provided the opportunity to determine the gene content of each strain very accurately and reliably. This array constitutes 10,112 independent genes representing four individual E. coli genomes, therefore providing the ability to survey genes of several different pathogen types. The entire ECOR collection, 80 EHEC-like isolates, and a diverse set of isolates from our FDA strain repository were included in our analysis. Results From this study we were able to define sets of genes that correspond to, and therefore define, the EHEC pathogen type. Furthermore, our sampling of 63 unique strains of O157:H7 showed the ability of this array to discriminate between closely related strains. We found that individual strains of O157:H7 differed, on average, by 197 probe sets. Finally, we describe an analysis method that utilizes the power of the probe sets to determine accurately the presence/absence of each gene represented on this array. Conclusions These elements provide insights into understanding the microbial diversity that exists within extant E. coli populations. Moreover, these data demonstrate that this novel microarray-based analysis is a powerful tool in the field of molecular epidemiology and the newly emerging field of microbial forensics.


Background
Escherichia coli is a Gram negative bacterium that is commonly found in the lower intestine of warm-blooded organisms. Most E. coli strains are harmless. As such, commensal strains of E. coli are part of the normal microbiota of the lower gastrointestinal tract (GI), benefitting their hosts by producing vitamin K2 and also by preventing the establishment of pathogenic bacteria within the intestine through a "colonization barrier effect". Yet some strains can cause infections in humans. Six major categories of diarrheagenic E. coli exist: enterotoxigenic E. coli (ETEC), enteroinvasive E. coli (EIEC), enteropathogenic E. coli (EPEC), enterohemorrhagic E. coli (EHEC), enteroaggregative E. coli (EAEC), and diffusely adherent E. coli (DAEC). Furthermore various types of extraintestinal pathogenic E. coli (ExPEC) are known to cause infections outside the gastrointestinal tract; namely uropathogenic E. coli (UPEC) and newborn meningitis-associated E. coli (NMEC).
E. coli O157:H7, an EHEC pathotype, was first recognized as a human pathogen in 1982 [1]. It has emerged as a major enteric pathogen capable of causing outbreaks of food poisoning in humans and often responsible for costly food product recalls. The primary clinical manifestation of an E. coli O157:H7 infection is hematic diarrhea that can progress into more severe sequelae of hemolytic-uremic syndrome and thrombocytopenic thrombotic purpura, which then can lead to renal failure and death [2][3][4]. An estimated 75,000 cases of E. coli O157:H7 infections occur annually in the United States, making it the principal serotype of enterohemorrhagic E. coli isolated from patients [5]. Infections with this serotype usually have a food borne etiology. In 2006, for example, three multi-state and 24 single-state foodborne outbreaks were confirmed to be due to E. coli O157:H7 [6].
The extent of genomic diversity existing in E. coli and other microbial pathogen populations has been the subject of debate. Recent studies including several genomic methods like DNA microarrays, optical mapping, and whole genome sequencing have shed new light on the level of diversity within bacterial populations, making it clear that the level of genomic diversity had been grossly underestimated previously [7][8][9]. A comprehensive sampling of this diverse species, however, has not been addressed in any single study using an advanced wholegenome analysis method such as microarray analysis.
Previous microarray studies have been limited to sampling a relatively small and/or undiverse collection of E. coli strains. Moreover, traditional microarrays used in these studies have had limited gene content and typically utilize a single probe for determining whether a gene is present or absent within a given strain [10]. This makes gene detection calls questionable and absolutely requires the inclusion of a reference strain for making accurate gene calls.
Current array technology allows the design of microarrays containing greater than six million features. Arrays thus can be designed to probe the genomes of multiple organisms, a prerequisite of great value to an investigator who may study closely related strains or species. However, a caveat in multiple genome array design is that it is not often possible, nor desirable, to define unique probes that match every gene target with 100% sequence homology. It is therefore necessary to establish a consensus sequence for allelic variants. When a consensus sequence is represented on the array, consideration of the level of sequence homology of a particular target sequence is important to accurately determine the status of that gene following a hybridization experiment.
The Affymetrix GeneChip ® E. coli Genome 2.0 Array represents the genic and intergenic sequences of four sequenced strains of E. coli: enterohemorrhagic (EHEC) O157:H7 strains EDL933 and Sakai, the uropathogenic (UPEC) O6:H1:K2 strain CFT073, and the laboratory attenuated K12 strain MG1655 (OR:H48:K-) [11]. The array consists of 228,484 25-mer oligonucleotides that represent a total of 10,208 probe sets. Each probe set contains approximately 22 oligonucleotide probes; 11 perfect match (PM) probes and 11 mismatch probes (MM). Mismatch probes are identical to the perfect match probe with the exception of a one nucleotide (nt) mismatch located at the 13th (middle) position of the oligo sequence. These mismatch probes are designed to allow for an approximation, and correction, of non-specific hybridization signal. We presumed that a probe design strategy such as this would be ideally suited for genotyping studies for two reasons: i) the probe redundancy for each genomic target sequence and ii) the excellent specificity afforded by hybridization of short 25-mer probes. Previous microarray studies have used either long oligos (50-mers to 80-mers) or PCR-derived cDNA amplicons. Inherent to these array designs are the disadvantages that i) only a single probe signal is used to measure the presence of each genomic target and ii) relatively high non-specific hybridization signal is observed when using longer DNA probes [12,13].
In this study, we used the Affymetrix GeneChip ® E. coli Genome 2.0 Array to investigate the gene content of 207 diverse isolates of E. coli and Shigella. This multi-genome array provided us the opportunity to survey genes from different pathogen groups [11]. Strains of E. coli and Shigella interrogated in this study consist of approximately 60 different serotypes and 75 isolates of the O157:H7 serotype (Table 1). Moreover, the entire ECOR collection [14] was included in this study and contains a set of 72 reference strains isolated from a variety of hosts and geographical locations that is presumed to represent the range of phenotypic and genotypic variation in the E. coli species as a whole. Finally, four sequenced strains of Shigella, believed to be in the same species division as E. coli, were included in this study. In summary, this strain collection was chosen both to i) represent the global diversity of E. coli and ii) to capture a diverse collection of a single pathogen type (EHEC O157:H7). In doing so, we were able to not only evaluate the ability of this array to measure the global genomic diversity of this species, but also to assess whether this array was useful for discriminating among individual, and closely related, strains of the same pathogen type. The latter is an important feature for the newly emerging fields of microbial forensics and molecular epidemiology, where the ability to uniquely identify and discriminate among closely related strains is of great importance in conducting attribution investigations of foodborne outbreaks or of covert biocrimes and potential bioterrorism activities [15,16].
The probe set design utilized in this expression array allowed highly accurate determination of a gene allele presence. These data, combined with a unique custom analysis approach, eliminate the need of a reference strain, which previously has been an absolute requirement for accurate comparative genomic hybridization (CGH) studies. The comprehensive wealth of data derived from this study has been used to identify groups

Bacterial strains and preparation of genomic DNA
Strains of E. coli and Shigella used in this study are listed in Table 1. Strains EDL933, MG1655, and CFT073 are E. coli O157:H7, K-12, and uropathogenic strains, respectively, for which the genome sequences are available [17][18][19]. Sakai is a Japanese enterohemorrhagic O157:H7 outbreak strain, and likewise the genome sequence is available [20]. Strains were grown overnight in 3 mls of Luria Broth at 37°C in a shaking incubator. Genomic DNA was isolated from 2 mls of the overnight culture using the Qiagen DNeasy Tissue kit following the manufacturer's recommendations. Typically, 5-10 μg of purified genomic DNA was recovered in a final elution volume of 200 μl. The purified DNA was further concentrated using Microcon YM-30 microcentrifuge filters to a final volume of approximately 10 μl. 5 μg of the genomic DNA was fragmented by incubating at 37°C for 10 minutes in a 50 μl reaction containing 1X One-Phor-All Plus Buffer (GE Healthcare) and 0.1 units DNase I (GE Healthcare). The fragmentation reaction was heat-inactivated at 95°C for 10 minutes. Following fragmentation, the DNA was 3'-end labeled by adding 4 μl of 5X terminal transferase buffer (Promega), 1 μl of 1 mM biotin-11-ddATP (PerkinElmer NEL508), and 2 μl (60 units) of terminal transferase enzyme (Promega) (final volume 27 μl). Labeling was carried out for at least 2 hours at 37°C followed by heat inactivation at 95°C for 10 minutes.
Array, incubated at 45°C, with rotation (60 rpm) for 16 hours in a hybridization oven. Following hybridization, the wash and stain procedure was carried out on an Affymetrix FS-450 fluidics station using the mini_prok2v1_450 fluidics script [21]. Wash and stain reagents were prepared according to the Gen-eChip ® Expression Analysis Technical Manual [21]. The following exceptions were made to the wash and stain procedure: Streptavidin solution mix (vial 1) was replaced with SAPE solution mix. Arrays were scanned using a GeneChip ® Scanner 3000 7G running GCOS v1.4 software.
Analysis Considerations for Genotyping with the Affymetrix E. coli 2.0 Array Adapting an expression array for genotyping purposes required a novel data analysis approach, given that the use of the Affymetrix probe set design produces 22 independent measurements for each genome target. These 22 measurements are derived from a single probe set, consisting of 11 probe pairs. Each probe pair is composed of a single 25-mer perfect match (PM) oligo and its corresponding 25-mer mismatch (MM) oligo. The mismatch oligo is identical to the perfect match with the exception of a single nucleotide mismatch located at the central (13 th ) position of the oligo sequence. Therefore summarizing these independent measurements into a single probe set intensity value that is indicative of the status of a particular gene can be problematic. This can be further complicated when one or more of the individual probes (25-mers) in a probe set is able to hybridize to another genomic region, thereby giving rise to a "partially specific" probe set. Secondly, the Affymetrix array is a multi-genome array. Previous CGH studies have required a reference strain in order to accurately detect whether a particular gene target was present/absent. For a multigenome array, however, it is not possible to use data from any single hybridization experiment as a reference data set. Therefore the status of each gene target should, ideally, be determined by considering only raw probe intensities from an individual experiment.

Parsing CEL Files and Data Analysis Tools
All 207 Affymetrix CEL files generated in this study were parsed using the Robust MultiArray Averaging (RMA) method (Bioconductor affy Package and Affymetrix Power Tools) [22][23][24][25]. Hierarchical clustering analysis and principal component analysis was done using Spotfire and the MADE4 package of Bioconductor [26,27]. MAS 5.0 gene present/absent calls were determined using Affymetrix Power Tools as well as the affy Bioconductor Package [24,28]. All data files, including 207 Affymetrix CEL files, the RMA-summarized probe set intensities, and the MAS 5.0 gene present/absent calls, can be accessed at http://www.mrscentral.com

Probe Set Summarization Methods
It is desirable to determine the summarized intensity of each probe set (consists of 11 PM and 11 MM oligos). In all of our analyses, summarized probe set intensities were calculated for each strain by using the Robust Mul-tiArray Averaging (RMA) [23] method as implemented in the affy package of Bioconductor or Affy Power Tools. In brief, RMA summarization of probe level data is done by performing three individual treatments on all of the experimental CEL data simultaneously. First, probe specific correction of the PM probes is done using a model based on the observed intensities being the sum of signal and noise. Secondly, quantile normalization is performed on the corrected PM probe intensities. Finally, a median polishing algorithm is used to summarize the background-corrected, normalized probe intensities to generate a final probe set value.
We also evaluated the utility of the MAS 5.0 probe set summarization algorithm [28]. Briefly, the signal is calculated as follows: i) global background correction, ii) ideal MM value is calculated and subtracted to adjust the PM intensity, iii) adjusted PM intensities are logtransformed to stabilize the variance, iv) biweight estimation to provide a robust mean of the resulting values, and finally v) probe set intensity signal is scaled using a trimmed mean.

Making Accurate Gene Present/Absent Calls
In addition to summarizing probe set intensities, it is also desirable to make absolute gene present/absent calls. We assessed two novel methods for determining the status of each gene using the hybridization intensities from the four E. coli strains that are represented on this array. The first method was included in the standard Affymetrix GCOS analysis package and is referred to as the MAS 5.0 Gene Detection approach [28]. Here gene targets are determined to be either present, absent, or marginal as determined by a p-value calculated from the discrimination score (R) for each probe pair. The discrimination score is a basic property of a probe pair that describes its ability to detect its intended target. It measures the target-specific intensity difference of the probe pair (PM-MM) relative to its overall hybridization intensity (PM+MM). The discrimination score (R) is therefore defined as: R = PM − MM PM + MM and approaches 1.0 as the mismatch probe intensity approaches 0. The next step in calculating a detection p-value is to compare each discrimination score to the user-definable threshold Tau. Tau is a small positive number that can be adjusted to increase or decrease sensitivity and/or specificity of the analysis.
Increasing the threshold Tau can reduce the number of false positive (present) calls but may also reduce the number of true present calls. We found that the most accurate gene detection calls were made when using a Tau value of 0.20. Next, the one-sided Wilcoxon Signed Rank test is employed to generate the detection p-value. It assigns each probe pair a rank based on how far the probe pair discrimination score is from Tau. The detection p-value is then compared to a user-definable cutoff value that results in the final present/absent call for each gene. Here again, we evaluated several cut-off values for the detection p-value and found that a value of 0.050 provided highly accurate gene detection calls. That is, probe sets with detection p-values <0.05 were scored as "present", and greater than, or equal to, 0.05 as "absent".
The second method that we used for determining the status of a particular gene target is analogous to the more conventional methods of CGH where a reference strain is used to determine relative hybridization intensities. Here, we calculated a "RefMax" value for each probe set by determining the maximum hybridization intensity (RMA summarized) from each of the four reference strains represented on the array. Therefore, "RefMax" represents the maximum observed hybridization intensity for each probe set as determined by the appropriate reference strain to which the probe set was designed (for example, the RefMax for probe set j is calculated as: RefMax j = Max(CFT073 j , EDL933 j , Sakai j , MG1655 j )). Absolute probe set intensities for each of the 207 hybridization experiments performed were compared to the RefMax data. For strain i, probe set j, RefMax Ratio is defined as: Re fMax ij Ratio = log 2 strain ij refMax j For each probe set, gene targets were scored as "absent" if their hybridization intensities were greater than 4-fold lower from the RefMax value. Otherwise, genes were scored as "present".

Phylogenetic Analysis of E. coli Genome Data
Tables containing 10,208 MAS 5.0 Present/Absent calls (described above) were transformed into A/T binary nucleotide calls for each isolate. Tables were then wrapped/concatenated into fasta-formatted text files. Fasta text files were then used directly as input for the MEGA5 software package [29]. Phylogenetic analysis was performed using the maximum likelihood method.

Results
In the present study, we interrogated the genomic content of 207 isolates of E. coli and Shigella using the Affymetrix GeneChip ® E. coli Genome 2.0 Array, choosing to explore both the level of diversity that existed among closely related strains of the same pathotypes (i. e. independent O157:H7 strains) and the level of genomic diversity that existed among a globally diverse collection of E. coli strains (i.e. the ECOR collection). From this analysis, we were able to determine, with high accuracy, the gene content of each of these strains relative to those probes represented on the array.
Probe Set Summarization Methods: RMA vs. MAS 5.0 In Figure 1, we demonstrate the advantage of using RMA over MAS 5.0 by comparing summarized probe set intensities from the two reference O157:H7 strains, EDL933 and Sakai. From the scatter plots in Figure 1, it is apparent that the RMA probe summarization method ( Figure 1A) yields a much lower variance in probe sets where intensities are less than 8 (log2) as compared to MAS 5.0 probe set summarization method ( Figure 1B). This decrease in variance allows for a more accurate determination of actual gene differences between these two strains. This is even more apparent when relative probe intensities (log 2 [Sakai]/[EDL933]) are plotted. Summarized probe set intensities from the MAS 5.0 method result in large fold-change differences ( Figure  1D), suggesting, incorrectly, the presence of genomic differences at these loci (false positives). However, RMA's ability to reduce this variability ( Figure 1C) provides a much higher confidence in accurately identifying true genomic differences (fewer false positives). Because we feel that RMA is a better method for summarizing probe set intensities, we performed all downstream analyses (hierarchical clustering, Pearson correlation matrix, and PCA) using RMA summarized probe set intensity data.
Determining the Status of Gene Targets: RefMax vs. MAS 5.0 Using RefMax we found that >99.85% (5653/5654) of the target genes that shared >98% sequence homology to a particular probe set were accurately detected as "present" as shown in Table 2. Also, 64% of the genes sharing 96%-98% sequence homology are called "present" and when homology decreases to between 94%-96%, only 27% of the probe targets are called "present".
Here we are not always attempting to score these present/absent calls as either "correct" or "incorrect" but rather simply report our finding that the probe set design used here is capable of discriminating among closely related genes sharing >90% sequence homology   [30]. Also shown in Table 2, using MAS 5.0, 99.83% of the probe sets having >98% homology to their target gene were scored correctly as "present". Further, 55% of the probe sets having between 96%-98% sequence homology to their gene target were called as "present". In comparison to the MAS 5.0 method, the RefMax method is more sensitive to sequence variation between the probe set and its gene target. This clearly can be seen when sequence homologies between probe sets and gene targets are relatively low (between 90%-92%). In these instances, the MAS 5.0 method called 22.4% of the gene targets as "present" whereas our RefMax method only called 5.6% of these gene targets as "present". Furthermore, a clear advantage of using the MAS 5.0 method for scoring genes was its ability to accurately call genes as being "present" regardless of the overall probe intensities. For example, in three instances, where gene target lengths were below 100 nt and corresponding probe set intensities were in the bottom 2% of all probe set intensities on the array, the Affymetrix MAS 5.0 method correctly predicted the status of these gene targets in each of the reference strains a total of 19 out of 21 times. This corresponds to an accuracy of 90% (10% false negatives) for probe sets having intensities in the bottom 2% of all probe sets.
While having the advantage of being able to detect genes correctly despite sometimes having low absolute probe intensities, the MAS 5.0 method was found to be Gene present/absent calls were determined for the 4 sequenced reference strains represented on the array using either the RefMax or MAS 5.0 gene detection methods. Genome corresponds to the accession number of the genome/strain being interrogated. Homology Bin corresponds to the percentage by which a probe set consensus sequence matches the target genome sequence. Genes Present corresponds to the number of genes present on the array which fall into a particular Homology Bin (this is also the maximum number of correct "present" calls). Present or Absent calls were determined from either the MAS 5.0 or RefMax method and are shown under the "MAS5" and "RefMax" headings. Strains MG1655, EDL933, and Sakai were each performed in duplicate to show the reproducibility of each method. Independent measurements are indicated by a "/" under the Present and Absent headers.
only 95% reproducible when comparing replicate experiments of a single reference strain. When "absent" and "marginal" calls were not considered separately, however, this reproducibility increased to 97% (i.e. "A" or "M" = "A/M") (as in Table 2). When examining the reproducibility of our RefMax method, we observed that well over 99.9% of all gene status calls from replicate experiments agreed. This level of reproducibility was based on whether at least a 4-fold difference was observed for any probe set from replicate experiments. The lack of reproducibility in the MAS 5.0 method appears to be a function of the level of gene homology and not necessarily absolute probe signal, whereas our RefMax method was independent of either of these factors. When using our RefMax method for making gene status calls, it is necessary to apply a user-defined cut-off value for RefMax in order to account for probe sets that are present on the E. coli v2.0 array, yet absent in all four of the reference strains represented on this array. We have observed a total of 145 probe sets on this E. coli 2.0 array which do not appear to have target genes present in any of the four reference strains. This was confirmed by BLASTing these probe set sequences against the reference genome sequences. As expected, the RefMax intensity was found to be in the lower 1.5% of all probe intensities which suggests an "absent" gene target. Of these 145 probe sets, 72 corresponded to Affymetrix control probes (probe sets designated with "AFFX-" prefix). The remaining 73 probe sets corresponded to genes encoded by various phages, transposons, plasmids as well as several known antibiotic resistance genes. These probe sets were included by Affymetrix due to their relevance to microbial pathogens. Therefore, RefMax values whose absolute probe set intensities were < 9 were considered to be absent in the reference strains. We therefore scored all probe sets from non-reference strains that differed by less than 4fold from RefMax as "absent". Only when a non-reference strain was found to have an absolute probe set intensity 4-fold greater than the RefMax value were these genes scored as "present".
Likewise, one can use the MAS 5.0 technique described above to make present/absent calls. Upon doing so, we found that among the four reference strains, there were a total of 151 probe sets that were consistently called absent. Of these 151 probe sets, 73 corresponded to Affymetrix control probe sequences, leaving 78 other (non-control) probe sets. One can then expand this analysis from the four reference strains to all 207 isolates analyzed in this study. Upon doing so, we found 44 (non-control) probe sets that were consistently absent in all 207 E. coli and Shigella isolates examined in this study. Again, these genes were mainly antibiotic resistance elements from species other than E. coli or Shigella. Two notable exceptions were nine independent probe sets targeting Phage M13 genes as well as the EDL933 gene Z2261 (and unknown protein associated with Rhs element).
As an additional confirmatory method, we compared our MAS 5.0 gene present/absent calls with those determined by confirmatory PCR. The stx1 and stx2 genes were chosen as targets. We interrogated 195 isolates for the presence of these two genes by confirmatory PCR. We found an agreement between present/absent calls in 193/195 cases (99%). The two discrepancies are thought to be due to primer/probe specificity towards different stx alleles (data not shown).

Discussion
Strain Attribution, Identification, and Discrimination within the O157:H7 Pathotype The genomic content of 63 unique strains (75 isolates) of E. coli O157:H7 was assessed in order to determine whether we could accurately distinguish between independent isolates of the same serotype. From this study, we found that, on average, individual strains of O157:H7 differed by 197 gene targets (Additional File 1). Moreover, among this same group of isolates, we found that a maximum of approximately 750 gene targets differed among any two independent O157:H7 isolates (EC1214 vs. EC885). When individual strains were examined in replicate, we found that a maximum of three gene target differences may occur between replicates. Note, however, that these "replicates" could often be independent isolates of the same strain and the observed gene differences may sometimes be due to known deletions that are present in strains that were derived from a parent strain (i.e. EC536 and EC1212 are isogenic). Next we tested if any of our independent isolates were indistinguishable using this array-based method, i.e., were there multiple isolates of the same strain. We found several instances where there were fewer than 10 gene target differences between two independent isolates (e.g. EC1423 vs. EC423, EC514 vs. EC515 from Additional File 1). In fact, we were able to make correlations between isolates that were previously unknown and presumed to be independent. For example, strain 86-24 (a human isolate from Washington State in 1986) differed from strain EC533 by a single gene difference. Upon further inspection of the history of these strain designations, we found that EC533 was also a human isolate from Washington State, also obtained in 1986. It is therefore likely that these two isolate designations actually refer to the same strain that were received by our laboratory from two different repositories. This precipitated a closer examination of the history of our strain collection, which showed a few similar examples of this.
From a forensic perspective, this is a very important finding as it suggests that this array-based method is not only a powerful tool for discriminating among different strains of the same serovar but also provides, with a high level of confidence, a robust means for strain identification and attribution.

Exploring the Global Genomic Diversity of E. coli
The ECOR collection is a set of 72 reference strains of E. coli isolated from a variety of hosts and geographical locations [12]. It was established for use in studies of variation and genetic structure in natural populations and is representative of the range of phenotypic variation in the species as a whole. We therefore used this well-studied collection to analyze the performance of our array-based method in measuring the global diversity of E. coli. Among this collection, we found that individual strains differed, on average, by approximately 1900 gene target differences. Moreover, a maximum of 4002 gene target differences was observed between any two strains (ECOR57 vs. ECOR37). This represents approximately 40% of the probe targets present on this array. In addition to the ECOR reference collection and our O157:H7 reference collection, we included 50 other E. coli and Shigella isolates in our analysis (Table 1). It is interesting to note that some of the most diverse strains included in this study were among the latter non-ECOR, non-O157 strains. For example, the greatest number of probe set differences, 4890, was observed between an O157:H7 strain EC885 and the UPEC CFT073 strain EC1521. While this number may seem impossibly high, it is in fact representative of gene differences and allelic differences that exist between these two evolutionarily distinct lineages of E. coli. Different alleles are often represented by different probe sets on the E. coli Genome 2.0 Array. Therefore, probe set differences do not always directly correspond to true gene differences. A gene differences matrix was prepared by calculating the number of gene differences that exists among all 207 isolates examined in this study (Additional File 1). Similarly, we performed a Pearson correlation analysis on all of the strains included in this study based on their RMA probe set summarized values. Results from this analysis depict a strain-to-strain relatedness quantified from 0 to 1 (1 being identical). A visual representation of this correlation matrix is shown in Additional File 2 and is color-coded based on relatedness.
In order to better understand the global diversity of E. coli, we performed several cluster-based analyses. Using the RMA-summarized data generated from all of the probe sets, we performed a hierarchical clustering analysis (Euclidean means) on the strains and gene targets to demonstrate graphically the level of genomic diversity that exists within this species (Additional File 3). The dendrogram in Additional File 3 shows that the 207 isolates examined in this study cluster into three major groups. The first cluster (left most) represents EHEC1 strains. The other two clusters cannot be clearly defined by a particular set of traits and contain an assortment of both serotypes and pathotypes.
To further demonstrate the relatedness and diversity of the strains examined in this study, we performed a Principal Component Analysis (PCA) on RMA-summarized probe set intensities from all 207 isolates. Upon plotting the first three dimensions of the PCA data, we color-coded the strain identifiers based on their serotype (Figure 2A). From this, we observed that strains belonging to the same serotype often, but not always, cluster together. It is worth noting that all serotypes are not equally represented in the collection of isolates that were examined in the current study, therefore it is not possible to make conclusions regarding the clustering of some serotypes. As pathotype information was available for 128 of our isolates, a PCA was performed on these isolates independently ( Figure 2B). This PCA plot showed strong clustering of isolates from the EHEC1 and EHEC2 pathotypes. Though these pathotypes have similar clinical manifestations, their genomes are quite distinct. While some strain of the same pathotype often co-clustered, this was not universally true, therefore suggesting that virulence traits are inherited horizontally. Indeed, this phenomenon has been reported previously [31,32].
Examining the Conserved, Core, Backbone Genes of E. coli Our array-based technique is well suited for determining which genes are conserved in various types of E. coli. Initially, we wanted to determine which gene targets were present in all strains of E. coli and Shigella included in this study. We found 2256 gene targets, using the MAS 5.0 technique described above, that were consistently present in all 207 isolates examined (Additional File 4). We define these gene targets as being the core/conserved genic regions that make up the E. coli genomic backbone. Similar analyses have been performed previously on whole genome sequence data [33]. From these in silico analyses, approximately 2200 conserved core genes were identified. Therefore, previous in silico analyses correlate well with our microarray-based approach.
Moreover, because the Affymetrix GeneChip ® E. coli Genome 2.0 Array represents intergenic regions from the MG1655 genome, we were also able to assess the conserved intergenic regions as well. In total, there are 686 unique probe sets representing 240.9 kb of intergenic sequences (Additional File 5). Of these, we found 232 unique probe sets that were conserved (always

Defining the EHEC Pathotype-Specific Genes
Because we examined a large number (63) of O157:H7 strains and a large number of non-O157:H7 strains, we are also able to make conclusions on which genes are responsible for defining the O157:H7 serotype. That is, which genes are always present in O157:H7 strains but never present in non-O157:H7 strains. For this analysis, we included only O157:H7 strains and excluded O157: H-and "O157" (EC1522, EC1523, EC510, EC1427) strains. As might be expected, most genes fitting this criterion engender information for the biosynthesis of the O and H antigens (Table 3).

Phylogenetic Analysis of E. coli Genomes Using Microarray Data
Extant populations of E. coli are structured into main phylogenetic groupings. The evolutionary history of E. coli has usually been recapitulated employing Multi Locus Sequence Typing (MLST) analysis of the ECOR collection. However, it has been shown that a correlation exists between the phylogenetic history of the strains and the presence/absence of genes. We compared our whole-genome microarray analysis approach to the main phylogenetic groups of the ECOR collection previously defined and based on MLST data. For the most part, our whole genome analysis of the ECOR collection correlated well with previous designated phylogenetic groups as shown in Figure 3. There were, however, several notable exceptions. For example, strains ECOR70, ECOR71, and ECOR72 that were previously defined in group B1 were, by our analysis, found to be part of group A (shown in Figure 3 as A*). And, notably, as with MLST, array analysis showed group D strains to be more heterogeneous than other groupings, seemingly reflecting diverse and distinct lineages of Group D strains.
One might a priori expect disagreement between our whole-genome approach and previous MLST analyses. That is, whereas MLST is an approximation of the mutation rate of the genomic backbone based on sequence analysis of a relatively small number of housekeeping genes, analysis of whole genome differences is a measure of both the rate of mutation and horizontal gene transfer that exists among this species. These two analyses are quite different and distinct. So it is therefore interesting to find such a remarkably high level of correlation between these two different approaches.

Conclusions
The current study is an in depth analysis of the genomic content of a large and diverse collection of E. coli Using the MAS 5.0 gene detection method, we filtered those probe sets that were consistently called "present" in all O157:H7 strains yet were called "absent" in all non-O157:H7 strains. The 14 probe sets shown here correspond to O157:H7-specific genes.
strains, including strains of the closely related genus Shigella. By utilizing a high-density oligo array representing 10,112 unique probe sets, we were able to qualitatively describe the complete gene repertoire of each of 183 unique strains. In so doing, we found that diverse strains can differ by as many as 4890 probe set targets, e.g., CFT073 strain EC1521 and the O157:H7 strain EC885. We found that, among the 207 independent isolates examined in the present studies, individual strains differed, on average, by 2276 probe sets, and the median number of probe set differences was 2597. When one focuses on only the EHEC-like isolates (Additional File 3, cluster 1), we found that there are 2321 probe sets that are variable among this EHEC cluster. On average, EHEC cluster 1 strains differed by 197 probe sets. We also examined how our microarray-based method was able to recapitulate the phylogenetic history of isolates. Using the maximum-likelihood method, we found a remarkable level of correlation between previous MLST-based results and our current microarray-based results. While these two approaches differ, in some cases however, these two analyses may be regarded as highly similar. Many of the probe sets on the E. coli Genome 2.0 array are in fact representatives of different alleles of the same gene. We give as an example the mutS gene: the mutS gene is >95% conserved in nucleotide sequence in strains Sakai, MG1655, and CFT073. However, on the E. coli 2.0 array, there are three different probe sets that represent these three allelic variants of mutS. Importantly, the hybridization-based strategy utilized here is capable of discriminating between these three alleles [30]. So, indeed, our microarray approach is, in some cases, is also a measure of allelic diversity that exist among housekeeping genes. This is highly analogous to an MLST-based assay.
It is tempting to speculate on the minimal number of gene differences observed between any two independent isolates derived from the same culture. However, we will refrain from making any specific qualitative remarks to this regard. The reason is simply one of semantics in that the definition of "independent isolates" varies based on perspective. For example, a clinician might refer to two isolates derived from the same patient on different days as being "independent". Yet from a genomics point of view, these two isolates may well be the same strain as measured by their genomic content.
The concept and definition of "independent isolates" and "different strains" came to light in our investigation of the 2006 outbreak of O157:H7 associated with fresh spinach. During this outbreak, we interrogated >200 "independent isolates" that were geographically diverse (different clinical isolates from different states across the country or various food isolates). Collectively, we referred to these isolates as the "outbreak population". All of the isolates contained within this outbreak population, with the exception of two, were genomically indistinguishable (no probe set differences). The two exceptions differed by a single phage-related insertion. This difference resulted in the appearance of approximately 40 probe set differences, a result later confirmed independently in our laboratory by optical mapping and whole genome sequencing [34]. So, the question of minimal gene differences can only be answered accurately in the context of the whole genome. Whereas these 40 gene differences might indicate a new strain, clearly they were derived from the same population of E. coli O157:H7 that caused the outbreak; i.e., they were indeed siblings within the outbreak population.
It is imperative to consider, as technology allows us to delve deeper into the bacterial genome, that individual strains and independent isolates are not synonymous terms. That deriving isolates from a single sick individual on different days; diluting and plating a single cultured sample onto several plates and subsequently deriving isolates from different plates; deriving isolates from different patients sickened during a single foodborne outbreak; or obtaining isolates from different environmental samples, are all valid examples of what "independent isolates" denote. These independent isolates, however, may, but need not, be individual strains. That is, if differences, no matter how subtle, are observed in genome comparisons of two independent isolates, sensu stricto, they are two individual strains. In contrast, if no distinguishing differences are seen between two isolates, then they are  Figure 3 Molecular Phylogenetic analysis by Maximum Likelihood method: The evolutionary history was inferred by using the Maximum Likelihood method based on the Tamura-Nei model [35]. The tree with the highest log likelihood (-282250.6332) is shown. Initial tree(s) for the heuristic search were obtained automatically as follows. When the number of common sites was < 100 or less than one fourth of the total number of sites, the maximum parsimony method was used; otherwise BIONJ method with MCL distance matrix was used. The tree is drawn to scale, with branch lengths measured in the number of substitutions per site. There were a total of 10208 positions in the final dataset.
independent isolates of the same strain. These distinctions will increasingly impact the fields of molecular epidemiology and microbial forensics as today's technologies and their attendant discriminatory powers are utilized in such investigations.

Additional material
Additional File 1: Gene differences matrix. Number of gene differences based on strain-to-strain comparisons is shown. A "gene difference" is defined here as a 4-fold difference in the RMA-summarized probe set intensities. The cells are color-coded based on the number of gene differences using the scale below. Strains are ordered based on their relatedness as determined by hierarchical cluster analysis.
Additional File 2: Pearson correlation matrix. R-Bioconductor was used to calculate Pearson correlation coefficients using RMA-summarized probe set intensities. The cells are color-coded to show relatedness and correlation (coefficient from 0-1) according to the scale below. Strains are ordered based on their relatedness as determined by hierarchical cluster analysis.
Additional File 3: Hierarchical Cluster Analysis. RMA-summarized probe set intensities were used to hierarchically cluster (Euclidean means) all 207 isolates (top dendrogram) and all 10,208 genes (left dendrogram) in Spotfire. The heatmap shows RMA probe set intensities from low (green) to high (red). The top dendrogram is color-coded based on the 3 large clusters of E. coli.
Additional File 4: Conserved, core, backbone genes in E. coli and Shigella. Using the MAS 5.0 gene detection method, we filtered those probe sets that were called "present" in all 207 isolates. The 2256 conserved probe sets are listed here along with their gene description, when available.
Additional File 5: Conserved Intergenic Regions. Using the MAS 5.0 gene detection method, we filtered those probe sets that were annotated as "intergenic" and called "present" in all 207 isolates. The 232 conserved intergenic probe sets are listed here along with their genome position and length.