Genome evolution in major Escherichia coli O157:H7 lineages

Background Genetic analysis of Escherichia coli O157:H7 strains has shown divergence into two distinct lineages, lineages I and II, that appear to have distinct ecological characteristics, with lineage I strains more commonly associated with human disease. In this study, microarray-based comparative genomic hybridization (CGH) was used to identify genomic differences among 31 E. coli O157:H7 strains that belong to various phage types (PTs) and different lineage-specific polymorphism assay (LSPA) types. Results A total of 4,084 out of 6,057 ORFs were detected in all E. coli O157:H7 strains and 1,751 were variably present or absent. Based on this data, E. coli O157:H7 strains were divided into three distinct clusters, which consisted of 15 lineage I (LSPA type 111111), four lineage I/II (designated in this study) (LSPA type 211111) and 12 lineage II strains (LSPA 222222, 222211, 222212, and 222221), respectively. Eleven different genomic regions that were dominant in lineage I strains (present in ≥80% of lineage I and absent from ≥ 92% of lineage II strains) spanned segments containing as few as two and up to 25 ORFs each. These regions were identified within E. coli Sakai S-loops # 14, 16, 69, 72, 78, 83, 85, 153 and 286, Sakai phage 10 (S-loops # 91, 92 and 93) and a genomic backbone region. All four lineage I/II strains were of PT 2 and possessed eight of these 11 lineage I-dominant loci. Several differences in virulence-associated loci were noted between lineage I and lineage II strains, including divergence within S-loop 69, which encodes Shiga toxin 2, and absence of the non-LEE encoded effector genes nleF and nleH1-2 and the perC homologue gene pchD in lineage II strains. Conclusion CGH data suggest the existence of two dominant lineages as well as LSPA type and PT-related subgroups within E. coli O157:H7. The genomic composition of these subgroups supports the phylogeny that has been inferred from other methods and further suggests that genomic divergence from an ancestral form and lateral gene transfer have contributed to their evolution. The genomic features identified in this study may contribute to apparent differences in the epidemiology and ecology of strains of different E. coli O157:H7 lineages.


Background
Enterohemorrhagic E. coli (EHEC) are associated with gastrointestinal and systemic illness in humans. This illness can range in severity from uncomplicated diarrhea to hemorrhagic colitis and the sometimes fatal hemolytic uremic syndrome [1][2][3]. EHEC strains possess a number of common virulence traits, such as the production of one or more types of antigenically distinct Shiga toxins (Stx1 and Stx2), a large plasmid that encodes an enterohemolysin, and a chromosomal gene cluster termed the locus of enterocyte effacement (LEE) that is found in most, but not all EHEC serotypes [4,5].
E. coli O157:H7 is the EHEC serotype most often associated with disease outbreaks and with the onset of severe disease in the U.S., Canada, Japan, and the U.K. [2,3]. Genomic sequencing of two outbreak-related E. coli O157:H7 strains, Sakai and EDL 933, revealed that there are many phage-related sequences and genomic islands scattered throughout the chromosome of this organism and that many of these genetic elements encode potential virulence attributes [6][7][8][9]. These E. coli O157:H7-specific genomic segments are dispersed throughout 177 different regions of a common genomic backbone that is shared with the distantly related E. coli K-12. Known as S-loops and O-islands (OI) in Sakai and EDL933 strains, respectively, some of the regions must be responsible for the virulence characteristics that were acquired during evolution of E. coli O157:H7.
E. coli O157:H7 strains are believed to comprise a clonal complex of related genotypes that are found worldwide [10]. It has been suggested that E. coli O157:H7 arose from the enteropathogenic E. coli serotype O55:H7 through sequential acquisition of virulence traits and serotype change [11][12][13]. A step-wise evolution of E. coli O157:H7 from enteropathogenic E. coli O55:H7 was recently proposed, based on the properties of specific existent strains that carry intermediate characteristics and are presumed to represent intermediates in the evolution of this EHEC serotype [11,13]. The proposed evolutionary pathway includes lysogenization by an stx2-converting phage followed by a shift in serotype from O55 to O157 brought about by acquisition of the O157 gnd-rfb locus [14]. The EHEC large plasmid was then acquired by the organism and the ability to ferment sorbitol was lost. The sorbitol-non-fermenting O157:H7 ancestor was subsequently lysogenized with an stx1-converting phage and, finally, acquired a frameshift mutation in the uidA gene, resulting in loss of β-glucuronidase activity [11]. The validity of this stepwise model is supported by recent comparative genomic hybridization (CGH) studies using E. coli O157:H7 whole genome-based oligonucleotide microarrays [13].
It is well recognized that E. coli O157:H7 populations have a bovine reservoir and that the organism is likely adapted for life in the ruminant gastrointestinal tract [15][16][17][18]. Using Octamer-Based Genome Scanning (OBGS), Kim et al., showed that Stx-producing, β-glucuronidase and sorbitol-negative E. coli O157:H7 strains have diverged into two distinct lineages, lineages I and II, and that descendants of these two lineages appear to have distinct ecological characteristics [19,20]. Populations of the two lineages are widespread in cattle in both the U.S. and Australia, suggesting that these two lineages have been disseminated throughout the global cattle population [20]. Analysis of a set of nearly 1,500 E. coli O157:H7 strains showed that lineage I strains are more commonly associated with human disease than lineage II strains, suggesting that there may be differences in virulence characteristics or transmissibility between these two taxonomic groups of E. coli O157:H7 strains [21].
Although high resolution comparative studies have indicated that prophages are associated with divergence of E. coli O157:H7 strains [6], systematic analysis of genetic distinctions between lineage I and lineage II strains has only recently been undertaken. We [22] and others [23] have recently reported that the Q anti-terminator gene found upstream of the stx2 operon in E. coli O157:H7 differs between lineage I and II strains. Possession of the stx2 gene is thought to be associated with the occurrence of more severe disease, such as hemolytic uremic syndrome, caused by EHEC strains [24]. In addition, Dowd and Ishizaki [25] recently used oligonucleotide mini-arrays to compare expression of a set of 610 genes between three lineage I and three lineage II strains, noting differential expression of stx2 as well as a number of other potentially virulence-associated genes under anaerobic growth conditions. Collectively, these published studies suggest that these lineages are genetically distinct and that lineage-specific genetic differences may be responsible for phenotypic differences between members of these two lineages.
To systematically identify lineage-specific genome segments, microarray-based CGH was used in this study to catalogue genomic alterations that are unique to lineage I or lineage II strains. The oligonucleotide microarray was based on the genome sequences of two lineage I, human outbreak-related E. coli O157:H7 strains, Sakai [9] and EDL933 [7] and the nonpathogenic E. coli K12 (MG1655) strain [26] and it was used to probe the genomes of a collection of E. coli O157:H7 strains. Although significant strain-strain variation was observed, our focus was on genome alterations that were conserved within different strains of a given lineage. Regions of divergence identified by CGH were then cloned and sequenced to gain additional insight into the genomic differences between the two lineages. The results of the study show that many lin-eage-specific differences in genomic content involve genes that are known or potentially virulence-associated. These findings may be used to identify candidate genes that could confer lineage-specific traits related to unique ecological or virulence characteristics.

Validation of microarray data by comparison with sequence data
In the CGH experiments, 6,057 probes from the MWG E. coli O157:H7 array set hybridized with a mixture of labelled DNA from the three reference strains (K12, Sakai, and EDL933) yielded adequate signals and these probes were used for all subsequent analysis. For E. coli O157:H7 EDL933, 5221/5261 (99.2%) of the probes with 100 % identity to the corresponding sequence gave the expected results (Table 1). Among the 40 probes that were expected to hybridize but did not with DNA from E. coli O157:H7 strain EDL933, 13 (0.25%) were negative and 27 (0.5%) were uncertain according to the GACK analysis. While for the E. coli O157:H7 Sakai strain only 4951/5335 (93%) of the probes with 100 % identity to the corresponding sequence gave the expected results. For strain Sakai, 39 (0.7%) were negative and 345 (6%) were uncertain based on GACK analysis. However, twenty-one of the probes with 100% identity to E. coli Sakai sequence that did not generate a positive signal with E. coli Sakai DNA were homologous to ORFs in S-loop#108 [9]. This S-loop is equivalent to OI#57 in E. coli O157:H7 EDL933. PCR experiments revealed that the Sakai strain used in this study has a deletion of these ORFs in S-loop#108 while the corresponding OI in EDL933 was intact (data not shown).

Genomic variability in lineage I and lineage II E. coli O157:H7 strains
In order to distinguish lineage-specific differences from strain-strain variability, multiple strains belonging to three different genotypic groups were tested. Our strain set included fifteen different LSPA genotype 111111 strains (lineage I), four different LSPA type 211111 strains (designated lineage I/II in this study) and 12 different lineage II strains of LSPA types 222222, 222221, 222212, and 222211. Characteristics of the strains used in the study are presented in Table 2, and data from microarray hybridization experiments with these E. coli O157:H7 strains are presented in the supplemental material [see Additional file 1]. A total of 4,084 of the 6,057 probes hybridized with all E. coli O157:H7 strains tested, indicating that this set of genes likely represents the conserved core genome of the ancestral E. coli O157:H7 population that has been maintained during its evolution. There were 222 probes that hybridized only with DNA from E. coli K12 and not with any of the E. coli O157:H7 strains tested, including two probes (ECs1372 and b1894) that were expected to hybridize with EDL933 and Sakai DNA, based on sequence identity. The remaining 1751 probes showed significant variability in microarray hybridization signals among E. coli O157:H7 strains (Table 3), and the ORFs that they represent were designated as variably absent or present (VAP).
Of these 1,751 VAP, 79 hybridized with only one of the 31 E. coli O157:H7 strains tested and 662 hybridized with all but one of the 31 E. coli O157:H7 strains tested. Initial functional classification of the 1751 VAP genes showed that 506 (29%) were encoded by prophage or phage-like elements found in the K-12, EDL933 and Sakai genomes and 615 (35%) were located within K-island (KI), Oisland (OI), or S-loop genomic islands [7,9,26]. The distribution of VAP genes in the genomes of E. coli EDL933 and Sakai and the percentage of the 31 E. coli O157:H7 strains that were divergent for each gene were plotted (Figures 1 and 2). In this study, "lineage-specific" refers to the presence of single ORFs or ORF clusters exclusively in a given lineage, while "lineage-dominant" refers to the presence of single ORFs or ORF clusters in ≥80% of the strains of one lineage and their absence from ≥90% of strains of other lineages.

Lineage-and phage type-specific and lineage and phage type-dominant ORFs
A total of 132 of the 1,751 VAP ORFs were either specific or dominant to a lineage, LSPA type or PT (Table 4, Figure  3).

i) S-loop#14/OI#7
Three lineage I and lineage I/II-specific ORFs, ECs0237, ECs0238, and ECs0239, were identified in S-loop#14/ OI#7 by CGH ( Table 4). The nucleotide sequence [Gen-Bank:EF112439] of this region in the lineage II strain FRIK 920 was homologous to Sakai sequence, except that a There were two clusters of lineage and LSPA type divergent ORFs in S-loop#69. The first cluster, consisting of ORFs ECs1160 to ECs1163 located upstream of the stx2 genes in E. coli Sakai, was missing in all four lineage I/II and the 12 lineage II strains but was conserved in all lineage I strains except strain 97701. The ORFs within this cluster encoded putative bacteriophage proteins and hypothetical proteins.
The second cluster of divergent ORFs in S-loop#69/OI#45 consisted of 21 ORFs, that were missing in 11 out of 12 lineage II strains and present in all four lineage I/II strains  3 These ORFs are only from K-islands. are also called tellurite resistance-and adherence-conferring islands because they contain genes responsible for these phenotypes [27]. Putative virulence-associated ORFs located outside of the lineage I and lineage I/II-specific cluster, including the urease genes (ECs1321-ECs1327), genes for tellurite resistance (ECs1351-ECs1358), and iha (IrgA homologue adhesin) (ECs1360) [27,28], were found by CGH to be conserved in all E. coli O157:H7 strains tested. However, 12 ORFs within S-loop#72 were lineage I and lineage I/II-specific ( Table 4). The nucleotide sequence [GenBank:EF112440] of the FRIK 920 amplification product obtained for this region had high similarity to the E. coli Sakai sequences, except that a segment 10.8 kb from the 3' end of ECs1377 to the 5' end of ECs1391 was missing. The missing region includes two putative transposases ECs1380 and ECs1381, which were not identified by CGH. With the exception of ECs1382, which encodes a HecB-like protein, and ECs1388 (pchD), a PerC-homologue [29], all other lineage I and lineage I/ II-specific ORFs in this region encode hypothetical proteins.
The distribution of divergent genes among 31 E. coli O157:H7 strains as determined in CGH experiments with MWG oligo-nucleotides Figure 2 The distribution of divergent genes among 31 E. coli O157:H7 strains as determined in CGH experiments with MWG oligonucleotides. As in Figure 1 except that the genome map of E. coli O157:H7 strain Sakai is used and Sloops and specific ORFs of interest are shown.

6 E C S 3 7 7 E C S 3 2 9 3 E C S 3 3 5 5 E C S 3 4 3 E C S 3 4 7 3 E C S 3 5 3 7 E C S 3 5 9 2 E C S 3 6 5 0 E C S 3 7 2 E C S 3 7 6 8 E C S 3 8 2 7 E C S 3 8 8 9 E C S 3 9 5 3 E C S 4 0 1
The distribution of divergent genes among 31 E. coli O157:H7 strains as determined in CGH experiments with MWG oligo-nucleotides Figure 1 The distribution of divergent genes among 31 E. coli O157:H7 strains as determined in CGH experiments with MWG oligonucleotides. The position of each ORF in the genome map of E. coli O157:H7 strain EDL933 is displayed on the X-axis and the percentage of E. coli O157 Table 4). The S-loop#78 divergent ORFs encoded hypothetical proteins of unknown function and putative bacteriophage-associated proteins. Another ORF present in this region, ECs1588 (pchE) [29], which encodes a PerC-homologue, was present in all lineage I and lineage I/II strains but not ten of the twelve lineage II strains tested. Repeated attempts to amplify the divergent region in S-loop#78 by long template PCR with FRIK 920 DNA were unsuccessful.

vi) S-loop#83/OI#55
A cluster of 15 lineage I and lineage I/II-specific ORFs, ECs1691-ECs1705, were detected across S-loop#83/ OI#55 and its surrounding sequences (Table 4). Five ORFs within the boundaries of S-loop#83, ORFs ECs1693-ECs1697, are homologous to the prrA-modD-yc73-fepC gene cluster located on the pyelonephritis and cystitis pathogenicity island of uropathogenic E. coli CFT073, which was proposed to be involved in iron uptake in this strain [30]. The S-loop#83 ORFs ECs1698-ECs1699 located immediately downstream from this gene cluster are also conserved in E. coli CFT073 and encode putative transport proteins that may be involved in iron transport. Two lineage I-and lineage I/II-specific ORFs located upstream of S-loop#83 encoded hypothetical proteins, and five lineage I-and lineage I/II-specific ORFs located downstream of S-loop#83 encoded components of the E. coli phosphotransferase system (PTS), or PTS-dependent dihydroxyacetone kinase enzymes. These are ECs1701 and ECs1702, which together are homologous to the periplasmic trehalase treA in E. coli K12 [26], ECs1703, a putative PTS system enzyme I ycgC gene, and ECs1704 and ECs1705, which encode putative dihydroxyacetone kinase genes dhaK1 and dhaK2.
The PCR fragment amplified with DNA from FRIK 920 showed that both the lineage I-and lineage I/II-specific ORFs identified by CGH as well as two putative transposases were missing. The FRIK 920 sequence [Gen-Bank:EF112438] also showed that a portion of E. coli K12 DNA sequence (K12 coordinates 1250409-1253544) that was absent from E. coli Sakai was present in the chromosome of FRIK 920. The E. coli K12 sequence in this region contained portions of b1201 and b1202 ORFs, which encode a PTS-dependent dihydroxyacetone kinase operon regulator dhaR and a protein of unknown function with both Pertactin adhesin and autotransporter domains, respectively.

viii) Sp10
The Sakai prophage Sp10 region, which is described as a hypervariable locus in EDL933, contains S-loops 91, 92, and 93. Sixteen ORFs within this prophage were observed to be lineage I-specific and one lineage-dominant (  Table 4). The ORFs ECs2989 and ECs2993 encode putative regulatory proteins, while the others encode hypothetical proteins of unknown function or prophage-related proteins. Repeated attempts to amplify this divergent region in S-loop#153 by PCR with FRIK 920 DNA were unsuccessful. However, nucleotide sequence analysis of DNA amplified from the region extending from the flanking region to within the stx1-converting prophage showed that integration site of the prophage to be the same in both the lineage II strain FRIK 920 and lineage I strain Sakai (data not shown).

x) S-loop#286/OI#172
S-loop#286/OI#172 corresponds to a cryptic prophagelike element SpLE5 in E. coli Sakai. Ten ORFs in this region in Sakai and EDL933 were present in all lineage I/II strains and all lineage I strains except those of PT 31 but not lineage II strains ( Table 4). The corresponding region [Gen-Bank:EF112443] in E. coli FRIK 920 was homologous to E. coli Sakai DNA sequence but was missing the DNA segment corresponding to the SpLE5 element (from ECs5242 to ECs5252). The divergent SpLE5 ORFs include bacteriophage-associated genes and several genes encoding hypothetical proteins of unknown function. One of these hypothetical genes, ECs5250, was shown through transposon mutagenesis to be required for intestinal colonization in calves [34]. Another ORF, ECs5252, is a putative transcriptional regulator.

xi) KI#71, KI#121, and E. coli genomic backbone
Four ORFs in K-island (KI) #71 (b1142, b1147, b1148, and b1152) and four ORFs in KI#121 (b2360, b2361, b2362 and b2363) were identified as being present in all lineage II strains but not lineage I strains. Only the two ORFs in KI#121 (b2360, b2361) were also present in lineage I/II strains. All of these ORFs above encode hypothetical proteins of unknown function (Table 4).
Two ORFs, b1201 and b1202, which were located on the conserved E. coli genomic backbone, were only found in E. coli K12 and E. coli O157:H7 lineage I/II and lineage II strains (Table 4). ORFs b1201 and b1202 encode a putative sensor-type regulator and a putative adhesion and penetration protein, respectively. Lineage I strain E. coli Sakai possesses truncated versions of these ORFs.
Two other ORFs found in the E. coli genomic backbone, b1519 and b1520, were lineage I and lineage I/II-specific. These ORFs encode a putative trans-aconitate methyltransferase enzyme and a hypothetical membrane protein of unknown function. DNA sequence analysis [Gen-Bank:EF112442] demonstrated that these two ORFs were incomplete in lineage II strain FRIK920.

Genomic characteristics of the lineage-specific segments
To visualize the distribution of the lineage-specific and lineage-dominant genome segments, the segments were mapped onto the O157:H7 strain Sakai genome using Microbial Genome Viewer [42] along with plots of Codon Adaptation Indices (CAI) and GC content. As shown in Figure 3, the lineage-specific segments (denoted by letters a-o) are distributed throughout several segments of the genome, with two main clusters (cluster c-j and cluster mb). Correlation exists between positioning of the lineagespecific segments and the origin and terminus of chromosome replication. Nearly half of the lineage I-specific segments (c-j) are concentrated in a 1.0 Mb segment of the genome that includes the replication terminus (at position 1.9 Mb) while several segments in the m-n region are likewise clustered near the replication origin, positioned at 4.8 Mb. However, there appears to be some bias in the distribution of the lineage-specific segments with respect to the two replichores, with segments n-h occurring within replichore 2 while only segments i-m are found in replichore 1. Replichore 1 is 290 Kb longer than replichore 2 in the Sakai genome sequence [9], which is a lineage I strain, and it is possible that symmetry is a driving force in shaping the genomes of the lineage II strains.

Genomotyping of E. coli O157:H7 strains
A dendrogram based on the Pearson correlation analysis of microarray data for all 31 E. coli O157:H7 strains is presented in Figure 4. Three distinct groups of strains, corresponding to lineage I, lineage I/II (LSPA 211111/PT2) and lineage II (LSPA types 222222, 222221, 222211, and 222212) strains are evident. The three PT31 strains within lineage I are grouped close to each other in the dendrogram. In contrast, PT14 and PT23 strains (the predominant strains in lineage I and lineage II groups, respectively) display variable genetic distances with respect to other strains of the same PT and other strains of different PTs within the same OBGS lineage.

Discussion
To our knowledge this is the first time that CGH has been applied to such a diverse collection of E. coli O157:H7 strains. Results of this study suggest that CGH is a robust and discriminating method for comparing large numbers of E. coli O157:H7 strains. Twenty-one of the oligonucleotides with divergent signals for Sakai DNA formed part of the S-loop#108/OI#57 genomic island which was confirmed to be missing from both of the Sakai strains in our collection (data not shown), suggesting that this element was lost during storage or subculture in the laboratory. There are a number of phage-related ORFs in S-loop#108 (Sp12), and the contribution of this gene cluster, if any to the survival and/or virulence of E. coli O157:H7 strains in nature is unknown. The high level of agreement between the CGH hybridization signals observed and the expected results for ORFs that were reported to be present and those that were reported to be divergent suggests that CGH is a reliable method of determining genomic composition. Furthermore, all areas of lineage I-specific divergence were in clusters of two or more ORFs, so it seems unlikely that the divergent regions observed in this study are a result of printing or hybridization errors.
An additional line of evidence in support of the reliability of the CGH data was obtained from parallel selective subtractive hybridization (SSH) studies conducted by our research group [43]. CGH identified all of the nine lineage I-dominant chromosomal genomic regions identified by SSH as well as two additional ones. Finally, in several of the areas where lineage I-specific divergence was inferred using CGH, sequence differences between these two lineages were verified following amplification, cloning and sequencing of DNA from the lineage II E. coli strain FRIK 920. Taken together, these findings represent a rigorous and comprehensive validation of the CGH data presented in this study.
The CGH data presented in this study provide evidence of both regions of genomic stability and regions of genomic variability that exist within populations of E. coli O157:H7. A total of 4084 oligonucleotides hybridized with DNA from all of the 31 E. coli O157:H7 strains tested. The ORFs which hybridized with these oligonucleotide probes approximate the conserved portion of the E. coli O157:H7 genome and likely include the core genes required for survival of this bacterial pathogen [44,45]. Wick et al identified 4230 conserved genes in E. coli O157:H7(:H-) and closely related E. coli O55:H7 strains using the same oligonucleotide set that was used in this study [13], however, a very limited number of O157 and O157-related strains were examined in the latter study. In contrast, Dobrindt et al. identified 3100 core genes in the genomes of 26 E. coli strains of different serotypes associated with both intestinal and extra-intestinal illness [44]. It would seem reasonable to hypothesize that the larger the collection of strains from a specific E. coli serotype and the more diverse the collection of E. coli strains, the fewer core genes and the more VAP genes that would be encountered.
A total of 1751 VAP ORFs were identified in E. coli O157:H7 strains in the present study. These ORFs existed as clusters across the E. coli Sakai and EDL933 chromosomes (Figures 1 and 2). The number of VAP ORFs identified in this study was significantly higher than that identified in the CGH study by Ogura et al. [46] which employed an E. coli Sakai-based microarray to examine genomic variability within eight human E. coli O157:H7 isolates. The higher number of VAP ORFs observed in this study, likely reflects the higher number of strains tested, the inclusion of both human and bovine-derived strains, and the fact that the MWG microarray used in this study is based on ORFs from E. coli O157:H7 Sakai and EDL933 strains and an E. coli K12 strain rather than just E. coli Sakai.
In addition to genomotyping an extensive collection of E. coli O157:H7 strains, we used CGH to compare the genomic profiles of E. coli O157:H7 lineage I, lineage I/II and lineage II strains. These observations extend well beyond those obtained from previous E. coli K12 and O157:H7 microarray studies [13,[44][45][46] and other genome analysis techniques such as WGPCR Scanning [47]. CGH shows promise as a method that can not only verify the existence of the lineages and categorize strains into subtypes, but also of providing insight into specific genetic differences that could be related to differences in their ecology and evolution. The 132 VAP ORFs that were differentially distributed between the two OBGS lineages were of particular interest in this study, since these ORFs likely confer to strains possessing them the unique ecological or virulence characteristics associated with these lineages.
These 132 lineage-divergent VAP ORFs were clustered within several chromosomal regions, many of which were associated within E. coli O157:H7 Sakai and EDL933 Sloops and OIs.
Prophage were originally suspected as either causing or at least correlating with divergence of the genomes in lineage I and lineage II strains [19] and other studies have shown that prophage regions of the genome are highly polymorphic in different O157:H7 strains [47][48][49][50][51]. It is therefore not surprising that many of the VAPs occur within prophage segments because these elements have significant homology to other prophages in the genome, making them hotspots for recombination. Moreover, these regions of the genome also are topologically distinct. The results of distribution mapping of the lineage-specific VAPs (Figure 2) also implies bias, suggesting that genome symmetry may be superimposed onto homology and topology as driving forces shaping the genome during divergence of the two lineages. When combined with the fact that some of the genes present in the affected prophage or pathogenicity islands could influence expression of virulence genes (e.g. the perC-like genes ECs1388 and ECs1588), it seems reasonable to conclude that multiple evolutionary forces may be working on these genome segments.
For many of these divergent regions, a bias in PT was also evident. Divergent ORFs identified in S-loops 69, 78, and 153 and in Sp10 (S-loops 91, 92 and 93) were absent in all lineage II strains. Similarly, lineage I PT31 strains and lineage II strains also lacked the same ORFs in S-loop#286. Lineage I/II strains were all of the same LSPA type, 211111, and of PT 2 and formed a distinct cluster in a dendrogram based on the Pearson correlation analysis of microarray data for all 31 E. coli O157:H7 strains (Figure 4). PT23 was the most common PT of lineage II strains included in this study. All PT23 strains from Canada grouped together within the lineage II cluster on this dendrogram, despite different isolation dates, host origins, and geographic origin within Canada. In contrast to the apparent clonality of PT2, PT31, and PT23 strains, strains of PT14, the predominant lineage I PT, were highly divergent on this dendrogram and some PT14 strains appeared to be closely related to other PTs, such as PT87, PT21, and PT31. These results suggest that there is a relationship between genomotype and lineage and between genomotype and some PTs for E. coli O157:H7 strains, however, a larger number of strains from different geographical regions need to be examined to verify these relationships.
The lineage, LSPA type and PT-specific and lineage, LSPA type or PT-dominant VAP ORFs identified in this study were associated with several known or suspected virulence genes, including regions of divergence within S-loop 69, which encodes Stx2. A number of studies have shown divergence in the structure of prophages encoding Stx genes in E. coli O157:H7, their insertion sites within the genome, and the type and level of toxin produced [22,47,50,52,53]. In addition to this, the absence of the non-LEE encoded effector genes nleH1-2 and nleF and the perC homologue gene pchD in lineage II strains are likely to affect their phenotype. However, further study is required to determine the precise role of these differences in the genomes E. coli O157:H7 lineage I and lineage II strains on their ecology and virulence.

Conclusion
CGH was used to evaluate genomic variability within a collection of 31 E. coli O157:H7 strains. A total of 4084 ORFs were detected in all of the strains, suggesting that they represent core genes conserved in all E. coli O157:H7 strains. Among the 1751 VAP ORFs were clusters of ORFs associated with bacteriophages and other genetically mobile elements. Several clusters of ORFs were lineagespecific or dominant. A number of the genes within lineage-specific or dominant ORF clusters have been associated with virulence in E. coli O157:H7. Nucleotide sequencing of lineage-specific or dominant regions in the lineage II E. coli FRIK 920 confirmed that the alterations in the genome detected by CGH are the results of genomic deletions or insertions. The genomic composition of these strains within lineages and subgroups suggests that both genomic divergence from an ancestral form and the lateral transfer of gene clusters have contributed to their evolution.

Bacterial strains and preparation of genomic DNA
The 15 lineage I (including EDL933 and Sakai strains for which the genomic sequence has been determined) (LSPA 111111), four PT 2 (LSPA 211111) strains and 12 lineage II (LSPA 222222) E. coli O157:H7 strains included in this study were obtained from a variety of human and bovine sources ( Table 2). OBGS types of all 31 strains were determined as previously described [19]. Phage typing of E. coli O157:H7 isolates was carried out as described previously [54]. The E. coli strains were grown overnight in 45 mL of Brain-Heart-Infusion (BHI) broth. The cultures were centrifuged at 8000 rpm for 10 minutes and the pellet was The dendrogram generated by CGH analysis of E. coli O157:H7 strains Figure 4 The dendrogram generated by CGH analysis of E. coli O157:H7 strains. Data generated from two arrays for each strain have been averaged and analyzed with acuity 3.1. The 6057 locus tags of open reading frames of three E. coli strains (EDL933, Sakai, and K12 MG1655) in the image were organized alphabetically. The dendrogram was generated by hierarchical cluster analysis (Pearson correlation) of log2 (RAT2N) of each strain with Acuity 3.1. In the colour scheme at the bottom of above figure, the brightest green corresponds to spots that are absent (divergent) with high certainty, the brightest red indicates spots that are present (conserved) with the greatest certainty, the black indicates spots that are uncertain or slightly divergent. A 70 μl hybridization solution consisting of 30% formamide, 5× SSC, 0.1% SDS, 0.1 mg/ml sonicated salmon sperm DNA, and equal amounts of test and reference DNA, each containing at least 30 pmol of incorporated dye, was denatured at 95°C for 5 min and briefly centrifuged to collect the contents. DNAs were hybridized to the array overnight (~16 hours) at 42°C and washed according to the modified Corning method (Corning). Arrays were scanned with a GenePix 4000B scanner (Axon Instruments, Redwood City, CA) and processed using GenePix Pro 5.0. Two slides were hybridized per strain with two dye-swap repeats per slide.

Microarray data analysis
Microarray data were normalized using the Ratio-based and Lowess method in Acuity 3.1 (Axon instruments) before analysis. The normalized data for all 31 strains were converted to log 2 (Fluor 647/Fluor 546) in Acuity 3.1 and subsequently analyzed in Microsoft Excel. Control, blank, and test spots with a mean intensity below that of the mean of all negative controls were removed from the analysis. The arithmetic mean of the remaining spots across the four duplicates was taken to construct the dataset. Acuity 3.1 was then used to construct a Pearson centred hierarchical dendrogram and GACK (Genomotyping Analysis by Charles Kim) [55] was used to generate a cut off value for identifying the presence or absence of genes.

Identification of lineage-and phage type-dominant ORF clusters
The presence or absence of each ORF was compared for all 15 lineage I, 4 lineage I/II and 12 lineage II E. coli O157:H7 strains included in the study, and ORFs that were variably absent and present (VAP) within this strain collection were identified. The proportion of strains of each OBGS lineage and phage type (PT) containing each VAP ORF was determined. ORFs that were present in 100% of strains in a lineage or PT and absent in all strains of the other lineage or PTs were designated as lineage or PT-specific. ORFs that were present in greater than 80 % of strains in a lineage or PT and absent in greater than 92 % of strains of the other lineage or PTs were defined as lineage or PT-dominant.
For visualization of VAP distribution, each locus tag was given a binary score, 1 for presence and 0 for absence based on GACK analysis. A single composite table was generated for all strains and sorted in Microsoft Excel to identify lineage-specific polymorphisms. This generated a set of 132 loci showing a lineage-specific distribution. The binary distribution was then represented in RGB format (1 = green, 0 = red), converted to a single text file, and the file uploaded and represented on the Microbial Genome Viewer [42].
DNA sequences from the lineage-conserved regions flanking lineage and PT-dominant regions were used to design primers for PCR amplification, cloning and DNA sequence analysis of these divergent regions (Table 5). DNA from the lineage II strain E. coli FRIK 920 was used to amplify divergent regions corresponding to E. coli Sakai S-loops 14, 16, 69, 72, 78, 83, 153, and 286, backbone regions b1519-b1520 and SP10. These divergent regions were amplified by PCR using AmpliTaq ® polymerase (Applied Biosystems, Foster City, CA) or long template PCR using the Advantage™ 2 PCR Kit (BD Biosciences Clontech, Palo Alto, CA) and then cloned into the pGEM-T Easy vector (Promega, Madison, WI), the pCR 2.1 TOPO TA vector (Invitrogen, Carlsbad, CA), or the Expand cloning kit (Roche Applied Science). The primers and annealing temperatures used in these reactions are summarized in Table 5. The cloned sequences were analyzed on an ABI