Genome sequence analysis of Helicobacter pylori strains associated with gastric ulceration and gastric cancer

Background Persistent colonization of the human stomach by Helicobacter pylori is associated with asymptomatic gastric inflammation (gastritis) and an increased risk of duodenal ulceration, gastric ulceration, and non-cardia gastric cancer. In previous studies, the genome sequences of H. pylori strains from patients with gastritis or duodenal ulcer disease have been analyzed. In this study, we analyzed the genome sequences of an H. pylori strain (98-10) isolated from a patient with gastric cancer and an H. pylori strain (B128) isolated from a patient with gastric ulcer disease. Results Based on multilocus sequence typing, strain 98-10 was most closely related to H. pylori strains of East Asian origin and strain B128 was most closely related to strains of European origin. Strain 98-10 contained multiple features characteristic of East Asian strains, including a type s1c vacA allele and a cagA allele encoding an EPIYA-D tyrosine phosphorylation motif. A core genome of 1237 genes was present in all five strains for which genome sequences were available. Among the 1237 core genes, a subset of alleles was highly divergent in the East Asian strain 98-10, encoding proteins that exhibited <90% amino acid sequence identity compared to corresponding proteins in the other four strains. Unique strain-specific genes were identified in each of the newly sequenced strains, and a set of strain-specific genes was shared among H. pylori strains associated with gastric cancer or premalignant gastric lesions. Conclusion These data provide insight into the diversity that exists among H. pylori strains from diverse clinical and geographic origins. Highly divergent alleles and strain-specific genes identified in this study may represent useful biomarkers for analyzing geographic partitioning of H. pylori and for identifying strains capable of inducing malignant or premalignant gastric lesions.


Background
Helicobacter pylori is a Gram-negative spiral-shaped bacterium that persistently colonizes the human stomach [1].
Persistent H. pylori colonization of the human stomach is a risk factor for several diseases, including non-cardia gastric adenocarcinoma, gastric lymphoma, and peptic ulcer-ation [1,2]. The incidence of these diseases varies considerably throughout the world. For example, the incidence of gastric adenocarcinoma is substantially higher in East Asia, Central America, and South America than in most other parts of the world [3].
H. pylori isolates from unrelated humans exhibit a high level of genetic diversity [4,5]. Genetic variation is readily detectable by analyzing the nucleotide sequences of individual genes in different H. pylori strains [6]. H. pylori allelic diversity is probably the consequence of multiple factors, including a high rate of mutation, a high rate of intraspecies genetic recombination, and a long evolutionary history of the species [4,7]. Corresponding alleles in different H. pylori strains typically are 92 to 99% identical in nucleotide sequences [4,6], but several H. pylori genes exhibit a much higher level of genetic diversity [8,9].
Further analyses have shown that there is geographic variation among H. pylori strains [10][11][12][13][14][15][16]. Based on multilocus sequence analysis of a panel of 370 H. pylori strains isolated from humans in different parts of the world, seven populations of strains with distinct geographic distributions have been identified [17]. These H. pylori populations reflect the migration of humans from Africa to other parts of the world over a time period estimated to be approximately 58,000 years [12]. Geographic differences among H. pylori strains could potentially be a factor that helps to explain the varying incidence of H. pylori-associated diseases in various parts of the world.
In addition to variation among H. pylori strains in the sequences of individual genes, there is considerable variation among strains in gene content. One study analyzed genomic DNA from 56 different H. pylori strains using array hybridization methods and identified 1150 genes that were present in all of the strains tested (thus representing a "core" genome) [18]. Among 1531 genes analyzed, 25% were absent from at least one of the 56 H. pylori strains. It was predicted that the H. pylori core genome would consist of 1,111 genes if a much larger set of isolates were tested [18]. Other studies have reported the existence of core genomes comprising 1091 or 1281 genes, based on DNA array analysis of 34 or 15 H. pylori strains, respectively [19,20]. One study reported that the phylogeny of H. pylori strains based on MLST analysis was substantially different from the phylogeny of H. pylori strains based on analysis of gene content [18].
One of the most striking differences in gene content among H. pylori strains is the presence or absence of a 40kb region of chromosomal DNA known as the cag pathogenicity island (PAI) [8,[21][22][23][24]. In the United States and Europe, about 50-60% of H. pylori strains contain the cag PAI and the remaining strains lack this region of the chro-mosome [8,[21][22][23][24]. In many other parts of the world, including East Asia, nearly all H. pylori strains contain the cag PAI [15,25,26]. The H. pylori cag PAI encodes an effector protein, CagA, and a type IV secretion apparatus that translocates CagA into gastric epithelial cells [27]. H. pylori strains harboring the cag PAI are associated with an increased risk of non-cardia gastric cancer or peptic ulcer disease compared to strains that lack the cag PAI [21,28]. The correlation between these diseases and presence of the cag PAI provides an example of how the clinical outcome of H. pylori infection is determined in part by genetic characteristics of the strains with which a person is infected.
In previous studies, the complete genomes of three H. pylori strains have been analyzed [29][30][31]. These three H. pylori strains were isolated from patients who had gastritis, atrophic gastritis, or duodenal ulcer disease. In the current study, we sought to analyze genetic features of H. pylori strains isolated from patients with two different H. pyloriassociated diseases: gastric ulcer and gastric cancer. For this analysis, we selected a gastric ulcer strain (B128) that readily colonizes the stomachs of mice and Mongolian gerbils. This strain is of particular interest because an animal-passaged derivative of strain B128 (strain 7.13) causes gastric cancer in a Mongolian gerbil model [32,33]. For an analysis of a gastric cancer-associated H. pylori strain, we selected strain 98-10, which was isolated from a gastric cancer patient in Japan [34], a country with a very high incidence of gastric cancer [3,35].

General features of H. pylori genomes
Prior to the current study, the complete genome sequences of H. pylori strains isolated from patients with superficial gastritis, atrophic gastritis, or duodenal ulcer disease had been reported [29][30][31]. In the current study, we analyzed the genome sequences of an H. pylori strain (98-10) that was isolated from a patient with gastric cancer [34] and a strain (B128) that was isolated from a patient with gastric ulcer disease [32]. General features of the two genomes analyzed in the current study in comparison to three previously sequenced genomes are summarized in Table 1

MLST analysis of H. pylori strains
In previous studies, MLST analysis has been used to classify H. pylori isolates into several haplogroups that have distinct geographic distributions [17]. To assign the two newly sequenced H. pylori strains to one of the previously described population clusters, we compared eight gene sequences from each strain to the corresponding sequences of 434 other H. pylori isolates, using an MLST database as described in the Methods. Based on this analysis, strain 98-10 was classified as a member of the East Asian population cluster and strain B128 was classified as a member of the European population cluster. A neighbor-joining tree depicting relationships of the two newly sequenced strains to representative reference strains isolated from diverse geographic locations is shown in Figure  1. The clustering depicted on this neighbor-joining tree accurately reflects the geographic origins of the reference strains, and is in agreement with previous assignments of the reference strains to distinct population groups [18]. In agreement with an earlier report [17], one of the previously sequenced H. pylori strains (J99) was most closely related to strains isolated in West Africa, and another (26695) was most closely related to strains isolated in Europe. A third H. pylori strain (HPAG1) analyzed in a prior study was closely related to strains isolated in Europe. Figure 1 illustrates that strain 98-10 is most closely related to strains of East Asian origin, and therefore, strain 98-10 belongs to a population cluster different from those of strains for which genome sequences were previously reported. Collectively, the genome sequences available for analysis represent three main geographic populations of H. pylori strains [European (26695, HPAG1, and B128), West African (J99), and East Asian (98-10)].

Analysis of cagA and vacA
CagA and VacA are two important H. pylori virulence factors that are secreted by a type IV secretion pathway and a type V (autotransporter) secretion pathway, respectively [14,38]. Diversity in cagA and vacA genes has been investigated in detail in previous studies, and diversity in these genes provides a basis for typing H. pylori strains [8,[13][14][15]. Therefore, we analyzed the cagA and vacA genes in each of the two newly sequenced strains.
When strain 98-10 was incubated with AGS gastric epithelial cells as described previously [39], CagA underwent tyrosine phosphorylation (data not shown), which indicates that this strain has a functional type IV secretion system for translocation of CagA into host cells [27]. The CagA protein encoded by strain 98-10 contains 3 EPIYA motifs (sites of tyrosine phosphorylation), which have been designated EPIYA-A, EPIYA-B, and EPIYA-D [14]. The presence of an EPIYA-D motif is characteristic of H. pylori strains isolated in East Asia [13,14]. Broth culture supernatant from strain 98-10 caused vacuolation of HeLa cells, indicating the presence of an active VacA toxin. This strain contains a type s1c/m1 vacA allele, a feature that is characteristic of H. pylori strains isolated in East Asia  Similar to strain 98-10, strain B128 has a functional type IV secretion system that can translocate CagA into gastric epithelial cells, and CagA subsequently undergoes tyrosine phosphorylation [41]. The CagA protein encoded by strain B128 contains two EPIYA motifs, designated EPIYA-A and EPIYA-C [14]. Strain B128 contains a type s1/m2 vacA allele, but a vacA mutation in this strain is predicted to prevent expression of a full-length VacA protein. The presence of the latter mutation was confirmed by nucleotide sequence analysis of a vacA fragment amplified by PCR. Immunoblot analysis using multiple anti-VacA antisera indicated that this strain did not produce a detectable VacA protein, and broth culture supernatant from this strain did not cause vacuolation of HeLa cells (data not shown).

Characterization of the H. pylori core genome
Delineation of a H. pylori core genome (i.e. genes that are consistently present in all H. pylori isolates) is of interest, because many such genes are likely to be required for colonization of the human stomach. Based on the use of BLAST score ratio analysis as described in the Methods, we identified 1237 genes that were present in all 5 H. pylori genomes ( Figure 2 and Additional file 1). In a previous study, 56 different H. pylori strains were analyzed by array methodology, and a core genome of 1150 genes was reported to be present in all 56 strains [18]. Among the Phylogenetic structure based on sequence analysis of 8 H. pylori core genes Figure 1 (see previous page) Phylogenetic structure based on sequence analysis of 8 H. pylori core genes. H. pylori strains analyzed in this figure include strains 98-10, B128, three strains for which genome sequences were previously determined (26695, J99, HPAG1), and representative strains isolated from patients in diverse geographic locations [18]. The figure lists the strain designations and the countries where strains were isolated. The nucleotide sequences of the concatenated MLST loci were aligned and compared, as described in Methods. All positions containing gaps and missing data were eliminated from the dataset. There were a total of 3041 positions in the final dataset. Neighbor-joining trees were constructed based on distances estimated by the Kimura 2parameter model of nucleotide substitution [57,58]. The bootstrap consensus tree inferred from 1000 replicates is taken to represent the evolutionary history of the strains analyzed [59]. Branches corresponding to partitions reproduced in fewer than 50% bootstrap replicates are collapsed. The tree is drawn to scale, with the branch lengths in the same units as those of the evolutionary distances used to infer the phylogenetic tree. Phylogenetic analyses were conducted in MEGA4 [63]. Five H. pylori strains for which genome sequences were available are denoted by diamonds. Three main H. pylori population groups (East Asian, European, and West African) are identifiable.
Comparison of predicted proteomes by BLAST-score ratio (BSR) analysis 1150 genes reported to comprise the H. pylori core genome based on array analysis, 1094 were present in all 5 strains analyzed in the current study, as determined by sequence analysis. The list of core genes detected in all five strains by sequence analysis but not by array analysis includes >20 genes located within the cag PAI. Although the cag PAI is present in all 5 strains analyzed in the current study, this region of DNA is known to be absent from many H. pylori strains [24]. Five other clusters of contiguous genes (each with at least 4 genes per cluster) were present in all 5 sequenced strains, but were absent from the list of core genes identified by array analysis (HP0061-0065, HP0797-0800, HP1339-1343, HP1400-1403, and HP1455-1458) (Additional file 1). The differences in designation of core genes in the current study compared to previous studies can be attributed to numerous factors, including differences in the number of strains analyzed and differences in methodology for gene detection.
An analysis of the 1237 core genes indicated that, in almost all cases, there were differences in the amino acid sequences of the proteins encoded by individual strains. Pairwise comparisons of proteins encoded by different strains indicated that the levels of relatedness ranged from 65% to 100% amino acid identity. A representative comparison of the core proteins encoded by two strains (98-10 and 26695) is shown in Figure 3. Only 11 genes were identified for which the amino acid sequences of encoded proteins were identical among all 5 strains. Seven of these 11 genes encoded ribosomal proteins; others encoded a translation initiation factor (IF-1), a lipoprotein (Lpp20), a flagellar basal body protein (FliE), and a protein of unknown function (HP0031).

Analysis of divergent genes in an East Asian cancerassociated H. pylori strain
H. pylori strains isolated from unrelated humans exhibit allelic diversity (typically 92-99% nucleotide identity among corresponding alleles), which provides a basis for classification of strains into population clusters via MLST analysis. Several genes exhibit a substantially higher level of allelic diversity. For example, at least two genes (cagA and a sel1 homologue) are known to be markedly divergent in East Asian H. pylori strains compared to Western H. pylori strains [13,14,42]. We hypothesized that additional genes might be highly divergent in the East Asian strain 98-10 compared to the other 4 sequenced strains. To identify gene products encoded by the genome of 98-10 that are markedly divergent compared to products encoded by the other 4 genomes, we focused on analysis of the 1237 core genes that were present in all 5 sequenced strains. By using the approach described in Methods, we identified 8 gene products that were highly divergent in the East Asian strain compared to the other four strains (Table 2). These include CagA and a sel1 homologue, which were previously reported to be markedly divergent in East Asian strains compared to strains from other parts of the world [13,42]. The amino acid sequences of these divergent proteins encoded by the Japanese strain 98-10 were each <90% identical to sequences of corresponding proteins from the other four strains ( Table 2). In each case, the divergent alleles in strain 98-10 and corresponding alleles in the other four strains were flanked by the same chromosomal genes.
As shown in Figure 1, strain J99 was most closely related to H. pylori strains isolated in West Africa, a population cluster different from those of the other strains for which genome sequences were available. Therefore, we hypothesized that specific genes might be highly divergent in the West African strain J99 compared to the other 4 sequenced strains. To identify such genes, we used the same approach as described above. Four unique highly divergent alleles were identified in strain J99 (Table 3), each encoding products that were <90% identical to cor-Relatedness of core proteins predicted to be encoded by H. pylori strains 98-10 and 26695 Figure 3 Relatedness of core proteins predicted to be encoded by H. pylori strains 98-10 and 26695. A set of 1237 genes present in all 5 H. pylori strains was identified, as described in the Methods. The deduced amino acid sequences of the corresponding proteins encoded by strain 98-10 were used to search a database of sequences from strain 26695 using FastA. The best match was identified, and the percent amino acid identity was calculated. The histogram shows the number of ORFs exhibiting the indicated level of amino acid identity.
responding proteins in the other four strains. Unique highly divergent alleles were not readily identifiable in strains 26695, HPAG1, or B128. A notable exception was the identification of a highly divergent vacA allele in strain B128 (gene HPB128_147g10). Identification of vacA as a divergent allele in strain B128 is attributable to the presence of an s1/m2 vacA allele in this strain and the presence of s1/m1 alleles in the four other strains; m1 and m2 forms of VacA typically exhibit only 60-70% amino acid identity within the mid-region of the protein [38].

Identification of novel strain-specific genes
To identify strain-specific genes uniquely present in one of the two newly sequenced genomes but not previously sequenced H. pylori genomes, we again used a BLAST score ratio analysis, as described in the Methods (Figure 2). Strain 98-10 contained 22 novel strain-specific genes and strain B128 contained 51 (Additional files 2 and 3). In addition, we identified 16 genes that were present in both strain 98-10 and B128, but not present in any of the previously sequenced strains (Additional file 4). Several of the strain-specific ORFs in H. pylori strains 98-10 and B128 were <100 nucleotides in length, and it is uncertain whether or not these very short ORFs are actually translated into proteins. An analysis of unique strain specific genes in the three previously sequenced H. pylori genomes (26695, J99, and HPAG1) revealed a similar number of unique strain-specific genes (Table 1), which have been described in previous studies [29][30][31].
To identify potential functions of the strain-specific genes found solely in strain 98-10 or B128 (or both 98-10 and B128), the deduced protein sequences were used as queries for BLAST searching of an NCBI database of nonredundant protein sequences (Table 4 and Additional files 2, 3, 4). Most of the strain-specific proteins found solely in strain 98-10 or B128 were not closely related to any known proteins or were related to proteins in the database for which the functions are not known. Several of the strain-specific genes found exclusively in strain 98-10 or B128 have been previously detected in strains of H. pylori for which the genome sequences have not been deter- a The sequences of the indicated gene products in strain 98-10 were compared with corresponding sequences in each of the other 4 strains (26695, J99, HPAG1 and B128), and mean % amino acid identities were calculated as described in Methods. b The sequences of the indicated gene products in each strain were compared in all permutations, except that comparisons involving strain 98-10 were excluded from analysis. Mean % amino acid identities were calculated as described in Methods. c Percentage of aligned sites in which the protein from strain 98-10 contained an amino acid different from the corresponding amino acids in proteins from 4 other strains. d Reported to be a constituent of the H. pylori core genome, based on at least one array analysis [18][19][20]. Subsets of the strain-specific genes found exclusively in strain 98-10 or B128 were found in contiguous chromosomal loci (Table 4). Two such gene clusters were identified in strain 98-10 and 11 were identified in strain B128. These gene clusters ranged from two to nine genes in length. Most of the gene clusters encode proteins of unknown function, but as noted above, one cluster encoded transposases and one cluster encoded two genes with homology to type IV secretion system components. The % G+C contents of three gene clusters in strain B128 (containing ORFs HPB128_65g16, HPB128_65g17, HPB128_156g11, HPB128_156g12, HPB128_192g1, HPB128_192g2, and HPB128_192g3, encoding proteins of unknown function) were each <30%, a value substantially lower than the total % G+C content of strain B128 (38.8%) and lower than the % G+C content of previously analyzed H. pylori strains (39%) [29,30]. The low % G+C content of these gene clusters suggests that these segments of DNA may have been acquired via horizontal transfer events.

Strain-specific genes present in strains associated with gastric cancer or premalignant gastric lesions
Atrophic gastritis is a premalignant lesion [3], and H. pylori-infected patients with gastric ulcer disease have an increased risk of gastric cancer compared to H. pyloriinfected patients with duodenal ulcer disease [46,47]. Strain 98-10 was isolated from a patient with gastric cancer, and strains HPAG1 and B128 were isolated from patients with atrophic gastritis and gastric ulcer disease, respectively. Therefore, we sought to identify strain-specific genes present in these three strains, but absent from the other two strains for which genome sequences were available (strains 26695 and J99, isolated from patients with superficial gastritis and duodenal ulcer disease, respectively). Ten strain-specific genes were found in the former 3 strains but were absent from strains 26695 and J99 (Table 5 and Additional file 5). These included 5 genes encoding restriction-modification systems and a gene predicted to encode an outer membrane protein; a previous study reported marked strain-specific variation in the size and sequence of this outer membrane protein [48]. We also performed a similar analysis to identify strain-specific genes present in various pairs of strains associated with malignant or premalignant conditions, but absent from strains 26695 and J99 (Table 5 and Additional files 4, 6, 7). The most commonly identified genes in these groups encoded restriction-modification systems or hypothetical proteins (Table 5 and Additional files 4, 6, 7). One of the genes in strains B128 and HPAG1 (hrgA) is a restriction endonuclease-replacing gene that was previously reported to be more prevalent among strains from Asian gastric cancer patients than among strains from non-cancer patients [49]. Three strain-specific genes found in strain B128 and strain HPAG1 (HPB128_146g1, HPB128_146g2, and HPB128_146g3) were previously reported to be localized on a plasmid in strain HPAG1 [31]. Another gene found exclusively in strain B128 and HPAG1 (HPB128_141g11) is found at the 3' end of the cag PAI in some H. pylori strains, and encodes a protein of unknown function (designated HP0521B) [50]. Strainspecific genes shared by 26695 and J99 (associated with superficial gastritis and duodenal ulcer disease), but not present in the three strains from patients with gastric cancer or premalignant gastric lesions are listed in Additional file 8.

Discussion
In this study, we analyzed the genome sequences of an H. pylori strain isolated from a patient with gastric cancer and an H. pylori strain from a patient with gastric ulcer disease, and compared these with previously determined genome sequences of H. pylori strains associated with superficial gastritis, atrophic gastritis, and duodenal ulcer disease. We identified 1237 genes that were present in all 5 of these H. pylori strains. This group of genes may be considered to represent the H. pylori core genome. Some of the genes within the core genome are predicted to be essential for H. pylori viability in vitro. One previous study identified 33 genes that were essential for H. pylori viability [51]; all of these essential genes were present in the list of 1237 core genes identified in the current study. Other genes in the H. pylori core genome are not required for bacterial viability in vitro, but are predicted to be essential for H. pylori colonization of the stomach. Among 47 genes essential for H. pylori colonization of a gerbil model [52], 45 were included in the core genome described in the current study. Similarly, among 23 genes essential for H. pylori colonization of a mouse model (based on detection of a colonization defect in two different H. pylori strains) [53], 19 were included in the core genome described in the current study.
Several previous studies used array-based methodology to identify genes that are consistently present in all H. pylori strains analyzed [18][19][20]. The core genomes described in these previous studies have ranged from 1091 genes to 1281 genes. Potential reasons for differences in the reported size of the H. pylori core genome include variations in the number and choice of H. pylori strains selected for analysis, as well as variation in the DNA sequences that were used for array synthesis. In comparison to arraybased methods, genome sequence analysis offers several potential advantages for delineation of a core genome. For example, genome sequence analysis is likely to be superior to array-based assays when differentiating between closely related paralogues, and genome sequence analysis is more likely to be successful in detecting the existence of highly divergent alleles. The main limitation of the sequence-based approach used in the current study for delineation of a core genome is that a relatively small number of genomes was analyzed. Nevertheless, there was reasonably close agreement between the core genes identified in the current study and the core genes identified in a previous array study [18].
Analysis of the 1237 core genes identified in this study revealed that the nucleotide sequences of these genes in individual strains were typically non-identical, and were differentiated by the presence of both synonymous and non-synonymous substitutions. As expected, allelic variation was detected within several housekeeping genes that have previously been used for MLST analysis. MLST analysis indicated that one of the strains analyzed in this study (98-10) belonged to an East Asian population cluster of H. pylori strains, whereas the other strains for which genome sequences were available belonged to European or West African population clusters. Thus, strain 98-10 is the first H. pylori strain from an East Asian population cluster to be analyzed by genome sequence analysis. We then focused on the identification of core genes in strain 98-10 that encoded proteins that were highly divergent compared to proteins encoded by the other 4 strains for which genome sequences were available. Eight such genes were identified ( Table 2). Two of the genes shown in   [13,42]. We speculate that several of the other genes listed in Table 2 may exhibit similar patterns of geographic divergence. Potentially the observed high level of divergence is associated with alterations in the functional activities of these proteins. The approach used in the current study prioritized identification of alleles that were highly divergent in one strain but similar in length compared to alleles in four other strains. A larger number of highly divergent alleles would have been identified if genes with substantial variations in length were included.
We identified several strain-specific genes in strain 98-10 or strain B128 that had not been previously described.
Many of the new strain-specific genes identified in the current study were not closely related to any genes in the databases or were related to proteins for which the functions are not known. Notably, several of the new strainspecific genes identified in this study were closely related to genes present in related Helicobacter species, such as H. acinonychis [44] and H. cetorum [45].
Three of the strains for which genome sequences were available were isolated from patients with gastric cancer (98-10) or premalignant gastric lesions (atrophic gastritis and gastric ulcer; HPAG1 and B128). Therefore, we sought to identify genes present in these strains that were absent from strains isolated from patients with non-malignant conditions. We identified numerous genes that fulfilled these criteria (Table 5). Potentially several of these may be useful biomarkers for strains capable of inducing malignant or premalignant gastric lesions. Further studies involving larger numbers of strains will be needed in order to test this hypothesis.
Finally, it is notable that one of the strains selected for analysis in the current study (strain 98-10) was isolated from a gastric cancer patient in Japan, a country with a very high incidence of gastric cancer [3,35]. The biological basis for geographic variation in the incidence of gastric cancer is not yet clearly understood. Both environmental factors (such as a high-salt diet) and host genetic factors may be contributory [2,3]. In addition, H. pylori strains circulating in some parts of the world may have an increased carcinogenic potential compared to strains circulating in other parts of the world. In support of the latter hypothesis, most H. pylori strains isolated in Japan express forms of CagA that have multiple sites where tyrosine phosphorylation can occur and a unique tyrosine phosphorylation site (EPIYA-D), resulting in high levels of tyrosine-phosphorylated CagA within gastric epithelial cells and potent activation of the SHP-2 tyrosine phosphatase [13,14,25]. In future studies, it will be important to study further the geographic variations that exist among H. pylori genomes by analyzing a larger number of strains, and to determine whether the presence of particular allelic variations or strain-specific genes correlates with specific disease outcomes such as gastric cancer.

Conclusion
In this study we analyzed the genome sequences of an H. pylori strain isolated from a patient with gastric cancer and a strain isolated from a patient with gastric ulcer disease. Each strain contained novel genes not present in previously described H. pylori genomes. In addition, highly divergent alleles were identified. Comparative analysis of H. pylori strains isolated from patients with different clinical conditions provides a foundation for understanding why H. pylori may be associated with a variety of different gastroduodenal diseases.

H. pylori strains
H. pylori strain 98-10 was isolated from a patient in Japan with gastric adenocarcinoma [34]. H. pylori strain B128 was isolated from a patient in the United States with a gastric ulcer [32]. The genome sequences of H. pylori strains 26695, J99, and HPAG1 have been published previously [29][30][31]. Strain 26695 was isolated from a patient in the United Kingdom with gastritis [29]. Strain J99 was isolated from a patient in the United States with duodenal ulcer disease [30]. Strain HPAG1 was isolated from a patient in Sweden with chronic atrophic gastritis [31].

Genome sequencing
A single colony of H. pylori 98-10 and a single colony of strain B128 were isolated and DNA was purified as described previously [54]. DNA sequencing was accomplished using an emulsion method for DNA amplification, and an instrument (Genome Sequence 20 System) that performs pyrophosphate-based sequencing (pyrosequencing) in picolitre-sized wells (454 Life Sciences, Branford, CT). Random libraries of DNA fragments were generated by shearing an entire genome and isolating single DNA molecules by limiting dilution. Specialized common adapters were added to the fragments, the individual fragments were captured on their own beads and, within the droplets of an emulsion, the individual fragments were clonally amplified [55]. This approach does not require subcloning in bacteria or the handling of individual clones, as the templates were handled in bulk within the emulsions. Three runs of the sequencing instrument were used for analysis of strain 98-10 and two instrument runs were used for analysis of strain B128. Assembly of sequence data was performed as described by Margulies et al. [55]. The average depth of sequencing coverage was approximately 20-fold. Sequence data from strain 98-10 were assembled into 51 large contigs, each > 600 nucle-otides in size (average contig length 30,819 nucleotides). Sequence data from strain B128 were assembled into 73 large contigs, each > 600 nucleotides in size (average contig length 22,592 nucleotides). As described by Oh et al. [31], analysis of an H. pylori genome via this approach yields results comparable to results obtained by traditional Sanger sequencing.

Analysis of sequence data
ORFs in the genomes of H. pylori strains 98-10 and B128 were predicted by FGENESB http:// www.softberry.coberry.phtml?topic=fgenesb&group=pro grams&sub group=gfindb [56], an algorithm based on Markov chain models of coding regions and translation and termination sites that was "trained" on the genome from H. pylori strain 26695.

Multi-locus sequence typing
To analyze relationships between the strains analyzed in this study and other globally distributed H. pylori patient isolates, we used a multilocus sequence typing (MLST) database http://pubmlst.org/helicobacter containing data on 434 H. pylori strains that were isolated from patients in a broad range of geographic locations. This MLST database contains sequence data (398 to 627 bp per gene) for eight core genes (atpA, efp, mutY, ppa, trpC, ureI, vacA, and yphC) that are distributed throughout the H. pylori genome. Nucleotide sequences of the concatenated MLST loci were aligned using ClustalW algorithm within MEGA4. Phylogenetic relationships were constructed using MEGA4 with the Kimura 2-parameter model of nucleotide substitution and neighbor-joining clustering [57,58]. The tree shown in Figure 1 is the product of 1000 bootstrap replicates [59].

Identification of strain-specific genes and core genes
To identify strain-specific genes and genes present in all 5 H. pylori genomes analyzed in the current study, we used a BLAST score ratio (BSR) algorithm [60]. This algorithm is based on an analysis of BLAST raw scores, which, in contrast to comparison to analysis of BLAST output E-values, more accurately accounts for the length of the similarity between the Reference and Query sequences. As a first step, ORFs were translated into deduced amino acid sequences. BLAST score ratios were computed by first determining the BLAST raw score for each Reference peptide against itself; this raw score was designated as the Reference score. Each Reference peptide was then compared to each peptide in individual query proteomes, and each best BLAST raw score was recorded. The BSR was calculated by dividing the Query score by the Reference score for each Reference peptide. Thus, all BSRs were normalized within a range between 0 and 1. A score of 1 indicates a perfect match of the Reference peptide to a Query peptide and a score of 0 indicates no BLAST match of the Ref-erence peptide in the Query proteome. To identify strainspecific genes, multiple separate analyses were performed, each using a different strain as the reference. A BSR threshold value of 0.4 was used for identification of strain-specific genes. This stringent threshold value corresponds to approximately 30% amino acid identity over approximately 30% of the peptide length, a commonly used threshold for peptide similarity [60]. The same analytical approach was used to identify core genes that were present in all 5 strains for which genome sequences were available.

Identification and analysis of alleles encoding highly divergent gene products
Among the core genes that were identified in all five H. pylori strains (BSR >0.4), we sought to identify alleles found in a single strain that differed markedly from corresponding alleles found in the other four strains. Candidate divergent alleles in a particular strain were initially identified by selecting peptides having a 0.4<BSR<0.9 in multiple analyses, each using a different strain as the reference. Deduced amino acid sequences from the 5 strains were aligned and compared using the NWay Comp program [61]. Alignments were manually inspected to exclude cases in which low BSRs were primarily attributable to differences in peptide length. Each gene product of interest from strain A was compared with corresponding gene products from strains B, C, D, and E, and a mean % amino acid identity value was calculated; similarly, the gene products in strains B, C, D, and E were compared in all permutations, and a mean % amino acid identity value was calculated. A gene from strain A was considered highly divergent if the former value was significantly lower than the latter value.

Sequence data
This Whole Genome Shotgun project has been deposited at DDBJ/EMBL/GenBank under the project accession ABSX00000000 (for strain 98-10) and ABSY00000000 (for strain B128). The versions described in this paper are the first versions (ABSX01000000 and ABSY01000000)

Authors' contributions
MM participated in the design of this study, analyzed genome sequences, and helped to draft the manuscript. CS performed the MLST analysis and helped to draft the manuscript. DI and RP contributed genome sequence data for strain B128. TC helped to design the study, analyzed genome sequences, and helped to draft the manuscript. All authors read and approved the final manuscript.