A comprehensive analysis of Helicobacter pylori plasticity zones reveals that they are integrating conjugative elements with intermediate integration specificity

Background The human gastric pathogen Helicobacter pylori is a paradigm for chronic bacterial infections. Its persistence in the stomach mucosa is facilitated by several mechanisms of immune evasion and immune modulation, but also by an unusual genetic variability which might account for the capability to adapt to changing environmental conditions during long-term colonization. This variability is reflected by the fact that almost each infected individual is colonized by a genetically unique strain. Strain-specific genes are dispersed throughout the genome, but clusters of genes organized as genomic islands may also collectively be present or absent. Results We have comparatively analysed such clusters, which are commonly termed plasticity zones, in a high number of H. pylori strains of varying geographical origin. We show that these regions contain fixed gene sets, rather than being true regions of genome plasticity, but two different types and several subtypes with partly diverging gene content can be distinguished. Their genetic diversity is incongruent with variations in the rest of the genome, suggesting that they are subject to horizontal gene transfer within H. pylori populations. We identified 40 distinct integration sites in 45 genome sequences, with a conserved heptanucleotide motif that seems to be the minimal requirement for integration. Conclusions The significant number of possible integration sites, together with the requirement for a short conserved integration motif and the high level of gene conservation, indicates that these elements are best described as integrating conjugative elements (ICEs) with an intermediate integration site specificity.


Background
Infections with the human gastric pathogen H. pylori are paradigmatic examples of chronic, or persistent, bacterial infections in the face of a constant immune response [1]. H. pylori infections are usually contracted during early childhood and persist for the lifetime of the host, but most infected individuals develop only mild gastric inflammation without overt symptoms. Nevertheless, a substantial fraction of infected persons develops more severe consequences, making H. pylori the principal cause of (symptomatic) chronic active gastritis and peptic ulcer disease, and a major risk factor for development of gastric adenocarcinoma and mucosa-associated lymphoid tissue (MALT) lymphoma [2,3]. For survival and persistent growth in the presence of a constant immune response and in an environment which is changing considerably over decades of infection, permanent adaptation of the bacteria is thought to be required [4]. Such adaptive processes may include regulatory mechanisms acting on gene expression, but also reversible or irreversible genome changes. For instance, it has been shown that strains isolated from patients with atrophic gastritis [5] or marginal zone B-cell MALT lymphoma [6] have reduced genomes in comparison to gastritis or ulcer strains, and a strain isolated from a gastric cancer patient had lost further genes in comparison to a strain isolated previously from the same patient during atrophic gastritis [7]. That genome plasticity plays a role in bacterial persistence is further supported by the observation that natural transformation competence, which is upregulated upon DNA stress [8], promotes persistent colonization in mice [9].
Allelic diversity caused by high mutation rates and frequent recombination events is a striking property of H. pylori strains. Genetic fingerprints of individual strains obtained by multilocus sequence typing of housekeeping genes have indicated that clonal transmission is likely to occur, but is followed by a rapid adaptation to the new host, so that H. pylori isolates from different subjects are almost always unique [4]. On the other hand, while recombination events generating allelic diversity are frequent, genome changes involving gain or loss of genes seem to be rare [10]. Nevertheless, on the level of gene content, evidence has been presented that H. pylori is a species with an open pan-genome, in which each individual isolate contains a distinct set of non-core, or strainspecific, genes [6,[11][12][13]. Comparative analysis of the first sequenced H. pylori genomes suggested that these strainspecific genes are often located in genomic regions that had previously been termed plasticity zones or plasticity regions, a designation originally used to describe a particular genetic locus with high variation between the first two H. pylori genome sequences [14]. However, with the availability of more sequencing data and more complete H. pylori genome sequences, it became clear that parts of the plasticity regions are usually organized as genomic islands that may be integrated in one of several different genetic loci. Furthermore, they generally contain complete sets of genes required to produce type IV secretion machineries, as well as genes encoding different DNAprocessing proteins [11,15,16], suggesting that they are actually mobile genetic elements capable of horizontal gene transfer between bacterial cells, and that they might be best described as conjugative transposons or integrating conjugative elements (ICEs).
The actual plasticity of these islands partly derives from the fact that gene rearrangements, insertions or deletions may have occurred within them, but it is not clear whether they also carry variable passenger genes. Interestingly, intrahost variation among genes of the plasticity zones, including deletions in a type IV secretion system gene, has been found for sequential isolates obtained from a duodenal ulcer patient over a course of 10 years [17]. Although several candidate genes of these plasticity regions have been suggested as disease markers, e.g. dupA for duodenal ulcer [18,19], or jhp950 for marginal zone B-cell MALT lymphoma [20], the functions of the plasticity zones are currently not well-understood.
To address the question of plasticity zone prevalence, and of their genetic diversity, we have performed a comparative analysis of these genome islands from a larger number of H. pylori genome sequences, including newly determined genome sequences of nine additional strains from different backgrounds. We show that these elements have a high prevalence throughout all populations, and that gene evolution within the elements is not congruent with the rest of the genomes. The wide variety of integration loci together with a conserved sequence motif at each integration site suggests an integration mechanism that depends on a short recognition motif in the DNA sequence only.

Prevalence of plasticity regions in the H. pylori population
We have reported previously that H. pylori strain P12 contains three genome regions with similarity to the prototypical plasticity zones, but only one of them (PZ2) corresponds to the originally described locus, whereas the other two regions (PZ1 and PZ3) have a genetic organization typical for genome islands and contain genes for type IV secretion systems that might make them capable of self-transfer [11]. In comparison, the original two genome sequences (strains 26695 and J99) contain only truncated and highly rearranged portions of these genome islands (Additional file 1: Figure S1). As reported previously, the most conserved type IV secretion system genes fall into one of two distinct groups, which have been termed either tfs3 and tfs3a/b [16], or tfs3 and tfs4 [11]. In accordance with Ref. [11], where conserved tfs3 genes have been shown not to be more closely related to tfs4 genes than to the respective comB genes encoding the type IV secretion system used for natural transformation, we consider tfs3 and tfs4 here as independent systems. Moreover, since there is evidence for horizontal gene transfer of the corresponding islands [11,16], but not for transposition within a strain, we propose to use the term integrating conjugative elements (ICE) and refer to individual islands as ICEHptfs3 or ICEHptfs4, respectively. A comparison of different designations of the islands and associated type IV secretion systems is given in Table 1. To determine the occurrence of ICEHptfs3 and ICEHptfs4 elements in the H. pylori population and the degree of variation among them, we performed a comparative sequence analysis of these elements from 36 completely sequenced H. pylori genomes available in public databases (Table 2).
We found that only 6 out of these 36 strains do not contain ICEHptfs3 or ICEHptfs4 islands or fragments thereof (Table 2). Among the remaining 30 strains, 19 harbour ICEHptfs3 islands, 6 of which seem to have complete gene sets, and 27 harbour ICEHptfs4 islands, 12 of which are complete. There are 3 strains with two different ICEHptfs4 elements, and 16 strains which have at least parts of both ICEHptfs3 and ICEHptfs4. Three strains (strains 51, SJM180 and Puno135) contain hybrid arrangements of ICEHptfs3 and ICEHptfs4 islands, but these seem to result from DNA rearrangements after integration of two independent genome islands (see below). Thus, each complete or truncated island can be assigned to either the ICEHptfs3 or the ICEHptfs4 type. Within the ICEHptfs3 group, two distinct variants can be discriminated, which differ by the presence (e.g., strain PeCan18) or absence (e.g., strain B8) of the pz21-pz23 genes ( Figure 1A). In contrast, three variants of ICEHptfs4, defined by orthologous, but variant sets of genes at both ends of the genome islands, or in their central regions, can be distinguished and are termed here ICEHptfs4a, ICEHptfs4b and ICEHptfs4c, respectively ( Figure 1B; Table 1). The third subtype, ICEHptfs4c, was only found in strain SouthAfrica7, which belongs to the hpAfrica2 population (see below), and as a plasmid-borne fragment in strain Lithuania75. Both types of genome island seem to vary considerably in size between strains (Table 2), but this is often due to small deletions within the islands or to insertion of IS elements; therefore, complete ICEHptfs3 islands have "standard" sizes of about 37.5, or 46 kb, depending on the presence of pz21-23 orthologs, while complete ICEHptfs4a, ICEHptfs4b and ICEHptfs4c usually comprise about 41, 39.5, and 39.5 kb, respectively ( Figure 1A, B).

Geographic distribution of ICEHptfs3 and ICEHptfs4 islands
It is well-established that H. pylori strains cluster into distinct populations according to their geographic origin when multilocus sequence typing using partial sequences of seven housekeeping genes is employed [21][22][23]. In contrast to this allelic variability, which suggests a common evolution of H. pylori and humans, consistent gene content profiles of individual populations could not be found, with the exception of one hypothetical gene (jhp914) present only in strains from the hpAfrica1 population [24]. Interestingly, comparison of gene content microarray data [24] with ICEHptfs4 composition suggests that most hpAfrica1 strains contain ICEHptfs4a genes close to the left junctions and in the mid region (jhp947-jhp951; hp1000-hp1006; Additional file 1: Figure S1), but ICEHptfs4b genes close to the right junctions (jhp917-jhp924; Additional file 1: Figure S1), while hpEurope strains variably contain these genes. Since there are only three hpAfrica1 strains among the 36 complete genome sequences analysed (strains 908, 2017 and 2018 were isolated from the same patient and are very similar), we decided to determine draft genome sequences of three further strains originating from Western Africa, as well as of six strains isolated in Europe, five of which had been tested positive for the presence of an ICEHptfs4atype or an ICEHptfs4b-type virB4 gene (data not shown). Sequence analysis revealed that all strains except one (196A) contain at least 37 kb of ICEHptfs3 and/or ICEHptfs4 sequences (Table 3).
To examine possible variations in plasticity zone distribution among phylogeographic groups, we first constructed a phylogenetic tree based on MLST gene sequences, using all 36 fully sequenced strains, the nine strains sequenced in this study, and 345 reference strains from the MLST database ( Figure 2). No correlation between phylogeographic groups and the presence or absence of either ICEHptfs3 or ICEHptfs4 could be found. However, all hpAfrica1 strains contain truncated versions of ICEHptfs4b or of an ICEHptfs4a/b variant similar to the hpAfrica1 strains mentioned above (Tables 2 and 3). We then calculated Neighbor-joining phylogenetic trees using conserved ICEHptfs3 or ICEHptfs4 gene sequences (concatenated virB9, virB11 and virD4 sequences) and compared them with an MLST-derived tree ( Figure 3A, B). Interestingly, ICEHptfs4ab genes clustered in a similar way as housekeeping gene sequences did, except for a much closer relationship of these genes than of housekeeping genes between hpAfrica2 strain SouthAfrica7 and other populations ( Figure 3B; Additional file 2: Figure S2). In contrast, ICEHptfs3 sequences formed at least three strongly divergent clades that were not congruent with the MLST population structure. These clades seem to correspond to (1) the hspAmerind population; (2) a mixture of hspEAsia and hpAsia2 populations; and (3) a mixture of hpEurope and hpAfrica1 populations ( Figure 3B; Table 1 Comparison of plasticity zone mobile genetic element and associated type IV secretion system (T4SS) designations Element designation used in this study T4SS designation used in this study Element designation used in [16] T4SS designation used in [16] Element designation used in [11]   Additional file 2: Figure S2). However, the number of ICEHptfs3-positive strains analysed may be too low to definitely draw conclusions from this observation.

Identification of conserved and ICE type-specific genes
Since both ICEHptfs3 and ICEHptfs4 islands contain genes for complete type IV secretion systems and may coexist in a single strain, an open question is whether individual genes or groups of genes from one type of island have the capacity to complement deficiencies in the other. Sequence comparisons showed that each of the type IV secretion apparatus components is clearly distinguishable between the different types (and partly between subtypes) of islands, with amino acid sequence similarities ranging from 40% to 80% (Table 4). This is also true for putative DNA processing or segregation proteins such as XerT, ParA, TopA or VirD2 (but not for the putative methylase/helicase PZ21 (OrfQ)/HPP12_447; see below), suggesting that the individual secretion systems might be sufficiently divergent to be incompatible.
To define further common ICE gene products and to identify ICE-type-specific genes, we performed similarity searches with all other amino acid sequences as well. The results show that nine further, hypothetical ICEHptfs4a genes have similar counterparts in ICEHptfs3-type islands (Table 4). Interestingly, orthologs of the conserved hypothetical genes hpb8_524 or hpp12_438 are present in ICEHptfs3, ICEHptfs4a and ICEHptfs4c islands, but absent from ICEHptfs4b islands. Because of their sequence similarities, we speculate that these hypothetical genes have additional conserved functions for genome island maintenance and/or transfer. In contrast, genes that are specific for either type of genome island might be cargo proteins of the respective mobile genetic elements, fulfilling more specific roles. Such specific genes for ICEHptfs4 islands are hpp12_440 (present only on ICEHptfs4a and ICEHptfs4c islands), hpp12_450/hpg27_977 (which is specifically absent in ICEHptfs4c islands), hpp12_452, hpp12_453, hpp12_456, hpp12_459-461, and hpp12_472 (Table 4). Specific genes of ICEHptfs3 islands include hpb8_522, hpb8_523, hpb8_525, hpb8_531, hpb8_534, hpb8_535, hpb8_539, hpb8_541, hpb8_542, hpb8_549, hpb8_552, pz22 and pz23. Interestingly, ICEHptfs3 islands in some strains have insertions of specific genes encoding Fic domain-containing or JHP940-like proteins (Additional file 3: Figure S3).
The putative DNA methylase/helicase gene pz21 (orfQ)/ hpp12_447 may be found associated with either ICEHptfs3 or ICEHptfs4 islands. In striking contrast to the abovementioned divergence between orthologous ICEHptfs3 and ICEHptfs4 genes, the methylase/helicase orthologs present on ICEHptfs3 (e.g., pz21) and on ICEHptfs4a/b/c containing ICEHPtfs4a-type genes close to the left junction, and ICEHptfs4b-type genes close to the right junction. 5 associated with genome rearrangement in comparison to strain P12. 6 associated with deletion of hpp12_980 to hpp12_995 (5') including one copy of 5S-23S-rRNA. 7 associated with a recombination between the two 5S-23S-rRNA loci (including hpp12_1381-1384). 8 partial duplication of both genes; ICEHptfs3 inserted into truncated ICEHptfs4b. 9 within a restriction-modification system inserted into this region. 10 integrated together with a 0.9 kb fragment of ICEHptfs3 and a putative toxin-antitoxin system. 11 integration of ICEHptfs4a into remnant of ICEHptfs4b, which is in turn integrated into truncated ICEHptfs3. 12 irregular integration, using internal AAGAATG motif. 13 left and right junctions coincide due to irregular integration. 14 numbers in parentheses indicate incomplete ICE elements. 15 disrupted by a chromosomal inversion from hpp12_92 to hpp12_128. 16 size of ICE increased by IS element insertion. 17 interrupted by a chromosomal rearrangement between hpp12_312 and hpp12_1044 (including babC deletion). 18 original integration probably in hpp12_994-5S-rRNA locus; from there, relocation of 26.6 kb fragment via internal AAGAATG motifs into hpp12_1510; 1.4 kb duplication (containing xerT) in both loci.
islands (e.g., hpp12_447) are highly conserved (90-98% similarity), indicating an evolutionary pressure for this gene which is distinct from other genes on the genome islands. A Neighbor-joining tree of pz21/hpp12_447 orthologs shows a certain clustering according to geographic origin, but this clustering is clearly independent of gene association with either ICEHptfs3 or ICEHptfs4 ( Figure 3C). Indeed, in cases where both ICEHptfs3 and ICEHptfs4 methylase/helicase orthologs are present in a single strain (Shi112, Shi417, Gambia94/24), these orthologs are always more similar to each other than to ICEHptfs3 or ICEHptfs4 orthologs of geographically related strains, and even more similar than two ICEHptfs4 methylase/helicase orthologs present in a single strain (SouthAfrica7) are to each other ( Figure 3C). Because of these high sequence similarities, homologous recombination between ICEHptfs3 and ICEHptfs4 methylase/helicase orthologs is possible. By analysing the gene arrangements of the hybrid ICEHptfs3-ICEHptfs4 elements mentioned above, we could identify situations where such recombination events seem to have occurred indeed after integration of one ICE element into another (Additional file 4: Figure S4).

Analysis of ICE integration sites
Originally, the plasticity zone was found located at a distinct position within H. pylori genomes (i.e., between the ftsZ gene (hp0979) and one copy of the 5S-23S rRNA genes) [14]. However, analysis of strain P12, Shi470 and G27 genome sequences showed that ICEHptfs3 and ICEHptfs4 elements are able to integrate as well into different genomic locations, in a manner similar to conjugative transposons or genome islands [11,16]. To examine further variations in integration sites, we compared the sequences of ICE integration sites and duplicated junction motifs in all genome sequences with recognizable left and/or right ICEHptfs3 and ICEHptfs4 junctions. In addition to 12 different sites described previously [16], we identified further 28 chromosomal sites and one plasmid site where complete or partial ICEHptfs3 or ICEHptfs4 elements can be integrated (Tables 2 and 3; Figure 4). Although these integration sites cluster in certain genome regions, such as the originally identified ICE integration locus (plasticity zone 2 in P12), the left border region of ICEHptfs4a, or a locus containing several restrictionmodification system genes (hpp12_1364-1366), there is no Figure 1 Gene arrangement of prototypical ICEHptfs3 (A) and ICEHptfs4 (B) islands. Genes encoding type IV secretion system components are drawn as red arrows, and other genes as grey arrows. Regions with high nucleotide sequence similarity are connected by dark grey bars, and regions with low to intermediate levels of similarity by light grey bars. Hatched arrows indicate orthologous, but clearly distinct gene variants. Typical sizes of the corresponding elements are indicated on the left. ICEHptfs3 elements differ by the presence or absence of pz21-pz23 genes (according to the nomenclature of [15]) and by several distinct variants of the pz34, pz35, and/or pz36 genes. However, variations within these two regions do not correlate with each other and were thus not considered for ICEHptfs3 subclassification. In contrast, ICEHptfs4 islands are further subclassified into ICEHptfs4a, ICEHptfs4b and ICEHptfs4c groups according to the presence of orthologous gene variants. Note that the polymorphic genes hpp12_446/hpg27_981 and hpp12_444-445/hpg27_982 could not clearly be assigned to ICEHptfs4a or ICEHptfs4b and were thus not considered for classification of ICEHptfs4 subtypes. LJ, left junction; RJ, right junction. obvious general preference for ICE integration. We also did not observe different patterns of ICEHptfs3 versus ICEHptfs4 integration sites; in fact, some integration sites are used by either ICEHptfs3 or ICEHptfs4 (Figure 4).
All islands with detectable junctions contained the conserved sequence motif AAGAATG [11,16], and this motif is always present in the corresponding empty sites of PZ-free strains (albeit sometimes mutated), suggesting that it represents a minimal requirement for integration of ICEHptfs3 and ICEHptfs4 elements. To determine whether additional sequences are required to form an integration site, we compared the sequences of the flanking regions of ICEHptfs3 and ICEHptfs4 separately ( Figure 5; Additional file 5: Figure S5). There is a certain preference for A or T close to the left junctions of both ICEHptfs3 and ICEHptfs4 islands (-1 to -3 or -1 to -6), but the alignment revealed no significant consensus sequences otherwise. However, there seems to be a stronger preference of A at the -1 position (resulting in AAAGAATG motifs) in ICEHptfs4 than in ICEHptfs3 islands. Furthermore, the low prevalence of the last G at the right junctions of ICEHptfs3 islands may even suggest that only six bases (AAGAAT) are used by ICEHptfs3 islands.

Identification of a unique ICEHptfs4 variant in the hpAfrica1 population
Since deletions of single genes or different sets of genes are frequent for both ICEHptfs3 and ICEHptfs4 islands ( Table 2), we checked whether these occur randomly or at conserved sites. Deletions found within ICEHptfs3 variants range from small deletions (pz26 and pz27) to loss of major parts of the island (Additional file 3: Figure S3A), and mostly seem to occur at random positions and without conserved sequence motifs (data not shown). However, we also identified several cases where ICEHptfs3 truncation sites are flanked by AAGAATG motifs, suggesting that recombination events similar to ICE integration resulted in some deletions (Additional file 3: Figure S3A). For ICEHptfs4 islands, we found certain deletions that are more frequent. For example, four hspEAsia strains (35A, F30, F57, XZ274) have identical truncations of their ICEHptfs4a islands (Additional file 3: Figure S3B). These elements also have identical integration sites ( Figure 4) and are accompanied by a common genome rearrangement [25], suggesting that the observed truncations reflect the situation in a common ancestor of all four strains. In fact, these truncated versions are the only ICEHptfs4a remnants that we found in hspEAsia or hspAmerind strains; all other complete or truncated variants in these populations are of the ICEHptfs4b type. A second common truncation was found in all hspWAfrica strains (908/2017/2018, Gambia94/24, 1_17C, 6_17A, 6_28C) and involved a loss of several genes close to the right junctions of their ICEHptfs4b or ICEHptfs4a/b islands, including the 5' regions of the respective virB4 genes (Additional file 3: Figure S3B). The same deletion occurs in hspWAfrica strain J99, where the corresponding virB4 gene (jhp917/918) is also known as dupA [18]. All inferred from the Neighbor-joining tree shown in Figure 2.
2 resulting from insertion of two genome islands and rearrangements associated with IS element insertion and two copies of pz21/hpp12_447-like genes. 3 associated with a genome rearrangement between hpp12_1366 and hpp12_1298. these ICEHptfs4b islands have their right junctions deleted and are furthermore inserted at the same genome position (Tables 2 and 3), flanked on the truncation site by jhp916, jhp915 and jhp914 orthologs ( Figure 6A). A closer inspection of the right border revealed that truncations have occurred at a CATTCTT (or AAGAATG on the reverse strand) motif which is conserved in the virB4 genes of ICEHptfs4b (but not ICEHptfs4a) islands. Interestingly, those ICEHptfs4b variants which contain ICEHptfs4a genes close to their left borders, all have another small truncation of about 300 bp at their left junctions, which also has occurred at a conserved CATTCTT motif  Figure S3B), indicating that these islands have integrated in an irregular fashion, producing irregular left junctions (ILJ) and irregular right junctions (IRJ; Figure 6A). Since the nearby jhp914 gene has previously been reported to be specifically present in the hpAfrica1 population [24], we asked whether this truncated right border might be a general signature of hpAfrica1 strains. To test this hypothesis, we performed a BLAST search of draft genome sequences with a 260 bp query sequence spanning the right border of J99 (including the IRJ). Of 78 retrieved draft genome sequences having the same IRJ, 64 also contained the jhp914 gene (data not shown). Furthermore, we checked a panel of H. pylori strains isolated in Nigeria for the presence of the irregular ICEHptfs4b right border ( Figure 6B). PCR analysis with primers specific to virB4 and jhp914, respectively ( Figure 6A), confirmed that 14 out of 19 strains from this population were positive for a similar gene arrangement in this locus and thus for an IRJ ( Figure 6B, and data not shown).

Discussion
The unusual genetic heterogeneity of H. pylori has been well-documented in terms of allelic diversity, establishing it as a species with a very high population recombination rate, and allowing for different populations from different geographic regions to be identified [4]. MLST analysis of these populations has revealed important insights into the coevolution of H. pylori and humans, and into migration events of human populations, but relatively little is known about bacterial population-specific properties on a genomic level. Striking differences in the presence or absence of putative host interaction genes have been reported for East Asian H. pylori strains in comparison to European strains [12], and many divergent genes were found to evolve under positive selection between East Asian and non-Asian strains [12,26]. Previous comparative analysis of a small number of H. pylori genome sequences indicated that many strain-specific genes are located either at potential genome rearrangement sites or within the plasticity zones [11]. However, for those plasticity zone regions that are organized in ICEHptfs3 or ICEHptfs4 islands as described here, identification of further novel genes seems unlikely. Instead, the gene content of a given type of ICEHptfs3 or ICEHptfs4 island is, apart from the variable presence of JHP940-or Fic domain protein-encoding genes, highly conserved, strongly suggesting that these elements are autonomous elements with fixed contents rather than true regions of genome plasticity. Nevertheless, partial truncations, insertions of restriction-modification systems, IS elements or even distinct genome islands, and associated rearrangements [25] are frequent within both types of ICE and result in a considerable amount of variation. Rearrangements between ICEHptfs3 and ICEHptfs4 elements may be facilitated by recombination events within pz21/hpp12_447 (methylase/helicase) orthologs present on both types of islands. Apart from that, ICEHptfs3 and ICEHptfs4 islands are clearly distinct and do not seem to exchange individual genes. The fact that pz21/hpp12_447 orthologs are the only genes with high similarity between ICEHptfs3 and ICEHptfs4 elements, indicates that these orthologs are either frequently exchanged between both types of island, or that they are subject to strong selective pressures.
Interestingly, certain regions of both ICEHptfs3 and ICEHptfs4 islands are much more variable than others. For instance, we were able to identify 3, 5, and 4 distinct clades, respectively, for the pz34, pz35 and pz36 orthologs on ICEHptfs3 elements (data not shown), whereas all other ICEHptfs3 genes are more conserved. However, similar to the variability of hpp12_444/445 and hpp12_446 orthologs among ICEHptfs4 islands, where two clades each can be distinguished (data not shown), no clear correlation of these different clades with individual geographic groups could be found. Likewise, the three different subtypes of ICEHptfs4 islands which are characterized by orthologous, but distinct sets of genes, do not seem to be restricted to certain geographic groups. We also performed a preliminary analysis of two further hpAfrica2 strain genome sequences [27] and one hpSahul strain genome sequence [13] that were published after completion of our comparative analysis. Both hpAfrica2 strains contain one full-length ICEHptfs4b element, and the hpSahul strain harbours a full-length ICEHptfs4b and a partial ICEHptfs3 element (data not shown), which further supports the notion that these elements are present in all phylogeographic groups. The modular structure of ICEHptfs4 islands indicates that parts of these elements can easily be exchanged, and that all variants may coexist in a given H. pylori population.
(See figure on previous page.) Figure 3 Neighbor-joining analysis of type IV secretion system gene sequences. (A) Phylogenetic tree calculated with MLST sequences for fully sequenced strains only, with phylogeography assignments based on the Neighbor-joining tree shown in Figure 2. Note that unequivocal classification of strains PeCan4 and PeCan18 was not possible. (B) Phylogenetic tree calculated from concatenated virB9, virB11 and virD4 ortholog sequences of all ICEHptfs3 and ICEHptfs4 islands. (C) Neighbor-joining tree calculated from DNA sequences of methylase/helicase (hpp12_447/ pz21) orthologs. Orthologs associated with ICEHptfs3 elements are marked by blue branch lines, and orthologs associated with ICEHptfs4 elements by red branch lines. Black lines indicate hybrid elements or the presence of two different elements in the same strain. Colouring of individual strains by phylogeographic origin is shown according to the tree in Figure 2.
Indeed, ICEHptfs4a, b and c islands all have some common genes which may be used for exchange of modules. However, it is striking that all members of ICEHptfs4b subtypes consistently lack hpp12_438 orthologs and that hybrid elements between different ICEHptfs4 subtypes do not occur. An exception is the combination of ICEHptfs4a (left) with ICEHptfs4b (right), which seems to occur in hpAfrica1 strains only, and always in a truncated version. genes hpb8_526 and pz35, as well as hpb8_527 and pz34 share only 61/73% and 54/70% identity/similarity, respectively, to each other, but are equally similar to hpp12_441 and hpp12_439, respectively. 3 some ICEHptfs4c islands contain the ICEHptfs4b versions with lower similarities in these sites. 4 similarities confined to short regions only. 5 no significant similarity detectable, but gene with similar size and orientation present. 6 ICEHptfs4c and ICEHptfs3 islands contain fusions of hpp12_454 and hpp12_455.
These restrictions on modular exchange suggest that there is a selective pressure on maintenance of cognate left and right ICEHptfs4 ends, for example by an inability of hybrid elements to be excised and/or transferred. The presence of ICEHptfs3-like islands in other Helicobacter species, such as H. cetorum [16,28] and H. suis [29], indicates that these elements were acquired a long time ago (i.e., before the cag pathogenicity island, which is absent in hpAfrica2 strains and was acquired more than 60000 years ago [30]). Whereas microdiversity within cag pathogenicity island genes correlates with microdiversity in housekeeping genes, this is not the case for ICEHptfs3 or ICEHptfs4 genes, which shows again that these islands are subject to more frequent horizontal gene transfer. Horizontal gene transfer of typical ICEs involves several steps [31]: first, the element is usually excised from the chromosome by a recombinase to generate a circular intermediate; second, this circular form is transferred from the donor to a recipient cell by conjugation; and third, the ICE integrates into the recipient cell chromosome via site-specific or unspecific recombination. In the case of ICEHptfs4, the first step is dependent on the XerT recombinase [11], and the second on the VirD2 relaxase [32], both of which are encoded on the ICE. It is likely, but has not been shown yet, that the ICEencoded type IV secretion system is responsible for the conjugative transfer process. It is also currently unclear whether the XerT recombinase catalyzes integration of the ICE into the recipient cell chromosome as well. An interesting finding of this study was the presumptive minimal requirement for integration of both ICEHptfs3 and ICEHptfs4 islands, the sequence motif AAGAATG (or possibly AAGAAT for ICEHptfs3), as suspected previously [11,16]. Thus, the total number of possible insertion sites might be limited only by the number of these Figure 4 Integration sites of all ICEHptfs3 and ICEHptfs4 islands mapped onto the genome of strain P12. Positions of these elements as well as of plasticity zone 2 (PZ2) in the genome of P12 are shown within the circle. Each arrow indicates an individual ICEHptfs3 and/or ICEHptfs4 integration site. Note that the integration sites shown for strains where one island is integrated into another are not indicative of their genomic location in comparison to the main genome (for example, ICEHptfs3 of strain PeCan18 is inserted into a ICEHptfs4a fragment and therefore shown at 456 kb, but the ICEHptfs4a fragment is in fact integrated in the PZ2 region at 1059 kb in this strain).

Figure 5
Comparative analysis of integration sites. Sequence logos for nucleotide sequences around ICEHptfs3 (A) or ICEHptfs4 (B) integration sites were generated using Weblogo [43]. The level of sequence conservation is indicated by the height of the letters (with a maximum of 2 bits at each position). motifs in intergenic regions or in non-essential genes. In total, we identified more than 40 different integration sites, but the total number of possible integration sites might be significantly higher, given that AAGAATG sequences are found approximately 550 times within individual H. pylori genomes (data not shown). Many wellcharacterized ICEs integrate into a unique position in the host cell genome (the primary attachment site), often in the 3' regions of tRNA loci [31]. In the absence of primary attachment sites, these elements are sometimes capable of integrating into secondary sites with much less specificity, but this may result in ICE immobility or even toxicity for the host cell [33]. In contrast, other ICE-like elements, which are often termed conjugative transposons, have very low integration site specificities, with as many as 100,000 possible integration sites in a given host strain [34,35]. In this regard, ICEHptfs3 and ICEHptfs4 seem to integrate with an intermediate specificity, but still with the potential to insert into coding regions and thereby to disrupt essential genes. Possible integration sites are also located on the ICE elements themselves, and we found several cases where one ICE is integrated into another. We could also identify situations where these internal sites were used for irregular ICE integration, associated with truncation of the left and/or right ICE ends, and possibly an incapability of these elements to excise.
Finally, despite the presence of genes encoding host interaction factors such as JHP940 [36], or correlated with disease outcome, such as dupA [18], the (potentially different) functions of ICEHptfs3 and ICEHptfs4 islands are currently unclear. In our analysis, a total of 18 strains were positive for dupA (the ICEHptfs4b virB4 gene), and 12 additional strains were found positive for ICEHptfs4a or ICEHptfs4c virB4 genes, which are likely to have the same functions. Because of this, and since not all of these strains have complete ICEs or even complete type IV secretion systems, testing for the presence of the dupA gene alone, and correlations of dupA with pathology is probably not useful. It has been shown that a more complete analysis of type IV secretion system genes is more significant as a virulence marker [19]. Therefore, future correlation studies should determine the presence of the complete set of genes. genes (compared here with P12) close to the left junction and ICEHptfs4b genes (compared here with G27) close to the right junction of the island. In these strains, the left part of the island is shortened by 350 bp at a CATTCTT motif upstream of xerT, and the right part by approximately 3850 bp at a CATTCTT motif within ICEHptfs4b virB4, generating irregular left and right junctions (ILJ and IRJ). In strain PeCan18, the ICEHptfs4a fragment has probably been integrated in a similar manner, using irregular integration at the same chromosomal position, but the majority of ICEHptfs4a seems to have been deleted subsequently by (regular) integration of an ICEHptfs3 at the same internal virB4 motif and another internal CATTCTT motif upstream of ICEHptfs4a virB6. Gene colouring is as in Figure 1, and asterisks denote frameshift or nonsense mutations (B) PCR analysis of the ICEHptfs4b right junction in H. pylori strains from Nigeria. PCR was performed from chromosomal DNA of the indicated strains with primers WS606 and WS539 (see Figure 6A).

Conclusions
Taken together, our comparative analysis reinforces the notion that major parts of the H. pylori plasticity zones described earlier should in fact be considered as mobile genetic elements with conserved gene content, rather than regions of genome plasticity. Although horizontal gene transfer of complete ICEHptfs3 or ICEHptfs4 elements remains to be demonstrated experimentally, the number of different integration sites indicates a considerable mobility, possibly also within individual H. pylori genomes. In this regard, these elements differ from the cag pathogenicity island, for which only one integration site is known (although rearrangements may occur). The high prevalence and wide distribution of these ICEs throughout all H. pylori populations suggest that they might provide an as yet unknown fitness benefit to their hosts.
Whole genomic DNA was isolated from bacteria that were subjected to minimal passage, using Qiagen Genomic-tip 100/G columns and the Genomic DNA Buffer Set (Qiagen). Genomic DNA was processed to generate 3 kb mate pair libraries, which were sequenced with 50 bp paired-end reads on an Illumina HiSeq 2000 platform (GATC, Konstanz, Germany). This resulted in 24-60 million reads per genome, which were cured from PCR replicates and mapped to a reference sequence consisting of concatenated ICEHptfs3 (strain B8), ICEHptfs4a (strain P12), and ICEHptfs4b (strain G27) sequences, using BWA [38] with default parameters. Unmapped reads were assembled de novo using Velvet [39], and ICE elements were identified by BLAST searches (http://blast.ncbi.nlm. nih.gov/Blast.cgi). Gaps within ICE elements were closed by Sanger sequencing.

Software tools for analysis of H. pylori genome sequences
For comparative analysis, we evaluated all complete H. pylori genome sequences available in GenBank at the time of initiation of the study. We used multilocus sequence typing analysis to assign all strains to the populations and subpopulations described previously [21]. To do so, partial nucleotide sequences of the housekeeping genes atpA, efp, mutY, ppa, trpC, ureI and yphC were concatenated for each strain and aligned with the corresponding sequences of 345 reference strains from the MLST database (http://pubmlst.org/helicobacter), using the Muscle algorithm within MEGA5.2 [40]. All phylogenetic trees were constructed and tested by neighbor joining with MEGA5.2, using the Kimura 2-parameter model of nucleotide substitution, and 1,000 bootstrap replications. ICE elements were identified in complete or draft genome sequences using BLAST search and visualization with the Artemis Comparison Tool [41]. A chromosomal map of strain P12 was generated using CGView [42], and WebLogo [43] was used to display sequence alignments of ICE border regions.

Genetic analysis of hpAfrica1 strains
Genomic DNA of H. pylori strains was prepared using a QIAamp DNA mini kit. For MLST analysis, the housekeeping genes atpA, efp, mutY, ppa, trpC, ureI and yphC were partially amplified by PCR, using the primer sets described in the MLST database (http://pubmlst.org/helicobacter), and the PCR products were sequenced. Sequences were trimmed to the required sizes, concatenated and analyzed for clustering, as described above. For examination of the right junctions of ICEHptfs4 islands, PCR fragments were amplified with a PAN-Script DNA polymerase (PAN Biotech, Aidenbach, Germany) under standard conditions in the presence of 3 mM MgCl 2 and at an annealing temperature of 52°C, using primers WS606 (5′-AGCAATAAAACGCTTAAAAGTCTC-3′) and WS539 (5′-ATGTCCAGTAAGGAATTTGTC-3′), and subsequently analyzed by gel electrophoresis.