Research article | Open | Published:
Inter-species horizontal transfer resulting in core-genome and niche-adaptive variation within Helicobacter pylori
BMC Genomicsvolume 6, Article number: 9 (2005)
Horizontal gene transfer is central to evolution in most bacterial species. The detection of exchanged regions is often based upon analysis of compositional characteristics and their comparison to the organism as a whole. In this study we describe a new methodology combining aspects of established signature analysis with textual analysis approaches. This approach has been used to analyze the two available genome sequences of H. pylori.
This gene-by-gene analysis reveals a wide range of genes related to both virulence behaviour and the strain differences that have been relatively recently acquired from other sequence backgrounds. These frequently involve single genes or small numbers of genes that are not associated with transposases or bacteriophage genes, nor with inverted repeats typically used as markers for horizontal transfer. In addition, clear examples of horizontal exchange in genes associated with 'core' metabolic functions were identified, supported by differences between the sequenced strains, including: ftsK, xerD and polA. In some cases it was possible to determine which strain represented the 'parent' and 'altered' states for insertion-deletion events. Different signature component lengths showed different sensitivities for the detection of some horizontally transferred genes, which may reflect different amelioration rates of sequence components.
New implementations of signature analysis that can be applied on a gene-by-gene basis for the identification of horizontally acquired sequences are described. These findings highlight the central role of the availability of homologous substrates in evolution mediated by horizontal exchange, and suggest that some components of the supposedly stable 'core genome' may actually be favoured targets for integration of foreign sequences because of their degree of conservation.
Helicobacter pylori is a bacterial pathogen associated with gastritis, peptic ulcers, gastric adenocarcinoma, and rare lymphomas . It has a highly panmictic population structure in which homologous recombination makes the predominant contribution to sequence differences within a highly diverse population structure . The acquisition of genes from other strains and species is by far the most rapid evolutionary process. This occurs frequently without loss of existing functions, is central to the evolution of niche-adaptive and pathogenic characteristics of bacteria, and greatly influences inter-strain differences in gene complement [3–5]. In this context, it is notable that none of the traits typically used to differentiate E. coli from Salmonella can be attributed to point mutation genes but are broadly attributable to horizontal exchange . H. pylori is relatively unusual in that it is a naturally transformable Gram-negative species that does not appear to have a species-specific DNA uptake sequence and appears to rely upon its niche separation as a transformation barrier . Disease associated H. pylori strains have been divided into two types, type I being those that carry the cag pathogenicity island  (cag PAI), which has a foreign species origin, and are associated with more severe disease.
Dinucleotide composition is highly stable within a genome and can distinguish between sequences from different species. Based upon its constancy the species composition is referred to as a 'genome signature' [9, 10]. This characteristic has been applied to assessments of DNA metabolic processes such as methylation and base conversion, DNA structure, and evolutionary relationships. It has also become established as a method for the identification of sequences that have been acquired by inter-species horizontal transfer. For example, lateral transfer has recently been shown using these methods for a tryptophan pathway operon , the gain of additional metabolic functions in Pseudomonas putida , a determination that many gain of function genes have been acquired by E. coli rather than lost from S. typhi , and more recently developed Bayesian methods based upon similar premises have been used to assess global signatures and determine the origins of some lateral transfer events [14, 15]. However there are problems associated with this and other methods that use progressive 'walking windows', and the larger the window the greater the problems. These result from the inclusion of intergenic sequence, the inability to distinguish divergences due to a single highly divergent gene from that from a cluster of less divergent ones, and an inability to identify the limits of the abnormal regions. In practice additional features are necessary to determine the ends of such regions, such as the location of repeats typical of pathogenicity islands in H. pylori , or comparisons with other sequences as in N. meningitidis strain MC58 . In addition, divergence scores are influenced by the size of the sampling window used such that sampling effects limit analysis of sequences shorter than about 800 bp (data not presented), and the need to use fixed window sizes prevents gene by gene studies.
We describe the use of a linear implementation of signature analysis that can efficiently address a range of walking window sizes using dinucleotide signatures (DNS) and longer signatures. In addition, use of a new approach based upon classical text analysis that allows analysis of genomes gene-by-gene is described. Analysis of H. pylori sequences, combined with comparisons of the identified genes between genomes, reveals complex changes that influence both niche-adaptive and core functions illustrating a previously unpredicted range of functions which are continuously undergoing variation and selection.
Results and discussion
Genes were ranked on the basis of their divergence from the mean genome composition. The degree of divergence that is indicative of acquisition from other species is not an absolute. The frequency with which genes are acquired, the untypicality of the donated material, and the rate at which they are ameliorated to the host sequence composition influence it. Strains J99 and 26695 had 53 (Table 1) and 60 (Table 4) genes respectively with DNS that were >2 SD from the mean. Those with annotated functions included genes from the cag pathogenicity island (6 and 5), vac and related toxins (3 and 4), and restriction-modification genes (2 and 4). On the basis of the similarities determined in the H. pylori strain J99 sequence annotation, 7 of the most divergent genes as determined by DNS are not present in strain 26695. Likewise, 2 of the 50 most divergent genes in strain 26695 are not present in strain J99. This is consistent with the identification of genes acquired from other species that have not extended to both sequenced strains. It also suggests that a significant proportion of the 6 to 7% of genes unique to one or other strain  are inherent to the Helicobacter gene pool, but are variably present in different strains rather than reflecting recent foreign origins. Comparisons of a selection of identified orthologous genes in the two strains are shown in Figure 1.
It cannot be assumed that all genes identified in this manner have been recently acquired. It is necessary to assess the nature of the sequence to determine if its divergence might be accounted for on the basis of features of the encoded protein. For example, JHP0476/HP0527, JHP1300/HP1408 and JHP0074/HP0080 include repetitive sequences likely to account for their DNS divergence. This type of analysis cannot be used to determine the possible foreign origin of such genes. Notably, the most divergent cag PAI gene (the 1st and 2nd most divergent gene in the whole genomes of strain 26695 and J99 respectively, JHP0476/HP0527) has a highly complex repetitive structure and the size of the large divergent peak associated with this island using previous methods is largely due to the presence of this gene.
While a significant proportion of the genes identified in this analysis are associated with regions including several such genes and which share characteristics of islands of horizontal transfer or pathogenicity islands, this is far from universally true. There are many instances of single genes or small numbers of genes that are present that are not associated with any features that might otherwise have been used as indicators of horizontal acquisition such as transposases and flanking repeats.
Our initial goal was to identify recently acquired and exchanged genes as candidates likely to be important in niche-adaptation, host interactions, and alterations in bacterial fitness. It has been argued that essential genes are unlikely to be transferred successfully since recipient taxa would already bear functional orthologues, which would have experienced long-term co-evolution with the rest of the cellular machinery. In contrast, it is proposed that those under weak or transient selection – like those associated with nonessential catabolic processes, new operons, and those providing new niche-adaptive changes are likely to be successfully transferred and retained . This leads to a model in which a stable 'core genome' comprised of essential metabolic, regulatory, and cell division genes provides a stable context for the more labile non-essential and niche adaptive genes. On this basis such genes are used for phylogenetic studies and are thought to provide a relatively constant background in which species evolution occurs. Many of the genes identified for which functions are known affect virulence or niche adaptive genes, including: the vacuolating cytotoxin and related toxins (2 and 3), urease and flagellar components, and genes involved in iron acquisition. However, we also find clear evidence, confirmed by differences between the two genome sequences, that recent, and therefore relatively frequent, horizontal transfer is not limited to genes associated with niche adaptation and virulence. Amongst the core function genes identified were mut S, fts K, xer D, and pol A. The comparisons of the latter three between the sequence strains are shown in Figure 1f,g &1j. These comparisons support the results suggesting that these genes have been the substrates for horizontal exchange between species.
Tetranucleotide composition has been used for the consideration of the presence of palindromic sequences that might be substrates for restriction systems and Chi sites and the presence of unstable repeats mediating phase variation , but the use of longer component signatures has not been used to identify horizontally acquired regions in bacterial genomes. Following analysis of eukaryotic sequences it was concluded that DNS captures most of the departure from randomness in DNA sequences and that longer component lengths correlate highly with the DNS results . Also, analysis of dinucleotides separated by no, one, or two other nucleotides showed that separated pairs are more nearly random than adjacent pairs and were concluded to be relatively uninformative . However, in preliminary analyses, while results using the typically long walking windows gave concordant results as previously reported, we found that the use of smaller walking windows generated progressively more different patterns of divergence with other length components. Using tetranucleotide (TNS) and hexanucleotide (HNS) signature analysis we find that, while in some instances there is significant overlap between the genes identified using the different component lengths, there are substantial differences that indicate additional horizontally transferred genes not identified by DNS alone (Tables 2 to 6).
The 50 most divergent J99 ORFs by HNS included 26 (52%) that were not in the 53 (>2 SD) most divergent by DNS, these included 11 restriction-modification system genes and 6 others that were not annotated within the strain 26695 genome sequence. The identification of genes of a type known to be horizontally exchanged, and different between the gene complements of the strains, is strong corroboration for the foreign origin of the additional genes identified by HNS. In several instances (Tables 2 to 6) the DNS did not detect these genes at all e.g. restriction enzymes that were the 3rd, 13th and 41st most divergent genes by HNS, were 319th, 857th and 750th most divergent by DNS, respectively. In some instances the TNS gave intermediate results and in others identified other genes as more divergent than the other methods. The TNS was most sensitive for the detection of rpoB (HP1198 / JHP1121) which is associated with a significantly different gene length in the two strains (Figure 1h). One explanation for this observation is that while the DNS may initially be the most sensitive indicator of horizontal exchange it may become ameliorated to the new sequence characteristics more rapidly that the longer component features, which are probably detecting qualitatively different sequence characteristics.
The differences in the analyses using different length components, and a comparison of the results from the two sequenced strains, suggest a complex evolutionary history for the cag pathogenicity island. These suggest that it probably has mosaic structure including sequences from more than one species background, in addition to sequence that is entirely typical of H. pylori.
It is normally impossible to determine the chronology of events to distinguish insertions and deletions when comparing strains. In strain 26695 there are two open reading frames that are both good candidate coding sequences. There is only one gene in this location in strain J99 composed of the 5' gene from strain 26695 and the 3' end of the subsequent gene. This could have arisen from either a deletion or an insertion event. However, the normal DNS of the J99 gene (JHP0073, 799th in divergence) and the 5' 26695 gene (HP0079, 751st in divergence), and the high divergence of the 3' 26695 gene (HP0078, 68th in divergence), indicate that the most likely event is an insertion into strain 26695 (Figure 1l). Likewise HP0119 is likely to contain an insertion and JHP1113 probably reflects the original sequences (Figure 1k).
The inclusion of two DNA metabolism genes associated with recombination and repair is notable. Both mutS and recN were identified in both strains (22nd and 35th, and 45th and 51st most divergent genes by DNS in strains 26695 and J99 respectively). When the homologous genes were compared between the strains, extensive divergences were evident between more than one region of each protein. That these genes have divergent signatures in both strains suggests that neither has a wholly native composition. This observation is consistent with the models of rapid evolution which suggest that transient competitive advantages are enjoyed by organisms that are hypermutators under conditions of environmental stress and transitions, and that these states which can be produced by mutations in DNA repair genes [21–26]. However, such states have to be reversed so that an unsustainable mutational burden is not attained, and it has been proposed that this reversal is mediated by repair following horizontal transfer and homologous recombination, and that such strains are hyper-recombinogenic [27–29]. The untypicality of mutS and recN suggest that H. pylori is another species that can make use of this strategy for diversification under stressful conditions.
The identification of RNA polymerase genes, with associated differences between the strains, is striking. The divergence of phylogenetic trees based upon different sequences has been highlighted, and particularly the differences between the trees associated with RNA polymerase genes and rRNA [30, 31]. It has been argued that RNA polymerase is as essential to cell function as is rRNA and that there is no compelling reason to chose rRNA as the more reliable marker . While the DNS analysis does not address the stability of rRNA (and specifically excludes the rRNA sequences because their differing coding requirements and evolutionary pressures generate a divergent signature for other reasons), it does indicate that RNA polymerase can be a substrate for horizontal transfer, and that trees based upon this gene, or other essential genes, need not necessarily be considered a challenge to rRNA based phylogenies.
The spectrum of recently horizontally acquired sequences identified emphasizes the two driving forces of horizontal exchange: the transfer of a phenotype which alters or enhances bacterial fitness resulting in increased competitive fitness or altered niche adaptation, and the presence of a substrate for homologous recombination. Because of the focus upon, and relative ease of identifying, large islands associated with readily identifiable features and phenotypes, the importance of the latter component has perhaps been underestimated. The genes that have been considered to code for 'core metabolic' 'house-keeping' functions are amongst those most likely to be changed by horizontal transfer events because of the presence of homologous substrates, and changes are likely to persist even when the change is phenotypically neutral. Equally, changes in the genes involved in core functions such as gene expression and DNA metabolism may have pleotropic effects and there may be significant differences in strain behaviour, that are not simply the consequence of differences in their respective gene complements. The selection of genes for phylogenetic analysis on the basis of their coding for conserved core functions is also problematic because these are also frequently the genes most likely to share the high homology that facilitates recombination and horizontal exchange.
A traditional nucleotide signature is generated by segmenting a sequence of DNA into k equal-sized subsequences (or 'windows'). The mathematical basis for the signature is an odds ratio – p i – calculated by dividing the frequency of a length-L oligonucleotide by its expected frequency. The odds ratios for each of the 4Loligonucleotides in each window (w) are compared with the odds ratios for the overall sequence (s) [9, 10, 33]. The normalized difference δ is plotted and thus a nucleotide signature consists of a k-length sequence of δ values: δ(w,s) = (1/4L)Σ(4L,i:x)|p i (w) - p i (s)|, where x is the set of all permutations of length L and i is one such permutation.
There are interesting parallels between signature-style genome analysis and stylometric techniques previously used to determine the authorship of controversial literary texts. This is analogous with the biological problem and it is from this that our method is derived. Rather than using a fixed-window signature, signature scores are calculated for each coding open reading frame (ORF) and weighted with variance estimates so that the scores for shorter ORFs confer with their longer counterparts. Bissell's weighted cusum (cumulative sum) , , is modified so that n denotes the number of ORFs in the genome, X i the number of oligonucleotides in ORF i, and w i the number of nucleotides in ORF i. The results are scaled according to ORF size using the standard error σ = √(*#ORF). In this way false positives are abrogated by normalizing for over-representation of lower order peptides.
The method is implemented in Java and efficiency is maintained through an O(N) (N = sequence length) refinement: probabilities for the complete sequence are calculated in O(N) steps for any length-L oligonucleotide, and maintain O(N) when 4L>N through a hashing function; the second part of the program calculates σ for each ORF using a loop flattening technique, thereby avoiding the program having to recalculate overlapping sub-expressions. The program is available from ftp://ftp.dcs.warwick.ac.uk/people/Stephen.Jarvis/ and http://www.molbiol.ox.ac.uk/~saunders/.
Open Reading Frame
Cremonini F, Gasbarrini A, Armuzzi A, Gasbarrini G: Helicobacter pylori-related diseases. Eur J Clin Invest. 2001, 31: 431-437. 10.1046/j.1365-2362.2001.00835.x.
Suerbaum S, Smith JM, Bapumia K, Morelli G, Smith NH, Kunstmann E, Dyrek I, Achtman M: Free recombination within Helicobacter pylori. Proc Natl Acad Sci U S A. 1998, 95: 12619-12624. 10.1073/pnas.95.21.12619.
Lawrence JG: Gene transfer, speciation, and the evolution of bacterial genomes. Curr Opin Microbiol. 1999, 2: 519-523. 10.1016/S1369-5274(99)00010-7.
Ochman H, Lawrence JG, Groisman EA: Lateral gene transfer and the nature of bacterial innovation. Nature. 2000, 405: 299-304. 10.1038/35012500.
Lan R, Reeves PR: Gene transfer is a major factor in bacterial evolution. Mol Biol Evol. 1996, 13: 47-55.
Lawrence JG, Ochman H: Molecular archaeology of the Escherichia coli genome. Proc Natl Acad Sci U S A. 1998, 95: 9413-9417. 10.1073/pnas.95.16.9413.
Saunders NJ, Peden JF, Moxon ER: Absence in Helicobacter pylori of an uptake sequence for enhancing uptake of homospecific DNA during transformation. Microbiology. 1999, 145: 3523-3528.
Censini S, Lange C, Xiang Z, Crabtree JE, Ghiara P, Borodovsky M, Rappuoli R, Covacci A: cag, a pathogenicity island of Helicobacter pylori, encodes type I-specific and disease-associated virulence factors. Proc Natl Acad Sci U S A. 1996, 93: 14648-14653. 10.1073/pnas.93.25.14648.
Karlin S, Burge C: Dinucleotide relative abundance extremes: a genomic signature. Trends in Genetics. 1995, 11: 283-290. 10.1016/S0168-9525(00)89076-9.
Karlin S, Mrazek J, Campbell AM: Compositional biases of bacterial genomes and evolutionary implications. J Bacteriol. 1997, 179: 3899-3913.
Xie G, Bonner CA, Grettin T, Gottardo R, Keyhani NO, Jensen RA: Lateral gene transfer and ancient paralogy of operons containing redundant copies of tryptophan-pathway genes in Xylela species and in heterocystous cyanobacteria. Genome Biol. 2003, 4: R14-10.1186/gb-2003-4-2-r14.
Weinel C, Nelson KE, Tümmler B: Global features of the Pseudomonas putida KT2440 genome sequence. Environmental Microbiology. 2002, 4: 809-818. 10.1046/j.1462-2920.2002.00331.x.
Hooper SD, Berg OG: Detection of genes with atypical nucleotide sequence in microbial genomes. J Mol Evol. 2002, 54: 365-375.
Sandberg R, Winberg G, Bräden C-I, Kaske A, Ernberg I, Cöster J: Capturing whole-genome characteristics in short sequences using a naïve Bayesian classifier. Genome Research. 2001, 11: 1404-1409. 10.1101/gr.186401.
Sandberg R, Bräden CI, Erngerg I, Cöster J: Quantifying the species-specificity in genomic signatures, synonymous codon choice, amino acid usage and G + C content. Gene. 2003, 311: 35-42. 10.1016/S0378-1119(03)00581-X.
Akopyants NS, Clifton SW, Kersulyte D, Crabtree JE, Youree BE, Reece CA, Bukanov NO, Drazek ES, Roe BA, Berg DE: Analyses of the cag pathogenicity island of Helicobacter pylori. Mol Microbiol. 1998, 28: 37-53. 10.1046/j.1365-2958.1998.00770.x.
Tettelin H, Saunders NJ, Heidelberg J, Jeffries AC, Nelson KE, Eisen JA, Ketchum KA, Hood DW, Peden JF, Dodson RJ, Nelson WC, Gwinn ML, DeBoy R, Peterson JD, Hickey EK, Haft DH, Salzberg SL, White O, Fleischmann RD, Dougherty BA, Mason T, Ciecko A, Parksey DS, Blair E, Cittone H, Clark EB, Cotton MD, Utterback TR, Khouri H, Qin H, Vamathevan J, Gill J, Scarlato V, Masignani V, Pizza M, Grandi G, Sun L, Smith HO, Fraser CM, Moxon ER, Rappuoli R, Venter JC: Complete genome sequence of Neisseria meningitidis serogroup B strain MC58. Science. 2000, 287: 1809-1815. 10.1126/science.287.5459.1809.
Alm RA, Ling LS, Moir DT, King BL, Brown ED, Doig PC, Smith DR, Noonan B, Guild BC, deJonge BL, Carmel G, Tummino PJ, Caruso A, Uria-Nickelsen M, Mills DM, Ives C, Gibson R, Merberg D, Mills SD, Jiang Q, Taylor DE, Vovis GF, Trust TJ: Genomic-sequence comparison of two unrelated isolates of the human gastric pathogen Helicobacter pylori. Nature. 1999, 397: 176-180. 10.1038/16495.
Lawrence JG: Gene transfer, speciation, and evolution of bacterial genomes. Curr Opin Microbiol. 1999, 2: 519-523. 10.1016/S1369-5274(99)00010-7.
Karlin S, Ladunga I: Comparisons of eukaryotic genomic sequences. Proc Natl Acad Sci USA. 1994, 91: 12832-12836.
Chao L, Cox EC: Competition between high and low mutation strains of Escherichia coli. Evolution. 1983, 37: 125-134.
Giraud A, Matic I, Tenaillon O, Clara A, Radman M, Fons M, Taddei F: Costs and benefits of high mutation rates: adaptive evolution of bacteria in the mouse gut. Science. 2001, 291: 2606-2608. 10.1126/science.1056421.
Sniegowski PD, Gerrish PJ, Lenski RE: Evolution of high mutation rates in experimental populations of E. coli. Nature. 1997, 387: 703-705. 10.1038/42701.
Taddei F, Radman M, Maynard-Smith J, Toupance B, Gouyon PH, Godelle B: Role of mutator alleles in adaptive evolution. Nature. 1997, 387: 700-702. 10.1038/42696.
LeClerc JE, Li B, Payne WL, Cebula TA: High mutation frequencies among Escherichia coli and Salmonella pathogens. Science. 1996, 274: 1208-1219. 10.1126/science.274.5290.1208.
Oliver A, Canton R, Campo P, Baquero F, Blazquez J: High frequency of hypermutable Pseudomonas aeruginosa in cystic fibrosis lung infection. Science. 2000, 288: 1251-1254. 10.1126/science.288.5469.1251.
Rayssiguier C, Thaler DS, Radman M: The barrier to recombination between Escherichia coli and Salmonella typhimurium is disrupted in mismatch-repair mutants. Nature. 1989, 342: 396-401. 10.1038/342396a0.
Denamur E, Lecointre G, Darlu P, Tenaillon O, Acquaviva C, Sayada C, Sunjevaric I, Rothstein R, Elion J, Taddei F, Radman M, Matic I: Evolutionary implications of the frequent horizontal transfer of mismatch repair genes. Cell. 2000, 103: 711-721. 10.1016/S0092-8674(00)00175-6.
Brown EW, LeClerc JE, Li B, Payne WL, Cebula TA: Phylogenetic evidence for horizontal transfer of mutS alleles among naturally occurring Escherichia coli strains. J Bacteriol. 2001, 183: 1631-1644. 10.1128/JB.183.5.1631-1644.2001.
Hirt RP, Logsdon JM, Healy B, Dorey MW, Doolittle WF, Embley TM: Microsporidia are related to Fungi: evidence from the largest subunit of RNA polymerase II and other proteins. Proc Natl Acad Sci U S A. 1999, 96: 580-585. 10.1073/pnas.96.2.580.
Stiller JW, Duffield EC, Hall BD: Amitochondriate amoebae and the evolution of DNA-dependent RNA polymerase II. Proc Natl Acad Sci U S A. 1998, 95: 11769-11774. 10.1073/pnas.95.20.11769.
Doolittle WF: Phylogenetic classification and the universal tree. Science. 1999, 284: 2124-2128. 10.1126/science.284.5423.2124.
Karlin S, Campbell A, Mrazek J: Comparative analysis across diverse genomes. Annu Rev Genet. 1998, 32: 185-225. 10.1146/annurev.genet.32.1.185.
Bissell AF: Weighted cumulative sums for text analysis using word counts. J Statist Soc. 1995, 158: 525-545.
Pearson WR: Effective protein sequence comparison. Methods Enzymol. 1996, 266: 227-258.
At the time most of this study was performed NJS was supported by a Wellcome Trust Advanced Research Fellowship.
NJS initiated the project, performed the genome sequence analyses, compared the two strains, interpreted the results, and prepared the biological aspects of the manuscript. PB was a DPhil student who worked on the coding aspects of the new methodology. JFP contributed to the bioinformatics discussions and planning stage of this project. SAJ directed and primarily developed the analysis strategy and the implementation of the new computational basis of the methodology, and prepared the computational aspects of the manuscript.