Ancestral European roots of Helicobacter pylori in India

Background The human gastric pathogen Helicobacter pylori is co-evolved with its host and therefore, origins and expansion of multiple populations and sub populations of H. pylori mirror ancient human migrations. Ancestral origins of H. pylori in the vast Indian subcontinent are debatable. It is not clear how different waves of human migrations in South Asia shaped the population structure of H. pylori. We tried to address these issues through mapping genetic origins of present day H. pylori in India and their genomic comparison with hundreds of isolates from different geographic regions. Results We attempted to dissect genetic identity of strains by multilocus sequence typing (MLST) of the 7 housekeeping genes (atpA, efp, ureI, ppa, mutY, trpC, yphC) and phylogeographic analysis of haplotypes using MEGA and NETWORK software while incorporating DNA sequences and genotyping data of whole cag pathogenicity-islands (cagPAI). The distribution of cagPAI genes within these strains was analyzed by using PCR and the geographic type of cagA phosphorylation motif EPIYA was determined by gene sequencing. All the isolates analyzed revealed European ancestry and belonged to H. pylori sub-population, hpEurope. The cagPAI harbored by Indian strains revealed European features upon PCR based analysis and whole PAI sequencing. Conclusion These observations suggest that H. pylori strains in India share ancestral origins with their European counterparts. Further, non-existence of other sub-populations such as hpAfrica and hpEastAsia, at least in our collection of isolates, suggest that the hpEurope strains enjoyed a special fitness advantage in Indian stomachs to out-compete any endogenous strains. These results also might support hypotheses related to gene flow in India through Indo-Aryans and arrival of Neolithic practices and languages from the Fertile Crescent.


Background
Analysis of genetic diversity in microorganisms normally reflects patterns of their own evolution although it is very rare that this can portray their hosts' evolution. Co-evolution between host and pathogens can be explained only if pathogens are not horizontally transmitted, and this supports a possible phylogenetic and evolutionary parallel of the host and pathogens. Sadly, in many cases frequent horizontal transmission separates the evolution of the bacterium from that of the host. However, for some pathogens, such as H. pylori [1][2][3], and JC viruses [4], transmission is faithfully restricted to families within specific communities. This phenomenon has in recent times provided evidence regarding patterns of human migration [2,4,5] in different continents.
The human gastric pathogen H. pylori is presumed to have co-evolved with its host [6] and established itself in the human stomach possibly millions of years ago [7]. It has been recognized recently as a reliable biological marker of host-pathogen co-evolution and ancient human migration based on sequence variation in select gene loci. H. pylori are genetically diverse to the extreme, providing about 1,400 informative sites within 3.5 to 4.5 kb of sequence from housekeeping genes, and their global genetic structure based on such sequence-haplotypes parallels that of humans [2]. Moreover, epidemiological studies have shown that transmission occurs predominantly within families [8][9][10][11]. H. pylori therefore, could provide a window into human origins and migration [1,3] and the impact of religions and social systems on stratification of human ethnic groups [12].
A landmark study based on PCR based DNA motif analysis proposed that H. pylori jumped recently from animals to humans and, therefore, the acquisition of H. pylori by humans may be a recent phenomenon [13]. This study has been the basis for the idea of 'H. pylori free New World' [13]. However, several independent studies based on large-scale analyses of candidate gene polymorphisms contrasted the idea of recent acquisition and suggest that H. pylori might have co-evolved with humans [1,6,14].
Using the same set of Peruvian isolates described earlier by Kersulyte et al. [13], Devi et al. [3], from our group have suggested that the genetic make up of south American isolates could be an admixture of ancestral and modern lineages of H. pylori. They clearly highlighted presence of ancestral H. pylori in Peruvians that possibly survived influxes of Spanish strains from Iberian expansions in Peru about 500 years ago. Also, according to this study, the survival advantage of indigenous strains was possibly due to the acquisition of western type cagPAIs from newly arrived Spanish strains.
Previous genotyping studies on Indian isolates have largely targeted molecular epidemiological issues. However, Wirth et al. [12], for the first time, using H. pylori genotypes, addressed issues such as impact of two different religions and societal systems on stratification of human ethnic groups [12] in the remotest north eastern Ladakh area of India. In view of intriguing ideas on ancient origin of H. pylori, and the fact that ancient origins and arrival of H. pylori are hardly known in the context of the vast South Asian continent, additional evidences based on strains from different geographical regions of Asia are clearly needed.
In this study, we attempted to unravel population genetic structure and gene pool diversity of Indian isolates of H. pylori from culturally and linguistically diverse ethnic Indians. The main objective behinds the study has been to explore genetic features of the strains that might explain their ancestral origin and might help reconstruct different waves of pre-historic human migration in India. We also looked if it is possible to link some of the native strains to their ancestors in West Asia, Eurasia or Europe.

DNA isolates, diagnostic PCR and epidemiological genotyping
DNA quality and purity was confirmed by agarose gel electrophoresis and diagnostic PCRs revealed presence of cagA, iceA, vacA, glmM, babB and oipA genes in all the Indian isolates we tested. The molecular epidemiological features of all the 63 strains we analyzed have been elaborated in Figure 1. Our isolates were quite diverse with respect to the plasticity region ORFs that we analyzed and no specific signature was seen dominant as regards to the arrangement or rearrangement of these ORFs. This validated that all the isolates that we looked at were in fact independent and did not represent any derivatives of clonal evolution.
Specific primers amplifying different alleles (see methods section) were used to analyze the vacA allelic diversity. The sizes of the amplified products for vacA s1 and vacA s2 were 259 bp and 286 bp respectively. Of the 63 isolates analyzed, the s1 allele was detected in 33 (52.3%) and the s2 allele type was detected in 11 (17.4%) strains. The m1 variant was detected in 34 (53.9%) and the m2 variant in 37 (58.7%). The highly toxigenic vacA allele combination s1m1 was found to be dominant (33.3%) as compared to other vacA allele types. The vacA genotype s1m2 was detected in 9 isolates (14.2%) whereas vacA s2m1 and vacA s2m2 genotypes were detected in 4 isolates (6.3%) each. Not all the isolates yielded full vacA amplicons, as regions of vacA gene, in particular, the signal region posed difficulty in amplification. This is a very common phenomenon observed in H. pylori owing to frequent recom-Detailed characteristics of Indian H. pylori isolates used in the study Figure 1 Detailed characteristics of Indian H. pylori isolates used in the study. [Yellow, region amplified or present; Blue, region absent or rearranged; -, region failed to amplify]. bination. The vacA alleles have been shown to differ in frequency and type among East Asian isolates [15], for instance, s1c is the predominant signal sequence allele among East Asian isolates [16]. Typically, the vacA s1c was found to be completely absent in the Indian isolates.

Multilocus sequence analysis
We report that almost all of the H. pylori strains from India share significant homology to the members of sub-population hpEurope. A total of 33 MLST profiles based on DNA sequence of a concatenated multigene comprising of 7 individual gene loci (atpA, efp, mutY, ppa, trpC, ureI and yphC) were generated from Indian isolates. Data compris-ing of these MLST profiles were subjected to comparative genomic analysis with ~400 other H. pylori sequences from different geographical and ethnic groups [11]. Such analyses upon construction of a neighbor-joining tree in MEGA 3. Neighbor joining tree (Kimura 2-parameter) (right) showing the global population structure of H. pylori wherein Indian isolates are highlighted. The phylogenetic tree was based on a total of 23 sequence records of South and North Indian isolates while incorporating ~400 other sequence records from pubMLST database representing different H. pylori populations and sub populations in the world. The population genetic structure was investigated by determining the multilocus haplotypes based on concatenated sequences of seven unlinked housekeeping genes that are scattered around the H. pylori chromosome. Individual isolates were assigned to bacterial populations called hpEastAsia (sub-populations: hspEAsia, hspMaori, hspAmerind), hpEurope, hpAfrica1 (hsp-SAfrica, hspWAfrica), hpAsia2 and hpAfrica2 [11]. Representatives from each of these (sub)-populations were chosen for subsequent analysis of the cagPAI. Isolates from the population hpAfrica2 do not contain cagPAI. Phylogenetic relationships were also estimated through NETWORK analysis (left) based on 665 mutating positions that revealed the co-evolution of the H. pylori genome. The Ladakhi (yellow) and other Indian (light green) lineages were more clearly discerned within the European (dark green) cluster (centre box), when analyses based on the remaining 650 mutating positions were performed. For the Neighborjoining tree (right), the bootstrap values of the interior branches as calculated in MEGA, were significantly high to indicate the correct topology of the branches within the clades. Comparative genomic analysis of the cagPAIs from Indian isolates. A) PCR based analysis of the complete cagPAI of 5 representative hpEurope Isolates: 3K, 4K, 3C, MS40 and MS38 from India. Overlapping PCR primers amplified the whole cagPAI indicating the intactness of the PAI in these isolates. B) Global pair-wise alignments of whole cagPAI sequences of different H. pylori isolates were generated by VISTA using default parameters [47]. The OK129 genome was taken as the base sequence (not shown) and rest of the sequences were aligned against it. The X-axis denotes length of the sequence under consideration and the Y-axis conveys homology in % with the base genome sequence). The Indian hpEurope isolate, 3K was aligned with other whole cagPAI sequences from GenBank along with the cag-PAIs of HP 26695, HPJ99 and HPAG1. The accession numbers for the public domain sequences of the cagPAIs from Europe [9] and Japan [49] that we used in our analyses, were as follows -Ca73 (AY330638 and AY330639), Du23:2 (AY330643 and AY330644), Du52:2 (AY330640, AY330641 and AY330642), F80 (AB120421), OK112 (AB120425), F16 (AB120416), F17 (AB120417), F28 (AB120418), F79 (AB120420), OK101 (AB120422), OK109 (AB120424). Sequence of the French isolate, Fr 908 was determined in this study (EF195721). While the cagPAI sequence of the Indian isolate 3K (hpEurope) was found to be genetically highly similar to and aligning closely with the 26695 sequence, it also revealed significant sequence similarities with other isolates of European origins (that harbor Western type of cag EPIYA sequences) such as HPAG1, OK112, Du52, Du23, Ca73, J99 and Fr908. It was however largely unrelated to the East Asian like isolates (mainly harboring Asian type cag EPIYA sequences) such as F16, F28, F79, OK109, F17, OK101 and F80. Indian populations [18] and the relative homogeneity of Indian populations regardless of their ethnic and linguistic affiliation [19].

Analysis of the cagPAI and its Right Junction (RJ) motifs
Overlapping primer amplification to span entire cagPAI worked reproducibly with our isolates; Figure 3(A) reveals complete PCR output for the ~38 kb cagPAI region in 5 representative strains MS38, MS40, 3K, 4K and 3C. All the constituent genes of the PAI were successfully amplified for all the Indian isolates studied. To get more insights into composition and arrangement of the gene loci within the PAI, complete sequencing of the cagPAI of isolate 3K was performed. This isolate was from a patient with peptic ulcer disease (PUD) from South India. The size of complete cagPAI of this isolate was 36,876 bp with a G+C content of 35.9. The sequence composition and gene order in the cagPAI of 3K was compared to those of the three completely sequenced strains 26695, J99 and HPAG1 which revealed some minor differences such as fused HP0521 and HP0522 genes due to the deletion of a single nucleotide at the 3' end of HP0521. Similarly single or dinucleotide differences were observed in the cagX (HP0528), cagN (HP0538) and cagE (HP0544) and most of these insertions and deletions were observed in the intergenic regions. Broadly, the cagPAI genes were very conserved as regards to the amino acid sequences when compared with at least 15 different publicly available cagPAI sequences.
cag-RJ (the extreme right junction of the cagPAI, between 3' end of the cagA gene and the start of the glutamate racemase -glr) was studied for our 63 isolates where 99% isolates harbored type III motif. A total of 47 of 63 strains (75%) gave positive PCR results for cag-RJ ( Figure 1). The type III motif was found in 27 of 39 South Indian isolates and 20 of 24 North Indian isolates. It is noteworthy that cag-RJ typeIII motifs are genetically close to European type I motifs probably due to an ancient insertion event, followed by recombinational scrambling among type I and III lineages [13]. We did not find in our Indian isolates any type II motifs, which constitute a signature characteristic of East Asian gene pool.
Phylogenetic tree based on the 5' end sequence of the cagA (an informative 219 bp segment of cagA was used to align sequences from unrelated isolates) suggests possible com-mon origins for isolates from ethnic Indians and the tribal We examined relatedness of the cagA gene sequences of tribal isolates from India to the mainstream Indian isolates and the European isolates by analyzing a 219 bp informative fragment near the 5' end of cagA which usually distinguishes the European and the East Asian strains [20]. Comparative sequence analysis was used to construct phylogenetic relationship in MEGA3.1. All the sequence records corresponding to the isolates of Santhal and Oraon tribals revealed homologies to the main stream Indian strains from Hyderabad, Lucknow and Bengal and also to all the representative European strains. These tribal isolates did not cluster with East Asian strains ( Figure 4).
This makes it clear that the cagPAI of Indian strains is a completely evolved one and probably was acquired from a European source, well before the arrival of H. pylori in India. This is also evident from the fact that the Indian strains, though of a European descent, do not share characteristic features of Asian cagPAIs.

Discussion
Although the Indian peninsula has seen many different waves of population migration [21], the Paleolithic archaeological evidence is not clearly visible to understand peopling of this country [22]. Nonetheless, the Indus Valley and Harappan civilizations portray footprints of Neolithic period [23] suggestive of the arrival of Indo-European speakers who established the caste system, an anthropologically significant prehistoric event [24,25]. The cultural and historical importance of the arrival and settlement of the Indo-Aryans is undisputed, but it is not clear if this was established through 'replacement of the existing people by outsiders' [22] or did the 'people already in India changed their habits and cultures?' [22]. Such questions have never been addressed in an unambiguous manner, even though the potential of polymorphic DNA markers in reconstruction of human migration and phylogeography [26,27] has long been appreciated. It appears that even carefully planned geographic genomics studies remained largely speculative due to the lack of a universal 'gold standard' as the classical mitochondrial DNA markers offer too few informative polymorphisms and the newly developed Y -chromosome markers are even less polymorphic than mitochondrial hypervariable regions [2]. Lately, new genetic models were successfully harnessed based on parasites and pathogens that probably accompanied their human host during evolution and much of the human history including migrations and expansions [2,4,5] in different continents. Such approaches constitute an attractive alternative to reconstruct human origins and spreads, population dynamics and bottlenecks, wars and displacements, farming and plagues etc.
Our study was aimed at tracking ancient origins of the Indian H. pylori through a two-pronged approach to i) substantiate European link of the pathogen in India and ii) to prove that the pathogenicity island was also of European origin and this PAI has not been a 'recent' addition to the genome of Indian H. pylori. Our analyses, based on MLST and comprehensive genotyping of the cagPAI, linked about 100% of the Indian isolates to H. pylori subpopulation hpEurope. This perhaps conveys the message that H. pylori was most probably introduced to the Indian subcontinent by ancient Indo-European nomadic people and our findings, therefore, are consistent with the idea of a possible gene flow into India with the arrival of Indo-Aryans.
Overall, based on the MLST data ( Figure 2) and the cagPAI patterns (Figure 3), we suggest that H. pylori might have arrived in India probably at the same time when Indo-European language speaking people crossed into India (~4000-10,000 years before present). Alternatively, the unquestionable common origin of Indian strains with the European ones could be actually more ancient, following the upper Paleolithic spread of Homo sapiens in Eurasia, as suggested by mtDNA variability [18], and our data on H. pylori MLST do not rule out this possibility.
Present day India represents a 'genetic playground' with tremendous diversity of cultures, and languages. However, the people are largely stratified as tribals and nontribals [25]. Four main language families are spoken, the largest being, Indo European (IE), which is prevalent in North, and the second largest Dravidian (DR) group represents languages spoken in the South [28]. The other two language groups include Tibeto-Burman (TB) of the Sino-Tibetan and the Austro-Asiatic (AA) families, largely spoken in far North and the North-east India. While most of the IE speakers belong to castes, the majority of the tribal communities (>450) speak about 750 different dialects that fit within any one of the other three language families (DR, TB, AA) [25,28]. Such an enormous cultural diversity might argue for many different populations and sub-populations of H. pylori. But until now, and including this study, H. pylori with genetic features of hpEurope have only been reported from India [29,30]. Even the newly described sub-population hpAsia2 from Ladakh is also a variant of hpEurope and many Ladakhi strains that we looked at in this study, clustered with European H. pylori clade ( Figure 2). Also, the cagA sequences from H. pylori belonging to tribal Oraon and Santhals were indistinguishable from those of the mainstream Indians and Europeans (Figure 4), indicating sweeping spread of a single H. pylori genotype across the Indian peninsula. Moreover, we did not document presence of any other H. pylori populations and sub-populations such as hspAmerind, hspMaori, hpAfrica and hpEast Asia in the limited, but representative culture collection that we looked at. However, the visible footprints of other migrations into India such as from the North Eastern corridor and the presence of phenotypic features resembling to Africans in the South, make it unwise to presume an 'H. pylori free India' at the time of arrival of Indo-European speaking invaders. This issue and the fact that H. pylori's first association with humans traces back to millions of years before present, in Africa [6,17], it is more realistic to hypothesize that H. pylori of African and Asian gene pool might have already been present in India. The predominance of a single H. pylori population might therefore, point to a distinct survival advantage conferred by a fully functional (western type) cagPAI. This analogy is consistent with the scenario we previously reported [3] for the South American, Amerindian strains, which were presumably out competed by their Spanish counterparts arriving with an intact and functional western cagPAI.
Finally, it is possible that phylogeny based on highly recombining gene loci [15,29,[31][32][33][34][35] may not be completely foolproof to extract inheritance from different ancestral populations, especially when we use tools such as MEGA 3.0 [36], which do not support admixture analysis. Moreover, phylogenetic methods based on bifurcating trees, such as Neighbor joining analysis, may not be fully appropriate for analysis at the intra-species level [37,38], especially in case of hypervariable genomic regions, where multiple homoplasy due to reversions, recurrent mutations etc., or polytomy may sometimes confound the phylogenetic interpretation. However, the housekeeping genes used here are selectively neutral and uniform as compared to virulence associated loci such as the flagellins and vacA [10], and therefore, recombinant and hybrid alleles that blur lineage inferences, could be a rare occurrence and not a routine. Partly in view of this assumption and due to our previous experiences on dissecting complex ancestry of native Peruvian isolates using phylogenetic methods [3] we did not attempt admixture analysis with complicated Bayesian statistics. However, to ensure that our conclusions did not represent shortcomings of a single method, we adopted an integrated phylogenetic approach combining MEGA/NETWORK based analyses and genotyping strategy based on full cagPAI and its left and right end sequences. Interestingly, these approaches unambiguously show the Indian H. pylori genotypes scattered among the European ones. Although this would be consistent with gene flow into India with the Indo-Aryans, or even more ancient origins following the Paleolithic expansion of humans in Eurasia, but also consistent with another scenario: migration from India to Europe. However, the later scenario becomes insignificant due to the unavailability of supporting archeological, linguistic and historical data. Nonetheless, an understanding of the time-scale would be helpful for choosing between such explanations, with the estimation of divergence times between the H. pylori sequences in the different human populations. These issues therefore need to be addressed in future.

Conclusion
In summary, we found significant overlap among genetic identities of Indian and European H. pylori based on core and flexible genome markers. This remarkable genetic similarity points to their possible common genetic origins and could therefore be potentially useful in understanding entry, survival, spread and adaptation of H. pylori in Indian stomachs. Also, this study is consistent with the hypothesis of co-evolution of H. pylori with H. sapiens and therefore, could form a reliable foundation to test and reconstruct gene flow into India with the arrival of Indo-Aryans or otherwise.

Bacterial strains, genomic DNA and diagnostic PCR
All the strains were cultured by the Centre for Liver Research and Diagnostics, Deccan college of Medical Sciences, Hyderabad, from patient biopsies. All the biopsy material was collected with necessary ethical clearances and after obtaining informed consents. Template DNA was prepared from single colony picks as described previously [39].  Figure 1). However, in the current study, the clinical background of the individual isolates was not taken into account. The Indian isolates we looked at (n = 63) were originally from Native Indian people mainly of Aryan and Dravidian ancestry from India. PCR based analyses of genes namely cagA, glmM, babB [14] and oipA were carried out to ascertain the quality of DNA samples we used. Also these PCR assays served as amplification level controls for the analysis of insertion, deletion and substitution in the cagPAI.

MLST analysis by MEGA 3.1 and NETWORK 4.2.0
A 600 bp region each from the 7 housekeeping genes spread throughout the genome atpA, efp, ureI, ppa and mutY, trpC, yphC was amplified by PCR and sequenced for all the Indian isolates exactly as described previously [3]. Sequencing was performed on both the strands, using an ABI Prism 3100 DNA sequencer (Applied Biosystems, USA). PCR and direct sequencing were performed at least twice to determine and confirm the DNA sequences for each isolate. Consensus sequence for each of the samples was generated using Genedoc (version 2.6.002). Multiple alignments of sequenced nucleotides were carried out using Clustal X (version 1.81). Neighbor joining trees were constructed in MEGA 3.0 [36] using bootstrapping at 10000 bootstrap trials and through Kimura-2 parameters. For beginning construction of phylogenetic trees based on MLST genotyping procedures, ~400 sequences of the 7 housekeeping genes of strains belonging to different established genotypes, including 40 sequences of isolates from Ladakh were obtained from the pubMLST database [40] (courtesy, Daniel Falush). The Indian H. pylori diversity represented in the final MEGA3.0 alignment and the tree thereof comprised of a total of 63 sequences inclusive of the 10 Ladakhi sequences generated in house along with the other 9 representative Ladakhi sequences from the database. We performed on MLST sequence data a network analysis using the program Network 4.2.0.0. [38,41]. In particular, the median-joining algorithm for multistate DNA data was used [42,43]. Because of a program limitation, which cannot handle more than 1000 polymorphic sites at once, we performed the analysis separately on two halves of the sequence (encompassing respectively 650 and 665 polymorphic sites). The input file (in *.rdf format) was obtained using the commercial software DNA Alignment 1.1.2.1.

Profiling of the cagA gene, the whole cagPAI and its right junction
The 5' end of the cagA gene was amplified using primers mentioned elsewhere [44] and the amplified products were sequenced with forward and reverse primers. The consensus sequences were then translated into amino acid sequences using GeneDoc software (version 2.6.002) and were then assigned to the Western or the East Asian group based on the C or D repeats present respectively in the EPIYA motif [45]. Genetic diversity of the cagA 5' end sequences for our Indian isolates: MS15, MS7, MS4 and 3K along with 26695 and J99 were compared to the other records from GenBank [20,30,46]. A phylogenetic neighbor-joining tree was constructed by MEGA 3.1 version using these sequences (Figure 4).
PCR analyses were carried out to find the status of the cag-PAI using 8 sets of primers that amplified the cagA gene, its promoter region, the cagE and cagT genes and the left end of the cagPAI [8,29,34]. We also analyzed whole cag-PAI of the representative isolates from India (3K, 4K, 3C, MS40 and MS38) by PCR using overlapping primers as described by Blomstergren and colleagues [9]. The entire cagPAI sequence of a single representative Indian isolate 3K was determined. The complete cagPAI sequence was aligned by VISTA programme [47] against other PAI sequences belonging to strains 26695, J99, HPAG1 and 13 other clinical isolates corresponding to H. pylori sub-pop-ulations hpEurope, hpEast Asia and hpAfrica1 ( Figure  3B).
Chromosomal rearrangements are known to give rise to 5 types of insertion-deletion and substitution motifs in the region between the right end of cagA gene and the glutamate racemase (glr) gene (cag-RJ). We assessed these rearrangement profiles for all of the Indian isolates by PCR as described earlier by Kersulyte and colleagues [13].

Analysis of the chromosomal plasticity region cluster
Chromosomal plasticity region ORF's were assessed for all the 63 Indian isolates by PCR based typing to ensure that all the strains that we looked at were independent and non-clonal by descent. The PCR primers and the procedures used for evaluating the presence of the plasticity region ORF's (JHP912, HP986, JHP947, JHP926, JHP944, JHP931, JHP945 and JHP933) have been descried previously [48].

Authors' contributions
SMD and IA performed and analyzed MLST, all other genotyping experiments and phylogenetic analysis. SMD also helped in analysis of babB and oipA genotyping. MAA performed vacA genotyping. IA also performed H. pylori isolation and culture. YA carried out in silico analysis of the cagPAI sequences. PF performed Network analysis on MLST data and contributed to manuscript writing. LAS and FM provided expert clinical and epidemiological support and contributed to discussions and manuscript writing. NA planned and supervised the study, edited the final draft of the manuscript and provided overall leadership. All the authors read and approved the final manuscript.