Comparative genomics reveals a lack of evidence for pigeons as a main source of stx2f-carrying Escherichia coli causing disease in humans and the common existence of hybrid Shiga toxin-producing and enteropathogenic E. coli pathotypes

Background Wild birds, in particular pigeons are considered a natural reservoir for stx2f-carrying E. coli. An extensive comparison of isolates from pigeons and humans from the same region is lacking, which hampers justifiable conclusions on the epidemiology of these pathogens. Over two hundred human and pigeon stx2f-carrying E. coli isolates predominantly from the Netherlands were analysed by whole genome sequencing and comparative genomic analysis including in silico MLST, serotyping, virulence genes typing and whole genome MLST (wgMLST). Results Serotypes and sequence types of stx2f-carrying E. coli showed a strong non-random distribution among the human and pigeon isolates with O63:H6/ST583, O113:H6/ST121 and O125:H6/ST583 overrepresented among the human isolates and not found among pigeons. Pigeon isolates were characterized by an overrepresentation of O4:H2/ST20 and O45:H2/ST20. Nearly all isolates harboured the locus of enterocyte effacement (LEE) but different eae and tir subtypes were non-randomly distributed among human and pigeon isolates. Phylogenetic core genome comparison demonstrated that the pigeon isolates and clinical isolates largely occurred in separated clusters. In addition, serotypes/STs exclusively found among humans generally were characterized by high level of clonality, smaller genome sizes and lack of several non-LEE-encoded virulence genes. A bundle-forming pilus operon, including bfpA, indicative for typical enteropathogenic E. coli (tEPEC) was demonstrated in 72.0% of the stx2f-carrying serotypes but with distinct operon types between the main pigeon and human isolate clusters. Conclusions Comparative genomics revealed that isolates from mild human disease are dominated by serotypes not encountered in the pigeon reservoir. It is therefore unlikely that zoonotic transmission from this reservoir plays an important role in the contribution to the majority of human disease associated with stx2f-producing E. coli in the Netherlands. Unexpectedly, this study identified the common occurrence of STEC2f/tEPEC hybrid pathotype in various serotypes and STs. Further research should focus on the possible role of human-to-human transmission of Stx2f-producing E. coli. Electronic supplementary material The online version of this article (10.1186/s12864-019-5635-z) contains supplementary material, which is available to authorized users.


Background
Shiga toxin-producing Escherichia coli (STEC) is a group of bacterial pathogens whose infection in humans is associated with varying clinical manifestations, including diarrhoea, haemorrhagic colitis and (occasionally fatal) haemolytic uremic syndrome (HUS) [1]. The production of Shiga toxin (Stx1 and/or Stx2 variants) is a cardinal virulence factor of this group of pathogens [2]. STEC is generally considered zoonotic with ruminants, and in particular cattle and sheep, as the main reservoirs [3,4]. In addition, there is evidence for birds, dogs, horses and pigs being additional reservoirs and/or spill-over hosts for STEC [5]. This implies that there may be other epidemiologically relevant sources of human STEC infection beyond ruminants.
STEC harbouring the stx 2f variant are frequently found in pigeons [6][7][8][9] and occasionally in other bird species [10], but have never been reported in ruminants. Initially, stx 2f -carrying E. coli were thought to be pigeon adapted with a limited impact on disease in humans. However, reports from several countries imply that infections with stx 2f -carrying E. coli are more common than anticipated [11][12][13]. In the Netherlands, they constituted 16% of all STEC infections in the period 2008-2011 but infections were generally associated with a relative mild course of the disease [13,14].
The occurrence of stx 2f -carrying strains in pigeons as well as in humans is suggestive for these birds being a zoonotic reservoir for human infection. Whole genome characterisation and strain comparison indicated that stx 2f -carrying E. coli from pigeons, humans with mild disease, and HUS patients belonged to three distinct sub-populations [15] with a certain but limited overlap between pigeon and human isolates with respect to serotypes and MLST [9,11]. Whether this overlap is sufficient to explain the epidemiological situation and justify the conclusion that human clinical isolates originate from a pigeon reservoir (directly by strains or indirectly by phages) remains under debate. To date, an extensive comparison of isolates from pigeons and humans from the same region is lacking, which hampers justifiable conclusions on the epidemiology of stx 2f-carrying E. coli.
With this study, an in-depth genomic comparison of stx 2f-carrying E. coli from pigeons and humans from the Netherlands is provided.

In silico analyses; typing
Analysis of the rpoB gene was used to confirm that the stx 2f -carrying isolates were really E. coli and not E. albertii as some authors have suggested [16]. In silico rpoB screening and phylogenetic analysis of the resulting alignment demonstrated that nearly all stx 2f -carrying isolates included in this study were E. coli except for four Dutch isolates (two human and two pigeon) and one from the UK. Three of these Dutch isolates displayed an ONT:Hserotype, while the fourth was typed to O115:H52. MLST typing according to the E. coli scheme resulted in two known STs (ST2681 (n = 2) and ST2680) and two new ones (see Additional file 1), however the rpoB sequence of all four isolates clearly cluster them among E. albertii (Fig. 1). A closer look at the UK strain SRR6144114 in the ENA database confirmed that this is indeed an E. albertii isolate rather than an E. coli.
Surprisingly, the major structural subunit of bundle-forming pilus determinant bfpA was demonstrated in the majority of pigeon and human isolates ( Table 2). In total, 72.0% (157/218) of these STECs harboured this typical enteropathogenic E. coli (tEPEC) determinant. However, none of the O113:H6 isolates (n = 25) nor the three Italian HUS isolates contained bfpA ( Table 2, Additional file 2). In total nine different bfpA alleles were identified in the entire isolate set investigated. Eight of them belonged to a different subgroup clearly separated from the well-known alpha and beta subtypes (Fig. 2, Additional files 2 and 3). In addition, one novel beta allele was characterized in two strains.

Comparative genomics and phylogenetic analysis
Comparative whole genome MLST (wgMLST) analysis of the 218 stx 2f -carrying E. coli predominantly displayed a clustering of the isolates according to serotype/ST with a general clustering of pigeon and human isolates along the phylogenetic tree which follows the observation of a clear non-random distribution of serotypes/STs among human and pigeon isolates (Fig. 3). The phylogenetic tree also showed shorter branch lengths of the serotypes exclusively found in humans (O63:H6, O113:H6, O125:H6) compared to the others that show overlap in occurrence between humans and pigeons (O4:H2, O45:H2 and O128:H2), indicative for a stronger clonal relation among the types exclusively found in humans. This was confirmed by looking in more detail at the number of genes different within the top eight serotypes investigated in this study ( Table 3). The strict human associated types showed significant lesser number of (T-test, P = 0.011). The pathogenicity island LEE was shown to be present in nearly all E. coli isolates included in the study (n = 212). It encoded the intimin adhesin gene eae, but also the well-known effector proteins EspB, EspF, EspG, EspH, EspZ, Map and Tir. Phylogenetic analysis zooming in on the 42 genes of LEE only, displayed a very similar clustering as wgMLST analysis (Additional file 4: Figure S1). Over 60 genes belonged to the stx 2f -phage including important determinants like cro, cI, int, capsid and tail structural genes and packaging genes. However, some of the genes normally involved in infection and propagation of Stx phages, such as cII, cIII, N Q, O, and P seemed to be absent. Consequently, immunological VERO cells test assays were performed to determine whether the phage was active. Shiga toxin-production was confirmed for a selection (n = 18) of strains belonging to various serotypes (data not shown). The phylogeny of over 60 stx 2f -phage associated genes revealed a more scattered distribution of the various serotypes (Additional file 4: Figure S2).
The unexpected result of the high prevalence of the bfpA gene among stx 2f -carrying E. coli isolates required subsequent genetic studies (RAST and BLAST). This Fig. 3 Neighbor-Joining phylogenetic tree of 218 stx 2f -carrying E. coli isolates based on wgMSLT data. The phylogenetic tree is constructed on a distance matrix calculated from the different allele numbers of the wgMLST scheme. The colours represent the various serotypes. Each isolate is indicated by the country of isolation, the year of isolation and its origin revealed that bfpA was located in a cluster of 14 genes, i.e. the bundle-forming pilus (BFP) operon. Phylogenetic analysis of this operon showed a clear separate clustering of O4:H2, O45:H2, and O128:H2 from the rest of the serotypes (Fig. 4).
The global regulator elements of BFP the so-called perABC (also known as bfpTVW) was not found in any of the bfpA positive strains.
Since BFP is commonly associated with FIB/FIIA plasmid families, this association was also investigated. In silico analysis revealed that in all BFP positive isolates the FIB repA gene was present and was almost always located on the same contig as bfpA, suggesting these genes were co-localized on the same plasmid. Because the assemblies concern draft genomes the FIIA repA was not always on the same contigs as FIB repA and bfpA.  Consequently, it is not known whether these FIIA rep genes are part of the BFP plasmid. In addition to repA, various other specific incF plasmid genes were encountered like the conjugative transfer protein genes traB-D, traF-I, traN, traP-R, traU-X, trbA-F and trbI.
In silico analyses; genome size A marked difference in genome sizes of the isolates was identified. The genomes of the serotypes O63:H6, O125:H6, O113:H6, O132:H34 and O145:H34 were similar in size to non-pathogenic E. coli and enteropathogenic E. coli (EPEC).

Discussion
Earlier studies emphasized the existence of a strict association between STEC carrying the stx 2f gene and pigeons, with limited impact on disease on humans [6,7,17]. However, reports from several countries imply that infections with stx 2f -carrying E. coli are more common than anticipated [11][12][13]. In the Netherlands, stx 2f -carrying E. coli constituted 16% of all STEC infections in the period 2008-2011, but were generally associated with a relative mild course of the disease [13]. As several STEC assays targeting stx genes are not capable of detecting the 2f variant, limited data on stx 2f -carrying E. coli from human infections in other countries are available due to under-diagnosis [18]. based on wgMLST, LEE island, and the BFP operon. Third, the strict associated human types, in contrast to the other types, tend to be highly clonal. Fourth, the genomic characteristics of the strict human associated types and pigeon types differ regarding genome size and virulence factor composition. In addition, an unexpected but important finding of the present study was that the majority of the stx 2f -carrying E. coli (72.0%) carried cardinal genes for tEPEC (BFP operon) as well as for STEC (stx 2f ), suggesting the existence of hybrid STEC/tEPEC strains. STEC serotypes can be strongly associated with specific reservoirs [4]. Besides a report on the isolation from shellfish and the associated production water (possibly contaminated with urban wastewater) [19] the dominant stx 2f -carying serotype O63:H6 in the present study has regularly and exclusively been reported from humans [11,13,14,20]. A weakness of the presented data is the possible under-sampling of the pigeon reservoir, which could have resulted in an underestimation of the circulating diversity. This was statistically confirmed by rarefaction analysis (Additional file 5). However, a probability analysis showed that if the common Dutch strict human associated serotypes (O63:H6, O113:H6, O125:H6, O145:H34) do actually occur in the pigeon reservoir with the same distribution as among humans we would have isolated them even with the current sample size (Additional file 5). The absence of these serotypes in pigeons and wild birds confirms the finding of a few other studies, although it was not clear whether this always concerned STECs [9,21,22]. Together with the observed high level of clonality, this strongly suggests that these common human associated stx 2f -carrying strains are not originating from the pigeon reservoir.
In this study, the majority of the STEC strains (carrying stx 2f ) is identified to simultaneously be tEPEC (defined by the presence of bfpA). The presence of both bfpA and stx 2f in E. coli strains is not new since it has been reported before. For example, Hazen et al. [23] demonstrated both genes in a human O128:H2 strain (STEC_H.1.8), which has been included in the current study. In addition, very recently, a study was published by Gioia-Di Chiacchio et al. [24], describing O137:H6 strains from a cockatiel and a budgerigar carrying both bfpA and stx 2f . However, our present study describes the occurrence of STEC/tEPEC hybrids on a far larger scale and among various E. coli serotypes and in different phylogenetic groups. The serotypes O63:H6, O125:H6, O132:H34 and O145:H34 all have been described earlier as (typical) EPEC [11,25,26]. While atypical EPEC (i.e. LEE-positive, bfpA-negative, stx-negative) have both animal and human reservoirs, tEPEC have a strict human reservoir [27,28]. In addition, tEPEC is most often not associated with typical severe STEC symptoms like bloody diarrhea and HUS but seems to be linked to milder but more persistent symptoms [13,27,29], which is similar as observed for stx 2f -carrying E. coli infections [13,27,29]. Surprisingly, the results of the present study demonstrated that also the majority of pigeon associated strains were identified as STEC/tEPEC hybrids. However, as described in this study the genomes of pigeon and human hybrid STEC/tEPEC show considerable differences. First, the genome sizes of the hybrids belonging to the strict human associated serotypes were generally smaller and more resembling EPECs while the hybrids belonging to other serotypes were significantly larger and more resembling STECs. Second, similar to Grande et al. [15], several non-LEE encoded type III effector STEC virulence determinants (nleA, nleB and nleC) were demonstrated only in strains from the pigeon associated cluster (including a limited number of human isolates) and in the clinical HUS isolates, while absent from the majority of the human associated hybrids associated with relatively mild disease. Although strains commonly encountered in relatively mild disease among humans are not found in the pigeon reservoir, some overlap between pigeons and humans can be seen regarding the more typical virulent STEC strains. Finally, pigeon and human isolates showed clear distinct BFP operon types.
Altogether, the emerging picture suggests that the stx 2f -carrying E. coli and stx 2f /tEPEC hybrids commonly encountered in relatively mild human disease do not directly originate from the pigeon reservoir. Although sporadically isolated from other sources it is possible that these mild disease strains do not have a zoonotic reservoir at all in terms of an animal species in which the pathogen is maintained and shed. Similarly no animal reservoirs have been identified for other STEC hybrids like stx-EAEC O104:H4 [30][31][32] and stx-ExPEC O80:H2 [33,34], which also show strong clonal relations [35]. In addition, it was demonstrated that the strain involved in an outbreak of STEC O117:H7 linked to transmission among men who have sex with men was characterized by a significantly smaller genome size compared to STEC O157 and O26 [36]. Moreover, the genomic relationships were consistent with existing symptomatic evidence for chronic infection with this O117:H7 serotype.

Conclusions
Pigeons should not be regarded as the most likely direct source of the most frequent encountered stx 2f-carrying E. coli types encountered in relatively mild human disease. Humans themselves may be the more plausible reservoir for the majority of milder infections with this pathogen. This study also showed the unexpected common existence of STEC/tEPEC hybrids among pigeon and human isolates although in different reservoir dependent genomic backbones (i.e. genome size, virulence genes, BFP operon type). The occurrence of the BFP plasmid among non-human isolates should be further investigated with respect to whole plasmid sequence and patho-phenotype of the BFP-carrying pigeon isolates. Possibly a phylodynamic approach would be helpful in elucidating the spread and evolution of this plasmid between isolates of different host species. Phylodynamic studies may also be of value in studying the possible human-to-human transmission of Stx2f-tEPEC hybrids. Finally, further experimental research on the infectivity of the Stx2f phages to E. coli isolates of different sources and of different pathotypes may be informative on their potential spread.

Methods
Stx2f -carrying E. coli strains Most of the Dutch human isolates (n = 119) originated from the collection held at the National Institute for Public Health and the Environment in the Netherlands (RIVM) and were collected as part of the national surveillance programme (2008-2017) [13]. Some additional isolates originated from the STEC-ID-net study and were isolated from the faeces of hospitalized patients or patients visiting their GP with (bloody) diarrhoea (n = 10) [14]. Thirteen Dutch pigeon isolates included in this study were obtained from a small study among pigeon droppings in the Netherlands in 2016. In total 140 pigeon faeces were sampled for the presence of Stx2f-producing E. coli according to ISO/TS 13136:2012. A prevalence of 9.3% was found among racing pigeons as well as free living pigeons in urban environments (data not shown). Two leafy green and one livestock isolates were also included in the study.
Besides Dutch isolates international ones were also included in this study in order to provide genomic context (n = 78). Raw reads or assemblies of non-Dutch isolates were recovered from publicly available databases; European Nucleotide Archive (ENA (www.ebi.ac.uk/ena)) and Escherichia/Shigella Enterobase (enterobase.warwick.ac.uk/species/ecoli/). The Italian isolates (n = 11) originated from a previous study [15]. An overview of the 223 isolates included in this study and their characteristics can be found in Additional file 1.

Whole genome sequencing
The sequencing of the Dutch strains was performed on various Illumina platforms (Illumina, San Diego, CA, USA), i.e. MiSeq PE300, HiSeq 2000 and HiSeq 2500 with the appropriate Illumina library protocols.

Assembly statistics and genome size analysis
The assemblies were assessed using the assembly file statistics of SeqSphere + 4.1.9 software (Ridom GmbH, Münster, Germany [37]). Various characteristics were determined like contig count, N50 and genome sizes.
To compare genome sizes of the various stx 2f -carrying E. coli serotypes against those of different E. coli pathotypes the Escherichia/Shigella Enterobase database was consulted. The following pathotypes aEPEC (atypical enteropathogenic E. coli), EIEC (enteroinvasive E. coli), ETEC (enterotoxigenic E. coli), ExPEC (extraintestinal pathogenic E. coli), STEC (Shiga-toxin producing E. coli) and UPEC (uropathogenic E. coli) were looked up as searches in the Field "Simple Patho" via the enterobase. warwick.ac.uk/species/ecoli/search_strains (this search was performed on 01-03-2018). Genome sizes of the selected pathotypes were registered and together with the stx 2f -carrying E. coli serotypes were compared by box plot analysis.
In silico MLST analysis, serotyping and determination of virulence and antimicrobial resistance genes Individual gene phylogeny of rpoB was generated after in silico analysis of this determinant and extraction of the nucleotide sequences using SeqSphere + . An alignment and maximum-likelihood tree using the Kimura [38] two-parameter model of distance estimation was made using Seaview Version 4.5.4 [39].
In silico serotypes were determined using the Seq-Sphere+ software by screening the assemblies for the presence of O-type (wzm, wzt, wzx and wzy) and H-type genes (fliC) as previously described [41].
Additionally the assemblies were analysed for the presence/absence of E. coli virulence genes. The sequence information for most of these genes was retrieved from the Center for Genomic Epidemiology database (bitbucket.org/account/user/genomicepidemiology/projects/ DB), but some gene clusters were added from own local databases and literature searches, e.g. bfpA, cdtI-cdtV, espB. Again SeqSphere+ was used to screen the assemblies for over a hundred virulence genes (see [42]).

Comparative genomics and phylogenetic analysis
KmerFinder 2.4 [43] (cge.cbs.dtu.dk/services/KmerFinder/) was used to determine the best matching E. coli isolates to the seven most prevalent serotypes of the stx 2f -carrying isolates. The best matches were E. coli O18:H7 strain IHE3034 (NC_017628.1, 5,108,383 bases) with 5179 genes with coding sequences (CDS) and E. coli O103:H2 strain 12009 (NC_013353.1, 5,449,314 bases), 5698 genes with CDS. Both complete genomes (strain IHE3034 as reference and strain 12009 as query) were used to design a whole genome multilocus sequence typing (wgMLST) scheme with the SeqSphere+ software to determine the genomic relatedness of the E. coli isolates included in this study. The target scan procedure details were set to 90% required identity and 100% required percentage aligned to the reference sequence. In total, 3365 targets were defined for core genome MLST (3,221,601 bases), while 1401 were assigned as accessory targets (1,111,383 bases). Overall, 413 targets were discarded because either homologous genes were encountered or they were missing a stop codon.
The assemblies of at least one representative of each serotype included in this study was annotated by RAST [44]. These RAST annotations were used to investigate certain areas of the genome, like the pathogenicity island locus of enterocyte effacement (LEE), stx 2f -phage and a bundle-forming pilus (BFP) plasmid, in more detail. The annotations helped to determine the composition of these genetic elements and enabled phylogenetic analysis after extraction of these specific parts from the assemblies. First, the contigs where the genes of interest were located, were recovered from the assemblies of the stx 2f -carrying isolates. For the LEE island this concerned the intimin determinant eae, for the stx 2f -phage the two stx subunits and in the case of the BFP plasmid the major bundle-forming pilus gene bfpA. Next the length of these contigs was determined. The longest contigs in the most common serotypes of the stx 2f -carrying isolates were used to assess the genomic structures of each of these three areas. Basic local alignment search tool (BLASTn) analyses were also included in this analysis [45]. In this way, the 42 genes which compose the LEE island of the various stx 2f -carrying E. coli serotypes were identified and used to develop a MLST scheme with SeqSphere+. Over 60 genes were characterized to belong to the stx 2f -phage. They were used to setup a core phage MLST scheme using SeqSphere+. The genes belonging to BFP plasmids were also recovered from the annotations and assemblies, resulting in a core plasmid MLST scheme of over 70 genes.

Statistical analysis
Rarefaction analysis; To test whether the pigeon associated E. coli population has been sampled enough as to capture the majority of the serotypes the distinct serotypes identified among both human and pigeon isolates were counted. A rarefaction analysis, implemented in EstimateS 9 [46] was run for individual-based abundance data, with 100 runs, randomization of individuals without replacement, and extrapolation to 500 individuals. The result is expressed as curves of the estimated number of serotypes expected to be found for a particular sample size, with associated confidence intervals (Additional file 4: Figure S2).
Bayesian inference of expected proportions of serotypes in pigeons; The extent to which certain serotypes could be absent in the sample retrieved from pigeons due to undersampling can be quantitatively evaluated by calculating the probability of observing more isolates of that particular serotype (p_higher in Additional file 5: Table S3) than actually observed in the sample. In the hypothesis that there is no difference between the distribution of the serotypes between humans and pigeons, the distribution of the various serotypes in the human sample was used as a beta prior to inform the binomial distribution of the isolates of the corresponding serotypes in the pigeon sample. For this, the beta binomial cumulative distribution was evaluated using the function pbetabinom.ab (q, size, shape1, shape2, log.p = FALSE) implemented in the R package VGAM [47]. Function arguments were: the number of pigeon isolates of a particular serotype (q) and the total number of isolates sampled from pigeons (size), the number of human isolates of the same particular serotype (shape1) and the number of human isolates of other serotypes (shape2). Hence, the prevalence of isolates from pigeon in a particular serotype was evaluated against the prevalence of the same serotype in the human sample.

Additional files
Additional file 1: Table S1. Characteristics of all 223 stx 2f -carrying strains analysed in this study; accession number, country, source, isolation year, in silico serotype, in silico ST, rpoB sequence analysis and assembly statistics. (XLSX 43 kb) Additional file 2: Table S2. Absence/Presence of E. coli virulence factors in 223 stx 2f -carrying strains. (XLSX 119 kb) Additional file 3: Fasta format of the bfpA gene sequences detected in this study. (FASTA 5 kb) Additional file 4: Figure S1. Neighbor-Joining phylogenetic tree of 212 LEE-positive E. coli isolates based on the 42 genes of locus of enterocyte effacement. The colours represent the various serotypes. Each isolate is indicated by the country of isolation, the year of isolation and its origin. Figure S2. Neighbor-Joining phylogenetic tree of 218 stx2f-carrying E. coli isolates based on the core phage MLST data. The colours represent the various serotypes. Each isolate is indicated by the country of isolation, the year of isolation and its origin. (DOCX 1158 kb) Additional file 5: Figure S3. Rarefaction analysis. The two lines indicate the number of serotypes (S (est) on the y-axis) relative to the sample size (number of samples on the x-axis), with the observed values represented by the continuous lines and the extrapolated values represented by the dotted lines. The confidence intervals are marked by the shaded areas. Table S3. Bayesian inference. The Table summarizes the observed number of isolates of each particular serotype in both the human and pigeon samples. The probability of observing a higher number of isolates in the pigeon sample, given that the serotypes distribution was the same as for the human samples, is given in the p_higher column for each of the respective serotypes. (DOCX 60 kb)