Gene duplications in the E. coli genome: common themes among pathotypes

Background Gene duplication underlies a significant proportion of gene functional diversity and genome complexity in both eukaryotes and prokaryotes. Although several reports in the literature described the duplication of specific genes in E. coli, a detailed analysis of the extent of gene duplications in this microorganism is needed. Results The genomes of the E. coli enteroaggregative strain 042 and other pathogenic strains contain duplications of the gene that codes for the global regulator Hha. To determine whether the presence of additional copies of the hha gene correlates with the presence of other genes, we performed a comparative genomic analysis between E. coli strains with and without hha duplications. The results showed that strains harboring additional copies of the hha gene also encode the yeeR irmA (aec69) gene cluster, which, in turn, is also duplicated in strain 042 and several other strains. The identification of these duplications prompted us to obtain a global map of gene duplications, first in strain 042 and later in other E. coli genomes. Duplications in the genomes of the enteroaggregative strain 042, the uropathogenic strain CFT073 and the enterohemorrhagic strain O145:H28 have been identified by a BLASTp protein similarity search. This algorithm was also used to evaluate the distribution of the identified duplicates among the genomes of a set of 28 representative E. coli strains. Despite the high genomic diversity of E. coli strains, we identified several duplicates in the genomes of almost all studied pathogenic strains. Most duplicated genes have no known function. Transcriptomic analysis also showed that most of these duplications are regulated by the H-NS/Hha proteins. Conclusions Several duplicated genes are widely distributed among pathogenic E. coli strains. In addition, some duplicated genes are present only in specific pathotypes, and others are strain specific. This gene duplication analysis shows novel relationships between E. coli pathotypes and suggests that newly identified genes that are duplicated in a high percentage of pathogenic E. coli isolates may play a role in virulence. Our study also shows a relationship between the duplication of genes encoding regulators and genes encoding their targets. Electronic supplementary material The online version of this article (10.1186/s12864-019-5683-4) contains supplementary material, which is available to authorized users.

EAEC adherence to intestinal cells is mediated by a specific type of fimbrial adhesin termed aggregative adherence fimbriae (AAFs). Epidemiological studies have shown that EAEC strains are genetically heterogeneous. A large number of virulence factors have been identified in EAEC clinical isolates [3]. Most of these virulence factors are expressed by only a percentage of all EAEC strains characterized. The strain O104:H4 is an example of EAEC genetic heterogeneity. A few years ago in Germany, this strain caused a large outbreak of bloody diarrhea [4]. Isolates from the O104:H4 outbreak harbor a plasmid (pAA) that encodes, among other virulence factors, the fimbriae that mediate the EAEC type of adherence [5]. Unlike typical EAEC strains, strain O104:H4 contains a prophage encoding the Shiga toxin [6], which is a well-characterized virulence determinant usually expressed by a different E. coli pathotype, EHEC [7].
When analyzing the 042 genomic sequence, we noticed that unlike other E. coli strains, the chromosome of this strain encodes four paralogues of the hha gene: hha, ydgT and the novel hha2 and hha3 genes [14]. The hha gene product, the Hha protein, is representative of a family that includes a group of sequence-related low molecular mass proteins (approximately 8 kDa) involved in gene regulation in enterobacteria. These proteins interact with the nucleoid-associated protein H-NS to modulate gene expression (as reviewed in [15]). The genomes of several enterobacterial isolates, such as Salmonella and E. coli strains, encode a paralogue of the hha gene (the ydgT gene). Orthologues of hha are also present in several conjugative plasmids [16,17]. The presence of the novel chromosomal hha paralogues hha2 and hha3 has been associated with pathogenic E. coli strains that belong to a wide range of pathotypes [14].
Gene duplication underlies a significant proportion of gene functional diversity and genome complexity [18][19][20][21][22]. Gene duplications occur in both eukaryotes and prokaryotes and significantly impact their gene repertoires [18][19][20][21][22][23]. In this work, we first aimed to gain insight into the biological role of the novel hha2 and hha3 genes of strain 042. To this end, we first performed a comparative genomic analysis between strains with and without hha2/ hha3. This approach allowed us to correlate hha2/hha3 with a gene cluster (the flu yeeR gene cluster), which is also duplicated in strain 042. Because strain 042 exhibits the duplication of genes encoding both regulators and the genes likely targeted by regulators, we decided to determine the extent of gene duplications in this strain and in the genomes of other pathogenic E. coli strains. Our analysis uncovers interesting patterns of gene duplications that are common to strains belonging to several E. coli pathotypes, both diarrheagenic and nondiarrheagenic.
For the identification of putative duplicates, we retrieved and downloaded the translated coding sequences of 28 E. coli strains from GenBank (Additional file 1: Table S1). For the BLAST search analysis, we used as filtering parameters a similarity cutoff > 85%, an alignment length between pairs > 85% and an e-value < 10 − 10 .
We analyzed the extent of gene duplication among strains by performing an all-vs-all BLASTp [25] protein similarity search (i.e., with the translated coding sequence regions of each strain, filtering the results according to the parameters specified above). For each duplicate, we retrieved genomic features (from the GenBank genomic feature files-gff), plotted the coordinates using R [26] and colored the duplicates according to their groups.
For the gene duplication analysis between strains and for the identification of the presence/absence of putative duplicated encoded proteins/coding regions, we also employed BLASTp. We searched the putative duplicates of interest against all translated coding sequences (all six frames). The results were filtered according to the above cutoff parameters.
The bioinformatics scripts employed for the analysis were deposited and available at the github website: https://github.com/molevol-ub/BacterialDuplicates. Statistical analysis. Proportions were compared between groups by using the two-tailed Fisher's exact test. A P-value of less than 0.05 was considered significant.
For the RNA-seq experiments, the detailed information and raw data were previously published in [27].

Results
E. coli strains encoding hha2/hha3 usually encode the flu yeeR aec69 aec70 cluster, which is also duplicated To gain insight into the biological role of hha duplication in the EAEC strain 042, we hypothesized that the presence of multiple alleles of a global regulator could be associated with the presence of genes specifically targeted by the regulator. To support this hypothesis, we decided to compare the core genomes of two groups, each with five E. coli strains. One of them included representatives that encode hha2/hha3 (hha + ), and the other included strains that do not encode them (hha − ).
To identify those genes that are truly exclusive to the hha + set, we used a restrictive strategy of excluding the pangenome of the hha − set from the core genome of the hha + set. By using this approach, only three gene families could be identified in the hha + set: the hha, yeeR and aec69 genes (Additional file 1: Figure S1 and Additional file 2).
The yeeR and aec69 genes belong to a gene cluster that includes flu (whose gene product is the well-characterized antigen43 protein), aec70 and aec71 (Fig. 1). A recent report shows that aec69, termed irmA, is transcribed in a single transcriptional unit with flu and yeeR [28]. In E. coli K12, the yeeR gene is truncated, and the irmA (aec69), aec70 and aec71 genes are missing (Fig. 1). This cluster belongs to the prophage CP4-44. Taking into account the high genomic variability of E. coli, the identification of yeeR and irmA (aec69) as linked to hha2/hha3 when the two five-strain groups were compared does not exclude the possibility that other strains that do not encode hha2/hha3 might encode yeeR/irmA (aec69) or that other strains harboring hha2/ hha3 do not harbor yeeR/irmA (aec69). To improve the analysis, we performed a BLASTp search on a total of 28 E. coli genomes, including both commensal and pathogenic strains belonging to several pathotypes (Additional file 1: Table S1). The results obtained showed that 72% of the strains expressing hha2/hha3 also express the yeeR or irmA (aec69) genes (P-value < 0.05), while 61% also express flu or aec70 (this latter comparison was close to the critical value (P = 0.055)) (Additional file 1: Table S2). In contrast, only 20-40% of hha2/hha3 − strains express yeeR, irmA (aec69), flu and aec70. aec71 does not appear to be associated with hha2/hha3. Its presence is widespread in both hha2/ hha3 and hha2/hha3 − strains. We then analyzed the map positions of the hha2/hha3 genes and the flu yeeR irmA (aec69) aec70 aec71 gene cluster in the chromosomes of seven E. coli strains corresponding to different pathotypes, including both enteric and extraintestinal pathogens (Fig. 2). In several instances, hha2/hha3 mapped close to the yeeR irmA (aec69) aec70 aec71 gene cluster. This study also showed that in most of the virulent E. coli strains analyzed (including the EAEC strain 042), genes belonging to the yeeR irmA (aec69) aec70 aec71 cluster are also duplicated (Fig. 2). The presence in the chromosome of strain 042 of four copies of hha-like genes (hha, ydgT, hha2 and hha3 [14]), three copies of hns-like genes (hns, stpA and hns2) [27], two copies of yeeR and irmA (aec69), three copies of flu and four copies of the aec71 gene suggests that gene duplication may play a relevant role in this and perhaps other pathogenic E. coli strains. We therefore decided to investigate the extent of gene duplications first in the genome of strain 042 and thereafter in the genomes of other pathogenic E. coli strains.

Gene duplications in the EAEC strain 042 genome
We analyzed the extent of gene duplications in strain 042 by using the BLASTp algorithm (see the materials for details) and mapped along the 042 genome those genes that are present in two or more copies (Fig. 3a). A total of 80 genes were duplicated in strain 042. Some of these genes correspond to transposases (black and open circles). Most of the duplicated genes are clustered in three main regions (labeled with vertical bars), which we arbitrarily termed regions 1 to 3. A significant number of genes that map to region 1, which is approximately 35.5 kb long, correspond to phage genes (Fig. 3b, Additional file 1: Table S3). Region 2 is approximately 17 kb long and contains the flu yeeR irmA (aec69) aec70 aec71 cluster, a toxin-antitoxin gene and several other genes of unknown function (Fig. 3b, Additional file 1: Table S3). One of the copies of this region includes the hha2 gene. The two copies of region 2 are inverted in the 042 chromosome, suggesting that genetic rearrangements leading to gene duplication can affect this region as a single recombinational unit. Region 3 is 10 kb long and includes mostly genes of unknown function (Fig. 3b, Additional file 1: Table S3).
We also analyzed gene duplications in strain 042 by using a BLASTn algorithm, yielding results similar to those obtained by using BLASTp (Additional file 1: Figure S2).

Duplicated genes in regions 1 and 2 are repressed by the H-NS/Hha system
Considering that some duplicated genes of region 2 in strain 042 (i.e., the yeeR irmA (aec69) gene cluster) have been identified as linked to hha2/hha3, it can be hypothesized that some duplicated genes are regulated by the H-NS/Hha system. To support this hypothesis, we analyzed the previously reported transcriptional profiles of strain 042 and its hha null (hha hha2) and hns mutant derivatives [27], which was performed in cultures growing in LB medium at 37°C. We assessed whether the duplicated genes of the three regions of strain 042 show H-NS-or Hha-dependent regulation ( Table 1). All genes from region 2 show fold change values higher than 2, both in the Fig. 2 Map position of the hha (red bar), hha2 (orange bar), hha3 (green bar) and genes from the flu yeeR aec gene cluster (violet bar) in the chromosomes of E. coli MG1655 and seven other E. coli strains belonging to different pathotypes (as indicated in each case). Note the duplication of genes belonging to the flu yeeR aec gene cluster in several strains. Scale bars 0 to 5 correspond to megabases in the genetic map hha null and hns derivatives. This result was also observed for several genes in region 1. Only two genes from region 3 appear to be coregulated by H-NS/Hha.
Genes from 042 regions 1 and 2 are also duplicated in several other pathogenic E. coli strains After determining the extent of gene duplications in the genome of strain 042, we addressed the question of whether the existing duplicates in strain 042 were strain-specific or whether they were generated in some putative ancestor and are also present in many other E. coli strains. We used the DNA sequences of the selected 28 E. coli genomes to perform a gene duplication analysis (see the methods for details) and annotated the number of copies of each of the duplicated genes from strain 042 that were detected in each of the other  [40]. See Additional file 1: Table S3 for the function of each duplicated gene genomes (Fig. 4). With respect to region 1, 10 out of the 14 duplicated genes in strain 042 were duplicated in most of the genomes analyzed. With respect to region 2, the 15 duplicated genes are duplicated in either some or most of the genomes studied (Fig. 4). Six of the genes from that region (listed as 55 to 60), which appear as a single linkage group and belong to the same putative transcriptional unit, are present in several copies (4 to 6) in the genomes of most of the strains. These genes encode conserved hypothetical proteins (55, 58-60), a putative antirestriction protein (56) and a putative DNA repair protein (57). With regard to region 3, duplications of the eight genes identified in strain 042 are a specific feature of that strain. Several of these genes are either absent or present in a single copy in most of the genomes studied (Fig. 4). It is relevant to mention here that only 9 out of 40 duplicated genes from strain 042 that map to regions 1 to 3 are present in a single copy in the genome of strain MG1655. The rest of the genes are not present in the genome of the commensal strain.

Gene duplications in the genomes of strains CFT073 (UPEC) and O145:H28 (EHEC)
To obtain a more complete picture of gene duplications in E. coli, we decided to analyze the genomes of two other E. coli strains that belong to pathotypes different from that of strain 042. Strain CFT073 is uropathogenic (UPEC), and strain O145:H28 is enterohemorrhagic (EHEC). With respect to strain CFT073, 94 duplicated genes could be identified. They can be grouped into six different DNA regions (Fig. 5, Additional file 1: Table  S4). Some of these genes correspond to transposases, similar to strain 042.
A total of 154 duplicated genes could be identified in the genome of strain O145:H28. The duplicated genes can also be grouped into six regions (Fig. 6, Additional file 1: Table S5). In this strain, several of the identified genes are present in more than two copies (Fig. 6). After identifying the duplicated genes in strains CFT073 and O145:H28, we also determined which of them are also duplicated in other E. coli strains. The genomic DNA sequences of the 28 E. coli strains were used to perform gene duplication analysis, and the number of copies of each of the duplicated genes from strains CFT073 and O145:H28 that were detected in each genome of the 28 E. coli strains was annotated. For both strains, duplicated genes that also occur as duplicates in other E. coli strains correspond to those already identified in strain 042 (Additional file 1: Figures S3 and S4). Some genes appear to be strain specific, as observed for strain 042. Some duplications in strains CFT073 and O145:H28 revealed a novel pattern: they are pathotype specific. The duplicated genes from strain CF703 region 5 belong to that group. Interestingly, most of these genes encode putative fimbrial proteins (Additional file 1: Table S4). Another example corresponds to duplications mapping in the region 4 of strain O145:H28. These genes are duplicated only in all the EHEC strains, one EPEC strain and one ETEC strain. Several of those genes are phage genes (Additional file 1: Table S5). In contrast to the EAEC strain 042, duplicated genes in the EHEC strain O145:H28 are not duplicated in the UPEC strains.

Discussion
The existence of gene duplications in both eukaryotes and prokaryotes has been extensively studied [18-23, 29, 30]. Several reports have established the basis for how gene duplication and divergence generate families and  Table S4 for the function of each of the duplicated genes superfamilies of proteins [21]. Gene duplications have been associated to the adaptation of cells to a changing environment [31,32], and have been found to occur more frequently among HGT genes than among indigenous genes [33]. The presence of several copies of genes such as flu in some E. coli strains was previously reported [30,32,[34][35][36].Nevertheless, detailed information about the extent of gene duplications in the genomes of the different types of pathogenic E. coli strains is needed. We applied an extensive blast search to identify putative internal duplications in the 042 strain using a moderate parameter cutoff (BLAST cutoff: > 85% similarity, > 85% alignment length and e-value < 10 − 10 ) and found that most duplicates cluster together in specific regions of the 042 genome. The results obtained suggest that different mechanisms underlie these duplication events. Whereas the duplication of region 2 from strain 042 involves an inversion, this was not the case for regions 1 and 3. Interestingly, the genes in region 1 have a phage origin and are widespread in several strains. It is worth mentioning that, as a general rule, duplications result in the presence of copies of the duplicated gene in both strands of the E. coli chromosome. It is also remarkable that a significant number of the duplicated genes are organized in putative transcriptional units (Fig. 3b), thus suggesting the existence of coordinated expression in response to specific stimuli. The comparative analysis of gene duplications in E. coli strains belonging to different pathotypes provides relevant information that can contribute to our understanding of the virulence mechanisms of this pathogen and better establish the relationships among the E. coli pathotypes. The existence of a significant number of genes that are duplicated in a wide range of pathotypes but absent from commensal strains suggests that these genes can play a relevant role in E. coli virulence. Genes 55 to 60 from strain 042 region 2 are duplicated in all except three of the 26 pathogenic E. coli strains analyzed. Given that detailed information about the function of the products encoded by a large number of these genes is missing, assigning functions to them and to many other genes of unknown function is a critical issue for better understanding the ability of E. coli to cause disease.
In addition to identifying a set of duplicated genes that is widespread in the different E. coli pathotypes, our study provides additional novel information on genomic features of virulent E. coli strains. In E. coli, some gene duplication processes are restricted to either specific strains or specific pathotypes. Examples are the duplicated genes in region 3  Table S5 for the function of each duplicated gene of strain 042 or the duplicated genes in regions 5 and 4 from the UPEC strain CFT073 and the EHEC strain 0145:H28, respectively. The study of the function of these genes can also contribute to a better understanding of the mechanisms underlying virulence in these pathotypes. It is well known that UPEC strains express specific types of fimbriae. Some of the duplicated genes in region 5 from the UPEC strain CFT073 encode putative fimbrial proteins, which might play a role in UPEC pathogenesis.
The correlation we observed between hha duplication and the presence of the duplicated yeeR irmA (aec69) gene cluster suggested that Hha (and H-NS) could modulate the expression of duplicated E. coli genes. The analysis of the comparative expression of duplicated genes in the wt 042 strain and its isogenic hha null and hns derivatives shows that under specific growth conditions (LB medium, 37°C), H-NS/Hha proteins downregulate the expression of a significant number of duplicated genes. These data highlight a novel role for the H-NS/Hha proteins in silencing several of the genes that are duplicated in strain 042. Hence, it can be hypothesized that to avoid fitness costs, duplications of genes targeted by global regulators may require the duplication of the genes that encode them. Derepression of H-NS/Hha-silenced genes can occur when environmental conditions change. Then, gene duplication may be advantageous because the two copies can exhibit different expression patterns and/or respond to different stimuli. This is the case for the duplicated irmA gene in strain 042 (our unpublished results).
A relevant point is whether HGT processes are underlying the presence of gene duplications in strain 042. The duplicated genes that map in the region 1 of strain 042 are of phage origin and can hence be considered as HGT DNA. In any case, a detailed phylogenetic analysis is being undertaken now to assess the origin of all duplicates that map in the three regions identified in strain 042.
Finally, our study has also shown some novel relationships between E. coli pathotypes. It is remarkable that most of the duplicated genes in the EAEC strain 042 are also duplicated in UPEC strains. Previous studies have suggested a close relationship between EAEC and UPEC strains [33,37]. In fact, E. coli strains showing a hybrid UPEC/EAEC genotype have been isolated [38]. The similar gene duplication patterns of EAEC and UPEC strains further support this EAEC/UPEC relationship. Unlike EAEC strain 042, duplicated genes in the EHEC strain O145:H28 are usually duplicated in EPEC and ETEC strains but not in UPEC strains. A distinctive feature of EHEC strains is that some of the duplicated genes are present in more than two copies.
For some E. coli infections, such as those caused by ETEC, the effectiveness of the existing vaccines must be significantly improved [39]. If any of the gene products encoded by the identified duplicated genes are antigenic, they could be candidates for developing novel improved E. coli vaccines.

Conclusions
Duplications of the hha gene can be correlated with the presence of genes belonging to the flu yeeR aec gene cluster, which is also duplicated in several pathogenic E. coli strains. The analysis of gene duplications in the E. coli genome has shown that (i) a number of duplicated genes are widely distributed among pathogenic E. coli strains, irrespective of the pathotype; (ii) some duplicated genes are only present in specific pathotypes; and (iii) some duplicated genes are strain specific. The present study also shows a relationship between duplications of both genes encoding regulators and genes encoding their targets. Our study also shows novel relationships between E. coli pathotypes. Finally, the distribution of duplicated genes in a high percentage of pathogenic E. coli isolates suggests that these genes must play a role in virulence. Hence, some of their gene products can serve as new targets for combating E. coli infections.

Additional files
Additional file 1: Table S1. List of E. coli strains whose genomes have been used. Table S2. Distribution of genes from the flu yeeR irmA aec70 aec71. Table S3. Locus tag and gene function of each of the duplicated genes in regions 1, 2 and 3 of strain 042. Table S4. Locus tag and gene function of each of the duplicated genes in regions 1, 2, 3, 4, 5 and 6 of strain CFT073. Table S5. Locus tag and gene function of each of the duplicated genes in regions 1, 2, 3, 4, 5 and 6 of strain O145:H8. The locus tags of the different copies are shown. Figure S1. Five-set Venn diagram of the exclusive core-genome of the hha2/3 + set (E. coli strains 042, NA114, O104:H4 LB226692, ETEC H10407 and UMN026). Figure S2. Genes duplicated in the E. coli strain 042, identified by using BLASTn instead of BLASTp. Figure S3. Distribution of the strain CFT073 duplicated genes in other E. coli strains belonging to a wide range of pathotypes.