In this study we identified core-genes and estimated the genetic variation among 186 publically available E. coli and Shigella genomes. Here, we will have a brief look at how E. coli is currently classified, how it fits our data, and discuss how these results may form a basis for future implementation of WGS as a standard typing tool for classification of E. coli in phylogeny and epidemiology and understanding E. coli evolution.
The dataset analyzed was obtained from GenBank and is publically available from NCBI. Two data quality issues are immediately encountered when using sequence data produced by others and from several different researchers: genome annotation and sequence quality. The annotation of the sequences can be very different, due to different annotation pipelines. Some annotations are manually curated and others are not. The completeness of each sequence can vary – some completed sequences are more “complete” than others. Chain et al. suggested a list of 6 categories in which all sequenced genomes could be defined based on their level of completeness . In an attempt to overcome the bias from different annotations all genomes were annotated using the Prodigal gene finder  which provided consistency across the entire data set.
Sequence quality is also a concern. Unfortunately there hasn’t been much focus on the issue, and publications estimating error rates in sequence databases are scarce. To our knowledge there are no recent publications estimating error rates in bacterial genomes deposited in GenBank. Wesche et al. estimated error rates in the mouse DNA sequences deposited to GenBank in 2004 . They found an error rate of 0.1% in coding DNA sequences. This is lower than the estimate done in 1988 for all GenBank sequences deposited at the time, which demonstrated an error of ~0.3% .
Eukaryotes in general have much more complex genomes, due to introns, exons and complex repeats, which in turn leads to a higher than expected error rate. Sequencing technologies and assembly have also improved significantly since 1988. It is hypothesized that a conservative estimate of sequence errors in bacterial sequences deposited to GenBank today is less than 0.1%. Consequently an average E. coli gene (~1000bp) will contain approximately 1 error per gene.
Most errors caused by NGS technologies comes from insertions and deletions (indels), which will be completely ignored, due to the way nucleotide diversity is calculated. Therefore the errors, which are actually having an effect on the nucleotide diversity calculations, are probably lower than 0.1%. Because of these facts, it is believed that errors will, at most, cause 0.001 additional diversity to any of the variation calculations, and we believe that this is probably a very conservative estimate.
Sequencing errors, both indels and nucleotide changes can, however, cause genes to be truncated. Touchon et al. showed that at least 23 essential housekeeping genes were missing in their core-genome , and genomes missing these genes turned out to contain truncated versions of the “missing” genes. It was hypothesized that this was probably due to sequencing errors. Owing to the possibility of sequencing errors accidently “deleting” genes from a genome, we also present the results for the soft core in this study.
Another issue, which sets a limit on our ability to interpret the results, is the lack of metadata, or specifically, the lack of a method for obtaining relevant metadata in an automated way. The amount of sequence data available now makes it unfeasible to email the corresponding author for each available genome to obtain its metadata. The community is aware of the increasing need for metadata and The Genomics Standards Consortium has suggested the Minimum Information about a Genome Sequence (MIGS), some of which is being incorporated into more recent GenBank files .
Pan- and core-genome
The core-genomes of E. coli and Shigella have been estimated in several studies. Lukjancenko et al. estimated the core-genome in 2010, from 61 genomes, using a single linkage clustering method and found it to be 1,472 HGCs if only E. coli was considered . Vieira et al. estimated the core-genome in 2010 from 29 E. coli and Shigella genomes using the orthoMCL algorithm and found the core-genome to consist of 1,957 gene clusters . In 2004 Fukiya et al. examined the core-genome from 22 E. coli strains using comparative genomic hybridization and estimated it to consist of approximately 2,800 shared open reading frames among all the strains . Willenbrock et al. used high-density micro arrays to estimate the core-genome of 32 E. coli and Shigella genomes, and estimated the core-genome to be around 1,563 genes . Chattopadhyaya et al. estimated the core-genome to consist of 1,513 genes among the 14 E. coli strains considered in their study . Touchon et al. estimated the core-genome in 20 E. coli to be 1,976 genes and the pan-genome to consist of 11,432 genes. Thus, in previous studies (with fewer genomes) the size of the core-genome seems to fluctuate between 1,000 and 3,000 genes and generally conforms to the expectation that the core-genome would decrease, as an increased number of strains are analyzed, which might be an artifact of truncated genes due to sequencing errors.
In this study we found the soft core-genome to consist of 3,051 HGCs (Figure 1) for 186 genomes. In contrast to previous studies, we allowed a soft core-gene to be missing in up to 5% of all the genomes. If the strict core (HGC must be found in all genomes) was considered, the core-genome shrinks to 1,702 HGCs. It fits well within previous estimations made with the same strict cutoff.
The pan-genome has also been estimated in many studies and will probably continue to increase as more genomes are sequenced. In one study, the pan-genome of E. coli has been estimated to be as large as 45,000 gene families . Another study suggests that the bacterial pan-genome is infinite . Additional E. coli isolates, including some more distinctly related to those already sequenced, should be sequenced to obtain a more complete picture of the E. coli pan-genome.
The joint core-genome diversity plotted in Figure 2 (yellow) has one large top, which suggests that for most core-genes there is little room for diversity. Several smaller tops are also observed. We examined some HGCs that are part of the larger of the smaller tops (~0.17 substitutions per site). In both cases the HGC consisted of a gene coding for an enzyme and its isozyme counterpart. As for the case of one of the most diverse core families, the ABC transporters, the high diversity is due to different genes coding for proteins having very similar functions.
The pan-genome diversity plotted in Figure 2 has one large top and the distribution is much broader, as would be expected, due to the inclusion of the accessory genes.
No single, officially recognized system for classification of prokaryotes exists at the present time. The “polyphasic approach” is the most popular, and includes phenotypic, chemotaxonomic and genotypic data . As for the genotypic data, this means that two genomes have to be 70% similar in order to be considered the same species. It has been shown that >70% similarity corresponds to an average nucleotide identity among the core-genes of >95% . These results are supported by the median ~0.018 substitutions per site for the joint core found in this study.
Figure 3 shows that the genes from the Mark Achtman MLST scheme and the T. Whittam MLST scheme, in general, have less diversity than the majority of core HGCs. This is a bit surprising because the more variation in a gene, the greater the potential to be able to distinguish different strains.
The Pasteur MLST scheme seems to contain quite diverse core-genes, but also contains some which are more conserved than the average core-genes. This raises the question of whether or not a selection of more variable core-genes could be made, which, in turn, could provide higher resolution. Variability is, of course, not the only consideration when choosing MLST genes, e.g. an MLST scheme should not contain genes that are candidates for horizontal gene transfer, they should not be paralogous, and they should reflect the true phylogeny as much as possible. It is beyond the scope of this study to present a new MLST scheme, but it will be demonstrated how resolution could improve by choosing more diverse MLST genes. 7 core HGCs were chosen semi-randomly, with variation around ~0.03 substitutions per site. Genes were chosen with variation higher than average, although not so high as to include paralogous genes. We found the corresponding genes in a set of 24 O157:H7 strains, aligned them and built a phylogenetic tree. Phylogenies were also inferred using each of the other three MLST schemes (see Additional file 2). We compared the MLST phylogenies with a published SNP tree created from these strains . There is almost no variation found in the traditional Mark Achtman MLST scheme genes in these strains. In the alternate MLST scheme tree there is more variation and in turn more resolution. T. Whittam’s scheme has the best overall resolution, probably due to the fact that T. Whittam’s scheme contains twice as many genes as the other MLST schemes. None of the MLST phylogenies presents the expected topology. It seems unlikely that any selection of genes this small will ever be able to infer a robust phylogeny for an E. coli outbreak. At this point in time, there is probably no need to chase after a better MLST scheme, as WGS will probably make MLST typing obsolete with time. For most scientists, WGS is already less expensive than MLST typing . WGS is, in general, far more promising, since it enables the use of entire core-genomes and SNPs (see core-gene tree discussion).
Barrick et al. documented the mutations fixed in a specific E. coli strain over 40,000 generations in vitro. We looked at the genes and their corresponding HGCs in which these mutations occurred, but found no significant trend with regard to the variability of the mutated genes (data not shown).
Gene function distribution
Most HGCs could not be annotated with a functional category (~12,000); this corresponds to ~25% of all the genes.
The annotations of the HGCs are presented in Figure 5. As expected, the conserved genes are overrepresented in the “ribosomal” category, and even though there are only a few HGCs found in the “extracellular” category, they are exclusively from the variable HGC pool.
E. coli as a species contains within it a large diversity of adaptive paths. This is the result of a highly dynamic genome, with a constant and frequent flux of insertions and deletions [7, 16]. Touchon et al. shows that the dynamic genome is compatible with a clonal population structure such as E. coli, since most gene acquisitions and losses happen in the exact same locations (“hotspots”). Hence the phylogenetic signal is still strong within the core genome even though recombination and lateral gene transfer is frequent .
The concatenated gene tree in Figure 6 demonstrates this strong phylogenetic signal quite well by the high fraction of confident nodes (confident nodes having a bootstrap value above 0.7). The tree also agrees with the MLST types. None of MLST types are actually split with the exception of ST-10, ST-11 and ST-93. In the ST-93 clade there is a single strain, which could not be typed by the in silico MLST algorithm. It is the draft genome of E. coli 101–1. Perfect matches for all 7 alleles are found, for the MLST scheme, but the combination is unknown. Its location within the ST-93 clade is valid though, since the unknown type is due to a single locus change (fumC-11 --> fumC-130). E. coli H 2687 with ST-587 is also a single locus variant of ST-11. ST-10 is split by ST-1060 and ST-167. Since the two strains of ST-1060 are sub-strains of K12, which is classified as ST-10, these fit inside the ST-10 clade. ST-167 is a single locus variant of ST-10.
All phylogroups (A, B1, B2, C, D, E, and F) also correspond very well with the core-gene tree. Only a few strains seem to violate the groups. E. coli MS 57 2 is classified as D, but the tree strongly suggests that it should belong to the B2 group. Gordon et al. showed that using the Clermont PCR multiplex method could lead to erroneous classification of phylotypes , in particular, classifying B2 phylotypes as D phylotypes were shown to be frequent. They proposed a new gene target, “ibeA”, which will distinguish most B2 types from D types. E. coli MS 57 2 contains the gene target ibeA, which confirms its placement within the B2 phylogroup .
The tree supports the claim that B2 and F are the ancestral groups followed by D and then the sister groups B1 and A [7, 16, 36].
The fact that phylotyping and MLST typing fit so nicely with the core-gene tree, both confirms the highly clonal nature of E. coli and supports the use of core-genes to infer the “true” E. coli phylogeny.
To obtain a resolution high enough to be used in short term epidemiology, researchers have turned to inferring phylogenies from Single Nucleotide Polymorphism (SNP). SNP trees have, with much success, been used previously to describe complex outbreaks in detail [4, 37]. However, to create a SNP tree, a good reference is needed and it is also frequently necessary to sort out false SNPs. The latter will always be subject to some controversy, because determination of a false SNP call will seldom be a completely objective call.
The creation of a core-gene tree requires no subjective alterations, which, in turn, also makes them much easier to automate and replicate than SNP trees. Figure 4 presents the E clade of the core-gene tree, and demonstrates the ability to differentiate three American E. coli O157:H7 outbreaks from each other. This is slightly better even, than the SNP tree published by Eppinger et. al .
In a case where the core-gene tree does not provide enough resolution, better resolution might be obtained by focusing on the more variable genes; in these cases care should be taken not to focus on paralogous to infer phylogeny. Whether this is possible is doubtful, and will require further studies with strains of known origin and relationship for validation.
Based on many various typing methods, Shigella consistently has been shown to belong within the E. coli species . Indeed, within Figure 6, all Shigella species can be seen to fall within the E. coli clade. How Shigella got the ‘shiga toxin’ and other pathogenicity genes has two opposing theories. One theory suggests that all the “Shigella genes” originated from one ancestral plasmid . Another theory suggests that Shigella originated from three different E. coli species, which, independently of each other, acquired the “Shigella genes” . Our core-gene tree (Figure 6) supports the latter theory, which is not surprising, since the theory was based on trees created from housekeeping genes. The core-gene tree fails to group the Shigella species. Shigella are classified based on their virulence factors, which are probably poor phylogenetic targets, and thus does not explain the “true” relationship between the Shigella species.
The pan-genome tree is based on the absence or presence of all the HGCs of the pan-genome. It has been reported by Touchon et al. that gene conversion events are more likely than point mutations in E. coli. From this they conclude that the contribution made by recombination events outweigh site-level mutations as an evolutionary mechanism .
The pan-genome tree differs from the core-gene tree, because it is focused on those genes that are absent between the genomes. Since all the core-genes will be present in all genomes these will not in any way influence the phylogenetic relationship in this tree.
The pan-genome tree does not have as confident nodes as the core-gene tree. The deeper nodes are almost all below 50%. However, the nodes close to the leaves are quite confident and a majority of these reaches 70-100%.
These results are in agreement with the previously mentioned study by Touchon et al. The gene diversity in E. coli creates a poor phylogenetic signal between distantly related strains, since the signal is only made up from very few fixed ancestral insertions. This is due to the high gene flux in E. coli which causes only closely related strains to share a significant amount of accessory genes .
There are many similarities between the core-gene tree and the pan-genome tree, but also some obvious differences. The pan-genome tree does not divide the strains as nicely into the different phylogroups as the core-gene tree. The MLST type clades are also more divided than is the case for the core-gene tree. These results might not be that surprising, since both phylogroups and MLST types are based on a small set of core-genes and the pan-genome tree actually ignores these genes.
The pan-genome tree, due to one single Shigella clade, supports the “one origin” theory, as opposed to the core-gene tree, which supports the “three origins” theory of Shigella. Since the definition of Shigella is based upon a group of genes which gives it its pathogenic characteristics, it makes perfect sense that the pan-genome tree, which focuses on gene presence/absence, is able to isolate the Shigella genus into one single clade.
This convergence for Shigella has been observed previously by calculating the “metabolic distance” between E. coli strains. Vieira et al. suggests that this inconsistency between genetic distance and metabolic distance is proof that the Shigella metabolic networks have evolved quickly by genetic drift .
Both trees fail to divide the Shigella genus into any species clades, which further supports the argument that the taxonomy within Shigella might not be optimal.
The core-gene tree in this study had a surprising capability to differentiate between closely related outbreak strains. However, more resolution might be needed to infer phylogenies or detect short-term outbreaks. In these cases, it might prove useful to put more weight on the variable regions of the genome. Further studies are needed to decide if this is a meaningful approach.
The results found in this study may lay ground for further studies into how we might create a standardized method for defining E. coli strains. To do this, studies are needed in which E. coli strains from different outbreaks and with different degrees of relatedness are sequenced and compared. Although “Single Nucleotide Polymorphism” (SNP) analysis was not done in this study, SNP potentially could be a powerful typing technique and will need to be included in future studies. This will, however, make more sense with a dataset that has been selected for this purpose.
It is becoming more and more apparent that a global epidemiological detection system is important, and for a global collaboration to be successful, standards are crucial.