Genome-wide association studies of Shigella spp. and Enteroinvasive Escherichia coli isolates demonstrate an absence of genetic markers for prediction of disease severity

Background We investigated the association of symptoms and disease severity of shigellosis patients with genetic determinants of infecting Shigella and entero-invasive Escherichia coli (EIEC), because determinants that predict disease outcome per individual patient could be used to prioritize control measures. For this purpose, genome wide association studies (GWAS) were performed using presence or absence of single genes, combinations of genes, and k-mers. All genetic variants were derived from draft genome sequences of isolates from a multicenter cross-sectional study conducted in the Netherlands during 2016 and 2017. Clinical data of patients consisting of binary/dichotomous representation of symptoms and their calculated severity scores were also available from this study. To verify the suitability of the methods used, the genetic differences between the genera Shigella and Escherichia were used as control. Results The isolates obtained were representative of the population structure encountered in other Western European countries. No association was found between single genes or combinations of genes and separate symptoms or disease severity scores. Our benchmark characteristic, genus, resulted in eight associated genes and > 3,000,000 k-mers, indicating adequate performance of the algorithms used. Conclusions To conclude, using several microbial GWAS methods, genetic variants in Shigella spp. and EIEC that can predict specific symptoms or a more severe course of disease were not identified, suggesting that disease severity of shigellosis is dependent on other factors than the genetic variation of the infecting bacteria. Specific genes or gene fragments of isolates from patients are unsuitable to predict outcomes and cannot be used for development, prioritization and optimization of guidelines for control measures of shigellosis or infections with EIEC.

isolates from patients are unsuitable to predict outcomes and cannot be used for development, prioritization and optimization of guidelines for control measures of shigellosis or infections with EIEC. Background Shigellosis is caused by the gram-negative bacterium Shigella and can lead to dysentery (1). The genus Shigella is divided in four species; Shigella dysenteriae, Shigella flexneri, Shigella boydii, and Shigella sonnei. All Shigella spp. are genetically closely related to Escherichia coli to that extent that they should be classified as one species (2,3).
However, it is a taxonomical decision based on historical and clinical arguments to maintain the current classification (4). Entero-invasive E. coli (EIEC) is a pathotype of E.
coli, which also can cause dysentery (5,6). Because of the similarity in pathogenetic features of EIEC and Shigella spp, distinction using diagnostic laboratory tests is difficult As in many other countries, shigellosis is a notifiable disease in the Netherlands. This means that each case is notified towards health authorities, and consequently, control measures are activated (8)(9)(10)(11). These control measures exist of source tracing for every shigellosis case, which place a burden on our public health system. Case definitions for shigelloses in the Dutch guidelines require the confirmation with culture techniques (8).
The sensitivity of the culturing of Shigella spp. and EIEC is low (12). Additionally, most laboratories perform a molecular prescreening based on the ipaH gene, which is present in both Shigella spp and EIEC. From approximately half of fecal samples positive in the molecular prescreening an isolate cannot be obtained in culture (12,13). Shigellosis cases that are diagnosed purely by molecular procedures are not notifiable.
In contrast to cultured Shigella spp., infections with EIEC are not notifiable in the Netherlands. Because of the high genetic similarities, identical disease outcomes and the 4 low sensitivity of culturing, the two infective agents are often not detected in culture at all or misidentified. Consequently, accurate application of the guidelines is challenging (14). Genes of pathogens that are predictive for disease outcomes can help in the prioritization of infectious disease control measures. Moreover, the presence of genes is more easily detected by using molecular procedures opposed to the current used culture techniques required for notification.
Few studies investigated the association of virulence genes with disease severity for shigellosis, using Pearson's correlation and regression analyses (15,16). In one of these studies, the virulence gene sepA was associated with abdominal pain and the combination of sepA, sigA and ial genes with bloody stools (16). Another study found that detection of the sen (shET-2) gene was associated with diarrhea and the virA gene was associated with fever (15). Both studies had a limited sample number, did not correct for multiple testing, and in one study the presence of virulence genes was established using direct detection in fecal samples. This approach is problematic, because different Enterobacteriaceae present in fecal samples may carry these genes, for example, on average, 2-3 E. coli clones are detected in the feces of a single person (17). Therefore, assessment of single isolates would be more appropriate. Furthermore, association with only a limited number of targeted virulence genes was conducted in these previous studies, while genomic approaches would analyze all harbored genes, gene variants, or other genetic content.
The purpose of our study is to investigate whether there is an association between symptoms and disease severity of the patients and genetic determinants of infecting Shigella and EIEC isolates in the Netherlands. To address this, microbial genome-wide association methods (GWAS) were applied. We hypothesize that genetic variants associated with symptoms or severity of disease allow development of specific molecular diagnostics that could predict the disease outcome per individual patient and prioritize 5 the employment of control measures for infections with Shigella spp and EIEC.

Data preparation and exploration
To assess whether other pathogens present in the fecal samples caused the symptoms and severity of patients, presence of symptoms and severity scores of patients with coinfection were compared to those of patients without coinfection. In 15.5% of the patients, a coinfection was detected. The symptom blood in stool, known as a typical symptom of shigellosis (18), was significantly less present in patients with a coinfection (chi-square, p = 0.019), while the presence of other symptoms was not statistically different (chi-square, p > 0.05). The lower fraction of patients with coinfection that experienced blood in stool was also reflected in the de Wit severity score, in which blood in stool is a criterion with double weighing, as it was significantly lower for patients with coinfection (T-test, p = 0.017). The Modified Vesikari Score (MVS), in which blood in stool is not a considered factor, showed no significant difference between patients with and patients without coinfection (T-test, p = 0.076).
The assemblies of 277 isolates were used to construct a gene presence/absence table and k-mers of variable length. This resulted in a gene presence/absence table consisting of 2,890 core genes (i.e. present in all 277 isolates) and 9,869 genes in total. K-mer counting yielded 28,551,795 genetic variants.
A phylogenetic tree was created based on the core genome SNPs, and the distribution of the severity scores, coinfection and the effects of underlying diseases were visualized ( Figure 1). The core SNP analysis resulted in some species-specific clusters. However, clusters that contain multiple species were also present ( Figure 1). In addition, severity scores, effects of underlying diseases and coinfection were randomly distributed over the isolates in the tree (Figure 1). For the GWAS analysis, only isolates sequenced during this 6 study and displayed in Figure 1 were used. However, for contextualization of the position of the isolates in this study compared to the global population structure of Shigella spp. and EIEC, an additional tree was inferred including genomes from each of the main lineages and phylogenetic groups (Additional File 1). It showed that the population structure of our EIEC isolates was mainly concentrated in three clusters containing ST270, ST6 and ST99 based on isolates from the United Kingdom (UK) (19). The UK ST270 cluster corresponded with cluster 8, the large EIEC cluster from Pettengill et al. (3). In our analysis, EIEC isolates belonging to cluster 4, EIEC small or cluster 7, the EIEC/EHEC/EAEC cluster were not included (3). For S. flexneri, a few isolates related to travel to Asia belonged to PG6 and PG2 ( Figure 1 and Additional File 1). However, the majority of isolates were PG3, consisting solely of isolates with serotype 2a or Y, and PG1, consisting of isolates of serotypes 1a, 1b, 1c, Yv and 4av. For S. sonnei, almost all isolates were of lineage III, only a few isolates within lineage II were detected ( Figure 1 and Additional File 1). The presence of large clusters of EIEC isolates, the presence and distribution of serotypes over the PGs for S. flexneri and the predominance of S. sonnei lineage III were described before, and were representative for a population structure as present in other Western European countries (19)(20)(21)(22).

GWAS using gene presence/absence of single genes
None of the tested symptoms and severity scales resulted in significantly associated genes with a sensitivity and specificity above 85%. However, eight significantly associated genes were found with sensitivity above 92% and a specificity of 87% for the characteristic "genus", that was used as a benchmark to evaluate algorithm performance.
The gene with the highest association, produces a hypothetical protein and had a Benjamini Hochberg corrected p-value of 7.01E-27 and a sensitivity and specificity of 99% and 87%, respectively.
Additionally, the p-values of all characteristics were compared to random permutation datasets by plotting the log transformed expected and observed p-values against each other ( Figure 2). The gene associations with the tested severity scales (Figure 2A and 2B) and symptoms ( Figure 2C) displayed similar plots as the random permutation datasets, indicating a performance as random cases. This did not apply to the benchmark characteristic "genus", that plot showed a clear difference between expected and observed p-values, which was supported by the low Benjamini Hochberg corrected pvalues ( Figure 2D).
It followed from the sensitivity analysis based on the benchmark characteristic "genus" that genes present in 0.7% of total isolates within the smallest group (Escherichia, n=30), corresponding to two isolates of the total number of isolates, resulted in significant pvalues. This indicated that a gene presence in a minimum of two isolates from the smallest group was enough to detect significance, if these genes were not present in the other larger group (Additional File 2).

GWAS using gene presence/absence of multiple genes
The generated random forest model, created using isolates from the training set resulted in an out-of-bag (OOB) estimate of error rates when testing the isolates from the test set.
A random error rate of 66.7% for the severity scores and 50% for the symptoms and genus was expected, as respectively three and two classes were predicted. OOB error rates in the created random forest models using 5000 trees for the prediction of symptoms and severity scales of patients were as expected for random datasets when applied to the test set. Error rates were ranging from 40.8% to 53.1% for all symptoms and 65.1% to 70.1% for the two severity scales ( Table 1). The construction of additional trees did not lead to 8 better predicting models. In contrast, the OOB error rate of the model that predicted the benchmark characteristic genus was 15.9%, much lower than the random expected error rate of 50% (Table 1). The created model for genus prediction was further explored by examining the location of the misclassified isolates in the phylogenetic tree ( Figure 1). Comparing them with the traditional laboratory results that were obtained during the IBESS-study showed that six out of ten discrepant isolates were so-called hybrid isolates and also had an uncertain assignment using the traditional laboratory tests ( Table 2).

GWAS using k-mers
Associating k-mers with different characteristics using Pyseer did not lead to any significant k-mers for abdominal pain, abdominal cramps, blood in stool, fever, headache, mucus in stool, nausea, vomiting, and the severity score of MVS (Table 1). In contrast, 156 k-mers were associated with diarrhea, however, all k-mers had an invalid chi squared test and likelihood-ratio test (LRT) p-values higher than 0.313. The de Wit severity score resulted in 17 associated k-mers, whereof 15 k-mers with an LRT p-value lower than 0.05.
An assembly of these 15 k-mers resulted in a single consensus sequence of 100 bp, based on overlapping k-mers. A BLASTn search of the consensus sequence against the database of the National Center for Biotechnology Information (NCBI, Bethesda, USA) revealed that the significant k-mers are located between two genes (Additional Figure 3), including a type II toxin-antitoxin gene (AYE47152.1) and a gene coding for DUF1391 (AYE48123.1), a protein of unknown function. A potential promoter region in the k-mer was found with a -10 box (CATTATTTT) at position 58 and a -35 box (TTGACG) at position 36 of the sequence (Additional Figure 3).
To validate the potential of the k-mer to predict the severity score of de Wit scale, the kmer was queried by BLAST against a database with all isolate assemblies from our study.
For every sample, the bit-score of the best scoring hit was plotted against the corresponding severity score ( Figure 3A). Roughly, three groups resulted, one with a bitscore of >175 corresponding with a full-length match with the k-mer, one with a bit-score of 50-175 corresponding to a partial match and <50 corresponding to no match.
Subsequently, the Kruskal-Wallis test was performed to investigate the difference in the de Wit severity score between the groups ( Figure 3B). No statistically significant difference between the groups was found, with a p-value of 0.6.
To check the suitability of the Pyseer method for the association of k-mers with characteristics in our data-set, the benchmark characteristic "genus" was used and resulted in 3,036,507 potential associated k-mers.

Discussion
The purpose of our study was to investigate association between genetic determinants of infecting targeting the stx 1 and stx 2 genes and particular virulence genes. These combination of genes within STEC bacteria are known to have associations with a higher risk for severe disease and clinical complications (24).
First, the association of the presence or absence of single genes resulted in no statistically significant association between genes with specific symptoms or severity scores with high sensitivity and specificity. Second, the association of multiple genes resulted again in no statistically significant association with specific symptoms and severity scores of patients, indicating that no complex genetic interactions that may explain disease severity could be found. Third, the association of k-mers resulted in a consensus sequence consisting of multiple aligned k-mers that was associated with a high severity score of de Wit. The sequence of 100 bp, containing multiple associated k-mers, was located between two genes with a putative promoter region with an optimal interbase distance of 16 bases but an unclear TATAAT box. When blasting the consensus k-mer against all assemblies, three difference bit scores were observed, suggesting there are three different genetic variants of this locus. Performing a Kruskal-Wallis test on these three different bit score groups, showed that the k-mer was not valid (p = 0.6), and presumably was a false positive.
In our study, the genes that were associated with specific symptoms in earlier described studies (15, 16), were not confirmed. In another study that was conducted in Brazil among children with shigellosis, sepA was associated with abdominal pain, and the combination of sepA, sigA and ial genes with bloody diarrhea (16). However, it was not clear if univariate or multivariate testing for virulence genes was performed. In another study from Brazil, a case-control study was conducted. They found that the sen (shET-2) gene was associated with diarrhea in children in general, but not with specific symptoms of shigellosis patients.
They associated the virA gene with fever in children with shigellosis, however virA was also found in 44% of controls (15). In our study, we have used a larger sample size consisting of patients with other demographics in another setting, analyzed all harbored genes instead of a predefined selection, used other methods with higher resolution as it was based on whole genomes, and included correction for multiple testing.
Because all used algorithms in our study generated negative results for association, the "genus" was also tested as a benchmark. The used algorithms performed adequate, as they resulted in relevant genetic variants. Furthermore, a sensitivity analysis indicated that the group distribution of the characteristic "genus" was suitable for significant detection of associated single genes. This characteristic had an adverse unequal group distribution of 10% versus 90%, indicating that the number of isolates and the distribution over the groups was suitable for associating genetic content with all symptoms and severity, except for "diarrhea", which was the only characteristic with a more unequal group distribution than "genus". Moreover, other studies found genetic variants significantly associated with their tested traits using the microbial GWAS methods that were used in our study (25)(26)(27)(28)(29).
Using Scoary, single genes that had association with the characteristic "genus" were found, with low p-values and high sensitivity and specificity. Further, with Pyseer, over 3,000,000 potentially associated k-mers were found. This is in concordance with another study that demonstrated the suitability of k-mers for identification of Shigella spp. and E.
coli isolates based on whole genome sequences (30). Moreover, using Random Forest, OOB estimate error rate for the benchmark characteristic "genus" was 15.9%. This indicated 13 that the model that predicts the genus of unknown isolates performed better than random, however does not accurately predicts the genus of some isolates. Notably, six out of ten discrepant isolates also had an uncertain assignment with traditional laboratory tests. If we exclude these isolates, the OOB estimate error rate is 1.9%, indicating that not the used method, but the nature of these isolates that are possessing characteristics of both In addition to the methods using gene presence/absence and k-mers that were used in our study, other types of genetic variants can be used as input for microbial GWAS (31). The k-mer approach used in this study is able to detect different genetic variants such as SNPs, indels, variable promotor regions and gene content simultaneously (32). This indicates that adding purely SNP-based methods to the used methods is redundant as SNPs are already encompassed in the performed k-mer method. Another genetic variant that can be used in GWAS is based on De Bruijn Graphs. However, it is mainly based on the creation of overlaps of k-mers, therefore, it probably does not generate association with symptoms or disease severity using the data from our study (33).
One of the strengths of our study was the availability of isolates representative for the population structure as encountered in Western European countries, as well as the clinical 14 data of the patients that they were infecting. Second, results of the performed traditional laboratory tests to determine the species of the bacteria were available for all isolates.
Finally, another strength of our study is that several potential genetic variants were associated with the trait "genus", and a sensitivity analysis was performed, both proving the suitability of the used algorithms.
Some considerations with regard to our study should be taken into account. study were inadequate with two and no isolates, respectively. However, we believe the number of isolates to be adequate, as studies with similar sample sizes have been performed in the past in which genetic variation in pathogens was identified that had predictive value for the course of disease (29,38). Finally, the used dataset only contained isolates encountered in the Netherlands, resulting in a geographical biased set (39,40). Therefore, to avoid missing serotypes in future studies, the current dataset should be supplemented with isolates from other geographic areas.

Conclusions
Using several microbial GWAS methods, genetic variants in Shigella spp. and EIEC that can predict specific symptoms or a higher disease severity were not found. In contrast to adjustment of the guidelines of STEC, genes or gene fragments that indicate higher risks for a more severe course of disease does not exists for shigellosis, whether caused by Shigella or EIEC, using the dataset in our study. Therefore, the bacterial specific genes or gene fragments from patient isolates are not suitable to predict outcomes in individual patients or to use in development, prioritization and optimization of guidelines for control measures of shigellosis or EIEC. As GWAS in our study associated genetic fragments with genus, future studies can be performed in which GWAS could support the distinction of Shigella spp. from EIEC. Additionally, the prediction of results of traditionally laboratory tests using draft genome sequences could be performed using GWAS. The results of these suggested follow-up studies could improve diagnostics and guidelines for control measures of shigellosis.

Bacterial isolates and clinical data
The data used in our study was collected during the Invasive Bacteria E. coli-Shigella With these severity scores, lower scores indicate a milder course of disease (43,44). The calculated scores were stratified into scales representing mild, moderate and severe disease according to their own categorization and used as dichotomous input in the GWAS methods in this study (Additional File 4). For the de Wit severity score, a score of 0-3 is considered mild, 4-6 moderate and ≥7 severe (44). The designers of the MVS score consider a score of 0-8 as mild, 9-10 as moderate and ≥11 as severe disease (43).
Laboratories that participated in IBESS, reported other detected bacteria, viruses and parasites in the fecal samples by molecular, culture and microscopy methods as well (Additional File 4). Other pathogens were detected in 15.5% of the patients from whom the Shigella spp. and EIEC isolates were isolated that were used in this GWAS study. The

Genome sequencing and data preparation
DNA isolation and short-read Illumina sequencing was performed as earlier described (41).
For preparation of the genomes, an in-house assembly pipeline available at GitHub (https://github.com/Papos92/assembly_pipeline) was used. It consists of raw data quality assessment using FastQC v. 0.11.8 (45)  were corrected by using two-thirds of the smallest class as sample size to create models based on gene presence/absence of multiple genes in the training set, using 5000, 8000 and 10,000 trees respectively. The performance of these models was validated by predicting the outcome of each trait using the genomes of the isolates in the test set.

GWAS using k-mers
To generate the k-mers that were associated with the characteristics, first, a population structure estimation was made using mash v. 2.0 (60). Second, K-mer counting was performed using fsm-lite v. 2.0.3, and the optimal number of dimensions to use as cofactors in the analysis was determined (32,61). Subsequently, to estimate the effect of the k-mers on the severity scores and patient symptoms, Pyseer v. 1.1.2 was used with the following settings: a maximum of six dimensions, a filter p-value of 1E-8, a minimum allele frequency of 0.02 and a maximum allele frequency of 0.98.
The resulting k-mers were aligned using ClustalW v. 2.1, which resulted in one consensus 20 sequence (62). To identify the position of the k-mers in the genome, the resulting consensus sequence was aligned using the nucleotide Basic Local Alignment Search Tool (BLASTn) v. 2.8.1 with default settings (63). To investigate whether the k-mer contained a promotor, BPROM was used (64).
In addition, to validate the association of the resulted consensus k-mer with the characteristics it was aligned against a BLAST database of all assembled genomes from this study, created using BLASTn v.

Consent for publication
Not applicable.

Availability of data and material
Sequences of isolates were submitted to the European Nucleotide Archive (ENA, EMBL-EBI, Cambridge, United Kingdom) with study number PRJEB32617 (https://www.ebi.ac.uk/ena).
The in-house assembly pipeline is available on Github: https://github.com/Papos92/assembly_pipeline. All other data generated or analyzed during this study are included in this published article and its supplementary files.