Similar yet different: phylogenomic analysis to delineate Salmonella and Citrobacter species boundaries

Background Salmonella enterica is a leading cause of foodborne illness worldwide resulting in considerable public health and economic costs. Testing for the presence of this pathogen in food is often hampered by the presence of background microflora that may present as Salmonella (false positives). False positive isolates belonging to the genus Citrobacter can be difficult to distinguish from Salmonella due to similarities in their genetics, cell surface antigens, and other phenotypes. In order to understand the genetic basis of these similarities, a comparative genomic approach was used to define the pan-, core, accessory, and unique coding sequences of a representative population of Salmonella and Citrobacter strains. Results Analysis of the genomic content of 58 S. enterica strains and 37 Citrobacter strains revealed the presence of 31,130 and 1540 coding sequences within the pan- and core genome of this population. Amino acid sequences unique to either Salmonella (n = 1112) or Citrobacter (n = 195) were identified and revealed potential niche-specific adaptations. Phylogenetic network analysis of the protein families encoded by the pan-genome indicated that genetic exchange between Salmonella and Citrobacter may have led to the acquisition of similar traits and also diversification within the genera. Conclusions Core genome analysis suggests that the Salmonella enterica and Citrobacter populations investigated here share a common evolutionary history. Comparative analysis of the core and pan-genomes was able to define the genetic features that distinguish Salmonella from Citrobacter and highlight niche specific adaptations.

Members of different serovars may exhibit phenotypic differences with respect to pathogenicity, host restriction, resistance to antibiotics, and metabolism. However, these differences can also exist among strains of the same serovar [7][8][9]. The evolutionary history of Salmonella, as constructed from single nucleotide polymorphism (SNP) matrix analysis of 156 genomes, depicts groups of serovars that have monophyletic and polyphyletic lineages [10].
Accurate and precise detection of Salmonella in foods is fundamental to ensuring a safe and adequate food supply. Diagnostic tests lacking sensitivity can lead to false negative results through a failure to detect the pathogen. Conversely, results lacking specificity can lead to false positive results through the misidentification of a non-pathogenic organism as a pathogen. During the detection of Salmonella in food, false positive results frequently arise due to the presence of Citrobacter [11,12]. Members of this genus resemble Salmonella more than any other genera in the family Enterobacteriaceae [13,14]. The two genera possess similar metabolic and antigenic properties. Furthermore, given the diverse nature of both genera, atypical strains may lack typical diagnostic features or may have acquired novel ones that confound their precise identification [1,13,15,16].
As with Salmonella, Citrobacter is frequently isolated from soil, water, and the digestive tract of animals [17]. Werkman and Gillen originally described members of this genus as intermediate forms between the genera Escherichia and Aerobacter and defined seven species of Citrobacter based on carbohydrate fermentation and gelatin liquefaction [18]. The inclusion of additional intermediate strains bearing similarities to other bacterial species significantly increased the complexity of the genus [19]. DNA hybridization analysis of 116 strains of Citrobacter defined 11 genomospecies within the genus [15]. Subsequent analysis of the 16S rRNA sequence and multilocus sequence typing (MLST), indicated three phylogenetic groupings for Citrobacter: I (C. freundii, C. youngae, C. braakii, C. werkmanii, C. gillenii, and C. murliniae); II (C. amalonaticus, C. farmeri, C. sedlakii, and C. rodentium); and III (C. koseri) with groups II and III being more closely related to Salmonella and other Enterobacteriaceae than group I [16].
Phylogenetic analysis of Enterobacteriaceae family members illustrate the complex relationship between closely related genera. The polyphyletic origins of Citrobacter have been shown by comparing the sequences of four Citrobacter species with those of several Salmonella, Escherichia, Klebsiella, Enterobacter, and Shigella strains [20]. Additional work by Retchless and Lawrence demonstrated the phylogenetic incongruence or conflicting topologies in the clade containing Escherichia coli, Salmonella, and Citrobacter [21]. This incongruence may be attributed to recombination between the ancestral taxa of these organisms and the gradual acquisition of recombination barriers. Different regions of the genomes, having different rates of recombination, may undergo genetic isolation due to sequence divergence and ecological adaptation. The gradual formation of barriers to recombination may eventually give rise to heterogeneous subpopulations with orthologous genes of differing lineages and less clear-cut species definitions [21][22][23]. The ambiguous relationships between Salmonella, Citrobacter, and E. coli present a practical issue in the area of food safety testing and medical diagnosis, as accurate species identification is critical for confirming the presence of pathogens.
To gain insights into the genetic features delineating Salmonella and Citrobacter, the genomic features of a set of 58 S. enterica and 37 Citrobacter strains were investigated. This strain collection was chosen to represent the diversity of these two bacterial groups that might be expected on Canadian produce and included five Citrobacter strains that were falsely identified as Salmonella during initial testing. The objectives of the analysis were to define the core and pan genome of this population of isolates and to identify regions contributing to their diversity that may have impacted the evolution of Salmonella and Citrobacter.

Characterization of Citrobacter strains falsely identified as Salmonella
During routine testing of lettuce samples, an industrial producer isolated five presumptive strains testing positive for Salmonella using a commercially available rapid identification method. Upon further investigation, these strains were determined to be Citrobacter and provided for us to use in this study. Morphologically, these five strains (S646, S647, S648, S1284, and S1285) produced colonies indicative of Citrobacter on the following selective/differential agar media; xylose lysine deoxycholate, brilliant green sulfa, triple sugar iron, and lysine iron agars. The five strains were all positive for glucose and sucrose fermentation, gas production, hydrogen sulfide production, and negative for lysine decarboxylation. All strains were negative for agglutination with the Salmonella O antiserum poly A-I + vi. However, the five strains produced positive reactions to the Salmonella Latex Test kit (Table S1). Variable results were also noted with tests based on biochemical and metabolic profiles (Table S1). API20E provided identification profiles for four of the five strains as C. freundii (90.8% identification accuracy, T = 0.65) while strain S1284 did not have a valid identification profile. The closest identified taxon for S1284 was C. braakii with 74.4% identity and a T value of 0.49. VITEK 2 identified S1284 as C. sedlakii with an 86% match but was unable to provide an identity for strains S646, S647, S648, and S1285. These same strains were only identified to the genus level as Citrobacter (scores ranging from 2.1-2.2) using the MALDI Biotyper whereas strain S1284 was identified as C. braakii (score = 2.5).
A maximum likelihood phylogenetic tree based on the 16S rRNA sequence of the five false positive Citrobacter strains, seven unspeciated Citrobacter produce isolates and members of the family Enterobacteriaceae (Table S2 and Figure S1) demonstrated that members of the Citrobacter genus did not form a distinct clade. Strain S1284 was in a clade with C. werkmanii NBRC 105721 while strains S646, S647, S648, and S1285 appeared to be closely related to C. koseri ATCC BAA895. The unspeciated produce isolates were in separate nodes on the tree (Groups 1 and 2) and clustered with C. braakii and C. freundii. The tree was characterized by low nodal support, providing limited taxonomic resolution between Citrobacter, E. coli, and Salmonella.
The average nucleotide identity (ANI) based on pairwise genome sequence comparison of the five Citrobacter strains with Salmonella and Citrobacter reference genomes (Table S3) was between 84 and 86% for strains S646, S647, S648 and S1285, which precluded the determination of their species identity. Citrobacter strain S1284 had an ANI of 94%, which was just below the threshold value for genetic similarity to C. braakii.

Genomic analysis of Salmonella and Citrobacter pangenome
A comparative analysis of the 58 Salmonella and 37 Citrobacter genomes was performed to gain insight into the genetic differences between these two groups of bacteria. The 58 S. enterica ssp. enterica strains chosen for this study encompass the diversity of the subspecies and include strains that may be associated with fresh produce in Canada (Table 1). Of the 38 serovars selected, 19 included representatives of serovars frequently associated with clinical salmonellosis in Canada. The five most commonly reported Canadian serovars were represented by multiple strains; Enteritidis (n = 5), Heidelberg (n = 4), Javiana (n = 4), Newport (n = 5), and Typhimurium (n = 7). Serovars Typhi, Paratyphi A, and Gallinarum are host-restricted and the remaining 16 serovars (Table 1, grey box) are rarely associated with human disease. The 37 Citrobacter strains chosen for this study included human clinical isolates and food isolates including five strains that were falsely identified as Salmonella (Table  1, blue box). The genome size of the individual S. enterica and Citrobacter spp. strains ranged from 4 to 5 Mb and contained 4000 to 5000 genes (Table S4).
The pan-genome of all 95 strains consisted of 31,130 protein-coding genes. Reciprocal comparison of Table 1 List of strains and genome sequences from the Salmonella enterica Foodborne Syst-OMICS (SalFoS) database that were used in this study. Highlighted strains indicate false positive Salmonella strains (blue) and rare Salmonella serovars (grey) orthologous coding sequences in both genera identified 1540 core gene sequences, encoded by approximately a third of each isolate's genome (Table S5). These results were supported by a rarefaction analysis that demonstrated that the number of shared gene families decreased as more genomes were included in the analysis and reached a plateau at approximately 1500 genes (Fig. 1). Analysis of the orthologous relationships between the pan-genome revealed the presence of 195 Citrobacter gene sequences that did not have orthologues in Salmonella (i.e. Citrobacter unique gene sequences), while the corresponding number of gene sequences unique to Salmonella was 1112 (Table S5).
A maximum likelihood phylogenetic tree based on the genomic variations in the amino aicd sequences of the Salmonella-Citrobacter core genome effectively divided the 95 strains into one of two clades (Fig. 2). The topology of the Salmonella taxon differed from that of Citrobacter, showing low bootstrap values in some of the internal nodes and shorter genetic distances. The Salmonella strains grouped into 3 major clades with the first clade consisting of Javiana and 8 other serovars. A second diverse clade of strains contained four serovars rarely reported in Canada (Tado, Fresno, Duesseldorf, and Elisabethville), as well as representatives from serovars Kentucky, Senftenberg, and Tennessee. The largest clade resolved the two typhoidal strains from the rest of non-typhoidal Salmonella (NTS) strains, with the rare serovar Taiping being most related to Typhi and Paratyphi A. Within the NTS clade, terminal nodes with the highest support included the clade consisting of the five Enteritidis and one Gallinarum strain, a result consistent with previous studies, as well as clades containing Typhimurium and Heidelberg [24,25]. Two strains, Newport (S197) and Javiana (S200) were found in clades distinct from those containing other Newport or Javiana strains. The serogroup designation of these two strains were confirmed according to the Kauffman-White serotyping scheme [26]. Also included in the NTS clade were the rare serovars Weston and Chingola that shared an ancestor with strains belonging to serovars Newport and Litchfield. Polyphyletic lineages included serovars Javiana and Newport whose respective members were found in other clades. The core genes of serovar Pasing were the most divergent of the 58 Salmonella strains and formed the outgroup for this population.
Overall, the topology of the Citrobacter group bore resemblance to previous 16S rRNA and MLST-based trees that grouped 11 Citrobacter genomospecies into three previously described species complexes [15,16]. This portion of the tree had well-supported nodes with the population falling into one of three major clades. The first clade contained the two C. koseri strains. The second clade contained a group of C. amalonaticus and C. farmeri strains that were closely related to four of the five strains falsely identified as Salmonella (S646, S647, S648, and S1285) (Fig. 2, green dots). Strain S1284 (red dot) was located in the third clade. It branched off from a larger group containing C. braakii, C. freundii, and a number of unspeciated strains. Several C. freundii strains Fig. 1 Estimation of the Salmonella-Citrobacter pan-and core genome size. Rarefaction analysis demonstrates that the average number of gene clusters in the pan-genome increases while the conserved core gene clusters reaches a plateau as more genomes are analyzed were found in different clades, confirming the polyphyletic nature of this species.
The heterogeneity and polyphyletic origin of Citrobacter species precluded the accurate taxonomic classification of the unspeciated strains in Fig. 2. Therefore, the core genomes of the Citrobacter strains in this study were compared to those of 160 Citrobacter strains from the RefSeq database (Table  S6). A maximum likelihood phylogenetic tree based on the variations in the amino acid sequences of the core genome resolved the population into twelve clades (Fig. 3a-l). With a few exceptions, one species was predominant in each clade. This grouping pattern enabled the identification of many of the Fig. 2 Consensus tree based on Salmonella-Citrobacter core genome. Maximum likelihood phylogenetic trees were constructed from multiple alignments of concatenated core gene sequences of Salmonella and Citrobacter using RaXML. The consensus tree divided the Salmonella and Citrobacter strains into two distinct clades. The tree confirms that the false positive strains (red and green dots) belong to the genus Citrobacter. Genetic distance is defined by the scale and bootstrap values indicate percentages of 1000 replicates unspeciated Citrobacter strains (Fig. 3, yellow dots). Eleven of the twenty-one unspeciated strains clustered with C. braakii strains (clade C). However, the species designation for strain S96 remained unclear.
Although this strain was found in the same cluster, where the majority of the strains were C. braakii, its proximity to several C. freundii and one C. amalonaticus obscured its identification. Four of the five Fig. 3 Citrobacter core genome tree. A maximum likelihood phylogenetic tree constructed from the concatenated core amino acid sequences of 197 Citrobacter strains was used to determine the taxonomic designation of the unspeciated Citrobacter strains in this study (yellow dots). Genome distance estimation was performed using MASH and the core genome was analyzed using centreseq [27]. genes. Genetic distance is defined by the scale and bootstrap values indicate percentages of 1000 replicates. Triangles denote collapsed nodes with n representing the number of strains found in each triangle. For each triangle, the letters represent the group of related strains found in the branches of the node Citrobacter strains that were falsely identified as Salmonella were found in a cluster with several C. amalonaticus strains (clade I). The remaining false positive strain, S1284, was found in a clade containing a mixture of several C. freundii strains, three C. europaeus strains, and one C. braakii strain (clade B). However, within this clade, S1284 clustered with a C. freundii strain suggesting that the two are closely related.

Functional analysis of Salmonella-Citrobacter core genes
The functional classification of the following three sets of gene sequences were analyzed (Table S7); the Salmonella and Citrobacter core gene sequences, the Salmonella-specific unique gene sequences, and Citrobacter-specific unique gene sequences. The core gene sequences had 1525 predicted functions in 176 KEGG pathways. A significant number of these sequences, approximating 11% of the predicted core gene functions were designated as either poorly characterized or could not be classified into a KEGG pathway or BRITE hierarchy. For the unique gene sequences specific to either Salmonella or Citrobacter, the respective totals were 521 and 135 predicted roles in 117 and 60 pathways. Poorly characterized genes made up a respective 28 and 16% of the Salmonella specific and Citrobacter specific unique gene sequences.
The majority of the core gene sequences were predicted to encode proteins with roles in the major KEGG categories of metabolism (60%), environmental information processing (14%), and genetic information processing (10%) (Fig. 4). Of the metabolic functions, the three major categories represented were for the metabolism of carbohydrates (17%), amino acids (10%), and energy (9%). In comparison, the major KEGG categories represented in the Salmonella-specific unique gene set were metabolism (54%), environmental information processing (18%), and human diseases (12%). Of the metabolic processes, the three major categories were carbohydrate metabolism (13%), amino acid metabolism (9%), and the metabolism of cofactors and vitamin (9%). In contrast, the majority of the functions unique to Citrobacter were Fig. 4 Functional classification of Salmonella and Citrobacter core and genus-specific unique genomes. Salmonella-Citrobacter core genes are depicted in (a), Salmonella-specific unique genes in (b), and Citrobacter-specific unique genes in (c). Left, proportion of genes represented in each KEGG major category. Right, proportion of putative metabolic genes represented in each KEGG metabolism sub-category predicted to occur in two major categories; metabolism (64%) and environmental information processing (24%). As with Salmonella, the three major categories within metabolism were carbohydrate (22%), amino acid (8%), and cofactors and vitamin metabolism (7%).
The Salmonella-specific gene sequences were proportionately higher in some global metabolic pathways such two-component system, ABC transporters, lipopolysaccharide biosynthesis, porphyrin and chlorophyll metabolism, and sulfur metabolism. In addition, 20 Salmonella-specific sequences were classified under 12 pathways that were not present in the core and Citrobacter-specific gene sequence sets (Table S7, orange), which include geraniol degradation, naphthalene degradation, bacterial infectious disease, and pathways related to organismal systems and human diseases. Predicted metabolic activities of Citrobacter-specific gene functions had higher representation in pathways involved with ABC transporters, sulfur metabolism, and glyoxylate and dicarboxylate metabolism and two sequences were uniquely categorized under the atrazine degradation pathway (Table S7, orange).

Network analysis
Network analysis of the Salmonella-Citrobacter pangenome provided a complementary view of the evolutionary relationships between the population of strains. Figure 5 illustrates the genetic network of the Salmonella and Citrobacter strains based on the proportion of protein families shared between the two groups of bacteria. At 26% shared gene families, all of the strains, regardless of their taxonomic designation, formed one community of interconnected members (Fig. 5a). This recapitulates the results of the Salmonella-Citrobacter comparative genome analysis, wherein the core gene sequences made up~30% of individual genomes. Increasing the proportion of shared sequences by 1% depicted a community with all the Salmonella (blue) strains on one end of the network while the Citrobacter strains (orange, red, and green) formed a gradient of decreasing interconnections with Salmonella (Fig. 5b). A threshold of 29% shared gene sequences (Fig. 5c) separated Salmonella and Citrobacter into two communities that remained connected due to the similarities of several Salmonella strains with individual Citrobacter strains. As The percentage of shared genes is based on the number of genes shared by two genomes divided by the maximum number of genes found in the genome. Colors indicate the following strains: blue -Salmonella, orange -Citrobacter, red -Citrobacter strains S646, S647, S648, S1285, green -Citrobacter strain S1284 the percent threshold was increased to 34%, two distinct communities became apparent and shared no homologues (Fig. 5d). However, the network also depicted the separation of four of the Citrobacter strains that were falsely identified as Salmonella (red) from the rest of the Citrobacter network. However, the Citrobacter strain S1284 (green) that was also falsely identified as Salmonella maintained many connections within the larger Citrobacter community. In contrast, all of the Salmonella strains maintained one network due to their genetic similarity.

Discussion
Analysis of the core and pan-genome of a population of Salmonella and Citrobacter strains offered a reliable delineation of the fuzzy taxonomic boundaries between these closely related bacteria. As the most stable portion of the pan-genome, the core genome encodes characteristics common to all members of a species [28,29]. Therefore, species identities are maintained despite considerable genomic flux that may occur between and within species. A recent study on the genomic structure of the Rickettsiales defined criteria to define genus and species assignments based on core genome alignments, with alignments ≥10% of the average input genome length and ≥ 96.8% identity respectively [30]. Here we show that despite the stringent parameters that were used to identify orthologous genes, approximately one third of each genome in this study was composed of core genes suggesting relationships between genera belonging to the family Enterobacteriaceae may be closer than in other bacterial families. A consensus tree based on sequence variations of the amino acid sequences encoded by the Salmonella-Citrobacter core genes was able to resolve the two bacterial species into two distinct clades and provided sufficient delineation to infer the taxonomic affiliation of the Citrobacter strains. Analysis of 37 Citrobacter strains proved insufficient in determining the taxonomic assignment of many of the unspeciated strains in this study. Adding more Citrobacter genomes to the analysis greatly increased the resolving power of the analysis. However, the identity of one strain, S96, remained ambiguous due to the uncertainty surrounding the identity of its closest homologues. Of the Citrobacter genomes investigated, the ones belonging to C. freundii appeared to be the most divergent, appearing in seven of the 12 clades. Given the complexity of Citrobacter taxonomy, it may be possible that some strains in the RefSeq database were classified erroneously. However, this observation also calls into question the utility of C. freundii as a species designation since based on our analysis, it appears to be a collection of highly disparate strains.
According to the results of the rarefaction analysis, the core genome defined in this study is believed to reflect the diversity of Salmonella and Citrobacter strains that may be associated with food commodities, specifically fresh produce in Canada. Many Salmonella-Citrobacter core gene sequences were predicted to have roles in central cellular processes including DNA replication, transcription, translation, cell division, and key metabolic pathways such as glycolysis and the tricarboxylic acid cycle. In contrast, the functional predictions of some genus-specific unique gene sequences were associated with activities required to thrive in a specific niche, such as specialized metabolic pathways. A limitation of the current study is that the KEGG database is primarily intended to catalogue eukaryotic (human) cellular functions. Thus, many bacterial gene functions are not represented. Just over half of the 1112 sequences identified as unique to our collection of Salmonella was mapped to cellular pathways. Given the redundancy expected when the activity of one gene makes contributions to multiple pathways, this number is reduced further. Continued study of bacterial gene functions along with the creation and continued curation of functional classification databases for bacterial genes would enable the more comprehensive and precise predictions of bacterial gene functions.
The stability of the core genome makes it an indispensable tool for defining the genetic integrity of bacterial populations. However, it is ill suited for investigating evolutionary relationships between closely related strains. A consensus tree based on the core genes of Salmonella and Citrobacter was not able to resolve the basal relationships between Salmonella serovars. There was greater support for the evolutionary relationships observed between many of the Citrobacter strains in our study, presumably due to the greater diversity of that population. However, many highly related strains were poorly resolved including the false positive strains and other isolates from fresh produce. It is possible that the overall strain diversity reported here is low as it is biased towards strains that were isolated through the context of food safety testing and outbreak investigations. In the absence of alternative sampling plans, the discriminatory power of phylogenetic trees can be increased by reducing the stringency in defining the core genome, or by the inclusion of intergenic or accessory regions. In this regard, public health laboratories have been successfully applying gene-by-gene comparison methods such as whole genome MLST (wgMLST) to outbreak investigations [31][32][33].
Accessory and singleton gene sequences represent the most dynamic regions of the genomes. They encode mobile elements, as well as many small proteins with putative or hypothetical functions. These elements evolve quickly and are believed to confer niche-specific selective advantages, as they are continually being lost and gained among strains of multiple lineages [34,35]. Phylogenetic network analysis provides a visual representation of events that drive bacterial evolution, such as the loss and gain of genes within the pan-genome. These events occur via recombination and horizontal gene transfer (HGT) and are not captured in a phylogenetic tree due to the non-tree-like nature of these processes [36]. Visualization of network connections can be altered by varying the threshold for the number of protein families shared by pairs of genomes within the population providing insight into the evolution of individual species. Reconstruction of networks based on prokaryotic genomes revealed that closely related taxa are strongly interconnected and form a distinct phylogenetic community that promotes gene sharing [37]. Moreover, closely related species that are found in the same environment, such as Salmonella and Citrobacter, are more interconnected and genetically similar [38]. Genetic exchange between members of the community network, through recombination or HGT, can contribute to the acquisition of traits enabling survival in fluctuating environments and potentially giving rise to strains with atypical biochemical, phenotypic, and antigenic properties [20,39]. The food environment presents a hostile landscape for microorganisms due to injury or stress caused by food preservation methods and intrinsic factors such as pH, temperature, and competition with other microflora [40]. These conditions induce adaptive responses for survival, such as stress tolerance and/or expanded metabolic capabilities [41]. Our results indicate that genetic exchange between Salmonella and Citrobacter could have contributed to their similarities as well as the diversification of their respective genera.

Conclusions
Bacterial species determination was initially based on phenotypic properties. However, phenotypic test results are often discrepant. Phenotypic characterization of the five Citrobacter strains that were misidentified as Salmonella provided variable results and were unable to unambiguously determine the species identity. Bacterial classification based on mass-spectrometry can complement standard phenotypic and biochemical approaches. However, the non-genomic nature of this method limits the phylogenetic investigation of closely related species or strains. This method is also limited by the comprehensiveness of the database used to compare spectra, which to date are heavily biased towards medically relevant pathogens [42]. Classification using 16S rRNA sequences and ANI provided limited taxonomic resolution and thus, less confidence in species identification [43]. The limited variability in the conserved region of the 16S rRNA genes in closely related enteric bacteria results in a decreased ability to resolve relationships below the family level [35]. Core and pan-genome analysis, however, were more informative in delineating Salmonella and Citrobacter. Both methods provided complementary but congruent results. The core genome analysis of Citrobacter highlighted the challenges associated with species designation, particularly in regards to C. freundii, which did not constitute a discrete phylogenetic group. Further taxonomic inquiry is needed to clarify the lineage and typing of its members. Our results showed that four of the five Citrobacter strains falsely identified as Salmonella were phylogenetically similar and the prevalence of these strains in fresh produce and other food products deserves further investigation to help improve the detection of Salmonella. The insights gained from the analyses of our study can be used to develop robust molecular assays capable of rapid identification and discrimination of Salmonella from Citrobacter or other closely related non-pathogens.

Methods
Bacterial strains, genome sequencing, and assembly Table 1 provides a list of the 37 Citrobacter spp. and 58 S. enterica subsp. enterica strains included in the study. The 58 S. enterica enterica strains in this study were chosen to encompass the diversity of the subspecies and represent serovars frequently associated with outbreaks as well as serovars isolated from food and rarely associated with human disease (grey box) [2,3]. Of the 38 serovars selected, 19 included representatives of serovars frequently associated with clinical salmonellosis in Canada. The five most commonly reported Canadian serovars were represented by multiple strains; Enteritidis (n = 5), Heidelberg (n = 4), Javiana (n = 4), Newport (n = 5), and Typhimurium (n = 7). The serovars Typhi, Paratyphi A, and Gallinarum are host-restricted while the remaining 16 serovars (Table 1, grey box) are rarely associated with human disease.
Six Citrobacter species (C. amalonaticus, n = 2; C. braakii, n = 5; C. farmeri, n = 1; C. freundii, n = 6, C. koseri, n = 2; and C. werkmanii, n = 1) were represented among the 37 Citrobacter strains sequenced in this study. Twenty-one strains did not have a species designation and five of these strains were isolated from fresh produce as part of a producer's food safety testing program and originally misidentified as Salmonella using a commercial immunoassay test (S646, S647, S648, S1284, S1285, Table 1, blue box).
All strains were sequenced at the Plateforme d'Analyses Génomiques of the Institute for Integrative and Systems Biology (IBIS), Université Laval, Quebec, Canada using the procedures described by the Salmonella Syst-OMICS consortium [44]. Raw sequence reads were downloaded from the SalFos database and de novo genome assembly was performed using SKESA (v 2.3.0) [45]. QUAST was used to assess the quality of assembled genomes [46], while gene calling and annotation were done using the Prokka software v1.12 [47]. All nucleotide sequence data and additional information on the strains are available from the Salmonella Foodborne Syst-OMICS database (SalFoS) (https://salfos.ibis.ulaval.ca/).

Phenotypic analysis of false positive isolates
The five Citrobacter strains that were falsely identified as Salmonella were cultured on tryptic soy agar (BD Difco, NJ, USA) overnight at 35°C. Isolates were grown in xylose lysine deoxycholate agar (XLD), brilliant green sulfa agar (BGS), triple sugar iron agar (TSI), and lysine iron agar (LIA) for 24 h at 35°C to determine colony morphologies indicative of S. enterica or Citrobacter. Serological reactions were assessed using Salmonella O antiserum poly A-I + vi (BD Difco, NJ, USA) and the OXOID Salmonella Latex Test (Thermo Fisher, Hampshire, UK) following the manufacturer's protocols. S. enterica ser. Enteritidis S187 and C. freundii ATCC8090 was used as a positive control. Biochemical characterization was carried out using API20E (bioMérieux, Marcy l'Étoile, France) and the VITEK 2 automated microbial identification system (version 07.01, bioMérieux, Marcy l'Étoile, France) using the GN identification card and following the manufacturer's instructions.
The strains were further characterized using a Biotyper matrix-assisted laser desorption/ionization time-of-flight (MALDI-TOF) mass spectrometer (Bruker Daltonik, Bremen, Germany) following a standard extraction procedure. Briefly, fresh colonies were applied onto a Biotarget plate in duplicate, allowed to dry, and overlaid with freshly prepared α-cyano-4-hydroxy-cinnamic acid (HCCA) matrix. Mass spectra were acquired and the strains were classified using the MALDI Biotyper 3.1 software and Biotyper taxonomy database version 6 (6903 entries, BDAL, Bruker Daltonik). S. enterica ser. Enteritidis S187, C. freundii ATCC 8090 and E. coli ATCC 25922 were used as controls. The software compares the peak profile of a sample's mass spectrum with a reference spectrum in the database and calculates an arbitrary unit score value from 0 to 3 according to a proprietary algorithm. High scores denote similarity between the sample and reference spectra. Classification criteria indicate genus and probable species-level identification for scores of 2.000 to 2.299 and highly probable species identification for values of 2.300 to 3.000.

Comparative genomic analysis
Annotated Citrobacter and Salmonella protein assemblies were clustered and filtered for redundancy using MMseqs2 [48]. An in-house script called centreseq was used to parse the data [27]. After removal of duplicate genes, the protein sequences were concatenated and reclustered using the MMseqs2 linclust algorithm. Cut-off values for percent amino acid identity and alignment length were both set at 90%. The complete set of gene sequences identified in all of the genomes was defined as the pangenome. The core genome represented sequences that were present in all of the analyzed genomes and the accessory genome was the set of sequences that were present in a sub-set of at least two genomes. Gene sequences that were only present in one genome were termed singletons and gene sequences that were exclusively found in either Salmonella or Citrobacter with no orthologue in the other genus were defined as genusspecific unique gene sequences.
A rarefaction analysis was performed to estimate the size of the core genome of Salmonella and Citrobacter. The analysis was achieved by randomly selecting subsets (ranging 2-95 genomes) of the input samples and calculating the core-and pan-counts. The core-and pancontents of each subset were calculated five times and the results averaged to create one data point. Curve fitting was done using nonlinear regression (one phase decay) in Graphpad Prism version 5.0b for Mac OS X (Graphpad Software, San Diego, California, USA). The point of the curve at which additional number of genes incorporated into the core genome relative to the number of analyzed genomes begins to plateau indicates the value needed for complete core genome representation for all included genomes.
Phylogenetic trees based on variations in the amino acid sequences encoded by the core genome were constructed using the tree module of centreseq [27]. Essentially, the core genome amino acid sequences were extracted and aligned using MUSCLE [49]. The alignment file was concatenated and used to construct a maximum likelihood phylogenetic tree using RAxML v8.2.9 [50] with 1000 bootstrap iterations. The numbers on the internal nodes of the tree represent the proportion of individual trees that are congruent with the consensus tree and indicate support for the separation of the taxa at that particular node.
Citrobacter sequences used to construct the Citrobacter specific phylogenetic tree were downloaded from RefSeq in June 2019 (n = 326) [51]. Genome distance estimation was performed using MASH [52]. Strains with low distance scores and in which Citrobacter was not the best hit were removed from the dataset (n = 18). Of the remaining RefSeq genomes, 160 sequences were chosen for further analysis to represent the 13 Citrobacter species depicted on the tree (Table S6).
Functional activities were assigned to the predicted amino acid sequences of the Salmonella-Citrobacter core and unique genes by mapping predicted activities to the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway database and BRITE hierarchies using the KEGG Automated Assignment Server (KAAS) at https:// www.genome.jp/kegg/kaas/ [53].
The percentage of shared genes among genomes of the 96 strains were visualized using the pairwise report and network chart features of centreseq [27]. Pairwise information on the core cluster counts were visualized by linking pairs of genomes that share genes at or below a given percentage threshold.