Analysis of the Pantoea ananatis pan-genome reveals factors underlying its ability to colonize and interact with plant, insect and vertebrate hosts

Background Pantoea ananatis is found in a wide range of natural environments, including water, soil, as part of the epi- and endophytic flora of various plant hosts, and in the insect gut. Some strains have proven effective as biological control agents and plant-growth promoters, while other strains have been implicated in diseases of a broad range of plant hosts and humans. By analysing the pan-genome of eight sequenced P. ananatis strains isolated from different sources we identified factors potentially underlying its ability to colonize and interact with hosts in both the plant and animal Kingdoms. Results The pan-genome of the eight compared P. ananatis strains consisted of a core genome comprised of 3,876 protein coding sequences (CDSs) and a sizeable accessory genome consisting of 1,690 CDSs. We estimate that ~106 unique CDSs would be added to the pan-genome with each additional P. ananatis genome sequenced in the future. The accessory fraction is derived mainly from integrated prophages and codes mostly for proteins of unknown function. Comparison of the translated CDSs on the P. ananatis pan-genome with the proteins encoded on all sequenced bacterial genomes currently available revealed that P. ananatis carries a number of CDSs with orthologs restricted to bacteria associated with distinct hosts, namely plant-, animal- and insect-associated bacteria. These CDSs encode proteins with putative roles in transport and metabolism of carbohydrate and amino acid substrates, adherence to host tissues, protection against plant and animal defense mechanisms and the biosynthesis of potential pathogenicity determinants including insecticidal peptides, phytotoxins and type VI secretion system effectors. Conclusions P. ananatis has an ‘open’ pan-genome typical of bacterial species that colonize several different environments. The pan-genome incorporates a large number of genes encoding proteins that may enable P. ananatis to colonize, persist in and potentially cause disease symptoms in a wide range of plant and animal hosts. Electronic supplementary material The online version of this article (doi: 10.1186/1471-2164-15-404) contains supplementary material, which is available to authorized users.


Background
Pantoea ananatis is a member of the family Enterobacteriaceae and is characterised by its ubiquity in nature and its frequent association with both plant and animal hosts. It has been found in a wide array of environments including rivers, soil samples, refrigerated beef and aviation fuel tanks [1,2]. P. ananatis is most frequently isolated from plant materials, including roots, leaves and stems of a broad range of plant hosts and exists as part of the epiphytic and endophytic flora [1]. Since its identification as the causal agent of fruitlet rot of pineapple in the Philippines in 1928 [3], P. ananatis has been implicated in diseases of a wide range of host crops including maize and onion, Eucalyptus, sudangrass and honeydew melons [1,4]. Individual isolates also appear to be capable of causing disease symptoms on a wide range of hosts. For example, strains pathogenic on rice and pineapple were demonstrated to cause blight symptoms on onion [4]. Conversely, some P. ananatis strains have been shown to promote plant growth [5,6]. P. ananatis strains have also been found associated with insects, including tobacco thrips that act as vectors of onion-pathogenic strains, mulberry pyralids, ticks and fleas, demonstrating its ability to persist in invertebrate hosts [7][8][9]. Its implication in human infections reveals its capacity for proliferation and potential to cause disease in a vertebrate host [10,11]. The ubiquity of P. ananatis suggests that it has adapted to proliferate in a wide range of environments, and its isolation from both plant and animal hosts indicate it has adapted for cross-Kingdom colonization and pathogenesis.
The concept of the pan-genome was introduced in 2005 [12]. The pan-genome of a bacterial species can be defined as the global gene repertoire of the species, and consists of a core genome, representing the genes present in all strains of the species, and an accessory genome composed of genes that are unique to particular strains as well as those genes that are absent from one or more of the sequenced strains [12,13]. Core genes encode proteins that are generally involved in crucial cellular processes and are thus mostly vertically transferred from parent to progeny. Accessory genes, on the other hand, are prone to lateral gene transfer, and often encode functions related to niche adaptation [12][13][14]. The microbial pan-genome of a species can further be considered as 'open' or 'closed'. A 'closed' pan-genome is highly conserved, and is typical associated with bacterial species which live in select niches, where they are secluded from the overall microbial gene pool or have a diminished capacity to acquire genes, such as Bacillus anthracis and Mycobacterium tuberculosis [13,14]. By contrast an 'open' pan-genome is observed for bacterial species that can colonize and exploit several different environmental niches and can expand their accessory and pan-genome through different means of lateral gene transfer [12][13][14]. Four complete and four draft genomes of P. ananatis strains, isolated from various environmental sources, and with diverse lifestyles, have recently been sequenced. Here we have determined and characterized the open P. ananatis pan-genome and show its adaptive capacity to interact with hosts of both the animal and plant Kingdoms.
Results and discussion P. ananatis genome statistics The genomes of eight P. ananatis strains, isolated from different geographic locations and sources were sequenced [6,[15][16][17][18]. These include four complete and four high quality draft genome sequences ( Table 1). The genomes consist of a chromosome, 4. 39-4.61 Mb in size with an average G + C content of 53.7%, together with a large plasmid pPANA1, which is 281-353 kb in size and has an average G + C content of~52% (Table 1). An estimated total of 4,225-4,415 CDSs are encoded on the combined chromosome and pPANA1 plasmid of each of the strains. Reciprocal Best BlastP Hit (RBBH) analysis [19] was used to undertake pair-wise comparisons of the protein complements encoded on the genomes of the eight strains, using the orthology cut-off values of >70% sequence identity and >70% sequence coverage between the query and hit [20]. This analysis indicated that between 89.3 and 95.7% of the proteins encoded on the genome of one strain have orthologs encoded on the genomes of the other strains, with an average amino acid identity of 99.4% (Figure 1, Additional file 1: Tables S1 and Additional file 2: Tables S2). These results suggested that an extensive, highly-conserved core genome, encompassing the majority of proteins encoded on each individual genome, exists among P. ananatis strains. Similar values were observed for E. amylovora, where on average 89% of the mean 3,819 CDSs predicted for each of twelve genomes compared are core, with an average amino acid identity of >99% between the core CDSs, while by contrast only 46.7% of the CDSs are core among three commensal and fourteen pathogenic strains of Escherichia coli (Table 2) [21,22]. Alignment of the combined chromosome and pPANA1 nucleotide The isolation source and predicted biological roles for each strain are shown. The P. ananatis genome consists of a chromosome and a large plasmid (LPP-1). The sizes, G + C contents (%) and number of proteins encoded on each replicon are indicated. a indicates the genomes for which genome sizes were estimated based on the sum of contig lengths.
sequences also showed that extensive synteny exists among the four complete P. ananatis genomes (Additional file 3: Figure S1). Pairwise comparison of the P. ananatis protein sets against the total proteins encoded on the genome of the biological control strain P. vagans C9-1 [23], further indicated that an extensive set of core proteins may be conserved among different species within the genus Pantoea. Between 73.1-74% of the proteins encoded on each of the eight P. ananatis genomes shared orthology with proteins encoded on the genome of P. vagans C9-1, with an average amino acid identity of 84.8% (Additional  file 1: Tables S1 and Additional file 2: Tables S2). P. ananatis has an 'open' pan-genome The pan-genome for the eight sequenced P. ananatis strains was determined by BlastP comparison of the translated CDS set, the clustering of orthologous proteins, and addition of the representatives of each orthologs cluster and strain-unique CDSs to the total pan-genome. The combined pan-genome for the eight compared P. ananatis strains encompasses 5,566 CDSs. Of these, 3,876 (69.6% of total CDSs) are core to all compared genomes ( Figure 1). This implies that on average 89.5% of CDSs encoded on the genomes of each of the eight strains form part of the core genome determined for the eight compared strains. A total of 1,690 CDSs proteins (30.36% of the pan-genome total) make up the accessory fraction ( Figure 1), of which an average of 108 CDSs are unique to each genome among the eight compared strains. The sequential addition and BlastP comparison of the CDS sets of each of the eight P. ananatis strains, and BlastP comparison of all possible combinations as a function of the number of genomes n (where n = 2→8), was used to determine the number of core CDSs, the total pan-genome CDSs, and the strain-unique CDSs for each of the combinations. The resultant values were fitted to an exponential decay function [12] (Figure 2). By extrapolation of this function, the core, pan-genome and strain-unique CDSs for the entire species, beyond the scope of the eight sequenced strains, could be predicted [12]. On the basis of the asymptotic value from the exponential decay function, the estimated core genome pPANA1 Figure 1 Comparison of the CDS sets of eight sequenced P. ananatis strains. The diagram was constructed using GenomeDiagram [70] on the basis of RBBH BlastP comparison of the CDS sets of each of the sequenced P. ananatis strains against the pan-genome CDS set. The strain order, from outside to inside, is AJ13355 (red), B1-9 (orange), BD442 (yellow), LMG20103 (green), LMG2665 T (purple), LMG5342 (purple), PA13 (light green) and PA4 (light blue). Core CDSs are colored as black arrows, while accessory CDS are colored in relation to the strain. pPANA1 plasmid CDS are demarcated by red lines. The outer ring shows the location of phage proteins (grey) and the ICE element (blue) in the pan-genome. The pPANA1 CDSs are demarcated by red lines. The number of strains included in the comparison are indicated, as are the mean number of CDSs per compared genome, those CDSs which are core to all compared strains, the relative proportion of core CDSs per genome (%) and the pan-genome sizes (#CDSs) for the compared strains, where the data is available. The estimated core genomes and unique CDSs added to the pan-genome when additional sequenced strains would be added to the comparison are shown. * denotes those pan-genome analyses where the orthologous CDSs in a compared genome were first grouped ito orthologous clusters. a indicates the number of unique genes that are predicted to be added to the S. pneumoniae pan-genome if 100 genomes would be compared.
for the species consists of 3,884 ± 4 CDSs ( Figure 2A). This estimate is similar to the observed value for the comparison of the eight strains, indicating that the addition of more sequenced strains to the comparison will likely have a minimal effect on the core CDS fraction of the species. An estimated 106 ± 10 unique CDSs would be added to the species pan-genome if a new sequenced genome is incorporated into the comparison ( Figure 2B). A rarefaction curve plotted on the basis of the average number of CDSs per sequenced genome and the estimated strain-unique CDSs (for n = 2→8) shows that the extrapolated curve continues to increase with the addition of new sequenced genomes to the analysis ( Figure 2C). This typifies an 'open' pan-genome as has been observed in similar pan-genome analyses for a number of bacterial species, members of which can inhabit a wide range of environments and/or have diverse lifestyles and possess efficient means of lateral gene transfer [12][13][14]21,22,[24][25][26] (Table 2). By contrast, similar rarefaction curves reach an asymptotic value of zero within a limited pan-genomic context in species exhibiting a 'closed' pan-genome [12]. For example, no new genes are accumulated in the pan-genome of Bacillus anthracis when a fourth genome is added to the comparison, which may be linked to the more isolated niche this pathogen occupies [12,27]. The 'open' pan-genome of P. ananatis may thus reflect the diverse habitats from which strains of this species have been isolated and their different lifestyles. Similar pan-genome analyses in E. coli revealed that~300 CDSs would be added to the pangenome with each novel genome sequenced, while 52 unique CDSs would be added with each sequenced genome of the phytopathogen E. amylovora [21,22] (Table 2). Thus the P. ananatis pan-genome can be considered to be less 'open' than that of E. coli, but more 'open' than that of E. amylovora. This assumption must, however, be viewed with caution, as several additional factors may influence the unique CDS and core-and pangenome calculations, such as genome completeness and annotation, the number of strains compared, strains selection, genome size, as well as the orthology cut-off values and methodologies employed [12][13][14]28].
The P. ananatis accessory genome encodes mainly 'poorly characterized' proteins The translated protein products of the CDSs encoded in both the accessory and core fractions of the P. ananatis pan-genome were classified into super-functional and functional categories, on the basis of orthology to functionally characterized proteins as determined using the A B C Figure 2 Plots of the core CDSs, strain-specific CDSs and pan-genome CDSs of the species P. ananatis. Comparisons of n = 1→8 genomes were performed to determine the core genome CDSs for the eight sequenced P. ananatis strains (A), the strain-unique CDSs (TCs) added by each genome in the comparison (B), and the P. ananatis pan-genome size (expressed as number of CDSs) (C). Black circles indicate the data points. The data were fitted to a generalized least-squares non-linear model as per [12] to determine the core CDSs (A) and strain-unique CDSs (B) for the species when more sequenced strains are added to the comparison, as depicted by the triangles and trend line (blue for core CDSs and red for strain-unique CDSs) in the plots, while the dashed line shows the asymptotic core (A) and strain-specific CDSs (B) values for the species as predicted using the least-squares non-linear model function. The pan-genome for the species was extrapolated by fitting the strain-specific CDS data and average number of CDSs per compared genome to an algebraic function as per [12], indicated by green triangles (C).
COGnitor tool and according to the classification nomenclature employed in the Conserved Orthologous Groups Database [29] (Additional file 4: Table S3). The majority of the 1,690 proteins encoded on the accessory genome (69.2%) fall into the super-functional COG category of 'poorly characterized' proteins, while relatively small proportions of proteins encoded on the accessory genome belong to the metabolism (9.4%) and cellular processes (8.2%) super-functional categories (Figure 3; Additional file 4: Table S3). This finding is supported by analyses of the E. coli and Staphylococcus aureus pangenomes, where the majority of proteins encoded on the accessory genome also fell into the 'poorly characterized' super-functional category [13]. By contrast, a large proportion of the 3,876 proteins encoded on the P. ananatis core genome are involved in the super-function metabolism (35.94%), with only 32.6% belonging to the 'poorly characterized' super-functional category ( Figure 3; Additional file 4: Table S3). Similar proportions of core and accessory proteins are involved in information storage and processing. Within this super-functional category, however, most of the proteins encoded on the core genome are involved in transcription, translation, ribosomal structure and biogenesis, while the majority of accessory genomeencoded proteins are involved in DNA replication, recombination and repair (Additional file 4: Table S3). This super-functional category includes transposases, integrases and other mobile genetic elements, and their extensive representation in the accessory genome indicates that horizontal gene transfer has potentially played a significant role in the diversification of the eight P. ananatis strains.
Prophage integration and integrative and conjugative elements have played a major role in the diversification of P. ananatis strains Integrated bacteriophage elements, or prophages, were identified in the genome sequences of the eight P. ananatis strains using Prophinder [30]. Between two and four prophages are integrated into the replicons of each of the eight strains, and 699 accessory CDSs (41.4% of the total accessory CDSs) are encoded by these prophages, indicating that phage integration has played a substantial role in P. ananatis diversification. Between 24.2% and 74.3% of the strain-unique CDSs for each of the eight strains are localized in predicted integrated phage elements ( Figure 1). This suggests that a large fraction of the strain-unique CDS complement of any new sequenced strains would likely be derived from phages. Prophages are found in two-thirds of all γ-proteobacteria and have been shown to play a major role in bacterial evolution through the horizontal transfer of genetic factors that contribute to various processes within the bacterial host, including fitness and pathogenesis [31]. Examples of prophage-borne genes in other bacteria include those encoding a Shiga toxin in E. coli, Type III secretion system effectors in S. enterica, R-and F-type bacteriocins in Pseudomonas aeruginosa, and a superoxide dismutase (providing protection against oxidative defences within the mammalian host) in S. enterica [32][33][34][35]. The prevalence of prophage CDSs among the accessory portion of the pan-genome, suggest that they may also play a major role in the adaptive evolution of this species, potentially contributing to their ability to colonize various environmental niches and hosts.  A further 148 CDSs (8.7% of the total accessory CDSs) are encoded in integrative and conjugative elements (ICEs), which are present in the genomes of five of the sequenced strains (Figure 1). These ICEs, in other bacterial taxa, including S. enterica and Vibrio cholerae, have been shown to disseminate and confer a number of adaptive traits, including antibiotic and heavy metal resistance [36].
The P. ananatis pan-genome encodes a number of proteins found on the genomes of bacteria associated with distinct hosts As P. ananatis strains are frequently found associated with a wide range of different hosts including plants, insects and humans, the P. ananatis pan-genome was analyzed to identify CDSs that may be involved in hostmicrobe interactions. The translated protein products of the 5,566 pan-genome CDSs were compared by BlastP comparison against the NCBI non-redundant (nr) protein database [37,38]. This revealed a large number of P. ananatis proteins with orthology to proteins encoded on the genomes of microorganisms associated with distinct hosts, namely animal-associated bacteria (AAB), plant-associated bacteria (PAB) and insect-associated (IAB) bacteria. A total of 1,415 CDSs (25.4% of the total pan-genome CDSs) shared orthology with CDSs specific to microorganisms belonging to one group (i.e. specific to the AAB, IAB or PAB groups but not more than one group). Figure 4 depicts a comparison of the P. ananatis pan-genome translated CDS products against the protein sets of 54 members of the Enterobacteriaceae for which complete genomes are available. This diagram showed the occurrence of AAB, IAB and PAB-specific CDSs mainly within large regions of reduced conservation among the Enterobacteriaceae, indicating they have likely arisen through extensive horizontal acquisition events. The BlastP analysis also identified a number of CDSs that were either unique to P. ananatis (106 CDSs −1.9% of total pan-genome) or specific to the Pantoea genus (28 CDSs -0.5% of total pPANA1 Figure 4 Comparative diagram of the P. ananatis pan-genome against members of the genomes of representative members of the family Enterobacteriaceae showing that AAB-, IAB-and PAB-specific CDSs are mainly restricted to regions of lower conservation among the Enterobacteriaceae. The P. ananatis pan-genome CDS set was compared by RBBH approach against the CDS sets of 54 members of the family Enterobacteriaceae, which are listed in Additional file 7: Table S6. Orthologs were considered as those translated CDSs sharing >50% amino acid identity and >70% sequence coverage across the query and hit sequences. Orthologs in each of the enterobacterial comparators are indicated by black arrows The outer circle shows the locations of PAB-(dark green), non-enterobacterial PAB-(light green), IAB-(maroon), AAB-(red), Pantoea-(blue) and P. ananatis-specific (yellow) CDSs on the P. ananatis pan-genome.
pan-genome CDSs). While pan-genome CDSs with orthologs in the genomes of both plant and animal-associated bacteria are almost certain to play a role in the interactions with hosts in both Kingdoms, the large number of PAB-, AAB-, and IAB-specific CDSs in the P. ananatis pan-genome may provide mechanisms specific to these different Kingdoms and were investigated further.
The P. ananatis pan-genome encodes proteins that may be involved in colonization of animal hosts A total of 151 P. ananatis CDSs (2.7% of total pan-genome CDSs) encode proteins with orthologs restricted to animal-associated bacteria (AAB), including animalpathogenic Salmonella, Escherichia and Yersinia strains, with between 61 (LMG5342) and 84 (AJ13355) AABspecific CDSs encoded on the different P. ananatis genomes. Of these, 50 CDSs form part of the P. ananatis core genome, while 101 are found in the accessory portion of the pan-genome, suggesting possible adaptive evolution of some strains that has enabled colonization of and persistence in animal hosts. The majority of proteins encoded by the AAB-specific CDSs (110 CDSs -66.3% of total AAB-specific CDSs) belong to the 'poorly characterized' super-functional category. However, a small proportion is involved in metabolism (21 CDSs) and cellular processes (17 CDSs) (Additional file 5: Table S4). Two sets of CDSs encode proteins putatively involved in the transport and metabolism of the carbohydrate substrates glucuronide and mannose, respectively. Orthologs of several P. ananatis pan-genome CDSs with a potential role in attachment, and in defense against antimicrobials are also restricted in distribution to AAB (Additional file 5: Table S4). Two CDSs unique among the sequenced P. ananatis strains to AJ13355, encode proteins with a predicted role in the biogenesis of a type 1 fimbria, with orthologs restricted to E. coli and Salmonella spp. A putative non-fimbrial autotransporter adhesin, which is common to all sequenced P. ananatis strains, shared extensive sequence identity with the AidA-I and MisL adhesins of enteropathogenic E. coli and S. enterica, respectively. These adhesins are involved in intestinal adherence in these pathogens [39,40]. Orthologs of the S. enterica Mig-14 protein, which has been proposed to repress immune system functions [41], are also present in all strains. Similarly, the protein products of two P. ananatis pan-genome CDSs showed orthology to βlactamases and their cognate transcriptional regulators in a number of AAB, including clinical strains of Enterobacter cloacae (Sfo-1/AmpR) and Citrobacter sedlakii (Sed1/SedR). These provide resistance to a broad spectrum of antibiotics used in the clinical environment [42,43], and given their isolation from this environment [10,11], may form the basis for antibiotic resistance of clinical P. ananatis strains.
Analysis of the P. ananatis pan-genome has thus revealed the presence of various CDSs coding for proteins with potential roles in adherence, immunity suppression, antibiotic-resistance and carbohydrate metabolism proteins that may aid in the persistence of this species in animal hosts (Additional file 5: Table S4).
The P. ananatis pan-genome encodes proteins with a potential role in interactions with insect hosts Fifteen pan-genome CDSs (0.3% of total pan-genome CDSs) share orthologs only in insect-associated bacterial genera, including Photorhabdus and Wolbachia. Between two (AJ13355 and PA13) and thirteen (LMG20103) of these IAB-specific CDSs are present in each of the individual P. ananatis genomes. In particular, one locus found in the genomes of four P. ananatis strains encodes twelve proteins showing sequence homology to a locus in Photorhabdus luminescens subsp. laumondii TTO1 and several Streptomyces spp. (Additional file 5: Table S4). Within this locus PANGEN_3511 and PANGEN_3515 encoded orthologs of two nikkomycin biosynthetic proteins in P. luminescens (plu1441 and plu1874). This antibiotic is produced by Streptomyces spp. and has acaricidal, fungicidal and insecticidal activities [44]. PANGEN_3520 and PANGEN_3522 encoded orthologs of the Streptomyces rubellomurinus FrbC and FrbD proteins involved in production of an antimalarial compound (ABB90932-90933 -50% average amino acid identity) [45]. P. ananatis strains may thus have acquired a locus for the biosynthesis of a potential insecticidal peptide. As P. ananatis is frequently isolated from invertebrate hosts, the presence of such a peptide may be of interest for the biological control of insect pests [7][8][9].
The P. ananatis pan-genome encodes proteins with a potential role in plant-microbe interactions A large number of the P. ananatis pan-genome CDSs (1,249 CDSs -22.4% of total pan-genome content) shared orthology with CDSs restricted to plant-associated bacteria, and between 933 (PA4) and 985 (LMG2665 T ) of these are encoded on each individual P. ananatis genome. This finding concurs with the frequent isolation of P. ananatis from the plant environment [1]. Of these CDSs, 849 formed part of the pan-genome core, while 400 were associated with the accessory genome. While the majority of PAB-specific CDSs were found in common with both enterobacterial and non-enterobacterial plant-associated species, a relatively large number (200 CDSs -16% of total PAB-specific CDSs) shared orthology only with CDSs restricted to non-enterobacterial PAB. This suggests that extensive horizontal exchange has occurred between P. ananatis and non-enterobacterial PAB. A total of 200 PAB-specific CDSs are localized on the pPANA1 pan-plasmid (50.5% of the total pan-plasmid CDSs).
Previously we showed that the pPANA1 plasmid is part of the Large Pantoea Plasmid group (LPP-1) common to all sequenced Pantoea spp. The LPP-1 plasmids share a small set of common core CDSs and a much larger accessory component, and we postulated that they play a major role in the ecological diversification of the genus [46,47]. The large number of PAB-specific CDSs (Figure 4) indicates that the pPANA1 plasmid likely plays a major role in the adaptation of P. ananatis to colonize and interact with plants.
As is the case for AAB-specific CDSs, the largest proportion of PAB-specific CDSs encoded proteins that belong to the 'poorly characterized' super-functional category (46.2% of the total PAB-specific CDSs), but a substantial number of PAB-specific CDSs encode proteins involved in metabolism (30.4%) and other cellular processes (11.5%). These may play a role in efficient colonization, nutrient utilization and persistence in or on the plant and/or in other plantmicrobe interactions. An extensive set of PAB-specific CDSs encode proteins with a role in the transport and metabolism of carbohydrates (Additional file 6: Table S5), which may facilitate the uptake and metabolism of plantderived carbohydrates. A number of orthologs of proteins involved in the degradation of plant carbohydrates are also encoded by PAB-specific CDSs present in all P. ananatis strains, including a predicted endo-1,4-β-xylanase, two polygalacturonases, a putative pectin acetylesterase, as well as a predicted cellulase with extensive sequence identity to the minor cellulose Cel8Y of Dickeya dadantii [48] (Additional file 6: Table S5). Several PAB-restricted amino acid transport and metabolism systems are also encoded on the P. ananatis pan-genome, including orthologs of the opine octopine, released from plant tumors induced by Agrobacterium tumefaciens [49] and the Amadori compound deoxyfructosyl glutamine, found in rotting fruits and vegetables, and in tumors caused by chrysopine-type Agrobacterium strains [50] (Additional file 6: Table S5). These compounds could furthermore serve as carbon, nitrogen and energy sources for P. ananatis in the plant. Several PAB-specific proteins with a predicted role in iron uptake and metabolism were also found to be encoded on the P. ananatis pan-genome. These included a predicted hydroxamate siderophore transporter, siderophore receptor, ferric dicitrate sensor components FecI/FecR and a PAB-specific TonB-ExbBD complex for outer membrane iron transport (Additional file 6: Table S5). These PAB-specific proteins may allow P. ananatis to actively contest for the limited iron available in the plant environment [51].
Several CDSs encoding PAB-restricted proteins potentially involved in protection against plant defenses are present in the P. ananatis pan-genome. PANGEN_04534-4535 encode orthologs of the protein Ohr and its transcriptional regulator OhrR, which are involved in resistance to organic hydroperoxides produced by plants in defense to pathogen infection [52]. Several PAB-restricted multidrug efflux pumps that could play a role in the extrusion of plant-produced antimicrobials and β-lactamases, which may specifically degrade plant antimicrobials, are encoded on the P. ananatis pan-genome (Additional file 6: Table S5). PANGEN_05563 encodes an ortholog of the cyclic β-1,2-glucan polysaccharide biosynthetic protein NdvB. This polysaccharide has been shown to provide protection to Xanthomonas campestris against localized and systemic defenses in the host plant [53]. A PAB-specific locus on the pPANA1 plasmid of all eight P. ananatis strains encodes orthologs of the proteins BudABCR, which are involved in the production of 2,3-butanediol. This volatile has been shown to promote plant growth and induce systemic resistance in the plant host [54,55], which may be linked to the biological role as plant growth promoter ascribed to strains of this species [5,6].
The P. ananatis pan-genome encodes proteins with potential roles in plant-and animal-pathogenesis As strains of P. ananatis have been found to be pathogenic on a broad range of plant hosts as well as humans, the pan-genome was analyzed to identify potential molecular determinants underlying its pathogenicity. Our analyses revealed the absence of many of the factors that are central to the pathogenicity and virulence arsenal of related plant and animal pathogens, including animal toxins, hemolysins, phytotoxins, several secretion systems (Type II, III and IV) and their associated effectors. However, several genes and loci with orthology to characterized pathogenicity determinants in related animal and plant pathogens could be identified in the P. ananatis pan-genome.
Type VI secretion systems (T6SSs) have been identified in a number of human pathogenic bacteria, including P. aeruginosa and V. cholerae [56,57], as well as plant pathogens such as Pseudomonas syringae and Pectobacterium atrosepticum [58,59]. The T6SS serves as injectisome for effector proteins including the hemolysin co-regulated protein (Hcp) and valine-glycine repeat (VgrG) protein [56][57][58][59]. Previously we have described the presence of three Type VI secretion system loci in P. ananatis, namely T6SS-1, −2 and −3 [60]. The T6SS-1 and −2 loci are found on the genomes of all P. ananatis strains, while T6SS-3 is restricted to the pPANA1 plasmids of P. ananatis AJ13355, LMG20103 and PA4. Nonconserved islands are localized adjacent to the vgrG and hcp genes in the T6SS-1 and T6SS-3 loci. On the basis of the presence of conserved domains and structural homology to various proteins of known function, we postulated that the proteins encoded in these islands may be translocated across the outer membrane or into the host cell cytosol as terminal domains of the VgrG and Hcp effectors, and that they may play a role in various functions in a wide range of hosts, including animal and plant pathogenesis [60]. During the classification of the P. ananatis pan-genome CDSs as AAB-, IAB-or PAB-specific we observed that the proteins encoded in the vgrG island of the T6SS-3 locus shared orthology only with proteins encoded on the genomes of members of the AAB group. A further non-conserved island encoding proteins with orthologs specific to AAB was found in the T6SS-3 locus ( Figure 5A; Additional file 5: Table S4). While the orthologs of most of these T6SS-3 island proteins were annotated as hypothetical proteins and belong to the 'poorly characterized' super-functional COG category, their localization within the T6SS putative effector islands together with the fact that they are AAB-group specific, makes them interesting targets for further analysis. Similarly, PAB-specific proteins were found to be encoded by CDSs localized within the vgrG and hcp islands of the T6SS-1 locus ( Figure 5B; Additional file 6: Table S5). This suggests a potential role for this secretion system in the translocation of effectors into the plant host. Other PAB-specific proteins in the T6SS-1 locus included a serine/threonine protein kinase (PpkA) and phosphatase (PppA) and a FHA domain protein, which have been shown to play a role in the posttranslational regulation of the Type VI secretion system [61]. A further partial Type VI secretion system (T6SS-2) was also identified in all Pantoea and Erwinia species [60]. All proteins encoded in this locus are restricted to PAB, suggesting a potential role for T6SS-2 in plant-microbe interactions.
Analysis of the PAB-specific complement also showed the presence of a CDS encoding an ice-nucleation protein with orthologs restricted to plant-pathogenic strains of P. ananatis, Pantoea agglomerans, X. campestris and P.s syringae. This protein induces wounds in frostdamaged plants and is postulated to allow these pathogens to gain access to host tissues [62]. A locus on the pPANA1 plasmids of P. ananatis LMG2665 T and B1-9 encodes proteins (PANGEN_05466-5474) with orthology to proteins for the biosynthesis of non-ribosomal peptide synthases/polyketide synthases (NRPS/PKS). The potential function of this locus in NRP/PK biosynthesis was further substantiated by comparison of the locus against the antiSMASH server [63]. Many phytopathogenic bacteria have been shown to carry CDSs encoding NRPS/PKS that are required for the production of phytotoxins, e.g. P. syringae syringomycin and syringopeptin, and P. atrosepticum coronofacic acid conjugates [64]. Orthologs of the proteins encoded in the P. ananatis NRPS/PKS locus were found to be restricted to PAB (E. amylovora ATCC BAA-2158 -EAIL5_2884-2892; 72% average amino acid identity) and may thus potentially encode synthases for the production of a phytotoxin. However, the biological role of the NRPS/ PKS products in P. ananatis will need to be determined.

Conclusions
Pantoea ananatis is ubiquitous in the environment and has an inherent capacity to survive, proliferate and form intimate relationships with plants, as well as insect and human hosts [1]. In particular, its frequent isolation from both plant and animal hosts suggests it has adapted to colonize, proliferate and potentially cause disease in these hosts. Here we analyzed the genome sequences of eight P. ananatis strains. As has been observed for other members of the Enterobacteriaceae, P. ananatis exhibits an 'open' pan-genome, which is mainly influenced by the integration of phage elements, but also by integrative conjugative elements and other insertion elements. Figure 5 The T6SS-1 and T6SS-3 loci of P. ananatis encode proteins with orthologs restricted to plant-associated and animal-associated bacteria. The T6SS-3 (A) and T6SS-1 (B) loci of P. ananatis AJ13355 are shown as an example. T6SS core genes with orthologs in both PAB and AAB groups are colored in grey, while PAB-and AAB-specific genes are colored in green and red, respectively. The hcp and vgrG genes are denoted by blue arrows, while the yellow arrow represents a P. ananatis-specific gene.
Phages play a significant role in bacterial evolution, transferring fitness and pathogenicity factors to their bacterial host [30]. They could therefore represent major adaptive drivers for P. ananatis, allowing strains of this species to colonize and interact with both plant and animal hosts.
Analysis of the P. ananatis pan-genome CDS complement revealed the presence of a large number of proteins restricted in distribution to plant-and animal-associated bacteria (PAB and AAB). These include a number of factors that could serve as putative tools for P. ananatis adherence and colonization of host tissues, to utilize nutrients and persist within a host(s), and potentially cause disease. However, it cannot be excluded that common mechanisms underlying colonization, persistence and pathogenicity exist among bacteria that are associated with both plant and animal hosts. The ability of a bacterium to make a cross-Kingdom jump is dependent on several prerequisites, including the close and frequent contact with the novel host, the ability to overcome host defences, and the capacity for horizontal acquisition of genes encoding factors that enable the bacterium to persist in its new host [65]. The frequent isolation of P. ananatis from various different environments and the (pan-)genomic evidence of a bacterial species well-adapted towards survival in and interaction with different hosts, provide a primary indication of the ecological success of P. ananatis and how it may have evolved to interact with cross-Kingdom hosts.

Methods
Genome comparisons and construction of the P. ananatis pan-genome The genome sequences from eight P. ananatis strains, four complete and four partially assembled genomes (Table 1), were included in the study. The partial genomes were annotated using FgenesB [66]. CDS sets were standardized by local BlastN analysis to identify ORFs which may not have been predicted for a particular genome. The combined nucleotide sequences of the chromosome and pPANA1 plasmid, for the P. ananatis strains for which complete genomes were available, were aligned using Mauve v. 2.3.1. [67]. The translated CDS sets for each of the eight genomes were pair-wise compared by local BlastP analysis using Bioedit 7.1.11 software package [68]. The comparison was performed using the reciprocal best hit approach (RBBH), whereby orthologs were assumed when a Blast hit of a query protein in the compared subject protein set also returned the query protein sequence as the best Blast hit when it in turn was compared by BlastP analysis against the query strain protein set [19]. The number of orthologous CDSs and average amino acid identities (sum of the amino acid identities for each compared protein divided by the aligned length of the protein) for each combination of pair-wise compared proteins sets were determined. Orthologs were defined using cut-off values (>70% amino acid identity and >70% sequence coverage for the query and hit) [20]. Using these orthology parameters and localized BlastP comparison, the orthologous CDSs from each of the genomes were clustered [69], where each cluster represented the set of RBBH orthologs for each distinct CDS across all the compared genomes. A representative of each cluster, the longest member sequence of each cluster for CDSs shared by one or more strain, as well as CDSs unique to specific P. ananatis strains, were incorporated in a single pangenome file. The translated protein products of the entire pan-genome CDS set were compared against the Conserved Orthologous Groups database using COGnitor [29] to determine the COG functional and superfunctional category to which they belong. Prophages and phage proteins were identified by searching against the ACLAME database using Prophinder [30]. A diagram of the eight CDS sets compared against the pangenome CDS set was constructed on the basis of the above localized BlastP analysis results using Genome-Diagram [70].

Pan-genome calculations
The size of the pan-genome for the eight sequenced strains, as well as the core and accessory fractions were determined by localized BlastP analysis with the translated CDS sets of each of the sequence P. ananatis strain against the pan-genome CDS set. The core genome, signifying all pan-genome CDSs common to all eight strains, and the accessory genome, incorporating those CDSs which are absent in or more of the strains or were unique to the genome of a particular strain were tabulated, and the pan-genome determined as the sum of the core and accessory CDSs.
The sequential inclusion of the CDS sets of each of the eight genomes in all possible combinations was used to determine the core CDSs, accessory CDSs shared by more than one but not all strains, and those unique to a single strain, as a function of the number of genomes (n) in the comparison (where n = 1,2,…8). The estimated strain-specific CDSs and core CDSs for the species, beyond the scope of eight sequenced strains if the genomes of additional strains were sequenced (i.e. n→ ∞), were extrapolated by fitting the data for the n = 1→8 comparison combinations above to an exponential decay functions as per [12]. The data was fitted to the function using the generalized least squares (gnls) algorithm of the nlme (linear and non-linear mixed-effects models) package in R [71]. The estimated data points for n→∞ obtained from the function along with the actual data points from the comparison of the eight P. ananatis CDS sets were plotted in graphs (Figure 2A and B) of the number of genomes versus the number of core/strainspecific CDSs. In order to extrapolate the pan-genome size for the species, the strain-unique CDSs and average number of CDSs per compared genome (for n = 2→8 genomes compared) were incorporated into an algebraic formula as per [12].
Comparison between the P. ananatis pan-genome and all available genomes The CDSs encoded by P. ananatis were compared by BlastP analysis against the NCBI non-redundant (nr) protein database [38]. For this comparison, orthology was assumed for proteins sharing >30% amino acid identity and 70% sequence coverage between the query and hit [72]. On the basis of the Blast hits, CDSs were identified that shared orthologs in bacteria occupying distinct ecological niches, namely those associated with animals (AAB), insects (IAB) and plants (PAB), as determined from available information of their source of isolation. For the purpose of this grouping, only those bacteria that are specifically associated with animals and plant tissues and/or the rhizosphere environment were taken into consideration, while members that are frequently found associated with hosts of both Kingdoms, such as Klebsiella and Enterobacter sp., were disregarded.
A diagram of the comparison between the translated protein products encoded by the P. ananatis pangenome CDS set and the protein sets of 54 members of the Enterobacteriaceae, encompassing all the genera for which complete genomes were available (Additional file 7: Table S6), was constructed using GenomeDiagram [70], using the RBBH approach for localized BlastP comparison. Orthologs were considered when proteins shared >50% amino acid identity and >70% coverage between the hit and query sequences.