Probing the pan-genome of Listeria monocytogenes: new insights into intraspecific niche expansion and genomic diversification

Background Bacterial pathogens often show significant intraspecific variations in ecological fitness, host preference and pathogenic potential to cause infectious disease. The species of Listeria monocytogenes, a facultative intracellular pathogen and the causative agent of human listeriosis, consists of at least three distinct genetic lineages. Two of these lineages predominantly cause human sporadic and epidemic infections, whereas the third lineage has never been implicated in human disease outbreaks despite its overall conservation of many known virulence factors. Results Here we compare the genomes of 26 L. monocytogenes strains representing the three lineages based on both in silico comparative genomic analysis and high-density, pan-genomic DNA array hybridizations. We uncover 86 genes and 8 small regulatory RNAs that likely make L. monocytogenes lineages differ in carbohydrate utilization and stress resistance during their residence in natural habitats and passage through the host gastrointestinal tract. We also identify 2,330 to 2,456 core genes that define this species along with an open pan-genome pool that contains more than 4,052 genes. Phylogenomic reconstructions based on 3,560 homologous groups allowed robust estimation of phylogenetic relatedness among L. monocytogenes strains. Conclusions Our pan-genome approach enables accurate co-analysis of DNA sequence and hybridization array data for both core gene estimation and phylogenomics. Application of our method to the pan-genome of L. monocytogenes sheds new insights into the intraspecific niche expansion and evolution of this important foodborne pathogen.


Background
Listeria monocytogenes is a Gram-positive foodborne bacterial pathogen and the causative agent of the human and animal infectious disease, listeriosis. L. monocytogenes can thrive in diverse environmental reservoirs (e.g. soil, water, and sewage) and proliferate under unfavorable conditions (e.g. high osmolarity, low pH, and refrigeration temperature) that other bacterial pathogens cannot endure [1][2][3][4]. Its robust physiological characteristics, coupled with its ubiquity in food processing, distribution and retail environments, have made L. monocytogenes difficult to manage in food manufacturing, particularly for ready-to-eat food products. L. monocytogenes causes the highest rates of hospitalization (about 92%) and mortality (about 20%) among all foodborne bacterial pathogens in the United States [5], making the control of this bacterium in foods a high priority for both food safety and public health. Yet, the versatile lifestyle of L. monocytogenes both inside and outside its host, and its unique capability to invade and replicate in different host cell types (e.g. macrophages and nonprofessional phagocytes), have made this opportunistic pathogen a paradigm for studying host-pathogen interactions, pathophysiology, gene regulation, and stress adaptation [6,7].
Previous molecular subtyping studies have collectively suggested that the species of L. monocytogenes is composed of at least three major evolutionary or genetic lineages that notably differ in their prevalence in causing human and animal diseases [8][9][10][11][12][13][14][15]. Specifically, lineage I (or LI) and lineage II (or LII) of L. monocytogenes are frequently isolated from foods and implicated in the vast majority (>95%) of both sporadic cases and epidemic outbreaks of human listeriosis [3]. Genetic lineage III (or LIII) strains are rarely reported in cases of human infections, but are sometimes associated with animal disease cases [3,14,16]. The mechanisms underlying the biased predominance of certain L. monocytogenes genetic lineages in human listeriosis remain largely unknown. Several recent studies have revealed elevated levels of genetic diversity among LIII isolates [12,15]. Multilocus sequence typing analysis on the basis of partial sigB and actA gene sequences have also suggested that LIII is polyphyletic, with the co-existence of at least three distinct subgroups (i.e. LIIIA, LIIIB, LIIIC) [14,16]. Atypical phenotypes of LIII isolates, such as deficiency in rhamnose fermentation [14], attenuated virulence potential [16], reduced resistance to heat and cold stresses [17] and lowered biofilm productivity [18], have collectively indicated that LIII may have followed a distinct evolutionary path from other L. monocytogenes lineages.
Compared to fairly extensive studies on LI and LII strains, little is known about LIII. Although it is documented that most listerial virulence factors such as the positive regulatory factor (or PrfA) are well conserved across the entire L. monocytogenes species, LIII strains are underrepresented in both food contamination and human listeriosis. This led us to speculate the existence of additional, yet-to-be-identified genetic factors in the predominant disease-causing L. monocytogenes lineages (i.e. LI and LII) that may mediate listerial niche adaptation, resistance to extra-or intracellular stresses, and pathogenicity. These unknown genetic factors may have been lost, mutated, or decayed in LIII as the genomes evolved, resulting in a defective phenotype for LIII isolates in certain ecological and host niches. To test our hypothesis, we combined in silico comparative genomic analyses with an array-based comparative genomic hybridization (CGH) approach to probe the genomic diversity of L. monocytogenes and to identify genomic features common in LI and LII but absent in LIII. Array CGH is a powerful yet cost-effective approach for genotyping and detecting intraspecies genomic diversity for many bacteria. Previous efforts on comparative genomic analyses underscore the usefulness of CGH in resolving genetic lineages and identifying strain-or lineage-specific genes in L. monocytogenes [10,[19][20][21][22]. However, most of these studies targeted only a number of selected genes or partial listerial genomes, making an accurate assessment of intraspecies genomic diversity difficult.
The availability of more than 20 L. monocytogenes full and draft genomes has made this pathogen an ideal candidate for pan-genomic study ( Table 1). Our initial comparative analysis of 17 L. monocytogenes genomes indicated a "closed" pan-genome for this bacterial species. Species with a closed pan-genome typically share highly syntenic genomes with less frequent horizontal gene transfers (HGT) and genomic rearrangements. Therefore, the entire gene pool can be fully sampled by sequencing a small set of representative isolates, and the number of new genes to be discovered by sequencing additional genomes will quickly approach zero. This prompted us to design and construct a pan-genome CGH array that, in theory, accommodates the total genomic diversity of the L. monocytogenes species on a single DNA chip. Compared to several previous pangenome microarrays that targeted either the conserved sequence of gene families with low probe density or no coverage of the intergenic regions, we utilized a novel probe selection algorithm (PanArray) to design a pangenome tiling array that incorporates the genomes of 20 available L. monocytogenes strains [38]. This design provides unbiased coverage of the pan-genome, and also superior accuracy and resolution for data analysis.
Using integrated data obtained from both in silico whole-genome comparisons and pan-genome CGH analyses, we (1) explored the intraspecific genetic diversity of L. monocytogenes with a focus on the largely unexplored genetic lineage III; (2) estimated the core and pan-genome that define the L. monocytogenes species; (3) identified unique protein-coding genes and regulatory RNAs in the predominant disease-causing lineages, as they may relate to ecological fitness, host niche adaptation and pathogenicity; and (4) reconstructed phylogeny for different L. monocytogenes lineages and strains based on pan-genome characteristics. less than 7 novel genes per additional genome sequenced. Therefore, we presumed a single array could be designed to query the full genetic repertoire of the species, and be used to completely genotype currently unsequenced strains. For this purpose we designed a pan-genomic array comprising 385,000 50mer in situ synthesized oligonucleotide probes that fully tile the sequences of 20 L. monocytogenes genomes (Table 1), with no gaps, at greater than 2-fold coverage of each genome. Shortly after we completed our chip design, four additional L. monocytogenes genomes were sequenced to closure, including strain Clip 80459 (LI), strain Finland 1988 (LII), strain R2-561 (LII) and strain HCC23 (LIII). These new L. monocytogenes genomes enabled us to evaluate the genomic coverage of our array design by individually mapping each of the 385,000 oligonucleotide probes to annotated genes to the four genomes. A 50-mer probe was mapped to a particular gene if it perfectly matched the gene sequence or contained only a single nucleotide mismatch. For each annotated gene, the probe coverage was calculated as the percentage of the gene length  M1-002  IIIB  4b  ------A   W1-111  IIIB  4c  ------A  F2-208  IIIC  4a  -----Life Technologies Corporation/Cornell  University   A   F2-569  IIIC  4b  ------A   W1-110  IIIC  4c  ------A   1 Number of contigs based on GenBank at the time of our study. Strains with > 200 contigs were sequenced only to low coverage and were excluded from analysis. 2 Number of annotated protein coding genes and RNAs based on GenBank. 3 Nucleotide sequence identity in reference to EGD-e. 4 Strains used for array design (D); comparative sequence analysis (S), comparative genomic hybridizations (A). 5 Strains N1-017 and J2-071 were found to be mislabeled in GenBank; this has since been fixed. -Information not available. covered by mapped probes. Results in Table 2 suggest that our design adequately represents the intraspecies diversity of L. monocytogenes, particularly for LI and LII genomes. However, due to the limited number of fully sequenced LIII genomes available at the time of design, the coverage for LIII specific genes is less optimal, as indicated by HCC23.

Accuracy of CGH detection calls
Genomic DNA of nine LIII strains were each co-hybridized on the pan-genome arrays with that of EGD-e (LII) as an internal reference. The nine LIII strains were carefully selected from a strain collection to represent 3 different serotypes (4a, 4b, and 4c) and 3 different subgroups (LIIIA, LIIIB, and LIIIC) of this lineage (Table  1). Individual probes were designated as present or absent in the sample based on statistical analysis of the normalized signal intensities (see Materials and Methods). Since the position of each probe is known for all sequenced L. monocytogenes genomes, genes were scored by the fraction of targeting probes with a positive signal, otherwise known as the positive fraction (PF). This yields a very flexible scoring scheme that can be readily applied to any intragenic or intergenic feature of the genome targeted by a sufficient number of probes. A high PF indicates a gene is likely present in the hybridized genome. Circular maps of all PF values for the nine LIII genomes in reference to an LI strain F2365 and an LII strain EGD-e are shown in Figure 1.
To select an appropriate PF threshold and test the accuracy of gene calls based on PF values, we examined the true-positive and false-positive rates of the PF criterion for 51,814 annotated L. monocytogenes genes, compared against genomes for which we had both sequence and CGH array data. True gene "presence" was determined by a tblastn search of the 51,814 predicted proteins against a six frame translation of the genome [39], requiring a minimum of 50% amino acid similarity and an E-value ≤ 10 -5 . Figure 2 shows the Receiver Operating Characteristics (ROC) curves for the PF criterion measured against the tblastn standard for two L. monocytogenes strains, EGD-e and J2-071. The PF measure is remarkably robust, as there appear to be very few genes near the classification threshold. Additional file 1 shows a density estimation of PF values for both present and absent genes, suggesting that the vast majority of present genes have PF > 0.9 and absent genes PF < 0.1. Based on the ROC analysis, a PF cutoff of 0.6 was chosen to best match the tblastn results and minimize the expected error rate. The seemingly higher false-positive rate for J2-071, in comparison to the closed EGD-e genome, is partially due to tblastn false-negatives incurred from the 78 gaps in the J2-071 draft genome. In these cases, a gene that is truly present, but overlapping a sequencing gap, is falsely reported as absent by the tblastn method which artificially increases the measured false-positive rate of the CGH array method.
Accuracy statistics for the chosen 0.6 PF cutoff versus the 50% alignment similarity cutoff are given in Table 3. The array has perfect sensitivity for detecting the EGDe and J2-071 control genes. Expected accuracy was estimated for detecting both individual gene variants from all other strains and for detecting homologous gene groups (HGs). Orthologous gene groups are typically preferred; however, the inability of CGH to accurately determine sequence identity and gene order makes it impractical to discriminate between highly similar paralogs. Alternatively, we tested for the presence of 3,560 strongly homologous groups, identified by clustering proteins with higher than 50% amino acid similarity. A gene group was marked as present in a genome if any gene from that group exceeded the BLAST or PF threshold. Figure 2 also displays the true-and falsepositive rates of HG detection alongside the original ROC curves. In comparison to detecting individual gene variants, HG detection significantly increases the sensitivity of the array without increasing the false-positive rate. When analyzing only a single gene variant on the chip, high polymorphism in the sample genome can disrupt hybridization and lead to false-negatives. However, by considering an entire gene group, a sample only needs to hybridize with its nearest variant, thereby increasing the sensitivity [34]. To demonstrate the sensitivity of the array at detecting HGs in unsequenced strains, Table 3 also lists accuracy statistics for EGD-e and J2-071 when the probes specific to those genomes are removed from the analysis. This simulates the accuracy of the array at calling genes in an unsequenced LII or LIII strain. The sensitivity of the array is only slightly affected, with a 0.2% true-positive rate drop for EGD-e and a 1.3% drop for J2-071. The drop is more pronounced for J2-071 because it is one of only two LIII genomes included on the array, so ignoring the J2-071 specific probes affects the sensitivity of calling HGs from that lineage. Proportion of genes from four newly sequenced strains with probe coverage meeting a minimum percentage of the gene length (100%, 90%, 80%) for probes containing at most one SNP.

Estimation of core and pan-genomes
The expected number of new genes to be discovered by sequencing additional L. monocytogenes strains, and the sizes of the core and pan-genomes, were estimated using methods adapted from Tettelin et al. [24].
Frequent gaps and sequencing errors in low-quality genome assemblies were found to cause many missed protein alignments, which affected the core genome estimation. For example, only 683 EGD-e proteins meet the alignment threshold in all 24 draft L. monocytogenes genomes, an unreasonably low number. Additionally, fragmented annotations in the low quality genomes artificially inflate the pan-genome size estimate. To avoid these artifacts, only 18 "high quality" L. monocytogenes genomes were used for the new genes and pan-genome estimation. Genomes sequenced to less than 10× coverage using 454 pyrosequencing were excluded from the sequence analysis (Table 1). Array CGH results for the 8 additional LIII genomes were included in the core gene estimate.
To estimate the L. monocytogenes core genome, the number of shared genes was computed for many random permutations of N genomes, and the mean number of shared genes was computed for each N. The number of core genes for the species was estimated by fitting an exponential decay function to the means. For the highquality sequenced genomes, this analysis yielded an estimated horizontal asymptote of 2,467 ± 7 core genes. However, the sequenced genomes include only two LIII genomes. Repeating the analysis for all 26 genomes, including CGH results for the 8 additional LIII genomes, reduced the estimate by over 100 genes to 2,330 ± 5, emphasizing the importance of a balanced sample of diversity for estimating core genome size. Figure 3A displays the result of the 26 genome analysis including a smoothed density plot of the shared gene count distributions, the mean value for each N, and the best-fit exponential decay. Imperfect detection sensitivity due to sequencing gaps makes it impossible to achieve convergence for real data, so an exact core genome cannot be determined. Any non-zero false-positive rate for detecting core genes will artificially shrink the core genome with each additional genome, violating the horizontal asymptote of an exponential decay. This is evident in the almost linearly decreasing means towards the tail of Figure 3A. To account for these false-negatives, we introduced an additional parameter to the core genes model that adds a constant number of false-negatives upon the addition of each genome (see Materials and Methods). The revised model is a much closer fit to the data (residual standard error of 2.98 versus 10.68), accounts for noisy draft and CGH data, and yields an increased core genes estimate of 2,456 ± 4. This likely represents an upper bound on the core genome size. Considering results from both models, and the uncertainty caused by the draft genomes and CGH data, we estimate the core genome of L. monocytogenes to be between 2,330 to 2,456 genes (approximately 80% of a typical L. monocytogenes genome).  Present/Absent are based on a tblastn search. ACC, TPR, FPR, FDR stand for accuracy, true-positive rate, false-positive rate, and false discovery rate, respectively.
(-) Excludes all probes directly targeting the hybridized strain from the analysis to simulate detection accuracy for an unknown strain. For EGD-e, the mean of 9 data sets are given, along with their standard deviation to illustrate array reproducibility.
A major limitation of array CGH is that this method cannot detect novel genes contained in LIII genomes. For this reason, the pan-genome estimation was performed for only the high-quality sequenced genomes, of which two are from LIII. Again, the number of new genes identified by sequencing each additional genome was computed for many random permutations of N genomes. The number of new genes identified for each N was modeled by the power law function n = N -a [26]. Using the median values, the power law exponent a was estimated to be 1.12 ± 0.02. This is slightly lower than our original estimate of 1.38 due to the recent sequencing of four additional genomes, an updated annotation, and a stricter similarity threshold. In both cases, an exponent a > 1 indicates a closed pan-genome, meaning the size of the pan-genome is a bounded function of the number of sequenced genomes. However, fitting a power law to the mean values of these distributions yields a = 0.85 ± 0.01, suggesting an open pan-genome ( Figure 3C). This difference is caused largely by the diverse strains N1-017, HCC23, and J2-071, which contain many strain-specific genes and pull the mean values higher than the medians. For example, strain HCC23 contains 122 strain-specific genes not found in any of the other 17 strains. Removal of these three genomes from the analysis results in an a slightly greater than Figure 3 Prediction of core, new and pan genes in L. monocytogenes. (A) Exponential regression analysis that predicts the number of core genes in N sequenced genomes. For each N, permutations are randomly sampled and the number of core genes conserved in all N genomes is computed. The estimated number of core genes in 26 L. monocytogenes genomes ranges from 2,330 to 2,456. The sampled distribution is represented by a smoothed color density plot obtained through kernel density estimation. Yellow indicates the lowest density and purple indicates the highest density. For each N, black circles indicate the mean value and whiskers indicate the 5 th and the 95 th percentiles of the distribution. An exponential decay fit to the means is given by a solid red curve. A modified exponential decay is given by a solid black curve, which better fits the observed data by accounting for false-negative gene calls. (B) Power law regression analysis predicts the number of new genes that will be discovered by sequencing additional L. monocytogenes genomes. The LIII genomes are the outliers that pull the means higher, indicating that LIII diversity has not yet been fully sequenced. (C) Power law regression analysis predicts the number of L. monocytogenes pan genes accumulated from genome sequencing is currently 4,052 and growing. one for both the mean and median analyses. Two of these genomes are the only LIII strains in the analysis, indicating that additional sequencing from LIII may reduce the exponent even further. This regression analysis suggests L. monocytogenes has a significantly diverse gene reservoir, and additional sequencing of LIII genomes is necessary to resolve the exact size and nature of the L. monocytogenes pan-genome.
The estimated growth of the L. monocytogenes pangenome with additional sequencing was also simulated using many random permutations of genomes. For open pan-genomes, the cumulative number of unique genes discovered with the sequencing of additional genomes can be modeled by Heap's law using the power law function n = N g [26]. This regression is illustrated by Figure 3B and g was estimated as 0.12 ± 0.001. Since the growth of an open pan-genome is equivalent to the number of new genes added after sequencing each successive genome, the derivative of the pan genes function should be equal to the new genes function. That is N g-1 ∝ N -a and a = 1 -g for a < 1. Although simulated separately, the pan and new gene functions do follow this property for the mean value regressions, with a = 0.85 and g = 0.12 being in good agreement. For N = 18, the mean estimated pan-genome size is 4,052 and continues to grow, with diminishing returns, for larger N.
The above method is useful for estimating the size of the pan-genome, but because it depends on the order of the genomes analyzed, it does not yield a single representative set of pan genes for the analyzed strains. An alternative that does not depend on the order of genomes is to measure the number of gene groups identified by a similarity clustering method such as OrthoMCL [40]. We applied a similar approach, but for clustering strong homologs rather than orthologs, to be consistent with the other analyses. From a graph of 52,776 proteins with > 50% similar proteins connected by edges, 3,744 HGs were identified (Additional file 2) using the MCL graph clustering algorithm [41]. This provides a relative lower bound for the size of the currently sequenced L. monocytogenes pan-genome.

Lineage-specific genes and disparately distributed genes
Lineage-specific genes refer to genes that are exclusively present in a single L. monocytogenes lineage based on the above defined similarity threshold. To maintain a stringent threshold, a gene is not considered to be lineage-specific if any member of its HG is present in another lineage. Annotated genes in F2365 (LI), EGD-e (LII), and J2-071 (LIII) were used to screen for gene lineage specificity against all genomes analyzed in this study. Table 4 lists 4 LI-5 LII-and 6 LIII-specific genes identified in our study. Most of these genes encode hypothetical proteins. It is notable that only 5 of the 21 LII-specific genes previously identified by Doumith et al [10] passed our lineage specificity threshold. We used colony polymerase chain reaction (PCR) assays to verify the lineage specificity for all LI-and LIII-specific genes identified by CGH analysis (except for LMOf2365-0409 due to the small size of this gene for proper PCR primer  design). A total of 225 colony PCR assays were conducted for randomly selected L. monocytogenes strains in our collection, including 8 LI, 8 LII, and 9 LIII strains. The PCR results confirmed the lineage specificity for all genes analyzed, suggesting that the CGH approach was accurate for calling gene presence or absence and determining lineage specificity. We identified 86 disparately distributed genes (or DDGs) as listed in Table 5. DDGs refer to genes that are highly conserved (PF > 0.6 or protein similarity >50%) in LI and LII genomes but absent or highly divergent (PF < 0.6) in at least six of the nine LIII genomes. DDGs are of particular interest for us because the biased distribution and conservation of these genes in LI and LII genomes likely correlate to the enhanced ecological fitness and pathogenicity of L. monocytogenes in the host. The largest functional group of DDGs (41%) is associated with carbohydrate transport and metabolism. Figure 1 illustrates their distribution. L. monocytogenes harbors one of the largest bacterial carbohydrate phosphotransferase system (PTS) genes [42][43][44]. The abundance and diversity of the PTS system allows this soil saprophyte to utilize different carbon sources associated with the ecosystems it inhabits such as soil, silage and sediments. Fifteen PTS genes were identified as DDGs; most are associated with fructose-specific PTS enzyme II components (lmo0357-0358, lmo0631-0633, lmo2135-2137, and lmo2733). We surveyed the distribution of 978 annotated PTS genes and their homologs in all 26 L. monocytogenes genomes, and found 965 (99%) PTS genes are conserved in all LI and LII genomes and 7 (0.7%) are specific to LI. In contrast, 137 (14%) PTS genes are absent or divergent in LIII genomes. Diversity in PTS content is most noticeable among the three LIII subgroups, where 48 (4.8%), 137 (14%), and 136 (13.9%) PTS genes are absent in LIIIA, LIIIB and LIIIC, respectively. An interesting distinction among 3 subgroups is that LIIIA strains are capable of fermenting rhamnose, whereas LIIIB and LIIIC strains are deficient in rhamnose utilization [14]. We discovered a cluster of six genes (lmo2846-2851), which is likely to mediate rhamnose utilization, is missing from all LIIIB and LIIIC genomes. Five genes in this cluster [45] share protein similarities to the rhamnose catabolic pathway in Escherichia coli [46,47] and other Gram-positive bacteria such as Bacillus subtilius (Additional file 3).
The second-largest functional group of DDGs consists of 12 putative transcription factors representing 7 different regulatory gene families. Six are adjacent to PTS genes and possibly involved in regulating carbohydrate metabolism. Four are absent from the non-pathogenic L. innocua [43], L. welshimeri [48] and L. seeligeri [49], suggesting roles in virulence and pathogenicity. One Crp/Fnr (cyclic AMP receptor protein-fumarate and nitrate reduction regulator) family gene lmo0753 was found to be highly specific to LI and LII but absent in LIII. This Crp/Fnr factor is adjacent to a bile resistance gene btlB and shares similar functional domains with prfA, the master regulatory gene of L. monocytogenes virulence.
We found multiple DDGs associated with gastrointestinal (GI) tract adaptation. For instance, two bile-associated genes btlB (lmo0754) and pva (lmo0446) are absent in LIII. Both genes help L. monocytogenes resist the antimicrobial effects imposed by bile salts during its passage through human GI tract [50]. Loss of these genes lowered tolerance to bile and reduced persistence in murine GI tract [51]. The glutamate decarboxylase (GAD) system mediates the acid resistance in bacteria [52][53][54]. In L. monocytogenes gadD1 (lmo0447) is responsible for growth at mild acidic conditions (pH = 5.1) and gadD2 (lmo2363) primarily mediates the resistance to severe acidic stress (pH = 2.8) [55]. We found that gadD2 is conserved in all lineages, whereas gadD1 and its coupled glutamate: γ-aminobutyrate antiporter gadT1 (lmo0448) are absent in most LIII strains except for J2-071 and HCC23. An arginine deminase (ADI) system (lmo0036-0041) was recently characterized in L. monocytogenes [56]. The ADI system plays a role in listerial acid tolerance and may contribute to the enhanced adaptation to acidic conditions in the stomach. It was previously reported that this gene cluster is present in LI and LII but absent from LIII and non-pathogenic L. innocua and L. welshimeri [56]. Our results, however, showed that the ADI gene cluster is also highly conserved in LIIIB. An additional seventeen DDGs have no homolog in the genome of L. innocua, including three putative genes encoding LPXTG surface proteins (lmo0333, lmo1666 and lmo2085) and sepA, a putative virulence factor co-regulated by PrfA and σ B [57,58].

Small regulatory RNAs
Complete tiling of the L. monocytogenes pan-genome allowed us to survey the distribution of 100 non-coding small regulatory RNAs with specified 5' and 3' positions [45] in 9 LIII genomes. The majority (87%) of these sRNAs are conserved in LIII genomes, and only eight were found to be absent or divergent in LIII (PF < 0.6) ( Table 6). Noticeably, all eight sRNAs are also absent from L. innocua, and five were differentially expressed in intestinal lumen or blood, suggesting roles in host niche adaptation. For example, ril38 contributes to listerial survival in human blood [45].

Phylogenomic reconstruction
To reconstruct the phylogeny of all L. monocytogenes strains analyzed in this study, we surveyed the binary distributions of 3,560 HGs (Additional file 4) and 2,846  Genes conserved in all LI and LII genomes but absent in two or more LIII sub-groups (IIIA, IIIB or IIIC). Genes are listed based on their annotation in functional groups. 2 LIII subgroup in which a listed gene is present. 3 Presence "+" or absence "-" of a gene in L. innocua genome. 4 Genes belong to an annotated operon based on [45]; "-", not annotated in operons. EGD-e protein-coding genes among the 26 L. monocytogenes genomes, respectively. We then constructed neighbor-joining (NJ) trees [59] based on a maximumlikelihood gene content distance measurement [60] ( Figure  4). The NJ trees based on 3,560 HGs ( Figure 4A) and 2,846 EGD-e genes ( Figure 4B) both clearly separated all L. monocytogenes strains into 3 main clusters (i.e. a LI cluster, a LII cluster and a LIII cluster) [61]. However, the EGD-e gene-based NJ tree showed a distorted topology, indicative of a bias caused by a restricted set of loci used for phylogenetic reconstruction [62].
Of note in LI, the serotype 4b strain N1-017 appears to be closely related to serotype 1/2b strains in the LI cluster, likely representing an evolutionary intermediate between the split of serotype 4b and serotype 1/2b [10]. Of note in LII, four strains F6900, F6854, J2818 and J0161 were previously traced back to a single food processing facility over a time span of 12 years [63]. These four isolates are clustered closely on a single branch, indicative of a recent common ancestry. While the NJ trees based on gene content allowed some inference of L. monocytogenes phylogeny, the reliability of Up-regulated "↑", or down-regulated "↓" in vivo [45]; n/a, information not available. 2 Gene is either present "+" or absent "-" in a LIII genome. the tree topology can be compromised by reticulate events such as horizontal gene transfer (HGT). Therefore, a split network was constructed using the Neighbor-net algorithm [64] to evaluate the extent by which incompatible phylogenetic signals (e.g. HGT) might affect our estimation of phylogenetic topology. Split networks do not force the formation of a tree-like structure and are able to represent incompatible signals as parallel edges, indicating the possibility of HGT or recombination. The resulting split network (Figure 4c) shows a congruent topology with the NJ tree (Figure 4a), suggesting the majority of the 3,560 HGs have been vertically inherited.
Genomic diversification in L. monocytogenes lineage III Figure 5A shows a rooted NJ tree for the three LIII subgroups, using EGD-e as an outgroup. HCC23 appears to be most closely related to LIIIA. Further evidence that links HCC23 to LIIIA is the rhamnose utilization gene cluster. This gene cluster is conserved in LIIIA and HCC23 but absent in LIIIB and LIIIC. The rooted NJ tree also suggests that LIII is polyphyletic and HCC23 possibly resembles an ancestral state of LIII. The emergence of 3 LIII subgroups is likely to be concomitant with stepwise genome reduction as observed in some non-pathogenic Listeria species, including L. welshimeri [48] and L. seeligeri [49]. A total of 206 genes, that are highly conserved in LI and LII, are found to be phylogenetically informative for LIII (i.e. present or absent in at least one LIII strain) (see Additional file 5). Figure 5C shows a heat map of these genes in the ten LIII strains. Interestingly, gradual gene decay or diversification was observed in the order of LIIIA, LIIIC and LIIIB. Loss of select LI and LII core genes was most significant in LIIIB. This LIII subgroup forms a deep branch in a split network ( Figures 5B). However, it should be noted that the contribution of novel LIII genes to the phylogenetic reconstruction is likely to be underestimated due to the limited number of fully sequenced LIII genomes available at the time of this study.
To access the inter-lineage diversity from a gene content perspective, we identified 576, 521 and 489 accessory genes in F2365 (LI), EGD-e (LII), and J2-071 (LIII), respectively and surveyed their distributions in 26 L. monocytogenes genomes (Additional file 6). Minimum spanning trees were then built to compare and visualize the different distributions of these accessory genes across the three lineages (Additional file 7). Accessory genes display similar distributions in most LI and LII strains, featured by one to two dominant subsets (shown as large circles) generated by genes present or absent in most strains of the same lineage. However, more complex and branched distributions were observed in LIII strains, demonstrating an elevated genomic diversity in this rare L. monocytogenes lineage.

Discussion
Pan-genome CGH was used in this study to compare L. monocytogenes genomes in pursuit of novel genes that potentially promote the fitness and virulence of LI and LII strains in human, as these strains are predominantly associated with human listeriosis. We used phylogenomic concepts [65] to guide our search for DDGs and to infer the phylogeny for the species. Array CGH is suitable for the purpose of this study because it is relatively cost-effective compared to the sequencing and closure required to make accurate gene calls using whole-genome shotgun sequencing. Unlike whole-genome sequencing, however, the CGH approach has several inherent limitations in detecting novel genes or pseudogenes, inferring sequence-based phylogenies, and for a host of other analyses inaccessible with array data.
A particular challenge in this study was to unify the analysis of both genome sequence and CGH array data. The sensitivity of the two methods is fundamentally different. BLAST searches are capable of precisely measuring amino acid similarity and can identify orthologs and detect distant homologies. In contrast, DNA array hybridizations measure nucleotide conservation and are only capable of detecting highly conserved DNA sequences. In addition, hybridization gives no positional information and is non-specific, making it difficult to discriminate between paralogs. For this reason, we used homologous groups for gene content comparison, and permitted variant sequences to hybridize to their nearest neighbor in a group, rather than a single selected variant (see Methods). Prior to implementing this method, there was tremendous detection bias in the CGH data. The HG method greatly increased the agreement between the array and BLAST detection strategies, which was critical for the phylogenetic analysis of the combined data.
The low frequency of LIII in human listeriosis can be partially explained by its overall rarity in foods, lack of unrecognized virulence factors, or defective mutation in some known virulence factors. For instance, a novel streptolysin S-like hemolytic and cytotoxic virulence factor, listeriolysin S, was recently found to be exclusively present in LI strains [66]. This factor contributes to virulence of the pathogen in murine and human polymorphonuclear neutrophil-based assays [66]. Several studies also reported that premature stop codons are common in inlA in LIII strains [67][68][69][70]. Point mutations in inlA are presumably caused by localized recombination and lead to a truncated InlA protein and consequently a reduced invasion phenotype in human intestinal epithelial cells [67][68][69][70]. Our pan-genome study uncovered 86 DDGs and 8 non-coding small RNAs that are absent or mutated in the largely uncharacterized LIII genomes (Table 5 and Table 6). Most of these genes fall into the functional categories of cell wall structure, transcription regulation, and carbohydrate metabolism and transport. Such functions are likely to play critical roles in ecological fitness of L. monocytogenes in different environment such as food processing facilities and host niches. Genes involved in carbohydrate metabolism and transport stand out as the largest functional group of DDGs, implying that the capability of utilizing different carbon sources in the transmission and infection cycle contribute most to the predominance of LI and LII strains in human infections. In particular, PTS systems that are likely to confer nichespecific metabolic advantages are conserved in LI and LII but decayed or lost in LIII. For example, the fructose-like PTS components (lmo2133-lmo2137) are conserved in all LI and LII genomes but completely lost in LIIIB and LIIIC (Figure 1). This operon was postulated to have been acquired by L. monocytogenes through HGT from Enterobacteriaceae that cohabitate the GI tract of mammalian host [71]. A recent study of its homolog in extraintestinal pathogenic E. coli suggested that this operon promotes bacterial fitness against the stress in host serum and gut, and enhances bacterial invasion in eukaryotic cells [72]-both are integral parts of listerial pathogenesis.
L. monocytogenes possess extraordinary capabilities for sustaining harsh conditions during its residency in the environment (e.g. it can utilize limited carbon source), in foods (e.g. it can resist salts and grow at refrigeration temperatures), and in parasitized hosts (e.g. it can escape from immune defense). During its passage through the human GI tract, L. monocytogenes is able to resist the antimicrobial effects imposed by gastric contents. Multiple genes involved in combating GI tractrelated stresses, primarily gastric acid (gadD1, gadT1 and the ADI system) and bile salts (btlB and pva), are missing in LIII. Loss of these genes may result in a defective phenotype in surviving the GI tract prior to invasive infection [50]. Also absent in most LIII genomes are a number of small regulatory RNAs (e.g. rli29 and rli48) and transcription factors (e.g. lmo2138 and lmo2851) that appear to be up-regulated in the murine intestine [45]. It is reasonable to speculate that the GI tract may act as a major barrier to prevent LIII strains from causing systematic infections. Epidemiological studies seem to support this speculation by collectively showing that gastroenteritis, rather than more severe listeriosis symptoms, is predominant among infected individuals [73][74][75]. Although intracellular strategies have been the primary focus in numerous studies of listerial pathogenesis, a few recent studies demonstrated that the GI passage has a fundamental impact on listerial pathogenicity [76,77]. Considering that most LIII strains possess virulence factors related to its intracellular lifestyle and are cytopathogenic [14], the inability to survive in the GI tract becomes a plausible explanation for the overall rarity of LIII in human listeriosis.
We estimate that the L. monocytogenes core-genome consists of 2,330 to 2,456 genes and the pan-genome encompasses over 4,052 genes ( Figure 3). Compared to several other bacterial species, L. monocytogenes has relatively higher proportions (about 80%) of core genes shared by individual genomes (Table 7), which in turn reflects lower intraspecies genomic variability. This is consistent with the low rates of recombination in this bacterial species [68]. Despite the perceived high genomic synteny, L. monocytogenes possesses considerably diverse pan gene reservoir and displays biased distribution of accessory genes across major evolutionary lineages (Additional file 7).
Some incompatible phylogenetic signals as indicated in the split network ( Figure 4C) were traced back to prophage-associated genes. Notably, the comK prophage regions in different L. monocytogenes genomes display significant sequence variations (Additional file 8). Such variations may be a result of prophage decay, recombination that have accumulated in the remnants of common prophage ancestor(s), or multiple lysogenization of different bacteriophages at the same genomic location. Phages have been recognized as the major contributors of important biological properties (e.g. virulence factors) in many bacterial species [78,79]. The functional impact of bacteriophages on the biology of L. monocytogenes, if any, has yet to be determined.

Conclusions
Intraspecific variations in host preference, ecological fitness and virulence are common in many bacterial pathogens. This is exemplified by the species of L. monocytogenes which consists of multiple distinct genetic lineages. Two lineages of this species (i.e. LI and LII) predominantly cause human sporadic and epidemic infections, whereas the other (i.e. LIII) has never been implicated in human disease outbreaks for unclear yet intriguing reasons. Here we described a novel pan-genomic approach that combines in silico comparative analysis and high-density CGH arrays to explore the genomic diversity of L. monocytogenes. Our integrated approach allows vigorous core genome estimation and phylogenomic reconstruction, which in turn is nearly impossible for low-quality, short-read draft genome assemblies with hundreds of contigs. Exponential regression analysis predicts that L. monocytogenes has a core genome of between 2,330 to 2,456 genes (80% of each individual genome) and a pan-genome repertoire of over 4,052 unique genes. Comparison of all lineage strains reveals high genomic synteny with limited sequence drift associated with lysogenic bacteriophages. Phylogenomic reconstructions based on 3,560 homologous groups suggest a polyphyletic population infrastructure and gradual loss of metabolic genes as this saprophytic species diversified into the rare and probably defective lineage III. Based on our All numbers are estimates in this table. 1 Only studies including more than five strains are shown. 2 Pan-genome growth behaviors as described by the authors. * Estimated from figures, but not explicitly stated in the paper. 3 Cutoff values and methods for defining core and pan genes vary widely across the different studies. This column only gives a rough summary of the similarity cutoff. Cutoffs of the form I/L indicate a minimum BLAST hit of I% similarity over L% of the protein length. BSR is Blast Score Ratio [32]. SSR is the similarity score ratio used in this study, similar to BSR. results, one L. monocytogenes strain carries about 75% of the pan genes of this species. That said, experiments based on a single reference strain may not adequately sample the total genetic repertoire and not fully interpret the versatile biology of L. monocytogenes. With a more defined species core genome, we may also be able to supplement new genomic criterion for taxonomic classification of L. monocytogenes, as some traditional methods are often inconclusive and controversial. The pan-genomic approach described here can be used to explore the genomic diversity in other pathogenic species, as such information would be extremely valuable for us to better understand the intraspecific variations in virulence, and the ecology, epidemiology and evolution of microbial pathogens.

Methods
Bacterial isolates and genomic DNA extraction Table 1 lists the 31 L. monocytogenes strains analyzed in this study. As of November 2008, twenty sequenced L. monocytogenes strains were available and used for the pan-genome array design. CGH was performed for nine LIII strains representing 3 serotypes (4a, 4b, and 4c) and 3 subgroups (LIIIA, LIIIB, and LIIIC). Four additional isolates that were sequenced after the array design were incorporated in the pan-genomic and phylogenetic analysis. Bacterial strains were grown overnight in brain heart infusion (BHI) broth at 35°C. Genomic DNA was extracted and purified using MasterPure Gram positive DNA purification kit (EPICENTRE Biotechnologies, Madison, WI). Genomic DNA was labeled with Cy3 or Cy5 dye prior to array hybridization.

Pan-genomic array design
The pan-genome tiling array was designed using the PanArray software [38] to fully tile the 20 sequenced L. monocytogenes genomes (Table 1). PanArray employs a greedy probe selection algorithm to tile multiple whole genomes using a minimal number of probes. For this study, PanArray was used to design an array comprising 385,000 50-mer oligonucleotide probes that fully tile the 20 listerial genomes at 2.65 × coverage with no gaps. To avoid tiling low quality or contaminant sequence, contigs less than 2 Kbp in length were discarded, leaving 54,810,759 bp of tiled sequence. A full description of the array design is given in [38], and the array design is available from the Gene Expression Omnibus (GEO) [80] under accession number GPL8942. To incorporate newly sequenced strains that had not been included in the original array design, we aligned all probes on the array to the new genomes allowing one mismatch per probe, and added genes with probe coverage ≥90% of their length to the array annotation.

Array hybridization and data analysis
Genomic DNA of each LIII strain was co-hybridized with that of EGD-e on a Roche NimbleGen 385 K custom CGH array. Two dye-swap replicates were performed for each LIII strain/EGD-e pair to eliminate dye bias and test the array reproducibility. Genomic DNA labeling and array hybridization were performed at Roche NimbleGen according to the manufacturers specifications (Madison, WI). Technical details on DNA labeling and hybridization can be found at http://www.nimblegen.com/products/lit/cgh_userguide_v6p0.pdf . We designed a probebased intensity classification scheme to provide the most flexibility for pan-genome array data analysis, allowing any locus to be classified based on the aggregated scores of its individual probes, without reference to control hybridization. Specifically, all raw signal intensities were first transformed to log values, then log intensities for replicate hybridizations were normalized using quantile normalization [81]. Replicates were combined at the probe level by taking the average of the normalized log intensities for each probe. Quantile normalization assumes similar intensity distributions, so to avoid crosssample normalization bias. Each strain was normalized and processed independently.
Because there was no one single reference to operate on, and to preserve sensitivity for small polymorphisms, intensity data was not smoothed or segmented. Instead, individual probes were each classified as present or absent using a minimum kernel density (MKD) method. MKD methods have performed well for the binary classification of both genes and segments [31,82], and here we extended the idea to the classification of individual probes. Because the array contains the full genetic diversity of L. monocytogenes and 4,300 random control probes, there is expected to be a significant fraction of both present and absent probe intensities for any L. monocytogenes sample. Therefore, the distribution of probe intensities is generally bimodal, and the minima between the present and absent peaks can be used as an effective threshold for binary classification. For each sample, the probability density function of the observed intensities was estimated using kernel density estimation and the central minima of this function identified as the optimal cutoff (Additional file 9). This method was preferred because it is non-parametric, there is no potential normalization bias, it requires no training, and each sample can be processed independently without affecting the accuracy. It is also extremely flexible, in that a classification for any gene can be generated by aggregating the classifications of the probes targeting that gene. For this purpose, genes were scored by collecting all probes known to target a specific gene and computing the fraction of probes classified as present, the positive fraction (PF). A PF threshold of 0.6 was chosen by analysis of ROC curves for the EGD-e and J2-071 controls to minimize the total error rate (false-positive rate + falsenegative rate) versus the tblastn 50% protein similarity threshold. PF was favored because it does not depend on cross-sample normalization, as would be necessary for an intensity threshold, and additional genomes can be analyzed independently without affecting accuracy. This makes it ideal for rapid and economical analysis of novel isolates, while maintaining comparable accuracy to alternative analysis methods [31,34].

Pan-genomic analysis
Pan-genomic analysis was performed using the methods introduced by Tettelin et al. [24], with modifications on the conservation threshold and permutation sampling. Annotated proteins for each genome were aligned to the six frame translations of all other genomes using tblastn. Query proteins were marked as present in a subject genome if the corresponding amino acid sequences aligned at ≥ 50% similarity with an E-value ≤ 10 -5 , where "similarity" was defined as the number of positively scored residues divided by the length of the protein sequence. This threshold is more stringent than originally proposed in [24], but less stringent than those used in other studies (e.g. [32]). The 50% threshold was empirically selected as a compromise between tolerating draft genomes with fragmented annotations and avoiding false positive detections due to conserved domains and distant paralogs. A PF threshold of 0.6 was consequently chosen as an analogous threshold for the CGH results, as described above. Genomes sequenced to less than 10× coverage using 454 pyrosequencing were excluded from the analysis.
The addition of an N th genome was simulated by examining ordered combinations of N genomes. Due to the large number of available genomes, it was not feasible to consider all possible permutations as originally suggested. Instead, a randomly selected subset of 100,000 permutations was considered for the addition of each N, and the mean (or median) values were computed from this subset. For each permutation, the number of new genes found in the N th genome G N was computed as the number of proteins of G N not present in any genomes G i for i = {1, ..., N-1}. The number of core genes was computed as the number of proteins of G N present in all genomes G i for i = {1, ..., N}. Because gene sequences for the CGH strains are not known, EGD-e was set to be G N for all permutations. The number of pan genes in a permutation of N genomes was computed by examining the genomes G i in order from 1 to N. A gene in G i was identified as a pan gene if it was not present in any of the genomes G j for all j < i.
The Gauss-Newton method implemented by the R function nls [83] was used to perform non-linear least squares regression on the mean and medians of the core genes, new genes, and pan genes distributions. According to [26], the number of new genes n expected to be discovered by sequencing an N th genome was modeled by the power law function n = N -a , and the number of pan genes also by a power law n = N g . According to [24], the number of core genes was modeled by the exponential decay function n = e -N/τ + Ω, where Ω describes the horizontal asymptote and therefore the core genes estimate. In all cases, the functions were fit to the mean or median values for all N > 1.
To accommodate false-negative errors introduced by sequencing gaps and weak hybridization signal, the originally proposed exponential decay function was modified with the addition of a fourth parameter to model the effect of a constant number of false-negatives with the addition of each genome. The modified equation is: where the linear parameter b represents the number of core genes lost to false-negative errors for each N. Core gene loss due to false-negatives is not a truly linear phenomenon (e.g. sequencing gaps are not independent and the core genome can never be negative), but for a large core genome and a modest N it is a reasonable approximation that is easy to fit. To assure convergence of the optimization algorithm, b was first estimated via linear regression for N ≥ 15, and this was used as the start estimate of b for the full model regression. The augmented model is useful in that the observed core genome size may be linearly decreasing (as is expected for draft genomes), but an estimate of the true core genome size Ω may still be recovered.

Identification of homologous groups
Homologous groups (HGs) were used for phylogenetic reconstruction and core genome estimation. HGs were identified by clustering a graph of protein similarity for all annotated protein-coding genes from the 18 highquality L. monocytogenes genomes. A node was added to the graph for each one of the 52,776 annotated proteins. Edges were added between any two proteins with an alignment above the 50% similarity threshold. Unlike OrthoMCL, no orthology constraint was applied. Edges between any two similar proteins were added, including edges between proteins in the same genome. This was necessary due to the inability of CGH to accurately determine orthology. The MCL clustering algorithm was applied to this graph using an inflation parameter of 2.0. From this clustering, 3,744 HGs were identified, including strain-specific genes represented as singleton clusters (Additional file 2). Some HGs, mostly singletons, were not represented on the array because additional genomes had been sequenced after the array design. A total of 3,560 HGs, represented on the array by at least one member gene, were used for the phylogenetic analysis.
For sequenced genomes, an HG was called present if at least one member protein of the HG aligned above the 50% similarity threshold. For CGH genomes, an HG was called present if at least one member gene of the HG hybridized with PF ≥ 0.6. Results based on this threshold were converted to a unified binary table indicating gene presence or absence for all HGs in all genomes analyzed in this study (Additional file 4). These binary vectors were used for measuring evolutionary distance using the maximum-likelihood measure of [59], and Neighbor-net split networks [60] and neighbor-joining trees [65] were built using the SplitsTree program [84]. Alternative parsimony methods failed to build reasonable trees, most likely due to the large number of incompatible splits caused by both horizontal gene transfer and errors in the data. supported in part by the U.S. Department of Homeland Security Science and Technology Directorate under award NBCH2070002. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Authors' contributions