High-throughput sequencing of Bacillus anthracis in France: investigating genome diversity and population structure using whole-genome SNP discovery

Background Single nucleotide polymorphisms (SNPs) are ideal signatures for subtyping monomorphic pathogens such as Bacillus anthracis. Here we report the use of next-generation sequencing technology to investigate the historical, geographic and genetic diversity of Bacillus anthracis in France. 122 strains isolated over a 60-years period throughout the country were whole-genome sequenced and comparative analyses were carried out with a focus on SNPs discovery to discriminate regional sub-groups of strains. Results A total of 1581 chromosomal SNPs precisely establish the phylogenetic relationships existing between the French strains. Phylogeography patterns within the three canSNP sub-lineages present in France (i.e. B.Br.CNEVA, A.Br.011/009 and A.Br.001/002) were observed. One of the more remarkable findings was the identification of a variety of genotypes within the A.Br.011/009 sub-group that are persisting in the different regions of France. The 560 SNPs defining the A.Br.011/009- affiliated French strains split the Trans-Eurasian sub-group into six distinct branches without any intermediate nodes. Distinct sub-branches, with some geographic clustering, were resolved. The 345 SNPs defining the major B.Br CNEVA sub-lineage clustered three main phylogeographic clades, the Alps, the Pyrenees, and the Massif Central, with a small Saône-et-Loire sub-cluster nested within the latter group. The French strains affiliated to the minor A.Br.001/002 group were characterized by 226 SNPs. All recent isolates collected from the Doubs department were closely related. Identification of SNPs from whole-genome sequences facilitates high-resolution strain tracking and provides the level of discrimination required for outbreak investigations. Eight diagnostic SNPs, representative of the main French-specific phylogeographic clusters, were therefore selected and developed into high-resolution melting SNP discriminative assays. Conclusions This work has established one of the most accurate phylogenetic reconstruction of B. anthracis population structure in a country. An extensive next-generation sequencing (NGS) dataset of 122 French strains have been created that allowed the identification of novel diagnostic SNPs useful to rapidly determine the geographic origin of any strain found in France.


Background
Bacillus anthracis, the causative agent of anthrax, is a spore-forming bacterium found throughout the world. The bacterium mainly affects wild and domesticated herbivores, causing serious, often fatal disease. Spores which are the infectant form of the bacteria can remain in soils for decades before being ingested by grazing animals and later on contaminated browse by blow flies [1]. Consequently, opportunity for accumulating DNA mutations is limited, resulting in relatively little genetic variation within the species.
Single nucleotide polymorphisms (SNPs) are the most common genetic variation found in genomes of all species. It also represents important biologically informative DNA markers extensively used to elucidate deep phylogenetic relationships among worldwide strains, due to their evolutionary stability [2]. By querying a large number of them against collections of diverse strains, a set of canonical SNPs (canSNPs) that define major clades within the B. anthracis species has been identified and used for subdividing all isolates into three major lineages (A, B and C) and 13 major sub-lineages or sub-groups [2][3][4][5]. Other SNPs that define the A.Br. Ames and A.Br.WNA lineages [6,7] or that lie on various branches of the B. anthracis SNP tree [8] have also been published.
As a consequence of the rapid development of technology in the area of high throughput sequencing, extensive genomic sampling is now possible at a reasonable cost. By interrogating nearly every base of the genome, whole genome sequencing is the genotyping tool giving the highest possible resolution, with comparison information spanning from distant phylogenetic relationships to the highest level of subtyping. Identification of thousands of SNPs retrieved from compiled NGS sequences facilitates high-resolution strain tracking and provides the level of discrimination required for microbial forensics. Canonical SNP typing and whole genome comparisons of multiple strains are now becoming the reference method in B. anthracis genotyping.
In France, anthrax was a widespread cause of disease in livestock and humans until the beginning of the 20 th century. The development of live vaccines for animals by Louis Pasteur (1880-1881), followed by widespread vaccination, had then allowed the numbers of animals affected, and indirectly human cases, to steadily decline. The prevention of new cases and proper carcass disposal with the implementation of knackeries prevented the replenishment of spores in soil in many regions of France and concentrations progressively declined [9]. Today, anthrax is considered as a sporadic disease that continues to cause occasionally outbreaks in livestock. A few cases with one to several dead animals may be recorded annually. However two large outbreaks with several dozens of animals affected still occurred over the past five years in areas where outbreaks had been reported in the past [10].
We have previously reported the canSNP characterization and MLVA typing of a hundreds of strains of B. anthracis isolated in France [11]. French strains were assigned into the three B.Br.CNEVA, A.Br.011/009 and A.Br.001/002 canSNP sub-groups and further resolved into about fifty MLVA genotypes (unpublished data). To gain further resolution into the phylogenetic relationships among French isolates, 122 strains have been sequenced using parallel sequencing technology. Comparative genomics, with a focus on SNPs discovery, were then conducted to identify diagnostic SNPs that specifically discriminate regional sub-groups of strains.

canSNP genotyping
One hundred and thirty-six "field strains" representative of anthrax activity in France since about half a century (1953-2011) were selected for this study. All strains were first subjected to canSNP typing [11] to place them into a broader phylogenetic context [3]. An additional canSNP assay designed around the A.Br.011 SNP, that further divides the Trans-Eurasian A.Br.008/009 group into two sub-groups (A.Br.008/011 and A.Br.011/009) [4], was also run against the whole collection (data not shown). All French A.Br.008/009 samples (n = 39) were defined as belonging to the A.Br.011/009 sub-group.

Whole genome sequencing
In order to accurately determine the evolutionary relationships among these strains, 122 isolates were characterized by paired-end whole genome sequencing. We sequenced 67 B.Br.CNEVA-affiliated strains collected in the four areas where this sub-lineage is prevalent (i.e. the Alps (n = 34/38), the Pyrenees (n = 9), the Massif Central region (n = 18/20) and the Saône-et-Loire department (n = 6)), 31 A.Br.011/009-affiliated strains isolated throughout the country (n = 31/36) and 24 of the 26 A.Br.001/002-affiliated strains collected from the Doubs department (n = 21/22) and the North-East of France (n = 3). An African strain (IEMVT 89) affiliated to the A.Br 005/006 sub-group positioned at the basis of the main A radiation was also whole-genome sequenced for comparison.
The Illumina sequencers produced 4 to 25 million reads per strain after applying the quality filter of the Illumina base-calling pipeline. The filter-passed reads were aligned to the Ames Ancestor reference genome, resulting in more than a 56-fold sequencing depth on average and a genome coverage ranging from 98.9 to 99.9% (Additional file 1: Table S1).

Extraction of whole strain-specific SNPs among B. anthracis strains
Comparative analysis of the genomic sequences was next carried out. Available data for the B. anthracis Ames Ancestor, Sterne and A1055 strains were also included. The B. cereus AH820 genome was used as outgroup. Figure 1 illustrates the minimum spanning tree (MST) generated by the chromosomal whole-genome SNPs data. A total of 3987 chromosomal SNPs were discovered that precisely positioned French strains within the B. anthracis SNP tree. Additional SNPs (128 in pXO1 and 92 in pXO2) were also identified in both plasmids (Analysis of the 122 French strains, including the African strains IEMVT 89, data not shown). Consistent with previous MLVA and canSNP analyses [11], all French isolates were clustered into three distinct groups, i.e. B.Br.CNEVA, A.Br.001/002 and A.Br.011/009. The major B.Br.CNEVA group was separated by a long phylogenetic branch (438 SNPs) to the A and C branches radiation. A total of 1581 SNPs resolved the French population structure at an unprecedented resolution. Strain-specific SNPs can be defined for nearly all particular strains. Moreover, particular genotypes corresponding with geographic regions were observed.

B.Br.CNEVA phylogenetic analysis
More than half of the B. anthracis isolates from France clustered within the B.Br.CNEVA lineage. We therefore focused on further resolving the positions of its members. The B.Br.CNEVA sub-lineage was characterized by 779 unique chromosomal SNPs, 49 pXO1 SNPs and 49 pXO2 SNPs. Our analysis identified 345 diagnostic SNPs that discriminate between French strains, as well as 29 plasmidic SNPs ( Figure 2). Positions of the 67 wholegenome sequenced strains relative to each others are illustrated on Figure 2. A geographic clustering into four sub-groups (Alps, Pyrenees, Massif Central and, nested in this latter, the Saône-et-Loire sub-cluster) was observed using both chromosomal and plasmidic SNPs analysis. Four markers that specifically define one of the newly described geographic sub-groups, as well as one additional SNP shared by all French B.Br.CNEVA genomes, were selected and incorporated into SNP discrimination HRM assays (Table 1).

A.Br.011/009 phylogenetic analysis
Comparison of A.Br.011/009 genomes yielded a total of 574 unique chromosomal SNPs, including 560 SNPs that resolve the sub-group into six distinct sub-branches without any intermediate nodes ( Figure 3A). This substructure was poorly reflected in both pXO1 and pXO2 SNPs profiles. The 38 Plasmidic SNPs resolved the 31 samples in a higher number of putative branches, without formal clustering ( Figure 3B and C). In contrast to the B.Br.CNEVA lineage, a less apparent geographic clustering was also observed within each identified subbranches ( Figure 3A). Branches 1 and 2 include strains isolated in the north-east quarter of France, the outermost samples belonging to branch 2. Branches 3 and 6 are composed of strains isolated around the center (close to Paris) and in the South-East quarter of France, respectively. While the only strain of branch 5 was isolated close to the Pyrenees (South-West), the two samples from branch 4 are from remote geographical regions (West and South). Specific SNPs for the two more relevant branches (1 and 2) were developed into real-time HRM assays (Table 1).

A.Br.001/002 phylogenetic analysis
Concerning the minor A.Br.001/002 subgroup, sequence analysis identified 226 chromosomal SNPs that differentiate the 24 French strains. Notably, all isolates from the Doubs department (21 out 24 samples) formed a single clonal cluster characterized by 44 SNPs (Figure 1). All strains were closely related, differing by a maximum of 2 chromosomal SNPs. Similar SNP patterns were found using both virulence plasmids data (data not shown). Most of them (n = 20) were collected during 17 clustered animal outbreaks occurring in the summer of 2008. An additional strain was isolated in the same area in 2011 and presented the dominant 2008's SNP genotype, suggesting a common point source for these repeated outbreaks. One SNP marker was selected for further testing based upon its specificity for the Doubs strains ( Table 1). The three other strains are older "outbreak" samples (1953, 1954 and 1981) isolated in other departments of France. They were found to be unrelated to the Doubs recent outbreak isolates. Several specific SNPs differentiating these strains were also found on both plasmids.

French specific SNP discrimination genotyping assays
We designed eight novel canSNPs discrimination assays based on High-Resolution Melting (HRM)-PCR technology (Table 1). Assays targeted five SNPs identified among the French B.Br.CNEVA genomes, one SNP specific to the A.Br.001/002 strains isolated in Doubs, and two SNPs defining the two large sub-branches 1 and 2 of the A.Br.011/ 009 sub-group ( Figure 1). These diagnostic assays were then successfully validated across the 138 B. anthracis strains of our collection. The two expected alternate alleles exhibited distinct melting curves and melting temperatures (Tms), allowing unambiguous grouping of each allele. On average, differences in Tm ranging from 0.5 to 0.8°C were observed between the both allelic states ( Table 2). The seventeen non-sequenced genomes of the collection fall on the correct geographical groups (data not shown).
The eight novel canSNPs were further validated in silico against all B. anthracis genomes (n = 30) available in public NCBI database and Sequence Read Archive (SRA) [12]. The new markers were also screened across a hundred of non-French B. anthracis DNAs of diverse origins (described by 10 canSNP typing, unpublished data) to confirm their specificity. The eight assays accurately separated the French strains from the remaining globally diverse genotypes (data not shown).

Discussion
Comparative whole-genome sequencing offers a powerful way for in-depth characterization of any bacterial pathogen. It also provides an unbiased approach for informative SNPs discovery [13]. In this study, an ultimate picture of the genetic diversity found within the B. anthracis population in France has been established. An extensive NGS dataset of 122 autochthonous strains have been created that allowed the phylogenetic linkage of geographically diverse isolates and the identification of novel SNPs signatures useful to rapidly determine the geographic origin of any strain ecologically established in France. Within the B.Br. CNEVA lineage, which is the most successful group in frequency found in France, a clustering concordant with the geographical origin of each strain was highlighted. This phylogeography pattern is consistent with a single introduction of this lineage in the country, followed by dispersal of the ancestral population toward West and South, with progressively derived local populations of B. anthracis strains. Ecological establishment and in situ differentiation within three regions with environmental conditions particularly favorable  . It is worthwhile to mention that A strains are adapted to more diverse environments than B strains, which are restricted to more narrow environmental conditions [14]. This trend is reflected on a global scale. A strains have experienced a global dissemination and population explosion, while both B lineages are uncommon in much of the world and appear to be only successful locally or regionally [3]. In contrast to the B.Br.CNEVA lineage, French strains affiliated to both A.Br.001/002 and A.Br.011/009 groups appeared to be more disparate and less related to one another. Several distinct introductions and subsequent in situ differentiation of both canSNP groups in France could therefore not be excluded. One of the more remarkable findings from the whole genome SNP analysis was the discovery of a complex pattern of six phylogenetic sub-branches (composed of up to fifteen strains) within the A.Br.011/009 isolates collected in France. The TEA group (A.Br.008/009) is a cluster that predominates throughout Europe, the Middle East and the Western most Chinese province of Xinjiang [3,15]. The identification of six sub-branches, with some geographic clustering (canSNPs BA21 and BA22), indicates an extensive history for the particular A.Br.011/009 TEA sub-group in France that could be linked to an expansion into many novel environments throughout the country or a combination of possibly repeated introduction and infections. Similar pattern of loosely related strains (≈100 SNPs) was also observed for the A.Br.001/002 canSNP group, except for strains isolated from the Doubs department (canSNP BA20). All but one of these Doubs specimens were isolated from a single episode (involving 17 clustered outbreaks) associated with the death of 39 animals in 2008 [16]. Only a short evolutionary time period, i.e. very few SNPs, separate all Doubs isolates, reflecting recurring outbreaks in this specific area. Further comparisons with European strains affiliated to the same canSNP groups are, however, need for drawing sound conclusions on the evolution and natural history of both A.Br.011/009 and A.Br.001/002 canSNP groups in France.
Although other relationships are topologically similar to MLVA predictions, some discrepancies exist between whole-genome SNP-and MLVA-based phylogenetic reconstructions (data not shown). There result from homoplasy coupled with weaker character support in the VNTR system. SNPs are more stable and definitive markers than VNTRs. In the MLVA-31 minimum spanning trees, for example, the French B.Br.CNEVA strains are grouped into three clonal complexes (CCs) and a few singletons unrelated to others strains from close geographical origins (Thierry et al., unpublished data). Most of the strains from The Alps, The Massif Central and part of the Saône-et-Loire samples were clustered in a single, large clonal complex (CC1). Strains from the Pyrenees formed a distinct clade (CC2), as well as half of the Saône-et-Loire strains (CC3). The Saône-et-Loire isolates were erroneously separated within two different clusters based on difference in up to four VNTR loci between CC1 and CC3. The discriminative power of the 31 VNTR markers is also more limited to resolve strains from a specific region or outbreak, as illustrated by the A.Br.001/002 strains collected in the Doubs 2008 and 2011. While MLVA31 differentiated five genotypes, nine distinct SNPs profiles could be resolved by whole-genome SNPs scanning. Furthermore, the latter clearly demonstrated that all Doubs strains are genetically related and differs by only one to two SNPs throughout the whole genomes. SNP-based phylogenetic reconstruction of B. anthracis population structure is more reliable. Whole genome comparison and canSNPs genotyping represent the future of molecular epidemiology analysis.

Conclusions
In this study, an ultimate picture of the genetic diversity found within the B. anthracis population in France has been established. The knowledge derived from this work is extremely useful for future outbreak investigations, not only due to the specific genomic information that facilitates detailed comparisons between different bacterial isolates, but also because the data can be used to add unprecedented resolving power to current molecular typing tools. Using HRM, eight novel canSNP assays that narrowly defined the three B. anthracis genetic sublineages found in France were developed. As a consequence, whether an index strain is representative of ecologically established strains from France or have been imported from other areas can be rapidly determined. Short reads data sets were exported on the FastQ format and mapped to the Ames Ancestor genome and both pXO1 and pXO2 plasmidic sequences using BioNumerics version 6.6 (Applied Maths, Belgium) and Power assembler module asking for a similarity of at least 90%. A set of SNPs was deduced for each genome sequence data using Bio-Numerics Chromosome Comparisons module. Individual lists were compiled by using an in-house Python script, and data filtered to remove SNP positions at which one or more isolate displayed an ambiguous residue call or missing data. Ribosomal operons and VNTR loci were also excluded from the analysis, so as contiguous SNPs (in a 10pb window). The list of canSNP positions is provided in Table 1.

Bacterial strains and biosafety procedures
The draft genome sequences of one representative

Whole genome phylogenetic analysis
A minimum spanning tree was drawn in BioNumerics by using the filtered whole genome sequencing SNP data as input. The tree was rooted by using the B. cereus AH820 strain as outgroup. B. cereus AH820 resides very close to the B. anthracis cluster using multilocus sequence typing [18]. Nodes were numbered by BioNumerics. The canSNPs along the three branches leading to the French strains were identified from whole genome sequencing data by searching for SNPs with allelic states shared only by these different subgroups.

SNP discrimination assays by HRM
We designed High Resolution Melting (HRM) assays for eight French-specific SNPs using Primer 3 + software [19]. The positions of these SNPs in the Ames Ancestor genome [GenBank: AE017334.2] and the primers sequences used are listed in Table 1.
Amplification was performed on the ViiA7™ Real-Time PCR System (Life Technologies) using the LightCycler® 480 High Resolution Melting Master Mix (Roche Diagnostics). The reaction mixture consisted of 0.2 μM of each primer, 1 × LightCycler® 480 HRM master mix and 2.5 mM MgCl2 in a 10-μl final volume. The following parameters were used: 10 min at 95°C were followed by 40 cycles consisting of 10 s at 95°C, 10 s at 58°C and 20 s at 72°C. Samples were next heated to 95°C for 30 s, cooled down to 65°C for 1 min and heated from 65°C to 88°C at a rate of 1°C/s with 25 acquisitions/°C. HRM data were analyzed by the ViiA7™ Software (version 1.2.1).