Independent evolution of neurotoxin and flagellar genetic loci in proteolytic Clostridium botulinum

Background Proteolytic Clostridium botulinum is the causative agent of botulism, a severe neuroparalytic illness. Given the severity of botulism, surprisingly little is known of the population structure, biology, phylogeny or evolution of C. botulinum. The recent determination of the genome sequence of C. botulinum has allowed comparative genomic indexing using a DNA microarray. Results Whole genome microarray analysis revealed that 63% of the coding sequences (CDSs) present in reference strain ATCC 3502 were common to all 61 widely-representative strains of proteolytic C. botulinum and the closely related C. sporogenes tested. This indicates a relatively stable genome. There was, however, evidence for recombination and genetic exchange, in particular within the neurotoxin gene and cluster (including transfer of neurotoxin genes to C. sporogenes), and the flagellar glycosylation island (FGI). These two loci appear to have evolved independently from each other, and from the remainder of the genetic complement. A number of strains were atypical; for example, while 10 out of 14 strains that formed type A1 toxin gave almost identical profiles in whole genome, neurotoxin cluster and FGI analyses, the other four strains showed divergent properties. Furthermore, a new neurotoxin sub-type (A5) has been discovered in strains from heroin-associated wound botulism cases. For the first time, differences in glycosylation profiles of the flagella could be linked to differences in the gene content of the FGI. Conclusion Proteolytic C. botulinum has a stable genome backbone containing specific regions of genetic heterogeneity. These include the neurotoxin gene cluster and the FGI, each having evolved independently of each other and the remainder of the genetic complement. Analysis of these genetic components provides a high degree of discrimination of strains of proteolytic C. botulinum, and is suitable for clinical and forensic investigations of botulism outbreaks.


Background
The species Clostridium botulinum consists of a group of four physiologically and phylogenetically distinct Gram-positive obligately anaerobic bacteria that share the common feature of producing the highly potent botulinum neurotoxin [1]. Organisms belonging to two of these groups are associated with the majority of cases of human botulism. C. botulinum Group I (proteolytic C. botulinum) is a mesophilic organism that is responsible for foodborne botulism, wound botulism, adult intestinal botulism and infant botulism. C. sporogenes is considered to be a non-toxigenic version of proteolytic C. botulinum [2]. C. botulinum Group II (non-proteolytic C. botulinum) is a psychrotrophic organism associated with most cases of foodborne botulism not attributed to Group I [3,4]. The botulinum neurotoxins are the most potent toxins known, with as little as 30-100 ng constituting a potentially fatal dose [5], and are considered to be a bioterrorism threat [6].
Seven major types of botulinum neurotoxin (types A to G), and a significant number of sub-types have been described. For example, four sub-types of type A toxin (termed A1, A2, A3, A4) have been identified [7][8][9]. Subtypes are defined as differing by at least 2.6% at the amino acid level [7,10]. Proteolytic C. botulinum strains form neurotoxin of types A, B, or F, and dual-toxin forming strains have also been described [2]. Additionally, some strains possess two neurotoxin genes, but only form one active neurotoxin. For example, A(B) strains possess a type A and type B neurotoxin gene, but only form type A neurotoxin. Non-proteolytic C. botulinum strains form a single neurotoxin of types B, E, or F. Each neurotoxin protein comprises a light chain and heavy chain. The light chains possess endopeptidase activity and cleave proteins in the SNARE complex leading to flaccid muscle paralysis, and potentially respiratory failure [11].
The neurotoxin genes are associated with other genes within the neurotoxin cluster, and two major cluster types are recognised. The most studied neurotoxin cluster in proteolytic C. botulinum is termed the ha plus/orf-X minus cluster. It is commonly associated with type A1 and type B neurotoxin genes [9,12,13], and is present in the genome of the sequenced type A1 strain ATCC 3502 used as a hybridisation reference in this work [14]. This cluster comprises genes for the neurotoxin (cntA), three haemagglutinins (HA) (cntC, cntD, cntE), non-toxic-non-haemagglutinin (NTNH) (cntB), and a positive regulatory protein (cntR). The second cluster type is called the ha minus/orf-X plus cluster. In the case of proteolytic C. botulinum, this cluster is most frequently associated with type A2, A3, A4 and F toxin genes, and the type A1 gene in A(B) strains [9,12,13]. This cluster includes genes for the neurotoxin, NTNH and CntR (historically also known as p21 [9,13]), lacks the three genes encoding HA, and additionally contains a group of three open reading frames (orf-X1, orf-X2, orf-X3) and a single CDS (coding sequence) (p47) all of unknown function.
The genome sequence of proteolytic C. botulinum strain ATCC 3502 (NCTC 13319, Hall 174) has been recently completed, and consists of a chromosome (3.9 Mbp) and plasmid (16.3 kbp), which contain 3,650 and 19 coding sequences (CDSs), respectively [14]. A DNA microarray was designed based on this sequence, and initial tests revealed that two prophages and a plasmid present in the genome of strain ATCC 3502 were absent from 11 test strains of proteolytic C. botulinum and C. sporogenes, and that the DNA microarray could be used to discriminate between strains of proteolytic C. botulinum [14]. The 11 test strains shared a minimum of 84% of the CDSs of ATCC 3502, but were significantly diverged from other sequenced clostridial species, demonstrating the wide phylogenetic distance between different clostridia [14].
The flagellar glycosylation island (FGI) also showed evidence of diversity between strains of proteolytic C. botulinum [14]. The ATCC 3502 genome contains a large putative FGI comprising CDSs CBO2666-2729. These are flanked by the CDSs for flagellar structural proteins FlgB (CBO2665), FliD and the flagellin structural subunits FlaA1 (CBO2730) and FlaA2 (CBO2731). The FGI can be divided into two distinct regions [14]. CBO2678-2689 are CDSs similar to those involved in capsular polysaccharide biosynthesis in Group B Streptococcus (designated FGI-I, flanked by putative flagellin structural genes CBO 2666 and CBO 2695), whereas CBO2696-2729 represent CDSs with sequence similarity to those involved in the modification of Campylobacter jejuni flagellins with nonulosonic acids (designated FGI-II, CBO2696-CBO2729) [14].
In order to extend our understanding of phylogenetic relationships and the biology of proteolytic C. botulinum, an extensive comparative genomic indexing study has been carried out involving 58 strains of proteolytic C. botulinum, 2 strains of non-proteolytic C. botulinum, and 3 strains of C. sporogenes using DNA microarrays based on the genome sequence of strain ATCC 3502. We have assessed the evolution of the neurotoxin gene and cluster and flagellar glycosylation island (FGI) in relation to the remainder of the genetic complement. We have also identified important links between CDSs contained within the FGI and sugars associated with post-translational modification of flagella, and discovered a new neurotoxin A subtype associated with UK wound botulism cases.

Methods
Bacterial strains and preparation of DNA C. botulinum and C. sporogenes strains used in this work, together with the type of neurotoxin formed, their origin, source and date of isolation are listed in Table 1. Before use, all strains were checked for purity (consistent colony morphology) and lack of contamination by growth on PYGS plates under both aerobic and anaerobic atmospheres [15]. Proteolytic activity was determined by growth on Reinforced Clostridial Medium (RCM) containing 5% (w/v) skim milk [16] and lipase activity on McClung Toabe egg yolk medium [17]. Strains were also checked for presence of type A, B and F neurotoxin genes by PCR using 100 ng genomic DNA as template with primer pairs NKB-1 (5'-GATACATTTACAAATCCTGAAGGAGA-3') and NKB-5 (5'-AACCGTTTAACACCATAAGGGATCATAGAA-3') which generate a 2278 bp PCR product for the type A neurotoxin gene; B-1A (5'-GATGGAACCATTTGCTAG-3') and B2-D (5'-AACATCAATACATATTCCTGG-3') which generate a 1284 bp PCR product for the type B neurotoxin gene [18]; and BONTFF2 (5'-GTGCTTATTATGATC-CTAATTATTTAACC-3') and BONTFR2 (5'-CCATACTTC-CATTGAAAATAATCTTTATA-3') which, using the same reaction conditions, give a 765 bp PCR product for the type F neurotoxin gene (data not shown). The type(s) of neurotoxins formed by each strain was established by sero-neutralisation and the mouse bioassay [19,20].
Genomic DNA was purified from exponentially growing cells, digested with Sau3A1 and labelled with fluorescent nucleotides as previously described [14] except that Cy5or Cy3-dUTP (GE Healthcare, UK) was substituted for Cy5-or Cy3-dCTP. The isolation of plasmid DNA followed the method outlined by O'Sullivan and Klaenhammer [21]. For restriction enzyme analysis, the manufacturer protocols (New England BioLabs, USA) were followed with the addition of spermidine to a final concentration of 4 mM. Digests were analyzed by standard gel electrophoresis in 1.5% agarose.

Microarray hybridisation and data analysis
Each experiment combined 2 μg Cy5-dUTP-labelled ATCC 3502 (reference) DNA and 2 μg Cy3-dUTP-labelled test DNA, and was performed on a minimum of four probe set replicates as described previously [14]. DNA microarrays were scanned using an Axon GenePix 4000B microarray laser scanner (Axon Instruments, CA, USA). The data from detected features was initially processed using the GenePix Pro v.6.0 software supplied with the scanner.
The R package arrayMagic v.1.14.0 [25] was used to assess the quality of the hybridisations by generating a diagnostic plot showing the pairwise similarities between all hybridisations. The pairwise similarity score (S ab ) was calculated by arrayMagic via S ab = MAD i (X ia -X ib ), where for each pair of arrays (a and b) X ia is the log-ratio of the i-th probe on the a-th array, and the MAD (median absolute deviation) is taken over all CDSs. The hierarchical clustering diagram generated used the similarity scores as a measure of the 'distance' between arrays. In this way the fidelity of the microarray technical replicates could be assessed (arrayMagic's R script, experiment description file and diagnostic plot are available on request). The data   for replicates that did not group together were discarded and the hybridisation experiments repeated with a fresh preparation of genomic DNA. Array data were further analysed using the GeneSpring GX package (Agilent Technologies) using Lowess normalisation. In order to correct for uneven printing or for probes which routinely gave a high or low signal, data were further normalised by using as a control hybridisation data from ATCC 3502 × ATCC 3502 dye-swap experiments (four microarrays) on a per CDS basis.
Pearson Correlation coefficients were calculated for the normalised signal ratios associated with probes for all chromosomal and pBOT3502 CDSs and used to create a similarity matrix for all 61 strains of proteolytic C. botulinum and C. sporogenes. The similarity matrix was subjected to the average linkage clustering method using Gene-Spring GX software.
The data generated by probes for neurotoxin cluster genes not found in ATCC 3502 were processed separately as there was no competing reference DNA during hybridisation. Whereas a signal channel ratio of 0.55 was taken as the cut-off between a presence or absence of hybridisation for chromosomal genes, a ratio greater than 5.0 was taken as a positive hybridisation for CDSs not in ATCC 3502. This gave results that agreed well with known genome sequences of C. botulinum in the GenBank database. Data for probes to the ATCC 3502 neurotoxin gene cluster itself used a cut-off point of 0.40 to compensate for the fact that all hybridisations had been performed using ATCC 3502 DNA as the reference material.

Validation of microarray
The microarray data were validated for biological significance using CDSs within the clostridial flagellar glycosylation island (FGI) and plasmid pBOT3502. The DNA sequence of 28 CDSs from the FGI-I (Figures 1 and 2) of ATCC 3502 and proteolytic C. botulinum type F strain Langeland, matched by annotation using genomic context and BlastP, was compared to the signals observed by microarray analysis. The highest sequence similarity between two homologous CDSs (CBO2682 and CLI_2747) corresponding to an absence of microarray hybridisation was 84.8%. The lowest sequence similarity between two homologous CDSs (CBO2683 and CLI_2748) that hybridised to the microarray was 85.7%, giving a minimum value of approximately 85% sequence identity between CDSs for a positive microarray hybridisation result. A similar percentage was previously reported for studies with Candida [26] and Helicobacter pylori [27].
Further validation was carried out by analysis of data for the 19 probes to the plasmid pBOT3502. Using a cut-off value of 0.30 (because of very high signals), only one strain, F9801A, gave a positive microarray signal for all 19 CDSs ( Figure 3). Subsequent tests demonstrated that this strain, but not two others that were tested, contained a plasmid that shared identical restriction sites to that of pBOT3502 ( Figure 4). Additionally, pBOT3502 contains five CDSs (CBOP15-CBOP19) that are dedicated to the synthesis and secretion of the bacteriocin, boticin [14]. However apart from F9801A, no other strain gave a microarray signal for these probes, including C. botulinum strain 213B. This strain carries a plasmid bearing the genes for boticin B [28], so might have been expected to give a positive signal. However, alignment of the 1 kb sequence from strain 213B with that of pBOT3502 showed that sequence identity over this region, spanning pBOT3502 CDSs CBOP16 and CBOP17, was only 52.1% which would fail to give a positive microarray signal.

Isolation and mass spectrometry analysis of flagellin proteins
Flagellin proteins were isolated [29] and mass spectrometry studies of intact flagellin proteins were carried out as described in earlier studies [30,31]. In some cases a large precipitate was observed in dialysed flagellin preparations. Protein isolates were evaporated to dryness in a Savant SpeedVac (Thermo Fisher Scientific UK) before resuspending in 5-10 μl of formic acid. The sample was agitated gently to dissolve protein and diluted 10-fold with hexafluoroisopropanol. Samples were infused into a hybrid quadrupole time-of-flight mass spectrometer (Micromass Q-TOF2, Waters Corporation, MA, USA) at a flow rate of 0.5-1.0 μl/min [30,31]. Top down mass spectrometry experiments were performed as described by [30], using argon collision gas with collision energies ranging from 20-30V.

Sequencing of sub-type A5 neurotoxin genes carried by wound botulism strains
To lower the risk of PCR-based errors, the A5 genes were sequenced using non-cloned PCR products. Initial 3.8 kb PCR products of the majority of the gene CDS were generated using a LongRange PCR kit (Qiagen, UK), with primers BONTAF1 (5'-GCAACCAGTAAAAGCTTTTAAAATTC-3'), BONTAR1 (5'-CCATCCATCATCTACAGGAATAAA-3') and 100 ng genomic DNA as template. PCR products were purified using DyeEx 2.0 spin columns (Qiagen). Sequencing was carried out using an AbiPrism 3730 capillary sequencer. Sequence of the entire 3.8 kb PCR product was achieved by designing primers using available sequence data and by 'walking' forward on both strands. Comparison of the 3.8 kb sequence with published examples of C. botulinum neurotoxin genes showed that the A5 neurotoxin sub-type was a close relative of the A1 subtype, which implied a similar neurotoxin locus structure. Therefore to amplify DNA containing the 5' and 3' ends of the A5 neurotoxin genes a series of PCR primers were designed that would recognise the cntB gene (encoding NTNH) and the transposase that flank the A1 neurotoxin gene of ATCC 3502 (CBO0805 and CBO0807 respectively). PCR was performed using 'outward facing' primers recognising the 3.8 kb sequence combined with these two sets of primers. PCR products that were of the expected size were sequenced. All sequencing fragments were assembled using the ContigExpress programme of the Vector NTI Advance 10 software package (Invitrogen). Comparison of the completed A5 neurotoxin gene sequence with that of published examples of other C. botulinum neurotoxin genes together with phylogenetic tree construction was carried out using the AlignX programme of this package. The A5 neurotoxin gene of four strains associated with UK cases of wound botulism was sequenced and found to be identical.

Accession Numbers
A representative of the sub-type A5 neurotoxin gene sequence from wound botulism strain H0 4402 065 was deposited in GenBank (accession number EU679004). Microarray data have been deposited with Array Express (accession number E-MEXP-1637).
Whole genome analysis of 61 strains of proteolytic C. botulinum and C. sporogenes Figure 1 Whole genome analysis of 61 strains of proteolytic C. botulinum and C. sporogenes. Each row of the heatmap represents a strain (indicated at right), and its branch on the dendrogram is coloured according to type of neurotoxin formed (indicated at left of heatmap; spo refers to C. sporogenes). Although lost at this resolution, each microarray probe is represented by a vertical column within this row, from left to right first the 19 probes for each CDS of ATCC 3502 plasmid pBOT3502, followed by probes for chromosomal CDSs, from CBO3648 to CBO001. The colour of each column in the heatmap is an indicator of test signal over reference (ATCC 3502) signal channel ratio. Yellow columns represent probes which hybridised to both test and reference isolates equally, those in blue hybridised more strongly to the reference strain, and those in red hybridised more strongly to the test strain. Microarray features with fluorescent signals lower than 100 units (background noise), plus those CDSs not represented on the microarray are coloured grey. Distance measurements between 0 and 1.0 are indicated in the non-linear scale underneath the dendrogram. Clades 1 to 9 (brackets at right), are groups of strains which cluster at a distance measurement value of 0.3. The four main regions of variability (clusters of blue-coloured columns) are CDSs associated with pBOT3502, the Flagellar Glycosylation Island (FGI), and the two prophages, Φ-CB1 and Φ-CB2 (indicated above heatmap).

Whole genome analysis
The 61 strains of proteolytic C. botulinum and C. sporogenes tested in the present study were selected to represent diverse origins. They had originally been isolated at different times over a period of more than 80 years, from the environment (17 strains (including unknowns)) or associated with various forms of botulism (foodborne (20 strains), infant (17 strains), and wound (4 strains)). The strains were of different toxin types; type A toxin gene (17 strains), type B toxin gene (16 strains), type F toxin gene (3 strains), dual toxin genes (22 strains), and no toxin gene (3 strains of C. sporogenes) ( Table 1). The CDS content of the 61 test strains was indexed in relation to the genome of proteolytic C. botulinum strain ATCC 3502 (  share a high degree of genetic relatedness (e.g. clustering distance or branch-lengths in the dendrogram were short with a high proportion of shared CDSs in the heatmap (coloured yellow)). Most major branch points in the dendrogram occurred at distance measurements of between 0.20 and 0.44. A distance measurement value of 0.30 separated the 61 strains of proteolytic C. botulinum and C. sporogenes into nine clades (excluding ATCC 3502) ( Figure  1). The strains did not group together according to the location, environment, time of isolation, or the type of botulism with which they were associated ( Figure 1). This lack of grouping probably reflects the wide range of sources of the strains, and has been reported previously by workers using other typing methods [8,32]. The predominance of yellow shading in the heatmap indicates that the 11 strains in clades 7 and 8 ( Figure 1) were most closely related to the reference strain (ATCC 3502). For example, they shared the same FGI. While nine of fourteen type A1 neurotoxin strains (as ATCC 3502) were present in these two clades, a type B and type F strain were also present. Indeed, most clades contained strains of more than one toxin type (or sub-type), and most toxin types (or subtypes) were represented in more than one clade, suggesting that the evolution of the neurotoxin genes has not paralleled that of the remainder of the genetic complement. For example, clade 3 contains nine type B strains, one type A1 strain, two type A2 strains, two type F and four type A5(B) strains (the novel type A5(B) strains are described below), and clade 9 contained two type B strains and three C. sporogenes strains (Figure 1), confirming the close rela-Plasmid pBOT3502 of strain ATCC 3502 shares CDSs with other strains of proteolytic C. botulinum  tionship between proteolytic C. botulinum and C. sporogenes (e.g. [2,8,14]). Two clades, however, contained strains of just one toxin type. Clade 4 contained eight closely-related North American-isolated type A1(B) strains, and clade 7 comprised seven closely-related type A1 strains (Figure 1). Clade 5 contained a single strain (NCTC 2012, Loch Maree) that forms type A3 toxin. Interestingly, other genomic indexing methods (MLST, AFLP, VNTR) also found this strain to be unique and well separated from other strains of proteolytic C. botulinum [8,33,34]. In addition to the nine clades identified, further sub-groupings were identified within each clade, often of the same toxin type (or sub-type). Some strains appeared highly similar to each other when compared to the genome of ATCC 3502. These strains included the three type Bf strains isolated from two patients and food following a foodborne botulism outbreak in Quebec that grouped closely together within clade 6, and the four strains associated with wound botulism in the UK within clade 3 (Figure 1). Most of the differences in microarray data between the three type Bf strains were distributed around the signal channel ratio cut-off point of 0.55, suggesting that these apparent differences may reflect background noise associated with this type of analysis. Indeed, it is likely that these three Bf strains are identical, as they were isolated from a pâté and clinical samples from the same outbreak. On the other hand, the wound botulism strains showed some clear differences in their genetic con-tent. Other genomic indexing methods (e.g. MLST, PFGE, AFLP, VNTR) have given a broadly similar pattern to that found in the present study, with groups of small numbers of closely-related strains generally of the same toxin type grouping together, with several distinct groups for each toxin type [8,[32][33][34]. There are, however, a number of interesting anomalies that might be interpreted as evidence for horizontally acquired genetic information, and therefore worthy of further study, for example the type B strain 2345 that groups most closely with the C. sporogenes strains.
It is estimated that the core gene set for all 61 strains of proteolytic C. botulinum and C. sporogenes tested was 2155 CDSs ( Figure 5). This is approximately 63% of the CDSs of ATCC 3502 represented by probes on the microarray, and considerably higher than the value of 20% previously reported for 75 strains of C. difficile [35]. This further confirms the close relationship of proteolytic C. botulinum and C. sporogenes and indicates that exchange of genetic information with other species has occurred less frequently than in C. difficile. Apart from the neurotoxin gene cluster itself, which although significant in terms of biological impact, represents a very small part of the genome, four main areas of divergence were identified; the plasmid (pBOT3502), the flagellar glycosylation island (FGI) and the two prophages ( Figure 1). Together these account for approximately 4.    [14], but the two strains included in this previous study are now revealed to be very close relatives. Indeed, it was estimated that the core gene set for the ten closely-related type A1 strains in clades 7 and 8 was 3055 CDSs, equating to 89% of the CDSs of ATCC 3502 ( Figure 5).
Additionally, two strains of non-proteolytic C. botulinum type E were tested, but too many CDSs were either absent or highly diverged for meaningful data to be derived (data not shown). It was previously reported that a strain of non-proteolytic C. botulinum type B and a strain of C. difficile were also too divergent to give a meaningful response on this microarray [14]. The poor hybridisation of DNA from the three strains of non-proteolytic C. botulinum to the microarray reflects the wide evolutionary and phylogenetic distance between proteolytic C. botulinum and non-proteolytic C. botulinum. This is a direct result of the species "C. botulinum" being defined not on the basis of a close evolutionary or phylogenetic relationship, but on the basis of the disease caused [3].

Neurotoxin cluster arrangement -Single toxin gene strains
The type A1 neurotoxin gene is normally present in the ha plus/orf-X minus type cluster, while the ha minus/orf-X plus cluster is more commonly associated with type A2, A3, A4 and F neurotoxin genes [9,36,37]. Twelve of the fourteen type A1 neurotoxin strains tested contained the ha plus/orf-X minus cluster, but in two strains (F9604A and MUL0109ASA) the type A1 neurotoxin gene appears to be in a ha minus/orf-X plus cluster ( Figure 6). This arrangement has also been recently reported for a small number of other type A1 strains [18,38]. In the present study, the genes (p47, orf-X1, orf-X2, orf-X3 and lycA) that are only present in the ha minus/orf-X plus cluster were always present together (26 strains), with no strain possessing only part of this cluster. The neurotoxin gene of the two type A2 strains (NCTC 9837 and ZK3) and one type A3 strain (NCTC 2012 -Loch Maree) was also present in a ha minus/orf-X plus cluster (Figure 6), as expected [9,13,18,37]. Although the two type A1 ha minus/orf-X plus strains (F9604A and MUL0109ASA) had the same neurotoxin cluster as the type A2 and A3 neurotoxin-forming strains ( Figure 6), they were in different clades well separated from each other and from the other type A1 neurotoxin-forming strains ( Figure 1). Instead these two type A1 ha minus/orf-X plus strains grouped with a type Ba strain (CDC 657) (Figure 1, clade 6). The type A neurotoxin gene in strain CDC 657 (type A4) is also in a ha minus/orf-X plus cluster [9]. Since previous studies using AFLP, MLST  The 16 type B strains gave an almost identical hybridisation pattern, with all neurotoxin genes present in a ha plus/orf-X minus cluster ( Figure 6). This is consistent with previous reports [9,12]. Strain MRB had a weak signal for cntB (encoding NTNH), this may reflect the mosaic structure of cntB, and a previous genetic crossover event between two types of neurotoxin gene cluster [2,9,40]. The three type F strains gave a ha minus/orf-X plus pattern ( Figure 6), an arrangement consistent with that reported in the genome sequence for strain Langeland (CDSs CLI_0845 to CLI_0850). However, while strains Langeland and Walls 8G grouped together in the whole genome analysis (Figure 1, clade 3), strain H461297F grouped with type A1 strains, providing further evidence that the neurotoxin gene clusters are not evolutionarily tied to their host organism [8,9,13].
The genes cntR/A1 and cntR/F (sometimes called p21) encode closely related sigma 70 factors involved in regu-Core set of CDSs of proteolytic C. botulinum/C. sporogenes Figure 5 Core set of CDSs of proteolytic C. Summary of microarray data for 16 neurotoxin gene cluster probes  lation of the neurotoxin genes [9,13]. The probe designed to be specific for cntR/A1 gave a positive result with all strains that possessed a neurotoxin gene in a ha plus/orf-X minus cluster (Figure 6). Similarly the probe designed to be specific for cntR/F gave a positive result with all strains that possessed a neurotoxin gene in a ha minus/orf-X plus cluster ( Figure 6). The type of neurotoxin regulatory gene (cntR) present, therefore, is entirely in accordance with the type of neurotoxin gene cluster, but not with the type of neurotoxin gene.

Neurotoxin cluster arrangement -Dual toxin gene strains
Twenty-two strains tested in the present study possessed two distinct neurotoxin genes. Fourteen of the dual gene toxin strains possessed a type A1 and type B neurotoxin gene. Two of these strains (CDC 657 and CDC 588) form both neurotoxins, albeit in different proportions, while the other 12 strains appear to form only type A neurotoxin ( Figure 6). All these 14 strains gave an identical response in that they possessed complete ha plus/orf-X minus and ha minus/orf-X plus clusters. The microarray data cannot distinguish between dual toxin gene strains which carry a type A1 toxin gene in a ha minus/orf-X plus cluster, and a type B gene in a ha plus/orf-X minus cluster or vice-versa. However, as all type B neurotoxin genes have been associated with a ha plus/orf-X minus cluster ( Figure 6; [9,12]), the simplest explanation is that the dual toxin gene strains are in the former arrangement. This has been reported in strains NCTC 2916 and CDC 657 [9,36,41] and from a preliminary analysis of strains NCTC 11199, MDa10, 667 and CDC 588 [18,41]. The four strains that formed both type B and type F toxin showed a similar hybridisation profile to the A1(B) strains except that they possessed a type F toxin gene rather than a type A toxin gene. Again both the full ha plus/orf-X minus and ha minus/orf-X plus clusters are present ( Figure 6). It is likely that these strains possess a type F gene in a ha minus/orf-X plus cluster, plus a type B gene in a ha plus/orf-X minus cluster. This hypothesis is supported by (i) this is the pattern found in strains forming either type B or type F toxin, (ii) such an arrangement was indicated by a preliminary analysis of strain CDC 3281 [42], and (iii) was reported for a recently sequenced unnamed Bf strain [GenBank: NZ_ABDP00000000].

Identification of a new toxin sub-type
The present study included four strains of proteolytic C. botulinum (H0 4244 0055, H0 4402 065, H04464 107, H0 4068 0341) that formed type A neurotoxin, and had been isolated from patients presenting with wound botulism in different regions of the UK in 2004. Whole genome analysis revealed that these strains formed a sub-group within clade 3, distinct from other type A strains. Since the majority of strains forming type A neurotoxin clustered together within clades 4 or 7, this suggested the possibility that they might represent an evolutionary distinct group which could be sufficiently diverged to also produce a novel neurotoxin sub-type ( Figure 1). From the DNA sequence of the entire cntA coding region, a translation product could be predicted that comprised 1297 amino acid residues of a type A neurotoxin gene. The cntA/A gene sequences from all four strains were identical suggesting that these strains may derive from a common source. Comparison with published examples of neurotoxin A sub-types revealed that the wound botulism-derived cntA/A genes were distinct from toxin sub-types A1 -A4 (Figures 7, 8 and 9; Table 2). Subtypes of cntA are defined by a minimum of 2.6% difference between amino acid sequences [7,10]. The closest relative of the wound botulism-derived cntA/A gene is the cntA/A1 gene (Table 2), and the new DNA sequence predicts a 2.9% difference (37 amino acid residues) between the wound botulism-derived cntA/A genes and the cntA/A1 genes, the latter tending to share approximately 99.8% identity between themselves (see Figures 7 and 8 for an alignment of amino acid sequence of all five sub-types). On this basis these wound botulism-derived cntA/A genes define a new sub-type, and should be termed cntA/A5. Furthermore, the four type A5(B) strains represented the only 'non-A1' neurotoxin-forming strains that possessed a type A neurotoxin gene in a ha plus/orf-X Amino acid sequence alignment of proteolytic C. botulinum type A neurotoxin subtypes (part 1)   Figure 6). Interestingly all four type A5 strains gave a positive signal with the C-terminal type B probe. Following a combination of DNA sequencing and PCR analysis, the presence of a near complete type B neurotoxin gene with the 5' end (i.e. N-terminus of protein) either absent or diverged from previously published examples was detected (data not shown). As such, these strains also represent the first published examples of type A(B) strains that lack the ha minus/orf-X plus cluster for neurotoxin genes. Since active type B toxin was not detected in the mouse test, they are designated as type A5(B). Wound botulism cases in the UK are associated with heroin abuse [43], and it is likely that the source of these strains of proteolytic C. botulinum is the same as the heroin, which comes from Afghanistan [44]. This may indicate that previously unknown botulinum neurotoxin types are present in this part of Asia; the majority of published botulinum neurotoxin gene sequences are from strains originating in Europe or North America.
The amino acid residue differences that distinguish the A5 sub-type from the four other type A sub-types are scattered throughout its length (Figures 7 and 8; [7][8][9]). The N-terminal eight amino acid residues are involved in binding to the neuronal cell plasma membrane [45]. Significantly the A5 neurotoxin has a leucine at position 2, in contrast to the usual proline, an amino acid known to cause marked conformational changes in peptide secondary structure. The C terminus of the light chain, especially residues 398-448 is important for solubility, stability and catalysis [46], but only one residue (E444) close to the protease nicking site differs in this region. Similarly, of the heavy chain residues that are proposed to build the lactose and sialyllactose-binding pockets needed for ganglioside binding [47], only L1278 has been changed (to an F). It is tempting to predict from this in silico study that the gene product of cntA/A5 will share a similar toxicity to that of cntA/A1, although the fact that at least three residues known to be functionally important are different may have important implications.

Relationship to C. sporogenes
The type B toxin producing strain 2345 had a weak signal for all four neurotoxin-associated probes (Figure 6), and Amino acid sequence alignment of proteolytic C. botulinum type A neurotoxin subtypes (part 2)   Figure 9 Relatedness of C. botulinum type A neurotoxins. The dendrogram was generated with the AlignX (ClustalW) programme of the Vector NTI Advance 10 (Invitrogen) software package, using data presented in Table 2  also groups together with the three C. sporogenes strains in the whole genome analysis (Figure 1). These observations support a hypothesis that strain 2345 may represent a strain of C. sporogenes which has recently acquired part of the (or a diverged but intact) neurotoxin gene cluster in the recent evolutionary past. Interestingly, two of the three C. sporogenes strains, which were expected to be completely negative for all neurotoxin-associated probes, gave a weak signal to the cntR/A1 probe ( Figure 6). This could be due to the presence of a distantly related or partial cntR gene, implying that these C. sporogenes strains may either represent a descendent of C. botulinum that have lost most of their ha plus/orf-X minus neurotoxin gene cluster, or may have acquired a neurotoxin gene cluster by horizontal gene transfer (as postulated for strain 2345), but then subsequently lost most of it. Although not the same strain as the three used in this work, BLAST searches of the predicted peptides of the (unfinished) C. sporogenes strain ATCC 15579 genome sequence showed that of several proteolytic C. botulinum examples of this gene (and one non-proteolytic C. botulinum example, that of strain Eklund 17B), CntR of ATCC 3502 gave the highest percentage identity (48%) over the longest unbroken stretch of peptide sequence. Whereas this tends to support the microarray data, the stretch of sequence was only 27 amino acid residues in length, so genome sequence analysis of more C. sporogenes strains would be needed to further investigate this interesting observation.

Evolution of neurotoxin genes in relationship to the genome
It is evident that in strains of proteolytic C. botulinum, the distribution of neurotoxin genes and neurotoxin cluster type are not consistent with the whole genome analysis ( Figure 6, Figure 1). This is consistent with previous reports by other genomic indexing methods (e.g. 16S rrn, PFGE, AFLP, MLST [2,8,32,33]). The evolutionary patterns of neurotoxin and associated genes within the neurotoxin cluster are also incompatible, and are likely to have arisen from several distinct recombination events. For example, the present study has confirmed earlier reports [18,38] that in type A1 strains, the neurotoxin genes may be located in a ha plus/orf-X minus cluster or in a ha minus/orf-X plus cluster. It was also found that the cntR gene correlated with neurotoxin cluster type, rather than neurotoxin gene. Previous reports have identified that NTNH-encoding genes also correlated with neurotoxin cluster type rather than neurotoxin gene, and that the middle of the NTNH gene may be a hot-spot for recombination events within the neurotoxin cluster [2,9,13]. Putative insertion sequence (IS) elements located close to the neurotoxin cluster and the localisation of neurotoxin genes on large plasmids may also have played a role in mobilisation and gene transfer of neurotoxin and associated genes [9,39].

Flagellar glycosylation island (FGI)
Microarray analysis of the CDSs within the FGI separated strains into six divisions. Divisions 1 and 2 had similar profiles, with division 1 missing some CDSs contained within FGI-II. Divisions 3-6 were all missing large sections of FGI-II, with division 5 also missing CDSs within FGI-I (Figure 2). The structure of these divisions indicates that, as seen for the neurotoxin cluster genes, the evolution of the FGI may have occurred independently of the remainder of the genetic complement, with most divisions containing strains of more than one toxin type and from more than one clade. Divisions 1-3 each contained strains of at least four toxin types (or sub-types) that belonged to three or more clades, while division 4 contained only type B strains from clade 3, and division 5 comprised only the strains in clade 9 (type B and C. sporogenes strains). Division 6 contained just one strain, 17A. The genetic variation highlighted by these divisions (Figure 2) forms the basis for a typing method for proteolytic C. botulinum [29].
The genetic repertoire of the FGI indicated by the microarray analysis suggested that the glycan biosynthetic capacity of these C. botulinum strains may vary ( Figure 2). Indeed, it has been shown that strains differ in their FlaA glycan structure, and it is proposed that FGI-II is involved in this process [29,31]. Flagellin proteins were isolated from representative strains in each division to determine the nature of the glycan produced and its correlation with  [48]. Many of the flagellins of division 2 strains carried a glycan oxonium ion at m/z 301. The MS/MS spectrum of this ion also shared fragmentation ions with a di-N-acetylhexuronic acid, but with the increased mass likely to correspond to the addition of a third acetyl group (data not shown). A glycan oxonium ion at m/z 418 was detected as the FlaA modification on all division 3 strains examined. This FlaA glycan has been fully characterized in strain FE9909ACS as a novel legionaminic acid derivative, Leg5Ac7NMeGlu [31]. Strains 920A276 and CDC 15044 from division 4 possessed flagellins with glycan oxonium ions at m/z 417 and 431, which shared glycan oxonium ion fragmentation patterns typical of nonulosonic acid sugars (data not shown). The flagellin of the division 5 strain FE0507BLP had a glycan oxonium ion at m/z 317, which had the characteristic MS/MS fragmentation pattern of the nonu-losonic acid sugars pseudaminic and legionaminic acid (data not shown). These sugars have been structurally characterised from the flagellins of Campylobacter jejuni [49], Campylobacter coli [50] and Helicobacter pylori [51]. Taken together, these observations show that differences in the FGI microarray profiles may be reflected in the mass of glycan oxonium ions that modify the flagellin. Interestingly, top down mass spectrometry analysis of FlaA from strain 17A did not produce any marker ions characteristic of glycan, although the mass of the protein is greater than could be predicted from its DNA sequence. This indicates that it too is probably post-translationally modified [29]. In this case a 'bottom up' mass spectrometry analysis of flagellar tryptic peptides may be required to identify the glycan moiety.
A representative of division 3, which appears to produce the novel legionaminic acid derivative, Leg5Ac7NMeGlu, is the type F strain Langeland, the genome of which has been recently sequenced. Comparison of FGI sequences of both Langeland and ATCC 3502 showed that CDSs of FGI-I shared at least 80% identity, while FGI-II was highly divergent and was 30 kb smaller in strain Langeland (Figure 2). Homologues to the biosynthetic genes for legionaminic acid synthesis in Campylobacter coli have been identified in the FGI-II region of the Langeland genome [31]. The definitive confirmation that CDSs in FGI-II are responsible for biosynthesis of the glycan found on C. botulinum strain Langeland FlaA, however, awaits further genetic analysis.
Previously, nonulosonic acid sugars such as legionaminic acid and pseudaminic acid have been identified as the post-translational modification of flagellins in the Gramnegative gastrointestinal pathogens, Campylobacter and Helicobacter [52]. In these bacteria, the glycosylation of the flagellin is essential for filament assembly and glycan modifications have been shown to play a role in pathogenesis [53,54]. The presence of nonulosonic acid sugars in numerous strains of C. botulinum may have an important bearing on its ability to establish a gut infection and thereby cause infant or adult intestinal botulism. Although the present study has not correlated the distribution of specific flagellin glycan modifications with the type of botulism caused, this property may enable a strain to bring about infant/adult intestinal botulism at a lower infectious dose than that of strains lacking these flagellin modifications. A comparative genomic analysis of Campylobacter jejuni identified distinct distributions of flagellar glycosylation genes (cj1321-cj1326) that were present only in strains associated with colonisation of livestock [53]. The hypothesis was developed that the type of flagellar glycosylation genes in Campylobacter jejuni strains conferred a survival advantage to these strains within livestock, offering a possible explanation for the host specificity of some Campylobacter jejuni strains. It remains to be established whether diversity in flagellar glycan biosynthetic capacity in C. botulinum is similarly related to host specificity and the colonization ability of isolates.

Conclusion
The most important aspects of the biology and evolution of proteolytic C. botulinum have been highlighted by this study, particularly in relation to the neurotoxin and its associated cluster and the FGI. The close relationship with C. sporogenes, and very distant relationship with non-proteolytic C. botulinum have been confirmed. Proteolytic C. botulinum and non-proteolytic C. botulinum are phylogenetically distinct organisms that coincidentally share type B and type F neurotoxin genes. These genes are of such sequence similarity as to obviously share a recent common ancestor, and appear therefore to have crossed the species barrier. Intriguingly, type A and type E neurotoxin genes seem to be mutually exclusive and are each restricted to just one of these species. The genome of proteolytic C. botulinum appears to be relatively stable, and strains sequenced to date display a high degree of synteny (data not shown). There are, however, variable regions, and we have presented evidence for the independent horizontal transfer of genes encoding the neurotoxin cluster and FGI, compared to the remainder of the genetic complement. Transfer of neurotoxin and associated genes may be associated with a hot-spot for recombination within the NTNH, closely associated IS elements, and plasmids. Further investigations of unexpected toxin or FGI types within clades may be particularly interesting, and reveal more about the acquisition or loss of genetic material. For example, while most type A1 strains grouped together according to whole genome and FGI analysis, four appeared to be distinct. Two of these type A1 strains, FE9604A and MUL0109ASA were closely related to each other, with the toxin gene in a ha minus/orf-X plus cluster. Strains 17A and 96A were both ha plus/orf-X minus strains, but appear to be different to each other and all the other type A1 strains by whole genome and FGI analysis. Interestingly, strain 96A was also well separated from other type A1 strains by PFGE [32]. The sequencing of further strains (which is rapidly becoming affordable for most laboratories) is a particularly attractive way forward, as unlike microarray analysis, which can only highlight CDSs that are present or absent in a test strain, it will provide information not only on what has been inserted or lost, but where on the genome this has taken place. Indeed the genomes of several of the strains used in this study have recently been sequenced, and those with slightly larger genomes typically also carry approximately 300-600 novel genes with respect to the ATCC 3502 strain used as a reference in this work (data not shown).
A number of typing tools have been used for the molecular characterisation of proteolytic C. botulinum. Some (e.g. ribotyping, 16S rrn sequencing) can be used to identify the organism, but are not particularly effective at discriminating between strains [8,55]. Others (e.g. PFGE, MLST, AFLP, VNTR, fla sequencing, DNA microarrays) are able to discriminate between strains [8,14,29,[32][33][34]. The present study and previous work [14] have demonstrated that comparative genomic indexing using a DNA microarray based on the genome sequence of ATCC 3502 is an effective tool to discriminate strains of proteolytic C. botulinum. Advantages of microarrays are that they can infer evolutionary relationships better than single/multi-locus methods and additionally provide valuable information on the genome content of tested strains, thereby providing an insight into the biology and evolution of the organism. The present microarray is suitable for the forensic analysis of strains of proteolytic C. botulinum, including investigations of bioterrorism associated events. A second generation DNA microarray could be developed for this purpose based on the variable regions identified between a number of sequenced strains, and utilise printed rather than spotted microarrays.