Updated genome assembly and annotation of Paenibacillus larvae, the agent of American foulbrood disease of honey bees

Abstract Background As scientists continue to pursue various 'omics-based research, there is a need for high quality data for the most fundamental 'omics of all: genomics. The bacterium Paenibacillus larvae is the causative agent of the honey bee disease American foulbrood. If untreated, it can lead to the demise of an entire hive; the highly social nature of bees also leads to easy disease spread, between both individuals and colonies. Biologists have studied this organism since the early 1900s, and a century later, the molecular mechanism of infection remains elusive. Transcriptomics and proteomics, because of their ability to analyze multiple genes and proteins in a high-throughput manner, may be very helpful to its study. However, the power of these methodologies is severely limited without a complete genome; we undertake to address that deficiency here. Results We used the Illumina GAIIx platform and conventional Sanger sequencing to generate a 182-fold sequence coverage of the P. larvae genome, and assembled the data using ABySS into a total of 388 contigs spanning 4.5 Mbp. Comparative genomics analysis against fully-sequenced soil bacteria P. JDR2 and P. vortex showed that regions of poor conservation may contain putative virulence factors. We used GLIMMER to predict 3568 gene models, and named them based on homology revealed by BLAST searches; proteases, hemolytic factors, toxins, and antibiotic resistance enzymes were identified in this way. Finally, mass spectrometry was used to provide experimental evidence that at least 35% of the genes are expressed at the protein level. Conclusions This update on the genome of P. larvae and annotation represents an immense advancement from what we had previously known about this species. We provide here a reliable resource that can be used to elucidate the mechanism of infection, and by extension, more effective methods to control and cure this widespread honey bee disease.


Background
Paenibacillus larvae is a spore-forming, Gram-positive bacterium, studied for the past century due to its ability to cause American foulbrood (AFB), a larval disease of honey bees [1]. The host is most vulnerable during an approximately 48-h window in the life cycle -the early larval stage -where arguably an undeveloped immune system and/or a lack of energy stores result in death. During this period, the oral LD50 ingestion is 8.49 spores [2]; death occurs due to systemic infection after the germinated bacterial spores proliferate in the midgut and then breach the midgut epithelium via a paracellular route [3]. The antibiotics oxytetracycline and tylosin are used both prophylactically and to treat symptoms; however, widespread drug resistance is evident [4] and their registered use is being withdrawn in many countries since residues can show up in honey. Even in susceptible isolates though, it is extremely difficult to completely eliminate from a hive, so without definitive knowledge of the molecular mechanism of pathogenesis, the design of specific treatments is significantly hindered. In 2006, a draft of the P. larvae genome was published at an estimated 5-6x coverage [5]; here we extend this coverage and further annotate the genome sequence with a combination of bioinformatics and proteomics.

Results and Discussion
Assembly of the P. larvae genome Using the Illumina GAII platform and the ABySS assembler, together with the previously collected Sanger reads [5]; we achieved 182x coverage of the P. larvae genome and generated an initial assembly with a contig N50 of 49.6 kb. This statistic describes the contiguity of an assembly, and denotes that 50% of the reconstructed genome is contained in contigs equal to or larger than the given value. Contigs were joined into scaffolds using ABySS and Anchor (version 0.2.7, http://www.bcgsc.ca/ platform/bioinfo/software/anchor/releases/0.2.7) (Docking et al., in preparation) to achieve a scaffold N50 of 83.2 kb. The largest contig size was 261,601 base pairs. There were 184 contigs with a size less than 1 kb, 136 contigs with length between 1 kb and 10 kb, 57 contigs with length between 10 kb and 100 kb and 11 contigs larger than 100 kb. The total length of the assembled contigs was 4,505,771 base pairs. Assembly statistics are listed in Table 1.
This Whole Genome Shotgun project has been deposited at DDBJ/EMBL/GenBank under the accession ADZY00000000, with the latest version being ADZY02000000. Genome annotation and downstream analysis described below is based on the first version, ADZY01000000. Despite high sequence coverage, the assembly was relatively fragmented (Figure 1a). This fragmentation appears to be due primarily to long genomic repeats that could not be bridged by our sequencing strategy, as indicated by a preponderance of repetitive sequence occurring at contig ends (example shown in Figure 1b). Many of these repeats are similar to known bacterial insertion sequences and exceed 1 kb in length (data not shown). As a result, the majority of contigs were 1-10 kb in length.
Given the fragmented nature of the assembly and the potential contribution of transposable elements to chromosome re-arrangements, we used tBLASTx and MUMmer to investigate the level of synteny between P. larvae and the congeneric soil bacterium P. JDR2. Although the completely sequenced P. JDR2 genome is 7.2 Mbp, substantially larger than our assembly (4.4 Mbp), the species share~93% identity at 16S ribosomal loci. Figure 2a shows that, despite the difference in assembly size, gene-level synteny is generally conserved with P. JDR2 across the entirety of many P. larvae contigs. Some regions have excellent conservation (example in Figure 2b). This suggests that P. JDR2 can be a useful reference for comparative genomics and to order P. larvae contigs for genome finishing. Divergent regions between the two genomes are also of interest because they may harbor species-specific genes that are ecologically important in soil and beehive environments, respectively. For example, contig Seq24, which lacks synteny with P. JDR-2 (Figure 2c), contains two gene regions not present elsewhere in P. JDR-2 that are potential virulence factors (Table 2). Interesting, the same contig was also poorly conserved in another bacterium P. vortex (Figure 2d), thereby strengthening the claim that genes in this contig are unique to P. larvae. P. vortex is a pattern-forming soil bacterium with a 6.4 Mbp genome [6], so as expected, its overall level of synteny with P. larvae is quite high (Figure 2e), though slightly lower than observed when compared against P. JDR-2.
Within the regions unique to P. larvae, one includes several open reading frames (ORFs) with strong homology to the synthetases of the antibiotic plipastatin, which inhibits phospholipase A2 [7] and surfactin [8], which possess hemolytic activity [9]. The region also contains putative type I polyketide synthetases whose products are secondary metabolites, some of which have antibiotic, immunosuppresant, or toxic effects [10]. The second region encodes homologs of a putative iron-siderophore ABC transporter, which enables iron uptake from the medium, an important process for many bacterial infections [11].
Annotation of putative P. larvae proteins Using GLIMMER to predict genes from the P. larvae assembly and BLAST [12] searches against Bacillus or Streptococcus to provide annotation information (Additional File 1), we identified 3,568 gene models (final list in Additional File 2). Using shotgun proteomics methods to analyze P. larvae lysates, we identified 1262 proteins (1% false discovery rate), thus confirming the expression of 35% of the predicted proteins. Details regarding the identities of these proteins can be found in Additional File 3. As a major aim of this project is to find potential virulent factors and unique genes for this niche-specific bacterium, we describe below the genome    content of P. larvae for several pathways considered to be important in this regard. The putative functions of the proteins were predicted by searching for matches against the Conserved Domain Database (CDD), and the complete results are tabulated in Additional File 4.

Flagellar system
P. larvae is a flagellated bacterium and 41 genes associated with this system are detected in this assembly (Table 3), based on comparisons to two other fullysequenced organisms. This group includes motor/switch, flagellar rings, rod hook and filament proteins, along with proteins involved in regulation and as chaperones.
When compared against the Bacillus subtilis (Figure 3a, based on [13]) and Escherichia coli (Figure 3b, based on [14]) flagella, P. larvae has orthologs for nearly all of the genes. Those that are missing, such as the L and P ring proteins FlgH and FlgI in the outer membrane of the Gram-negative E. coli, are not necessary for the Grampositive P. larvae since it only has one membrane [15].
However, it appears to be missing some players needed for assembly of the flagellar hook, a structure that acts both as a joint and motor for each individual flagellum. The hook itself, made of monomers, requires the monomer scaffolding protein FlgD [16]; this was not found in the P. larvae genome. Also missing is FliK, which acts essentially as a checkpoint for the flagellum's correct length prior to export. This may mean that this species has evolved ways to proceed with hook assembly despite their omission, or the enzymes were simply not detected by our current methods; as with many negative results, however, our inability to detect such genes does not imply P. larvae is completely incapable of quality control functions. Besides the flagellar structure, another important aspect is the control of its movement. Directionality is largely dictated by the Che gene family and methyl-accepting chemotaxis proteins [17], which can also be found in the P. larvae genome. The base of the flagellum consists of assembly and regulatory components, which are made of Fli and Mot gene families in E. coli [18]. Again, orthologs for most of these can be seen in P. larvae.

Toxins
Our search for P. larvae genes that can encode toxins returned three matches to sixteen proteins (Table 4). A domain of the Clostridum perfringens epsilon toxin (pfam03318) was observed in EFX46729 and EFX45732, which forms pores on host cells [19] leading to cell death. P. larvae appear to possess classic binary toxins: the first factor is a membrane component that makes the host cell permeable to a second factor -the enzymatic component. Seven proteins matched the Clostridial binary toxin B domain (pfam03495), the membrane component. We also saw that four proteins matched the VIP2 domain (cd00233), which is the enzymatic component of a Bacillus toxin capable of ADP ribosylation on actin. In both the Clostridial and Bacillus toxins, the associated partners of these binary toxin-pairs were not observed.

Hemolysins
In the mid to late stage of the P. larvae infection cycle, bacteria circulate in bee hemolymph which contains hemocytes. Hemolysins are a class of bacterial toxins with lytic activity against blood cells. Four hemolysin domains were matched by proteins from the P. larvae genome. (Table 5). Three proteins contained regions that were highly similar to the hemolytic domain TylC (COG1253) and concomitantly to other domains such as CBS_pair_CorC_HylC_assoc and DUF21 (data not shown), implying that these are very similar proteins. EFX46836 was matched to the hlyIII domain, an integral membrane protein from Bacillus cereus which forms channels on host cells [20]. EFX45686 appears to be related to hemolysin TlyA (TIGR00478) with an indirect pore-forming role. One protein (EFX45724) showed similarity to DUF37 with unpublished claims of hemolysin activity in Aeromonas hydrophila.

Proteases
The occurrence of proteases in larval scales (remains of P. larvae-infected bee larvae) has already been well documented [21]. A more recent study hints on a definite possibility that one or more such proteins can act as the virulence factor [22]. Since it has been shown that vegetative rods residing in the host midgut can traverse the epithelium by a paracellular route, P. larvae may generate proteolytic activity that can disrupt epithelial cell-to-cell junctions. We searched the proteins for domains with this function, and after excluding ones with roles that are not directly implicated in pathogenesis (e.g. signal peptidase, SOS response for DNA repair), we list the matches in Table 6. Based on the information provided by the CDD, several proteins may conceivably serve as virulence factors. The Clp protease complex, in particular, has been linked to virulence factor production in Gram-positive bacteria, as well as degradation of misfolded proteins [23,24]. Proteins EFX45851, EFX44792 may be capable of cleaving the transmembrane regions of substrate proteins (matching domains cd06161, cd06158). Given that tight junctions between epithelial cells are mainly composed of occludin and claudin proteins, both of which have transmembrane domains [25], they may be specific substrates for P. larvae proteases. These points, however, are merely speculative, and the many proteases in Table 6 await experimental evidence to prove whether or not they actually contribute a real role in virulence.

Antibiotic resistance
P. larvae infections have been treated with tetracycline for decades. It has been shown that resistance against this antibiotic can be conferred by the pMA67 plasmid [4]. More recently, the macrolide tylosin has now been registered as a new antibiotic in the U.S. for use when tetracycline is no longer effective [26]. Since antibiotic resistance is inevitable [27], new therapeutics will eventually be required. Realizing that P. larvae appears to encode a plethora of drug efflux proteins (Table 7) may help in the selection of useful drugs. It is known that the bee midgut contains a vast array of bacteria, some of which can inhibit P. larvae growth in vitro [28][29][30], presumably by the production of lantibiotics. In our search we were able to find a number of different drug and lantibiotic efflux pumps and modifying enzymes (Table 7), and it is likely that these proteins are enlisted by P. larvae during interactions with cooccurring antagonistic bacteria.

Conclusions
In this article we report an update on the P. larvae genome to 182-fold coverage, and estimated the ordering of contigs based on comparison against the P. JDR2 genome. We predicted more than 3500 genes and provided these with functional annotation. Of great interest are enzymes that allow the bacterium to traverse the midgut epithelium after ingestion, a process which can contribute to its virulence. Proteases have been thought of as a key factor [22], and this is supported by the large repertoire of such enzymes which we have seen from the predicted gene products. Damage to the host is likely the result of mixed effects from the hemolysins and toxins that can be encoded by P. larvae. Its ability  to survive in the host by evading the immune system, as well as protecting itself against other gut bacteria, is likely made possible due to the antibiotic proteins that may be expressed. This improved version of the P. larvae genome and the more detailed annotations will be tremendously useful for improving our understanding of this species; for example, in designing primers to clone genes, or mutate them by site-directed mutagenesis. Especially for the field of proteomics, which often relies heavily on a database of protein sequences to make identifications, the current genome update will be a very powerful tool. Computer-based modeling techniques can be employed to predict the structure of virulence factors, which can guide drug design. Ultimately, the genome will pave the way to more effective prevention; or better yet, a cure for AFB. Given the bee population is under severe strain from the widely publicized Colony Collapse Disorder, attributable to multiple biotic and abiotic factors, it is important to continually improve our knowledge of the key threats that burden worldwide bee health.

Methods
Preparation of genomic DNA P. larvae strain BRL-230010, stored at -80°C was cultured for 5 d in a shaking incubator at 35°C in 25 mL of media. This media contained 1% Yeast Extract (Difco), 1% soluble starch which was prepared in 10 mM KH 2 PO 4 and adjusted to pH 6.6 with KOH (Recipe #4 from [31]). Genomic DNA was extracted using the DNeasy kit (Qiagen), following the manufacturer's protocol for Gram-positive bacteria.
Genome sequencing 8,212,402 2 × 50 bp paired-end reads with a mean fragment size of 182 bp were sequenced at the Genome Sciences Centre, Vancouver, Canada using the Illumina GAIIx platform as per manufacturer's instructions. This yielded approximately 182-fold coverage of the estimated genome size. In addition, 24,768 Sanger capillary-derived sequences were generated at the Human Genome Sequencing Centre, Baylor College of Medicine, using the ABI3730 sequencing platform. This provided an additional 3-fold coverage. Raw Sanger sequences were aligned against the UniVec database (NCBI http://www. ncbi.nlm.nih.gov) using BLAST. Those Sanger sequences that had good BLAST hits (e-value < 1e-10) were filtered from the dataset. Then, vector sequences were trimmed from the raw Sanger reads using cross_match (Cross_match http://www.phrap.org/) against the UniVec database. We then aligned all the Illumina reads to the Sanger sequences using MAQ version 0.7.1 [32]. Due to observed Honey Bee DNA contamination arising within the Sanger sequences, only sequences where at least 30% of the length aligned with Illumina reads were retained.
In the first stage of the assembly process, the Illumina reads were assembled into contigs using ABySS version 1.2.7 [33]. Assemblies were attempted using several different values for the ABySS sequence overlap parameter k (which determines the k-mer length used to construct the assembly graph), with the best assembly found at k = 37. Other modified ABySS parameter values were s = 100, n = 10, and d = 50. All other ABySS parameters were set to their default values. Following the initial assembly, contigs were joined into scaffolds by aligning the long-insert Sanger sequence data to the assembly, and merged using the scaffolding module of ABySS (Jackman et al., in preparation). Where possible, scaffold gaps were filled, and contigs were extended, using the local reassembly tool Anchor (version 0.2.7, http://www.bcgsc.ca/platform/bioinfo/software/ anchor/releases/0.2.7) (Docking et al., in preparation).

Genome annotation
ORFs were predicted with GLIMMER [34] using the long-ORF training script provided with the software package. ORFs were annotated as predicted proteins if they have at least 100 amino acids or at least 50 with a BLAST hit at a threshold of E = 10 -5 to Bacillus or Streptococcus (Additional File 1). Annotation terms were transferred from FASTA headers where appropriate, and in accordance with NCBI Prokaryotic Genome Submission Guidelines (full list of proteins in Additional File 2).
Genomic dot-plot comparisons were made with the program MUMmer [35]. To produce Figure 2, P. larvae *the listed match is not the one with the lowest E-value; it is displayed because it is the most informative  contigs were ordered manually based on BLAST hit coordinates to P. JDR2 and P. vortex as well as the graphical output of MUMmer alignments. The genome annotation as described above was accomplished using the first version of the assembly from this sequencing (ADZY01000000), which had a total N50 of 56,120 base pairs, a total length of 4,352,422, and a mean contig size of 12,329 bp. When we repeated this analysis with the latest assembly, all but 53 of the 3,568 gene models were identical. Of the 53, most were strong partial hits. All downstream analysis was done using ADZY01000000.  Shotgun proteomics of P. larvae lysates P. larvae culture was prepared as above. Bacteria were collected by centrifugation for 5 minutes at 16100 relative centrifugal force and the pellet was washed twice in PBS. For an in-gel digestion, the bacteria were lysed in 100 μl of 4× LDS and heated for 15 minutes at 90°C, passed 5 times through a 26 G needle fitted with a syringe before heating for another 10 minutes at 90°C.
Lysis was observed under the microscope. Protein concentration was estimated by Coomassie Plus staining, and 200 μg of protein was divided among all 10 wells of a 2-12% NuPAGE precast MES mini gel set up according to the manufacturer's instructions and the in-gel digestion proceeded essentially as described in [36]. For in-solution digestions, bacteria was lysed by either 100 μl of 6 M urea, 2 M thiourea, 20 mM Tris-HCl, pH 8.0