Erratum to: genomic comparison of 93 Bacillus phages reveals 12 clusters, 14 singletons and remarkable diversity

Background The Bacillus genus of Firmicutes bacteria is ubiquitous in nature and includes one of the best characterized model organisms, B. subtilis, as well as medically significant human pathogens, the most notorious being B. anthracis and B. cereus. As the most abundant living entities on the planet, bacteriophages are known to heavily influence the ecology and evolution of their hosts, including providing virulence factors. Thus, the identification and analysis of Bacillus phages is critical to understanding the evolution of Bacillus species, including pathogenic strains. Results Whole genome nucleotide and proteome comparison of the 83 extant, fully sequenced Bacillus phages revealed 10 distinct clusters, 24 subclusters and 15 singleton phages. Host analysis of these clusters supports host boundaries at the subcluster level and suggests phages as vectors for genetic transfer within the Bacillus cereus group, with B. anthracis as a distant member. Analysis of the proteins conserved among these phages reveals enormous diversity and the uncharacterized nature of these phages, with a total of 4,442 protein families (phams) of which only 894 (20%) had a predicted function. In addition, 2,583 (58%) of phams were orphams (phams containing a single member). The most populated phams were those encoding proteins involved in DNA metabolism, virion structure and assembly, cell lysis, or host function. These included several genes that may contribute to the pathogenicity of Bacillus strains. Conclusions This analysis provides a basis for understanding and characterizing Bacillus and other related phages as well as their contributions to the evolution and pathogenicity of Bacillus cereus group bacteria. The presence of sparsely populated clusters, the high ratio of singletons to clusters, and the large number of uncharacterized, conserved proteins confirms the need for more Bacillus phage isolation in order to understand the full extent of their diversity as well as their impact on host evolution.

The version of this article published in BMC genomics 2014 15(1):855, contains unpublished genomes downloaded from the public website phagesdb.org. We apologize for not having contacted the authors of these genomes in advance. In this correction, we removed all unpublished genomes as of the original publication date at authors request (Adelynn, Doofinshmertz, Gir1, JPB9, Nigalana Polaris, Pleiades, Pappano, Pegasus, Stitch). Removing these data did not alter the principle results and conclusions of our original work, including conservation of 100% the phage relationships (grouping into clusters and subclusters). It only altered their numbers, with 83 total phages, 10 clusters and 15 singletons. Hence the figures, tables and text are very similar, with minor changes in numbering and wording. In addition, we have repaired and updated some of the references for Table 1. We apologize for any confusion or inconvenience this may have caused.

Background
Bacteriophages are the most abundant biological entities on the planet, with at least 10 31 bacteriophages in Earth's biosphere [1][2][3][4][5]. Their ability to infect and kill their bacterial hosts makes them key factors in both the evolution of bacteria and the maintenance of ecological balance (for recent reviews see [6][7][8][9][10][11][12]). In addition, they are able to infect and transfer genetic information to their hosts, in many cases being key factors in the transfer of pathogenic traits such as in pathogenic Escherichia coli, Salmonella sp., Corynebacterium diphtheriae and Vibrio cholerae. Despite their clear importance to global environmental and health concerns, little is known about the complexity and diversity of these living entities, but what is known from metagenomics and phage genome sequencing suggests it is vast.
The most studied bacteriophages are those that infect the Gram-positive bacterium Mycobacterium smegmatis mc 2 155, with over 4,800 phages isolated and 690 fully sequenced genomes (www.phagesdb.org). These phages have been isolated by students from throughout the world as part of the Howard Hughes Medical Institute Science Education Alliance Phage Hunters Advancing Genomics and Evolutionary Science (HHMI SEA-PHAGES) for determining the diversity of phages that can infect a single host. A recent analysis of 491 of these indicates they belong to approximately 17 "clusters" of related phages (A-Q) and 13 singleton clusters [13]. Of interest, identical mycobacteriophages have only been isolated independently twice (Graham Hatfull, personal communication). Beyond these Mycobacterium phages, the bacterial family with the most phages isolated is the Gram-negative Enterobacteriaceae family (337 fully sequenced genomes available in GenBank). This group of phages has been isolated and sequenced independently from investigators throughout the world and contains many of the well-characterized, historical phages such as Lambda, Mu, T4 and T7. They have recently been grouped into 38 clusters of related phages and 18 singleton clusters [14].
A third group of well-studied phages, the Bacillus phages, have also been isolated by diverse investigators from throughout the world and infect many strains of the genus Bacillus. The Bacillus genus is ubiquitous in nature and includes one of the best characterized model organisms, B. subtilis, as well as medically significant human pathogens, the most notorious being B. anthracis (the causative agent of anthrax) and B. cereus (which causes food poisoning). Phages have been isolated that infect B. anthracis, B. cereus, B. megaterium, B. mycoides, B. pseudomycoides, B. subtilis, B. thuringiensis, and B. weihenstephanensis, allowing a unique opportunity to investigate the diversity of phages that infect different hosts within a bacterial genus. This study focuses on the genomic comparison of 83 fully sequenced phages that infect the Bacillus genus and discusses their place in the diversity and evolution of these important bacteria. In addition, we identify several genes that may contribute to the pathogenicity of Bacillus species. This analysis presents a framework for understanding phages that infect Bacillus and for comparing Bacillus phage diversity with the diversity of phages that infect other genera. In addition, it increases our understanding of the evolution and diversity of phages and their hosts, including the evolution of pathogenic strains.

Results and discussion
Whole genome nucleotide and amino acid comparison of the Bacillus family of phages reveals 10 diverse clusters of related phages and 15 singleton clusters In order to determine the relationship of the 83 extant, fully-sequenced Bacillus phages, we analyzed the published phage genomes by methods similar to those of Hatfull et al. [15,16], including whole genome dot plot analysis, pairwise average nucleotide identities (ANI) and genomic maps. The accession numbers and basic properties (host, genome size, GC content, number of ORFs, number of tRNAs and morphotype) of the 83 fully sequenced Bacillus phages are provided in Table 1 along with the appropriate reference.
Dot plot analysis of the Bacillus phages revealed 10 clusters of phages with similarity over at least 50% of their genomes (clusters A-J, also referred to by "founding phage" for clarity) and 15 phages that are singletons, having little to no nucleotide similarity to any other Bacillus phages. Genomic dot plot analysis consists of placing the nucleotide sequences across both the X-and Y-axis. A dot is placed where the sequences are identical, resulting in a diagonal line down the center of the plot when a sequence is compared to itself. The phages were aligned on two separate plots due to the wide range in genome size and the fact that no additional nucleotide similarity was seen in a combined plot. Figure 1A contains phage genomes of less than 100 kb while 1B contains the larger phage genomes. As stated above, assignment of a phage to a cluster was based on nucleotide similarity over at least 50% of the genome when compared to at least one other phage in the cluster. A phage could be placed into the same cluster by weak similarity over most of the genome, by strong similarity over about half of the genome, or by a combination of relatedness. The ANI values were also calculated within each cluster and found to be at least 55% between a phage and another phage within a cluster. From the total of 25 clusters over half (15) are singleton clusters containing a single phage member, suggesting that the isolation of unique Bacillus phages is far from complete. Our analysis and grouping of phages into clusters agrees completely with a previous grouping of B. cereus group phages by Lee et al. in which our PhiNIT1-like J cluster phages would belong to Group I, our gamma d'Herelle-like E cluster phages to Group II, and our Wip1-like A cluster phages  The subcluster and cluster designation is given followed by the phage name. The founding phage for each cluster is in bold-italics. Hosts are the bacterial hosts on which the phages were isolated (not the host range) and are abbreviated as *tRNA predicted in this study using Aragorn and DNAMaster. **Phage SI0phi is reported as an incomplete genome but is included in this analysis because it was complete enough to clearly assign it to a cluster. Thick lines indicate cluster assignments, which are provided on the Y-axis (A-J). Dot plots were produced using Gepard [57] and whole genome amino acid sequences were retrieved from Phamerator [34].
to Group III [17]. In addition it agrees with the recent grouping of B. pumilus phages into BpA, where BpA corresponds to our A cluster [18]. In addition to showing strong evolutionary relationships, whole genome nucleotide dot plots also reveal smaller regions of homology (<50% span length) between phages of different clusters that are likely areas of recombination. The largest such region is a~10,000 bp region of similarity between phBC6A51 (bp 44289-50616 and 58088-61389) and gamma d'Herelle-like E cluster phages that includes a tail component protein, minor structure protein and holin as well as a site-specific recombinase, a Ftsk/SpoIIIE family protein and five conserved phage proteins.
In addition to whole genome nucleotide analysis, whole proteome dot plot analysis was performed (Figures 1C and D). Because nucleotide sequences diverge more rapidly, the amino acid dot plots were expected to reveal more distant evolutionary relationships. The analysis confirmed the basic cluster assignments seen with whole genome nucleotide analysis and revealed distant relationships between the TP21-like D, gamma d'Herelle-like E, and IEBH-like F cluster phages discussed in more detail below. Note that there should be some limited similarity between all of the Bacillus tailed phages in that they should all encode a major capsid protein (MCP), portal protein and terminase. However, these proteins can diverge to a point that no sequence similarity is apparent.
Another common way to group phages is by the percent of the proteome that is conserved between phages. CoreGenes 3.0 was used to confirm clusters by ensuring that phages within a cluster share~40% of their proteome, a cutoff commonly used for determining phage relationships [19,20]. The cluster with the lowest conservation of the proteome (that is, the lowest conservation between a phage and its closest relative) is the Staley-like H cluster, with the highly related phages Staley and Slash sharing only 43.4% of their proteome with Basilisk. All other clusters yielded proteome comparison scores well above the 40% CoreGenes threshold, thus confirming that the phages belong in the proposed clusters.
The division of phages into the proposed clusters is also supported by the low standard deviation in the average basic phage properties including genome size, GC content, number of ORFs and morphotype ( Table 2). For example, the cluster A consists completely of tectiviruses of an average genome size of 14685 ± 302 bp, clusters B and C of podoviruses with short tails (average genome size is 19432 ± 1001 and 39864 ± 17 bp, respectively), clusters D, E, F, G and H of long noncontractile siphoviruses (average genome size ranging from 39222 ± 3522 to 81276 ± 777 bp), and the large contractile myovirus clusters I and J (average genome size is 138886 ± 5607 and 158129 ± 4580 bp, respectively). The average number of tRNA's for each cluster is also reported but is highly variable within a cluster, with standard deviations often approaching the number of tRNAs. This variation may reflect the phages' adaptation to different hosts because tRNAs are thought to provide efficient protein production in hosts with alternate codon preferences [21]. Further host range studies are needed to test these hypotheses.

Division of clusters into subclusters reveals large variance between clusters
Each cluster was further analyzed by nucleotide dot plot to reveal groups of high similarity, or subclusters ( Figures 2  and 3). These subclusters were chosen based on natural divisions in phage similarity seen in the dot plot, but could be more strictly defined by ANI values of at least 66% between two phages within the subcluster. The subcluster assignments indicate great diversity in the  relatedness within each Bacillus phage cluster. It is unknown whether this diversity represents evolutionary forces that constrain certain types of phages or if it is an artifact of phage isolation. Further phage isolation is necessary for this distinction.

Clusters containing highly related phages
Clusters C, D, and F and G are each comprised of a single subcluster containing highly related phages (sharing at least 74% ANI). Cluster G is the largest cluster containing only highly related phages, and harbors 9 myovirus phages  . Dot plots were produced using Gepard [57]. [18], the clusters C and D each contain three phages of the podovirus and siphovirus families, respectively, while F has two siphoviruses. The majority of phages in each of these clusters are recently isolated phages that are not well characterized. In fact, the MCP was not annotated for any cluster C or D phage and we were unable to identify an MCP by TBLASTN searches, suggesting that the MCP of these phages are novel.
Clusters containing more distantly related phages  [22]. Similarly, cluster E contains 11 phages divided into 3 subclusters where ANI varies from 42% to 99.99% between phages but all phages have at least 55% to one another. There is 86% proteome conservation within each subcluster, and between subclusters there is at least 41% proteome conservation. Cluster H harbors the very similar Staley and Slash (94% ANI) and the more distantly related phage Basilisk, which shares 55% ANI and 43% of its proteome with Staley/Slash. Cluster I harbors SPO1 and close relatives CampHawk (subcluster I1) as well as the more distantly related phages Shanette and JL (subcluster I2), which sharẽ 53% of their proteomes with the I1 phages. Clusters F and J contain more closely related phages. Cluster F harbors siphoviruses IEBH and 250 which share 90% ANI and 55% of their proteomes. Cluster J is the largest cluster and contains 23 myoviruses. Of interest, the eight subclusters to which these large phages belong are highly variable in host, tRNA content and number of ORF's (see Table 1), but they are all highly related having at least 81% ANI.
Overall, Bacillus phages remain highly uncharacterized but clusters B, E and I contain a some of well characterized Bacillus phages including the B. subtilis phage phi 29, the B. anthracis typing phages Gamma and Cherry, and B. subtilis phages SPO1 and CampHawk, respectively.
Single gene product analysis mirrors whole genome/ proteome analysis In addition to using whole genome or proteome comparisons to determine phage cluster assignment we recently demonstrated the utility of single gene product analysis using the mycobacteriophage tape measure protein (TMP) and major capsid protein (MCP) gene products [23]. We were unable to use either TMP or MCP for Bacillus phage single-gene comparison because podoviruses do not have a TMP and the MCP was not reported or identified by a TBLASTN search for several of the 83 Bacillus phages (including clusters C, D and H). Three genes are thought to be common to all tailed phages, the MCP (the major constituent of the icosahedral shell), portal protein (forms the pore into the capsid through which the DNA is packaged) and large terminase (the ATPase that packages the DNA into capsid) [24]. A putative large terminase gene product (TerL) was identified in 100% of the Bacillus phages and was, therefore, used for single-gene comparison (Figure 4). A dot plot alignment of the terminase gene products (TerL) confirmed our basic subcluster/cluster assignment with 100% of phages grouping by their pre-assigned subclusters and 90% by their clusters, while 12 of 15 singletons remaining singletons. Cluster B phage BceA1 was the only phage that appeared to have a terminase that was not homologous to the rest of its cluster. This overall percentage (95.2%) is comparable to the 98.8% reported for the mycobacteriophages using TMP [23]. The terminase dot plot analysis is supported by a neighbor-joining tree in which all of the proteins grouped by cluster/subcluster, with the exception of BceA1 and six of the 15 singletons which associated with another cluster ( Figure 5). The few outliers are consistent with a recent analysis that suggested genes encoding TerL have undergone sufficient horizontal transfer between phage groups to disrupt some correlations between terminase sequence type and cluster relationship [25].
From single-gene comparison, one of the subclusters appears to be unrelated to the rest of the cluster in which it belongs (subcluster B3 phage BceA1) while six singletons display similarity to other terminases. The SP10 terminase is similar to those in cluster I (SPO1-like), MG-B1 is similar to those in cluster B (Phi 29 -like), SPP1 and BCJA1c terminases are similar to those of clusters D (TP21L-lke) and F (IEBH-like), while Bacillus virus 1 and phBC6A52 display remarkable similarity to terminases of the E cluster (Gamma d'Herelle-like). These relationships could indicate more distant/ancient relationships over the entire chromosome or small regions of genetic exchange. The limited similarity of BceA1 TerL proteins to the rest of the B cluster is consistent with its distant whole genome/proteome relationships (faint diagonal lines on both the nucleotide and amino acid dot plots, see Figure 1). Phages SP10 and MG-B1 also show significant overall similarity to the I (SPO1-like) and B clusters (Phi 29-like), respectively (see supercluster discussion below for the SP10/cluster I relationship). Very weak similarity between B cluster phages and phage MG-B1 appears in dot plots and the similarity of MG-B1 to the phi 29 family was previously reported by Redondo et al. [26]. CoreGenes analysis and genome mapping indicates 11 MG-B1 gene products in common with the entire B cluster (29% of the proteome), and they are found in the same order ( Figure 6), however, 7,475 bp larger than the rest of cluster B (30% larger), containing an extra 15-25 gene products by CoreGenes analysis. Further phage isolation will most likely deduce its precise relationship.
Weaker relationships are displayed by BCJA1, Bacillus virus 1 and phBC6A52. Phage BCJA1c shares only 14-22% of its proteome with cluster D and F phages, while Bacillus virus 1 and phBC6A52 share only 10-22% of their proteome with phages in cluster E. In contrast, CoreGenes analysis suggests only small regions of genetic exchange for SSP1 in that it shares only~5% of its proteome with the cluster D/F phages (including the terminase, tailspike, DnaB/DnaD replication protein, and the single stranded DNA binding and annealing proteins).

Predicting phage replication strategies by terminase conservation
The identification and analysis of Bacillus phage terminase proteins presented in Figures 4 and 5 can also provide valuable insight into the replication strategy of these highly uncharacterized phages by comparing their terminases to those of well-characterized phages. Such comparisons have been used to determine the replication strategy of phages that infect diverse hosts such as Enterobacteriaceae and Paenibacillus larvae [14,27]. In our analysis, several Bacillus phages contain terminases that were similar to the well-characterized SPO1 Bacillus phage, suggesting that they replicate and package their DNA by a similar concatemer strategy resulting in nonpermuted DNA with long, direct terminal repeats [28,29]. The cluster I phages had terminases of at least 87% similarity to SPO1 by BLASTP, while clusters G and J were weakly similar (~43% and~56% similar, respectively) and singleton phage SP10 was 68% similar. Cluster E, phBC6A52 and Bacillus virus 1 terminases have weak homology to the HK97 terminase (42%-45% similarity) which packages by 3' cos ends, while phages of cluster H and singleton BanS-Tsamsa may have short DTRs due to weak homology to the Clostridium phage C terminase (~47% similarity) [30]. The B cluster phages and singleton MG-B1 have terminases that are homologues of the phi Figure 4 Single gene amino acid dot plot analysis using the large terminase mirrors whole genome cluster assignment of Bacillus phages. Bacillus phage clusters A-J are indicated on both the X-and Y-axis. Sequences for comparison were chosen by annotated large terminase gene products or a BlastP alignment to the closest relative when unannotated. Dot plots were produced using Gepard [57]. 29 terminase, suggesting they replicate DNA with a similar protein-primed replication strategy [31].

Identification of two superclusters describing distantly related phages through proteome conservation analysis
In an effort to identify more distantly related phages belonging to "superclusters", we carefully analyzed faint nucleotide and proteome dot plot lines, CoreGenes percentages, and whole genome maps for intercluster relationships. The genomic map of a representative phage from each subcluster is given in Figure 6 as an example, however the larger phages are excluded due to space constraints (clusters A through F are shown). Since short regions of similarity are common among phages, phages had to have similarity in genome content and order (synteny) to be termed a supercluster. Table 3 lists the two superclusters identified in this analysis.
Faint lines can be seen in both the nucleotide and proteome dot plots between clusters D, E and F as well as singleton PBC1. In addition, a similar genome content and order can be seen between these phages (for example phages TP21-L, Gamma and IEBH) where the first section of the chromosome contains phage structure and assembly genes and the last section harbors DNA metabolism genes (see Figure 6). These clusters also share an appreciable percentage of their proteome, with cluster D, E and F phages sharing~21% of their proteome with at least two members of another cluster. This observation suggests an ancient relationship that has diverged. Singleton PBC1 also shares 32% of its proteome with the cluster F phages. These proteins include the portal protein, the MCP, three putative minor capsid proteins, a putative minor structural protein, the TMP, a holin, a glutaredoxin-like protein and nine hypothetical proteins. The environmental success of gamma-like phages is well documented (for a recent review see [32]). We have grouped the clusters D, E and F together with singleton PBC1 as the gamma d'Herelle-like supercluster, named after this well-characterized phage.
Clusters I, J and singleton SP10 have similar relationships, with I and J cluster phages sharing up to 27% of their proteome. Singleton SP10 shares~29% of its proteome with cluster I phages and~24% with cluster J phages, including several structural proteins (portal protein, MCP, minor structural protein, tail sheath, tail tube, tail assembly chaperone, tail lysin, tail fiber, tail baseplate and tail spike proteins), DNA replication proteins (DNA helicases, primase, endonuclease, exonuclease, and ribonuclotide reducatase), a peptidoglycan binding protein, a tRNA processing protein, several RNA polymerase sigma factors, and hypothetical proteins. Of interest, phage SP10 had previously been described as a SPO1-related phage by its discoverers [33]. This supercluster comprised of clusters I, J and singleton SP10 is termed the SPO1-like supercluster after these well characterized phages with family members that can infect many genera [29].
DNA metabolism, cell lysis, structural, and host gene products are well conserved in Bacillus phages Phamerator [34] was used to determine the most highly conserved gene products within the 83 fully sequenced Bacillus phages, and the extent of conservation among the phages. Phamerator identified a total of 4,442 phams, or groups of proteins with homology to one another. Of these, 894 (20%) had a predicted function and 3,548 (80%) were uncharacterized. In addition, 2,583 (58%) were orphams (phams containing a single member). This analysis confirms the highly diverse and uncharacterized nature of the Bacillus phages and underscores the immense biological reservoir that is present. Table 4 (phams with predicted function) and Table 5 (phams with uncharacterized proteins) contain the highly conserved phams that have twenty or more members. These phams are partitioned by their function as DNA replication/metabolism proteins, virion structure and assembly proteins, cell lysis proteins, or proteins involved in gene expression or host function. It is important to note that there may be other proteins with similar function not included in a pham due to lack of sufficient homology.

DNA replication/metabolism
The most highly conserved Bacillus gene product is a class I ribonucleotide reductase (RNR, pham 247), with homologs found in 33 of the 83 phages and four phages have multiple homologs. RNR forms deoxyribonucleotides from ribonucleotides for DNA biosynthesis and is commonly found in lytic phages [35]. Other well-conserved proteins for nucleotide metabolism include a dihydrofolate reductase (conserved in 27 phages), thymidylate synthase (conserved in 24 phages), deoxynucleotide monophosphate kinase (conserved in 23 phages), fumerate reductase (conserved in 23 phages), deoxyuridine diphosphatase (DUT, conserved in 23 phages), RNR beta subunit (conserved in 22 phages) and a glutaredoxin-like protein (conserved in 22 phages). Many putative proteins involved in (See figure on previous page.) Figure 5 A neighbor-joining tree analysis of the Bacillus terminase mirrors whole genome cluster assignments. Phage names are colored by whole genome subcluster assignment, and this subcluster assignment is indicated on the right. Putative replication strategies for phages are also indicated when known. Abbreviations are direct, terminal repeats (DTR) and cohesive ends (cos). The phylogenetic tree was constructed using a MUSCLE [58] alignment and the neighbor-joining method in Mega5 [59]. Bootstrapping was set to 2000 and the unrooted tree was collapsed at a less than 50% bootstrap value.
DNA replication and recombination were also identified including a DNA helicase (conserved in 28 phages), DNA exonuclease and endonuclease (conserved in 28 and 27 phages, respectively), DNA polymerase (conserved in 26 phages), two chromosome segregation proteins (conserved in 25 and 22 phages), and a Mre11-like nuclease, replicative helicase, DNA polymerase III, RecA homolog and DNA primase (each conserved in 23 phages). These results underscore the vital nature of efficient nucleotide metabolism in the propagation of lytic phages.

Virion structure and assembly proteins
The structural and assembly proteins of the virion are also highly conserved gene products within the Bacillus phages, with phams consisting of a MCP, large terminase, portal protein, capsid structural protein, baseplate, tail sheath, and a tail lysin all having homologs in 28 of the 83 phages (34%). In addition, a procapsid protease, tail adsorption protein, tail lysin, virion structural protein, baseplate and another terminase have homologs in at least 23 of the 83 phages. These structural proteins are conserved among phages that are known myoviruses and siphoviruses, although the podoviruses and tectiviruses should also contain an MCP, portal protein and terminase. As discussed above, we were able to identify a large terminase for all of the Bacillus phages, meaning that these gene products had homologues that were somewhat characterized, but not homologous to the prevalent Pham. In contrast, we were unable to identify an MCP for many of the Bacillus phages, suggesting that homologs have not been described and emphasizing the need for further characterization of Bacillus phages. In support of this finding, recent studies have shown that MCP's bearing no amino acid sequence similarity can harbor similar folds [36][37][38][39][40] hampering identification by sequence alone.

Cell lysis
Cell lysis proteins are vital to the phage lifecycle, allowing them to exit the cell and infect other hosts. Five cell lysis proteins were well conserved including a cell wall hydrolase and murein-transglycosylase (each conserved in 28 phages), two holins (each conserved in 23 phages) and a lysozyme-like protein (conserved in 23 phages).

Host functions/pathogenesis
Several gene products that are likely to regulate host functions were also highly conserved in Bacillus phages. A protein containing a bacterial SH3-like domain was identified in 25 of the 83 phages, including phages from cluster C, E, F, and J as well as the singletons phiCM3 and BanS-Tsamsa. The function of this protein is unknown but the SH3 domain is thought to mediate the assembly of large multiprotein complexes [41]. In addition, the cAMP regulatory protein (CRP) is found in 23 phages that may be used to control the expression of host carbon metabolism genes, which can contribute to bacterial virulence [42]. An FtsK/SpoIIIE-like cell division protein (gp22 in phage Cherry) was conserved in 23 of the phages (pham 370). This protein may control host transition into the sporulation state, contributing to the environmental fitness of B. anthracis [43]. As discussed above, pham 252 contains 23 DUT homologues, which are common in many bacteriophages and have been shown to function as G protein-like regulators required for the transfer of staphylococcal virulence factors [44,45].
There are several other proteins that are less conserved that may contribute to host pathogenesis. Five Bacillus phages (SPO1, CampHawk, Pegasus, JL, and Shanette), encode a Pho-H like protein that aids in bacterial survival under phosphate starvation [46,47]. Genes belonging to the phosphate regulon are reportedly very common in marine phages (40%) while they are less common in non-marine phages (4%) [48], in good agreement with our identification of PhoH in 5.4% of the Bacillus phages.
Subcluster E1 phages encode resistance to the soil antibiotic fosfomycin, which may account for the resistance (See figure on previous page.) Figure 6 A comparison of gene content and order within the Bacillus phage clusters reveals modularity and great diversity. Genome maps for representative phages from the subclusters within Bacillus phage clusters A-F are provided along with singleton MG-B1. Phages were mapped using Phamerator [34], where purple lines between phages denote regions of high nucleotide similarity and the ruler corresponds to genome base pairs. Boxes for gene products are labeled with predicted function, occasionally numbered, and colored to indicate similarity between the phages (E-value <1e − 4). Abbreviations are adenosine triphosphatase (ATPase), DnaB helicase (DNAB), double-stranded DNA binding (dsDNA binding), 2'-deoxyuridine 5'-triphosphatase (dUTPase), major capsid protein (MCP), N-acetyl-muramyl-L-alanine amidase (NAM amidase), pyrophosphate reductase (PP reductase) RNA polymerase (RNAP), sigma factor (σ factor), large terminase (TerL), small terminase (TerS), tape measure protein (TMP), pilus specific protein, ancillary protein involved in adhesion (SpaF1), single-stranded binding protein (SSB), single-strand recombinase (SS recombinase).  ) has been used in the treatment of mycobacterial infections and resistance is a feature of many pathogenic bacteria. In fact, resistance is commonly used for the identification and isolation of Shiga toxin-producing E. coli [49].

The comparison of subcluster and bacterial host reveals evolutionary boundaries
The Bacillus hosts in this study can be assembled into two separate groups by relatedness, and this evolutionary boundary may define phage boundaries and predict barriers for pathogenic gene transfer. B. subtilis, B. megaterium and B. pumilus are more closely related to each other than they are to the Bacillus cereus group of bacteria, comprised of B. cereus, B. anthracis, B. thuringiensis, B. weihenstephanensis, B. mycoides and B. pseudomycoides [50,51]. To determine if there are such boundaries between phages and their hosts, the host from which each phage was isolated was compared within each cluster and subcluster.
The cluster to bacterial host relationship was somewhat ambiguous, with 70% of clusters populated by phages from only closely related Bacillus species (clusters A, B, C, D, E, F, and G) and others (clusters H, I and J) harboring phages from more distantly related Bacillus species (see Table 2). However, within these latter clusters there is a clear division at the subcluster level in that B. subtilis, B. pumilus, and B. megaterium phages always fall into a separate subcluster than phages that infect B. cereus, B. thuringiensis, B. anthracis, and B. weihenstephanensis. In fact, 22 of the 24 subclusters (92%) are divided by species, even when the cluster contains closely related species (the exceptions are subclusters J5 and J8, but these have closely related species). More phages are clearly needed to understand the host diversity within clusters, however, because only four clusters contain phages from diverse hosts (phages from both a B. subtilis, B. pumilus, or B. megaterium host and   *Pham #'s are specific to this analysis due to assignment by Phamerator [34]. from a Bacillus cereus group host). In addition, this analysis was performed using only the host from which the phage was isolated since the host range of most of these phages is unknown. Host range studies will provide greater insight. For example, a recent finding that phage BPC78 inhibits both B. cereus and B. subtilis suggests that some phages are able to overcome this apparent host boundary [52]. The subcluster to host analysis also suggests a closer relationship between the B. thuringiensis and B. cereus species when compared to B. anthracis, since there is a subcluster division between B. anthracis phages and those that infect B. thuringiensis or B. cereus (see clusters A and E verses J). This apparent evolutionary separation is surprising given the recent report of five phages that infect B. anthracis, B. thuringiensis as well as the B. cereus host on which they were isolated (BanS-Tsamsa [53], Bc431v3 [54], and JL, Shanette, and Basilisk [40]).

Conclusions
Phages are intimately linked to the ecology and evolution of their hosts, making characterization vital to understanding the diversity and evolution of the Bacillus genus. Herein we described the comparison of 83 fully sequenced Bacillus phages and their grouping into 10 clusters, 15 singletons and 24 subclusters (see Tables 1  and 2). In addition, two groups of more distantly-related phages were identified and termed "superclusters", namely the SPO1-like and gamma d'Herelle-like. This analysis of Bacillus phages may aid in understanding newly isolated phages as well as the enormous complexity of tailed phages. It may also serve as a reference for comparisons to phages that infect other genera. Other such large-scale analyses are of 491 phages that infect Mycobacterium and of 337 phages that infect the Enterobacteriaceae family. Hatfull et al. grouped the Mycobacteriophages into~17 "clusters" of related phages (A-Q) and 14 singleton clusters [13], while Grose and Casjens grouped the Enterobacteriaceae phages into 38 clusters of related phages and 18 singleton clusters [14]. In contrast to both of these phage groups, the Bacillus singletons outnumber the Bacillus clusters, presumably due to the decreased number of total phages isolated (83 phages as compared to 491 or 337). It should also be noted that additional Bacillus phage isolation will most likely require future revision of these cluster assignments as phages may be isolated that unite clusters.
Our analysis revealed several clusters of highly related phages (clusters C, D, F and G), and other clusters that contained very diverse phages (A, B, E, H, I, and J) (see Figures 2 and 3). Due to the low number of Bacillus phages isolated and the apparent expected diversity, it is currently unknown if these differences reflect differences in phage lifestyles, or if they occur due to sampling biases. Our analysis also revealed the need for using several analytical techniques to group phages, since one technique may suggest apparent relatedness that is weak by other techniques.
In addition to whole genome analysis, analysis of Bacillus phage gene products further underscores the enormity of Bacillus phage diversity, with 80% of protein phams (3,548) consisting of uncharacterized proteins. Because several phams of known function were identified that may contribute to host pathogenicity, understanding these uncharacterized phams is critical to understanding the evolution of pathogenic Bacillus strains.
The analysis of Bacillus phage evolutionary boundaries suggests that close phage relationships (defined by subclusters) are restricted by the relatedness of the host, with the phages that infect the Bacillus cereus group of phages more similar than those that infect B. subtilis, B. megaterium and P. pumilus. This analysis of host versus cluster is not only beneficial to understanding the evolution of Bacillus species but may indicate phage clusters more suitable for targeted phage therapy of pathogenic B. cereus and B. anthracis strains.

Computational analysis and genomic comparison
Bacillus phage sequences were retrieved from GenBank and the Bacillus Phage Database at PhagesDB.org as well as by contact with the authors of this website. To ensure retrieval of all Bacillus phages from GenBank, the major capsid protein (MCP) from at least one phage in each cluster was used to retrieve all phages with similar MCP sequence via TBLASTN [55]. Genomic maps of each phage were prepared using Phamerator [34], an open-source program designed to compare phage genomes. Phamerator was also used to calculate the percent G/C, number of ORFs and protein families or phams. The percentage of the proteome conserved was identified using the program CoreGenes 3.0 at the default BLASTP threshold of 75 [19,20], while average nucleotide identity (ANI) was calculated by Kalign [56]. Dot plots were generated using Gepard [57]. For ease in dot plot analysis, long direct terminal repeats were removed from some phages, other phage genomes were reverse complemented, and new bp one calls were made to reorient according to the majority of phages within a cluster. In addition, a portion of the PZA nucleotide sequence was reverse complemented to allow alignment with other phages of the cluster. Whole genome amino acid sequences were retrieved from Phamerator [34].
The terminase phylogenetic tree was constructed using a MUSCLE [58] alignment and the neighbor-joining method in Mega5 [59]. Bootstrapping was set to 2000 and the unrooted tree was collapsed at a less than 50% bootstrap value. Sequences for comparison were chosen by annotated large terminase gene products or a BlastP alignment to the closest relative when unannotated.