To first determine the whole-genome relatedness of phages ΦCP9O, ΦCP13O, ΦCP26F, and ΦCP34O to each other and to other Clostridial phages, we used two approaches: correlations of tetra-nucleotide frequencies and clustering of predicted proteins based on sequence similarities. The results of both methods were consistent with each other and demonstrated close genomic relationships among our phages, more distant relationships to other Clostridial phages, and consistent correlations between phage and host phylogenies. Our phages were generally quite closely related - both techniques showed that the genomes of ΦCP34O and ΦCP13O were most closely related to each other and formed a distinct group from ΦCP26F and ΦCP9O (Figure 1). All four genomes were similar to the genome of ΦCP39O, previously published by our research group [6], and belonged to a larger clade (Figure 1B, 1C) containing ΦCPV1, a C. perfringens phage isolated in Russia [7]. Genomic comparisons of our phages to two other C. perfringens phage genomes (ΦSM101 and Φ3626), three C. difficile-infective phages (ΦC2, ΦCD27, and ΦCD119), and one C. botulinum-infective phage (ΦC-St) showed phage phylogeny closely associated with host phylogeny (Figure 1B, 1C). Our results of nearly identical topologies between tetra-nucleotide and proteomic trees is consistent with previous uses of tetra-nucleotide distributions as genomic signatures [15, 16] and to infer co-evolution between virus and host [17].
Core and accessory genomes of Clostridial phages
To determine if our phages contain a common set of genes shared with other Clostridial phages, we compared predicted ORFs based on classifications of clusters of orthologous groups (COGs) among the three host groups shown in Figure 1. COGs represent individual proteins or groups of paralogs from at least three lineages corresponding to ancient conserved domains (http://www.ncbi.nlm.nih.gov/COG/) and thus provide an informative means to compare conserved functions across genomes [18].
Three COGs were shared among bacteriophages infecting C. perfringens, C. difficile, and C. botulinum (Figure 2). These shared COGs were COG5412, annotated as a phage-related protein of unknown function; COG0629, a single-stranded DNA-binding protein; and COG0860, a phage endolysin, N-acetylmuramoyl-L-alanine amidase (Figure 2). Endolysins, together with holins, are the key bacteriophage-encoded enzymes involved in cell wall degradation and lysis of the host and are typically transcribed from adjacent ORFs in the phage genome [8, 19–21]. To better understand the evolution and natural variability of an endolysin in its genomic context, we investigated the phylogeny of the N-acetylmuramoyl-L-alanine amidase across multiple host genera and compared the phylogeny and host associations to the domain architecture of the endolysin-holin gene neighborhood.
Statistical associations between domain architecture and phylogeny
To compare our phage sequences and domain architecture to others, we retrieved amidase sequences belonging to the pfam protein family PF01520 from 26 publicly available bacteriophage genomes (Additional file 1, Table S1) and analysed these as fully described in the methods. Bacteriophage endolysins typically contain two domains: an enzymatically active domain and a cell wall binding domain, some of which have been elucidated with crystal structures [22]. We constructed an alignment of both putative domains after building a Hidden Markov Model from representative sequences in the Conserved Domain Database belonging to PF01520 and considering only columns with >10% sequence conservation to eliminate highly variable positions and control for sequence length heterogeneity.
Several interesting conclusions could be drawn from these analyses. First, to determine whether there is a significant association between the phylogeny of the amidase protein and the identity of the bacterial host, we used the UniFrac statistic [23] which assesses unique versus shared branch lengths by host for the observed tree relative to a null distribution of host groups randomly permuted within the tree. Significant clustering by host group was found with both UniFrac (p < 0.001) and the Parsimony test (p < 0.001) which performs a similar analysis based on tree topology [24]. The association between phage lytic enzymes and host is well-known [25]; here we show a strong and statistically significant association between the N-acetylmuramoyl-L-alanine amidase phylogeny and host for a large number of phages across five host genera (Figure 3).
Second, to better understand the genomic context of the amidase protein and associated holin genes, we used the same statistical approaches to formally compare the association between the domain architecture and phylogeny of the amidase protein. The five phages sequenced by our group belong to their own clade within the amidase tree and were the only genomes in which the holin is immediately downstream of the amidase protein in the presumed direction of transcription, a reversed arrangement of the typical domain architecture (Figure 3). Interestingly, though Φ3626, ΦC2, and ΦCD27 belonged to a sister clade, this domain architecture was unique even among these other Clostridial phages (Figure 3). To confirm this domain architecture for our phages, we re-sequenced the appropriate regions of Φ9O, Φ13O, Φ26F, Φ34O, and several other phage isolates, all of which shared the amidase-holin arrangement. Holin genes were identified using multiple sequence-similarity approaches as described in detail in the methods, and included identifications of transmembrane domains. The association between gene phylogeny and domain architecture was strongly significant as determined by UniFrac (p < 0.001) and P tests (p < 0.001).
Because lysis of bacterial cells generally requires both an endolysin and a holin - membrane disruption (the function of the holin) is considered to be requisite for the endolysin to attack the peptidoglycan [19] - understanding the phylogenetics and genomic context of these genes are important milestones to develop biotechnological applications. The unusual domain architecture we observed suggests that either the typical gene order or the reverse is a successful evolutionary strategy. The transcriptional regulation of these genes in our phages remains unknown, but searches for transcriptional promoters and terminators using BPROM (Softberry, Inc., Mount Kisco, NY, USA; http://linux1.softberry.com/berry.phtml) and TransTerm (http://nbc3.biologie.uni-kl.de) did not find either within the regions of our endolysin and holin genes; these genes may be co-transcribed. Efficacy of the endolysin as recently demonstrated for phages ΦCP26F and ΦCP39O [8] could potentially be improved by successful holin purification.
Genomic arrangement and context of orthologs
Twenty-one pfam families were identified among the four phage genomes (Figure 4). Of these, only one, PF04233, annotated as a homolog of phage Mu protein gp30, was found in only one genome (ΦCP13O). Three other pfams were found in 2-3 genomes and were absent from the other(s). A prophage antirepressor (PF02498) was present and 100% identical in the genomes of ΦCP9O, ΦCP13O, and ΦCP26F, but, interestingly, a syntenous protein of ΦCP34O (gene product 22, Figure 4), had no significant sequence similarity to these sequences based on pairwise blastp and no significant matches to any pfam domains. Similarly, 3'-phosphoadenosine 5'-phosphosulfate sulfotransferase (PAPS reductase)/FAD synthetase (pfam01507) genes were present in the genomes of ΦCP13O and ΦCP34O with 100% pairwise sequence similarity, but approximately syntenous ORFs in the genomes of ΦCP9O and ΦCP26F had no significant blastp similarity to COG0175 and did not match any pfam domains. The majority of pfams (17/21) were present in all four bacteriophage genomes (Figure 4). Detailed statistics for each genome are shown in Additional file 2, Table S3.
Conservation and variability of core genome
To investigate shared genes in more detail and to classify the majority of predicted ORFs which were not assigned to COGs or pfams, we next compared the distributions of pfams and sequence-similarity groups derived by clustering of all predicted ORFs across all four genomes to determine a core and accessory genome (Figure 5). Most gene clusters (41/61) were shared by all four genomes on the basis of sequence similarity (Figure 5a). Of the 17 pfam families that were common to all four genomes, we considered 12 to represent a 'conserved core genome', and five to represent a 'variable core genome' based on pairwise sequence similarities (Figure 5b). The five pfam families in the core genome containing highly variable genes were: PF01520, the N-acetylmuramoyl-L-alanine amidase (COG0860); PF11753 of unknown function; PF10145, a tail tape measure protein (COG5412); PF 02511, a thymidylate synthase (COG1351) involved in nucleotide transport and metabolism; and PF11651, a P22 coat protein (Figure 5b).
In the conserved core genome, genes within each of the 12 pfam families were very similar to each other, with a maximum pairwise sequence difference of 8% based on amino acid alignments with bl2seq (Figure 5b). Genes belonging to these 12 pfam families were involved in the following functions: tail protein, phage anti-repressor, ssDNA binding, portal protein, minor structural protein GP20, hydrolase, CHC2 zinc finger, terminase large subunit, virulence-associated protein E, and the holin (Figure 5b).
The holin genes were among the most conserved, with 100% identity among all sequences, and the amidase genes were the most variable (Figure 5b), suggesting these two genes are subject to very different rates of evolution despite their colocation in the genome and paired function in the lytic cycle. Holins target the relatively invariable cytoplasmic membrane, while phage endolysins recognize and degrade the cell wall, which is highly variable. It has been suggested that holins may function as a type of lysis clock, governing the timing of lysis of the host [26]. As the primary determinant of the length of the infective cycle, holins can be considered to experience stabilizing selection as there are opposing fitness advantages to extending the vegetative cycle and allowing phage replication versus lysing the host to release progeny phage to infect new host cells [19]. In contrast, the phage endolysins generally contain an enzymatically active domain and a cell-wall binding domain which recognizes highly-specific ligands on the host cell surface [27], and thus each domain is under strong directional selective pressures. Our data clearly show strong sequence conservation of the holin protein, and very distinct sequence types within the associated amidase for a group of closely related phages.
Detailed sequence comparisons of variable core genome and association with host genotype
For the four pfam families with known functions in the variable core genome, multiple-sequence alignments of the four genomes presented here and ΦCP39O sequenced previously by our group [6] revealed some striking differences in amino acid length and content. For all four proteins, two very distinct sequence types were represented.
For the amidase, ΦCP39O and ΦCP34O were most closely related and clearly distinct from the sequence types of ΦCP9O, ΦCP13O, and ΦCP26F (Figure 6a). The N-terminal portion of the protein from amino acid residues 1-166 of the multiple sequence alignment was the most variable portion of the protein (Figure 6a), and corresponds within approximately 5 residues to the enzymatically active domain (EAD) determined structurally and experimentally [22] for the endolysin from Listeria phage 2389 (NC_003291). The C-terminal portion of the protein, corresponding to the cell wall binding domain (CBD) of Listeria phage 2389 is more conserved than the EAD in our phages (Figure 6a).
The tape measure proteins (PF10145/COG5412) of ΦCP26F, ΦCP9O, and ΦCP39O were all 780AA long and 96% similar to each other and quite different from those of ΦCP13O and ΦCP34O. The tape measure proteins of ΦCP34O and ΦCP13O were 95% similar to each other, but only 473AA residues in length with a 225 AA N-terminal portion of the protein encoded by another ORF immediately upstream in the genome. For the portion of the protein encoded by a single reading frame, alignments of these five sequences revealed a deletion of 89 residues in the tape measure proteins of ΦCP34O and ΦCP13O (Figure 6b). Whether these represent gene fissions or fusions, or insertions or deletions relative to the ancestral state remains unknown, as do the consequences for the structure and function of the protein, but clearly these questions warrant further study.
For the thymidylate synthase (PF02511/COG1351), the phage relatedness patterns were the same as for the tape measure protein, with ΦCP34O and ΦCP13O containing a similar genotype distinct from that of ΦCP9O and ΦCP39O, largely defined by a variable region from residues 93-139 (Figure 6c). Similarly, the P22 coat proteins (PF11651) of ΦCP13O and ΦCP34O were distinct from those shared by ΦCP9O, ΦCP26F, and ΦCP39O (Figure 6d).
In contrast to these groupings, genomic fingerprints of the C. perfringens host based on rep-PCR defined three main host groups: 1) Cp34O and Cp9O, 2) Cp13O and Cp39O, and 3) Cp26F as a more distantly related group (Figure 6e). Interestingly, the single gene phage similarities based on the tape measure protein, the thymidylate synthase, and the coat protein reflected the whole-genome groupings shown in Figure 1 with ΦCP13O and ΦCP34O most similar to each other and ΦCP9O, ΦCP26F, and ΦCP39O forming a separate group. In contrast, sequence similarities based on the amidase protein were not concordant with the other genes in the core genome or the whole-genome clustering. Based on these data, we concluded that the selective pressures on the amidase genes for these phages are somehow unique from the rest of the genome. This result may have important implications for potential biotechnological applications in which amidase proteins are used separately or together with other gene products such as holins for bacterial control.
Endolysin protein structure
To investigate the association between the sequence variability of our phages and the structure of the EAD and the CBD of the amidase, we constructed a structural model using as a template a related structure from a Listeria phage (PDB; 1XOV) previously solved with crystallography [22]. Comparative modelling of bacteriophage lytic enzymes is becoming a common tool to inform the development of phage lysin-based biocontrol agents [28]. N-acetylmuramoyl-L-alanine amidases are one of at least six types of phage endolysins and attack the amide bonds between the amino sugar MurNAc and L-Ala of the cross-linking peptide stem in the peptidoglycan layer of the host cell wall [21]. The specificity of the enzyme is thought to be due to recognition of specific ligands on the host cell surface by the CBD [21]. Our modeling revealed that the enzymatic core is formed by a twisted, six-stranded β-sheet flanked by six helices (α1-α6) linked through a loop region to the cell wall binding domain which consists of two anti-parallel β-sheets (Figure 7). The areas of highest sequence conservation were concentrated in the CBD and the central portion of the enzymatic domain (Figure 7). Several point mutations within the CBD may contribute to its specificity, but interestingly, for our phages, the N-terminal EAD was much more variable than the CBD, suggesting much higher diversifying selective pressures on this portion of the protein.