Automated classification of tailed bacteriophages according to their neck organization

Background The genetic diversity observed among bacteriophages remains a major obstacle for the identification of homologs and the comparison of their functional modules. In the structural module, although several classes of homologous proteins contributing to the head and tail structure can be detected, proteins of the head-to-tail connection (or neck) are generally more divergent. Yet, molecular analyses of a few tailed phages belonging to different morphological classes suggested that only a limited number of structural solutions are used in order to produce a functional virion. To challenge this hypothesis and analyze proteins diversity at the virion neck, we developed a specific computational strategy to cope with sequence divergence in phage proteins. We searched for homologs of a set of proteins encoded in the structural module using a phage learning database. Results We show that using a combination of iterative profile-profile comparison and gene context analyses, we can identify a set of head, neck and tail proteins in most tailed bacteriophages of our database. Classification of phages based on neck protein sequences delineates 4 Types corresponding to known morphological subfamilies. Further analysis of the most abundant Type 1 yields 10 Clusters characterized by consistent sets of head, neck and tail proteins. We developed Virfam, a webserver that automatically identifies proteins of the phage head-neck-tail module and assign phages to the most closely related cluster of phages. This server was tested against 624 new phages from the NCBI database. 93% of the tailed and unclassified phages could be assigned to our head-neck-tail based categories, thus highlighting the large representativeness of the identified virion architectures. Types and Clusters delineate consistent subgroups of Caudovirales, which correlate with several virion properties. Conclusions Our method and webserver have the capacity to automatically classify most tailed phages, detect their structural module, assign a function to a set of their head, neck and tail genes, provide their morphologic subtype and localize these phages within a “head-neck-tail” based classification. It should enable analysis of large sets of phage genomes. In particular, it should contribute to the classification of the abundant unknown viruses found on assembled contigs of metagenomic samples. Electronic supplementary material The online version of this article (doi:10.1186/1471-2164-15-1027) contains supplementary material, which is available to authorized users.


Supplementary figure 1 A. Inter-gene distance matrices resulting from the analysis of the Aclame genomes
Subfamilies of genes whose relative positions are tightly restrained (STD Error Distance < 3, underlined in yellow in the matrices): -MCP, Portal -Portal, Ad4

B. Graphical representation of the average neck gene organization in phages with known morphology and neck Type
The numbers of identified phages with the corresponding gene arrangement are indicated in brackets. 248 Type 1 and 3 phages share a similar organization, defined by the cluster TermL-(x) 1-3 -Portal-(x) 2-3 -MCP-x-Ad-(x) 0-1 -Hc. In the remaining 17 Myoviridae of Type 2 and 10 Podoviridae of Type 4, specific conserved patterns are observed. The gene names and colors are defined in Figure 1. Ne1 corresponds to an additional Type 1 neck protein identified through this work (see Results section).

Supplementary figure 2 Secondary structure analysis of Type 1 neck Ne1 proteins identified in (A) Siphoviridae and (B) Myoviridae.
Profile-profile comparison using HHsearch provided sequence alignments for Ne1 proteins. Because of the high sequence divergence of these proteins, only alignments of the predicted secondary structure elements are presented for a set of Ne1 proteins (identified by their Genbank Identifyers). This set of proteins is representative of most detected Ne1 proteins. Indeed, Aclame classifies proteins into families of homologs.
In (A), patterns representative of the 5 Aclame Ne1 families comprising more than 10 proteins are displayed, together with the secondary structure patterns of the shortest (4604) and longest (6375) Ne1. In (B), patterns representative of the 3 Aclame Ne1 families comprising more than 5 proteins are displayed, together with the secondary structure patterns of the shortest (102338) and longest (3787) Ne1. All the displayed patterns (except that of 4604) correspond to proteins detected as analogous to SPP1 gp16.1 by HHSEARCH using a confidence threshold of 95%.
Protein 4604 was detected with a confidence threshold of 80%. Correspondence between the Aclame identifiers found on this figure and the GenBank identifiers is the following (using the same order): in fig

Matrices of profile-profile comparison and identity scores (A) within a typical Type 1 Cluster (Cluster 1) and (B) between Type 1 Clusters
(A) Analysis of Cluster 1 phage neck homologies. HHsearch probability scores are plotted above the diagonal and sequence identity percentages are plotted below the diagonal. Both use the following legend: (B) Analysis of phage neck homologies between Clusters. HHsearch probability scores are plotted using the same codes as in (A).

Supplementary Figure 4
Distribution of phage types and clusters over the different bacterial clades organised using the circular tree defined in (Ciccarelli FD, Doerks T, von Mering C, Creevey CJ, Snel B, Bork P. Science. (2006)  . Names of the bacterial clades are indicated in the inner circle with pink and cyan arches spanning the different bacterial species. The coverage of the clusters and types are indicated by grey arches labelled using colored square boxes in the outer circle.

Supplementary Figure 5
Global trees representing the global classification of phages from Aclame when only (A) the the Portal proteins or (B) the MCP-TermL-Portal triad proteins were used to calculate the similarity matrix between phages. The similarity scores were calculated using the same metric as described in the manuscript combining HHsearch probabilities and sequence identities. The background colors report for the Clusters and Types that were defined using the multiproteins classification. Along the external circle, small boxes indicate the cluster assignment for every phage (T1.x stands for Type 1 Cluster x , T2 for Type 2, etc ). A colored star was added outside the circle to pinpoint inconsistent grouping of some phages outside the Clusters defined from the multi-proteins classification.
The inconsistencies were rated using three criteria. First, we controlled whether the four types are properly discriminated. Second, we analyzed the clusters generated and whether the phages that are grouped together shared the same host categories (Firmicutes/Proteobacteria/Actinobacteria) as in Figure 3. Third, we also used as sensitivity markers pairs of phages such as N15 (Siphoviridae)/ VP882 (Myoviridae), grouped together in Cluster 6 (Type 1). These two phages, although quite distant, share the same biological properties of a linear genome containing a protelomerase gene, suggesting that they belong to a group diverged from a common ancestor [1]. Other methodologies were tested to produce the global tree with the single Portal marker : a multiple sequence alignment of all Portal sequences using Mafft linsi algorithm, trimming or not the gaps, using either Neighbour-Joining algorithm or more advanced methodologies included in the PhyML method to calculate the tree. In none of these cases, the derived classification was better than the one produced using the HHsearch-based similarity matrix shown in Suppl. Fig. 5A.

Altogether, our metrics supports that MCP, TermL and Portal are indeed informative and that
Portal is probably the best single protein marker to classify phages. However, Portal can also diverge in evolution while head-to-tail connection proteins still show consistent evolutionary links with proteins from other phages. We have added a new Figure, Figure 5, to illustrate this observation. To appreciate the protein sequence relationships between these phages, it is important to remind that above the so--35% identity range, common evolutionary history can be fairly well established. Below that threshold, sequence identity is generally assumed to be a less accurate proxy for reporting homology useful proxy for detecting common evolutionary history. Many phages have Portal sharing less than 30% identity with other Aclame proteins. Despite this diversity, it is remarkable that a majority of phages with divergent Portal can still be correctly clustered in the Portal-only tree. We believe that this arises from the fact that there is a dense reticulation of similar phages, which tend to properly attract phages with divergent Portal to the correct cluster of related phages. However, in case reticulation between similar phages is low (as for phages from Actinobacteria in Clusters 4 and 10), Portal may not suffice to drive proper clustering. In that case, markers physically connected to the head-neck-tail module (and in particular headto-tail connection markers) may turn as a useful proxy to favor the clustering of evolutionary related phages. The metric used to establish the phage similarity matrix account for that property. As soon as a protein pair will have significant match above 30% it will contribute favorably to phage association. The fact that our 4 Types multi-proteins classification is (i) globally consistent with Portal-only classification (ii) successful in difficult cases in recognizing phages hosted in Bacteria of the same phylum supports the idea that neck proteins bring useful signal to the classification of phages. We speculate that because Portal and neck proteins have direct physical contacts, their evolution was sufficiently correlated so that the signals from the neck proteins do not add too much noise to the classification and rather helps recognizing related phages when for some reasons Portal sequences are more divergent.

Supplementary Text 1 : Relationship between neck structural organisation and genome size in tailed bacteriophages
In our study, Siphoviridae represent a homogeneous phage family that generally exhibits a SPP1-like (Type 1) neck and have between 31 and 150 genes (Suppl. Fig. 4A). They share the neck structural organisation of phages SPP1, TP901-1, HK97, T1 and L5 (Maxwell et al. 2001;Maxwell et al. 2002;Edmonds et al. 2007;Lhuillier et al. 2009;Cardarelli et al.). The large majority of these phages exhibit 4 detected head-to-tail connection proteins (Table 3).
Twenty seven siphophages (16%) have only 3 head-to-tail connection proteins and for 1 phage only 2 of these proteins were detected. The missing protein is often Hc1, as observed in T5 and 21 other phages. In these phages, a completely different Hc protein may be used, or the stopper function may be performed by Ad1, even if no clear structural difference could be identified between Ad1 of phages with or without Hc1.
Most myophages belong to two structural groups that adopt either SPP1-like (Type 1) or T4like (Type 2) neck structures. These groups drastically differ by the size of their genomes while for the other half our methods could not detect any annotated head-to-tail connection proteins prompting for further exploration of these unknown systems.
Podoviridae also belong to two structural groups that adopt P22-like (Type 3) and 29-like (Type 4) neck structures, respectively. Here again, these groups drastically differ by the size of their genomes (Suppl. Fig. 4C). Podoviridae with a genome size similar to that of siphophages (between 31 to 150 genes) generally exhibit a P22-like neck: they adopt a neck structural organisation similar to that described for P22 (ex: T3, T7, 15) Tang et al. 2011). Tiny phages (containing less than 30 genes) generally present a 29-like neck (ex: PZA, c1).

Supplementary Text 2 : Structural analogies between the different neck Types
Our HHsearch analysis revealed structural relationships within each functional category: Ad, Hc and Tc. Analysing adaptor proteins, we observed that Ad1, Ad2, Ad3 and Ad4 are all predicted to contain 4 to 5 -helices. For one particular Ad1-Ad3 pair, HHsearch strongly predicts homology: the Ad1 of Siphoviridae phage Che9c and the Ad3 of Podoviridae phage 933W, have profiles matching with a probability value of 81%. Experimental evidence, derived from the structural characterisations of the Ad1 gp15 from SPP1 (Lhuillier et al. 2009), gp6 from HK97 (Cardarelli et al. 2010) and of the Ad3 gp4 from P22  consistently revealed that Ad1 and Ad3 proteins shared the same -helical bundle fold.
Examination of Ad1 to Ad3 profile alignments suggest that these two groups are segregated on the basis of large insertions within loops and and at the C-terminus. In a similar way, Hc1, Hc2 and Hc3 are all predicted to fold into a strand rich structure.
Furthermore, one match was detected between a Hc1 (Siphoviridae phage M6) and a Hc2 (Myoviridae phage Syn9) with a probability value of 71%, suggesting that these two superfamilies could share a structural core. Finally, the tail-completion proteins Tc1 and Tc2 both adopt an -( ) n --( ) n structure. Here again, a match was observed between a Tc1 (Siphoviridae phage BCJA1c) and a Tc2 (Myoviridae phage 25) with a probability value of 77%. The very recent determination of the 3D structure of a Tc2 protein (gp15 from phage T4) consistently revealed its structural homology with a Tc1 protein, gpU from phage (Fokine et al., 2013). Altogether, these observations support the existence of extended structural homologies between the different neck architectures. Additional experimental results are now needed in order to further describe a potential common structural core at the phage head-to-tail connection. Table 2 To compare the capacity of PSI-Blast to retrieve remote homologs with respect to that of HHsearch, we first defined, for every neck component, a set of reference profiles that could then be used as queries against the database of 28300 sequences contained in Aclame. These reference profiles were retrieved from two profiles databases, PFAM and CDD (Marchler-Bauer et al. 2013;Finn et al. 2014). To obtain the comprehensive list of reference profiles reported in Supp. Table 3, we actually used HHsearch and queried the PFAM and CDD profiles with the profiles of every neck proteins identified in our work. The reference PFAM and CDD profiles were in turn used as queries of a PSI-Blast search against all the 28300 sequences of the proteins contained in Aclame. All the proteins matched by this protocol were enumerated and their numbers are reported in Table 2.