Bacteriophages are the most abundant biological entities in the biosphere, infecting virtually every bacterial genus and species . In environmental samples, phages are found in an order of magnitude more frequently than their hosts . In light of their abundance and diversity, it is interesting that no M. catarrhalis prophages thus far have been identified. Preliminary work on the bacterium suggested the possibility of inducible prophages in M. catarrhalis strains (unpublished), however, attempts to induce and isolate M. catarrhalis phages were unsuccessful, leading to the choice of a bioinformatics approach to identifying M. catarrhalis prophages. Other studies have utilised bioinformatics approaches in the annotation of prophages in a diverse range of hosts, such as Streptococcus suis , Mycobacterium sp. , and Lawsonia intracellularis . Analysis of 95 M. catarrhalis assembled draft whole genome sequences resulted in the identification of 32 putative complete prophages, as well as numerous incomplete prophage remnants. These prophages, which have been assigned to the Siphoviridae family based on gene synteny and sequence identity, are annotated as ‘complete’ prophages according to the scoring system adopted by PHAST . However, it is still unknown if they can excise from host genomes, form functional virions, and infect new host cells.
What qualifies as a prophage element has no definitive answer, as there is no comprehensive database for such elements, and no one gene is found in all prophages to serve as an identifying marker. However, it is expected that most bacterial genomes contain prophage elements, whether they are active prophages or remnants [15, 20, 25, 26]. Additionally, as prophage elements are continuously under degradative pressure , it was expected that there would be present more incomplete prophage elements, than questionable prophages, than complete prophages; even though the prophage remnants may still play roles in host function . M. catarrhalis, with evolution spanning about 70 Ma, would be expected to have acquired, lost, modified, and inactivated a host of genetic parasites, including phages . This order of expectation was found to be true over all the samples (131 incomplete prophages, 54 questionable prophages, and 32 complete prophages). Additionally, no lysogen harboured more than one prophage, as the presence of a complete prophage confers immunity to superinfection by other similar prophages [29, 30]. All the queried M. catarrhalis genomes had at least one incomplete prophage element except for the strains O35E, and 2041717D.
The discovered M. catarrhalis prophages fall into a narrow range of sizes ranging from 25 kb to 55 kb; Siphoviridae viruses have genome sizes usually ranging from 35 to 70 kb of DNA, with a mean of about 50 kb. The average GC content of the identified prophages was found to be 41.90 %, which is similar to that of the M. catarrhalis genome (~41 %). This is different from the average GC content of Siphoviridae (roughly 52.28 %, calculated from all 685 Siphoviridae sequences available in the NCBI database at the time of publication). However, the similarity of GC contents between identified prophages and host M. catarrhalis may be indicative of the prophages having incorporated into M. catarrhalis genomes early in the host’s speciation, or that the prophages have lain dormant in the host genomes.
An alignment of the putative attachment sites clusters them into a major and six minor groups (Additional file 6: Figure S4.). The clustering of the att sites did not correspond to the prophage clades, and did not directly reflect the diversity of integrase genes. However, it is known that various integrases can have secondary attachment sites that are divergent from the sequence of the primary attachment site, which may explain the variety of putative att sites [31–33]. Though the att sites are not identical, the conservation of a putative core site may relate to the hyperconservation of integrase genes [34–36].
Categorisation of prophages using global sequence and protein similarity has previously yielded reproducible and consistent results [37–39]. However, the reliability of such analyses has only been previously reported in the Podoviridae and Myoviridae families. This study found that a similar alignment and comparison approach classified 32 M. catarrhalis prophages as Siphoviridae, within a distinct group. Our prophage classification was reinforced by the analysis of head-neck-tail proteins using the programme Virfam . The diversity of M. catarrhalis prophages is demonstrated in the breadth of categorisation using this software, where all prophages were categorised into four different Clusters of the Type I Siphoviridae (Additional file 2: Table S1).
The multiple alignment of all 32 M. catarrhalis putative prophages divides them into four separate clades. Based on the synteny and similarity of genes, prophage clades 2, 3, and 4 are highly related, but remain distinct, whereas clade 1 is more divergent than the others. This is illustrated by Virfam prophage categorisation, where 22 of 23 prophages in clades 2, 3, and 4 were categorised as Type 1, Cluster 3 Siphoviridae with high similarity to phages D3, HK97 and HK022 (the exception being Mcat28, which was categorised as a Type 1, Cluster 1 Siphoviridae). The clade 1 prophages were more variable, as described in the results section. Thus, clades 2, 3, and 4 represent a homogeneous and conserved group of M. catarrhalis prophages, but clade 1 represents a less well-defined cluster of M. catarrhalis prophages. Addition of more prophages to the distance tree would illustrate whether clade 1 requires further sub-division into more clades, or if a single clade is sufficient, particularly if the core genome for clade 1 prophages is smaller, or less homogeneous than that of clades 2, 3, and 4.
The BRIG alignment shown in Fig. 6 indicates that the prophages from clades 2, 3, and 4 are highly similar in their core genomes, and the core genome of clade 2 is a subset of that of clade 3, which in turn is a subset of the core genome of clade 4. The clade 1 prophages have a nearly unique core genome, represented by the span 0–40 kb of the reference pan-genome, which comprises the prophages Mcat1 – Mcat7. Mcat7, Mcat8 and Mcat9 are prophages with similarities to the clade 1 prophages, but also to the prophages of clades 2, 3, and 4, though they do not share the core genomes of the latter 3 clades.
The inclusion of five reference genera into the distance tree shows that the clades are adequately different to be considered for the categorisation of M. catarrhalis prophages; and that these prophages share some homology with the five genera of phages, but remain distinct as a group of their own. It is interesting to note that the highest region of identity between the reference genera and the M. catarrhalis prophages is centred around the tail fibre genes, which are involved in host recognition. We speculate that this may be a mechanism that confers host specificity to the phages.
This result is supported by the analysis of the more divergent genes for coat and tail shaft proteins. Coat genes clustered based on function into clusters representing 1) adaptor proteins, 2) joining and completion proteases, 3) connector proteins, 4) head proteases, and two undefined clusters, where there was little diversity in each cluster (Additional file 5: Figure S3). This distribution was observed for all M. catarrhalis prophages, regardless of prophage clade, indicating that selective pressure for the coat genes is different from that upon the integrase, terminase, portal, and plate genes. As with most structural genes, genetic mutation is more permissible than in non-structural genes, and as long as the function of the protein is preserved, there can be a larger amount of diversity before the gene renders its virus non-viable. Analysis of the tail shaft genes, however, presents a different picture. Although there are 7 clusters of genes, there are many more outlier genes or groups of genes that are significantly divergent from the major groups. Additionally, none of these groups could be assigned according to function, as there was no description of tail shaft genes in the reference genera. Furthermore, the distribution of genes from the reference genera puts these genes in closely associated clusters, with two of the 7 clusters of tail shaft genes consisting almost exclusively of genes from the reference genera. These distributions demonstrate diversity in M. catarrhalis prophage structural genes, which is expected for such genes under low selective pressure over a long evolutionary time period, such as that suggested by the analysis below.
The distribution of prophages and their M. catarrhalis hosts were compared based on the MLST tree created from 312 strains. There was no observable pattern correlating the prophage distribution and MLST backbone, which leads to the conclusion that the identified prophages are not confined to or fixed within a particular lineage of M. catarrhalis host. Furthermore, analysis of the prophage clades and how they are distributed amongst their hosts indicates that there is no observable pattern or relationship between prophage clade and M. catarrhalis host strain (Fig. 7). The data in this study suggest that the prophages have evolved alongside M. catarrhalis host before the divergence of serosensitive and seroresistant lineages, and that lateral transfer of the prophages between hosts is not limited by the divergence of the two M. catarrhalis lineages. The seroresistant lineage harbours prophages from all four prophage clades, but the serosensitive lineage only harbours prophages from clades 1 and 3. However, it is important to note that the serosensitive group is much smaller than the seroresistant group, and is only populated by 2 prophage hosts (as opposed to the 30 in the seroresistant group). This is compounded by the sequencing bias of this study in favour of the seroresistant lineage, where only four of the 95 isolates were found to be of the serosensitive lineage. The absence of any discernible pattern in this analysis demonstrates the diversity of M. catarrhalis prophages, supporting the prophage genetic analysis. However, further studies are required to establish any relationship between M. catarrhalis prophage and host, especially pertaining to M. catarrhalis isolates from the serosensitive lineage.
This diversity of M. catarrhalis prophages is contrasted by a hyperconservation of the phage specific genes encoding integrase, portal, and terminase (large and small subunits), which follows a general trend where the genes for clade 1 prophages cluster together, whereas those of clades 2, 3, and 4 cluster together (Fig. 3). This supports the argument that the prophages of clades 2, 3, and 4 share a relatively recent common ancestor, which is also reflected in the homogeneity of their core genome; whereas clade 1 prophages seem to have branched off earlier than clades 2, 3, and 4. Although the core genome of the clade 1 prophages shows a higher degree of diversity (Fig. 2 and Fig. 6), it is interesting that the integrase, portal, and terminase genes of the group 1 prophages are hyperconserved. This may support that the clade 1 prophages be categorised together, regardless of the near-absence of a core genome. The portal and two terminase genes can be grouped together as genes involved in packaging of new viral genetic material into a newly formed capsid. The generalised conservation of non-structural and portal genes in the two groups (clade 1 prophages compared to clade 2, 3, and 4 prophages) may indicate a difference in physiology of the two groups. This hypothetical evolutionary divergence is not reflected in the plate genes, which exhibit almost no diversity. This may be because the plate protein is essential for the injection of phage genetic material into the host, and significant mutation in the plate gene may render mutants incapable of infection and propagation. However, only 10 plate sequences were isolated, and it could be that this small sample size is indicative of plate genes that are too divergent from the database queries to be identified.
The integrase protein is spanned by two Pfam domains: DUF4102 and Phage_integrase, with the latter containing the active site. Further analysis of the ratio of non-synonymous to synonymous codons, dN/dS values (also known as ω or Ka/Ks), and the trend of dN/dS values averaged over 4 codons, for the integrase gene (Fig. 4) showed that the N-termini of the integrase genes (corresponding to DUF4102) are hyperconserved, with dN/dS trends between 0.2 and 0.4, whereas the distal two thirds of the genes show conservation, where the dN/dS trend only exceeds 1.0 at three points. Similar analysis of the large terminase subunit and portal genes revealed that the dN/dS trend lines, averaged over 4 codons, for all genes are statistically significantly below 1.0, supporting a hypothesis of purifying selection for these genes. Such conservation suggests that the non-structural genes play a central and stable role in the maintenance of the phage.
The putative cargo genes identified in M. catarrhalis prophages are indirectly related to virulence. The majority of ORFs are related to antitoxin components of toxin-antitoxin systems, of which the most descriptive BLAST hit is hicB, the antitoxin component of the hicAB toxin-antitoxin system , and these antitoxins are most highly related to the hicB components in Acinetobacter guillouaie (BAP37890.1) and Yersinia pseudotuberculosis (AIN15103.1). Toxin-antitoxin systems have been shown to be involved in bacterial defence against phage infection via different mechanisms [40–42]. HicA expression is bacteriostatic, mediated by HicA cleavage of mRNA . The HicAB toxin-antitoxin system is a type II system, where the HicB protein interacts sterically with HicA to inhibit its action. The corresponding toxin genes with sequence identity to hicA (BAP37889.1 and AIN15869.1) were not found using the BLAST and MUMmer  alignment algorithms. A Smith-Waterman alignment of the toxin and antitoxin components revealed that they share significant sequence similarity, but the relatively short toxin lengths may obscure results, leading to a reduced number of significant hits. Based on the presence of hicB antitoxin homologues in up to 87.4 % of the sequenced strains, it is suggested that at least one toxin-antitoxin system is utilised in M. catarrhalis, possibly relating to defence against phages. However, it could be that these are remnants of such systems, and have become degraded. Further research into the potential toxin-antitoxin systems related to M. catarrhalis prophages is under way.
Two prophages, Mcat23 and Mcat17 harboured genes similar to vapE. The virulence associated protein (Vap) family of virulence genes has been associated with clinically virulent strains of bacteria. The function of these virulence genes is not well described, but it is known that VapA is required in conjunction with other Vap proteins  in Rhodococcus equi for diversion of the phagosome-maturation pathway and prevents acidification of phagosomes . With the exception of VapF, the Vap proteins are well conserved on the amino acid and nucleic acid levels, and share a well-conserved C-terminus [45, 47]. Although vapE is described to be present with vapF in virulence plasmids , vap genes have been shown to be widely distributed and also occur chromosomally as single genes .