Complete genome sequence and identification of polyunsaturated fatty acid biosynthesis genes of the myxobacterium Minicystis rosea DSM 24000T

Background Myxobacteria harbor numerous biosynthetic gene clusters that can produce a diverse range of secondary metabolites. Minicystis rosea DSM 24000T is a soil-dwelling myxobacterium belonging to the suborderSorangiineae and family Polyangiaceae and is known to produce various secondary metabolites as well as polyunsaturated fatty acids (PUFAs). Here, we use whole-genome sequencing to explore the diversity of biosynthetic gene clusters in M. rosea. Results Using PacBio sequencing technology, we assembled the 16.04 Mbp complete genome of M. rosea DSM 24000T, the largest bacterial genome sequenced to date. About 44% of its coding potential represents paralogous genes predominantly associated with signal transduction, transcriptional regulation, and protein folding. These genes are involved in various essential functions such as cellular organization, diverse niche adaptation, and bacterial cooperation, and enable social behavior like gliding motility, sporulation, and predation, typical of myxobacteria. A profusion of eukaryotic-like kinases (353) and an elevated ratio of phosphatases (8.2/1) in M. rosea as compared to other myxobacteria suggest gene duplication as one of the primary modes of genome expansion. About 7.7% of the genes are involved in the biosynthesis of a diverse array of secondary metabolites such as polyketides, terpenes, and bacteriocins. Phylogeny of the genes involved in PUFA biosynthesis (pfa) together with the conserved synteny of the complete pfa gene cluster suggests acquisition via horizontal gene transfer from Actinobacteria. Conclusion Overall, this study describes the complete genome sequence of M. rosea, comparative genomic analysis to explore the putative reasons for its large genome size, and explores the secondary metabolite potential, including the biosynthesis of polyunsaturated fatty acids. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-021-07955-x.

To characterize and explore the huge biosynthetic potential of myxobacteria, whole-genome sequencing of more strains is needed. Here we report the complete genome sequence of M. rosea DSM 24000 T and identify several biosynthetic gene clusters including one involved in the synthesis of PUFA. We also perform comparative genome analysis of M. rosea and other related myxobacteria to glean insights about the expansion in genome size that makes the M. rosea DSM 24000 T genome the largest bacterial genome known to date.

Results and discussion
Genomic properties of M. rosea DSM 24000 T The M. rosea genome assembled into a complete circular chromosome of length 16,040,666 bp with 69.07% GC. It has been deposited in GenBank under the accession number CP016211.1 within the BioProject number PRJNA321464. The assembly process did not detect any plasmid sequence. This is not surprising as among the genomes of order Myxococcales, only one organism M. fulvus 124B02 has been reported to harbor a plasmid, pMF1 [26]. However, as we have used Bluepippin size selection in our sequencing, we might have missed any smaller size plasmid. RAST-based annotation has predicted 14,121 genes that consist of 14,018 proteincoding genes, 88 tRNAs, four 5S-16S-23S rRNA operons, one transfer-messenger RNA, and two noncoding RNAs (each belongs to RNase_P_RNA and SRP_ RNA class) ( Table 1). As of date, the genome sequence of M. rosea is the largest amongst kingdom bacteria ( Fig. 1), and is~1.26 Mbp larger than the genome of the myxobacteria S. cellulosum So0157-2 (14,782,125 bp), which has been previously reported as the largest prokaryotic genome [27].
16S rRNA-based phylogenetic tree indicates that M. rosea DSM 24000 T is a close relative of members of the family Polyangiaceae in suborder Sorangiineae (Fig. 2). Similar tree topology has also been observed in the marker-gene-based tree where M. rosea is closely clustered with selected species of the genera within the Polyangiaceae family (Fig. S1). Moreover, M. rosea also shows higher DDH and ANI values with the Sorangium spp. as compared to other myxobacteria (Table S1) suggesting their close relatedness.
Analysis of genome expansion and protein function in M. rosea DSM 24000 T M. rosea encodes 14,018 protein-coding sequences which account for 87.50% coding density with an average gene size of 1003 bp (Table 1). A total of 6,167 (4 4%) coding sequences have been annotated as hypothetical proteins in M. rosea. Our pan-genome studies with 19 other myxobacteria (having > 9 Mbp genome size) revealed vast diversity among all studied members.

Core genome
Our study suggested that 650 orthologous proteincoding genes are conserved and constitute the core genome. This category includes only 5.03% of M. rosea proteins in contrast with its vast gene content (Table S2a)

Accessory genome
This study identified a total of 8947 (63.83%) accessory genes in M. rosea (Fig. 3), which are associated with the COG category CPS in higher number (39.29%) as com-

Unique genome
A total of 4421 (31.54%) proteins do not display any significant identity with selected myxobacteria, which are mentioned as unique proteins in M. rosea (Table S2a). Among them, only 347 unique proteins have been functionally identified which are associated with the COG category CPS (34%) followed by MET (30.25%) and ISP  (Fig. S2). Among unique proteins in M. rosea, 125 proteins exhibit significant similarity with exogenous genetic materials, including integrated plasmids, phages, and insertion sequence (IS) elements (Table S2a). Twenty-four genomic islands (GIs) have been identified in M. rosea comprising a total of 6,15,248 bp (3.84%) of the genome ( Table S2b). The GIs containing unique exogenous genes (Table S2b) may help facilitate horizontal gene transfer [28].

Signal transduction
Overall, our genome analysis indicates an abundance of signal transduction proteins as well as transcriptional regulators in M. rosea. Our analysis is supported by previous studies reporting a strong correlation between the number of bacterial transcriptional regulators and genome size [29]. Earlier, a linear relationship has been observed between the signaling proteins, including twocomponent system (TCS) proteins, and genome size in host-associated, as well as, environmental bacteria [30]. M. rosea also shows a higher number (323 proteins) of TCS proteins, which comprise 145 orphan histidine kinases (HK), 125 orphan response regulators (RR), and 53 hybrid TCS proteins as compared to S. cellulosum So0157-2 (309 TCS proteins) as well as other Sorangium spp. (Fig. 4A). However, no strong correlation (r = Fig. 1 Circular representation of the genome of M. rosea DSM 24000 T showing GC skew, GC content, genes on leading and lagging strands, core genes, duplicate genes, unique genes, unique exogenous genes, secondary metabolites producing genes (BGCs), and eukaryotic-like kinase (ELK) synthesizing genes from inner to outer layers respectively 0.531, p < 0.05) between the genome size and the number of TCS proteins has been found in myxobacteria, as reported previously [9]. Apart from the environmental diversity, the complex life cycle also influences the numbers of TCS proteins in the case of myxobacteria [31]. In addition to the TCS system, signal transduction mechanisms are also facilitated by serine, threonine, and tyrosine phosphorylation mediated protein kinases in prokaryotes. This protein family in myxobacteria has been reported to have strong sequence similarity with eukaryotic-like kinases (ELKs) [32]. M. rosea contains 353 ELKs, which is higher than S. cellulosum So ce56 (317) [33], as well as other myxobacteria (Fig. 4B). The number of ELKs increases with increasing genome size in bacteria [34]. A significant strong positive correlation between genome size and number of ELKs (r = 0.859, p < 0.001) is seen. In contrast to ELKs, M. rosea has fewer protein phosphatases (PPs) (43 genes), comprising all three major families of PPs i.e., serine/threonine PPs (PPP-family = 9 genes), metal-dependent serine/threonine PPs (PPM-family) including PP2c-type (21 genes) and SpoIIE-like PPs (5 genes), and tyrosine-specific PPs (PTP-family) including dual-specificity PTPs (5 genes), low molecular weight protein PTPs (2 genes) and PTPZlike PTPs (1 gene). In response to the peripheral stimuli, protein kinases phosphorylate the target proteins,  whereas, phosphatases deactivate them by removing the phosphate groups [35]. Thus, kinase/phosphatase ratio regulates the bacterial cell differentiation and development to quickly adapt to the persistently varying environment [36]. It has also been reported that PP2c-type PPs can compete with ELKs in bacteria [37]. However, a higher number of PP2c-type PPs has been observed in M. rosea (21 genes) than A. dehalogenans (2 genes), M. xanthus (4 genes), and S. cellulosum So ce56 (16 genes), reported as the highest PP2c-type PPs containing prokaryote [38]. Moreover, an elevated ratio of ELKs/ PPs has been also observed in M. rosea (8.2/1), as in A. dehalogenans (1.7/1), M. xanthus (6.9/1), and S. cellulosum So ce56 (7.7/1) [38]. It could explain the phosphorylation events which cannot be reversed by PPs during multicellular development in myxobacteria [38]. We identified 90 ELK proteins as being involved in the fruiting body production in M. rosea by BLASTP search (length ≥ 50% and e-value ≤1e-10) against the fruiting body forming proteins of M. xanthus [39] and HMMprofile based searches [40]. However, crucial genes for fruiting body development (actA, asgA, csgA, fruA, and sdeK) identified in M. xanthus are absent in M. rosea and in S. cellulosum So ce56 [33]. Therefore, as suggested in earlier studies [41], it can be argued that an alternative mechanism for fruiting body development may exist in M. rosea [42].

Secretome analysis
Our analysis revealed that 3035 proteins constitute the secretome in M. rosea, which is higher as compared to other myxobacteria (Fig. 4C). Significant positive correlation is seen between genome size and the number of secretome proteins (r = 0.845, p < 0.001). KEGG pathway analysis [43] has also revealed a higher number of proteins (104 proteins) are involved in the secretion system in M. rosea (KEGG pathway ID -mrm03070) as compared to A. . An extensive secretion system may explain the selection of such a large number of associated genes in M. rosea for executing sophisticated cellular crosstalk and adaptation to diverse environments. A variety of regulatory systems are broadly distributed across the M. rosea proteome, with most of them involved in transcription regulation. Free-living and soildwelling large-genome-containing bacteria usually acquire a complex regulatory network and a higher number of corresponding genes to survive in environments where the resources for growth are scarce but diverse [44]. Moreover, a higher number of lipid metabolism [I] associated proteins than carbohydrate metabolism [G] reveals efficient utilization of lipid as an energy source in M. rosea similar to that observed in M. xanthus [45]. Lipids have been observed in producing diverse morphological characters such as fruiting body formation in myxobacteria upon amino acid and carbon depletion [46]. Steroid biosynthesis in M. rosea further explores the importance of lipid bodies as signaling molecules similar to the steroid hormones in animals [47]. Thus, sophisticated intercellular communication for niche adaptation and morphogenetic variations may facilitate the retention of a huge amount of protein-coding genes in M. rosea.

Duplication events
Paralogous genes, which arise by gene duplications, comprise 44.10% genes in M. rosea (Table S2a). Using the same parameters to define paralogous genes, we find that well-studied members of the family Polyangiaceae i.e., S. cellulosum So ce56 and S. cellulosum So0157-2 contain 47.10 and 41.80% paralogous genes, respectively. Our results are in agreement with previous reports suggesting that the extensive expansion of paralogous genes account for the large genome size [48], similar to that reported in S. cellulosum So ce56 [33] and S. cellulosum
The acyltransferase (AT) domain is distinctly encoded by PfaB in marine γ-proteobacteria such as S. pneumatophori SCRC-2738, P. profundum SS9 [22,23], and M. marina MP-1 [24]. Whereas, AT domain is integrated into the carboxy-terminus of pfa3 in M. rosea (Fig. 5AI) as observed in terrestrial myxobacteria Aetherobacter (Fig. 5AII) [18]. The domain shows 65.26% and 64.91% identities with the AT domains of pfa3 proteins in Aetherobacter fasciculatus and Aetherobacter sp. SBSr008, respectively. It plays a significant role in shaping the final PUFA products synthesized from the PUFA gene cluster. However, the AT domain is not present in pfa3 of Sorangium (Fig. 5AIII), which has been suggested as the reason for the inability of Sorangium to produce DHA and EPA [18]. Overall, homology studies suggest that the PUFA clusters in M. rosea and Aetherobacter are unique amongst myxobacteria, containing all ten enzyme domains to yield PUFAs [56] including ARA, DHA, EPA as well as LA, GLA, SDA, and DPA. The fully functional PUFA synthase in M. rosea enables it to produce approximately 30% of the total cellular fatty acids [57]. Overall, the phylogeny of each gene (Fig. 5B) within the PUFA cluster reveal that these PUFA genes are evolutionarily closely related to Actinobacteria, suggesting that M. rosea might have acquired these genes from Streptomyces species via horizontal gene transfer. To further confirm how this cluster evolved within M. rosea, we also performed synteny studies based on identified homologs across close relatives. We identified that the pfa gene cluster in M. rosea along with close relatives Aetherobacter and Sorangium is completely conserved with the PUFA synthetic gene cluster in several Streptomyces spp., Azospirillum melinis, Tahibacter aquaticus, etc. (Fig. 5C). Conclusively, based on our phylogenetic and synteny analysis, we speculate that the pfa gene cluster might have been horizontally transferred to M. rosea and closely related myxobacteria i.e., Aetherobacter and Sorangium from Actinobacteria.

Conclusions
Myxobacteria are well known for their large genome size and genomic content, as well as the potential to produce a wide range of secondary metabolites, including polyunsaturated fatty acids. Although there has been a huge surge in next-generation sequencing of microbes in the last three decades, however, in comparison to other soil bacteria, only a few whole-genome sequences of myxobacteria are available. In the present work, we have sequenced, assembled, and annotated a 16.04 Mbp circular genome of M. rosea DSM 24000 T , the largest bacterial genome sequenced to date along with its genome characterization, and further emphasized the putative reasons for its genome expansion. Phylogenetic analysis and genome-genome distance calculation suggest M. rosea to be a close relative of the members of suborder Sorangiineae in the family Polyangiaceae. Due to its complex social behavior, diverse niche adaptation, and large genome size, M. rosea encodes a plethora of genes. Analysis of protein families reveals that most of the functionally identified proteins are associated with regulatory functions, protein folding, and genome packaging. Overrepresentation of protein families such as protein kinase, histidine kinase, tetR, transcription regulators like σ 54 , tetratricopeptide and pentapeptide repeats, VCBS, sel1, phage_GPD, FGE-sulfatase, short-chain dehydrogenase, and radical SAM, as well as higher numbers of secretomes and eukaryotic-like kinases in M. rosea as compared to other myxobacteria, are important explanations for genome expansion. Therefore, the requisite of adaptation in varied niches and complex myxobacterial multicellular behavior could be the driving forces behind genome expansion in M. rosea, which might be facilitated via gene-duplication followed by functional diversification of these proteins. A vast number of biosynthetic genes (7.71% of the coding potential) reveals the diversity of secondary metabolites production in M. rosea. Our study has identified the previously known functional PUFA biosynthetic gene cluster in the genome, one of the few known prokaryotic sources of DHA, EPA, LA, GLA, SDA, and DPA. Additionally, based on our phylogenetic and synteny studies, we hypothesize that this cluster might have been horizontally transferred from Actinobacteria. Our study on the genome sequencing, functional characterization, and pfa gene cluster analysis of M. rosea could further help biotechnological areas for heterologous expression of PUFAs from prokaryotes.

Bacterial culture and isolation of genomic DNA
The actively growing plate culture of M. rosea was procured from Deutsche Sammlung von Mikroorganismen und Zellkulturen (DSMZ) as strain number DSM 24000 T (also known as strain SBNa008 or NCCB 100349). The colonies from the procured sample were subcultured on VY/2 agar (DSMZ Medium 9) plates. These actively growing subculture plates were used to isolate whole genomic DNA using Zymogen Research Bacterial/fungal DNA isolation kit and Phenol-Chloroform-Isoamyl alcohol (PCI) methods. The quantity and quality of the extracted DNA were confirmed by gel electrophoresis and Nanodrop and supported by Qubit quantification. onto three single molecules real-time (SMRT) cells and sequenced using P6 polymerase and C4 chemistry (P6C4) with 180-min movie time. PacBio sequencing generated 4,41,539 raw reads (3,48,84,02,643 bp) with an average read length of 7900 bp. The Hierarchical Genome Assembly Process (HGAP) Pipeline v. SMRT v2.3.0 and consensus polishing with Quiver [58] were used to generate de novo assembly using default parameters. Gene prediction and functional annotation were performed by Rapid Annotation using Subsystem Technology (RAST) [59], whereas rRNA and tRNA genes were predicted using RNAmmer 1.2 [60] and tRNAscan-SE-1.23 [61]. RNAz 2.0 tool [62] was used to identify structured non-coding RNA (P > 0.85). A circular plot for M. rosea DSM 24000 T genome was drawn using BRIG (v 0.95-dev.0004) [63].
In silico DNA-DNA hybridization (DDH) and Average Nucleotide Identity (ANI) were calculated between M. rosea DSM 24000 T and other 21 selected members (all representative genomes from suborder Sorangiineae and a few representative genomes from other families in order Myxoccales) using Genome-to-Genome Distance Calculator (GGDC) server [68] and ANI Calculator [69] respectively.
Additional file 1: Fig. S1. Single-copy genes-based phylogenetic tree of myxobacteria. Branch color and leaf stripes represent the suborder and family-level taxonomy (color-coded), respectively.
Additional file 2: Fig. S2. COG functional categorization of the Total, Accessory, Core, and Unique proteins in M. rosea. CPS = Cellular Processes and Signaling, ISP = Information Storage and Processing, MET = Metabolism, and PC = Poorly characterized.
Additional file 3: Table S1. DDH and ANI values between M. rosea and selected myxobacteria. Color intensity changes from green to orange corresponding with higher to lower values, respectively.
Additional file 4: Table S2. a) Distribution of core, unique, duplicate, ELK, BGC genes in M. rosea. Unique exogenous genes in Column E have been shaded using orange color.; b) Identified genomic islands in M. rosea genome; and c) Comparative distribution of Pfam families among the selected myxobacterial genomes.