Comparative genomics analysis of c-di-GMP metabolism and regulation in Microcystis aeruginosa

Background Cyanobacteria are of special concern because they proliferate in eutrophic water bodies worldwide and affect water quality. As an ancient photosynthetic microorganism, cyanobacteria can survive in ecologically diverse habitats because of their capacity to rapidly respond to environmental changes through a web of complex signaling networks, including using second messengers to regulate physiology or metabolism. A ubiquitous second messenger, bis-(3′,5′)-cyclic-dimeric-guanosine monophosphate (c-di-GMP), has been found to regulate essential behaviors in a few cyanobacteria but not Microcystis, which are the most dominant species in cyanobacterial blooms. In this study, comparative genomics analysis was performed to explore the genomic basis of c-di-GMP signaling in Microcystis aeruginosa. Results Proteins involved in c-di-GMP metabolism and regulation, such as diguanylate cyclases, phosphodiesterases, and PilZ-containing proteins, were encoded in M. aeruginosa genomes. However, the number of identified protein domains involved in c-di-GMP signaling was not proportional to the size of M. aeruginosa genomes (4.97 Mb in average). Pan-genome analysis showed that genes involved in c-di-GMP metabolism and regulation are conservative in M. aeruginosa strains. Phylogenetic analysis showed good congruence between the two types of phylogenetic trees based on 31 highly conserved protein-coding genes and sensor domain-coding genes. Propensity for gene loss analysis revealed that most of genes involved in c-di-GMP signaling are stable in M. aeruginosa strains. Moreover, bioinformatics and structure analysis of c-di-GMP signal-related GGDEF and EAL domains revealed that they all possess essential conserved amino acid residues that bind the substrate. In addition, it was also found that all selected M. aeruginosa genomes encode PilZ domain containing proteins. Conclusions Comparative genomics analysis of c-di-GMP metabolism and regulation in M. aeruginosa strains helped elucidating the genetic basis of c-di-GMP signaling pathways in M. aeruginosa. Knowledge of c-di-GMP metabolism and relevant signal regulatory processes in cyanobacteria can enhance our understanding of their adaptability to various environments and bloom-forming mechanism.


Background
Cyanobacteria, which are phototrophic bacteria that survive in ecologically diverse habitats, have received growing attention because they have been forming toxic blooms in eutrophic water bodies worldwide for decades [1,2]. Dense blooms are considered seriously harmful to aquatic ecosystems because of their deleterious effects on water quality, such as increased turbidity, smothering submerged aquatic vegetation, and producing taste and odor compounds [3,4]. Moreover, some cyanobacteria species can synthesize toxic secondary metabolites, such as hepatotoxin microcystins that can inhibit eukaryotic protein phosphatases; thus, they threaten the function of water bodies for drinking, bathing, and fishing, and they also ultimately pose potential risks to animal and human health [5][6][7]. Cyanobacteria are able to inhabit most of Earth's environments because they evolved mechanisms to monitor and rapidly adapt to environmental changes through a web of complex signaling networks, such as using second messengers to regulate physiology or metabolism [8].
Cyanobacteria must cope with variations in the external environment. They rely on signaling molecules to translate these changes into intracellular responses and mediate adaptation to ambient conditions. Once bacterial cells sense an external stimulus, such as light and temperature, the intracellular level of a second messenger rapidly changes to amplify the biological input signal to a downstream output effector and initiate physiological changes, including sugar metabolism, motility, and biofilm production [8,9]. A ubiquitous second messenger, bis-(3′,5′)-cyclic-dimeric-guanosine monophosphate (c-di-GMP), which was first identified as an allosteric activator of cellulose synthase in Gluconacetobacter xylinus in 1987, plays an important role in regulating biofilm formation or dispersal in response to various environmental cues and cell-cell signals [10][11][12][13][14]. Studies have summarized that c-di-GMP regulates an astounding array of important processes in bacteria, including transcription, RNA turnover, protein synthesis, motility, virulence, and altering activities of proteins or protein complexes [15][16][17]. The intracellular level of c-di-GMP are modified by the rate of its synthesis and degradation in response to a variety of environmental stimuli, relying on the opposite enzymatic activity of diguanylate cyclases (DGCs) and c-di-GMP-specific phosphodiesterases (PDEs), respectively [12,18]. DGC proteins contain a GGDEF domain that synthesizes one c-di-GMP molecule from two GTP molecules [19,20]. PDE proteins contain an EAL or, less frequently, a HD-GYP domain, which breaks down c-di-GMP into the linear molecule 5′-phosphoguanylyl-(3′-5′)-guanosine (pGpG) or into two GMP molecules [21,22]. Moreover, GGDEF and EAL domains can both be present in the same protein, forming "hybrid" proteins, even though they have opposing activities [23,24]. In that case, only one of the two domains is catalytically active, and the other performs a regulatory function, or a third regulatory domain is present that may disjoin the activity of the GGDEF and EAL domains [23,25,26]. Ute Römling et al. list a census of all GGDEF, EAL, and HD-GYP domains in bacterial genomes [12,27].
Diverse sensor domains can modulate enzymatic activities in response to external stimuli, including N-terminal response regulator receiver (REC), Per/Arnt/Sim (PAS), histidine kinases/adenylate cyclases/methyl accepting proteins and phosphatases (HAMP), and cGMP phosphodiesterase/adenylyl cyclase/FhlA (GAF) domains, which were related to c-di-GMP association network retrieved by STRING [25,[28][29][30]. C-di-GMP has been found to be recognized by downstream receptors that have been linked to specific physiological processes, ranging from polysaccharide biosynthesis to direct regulation of gene expression and to motility. Among the downstream effectors, the PilZ domain is ubiquitous in bacteria and can bind c-di-GMP to regulate biosynthesis of biofilms, such as cellulose and alginate [31][32][33]. The PilZ domain can be a stand-alone protein or fused with other functional proteins, such as cellulose synthases and alginate biosynthesis protein, or attached to certain signaling domains, such as the GGDEF, EAL, and HD-GYP domains [32,34] (Fig. 1). Molecular mechanisms of c-di-GMP signaling in a few cyanobacteria that are obligate photosynthetic microorganisms in the environment, such as Thermosynechococcus and Synechocystis, have been examined in-depth [35][36][37][38]. However, none of those studies have addressed Microcystis, one of the most ubiquitous freshwater cyanobacterial genera, which limits the comprehensive understanding of c-di-GMP signaling in cyanobacteria [36].
Genome sequencing of numerous Microcystis species has been performed, which makes it possible to improve our knowledge about c-di-GMP function in this genus. The purpose of this study was to explore the genomic basis of c-di-GMP signaling in M. aeruginosa. In this study, c-di-GMP metabolism and regulation in M. aeruginosa was revealed through in silico comparative analyses. We identified genes that encode proteins containing the GGDEF, EAL, HD-GYP and PilZ domains and other associated sensing domains in the complete or draft genome sequences of 24 M. aeruginosa strains available in Gen-Bank. Meanwhile, we performed comparative genomic analyses based on phylogenetic, phylogenomic, positive selection, and pan-genome analyses of these strains to comprehensively analyze the c-di-GMP signaling genes. We also characterized the structural features of GGDEF, EAL, HD-GYP and PilZ domains. The comparative genomic analysis will help elucidate c-di-GMP metabolism and relevant signal regulation processes in cyanobacteria.

Results
General genome features of M. aeruginosa strains Genomes of 24 M. aeruginosa strains were retrieved from the National Center for Biotechnology Information (NCBI) database for series analysis. As shown in Table 1, except for strains NIES 2481 [39] and NIES 2549 [40], no plasmid sequence was discovered in other strains. Among them, genome sequences of the strains CHAOHU 1326 and NaRes975 were recently released by our laboratory [41,42], and their general information are shown in Table  S1. The average size of the genomes was 4.97 ± 0.40 Mb, and the average G + C content was 42.66 ± 0.26%. Genomes ranged in size from 4.26 Mb (M. aeruginosa PCC9806) to 5.89 Mb (M. aeruginosa KW, Table 1). The selected genomes are of high completeness and low contamination as evaluated based on lineage-specific marker sets by checkM [43]. Multiple rRNA coding sequences were present in M. aeruginosa strains. Generally, each strain contains 1~2 sets of rRNA clusters as a rough estimation due to the incomplete sequences present in the genomes (Additional file 1, Table S2).
Modular signaling proteins involved in c-di-GMP metabolism and regulation in M. aeruginosa A genome search for genes that encode enzymes involved in c-di-GMP metabolism was performed to identify the putative translated products that have DGC and PDE activities in the selected 24 M. aeruginosa genomes. The accession numbers of the predicted proteins are shown in Table 2. This survey led to identification of three enzymatic classes of predicted proteins DGCs, PDEs, and hybrid DGC-PDEs, which contain GGDEF and EAL domains, even though they have opposing activities. As listed in Tables 2, 14 of the 24 M. aeruginosa genomes had genes that encode DGC enzymes, which contain a fused N-terminal REC domain and GGDEF domain in tandem. The REC domain, as a signal receiver domain present in association with c-di-GMP metabolism domains, is supposed to modulate the enzymatic activities in response to the internal or external stresses. There are two types of PDEs in M. aeruginosa genomes, one type contains partial EAL domain, and the other type contains HD-GYP domain along with a N-terminal Once bacterial cells sense environmental cues, such as light and temperature, the intracellular level of c-di-GMP rapidly changes to amplify the biological input signal. C-di-GMP then is recognized by downstream effector that interact with a downstream target to affect specific physiological processes. C-di-GMP synthesis and degradation is achieved by DGC proteins containing the GGDEF domain and PDE proteins bearing the EAL or HD-GYP domains, respectively. Downstream effectors contain PilZ, degenerate GGDEF or EAL domain, riboswitch, and transcription factors DICT domain, a sensory domain in "diguanylate cyclases and two-component system". Interestingly, compared with the HD-GYP domain-containing PDEs, which were identified in all selected M. aeruginosa genomes and seemed to be highly conserved, proteins with partial EAL domains were found less frequently (in only three Bacteria encode a variety of sensory and signal transduction proteins to sense and adapt to changes in the physicochemical makeup of their environment. Sensory and signal transduction proteins encoded in the selected 24 M. aeruginosa genomes were predicted, and 12 sensory domaincontaining proteins were found. Most of these proteins are signal transduction histidine kinases. The accession numbers and domain architectures of the highly conserved GAF, PAS, and REC domain-containing proteins are listed in Additional file 1, Table S3. As an important sensor for photosensory behavior, the GAF domain was commonly associated with c-di-GMP domains in cyanobacteria [38]. As many as 11 of the 12 proteins had the GAF domain, and some even contained two. PAS-containing proteins are related to sensory input (GAF), transduction (HAMP), or output (histidine kinases). Half of the four predicted PAS-containing proteins contain a PAC motif, a conserved region of 40-45 amino acids located at the carboxyterminal of the PAS domain, which contributes to PAS structure [28]. Interestingly, some sensory domaincontaining proteins in different genomes were identical, and were therefore assigned the same accession number, such as NIES2549 and NIES2481, DIANCHI905 and PCC7806SL, and NaRes975 and PCC9808.

Pan-genome of M. aeruginosa
To assess the distribution of genes involved in c-di-GMP metabolism and regulation across the M. aeruginosa genome, a core-pan-genome analysis was performed using all 24 M. aeruginosa genome sequences as input in the Bacterial Pan Genome Analysis (BPGA) tool [44]. The pan-genome analysis revealed a core genome of 1918 genes with an accessory genome of 36,550 genes and 6489 unique genes (Fig. 2a). Accessory genes are those whose orthologs are present in two or more genomes, but not in all the genomes. M. aeruginosa possess a core genome shared by 24 strains, accounting for a Letters in parentheses are domains of the referred c-di-GMP metabolism enzymes b Number of DGCs, PDEs and hybrids proteins 37.0 to 49.9% of the genome repertoire. The core-pan plot ( Fig. 2b) showed that the pan-genome trend curve did not reach a plateau and seemed to extend with addition of more genomes to the analysis. Therefore, the pan-genome was considered an "open" pan-genome. In contrast, as shown in Fig. 2b, the core genome curve leveled off, considered as a conserved core genome. The distribution of Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways and Cluster of Orthologous Groups (COG) categories in M. aeruginosa has been mapped in Fig. 2c and Additional file 1, Figure S1. Genes involved in c-di-GMP metabolism and regulation were assigned to gene families belonging to environmental information and signal transduction category. Among these genes, the GG[D/E]EF domain-containing DGCs and GG[D/E]EF-EAL hybrid proteins coding genes were classed into accessory genes, while the HD-GYP domain containing PDEs and the sensory genes which possess The BLAST Ring Image Generator (BRIG) alignment made it clear that most regions within the 24 M. aeruginosa genomes were conserved when compared to the reference strain NIES843 (Fig. 3). Several regions appeared to have low or even no similarity, possibly because of acquisition/deletion/rearrangement or horizontal gene transfer (HGT). The outer ring in Fig. 3 showed the distribution of genes encoding the domains involved in c-di-GMP signaling in NIES843 genome, and the specific locations were list at Additional file 1, Table S4. The corresponding sequences of other M. aeruginosa strains seemed highly conserved, and some even shared an identity up to 100% with the reference genome.

Phylogenetic analysis of M. aeruginosa strains
Comparing with the phylogenetic tree based on the 16S rRNA gene (Additional file 1, Figure S2a), the phylogenetic tree based on the conserved marker genes, previously validated as phylogenetic markers for (cyano) bacteria [45], produced higher resolution (Fig. S2b). The 31 conserved marker genes tree revealed a topology with generally well-defined nodes, with bootstrap support values greater than 90% over 1000 replicates. Further, propensity for gene loss (PGL) analysis of the gene families revealed a group of strains have lost the REC-GGDEF domain coding gene, including strains NIES98 [46], TAIHU98, NaRes975, PCC9808, PCC9432, DIA-NCHI905, PCC7806SL, PCC7941, NIES87, and NIES44 (Fig. 4a). As to the node, consisting of strains NIES1211, PCC9701, NIES44, NIES2549 and NIES2481, the EAL domain coding gene seems to be acquired, but PCC9701 and NIES44 have lost this gene. Gene encoding GGDEF-EAL hybrid protein was lost in strain NIES44. There is no gain or loss of genes encoding HD-GYP and sensory domain-containing protein. To further analyze the evolution of genes encoding sensory domain-containing protein, phylogenetic tree was constructed using a multilocus sequence typing approach based on these concatenated conserved sensory domain-containing proteins sequences (Fig. 4b). It showed a similar topology with the conserved marker genes tree (Fig. S2b). Phylogenomic analyses based on binary gene presence/absence (1/0) pan-genome matrix generated by BPGA pipeline resulted in a tree (Fig. 4c) with a topology similar to the trees obtained using conserved marker genes and sensory genes (Figs. S2b, 4b). All phylogenetic trees provided more robust topologies than that based on 16S rRNA gene analysis alone.
A visual comparison of phylogenies based on 31 marker genes, sensory domains, and pan-genome presence/absence matrix were generated by tanglegram [47]. As shown in Fig.  S3, only a few strains (2 of 24) occupied divergent positions on the phylogenetic trees based on 31 marker genes and sensory domains, which indicated a congruence between the two trees (Fig. S3b). Interestingly, most of that strains in which REC-GGDEF domains containing protein DGC is not detected (marked in blue), including NIES98, TAIHU98, NaRes975, PCC9808, PCC9432, DIANCHI905, PCC7806SL, and PCC7941, appeared to be phylogenetically closely related, thus were always grouped in the same clade of the different phylogenetic trees based on 31 marker genes, sensory domains, and pan-genome presence/absence matrix. Among them, DIANCHI905 and PCC7806SL [48] are representatives of toxic (microcystin-producing) bloomforming strains; in contrast, PCC 9432 and NIES98 are non-microcystin-producing strains [49]. Specifically, pairs of strains also appeared to be phylogenetically closely related, such as NIES2549 and NIES2481, DIANCHI905 and PCC7806SL, and NaRes975 and PCC9808, although they were isolated from diverse geographic origin (Table 1). The majority of the M. aeruginosa strains were isolated in different locations, but no correlation was found between their geographic distribution or bloom-forming ability and phylogenetic relationships, consistent with the previous report by Meyer et al [50].
To the majority sequence of genes encoding for c-di-GMP metabolism and regulation in M. aeruginosa, likelihood ratio test indicated that model M2 and M8 gave a significantly better fit than model M1 and M7, respectively, which allowed individual sites to evolve under positive selection (Additional file 1, Table S5). Lots of sites with ω > 1 were identified in the sequence of genes responsible for c-di-GMP metabolism and regulation, revealing that they are likely to have been subjected to positive selection (Additional file 2, Table S6).

Structural features of GGDEF domain, EAL domain, and HD-GYP domain of M. aeruginosa strains
To elucidate the structural features, structure predictive modeling of GGDEF domain, EAL domain, and HD-GYP domain was performed on the corresponding M. aeruginosa strains. The NIES843 genome is a representative genome of M. aeruginosa because of its genome has been completely sequenced and is modeled in Fig. 5. Similarly, the corresponding structural models of strain CHAOHU 1326 were shown in Additional file 1, Figure  S4. The Z-values of the models are shown in Table S7. All the models of the GGDEF, EAL and HD-GYP domain containing proteins were qualified with a Z-score higher than − 4.0.
According to SWISS-MODEL, the crystal structure of the conserved GGDEF domain of WspR (Protein Data Bank (PDB) id: 3BRE) was selected as the template to model the structure of the GGDEF domain of the DGC [51,52].
Amino acids S173 to N329 from the GGDEF domain were used to perform structural alignments (Fig. 5a, left). The amino acid sequences of the GGDEF domains in DGC showed a similarity of 34.2-37.2% to that of 3BRE (Additional file 1, Table S8). C-di-GMP binds to the catalytic site and to a second site distal to the catalytic loop. DGC proteins possess a conserved allosteric inhibition site (I site), composed of a RXXD motif (in which X represents any amino acid) five amino acids upstream of the GGDEF active site, that is important for controlling DGC activity. When levels of c-di-GMP are high, the second messenger can bind the RXXD motif, thereby repressing the DGC activity [53]. A systematic analysis and comparison of the 14 genomes that have corresponding GGDEF domains was performed to identify the amino acid motifs or signatures involved in catalysis and allosteric inhibition. As shown in Fig. 5a (left), the WebLogo alignment revealed that the RXXD and GGEEF motifs of the GGEEF domain were highly conserved in the same amino acid residues: Arg-Gln-Val-Asp (RQVD) and Gly-Gly-Glu-Glu-Phe (GGEEF), respectively. The GG[D/E]EF domain of the putative DGCs possessed the conserved amino acid residues essential for GTP binding, indicating that the DGCs may have catalytic activity [26].
Because only three genomes had partial EAL domains, the EAL domain in hybrid proteins from the M. aeruginosa NIES843 genome were chosen as paradigms to examine the crystal structure. Based on the crystal structure of the GGDEF-EAL domain of RmcA (PDB ID: 5M3C), which has a crystallographic resolution of 2.8 Å [54], the GGDEF and EAL domains in the hybrid protein of NIES843 were modeled. Compared with 5M3C, the GGDEF-EAL domains in the hybrid proteins showed sequence conservation of 35.9-37.8% (Additional file 1, Table S9). The low sequence conservation appeared to have no impact on model prediction by SWISS-MODEL. Compared with DGCs that contained only the GGDEF domain, amino acid residues of RXXD and GGDEF motifs in the GGDEF domain of the hybrid proteins were less conserved (Fig. 5a, right). The WebLogo alignment in Fig.  5b showed that amino acid residues of the EAL domain involved in the binding of c-di-GMP and catalytic activity were highly conserved in all sequences. The Glu in the EAL signature motif is an essential residue that is required to bind the c-di-GMP, whereas a change of Ala into Val (EVL) still sustains the enzymatic activity [55]. Arg in the second position downstream of the EAL signature motif was conserved in nearly all EAL domain sequences; thus, the EAL signature motif can be extended as EXLXR motif, which forms a stable platform to bind c-di-GMP [23].
Crystal structure of HD-GYP domain of M. aeruginosa NIES843 was modeled based on PA4781(PDB ID: 4R8Z) from P. aeruginosa [56] (Fig. 5c). Aligned with PA4781, the HD-GYP domain containing PDEs in M. aeruginosa showed sequence conservation of 33.5-34.3% (Additional file 1, Table S10). Generally, in HD-GYP domain, the HD residues clearly serve as metal ligands, the signature of HD can be extended as a larger motifs HDxGK; while the GYP motif may be more usefully considered as part of the HHExxDGxGYP, and the role of GYP motif may be substrate specificity determining but is not certainly clear [56,57]. WebLogo alignment revealed that the GYP motifs of GYP loop in M. aeruginosa were highly conserved in the same amino acid residues EFExxDGxGVP, whereas Val replaced Tyr compared with the GYP motif template. Moreover, the HD motif possess YR residues in M. aeruginosa strains. That is, amino acid residues were replaced by YR-GVP in HD-GYP motif in all selected M. aeruginosa strains.

Structural features of the PilZ domain of M. aeruginosa strains
All selected M. aeruginosa genomes encoded proteins that possess a PilZ domain. Twenty genomes encoded cellulose synthase (CelA), which contained a C-terminal PilZ domain, and the other four genomes encoded a protein that contained only a PilZ domain. The accession numbers of the corresponding proteins are shown in Additional file 1, Table S11.
To identify the structural features, structure predictive modeling of proteins with a single PilZ domain and PilZ domain-containing CelA was performed for M. aeruginosa strains. Predictive modeling was based on the crystal structure of the BcsA (PDB id: 4P02) from Rhodobacter sphaeroides, which has a crystallographic resolution of 2.65 Å, according to SWISS-MODEL results [31]. Model of the PilZ domain-containing protein CelA of strain CHAOHU 1326 is shown in Fig. 6a. The c-di-GMP-binding PilZ domain was located in the C-terminal region of CelA and had similar structure with protein containing a single PilZ domain in Fig. 6b, which was derived from the representative M. aeruginosa strain NIES843. Figure 6c shows that the PilZ domain consists of a six-stranded βbarrel and a short α-helix that follows the last strand of the β-barrel.

Discussion
The occurrence of cyanobacterial blooms appears to be increasing because of environmental factors, including continued eutrophication, rising atmospheric CO 2 concentrations, and global warming [58][59][60]. Cyanobacteria can survive in ecologically diverse habitats, to a great extent, because intracellular second messengers function in pathways that mediate cellular responses to oxidative stress, nutrient imbalances, and temperature variations in the environment [8]. C-di-GMP, as a universal bacterial second messenger, has been shown to regulate biofilm formation and aggregation, which are beneficial for cyanobacteria colony formation and thus promotes bloom formation [35,61]. With recent advances in genome sequencing and bioinformatics, it is possible to identify sequence groups with high genotypic similarity based on variation in protein-coding genes distributed across the genomes and predictions drawn from bioinformatics, and thereby provide genetic insight into c-di-GMP signaling regulation in M. aeruginosa. Because only one or two M. aeruginosa genomes do not adequately represent this species, 24 M. aeruginosa genomes available in NCBI's Gen-Bank were selected to comprehensively clarify the genetic similarities and differences of M. aeruginosa strains in the present study.
The selected M. aeruginosa strains in this study diverged to some extent at the genomic level and were isolated from aquatic ecosystems around the world. An indepth comparative genomics analysis that included genome feature analysis, core-pan-genome analysis, and phylogenetic analysis were used to distinguish differences and similarities among the 24 selected M. aeruginosa genomes.
M. aeruginosa genome sizes result from a mix of gains and losses during natural selection as they were subjected to changing environments and competitive forces during the evolution of the species. As a freshwater species, M. aeruginosa have medium genomes compared with other cyanobacteria, especially compared with marine species that mostly occur in low nutrient and stable open ocean waters, such as Synechococcus and Prochlorococcus, the genome sizes of which are almost half those of M. aeruginosa [62]. Some reports indicated that genome size is positively correlated with the number of duplicated genes, which can originate from either within the genome itself or can be introduced by HGT [63]. Gene duplication and high genetic redundancy in the M. aeruginosa genomes are considered an evolutionary strategy that might confer this cyanobacterial species an extensive adaptive capacity that allows them to inhabit a wide range of habitats worldwide, and facilitates the ability to proliferate and dominate the phytoplankton communities in eutrophic freshwater ecosystems [64]. The core-pan-genome analysis indicated that these strains maintained a conserved core genome and an expansive pan-genome that allow them to acquire new genes. M. aeruginosa possess a relatively small core genome, which might result from high genetic diversity and variable gene content [49].
In this study, bioinformatics tools furthered our understanding of c-di-GMP signaling in M. aeruginosa by recognizing and studying domain architectures and tridimensional structures of the predicted proteins with DCGs, PDEs, and DGC-PDEs encoded in the genomes. These coding genes are widespread in other cyanobacterial species, such as Synechocystis sp. PCC6803 and Thermosynechococcus elongatus BP-1, which reportedly encode a considerable number of proteins predicted to be involved in c-di-GMP metabolism [36,38]. In general, the number of domains involved in c-di-GMP signaling in cyanobacteria may be mainly determined by genome size [65]. However, there are at most three c-di-GMP signal-related domains identified in M. aeruginosa genomes, even if the mean genome size of this species is nearly twofold that of the Synechocystis sp. PCC6803 and Thermosynechococcus elongatus BP-1. An alternate explanation is that, in cyanobacteria, the number of c-di-GMP signal-related domains are not simply correlated with genome size but may also be affected by bacterial adaptation. Among the species present in the Cya-noBase database, the species found to lack c-di-GMP signaling systems were Prochlorococcus and some Synechococcus strains. It was reported that Synechococcus strains that contain c-di-GMP-modulating domains inhabit both marine and freshwater habitats and are found in rich-nutrient (eutrophic) waters, whereas Synechococcus strains lacking c-di-GMP-regulatory domains inhabit low-nutrient (oligotrophic) marine habitats [36]. Species adapted to stable habitats may have lost genes that encode c-di-GMPmodulating proteins. Primitive M. aeruginosa that inhabit low-nutrient lakes may have a small number of c-di-GMP domains even though they have relatively large genomes [66].
The domain architectures of the deduced amino acid sequences from the M. aeruginosa genomes also revealed diverse sensor domains, such as REC, PAS/PAC, GAF, and HAMP, which are involved in activity regulation by driving the protein dimerization process and play important roles in c-di-GMP-controlled rapid response to changing environmental conditions. It seemed that these sensor domains have been subjected to positive selection during evolution. Some sensory domain-containing proteins from different genomes have identical amino acid sequences, such as that of NIES2549 and NIES2481, DIANCHI905 and PCC7806SL, and NaRes975 and PCC9808. It should be noted that each pair of strains have close genetic relationships as determined by phylogenetic analysis.
In this study, four types of phylogenetic trees were established based on 16S rRNA gene, 31 protein-coding phylogenetic marker genes, sensory protein sequences, and binary gene presence/absence (1/0) pan matrix. The congruence of the two phylogenetic trees based on 31 proteincoding phylogenetic marker genes and sensory protein sequences facilitated a comprehensive understanding of the phylogenetic relationships and the evolution of the sensory domain coding genes among M. aeruginosa strains. It should be noticed that most of strains in which REC-GGDEF domains-containing protein DGC was not detected, appeared to be phylogenetically closely related. PGL analysis revealed most of genes involved in c-di-GMP signaling are stable in M. aeruginosa strains. In some M. aeruginosa strains, the EAL domain coding gene seems to be acquired from other origin by lateral gene transfer. In complex signal transduction process, the range of cellular functions might be regulated by different regulatory systems. The number of genes involved in c-di-GMP signaling might more likely rely on the number (kinds) of signals that they respond to and the intracellular level of the c-di-GMP that they need to regulate a series of physiological process, instead of genome size [67]. In some M. aeruginosa strains, the missing or lack of coding genes for DGC or PDE revealed that c-di-GMP-mediated regulation might not be the sole alternative regulatory pathway in this ancient photosynthetic microorganism. They might use other signal molecules such as NO, to regulate diverse biochemical and physiological processes [8].
It seems that the relatedness of the closely related strains studied did not perfectly reflect their similar physiological characteristics (e.g., cyanobacterial toxin-producing ability) or geographical origins. Phylogenetic analysis could also not reveal the M. aeruginosa strains with bloom-forming characteristics. Previous studies demonstrated that Microcystis "species" distinctions are problematic and doubtful [68,69]. Microcystis taxonomic studies using 16S rRNA analysis revealed that phylogenetic trees using sequences with significantly high sequence similarities did not clearly delineate Microcystis species [70,71]. In the present study, 31 protein-coding phylogenetic marker genes was used instead of discrimination based only on the traditional 16S rRNA gene, which could not sufficiently discriminate between strains [44,45]. Similarly, the phylogenomic tree based on whole genome information was more reliable compared with the phylogenetic tree only based on the 16S rRNA gene. More tests are needed to further determine whether the alternative approaches could refine Microcystis species classification.
GG[D/E]EF and EAL domain-containing proteins analyzed in this study included all essential conserved amino acid residues that bind the corresponding substrate to have enzymatic activity. In M. aeruginosa, HD-GYP domains possess the variant key residues YR-GVP, and further study needs to be done to verify whether this domain still has the ability to bind corresponding substrates or is degenerate. Structural analysis provides important information for predicting the function of these proteins that contain GGDEF, EAL, hybrid domains, and HD-GYP domain, and creates a paradigm for future studies that analyze the evolution of enzymes involved in c-di-GMP metabolism. It was also found that all selected M. aeruginosa genomes encode PilZ domain, regardless of if it is in CelA, by which c-di-GMP could stimulate the biosynthesis of extracellular polysaccharides that are important for biofilm formation.

Conclusion
In summary, comparative genomic analysis of 24 publicly available M. aeruginosa genomes focusing on c-di-GMP metabolism and regulation revealed the following main results: (1) Proteins involved in c-di-GMP metabolism and regulation, such as diguanylate cyclases, phosphodiesterases, and PilZ-containing proteins, were encoded in M. aeruginosa genomes. However, the number of identified c-di-GMP signaling related domains was not proportional to the size of M. aeruginosa genomes (4.97 Mb in average). Pan-genome analysis showed that genes involved in c-di-GMP metabolism and regulation are relatively conservative in M. aeruginosa strains. (2) Phylogenetic and phylogenomic analysis revealed that the relatedness of the closely related M. aeruginosa strains did not reflect the geographical origins, even though they were isolated from diverse freshwater ecological environments. PGL analysis revealed that most of c-di-GMP signaling related genes are stable in M. aeruginosa strains. (3) In silico analysis of signaling related DGCs, PDEs, and hybrid proteins revealed that GGDEF and EAL domains contained the conserved amino acid residues essential for the substrates binding, indicating a possible catalytic activity. In addition, it was also found that all selected M. aeruginosa genomes encode PilZ domain, regardless of if it is in CelA.
This study is the first to analyze c-di-GMP signalrelated proteins in M. aeruginosa, and our findings provide a pre-requisite genetic basis for further experimental characterization and evaluation of biological function. Some important aspects are still unclear that could help enhance our understanding of M. aeruginosa blooms in aquatic environments, such as the involvement of the specific domain-containing proteins of c-di-GMP signaling networks in M. aeruginosa physiological regulation and an ecologically relevant explanation of how M. aeruginosa adapts to its specific ecological niche.

M. aeruginosa genomes features
All of the M. aeruginosa genome sequences available in June 2018 in the NCBI database, annotated with the Prokaryotic Genome Annotation Pipeline [72], were used to conduct various analyses. Draft genomes that consisted of more than 1000 contigs were omitted to obtain consistent genome quality. CheckM v.1.0.8 was used to estimate the completeness of the selected M. aeruginosa genomes [43]. The genome features of these selected strains were listed in Table 1. The sequencing and sequence assembly of M. aeruginosa strain NaRes975 and CHAOHU 1326 genomes were performed as previously described [41].

Identification of genes involved in c-di-GMP metabolism and regulation
Genes that encode the GG[D/E]EF, EAL, and GG[D/ E]EF-EAL domains, the related sensor GAF, PAS, and HAMP domains, and the c-di-GMP binding domain PilZ from the selected 24 M. aeruginosa genome sequences were identified by performing BLAST (Identity ≥30% for amino acid and 80% for nucleotide, E-value ≤1E-5). Conserved Domain Database (CDD) [73], Microbial Signal Transduction Database (MiST, version 3.0) [74], and Simple Modular Architecture Research Tool (SMART) [75] were used to characterize the domain.

Comparative genome analyses
Core-pan-genome analysis was performed using the BPGA tool [44]. Orthologous clusters were assigned by grouping all protein sequences encoded by the 24 genomes using the default clustering tool USEARCH based on 90% sequence identity cut-off. Core-pan-genome plots were calculated over 500 iterations. Comparative functional analysis was performed based on COG of proteins and KEGG pathways by focusing on distributions of representative protein sequences of core, accessory, and unique clusters of the M. aeruginosa strains. Gene families were classified accordingly. BRIG(version 0.95) [76] was used to create a circular genome comparison to highlight the location of genes encoding c-di-GMP-associated signaling domains between the 24 genomes compared with the reference sequence.

Phylogeny and evolution analysis
To elucidate the phylogenetic relationships between the M. aeruginosa strains, 16S rRNA gene sequences of cyanobacterial strains for which whole genome sequence data were available on NCBI were downloaded and analyzed to construct a phylogenetic tree. Sequences were aligned in MUSCLE version 3.8 with default settings [77]; then, the phylogenetic and molecular evolutionary analyses were conducted using MEGA version X [78]. The phylogenetic tree was inferred using the neighbor-joining method with 1000 bootstrap replications. The evolutionary distances were computed using the maximum composite likelihood method and the units were number of base substitutions per site. The analysis involved 25 nucleotide sequences, including 24 sequences of M. aeruginosa strains and the Synechocystis sp. PCC6803 sequence as the outgroup. All positions that contained gaps and missing data were eliminated. There were a total of 1313 nucleotides in the final dataset.
To assess relationships between the M. aeruginosa strains, the phylogenetic tree was constructed based on amino acid sequences of 31 highly conserved proteins that were encoded by the genes distributed in genomes as a single copy along 24 M. aeruginosa genomes, according Wu and Eisen (2008) [45,79]. These protein sequences were mined by the AutoMated Phylogenomic inference Application−AMPHORA2 tool [80,81], using default settings for the bacteria option and a cut-off Evalue of 1 E-10. Individual alignments were performed for each of the 31 gene sets in MUSCLE version 3.8 with default settings [77], trimmed with respect to the reading frame, and subsequently concatenated with the FaBox Fasta Alignment Joiner [82]. Only genomes with all selected sets of conserved genes were used in the phylogenetic analysis. A Maximum Likelihood (ML) tree was constructed with MEGA X using the Jones-Taylor-Thornton model with nearest neighbor interchange [78,83]. Then, 1000 bootstrap replicates were calculated to evaluate relative branch support. There were 7481 total nucleotides in the final dataset. After that, the program Count was performed to compute PGL to quantify the frequency of loss for select gene family of the key node in the 31 marker genes phylogenetic tree [84].
A multilocus sequence typing approach based on concatenation of sensory domain containing proteins, was used to generate the phylogenetic reconstructed tree following protocols described. Individual alignments were performed for each of the sensory genes in MUSCLE version 3.8 with default settings, trimmed with respect to the reading frame, and subsequently concatenated with the FaBox Fasta Alignment Joiner [82]. ML tree was constructed with MEGA X using the Jones-Taylor-Thornton model and a bootstrap resampling value of 1000. There were 4951 total nucleotides in the final dataset.
The pan phylogenetic tree was reconstructed using the neighbor-joining algorithm based on a binary gene presence/absence (1/0) pan matrix generated by BPGA from orthologous clusters after clustered by USEARCH. The trees in Newick format were then loaded into Dendroscope 3 and the tanglegram algorithm was applied for further comparison [47,85].
Nucleotide sequences of c-di-GMP signal-related genes were aligned by MUSCLE version 3.8 with default settings [77]. Phylogenetic trees files were generated by MEGA version X with ML model [78]. Aligned sequences in conjunction with ML tree files were used as input for evolution analysis using codeml from the PAML (version 4.9) with site models [86]. Likelihood ratio test was performed among pairs of models (M1 and M2; M7 and M8).

Proteins structural analyses
Automated protein structure models were predicted and built by the SWISS-MODEL server [51] by searching for evolutionarily related protein structures against the SWISS-MODEL template library SMTL based on the PDB database [87,88]. In this platform, templates are ranked based on the expected quality of the resulting models, and estimated by Global Model Quality Estimate and Quaternary Structure Quality Estimate [88,89]. The crystal structures of a DGC (WspR) from P. aeruginosa [52], the GG[D/E]EF-EAL hybrid domain protein RmcA from P. aeruginosa [54], the HD-GYP domain containing protein PA4781 from P. aeruginosa [56], and the PilZ domain-containing protein BcsA from R. sphaeroides were selected as templates for the structural analyses [31]. QMEAN scoring functions were used to estimate alternative models and screen for models whose scores strongly matched highresolution structures that were then used to create the corresponding model [90]. Structures were matched using Chimera UCSF [91]. CDD was used to identify the amino acids of the motifs present in the various domains and PROSITE was used to determine the site [92]. Multiple protein sequence alignments were generated through MUSCLE with default parameters. Conserved motif sequence figures were visualized using WebLogo based on aligned amino acid sequences [93].
Additional file 1: Figure S1. COG distribution of core, accessory and unique genes present in 24 analyzed M. aeruginosa genomes. Figure S2.  Table S1. Genome features of M. aeruginosa CHAOHU 1326 and NaRes975. Table S2. Numbers of RNA genes found in all 24 analyzed M. aeruginosa genomes. Table S3. Highly conserved GAF and PAS domain-containing protein accession numbers and domain architectures in M. aeruginosa. Table S4. Locations of genes related to c-di-GMP metabolism and regulation in M. aeruginosa NIES843. Table S5. Positive selection for genes related to c-di-GMP metabolism and regulation in M. aeruginosa. Table S7. QMEAN Z-score of the predicted structures of EAL, GGDEF, HD-GYP and PilZ domain containing proteins. Table S8. Identity of DGC sequences from M. aeruginosa genomes compared to WspR from P. aeruginosa. Table S9. Identity of GGDEF-EAL domain sequences from M. aeruginosa genomes compared to that from P. aeruginosa. Table S10. Identity of HD-GYP containing PDE sequences from M. aeruginosa genomes compared to that from P. aeruginosa. Table S11. Accession numbers of the predicted PilZ containing proteins found in 24 analyzed M. aeruginosa genomes.
Additional file 2: Table S6. Positive selection of genes related to c-di-GMP metabolism and regulation in M. aeruginosa.