UvA-DARE ( Digital Academic Repository ) Highly plastic genome of Microcystis aeruginosa PCC 7806 , a ubiquitous toxic freshwater cyanobacterium

Background: The colonial cyanobacterium Microcystis proliferates in a wide range of freshwater ecosystems and is exposed to changing environmental factors during its life cycle. Microcystis blooms are often toxic, potentially fatal to animals and humans, and may cause environmental problems. There has been little investigation of the genomics of these cyanobacteria. Results: Deciphering the 5,172,804 bp sequence of Microcystis aeruginosa PCC 7806 has revealed the high plasticity of its genome: 11.7% DNA repeats containing more than 1,000 bases, 6.8% putative transposases and 21 putative restriction enzymes. Compared to the genomes of other cyanobacterial lineages, strain PCC 7806 contains a large number of atypical genes that may have been acquired by lateral transfers. Metabolic pathways, such as fermentation and a methionine salvage pathway, have been identified, as have genes for programmed cell death that may be related to the rapid disappearance of Microcystis blooms in nature. Analysis of the PCC 7806 genome also reveals striking novel biosynthetic features that might help to elucidate the ecological impact of secondary metabolites and lead to the discovery of novel metabolites for new biotechnological Published: 5 June 2008 BMC Genomics 2008, 9:274 doi:10.1186/1471-2164-9-274 Received: 7 March 2008 Accepted: 5 June 2008 This article is available from: http://www.biomedcentral.com/1471-2164/9/274 © 2008 Frangeul et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

applications. M. aeruginosa and other large cyanobacterial genomes exhibit a rapid loss of synteny in contrast to other microbial genomes.

Conclusion:
Microcystis aeruginosa PCC 7806 appears to have adopted an evolutionary strategy relying on unusual genome plasticity to adapt to eutrophic freshwater ecosystems, a property shared by another strain of M. aeruginosa . Comparisons of the genomes of PCC 7806 and other cyanobacterial strains indicate that a similar strategy may have also been used by the marine strain Crocosphaera watsonii WH8501 to adapt to other ecological niches, such as oligotrophic open oceans.

Background
Dated approximately 3 billion years old by fossil records, cyanobacteria were the first oxyphototrophic prokaryotes present on Earth [1]. As architects of the Earth's atmosphere they had a major impact on the evolution of aerobic metabolism and the evolution of life [2]. Cyanobacteria still play a fundamental role in the functioning of global ecosystems by significantly contributing to carbon fluxes [3,4] and by providing nitrogen used for primary production [5]. On the other hand, cyanobacterial blooms may lead to a loss of biodiversity in the phytoplanktonic communities and, by generating very high quantities of organic matter used by anoxygenic bacteria in the bottom layers of water resources, can cause massive death of fish by asphyxia [6]. The financial costs resulting from cyanobacterial proliferations are considerable (e.g. 200 million Australian dollars/year in Australia) [7].
Freshwater cyanobacteria of the genus Microcystis are distributed worldwide, and are involved in numerous proliferation events in stratified lakes [8]. In their natural environment, Microcystis cells are organized in large colonies of various sizes and shapes, which were used to define various morphospecies. Five of these have recently been reunified as a single species, Microcystis aeruginosa [9]. The determinism of the morphogical variations within this polymorphic cyanobacterial species is currently under debate.
The ecology of M. aeruginosa is characterized by an annual life cycle comprising a spring and summer pelagic phase, and an overwintering benthic phase [10]. During the pelagic phase, M. aeruginosa colonies migrate daily in the water column [11] and may accumulate to form blooms or scums on the surface of the water. Thus, on a daily basis, as well as during the benthic and pelagic phases, colonies are exposed to changing environmental conditions of light, temperature and oxygen concentrations.
In the last decade, cyanobacterial blooms have been involved in numerous cases of animal [12] and human [13] poisonings, mainly due to the ability of Microcystis cells to synthesize toxins, in particular variants of microcystin [14]. Many other oligopeptides, such as cyanopep-tolins, aeruginosins, microginins, microviridins and cyclamides may also be produced [15]. Other peptides and congeners doubtless remain to be discovered, as do their respective biosynthesis pathways.
To gain further insight into the ecophysiology of Microcystis aeruginosa, we deciphered the genome sequence of the toxic strain PCC 7806. The results presented here associate descriptive genomics and comparisons with the genomes of other cyanobacteria isolated from freshwater and marine ecosystems to highlight the ecophysiological peculiarities of this strain, and put its particularly high genome plasticity into a cyanobacterial context.

General features of the M. aeruginosa PCC 7806 genome
The 12× shotgun sequencing project produced 90,000 sequence reads, and their assembly resulted in more than 500 contigs. After the first steps of a long finishing process performed using CAAT-Box [16] and Consed [17] software, the number of contigs was reduced to 328 (N50 = 100kb), 116 of which were more than 3,000 bases in length (up to 533,374 bases). The genome contains an unusually high number of long DNA repeats. Most of the extremities of these contigs consist of DNA repeated sequences including gene coding for transposases (see below). The 116 contigs were deposited in the EMBL database (AM778843-AM778958). The genome sequence of M. aeruginosa PCC 7806 (Mic-PCC7806), represented by these contigs, consists of 5,172,804 bases, with an average G+C content of 42%. These values are consistent with those previously determined using thermally denatured DNA [18]. The contigs were annotated using CAAT-Box software and a total of 5,292 predicted protein-coding sequences (CDSs) were validated manually. These CDSs were compared to several protein (Uniprot, COG and 45 cyanobacterial proteomes) and motif databases (Prosite and Pfam).
All the genomes used for the comparative studies described below are listed in the Methods section.

Comparison with other cyanobacterial genomes
A concatenated dataset of large and small subunit rRNA sequences (23S and 16S rRNA) was used to construct a phylogenetic tree including Mic-PCC7806 and 37 other cyanobacterial strains (Figure 1). The tree is congruent with previously published ones based on 16S rRNA sequences [19,20], but shows higher statistical support at most nodes (especially internal ones), probably due to the larger number of positions used. The strains of the genus Microcystis form a well-supported group (BV of 853‰) with Synechocystis sp. (Syn-PCC6803), Crocosphaera watsonii (Cwa-WH8501) and Cyanothece sp. (Cth-CCY0110 and Cth-ATCC51142). Within this group, Microcystis is most closely related to Syn-PCC6803 (BV of 990‰).
The Mic-PCC7806 genome was compared to the recently publicly available genome of Microcystis aeruginosa strain NIES-843 (Mic-NIES843) [21]. Although the average similarity between the orthologous genes is 94%, their comparison emphasizes that the two genomes largely differ both in length and gene composition (Table 1). Indeed, the Mic-NIES843 genome is 0.6 Mb longer than that of Mic-PCC7806. Moreover, the two genomes display a high number of strain-specific genes (838 for Mic-PCC7806 and 1760 for Mic-NIES843). Interestingly, most of these genes are absent from 44 other cyanobacterial complete genomes suggesting that they have recently been acquired in each of the two Microcystis strains independently. Although the two genomes contain the same proportions of large DNA repeats (~12%, see below), their distribution and size partly differ since Mic-PCC7806 contains 48 repeats longer than 3,000 bases for only 11 in Mic-NIES843. The comparison of the location of similar genes in the largest contig of the Mic-PCC7806 assembly (contig328) and in the Mic-NIES843 genome shows numerous genomic rearrangements (see Additional file 1). These rearrangements, probably facilitated by the presence of large repeats, render the Mic-NIES843 genome of little help for the finishing of the assembly process of the Mic-PCC7806 genome sequence.
The 5292 CDSs of the Mic-PCC7806 genome were also compared to the proteomes of 44 strains representing the diversity of the cyanobacterial lineages (all publicly available genomes excluding Mic-NIES843). The distribution of the best High Scoring Pairs (HSPs) found using Blastall software indicates a high similarity between the proteome of Mic-PCC7806 and a group of three strains Cth-ATCC51142, Cth-CCY0110 and Cwa-WH8501 (Table 2). This is puzzling, since Mic-PCC7806 is closer to Syn-PCC6803 than to this group in the 23S-16S phylogeny ( Figure 1). In order to exclude possible bias introduced by uneven distribution of CDSs in these genomes, we analyzed only the orthologs shared by three of these genomes, Mic-PCC7806, Syn-PCC6803 and Cwa-WH8501. Based on BiDirectional Best Hit (BDBH) analyses, 1789 CDSs of the Mic-PCC7806 genome were found to correspond to putative orthologs in Cwa-WH8501 and Syn-PCC6803. The mean Blast score of these CDSs was 381 for the comparison between Mic-PCC7806 and Cwa-WH8501, and only 366 for Mic-PCC7806 versus Syn-PCC6803. The distribution curve of all the Blast scores (see Additional file 2) showed that the Mic-PCC7806 genome was more closely related to Cwa-WH8501 than to Syn-PCC6803 for all score values considered. The absence of congruence between the results obtained with rDNA sequences and the core proteins means that additional data sets for other members of these three cyanobacterial genera are required. Nevertheless, the results obtained by comparing all the orthologous genes shared by Mic-PCC7806 (freshwater strain) and Cwa-WH8501 (marine strain) are consistent with the fact that freshwater and marine cyanobacteria are interspersed in global 16S rDNA phylogenetic trees [20].
Three distinct groups of proteins were identified on the basis of Blastp analyses of the 5,292 CDSs of Mic-PCC7806, with a selection of 15 other cyanobacterial genomes displaying at least 1% of best Blastp hits with Mic-PCC7806 ( Table 2). The composition of these groups largely depends on the threshold chosen to consider that two proteins are similar. Without an obvious breakpoint in the distribution of protein similarities between different genomes (see Additional file 2), we arbitrarily chose a threshold of 40% of similarity, considering that below this value two proteins do not share the same function. The three groups are as follows: -The "maeru40" group included 764 CDSs (14.4%) specific to the Mic-PCC7806 genome and not found in the 15 selected genomes; 438 (8.3%) of them have no homolog in the uniprot database; -The "core40" group comprised 652 proteins (12.3%) sharing significant Blastp scores with at least one CDS in each of the 15 other genomes tested; -The last group, designated "other40", consisted of 3,876 CDSs (73%) sharing significant Blastp scores with CDSs in only some of the other 15 genomes tested.
The small percentage of CDSs in the core40 group reflects the wide diversity of the cyanobacterial genomes analyzed. In the other40 group, the distribution of the Mic-PCC7806 CDSs among the tested genomes matches their phylogenetic distances based on 23S-16S rDNA sequences. For example, in this group, 10% of the CDSs were present in all the genomes, apart from that of Gvi-PCC7421, which is the most distant phylogenetically (Figure 1). Moreover, the four closest genomes to Mic-Phylogenetic maximum likelihood (ML) tree based on the concatenated 23S-16S rDNA sequences of diverse cyanobacterial lin-eages

Plasticity of the genome of M. aeruginosa PCC 7806
Large number of long repeated sequences The Mic-PCC7806 genome includes a very large number of DNA sequences containing more than 1000 bases that are repeated at least twice in the genome with more than 90% identity. A comparative analysis of all the cyanobacterial genome sequences available in databases showed that Mic-PCC7806, Mic-NIES843 and Cwa-WH8501 are particularly rich in such DNA repeats. Indeed, they account for 11.7%, 11.7% and 19.8% of the total DNA length, respectively ( Figure 2). The cumulative size of the DNA repeated sequences is not strictly a function of genome length as Mic-PCC7806 and Cwa-WH8501 genomes have the highest percentage of DNA repeats, but are of intermediate size relative to the other cyanobacterial genomes (see Additional file 3). In the Mic-PCC7806 genome, 1346 CDSs (25%) are located within these DNA repeats. Among these CDSs, only 256 and 92 belong to the maeru40 and core40 groups, respectively. Most of the CDSs of the core40 group correspond to orthologs that are not located within DNA repeats in other cyanobacterial genomes. This implies that over the course of evolution, resident genes were probably captured by genetic mobile elements. A large number of CDSs (362) are very similar to transposases from the COG database, and 93% of them are located within long DNA repeated sequences. At least 46 transposases correspond to ISMae1A/2/3/4 that had previously been characterized in strain PCC7806 [22], but a large majority of the other transposases cannot be clearly associated with any known insertion sequence (only 17 are associated to IS30, 7 to IS1 and 3 to IS5). The genome of Cwa-WH8501 also contains numerous putative transposases. One third of them are associated to IS5, but none to IS30; the DNA repeated sequences are therefore different in each genome, and cannot account for the close phylogenetic relationship between these two strains.

Synteny of cyanobacterial genomes
Although Mic-PCC7806 and Mic-NIES843 are very closely related strains (Figure 1), their genomes contain a high number of rearrangements. Moreover, an unexpectedly low level of synteny was also observed between the Microcystis strains and two close relatives, Cwa-WH8501 and Syn-PCC6803 (68% mean CDS similarity). Since the same observation was made for all the cyanobacterial genomes tested, we compared the dynamics of these   genomes using a large set of other bacterial genomes chosen on the basis of their sizes and phylogenetic distances.
To this end, a synteny score was calculated for a number of genome pairs (see Methods), and then compared to their evolutionary distance based on the 23S-16S rDNA tree. This analysis showed that the synteny scores for cyanobacterial genomes were significantly lower than those obtained for pairs of non-cyanobacterial genomes with similar genome lengths and 23S-16S phylogenetic distances ( Figure 3). Similar results were obtained for all the cyanobacterial genomes tested. This means that the low synteny scores observed cannot be related to the long DNA repeated sequences, which occur only in the Mic-PCC7806 and Cwa-WH8501 genomes. These results are in agreement with those of Fang et al. [23], who showed that both persistent and rare genes are significantly clustered in most of the 169 bacterial genomes analyzed. However, in a minority subset of bacterial genomes that includes the cyanobacteria, persistent genes were found to be fairly uniformly distributed throughout the genome.
Interestingly, only 8 clusters with at least 4 CDSs remain syntenic in the genomes of Mic-PCC7806, Cwa-WH8501 and Syn-PCC6803. Four of these clusters correspond to ribosomal proteins. The other clusters are shown in Table   3. Considering the very low level of synteny between cyanobacterial genomes, it is likely that these specific clusters have been subjected to strong positive selection pressure and may play essential roles in these cyanobacteria. Some of these clusters are clearly linked to a specific biological function, such as the transport of phosphate (see Additional file 4) [24], while others consist of conserved proteins with unknown functions. One can thus speculate that these proteins may be involved in the same biological pathway as their close neighbors.

Intergenic regions
Four groups can clearly be identified among the cyanobacterial genomes studied on the basis of their intergenic distances ( Figure 4). The first consists solely of the genome of Ter-IMS101, which harbors exceptionally long intergenic regions. To the best of our knowledge, no data has been published on this genome, which makes it impossible to rule out the possibility that these regions result from the poor quality of the sequence or the syntaxic annotation. The second group includes the genome of Mic-PCC7806 and, among others, those of Cwa-WH8501 and Syn-PCC6803 which have a high proportion of intergenic sequences around 300 bases long; in the case of the Mic-PCC7806 genome, less than 35% of inter-Percentage of DNA repeated sequences in the total genome length Figure 2 Percentage of DNA repeated sequences in the total genome length. This analysis was performed on complete and in-finishing (*) cyanobacterial genomes. The strain identifiers are listed in the Methods section. Only DNA repeats containing more than 1000 bases, and with an identity threshold >90%, are taken into account.
genic sequences are shorter than 100 bases. The third group comprises the genomes of Syn-PCC7942, Tel-BP1 and Gvi-PCC7421, which have short intergenic regions, similar in size to those found in a number of other bacterial genomes (see Additional file 5). The fourth group includes some members of the Prochlorococcus genus that have very small genomes with short or no intergenic regions.
The mean length of the intergenic sequences seems to be linked to the genome size of the cyanobacterium, except for the genome of Syn-PCC6803, which is smaller (3.6 Mb) than that of Gvi-PCC7421 (4.6 Mb), but harbors longer intergenic sequences. Although the role of long intergenic sequences in most cyanobacterial genomes remains unclear, we can surmise that they might be involved in the modulation of gene expression, which would allow cells to acclimate to rapid environmental changes.

Cluster of atypical genes
In order to explore the plasticity of the Mic-PCC7806 genome further, the number of CDSs with an atypical dinucleotide composition was determined using a oneorder Markov chain-based methodology [25]. This method can identify genes that may have been acquired recently by lateral transfers. In the Mic-PCC7806 genome, a total of 1971 atypical genes were found, including 1402 within 159 clusters of atypical genes (CAGs) that probably correspond to recently acquired foreign genomic elements ( Table 4). As expected, more than 98% of Mic-PCC7806 genes belonging to the core40 group were not in CAGs, and 31% of the atypical genes were in the maeru40 group. Moreover, a high percentage (80%) of the transposase genes were in CAGs (16% of the genes present in CAGs encode putative transposases). Compared to seven other cyanobacterial genomes, those of Mic-PCC7806 and Mic-NIES843 harbor the highest percentages of atypical genes (37%) and CAGs (34% and 36%, respectively). These findings may indicate that the Microcystis genomes contain a higher proportion of genes  recently acquired by lateral transfers than the other genomes studied.

Putative restriction and modification systems
Blast searches for restriction enzymes and examination of genes surrounding DNA methylases, identified 21 potential restriction enzymes (see Additional file 6), seven of which were found to be co-localized with putative methylases (see Additional file 7) in the Mic-PCC7806 genome. The Mic-NIES843 genome also contains a high number (at least 17) of putative restriction enzymes [21]. Blast searches revealed that 14 restriction enzymes are common to both genomes. In contrast, seven and eight restriction enzymes seem specific to Mic-PCC7806 and Mic-NIES843, respectively. The Microcystis aeruginosa strains might thus constitute a rich source of novel restriction enzymes potentially useful in biotechnology. According to Zhao et al. [26], filamentous cyanobacteria (Anabaena, Spirulina and Nostoc strains) contain more restriction and modification genes than unicellular cyanobacteria (Synechocystis, Synechococcus and Prochlorococcus strains). Based on COG annotations, at least as many restriction-modification genes were found in Mic-PCC7806, Mic-NIES843 and Cwa-WH8501 as in filamentous cyanobacteria. Thus,  rather than corresponding to a difference between filamentous and unicellular cyanobacteria, the restrictionmodification gene content of Microcystis aeruginosa may reflect the potential exposure of the cells to high concentrations of foreign DNA due to the presence of numerous other bacterial cells or viruses associated with Microcystis colonies [27]. This exposure to foreign DNA is also consistent with the high number of CAGs putatively acquired by lateral transfers. Whether such a hypothesis might also hold true for planktonic cyanobacteria of the genus Crocosphaera remains an open question.

Distribution of the intergenic distances in diverse cyanobacterial genomes
In bacterial genomes containing a high number of genes for restriction enzymes, short palindromic sequences corresponding to the target sites of these enzymes may be under-represented [28]. Since the genomes of Microcystis aeruginosa and Cwa-WH8501 contain a very high number of putative restriction enzymes, there should be a number of under-represented short sequences that correspond to restriction sites. To test this hypothesis, the number of occurrences of each 6-mer was counted, and a frequency distribution calculated for Mic-PCC7806, Mic-NIES843, Cwa-WH8501 and Syn-PCC6803 (Table 5). The underrepresented sites in the three first genomes were not found in Syn-PCC6803, a genome devoid of restriction enzymes [29], supporting the idea that these rare 6-mers could indeed correspond to restriction enzyme sites. In total, there are 4096 possible 6-mers, 1.5% of which are palindromes. Fifty-one percent of the rarest 1% of 6-mers in the Mic-PCC7806 genome are palindromes (see Additional file 8). Palindromes are thus over-represented among the rarest 6-mers, further supporting the hypothesis that they could correspond to sites cut by restriction enzymes. The identity of the rarest 1% of 6-mers in the Mic-PCC7806 genome was compared to known restriction sites in other organisms as identified by New England Biolabs [30]. We found that 20 of the 41 sites corresponded to sites cut by restriction enzymes in other organisms.
A novel DNA modification system was discovered recently in the Gram-positive bacterium Streptomyces lividans 66 [31]. This system results in the degradation of DNA in vitro by oxidative, double-stranded, site-specific cleavage during electrophoresis, and is determined by a cluster of five genes (dndA-B-C-D-E). The dnd gene products incorporate sulfur into the DNA backbone as a sequence-selective, stereospecific phosphorothioate modification [32]. According to He et al. [33], the resistance of phosphorothiate linkages to a variety of nuclease activities, and the site specific nature of such a modification suggest that phosphorothioates could have a role comparable to that of DNA methylation in protection against nucleases.
Although the presence of dndB homologs is not clear in the genomes of cyanobacteria, the rest of the cluster was found in several of them including Mic-PCC7806 (see Additional file 9). Despite the low level of synteny in cyanobacterial genomes (see above), the dndC-D-E genes are still clustered.

Unraveling genetic features related to the ecophysiology of M. aeruginosa PCC 7806
Life cycle, colony formation and floatation During the overwintering benthic phase of their life cycle, Microcystis colonies withstand long periods of darkness. A fermentation pathway has been proposed based on biochemical data [34]. All the genes coding for the enzymes required for the various steps in this pathway have been identified in the genome sequence (see Additional file 10). During the benthic phase, Microcystis colonies are exposed to lower temperature and higher pressure. In this respect, it is interesting to note the presence of a gene (mic5251) coding for a protein similar to Hik33 that perceives osmotic stress and cold stress in Syn-PCC6803 [35]. Another gene, mic5237, is similar to the Ana-PCC7120 orrA gene whose product is involved in osmoregulation [36]. A genomic island carrying actM and pfnM, two genes that encode eukaryotic-like proteins, actin and profilin (an actin cognate binding partner), respectively, have been discovered in the Mic-PCC7806 genome. As shown by Guljamow et al. [37], this eukaryotic-like actin forms a shell-like structure that could strengthen cell resistance to hydrostatic and osmotic pressures. Interestingly, these genes are only present in Microcystis cells that inhabit the Braakman water reservoir (The Netherlands), which was cut off from the sea in the 20 th century, and from which the Mic-PCC7806 strain was originally isolated.
Although several different M. aeruginosa morphotypes have been described [38], little is known about their colony formation. The genome sequence of strain Mic-PCC7806 revealed a gene coding for a lectin (mvn; mic3128), which binds specifically to a sugar moiety present on the surface of Mic-PCC7806 cells, and a binding partner has been identified in the lipoplysaccharide fraction [39]. A functional correlation between the potent toxin microcystin and this lectin has been demonstrated, with possible implications for the formation of colonial aggregates that are characteristic of different Microcystis morphotypes. Another protein, MrpC (microcystinrelated protein C), has been shown to be a potential target of an O-glycosyltransferase of the SPINDLY family [40]. In situ, this protein accumulates at the cell surface, and is involved in cellular interactions. Microcystins may therefore have an impact on the aggregation of Microcystis cells, which is very important for the competitive advantage of these organisms over other phytoplankton species. Mvn and MrpC are predominantly encoded in toxic strains [ [38] and E. Dittmann, unpublished data], but not in the genome of Mic-NIES843. The latter strain may thus represent an ecotype that differs from Mic-PCC7806 in the characteristics of the cell surface. Genes coding for a Ser/ Thr kinase (mic0129) and a Ser/Thr phosphatase of the PPP family (mic4622) are found within two clusters that may be involved in cell wall synthesis. Mic-PCC7806 also has two genes that encode Wzc-like protein Tyr kinases (mic2086 and mic1089) and three genes coding for Wzblike protein Tyr phosphatases (mic3515, mic3588 and mic6566). In E. coli, the function of these systems is known to be related to the synthesis of the cell wall and polysaccharides [41]. These kinases/phosphatases could potentially be involved in colony formation. Colony migration depends not only on the cell ballast resulting from the accumulation of photosynthates and the size of the colonies, but also on the synthesis of gas vesicles (GV), intracellular structures providing cells with buoyancy [42]. The Mic-PCC7806 genome carries a cluster of 12 genes required for GV synthesis, two of which, gvpV and gvpW, are novel [43]. The mic1271 and mic1270 genes are highly similar to the genes coding for a light-regulated two-component system in Syn-PCC6803. This system, which consists of a cyanobacterial phytochrome (Cph1) and its response regulator (Rcp1), has been proposed to play a role in the control of processes required for the adaptation from light to dark conditions and vice-versa [44]. Moreover, all the genes involved in circadian rhythm [45] are present in Mic-PCC7806 (see Additional file 11). Whether day-night cycles and the timing of vertical migration of Microcystis colonies in the water column are controlled by this phytochrome and by the circadian clock mechanism would be worth being tested.
In natural populations of Microcystis, oxidative stress was shown to induce programmed cell death (PCD) [46]. Accordingly, 5 putative eukaryotic caspase-like genes were identified by PSI-Blast in the genome of strain Mic-PCC7806. Three of them (Mic0980, Mic3930 and Mic4051) showed best similarity with Mic-NIES843 proteins that lack caspase-like motifs. Consequently, these three proteins are likely involved in other functions than PCD. In contrast, the Mic1068 protein showed similarity in the caspase-like region with one protein of Mic-NIES843 (MAE24870). The last caspase-like protein of Mic-PCC7806 (Mic5406) is strain-specific. Both mic1068 and mic5406 are expressed, and a cross-reaction with human caspase-3 polyclonal antisera was observed indicating that the proteins are synthesized (data not shown). Alignment of the regions containing the conserved caspase domains of Mic1068, Mic5406, MAE24870 and a yeast metacaspase shows that the Histidine-Cysteine catalytic diad of the key functionnal regions of the capases is conserved (see Additional file 12). PCD might thus be triggered when Microcystis cells are exposed to severe environmental stress conditions, leading to the rapid decline of blooms, as has been suggested by Berman-Frank et al. in the case of Ter-IMS101 [47]. Mic-PCC7806 and Mic-NIES843 are the only unicellular cyanobacteria known to have genes coding for HstK-like kinases (mic1879 and mic1015), proteins characterized by the presence of both His and Ser/Thr kinase domains [48,49]. Some of these kinases are implicated in either the iron homeostasis/oxidative stress response or in the differentiation of N 2 -fixing cells in filamentous cyanobacteria [ [48,49] and C-C Zhang, unpublished data]. Cell differentiation does not occur in M. aeruginosa, but it would be interesting to test whether these HstK-like protein kinases are involved in iron homeostasis and/or in the control of programmed cell death in response to oxidative stress. It has been proposed that the methionine recycling pathway may contribute to preventing oxidative stress in Bacillus subtilis [50,51]. Interestingly, all the genes involved in this pathway are present in the Mic-PCC7806 genome (see Additional file 13). One of these genes, mtnW (rbcL IV ), encodes a 2,3-diketo-5-methylthiopentyl-1-phosphate enolase that has been identified in all the Microcystis strains tested including Mic-NIES843 [21,52], but not in other cyanobacteria for which the genome sequences are available, except Lae-PCC8106 (accession n° ZP_01618990) and Cth-PCC8801 (accession n° ZP_02940034). The putative methionine recycling pathway may thus have a specific role related to the lifestyle or ecological niches inhabited by members of the genera Microcystis, Lyngbya and Cyanothece.
Genetic potential for the production of secondary metabolites Cyanobacteria are known as prolific producers of natural products, in particular of the nonribosomal peptide and polyketide classes [15,53]. However, the potential to produce complex secondary metabolites largely varies among the cyanobacterial genera and species, and even among individual strains. Remarkably, the genomes of Mic-PCC7806, Mic-NIES843 and Cwa-WH8501 differ from unicellular cyanobacteria of other genera in that they contain a large number of genes that encode nonribosomal peptide synthetases (NRPS) and polyketide synthases (PKS). Interestingly, such genes in Mic-PCC7806 outnumber those found in Mic-NIES843 and Cwa-WH8501 (Table 6). Apart from the terrestrial filamentous strain Npu-PCC73102, Mic-PCC7806 devotes the largest percentage of its genome (~3.5%) to secondary metabolite production ( Table 6) [54].
The strain Mic-PCC7806 is known to produce two isoforms of microcystin [55]. The corresponding genes in the bi-directional mcyA-J gene cluster encoding NRPS, PKS and tailoring enzymes [56,57] could be re-assigned during the genome sequencing project ( Figure 5). Genes for cyanopeptolin biosynthesis (mcn cluster) could be assigned based on the amino acid specificities of the substrate-activating domains of a second NRPS gene cluster that was congruent with the amino acid moieties contained in the cyanopeptolin structure [58] ( Figure 5). The mcn genes of Mic-PCC7806 display some similarity to the anabaenopeptilide genes of Anabaena strain 90 [59] and to the cyanopeptolin genes of Microcystis wesenbergii [60]. In addition, the genome of Mic-PCC7806 harbors three NRPS and PKS gene clusters ( Figure 5). One of the clusters displays some similarity to the cluster involved in the pro-duction of the protease inhibitor aeruginoside in Planktothrix agardhii Cya 126 [61]. The genomic data therefore clearly indicate that strain Mic-PCC7806 might be capable of producing a variant of aeruginosin ( Figure 5).
The two remaining PKS I gene clusters do not show significant similarity to any known cyanobacterial biosynthetic gene clusters, and may be involved in the production of hitherto unknown compounds ( Figure 5 and Table 6). The first gene cluster encodes an iterative PKS I that is similar in both architecture and sequence to the PksE of various actinobacteria, and is accompanied by several tailoring enzymes including three halogenases. The actinobacterial enzyme is involved in the biosynthesis of enedyine type antitumor antibiotics [62]. The second PKS gene cluster encodes a modular PKS I complex accompanied by several putative tailoring enzymes, and a PKS III type enzyme that is capable of synthesizing compounds of the chalcone/stilbene family. These biosynthetic enzymes are widespread in plants but have only recently been discovered in bacteria [63]. A comparison of the biosynthetic potential of Mic-PCC7806 and Mic-NIES843 reveals that three of the large NRPS/PKS complexes, namely those dedicated to microcystin, cyanopeptolin and aeruginosin production, are encoded on both genomes, whereas some other gene clusters are not shared by both genomes. The biosynthetic versatility of members of the genus Microcystis may thus be larger than expected, since the two strains selected for genome sequencing have similar chemotypes. Beside the NRPS and PKS encoding genes, the genome of Mic-PCC7806 contains a gene cluster similar to the patellamide genes that were recently detected in symbiotic cyanobacterial strains of ascidians [64]. Patellamides are a family of cyclic peptides generated from a ribosomally-synthesized precursor. Mic-PCC7806 is the first freshwater cyanobacterium showing the capability to produce patellamide-like peptides. A peptide with striking similarity to the patellamides, microcyclamide, has been  reported in M. aeruginosa strain NIES-298 [65]. Chemical analyses have revealed that the gene cluster discovered in Mic-PCC7806 is indeed dedicated to the production of a microcyclamide-type compound [66]. The genome of Mic-PCC7806 could attract further attention, as it also contains gene clusters comprising unique features that have yet to be characterized and which may well produce so-far unidentified natural substances.
Transporter genes are commonly found in the immediate vicinity of the secondary metabolite biosynthetic genes.
These secondary metabolites may therefore at least partly function at the surface of Microcystis cells, in the colonysurrounding sheath or in their planktonic environment. Gene clusters involved in the synthesis of secondary metabolites are frequently associated with genes that confer resistance to these metabolites, which would otherwise be toxic to the cells producing them. In Mic-PCC7806, only the transport system associated with the uncharacterized PKS I/PKS III hybrid compound ( Figure 5) shows any similarity to typical efflux transporters that potentially confer self-resistance. The compound produced could therefore have an allelopathic or antibacterial role in the environment [67].

Conclusion
Among bacteria, members of the genus Microcystis have a particularly high potential for the production of complex secondary metabolites, although this is lower than that of some actinobacterial and myxobacterial genomes that have been shown to devote up to 10% of their coding capacity to the production of secondary metabolites [68]. Genomics has already been useful to the study of secondary metabolites, and has restored natural product research as a major field of pharmaceutical research [69]. Analysis of the Mic-PCC7806 genome has revealed striking novel biosynthetic features that might help to explain the ecological impact of these compounds, as well as guide the search for novel metabolites of biotechnological importance.
Data mining of the genome sequence of Mic-PCC7806 has also shed light on genes that are of importance for the colonial life style and survival of this cyanobacterium in its natural habitat, either during the benthic phase or when it forms blooms on the surface of the water. One of the most intriguing features of this genome is its exceptional plasticity, characterized by a very large number of long repeated sequences, and genes encoding transposases and putative restriction enzymes. These biological entities may generate deletions, duplications, conversions, and rearrangements in the chromosome [70]. One illustration of these changes is the marked loss of synteny between this genome and other cyanobacterial genomes. In addition, the presence of a large number of clustered atypical genes in the genome of Mic-PCC7806 suggests that frequent gene acquisition events by lateral transfers have occurred.
Genome plasticity in prokaryotes is often considered to be an adaptive strategy allowing microorganisms to promote diversification in a way similar to sexual reproduction in eukaryotic organisms. However, genomic rearrangements can also impede the co-expression of genes [71] and disrupt gene dosage effects [70]. The resulting trade-off between gene conservation and rearrangement in the chromosome depends on various factors and processes linked to the ecophysiology of the microorganisms. The cost of chromosome rearrangements may be greater for fast-growing bacteria, than for slow-growing ones such as cyanobacteria [72]. The relative importance of the process of gene co-expression in cyanobacteria is more difficult to evaluate. However, it is worth noting that some of the eight syntenic clusters found in Mic-PCC7806 concern transport systems for nutrients, such as phosphate, which is often the limiting factor in marine and freshwater ecosystems.
Although Syn-PCC6803, Cwa-WH8501 and Mic-PCC7806 are phylogenetically closely related, only the last two strains have highly plastic genomes containing high proportions of long DNA repeats and transposase genes. No obvious explanation can be deduced from the ecophysiological features of these two strains. Indeed, members of the genus Microcystis are freshwater colonial cyanobacteria that proliferate in eutrophic ecosystems (e.g. ≤ 2.10 7 cells/ml in [73]) while the Crocosphaera are marine nitrogen-fixing cyanobacteria living in oligotrophic open oceans (≤ 10 3 cells/ml [74]). Microcystis colonies may display chaotic population dynamics, with alternating explosion and crash phases [75], but to the best of our knowledge, no such data are available for Crocosphaera. Such chaotic population dynamics could explain the widespread occurrence of rearrangements in the Mic-PCC7806 genome, if, as proposed by Helm et al. [76] for Salmonella serovars, bottlenecks and genetic drifts generally promote the fixation of mildly harmful rearrangements.
More genome sequences of members of the Microcystis and Crocosphaera genera are required to clarify the molecular basis of their genome plasticity, at both the intergeneric and intraspecies levels. This will also provide a deeper understanding of the evolutionary significance of this mode of adaptation to the environment. The ongoing sequencing of such genomes should make it possible to reach this goal in the near future. More generally, large cyanobacterial genomes constitute excellent model systems for studying genome dynamics and the mechanism(s) by which some gene clusters may escape rearrangement and retain the same physical organization in several different lineages.  Pairs of cyanobacterial genomes used in Figure 3 Mic

DNA preparation and sequencing
The strain Microcystis aeruginosa PCC 7806 (kept in constant culture since its isolation in 1978; Pasteur Culture Collection, Paris, France [18]) was grown as described [52]. The genome sequence of Mic-PCC7806 was determined by a whole-genome shotgun strategy. Two libraries were generated using genomic DNA extracted with the kit Nucleobond AGX500 (Macherey-Nagel, Hoerdt, France) and shared by nebulization. The first library contained inserts from 1 to 4 kb cloned in pcDNA2.1 (Invitrogen Life Technologies, Carlsbad, CA, USA) and the second included inserts from 5 to 8 kb cloned in the low-copy vector pSYX34 (gift of F. Kunst, Institut Pasteur, Paris, France). A BAC library was constructed into the vector pBeloBAC11 (inserts ≤ 20 kb) (Epicentre, Madison, USA) using spooled DNA extracted as described [78] and partially hydrolyzed with HindIII.
Plasmid DNA purification was performed using the Montage Plasmid Miniprep96 Kit (Millipore, Molsheim, France) or the TempliPhi DNA sequencing template amplification kit (GE Healthcare, Uppsala, Sweden). BAC Miniprep96 Kit (Millipore, Molsheim, France) was used for BAC templates. Sequencing reactions were done, from both ends of DNA inserts, using ABI PRISM BigDye Terminator cycle sequencing ready reactions kit and run on a 3700 Genetic Analyzer (Applied Biosystems, Foster City, CA, USA). The trace file was used with the Phred-Phrap-Consed package to perform the assembly [79]. Sequencing reactions were performed to close gaps, improve coverage and resolve sequence ambiguities using PCR products amplified from genomic DNA or DNA plasmid templates.

Phylogenetic analysis
A dataset containing a concatenation of the 16S and 23S sequences was aligned by Muscle [80], and the alignment was manually edited to remove ambiguously aligned positions, giving a final dataset of 4195 nucleotide positions for phylogenetic analysis. From this dataset, a maximum likelihood tree was calculated by Phyml [81], using the HKY model of nucleotide evolution with an estimation of the transition/transversion ratio, including 4 rates of site heterogeneity, an estimated number of invariable positions, and an estimated alpha shape parameter. The numbers at the nodes correspond to the bootstrap values calculated on 1000 resampled datasets by Phyml.

Syntenic score computation
Ten orthologs located on either side of one pair of putatively orthologous CDS (linked by BDBH) were analyzed. For each pair of orthologous genes located in the proximity of the tested gene and of its ortholog, the synteny score was incremented by 1. Using this method of calculation, two totally syntenic genomes will have a score of 20 attributed to each of their orthologs, whereas two-non syntenic genomes will have a score of 0.

Restriction-modification enzymes
Putative restriction enzymes were identified by Blast searching of known type I and II restriction enzymes against the Mic-PCC7806 genome. Because DNA methylases are more reliably identified by Blast than restriction enzymes, we also identified all methylases, and examined the surrounding genes for potential restriction enzymes.

Detection of atypical CDSs
A first-order Markov model was built based on the dinucleotide composition of the core genes of a group of 8 selected cyanobacterial genomes (Table 4), identified by bi-directional best hits using BLASTp (bitscore of 30% against itself). This Markov model takes into account the Markov probability matrix of the core genes to analyse whether the composition of the CDS under study is "atypical", using the formula described in [25]. For each CDS, the model calculates an index that represents the likelihood that CDS will have a dinucleotide composition compatible with that of the core genes. In order to assess significance cutoffs, we applied the following statistics [82]: for each gene analyzed, one million random sequences were generated based on the Markov model probability matrix of the core genes, and the Markov index was calculated for each of these random sequences. The results were then analyzed by a one-tailed test with cut-offs of 0.1%. The cut-off was defined after several in silico horizontal gene transfer simulations, during which random genes from different genomes were introduced artificially into the genome sequences under study. The optimal threshold (0.1%) was defined for all the genomes of the group as the value at which the model had the highest detection of the in silico introduced genes (true positives), and the lowest detection of core genes (false positives).

Clustering of atypical genes
We defined an initial cluster of at least 4 neighboring atypical genes which was allowed to grow (in both directions) searching for other nearby atypical genes, until regions containing 4 or more non-atypical genes appeared. By this process, a reduced number of less-atypical genes and of normal genes could be included in a larger CAG.
Abbreviations CDS: coding sequence; HSP: high scoring segment pair; BDBH: bidirectional best hit; rDNA: ribosomal DNA; CAG: cluster of atypical gene; BV: bootstrap value. NRPS: nonribosomal peptide synthetase; PKS: polyketide synthase; N50: contig size such that all the larger contigs contain 50% of the bases of the assembly.