The diversity of cyanobacterial metabolism: genome analysis of multiple phototrophic microorganisms

Background Cyanobacteria are among the most abundant organisms on Earth and represent one of the oldest and most widespread clades known in modern phylogenetics. As the only known prokaryotes capable of oxygenic photosynthesis, cyanobacteria are considered to be a promising resource for renewable fuels and natural products. Our efforts to harness the sun's energy using cyanobacteria would greatly benefit from an increased understanding of the genomic diversity across multiple cyanobacterial strains. In this respect, the advent of novel sequencing techniques and the availability of several cyanobacterial genomes offers new opportunities for understanding microbial diversity and metabolic organization and evolution in diverse environments. Results Here, we report a whole genome comparison of multiple phototrophic cyanobacteria. We describe genetic diversity found within cyanobacterial genomes, specifically with respect to metabolic functionality. Our results are based on pair-wise comparison of protein sequences and concomitant construction of clusters of likely ortholog genes. We differentiate between core, shared and unique genes and show that the majority of genes are associated with a single genome. In contrast, genes with metabolic function are strongly overrepresented within the core genome that is common to all considered strains. The analysis of metabolic diversity within core carbon metabolism reveals parts of the metabolic networks that are highly conserved, as well as highly fragmented pathways. Conclusions Our results have direct implications for resource allocation and further sequencing projects. It can be extrapolated that the number of newly identified genes still significantly increases with increasing number of new sequenced genomes. Furthermore, genome analysis of multiple phototrophic strains allows us to obtain a detailed picture of metabolic diversity that can serve as a starting point for biotechnological applications and automated metabolic reconstructions.


Background
Cyanobacteria are a unique phylogenetic group of bacteria and are the only known prokaryotes capable of oxygen-evolving photosynthesis. Cyanobacteria occupy diverse ecological niches and exhibit enormous diversity in terms of their habitats, physiology, morphology and metabolic capabilities. Due to their numerical abundance, most notably in marine environments, cyanobacteria have profound impact on almost all biochemical cycles that shape life on Earth. They are major players in global oxygen supply, carbon dioxide (CO 2 ) sequestration, nitrogen fixation, as well as the primary phototrophic production of biomass. The latter capability, the utilization of atmospheric CO 2 and sunlight for growth, has triggered renewed interest in the organization of cyanobacterial metabolism: Cyanobacteria are considered a promising resource for third generation biofuels and have attracted interest for a variety of related biotechnological applications [1][2][3]. However, while substantial knowledge is available for several model strains, the diversity of cyanobacterial metabolism remains poorly understood.
With the advent of novel sequencing techniques and the completion of several genome sequencing projects, a considerable number of complete cyanobacterial genome sequences are now available. This increasing number of sequenced genomes provides new opportunities for understanding microbial diversity and metabolic organization in diverse environments. Here, we report a whole genome comparison of multiple phototrophic cyanobacteria. Our focus is to describe the genetic diversity found within cyanobacterial genomes and to describe metabolic adaptations and diversity of several strains with different environmental background. Our work builds upon several previous studies on cyanobacterial genomic diversity and evolution [2][3][4]. For example, Raymond et al. [5] have previously compared five whole genome sequences from all groups of photosynthetic prokaryotes, with the aim to identify genes that play an essential role in phototrophy and to understand the advent and developement of photosynthesis. Their results showed that the genomes of the studied organisms resemble mosaics of genes with very different evolutionary histories and that orthologs common to all five genomes showed a distinct lack of unanimous support for any single phylogenetic topology. The importance of horizontal gene transfer (HGT) for cyanobacteria was later corroborated by the work of Zhaxybayeva et al. [6]. Shi and Falkowski [7] demonstrated an overall phylogenetic discordance among putative orthologous protein families from 13 genomes of cyanobacteria. The authors identified a core set of genes that was argued to be resistant to HGT and on which a robust organismal phylogeny can be constructed. Molecular synapomorphies, protein signatures that are present in an indicated group but not in other cyanobacteria or bacteria, were described by Gupta et al. [8,9] to further understand the evolutionary relationships between cyanobacteria. Mulkidjanian et al. (2006) [4] conducted a comparative analysis of 15 cyanobacterial genomes, with a focus on the origin of photosynthesis, and concluded that modern cyanobacteria inherited their photosynthetic apparatus from ancestral anaerobic phototrophs and not by lateral gene transfer from other phototrophic bacterial lineages. Recently, also several ocean sampling expeditions investigated microbial diversity in marine environments [10,11], again confirming substantial oceanic microbial diversity and considerable heterogeneity of microorganisms at the genomic level, specifically for Prochlorococcus, one of the most abundant genus of cyanobacteria.
Here, we augment the view on cyanobacterial genomic diversity with the identification and detailed analysis of putative orthologous genes across 16 cyanobacterial whole genome sequences. Our analysis is not restricted to a single genus of cyanobacteria but seeks to integrate representatives of cyanobacteria from almost all known environments. Unlike several previous studies, we do not aim to reconstruct evolutionary trajectories, but rather seek to describe differences and similarities in genome content. Our main focus is the role of metabolic genes of central carbon metabolism and hence metabolic functionality across diverse strains. The manuscript is organized as follows: First, we define clusters of likely ortholog genes, denoted as CLOGs, based on pair-wise comparison of protein sequences. Subsequently, we investigate the core and pan-genome of cyanobacterial strains and discuss codon usage analysis, as well as gene sharing and phylogenetic congruence. In the final three sections, we focus on the diversity of cyanobacterial metabolism and discuss how specific enzymes, and hence metabolic pathways and capabilities are distributed across selected cyanobacterial strains.

Genome analysis and ortholog cluster
Starting point of our analysis are the genome sequences of 16 selected cyanobacteria, as obtained from Gene-Bank (http://www.ncbi.nlm.nih.gov/genbank). The chosen strains are not restricted to a single genus but were selected to represent the known genomic and metabolic diversity found in the cyanobacterial phylum, including eight marine and eight freshwater strains. The selected cyanobacterial strains include the model organisms Synechocystis sp. PCC 6803, Synechococcus elongatus PCC 7492 and Cyanothece sp. ATCC 51142, several nitrogen-fixing cyanobacteria (diazotrophs), as well as two thermophiles originally isolated from hot-spring environments. Details on the choice of strains are provided in Methods and a summary of the properties of the selected strains is given in Table 1. A phylogenetic tree based on 16S rRNA is shown and discussed further below.
To investigate genomic diversity, we aim to identify groups of ortholog genes, based on a pair-wise allagainst-all comparison of identified protein sequences. Two protein sequences are regarded as likely orthologs if the reciprocal comparison results in a bidirectional hit rate (BHR) larger than a given threshold. Subsequently, likely orthologs were assigned to clusters by merging ortholog pairs. Clusters of likely ortholog genes were then checked for consistency and, if applicable, split into separate clusters. In this way, gene pairs within one cluster that exhibit a BHR below a given threshold are avoided. We restrict the analysis to the chromosome, plasmids are not considered. Details of the algorithm are given in Material and Methods. Our approach follows earlier approaches to detect putative orthologs across several genome sequences [4,5,[12][13][14][15][16][17]. However, we adopt rather stringent criteria to avoid inclusion of erroneous non-ortholog pairs, at the expense of potentially underestimating the number of true orthologs.
Our algorithm results in 21238 distinct clusters of likely ortholog genes (CLOGs), distributed across all 16 strains (data in Additional File 1). Figure 1 shows a histogram of the number of assigned genes per CLOG. The majority of clusters, almost 60%, consists of a single gene (singletons), whereas only a small number of clusters have more than 30 or 40 members. CLOGs with exactly 16 members are overrepresented, indictated in Figure 1 by a vertical line. Overall, the distribution differs slightly from the results provided in the COG database [12,18]. Therein, considering only the two cyanobacterial strains (Syn6803 and Nos7120) included in the database, clusters of ortholog genes tend to be comprised of more genes, often including multiple genes from the same strain. To obtain insight into the organization of the cyanobacterial genomic diversity, each CLOG is assigned to a  [42], the ability of the strain to fixate nirtrogen, habitat and cell arrangement. Within the column for habitat marine strains are marked by an M, fresh water by an F, thermophile strains are marked by a T. Cell arrangement is subdivided in single cells (S) and filamentous cell arrangement (F). The division of the strains into different subsections is according to [43].  Only few clusters consist of more than 16 genes and almost no cluster consists of more than 32 genes. cyanobacterial strain if one or more member of a CLOG is present in the respective genome. Figure 2A shows a histogram of the number of CLOGs as a function of the number of associated strains. We can distinguish between core genes (660 CLOGs), those that are assigned to all 16 strains, shared genes (6668 CLOGs), those that are found in more than one but not in all strains, and unique genes (13910 CLOGs) that have no likely ortholog in any other of the 15 genome sequences. Figure 2B shows the number of CLOGs assigned to each cyanbacterial species, highlighting the contribution of core, shared, and unique CLOGs. The data is provided as Additional File 2.
We observe that the majority of ortholog clusters is associated with a single genome, and therefore represent unique genes with no likely ortholog in any other of the considered strains. The number of CLOGs shared among two or more genomes then quickly drops. We note that the scale in Figure 2A is logarithmic. However, a significant number of CLOGs is again assigned to the core genome. Clusters of likely ortholog genes that are present in all 16 cyanobacterial genomes are more frequent than clusters that are only shared between any given number, but not all, strains. The set of core CLOGs is in good agreement with the results reported in Mulkidjanian et al. (2006) [4]. Specifically, when using Syn6803 as a reference, almost all genes assigned to a core CLOG (>90%) in our analysis are likewise a member of a core cyanobacterial clusters identified by Mulkidjanian et al. [4]. Our results are also in good qualitative agreement with several previous studies on other bacterial lineages. For example, Hogg et al. [14] observed a similar distribution for 12 sequenced strains of Haemophilus influenzae. Extending the pan-genome concept to higher taxonomic units, Lapierre and Gogarten [19] report a shared core genome of approximately 250 genes across more than 500 sequenced bacterial genomes. In both cases, corresponding to the results shown in Figure 2, a U-shaped distribution was observed, such that unique and core genes are overrepresented compared to any single set of genes assigned to a finite number of genome sequences.
The cyanobacterial core-and pan-genome Whole genome comparisons offer the possibility to extrapolate the observed results beyond the number of strains explicitly considered in the comparison. In this respect, pan-genome analysis has recently emerged as a novel approach to estimate the size of the gene repertoire accessible to any given species [20]. A number of recent studies have found consistently that the number of genes accessible to a bacterial species is usually orders of magnitude larger than the number of genes contained in the genome of any single organism. These results have a direct implication for resource allocation and whole-genome sequencing projects, as they can potentially predict how many new genes are identified every time a new genome of the species of interest is sequenced. Figure 3 shows the size of the cyanobacterial coreand pan-genome estimated from the 16 strains considered here. The total pan-genome of all 16 strains encompasses more than 2·10 4 ortholog clusters and the increase as a function of the number of genomes does not show substantial flattening of the curve ( Figure 3B). With each newly included genome still more than approximately 500 novel ortholog clusters are added to the pan-genome. Given these rarefaction curves, it must be expected that sequencing of further cyanobacterial strains will still result in the discovery of a high number of as yet unknown genes, even when the number of sequenced genomes goes significantly beyond the number sequenced as yet. The results shown in any, between the core, shared and unique genes? Both questions have been addressed in the recent literature but cannot be resolved with any certainty yet. For the size of the bacterial pan-genome, divergent results have been obtained for different species. Hogg et al. [14], reported a finite pan-genome for Haemophilus influenzae, extrapolating from 12 whole genome sequences, while results for Streptococcus agalactiae indicate an infinite asymptotic pan-genome [21]. These results may indeed reflect differences in ecologial niches and evolutionary history. However, a fundamental objection to mathematical extrapolation has been raised recently [17]. As argued by Kislyuk et al. [17] such extrapolation estimates are likely to be spurious because they depend on the estimation of the occurence of extremely rare genes and genomes, respectively, which are problematic to estimate precisely because they are rare. Therefore, we do not give any estimate for the total cyanobacterial pan-genome. Nonetheless, we consider several key findings to be valid: There is a core genome that is shared between all 16 cyanobacterial strains considered here. The asymptotic size of the core genome when exptrapolated to all cyanobacterial strains is currently unknown. Furthermore, there is no indication that the cyanobacterial pan-genome is closed. Therefore, the results shown in Figure 3 provide a strong incentive for further genome sequencing even of closely related strains.
A second issue relates to the possible functional and evolutionary differences between shared, core and unique genes. Common to all recent studies is that the number of unique genes, and those that are only shared between a small number of genomes, represents a rather large proportion of the total gene repertoire [22]. A variety of hypotheses with respect to the origin of such a distribution have been put forward. For example, core genes are often assumed to be predominantely related to housekeeping functions [22]. Unique genes, on the other hand, may be characteristic to specific environments and are assumed to be subject to extensive HGT [6,7].
We tested this assertion by comparing the annotation obtained from gene ontology (GO) database [23]. An analysis of the GO annotation of core CLOGs reveals a significant enrichment of genes related to "translation" (p-value <1·10 -30 ), "DNA repair" (p-value <1·10 -4 ), "gene expression" (p-value <1.8·10 -7 ), "RNA processing/modification" (p-value <1·10 -5 ), diverse transporting processes (p-values <1·10 -4 ), as well as several metabolic and biosynthetic processes (p-values <1·10 -5 ). Genes of the unique CLOGs are enriched with the annotation "defence response" (p-value <1.6·10 -5 ), "DNA integration" (p-value <8.2·10 -5 ), and are in particular enriched in annotations of regulatory processes. The latter may implicate a conservation of a functional core, such as metabolism and gene expression machinery, whereras regulatory properties and interactions are more specific to diverse environments. A complete list of the enriched GO terms is provided in Additional File 3. Algorithmic details are given in the Methods. We also need to point out a possible bias due to a significant enrichment of GO annotated genes in the core CLOGs (p <9·10 -210 with Fisher's exact test), whereas genes associated with unique clusters are more likely to have no GO annotation. This imbalance can be explained by the fact that GO annotations are mainly based on BLAST searches in other species and unique genes can therefore be expected to have fewer matches. Furthermore, a significant fraction of unique genes may also be due to annotation errors or be non-functional as part of an ongoing process of genome reduction and pseudogenization [22].

Codon usage analysis
To further elucidate the difference between core and pan-genome, we compared the codon usage of the respective CLOGs across the 16 cyanobacterial genomes. To this end, the relative abundance of each nucleotide triplet for each amino acid was estimated and we calculated d 2 g1,g2 as the sum of squared differences in codon usage between any two groups of genes, g1 and g2, respectively. See Methods for algorithmic details. We found considerable differences in codon frequencies for the set of core genes between different strains, indicating adaptation of the codon usage to the respective strain. For all 16 strains, the codon usage frequency was found to be significantly different between the set of core and unique genes. To quantify the difference in The cyanobacterial pan-and core-genome. Estimated size of core-(A) and pan-(B) genome with increasing number of considered genomes. To avoid dependency on strain order, the 16 cyanobacterial strains were arranged in random order. At each step, we recalculated the number of core CLOGs (CLOGs assigned to all strains included as yet) and pan CLOGs (all CLOGs as yet found in at least one of the included strains) genome. This procedure was repeated 1000 times, the median across all iterations is shown. The errorbars represent the 0.1 and 0.9 quantiles estimated from 1000 iterations.
Beck et al. BMC Genomics 2012, 13:56 http://www.biomedcentral.com/1471-2164/13/56 codon usage between the set of of core and unique genes, we use the ratio r x of the sum of squared differences in codon usage between core/core and core/ unique genes for each strain. This ratio ranges from r x = 2 for Pro9215 up to r x = 24 for Syn6803 and Mic843. A table with detailed information is provided as Additional File 4. The difference in codon usage between core and unique genes within any single strain typically exceeds the differences between core genes, as well as between unique genes, across different strains. The difference in codon usage between core and unique genes is lowest in the rather small genomes of the three Prochlorococcus strains Pro9215, ProMED4, and Pro9211.

Gene sharing and phylogenetic congruence
We are interested in the relationships between cyanobacterial species based on gene sharing, as compared to 16S rRNA analysis. Figure 4A shows a phylogenetic tree obtained from 16S rRNA, using PHYLIP (phylogeny inference package version 3.69) by Felsenstein [24]. Several options to estimate similarity based on assignment of CLOGs are available. Here, we use a simple measure based on the number of CLOGs common to two strains divided by the total number of CLOGs associated with both strains combined. The respective distance tree is shown in Figure 4B. Both trees exhibit a high degree of similarity, with only minor topological differences. In both cases, the Prochlorococcus strains form the closest related cluster. We note that we do not consider phylogenetic trees of individual gene families, where a higher degree of phylogenetic discordance must be expected [7]. Likewise any estimate of distance based on shared CLOGs is likely biased by genome size, which again reflects evolutionary distance as determined by 16S rRNA analysis. Table 2 gives a pair-wise comparison of shared CLOGs between all 16 cyanobacterial strains. The table confirms the close association of the three Prochlorococcus strains with Syc7803 with respect to shared genes.

The metabolic network is highly conserved
Going beyond pan-genome analysis, we are particularly interested in the organization and diversity of cyanobacterial metabolism. To identify those CLOGs that can be associated with metabolic function, we utilize the Enzyme Commison (EC) number of each gene, as obtained from the KEGG database: A CLOG is regarded as metabolic if the respective set of orthologous genes can be assigned to one or more EC numbers associated with a specific enzymatic activity. We note that due to the hierarchical classification scheme, this assignment may also include broad enzymatic categories, as well as a limited number of non-metabolic enzymatic functions. See Methods for algorithmic details and some caveat.
Using the set of CLOGs described above, 1851 CLOGs of the 21238 can be regarded as metabolic. We note that due to bifunctional enzymes or inconsistent and erroneous annotation CLOGs may be assigned to more than one metabolic function. However, in our case only 66 CLOGs (out of 1851) are assigned to more than one EC number, with a total of 759 distinct EC numbers assigned across all clusters. These results indicate that inconsistent annotation does not significantly constrain our analysis, even without prior filtering or manual curation. Figure 5 shows the distribution of metabolic CLOGs across the unique, shared and core genome. CLOGs assigned to metabolic function are highly overrepresented within the set that is common to all 16 cyanobacterial strains, with about 55% of all core CLOGs associated with metabolic function. Obviously, cellular metabolism, defined here as genes assigned to enzymatic function, constitutes a large fraction of the core genome. Figure 6 gives the percentage of CLOGs assigned to enzymatic function across all 16 strains considered in this study. The number of enzymatic CLOGs increases linearly with the number of total CLOGs assigned to each strain, with an offset of about 500 core enzymatic CLOGs. However, the correlation between number of enzymatic and total CLOGs is rather weak and dominated by the contribution from enzymatic core CLOGs. A further analysis of the respective pathways and enzyme classes associated to CLOGs revealed no obvious difference between unique and core genes, that is, no particular enzymatic category or pathway was strongly overrepresented in either class. Nonetheless, a number of core pathways can be identified that are common to all 16 cyanobacterial strains. Among the highly conserved pathways are the Calvin Benson cycle, the oxidative pentose phosphate pathway, nucleotide synthesis, and amino acids synthesis. However, with respect to the latter, a number of phosphatases and transaminases are not annotated in several strains.

The diversity of cyanobacterial metabolism
A multifaceted picture is obtained, if we look how specific enzymes, and hence metabolic capabilities, are distributed across the 16 cyanobacterial strains. To this end, we first limit the analysis to shared EC numbers. Of the total of 759 distinct EC numbers, assigned across all clusters, a subset of 378 EC numbers is associated with more than one, but less than 16 strains. Figure 7 provides a clustered heatmap of the association between these 378 shared EC numbers and the 16 cyanobacterial strains. EC numbers were clustered using the matlab function clustergram with distance 'hamming'. Overall, we can distinguish between four broad categories: First, shared EC numbers that are predominantely annotated with the Prochlorococcus strains, Pro9215, ProMed4, Pro9211 and Syc7803 (Cluster A in Figure 7). Second, shared EC numbers that are only annotated to a small number of strains (Cluster B in Figure 7). Third, shared EC numbers that are annotated to a large number of strains (Cluster C in Figure 7), and, fourth, shared EC numbers that are annotated to almost all strains, except the three Prochlorococcus strains and Syc7803. We note that Figure 7 again underscores the similarity between the three Prochlorococcus strains and Syc7803 that is already apparent in Figure 4 and Table 2.
Investigating the associations more closely confirms differences between the four groups. Exclusively associated with the four strains ProMED4, Pro9211, Pro9215 and Syc7803 are the two EC Numbers EC 1.1.5.4, a malate:quinone oxidoreductase involved in the TCA cycle and EC 2.5.1.48, a cystathionine gamma-synthase which catalyzes one of the essential steps in the synthesis of the amino acid L-methionine. For most other strains the synthesis pathway of methionine is unclear. No EC numbers are solely associated with the three Prochlorococcus strains, but no other strain. Vice versa, group D contains a set of 11 EC numbers that are associated with all strains, except the three Prochlorococcus strains ProMED4, Pro9211, Pro9215. This set includes EC 1.7.7.1, a nitrite reductase, and EC 1.7.7.2, a nitrate reductase. Indeed, it was widely assumed that nitrate is unavailable for Prochlorococcus because none of the sequenced laboratory strains contain the respective gene for nitrate utilization [25]. However, this view was recently challenged as metagenomic sequence data revealed that several Prochlorococcus ecotypes may contain nitrate reductase genes [26]. EC numbers that are not annotated for the Prochlorococcus strains and Syc7803, include several enzymatic activities of central metabolism, such as EC 1.1.1.38 (malic enzyme), EC 1.1.1.94 (glycerol-3-phosphate dehydrogenase), and EC  3.1.3.11 (fructose-1,6-bisphosphatase). A comprehensive list of differences in EC annotations between group A and group D is given as Additional File 5. A detailed discussion of genes involved in central metabolism is provided below.

Cyanobacterial storage metabolism
As phototrophic organisms, most cyanobacteria rely on indigenous compounds that act as storage and allow to maintain cellular function during the night and in the absence of light. The most prevalent storage compound in cyanobacteria is glycogen, a branched polymer synthesized from glucose-6-phosphate. Glycogen is assumed to be accumulated during the day and mobilized during periods of darkness. Accumulation of glycogen is also relevant under conditions of nitrogen limitation. All 16 strains considered here possess the necessary enzymes for glycogen synthesis and mobilization. In particular, a CLOG that is annotated with the enzyme AGP (EC 2.7.7.27) belongs to the core genome and is associated with all 16 strains.  Table 3 for an overview. We note that in all cases where an enzyme is associated with more than one CLOG, there usually is a dominant CLOG associated to almost all strains and a small number of secondary CLOGs whose members are annotated with the same enzyme. On closer inspection, this distinction is not an artifact of the clustering algorithm, but is supported by pair-wise comparisons of the respective sequences.
Compared to glycogen, other storage compounds are less ubiquitous. Nonetheless, for the majority of strains, the enzymes for cyanophycin synthesis and mobilization are annotated. Cyanophycin is a polymer composed of aspartate and arginine and serves as a source of nitrogen and carbon in several cyanobacteria. As in the case for glycogen, the enzyme for cyanophycin synthesis (CphA, EC 6.3.2.29/30, see Knoop et al. [27]) is associated with several CLOGs, whereas the enzyme for cyanophycin mobilization (CphB, EC 3.4.15.6) is confined to a single CLOG. Both enzymes always occur together. That is, no strain is annotated only with synthesis or mobilization, with a single exception for Syc7803. However, in this case, the associated singleton CLOG is likely to be a  erroneous annotation. The respective gene is annotated as a putative cyanophycin synthetase in CyanoBase, but the similarity to other known genes encoding cyanophycin synthetase is low.
Less prevalent than cyanophycin is the utilization of poly-beta-hydroxybutyrate (PHB) as a storage for carbon. PHB is a nontoxic biodegradable polyester of biotechnological importance, whose production by genetically engineered cyanobacteria was discussed recently [28]. Among the 16 strains considered here, the corresponding enzymes for synthesis of PHB are annotated only for Synechosystis sp. PCC 6803 (Syn6803) and Microcystis aeruginosa NIES-843 (Mic843). We note that although the strain Aca11017 is also associated with the CLOG annotated with PhaA (PHA-specific bketothiolase/Acetyl-CoA acetyltransferase, EC 2.3.1.9), the strain lacks the remaining steps for PHB synthesis. The respective gene in Aca11017 is a close variant of Figure 7 Diversity of cyanobacterial metabolism. Shown is a clustered heatmap of the association between the 378 shared EC numbers and the 16 cyanobacterial strains. The clustering of cyanobacterial strains is in good agreement with the results shown in Figure 4. With respect to annotated EC numbers, four broad categories can be distinguished: Category A corresponds to EC numbers predominantely annotated for the three Prochlorococcus strains and Syc7803. Category B corresponds to EC numbers associated with only a small number of strains. Category C covers EC numbers that are associated with almost all strains. Category D corresponds to those EC numbers that are associated with most strains, but are less prevalent in the Prochlorococcus strains and Syc7803. The categories are identified by visiual inspection of the clustered heatmap. Table 3 Cyanobacterial storage metabolism.

Glycogen
Cyanophycin     PhaA that is not specific for PhB synthesis. We emphasize that all storage compounds are of high biotechnological interest.

The diversity of cyanobacterial central metabolism
Beyond storage compounds, Tables 4 and 5 summarize the presence of several key enzymes within cyanobacterial central metabolism across all 16 strains considered in this study. In contrast to Table 3, the tables do not distinguish between individual CLOGs associated with the same enzymatic function. A detailed depiction of individual CLOGs is provided as supplementary material (Additional File 6). For each enzyme usually a dominant CLOG exists in addition to a smaller number of secondary CLOGs. A graphical depiction of annotated pathways is given in Figure 8. Tables 4 and 5 allow for a detailed analysis of metabolic function. First, we note that all key enzymes of the Calvin-Benson cycle, responsible for CO 2 fixation, are annotated in all 16 strains (Table 4). Likewise, for all enzymes belonging to the pentose phosphate pathway (PPP), respective CLOGs are associated with all strains (Table 5). A more diverse picture is obtained for other key metabolic pathways. The enzyme FBP (fructose-1,6-bisphosphatase, EC 3.1.3.11) is not annotated for all strains and absent in all alpha-cyanobacteria, including the Prochlorococcus strains. However, taking into account results from a recent reconstruction and stoichiometric modeling of the strain Syn6803, the enzyme was found to be not essential for biomass formation [27]. To some extend, its function can also be substituted by the bifunctional enzyme SBP (fructose-1,6-/sedoheptulose-1,7-bisphosphatase, EC 3.1.3.37), present in all strains considered in this study. Likewise, the enzyme PFK (phosphofructokinase, EC 2.7.1.11) is not annotated for several strains, most notably again the Prochlorococcus strains. We note that PFK is essential for glycolysis, in its absence utilization of glycogen as a carbon and energy source has to proceed exclusively via the PPP. Other enzymes of the glycolytic pathway, such as FBA, GAP, PGM, and PYK are annotated for all strains (Table 4). In contrast, pyruvate metabolism, summarized in Table 5, is rather fragmented. While CLOGs annotated with the PEP carboxylase (PEPC, EC 4.1.1.31) are associated with all strains, the back reaction via the PEPKinase is rather rare and annotated only for three strains. PEPC catalyzes the anaplerotic conversion of PEP to oxaloacetate and inorganic phosphate Pi, and is essential for replenishment of TCA cycle intermediates. CLOGs annotated with the right-hand side of the TCA cycle, resulting in the formation of 2-oxoglutarate, are ubiquitous for all strains. The metabolite 2-oxoglutarate is considered to Figure 8 A pathway diagram of the cyanobacterial core metabolic network. Black boxes indicate enzymes whose corresponding CLOGs are associated with all 16 cyanobacterial strains. Grey boxes correspond to enzymes that are not annotated for one or more strains.
be a sensor for the nitrogen status of cyanobacteria [29] and serves as a precursor for several amino acids and nucleotides. The subsequent steps within the TCA cycle are highly fragmented. No CLOGs are associated with the enzyme complex 2-oxogluterate dehydrogenase (OGDH), catalyzing the conversion from 2-oxogluterate to succinyl-CoA. The lack of OGDH corrosponds to the known fact that cyanobacteria do not possess a conventional TCA cycle. Nonetheless, it is generally believed that the TCA cycle is able to carry cyclic flux and is therefore able to fulfill its respiratory function in the absence of light.  [30]. The respective CLOGs are shared between all strains, except the three Prochlorococus strains, Syc7942, and Syc7803, see Table 5 for an overview. The existence of a bypass also explains that the succinate thiokinase (STK, EC 6.2.1.5), otherwise an essential enzyme within the TCA cycle, is only annotated for a small number of strains. We speculate that in these few cases the enzyme STK serves mainly to produce succinyl-CoA. Furthermore, several strains lack CLOGs annotated with the enzyme succinate dehydrogenase (SDH, EC 1.3.99.1). For these cyanobacteria cellular respiration via the TCA cycle is significally impaired. CLOGs associated with the enzyme fumarate hydratase (FH, EC 4.2.1.2) are present for all cyanobacteria. However, for Syc7002 and Glo7421, the original genes are annotated differently, representing either erroneous annotation or a bifunctional enzyme. Extrapolating results from a metabolic model of Syn6803 (Knoop et al., 2010), the enzyme FH is essential to recycle fumarate, which is an obligate byproduct during growth. In addition, several other enzymes are only annotated in a subset of the considered strains. In particular, the malic enzyme (ME, EC 1.

Conclusions
The rapidly increasing number of complete microbial genomes offers new possibilities to understand microbial diversity in complex environments. In this work, we have presented a whole genome analysis of multiple phototrophic cyanobacteria, with the aim to gain insight into the diversity of cyanobacterial metabolism from a genome perspective. Cyanobacteria exhibit an enormous metabolic diversity and occur in almost all environments where light is available, and are therefore particularly suited for a comparative analysis of genetic diversity. The basis of our analysis was the definition of clusters of likely ortholog genes (CLOGs), and how these are distributed across the 16 cyanobacterial strains considered in this study. We found that of the 21238 distinct CLOGs identified across all cyanobacterial strains, the majority (approximately 65%) consist of single genes that have no likely ortholog in any other considered strain. About 3% of CLOGs are assigned to all strains, constituting the core genome shared among all strains considered in this study. The remaining CLOGs are assigned to more than one, but not all strains. We note that the set of core CLOGs does not represent a minimal set of genes to sustain Life. Several conclusions can be drawn from the distribution of CLOGs: First, we find no indication that the pan-genome of cyanobacteria is closed. Rather, the number of total CLOGs increases indicating that no small number of sequenced strains is sufficient to explore the full diversity of the cyanobacterial genome. While such extrapolations must be taken with caution [17], these result at least provide a strong incentive for further sequencing projects. Second, the set of core and unique CLOGs differs in with respect to enriched annotations. Core CLOGs exhibit a significant enrichment of genes with annotations that are commonly associated with household functions, such as "translation", "DNA repair", "gene expression", "RNA processing/modification", diverse transporting processes, as well as several metabolic and biosynthetic processes. The annotation of genes of unique CLOGs are enriched with terms pointing to more specific functions, such as various regulatory processes. Furthermore, we found that codon frequencies are different for core CLOGs between different strains, indicating adaptation of codon usage to the respective organisms.
A focus of our work was to explore the metabolic diversity of cyanobacteria from a genomic perspective. CLOGs assigned to metabolic functions are strongly overrepresented within the set of core CLOGs common to all strains considered in this study. Investigating the distribution of individual CLOGs in more detail, our analysis revealed a diverse picture of the presence of core metabolic pathways within all trains. Several key pathways in central metabolism are highly conserved, such as the pentose phosphate pathway and the Calvin-Benson cycle. However, other parts of the metabolic network, most notably pyruvate metabolism and the TCA cycle, are highly fragmented. In contrast to conventional functional annotation, the annotation of enzymatic function offers the advantage that the functional context of assigned CLOGs, in terms of pathways and adjacent reactions, can be taken into account. Therefore the analysis allows to assess the presence or absence of certain enzymes in terms of metabolic function, providing stronger criteria to judge erroneous annotation or reliability of the associated CLOGs. For example, the incompletely annotated TCA cycle of cyanobacteria puts constraints on its ability to provide the precursors for cellular respiration. This deficiency, in term, has consequences for the functional role of other assigned reactions, such as the succinate thiokinase (STK). The latter is therefore unlikly to assume its usual role, and correspondingly is annotated only for a small number of strains. In this respect, we consider our analysis also as a first step in automated network reconstruction. Largescale models of cellular metabolism are becoming increasingly important for a variety of biotechnological applications, but are currently often restricted to a small number of model strains [27,31]. Reconstruction of multiple strains can greatly benefit from a thorough analysis of metabolic diversity found among the set of already sequenced cyanobacteria. We expect that an iterative process, from genome analysis to model construction will eventually lead to a leap in understanding of the metabolic and ecological capabilities of bacterial species and to advanced eco-systems biology.
The tree is reproduced as Additional File 7 highlighting the position of the 16 selected strains within the tree. An overview of the properties of the selected strains is given in Table 1.

Definition of clusters of likely ortholog genes (CLOGs)
To identify ortholog genes, we performed an all-againstall comparison of all 16 cyanobacterial strains. For each possible combination of species A and B ortholog genes are identified using a method similar to the KEGG Automatic Annotation Server [34]. First all genes of A are compared to each gene in B and vice versa using blastp. Hits with a bit score below 50 bits are rejected. The bidirectional hit rate (BHR) for a gene pair a and b is computed as where S a,b is the blastp score of a versus b and S bestA b is the best score of b against any gene in A (which may be different to a). The value of BHR is unity for genes which are mutually best hits in both directions, and lower otherwise. The set of genes for which the BHR is calculated includes genes located on the same genome.
To favor cross-genome orthologs in the further steps of our analysis, the BHR of genes located on the same genome is artificially restricted to values up to 0.95, even when the actual value is higher. All gene pairs with a BHR above or equal 0.95 are classified as putative orthologs.
In a second step, the gene pairs are then clustered by merging all genes which are identified as putative orthologs. To avoid clusters where two genes have a low BHR but are weakly connected through a third gene, all genes in a pre-cluster are clustered again with UPGMA (Unweighted Pair Group Method with Arithmetic mean) and a minimal BHR of 0.75 [35]. This is accomplished by clustering the closest entities -the one with the highest BHR -and recalculating the BHR to all other enti-tiesCusing the following equation where X is the new entity merged of entities A and B, | A| is the size of A, and BHR X,C is the BHR of two entities, until all pairs of entities have a BHR below 0.75. Using our procedure, each gene is assigned to a single cluster.

Analysis of codon usage
The codon usage of a group g of genes was calculated by randomly selecting 100 different genes within this group, such that each gene has a size of at least 50 amino acids. For each encoding triplet t we calculated the relative frequency f t,g with which the triplet codes for its corresponding amino acid across all selected genes g. Stop codons were not taken into account. The differences in codon usage of two groups i and j where calculated according to To account for selection bias, randomized gene selection within the groups and calculation of d' 2 was 100 P repeated 100 times and averaged d 2 i,j = 1 100 100 1 d 2 i,j . To quantify the difference in codon usage between the set of of core and unique genes, we use the ratio r x of the sum of squared differences in codon usage between core/core and core/unique genes for each strain, The values are provided in Additional File 4. To verify if the differences in codon usage between core and unique genes within one species are statistically significant we used a two-sample Kolmogorov-Smirnov test with 100 repeats of d 2 core(x),core(x) and d 2 core(x),unique (x) . The test rejected the hypothesis that both samples came from the same distribution for all strains with an asymptotic p-value of p <4e -27 for Pro9215 and p <4e -43 for all other strains.

Enrichment of GO annotation
To investigate functional differences between core and uniqu CLOGs, each CLOG was assigned to the functional annotation provided by the Gene Onthology (GO) database [23,36] (effective January 2011) for every constituent gene within the respective CLOG. For enrichment analysis of GO terms the TopGO software was used [37], which is available as part of the Bioconductor R packages (http://www.bioconductor.org). Here we chose the parent-child algorithm [38] with standard parameters and Fisher's exact test. For the calculation of p-values only CLOGs with assigned GO term where taken into account. The complete list of all enriched terms with p-values < e -4 for core and unique CLOGs is provided as Additional File 3.

Phylogenetic tree analysis
The trees presented in this work were created with PHYLIP (PHYLogeny Inference Package version 3.69) by Felsenstein [24]. For the 16S-RNA comparison ( Figure  4A), a tree in newick format was constructed with the Ribosomal Database Project web site (http://rdp.cme. msu.edu/) [39]. The tree was then plotted with DRAW-GRAM of PHYLIP.
For the tree based on shared clusters ( Figure 4B), we built a similarity matrix for all strains, where the similarity between two strains is calculated by the number of the shared cluster divided by the number of clusters where at least one of the two strains participate. We did not take into account core clusters (which are the same for all strains) and unique clusters. The latter also serves to minimize size bias. From this matrix we calculated a distance matrix by substracting each entry from the maximal entry. The distance matrix was then converted into a tree using NEIGHBOR of PHYLIP and subsequently plottet again with DRAWGRAM. In both trees, Aca11017 was used as outgroup.

Assigments of Metabolic Function
Metabolic functions were assigned to the CLOGS by matching the occurring genes to the KEGG database (release date 19. october 2010) [40]. CLOGs with at least one gene associated to an enzymatic function in the KEGG database, are labeled with the respective EC number. In case that a CLOG contains genes which are assigned to different EC numbers we annotate this CLOG with multiple EC numbers, unless one of the numbers is just an incomplete form of the other (i.e. 3.7.-.-and 3.7.4.21). Consequentially the total number of distinct EC numbers does not exactly correspond to the number of metabolic CLOGS. For enzyme complexes consisting of multiple subunits, which are encoded by several genes and are therefore associated with different CLOGs, the EC number of the corresponding enzyme is assigned to each of the CLOGs. We note that EC numbers do not strictly correspond to genuine small-molecule metabolic function, since the EC nomenclature also includes general enzymatic activity, such as protein kinases and RNA or DNA polymerases.

Additional material
Additional file 1: Text file of CLOGs. Tab separated text file containing all CLOGs found by our methodes. Each line represents one CLOG, the entries for each strain are separated by tabs. Genes are annotated with the respective EC number, is applicable. If one strain has more than one entry in a CLOG, the genes are separated by a tilde. The last column summarizes all EC numbers denoted to genes in this CLOG and the frequency of appearance. Multiple ECs in one CLOG are separated by a hash.
Additional file 2: Table of strains assigned to CLOGs of different sizes. The table provides the number of CLOGs assigned to each strain. The fraction of CLOGs associated with one or more EC number is given in brackets. The size of a CLOG is determined by the number of strains that it is associated with.
Additional file 3: Table of enriched Gene Onthology terms. An excel file which contains the list of enriched GO terms for genes of the core CLOGs as well as for genes of the unique CLOGs. For each enriched term the table gives: the GO specific ID; the term in clear text; the total number of CLOGs annotated with the term across all clusters; the number of CLOGs in the particular set annotated with the term; the expected number of CLOGs in the current set annotated with the term, given an uniform distribution; the significance level of the enrichment, calculated with fisher's exact test. Only results with a p-value below 1e-3 are shown. Each list is divided into the three GO domains "biological process", "cellular component", and "molecular function".
Additional file 4: Table of differential codon usage. The table shows the differences in codon usage of core and unique genes across all 16 cyanobacterial strains. Each number indicates the difference in codon usage of the core genes of one strain (row) compared to the core or unique genes of one strain (columns) and is calculated as described in Methods.
Additional file 5: Differences in EC annotation. The pdf contains a list of EC numbers corresponding to groups A and D in Figure 7, respectively.
Additional file 6: Tables of CLOGs related to key enzymes of central metabolism. The table provides the CLOGs associated to enzymes involved in the storage metabolism, glycolysis, Calvin Benson cycle, PPP, pyruvate metabolism and TCA cycle. For each strain a dot is representing a strain specific gene which can be found in the CLOG with the corresponding enzymatic function. Bracketed dots represent genes, which are assigned to that CLOG, but differ in annotation and most likely in function.
Additional file 7: Annotated phylogenetic tree. Shown is a maximumlikelihood distance tree for sequenced cyanobacteria reproduced from Gupta et al. 2010 [9]. Strains chosen for analysis are indicated by red arrows.