Clustering homologues of amino acid biosynthetic enzymes
To determine the distribution of amino acid biosynthetic enzymes, a homologue clustering process was developed to allow the use of both complete and incomplete genomes [14, 15]. The procedure starts with Seed Linkage software [14] that clusters cognate proteins from multiple organisms beginning with a single seed sequence through connectivity saturation with it. Since basal eukaryotes such as plants and fungi are autotrophic, sequences coding for all the enzymes used in the biosynthesis of EAAs from the plant Arabidopsis thaliana and the fungus Saccharomyces cerevisiae were manually inspected using KEGG Pathway and used as seeds to search for homologues. Moreover, our group has been developing a procedure to enrich secondary databases such as COG [12] and KEGG Orthology (to be published) with UniRef50 clusters [16] available from UniProt, therefore allowing the inclusion of data from incompletely sequenced genomes. Additional file 1: Sequences and genome status distribution reflects the abundance of proteins derived from incomplete genomes and evidences the importance of their inclusion. In this work we took advantage of a home-built UniRef50 Enriched KEGG Orthology database (UEKO) to additionally cluster sequences with the seed sequences mentioned above. Since these searches recruit sequences from diverse clades, which may or may not contain organisms with completely sequenced genomes, we represented this information in Figure 1 as: (a) black filled circles for phyla containing complete genomes; (b) grey filled circles comprise clades with at least one draft genome available, but no complete genome, and (c) empty circles represent phyla with no complete nor draft genomes. Protein fragments are not included in the search for homologues because they may represent partial sequenced full length proteins at mRNA level or incompletely modeled from genome. Moreover since some full length proteins might have not been captured in databases due to high sequence divergence, a second search round used UniProt to query all clustered sequences. This step also captures partial sequences (entries labeled as fragments in UniProt) which were approved by the coverage filtering applied (see Methods for details). These additional significant hits are represented by triangles in Figure 1. Furthermore, enzymes required for the biosynthesis of the indicated amino acids are ordered in the anabolic pathway from left to right. All pathways refer to EAAs biosynthesis except serine and glycine (the rightmost ones) used as experimental controls. Serine is represented with two alternative pathways observed in human and other eukaryotes: S(1), from 3P-D-glycerate; and S(2), from pyruvate. Glycine is also represented by two pathways: G(1) and G(2), both coming from serine; and G(3), coming from threonine. As expected, serine and glycine biosynthesis were found to be potentially proficient in almost all phyla. This control supports the searching mechanism and attest for the efficacy of methods applied. A few exceptions were observed and deserve comments: (i) Serine biosynthetic pathways was found to be absent in Rhodophyta, although the complete genome of Cyanidioschyzon merolae is available. We manually inspected this result with regular BLAST searches and did not find additional evidence, although a translation of partial CDS was obtained for glycine biosynthetic enzyme G1 (Figure 1, triangle); (ii) Serine biosynthesis seems absent in Apicomplexa as well, a clade comprising two Plasmodium complete genomes lacking enzymes S1 and S4; (iii) Considering the animals, besides being able to find serine biosynthetic enzymes, we fail to support the NEAA character of glycine for Mollusca. However, evidences could be obtained for ancient organisms such as Placozoa and Porifera. For the Microsporidia E. cuniculi, an obligatory intracellular parasitic fungus with complete genome, it has been reported that “the repertoire for the biosynthesis of amino acids is restricted to asparagines synthetase and serine hydroxymethyltransferase genes”, then serine was known as an EAA [17]. Thus, absence of evidence may not guarantee the absence of the gene. However, out of 28 phyla, discarding both the four clades with no genome project or in progress (open circles) and the ones with complete genome (filled symbols), we could not provide evidence of glycine biosynthesis for two phyla (Fornicata and Mollusca). However evidence for serine has been provided in all of them.
Data presented in Figure 1 clearly depicts the presence of complete biosynthetic pathways for EAAs in both plants (Chlorophyta and Streptophyta) and fungi (Ascomycota and Basidiomycota), as stated above. In previous work we hypothesized that a great event of genome deletion on which many of the intermediate enzymes for biosynthetic pathways for amino acids have vanished, ended up affecting the usage of EAAs in chordate proteomes [18, 19]. In 2006, Payne and Loomis [10] using pFam protein signatures reported that protists and animals share essentiality for the nine amino acids. Here we provide a broader analysis covering all genomes available today and trying to map how and when the Great Genomic Deletion has happened. Evidence was found suggesting that this loss of capability to synthesize EAAs is conspicuous at the base of metazoan evolution, simultaneously affecting the complete set of EAAs. The phenomenon is characterized as an initial phenotypic deficiency, observed in Choanozoa, followed by multiple secondary gene losses. Accordingly, some enzymes found in Chordata such as K14, M4 and M9 are missing in Arthropoda. Remarkably, some components such as VIL1 and M7 are maintained in most metazoan clades, despite of pathway loss.
Actually, a Great Deletion causing concurrent phenotypic loss of amino acid biosynthesis capability affects both metazoan and non-metazoan eukaryotes. Several clades containing complete genomes (black filled symbols) such as Rhodophyta, Euglenozoa and Apicomplexa, show similar EAAs pattern. Moreover, some evidence is provided suggesting the absence of complete pathways in the non-Dikarya Fungi Microsporidia and Neocallimastigomycota. This gives support to separate events of Great Genomic Deletion for the origin of EAAs auxotrophy in at least three other branches. Similarly to Choanozoa, clades such as Heterokontophyta and Rhizaria present various enzymes and some complete pathways. Evidences of complete pathways for all EAAs but histidine (H) were obtained in Heterokontophyta. Valine (V), isoleucine (I), lysine (K) and threonine (T) are potentially synthesized in Rhizaria as well as methionine (M) in Euglenozoa and Amoebozoa. However it is possible that other EAAs may also be synthesized in some of these clades. The anabolic capabilities suggested by the current data might be underestimated because we have only draft genomes available for most of these organisms. The Choanozoa clade contains only draft genomes. Though we observed more enzymes than in metazoan clades, a final picture of Choanozoan phenylalanine biosynthesis, for example, might require completion of genome sequencing. Further gene loss occurs during metazoan evolution; however, for Placozoa, Porifera and Cnidaria, the Great Genomic Deletion seems to be well established. Since the first available sponge genome is still an ongoing project and its proteins are not yet deposited in UniProt, we manually inspected the deduced proteome using regular BLAST alignments (see Methods) and evidenced auxotrophy for all nine EAAs. The same simple approach was applied to all phyla (Figure 1, triangles). Other clades that do not present any enzymes were omitted from Figure 1, such as Apusozoa and Jakobida.
Lysine biosynthesis
Inspection of Figure 1 depicts a remarkable difference on lysine (K) biosynthesis pathways present in fungi and plants. Since the occurrence of an α-aminoadipate (AAA) pathway K(1) in Fungi [20] as opposite to a diaminopimelate (DAP) pathway K(2) known to be present in plants, algae and bacteria [21, 22] has already been reported, we set up to depict the complete scenario for K biosynthesis including prokaryotes (Figure 2). A third pathway K(3) preferentially used by Archaea but also reported to exist in bacterial groups [23] was also considered, therefore sequences from the Pyrococcus horikoshii archaea were also used as seed for homologue sequence clustering. Data supports the view that the K(2) pathway, found to be complete in plants, is often present in prokaryotic clades of bacteria and archaea, in agreement with previous findings [21, 22]. Curiously, nine bacterial clades (Acidobacteria, Chlorobi, Deferribacteres, Deinococcus-Thermus, Fusobacteria, Chlamydiae, Synergistetes, Tenericutes and Thermotogae) -- all of which contain complete genomes -- do not present K12 enzyme, but there are three other alternative subsets of enzymes present in prokaryotes that could circumvent this step in lysine biosynthesis. Chlamydiae may represent an evidence of amino acid essentiality extended to prokaryotes, since diaminopimelate decarboxylase (K14) is absent and there are no known alternatives to this reaction. The set of enzymes responsible for the K(3) pathway, was found to occur in prokaryotes, and it is complete in the archaeal clades Crenarchaeota and Euryarcheota, as well as in the bacterial clades Chloroflexi and Proteobacteria, and probably in Actinobacteria and Bacteroidetes. Remarkably, the first four enzymes that constitute this pathway are coincident with the K(1) pathway (indicated by gray shading). The complete K(1) pathway occurs in Proteobacteria (and possibly in Actinobacteria, Bacteroidetes and Firmicutes, as evidenced by regular BLAST) and fungi. Thus, it is tempting to assume that a variant synthesis of K occurred in Archaea and, being modified in one of the four bacterial phyla above (with the addition of three enzymes: aminoadipate-semialdehyde dehydrogenase, saccharopine dehydrogenase NADP+ and saccharopine dehydrogenase NAD+), ended up constituting the fungi-occurring K biosynthetic pathway. The eukaryotic clades Rhizaria and Heterokontophyta, which present the K(2) pathway, appear to group with plants.
Nitrogen auxotrophy
Consumption of amino acids is an important route for nitrogen assimilation in other biological compounds for heterotrophic organisms, such as those comprised by some of the clades shown in Figure 1 (e.g. Chordata). Assimilation of free ammonium in eukaryotes is done by a cytoplasmatic reaction catalyzed by glutamate dehydrogenase (EC:1.4.1.4) which incorporates ammonium into alpha-ketoglutarate yielding glutamate, using electrons from a reduced cytoplasmatic co-enzyme NADPH. Two isoforms are present in fungi and one in plants, the latter having the additional option to not only assimilate nitrogen, but also to fixate it, often with the association of nitrogen-fixating bacteria. Thus, to investigate if the Great Genomic Deletion of biosynthetic enzymes for EAAs co-occurred with the heterotrophy for nitrogen, we generated clusters of the assimilative isoforms (EC:1.4.1.4) and, as a control, the mitochondrial enzymes (EC:1.4.1.2) which tend to operate in the reverse direction, i.e. glutamate degradation, by oxidizing it and delivering ammonium, loading electrons in NAD+ co-enzyme. In yeast, the cytoplasmic assimilative isoforms are named GDH1 and GDH3, and the catabolic (mitochondrial) is known as GDH2. Arabidopsis thaliana proteins were also used as seed together with the Saccharomyces cerevisiae sequences: one known as putative GDH which grouped with the fungi assimilative ones, and three catabolic GDHs, that grouped with the human mitochondrial GLUD1, though not with the yeast catabolic GHD2. Results are shown in Figure 3A. The left column shows a cluster that groups assimilative isoforms with the two from yeast and the putative GDH from A. thaliana. The catabolic mitochondrial isoforms from yeast (central column) and plant (right column) formed two independent clusters. In metazoan organisms, an assimilative enzyme was found in the basal group Cnidaria, all others being dependent on amino acid consumption to build nitrogenated compounds such as DNA, Porifera included. Assimilative isoforms were also lacking in Choanozoa although complete genomes are unavailable. The same was observed for Placozoa. Comparing these results with those shown in Figure 1, it is remarkable that Choanozoa, while still registering many amino acid biosynthetic enzymes (37 out of 61, redundancy eliminated) shows a simultaneous deletion in both EAAs biosynthesis and nitrogen assimilation. It is also apparent that the Great Genomic Deletion attains its almost final broad distribution in Cnidaria, which may be the last metazoan clade still capable to assimilate nitrogen from free ammonium. Therefore a few biosynthetic enzymes remain, in this clade and other Metazoa, probably by connective functions in metabolism (e.g. EC: 1.2.1.31 aminoadipate-semialdehyde dehydrogenase K5 and EC: 1.5.1.7 saccharopine dehydrogenase K7 also participates in the lysine degradation pathway). We have also observed that mammalian GDH (GLUD1) presents a specialized allosteric control [24] which might have turned the enzyme toward glutamate catabolism rather than anabolism. Such control was first observed in Ciliophora [25] and it is thought to have been transferred by lateral gene transfer to the metazoan ancestor [26]. To confirm the grouping in three clusters of enzymes with so similar activities, Figure 3B shows a phylogenetic tree built with eukaryotic glutamate dehydrogenase sequences, which clustered the isoforms in total accordance with data shown in Figure 3A.
The non-Metazoa eukaryotes with complete genomes, such as Alveolata, Apicomplexa and Euglenozoa, lack EAA biosynthetic enzymes (Figure 1) but keep the capability of nitrogen assimilation (Figure 3). Fornicata and Parabasalia, although represented only by draft genomes, have shown to contain the nitrogen assimilation enzyme even if they appear to be auxotrophic for all EAAs. Lacking detection of any isoform of glutamate dehydrogenase and with available draft genomes is Rhizaria (no complete genomes available), which still presents some EAA biosynthetic capability. It is possible that the dependency of organic nitrogen has been attained earlier in Rhizaria, although complete sequencing is required for a sound conclusion. In general, data support a tendency for nitrogen heterotrophy succeeding the amino acid essentiality. In Rhodophyta, a clade containing complete genomes sequenced, surprisingly no catabolic homologues were found; however a sequence that clusters with the assimilative isoforms has been found.
We also investigated nitrogen assimilation in prokaryotes. Homologues of assimilative enzymes are present and detected by our clustering procedure, but besides finding homologues of the catabolic seeds in bacterial clades, assimilative enzymes were not found in Aquificae, Chlamydiae and Synergistetes, all of them containing complete genomes available. This absence is consistent with the lysine auxotrophy suggested in Chlamydiae (Figure 2) and support the idea that EAA auxotrophy is associated with the lack of nitrogen assimilation even in the prokaryotic clades. It is hard to infer differential enzymatic activity in prokaryotes, since the annotated sequences available often report mixed use of coenzyme, either NADPH or NAD, although the homologous tools had grouped them distinctively. If the homology is related to function, it may indicate that these organisms also demand the consumption of NEAA to constitute a source of organic nitrogen. The presented scenario suggests that the loss of nitrogen assimilation forcing consumption of NEAA shortly succeeds the Great Genomic Deletion of EAA biosynthetic enzymes in metazoans. If this hypothesis is true, the Cnidaria would be an exception.
EAA biosynthetic enzymes maintained
The remaining EAA biosynthetic enzymes in organisms that do not have the complete amino acid pathway (Figure 1) are more susceptible to evolutionary modifications. It is also possible that paralogue subfunctionalization occurred in the common ancestor of animals, fungi and plants, and thus the divergent copy has remained in detriment of the original gene. Considering both hypothesis we set up to analyze enzymes from EAA and functional NEAA pathways present in metazoans. Phylogenetic trees for acetolactate synthase (VIL1 code in Figure 1) and for a group of alanine-glyoxylate, serine-glyoxylate and serine-pyruvate transaminases (G1 code in Figure 1) are represented in Figure 4. As expected, the distance between the ancestors of the two prototrophic groups varies, plant (green circles) and fungi (yellow circles): 0.4 and 0.7, for VIL1 (Figure 4A) and G1 (Figure 4B), respectively. The distance from the ancestors of plant (green circles) to metazoans (red circles) are relatively higher for the remaining enzyme VIL1: 1.0 (as compared to 0.4 measured from plant to fungi, 2.5 fold) than for the NEAA biosynthetic enzyme G1: 0.7 (as compared to 0.7 measured from plant to fungi, 1.0 fold). Thus, the remaining EAA enzymes are experiencing higher divergence after the attainment of amino acids auxotrophy.
To support this observation, Figure 5 shows the ratios calculated for 12 enzymes. Only trees that show significant bootstraps for the branches of interest were considered. Enzyme codes in bars are described as in Figure 1. The Y axis at the right side corresponds to the distance measured from plant (Streptophyta) to the ancestor of fungi (Dikarya). This distance was assumed as a background distance to normalize the distances measured “from” plant (green bars) “to” the clades indicated in the X axis. The three enzymes on the right, S1, G1 and G2, belong to NEAA pathways, and the ratios are low. For the enzymes H5, FW7, F8, VIL1, VIL3, MT3 and M7, the ratio shown by green bars are conversely high, ranging from around 1.5 up to 7 fold. These preliminary data suggest that the additional evolutionary modifications have occurred in distinct levels in the enzymes maintained after the loss of biosynthetic capability. M(2) pathway appears as incomplete in Basidiomycota (Figure 1; M8 is absent), however MT3 enzyme used here is present in threonine pathway which is complete in this clade. K6 and K10 are involved in incomplete pathways, respectively, in plants and fungi. Accordingly, the distance measured from plant to fungi is high, and so is the drift between plant to Chordata (K6) or to Arthropoda (K10), therefore yielding balanced lower ratios. Since the ancestor of fungi and plants seems to be equally distant from both of these two groups, and the divergence between plant and Fungi/Metazoa group tends to a trifurcation (see Figure 4), the yellow bars (which represent the distance from fungi to the animal clades in the X axis divided by the background distance from plant to fungi) are similar to the ratios represented by the green bars, independently of how much modification has been occurred to the animal sequences (e.g. VIL1, MT3, G1). Furthermore, a detailed inspection of phylogenetic trees seems to indicate that subfunctionalized paralogues have appeared in basal clades such as Fungi, and those divergent paralogues remain in the more recent groups of organisms, while the copy that previously participated in the biosynthesis was actually deleted in animals. Note some Streptophyta and Ascomycota divergent paralogues (outparalogues) [27] grouped with animal sequences under 100% bootstrap (Figure 4A). Accordingly, similar divergent paralogues were observed for M7 enzyme (Ascomycota and Basidiomycota divergent paralogues grouped with animal sequences, 98% bootstrap, see additional file 2: Phylogenetic tree of 5-methyltetrahydropteroyltriglutamate--homocysteine methyltransferase (M7)). Moreover, for K10 enzyme that participates in the K biosynthetic pathway which is defective in fungi, a divergent paralogue from Streptophyta groups with fungi enzymes (92% bootstrap) near the Arthropoda sequence (Additional file 3: Phylogenetic tree of dihydrodipicolinate synthase (K10)). Thus, the enzymes remaining from biosynthetic pathways show higher divergence, and this might have been acquired due to subfunctionalization in ancient clades.