We used Correspondence analysis (see Methods) to compare amino acid compositions of 208 predicted proteomes with large representations of the three phylogenetic domains as well as various lifestyles (20 hyperthermophiles, 7 thermophiles, 8 psychrophiles and 173 mesophiles including 53 eukaryotes; detailed list is in Additional file 1). Figure 1 shows the resulting distribution of species and amino acids as projected on the first factorial plane, representing 77% of the total information in the original data table. We analyze first the distribution of species, in terms of global properties and discriminated groups. We then focus on more detailed statistical characterizations of the various groups, with their associated amino acid signatures. Finally we explore potential evolutionary trends associated with the various observations.
Distribution of species and segregations
Global description
Confirming and refining the results of previous analyses [6, 7] the global distribution of species is first following GC content, as corresponding to F1 factorial axis (contribution of order 63%), increasing from left to right (23% in Mycoplasma mycoides to 72.1% in Streptomyces coelicolor), and secondly following optimal growth temperatures, as corresponding to F2 factorial axis (contribution of order 14%), increasing upward from moderate to high temperatures. It is important to stress that GC content and optimal growth temperatures are not included in the set of analysed parameters, but correspond rather to observations underlying the distributions of species as obtained from their amino acid compositions.
Species segregation and discrimination following lifestyles and phylogenies
Based on phylogenetic and lifestyle classifications (as identified by colour codes in Figure 1 and Figure 1A), we observe a striking segregation for eukaryotes, prokaryotic mesophiles and hyperthermophiles, with sharply defined non-overlapping associated strips. With respect to this segregation, the only 'discrepancy' concerns the eukaryotic E. cuniculi, in the territory of mesophilic prokaryotes. The thermophilic species are essentially at the border between hyperthermophiles and prokaryotic mesophiles (with the exception of T. elongatus). In contrast with such clearcut segregation, we observe that psychrophiles are in the strip associated with prokaryotic mesophiles.
Concerning this general stratified structure in the distribution of species, we observe (Figure 1) that the barycenters of the various categories considered above (hyperthermophiles (HTH), thermophiles (TH), prokaryotic mesophiles (PMES), psychrophiles (PSYC) and eukaryotes (EUK)) are roughly aligned along the second factorial axis. This structure shows that species are rather homogeneously distributed within each category around the barycenters axis (with a mean value of about 40%) according to their GC content. With this respect, based on the 8 available species, it appears that the scattering of psychrophiles around the corresponding barycenter is of limited extent as compared to the other categories.
Statistical characterization of segregated groups and associated signatures
Statistical characterization of segregated groups
For detailed characterization of the species distribution observed in Figure 1 we compared for the various groups (HTH, TH, PMES, PSYC and EUK) the mean amino acid compositions, along with pooled means associated with physico-chemical characteristics (polar, charged and hydrophobic). We used one way analysis of variance, followed by Newman-Keuls multiple comparison test for pairwise differences. For robustness and consistency reasons we choose a high probability limit of significance, set at the probability p < 0.001. Such comparisons revealed significant signatures (with steady variations of mean values, either increasing or decreasing) between the three following groups: a merged group associated with hyperthermophiles and thermophiles (HTH-TH); a merged group associated with prokaryotic mesophiles and psychrophiles (PMES-PSYC) and finally eukaryotes (EUK). The corresponding signatures and characteristic trends are detailed below.
Physico-chemical signatures
The pools of polar, charged and hydrophobic amino acids are represented on the factorial plane in Figure 2. The pools associated respectively with polar and the difference [polar – charged] amino-acids are characteristic of each one of the three segregated groups (HTH-TH, PMES-PSYC and EUK), all three mean value pairs being significantly different at p < 0.001. The abundance of the polar and [polar – charged] pools increase steadily from HTH-TH to PMES-PSYC, from HTH-TH to EUK and from PMES-PSYC to EUK (see Additional file 4). As for the hydrophobic pool, we observe a decrease from HTH-TH to EUK and from PMES-PSYC to EUK. The relative abundance in the hydrophobic pool thus appears as a characteristic signature for eukaryotes (EUK), since the corresponding mean value is significantly different from that of each of the two other groups (HTH-TH and PMES-PSYC; the mean values for these two groups being not significantly different at p < 0.001).
Amino acid signatures
Based on the discrimination of the three segregated classes HTH-TH, PMES-PSYC and EUK we classify (with the significance level at p < 0.001) amino acids following three groups (Figure 3 and Additional file 3):
a) Amino acids whose relative abundance is characteristic of each one of the three groups (with steady variation -either increase or decrease – from HTH-TH to PMES-PSYC and from PMES-PSYC to EUK): VAL (decrease), His and Ser (increase).
b) Amino acids characterizing HTH-TH or EUK: for EUK, Cys is high and Leu, Gly and Ile are low; whereas for HTH-TH, Tyr and Glu are high, Asp, Thr and Gln are low.
c) Amino acids with no discriminative characteristics (no significant differences between the three groups): all others (with the exception of Ala and Pro, with partial discriminative properties).
In summary, the characterizations above (at the probability significance level of p < 0.001) are represented in Figure 3 in correspondence with the segregation following the three main groups (HTH-TH, PMES-PSYC and EUK), with the non-discriminative amino acids essentially concentrated in a median horizontal strip in the factorial plane. The description would of course vary according to the threshold (for example, with a probability threshold of 0.05 Cys is found to increase from HTH-TH to PMES-PSYC). More detailed amino acid comparison results are reported in Additional file 2, file 3 and file 4.
Overall trends and amino acid chronologies
We investigate overall trends which could underly the segregation of species, following amino acid territories as shown in Figure 1.
Protein conservation
For the three phylogenetic domains of life (Achaea (A), Bacteria (B) and Eukarya (E)), based on systematic comparisons of proteomes, the subsets of proteins conserved exclusively in one or in combinations of domains (E, A, B, EA, EB, AB and EAB) were determined, along with the subset of species specific proteins (SPEC, i.e. with no detectable similarities outside their own proteomes). The comparative data were from results in a recent study [11], concerning 100 species (amongst the 208 considered here). The amino acid compositions for the different subsets were determined and used as dummy observations (see methods) in the factorial analysis distribution shown in Figure 2. Following this analysis, the trend from the core set EAB (which can be associated with ancient proteins [12, 7] to the specific set SPEC of proteins is essentially following the factorial axis F2, and pointing towards eukaryotic territory.
Amino acid chronologies
In this section we consider the distributions observed above in the light of inferred chronologies for amino acids recruitment into the genetic code, following models and data from Jordan et al. [9], Trifonov [10], Miller [13, 14] and Cronin and Pizzarello[15].
1) Model of Jordan et al.:
Following the model of Jordan et al. [9] amino acids are classified as "gainers" (either strong or weak) or "losers" (either strong or weak), with "gainers" corresponding to amino acids supposed to be recruited late into the genetic code. In this model the "strong gainers" are His, Ser and Cys (corresponding to the discriminant signatures for the three main segregated classes above) along with Phe and Met. Conversely, in this model, the "strong losers" (presumed to include the most ancient amino acids) are Pro, Ala, Glu and Gly. This separation between "strong losers" and "gainers" is recovered rather faithfully in our factorial analysis if we separate the factorial plane into two regions T1 and T2 (as represented in Figure 4) corresponding respectively to [high_temperature]-[high_GC] and [moderate_temperature]-[low_GC] characteristics, with a roughly defined border. With such a separation of the factorial plane we observe that the "strong gainers" His, Ser and Cys are in T2, with Phe and Met at the border. In addition, the "weak gainers" in this model (Asn, Thr and Ile; but not Val, in T1) also lie within the T2 space. Conversely, all "strong losers" lie within T1, while the "weak loser" Lys is situated in T2.
2) Model of Trifonov:
With the factorial plane separation above, we observe that the amino acids in T1 correspond largely to the first amino acids recruited into the genetic code according to the chronology suggested by Trifonov [10] (Gly, Ala, Asp, Val, Pro, Ser, Glu, Leu, Thr, Arg; see the order reported on the amino acids in Figure 4). In this list, the first discrepancy according to the separation in Figure 4 concerns the amino acid Ser in position 6 (with the 5 first amino acids in the list all situated in T1). It is nevertheless interesting to note that in the analysis of Trifonov it appears that Ser was also the first amino acid in the chronological classification for which two distinct positions were considered, following the associated codons (either UCX or AGY; with the associated ranks 6 and 11; see Figure 1 in [10]). As for Thr (ranked 9 in the chronology by Trifonov) its position in Figure 4 is at the border between T1 and T2. The two discrepancies between the chronology by Trifonov and the separation T1/T2 concern Ser and Thr and appear to correspond to contradictions between the models by Trifonov and by Jordan et al. (with Ser and Thr reported as "strong" and "weak gainers", respectively, in the model of Jordan et al. [9]).
3) Miller's experiments and data from the Murchison meteorite:
The ancient amino acids, as derived from Miller's experiments [13, 14] and analysis of Murchison meteorite [15] (both informations being included in the criteria used for establishing the chronology in [10]), are essentially clustered in T1. It is striking to observe that the most abundant amino acids in these experiments [14, 15] (Gly and Ala) are deep situated in T1, whereas those reported to be less abundant tend to cluster at the boundary between T1 and T2. It is also not surprising to observe that the possible "discrepancies" between the spark data (Miller's experiment) and the T1/T2 scheme concern again the amino acids Ser and Thr. However, interestingly, these amino acids do also correspond to the observed discrepancies between the spark data and the Murchinson meteorite data (Ser and Thr are not reported in the meteorite data; see representations in Figure 4) as well as to discrepancies between the two models above, as already mentioned. Finally, the overall decreasing gradient from T1 to T2 (Figure 4) in terms of ancient amino acid abundance is further enhanced with the chronology following the "yields of amino acids in imitated primordial conditions" as compiled by Trifonov [10] (criterion N3'; including 3 experimental conditions in addition to that of Miller's).
The various schemes above, relevant to the analysis of chronologies in correspondence with the recruitment of amino acids into the genetic code, let us suggest a time-directionality, the arrow from T1 to T2 in Figure 4. Overall, the direction of this arrow points in the same direction than the one associated with proteins conservation: from the most conserved ancestral common "core" of proteins to the set of species-specific proteins.