Intron distribution and gene length
A comparison of the genomes of the three algal species (Ectocarpus fasciculatus, Chlamydomonas reinhardtii, Volvox carteri), the moss (Physcomitrella patens), the fern (Selaginella moellendorffii) and the six angiosperms (Oryza sativa, Zea mays, Sorghum bicolor, Arabidopsis thaliana, Glycine max and Populus trichocarpa) showed that, the number of genes is reduced as the frequency of introns increases (Additional file 1: Figure S1). The proportions of genes containing 0–9 introns ranged from 73.6% in C. reinhardtii to 90.1% in P. trichocarpa.
The correlation between SCUB frequency and intron number
The SCUB frequency in genes bearing 0–9 introns was based on the analysis of 59 codons. SCs formed by the alternate presence of an A or a T in the third position (NNAs, NNTs) behaved quite distinctly from NNCs and NNGs (Additional file 2: Table S1). In the algal genomes, the frequencies of most NNAs and NNTs were inversely related to intron number, while the relationship was largely opposite for most NNCs and NNGs. In the land plant sequences, however, the frequencies of NNAs and NNTs were positively correlated with intron number but the opposite was the case for NNCs and NNGs. These trends were depicted graphically with respect to the ratios of NNAs and NNTs to NNCs and NNGs of 18 SCs, which were positively related to intron number in land plants, but negatively in algae (Figure 1, Additional file 3: Figure S2).
The algal and the land plant genomes also differed with respect to the global frequency of NNA, NNT, NNC and NNG, which were defined as the ratios of the frequencies of all NNAs, NNTs, NNCs and NNGs to 59 SCs, respectively. In the former genomes, both NNC and NNG were commoner than NNA and NNT, and their frequencies were correlated with intron number in the same way as were NNAs, NNTs and NNCs, NNGs, respectively (Figure 2A). In the land plant genomes, both the NNA and the NNT frequencies rose as intron number increased, while those of NNC and NNG both fell (Figure 2B-E). In the moss genome, the frequency of each of these four codon types was similar across the fully gene set (Figure 2B), while in S. moellendorffii and the three cereal genomes, the NNC and NNG frequencies were notably higher than those of NNA and NNT in genes carrying 0–4 introns, and similar in genes carrying 5–9 introns (Figure 2C,D, Additional file 4: Figure S3). Among the three dicotyledonous species genomes, the excess of NNA and NNT was more apparent than that of NNC and NNG in the whole gene set, but particularly when the number of introns was large (Figure 2E, Additional file 4: Figure S3).
SCUB frequency is variable within exonic sequence
In genes containing between two and ten exons, the SC frequencies showed arched-curves (‘∩’ or ‘∪’), interstitial exons had higher or lower frequencies than two terminal exons from the first to the last exons in both algae and land plants (data not shown). NNCs and NNGs mostly possessed ‘∪’ curves, but NNAs and NNTs mostly appeared ‘∩’ patterns in land plants; the pattern in the algal genomes was the opposite. In each of 18 amino acids with SCs, the ratio between the frequency of NNAs, NNTs to that of NNCs, NNGs among exons displayed a ‘∩’ distribution in land plants but a ‘∪’ distribution in algae (data not shown), and the mean value of such ratios among the 18 SCs was also characterized by the similar patterns in either the land plants or the algae (Figure 3, Additional file 5: Figure S4). Similarly, the individual frequencies of NNA, NNT and NNC, NNG appeared ‘∩’ and ‘∪’ distributions respectively across the whole set of exons in the land plants, and the opposite is shown in the algae (Figure 4, Additional file 6: Figure S5).
The mean ratios of NNAs, NNTs to NNCs, NNGs within the first exon were comparable among genes with 2–10 exons in either algal or land genomes; in comparison with the first exon, these ratios in the subsequent exons were higher in the land plant but lower in the algal genes (Figures 3, Additional file 5: Figure S4). In the final exon, the ratios were conserved among the algal genes, but were positively correlated with intron number among the land plant genes; this correlation was weakest among the angiosperm species. In the interstitial exons, the ratios were conserved among the algal genes, but were variable among the land plants, particularly in genes having a larger number of introns. Heterogeneity between exons was also reflected by the frequencies of NNA, NNT, NNC and NNG (Figure 4), which were relatively well conserved in the first exon across all the test species. Conservation was good in the final exon among the algal species; the frequency of NNC and NNG was positively correlated with intron number in the moss, fern and monocotyledonous angiosperm species, but that of NNA and NNT was negatively correlated; among the dicotyledonous species, the frequency of NNC and NNG was well conserved, but that of NNA was reduced and that of NNT was increased in genes carrying a larger number of introns.
The role of DNA methylation in the formation of SCs
DNA methylation is a major source of DNA variation, since methylated C can readily be converted into T [15]. The conversion of methylated C in CpG or its complement strand produces TpG or CpA, and the conversion of two cytosines produces TpA. To investigate the influence of C methylation on the relationship between SCUB and either intron number or exon position, the frequencies of 16 second-third nucleotide combinations (NNN) and 16 third-next codon’s first nucleotide combinations (NN|N) were compared. In the land plant genomes, an increase in intron number was associated with a sharper fall in the frequency of NCG than in that of either NAG, NTG or NGG, while the frequency of NCA raised with stronger extent than the other NNA triplets (Figure 5C, Additional file 7: Figure S6); the frequencies of the four NNC and the four NNT codons did not differ from one another (data not shown). The decline in the frequency of NC|G was steeper than that of the other three possible NC|N s in those land plant genes with a high intron number; at the same time, the frequency of NT|G ascend more sharply than other NT|N s (Figures 5D, Additional file 8: Figure S7). This behaviour was not shown by either NG|N or NA|N (data not shown). Unlike NC|A and NT|G, the frequencies of NTA, NTG, NC|A and NT|A were largely conserved, presumably reflecting a strong level of selection pressure against base conversion at the first and second positions of the codon.
A relationship between DNA methylation-induced nucleotide substitution and SCUB was also detectable among the exon sequences in the land plant genomes (Figure 6, Additional file 9: Figure S8, Additional file 10: FigureS9). The distribution of NCG frequencies from the first to the last exons had larger ‘∪’ curvatures than those associated with the other NNG s - the frequencies of NAG, NGG and NTG were rather constant among the various exons. The behaviour of NCA was rather similar to that of NCG, and its distribution showing the largest ‘∩’ curvatures among the NNA s. The curvatures associated with NC|G and NT|G distribution appeared to be greater than those associated with either the NC|N s or the NT|N s (Figure 6C,D), while those associated with either the NNC s and NNT s or the NA|N s and NG|N s were similar to one another. In comparison with the other NNG s and NC|N s, the frequencies of NCG and NC|G were the most closely positively correlated with those of, respectively, NNG and NNC, and the most negatively with those of NNA and NNT. Similarly, compared to the other NNA s and NT|N s, the frequencies of NCA and NT|G were most strongly positively correlated with those of, respectively, NNA and NNT, and most strongly negatively with NNG and NNC (data not shown).
The role of methylation in SCUB is also revealed by frequencies of SCs within a certain amino acid (Additional file 2: Table S1). For the residues Ala, Pro, Ser and Thr, each of which is encoded by more than two SCs each with a C in its middle position, the NCG frequency declined more sharply than that of NCC as the intron number increased, while the NCA frequency rose more obviously. For Arg, Gly, Leu and Val (codons without a C in the middle position), the frequencies of NNCs were clearly lower than those of NNGs, while those of NNTs was higher than those of NNAs. A comparison between the pairs of residues Asn vs Lys, Asp vs Glu and Gln vs His (the first two nucleotides of the SCs lacking C at the second position are the same in each pair) showed that the frequencies of NNCs and NNTs had more distinguishable alteration than NNGs and NNAs, respectively. A similar analysis of asymmetric methylation, based on the codons CHG and CHH (H = A, C or T) was carried out by assessing the frequencies of N|NN and NNN, and a more obvious alternation in C|NN and NNG frequencies than in others was found based on both intron number and exon position (data not shown). Unlike for the land plant genomes, in the algal genomes the frequencies of NCG and NC|G, and of NCA and NT|G were not different from those of NNN and NN|N, and were uncorrelated with both intron number and exon position (Figure 5A,B, 6A,B, Additional file 8: Figure S7, Additional file 9: Figure S8, Additional file 10: Figure S9).
Plants are clustered with respect to SCUB based on intron number and exon position
SCUB frequency clearly distinguished the algae from the land plants (Figure 7). Within the latter group of species, a principal component (PC) analysis based on either intron number or exon position also divided the monocotyledonous and dicotyledonous species into two recognizable clades (Figure 7C,D), although the relationships between the land plant species was somewhat different when a clustering analysis was applied as an alternative to the PC (Figure 7A,B). A PC analysis using SCUB frequency at various exon positions suggested a level of heterogeneity to be present (Additional file 11: Figure S10). SCUB frequency based on the full set of exons successfully separated the algal from the land plant genomes; that based on the first exon only produced four groups (algae, mosses/ferns, monocotyledonous species and dicotyledonous species); that based on the last exon alone merged the two angiosperm families into a single group; finally, that based on the interstitial exons produced three clades, namely the algae, moss/fern/monocotyledonous species and the dicotyledonous species.