Analysis of codon usage patterns in citrus based on coding sequence data

Background Codon usage is an important determinant of gene expression levels that can help us understand codon biology, evolution and mRNA translation of species. The majority of previous codon usage studies have focused on single species analysis, although few studies have focused on the species within the same genus. In this study, we proposed a multispecies codon usage analysis workflow to reveal the genetic features and correlation in citrus. Results Our codon usage analysis workflow was based on the GC content, GC plot, and relative synonymous codon usage value of each codon in 8 citrus species. This approach allows for the comparison of codon usage bias of different citrus species. Next, we performed cluster analysis and obtained an overview of the relationship in citrus. However, traditional methods cannot conduct quantitative analysis of the correlation. To further estimate the correlation among the citrus species, we used the frequency profile to construct feature vectors of each species. The Pearson correlation coefficient was used to quantitatively analyze the distance among the citrus species. This result was consistent with the cluster analysis. Conclusions Our findings showed that the citrus species are conserved at the genetic level and demonstrated the existing genetic evolutionary relationship in citrus. This work provides new insights into codon biology and the evolution of citrus and other plant species.


Background
The genetic code is degenerate. There are 64 different codons, including 61 codons encoding for amino acids and 3 stop codons, but only 20 translated amino acids. As a result of the degeneracy of the genetic code, many amino acids are encoded by two-to-six synonymous codons, termed condon usage bias. The genetic codes of different organisms are often biased towards the use of one of several codons. The codons that encode the same amino acid *Correspondence: wanxiaohua@ict.ac.cn † Zenan Shen and Zhimeng Gan contributed equally to this work. 1 High Performance Computer Research Center, Institute of Computing Technology, Chinese Academy of Sciences, 100190 Beijing, China 2 University of Chinese Academy of Sciences, 100000 Beijing, China Full list of author information is available at the end of the article over the others are called synonymous codons [1]. These differences among the usage of the synonymous codons have been the important factor for the evolution of proteome diversity, and preferences for synonymous codons exists widely within the genomes due to mutation, natural selection, and random drift [2][3][4]. Thus, a comprehensive understanding of the biases in codon usage can help us explore the evolution of those proteins that have structural differences conserved at the sequence level [5][6][7][8].
Escherichia coli [9], Caenorhabditis, Drosophila, Arabidopsis [10], Paeoniaceae lactiflora [11] and Megalobrama amblycephala [12]. However, few studies has been performed on the correlation within the same genus based on codon usage patterns, and a similar study in citrus species was not based on the whole genome [13]. Therefore, further research and analysis of the Citrus genus could be useful for understanding the conservatism and evolution of different citrus species. Citrus species are economically important evergreen trees that are major fruit producers in the world, with annual global yields of more than 130 million tons [14]. They are native to the subtropical and tropical regions of Asia and the Malay [15][16][17]. Citrus plants spread to Australasia, Japan and other regions during the early Pleistocene. The geographical origin, timing and dispersal of citrus species across southeast Asia remain unclear [18]. The investigation of genetic difference can help us get new insights on evolutionary relationship of citrus. To reveal the correlation in citrus species, we proposed a multispecies codon usage analysis workflow including data preprocessing, codon usage bias analysis, high-frequency codons identification of 8 different citrus species in this study. The difference between the same highfrequency codons among different citrus was no more than 0.05, and in 13 high-frequency codons, 11 of them were the same. Compared with other species in the plant kingdom, citrus showed similar codon usage bias. Moreover, pearson correlation coefficient was used to study the relationship among citrus quantitatively [19]. This can confirm the results of cluster analysis. The results will help us understand codon biology and evolution in citrus plants, and will help improve the research on correlation analysis of the same genus.

Codon usage in 8 citrus genomes
The GC content may reflect significant compositional features of the genome. As the research shows, GC content still remained significantly negatively correlated with mean annual temperature, warmest and positively correlated with latitude and annual temperature range [20]. The average overall GC content in this study was 43.67%, and varied among the different citrus species and codon positions. Citrus grandis showed the highest GC content with a value of 43.79%, Citrus sinensis showed the lowest GC content with a value of 43.50%. For GC content at the first position, which obtained the highest value in citrus, Atlantia buxifolia showed the highest value at 50.70% and Citrus reticulata 'Mangshan' showed the lowest value at 50.51%. The highest and lowest values of GC2, GC3 and GC3s were GC2: 40.56%(Citrus grandis) and 40.12%(Citrus sinensis); GC3: 40.28%(Citrus clementina) and 39.35%(Atlantia buxifolia); and GC3s: 38.02%(Citrus clementina) and 37.08%(Citrus reticulata 'Mangshan'). Among the 8 citrus species, the value of GC3 and GC3s of Atlantia buxifolia was the lowest (Atlantia buxifolia is known as Chinese box orange and was formerly named Severinia buxifolia) [21]. The GC base pair is more thermally stable than AT base pair, and it can reflect the distribution history in citrus species. As an example of a primitive citrus species, Atlantia buxifolia showed that codon usage was not completely conserved and evolution was more active (Table 1).

Neutrality plot analysis
The neutrality plot was used to analyze the relationships among the three codon positions to examine the role of mutation in citrus [22]. We found that citrus genes had a narrow range of GC12(42%~48%) and GC3(36%~42%) values and there were significiant correlations between GC12 and GC3 in Citrus sinensis and Citrus clementina, Genes represents the number of sequences after filtering; GC1, GC2 and GC3 represent the GC content of the first, second, third base of codon; GC3s represents the GC content of the third synonymous position; ENC represents the effective number of codons where the slope of the regression line was more than 0.2. The significantly correlation indicating that the GC mutation bias effect the GC contents similarly among all positions of codons. In contrast, there was no significantly correlations in other 6 citrus species, and the slope of regression line was near 0, indicating there are low mutation bias or high conservation of GC content and limited evidence of directional mutation pressure in these citrus genes. The results also showed that Citrus sinensis was the most affected species by directional mutation pressure due to its highest correlation coefficient of 0.3047 in citrus ( Fig. 1). Because of the partially silent nature of the third codon position, GC3 represents one of the most neutral nucleotides within the genome with respect to the G + C content [23].

ENc plot analysis
Analysis of the relation between GC3 and ENC can determine the relation between the differences in ENC and the differences in GC contents. The ENc-plot is an effective tool to study the codon usage patterns, and it was used here to explore the influence of GC3s on the codon bias in citrus [24]. As shown in Fig. 2, citrus species showed similar patterns in ENc plot. Most genes were located below the expected ENc-plot curve, whereas only a small number of genes were at or above the curve. These results indicated that the conditional mutation might be a weak factor in shaping the codon bias, which is also affected by other factors.
To further prove the conservative of the influence of GC3s in citrus and to validate the difference between the observed and expected ENC values, (ENCexp-ENCobs)/ENCexp was calculated. As shown in Fig. 3, there was a single peak, the shape and location of the peak were similar among the citrus species. More than 60% of the total genes of the 8 citrus species were distributed within the 0 to 0.1 range of the (ENCexp-ENCobs)/ENCexp values, indicating that the most actual ENC values were slightly smaller than the expected ENC values from the GC3s. These results also prove that the conditional mutation might be a weak factor affecting the evolution history of citrus.

High-Frequency codons and codon pairs usage analysis in citrus
The RSCU of codons was calculated. AGA was the most frequent codon, which encoded Arg. GCT and GTT were the next two highly frequent codons, which encoded Ala and Val, respectively. Of all the 8 citrus species, AGA, GTT, GCT, TCT, TTG, ATT, GAT, CAT, AAT, TTT and  TAT were identified as the most frequent codons in common. Among these codons, 91% ended with A/T, and only 9% of them ended with G/C, indicating that citrus species were more likely to use A/T at the third position of highfrequency codons. Among the high-frequency codons, 36.4% started with G/C and the other 63.6% started with A/T, indicating a bias towards A/T at the first position of the high-frequency codons. Atlantia buxifolia had the most high-frequency codons at 15. It is possible that the GC to AT mutation in Atalantia buxifolia mainly occurred during the evolution ( Table 2) [25].
The RSCU of four NCG codons in the citrus species were the lowest (CCG:0.46 TCG:0.43 ACG:0.42 GCG:0.32). The results showed that citrus have a relatively high methylation level. Four NTA codons also had a low RSCU value (TTA:0.84 ATA:0.77 GTA:0.65 CTA:0.56), as low RSCU values of NTA codons inhibit mRNA degradation and thus increases protein production [26].
In practice, codon pairs are used more frequently. At the mRNA translation level, codon pair context influences the speed and accuracy of translation processes, and are species specific. Single codon optimization does not mean global optimization. Codon pairs also show some bias among synonymous pairs. As shown in the Additional file 1, based on 3,721 (61*61) codon pairs, 832 high-frequency codon pairs were identified on average, and Atlantia buxifolia had the highest number of highfrequency codon pairs at 839, and Citrus grandis had the lowest number of pairs at 822. The last three codon pairs were nnGCnn, nnCCnn and nnCTnn, which may relate to a lower methylation level of citrus DNA [27]. This result was consistent with our hypothesis that the codon usage patterns in Atlantia buxifolia was not completely conserved in the evolutionary process.

Codon usage patterns across the plant kingdom
The natural selection distinguishing between synonymous codons constrains the rate of nucleotide substitution. And within an evolutionary framework, the degree of codon bias reflects a balance between selection and synonymous mutations [28]. A heat map via biclustering was used to describe the variations of codon usage bias among 8 [29]. Citrus species had a closer relationship than other dicotyledon species (Fig. 4).
To prove the species in the same group had the similar GC and GC3 contents, GC distribution from 30 plant genomes was plotted. And they varied greatly in different species and have changed during evolution, which was confirmed by the results (Fig. 4). The original single-celled or multi-celled Chlorophyte plants had very high GC3 contents (0.69 to 0.82), whereas in the monocotyledons, the GC3 content decreased but was still over 0.5, and in Dicotyledons, the GC3 content was approximately 0.4. It is hypothesized that one of the major selective advantages of GC-rich DNA is the ability for more complex gene regulation [20].

Pearson correlation coefficient among citrus species
The similarity among citrus species was calculated quantitatively based on Pearson correlation coefficients, which were used to construct heat maps. The heat map of Pearson correlation coefficients between each species is shown in Fig. 5, which illustrates the correlation among citrus and shows which pairs of species have close relationships.
Citrus medica and Citrus reticulata 'Mangshan' had the highest value of 0.999989. This result was confirmed by the cluster analysis, which showed that these two species were clustered together. Citrus medica and Citrus ichangensis also clustered together, with a Pearson

Conclusion
We identified a multispecies codon usage analysis workflow that revealed the genetic features and correlation of the genus Citrus. In particular, we performed a comprehensive analysis of codons and codon pair usage in 8 citrus species and 22 other plants. Our results showed few differences in codon features among citrus species and, thus, that the genomes of citrus species were conserved. Regarding GC content, the nucleotide content of citrus genes was slightly GC poor and AT rich. As for Pearson correlation coefficient of dinucleotide sequence profile among citrus species, its results can also be confirmed by the cluster analysis. Using this workflow, we compared 8 species of citrus. This method can also be used on other species. However, our results should be considered cautiously, as more data are required. Future work will focus on additional codon usage indices in citrus to determine if citrus is conserved at these levels.
In conclusion, our findings provided insight into the codon usage patterns of citrus species and could be used for the cloning and expression of exogenous genes in citrus and other functionally important plants.
Protein-coding sequences (CDS) of those compared plant species were extracted by Tbtools(http://cjchen.github.io/tbtools/). All CDS without an AUG start codon, not ending with UAA, UAG or UGA stop codons, and having uncertain nucleotides and containing internal stop codons were filtered out, which were regarded as low quality sequences because of invalid format. After filtering, the remaining high quality sequences were used for further analysis. The filtering procedure was performed by python scripts written in-house.

Indices of codon usage
The overall GC content and the GC content at the first, second and third position reflect the strength of directional mutation. RSCU is an index used to study the overall synonymous codon usage variation among genes. Codons with RSCU values over 1.0 were identified at a high frequency and codons with RSCU values below 1.0 showed negative codon usage bias. RSCU was calculated according to the formula described in Sharp and Li [31]. The ENC reflects the degree of codon bias for 20 amino acids across ORFs. The ENC was between 20 and 61. An ENC value close to 20 indicates that only one of the synonymous is preferred, and a value close to 61 shows that each synonymous codon is used equally. The GC content and RSCU were calculated with C++ programs written inhouse, and the ENC was calculated using the codonW1.4.4 (http://codonw.sourceforge.net/).

Overview of the codon usage analysis workflow
Our workflow consists of six parts: data preprocessing, GC content analysis, neutrality plot and ENc plot analysis, high-frequency codons identification, comparison and cluster analysis, and statistical analysis. We examined the correlation of citrus species based on codon usage patterns (Fig. 6).

Analysis of gC content
GC content includes the overall GC content, GC1 (GC content of 1st nucleotide in codon), GC2 (GC content of 2nd nucleotide in codon), GC3 (GC content of 3rd nucleotide in codon) and GC3s (GC content of 3rd synonymous codons). The GC content reveals GC bias and varies greatly between species [32]. An analysis of codon usage pattern can provide a basis for understanding the relevant mechanism of the biased usage of synonymous codons. This analysis also has both practical and theoretical applications for understanding the basics of molecular biology [33].

Neutrality plot and eNc plot analysis
A neutrality plot (GC12-GC3) was used to estimate and characterize the codon usage patterns among three codon positions. GC12 represents the average of GC1 and GC2. A plot regression with a slope of 0 indicates no effect of directional mutation pressure (complete selective constraints), whereas a slope of 1 indicates the same mutation module between GC12 and GC3 and that complete neutrality was the main factor in evolution [11].
The ENc-plot(ENC-GC3s) is a general strategy to determine whether the codon usage of a gene is affected by mutation and selection. The expected ENc values were plotted against the GC3s values and were calculated according to Equation 1, where F represents the frequency of the estimated GC3s. That the actual ENC values lie on or around the standard GC3s curve indicates that the codon bias is determined by a G + C mutation bias only. In other words, the values distributed far below the standard curve shows that other factors such as selection effects are present [34]. 2 (1)

Identification of high-Frequency codons and codon pairs
Those codons with RSCU values over 1.5 or having a relative frequency above 60% of the synonymous codons for the corresponding amino acids were identified as highfrequency codons. Codon pairs with the last codon coding the same amino acid were defined as synonymous codon pairs. High-frequency codon pairs were defined as those codons with RSCPU (relative synonymous codon pair usage) values over 1.5 or when the number of codon pairs included over 60% of the total number of synonymous codon pairs [35][36][37]. The novel equation to compute RSCPU for a pair of codon is as follows: where x i is the number of the occurrences of the i th kind of codon pairs, and n i is the number of synonymous codon pair for the i th type amino acid pair [38]. Identification of high-frequency codons and codon pairs were performed by C++ programs written in-house.

Comparison and cluster analysis
The RSCU of 59 codons (excluding the 3 stop codons and codons with synonymous codons) of 8 citrus species and 22 other plants were clustered using the Mev4.8.1 software (https://sourceforge.net/projects/mev-tm4/) [39]. The hierarchical clustering, Euclidean distance and sample tree parameters were set to cluster with the RSCU. The GC and GC3 variation of 30 different species were analyzed using Microsoft Excel.

Statistical analysis
The distribution characteristics of dinucleotides can be used to study nucleic acids [40]. To further estimate the correlation among citrus species, we extracted the dinucleotide frequency profile vectors. Four kinds of nucleotides make up 16 different dinucleotide feature vectors. Each feature vector was calculated according to equation f xy = MN/(L − 1), where f xy stands for the frequency of each nucleotide pair, M and N stand for the kinds of nucleotides, MN stands for the number of occurrences of the dinucleotides and L represents the length of all sequences. For each sequence, we used a two bit sliding window to obtain the frequency of the vectors. Thus, each nucleic acid was calculated twice, and equation p xy = f xy /(f x f y ) was used to avoid repeated calculations based on the above-mentioned results. Variable p xy represents the frequency profile of the dinucleotides. Variable p x and p y represent the corresponding frequency profile of the nucleic acids [41].
The 16 different kinds of dinucleotides represent the signature of the species. We used the Pearson correlation coefficient to calculate the distance and obtain the similarity between two species. The Pearson correlation coefficient r was defined as follows: where X and Y represent the set of each dinucleotides frequency vectors of the citrus species. N represents the number of the points. Here, N equals to 16.