Analysis of codon usage bias of envelope glycoprotein genes in nuclear polyhedrosis virus (NPV) and its relation to evolution

Background Analysis of codon usage bias is an extremely versatile method using in furthering understanding of the genetic and evolutionary paths of species. Codon usage bias of envelope glycoprotein genes in nuclear polyhedrosis virus (NPV) has remained largely unexplored at present. Hence, the codon usage bias of NPV envelope glycoprotein was analyzed here to reveal the genetic and evolutionary relationships between different viral species in baculovirus genus. Results A total of 9236 codons from 18 different species of NPV of the baculovirus genera were used to perform this analysis. Glycoprotein of NPV exhibits weaker codon usage bias. Neutrality plot analysis and correlation analysis of effective number of codons (ENC) values indicate that natural selection is the main factor influencing codon usage bias, and that the impact of mutation pressure is relatively smaller. Another cluster analysis shows that the kinship or evolutionary relationships of these viral species can be divided into two broad categories despite all of these 18 species are from the same baculovirus genus. Conclusions There are many elements that can affect codon bias, such as the composition of amino acids, mutation pressure, natural selection, gene expression level, and etc. In the meantime, cluster analysis also illustrates that codon usage bias of virus envelope glycoprotein can serve as an effective means of evolutionary classification in baculovirus genus. Electronic supplementary material The online version of this article (doi:10.1186/s12864-016-3021-7) contains supplementary material, which is available to authorized users.


Background
Codons are not used equally in most organisms. During an organisms evolutionary history, preference for using a particular synonymous codon will be formed within a species or gene in the long-term. Codons which are used in higher frequency within species or genes are referred to as optimal codons. Codon usage bias itself refers to such cases in which codons are utilized in higher frequency than other synonymous codons during the process of translation, often as a result of adaptive evolution [1]. Analysis of codon usage bias is thus of vital significance in the quest to improve exogenous gene expression levels within host cells. Codon bias analysis is a common phenomenon in many species, such as Escherichia coli [2], Arabidopsis thaliana [3], Xanthophyllomyces dendrorhous [4], Taenia saginata [5], Megalobrama amblycephala [6], metazoans [7], and even human beings [8]. Recent studies have shown that the employment of some special synonymous codons can affect protein folding as well as errors in folding [9,10]. Furthermore, studies have shown that the inherent links between codon usage and thus amino acids influence the protein components of cells [11]. At the same time, thoroughly understanding codon usage bias plays a central role in making accurate prediction of related gene functions.
Different genes exhibit different codon usage bias in the same genome. Mutation, natural selection, and random drift were the three major factors for species' codon usage bias [12][13][14][15]. Bioinformatics methods analyses have shown that translation selection is probably the original reason for the formation of codon usage bias. Other possible factors affecting codon usage bias among species include: gene expression level [16], gene length [8], GC content [17], recombination rate, RNA stability [18], environmental stress [19], population size [20], evolutionary age of genes [21], and so on. Codon usage bias has profound influence on genomic evolution [22]. Even within the same genome, codon usage patterns are not necessarily the same within the same gene [23].
Envelope glycoprotein is a main fatty acid acylating glycoprotein of Bombyx mori nuclear polyhedrosis virus (BmNPV) [24]. With pH-dependent membrane fusion activity, it can make the virus and host cell fusion. Glycoprotein gene mainly is connected by disulfide bond into the form of trimer and exists in the end of stick baculovirus, forming a typical membrane-grain structure. Research shows that the monoclonal antibody of glycoprotein can make a significant reduction in the infectious virus particles [25]. The glycoprotein gene plays a key role in the progress that baculovirus infects cells and progeny nuclear capsid effective budding release [26,27]. Besides, silencing glycoprotein gene in transgenic silkworm increases resistance to BmNPV [28,29]. Therefore, the glycoprotein is one of the most important capsule membrane proteins in baculovirus. Because the glycoprotein gene product and its homologues are relatively conserved, it is an ideal gene useful in the study of the evolutionary relationships of different baculoviruses. Analyses of codon usage bias could therefore enable a better understanding of the NPV molecular evolution dynamic.
On the other hand, 18 different NPV species were also analyzed using a cluster analysis method specifically using RSCU values for calculation. It is observed that AcNPV, PlxyNPV, RaouMNPV, BmNPV, BmaNPV, ToNPV and MvMNPV exhibit similar evolution status, consistent with the analysis results of Neighbor-joining (Fig. 2). Other species are similar to the results of Neighbor-joining on the whole, but still have some differences from the perspective of the individual evolutionary branch, such as AgNPV, CvMNPV and HycuNPV exhibit similar codon usage bias, indicating that the respective pairs are evolutionarily related.
It shows that there are some difference between Cluster and Neighbor-joining analysis, especially codon usage patterns are rather different in Group 2 but less apparent in Group 1.

Glycoprotein codon usage bias of 18 NPVs analysis
We analyzed the glycoprotein genes of 18 NPV species. The GC content for these genes ranges from 42.7 to 54.4 %, with the average being 48.90 %. GC content varies most significantly in the first and third codon positions, with of values of 46.39 and 64.44 %, respectively. The ENC in glycoprotein varies from 38.9 to 52.5, with a mean of 47.39. Accordingly, none of the 18 glycoprotein genes exhibits strong codon bias, as all of their ENC values are above 35. This data shows that glycoprotein displays a general random codon usage, lacking strong codon bias (Table 1 and Additional file 1).
Additionally, the relative synonymous codon usage (RSCU) values of 59 sense codons (except for Trp, initiator codon and terminator codon) also support this conclusion NPV glycoprotein presenting weak codon bias. Nearly half of the glycoprotein codons (24/59) are frequently used as shown in Table 2, such as GGC (coding Glycine) and UUG (coding Leucine). The RSCU values  Furthermore, we also compared the RSCU values of 59 sense codons (Fig. 3). There was some difference of the RSCU values of 59 sense codons from 18 NPVs glycoprotein, but the overall trend is relatively similar. This illustrates that relatively similar species maintain the stability codon usage patterns.   Fig. 4, are represented by Axis 1 and 2 which correlate to two main influencing factors of codon usage bias. They represent 36.29 and 21.20 % of the total variation, respectively (Fig. 4). The relationship between codon usage bias and amino acid composition were explained by multifactor variable analysis. Axis 1 has a distinct positive correlation with C3s (r = 0.965, p < 0.01), G3s (r = 0.948, p < 0.01), and GC3s (r = 0.996, p < 0.01). Axis 1 shows evidently negative correlation with A3s (r= −0.957, p < 0.01) and T3s (r= −0.969, p < 0.01). There is an obvious negative correlation between GC3s and ENC (r = −0.822, p < 0.01). However, GC3s exhibits a significantly positive correlation with Axis 1 (r = 0.996, p < 0.01). On the other hand, ENC shows significant negative correlation with Axis 1  Table 3). There is a high correlations among these parameters because their R value is greater than 0.8. These results demonstrate that nucleotide composition indeed affects codon usage bias. All of the genes are diffusely distributed, and it indicates that many factors affect codon usage bias (Fig. 5a). Axis 1 represents the main index for affecting codon usage bias. The distribution density of triplet codons ending with G/C is closer to Axis 1 than that of codons ending with A/U (Fig. 5b). Thus, these results suggest that nucleotide composition (especially G and C) posits a certain degree of influence on the codon usage bias. Furthermore, the mutation impact of codons ending with G/C on codon usage bias is greater than that of codons ending with A/U.

GC3s affecting codon bias
A standard curve evaluates the relationship between ENC and GC3s, which illustrates their corresponding relationship of the extent under mutation pressures. If the points, which represent various genes, fall on or near the standard curve, the codon usage bias would be interpreted as being mainly determined by mutation pressures. Generally, codon usage bias depends on the content of the ending base in codons-in other words, the GC3s content of genes. However, all of the points (no matter the Group 1 or Group 2) are located beneath the standard curve, indicating that mutation pressure which is not the critical factor in the formation of codon preferences (Fig. 6). Thus, the GC3s values of glycoprotein are not the sole factors affecting codon bias formation in various species of NPV. Furthermore, the dispersed plotted genes indicates that other factors can impact codon usage bias to a certain extent. These factors include natural selection, gene length, and gene expression levels.
Natural selection plays an important role in the process of codon bias formation ENC-plot analysis demonstrated the extent to which mutational pressures affect the formation of codon usage bias. We next seek to determine whether natural selection or mutation pressure plays a greater role in generating codon usage bias. To determine this, we attempt to carry out a neutrality plot analysis on the GC content of codons. The distribution range of GC3 is very broad, from 48.4 to 77.5 % (Fig. 7). There is indeed obvious correlation between GC1 and GC3 (p < 0.01), which initially seemed indicative of mutation pressure playing a greater role in direct codon usage bias. However, after calculating the neutrality plot, this was not the case. In Fig. 7, all of GC3 values diffuse distribution and all of regression curve deviate from the diagonal line. And then, the slope of the regression line was determined to be 0.1063, 0.0685 and 0.0956. Should the slope be equal to one (diagonal line), indicating a perfect correlation between GC12 and GC3, mutation pressure would be deemed the dominant factor in generating bias. Slopes approaching the vertical or horizontal axes would indicate natural selection as dominant. Despite the observed GC12 and GC3 correlation, our slope of 0.1063, 0.0685 and 0.0956 indicates that the influence of direct mutation pressure for codon usage bias is only 10.63, 6.85 and 9.56 %, respectively. The influence of natural selection on codon usage bias was calculated to be 89.37, 93.15 and 90.44 %, thereby indicating natural selection as the dominant factor influencing bias. Table 3 Correlation coefficients between the position of genes along the first two major axes with index of glycoprotein genes' codon usage and synonymous codon usage bias

Effects of gene length and expression level on codon usage bias
CAI values are useful in predicting the levels of gene expression. Silkworm ribosomal genes, which have a high level of expression, were used as references in our computation of codon adaptation indices [30]. Correlation analysis shows that CAI and ENC demonstrate no significant correlation, as well as no obvious correlation exists between the CAI, GC3s and GC content. This illustrates that gene expression levels have no effect on codon bias. On the other hand, gene length has no obvious correlation with CAI, ENC and Axis 1. This observation indicates that there is no correlation between the length of the gene and its codon usage bias for NPV glycoprotein.
The CAI values of the various glycoprotein genes ranges from 0.765 to 0.812, and the length of the gene ranges from 1500 to 1593 bp. The level of variation in CAI values and gene length among the various glycoprotein genes is relatively small, as shown in Fig. 8. These results indicate that gene expression level and length play an acute role in the shaping of codon bias. The gene lengths of the various viral species are all relatively similar, and given that all species CAI values are very approximate. It was suggested that the glycoprotein gene displays stable expression in the process of evolution in NPV.

Discussion
Our clustering analysis statistics are similar to Neighborjoining. We compared the RSCU values of glycoprotein from 18 species, the results show that they have relative similarity codon usage bias. After a series of analyses, glycoprotein possess a general codon usage pattern because all the ENC values are greater than 35. RSCU values are an index for assessing frequency of synonymous codon usage. RSCU = 1.0 means that there is only one codon within a synonymous codon set, and it indicates that the codon is not biased. Alternatively, RSCU > 1.0 indicates a high frequency bias for a particular codon within a synonymous codon set, and vice versa [31]. Many factors can result in the synonymous codon usage bias of the glycoprotein gene of the NPV genus. Nucleotide composition is one of the factors that affect codon usage bias especially codons ending with G/C. But, most of the codons ending with A/U also demonstrated a stronger frequency of codon usage bias in other species that contains rich A/T base pairs, such as Saccharomyces cerevisiae and Plasmodium falciparum [32,33]. In  addition, previous studies have identified mutational pressures and natural selection as two major factors influencing codon usage bias [34]. The ENC-plot is an effective tool for measuring codon usage bias [35]. Our ENC-plot analysis showed that mutational pressures can slightly affect the formation of codon usage bias. However, our neutrality plot analysis indicates that natural selection might play an important role in shaping the codon usage bias. This phenomenon also exists in other species, such as Arabidopsis thaliana, Drosophila melanogaster, and Caenorhabditis elegans, in which natural selection is also highly important in shaping codon usage bias in the complete genome [36]. Through previous research we know that gp64 is a highly conserved gene [37]. It is also one of the homologue of NPV glycoprotein. According to the results of this study, the gene length of 18 glycoprotein is almost the identical size, and their CAI values are maintained at the same level. These results illustrate that the length of the glycoprotein and its expression level would not random variation, in other words, glycoprotein is one of the conservative genes in NPV.

Conclusions
Codon usage patterns were similar between different NPV viral species in same genus. Both Neighbor-joining analysis and clustering analysis were showing the similar conclusion. Multiple factors can affect the synonymous codon usage bias of every organism. Through a series of research and analysis, we can draw the following conclusions: The glycoprotein gene of the NPV genus exhibits a weak codon usage bias. Nucleotide composition, mutation pressure, gene length, and gene expression levels all influence synonymous codon usage bias, with natural selection being the main influence factor. Though codon usage bias is not a necessary metric for carrying out traditional phylogenetic analysis, our study enables us to understand the molecular and genetic mechanisms of viral evolution from a novel perspective. Future advances in the understanding of codon usage evolution will undoubtedly aid us in achieving a more nuanced mastery of viral genetics.

Codon usage bias measurement index
The effective number of codons (ENC) is a measure that quantifies the extent to which the usage of a gene departs from the equal usage of synonymous codons. It is an excellent indicator of codon usage bias in both genes and genomes. The minimum ENc value is 20, indicating severe codon usage bias, and the maximum value is 61, indicating equally likely usage of all codons.
Relative synonymous codon usage (RSCU) refers to a relative ratio that describes the usage frequency of one specific codon compared to the usage frequency of synonymous codon for the same corresponding amino acid. If the RSCU value is 1, codons are used equally with no bias. Codons with an RSCU value greater than 1 exhibit strong bias (i.e., used more frequently than other synonymous codons), whereas codons with an RSCU value less than 1 exhibit negative bias and are used less frequently than other synonymous codons.
The codon adaptation index (CAI) is another effective measure of codon usage bias, in which each codon is referenced to an optimal codon frequency derived from a set of highly expressed genes. CAI values range from 0 to 1. A value of one indicates strong codon bias in which the optimal codon is always used, and vice versa.

Multifactor variable analysis
Correspondence analysis (COA) is a widely used statistical method used in the analysis of multiple factors and their influences on a particular component. With respect to our experiment, correspondence analysis was used to analyze the effects of various factors on the formation of synonymous codon usage bias in various genes.
Linear regression analysis (LRA) and factor analysis (FA) were used to analyze the relationship between the ENC values and GC3 content of glycoprotein in NPV and the level of correlation between GC12 content and GC3 content. This analysis allowed us to deduce the effects of mutational pressure on codon bias formation.
Neighbor joining (NJ) is a bottom-up clustering method for the creation of phylogenetic trees. Usually used for trees on DNA or protein sequence data, the algorithm knowledge of the distance between each pair of taxa to form the tree.
Cluster analysis (CA) is an analytical method that divides data into groups in such a way that elements more similar to each other are grouped together. Distance is not constant in cluster analysis. The euclidean distance which describes the linear correlation between two variables, was used in our analysis to determine distance.
ENC-plot analysis was used to determine the decisive factors affecting codon usage bias. Each point in the plot corresponds to a GC3s value of a particular gene. Sets of points located on the standard curve indicate mutational pressure determines codon usage bias. Alternatively, points located below the standard curve indicate there are other factors other than mutational pressure affecting codon usage bias.
A neutrality plot analysis was used to determine the extent to which mutational pressures affect codon usage bias as compared to natural selection. Synonymous codon mutations often occur in the third position of the codon, though at times mutations may also occur in the first and second positions, leading to non-synonymous codons. Using GC3 as a horizontal coordinate and GC12 as a vertical coordinate, the GC3 and GC12 contents of glycoprotein genes were plotted and a regression line was calculated to determine the extent to which mutational pressures played a role in the formation of codon usage bias as opposed to natural selection. Regression lines that fall near the diagonal (slope = 1) indicate weak external selection pressures on the generation of codon usage bias, whereas regression curves deviating from the diagonal indicate a heavy influence of natural selection on codon usage bias.

Software
All indices of codon usage bias above were calculated from the data set using the program CodonW 1.4.4 (http:// codonw.sourceforge.net/). Clustering analysis and correlations between codon usage variations amongst indices of codon usage were carried out using a statistical software called SPSS Version 22.0, MEGA 6.0, ClustalX 2.0 and GraphPad Prism 5.0.

Additional file
Additional file 1: Table S1. All the indices of total genes. (XLS 27 kb)