Analysis of transcriptome data reveals multifactor constraint on codon usage in Taenia multiceps

Huang, Xing; Xu, Jing; Chen, Lin; Wang, Yu; Gu, Xiaobin; Peng, Xuerong; Yang, Guangyou

doi:10.1186/s12864-017-3704-8

Research article
Open access
Published: 20 April 2017

Analysis of transcriptome data reveals multifactor constraint on codon usage in Taenia multiceps

Xing Huang^1,2,
Jing Xu¹,
Lin Chen³,
Yu Wang¹,
Xiaobin Gu¹,
Xuerong Peng⁴ &
…
Guangyou Yang¹

BMC Genomics volume 18, Article number: 308 (2017) Cite this article

2443 Accesses
34 Citations
2 Altmetric
Metrics details

Abstract

Background

Codon usage bias (CUB) is an important evolutionary feature in genomes that has been widely observed in many organisms. However, the synonymous codon usage pattern in the genome of T. multiceps remains to be clarified. In this study, we analyzed the codon usage of T. multiceps based on the transcriptome data to reveal the constraint factors and to gain an improved understanding of the mechanisms that shape synonymous CUB.

Results

Analysis of a total of 8,620 annotated mRNA sequences from T. multiceps indicated only a weak codon bias, with mean GC and GC3 content values of 49.29% and 51.43%, respectively. Our analysis indicated that nucleotide composition, mutational pressure, natural selection, gene expression level, amino acids with grand average of hydropathicity (GRAVY) and aromaticity (Aromo) and the effective selection of amino-acids all contributed to the codon usage in T. multiceps. Among these factors, natural selection was implicated as the major factor affecting the codon usage variation in T. multiceps. The codon usage of ribosome genes was affected mainly by mutations, while the essential genes were affected mainly by selection. In addition, 21codons were identified as “optimal codons”. Overall, the optimal codons were GC-rich (GC:AU, 41:22), and ended with G or C (except CGU). Furthermore, different degrees of variation in codon usage were found between T. multiceps and Escherichia coli, yeast, Homo sapiens. However, little difference was found between T. multiceps and Taenia pisiformis.

Conclusions

In this study, the codon usage pattern of T. multiceps was analyzed systematically and factors affected CUB were also identified. This is the first study of codon biology in T. multiceps. Understanding the codon usage pattern in T. multiceps can be helpful for the discovery of new genes, molecular genetic engineering and evolutionary studies.

Background

The multiple codons that encode the same amino acid are defined as synonymous codons. The non-normal distribution of synonymous codon usage within and between genomes is termed codon usage bias (CUB) [1, 2]. Among the various factors that are known to dictate CUB in a variety of organisms, compositional constraints and translational selection are considered to be the main influences [3].

Studies of synonymous codon usage contribute to the understanding of the mechanisms of biased usage of synonymous codons [4], selecting suitable host expression systems [5], designing degenerate primers [6], predicting genes based on genomic sequences [7] and functional protein classification [8]. Synonymous CUB has been characterized in a number of organisms. However, the transcriptomes of only Taenia pisiformis [9] and Taenia saginata [10] have been reported from the Taeniidae family.

Taenia multiceps (T. multiceps) is a parasite found in nearly all regions of the world and causes coenurosis [11], which is not only associated with significant economic losses to the livestock industry, but also represents a threaten to human health [11–15].

In the present study, we analyzed the codon usage profile of T. multiceps from annotations of the transcriptome using the CodonW 1.4.2 program and multivariate statistical analysis. Knowledge of the codon usage pattern of T. multiceps is important in elucidating the mechanisms underlying synonymous CUB and also for improved T. multiceps genetic vaccine production through informed selection of the most suitable expression systems.

Methods

Sequence acquisition

A total of 20,896 annotated coding sequences (CDSs) were obtained from the adult T. multiceps transcriptome database (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA80935/); all genes excluded gaps [16]. The codon usage of CDS from nuclear genome were analyzed, while all mitochondrial genes were excluded. Additionally, only genes greater than 450 base pairs (bp) were included to enhance the sequence quality in further analysis [9]. A total of 8,620 CDSs, including ribosomal genes (15 CDSs) and essential genes (15 CDSs), remained in the final analysis based on the CDS annotation information and the Database of Essential Genes (DEG) [17–21]. Some partial sequences are still referred to as “genes”.

Indices of codon usage

The following codon indices were determined: relative synonymous codon usage (RSCU) [3], effective number of codons (ENc) [22, 23], codon adaptation index (CAI), and GC-content at the first, second and third codon positions (GC1, GC2 and GC3), frequency of either a G or C at the third codon position of synonymous codons (GC3s), and the average of GC1 and GC2 (GC12) [22].

RSCU is the ratio of the observed and expected codon frequencies under a uniform synonymous codon usage [3], with codon bias diminishing as this value approaches 1.0, while RSCU values exceeding 1.0 indicate higher than expected codon usage [3].

ENc indicates the magnitude of codon bias for individual genes. Over a range of values from 20 to 61[23], lower values indicate greater codon bias. Generally speaking, ENc values lower than 36 indicate strong codon bias [23, 24].

CAI values indicate the extent of bias toward codons in highly expressed genes. Over a range of values between 0 and 1.0, higher CAI values indicate higher expression and greater CUB [22, 25, 26]. The set of sequences used to calculate CAI values in this study were the genes coding for 15 ribosomal proteins in T. multiceps [23], so that it can provide an indication of gene expression level under the assumption that translational selection can optimize gene sequences according to their expression levels.

All the indices of the total number of genes analyzed are shown in Additional file 1.

Principal Component Analysis (PCA)

Principal component analysis (PCA) have often been used to identify major trends of variation in synonymous codon usage among genes [27, 28]. In this paper, data were normalized in the manner developed by Sharp and Li [22] to define the relative adaptiveness of each codon. And then PCA based on the relative adaptiveness was applied to identify major trends of intragenomic variation in synonymous codon usage among genes [28]. In addition, we analyzed the distribution of PC scores for constitutively highly expressed genes (encoding ribosomal proteins) [28]. In each PC, the score for the gth gene (y _g) was normalized by the mean (m) and the standard deviation (S.D.) of scores for all genes, expressed as z _g = (y _g - m)/S.D.. If the mean absolute z g score for the highly expressed genes was greater than 5.17 (an interval in which theoretically only 1.5% of all genes are included), then gene expression level (Expression) was identified as the main trend of variation in PC scores among genes.

ENc-plot

The ENc-plot of ENc values plotted against GC3s values is used to analyze the influence of base composition on the codon usage in a genome [29]. A standard curve is generated to show the functional relationship between ENc and GC3 values under mutation pressure rather than selection pressure. In genes where codon choice is constrained only by a G + C mutation bias, predicted ENc values will lie on or around the GC3 curve. However, the presence of other factors, such as selection effects, causes the values to deviate considerably below the expected GC3 curve.

PR2 bias plot

Parity rule 2 (PR2) plot analysis, which was also conducted to investigate CUB, is used to the impact of mutation and selection on codon usage [30]. This analysis is based on a plot of AT-bias [A3/(A3 + T3)] and GC-bias [G3/(G3 + C3)] at the third codon position of the four-codon amino acids in entire genes. The four-codon amino acids are alanine, arginine (CGA, CGT, CGG, CGC), glycine, leucine (CTA, CTT, CTG, CTC), proline, serine (TCA, TCT, TCG, TCC), threonine, and valine [31].

Neutrality plots

Neutrality plots (GC12-GC3) [26] were used to evaluate the relationships among the three positions in T. multiceps codons. Following linear regression analysis, a slope of 0 indicates an absence of directional mutation pressure (complete selective constraints), while a slope of 1 indicates complete neutrality.

Determination of optimal codons

Based on axis 1 ordination, the top and bottom 5% of genes were regarded as the high and low datasets, respectively. Codon usage in the two data sets was compared using chi square tests, with the sequential Bonferroni correction to assess significance [32]. Optimal codons were defined as those used at significantly higher frequencies (P < 0.01) in highly expressed genes compared with the frequencies in genes expressed at low levels [33].

Software

The following programs were used in this study: Codon W (Ver.1.4.2) (http://sourceforge.net/projects/codonw/), CHIPS (http://emboss.sourceforge.net/apps/release/6.6/emboss/apps/chips.html), and CUSP (http://emboss.sourceforge.net/apps/release/6.6/emboss/apps/cusp.html). These programs were used to calculate CUB indices, such as GC, GC3s (G + C content at the third position of codons), and silent base compositions (A3s, T3s, C3s, and G3s, which indicate the frequency of codons with A, U, C, or G, respectively, at the synonymous third position). GRAVY, Aromo, RSCU and ENc values were also calculated and COA was performed.

Results

Codon composition analysis

As shown in Table 1 and Figure 1, the GC-content of the T. multiceps genes ranged from 36.5% to 62.1%. The GC-content of the total number of genes included in the analysis (8,620) were distributed mainly between 40% and 60%, with a mean value of 49.27%, indicating that a slight AT-rich bias in the genome. In addition, the average GC-content in the third codon position (GC3 = 51.43%, Table 1) was slightly higher than that among the total number (8,620) genes analyzed (49.27%).

Table 1 Mean values and standard deviation of GC, GC1, GC2, GC12, GC3, GC3s, GRVAY, Aromo, ENC and CAI values for reconstructed genes in T.multiceps

Full size table

The ENc values among all 8,620 genes varied from 30.5 to 61, with an average of 56.68 (Table 1), and only two of these genes showed a high codon bias (ENc < 35). These results suggested that the existence of random codon usage in T. multiceps, with no strong CUB.

Preferential codon usage

A shown in Table 2, frequent use of 32 of the 59 sense codons, including GCU, GCC and CGU, was observed. Furthermore, more than half (18/32) of the frequently used codons ended with G or C.

Table 2 Codon usage in T. multiceps genes

Full size table

Principal Component Analysis (PCA)

Principal components (PCs) with variances greater than the maximal variance of the original variables were selected as the significant axes [27]. PCA based on the relative adaptiveness showed that the first principal component (PC1) explained 8.95% of the total variation, while the other three PCs accounted for 3.40%, 3.18% and 2.84% of the data (Fig. 2). Moreover, multivariable correlation analysis was performed to gain a better understanding of the relationship between relative codon bias and nucleotide composition (Table 3). As shown in table 3, there was a clear negative correlation between PC1 and GC, GC3 and GC3s (r = −0.693, −0.859and −0.865, respectively, P < 0.01), while PC1correlated positively with A3s or T3s (r = 0.643 and 0.304; P < 0.01). In addition, the ENc value correlated negatively with GC, GC1, GC2, GC3, G3s, C3s and GC3s (r = −0.217, −0.125, −0.43, −0.243, −0.063, −0.246 and −0.249, respectively; P < 0.01). These results suggested the ENc value decreased as the content of GC or GC3s increased, with a corresponding increase in the strength of codon bias. Furthermore, ENc values showed a significant positive correlation with the first and the second principal component (PC1 and PC2) (r = 0.230 and 0.204; P < 0.01). However, PC1explained a larger proportion of the variation at 8.95% (Fig. 2), indicating that the first axis is the major contributor to codon bias although other factors also have a strong influence on this parameter.

Table 3 Correlation coefficients between the positions of genes along the PC1, PC2 and the index of codon usage and synonymous codon usage bias among the total number of genes analyzed

Full size table

The GC-content of each gene was then investigated in terms of the codon usage preference. Following classification according to GC-content (GC < 45%, 45% ≤ GC < 60%, and GC ≥ 60%), all the genes were then marked along the first two PCs (Fig. 3A). The genes with GC <45% and GC ≥ 60% were distributed mainly to the right and left of PC1, respectively, while the genes with GC contents ranging from 45% to 60% were clustered in the center of the plot.

To characterize the codon usage patterns of the different kinds of gene, hydrophobic genes with scores >0.3, aromatic genes with scores ≥0.15, essential genes and ribosomal genes were selected from the 8,620 genes included in this study. The distribution of these genes was marked along PC1 and PC2 based on the principal component analysis (Fig. 3B). A majority of the ribosomal genes were clustered to the right of PC1, while essential genes to the left of PC1. Hydrophobic genes with scores >0.3, aromatic genes with scores ≥0.15 and other genes were located mainly in the central region of PC1.

These results suggested that compositional constraints are the main factor accounting for the CUB, although other factors are also strong influences.

Relationship between ENc and GC3s

The features of codon usage among genes can be visualized by plotting ENc against GC3s [9]. As shown in Figure 4, a majority of T. multiceps genes were located under the curve of expected ENc values, while only a small number were distributed along or above. This implied that conditional mutations exert only weak influences on CUB of T. multiceps, although a major role may be played by other factors, such as natural selection. Most ribosomal genes were scattered along the expected ENc curve, while all essential genes were located at a marked distance below. These results implied that the CUB of ribosomal genes was affected mainly by mutations, while that of essential genes was influenced mainly by selection.

To a gain a more intuitive understanding of the difference between the observed and expected ENc values, the frequency distribution of (ENc_exp-ENc_obs)/ENc_exp, (ENc_exp-ENc_obs)/ENc_exp was plotted (Fig. 5). Most genes had (ENc_exp-ENc_obs)/ENc_exp values ranging from −0.05 to 0.2, with a peak in the distribution of values between 0–0.05. The significant differences observed between the observed and expected ENc values indicated that mutation exerts only a weak effect in shaping CUB.

PR2 bias plot analyses

PR2 plot analysis was conducted assess to the impact of mutation and selection on CUB (Fig. 6). In this analysis, most genes were distributed in the lower left quadrant of the PR2-plot (Fig. 6), implying that C and T (pyrimidines) were used more frequently than G and A (purines) in T. multiceps codons. These data provide further evidence that factors other than mutational pressure, such as natural selection, also contribute to CUB.

Neutrality plot analysis

In the neutrality plot of all the genes generated to evaluate the relationships among the three positions in T. multiceps codons (Fig. 7), most did not lie on or along the diagonal line. In addition, the ranges of GC12 and GC3 were narrow (0.3464–0.6818 and 0.2893–0.7944, respectively). These data suggested that T. multiceps codon usage is affected by natural selection. Moreover, linear regression of the entire coding sequence data yielded a slope of 0.1104, revealing that directional mutation pressure accounts for only 11.04% of the effect, while other factors (e.g. natural selection) account for 88.96% of the influence [34, 35]. Additionally, a significant positive correlation was identified between GC12 and GC3 (r = 0.187, P < 0.01), suggesting the influence of directional mutation pressure at all codon positions and that codon usage was affected by mutation.

Gene expression level and synonymous CUB

To explore the relationship between gene expression level and codon preference, we calculated the coefficients of the correlations between the codon adaptation index (CAI) and other gene features, including nucleotide composition and ENc values (Table 3). As shown in Table 3, CAI showed significant negative correlations with ENc value, PC1, PC2, GC2, A3s, and GRAVY value (r =−0.321,−0.358,−0.333,−0.045,−0.377, and−0.075, respectively; P < 0.01). However, CAI showed significant positive correlations with gene length and the other nucleotide composition indices (GC, GC1, GC3, GC3s, G3s, C3s and Aromo) (r = 0.074, 0.210, 0.124, 0.302, 0.306, 0.070 and 0.415, respectively; P < 0.01). These results indicated that the codon usage in T. multiceps was affected by gene expression levels. To be more specific, genes with higher expression levels had a greater degree of CUB and GC-rich content. Furthermore, these genes exhibited preference for codons with C or G at the synonymous position. Based on these results, we deduced that both nucleotide composition and gene expression levels play important roles in T. multiceps codon usage.

The relationship between amino-acid composition index and CUB in T. multiceps was investigated by Spearman’s rank correlation analysis to determine the correlation coefficients of the positions of the genes along the first two PCs with the corresponding amino-acid usage indices (Table 4).

Table 4 Correlation coefficients between the positions of genes along the PC1, PC2 and index of amino acid usage among the total number of genes analyzed

Full size table

In the principal component analysis, the first two PCs generated accounted for 74.35% of the variation in amino-acid usage. PC1 explained 68.43% of the variation in amino-acid usage (Fig. 8), with the genes showing significant negative correlations with GRAVY score and Aromo value score (r =−0.688 and−0.454, respectively; P < 0.01), while a significant positive correlation was identified with CAI (r = 0.081, P < 0.01). PC2 explained 5.92% of the variation, and showed significant positive correlations with GRAVY score and Aromo value (r = 0.509 and 0.637, respectively; P < 0.01). In contrast to the results for E. coli [36] and B. mori [35], CAI was found to be the most important factor influencing the amino-acid usage of T. multiceps, with aromaticity having the second most important influence, followed by hydrophobicity.

Effect of the hydrophobicity and aromaticity of encoded protein on codon bias

We also investigated the influence of other factors on codon usage in T. multiceps genes. A correlation analysis was performed to evaluate whether GRAVY and Aromo values were related to nucleotide composition and ENc values. As shown in Table 3, GRAVY values of the encoded proteins showed significant negative correlations with GC1, A3s, T3s, G3s and ENc values (r =−0.209,−0.319,−0.110,−0.278, and−0.039, respectively; P < 0.01), while this value showed significant positive correlations with GC3, GC3s, and C3s (r = 0.140, 0.121, 0.242, respectively; P < 0.01). In addition, Aromo values of the encoded proteins showed significant negative correlations with GC, GC1, GC2, A3s and G3s (r =−0.265,−0. 486,−0. 225,−0. 076, and−0. 185, respectively; P < 0.01). This indicated that codon usage variations were associated with both the degree of hydrophobicity and aromatic among the amino acids.

Gene length and synonymous CUB

According to Table 3, gene length showed no significant correlation with ENc values (r = 0.017, P > 0.05), but showed significant positive correlations with GC1, GC3, GC3s, C3s, G3s, and CAI (r = 0.090, 0.050, 0.055, 0.026, 0.059 and 0.074, respectively; P < 0.01). These results suggested an absence of any significant correlation between gene length and the CUB, although relatively higher expression of the longer genes was observed.

Optimal translational codons

Twenty-one codons, including UUC, CUC, CUG and AUC, were identified as optimal translational codons based on the average RSCU values of the high and low datasets. Of these, the AU/GC ratio was 41:22, and the optimal codons (except CGU) all ended with G or C (Table 5).

Table 5 Translational optimal codons of T. multiceps

Full size table

Closely-related species always have similar patterns of codon usage, while distantly related organisms, such as Escherichia coli, Saccharomyces cerevisiae and Drosophila melanogaster possess quite different codon usage patterns [37]. It is generally acknowledged that a ratio of codon usage frequency between two species that is greater than 2, or less than 0.5 indicates distinct CUB [38], while a ratio between these two values indicates a close codon usage preference. The ratios of codon usage frequency of T. multiceps compared with the four model organisms E. coli, S. cerevisiae, Homo sapiens and T. pisiformis, showed that number of codons with ratios greater than 2 or less than 0.5 was 10, 9, 4 and 0, respectively. This suggested that a relatively greater variation in codon preferences between T. multiceps and E. coli, S. cerevisiae, or Homo sapiens than that between T. multiceps and T. pisiformis, indicating that closer relationships between species are associated with less variation in codon usage (Fig. 9).

Discussion

Nucleotide composition is considered to be one of the most important factors that shapes codon usage among genes and genomes, with GC-content reflecting the overall trend of codon mutation [31]. The average GC-content of the total of 8,620 T. multiceps genes investigated was 49.27% (slightly below the average AT content), while the average GC3 content was slightly higher at 51.43%. These results are consistent with the GC and AT contents of Giardia lamblia [39] and T. saginata [10].

The average effective number of codons (ENc) among the T. multiceps genes was 56.68, with only two genes showing a strong CUB (ENc < 36). This indicates random codon usage in T. multiceps, with no strong codon bias, which is in accordance with the pattern in B. mori [35]. Furthermore, more than half of the high frequency codons ended with G/C (18/32); this phenomenon has been found in many other GC-rich organisms, including bacteria, archaea, fungi, wheat and rice [40–43].

CUB is a complex evolutionary phenomenon known to exist in a wide variety of organisms, including prokaryotes, as well as unicellular and multicellular eukaryotes [10]. Numerous hypotheses have been proposed to explain this phenomenon including the neutral theory [44] and the selection-mutation-drift balance model [3]. The number of factors reported to affect CUB is increasing, with gene length [45], GC-content [46, 47], recombination rate [46, 48–50], and gene expression level [45, 48, 51] shown to exert influences. Other studies have shown that RNA and protein structure [29, 52–54], intron length [55], population size [56], evolutionary age of the genes [57], and environmental stress [58], in addition to the hydrophobicity and the aromaticity of the encoded proteins [59, 60] are influencing factors. In this study, various factors such as gene compositional constraints, mutation pressure, gene expression level and, in particular natural selection, were all found to contribute to shaping the codon usage of T. multiceps. Other factors, such as hydrophobicity and aromaticity of the encoded proteins were implicated in generating the CUB of T. multiceps, while our analysis indicated that amino-acid selection also affects translational efficiency of T. multiceps.

Base changes in first and second positions of the codon lead to changes in the encoded amino-acid sequence, while the third codon position rarely induces such sequence variation. It is generally acknowledged that the third codon position is subject to lower selection pressure compared with that of the first and second codon positions. Thus, ENc-GC3s correlation analysis, PR2 bias plot analyses and neutrality plot analysis based on GC3 or GC3s are vitally important for elucidation of the CUB patterns in many organisms.

ENc-GC3s correlation analysis showed that mutation plays a minor role in shaping CUB in T. multiceps, while other factors, such as natural selection, exert significant effects on CUB in this species. Additionally, correlation analysis indicated that the CUB of ribosomal genes was shaped mainly by mutations, while essential genes were affected mainly by natural selection. Further evidence in support of this conclusion was provides by the PR2 bias plot analyses, which also indicated that selection is the major factor that shapes CUB in T. multiceps. ENc plots provide a method of quantifying the CUB of synonymous codons; however, this analysis alone is insufficient for determining the exact contributions of natural selection and mutational pressure to CUB within a species [35, 61]. In this study, we generated a neutrality plot to provide more precise information on this issue. According to the neutrality plot, directional mutation pressure accounts for only 11.04% of the effect, while other factors, such as natural selection, account for 88.96% [34, 35]. Therefore, natural selection was thought to be the major factor affecting the codon usage variation in T. multiceps. These results are similar to those obtained in investigations of B. mori [35].

Natural selection can enhance efficiency of transcription/translation by preferential usage of alternative synonymous codons. The study of Drosophila and Caenorhabditis revealed that significant codon usage bias was existed in highly expressed genes, and this is due to the increased effectiveness and accuracy during translation by preferential usage of optimal synonymous codons [45, 62]. Since synonymous mutations do not change the final protein product, selection for optimal codons is thought to be fairly weak [63]. This explains the possible relation between natural selection and the overall low levels of CUB in T. multiceps.

Previous studies have revealed that CUB in mammals is not correlated with the gene expression levels. However, in Arabidopsis thaliana [64], Oryza sativa [65], C. elegans [45], B. mori [35] and T. saginata [10], genes expressed at relatively high levels exhibited a greater degree of CUB.Various analyses can be used to assess gene expression levels, including EST (expressed sequence tag) counting [66], CAI values [10, 45] and ENc values [67]. In this study, calculation of CAI values was adopted to evaluate the levels of expression of T. multiceps genes. CAI and ENC values showed a significant negative correlation with PC1, suggesting gene expression levels influence CUB in T. multiceps, with stronger CUB in highly expressed genes.

For various organisms, such as Populus tremula [68], Caenorhabditis elegans [45], Drosophila melanogaster [47], Arabidopsis thaliana [45] Silene latifolia [69] and T. saginata [10], significant negative correlations were found between gene length and CUB. To account for this phenomenon, Moriyama and Powell proposed that selection constraints tend to reduce the length of highly expressed genes to generate shorter proteins that perform functions similar to those of longer proteins; thus reducing the energy expenditure required to generate a protein with a specific function [70]. In T. multiceps, however, gene length was found to be irrelevant in shaping CUB, although it was positively correlated with the gene expression level. This finding is inconsistent with that obtained in studies of T. saginata [10] and further investigations are required to explore the mechanisms of this phenomenon.

Identification of optimal codons could provide valuable information for use in molecular genetics studies of evolutionary and rational rearrangement (transformation) of codon usage [71–73]. Under normal circumstances, the optimal codons tend to reflect the GC and AT content of the genomes [43, 74], such as those of bacteria, archaebacteria and fungi. In the present study, the GC-content of codons in the T. multiceps transcriptome was lower than the AT content (GC:AT, 0.97:1), although 21 optimal codons found to be GC-rich (AU:GC, 41:22), with most ending in G/C. The same phenomenon has been reported in other organisms, such as Populus tremula (average GC-content, 45%) [68], Drosophila (average GC-content, 35%) [75], T. pisiformis (average GC-content, 49.48%), and T. saginata (average GC-content, 43.61%), with most favored codons being GC-rich or ending with G and/or C. In Triticum aestivum [76], Hordeumvulgare [61], Oryza sativa [65] and Zea mays [77], the average GC-content is 55.6%, 59.3%, 56.8% and 60% respectively, with optimal codons being AT-rich or ending in G or C.

Correspondence analysis (COA) is widely used to elucidate the variation in synonymous codon usage among genes. However, COA based on RSCU can be affected by biases such as amino acid biases [78]. Principal Component Analysis (PCA) using relative adaptiveness [28] or within-block correspondence analysis [79] can avoid the biases. Thus in this paper, PCA using relative adaptiveness was adopted to perform multivariate analysis other than correspondence analysis.

Conclusions

Our analysis of the codon usage pattern of T. multiceps indicates that natural selection is the major factor influencing the codon usage variation in this species, while other factors such as nucleotide composition, mutational pressure, and gene expression level, also contribute to shaping the CUB. Furthermore, we identified 21 optimal codons, all of which ended in G/C.

In summary, our analysis further elucidates the codon usage pattern in T. multiceps, and provides the basis of further investigations for the identification of novel genes, as well as molecular genetic engineering and evolutionary studies in this species.

Abbreviations

Aromo:: aromaticity
CAI:: the codon adaptation index
CDSs:: coding sequences
CUB:: codon usage bias
DEG:: essential genes
ENc:: the effective number of codons
GC1:: the GC-content at the first codon positions
GC12:: the average of GC1 and GC2
GC2:: the GC-content at the second codon positions
GC3:: the GC-content at the third codon positions
GC3s:: the frequency of either a guanine or cytosine at the third codon position of synonymous codons
GRAVY:: grand average of hydropathicity
PC1:: the first principal component
PC2:: the second principal component
PCA:: principal component analysis
PCs:: Principal Components.
PR2:: parity rule 2
RSCU:: the relative synonymous codon usage

References

Grantham R, Gautier C, Gouy M, Mercier R, Pave A. Codon catalog usage and the genome hypothesis. Nucleic Acids Res. 1980;8(1):r49–62.
Article CAS PubMed PubMed Central Google Scholar
Lloyd AT, Sharp PM. Evolution of codon usage patterns: the extent and nature of divergence between Candida albicans and Saccharomyces cerevisiae. Nucleic Acids Res. 1992;20(20):5289–95.
Article CAS PubMed PubMed Central Google Scholar
Sharp PM, Li WH. An evolutionary perspective on synonymous codon usage in unicellular organisms. J Mol Evol. 1986;24(1–2):28–38.
Article CAS PubMed Google Scholar
Powell JR, Moriyama EN. Evolution of codon usage bias in Drosophila. Proc Natl Acad Sci U S A. 1997;94(15):7784–90.
Article CAS PubMed PubMed Central Google Scholar
Zheng Y, Zhao WM, Wang H, Zhou YB, Luan Y, Qi M, et al. Codon usage bias in Chlamydia trachomatis and the effect of codon modification in the MOMP gene on immune responses to vaccination. Biochem Cell Biol. 2007;85(2):218–26.
Article CAS PubMed Google Scholar
Zhou T, Gu W, Ma J, Sun X, Lu Z. Analysis of synonymous codon usage in H5N1 virus and other influenza A viruses. Biosystems. 2005;81(1):77–86.
Article CAS PubMed Google Scholar
Salamov AA, Solovyev VV. Ab initio gene finding in Drosophila genomic DNA. Genome Res. 2000;10(4):516–22.
Article CAS PubMed PubMed Central Google Scholar
Lin K, Kuang Y, Joseph JS, Kolatkar PR. Conserved codon composition of ribosomal protein coding genes in Escherichia coli, Mycobacterium tuberculosis and Saccharomyces cerevisiae: lessons from supervised machine learning in functional genomics. Nucleic Acids Res. 2002;30(11):2599–607.
Article CAS PubMed PubMed Central Google Scholar
Chen L, Liu T, Yang D, Nong X, Xie Y, Fu Y, et al. Analysis of codon usage patterns in Taenia pisiformis through annotated transcriptome data. Biochem Biophys Res Commun. 2013;430(4):1344–8.
Article CAS PubMed Google Scholar
Yang X, Luo X, Cai X. Analysis of codon usage pattern in Taenia saginata based on a transcriptome dataset. Parasit Vectors. 2014;7:527.
Article PubMed PubMed Central Google Scholar
Gauci C, Vural G, Oncel T, Varcasia A, Damian V, Kyngdon CT, et al. Vaccination with recombinant oncosphere antigens reduces the susceptibility of sheep to infection with Taenia multiceps. Int J Parasitol. 2008;38(8–9):1041–50.
Article CAS PubMed PubMed Central Google Scholar
Ibechukwu BI, Onwukeme KE. Intraocular coenurosis: a case report. Br J Ophthalmol. 1991;75(7):430–1.
Article CAS PubMed PubMed Central Google Scholar
El-On J, Shelef I, Cagnano E, Benifla M. Taenia multiceps: a rare human cestode infection in Israel. Vet Ital. 2008;44(4):621–31.
PubMed Google Scholar
Christodoulopoulos G. Two rare clinical manifestations of coenurosis in sheep. Vet Parasitol. 2007;143(3–4):368–70.
Article CAS PubMed Google Scholar
Edwards GT, Herbert IV. Observations on the course of Taenia multiceps infections in sheep: clinical signs and post-mortem findings. Br Vet J. 1982;138(6):489–500.
CAS PubMed Google Scholar
Wu X, Fu Y, Yang D, Zhang R, Zheng W, Nie H, et al. Detailed transcriptome description of the neglected cestode Taenia multiceps. PLoS One. 2012;7(9):e45830.
Article CAS PubMed PubMed Central Google Scholar
Gao F, Luo H, Zhang CT, Zhang R. Gene essentiality analysis based on DEG 10, an updated database of essential genes. Methods Mol Biol. 2015;1279:219–33.
Article CAS PubMed Google Scholar
Luo H, Lin Y, Gao F, Zhang CT, Zhang R. DEG 10, an update of the database of essential genes that includes both protein-coding genes and noncoding genomic elements. Nucleic Acids Res. 2014;42(Database issue):D574–80.
Article CAS PubMed Google Scholar
Zhang R, Lin Y. DEG 5.0, a database of essential genes in both prokaryotes and eukaryotes. Nucleic Acids Res. 2009;37(Database issue):D455–8.
Article CAS PubMed Google Scholar
Zhang CT, Zhang R. Gene essentiality analysis based on DEG, a database of essential genes. Methods Mol Biol. 2008;416:391–400.
Article CAS PubMed Google Scholar
Zhang R, Ou HY, Zhang CT. DEG: a database of essential genes. Nucleic Acids Res. 2004;32(Database issue):D271–2.
Article CAS PubMed PubMed Central Google Scholar
Sharp PM, Li WH. The codon Adaptation Index--a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res. 1987;15(3):1281–95.
Article CAS PubMed PubMed Central Google Scholar
Wright F. The ‘effective number of codons’ used in a gene. Gene. 1990;87(1):23–9.
Article CAS PubMed Google Scholar
Jiang Y, Deng F, Wang H, Hu Z. An extensive analysis on the global codon usage pattern of baculoviruses. Arch Virol. 2008;153(12):2273–82.
Article CAS PubMed Google Scholar
Lee S, Weon S, Kang C. Relative codon adaptation index, a sensitive measure of codon usage bias. Evol Bioinform Online. 2010;6:47–55.
CAS PubMed PubMed Central Google Scholar
Carbone A, Zinovyev A, Kepes F. Codon adaptation index as a measure of dominating codon bias. Bioinformatics. 2003;19(16):2005–15.
Article CAS PubMed Google Scholar
Kanaya S, Kudo Y, Nakamura Y, Ikemura T. Detection of genes in Escherichia coli sequences determined by genome projects and prediction of protein production levels, based on multivariate diversity in codon usage. Comput Appl Biosci. 1996;12(3):213–25.
CAS PubMed Google Scholar
Suzuki H, Saito R, Tomita M. A problem in multivariate analysis of codon usage data and a possible solution. FEBS Lett. 2005;579(28):6499–504.
Article CAS PubMed Google Scholar
Hartl DL, Moriyama EN, Sawyer SA. Selection intensity for codon bias. Genetics. 1994;138(1):227–34.
CAS PubMed PubMed Central Google Scholar
Sueoka N. Translation-coupled violation of Parity Rule 2 in human genes is not the cause of heterogeneity of the DNA G + C content of third codon position. Gene. 1999;238(1):53–8.
Article CAS PubMed Google Scholar
Shang M, Liu F, Hua J, Wang K. Analysis on codon usage of chloroplast genome of Gossypium hirsutum. Sci Agric Sin. 2011;44(2):245–53.
CAS Google Scholar
Rice WR. Analyzing tables of statistical tests. Evolution. 1989;43(1):223–5.
Article Google Scholar
Liu Q. Analysis of codon usage pattern in the radioresistant bacterium Deinococcus radiodurans. Biosystems. 2006;85(2):99–106.
Article CAS PubMed Google Scholar
Sueoka N. Directional mutation pressure and neutral molecular evolution. Proc Natl Acad Sci U S A. 1988;85(8):2653–7.
Article CAS PubMed PubMed Central Google Scholar
Jia X, Liu S, Zheng H, Li B, Qi Q, Wei L, et al. Non-uniqueness of factors constraint on the codon usage in Bombyx mori. BMC Genomics. 2015;16:356.
Article PubMed PubMed Central Google Scholar
Lobry JR, Gautier C. Hydrophobicity, expressivity and aromaticity are the major trends of amino-acid usage in 999 Escherichia coli chromosome-encoded genes. Nucleic Acids Res. 1994;22(15):3174–80.
Article CAS PubMed PubMed Central Google Scholar
Sharp PM, Cowe E, Higgins DG, Shields DC, Wolfe KH, Wright F. Codon usage patterns in Escherichia coli, Bacillus subtilis, Saccharomyces cerevisiae, Schizosaccharomyces pombe, Drosophila melanogaster and Homo sapiens; a review of the considerable within-species diversity. Nucleic Acids Res. 1988;16(17):8207–11.
Article CAS PubMed PubMed Central Google Scholar
San Hong F, Guang GA, Wei SL, Ping HX. Analysis of Genetic Code Preference in Arabidopsis thaliana [J]. Progress In Biochemistry and Biophysics. 2003;2:012.
Google Scholar
Lafay B, Sharp PM. Synonymous codon usage variation among Giardia lamblia genes and isolates. Mol Biol Evol. 1999;16(11):1484–95.
Article CAS PubMed Google Scholar
Miyasaka H. Translation initiation AUG context varies with codon usage bias and gene length in Drosophila melanogaster. J Mol Evol. 2002;55(1):52–64.
Article CAS PubMed Google Scholar
Marin A, Gonzalez F, Gutierrez G, Oliver JL. Gene length and codon usage bias in Drosophila melanogaster, Saccharomyces cervisiae and Escherichia coli. Nucleic Acids Res. 1998;26(19):4540.
Article CAS PubMed PubMed Central Google Scholar
Moriyama EN, Powell JR. Gene length and codon usage bias in Drosophila melanogaster, Saccharomyces cerevisiae and Escherichia coli. Nucleic Acids Res. 1998;26(13):3188–93.
Article CAS PubMed PubMed Central Google Scholar
Hershberg R, Petrov DA. General rules for optimal codon choice. PLoS Genet. 2009;5(7):e1000556.
Article PubMed PubMed Central Google Scholar
Nakamura Y, Gojobori T, Ikemura T. Codon usage tabulated from the international DNA sequence databases. Nucleic Acids Res. 1997;25(1):244–5.
Article CAS PubMed PubMed Central Google Scholar
Duret L, Mouchiroud D. Expression pattern, and surprisingly, gene length shape codon usage in Caenorhabditis, Drosophila, and Arabidopsis. Proc Natl Acad Sci U S A. 1999;96(8):4482–7.
Article CAS PubMed PubMed Central Google Scholar
Comeron JM, Kreitman M, Aguade M. Natural selection on synonymous sites is correlated with gene length and recombination in Drosophila. Genetics. 1999;151(1):239–49.
CAS PubMed PubMed Central Google Scholar
Marais G, Mouchiroud D, Duret L. Does recombination improve selection on codon usage? Lessons from nematode and fly complete genomes. Proc Natl Acad Sci U S A. 2001;98(10):5688–92.
Article CAS PubMed PubMed Central Google Scholar
Hey J, Kliman RM. Interactions between natural selection, recombination and gene density in the genes of Drosophila. Genetics. 2002;160(2):595–608.
CAS PubMed PubMed Central Google Scholar
Kliman RM, Hey J. Hill-Robertson interference in Drosophila melanogaster: reply to Marais, Mouchiroud and Duret. Genet Res. 2003;81(2):89–90.
Article CAS PubMed Google Scholar
Marais G, Piganeau G. Hill-Robertson interference is a minor determinant of variations in codon bias across Drosophila melanogaster and Caenorhabditis elegans genomes. Mol Biol Evol. 2002;19(9):1399–406.
Article CAS PubMed Google Scholar
Stenico M, Lloyd AT, Sharp PM. Codon usage in Caenorhabditis elegans: delineation of translational selection and mutational biases. Nucleic Acids Res. 1994;22(13):2437–46.
Article CAS PubMed PubMed Central Google Scholar
Chen Y, Carlini DB, Baines JF, Parsch J, Braverman JM, Tanda S, et al. RNA secondary structure and compensatory evolution. Genes Genet Syst. 1999;74(6):271–86.
Article CAS PubMed Google Scholar
Carlini DB, Chen Y, Stephan W. The relationship between third-codon position nucleotide content, codon bias, mRNA secondary structure and gene expression in the drosophilid alcohol dehydrogenase genes Adh and Adhr. Genetics. 2001;159(2):623–33.
CAS PubMed PubMed Central Google Scholar
Oresic M, Dehn M, Korenblum D, Shalloway D. Tracing specific synonymous codon-secondary structure correlations through evolution. J Mol Evol. 2003;56(4):473–84.
Article CAS PubMed Google Scholar
Vinogradov AE. Intron length and codon usage. J Mol Evol. 2001;52(1):2–5.
Article CAS PubMed Google Scholar
Berg OG. Selection intensity for codon bias and the effective population size of Escherichia coli. Genetics. 1996;142(4):1379–82.
CAS PubMed PubMed Central Google Scholar
Prat Y, Fromer M, Linial N, Linial M. Codon usage is associated with the evolutionary age of genes in metazoan genomes. BMC Evol Biol. 2009;9:285.
Article PubMed PubMed Central Google Scholar
Goodarzi H, Torabi N, Najafabadi HS, Archetti M. Amino acid and codon usage profiles: adaptive changes in the frequency of amino acids and codons. Gene. 2008;407(1–2):30–41.
Article CAS PubMed Google Scholar
Romero H, Zavala A, Musto H. Codon usage in Chlamydia trachomatis is the result of strand-specific mutational biases and a complex pattern of selective forces. Nucleic Acids Res. 2000;28(10):2084–90.
Article CAS PubMed PubMed Central Google Scholar
Rispe C, Delmotte F, van Ham RC, Moya A. Mutational and selective pressures on codon and amino acid usage in Buchnera, endosymbiotic bacteria of aphids. Genome Res. 2004;14(1):44–53.
Article CAS PubMed PubMed Central Google Scholar
Kawabe A, Miyashita NT. Patterns of codon usage bias in three dicot and four monocot plant species. Genes Genet Syst. 2003;78(5):343–52.
Article CAS PubMed Google Scholar
Akashi H. Inferring weak selection from patterns of polymorphism and divergence at “silent” sites in Drosophila DNA. Genetics. 1995;139(2):1067–76.
CAS PubMed PubMed Central Google Scholar
Dey S. Benefits of being biased! J Genet. 2004;83(2):113–5.
Article PubMed Google Scholar
Chiapello H, Lisacek F, Caboche M, Henaut A. Codon usage and gene function are related in sequences of Arabidopsis thaliana. Gene. 1998;209(1-2):GC1-38.
Liu Q, Feng Y, Zhao X, Dong H, Xue Q. Synonymous codon usage bias in Oryza sativa. Plant Sci. 2004;167(1):101–5.
Article CAS Google Scholar
Wang L, Roossinck MJ. Comparative analysis of expressed sequences reveals a conserved pattern of optimal codon usage in plants. Plant Mol Biol. 2006;61(4–5):699–710.
Article CAS PubMed Google Scholar
Gupta S, Bhattacharyya T, Ghosh TC. Synonymous codon usage in Lactococcus lactis: mutational bias versus translational selection. J Biomol Struct Dyn. 2004;21(4):527–35.
Article CAS PubMed Google Scholar
Ingvarsson PK. Gene expression and protein length influence codon usage and rates of sequence evolution in Populus tremula. Mol Biol Evol. 2007;24(3):836–44.
Article CAS PubMed Google Scholar
Qiu S, Bergero R, Zeng K, Charlesworth D. Patterns of codon usage bias in Silene latifolia. Mol Biol Evol. 2011;28(1):771–80.
Article CAS PubMed Google Scholar
Moriyama EN, Powell JR. Codon usage bias and tRNA abundance in Drosophila. J Mol Evol. 1997;45(5):514–23.
Article CAS PubMed Google Scholar
Ko HJ, Ko SY, Kim YJ, Lee EG, Cho SN, Kang CY. Optimization of codon usage enhances the immunogenicity of a DNA vaccine encoding mycobacterial antigen Ag85B. Infect Immun. 2005;73(9):5666–74.
Article CAS PubMed PubMed Central Google Scholar
Peng R-H, Yao Q-H, Xiong A-S, Cheng Z-M, Li Y. Codon-modifications and an endoplasmic reticulum-targeting sequence additively enhance expression of an Aspergillus phytase gene in transgenic canola. Plant Cell Rep. 2006;25(2):124–32.
Article CAS PubMed Google Scholar
Rouwendal GJ, Mendes O, Wolbert EJ, De Boer AD. Enhanced expression in tobacco of the gene encoding green fluorescent protein by modification of its codon usage. Plant Mol Biol. 1997;33(6):989–99.
Article CAS PubMed Google Scholar
Rao Y, Wu G, Wang Z, Chai X, Nie Q, Zhang X. Mutation bias is the driving force of codon usage in the Gallus gallus genome. DNA Res. 2011;18(6):499–512.
Article CAS PubMed PubMed Central Google Scholar
Vicario S, Moriyama EN, Powell JR. Codon usage in twelve species of Drosophila. BMC Evol Biol. 2007;7:226.
Article PubMed PubMed Central Google Scholar
Zhang WJ, Zhou J, Li ZF, Wang L, Gu X, Zhong Y. Comparative Analysis of Codon Usage Patterns Among Mitochondrion, Chloroplast and Nuclear Genes in Triticum aestivum L. J Integr Plant Biol. 2007;49(2):246–54.
Article CAS Google Scholar
Liu H, He R, Zhang H, Huang Y, Tian M, Zhang J. Analysis of synonymous codon usage in Zea mays. Mol Biol Rep. 2010;37(2):677–84.
Article CAS PubMed Google Scholar
Perriere G, Thioulouse J. Use and misuse of correspondence analysis in codon usage studies. Nucleic Acids Res. 2002;30(20):4548–55.
Article CAS PubMed PubMed Central Google Scholar
Charif D, Thioulouse J, Lobry JR, Perriere G. Online synonymous codon usage analyses with the ade4 and seqinR packages. Bioinformatics. 2005;21(4):545–7.
Article CAS PubMed Google Scholar

Download references

Acknowledgements

We would like to thank the native English speaking scientists of Elixigen Company (Huntington Beach, California) for editing our manuscript.

Funding

This work was financially supported by the Program for Changjiang Scholars and Innovative Research Team in University (PCSIRT) [grant number IRT0848].

Availability of data and material

Sequence data was retrieved from transcriptome database in NCBI (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA80935/).

Authors’ contributions

XH, JX, and LC conceived the study and analyzed the data. YW, X G and X P performed the bioinformatics and statistical analysis. G Y contributed to the design and coordination of the study. All authors read and approved the final manuscript.

Competing interests

The authors declare that they have no competing interests.

Consent for publication

Not applicable.

Ethics approval

All animals were raised strictly according to the animal protection laws of the People’s Republic of China (a draft of an animal protection law released on September 18, 2009). All procedures were reviewed and approved by the Animal Ethics Committee of Sichuan Agricultural University (Ya’an, China) (Approval No. 2012–038).

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Author information

Authors and Affiliations

Department of Parasitology, College of Veterinary Medicine, Sichuan Agricultural University, Chengdu, 611130, China
Xing Huang, Jing Xu, Yu Wang, Xiaobin Gu & Guangyou Yang
Chengdu Agricultural College, Chengdu, 611130, China
Xing Huang
Meat-processing Application Key Laboratory of Sichuan Province, College of Pharmacy and Biological Engineering, Chengdu University, Chengdu, 610106, China
Lin Chen
College of Science, Sichuan Agricultural University, Ya’an, 625014, China
Xuerong Peng

Authors

Xing Huang
View author publications
You can also search for this author in PubMed Google Scholar
Jing Xu
View author publications
You can also search for this author in PubMed Google Scholar
Lin Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yu Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaobin Gu
View author publications
You can also search for this author in PubMed Google Scholar
Xuerong Peng
View author publications
You can also search for this author in PubMed Google Scholar
Guangyou Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guangyou Yang.

Additional files

Additional file 1:

All the indices of the total number of genes analyzed. (XLS 3493 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Huang, X., Xu, J., Chen, L. et al. Analysis of transcriptome data reveals multifactor constraint on codon usage in Taenia multiceps . BMC Genomics 18, 308 (2017). https://doi.org/10.1186/s12864-017-3704-8

Download citation

Received: 27 September 2016
Accepted: 12 April 2017
Published: 20 April 2017
DOI: https://doi.org/10.1186/s12864-017-3704-8

Analysis of transcriptome data reveals multifactor constraint on codon usage in Taenia multiceps

Abstract

Background

Results

Conclusions

Background

Methods

Sequence acquisition

Indices of codon usage

Principal Component Analysis (PCA)

ENc-plot

PR2 bias plot

Neutrality plots

Determination of optimal codons

Software

Results

Codon composition analysis

Preferential codon usage

Principal Component Analysis (PCA)

Relationship between ENc and GC3s

PR2 bias plot analyses

Neutrality plot analysis

Gene expression level and synonymous CUB

Effect of the hydrophobicity and aromaticity of encoded protein on codon bias

Gene length and synonymous CUB

Optimal translational codons

Discussion

Conclusions

Abbreviations

References

Acknowledgements

Funding

Availability of data and material

Authors’ contributions

Competing interests

Consent for publication

Ethics approval

Publisher’s Note

Author information

Authors and Affiliations

Corresponding author

Additional files

Additional file 1:

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Genomics

Contact us