Codon usage suggests that translational selection has a major impact on protein expression in trypanosomatids

Background Different proteins are required in widely different quantities to build a living cell. In most organisms, transcription control makes a major contribution to differential expression. This is not the case in trypanosomatids where most genes are transcribed at an equivalent rate within large polycistronic clusters. Thus, trypanosomatids must use post-transcriptional control mechanisms to balance gene expression requirements. Results Here, the evidence for translational selection, the enrichment of 'favoured' codons in more highly expressed genes, is explored. A set of highly expressed, tandem-repeated genes display codon bias in Trypanosoma cruzi, Trypanosoma brucei and Leishmania major. The tRNA complement reveals forty-five of the sixty-one possible anticodons indicating widespread use of 'wobble' tRNAs. Consistent with translational selection, cognate tRNA genes for favoured codons are over-represented. Importantly, codon usage (Codon Adaptation Index) correlates with predicted and observed expression level. In addition, relative codon bias is broadly conserved among syntenic genes from different trypanosomatids. Conclusion Synonymous codon bias is correlated with tRNA gene copy number and with protein expression level in trypanosomatids. Taken together, the results suggest that translational selection is the dominant mechanism underlying the control of differential protein expression in these organisms. The findings reveal how trypanosomatids may compensate for a paucity of canonical Pol II promoters and subsequent widespread constitutive RNA polymerase II transcription.


Background
Trypanosomatids have a devastating impact on the world's poor, causing African trypanosomiasis, Chagas disease and leishmaniasis [1]. The consequences of this range of human and animal diseases are hundreds of thousands of deaths each year, ~1.5 million cases a year of the disfiguring lesions associated with cutaneous leishmaniasis and severely curtailed agricultural development throughout sub-Saharan Africa. The African trypanosome also causes Nagana disease in cattle, rendering 10 million square kilometres of land unsuitable for livestock.
The protozoan parasites responsible branched early from the eukaryotic lineage and display a range of unusual molecular features. RNA polymerase II transcription of protein coding genes is polycistronic and constitutive and all mature mRNAs are trans-spliced to an identical leader sequence [2]. Genome sequencing revealed remarkably conserved gene order or synteny across the genomes of the African trypanosome, Trypanosoma brucei, the South American trypanosome, Trypanosoma cruzi, and Leishmania major [3]. These trypanosomatids cause distinct diseases, are spread by different insect vectors and are thought to have diverged from a common ancestor several hundred million years ago. Trypanosomatids display a unique paucity of conventional RNA polymerase II promoters and transcriptional control is extremely limited compared to any other organism studied in any detail. Widespread constitutive and polycistronic transcription places considerable emphasis on post-transcriptional control since genes in the same transcriptional cluster function in unrelated pathways and are expressed at widely different levels [4]. Thus, trypanosomatids present a unique opportunity to study post-transcriptional control of gene expression.
Cells must express different proteins over an enormous abundance-range, from fewer than 50 to more than a million molecules per cell reported in Saccharomyces cerevisiae [5]. Efficiently translated mRNA species are translated several thousand times, with translation initiating up to once every 2 s, providing substantial scope for differential control at the level of translation. Translational selection has been reported in a range of species, whereby more frequent synonymous codons correspond to more abundant cognate tRNAs, with the correspondence being more pronounced for highly expressed genes [6][7][8]. Synonymous codon bias has also been reported in trypanosomatids [9][10][11][12]. There is a good correlation between mRNA levels and protein levels in yeast with 73% of variance in protein abundance explained by mRNA abundance [13]. In contrast, microarray analysis reveals modest differences in mRNA abundance in trypanosomatids and proteome analysis suggests substantial differential control at the level of translation or protein turnover [14][15][16][17]. Evidence for translational selection is explored here using trypanosomatid genome sequence data [18][19][20] and whole-cell proteome data [21]. The findings suggest that translational selection is the dominant mechanism underlying the control of differential protein expression in trypanosomatids.

Tandem genes are highly expressed and display codon bias
Most trypanosomatid genes are 'single copy' (trypanosomatid genomes are typically diploid) but tandem gene amplification is thought to contribute to increased expression [20,22]. Thus, tandem amplified genes may be among the most highly expressed and, if translational selection operates, may be an excellent source of favoured codons. To begin to explore evidence for translational selection in trypanosomatids, codon bias was assessed in tandem amplified genes. Whole-cell proteome data has been derived from T. cruzi [21] but the genome assembly is incomplete due to sequence complexity; the strain used for the sequencing project is a hybrid of two genotypes with multiple distinct alleles for most genes [3]. Since the T. brucei assembly is excellent and most trypanosomatid genes share orthologues accessible through the GeneDB interface, the T. brucei genome was scanned for tandem duplicated protein-coding genes. Consistent with the first prediction above, sixty-four tandem-amplified genes in T. brucei encode proteins with orthologues among the 243 proteins over-represented (≥ 10 mass spectra) in the nonredundant T. cruzi proteome set [21]. This includes the histones, ribosomal proteins, chaperones, tubulins and enzymes of carbohydrate metabolism. Tandem arrayed genes display little or no sequence divergence so a single copy from each tandem was selected for further analysis (see Table 1 in Additional file 1).
Codon usage was analysed for this tandem gene set from T. brucei and for the orthologous sets from T. cruzi and L. major, >60,000 codons in total. Consistent with previous reports [9,11,12], this revealed codon bias in all three trypanosomatids (Table 1). An extreme example is the gene encoding the highly abundant α-tubulin gene in L. major which uses only 40 of the 61 available codons. Figure 1 illustrates this bias across all synonymous codons for T. brucei and L. major. What is clear from Figure 1 is that codon bias is more pronounced in L. major. This is likely explained by the higher 'background' GC-content; the intergenic and protein-coding GC-contents are 41% and 50.9% in T. brucei [18], 47% and 53.4% in T. cruzi [19] and 57.3% and 62.5% in L. major [20] respectively. Since RNA sequences probably compete for access to the translation machinery and most GC3-codons (codons with G or C at the third position) are favoured, the higher GC-content may have driven the increase in codon bias within protein coding regions.
Although all two-fold degenerate codons show preference for GC3, this feature is not universal throughout the highexpression gene sets. T. brucei shows little bias for Ala, Pro, Ser or Thr codons and GGG Gly , AGG Arg and CGG Arg codons are more than two-fold under-represented in all three trypanosomatids ( Fig. 1 and Table 1).

'Favoured' codons correspond to over-represented cognate tRNAs
Biased codons likely favour translation if cognate tRNAs are more abundant [7,23]. Previous analysis indicated that T. brucei tRNA genes are organised into clusters spread over several chromosomes and that relative tRNA abundance correlates with codon usage but not with gene copy number [24]. It was noted in that study, however, that tRNA nucleotide modification may lead to underestimation of tRNA abundance in some cases. The tRNA gene complement was analysed in all three trypanosomatids in order to explore the relationship between codon usage and tRNA gene copy number.
The GeneDB database revealed a total of 261 annotated tRNAs among the three trypanosomatids. The universal genetic code comprises 61 codons for 20 amino acids but some tRNAs decode multiple codons, a phenomenon known as wobble [25]. The trypanosomatid tRNA complement (see table 2 in the additional data file) represents 45 anticodons, a tRNA gene distribution that suggests that sixteen CU3 codons are decoded by wobble tRNAs (see Table 1). Eight U3 codons are likely decoded by anticodons with guanosine while another eight C3 codons are likely decoded by anticodons with inosine (deaminated adenosine) in the wobble position respectively [25,26]. All but one of these 'wobble pairs' were predicted previously in T. brucei [24]. Codon bias is seen among wobble pairs in the high-expression gene set; Asn, Asp, Cys, His, Phe and Tyr for example (Table 1), possibly reflecting translational selection based on differential stability of codon-anticodon interaction. In addition, some wobble codon pairs and putative cognate tRNAs are biased and over-represented respectively; CGU/C Arg -ACG tRNA and GGU/C Gly -GCC tRNA for example (Table 1).
To test the idea that tRNA abundance is related to gene copy number in trypanosmatids, amino acid frequency in the tandem, high-expression protein set was plotted against tRNA gene copy number ( Fig. 2A). The positive correlation strongly supports the idea that tRNA gene copy number determines relative tRNA abundance. Having established this relationship, the four GA3 synonymous codon-pairs with >30% bias in all three trypanosomatids were analysed (Fig. 2B). The results show a positive correlation between tRNA gene copy number and codon usage bias providing further evidence for translational selection; only one T. brucei tRNA gene fails to display a matching bias in copy number. The correlation is more striking in L. major, the trypanosomatid that displays more pronounced codon bias. In this case, eight of nine GA3 pairs that display substantial codon bias also display a corresponding tRNA bias (Table 1; those shown in Fig. 2b plus GCG/A Ala , CUG/A Leu , CCG/A Pro and ACG/A Thr ). Indeed, there are several examples in Leishmania where tRNA genes that recognise favoured codons appear to have been specifically amplified since divergence from the trypanosomes (CAG Gln , CUG Leu , CCG Pro , ACG Thr and favoured 'wobble' codons; CGC Arg , GGC Gly , AUC Ile , CUC Leu , CCC Pro and ACC Thr ; see Table 1), perhaps to counter the high background GC-content. Thus, there is a correspondence between numbers of cognate tRNAs, a likely measure of tRNA abundance, and preferred codons in the high expression gene sets in all three trypanosomatid genomes.

Codon usage correlates with expression level
The codon adaptation index (CAI) is used to measure synonymous codon usage bias and can predict gene expression level if translational selection operates [27]. Wholecell proteome data are available for T. cruzi [18] and the number of mass-spectra matched to individual genes provides an indication of relative expression level. Thus, proteome data provide an opportunity to test for a correlation  1 in the additional data file, >60,000 codons) and tRNA  gene copy number (see table 2 in the additional data file) are shown. Overall favoured codons and all sixteen CU3 'wobble tRNA pairs' are indicated in bold text. T. cruzi has additional tRNA genes in many cases because the strain used for the sequencing project is a hybrid of two genotypes [19]. (Continued) between codon usage and expression. For this analysis, T. cruzi sequences were divided into four categories expected to have progressively higher CAI scores if translational selection operates. The categories were as follows: (a) intergenic regions (translated in all six reading-frames); (b) 'Single-copy' genes; (c) 'Single-copy' genes detected through whole-cell proteome analysis (≥ 5 mass spectra) and (d) Tandem-arrayed genes detected through wholecell proteome analysis (≥ 10 mass spectra). The analysis shown in Figure 3 indicates progressively increasing CAI scores from (a) through (d). Protein coding sequences appear to have evolved to optimise translation while intergenic sequences may have evolved to counter translation.

Codon usage can predict expression level for individual proteins
A more rigorous test of the contribution of translational selection to gene expression is whether expression level can be predicted for individual genes exclusively based on codon usage. Single-copy genes represented in the T. cruzi proteome data [21] were used for this analysis (see table 3 in the additional data file); tandem array genes were not suitable because gene copy number, thought to contribute to expression, is highly variable and unknown in most cases. CAI scores were calculated for each gene and plotted against the gene length-adjusted number of cognate massspectra, a measure of relative expression. A positive correlation emerges and the relationship between protein abundance and CAI is log-linear (Fig. 4). The results indicate that T. cruzi protein-coding sequences can predict rel-ative steady-state protein expression level. The trend is remarkable because of the range of other parameters, both alternative modes of expression control and experimental sampling that could impact on the outcome. The findings are consistent with the idea that codon bias has a major impact on steady state protein levels in trypanosomatids.

Relative codon bias is conserved among trypanosomatids
Current, high throughput technologies fail to detect and/ or quantify the expression levels of less abundant proteins [28]. An interesting question is whether CAI can predict expression level across the genome. Although certain proteins will be required in substantially different quantities in the different trypanosomatids, the majority are expected to be expressed at similar relative levels. Thus, relative CAI scores are expected to be broadly conserved if translational selection impacts upon global gene expression. CAI scores were calculated for 'single-copy' genes from syntenic, polycistronic gene clusters on three different chromosomes (see table 4 in the additional data file). The different gene clusters from each trypanosomatid showed similar CAI distribution so the data were pooled. CAI scores were first compared in the trypanosomes, T. brucei and T. cruzi and the analysis indicates that relative scores are indeed conserved (Fig. 5A). A more rigorous test was then carried out, between T. brucei and L. major, thought to have diverged around 250 Mya. Despite the substantially higher GC content in L. major, relative scores remain broadly conserved (Fig. 5B). The results are consistent with the idea that codon bias predicts translation efficiency for any mRNA in trypanosomatids.
Relative frequency of synonymous codon usage in the tandem, high expression gene sets from T. brucei and L. major; codon usage patterns were broadly similar in T. brucei and T. cruzi (data not shown) Figure 1 Relative frequency of synonymous codon usage in the tandem, high expression gene sets from T. brucei and L. major; codon usage patterns were broadly similar in T. brucei and T. cruzi (data not shown). Tandem amplified genes were considered highly expressed if represented by ≥ 10 mass spectra from whole-cell proteome analysis of four life-cycle stages of T. cruzi [21]. Correspondence between codon-usage in highly expressed genes and tRNA gene copy number Figure 2 Correspondence between codon-usage in highly expressed genes and tRNA gene copy number.

Discussion
In trypanosomatids, bias in codon usage correlates with tRNA gene copy number and with expression level. This provides strong evidence for a major impact of translational selection on gene expression. Thus, translational selection facilitates the generation of differential protein abundance from genes embedded within polycistrons. Since translation rates are likely retarded by codons with low-abundance cognate tRNAs, natural selection of tRNA gene numbers and codon bias allows optimization of translation rate and efficiency across the genome. Many of the most highly expressed genes use a dual strategy to enhance expression; increased gene dosage combined with a high proportion of codons with more abundant cognate tRNAs. This dual strategy allows for an increase in overall transcription and translation.
In S. cerevisiae, although the value of codon bias as a predictor of protein levels is disputed, proteins encoded by genes with low bias are not detected on two-dimensional gels and protein abundance does correlate when only genes with high bias are considered [29,30]. Thus, translational selection may be a pervasive mechanism in the control of gene expression but its impact may be obscured in many cell-types due to the impact of other regulatory mechanisms. I propose that translational selection makes a more substantial contribution to gene expression control in trypanosomatids due to the paucity of regulated transcription. Initial ribosome assembly on mRNA may also be largely unregulated since trans-splicing leads to the attachment of an identical spliced-leader sequence to every mRNA [4]. Thus, differential translation efficiency may be the dominant level of gene expression control in trypanosomatids. Translational selection may have emerged in primitive cells that lacked mechanisms for differential mRNA expression and the emergence of differen-Relative codon bias is conserved among trypanosomatids Figure 5 Relative codon bias is conserved among trypanosomatids. CAI scores were calculated for single-copy, protein-coding sequences. (a) Relative scores for T. brucei and T. cruzi. (b) Relative scores for T. brucei and L. major. The syntenic, polycistronic clusters analysed were from, T. brucei chromosomes 1, 6 and 10 and from the orthologous genes in T. cruzi and on L. major chromosomes 12, 30 and 3 (see table 4 in the additional data file). The T. cruzi set is the same as that presented in Figure 3b; the chromosome numbers have not been determined in T. cruzi due to problems with sequence assembly. Codon usage is predictive of expression level for individual genes Figure 4 Codon usage is predictive of expression level for individual genes. CAI scores were calculated for single-copy genes (see Fig. 3c and table 3 in the additional data file). The number of mass spectra, corrected for gene length, provides a measure of relative abundance. Only genes represented by ≥ 5 mass spectra from T. cruzi proteome analysis [21] were analysed to minimise sampling errors. Many trypanosomatid proteins are differentially expressed during the cell-cycle and the life-cycle and additional controls must clearly determine such differential expression. A number of mRNA un-translated regions, particularly at the 3' end, may modulate mRNA maturation, transport, turnover and translation for example and protein turnover may also vary [4]. When these additional controls operate, codon bias should fail to predict expression level. Prominent examples of differential regulation include the variant surface glycoprotein gene, abundantly expressed in bloodstream form T. brucei, and procyclins, expressed in insect stage T. brucei. Expression of these proteins is regulated using an unusual mechanism involving differential transcription by RNA polymerase I which is restricted to RRNA genes in other eukaryotes [31]. mRNA turnover [32] and protein turnover [33] also contribute to controlling variant surface glycoprotein expression and, as expected, codon bias fails to predict relative expression when these controls operate (CAI for variant surface glycoprotein genes = 0.54 +/-0.01. n = 4). Thus, codon analysis in combination with high-throughput proteome analysis may allow identification of proteins subject to the alternative expression control strategies described above. In addition, orthologous genes that show poor correspondence in relative codon bias among trypanosomatids may be those that display species-specific expression differences. If this is the case, genome-wide codon-usage analysis will facilitate the identification of these genes.
Protein coding sequences are relatively easy to predict in trypanosomatids due to high density, intron poverty and organisation into directional clusters. New annotation tools are under development, however [34], and gene annotation could be refined. The findings reported here indicate that algorithms incorporating codon sampling could facilitate the annotation of current and future trypanosomatid genome sequence data.

Conclusion
Constitutive RNA polymerase II transcription is widespread in trypanosomatids so differential gene expression must be controlled post-transcription. Research in this area has focussed on un-translated mRNA regulatory sequences typically found within 3' un-translated regions.
As reported here, analysis of synonymous codon bias indicated pronounced bias in highly expressed genes and this bias correlates with tRNA gene copy number and with gene expression level. In addition, relative codon bias is conserved among orthologous genes from divergent trypanosomatids, even in genes thought to be expressed at low level. Taken together, the results suggest that control at the level of translation, translational selection, is the dominant mechanism underlying differential protein expression in these organisms.

Analysis of sequence and expression data
Annotated trypanosomatid genome sequence data were browsed and analysed using the GeneDB interface [35] hosted by the Wellcome Trust Sanger Institute [36]. T. cruzi proteome expression data [21]

Additional file 1
All genes listed are linked to the GeneDB database [36]. The colour coding in additional tables S1, S3 and S4 is based on the GeneDB annotation. T. cruzi mass spectra values in additional tables S1 and S3 are from Atwood et al., [21]. Table S1 -Trypanosomatid tandem/high-expression set. Only the T. brucei GeneDB links are listed but further links to orthologous trypanosomatid genes can be found on each GeneDB page. Grey shading represents genes thought to be present in tandem arrays but not annotated (An) as such; HSP83 [41] and aldolase [42] in T. brucei for example, but mostly due to incomplete assembly of T. cruzi genome sequence.