Properties and determinants of codon decoding time distributions

Background Codon decoding time is a fundamental property of mRNA translation believed to affect the abundance, function, and properties of proteins. Recently, a novel experimental technology--ribosome profiling--was developed to measure the density, and thus the speed, of ribosomes at codon resolution. Specifically, this method is based on next-generation sequencing, which theoretically can provide footprint counts that correspond to the probability of observing a ribosome in this position for each nucleotide in each transcript. Results In this study, we report for the first time various novel properties of the distribution of codon footprint counts in five organisms, based on large-scale analysis of ribosomal profiling data. We show that codons have distinctive footprint count distributions. These tend to be preserved along the inner part of the ORF, but differ at the 5' and 3' ends of the ORF, suggesting that the translation-elongation stage actually includes three biophysical sub-steps. In addition, we study various basic properties of the codon footprint count distributions and show that some of them correlate with the abundance of the tRNA molecule types recognizing them. Conclusions Our approach emphasizes the advantages of analyzing ribosome profiling and similar types of data via a comparative genomic codon-distribution-centric view. Thus, our methods can be used in future studies related to translation and even transcription elongation.


4
M. musculus fragments were removed from the their 3' linker CTGTAGGCACCATCAATTCGTATGCCGTCTTCTGCTTGAA in a similar manner described for C. elegans. Gene coordinates were downloaded from the UCSC genes data set [11], using the mm9 genomic strain. Fragments failing to align to rRNA transcripts were further processed and aligned to exons. Those that failed to show a clear single peak 10-25 nt before the beginning of the ORFs were discarded (Figure 05 in Additional file 1). According to the results of this analysis, the A site of M. musculus was set 15/16/17 nt downstream for fragments of length of 29-30/31-33/34-36 nt accordantly (similarly to the results reported in the original study [4]).
E. coli and B. subtilis fragments were first removed from their attached 3' linker CTGTAGGCACCATCAAT, allowing up to one difference between the estimated linker and the real linker (due to its shorter length). The first nt of each fragment was ignored, as it frequently represents an un-templated addition during reverse transcription [12] and fragments shorter than 20 nt were discarded. Gene coordinates were downloaded from the Biomart database (www.biomart.org). Fragments failing to align to rRNA transcripts were further processed and aligned to genes transcripts. Fragments that failed to show a clear single peak 10-25 nt before the beginning of the ORFs were discarded (Figures 06-07 in Additional file 1). The results of this analysis indicate that the A site of E. coli is located 16-24 nt from the 5' of the fragments while the A site of B. subtilis is located 19-25 nt relatively to the 5' of the fragments.
To increase the robustness of the data, for S. cerevisiae, E. coli and B. subtilis RC profiles of genes from different replications of the experiment were averaged together. To avoid analyzing genes with low RC, only genes with a median RC above one were included in the analysis. Full statistics of the alignment steps for each processed dataset appear in Table S01.

Evaluating the influence of length of the ORFs on the calculated NFC values
To test whether the length of the ORF has a potential effect on the NFC values, we tested the influence of this factor in a controlled environment (TASEP simulation with translational pauses, as presented in Figure 28 in Additional file 1). All simulated genes were divided into two groups according to their length (top and bottom 50%). For each group we randomly selected an equal amount of 1200 NFC values for each codon type. Spearman correlation between the various estimated features on both group of genes was found to be highly correlative, indicating the robustness of the measure to the length of the genes (mean: r = 0.92, p = 4*10 -6 ; median: r = 0.94, p = 4*10 -6 ; mode: r = 0.86, p = 2.5*10 -7 ; mean of log-normal fitting: r = 0.95, p = 4*10 -6 ; median of log-normal fitting: r =0.99, p = 2.7*10 -6 ; skewness of log-normal fitting: r = 0.91, p = 3.9*10 -6 ).
other codons from the second subset was also calculated. These two measures were calculated for each organism for 100 different divisions of the genes into two subsets. were calculated for all genes in a GO functional group that were at least 300 codons long, to enable analysis of non-overlapping regions of the first and last 150 codons (this threshold was lowered to allow more genes to participate due to smaller group sizes), similarly using a sliding window of 50 codons.

Calculating tAI value of codons
The tAI index [14]  where codons with low adaptiveness values to the tRNA pool will be more slowly translated.

Simulating ribosomal density profiles using the TASEP model
Ribosomal density profiles were simulated using an adapted version of the TASEP biophysical translation model, previously used in different studies [15,16]. This model considers all major aspects of translation elongation and initiation, including: the elongation speed of each codon (which in this study is based on the adaptation of the codon to the tRNA pool); initiation rate; the size of the ribosomes; interaction between ribosomes such as traffic jams and ribosomes blocking each other; and the fact that the process is stochastic.
Generally, in this model, the mRNA was simulated using a lattice of sites, where represents the number of codons of the ORF. Each ribosome was defined to cover 11 codons and its A site was located at the sixth codon. During translation, any codon could be covered at a time by a single ribosome at most. In each step of the simulation, a single ribosome was allowed to attach itself to the lattice/advance to the next codon if the first/next six codons were not occupied. The time between initiation attempts was set to be exponentially distributed with a constant rate . Similarly, the time between jump attempts from site to site was assumed to be exponentially distributed with rate .
The time between events (initiation or jumping between sites) is therefore exponentially distributed (minimum of exponentially distributed random variables) with rate: ∑ where describes the site (codon) number on the lattice and if codon is being translated, otherwise =0. Therefore the initiation probability is given by and the probability of a ribosome to jump from site to site is given by [17].
To simulate the ribosomal profiling experiment, the TASEP model was adapted in the following manner [10]: footprint fragments were defined as fragments of the mRNA that were covered by a ribosome in the simulation. For each gene the TASEP model was run several times, proportionally to the measured mRNA levels of the gene. Thus each simulated profile of a gene was created from ribosome protected fragments from different mRNA copies, similarly to the real ribosome profile experiment.
In this study S. cerevisiae genes that contained a sufficient amount of RC (see criterion in Table S02) and with a least 50 codons were simulated. Codons translation rate were set according to their adaptiveness to the tRNA pool (specific used values in the simulation are presented in Table S11).
The initiation rate was set to be as the translation rate of the fastest codon (unless mentioned otherwise). To reach a steady-state distribution on the mRNA, the simulation of each mRNA copy In the first TASEP simulation [18] we set all codons to have equal translation efficiency, with a low initiation rate so traffic jams will be seldom created. The NFC distributions in Figure  However, codons with low decoding times could be more affected by ribosomal jams, a phenomenon reflected in 'right' tails of their NFC distribution. In addition, codons with high decoding times and translational pauses would create higher NFC read counts, also expressed in extreme values of the NFC distribution. Therefore, we expect the NFC distributions of slower codons to be less skewed (i.e. more symmetrical), as they should theoretically be less affected by delays caused by these factors. Thus, it is possible that a mean estimator could be too sensitive and easily biased by the skewness of the NFC distribution.
Therefore we created a second TASEP simulation [18]  indeed the translation elongation process is generated by a TASEP-like process.

9
To demonstrate that even a very small number of translational pauses can have a major effect on the mean estimator, we've altered 4% of the codons to cause translational pauses [18]. To this end, codons of the real S. cerevisiae ribosomal profiles with NFC higher than three-fold of the mean NFC of the ORF (when excluding codons with zero read counts [5]) were defined as locations of translational pauses. The translation time of these codons was set to be proportional to the NFC of the measured pauses in the real profiles. The resulting NFC distributions are presented Figure 28 in Additional file 1. However, Spearman correlation between the simulated decoding times and estimated mean NFC values (as calculated in previous studies [5,7]) deteriorated to r = -0.11 ( ). This result indicates that the mean NFC values are highly sensitive to possible outliers caused by the additional translational pauses.                       cerevisiae GO cellular components. The regions were calculated (relatively to the 5'/3' end of the ORFs), based on the mean distance vector presented in Figure S17. codons [2]. Only genes of at least 400 codons length were included in this presentation.    Supplementary Tables   Table 01. Alignment statistics of the analyzed organisms. Each row depicts the statistics of each processed replica of an organism. The second column depicts the number of obtained fragments from each replica, after filtering fragments of a quality lower than 50. The third column depicts the number of fragments after linker removal (see specific linker details in Supplementary Methods 'Reconstructing ORFs ribosomal profiles of the analyzed organisms') that had at least 19 nt. The fourth column depicts the number of remaining fragments after filtering fragments that aligned to rRNA sequences. The fifth column represents the number of fragments that aligned to annotated ORFs and spliced junctions. The sixth column describes the percentage of fragments that their P site could not be determined (details in Supplementary Methods 'Reconstructing ORFs ribosomal profiles of the analyzed organisms'). Columns seventen describe the percentage of remaining fragments that were mapped to one/two/three/more than three locations.   Table 07. For each organism the logL(best fit)/logL(log normal fit) and logL(best fit)/logL(normal fit) ratios are presented only for codons that their best mathematical fitting is not of a log-normal type (rounding at two digits precision). As seen from the results below, the ratio between loglikelihoods of the best mathematical fitting and log-normal fitting were bigger than 0.99; in comparison, the ratio between log-likelihoods of the best mathematical fitting and normal fitting were smaller than 0.78).