Poly(A) site prediction
EST sequences were obtained from dbEST v. 01/06/05 (6,009,051 human, 4,314,509 mouse and 623,741 rat ESTs). Full-length cDNA sequences were obtained from H-Inv 1.8 (41,118 sequences) [18] and FANTOM 2.01 (60,770 sequences) [19]. ESTs annotated as 3' were extracted (2,029,534 human, 1,864,874 mouse and 293,597 rat ESTs) and trailing poly(A) or poly(T) sequences of 5 nt or more were removed, with one mismatch (non-A or non-T) allowed for polyA/T tails of 10 or more. Both 3' EST and cDNA sequences were aligned to the repeat-masked human genome v27.35a.1, mouse genome v27.33c.1 and rat genome v27.3e.1 using the Megablast program [20]. We did not use a specific exon junction mapping software, since we were only interested in the terminal part of the 3' exon. All hits presenting at least 95% identity with the genomic sequence were retained (hit size >28 nt at default E-value). Partial hits flanking a repeat masked region of the genome were then realigned to the locally unmasked region. Hits with 95% identity after this step were retained. Clusters were formed with ESTs having either their 5' or 3' extremities falling within a 10 nt distance. ESTs were not oriented at this stage. Each cluster was analyzed using a sliding window to locate the most likely cleavage site, defined as the position where the window contains the most EST/cDNA ends. The following filters where then applied:
(i) Dangling ends: discard hits with more than 5 unmatched nt at cleavage site
(ii) Internal priming: discard cleavage sites flanked by A-rich region (at least 9 As out of 10 nt) in the 50 nt downstream genomic sequence
(iii) Poly(A) signal: retain only cleavage sites where 30 nt upstream genomic sequence contains one of the 11 variant poly(A) signal identified in our previous study [2]: AATAAA, ATTAAA, AGTAAA, TATAAA, CATAAA, GATAAA, AATATA, AATACA, AATAGA, AATGAA, ACTAAA
Only those cleavage sites passing the filters and supported by at least two ESTs/cDNAs were retained as predicted poly(A) sites. From the starting EST/cDNA datasets we finally retained 718,927 ESTs and 19,626 full length cDNAs to identify 66,647 different poly(A) sites in human, 739,259 ESTs and 23,069 full length cDNAs to identify 52,232 different poly(A) sites in mouse and 119,946 ESTs to identify 27,494 different poly(A) sites in rat.
Assignment of tandem Poly(A) signal sites
Poly(A) sites were assigned to transcript sequences taken from Ensembl 27.35a.1 [21]. If the poly(A) site lied within one or more annotated transcripts, downstream of the end of translation, the site was affected to each of these transcripts. If the poly(A) site lied upstream of the end of translation, then it was considered as "in CDS" and is not used for analysis. If the poly(A) site did not map to any annotated transcript, it was affected to the nearest 5' transcript. Poly(A) signals were assigned to their respective poly(A) sites by taking the signal that was closest to the 5'-most poly(A) site in each cluster. Only poly(A) sites mapping to the 3'-most exon of an Ensembl gene or its genomic downstream region up to 10 kb were considered further.
Assignment of conserved Poly(A) sites
Ortholog human/mouse (or mouse/rat) gene pairs were obtained from EnsMart [22]. All genes with paralogs were omitted from the analysis. 3'UTR regions assigned in Ensembl including up to 10 kb downstream genomic sequence of all transcripts were aligned by ClustalW with default parameter. Predicted Poly(A) sites were then defined as conserved if they were within a distance of 30 bp of a properly aligned poly(A) signal and had EST-support in both human and mouse. In the case where multiple poly(A) signals were associated to a single cleavage site, the signal closest to the cleavage site was used for the analysis. Multiple cleavage sites that were associated to the same poly(A) signal were omitted.
Mouse eVOC ontology mapping
Anatomical terms in mouse cDNA libraries were mapped to anatomical systems from the eVOC ontology ver 2.6 [12]. Terms that matched exactly to human eVOC terms were kept and all mixed terms (e.g. lung and colon) were manually checked to see if each component could match to a human eVOC term. Any terms that could not be directly mapped were classified as "unclassifiable". Finally all libraries were assigned to 12 anatomical systems: anatomical site, cardiovascular, respiratory, hematological, lymphoreticular, alimentary, urogenital, endocrine, musculoskeletal, dermal, nervous and unclassifiable.
Correlation analysis
χ2 test, t-test, calculation of correlation coefficient and Fisher's z-transformation were performed using Microsoft Excel 2002. Distributions of observed and random distributed values were calculated using dedicated Perl scripts.
Efficiency and tissue specificity
All EST counts were performed after discarding EST libraries annotated as normalized in the dbEST database (3% of overall human and mouse libraries). The relative efficiency R of a poly(A) site was calculated as a ratio of number of ESTs,
where nx,iis the number of ESTs of poly(A) site i within gene X, and nX,maxis maximum number of ESTs of any tandem poly(A) site within gene X.
Sites with a ratio higher than 0.5 were defined as "major" (high efficiency), while other sites were defined as "minor" (low efficiency).
The specificity of a poly(A) site was defined as the number of different expression systems in which this site was utilized, normalized by the number of supporting ESTs as follows. For a poly(A) site supported by N ESTs, we calculated an expected number of expression systems as the average number of expression systems obtained in 10,000 random sets of N ESTs sampled from the complete set of ESTs mapping any poly(A) site. Specificity S was then calculated as the log-ratio of the number of expression systems:
where T is observed number of different expression system in each poly(A) site, T
sim [n]
is the simulated number of different expression systems for n ESTs.
For genes with tandem poly(A) sites, a relative tissue specificity U was calculated for each site, as the ratio of tissue specificities:
where SX,iis tissue specificity of each tandem poly(A) site i within gene X, SX,maxis maximum tissue specificity of tandem poly(A) site within gene X, and S
min
is minimum value of S in each species. We adjusted the interval of relative tissue specificity to 0–1; 0 being the most specific site and 1 being the least specific. Median ratio was 0.90 for human, 0.88 for mouse. Sites with a ratio higher than these values were defined as "broad" while others were defined as "narrow".
A usage pattern was defined as the sequence of relative efficiencies or relative tissue specificities for all tandem poly(A) sites in a gene. For each usage pattern, the expected value E of the number of gene pairs that could randomly bare this pattern was calculated from a random combination of human and mouse gene pairs with conserved tandem poly(A) sites as described below. For a number of conserved poly(A) sites j, the list of relative usage patterns L [j, k] was defined by
L [j, k] = {[r1, r2,...,rj]1, [r1, r2,...,rj]2, ..., [r1, r2,..., rj]k}
where rj = {major or minor} or {broad or narrow}. For a number N of genes and the number of conserved poly(A) sites j, a probability P for human and mouse poly(A) sites having pattern L [j, k] was defined as:
Then the expected value E for the maximal value of k (kmax = 2j-1) is :
Expected frequencies of genes with narrow usage patterns, based on a binomial distribution, were calculated by Microsoft Excel 2002.
Differential use of poly(A) sites
To identify differentially processed poly(A) sites, Fisher's tests were performed on the distribution of the number of supporting ESTs from each expression system against all other systems for each poly(A) site as previously described [6]. A Bonferroni correction for multiple testing was applied. Poly(A) sites supported only by ESTs from pooled tissue libraries were omitted.