After the completion of the human genome project, the annotation of the whole sequence became a major challenge. Most of the initial efforts were focused on the annotation of the coding sequences. Nowadays, however, increasing attention is also given to non-coding RNAs and untranslated regions. To achieve this aim, different methods have been developed in recent years to annotate some of these elements using deep sequencing data
[24–26]. For example, a novel approach using histone methylation marks have been successfully used to identify large intervening non-coding RNAs (lincRNAs)
. In addition, important progresses have been made on 3′UTR annotation of model organisms, such as Caenorhabditis elegans, using high-throughput approaches
[27, 28]. However, we believe that 3′UTRs annotation in vertebrates also requires to be complemented with new and specific methods as the ones we described in this work.
In order to develop new appropriate approaches, different aspects of 3′ UTRs nature have to be considered. One particular hurdle of 3′UTR annotation is that the highly conserved elements that might be present in these sequences that could be used to define them can be separated from one another by other elements such as Alu sequences that are not necessarily conserved. Another difficulty is that untranslated regions, in particular 3′UTRs, can be several kilobases long. This is of particular importance because most of the transcriptional information used for annotation comes from ESTs that typically cover only a few hundred bases, or from incomplete cDNA sequences. Moreover, while coding sequences are spliced and exons junctions can be precisely delimited by short ESTs this cannot be done for long unspliced 3′UTRs. In addition, while coding sequences have a reading frame that provides useful information for a correct annotation, there is not such a thing for 3′UTRs. Until this moment, the position of the PAS was usually determined by clustering ESTs finishing at the same position and examining for the presence of a polyA tail, not encoded in the genome. In addition, the presence of the canonical HexM approximately twenty bases upstream of the PAS was usually considered a good indicator of a bona fide 3′end although we now know that less than 60% of the PASs are associated with a canonical HexM
From an evolutionarily point of view, however, previous studies also showed that PASs can be conserved between mice and humans
 and among different vertebrates
. In this study, we observed that the drop in genomic conservation after the PAS can be used to identify not annotated 3′ends. Interestingly, this change in conservation was not observed for TSSs, possibly due to the strong evolutionary constrains acting upon 3′UTRs
. In addition, different evolutionary forces shape the proximal regions of PASs and TSSs. For example, 5′UTRs are flanked by promoters that contain transcription factors binding sites which are necessary to recruit the transcriptional machinery to the TSS. On the contrary, apart from the DSE necessary for a correct cleavage reaction, there are no other well-described conserved elements crucial for the 3′end processing of the messenger. Consistent with this view, Xie and colleagues
 showed that most conserved elements in 5′ regions of genes are associated with binding sites for transcription factors, which usually fall outside the mature transcript, while most conserved elements in the downstream region correspond to AU rich elements and binding sites for miRNAs, which fall inside the mature transcript.
In this study, we used ESTs to define not annotated PASs that in general did not overlap with nearby-annotated gene. Thus, they were not direct evidences of a physical link between the proximal annotated gene and the predicted extended 3′UTR. However, the CDI value of the PTEs depended on whether they were putative PASs or TSSs, as shown in Figures
2. In particular, the distribution of the putative PASs resembled the distribution of the annotated PASs. This observation strongly suggests that at least some of these sites corresponded to the 3′ end of annotated genes. In order to provide evidence for a physical link between the annotated genes and their putative extensions, we examined the existence of orthologous transcripts using the predicted distal PAS. We found that for higher CDIs there was an increased probability of finding a transcript from other species supporting the predicted PAS. This result indicates that using this conservation parameter can facilitate the identification of bona fide 3′UTR extensions.
To look for more evidence of a physical link between the predicted extensions and the annotated gene, we then coupled the transcriptional data from other species together with deep sequencing reads from total RNA of different human tissues. Given the elevated number of short tags generated with this technique, it is now possible in many cases to reconstruct almost the entire 3′UTR of highly expressed genes. The brain specific isoform of the ADD2 gene, with a 3′UTR of more than 6 Kb, is an example. Although the read-coverage is generally not complete for genes expressed at lower levels these gaps could potentially be filled by increasing the sequencing depth. Importantly, the signal-to-noise ratio of this method allowed us to detect a clear transition in the RNA signal for both high and low expression transcripts.
Although overlapping deep sequencing reads can reconstruct a putative extended 3′UTR, the possibility in which the locus being considered encodes for different overlapping transcripts not physically linked still exist. Using Northern blot analysis, we showed that this was not the case for two PTEs. In both cases, we observed a high molecular weight band that corresponded to the predicted isoform with an extremely long 3′UTR. We previously provided evidence of the existence of high molecular weight molecules by Northern blot analysis for the not annotated CPEB3 and ADD2 isoforms with similar results
[15, 16]. Interestingly, we found other examples in the literature, such as the TNR or RORB genes, where very high molecular weight bands were detected by Northern blot for their rodent orthologs
[32, 33]. These isoforms are longer than the annotated isoforms but they are consistent with the usage of the predicted PTEs identified in this study. Moreover, the majority of the isoforms that we predicted were abundantly expressed and might correspond to the most representative transcripts, if not the only one, used by these genes in specific tissues. In recent updates of the human datasets we noticed the incorporation of new lincRNAs that partially overlap and share the same 3′end of our predicted 3′UTR extensions. We therefore speculate that some of these lincRNA may be part of the 3′UTR of the immediately upstream gene, as we have experimentally shown here for the KCNB1 gene (Figure
6). To evaluate this possibility, Northern blot analyses are required to evaluate each case.
Comparative analysis using genomic and transcriptomic information from different species revealed that gene structure can be highly conserved across vertebrates
 and could be used to improve gene annotation
. This observation, coupled with the presence of evolutionary conserved genomic signatures and high-throughput transcriptomic data, facilitated the design of novel approaches that can be used to improve gene annotation of extensively curated databases. Indeed, in this work, we identified a few hundred conserved PASs not represented in the current Ensembl human predictions, possibly the most exhaustive annotation of the human genome. Given that the identification of the PASs is strongly based on conservation across species, we rationalized that the methodology could be applied to other vertebrates. Therefore, we extended our analysis to rats and dogs using total RNA-Seq data recently deposited on the public archives. The annotation of the rat genome is on an advanced state because the species has been extensively used as animal model. Still, we detected several hundred conserved 3′UTRs not present on the current databases. The annotation of the dog genome is instead at an early stage. Thus, we found thousands of genes with distal PASs not represented on the existent models. Since the approach could be applied to different mammals, we propose that the method can be incorporated to the annotation pipeline of any species in the phylum. Moreover, we believe that the validation of the predictions in different organisms using species-specific transcriptional data will increase the confidence of the orthologous models.
The RefSeq and Ensembl databases, among others, are intended to provide a common framework for genome wide analyses, facilitating the communication of different laboratories around the world. However, a recurrent result from high throughput sequencing studies is that a considerable amount of transcribed elements map to not annotated regions, including untranslated regions of genes encoding for proteins
[4, 5]. For example, this was observed in an early study that investigated the transcripts bound by the RNA binding protein Nova, previously known to participate in the regulation of alternative splicing
. Unexpectedly, the authors found that many binding sites fall within 3′UTRs or a few hundred bases downstream of the annotated PASs, presumably on not annotated 3′UTR extensions. Interestingly, NOVA2, one of the two members of the Nova family, appears to have a highly conserved not annotated 3′UTR itself (data not shown). Similar results were recently obtained by another group studying the RNA binding protein TDP-43
Different cis-acting elements present in 3′UTRs can be recognized by RNA-binding protein and/or small RNAs. The effect of some of these trans-acting factors has been demonstrated using high throughput techniques. In particular, it has been elegantly shown using genome wide microarrays that variations in the levels of a miRNA change the stability of the population of mRNAs containing the specific recognition motif for that particular small RNA
[37, 38]. Importantly, many of these regulatory factors have subtle effects upon their targets. Therefore, the best way (and possibly the only one) to study their function is by analyzing their overall impact on the transcriptome. A better annotation of the 3′UTRs will facilitate the identification of all the potential targets for a particular regulator. This, in turn, will lead to an increase in the signal-to-noise ratio of the effect of the factor at a genome wide level.