Assisted transcriptome reconstruction and splicing orthology
- Samuel Blanquart3,
- Jean-Stéphane Varré4,
- Paul Guertin1, 6,
- Amandine Perrin7, 4,
- Anne Bergeron†1 and
- Krister M. Swenson†2, 5Email author
© The Author(s) 2016
Published: 11 November 2016
Transcriptome reconstruction, defined as the identification of all protein isoforms that may be expressed by a gene, is a notably difficult computational task. With real data, the best methods based on RNA-seq data identify barely 21 % of the expressed transcripts. While waiting for algorithms and sequencing techniques to improve — as has been strongly suggested in the literature — it is important to evaluate assisted transcriptome prediction; this is the question of how alternative transcription in one species performs as a predictor of protein isoforms in another relatively close species. Most evidence-based gene predictors use transcripts from other species to annotate a genome, but the predictive power of procedures that use exclusively transcripts from external species has never been quantified. The cornerstone of such an evaluation is the correct identification of pairs of transcripts with the same splicing patterns, called splicing orthologs.
We propose a rigorous procedural definition of splicing orthologs, based on the identification of all ortholog pairs of splicing sites in the nucleotide sequences, and alignments at the protein level. Using our definition, we compared 24 382 human transcripts and 17 909 mouse transcripts from the highly curated CCDS database, and identified 11 122 splicing orthologs. In prediction mode, we show that human transcripts can be used to infer over 62 % of mouse protein isoforms. When restricting the predictions to transcripts known eight years ago, the percentage grows to 74 %. Using CCDS timestamped releases, we also analyze the evolution of the number of splicing orthologs over the last decade.
Alternative splicing is now recognized to play a major role in the protein diversity of eukaryotic organisms, but definitions of spliced isoform orthologs are still approximate. Here we propose a definition adapted to the subtle variations of conserved alternative splicing sites, and use it to validate numerous accurate orthologous isoform predictions.
The knowledge of all protein isoforms that may be expressed by a gene is fundamental. Recently, several computational methods have been proposed for transcriptome reconstruction that use RNA-seq data for exon identification, and expression levels data for transcript assembly [1, 2]. While exon identification performs quite well, transcript assembly remains difficult for complex transcriptomes. As shown in , the best-performing computational methods identified at most 21 % of spliced protein-coding transcripts from H. sapiens, and transcript detection remains low even with very high sequencing coverage, leading the authors to conclude that improved results would have to wait for technological advances. Those findings were confirmed by several other studies [3–5], that include methods recently developed such as StringTie .
Given these limitations of ab initio transcript prediction, it is natural to investigate assisted transcriptome reconstruction, in which the knowledge of transcript structures is transferred from one species to another: since transcripts provide a road-map of the successive links between exons, it should be possible to distinguish transcripts that may be expressed from those that may not, by analyzing the sequence of the target genes.
An essential problem is the assessment of the predictions. Confirmation of the prediction that “transcript t may be expressed in the target species” is only possible through experimental validation, which can be a long and costly process, since many transcripts are detectable only in specific cells, under specific conditions . Gene predictors that make explicit use of external evidence to predict eukaryotic gene structure have been around since the turn of the century , and with the discovery of alternative transcripts in recent years, hundreds of thousands of predicted isoforms are now available in databases . For example, as of June 2016, Felis catus had 32 842 predicted isoforms in the RefSeq database, of which 365 are confirmed, and Canis lupus familiaris had 45 430 predicted isoforms of which 1 644 are confirmed. These predictions were made by the Gnomon predictor, described in an unpublished paper , and whose performance is unknown, especially for predictions that rely exclusively on external evidence.
In order to evaluate the predictive power of transcript annotation transfer, it is necessary to identify splicing orthologs, loosely defined as transcripts from ortholog genes with similar splicing patterns. Zambelli et al. , introduced the concept and gave three possible definitions that yielded estimates ranging from 31 to 86 % for the percentage of human transcripts that have a splicing ortholog in mouse. This wide range of estimates is an indicator of how the concept is still a work-in-progress. A subsequent paper of Fong et al.  simplifies the definition of splicing orthologs as “…all protein-coding exons in the two proteins can be paired with 90 % overlap in lengths of both exons.”. This is clearly an over-simplification that ignores, among other things, nearby alternative donor or acceptor splice sites .
In this study, we revisit the concept of splicing orthology and we give a comprehensive assessment of the performance of assisted transcriptome reconstruction using the human to predict mouse, and vice versa.
Our experimental set-up is based on a procedural definition of splicing orthologs that is concurrently implemented by two procedures. Our predictor uses transcripts from a known species to predict transcripts in a target species, and for evaluation purposes, identifies putative pairs of orthologous transcript isoforms based solely on nucleotide sequence evidence. Our controller identifies putative pairs of orthologous protein isoforms between human and mouse using amino acid sequences and positions of exon junctions.
Transcripts from orthologous genes with differing splicing patterns could have easily identifiable differences in the number or identity of their exons. However, defining splicing orthology can be more difficult due to the presence of alternative donor and acceptor splice sites, where exons are elongated or truncated, often by a very few nucleotides. Within a gene, these exon isoforms  overlap and have different exon-intron border(s).
Across species, we define orthologous exon isoforms as orthologous exons that have conserved exon-intron borders. Using this concept, splicing orthologs are defined as transcripts of orthologous genes, whose exons are orthologous exon isoforms that appear in the same order in each gene, and that code for similar proteins.
In the ideal situation where all splicing orthologs are conserved across species in one-to-one orthologous gene pairs, all exons — with their flanking intronic sequences — should be best reciprocal hits, and the alignments should preserve exon-intron borders. Each protein would have a perfect match that should be a unique best reciprocal hit, the alignments should preserve the positions of exon junctions, and be without gaps in the near vicinity of junctions.
Given a pair of orthologous genes, the first task of the predictor is to establish a common reference sequence  for the set of all transcripts in both the human and mouse. Our solution is based on the concept of blocks, which are alignments of human and mouse exon segments that are either contained in, or disjoint from, each human or mouse transcript.
To ensure internal exons best reciprocal hit property, every human and mouse sequence in a block is Blasted, in the orthologous gene, using flanking genomic sequences of length 20. Further details on the predictor algorithm can be found in Additional file 1.
The fact that all blocks can be found by a Blast search allows the predictor to simulate a pure predicting mode, that does not have any prior knowledge of the target species transcripts. Given a gene model M, the predictor reports all blocks and signals that were found in the target gene, and reports its successes as a substring S of the gene model.
This paper predicts in flexible mode [2, 11], where orthologous transcripts must have the same splicing pattern, but the first and last exons are only required to have orthologous internal exon-intron borders. A transcript is said to be executable by the target gene if its donor and acceptor blocks are in S. The primary task of the predictor is to classify transcripts as executable or not executable.
In the simulation context of this paper, a secondary task of the predictor is to classify executable transcripts of the known species into found or yet-to-be-found, using the transcripts of the target species. When a known transcript k has a match t in the list of transcripts of the target species, the predictor reports a pair of putative ortholog transcripts (k,t). Thus the output of the predictor is a list of transcript pairs, a list of yet-to-be-found transcripts, and a list of non-executable transcripts.
The controller outputs a list of putative human/mouse protein ortholog pairs whose underlying transcripts have the same splicing patterns. For each gene, the controller constructs a list of candidate pairs, that are unique best reciprocal hits among protein isoforms of the gene with the same number of exons.
Finally, the right segment of the alignment of the first exon of each transcript, and the left segment of the alignment of the last exon, are tested for conservation by requiring non-negative Blosum62 scores, over a maximum of 10 amino acids. This last step is necessary because the position of the first (or last) exon junction can be preserved even when the first (or last) exons are different.
We implemented a co-validation process between the two procedures, with the principle that both procedures should agree on their conclusions: disagreements are treated as errors.
A transcript pair found by the predictor that also belong to the controller list is labeled splicing orthologs.
A transcript k classified as yet-to-be-found, for which the controller list does not contain any entry with transcript k, is labeled confirmed yet-to-be-found.
A transcript k classified as non-executable, for which the controller list does not contain any entry with transcript k is labeled confirmed species-specific.
Found by the predictor, but not contained in the controller list.
Yet-to-be-found by the predictor, but contained in the controller list.
Species-specific by the predictor, but contained in the controller list.
The number of each of these types of errors are reported and discussed in the next section.
Results and Discussion
All 15 513 one-to-one ortholog gene pairs with at least one CCDS isoform in human and in mouse were selected in the Ensembl database , as of 07/30/2015, the date of CCDS Release 19. Of these, the predictor could analyze 14 992 pairs of genes, totaling 24 382 human and 17,909 mouse transcripts with more than one exon. The 521 genes rejected at this stage had homology problems such as duplicated or rearranged exons, or were not one-to-one orthologs. The controller list computed on this set has 11 904 pairs of putative protein ortholog pairs. When transcripts from the same species are splicing orthologs, thus having the same splicing pattern, both procedures keep the isoform with the smallest CCDS accession number.
The first experiment predicts mouse transcripts using human as the known species. In this case, there are about 36 % more transcripts in the known species than in the target species. The second experiment mirrors the first, with the known species (mouse) having fewer transcripts than the target species (human). We also partitioned the complete set of predictions according to the CCDS release in which the transcript of the known species first appears.
We looked at the complete set of predictions, and also at subsets ordered by their CCDS release numbers. The results are analyzed from three different perspectives: What is the difference between early and late predictions? If we had made these predictions in the past how long would it have taken to confirm them? What is the impact of the size of the predicting set of transcripts? Unresolved cases are discussed at the end of the section.
Early cohort has good precision and recall values
Mouse transcripts as predicted by human transcripts
Total number of analyzed transcripts
Proportion of yet-to-be-found transcripts
Proportion of human-specific transcripts
Splicing orthologs, by mouse cohorts
Known mouse transcripts
The precision value is the proportion of splicing orthologs over the total number of positive predictions, defined here as the sum of the number of splicing orthologs and and the number of confirmed yet-to-be-found transcripts. For the first experiment we have a precision of (11122)/(11122+5969)=0.65.
The recall value is defined in the literature as the proportion of correctly identified isoforms with respect to the number of real isoforms, and is computed using the number of splicing orthologs over the number of known mouse transcripts, and is (11122)/(17909)=0.62.
The same precision and recall computations were done after partitioning the predictions in three time-stamped cohorts, each ending at a CCDS Mouse release. Cohort C 1 contains transcripts known at Release 2 (10/2006); cohort C 2 contains transcripts added to CCDS between Releases 3 and 7 (01/2011); cohort C 3 contains transcripts added to CCDS between Releases 8 and 19 (07/2015). Note that the number of splicing orthologs is different for human and mouse cohorts, but they all add up to the total of 11 122 transcript pairs: indeed, transcripts k and t of a pair (k,t) may belong to different cohorts.
For the first cohort, the precision and recall are, respectively, 0.85 and 0.74. Meaning that 74 % of the transcripts of the first mouse cohort are correctly predicted, which is more than three times the 21 % obtained by the purely computational methods tested in . Precision and recall fall with subsequent younger cohorts, and this phenomenon is discussed in the next section.
Table 1 also shows the large increase in the proportion of yet-to-be-found transcripts, from 0.12 in cohort C 1 to 0.46 in cohort C 3. This increase was expected, and is discussed further in the next section.
On the other hand, the increase of the proportion of human-specific transcripts, from 0.15 in cohort C 1 to 0.4 in cohort C 3 was unexpected. Possible explanations would be that the first cohort is dominated by highly expressed ubiquitous isoforms that were detected early; and/or that species-specific transcripts are less expressed, and are found later; and/or that recent sequencing experiments are more focussed on exploring the difference between mouse and human.
Performances get better over time
Here we analyze the evolution over time of the number of splicing orthologs, for a fixed number of positive predictions, which drives the evolution of precision values. We restricted this analysis to genes that have a large number of isoforms, defined here as genes that have at least two different CCDS isoforms in human and in mouse. This is a subset of 4253 positive predictions of the first experiment. Each cohort of this subset has respectively, 1464, 1404 and 1385 positive predictions.
The three colored curves of Fig. 3 correspond to the partition of the set of predictions in cohorts C 1 to C 3. They start at different heights, with cohort C 1 leading, but the fact that all curves have a similar shape, notably in the last half of the time interval plotted, was highly unexpected. Indeed, since the number of positive predictions is fixed for each cohort, the number of cumulative confirmed predictions must eventually become almost flat, with less and less predictions being confirmed. We expected such a phenomenon, especially for cohort C 1 whose precision is currently 74 %, but the saturation effect is barely perceptible in this subset.
A similar shape in cumulative curves indicates a similar absolute growth over time. This means that, for the last four releases, all cohorts have roughly the same number of new splicing orthologs. Given that all three have similar numbers of positive predictions, we expect the number of splicing orthologs to grow significantly in future CCDS releases, at least for genes that have a large number of isoforms.
The number of known species transcripts influences both precision and recall
Human transcripts as predicted by mouse transcripts
Total number of analyzed transcripts
Proportion of yet-to-be-found transcripts
Proportion of mouse-specific transcripts
Splicing orthologs, by human cohorts
Known human transcripts
Unresolved cases are the results of disagreements between the predictor and the controller. Both procedures are experimental, in the sense that they depend on parameters and thresholds, and adjusting these values was done with the goal of maximizing agreement, which is 95.0 % when human transcript is used to predict mouse transcripts.
There are 1018 unresolved cases in total, since most of them appear in both experiments. Of them, 369 ortholog pairs are found by the predictor but not by the controller and often reflect a built-in stringency of the controller. In order to distinguish subtle modifications in the regulation of alternative transcription, such as nagnag alternative transcripts  that add or delete one amino acid next to an exon junction, the controller rejects every pair that contains deletions in near junctions. Using the nucleotide sequences, the predictor has an advantage in deciding whether such insertion or deletion is a true mutation. The controller can also detect frame-shifts, where a single mutation can cause two dissimilar proteins in different species. This is currently not verified by the predictor.
There are 649 instances of pairs included in the controller list for which one of the transcript is predicted yet-to-be-found or species-specific by the predictor. They are due to pairs incorrectly included in the controller, or to the built-in stringencies of the predictor. In the case of the controller, the main source of errors is the presence of conserved alternative exons of the same length, yielding two possible isoforms A and B in the human, and A ′ and B ′ in the mouse. If the current state of the database contains A and B ′, the controller will pair them, but the predictor will correctly detect that the transcripts differ by one exon. When three of the four isoforms are present, the unique best-hit property of the controller construction will resolve the conflict. A second source of errors from the controller is the presence of very small first or last exons, sometimes as small as 1 nucleotide.
We gave a rigorous high level definition of splicing orthology and implemented it with a dual predictor/controller. We applied the methods to the CCDS human and mouse sets of isoforms and classified them into pairs of splicing orthologs.
We also showed that, for the prediction that could have been made eight years ago, human transcripts would have correctly predicted 7 024 of the 9 486, thus 74 % of the known mouse transcripts at that time. We showed that this percentage is a lower bound, since predictions for that cohort are still being confirmed with each new release of the CCDS database, driven by the discovery of predicted isoforms of genes that have a large number of isoforms.
Our list of 11 122 confirmed splicing orthologs is available in Additional file 3, together with their common block representations. It is intended as a benchmark for predictors, and as a data source for researchers interested in studying the conservation of alternative splicing.
Publication costs for this article were funded by the Inria Associate Team program.
This article has been published as part of BMC Genomics Vol 17 Suppl 10, 2016: Proceedings of the 14th Annual Research in Computational Molecular Biology (RECOMB) Comparative Genomics Satellite Workshop: genomics. The full contents of the supplement are available online at https://bmcgenomics.biomedcentral.com/articles/supplements/volume-17-supplement-10.
AB is partially supported by Canada NSERC Grant number 05729-2014, and by Équipes associées Inria-FRQNT Grant number 188128. This research is supported by the Inria Associate Team program.
Availability of data and materials
All data generated or analyzed during this study are included in this published article and its supplementary information files.
AB, SB, KS and JSV conceived the study, modeled the validation process, and analyzed the experiments. PG and AP wrote the draft code for constructing the gene models. AB, SB, KS and JSV modeled the predictor. SB and JSV wrote the final code for constructing the gene models. SB and JSV wrote the code for the predictor. AB and KS modeled the controller. KS wrote the code for the controller. AB wrote the manuscript. SB, KS and JSV revised the manuscript. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Consent for publication
Ethics approval and consent to participate
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Garber M, Grabherr MG, Guttman M, Trapnell C. Computational methods for transcriptome annotation and quantification using rna-seq. Nat Methods. 2011; 8(6):469–77.View ArticlePubMedGoogle Scholar
- Steijger T, Abril JF, Engstrom PG, Kokocinski F, Hubbard TJ, Guigo R, Harrow J, Bertone P, Abril JF, Akerman M, Alioto T, Ambrosini G, et al. Assessment of transcript reconstruction methods for RNA-seq. Nat Methods. 2013; 10(12):1177–84.View ArticlePubMedGoogle Scholar
- Angelini C, De Canditiis D, De Feis I. Computational approaches for isoform detection and estimation: good and bad news. BMC Bioinforma. 2014; 15(135):1–25.Google Scholar
- Jänes J, Hu F, Lewin A, Turro E. A comparative study of rna-seq analysis strategies. Brief Bioinform. 2015; 16(6):932–40.View ArticlePubMedPubMed CentralGoogle Scholar
- Hayer KE, Pizarro A, Lahens NF, Hogenesch JB, Grant GR. Benchmark analysis of algorithms for determining and quantifying full-length mRNA splice forms from rna-seq data. Bioinformatics. 2015; 31(24):3938–45.PubMedPubMed CentralGoogle Scholar
- Pertea M, Pertea GM, Antonescu CM, Chang TC, Mendell JT, Salzberg SL. Stringtie enables improved reconstruction of a transcriptome from rna-seq reads. Nat Biotechnol. 2015; 33(3):290–5.View ArticlePubMedPubMed CentralGoogle Scholar
- Kelemen O, Convertini P, Zhang Z, Wen Y, Shen M, Falaleeva M, Stamm S. Function of alternative splicing. Gene. 2013; 514(1):1–30.View ArticlePubMedGoogle Scholar
- Korf I, Flicek P, Duan D, Brent MR. Integrating genomic homology into gene structure prediction. Bioinformatics. 2001; 17(suppl 1):140–8.View ArticleGoogle Scholar
- Fong JH, Murphy TD, Pruitt KD. Comparison of refseq protein-coding regions in human and vertebrate genomes. BMC Genomics. 2013; 14(1):1.View ArticleGoogle Scholar
- Souvorov A, Kapustin Y, Kiryutin B, Chetvernin V, Tatusova T, Lipman D. Gnomon – ncbi eukaryotic gene prediction tool. Technical report, National Center for Biotechnology Information, Bethesda. 2010. http://www.ncbi.nlm.nih.gov/core/assets/genome/files/Gnomon-description.pdf. Accessed 23 Sept 2016.
- Zambelli F, Pavesi G, Gissi C, Horner D, Pesole G. Assessment of orthologous splicing isoforms in human and mouse orthologous genes. BMC Genomics. 2010; 11:534.View ArticlePubMedPubMed CentralGoogle Scholar
- Hiller M, Huse K, Szafranski K, Jahn N, Hampe J, Schreiber S, Backofen R, Platzer M. Widespread occurrence of alternative splicing at nagnag acceptors contributes to proteome plasticity. Nat Genet. 2004; 36(12):1255–57.View ArticlePubMedGoogle Scholar
- Thanaraj T, Clark F. Human gc-ag alternative intron isoforms with weak donor sites show enhanced consensus at acceptor exon positions. Nucleic Acids Res. 2001; 29(12):2581–93.View ArticlePubMedPubMed CentralGoogle Scholar
- Ouangraoua A, Swenson KM, Bergeron A. On the comparison of sets of alternative transcripts. In: Bioinformatics Research and Applications, vol. LNBI 7292. Berlin: Springer: 2012. p. 201–12.Google Scholar
- Sellers PH. The theory and computation of evolutionary distances: Pattern recognition. J Algoritm. 1980; 1(4):359–73.View ArticleGoogle Scholar
- Yates A, Akanni W, Amode MR, Barrell D, Billis K, Carvalho-Silva D, Cummins C, Clapham P, Fitzgerald S, Gil L, et al. Ensembl 2016. Nucleic Acids Res. 2016; 44(D1):710–6.View ArticleGoogle Scholar
- Burset M, Guigo R. Evaluation of gene structure prediction programs. genomics. 1996; 34(3):353–67.View ArticlePubMedGoogle Scholar
- Farrell CM, O’Leary NA, Harte RA, Loveland JE, Wilming LG, Wallin C, Diekhans M, Barrell D, Searle SM, Aken B, et al, Current status and new features of the consensus coding sequence database. Nucleic Acids Res. 2014; 42(D1):D865–D872.Google Scholar