Heuristic pairwise alignment of de Bruijn graphs to facilitate simultaneous transcript discovery in related organisms from RNA-Seq data

Background The advance of high-throughput sequencing has made it possible to obtain new transcriptomes and study splicing mechanisms in non-model organisms. In these studies, there is often a need to investigate the transcriptomes of two related organisms at the same time in order to find the similarities and differences between them. The traditional approach to address this problem is to perform de novo transcriptome assemblies to obtain predicted transcripts for these organisms independently and then employ similarity comparison algorithms to study them. Results Instead of obtaining predicted transcripts for these organisms separately from the intermediate de Bruijn graph structures employed by de novo transcriptome assembly algorithms, we develop an algorithm to allow direct comparisons between paths in two de Bruijn graphs by first enumerating short paths in both graphs, and iteratively extending paths in one graph that have high similarity to paths in the other graph to obtain longer corresponding paths between the two graphs. These paths represent predicted transcripts that are present in both organisms. We show that our algorithm recovers significantly more shared transcripts than traditional approaches by applying it to simultaneously recover transcripts in mouse against rat and in mouse against human from publicly available RNA-Seq libraries. Our strategy utilizes sequence similarity information within the paths that is often more reliable than coverage information. Conclusions Our approach generalizes the pairwise sequence alignment problem to allow the input to be non-linear structures, and provides a heuristic to reliably recover similar paths from the two structures. Our algorithm allows detailed investigation of the similarities and differences in alternative splicing between the two organisms at both the sequence and structure levels, even in the absence of reference transcriptomes or a closely related model organism.


Background
The advance of next generation sequencing technologies has made it possible to perform detailed studies of splicing mechanisms among eukaryotic organisms. A popular strategy is to first sequence their transcriptomes, then map the reads to reference databases. In non-model organisms, such reference databases are often unavailable, and de novo transcriptome assembly algorithms are employed to obtain predicted transcripts [1][2][3][4]. This is often achieved by first constructing a de Bruijn graph structure that contains all branching possibilities [5,6], then obtaining predicted transcripts based on coverage information along the paths.
In many of these studies, there is often a need to investigate the transcriptomes of two related organisms at the same time in order to study their similarities and differences. In these cases, RNA-Seq libraries are obtained from both organisms under different experimental conditions and the goal is to compare their transcriptome assemblies. The traditional approach to address this problem is to perform transcriptome assemblies to obtain predicted transcripts for the two organisms separately (see Figure 1). Similarity comparison algorithms such as BLAST [7] are then employed to extract corresponding transcripts that are shared in the two organisms.
Since predicted transcripts are constructed independently for each organism based on coverage information only, this strategy is often unreliable. To address this problem, we develop an algorithm to allow direct comparisons between paths in the two intermediate de Bruijn graph structures by an iterative extension strategy (see Figure 1). Since sequence similarity information is often more reliable, this strategy allows the direct extraction of shared transcripts based on evolutionary support.

De Bruijn graph
Given a set of reads and a parameter k, a de Bruijn graph is constructed by taking each k-mer that appears within the reads as a vertex. Two k-mers are connected by a directed edge if the (k − 1)-suffix of the first k-mer is the same as the (k − 1)-prefix of the second k-mer [5,6]. The de Bruijn graph implicitly assembles the reads by linking together the overlapping parts, and it is employed as the main intermediate structure by most short read assembly algorithms [8][9][10][11][12]. To obtain a more compact structure, each linear sequence of vertices that have no branches is collapsed into a single node that corresponds to contigs.

Iterative extension
Given de Bruijn graphs G 1 and G 2 that correspond to transcriptome assemblies of two related organisms, we first apply BLAST to obtain similarity scores between each pair of nodes u from G 1 and v from G 2 . We then start the iterative extension process as follows. For each node u from G 1 , we extract its most similar node v from G 2 with e-value below a cutoff. If such a node v exists, we retain u as a single-node path. We extend u by one node along all its outgoing edges into multiple paths, and apply BLAST from each of these extended paths from u against v. If at least one of these extended paths gives a better e-value against v, we retain all the paths that have better e-values and continue to extend the top path that gives the best e-value. We repeat the procedure starting from this new path until the e-value no longer improves. Note that only one best direction is chosen since extending in more than one direction is very time-consuming. By starting from each node u in G 1 independently, the probability of missing the real best path is reduced a lot.
After the above procedure, we have retained u and all the extended paths from u that have improved e-values, with the top path that gives the best e-value being fully extended. We then retain v as a single-node path and perform a similar extension process starting from v by extending it by one node along all its outgoing edges into multiple paths. We apply BLAST from each of these extended paths from v against all the retained paths from u. If at least one of these extended paths gives a better e-value, we retain all the paths that have better e-values and continue to extend the top path that gives the best e-value. Similar to above, we repeat the procedure starting from this new path until the e-value no longer improves to obtain a fully extended path and a set of retained paths from v that have improved e-values (see Figure 2).
We then repeat the entire extension procedure in turn in G 1 and G 2 by replacing u by the fully extended path from u and comparing against all the retained paths from v, and replacing v by the fully extended path from v and comparing against all the retained paths from u. The entire process is repeated until no more improvements can be made, and the algorithm is applied again by switching the role of G 1 and G 2 and repeating all the steps.
To obtain longer paths, we consider the retained paths from each node u and the retained paths from its twin node u′, in which u′ represents the reverse complementary sequence of u on the opposite strand. We merge the twin paths that are complementary to the retained paths from u′ with the retained paths from u, and keep those paths with improved e-values.

Extraction of predicted transcripts
We consider all the retained paths in G 1 as predicted transcripts in the first organism and all the retained paths in G 2 as predicted transcripts in the second organism. Since the collection of all these retained paths can be very big, we only keep a path if it contains a node in the de Bruijn graph that is not covered by another path with a better e-value according to the top BLAST alignment. In this condition, a node is covered by a path if it contains the node itself or its twin node. To avoid a large number of incorrectly predicted isoforms, we remove paths with worse e-values so that each node in the de Bruijn graph along with its twin node appears at most 10 times within the final set of paths.

Extraction of predicted shared transcripts
To obtain predicted shared transcripts that have correspondences between the two organisms, we apply BLAST from each predicted transcript in one organism against the set of all predicted transcripts in the other organism as database. We retain a predicted transcript as a predicted shared transcript if it appears both as a query with BLAST hits from one direction and as a subject BLAST hit in the other direction.

Validation
We implement our algorithm Mutual as a postprocessing module of Velvet [10], which is a popular sequence assembly algorithm that returns a set of contigs along with the de Bruijn graph. We compare our performance to Oases [3], which uses output from Velvet to construct predicted transcripts. We validate our algorithm by applying it to simultaneously recover transcripts in mouse against rat and in mouse against human from publicly available RNA-Seq libraries at the sequence read archive [13], including two libraries from mouse in [14] (SRX017794), one library from rat in [15] (SRX076903), and four libraries from human in [16] (SRX011545). We perform quality trimming by removing all positions including and to the right of the first position that has a quality score of less than 15, resulting in a size of 1.3 G for the mouse libraries, 2.5 G for the rat libraries and 1.1 G for the human libraries.
We apply each algorithm over k = 25, 31, and over k-mer coverage cutoff c = 3, 5, 10. In our algorithm Mutual, iterative extension is applied twice with an e-value cutoff of 0.1 using the bl2seq (BLAST 2 Sequences) variant of BLAST, once with translated BLAST and once with nucleotide BLAST. Velvet and Oases are applied independently in each organism. Since Oases applies the coverage cutoff itself to obtain a de Bruijn graph by modifying Velvet's original de Bruijn graph without coverage cutoff, Mutual is applied on the two de Bruijn graphs given by Oases to obtain predicted transcripts.
To obtain predicted shared transcripts for both Oases and Mutual, we apply both translated BLAST and nucleotide BLAST with an e-value cutoff of 10 −7 or 10 −20 from each predicted transcript in one organism against the set of all predicted transcripts in the other organism as database. The predicted transcripts that appear both as a query with BLAST hits from one direction and as a subject BLAST hit in the other direction are retained as predicted shared transcripts. To evaluate the accuracy of the predicted shared transcripts, we apply nucleotide BLAST to compare them against known mouse, rat or human transcriptome databases using the same e-value cutoff as the one used to obtain the transcripts, which is 10 −7 or 10 −20 . To assess the extent of translocated transcripts, we apply GMAP [17] to map the predicted shared transcripts to known mouse, rat or human genomes.

Predicted transcripts
Tables 1 and 2 show that Mutual constructed less predicted transcripts than Oases. Note that the predicted transcripts from Mutual are obtained by extending similar paths that appear in the two organisms through iterative BLAST, while the predicted transcripts from Oases are obtained independently in each organism. The similarity constraints in Mutual ensure that a predicted transcript in one organism has a similar counterpart in the other organism, albeit with a loose e-value cutoff. The later reciprocal BLAST is needed to enforce more stringent e-value cutoffs. On the other hand, the predicted transcripts from Oases have no such constraints, and reciprocal BLAST is used to obtain shared transcripts.

Predicted shared transcripts
When compared to Tables 1 and 2, Tables 3 and 4 show that only a small percentage of predicted transcripts were shared in the two organisms, with a smaller decrease by Mutual than by Oases. The decrease by Mutual is due to more stringent e-value cutoffs, while the decrease by Oases is due to imposing similarity constraints between the two organisms. While the actual amount of predicted shared transcripts that can be recovered depends on the size of libraries, the evolutionary distance between the two organisms and the experimental conditions, Tables 3 and 4 show that Mutual recovered more predicted shared transcripts than Oases. Almost all these predicted shared transcripts are found in the corresponding known transcriptome database, with comparable percentages between Mutual and Oases. The percentages are lower in rat, probably due to the fact that the rat genome is less well annotated. The number of predicted shared transcripts decreases as the assembly parameters become more stringent, but these transcripts are of higher quality.

Top BLAST hits to databases
By applying BLAST from each set of predicted shared transcripts in each organism to its known transcriptome database, Tables 5 and 6 show that Mutual recovered more shared transcripts than Oases, with many more shared transcripts recovered when the assembly parameters are less stringent. When compared to Tables 3 and 4, there is an effect of diminishing returns since a few thousand more predicted shared transcripts are needed to recover a few hundred more shared transcripts. Figures 3 and 4 show that the lengths of predicted shared transcripts recovered by Mutual were comparable to the ones recovered by Oases, which are slightly shorter in mouse but have slightly higher medians in rat. These transcripts are generally longer when the kmer coverage cutoff c increases.

Recovery of full length transcripts
The situation is different when considering predicted shared transcripts that are close to full length. Tables 7  and 8 show that Mutual recovered more or a comparable number of 80% full length transcripts as Oases when the assembly parameters are more stringent, and less 80% full length transcripts than Oases when the assembly parameters are less stringent. Although Mutual performs worst in rat that recovers less 80% full length transcripts than Oases, its predicted shared transcripts have slightly higher median lengths when considering all the transcripts together (see Figure 3), instead of just the ones that are 80% full length transcripts.  Notations are the same as in Table 1. The number in parentheses is the percentage of predicted shared transcripts that have BLAST hits from each organism to its known transcriptome database. Notations are the same as in Table 3.

Presence of translocated transcripts
As reported by GMAP, Tables 9 and 10 show that Mutual recovered a much larger number of predicted shared transcripts that are uniquely mapped than Oases, while at the same time returning more translocated transcripts that can be considered to be errors due to their rare occurrences [18]. The ratio of the number of translocated transcripts to the number of uniquely mapped transcripts is at most about twice as much for Mutual when compared to Oases. This ratio increases when k decreases or when the k-mer coverage cutoff c increases.

Accuracy of transcript reconstruction
By investigating the fitness of the alignment between the predicted shared transcripts and the known transcriptome database sequences, Figures 5 and 6 show that with respect to the accuracy of shared transcript reconstruction, there are tradeoffs between precision and recall by Mutual when compared to Oases. Mutual has slightly lower F-scores than Oases in most cases.

Availability
A software program that implements our algorithm (Mutual) is available at http://faculty.cse.tamu.edu/shsze/ mutual.

Conclusions
We have developed an algorithm that makes use of evolutionary information to simultaneously recover significantly more shared transcripts from RNA-Seq data in two related organisms that may be missed by traditional de novo approaches. While more shared transcripts are recovered due to the smaller evolutionary distance between mouse and rat, our algorithm can be applied to related organisms that are evolutionarily farther away, such as between mouse and human. While known transcriptomes are used as databases during validation, one important characteristic of our algorithm is that no reference transcriptomes or a closely related model organism are needed. Our algorithm can be used to recover shared transcripts that are specific to two closely related non-model organisms, which may not be present in a related model organism that is evolutionarily farther away.
Depending on the size of the de Bruijn graphs, our algorithm can take many processor-hours to run. It takes more than 600 processor-hours to obtain all the predicted transcripts in mouse against rat or in mouse against human for the least stringent values of k and the k-mer coverage cutoff c. Although our algorithm can take much more computational time than the de novo algorithms Velvet or Oases, the iterative BLAST searches can be run independently in parallel on a computing cluster. While an additional 60 processorhours are needed to obtain predicted shared transcripts from the predicted transcripts, a similar procedure is also needed for Oases. No special memory requirement is needed after the de Bruijn graphs are obtained. Only the top hit with e-value below the cutoff is considered. The number in parentheses is the change by Mutual over Oases. Notations are the same as in Table 5.
One drawback of our algorithm is that only a weak recovery of non-coding regions of mRNA is possible since these regions may not be conserved. Due to the use of similarity information between two related organisms to extend transcripts, our algorithm cannot identify extended transcripts that are not shared between the two organisms.   Table 7 Comparisons of the number of predicted shared transcripts that are 80% full length transcripts in the test on mouse against rat from Oases and from Mutual over different values of k and k-mer coverage cutoff c and over different e-value cutoffs 10 −7 and 10 −20 . Notations are the same as in Table 7.   Notations are the same as in Table 9. Figure 5 Precision, recall and F-score with respect to the accuracy of shared transcript reconstruction in the test on mouse against rat from Oases and from Mutual over different values of k and k-mer coverage cutoff c (represented by k_c) and over different e-value cutoffs 10 −7 and 10 −20 . Precision is defined to be the fraction of query positions from predicted shared transcripts that are included in BLAST alignments from each organism to its known transcriptome database. Recall is defined to be the fraction of subject positions from database sequences that are included in BLAST alignments from each organism to its known transcriptome database. F-score is the harmonic mean of precision and recall. Figure 6 Precision, recall and F-score with respect to the accuracy of shared transcript reconstruction in the test on mouse against human. Notations are the same as in Figure 5.