Multiple Sample Connectivity Graph (MSCG)
A set of RNA-Seq reads from F different samples are first mapped independently to the reference genome using a splice junction detection tool such as Tophat [19], SpliceMap [20], etc. The mapped reads are then clustered into genes, and the exon-intron boundary information for each may be derived from either its junction reads or existing annotations such as NCBI RefSeq [21] or UCSC known isoforms [22]. Based on this information, the sequence of a gene can be split into different expressed segments (or simply segments) [6], where a segment is a continuous region in the reference genome uninterrupted by any splicing events (or exon-intron boundaries).
Several transcriptome assemblers [7, 9, 23] use the Connectivity Graph (CG) to represent the splicing connections between segments or bases on single sample RNA-Seq data. Similarly, for multiple sample RNA-Seq data, we construct a multiple sample connectivity graph (MSCG) G = (V, E) based on F sets of mapped RNA-Seq reads as follows. V = {s, t} ∪ {v
i
|1 ≤ i ≤ M} where v1,... v
M
represents the M segments contained in a gene. (v
i
, v
j
) ∈ E if there is at least one read from the F samples joining both segments i and j. Also, (s, v
i
) ∈ E, (v
i
, t) ∈ E, 1 ≤ i ≤ M (see Figure 1(A)).
For simplicity, s is assigned number 0, t is assigned number M +1, and the vertices v1,...v
M
are all assigned numbers 1 through M. Thus, a vertex in V or an edge in E can be represented as a tuple (i, j) with 0 ≤ i, j ≤ M + 1. For example, (0, i) = (s, v
i
) and (i, M + 1) = (v
i
, t) for 1 ≤ i ≤ M. Similarly, (i, j) ∈ E if (v
i
, v
j
) ∈ E for i ≠ j, and (i, j) ∈ V if i = j (1 ≤ i, j ≤ M).
Isoform discovery on MSCG
In MSCG, every path P from s to t represents a possible isoform of the gene. We further assign weights to each vertex (or edge) in MSCG to represent the probability that the corresponding segment (or junction) is included in an isoform; the higher the weights, the lower probability that they are included. As a result, a shortest path (minimum weight path) represents the most possible isoform of the gene. Furthermore, we make use of paired-end read information (if any) and expression correlation between segments among the samples that could help the reconstruction of isoforms.
Our approaches to assigning weights and utilizing information from paired-end reads and segment expression correlations are described in detail below.
Weight assignment
A weight w
i,j
is assigned for each edge (i, j) ∈ E (if i ≠ j) and each vertex v
i
∈ V (if i = j) to reflect the likelihood that the corresponding segment (or junction) is "problematic": a higher weight is assigned if the segment (or junction) is more likely the product of some incorrectly mapped reads. Notice that w
i,j
may be either positive (considered as "cost") or negative (considered as "reward"). For simplicity, the weight of a path P is represented as), , which is the sum of all weights of the vertices and edges included on P.
Assume that the same number of reads are generated from F samples. For every vertex v
i
∈ V\{s, t}, we assign w
i,i
= − log(d
i
+ 1), where d
i
is the average read density of segment i in the F samples:
(1)
Here, is the number of reads mapped to segment i from the kth RNA-Seq sample, l
i
is the length of the segment i and L is the read length. Since we will look for a shortest path, paths going through segments with high densities are preferred.
Because noisy junctions may result in incorrect assembly results, a higher positive cost is assigned for junction edges that are more likely to be problematic. For every edge (i, j) ∈ E, where 1 ≤ i, j ≤ M and i ≠ j, we set
(2)
where 1 ≤ h ≤ M, P (i, j) represents the probability that the junction between segments i and j is included in an isoform, and d
i,j
is the average read density of edge (v
i
, v
j
) ∈ G:
(3)
where is the number of reads mapped to the corresponding junction from the kth RNA-Seq sample.
For every edge (0, i) ∈ E, we "inactivate" it by setting w0,ito infinity if v
i
can be reached from another vertex v
j
in the MSCG G; otherwise, we set w0,i= 0. Similarly for edge (i, M + 1) ∈ E, wi,M+ 1 is assigned infinity if there is an edge from v
i
to another vertex v
j
in the MSCG G, or 0 otherwise.
Incorporating paired-end read information
Paired-end RNA-Seq reads provide more information than single-end reads in transcriptome assembly, since both ends of a paired-end read come from the same RNA (or cDNA) fragment. To incorporate paired-end read information into our framework, we try to find a path in MSCG to simultaneously minimize the cost of the path and maximize the number of paired-end reads that are compatible with the isoform represented by the path. Reads that are compatible with an isoform are the reads that are possibly generated from the isoform. If a read is compatible with an isoform, the splicing patterns implied by the read and the isoform must be identical. More precisely, a single-end read b containing k segments can be represented as a vector b = (b1, b2,⋯, b
k
), where 1 ≤ b1 < ⋯ <b
k
≤ M are the segments included in b. An isoform I (or a path P ) that b is compatible with must include all the segments b1,⋯, b
k
, and must not include any other segment between b1 and b
k
. A paired-end read p = (b, b′) is compatible with I if and only if both b and b′ are compatible with I.
For each paired-end read p = (b, b′), where b = (b1,⋯ b
k
) and , we define the set of "inclusion segments" IS
p
and "exclusion segments" ES
p
as follows:
(5)
Intuitively, IS
p
(and ES
p
) represents the set of segments that an isoform I must (and must not) include, based on the information of p. For example, if p = ((b1, b3), (b5, b7)), then IS
p
= {b1, b3, b5, b7} (as they are included in p), and ES
p
= {b2, b6} (as they are spliced out in p). For each paired-end read p, we define a binary variable q
p~P
∈ {0, 1} to indicate whether p is compatible with a path P implied in the solution. Given a set of paired-end reads R, maximizing the number of compatible reads with P is equivalent to maximizing), . For each gene, paired-end reads that are mapped to only one segment (i.e., |IS
p
| = 1) are excluded from R, since these reads do not provide any useful information in the assembly.
Resolving ambiguities using Jensen-Shannon metric
In a complicated gene model, an MSCG may give rise to several sets of isoforms due to the existence of segments that introduce ambiguities (named as "uncertain" segments). For example, the MSCG in Figure 1(B) has two edges at each end of segment 4 due to the two uncertain segments, segments 3 and 5. Different combinations of these pairs of edges would lead to two possible sets of isoforms. Paired-end reads can be used to resolve such ambiguity (as in [23]), but it only works if there are paired-end reads mapped to uncertain segments. In [5], isoforms are decomposed such that the expression levels of the segments in one isoform are similar, but this strategy does not consider positional biases [24] and is applied only to a single sample.
In this work, we use Jensen-Shannon metric (or JS metric) to resolve the ambiguity of uncertain segments. JS metric measures the similarity of the expression patterns between samples and was used to analyze differential alternative splicing events [5]. It is defined as the square root of the Jensen-Shannon divergence [25]:
(6)
where H(x) stands for the entropy of the probability distribution x and p
i
is the distribution of segment i among the samples. The latter is calculated based on the read density of segment i (defined in Equation (1)) over all F samples.
If the JS metric of the expression levels of two uncertain segments is low (which means both segments are positively correlated), then both segments are likely to be included on the same isoform (termed "coexpressed segments", see Figure 1(B) for an example). Otherwise if the JS metric is high, they are likely to appear in different isoforms (termed "exclusively expressed segments"). To determine whether two uncertain segments are coexpressed or exclusively expressed, we randomly permute the expression of each segment in a gene 1000 times, and calculate the "background" JS metric distribution P
bg
. For a given false-discovery rate (FDR) β% (controlled by the user; the default is 5%), segments i and j are considered coexpressed (or exclusively expressed) if J S(i, j) is located in the lowest (or highest) β% of P
bg
, respectively. For coexpressed segments i and j, we add some "pseudo" paired-end reads p
c
spanning segments i and j (i.e., IS
pc
= {i, j}) to the read set R. These reads will encourage our algorithm (i.e., ISP) to prefer paths that include both segments i and j. Similarly for exclusively expressed segments i and j, paired-end reads p
e
with IS
pe
= {i} and ES
pe
= {j} are added to R.
The objective function and complexity of the problem
Using the notations defined in previous sections, given an MSCG G and a single-end and/or paired-end read set R, our objective function is to find a path P from s to t in the MSCG to maximize
(7)
where the set R′ includes all "pseudo" reads and excludes all reads p ∈ R with |IS
p
| = 1. Here, α
p
> 0 is a user-defined parameter and should be smaller for organisms with simple splicing patterns (like fruit fly or warm) and relatively larger for organisms with more complicated splicing patterns (like human or mouse). For the convenience of presentation, we will refer to this problem as a (constrained) shortest path problem on G, because when α = 0, the problem reduces to finding the minimum weight path (or shortest path) from s to t.
Unfortunately, it is hard to find a path to maximize Equation (7), since we can show that the corresponding decision problem is NP-complete even when w
i,j
= 0 and α
p
= 1.
Theorem: The following decision problem is NP-complete:
Input: An MSCG G = (V, E) and a set of mapped paired-end reads R; an integer k.
Question: Is there a path P in G such that), ?
Proof : The theorem can be proven by a straightforward reduction from the wellknown CLIQUE problem. The reduction is presented in Additional file 1. ■
An efficient heuristic algorithm to identify expressed isoforms in multiple samples
The ILP and LP approaches for finding an optimal path
In this section, we present two different approaches to find a path on the MSCG G maximizing Equation (7). First, a binary variable f (i, j) ∈ {0, 1} is introduced to indicate whether each vertex (or edge) in G is included in a path P. The following ILP problem is formulated to find a path maximizing Equation (7):
(8)
(9)
(10)
(11)
(12)
(13)
Equations (9)-(10) are constraints ensuring that the final solution represents a path (and thus an isoform) from s to t, while Equation (11)-(12) guarantee that q
p
= 1 if and only if the path P is compatible with paired-end read p. Solving the above ILP problem may be time-consuming since the number of variables may be large for some genes. Instead, we could relax the binary constraints in Equation (13) as follows, which turns the problem into an LP problem:
(14)
Ideally, the solution to the above LP problem is integral (i.e., q
p
, f(i, j) ∈ {0, 1}), which would represent a path from s to t. However, in some cases (for about 0.1% of the genes in our simulated and real data experiments), the LP problem may not lead to an integral solution. For these genes, we can solve the corresponding ILP problem instead. We use GNU Linear Programming Kit (GLPK, [26]) to solve both the ILP and LP problems.
The Iterative Shortest Path algorithm
A gene may have multiple isoforms expressed in the samples, but only one isoform is extracted by solving the above LP/ILP problem. To recover more expressed isoforms of the gene, we apply the "weight-decay" strategy [27] to modify the weights in the graph G and iterate the algorithm several times. In each iteration, the weights are adjusted to encourage the algorithm to look for an isoform different from all previously found isoforms. The details of this ISP algorithm are described in Additional file 1.