Sample preparation
Triple negative breast cancer cell lines MDA-MB-231 and MDA-MB-468 were obtained from ATCC (Manassas, VA) with identification numbers HTB-26 and HT-132 respectively and SUM149 was purchased from Asterand plc (Detroit, MI). All cell lines were grown according to the suppliers’ recommendation. CK2 inhibitor CX-4945 (Selleckchem, Houston, TX) was dissolved in DMSO and stored frozen at − 80 °C until used.
To induce splicing events, cells were grown to ~ 70% confluence and treated with 1 μM CX-4945 or DMSO during 12 h in a total of 5 replicates per condition. Total RNAs were isolated using the RNeasy Mini Kit (Qiagen, Germantown, MD) according to the manufacturer’s protocol. Integrity of RNA was quantified using the Agilent 2100 Bioanalyzer (Agilent Biosystems, Foster City, CA). Samples were labeled and hybridized in Human Transcriptome arrays (HTA) by the Genomics Core Facility of the Center for Applied Medical Research (CIMA) following manufacturer’s instructions.
RNAseq was performed in the Center for Cooperative Research in Biosciences (CICBiogune) using the Illumina HiSeq2000 sequencing technology, HiSeq Flow Cell v3 and TruSeq SBS Kit v3. 2μg of RNA of each sample was sent for this purpose. The run type was strand specific, multiplexed with paired-end reads of 100 nucleotides each. The amount of RNA for hybridization and validation purposes was 5 μg.
STAR 2.4.0 h1 was used to align the reads against the human genome. The reference genome was Ensembl v.74, GRCh 37.75. The output were sorted BAM files. All the other parameters were set to the default values. The average sequencing depth was 49 million reads (9.8 billion nucleotides sequenced per sample).
The microarray data preprocessing was performed using the aroma.affymetrix framework using the standard RMA algorithm applied to probesets of the paths [30]. In addition, we used both platforms to quantify expression at the gene level. Results are shown in the Additional file 1. Gene expression was computed from RNA-seq data using Kallisto [14] to quantify expression using a pseudo-aligment method. Kallisto returns an estimate of the expression of all the isoforms for each gene. The overall expression of the gene was simply computed by summing up the expression of each the its isoforms. We estimated the expression of the arrays using the RMA algorithm in the aroma.affymetrix framework using a Brainarray reference file of the Ensembl 74 transcriptome.
Event pointer for RNAseq
EventPointer is an R package to identify, classify and analyze alternative splicing events using microarrays and RNA-Seq data. The software is available for download at Bioconductor. A thorough description of EventPointer for microarrays can be found in [8]. This method has been extended to RNA-seq.
The concepts for detection, classification and statistical analysis are shared in EventPointer for the analysis of both technologies. The main difference of EventPointer for RNA-seq compared with that of microarrays are the ones associated with the type of input data (CEL or BAM files). The R code for the analysis is available at https://github.com/jpromeror/SplicingComparison.
EventPointer requires a splicing graph -a directed graph used to represent the structure of the different isoforms of a given gene [39] - as input to detect splicing events. EventPointer for RNA-seq uses SGSeq [24] to build the corresponding splicing graphs from BAM files. The complexity of the splicing graph can be controlled in SGSeq by setting different thresholds in the expression values of splicing junctions of the splicing graph (by default set to 2 FPKM). For RNA-seq, the splicing graphs are constructed for every single experiment. On the contrary, in the case of microarrays the same splicing graph (and the corresponding CDF) is used for all the experiments run on the same type of microarray (HTA or, more recently Clariom-D).
The input data for the statistical analysis is different in both technologies: signal values of the probes in microarrays and counts in RNA-seq. In order to deal with reads, Voom [17] is applied to preprocess the RNA-seq count data. The statistics to deal with the processed RNA-seq data is identical to the one used for microarray data and hence, the same statistical tests -based on limma [33]- are applied to both technologies.
As output, EventPointer provides a table with the following information associated to each detected alternatively spliced event: gene identifier, genomic position, type of event, statistical parameters and ΔΨ values. Additionally, EventPointer generates a “Gene Transfer Format” (GTF) file that can be used with the Integrative Genomics Viewer (IGV) [36] to view the structures of each detected alternative splicing event. This visualization facilitates the interpretation of the detected events and the design of primers for the validation of the events using standard PCR.
Estimation of PSI
We have included a novel algorithm to estimate Ψ that can be applied to both RNA-seq and microarrays. Assuming that the signal of a probe-set in microarrays and the number of reads within a region of the transcriptome in RNA-Seq depend on the product of an affinity value of the probe-set (or the equivalent length in RNA-seq) and the concentration of the interrogated isoforms in the paths, the following equation holds
$$ {S}_i={a}_i\bullet {t}_i $$
(1)
where Si is the measured expression value of path i, ai is the affinity of the probes or equivalent length of the path i and ti is the concentration of the isoforms mapped to path i. The affinity values (or equivalent lengths) and concentration values are assumed to be unknown and must be estimated from the data.
Particularizing the above equation to each of the paths and taking into account that the concentration of the isoforms in the reference path must be the sum of those of paths 1 and 2, the following equations are obtained:
$$ {S}_1={a}_1\bullet {t}_1 $$
(2)
$$ {S}_2={a}_2\bullet {t}_2 $$
(3)
$$ {S}_R={a}_R\bullet {t}_R={a}_R\left({t}_1+{t}_2\right) $$
(4)
In turn, the signal value of the reference path can be expressed as the sum of the signal values of paths 1 and 2 as follows,
$$ {S}_R={a}_R{a}_1^{-1}{S}_1+{a}_R{a}_2^{-1}{S}_2=u{S}_1+v{S}_2 $$
(5)
where u and v represent the fraction of the affinities of the mapped probe-set (or equivalent lengths) in the reference path and paths 1 or 2 respectively. The values of u and v can be estimated from signal data.
Dividing eq. (2) with eq. (4) we get,
$$ \frac{S_1}{S_R}=\frac{a_1{t}_1}{a_R\left({t}_1+{t}_2\right)} $$
(6)
Combining eqs. (5) and (6), the desired equation of the Percent Spliced Index (Ψ) used in EventPointer is obtained:
$$ \varPsi =\frac{t_1}{t_1+{t}_2}=\frac{u{S}_1}{S_R}=\frac{u{S}_1}{u{S}_1+v{S}_2} $$
(7)
Note that Ψ can be directly obtained from signal values once u and v are known. This equation does not require the estimation of the affinities (difficult to predict accurately) to compute Ψ. On the contrary, it simply requires to estimate u and v from signal values using eq. (5). In the case of RNA-seq, the equivalent lengths are known a priori and hence u and v. However, using this approach has an advantage: the estimates of these lengths can accommodate the potential lack of uniformity of the reads.
Note that u and v must be positive, similar between them and close to one. The first affirmation is trivial since affinity values (or equivalent lengths) are always positive. In microarrays, probe-sets are composed by several probes and their overall affinity are expected to be similar to each other, since these affinities are a median of the average of the affinities of the probes that build up them. Therefore a1 ≈ a2 ≈ aR,and u ≈ v ≈ 1. A similar reasoning can be applied to RNA-seq, if using coverage instead of read counts, since the coverage of the reference path is expected to be close to the sum of the coverages of paths 1 and 2.
These two fractions can be estimated from eq. (5) by using non-negative least squares as follows:
$$ {\displaystyle \begin{array}{c}\min \left\Vert Ax-b\left\Vert {}_2\right.\right.\\ {}s.t.x\ge 0,x\in {R}^2,A\in {R}^{m\;x\;n}\end{array}} $$
(8)
where,
$$ A=\left[\begin{array}{cc}\boldsymbol{Signal}\ \boldsymbol{P}\mathbf{1}& \boldsymbol{Signal}\ \boldsymbol{P}\mathbf{2}\\ {}\lambda & -\lambda \\ {}\lambda & 0\\ {}0& \lambda \end{array}\right];x=\left[\begin{array}{c}u\\ {}v\end{array}\right];b=\left[\begin{array}{c}\boldsymbol{Signal}\ \boldsymbol{R}\\ {}0\\ {}\lambda \\ {}\lambda \end{array}\right] $$
(9)
The penalty factor λ is added to force the equation to fulfill the previous considerations: u and v must be similar and close to 1. In our results, we found that the estimates were not sensitive to the specific value of λ if there is differential alternative splicing. If the relative usage of both paths is similar and therefore, Ψ is constant, the results are more sensitive to the value of λ. This fact is shown in Fig. 4: the correlation is much better for events that show variability in the relative expression of both paths.
The residuals of this model can be used to test if the additive model of eqs. 2, 3 and 4 holds. We computed the relative error of the residuals as follows:
$$ \upvarepsilon =\frac{{\left\Vert \left(u\cdotp \boldsymbol{Signal}\ \boldsymbol{P}\mathbf{1}+v\cdotp \boldsymbol{Signal}\ \boldsymbol{P}\mathbf{2}\right)-\boldsymbol{Signal}\ \boldsymbol{R}\right\Vert}_2}{{\left\Vert \boldsymbol{Signal}\ \boldsymbol{R}\right\Vert}_2} $$
If the relative error is large, the additive model does not fit the data and, therefore, the estimates are expected to be less reliable. In order to test this, we divided the events according to the relative error. The events with top 50% relative error have FDR two times larger than the bottom 50% as shown in Additional file 1: Table S10.
Statistical analysis
The comparison and analysis of the profiling data was done using a linear model. The design matrix was built considering both the cell line and treatment with CX4945 as factors. The interaction between cell line type and treatment was not considered.
The selected contrasts test for the difference between control samples (DMSO) and drug exposed ones (CX4945) controlling for the cell-type. The complete experimental design in the form of design and contrast matrices is included in Additional file 1: Table S9.
EventPointer includes several statistical methods to state the significance of an event. In this experiment, the events are considered to be statistically significant if there is a change in the expression of the isoforms associated to each of the alternative paths, this change occurs in opposite direction, i.e. opposite signs for the fold changes and the summarized p.value is significant (p value< 0.001).
In order to compare the arrays with different sequencing depths, we subsampled the RNAseq data to 30 and 10 million reads and rerun the whole pipeline with these data. The FDR for 30 million reads was better than using arrays. On the contrary, using 10 million reads the FDR was worse than using arrays. Interpolating both data, the FDR for arrays is similar to a depth of 20 million reads.
Filters used to include the events
For arrays, the signal of the probe-sets interrogating each of the alternative paths involved in a splicing event, must be expressed more than a certain threshold in at least one sample. This threshold is the 25% quantile of the expression of the signal in the reference paths for all the events included in the array. For RNAseq, the edges of the splicing graph (junction reads) are included only if their expression is at least 2 FPKM in at least one sample (SGSeq defaults).
Matching of the events using different technologies
Let’s assume that AR and AM are, possibly non-contiguous, regions of the genome that correspond to path A using either technology (AR for RNA-seq and AM for HTA). BR and BM have a similar description for path B and RR and RM for the reference path in each technology. Two events are considered to match if any of the following two expressions is true:
$$ \left(\left({A}_R\subset {A}_M\right)|\left({A}_M\subset {A}_R\right)\right)\&\left(\left({B}_R\subset {B}_M\right)|\left({B}_M\subset {B}_R\right)\right)\&\left(\left({R}_R\cap {R}_M\right)\ne \varnothing \right) $$
(10)
$$ \left(\left({A}_R\subset {B}_M\right)|\left({B}_M\subset {A}_R\right)\right)\&\left(\left({B}_R\subset {A}_M\right)|\left({A}_M\subset {B}_R\right)\right)\&\left(\left({R}_R\cap {R}_M\right)\ne \varnothing \right) $$
(11)
In these expressions, (x ⊂ y) is true if the genomic region x is a subset of the genomic region y (the nucleotide sequence of x is a substring of the nucleotide sequence in y). Besides, the operators “|” and “&” and the logical OR and AND operations. If (x ⊂ y)|(y ⊂ x), then one of the regions is contained in the other are considered to be “compatible”. On the other hand, (x ∩ y) ≠ ∅ means that regions x and y overlap in the genome. Therefore, the first expression is true if both paths AR and AH are compatible, BR and BM are compatible and RRand RM overlap. The second expression is true if path AR and path BM are compatible and also path AM and BR are compatible and, again, and RR and RM overlap.
Within an event, the longer path in the transcriptome is assigned the name “A” and the other the B. The second eq. (11) takes into account that, in some few cases, the name of the paths can be switched in both technologies.
PCR validation
For each splicing event, an end-point PCR was run using primers designed in the exons that flank the event of interest. RNA was retro-transcribed and the PCR was performed an analyzed as previously described [40]. Primers used are shown in Additional file 2.