Development of a novel splice array platform and its application in the identification of alternative splice variants in lung cancer

Background Microarrays strategies, which allow for the characterization of thousands of alternative splice forms in a single test, can be applied to identify differential alternative splicing events. In this study, a novel splice array approach was developed, including the design of a high-density oligonucleotide array, a labeling procedure, and an algorithm to identify splice events. Results The array consisted of exon probes and thermodynamically balanced junction probes. Suboptimal probes were tagged and considered in the final analysis. An unbiased labeling protocol was developed using random primers. The algorithm used to distinguish changes in expression from changes in splicing was calibrated using internal non-spliced control sequences. The performance of this splice array was validated with artificial constructs for CDC6, VEGF, and PCBP4 isoforms. The platform was then applied to the analysis of differential splice forms in lung cancer samples compared to matched normal lung tissue. Overexpression of splice isoforms was identified for genes encoding CEACAM1, FHL-1, MLPH, and SUSD2. None of these splicing isoforms had been previously associated with lung cancer. Conclusions This methodology enables the detection of alternative splicing events in complex biological samples, providing a powerful tool to identify novel diagnostic and prognostic biomarkers for cancer and other pathologies.


Choice of hybridization technology
Dual color arrays were chosen for the analysis of splicing in paired tumor/normal tissue samples. In co-hybridization studies, the labeled material from each sample competes with its paired sample on a specific oligonucleotide probe spot, and the relative intensities are representative of the relative contribution of each sample when the hybridization equilibrium is reached. Inherently, co-hybridization eliminates the variation associated with the use of independent arrays, washes, detections and in silico comparison of single channel experiments. On the other hand, co-hybridization can introduce a dye bias, which nevertheless can be measured in self to self hybridizations and corrected, if necessary. In our experience, in high quality experiments, the dye bias is small and a correction is hardly ever needed.

Data processing
For data processing, a novel algorithm was developed to distinguish between gene expression changes and splicing variations. The analysis of differential splice forms is more complex than the analysis of regular gene expression, due to a higher variation in thermodynamic conditions and possible cross-hybridization, folding or sub-optimally designed oligonucleotides. In addition, compared with regular gene expression analyses, additional variations can be introduced due to the incorporation of extra steps in the labeling protocol. In this situation, the use of spiked-in controls for data processing is especially important. Data processing was divided into four steps: data filtering and normalization, probe spot calibration, statistical analysis of gene probes, and isoform analysis.

Data filtering and normalization
The initial steps of data processing were performed as in gene expression arrays described elsewhere (Cerdá et al. 2008). Due to the intrinsic nature of the splice array experiments, we had to develop new processes, very tied to the array design and experimental execution, to overcome a specific problem: a predictable big amount of non-responsive oligonucleotides probes (sub-optimal probes, non-expressed exons, etc) may bias the statistics analysis and distort the normalization at the lower signal level.
Therefore, prior to the normalization of the array data, probes that were judged nonresponsive were filtered. Specific negative and background control probes were designed and used in all hybridizations (Additional file 2: Figure S6). The signals of these controls in each channel were used to calculate the detection limit as the mean signal value + 3σ for each array. Then, probes with signal intensities below each array's detection limit in all the replicated arrays were filtered out (this process was called Global Filtering).
To normalize the data passed through Global Filtering, MA space was used (Dudoit et al. 2002). M values (vertical axis) are the log-differential expression ratios, and A values (horizontal axis) are the log-intensity of the spot. M and A are calculated from the Cy3 and Cy5 intensity values. The method used was Polyphemus Q-Splines Normalization, in which array data are normalized using an improved version of the non-linear normalization described previously (Workman et al. 2002), with a piecewise approach: using Q-Splines for the inner pieces and linear for the two extremum ones, ensuring a smooth normalization curve (Additional file 2: Figure S7). The data used to make the normalization did not include the signals derived from positive or negative control probes.
The normalization process is useful to correct deviations of the M values from the statistical assumption that most of the spots have M=0 in MA plots. The method allows the adjustment of all M values to form a cloud centered at M=0 in all intensity ranges. It is important to note that any normalization procedure makes two basic assumptions: the total intensity in both samples is equal, and the data cloud should be centered at M=0 in all intensity ranges. It is also important to keep in mind that normalization cannot correct data saturation or generate data below the original detection level.
The positive and negative hybridization controls were thus helpful to monitor the entire process: to control lower detection limits, signal distribution and saturation, and to control that the final normalization had been performed adequately (the positive spikes should be positioned at M=0 after normalization).
In gene expression studies, intra-array replicate analyses are generally applied at this point. The splice arrays used in this study did not contain triplicate copies for each probe, as replicates for all the probes did not fit on the probe surface at the probe densities used. Additional replicate hybridizations were not performed either, due to a limitation in the amount of material available from the biological samples.
Data processing can be performed on individual arrays, although the absence of intraarray replicates does not support the elimination of bad data points by outlier analysis, which renders the analysis more error prone.

Probe Spot Calibration
The objective is to distinguish intensity changes from random variation with reasonable reliability. For this purpose, both the variability of the intensity of the control probes within each array and the variability of the intensity of self to self hybridized samples, as assessed by the gene probes, were evaluated. In both cases, the logarithm of the ratio of both channel expression values was statistically analyzed, and the mean and standard deviation for the given log-ratio population was calculated. Typically, the distribution of the control values obtained from the artificial spiked-in controls is a bit narrower than that of the self to self hybridization (Additional file 2: Figure S2a and b).
The two-fold change distribution provides information about the empirically significant change limit, since the threshold for statistically significant changes in probe response cannot be less than the technologically intrinsic variation. Controls are used in all the experiments (self to self comparisons and normal to tumor tissue comparisons) in equal concentration in both channels, thus allowing variations in self to self hybridization to be related to variations in the control. A very simple way to do this is to determine the ratio of both standard deviations, which yields the Calibration Factor (CF):

First-pass analysis: gene probe statistics
The first step is to decide whether the variation in the logarithm of the ratio of both channel expression values for all the exon and junction probes of a given gene falls within (reflecting genes without or with differential expression) or outside of (potential splicing variation) the variability of the experiment. To differentiate these situations the standard deviation ( c ) on the control probe was calculated and, using the CF from the calibration, converted in an estimate of the associated gene self to self standard deviation ( * s,g ): Subsequently, the threshold was defined as TH = µ G ± 3 * s,g, µ G being the mean value for the ratios of the signal in the Cy5 and Cy3 channels for all the oligos for a given gene G. Depending on the sensitivity and reliability required, the threshold can be modulated. Genes that contained at least one probe with a log-ratio below or above the threshold were selected for next step in the analysis (Additional file 2: Figure S2c).

Second-pass analysis: isoform quantification
Facing the challenge of doing high throughput screening of clinical samples to search for alternative splice variants, a multidisciplinary integrated approach was required.
Array design, sample preparation, labeling, and final data analysis were considered as an interdependent system. Due to the array design, a huge variety of probe-sample interactions was expected: from small sub-optimal junctions to highly cross-hybridized exons. A dual color approach was chosen, because in this technology, based in the cohybridization of the two compared samples in the same array, at high concentrations the observed fold change is monitoring the biochemical equilibrium of the samples competent for hybridization. This equilibrium is expected to be proportional to the transcript ratio, but the signal level would not be related to the real transcript concentration.
Different algorithms for the quantification of isoforms have been described, including  Wentzell et al., 2006), but not to alternative splicing. Translating MCR to the analysis of differential splicing, for each gene a probe signal matrix, S (2 colors x n probes), is represented as the product of two lower rank matrices, of Q and P, which represent the quantification factors and the canonical shape patterns, and a residual error term, E. The data matrix is set up based on the assumption that each channel has a linear combination of two or more patterns of the hybridization signals derived from different transcript variants (Additional file 2: Figure S8). The adaptation in matrix form is: Where S is the expression signal matrix from the hybridized probes, Q are the quantification factors that indicate the amount of each expression pattern needed to approximate the expression, P are the base expression unitary patterns that just depict the canonical shape patterns and E is the residual error from the approximation. The matrices are defined as: • ≈ As the MCR technique is a family of solutions, the following constraints were applied to solve the equation: all the matrices must be positive (transcripts can only be added, not subtracted), a wide range of intensities must be possible; only two events (the two colors) exist per gene, and a random noise due to all the different conditions. In practice, that translates into: a) PCA filtering/whitening was used to minimize the impact of random noise.
b) The Alternate Least Squares (ALS) methodology using non-negative least squares approach was then used to find out the Q and P matrices from the initial expression matrix. To have an initial approach for the Q and P matrices and begin the iterative ALS, a principal component analysis (PCA) was used.
c) The expression vector was normalized to optimize the numerical stability for the iterative ALS method.
d) The two colors only allow defining n x 2 matrices, which can be deconvoluted in two components. The components do not necessarily have any biological meaning, i.e. they do not necessarily represent the actual transcripts of the mixture. However, the method is still valid to detect change, although the biological interpretation becomes much more difficult. 8 The complete flowchart for data analysis in AltPolyphemus (Additional file 2: Figure   S9) was validated on data obtained from the hybridization of synthetic VEGF and PCBP4 transcripts on the pilot array. To eliminate false positives due to low expression probes, a yeast spike at very low concentration was used. This allowed monitoring and processing the genes with very low expression probes.
The presence of a bona fide splice event is expected to generate a variation in the Fold To allow for final visual interpretation, the algorithm was completed with a graphic interface that plots the Cy3 and Cy5 oligonucleotide intensities for all the oligonucleotides of a gene with signal intensity about background + 3σ of background variability, and with the resolved curves composing the signal in the Cy3 and Cy5 channels (Additional file 2: Figure S3).
As a summary, the implemented algorithm is able to detect and rank those genes that have a significant probability of showing an expression mixture with different transcripts contributions. Combined with the first-pass gene based analysis, the algorithm is able to detect the probes that are changing in the mixtures, allowing, together with the probe sequences and tagged probe quality, to make a biological hypothesis and plan the validation experiments.