Lung cancer accounts for a quarter of all cancer mortalities in the U.S. . Non-small-cell lung cancer (NSCLC) is a histologically defined sub-group that represents 75 to 80% of all lung cancer cases. NSCLC can be subdivided into adenocarcinoma (AdCa), squamous cell carcinoma (SCC), and large cell lung cancer (LCLC). The high mortality rate of lung cancer can be attributed to late diagnosis and thus an already metastasised and aggressive tumour. Platinum-based chemotherapy in combination with taxanes, camptothecins, or vinca alkaloids, is the first-line treatment of choice for patients with advanced NSCLC . Yet, survival time is short and the five-year survival rate has only risen slightly since 1987 . New therapies address molecular targets that are involved in tumour progression or angiogenesis (e.g. EGFR, VEGF-R) . These involve small molecule drugs directed against mutated EGFR (e.g. gefitinib, erlotinib), as well as monoclonal antibodies directed against EGFR (cetuximab) and against VEGF (bevacizumab). These targeted therapeutics have provided some clinical benefit but also underlined that the molecular target needs to be accessible in the tumour type under treatment. That is, administering the drug will provide benefit to patients only for specific sub-types of the tumour where the target is relevant, thereby introducing the concept of an individualised therapy. Novel drugs for a targeted therapy that consider the sub-type specific tumour biology are urgently needed as treatment of NSCLC - a major unmet clinical need. In order to better understand molecular characteristics of NSCLC and derive novel molecular targets, studies have investigated gene expression in clinical samples  or in novel xenograft models derived from NSCLC specimens , gene expression under compound treatment (e.g. Sagopilone ), DNA copy-number variation , and epigenetic changes  in NSCLC in an unbiased approach. Yet, the changes caused by alternative splicing are relatively unexplored in NSCLC.
Alternative splicing (AS) describes the process by which pre-mRNA is spliced in different ways thus giving rise to distinct mature mRNA transcripts . AS events can be characterised as inclusion or skipping of a complete exon (cassette exon, CE), prolongation or shortening of an exon (alternative 5'- or 3'-splice site), retention of an intron (IR), inclusion of only one exon from an array of two or more exons (mutually exclusive exons, MX), and alternative poly-A site . Also alternative start of transcription can lead to different exon-exon junctions; however, this mechanism is not an AS event and its regulation need not be at the level of splicing. It has become evident that more than 73% of all human genes are alternatively spliced [11, 12]. AS plays a major role in gene regulation, both in normal tissues as well as in disease. In cancer, AS has an impact on cellular processes related to tumour progression, including inhibition of apoptosis, tumour invasion, metastasis, and angiogenesis . Changes of the AS pattern of a gene can be triggered by differential expression of splicing factors or by changes up-stream of the splicing machinery. One example is SRPK1, a kinase that is over-expressed in breast, colon, and pancreas carcinoma . SRPK1 phosphorylates the splicing factor SF2/ASF, thereby mediating its import into the nucleus and recruitment to nuclear speckles . This process affects the AS of multiple target genes (e.g. BIN1, S6K1, MNK2) which contributes to tumour progression . Based on the analysis of ESTs from normal and cancerous tissues it was suggested that alterations affecting the splicing machinery and its regulation are a further hallmark of cancer progression . Both the identification of RNA binding proteins affecting the AS pattern as well as their target sequences are active fields of research [18, 19].
Most studies of AS in cancer focused on the analysis of individual genes. Several genes are well-known for AS and derived alternative proteins have different functionalities in tumour compared to normal tissue. One member of the Bcl-2 family is Bcl-X (BCL2L1), whose short transcript variant Bcl-X
promotes apoptosis. AS results in a longer exon in the transcript variant Bcl-X
in cancer cells. In contrast to the short transcript variant, this isoform has an anti-apoptotic function . CD44 is another example of a gene that is affected by AS. Ten variant exons in this gene generate multiple transcript variants. In most tissues, the short isoform CD44
lacking all variant exons is expressed. Longer transcript variants containing one or many variant exons were found in specific cell types as well as in cancer cells. It was shown that the transcript variants of CD44 are involved in angiogenesis and metastasis [21, 22]. AS can also yield new epitopes in tumour cell surface proteins or in proteins of the extracellular matrix of tumour cells that can be exploited for therapy via targeting by an immunoconjugate. Using such an approach, CD44-v6 was targeted by the immunoconjugate bivatuzumab mertansine . The Fibronectin (FN1) gene codes for an extracellular matrix protein that contains three cassette exons, among them the extradomain B exon (EDB) . An antibody fragment targeting the onco-foetal antigen FN1-EDB (L19-SIP) is currently in pre-clinical development . All these examples demonstrate that AS is a highly interesting field: on the one hand, elucidating induced changes along the hallmarks of cancer  and, on the other hand, in the search for new drug targets, both for small molecule as well as for antibody-mediated approaches. Nevertheless, it remains a relatively unexplored area. Until recently, methods for a global analysis of AS were challenging and required a great deal of effort.
As a new technology that allow an unbiased analysis of AS, splice variant sensitive microarrays became commercially available in 2005. In this study, we utilise the oligonucleotide microarray Human Exon 1.0 ST Array (Affymetrix, Santa Clara, CA, USA). This array contains 6.5 million probes targeting known and predicted exons of the human transcriptome . Probe sequences were designed in such a way that up to four probes compose a probe set that maps to one exon of a gene. Probe intensities can be summarised on the gene level which provides information about expression of the whole gene. In addition, the exon array technology allows summarisation per probe set which provides exon expression values. One can determine the relative inclusion or skipping rate of an exon between two or more sample groups (differential splicing) using both metrics together. Although alternative start of transcription is not an AS event, the exon array technology can also detect differences in the usage of transcription start sites. In the following, we also consider this kind of mechanism when speaking of differential splicing.
In previous studies (review ), the exon array technology was used to detect differences in AS patterns between healthy human tissues , between human populations , under hypoxia conditions , and in disease tissues. Differential splicing was analysed in several types of cancer, among them colon, breast, prostate, bladder, and head and neck cancers [27, 32, 33]. Clinical samples of NSCLC have been investigated in two studies: Xi et al. analysed a data set of matched pairs of AdCa . They found evidence for differential splicing in 2369 genes and further analysed a subset of 729 genes that are cancer-related according to pathway annotations. Of 11 genes selected for a validation using independent laboratory methods, differential splicing was confirmed in six genes (CEACAM1, ERG, RASIP1, VEGFC, CDKN2A, CDH3). Lin et al. analysed differential splicing in a large data set consisting of samples of AdCa and SCC of NSCLC besides colon and breast cancer . This data set does not contain samples of healthy tissue for comparison.
Data analysis of exon arrays remains a challenging task despite several different approaches described recently [27, 33–35]. In the workflow proposed by Affymetrix, exon level expression values are normalised to gene level expression values to calculate the splicing index (SI) . Differentially spliced genes are identified using both the magnitude of change (SI) as well as significance, e.g. p value obtained from a Student's t-test. It became evident that the standard workflow leads to a high false positive rate which especially affects noisy data sets such as cancer data sets with high intrinsic variability due to inter-patient heterogeneity . Three sources of artefacts are thought to be the major cause of false positives: (1) Probe intensities at the background noise, (2) cross-hybridising probes, and (3) imprecise calculation of the SI.
Probes corresponding to exons that are not expressed in a particular sample group measure the background noise and thus will not be informative. Still, their expression value does not follow the overall gene expression level thus leading to high SI values and false positive results. It is generally accepted that it is desirable to remove this kind of probes before conducting the analysis . Here, the difficulty resides in identifying probes that are in fact detecting signals above the background noise. Algorithms initially employed like DABG were based on the GC-content of the probes [29, 38]. Recent advances in this field incorporate a statistical model based on probe sequences (MAT algorithm ) or a thermodynamic model of oligonucleotide hybridisation (MSNS algorithm ).
Cross-hybridising probes are probes that bind other sequences besides the intended exon. This can lead to a constantly high expression value that does not follow the gene-level expression value. Again, this will lead to a high SI and to false positives. To date, this issue has been addressed with either of two approaches: mapping of probes sequences to the transcriptome and flagging potential candidates probes that have multiple hits  or filtering out probes with a constantly high expression value .
The third kind of artefact can be attributed to difficulties in measuring the SI. As mentioned by Affymetrix, determination of the correct gene level summarisation value can be cumbersome . It can be estimated in general more reliably for a gene with many constitutive exons and preferably only a small number of exons affected by AS. A number of improvements were published trying to identify constitutive exons [42–44]. As another approach, Shah and Pallas used a correlation-based approach in favour of calculating the SI . Recently, Möller-Levet et al. introduced the new metric VFC which is a weighted fold-change based on probe set reliability, i.e. detection above background score .
In addition to the identification of artefacts as mentioned above, a method of analysis needs to consider the relationship between exons and genes and also annotations of known transcript variants. Apparently, probe set definitions and annotations provided by Affymetrix were used in many studies. Affymetrix continuously updates the annotation files, i.e. the assignment of probe sets to the latest set of genes and mRNA sequences. Yet other information is not updated, such as the assignment of probes to probe sets (chip definition), the assignment of probe sets to a putative gene locus (transcript cluster), and the reliability assignment of a probe set (core set, extended set, full set). As this information is not static, an analytical method would benefit from an update of all of these definitions: a probe looking perfect at design time might in fact map to multiple targets suggesting exclusion from a probe set. Unfortunately, transcript and gene annotations may contain some errors; corrections might require remapping between probe sets and the new gene locus/loci. Predicted exons can get backed by transcript evidence but might still be omitted from a study since they have not been moved to the core set. In part, these issues have been addressed with the database X:Map that provides an up-to-date mapping of probe sequences to transcripts and genes . In some studies, X:Map is used either directly or indirectly via the Bioconductor package Exonmap [47, 48]. All of the analysis workflows discussed above underline that analysing an exon array data set is a multi-faceted challenge. In this study, we propose a new workflow that addresses the above issues. We generated a new chip definition that represents an updated composition of probe sets and their assignment to known transcripts and genes. Furthermore, we use advanced algorithms in order to remove artefacts and to detect differentially spliced genes.
With the new workflow we identified genes that exhibit differential splicing in NSCLC compared to normal adjacent lung tissue (NAT). In order to extend our biological understanding of AS in cancer, alterations of splicing patterns between NSCLC and NAT were analysed on a genome-wide level. We generated an exon array data set derived from matched pairs of NSCLC and NAT including both the AdCa and the SCC subtype. Initially, we analysed this data set with a standard workflow based on the SI (workflow proposed by Affymetrix ) and the generally accepted analysis of variance (ANOVA; workflow implemented in Partek® Genomics Suite). Several improvement steps led to our enhanced workflow. After using the final version, genes that are known to be differentially spliced in NSCLC versus NAT can be found ranking highly in the result list (e.g. FN1). In total, 14 genes of this result list were selected for validation using independent laboratory methods. We succeeded in validating 69% of all differential splicing events. This includes ten events that are genuine AS events and one alternative transcription start site event. This proves that our enhanced workflow can reliably identify genes that are affected by AS even in a clinical cancer data set that contains different subtypes of NSCLC and that reveals a high heterogeneity between the patients. We also examined the data set for genes that exhibit a different splicing pattern between two subtypes of NSCLC (AdCa versus SCC).