General functionality of Chipster
User interface
Chipster's user interface consists of four panels: Analysis tools, datasets, workflow and visualization (Figure 2). The panels for the datasets and the workflow display essentially the same files, but while the former provides a typical folder view, the latter shows the relationships between the files. It is therefore easy to keep track of which analysis steps were taken to produce a particular file. Both views allow the user to export, rename, and delete files, and the workflow view also allows the user to prune and save workflows. The analysis tool panel displays Chipster's analysis tools grouped into categories such as normalization, preprocessing, statistics and pathway analysis for easy discovery. Once a tool has been selected, the user can view its short description, the manual page and the source code, and change parameters if necessary. A complete list of the current analysis tools is available on the Chipster web site, and the analysis functionality is described in more detail in the corresponding section of this article. The visualization panel allows the user to view the selected dataset using different visualization methods, which are discussed in more detail below.
Describing an experimental setup is accomplished using a Phenodata editor. Chipster's normalization tools produce a phenodata file, which the user can complete by entering the experimental groups for the different samples. Any other variables such as time, dose, pairing and technical replicates can also be entered by adding new columns to the phenodata. The description column allows the user to enter the sample names that s/he wants to be used in visualizations. Phenodata is by default created during normalization, but users can also import normalized data and generate a phenodata file for it in Chipster, as demonstrated in the second case study of this article.
When an analysis task has been submitted, its progress can be monitored by opening the Task manager window from the bottom panel of the user interface. Task manager lists the status (i.e. transferring inputs, waiting, running, transferring outputs, completed), starting and running times, and tool parameters. It also allows the user to cancel a task if needed.
Chipster allows users to save their analysis sessions, so that the work can be continued later, even on another computer, or shared with collaborators. Work on different datasets can be saved into separate sessions, and the sessions can also be combined later if needed. A session file is a zip-file containing all the data files, their relationships, and the tool parameters used for each analysis step. It is also possible to save just the commands for the analysis steps taken as a workflow, which can be applied to another dataset or shared with other users. The workflow functionality of Chipster is described in more detail later in this article.
A complete manual for Chipster describing data import, user interface and the individual analysis tools is available on the web [2]. It also contains step-by-step tutorials which cover whole analysis from data import to downstream applications such as pathway enrichment using publicly available datasets. While helpful for individual users getting started with Chipster, the tutorials can also serve as teaching material in microarray data analysis courses. Several Chipster training sessions are organized every year in different locations, the details can be found on Chipster website.
Visualizations
Visualizing data and inspecting it by eye is one of the most powerful ways of finding patterns that are interesting for further analysis. We have therefore made a lot of effort to provide rich and powerful visualizations in Chipster. Currently there are about 25 different visualizations, which are divided in two categories: interactive visualizations generated by the client program, and static images generated by R/Bioconductor on the server. Both types of visualizations are viewed in the visualization panel (Figure 2). This panel can be maximized if more area is required for viewing, or detached as a separate window if several visualizations need to be viewed simultaneously.
Chipster's interactive visualizations include 2D and 3D scatter plots, histogram, expression profiles, array layout, volcano plot, Venn diagram, heatmap and self-organizing map clustering (SOM) visualization. In addition to zooming and changing titles and colors etc, the interactive visualizations allow users to select datapoints and create new gene lists based on these selections. There is cross-talk between the different visualization methods, so that datapoints selected in one visualization are highlighted when the same data is visualized using another method. All interactive visualizations can be saved in PNG format by right-clicking on the image.
R/Bioconductor provides a wide variety of visualizations for microarray data, many of which are available in Chipster. These include box plot, density plot, heatmap, correlogram, annotated dendrogram, MA plot, idiogram, quality control plots, gene set enrichment plots, and several visualizations for array comparative genomic hybridization (aCGH) data. As opposed to the interactive visualizations generated by the Chipster client program, the images generated by R/Bioconductor are static, although in many of them the user can change the sample names by entering the desired names in the phenodata file as described above.
Automatic workflows speed up analysis and enable reproducible and collaborative research
Microarray data analysis typically involves performing several analysis steps and trying different parameter settings. Once a suitable combination has been found and analysis completed, it is often desirable to save the steps taken as an automatic workflow. Reusing workflows serves many purposes. Firstly, it saves time as multi-step analysis can be executed with just one mouse click. Sharing workflows within a research group brings consistency to analysis and provides an easy way for bioinformaticians to help biologists. Sharing workflows in a wider context is also beneficial as providing a downloadable workflow file facilitates the reproduction of published results and increases the collaboration of the bioinformatics community in general.
The need for automatic workflows is widely recognized and many programs such as GenePattern, Taverna and Galaxy [5–7] provide different approaches towards this goal, ranging from pure workflow enactment engines to analysis software with web forms for workflow construction. In Chipster we have taken an approach where, instead of specifically constructing workflows, the user performs the analysis normally. The system keeps track of the analysis steps taken, and displays them visually in the Workflow panel (Figure 2). The user can experiment with different methods and parameters, and prune the resulting workflow by deleting the unwanted steps. When a satisfactory analysis pipeline is ready, the user simply clicks on the desired beginning point of it in the workflow panel and saves the workflow. The workflow is saved as a file, which contains instructions to run certain analysis tools with the selected parameter settings in a certain order. Importantly, Chipster also supports branched workflows, as real life analysis workflows are seldom simple linear sequences of steps.
Users can easily apply a workflow to another dataset, or share it with other Chipster users by giving them a copy of the workflow file. In addition to the user-made workflows, Chipster provides ready-made workflows for finding and analyzing differentially expressed genes, miRNAs and proteins. The user can continue analysis from the workflow results as normal, so they don't restrict the analysis in any way but can be used rather as a backbone.
The primary goal of Chipster's workflow functionality is to enable non-programming users to construct workflows. However, users with programming experience can extend the Java BeanShell code of a workflow file with any functionality desired: the workflow environment is a complete programming environment and the functionality of the client can be accessed using a workflow programming interface.
Analysis functionality
Data import and supported array types
Chipster is able to import any tab-delimited data. While Affymetrix CEL-files and Illumina BeadStudio/GenomeStudio-files are recognized automatically, other files are imported using an Import tool, which allows the user to specify the data columns corresponding to identifiers, sample and background intensities, etc. Chipster offers the possibility to import data not only from user's computer, but also directly from public databases such as ArrayExpress [8], Gene Expression Omnibus (GEO) [9], and CanGEM [10], and from a given url.
It is important to note that while the tools for preprocessing, statistics, clustering and visualizations work for any tab-delimited data, tools for annotation, pathway and promoter analysis require annotation information for the array. Chipster has annotation packages for most Affymetrix expression arrays (3', gene and exon arrays), all Illumina expression arrays and the human 27 k methylation array, and the most common Agilent expression arrays. In addition, rudimentary support is offered for Affymetrix and Illumina SNP arrays. For aCGH arrays it is essential to know the exact genomic coordinates for the probes, and Chipster has a dedicated tool for fetching these annotations from the CanGEM database [10]. For a full list of supported array types, please see the website [2]. Annotation packages for new arrays can be created using the AnnotationDbi package offered in the Bioconductor project.
Normalization
Chipster is capable of normalizing most of the commonly used chip types. It has dedicated normalization tools for Affymetrix 3', gene and exon arrays, Illumina arrays, and Agilent 1- and 2-color arrays. Chipster also offers a general normalization tool for cDNA arrays that can be used for normalizing other 2-color data. Similarly, the Agilent 1-color tool can be used for normalizing other 1-color data. The actual normalization methods, such as Robust Multi-array Average (RMA), Li-Wong (dChip), loess, quantile, robust spline and variance stabilizing normalization, are implemented as parameters of the tools [11, 12].
It has been shown that a significant number of probes on several Affymetrix and Illumina arrays map to different genes than indicated by the manufacturer [13–16]. As remapping probes to the current genome and transcriptome databases has been shown to improve the interpretation of gene expression data, Chipster's normalization tools offer the possibility to use the remapped information. For Affymetrix' 3'-expression arrays the user can decide whether to use the alternative mappings (altCDFs) in the summarization step. For Affymetrix exon and gene arrays and for Illumina arrays the remappings are used automatically. The first case study of this article demonstrates how to apply the alternative mappings for Affymetrix' 3'-expression arrays.
After the initial normalization using a platform-specific tool, the data can be further normalized to specific genes or samples. Chipster also includes a tool for removing random (batch) effects, e.g. where samples cluster according to preparation day instead of the biological groups under study, using a linear mixed modelling approach to the normalization.
Quality control
Chipster has an extensive selection of tools for quality control. These include platform-specific tools, such as plots for RNA degradation, Relative Log Expression (RLE), Normalized Unscaled Standard Error (NUSE), scaling factor summary, percent of present probesets, and quality control probe expression in the case of Affymetrix arrays. The more general tools, such as Principal Component Analysis (PCA), clustering and Non-metric Multi-Dimensional Scaling (NMDS), can also be used for quality control of samples. If quality control tools indicate that certain samples need to be excluded from further analysis, this can be easily accomplished in Chipster by either excluding the deviant samples from the already normalized data or by re-normalizing the acceptable samples. The latter approach is recommended for certain normalization methods such as RMA, which are affected by the context (i.e. a set of arrays).
Filtering
Chipster includes tools for filtering genes by standard deviation, coefficient of variation, inter-quartile range, expression and flags. Another, more versatile way of filtering is to first calculate several descriptive statistics for each gene by using the specific tool for that, and then apply the "Filter using a column value" tool to filter the genes based on any of these. Annotated gene lists can also be filtered based on chromosomal location, pathway terms, etc. Different filters can be combined by using the interactive Venn diagram to create new subsets. Venn diagram can also be used for filtering the dataset with a list of gene identifiers.
Statistical testing
Statistical tools in Chipster can be divided into tests for finding differentially expressed genes, ordination methods and association analysis. Tools for pathway analysis as well as the statistical tools dedicated for aCGH data are described in their own sections below.
Tests for finding differences in mean gene expression between groups are divided into separate tools according to the number of groups to be compared (one group, two groups, several groups). Several tests are available in every tool, and they usually include both parametric tests such as t-test, empirical Bayes [17], ANOVA, and non-parametric tests such as Mann-Whitney U and Kruskall-Wallis' test. Chipster also contains separate tools for Significance Analysis of Microarrays (SAM) [18] and Reproducibility-Optimized Test Statistic (ROTS) [19]. A linear modelling tool, an implementation of linear regression modelling, allows analysis of several variables at the same time. It can take into account three main effects and their interactions, as well as technical replicates and pairing, and its use is demonstrated in the first case study of this article.
Ordination methods include PCA, NMDS, and Canonical Correspondence Analysis (CCA). PCA can be performed for either genes or samples, and the results can be visualized as an interactive 3D-scatter plot, where samples can be colored according to any experimental variable defined in the phenodata file.
Association analysis can perform case-control analyses on SNP array data. It tests Hardy-Weinberg equilibrium, and association of the genetic markers with the case-control status using both dominant and recessive models of inheritance.
Unsupervised and supervised clustering
Chipster's tools for unsupervised clustering include K-means, hierarchical and quality threshold clustering and SOM. Hierarchical clustering results can be visualized as interactive heatmaps and plain trees, and the reliability can be checked using bootstrapping. For K-means clustering, Chipster includes a separate tool for estimating the optimal number of clusters to generate (K).
Classification or supervised clustering tools include K-nearest neighbor (KNN)-classification and the more versatile general classification. KNN-classification allows validation of classifiers by using either a cross-validation approach or a test set of new samples. The general classification tool offers many more classification methods, such as Support Vector Machines (SVM), Linear Discriminant Analysis (LDA), and Naïve Bayes networks, but it does not allow classifying new samples like the KNN-classification does.
Annotation
Chipster uses annotation packages provided by the Bioconductor project and the BrainArray site [20]. There are two ways to annotate the data: either by generating a separate annotation file or by appending the annotation to the actual data. This latter option allows for filtering genes based on pathway involvement, chromosomal location, or other annotation information.
Pathway and promoter analysis
The pathway tools include gene enrichment analysis for Gene Ontology (GO) terms [21] and KEGG pathways [22] based on the hypergeometric test implemented in the GOstats package [23]. Users can select conditional testing for GO terms in order to avoid redundancy caused by the hierarchical structure of GO. In this mode, the gene list is tested for the most specific GO terms first. If significant terms are found, the genes mapping to these terms are removed before testing for the more general parent terms. As opposed to testing genes individually, the user can also perform gene set tests based on the globaltest package [24] and SAFE [25], which calculate a test statistic per GO category or KEGG pathway taking into account the expression levels of the genes. In addition to these tools running on the actual Chipster server, pathway tools running elsewhere are also offered in the Chipster client program. These include over-representation analysis with ConsensusPathDB provided by the Max-Planck Institute. ConsensusPathDB integrates functional interaction data from 20 databases covering protein-protein, metabolic, signalling and gene regulatory interaction networks [26], thus providing a powerful and combinatorial approach to pathway analysis.
The promoter analysis tools in Chipster offer a possibility to search for common sequence motifs with Weeder [27] or Cosmo [28], or search for known transcription factor binding motifs using the JASPAR matrices [29]. Transcripts are linked to the corresponding promoter sequences using RefSeq accession numbers. Promoter sequences for human, mouse, rat, drosophila and yeast are obtained from the UCSC genome browser [30].
miRNA analysis
The tools for miRNA analysis are applicable to most miRNA arrays including Agilent and Exiqon, as long as the data includes miRNA systematic names which Chipster uses as identifiers. The user can retrieve miRNA target genes from six different databases, run pathway enrichment analysis for the targets, and correlate miRNA expression with matching gene expression data if available.
aCGH data analysis
Chipster contains a comprehensive collection of tools for analysing DNA copy number data measured by aCGH. The tools include calling copy number aberrations (gains and losses) [31, 32], identifying commonly aberrated regions [33], removing wavy artifacts from aCGH profiles [34], and measuring known copy number variation for the areas of interest (probes, genes or chromosomal regions) from the Database of Genomic Variants [35]. Dedicated tools are also available for clustering [36], group comparisons [37], and hypergeometric tests for enriched GO categories. These take into account the specific characteristics of aCGH data, and are therefore more suitable than the equivalent tools developed for gene expression studies. Importantly, it is also possible to integrate aCGH data with expression data to assess expression changes induced by aberrated gene copy numbers [38]. The third case study of this article demonstrates how to integrate aCGH data with gene expression data in Chipster.
As the mapping of microarray probes to their genomic coordinates is essential for all aCGH data analysis, this information can be downloaded from CanGEM, which is a public database focusing on aCGH microarray data [10]. These mappings have been obtained from probe sequences using MegaBlast [39] and are available for different builds of the human genome. Direct importing of entire data sets from CanGEM is also supported.
Data export to public databases and other software
In addition to analysis sessions, individual data files can also be exported from Chipster in a tabular text format at any time. These files are suitable for submission to many third-party software. Chipster can also export data in a suitable format for uploading to the ArrayExpress [8] and GEO [9] databases.
Case studies demonstrating Chipster's analysis and visualization tools
In this section we present three case studies to illustrate the merits of some data analysis and visualization options in Chipster, such as linear modelling, alternative probe mappings, and data integration. The analysis sessions of these case studies are available for download [40] and further inspection in Chipster.
Using linear modelling to analyze several factors simultaneously
This case study demonstrates how to apply the linear modeling tool for a biological problem using data from the case-control study published by Lenburg [41]. They compared renal cell carcinoma tissue samples with healthy tissue from the same person, which effectively introduces a pairing structure to the data. We will model the pairing explicitly here, and also include the gender of the individual and the side of the affected kidney (left or right) as independent variables in the model. In this example we also show how to apply alternative probe mappings for Affymetrix data, in this case for the U133A arrays.
The CEL-files for the 17 samples were imported to Chipster and the quality of the data was checked using the Affymetrix-specific quality control tools including RLE and NUSE. As no deviant arrays were identified, all the arrays were retained in the dataset and normalized using the RMA method and the alternative probe mappings (altCDFs). Using altCDFs for the summarization step practically halved the number of probesets, reducing it from 22 283 to 12 133. Next the experimental setup was described using the phenodata file, which was generated during normalization. The variable corresponding to the most interesting hypothesis (here, case versus control) was coded in the group column. All the other variables of interest such as gender, side and pairing were added as new columns to the phenodata and coded with numbers. Several quality controls including PCA, NMDS and dendrogram run on the normalized data showed that the sample groups separate well from each other. Affymetrix control probes and 90% of the genes that showed the lowest coefficient of variation were removed using the tools "Search by gene name" and "Filter by CV", respectively. Chipster's filtering tools "Filter by CV" and "Filter by standard deviation" allow users to set the filtering percentage according to their needs. We used a relatively high level of stringency in this and the following case studies in order to focus on the more prominent changes in expression and to minimize false positive findings in the downstream analyses.
The genes that are differentially expressed between cases and controls, males and females, or left and right kidneys, can be analysed using tests suitable for comparing two groups. However, this is a suboptimal solution, since possible interactions between the variables can not be tested, and the effect of interest can be masked by confounding variables. To address this we used the linear modelling tool in Chipster to build a linear regression model that allows us to include all the variables in the same analysis and to take the pairing structure into account. Chipster's linear modelling tool is an implementation of the limma package [17] from the Bioconductor project. The case-control status, gender and side of the kidney were included as main effects and the patient was included as pairing. All variables were treated as categorical variables (factors). Thus, the following model was fitted
The Benjamini and Hochberg false discovery rate (FDR) correction was applied to the p-values to adjust them for multiple comparisons.
Results for the case-control comparison were visualized using the interactive volcano plot, where the x-axis contains the log2-transformed fold change values, and the y-axis contains the -log10 -transformed p-values (Figure 3). The linear modelling result was filtered for p-values using the tool "Filter using a column value". 839 genes were statistically significantly differentially expressed (p-value < 0.05) between the cases and controls, 20 genes were significant for gender comparison, and no genes became significant for the comparison between left and right kidneys. The list of genes that were up-regulated in cancer (378 genes) was enriched for GO categories Blood vessel development (GO:0001568) and Response to hypoxia (GO:0001666), as judged by the tool "Hypergeometric test for GO". Similarly, enrichment for HIF1-alpha transcription factor network and several adhesion pathways was indicated by the tool "Hypergeometric test for ConsensusPathDB". These results are consistent with the fundamental role of angiogenesis in the renal cell carcinoma pathogenesis [42].
In contrast to the analysis conducted by Lenburg et al, our results for the case-control comparison are adjusted for the other variables in the model. In other words, the results given for the case-control comparison take into account additional knowledge of the samples such as gender, side of the kidney and the patient. Lenburg et al reported 1211 UniGene clusters and 23 unannotated probesets (corresponding to 851 unique gene symbols) that had changed more than three-fold. In order to compare their result to ours, the differentially expressed genes were filtered for fold change using the tool "Filter using a column value". The list of more than three-fold changed genes (280) was then compared to that of Lenburg in the interactive Venn diagram visualization, using gene symbol as the common identifier. Only 191 genes were common to both datasets. In addition to the different analysis methodology, this difference probably reflects the use of remapped probes, which has been shown to cause up to 50% discrepancy in genes previously identified as differentially expressed [13]. Interestingly, the 89 genes detected only by Chipster included genes involved in hypoxia response (ADM, ALDOC and DDIT4), cell migration (COL1A2), and cell proliferation (PDGFD). Taken together, Chipster's linear modelling tool and alternative probe mappings enabled us to find additional genes potentially relevant to renal cell carcinoma, while keeping false positive findings due to outdated probe mappings to a minimum.
Analyzing a prenormalized dataset: Comparing gene expression between two populations
In this example we demonstrate how to analyse prenormalized data in Chipster by using expression data from the study by Stranger et al. [43]. They performed gene expression profiling of Epstein-Barr virus-transformed lymphoblastoid cell lines of the 270 individuals genotyped in the HapMap Consortium using Illumina's WG-6 version 1 arrays. In this example we compare gene expression in the European (CEU) and African (YRI) populations using a subset of 120 samples (parents only).
Normalized data from the Genevar site [44] of the Sanger Institute were imported to Chipster using the Import tool. The data was converted to Chipster format and the phenodata was created by using the tool "Process prenormalized". The population was indicated with numeric codes (CEU = 1, YRI = 2) in the group column of the phenodata, and the population codes (CEU and YRI) were entered in the description column in order to use them as sample labels in visualizations.
Differential expression between the populations was visualized using the NMDS tool, which produces a two-dimensional map (Figure 4) based on sample dissimilarity calculated using Euclidean distance. As is instantly evident from the image, the YRI samples are more to the top-left of the image, and the CEU samples more to the bottom-right, indicating that there are differences in gene expression between the populations. The samples were also visualized in a 3-dimensional interactive scatterplot using the three most significant components from a PCA analysis. Again, it was noted that samples clearly segregated according to population, but no further sample clustering could be observed upon close examination of the data points along any axis and direction of view, suggesting that no additional underlying sample characteristics exhibited any major impact on the expression patterns.
Differentially expressed genes were analysed using the empirical Bayes test, after filtering out 95 percent of the probes that showed the lowest standard deviation. 1601 probes corresponding to 1233 known genes were statistically significantly differentially expressed between the populations at the 5% false-discovery rate. In order to gain functional insight, the differentially expressed genes were analysed for enrichment in GO categories for biological process using the tool "Hypergeometric test for GO" with default parameter settings. Interestingly, the most enriched category was immune response. The list of differentially expressed genes was further filtered on fold change using the tool "Filter using a column value". Only 75 probes corresponding to 45 known genes showed a fold change higher than 2. Taken together, it seems that gene expression differences between populations are commonplace, but most of the differences are very subtle.
Integrating DNA copy number and gene expression data
This third case study illustrates the integration of aCGH and mRNA data to assess expression changes induced by DNA copy number aberrations. As the aberrations typically contain also bystander genes in addition to the driving ones, integration with expression data helps to identify the potential cancer genes. We used 32 breast cancer samples with matching aCGH data [45] and expression data [46]. This is a subset of the original study containing 106 samples, because we were able to pair data only for 32 samples using the supplementary material of the referred articles. Attempts to obtain the pairing information from the original authors were also unsuccessful.
The Agilent 4x44K aCGH data was normalized using the Agilent 2-color normalization tool with normexp background correction (offset 50) and loess normalization [1]. The Affymetrix U133A expression data was GCRMA normalized [1], and 75% of the probesets with the lowest standard deviation were filtered out. Quality of the two data sets was checked with respective quality control tools, and since no deviant samples were observed, all arrays were retained. In order to enable the integration of the copy number and expression data, the Agilent probes and Affymetrix probesets were annotated with their chromosomal positions using the tool "Fetch probe positions from CanGEM" [10].
aCGH profiles typically show a wavy artefact related to their GC content. This pattern can be removed by using clinical genetics samples measured on the same array platform as calibration data [34]. We applied the tool "Smooth waves from normalized aCGH data" using a calibration dataset of mental retardation samples [47] which had been previously normalized using the same settings as described for the aCGH data above. Smoothed log ratios were then analyzed with the tool "Call copy number aberrations from aCGH data" [31, 32] to detect gains and losses. The aCGH data set was studied further by identifying commonly aberrated regions [33], which showed most frequent gains in 8q and 1q. The amount of known copy number variation (CNV) within these regions was measured with the tool "Count overlapping CNVs" [35], which annotates the data with two metrics: the number of reported CNVs that overlap with the region of interest, and the proportion of base pairs that falls within the reported CNVs. These values were compared to the mean and median across the whole genome, obtained by running the tool "Calculate descriptive statistics".
In order to assess expression changes induced by DNA copy number aberrations, the aCGH and mRNA data sets were first integrated using the tool "Match copy number and expression probes", which locates the closest copy number probe for each expression probeset. It also generates a heatmap showing the two data sets organized by chromosomal position. The effect of copy number changes on mRNA expression levels was then evaluated by a permutation-based non-parametric test [38] implemented in the tool "Test for copy number induced expression changes" using the default parameter settings. Probesets with a p-value smaller than 0.05 were selected with the tool "Filter using column value". Our analysis identified 884 genes (corresponding to 1087 Affymetrix probesets) which showed copy number induced expression changes. In the original paper, Andre et al. [45] highlighted a list of 20 frequently amplified genes, 15 of which showed significant correlation between expression and copy number. Chipster detected nine of these genes: BRF2, DDHD2, EIF4EBP1, ERBB2, ERLIN2, FGFR1, GRB7, LSM1, and RAB11FIP1.
The resultant gene list was explored further using different filters. As ERRB2 is a well-known breast cancer gene, we filtered the gene list for involvement in the ERRB2 signaling pathway by using the tool "Extract genes from KEGG pathway". Five such genes were found, in addition to ERBB2 itself. We filtered the gene list also for effect size (the amount of differential gene expression induced by the copy number difference), and for the coefficient of determination, R2 (the proportion of variation in gene expression explained by copy number change). There were 25 genes for which the effect size of the DNA copy number on the gene expression was higher than two and explained over 50% of the variation in gene expression. Interestingly, one of these genes was TOB1 (Transducer of ErbB2 1), which has been recently implicated in breast cancer metastasis [48]. The relation between the copy number and expression data for TOB1 was illustrated using the tool "Plot copy number induced gene expression" (Figure 5). Taken together, these results demonstrate Chipster's ability to identify potential cancer related genes. While the integration method used by Andre et al. simply divides the samples into two groups based on DNA copy number calls, the method implemented in Chipster also takes into account the probabilities with which these calls are made (sometimes referred to as "soft calls"), which has been shown to yield improved results [38].