- Open Access
A user-friendly workflow for analysis of Illumina gene expression bead array data available at the arrayanalysis.org portal
BMC Genomicsvolume 16, Article number: 482 (2015)
Illumina whole-genome expression bead arrays are a widely used platform for transcriptomics. Most of the tools available for the analysis of the resulting data are not easily applicable by less experienced users. ArrayAnalysis.org provides researchers with an easy-to-use and comprehensive interface to the functionality of R and Bioconductor packages for microarray data analysis. As a modular open source project, it allows developers to contribute modules that provide support for additional types of data or extend workflows.
To enable data analysis of Illumina bead arrays for a broad user community, we have developed a module for ArrayAnalysis.org that provides a free and user-friendly web interface for quality control and pre-processing for these arrays. This module can be used together with existing modules for statistical and pathway analysis to provide a full workflow for Illumina gene expression data analysis.
The module accepts data exported from Illumina’s GenomeStudio, and provides the user with quality control plots and normalized data. The outputs are directly linked to the existing statistics module of ArrayAnalysis.org, but can also be downloaded for further downstream analysis in third-party tools.
The Illumina bead arrays analysis module is available at http://www.arrayanalysis.org. A user guide, a tutorial demonstrating the analysis of an example dataset, and R scripts are available. The module can be used as a starting point for statistical evaluation and pathway analysis provided on the website or to generate processed input data for a broad range of applications in life sciences research.
Illumina bead arrays  are a popular choice for array-based genome profiling studies. Although Next Generation Sequencing technology is on the rise, microarray-based gene expression profiling is still widely utilized due to its ease of use, robust performance, reproducibility, and low per-sample cost. Furthermore, open data repositories (e.g. ArrayExpress  and Gene Expression Omnibus ) contain a vast amount of microarray experiments, which are often re-analyzed, integrated, or combined with newly generated data in the context of modern integrated systems biology research. This process is facilitated by easy access to streamlined processing. To extract biologically meaningful information from genome profiling experiments, generated data first needs to be quality checked, filtered, pre-processed and statistically analyzed. Having these basic analysis steps at a user’s disposal is essential for an effective and iterative research process. As gene expression profiling experiments are typically designed, performed, and interpreted by biological domain experts rather than bioinformaticians, it is important to enable these researchers to independently operate basic analysis pipelines. Pipelines with a user interface that provides immediate and intuitive feedback are of great interest for increasing efficiency and effectiveness of the research process. Besides proprietary vendor-provided software (BeadStudio, GenomeStudio) and open-source software Illuminaio , several pre-processing and quality control (QC) methods for Illumina bead arrays are available (beadarray ; lumi ; limma ). However, utilization of these methods requires extensive bioinformatics skills and therefore they are not readily accessible for a broad researchers community. To extend utility of analysis workflows for Illumina bead arrays also to non-bioinformaticians, we have created an open-source, user-friendly workflow, accessible via the web interface of ArrayAnalysis.org, that combines functionality of Bioconductor packages for essential quality control and pre-processing, with statistical functions and downstream analysis .
The relevance of analysis workflows for Illumina bead arrays that are friendly to a wide range of researchers has been recognized by several other bioinformatics developers, resulting in availability of tools and pipelines related to our work (e.g. Chipster , MadMax , IlluminaGUI ). Nevertheless, our module for ArrayAnalysis.org provides a significant contribution to the research community as it provides an easily accessible alternative that does not require local installs. For instance, Chipster provides similar functionality but requires local software installation and availability of specific Java versions; Madmax is not open source and requires login credentials to be provided by the developers; and IlluminaGUI requires a local install of R and its support has been discontinued. Therefore, our web interface-based workflow is a convenient resource for free, fast and user-friendly analysis of Illumina bead arrays by a broad community of researches - regardless of their bioinformatics skill level or research budget.
The Illumina QC and pre-processing module was developed to complement and link to previously created modules for analysis of microarrays, available at www.arrayanalysis.org . The Illumina module has been implemented as a wizard guiding the users through the different steps and is connected in an ArrayAnalysis workflow to downstream modules for statistics and pathway analysis. Figure 1 shows an overview of the steps of the Illumina module and its use together with other modules and software.
The module was implemented using R and Bioconductor packages for Illumina analysis lumi  and limma  to provide the user with the most commonly used analysis options. Using the lumi package, we implemented various types of background correction (e.g. ‘none’, ‘bgAdjust’, ‘forcePositive’), variance stabilization (‘vst’ (variance-stabilizing transformation), ‘log2’, ‘cubicRoot’) and normalization. Additionally, the neqc method from the limma package has been included, which performs a background correction using a normal-exponential-modeling approach  followed by a quantile normalization of all regular and control probes together, and a log2-transformation on the dataset. After normalization, probes with intensities below detection level can be removed to speed up the processing and to reduce false positives.
Five types of quality control (QC) plots are implemented: (1) density plots and (2) boxplots of the log-intensity distributions of all arrays on a single graph, facilitating comparison of signals between arrays and identification of arrays with deviating distributions; (3) a correlation coefficient plot, representing correlations between all pairs of arrays in the dataset as a colored matrix; (4) a principal component analysis (PCA) plot, providing another view of the correlations of expression between arrays: the data are projected on several axes (or components) that explain the largest amounts of variance; (5) a hierarchical clustering plot that can be generated using various distance metrics (Pearson, Spearman, or Euclidean) and clustering methods (Ward, Mcquitty, average, median, single, complete, or centroid), and is used to inspect the groupings of the samples. All plots use consistent colors for arrays and experimental groups and can be generated for both raw and pre-processed data, which helps to assess whether the pre-processing step corrects possible aberrations.
The Illumina identifiers are converted to equivalent nucleotide universal identifiers (nuIDs)  based on their probe sequence. After quality control and pre-processing, the nuIDs are used to add additional annotation (e.g. gene symbol, Entrez Gene identifier, etc.) to the processed result tables.
Results and discussion
When running the Illumina workflow, the user is guided through the different analysis steps via a web based user interface. At the first step, the user is prompted to upload a summarized probe-level data file and optionally a control probe data file, the output of Illumina’s BeadStudio/GenomeStudio software. The user may choose to perform all pre-processing steps within our workflow (recommended), or to provide already background-subtracted data. Both summarized probe-level and summarized gene-level input data are supported. Summarized probe-level data is recommended as input, as this will eliminate the occurrence of improper combinations of the expression values of different probes into a single-gene value .
In the second step, the user can annotate the imported samples by entering custom sample names and experimental group names by either uploading a sample description file or entering the sample description information manually via the web based interface.
The third step summarizes the information about the uploaded data and provides the user with the option to enter an email address for notification when the workflow has finished.
The fourth step will perform background correction and normalization of the user’s data. This encompasses the removal of per-array technical effects, which ensures that the values being further analyzed reflect underlying biology. Three actions are typically performed to achieve the following : (i) background correction, (ii) between-array normalization and (iii) data transformation (typically a log2-transformation). The user may choose between two popular pre-processing approaches that implement these actions for Illumina data: (a) lumiExpresso from the lumi Bioconductor package , or (b) neqc, from the limma package . Also, the user can choose the types of plots that are to be created and whether filtering probes with intensities below detection level is to be performed.
Upon completion of the run, the user receives a link to download a zip archive of results either at the web-interface or by email. If the QC diagnostic plots show arrays of insufficient quality, the pre-processing procedure may be repeated after exclusion of those arrays. Otherwise, the user can immediately proceed with the next module of the workflow to perform statistical analysis. Via a web interface, the existing statistics module prompts the user to specify which experimental groups are to be compared (e.g. treated versus control) or to define any custom comparison of interest. After submitting the choices, this module runs limma model fitting to compute a table of relevant statistics, including estimated coefficients (effect sizes) and their significances . Results from the statistics module can then be used for further pathway analysis processing in a downstream module that makes automated calls to PathVisio  or they can be downloaded for processing in other software.
Running time of an analysis is very much dependent on the size of the input file, the number of arrays, the specific user settings, and the modules used, and will range from minutes to hours in the extremes. Performance of ArrayAnalysis servers is being monitored to make sure they effectively deal with the workload, and extra capacity can be allocated in future if needed. When not surpassing a dozen concurrent runs, running times will not increase much. Additionally, users can download the R scripts to run on their own systems if desired, for example in case of many projected runs or very large data sets that would not be convenient to process over the internet. The scripts have been designed for ease-of-use, providing a separate initiation script to specify user settings (e.g. data directories and preferences), which automatically calls the other scripts.
The addition of the currently introduced Illumina module complements ArrayAnalysis.org with functionality to pre-process data from experiments run on the widely used Illumina bead array platform. It provides users of this platform or those processing existing data not only with an easy to use data quality control and pre-processing web module, but also with a direct connection to further modules offering downstream statistical and pathway analysis functionalities. As a whole, ArrayAnalysis.org is continuously being improved, evolving into a one-step solution for pre-processing, statistical analysis, and biological interpretation of data from multiple technological platforms. Being an open source project, developers within the user community can contribute by adding modules or improving functionality of existing ones, and source code can be downloaded for local deployment.
The developed Illumina bead array analysis workflow provides an easy, fast, and intuitive way for quality control, pre-processing, statistical, and pathway analysis of Illumina gene expression arrays for a broad range of researchers. The workflow provides immediate feedback on quality and basic statistics outcomes of generated data, increasing the speed and iterative capacity of intuitive research pipelines. This enables researchers to effectively resolve the first steps in data analysis and focus on their primary interest: extracting biologically meaningful information out of their gene expression data. The workflow can therefore be used as a starting point facilitating a broad range of applications in life sciences research.
Availability and requirements
Project name: ArrayAnalysis.org Illumina Pre-processing and QC module
Project home page: http://www.arrayanalysis.org
Operating system(s): Platform independent (web-based)
Programming language: implemented in R, php
Other requirements: none
License: Apache version 2.0
Any restrictions to use by non-academics: no restrictions
User guide: Additional file 1.
Tutorial: Additional file 2.
Linear models for microarray data
Principal component analysis
Nucleotide universal identifier
Kuhn K, Baker SC, Chudin E, Lieu MH, Oeser S, Bennett H, et al. A novel, high-performance random array platform for quantitative gene expression profiling. Genome Res. 2004;14(11):2347–56. doi:10.1101/gr.2739104.
Kolesnikov N, Hastings E, Keays M, Melnichuk O, Tang YA, Williams E, et al. ArrayExpress update-simplifying data submissions. Nucleic Acids Res. 2014. doi:10.1093/nar/gku1057.
Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, et al. NCBI GEO: archive for functional genomics data sets–update. Nucleic Acids Res. 2013;41(Database issue):D991–5. doi:10.1093/nar/gks1193.
Smith ML, Baggerly KA, Bengtsson H, Ritchie ME, Hansen KD. illuminaio: An open source IDAT parsing tool for Illumina microarrays. F1000Research. 2013;2:264. doi:10.12688/f1000research.2-264.v1.
Dunning MJ, Smith ML, Ritchie ME, Tavare S. beadarray: R classes and methods for Illumina bead-based data. Bioinformatics. 2007;23(16):2183–4. doi:10.1093/bioinformatics/btm311.
Du P, Kibbe WA, Lin SM. lumi: a pipeline for processing Illumina microarray. Bioinformatics. 2008;24(13):1547–8. doi:10.1093/bioinformatics/btn224.
Smyth GK. Limma: linear models for microarray data. In: Gentleman R, Carey V, Dudoit S, Irizarry R, Huber W, editors. Bioinformatics and Computational Biology Solutions using R and Bioconductor. New York: Springer; 2005. p. 397–420.
Eijssen LM, Jaillard M, Adriaens ME, Gaj S, de Groot PJ, Muller M, et al. User-friendly solutions for microarray quality control and pre-processing on ArrayAnalysis.org. Nucleic acids research. 2013;41(Web Server issue):W71-6. doi:10.1093/nar/gkt293.
Kallio MA, Tuimala JT, Hupponen T, Klemela P, Gentile M, Scheinin I, et al. Chipster: user-friendly analysis software for microarray and other high-throughput data. BMC Genomics. 2011;12:507. doi:10.1186/1471-2164-12-507.
Lin K, Kools H, de Groot PJ, Gavai AK, Basnet RK, Cheng F, et al. MADMAX - Management and analysis database for multiple ~ omics experiments. J Integr Bioinform. 2011;8(2):160. doi:10.2390/biecoll-jib-2011-160.
Schultze JL, Eggle D. IlluminaGUI: graphical user interface for analyzing gene expression data generated on the Illumina platform. Bioinformatics. 2007;23(11):1431–3. doi:10.1093/bioinformatics/btm101.
Shi W, Oshlack A, Smyth GK. Optimizing the noise versus bias trade-off for Illumina whole genome expression BeadChips. Nucleic Acids Res. 2010;38(22):e204. doi:10.1093/nar/gkq871.
Du P, Kibbe WA, Lin SM. nuID: a universal naming scheme of oligonucleotides for illumina, affymetrix, and other microarrays. Biology direct. 2007;2:16. doi:10.1186/1745-6150-2-16.
Ritchie ME, Dunning MJ, Smith ML, Shi W, Lynch AG. BeadArray expression analysis using bioconductor. PLoS Comput Biol. 2011;7(12):e1002276. doi:10.1371/journal.pcbi.1002276.
van Iersel MP, Kelder T, Pico AR, Hanspers K, Coort S, Conklin BR, et al. Presenting and exploring biological pathways with PathVisio. BMC bioinformatics. 2008;9:399. doi:10.1186/1471-2105-9-399.
All authors received the funding for this research and preparation of the manuscript from respective institutes they are affiliated with: LE: Maastricht University; VG: TNO; TK: TNO and EdgeLeap B.V.; MA: Maastricht University and AMC; CE: Maastricht University; MR: TNO and EdgeLeap B.V.. Funding bodies did not have any role in study design, collection, analysis and interpretation of data, in the writing of the manuscript, and in the decision to submit the manuscript for publication.
We thank Lars Verschuren, Annelies Dijk-Stroeve, and Andre Boorsma for beta testing. We thank Nuno Nunes for technical support and system maintenance of the ArrayAnalysis.org servers. No materials were used in this study. Authors have obtained permission from all those mentioned in the Acknowledgements.
The author(s) declare that they have no competing interests.
MR, CE, conceived the research. VG, LE, TK, MA, implemented the software. MR, LE, VG, wrote the manuscript. TK, CE, MA, critically reviewed the manuscript. All authors read and approved the final manuscript.
The Additional files are given for reference, most recent versions are available from http://www.arrayanalysis.org and https://github.com/BiGCAT-UM/ilmnQC_Module.
Tutorial demonstrating analysis of a publicly available example dataset from ArrayExpress.