- Research article
SIGMA: A System for Integrative Genomic Microarray Analysis of Cancer Genomes
BMC Genomicsvolume 7, Article number: 324 (2006)
The prevalence of high resolution profiling of genomes has created a need for the integrative analysis of information generated from multiple methodologies and platforms. Although the majority of data in the public domain are gene expression profiles, and expression analysis software are available, the increase of array CGH studies has enabled integration of high throughput genomic and gene expression datasets. However, tools for direct mining and analysis of array CGH data are limited. Hence, there is a great need for analytical and display software tailored to cross platform integrative analysis of cancer genomes.
We have created a user-friendly java application to facilitate sophisticated visualization and analysis such as cross-tumor and cross-platform comparisons. To demonstrate the utility of this software, we assembled array CGH data representing Affymetrix SNP chip, Stanford cDNA arrays and whole genome tiling path array platforms for cross comparison. This cancer genome database contains 267 profiles from commonly used cancer cell lines representing 14 different tissue types.
In this study we have developed an application for the visualization and analysis of data from high resolution array CGH platforms that can be adapted for analysis of multiple types of high throughput genomic datasets. Furthermore, we invite researchers using array CGH technology to deposit both their raw and processed data, as this will be a continually expanding database of cancer genomes. This publicly available resource, the System for Integrative Genomic Microarray Analysis (SIGMA) of cancer genomes, can be accessed at http://sigma.bccrc.ca.
Array comparative genomic hybridization (CGH) is a method used to detect segmental DNA copy number alterations and is widely used to discover chromosomal aberrations in cancer and other genetic diseases [1, 2]. In this method, differentially labeled genomic DNA samples are competitively hybridized to chromosomal targets, and the copy number balance between the two samples is reflected by their signal intensity ratio. Numerous array CGH platforms exist; these vary in the type of elements present on the array and their corresponding coverage of the human genome. With the development of high resolution, genome wide arrays, tens of thousands of loci can be evaluated for copy number status, facilitating the high throughput search for genes potentially involved in pathogenesis. This has allowed the identification of discrete regions of alteration that may have been missed by traditional cytogenetic methods and has proven to be a useful platform for exploring the underlying genetic basis of cancer [1, 3].
With the increasing utilization of array CGH, it has not only become important to establish standards for data deposition, but to develop tools to facilitate public access and to ease mining of available data. Currently, the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) repository  and European Bioinformatics Institute (EBI) ArrayExpress  provide storage for array CGH data, but these databases have been largely designed for gene expression microarrays. Although these sites support visualization of previously analyzed gene expression profiles, there are limited tools available for direct mining and analysis of array CGH data. Hence, there is a need for forums specific to array CGH data. Recently, attempts have been made in making a database primarily of lower resolution array CGH data . However, with the accumulation of high density array data generated with diverse technology, the viewing of array data has become a bioinformatics challenge, especially when the integration of multiple datasets from different platforms is required. Therefore, a central database with analytical software tailored specifically for analyzing and visualizing different types of high resolution array CGH data would greatly facilitate data mining.
In this study we have created a database consisting of high-resolution, whole-genome array CGH profiles for nearly 200 commonly used cancer cell lines profiled on four different array platforms, which have been instrumental in biochemical and pharmacogenetic studies. Moreover, we have developed a user friendly, web-based java application called the System for Integrative Genomic Microarray Analysis (SIGMA) for comparative analysis of multiple genomes.
Results and Discussion
We have assembled a collection of 267 array CGH profiles, representing 184 distinct cell lines profiled on at least one of the four array CGH platforms (Table 1, Table 2). Moreover, 14 different cancer tissue origins and 30 distinct cancer types are represented in this database, resulting in the assembly of a wide spectrum of genomes in this repository (Table 2) [see Additional file 1]. Significantly, 56 of the 267 CGH profiles are unpublished raw data which is now made public.
Main functionalities of SIGMA
In order to increase the utility of this collection, a significant component of SIGMA is the web-based application which allows for the user-friendly mining of this dataset. Four major types of functionalities are offered by SIGMA: (1) interrogation of a single sample, (2) visualization and analysis of a single group of samples, (3) comparative analysis of two groups of samples, and (4) integration of data from multiple platforms (Figure 1B).
Visualization and interrogation of a single genome
The first function we discuss is the ability to view a single high resolution array CGH profile at multiple magnification levels. The major utility of this function is to display the underlying genomic architecture of a cell line, so that genetic features can be considered in experimental interpretation. For example, a whole genome karyogram of lung adenocarcinoma cell line H2087 profiled on the SMRT array platform (Figure 2A). From this image, we see many changes such as the loss of the 3p arm as well as segmental changes in chromosomes 8, 19 and 20. Specifically, we can select chromosome 8 (Figure 2B) and view that separately, then zoom into the region of interest and visualize it in finer detail (Figure 2C). Users can then highlight or place boundary lines in this region and query for which genes are located within the region of interest.
Subsequently, using the interval search option, users can retrieve the genes which are located in a desired region and have the option to query commonly used biological databases such as NCBI MIM, NCBI Gene, NCBI PubMed and the UCSC Genome Browser. For example, if we look at band 8q24.21 (Figure 2D), we can highlight the region and search the interval for which genes it contains. When we invoke the region search and retrieve only genes curated by RefSeq, we see there are 8 genes in the amplicon. If the user selects a particular gene, options to link out to the above mentioned biological databases become available. The utility of this function is to facilitate a direct connection from experimental findings to known, relevant information. Moreover, the ability to interrogate for specific genes and regions can be done for any types of the analysis outlined (Figure 1B).
Multiple genome comparison and mining across tumor types
A common research question is to look across a series of samples with common phenotype to identify recurrent genetic changes, for example comparison of lung adenocarcinomas [see Additional file 2]. With the spectrum of samples warehoused in SIGMA, such a query can be performed across multiple cancer types. For example, the alignment of a set of samples representing 8 different cancer types revealed common amplification of the MYC oncogene locus (Figure 3A), while the epidermal growth factor receptor (EGFR) locus is amplified only in a subset of samples (Figure 3B).
Recurrent alterations detected in one group of samples can be compared against those in another group, for example, the comparison of lung squamous cell carcinoma (SqCC) with cervical SqCC [see Additional file 3]. The strategy for comparison, for example the overlay of frequency plots derived from two groups of profiles, has been described elsewhere [7, 8].
Simultaneous visualization of data from multiple platforms
Cross platform comparison is essential to the multi-dimensional descriptions of a genome. Here, we have included a feature in SIGMA to allow users to view multiple platforms of data simultaneously. We use the breast cancer cell line, MCF7, to illustrate this functionality. Data from four different array CGH platforms were available publicly: SMRT array, Stanford cDNA microarray, Affymetrix 10K SNP array and the Affymetrix 100K SNP array. Figure 4A illustrates the cross platform display of chromosome 17, while Figure 4B shows the variable density of coverage by these four commonly used platforms.
Integrative visualization of DNA copy number and methylation
One of the novel features we have provided is the integrative visualization of copy number alterations with DNA methylation status. The major premise in studying copy number alterations at the DNA level is that these are the primary changes involved in driving changes in gene expression. Though gene dosage variation may be responsible for expression changes, alteration in DNA methylation pattern also contributes significantly to regulating gene expression. Recently, methods for global methylation analysis to measure aberrant DNA methylation status across tumor genomes have been developed [9–12]. Wilson et al (2006) compared methylation patterns with copy number profiles in lung cancer cells. Utilizing genomic and epigenetic data from this study for the H1395 lung cancer cell line, we illustrate a parallel display in SIGMA. Figure 5 shows a large segmental copy number gain spanning 1q21.2 to 1q23.1 with corresponding hypomethylation, localized precisely to 1q21.3 . Significantly, both copy number gain and decreased methylation can elevate gene expression. The S100 calcium binding protein A10 (S100A10) gene within this region has been previously shown to be over-expressed in lung cancer . It is readily apparent of the value of integrative studies examining aberrant DNA methylation and genomic copy number. With increased prevalence of studies of whole genome methylation, this feature will be of greater importance.
We have developed an application for the integrative cross platform analysis of array CGH data. The SIGMA application facilitates consolidation and structuring of diverse sources of array CGH data into a repository that is accessible with a new easy-to-use built-in web-based analytical application. The launch version contains data for 267 array CGH profiles, representing cancer cell lines of over 14 different types of tissue. The ability of SIGMA to incorporate multiple array CGH platforms facilitates the archiving of array CGH data from future publications, regardless of current or future array platform used. Though currently SIGMA's architecture facilitates the direct mining of genomic and epigenomic data, this can be easily adapted, and not limited to, high resolution genetic and gene expression surveys.
Sources of raw array CGH data
The raw data for 267 CGH profiles in the database was obtained from a variety of sources. They include both published and unpublished data [7, 8, 15–22]. Publicly available data were downloaded from NCBI GEO , Stanford Microarray Database  or websites affiliated with the author's laboratory. Data which were not publicly available were obtained by consent from the authors of the respective studies. The four array CGH platforms used for this study were the whole genome tiling path BAC (SMRT) array , the early access Affymetrix SNP 100 K array [16, 20], the Stanford cDNA microarray  and the Affymetrix SNP 10 K array . In addition, 2 of the cell lines were profiled for whole genome DNA methylation status using MeDIP array CGH [10, 13]. For this launch version, we concentrated on available cell lines profiled on high density array platforms and did not include profiles from clinical specimens. A summary of the sources for the raw data is given in Table 1, while the detailed description of each of the cell lines in the collection is given in Additional file 1.
Application layout of SIGMA
There are three main components which comprise this application; a Java WebStart application interface allowing users to formulate queries and perform visualization, an Apache Web Service which facilitates the connection of the user application to the database and a relational database which is implemented using MySQL (Figure 1A). Utilization of the Java WebStart technology ensures that users will have the latest version of the application, without the need for manual updating. In addition, efficiency and speed of the application will largely be determined by the user's computer specifications. Hence, we have provided different versions of our application based on system resource utilization, allowing users with greater system resources to perform more analytical tasks per session.
Database implementation and structure
SIGMA is a continually expanding database of array CGH experiments. The launch version contains 267 genomic profiles generated from cancer cell lines, implemented using the MySQL relational database application. Each array CGH experiment is contained in a separate database table allowing for easy and seamless expansion of this database. Upon addition of a new profile, a database table which contains the information of each array CGH experiment is updated. This table stores a record of each experiment, with the name of the cell line, the American Type Culture Collection (ATCC) identification (if applicable), array platform used and the description of the cancer type as part of the schema. For two channel array-based profiles, the dye which was used to label the sample is also recorded. Lastly, mapping information pertaining to a clone or probe and its position in the genome is kept in file with a fixed format, such that subsequent improvements and updates to the genomic positioning of the array elements can be easily incorporated. Moreover, since individual microarray software platforms use their own map version, map information for all platforms were compiled based on data from the UCSC Genome Browser . Currently, two genomic builds are supported: April 2003 (hg15) and May 2004 (hg17) assemblies.
Data for each platform were processed as similarly as possible. SMRT array data were normalized using the stepwise framework for normalization with default parameters . Similarly, Affymetrix 10 K and Affymetrix 100 K data were normalized and processed using dChip  with default settings. Specifically, the samples from the Affymetrix 10 K dataset of lung cancer cell lines were referenced against the group of matching blood lymphoblast lines and similarly, the breast cancer cell lines were referenced against their matching blood lymphoblast lines. Affymetrix 100 K data from Zhao et al. (2004) were referenced against 12 normal individuals and the NCI-60 profiles were referenced against 6 normal diploid controls. The gender of the profiles were not specified, hence data from the X chromosome may not be an accurate reflection. Segmentation of all data was performed using aCGH-Smooth , with data for the sex chromosomes removed prior to segmentation – as profiles were generated with sex matched or mismatched reference DNA in two channel hybridization experiments – and run with the settings of Lambda = 6.75 and "breakpoints per chromosome" = 100. Each element of the array is given a call with respect to normal: -1 if the element shows copy loss, 0 if the element shows no change in copy number and +1 if the element shows increased genomic content.
Availability and requirements
Project name: SIGMA (System for Integrative Genomic Microarray Analysis)
Project home page: http://sigma.bccrc.ca
Operating system(s): Platform independent
Programming language: Java
Other requirements: Java version 1.6+
License: Free for academic and research use, commercial users please request special permission
System for Integrative Genomic Microarray Analysis
Comparative Genomic Hybridization
Single Nucleotide Polymorphism
Lockwood WW, Chari R, Chi B, Lam WL: Recent advances in array comparative genomic hybridization technologies and their applications in human genetics. Eur J Hum Genet. 2006, 14 (2): 139-148. 10.1038/sj.ejhg.5201531.
Davies JJ, Wilson IM, Lam WL: Array CGH technologies and their applications to cancer genomes. Chromosome Res. 2005, 13 (3): 237-248. 10.1007/s10577-005-2168-x.
Pinkel D, Albertson DG: Array comparative genomic hybridization and its applications in cancer. Nat Genet. 2005, 37 Suppl: S11-7. 10.1038/ng1569.
Barrett T, Suzek TO, Troup DB, Wilhite SE, Ngau WC, Ledoux P, Rudnev D, Lash AE, Fujibuchi W, Edgar R: NCBI GEO: mining millions of expression profiles--database and tools. Nucleic Acids Res. 2005, 33 (Database issue): D562-6. 10.1093/nar/gki022.
Parkinson H, Sarkans U, Shojatalab M, Abeygunawardena N, Contrino S, Coulson R, Farne A, Lara GG, Holloway E, Kapushesky M, Lilja P, Mukherjee G, Oezcimen A, Rayner T, Rocca-Serra P, Sharma A, Sansone S, Brazma A: ArrayExpress--a public repository for microarray gene expression data at the EBI. Nucleic Acids Res. 2005, 33 (Database issue): D553-5. 10.1093/nar/gki056.
La Rosa P, Viara E, Hupe P, Pierron G, Liva S, Neuvial P, Brito I, Lair S, Servant N, Robine N, Manie E, Brennetot C, Janoueix-Lerosey I, Raynal V, Gruel N, Rouveirol C, Stransky N, Stern MH, Delattre O, Aurias A, Radvanyi F, Barillot E: VAMP: visualization and analysis of array-CGH, transcriptome and other molecular profiles. Bioinformatics. 2006
Coe BP, Lockwood WW, Girard L, Chari R, Macaulay C, Lam S, Gazdar AF, Minna JD, Lam WL: Differential disruption of cell cycle pathways in small cell and non-small cell lung cancer. Br J Cancer. 2006, 94 (12): 1927-1935. 10.1038/sj.bjc.6603167.
Garnis C, Lockwood WW, Vucic E, Ge Y, Girard L, Minna JD, Gazdar AF, Lam S, Macaulay C, Lam WL: High resolution analysis of non-small cell lung cancer cell lines by whole genome tiling path array CGH. Int J Cancer. 2006, 118 (6): 1556-1564. 10.1002/ijc.21491.
Ching TT, Maunakea AK, Jun P, Hong C, Zardo G, Pinkel D, Albertson DG, Fridlyand J, Mao JH, Shchors K, Weiss WA, Costello JF: Epigenome analyses using BAC microarrays identify evolutionary conservation of tissue-specific methylation of SHANK3. Nat Genet. 2005, 37 (6): 645-651. 10.1038/ng1563.
Weber M, Davies JJ, Wittig D, Oakeley EJ, Haase M, Lam WL, Schubeler D: Chromosome-wide and promoter-specific analyses identify sites of differential DNA methylation in normal and transformed human cells. Nat Genet. 2005, 37 (8): 853-862. 10.1038/ng1598.
Khulan B, Thompson RF, Ye K, Fazzari MJ, Suzuki M, Stasiek E, Figueroa ME, Glass JL, Chen Q, Montagna C, Hatchwell E, Selzer RR, Richmond TA, Green RD, Melnick A, Greally JM: Comparative isoschizomer profiling of cytosine methylation: The HELP assay. Genome Res. 2006
Karimi M, Johansson S, Stach D, Corcoran M, Grander D, Schalling M, Bakalkin G, Lyko F, Larsson C, Ekstrom TJ: LUMA (LUminometric Methylation Assay)--a high throughput method to the analysis of genomic DNA methylation. Exp Cell Res. 2006, 312 (11): 1989-1995. 10.1016/j.yexcr.2006.03.006.
Wilson IM, Davies JJ, Weber M, Brown CJ, Alvarez CE, MacAulay C, Schubeler D, Lam WL: Epigenomics: mapping the methylome. Cell Cycle. 2006, 5 (2): 155-158.
Remmelink M, Mijatovic T, Gustin A, Mathieu A, Rombaut K, Kiss R, Salmon I, Decaestecker C: Identification by means of cDNA microarray analyses of gene expression modifications in squamous non-small cell lung cancers as compared to normal bronchial epithelial tissue. Int J Oncol. 2005, 26 (1): 247-258.
Zhao X, Li C, Paez JG, Chin K, Janne PA, Chen TH, Girard L, Minna J, Christiani D, Leo C, Gray JW, Sellers WR, Meyerson M: An integrated view of copy number and allelic alterations in the cancer genome using single nucleotide polymorphism arrays. Cancer Res. 2004, 64 (9): 3060-3071. 10.1158/0008-5472.CAN-03-3308.
Garraway LA, Widlund HR, Rubin MA, Getz G, Berger AJ, Ramaswamy S, Beroukhim R, Milner DA, Granter SR, Du J, Lee C, Wagner SN, Li C, Golub TR, Rimm DL, Meyerson ML, Fisher DE, Sellers WR: Integrative genomic analyses identify MITF as a lineage survival oncogene amplified in malignant melanoma. Nature. 2005, 436 (7047): 117-122. 10.1038/nature03664.
de Leeuw RJ, Davies JJ, Rosenwald A, Bebb G, Gascoyne RD, Dyer MJ, Staudt LM, Martinez-Climent JA, Lam WL: Comprehensive whole genome array CGH profiling of mantle cell lymphoma model genomes. Hum Mol Genet. 2004, 13 (17): 1827-1837. 10.1093/hmg/ddh195.
Coe BP, Lee EH, Chi B, Girard L, Minna JD, Gazdar AF, Lam S, Macaulay C, Lam WL: Gain of a region on 7p22.3, containing MAD1L1, is the most frequent event in small-cell lung cancer cell lines. Genes Chromosomes Cancer. 2006, 45 (1): 11-19. 10.1002/gcc.20260.
Shadeo A, Lam WL: Comprehensive copy number profiles of breast cancer cell model genomes. Breast Cancer Res. 2006, 8 (1): R9-10.1186/bcr1370.
Zhao X, Weir BA, LaFramboise T, Lin M, Beroukhim R, Garraway L, Beheshti J, Lee JC, Naoki K, Richards WG, Sugarbaker D, Chen F, Rubin MA, Janne PA, Girard L, Minna J, Christiani D, Li C, Sellers WR, Meyerson M: Homozygous deletions and chromosome amplifications in human lung carcinomas revealed by single nucleotide polymorphism array analysis. Cancer Res. 2005, 65 (13): 5561-5570. 10.1158/0008-5472.CAN-04-4603.
Bredel M, Bredel C, Juric D, Kim Y, Vogel H, Harsh GR, Recht LD, Pollack JR, Sikic BI: Amplification of whole tumor genomes and gene-by-gene mapping of genomic aberrations from limited sources of fresh-frozen and paraffin-embedded DNA. J Mol Diagn. 2005, 7 (2): 171-182.
Kim YH, Girard L, Giacomini CP, Wang P, Hernandez-Boussard T, Tibshirani R, Minna JD, Pollack JR: Combined microarray analysis of small cell lung cancer reveals altered apoptotic balance and distinct expression signatures of MYC family gene amplification. Oncogene. 2006, 25 (1): 130-138. 10.1038/sj.onc.1209303.
Ball CA, Awad IA, Demeter J, Gollub J, Hebert JM, Hernandez-Boussard T, Jin H, Matese JC, Nitzberg M, Wymore F, Zachariah ZK, Brown PO, Sherlock G: The Stanford Microarray Database accommodates additional microarray platforms and data formats. Nucleic Acids Res. 2005, 33 (Database issue): D580-2. 10.1093/nar/gki006.
Ishkanian AS, Malloff CA, Watson SK, DeLeeuw RJ, Chi B, Coe BP, Snijders A, Albertson DG, Pinkel D, Marra MA, Ling V, MacAulay C, Lam WL: A tiling resolution DNA microarray with complete coverage of the human genome. Nat Genet. 2004, 36 (3): 299-303. 10.1038/ng1307.
Hinrichs AS, Karolchik D, Baertsch R, Barber GP, Bejerano G, Clawson H, Diekhans M, Furey TS, Harte RA, Hsu F, Hillman-Jackson J, Kuhn RM, Pedersen JS, Pohl A, Raney BJ, Rosenbloom KR, Siepel A, Smith KE, Sugnet CW, Sultan-Qurraie A, Thomas DJ, Trumbower H, Weber RJ, Weirauch M, Zweig AS, Haussler D, Kent WJ: The UCSC Genome Browser Database: update 2006. Nucleic Acids Res. 2006, 34 (Database issue): D590-8. 10.1093/nar/gkj144.
Khojasteh M, Lam WL, Ward RK, Macaulay C: A stepwise framework for the normalization of array CGH data. BMC Bioinformatics. 2005, 6 (1): 274-10.1186/1471-2105-6-274.
Lin M, Wei LJ, Sellers WR, Lieberfarb M, Wong WH, Li C: dChipSNP: significance curve and clustering of SNP-array-based loss-of-heterozygosity data. Bioinformatics. 2004, 20 (8): 1233-1240. 10.1093/bioinformatics/bth069.
Jong K, Marchiori E, Meijer G, Vaart AV, Ylstra B: Breakpoint identification and smoothing of array comparative genomic hybridization data. Bioinformatics. 2004, 20 (18): 3636-3637. 10.1093/bioinformatics/bth355.
We thank Ian Wilson for useful discussion and Byron Cline for extensive testing of the application as well as Andrew Lam for tutorial development. This work was supported by funds from Genome Canada/Genome British Columbia, Canadian Institute of Health Research, and NIDCR grant RO1 DE15965-01.
RC designed and developed SIGMA and wrote the manuscript. WWL and BPC were involved in the design and development. AC, DM and AT were the programmers. JJD was involved with initial design and CM provided manuscript editing and program feedback. WLL is the principal investigator of the lab.
Electronic supplementary material
About this article
- Array Comparative Genomic Hybridization
- Copy Number Alteration
- Array Platform
- Lung Squamous Cell Carcinoma
- Array Comparative Genomic Hybridization Data