- Open Access
Combined analysis of genome-wide expression and copy number profiles to identify key altered genomic regions in cancer
© Fontanillo et al.; licensee BioMed Central Ltd. 2012
- Published: 19 October 2012
Analysis of DNA copy number alterations and gene expression changes in human samples have been used to find potential target genes in complex diseases. Recent studies have combined these two types of data using different strategies, but focusing on finding gene-based relationships. However, it has been proposed that these data can be used to identify key genomic regions, which may enclose causal genes under the assumption that disease-associated gene expression changes are caused by genomic alterations.
Following this proposal, we undertake a new integrative analysis of genome-wide expression and copy number datasets. The analysis is based on the combined location of both types of signals along the genome. Our approach takes into account the genomic location in the copy number (CN) analysis and also in the gene expression (GE) analysis. To achieve this we apply a segmentation algorithm to both types of data using paired samples. Then, we perform a correlation analysis and a frequency analysis of the gene loci in the segmented CN regions and the segmented GE regions; selecting in both cases the statistically significant loci. In this way, we find CN alterations that show strong correspondence with GE changes. We applied our method to a human dataset of 64 Glioblastoma Multiforme samples finding key loci and hotspots that correspond to major alterations previously described for this type of tumors.
Identification of key altered genomic loci constitutes a first step to find the genes that drive the alteration in a malignant state. These driver genes can be found in regions that show high correlation in copy number alterations and expression changes.
- Genomic Alteration
- Copy Number Alteration
- Copy Number Data
- Log2 Ratio Signal
- Detect Copy Number Alteration
Acquisition of somatic genetic alterations plays an important role in the development of cancer. Several systematic efforts have addressed the study of genetic alterations to characterize human cancers [1, 2], including: copy-number alterations (CNAs), translocations, insertions or single-nucleotide polymorphisms (SNPs). Most of these approaches are focused on finding frequent alterations, which occur in a high number of cases. According to the selective pressure theory, a genomic alteration that confers an advantage to a malignant state is likely to be found in more tumors than expected by chance . However, most methods that look for recurrent aberrations using copy number information find many regions, containing many genes [4, 5]. Therefore, to identify recurrently altered genomic regions -biologically relevant- it is necessary to integrate gene and genome information, as proposed by Akavia et al. . Several reports have recently shown that integrative strategies can be very useful to identify driver genes, considering the hypothesis that disease-associated gene expression changes are frequently induced by genomic alterations [3, 6–10]. Most of these reports are focused on finding gene-based relationships.
Built on these hypotheses -that relate transcriptomic and genomic alterations-, we propose a new integrative method based on the location of both types of signals along the genome. Our method takes into account the genomic loci, both in the copy number (CN) analysis and also in the gene expression (GE) analysis, and applies the segmentation step proposed by Ortiz-Estevez et al. . These authors designed a method for robust comparison between CN and GE using paired samples. Such approach is based on a search for correlation between segmented CN regions and segmented GE regions to find the most significant simultaneous alterations. We follow this approach introducing two new steps to asses the matching between CN and GE loci: (i) first, a signal correlation analysis; (ii) second, an alteration frequency analysis. Using these analyses we propose a set of significantly altered genomic regions in the studied pathological state. In order to show the performance and demonstrate the value of our method, we use a dataset of 64 Glioblastoma Multiforme (GBM) samples with paired measurements of GE and CN (taken from [7, 8]).
The method is designed for combined analysis of datasets from two types of genome-wide arrays: DNA genomic microarrays and RNA expression microarrays. These arrays provide copy number and expression quantitative data, respectively. The analysis places both types of signals along the genome, taking into account the gene loci for the CN data and the GE data. The rationale of the method is to search for copy number alterations with a major influence in the expression levels of the genes encoded. As a distinctive element from other integrative approaches we do not consider only SNPs or genes individually. We take into account the gene loci following the strategy described in , that is based on the application of the same smoothing and segmentation algorithm to CN and GE in order to establish comparable regions. Once we get the smoothed segments, we perform two independent analyses for each gene loci: a signal correlation analysis and an alteration frequency analysis. (The workflow described in Materials and Methods, presented in last figure, illustrates the procedure of the method including these two independent analyses).
Analysis of correlation between gene expression and copy number levels
The number of probes in the SNP arrays -used to calculate the segmented signals for CN- is large and uniform along the genome. However, in the expression arrays some genomic regions do not have enough allocated gene loci and the number of probes is sparse. This fact is a problem when a GE segment includes outliers (i.e. gene locus which have expression levels very different from the mean of their neighbours). To solve this problem, we look for statistically significant outliers within the GE segments -which were at least in 1/3 of the samples- and we recalculate the signal correlation between their unsegmented GE and the corresponding CN segments. In this way, we find a new set of gene loci with correlation r ≥ 0.60, which is added to the initial set of candidate hotspots identified. This step of the procedure is important to recover some gene loci with quite significant correlation (e.g. EGFR or SEC61G), which were missed in the first step due to the described problem.
Analysis of frequencies for the categorical states Up-Gain and Down-Loss
Genome-wide identification of hotspots: candidate key genomic regions
Our method identifies candidate key regions that show high correlation between CN and GE and that are frequently altered in the same direction, in both types of signals. The overlapping between the regions with the most significant correlation and the ones with the highest frequencies of simultaneous alteration (CN and GE) along the genome, will constitute hotspots where putative driver genes are likely to be encoded.
Key genomic regions found for the 64 paired GBM cancer samples
Significant U-G regions with the associated genes.
Correlation (average r coefficient)
U-G Frequency (average %)
Number of genes
Significant D-L regions with the associated genes.
Correlation (average r coefficient)
U-G Frequency (average %)
Number of genes
Complete information corresponding to the genes found in the significant U-G regions and D-L regions is included respectively as supplementary material in Additional-file-1 (for the data corresponding to Table 1) and Additional-file-2 (for the data corresponding to Table 2).
The combined analysis of CN and GE data obtained using DNA genome and RNA expression microarrays for paired samples is a very powerful approach to uncover key altered regions in a biological state studied. We present a robust method to find genomic regions that show simultaneous significant changes in both CN and GE. Our calculations applied to a cancer dataset find expected known genomic alterations and many others identified as key altered genomic regions. This approach is also proposed as an adequate strategy to identify driver or causal genes under the hypothesis that disease-associated gene expression changes are frequently induced by genomic alterations.
In this study we use a dataset of 64 human samples from Glioblastoma Multiforme (GBM)  that includes for each sample: Affymetrix DNA microarrays applied to detect of genome-wide CN changes and Affymetrix RNA expression microarrays applied to detect of GE changes. We used the same subgroup of samples that was previously analysed in Ortiz-Estevez et al. .
GE and CN normalization and signals calculation
GE data were processed using RMA algorithm  applied to the human gene expression microarrays: Affymetrix HGU133 plus 2.0 (using the same strategy followed in [19, 20]). CRMAv2 algorithm  was applied to normalize the raw data and obtain the signals from the Affymetrix Human Mapping 500K SNP arrays. The processed signals were divided by the median of the normal samples for each element (SNP or gene) and then the log2 was computed. These log2 ratio signals were smoothed and segmented using Circular Binary Segmentation (CBS) algorithm  with the default parameters implemented in the DNAcopy R package.
Correlation between GE and CN
Pearson Correlation Coefficients (r) of the segmented GE and CN data were calculated taking the values of the segmented copy number and gene expression at the central point of the genomic position for each gene. P-values for the correlation coefficient of every gene loci were computed and adjusted by Bonferroni method. The established threshold for the selection of significantly correlated gene loci was correlation coefficient r ≥ 0.60, which corresponds to adjusted p-value < 0.005. When using the gene loci GE unsegmented signal, the same correlation threshold and p-value cutoff were applied.
Frequency of U-G and D-L alterations
The thresholds that define DNA copy number gains and losses and up and down gene regulation were established applying k-Means algorithm, fixing three clusters (k = 3) on the segmented data, and done independently for the CN data and for the GE data. The CN data values were classified into gained (G), lost (L) or no-change (N) and the GE values were classified as up-regulated (U), down-regulated (D) or no-change (N). The thresholds found by k-Means for CN in the GBM dataset were > 0.19 (of the log2 ratio signals) for gain and < -0.15 for loss. The thresholds found for GE in the GBM were > 0.10 (of the log2 ratio signals) for up-regulation and < -0.12 for down-regulation. A contingency table with the 9 possible categorical states for the two types of data was built for every gene locus. A cutoff threshold was set up for the frequency of up-regulated and gained (U-G) and for the down-regulated and lost (D-L) categories, based on the empirical cumulative distributions of the categories. Taking into account the gene loci, the significant altered regions were defined as the ones that had a frequency ≥ than the upper 10% quantile of the distribution of U-G or the distribution of D-L.
General workflow for identification of key regions in the genome
This work has been supported by funds provided by the Local Government Junta de Castilla y León (JCyL, ref. project: CSI07A09), by the Spanish Government (ISCiii, ref. project PS09/00843) and by the European Commission (Research Grant ref. FP7-HEALTH-2007-223411). SA thanks the JCyL and the European Social Fund (ESF-EU) for a research grant. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
This article has been published as part of BMC Genomics Volume 13 Supplement 5, 2012: Proceedings of the International Conference of the Brazilian Association for Bioinformatics and Computational Biology (X-meeting 2011). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/13/S5.
- Futreal PA, Coin L, Marshall M, Down T, Hubbard T, Wooster R, Rahman N, Stratton MR: A census of human cancer genes. Nature Rev Cancer. 2004, 4: 177-83. 10.1038/nrc1299.View ArticleGoogle Scholar
- Stratton MR, Campbell PJ, Futreal PA: The cancer genome. Nature. 2009, 458: 719-24. 10.1038/nature07943.PubMed CentralView ArticlePubMedGoogle Scholar
- Akavia UD, Litvin O, Kim J, Sanchez-Garcia F, Kotliar D, Causton HC, Pochanard P, Mozes E, Garraway L a, Pe'er D: An integrated approach to uncover drivers of cancer. Cell. 2010, 143: 1005-17. 10.1016/j.cell.2010.11.013.PubMed CentralView ArticlePubMedGoogle Scholar
- Beroukhim R, Getz G, Nghiemphu L, Barretina J, Hsueh T, Linhart D, Vivanco I, Lee JC, Huang JH, Alexander S, Du J, Kau T, Thomas RK, Shah K, Soto H, Perner S, Prensner J, Debiasi RM, Demichelis F, Hatton C, Rubin MA, Garraway LA, Nelson SF, Liau L, Mischel PS, Cloughesy TF, Meyerson M, Golub TA, Lander ES, Mellinghoff IK, Sellers WR: Assessing the significance of chromosomal aberrations in cancer: methodology and application to glioma. Proc Nat Acad Sci USA. 2007, 104: 20007-12. 10.1073/pnas.0710052104.PubMed CentralView ArticlePubMedGoogle Scholar
- Beroukhim R, Mermel CH, Porter D, Wei G, Raychaudhuri S, Donovan J, Barretina J, Boehm JS, Dobson J, Urashima M, Mc Henry KT, Pinchback RM, Ligon AH, Cho Y-J, Haery L, Greulich H, Reich M, Winckler W, Lawrence MS, Weir BA, Tanaka KE, Chiang DY, Bass AJ, Loo A, Hoffman C, Prensner J, Liefeld T, Gao Q, Yecies D, Signoretti S, Maher E, Kaye FJ, Sasaki H, Tepper JE, Fletcher JA, Tabernero J, Baselga J, Tsao M-S, Demichelis F, Rubin MA, Janne PA, Daly MJ, Nucera C, Levine RL, Ebert BL, Gabriel S, Rustgi AK, Antonescu CR, Ladanyi M, Letai A, Garraway LA, Loda M, Beer DG, True LD, Okamoto A, Pomeroy SL, Singer S, Golub TR, Lander ES, Getz G, Sellers WR, Meyerson M: The landscape of somatic copy-number alteration across human cancers. Nature. 2010, 463: 899-905. 10.1038/nature08822.PubMed CentralView ArticlePubMedGoogle Scholar
- Pollack JR, Sørlie T, Perou CM, Rees CA, Jeffrey SS, Lonning PE, Tibshirani R, Botstein D, Børresen-Dale A-L, Brown PO: Microarray analysis reveals a major direct role of DNA copy number alteration in the transcriptional program of human breast tumors. Proc Nat Acad Sci USA. 2002, 99: 12963-8. 10.1073/pnas.162471999.PubMed CentralView ArticlePubMedGoogle Scholar
- Kotliarov Y, Steed ME, Christopher N, Walling J, Su Q, Center A, Heiss J, Rosenblum M, Mikkelsen T, Zenklusen JC, Fine HA: High-resolution global genomic survey of 178 gliomas reveals novel regions of copy number alteration and allelic imbalances. Cancer Res. 2006, 66: 9428-36. 10.1158/0008-5472.CAN-06-1691.PubMed CentralView ArticlePubMedGoogle Scholar
- Kotliarov Y, Kotliarova S, Charong N, Li A, Walling J, Aquilanti E, Ahn S, Steed ME, Su Q, Center A, Zenklusen JC, Fine H a: Correlation analysis between single-nucleotide polymorphism and expression arrays in gliomas identifies potentially relevant target genes. Cancer Res. 2009, 69: 1596-603. 10.1158/0008-5472.CAN-08-2496.PubMed CentralView ArticlePubMedGoogle Scholar
- Turner N, Lambros MB, Horlings HM, Pearson A, Sharpe R, Natrajan R, Geyer FC, van Kouwenhove M, Kreike B, Mackay A, Ashworth A, van de Vijver MJ, Reis-Filho JS: Integrative molecular profiling of triple negative breast cancers identifies amplicon drivers and potential therapeutic targets. Oncogene. 2010, 29: 2013-23. 10.1038/onc.2009.489.PubMed CentralView ArticlePubMedGoogle Scholar
- Kim Y-A, Wuchty S, Przytycka TM: Identifying causal genes and dysregulated pathways in complex diseases. PLoS Computational Biology. 2011, 7: e1001095-10.1371/journal.pcbi.1001095.PubMed CentralView ArticlePubMedGoogle Scholar
- Ortiz-Estevez M, De Las Rivas J, Fontanillo C, Rubio A: Segmentation of genomic and transcriptomic microarrays data reveals major correlation between DNA copy number aberrations and gene-loci expression. Genomics. 2011, 97: 86-93. 10.1016/j.ygeno.2010.10.008.View ArticlePubMedGoogle Scholar
- De Tayrac M, Etcheverry A, Aubry M, Saïkali S, Hamlat A, Quillien V, Le Treut A, Galibert MD, Mosser J: Integrative genome-wide analysis reveals a robust genomic glioblastoma signature associated with copy number driving changes in gene expression. Genes Chromosomes Cancer. 2009, 48: 55-68. 10.1002/gcc.20618.View ArticlePubMedGoogle Scholar
- Ruano Y, Mollejo M, Ribalta T, Fiaño C, Camacho FI, Gómez E, de Lope AR, Hernández-Moneo JL, Martínez P, Meléndez B: Identification of novel candidate target genes in amplicons of glioblastoma multiforme tumors detected by expression and CGH microarray profiling. Molecular Cancer. 2006, 5: 39-PubMed CentralView ArticlePubMedGoogle Scholar
- Reifenberger G, Collins VP: Pathology and genetics of astrocytic gliomas. J Mol Med. 2004, 82: 656-670. 10.1007/s00109-004-0564-x.View ArticlePubMedGoogle Scholar
- Chernova OB, Somerville RP, Cowell JK: A novel gene, LGI1, from 10q24 is rearranged and downregulated in malignant brain tumors. Oncogene. 1998, 17: 2873-2881. 10.1038/sj.onc.1202481.View ArticlePubMedGoogle Scholar
- Wechsler DS, Shelly CA, Petroff CA, Dang CV: MXI1, a putative tumor suppressor gene, suppresses growth of human glioblastoma cells. Cancer Res. 1997, 57: 4905-4912.PubMedGoogle Scholar
- Cullis DN, Philip B, Baleja JD, Feig LA: Rab11-FIP2, an adaptor protein connecting cellular components involved in internalization and recycling of epidermal growth factor receptors. J Biol Chem. 2002, 277: 49158-49166. 10.1074/jbc.M206316200.View ArticlePubMedGoogle Scholar
- Irizarry R a, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003, 4: 249-64. 10.1093/biostatistics/4.2.249.View ArticlePubMedGoogle Scholar
- Vicent S, Luis-Ravelo D, Antón I, García-Tuñón I, Borrás-Cuesta F, Dotor J, De Las Rivas J, Lecanda F: A novel lung cancer signature mediates metastatic bone colonization by a dual mechanism. Cancer Res. 2008, 68: 2275-85. 10.1158/0008-5472.CAN-07-6493.View ArticlePubMedGoogle Scholar
- Hernández JA, Rodríguez AE, González M, Benito R, Fontanillo C, Sandoval V, Romero M, Martín-Núñez G, de Coca AG, Fisac R, Galende J, Recio I, Ortuño F, García JL, De Las Rivas J, Gutiérrez NC, San Miguel JF, Hernández JM: A high number of losses in 13q14 chromosome band is associated with a worse outcome and biological differences in patients with B-cell chronic lymphoid leukemia. Haematologica. 2009, 94: 364-371. 10.3324/haematol.13862.PubMed CentralView ArticlePubMedGoogle Scholar
- Bengtsson H, Wirapati P, Speed TP: A single-array preprocessing method for estimating full-resolution raw copy numbers from all Affymetrix genotyping arrays including GenomeWideSNP 5 & 6. Bioinformatics. 2009, 25: 2149-56. 10.1093/bioinformatics/btp371.PubMed CentralView ArticlePubMedGoogle Scholar
- Venkatraman ES, Olshen AB: A faster circular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics. 2007, 23: 657-63. 10.1093/bioinformatics/btl646.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.