Bimodal gene expression patterns in breast cancer
© Bessarabova et al. 2010
Published: 10 February 2010
Skip to main content
© Bessarabova et al. 2010
Published: 10 February 2010
We identified a set of genes with an unexpected bimodal distribution among breast cancer patients in multiple studies. The property of bimodality seems to be common, as these genes were found on multiple microarray platforms and in studies with different end-points and patient cohorts. Bimodal genes tend to cluster into small groups of four to six genes with synchronised expression within the group (but not between the groups), which makes them good candidates for robust conditional descriptors. The groups tend to form concise network modules underlying their function in cancerogenesis of breast neoplasms.
Whole-genome gene expression studies primarily aim to identify conditional descriptors, i.e. subsets of genes or functional groups whose expression profiles distinguish between different biological states. Different biological conditions might include: disease state vs. normal state, good prognosis vs. bad, drug treated vs. untreated tissues, etc. Differential expression descriptors can be calculated in two ways. The traditional method consists of selecting a set of descriptor genes (gene signatures) using a variety of statistical methods [1–5]. Using this approach, a number of gene signatures were deduced for breast cancer phenotypes, including an "intrinsic" set for clustering of breast cancers , an "Amsterdam" signature consisting of 70 genes , a 76-gene "Rotterdam" signature  for metastasis, and a set of 21 genes associated with disease outcomes for ER+ tumors . Some of these sets are commercialized as multivariant diagnostics by Genomic Health http://www.genomichealth.com and Agendia http://www.agendia.com. Although important, gene signatures have many issues as descriptors - for instance, loss of specificity in validation studies with an increased number of samples , generally poor cross-platform compatibility (Amsterdam and Rotterdam signatures virtually do not overlap in gene content), lack of mechanistic (functional) correlation with phenotype, etc.
The second, more recent, approach deals with so-called "functional descriptors," such as pathways, signaling networks, enrichment distribution in ontologies, etc., which are differentially perturbed in the conditions being compared [11–14]. In good accordance with the original concept of "modularity" of biological functions systems , functional entities seem to be more robust descriptors than gene lists [16, 17]. In addition, functional descriptors provide strong mechanistic linkages with clinical phenotypes and, in the case of cancer, may explain important aspects of cancerogenesis.
However, in both cases, the genes composing gene signatures or functional categories are selected regardless of their individual patterns of expression among the samples in the study. In general, gene expression distribution in a population is assumed to be normal, as for any quantitative trait . However, it is not. As we have recently shown, distributions of expression signals of certain genes feature two distinct peaks among the samples in breast cancer . The phenomenon of expression "bimodality" was reported for other cancers as well [20–22], where «bimodality» was calculated by selection of hypervariable (HV) genes using F-statistics  and a combination of mixture modelling and kurtosis [21, 22].
Here we report a meta-analysis of bimodally expressed genes from five previously published independent breast cancer studies. We show that "bimodality" is a general phenomenon (at least for breast cancer), independent of a microarray platform and clinical phenotype (patient cohort). Bimodality is intrinsically associated with physiological states of the system, such as cancer vs. normal. Moreover, bimodally-expressed genes tend to cluster into groups with synchronised expression within a group. Consequently, bimodal group expression can be effectively used as an efficient and robust conditional descriptor, applicable for a variety of studies.
We also demonstrate the platform-independence of bimodality in three different microarrays used in the studies. Although compatibility between arrays can be high for certain end-points in limited size studies, as shown in the MAQCII project , in general, gene signatures are not robust and cannot be directly compared across platforms. There are several statistical methods of meta-analysis which enable direct comparison between gene expression levels in multiple experiments and allow for identification of genes with consistent signal values across the studies [20–27]. Here, we offer an approach to normalization of expression signal values into a binary mode corresponding to different conditions, which makes expression profiles on different arrays directly compatible.
Gene expression datasets used for identification of genes with bimodal expression patterns. In all five datasets, bimodality was defined by τ = 2.64 and standard deviation over 25th percentile of the distribution.a Recognized genes for each platform.
Pair-wise intersections of the sets of bimodal genes in five studies. Fisher exact tests were used to estimate p-values.
All genes intersection
Bimodal genes intersection
Bimodal genes for set Aa
Bimodal genes for set Ba
Therefore, we conclude that bimodality of gene expression is a phenomenon not limited to a specific microarray platform, a study/endpoint or a dataset/patient cohort. Bimodality of individual genes is confirmed for at least three different studies, and in some cases in four or five studies.
We believe that "bimodality" is a conditional expression property of a gene and each «mode» corresponds to a certain physiological condition, for example, a normal and a disease state. It is also possible that the two modes could correspond to different disease subtypes.
In order to clearly separate the patient samples by bimodal gene expression, we normalized the signals, so the signals could be presented in a binary manner, with one peak designated as -1 and another as 1. The original expression signals varied significantly between the genes in the same sample, and individual bimodal genes could be both over- and under-expressed in different samples. Therefore, the step of normalization was neccessary for minimizing the difference in amplitude of the expression of the genes in order to profile separate experiments in a uniform way. There can be two cases: 1. one gene from different experiments in which the intensities of its expression are different, and 2. different genes have similar intensity within one experiment. In the former case, normalization makes comparable the profiles of genes with different original intensities of expression. In the latter case, it allows one to identify truly similar genes within one set, with synchronised expression profiles for a physiological condition. The process of normalization is described in detail in "Methods," and an example of normalization of GRB7 expression from the Sorlie295 dataset is shown in Figure 2B.
The "close neighbors" groups of synchronously expressed bimodal genes for Sorlie295 data set.a Italics - breast cancer-associated genes.
As every gene in the group is bimodal, and the expression profiles of genes in each group are synchronised, each group can be used as an effective descriptor dividing patients into two clusters corresponding to the two expression modes. An average expression value for all genes in the group was used as the measure of the group's expression. For instance, ERBB2 group expression is downregulated in some patients and up-regulated in another part of the cohort (Figure 5C). It was shown that the expression group profiles are more robust descriptors than individual genes .
The expression of the "ñlose neighbors" groups is a remarkably robust descriptor between microarray platforms. "Robustness" can be defined as retained performance on larger validation datasets and «across platforms», i.e. the descriptor genes have to be synchronously expressed on different types of arrays. It is particularly important in the cases when the descriptors are deduced using a training set on one array platform and validation sets on a different platform, and especially when descriptor genes are present on the training array but are missing on the validation array [40–44, 23]. Using groups of genes (instead of individual genes in «gene signatures») and their summarized «group» expression instead of individual gene expression allows one to reduce or eliminate this problem. Thus, the gene TMIM158 from the PLAUR group is missing on Agilent arrays, but the group itself can still be used effectively as the descriptor with one gene missing. The average or summarized expression of the remaining genes in the group can be used as the group expression metric in this case.
Importantly, the pattern of group expression (i.e. an average of gene expression within a group) is remarkably stable between different studies and unique for the group, group expression profiles are essentially different and among the samples in all studies, i.e. the groups are expressed independently from each other. Therefore, the groups can be applied as robust descriptors for dividing samples (patients in the cohort) into sub-clusters (see additional file 6). The group descriptors can be applied consequently: Group 1 divided patients into two clusters, then Group 2 sub-categorizes each part into two and so on. Eventulally, every sample will be "barcoded" with 5 numbers reflecting the Group's expression mode as "1" or "2", for instance 1-1-2-1-2 (see additional file 6), and samples can be grouped together based on the matching "barcodes".
Here we described a fundamental property of certain genes to be expressed in two «modes» or expression levels depending on physiological condition/disease state. We studied this phenomenon in invasive breast cancer in five different studies using different array platforms, including cDNA arrays, Affymetrix and Agilent [28–35]. We have shown that bimodal genes are present on all arrays, and that the sets of bimodal genes statistically significantly overlap among the platforms. Therefore, we assume that bimodality is a common property of gene expression, dependent on physiological or disease states and independent of the end-points of the study or the microarray platform. In total, we identfied 866 bimodal genes shared among all platforms.
We developed and applied a computationally efficient algorithm to estimate bimodality of expression intensity distributions of genes based on maximization of the t-statistic -like measure τ (see Methods). Gene expression distribution is often modeled by a mixture of Gaussians with model parameters fit through expectation maximization (see e.g., [21, 22]). Bimodality of expression can then be deduced from testing log-likelihood ratios of two component mixture distribution versus a single component normal distribution as in , or through calculation of Bayesian information criterion as in . These approaches are computationally demanding and do not offer clear advantages over t-statistics. Also, characterization of a gene's bimodality via excess kurtosis as in  disregards bimodal distributions with unbalanced sizes of peaks, while a t-statistic still captures such unbalanced bimodality. A different approach for characterizing "hypervariable" genes was applied in , where authors searched for genes with higher variability than in a majority of genes. The F-test was used to select the genes with variances significantly higher than the variance of genes in a 'reference group'.
Conditional bimodality is an unexpected and non-trivial property of gene expression. The expected distribution of any quantitative character in biological systems is expected to be normal . Moreover, most genes in the studied datasets are not «bimodal» (Table 1). Distribution with two distinct peaks means that transcriptional regulation of some genes is conditional - breast cancer-dependent in our case. Alternatively, one could expect expression of two different conditionally prevailing splice variants for certain transcripts - a phenomenon shown for some cancers . However, observation of this case is not likely, as we see the same genes on three different platforms, and the array set does not allow us to separate different splice variants, at least for the original cDNA array.
Bimodal expression is conditional and, in our case, is linked to the complex condition known as breast cancer. The set of 866 common bimodal genes is heavily enriched in breast cancer-associated genes, participating in many pathways and processes of cancerogenesis. Thus, the «group query» genes ERBB2, ESR1, CEACAM5 and AR are well known markers of breast cancer [46–49].
According to the theory of modularity of biological processes , bimodal genes tend to cluster into synchronously expressed functional groups of «close neighbors». We described an approach for identification of such groups based on normalized gene expression, which makes it platform-independent and comparable between different arrays. We have selected 23 genes divided into 5 groups which were co-expressed within groups in all five studies on three different platforms (but the group expression was independent from one to another). The genes within the groups are functionally close. Thus, genes in each group form statistically significant protein interaction networks. Some groups, such as TCAP, PSMD3, GRB7, and ERBB2, belong to a well known amplicon . Transcription of MX1, CXCL10, PLSCR1, and ISG15 from the STAT1 group is regulated by STAT1 [37, 38]. 15 out of 23 bimodal genes in groups are known in the literature as breast cancer-associated genes, which suggests breast cancer specificity of these functional modules.
As gene expression within a group is synchronised through many studies, «group expression» can be applied as a «binary» conditional descriptor separating a patient cohort into sub-groups with «-1» and «1» expression. Consecutive application of different groups can be appled for further sub-division of the patient cohort into patient clusters, with "1s" and "2s" for each group used as a bar-code for the patent cluster. The advantage of using group expression instead of individual gene expression is in high robustness: an average per group expression fluctuates at a lower scale than dispersed expression of individual genes. The «close neighbors» gene groups can be used as prognostic descriptors for clinical end-points such as patient survival, metastases development, response to therapy, etc. Sub-categorization of cancer patients is a non-trivial problem due to high heterogeneity of expression profiles. Thus, in a well-known sub-categorisation scheme which divided invasive breast cancers into five clusters based on expression of certain "centroid" genes , over 1/3 of samples could not be categorized into any cluster and expression heterogeneity within clusters was still high, especially in validation studies with more samples, despite running the studies on essentially the same cDNA array [50, 51]. Importantly, when we applied normalized group expression as the clustering metrics, we saw not a single outlier among over 1000 samples in five studies on three different microarray platform. The heterogeneity was also much lower within the clusters (data not shown). Such high robustness makes the "close neighbors" groups potentially very promising biomarkers for clinical end-points in breast cancer and, likely, other types of cancers.
Functional grouping of genes as descriptors also deals with an important issue of reduction of dimensionality in meta-analysis. Meta-analysis can be defined as a cross-study analysis of different patient cohorts united by a clinical end-point or any other parameter [24, 52]. This type of analysis is broadly applied, for example, during comparison of a study of interest with the expression data accumulated in GEO http://www.ncbi.nlm.nih.gov/geo/ or other expression databases (ArrayExpress http://www.ebi.ac.uk/microarray-as/ae/, Stanford Microarray Database http://smd.stanford.edu/), Yale Microarray Database http://www.med.yale.edu/microarray/, etc). Platform compatibility and minimization of «dimensionality» are two major problems in meta-analysis, where «gene signatures» consisting of individual genes are notorious for poor reproducibility [40–44, 23]. Here, we offer a general solution for the problem, consisting of identification of bimodal genes, normalization of their expression and grouping of the normalized expression into synchronised clusters of «close neighbors». Normalization consists of transformation of expression signals into a binary system of «-1» and «1», and it enables comparison of otherwise incomparable expression data between platforms and studies . Lack of individual genes on a certain array platform does not prevent using the group as the descriptor.
We described the phenomenon of bimodality of gene expression in breast cancer and grouping of the bimodal genes into «close neighbor» groups. The sets of bimodal genes are non-random; they are enriched in disease markers and targets and tend to form functionally related groups with synchronised expression. These groups of «close neighbors» can be used as robust descriptors for certain sub-groups of patients and associated with clinically important phenotypes (end-points). Application of functional descriptors consisting of bimodal genes is important in the area of meta-analysis of gene expression experiments across platforms and across studies.
In a set of expression experiments (for instance, a patient cohort), each gene has a distribution of expression signals across the set. Bimodal genes feature a distribution with two distinct peaks (maximal signals) (Fig 1B). For each gene, we can set up a distinguishing expression value such that the signals lower than this value correspond to the lower peak in the bimodal distribution, and the signals higher than this value correspond to the higher peak. The characteristic value was chosen as follows: all expression values for a gene were randomly divided onto two groups, and average and sum of squared deviations were calculated for each group. The lower the sum of deviations, the better the partition.
In calculation of bimodality, we assume that distribution of expression within the cohort for a bimodal gene is a sum of two normal distributions. Let us consider - an expression value of i -s gene in the j -s experiment; L i , U i - partition of the set of all experiments onto subsets depending on i -s gene); # L i , # U i - the number of experiments in each subset; - average signal for i -s gene in each subset L i , U i . We need to find a partition with the minimum γ (L i , U i ): . We need to look at only the subsets with - the values in subset L i are lower than in subset U i . The number of possible partitions with such a property is larger by 1 than the number of experiments (including two cases with empty subsets). For the «optimal» partition, l i = ⟨ L i ⟩, u i = ⟨ U i ⟩ and γ i = γ (L i , U i ). In this case, the characteristic signal T i will be calculated as follows:
. In other words, the characteristic signal T is the border with two optimal divisions, i.e. (Figure 2A).
We can consider as the level of bimodality a relative dicrepancy in the values in sub-sets (measure of signal isolation) τ i , which is calculated as , where M - is the total number of experiments.
The list of values for i- s gene is entered into an algorithm
All signals are sorted by value: the number of signals is n, where j is the number of the signal
For all n - 1 possible partitions of the sorted list of values onto two groups (partition is defined by the number of the highest signal in the smaller by value group), Partition is defined as k ∈ [1, n - 1]):
The average for each sub-set is for the sub-set with lower signals - for the sub-set with higher signals
The sum of squares of deviations for each sub-set:
We choose K i , for which is ( )
The algorithm results in K i , , .
where divides the signals for the given partition according to . This property allows us to clearly divide signals for each peak (mode).
One of the drawbacks of the method described above is its sensitivity to outliers. For instance, if in three experiments out of 100 the expression values are significantly higher than the others, the three signals will be assigned to one peak, and the remaining 97 to another peak. This situation can be avoided if all values for a group will be considered as outliers if its relative size is small, for instance, less than 5%.
This transformation allows us to reduce all signals l i and u i to -1 and 1, correspondingly (Figure 2B). If the set contains a certain number of control experiments (for instance, normal samples among the disease samples), we can consider the expression values for the group with normal samples as 1, and the other group as -1. This allows us to compare expression profiles which are synchronised among the patients but in different directions. Also, the genes with control values belonging to different modes can be excluded.
The mean for normal patients was calculated , where N - samples from normal patients, and # N is their number
In the case of ν i < T i , the gene expression values were transformed by ; otherwise . Therefore, always.
The genes with similar signal profiling constitute a group of «close neighbors»
This study was supported by NCI grant 5R44CA112828-03 Elucidation of protein networks implicated in breast cancer
This article has been published as part of BMC Genomics Volume 11 Supplement 1, 2010: International Workshop on Computational Systems Biology Approaches to Analysis of Genome Complexity and Regulatory Gene Networks. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2164/11?issue=S1.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.