Gene expression signature of estrogen receptor α status in breast cancer

Background Estrogens are known to regulate the proliferation of breast cancer cells and to modify their phenotypic properties. Identification of estrogen-regulated genes in human breast tumors is an essential step toward understanding the molecular mechanisms of estrogen action in cancer. To this end we generated and compared the Serial Analysis of Gene Expression (SAGE) profiles of 26 human breast carcinomas based on their estrogen receptor α (ER) status. Thus, producing a breast cancer SAGE database of almost 2.5 million tags, representing over 50,000 transcripts. Results We identified 520 transcripts differentially expressed between ERα-positive (+) and ERα-negative (-) primary breast tumors (Fold change ≥ 2; p < 0.05). Furthermore, we identified 220 high-affinity Estrogen Responsive Elements (EREs) distributed on the promoter regions of 163 out of the 473 up-modulated genes in ERα (+) breast tumors. In brief, we observed predominantly up-regulation of cell growth related genes, DNA binding and transcription factor activity related genes based on Gene Ontology (GO) biological functional annotation. GO terms over-representation analysis showed a statistically significant enrichment of various transcript families including: metal ion binding related transcripts (p = 0.011), calcium ion binding related transcripts (p = 0.033) and steroid hormone receptor activity related transcripts (p = 0.031). SAGE data associated with ERα status was compared with reported information from breast cancer DNA microarrays studies. A significant proportion of ERα associated gene expression changes was validated by this cross-platform comparison. However, our SAGE study also identified novel sets of genes as highly expressed in ERα (+) invasive breast tumors not previously reported. These observations were further validated in an independent set of human breast tumors by means of real time RT-PCR. Conclusion The integration of the breast cancer comparative transcriptome analysis based on ERα status coupled to the genome-wide identification of high-affinity EREs and GO over-representation analysis, provide useful information for validation and discovery of signaling networks related to estrogen response in this malignancy.


Background
Estrogen plays essential roles in the development, growth control and differentiation of the normal mammary gland. However, it is well documented that endogenous estrogens are powerful mitogens critical for the initiation and progression of human breast and gynecological cancers [1]. This cell proliferation signal is mediated by the estrogen receptors (ER), members of the nuclear receptor family that function both as signal transducers and transcription factors to modulate expression of target genes [2]. There are two main subtypes of estrogen receptors: ERα and ERβ that generally can form homo-and heterodimers before binding to DNA. Although the DNA binding domains of these receptors are very similar, the overall degree of homology is low [3].
Transcriptional regulation of target genes in response to 17β-estradiol (E 2 ) is mediated by two main mechanisms. In one, the E 2 -ER complex binds to a specific DNA sequence called the estrogen response element (ERE), this receptor-ligand DNA bounded complex interacts with coregulatory proteins, promoting chromatin remodeling and bridging with the general gene transcription machinery thus resulting in transcription initiation [4]. Alternatively, the ligand-ER complex can interact with other DNA-bound transcription factors that in turn bind DNA sequences (e.g. via AP1, SP1 complexes) [5,6]. ERα and ERβ have different affinities for different response elements and exhibit distinct transcriptional properties. Additionally, E 2 also exerts rapid, non-genomic effects attributed to cell membrane-initiated signaling [7].
Approximately two-thirds of all breast cancers are ERα (+) at the time of diagnosis and expression of this receptor is determinant of a tumor phenotype that is associated with hormone-responsiveness. Patients with tumors that express ERα have a longer disease-free interval and overall survival than patients with tumors that lack ERα expression [8]. However, the association between ERα expression and hormonal responsiveness is not perfect: approximately 30% of ERα-positive tumors are not hormone-responsive while 5-15% of ERα-negative tumors respond to hormonal therapy [9]. The molecular basis for the association between ERα expression, hormonal responsiveness and breast cancer prognosis remains unclear.
Several studies have been carried out using cDNA and oligonucleotide microarrays identifying breast cancer subclasses possessing distinct biological and clinical properties [10][11][12][13]. Among the distinctions made to date, the clearest separation was observed between ERα (+) and ERα (-) tumors [10][11][12][13][14][15]. It has been suggested that there are sets of genes expressed in association with ERα that could play an important role in determining the hor-mone-responsive breast cancer phenotype [16]. ERα is obviously likely to be important for the E 2 induced proliferative response predominantly via the regulation of estradiol-responsive genes. Nevertheless, the expression of additional subsets of genes not necessarily directly regulated by estrogen may also be fundamental in defining the breast cancer hormone-responsive phenotype.
To further elucidate the molecular basis of estrogendependent breast carcinogenesis, we here report a comparative transcriptome profiling of invasive breast tumors based on ERα status obtained by SAGE. The SAGE method provides a statistical description of the mRNA population present in a cell without prior selection of the genes to be studied, and this constitutes a major advantage [17]. The breast cancer SAGE comparative analysis was combined with promoter sequence analysis of genes of interest using high-throughput methods of high-affinity ERE identification. In order to have an even more comprehensive picture we also performed a cross-platform comparison between SAGE and DNA microarray studies.

Biomarkers of ERα status in breast carcinomas
The primary goal of our study was to identify the most commonly deregulated genes in invasive breast carcinomas related to ERα status. To this end SAGE data was obtained from a set of primary breast carcinomas. Thus, a breast cancer SAGE database of almost 2.5 million tags was analyzed, representing over 50,000 tag species. We performed a comprehensive evaluation and comparison of gene expression profiles using a recently developed supervised method [18], to identify the most representative differentially expressed transcripts between tumors groups, i.e. ERα (+) vs. ERα (-) breast tumors. This statistical analysis revealed 520 genes differentially expressed (Fold change ≥ 2; p < 0.05) between ERα (+) and ERα (-) primary breast carcinomas (see additional data file 1). Among the 520 transcripts, 473 were up-modulated and 47 were down-modulated transcripts in ERα (+) tumors.
To validate novel ERα associated genes detected by SAGE not reported in other studies, we performed Real Time RT-PCR analysis of representative transcripts in an Real time RT-PCR validation of nine over-expressed genes in 36 invasive breast carcinomas  independent set of 36 invasive ductal breast carcinomas.
SCUBE2 (also known as EGF-like 2 or CEGP1) encodes a secreted and cell-surface protein containing EGF and CUB domains that defines a novel gene family [19]. The epidermal growth factor (EGF) motif is found in many extracellular proteins that play an important role during development, functioning as secreted growth factors, transmembrane receptors, signaling molecules, and important components of the extracellular matrix. The CUB domain is found in several proteins implicated in the regulation of extracellular process such as cell-cell communication and adhesion [20]. Expression of SCUBE2 has been detected in vascular endothelium and may play important roles in development, inflammation and perhaps carcinogenesis [19].
The CELSR2 gene (also known EGFL2) encodes a protein member of the nonclassic-type cadherins (flamingo subfamily). These 7-pass transmembrane proteins have nine cadherin domains, seven-epidermal growth factor-like repeats and two laminin A G-type repeats [21]. It is postulated that these proteins are receptors involved in cell adhesion and receptor-ligand interactions [21] playing a role in developmental processes and cell growth/ maintenance in epithelial and neuronal cells [22,23].
SYTL4 (also known as granuphilin-a or SLP4) contains an N-terminal Slp homology domain (SHD) than can specifically and directly bind the GTP-bound form of Rab27A, a small GTP-binding protein involved in granule exocytosis in cytotoxic T lymphocytes [24]. We determined that overexpression of SYTL4 is associated with ERα (+) tumors ( Figure 1b). However, the potential role of this gene in breast carcinogenesis remains unknown.
ENO2 (also known as NSE/neuron-specific gamma enolase) encodes one of three enolase isoenzymes found in mammals. This isoenzyme was described to be expressed in cells of neuronal origin. Interestingly, in a recent report Hao et al. (2004) showed high expression of ENO2 transcripts in breast cancer lymph node metastases when compared with primary breast tumors [25].
The TSPAN1 gene (also known as tetraspanin or NET1) encodes a cell-surface protein member of the transmembrane 4 superfamily (TM4SF), involved in the regulation of cell development, activation, growth and motility. A number of tetraspanins were described as tumor-specific antigens, and it was suggested that the function of some TM4SF proteins may be particularly relevant to tumor cell metastasis [26]. Sugiura and Berditchevski (1999) observed that TSPAN1 protein complexes may control the invasive migration of tumor cells and contribute to ECMinduced production of MMP2 in breast cancer cell line [27].
NR4A1, a nuclear receptor subfamily 4, group A gene (also known as steroid receptor TR3 or NUR77) encodes an orphan member of the steroid-thyroid hormone-retinoid receptor superfamily whose members mainly act as transcriptional factors to positively or negatively regulate gene expression and play roles in regulating growth and apoptosis [28,29]. A role for NR4A1 in cell proliferation has been previously reported. It was shown that its expression is rapidly induced by various mitogenic stimuli such as: serum growth factor, epidermal growth factor and fibroblast growth factor [28].
Taken together, the genes that we identified and validated appear to be involved in signaling pathways related to cell proliferation, invasion and metastatic processes, but their exact role in breast carcinogenesis remains to be elucidated.

Gene Ontology analysis
Classification of genes based on Gene Ontology (GO) terms is a powerful bioinformatics tool suited for the analysis of DNA microarray and SAGE data. Analysis of GO annotation allows one to identify families of genes that may play significant roles related to specific molecular or biological processes in expression profiles [30]. We used the Expression Analysis Systematic Explorer software (EASE) [31] to annotate the 520 deregulated genes according to the information provided by the GO Consortium [30]. The GO database provided annotation for 80% (419 out of 520) of the genes identified by SAGE. Results of this analysis are shown in Figure 2 and in detail in additional data file 2.
We observed that 31% of ERα associated transcripts are involved in biological processes related to cell growth and/ or maintenance, 21% are related to cell communication, and 16% are related to regulation of transcription. Approximately 16% of these deregulated genes are related to molecular functions associated with DNA binding and more specifically with transcription factor activity (10%) ( Figure 2). Interestingly, using the enrichment GO terms analysis, we identified statistical significant over-represen-

Identification of high-affinity Estrogen Response Elements
We used a recently reported genome-wide high-affinity ERE database [32] to identify putative EREs in the promoter regions of the SAGE-identified 473 up-modulated genes in ERα (+) breast tumors. We identified 220 EREs distributed on 163 out of the 473 genes (35%) (see additional data file 3). Seventy-two percent of these genes contain one high affinity ERE (117 out of 163) and 28% of them contain two or more EREs in proximity to the transcriptional start sites (TSS) (46 out of 163) (Figure 3a). These EREs can be located in both coding and non-coding sequences such as was described by Bourdeau et al. [32].
The observed frequency of these elements in our study was 220 EREs in 3260 kb (considering a DNA window of 20 kb for each one of the 163 up-modulated genes with EREs). Compared with the expected frequency from random distribution of high-affinity EREs found in the genome (732 EREs in 3,069334 kb 0.8 ERE in 3260 kb) (see material and methods) [32], the number of individual EREs was 270 fold higher than expected by chance (p < 0.00001).
Fifty percent (110 out of 220) of the detected EREs mapped within a 10 kb region 5' of the TSS, while the rest mapped to 3' regions ( Figure 3b). Approximately 68% of EREs mapped within the region between -5 to +5 kb from the TSS; in agreement with those observations of Bourdeau et al. [32]. However, it remains to be determined whether distantly located EREs (e.g. -10 kb from the TSS) are functional E 2 -ER binding sites related to transcriptional activation.
It is interesting to note that we were unable to identify high-affinity EREs on the majority of deregulated genes (65%) associated with a positive ER α status. The possibility exists that many of these genes are transcriptionally regulated by non-ERE mediated mechanisms such as those involving ER binding to the AP1 or SP1 transcription factors [33]. The AP1 transcription factor is a heterodimer formed by Jun and Fos family member proteins that binds to the phorbol diester (TPA) response element as well as to the AP1 consensus DNA sequence. In this pathway, ER plays a co-activator role for AP1 [6]. The ER/ AP-1 complex can confer estrogen responsiveness to additional subset of genes found in our dataset such as: lin-like growth factor binding protein-4 (IGFBP4) (Fold change: 2; p = 0.01) and heat shock protein 27 (HSPB1) (Fold change: 2; p = 0.045); four transcripts detected as over-expressed in ERα (+) tumors in our study (additional data file 1).
An additional pathway of transcription regulation by estrogen involves the ER-related receptors (ERR), nuclear orphan receptors with significant homology to ERs, which do not bind estrogen and have unknown physiological ligands. ERRs are known to bind to the steroidogenic factor 1 response element (SFRE) and also bind to classic EREs, by means of which they exert constitutive transcriptional activity [34]. We detected over-expression of the nuclear orphan receptor NR4A1 by SAGE and subsequently validated this observation by real time RT-PCR ( Figure 1g). Interestingly, and as previously mentioned, the genomic region 5' and 3' to the TSS of NR4A1 contain high-affinity EREs. Interaction between ERs and ERRs has been observed in the transcriptional regulation of certain genes such as the human breast cancer related gene TFF1/ pS2, the promoter of which is not only activated by ERs but also by ERRs [35].
As described, ERα can mediate estrogenic response through multiple genomic and non-genomic mechanisms, many of which affect proteins and pathways not necessarily directly or exclusively associated with ERα. Thus it is worth stressing that it will the totality of deregulated proteins the ones that ultimately define the phenotype of ERα (+) breast carcinomas regardless of whether a "direct association" with ER transcriptional regulation exists or not.

In vivo versus in vitro estrogen induced global gene expression findings
The SAGE profiles for E 2 -responsive genes in MCF-7 cell line, previously reported by us [36], was compared with the ER status genes expression profile found in primary breast carcinomas. Briefly, we detected 199 transcripts differentially expressed (p < 0.01) in MCF-7 treated cells, 124 were up-regulated and 75 were down-regulated transcripts. Basically and as reported Charpentier et al, we observed a general up-regulation cell cycle progressionrelated genes including: CCT2, CCND1, PES1, RAN/TC4, CALM1, CALM2; and tumor-associated genes such as: RFP, D52L1, TFF1/PS2, CAV1, and NDKA among others [36]. These together could contribute to the stimulation of proliferation and the suppression of apoptosis by E 2 -ER transcriptional regulation.
By comparing the in vitro (199 differentially expressed transcripts) and in vivo (520 differentially expressed transcripts) gene expression profiles, to our surprise we detect that only few transcripts: TFF1, CCND1, H19, SREBF1 and WWP1 behaved similarly (i.e. up-regulation) in both studies. This is similar to observations made previously by Meltzer and co-workers whom showed that the majority of genes regulated in cell culture do not predict ER status in breast carcinomas [11,37]. This result suggests that the estrogen-responsive pathways affected in vitro represent only a minor portion of the global gene expression profiles characteristic of ERα (+) breast tumors. This maybe in great part the result of the heterogenous nature of bulk tumor tissue but in addition, the in vitro response of a single cell line to E 2 , in this particular case the widely used MCF-7 cells, may not faithfully reproduce the physiological effects of ER signaling in vivo.

Cross-platform gene expression profiling comparison
In order to identify and validate the most reliable set of genes able to discriminate breast carcinomas based on their ERα status, we performed a cross-platform comparison between the described SAGE dataset with two previously reported breast cancer studies based on DNA microarray methods [12,13]. van't Veer et al. [12] reported the gene expression profile of 97 primary breast tumors based on oligonucleotide microarrays containing 24,479 elements (Agilent Technologies, Palo Alto, CA, USA). In another study, Sotiriou et al. [13] reported the gene expression profile of 99 primary breast tumors using a cDNA microarray containing 7650 elements. Only files containing differentially expressed genes associated to ERα status tumors from both microarrays studies were obtained for cross-platform comparison (see material and methods).
Among the three platforms, a total of 1686 transcripts were identified as over-expressed in ERα (+) breast tumors. One hundred and eighty-three genes were identified by more than one method (  (Table  2).
One hundred and fourteen genes were identified as overexpressed by oligonucleotide microarrays [12] and SAGE in ERα (+) tumors, representing a non-random significant number of overlapping genes based on normal approximation to the binomial distribution (p < 0.001) ( Figure 4 and Table 2). Sixty-six genes were identified as overexpressed in ERα (+) tumors by both DNA microarrays platforms (p < 0.01). The set of 25 genes overlapping between cDNA microarrays [13] and SAGE were not statistical significant (p > 0.05).
Interestingly, we found a higher number of overlapping genes between the oligonucleotide microarray and SAGE platforms (114 genes), while only 66 genes were observed overlapping when comparing both microarray platforms. It is worth noting that 96% of the 470 genes ( Figure 4) identified as overexpressed by the cDNA microarray method [13] were included within the total set of elements in the oligonucleotide microarray platform [12]. In Cross-platform comparisons of the up-modulated transcripts in ERα (+) breast carcinomas Figure 4 Cross-platform comparisons of the up-modulated transcripts in ERα (+) breast carcinomas. One hundred and eighty-three genes were identified by more than one study, eleven of which were commonly identified across the three platforms. a) Comparison between SAGE and oligonucleotide microarray platforms [12] showing a highly significant number of overlapping genes (p < 0.001) (see table 2). b) Comparison between SAGE and cDNA microarray platforms [13] (p > 0.05). c) Statistically significant number of overlapping genes identified by both DNA microarrays platforms (p < 0.01). other words, it appears that a better correlation was observed between SAGE and oligonucleotide arrays, than between both DNA microarray methods.

Conclusion
In summary, our comprehensive comparison of overlapping genes across different gene expression platforms provides validation for a significant number of transcripts identified as highly expressed in ERα (+) breast tumors.
More importantly this analysis identifies the most promising biomarkers for further evaluation as ERα associated genes in breast cancer. Furthermore, the identified proteins may be of value as breast cancer prognostic indicators analyzed either as a group or individually. It is also likely that groups of co-regulated genes in ERα (+) breast cancers may be associated to the hormonal control of mammary epithelial cells growth and differentiation. Finally, a better understanding of the signaling networks controlled or associated with the estrogen response may lead to the identification of novel breast cancer therapeutic targets.  [36,38].

Data processing and statistical analysis of SAGE libraries
SAGE tag extraction from sequencing files was performed by using the SAGE2000 software version 4.0 (a kind gift of Dr. K. Kinzler, John Hopkins University). SAGE data management, tag to gene matching as well as additional gene annotations and links to publicly available resources such as GO, UniGene, LocusLink, were performed using a suite of web-based SAGE library tools developed by us http:// spi.mdacc.tmc.edu/bitools/about/sage_lib_tool.html. In our analyses we only considered tags with single tag-togene reliable matches. To compare these SAGE libraries, we utilized a modified t-test recently developed by us [18]. This test is based on a beta binomial sampling model that takes into account both, the intra-library and the interlibrary variability, thus identifying 'common patterns' of SAGE transcript tag changes systematically occurring across samples [18].
All raw SAGE data reported as Supplementary tables in this manuscript is publicly available at http://science park.mdanderson.org/ggeg/sage_Proj_9.htm.

Real Time RT-PCR analysis
Template cDNAs were synthesized on mRNAs isolated from an independent set of 36 Stage I -Stage II human breast carcinomas (13 ERα-negative tumors and 23 ERαpositive tumors) obtained from our tumor bank. Primers and probes were obtained from the TaqMan Assays-on-Demand™ Gene Expression Products (Applied Biosystems, Foster City, CA, USA). All the PCR reactions were performed using the TaqMan PCR Core Reagents kit and the ABI Prism ® 7700 Sequence Detection System (Applied Biosystems, Foster City, CA, USA). Experiments were performed in duplicate for each data point and 18s rRNA was used as control. Results were expressed as mean ± 2 Standard Error based on Log2 transformation of normalized real time RT-PCR values of the assayed genes. We used ttest to compare the gene expression levels of validated genes between ERα (+) and ERα (-) breast tumors (p < 0.05).  non-specific antibody binding, the slides were incubated with 10% goat serum in PBS for 30 min. Primary monoclonal ERα antibody (ER-6F11, Novocastra, Newcastle, UK) was used at 1:50 dilution and detected following standard immunohistochemical techniques. DAB was used as chromogen and Mayers hematoxylin is used as counterstain. Scoring was performed by breast pathologist (AS). Cuttoff for positivity was determined at 5% of tumor cells staining positively for ER (i.e. < 5% of cells the tumor was considered negative for ERα).

Bioinformatics analysis
For automated functional annotation and classification of genes of interest based on GO terms, we used the EASE [31] available at the Database for Annotation, Visualization and Integrated Discovery (DAVID) at http:// david.niaid.nih.gov/david [39]. The EASE software calculates over-representation of specific GO terms with respect to the total number of genes assayed and annotated. Statistical measures of specific enrichment of GO terms are determined by means of an EASE score (p < 0.05). The EASE score is a conservative adjustment of the Fisher exact probability that weights significance in favor of biological themes supported by more genes and is calculated using the Gaussian hypergeometric probability distribution that describes sampling without replacement from a finite population [31]. This allows one to identify biological themes within a specific list of EASE analyzed genes.

High-affinity Estrogen Response Elements (ERE) analysis
To identify the occurrence of EREs within the promoter regions of up-modulated genes in ERα (+) breast tumors, we used a human genome-wide high-affinity ERE database http://mapageweb.umontreal.ca/maders/eredata base/ [32]. This public available database contains 71,119 EREs identified across the human genome (related to 17,353 transcriptional start sites), representing the consensus ERE (5'-Pu-GGTCA-NNN-TGACC-Py-3'), and equivalent sequences with only one or two nucleotide variations from such consensus. Based on these restrictions the expected random frequency was calculated as the total number of base pairs in the human genome divided by the frequency of occurrence of a sequence with specified base pairs at 10 positions and two base pair choices at two positions (3,069334246/4 11 = 732 high-affinity EREs) [32].

Comparison of gene expression patterns identified by different methodologies
ERα status associated genes identified in previous breast cancer studies [12,13] using DNA microarray methods were compared with our SAGE findings.
These datasets were annotated by LocusLink ID using the EASE software [26], and then compiled into one Excel spreadsheet pivotTable for comparison of overlapping genes between platforms, i.e. SAGE, Oligonucleotide and cDNA arrays. Anonymous ESTs from the microarrays platforms were excluded due to the inability to cross validate the identities between different gene expression profiles. Any combination of two lists was compared for matching gene-identity. The number and identity of genes commonly affected in two platforms (e.g. SAGE study vs. DNA microarray) was determined. We used the normal approximation to the binomial distribution as previously described [40] to calculate whether the number of matching genes derived from each cross-platform comparison was of statistical significance (p < 0.05).