A reference database for tumor-related genes co-expressed with interleukin-8 using genome-scale in silico analysis

Background The EST database provides a rich resource for gene discovery and in silico expression analysis. We report a novel computational approach to identify co-expressed genes using EST database, and its application to IL-8. Results IL-8 is represented in 53 dbEST cDNA libraries. We calculated the frequency of occurrence of all the genes represented in these cDNA libraries, and ranked the candidates based on a Z-score. Additional analysis suggests that most IL-8 related genes are differentially expressed between non-tumor and tumor tissues. To focus on IL-8's function in tumor tissues, we further analyzed and ranked the genes in 16 IL-8 related tumor libraries. Conclusions This method generated a reference database for genes co-expressed with IL-8 and could facilitate further characterization of functional association among genes.


Background
Chemokines are a large superfamily of small, structurallyrelated peptides originally discovered as neutrophil attractants. Interleukin-8 (IL-8) is a potent member of the supergene family of CXC chemokines with ELR motif (ELR+). ELR + CXC chemokines are potent angiogenic factors, whereas ELR -CXC chemokines are potent angiostatic factors. Studies have shown that these angiogenesis-related activities are correlated with tumorigenesis in many tu-mor types and are distinct from their ability to recruit neutrophils [1,2].
Interleukin-8 (IL-8) is inducible in a wide range of cells including lymphocytes, monocytes, endothelial cells, fibroblasts, hepatocytes, and keratinocytes [3][4][5]. IL-8 is also found to be constitutively expressed in several tumor tissues including bronchogenic carcinoma, non-small cell lung cancer, colorectal carcinoma, breast cancer, melanoma, prostate cancer, gastric carcinoma, and ovarian can-cer. While IL-8 has been implicated in growthpotentiation [6], angiogenesis [7], metastasis [3,8], and tumorigenesis [3,6] of various tumors, its specific role remains unclear. Several observations support the assertion that a complex interaction between IL-8 and several other growth factors, cytokines, or other proteins is responsible for these tumor-related events [5,[9][10][11][12][13]. Growth regulated protein alpha (GRO-a), beta (GRO-b), and ENA-78 have been reported to be co-induced with IL-8 in A549 cells stimulated with two proinflammatory cytokines, IL-1b and TNF-a [14]. This result is consistent with the presence of nuclear factor kappa B (NF-kB) consensus binding sites in the promoter regions of all three genes [4,15]. In addition, GRO-a and -b are also members of the ELR + CXC cytokine family that are reported to be important mediators of tumorigenesis through their angiogenic properties [2,16]. They both share one of the IL-8 receptors, CXCR2, which has been postulated to regulate the ELR + CXC chemokine-mediated angiogenesis and resulting-tumorigenesis [1]. These observations suggest that GRO-a and -b may exhibit expression profiles similar to IL-8. Coordinated expression of these and other factors with IL-8 in certain tumor tissues suggests that they may be functionally associated with IL-8 in these tumors. In order to fully understand IL-8's role in these events, it is important to investigate these coordinately expressed genes. A systematic, unbiased approach to identification and ranking of proteins related to IL-8's expression in tumor tissues/cells may help to define the scope of IL-8's role in tumorigenicity and important related interactions with other factors.
There are over 3 million human Expressed Sequence Tag (EST) records in GenBank (Table 1), which is still growing rapidly. EST sequences in GenBank are derived from cDNA libraries generated from a vast array of tissue types including normal, disease-state, and variously treated tissues. The number of EST clones is reported to be proportional to the abundance of cognate transcripts in the tissue or cell type used to make the cDNA library and thus the EST distribution can provide a quantitative assessment of differential expression of a gene [17]. The expanding tissue diversity and EST coverage have increased the statistical power of EST-distribution based expression analysis. An approach using EST expression data as a binary variable (present or absent in a cDNA library) has previously identified prostate cancer-associated genes [18]. We are reporting here a novel in silico approach for identification of genes whose mRNA are enriched in libraries where IL-8 is represented. We focused on ESTs from cDNA libraries in which IL-8 has been sequenced at least once and analyzed the EST frequency of occurrence for all other genes in these IL-8 related cDNA libraries. Those cDNA libraries were further catalogued into tumor and non-tumor libraries, which allowed us to identify genes whose expression profile is closely related to that of IL-8 in tumor tissues.

Genes co-expressed with IL-8
We found that IL-8 has been sequenced in 53 dbEST cDNA libraries (Table 1 and Additional file 1). Using these cDNA libraries (referred to as IL-8 tissue), we generated a reference database for all the genes and their expression profile relationship (measure by Z-score) with IL-8. The complete list of genes is provided in additional file 2, along with a distribution table based on contig size and Zscore (Table 2). A gene could be represented by multiple contigs which represent splice variants and sequencing errors. In this database, we provide the statistics for each contig, including the number of EST clones, cDNA libraries, EST clones in IL-8 tissue, and cDNA libraries in IL-8 tissue. First we evaluated the performance of this search using the distribution analysis results for some known genes ( Table 3). The co-expression of IL-8 with IL-6, GROa, and GRO-b in many tissues has been well documented [5,14,19], and those genes are identified here as IL-8 related genes with an IL-8 Z-score >= 3. The GRO-g and ENA- 78 [14] also show a high IL-8 Z-score, although the available EST clone number for these two genes is much smaller. Successful identification of these genes demonstrates the value of our in silico approach. Besides Z-score, this database can be sorted in many different ways. For example, we can adjust the stringency of correlation by using the number of IL-8 cDNA libraries as a cutoff parameter.
The significance of this study lies in the generation of a complete reference database for IL-8 co-expressed genes. We present here a list of 36 genes with high EST clone count (>= 100) and high Z-score (>= 3) (referred to as IL-8 genes, I-1, I-2, ..., I-36) ( Table 4) to illustrate the usage of this reference database. A high EST clone count is used to ensure the statistic significance of the data mining results. IL-8 tissues represent both tumor tissues/cells (IL-8tumor) and non-tumor tissues/cells (Table 1 and Additional file 1). To generate a tumor expression profile, the percentage of IL-8-tumor clones versus IL-8 clones (R tumor ) was calculated for the 36 IL-8 genes. The resulting graph is shown in Figure 1. The R tumor profile reveals a bimodal distribution. In the case of IL-8, tumor clones account for approximately 17% of all IL-8 clones (R tumor = 17%). Eleven IL-8 genes exhibit a R tumor of greater than 80%; five of these being greater than 95% (kruppel-like factor 2 (KLF2), presenilin 1 (PS1), neural proliferation and differentiation control protein-1 (NPDC1), GW112 and claudin-3). On the other hand, eight genes were not found in any IL-8-tumor related library (R tumor = 0).
For comparison with GRO-a and GRO-b, the ratios of IL-8 clones, GRO-a clones and GRO-b clones to total clones(R il8 , R GRO-a and R GRO-b ) for each IL-8 gene were analyzed. Approximately half of the IL-8 genes show similar correlation with all three chemokines. As an example, GW112 exhibits high correlation with IL-8, GRO-a andb (R il8 = 69%, R GRO-a = 71%, R GRO-b = 68%). On the other hand, KLF2 (R il8 = 71%, R GRO-a = 16%, R GRO-b = 2%) and PS1 (R il8 = 65%, R GRO-a = 5%, R GRO-b = 3%) show  high correlation with IL-8 and only slight association with GRO-a or GRO-b.  Table 5). To determine whether these genes are specifically related to IL-8-tumor tissue or whether they are also commonly found in tumors lacking IL-8 expression, the ratio between IL-8-tumor clones and general tumor clones (R il8tumor ) was plotted for IL-8-tumor genes ( Figure 2). The R il8tumor for all IL-8 tumor genes are greater than 38%, which is much higher than the expected background ratio between IL-8tumor clones (171521,

Discussion
The vast amount of available EST data allows expression analysis based solely on computational methods. It is pos- sible to construct a "virtual tissue" based on the expression pattern of a particular gene or group of genes (e.g. tumor marker genes). It is also possible to group genes by tissue type through combining genes from the same source tissue (e.g. tumor, brain, etc.). In this study we have generated "virtual tissues" based on (a) expression of IL-8 in all tissues and (b) expression of IL-8 in tumor tissues. Next we generated lists of genes that are most highly coexpressed with IL-8 in these two virtual tissues. Since the method is not limited to genes previously correlated with IL-8 expression nor to known genes, we have the opportunity to identify previously overlooked correlations with known or novel genes. In addition, the relative strength of the correlations is measured by a Z-score based on the number of clones representing IL-8 or IL-8-tumor tissue compared with the total number of clones. Assuming that co-expression is related to function, these Z-scores provide a basis for ranking genes according to potential involvement in IL-8's function. This reference database for genes co-expressed with IL-8 can be cross-referenced with other  large scale expression analyses, such as microarray experiments, to help decipher the regulation network and the functions of IL-8 and IL-8 co-expressed genes.
Among the genes most highly correlated with IL-8 in tumor tissues are KLF2, PS1, VEGF, and tumor necrosis factor receptor-1 (TNFR-1), lumican (keratin sulfate proteoglycan), claudin-3, and perlecan. Cytokines (IL-2 and IL-7) induce the expression of KLF2 in activated T cells which correlates with their survival [20]. They suggest that KLF2 may be involved in avoiding activation-induced cell death. It is possible that IL-8 may also mediate similar actions. This type of action could increase the ability of tumor cells to survive under conditions where cells normally apoptose.
Lumican expression has previously been correlated with higher tumor grade, lower estrogen levels in the tumor and younger age of patients in human breast cancer [21]. In addition, lumican is structurally related to both biglycan which is among the top IL-8-tumor related genes and the human embryonal carcinoma marker antigen TRA-1-60 [22]. These relationships support the hypothesis that lumican may be a tumor related protein involved with IL-8's tumorigenic function.
PS1 was also found to be highly correlated with IL-8 in both the entire list of tissues and specifically in tumor tissues. Based on EST distribution, presenilin 2 (PS2) is less related to IL-8-tumor libraries (Z-score = 0.3, IL-8-tumor clone/Total Clone = 10%). In the breast tumor microarray experiments, unlike PS1, the peak level of expression for PS2 is not in BT-549. PS1 is a gene involved in early-onset familial Alzheimer's disease. There is accumulating evidence that mutations in PS1 accelerate neurodegeneration and facilitate apoptosis, and some researchers [23] suggest an association with the p53 signal transduction pathway. It was suggested that down-regulation of PS1 by wildtype p53 and also p21 WAF-1 may be independent mechanisms leading to apoptosis and tumor suppression [24]. Our data correlating PS1 with IL-8-tumor related tissue also suggest a potential anti-apoptotic role for PS1 in tumorigenesis.   Vascular endothelial growth factor (VEGF), like IL-8, is a potent mediator of angiogenesis. VEGF has been shown to regulate angiogenesis and metastasis of bladder cancer [25] and several recent studies have reported correlations with IL-8 in several cell types including bladder cancer (TCC), non-small cell lung cancer (NSCLC) [26,27], human brain microvascular endothelial cells (HBMECS) [28] and monocytes [29]. These data collectively demonstrate the correlation of VEGF with IL-8 in tumor tissues.
TNF-a is a well-characterized, potent inducer of IL-8 transcription and an anti-apoptotic agent [30][31][32]. Roebuck provides a detailed analysis of IL-8 promoter structure including a promoter recruitment mechanism which involves TNF-a and the cooperativity of NF-kB and NF-IL-6 binding sites [4,32]. TNF-a exerts many of its effects through TNFR-1 and TNFR-2 receptors, thus co-expression of these receptors with IL-8 might be anticipated. Our analysis does reveal a correlation between TNFR-1 and IL-8 expression. Using another bioinformatic approach, Eidelman et al. report a close relationship between IL-8 secretion in cystic fibrosis cells and expression of genes from the TNFR-1/NFkB pathway. As is the case for several other genes found in this study, the TNFR-1/NFkB pathway is associated with p53/p21 WAF-1 tumor suppressor systems [33][34][35][36][37].
Our analysis identified several interesting genes co-expressed with IL-8 in tumor tissues which have not previously been associated with IL-8. Along with lumican, claudin-3, KLF2, PS1, VEGF and TNFR-1 mentioned above, these include several secreted proteins (secreted apoptosis related protein 1 and NPDC1) which are also likely to function in coordination with IL-8; several genes from the well-known family of complement factors (C1 esterase, human complement C4A and complement factor B); and several unknown or hypothetical genes. Interestingly, several of the genes correlate with the p53/ p21 WAF-1 tumor suppressor network. We expect that some of these genes may present important therapeutic targets at one or more levels in the network of IL-8 mediated interactions related to tumorigenesis.
Like all EST-based expression analysis, there are limitations for this in silico method. First, EST-based expression analysis is not suitable for transcripts with low abundance (no or few representative ESTs in the database), which precludes statistical analysis. Second, EST-based expression analysis can only indicate the co-expression of genes, but can not accurately measure the expression level. Third, to generate a comprehensive reference database, we decided not to exclude cDNA libraries based on the sequence depth or special manipulation (e.g. subtraction and normalization). The depth of sequencing and special manip-ulation are not consistent across cDNA libraries, which could also influence the sensitivity of the analysis.
The computational expression analysis methods developed here are not limited to the identification of IL-8 related genes, but can also be applied to many other proteins of interest. This method is complementary to other large-scale expression analysis methods (e.g. microarray) in that it is not limited by the physical presence of a gene on microarray, thus it offers an unique approach to discovering potential functional links between genes through expression profiling.

In silico expression analysis
All non-commercial software used in these studies was written in PERL 5.0. Human EST sequence and cDNA library information were retrieved from GenBank (Release 120) and an in-house relational database model (Sybase, SQL Server Release 11.0, CA, Sybase Inc.) was created to mirror the public human EST database (dbEST). The EST sequences were first binned into clusters if they share the same 21 mer tag beginning with ATG or CTG. Next, ESTs in each cluster were assembled into contigs using the PHRAP sequence assembly software (Phil Green, Unpublished). Based on the ESTs corresponding to IL-8 in its cognate contig, we generated a set of cDNA libraries in which IL-8 is represented. This set of libraries is referred to as IL-8 tissue and can be considered as a virtual tissue consisting of 53 cDNA libraries (306,888 cDNA clones) ( Table 1 and Additional file 1). We chose contigs consisting of at least 5 EST clones and present in at least 3 IL-8 libraries. The frequency of occurrence (F) and Z-score for all the contigs in these cDNA libraries were calculated as: where N IL8 is the number of EST clones in IL-8 tissue for a particular contig, N total is the total EST clone count in that contig. Contigs were collected into groups based on contig size (number of EST clones in a contig), and small contig groups (< 1000 contigs) were merged together with neighboring groups (similar in contig size) to make sure that there were at least 1000 contigs in each group. The rational for grouping contigs based on size is that the distribution pattern of F (i.e. mean and Z-score) may vary for contigs of different size. n is the total number of contigs in a contig group, SF is the sum of F in a contig group, and SDV is the standard deviation of F in a contig group. Assuming a Gaussian Distribution, the fractions of the population that are greater than Z-score SDV and also above the mean are 15.87%, 2.28%, and 0.13% for Z-scores of 1, 2, and 3 respectively. Similarly, a set of tumor cDNA libraries (based on their GenBank Annotations) in which IL-8 is represented was generated. This set of libraries is referred to as IL-8-tumor tissue, consisting of 16 cDNA libraries (171,521 cDNA clones) ( Table 1 and Additional file 1). Using the method described above, a database for IL-8-tumor candidate genes (using IL-8-tumor tissue) was generated and IL-8-tumor Z-scores were calculated. Databases for GRO-a and GRO-b related genes were generated in a similar way using the cDNA libraries where GRO-a or GRO-b were represented respectively (Table 1).