Meta-analytical biomarker search of EST expression data reveals three differentially expressed candidates
© Wu et al.; licensee BioMed Central Ltd. 2012
Published: 13 December 2012
Researches have been conducted for the identification of differentially expressed genes (DEGs) by generating and mining of cDNA expressed sequence tags (ESTs) for more than a decade. Although the availability of public databases make possible the comprehensive mining of DEGs among the ESTs from multiple tissue types, existing studies usually employed statistics suitable only for two categories. Multi-class test has been developed to enable the finding of tissue specific genes, but subsequent search for cancer genes involves separate two-category test only on the ESTs of the tissue of interest. This constricts the amount of data used. On the other hand, simple pooling of cancer and normal genes from multiple tissue types runs the risk of Simpson's paradox. Here we presented a different approach which searched for multi-cancer DEG candidates by analyzing all pertinent ESTs in all categories and narrowing down the cancer biomarker candidates via integrative analysis with microarray data and selection of secretory and membrane protein genes as well as incorporation of network analysis. Finally, the differential expression patterns of three selected cancer biomarker candidates were confirmed by real-time qPCR analysis.
Seven hundred and twenty three primary DEG candidates (p-value < 0.05 and lower bound of confidence interval of odds ratio ≧ 1.65) were selected from a curated EST database with the application of Cochran-Mantel-Haenszel statistic (CMH). GeneGO analysis results indicated this set as neoplasm enriched. Cross-examination with microarray data further narrowed the list down to 235 genes, among which 96 had membrane or secretory annotations. After examined the candidates in protein interaction network, public tissue expression databases, and literatures, we selected three genes for further evaluation by real-time qPCR with eight major normal and cancer tissues. The higher-than-normal tissue expression of COL3A1, DLG3, and RNF43 in some of the cancer tissues is in agreement with our in silico predictions.
Searching digitized transcriptome using CMH enabled us to identify multi-cancer differentially expressed gene candidates. Our methodology demonstrated simultaneously analysis for cancer biomarkers of multiple tissue types with the EST data. With the revived interest in digitizing the transcriptomes by NGS, cancer biomarkers could be more precisely detected from the ESTs. The three candidates identified in this study, COL3A1, DLG3, and RNF43, are valuable targets for further evaluation with a larger sample size of normal and cancer tissue or serum samples.
One of the key aspects in the study of cancer is to understand the principles and mechanisms of gene expression variation contributing to cancer genesis and progression. The identification of genes differentially expressed between normal and cancer cells/tissues is not only helpful for designing diagnostic and therapeutic procedures, but also for understanding cancer biology as a whole. In this regard, DNA microarrays have been the dominating platform in the high-throughput study of cancer transcriptomes since their emergence in the mid-1990s [1, 2]. However, there are several drawbacks, which include: high background level signals resulting from cross-hybridization [3, 4]; difference in hybridization properties due to different probe sequences; limited dynamic range due to background level and saturation, and difficulty in detecting splicing isoforms and unknown genes. For these reasons, with the advancement of the next generation sequencers, we are seeing high-throughput transcriptome mapping and quantifying method, also known as RNA-Seq, to begin to supersede microarray in expression profiling. However, RNA-Seq experiments are relatively demanding in terms of time, cost, and computation equipment. Experimental differences between different sequencing platforms may complicate transcriptome analysis with multiple tissue sources. Since exploring meta-analysis from traditional digital expression data such as EST derived from cDNAs [5–8] is more feasible, this study may serve as a precursor to more complicated experiments.
Originally primarily aimed for cataloging of transcript repertoire, ESTs from large-scale cDNA sequencing projects such as Cancer Genome Anatomy Project (CGAP), Human Cancer Genome Project (HCGP), and Cancer Genome Project (CGP) also allow searching for differentially expressed genes (DEGs) in specific tissue types or in whole genomes [9–11]. Several in silico analysis tools such as NCBI Unigene cDNA xProfiler , CGAP Digital Differential Display (DDD) , and CGAP Digital Gene Expression Displayer (DGED)  are available online allowing the analysis of publicly available data. While standard statistical methods such as Fisher's exact test for finding DEGs in two-class problems (e.g. cancer vs. normal) or Pearson's correlation are commonly used , there are also specially developed methods for finding DEGs in the landscape of digital signals for two-library problems [15, 16] or for multiple libraries . The online tools as well as the statistical methods remain useful to this day in EST or even RNA-Seq projects [18–23]. Aside from searching for DEGs, the searches for gene transcript isoforms specific to particular libraries were also demonstrated and many of these attribute differentially expressed isoforms to human cancers [24–31].
In spite of the successful applications, these tools or methods are not without limitations. xProfiler reports differential expression in an all-or-none manner where only a list, but not statistical quantification, of candidates is reported. DDD allows quantification using Fisher's exact test. However, the nature of the test dictates that comparisons of three or more libraries involve multiple pair-wise comparisons, and thus there are no easy comparisons of library specific genes. DGED uses a Bayesian approach to find DEGs, but it is also pair-wise. The reported "odds ratio" is perhaps better described as "relative risk" and may be biased with unequal sampling. Another popular and useful Bayesian-based method originally developed for EST analysis by Audic and Cleverie  is also popular for RNA-Seq data. It is less conservative than Fisher's exact test, but it also does not apply to multi-class problems. The multi-class comparison method established by Stekel et al.  finds specificity in one condition out of all and is useful in application such as finding DEGs in multi-tissue libraries. However, in the search for cancer DEGs, a subsequent analysis of differential expression between cancer and normal libraries of the tissue of interest may not yield fruitful results due to the possible scarcity of EST sampling in the particular tissue type. On the other hand, the naïve method of pooling all data into the two-class problem of normal versus cancer when searching for differentially expressed genes or differentially splice variants  risks introducing bias. In extreme cases, one may encounter the fallacy of Simpson's paradox  where genes in reality more active in the normal condition appear to be more so in the cancer condition (discussed later in this paper).
We now report on the application of a computational and integrative approach to analyze cancer differentially expressed genes (DEGs). The statistical method we employed is Cochran-Mantel-Haenszel statistics (CMH)  and to the best of our knowledge has not been applied in this context. Instead of pooling all normal and all cancer ESTs from different tissue types to fit into a two-class problem as by using the 2 by 2 contingency χ2 test or the Fisher's exact test, CMH allows original stratification of libraries in their respective tissue types, yet exhaustively analyzes expression between cancer and normal conditions across all tissue types. The method is an extension to χ2 test which, in our application, measures the association between cancer and gene expressions, adjusting for the tissue confounding factor. This approach allows one to find genes that are overall differentially expressed in cancer, or multiple-cancer genes, irrespective to a specific tissue type. The method is demonstrated in this paper to exhaustively analyze ESTs from the dbEST database . To the best knowledge of the authors, such an all-inclusive, whole-transcriptome analysis has not been redone in recent years now that more ESTs than ever are available.
Our filtering of EST libraries was also more rigorous than many previous studies. Notably, we excluded the ORESTES (open reading-frame EST sequencing) libraries  on which a normalization procedure had been applied. Libraries from cell line were also excluded owning to their unrepresentativeness of primary cancer cell transcriptomes. Our analysis pipeline further focused on enrichment of the DEGs by cross examination with expression data of a different platform, i.e. the microarray data, and selecting for membrane and secretory associated protein genes since we intend to find therapeutic targets or biomarkers, and conducting STRING (The Search Tool for the Retrieval of Interacting Genes) network analysis to show the cancer enriched clusters . With real-time qPCR validation, we have identified three candidates that are inclined to express in cancer across more than one tissue types. We hope such a meta-analytical and multiple-tissue comparison can serve as an exploratory experiment for future multi-library or multi-tissue study of other digital sources such as RNA-Seq.
Human gene reference sequence preparation
The NCBI Reference Sequences (RefSeq Release 38, November 11, 2009)  were downloaded from its ftp site . Homo sapiens RefSeq records were selected and subjected to repeat masking via RepeatMasker .
Human EST sequence preparation and library filtering
Human EST data (Released on December 11, 2009) and their cDNA library information were downloaded from NCBI dbEST database  and CGAP . Program in Python language was written to mark for discard the unsuitable libraries when the keywords such as "enrichment", "subtract", "pcr", and "normalized" were found in the DESCR, UNIQUE_PROTOCOL, or KEYWORDS fields of the library information. An arbitrary cutoff of > 400 was chosen to the highly unrepresentative libraries (approximately 7,000 libraries constituting approximately 650,000 ESTs were discarded as a consequence). To curb from incorrect inclusions or exclusions, we finalized the process with manual curation. Libraries made from mixed tissues or cell lines were also discarded. The final libraries from CGAP were manually classified into 48 different tissue types and two different conditions, normal and cancer.
EST to gene assignment
The BLAT alignment tool was used to align ESTs to RefSeqs as a mean to assign ESTs to genes . The criteria of having an identity of 95% or above and the minimum length of 100 nucleotides were set for a match. The RefSeq match with the highest identity was assigned for the EST. If two RefSeq matches shared exactly the same identity, the program chose the first encountered.
EST count and summarization
The procedure attributes each transcript represented by RefSeq its expression profile across different tissue and condition types based on EST assignment counts. Each EST has its corresponding tissue type and condition type classification, based on its source clone library. For example, a transcript with an aligned EST from a lung cancer clone library is one expression count each in tissue type lung and condition type cancer. This way, after all ESTs were counted, each transcript has a profile of expression across various libraries and conditions. Expressions from different transcript variants of the same gene were pooled to obtain a single gene expression. The raw counts were thus made into transcription profile for each gene for further statistical analysis.
Statistical evaluation of cancer candidates
A hypothetical EST count table demonstrating CMH analysis and also a contrived example of Simpson's paradox.
Microarray cross reference
Human U133 Plus 2.0 GeneChip array CEL data were downloaded from Gene Expression Omnibus (GEO) . When computing power allows, the data were processed with AffyPLM  using its three-step procedure of processing background signals with GCRMA, normalizing signals with quantile normalization, and summarize probe signals with medium polish. For large experimental datasets that were computationally infeasible for us, we used justRMA from the Affy package . For experimental dataset without raw CEL data, we obtained the pre-processed matrix files via GEOQuery . Regardless of the source of array signal processing, we analyzed the genes for differential expression with Limma . Differentially expressed gene candidates with p-value < 0.05 and logFC > 1.0 were selected and crossed with genes from EST profiling with statistical evaluation. For each array, the significant genes were crossed with our EST profiling results. The union of these intersecting genes was selected for further evaluation.
Annotation of secretory proteins
To identify our differentially expressed genes with secretory annotation, a list of 3,975 proteins with secretory annotation originated from the conglomeration of data from Uniprot (1,632 unique proteins) , Human Plasma Proteome Organization (HUPO) (889 proteins), and Secreted Protein Database (SPD) (4,142 proteins) . This list was matched against DEGs to give them secretory annotation.
Annotation of membrane proteins
Membrane protein annotations were gathered from five sources - TOPDB (283 proteins) , LOCATE (2629 proteins) , PDB_TM (41 proteins) [51, 52], OPM (107 proteins) , and MPDB (23 proteins)  - to generate a unique list of 2,767 membrane proteins. Any DEGs on this list would confer it a membrane annotation.
Validation of tissue expression profiles of candidate genes
TissueScan™ Cancer Survey Panel 96-I qPCR array panel (Origene Technologies, Rockville, MD) containing the cDNAs of 3 normal and 9 cancer tissues each from 8 organs (breast, colon, kidney, liver, lung, ovarian, prostate, and thyroid) was used to examine the expression profiles of selected cancer differentially expressed gene candidates. Real-time qPCR analyses with the Taqman® Gene Expression Assay kits (Applied Biosystems, Foster City, CA) and FAM- and VIC-labeled target genes and HPRT1 internal control primers, respectively, were performed according to the manufacturer's suggested procedure on an Applied Biosystems Prism 7500 system. Relative specific gene expression was quantified by normalization against the HPRT1 with the ΔCT method. Gene expression changes were quantified as 2 - (CT gene - CT control).
Human ESTs selection and tissue distribution
The basic steps of our analysis are illustrated in Figure 1. A total of 8,296,089 human EST sequences (Dec. 11, 2009 release) were downloaded from the NCBI. Despite the size of the data, not all ESTs are relevant for our gene expression analysis. After screening the 8,907 EST libraries as described in the methods section above, 8,447 unsuitable libraries, the preparation of which involved PCR amplification, normalization, subtraction, etc. or originated from cell lines, were discarded. The remaining 460 libraries consisted of 2,386,536 EST sequences representing approximately a third of all the downloaded human ESTs.
After BLAT alignment of the 2,386,536 ESTs to 44,513 gene transcripts from RefSeqs, approximately 1,644,960 (68.92%) ESTs with at least 100 nucleotides matched to RefSeqs were detected. An examination of the sources of the matched ESTs indicated that the representativeness of each tissue is skewed and that the brain is the most represented out of all tissues. Among the 48 different tissues, brain ESTs constituted 26% of all matched ESTs, uterus (6.40%) ranked second, followed by testis (5.91%), placenta (4.33%), pancreas (3.99%), muscle (3.88%), liver (3.51%), kidney (3.52) and others each below 3% (see Additional file 1). Similarly, condition type (normal and cancer) representation was also skewed. Normal tissue type had 1,251,883 ESTs combined, and cancer tissue had 393,077 ESTs in the ratio of roughly 3 to 1. Originally before filtering out those from the cell lines, there were more cancer ESTs and the ratio of normal ESTs to cancer ESTs was roughly 1 to 3. This showed how much more rigorous our filtering was. Unfortunately, this also meant we had a much smaller dataset to work with.
The unequal distribution of the 1,644,960 matched ESTs in different tissue types caused some tissue types to be ill-represented. For example, the number of brain EST hits dominated over other tissue types. On the other hand, spinal cord had the least count with 430 EST hits. The latter had little value for our application. Therefore, we only took a tissue type into consideration when its total EST hit count was above the cut-off of 20,000. Considering that the human genome has approximately 22,000 genes, the cut-off still did not allow "deep" probe into gene expression. Nevertheless, the method we employed did not attempt to identify specific gene expression in one particular tissue; therefore, the problem was mitigated.
We also categorized ESTs according to their clone library classification, to either be from normal or from cancer. Sometimes a certain tissue-condition type was so under-represented that the information was not trustworthy. For example, adipose had 10,362 normal hits but only 440 cancer hits, and heart tissue had 22,179 normal hits but no cancer hits. For these cases, data was kept throughout the analysis. But these data did not make contribution to our analysis.
Since our EST assignments were made to transcripts represented by RefSeq sequences, when the entire assignment procedure was done, each transcript variant had its expression profile across all tissue-condition types. Due to the lack of enough ESTs data, differentiating between different splicing variants of the same gene was not feasible. We had to pool expression from different splicing variants into a single expression profile representing the gene.
Analysis of differentially expressed genes
EST counts and odd ratios of 11 well-known cancer-related genes present in our list of DEGs.
Homo sapiens brevican
Homo sapiens keratin 14
Homo sapiens keratin 16
Homo sapiens matrix metallopeptidase 11 (stromelysin 3)
Homo sapiens mucin 1, cell surface associated
Homo sapiens vascular endothelial growth factor A
Homo sapiens agrin
Homo sapiens collagen, type III, alpha 1
Homo sapiens matrix metallopeptidase 1 (interstitial collagenase)
Homo sapiens epidermal growth factor receptor (erythroblasticleukemia viral (v-erb-b) oncogene homolog, avian)
Homo sapiens alpha-fetoprotein
Five microarray projects cross referenced with our set of 723 DEGs
Since membrane and secretory proteins could be potential therapeutic target or serum biomarkers, the subcellular location of the 235 DEGs were examined against the secretory and membrane protein lists consolidated from public databases. Among these, 96 DEGs were putative membrane or secretory proteins - 57 had only secretory annotation, 27 had only membrane annotation and 12 had both.
Literature search and STRING analysis of the 96 DEGs
Three candidates had higher expression in several cancer tissues
Reported here is an integrative, meta-analytic approach for the discovery of pan-cancer differentially expressed gene candidates. Our primary enrichment included a set of 723 DEGs with cancer associations supported by GeneGO disease and pathway analysis. Further integrative evaluations with cancer differentially expressed genes suggested by microarray data narrowed the list down to 234 genes, and among these there were 96 DEGs likely belonged to either secretory and membrane protein genes. Further STRING protein network analysis and literature reviewing indicated 71% of the 96 DEGs were highly connected and many of them were associated with cancers in previous publications.
The meta-analytic nature of our study brought us the opportunities as well as challenges to study the digital signatures of various transcriptomes in a new perspective. Comparing to experimental methods that focus on a single tissue type or limited tissue types, our approach allows us to find genes inclined to express in cancer in a pan-tissue manner. An important challenge of our approach is to avoid Simpson's paradox which can occur in a meta-analysis study . Simpson's paradox is where the association between two variables may show a correlation that is reversed in direction from what is observed from stratified subgroups. A contrived example is shown in Table 1 in which gene A appears to have a higher cancer expression when pooled, but it is in fact not so under individual stratified sub-tables. This may be somewhat of an extreme case, where directionality of the ratios actually differs between the sub-tables and the pooled table. However, the tissue confounder still introduces bias, large or small, that may throw our judgment off. In this study, we used CMH to analyze the data based on stratified sub-tables to avoid running into this paradox. One could also analyze only one tissue at a time for differential expression, but this means one has a smaller dataset to work with. CMH could avoid this problem since it uses EST counts from all tissues instead of analyzing just the normal and cancer propensity under each individual tissue type.
In our actual data, the odds ratio of the pooled table is also different from that of the stratified table. For example, the gene PTRF, a polymerase I and transcript release factor, has a pooled odds ratio of 0.40 and a CMH odds ratio of 0.16 calculated from stratified sub-tables. In this particular case, both odds ratios indicated an inclination toward a higher normal expression and are both statistically significant although at different degree (the pooled has a p-value of 5.269e-15 under a 2x2 χ2 test  versus CMH's 7.16E-77). For the gene VCAN (versican), the pooled odds ratio is 1.86 and χ2 test yields a significant p-value of 1.83e-4. However, CMH gives an insignificant result for this gene with p-value of 0.25. As an extreme case, GBP6 (guanylate binding protein family, member 6) has a pooled odds ratio of 6.69 and χ2 test gives a p-value smaller than 2.2e-16 (approaching 0), whereas with CMH the odds ratio is 0.73, actually indicating a higher normal counts, although CMH p-value of 0.15 is insignificant. This indicates Simpson's paradox in action. Careful inspection showed that all cancer counts and most normal counts of GBP6 were contributed by the tongue tissue source. Out of a total of 50 cancer counts and 21 EST normal counts, tongue accounts for cancer and normal counts of 50 and 17, respectively. For this gene, the tongue cancer count 50 is not influential under a total of 29,479 cancer counts and 7,486 counts for the tongue. Thus pooling loses information in this respect and gives a false impression that its cancer expression is much higher when summing all cancer counts from all tissues. Stratifying by tissue type guards against this bias.
Heterogeneity of odds ratios
In the strictest application, the use of Cochran-Mantel-Haenszel method requires the odds ratios of the sub-tables be homogeneous. In our context, it means the ratio of gene expressions between cancer and normal tissues are probably the same among all tissue types under study and any observed variability is most likely due to sampling bias. Also, the calculated odds ratio would be the estimated common odds ratio across the tissue strata. In our case, however, not all genes had similar ratios under each tissue (based on Woolf's test for homogeneity available in Additional file 2 under the "Woolf" column label), and this was of course expected. In spite of this, we were interested in the overall expression patterns of the genes in cancer conditions. We were not interested in an estimate of common odds ratio across the strata, which often does not exist. We were interested in hypothesis testing - to give us leads to the genes that had higher cancer expression in general. In this regard, the test could be applied [58, 59]. The CMH odds ratio is a weighted average of the odds ratio in each tissue classification and can give us a summary measure , which we used to prioritize and followed up with subsequent biological analyses. In other word, an odds ratio in our data was merely a value that "average up" across all tissue types. From these ratios we were able to reveal the preferential cancer expressions, since the list covered a number of important known biomarkers, and enrichment of cancer-related genes were supported by knowledge-based GeneGO analyses and previous publications.
Lower bound of confidence interval
Another distinctive tactic we used is the selection of DEGs among the statistically significant genes (p-value < 0.05) base on lower bounds of the confidence interval of the odds ratio estimates. The popular approach to search for DEGs is to select genes base on p-value first, and then select the subset base on parameter estimators such as odds ratio or fold change values. The p-value criterion selects the statistically significant ones (those not likely to be the result of random fluctuation). The subsequent criterion is based on prior domain knowledge. However, among those with statistically significant p-values and similar parameter estimators, the ranges of the estimations can vary widely. Using our dataset as an example, the two genes TUBA1B and FAM60A both have odd values of 2.38 (Additional file 2). However, for TUBA1B, it is within the 95% confidence that its true odds ratio is between 2.26 and 2.50. Yet for FAM60A it is between 1.59 and 3.54. Based on our background knowledge and for future application, if we must select genes having odds ratios greater than 2.0, then using odds ratio as cutoff would not serve this purpose since it is quite possible that the real odds ratio (i.e., of the population) is below 2.0. Choosing genes based on their confidence intervals would be more precise, but this has not been much appreciated.
The multi-cancer approach compares genes that are overall differentially expressed among multiple cancer types comparing to their respective normal tissue types. Although many biomarker studies focus on gene differentially expressed in a particular tissue type, Wu et al. found 8 proteins in the conditioned media of 23 cell lines showing negative or weak tissue staining in the Human protein atlas, suggesting them to be potential pan-cancer markers . Sahin et al., found that claudin-18 splice variant 2 had the ectopic activations in pancreatic, esophageal, ovarian, and lung tumors while its expression in normal tissue only occurred in differentiated epithelial cells of the gastric mucosa, confirmed by RT-PCR . These studies suggested that relatively multi-cancer genes or multi-cancer splice variants exist. The three candidates COL3A1 (Collagen alpha-1(III) chain), DLG3 (Discs large homolog 3) (plasma membrane), and RNF43 (Ring finger protein 43) are putative secreted or plasma membrane proteins with the potential of developing serum diagnostic reagents. In reviewing the involvement of these genes with cancers in previous studies, hint for pan-cancer marker was surfaced as the expression of the extracellular matrix protein COL3A1 gene in brain cancer  and angiofibroma  was elevated. While secreted membrane bound RNF43 protein gene was known to be up-regulated in colorectal cancer . Interestingly, upon the real-time qPCR analysis of three cancer differentially expressed secreted protein gene candidates, COL3A1, DLG3, and RNF43 identified in this study, higher cancer expression levels of these genes in multiple cancer types were verified. This does not only indicate the usefulness of our computational approach and filtering procedure but also encourages us to devote further resources for assessing the clinical usages of these three candidates.
Pooling of gene expression
Earlier in this discussion, we mentioned that naïve pooling of data may introduce bias and at worst may produce Simpson's paradox. We also mentioned that we have tackled this problem with CMH. Nevertheless, two other occasions of pooling actually took place. We pooled expression from different splicing variants from the same gene to make one gene expression. We also pooled different libraries of the same tissue into one tissue classification. In both of these cases, we may encounter expression bias, since different splicing variants and different tissue libraries (i.e., tissues from different patients) might have differences in expression patterns. This is an unfortunate limitation in this and similar studies, since dbEST data consists of many different sources, and given the relative lack of data after the very stringent criteria we have used in our library selection compare to previous studies (Most importantly the exclusion of ESTs from cell lines, PCR amplification, subtraction, and cDNA normalization protocols). We opted for pooling since we had comparatively limited number of sequences to work with (1,644,960 out of 8,296,089 downloaded - 18.03%). Nonetheless, future digital expression profiling can be made better with the RNA-Seq methodology that offers a greater depth of coverage than ESTs obtained from traditional cDNA sequencing. It gives a much larger sampling size that makes more realistic the differentiation among isoforms and also makes pooling of different libraries of the same tissue less necessary. As for discovery of pan-cancer genes or isoforms when studying multiple tissue types, similar idea as outlined in this study would be just as applicable.
We have demonstrated that the use of the Cochran-Mantel-Haenszel statistic in the integrative approaches allowed us to identify potential biomarkers or therapeutic targets via exhaustive search of various EST libraries from dbEST. As shown in previous study, splice variant could be useful target of antibody therapy . The method can be easily extended over to searching cancer differential splicing variants had there been enough data. The issues involved in the analysis, such as the Simpson's paradox and the pan-cancer markers, may also be encountered in other multi-class digital analysis. The three targets confirmed by real-time qPCR, COL3A1, DLG3, and RNF43, are worthy of further evaluation for clinical applications.
This work was supported by grants NSC 99-3112-B-010-003 (W.V. Ng) from the National Science Council and an intramural grant derived from the Aim for the Top University Grant awarded to National Yang Ming University from the Ministry of Education, Taiwan, the Republic of China.
This article has been published as part of BMC Genomics Volume 13 Supplement 7, 2012: Eleventh International Conference on Bioinformatics (InCoB2012): Computational Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/13/S7.
- Pease AC, Solas D, Sullivan EJ, Cronin MT, Holmes CP, Fodor SP: Light-generated oligonucleotide arrays for rapid DNA sequence analysis. Proceedings of the National Academy of Sciences of the United States of America. 1994, 91 (11): 5022-5026. 10.1073/pnas.91.11.5022.PubMed CentralView ArticlePubMedGoogle Scholar
- Schena M, Shalon D, Davis RW, Brown PO: Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science (New York, NY). 1995, 270 (5235): 467-470. 10.1126/science.270.5235.467.View ArticleGoogle Scholar
- Casneuf T, Van de Peer Y, Huber W: In situ analysis of cross-hybridisation on microarrays and the inference of expression correlation. BMC Bioinformatics. 2007, 8: 461-10.1186/1471-2105-8-461.PubMed CentralView ArticlePubMedGoogle Scholar
- Okoniewski MJ, Miller CJ: Hybridization interactions between probesets in short oligo microarrays lead to spurious correlations. BMC Bioinformatics. 2006, 7: 276-10.1186/1471-2105-7-276.PubMed CentralView ArticlePubMedGoogle Scholar
- Adams MD, Kelley JM, Gocayne JD, Dubnick M, Polymeropoulos MH, Xiao H, Merril CR, Wu A, Olde B, Moreno RF, et al: Complementary DNA sequencing: expressed sequence tags and human genome project. Science (New York, NY). 1991, 252 (5013): 1651-1656. 10.1126/science.2047873.View ArticleGoogle Scholar
- Matsubara K, Okubo K: cDNA analyses in the human genome project. Gene. 1993, 135 (1-2): 265-274. 10.1016/0378-1119(93)90076-F.View ArticlePubMedGoogle Scholar
- Weinstock KG, Kirkness EF, Lee NH, Earle-Hughes JA, Venter JC: cDNA sequencing: a means of understanding cellular physiology. Curr Opin Biotechnol. 1994, 5 (6): 599-603. 10.1016/0958-1669(94)90081-7.View ArticlePubMedGoogle Scholar
- Adams MD, Kerlavage AR, Fleischmann RD, Fuldner RA, Bult CJ, Lee NH, Kirkness EF, Weinstock KG, Gocayne JD, White O, et al: Initial assessment of human gene diversity and expression patterns based upon 83 million nucleotides of cDNA sequence. Nature. 1995, 377 (6547 Suppl): 3-174.PubMedGoogle Scholar
- Ewing RM, Ben Kahla A, Poirot O, Lopez F, Audic S, Claverie JM: Large-scale statistical analyses of rice ESTs reveal correlated patterns of gene expression. Genome research. 1999, 9 (10): 950-959. 10.1101/gr.9.10.950.PubMed CentralView ArticlePubMedGoogle Scholar
- Vasmatzis G, Essand M, Brinkmann U, Lee B, Pastan I: Discovery of three genes specifically expressed in human prostate by expressed sequence tag database analysis. Proceedings of the National Academy of Sciences of the United States of America. 1998, 95 (1): 300-304. 10.1073/pnas.95.1.300.PubMed CentralView ArticlePubMedGoogle Scholar
- Schmitt AO, Specht T, Beckmann G, Dahl E, Pilarsky CP, Hinzmann B, Rosenthal A: Exhaustive mining of EST libraries for genes differentially expressed in normal and tumour tissues. Nucleic acids research. 1999, 27 (21): 4251-4260. 10.1093/nar/27.21.4251.PubMed CentralView ArticlePubMedGoogle Scholar
- The Cancer Genome Anatomy Project cDNA xProfiler. [http://cgap.nci.nih.gov/Tissues/xProfiler]
- NCBI Unigene Digital Differential Display. [http://www.ncbi.nlm.nih.gov/UniGene/ddd.cgi]
- The Cancer Genome Anatomy Project Digital Gene Expression Displayer.Google Scholar
- Audic S, Claverie JM: The significance of digital gene expression profiles. Genome research. 1997, 7 (10): 986-995.PubMedGoogle Scholar
- Lal A, Lash AE, Altschul SF, Velculescu V, Zhang L, McLendon RE, Marra MA, Prange C, Morin PJ, Polyak K, et al: A public database for gene expression in human cancers. Cancer Res. 1999, 59 (21): 5403-5407.PubMedGoogle Scholar
- Stekel DJ, Git Y, Falciani F: The comparison of gene expression from multiple cDNA libraries. Genome research. 2000, 10 (12): 2055-2061. 10.1101/gr.GR-1325RR.PubMed CentralView ArticlePubMedGoogle Scholar
- Lu B, Xu J, Lai M, Zhang H, Chen J: A transcriptome anatomy of human colorectal cancers. BMC cancer. 2006, 6: 40-10.1186/1471-2407-6-40.PubMed CentralView ArticlePubMedGoogle Scholar
- Nie D, Xiang Y: Molecular cloning and characterization of a novel human testis-specific gene by use of digital differential display. Journal of genetics. 2006, 85 (1): 57-62. 10.1007/BF02728971.View ArticlePubMedGoogle Scholar
- Clepet C, Joobeur T, Zheng Y, Jublot D, Huang M, Truniger V, Boualem A, Hernandez-Gonzalez ME, Dolcet-Sanjuan R, Portnoy V, et al: Analysis of expressed sequence tags generated from full-length enriched cDNA libraries of melon. BMC genomics. 2011, 12: 252-10.1186/1471-2164-12-252.PubMed CentralView ArticlePubMedGoogle Scholar
- Tang Q, Ma XJ, Mo CM, Wilson IW, Song C, Zhao H, Yang YF, Fu W, Qiu DY: An efficient approach to finding Siraitia grosvenorii triterpene biosynthetic genes by RNA-seq and digital gene expression analysis. BMC genomics. 2011, 12 (1): 343-10.1186/1471-2164-12-343.PubMed CentralView ArticlePubMedGoogle Scholar
- Jager M, Ott CE, Grunhagen J, Hecht J, Schell H, Mundlos S, Duda GN, Robinson PN, Lienau J: Composite transcriptome assembly of RNA-seq data in a sheep model for delayed bone healing. BMC genomics. 2011, 12: 158-10.1186/1471-2164-12-158.PubMed CentralView ArticlePubMedGoogle Scholar
- Tillett RL, Ergul A, Albion RL, Schlauch KA, Cramer GR, Cushman JC: Identification of tissue-specific, abiotic stress-responsive gene expression patterns in wine grape (Vitis vinifera L.) based on curation and mining of large-scale EST data sets. BMC plant biology. 2011, 11: 86-10.1186/1471-2229-11-86.PubMed CentralView ArticlePubMedGoogle Scholar
- Kan Z, States D, Gish W: Selecting for functional alternative splices in ESTs. Genome research. 2002, 12 (12): 1837-1845. 10.1101/gr.764102.PubMed CentralView ArticlePubMedGoogle Scholar
- Xu Q, Modrek B, Lee C: Genome-wide detection of tissue-specific alternative splicing in the human transcriptome. Nucleic acids research. 2002, 30 (17): 3754-3766. 10.1093/nar/gkf492.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang Z, Lo HS, Yang H, Gere S, Hu Y, Buetow KH, Lee MP: Computational analysis and experimental validation of tumor-associated alternative RNA splicing in human cancer. Cancer Res. 2003, 63 (3): 655-657.PubMedGoogle Scholar
- Xu Q, Lee C: Discovery of novel splice forms and functional analysis of cancer-specific alternative splicing in human expressed sequences. Nucleic acids research. 2003, 31 (19): 5635-5643. 10.1093/nar/gkg786.PubMed CentralView ArticlePubMedGoogle Scholar
- Hui L, Zhang X, Wu X, Lin Z, Wang Q, Li Y, Hu G: Identification of alternatively spliced mRNA variants related to cancers by genome-wide ESTs alignment. Oncogene. 2004, 23 (17): 3013-3023. 10.1038/sj.onc.1207362.View ArticlePubMedGoogle Scholar
- Kirschbaum-Slager N, Parmigiani RB, Camargo AA, de Souza SJ: Identification of human exons overexpressed in tumors through the use of genome and expressed sequence data. Physiol Genomics. 2005, 21 (3): 423-432. 10.1152/physiolgenomics.00237.2004.View ArticlePubMedGoogle Scholar
- He C, Zhou F, Zuo Z, Cheng H, Zhou R: A global view of cancer-specific transcript variants by subtractive transcriptome-wide analysis. PLoS One. 2009, 4 (3): e4732-10.1371/journal.pone.0004732.PubMed CentralView ArticlePubMedGoogle Scholar
- Valletti A, Anselmo A, Mangiulli M, Boria I, Mignone F, Merla G, D'Angelo V, Tullo A, Sbisa E, D'Erchia AM, et al: Identification of tumor-associated cassette exons in human cancer through EST-based computational prediction and experimental validation. Mol Cancer. 2010, 9: 230-10.1186/1476-4598-9-230.PubMed CentralView ArticlePubMedGoogle Scholar
- Rucker G, Schumacher M: Simpson's paradox visualized: the example of the rosiglitazone meta-analysis. BMC Med Res Methodol. 2008, 8: 34-10.1186/1471-2288-8-34.PubMed CentralView ArticlePubMedGoogle Scholar
- Mantel N, Haenszel W: Statistical aspects of the analysis of data from retrospective studies of disease. Journal of the National Cancer Institute. 1959, 22 (4): 719-748.PubMedGoogle Scholar
- Boguski MS, Lowe TM, Tolstoshev CM: dbEST--database for "expressed sequence tags". Nature genetics. 1993, 4 (4): 332-333. 10.1038/ng0893-332.View ArticlePubMedGoogle Scholar
- Dias Neto E, Correa RG, Verjovski-Almeida S, Briones MR, Nagai MA, da Silva W, Zago MA, Bordin S, Costa FF, Goldman GH, et al: Shotgun sequencing of the human transcriptome with ORF expressed sequence tags. Proceedings of the National Academy of Sciences of the United States of America. 2000, 97 (7): 3491-3496. 10.1073/pnas.97.7.3491.PubMed CentralView ArticlePubMedGoogle Scholar
- Jensen LJ, Kuhn M, Stark M, Chaffron S, Creevey C, Muller J, Doerks T, Julien P, Roth A, Simonovic M: STRING 8--a global view on proteins and their functional interactions in 630 organisms. Nucleic acids research. 2009, D412-416. 37 DatabaseGoogle Scholar
- Pruitt KD, Tatusova T, Klimke W, Maglott DR: NCBI Reference Sequences: current status, policy and new initiatives. Nucleic acids research. 2009, D32-36. 37 DatabaseGoogle Scholar
- NCBI RefSeq FTP. [ftp://ftp.ncbi.nih.gov/refseq/release/vertebrate_mammalian]
- RepeatMasker Open-3.0.1996-2010. [http://www.repeatmasker.org]
- CGAP download site. [http://cgap.nci.nih.gov/Info/CGAPDownload]
- Kent WJ: BLAT--the BLAST-like alignment tool. Genome research. 2002, 12 (4): 656-664.PubMed CentralView ArticlePubMedGoogle Scholar
- Edgar R, Domrachev M, Lash AE: Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic acids research. 2002, 30 (1): 207-210. 10.1093/nar/30.1.207.PubMed CentralView ArticlePubMedGoogle Scholar
- Bolstad B: Low Level Analysis of High-density Oligonucleotide Array Data: Background, Normalization and Summarization. 2004, University of California, BerkeleyGoogle Scholar
- Gautier L, Cope L, Bolstad BM, Irizarry RA: affy--analysis of Affymetrix GeneChip data at the probe level. Bioinformatics. 2004, 20 (3): 307-315. 10.1093/bioinformatics/btg405.View ArticlePubMedGoogle Scholar
- Sean D, Meltzer PS: GEOquery: a bridge between the Gene Expression Omnibus (GEO) and BioConductor. Bioinformatics. 2007, 23 (14): 1846-1847. 10.1093/bioinformatics/btm254.View ArticleGoogle Scholar
- Smyth GK: Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol. 2004, 3: Article3-PubMedGoogle Scholar
- Magrane M, Consortium U: UniProt Knowledgebase: a hub of integrated protein data. Database (Oxford). 2011, bar009Google Scholar
- Chen Y, Zhang Y, Yin Y, Gao G, Li S, Jiang Y, Gu X, Luo J: SPD--a web-based secreted protein database. Nucleic acids research. 2005, D169-173. 33 DatabaseGoogle Scholar
- Tusnady GE, Kalmar L, Simon I: TOPDB: topology data bank of transmembrane proteins. Nucleic acids research. 2008, D234-239. 36 DatabaseGoogle Scholar
- Sprenger J, Lynn Fink J, Karunaratne S, Hanson K, Hamilton NA, Teasdale RD: LOCATE: a mammalian protein subcellular localization database. Nucleic acids research. 2008, D230-233. 36 DatabaseGoogle Scholar
- Tusnady GE, Dosztanyi Z, Simon I: Transmembrane proteins in the Protein Data Bank: identification and classification. Bioinformatics. 2004, 20 (17): 2964-2972. 10.1093/bioinformatics/bth340.View ArticlePubMedGoogle Scholar
- Tusnady GE, Dosztanyi Z, Simon I: PDB_TM: selection and membrane localization of transmembrane proteins in the protein data bank. Nucleic acids research. 2005, D275-278. 33 DatabaseGoogle Scholar
- Lomize MA, Lomize AL, Pogozheva ID, Mosberg HI: OPM: orientations of proteins in membranes database. Bioinformatics. 2006, 22 (5): 623-625. 10.1093/bioinformatics/btk023.View ArticlePubMedGoogle Scholar
- Raman P, Cherezov V, Caffrey M: The Membrane Protein Data Bank. Cell Mol Life Sci. 2006, 63 (1): 36-51. 10.1007/s00018-005-5350-6.PubMed CentralView ArticlePubMedGoogle Scholar
- GeneGo. [http://www.genego.com/]
- Smoot ME, Ono K, Ruscheinski J, Wang PL, Ideker T: Cytoscape 2.8: new features for data integration and network visualization. Bioinformatics. 2011, 27 (3): 431-432. 10.1093/bioinformatics/btq675.PubMed CentralView ArticlePubMedGoogle Scholar
- Pearson K: On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philosophical Magazine, Series 5. 1900, 50 (302): 157-175. 10.1080/14786440009463897.View ArticleGoogle Scholar
- Greenland S: Interpretation and estimation of summary ratios under heterogeneity. Statistics in medicine. 1982, 1 (3): 217-227. 10.1002/sim.4780010304.View ArticlePubMedGoogle Scholar
- McDonald JH: Handbook of Biological Statistics. 2009, Sparky House Publishing, Baltimore, Maryland, 2ndGoogle Scholar
- Kuritz SJ, Landis JR, Koch GG: A general overview of Mantel-Haenszel methods: applications and recent developments. Annual review of public health. 1988, 9: 123-160. 10.1146/annurev.pu.09.050188.001011.View ArticlePubMedGoogle Scholar
- Wu CC, Hsu CW, Chen CD, Yu CJ, Chang KP, Tai DI, Liu HP, Su WH, Chang YS, Yu JS: Candidate serological biomarkers for cancer identified from the secretomes of 23 cancer cell lines and the human protein atlas. Molecular & cellular proteomics : MCP. 2010, 9 (6): 1100-1117. 10.1074/mcp.M900398-MCP200.PubMed CentralView ArticlePubMedGoogle Scholar
- Sahin U, Koslowski M, Dhaene K, Usener D, Brandenburg G, Seitz G, Huber C, Tureci O: Claudin-18 splice variant 2 is a pan-cancer target suitable for therapeutic antibody development. Clinical cancer research : an official journal of the American Association for Cancer Research. 2008, 14 (23): 7624-7634. 10.1158/1078-0432.CCR-08-1547.View ArticleGoogle Scholar
- Liu Y, Carson-Walter EB, Cooper A, Winans BN, Johnson MD, Walter KA: Vascular gene expression patterns are conserved in primary and metastatic brain tumors. Journal of neuro-oncology. 2010, 99 (1): 13-24. 10.1007/s11060-009-0105-0.PubMed CentralView ArticlePubMedGoogle Scholar
- Gramann M, Wendler O, Haeberle L, Schick B: Expression of collagen types I, II and III in juvenile angiofibromas. Cells, tissues, organs. 2009, 189 (6): 403-409. 10.1159/000158754.View ArticlePubMedGoogle Scholar
- Yagyu R, Furukawa Y, Lin YM, Shimokawa T, Yamamura T, Nakamura Y: A novel oncoprotein RNF43 functions in an autocrine manner in colorectal cancer. International journal of oncology. 2004, 25 (5): 1343-1348.PubMedGoogle Scholar
- Mok SC, Bonome T, Vathipadiekal V, Bell A, Johnson ME, Wong KK, Park DC, Hao K, Yip DK, Donninger H, et al: A gene signature predictive for outcome in advanced ovarian cancer identifies a survival factor: microfibril-associated glycoprotein 2. Cancer cell. 2009, 16 (6): 521-532. 10.1016/j.ccr.2009.10.018.PubMed CentralView ArticlePubMedGoogle Scholar
- Bowen NJ, Walker LD, Matyunina LV, Logani S, Totten KA, Benigno BB, McDonald JF: Gene expression profiling supports the hypothesis that human ovarian surface epithelia are multipotent and capable of serving as ovarian cancer initiating cells. BMC medical genomics. 2009, 2: 71-10.1186/1755-8794-2-71.PubMed CentralView ArticlePubMedGoogle Scholar
- Badea L, Herlea V, Dima SO, Dumitrascu T, Popescu I: Combined gene expression analysis of whole-tissue and microdissected pancreatic ductal adenocarcinoma identifies genes specifically overexpressed in tumor epithelia. Hepato-gastroenterology. 2008, 55 (88): 2016-2027.PubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.