CTen: a web-based platform for identifying enriched cell types from heterogeneous microarray data
© Shoemaker et al.; licensee BioMed Central Ltd. 2012
Received: 9 December 2011
Accepted: 31 August 2012
Published: 6 September 2012
Interpreting in vivo sampled microarray data is often complicated by changes in the cell population demographics. To put gene expression into its proper biological context, it is necessary to distinguish differential gene transcription from artificial gene expression induced by changes in the cellular demographics.
CTen (c ell t ype en richment) is a web-based analytical tool which uses our highly expressed, cell specific (HECS) gene database to identify enriched cell types in heterogeneous microarray data. The web interface is designed for differential expression and gene clustering studies, and the enrichment results are presented as heatmaps or downloadable text files.
In this work, we use an independent, cell-specific gene expression data set to assess CTen's performance in accurately identifying the appropriate cell type and provide insight into the suggested level of enrichment to optimally minimize the number of false discoveries. We show that CTen, when applied to microarray data developed from infected lung tissue, can correctly identify the cell signatures of key lymphocytes in a highly heterogeneous environment and compare its performance to another popular bioinformatics tool. Furthermore, we discuss the strong implications cell type enrichment has in the design of effective microarray workflow strategies and show that, by combining CTen with gene expression clustering, we may be able to determine the relative changes in the number of key cell types.
CTen is available at http://www.influenza-x.org/~jshoemaker/cten/
Several bioinformatics tools exist to identify the cause and effect of changes in gene regulation, with gene set enrichment analysis (GSEA)  and gene ontology (GO) enrichment  being the most popular, and there are several other web-based platforms with improvements or variations of these analytical tools [6–8]. GSEA relies on a database of reference gene lists which were previously determined to be regulated under several conditions (e.g., by transcription factors, chemical and genetic perturbations, or between healthy and diseased states). GSEA determines which reference list - if any - has statistically significant, concordant regulation. Although very useful for linking gene expression to specific transcription factors or identifying similarities between diseases, this tool does not include cell specific data at this time. The other popular alternative, GO, relies on a controlled vocabulary to describe the biological role of genes and their products. It is often accurate in predicting the local phenotype from gene expression data (e.g., inflammation annotations are highly enriched in samples from inflamed tissue ). However, cell specific GO annotations are often overwhelmed by more ubiquitous terms in the GO annotation hierarchy.
Additionally, some algorithms exist to unmix cellularly heterogeneous gene expression data into expression profiles for each cell type [10, 11] but generally either the number of cell types must be known a priori or cell counts must be determined. The Gene Expression Barcode  and BioGPS  web platforms provide tissue specific gene expression data and allow researchers to compare gene expression between different tissues in their databases. However, these tools do not provide a means to relate user-generated sets of differentially expressed genes to specific cell types. Hence, to facilitate the proper interpretation of genomic regulation from in vivo microarray data, we developed CTen to determine if the observed gene regulation is the result of changes in the cellular make up of the sample.
Two principles guided the development of our highly expressed cell-specific (HECS) gene database and the CTen website's interface. First, basal gene expression levels strongly differ between cell types [3, 14]. By analyzing gene expression across several cell types and tissues, we can select genes with very high expression in a limited number of cell types. In turn, each cell type has a collection of HECS genes to act as a cell-specific signature. Thus for any user generated list of genes, we can determine if the number of HECS genes for a particular cell type is greater than the number expected by chance.
The second principle, which led us to optimize CTen's interface for gene expression clustering studies, is the observation that changes in messenger RNA levels due to cell migration or variances in sample collection techniques result in conserved expression patterns in microarray data. Several clustering strategies, including hierarchical clustering and the weighted gene corregulation network algorithm (WGCNA) , have been developed to identify gene expression patterns which are conserved temporally or across experimental groups. By combining clustering with cell type enrichment, CTen can address a major challenge in biology today; namely separating gene expression from cellularly heterogeneous RNA samples into clusters representing differential transcriptional activity and clusters representing changes in gene expression due to cell migration.
Here, we first describe the construction of the HECS database and discuss the workflow behind the CTen website's design. We then validate CTen’s ability to correctly identify the appropriate cellular signature and evaluate the benefits of users requiring increasingly strict enrichment scores. We motivate the use of CTen using genes differentially expressed in the lungs of mice infected with influenza virus, and, lastly, provide an illustrative example promoting the use of CTen for detecting changes in the cellular demographics and the critical role this plays in functional enrichment and gene network inference studies.
The HECS database construction
We downloaded from BioGPS  gene expression data from 96 mouse and 84 human tissues/cell types (Mouse MOE430 Gene Atlas and Human U133A/GNF1H Gene Atlas; a complete list of all cell types used is available in the Additional file 1). The expression values were averaged over the biological replicates (2 per cell type) and, for each cell type, a transcript was identified as a HECS gene if one of its corresponding probes had an expression value (averaged over the 2 replicates) at least 15x or 10x greater than the median expression value of the probe for all cell types in the mouse and human datasets, respectively. Next, probe identifiers were matched to their Entrez Gene IDs and official gene symbols using the Affymetrix Mouse Genome 430 2.0 Array (mouse4302 version 2.5.0) and Affymetrix Human Genome U133 Set (hgu133a version 2.5.0) annotation files available from Bioconductor . The final step was to remove redundant Entrez Gene IDs assigned as HECS genes to the same cell type (due to multiple probes mapping to the same gene). The CTen database is available for download under the "Database Info" tab on the CTen website.
At the cutoff values selected (emphasized in Figure 2A-B in blue (mouse) and orange (human)), even when applying a more stringent expression requirement, the number of HECS genes per cell type remains significantly higher in the mouse data (average number of HECS gene per cell = 794 in mouse and only 351 for human derived cell types). In terms of uniqueness of the HECS genes (emphasized in Figure 2C-D in blue (mouse) and orange (human)), we find that 55.8% of human HECS genes are exclusive to 3 or fewer cell types, while 53.3% of mouse HECS genes are limited to 4 or fewer cell types.
We emphasize that for both the mouse and human HECS databases, for values greater than 10x the median gene expression, the number of HECS genes per cell type and the identity of the HECS genes do not change significantly. Thus, cutoff selection within the ranges considered should not strongly bias any results from enrichment analysis. We validated this by showing that CTen's performance was independent of the precise threshold selected. We reconstructed the HECS databases for several different thresholds and then used the receiver operating characteristic (ROC) curve to determine if changing the threshold affected CTen's performance in terms of the true positive rate versus the false positive rate. As seen in Additional file 2 and Additional file 3, CTen's performance is robust to the precise threshold used for developing the HECS gene databases.
The HECS genes are highly unique to each cell type
Data preprocessing and calculating the enrichment score
Using the one-sided, Fisher's Exact test for enrichment, the enrichment score returned from CTen is the -log10 of the Benjamini-Hochberg (BH) adjusted P-values (all calculations are performed in R ). Although the enrichment score is a statistic in origin (indeed the enrichment scores could be used to control the false discovery rate), we advise users to consider the enrichment score to be a ranking and to not apply a strictly statistical understanding of the number. This is due to the sensitivity of the score to the size of the gene list being analyzed, and we show in detail in the Results and Discussion that ranking the results allows for easier interpretation. The appropriate contingency tables are constructed using the intersection of the user list and the HECS genes for each cell type. The gene universe (or gene background) against which the enrichment is calculated is currently defined to be all of the genes annotated in the human or mouse arrays defined above. Importantly, the enrichment scores for each gene list are calculated separately.
When only a single list is processed, a radar map of the enrichment scores is produced but in the case of multiple gene lists being supplied, P-values between each list cannot be compared since the length of the gene lists differ. So we developed a "weighted-ranking" strategy in which the enrichment scores for the 10 most enriched cell types in each list are scaled by the maximum enrichment score for that list. The enrichment scores of cell types either not present in the top 10 or present in the top 10 but with enrichment score of less than 2 are excluded. This procedure selects only the most enriched cell types for each list and allows us to visualize whether the enrichment scores of the top cell types were similar or if one cell type's enrichment score was dominant. The influenza-infected lung tissue example and the advanced use-case in the Results and Discussion illustrate CTen's output for single and multi-list analyses.
Finally, for both single- and multi-lists analyses, the final enrichment scores for all cell types can be downloaded for further processing by the users.
Results and discussion
CTen correctly identifies cell types
To assess CTen’s ability to identify the correct cell type associated with gene expression data, we used an independent database of cell-specific gene expression (GNF1M_plus_macrophage_small dataset from BioGPS; abbreviated GNFM1) to develop several lists of genes which were highly expressed in select cell types. This data set is an interesting test case for CTen because the differences in the experimental protocol tests CTen's performance when using different microarray technologies and biological conditions. In the GNF1M experiment, they used mice which were ~2 weeks older (compared to the mice used to develop the Mouse MOE430 Gene Atlas data set), used a different ratio of male and female mice, and employed custom microarray slides (GPL1037) . For several cell types (5 tissues and 3 lymphocytes; 2 lymphocytes in different cellular states), we selected the top 5% of the most highly expressed probes. Entrez Gene IDs were mapped using the annotation files available from BioGPS, and the resulting lists analyzed in CTen.
Ranking of the enrichment scores are robust
As with any analysis, small changes in experiment parameters should not greatly change the outcome. P-values from the Fisher Exact test are very sensitive to changes in the size of the gene list, but for many enrichment analyses, it has been observed that the rankings of the enrichment scores are very robust [7, 18]. Here, we asked if CTen could robustly rank the correct cell types by repeating the procedure described above - now using a list of the top 2, 3,…, 10% most highly expressed genes for the selected tissues and lymphocytes, resulting in 90 test lists. The different sizes of the lists simulate different differential expression criteria during gene expression analysis. As shown in Figure 5C, although the sizes of the gene lists (and the underlying enrichment scores) vary considerably, CTen most often ranks the appropriate cell type the most highly. CTen was also able to identify the proper cell state of the lymphocytes as well although unstimulated macrophage data was assigned to bone marrow macrophages collected 6 h after exposure to lipopolysaccharide (LPS) 4 out of 9 times. CTen performed even better for the tissue data, always ranking the appropriate tissue the highest ( Additional file 5). In all, CTen can accurately identify a broad range of cell types and very often identify the cellular state as well. The results are very robust to changes in the length of the test data, which can be equated to changes in the cutoff criteria used during microarray analysis.
Minimizing the false positive rate
CTen versus GO analysis of influenza infected lung tissue
DAVID uses modules of related biological terms to interpret large gene lists into a meaningful biological context and reports the scores of each module as the -log10 of the average P-value for each term within the module . Using the default settings, DAVID identifies the Toll-like receptor pathway (Figure 7B, Cluster #1) as the most significant cluster of annotations (Enrichment score: 12.62; full results available in Additional file 7). However, the clusters indicating enhanced macrophage presence have a low significance (Cluster #29; enrichment score: 1.74) and are very closely followed by a T-cell related cluster (Enrichment score: 1.68). Taken together, these results indicate that although both analyses can identify aspects of the cellular state of the sample, CTen is better suited to identify the known changes in the cellular demographics of the RNA samples.
Advanced use-case: distinguishing changes in lymphocyte cell count from gene transcription
The most exciting potential of CTen is that, when applied to clustering studies, cell type enrichment analysis can be used to approximate the evolution of local cellular demographics. Our laboratory's research is primarily focused on reconstructing the host response during an influenza infection ; a goal which requires us to be able to integrate local intracellular signaling (Toll-like receptor/RIG-I/NFκB pathways) with the coordinated migration, infiltration, and activity of macrophages, T-cells, B-cells, and other immune related cell-types. Being able to resolve the various cell types present in a sample from microarray data would greatly facilitate discovery in a broad range of in vivo studies.
In this case, we are illustrating potential results from using the WGCNA package , which applies color labels to each cluster. The genes for each cluster can be uploaded and analyzed in one session to identify the most enriched cell types in each cluster. In Figure 8C, we find that macrophages are highly enriched in the dark red cluster while several categories of B- and T-cells (CD8+ T-cells) are the most enriched cell types in the green and black clusters, respectively. Interestingly, the orange cluster is not enriched for any cell type, and we would conclude that transcripts in the orange cluster represent differential gene expression due to transcriptional differences between the samples (as opposed to difference in the cellular makeup of the samples) and are suitable for further analysis using traditional approaches. The dark red, green and black clusters can be further analyzed for pathway or functional enrichment to identify processes that may be coordinated with cell migration. This result may also help researchers decide the appropriateness of additional analyses. Some analyses, such as gene network inference, will have to carefully consider how to remove the effects of cell migration prior to network construction. Furthermore, the green, black and dark red clusters' gene expression is highly correlated to the corresponding lymphocyte's cell count change. Thus, we may be able to infer the relative changes in the B cell, T cell and macrophage count in the infected tissue.
In all, this example illustrates how CTen has been designed to facilitate the understanding of clustering results by identifying conserved expression patterns that are the result of changes in the numbers of a particular cell type, providing critical guidance for selecting additional analyses for each gene set and allowing users to infer changes in cellular demographics between samples. Based on the CTen enrichment platform, we propose a novel analytical workflow for in vivo microarray, as illustrated schematically in Figure 8D, which ensures that enriched biological pathways and processes identified in a set of differentially expressed genes can be interpreted in the proper cellular context.
In conclusion, CTen can effectively distinguish between active gene transcription and apparent gene expression resulting from differences in the numbers of select cell types in microarray data. Furthermore, we provide a novel research workflow which helps to ensure that gene expression is interpreted in the proper biological context. We will continuously improve the enrichment algorithm so that a larger number of gene lists can be processed simultaneously (currently, users are limited to 20 gene lists in a single session). Recently, a gene set enrichment analysis based on the degree of pairwise correlation within a given gene set was shown to successfully relate samples to their corresponding tissue . No simple interface is available yet for researchers, but it will be interesting to compare the performance between these two approaches in the near future. Additionally, we plan to introduce additional cell specific gene expression datasets so users can compare the results from different databases. And finally, while the examples focused on lymphocyte migration, CTen can be used in several other scenarios; for example, comparing excised tissue to ensure homogeneity between tissue samples.
Availability and requirements
Project name: CTen
Project home page: http://www.influenza-x.org/~jshoemaker/cten/
Operating system: Platform independent
Programming Language: PHP and R
Other requirements: None
This work was supported by the Japanese Science and Technology Agency's ERATO influenza induced host responses project.
- Hacia JG, Fan JB, Ryder O, Jin L, Edgemon K, Ghandour G, Mayer RA, Sun B, Hsie L, Robbins CM, et al: Determination of ancestral alleles for human single-nucleotide polymorphisms using high-density oligonucleotide arrays. Nat Genet. 1999, 22 (2): 164-167. 10.1038/9674.View ArticlePubMedGoogle Scholar
- Shao L, Fan X, Cheng N, Wu L, Xiong H, Fang H, Ding D, Shi L, Cheng Y, Tong W: Shifting from population-wide to personalized cancer prognosis with microarrays. PLoS One. 2012, 7 (1): e29534-10.1371/journal.pone.0029534.PubMed CentralView ArticlePubMedGoogle Scholar
- Su AI, Wiltshire T, Batalov S, Lapp H, Ching KA, Block D, Zhang J, Soden R, Hayakawa M, Kreiman G, et al: A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci USA. 2004, 101 (16): 6062-6067. 10.1073/pnas.0400782101.PubMed CentralView ArticlePubMedGoogle Scholar
- Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, et al: Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA. 2005, 102 (43): 15545-15550. 10.1073/pnas.0506580102.PubMed CentralView ArticlePubMedGoogle Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000, 25 (1): 25-29.PubMedGoogle Scholar
- Chang JT, Nevins JR: GATHER: a systems approach to interpreting genomic signatures. Bioinformatics. 2006, 22 (23): 2926-2933. 10.1093/bioinformatics/btl483.View ArticlePubMedGoogle Scholar
- da Huang W, Sherman BT, Lempicki RA: Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc. 2009, 4 (1): 44-57.View ArticlePubMedGoogle Scholar
- Kaimal V, Bardes EE, Tabar SC, Jegga AG, Aronow BJ: ToppCluster: a multiple gene list feature analyzer for comparative enrichment clustering and network-based dissection of biological systems. Nucleic Acids Res. 2010, 38: W96-102. 10.1093/nar/gkq418.PubMed CentralView ArticlePubMedGoogle Scholar
- Baas T, Baskin CR, Diamond DL, Garcia-Sastre A, Bielefeldt-Ohmann H, Tumpey TM, Thomas MJ, Carter VS, Teal TH, Van Hoeven N, et al: Integrated molecular signature of disease: analysis of influenza virus-infected macaques through functional genomics and proteomics. J Virol. 2006, 80 (21): 10813-10828. 10.1128/JVI.00851-06.PubMed CentralView ArticlePubMedGoogle Scholar
- Schwartz R, Shackney SE: Applying unmixing to gene expression data for tumor phylogeny inference. BMC Bioinforma. 2010, 11: 42-10.1186/1471-2105-11-42.View ArticleGoogle Scholar
- Shen-Orr SS, Tibshirani R, Khatri P, Bodian DL, Staedtler F, Perry NM, Hastie T, Sarwal MM, Davis MM, Butte AJ: Cell type-specific gene expression differences in complex tissues. Nat Methods. 2010, 7 (4): 287-289. 10.1038/nmeth.1439.PubMed CentralView ArticlePubMedGoogle Scholar
- McCall MN, Uppal K, Jaffee HA, Zilliox MJ, Irizarry RA: The Gene Expression Barcode: leveraging public data repositories to begin cataloging the human and murine transcriptomes. Nucleic Acids Res. 2011, 39: D1011-1015. 10.1093/nar/gkq1259. Database issuePubMed CentralView ArticlePubMedGoogle Scholar
- Wu C, Orozco C, Boyer J, Leglise M, Goodale J, Batalov S, Hodge CL, Haase J, Janes J, Huss JW, et al: BioGPS: an extensible and customizable portal for querying and organizing gene annotation resources. Genome Biol. 2009, 10 (11): R130-10.1186/gb-2009-10-11-r130.PubMed CentralView ArticlePubMedGoogle Scholar
- Eisenberg E, Levanon EY: Human housekeeping genes are compact. Trends Genet. 2003, 19 (7): 362-365. 10.1016/S0168-9525(03)00140-9.View ArticlePubMedGoogle Scholar
- Langfelder P, Horvath S: WGCNA: an R package for weighted correlation network analysis. BMC Bioinforma. 2008, 9: 559-10.1186/1471-2105-9-559.View ArticleGoogle Scholar
- Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, et al: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 2004, 5 (10): R80-10.1186/gb-2004-5-10-r80.PubMed CentralView ArticlePubMedGoogle Scholar
- Team RDC: R: A Language and Environment for Statistical Computing. 2010, R Foundation for Statistical Computing, Vienna, AustriaGoogle Scholar
- da Huang W, Sherman BT, Lempicki RA: Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 2009, 37 (1): 1-13. 10.1093/nar/gkn923.View ArticlePubMedGoogle Scholar
- Medzhitov R: Recognition of microorganisms and activation of the immune response. Nature. 2007, 449 (7164): 819-826. 10.1038/nature06246.View ArticlePubMedGoogle Scholar
- Aderem A, Ulevitch RJ: Toll-like receptors in the induction of the innate immune response. Nature. 2000, 406 (6797): 782-787. 10.1038/35021228.View ArticlePubMedGoogle Scholar
- Sun L, Liu S, Chen ZJ: SnapShot: pathways of antiviral innate immunity. Cell. 2010, 140 (3): 436-436. 10.1016/j.cell.2010.01.041. e432PubMed CentralView ArticlePubMedGoogle Scholar
- Yu WC, Chan RW, Wang J, Travanty EA, Nicholls JM, Peiris JS, Mason RJ, Chan MC: Viral replication and innate host responses in primary human alveolar epithelial cells and alveolar macrophages infected with influenza H5N1 and H1N1 viruses. J Virol. 2011, 85 (14): 6844-6855. 10.1128/JVI.02200-10.PubMed CentralView ArticlePubMedGoogle Scholar
- Reading PC, Whitney PG, Pickett DL, Tate MD, Brooks AG: Influenza viruses differ in ability to infect macrophages and to induce a local inflammatory response following intraperitoneal injection of mice. Immunol Cell Biol. 2010, 88 (6): 641-650. 10.1038/icb.2010.11.View ArticlePubMedGoogle Scholar
- da Huang W, Sherman BT, Tan Q, Collins JR, Alvord WG, Roayaei J, Stephens R, Baseler MW, Lane HC, Lempicki RA: The DAVID Gene Functional Classification Tool: a novel biological module-centric algorithm to functionally analyze large gene lists. Genome Biol. 2007, 8 (9): R183-10.1186/gb-2007-8-9-r183.View ArticlePubMedGoogle Scholar
- Fukuyama S, Kawaoka Y: The pathogenesis of influenza virus infections: the contributions of virus and host factors. Curr Opin Immunol. 2011, 23 (4): 481-486. 10.1016/j.coi.2011.07.016.PubMed CentralView ArticlePubMedGoogle Scholar
- Chang JT: Deriving transcriptional programs and functional processes from gene expression databases. Bioinformatics. 2012, 28 (8): 1122-1129. 10.1093/bioinformatics/bts112.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.