The genetic and molecular basis for many intermediate and end stage phenotypes in model systems such as C. elegans and D. melanogaster has long been known to involve pleiotropic effects and complex multigenic interactions. Gene sets are groups of genes that contribute to multiple biological or molecular phenomena. They have been used in the analysis of large molecular datasets such as microarray data, Next Generation sequencing, and other genomic datasets to reveal pleiotropic and multigenic contributions to phenotypic outcomes. Many model systems lack species specific organized phenotype based gene sets to enable high throughput analysis of large molecular datasets.
Results and discussion
Here, we describe two novel collections of gene sets in C. elegans and D. melanogaster that are based exclusively on genetically determined phenotypes and use a controlled phenotypic ontology. We use these collections to build genome-wide models of thousands of defined phenotypes in both model species. In addition, we demonstrate the utility of these gene sets in systems analysis and in analysis of gene expression-based molecular datasets and show how they are useful in analysis of genomic datasets connecting multigenic gene inputs to complex phenotypes.
Phenotypic based gene sets in both C. elegans and D. melanogaster are developed, characterized, and shown to be useful in the analysis of large scale species-specific genomic datasets. These phenotypic gene set collections will contribute to the understanding of complex phenotypic outcomes in these model systems.
Traditional experimentation in animal model systems such as the worm Caenorhabditis elegans and the fly Drosophila melanogaster often results in complex molecular and phenotypic outcomes. Frequently a targeted deletion or ectopic expression of a single gene product results in pleiotropic phenotypes. Similarly, broad high-throughput multiplex experimental strategies such as microarray based gene expression, RNA interference (RNAi) screens, or next-generation DNA and RNA sequencing, analyzing phenomena such as development, behavior, mating, diet, and life span, typically produce large datasets requiring complex analytical approaches.
Gene sets are collections of keyword terms with annotated genes derived from multiple sources of a priori information. They have been used in computational analysis of gene expression data [1–3] with the goal of identifying higher order relationships beyond simple gene list results, as well as in analysis of population based GWAS in humans [4, 5]. The most commonly used gene sets include those derived from GO annotations , biological pathways from KEGG  or BioCarta, expression modules, DNA binding sites, or other sources of molecular information [1, 3, 8]. Each collection of gene sets has its own unique qualities and features which are useful in different ways. For instance, KEGG emphasizes metabolic and biochemical pathways; GO annotations, while having some phenotypic content, emphasizes molecular function, cellular component, and biological processes, while MSigDB  emphasizes gene expression signatures. This information is often closely related, or “proximal” to gene and molecular function, rather than more “distal” information regarding phenotypic outcomes and disease susceptibility. Recently, phenotype based gene sets have been derived exclusively from genetically determined phenotypic associations for mouse phenotypes and common human disease [9, 10], resulting in gene sets for specific phenotypes, organized by a structured systematic ontology.
Here, we present gene sets for worm and fly, which use the structured ontology found in the Worm Phenotype Ontology from the C. elegans database - WormBase  and phenotypic descriptions for D. melanogaster found in FlyBase . These gene sets are derived from information on gene-phenotype relationships based on genetically determined phenotypes. We use these collections in large scale phenotypic modeling in worms and flies and demonstrate their utility in complex analysis in multiple ways, including analysis of gene expression datasets representing complex phenotypic and biological phenomena in both C. elegans and D. melanogaster. In this way, we integrate large scale genome analysis with large scale phenotypic analysis in these two model systems.
Derivation of worm gene sets
The worm gene sets presented here are derived from two lists of genes and assigned phenotypes provided by Gary Schindelman and Paul Sternberg as a component of the Worm Phenotype Ontology . These two lists originated from information curated from RNAi experiments and genetic variations (VAR) as archived in WormBase .
Two worm gene set files (CE- RNAi-GS and CE-VAR-GS) were produced by parsing each gene list separately into non-redundant lists of unique phenotypic terms with all genes assigned to their corresponding phenotypic terms. This produced two non-redundant gene set files containing 850 and 1109 gene sets for RNAi and VAR, respectively. In addition, we developed a master worm file by combining the original RNAi and VAR gene lists into a combined file (CE-Combined-GS) containing 1,385 non-redundant phenotypes and their associated gene sets.
Derivation of fly gene sets
The Drosophila gene sets described here are derived from phenotypic data provided in FlyBase (see Methods). A file containing 259,162 phenotypic descriptions with assigned Drosophila genes was collapsed and parsed resulting in a non-redundant gene set file of 11,999 unique phenotypic terms with annotated genes. This file named DM-narrow-GS was used for systems biology and gene expression analysis.
Table 1 shows representative examples of individual gene sets from the C. elegans and D. melanogaster gene set files. Official gene symbols are shown where available, locus tags (C. elegans) where gene symbols are not available. As in other gene set collections, as the number of genes in any given gene set decreases, the phenotypes progress from broad categories to more specific phenotypic descriptors. The full gene set lists consist of a wide range of developmental, structural, metabolic and behavioral phenotypes, representing a large majority of the experimentally determined phenotypes found in worms and flies. They range from broad phenotype categories such as “sterile”, “slow_growth”, or “larval_arrest” in worms and “viable”, “lethal” and “fertile” in flies; to narrow phenotypic descriptors such as “flaccid”, “DNA_synthesis_variant” or “no_posterior_pharynx” in worms and “ejaculatory_bulb”, “dorsal_vessel_primordium”, or “dense_body” in flies. In addition, there is often overlap of the genes found in related gene sets in both species, emphasizing the contributions of the same genes to multiple phenotypic traits. The complete C. elegans (Additional file 1: Table S1: Additional file 2: Table S2: Additional file 3: Table S3) and D. melanogaster (Additional file 4: Table S4) gene set files are available at this address http://www.grc.nia.nih.gov/branches/rrb/dna/index/Worm-fly_gene_sets_5-9-12.html.
General uses of phenotype based gene sets in both worm and fly
As described here, a single gene set is essentially a single phenotypic term followed by a single row of genes that have been associated with that phenotype. A collection of gene sets consists of a list of phenotypic terms with their corresponding gene sets. Gene sets can be used individually, as a collection, or compared across collections in a number of ways including network analysis, genome-wide model representations, hierarchical clustering, gene set analysis (GSA) of microarray data, and principal component analysis (PCA) of gene set values; among others. A property of this collection of gene sets is that they describe complex intermediate and end stage phenotypes as opposed to molecular function or lists of coordinately regulated genes. They can be used in a variety of bioinformatics applications to reveal higher order or emergent biological and phenotypic relationships and to provide insight into the biological relevance of complex molecular datasets.
Each individual gene set can be used to build networks to determine transcriptional regulation or protein-protein interactions. Figure 1 is a representative network of six networks showing regulatory relationships analyzed by Ingenuity Pathway Analysis (IPA) (Ingenuity® Systems, http://www.ingenuity.com) from a single 169 gene, C. elegans gene set, “life span variant”, found in the worm CE-Combined-GS 7-28-2011 file. This analysis identifies members of the gene set (shaded) as well as regulatory or transcriptional partners not found (unshaded) in the original gene set. This network highlights the central role of insulin, ERK family members, and PI3 Kinase as important contributors to longevity in worms.
An example of a network showing regulatory relationships from a single 82 gene “long_lived” gene set, found in the fly gene set file (DM-narrow-GS 9-7-2011), is also shown in Figure 2. Like in the worm, insulin is central in this fly network, as well as ERKs, AKT, and histones, demonstrating significant overlap in age related biochemical pathways between worms and flies. Each individual gene set (one phenotype with one row of annotated genes) produces multiple network diagrams showing the transcriptional neighbors and protein-protein partners of the core genes, while the entire collection of thousands of gene sets would produce many thousands of individual networks relative to phenotypic descriptions.
Genome-wide phenotypic modeling in worms and flies
In addition to analysis of a single gene set, a collection of phenotypic gene sets can be compared to itself to reveal biological relationships between all members of the collection. Figure 3 shows a dendrogram of the combined C. elegans file (CE-Combined-GS), using gene sets, having three or more genes, compared to each other based on the degree of gene sharing between individual gene sets. The overall worm tree (Figure 3) is composed of eleven large branches enriched for related biological functions. Moreover, local relationships within a specific branch suggest functional relationships between closely spaced individual gene lists. For instance, in branch 2 (Additional file 5: Figure S1) cell cycle phenotypes such as “cell cycle timing”, “cell cycle delayed” and “cell cycle variant” are closely positioned in space and close to spindle assembly phenotypes. Likewise, in branch 6 (Additional file 6: Figure S2) Dauer phenotypes are closely aligned with multiple lifespan phenotypes based on individual gene sharing within their respective gene sets. Close apposition of related phenotypes as determined by gene sharing between gene sets is a pervasive feature of these dendrogram displays and represents overlap of related phenotypes being influenced by shared genes.
The Drosophila gene set collection also produced a similar complex dendrogram of phenotypic functional groups based on gene sharing between gene sets (Figure 4). Like the worm dendrogram, individual branches of the fly dendrogram display a functional relatedness within subregions in each branch. For example, chromosome related phenotypes are grouped in branch 2 (Additional file 7: Figure S3) with mitotic and meiotic phenotypes, including meiotic telophase phenotypes, being closely aligned to each other, as well as spermatid and spermatocyte phenotypes. Behavioral, neuronal, and sensory response phenotypes are shown closely aligned in branch 11 of Figure 4 (Additional file 8: Figure S4), demonstrating overlapping genetic control of related complex phenotypes.
Phenotype Gene Set Analysis (GSA) of microarray data and Principal Components Analysis (PCA) of gene sets
C. elegans: In addition to comparisons of gene sets either individually or collectively to themselves, these phenotype gene sets are useful in analysis of microarray based gene expression datasets in worm and fly. Figure 5a illustrates statistically significant gene sets resulting from Gene Set Analysis (GSA) of a single 4 day old larva versus 15 day old whole genome gene expression comparison in a C. elegans aging microarray dataset . This dataset (GEO # GSE21784) represents a 15 day time course with incremental stages of infection with P. areuginosa. Statistically significant up-regulated gene sets include germ cell gene groups, as well as meiosis and cell division gene sets, among others. Down-regulated gene sets include gene groups involved in body vacuoles, as well as alae and cuticle formation. Figure 5b is a heat map of the significant changes across the entire time course.
Figure 6 shows changes in selected gene sets from a different aging time course in C. elegans over 24 days  (GEO # GSE12290). Aging related increases (Figure 6a) or decreases (Figure 6b) in gene groups related to locomotion, energy metabolism, and life span are highlighted.
In addition to GSA of microarray data the gene set values derived from gene expression data can be further analyzed by principal components analysis (PCA) using the Z-score values of the original gene set data output. This is in contrast to more commonly described PCA resulting from individual gene expression values. Figure 7 shows tight grouping of individual biological samples within three groups; larvae, adult day 6, and adult day 15, and dramatic separation of time points within the experiment, based solely on PCA analysis of the gene sets values from the previous gene set analysis. This demonstrates that there is useful biological information content in the aggregate gene set results, in addition to that found in any individual gene set, which can discriminate between discrete biological states.
D. melanogaster: In a similar fashion to the worm (above), microarray data from young versus aging flies was analyzed with the Drosophila gene set file DM-narrow-GS containing 11,999 gene sets. Gene set analysis was performed using the WEB-PAGE gene set analysis tool  on a dataset of gene expression values from young versus old flies  (GEO# GSE22437). The top 100 statistically significant enriched gene sets using Z ratios of the expression values from day 10 versus day 40 fly heads is shown in Figure 8. Over enriched gene sets include minute phenotypes, life span, as well as developmental growth rate phenotypes, among others. The discriminative ability of PCA using gene set Z-scores (as opposed to individual gene values) is illustrated using the individual samples of day 10 versus day 40 fly heads in Figure 8.
Here we describe genome-wide phenotypic modeling using gene sets based on gene-phenotypic assignments in C. elegans and D. melanogaster. Unlike previous gene set collections such as KEGG, GO, MSigDB, in these and other species, every gene in every gene set described here is based on genetic evidence contributing to each specific phenotype. Although very useful, these gene sets should be considered a first generation. They may not be complete. Some may describe certain phenotypes in different developmental contexts, or in particular applications and not in others. In addition, many subtleties and details were not included in deriving these gene sets including penetrance of different alleles, strain differences, and environmental modifiers. Moreover, these gene sets may produce different results depending on the statistical algorithms used in complex analysis.
However, we have demonstrated these gene sets can be used to identify complex higher order biological and genetic relationships through network analysis, whole genome phenotypic modeling, and analysis of complex molecular datasets. They will help elucidate complex multigenic relationships between genes and phenotypes in worms and flies in many experimental and biological contexts and will provide a bridge for phenotypic comparisons between model and intermediate species.
Derivation of phenotypic gene sets
Phenotype-gene lists obtained from WormBase on 4/24/11 were titled RNAi and VAR. RNAi, consisted of 34,433 gene phenotype pairs having 7,289 unique genes and 850 unique phenotypes. These phenotypes were the results of observations of phenotypes from knockdown of the gene products (RNAi experiments). The list VAR contained 8,440 records, having 2,165 unique genes, and 1,109 unique phenotypes and was the result of observations of phenotypes from genetic mutations as deposited in WormBase. The overlap between each file consists of 1,410 genes and 237 phenotypes.
Phenotype gene set files were created by parsing the original gene lists into non-redundant phenotype lists with annotated genes using a custom Perl script as previously described . This was done for RNAi and VAR independently, as well as combined to create the gene set files; CE-RNAi-GS 7-26-11, CE-VAR-GS 7-26-11, and CE-Combined-GS 7-28-11. The resultant individual Phenotype Gene set names are identical to the Phenotype descriptors found in the original WormBase Phenotype file. These files can be downloaded here: http://www.grc.nia.nih.gov/branches/rrb/dna/index/Worm-fly_gene_sets_5-9-12.html.
It should be noted that nomenclature of many phenotype gene sets in both worm and fly often have a directionality in the name which may or may not be relevant to any given microarray or other analysis. Please see Additional file 9: S7 for an explanation of directionality in gene set nomenclature and interpretation in their use.
Networks for C. elegans and D. melanogaster were produced using Ingenuity Pathway Analysis (IPA) (Ingenuity® Systems, http://www.ingenuity.com). Using the “life_span_variant” gene set in C. elegans generated on 7-26-2011, and the “long_lived” gene set in D. melanogaster generated on 12-07-2011. The input and output files can be found here for C. elegans (Additional file 10: Table S5) and D. melanogaster (Additional file 11: Table S6).
Genome-wide phenotypic modeling
Genome-wide dendrograms were produced by a unique method similar to phylogenetic classification as previously described . Briefly, the distance between each phenotypic gene set was calculated by pairwise comparison of every gene set pair by finding the number of common genes between each pair and dividing that number by the number of genes in the smallest group of the pair, resulting in a correlation value between 1 and 0 for each pair. This was done for all gene sets to produce a distance matrix. This number was then subtracted from 1because if two lists are identical (100 % match) then the resultant distance should be 0. This is represented as:
when i ≠ j. If i = j then d = 0
Where: Ck: Genes in each disease set (where k = i,j) ; N(Ck): Number of genes in each disease set (where k = i,j) ; dij is the pairwise distance ; i,j: index of genes in each disease set where; i = 1,2,3,………,n ; j = 1,2,3,………,m.
The gene set relationships were calculated from the distance matrix using the Fitch program . It calculates the relationships based on the Fitch and Margoliash method of constructing the phylogenetic trees using the following formula (from the Phylip manual):
where D is the observed distance between gene sets i and j and d is the expected distance, computed as the sum of the lengths of the segments of the tree from gene set i to gene set j. The quantity n is the number of times each distance has been replicated. In simple cases n is taken to be one. If n is chosen more than 1, the distance is then assumed to be a mean of those replicates. The power P is what distinguished between the Fitch and Neighbor-Joining methods. For the Fitch- Margoliash method P is 2.0 and for Neighbor-Joining method it is 0.0. The resulting coefficient matrix file was displayed using the Phylodraw graphics program .
Gene set analysis
This analysis used the Disease/Phenotype WEB-PAGE GSA web tool using the PAGE algorithm  with the CE-Combined-GS gene set file excluding gene sets containing over 500 and less than 3 genes. Briefly, for each gene set a Z score was computed as, In which the phenotype index i = 1,2,…,K; where K is the total number of the disease phenotypes we included in our data set; nI is the number of genes in the sub-group of phenotype i in the current sample array; σA is the standard deviation of the current gene expression changes of the sample. Diff(i): is the difference between the mean value of gene expression changes in the subgroup disease phenotype (i) (GCI ) and the mean value of the gene expression changes on the whole sample (GCA) i.e. . The empirical p-value of the disease phenotype i changes is described by: in which Φ(x) is the standard normal distribution function with the variable as X = DIFFI/σ(DIFFFI). σ(DIFFI) is the standard deviation of the difference for gene expression changes between phenotype subgroup (i) and the whole array σI is the standard deviation of the average gene expression changes in the disease phenotype (i). NA is the total number of genes in the whole sample set. The plots were drawn with R-statistical programming language (R Development Core Team 2005) using either calculated or absolute z-score values.
Mootha VK, Lindgren CM, Eriksson KF, Subramanian A, Sihag S, Lehar J, Puigserver P, Carlsson E, Ridderstrale M, Laurila E: PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat Genet. 2003, 34 (3): 267-273. 10.1038/ng1180.
Zhang Y, De S, Garner JR, Smith K, Wang SA, Becker KG: Systematic analysis, comparison, and integration of disease based human genetic association data and mouse genetic phenotypic information. BMC Med Genomics. 2010, 3: 1-10.1186/1755-8794-3-1.
De S, Zhang Y, Garner JR, Wang SA, Becker KG: Disease and phenotype gene set analysis of disease-based gene expression in mouse and human. Physiol Genomics. 2010, 42A (2): 162-167. 10.1152/physiolgenomics.00008.2010.
Yook K, Harris TW, Bieri T, Cabunoc A, Chan J, Chen WJ, Davis P, Dela Cruz N, Duong A, Fang R: WormBase 2012: more genomes, more data, new website. Nucleic Acids Res. 2012, 40: 735-741. 10.1093/nar/gkr954. Database issue
Schindelman G, Fernandes JS, Bastiani CA, Yook K, Sternberg PW: Worm Phenotype Ontology: integrating phenotype data within and beyond the C. elegans community. BMC Bioinforma. 2011, 12: 32-10.1186/1471-2105-12-32.
Harris TW, Antoshechkin I, Bieri T, Blasiar D, Chan J, Chen WJ, De la Cruz N, Davis P, Duesbury M, Fang R: WormBase: a comprehensive resource for nematode research. Nucleic Acids Res. 2010, 38 (Database issue): 463-467.
The authors would like to thank Gary Schindelman and Paul Sternberg from WormBase for providing gene-phenotype files and Dr Elin Lehrmann for critical reading of the manuscript. This research was supported entirely by the Intramural Research Program of the NIH, National Institute on Aging.
Authors and Affiliations
Gene Expression and Genomics Unit, Laboratory of Genetics, National Institute on Aging, National Institutes of Health, Biomedical Research Center, 251 Bayview Boulevard, Baltimore, MD, 21224, USA
Supriyo De, Yongqing Zhang & Kevin G Becker
Department of Neuroscience, Albert Einstein College of Medicine, Bronx, NY, 10461, USA
Catherine A Wolkow
Translational Gerontology Branch, National Institute on Aging, National Institutes of Health, Biomedical Research Center, 251 Bayview Boulevard, Baltimore, MD, 21224, USA
Image Informatics and Computational Biology Unit, Laboratory of Genetics, National Institute on Aging, National Institutes of Health, Biomedical Research Center, 251 Bayview Boulevard, Baltimore, MD, 21224, USA
The authors declare that they have no competing interests.
SD participated in study design and implemented the graphing algorithm. YZ participated in study design and developed the primary gene set files for both species. CW, SZ, and IG provided biological insights into the relevance and applicability in both C. elegans and D melanogaster. KGB conceived the study design, participated in gene set development, ran analysis, and wrote the manuscript. All authors read and approved the final manuscript.
This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License (
), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
De, S., Zhang, Y., Wolkow, C.A. et al. Genome-wide modeling of complex phenotypes in Caenorhabditis elegans and Drosophila melanogaster.
BMC Genomics14, 580 (2013). https://doi.org/10.1186/1471-2164-14-580