A survey of putative secreted and transmembrane proteins encoded in the C. elegans genome
© Suh and Hutter; licensee BioMed Central Ltd. 2012
Received: 19 January 2012
Accepted: 25 May 2012
Published: 23 July 2012
Skip to main content
© Suh and Hutter; licensee BioMed Central Ltd. 2012
Received: 19 January 2012
Accepted: 25 May 2012
Published: 23 July 2012
Almost half of the Caenorhabditis elegans genome encodes proteins with either a signal peptide or a transmembrane domain. Therefore a substantial fraction of the proteins are localized to membranes, reside in the secretory pathway or are secreted. While these proteins are of interest to a variety of different researchers ranging from developmental biologists to immunologists, most of secreted proteins have not been functionally characterized so far.
We grouped proteins containing a signal peptide or a transmembrane domain using various criteria including evolutionary origin, common domain organization and functional categories. We found that putative secreted proteins are enriched for small proteins and nematode-specific proteins. Many secreted proteins are predominantly expressed in specific life stages or in one of the two sexes suggesting stage- or sex-specific functions. More than a third of the putative secreted proteins are upregulated upon exposure to pathogens, indicating that a substantial fraction may have a role in immune response. Slightly more than half of the transmembrane proteins can be grouped into broad functional categories based on sequence similarity to proteins with known function. By far the largest groups are channels and transporters, various classes of enzymes and putative receptors with signaling function.
Our analysis provides an overview of all putative secreted and transmembrane proteins in C. elegans. This can serve as a basis for selecting groups of proteins for large-scale functional analysis using reverse genetic approaches.
More than ten years ago the draft sequence of the Caenorhabditis elegans genome was published . At that time the genome was predicted to contain approximately 19,000 genes. Gene predictions have been refined over the past decade and now the majority of C. elegans genes have been experimentally verified. The release 210 of WormBase ( http://www.wormbase.org) contains 20,242 protein-coding genes, which is close to the initial estimate. Most vertebrate genomes contain a comparable number of genes illustrating that anatomical complexity does not correlate well with the number of genes encoded in the genome. This raises the question of why an anatomically simple animal such as C. elegans has such a large number of genes. The C. elegans genome contains several large gene families, e.g. more than 1,300 genes encoding serpentine-type G-protein coupled receptors (putative chemoreceptors) , 326 F-box proteins , 278 C-type lectins , 284 nuclear hormone receptors  and more than 170 cuticular collagens . These and other families contain large nematode-specific subfamilies, but altogether they correspond to a small fraction of the genes encoded in the genome. The TreeFam project groups 16,866 C. elegans genes into 7,471 families , pointing to a wide variety of different genes present in the C. elegans genome. To date only a fraction of the C. elegans genes have been functionally characterized and we are still fairly ignorant of the role that the majority of genes play in the life of C. elegans.
The domain organization of a protein provides clues to its putative biochemical function. The biological function, however, is not immediately clear from the domain structure. Additional data, such as gene expression and phenotypes associated with mutations in the corresponding genes, provide further insight into potential biological functions. Although large-scale expression studies have been published [8, 9], we still lack a detailed knowledge of the expression of most genes. Large-scale projects to generate mutations in every C. elegans gene (reviewed in ) currently provide mutations in less than half of the genes. For most genes we are left with the domain structure of the protein as the primary (and often only) source of information for a putative function.
The large number of genes in the C. elegans genome poses a significant challenge for geneticists employing genome-scale approaches to identify genes of interest. For this study, we examined putative secreted and transmembrane proteins with the goal of providing an overview and practical groupings of proteins as well as educated guesses for putative functions based on selected genome-scale expression data. This work will allow researchers to select and focus on subsets of the genome for further analysis thereby increasing efficiency of large-scale reverse genetic experiments.
According to predictions in Wormbase release 210, 5,676 C. elegans genes encode proteins with a signal peptide (SP), i.e. proteins likely to enter the secretory pathway through the endoplasmic reticulum (ER). 5,458 proteins are predicted to have a transmembrane (TM) domain. The majority of those (3,539) lack a SP. In total 9,215 genes encode proteins with either a SP or a TM domain (see Additional file 1 for a full list and Additional file 2 for the corresponding RNAi clones).
3,757 SP-containing proteins do not contain a TM domain and are therefore either secreted or reside in various subcellular compartments originating from the ER. Direct experimental evidence for a subcellular localization does not exist for the overwhelming majority of C. elegans proteins. To identify the putative subcellular locations of C. elegans proteins, we examined homologs of yeast and mouse proteins for which experimental evidence was available. We focused on proteins residing in the ER, the Golgi apparatus, and various vesicular compartments originating from the ER (see Materials and Methods for details and Additional file 3 and Additional file 4 for the corresponding RNAi clones). We examined the resulting list of proteins and removed known extracellular proteins, as these proteins are known to pass through the secretory pathway and are therefore only transiently present in the ER or Golgi. These primary analyses resulted in 207 proteins that likely reside in one of the endomembrane compartments originating from the ER. In addition, we found that 66 proteins were predicted to be mitochondrial. As this group contained bona fide mitochondrial proteins, the entire group was classified as potentially mitochondrial and removed from the original list of 3,757 SP-containing proteins. The remaining 3,484 proteins, about 17% of the proteome, are considered in this study as putative secreted proteins.
The second group of proteins consisted of 5,458 proteins with one or more transmembrane domains (TM). 3,539 of those proteins did not contain a signal peptide. Experimental evidence for the localization of yeast and mouse homologs was used as an indicator for putative localization of C. elegans transmembrane proteins. In this way, we were able to assign a total of 481 transmembrane proteins to one or more organelles such as mitochondria, ER, Golgi, endosomes, lysosomes or peroxisomes. The overwhelming majority, 4,977 proteins, lacked any predicted subcellular localization. A substantial fraction of these proteins (43%) also lacked any recognizable domain, preventing functional predictions based on the protein sequence itself.
Upon examination of the size distribution of secreted and transmembrane proteins, we observed that smaller proteins are strongly overrepresented within the putative secreted proteins, whereas large proteins are underrepresented (Figure 1B). This effect becomes more pronounced for proteins lacking recognizable domains. Among transmembrane proteins, small proteins (<200 aa) are underrepresented while proteins in a size range of 300–400 aa are strongly overrepresented. The overrepresentation of this size group, however, is due to the fact that 1,469 putative chemoreceptors fall into this size range. Exclusion of this group generates a size distribution of TM proteins that is very similar to the overall size distribution.
Taken together, 9,215 proteins, almost half of the genome, contain either a signal peptide or at least one transmembrane domain. Currently the vast majority of these proteins have not yet been functionally characterized.
The majority of the putative secreted proteins contain only a small number of domains. Only 52 proteins contain more than ten domains and only 21 contain more than 20 domains. Eleven secreted proteins contain multiple copies of one or two different domains (Figure 3) rather than a variety of different domains (see above for complex proteins). Among the highly repetitive secreted proteins are extracellular matrix components like emb-9, let-2 or him-4. Three additional proteins (Y43F8B.3, Y55F3BR.2, ZC84.6) have a combination of DC and KU domains and are either strongly (ZC84.6) or moderately (Y43F8B.3, Y55F3BR.2) enriched in embryos raising the possibility that they are also part of the extracellular matrix. RNAi experiments indicate that all three genes are essential for survival [16–18], further supporting this idea.
Large families of secreted proteins with known domains
Upregulated upon infection
Size range (aa)
Typical domain organization
small C-type lectins
proteins with a C-type lectin domain
putative cuticular collagens
proteins containing multiple copies of the ShK toxin domain
RcpL is the ligand binding domain in the EGF-receptor (let- 23) and the insulin-receptor (daf- 2)
many members have no predicted signal peptide
grd- and grl-genes
groundhog proteins (hedgehog related proteins)
proteins containing multiple copies of a cysteine-rich repeat (Ctx)
Vitelline membrane outer layer protein I
large KU/DC proteins
(see Figure 3 complex proteins)
proteins containing tandem copies of Kunitz and DC domains
saposin-like protein family
warthog proteins (hedgehogrelated proteins)
Large families of putative secreted proteins with ‘domains of unknown function’
No of genes
Upregulated upon infectionb
Size range (aa)
Typical domain organization
small DB proteins
Large families of transmembrane proteins with known domains
No of genes
Pfam ID, SMART ID, reference
Typical domain organization
Ligand-gated ion channel
PF02931, PF02932, lgc-*
Salkoff et al.
Major facilitator superfamily MFS-1
transmembrane region, type 1
Zhao et al.
Sugar (and other) transporter
Amino acid transporter
see Additional file 7
Guanylate and adenylate cyclase
(F) Cell adhesion
(G) ECM components
see Additional file 7
“Channels” and “Transporters” were grouped based on established definitions . The largest subgroup within channels is ligand-gated ion channels with 102 members, including 29 nicotinic acetylcholine receptors , ten ionotropic glutamate receptors as well as glycine, serotonin and GABA receptors. In addition, the C. elegans genome contains 71 potassium channels  and 46 sodium channels. The largest subgroups within the transporters are 117 major facilitators (PF07690), 52 ABC transporters , 43 sugar transporters (PF00083) and 36 amino acid transporters (PF00324, PF01490).
The “Enzymes” group consists of various metabolic enzymes, whereas components of signal transduction pathways such as kinases were placed in the “Signaling” group. 71 O-acyltransferases (PF01757, PF03062, oac-genes), 67 UDP-glucuronosyltransferases (PF00201), 50 AAA ATPases (SM00382) and 49 peptidases constitute major groups within the “Enzymes”.
The “Signaling” group consists mainly of well-known receptors, for example 143 GPCRs, 69 kinase receptors, and 30 adenylate and guanylate cyclase receptors (PF00211, SM00044).
Proteins involved in “Trafficking” are composed of a small and diverse family of proteins. Ten t-SNARE proteins, eight synaptobrevins, and five synaptotagmin genes are the major constituents.
“Cell adhesion” proteins were collected following the definition by Cox and Hardin . Cadherins, claudin-like proteins (PF07062) and several immunoglobulin or laminin G domain containing proteins are major families within this group.
“ECM” components are mostly composed of cuticlins and collagens. Eleven collagen proteins are included in this group and ten of which have a TM domain in close proximity to their N-terminus. These collagens may be type II TM proteins that are bound to the plasma membrane and then shed by various proteases as seen in mammalian collagens . A portion of these TM predictions by SMART could be mispredictions of SPs, especially when considering that three proteins (C34F6.2, C34F6.3, F54B11.1) are predicted to have a SP overlapping with a TM domain at their N-terminus according to Wormbase 210. The majority of remaining ECM proteins are cuticlins with a C-terminal TM domain and are expected to be cleaved from the cell surface .
A substantial number of proteins (805) do not fit into any of the above categories and also do not form a homogeneous group. Among those proteins are those containing one or more domains, which are not indicative of a particular biochemical or biological function, proteins with a “domain of unknown function”, and proteins, where the fragmentary information available (GO annotation, gene name or description) does not provide sufficient evidence for a conclusive placement into one of the functional groups defined above.
A significant fraction of putative secreted proteins and transmembrane proteins are upregulated in specific stages (Figure 5). Compared to the entire protein data set, a comparatively large number of transmembrane proteins are upregulated in the L4 stage. Putative secreted proteins show a higher proportion of genes upregulated in the L2 stage (Figure 5). Almost 20% of the putative secreted proteins are specifically upregulated in males, suggesting a sex-specific function. While secreted proteins constitute only about 17% of the genome, expression profiles of genes upregulated upon bacterial infections  contain 31% secreted proteins. Similarly, profiles of genes upregulated upon exposure to fungal pathogens  contain about 37% secreted proteins. In contrast, TM proteins are not overrepresented among the genes upregulated in these infection scenarios. As expected, many signaling proteins (110 of 354) are upregulated upon infection. A substantial fraction of the GPCRs (62 of 143; not including the putative chemoreceptors) and patched receptors (19 of 30) are upregulated upon bacterial infection. Ten of the twelve cadherins and twelve of 23 fatty acid metabolic enzymes (PF01151, PF00487, PF04116) are upregulated as well. Only a small fraction of TM proteins (307 out of 5,458) are up-regulated after fungal exposure including five of twelve ‘fungus induced’ (fip) or ‘fungus induced related’ (fipr) protein family members. Overall 58% of all proteins upregulated after pathogen exposure have either a signal peptide or a transmembrane domain (Figure 5), indicating putative secreted and cell surface proteins encompass most of the immune response. In total, more than a third of secreted proteins (1,297 out of 3,484) are upregulated in at least one of five pathogen exposure scenarios tested , suggesting that a substantial fraction of secreted proteins are part of the nematode’s immune system.
These genome-wide expression profiles also reveal potential functions for some protein families characterized by various ‘domains of unknown function’ (DUF). For example 20 out of 22 DUF274 genes and nine out of ten DUF1261 are among those that are upregulated upon pathogen exposure (Table 1), pointing to a role in the immune response. This analysis also allows a more refined functional characterization of very large families. 83 of the small C-type lectins are upregulated upon pathogen exposure (Table 2), confirming a role for C-lectins in immune response. A comparable number of C-lectins (80) are more than 5-fold upregulated in L4 stage males. Nine of these genes also show a moderate upregulation (3–7 fold) in the L4 stage of hermaphrodites. A similar expression profile is identified in proteins required for sperm production such as ‘major sperm proteins’, suggesting a possible function for this particular group of C-lectins in sperm production. However, the overwhelming majority of C-lectins upregulated in males are expressed at low levels in L4 stage hermaphrodites, suggesting a role in males unrelated to the production of sperm. Only a small number of C-lectins (4–18) are upregulated in particular developmental stages, signifying a lack of stage-specific functions. Eleven of 66 ShK-proteins are upregulated in males and 26 members are upregulated after pathogen exposure, indicating this family may have a similar range of functions as C-lectins. 15 of the 58 nematode-specific peptides (nsp-genes) are also more than 5-fold upregulated in males. Notably, six of ten nspa-genes and eight of ten nspd-genes belong to this group, suggesting a male-specific role for these two families in particular.
The WSN domain (Worm-Specific N-terminal domain, another domain of unknown function found in nematodes) is present in 43 TM proteins. We found that 34 of these proteins are specifically upregulated in hermaphrodites in the L4 stage. Most were also highly expressed in L4 stage males raising the possibility that many WSN-proteins are involved in early germline or sperm development. UDP-glucuronosyltransferases (PF00201) are known to be involved in detoxification in mammals by adding glucosyl-groups to hydrophobic molecules, thereby allowing solubilization and secretion of toxic molecules. In C. elegans, we found these enzymes are expressed throughout all developmental stages. 41 of 67 TM domain-containing UDP-glucuronosyltransferases are upregulated after pathogen exposure, suggesting involvement in immune response. Seven of twelve cadherins are upregulated in embryos, implying involvement in early developmental processes, which has been shown previously for cdh-4, fmi-1 and hmr-1.
In summary genome-scale expression profiles provide hints at a potential function for a portion of previously uncharacterized gene families. Based on these data we speculate that many putative secreted proteins potentially have stage-specific functions and that a substantial fraction may be part of the nematode’s immune system.
The secretome is defined as the set of secreted proteins found in the proteome of an organism, i.e. proteins that reside outside the cell . These proteins are important for cell communication, cell adhesion and interactions with the environment. During development secreted proteins are essential for cell fate specification and cell migrations. Some provide structural support for cells and organs to determine the overall appearance of an animal. Others are essential for cell communication within the organism. In addition secreted proteins mediate responses to environmental stresses such as pathogens. Further still secreted proteins have enzymatic functions that range from degrading food to remodeling the body during life stage transitions. The secretome therefore is of interest to a wide variety of researchers ranging from developmental biologists to immunologists.
Bioinformatic approaches are frequently used to define the ‘secretome’. The prediction of a signal peptide , which targets proteins to the secretion pathway, is typically used as a first filter to identify secreted proteins. Various algorithms to predict subcellular localization can be used (see [34–36] for recent reviews) to distinguish bona fide secreted proteins from those that reside in various intracellular compartments. However, the ability of current prediction algorithms to accurately assign novel proteins to subcellular localizations is somewhat limited, mainly due to localization and retention signals for endomembrane compartments, in particular ER, Golgi, endosomes, lysosomes or peroxisomes, being poorly understood. Thus, a reliable identification of proteins residing in endomembrane compartments by virtue of sequence features alone is currently impossible. A more conservative approach incorporates experimental evidence indicating the presence of proteins in certain subcellular compartments. Large-scale data sets of this kind are typically generated by proteomic studies, i.e. the analysis of the protein content of a purified subcellular compartment. While these studies provide direct evidence for the presence of a protein in a particular subcellular fraction (to the extent that purification is possible), proteins ‘caught in transit’ in the Golgi or the ER add another level of complexity to the identification of truly resident proteins in those compartments. Indeed, our manual inspection of the organelle data sets revealed several secreted proteins like collagens or bona fide cell surface receptors annotated as being present in the Golgi or lysosomes. In the absence of localization information for most C. elegans proteins we had to rely on localization information from yeast and mouse homologs. Identical subcellular localization across species cannot be taken for granted, especially within gene families that have expanded independently in the species. Consequently the secretome as defined in this study will contain false positives and negatives, which should be kept in mind. Despite these limitations our analysis suggests that as in other organisms the overwhelming majority of proteins in C. elegans with a signal peptide and no transmembrane domain are likely to be secreted and do not reside in subcellular compartments.
A number of protein families are known to be secreted and can be confidently assigned to the secretome, despite the lack of a predicted signal peptide in some members. The C. elegans genome contains 39 insulins, two of which lack a predicted signal peptide. This may be due to a failure of the prediction algorithm or a false prediction of the gene structure. Predicting the boundaries of a gene is still a challenging problem and the incorrect prediction of the N-terminus of a protein would almost certainly lead to a failure of signal peptide prediction. In our analyses we found proteins lacking signal peptides in almost all families predicted to be secreted, suggesting that current estimates understate the number of secreted proteins.
Some proteins are exported without passing through the ER and consequently do not have a signal peptide. Several such ‘unconventional’ secretion pathways exist [37, 38] although we do not know how widespread their use in C. elegans is. Finally, secreted proteins or peptides can also be generated by cleavage of membrane proteins. One of the most prominent examples are the β-amyloid peptides, which are generated by cleavage of APP and thought to be causative agents in the development of Alzheimer’s (see  for a recent review). Despite our employment of different strategies to minimize false-positive and -negatives in our secretome data set, it should be noted that defining the ‘secretome’ remains a work in progress.
Based on our analysis we estimate that approximately 17% of the genome encodes putative secreted proteins. The LOCATE database currently lists approximately 7% of the mouse and human proteins as secreted . 17% of zebrafish and human proteins have been reported to contain a signal peptide , as well as 12% of mouse proteins , again lower than the corresponding number in C. elegans (28% proteins with SP). The PEDANT3 database using TargetP prediction on Ensembl data sets lists 32% of C. elegans proteins as being in the secretion pathway and comparable and lower numbers (21%-27%) for Drosphila melanogaster, Tribolium castaneum, zebrafish, mouse and human proteins . This suggests that the C. elegans genome contains a larger proportion of secreted proteins as compared to other invertebrate and vertebrate model organisms. This seems counterintuitive, as C. elegans is much simpler anatomically and lacks the sophisticated vertebrate immune system as well as many secreted proteins known to have evolved within vertebrates. Nevertheless, C. elegans possesses an elaborate set of secreted proteins illustrating that genetic complexity does not necessarily correlate with anatomical complexity.
We used data from the most recent large-scale study  to identify genes potentially involved in immune response. The authors used three bacterial and two fungal pathogens and found surprisingly few genes commonly upregulated upon infection. Bacterial infections elicited a response very different from fungal infections and among any two of the infections there was limited overlap in response. We found that more than a third of the putative secreted proteins are upregulated in at least one of the five infection scenarios covered in this study . Given the limited set of potential pathogens tested in this study, this suggests that a substantial fraction of putative secreted proteins may be involved in immune response.
Several large protein families have been of particular interest among researchers since the genome sequence was made available. Among these are cuticular collagens , C-type lectins , various families of neuropeptides  and known putative signaling molecules such as insulins [19, 20], warthog, groundhog and groundhog-related proteins [19, 20]. Many families of signaling molecules that are expanded in vertebrates such as TGF-βs or ephrins along with their receptors have small families in C. elegans. So, while the overall number of potential signaling molecules seems comparable, vertebrates and nematodes differ in which families of signaling molecules are expanded.
Our analysis shows an enrichment of smaller secreted proteins with a significant fraction less than 100 amino acids. The gene structure for many of these genes is experimentally confirmed by cDNA or RNA-seq expression data . Unfortunately, these smaller secreted proteins are very diverse and often contain no recognizable domain preventing speculation on potential functions. Larger families found within this group consist of peptides like FMRF-like peptides (flp-genes), neuropeptide-like proteins (nlp-genes) and 5 families dubbed nematode-specific peptides (nsp-genes) . It is possible that many of the remaining uncharacterized, small secreted proteins encode a variety of peptide families as well. An analysis of the mouse secretome  critically evaluated signal peptide-containing proteins smaller than 100 amino acids. By applying stringent criteria for support of the gene model, the authors excluded 649 of the original 741 sequences as unlikely to be real genes. Among the criteria used were the presence of introns, recognizable domains and the presence of orthologs. Many of the small putative secreted proteins in C. elegans that have experimentally confirmed gene structures would fail this stringent test as they do not contain recognizable domains, orthologs and sometimes introns. It is therefore possible that the mouse (and human) genome also may contain a larger number of small, secreted proteins than previously predicted. For this reason it would be worthwhile to revisit this point should additional expression information become available.
We broadly grouped the C. elegans proteins into three evolutionary categories, namely ‘nematode-specific’, ‘metazoan’ and ‘eukaryotic’ (not further distinguishing between universal and truly eukaryotic genes). We found that orthology databases such as Treefam or Inparanoid each cover only part of the C. elegans genome, so that combining the data from various databases significantly improves coverage. Similar observations were made by Shaye and Greenwald, who recently performed a meta-analysis of orthology prediction programs and assembled a list of C. elegans proteins with human orthologs . 96% of the proteins contained in this Ortholist are found in our metazoan or eukaryotic categories, confirming the validity of our classification.
The C. elegans secretome contains a number of large families with novel domains of unknown function. Of these many are found exclusively in nematodes and few of them have been characterized thus far. Members of one such family, however, have now been shown to encode enzymes, shedding light onto one of these novel families. galt-1 (DUF23 family) has been identified as a member of a novel glycosyltransferase family . 24 additional DUF23 proteins have also been added to this group of enzymes based on sequence similarity and phylogenetic analysis. It remains to be seen whether the remaining members of the DUF23 family are a more distantly related group of enzymes with similar function. The function of the remaining DUFs currently remains unknown.
Transmembrane proteins regulate cell communication and adhesion and allow controlled exchange of chemicals across membranes. The C. elegans genome contains a wide variety of different transmembrane proteins. By far the largest group consists of serpentine-type GPCR receptors , a C. elegans expansion of GPCRs that are separately grouped as a subset within rhodopsin-type GPCR ( http://www.gpcr.org/7tm/). Serpentine-type GPCR receptors are typically expressed in a subset of sensory neurons and are considered to be chemoreceptors. In contrast to vertebrates, C. elegans contains a very small number of chemosensory neurons and yet has four times more GPCRs than vertebrates . Why GPCRs are expanded in C. elegans is currently unknown, however, one hypothesis is that C. elegans depends to a larger extend on chemosensation, since it lacks visual or auditory systems found in other animals .
Almost half of all C. elegans transmembrane proteins have additional domains or annotations that allowed for grouping into larger families and broad functional categories using Gene Ontology (GO) annotations. During our analysis, we noticed some anomalies one being an unusually large number of genes (589) tagged with a “lipid_storage” phenotype annotation based on the genome-wide RNAi screen by Ashrafi et al. . This group of genes contains transcription factors and signaling molecules, which are merely indirectly involved in lipid storage. Similarly 168 genes are tagged as being involved in “receptor-mediated endocytosis” based on the large-scale RNAi screen by Balklava et al. . Again, no distinction is made between genes that are directly involved and those, which are indirectly involved. In these instances we validated GO annotations by manual curation, however we could not completely solve this problem as many proteins do not have additional information to support or disregard the individual GO annotations. The subgroups and the data used to create the groupings (Pfam, SMART groups defined by other researchers) are listed in Table 3 and Additional file 7. This format offers a simple way to re-group the proteins using existing groups or subgroups as building blocks and can be updated easily as new data becomes available.
The C. elegans genome is predicted to contain around 20,000 genes. The overwhelming majority of these genes are still uncharacterized. A substantial fraction of the genome encodes putative secreted and transmembrane proteins, which are of interest in the context of studying developmental processes as well as interactions of animal and environment. We grouped all proteins containing either a signal peptide or a transmembrane domain, a total of 9,215 proteins, using a number of criteria including similarity and evolutionary conservation. A substantial fraction of the 3,484 putative secreted proteins seem to be nematode-specific and therefore likely have phylum-specific functions. Putative secreted proteins are enriched for small proteins and proteins with no predicted domains limiting further analysis. Single-pass transmembrane proteins are the largest group of transmembrane proteins containing most of the known receptors for cell-cell communication and cell adhesion. The C. elegans genome contains 1,208 single-pass TM proteins with no known or predicted function, a group likely containing unidentified receptors for developmental processes and other functions.
Combined with other information, such as genome-scale expression data sets, our classification system can be utilized to select sets of proteins for targeted functional analysis. As the extensive number of genes in the C. elegans genome poses a problem for many labor intensive screens these groupings provide an efficient tool for focusing efforts on likely candidate genes.
We used SMART  and Pfam  to establish the domain organization of the C. elegans proteins. We used SMART predictions to determine signal peptides and transmembrane domains. In cases where both a signal peptide and a transmembrane domain were predicted at the N-terminus in an overlapping fashion, we considered the protein to have a signal peptide, but no transmembrane domain. Other information for the proteins and their corresponding genes like TreeFam and Inparanoid groupings and GO (Gene Ontology) annotations was extracted from WormBase . The data were assembled in a mySQL database, which could be queried using the web interface GExplore ( http://genome.sfu.ca/gexplore/).
For most C. elegans proteins direct experimental evidence for subcellular localization is not available. To identify putative organelle proteins we therefore identified homologs of yeast and mouse organelle proteins, where corresponding experimental evidence exists. Mouse organelle proteins with manual experimental evidence were obtained from QuickGO . Yeast organelle proteins with experimental evidence were obtained from the yeast GO slim data set ( http://www.yeastgenome.org). Inparanoid  was used to identify the C. elegans homologs of these genes. In case of protein families, where there is more than one C. elegans homolog for a given mouse or yeast protein, the possibility exists that family members are located in different subcellular compartments. In the absence of further information we are unable to resolve this issue. The resulting list of putative organelle proteins is available at Additional file 3.
We grouped genes into three categories according to their likely phylogenetic origin. Genes of origin ‘nematoda’ are defined as not having homologs outside the nematodes. ‘metazoan’ origin is defined as having homologs in invertebrate and vertebrates, but not in plants or fungi. ‘eukaryotic’ origin is defined as having homologs in all the above groups. The last group contains genes that are ‘universal’ and found even in prokaryotes. We used data from TreeFam 7.0 , Inparanoid 7.0  and best BLAST matches to proteins in other species (data extracted from Wormbase 210) to place proteins into individual groups. We found that TreeFam and Inparanoid assignments generally agree, but that both databases cover only part of the proteome. Since TreeFam covered a larger fraction of the proteome (8,153 proteins are not covered by Inparanoid 7 whereas only 4,650 proteins are not assigned to any Treefam in version 7.0), we started with the TreeFam data in the following way
Genes in the ‘nematoda’ group were defined as having only nematode species and at most one non-nematode species (to allow for an outgroup) in the tree. Genes in the ‘metazoa’ group were defined as having a vertebrate or chordate species, an arthropod species, but no plant or fungi species in the tree. Genes in the ‘eukaryota’ group were defined as having a vertebrate or chordate species, an arthropod species and a plant or fungi species in the tree.
We then used Inparanoid data to group those proteins that remained unclassified. Inparanoid clusters are classified as InP_cae (built solely from C. elegans-C. briggsae ortholog pairs), InP_met (built from metazoan ortholog pairs) or InP_uni (built from non-metazoan, eukaryotic ortholog pairs) (see http://wiki.wormbase.org/index.php/Glossary_of_terms#inparanoid). Of origin ‘nematoda’ is then simply defined as belonging to an InP_cae cluster, but not to an InP_met or InP_uni cluster, of origin ‘metazoa’ is defined as belonging to an InP_met cluster, but not to an InP_uni cluster and of origin ‘eukaryota’ as belonging to an InP_uni cluster. For the 2,723 proteins that still remained unclassified, we used the best BLAST matches in the following way: nematoda: best BLAST matches contain proteins from C. remanei, C. briggsae or P. pacificus, but not M. musculus, H. sapiens, S. cerevisiae or A. thaliana; metazoa: BLAST matches contains proteins from M. musculus or H. sapiens but not from S. cerevisiae or A. thaliana; eukaryota: BLAST output contains proteins from S. cerevisiae or A. thaliana. This approach potentially overestimates the number of ‘metazoan’ or ‘eukaryotic’ genes that are classified solely using the BLAST comparisons, but is conservative in defining genes as nematode-specific. This strategy allowed us to classify the overwhelming majority of the proteins and left us with only 834 unclassified proteins.
GO annotations use standardized hierarchical vocabulary to describe the function of a gene product. We utilized GO annotations in cases where the domain analysis was more difficult, e.g. for proteins belonging to small families or for individual proteins not belonging to families.
We used quantitative expression data generated by high-throughput sequencing (RNA-seq) and provided by the modENCODE project  to determine stage and sex-specifically enriched genes. Expression level data were presented as depth of coverage per base per million reads (dcpm). We only considered genes with expression levels higher than 0.04 dcpm, which has been established as a reasonable threshold for true expression . We calculated enrichment in a particular developmental stage by dividing the expression in that stage by the average of expression in all other stages. For male-specific enrichment we used expression in the L4 hermaphrodite as a reference. Genes more than 5-fold enriched in a particular stage or sex were considered to be substantially upregulated. Lists of genes upregulated after infection were taken from  using only the RNA-seq data sets. We applied the same threshold of 0.04 dcpm and only considered genes with expression levels higher than this threshold after infection.
Transmembrane proteins were grouped into broad functional categories based on domains known to be involved in specific biochemical functions. We employed various approaches to group these proteins. For clarity, we used “family” to refer to an already existing clan from other databases, e.g. acyltransferase family (PF01757) or metalloprotease family (SM00235), and “group” or “subgroup” to refer to a new clan that we generated. From our analysis described above, we had lists of proteins in each TreeFam family. Starting with TreeFam families containing ten or more members we found a number of families whose biochemical functions were easily predicted as they contained catalytic or critical domains for certain biochemical functions. Each TreeFam family may represent all proteins sharing the domain, or may represent only a subset, which share an additional domain or share a distinct sequence feature within the domain. Therefore we extended our grouping from TreeFam families to include more proteins that shared the same catalytic or critical domains. In this way, most groups with many members were easily grouped. Smaller families with five or more members sharing the same domain were manually examined using GExplore  and grouped according to their predicted domain function. In addition we used common gene names that define families for grouping; e.g. lgc- genes grouped to ligand gated ion channels. Families with less than five members were not manually inspected. Instead we relied mainly on GO annotations. Finally, we included groups that have been defined by other researchers to provide a complete overview. After the initial grouping, we manually checked and reassigned proteins when there was evidence that the GO terms or other definitions were not correctly assigned.
Reprolysin (M12B) family zinc metalloprotease (ADAM)
Cell adhesion molecule
Chitin binding peritrophin-A domain
Domain first found in C1r C1s, uEGF, and bone morphogenetic protein
Depth coverage per million
Domain of unknown function
Epidermal growth factor
Phe-Met-Arg-Phe (abbreviation of 4 amino acids)
Fibronectin type III-like fold
G-protein coupled receptor
Immunoglobulin cell adhesion molecule
Kunitz protease inhibitor domain
Kazal proteinase inhibitor I1
Low density lipoprotein-receptor class A domain
Laminin B domain
Laminin EGF-like domain
Laminin G domain
Laminin N-terminal domain
Leucine-rich repeat C-terminal
Leucine-rich repeat N-terminal
Low-density lipoprotein receptor YWTD repeat
Model organism ENCODE (the ENCyclopedia Of DNA Elements)
My(name of the founder’s daughter) structured query language
Nidogen N-terminal domain
Collagenase NC10 and Endostatin
Protein Extraction Description and ANalysis Tool
A database of protein families
Receptor L domain
Simple Modular Architecture Research Tool
SNAP (Soluble NSF Attachment Protein) REceptor
Thrombospondin type I
Trypsin inhibitor like cysteine rich domain
Tree families database
Thyroglobulin type-1 domain
von Willebrand factor type A domain
von Willebrand factor type D domain
Vitelline membrane outer layer protein I
Whey acidic protein 4-disulphide core
DB,EB,ET,MD: Nematode-specific domains.
We would like to thank N. Chen for help with extracting data from Wormbase, D. Napier for help with hosting the local GExplore database, L. Hillier for providing the stage- and sex-specific RNA-seq data in a form that was easy to process and members of the lab for discussion and comments on the manuscript. Work in HH’s lab is funded by grants from CIHR and NSERC. HH is a Senior Michael Smith Scholar.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.