CEG: a database of essential gene clusters
© Ye et al.; licensee BioMed Central Ltd. 2013
Received: 26 June 2013
Accepted: 5 November 2013
Published: 9 November 2013
Essential genes are indispensable for the survival of living entities. They are the cornerstones of synthetic biology, and are potential candidate targets for antimicrobial and vaccine design.
Here we describe the Cluster of Essential Genes (CEG) database, which contains clusters of orthologous essential genes. Based on the size of a cluster, users can easily decide whether an essential gene is conserved in multiple bacterial species or is species-specific. It contains the similarity value of every essential gene cluster against human proteins or genes. The CEG_Match tool is based on the CEG database, and was developed for prediction of essential genes according to function. The database is available at http://cefg.uestc.edu.cn/ceg.
Properties contained in the CEG database, such as cluster size, and the similarity of essential gene clusters against human proteins or genes, are very important for evolutionary research and drug design. An advantage of CEG is that it clusters essential genes based on function, and therefore decreases false positive results when predicting essential genes in comparison with using the similarity alignment method.
KeywordsCEG Essential gene cluster Antibacterial drug targets Essential gene prediction
Essential genes are indispensable for the survival of living entities [1, 2], and the functions of proteins encoded by these genes are considered to be the foundation of life [1–3]. They are the cornerstones of synthetic biology [2, 4]. For example, in 2010, Venter et al. created a Mycoplasma mycoides cell based on the essential genes of M. mycoides. Furthermore, analyses of the essential genes shared by different organisms further our understanding of the basic composition of cellular life [1, 6].
Since most antibiotic target gene products are involved in basic metabolic pathways, the essential genes of pathogens constitute attractive targets for antimicrobial drug and vaccine design [1–3, 7]. Hu et al. and Roemer et al.  have respectively identified drug targets for Aspergillus fumigatus and Candida albicans based on corresponding essential genes identified from mouse models. Furthermore, Barh and Kumar , and Amineni et al. successfully identified drug and vaccine targets in silico in Neisseria gonorrhoeae and Leptospira interrogans using the BLAST method against the Database of Essential Genes (DEG) [12, 13]. Identification and study of essential genes helps further the understanding of the origins of life and evolution, and can help determine the last universal common ancestor (LUCA) [6, 14].
Itaya et al. first investigated the essential genes of Bacillus subtilis on a large scale in 1995 . Presently, essential genes have been determined at the genomic-scale in over 18 bacterial genomes. Zhang and colleagues [12, 13], and Chen et al. have constructed two different databases, DEG and OGEE (Online GEne Essentiality database), respectively; these contain all published essential genes. The DEG [12, 13] is the first database of essential genes, and only collects genes determined by genome-wide experiments. Since its publication, the DEG database has been widely used in areas such as antibacterial drug target discovery and synthetic biology. OGEE  not only contains experimentally tested essential and non-essential genes, but also associates gene features such as expression profiles, duplication status, conservation across species, evolutionary origins, and involvement in embryonic development. It also stores text-mining data in addition to experimental data. Furthermore, the OGEE offers tools that allow users to compare gene essentiality among different gene groups, to compare the features of essential genes with non-essential genes, and for visualization of results. The CEG database differs from these databases in that it deposits essential genes in orthologous groups and not as single genes. The CEG_Match tool was developed for predicting essential genes in the CEG database based on function. CEG_Match significantly decreases false positive (FP) essential genes predictions in comparison with using direct sequence alignments.
Construction and content
The original essentiality data for generating CEG were derived from DEG 6.5 [12, 13] (http://tubic.tju.edu.cn/deg/). After obtaining the data, we performed the following processes for each essential gene. First, as there are two groups of experimental data for E. coli in DEG, we removed any redundant data. For example, the gene GI:16128021 corresponds to three essentiality records in DEG, but is assigned to only one essentiality record in CEG. Second, genes were assigned to clusters based on their corresponding KEGG (Kyoto Encyclopedia of Genes and Genomes) Orthology (KO)  and Clusters of Orthologous Group (COG) category and function descriptions . We investigated functional descriptions of all genes, or their KO identification, within each COG. If there were two or more different functional descriptions then the genes in a COG were separated into several smaller clusters. They were also separated based on KO. For example, genes within the code COG0008J are separated into three CEG clusters (CEG0151, CEG0208, and CEG0438), which correspond to the KOs k01885, k01886, and k09698. Data were manually curated at this step so that essential genes with synonymous functional descriptions but identical function were assigned within the same cluster. Consequently, resultant clusters contain only genes orthologous to each other, and will be regarded as one CEG cluster. Genes without COG identifications were assigned to the most probable COG identification and functional description through functional comparison and sequence alignment methodologies . Third, each protein or gene in CEG is aligned against the whole human proteome or genome in the Human Protein Reference Database (HPRD), which contains 30046 human protein sequences [20, 21], using PSI-BLAST  or BLASTN tools . The e-values of the best hits were recorded and named as ESAHP for proteins, and as ESAHG for genes. For each cluster, we give the e-value of the essential gene with the highest similarity as that of the cluster, and this provides a convenient resource for selecting targets of antibacterial drugs [2, 3, 7].
Database design and implementation
The CEG database is executed using PHP scripts (http://php.net) on a Linux server, and queries a MySQL relational database (http://www.mysql.com). Its web interface is coded in PHP5 and HTML (http://www.w3.org).
Each CEG cluster groups all orthologous genes that are essential for their bacterial hosts. This differs from DEG [12, 13] in that the CEG database stores essential genes in the form of orthologous gene groups rather than single genes. Viewing the size of the cluster allows users to decide whether an essential gene is conserved in multiple bacterial species or is species-specific. There are two different types of size number: one corresponding to the host number, and the other to the gene number. The gene number is not always identical to the host number because in some cases a genome may have multiple copies of the same gene. For example, the Mycoplasma pulmonis strain UAB CTIP has two ligA genes that belong to the cluster CEG_0001. The database also contains information on the highest similarity values for every essential gene or cluster against human genes or translated proteins. The data provided by CEG for each essential gene are very important for evolutionary research  and drug design [1, 2, 7].
KO and COG classifications in CEG
Abbreviated strain name
No. of essential genes assigned within KO groups
No. of essential genes assigned within COG groups
No. of essential genes assigned within both KO and COG groups
% of essential genes assigned within KO groups.
% of essential genes assigned within COG groups
% of essential genes assigned within either of the two groups
We developed the tool CEG_Match for prediction of essential genes based on whose functions. The ideology for this method is as described by Guo et al. . When using the CEG_Match to predict essential genes, the end-user only needs to provide standard names or synonymous names of genes in the query bacterial genome. For example, ‘dnaA’ is the name of a gene encoding one type of chromosomal replication initiation protein. This information is usually contained in files with the extension ptt in GenBank and RefSeq annotations. Note that the CEG_Match is aware of gene name synonyms.
User interface and database usage
We have created a freely accessible web interface for visiting the CEG database. It includes five core-page sections: “home”, “browse”, “search”, “blast”, and “predict”.
Users can access the CEG database at the URL http://cefg.uestc.edu.cn/ceg/. The home page contains an introduction to the CEG database and provides a contact email address for users to leave feedback suggestions to the administrators. A brief user guide is also provided on this page.
This section includes overview page, cluster information page, and gene information page. In the overview page, users can browse the basic information of each gene cluster (or gene) in the CEG database. This information includes CEG (or gene) id, cluster name, KO id, COG id, cluster size, enzyme id (EC), the number of strains covered, and the e-value of the similarity alignment for every cluster (or gene) compared against human proteins . The cluster information page provides the phyletic profile of a cluster, and a list of included genes. Users are easily able to estimate the conservation of a cluster at different clade levels (phylum, class, order, family, and genus). Links to external information on each essential gene or cluster are also given. For example, users are able to open an information page for a group in the KO website  through links on the KO id, and quickly investigate whether an equivalent gene appears in a CEG cluster. For clusters that do not have COG codes in the COG database, or a KO id in KEGG, an exception-handling interface alerts the end-user that these records are not found in the KO database. In addition, a link to retrieve the original information in DEG [12, 13] is provided for each gene. Clicking the HPRD id  opens a page containing information on the best HPRD hit in the human genome. The whole essential gene set in the CEG database can be sorted in ascending or descending order according to indexes such as cluster size and similarity value. This facilitates users mine information of interest. For example, if a researcher wanted to design a broad-spectrum antibiotic, they would quickly be able to identify that the dnaA gene has potential as a drug target because of it being conserved in 13 strains, and not being homologous to human genes. Meanwhile, the gltD gene would quickly be dismissed as a drug target because it is only conserved in two strains, and is homologous to human genes.
Search & blast pages
End-users can search the CEG database by cluster name, cluster size, cluster function, or cluster id. CEG allows users to paste or upload sequences, and BLAST the query sequences against all clusters of essential genes contained within the database.
CEG usage case
Campylobacter jejuni is a zoonosis pathogenic bacterium. It can cause a variety of human and animal diseases. It is considered the main cause of bacterial diarrhea in humans . The complete genome sequence of C. jejuni NCTC 11168 was sequenced in 2012 . We download these data from the NCBI ftp site. The genome has 1621 proteins (or genes), 803 of which have been annotated with detailed functions (with standard gene names given in the annotation file). In the following example, we predicted essential genes of C. jejuni NCTC 11168 using CEG_Match, and identified potential drug target genes in silico.
First, we collected the 803 annotated gene names, and predicted 374 genes as essential using a setting of K = 3 in CEG_Match (Additional file 1). Second, we retrieved the ESAHP values of these genes from the CEG database. Genes with an ESAHP value larger than 10e-3 were considered as potential targets of innocuous antibacterial drugs, and this led to 120 potential drug targets being predicted in C. jejuni NCTC 11168 (Additional file 1: Table S1). Fifty-seven genes, with a match number greater than seven in the CEG database, have a high potential to be broad-spectrum antibiotic drug targets (Additional file 1: Table S1).
DEG, OGEE vs.CEG
Bacterial essential genes are the subject of increased recent attention because of their importance in the fields of antibacterial drug design [1–3, 7], synthetic biology [2, 4], and life origin research [1, 6]. The essential gene databases, DEG [12, 13] and OGEE , contain essential genes determined by experimental methods, and have been used for evolutionary research  and drug design . However, they only include basic sequence information rather than a full integration of resources that are convenient for evolutionary research and drug design . The CEG database was developed with the rationale of enriching the information contained in the existing essential gene databases. In the CEG database, each cluster (gene) is provided with indices such as cluster size, and the results of similarity alignments for every cluster (gene) against human proteins. This information is not found in DEG or OGEE. The cluster size is helpful when devising broad-spectrum or specific drugs (corresponding to Rule 1 mentioned above). It is also useful for functional or evolutionary genomic bacteria research.
Sequence homology approaches vs.CEG_Match
Sequence homology approaches are the foundation for functional inference. Although powerful, homology approaches have their limitations. For instance, they do not give information on direct functional links among non-homologous genes . Lord et al. found a strong correlation between gene annotation (function) and sequence similarity (homology); however, some protein pairs deviate from this trend . A consideration of the above factors led to the development of the CEG_Match tool for predicting essential genes contained in the CEG database. This tool and methodology make full use of the annotations of the genes. Because the annotated functions are mainly obtained through homology alignment, it follows that the tool is also based on alignment data. However, it does not directly use sequence alignments, but makes use of alignment-generating-annotation information.
The predicted results between blast and CEG_match
Abbreviated strain name
BLAST (e-value <1e-10)
K = 3
K = 4
False positive rate (%)
Loss rate (%)
False positive rate (%)
Loss rate (%)
False positive rate (%)
Loss rate (%)
False positive rate (%)
Loss rate (%)
The CEG database will be updated when new bacterial essential genes are experimentally determined at the genome scale. In next version, we will also take OGEE as one data source. Furthermore, phylogenetic relationships among genes in every cluster and the Gene Ontology information of CEG genes will be incorporated into the CEG database. To make the CEG database more comprehensive for drug design and related fields, the protein structures of each essential gene will also be incorporated into the database.
In this study, we propose a terminology called Cluster of Essential Genes (CEG), and construct a database to deposit these essential gene clusters. The CEG database has the following features: (I) it stores essential genes in the form of orthologous groups instead of as single genes; (II) it provides an essential gene prediction tool (CEG_Match), which could greatly decrease the number of FPs when predicting essential genes in comparison to the similarity alignment method; (III) it makes it easy for the end-user to determine whether an essential gene is conserved in multiple species or is species-specific; and (IV) ESAHP and ESAHG values in the CEG database allow the end-user to easily obtain the similarity of every cluster against human proteins or genes. Features (III) and (IV) are important properties for drug target discovery [1–3, 7].
Availability and requirements
Database of essential genes
Clusters of orthologous groups
The human protein reference database
Online gene essentiality database
The e-value of similarity alignment (PSI-BLAST) for every cluster against human proteins
The e-value of similarity alignment (BLASTN) for every cluster against human genes
Bacillus subtilis 168
Francisella novicida U112
Mycoplasma pulmonis UAB CTIP
Salmonella enterica serovar Typhi
Staphylococcus aureus N315
Staphylococcus aureus NCTC 8325
Vibrio cholera N16961.
The authors are grateful to the reviewers for their valuable comments, which have led to improvements of this paper. This work was supported by the National Natural Science Foundation of China (grant numbers 31071109 and 60801058); the China Postdoctoral Science Foundation (grant numbers 201104687 and 2013M540705); the Key Technology Research and Development Program of Sichuan Province (grant number 2011FZ0034); and the program for New Century Excellent Talents in University (grant number NCET-11-0059).
- Juhas M, Eberl L, Glass JI: Essence of life: essential genes of minimal genomes. Trends Cell Biol. 2011, 21 (10): 562-568. 10.1016/j.tcb.2011.07.005.View ArticlePubMedGoogle Scholar
- Juhas M, Eberl L, Church GM: Essential genes as antimicrobial targets and cornerstones of synthetic biology. Trends Biotechnol. 2012, 30 (11): 601-607. 10.1016/j.tibtech.2012.08.002.View ArticlePubMedGoogle Scholar
- Battista JR, Juhas M, Stark M, von Mering C, Lumjiaktase P, Crook DW, Valvano MA, Eberl L: High Confidence Prediction of Essential Genes in Burkholderia Cenocepacia. PLoS ONE. 2012, 7 (6): e40064-10.1371/journal.pone.0040064.View ArticleGoogle Scholar
- Davierwala AP, Haynes J, Li Z, Brost RL, Robinson MD, Yu L, Mnaimneh S, Ding H, Zhu H, Chen Y, et al: The synthetic genetic interaction spectrum of essential genes. Nat Genet. 2005, 37 (10): 1147-1152. 10.1038/ng1640.View ArticlePubMedGoogle Scholar
- Gibson DG, Glass JI, Lartigue C, Noskov VN, Chuang RY, Algire MA, Benders GA, Montague MG, Ma L, Moodie MM, et al: Creation of a bacterial cell controlled by a chemically synthesized genome. Science. 2010, 329 (5987): 52-56. 10.1126/science.1190719.View ArticlePubMedGoogle Scholar
- Koonin EV: Comparative genomics, minimal gene-sets and the last universal common ancestor. Nat Rev Microbiol. 2003, 1 (2): 127-136. 10.1038/nrmicro751.View ArticlePubMedGoogle Scholar
- Haselbeck R, Wall D, Jiang B, Ketela T, Zyskind J, Bussey H, Foulkes JG, Roemer T: Comprehensive essential gene identification as a platform for novel anti-infective drug discovery. Curr Pharm Des. 2002, 8 (13): 1155-1172. 10.2174/1381612023394818.View ArticlePubMedGoogle Scholar
- Hu W, Sillaots S, Lemieux S, Davison J, Kauffman S, Breton A, Linteau A, Xin C, Bowman J, Becker J, et al: Essential Gene Identification and Drug Target Prioritization in Aspergillus fumigatus. PLoS Pathog. 2007, 3 (3): e24-10.1371/journal.ppat.0030024.PubMed CentralView ArticlePubMedGoogle Scholar
- Roemer T, Jiang B, Davison J, Ketela T, Veillette K, Breton A, Tandia F, Linteau A, Sillaots S, Marta C, et al: Large-scale essential gene identification in Candida albicans and applications to antifungal drug discovery. Mol Microbiol. 2003, 50 (1): 167-181. 10.1046/j.1365-2958.2003.03697.x.View ArticlePubMedGoogle Scholar
- Barh D, Kumar A: In silico identification of candidate drug and vaccine targets from various pathways in Neisseria gonorrhoeae. In Silico Biol. 2009, 9 (4): 225-231.PubMedGoogle Scholar
- Amineni U, Pradhan D, Marisetty H: In silico identification of common putative drug targets in Leptospira interrogans. J Chem Biol. 2010, 3 (4): 165-173. 10.1007/s12154-010-0039-1.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhang R, Lin Y: DEG 5.0, a database of essential genes in both prokaryotes and eukaryotes. Nucleic Acids Res. 2009, 37 (Database): D455-D458. 10.1093/nar/gkn858.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhang R, Ou HY, Zhang CT: DEG: a database of essential genes. Nucleic Acids Res. 2004, 32 (Database issue): D271-272.PubMed CentralView ArticlePubMedGoogle Scholar
- Koonin EV: How many genes can make a cell: the minimal-gene-set concept. Annu Rev Genomics Hum Genet. 2000, 1: 99-116. 10.1146/annurev.genom.1.1.99.View ArticlePubMedGoogle Scholar
- Itaya M: An estimation of minimal genome size required for life. FEBS Lett. 1995, 362 (3): 257-260. 10.1016/0014-5793(95)00233-Y.View ArticlePubMedGoogle Scholar
- Chen WH, Minguez P, Lercher MJ, Bork P: OGEE: an online gene essentiality database. Nucleic Acids Res. 2011, 40 (D1): D901-D906.PubMed CentralView ArticlePubMedGoogle Scholar
- Kanehisa M, Goto S, Kawashima S, Okuno Y, Hattori M: The KEGG resource for deciphering the genome. Nucleic Acids Res. 2004, 32 (Database issue): D277-280.PubMed CentralView ArticlePubMedGoogle Scholar
- Tatusov RL, Galperin MY, Natale DA, Koonin EV: The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res. 2000, 28 (1): 33-36. 10.1093/nar/28.1.33.PubMed CentralView ArticlePubMedGoogle Scholar
- Deng J, Deng L, Su S, Zhang M, Lin X, Wei L, Minai AA, Hassett DJ, Lu LJ: Investigating the predictability of essential genes across distantly related organisms using an integrative approach. Nucleic Acids Res. 2010, 39 (3): 795-807.PubMed CentralView ArticlePubMedGoogle Scholar
- Kandasamy K, Keerthikumar S, Goel R, Mathivanan S, Patankar N, Shafreen B, Renuse S, Pawar H, Ramachandra YL, Acharya PK, et al: Human Proteinpedia: a unified discovery resource for proteomics research. Nucleic Acids Res. 2009, 37 (Database): D773-D781. 10.1093/nar/gkn701.PubMed CentralView ArticlePubMedGoogle Scholar
- Keshava Prasad TS, Goel R, Kandasamy K, Keerthikumar S, Kumar S, Mathivanan S, Telikicherla D, Raju R, Shafreen B, Venugopal A, et al: Human Protein Reference Database--2009 update. Nucleic Acids Res. 2009, 37 (Database issue): D767-772.PubMed CentralView ArticlePubMedGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25 (17): 3389-3402. 10.1093/nar/25.17.3389.PubMed CentralView ArticlePubMedGoogle Scholar
- McGinnis S, Madden TL: BLAST: at the core of a powerful and diverse set of sequence analysis tools. Nucleic Acids Res. 2004, 32 (Web Server issue): W20-25.PubMed CentralView ArticlePubMedGoogle Scholar
- Chen F, Mackey AJ, Stoeckert CJ, Roos DS: OrthoMCL-DB: querying a comprehensive multi-species collection of ortholog groups. Nucleic Acids Res. 2006, 34 (Database issue): D363-368.PubMed CentralView ArticlePubMedGoogle Scholar
- Guo FB, Ning LW, Huang J, Lin H, Zhang HX: Chromosome translocation and its consequence in the genome of Burkholderia cenocepacia AU-1054. Biochem Biophys Res Commun. 2010, 403 (3–4): 375-379.View ArticlePubMedGoogle Scholar
- Parkhill J, Wren BW, Mungall K, Ketley JM, Churcher C, Basham D, Chillingworth T, Davies RM, Feltwell T, Holroyd S, et al: The genome sequence of the food-borne pathogen Campylobacter jejuni reveals hypervariable sequences. Nature. 2000, 403 (6770): 665-668. 10.1038/35001088.View ArticlePubMedGoogle Scholar
- Revez J, Schott T, Rossi M, Hanninen ML: Complete genome sequence of a variant of Campylobacter jejuni NCTC 11168. J Bacteriol. 2012, 194 (22): 6298-6299. 10.1128/JB.01385-12.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhou J, Thompson DK, Xu Y, Tiedje JM: Microbial functional genomics. 2004, Hoboken, New Jersey, USA: Wiley-LissView ArticleGoogle Scholar
- Lord PW, Stevens RD, Brass A, Goble CA: Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation. Bioinformatics. 2003, 19 (10): 1275-1283. 10.1093/bioinformatics/btg153.View ArticlePubMedGoogle Scholar
- Tian W, Skolnick J: How well is enzyme function conserved as a function of pairwise sequence identity?. J Mol Biol. 2003, 333 (4): 863-882. 10.1016/j.jmb.2003.08.057.View ArticlePubMedGoogle Scholar
- Xu Z, Hao B: CVTree update: a newly designed phylogenetic study platform using composition vectors and whole genomes. Nucleic Acids Res. 2009, 37 (Web Server issue): W174-178.PubMed CentralView ArticlePubMedGoogle Scholar
- Qi J, Luo H, Hao B: CVTree: a phylogenetic tree reconstruction tool based on whole genomes. Nucleic Acids Res. 2004, 32 (Web Server issue): W45-47.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.