- Research article
- Open Access
Towards the identification of essential genes using targeted genome sequencing and comparative analysis
- Adam M Gustafson†1Email author,
- Evan S Snitkin†1Email author,
- Stephen CJ Parker1,
- Charles DeLisi1, 2 and
- Simon Kasif1, 2, 3
© Gustafson et al; licensee BioMed Central Ltd. 2006
- Received: 18 August 2006
- Accepted: 19 October 2006
- Published: 19 October 2006
The identification of genes essential for survival is of theoretical importance in the understanding of the minimal requirements for cellular life, and of practical importance in the identification of potential drug targets in novel pathogens. With the great time and expense required for experimental studies aimed at constructing a catalog of essential genes in a given organism, a computational approach which could identify essential genes with high accuracy would be of great value.
We gathered numerous features which could be generated automatically from genome sequence data and assessed their relationship to essentiality, and subsequently utilized machine learning to construct an integrated classifier of essential genes in both S. cerevisiae and E. coli. When looking at single features, phyletic retention, a measure of the number of organisms an ortholog is present in, was the most predictive of essentiality. Furthermore, during construction of our phyletic retention feature we for the first time explored the evolutionary relationship among the set of organisms in which the presence of a gene is most predictive of essentiality. We found that in both E. coli and S. cerevisiae the optimal sets always contain host-associated organisms with small genomes which are closely related to the reference. Using five optimally selected organisms, we were able to improve predictive accuracy as compared to using all available sequenced organisms. We hypothesize the predictive power of these genomes is a consequence of the process of reductive evolution, by which many parasites and symbionts evolved their gene content. In addition, essentiality is measured in rich media, a condition which resembles the environments of these organisms in their hosts where many nutrients are provided. Finally, we demonstrate that integration of our most highly predictive features using a probabilistic classifier resulted in accuracies surpassing any individual feature.
Using features obtainable directly from sequence data, we were able to construct a classifier which can predict essential genes with high accuracy. Furthermore, our analysis of the set of genomes in which the presence of a gene is most predictive of essentiality may suggest ways in which targeted sequencing can be used in the identification of essential genes. In summary, the methods presented here can aid in the reduction of time and money invested in essential gene identification by targeting those genes for experimentation which are predicted as being essential with a high probability.
- Positive Predictive Value
- Essential Gene
- Codon Bias
- Flux Balance Analysis
- Protein Size
A fundamental step in understanding how cells function is the comprehension of the minimal gene set required to sustain life. Before the core requirements for cellular life can be understood, it is necessary to identify the components of this set in diverse organisms. To date, prediction and discovery of essential genes has been supported by a significant amount of experimental work. Procedures such as single gene knockouts , RNA interference , and conditional knockouts  have been used as discovery mechanisms, but each of these techniques require a large investment of time and skill to perform. With the increase in availability of gene knockout data, many studies have been undertaken in an attempt to decipher the characteristics of essential genes. Through the analysis of essential genes in numerous organisms, fundamental evolutionary mechanisms and genomic fingerprints may be uncovered which will aid in essential gene identification in organisms lacking experimental validation.
Several studies have taken advantage of the abundance of experimental data available for model organisms in order to understand the properties of essential genes. For example, several groups have suggested that there is a relationship between degree in protein-protein interaction networks and essentiality [4, 5]. The implication is that the hubs of the networks are of increased importance because of their abundance of interaction partners. Other studies have revealed relationships between essentiality and the number of transcription factor binding sites upstream of a gene . It was demonstrated that those genes with more complex regulation are enriched in dispensable genes. High accuracy predictions of essential genes have also been made using flux balance analysis . This method has the advantage of generating hypotheses regarding which genes are likely to be essential under a wide variety of hypothetical conditions. There is little doubt that with the plethora of experimental data being generated, additional properties of essential genes will be documented in the coming years.
While genome-wide experimental data is abundant in model organisms such as S. cerevisiae and E. coli, information is often limited for newly sequenced organisms, which precludes the use of such data for identification of genes essential for survival. The ability to identify essential genes in the absence of experimental data is of added importance, because it allows for a system to rationally select possible drug targets for newly sequenced pathogens. Fortunately, in addition to the relationships between essentiality and various experimental measures, there has been a good deal of research aimed at understanding the genomic features of essential genes. Metrics such as codon bias, number of paralogs, and phyletic retention have all been shown to be distinguishing of essential genes [8–10]. As essential genes are under a unique evolutionary pressure, it is likely that they share many other characteristics which may be gleaned from genome sequence data.
With both the practical and theoretical importance of the identification of essential genes in mind, we set out to construct an effective classifier of essential genes which exploited various genomic descriptors that could be generated directly form sequence data. Previous works aimed at understanding the properties of human disease genes have taken a similar approach [11, 12]. Interestingly, many of the predictive descriptors of human disease genes identified by Kondrashov et al.  were identified in our study as being predictive of microbial essential genes.
As an initial step in the construction of our classifier, we explored various experimental and genomic metrics to assess how they relate to essentiality in both E. coli and S. cerevisiae. A metric which has been shown to be highly predictive of essentiality in previous studies was the retention of genes across different phyla. In order to extract the most from this metric we identified subsets of organisms achieving the highest accuracy in prediction of essential genes. The most predictive sets contained host-associated organisms with small genomes which are closely related to the query organism. This result directly suggests how targeted sequencing of genomes can be used in the prediction of essential genes. With the rapidly decreasing investments of both money and time required for the sequencing of a microbial genome, this approach of identifying essential genes through targeted sequencing may itself become a viable alternative in the identification of essential genes.
In addition to phyletic retention, we also assessed the relative performance of other previously reported indicators of essentiality such as protein interaction degree, protein size and codon bias. In S. cerevisiae, protein interaction degree was found to be highly predictive of essentiality, while protein size and codon bias were predictive in both organisms. Additionally, in S. cerevisiae we observed that high counts of certain individual charged amino acids were more predictive than size alone, implying that high counts of these amino acids distinguish proteins in a way not completely captured by their size.
Interestingly, although the aforementioned metrics were predictive of essentiality in both organisms, their relative importance and the sets of genes identified varied. The most striking difference is the relationship between protein size and essentiality. In E. coli, small proteins are enriched in essential genes, while in S. cerevisiae essential genes are underrepresented among the smallest proteins. We hypothesize that this may be indicative of a pressure in E. coli to maintain small proteins in the absence of other functional constraints, as has been previously suggested .
After identifying the genomic features most predictive of essentiality in E. coli and S. cerevisiae, we quantified the predictive limits of our assembled genomic characteristics by integrating them using a probabilistic classifier in conjunction with a feature selection criterion that is novel to bioinformatic applications. Using only easily obtainable genomic features from S. cerevisiae and E. coli, we show that our ability to predict essential genes is competitive with classifiers which included experimental features. The fact that we were able to construct our classifier using only descriptors generated from sequence data will allow broad application of this technique to other organisms, with only gene annotation being required. This ability has the potential to impact both the understanding of essential genes in different organisms, as well as the search for drug targets in poorly understood pathogens.
Experimental definition of essentiality
During our analysis of the relationship between various genomic features and essentiality, it is important to put our results in the context of our definition of essentiality. The definition used in most experiments is based on the growth, or lack thereof, of mutants under rich media conditions. Clearly, these conditions are not representative of the wild type environments which most of these organisms inhabited as they evolved their gene content. Studying metabolic pathways in S. cerevisiae using flux balance analysis indicates that as many as two-thirds of metabolic enzymes may be essential under some condition, while experiments in the presence of rich media result in roughly 20% of genes being labeled as essential . Furthermore, analysis of the set of genes which have been labeled as dispensable in knockout experiments, which are also ubiquitously present throughout different phyla, revealed an overrepresentation of biosynthetic pathways . It is likely that these pathways are essential under wild type conditions, as suggested by their retention throughout evolution, but with a surplus of nutrients provided they are identified as dispensable. Undoubtedly, there is room for debate as to a meaningful definition of essentiality, but it seems that, in general, most genes which are required under rich media conditions will be vital under most other conditions. Therefore, we feel that despite these inconsistencies, it is still valuable to understand the properties of this artificial set of essential genes, because although the comprehensiveness of the set can be questioned, its accuracy should only be limited by experimental bias.
Selection of organisms for phyletic retention measure
The plummeting cost of genome sequencing is making comparative genomics an attractive technique, and creative bioinoformatic methods which take advantage of targeted sequencing are becoming more prevalent [14–16]. In this vein, we set out to understand the evolutionary properties of the sets of organisms in which the presence of an ortholog would be most indicative of the essentiality of a gene, with the hopes that by sequencing the appropriate genomes, high accuracy predictions of essential genes can be made. Retention of a gene over long evolutionary periods in a form that allows recognition using sequence similarity based techniques suggests that it is performing a critical function . In previous studies, sets of organisms varying from a few distantly related organisms to several closely related organisms have been utilized in the examination of the relationship between the retention of genes and their essentiality [8, 9]. To our knowledge, a systematic analysis with the aim of understanding the nature of a set of organisms in which presence of a gene is most predictive of essentiality has not been performed.
Influence of organism composition on phyletic retention performance
Number Of Genes in Group
Number of Essential Genes
CMIM Selected Set Of 5
CMIM Selected Set of 7
Set of 27 Non-parasitic Gamma Proteobacteria
Set of 179 prokaryotes
Performing similar analysis in S. cerevisiae using a set of 26 sequenced eukaryotes returned Schizosaccharomyces pombe, Encephalitozoon cuniculi, Eremothecium gossypii as the three most abundant organisms in the most highly predictive sets of five (Figure 3B). Both E. cuniculi and E. gossypii lead host-associated lifestyles, corroborating our interpretation of the optimal organism sets in E. coli. Furthermore, E. gossypii is among the smallest known eukaryotic genomes, with a 9.2 Mb nuclear genome .
Performance of individual features in yeast
As expected, the best performing genomic feature was the phyletic retention measure, whose construction was described above. For clarity, the term 'phyletic retention' was used to describe the presence of an ortholog in other organisms, in place of the term 'conservation,' in order to prevent confusion with measures of substitution rate. A second feature from our genomic set which was predictive of essentiality was the total upstream size of a gene. Genes with the largest upstream sizes are markedly enriched in dispensable genes. This result may be explained when considering the recent results by Yu et al. showing that genes with complex regulation are enriched in dispensable genes, in conjunction with the possibility that genes with more complex regulation may have larger upstream regions in order to accommodate an increased number of cis elements . This connection has previously been shown to be valid in Caenorhabditis elegans and Drosophila melanogaster . Our results suggest that a relationship exists between regulatory complexity and intergenic distance in S. cerevisiae and that this relationship accounts for the association between intergenic distance and essentiality. Given that the number of transcription factor binding sites present in the promoter of a gene is determined through arduous experimental procedures, it is beneficial to be able to use upstream size as a proxy for regulatory complexity.
Examining single features from the protein subset revealed several as being highly predictive of essentiality. In addition to previously discussed descriptors such as codon bias and protein size, we also identified enrichment in essential genes among proteins with an abundance of certain amino acids. Specifically, proteins with the highest counts of aspartate, glutamate, and lysine are enriched in essential genes with PPVs of 29.6%, 31.5% and 30.0% respectively in the top 10% of predictions. This trend is partially explained by observing that large proteins in general are enriched in essential genes, with a PPV of 25.8% among the largest 10% of proteins. Although this relationship in part explains high amino acid counts being predictive of essentiality, it fails to fully clarify why specific amino acids are more predictive than others. In order to gain insight into this phenomenon, we looked for enrichment in GO molecular function categories for proteins with high counts of charged amino acids that were not present among the largest 10% of proteins. Although no individual function attained a significant p-value, the functions present almost all involved either catalytic activity or an interaction with nucleic acids. Charged amino acids are often present in the active sites of enzymes, where they participate in catalytic mechanisms. Additionally, charged amino acids make associations with charged substrates more favorable due to electrostatic interactions. Based on these observations, we hypothesize that the enrichment in essential genes among those proteins with high counts of charged amino acids is in part because of the functional capabilities of these amino acids.
In addition to examining the predictive power of features from our genomic and protein set, we also measured the prediction accuracy of some experimentally derived features previously reported to be indicative of essentiality. We observed that genes with a high degree in a protein interaction network are more likely to be essential, which is in agreement with previous work [5, 24, 25]. Among those proteins in the top 5% for degree, 42% are essential, a considerable enrichment in essential genes when compared to the ~17% expected by chance alone. It should be noted that it has been stated in the literature that the relationship between degree and essentiality is at least partly due to biases in the data . Specifically, Coulumb et al. state that the protein interaction dataset from the Database of Interacting Proteins (DIP), which we used in this analysis, is biased towards essential genes due to the accumulation of interactions from small scale experiments which are partial towards essential genes. The authors go on to state that this partiality accounts for a significant component of the relationship between degree and essentiality. This contention was then substantiated by showing the disappearance of the relationship between essentiality and connectivity when using unbiased whole-genome yeast two-hybrid experiments. Others have recently stated that the lack of reliability and completeness of yeast two-hybrid data are responsible for the absence of a relationship between connectivity and essentiality .
Performance of individual features in E. coli
Despite a large evolutionary distance and fundamental biological differences, the characteristics of essential genes in S. cerevisiae and E. coli are largely similar. As seen in Figure 4B, the strongest predictor of essentiality in E. coli other than phyletic retention, is CAI. Number of paralogs and protein size were also predictive of essentiality in both organisms. Features for all E. coli genes used in this study are available in Additional files [see Additional file 4].
Differences in feature performance between S. cerevisiae and E. coli
Although most of the features performed comparably in both organisms, there were some noticeable differences. For example, although features such as protein size and codon bias were predictive of essentiality, their accuracy as well as the sets of proteins which they identified varied between the two organisms.
Figure 5C yields insight into the lack of essential genes among the smallest S. cerevisiae ORFs. The enrichment of dispensable genes amid small ORFs in S. cerevisiae seems to be a consequence of an abundance of small species specific genes. In both organisms species specific genes are enriched in dispensable genes, as would be expected based on the predictive power of phyletic retention in identifying essential genes. Therefore, an abundance of small species specific genes leads to the apparent trend of dispensability among the smallest S. cerevisiae genes. It should be noted that the organisms used in this phyletic conservation analysis were a set of 16 sequenced fungi and protists, so as to have a more diverse set of genomes than used in the optimal predictive set.
Performance of integrated features in yeast
To determine the limits of our predictive abilities in the classification of essential genes when integrating multiple features, we utilized naïve Bayes classifiers. We assigned all of our features into three different overlapping sets, in order to assess the relative contributions of different subsets of features. The first set, which we will designate as SC_GenProt, is composed of all features which can be obtained directly from sequence data. Our second set, which is designated as SC_GenProt_No, is identical to SC_GenProt, but lacks the phyletic retention measure. We included this set in order to assess our ability to identify less conserved essential genes. Our third set, designated as SC_All, is composed of features that require extensive experimentation, in addition to all easily obtainable features, so that we could assess the impact of neglecting experimental data on our prediction accuracy. A benefit to using naïve Bayes for feature integration is that each classification is assigned a probability, making it natural to rank the predictions, which allows for direct comparison to results using individual features.
Feature selection was accomplished by ranking features using conditional mutual information maximization (CMIM), as described in Methods [see Additional file 1 for the actual ranking]. The phyletic retention feature achieved the highest mutual information with essentiality, which is consistent with our results on single feature performance. By using the 21 most informative features in SC_All, 11 in SC_GenProt and 13 in SC_GenProt_No, we were able to improve prediction accuracy over the inclusion of all features in each set [see Additional file 2].
As SC_GenProt_No is performing significantly worse than other feature sets, it is only of use if it is identifying especially interesting genes. To assess the ability of SC_GenProt_No to identify essential genes that are less conserved, we looked at the broader conservation pattern of yeast genes in a set of 16 fungi and protists. Based on this set of 16 organisms, there were 285 essential yeast genes that had orthologs in 5 organisms or less. In the top 15% of predictions made by SC_GenProt_No, 24.6% of the 285 less conserved essential genes were identified. In contrast, only 3.5% of the 285 less conserved essential genes were identified by SC_GenProt at the same cutoff. Thus, while SC_GenProt_No has the lowest accuracy of the integrated feature sets, it is useful because of its increased ability to predict less conserved essential genes.
Performance of integrated features in E. coli
As in yeast, feature sets integrated with a naïve Bayes classifier were used to predict essentiality in E. coli. Two sets containing easily obtainable features were analyzed, EC_GenProt and EC_GenProt_No. No experimental feature set was used due to a lack of available genome wide analyses.
The phyletic retention measure was, as in yeast, found to be the most informative feature when ranking by conditional mutual information [Additional file 1]. However, where in yeast we obtained the best PPV when using 13 features, E. coli required only the top four: phyletic retention, serine, tryptophan and paralog count (9 features were found to optimally classify EC_GenProt_No). Figure 6B shows the performance of the integrated features in E. coli, where PPV is shown for the top 1, 5, 10, 15 and 20% of predictions.
The identification of essential genes has largely been an experimental effort, achieved through whole-genome knockout techniques. While in some organisms such as C. elegans, it is possible to devise highly effective screens for essential genes using siRNAs , the cost and/or ineffectiveness of this technique in other organisms makes its broad application currently infeasible. In this paper we assessed the potential effectiveness of a methodology in which genes are first prioritized based on their likelihood of testing positively in a lethality screen and after which subsequent small scale knockout screens can be performed on the top predictions to obtain experimentally validated genes.
We investigated the efficacy of this strategy by using available knockout experiments to assess the predictive power of features that are easily obtainable from sequence data and then integrating them using machine learning methodologies. By integrating genomic and protein characteristics of varying predictive power using a probabilistic classifier with feature selection, we were able to achieve an overall predictive accuracy in both S. cerevisiae and E. coli that was superior to the performance of any individual feature. The use of several descriptors will make our classifier more robust than using individual features whose predictive power is likely to vary a great deal among different organisms. For example, codon bias is a strong predictor of essential genes in both organisms studied here, but a study of 80 bacterial genomes revealed that 30% have no codon bias . Furthermore, we were able to classify essential genes with a reasonable accuracy even without the use of a gene conservation measure such as phyletic retention, providing the added benefit of identifying essential genes which may be organism specific. The ability to identify essential genes from sequence data alone has the potential to be of great practical importance in guiding the investigations of researchers searching for potential drug targets in newly sequenced pathogens.
In the process of constructing an integrated classifier, the relationship between various genomic characteristics and essentiality were explored. In both E. coli and S. cerevisiae, phyletic retention, protein size, and codon bias were identified as being among the single features most predictive of essentiality. Furthermore, we showed that the most predictive groups of organisms used in a phyletic retention measure contain host-associated organisms which are closely related to the reference organism. Despite the influence of our artificial definition of essentiality on the selection of our optimal genome sets, this result is still useful in suggesting how targeted sequencing can be used in the identification of essential genes in other organisms. In addition to phyletic retention and codon bias, which have been related to essentiality in previous studies, we identified a relationship between protein size and essentiality, which to our knowledge has not been explored before. Specifically, we observed that the nature of this relationship differed for E. coli and S. cerevisiae, with small protein size being indicative of essentiality in E. coli and the same being true of large proteins in S. cerevisiae. Moreover, among the largest E. coli proteins, only those which are the most conserved are essential. We hypothesize that these observations are both indicative of a pressure to maintain a small proteome in E. coli.
In summary we have made strides towards the prediction of essential genes based solely on sequence data on two fronts. First, we have gained insight into the properties of sets of organisms in which the presence of an ortholog is most predictive of essentiality. Second, we have assessed the predictive power of several sequence based features, and achieved superior prediction accuracy through integration with a probabilistic framework and intelligent feature selection.
Sets of essential genes
Essential gene definitions were taken from Giaever et al and Gerdes et al for S. cerevisiae and E. coli respectively. Additionally, for E. coli all ORFs were removed which were less than 80 amino acids. For S. cerevisiae all ORFs were removed whose FASTA headers contained the key words "transposable" or "mitochondrial". In total 4,728 yeast genes were used, 966 of which are essential. In E. coli, 3569 genes were used, of which 611 are essential.
Selection of Features
S. cerevisiae and E. coli are both model organisms which have been very well studied over the years. We capitalized on this fact by assessing the relationship among a variety of gene properties and essentiality. Following is a list of predictors and our rational for including them. Note that those features used in just S. cerevisiae are marked with a star, and those used only in E. coli are marked with two stars.
The following parameters require a large amount of experimental work to obtain. These features were only used in our classifiers integrating all features, and were excluded from those using only 'easily obtainable' features.
*Protein interaction network degree
Generated from the curated interactions accumulated in the Database of Interacting Proteins (DIP) . The degree of a protein is computed by summing the number of unique interactions which it participates in. Degree, along with related metrics of network position, have been documented in the literature as being indicative of essentiality [5, 24, 25]. Protein interaction data was not included for E. coli due to low coverage.
It is known that proteins with GO transcriptional regulation annotations are enriched in essential genes . Based on this result we tested whether or not nuclear localization, along with other protein localization categories, are useful in predicting essentiality. Localization information was obtained from a previous large-scale study .
Clusters of essential genes are known to be in regions of the genome that are characterized by a lower recombination rate . All per-gene recombination rates were acquired from Gerton et al. , and analyzed according to procedures used by Pal and Hurst .
We consider the following parameters 'easily obtainable' in that they can be automatically generated from sequence data.
There is a trend for proteins to become larger throughout evolution. We therefore expected that gene size may be indicative of essentiality, especially in E. coli, as ancestral genes are likely essential.
Different genes in S. cerevisiae exhibit wide variation in their regulatory complexity. It has recently been documented that there is a relationship between regulatory complexity and essentiality . We measured regulatory complexity using the following parameters: upstream size, downstream size, upstream conservation and downstream conservation. All sizes were measured as the distance to the nearest gene. Conservation was measured as the number of bases among the (up to) 1000 bp upstream of the ORF start site and (up to) 300 bp downstream of the designated open reading frame, that overlap with elements identified as being conserved in a seven species comparison (downloaded from the UCSC Genome Browser; most conserved track).
Phyletic retention measure
Genes that are ubiquitously present across different taxa are more likely to be essential . As detailed in the Results section, for yeast and E. coli separately, a set of five organisms were selected that optimally predicted essentiality. A count was made for each gene in the reference organism (yeast or E. coli) that represents the number of orthologs present in the five organisms. Bi-directional best BLAST hits were used to define an orthologous relationship (using an E-value cutoff of 0.1).
Number of paralagous genes
Genes that do not have duplicates are more likely to be essential . The rational is that the duplicates may function in a backup capacity in the presence of a knockout mutation in the original gene. Paralogs were defined as those genes which were present in the same genome which had a BLASTP E-value less than 10-20. In addition the ratio of the larger gene to the smaller could not exceed 1.33.
Essential genes are more likely to be encoded on the leading strand of the circular chromosome .
These parameters fall in our 'easily obtainable' category as well because they require coding sequence only; no laboratory experiments are necessary. We represent protein characteristics in terms of the following metrics: amino acid composition, codon bias, codon adaptation index (CAI), frequency of optimal codons (FOP), isoelectric point (PI), hydropathicity score, and hydrophobicity score. For yeast all data was downloaded from the Saccharomyces Genome Database . For E. coli these metrics were generated using the CodonW software package .
Integration of features
Classification of essential genes was done using the Orange machine learning package's implementation of naïve Bayes. . All features, with the exception of the optimized phyletic retention measure and the binary localization features, were discretized using Fayyad and Irani's entropy discretization method , as implemented by Orange.
After discretization, conditional mutual information maximization criteria (CMIM), as described by Fleuret , was used to rank the features. Briefly, CMIM is an iterative method where feature vectors are selected that have the highest mutual information with the class vector after conditioning on previously selected features vectors.
Let Y be our class vector, X n be a feature, I(Y;X n ) be the mutual information between Y and X n , and I(Y; X n | X m ) be the conditional mutual information between Y and X n conditioned on X m . CMIM was implemented as follows, with alternate implementations described by Fleuret . A score table s is initialized with the values I(Y ; X n ). The algorithm picks at each iteration the feature Xm with the highest score, and then refreshes every score s [n] by taking the minimum of s [n] and I (Y ; Xn | Xm). The algorithm is run until all features have been selected.
For a given feature set, the optimal number of features to include in classification was determined empirically. Features were ranked by CMIM, and iteratively removed one at a time. At each interval the classification accuracy was measured. The cutoff for the optimal feature set was identified as that with the highest PPV for the top 5% of predictions, with the requirement that PPV for the top 1% must be higher [see Additional file 2].
To assign the probability of essentiality to all genes, the following procedure was used. Half of all essential and half of all non-essential genes were randomly chosen to be included in the training set. The classifier was then tested on the remaining genes, which assigns a probability of essentiality. Training/testing was bootstrapped 100 times, and for each gene the probability of essentiality was taken as an average of all the probabilities that were assigned it [see Additional file 6].
GO and KEGG enrichment analysis
The GeneMerge software was used to calculate enrichment of KEGG annotations, using a background of all genes used in the given organism . P-values given are Bonferroni corrected.
The authors would like to thank Rich Roberts for his comments on the manuscript, and Joseph Mellor and Dustin Holloway for their valuable insights throughout the project. This work was supported by NIH grants 1P20GM066401, 1T32GM070409, R01 HG003367-01A1 and NSF grant ITR-048715
- Giaever G, Chu AM, Ni L, Connelly C, Riles L, Veronneau S, Dow S, Lucau-Danila A, Anderson K, Andre B, Arkin AP, Astromoff A, El-Bakkoury M, Bangham R, Benito R, Brachat S, Campanaro S, Curtiss M, Davis K, Deutschbauer A, Entian KD, Flaherty P, Foury F, Garfinkel DJ, Gerstein M, Gotte D, Guldener U, Hegemann JH, Hempel S, Herman Z, Jaramillo DF, Kelly DE, Kelly SL, Kotter P, LaBonte D, Lamb DC, Lan N, Liang H, Liao H, Liu L, Luo C, Lussier M, Mao R, Menard P, Ooi SL, Revuelta JL, Roberts CJ, Rose M, Ross-Macdonald P, Scherens B, Schimmack G, Shafer B, Shoemaker DD, Sookhai-Mahadeo S, Storms RK, Strathern JN, Valle G, Voet M, Volckaert G, Wang CY, Ward TR, Wilhelmy J, Winzeler EA, Yang Y, Yen G, Youngman E, Yu K, Bussey H, Boeke JD, Snyder M, Philippsen P, Davis RW, Johnston M: Functional profiling of the Saccharomyces cerevisiae genome. Nature. 2002, 418: 387-391. 10.1038/nature00935.PubMedView ArticleGoogle Scholar
- Cullen LM, Arndt GM: Genome-wide screening for gene function using RNAi in mammalian cells. Immunol Cell Biol. 2005, 83: 217-223. 10.1111/j.1440-1711.2005.01332.x.PubMedView ArticleGoogle Scholar
- Roemer T, Jiang B, Davison J, Ketela T, Veillette K, Breton A, Tandia F, Linteau A, Sillaots S, Marta C, Martel N, Veronneau S, Lemieux S, Kauffman S, Becker J, Storms R, Boone C, Bussey H: Large-scale essential gene identification in Candida albicans and applications to antifungal drug discovery. Mol Microbiol. 2003, 50: 167-181. 10.1046/j.1365-2958.2003.03697.x.PubMedView ArticleGoogle Scholar
- Maslov S, Sneppen K: Protein interaction networks beyond artifacts. FEBS Lett. 2002, 530: 255-256. 10.1016/S0014-5793(02)03428-2.PubMedView ArticleGoogle Scholar
- Jeong H, Mason SP, Barabasi AL, Oltvai ZN: Lethality and centrality in protein networks. Nature. 2001, 411: 41-42. 10.1038/35075138.PubMedView ArticleGoogle Scholar
- Yu H, Greenbaum D, Xin Lu H, Zhu X, Gerstein M: Genomic analysis of essentiality within protein networks. Trends Genet. 2004, 20: 227-231. 10.1016/j.tig.2004.04.008.PubMedView ArticleGoogle Scholar
- Papp B, Pal C, Hurst LD: Metabolic network analysis of the causes and evolution of enzyme dispensability in yeast. Nature. 2004, 429: 661-664. 10.1038/nature02636.PubMedView ArticleGoogle Scholar
- Fang G, Rocha E, Danchin A: How essential are nonessential genes?. Mol Biol Evol. 2005, 22: 2147-2156. 10.1093/molbev/msi211.PubMedView ArticleGoogle Scholar
- Chen Y, Xu D: Understanding protein dispensability through machine-learning analysis of high-throughput data. Bioinformatics. 2005, 21: 575-581. 10.1093/bioinformatics/bti058.PubMedView ArticleGoogle Scholar
- Gu Z, Steinmetz LM, Gu X, Scharfe C, Davis RW, Li WH: Role of duplicate genes in genetic robustness against null mutations. Nature. 2003, 421: 63-66. 10.1038/nature01198.PubMedView ArticleGoogle Scholar
- Smith NG, Eyre-Walker A: Human disease genes: patterns and predictions. Gene. 2003, 318: 169-175. 10.1016/S0378-1119(03)00772-8.PubMedView ArticleGoogle Scholar
- Kondrashov FA, Ogurtsov AY, Kondrashov AS: Bioinformatical assay of human gene morbidity. Nucleic Acids Res. 2004, 32: 1731-1737. 10.1093/nar/gkh330.PubMedPubMed CentralView ArticleGoogle Scholar
- Lipman DJ, Souvorov A, Koonin EV, Panchenko AR, Tatusova TA: The relationship of protein conservation and sequence length. BMC Evol Biol. 2002, 2: 20-10.1186/1471-2148-2-20.PubMedPubMed CentralView ArticleGoogle Scholar
- Kellis M, Birren BW, Lander ES: Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae. Nature. 2004, 428: 617-624. 10.1038/nature02424.PubMedView ArticleGoogle Scholar
- Kellis M, Patterson N, Endrizzi M, Birren B, Lander ES: Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature. 2003, 423: 241-254. 10.1038/nature01644.PubMedView ArticleGoogle Scholar
- Galagan JE, Calvo SE, Cuomo C, Ma LJ, Wortman JR, Batzoglou S, Lee SI, Basturkmen M, Spevak CC, Clutterbuck J, Kapitonov V, Jurka J, Scazzocchio C, Farman M, Butler J, Purcell S, Harris S, Braus GH, Draht O, Busch S, D'Enfert C, Bouchier C, Goldman GH, Bell-Pedersen D, Griffiths-Jones S, Doonan JH, Yu J, Vienken K, Pain A, Freitag M, Selker EU, Archer DB, Penalva MA, Oakley BR, Momany M, Tanaka T, Kumagai T, Asai K, Machida M, Nierman WC, Denning DW, Caddick M, Hynes M, Paoletti M, Fischer R, Miller B, Dyer P, Sachs MS, Osmani SA, Birren BW: Sequencing of Aspergillus nidulans and comparative analysis with A. fumigatus and A. oryzae. Nature. 2005, 438: 1105-1115. 10.1038/nature04341.PubMedView ArticleGoogle Scholar
- Krylov DM, Wolf YI, Rogozin IB, Koonin EV: Gene loss, protein sequence divergence, gene dispensability, expression level, and interactivity are correlated in eukaryotic evolution. Genome Res. 2003, 13: 2229-2235. 10.1101/gr.1589103.PubMedPubMed CentralView ArticleGoogle Scholar
- Fleuret F: Fast Binary Feature Selection with Conditional Mutual Information. Journal of Machine Learning Research (JMLR). 2004, 5: 1531–1555-Google Scholar
- Klasson L, Andersson SG: Evolution of minimal-gene-sets in host-dependent bacteria. Trends Microbiol. 2004, 12: 37-43. 10.1016/j.tim.2003.11.006.PubMedView ArticleGoogle Scholar
- Moran NA: Microbial minimalism: genome reduction in bacterial pathogens. Cell. 2002, 108: 583-586. 10.1016/S0092-8674(02)00665-7.PubMedView ArticleGoogle Scholar
- Pal C, Papp B, Lercher MJ, Csermely P, Oliver SG, Hurst LD: Chance and necessity in the evolution of minimal metabolic networks. Nature. 2006, 440: 667-670. 10.1038/nature04568.PubMedView ArticleGoogle Scholar
- Dietrich FS, Voegeli S, Brachat S, Lerch A, Gates K, Steiner S, Mohr C, Pohlmann R, Luedi P, Choi S, Wing RA, Flavier A, Gaffney TD, Philippsen P: The Ashbya gossypii genome as a tool for mapping the ancient Saccharomyces cerevisiae genome. Science. 2004, 304: 304-307. 10.1126/science.1095781.PubMedView ArticleGoogle Scholar
- Nelson CE, Hersh BM, Carroll SB: The regulatory content of intergenic DNA shapes genome architecture. Genome Biol. 2004, 5: R25-10.1186/gb-2004-5-4-r25.PubMedPubMed CentralView ArticleGoogle Scholar
- Estrada E: Virtual identification of essential proteins within the protein interaction network of yeast. Proteomics. 2005Google Scholar
- Wuchty S: Evolution and topology in the yeast protein interaction network. Genome Res. 2004, 14: 1310-1314. 10.1101/gr.2300204.PubMedPubMed CentralView ArticleGoogle Scholar
- Coulomb S, Bauer M, Bernard D, Marsolier-Kergoat MC: Gene essentiality and the topology of protein interaction networks. Proc Biol Sci. 2005, 272: 1721-1725. 10.1098/rspb.2005.3128.PubMedPubMed CentralView ArticleGoogle Scholar
- Batada NN, Hurst LD, Tyers M: Evolutionary and physiological importance of hub proteins. PLoS Comput Biol. 2006, 2: e88-10.1371/journal.pcbi.0020088.PubMedPubMed CentralView ArticleGoogle Scholar
- Kamath RS, Fraser AG, Dong Y, Poulin G, Durbin R, Gotta M, Kanapin A, Le Bot N, Moreno S, Sohrmann M, Welchman DP, Zipperlen P, Ahringer J: Systematic functional analysis of the Caenorhabditis elegans genome using RNAi. Nature. 2003, 421: 231-237. 10.1038/nature01278.PubMedView ArticleGoogle Scholar
- Sharp PM, Bailes E, Grocock RJ, Peden JF, Sockett RE: Variation in the strength of selected codon usage bias among bacteria. Nucleic Acids Res. 2005, 33: 1141-1153. 10.1093/nar/gki242.PubMedPubMed CentralView ArticleGoogle Scholar
- Gerdes SY, Scholle MD, Campbell JW, Balazsi G, Ravasz E, Daugherty MD, Somera AL, Kyrpides NC, Anderson I, Gelfand MS, Bhattacharya A, Kapatral V, D'Souza M, Baev MV, Grechkin Y, Mseeh F, Fonstein MY, Overbeek R, Barabasi AL, Oltvai ZN, Osterman AL: Experimental determination and system level analysis of essential genes in Escherichia coli MG1655. J Bacteriol. 2003, 185: 5673-5684. 10.1128/JB.185.19.5673-5684.2003.PubMedPubMed CentralView ArticleGoogle Scholar
- Xenarios I, Rice DW, Salwinski L, Baron MK, Marcotte EM, Eisenberg D: DIP: the database of interacting proteins. Nucleic Acids Res. 2000, 28: 289-291. 10.1093/nar/28.1.289.PubMedPubMed CentralView ArticleGoogle Scholar
- Huh WK, Falvo JV, Gerke LC, Carroll AS, Howson RW, Weissman JS, O'Shea EK: Global analysis of protein localization in budding yeast. Nature. 2003, 425: 686-691. 10.1038/nature02026.PubMedView ArticleGoogle Scholar
- Pal C, Hurst LD: Evidence for co-evolution of gene order and recombination rate. Nat Genet. 2003, 33: 392-395. 10.1038/ng1111.PubMedView ArticleGoogle Scholar
- Gerton JL, DeRisi J, Shroff R, Lichten M, Brown PO, Petes TD: Inaugural article: global mapping of meiotic recombination hotspots and coldspots in the yeast Saccharomyces cerevisiae. Proc Natl Acad Sci U S A. 2000, 97: 11383-11390. 10.1073/pnas.97.21.11383.PubMedPubMed CentralView ArticleGoogle Scholar
- Rocha EP, Danchin A: Essentiality, not expressiveness, drives gene-strand bias in bacteria. Nat Genet. 2003, 34: 377-378. 10.1038/ng1209.PubMedView ArticleGoogle Scholar
- Saccharomyces Genome Database. [ftp://ftp.yeastgenome.org/yeast/]
- CodonW. [http://codonw.sourceforge.net/]
- Demsar J, Zupan B, Leban G, Curk T: Orange: From experimental machine learning to interactive data mining. Lect Notes Artif Int Lect Notes Artif Int. 2004, 3202: 537-539.Google Scholar
- Fayyad UM, Irani KB: On the Handling of Continuous-Valued Attributes in Decision Tree Generation. Machine Learning. 1992, 8: 87-102.Google Scholar
- Castillo-Davis CI, Hartl DL: GeneMerge--post-genomic analysis, data mining, and hypothesis testing. Bioinformatics. 2003, 19: 891-892. 10.1093/bioinformatics/btg114.PubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.