How many human genes can be defined as housekeeping with current expression data?
© Zhu et al. 2008
Received: 20 December 2007
Accepted: 16 April 2008
Published: 16 April 2008
Skip to main content
© Zhu et al. 2008
Received: 20 December 2007
Accepted: 16 April 2008
Published: 16 April 2008
Housekeeping (HK) genes are ubiquitously expressed in all tissue/cell types and constitute a basal transcriptome for the maintenance of basic cellular functions. Partitioning transcriptomes into HK and tissue-specific (TS) genes relatively is fundamental for studying gene expression and cellular differentiation. Although many studies have aimed at large-scale and thorough categorization of human HK genes, a meaningful consensus has yet to be reached.
We collected two latest gene expression datasets (both EST and microarray data) from public databases and analyzed the gene expression profiles in 18 human tissues that have been well-documented by both two data types. Benchmarked by a manually-curated HK gene collection (HK408), we demonstrated that present data from EST sampling was far from saturated, and the inadequacy has limited the gene detectability and our understanding of TS expressions. Due to a likely over-stringent threshold, microarray data showed higher false negative rate compared with EST data, leading to a significant underestimation of HK genes. Based on EST data, we found that 40.0% of the currently annotated human genes were universally expressed in at least 16 of 18 tissues, as compared to only 5.1% specifically expressed in a single tissue. Our current EST-based estimate on human HK genes ranged from 3,140 to 6,909 in number, a ten-fold increase in comparison with previous microarray-based estimates.
We concluded that a significant fraction of human genes, at least in the currently annotated data depositories, was broadly expressed. Our understanding of tissue-specific expression was still preliminary and required much more large-scale and high-quality transcriptomic data in future studies. The new HK gene list categorized in this study will be useful for genome-wide analyses on structural and functional features of HK genes.
Human transcriptomes are complicated in three dimensions: diversified to perform tissue/cell-specific functions, undergone temporal regulations during cell cycle and development, and influenced by other physiological and pathological conditions. A collection of genes are expressed in all tissues/cells to maintain basic cellular functions, traditionally known as housekeeping (HK) genes, whereas others are specialized to perform unique functions in differentiated tissues/cells, known as tissue-specific (TS) genes. To characterize cell-specific human transcriptomes, it is important to define this collection of HK genes shared by all human transcriptomes. HK genes were previously considered to express at a constant level across different biological contexts and thus entitled as "control genes" that can be used to standardize quantitative expression studies. However, it has been proven later that the expression of HK genes is still under stringent regulation albeit ubiquitously expressed; their expression levels may vary significantly across different cell types [1–3]. Another related concept refers to "essential genes", the disturbances of which often lead to lethal phenotypes. A recent study has demonstrated that about 500 genes are essential to sustain bacterial life . However, ubiquitous expression does not necessarily mean essentiality and vice versa. In this study, we focused on the primary definition of HK genes — a set of genes universally expressed in diversified tissue/cell types to maintain a basal transcriptome .
Previous studies have aimed at large-scale categorization of human HK genes, largely based on microarray technology. There have been three lists of HK genes widely cited in the literature. Warrington et al.  and Hsiao et al.  pioneered the effort, and obtained 533 and 451 HK genes after sampling 11 and 19 tissues, respectively, by using Affymetrix HuGeneFL chip. Eisenberg et al.  later extended the number of HK genes to 575 based on 47 tissue samples, using data from a more advanced Affymetrix U95A platform . Depending on these HK gene lists, many following-up studies have demonstrated distinct natures of HK genes in comparison with TS genes, including gene structure [8, 10], nucleotide composition , rate of evolution [12, 13], protein domain , and other genomic characteristics [15–18]. While comparative analyses between HK and TS genes have produced many meaningful results, a consensus on the identity and number of HK genes has been long expected. Although all three microarray-defined HK gene lists arrived at an estimate of about 500 in the number of human HK genes, the overlaps among them were very low.
In this study, we used the latest microarray and EST data from the public databases to re-categorize HK and TS genes. A manually-curated benchmark of HK genes (HK408) was created as control to compare the two different data types. We demonstrated that present EST data was far from saturated and many tissues were still poorly sampled. The inadequacy of EST sampling limited our ability to identify genes and to understand TS expression. The microarray data, due to a likely over-stringent threshold, showed higher false negative rate in comparison with the EST data, leading to a significant underestimation of HK genes. Based on EST data we catalogued a new set of human HK genes, ranging from 3,140 to 6,909 in number, nearly a ten-fold increase as compared to the previous results based on microarray data. We believe that this new dataset will be useful for genome-wide analyses on structural and functional features of HK genes.
Three lists of microarray-defined HK genes have been widely cited in the literature. After updating the annotation of these datasets, there were 501, 425, and 567 HK genes in the lists put together by Warrington , Hsiao  and Eisenberg , respectively. Although all of them arrived at an estimate of approximately 500 human HK genes, the shared HK genes were found significantly low — only 155 genes were found in all three datasets despite the fact that two of them shared 340 genes due to the utilization of an identical technical platform (Additional file 1, Figure S1). The unique part of individual dataset ranged from 20% to 60%, implying both high false positive (FP) and false negative (FN) rates in these lists. Moreover, these studies were based on the old microarray platforms with only approximately 7,000 genes represented on the chip, less than half of the present annotations. To update these limited results and avoid systematic bias introduced by a single technique, in this study we analyzed both the latest EST and microarray data to reassess human HK genes.
In EST data, we observed that gene detectability in each tissue was proportional to the sampling depth (Figure 1A and 1B). According to the relationship between the number of detected genes and the number of sampled ESTs, the sampling was far from saturated for almost all of 18 tissues (Figure 1B). Many tissues were still very poorly sampled, limiting the gene detectability of current EST data. In microarray data, the number of detected genes was lower than that in EST data, even when ESTs have not been sampled deeply enough (Figure 1A). In our 18-tissue collection, 17,288 of 18,225 total genes (94.9%) were found to be expressed in at least one tissue by EST data, in contrast to 11,730 of 13,986 represented genes (83.9%) by microarray data.
We defined expression breadth as the number of unique tissues where a gene was expressed, which ranged from 1 (TS) to 18 (HK) with decreased tissue-specificity. We observed that the distribution of expression breadth showed two modes representing TS and HK genes respectively in both data types (Figure 1C). The degree of tissue specificity varied gradually and no clear-cut boundaries of both TS and HK genes were observed. However, the expression breadth distributions from the two data types showed opposite trend. In microarray data, majority of genes exhibited tissue-specific expression whereas only a small fraction showed universal expression. 1,418 (12.1%) and 1,206 (10.3%) genes were detected in only one and all 18 tissues, respectively, consistent with previous microarray-based results [8, 11]. In EST data, a large fraction of genes was found broadly expressed whereas tissue-specific expression was less notable. 885 (5.1%) genes were detected in only one tissue and 3,140 (18.2%) in all 18 tissues. This was in agreement with a recent microarray experiment on 14 mouse tissues .
We compared the expression breadths of 11,495 genes detected in at least one of 18 tissues by both microarray and EST data. The correlation was not significant (Figure 1D, r = 0.42); 71% of the genes were detected in less number of tissues by microarray data than by EST data. The above observations implied that microarray data on average detected less number of genes compared with EST data, thus underestimated the expression breadth, making the expression breadth distribution in microarray data skewed toward TS genes.
Functional classification of HK408 genes
Transcription pre-initiation complex
Basal transcription elongation factor
Capping, splicing and polyadenylation
Nuclear pore complex
Basal translation factor
Ubiquitin mediated proteolysis
We evaluated the expression breadth distribution of HK408 genes. In EST data, the expression breadths clearly peaked at the value of 18 (Figure 2C). Although only 278 (68.1%) of HK408 genes were found expressed in all 18 tissues due to 5 poorly sampled tissues, 379 (92.9%) were detected in at least 16 of 18 tissues (Table 1). This result justified HK408 genes as a qualified benchmark. In contrast, microarray data detected only 182 (45.3%) and 235 (58.5%) of HK408 genes in all 18 tissues and at lest 16 of 18 tissues, respectively (Table 1). A messy tail of expression breadth distribution across all breadth groups indicated the rather noisy nature and high FN rate in microarray data (Figure 2D).
As only 70% of HK408 genes can be identified by EST data in all 18 tissues due to several poorly sampled tissues, but 93% of HK408 genes have expression breadths enriched at value 16 to 18 (Figure 2C), we set a cutoff at 16 tissues for a less stringent definition of HK genes. As a result, we obtained 3,140 HK genes as a lower bound (expressed in all 18 tissues and with a FN rate of 31.9%) and 6,909 HK genes as an upper bound (expressed in at least 16 of 18 tissues and with a FN rate of 7.1%) according to the EST data. Similarly, we deduced 1,206 HK genes (with a FN rate of 54.7%) and 2,403 HK genes (with a FN rate of 41.5%) for the low and high numbers according to the microarray data. The detailed descriptions of the 6,909 EST-defined and 2,403 microarray-defined HK genes were given in Additional file 4. We compared 6,909 EST-defined and 2,403 microarray-defined HK genes, and found 4,921 (71.2%) and 415 (17.3%) genes unique to each group, respectively (Additional file 1, Figure S3). In addition, 6,909 EST-defined HK genes covered nearly all HK genes in the lists of Warrington (488/501), Hsiao (403/425) and Eisenberg (502/567). However, when comparing with 2,403 microarray-defined HK genes, we found that 18.6%, 17.4%, and 33.5% genes were unique to the Warrington's, Hsiao's, and Eisenberg's lists, respectively (Additional file 1, Figure S3); the consistency among the microarray-derived results was still very low.
We found 1,418 and 885 genes expressed in only one of 18 tissues from microarray and EST data, respectively. The microarray data identified more TS genes than the EST data, but many of which were actually expressed in more than one tissue according to the EST data (Figure 1D). We observed a common trend in both data: brain and testis contributed the most TS genes as compared to other tissues. In EST data, about half of TS genes appeared either brain- or testis-specific. The most important observation we had was that thyroid, the least sampled tissue, had 5,403 of 7,263 detected genes (74.4%) defined as HK genes. This indicated that for poorly sampled tissues our knowledge on their transcriptomes was still limited to the most abundant housekeeping genes, and a true definition of tissue-specific expression required much greater efforts in the future.
In our analyses, requiring only one EST for justifying positive expression was a potential source of FP, but the limited sampling depth of present EST data prevented us from using a more stringent threshold. In the least sampled tissue (such as thyroid), 2,607 of 7,263 (35.9%) detected genes were sampled only once. If we required > 1 EST to justify positive expression, these poorly-sampled tissues became non-informative and with very high FN rates. When insisting > 2 ESTs, we had 3 tissues suffered severely for the same reason. More seriously, by doing so, the expression breadths of HK genes peaked at the value 16 rather than 18 — most of HK genes can not be detected in all 18 tissues. When the parameter increased to > 4 ESTs, the peak of HK gene expression breadths disappeared — no clear HK gene group existed (Additional file 1, Table S2).
Although insisting on single EST may introduce FP, there were reasons to suggest that our processed EST data should be a reliable indicator for legitimate expression and the FP involved in our EST-defined results were very low. First, we only took account of ESTs that were reliably aligned onto human genome and clustered into RefSeq loci; most dubious ESTs originated from genomic contaminations and cloning artefacts during cDNA library construction were removed (See methods for details). If we ignore the problems in cloning and RNA isolations (actually faced by both EST and microarray methods), EST sampling is advantageous in that no empirical cutoff on signals is needed to indicate positive expression. When erroneous sequences are discarded and only reliable sequences are used, EST-based methods suffer less from FP than microarray-based ones. Second, according to our newly-established transcriptome-sampling model , transcripts with certain expression levels have finite probability to be detected at a given sampling stage. Although other high-throughput experiments such as SAGE do introduce erroneous low-frequency tags, for EST data at such a poor sampling depth even those genes detected at low sampling frequency are most likely to be moderately and even abundantly expressed.
The high FN rate of microarray data is attributable to the fact that the cutoff value of 200  for defining positive expression is quite conservative. In a parallel analysis, we relaxed the cutoff value to 100 and identified 3,058 and 5,630 genes expressed in all 18 tissues and at least 16 of 18 tissues, respectively, where 66.2% and 78.4% of HK408 genes were covered. These numbers were comparable to those derived from the EST data (3,140 and 6,909). However, as the cutoff value of 200 was determined based on negative controls on the chip to match some optimum ratio of FN to FP, and to our knowledge, almost all published works utilizing this microarray dataset used the cutoff 200, a liberty given to the lower cutoff value for reducing FN should not be encouraged.
At present time, large-scale gene expression profiling is still approached inadequately; both transcriptome sequencing and microarray technique have their own drawbacks. The most noticeable weakness of microarray technique is that it still suffers from poor detectability and reproducibility for low-copy and transiently-expressed genes ; the latter are actually very important as they are most likely enzymes and transcription factors, performing transient yet critical biological functions. Systematic noises introduced during sample processing and fluorescence scanning can be improved but are hard to avoid completely, making the cutoff of present/absent call difficult to determine and vary from experiment to experiment. However, EST and its equivalent methods suffer the most from low sampling depth although they are essentially capable of discovering novel and low-copy transcripts . Although about 8 million ESTs have been sequenced, considering the fact that human body has over 200 tissue/cell types, many tissues are still poorly sampled. Fortunately, short tag techniques and recent developments in multiplex sequencing instruments have been applied to comprehensive transcriptome sampling, allowing for effective acquisition of millions of transcript tags in a single experiment [39–41].
Since mRNAs are biological materials of transcriptomes, protocols for RNA isolation and processing are critical for transcriptome studies. Low-copy transcripts may suffer more severely from RNA degradation when lengthy protocols are required, such as making cDNA libraries (especially for SAGE libraries) and labelling RNA probes. These factors make both microarray and EST profile only approximation of in vivo expression status. In this in silicon study the EST data were generated from 2,563 cDNA libraries, and we consolidated tissue samples from similar origin into uniquely defined tissue. Although precise information of tissue/cell type, which requires advancement of micro-dissection tools and single-cell techniques, has been lost and potential FP may be introduced, this procedure is a necessary approximation considering the limitation of present expression data.
The last but the most profound issue relates to the extended definition of gene itself. Recently, genome-wide tiling array experiments and large-scale full-length cDNA sequencing have provided new insights on the transcribed content of human genome [42, 43]. Transcription is complicated by extensive overlap of transcriptional units as well as alternative initiation, splicing and termination; this complex transcriptional organization challenges the traditional definition of a "gene", suggesting that transcripts should be used as operational units of genomes . Consequently, the concept of "housekeeping" or "maintenance" should be defined in a hierarchical way related to cell types, growth stages, cell cycles as well as various physiological conditions, and in terms of specific transcript variant.
Clearly, we are still at an early stage toward precisely defining the basal and cell-specific transcriptomes. However, we believe that along with the improvement of microarray technology and saturated sequencing of transcriptomes, results from microarray and EST data will converge to a consensus. Intensive transcriptome sequencing for the identification of unknown transcript, followed by extensive microarray experiments under various biological contexts, will give us a great opportunity to precisely define cell-specific human transcriptome in the near future.
The present EST sampling data was far from adequate; many human tissues were still poorly sampled so that our ability to define TS expression was still very limited. Microarray data, due to a likely over-stringent threshold, showed higher FN rate in comparison with EST data, leading to a significant underestimation of HK genes. Based on EST data, we estimated that about 40.0% of the currently annotated human genes were actually universally expressed, nearly a ten-fold increase as compared to the previous estimates based on microarray data solely.
We aligned 24,354 human RefSeq transcripts (NCBI, June 18, 2007 update) onto human genomic sequences (UCSC, hg18) using BLAT . Requiring at least 98% base-pair identity and 95% length coverage, we acquired 24,458 gene features on the genome. Features were clustered into loci based on sharing of splicing site for multi-exon features and overlaps of exon for single-exon features. Finally, 18,225 RefSeq loci — 17,009 (93.3%) multi-exon and 1,216 (6.7%) single-exon — were used in further analyses.
Human EST sequences and their genomic alignments were downloaded from UCSC annotation database (March 11, 2007 update) . We removed 4,609 cDNA libraries with less than 100 ESTs; the number of ESTs from these libraries contributed only 2.0% (156,378) of the total EST collection. The remaining 4,026 libraries contain 7,801,123 (98.0%) ESTs. After post-processing and filtering, 6,039,131 (77.4%) ESTs can be reliably aligned with at least 96% identity and 80% coverage, revealing 3,327,959 spliced and 2,776,470 unspliced features on the genome. EST features were clustered into RefSeq loci according to the following three steps: (1) 3,186,812 (95.8%) spliced features sharing at least one splicing site with a multi-exon RefSeq locus were first clustered into corresponding locus; (2) 1,570,241 unspliced features that exactly locate in an internal exon, extend the 5'-most exon or extend the 3'-most exon of a multi-exon RefSeq locus were then added; (3) 59,173 unspliced features were finally clustered into single-exon RefSeq loci by requiring at least 1-bp overlap. The remaining 1,288,203 EST features, largely unspliced (89.0%), were regarded as unreliable and discarded from further analyses. We retrieved microarray data from Gene Expression Atlas II . The alignment of exemplar/consensus sequences of the probe sets were acquired from UCSC annotation database (April 13, 2006 update), and clustered into RefSeq loci by using similar procedure as for EST clustering. Eventually, 13,986 RefSeq loci were represented on the chip (Affymetrix U133A coupled with GNF1H) .
cDNA library information was obtained from CGAP (February 27, 2007 update)  and UniGene (March 26, 2007 update) , followed by integration and manual curations. Information for microarray samples was retrieved from NCBI's GEO database . Since most available tissue samples are anatomically heterogeneous at present time, we in silicon consolidated RNA samples from the same tissue and/or partial tissue samples from entire organs into unique tissues to avoid overlapping results. Finally, 4,026 original cDNA libraries and 79 original microarray tissue samples were categorized into 51 and 31 unique tissues respectively. We selected 18 well-studied tissues covered by both two types of data for analyses. Although previous studies reported more number of tissues assayed than our current study, bulky and pooled tissues were used for overlapping gene expression profiling and often resulted in redundant counts. The 18 tissues used in this study have represented a broad spectrum of differentiated tissue/cell types in the human body (Additional file 1, Figure S2).
For EST data, we defined a RefSeq locus as expressed in a given tissue when at least one reliably clustered EST was detected from cDNA libraries of that tissue. No empirical cutoff was enforced. This was justified as we have removed most of dubious ESTs — largely unspliced — originated from genomic contaminations and other experimental artefacts. The ESTs clustered in RefSeq loci are well consistent with the annotated gene structure, thus should reliably indicate the genuine expression. For microarray data, we retrieved the expression intensities of each probe set from NCBI's GEO database. Expression intensities from different probe sets of the same RefSeq locus and from different experiments of the same tissue were averaged. We called a RefSeq locus as expressed if its expression intensity exceeded the cutoff value of 200 as recommended by authors who carried out the experiments . We also loosened the cutoff to 100 for comparative analyses.
Pathway information was acquired from Reactome  and KEGG . As a benchmark for comparative analyses, we manually curated 408 genes with well-documented housekeeping functions, including general transcription factors [22, 23] and major components of capping and polyadenylation machinery [24–26], spliceosome [27–29], nuclear RNA export complex [30–32], translation machinery , cytosolic ribosome , and ubiquitin-proteasome proteolytic pathway  (Table 1 and Additional file 2).
408 manually curated genes that are highly proposed to play housekeeping roles
We thank Drs. Janet A. Warrington, Eli Eisenberg and four anonymous reviewers for their critical comments and helpful suggestions, which significantly improved the early version of this manuscript. We also thank all staffs at Beijing Institute of Genomics for their sincere supports. This work was supported by the National Natural Science Foundation of China (NSFC) (Grant No. 90608003) and the National Basic Research Program of China (973 Program) (Grant No. 2006CB910404).
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.