- Research article
- Open Access
Analysis of a human brain transcriptome map
BMC Genomics volume 3, Article number: 10 (2002)
Genome wide transcriptome maps can provide tools to identify candidate genes that are over-expressed or silenced in certain disease tissue and increase our understanding of the structure and organization of the genome. Expressed Sequence Tags (ESTs) from the public dbEST and proprietary Incyte LifeSeq databases were used to derive a transcript map in conjunction with the working draft assembly of the human genome sequence.
Examination of ESTs derived from brain tissues (excluding brain tumor tissues) suggests that these genes are distributed on chromosomes in a non-random fashion. Some regions on the genome are dense with brain-enriched genes while some regions lack brain-enriched genes, suggesting a significant correlation between distribution of genes along the chromosome and tissue type. ESTs from brain tumor tissues have also been mapped to the human genome working draft. We reveal that some regions enriched in brain genes show a significant decrease in gene expression in brain tumors, and, conversely that some regions lacking in brain genes show an increased level of gene expression in brain tumors.
This report demonstrates a novel approach for tissue specific transcriptome mapping using EST-based quantitative assessment.
Sequencing of Expressed Sequence Tags (ESTs) has resulted in the rapid identification of expressed genes . ESTs are single-pass, partial sequences of cDNA clones from a large number of disease and normal tissue libraries. ESTs have been used extensively for gene discovery and for transcript mapping of genes from a wide number of organisms [2–4]. Even with the finished working draft of the human genome, the generation of a complete and non-redundant catalog of human genes is still a big challenge facing the genome research community. Full-length cDNA data are currently available for only 10,000 human genes , less than one-third of the total using the most conservative recent estimates of human gene numbers [6, 7]. Evidence of differential expression is one of the most important criteria in prioritizing the exploitation of genes in both academic and pharmaceutical research [8–10].
While identifying individual differentially expressed genes attracts most of the interest, a genome wide transcriptome map may not only provide a tool to identify candidate genes that are over-expressed or silenced in certain disease tissue, but may also help to understand the structure and organization of the genome. Genomes are the blueprints of life and they should not be considered as a simple collection of genes. In fact, the organization of genes into operons, complex regulons , or pathogenicity islands  suggests that related functions usually share physical proximity. Different types of transcriptome maps can help to identify different types of transcription domains. Those domains can now be analyzed as to how they relate to known nuclear substructures, such as nuclear speckles, PML bodies and coiled bodies [13–15].
Two strategies have been commonly used to evaluate large-scale gene expression: experimental and computational. The former is represented by DNA microarray technology . Computational methods consist of generating a large number of random ESTs from non-normalized cDNA libraries. The variation in the relative frequency of those tags, stored in databases, are then used to point out the differential expression of the corresponding genes: this is the so called "digital Northern" comparison. Digital Northern data can be used to provide quantitative assessment of differential expression within a certain limit . Velculescu et al.  introduced another digital method called serial analysis of gene expression (SAGE). The SAGE method requires only nine nucleotides, therefore allowing a larger throughput. In both protocols, the number of tags is reported to be proportional to the abundance of cognate transcripts in the tissue or cell type used to make the cDNA library.
The recently announced first draft of the human genome [19, 20] holds in it an unprecedented wealth of information, available for public study and scrutiny. How are genes organized in the human genome? Is there any distribution pattern of tissue specific genes in terms of chromosomal location? In this study, we combined the concept of digital Northern and transcript mapping for all public and Incyte LifeSeq ESTs to evaluate the tissue specific transcriptome. The goal of this paper is not to evaluate the digital expression of individual genes; instead we are looking at the tissue enriched digital expression level for a given chromosomal region. Particularly, we looked at the distribution pattern of brain-enriched genes in the genome and how that pattern changes in brain tumor tissues. We are well aware of the fact that this method and associated approaches are quite primitive. However, the tissue specific transcriptome data strongly suggest that human genome organization is correlated to the tissue type and its dynamics.
Distribution of brain-enriched genes along the chromosomes
With the unavailability of the complete annotated human gene catalog, it is not practical to document each individual gene that is expressed in brain within one chromosome region. Since the number of sequence tags is reported to be proportional to the abundance of cognate transcripts in the tissue or cell type used to make a given cDNA library, the number of ESTs within a chromosome region should reflect the abundance of the cognate transcripts in that region. Therefore comparison of the abundance of brain tissue derived transcripts relative to those from other tissues within the same chromosome region can highlight regions that have more brain-enriched gene expression.
We performed digital expression analysis of brain-enriched genes across the human genome with a window size of 5 Mbp and an interval of 1 Mbp. The transcript density factor for normal (non-tumor) brain libraries (TDFNB) was calculated as described in Methods. Figure 1 is an example of the distribution of TDFNB over chromosome 1 using publicly available EST sequences from dbEST and reveals a number of "peaks" that represent transcripts that appear to be preferentially expressed in brain tissues. To check the validity of these peak regions and make sure that the difference is not due to random picking or partial sequencing of cDNA libraries (which is the common random fluctuation caused by digital Northern approach) , the analysis was repeated using ESTs and the associated library information from the Incyte Genomics LifeSeq database. The distribution pattern of TDFNB shows an overall correlation coefficient of 0.658 for the whole genome between these two data sources. If we only analyze the region with Z-score >= 2 (i.e. peak regions), the correlation coefficient is 0.935 which suggests that the peak regions resulting from the analysis of public data are most likely not artifacts. Figure 2 shows the comparison of the distributions of TDF on chromosome 1 calculated from ESTs derived from brain tissue libraries vs. ESTs derived from breast tissue libraries. The overall pearson correlation coefficient for these two tissues is 0.113 which suggests that the peak regions observed in Figure 1 are brain specific.
There are 16 high TDFNB regions (enriched with brain specific expression) with Z-score >= 2 (Table 1). To assess the validity of our finding using public data, similar analysis using Incyte LifeSeq data shows that the same peaks can be derived (data not shown). Table 2 summarizes all the low TDFNB regions (lack of brain specific expression) over the whole genome with Z-score >= 2. It's interesting to note that the majority of high TDFNB or low TDFNB regions have close to average gene density indicating that those regions are not biased toward extremely high or low gene density.
Expression profile change of some chromosome regions with extreme TDFNB in brain tumor
Our analysis strongly suggests that brain-enriched genes are distributed throughout the genome in a non-random fashion. Some regions are dense with brain-enriched genes or brain specific expression. It would be interesting to know if any of these patterns change during tumorgenesis. A similar analysis was performed using ESTs generated from brain tumor libraries and their digital expression profile relative to the pooled tissue was plotted against the genome. The chromosomal distributions of these putative brain tumor enriched transcripts and the normal brain enriched transcripts are quite different. Table 3 lists all the chromosome regions with high TDF in non-tumor brain libraries (TDFNB) which become low TDF or neutral TDF in brain tumor (TDFTB). Chr15, 21–25 Mbp, Chr12, 85–89 Mbp, and Chr18, 45–52 Mbp (Figure 3) are some of the examples. While most of the low TDF regions in normal brain remain low in brain tumor, a few regions did become high TDF regions in brain tumor tissues (Table 3). Chr2, 93–99 Mbp and Chr19, 53–58 Mbp (Figure 4) are two examples. The digital expression profile in those regions was further confirmed by using data from Incyte LifeSeq (data not shown).
A genome is not a simple collection of genes. It has been reported that significant correlation exists between the distribution of genes along the chromosome and the physical architecture of the cell in bacteria . The human genome is much more complex and a complete understanding of its organization awaits completion of the finished sequence as well as a definitive annotation of the human gene catalog. Considerable evidence has already shown that related genes tend to exist as clusters in the genome. For example, 80% of the over 900 olfactory genes are found in clusters of 6–138 genes . The 3.6 Mbp human major histocompatibility complex (MHC) on chromosome 6p21.3 is a critical repository for the immune response genes . Extensive analysis of the genomic organization of the MHC region has revealed that at least 27 of its resident genes possess duplicated copies in at least one of the three other restricted chromosomal regions 1q21-q25, 9q33-q34 and 19q13.1-p13.4. For another example, ABC transporter gene family members are located on 6p21.3, 1q25 and 9q34 as clusters.
The development of distinct tissue and cell types is a fundamental characteristic of growth in higher organisms. Tissue and cellular differentiation, in turn, is highly dependent on specific patterns of gene expression and transcript accumulation. Many studies have been successfully used to pinpoint genes exhibiting tissue or disease specific expression. This study suggests another approach that focuses on the tissue specific transcriptome map study and attempts to study the genomic proximity of tissue specific genes. We show here that some regions on specific chromosomes are enriched with brain-enriched gene expression. Given the chromosome window size (5 Mbp) used to calculate the TDF and the average size of the gene (~30 Kbp), each 5 Mbp region should contain on average about 170 genes. In addition, we report here only regions that have a normal level of gene density. Therefore the very high TDFNB regions are most likely contributed by the high level expression of brain-enriched gene clusters or the regions containing a high density of brain-enriched genes.
We are aware of the importance of those brain-enriched genes that scatter across the genome whose brain specific or differential expression is diluted by the large volume of neighbor genes within the 5 Mbp window region. Another limitation of this approach is the inability to reveal some regions with equal up-down differential expression. The goal of this study is to pinpoint the chromosomal regions with the most significant brain-enriched gene expression and to elucidate the non-randomness of the tissue specific expression over the chromosome. Naturally, gene proximity within chromosomes is already known to be significant. Finding neighbors of a given gene can shed light on that gene, especially when the neighbors contain objects with similar features. Therefore, this tissue specific transcriptome map study may not only help us to understand the genome organization in the future, but may also provide means to gain leads to the functions of many genes for which this information is not currently available.
Resources for databases and computer programs
GenBank release 120 was downloaded from ftp://ncbi.nlm.nih.gov. ESTs and their associated tissue library information and clone information were extracted and organized in a relational database (Sybase, SQL Server Release 11.0, CA, Sybase Inc.). The EST cDNA libraries were manually curated and catalogued into non-tumor brain libraries and brain tumor libraries. We obtained 208604 dbEST ESTs from 369 non-tumor brain cDNA libraries, 67351 ESTs from 148 brain tumor cDNA libraries. We also extracted 363473 ESTs from 100 Incyte LifeSeq non-tumor brain cDNA libraries and 126538 ESTs from their 23 brain tumor libraries. All non-commercial software used in this study was written in PERL 5.0.
Transcript mapping was done based on the October 2000 Freeze of the University of California at Santa Cruz's working draft sequence http://genome.ucsc.edu, which presents a tentative assembly of the finished and draft human genomic sequence based on the Washington University-Saint Louis clone map http://genome.wustl.edu/gsc. We mapped all the public ESTs (2.5 millions ESTs) from dbEST as well as the ESTs from Incyte Genomics' LifeSeq database (5.1 millions ESTs) using a local alignment software package AAT. AAT is a local alignment software which extended the BLAST algorithm by assigning fixed penalty to long gaps . To reduce the number of undesirable matches due to interspersed repeats, the DNA sequence is screened for interspersed repeats using the RepeatMasker program (Smit, AFA & Green, P et al http://ftp.genome.washington.edu/RM/RepeatMasker.html). Only those ESTs that have over 95% identity to the genomic counterpart over half the length of ESTs' length or 50 bp whichever is longer are included.
Calculation of the transcript density factor (TDF)
The TDF for normal brain-enriched gene expression (TDFNB) is calculated using 5 Mbp window moving along the chromosome with 1 Mbp interval and defined as:
TDFNB = ln(RNB/R)/(TNB/T)
RNB is the number of EST clones derived from non-tumor brain tissue within a window of 5 Mbp.
R is the number of EST clones derived from all non-tumor tissue pooled libraries within a window of 5 Mbp.
TNB is the number of total mapped EST clones derived from non-tumor brain tissues.
T is the number of total mapped EST clones derived from all non-tumor tissue pooled libraries.
The distribution of TDFNB of all the 5 Mbp regions should approximate a Gaussian distribution. Theoretically, TDFNB should approach 0 if the expression level of genes in brain tissues has no difference from that in pooled tissues within a 5 Mbp chromosomal region. We define all the regions with high TDFNB (Z-score >= 2) as brain-enriched regions. Those chromosome regions that have high brain-enriched gene expression are referred as high TDFNB region. Those chromosome regions that have low brain-enriched gene expression are referred as low TDFNB region.
The calculation of transcript density factor for brain tumor (TDFTB) is similar to the calculation of TDFNB.
TDFTB = ln(RTB/R)/(TTB/T)
RTB is the number of EST clones derived from tumor brain tissue within a window of 5 Mbp.
R is the number of EST clones derived from all non-tumor tissue pooled libraries within a window of 5 Mbp.
TTB is the number of total mapped EST clones derived from tumor brain tissue libraries.
T is the number of total mapped EST clones derived from all non-tumor tissue libraries.
Correlation analysis of distribution pattern of TDF using public and Incyte data
The Pearson correlation coefficient (r) represents the degree of similarity (strength of correlation) between two sets of data. The correlation coefficient is calculated as follows:
where Xi is TDF values derived from public data and Yi is the TDF values derived from the Incyte data. Values for the Pearson correlation coefficient range from -1 to 1 where zero indicates no correlation, -1 indicates a perfect negative correlation and 1 indicates a perfect positive correlation.
The calculation of gene density
Using 5 Mbp window moving along the chromosome with 1 Mbp interval, the gene density ratio is defined as UniGene (Version 5.002) [4, 25] count divided by the average UniGene count in each 5 Mbp region. Average gene density ratio is equal to 1.0.
Adams MD, Kelley JM, Gocayne JD, Dubnick M, Polymeropoulos MHM, Xiao H, Merril CR, Wu A, Olde B, Moreno RF: Complementary DNA sequencing: expressed sequence tags and human genome project. Science. 1991, 252: 1651-1656.
Adams MD, Kerlavage AR, Fleischmann RD, Fuldner RA, Bult CJ, Lee NH, Kirkness EF, Weinstock KG, Gocayne JD, White O: Initial assessment of human gene diversity and expression patterns based upon 83 million nucleotides of cDNA sequence. Nature. 1995, 377 (suppl.): 3-174.
Polymeropoulos MH, Xiao H, Sikela JM, Adams MD, Venter JC: Chromosomal distribution of 320 genes from a brain cDNA library. Nature Genet. 1993, 4: 381-386.
Schuler GD, Boguski MS, Stewart EA, Stein LD, Gyapay G, Rice K, White RE, Rodriguez-Tome P, Aggarwal A, Bajorek E: A gene map of the human genome. Science. 1996, 274: 540-546. 10.1126/science.274.5287.540.
Maglott DR, Katz KS, Sicotte H, Pruitt KD: NCBI's LocusLink and Refseq. Nucleic Acids Res. 2000, 28: 126-128. 10.1093/nar/28.1.126.
Ewing B, Green P: Analysis of expressed sequence tags indicates 35000 human genes. Nature Genet. 2000, 25: 232-234. 10.1038/76115.
Roest Crollius H, Jaillon O, Bernot A, Dasilva C, Bouneau L, Fischer C, Fizames C, Wincker P, Brottier P, Quetier F, Saurin W, Weissenbach J: Estimate of human gene number provided by genome-wide analysis using DNA Tetraodon nigroviridis DNA sequence. Nat. Genet. 2000, 25: 235-238. 10.1038/76118.
Nowak R: Entering the postgenome era. Science. 1995, 270: 368-369.
Adams MD: Progress towards a complete set of human genes. In Genomes, molecular biology and drug discovery. Edited by: MJ Browne, PL Thurby. 1996, Academic Press, London, UK.
Bains W: Virtually sequenced: the next genomic generation. Nature Biotechnol. 1996, 14: 711-713.
Collado-Vides J: A syntactic representation of units of genetic information – a syntax of units of genetic information. J. Theor. Biol. 1991, 148: 401-429.
Finlay BB, Falkow S: Common themes in microbial pathogenicity revisited. Microbiol. Mol. Biol. Rev. 1997, 61: 136-169.
Wansink DG, Schul W, van der Kraan I, van Steensel B, van Driel R, de Jong L: Fluorescent labeling of nascent RNA reveals transcription by RNA polymerase II in domains scattered throughout the nucleus. J. Cell Biol. 1993, 122: 283-293.
Wei X, Somanathan S, Samarabandu J, Berezney R: Three-dimensional visualization of transcription sites and their association with splicing factor-rich nuclear speckles. J. Cell Biology. 1999, 146: 543-58. 10.1083/jcb.146.3.543.
Jackson DA, Iborra FJ, Manders EM, Cook PR: Numbers and organization of RNA polymerases, nascent transcripts, and transcription units in HeLa nuclei. Mol. Biol. Cell. 1998, 9: 1523-1536.
Ekins R, Chu FW: Microarrays: their origins and applications. Trends in Biotechnology. 1999, 17: 217-218. 10.1016/S0167-7799(99)01329-3.
Audic S: The significance of Digital Gene Expression Profiles. Genome Res. 1997, 7: 986-995.
Velculescu VE, Zhang L, Vogelstein B, Kinzler KW: Serial analysis of gene expression. Science. 1995, 270: 484-487.
International Human Genome Sequencing Consortium: Initial sequencing and analysis of the human genome. Nature. 2001, 409: 860-921. 10.1086/172716.
Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans CA, Holt RA: The sequence of the human genome. Science. 2001, 291: 1304-1351. 10.1126/science.1058040.
Danchin A, Guerdoux_Jamet P, Moszer I, Nitschk P: Mapping the bacterial cell architecture into the chromosome. Philos. Trans. R. Soc. Lond. B. Biol. Sci. 2000, 355: 179-190. 10.1098/rstb.2000.0557.
Glusman G, Yanai I, Rubin I, Lancet D: The Complete Human Olfactory Subgenome. Genome Res. 2001, 11: 685-702. 10.1101/gr.171001.
Campbell RD, Trowsdale J: Map of the human MHC. Immunol. Today. 1993, 14: 349-352. 10.1016/0167-5699(93)90234-C.
Huang X, Adams MD, Zhou H, Kerlavage AR: A Tool for Analyzing and Annotating Genomic Sequences. Genomics. 1997, 46: 37-45. 10.1006/geno.1997.4984.
Schuler GD: Pieces of the puzzle: expressed sequence tags and the catalog of human genes. J. Mol. Med. 1997, 75: 694-698. 10.1007/s001090050155.