Analysis of a human brain transcriptome map
© Qiu et al; licensee BioMed Central Ltd. 2002
Received: 21 December 2001
Accepted: 16 April 2002
Published: 16 April 2002
Genome wide transcriptome maps can provide tools to identify candidate genes that are over-expressed or silenced in certain disease tissue and increase our understanding of the structure and organization of the genome. Expressed Sequence Tags (ESTs) from the public dbEST and proprietary Incyte LifeSeq databases were used to derive a transcript map in conjunction with the working draft assembly of the human genome sequence.
Examination of ESTs derived from brain tissues (excluding brain tumor tissues) suggests that these genes are distributed on chromosomes in a non-random fashion. Some regions on the genome are dense with brain-enriched genes while some regions lack brain-enriched genes, suggesting a significant correlation between distribution of genes along the chromosome and tissue type. ESTs from brain tumor tissues have also been mapped to the human genome working draft. We reveal that some regions enriched in brain genes show a significant decrease in gene expression in brain tumors, and, conversely that some regions lacking in brain genes show an increased level of gene expression in brain tumors.
This report demonstrates a novel approach for tissue specific transcriptome mapping using EST-based quantitative assessment.
Sequencing of Expressed Sequence Tags (ESTs) has resulted in the rapid identification of expressed genes . ESTs are single-pass, partial sequences of cDNA clones from a large number of disease and normal tissue libraries. ESTs have been used extensively for gene discovery and for transcript mapping of genes from a wide number of organisms [2–4]. Even with the finished working draft of the human genome, the generation of a complete and non-redundant catalog of human genes is still a big challenge facing the genome research community. Full-length cDNA data are currently available for only 10,000 human genes , less than one-third of the total using the most conservative recent estimates of human gene numbers [6, 7]. Evidence of differential expression is one of the most important criteria in prioritizing the exploitation of genes in both academic and pharmaceutical research [8–10].
While identifying individual differentially expressed genes attracts most of the interest, a genome wide transcriptome map may not only provide a tool to identify candidate genes that are over-expressed or silenced in certain disease tissue, but may also help to understand the structure and organization of the genome. Genomes are the blueprints of life and they should not be considered as a simple collection of genes. In fact, the organization of genes into operons, complex regulons , or pathogenicity islands  suggests that related functions usually share physical proximity. Different types of transcriptome maps can help to identify different types of transcription domains. Those domains can now be analyzed as to how they relate to known nuclear substructures, such as nuclear speckles, PML bodies and coiled bodies [13–15].
Two strategies have been commonly used to evaluate large-scale gene expression: experimental and computational. The former is represented by DNA microarray technology . Computational methods consist of generating a large number of random ESTs from non-normalized cDNA libraries. The variation in the relative frequency of those tags, stored in databases, are then used to point out the differential expression of the corresponding genes: this is the so called "digital Northern" comparison. Digital Northern data can be used to provide quantitative assessment of differential expression within a certain limit . Velculescu et al.  introduced another digital method called serial analysis of gene expression (SAGE). The SAGE method requires only nine nucleotides, therefore allowing a larger throughput. In both protocols, the number of tags is reported to be proportional to the abundance of cognate transcripts in the tissue or cell type used to make the cDNA library.
The recently announced first draft of the human genome [19, 20] holds in it an unprecedented wealth of information, available for public study and scrutiny. How are genes organized in the human genome? Is there any distribution pattern of tissue specific genes in terms of chromosomal location? In this study, we combined the concept of digital Northern and transcript mapping for all public and Incyte LifeSeq ESTs to evaluate the tissue specific transcriptome. The goal of this paper is not to evaluate the digital expression of individual genes; instead we are looking at the tissue enriched digital expression level for a given chromosomal region. Particularly, we looked at the distribution pattern of brain-enriched genes in the genome and how that pattern changes in brain tumor tissues. We are well aware of the fact that this method and associated approaches are quite primitive. However, the tissue specific transcriptome data strongly suggest that human genome organization is correlated to the tissue type and its dynamics.
Distribution of brain-enriched genes along the chromosomes
With the unavailability of the complete annotated human gene catalog, it is not practical to document each individual gene that is expressed in brain within one chromosome region. Since the number of sequence tags is reported to be proportional to the abundance of cognate transcripts in the tissue or cell type used to make a given cDNA library, the number of ESTs within a chromosome region should reflect the abundance of the cognate transcripts in that region. Therefore comparison of the abundance of brain tissue derived transcripts relative to those from other tissues within the same chromosome region can highlight regions that have more brain-enriched gene expression.
Summary of chromosome regions with significant brain-enriched gene expression (high TDFNB region) (window size 5 Mbp, Z-score >= 2.0). All regions are confirmed by separate analysis using Incyte LifeSeq ESTs. Pearson Correlation Coefficient = 0.935.
Gene Density Ratio
Summary of chromosome regions lacking in brain-enriched gene expression (low TDFNB regions) (window size 5 Mbp, Z-score >= 2.0). All regions are confirmed by separate analysis using Incyte LifeSeq ESTs. Pearson Correlation Coefficient = 0.935.
Gene Density Ratio
Expression profile change of some chromosome regions with extreme TDFNB in brain tumor
High TDFNB and low TDFNB (Z-score >= 2) regions which show significant differential expression between tumor brain tissues and non-tumor brain tissues.
A genome is not a simple collection of genes. It has been reported that significant correlation exists between the distribution of genes along the chromosome and the physical architecture of the cell in bacteria . The human genome is much more complex and a complete understanding of its organization awaits completion of the finished sequence as well as a definitive annotation of the human gene catalog. Considerable evidence has already shown that related genes tend to exist as clusters in the genome. For example, 80% of the over 900 olfactory genes are found in clusters of 6–138 genes . The 3.6 Mbp human major histocompatibility complex (MHC) on chromosome 6p21.3 is a critical repository for the immune response genes . Extensive analysis of the genomic organization of the MHC region has revealed that at least 27 of its resident genes possess duplicated copies in at least one of the three other restricted chromosomal regions 1q21-q25, 9q33-q34 and 19q13.1-p13.4. For another example, ABC transporter gene family members are located on 6p21.3, 1q25 and 9q34 as clusters.
The development of distinct tissue and cell types is a fundamental characteristic of growth in higher organisms. Tissue and cellular differentiation, in turn, is highly dependent on specific patterns of gene expression and transcript accumulation. Many studies have been successfully used to pinpoint genes exhibiting tissue or disease specific expression. This study suggests another approach that focuses on the tissue specific transcriptome map study and attempts to study the genomic proximity of tissue specific genes. We show here that some regions on specific chromosomes are enriched with brain-enriched gene expression. Given the chromosome window size (5 Mbp) used to calculate the TDF and the average size of the gene (~30 Kbp), each 5 Mbp region should contain on average about 170 genes. In addition, we report here only regions that have a normal level of gene density. Therefore the very high TDFNB regions are most likely contributed by the high level expression of brain-enriched gene clusters or the regions containing a high density of brain-enriched genes.
We are aware of the importance of those brain-enriched genes that scatter across the genome whose brain specific or differential expression is diluted by the large volume of neighbor genes within the 5 Mbp window region. Another limitation of this approach is the inability to reveal some regions with equal up-down differential expression. The goal of this study is to pinpoint the chromosomal regions with the most significant brain-enriched gene expression and to elucidate the non-randomness of the tissue specific expression over the chromosome. Naturally, gene proximity within chromosomes is already known to be significant. Finding neighbors of a given gene can shed light on that gene, especially when the neighbors contain objects with similar features. Therefore, this tissue specific transcriptome map study may not only help us to understand the genome organization in the future, but may also provide means to gain leads to the functions of many genes for which this information is not currently available.
Resources for databases and computer programs
GenBank release 120 was downloaded from ftp://ncbi.nlm.nih.gov. ESTs and their associated tissue library information and clone information were extracted and organized in a relational database (Sybase, SQL Server Release 11.0, CA, Sybase Inc.). The EST cDNA libraries were manually curated and catalogued into non-tumor brain libraries and brain tumor libraries. We obtained 208604 dbEST ESTs from 369 non-tumor brain cDNA libraries, 67351 ESTs from 148 brain tumor cDNA libraries. We also extracted 363473 ESTs from 100 Incyte LifeSeq non-tumor brain cDNA libraries and 126538 ESTs from their 23 brain tumor libraries. All non-commercial software used in this study was written in PERL 5.0.
Transcript mapping was done based on the October 2000 Freeze of the University of California at Santa Cruz's working draft sequence http://genome.ucsc.edu, which presents a tentative assembly of the finished and draft human genomic sequence based on the Washington University-Saint Louis clone map http://genome.wustl.edu/gsc. We mapped all the public ESTs (2.5 millions ESTs) from dbEST as well as the ESTs from Incyte Genomics' LifeSeq database (5.1 millions ESTs) using a local alignment software package AAT. AAT is a local alignment software which extended the BLAST algorithm by assigning fixed penalty to long gaps . To reduce the number of undesirable matches due to interspersed repeats, the DNA sequence is screened for interspersed repeats using the RepeatMasker program (Smit, AFA & Green, P et al http://ftp.genome.washington.edu/RM/RepeatMasker.html). Only those ESTs that have over 95% identity to the genomic counterpart over half the length of ESTs' length or 50 bp whichever is longer are included.
Calculation of the transcript density factor (TDF)
The TDF for normal brain-enriched gene expression (TDFNB) is calculated using 5 Mbp window moving along the chromosome with 1 Mbp interval and defined as:
TDFNB = ln(RNB/R)/(TNB/T)
RNB is the number of EST clones derived from non-tumor brain tissue within a window of 5 Mbp.
R is the number of EST clones derived from all non-tumor tissue pooled libraries within a window of 5 Mbp.
TNB is the number of total mapped EST clones derived from non-tumor brain tissues.
T is the number of total mapped EST clones derived from all non-tumor tissue pooled libraries.
The distribution of TDFNB of all the 5 Mbp regions should approximate a Gaussian distribution. Theoretically, TDFNB should approach 0 if the expression level of genes in brain tissues has no difference from that in pooled tissues within a 5 Mbp chromosomal region. We define all the regions with high TDFNB (Z-score >= 2) as brain-enriched regions. Those chromosome regions that have high brain-enriched gene expression are referred as high TDFNB region. Those chromosome regions that have low brain-enriched gene expression are referred as low TDFNB region.
The calculation of transcript density factor for brain tumor (TDFTB) is similar to the calculation of TDFNB.
TDFTB = ln(RTB/R)/(TTB/T)
RTB is the number of EST clones derived from tumor brain tissue within a window of 5 Mbp.
R is the number of EST clones derived from all non-tumor tissue pooled libraries within a window of 5 Mbp.
TTB is the number of total mapped EST clones derived from tumor brain tissue libraries.
T is the number of total mapped EST clones derived from all non-tumor tissue libraries.
Correlation analysis of distribution pattern of TDF using public and Incyte data
The Pearson correlation coefficient (r) represents the degree of similarity (strength of correlation) between two sets of data. The correlation coefficient is calculated as follows:
where Xi is TDF values derived from public data and Yi is the TDF values derived from the Incyte data. Values for the Pearson correlation coefficient range from -1 to 1 where zero indicates no correlation, -1 indicates a perfect negative correlation and 1 indicates a perfect positive correlation.
The calculation of gene density
Using 5 Mbp window moving along the chromosome with 1 Mbp interval, the gene density ratio is defined as UniGene (Version 5.002) [4, 25] count divided by the average UniGene count in each 5 Mbp region. Average gene density ratio is equal to 1.0.
- Adams MD, Kelley JM, Gocayne JD, Dubnick M, Polymeropoulos MHM, Xiao H, Merril CR, Wu A, Olde B, Moreno RF: Complementary DNA sequencing: expressed sequence tags and human genome project. Science. 1991, 252: 1651-1656.View ArticlePubMedGoogle Scholar
- Adams MD, Kerlavage AR, Fleischmann RD, Fuldner RA, Bult CJ, Lee NH, Kirkness EF, Weinstock KG, Gocayne JD, White O: Initial assessment of human gene diversity and expression patterns based upon 83 million nucleotides of cDNA sequence. Nature. 1995, 377 (suppl.): 3-174.PubMedGoogle Scholar
- Polymeropoulos MH, Xiao H, Sikela JM, Adams MD, Venter JC: Chromosomal distribution of 320 genes from a brain cDNA library. Nature Genet. 1993, 4: 381-386.View ArticlePubMedGoogle Scholar
- Schuler GD, Boguski MS, Stewart EA, Stein LD, Gyapay G, Rice K, White RE, Rodriguez-Tome P, Aggarwal A, Bajorek E: A gene map of the human genome. Science. 1996, 274: 540-546. 10.1126/science.274.5287.540.View ArticlePubMedGoogle Scholar
- Maglott DR, Katz KS, Sicotte H, Pruitt KD: NCBI's LocusLink and Refseq. Nucleic Acids Res. 2000, 28: 126-128. 10.1093/nar/28.1.126.PubMed CentralView ArticlePubMedGoogle Scholar
- Ewing B, Green P: Analysis of expressed sequence tags indicates 35000 human genes. Nature Genet. 2000, 25: 232-234. 10.1038/76115.View ArticlePubMedGoogle Scholar
- Roest Crollius H, Jaillon O, Bernot A, Dasilva C, Bouneau L, Fischer C, Fizames C, Wincker P, Brottier P, Quetier F, Saurin W, Weissenbach J: Estimate of human gene number provided by genome-wide analysis using DNA Tetraodon nigroviridis DNA sequence. Nat. Genet. 2000, 25: 235-238. 10.1038/76118.View ArticlePubMedGoogle Scholar
- Nowak R: Entering the postgenome era. Science. 1995, 270: 368-369.View ArticlePubMedGoogle Scholar
- Adams MD: Progress towards a complete set of human genes. In Genomes, molecular biology and drug discovery. Edited by: MJ Browne, PL Thurby. 1996, Academic Press, London, UK.Google Scholar
- Bains W: Virtually sequenced: the next genomic generation. Nature Biotechnol. 1996, 14: 711-713.View ArticleGoogle Scholar
- Collado-Vides J: A syntactic representation of units of genetic information – a syntax of units of genetic information. J. Theor. Biol. 1991, 148: 401-429.View ArticlePubMedGoogle Scholar
- Finlay BB, Falkow S: Common themes in microbial pathogenicity revisited. Microbiol. Mol. Biol. Rev. 1997, 61: 136-169.PubMed CentralPubMedGoogle Scholar
- Wansink DG, Schul W, van der Kraan I, van Steensel B, van Driel R, de Jong L: Fluorescent labeling of nascent RNA reveals transcription by RNA polymerase II in domains scattered throughout the nucleus. J. Cell Biol. 1993, 122: 283-293.View ArticlePubMedGoogle Scholar
- Wei X, Somanathan S, Samarabandu J, Berezney R: Three-dimensional visualization of transcription sites and their association with splicing factor-rich nuclear speckles. J. Cell Biology. 1999, 146: 543-58. 10.1083/jcb.146.3.543.View ArticleGoogle Scholar
- Jackson DA, Iborra FJ, Manders EM, Cook PR: Numbers and organization of RNA polymerases, nascent transcripts, and transcription units in HeLa nuclei. Mol. Biol. Cell. 1998, 9: 1523-1536.PubMed CentralView ArticlePubMedGoogle Scholar
- Ekins R, Chu FW: Microarrays: their origins and applications. Trends in Biotechnology. 1999, 17: 217-218. 10.1016/S0167-7799(99)01329-3.View ArticlePubMedGoogle Scholar
- Audic S: The significance of Digital Gene Expression Profiles. Genome Res. 1997, 7: 986-995.PubMedGoogle Scholar
- Velculescu VE, Zhang L, Vogelstein B, Kinzler KW: Serial analysis of gene expression. Science. 1995, 270: 484-487.View ArticlePubMedGoogle Scholar
- International Human Genome Sequencing Consortium: Initial sequencing and analysis of the human genome. Nature. 2001, 409: 860-921. 10.1086/172716.View ArticleGoogle Scholar
- Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans CA, Holt RA: The sequence of the human genome. Science. 2001, 291: 1304-1351. 10.1126/science.1058040.View ArticlePubMedGoogle Scholar
- Danchin A, Guerdoux_Jamet P, Moszer I, Nitschk P: Mapping the bacterial cell architecture into the chromosome. Philos. Trans. R. Soc. Lond. B. Biol. Sci. 2000, 355: 179-190. 10.1098/rstb.2000.0557.PubMed CentralView ArticlePubMedGoogle Scholar
- Glusman G, Yanai I, Rubin I, Lancet D: The Complete Human Olfactory Subgenome. Genome Res. 2001, 11: 685-702. 10.1101/gr.171001.View ArticlePubMedGoogle Scholar
- Campbell RD, Trowsdale J: Map of the human MHC. Immunol. Today. 1993, 14: 349-352. 10.1016/0167-5699(93)90234-C.View ArticlePubMedGoogle Scholar
- Huang X, Adams MD, Zhou H, Kerlavage AR: A Tool for Analyzing and Annotating Genomic Sequences. Genomics. 1997, 46: 37-45. 10.1006/geno.1997.4984.View ArticlePubMedGoogle Scholar
- Schuler GD: Pieces of the puzzle: expressed sequence tags and the catalog of human genes. J. Mol. Med. 1997, 75: 694-698. 10.1007/s001090050155.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.