QTL global meta-analysis: are trait determining genes clustered?
© Salih and Adelson; licensee BioMed Central Ltd. 2009
Received: 24 September 2008
Accepted: 24 April 2009
Published: 24 April 2009
A key open question in biology is if genes are physically clustered with respect to their known functions or phenotypic effects. This is of particular interest for Quantitative Trait Loci (QTL) where a QTL region could contain a number of genes that contribute to the trait being measured.
We observed a significant increase in gene density within QTL regions compared to non-QTL regions and/or the entire bovine genome. By grouping QTL from the Bovine QTL Viewer database into 8 categories of non-redundant regions, we have been able to analyze gene density and gene function distribution, based on Gene Ontology (GO) with relation to their location within QTL regions, outside of QTL regions and across the entire bovine genome. We identified a number of GO terms that were significantly over represented within particular QTL categories. Furthermore, select GO terms expected to be associated with the QTL category based on common biological knowledge have also proved to be significantly over represented in QTL regions.
Our analysis provides evidence of over represented GO terms in QTL regions. This increased GO term density indicates possible clustering of gene functions within QTL regions of the bovine genome. Genes with similar functions may be grouped in specific locales and could be contributing to QTL traits. Moreover, we have identified over-represented GO terminology that from a biological standpoint, makes sense with respect to QTL category type.
Gene density has been shown to vary widely by organism and genomic region and has been measured both in terms of mean interval between genes and genes per mega base pair of DNA [1, 2]. It is known that gene density is positively correlated with G+C content  and that the heterochromatic regions surrounding centromeres and telomeres have a lower than average gene density [3–5]. In general, measurements of gene density have focused on correlations of gene density with chromosomal structure or base composition [2, 6]. However, to our knowledge no one has looked at the correlation of gene density with Quantitative Trait Locus (QTL) density over the genome. Furthermore, gene density on its own is a fairly crude measurement of the functional role of specific genomic domains. It would be more informative to combine this with quantitative information about the types of gene annotations found across the genome, but to date this has not been done. In this report we describe the correlation of gene density with chromosomal regions defined on the basis of their association with phenotypic traits (QTL regions) and we have determined if gene annotations associated with the phenotypes in question are over represented in these same regions. Our model system is the bovine genome because it has a wealth of well annotated QTL  and gene models that have been anchored to a high quality draft genome sequence assembly.
While quantifying gene annotations on the basis of gene descriptions is virtually impossible, quantitative distributions of gene function can be determined on the basis of Gene Ontology (GO) term annotations . A gene ontology is a controlled vocabulary within a structured hierarchy that describes gene products in a species independent manner. For us, GO terms provide a straightforward link from gene coordinates to phenotype. Gene ontologies have been used in many ways for the quantitative analysis of gene expression profiles, for gene set analysis and for general annotation analyses [9–11]. From our perspective, identifying over represented GO terms can provide insight into regional genomic function, and while statistical methods of measuring GO term distribution vary, we have adopted a commonly used method based on the hypergeometric distribution .
Until now, GO term analysis performed on the bovine genome has focused on very specific gene expression analysis [13–15]. We have carried out the first genome wide analysis of GO term use correlated with genomic regions known to control quantitatively regulated phenotypes (QTL). One of the challenges of mining GO terms is the large number of GO terms that are often not grouped very tightly by phenotype. One way of overcoming this problem is to use a GO slim, which is a cut down version of the GO. A GO slim contains a subset of terms in the whole GO and facilitates research by streamlining the ontologies for specific areas of interest . At the time we undertook this study there was no bovine GO slim, so we have created our own for this analysis and have deposited it with the GO consortium.
Results and discussion
QTL region breakdown
Gene density analysis
Partitioning the QTL into non-redundant classes provided the opportunity to quantify correlations between gene density and their associated ontology term frequencies. We measured the distance between the probability distributions for gene density between QTL and non-QTL regions using Welch's t-test. The advantage of this method over using the raw gene count distribution is that this approach gives very little weight to the 0 or low gene counts which would otherwise distort the result. Our analysis (see Additional File 1) demonstrated a statistically significant enrichment of gene density in QTL regions (average of 5.3 genes/5 Mbp bin) compared to non-QTL regions (average of 2.3 genes/5 Mbp bin) (p-value 0.042). This could indicate that genes are clustered in regions of the bovine genome that contribute to quantitative traits. We were not convinced that this correlation of increased gene density with QTL regions indicated functional clustering of genes in those regions. For this reason we decided to examine the quantitative distributions of gene functions with respect to QTL regions.
Enriched term (GO) t-test analysis
Second level GO terms that differed when comparing QTL to the full genome using the t-test.
QTL GO term density
Genome wide GO term density
Enriched term (GO) GeneMerge analysis
Second level GO terms that differed when comparing QTL to the entire genome using GeneMerge.
transcription regulator activity
extracellular region part
transcription regulator activity
transcription regulator activity
structural molecule activity
Lowest level GO terms that differed in QTL regions compared to the genome as a whole.
chloride channel activity
neurotransmitter receptor activity
signal recognition particle
GABA-A receptor activity
extracellular ligand-gated ion channel activity
gamma-aminobutyric acid signaling pathway
positive regulation of synaptic transmission
response to unfolded protein
positive regulation of microtubule polymerization
basal plasma membrane
negative regulation of endothelial cell proliferation
response to drug
response to peptide hormone stimulus
cytoplasmic vesicle membrane
phospholipid catabolic process
nitric-oxide synthase binding
transepithelial chloride transport
response to unfolded protein
advanced glycation end-product receptor activity
response to unfolded protein
response to nutrient
thyroid hormone receptor coactivator activity
vitamin D receptor binding
retinoic acid receptor binding
retinoid-X receptor activity
This type of analysis is biased by the nature and comprehensiveness of the annotations in the Gene Ontology and by the number of GO annotated gene models in the bovine genome. It is beyond the scope of this report to comment on the former, but since the bovine annotations depend overwhelmingly on the transfer of GO annotations from human, we know that many of the bovine gene models remain un-annotated. It is also likely that in spite of stringent sequence similarity criteria for the transfer of GO annotations that some will be incorrect.
GO slim result
During the course of this analysis a GO slim was created to reduce the large number of total GO terms to create a list of terms more specific for bovine analyses. We identified 272 terms whose meanings associated them with QTL, or terms enriched in QTL regions, or terms commonly known to be associated with physiologically/commercially important bovine traits that did not correspond to QTL.
The idea that genes are not randomly distributed throughout the genome can be traced back to R.A. Fisher , who showed that interacting genes tend to become more closely linked. More recently, tissue specific patterns of gene expression have been shown to map to chromosomal domains . Our quantitative analysis of the gene content of QTL regions should be viewed in this context, and was able to provide evidence that gene density is higher in QTL regions and that some gene functions, as reflected by GO terms are also over-represented in QTL regions. While many of the GO terms found to be associated with QTL categories were not obviously linked through a biological context, these results were consistent with the hypothesis that genes may be clustered in a manner that reflects their functional association with particular traits. This was most obvious for the 'milk yield' QTL category, where the associated GO terms were highly biologically plausible.
Placement of additional USDA Marc Markers onto Btau 3.1 assembly through BLAST and E-PCR
STS sequence data was downloaded from the NCBI website. All markers with unknown locations (i.e. not in the STS data from NCBI) were aligned to the Btau 3.1 assembly using MegaBLAST and BLASTN. Markers whose alignments were 100% identical with over 90% of their length were verified by ensuring that each marker could be placed on the same chromosome in both the linkage map and the sequence assembly. This method placed 78 sequence tagged sites onto the assembly. However, after both BLAST analyses, some markers could not be placed on the assembly because they either had no BLAST hit or none of their hits fulfilled the above criteria. There were also a number of linkage markers without NCBI accessions which not be anchored to the genome via BLAST/MegaBLAST due to a lack of sequence data . We attempted to place these markers using e-PCR, permitting 1 gap and 1 mismatch . E-PCR allowed us to place an additional 42 sequence tagged sites on the assembly.
14,354 bovine genes were annotated with Gene Ontology data from human orthologs by David Lynn. Additional non-annotated genes from the GLEAN 5 dataset were included for a total of 22,418 bovine gene models. We identified all the bovine gene models within each non-redundant QTL region.
Non-redundant QTL categories
Each QTL category was collapsed into non-redundant regions. Overlapping QTL regions for each QTL category were combined into single, contiguous non-redundant regions. Figure 1 provides an illustration of how Growth QTL were combined to create non-redundant regions. Non-QTL regions are locations of the genome in which no QTL are known to be present. There were a total of 597 QTL used in this study with the following breakdown: Body Conformation 47, Carcass 94, Disease Resistance 25, Fat 162, Growth 67, Milk Protein Yield 114, Milk Yield 61 and Reproduction 35.
Binning strategy employed for t-tests
QTL category regions and non-QTL regions of the genome were divided into sequential 5 Mb "bins." Gene counts/GO term counts were measured in each bin across the regions. For gene counts, a histogram plot of the bin counts showed that the distribution was not normal; mostly due to a large number of zero count bins. For gene density comparison, we transformed the bin counts into probability distributions of gene density, which removed the zero count bins and normalized the distribution. For GO terms similar problems were encountered, with many zero count bins. For this analysis we used a log transformation to remove the zero counts and normalize the distributions (see below).
GO analysis using counts normalized for gene density
Second level gene ontologies were counted in non-redundant QTL category regions. The structure of GO is that the child terms are to be more specific and targeted than their parent terms. Gene products associated with a GO term are expected to be loosely associated with the parent term and even more loosely associated each term up the ontology. Using a mysql database of GO terms downloaded from the Gene Ontology website http://www.geneontology.org/ we were able to cycle up terms from associated gene products, to higher parent terms. GO terms can be traversed to multiple parents. So as not to negate any possible contributing factors, all second level GO term parents of child GO terms were counted and incorporated into bin counts.
Where n = normalized bin value, x = gene count, y = GO term count
GeneMerge analysis was performed according to , using the raw GO term counts from all bins across QTL regions and the full genome. Genes within each of the 8 QTL categories were grouped and GO term frequencies associated with gene products from those genes were compared against GO term frequencies found across the genome as a whole (population). By using the second level GO terms from each QTL category regions as the subpopulation, via GeneMerge we identified second level GO terms found to be statistically significantly over represented. We compared fine level GO terms in the same manner; by grouping QTL region GO terms (subpopulation) and comparing against second level GO terms from the entire genome (population). To illustrate the results of the GeneMerge analysis integrated with the ratio of the frequency of GO terms within QTL regions to the genome, the Mayday platform  was used to create heat maps. A data file was produced and loaded into Mayday containing QTL categories, GO terms and values of GO term ratios between QTL regions and the genome. Non-statistically significant relationships were washed out by colour shading. The enhanced heat map (Fig. 3) displays which GO terms were significantly enriched, and the extent to which GO term frequency from particular QTL regions vary with respect to the entire genome.
We would like to thank Dr. David Lynn for providing GO terms associated with human orthologs to the GLEAN 5 gene models. DLA would like to thank Ian Franklin for critically reading the manuscript and providing useful feedback. Thanks also to the three anonymous reviewers for their input.
- Hey J, Kliman RM: Interactions between natural selection, recombination and gene density in the genes of Drosophila. Genetics. 2002, 160 (2): 595-608.PubMed CentralPubMedGoogle Scholar
- Gardiner K: Base composition and gene distribution: critical patterns in mammalian genome organization. Trends Genet. 1996, 12 (12): 519-524. 10.1016/S0168-9525(97)81400-X.View ArticlePubMedGoogle Scholar
- Sullivan BA, Blower MD, Karpen GH: Determining centromere identity: cyclical stories and forking paths. Nature reviews. 2001, 2 (8): 584-596. 10.1038/35084512.View ArticlePubMedGoogle Scholar
- Zoubak S, Clay O, Bernardi G: The gene distribution of the human genome. Gene. 1996, 174 (1): 95-102. 10.1016/0378-1119(96)00393-9.View ArticlePubMedGoogle Scholar
- Zakian VA: Structure and function of telomeres. Annual review of genetics. 1989, 23: 579-604. 10.1146/annurev.ge.23.120189.003051.View ArticlePubMedGoogle Scholar
- Federico C, Andreozzi L, Saccone S, Bernardi G: Gene density in the Giemsa bands of human chromosomes. Chromosome Res. 2000, 8 (8): 737-746. 10.1023/A:1026797522102.View ArticlePubMedGoogle Scholar
- Polineni P, Aragonda P, Xavier SR, Furuta R, Adelson DL: The bovine QTL viewer: a web accessible database of bovine Quantitative Trait Loci. BMC bioinformatics. 2006, 7: 283-10.1186/1471-2105-7-283.PubMed CentralView ArticlePubMedGoogle Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000, 25 (1): 25-29. 10.1038/75556.PubMed CentralView ArticlePubMedGoogle Scholar
- Volinia S, Evangelisti R, Francioso F, Arcelli D, Carella M, Gasparini P: GOAL: automated Gene Ontology analysis of expression profiles. Nucleic acids research. 2004, W492-499. 10.1093/nar/gkh443. 32 Web ServerGoogle Scholar
- Zhong S, Storch KF, Lipan O, Kao MC, Weitz CJ, Wong WH: GoSurfer: a graphical interactive tool for comparative analysis of large gene sets in Gene Ontology space. Applied bioinformatics. 2004, 3 (4): 261-264. 10.2165/00822942-200403040-00009.View ArticlePubMedGoogle Scholar
- Xie H, Wasserman A, Levine Z, Novik A, Grebinskiy V, Shoshan A, Mintz L: Large-Scale Protein Annotation through Gene Ontology. 2002, Cold Spring Harbor Lab, 12: 785-Google Scholar
- Pavlidis P, Qin J, Arango V, Mann JJ, Sibille E: Using the gene ontology for microarray data mining: a comparison of methods and application to age effects in human prefrontal cortex. Neurochemical research. 2004, 29 (6): 1213-1222. 10.1023/B:NERE.0000023608.29741.45.View ArticlePubMedGoogle Scholar
- Suchyta SP, Sipkovsky S, Halgren RG, Kruska R, Elftman M, Weber-Nielsen M, Vandehaar MJ, Xiao L, Tempelman RJ, Coussens PM: Bovine mammary gene expression profiling using a cDNA microarray enhanced for mammary-specific transcripts. Physiological genomics. 2003, 16 (1): 8-18. 10.1152/physiolgenomics.00028.2003.View ArticlePubMedGoogle Scholar
- Byrne KA, Wang YH, Lehnert SA, Harper GS, McWilliam SM, Bruce HL, Reverter A: Gene expression profiling of muscle tissue in Brahman steers during nutritional restriction. Journal of animal science. 2005, 83 (1): 1-12.PubMedGoogle Scholar
- Harhay GP, Keele JW: Positional candidate gene selection from livestock EST databases using Gene Ontology. Bioinformatics (Oxford, England). 2003, 19 (2): 249-255. 10.1093/bioinformatics/19.2.249.View ArticleGoogle Scholar
- Castillo-Davis CI, Hartl DL: GeneMerge – post-genomic analysis, data mining, and hypothesis testing. Bioinformatics (Oxford, England). 2003, 19 (7): 891-892. 10.1093/bioinformatics/btg114.View ArticleGoogle Scholar
- Khatri P, Draghici S: Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics (Oxford, England). 2005, 21 (18): 3587-3595. 10.1093/bioinformatics/bti565.View ArticleGoogle Scholar
- Draghici S, Khatri P, Bhavsar P, Shah A, Krawetz SA, Tainsky MA: Onto-Tools, the toolkit of the modern biologist: Onto-Express, Onto-Compare, Onto-Design and Onto-Translate. Nucleic acids research. 2003, 31 (13): 3775-3781. 10.1093/nar/gkg624.PubMed CentralView ArticlePubMedGoogle Scholar
- Veerkamp RF, Goddard ME: Covariance functions across herd production levels for test day records on milk, fat, and protein yields. J Dairy Sci. 1998, 81 (6): 1690-1701.View ArticlePubMedGoogle Scholar
- Hankins OG, Ellis NR: Fat in Relation to Quantity and Quality Factors of Meat Animal Carcasses. Journal of animal science. 1939, 314-319.Google Scholar
- Fisher RA: The genetical theory of natural selection. 1930, Oxford: Clarendon PressView ArticleGoogle Scholar
- Yamashita T, Honda M, Takatori H, Nishino R, Hoshino N, Kaneko S: Genome-wide transcriptome mapping analysis identifies organ-specific gene expression patterns along human chromosomes. Genomics. 2004, 84 (5): 867-875. 10.1016/j.ygeno.2004.08.008.View ArticlePubMedGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. Journal of molecular biology. 1990, 215 (3): 403-410.View ArticlePubMedGoogle Scholar
- Schuler GD: Electronic PCR: bridging the gap between genome mapping and genome sequencing. Trends in biotechnology. 1998, 16 (11): 456-459. 10.1016/S0167-7799(98)01232-3.View ArticlePubMedGoogle Scholar
- Dietzsch J, Gehlenborg N, Nieselt K: Mayday – a microarray data analysis workbench. Bioinformatics (Oxford, England). 2006, 22 (8): 1010-1012. 10.1093/bioinformatics/btl070.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.