Inferring genetic architecture of complex traits using Bayesian integrative analysis of genome and transcriptome data
© Ehsani et al.; licensee BioMed Central Ltd. 2012
Received: 27 March 2012
Accepted: 24 August 2012
Published: 5 September 2012
To understand the genetic architecture of complex traits and bridge the genotype-phenotype gap, it is useful to study intermediate -omics data, e.g. the transcriptome. The present study introduces a method for simultaneous quantification of the contributions from single nucleotide polymorphisms (SNPs) and transcript abundances in explaining phenotypic variance, using Bayesian whole-omics models. Bayesian mixed models and variable selection models were used and, based on parameter samples from the model posterior distributions, explained variances were further partitioned at the level of chromosomes and genome segments.
We analyzed three growth-related traits: Body Weight (BW), Feed Intake (FI), and Feed Efficiency (FE), in an F2 population of 440 mice. The genomic variation was covered by 1806 tag SNPs, and transcript abundances were available from 23,698 probes measured in the liver. Explained variances were computed for models using pedigree, SNPs, transcripts, and combinations of these. Comparison of these models showed that for BW, a large part of the variation explained by SNPs could be covered by the liver transcript abundances; this was less true for FI and FE. For BW, the main quantitative trait loci (QTLs) are found on chromosomes 1, 2, 9, 10, and 11, and the QTLs on 1, 9, and 10 appear to be expression Quantitative Trait Locus (eQTLs) affecting gene expression in the liver. Chromosome 9 is the case of an apparent eQTL, showing that genomic variance disappears, and that a tri-modal distribution of genomic values collapses, when gene expressions are added to the model.
With increased availability of various -omics data, integrative approaches are promising tools for understanding the genetic architecture of complex traits. Partitioning of explained variances at the chromosome and genome-segment level clearly separated regulatory and structural genomic variation as the areas where SNP effects disappeared/remained after adding transcripts to the model. The models that include transcripts explained more phenotypic variance and were better at predicting phenotypes than a model using SNPs alone. The predictions from these Bayesian models are generally unbiased, validating the estimates of explained variances.
KeywordsBayesian Body Weight Feed Intake Genome Transcriptome eQTL Variance
Large amounts of genomic information generated from Single Nucleotide Polymorphism (SNP) microarrays have become available in recent years for many species[1–3]. This genomic information is used to detect polymorphisms that contribute to variation in economically important traits, such as production traits in farm animals. Microarray technology is also used to screen the expression levels of thousands of genes, i.e., the transcriptome[4, 5]. Studies have shown that genetic background can have a large impact on differential expression. Integrating genome and transcriptome information can help to elucidate the underlying biology of the genotype-phenotype map, using expression Quantitative Trait Locus (eQTL) mapping.
However, in the eQTL approach, associations between SNPs, transcript level, and phenotypes are analyzed individually. This is likely to lead to “missing heritability”, because corrections for multiple testing lead to a high false negative rate and multiple SNPs and transcript level that jointly explain the phenotype are ignored[9, 10]. Here we propose and demonstrate Bayesian models that model all SNPs and transcript level simultaneously to obtain explained variances by the whole genome and whole transcriptome. In these models, we identify eQTLs as those SNPs whose effects disappear when transcript level are added to the model. Genomic- and transcriptomic-explained variances are further partitioned by chromosome and genome sections to offer a view of the genetic architecture on different aggregation levels.
The choice of Bayesian variable selection (BVS) models was due to its features to separate markers with large/moderate or small effects, and to locate the important regions in the genome or transcriptome which serves a better QTL mapping method because it produces clearer signals for QTL. Furthermore the prediction based on genomic variables using BVS is more accurate even when the prior is not correct[11–14]. It is important to say that simpler methods suffer from “missing heritability” too[15, 16].
The aim of this study was to explore the contributions of various sources of variation, such as population structure, SNP variants, and gene expression levels, to a set of growth related traits (body weight, feed intake, and feed efficiency) in mice. These traits are very important, both in terms of agricultural production and for obesity in humans. Bayesian mixed models and Bayesian variable selection models were applied to model pedigree, SNPs and/or gene expressions and to derive explained variances for these components. In addition, they were used to partition of SNPs and gene expression by chromosome and genome sections. To validate the estimates of explained variances, the predictive ability of these models was studied using cross validation.
An M16 × ICR F2 population of 440 mice was available with complete records for body weight at 8 weeks (BW) and 337 records for feed intake (FI) and feed efficiency (FE), measured during the period 3 weeks to 8 weeks. An additional 89 pedigree records were available that described the family structure up to the F0 founder lines. Data was obtained in three batches and the sex of the animals was recorded. At the end of the experiment, the mice were sacrificed and liver tissue was extracted for genome-wide expression profiling. RNA isolation, cDNA synthesis, array hybridization, normalization of probe level intensity, and annotation of data were performed as described in detail by. Genotypes for 1806 highly informative single nucleotide polymorphisms (SNPs) were available for each animal. These tag-SNPs were used to trace the genomic variation in this F2 population. Density functions of phenotypes are available in Additional file1 and the whole data were made publicly available at (http://gbi.agrsci.dk/~pso/BIG_genome_transcriptome/).
where X is the design matrix for batch and sex effects, Z is a design matrix that links polygenic effects to the observed records, W is a matrix with 1806 SNP covariates, and Q is a matrix with 23,698 gene expression covariates. The SNP and gene expression covariates were centered and scaled to unit variance.
Based on work of[19–22], the Bayesian mixed model version assigns normal prior to the vectors u, a, g, and e in (1), i.e.,, where is the polygenic variance and A is the numerator relationship matrix based on pedigree information, is the per-SNP explained variance, is the per-gene expression explained variance, and is the residual or environmental variance. These four variances are estimated in the model using flat prior distributions, i.e.,. The remaining parameters in (1), μ and b, are assigned flat prior distributions, which is the Bayesian analog of fitting “fixed effects” (unshrunken) estimates. A Markov chain Monte Carlo (MCMC) algorithm was applied in the software bayz to obtain samples from the posterior distribution of the model parameters. MCMC algorithms for sampling effects and variances in mixed models have been extensively described, for a general overview see. The Monte Carlo accuracy of the MCMC algorithm was evaluated by correlating repeated estimates for the parameter vectors u, a and g, requiring a correlation >0.999 from repeated MCMC runs, and by computing the effective sample sizes for the variance components using the R Coda package.
The explained variance in y from (1) is var(Zu) + var(Wa) + var(Qg) + var(e). To obtain posterior means (PMs) and posterior standard deviations (PSDs) on the explained variances for SNPs and gene expressions, var(Wa) and var(Qg) were evaluated based on the posterior samples for a and g from the MCMC, i.e., as the PM and PSD of var(Wat) values over MCMC cycles, where at is the posterior sample for a from MCMC cycle t. This procedure is not required for the polygenic variance, because Z is a design matrix, unlike W and Q, which are covariate matrices.
where and are the “large” and “small” variances in the mixture distribution for a, and are the “large” and “small” variances in the mixture distribution for g, and and are vectors of 0/1 indicator variables for a and g, respectively, indicating whether the i th element in a or g, respectively, comes from the distribution with large or small variance. The variances were all estimated from the data using unbounded flat prior distributions. The constraints and were applied using a rejection sampler, so that “large” and “small” effects remained identifiable. The priors for the indicator variables were taken as and, wheremeans a Bernoulli distribution for a 0/1 indicator with a probability π for a 1. The parameterswere taken as known. The MCMC implementation of this model is relatively straightforward, because conditional on the indicator variables the model remains a mixed model. The updating of the mixture indicators is described in. This model is also run in the software bayz, and the Monte Carlo accuracy was evaluated in the same way as the mixed model version.
From the posterior samples for a and g in the variable selection model, explained variances were computed and partitioned by chromosome and by genome section. The variable selection model is more suited to make such a partitioning, because unlike the mixed model version, it allows for different variance contributions per SNP. The explained variances were evaluated in the same way as for the mixed model, by evaluating var(Wat) and var(Qgt) over MCMC cycles t, except that the a and g samples are obtained under the mixture model prior assumptions. The same expressions can be straightforwardly evaluated for parts of the SNPs or gene expressions to obtain explained variances per chromosome and for small windows of SNPs within chromosomes. Variance within a chromosome was computed using a 5-SNP sliding window to obtain a genomic variance profile.
It is difficult to choose an optimal windows size as it depends on extend of LD, marker density and an arbitrary cut-off for what is considered important LD. In the data analyzed here, average R2 between adjacent SNPs was 0.55, and average R2 between SNPs two apart was 0.39, which we considered sufficiently high to warrant computation of variances in a 5-SNP window. To study the relative importance of family structure, SNPs, and gene expressions, six sub models and the complete model (1) were used. These were models that use only pedigree information (PED), only SNP data (SNP), only gene expression data (GEX), SNP + GEX, PED + GEX, PED + SNP, and the complete model PED + SNP + GEX. These models always included sex and batch effects.
The predictive ability of the models was evaluated using an 11-fold cross-validation. For body weight, 440 records were divided randomly in 11 groups, each with 40 individuals. Feed intake and feed efficiency, with 337 records in total, were randomly divided in 10 groups of 30 records and one group of 37 records. The complete model, including all variance parameters, was re-estimated on each set of 10 folds and predictions were computed for the phenotypes in the remaining 11th fold. All predictions from the 11-fold cross validation were collected to compute correlations between predicted and actual phenotypes, and regressions of predicted phenotypes on actual phenotypes, using the whole data set. The slope of the regression lines of predicted phenotypes on actual phenotypes are expected to be 1 if the model produces unbiased predictions, which would validate the estimates of explained variances. The University of Nebraska Institutional Animal Care and Use Committee approved all procedures and protocols.
Results and discussion
Explained variance in different models for Body Weight (BW), Feed Intake (FI), and Feed Efficiency (FE)
PED + SNP
PED + GEX
SNP + GEX
PED + SNP + GEX
Feed Efficiency (×10,000)
Overall, explained variances increase by adding gene expression information (GEX; data from liver), i.e., in the most complete model (PED + SNP + GEX) explained variances were 88%, 75%, and 71% for BW, FI, and FE respectively. This confirms the assumption that gene expressions can explain a larger part of phenotypic variance than genetic or genomic information, by capturing environmental, and possibly non-additive, genetic effects through the gene expressions[5, 30]. Information on the genetic architecture of these traits is best judged from the relative contributions of genomic and transcriptomic data in the SNP + GEX model.
This model shows that, for these traits, the liver transcriptome contributes a larger portion of explained variance. This is most pronounced for BW, with 18% of explained variance from the genome and 63% from the liver transcriptome. Thus, in this case, the predominant model is that SNPs regulate gene expressions to exert their effect on the phenotype.
This method/approach is suitable for gene-level resolution. However, gene-level resolution is highly data dependent, i.e. it requires high marker density and a study population with LD blocks that span small genomic regions. In this work we have used F2 crosses from outbred lines, which has large LD blocks and this kind of data has limited resolution for fine-mapping of QTL.
One may argue that the most complete model is more interesting to investigate genetic architecture and chromosomal/sub-chromosomal variance but as we have shown SNPs and pedigree are largely confounded and they explain about the same variance. This confounded explained variance is getting worse in the case that both Pedigree and SNPs are in one model (PED + SNP model) which is shown in higher confidence intervals of explained variance by pedigree. The model with only omics information (SNP + GEX) is therefore simpler, more accurate and as effective as the model that also uses pedigree information. This is interesting for future applications of omics technologies, because we expect that pedigree information often will be absent.
Rank correlation (Spearman) between individual values predicted from different sources of information pedigree (PED), SNPs markers (SNP), and gene expression signals (GEX) in three traits
PED & SNP
SNP & GEX
PED & GEX
Correlation between predicted and actual phenotypes with different sources of information
SNP + PED
GEX + PED
SNP + GEX
SNP + GEX + PED
With increased availability of various -omics data, integrative approaches are promising tools for understanding the genetic architecture of complex traits. We have developed a complementary approach to the univariate “eQTL” mapping, by considering Bayesian models that fit all genome-wide SNPs and transcript abundances in one model, and that estimate and partition explained variances by chromosome and genome segments. Our results show that, using gene expressions, more of the phenotypic variance can be explained and phenotypes can be better predicted. Predictions were also shown to be unbiased, which validates the assessed explained variances. The improvement of phenotype predictions using gene expression data will be useful for several applications in agriculture and medicine, although it should be assessed on a case-by-case basis as to whether a suitable tissue can be sampled for the gene expression measurements. Partitioning of the explained genomic variance at the level of chromosomes and genome segments showed clear examples of eQTL locations as regions where genomic variance disappears when gene expressions are added to the model. Our study used only gene expressions from the liver, and an obvious further extension is to include expressions from other tissues. The QTLs that did not disappear when transcripts are added to the model may be eQTLs that affect gene expression in a tissue other than liver. The Bayesian model is quite efficient for handling large sets of covariates, and extensions to include multiple sets of expressions will be feasible. We have not provided formal statistical tests in this model, but the Bayesian approach lends itself naturally to obtaining confidence intervals for (differences between) parameter estimates. The estimates of total explained variances from the Bayesian mixed model can also be obtained by a residual maximum likelihood (REML) approach. We verified this, and the Bayesian and REML estimates generally agree. However, using REML it is not feasible to utilize mixture priors to better discriminate between SNPs which contribute more or less variance, and to partition the variances at the sub-chromosome level, which is all straightforward in a Bayesian approach.
Our approach can easily allow up scaling to higher-density arrays, even to whole-genome sequence data with the variance components analysis as it was for gene expression probes in this study.
Single Nucleotide Polymorphisms
Restricted maximum Likelihood
Quantitative trait loci
Expression Quantitative trait loci.
This research is supported in part by the Quantomics research project that has been co-financed by the European commission within the 7th Framework Programme, contract No. 222664. This work is a part of PhD project scholarship from the Ministry of Science, Research and Technology of Iran.
- Hayes B, Goddard ME: Break-even cost of genotyping genetic mutations affecting economic traits in Australian pig enterprises. Livest Prod Sci. 2004, 89 (2–3): 235-242.View ArticleGoogle Scholar
- Wong GKS, et al: A genetic variation map for chicken with 2.8 million single-nucleotide polymorphisms. Nature. 2004, 432 (7018): 717-722. 10.1038/nature03156.View ArticlePubMedGoogle Scholar
- Gonzalez-Recio O, et al: Nonparametric methods for incorporating genomic information into genetic evaluations: an application to mortality in broilers. Genetics. 2008, 178 (4): 2305-2313. 10.1534/genetics.107.084293.PubMed CentralView ArticlePubMedGoogle Scholar
- Cui XG, et al: Improved statistical tests for differential gene expression by shrinking variance components estimates. Biostatistics. 2005, 6 (1): 59-75. 10.1093/biostatistics/kxh018.View ArticlePubMedGoogle Scholar
- Chesler EJ, et al: Complex trait analysis of gene expression uncovers polygenic and pleiotropic networks that modulate nervous system function. Nat Genet. 2005, 37 (3): 233-242. 10.1038/ng1518.View ArticlePubMedGoogle Scholar
- Dworkin I, et al: Genomic consequences of background effects on scalloped mutant expressivity in the wing of Drosophila melanogaster. Genetics. 2009, 181 (3): 1065-1076. 10.1534/genetics.108.096453.PubMed CentralView ArticlePubMedGoogle Scholar
- Schadt EE, et al: An integrative genomics approach to infer causal associations between gene expression and disease. Nat Genet. 2005, 37 (7): 710-717. 10.1038/ng1589.PubMed CentralView ArticlePubMedGoogle Scholar
- Manolio TA, et al: Finding the missing heritability of complex diseases. Nature. 2009, 461 (7265): 747-753. 10.1038/nature08494.PubMed CentralView ArticlePubMedGoogle Scholar
- Zuk O, et al: The mystery of missing heritability: genetic interactions create phantom heritability. Proc Natl Acad Sci U S A. 2012, 109 (4): 1193-1198. 10.1073/pnas.1119675109.PubMed CentralView ArticlePubMedGoogle Scholar
- Hoggart CJ, et al: Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies. PLoS Genet. 2008, 4 (7): e1000130-10.1371/journal.pgen.1000130.PubMed CentralView ArticlePubMedGoogle Scholar
- Xu SZ: Estimating polygenic effects using markers of the entire genome. Genetics. 2003, 163 (2): 789-801.PubMed CentralPubMedGoogle Scholar
- Meuwissen THE, Hayes BJ, Goddard ME: Prediction of total genetic value using genome-wide dense marker maps. Genetics. 2001, 157 (4): 1819-1829.PubMed CentralPubMedGoogle Scholar
- Habier D, Fernando RL, Dekkers JCM: The impact of genetic relationship information on genome-assisted breeding values. Genetics. 2007, 177 (4): 2389-2397.PubMed CentralPubMedGoogle Scholar
- de los Campos G, et al: Predicting quantitative traits with regression models for dense molecular markers and pedigree. Genetics. 2009, 182 (1): 375-385. 10.1534/genetics.109.101501.PubMed CentralView ArticlePubMedGoogle Scholar
- Yang JA, et al: Common SNPs explain a large proportion of the heritability for human height. Nat Genet. 2010, 42 (7): 565-131. 10.1038/ng.608.PubMed CentralView ArticlePubMedGoogle Scholar
- Visscher PM, Yang JA, Goddard ME: A commentary on 'common SNPs explain a large proportion of the heritability for human height' by Yang et al. (2010). Twin Res Hum Genet. 2010, 13 (6): 517-524. 10.1375/twin.13.6.517.View ArticlePubMedGoogle Scholar
- Allan MF, Eisen EJ, Pomp D: Genomic mapping of direct and correlated responses to long-term selection for rapid growth rate in mice. Genetics. 2005, 170 (4): 1863-1877. 10.1534/genetics.105.041319.PubMed CentralView ArticlePubMedGoogle Scholar
- Dobrin R, et al: Multi-tissue coexpression networks reveal unexpected subnetworks associated with disease. Genome Biol. 2009, 10 (5): R55-10.1186/gb-2009-10-5-r55.PubMed CentralView ArticlePubMedGoogle Scholar
- Habier D, Fernando RL, Dekkers JC: Genomic selection using low-density marker panels. Genetics. 2009, 182 (1): 343-353. 10.1534/genetics.108.100289.PubMed CentralView ArticlePubMedGoogle Scholar
- Meuwissen TH, et al: A fast algorithm for BayesB type of prediction of genome-wide estimates of genetic value. Genet Sel Evol. 2009, 41: 2-10.1186/1297-9686-41-2.PubMed CentralView ArticlePubMedGoogle Scholar
- Luan T, et al: The accuracy of Genomic Selection in Norwegian red cattle assessed by cross-validation. Genetics. 2009, 183 (3): 1119-1126. 10.1534/genetics.109.107391.PubMed CentralView ArticlePubMedGoogle Scholar
- VanRaden PM, et al: Invited review: reliability of genomic predictions for North American Holstein bulls. J Dairy Sci. 2009, 92 (1): 16-24. 10.3168/jds.2008-1514.View ArticlePubMedGoogle Scholar
- Janss L: bayz manual. 2011, Leiden, the Netherlands: Bayesian SolutionsGoogle Scholar
- Sorensen D, Gianola D: Likelihood, Bayesian and MCMC methods in quantitative genetics. 2002, New York: Springer-Verlag: Statistics for biology and health, 740-View ArticleGoogle Scholar
- Plummer M, et al: CODA: Convergence Diagnosis and Output Analysis for MCMC, in R News. 2006, 7-11.Google Scholar
- George EI, Mcculloch RE: Variable selection via Gibbs sampling. J Am Stat Assoc. 1993, 88 (423): 881-889. 10.1080/01621459.1993.10476353.View ArticleGoogle Scholar
- Kapell DN, et al: Efficiency of genomic selection using Bayesian multimarker models for traits selected to reflect a wide range of heritabilities and frequencies of detected quantitative traits loci in mice. BMC Genet. 2012, 13 (1): 42-PubMed CentralView ArticlePubMedGoogle Scholar
- Rolf MM, et al: Impact of reduced marker set estimation of genomic relationship matrices on genomic selection for feed efficiency in Angus cattle. BMC Genet. 2010, 11: 24-PubMed CentralView ArticlePubMedGoogle Scholar
- Bink MCAM, et al: Bayesian analysis of complex traits in pedigreed plant populations. Euphytica. 2008, 161 (1–2): 85-96.View ArticleGoogle Scholar
- Chesler EJ, et al: Genetic correlates of gene expression in recombinant inbred strains - a relational model system to explore neurobehavioral phenotypes. Neuroinformatics. 2003, 1 (4): 343-357. 10.1385/NI:1:4:343.View ArticlePubMedGoogle Scholar
- Wuschke S, et al: A meta-analysis of quantitative trait loci associated with body weight and adiposity in mice. Int J Obes. 2007, 31 (5): 829-841.Google Scholar
- Keightley PD, et al: A genetic map of quantitative trait loci for body weight in the mouse. Genetics. 1996, 142 (1): 227-235.PubMed CentralPubMedGoogle Scholar
- Brockmann GA, et al: Quantitative trait loci affecting body weight and fatness from a mouse line selected for extreme high growth. Genetics. 1998, 150 (1): 369-381.PubMed CentralPubMedGoogle Scholar
- Thompson R: Variance-components and animal breeding - Vanvleck, Ld, Searle, Sr. Biometrics. 1981, 37 (1): 201-202. 10.2307/2530542.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.