Trait-trait dynamic interaction: 2D-trait eQTL mapping for genetic variation study
© Sun et al. 2008
Received: 16 June 2007
Accepted: 23 May 2008
Published: 23 May 2008
Skip to main content
© Sun et al. 2008
Received: 16 June 2007
Accepted: 23 May 2008
Published: 23 May 2008
Many studies have shown that the abundance level of gene expression is heritable. Analogous to the traditional genetic study, most researchers treat the expression of one gene as a quantitative trait and map it to expression quantitative trait loci (eQTL). This is 1D-trait mapping. 1D-trait mapping ignores the trait-trait interaction completely, which is a major shortcoming.
To overcome this limitation, we study the expression of a pair of genes and treat the variation in their co-expression pattern as a two dimensional quantitative trait. We develop a method to find gene pairs, whose co-expression patterns, including both signs and strengths, are mediated by genetic variations and map these 2D-traits to the corresponding genetic loci. We report several applications by combining 1D-trait mapping with 2D-trait mapping, including the contribution of genetic variations to the perturbations in the regulatory mechanisms of yeast metabolic pathways.
Our approach of 2D-trait mapping provides a novel and effective way to connect the genetic variation with higher order biological modules via gene expression profiles.
In this work, we shall investigate a different type of genetic interference that cannot be analyzed under the above one-gene one-trait formulation. Recent studies have demonstrated that transcription regulation of functionally associated genes can be dependent on the relevant cellular states such as fluctuations in the levels of nutrients, metabolites, hormones or other signaling molecules . This raises the question of whether the co-regulation pattern, hence the co-expression pattern of two genes is affected by genetic variations. Figure 1(b) suggests a scenario where a DNA polymorphism captured by marker Z affects the co-expression pattern of a pair of genes Gene2 and Gene3. Suppose Z has two alleles, A and B. For the cells with allele B, Gene2 and Gene3 are seen to be positively co-expressed. But if the genetic variation (from B to A) affects the regulatory mechanism, we may observe weaker or even sign-changing correlation between Gene2 and Gene3. We shall develop a method to map the change of co-expression/correlation pattern of two traits to genetic loci, and refer it as 2D-trait mapping.
Our 2D-trait mapping approach is different from the traditional multivariate analysis of quantitative traits [13–15]. In these statistical multi-trait models, although different traits are correlated, the covariance between traits is assumed to be unaffected by genotypes. In other words, genotypes can only affect the multi-trait mean, but not the covariance. Thus multivariate trait analysis in the literature did not address the issue to be studied here.
The rest of the paper is organized as follows. In method section, we present our 2D-trait mapping approach and the eQTL data. In result section, we first carry out the full genome study on 16.5 millions of gene pairs, and then concentrate on gene pairs from metabolic pathways. We also use 2D-trait mapping to study how a transcription factor (TF) regulates the expression of its target genes. In discussion section, we summarize the merits of our approach and discuss some directions of generalization.
We apply our 2D-trait mapping method to a yeast eQTL dataset , which includes data from a genetically variable population of 40 yeast segregants generated from a cross of two budding yeast strains: a standard laboratory strain (BY) and a wild isolate from a California vineyard (RM). Data for 6229 gene expression traits and 3313 single nucleotide polymorphism (SNP) markers are collected for each yeast segregant. The genotype profile of each marker is a binary vector, indicating from which parental strain the allele is inherited. The gene expression data is downloaded from Gene Expression Omnibus (GEO) : GDS 91 and GDS 92 in series GSE37. The genotype data are downloaded from Leonid Kruglyak's laboratory's website . The genotype profiles of neighboring markers tend to have very high correlations and some are even identical. We merge adjacent markers into marker blocks sequentially using the criterion that any two marker profiles within one block are either the same or different by only one segregant. The 3313 markers are merged into 667 marker blocks. The dichotomized centroid of all the markers within a marker block is used to represent the marker block [See section 1 of Additional file 1 for details].
An anonymous referee pointed out that, as an alternative to the marker block approach, traditional interval mapping approach where one estimates line origin at fixed intervals can be used to obtain continuous Z.
Quantifying the co-expression pattern between two genes (2D trait) is more complicated than quantifying the expression level of one gene (1D trait). We employ the statistical concept "Liquid Association" (LA)  to address this 2D problem. Suppose X, Y, Z are continuous random variables with mean 0 and variance 1. Then the correlation between X and Y is just E(XY). LA aims to describe the change of the conditional expectation g(z) = E(XY|Z = z). If Z is a continuous random variable, change of the conditional expectation can be described by its derivative. This leads to the mathematical definition of LA:
LA(X, Y|Z) = E(g'(Z)) (1)
A simple estimation of LA score is available if Z follows the standard normal distribution:
LA(X, Y|Z) = E(XYZ) . To ensure normality, the normal quantile transformation is applied to each gene. This data pre-processing step is carried out after downloading the expression data from GEO. In 2D-trait mapping to binary markers, random variables X and Y represent the normalized gene expression profiles of two genes, while Z represents a marker genotype profile which only takes two possible values 0 and 1 and indicates from which parental strain the segregant inherits Z. We define Liquid Association score specifically for binary variable Z using a proper rescaling of Z. Assume P(Z = 1) = a and P(Z = 0) = b (a + b = 1). We transform Z to Z' such that if Z = 0; if Z = 1.
Comparing equation (2) with equation (1), we see that "difference" is used to replace "derivative". The term can be viewed as a penalty for unbalance in the frequencies of two alleles of Z. It takes the maximum value if a = b = 1/2 (the case of Hardy-Weinberg equilibrium) and it gets smaller as the difference between a and b becomes larger. To compute (2), we replace E(XY|Z = 1) and E(XY|Z = 0) by the average of XY for all yeast segregants with Z = 1 and Z = 0 respectively.
LA can reflect the intuitive change in the co-expression/co-regulation of a pair of genes X and Y. Following most gene expression studies , the baseline expression of a gene is given by the average expression of all yeast segregants from the cross of RM strain and BY strain, which has been set to zero for each gene because of the normalization at the data preprocessing stage. A segregant with X > 0 and Y > 0 (X < 0 and Y < 0, respectively) is a case of an up-regulation (down-regulation, respectively) in both genes. Co-up-regulation or co-down-regulation contributes a positive value in the product XY. Likewise, contra-expression/contra-regulation (either "X > 0 and Y < 0", or "X < 0 and Y > 0") contributes a negative value in the product XY. Thus by calculating the average contribution to the product XY from segregants inheriting Z from one strain and compare it with that from segregants inheriting Z from another strain, we can quantify the change of co-regulation pattern. For the ideal scenario as depicted in Figure 1(b), E(XY|Z = 1) is positive because for Z = 1 (green dots) most cases are co-regulated; E(XY|Z = 0) is negative because for Z = 0 (red dots) most cases are contra-regulated; thus the net LA score is positive. LA can also detect patterns like Figure 1(c). But LA will not detect patterns like Figures 1(d), 1(e) where one or both genes are connected to Z only through 1D linkage. Specifically, for Figure 1(d), the expression of Gene 6 has a shift between Z = 0 and Z = 1. The contribution to E(XY|Z = 1) comes from both co-up-regulation and contra-regulation, which by and large cancels out each other because of symmetry. Similarly, the contribution to E(XY|Z = 0) comes from both co-down-regulation and contra-regulation, which again cancel out each other. Therefore LA score will be close to 0. In Figure 1(e), E(XY|Z = 1) and E(XY|Z = 0) cancel out each other because both are positive and take about the same size. Therefore LA score is close to 0 as well. In general, the size of LA score quantifies the degree of change in the co-expression pattern of a pair of genes as the genotype of a marker changes. The sign of the LA score indicates the direction of the change. Large LA scores, either positive or negative are of special interest.
For each fixed gene pair (X, Y), we find the marker blocks with the most positive and the most negative LA scores respectively and define two LA linkage scores, LA(X, Y)max = max Z LA(X, Y|Z) and LA(X, Y)min = min Z LA(X, Y|Z). We consider positive and negative LA linkage scores separately because they have different implications. Since alleles from strain BY and RM are coded as 1 and 0 respectively, a positive LA linkage score indicates that the gene pair has a higher correlation when the allele is inherited from strain BY than from strain RM; a negative LA linkage score indicates just the opposite way of correlation change. To determine if a LA linkage score is statistically significant or not, we address the issue of multiple testing across markers by a permutation p-value procedure. Specifically, we randomly permute the yeast segregant labels to generate a reference distribution for the LA linkage scores. At each permutation, we re-compute the LA scores for all markers and record the most positive and most negative values. After performing M permutations, M being a large number, we obtain the reference distribution of LA linkage scores that represents the no-linkage situation. The p-value for an observed LA linkage score is computed by dividing the number of LA linkage scores in the reference distribution that are higher (for positive score) or lower (for the negative score) than the observed LA linkage score by M. Now suppose there are altogether N gene pairs under consideration and the permutation p-value cutoff is p. We calculate the false discovery rate [20, 21]: FDR = Np/D, where D is the number of gene pairs with permutation p-value ≤ p. Note that this is a conservative estimate of FDR by assuming the proportion of the null cases is 100%. The true proportion of null cases should be somewhat smaller but is in general not easy to estimate accurately despite of various suggestions in the literature. For more discussion on the complex issues concerning the estimation of null proportion in linkage studies, see .
To gain a full genome view of 2D-trait mapping, we computed the positive LA linkage score and the negative LA linkage score for every gene pair and used them to rank the gene pairs. We examined gene pairs with exceptionally large LA scores, both positive and negative, and found many of them are functionally associated. For instance, from the first 20 gene pairs with the highest positive LA linkage scores [See Table S1 in Additional file 2], we found nine pairs linked to marker block 348. We observed that all but one gene are associated with functions of mitochondria. Among these genes, CYC1 appears 4 times, pairing with mitochondrial small (MRPS28, RSM28) and large (MRPL22) subunits and YLR168C, which encodes a putative protein of unknown function that may be involved in intra-mitochondrial sorting. CYC1 encodes cytochrome c, which serves as the electron carrier of the mitochondrial inter-membrane space that transfers electrons from ubiquinone-cytochrome c oxidoreductase to cytochrome c oxidase during cellular respiration.
We further studied the LA 2D-trait mapping results in a genome-wide scale. We obtained two genome-wide distributions of LA linkage scores, one for positive scores and one for negative scores, based on a total of 16.5 million gene pairs. Because an exceedingly large number of gene pairs are involved in the mapping and the majority of them are probably biologically unrelated, the LA linkage scores are highly susceptible to random chance. To help assess the degree of impact by chance fluctuation, we generated a reference distribution of linkage scores under the assumption of no linkage using a simulation scheme [See section A of Additional file 2]. Briefly, we first put all gene expression values into a data pool, ignoring the gene names and the segregant identities. We then simulated each gene profile by randomly sampling from the data pool with replacement. For a pair of random gene expression profiles, we computed two most extreme LA linkage scores (across all the 667 marker blocks), one for the positive value and another for the negative value. After repeating this procedure 1 million times, we obtained one reference distribution for the positive linkage scores and one for the negative LA linkage scores. We then compared the genome-wide LA score distribution with the reference distribution by quantile to quantile (Q-Q) plot. As expected, we found a global linear pattern [Figure S1 in Additional file 2]. The small but important difference is revealed by subtraction [Figure S2 in Additional file 2], where one can find an upward shifting of the quantiles for the genome-wide positive LA linkage scores and a downward-shifting for the negative scores. If the 0.001 quantile from the reference distribution is used to draw the cutoff point (i.e., p-value = 1e-3), we would detect 47,445 significant positive LA gene pairs and 49,646 significant negative LA gene pairs. The corresponding false discovery rates are 34.1% and 33.1% for positive and negative LA scores, respectively. When lowering the p-value cutoff to be 1e-5, we can reduce the FDR to 27.3% and 13.6% for positive and negative LA results respectively, which are still modest. But a clear message is that although the information content in the tail part of the LA linkage distribution is only modestly rich, useful results can still be detected. The results need to be subject to closer biological discern however [More discussion in section A of Additional file 2].
Next, we ask the question whether some of these 1818 LA 2D-trait mapping results can also be obtained by 1D-trait mapping. Specifically, each 2D-trait mapping result is a triplet (X, Y, Z) where X and Y are two genes and Z is one marker block. In order for 1D mapping to link X to Z and Y to Z, the absolute value of correlation, |corr(X, Z)| and |corr(Y, Z)|, must be large enough. However, this is not the case. In fact, 95 percents of these values are less than 0.24. As a comparison, 95 percents of the 1D linkages mapped to eight 1D-hotspots of Brem et al. have correlation 0.59 or higher [see section D of Additional file 2]. An alternative way to temper the impact of chance errors is to control the number of gene pairs in each study. We shall restrict to functionally associated gene pairs next. Note that the reported FDRs hereafter are computed under the conservative assumption that all cases are null.
The pathway information is downloaded from Saccharomyces Genome Database . Among the 139 pathways annotated by SGD, 121 of them include at least 2 genes with expression profiles in the yeast eQTL dataset . The majority of these 121 pathways (78/121) have no more than five genes. Only 11 pathways have more than ten genes [Figure S2 in Additional file 1]. There are altogether 1711 gene pairs that can be formed from genes within the same pathway. The permutation p-value of the most positive or the most negative LA score for each gene pair is calculated based on 5000 permutations. At the permutation p-value cuto3 0.005, 207 gene pairs with positive LA scores (FDR = 4.13%) and 176 gene pairs with negative LA scores (FDR = 4.86%) are found1. Because of an overlap of 34 gene pairs, in total we get 349 unique gene pairs, covering 70 pathways [Figure S3 in Additional file 1]. The full list of these gene pairs and LA results are available at LA website .
We shall report the results of four pathways next. The first case is the leucine biosynthesis pathway. Brem et al.  have reported that expression levels of several leucine biosynthesis genes are linked to LEU2 locus due to the deletion of LEU2 in RM strain. We discuss how the dynamic co-expression of the leucine biosynthesis genes complements their findings. The second case is the IMD3 locus that mediates the co-expression of gene pairs from both histidine and purine biosynthesis pathways. The third and the fourth cases concern the phospholipid and purine biosynthesis pathways. The role of transcription factors will be discussed. To save space, the discussions of these two cases are given in section 2 of Additional file 1.
Co-expression of leucine biosynthesis genes mediated by the eQTL of LEU2 (marker block 75).
Co-expression of histidine and purine biosynthesis genes mediated by IMD3 locus (marker block 473)
We further find that the co-expression patterns of two gene pairs (HIS1, IMD3) and (HIS5, IMD3) are also linked to IMD3 locus [Table S2, and Figure S4 in Additional file 1]. These liquid association results reflect well the dynamic connection between the histidine and purine biosynthesis pathways. This finding cannot be explained by 1D-mapping. The expression of histidine/purine biosynthesis genes have low correlation with the genotype profile of the IMD3 locus4.
The above two examples show that the marker block, which mediates the 2D-trait contains an enzyme within the same pathway. There are altogether 10 such cases (corresponding 8 distinct marker blocks) [see Table S2 of Additional file 1]. In section 2 of Additional file 1, we also describe a different situation where the mediating marker block contains a TF known to regulate at least one gene in the 2D-trait.
In this section, we shall discuss a more complicated situation encountered in 1D-trait mapping when loci with only trans-linkages but no cis-linkages are detected. If there is a cis-linked gene in a locus, a straightforward explanation of the trans-linkages is that the sequence polymorphism in the eQTL affects the expression of the cis-linked gene first, and then the cis-linked gene affects expression of the trans-linked genes. In this situation, we would expect to observe the overall co-expression between the cis-linked gene and the trans-linked genes. On the other hand, there are eQTL that do not harbor any cis-linked genes. This leads to a puzzling situation of where to find the likely local causative genes.
With 1D-trait mapping, we find altogether 76 genes trans-linked to the marker blocks that contain no cis-linked genes (see section 3 of Additional file 1). We further find out that there are only 7 marker blocks harboring at least 3 trans-linked genes but no cis-linked genes. We study the function of these trans-linked genes by GO Term Finder of SGD . Among the 7 marker blocks, we find 3 of them have enriched GO term annotations:
1. Marker block 391: 8 trans-linked genes, enriched GO term "ATP metabolic process" (4 of 8 genes, p-value = 1.97e-7).
2. Marker block 335: 3 trans-linked genes, enriched GO term "formate metabolic process" (3 of 3 genes, p-value = 3.87e-10).
3. Marker block 446: 4 trans-linked genes, enriched GO term "mitochondrial electron transport, ubiquinol to cytochrome c" (2 of 4 genes, p-value = 0.00041).
We shall investigate the first case in detail and leave the discussion of the other two cases in section 3 of Additional file 1.
Eight genes that are trans-linked to marker block 390-39
Subunit b of the stator stalk of mitochondrial F1F0 ATP synthase
Subunit 5 of the stator stalk of mitochondrial F1F0 ATP synthase
Subunit d of the stator stalk of mitochondrial F1F0 ATP synthase
Subunit h of the F0 sector of mitochondrial F1F0 ATP synthase
Mitochondrial elongation factor G-like protein
Mitochondrial inorganic pyrophosphatase, required for mitochondrial function and possibly involved in energy generation from inorganic pyrophosphate
Subunit of the Cdc28 protein kinase
Similar to several yeast probable membrane proteins, including YNR075W and YFL062W
Genome-wide TF binding data shows that Hap4 binds the upstream regions of ATP5, ATP7, and ATP14 . However, HAP4 is not cis-linked since this locus is a cis-null/all-trans linkage spot. Consequentially, the correlations in expressions between HAP4 and any of the 8 trans-linked genes are low (from 0.02 to 0.38, with median 0.21).
Co-expression between HAP4 and its six target genes are mediated by a locus in marker block 41.
LA score I
The goal of eQTL studies is to map complex traits to genetic loci with the aid of gene expression data. Since a single gene/protein is unlikely to affect a complex trait by itself, it would be more informative to take higher order cellular organization into consideration. The co-expression pattern of two genes may reflect the status of the regulatory mechanism. If the majority of gene pairs (or gene pairs from the rate-limiting steps) in a pathway show coherent patterns of expression, we would expect the pathway to function more effectively. Our approach of 2D-trait mapping is a novel way to connect the genetic variation with higher order biological modules via gene expression profiles.
It is important to bear in mind, however, that a significant score LA(X, Y|Z) does not necessarily implies that there must be a direct causal relationship of "Z affects X and Z affects Y ". It is feasible for Z to affect X only, but X and Y are correlated through other unspecified factors, thereby changing the conditional correlation. There are many factors, including environment, epigenetics, signaling molecules, microRNA, etc., which may have more direct influence on the correlation between X and Y. Our method should be viewed as complementary to the more traditional and multivariate QTL analysis, but certainly not as the replacement.
We have illustrated how the standard 1D-trait mapping and the 2D-trait mapping can complement each other to broaden the scope of eQTL studies. The similarity between the binary LA scoring method and the continuous LA scoring method offers an additional advantage. Let's reconsider how the locus of HAP4 and the locus of TCM62 mediate the ATP metabolism and aerobic respiration. An alternative analysis can begin with using 6299 gene expression profiles as Z to find genes with highest LA score, LA(X, Y|Z), where X = the expression profile of each of the 8 trans-linked genes, and Y = HAP4 expression profile. We find that TCM62 appears six times as one of the top 20 genes with highest LA scores. Combined with 1D-trait mapping result of cis-linkage for TCM62, we may propose a possible scenario that the DNA polymorphism in marker block 41 affects the expression of TCM62, which in turn affects the co-expression pattern of HAP4 and its target genes (Table 4).
Although we choose to study the dynamic co-expression patterns of genes belonging to one metabolic pathway, other functional categories such as gene ontology terms or protein complexes can also be employed.
Our 2D-trait mapping can be generalized in two directions. The first direction is to extend the method to the co-expression of more than 2 genes. The method of projective LA  should be applicable here. On the other hand, for 1D-trait mapping, one gene expression profile can be mapped to more than one locus. How to model and detect interactions between loci (for 1D-trait mapping) has received great attention recently. Likewise, we should allow 2D-trait to be mapped to more than one locus. The dimensionality issue is surely to escalate further if high dimensional traits with multi-loci are considered systematically in a comprehensive manner.
where SD stands for standard deviation. For the protocol case as depicted by the schematic Figure 1(b), the two measures are equivalent because E(X|Z = 1) = E(X|Z = 0), E(Y|Z = 1) = E(Y|Z = 0), SD(X|Z = 1) = SD(X|Z = 0), and SD(Y|Z = 1) = SD(Y|Z = 0). If DCC is used for other situations, then one must be aware of the different biological interpretation of what it means by co-expression/co-regression of two genes. This is because (i) two different baseline expressions are now used in defining up-regulation or down-regulation of a gene and (ii) two different scales are used in defining the strength of up-regulation or down-regulation of a gene. In contrast, LA method uses only one common baseline and only one common scale. Another difference between LA and DCC is that while LA can be applied to both discrete and continuous Z, it is not easy to obtain an implementable version of DCC for a continuous Z.
gene expression quantitative trait loci
false discovery rate
The authors would like to thank three anonymous referees for useful suggestions which improve the presentation substantially. This work is supported by NSF grants DMS-0201005, DMS-0104038 and DMS-0406091. Li and Yuan's work is also supported in part by an internal grant from Academia Sinica, Taipei, Taiwan, R.O.C and in part by the grant NSC95-3114-P-002-005-Y. Sun's work is also supported in part by an internal funding from UNC. Authors thanked the MIB (Mathematics in Biology) team at the Institute of Statistical Science, Academia Sinica and at Statistics Department UCLA for LAP website development. Conflict of Interest: none declared.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.