Causal inference of regulator-target pairs by gene mapping of expression phenotypes
© Kulp and Jagalur; licensee BioMed Central Ltd. 2006
Received: 02 December 2005
Accepted: 24 May 2006
Published: 24 May 2006
Correlations between polymorphic markers and observed phenotypes provide the basis for mapping traits in quantitative genetics. When the phenotype is gene expression, then loci involved in regulatory control can theoretically be implicated. Recent efforts to construct gene regulatory networks from genotype and gene expression data have shown that biologically relevant networks can be achieved from an integrative approach. In this paper, we consider the problem of identifying individual pairs of genes in a direct or indirect, causal, trans-acting relationship.
Inspired by epistatic models of multi-locus quantitative trait (QTL) mapping, we propose a unified model of expression and genotype to identify quantitative trait genes (QTG) by extending the conventional linear model to include both genotype and expression of regulator genes and their interactions. The model provides mapping of specific genes in contrast to standard linkage approaches that implicate large QTL intervals typically containing tens of genes. In simulations, we found that the method can often detect weak trans-acting regulators amid the background noise of thousands of traits and is robust to transcription models containing multiple regulator genes. We reanalyze several pleiotropic loci derived from a large set of yeast matings and identify a likely alternative regulator not previously published. However, we also found that many regulators can not be so easily mapped due to the presence of cis-acting QTLs on the regulators, which induce close linkage among small neighborhoods of genes. QTG mapped regulator-target pairs linked to ARN1 were combined to form a regulatory module, which we observed to be highly enriched in iron homeostasis related genes and contained several causally directed links that had not been identified in other automatic reconstructions of that regulatory module. Finally, we also confirm the surprising, previously published results that regulators controlling gene expression are not enriched for transcription factors, but we do show that our more precise mapping model reveals functional enrichment for several other biological processes related to the regulation of the cell.
By incorporating interacting expression and genotype, our QTG mapping method can identify specific regulator genes in contrast to standard QTL interval mapping. We have shown that the method can recover biologically significant regulator-target pairs and the approach leads to a general framework for inducing a regulatory module network topology of directed and undirected edges that can be used to identify leads in pathway analysis.
Recently several data sets of both whole genome genotype and expression data have been published [1–4]. In a major departure from conventional genetic trait analysis, now 105 to 106 phenotypic traits are represented by distinct gene expression measurements. Theoretically, chromosomal regions can be linked to the expression of each of the many measured genes. Thus, this data provides the basis for determining the role of genetic variation in differential gene expression and the identification of polymorphic genes that regulate, directly or indirectly, transcriptional control.
Genetics of gene expression
The three most significant studies to date include whole genome expression and genotyping for a collection of 113 segregants from a mating of two isogenic yeast strains , a collection of 111 selfed progeny from a mating of two inbred mice strains , and 32 recombinant inbred mice strains . The yeast regulatory system is simpler, recombination rates are higher, and the number of available samples is greater than in the recombinant inbred mouse dataset currently available; therefore, we focus on the work of Brem and colleagues exclusively. With sufficient sampling, the principles of our work should be applicable to other model systems.
Quantitative trait loci (QTL) are regions on a genome in which the genetic variation is significantly correlated with a phenotypic trait, here the expression of a gene. Interval mapping  models assume an additive relationship of genotype to phenotype according to a mixture model of the form:
P(T i |Q j ) = (β0 + β1Q j , σ)
where T i is the expression of the i th transcript and Q j is a numeric genotype value corresponding to the j th site in the genome. (Genotypes usually take on three values, -1, 0, and 1 for biallelic sites in diploids. Genotypes for haploid or back-crossed individuals are only two-valued, i.e. 0 and 1.) For an observed T i , all possible j are considered and the β terms fit by regression. When measured markers are sparse, Q j is usually sampled in regular intervals between the markers, in which case, Q j is estimated according to a maximum likelihood approach. Consecutive regions along a chromosome with log odds () greater than some threshold are identified as candidate QTL intervals, within which are genes believed responsible for the phenotypic variation. In our case, each QTL interval putatively contains a regulator gene. The identification of the specific regulator gene is referred to as fine mapping.
In Brem , expression levels were measured for 5727 ORFs and genotype data obtained for 2957 markers regularly spaced across the yeast chromosomes. They found that a large fraction of differential gene expression was due to genetic variation and Yvert, et al  showed that, perhaps surprisingly, the genes in the QTL intervals were not enriched for transcription factors or any particular gene function. This observation could possibly be explained by the large size of the QTL intervals, typically containing ten or more genes, however we will show that our more precise models still do not implicate transcription factors as dominant upstream regulators. Thus, one must conclude that regulatory influence is a complex process such that "upstream regulator" is interpreted in the broadest of contexts.
Polymorphisms affecting gene expression are conventionally divided into cis- and trans-acting effects, i.e. polymorphisms that are proximal to the gene, such as in the promoter or 3' end, and those in another gene (generally speaking on a different chromosome). Detecting cis-acting QTLs is straightforward using interval mapping and such QTLs tend to be highly significant . On the other hand, trans-acting variation accounts for most differential gene expression, but the variation is due to a large number of weak actors . This is consistent with a model of gene regulation in which multiple factors contribute in macromolecular complexes and in many different stages of cell regulation such as signaling, transport, and so on.
Inferring regulatory networks from correlated gene expression
Independent of the data sets described so far, large collections of gene expression over time course  or varying environmental conditions [10, 11] have been studied to reveal dependent variation among genes and thereby deduce regulatory relationships. A dominant model used in such analyses was first proposed by Friedman, et al  in which each gene is a random variable with conditional distribution dependent on a small number of parent variables according to the Bayesian network (BN) formalism.
In such a model, nodes in a graph are random variables representing gene expression and edges connect nodes in a directed acyclic graph (DAG).
In the BN modeling method, the key design factors are (1) the estimation of the conditional probability term P(T i |Pa(T i )) – abstractly a score function, where Pa(T i ) are the parents of gene T i – and (2) an efficient means of discovering the set Pa(T i ). Both parametric continuous and non-parametric discrete score functions have been considered. The discrete case is common in the literature; relative gene expression is discretized usually into categorical increased, decreased, and unchanged values and conditional probability tables (CPT) are constructed from tallied observations of the values among parents and child. A CPT model can theoretically capture complex relationships among the parents, but this power is usually limited by the binning of expression values into a few values in order to achieve adequate conditional density estimates.
Continuous models are attractive because parameters are estimated from the totality of the data, but computational efficiency concerns have conventionally limited the class of models considered to simple linear Gaussian models with a small number of parameters of the form
Each parent adds an independent contribution to T i , eliminating the potential interacting effects among parents. But interacting effects can be important under limited circumstances and we will suggest below a relaxed Gaussian model that is computationally acceptable.
The major drawback of the BN approach for analyzing gene expression data alone is that the dependencies inferred among variables do not necessarily imply causality. Indeed, while it is possible to determine causality in some cases , in general, for any BN solution there is an equivalence class represented by a partially directed acyclic graph (PDAG) of alternative solutions with different edge directions that have the same joint probability.
Pe'er, et al  describe how to infer causality under certain gene expression perturbation experiments. (In some sense, our genotype/expression data sets represent the ideal randomized perturbation experimental design, where allelic expression is perturbed by random recombination.) Inother work, transcription factor binding sites (either predicted or experimentally determined) have been added to infer direction of regulation [15, 16]. A limitation of these approaches is that strong priors over allowable structures are implicitly or explicitly applied, which limits the relationships that may be discovered. For example, Segal, et al  limit their search to downstream targets of a small set of known transcription factors. However, gene expression regulation is much more complex, involving other direct and indirect factors such as post-translational modification or protein signaling cascades.
(Note that BNs also lack cycles that exist in real biological networks. This can be overcome using Dynamic Bayesian Networks  when time course data is available, which is not the case here.)
Zhu, et al  describes a method to build comprehensive BN reconstructions of regulatory networks based on genotype/expression data sets. Their work is motivated, as we are, by the recognition that correlation between genotype and expression does imply causality. Genotype assignments represent random shuffling during meiosis, so correlations observed must be the effect of causative polymorphisms. Their approach is to weight the BNs with priors according to rules regarding the chromosomal positions of genes and the differences in QTLs between pairs of genes. A key idea in this work is that related genes will share multiple QTLs. More recently, the same group has described a protocol for inferring the causal nature between two traits – in particular, their method can be applied to the expression levels of a pair of genes . In their approach, pairs of traits that share multiple QTLs are tested for causal orientation based on independence conditioned on the genotypes of the shared QTLs.
Li, et al  also addressed this problem by filtering the set of candidate parent genes of a target gene to only those genes with coding SNPs located within QTL intervals with stringent LOD scores. With a much smaller set of possible model configurations it was possible to exhaustively search all BN configurations using a score function based on gene expression alone.
Lastly, in a similar strategy, Bing and Hoeschele  recommend an analysis protocol for genotype/expression data in which individual genes within a QTL interval are considered as parents according to their expression correlation to the target gene.
In all of the above cited works, regulatory relationships are derived roughly according to a two-step process in which standard QTL interval mapping is first applied – which serves to filter the set of parent genes – followed by a selection of regulator-target pairs according to gene expression correlation or conditional independence tests. It is conceivable that the interaction between polymorphisms in and expression of a regulator may have a significant effect, not observed by either factor alone. We propose such a simple, unified model for the scoring of candidate regulator-target pairs that considers all scenarios of cis- and trans- effects, allowing for interaction among gene expression and genotype. (Note that we do not model the interacting effects of multiple QTLs (epistasis) . While it has been estimated that at least 14% of genes are controlled by epistatic effects between two simultaneous linkages, the ability to detect such pairs in small sample sets is very limited .)
As with networks derived from gene expression alone, connectivity between genes does not necessarily imply physical interactions. Yvert, et al previously observed that genes within QTLs of gene expression traits were not enriched for transcription factors or any other function . Nevertheless, we wondered whether this lack of functional enrichment was due to the imprecise mapping of intervals that contain usually tens of candidate genes. We hypothesized that our QTG mapping method, which identifies specific candidate genes, might show enrichment for transcription factors or other functional categories.
To test this hypothesis, we analyzed the yeast set consisting of 6164 gene expression measurements and 2957 genotype markers across 113 matings between two distinct isogenic strains . We computed the pairwise dependency among all pairs of genes according to the full and reduced model scores, selecting those pairs with a p < 0.00001 based on exhaustive permutation tests (required for each pair for the full model and expression reduced-model). This resulted in 22,923 predicted interacting pairs yielding a modest false discovery rate of 1.7%. Finally, to avoid linkage disequilibrium effects, putative cis-trans-acting regulators (using a conventional 0.05 p-value cutoff) were excluded and regulator-target pairs residing on the same chromosome were removed. This filtering likely removed some true pairs, but we chose to favor conservative selection in order to detect any group-wide trends that would be obscured by noise from false positives. Our final set consisted of 4268 pairs.
Functional enrichment of regulators in GO. The set of GO terms from the "molecular function" (F) and "biological process" (P) categories showing significant enrichment among the candidate QTG. Total number of genes in yeast genome is 6164. Total number of regulators in filtered set is 823.
# in genome
# of regulators
structural constituent of ribosome
35S primary transcript processing
processing of 20S pre-rRNA
ribosomal large subunit assembly and maintenance
protein amino acid phosphorylation
regulation of carbohydrate metabolism
regulation of cell cycle
chromatin silencing at ribosomal DNA
DNA-directed RNA polymerase activity
Protein kinase activity
Even though no functional enrichment in transcription factors was found, we still examined the predicted targets of transcription factors for evidence of physical interaction. Considering all the predicted targets of each transcription factor that met the selection criteria above, we searched 500nts upstream of the target for matches to known binding site motifs (TRANSFAC ). We found no significant enrichment for targets containing known binding regardless of sequence similarity thresholds. For example, only 35 of 719 putative targets contained matches to known binding sites. And of those, only 8 were known targets of their respective transcription factor regulators.
In a final attempt to recover a bias for transcription factors, we hypothesized that QTGs associated with multiple target genes would be enriched for transcription factors. We extracted those regulators from our set of 4268 pairs that had ten or more target genes. The set included well known transcription factors FKH1, FKH2, MSN1, KSP1, and ZAP1, but there was no significant enrichment in the total set for transcriptional regulators. All these observations further confirm that regulatory behavior captured in genotype/expression networks is not likely to be physical interactions, but more complex, indirect relationships as suggested by the functional enrichment found above.
Next we wondered how well a causal relationship could be inferred when the regulator was part of a multi-factor regulon. Using the yeast data set of n = 6164 genes, we simulated an n + 1 target gene according to an additive model of k = 2 ... 5 regulators, with only one regulator having genotypic effect. Specifically, we simulated
Tn+1= β1T1 + ... + β k T k + β k' T k Q k + ε
where β k' was set at random values such that the genotypic effect between the two alleles, (μ a - μ b )/σ, was uniformly selected between 0.5 and 3.0. The other β's were selected from (0,1). Using the QTG trans model we attempted to recover the causal regulator of the simulated target among the background of the other n genes. By modifying the full model threshold for equation 2 we obtained different tradeoffs between recall and precision. We compared this approach to the standard QTL mapping approach and to a more liberal, but realistic, test in which we defined a true positive as correctly identifying the interval containing the regulator (false positives were intervals not containing the true regulator). We found that our QTG model was successful in identifying the correct regulating gene, even for larger values of n (figure 4). Not surprisingly, conventional QTL mapping alone, being a function of only the flanking markers, failed to accurately predict the precise regulating gene, but the QTL interval was typically identified with reasonable success in our simulation.
Prediction of novel regulators
We next considered six candidate QTL intervals analyzed in  representing highly pleiotropic loci. The intervals were each predicted as containing a regulator gene associated with a large number of target genes, although the precise gene was unknown. In the paper, a putative gene within each interval was predicted manually by the authors according to published gene function annotations of regulator and target genes. For the most part, QTG analysis was disappointing in these cases; as it turned out, loci 1 through 5 were coincident with cis-acting QTLs. As a result, the gene expression of most of the putative regulators are highly linked and the manually predicted genes are no better fit than the neighboring genes.
ILV6 targets. The putative targets of a novel regulator gene, ILV6, in the "group 2" loci from . The list includes five additional targets not previously identified.
Receptor for a factor receptor, transcribed in alpha cells and required for mating by alpha cells, couples to MAP kinase cascade to mediate pheromone response; ligand bound receptors are endocytosed and recycled to the plasma membrane; GPCR,
Alpha-isopropylmalate synthase (2-isopropylmalate synthase); the main isozyme responsible for the first step in the leucine biosynthesis pathway
Dihydroxyacid dehydratase, catalyzes third step in the common pathway leading to biosynthesis of branched-chain amino acids
Acetolactate synthase, catalyses the first common step in isoleucine and valine biosynthesis and is the target of several classes of inhibitors, localizes to the mitochondria; expression of the gene is under general amino acid control
Protein involved in vitamin B6 biosynthesis; member of a stationary phase-induced gene family; coregulated with SNO1; interacts with Snolp and with Yhrl98p, perhaps as a multiprotein complex containing other Snz and Sno proteins
Mitochondrial aldehyde dehydrogenase, involved in regulation or biosynthesis of electron transport chain components and acetate formation; activated by K+; utilizes NADP+ as the preferred coenzyme; constitutively expressed
Mitochondrial branched-chain amino acid aminotransferase, homolog of murine ECA39; highly expressed during logarithmic phase and repressed during stationary phase
Mitochondrial inner membrane transporter, transports oxaloacetate, sulfate, and thiosulfate; member of the mitochondrial carrier family
Putative alanine transaminase (glutamic pyruvic transaminase)
Mitochondrial dicarboxylate carrier, integral membrane protein, catalyzes a dicarboxylate-phosphate exchange across the inner mitochondrial membrane, transports cytoplasmic dicarboxylates into the mitochondrial matrix
High-affinity leucine permease, functions as a branched-chain amino acid permease involved in the uptake of leucine, isoleucine and valine; contains 12 predicted transmembrane domains
Subunit of the CBF3 complex, which binds to the CDE III element of centromeres, bending the DNA upon binding, and may be involved in sister chromatid cohesion during mitosis
Protein of unconfirmed function, involved in pyridoxine metabolism; expression is induced during stationary phase; forms a putative glutamine amidotransferase complex with Snzlp with Snolp serving as the glutaminase
Conserved protein of the mitochondrial matrix, required for synthesis of mitochondrial and cytosolic iron-sulfur proteins, performs a scaffolding function in mitochondria during Fe/S cluster assembly; isu1 isu2 double mutant is inviable
Mannose-6-phosphate isomerase, catalyzes the interconversion of fructose-6-P and mannose-6-P; required for early steps in protein mannosylation
Prediction of novel structure
Finally, we constructed a putative network from the seed gene, ARN1, as a demonstration of the characteristics of a regulatory module that might be derived from our models. Regulatory modules of the iron homeostasis pathway have been previously constructed from the gene expression data of Hughes, et al [11, 14, 28, 29].
For the purposes of this work, we consider networks constructed simply according to the pairwise relationships; that is, the network is not a BN, but its edges are derived from Markov pairs according to our derived score functions. In order to concretely demonstrate the advantage of incorporating the variation data and compare with a structure that might be derived from an approach such as by Pe'er et al, we also constructed a BN from the Markov Blanket of ARN1 . The BN structure was constructed using the K2 algorithm on discretized gene expression values .
Most genes in our module reconstruction in figure 5 are iron homeostasis and many that were found were common to previous reconstructions based on a different expression data set. Both the reductive mechanism directly associated with ARN1 (ARN2, SIT1, FTR1, SMF3, FITS) and the non-reductive transport (FRE6) mechanism were implicated in the network. Interestingly, two key genes in this pathway, FTR1 and FET3, although not directly linked to ARN1, are both involved in iron uptake and were found to have significant bi-directional causality, implying an auto-regulatory mechanism.
We are concerned with the class of trans-acting regulators in which the expression of the target is dependent on the expression of the regulating gene. We consider three sub-classes of genotypic effect: cis-, trans-, and cis-trans-acting sites. For example, a variation in the promoter region or 3' end of the target gene may have a cis-acting effect on the expression level of the target; a variation in the coding region of the regulator may have a trans-acting effect, either directly or indirectly, on the expression of a target gene, such as through the modification of a DNA-binding motif in a transcription factor; and variation in or around the regulator gene may have a cis-acting effect on the regulator's expression which indirectly affords a trans-acting effect on the target, i.e. cis-trans. We make no specific assumptions in our model regarding the precise mechanism of the allelic effect even though it is convenient to imagine examples of transcription factor binding. Variation can have direct or indirect effects on transcript abundance through a variety of mechanisms such as protein levels, RNA degradation rates, splicing, and so on.
If we consider only the genotype sites at the locations of the protein-coding genes in a fully annotated genome, then we can conveniently reference both genotypes and genes with a common index, i.e. Q i represents the genotype for the gene i with expression T i . Figure 1b naturally follows. We refer to this model as the full QTG model for a single quantitative trait gene and the process of estimating regulatory genes for a given target as "QTG mapping". The three genotype sub-classes are subgraphs of the full model shown in figure 1c–f.
Case 1: trans-acting Regulator
Therefore, to consider the possible interactions among genotype and expression, our full model is
P(T i |Q j , T j , θ) = (β0 + β1T j + β2Q j + β3T j Q j , σ) (1)
where θ is the β and σ model parameters.
As with standard interval mapping, Maximum likelihood estimation can be achieved using an expectation maximization (EM) approach in which the genotype, Q j , and the variables, θ, are alternatively estimated until convergence. But the advantage of this model over the standard mapping and multi-step approaches previously proposed is that individual loci are automatically mapped in a single step by simultaneously considering all available evidence.
Note that the strength of the genotypic effect is directly related to our ability to infer causality. That is, as the contribution of the β2 and β3 terms decreases, our confidence in the causal direction between genes i and j is reduced. We can be precise about this directionality by comparing our model with the simpler model of no genotypic effect (figure 1f). From equation (1), for each tested gene pair, i and j, we can determine the strength of a relationship (the full model score) as
and the directionality (genotype reduced-model score) according to
Moreover, if the β2 and β3 terms are weak, then it indicates that the major effect is the QTL interval and so our confidence in the specific regulator gene is correspondingly weak. Thus, confidence in the gene, T j , as the actor in the relationship is found with the expression reduced-model score
And so a hypothesis of a causal regulator-target relationship requires significant values from the full model and reduced-model scores (equations 2, 3, and 4). However, interesting results can be achieved when some, but not all, scores are significant. For example, assuming that T j controls T i , there may be a functional polymorphism in gene j' s sequence, but no observed variation in expression level, T j . In that case, the expression reduced-model score will not be significant and we cannot confidently implicate gene j versus its nearby linked neighbors. On the other hand, suppose there is no polymorphism in gene j, but expression levels T i and T j are highly correlated. Then the genotype reduced-model score will not be significant, but the expression reduced-model score will be significant. In such a case, we can infer that a relationship exists between genes i and j, but cannot make any claim about the type of relationship, which could be causal in either direction or driven by a hidden actor.
Cases 2 and 3: cis- and cis-trans-acting Regulators
Since we are interested here in fine mapping, we specifically avoid in this analysis modeling the cis control of target or regulator, directly or indirectly, i.e. the cis and cis-trans models. Such cis effects can be detrimental to fine mapping because linkage disequilibrium causes transcripts in local regions to be highly correlated. Disambiguating among different neighboring genes under cis-trans control can be accomplished, in principle, as implied by figure 1e, by conditioning on the genotype, but we don't pursue that further in this paper. Such a cis-trans local linkage problem can be framed as a kind of pleiotropy test and attempts are described elsewhere for fine mapping in such contexts .
Mapping and structure inference
In addition to basic mapping, the QTG model described above leads us to propose a seed-based partially directed graph of regulatory relationships built from confident Markov pairs, much like the regulatory modules studied by  and others. Deriving such a sub-network is computationally tractable and biologically relevant, since most biological analysis is concerned with a pathway centered around a gene of interest. Such a seed-based approach yields a network of dependent genes as nodes whose edges are directed when causal influence is significant and undirected otherwise (figure 5). (Inferring a complete network is not reasonable for any single data set where the perturbations are incomplete and the data is too sparse. In particular, with genotype/expression data sets, regardless of the number of matings, without allele-specific differential expression no characterization of a target gene can be inferred. Without polymorphisms in upstream candidate regulators, no relationships can be hypothesized.)
Discussion and Conclusion
We proposed an improved, principled method for mapping causal loci involved in transcriptional control when analyzing data sets of whole genome genotype and expression data. Our "QTG" model is a natural application of epistatic models to gene expression, allowing for interactions among gene expression and genotype. From a genetics perspective, our model is a more complex scoring function for the identification of a single, causative gene instead of the conventional multi-gene QTL interval. And from a bioinformatics perspective, it is an improved score function for BN regulatory module reconstruction. Here we considered the simplest question of detecting relationships between regulator-target pairs, but we plan to extend this to the construction of Markov blankets.
We only presented results for the sub-model describing trans-acting genes, but the generality of the QTG model allows for incorporation of cis-acting effects for targets and regulators. Of note, when all polymorphic sites are known, but the haplotype is unknown for any sample, as is the case when crossing two fully sequenced genomes, then one may choose the full model or a sub-class of the full model as appropriate. For example, if there is no polymorphic sites in the non-coding regions of the target gene then the cis-acting model parameters can be dropped since the genotype is likely to be uninfor-mative.
We contrast this work with other related efforts as follows. First, other integrative work has attempted to incorporate interactions among variables. For example, Jiang and Zeng  describe a general method to simultaneously fit multiple effects to a common QTL by introducing multiple dependent variables and leveraging the covariance structure among them. Such an approach could theoretically be used to more confidently map pleiotropic loci. In our method, we instead introduce interactions among multiple regressors, namely the simultaneous consideration of genotype and expression of a regulator gene. Second, methods such as  and  have attempted to finely map regulators by first selecting candidate QTLs and then identifying candidate genes within these QTLs by expression correlation. Our main contribution is that we show how to do this simultaneously, and thereby consider the interacting effects that allow for fine mapping. Lastly, work by Schadt and colleagues infer causality by applying conditional independence  or by placing priors on causal relationships according to the numbers of QTLs and the location of the transcripts with respect to their QTLs . We do not consider conditional independence or QTL count arguments, but do consider specifically the relative location of QTLs and transcripts as in . The main difference is that we eliminate confounding effects from cis-QTLs and close linkage and that our score function incorporates interacting effects.
There are significant limitations to our model. With small sample sizes, many relations will go undetected. Small sample size also emphasizes linkage effects, which undermine our model. Nearby genes may appear to be regulator-target pairs due to close linkage. Relatedly, regulators that are themselves under cis control have strong linkage effects on neighboring genes, which makes fine mapping impossible, as was demonstrated on the Brem candidate loci. In practice, we recommend excluding candidate regulator-target pairs sharing the same chromosome and excluding regulators that are cis-regulated themselves.
Another major source of difficulty is missing (latent) variables. We assume that all expressed transcripts are known and correctly measured by the mi-croarray. In fact, even in yeast, as much as 5% of the proteome is believed to be unannotated small protein-coding genes . Other non-coding RNA genes, including miRNAs, are believed to play a large role in transcriptional regulation. Protein expression is assumed to correspond to mRNA abundance, but overall correlation has been shown to be only 0.66 between yeast mRNA and protein products . And, of course, the true dynamics of transcriptional control are rate-based. Latent variables and kinetics can be included in BN modeling , but at the expense of additional complexity, which requires more samples for parameter estimation. On the bright side, when variables are missing from expression data, it has been shown that predicted network structures usually roughly approximate the correct solution .
With the finer mapping precision of the QTG model and judicious filtering, we found some functional enrichment among regulators that had not been observed in previous reports. However, as was reported elsewhere, we confirmed that transcription factors were not among the enriched functional classes, which indicates the degree of complexity of transcriptional control even in yeast.
We are particularly grateful to an anonymous reviewer for helpful discussions regarding cis effects that resulted in significant new results. The authors thank Gary Churchill for helpful discussions. This work was supported in part by a gift from Affymetrix, Inc. and a grant from the Department of Computer Science, University of Massachusetts, Amherst.
- Morley M, Molony C, Weber T, Devlin J, Ewens K, Spielman R, Cheung V: Genetic analysis of genome-wide variation in human gene expression. Nature. 2004, 430 (7001): 743-7. 10.1038/nature02797.PubMedPubMed CentralView ArticleGoogle Scholar
- Schadt E, Monks S, Drake T, Lusis A, Che N, Colinayo V, Ruff T, Milligan S, Lamb J, Cavet G, Linsley P, Mao M, Stoughton R, Friend S: Genetics of gene expression surveyed in maize, mouse and man. Nature. 2003, 422 (6929): 297-302. 10.1038/nature01434.PubMedView ArticleGoogle Scholar
- Steinmetz L, Sinha H, Richards D, Spiegelman J, Oefner P, McCusker J, Davis R: Dissecting the architecture of a quantitative trait locus in yeast. Nature. 2002, 416 (6878): 326-30. 10.1038/416326a.PubMedView ArticleGoogle Scholar
- Brem R, Yvert G, Clinton R, Kruglyak L: Genetic dissection of transcriptional regulation in budding yeast. Science. 2002, 296 (5568): 752-5. 10.1126/science.1069516.PubMedView ArticleGoogle Scholar
- Brem R, Kruglyak L: The landscape of geneticcom plexity across 5,700 gene expression traits in yeast. Proc Natl Acad Sci USA. 2005, 102 (5): 1572-7. 10.1073/pnas.0408709102.PubMedPubMed CentralView ArticleGoogle Scholar
- Chesler E, Lu L, Shou S, Qu Y, Gu J, Wang J, Hsu H, Mountz J, Baldwin N, Langston M, Threadgill D, Manly K, Williams R: Complex trait analysis of gene ex pression uncovers polygenic and pleiotropic networks that modulate nervous system function. Nat Genet. 2005, 37 (3): 233-42. 10.1038/ng1518.PubMedView ArticleGoogle Scholar
- Lander E, Botstein D: Mapping mendelian factors underlying quantitative traits using RFLP linkage maps. Genetics. 1989, 121: 185-99.PubMedPubMed CentralGoogle Scholar
- Yvert G, Brem R, Whittle J, Akey J, Foss E, Smith E, Mackelprang R, Kruglyak L: Trans-acting regulatory variation in Saccharomyces cerevisiae and the role of transcription factors. Nat Genet. 2003, 35: 57-64. 10.1038/ng1222.PubMedView ArticleGoogle Scholar
- Spellman P, Sherlock G, Zhang M, Iyer V, Anders K, Eisen M, Brown P, Botstein D, Putcher B: Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell. 1998, 9 (12): 3273-97.PubMedPubMed CentralView ArticleGoogle Scholar
- Gasch A, Spellman P, Kao C, Carmel-Harel O, Eisen M, Storz G, Botstein D, Brown P: Genomic expression programs in the response of yeast cells to environmental changes. Mol Biol Cell. 2000, 11 (12): 4241-57.PubMedPubMed CentralView ArticleGoogle Scholar
- Hughes T, Marton M, Jones A, Roberts C, Stoughton R, Armour C, Bennett H, Coffey E, Dai H, He Y, Kidd M, King A, Meyer M, Slade D, Lum P, Stepaniants S, Shoemaker D, Gachotte D, Chakraburtty K, Simon J, Bard M, Friend S: Functional discovery via a compendium of expression profiles. Cell. 2000, 102: 109-26. 10.1016/S0092-8674(00)00015-5.PubMedView ArticleGoogle Scholar
- Friedman N, Linial M, Nachman I, Pe'er D: Using Bayesian networks to analyze expression data. J Comput Biol. 2000, 7 (3–4): 601-20. 10.1089/106652700750050961.PubMedView ArticleGoogle Scholar
- Pearl J: Causality: models, reasoning, and inference. 2000, Cambridge University PressGoogle Scholar
- Pe'er D, Regev A, Elidan G, Friedman N: Inferring subnetworks from perturbed expression profiles. Bioinformatics. 2001, 17 (Suppl 1): S215-24.PubMedView ArticleGoogle Scholar
- Hartemink A, Gifford D, Jaakkola T, Young R: Combining location and expression data for principled discovery of genetic regulatory network models. Pac Symp Biocomput. 2002, 437-49.Google Scholar
- Segal E, Shapira M, Regev A, Pe'er D, Botstein D, Koller D, Friedman N: Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data. Nat Genet. 2003, 34 (2): 166-76.PubMedView ArticleGoogle Scholar
- Friedman N, Murphy K, Russell S: Learning the structure of dynamic probabilistic networks. UAI. 1998Google Scholar
- Zhu J, Lum P, Lamb J, GuhaThakurta D, Edwards S, Thieringer R, Berger J, Wu M, Thompson J, Sachs A, Schadt E: An integrative genomics approach to the reconstruction of gene networks in segregating populations. Cytogenet Genome Res. 2004, 105 (2–4): 363-74. 10.1159/000078209.PubMedView ArticleGoogle Scholar
- Schadt E, Lamb J, Yang X, Zhu J, Edwards S, Guhathakurta D, Sieberts S, Monks S, Reitman M, Zhang C, Lum P, Leonardson A, Thieringer R, Metzger J, Yang L, Castle J, Zhu H, Kash S, Drake T, Sachs A, Lusis A: An integrative genomics approach to infer causal associations between gene expression and disease. Nat Genet. 2005, 37 (7): 710-7. 10.1038/ng1589.PubMedPubMed CentralView ArticleGoogle Scholar
- Li H, Lu L, Manly K, Chesler E, Bao L, Wang J, Zhou M, Williams R, Cui Y: Inferring gene transcriptional modulatory relations: a genetical genomics approach. Hum Mol Genet. 2005, 14 (9): 1119-25. 10.1093/hmg/ddi124.PubMedView ArticleGoogle Scholar
- Bing N, Hoeschele I: Genetical genomics analysis of a yeast segregant population for transcription network inference. Genetics. 2005, 170 (2): 533-42. 10.1534/genetics.105.041103.PubMedPubMed CentralView ArticleGoogle Scholar
- Sen S, Churchill G: A statistical framework for quantitative trait mapping. Genetics. 2001, 159: 371-87.PubMedPubMed CentralGoogle Scholar
- Storey J, Akey J, Kruglyak L: Multiple Locus Linkage Analysis of Genomewide Expression in Yeast. PLoS Biology. 2005, 3 (8):
- Ashburner M, Ball C, Blake J, Botstein D, Butler H, Cherry J, Davis A, Dolinski K, Dwight S, Eppig J, Harris M, Hill D, Issel-Tarver L, Kasarskis A, Lewis S, Matese J, Richardson J, Ringwald M, Rubin G, Sherlock G: Gene ontology: tool for the unification of biology. Nat Genet. 2000, 25: 25-9. 10.1038/75556.PubMedPubMed CentralView ArticleGoogle Scholar
- Zhao Y, Sohn J, Warner J: Autoregulation in the biosynthesis of ribosomes. Mol Cell Biol. 2003, 23 (2): 699-707. 10.1128/MCB.23.2.699-707.2003.PubMedPubMed CentralView ArticleGoogle Scholar
- Matys V, Fricke E, Geffers R, Gossling E, Haubrock M, Hehl R, Hornischer K, Karas D, Kel A, Kel-Margoulis O, Kloos D, Land S, Lewicki-Potapov B, Michael H, Munch R, Renter I, Rotert S, Saxel H, Scheer M, Thiele S, Wingender E: TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids Res. 2003, 31: 374-8. 10.1093/nar/gkg108.PubMedPubMed CentralView ArticleGoogle Scholar
- Cullin C, Baudin-Baillieu A, Guillemet E, Ozier-Kalogeropoulos O: Functional analysis of YCL09C: evidence for a role as the regulatory subunit of acetolactate synthase. Yeast. 1996, 12 (15): 1511-8. 10.1002/(SICI)1097-0061(199612)12:15<1511::AID-YEA41>3.0.CO;2-B.PubMedView ArticleGoogle Scholar
- Pena J, Bjorkegren J, Tegner J: Growing Bayesian network models of gene networks from seed genes. Bioinformatics. 2005, 21 (Suppl 2): ii224-ii229. 10.1093/bioinformatics/bti1137.PubMedView ArticleGoogle Scholar
- Margolin A, Banerjee N, Nemenman I, Califano A: 2004, [http://www.menem.com/~ilya/digital_library/mypapers/margolin-etal-b-04.pdf]
- Tsamardinos I, Aliferis C, Statnikov A: Algorithms for Large Scale Markov Blanket Discovery. The 16th International FLAIRS Conference, St. Augustine, Florida. 2003Google Scholar
- Cooper G, Herskovits E: A Bayesian method for the induction of probabilistic networks from data. Machine Learning. 1992, 9 (4): 309-347.Google Scholar
- Lynch M, Walsh B: Genetics and analysis of quantita tive traits. 1998, Sunderland, Mass.: Sinauer, [97017666 Michael Lynch, Bruce Walsh. Includes bibliographical references (p. 891-) and indexes.].Google Scholar
- Jiang C, Zeng Z: Multiple trait analysis of genetic mapping for quantitative trait loci. Genetics. 1995, 140 (3): llll-27.Google Scholar
- Kastenmayer J, Ni L, Chu A, Kitchen L, Au W, Yang H, Carter C, Wheeler D, Davis R, Boeke J, Snyder M, Basrai M: Functional genomics of genes with small open reading frames (sORFs) in S. cerevisiae. Genome Res. 2006, 16 (3): 365-73. 10.1101/gr.4355406.PubMedPubMed CentralView ArticleGoogle Scholar
- Greenbaum D, Colangelo C, Williams K, Gerstein M: Comparing protein abundance and mRNA expression levels on a genomic scale. Genome Biol. 2003, 4 (9): 117-10.1186/gb-2003-4-9-117.PubMedPubMed CentralView ArticleGoogle Scholar
- Nachman I, Regev A, Friedman N: Inferring quantitative models of regulatory networks from expression data. Bioinformatics. 2004, 20 (Suppl 1): I248-I256. 10.1093/bioinformatics/bth941.PubMedView ArticleGoogle Scholar
- Yu J, Smith A, Wang P, Hartemink A, Jarvis E: Using Bayesian Network Inference Algorithms to Recover Molecular Genetic Regulatory Networks. Third International Conference on Systems Biology. 2002Google Scholar