 Methodology article
 Open access
 Published:
Functional regression method for whole genome eQTL epistasis analysis with sequencing data
BMC Genomics volume 18, Article number: 385 (2017)
Abstract
Background
Epistasis plays an essential rule in understanding the regulation mechanisms and is an essential component of the genetic architecture of the gene expressions. However, interaction analysis of gene expressions remains fundamentally unexplored due to great computational challenges and data availability. Due to variation in splicing, transcription start sites, polyadenylation sites, posttranscriptional RNA editing across the entire gene, and transcription rates of the cells, RNAseq measurements generate large expression variability and collectively create the observed position level read count curves. A single number for measuring gene expression which is widely used for microarray measured gene expression analysis is highly unlikely to sufficiently account for large expression variation across the gene. Simultaneously analyzing epistatic architecture using the RNAseq and whole genome sequencing (WGS) data poses enormous challenges.
Methods
We develop a nonlinear functional regression model (FRGM) with functional responses where the positionlevel read counts within a gene are taken as a function of genomic position, and functional predictors where genotype profiles are viewed as a function of genomic position, for epistasis analysis with RNAseq data. Instead of testing the interaction of all possible pairwises SNPs, the FRGM takes a gene as a basic unit for epistasis analysis, which tests for the interaction of all possible pairs of genes and use all the information that can be accessed to collectively test interaction between all possible pairs of SNPs within two genome regions.
Results
By largescale simulations, we demonstrate that the proposed FRGM for epistasis analysis can achieve the correct type 1 error and has higher power to detect the interactions between genes than the existing methods. The proposed methods are applied to the RNAseq and WGS data from the 1000 Genome Project. The numbers of pairs of significantly interacting genes after Bonferroni correction identified using FRGM, RPKM and DESeq were 16,2361, 260 and 51, respectively, from the 350 European samples.
Conclusions
The proposed FRGM for epistasis analysis of RNAseq can capture isoform and positionlevel information and will have a broad application. Both simulations and real data analysis highlight the potential for the FRGM to be a good choice of the epistatic analysis with sequencing data.
Background
Epistatic effect in gene expression, defined as the departure from additive effects in a linear model of eQTL analysis [1], plays an essential role in understanding the gene regulation and disease mechanisms [2,3,4]. One polymorphism’s effect on expression of a gene depends on other polymorphisms present in the genome [5]. Epistasis analysis of gene expressions will substantially improve the understanding of the genetic architecture of gene expression and facilitate mechanistic insights into complex traits [6, 7]. However, eQTL epistasis analysis remains fundamentally unexplored due to large computational challenges and data availability [6].
Gene expression is an intermediate phenotype that bridges the genotype and higher level phenotypes such as diseases [8, 9]. Studying the effect of epistasis on the gene expression could provide a better understanding of the genetic architecture and gene regulation. The importance of detecting the epistatic effect on the gene expression has been emphasized in many recent studies [10, 11]. However, the corresponding methods are relatively rare. The widely used statistical methods for identifying eQTL epistasis are designed for microarray expression data where an overall expression of the gene is taken as a quantitative trait and all methods for QTL epistasis analysis can be used for eQTL epistasis analysis [10, 12].
Application of next generation sequencing (NGS) techniques to the genetic analysis of gene expression involves (1) generating millions of short reads of mRNA or cDNA which are mapped to the genome and lead to a sequence of read counts at the hundreds of millions of genomic positions [13,14,15,16] and (2) generating millions or 10 millions of genetic variants. RNAseq counts vary greatly across the gene [17]. Count variations can be due to experimental bias such as fragmentation methods, reversetranscription [16], sequencespecific bias and sequencing technology variation [18]. However, count variation can also be caused by variation in splicing, transcription start sites, polyadenylation sites, posttranscriptional RNA editing across the entire gene, and transcription rates of the cells [13,14,15,16, 18]. RNAseq data can be viewed as a function or a curve of the genomic position and hence can be taken as a functionvalued trait.
Although RNAseq data are measured as a function, the widely used methods for genetic studies of the RNAseq in humans are the same as that for the traditional singlevalued quantitative traits where a single number for overall expression of the gene is taken as a quantitative trait. These methods use summary statistics to measure or represent gene expressions assayed by NGS techniques and cannot capture the expression variations across the gene due to splicing, transcription start sites, polyadenylation sites, posttranscriptional RNA editing across the entire gene, and transcription rates of the various cells. The summary statisticbased epistasis analysis of the RNAseq fails to utilize all transcripts information.
The critical barrier in epistasis analysis is to deal with rare variants. The traditional statistical methods for epistasis analysis were originally designed for testing the interaction between common variants and are difficult to apply to rare variants due to high type 1 error rates, severe multiple testing, prohibitive computational time and low power [19]. Whole genome RNAreq eQTL analysis poses a significant challenge. To meet the challenge, we developed a nonlinear functional regression model (FRGM) with functional responses where the positionlevel read counts within a gene are taken as a function of genomic position, and functional predictors where genotype profiles are viewed as a function of genomic position, for epistasis analysis with RNAseq data, which allows simultaneous capture of all space information hidden in the RNAseq data and genetic variation data, but with substantially reduced dimensions. Instead of testing the interaction of all possible pairwises SNPs, the FRGM takes a gene as a basic unit for epistasis analysis, which tests for the interaction of all possible pairs of genes and uses all the information that can be accessed to collectively test interaction between all possible pairs of SNPs within two genome regions (or genes). The proposed FRGM for epistasis analysis of the RNAseq can capture isoform and positionlevel information and will have a broad application.
The FRGM for epistasis analysis has several remarkable features. First, the FRGM accounts for the change in the positionlevel read counts, while preserving the intrinsic structure and all the positionallevel genetic information. Second, the FRGM simultaneously utilizes both correlation information among the RNAseq at different genomic positions and among all variants in a genomic region. Third, the multicollinearity problems in the FRGM which may be presented in both the RNAseq and genetic variation are alleviated. Fourth, the FRGM expands both positionlevel read count function and genotype function in terms of orthogonal eigenfunction, which leads to substantial dimension reduction in both RNAseq data and SNP data. The FRGM for epistasis analysis of functionvalued traits which capture key information in the data is expected to open a new route for epistasis analysis of RNAseq data.
To evaluate its performance for epistasis analysis of the RNAseq, we use large scale simulations to calculate the type I error rates and evaluate the power of the proposed FRGM for detecting epistasis. To further evaluate its performance, the FRGM for epistasis analysis is applied to 350 samples with both RNAseq and NGS data from the 1000 Genomes Project. An R packge for implementing the developed FRGM for epistasis analysis of RNAseq and NGS data can be downloaded from our website https://sph.uth.edu/research/centers/hgc/xiong/software.htm.
Results
Null distribution of test statistics
To examine the null distribution of test statistics, we performed a series of simulation studies to compare their empirical levels with the nominal ones. We consider three models for type 1 error rate simulations: model 1 with no marginal effects, model 2 with marginal effects at the first gene and model 3 with marginal effects at both the first and second genes.
We generated 100,000 chromosomes by resampling from the 350 European samples with genetic variants in five genes: IRAK3, ACSS3, SUV420H1, ETV7, and HPS4 from the next generation sequencing data in the 1000 Genomes Project. The summary statistics of the variants in five genes are summarized in Additional file 1: Table S1. The marginal genetic effects will be estimated from the data. 100 genes with RNAseq data were randomly selected from GEUVADIS project. They were used to develop the models for generating RNAseq data in simulation (Detailed description were referred to Method Section).10 pairs of genes were selected from five genes : IRAK3, ACSS3, SUV420H1, ETV7, and HPS4 with genotype data from 1000 Genome Project dataset.
The number of sampled individuals from the population ranged from 1000 to 5,000, and 5,000 simulations were repeated. We randomly selected 10% of the SNPs as causal variants from five genes: IRAK3, ACSS3, SUV420H1, ETV7, and HPS4. We perfume genegene interaction tests for 10 pairs of genes selected from five genes with genotypes under the three models for 5000 times. The type 1 error rates were averaged over 10 pairs of genes with genotype data and 5,000 simulations for each model. Tables 1, 2 and 3 summarized the type I error rates of the test statistics for testing the interaction between two genes with no marginal effect, marginal effect at the first gene and marginal effects at both genes consisting only of rare variants and both common and rare variants, respectively, averaged over 100 genes with RNAseq data and 10 pairs of genes with genotype data at the nominal levels α = 0.05, α = 0.01 and α = 0.001. These results clearly showed that the type I error rates of the FRGMbased test statistics for testing interaction between two genes with or without marginal effects were not appreciably different from the nominal α levels.
Power evaluation
To evaluate the performance of the functional regression model for testing the epistatic effect on gene expression, we estimated the power through simulations. We generated 100,000 chromosomes by resampling from the 350 European samples with genetic variants in two genes: IRAK3 and ACSS3 from the next generation sequencing data in 1000 Genomes Project. We randomly selected 20% variants as causal variants, assumed that there were k _{ 1 } SNPs in the first gene, and k _{ 2 } SNPs in the second gene. Two thousand individuals were sampled. We assumed that both marginal effects and epistasis effects were a function of the genomic position and used the multiple regression models to generate the RNAseq data under four interaction models: Dominant OR Dominant, Dominant AND Dominant, Recessive OR Recessive and Threshold (See the Methods section).
We compared the power of the FRGM with both functional response and functional predictors (BFGM), FRGM with scalar response and functional predictors (SFGM) and regression on principal component analysis (PCA). For the PCA method, the PCA was performed on the RNAseq data and the number of PCs were selected to explain 80% variance of number of reads at different genomic positions. The multiple functional regression was performed to analyze the data [20]. In the BFGM, both RNAseq and genotype profiles were taken as a function of genomic position and expanded in terms of functional principal components.
Figure 1ad plotted the power curves of three statistics: BFGM, SFGM and PCA to test the interaction between two genes with rare variants under the Dominant OR Dominant, Dominant AND Dominant, Recessive OR Recessive and Threshold models, respectively. In the simulation, 20% of the rare variants were randomly selected as the causal variants. These power curves were a function of the risk parameter at the significance level α = 0.05. We observed that under all four interaction models the BFGM had the highest power, followed by the regression on PCA. Power of the SFGM was the lowest. The results demonstrated that summary statistics such as RPKM for measuring gene expression could not capture the expression variations across the gene and almost had no power to detect the interaction between two genes with rare variants.
The BFGM can also be applied to the presence of both common and rare variants. Figure 2ad plotted the power curves of three statistics for testing interaction between two genes with both common and rare variants where 10% of the common variants and 10% of the rare variants were chosen as causal variants under the Dominant OR Dominant, Dominant AND Dominant, Recessive OR Recessive and Threshold models, respectively. The power patterns of tests for the interactions between two genes with both common and rare variants were similar to that with rare variants only. The BFGM had the highest power, followed by the PCA and the SFGM. However, we noticed that the power of the SFGM for epistasis analysis in the presence of common variants increased substantially. Under some models such as the Dominant OR Dominant model, the SFGM would have enough power to detect interactions between two genes with common variants.
RNAseq data and NGS data
The BFGM was applied to the RNAseq data in the GEUVADIS RNA Sequencing Project [21] and the WGS data in the1000 Genomes Project. A total of 350 samples with European origin was shared between the GEUVADIS RNA Sequencing Project and 1000 Genomes Project, which had combined transcriptome (22,706 gene expressions measured by RNAseq) and genome sequencing data (2,708,453 SNPs in 24,519 genes). After removing singleton SNPs, repeated SNPs, and filtering out the SNPs violating HW equilibrium [22] (P value < 10^{−9} for declaring HW disequilibrium), 2,566,261 SNPs in 18,986 genes were included in the epistasis analysis. In the RNAseq data preprocessing, we removed the genes whose expressing rates were less than 30% and the genes that did not contain any SNPs. Finally, RNAseq data of the 15,656 genes were included in the analysis. We used DESeq [23] to normalize the RNAseq data.
Cistrans interactions
We considered the RNAseq curve of the target gene as a functionvalued trait. The target gene selected from the 15656 gene expressions was referred to as gene 1. We selected one of the remaining 18985 genotyping genes as gene 2. We used BFGM to test for the interactions between gene 1 and gene 2 influencing the expression of the target gene 1. The total number of gene pairs tested for interactions which included both common and rare variants was 297,229,160. A Pvalue for declaring significant interaction after applying the Bonferroni correction for multiple tests was 1.68 × 10^{−10}. To examine the behavior of the BFGM, we plotted the QQ plot of the test (Fig. 3). QQ plot showed that the false positive rate of the BFGM for detection of epistasis was controlled.
For comparisons, the SFGM was also applied to the dataset. RPKM and DESeq were used to compute the overall expression value of genes from the RNAseq data. All the expression values were processed by the rankbased inverse normal transformation [24]. For both common and rare variants, in total, 162361, 260 and 51 significant cistrans interactions regulating the gene expressions were identified by the BFGM, SFGM with the RPKM and DESeq, respectively. We observed 9,846genes whose expressions were influenced by 16,2361cistrans interactions. We found that the average number of epistasis influencing each gene was 16. A total of 3,505 gene expressions were influenced by one significant cistrans genegene interactions, 169 gene expressions were influenced by more than 100 cistrans genegene interactions. Figure 4 presented a histogram that showed a distribution of the cistrans genegene interactions.
The Pvalues of the top 20 interactions between genes ranked by the BFGM method were summarized in Table 4 where Pvalues for testing interactions between genes by the SFGM (RPKM, DESeq and RNAmin) and min Pvalues were also listed. The RNAmin denoted the minimum of Pvalues computed by the SFGM method with the number of reads at each genome position of the gene as the scalar response in the functional regression model. The min Pvalues denoted to take the minimum of all Pvalues for testing all possible pairs of SNPs between two genes using functional regression model with functional response and scalar predictors. Table 4 showed several remarkable features. First, we often observed the pairwise interaction between rare and rare variants (34.38%), and rare and common variants (59.38%). Less observed was the significant pairwise interaction between common and common variants (6.25%). Second, significant interactions between two genes often indicated that at least one significant pair of SNPs in two genes could be observed (min Pvalues were small). However, we can observe that pairs of SNPs between two genes jointly had significant interaction effects, but individually each pair of SNPs mildly contributed to the interaction effects. Third, the BFGM often had a much smaller Pvalue to detect interaction than other tests. Fourth, we observed that genes may not show even mild marginal association, but they did demonstrate significant evidence of interaction. If only the interactions between two marginally significant genes are tested, some significant interactions may be missed. The fifth, the BFGM tremendously reduced computation burden.
To further assess the validity of the BFGM for epistasis analysis with RNAseq data, we randomly selected six pairs of genes from the significant 162361 genegene interactions. The Pvalues for testing the interactions of six pairs of genes using the BFGM and SFGM were summarized in Table 5. Table 5 showed that six significant interactions identified by the BFGM significantly influenced read count variation at least at one genomic position within the gene. To explain why the BFGM had higher power to detect interaction than the SFGM, we presented Fig. 5af showing the RNAseq profiles and overall expression level of the genes PLA2G4A, PLA2G6, PLAUR, PLD4, PLD6 and PLEKHA3 of two individuals, respectively. These figures showed that the overall expression levels of the individuals were the same, but their RNAseq profiles were quite different. This demonstrated that unlike the RNAseq profiles, the overall expression levels cannot capture the expression variation across the genes. Therefore, the SFGM using summary statistics as a trait will have less power to detect the interaction than the BFGM using the RNAseq profiles as a functionvalued trait.
To investigate whether the top 20 interactions were caused by the linkage disequilibrium (LD) or not, we listed the maximum of r ^{2} between all possible SNPs in the top 20 significantly interacting pairs of genes and the Pvalues for testing their presence of LD in Table 6. We did not observe the strong LD between the interacting genes.
Interactions in the MAPK signaling pathway
To show the detailed interaction structure, we presented the results of 331 significant cistrans interactions in the MAPK signaling pathway in the Additional file 2: Table S3 where min Pvalues indicated that the functional regression model with functional response and discrete predictors was used to test for the interaction for all possible pairs of SNPs within two genes and minimum of Pvalues of the tests was listed in the Additional file 2: Table S3. The column “SNP pair” listed their corresponding pair of SNPs reaching the minimum of the Pvalues and their chromosome locations. From Additional file 2: Table S3 we had several significant observations. First, we observed that the majority of interacting genes were located in different chromosomes, which implied that interactions were not caused by the linkage disequilibrium (LD). Second, we observed that large proportions of interacting genes did not show significant evidence of marginal association. This demonstrated that if we only selected the genes with significant marginal association for epistasis analysis, many interactions would be missed. Third, in general, the functionvaluebased epistasis analysis (BFGM, min Pvalues) had much smaller Pvalues than the summary statisticbased epistasis analysis (SFGM). Fourth, we observed that the genes interacting with the genes in MAPK signaling pathway were in 147 other pathways, including cytokinecytokine receptor interaction, Cytosolic DNAsensing pathway, DNA replicationamong others. Fifth, it was interesting to observe that the interacting genes formed a large connected network with 281 nodes and 317 edges (Fig. 6). We observed hub genes IBA57AS1 with 67 connections, HIST1H2AD with 21 connections, PRR24 with 18 connections and ARL6IP4 with 14 connections. HIST1H2AD is a core component of nucleosome and plays a central role in transcription regulation. ARL6IP4 functions as a splicing inhibitor [25].
Gene ontology and KEGG pathway enrichment analysis
Gene ontology enrichment analysis was performed on the genes in the identified 162361 pairs of significant cistrans interactions influencing the transcription to discover overrepresented functional biological groupings with interactions. Our analysis was performed using the biological process, cellular component and molecular function categories of the gene ontology.
Ontology enrichment analysis found that cistrans interactions were significantly enriched in biological processes (BP) including a single organism process, single organism cellular process, single organism metabolic process and development process (Fig. 7) and molecular functions that were primarily related to catalytic and binding activity with Pvalues <10^{−4} (Fig. 8). Ontology enrichment analysis also identified that cistrans interactions were significantly enriched in the cell, intracellular, organelle, and membrane bounded organelle components (Fig. 9).
The enrichment analysis was also applied to 228 KEGG pathways to identify the pathways that were enriched with cistrans interactions. The results were summarized in Fig. 10. The cistrans interactions were enriched in metabolic pathways, MAPK signaling pathway, pathways in cancer, endocytosis, protein processing in endoplasmic reticulum and Wnt signaling pathway.
Communities in gene interaction networks
We used random walks in igraph [26] to detect 10 communities from the entire genegene interaction network. We used R package GOstats [27] to conduct gene set enrichment analysis. We have identified 29 pathways enriched in 10 communities. Figure 11 showed the 6^{th} community with 96 genes and 186 interactions enriched with metabolism (Three of four significantly enriched pathways were metabolism pathways: Glycerolipid metabolism, Nicotinate and nicotinamide metabolism, and Pyrimidine metabolism) where node represents a gene and an edge represents the interaction between the connected gene by the edge. All 10 communities with the enriched pathways (Pvalue < 0.01) are summarized in Additional file 3: Table S4.
Discussion
In the past, the statistical epistasis of gene expression is defined as variant–variant interactions that regulate gene expression and its analysis has been mainly designed for microarray gene expression data and common variants. Since the dimension of the data for epistasis analysis of gene expression is very high, all the traditional methods for epistasis analysis of gene expression have the limited application to eQTL data. The whole genome epistasis studies of gene expressions have been very limited. The genetic structure of epistasis of gene expressions has not been fully discovered.
The recently developed nextgeneration mRNA sequencing (RNAseq) assay generates dozens or even one hundred million short reads of mRNA and WGS also generates millions of SNPs. As a consequence, these genetic variation and gene expression variation data are so densely distributed across the genome that both genetic variation and expression variation can be modeled as a function of genomic location. The RNAseq profiles can be taken as a functionvalued trait. However, the standard multivariate statistical analysis often fails with functional data. The computational burden and correction for multiple tests seriously damage the feasibility of the variantvariant interaction analysis of extremely high dimensional RNAseq and WGS genotype data. The variantvariant interaction analysis is not suitable for the epistasis analysis of the functionvalued traits with NGS data as genotype data. Although the genetic study of quantitative traits has seen wide application and extensive technical development, the quantitative genetic analysis, particularly epistasis analysis of functionvalued trait is comparatively less developed. To our knowledge, no statistical methods have been developed for genetic epistasis analysis of functionvalued traits with NGS data. In the past few years we have witnessed the rapid development of novel statistical methods for association studies using NGS data. However, these methods might not be appropriate for genetic epistasis analysis of functionvalued trait. The quantitative genetic epistasis analysis of rare variants for functionvalued traits remains a huge challenge.
The widely used methods for reducing dimensionality of the RNAseq data use the Poisson distribution, binomial distribution and negative binomial distribution to summarize the RNAseq profile into a single number to represent the RNAseq curve. However, these discrete distributions cannot capture the shape and variation of the RNAseq curve. To illustrate this we presented Additional file 4: Figure S1A showing the real RNAseq curve, the data simulated by a negative distribution of the gene LMNB2 and Additional file 4: Figure S1B showing the real RNAseq curve of the gene LMNB2 and the curve estimated by the FPCA of the RNAseq data. We observed that the negative distribution failed to capture the variation of the RNAseq profile, but the FPCA approximated the RNAseq curve exceedingly well.
Emergence of the NGS techniques demands a paradigm shift in the analytic methods for eQTL epistasis analysis from standard singlevariate or multivariate data analysis to functional data analysis. The BFGM with functional response and functional predictors takes a RNAseq profile as a functional response and genetic variants across the genomic regions as functional predictors, which can be used to test the association of the entire allelic spectrum of the genetic variation with a functionvalued trait and has several remarkable features. First, unlike simple and multiple regressions that discard a large amount of information, the BFGM preserves the intrinsic structure and all the positionallevel genetic information. Second, the multiple regressions will not account for the spaceordering of the data and correlation information contained in the data. The BFGM simultaneously employs genetic information of the individual variants and correlation information contained in both RNAseq and SNP data. Third, both the sign and the size of the heterogeneity will also be incorporated into the test in the BFGM. Fourth, the multicolinearity problem in the BFGM is alleviated. Fifth, the BFGM expands both RNAseq function and genotype function in terms of orthogonal eigenfunctions, which leads to substantial dimension reduction. The BFGM for genetic epistasis analysis of a functionvalued trait which captures key information in the data is expected to open a new route for genetic epistasis analysis of RNAseq and NGS genotype data.
Conclusions
We developed a novel functional regression model with both functional response and functional predictors for detection of epistasis influencing RNAseq variations in humans, which is referred to as the BFGM. The BFGM takes genes as a basic unit of epistasis analysis and utilizes all information contained in both the RNAseq and SNP data. By large simulations and real data analysis we demonstrated the merits and limitations of the proposed new paradigm of epistasis analysis for the RNAseq and WGS data.
The new approach uses all genetic information in the genome regions and expression variation information in the target gene to collectively test the interaction between multiple SNPs within the regions influencing the RNAseq curves. Therefore, the BFGM for interaction analysis overcomes limitations inherent in pairwise interaction tests with the summary expression level as a scalar trait. By large simulations and real data analysis, we showed that the proposed BFGM substantially increased the power, dramatically reduced the computational burden and substantially outperformed the traditional variantvariant epistasis analysis of summary statistic measured quantitative traits. In real data analysis, we also clearly demonstrate that pairs of SNPs between two genes jointly have significant interaction effects, but individually each pair of SNPs makes a mild contribution to interaction effects.
The previous interaction analyses have mainly focused on the interactions between common and common variants. The distribution of the common and rare variants causing interactions is unknown. Very few genomewide interaction analyses with the RNAseq and WGS data, and very few results of significant interaction between rare and rare variants, and rare and common variants have been reported. We analyzed 350 samples of European origin with both RNAseq and whole genome sequencing data available. We observed the large proportions of pairwise interactions between rare and rare variants, and rare and common variants. The significant pairwise interactions between common and common variants were less observed. The results showed that the number of significant cistrans interactions identified by the SFGM with RPKM as overall gene expression level only accounted for 0.16% of the significant cistrans interactions identified by the BFGM with RNAseq and NGS genotype data. The majority of epistasis analysis for gene expressions used the microarray to measure gene expressions and test interactions for only common variants. Even though the RNAseq data are available they still converted variation rich RNAseq data into a single number such as RPKM or other summary statistics. Then, the variantvariant epistasis analysis is conducted on these converted data. That explains why these researches question the universe presence of significant genegene interaction influencing gene expressions.
Some researchers suggest that in genomewide interaction analysis only genes with large or mildly marginal genetic effects should be tested for interaction. However, we observed that the majority of the significantly interacting genes showed no marginal association. These results clearly demonstrated that if we tested interactions for only genes with marginal associations, then many true interactions will be missing.
We are unsure whether interaction is most often presented in isolation, or interacting genes form networks. We identified a large number of cistrans interactions and observed that the interacting genes formed large connected networks with hub genes presented. We found that some hub genes, for example histone modification genes, can globally regulate gene expressions. Enrichment analysis also showed that metabolic pathways, MAPK signaling pathway, pathways in cancer, endocytosis, protein processing in endoplasmic reticulum and Wnt signaling pathway among others were enriched with cistrans interactions.
The results in this paper are preliminary. The confounding factors that cause spurious interactions have not been investigated. The statistical methods for epistasis analysis which remove confounding factors have not been developed. The complete genomewide epistasis analysis including all cis and trans interactions have not been performed. The purpose of this paper is to stimulate further discussions regarding the great challenges we are facing in the epistasis analysis of high dimensional RNAseq and WGS data.
Methods
Functional regression with both functional response and functional predictor models for epistasis analysis
For the convenience of discussion, position level read counts are taken as the RNAseq profile and is referred to as a functionvalued trait. Let y _{ i }(τ), τ ∈ Τ _{ τ } = [0, Τ _{ τ }] be the read counts of the i ^{th} individual at the genomic position τ. Consider two genomic regions (or genes) [a _{1}, b _{1}] and [a _{2}, b _{2}]. Let x _{ i }(t) and z_{ i }(s) be genotypic functions of the i ^{th} individual defined in the regions [a _{1}, b _{1}] and [a _{2}, b _{2}], respectively. Let t and s be a genomic position in the first and second genomic regions, respectively. The genotype functions x _{ i }(t) and z_{ i }(s) are defined as
where M and m are two alleles of the marker at the genomic position t and s, P _{ M }(t) and P _{ m }(t), and P _{M}(s), P _{ m }(s) are the frequencies of the alleles M and m at the genomic positions t and s, respectively. Consider a functional regression model with functional response and functional predictors (BFGM):
where μ(τ) is an overall mean function at the genomic position τ, W _{ i } is a vector of covariates for i ^{th} individual, ω(τ) is a vector of effects associated with the covariates, α(t, τ) is a genetic additive effect function at genomic position t of the first gene and genomic position τ of the RNAseq profile, β(s, τ) is a genetic additive effect function at genomic positions s of the second gene and the genomic position τ, γ(t, s, τ) is an interaction effect function between two putative quantitative trait loci (QTLs) located at the genomic positions t and s influencing the read counts at the genomic position τ, and ε _{ i }(τ) is a residual function of the unexplained effect for the i ^{th} individual at the genomic position τ. The interaction function is measured by double integrals of the genotype function in two genes.
Estimation of interaction effect function
We assume that both position level read count function and genotype functions are centered. The genotype functions x _{ i }(t) and z _{ i }(s) are expanded in terms of the orthonormal basis function as:
where \( {\phi}_{{}_j}(t) \) and ψ _{ l }(s) are sequences of the orthonormal basis functions. The expansion coefficients ξ _{ ij } and η _{ il } are estimated by
In practice, numerical methods for the integral will be used to calculate the expansion coefficients. Substituting Eq. (2) into Eq. (1), we obtain
where
The parameters α _{ j }(τ), β _{ l }(τ) and γ _{ jl }(τ) are referred to as genetic additive effect and additive x additive effect score functions. These score functions can also be viewed as the expansion coefficients of the genetic effect functions with respect to orthonormal basis functions:
Equation (4) can be written in a vector form:
where Y(τ), μ(τ), ω(τ), α(τ), β(τ) and γ(τ) are vectors, W, ξ, η and Γ are matrices.
Expanding Y(τ), μ(τ), ω(τ), α(τ), β(τ), γ(τ) and ε(τ) in terms of the orthogonal basis functions yield
Define expansion coefficient vectors and matrices as follows.
Thus, substituting the above expansion into Eq. (5) gives
where
Since Eq. (6) holds for every genomic position τ, the coefficients on both sides of Eq. (6) should be equal. Therefore, the functional regression model (6) can be further transformed to standard multivariate multiple regression:
Let
Equation (7) can be rewritten as
The standard least square estimators of b is
The covariance matrix Σ is estimated by
Test statistic
An essential problem in genetic epistasis analysis of the functionvalued traits is to test the interaction between two genomic regions (or genes). Formally, we investigate the problem of testing the following hypothesis:
which is equivalent to testing the hypothesis:
Let vec denote the vector operation. To develop test statistics, we begin with calculating the covariance matrix of the \( v e c\left(\widehat{b}\right) \). We assume that
Recall that
Therefore, we have
Let Λ be a matrix consisting of the last JLK columns and JLK rows of the covariance matrix \( \operatorname{var}\left( vec\left(\widehat{b}\right)\right) \) and \( \widehat{\gamma} \) be the estimators of interaction which can be obtained by extracting the last JL rows of the estimators of the matrix \( \widehat{b} \). Define the test statistic for testing the interaction between two genomic regions [a _{1}, b _{1}] and [a _{2}, b _{2}] as
Then, under the null hypothesis H _{0} : γ = 0, T _{ I } is asymptotically distributed as a central χ ^{2}_{(JLK)} with degrees of freedom JLK or the rank of the matrix Λ.
Null distribution of test statistics
To examine the null distribution of test statistics, we performed a series of simulation studies to compare their empirical levels with the nominal ones. We calculated the type I error under three models. We first assumed the model with no marginal effects:
Model 1 (no marginal effect):
where μ(τ) is the overall mean at the genomic position τ, y _{ i }(τ) is the normalized number of reads at the genomic position τ of the i ^{th} individual and ε _{ i }(τ) is an error stochastic process. The errors should be correlated stochastic process. The theoretic models for the errors are unclear. They were estimated from the data. The procedures for generating mean μ(τ) and errors ε _{ i }(τ) consisted of the following steps.
Step 1: We randomly sampled 100 genes from the whole real RNAseq dataset. Let k index genes, j index the genomic positions and i index the samples. Assume that the gene k is located in the interval [a _{ k }, b _{ k }]. Let x _{ ikj } , (i = 1, …, n, k = 1, …, 100, j = 1, …, s _{ k }) be the observed count of reads of the gene k in the genomic position j of the i ^{th} individual where the length of gene k is denoted s _{ k }. For each genomic position, we define an n dimensional vector:
x _{ kj } = [x _{1kj }, …, x _{ nkj }]^{T}.
Step 2. Let m be the median length of 100 genes. In our dataset, m = 2, 456.
Step 3. Remap the original RNAseq data of 100 genes to the interval [0, 1] using transformation \( \frac{j{a}_k}{b_k{a}_k} \). Then, estimate the count of reads on position \( 0,\frac{1}{m},\frac{2}{m},\dots, 1 \) from the original RNAseq data of the 100 genes using local polynomial regression (LOESS). The estimated count of reads of the gene k in the genomic position j of the i ^{th} individual at the equally distributed new positions \( 0,\frac{1}{m},\frac{2}{m},\dots, 1 \) are denoted by y _{ ikj }. Define vector
Step 4. Compute the means of the remapped the RNAseq data over 100 genes and over n samples: \( {\overline{y}}_{ij}=\frac{1}{100}{\displaystyle {\sum}_{k=1}^{100}{y}_{ikj}}, i=1,\dots, n, j=1,\dots, m \) and \( {\overline{y}}_j=\frac{1}{100\times n}{\displaystyle {\sum}_{i=1}^n{\displaystyle {\sum}_{k=1}^{100}{y}_{i kj}}}, j=1,\dots, m \).
Define the mean vector of the remapped counts of reads: \( \overline{y}={\left[{\overline{y}}_1,\dots, {\overline{y}}_m\right]}^T \).
Step 5. Compute the mean function μ(τ). Pooling all the remapped data:
Use the pooled data to perform FPCA, which leads to functional principal component expansion:
where β _{ l }(τ) are the functional principal components.
Calculate \( {\overline{\xi}}_l=\frac{1}{n}{\displaystyle {\sum}_{i=1}^n{\xi}_{i l}}, l=1,\dots, L \). Using the averaged functional principal component score, we compute the mean μ(τ) as follows:
Step 6. Define the centralized RNAseq data matrix:
where \( {Z}_{ij}={\overline{y}}_{ij}{\overline{y}}_j, i=1,\dots, n, j=1,\dots, m \).
We perform FPCA on the centralized dataset Z where means of the RNAseq data at each genomic position over 100 genes is removed and obtain a set of functional principal components (eigenfunctions) {φ _{1}(τ), …, φ _{ T }(τ)} and functional principal component scores η _{ it }, i = 1, …, n, t = 1, …, T. Define T random variables η = [η _{1}, …, η _{ T }] with vectors of their sampling values:η _{ t } = [η _{1t }, …, η _{ nt }]^{T}, t = 1, …, T.
We then calculate the sampling covariance matrix \( \widehat{\varSigma}=\operatorname{cov}\left(\eta, \eta \right) \).Assume that the scores of the residuals follow a multivariate normal distribution \( N\left(0,\widehat{\varSigma}\right) \). Using the normal random variables to generate an n sample of vectors ε _{ i } = [ε _{ i1}, …, ε _{ iT }]. The residuals ε _{ i }(τ) will be defined as
Model 2 (a marginal effect at the first gene):
where μ(τ) is the overall mean at the genomic position τ, ε _{ i }(τ) is an error stochastic process, x _{ ij } is an indicator variable for the genotype of i ^{th} individual at the j ^{th} SNP of the first gene, y _{ i }(τ) is defined as that in model 1. The coefficient α _{ j }(τ) = r _{ j } ⋅ α(τ), where r _{ j } is randomly selected from 0.5 to 1.5, is the additive effect function of the j ^{th} SNP of the first gene, μ(τ) is obtained by randomly sampling 100 genes from the real RNAseq and WGS genotype data without interactions and α(τ) is obtained by randomly sampling 100 genes from the real RNAseq and WGS genotype data under the condition that one gene have significant main effect, the other gene do not have significant main effect, and these gene pairs are not in the list of significantly interacted gene pairs in our results. The overall mean μ(τ), effect function α(τ) and the residuals ε _{ i }(τ) were similarly simulated as that in Model 1.
Model 3 (marginal effects at both the first and the second genes):
where z _{ ik } is an indicator variable for the genotype of i ^{th} individual at the k ^{th} SNP of the second gene. The genetic additive effect function β _{ k }(τ _{ l }) is assumed to be equal to β _{ k }(τ) = s _{ k } β(τ), where s _{ k } is randomly selected from 0.5 to 1.5, other parameters are defined in Model 2. A total of 100 pairs of genes were randomly selected under the condition that both genes have significant main effect and these gene pairs are not in the list of significant interacted gene pairs in our results. The overall mean function μ(τ), main effect functions α(τ) and β(τ), and residual term ε _{ i }(τ) were similarly generated as that in Models 1 and 2.
Power evaluation
To evaluate the performance of the functional regression model for testing epistatic effects on gene expression, we estimated the power through simulations. We assumed that there were k _{ 1 } SNPs in the first gene and k _{ 2 } SNPs in the second gene. Thus, there were totally k _{ 1 } k _{ 2 } SNP pairs between these two genomic regions. For the h ^{th} pair of SNPs, let \( {Q}_{h_1} \) and \( {q}_{h_1} \) be two alleles at the SNP in the first gene, \( {Q}_{h_2} \) and \( {q}_{h_2} \) be two alleles at the SNP in the second gene. Let u ^{h}_{ ijkl } denote her/his genotypes of the h ^{th} pair of SNPs, where \( i j\in {Q}_{h_1}{Q}_{h_1},{Q}_{h_1}{q}_{h_1},{q}_{h_1}{q}_{h_1} \) and \( kl\in {Q}_{h_2}{Q}_{h_2},{Q}_{h_2}{q}_{h_2},{q}_{h_2}{q}_{h_2} \). Let \( {g}_{u_{ijkl}}^h\left(\tau \right) \) denote her/his genotypic value in the h ^{th} pair of SNPs at genomic position τ influencing gene expressions. Then we can use the following multiple regression model to generate the functionvalued trait (RNAseq) of the u ^{th} individual of the h ^{th} pair of SNPs at the genomic position τ.
\( {g}_{u_{ijkl}}^h\left(\tau \right)={\lambda}_{u_{ijkl}}^h g\left(\tau \right) \), \( {\lambda}_{u_{ijkl}}^h \) is a risk parameter which is determined by the gene interaction model (Additional file 5: Table S2), the risk parameter r varies from 0 to 1, and g(τ) is a common genotype coefficient function fitted by the real RNAseq data and ε _{ u }(τ) is the error stochastic process and estimated from the null model as that in null distribution in the test statistics section.
Abbreviations
 BFGM:

FRGM with Functional Response and Functional Predictors
 FRGM:

Functional regression model
 NGS:

Next generation sequencing
 PCA:

Principal component analysis
 SFGM:

FRGM with scalar response and functional predictors
 WGS:

Whole genome sequencing
References
Fisher RA. The correlation between relatives on the supposition of mendelian inheritance. Trans Roy Soc Edinb. 1918;52:399–433.
Lehner B. Molecular mechanisms of epistasis within and between genes. Trends Genet. 2011;27:323–31.
Phillips PC. The language of gene interaction. Genetics. 1998;149:1167–71.
Phillips PC. Epistasis—the essential role of gene interactions in the structure and evolution of genetic systems. Nat Rev Genet. 2008;9:855–67.
Hemani G, Shakhbazov K, Westra HJ, Esko T, Henders AK, McRae AF, Yang J, Gibson G, Martin NG, Metspalu A. Detection and replication of epistasis influencing transcription in humans. Nature. 2014;508:249–53.
Huang Y, Wuchty S, Przytycka TM. eQTL epistasis–challenges and computational approaches. Front Genet. 2013;4:51.
Zuk O, Hechter E, Sunyaev SR, Lander ES. The mystery of missing heritability: Genetic interactions create phantom heritability. Proc Natl Acad Sci. 2012;109:1193–8.
Kang H, Yang X, Chen R, Zhang B, Corona E, Schadt E, Butte A. Integration of diseasespecific single nucleotide polymorphisms, expression quantitative trait loci and coexpression networks reveal novel candidate genes for type 2 diabetes. Diabetologia. 2012;55:2205–13.
Shang J, Zhang J, Sun Y, Liu D, Ye D, Yin Y. Performance analysis of novel methods for detecting epistasis. BMC Bioinf. 2011;12:1.
Kang M, Zhang C, Chun HW, Ding C, Liu C, Gao J. eQTL epistasis: detecting epistatic effects and inferring hierarchical relationships of genes in biological pathways. Bioinformatics. 2015;31:656–64.
Lappalainen T, Montgomery SB, Nica AC, Dermitzakis ET. Epistatic selection between coding and regulatory variation in human evolution and disease. Am J Hum Genet. 2011;89:459–63.
Sun X, Lu Q, Mukherjee S, Crane PK, Elston R, Ritchie MD. Analysis pipeline for the epistasis search–statistical versus biological filtering. Front Genet. 2014;5:106.
Lee J, Ji Y, Liang S, Cai G, Müller P. On differential gene expression using RNAseq data. Cancer Informat. 2011;10:205–15.
Li JJ, Jiang CR, Brown JB, Huang H, Bickel PJ. Sparse linear modeling of nextgeneration mRNA sequencing (RNAseq) data for isoform discovery and abundance estimation. Proc Natl Acad Sci. 2011;108:19867–72.
Martin JA, Wang Z. Nextgeneration transcriptome assembly. Nat Rev Genet. 2011;12:671–82.
Wang Z, Gerstein M, Snyder M. RNAseq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009;10:57–63.
Finotello F, Di Camillo B. Measuring differential gene expression with RNAseq: challenges and strategies for data analysis. Brief Funct Genomics. 2015;14:130–42.
Gosik K, Kong L, Chinchilli VM, Wu R. iFORM/eQTL: an ultrahighdimensional platform for inferring the global genetic architecture of gene transcripts. Brief Bioinform. 2017;18(2):250–9.
Zhang F, Boerwinkle E, Xiong M. Epistasis analysis for quantitative traits by functional regression model. Genome Res. 2014;24:989–98.
Zhang F, Xie D, Liang M, Xiong M. Functional Regression Models for Epistasis Analysis of Multiple Quantitative Traits. PLoS Genet. 2016;12:e1005965.
Lappalainen T, Sammeth M, Friedländer MR, AC‘t Hoen P, Monlong J, Rivas MA, GonzàlezPorta M, Kurbatova N, Griebel T, Ferreira PG. Transcriptome and genome sequencing uncovers functional variation in humans. Nature. 2013;501:506–11.
Graffelman J, Moreno V. The mid pvalue in exact tests for HardyWeinberg equilibrium. Stat Appl Genet Mol Biol. 2013;12:433–48.
Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010;11:1.
Beasley TM, Erickson S, Allison DB. Rankbased inverse normal transformations are increasingly used, but are they merited? Behav Genet. 2009;39:580–95.
Li Q, Zhao H, Jiang L, Che Y, Dong C, Wang L, Wang J, Liu L. An SRprotein induced by HSVI binding to cells functioning as a splicing inhibitor of viral premRNA. J Mol Biol. 2002;316:887–94.
Csardi G, Nepusz T. The igraph software package for complex network research. Inter J Complex Systems. 2006;1695(5):1–9.
Falcon S, Gentleman R. Using GOstats to test gene lists for GO term association. Bioinformatics. 2007;23(2):257–8.
Acknowledgement
The authors express appreciation to the computation sources from the Texas Advanced Computing Center and the HighEnd Computing Center at Fudan University.
Funding
Mr. Xiong was supported by Grant 1R01AR057120–01 and 1R01HL10603401, from the National Institutes of Health and NHLBI. Ms. Xu and Mr. Jin were supported by research grants from the National Natural Science Foundation of China (31330038, 31521003), National Basic Research Program (2014CB541801), and 111 Project (B13016). The funders had no role in the design of the study, data collection, analysis, or manuscript preparation.
Availability of data and materials
The R package FRGM_1.0.tar.gz is available from https://sph.uth.edu/research/centers/hgc/xiong/software.htm. All results represented in this paper could be reproduced by the examples and programs in this package. SNP data supporting the conclusions of this article are available from FTP site hosted at the EBI, ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/, and RNAseq data are available in Geuvadis RNA sequencing project of 1000 Genomes samples under accession EGEUV3, http://www.ebi.ac.uk/arrayexpress/files/EGEUV3/processed/.
Authors’ contributions
KLX, JL and MMX designed this study, developed the statistical methods and wrote the manuscript. KLX wrote the R package and performed data analysis. All authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
This study is a retrospective analysis of the existing data. SNP data are downloaded from FTP site hosted at the EBI, and RNAseq data are downloaded from EBI ArrayExpress.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Author information
Authors and Affiliations
Corresponding author
Additional files
Additional file 1: Table S1.
Summary statistics of genetic variants in the five genes. (XLS 25 kb)
Additional file 2: Table S3.
A list of significant cistrans interactions in the MAPK pathway. (XLS 114 kb)
Additional file 3: Table S4.
Ten communities and their significantly enrished pathways. (XLSX 12 kb)
Additional file 4: Figure S1.
A. The real RNAseq curve of the gene LMNB2 and the data simulated by the negative distribution of the gene LMNB2. B. The real RNAseq curve of the gene LMNB2 and the curve estimated by the FPCA of the RNAseq data. (ZIP 120 kb)
Additional file 5: Table S2.
The value of risk parameter in different interaction models. (XLS 30 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Xu, K., Jin, L. & Xiong, M. Functional regression method for whole genome eQTL epistasis analysis with sequencing data. BMC Genomics 18, 385 (2017). https://doi.org/10.1186/s1286401737774
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1286401737774