Skip to main content
  • Methodology article
  • Open access
  • Published:

Functional regression method for whole genome eQTL epistasis analysis with sequencing data

Abstract

Background

Epistasis plays an essential rule in understanding the regulation mechanisms and is an essential component of the genetic architecture of the gene expressions. However, interaction analysis of gene expressions remains fundamentally unexplored due to great computational challenges and data availability. Due to variation in splicing, transcription start sites, polyadenylation sites, post-transcriptional RNA editing across the entire gene, and transcription rates of the cells, RNA-seq measurements generate large expression variability and collectively create the observed position level read count curves. A single number for measuring gene expression which is widely used for microarray measured gene expression analysis is highly unlikely to sufficiently account for large expression variation across the gene. Simultaneously analyzing epistatic architecture using the RNA-seq and whole genome sequencing (WGS) data poses enormous challenges.

Methods

We develop a nonlinear functional regression model (FRGM) with functional responses where the position-level read counts within a gene are taken as a function of genomic position, and functional predictors where genotype profiles are viewed as a function of genomic position, for epistasis analysis with RNA-seq data. Instead of testing the interaction of all possible pair-wises SNPs, the FRGM takes a gene as a basic unit for epistasis analysis, which tests for the interaction of all possible pairs of genes and use all the information that can be accessed to collectively test interaction between all possible pairs of SNPs within two genome regions.

Results

By large-scale simulations, we demonstrate that the proposed FRGM for epistasis analysis can achieve the correct type 1 error and has higher power to detect the interactions between genes than the existing methods. The proposed methods are applied to the RNA-seq and WGS data from the 1000 Genome Project. The numbers of pairs of significantly interacting genes after Bonferroni correction identified using FRGM, RPKM and DESeq were 16,2361, 260 and 51, respectively, from the 350 European samples.

Conclusions

The proposed FRGM for epistasis analysis of RNA-seq can capture isoform and position-level information and will have a broad application. Both simulations and real data analysis highlight the potential for the FRGM to be a good choice of the epistatic analysis with sequencing data.

Background

Epistatic effect in gene expression, defined as the departure from additive effects in a linear model of eQTL analysis [1], plays an essential role in understanding the gene regulation and disease mechanisms [2,3,4]. One polymorphism’s effect on expression of a gene depends on other polymorphisms present in the genome [5]. Epistasis analysis of gene expressions will substantially improve the understanding of the genetic architecture of gene expression and facilitate mechanistic insights into complex traits [6, 7]. However, eQTL epistasis analysis remains fundamentally unexplored due to large computational challenges and data availability [6].

Gene expression is an intermediate phenotype that bridges the genotype and higher level phenotypes such as diseases [8, 9]. Studying the effect of epistasis on the gene expression could provide a better understanding of the genetic architecture and gene regulation. The importance of detecting the epistatic effect on the gene expression has been emphasized in many recent studies [10, 11]. However, the corresponding methods are relatively rare. The widely used statistical methods for identifying eQTL epistasis are designed for microarray expression data where an overall expression of the gene is taken as a quantitative trait and all methods for QTL epistasis analysis can be used for eQTL epistasis analysis [10, 12].

Application of next generation sequencing (NGS) techniques to the genetic analysis of gene expression involves (1) generating millions of short reads of mRNA or cDNA which are mapped to the genome and lead to a sequence of read counts at the hundreds of millions of genomic positions [13,14,15,16] and (2) generating millions or 10 millions of genetic variants. RNA-seq counts vary greatly across the gene [17]. Count variations can be due to experimental bias such as fragmentation methods, reverse-transcription [16], sequence-specific bias and sequencing technology variation [18]. However, count variation can also be caused by variation in splicing, transcription start sites, polyadenylation sites, post-transcriptional RNA editing across the entire gene, and transcription rates of the cells [13,14,15,16, 18]. RNA-seq data can be viewed as a function or a curve of the genomic position and hence can be taken as a function-valued trait.

Although RNA-seq data are measured as a function, the widely used methods for genetic studies of the RNA-seq in humans are the same as that for the traditional single-valued quantitative traits where a single number for overall expression of the gene is taken as a quantitative trait. These methods use summary statistics to measure or represent gene expressions assayed by NGS techniques and cannot capture the expression variations across the gene due to splicing, transcription start sites, polyadenylation sites, post-transcriptional RNA editing across the entire gene, and transcription rates of the various cells. The summary statistic-based epistasis analysis of the RNA-seq fails to utilize all transcripts information.

The critical barrier in epistasis analysis is to deal with rare variants. The traditional statistical methods for epistasis analysis were originally designed for testing the interaction between common variants and are difficult to apply to rare variants due to high type 1 error rates, severe multiple testing, prohibitive computational time and low power [19]. Whole genome RNA-req eQTL analysis poses a significant challenge. To meet the challenge, we developed a nonlinear functional regression model (FRGM) with functional responses where the position-level read counts within a gene are taken as a function of genomic position, and functional predictors where genotype profiles are viewed as a function of genomic position, for epistasis analysis with RNA-seq data, which allows simultaneous capture of all space information hidden in the RNA-seq data and genetic variation data, but with substantially reduced dimensions. Instead of testing the interaction of all possible pair-wises SNPs, the FRGM takes a gene as a basic unit for epistasis analysis, which tests for the interaction of all possible pairs of genes and uses all the information that can be accessed to collectively test interaction between all possible pairs of SNPs within two genome regions (or genes). The proposed FRGM for epistasis analysis of the RNA-seq can capture isoform and position-level information and will have a broad application.

The FRGM for epistasis analysis has several remarkable features. First, the FRGM accounts for the change in the position-level read counts, while preserving the intrinsic structure and all the positional-level genetic information. Second, the FRGM simultaneously utilizes both correlation information among the RNA-seq at different genomic positions and among all variants in a genomic region. Third, the multicollinearity problems in the FRGM which may be presented in both the RNA-seq and genetic variation are alleviated. Fourth, the FRGM expands both position-level read count function and genotype function in terms of orthogonal eigenfunction, which leads to substantial dimension reduction in both RNA-seq data and SNP data. The FRGM for epistasis analysis of function-valued traits which capture key information in the data is expected to open a new route for epistasis analysis of RNA-seq data.

To evaluate its performance for epistasis analysis of the RNA-seq, we use large scale simulations to calculate the type I error rates and evaluate the power of the proposed FRGM for detecting epistasis. To further evaluate its performance, the FRGM for epistasis analysis is applied to 350 samples with both RNA-seq and NGS data from the 1000 Genomes Project. An R packge for implementing the developed FRGM for epistasis analysis of RNA-seq and NGS data can be downloaded from our website https://sph.uth.edu/research/centers/hgc/xiong/software.htm.

Results

Null distribution of test statistics

To examine the null distribution of test statistics, we performed a series of simulation studies to compare their empirical levels with the nominal ones. We consider three models for type 1 error rate simulations: model 1 with no marginal effects, model 2 with marginal effects at the first gene and model 3 with marginal effects at both the first and second genes.

We generated 100,000 chromosomes by resampling from the 350 European samples with genetic variants in five genes: IRAK3, ACSS3, SUV420H1, ETV7, and HPS4 from the next generation sequencing data in the 1000 Genomes Project. The summary statistics of the variants in five genes are summarized in Additional file 1: Table S1. The marginal genetic effects will be estimated from the data. 100 genes with RNA-seq data were randomly selected from GEUVADIS project. They were used to develop the models for generating RNA-seq data in simulation (Detailed description were referred to Method Section).10 pairs of genes were selected from five genes : IRAK3, ACSS3, SUV420H1, ETV7, and HPS4 with genotype data from 1000 Genome Project dataset.

The number of sampled individuals from the population ranged from 1000 to 5,000, and 5,000 simulations were repeated. We randomly selected 10% of the SNPs as causal variants from five genes: IRAK3, ACSS3, SUV420H1, ETV7, and HPS4. We perfume gene-gene interaction tests for 10 pairs of genes selected from five genes with genotypes under the three models for 5000 times. The type 1 error rates were averaged over 10 pairs of genes with genotype data and 5,000 simulations for each model. Tables 1, 2 and 3 summarized the type I error rates of the test statistics for testing the interaction between two genes with no marginal effect, marginal effect at the first gene and marginal effects at both genes consisting only of rare variants and both common and rare variants, respectively, averaged over 100 genes with RNA-seq data and 10 pairs of genes with genotype data at the nominal levels α = 0.05, α = 0.01 and α = 0.001. These results clearly showed that the type I error rates of the FRGM-based test statistics for testing interaction between two genes with or without marginal effects were not appreciably different from the nominal α levels.

Table 1 Average type 1 error rates of the statistic for testing interaction between two genes with no marginal effect over 10 pairs of genes
Table 2 Average type 1 error rates of the statistic for testing interaction between two genes with marginal effect at the first genes over 10 pairs of genes
Table 3 Average type 1 error rates of the statistic for testing interaction between two genes with marginal effects at two genes over 10 pairs of genes

Power evaluation

To evaluate the performance of the functional regression model for testing the epistatic effect on gene expression, we estimated the power through simulations. We generated 100,000 chromosomes by resampling from the 350 European samples with genetic variants in two genes: IRAK3 and ACSS3 from the next generation sequencing data in 1000 Genomes Project. We randomly selected 20% variants as causal variants, assumed that there were k 1 SNPs in the first gene, and k 2 SNPs in the second gene. Two thousand individuals were sampled. We assumed that both marginal effects and epistasis effects were a function of the genomic position and used the multiple regression models to generate the RNA-seq data under four interaction models: Dominant OR Dominant, Dominant AND Dominant, Recessive OR Recessive and Threshold (See the Methods section).

We compared the power of the FRGM with both functional response and functional predictors (BFGM), FRGM with scalar response and functional predictors (SFGM) and regression on principal component analysis (PCA). For the PCA method, the PCA was performed on the RNA-seq data and the number of PCs were selected to explain 80% variance of number of reads at different genomic positions. The multiple functional regression was performed to analyze the data [20]. In the BFGM, both RNA-seq and genotype profiles were taken as a function of genomic position and expanded in terms of functional principal components.

Figure 1a-d plotted the power curves of three statistics: BFGM, SFGM and PCA to test the interaction between two genes with rare variants under the Dominant OR Dominant, Dominant AND Dominant, Recessive OR Recessive and Threshold models, respectively. In the simulation, 20% of the rare variants were randomly selected as the causal variants. These power curves were a function of the risk parameter at the significance level α = 0.05. We observed that under all four interaction models the BFGM had the highest power, followed by the regression on PCA. Power of the SFGM was the lowest. The results demonstrated that summary statistics such as RPKM for measuring gene expression could not capture the expression variations across the gene and almost had no power to detect the interaction between two genes with rare variants.

Fig. 1
figure 1

a. Power curves of three statistics: the BFGM, regression on PCA, SFGM, for testing interaction between two genomic regions that consist of rare variants with the RNA-seq trait as a function of the relative risk parameter r at the significance level α = 0.05 under the Dominant OR Dominant model, assuming sample sizes of 2,000. b. Power curves of three statistics: the BFGM, regression on PCA, SFGM, for testing interaction between two genomic regions that consist of rare variants with RNA-seq trait as a function of the relative risk parameter r at the significance level α = 0.05 under the Dominant AND Dominantmodel, assuming sample sizes of 2,000. c. Power curves of three statistics: the BFGM, regression on PCA, SFGM, for testing interaction between two genomic regions that consist of rare variants with RNA-seq trait as a function of the relative risk parameter r at the significance level α = 0.05 under the Recessive OR Recessive model, assuming sample sizes of 2,000. d. Power curves of three statistics: the BFGM, regression on PCA, SFGM, for testing interaction between two genomic regions that consist of rare variants with RNA-seq trait as a function of the relative risk parameter r at the significance level α = 0.05 under the Threshold model, assuming sample sizes of 2,000

The BFGM can also be applied to the presence of both common and rare variants. Figure 2a-d plotted the power curves of three statistics for testing interaction between two genes with both common and rare variants where 10% of the common variants and 10% of the rare variants were chosen as causal variants under the Dominant OR Dominant, Dominant AND Dominant, Recessive OR Recessive and Threshold models, respectively. The power patterns of tests for the interactions between two genes with both common and rare variants were similar to that with rare variants only. The BFGM had the highest power, followed by the PCA and the SFGM. However, we noticed that the power of the SFGM for epistasis analysis in the presence of common variants increased substantially. Under some models such as the Dominant OR Dominant model, the SFGM would have enough power to detect interactions between two genes with common variants.

Fig. 2
figure 2

a. Power curves of three statistics: the BFGM, regression on PCA, SFGM, for testing interaction between two genomic regions that consist of both common and rare variants with the RNA-seq trait as a function of the relative risk parameter r at the significance level α = 0.05 under the Dominant OR Dominant model, assuming sample sizes of 2,000. b. Power curves of three statistics: the BFGM, regression on PCA, SFGM, for testing interaction between two genomic regions that consist of both common and rare variants with RNA-seq trait as a function of the relative risk parameter r at the significance level α = 0.05 under the Dominant AND Dominantmodel, assuming sample sizes of 2,000. c. Power curves of three statistics: the BFGM, regression on PCA, SFGM, for testing interaction between two genomic regions that consist of both common and rare variants with RNA-seq trait as a function of the relative risk parameter r at the significance level α = 0.05 under the Recessive OR Recessive model, assuming sample sizes of 2,000. d. Power curves of three statistics: the BFGM, regression on PCA, SFGM, for testing interaction between two genomic regions that consist of both common and rare variants with RNA-seq trait as a function of the relative risk parameter r at the significance level α = 0.05 under the Threshold model, assuming sample sizes of 2,000

RNA-seq data and NGS data

The BFGM was applied to the RNA-seq data in the GEUVADIS RNA Sequencing Project [21] and the WGS data in the1000 Genomes Project. A total of 350 samples with European origin was shared between the GEUVADIS RNA Sequencing Project and 1000 Genomes Project, which had combined transcriptome (22,706 gene expressions measured by RNA-seq) and genome sequencing data (2,708,453 SNPs in 24,519 genes). After removing singleton SNPs, repeated SNPs, and filtering out the SNPs violating HW equilibrium [22] (P value < 10−9 for declaring HW disequilibrium), 2,566,261 SNPs in 18,986 genes were included in the epistasis analysis. In the RNA-seq data pre-processing, we removed the genes whose expressing rates were less than 30% and the genes that did not contain any SNPs. Finally, RNA-seq data of the 15,656 genes were included in the analysis. We used DESeq [23] to normalize the RNA-seq data.

Cis-trans interactions

We considered the RNA-seq curve of the target gene as a function-valued trait. The target gene selected from the 15656 gene expressions was referred to as gene 1. We selected one of the remaining 18985 genotyping genes as gene 2. We used BFGM to test for the interactions between gene 1 and gene 2 influencing the expression of the target gene 1. The total number of gene pairs tested for interactions which included both common and rare variants was 297,229,160. A P-value for declaring significant interaction after applying the Bonferroni correction for multiple tests was 1.68 × 10−10. To examine the behavior of the BFGM, we plotted the QQ plot of the test (Fig. 3). QQ plot showed that the false positive rate of the BFGM for detection of epistasis was controlled.

Fig. 3
figure 3

QQ plot of P-values from the BFGM for testing the cis-trans interactions between two genes influencing transcription

For comparisons, the SFGM was also applied to the dataset. RPKM and DESeq were used to compute the overall expression value of genes from the RNA-seq data. All the expression values were processed by the rank-based inverse normal transformation [24]. For both common and rare variants, in total, 162361, 260 and 51 significant cis-trans interactions regulating the gene expressions were identified by the BFGM, SFGM with the RPKM and DESeq, respectively. We observed 9,846genes whose expressions were influenced by 16,2361cis-trans interactions. We found that the average number of epistasis influencing each gene was 16. A total of 3,505 gene expressions were influenced by one significant cis-trans gene-gene interactions, 169 gene expressions were influenced by more than 100 cis-trans gene-gene interactions. Figure 4 presented a histogram that showed a distribution of the cis-trans gene-gene interactions.

Fig. 4
figure 4

A histogram showing a distribution of the number of cis-trans gene-gene interactions on each gene expression

The P-values of the top 20 interactions between genes ranked by the BFGM method were summarized in Table 4 where P-values for testing interactions between genes by the SFGM (RPKM, DESeq and RNAmin) and min P-values were also listed. The RNAmin denoted the minimum of P-values computed by the SFGM method with the number of reads at each genome position of the gene as the scalar response in the functional regression model. The min P-values denoted to take the minimum of all P-values for testing all possible pairs of SNPs between two genes using functional regression model with functional response and scalar predictors. Table 4 showed several remarkable features. First, we often observed the pair-wise interaction between rare and rare variants (34.38%), and rare and common variants (59.38%). Less observed was the significant pair-wise interaction between common and common variants (6.25%). Second, significant interactions between two genes often indicated that at least one significant pair of SNPs in two genes could be observed (min P-values were small). However, we can observe that pairs of SNPs between two genes jointly had significant interaction effects, but individually each pair of SNPs mildly contributed to the interaction effects. Third, the BFGM often had a much smaller P-value to detect interaction than other tests. Fourth, we observed that genes may not show even mild marginal association, but they did demonstrate significant evidence of interaction. If only the interactions between two marginally significant genes are tested, some significant interactions may be missed. The fifth, the BFGM tremendously reduced computation burden.

Table 4 P-values of top 20 genes ranked by the BFGM methods

To further assess the validity of the BFGM for epistasis analysis with RNA-seq data, we randomly selected six pairs of genes from the significant 162361 gene-gene interactions. The P-values for testing the interactions of six pairs of genes using the BFGM and SFGM were summarized in Table 5. Table 5 showed that six significant interactions identified by the BFGM significantly influenced read count variation at least at one genomic position within the gene. To explain why the BFGM had higher power to detect interaction than the SFGM, we presented Fig. 5a-f showing the RNA-seq profiles and overall expression level of the genes PLA2G4A, PLA2G6, PLAUR, PLD4, PLD6 and PLEKHA3 of two individuals, respectively. These figures showed that the overall expression levels of the individuals were the same, but their RNA-seq profiles were quite different. This demonstrated that unlike the RNA-seq profiles, the overall expression levels cannot capture the expression variation across the genes. Therefore, the SFGM using summary statistics as a trait will have less power to detect the interaction than the BFGM using the RNA-seq profiles as a function-valued trait.

Table 5 The P-values of randomly selected 6 pairs of genes from the significant 162361 gene-gene interactions
Fig. 5
figure 5

a. RNA-seq profile of the gene PLA2G4A where the curve represented the number of reads as a function of the genomic position. The dotted line denoted the overall expression of the gene PLA2G4A. b. RNA-seq profile of the gene PLA2G6 where the curve represented the number of reads as a function of the genomic position. The dotted line denoted the overall expression of the gene PLA2G6. c. RNA-seq profile of the gene PLAUR where the curve represented the number of reads as a function of the genomic position. The dotted line denoted the overall expression of the gene PLAUR. d. RNA-seq profile of the gene PLD4 where the curve represented the number of reads as a function of the genomic position. The dotted line denoted the overall expression of the gene PLD4. e. RNA-seq profile of the gene PLD6 where the curve represented the number of reads as a function of genomic position. The dotted line denoted the overall expression of the gene PLD6. f. RNA-seq profile of the gene PLEKHA3 where the curve represented the number of reads as a function of the genomic position. The dotted line denoted the overall expression of the gene PLEKHA3

To investigate whether the top 20 interactions were caused by the linkage disequilibrium (LD) or not, we listed the maximum of r 2 between all possible SNPs in the top 20 significantly interacting pairs of genes and the P-values for testing their presence of LD in Table 6. We did not observe the strong LD between the interacting genes.

Table 6 The maximum of r2 between all possible SNPs in top 20 significantly interacting pairs of genes

Interactions in the MAPK signaling pathway

To show the detailed interaction structure, we presented the results of 331 significant cis-trans interactions in the MAPK signaling pathway in the Additional file 2: Table S3 where min P-values indicated that the functional regression model with functional response and discrete predictors was used to test for the interaction for all possible pairs of SNPs within two genes and minimum of P-values of the tests was listed in the Additional file 2: Table S3. The column “SNP pair” listed their corresponding pair of SNPs reaching the minimum of the P-values and their chromosome locations. From Additional file 2: Table S3 we had several significant observations. First, we observed that the majority of interacting genes were located in different chromosomes, which implied that interactions were not caused by the linkage disequilibrium (LD). Second, we observed that large proportions of interacting genes did not show significant evidence of marginal association. This demonstrated that if we only selected the genes with significant marginal association for epistasis analysis, many interactions would be missed. Third, in general, the function-value-based epistasis analysis (BFGM, min P-values) had much smaller P-values than the summary statistic-based epistasis analysis (SFGM). Fourth, we observed that the genes interacting with the genes in MAPK signaling pathway were in 147 other pathways, including cytokine-cytokine receptor interaction, Cytosolic DNA-sensing pathway, DNA replicationamong others. Fifth, it was interesting to observe that the interacting genes formed a large connected network with 281 nodes and 317 edges (Fig. 6). We observed hub genes IBA57-AS1 with 67 connections, HIST1H2AD with 21 connections, PRR24 with 18 connections and ARL6IP4 with 14 connections. HIST1H2AD is a core component of nucleosome and plays a central role in transcription regulation. ARL6IP4 functions as a splicing inhibitor [25].

Fig. 6
figure 6

Networks of 317 pairs of genes from 281 genes showing the significant evidence of cis-trans interactions as identified by BFGM

Gene ontology and KEGG pathway enrichment analysis

Gene ontology enrichment analysis was performed on the genes in the identified 162361 pairs of significant cis-trans interactions influencing the transcription to discover overrepresented functional biological groupings with interactions. Our analysis was performed using the biological process, cellular component and molecular function categories of the gene ontology.

Ontology enrichment analysis found that cis-trans interactions were significantly enriched in biological processes (BP) including a single organism process, single organism cellular process, single organism metabolic process and development process (Fig. 7) and molecular functions that were primarily related to catalytic and binding activity with P-values <10−4 (Fig. 8). Ontology enrichment analysis also identified that cis-trans interactions were significantly enriched in the cell, intracellular, organelle, and membrane bounded organelle components (Fig. 9).

Fig. 7
figure 7

Gene ontology (GO) enrichment analysis of cis-trans interactions: enriched in the category of biological processes

Fig. 8
figure 8

Gene ontology (GO) enrichment analysis of cis-trans interactions: enriched in the category of molecular functions

Fig. 9
figure 9

Gene ontology (GO) enrichment analysis of cis-trans interactions: enriched in the category of cellular components

The enrichment analysis was also applied to 228 KEGG pathways to identify the pathways that were enriched with cis-trans interactions. The results were summarized in Fig. 10. The cis-trans interactions were enriched in metabolic pathways, MAPK signaling pathway, pathways in cancer, endocytosis, protein processing in endoplasmic reticulum and Wnt signaling pathway.

Fig. 10
figure 10

The enrichment analysis of cis-trans interactions in 228 KEGG pathways

Communities in gene interaction networks

We used random walks in igraph [26] to detect 10 communities from the entire gene-gene interaction network. We used R package GOstats [27] to conduct gene set enrichment analysis. We have identified 29 pathways enriched in 10 communities. Figure 11 showed the 6th community with 96 genes and 186 interactions enriched with metabolism (Three of four significantly enriched pathways were metabolism pathways: Glycerolipid metabolism, Nicotinate and nicotinamide metabolism, and Pyrimidine metabolism) where node represents a gene and an edge represents the interaction between the connected gene by the edge. All 10 communities with the enriched pathways (P-value < 0.01) are summarized in Additional file 3: Table S4.

Fig. 11
figure 11

A subnetwork in the 6th community with 96 genes and 186 interactions enriched with metabolism where node represents a gene and an edge represents the interaction between the connected gene by the edge

Discussion

In the past, the statistical epistasis of gene expression is defined as variant–variant interactions that regulate gene expression and its analysis has been mainly designed for microarray gene expression data and common variants. Since the dimension of the data for epistasis analysis of gene expression is very high, all the traditional methods for epistasis analysis of gene expression have the limited application to eQTL data. The whole genome epistasis studies of gene expressions have been very limited. The genetic structure of epistasis of gene expressions has not been fully discovered.

The recently developed next-generation mRNA sequencing (RNA-seq) assay generates dozens or even one hundred million short reads of mRNA and WGS also generates millions of SNPs. As a consequence, these genetic variation and gene expression variation data are so densely distributed across the genome that both genetic variation and expression variation can be modeled as a function of genomic location. The RNA-seq profiles can be taken as a function-valued trait. However, the standard multivariate statistical analysis often fails with functional data. The computational burden and correction for multiple tests seriously damage the feasibility of the variant-variant interaction analysis of extremely high dimensional RNA-seq and WGS genotype data. The variant-variant interaction analysis is not suitable for the epistasis analysis of the function-valued traits with NGS data as genotype data. Although the genetic study of quantitative traits has seen wide application and extensive technical development, the quantitative genetic analysis, particularly epistasis analysis of function-valued trait is comparatively less developed. To our knowledge, no statistical methods have been developed for genetic epistasis analysis of function-valued traits with NGS data. In the past few years we have witnessed the rapid development of novel statistical methods for association studies using NGS data. However, these methods might not be appropriate for genetic epistasis analysis of function-valued trait. The quantitative genetic epistasis analysis of rare variants for function-valued traits remains a huge challenge.

The widely used methods for reducing dimensionality of the RNA-seq data use the Poisson distribution, binomial distribution and negative binomial distribution to summarize the RNA-seq profile into a single number to represent the RNA-seq curve. However, these discrete distributions cannot capture the shape and variation of the RNA-seq curve. To illustrate this we presented Additional file 4: Figure S1A showing the real RNA-seq curve, the data simulated by a negative distribution of the gene LMNB2 and Additional file 4: Figure S1B showing the real RNA-seq curve of the gene LMNB2 and the curve estimated by the FPCA of the RNA-seq data. We observed that the negative distribution failed to capture the variation of the RNA-seq profile, but the FPCA approximated the RNA-seq curve exceedingly well.

Emergence of the NGS techniques demands a paradigm shift in the analytic methods for eQTL epistasis analysis from standard single-variate or multivariate data analysis to functional data analysis. The BFGM with functional response and functional predictors takes a RNA-seq profile as a functional response and genetic variants across the genomic regions as functional predictors, which can be used to test the association of the entire allelic spectrum of the genetic variation with a function-valued trait and has several remarkable features. First, unlike simple and multiple regressions that discard a large amount of information, the BFGM preserves the intrinsic structure and all the positional-level genetic information. Second, the multiple regressions will not account for the space-ordering of the data and correlation information contained in the data. The BFGM simultaneously employs genetic information of the individual variants and correlation information contained in both RNA-seq and SNP data. Third, both the sign and the size of the heterogeneity will also be incorporated into the test in the BFGM. Fourth, the multicolinearity problem in the BFGM is alleviated. Fifth, the BFGM expands both RNA-seq function and genotype function in terms of orthogonal eigenfunctions, which leads to substantial dimension reduction. The BFGM for genetic epistasis analysis of a function-valued trait which captures key information in the data is expected to open a new route for genetic epistasis analysis of RNA-seq and NGS genotype data.

Conclusions

We developed a novel functional regression model with both functional response and functional predictors for detection of epistasis influencing RNA-seq variations in humans, which is referred to as the BFGM. The BFGM takes genes as a basic unit of epistasis analysis and utilizes all information contained in both the RNA-seq and SNP data. By large simulations and real data analysis we demonstrated the merits and limitations of the proposed new paradigm of epistasis analysis for the RNA-seq and WGS data.

The new approach uses all genetic information in the genome regions and expression variation information in the target gene to collectively test the interaction between multiple SNPs within the regions influencing the RNA-seq curves. Therefore, the BFGM for interaction analysis overcomes limitations inherent in pair-wise interaction tests with the summary expression level as a scalar trait. By large simulations and real data analysis, we showed that the proposed BFGM substantially increased the power, dramatically reduced the computational burden and substantially outperformed the traditional variant-variant epistasis analysis of summary statistic measured quantitative traits. In real data analysis, we also clearly demonstrate that pairs of SNPs between two genes jointly have significant interaction effects, but individually each pair of SNPs makes a mild contribution to interaction effects.

The previous interaction analyses have mainly focused on the interactions between common and common variants. The distribution of the common and rare variants causing interactions is unknown. Very few genome-wide interaction analyses with the RNA-seq and WGS data, and very few results of significant interaction between rare and rare variants, and rare and common variants have been reported. We analyzed 350 samples of European origin with both RNA-seq and whole genome sequencing data available. We observed the large proportions of pair-wise interactions between rare and rare variants, and rare and common variants. The significant pair-wise interactions between common and common variants were less observed. The results showed that the number of significant cis-trans interactions identified by the SFGM with RPKM as overall gene expression level only accounted for 0.16% of the significant cis-trans interactions identified by the BFGM with RNA-seq and NGS genotype data. The majority of epistasis analysis for gene expressions used the microarray to measure gene expressions and test interactions for only common variants. Even though the RNA-seq data are available they still converted variation rich RNA-seq data into a single number such as RPKM or other summary statistics. Then, the variant-variant epistasis analysis is conducted on these converted data. That explains why these researches question the universe presence of significant gene-gene interaction influencing gene expressions.

Some researchers suggest that in genome-wide interaction analysis only genes with large or mildly marginal genetic effects should be tested for interaction. However, we observed that the majority of the significantly interacting genes showed no marginal association. These results clearly demonstrated that if we tested interactions for only genes with marginal associations, then many true interactions will be missing.

We are unsure whether interaction is most often presented in isolation, or interacting genes form networks. We identified a large number of cis-trans interactions and observed that the interacting genes formed large connected networks with hub genes presented. We found that some hub genes, for example histone modification genes, can globally regulate gene expressions. Enrichment analysis also showed that metabolic pathways, MAPK signaling pathway, pathways in cancer, endocytosis, protein processing in endoplasmic reticulum and Wnt signaling pathway among others were enriched with cis-trans interactions.

The results in this paper are preliminary. The confounding factors that cause spurious interactions have not been investigated. The statistical methods for epistasis analysis which remove confounding factors have not been developed. The complete genome-wide epistasis analysis including all cis and trans interactions have not been performed. The purpose of this paper is to stimulate further discussions regarding the great challenges we are facing in the epistasis analysis of high dimensional RNA-seq and WGS data.

Methods

Functional regression with both functional response and functional predictor models for epistasis analysis

For the convenience of discussion, position level read counts are taken as the RNA-seq profile and is referred to as a function-valued trait. Let y i (τ), τΤ τ  = [0, Τ τ ] be the read counts of the i th individual at the genomic position τ. Consider two genomic regions (or genes) [a 1, b 1] and [a 2, b 2]. Let x i (t) and z i (s) be genotypic functions of the i th individual defined in the regions [a 1, b 1] and [a 2, b 2], respectively. Let t and s be a genomic position in the first and second genomic regions, respectively. The genotype functions x i (t) and z i (s) are defined as

$$ {\mathrm{x}}_i\left(\mathrm{t}\right)=\left\{\begin{array}{c}\hfill 2{P}_m\left(\mathrm{t}\right), \kern2.25em \mathrm{MM}\hfill \\ {}\hfill {P}_m\left(\mathrm{t}\right)\hbox{-} {\mathrm{P}}_{\mathrm{M}}\left(\mathrm{t}\right),\kern0.5em \mathrm{Mm}\hfill \\ {}\hfill -2{P}_M\left(\mathrm{t}\right), \kern1.5em \mathrm{mm}\hfill \end{array}\right.,\kern2em {\mathrm{z}}_i\left(\mathrm{s}\right)=\left\{\begin{array}{c}\hfill 2{P}_m\left(\mathrm{s}\right), \kern2.25em \mathrm{MM}\hfill \\ {}\hfill {P}_m\left(\mathrm{s}\right)\kern0.5em \hbox{-} \kern0.5em {\mathrm{P}}_{\mathrm{M}}\left(\mathrm{s}\right),\kern0.5em \mathrm{Mm},\hfill \\ {}\hfill -2{P}_M\left(\mathrm{s}\right), \kern1.5em \mathrm{mm}\hfill \end{array}\right. $$

where M and m are two alleles of the marker at the genomic position t and s, P M (t) and P m (t), and P M(s), P m (s) are the frequencies of the alleles M and m at the genomic positions t and s, respectively. Consider a functional regression model with functional response and functional predictors (BFGM):

$$ {y}_i\left(\tau \right)=\mu \left(\tau \right)+{W}_i^T\omega \left(\tau \right)+{\displaystyle {\int}_T{x}_i(t)\alpha \left( t,\tau \right) dt+{\displaystyle {\int}_S{z}_i(s)\beta \left( s,\tau \right) ds+{\displaystyle {\int}_T{\displaystyle {\int}_S{x}_i(t){z}_i(s)\gamma \left( t, s,\tau \right) ds dt}}}}+{\varepsilon}_i\left(\tau \right) $$
(1)

where μ(τ) is an overall mean function at the genomic position τ, W i is a vector of covariates for i th individual, ω(τ) is a vector of effects associated with the covariates, α(t, τ) is a genetic additive effect function at genomic position t of the first gene and genomic position τ of the RNA-seq profile, β(s, τ) is a genetic additive effect function at genomic positions s of the second gene and the genomic position τ, γ(t, s, τ) is an interaction effect function between two putative quantitative trait loci (QTLs) located at the genomic positions t and s influencing the read counts at the genomic position τ, and ε i (τ) is a residual function of the unexplained effect for the i th individual at the genomic position τ. The interaction function is measured by double integrals of the genotype function in two genes.

Estimation of interaction effect function

We assume that both position level read count function and genotype functions are centered. The genotype functions x i (t) and z i (s) are expanded in terms of the orthonormal basis function as:

$$ {x}_i(t)={\displaystyle \sum_{j=1}^{\infty }{\xi}_{i j}{\phi}_j(t)}\kern1em \mathrm{and}\kern1em {z}_i(s)={\displaystyle \sum_{l=1}^{\infty }{\eta}_{{}_{i l}}{\psi}_{{}_l}(s),} $$
(2)

where \( {\phi}_{{}_j}(t) \) and ψ l (s) are sequences of the orthonormal basis functions. The expansion coefficients ξ ij and η il are estimated by

$$ {\xi}_{i j}={\displaystyle \underset{T}{\int }{x}_{{}_i}(t){\phi}_{{}_j}(t) dt\kern0.75em }\kern0.5em \mathrm{and}\kern1.5em {\eta}_{{}_{i l}}={\displaystyle \underset{S}{\int }{z}_{{}_i}(s){\psi}_{{}_l}(s) ds} $$
(3)

In practice, numerical methods for the integral will be used to calculate the expansion coefficients. Substituting Eq. (2) into Eq. (1), we obtain

$$ {y}_i\left(\tau \right)=\mu \left(\tau \right)+{W}_i^T\omega \left(\tau \right)+{\displaystyle {\sum}_{j=1}^J{\xi}_{i j}}{\alpha}_j\left(\tau \right)+{\displaystyle {\sum}_{l=1}^L{\eta}_{i l}}{\beta}_l\left(\tau \right)+{\displaystyle {\sum}_{j=1}^J{\displaystyle {\sum}_{l=1}^L{\xi}_{i j}}}{\eta}_{i l}{\gamma}_{j l}\left(\tau \right)+{\varepsilon}_i\left(\tau \right), $$
(4)

where

$$ \begin{array}{l}{\alpha}_{{}_j}\left(\tau \right)={\displaystyle {\int}_T\alpha \left( t,\tau \right){\phi}_{{}_j}(t) dt,}\kern0.5em {\beta}_{{}_l}\left(\tau \right)={\displaystyle {\int}_S\beta \left( s,\tau \right){\psi}_l(s) ds}\kern0.5em \mathrm{and}\\ {}\kern0.5em {\gamma}_{{}_{j l}}\left(\tau \right)={\displaystyle {\int}_T{\displaystyle {\int}_S\gamma \left( t, s,\tau \right){\phi}_{{}_j}(t){\psi}_{{}_l}(s) dt ds}}.\end{array} $$

The parameters α j (τ), β l (τ) and γ jl (τ) are referred to as genetic additive effect and additive x additive effect score functions. These score functions can also be viewed as the expansion coefficients of the genetic effect functions with respect to orthonormal basis functions:

$$ \alpha \left( t,\tau \right)={\displaystyle \sum_j{\alpha}_j\left(\tau \right){\phi}_j(t),\beta \left( s,\tau \right)={\displaystyle \sum_l{\beta}_l\left(\tau \right){\psi}_l(s)\ \mathrm{and}\ \gamma \left( s, t\right)={\displaystyle \sum_j{\displaystyle \sum_l{\gamma}_{j l}\left(\tau \right){\phi}_j(s){\psi}_l(t)}}}}. $$

Equation (4) can be written in a vector form:

$$ Y\left(\tau \right)=\mathrm{E}\mu \left(\tau \right)+ W\omega \left(\tau \right)+\xi \alpha \left(\tau \right)+\eta \beta \left(\tau \right)+\varGamma \gamma \left(\tau \right)+\varepsilon \left(\tau \right)\kern0.5em , $$
(5)

where Y(τ), μ(τ), ω(τ), α(τ), β(τ) and γ(τ) are vectors, W, ξ, η and Γ are matrices.

Expanding Y(τ), μ(τ), ω(τ), α(τ), β(τ), γ(τ) and ε(τ) in terms of the orthogonal basis functions yield

$$ \begin{array}{l}{y}_{{}_i}\left(\tau \right)={\displaystyle {\sum}_{k=1}^K{y}_{{}_{i k}}{\theta}_{{}_k}\left(\tau \right)}\kern0.5em ,\kern0.5em \mu \left(\tau \right)={\displaystyle {\sum}_{k=1}^K{\mu}_{{}_k}{\theta}_{{}_k}\left(\tau \right)}\kern0.5em ,\kern0.5em {\omega}_{{}_j}\left(\tau \right)={\displaystyle {\sum}_{k=1}^K{\omega}_{{}_{j k}}{\theta}_{{}_k}\left(\tau \right)}\kern0.5em ,\\ {}\kern1em {\alpha}_{{}_j}\left(\tau \right)={\displaystyle {\sum}_{k=1}^K{\alpha}_{{}_{j k}}{\theta}_{{}_k}\left(\tau \right)}\kern0.5em ,\kern0.5em {\beta}_{{}_j}\left(\tau \right)={\displaystyle {\sum}_{k=1}^K{\beta}_{{}_{j k}}{\theta}_{{}_k}\left(\tau \right)}\kern0.5em ,\kern0.5em {\gamma}_{{}_{j l}}\left(\tau \right)={\displaystyle {\sum}_{k=1}^K{\gamma}_{{}_{j l k}}{\theta}_{{}_k}\left(\tau \right)}\kern0.5em ,\kern1em \mathrm{and}\\ {}\kern1em {\varepsilon}_{{}_i}\left(\tau \right)={\displaystyle {\sum}_{k=1}^K{\varepsilon}_{{}_{i k}}{\theta}_{{}_k}\left(\tau \right)}.\end{array} $$

Define expansion coefficient vectors and matrices as follows.

$$ \begin{array}{l} Y=\left[\begin{array}{ccc}\hfill {y}_{{}_{11}}\hfill & \hfill \cdots \hfill & \hfill {y}_{{}_{1 K}}\hfill \\ {}\hfill \vdots \hfill & \hfill \ddots \hfill & \hfill \vdots \hfill \\ {}\hfill {y}_{{}_{n1}}\hfill & \hfill \cdots \hfill & \hfill {y}_{{}_{n K}}\hfill \end{array}\right],\kern0.5em \mu ={\left[\begin{array}{c}\hfill {\mu}_{{}_1}\hfill \\ {}\hfill \vdots \hfill \\ {}\hfill {\mu}_{{}_K}\hfill \end{array}\right]}^{\mathrm{T}},\kern0.5em E=\left[\begin{array}{c}\hfill 1\hfill \\ {}\hfill \vdots \hfill \\ {}\hfill 1\hfill \end{array}\right],\kern1em \omega =\left[\begin{array}{ccc}\hfill {\omega}_{{}_{11}}\hfill & \hfill \cdots \hfill & \hfill {\omega}_{{}_{1 K}}\hfill \\ {}\hfill \vdots \hfill & \hfill \ddots \hfill & \hfill \vdots \hfill \\ {}\hfill {\omega}_{{}_{d1}}\hfill & \hfill \cdots \hfill & \hfill {\omega}_{{}_{d K}}\hfill \end{array}\right],\kern0.5em \\ {}\alpha =\left[\begin{array}{ccc}\hfill {\alpha}_{{}_{11}}\hfill & \hfill \cdots \hfill & \hfill {\alpha}_{{}_{1 K}}\hfill \\ {}\hfill \vdots \hfill & \hfill \ddots \hfill & \hfill \vdots \hfill \\ {}\hfill {\alpha}_{{}_{J1}}\hfill & \hfill \cdots \hfill & \hfill {\alpha}_{{}_{J K}}\hfill \end{array}\right],\end{array} $$
$$ \beta =\left[\begin{array}{ccc}\hfill {\beta}_{{}_{11}}\hfill & \hfill \cdots \hfill & \hfill {\beta}_{{}_{1 K}}\hfill \\ {}\hfill \vdots \hfill & \hfill \ddots \hfill & \hfill \vdots \hfill \\ {}\hfill {\beta}_{{}_{L1}}\hfill & \hfill \cdots \hfill & \hfill {\beta}_{{}_{L K}}\hfill \end{array}\right],\kern1em \gamma =\left[\begin{array}{ccc}\hfill {\gamma}_{{}_{111}}\hfill & \hfill \cdots \hfill & \hfill {\gamma}_{{}_{11 K}}\hfill \\ {}\hfill \vdots \hfill & \hfill \ddots \hfill & \hfill \vdots \hfill \\ {}\hfill {\gamma}_{{}_{JL1}}\hfill & \hfill \cdots \hfill & \hfill {\gamma}_{{}_{JL K}}\hfill \end{array}\right]\kern1em \mathrm{and}\kern0.5em \varepsilon =\left[\begin{array}{ccc}\hfill {\varepsilon}_{{}_{11}}\hfill & \hfill \cdots \hfill & \hfill {\varepsilon}_{{}_{1 K}}\hfill \\ {}\hfill \vdots \hfill & \hfill \ddots \hfill & \hfill \vdots \hfill \\ {}\hfill {\varepsilon}_{{}_{n1}}\hfill & \hfill \cdots \hfill & \hfill {\varepsilon}_{{}_{n K}}\hfill \end{array}\right]. $$

Thus, substituting the above expansion into Eq. (5) gives

$$ Y\theta \left(\tau \right)=\mu \theta \left(\tau \right)+ W\omega \theta \left(\tau \right)+\xi \alpha \theta \left(\tau \right)+\eta \beta \theta \left(\tau \right)+\varGamma \gamma \theta \left(\tau \right)+\varepsilon \theta \left(\tau \right), $$
(6)

where

$$ \begin{array}{c} W=\left[\begin{array}{ccc}\hfill {W}_{{}_{11}}\hfill & \hfill \cdots \hfill & \hfill {W}_{{}_{1 d}}\hfill \\ {}\hfill \vdots \hfill & \hfill \ddots \hfill & \hfill \vdots \hfill \\ {}\hfill {W}_{{}_{n1}}\hfill & \hfill \cdots \hfill & \hfill {W}_{{}_{n d}}\hfill \end{array}\right],\kern1.5em \xi =\left[\begin{array}{ccc}\hfill {\xi}_{{}_{11}}\hfill & \hfill \cdots \hfill & \hfill {\xi}_{{}_{1 J}}\hfill \\ {}\hfill \vdots \hfill & \hfill \ddots \hfill & \hfill \vdots \hfill \\ {}\hfill {\xi}_{{}_{n1}}\hfill & \hfill \cdots \hfill & \hfill {\xi}_{{}_{n J}}\hfill \end{array}\right],\kern1em \eta =\left[\begin{array}{ccc}\hfill {\eta}_{{}_{11}}\hfill & \hfill \cdots \hfill & \hfill {\eta}_{{}_{1 L}}\hfill \\ {}\hfill \vdots \hfill & \hfill \ddots \hfill & \hfill \vdots \hfill \\ {}\hfill {\eta}_{{}_{n1}}\hfill & \hfill \cdots \hfill & \hfill {\eta}_{{}_{n L}}\hfill \end{array}\right]\kern1em \mathrm{and}\\ {}\\ {}\varGamma =\left[\begin{array}{c}\hfill {\upxi}_1^T\otimes {\upeta}_1^T\hfill \\ {}\hfill \vdots \hfill \\ {}\hfill {\upxi}_n^T\otimes {\upeta}_n^T\hfill \end{array}\right]=\left[\begin{array}{ccc}\hfill \begin{array}{ccc}\hfill {\upxi}_{{}_{11}}{\upeta}_{{}_{11}}\hfill & \hfill \cdots \hfill & \hfill {\upxi}_{{}_{11}}{\upeta}_{{}_{1 L}}\hfill \end{array}\hfill & \hfill \cdots \hfill & \hfill \begin{array}{ccc}\hfill {\upxi}_{{}_{1 J}}{\upeta}_{{}_{11}}\hfill & \hfill \cdots \hfill & \hfill {\upxi}_{{}_{1 J}}{\upeta}_{{}_{1 L}}\hfill \end{array}\hfill \\ {}\hfill \begin{array}{ccc}\hfill \cdots \hfill & \hfill \cdots \hfill & \hfill \cdots \hfill \end{array}\hfill & \hfill \cdots \hfill & \hfill \begin{array}{ccc}\hfill \cdots \hfill & \hfill \cdots \hfill & \hfill \cdots \hfill \end{array}\hfill \\ {}\hfill \begin{array}{ccc}\hfill {\upxi}_{{}_{n1}}{\upeta}_{{}_{n1}}\hfill & \hfill \cdots \hfill & \hfill {\upxi}_{{}_{n1}}{\upeta}_{{}_{n L}}\hfill \end{array}\hfill & \hfill \cdots \hfill & \hfill \begin{array}{ccc}\hfill {\upxi}_{{}_{n J}}{\upeta}_{{}_{n1}}\hfill & \hfill \cdots \hfill & \hfill {\upxi}_{{}_{n J}}{\upeta}_{{}_{n L}}\hfill \end{array}\hfill \end{array}\right].\end{array} $$

Since Eq. (6) holds for every genomic position τ, the coefficients on both sides of Eq. (6) should be equal. Therefore, the functional regression model (6) can be further transformed to standard multivariate multiple regression:

$$ Y= E\mu + W\omega +\xi \alpha +\eta \beta +\varGamma \gamma +\varepsilon . $$
(7)

Let

$$ A=\left[\begin{array}{cccc}\hfill \begin{array}{cc}\hfill E\hfill & \hfill W\hfill \end{array}\hfill & \hfill \xi \hfill & \hfill \eta \hfill & \hfill \varGamma \hfill \end{array}\right]\kern0.5em \mathrm{and}\kern0.5em b=\left[\begin{array}{c}\hfill \begin{array}{c}\hfill \mu \hfill \\ {}\hfill \omega \hfill \end{array}\hfill \\ {}\hfill \alpha \hfill \\ {}\hfill \beta \hfill \\ {}\hfill \gamma \hfill \end{array}\right]. $$

Equation (7) can be rewritten as

$$ Y= Ab+\varepsilon . $$
(8)

The standard least square estimators of b is

$$ \widehat{b}={\left({A}^T A\right)}^{-1}{A}^T Y. $$
(9)

The covariance matrix Σ is estimated by

$$ \widehat{\varSigma}=\frac{{\left( Y- A\widehat{b}\right)}^T\left( Y- A\widehat{b}\right)}{\left( n-\left(1+ d+ J+ L+ J L\right)\right)\mathrm{K}}. $$
(10)

Test statistic

An essential problem in genetic epistasis analysis of the function-valued traits is to test the interaction between two genomic regions (or genes). Formally, we investigate the problem of testing the following hypothesis:

$$ \gamma \left( t, s,\tau \right)=0,\forall t\in \left[{a}_1,{b}_1\right], s\in \left[{a}_2,{b}_2\right],\tau \in \left[0,{T}_{\tau}\right], $$

which is equivalent to testing the hypothesis:

$$ {H}_0:\gamma =0. $$
(11)

Let vec denote the vector operation. To develop test statistics, we begin with calculating the covariance matrix of the \( v e c\left(\widehat{b}\right) \). We assume that

$$ \operatorname{var}\left( vec\left(\varepsilon \right)\right)=\varSigma \otimes {I}_n. $$
(12)

Recall that

$$ v e c\left(\widehat{b}\right)=\left[{I}_K\otimes {\left({A}^T A\right)}^{-1}{A}^T\right] v e c(Y). $$

Therefore, we have

$$ \begin{array}{l}\operatorname{var}\left( vec\left(\widehat{b}\right)\right)=\left[{I}_K\otimes {\left({A}^T A\right)}^{-1}{A}^T\right]\left(\varSigma \otimes {I}_n\right)\left[{I}_K\otimes A{\left({A}^T A\right)}^{-1}\right]\\ {}\kern7.5em =\varSigma \otimes {\left({A}^T A\right)}^{-1}\end{array} $$
(13)

Let Λ be a matrix consisting of the last JLK columns and JLK rows of the covariance matrix \( \operatorname{var}\left( vec\left(\widehat{b}\right)\right) \) and \( \widehat{\gamma} \) be the estimators of interaction which can be obtained by extracting the last JL rows of the estimators of the matrix \( \widehat{b} \). Define the test statistic for testing the interaction between two genomic regions [a 1, b 1] and [a 2, b 2] as

$$ {T}_I= v e c{\left(\widehat{\gamma}\right)}^T{\varLambda}^{-1} v e c\left(\widehat{\gamma}\right). $$
(14)

Then, under the null hypothesis H 0 : γ = 0, T I is asymptotically distributed as a central χ 2(JLK) with degrees of freedom JLK or the rank of the matrix Λ.

Null distribution of test statistics

To examine the null distribution of test statistics, we performed a series of simulation studies to compare their empirical levels with the nominal ones. We calculated the type I error under three models. We first assumed the model with no marginal effects:

Model 1 (no marginal effect):

$$ {y}_i\left(\tau \right)=\mu \left(\tau \right)+{\varepsilon}_i\left(\tau \right), $$
(15)

where μ(τ) is the overall mean at the genomic position τ, y i (τ) is the normalized number of reads at the genomic position τ of the i th individual and ε i (τ) is an error stochastic process. The errors should be correlated stochastic process. The theoretic models for the errors are unclear. They were estimated from the data. The procedures for generating mean μ(τ) and errors ε i (τ) consisted of the following steps.

Step 1: We randomly sampled 100 genes from the whole real RNA-seq dataset. Let k index genes, j index the genomic positions and i index the samples. Assume that the gene k is located in the interval [a k , b k ]. Let x ikj  , (i = 1, …, n, k = 1, …, 100, j = 1, …, s k ) be the observed count of reads of the gene k in the genomic position j of the i th individual where the length of gene k is denoted s k . For each genomic position, we define an n dimensional vector:

x kj  = [x 1kj , …, x nkj ]T.

Step 2. Let m be the median length of 100 genes. In our dataset, m = 2, 456.

Step 3. Re-map the original RNA-seq data of 100 genes to the interval [0, 1] using transformation \( \frac{j-{a}_k}{b_k-{a}_k} \). Then, estimate the count of reads on position \( 0,\frac{1}{m},\frac{2}{m},\dots, 1 \) from the original RNA-seq data of the 100 genes using local polynomial regression (LOESS). The estimated count of reads of the gene k in the genomic position j of the i th individual at the equally distributed new positions \( 0,\frac{1}{m},\frac{2}{m},\dots, 1 \) are denoted by y ikj . Define vector

$$ {y}_{kj}={\left[{y}_{1 kj},\dots, {y}_{nkj}\right]}^T, k=1,\dots, 100, j=1,\dots, m. $$

Step 4. Compute the means of the re-mapped the RNA-seq data over 100 genes and over n samples: \( {\overline{y}}_{ij}=\frac{1}{100}{\displaystyle {\sum}_{k=1}^{100}{y}_{ikj}}, i=1,\dots, n, j=1,\dots, m \) and \( {\overline{y}}_j=\frac{1}{100\times n}{\displaystyle {\sum}_{i=1}^n{\displaystyle {\sum}_{k=1}^{100}{y}_{i kj}}}, j=1,\dots, m \).

Define the mean vector of the re-mapped counts of reads: \( \overline{y}={\left[{\overline{y}}_1,\dots, {\overline{y}}_m\right]}^T \).

Step 5. Compute the mean function μ(τ). Pooling all the re-mapped data:

$$ Y=\left[\begin{array}{ccc}\hfill {\overline{y}}_{11}\hfill & \hfill \cdots \hfill & \hfill {\overline{y}}_{1 m}\hfill \\ {}\hfill \vdots \hfill & \hfill \vdots \hfill & \hfill \vdots \hfill \\ {}\hfill {\overline{y}}_{n1}\hfill & \hfill \vdots \hfill & \hfill {\overline{y}}_{n m}\hfill \end{array}\right]. $$

Use the pooled data to perform FPCA, which leads to functional principal component expansion:

$$ {y}_i\left(\tau \right)={\displaystyle {\sum}_{l=1}^L{\xi}_{{}_{i l}}{\beta}_{{}_l}\left(\tau \right)}, i=1,\dots, n, $$

where β l (τ) are the functional principal components.

Calculate \( {\overline{\xi}}_l=\frac{1}{n}{\displaystyle {\sum}_{i=1}^n{\xi}_{i l}}, l=1,\dots, L \). Using the averaged functional principal component score, we compute the mean μ(τ) as follows:

$$ \mu \left(\tau \right)={\displaystyle {\sum}_{l=1}^L{\overline{\xi}}_l}{\beta}_l\left(\tau \right). $$

Step 6. Define the centralized RNA-seq data matrix:

$$ Z=\left[\begin{array}{ccc}\hfill {z}_{11}\hfill & \hfill \cdots \hfill & \hfill {z}_{1 m}\hfill \\ {}\hfill \vdots \hfill & \hfill \vdots \hfill & \hfill \vdots \hfill \\ {}\hfill {z}_{n1}\hfill & \hfill \cdots \hfill & \hfill {z}_{n m}\hfill \end{array}\right], $$

where \( {Z}_{ij}={\overline{y}}_{ij}-{\overline{y}}_j, i=1,\dots, n, j=1,\dots, m \).

We perform FPCA on the centralized dataset Z where means of the RNA-seq data at each genomic position over 100 genes is removed and obtain a set of functional principal components (eigenfunctions) {φ 1(τ), …, φ T (τ)} and functional principal component scores η it , i = 1, …, n, t = 1, …, T. Define T random variables η = [η 1, …, η T ] with vectors of their sampling values:η t  = [η 1t , …, η nt ]T, t = 1, …, T.

We then calculate the sampling covariance matrix \( \widehat{\varSigma}=\operatorname{cov}\left(\eta, \eta \right) \).Assume that the scores of the residuals follow a multivariate normal distribution \( N\left(0,\widehat{\varSigma}\right) \). Using the normal random variables to generate an n sample of vectors ε i  = [ε i1, …, ε iT ]. The residuals ε i (τ) will be defined as

$$ {\varepsilon}_i\left(\tau \right)={\displaystyle {\sum}_{t=1}^T{\varepsilon}_{i t}{\phi}_t\left(\tau \right), i=1,\dots, n}. $$

Model 2 (a marginal effect at the first gene):

$$ {y}_i\left(\tau \right)=\mu \left(\tau \right)+{\displaystyle {\sum}_{j=1}^J{x}_{i j}{\alpha}_j\left(\tau \right)+{\varepsilon}_i\left(\tau \right)}, $$
(16)

where μ(τ) is the overall mean at the genomic position τ, ε i (τ) is an error stochastic process, x ij is an indicator variable for the genotype of i th individual at the j th SNP of the first gene, y i (τ) is defined as that in model 1. The coefficient α j (τ) = r j α(τ), where r j is randomly selected from 0.5 to 1.5, is the additive effect function of the j th SNP of the first gene, μ(τ) is obtained by randomly sampling 100 genes from the real RNA-seq and WGS genotype data without interactions and α(τ) is obtained by randomly sampling 100 genes from the real RNA-seq and WGS genotype data under the condition that one gene have significant main effect, the other gene do not have significant main effect, and these gene pairs are not in the list of significantly interacted gene pairs in our results. The overall mean μ(τ), effect function α(τ) and the residuals ε i (τ) were similarly simulated as that in Model 1.

Model 3 (marginal effects at both the first and the second genes):

$$ {y}_i\left(\tau \right)=\mu (t)+{\displaystyle {\sum}_{j=1}^J{x}_{i j}{\alpha}_j\left(\tau \right)+{\displaystyle {\sum}_{k=1}^K{z}_{i k}{\beta}_k\left(\tau \right)+}{\varepsilon}_i\left(\tau \right)}, $$
(17)

where z ik is an indicator variable for the genotype of i th individual at the k th SNP of the second gene. The genetic additive effect function β k (τ l ) is assumed to be equal to β k (τ) = s k β(τ), where s k is randomly selected from 0.5 to 1.5, other parameters are defined in Model 2. A total of 100 pairs of genes were randomly selected under the condition that both genes have significant main effect and these gene pairs are not in the list of significant interacted gene pairs in our results. The overall mean function μ(τ), main effect functions α(τ) and β(τ), and residual term ε i (τ) were similarly generated as that in Models 1 and 2.

Power evaluation

To evaluate the performance of the functional regression model for testing epistatic effects on gene expression, we estimated the power through simulations. We assumed that there were k 1 SNPs in the first gene and k 2 SNPs in the second gene. Thus, there were totally k 1 k 2 SNP pairs between these two genomic regions. For the h th pair of SNPs, let \( {Q}_{h_1} \) and \( {q}_{h_1} \) be two alleles at the SNP in the first gene, \( {Q}_{h_2} \) and \( {q}_{h_2} \) be two alleles at the SNP in the second gene. Let u h ijkl denote her/his genotypes of the h th pair of SNPs, where \( i j\in {Q}_{h_1}{Q}_{h_1},{Q}_{h_1}{q}_{h_1},{q}_{h_1}{q}_{h_1} \) and \( kl\in {Q}_{h_2}{Q}_{h_2},{Q}_{h_2}{q}_{h_2},{q}_{h_2}{q}_{h_2} \). Let \( {g}_{u_{ijkl}}^h\left(\tau \right) \) denote her/his genotypic value in the h th pair of SNPs at genomic position τ influencing gene expressions. Then we can use the following multiple regression model to generate the function-valued trait (RNA-seq) of the u th individual of the h th pair of SNPs at the genomic position τ.

$$ {y}_u\left(\tau \right)={\displaystyle {\sum}_{h=1}^{k_1{k}_2}{g}_{u_{ijkl}}^h}\left(\tau \right)+{\varepsilon}_u\left(\tau \right), u=1,\dots, n, $$
(18)

\( {g}_{u_{ijkl}}^h\left(\tau \right)={\lambda}_{u_{ijkl}}^h g\left(\tau \right) \), \( {\lambda}_{u_{ijkl}}^h \) is a risk parameter which is determined by the gene interaction model (Additional file 5: Table S2), the risk parameter r varies from 0 to 1, and g(τ) is a common genotype coefficient function fitted by the real RNA-seq data and ε u (τ) is the error stochastic process and estimated from the null model as that in null distribution in the test statistics section.

Abbreviations

BFGM:

FRGM with Functional Response and Functional Predictors

FRGM:

Functional regression model

NGS:

Next generation sequencing

PCA:

Principal component analysis

SFGM:

FRGM with scalar response and functional predictors

WGS:

Whole genome sequencing

References

  1. Fisher RA. The correlation between relatives on the supposition of mendelian inheritance. Trans Roy Soc Edinb. 1918;52:399–433.

    Article  Google Scholar 

  2. Lehner B. Molecular mechanisms of epistasis within and between genes. Trends Genet. 2011;27:323–31.

    Article  CAS  PubMed  Google Scholar 

  3. Phillips PC. The language of gene interaction. Genetics. 1998;149:1167–71.

    CAS  PubMed  PubMed Central  Google Scholar 

  4. Phillips PC. Epistasis—the essential role of gene interactions in the structure and evolution of genetic systems. Nat Rev Genet. 2008;9:855–67.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Hemani G, Shakhbazov K, Westra H-J, Esko T, Henders AK, McRae AF, Yang J, Gibson G, Martin NG, Metspalu A. Detection and replication of epistasis influencing transcription in humans. Nature. 2014;508:249–53.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Huang Y, Wuchty S, Przytycka TM. eQTL epistasis–challenges and computational approaches. Front Genet. 2013;4:51.

    PubMed  PubMed Central  Google Scholar 

  7. Zuk O, Hechter E, Sunyaev SR, Lander ES. The mystery of missing heritability: Genetic interactions create phantom heritability. Proc Natl Acad Sci. 2012;109:1193–8.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Kang H, Yang X, Chen R, Zhang B, Corona E, Schadt E, Butte A. Integration of disease-specific single nucleotide polymorphisms, expression quantitative trait loci and coexpression networks reveal novel candidate genes for type 2 diabetes. Diabetologia. 2012;55:2205–13.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Shang J, Zhang J, Sun Y, Liu D, Ye D, Yin Y. Performance analysis of novel methods for detecting epistasis. BMC Bioinf. 2011;12:1.

    Article  Google Scholar 

  10. Kang M, Zhang C, Chun H-W, Ding C, Liu C, Gao J. eQTL epistasis: detecting epistatic effects and inferring hierarchical relationships of genes in biological pathways. Bioinformatics. 2015;31:656–64.

    Article  CAS  PubMed  Google Scholar 

  11. Lappalainen T, Montgomery SB, Nica AC, Dermitzakis ET. Epistatic selection between coding and regulatory variation in human evolution and disease. Am J Hum Genet. 2011;89:459–63.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Sun X, Lu Q, Mukherjee S, Crane PK, Elston R, Ritchie MD. Analysis pipeline for the epistasis search–statistical versus biological filtering. Front Genet. 2014;5:106.

    PubMed  PubMed Central  Google Scholar 

  13. Lee J, Ji Y, Liang S, Cai G, Müller P. On differential gene expression using RNA-seq data. Cancer Informat. 2011;10:205–15.

    Google Scholar 

  14. Li JJ, Jiang C-R, Brown JB, Huang H, Bickel PJ. Sparse linear modeling of next-generation mRNA sequencing (RNA-seq) data for isoform discovery and abundance estimation. Proc Natl Acad Sci. 2011;108:19867–72.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Martin JA, Wang Z. Next-generation transcriptome assembly. Nat Rev Genet. 2011;12:671–82.

    Article  CAS  PubMed  Google Scholar 

  16. Wang Z, Gerstein M, Snyder M. RNA-seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009;10:57–63.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Finotello F, Di Camillo B. Measuring differential gene expression with RNA-seq: challenges and strategies for data analysis. Brief Funct Genomics. 2015;14:130–42.

    Article  PubMed  Google Scholar 

  18. Gosik K, Kong L, Chinchilli VM, Wu R. iFORM/eQTL: an ultrahigh-dimensional platform for inferring the global genetic architecture of gene transcripts. Brief Bioinform. 2017;18(2):250–9.

    PubMed  Google Scholar 

  19. Zhang F, Boerwinkle E, Xiong M. Epistasis analysis for quantitative traits by functional regression model. Genome Res. 2014;24:989–98.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Zhang F, Xie D, Liang M, Xiong M. Functional Regression Models for Epistasis Analysis of Multiple Quantitative Traits. PLoS Genet. 2016;12:e1005965.

    Article  PubMed  PubMed Central  Google Scholar 

  21. Lappalainen T, Sammeth M, Friedländer MR, AC‘t Hoen P, Monlong J, Rivas MA, Gonzàlez-Porta M, Kurbatova N, Griebel T, Ferreira PG. Transcriptome and genome sequencing uncovers functional variation in humans. Nature. 2013;501:506–11.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Graffelman J, Moreno V. The mid p-value in exact tests for Hardy-Weinberg equilibrium. Stat Appl Genet Mol Biol. 2013;12:433–48.

    PubMed  Google Scholar 

  23. Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010;11:1.

    Article  Google Scholar 

  24. Beasley TM, Erickson S, Allison DB. Rank-based inverse normal transformations are increasingly used, but are they merited? Behav Genet. 2009;39:580–95.

    Article  PubMed  PubMed Central  Google Scholar 

  25. Li Q, Zhao H, Jiang L, Che Y, Dong C, Wang L, Wang J, Liu L. An SR-protein induced by HSVI binding to cells functioning as a splicing inhibitor of viral pre-mRNA. J Mol Biol. 2002;316:887–94.

    Article  CAS  PubMed  Google Scholar 

  26. Csardi G, Nepusz T. The igraph software package for complex network research. Inter J Complex Systems. 2006;1695(5):1–9.

    Google Scholar 

  27. Falcon S, Gentleman R. Using GOstats to test gene lists for GO term association. Bioinformatics. 2007;23(2):257–8.

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgement

The authors express appreciation to the computation sources from the Texas Advanced Computing Center and the High-End Computing Center at Fudan University.

Funding

Mr. Xiong was supported by Grant 1R01AR057120–01 and 1R01HL106034-01, from the National Institutes of Health and NHLBI. Ms. Xu and Mr. Jin were supported by research grants from the National Natural Science Foundation of China (31330038, 31521003), National Basic Research Program (2014CB541801), and 111 Project (B13016). The funders had no role in the design of the study, data collection, analysis, or manuscript preparation.

Availability of data and materials

The R package FRGM_1.0.tar.gz is available from https://sph.uth.edu/research/centers/hgc/xiong/software.htm. All results represented in this paper could be reproduced by the examples and programs in this package. SNP data supporting the conclusions of this article are available from FTP site hosted at the EBI, ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/, and RNA-seq data are available in Geuvadis RNA sequencing project of 1000 Genomes samples under accession E-GEUV-3, http://www.ebi.ac.uk/arrayexpress/files/E-GEUV-3/processed/.

Authors’ contributions

KLX, JL and MMX designed this study, developed the statistical methods and wrote the manuscript. KLX wrote the R package and performed data analysis. All authors read and approved the final manuscript.

Competing interests

The authors declare that they have no competing interests.

Consent for publication

Not applicable.

Ethics approval and consent to participate

This study is a retrospective analysis of the existing data. SNP data are downloaded from FTP site hosted at the EBI, and RNA-seq data are downloaded from EBI ArrayExpress.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Momiao Xiong.

Additional files

Additional file 1: Table S1.

Summary statistics of genetic variants in the five genes. (XLS 25 kb)

Additional file 2: Table S3.

A list of significant cis-trans interactions in the MAPK pathway. (XLS 114 kb)

Additional file 3: Table S4.

Ten communities and their significantly enrished pathways. (XLSX 12 kb)

Additional file 4: Figure S1.

A. The real RNA-seq curve of the gene LMNB2 and the data simulated by the negative distribution of the gene LMNB2. B. The real RNA-seq curve of the gene LMNB2 and the curve estimated by the FPCA of the RNA-seq data. (ZIP 120 kb)

Additional file 5: Table S2.

The value of risk parameter in different interaction models. (XLS 30 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xu, K., Jin, L. & Xiong, M. Functional regression method for whole genome eQTL epistasis analysis with sequencing data. BMC Genomics 18, 385 (2017). https://doi.org/10.1186/s12864-017-3777-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12864-017-3777-4

Keywords