Functional regression method for whole genome eQTL epistasis analysis with sequencing data

Background Epistasis plays an essential rule in understanding the regulation mechanisms and is an essential component of the genetic architecture of the gene expressions. However, interaction analysis of gene expressions remains fundamentally unexplored due to great computational challenges and data availability. Due to variation in splicing, transcription start sites, polyadenylation sites, post-transcriptional RNA editing across the entire gene, and transcription rates of the cells, RNA-seq measurements generate large expression variability and collectively create the observed position level read count curves. A single number for measuring gene expression which is widely used for microarray measured gene expression analysis is highly unlikely to sufficiently account for large expression variation across the gene. Simultaneously analyzing epistatic architecture using the RNA-seq and whole genome sequencing (WGS) data poses enormous challenges. Methods We develop a nonlinear functional regression model (FRGM) with functional responses where the position-level read counts within a gene are taken as a function of genomic position, and functional predictors where genotype profiles are viewed as a function of genomic position, for epistasis analysis with RNA-seq data. Instead of testing the interaction of all possible pair-wises SNPs, the FRGM takes a gene as a basic unit for epistasis analysis, which tests for the interaction of all possible pairs of genes and use all the information that can be accessed to collectively test interaction between all possible pairs of SNPs within two genome regions. Results By large-scale simulations, we demonstrate that the proposed FRGM for epistasis analysis can achieve the correct type 1 error and has higher power to detect the interactions between genes than the existing methods. The proposed methods are applied to the RNA-seq and WGS data from the 1000 Genome Project. The numbers of pairs of significantly interacting genes after Bonferroni correction identified using FRGM, RPKM and DESeq were 16,2361, 260 and 51, respectively, from the 350 European samples. Conclusions The proposed FRGM for epistasis analysis of RNA-seq can capture isoform and position-level information and will have a broad application. Both simulations and real data analysis highlight the potential for the FRGM to be a good choice of the epistatic analysis with sequencing data. Electronic supplementary material The online version of this article (doi:10.1186/s12864-017-3777-4) contains supplementary material, which is available to authorized users.


Background
Epistatic effect in gene expression, defined as the departure from additive effects in a linear model of eQTL analysis [1], plays an essential role in understanding the gene regulation and disease mechanisms [2][3][4]. One polymorphism's effect on expression of a gene depends on other polymorphisms present in the genome [5]. Epistasis analysis of gene expressions will substantially improve the understanding of the genetic architecture of gene expression and facilitate mechanistic insights into complex traits [6,7]. However, eQTL epistasis analysis remains fundamentally unexplored due to large computational challenges and data availability [6].
Gene expression is an intermediate phenotype that bridges the genotype and higher level phenotypes such as diseases [8,9]. Studying the effect of epistasis on the gene expression could provide a better understanding of the genetic architecture and gene regulation. The importance of detecting the epistatic effect on the gene expression has been emphasized in many recent studies [10,11]. However, the corresponding methods are relatively rare. The widely used statistical methods for identifying eQTL epistasis are designed for microarray expression data where an overall expression of the gene is taken as a quantitative trait and all methods for QTL epistasis analysis can be used for eQTL epistasis analysis [10,12].
Application of next generation sequencing (NGS) techniques to the genetic analysis of gene expression involves (1) generating millions of short reads of mRNA or cDNA which are mapped to the genome and lead to a sequence of read counts at the hundreds of millions of genomic positions [13][14][15][16] and (2) generating millions or 10 millions of genetic variants. RNA-seq counts vary greatly across the gene [17]. Count variations can be due to experimental bias such as fragmentation methods, reverse-transcription [16], sequence-specific bias and sequencing technology variation [18]. However, count variation can also be caused by variation in splicing, transcription start sites, polyadenylation sites, posttranscriptional RNA editing across the entire gene, and transcription rates of the cells [13][14][15][16]18]. RNA-seq data can be viewed as a function or a curve of the genomic position and hence can be taken as a functionvalued trait.
Although RNA-seq data are measured as a function, the widely used methods for genetic studies of the RNAseq in humans are the same as that for the traditional single-valued quantitative traits where a single number for overall expression of the gene is taken as a quantitative trait. These methods use summary statistics to measure or represent gene expressions assayed by NGS techniques and cannot capture the expression variations across the gene due to splicing, transcription start sites, polyadenylation sites, post-transcriptional RNA editing across the entire gene, and transcription rates of the various cells. The summary statistic-based epistasis analysis of the RNA-seq fails to utilize all transcripts information.
The critical barrier in epistasis analysis is to deal with rare variants. The traditional statistical methods for epistasis analysis were originally designed for testing the interaction between common variants and are difficult to apply to rare variants due to high type 1 error rates, severe multiple testing, prohibitive computational time and low power [19]. Whole genome RNA-req eQTL analysis poses a significant challenge. To meet the challenge, we developed a nonlinear functional regression model (FRGM) with functional responses where the position-level read counts within a gene are taken as a function of genomic position, and functional predictors where genotype profiles are viewed as a function of genomic position, for epistasis analysis with RNA-seq data, which allows simultaneous capture of all space information hidden in the RNA-seq data and genetic variation data, but with substantially reduced dimensions. Instead of testing the interaction of all possible pair-wises SNPs, the FRGM takes a gene as a basic unit for epistasis analysis, which tests for the interaction of all possible pairs of genes and uses all the information that can be accessed to collectively test interaction between all possible pairs of SNPs within two genome regions (or genes). The proposed FRGM for epistasis analysis of the RNA-seq can capture isoform and position-level information and will have a broad application.
The FRGM for epistasis analysis has several remarkable features. First, the FRGM accounts for the change in the position-level read counts, while preserving the intrinsic structure and all the positional-level genetic information. Second, the FRGM simultaneously utilizes both correlation information among the RNA-seq at different genomic positions and among all variants in a genomic region. Third, the multicollinearity problems in the FRGM which may be presented in both the RNAseq and genetic variation are alleviated. Fourth, the FRGM expands both position-level read count function and genotype function in terms of orthogonal eigenfunction, which leads to substantial dimension reduction in both RNA-seq data and SNP data. The FRGM for epistasis analysis of function-valued traits which capture key information in the data is expected to open a new route for epistasis analysis of RNA-seq data.
To evaluate its performance for epistasis analysis of the RNA-seq, we use large scale simulations to calculate the type I error rates and evaluate the power of the proposed FRGM for detecting epistasis. To further evaluate its performance, the FRGM for epistasis analysis is applied to 350 samples with both RNA-seq and NGS data from the 1000 Genomes Project. An R packge for implementing the developed FRGM for epistasis analysis of RNA-seq and NGS data can be downloaded from our website https:// sph.uth.edu/research/centers/hgc/xiong/software.htm.

Null distribution of test statistics
To examine the null distribution of test statistics, we performed a series of simulation studies to compare their empirical levels with the nominal ones. We consider three models for type 1 error rate simulations: model 1 with no marginal effects, model 2 with marginal effects at the first gene and model 3 with marginal effects at both the first and second genes.
We generated 100,000 chromosomes by resampling from the 350 European samples with genetic variants in five genes: IRAK3, ACSS3, SUV420H1, ETV7, and HPS4 from the next generation sequencing data in the 1000 Genomes Project. The summary statistics of the variants in five genes are summarized in Additional file 1: Table S1. The marginal genetic effects will be estimated from the data. 100 genes with RNA-seq data were randomly selected from GEUVADIS project. They were used to develop the models for generating RNA-seq data in simulation (Detailed description were referred to Method Section).10 pairs of genes were selected from five genes : IRAK3, ACSS3, SUV420H1, ETV7, and HPS4 with genotype data from 1000 Genome Project dataset.
The number of sampled individuals from the population ranged from 1000 to 5,000, and 5,000 simulations were repeated. We randomly selected 10% of the SNPs as causal variants from five genes: IRAK3, ACSS3, SUV420H1, ETV7, and HPS4. We perfume gene-gene interaction tests for 10 pairs of genes selected from five genes with genotypes under the three models for 5000 times. The type 1 error rates were averaged over 10 pairs of genes with genotype data and 5,000 simulations for each model. Tables 1, 2 and 3 summarized the type I error rates of the test statistics for testing the interaction between two genes with no marginal effect, marginal effect at the first gene and marginal effects at both genes consisting only of rare variants and both common and rare variants, respectively, averaged over 100 genes with RNA-seq data and 10 pairs of genes with genotype data at the nominal levels α = 0.05, α = 0.01 and α = 0.001. These results clearly showed that the type I error rates of the FRGM-based test statistics for testing interaction between two genes with or without marginal effects were not appreciably different from the nominal α levels.

Power evaluation
To evaluate the performance of the functional regression model for testing the epistatic effect on gene expression, we estimated the power through simulations. We generated 100,000 chromosomes by resampling from the 350 European samples with genetic variants in two genes: IRAK3 and ACSS3 from the next generation sequencing data in 1000 Genomes Project. We randomly selected 20% variants as causal variants, assumed that there were k 1 SNPs in the first gene, and k 2 SNPs in the second gene. Two thousand individuals were sampled. We assumed that both marginal effects and epistasis effects were a function of the genomic position and used the multiple regression models to generate the RNA-seq data under four interaction models: Dominant OR Dominant, Dominant AND Dominant, Recessive OR Recessive and Threshold (See the Methods section).
We compared the power of the FRGM with both functional response and functional predictors (BFGM), FRGM with scalar response and functional predictors (SFGM) and regression on principal component analysis (PCA). For the PCA method, the PCA was performed on the RNA-seq data and the number of PCs were   selected to explain 80% variance of number of reads at different genomic positions. The multiple functional regression was performed to analyze the data [20]. In the BFGM, both RNA-seq and genotype profiles were taken as a function of genomic position and expanded in terms of functional principal components. Figure 1a-d plotted the power curves of three statistics: BFGM, SFGM and PCA to test the interaction between two genes with rare variants under the Dominant OR Dominant, Dominant AND Dominant, Recessive OR Recessive and Threshold models, respectively. In the simulation, 20% of the rare variants were randomly selected as the causal variants. These power curves were a function of the risk parameter at the significance level α = 0.05. We observed that under all four interaction models the BFGM had the highest power, followed by the regression on PCA. Power of the SFGM was the lowest. The results demonstrated that summary statistics such as RPKM for measuring gene expression could not capture the expression variations across the gene and almost had no power to detect the interaction between two genes with rare variants. The BFGM can also be applied to the presence of both common and rare variants. Figure 2a-d plotted the power curves of three statistics for testing interaction between two genes with both common and rare variants where 10% of the common variants and 10% of the rare variants were chosen as causal variants under the Dominant OR Dominant, Dominant AND Dominant, Recessive OR Recessive and Threshold models, respectively. The power patterns of tests for the interactions between two genes with both common and rare variants were similar to that with rare variants only. The BFGM had the highest power, followed by the PCA and the SFGM. However, we noticed that the power of the SFGM for epistasis analysis in the presence of common variants increased substantially. Under some models such as the Dominant OR Dominant model, the SFGM would have enough power to detect interactions between two genes with common variants.

RNA-seq data and NGS data
The BFGM was applied to the RNA-seq data in the GEUVADIS RNA Sequencing Project [21] and the WGS data in the1000 Genomes Project. A total of 350 samples Fig. 2 a. Power curves of three statistics: the BFGM, regression on PCA, SFGM, for testing interaction between two genomic regions that consist of both common and rare variants with the RNA-seq trait as a function of the relative risk parameter r at the significance level α = 0.05 under the Dominant OR Dominant model, assuming sample sizes of 2,000. b. Power curves of three statistics: the BFGM, regression on PCA, SFGM, for testing interaction between two genomic regions that consist of both common and rare variants with RNA-seq trait as a function of the relative risk parameter r at the significance level α = 0.05 under the Dominant AND Dominantmodel, assuming sample sizes of 2,000. c. Power curves of three statistics: the BFGM, regression on PCA, SFGM, for testing interaction between two genomic regions that consist of both common and rare variants with RNA-seq trait as a function of the relative risk parameter r at the significance level α = 0.05 under the Recessive OR Recessive model, assuming sample sizes of 2,000. d. Power curves of three statistics: the BFGM, regression on PCA, SFGM, for testing interaction between two genomic regions that consist of both common and rare variants with RNA-seq trait as a function of the relative risk parameter r at the significance level α = 0.05 under the Threshold model, assuming sample sizes of 2,000 with European origin was shared between the GEUVA-DIS RNA Sequencing Project and 1000 Genomes Project, which had combined transcriptome (22,706 gene expressions measured by RNA-seq) and genome sequencing data (2,708,453 SNPs in 24,519 genes). After removing singleton SNPs, repeated SNPs, and filtering out the SNPs violating HW equilibrium [22] (P value < 10 −9 for declaring HW disequilibrium), 2,566,261 SNPs in 18,986 genes were included in the epistasis analysis. In the RNA-seq data pre-processing, we removed the genes whose expressing rates were less than 30% and the genes that did not contain any SNPs. Finally, RNA-seq data of the 15,656 genes were included in the analysis. We used DESeq [23] to normalize the RNA-seq data.

Cis-trans interactions
We considered the RNA-seq curve of the target gene as a function-valued trait. The target gene selected from the 15656 gene expressions was referred to as gene 1. We selected one of the remaining 18985 genotyping genes as gene 2. We used BFGM to test for the interactions between gene 1 and gene 2 influencing the expression of the target gene 1. The total number of gene pairs tested for interactions which included both common and rare variants was 297,229,160. A P-value for declaring significant interaction after applying the Bonferroni correction for multiple tests was 1.68 × 10 −10 . To examine the behavior of the BFGM, we plotted the QQ plot of the test (Fig. 3). QQ plot showed that the false positive rate of the BFGM for detection of epistasis was controlled.
For comparisons, the SFGM was also applied to the dataset. RPKM and DESeq were used to compute the overall expression value of genes from the RNA-seq data. All the expression values were processed by the rank-based inverse normal transformation [24]. For both common and rare variants, in total, 162361, 260 and 51 significant cis-trans interactions regulating the gene expressions were identified by the BFGM, SFGM with the RPKM and DESeq, respectively. We observed 9,846genes whose expressions were influenced by 16,2361cis-trans interactions. We found that the average number of epistasis influencing each gene was 16. A total of 3,505 gene expressions were influenced by one significant cis-trans gene-gene interactions, 169 gene expressions were influenced by more than 100 cis-trans gene-gene interactions. Figure 4 presented a histogram that showed a distribution of the cis-trans gene-gene interactions.
The P-values of the top 20 interactions between genes ranked by the BFGM method were summarized in Table 4 where P-values for testing interactions between genes by the SFGM (RPKM, DESeq and RNAmin) and min P-values were also listed. The RNAmin denoted the minimum of P-values computed by the SFGM method with the number of reads at each genome position of the gene as the scalar response in the functional regression model. The min P-values denoted to take the minimum of all P-values for testing all possible pairs of SNPs between two genes using functional regression model with functional response and scalar predictors. Table 4 showed several remarkable features. First, we often observed the pair-wise interaction between rare and rare variants (34.38%), and rare and common variants (59.38%). Less observed was the significant pair-wise interaction between common and common variants (6.25%). Second, significant interactions between two  genes often indicated that at least one significant pair of SNPs in two genes could be observed (min P-values were small). However, we can observe that pairs of SNPs between two genes jointly had significant interaction effects, but individually each pair of SNPs mildly contributed to the interaction effects. Third, the BFGM often had a much smaller P-value to detect interaction than other tests. Fourth, we observed that genes may not show even mild marginal association, but they did demonstrate significant evidence of interaction. If only the interactions between two marginally significant genes are tested, some significant interactions may be missed. The fifth, the BFGM tremendously reduced computation burden.
To further assess the validity of the BFGM for epistasis analysis with RNA-seq data, we randomly selected six pairs of genes from the significant 162361 gene-gene interactions. The P-values for testing the interactions of six pairs of genes using the BFGM and SFGM were summarized in Table 5. Table 5 showed that six significant interactions identified by the BFGM significantly influenced read count variation at least at one genomic position within the gene. To explain why the BFGM had higher power to detect interaction than the SFGM, we presented Fig. 5a-f showing the RNA-seq profiles and overall expression level of the genes PLA2G4A, PLA2G6, PLAUR, PLD4, PLD6 and PLEKHA3 of two individuals, respectively. These figures showed that the overall expression levels of the individuals were the same, but their RNA-seq profiles were quite different. This demonstrated that unlike the RNA-seq profiles, the overall expression levels cannot capture the expression variation across the genes. Therefore, the SFGM using summary statistics as a trait will have less power to detect the interaction than the BFGM using the RNA-seq profiles as a function-valued trait.
To investigate whether the top 20 interactions were caused by the linkage disequilibrium (LD) or not, we listed the maximum of r 2 between all possible SNPs in the top 20 significantly interacting pairs of genes and the P-values for testing their presence of LD in Table 6. We did not observe the strong LD between the interacting genes.

Interactions in the MAPK signaling pathway
To show the detailed interaction structure, we presented the results of 331 significant cis-trans interactions in the MAPK signaling pathway in the Additional file 2: Table S3 where min P-values indicated that the functional regression model with functional response and discrete predictors was used to test for the interaction for all possible pairs of SNPs within two genes and minimum of P-values of the tests was listed in the Additional file 2: Table S3. The column "SNP pair" listed their corresponding pair of SNPs reaching the minimum of the P-values and their chromosome locations. From Additional file 2: Table S3 we had several significant observations. First, we observed that the majority of interacting genes were located in different chromosomes, which implied that interactions were not caused by the linkage disequilibrium (LD). Second, we observed that large proportions of interacting genes did not show significant evidence of marginal association. This demonstrated that if we only selected the genes with significant marginal association for epistasis analysis, many interactions would be missed. Third, in general, the functionvalue-based epistasis analysis (BFGM, min P-values) had much smaller P-values than the summary statisticbased epistasis analysis (SFGM). Fourth, we observed that the genes interacting with the genes in MAPK signaling pathway were in 147 other pathways, including cytokine-cytokine receptor interaction, Cytosolic DNA-sensing pathway, DNA replicationamong others. Fifth, it was interesting to observe that the interacting genes formed a large connected network with 281 nodes and 317 edges (Fig. 6). We observed hub genes IBA57-AS1 with 67 connections, HIST1H2AD with 21 connections, PRR24 with 18 connections and ARL6IP4 with 14 connections. HIST1H2AD is a core component of nucleosome and plays a central role in transcription regulation. ARL6IP4 functions as a splicing inhibitor [25].

Gene ontology and KEGG pathway enrichment analysis
Gene ontology enrichment analysis was performed on the genes in the identified 162361 pairs of significant cis-trans interactions influencing the transcription to discover overrepresented functional biological groupings with interactions. Our analysis was performed using the biological process, cellular component and molecular function categories of the gene ontology.
Ontology enrichment analysis found that cis-trans interactions were significantly enriched in biological  processes (BP) including a single organism process, single organism cellular process, single organism metabolic process and development process (Fig. 7) and molecular functions that were primarily related to catalytic and binding activity with P-values <10 −4 (Fig. 8). Ontology enrichment analysis also identified that cis-trans interactions were significantly enriched in the cell, intracellular, organelle, and membrane bounded organelle components (Fig. 9). The enrichment analysis was also applied to 228 KEGG pathways to identify the pathways that were enriched with cis-trans interactions. The results were summarized in Fig. 10. The cis-trans interactions were enriched in metabolic pathways, MAPK signaling pathway, pathways in cancer, endocytosis, protein processing in endoplasmic reticulum and Wnt signaling pathway.

Communities in gene interaction networks
We used random walks in igraph [26] to detect 10 communities from the entire gene-gene interaction network. We used R package GOstats [27] to conduct gene set enrichment analysis. We have identified 29 pathways enriched in 10 communities. Figure 11 showed the 6 th community with 96 genes and 186 interactions enriched with metabolism (Three of four significantly enriched pathways were metabolism pathways: Glycerolipid metabolism, Nicotinate and nicotinamide metabolism, and Pyrimidine metabolism) where node represents a gene and an edge represents the interaction between the connected gene by the edge. All 10 communities with the enriched pathways (P-value < 0.01) are summarized in Additional file 3: Table S4.

Discussion
In the past, the statistical epistasis of gene expression is defined as variant-variant interactions that regulate gene expression and its analysis has been mainly designed for microarray gene expression data and common variants. Since the dimension of the data for epistasis analysis of gene expression is very high, all the traditional methods for epistasis analysis of gene expression have the limited application to eQTL data. The whole genome epistasis studies of gene expressions have been very limited. The genetic structure of epistasis of gene expressions has not been fully discovered.
The recently developed next-generation mRNA sequencing (RNA-seq) assay generates dozens or even one hundred million short reads of mRNA and WGS also generates millions of SNPs. As a consequence, these genetic variation and gene expression variation data are so densely distributed across the genome that both genetic variation and expression variation can be modeled as a function of genomic location. The RNA-seq profiles can be taken as a function-valued trait. However, the standard multivariate statistical analysis often fails with functional data. The computational burden and correction for multiple tests seriously damage the feasibility of the variant-variant interaction analysis of extremely high dimensional RNA-seq and WGS genotype data. The variant-variant interaction analysis is not suitable for the epistasis analysis of the function-valued traits with NGS data as genotype data. Although the genetic study of quantitative traits has seen wide application and extensive technical development, the quantitative genetic analysis, particularly epistasis analysis of function-valued trait is comparatively less developed. To our knowledge, no statistical methods have been developed for genetic epistasis analysis of function-valued traits with NGS data. In the past few years we have witnessed the rapid development of novel statistical methods for association studies using NGS data. However, these methods might not be appropriate for genetic epistasis analysis of function-valued trait. The quantitative genetic epistasis analysis of rare variants for function-valued traits remains a huge challenge.
The widely used methods for reducing dimensionality of the RNA-seq data use the Poisson distribution, binomial distribution and negative binomial distribution to summarize the RNA-seq profile into a single number to represent the RNA-seq curve. However, these discrete   Figure S1A showing the real RNA-seq curve, the data simulated by a negative distribution of the gene LMNB2 and Additional file 4: Figure S1B showing the real RNA-seq curve of the gene LMNB2 and the curve estimated by the FPCA of the RNA-seq data. We observed that the negative distribution failed to capture the variation of the RNA-seq profile, but the FPCA approximated the RNA-seq curve exceedingly well. Emergence of the NGS techniques demands a paradigm shift in the analytic methods for eQTL epistasis  Fig. 10 The enrichment analysis of cis-trans interactions in 228 KEGG pathways Fig. 11 A subnetwork in the 6 th community with 96 genes and 186 interactions enriched with metabolism where node represents a gene and an edge represents the interaction between the connected gene by the edge will not account for the space-ordering of the data and correlation information contained in the data. The BFGM simultaneously employs genetic information of the individual variants and correlation information contained in both RNA-seq and SNP data. Third, both the sign and the size of the heterogeneity will also be incorporated into the test in the BFGM. Fourth, the multicolinearity problem in the BFGM is alleviated. Fifth, the BFGM expands both RNA-seq function and genotype function in terms of orthogonal eigenfunctions, which leads to substantial dimension reduction. The BFGM for genetic epistasis analysis of a function-valued trait which captures key information in the data is expected to open a new route for genetic epistasis analysis of RNA-seq and NGS genotype data.

Conclusions
We developed a novel functional regression model with both functional response and functional predictors for detection of epistasis influencing RNA-seq variations in humans, which is referred to as the BFGM. The BFGM takes genes as a basic unit of epistasis analysis and utilizes all information contained in both the RNA-seq and SNP data. By large simulations and real data analysis we demonstrated the merits and limitations of the proposed new paradigm of epistasis analysis for the RNA-seq and WGS data.
The new approach uses all genetic information in the genome regions and expression variation information in the target gene to collectively test the interaction between multiple SNPs within the regions influencing the RNA-seq curves. Therefore, the BFGM for interaction analysis overcomes limitations inherent in pair-wise interaction tests with the summary expression level as a scalar trait. By large simulations and real data analysis, we showed that the proposed BFGM substantially increased the power, dramatically reduced the computational burden and substantially outperformed the traditional variant-variant epistasis analysis of summary statistic measured quantitative traits. In real data analysis, we also clearly demonstrate that pairs of SNPs between two genes jointly have significant interaction effects, but individually each pair of SNPs makes a mild contribution to interaction effects.
The previous interaction analyses have mainly focused on the interactions between common and common variants. The distribution of the common and rare variants causing interactions is unknown. Very few genome-wide interaction analyses with the RNA-seq and WGS data, and very few results of significant interaction between rare and rare variants, and rare and common variants have been reported. We analyzed 350 samples of European origin with both RNA-seq and whole genome sequencing data available. We observed the large proportions of pair-wise interactions between rare and rare variants, and rare and common variants. The significant pair-wise interactions between common and common variants were less observed. The results showed that the number of significant cis-trans interactions identified by the SFGM with RPKM as overall gene expression level only accounted for 0.16% of the significant cis-trans interactions identified by the BFGM with RNA-seq and NGS genotype data. The majority of epistasis analysis for gene expressions used the microarray to measure gene expressions and test interactions for only common variants. Even though the RNAseq data are available they still converted variation rich RNA-seq data into a single number such as RPKM or other summary statistics. Then, the variant-variant epistasis analysis is conducted on these converted data. That explains why these researches question the universe presence of significant gene-gene interaction influencing gene expressions.
Some researchers suggest that in genome-wide interaction analysis only genes with large or mildly marginal genetic effects should be tested for interaction. However, we observed that the majority of the significantly interacting genes showed no marginal association. These results clearly demonstrated that if we tested interactions for only genes with marginal associations, then many true interactions will be missing.
We are unsure whether interaction is most often presented in isolation, or interacting genes form networks. We identified a large number of cis-trans interactions and observed that the interacting genes formed large connected networks with hub genes presented. We found that some hub genes, for example histone modification genes, can globally regulate gene expressions. Enrichment analysis also showed that metabolic pathways, MAPK signaling pathway, pathways in cancer, endocytosis, protein processing in endoplasmic reticulum and Wnt signaling pathway among others were enriched with cis-trans interactions.
The results in this paper are preliminary. The confounding factors that cause spurious interactions have not been investigated. The statistical methods for epistasis analysis which remove confounding factors have not been developed. The complete genome-wide epistasis analysis including all cis and trans interactions have not been performed. The purpose of this paper is to stimulate further discussions regarding the great challenges we are facing in the epistasis analysis of high dimensional RNA-seq and WGS data.

Functional regression with both functional response and functional predictor models for epistasis analysis
For the convenience of discussion, position level read counts are taken as the RNA-seq profile and is referred to as a function-valued trait. Let y i (τ), τ ∈ Τ τ = [0, Τ τ ] be the read counts of the i th individual at the genomic position τ. Consider two genomic regions (or genes) [a 1 , b 1 ] and [a 2 , b 2 ]. Let x i (t) and z i (s) be genotypic functions of the i th individual defined in the regions [a 1 , b 1 ] and [a 2 , b 2 ], respectively. Let t and s be a genomic position in the first and second genomic regions, respectively. The genotype functions x i (t) and z i (s) are defined as where μ(τ) is an overall mean function at the genomic position τ, W i is a vector of covariates for i th individual, ω(τ) is a vector of effects associated with the covariates, α(t, τ) is a genetic additive effect function at genomic position t of the first gene and genomic position τ of the RNA-seq profile, β(s, τ) is a genetic additive effect function at genomic positions s of the second gene and the genomic position τ, γ(t, s, τ) is an interaction effect function between two putative quantitative trait loci (QTLs) located at the genomic positions t and s influencing the read counts at the genomic position τ, and ε i (τ) is a residual function of the unexplained effect for the i th individual at the genomic position τ. The interaction function is measured by double integrals of the genotype function in two genes.

Estimation of interaction effect function
We assume that both position level read count function and genotype functions are centered. The genotype functions x i (t) and z i (s) are expanded in terms of the orthonormal basis function as: where ϕ j t ð Þ and ψ l (s) are sequences of the orthonormal basis functions. The expansion coefficients ξ ij and η il are estimated by In practice, numerical methods for the integral will be used to calculate the expansion coefficients. Substituting Eq. (2) into Eq. (1), we obtain where The parameters α j (τ), β l (τ) and γ jl (τ) are referred to as genetic additive effect and additive x additive effect score functions. These score functions can also be viewed as the expansion coefficients of the genetic effect functions with respect to orthonormal basis functions: Equation (4) can be written in a vector form: where Y(τ), μ(τ), ω(τ), α(τ), β(τ) and γ(τ) are vectors, W, ξ, η and Γ are matrices. Expanding Y(τ), μ(τ), ω(τ), α(τ), β(τ), γ(τ) and ε(τ) in terms of the orthogonal basis functions yield and ε i τ ð Þ ¼ X K k¼1 ε ik θ k τ ð Þ: Define expansion coefficient vectors and matrices as follows. and ε ¼ Thus, substituting the above expansion into Eq. (5) gives where W ¼ Since Eq. (6) holds for every genomic position τ, the coefficients on both sides of Eq. (6) should be equal. Therefore, the functional regression model (6) can be further transformed to standard multivariate multiple regression: Let Equation (7) can be rewritten as The standard least square estimators of b iŝ The covariance matrix Σ is estimated bŷ Test statistic An essential problem in genetic epistasis analysis of the function-valued traits is to test the interaction between two genomic regions (or genes). Formally, we investigate the problem of testing the following hypothesis: which is equivalent to testing the hypothesis: Recall that Therefore, we have var vecb Let Λ be a matrix consisting of the last JLK columns and JLK rows of the covariance matrix var vecb andγ be the estimators of interaction which can be obtained by extracting the last JL rows of the estimators of the matrixb . Define the test statistic for testing the interaction between two genomic regions [a 1 , b 1 ] and [a 2 , b 2 ] as Then, under the null hypothesis H 0 : γ = 0, T I is asymptotically distributed as a central χ (JLK) 2 with degrees of freedom JLK or the rank of the matrix Λ.

Null distribution of test statistics
To examine the null distribution of test statistics, we performed a series of simulation studies to compare their empirical levels with the nominal ones. We calculated the type I error under three models. We first assumed the model with no marginal effects: Model 1 (no marginal effect): where μ(τ) is the overall mean at the genomic position τ, y i (τ) is the normalized number of reads at the genomic position τ of the i th individual and ε i (τ) is an error stochastic process. The errors should be correlated stochastic process. The theoretic models for the errors are unclear. They were estimated from the data. The procedures for generating mean μ(τ) and errors ε i (τ) consisted of the following steps.
Step 1: We randomly sampled 100 genes from the whole real RNA-seq dataset. Let k index genes, j index the genomic positions and i index the samples. Assume that the gene k is located in the interval [a k , b k ]. Let x ikj , (i = 1, …, n, k = 1, …, 100, j = 1, …, s k ) be the observed count of reads of the gene k in the genomic position j of the i th individual where the length of gene k is denoted s k . For each genomic position, we define an n dimensional vector: x kj = [x 1kj , …, x nkj ] T .
Step 2. Let m be the median length of 100 genes. In our dataset, m = 2, 456.
Step 3. Re-map the original RNA-seq data of 100 genes to the interval [0, 1] using transformation j−a k b k −a k . Then, estimate the count of reads on position 0; 1 m ; 2 m ; …; 1 from the original RNA-seq data of the 100 genes using local polynomial regression (LOESS). The estimated count of reads of the gene k in the genomic position j of the i th individual at the equally distributed new positions 0; 1 m ; 2 m ; …; 1 are denoted by y ikj . Define vector y kj ¼ y 1kj ; …; y nkj h i T ; k ¼ 1; …; 100; j ¼ 1; …; m: Step 4. Compute the means of the re-mapped the RNA-seq data over 100 genes and over n samples: y ij ¼ 1 100 X 100 k¼1 y ikj ; i ¼ 1; …; n; j ¼ 1; …; m and y j ¼ 1 100Ân X n i¼1 X 100 k¼1 y ikj ; j ¼ 1; …; m. Define the mean vector of the re-mapped counts of reads: y ¼ y 1 ; …; y m ½ T .
We then calculate the sampling covariance matrix Σ ¼ cov η; η ð Þ.Assume that the scores of the residuals follow a multivariate normal distribution N 0;Σ À Á . Using the normal random variables to generate an n sample of vectors ε i = [ε i1 , …, ε iT ]. The residuals ε i (τ) will be defined as Model 2 (a marginal effect at the first gene): where μ(τ) is the overall mean at the genomic position τ, ε i (τ) is an error stochastic process, x ij is an indicator variable for the genotype of i th individual at the j th SNP of the first gene, y i (τ) is defined as that in model 1. The coefficient α j (τ) = r j ⋅ α(τ), where r j is randomly selected from 0.5 to 1.5, is the additive effect function of the j th SNP of the first gene, μ(τ) is obtained by randomly sampling 100 genes from the real RNA-seq and WGS genotype data without interactions and α(τ) is obtained by randomly sampling 100 genes from the real RNA-seq and WGS genotype data under the condition that one gene have significant main effect, the other gene do not have significant main effect, and these gene pairs are not in the list of significantly interacted gene pairs in our results. The overall mean μ(τ), effect function α(τ) and the residuals ε i (τ) were similarly simulated as that in Model 1. Model 3 (marginal effects at both the first and the second genes): where z ik is an indicator variable for the genotype of i th individual at the k th SNP of the second gene. The genetic additive effect function β k (τ l ) is assumed to be equal to β k (τ) = s k β(τ), where s k is randomly selected from 0.5 to 1.5, other parameters are defined in Model 2. A total of 100 pairs of genes were randomly selected under the condition that both genes have significant main effect and these gene pairs are not in the list of significant interacted gene pairs in our results. The overall mean function μ(τ), main effect functions α(τ) and β(τ), and residual term ε i (τ) were similarly generated as that in Models 1 and 2.