FunPat pipeline
FunPat takes as input the expression data and the functional annotations organized according to Gene Sets, which can be defined as GO terms, pathways or other sets depending on the annotation database used as input. A summary of FunPat analysis pipeline is displayed in Figure 1. First, FunPat assigns a p-value to each gene based on the Bounded-Area method described in Di Camillo et al. [19]. For each gene, the area A of the region bounded by the gene transcriptional expression profiles in two experimental conditions (e.g. treatment vs. control), or in a single condition vs. a baseline, is calculated (Figure 1A, B). A p-value is assigned to A by evaluating its statistical significance against a null hypothesis distribution. To do that, the available replicates are exploited to model the biological-plus-technical variability and its dependency on the mean gene expression level. This model is then used to build the null hypothesis distribution of A, named AH0, by a Monte Carlo resampling approach. This method requires only at least two replicates for a single time point, thus addressing lack of fully replicated time series experiments, typical of currently available RNA-seq data. Finally, different distribution models (Gamma, Log-normal, Weibull) are used to fit the entire set of AH0 values and the best model is chosen according to the goodness of fit and the parameter precision. Quantiles of the empirical distribution can be also used as an alternative to the above models.
Exploiting the p-values assigned by the Bounded-Area method, FunPat performs a gene selection simultaneously to a Gene Set-specific clustering. First, two sets of genes are defined: seeds, i.e. genes passing a FDR threshold chosen by the user, and candidates, i.e. genes passing a soft-threshold applied to not-adjusted p-values (Figure 1B). Alternatively, seeds and candidates can be provided as input by the user. To identify the main transcriptional dynamics characterizing the expression data, FunPat searches for the common temporal patterns on groups of genes belonging to the same Gene Set (Figure 1C). The core of this analysis is a linear model-based clustering [20], which searches for a cluster of genes whose time series expression profile X = <x(1),..., x(M)> with M time points can be modelled by the following equation:
where P = <p(1),..., p(M)> is the characteristic temporal pattern, k and q are the gene-specific parameters of the model and Σ is the error covariance matrix. The clustering algorithm iteratively performs an identification step of the gene-specific parameters and a temporal pattern search using an Expectation-Maximization approach. A cluster is considered significant if it contains at least one seed gene and passes both a goodness of fit and a runs test. Further details of the clustering algorithm are reported in Additional File 1.
At the end of the analysis, FunPat considers a gene as significantly differentially expressed if either it is a seed gene or it belongs to a significant pattern. Intuitively, since each pattern is required to contain at least one seed gene, all genes associated to the same pattern are likely to be differentially expressed because they are highly correlated to the same temporal profile and, since the clustering is specific for each Gene Set, they share a common biological function or pathway.
Gene Sets used in FunPat pipeline can be organized according to a hierarchical structure, such as the direct acyclic graph (DAG) in GO database. Structured annotations provide further useful information: relationships between the biological terms in a hierarchical structure codify for the specificity of some terms with respect to others and can help to associate genes to the most informative terms avoiding redundant annotations. When structured information is available, FunPat assumes that genes annotated to a Gene Set are also annotated to all its ancestors and that the farther the Gene Set is from the root node, the more specific information it conveys. FunPat performs the clustering starting from the Gene Sets associated to the most specific terms and then removes all the genes belonging to a significant pattern from the ancestor nodes, similarly to what has been proposed by Alexa et al. [21] in the context of the functional enrichment analysis.
Since similar temporal patterns can be identified for different Gene Sets, FunPat applies a second clustering step to identify the Main Patterns characterizing the data. Each Main Pattern thus represents groups of Gene Sets characterized by highly-correlated temporal patterns of DE genes (Figure 1C).
FunPat implementation
FunPat is provided in R/Bioconductor with the related documentation, it is open source under the GPL-3 License and it requires R version 3.0.3 or higher to perform the analyses. As output, FunPat reports the selected genes and Gene Sets into simple tab delimited .txt files and plots the temporal patterns on separate .pdf files. The package allows the user to easily display the results into HTML tables with sortable and filterable columns, plots and hyperlinks to other data sources such as NCBI and GO databases. Figure 2 shows an example of the final report summarizing the information on the identified Main Patterns. Other two HTML reports are generated by FunPat displaying the output of the Bounded-Area method and the temporal pattern profiles associated to each Gene Set (Additional File 2).
RNA-seq time series data simulation
In order to assess FunPat pipeline performance in terms of selection of DE genes and correct identification of the temporal patterns, its application to simulated time series data was tested. 100 time series datasets were generated, simulating a dynamical response measured in two different conditions: treatment and control. Each dataset is characterized by N = 10000 genes monitored over M = 13 time samples. 120 genes were simulated as differentially expressed, belonging to S = 6 different temporal patterns (Figure 3), representing the log fold-changes of expression levels over time induced by the treatment with respect to the control.
In particular, the expression profile θ
f
(t) of a given gene f was modelled on log scale as follows:
(2)
where P
j
(t) (j = 1,..., S) is the temporal pattern reflecting the changes in gene expression levels in response to treatment and k
f
, q
f
are gene-specific parameters. q
f
was sampled from a normal distribution N(1.5, 1.8) and k
f
was sampled from a uniform distribution in the interval [0.5, 2]. Each pattern P
j
(t) was used to simulate the expression profiles of 20 DE genes, for a total of 120 DE genes. The log fold-changes of the remaining 9880 genes were simulated as:
(3)
Assuming that the genes are single-isoform and have the same length for all the simulated transcripts, the probability that a read comes from a gene f can be computed, for each time point t, as:
(4)
π
f
(t) was used to obtain the final transcripts counts, using a Negative Binomial distribution NB(R·π
f
(t), φ), where R is the sequencing depth and φ is the dispersion parameter. The sequencing depth of each sample was sampled from a uniform distribution in the interval [106, 107] and the dispersion parameter was set to φ = 0.1. Three replicates were generated for each time point. Simulated data were finally normalized according to the TMM method [6]. In particular, the normalization factors were re-scaled by the median of the normalized library sizes and then used to obtain the normalized read counts.
Finally, each cluster of DE genes was associated to a common specific GO term. To each of these GO terms, a random number of non-DE genes was also associated, ranging between 9 and 925. The remaining not-DE genes were randomly associated to other randomly chosen GO terms. R Packages GO.db and org.Hs.eg.db were used to define the DAG structure of GO terms and the GO annotations, respectively.
Performance evaluation
FunPat was tested to evaluate its ability to: 1) recover false negatives in the selection of DE genes without decreasing the false discovery rate; 2) correctly cluster the genes associated to the same temporal pattern; 3) give reproducible results on independent replicates. The statistical significance of all the comparisons done was evaluated using two-sided Wilcoxon signed-rank test.
Selection of DE genes
Selection performance was assessed in terms of precision (number of true positives divided by the number of selected features) and recall (number of true positives divided by the number of true DE genes) in detecting the 120 simulated DE genes. FunPat selection performance was compared to edgeR and two existing methods specifically designed for time series expression data: maSigPro, using the new generalize linear model for the RNA-seq data [17], and the FPCA-based approach proposed in Wu and Wu [15]. In the comparison, we also considered the stand-alone application of the Bounded-Area method in order to evaluate if the integration of gene selection with the clustering step and the functional annotation is able to improve the recall without loss in precision.
EdgeR was applied to the data using the GLM application, by defining two factors for the model: one indicating the treatment/control samples, and the other indicating the corresponding time point, as suggested in [22].
MaSigPro applies two generalized linear regression steps to model gene expression in time series expression data. In particular, the former generates for each gene an ANOVA table and the related p-values; the latter is a stepwise regression analysis applied only to the genes with significant p-value. The goodness of fit of the obtained models, namely R2, can optionally be used to perform an additional gene selection step. In the evaluation of maSigPro on our simulated data we used the latest version adapted for RNA-seq data [17], considering the results obtained by both the first regression step (no threshold on R2) and setting a threshold on R2 equal to 0.7 (maSigPro default setting). In both regression steps the same two factors defined for edgeR were considered for the generalized linear model.
Differently from the above methods, the FPCA-based approach [15] integrates principal component analysis into an hypothesis testing framework, identifying data-driven eigenfunctions representing the expression trajectories. The related test, publicly available at the Immune Modeling Community Web Portal repository [23] was used to perform the gene selection on our data.
Identification of temporal patterns
The ability to correctly associate the expression profiles to the corresponding simulated patterns was assessed in terms of clustering precision (C-precision) and recall (C-recall), defined as described in Figure 3. The two scores were calculated by matching each identified profile to one of the simulated patterns looking at the maximum intersection between the groups of genes identified by the clustering method and those assigned to a cluster by the simulation, respectively. C-precision was calculated as true positives, i.e. the number of genes in the intersection, divided by the number of genes associated to the identified profile; the C-recall was calculated as true positives divided by the number of genes (i.e. 20) associated to the simulated pattern. Figure 3 provides a toy example.
In addition to C-precision and C-recall, we also considered the normalized mutual information (NMI) to quantify the shared information between the simulated partition and the clustering results. Specifically, mutual information was calculated using the contingency table obtained by the true partition and the clustering results; since the mutual information has no upper bounds, its normalized version, ranging between 0 and 1, was used [24].
FunPat clustering performance was compared to the hierarchical and k-means clustering approaches and to a model-based method recently adapted for RNA-seq data [12]. The number of clusters obtained as output by FunPat was used to set the number of clusters for all the other methods.
Reproducibility of DE gene lists
As a further evaluation, we considered the reproducibility of the results to assess the ability of each method to detect the same DE genes when it is applied independently on available independent replicates. We applied FunPat and the other selection methods to each single replicate and evaluated the reproducibility of the resulting lists of DE genes in terms of intersection across the three replicates divided by the minimum list size, i.e. the smallest list size among the lists of DE genes identified for each replicate. The same evaluation was also applied to two different real datasets [16, 18], focusing also on the ability of FunPat to obtain biologically interpretable results.