Data normalization in RNA-Seq
Brain RNA-Seq data were generated from post-mortem cortical samples collected from Brodmann Area 19 (BA19) in 39 control and 25 autism-affected cases (see Additional file 1: Table S1). After estimating gene expression from the sequencing reads, two methods for data normalization were assessed: Exploratory Data Analysis and Normalization for RNA-Seq (EDASeq) [16] and Conditional Quantile Normalization (CQN) [17]. The normalized gene expression values from each algorithm demonstrated method-specific biases. Examining p-values from our covariate adjusted case–control analysis, we note that normalization by CQN leads to a marked increase in the test statistics for shorter and low GC content genes (gene length < 1000 bp, GC content < 35%), a problem not observed with EDASeq (see Additional file 1: Figure S1). On the other hand, genes with both lower gene expression estimates and the assignment of zero values by EDASeq led to an increase in outliers on a per-gene basis in our eQTL analyses (see Additional file 1: Figure S2A), whereas CQN did a better job handling these genes (see Additional file 1: Figure S2B). Further comparison by eQTL replication to assess the biologic reproducibility (discussed below) of these two normalization methods was performed with CQN slightly outperforming EDASeq (Figure 2). While one unified approach that directly addresses the limitations of each approach more effectively would improve results, we selected CQN for downstream analyses due to its slight improvement in eQTL replication. Nonetheless, we recommend that, until the presented issues are directly addressed, both methods be considered as part of an analysis pipeline.
Identifying outliers in RNA-Seq data
In large sequencing studies, specific samples, for technical or biological reasons, can be recognized as outliers and should be removed from the study [18]. To identify outlier samples, whose global gene expression pattern is not explained by known covariates, we used Principal Component Analysis (PCA), investigating the first six principal components, which together explain ~60% of the variance in the brain data. Samples greater than three standard deviations (SD) from the mean in any of the first six principal components were deemed outliers and removed from analysis (N = 4 or 6.3% of all samples) (see Additional file 1: Figure S3).
After sample-based outlier removal described above, it was apparent that, on a gene-by-gene basis, there were samples whose expression estimates differed greatly from the rest of the samples for that particular gene (see Additional file 1: Figure S2). Using a cut-off of three SD from the mean, 20.2% [7,027/34,738] of genes tested for differential expression between cases and controls had at least one sample flagged as an outlier for gene expression level. As these sample outliers are gene-specific, they suggest a clear artifactual origin, as opposed to a problem with the sample as a whole. Comparing the 50 most significantly differentially expressed genes between cases and controls before and after outlier removal, the lists differ at 60% [30/50] of the genes present (see Additional file 2), demonstrating that inaccurate results would be reported if gene-by gene outliers were not removed. To further ensure that this was indeed biologically sound, we assessed the validity of this approach using our eQTL analysis (discussed below).
After flagging outlier samples for removal in the brain data set, we obtained genotypes from both DNA and RNA. As a check on our data, we verified sample identity by comparing each RNA-Seq sample against all DNA samples. Pair-wise Identity by State (IBS) distances (DSTs) were calculated in PLINK with the expectation that DNA and RNA genotypes generated from the same individual should have a DST value approaching 1.0. In all samples, DNA genotypes best matched their corresponding RNA genotypes with a DST > 0.83, indicating that our DNA and RNA samples were, in fact, from the same individual.
Despite correct identification of sample identity by IBS, three samples had borderline DST values (DSTs = 0.83-0.89), warranting further investigation. These samples demonstrated an unexpected genotyping comparison profile such that all three showed an increased number of genotype calls deemed homozygous by DNA genotyping but called heterozygous at the RNA level. As DNA genotyping by Affymetrix array has proven to be extremely accurate [19], an excess of sites where the DNA genotype indicates homozygosity but heterozygous calls are present at the RNA level indicates possible contamination. We quantify these occurrences in each sample using a metric we refer to as the Discordance Ratio (DR). For the majority of our samples, for which there is no suspected contamination, the DR approaches zero, with a value less than 0.2 indicating RNA-Seq data of sufficient quality for further analysis. The three samples in question had elevated DRs (0.32, 0.41, and 0.47), suggestive of sample cross-contamination (see Additional file 1: Figure S4).
To address the possibility of contamination, we conducted a mixing experiment where we combined high quality RNA-Seq samples (identified as having a DR < 0.1) in controlled ratios. We carried out variant calling on these intentionally contaminated samples as had been previously carried out in the RNA-Seq data and calculated the DR for each. This ratio was then compared between the RNA-Seq samples in question and those from which mixing had been simulated. This comparison suggests that, for the three sample libraries in question, 30-70% of the RNA-Seq reads originated from a different sample (Figure 3). As reads from a foreign sample would lead to inaccurate gene expression estimates, we removed these samples from downstream analysis, resulting in a final data set of 57 samples, comprising 21 controls and 36 cases.
Reported brain eQTLs are reproducible in RNA-Seq data
Previously, surrogate measures of RNA quality (e.g., pH, post-mortem intervals, RIN values, etc.) have been used in an attempt to predict biologic validity, but none has been uniformly successful. Using published sets of brain eQTLs – regulatory genomic loci at which gene expression levels in the brain differ by genotype – we looked to recapitulate a number of the previously reported brain eQTLs in our gene expression data. We postulated that if we could replicate these eQTLs in our data, this would indicate that the use of post-mortem brain tissue may be representative of physiological conditions. We used a list of 909 cis-eQTLs generated from two recent studies that detected brain eQTLs in multiple disease populations across a number of brain regions [7, 8] (see Additional file 3). Despite a smaller sample size and only one brain region under interrogation, we replicate 26.1% [237/909] of the tested associations (inflation-adjusted p < 0.05) when age, sex, site and principal components are included as covariates (Figure 2 & see Additional file 1: Figure S5 & Table S2).
Monitoring eQTL replication to gauge quality control measures
We posit that if we are appropriately handling our data, known brain eQTLs should demonstrate improved association after each data correction step as well as an overall increase in the number of previously reported eQTLs that replicate. We have measured the ability to replicate known cis-eQTLs associations using three metrics: (1) the percentage of known eQTLs that replicate at p < 0.05 after adjusting for genome-wide inflation (See Methods) (2) π1, a statistic that estimates of the proportion of significant tests [20], and (3) the percentage of known eQTLs that replicate at q < 0.05. When taken together, these three metrics offer a profile of the validity of each data handling step.
As part of the initial quality control, seven of the 64 samples (11% of total) were flagged as PCA outliers or contaminated samples, and removed. To assess the effect of sample removal, we compared eQTL replication in three data sets: (1) prior to outlier removal (N = 64), (2) after dropping PC outliers (N = 60), and (3) after dropping likely contaminated samples (N = 57). Sample outlier removal allows for the detection of 7.4% more known eQTLs p < 0.05 and 3.5% more eQTLs q < 0.05. Similarly, π1 estimates a dramatic increase in the proportion of replicating eQTLs from 0.000 to 0.209. These data indicate the necessity of removing suspect samples in these data (Figure 2 & see Additional file 1: Table S2).
We further utilized eQTL replication to determine the most appropriate model for gene annotation. There is evidence that suggests expression levels estimated from RNA-Seq data at the coding sequence (CDS) alone correspond better with qRT-PCR measurements than RNA-Seq estimates that include both the CDS and its untranslated regions (UTRs). However, recent RNA-Seq analyses have generally included gene annotation from the whole gene – that is the CDS and its UTRs – under the argument that gene annotation gains accuracy upon UTR inclusion [21]. To address this discrepancy in the literature, we compared these two gene annotation approaches by eQTL replication. The whole gene annotation clearly replicates known eQTLs better than the CDS alone (Figure 2 & see Additional file 1: Table S2) detecting 5% more known eQTLs at p < 0.05 and 1.9% more at q < 0.05. Replication, as measured by π1 demonstrates an increase in this test statistic as well (0.114 in CDS, 0.209 in whole gene annotation). This improvement in eQTL detection offers support for the use of UTR inclusion in gene annotation in these data.
Similarly, eQTL replication was used to compare normalization methods. We note that when considering the overall number of known eQTLs detected, CQN replicates 2.7% more eQTLs (p < 0.05) than does EDASeq (Figure 2 & see Additional file 1: Table S2), further supporting its use in analyzing gene expression in this data set.
Disease-based comparisons are frequently adjusted for known covariates (age, sex, etc.). However, comparative studies are also frequently plagued by unknown covariates, or confounders within the data that are not easily attributable to any recorded measurement [10, 18]. These unknown covariates can be approximated through various data decomposition methods. We initially considered using PCA to accomplish this goal but observed that the first PC was correlated with both collection site (see Methods) and disease status, which often occurs whenever different sites have differing fractions of cases and controls. As this could be a likely issue in many case–control studies, limiting the utility of PCs in downstream analyses, we also considered Surrogate Variable Analysis (SVA) [14] and Independent Surrogate Variable Analysis (ISVA) [15], as these approaches allow for disease status to be protected during their generation. Lastly, we also considered utilizing PEER [13, 22] to account for unknown covariates, as this method has been used and performed well in previous eQTL analyses [23]. In eQTL replication analyses, performance was comparable with ISVs, SVs, PEER and PCs detecting 25.1, 26.2, 26.9 and 26.1 percent of the previously reported eQTLs, respectively (p < 0.05) (see Additional file 1: Table S2). Ultimately, however, to address the case–control confounding issue, we had to decide between ISV and SV usage. To do so, we tested both methods by assessing Q-Q Plots generated for disease-based comparisons. As the inclusion of SVs, but not ISVs, demonstrated inflated p-values in these analyses (see Additional file 1: Figure S6), we decided to move forward with ISVs to account for unknown covariates.
Finally, regarding covariate inclusion, we note that certain metrics for technical artifacts of sequencing (percent coding bases, percent intronic bases, percent mRNA bases, median 3′ bias, percent UTR bases, and AT dropout) were correlated with specific ISVs (see Additional file 1: Table S4), suggesting that the unknown covariates detected by ISVA may simply be accounting for known technical artifacts of sequencing. We tested this possibility and demonstrate that, while including technical artifacts as covariates does improve eQTL detection over known covariates alone (2.4% increase at p < 0.05, increase in π1 from 0.217 to 0.308), both PCs and ISVs perform even better, demonstrating a 5.9% and 4.9% increase at p < 0.05, respectively, when compared to no covariate inclusion (Figure 2 & Additional file 1: Table S2). These data ultimately support the inclusion of covariates, as captured by data decomposition methods, in downstream analyses suggesting that such methods are either (a) accounting for unknown covariates beyond technical sequencing artifacts or (b) appropriately weighting the effects of the technical artifacts amongst the ISVs/PCs generated.
As noted above, sample outliers were also identified on a per-gene basis and removed from analysis. To ensure that removing these outliers was biologically sound and that these outliers did not represent true measures of differential expression, we tested data sets where sample outliers were removed at each gene using our eQTL replication approach. While per gene outlier removal did not demonstrate a marked increase or decrease in eQTLs detected (Figure 2 & Additional file 1: Table S2), the presence of outlier samples leads to a lack of robustness in the case–control analysis where single samples dramatically skewed the results (see Additional file 1, Figure 2). As per-gene outlier removal helped to stabilize the case–control analyses and did not hinder our ability to detect known eQTLs, we support its inclusion in RNA-Seq data analysis.
Independent RNA-Seq data set supports use of eQTL gold standards
To bolster the results of our brain RNA-Seq data set, we set out to replicate the main findings of our initial analysis in an independent RNA-Seq data set generated from a distinct tissue source. To do this, we used 162 blood samples from the GTEx project [9], for whom we had DNA genotypes as well as raw count data from RNA-Seq. In these data, four samples (2.5% of total) were identified as PC outliers, using the same criteria as was used in the brain data. Sample outlier removal led to a slight decrease in the number of eQTLs detected (29.6% versus 28.3% at p < 0.05); however, there was an increase in π1 (0.374 to 0.387 after outlier removal) (Figure 4 & Additional file 1: Table S3). Normalizing using CQN again led to an overall increase in eQTLs detected (3.5% increase at p < 0.05) (Figure 4 & Additional file 1: Table S3). In assessing covariate addition, a pattern similar to what was seen in the brain data was observed. While known covariates (age, sex, and cohort) in the brain data did not improve the eQTL detection, there was a similar improvement seen upon the addition of PCs to account for unknown covariates (9.3% increase when compared to the use of no covariates) (Figure 4 & Additional file 1: Table S3). Again, per gene outlier removal does not hamper the ability to detect known eQTLs (Figure 4 & Additional file 1: Table S3).