A flexible Bayesian method for detecting allelic imbalance in RNA-seq data

Background One method of identifying cis regulatory differences is to analyze allele-specific expression (ASE) and identify cases of allelic imbalance (AI). RNA-seq is the most common way to measure ASE and a binomial test is often applied to determine statistical significance of AI. This implicitly assumes that there is no bias in estimation of AI. However, bias has been found to result from multiple factors including: genome ambiguity, reference quality, the mapping algorithm, and biases in the sequencing process. Two alternative approaches have been developed to handle bias: adjusting for bias using a statistical model and filtering regions of the genome suspected of harboring bias. Existing statistical models which account for bias rely on information from DNA controls, which can be cost prohibitive for large intraspecific studies. In contrast, data filtering is inexpensive and straightforward, but necessarily involves sacrificing a portion of the data. Results Here we propose a flexible Bayesian model for analysis of AI, which accounts for bias and can be implemented without DNA controls. In lieu of DNA controls, this Poisson-Gamma (PG) model uses an estimate of bias from simulations. The proposed model always has a lower type I error rate compared to the binomial test. Consistent with prior studies, bias dramatically affects the type I error rate. All of the tested models are sensitive to misspecification of bias. The closer the estimate of bias is to the true underlying bias, the lower the type I error rate. Correct estimates of bias result in a level alpha test. Conclusions To improve the assessment of AI, some forms of systematic error (e.g., map bias) can be identified using simulation. The resulting estimates of bias can be used to correct for bias in the PG model, without data filtering. Other sources of bias (e.g., unidentified variant calls) can be easily captured by DNA controls, but are missed by common filtering approaches. Consequently, as variant identification improves, the need for DNA controls will be reduced. Filtering does not significantly improve performance and is not recommended, as information is sacrificed without a measurable gain. The PG model developed here performs well when bias is known, or slightly misspecified. The model is flexible and can accommodate differences in experimental design and bias estimation. Electronic supplementary material The online version of this article (doi:10.1186/1471-2164-15-920) contains supplementary material, which is available to authorized users.


Poisson Synthetic Data
This supplement describes the way Poisson synthetic data were generated and analyzed.

Data Generation
Let the subindices m and p denote maternal (D. melanogaster ) and paternal (D. simulans), respectively.
Data under three simulation scenarios were generated. Each is a variation of the simulation scheme presented below. In order to describe the scheme for generating synthetic data we introduce the following notation: and for the DNA counts are, Here B denote the bias parameter (B = 1/2 is no bias). In particular, when B = 1/2, the expected maternal DNA counts and the expected paternal DNA counts are the same. Let R = µ m /µ p be the ratio of the RNA maternal mean counts and RNA paternal mean counts after correcting for bias. So R = 1 when there is no AI; and, for example, when R = 2 the RNA maternal counts have mean twice the paternal counts mean, if there were no systematic bias (B = 1/2), or after correcting for it. In all scenarios we randomly draw 10 4 exons from the real dataset. We use the RNA and DNA real counts to define RNA and DNA 1 mean counts µ m , µ p and µ DN A in the simulated datasets. We use the sampling distributions (1) and (2) to generate data for every exon. The full notation is µ m,g , µ p,g , and µ DN A,g with g = 1, 2, . . . , 10 4 , but, to simplify notation, we omit the exon subindex g. For every exon, we define the parameters of the sampling distributions above, (1) and (2), in the following way: 4. We simulate DNA data only in Scenario 1, there we describe how we do so.

Scenario 1: No AI and No Bias
We simulate a dataset with 10 4 exons and no AI, i.e. R ≡ 1, and no bias, i.e. B = 1/2, or, equivalently, with a = b = 1 in (3). In addition, we generated DNA simulated counts as we did with the RNA counts but, to mimic the real data DNA coverage, setting µ T,DN A ≡ i (x i ,DN A + y i ,DN A )/I (defined as µ T in 1 Subsection 1.1 but with the DNA, instead of RNA, data) equal to the maternal real sample DNA mean counts+paternal real sample DNA mean counts, we then set µ DN A = (1/2)µ T,DN A .
We analyze this data set with 1. Negative binomial model in [1] with the parameter p random (plays the same roll of q or φ in the PG model) .
2. PG model with φ (random). Remember that, for comparison, we consider φ assuming the same DNA model that the negative binomial model in [1].
We estimate the type I error rate (TIER) of each procedure as the proportion of exons where the procedure finds significant AI. The binomial exact test concludes an exon shows AI if its p-value is less than 0.05. The PG and Negative binomial models conclude an exon shows AI if the central 95% credible interval for the proportion of paternal reads does not contain 0.5. Table 1 in the paper shows the results. Additional file 1: Figure S1: The datasets of scenarios 2 and 3 were merged to obtain a dataset with 10 4 exons not in AI and 10 4 exons in AI. For each exon we compute the "Bayesian pvalue"= 2 min{P [α > 1 | data], P [α < 1 | data]}, this is, the minimum value, such that the central 1−"Bayesian pvalue" credible interval does not contain α = 1 (See [2]), remember α is a parameter in the PG model. We used these "Bayesian p-values" under the PG model with q = B and under the PG model with q = 1/2 as classifiers in the ROC curve. We used the p-values of the exact binomial test as classifier for a third ROC curve.