Data structure
Suppose there are D ChIP-seq datasets generated using cells from the same individual and the same cell type. Each dataset d corresponds to one TF or HM, and has J
d
replicate samples (Figure 1a). Different datasets represent different TFs or HMs, or data generated by different labs. For the individual in question, assume one is interested in analyzing I heterozygote SNPs with known genotypes. We want to know whether the two alleles of each SNP behave differently in each dataset, and if possible how the AS events are correlated among datasets. For each SNP, the allele consistent with the reference genome is called the reference allele, and the other allele is called the non-reference allele.
After read mapping and data preprocessing (see Additional file 1: Supplemental Methods S.1), we count reads for each allele at each heterozygote SNP. For SNP i, dataset d and replicate sample j, let x
idj
and y
idj
be the read counts for the reference allele and non-reference allele respectively. Let n
idj
=x
idj
+ y
idj
be the total read count (See Figure 1a for a toy example). Protein-DNA binding can be skewed to the reference allele (SR), skewed to the non-reference allele (SN), or not allele-specific (NS). We use a binary variable b
id
to indicate whether SNP i is SR (b
id
=1) or not (b
id
=0) in dataset d. If b
id
=1, then SNP i is assumed to be SR in all replicate samples in dataset d. Similarly, we introduce another binary indicator c
id
to indicate whether SNP i is SN or not in dataset d. b
id
and c
id
cannot be equal to one at the same time. If b
id
=0 and c
id
=0, then SNP i is NS in dataset d. The configuration at each SNP i can be described by two vectors B
i
=(bi 1,⋯,b
iD
) and C
i
=(ci 1,⋯,c
iD
) (See Figure 1d for a cartoon illustration). Based on these notations, (x
idj
,y
idj
), or equivalently (x
idj
,n
idj
), are the observed data for SNP i in sample (d,j), whereas the indicators b
id
and c
id
are unobserved.
Main intuition and challenge
Our primary goal is to infer for each SNP whether there is allelic imbalance in each dataset. This is equivalent to inferring b
id
and c
id
. A simple solution to this problem is to analyze each individual dataset separately, but this approach has low statistical power since the counts (x
idj
,n
idj
) usually are small.
If one knows how different datasets are correlated in terms of allelic imbalance, this knowledge may be used to improve the data analysis. For instance, if the allelic imbalance of two proteins A and B are closely correlated, then observing skewed read counts for protein A will provide information for inferring the allelic imbalance of protein B. Integrating the data from both A and B will increase the effective number of reads available for statistical inference, which will then lead to increased statistical power.
In reality, how different proteins are correlated is usually unknown. However, one may learn it by studying the data from many SNPs. Each SNP has three possible states in each dataset: SR, SN and NS. For D datasets, there are 3D possible configurations in total. From studying many SNPs, one can know the relative frequencies (or mixing proportions) of these 3Dconfigurations. The mixing proportions will tell how different datasets are correlated. For instance, let [s1,s2,⋯,s
D
] be the skewness configuration of a SNP in the D datasets. If the mixing proportions for three configurations [NS,NS,⋯,NS], [SR,SR,⋯,SR] and [SN,SN,⋯,SN] are 0.9, 0.05 and 0.05, then no other configurations exist in the data and all datasets are perfectly correlated in terms of the allelic imbalance. In other words, at a particular SNP, if one dataset is SR, then all the other datasets are also SR. If one is SN, then all the others are also SN. On the other hand, if other configurations have non-zero mixing proportions, then not all datasets are perfectly correlated, and at a particular SNP, one allows the possibility that only a subset of datasets are correlated. For instance, if the mixing proportion for a configuration [SR,SR,NS,⋯,NS] is 0.03, then there will be 3% of SNPs that are skewed to the reference allele in the first two datasets but not skewed in the other datasets. Therefore, knowing the mixing proportions of all 3Dconfigurations will tell one the correlation structure in the data. This knowledge can then be used to improve statistical inference at each individual SNP by facilitating information sharing across datasets. For example, if the configuration [SR, SR, SN] has a much higher mixing proportion than [SR, SR, NS], then observing strong skewness towards the reference allele of a SNP in the first two datasets will imply that, a priori, the SNP is highly likely to be skewed to the non-reference allele in the third dataset and has much lower probability to be non-skewed for both alleles. The principle here is the same as the principle represented by the Bayesian hierarchical models in the statistical literature.
A limitation of this approach is that one has to enumerate all 3D AS configurations in order to describe the correlation. As the number of datasets increases, the number of possible configurations increases exponentially. Thus this approach does not scale well with the increasing D. Later, in our analysis of GM12878 data, D=40 and 3D>1019. This simple approach is clearly intractable.
To circumvent the difficulty of documenting the frequencies of all 3Dconfigurations, iASeq employs a technique that can describe the major correlation patterns in the data using a few probability vectors whose values vary from 0 to 1 rather than being dichotomous (i.e., 0 or 1). This approach significantly reduces the model complexity but keeps the flexibility to account for all 3Dconfigurations. It is easily scalable to increasing dataset number. The correlation structure in the model can then be used to improve the statistical inference of allelic imbalance at each SNP in each individual dataset.
Probability model
iASeq is based on the Bayesian hierarchical mixture model below that uses several probability vectors to describe the major correlation patterns among multiple datasets (Figure 1). The model assumes that SNPs can be grouped into K + 1 classes with different allele-specificity patterns (K≪3D), and the observed data are viewed as generated as follows:
● First, a class label a
i
is randomly assigned to each SNP i according to a probability vector Π=(Π0,Π1,⋯,Π
K
). Here, Π
k
=Pr(a
i
=k) is the prior probability to assign a SNP to class k. .
● If the class label a
i
=0, then B
i
=(0,⋯,0) and C
i
=(0,⋯,0). In other words, all SNPs in class 0 are background SNPs, and they are NS in all datasets. If a
i
=k and k≠0, then SNP i can be skewed, and its [b
id
;c
id
]s in different datasets are generated independently according to the following probabilities: Pr(b
id
=1,c
id
=0|a
i
=k)=v
kd
and Pr(b
id
=0,c
id
=1|a
i
=k)=w
kd
. We assume v
kd
+ w
kd
<1, i.e., Pr(b
id
=0,c
id
=0|a
i
=k)=1−v
kd
−w
kd
>0. The model implies that each class is associated with two vectors of probabilities V
k
=(vk 1,⋯,v
kD
) and W
k
=(wk 1,⋯,w
kD
). For SNPs in class k, B
i
and C
i
are generated according to the probabilities in V
k
and W
k
.
● Next, the observed read counts are generated based on the AS configurations specified by B
i
s and C
i
s. Consider SNP i and dataset d. If b
id
=1, then (x
idj
,n
idj
) in each replicate sample (d,j) is generated according to a probability distribution Pr(x
idj
,n
idj
|b
id
=1,c
id
=0)=Pr(n
idj
|b
id
=1,c
id
=0)Pr(x
idj
|n
idj
,b
id
=1,c
id
=0)≡Pr(n
idj
)fidj 1(x
idj
). Here we assume that the marginal distribution of n
idj
does not depend on b
id
and c
id
, and we use fidj 1(x
idj
) to denote the conditional distribution Pr(x
idj
|n
idj
,b
id
=1,c
id
=0). Data in different replicate samples are assumed to be generated independently. Similarly, if c
id
=1, then (x
idj
,n
idj
)s are generated according to Pr(x
idj
,n
idj
|b
id
=0,c
id
=1)=Pr(n
idj
)fidj 2(x
idj
). If b
id
=0 and c
id
=0, then (x
idj
,n
idj
)s are generated according to Pr(x
idj
,n
idj
|b
id
=0,c
id
=0)=Pr(n
idj
)fidj 0(x
idj
).
For SNP i and dataset d, we organize data from all replicates j=1,⋯,J
d
into X
id
=(xid 1,⋯,x
idJ
d
) and N
id
=(nid 1,⋯,n
idJ
d
). For SNP i, X
i
=(Xi 1,⋯,X
iD
) and N
i
=(Ni 1,⋯,N
iD
) contain data from all datasets. The final observed data are X=(X1,⋯,X
I
) and N=(N1,⋯,N
I
) which are the ensemble of data from all SNPs.
Let A=(a1,⋯,a
I
) be the collection of class membership indictors of all SNPs, and let B=(B1,⋯,B
I
) and C=(C1,⋯,C
I
) be the SR and SN indictors for all SNPs. A, B and C are the unobserved missing data one wants to infer.
Organize the probability vectors V
k
and W
k
from different classes into two matrices and . V, W, and the probability vector Π that describes the class abundance are the unknown model parameters. K is assumed to be fixed. The choice of K and specification of data generating distributions Pr(n
idj
), fidj 0(x
idj
), fidj 1(x
idj
) and fidj 2(x
idj
) will be discussed later.
Based on this model, each SNP class k (k≠0) is associated with two vectors of probabilities V
k
and W
k
which characterize the allelic imbalance preferences in different datasets for SNPs belonging to class k. For example, if a class has [V
k
;W
k
] = [(0.8,0.7,0.1,0.1); (0.1,0.1,0.8,0.1)], then SNPs in this class have high probability to be SR in datasets 1 and 2, and high probability to be SN in dataset 3, but they have low probability to be allele-specific in dataset 4. Since V
k
and W
k
are probabilities rather than 0-1 vectors, each class k can generate all 3D AS configurations. Therefore, SNPs in the same class are not required to have the same AS configuration (e.g., a class can have one SNP with configuration [SR,SR,NS,NS] while at the same time another SNP with configuration [SR,NS,SR,NS]), although they usually have similar AS configurations because SNPs in the same class are all generated using the same probability vectors. Meanwhile, there are K different classes, and each class has a different [V
k
;W
k
] which specifies a different preference to generate the skewing configurations. Thus, whereas SNPs in the same class tend to have similar [B
i
;C
i
] configurations, SNPs from different classes tend to have very different configurations. Conceptually, this is similar to a model-based clustering analysis in which SNPs are grouped into K + 1 clusters based on their [B
i
;C
i
] configurations. However, an important difference here is that [B
i
;C
i
]s are unknown.
Our model assumes that [b
id
;c
id
]s of the same SNP in different datasets are a priori independent conditional on the class membership a
i
. However, [b
id
;c
id
]s from different datasets are not independent marginally if one integrates out the class label a
i
. For example, the marginal probability Pr([b
id
;c
id
]= [1;0])=. On the other hand, the joint probability , which is clearly different from the product of the marginals . This explains why our model can be used to describe the correlation among multiple datasets despite the conditional independence assumption. Intuitively, if one views the model as a clustering analysis of SNPs based on [B
i
;C
i
], then each cluster will represent a co-occurrence pattern of allele-specificity across multiple proteins. The marginal correlation among multiple datasets is described by multiple clusters, whereas within each cluster the data in different datasets are generated independently. In real data, a small K (i.e., a small number of SNP classes) usually is sufficient to describe the major correlation structure among datasets. Using Π, V and W to describe the correlation among datasets only requires O(KD) parameters, which is significantly less complex than O(3D) parameters. At the same time, the iASeq model still provides the flexibility to accommodate all 3D possible [B
i
;C
i
] configurations as all of them have non-zero probability to occur.
Data generating distributions
To fully specify the model, one also needs to specify the data generating distributions Pr(x
idj
,n
idj
|b
id
,c
id
)=Pr(n
idj
)Pr(x
idj
|n
idj
,b
id
,c
id
). The primary goal of iASeq is to infer whether two alleles are different. We assume that information on allele-specificity is only contained in Pr(x
idj
|n
idj
,b
id
,c
id
), and therefore the exact form of Pr(n
idj
), i.e., the marginal probability distribution of the total read count, is irrelevant for our purpose. As such, we mainly focus on modeling the conditional distribution of x
idj
given n
idj
, b
id
and c
id
, i.e., the three distributions fidj 0(x), fidj 1(x) and fidj 2(x).
iASeq models these distributions hierarchically in two steps. First, x
idj
is assumed to follow a binomial distribution x
idj
|n
idj
,p
idj
∼Bin(n
idj
,p
idj
), where p
idj
is the probability that a read generated at SNP i in sample (d,j) represents the reference allele. Next, we model p
idj
depending on the values of b
id
and c
id
.
If b
id
=0 and c
id
=0, SNP i is NS in dataset d. In this case, we assume that p
idj
follows a Beta distribution Beta(α
dj
β
dj
) with mean pdj 0=α
dj
/(α
dj
+ β
dj
). Note that a simpler model for p
idj
would be to set it to a constant pdj 0 which reflects the background ratio of read counts between two alleles. However, previous studies have shown that many background SNPs can have p
idj
slightly different from the average background pdj 0even though they do not have biologically meaningful allele-specificity [33]. As a result, a constant pdj 0 is not sufficient to describe the background variation. For this reason, we adopt the Beta distribution to describe p
idj
instead of setting it to a constant (See the blue lines illustrated for f(p
idj
|b
id
=0,c
id
=0) in Figure 1c). In the ideal world, the mean of the Beta distribution, pdj 0, would be equal to 0.5. However, in reality pdj 0may be slightly different from 0.5 due to various sources of read mapping biases. For example, allowing the same number of mismatches, reads from the reference allele are easier to be mapped back to the reference genome than reads from the non-reference allele. Therefore, in iASeq pdj 0may take values different from 0.5. Indeed, it is determined by the parameters α
dj
and β
dj
in the Beta distribution which are estimated from the data using a moment matching approach (see Additional file 1: Supplemental Method S.2). Once estimated, α
dj
, β
dj
and pdj 0are treated as fixed and known parameters. Based on the model for p
idj
, we integrate out all possible values of p
idj
to obtain the distribution of x
idj
conditional on b
id
=0 and c
id
=0, which is a beta-binomial distribution:
(1)
Here is the binomial coefficients “n choose k”, and B(. .) is the beta function.
If b
id
=1 and c
id
=0, SNP i is SR in dataset d. In this case, we assume that p
idj
follows a uniform distribution U[pdj 0,1](See the dark blue lines illustrated for f(p
idj
|b
id
=1,c
id
=0) in Figure 1c). Here pdj 0=α
dj
/(α
dj
+ β
dj
) is defined as above. After integrating out p
idj
, the distribution of x
idj
conditional on b
id
=1 and c
id
=0 is
(2)
If b
id
=0 and c
id
=1, SNP i is SN in dataset d, and we assume that p
idj
follows a uniform distribution U[0,pdj 0] (See the light blue lines illustrated for f(p
idj
|b
id
=0,c
id
=1) in Figure 1c). After integrating out p
idj
, the distribution of x
idj
conditional on b
id
=0 and c
id
=1 is
(3)
Joint probabilities and model fitting
Based on the model above, the complete data likelihood can be derived as:
(4)
Define , and . Define δ(.) to be an indicator function. δ(.)=1 if its argument is true, and δ(.)=0 otherwise. We have
(5)
To infer Π, V and W, we employ a Bayesian approach by imposing a Dirichlet prior D(η,⋯,η) on Π and imposing independent Dirichlet priors D(η,η,η) on all triplets (v
kd
,w
kd
,1−v
kd
−w
kd
). The joint posterior distribution of unknown parameters and indicators given the observed data is:
(6)
Conditional on the observed data, Pr(N) is a constant that does not contain parameters of interest, therefore it is absorbed into a proportionality constant not shown in the formula above. Using this joint posterior, an EM algorithm can be derived to search for posterior mode of in which the missing indictors A, B and C are all integrated out (see Additional file 1: Supplemental Method S.4).
For the Dirichlet prior, we use η=2 (see Additional file 1: Supplemental Method S.3 for a discussion on the choice of parameter for the Dirichlet prior). In the EM algorithm, we assume that the class number K is given. In order to choose the optimal K, we run the algorithm multiple times using different values of K. We choose the best K using the Bayesian Information Criterion (BIC) (see Additional file 1: Supplemental Method S.5).
Statistical inference of allele-specificity
The estimated Π, V and W can describe the correlation patterns of allele-specificity among datasets. Given Π, V and W, one can infer whether SNP i belongs to class k based on the posterior probability Pr(a
i
= k|X
i
,N
i
,Π,V,W) (see Additional file 1: Supplemental Method S.4 equations S.12-S.13). One can then infer whether each SNP i is skewed in each individual dataset d based on the posterior probability Π,V,W) after summing over all possible values of a
i
(see Additional file 1: Supplemental Method S.4 equation S.14). Note that
(7)
Define
(8)
Using , SNPs can be rank ordered for biologists to choose candidates to design follow-up studies. For each top ranked SNP, one can determine its skewing direction by comparing Pr(b
id
=1,c
id
=0|X
i
,N
i
,Π,V,W) and Pr(b
id
=0,c
id
=1|X
i
,N
i
,Π,V,W). The one with the larger value determines the direction. Finally, the posterior probabilities of top N SNPs can be converted to an estimate of false discovery rate (FDR) using .
Formula 7 shows that two types of information contribute to Pr(b
id
,c
id
|X
i
,N
i
,Π,V,W): (1) Pr(a
i
=k|X
i
,N
i
,Π,V,W), which is determined using information from all D datasets, and (2) Pr(b
id
,c
id
|a
i
=k,X
i
,N
i
,Π,V,W), which only uses information specific to dataset d conditional on Π, V and W. Thus for each particular dataset d, the dataset-specific information is weighted by information obtained from other datasets to determine the SNP ranking. Intuitively, if allelic imbalance in two datasets are correlated, then observing an AS event in one dataset will suggest that a relatively weak skewing event observed at the same SNP in the other dataset is very likely to be a true AS event. In contrast, if no AS event is observed in one dataset, then a relatively weak skewing event observed at the same SNP in the other dataset is likely to be a false positive. This is the underlying nature of using Pr(a
i
=k|X
i
,N
i
,Π,V,W) to re-weigh information in Pr(b
id
,c
id
|a
i
=k,X
i
,N
i
,Π,V,W), and it provides the foundation for improving SNP ranking by borrowing information across datasets. In real applications, Π, V,W are unknown, and they are replaced by the posterior mode obtained from the EM algorithm.