iASeq: integrative analysis of allele-specificity of protein-DNA interactions in multiple ChIP-seq datasets

Background ChIP-seq provides new opportunities to study allele-specific protein-DNA binding (ASB). However, detecting allelic imbalance from a single ChIP-seq dataset often has low statistical power since only sequence reads mapped to heterozygote SNPs are informative for discriminating two alleles. Results We develop a new method iASeq to address this issue by jointly analyzing multiple ChIP-seq datasets. iASeq uses a Bayesian hierarchical mixture model to learn correlation patterns of allele-specificity among multiple proteins. Using the discovered correlation patterns, the model allows one to borrow information across datasets to improve detection of allelic imbalance. Application of iASeq to 77 ChIP-seq samples from 40 ENCODE datasets and 1 genomic DNA sample in GM12878 cells reveals that allele-specificity of multiple proteins are highly correlated, and demonstrates the ability of iASeq to improve allelic inference compared to analyzing each individual dataset separately. Conclusions iASeq illustrates the value of integrating multiple datasets in the allele-specificity inference and offers a new tool to better analyze ASB.

the gene annotation file. The exonic ASE SNPs were determined using RNA-seq. For a given RNA-seq dataset, the naive Bayes statistic was calculated for each exonic SNP. The top 400 exonic SNPs (6.61% of the total 6051 exonic SNPs) ranked based on the naive Bayes statistic were called as exonic ASE SNPs. Subsequently, in the analysis of ChIP-seq data, a SNP was claimed to be a true positive if there was an exonic ASE SNP within its Xkb neighborhood. Results for X=10kb are shown in the main manuscript and results for X=1kb are shown in Additional File 7. Both criteria gave similar results.
For sample (d, j), each SNP has a p idj which can be roughly estimated asp idj = (x idj +2 * p is the number of SNPs in dataset d for which By matching p dj0 and v dj to the theoretical mean and variance of a Beta distribution, we obtain In principle, one may develop a more sophisticated algorithm to estimate α and β by fitting beta-binomial distributions to x idj |n idj , but the computation will be more involved. Therefore we did not pursue this solution and instead used the simple method described above to approximately estimate α and β.

S.3 Parameter choice for the Dirichlet prior
Although η = 1 can specify an uniform prior and seems to be a natural choice, it will make the EM algorithm numerically unstable. This is because the EM searches for posterior mode and is implemented on log scale. The mode of a Dirichlet distribution D(η 1 , · · · , η M ) for the m-th . It is not defined if all η m s are equal to one. As a result, if η = 1 is used as the prior, and when the expectation of the counts ∑ i δ(a i = k), in the E-step of the algorithm is close to zero, then the algorithm can easily lose its numerical stability due to issues such as log(0) and ill-defined posterior mode. These issues can be avoided by using η = 2 which still imposes a relatively non-informative prior.

S.4 The EM algorithm used in iASeq
This section presents the EM algorithm used to search for posterior mode (π,V ,Ŵ ) of the distribution P r(π, V , W |X, N ) = ∑ A,B,C P r(A, B, C, π, V , W |X, N ). In the EM algorithm, A, B and C are the missing data. The algorithm iterates between an E-step and an M-step.
In the E-step, one evaluates the Q-function Q(π, V , W |π old ,V old ,Ŵ old ) which is defined as Here the expectation is taken with respect to probability dis- andŴ old are the parameter estimates obtained from the last iteration.
When we use η = 2 in the Dirichlet priors for π, V and W , we have Therefore, In the M-step, one finds π, V and W that maximize the Q-function Q(π, V , W |π old ,V old ,Ŵ old ).
Denote them byπ new ,V new andŴ new . These will give the new parameter estimates.

S.3
By solving We haveπ In the formulas above, P r old (a i = k), P r old (a i = k, b id = 1) and P r old (a i = k, c id = 1) are computed as follows. To compute P r old (a From Equation 5 in the main manuscript, we have Therefore, P r old (a i = k) can be computed by replacing π, V and W byπ old ,V old andŴ old .
P r old (b id = 1|a i = k) and P r old (c id = 1|a i = k) can be obtained by plugging inπ old ,V old and W old into the formula above to replace π, V and W . S.4

S.5 Bayesian Information Criterion (BIC) for choosing K
We compute BIC as We calculate BIC for different values of K, and choose the K with the smallest BIC. Here K + 1 is the number of classes. K is also the number of parameters in π. 2KD is the number of parameters involved in V and W . I is the SNP number. Strictly speaking, the data likelihood also involve terms P r(N i |π, V , W ). However, based on our assumption, these terms do not depend on K, π, V and W , and can be reduced to P r(N i ). They can be viewed as constants for the purpose of choosing the optimal K. We do not include them in the BIC computation.

S.6 Data generation in simulation studies
To simulate a ASB SNP i, we first sampled a SNP from the 8166 non-background SNPs in the real GM12878 data. Here the non-background SNPs in the real data were determined by iASeq using P r(a i = 0|X i , N i , π, V , W ) < 0.5 as cutoff. Additionally, we also sampled a SNP from the 86,353 background SNPs in the real GM12878 data. Next, with these two real SNPs in hand, we went

S.7 The single dataset based EM analysis
This approach analyzes each dataset separately. Let X d = (X 1d , · · · , X Id ) and N d = (N 1d , · · · , N Id ) be the data from dataset d. We assumed that in each dataset d, a SNP i can be SR (b id = 1), SN and C d = (c 1d , · · · , c Id ) be the ensemble of all ASB indicators in dataset d. Adopting the same distributional assumption as in Equations 1-3 in the main mainuscript, the complete data likelihood can be derived as: , we obtain the posterior distribution of the unknown parameters and missing indicators: An EM algorithm can be similarly derived as in iASeq to estimate the parameters v d and w d by searching for the posterior mode of P r(v d , w d |X d , N d ).
In the E-step, we compute the Q-function Q(v d , w d |v old d ,ŵ old d ). Since We have Using the posterior mode, one can similarly compute P r(b id , c id |X d , N d , v d , w d ) andP id to detect and rank AS SNPs.