Problem formulation
Model DNA methylation profiles with beta mixture distribution
For a two-color hybridization based array such as Illumina Infinium array, the measurements are associated with the percentage of the methylated alleles, which is called the "beta" values because it can be described by a mixture of beta distributions [7, 10]. Since the distribution of "beta" values shows bimodalities [11], the beta distribution component in the mixture model should be convex, which means the beta distribution component should be equipped with large parameters, shown in Figure 1.
Consider the problem of clustering n independent DNA methylation samples, let X = {X1, X2, ..., X
n
} be the DNA methylation expressions for n samples. For each sample i, X
i
= {xi 1, xi 2, ..., x
iL
} be a vector of L continuous outcomes falling between zero and one. Suppose there exists a total K clusters and sample i belongs to cluster class c
i
∈ {1, ..., K}. Conditional on class membership say k, each outcome x
il
could be viewed as an independent identically distributed variable from a beta distribution with α
kl
and β
kl
(1)
where stands for the Beta function. Then, DNA methylation sample X
i
can be modeled by (2).
(2)
where . With the limitation of large parameters for beta distribution component, α
kl
> 1 and β
kl
> 1. Note that due to clustering in samples, θ
l
and θ
i
for i ≠ l may be equal, represent the cluster proportion and . Now, in reality, the total cluster number K is not known a priori. We discuss next a model based on Dirichlet process to address this difficulty.
Dirichlet process mixture model
The Dirichlet process is an nonparametric extension of the original Dirichlet distribution. Let x
i
be a random sample from a distribution f with parameters θ
i
. In a Bayesian formulation, the model for parameter θ
i
can be defined as
(3)
where G is the prior distribution of θ
i
. It is not always realistic to assume that G is of a known form and the nonparametric Bayesian models including the Dirichlet process (DP) is proposed to address this problem. Now, instead of defining a parametric form for G, G is assumed to be a draw from a Dirichlet process with a base distribution G0 and a precision parameter τ [12]. The model for the Bayesian estimation is also built in Figure 2 following the principles of graphical models. It can also be written as (4) with a DP prior.
(4)
where G0 is such that E[G] = G0 and has a parametric form, τ measures the strength of belief in G0. The DP of mixtures (DPM) are proposed to model the clustering effect in data. Compared with other clustering models, DPM is very attractive because it allows the cluster number K to be a priori ∞ and learned from the data. To capture the clustering natural of DNA methylation samples, a beta mixture model with infinite classes can be built with DPM. Let θ
i
= {α
i
, β
i
} be the set of parameters for each sample and note that some of them may be equal. In DPM models, each θ
i
is marginally sampled from G0, and with positive probability some of the θ
i
are identical due to the discreteness of the random measure G. Therefore the new value of θ
i
can either be one of the , or θ
i
could be a new draw from G0. Let K in (2) be ∞, we assume a DPBMM for DNA methylation array.
Inference
Let Φ = {Φ1, Φ2, ..., Φ
K
} denote the set of distinct , where K is the number of distinct elements of θ1, ..., θ
m
. Let s = {s1, ..., s
m
} denote cluster assignment vector, that means, s
i
= l if and only if θ
i
= ϕ
l
. Then θ = {θ
i
: i = 1, ..., m} can be reparameterized as {ϕ1, ..., ϕ
k
, s1, ..., s
m
}. Let n
i
, i = 1, ..., K be the number of elements s
l
equal to i. Let subscript "-i" stands for all the variables except the i-th one. The goal from a Bayesian perspective is to calculate the posterior distribution of the known parameters {Θ, π, τ}. However, the analytical expression is intractable and we instead develop a Gibbs sampling solution to obtain random samples from the posterior distribution. The key for Gibbs sampling is to derive the conditional posterior distributions of the unknown parameters. Due to the constrains on α and β, we first re-parameterize α as L
α
by α = exp(|L
α
|) and β as L
β
by β = exp(|L
β
|). Thus, we only need to sample in the range of (-∞, ∞) for L
α
and L
β
. Then the transformed α > 1 and β > 1. Thus, we can specify G0 as , where represents the Gaussian distribution with mean μ and variance σ2 [13]. The prior distribution of the cluster proportion π is the Dirichlet distribution
(5)
There are some useful expression of a Dirichlet process, such as Chinese Restaurant Process(CRP) [14, 15], Stick-breaking construction [16], Polya Urn formulation [17, 18], etc... Blackwell showed that Dirichlet process are discrete as they consist of countably infinite point probability masses [19]. Escobar and West [20] first showed that Markov Chain Monte Carlo (MCMC) techniques, specifically Gibbs sampling, could be used for posterior density estimation if the Blackwell-MacQueen Polya Urn formulation of Dirichlet process is used. Based on the generalized Polya urn scheme, the conditional prior distributions s
i
|s1, ..., si-1, i = 1, ..., n and θ
i
|θ
-i
have the following forms as (6) and (7).
(6)
and,
(7)
Then the conditional posterior distribution for sampling θ
i
has the form
(8)
Thus the conditional posterior distribution for sampling Φ
i
has the form
(9)
It is obvious that G0 is not conjugate with f, so the integral qi,0cannot be evaluated analytically and drawing samples from G
i
is also extremely challenging [21]. To overcome the difficulty, we adopt the "no-gaps" algorithm proposed in [22] to enable sampling from (8).
As to τ, it is useful to choose a weakly informative prior in many applications. If τ is assigned a gamma prior, its posterior becomes a simple function of K, then samples are easily drawn via an auxiliary variable method. For the convenience of sampling, we adopt the τ ~ Gamma(a, b) as the prior [9, 20].
The final Gibbs sampling steps can be summarized by the following steps:
Gibbs sampling for DPBMM
Iterate the following steps and for the t-th iteration:
-
1.
For each sample i, re-sample s
i
according to (6) if . In this case k
-i
= K. If , then with probability 1 - 1/K leave s
i
unchanged. With probability 1/K rearrange s such that s
i
= K, then re-sample s
i
according to (6). But in this case k
-i
= K - 1.
-
2.
For i = 1, ..., K, the posterior distribution for Φ
i
has the form as (9).
For i = K + 1, ..., n, both prior and posterior distribution for Φ
i
are G0.
-
3.
Sample π following (5) with .
-
4.
Based on Step 1, we can get the value of K, then sample τ|K, n where τ ~ Gamma(a, b).
Due to the large number of parameters, the initial values for parameters and should be chosen carefully.