Problem formulation
Model DNA methylation profiles with beta mixture distribution
For a twocolor hybridization based array such as Illumina Infinium array, the measurements are associated with the percentage of the methylated alleles, which is called the "beta" values because it can be described by a mixture of beta distributions [7, 10]. Since the distribution of "beta" values shows bimodalities [11], the beta distribution component in the mixture model should be convex, which means the beta distribution component should be equipped with large parameters, shown in Figure 1.
Consider the problem of clustering n independent DNA methylation samples, let X = {X_{1}, X_{2}, ..., X_{
n
}} be the DNA methylation expressions for n samples. For each sample i, X_{
i
} = {x_{i 1}, x_{i 2}, ..., x_{
iL
}} be a vector of L continuous outcomes falling between zero and one. Suppose there exists a total K clusters and sample i belongs to cluster class c_{
i
} ∈ {1, ..., K}. Conditional on class membership say k, each outcome x_{
il
} could be viewed as an independent identically distributed variable from a beta distribution with α_{
kl
} and β_{
kl
}
f\left({x}_{il}{\alpha}_{kl},{\beta}_{kl},{c}_{i}=k\right)=\frac{{x}_{il}^{{\alpha}_{kl}1}{\left(1{x}_{il}\right)}^{{\beta}_{kl}1}}{B\left({\alpha}_{kl},{\beta}_{kl}\right)}
(1)
where B\left(\alpha ,\beta \right)={\int}_{0}^{1}{x}^{\alpha 1}{\left(1x\right)}^{\beta 1}dx stands for the Beta function. Then, DNA methylation sample X_{
i
} can be modeled by (2).
p\left({X}_{i}\theta \right)=\sum _{k=1}^{K}{\pi}_{k}\prod _{l=1}^{L}\frac{{x}_{il}^{{\alpha}_{kl}1}{\left(1{x}_{il}\right)}^{{\beta}_{kl}1}}{B\left({\alpha}_{kl},{\beta}_{kl}\right)}
(2)
where {\theta}_{l}=\left\{{\alpha}_{kl},{\beta}_{kl},\forall l\right\}. With the limitation of large parameters for beta distribution component, α_{
kl
} > 1 and β_{
kl
} > 1. Note that due to clustering in samples, θ_{
l
} and θ_{
i
} for i ≠ l may be equal, {\pi}_{k}^{\prime}\mathsf{\text{s}} represent the cluster proportion and {\sum}_{k=1}^{K}{\pi}_{k}=1. Now, in reality, the total cluster number K is not known a priori. We discuss next a model based on Dirichlet process to address this difficulty.
Dirichlet process mixture model
The Dirichlet process is an nonparametric extension of the original Dirichlet distribution. Let x_{
i
} be a random sample from a distribution f with parameters θ_{
i
}. In a Bayesian formulation, the model for parameter θ_{
i
} can be defined as
\begin{array}{ll}\hfill {x}_{i}{\theta}_{i}& ~f\left({\theta}_{i}\right)\phantom{\rule{2em}{0ex}}\\ \hfill {\theta}_{i}G& ~G\phantom{\rule{2em}{0ex}}\end{array}
(3)
where G is the prior distribution of θ_{
i
}. It is not always realistic to assume that G is of a known form and the nonparametric Bayesian models including the Dirichlet process (DP) is proposed to address this problem. Now, instead of defining a parametric form for G, G is assumed to be a draw from a Dirichlet process with a base distribution G_{0} and a precision parameter τ [12]. The model for the Bayesian estimation is also built in Figure 2 following the principles of graphical models. It can also be written as (4) with a DP prior.
\begin{array}{ll}\hfill {X}_{i}{\theta}_{i}& ~f\left({\theta}_{i}\right)\phantom{\rule{2em}{0ex}}\\ \hfill {\theta}_{i}G& ~G\phantom{\rule{2em}{0ex}}\\ \hfill G\tau ,{G}_{0}& ~DP\left(\tau ,{G}_{0}\right)\phantom{\rule{2em}{0ex}}\end{array}
(4)
where G_{0} is such that E[G] = G_{0} and has a parametric form, τ measures the strength of belief in G_{0}. The DP of mixtures (DPM) are proposed to model the clustering effect in data. Compared with other clustering models, DPM is very attractive because it allows the cluster number K to be a priori ∞ and learned from the data. To capture the clustering natural of DNA methylation samples, a beta mixture model with infinite classes can be built with DPM. Let θ_{
i
} = {α_{
i
}, β_{
i
}} be the set of parameters for each sample and note that some of them may be equal. In DPM models, each θ_{
i
} is marginally sampled from G_{0}, and with positive probability some of the θ_{
i
} are identical due to the discreteness of the random measure G. Therefore the new value of θ_{
i
} can either be one of the {\theta}_{l}^{\prime}\mathsf{\text{s(}}l\ne i\mathsf{\text{)}}, or θ_{
i
} could be a new draw from G_{0}. Let K in (2) be ∞, we assume a DPBMM for DNA methylation array.
Inference
Let Φ = {Φ_{1}, Φ_{2}, ..., Φ_{
K
}} denote the set of distinct {\theta}_{i}^{\prime}\mathsf{\text{s}}, where K is the number of distinct elements of θ_{1}, ..., θ_{
m
}. Let s = {s_{1}, ..., s_{
m
}} denote cluster assignment vector, that means, s_{
i
} = l if and only if θ_{
i
} = ϕ_{
l
}. Then θ = {θ_{
i
} : i = 1, ..., m} can be reparameterized as {ϕ_{1}, ..., ϕ_{
k
}, s_{1}, ..., s_{
m
}}. Let n_{
i
}, i = 1, ..., K be the number of elements s_{
l
} equal to i. Let subscript "i" stands for all the variables except the ith one. The goal from a Bayesian perspective is to calculate the posterior distribution of the known parameters {Θ, π, τ}. However, the analytical expression is intractable and we instead develop a Gibbs sampling solution to obtain random samples from the posterior distribution. The key for Gibbs sampling is to derive the conditional posterior distributions of the unknown parameters. Due to the constrains on α and β, we first reparameterize α as L_{
α
} by α = exp(L_{
α
}) and β as L_{
β
} by β = exp(L_{
β
}). Thus, we only need to sample in the range of (∞, ∞) for L_{
α
} and L_{
β
}. Then the transformed α > 1 and β > 1. Thus, we can specify G_{0} as {G}_{0}\left(\alpha ,\beta \right)=\mathcal{N}\left(0,{\sigma}_{\alpha}^{2}\right)\mathcal{N}\left(0,{\sigma}_{\beta}^{2}\right), where \mathcal{N}\left(\mu ,{\sigma}^{2}\right) represents the Gaussian distribution with mean μ and variance σ^{2} [13]. The prior distribution of the cluster proportion π is the Dirichlet distribution
\pi ~Dir\left({n}_{1}+\tau /K,...,{n}_{K}+\tau /K\right).
(5)
There are some useful expression of a Dirichlet process, such as Chinese Restaurant Process(CRP) [14, 15], Stickbreaking construction [16], Polya Urn formulation [17, 18], etc... Blackwell showed that Dirichlet process are discrete as they consist of countably infinite point probability masses [19]. Escobar and West [20] first showed that Markov Chain Monte Carlo (MCMC) techniques, specifically Gibbs sampling, could be used for posterior density estimation if the BlackwellMacQueen Polya Urn formulation of Dirichlet process is used. Based on the generalized Polya urn scheme, the conditional prior distributions s_{
i
}s_{1}, ..., s_{i1}, i = 1, ..., n and θ_{
i
}θ_{
i
} have the following forms as (6) and (7).
\begin{array}{l}P\left({s}_{1}=1\right)=1\\ \left\{\begin{array}{c}P\left({s}_{i}=l{s}_{1},...,{s}_{i1}\right)=\frac{{n}_{i,l}}{\left(\tau +i1\right)},l=1,...,{k}_{i}\hfill \\ P\left({s}_{i}={k}_{i}+1{s}_{1},...,{s}_{i1}\right)=\frac{\tau}{\left(\tau +i1\right)}\hfill \end{array}\right.\end{array}
(6)
and,
\begin{array}{ll}\hfill {\theta}_{1}& ~{G}_{0}\left({\theta}_{1}\right)\phantom{\rule{2em}{0ex}}\\ \hfill {\theta}_{i}{\theta}_{1}...{\theta}_{i1}& ~\frac{\tau}{\tau +i1}{G}_{0}\left({\theta}_{i}\right)+\sum _{l=1}^{K}{n}_{l}\frac{1}{\tau +i1}{\delta}_{{\Phi}_{l}}\left({\theta}_{i}\right),\phantom{\rule{2.77695pt}{0ex}}\mathsf{\text{for}}\phantom{\rule{0.3em}{0ex}}i\ge 1.\phantom{\rule{2em}{0ex}}\end{array}
(7)
Then the conditional posterior distribution for sampling θ_{
i
} has the form
\begin{array}{ll}\hfill p\left({\theta}_{i}{\theta}_{i},{s}_{i},X\right)& \propto {q}_{i,0}{G}_{i}\left({\theta}_{i}\right)+\sum _{l=1,l\ne i}^{n1}{q}_{i,l}{\delta}_{{\theta}_{l}}\left({\theta}_{i}\right)\phantom{\rule{2em}{0ex}}\\ ={q}_{i,0}{G}_{i}\left({\theta}_{i}\right)+\sum _{l=1}^{K}{n}_{i,l}{q}_{i,l}{\delta}_{{\Phi}_{l}}\left({\theta}_{i}\right).\phantom{\rule{2em}{0ex}}\end{array}
(8)
Thus the conditional posterior distribution for sampling Φ_{
i
} has the form
\begin{array}{ll}\hfill p({\Phi}_{i}& {\Phi}_{i},s,X,\pi )\phantom{\rule{2em}{0ex}}\\ \propto p({X}_{m:{s}_{m}={s}_{i}}\Phi ,s,\pi \mathsf{\text{)}}p\left({\Phi}_{i}{\Phi}_{i},s,\pi \right)\phantom{\rule{2em}{0ex}}\\ ={G}_{0}\prod _{m:{s}_{m}={s}_{i}}\prod _{l=1}^{L}\frac{{x}_{ml}^{{\alpha}_{kl}1}{\left(1{x}_{ml}\right)}^{{\beta}_{kl}1}}{B\left({\alpha}_{kl},{\beta}_{kl}\right)}\phantom{\rule{2em}{0ex}}\end{array}
(9)
It is obvious that G_{0} is not conjugate with f, so the integral q_{i,0}cannot be evaluated analytically and drawing samples from G_{
i
} is also extremely challenging [21]. To overcome the difficulty, we adopt the "nogaps" algorithm proposed in [22] to enable sampling from (8).
As to τ, it is useful to choose a weakly informative prior in many applications. If τ is assigned a gamma prior, its posterior becomes a simple function of K, then samples are easily drawn via an auxiliary variable method. For the convenience of sampling, we adopt the τ ~ Gamma(a, b) as the prior [9, 20].
The final Gibbs sampling steps can be summarized by the following steps:
Gibbs sampling for DPBMM
Iterate the following steps and for the tth iteration:

1.
For each sample i, resample s_{
i
} according to (6) if {n}_{{s}_{i}}>1. In this case k_{
i
} = K. If {n}_{{s}_{i}}=1, then with probability 1  1/K leave s_{
i
} unchanged. With probability 1/K rearrange s such that s_{
i
} = K, then resample s_{
i
} according to (6). But in this case k_{
i
} = K  1.

2.
For i = 1, ..., K, the posterior distribution for Φ_{
i
} has the form as (9).
For i = K + 1, ..., n, both prior and posterior distribution for Φ_{
i
} are G_{0}.

3.
Sample π following (5) with {n}_{k}={\sum}_{i=1}^{n}\delta \left({s}_{i},k\right).

4.
Based on Step 1, we can get the value of K, then sample τK, n where τ ~ Gamma(a, b).
Due to the large number of parameters, the initial values for parameters {\sigma}_{\alpha}^{2} and {\sigma}_{\beta}^{2} should be chosen carefully.