ENIGMA: an enterotype-like unigram mixture model for microbial association analysis

Background One of the major challenges in microbial studies is detecting associations between microbial communities and a specific disease. A specialized feature of microbiome count data is that intestinal bacterial communities form clusters called as “enterotype”, which are characterized by differences in specific bacterial taxa, making it difficult to analyze these data under health and disease conditions. Traditional probabilistic modeling cannot distinguish between the bacterial differences derived from enterotype and those related to a specific disease. Results We propose a new probabilistic model, named as ENIGMA (Enterotype-like uNIGram mixture model for Microbial Association analysis), which can be used to address these problems. ENIGMA enabled simultaneous estimation of enterotype-like clusters characterized by the abundances of signature bacterial genera and the parameters of environmental effects associated with the disease. Conclusion In the simulation study, we evaluated the accuracy of parameter estimation. Furthermore, by analyzing the real-world data, we detected the bacteria related to Parkinson’s disease. ENIGMA is implemented in R and is available from GitHub (https://github.com/abikoushi/enigma).


Background
More than 100 trillion microbes live on and within human beings and form of complex microbial communities (microbiota). Most microbes cannot be cultured in laboratories, making it difficult to understand how individual microorganisms mediate vital microbiomehost interactions under health and disease conditions. However, recent important advances in high-throughput sequencing technology have enabled observation of the composition of these intestinal microbes. For each sample drawn from an ecosystem, the number of occurrences of each operational taxonomic units (OTUs) is measured and the resulting OTU abundance can be summarized at *Correspondence: shimamura@med.nagoya-u.ac.jp 4 Division of Systems Biology, Nagoya University Graduate School of Medicine, 65 Tsurumai-Cho, Showa-Ku, 466-8550 Nagoya, Japan Full list of author information is available at the end of the article any level of the bacterial phylogeny. Discovering recurrent microbial compositional patterns that are related to a specific disease is a significant challenge, as individuals with the same disease typically harbor different microbial community structures.
Recent large-scale sequencing surveys of the human intestinal microbiome, such as the US NIH Human Microbiome Project (HMP) and the European Metagenomics of the Human Intestinal Tract project (MetaHIT), have revealed considerable variations in microbiota composition among individuals [1,2]. Particularly, community clusters characterized by differences in the abundance of signature taxa, referred to as enterotypes, were first reported in humans [3]. Later, other studies identified enterotype-like clusters that may reflect features of the host-microbial physiology and homeostasis in different species [4,5] or at different human body sites [6][7][8][9]. This microbial stratification has motivated the development of methods for examining unknown clusters of microbial communities.
Probabilistic modeling of microbial metagenomics data often provides a powerful framework for characterizing the microbial community structures [10][11][12]. For example, Knights et al. [10] applied a Dirichlet prior to a single-level hierarchy and proposed a Bayesian approach for estimating the proportion of microbial communities. Holmes et al. [11] extended the Dirichlet prior to Dirichlet multinomial mixtures to facilitate clustering of microbiome samples. Shafiei et al. [12] proposed a hierarchical model for Bayesian inference of microbial communities (BioMiCo) to identify clusters of OTUs related to environmental factors of interest.
However, such models are not suitable for discovering enterotype-like clusters of microbial communities and associations between microbes and a specific disease for the following two reasons. First, the frameworks of Knights et al. [10] and Holmes et al. [11] do not explicitly address the association between the microbial compositional patterns and environmental depend on the interest. Second, the framework of Shafiei et al. [12] models the structure of each sample using a hierarchical mixture of multinomial distributions that are depends on the factors of interest. Individual host properties such as body mass index, age, or gender cannot explain the observed enterotypes [3]. Thus, such enterotype-like clusters that describes interindividual variability among humans do not always to directly affect host probabilities such as diseases ranging from localized gastroenterologic disorders to neurologic, respiratory, metabolic hepatic, and cardiovascular illnesses.
Here, we introduce a novel probabilistic model of a microbial community structures, named as ENIGMA (Enterotype-like uNIGram mixture model for Microbial Association analysis), to address these problems. ENIGMA includes the following contributions: 1. ENIGMA uses OTU abundances as input and models each sample by the underlying unigram mixture whose parameters are represented by unknown group effects and known effects of interest. The group effects are represented by baseline parameters that change with a latent group of microbial communities. One of the most important features of our model is that the group effects are independent of the effects of interest. This enables the separation of interindividual variability and fixed effects of the host properties related to disease risk. 2. ENIGMA is regarded as Bayesian learning for detecting associations between a community structure and factors of interest. Our model can be used to simultaneously learn how enterotype-like clusters of OTUs contribute to the microbial structure and how microbial compositional patterns may be related to known features of the sample. 3. We provide an efficient learning procedure for ENIGMA by using a Laplace approximation to integrate latent variables and estimate the evidence of the complete model and credible intervals of the parameters. The software package that implements ENIGMA in the R environment is available from https://github.com/abikoushi/enigma.
We describe our proposed framework and algorithm in the "Methods" section. We evaluate the performance of ENIGMA using simulated data in terms of its accuracy to estimate parameters and identify clusters in the "Simulation study" section. We apply ENIGMA to clinical metagenomics data and demonstrate how ENIGMA simultaneously identifies enterotype-like clusters and gut microbiota related to Parkinson's disease (PD) in the "Results on real data" section.

Methods
The key idea of ENIGMA is to adjust for the effects of the enterotype and evaluate the increases and decreases of bacterial abundance associated with environmental factors. Figure 1 shows a conceptual view of ENIGMA. The formal definition of the model is described in the following Mode section. Here we introduce several notations.
Suppose that we observe microbiome count data of K taxa for N samples with M individual host properties, (y nk , x nm ) (n = 1, . . . , n; k = 1, . . . , K; m = 1, . . . , M) where y nk ∈ N represents the abundance of the k-th taxa in the n-th sample and x nm represents a binary variable such that x nm = 1 if the nth sample has the m-th host property and is otherwise x nm = 0. Here the word taxa can represent any level of the bacterial phylogeny, e.g., species, genes, family, order, etc. Figure 2 shows a plate diagram of the proposed model for metagenome sequencing, where y n is the read count vector of the n-th sample, x n is the vector of the host properties of the n-th sample and z n ∈ {1, . . . , L} is a latent class of the n-th sample. Our model is a simple extension of the unigram mixture model. We assume that each sample is generated from a multinomial distribution with the parameter vector p n = (p n1 , . . . , p nK ) . The elements of p n and p nk (k = 1, . . . , K) are probabilities of the occurrence of the K taxa for the n-th sample. We also assume that p nk can be influenced independently by the environmental factor on the taxa that is common to all latent classes and the interindividual factor on the latent enterotype-like classes. More specifically, the generative process of ENIGMA is defined as follows:

Model
where γ l is baseline parameter (K-dimensional vector) that changes with the latent class, M × K matrix B = (β mk ) is effect of a environmental factor common to all enterotype-like clusters, β m is a m-th row-vector of B, π = (π 1 , . . . , π L ) is a mixing ratio of components, O K is a K-dimensional zero matrix and I K is K-dimensional identity matrix. Here, the softmax function is defined by for a vector x = (x 1 , . . . , x K ) using an element-wise exponential function and the probability function of categorical distribution is parameterized as Pr(z = l|π) = π l , l ∈ {1, . . . , L}. In a Bayesian approach, the prior distributions for π , β, and γ l must be defiend. We set a prior based on the Dirichlet distribution for π , and flat priors to the hyperparameters σ and τ for

Parameter estimation
Let us denote observed matrix by Y = (y nk ), X = (x nm ), the unknown parameters by θ = (α, B, γ 1 , . . . , γ L , σ , τ ), and their prior by φ(θ). The posterior distribution is represented as follows: First, latent variable z n must be marginalized. The likelihood is described by π l p(y n |z n = l, x n , θ ).
The posterior distribution is proportional to the product of the likelihood and prior density: Letθ be the MAP estimator of θ , found by maximizing log p(θ, Y , X).
We use a Laplace approximation [13] for parameter estimation. A Taylor expansion aroundθ gives where H(θ ) is the Hessian of log p(θ |Y , X) evaluated atθ . Eq. 4 gives where C is a normalizing constant. This relationship shows that p(θ|Y , X) can be approximated by the normal distribution N θ , H −1 θ . Credible intervals can be calculated from this multivariate normal distribution. We used the stochastic programming language Stan (http://mc-stan.org/) for its implementation. The MAP estimators were obtained by the L-BFGS method. Credible intervals were computed from the using a Stan function to compute the Hessian at the MAP estimates.
After fitting the model, the enterotype-like cluster of each sample must be classified. The conditional probability of z n = l is This is the probability that the n-th sample belongs to cluster l. Next, the n-th sample is then classified into the lth cluster that maximizes the conditional probability given by Eq. 5.

Model Selection
We also examined whether or not the whole set rather than individual bacteria is related to the environmental factors of interest. We compared between the two models when B = 0 and B = 0. We used the log marginal likelihood as the goodness of fit for model comparison. The marginal likelihood is given by From Eq. 4, we have Thus, the log marginal likelihood is approximated by the following formula: where D is the number of free parameters. In model comparison, we choose the model showing larger log marginal likelihood.

Simulation study
To demonstrate the performance of ENIGMA, we conducted several simulation experiments. The synthetic data were naturally produced via our generative process coverage probability given by Eq. 1. We set M = 2000, L = 3, π l = 1/3, and α = (1, 1, 1) T . We first generated B and γ l from the standard normal distribution. The variables x n , z n , and y n are then sampled from the Bernoulli distribution with probability of 0.5, the categorical distribution, and the multinomial distribution, respectively. For the above parameter settings, we randomly generated a count dataset of 100 taxa for 100 samples for evaluation.
• Coverage probability (CP): The coverage probability is the proportion of the time over which the interval contains the true value. A discrepancy between the coverage probability and the nominal coverage probability frequently occurs. When the actual coverage is greater than the nominal coverage, the interval is referred to as conservative. If the interval is conservative, there is no inconsistency in interpretation. To calcurate these metrics, we detrmined that we calculated the sample means and standard deviations ofB and B − B 2 from the 10,000 synthetic datasets. Figure 3 shows a comparison of the true B and the mean and standard deviation of estimatesB obtained from the 10,000 simulations. We observed that the points were arranged diagonally, indicating that the estimator of ENIGMA was unbiased. We also calculated the proportion of the time for which the 95% credible interval contains the true value of B. We found that this proportion was greater than nominal value of 0.95 for all B in Fig. 4. Table 1 shows the coverage probability (CP), bias, standard error (SE), and RMSE ofB, respectively. We observed that the bias and standard error decreased when β mk was large (i.e. the corresponding abundance was large). We also found that the accuracy of classification given by Eq. 5 was exactly 100%. Thus, these results indicate that ENIGMA can produce reasonable estimates.

Arumugam et al. (2011)'s data
We demonstrated that the enterotype-like cluster can be estimated using the data of Arumugam et al. [3]. This data is N = 33, K = 55. The data of Arumugam et al. [3] does not disclose the total read count. Thus, We used the relative abundance multiplied by 10,000 as y nk . Based on the result of Arumugam et al. [3], the number of latent classes in ENIGMA was chosen to be L = 3. We estimated the parameters using the ENIGMA and setting all β mk = 0 in Eq. 1. We set the hyperparameters of Dirichlet prior α = (1, . . . , 1) , which is equivalent to a noninformative prior. Arumugam et al. [3] showed that the enterotype is characterized by the differences in the abundance of Bacteroides, Prevotella, and Ruminococcus. Estimates of the probability of occurrence of those bacteria in three clusters are shown in the Fig. 5. Class 1 contains high-level Ruminococcus, class 2 contains high-level Bacteroides, and class 3 contains high-level Prevotella. This result is consistent with that of Arumugam et al. (2011) [3].

Parkinson's disease data
To validate the performance of ENIGMA in discovering clusters of microbial communities and associations between microbes and a specific disease, we applied ENIGMA to the real metagenomic sequencing data from Scheperjans et al. [14], Hill-Burns et al. [15], Heintz-Buschart et al. [16] and Hopfner et al. [17]. The data was analyzed by sequencing the bacterial 16S ribosomal RNA genes sampled from patients with Parkinson's disease (PD) and controls in Finland, USA, and Germany. Table 2 shows the summary statistics of the data. The OTUs were mapped to the SILVA taxonomic reference, version 132 (https://www.arb-silva. de/) and the abundances of family-level taxa were calculated.
To assess the optimal number of clusters, we used the Calinski-Harabasz (CH) Index. It is defined as:  where BC l is the between-cluster sum of squares (i.e. the squared distances between all points i and j, for which i and j are not in the same cluster) and WC l is the within-clusters sum of squares (i.e. the squared distances between all points i and j, for which i and j are in the same cluster). Here, we used Jensen-Shannon divergence (JSD) as the distance. The JSD between samples a = (a 1 , . . . , a K ) and b = (b 1 , . . . , b K ) is defined as follows: When calculating the JSD, we used the normalized abundance obtained by dividing y nk by the total read count, and 0 was replaced with pseudo count 10 −6 . We chose the number of clusters L such that CH l was maximal. To evaluate the CH Index, we use the function index.G1() from the R library clusterSim. The number of latent classes in ENIGMA was chosen to be L = 3 in Finland and Germany and L = 2 in USA by the CH indexes (Fig. 6). First, we evaluated whether the model assumption was satisfied when using this data. According to Arumugam et al. (2011) [3], the gender of the host is not related with the enterotype. The genders of the subjects were published in the Finland study. We examined the relationship between gender and enterotype-like cluster using the data from Finland. Table 3 shows there was no correlation between them. We conducted a Chi-squared test for independence as shown in Table 3 and the p-value was 0.66.
We evaluated whether the model showing that bacteria were associated with PD is better than the model without the associations in terms of marginal likelihood. Marginal   Table 4 shows that the marginal likelihood of M 1 was greater than M 0 . It is preferrd to explain the data by considering the association between the microbiota and PD. Figure 7 shows the estimated probabilities of the occurrences of bacteria for the three latent classes, p l , (l = 1, 2, 3). Bacteria detected in fewer three countries were removed. Arumugam et al. [3] showed that enterotype is characterized by the differences in the abundance of Bacteroides, Prevotella, and Ruminococcus. The results of ENIGMA showed the same tendency as the previous survey. Figure 8 shows the (γ l ) values and their credible intervals. The top three microbes in each enterotype-like cluster are shown in excerpts for this plot. According to the results of ENIGMA, the abundance of Enterobacteriaceae and Lachnospiraceae also differed greatly among clusters. Bacterial abundance differed between countries. In the USA, there was a high abundance of Verrucomicrobiaceae, while in Finland, few of these bacteria were detected. In contrast, Finland showed more Prevotellaceae, with fewer in the in USA it is less. Table 5 shows the coefficients whose 95% credible intervals did not contain zero in more than two countries. The microbes with these coefficients indicates that the corresponding microbial composition patterns were significantly related to PD. We found that at the family levels, Clostridiaceae, Comamonadaceae, Prevotellaceae, Actinomycetaceae, Bifidobacteriaceae, Enterococcaceae, Synergistaceae, Verrucomicrobiaceae and Victivallaceae, the signs of the coefficients matched in all countries. These results are consistent with those of previous studies. Hill-Burns et al. [15] reported that patients with PD contained high levels of Bifidobacteriaceae and Verrucomicrobiaceae. Scheperjans et al. [14] reported PD patients contained high levels of Verrucomicrobiaceae and low levels of Prevotellaceae. Hopfner et al. reported that patients with PD have high levels of Enterococcaceae.
We compared ENIGMA to the Wilcoxon rank sum test, a classical methods for identifying bacteria related with an environmental factor of interest [16]. Table 6 shows the bacteria significantly related to PD with pvalue < 0.05 in more than two countries. We observed that the bacteria detected by the Wilcoxon test were mostly included in those of ENIGMA (Table 5). Notably, all of the corrected p-values in Table 6 are larger than 0.05. This result shows that ENIGMA was superior to the Wilcoxon rank sum test in terms of identifying a larger number of associations between microbiota and PD.
Finally, we combined the results of ENIGMA to those of PICRUSt (version 1.1.3) [18] in order to evaluate which functions are related to PD. In the present study, PICRUSt was performed using the default settings. The Fig. 9 shows the functions exhibiting an increase and decrease from the median, which matched in all countries and clusters with respect to PD and control (CO). This result indicates that ENIGMA is a valuable tool for discovering new disease-related functions.
The analyses using real-world data thus show that ENIGMA can identify enterotype-like clusters and the associations between the gut microbiota and PD. Some of the results were strongly supported by those of previous studies.

Conclusion
We proposed a novel hierarchical Bayesian model, ENIGMA, for discovering the underlying microbial community structures and associations between microbiota and their environmental factors from microbial metagenome data. ENIGMA is based on a probabilistic model of a microbial community structures and supplied with labels for one or more environmental factors of interest for each sample. The structures of each sample are modeled by a multinomial distribution whose parameters are represented independently by group and environmental effects of each sample, which prevents mixing of individual differences and the effects of interest. This The "-" notation indicates the bacteria undetected in that country framework enables the model to simultaneously learn (i) how microbes contribute to an underlying community structures (cluster) and (ii) how microbial compositional patterns are explained by environmental factors of interest. The effectiveness of ENIGMA was evaluated through experiments involving both synthetic and read-world datasets. These newly discovered clusters and associations estimated using ENIGMA can provide insight into the the mechanisms of a microbial communities.  Fig. 9 Predicted functional trait abundance in each country and cluster The major limitation of ENIGMA is its scalability and efficiency, as the number of the parameters in the model increase proportionally with the number of taxa when the number of environmental factors of interest is large. Further studies should focus on developing a scalable probabilistic model of microbial compositions to analyze underlying microbial structures with a large number of these effects by using sparse parameter estimation [19]. We are also interested in developing a dynamic probabilistic model similar to that reported by Blei and Lafferty [20] for analyzing time-varying bacteria compositions during disease progression.