Topic model for metagenomics study
Topic model, a type of statistical model, is originally used in machine learning and natural language processing area for latent “topics” discovery in a particular set of documents [1]. The basic idea of this model is that it assumes that each topic consists of the highly correlated words and each document may contain several different topics with a certain probability distribution, and the distribution of such potential topics can be inferred by given the set of documents together with their word frequency representations. In particularly, the Bayesian based model Latent Dirichlet Allocation [2] can be used in such inference. In the application of this model for text processing, each document follows a probability distribution over topics, and each topic follows a probability distribution over words. This generative hierarchical model, assumes that a word in a document is generated through two steps, i.e., a topic in a document is chosen with a certain probability, and then a word in the topic is chosen with a certain probability. The generative process of topic model is formulated as follows: θ
d
and ∅
t
are respectively the distribution over topics of document d and that over words of topic t.
$$ {\uptheta}_d\sim Dirichlet\left(\alpha \right) $$
$$ {\varnothing}_t\sim Dirichlet\left(\beta \right) $$
Here α and β are hyper parameters following Dirichlet distributions. For generating word i in document d, topic Z
d,i
is first sampled from document’s distribution over topics, and then word W
d,i
is sampled from topic’s distribution over words based on the following distributions,
$$ {\mathrm{Z}}_{d,i}\Big|{\uptheta}_d\sim Multinomial\left({\uptheta}_d\right) $$
$$ {\mathrm{W}}_{d,i}\Big|{\mathrm{Z}}_{d,i},{\varnothing}_{{\mathrm{Z}}_{d,i}}\sim Multinomial\left({\mathrm{Z}}_{d,i}\right) $$
In this study, the topic model is utilized to process our metagenomics data. We made a perfect analogy between text mining and microbial community detection, where documents can be analogized to the samples in metagenomics study and the words frequency in a document can be analogized to the OTUs abundance for a given sample. We formed a joint probability of bacteria taxa to each sample by integrating parameter θ into φ and applied collapsed Gibbs sampling to assign the bacteria taxa of each sample to topics. Detailed information can be referred to [2].
R package MetaTopics implementation
MetaTopics is an R package, designed purposely to support the workflow of applying topic model to metagenomics data, with the following sample analysis and visualization functions (Fig. 1). Several functions are built to visualize the abundance and diversity of the microbial profiles over the individual samples. The core topic model used in MetaTopics is integrated from the R package topicmodels [10], which provides LDA models and Correlated Topic Models (CTM) [2] (Fig. 1a). Each topic, viewed as a microbe sub-community, biologically representing a group of high correlated bacteria functioning similarly in a disease status, can be interpreted by the probability distribution and the profile of bacteria. And each sample can be represented by these sub-communities with different degree. Various interactive visualization approaches based on ggplot2 [11] and LDAvis [12] are incorporated to show the composition of each sub-community and each sample for comparison. After identifying the dominant microbes in each sub-community, these sub-communities can be visualized by the level of overlap to indicate the community interaction, which guides the deep investigation of the microbe interactions (Fig. 1b). In addition, considering the substantial needs in the analysis of the relationship between each sub-community and a certain disease status, the Quetelet Index (QI) [13] is defined to estimate the relative change of the observation frequency of a specific latent sub-community among all the samples compared to that among the samples with a certain disease status (Fig. 1c). QI quantitatively describes the degree of the influence of a specific topic on a certain disease (see Additional file 1, Defining QI for topic and disease status relationship analysis Section, for more details).