 Research
 Open access
 Published:
GBDR: a Bayesian model for precise prediction of pathogenic microorganisms using 16S rRNA gene sequences
BMC Genomics volume 22, Article number: 916 (2021)
Abstract
Background
Recent evidences have suggested that human microorganisms participate in important biological activities in the human body. The dysfunction of hostmicrobiota interactions could lead to complex human disorders. The knowledge on hostmicrobiota interactions can provide valuable insights into understanding the pathological mechanism of diseases. However, it is timeconsuming and costly to identify the disorderspecific microbes from the biological “haystack” merely by routine wetlab experiments. With the developments in nextgeneration sequencing and omicsbased trials, it is imperative to develop computational prediction models for predicting microbedisease associations on a large scale.
Results
Based on the known microbedisease associations derived from the Human MicrobeDisease Association Database (HMDAD), the proposed model shows reliable performance with high values of the area under ROC curve (AUC) of 0.9456 and 0.8866 in leaveoneout cross validations and fivefold cross validations, respectively. In case studies of colorectal carcinoma, 80% out of the top20 predicted microbes have been experimentally confirmed via published literatures.
Conclusion
Based on the assumption that functionally similar microbes tend to share the similar interaction patterns with human diseases, we here propose a group based computational model of Bayesian diseaseoriented ranking to prioritize the most potential microbes associating with various human diseases. Based on the sequence information of genes, two computational approaches (BLAST+ and MEGA 7) are leveraged to measure the microbemicrobe similarity from different perspectives. The diseasedisease similarity is calculated by capturing the hierarchy information from the Medical Subject Headings (MeSH) data. The experimental results illustrate the accuracy and effectiveness of the proposed model. This work is expected to facilitate the characterization and identification of promising microbial biomarkers.
Background
Researchers are increasingly aware of the critical effects of the human microorganisms on our physical condition. Microorganisms, a.k.a. microbes, are referred to viruses, bacteria, archaea, and eukaryotes (protozoa and fungi) [1]. They can inhabit and thrive in almost each kind of natural environments, of course including the human body. Human microbial communities locate in different parts of the human body, including the external (e.g. skin) and the internal (e.g. the mucosal epithelia of vagina and intestine). In the adult gut, the large majority of intestinal microbes (10^{13}–10^{14}) inhabiting the gastrointestinal tract can approach the population quantity of human cells [2]. Recently, accumulating evidences [3,4,5] show that the onset of human disorders could be attributed to the dysfunction of human microbiota.
The symbiotic relationship between the human microbiota and its host has been demonstrated to get involved in multiple important biological activities. The human microbiota can be influenced by multiple factors of its host such as the genetics, lifestyle, body site, age, health status and others (e.g. antibiotics and smoking) [6,7,8,9]. The resident microbial flora can also affect human physical conditions via multiple microbial genomeencoded metabolic functions. Such metabolic functions can strengthen the metabolic capacity of its host. Therefore, human microbes play a key role in many important biological processes, e.g., by defending against pathogens, enhancing the immune system, getting access to nutrients, as well as degrading toxic compounds [6]. Identifying the latent relationships between microbes and human diseases can provide valuable insights into understanding the pathology of human diseases. For example, butyrate as the primary energy source of intestinal epithelial cells can also function as a key component to suppress the signal transduction pathways of expressing proinflammatory cytokines. Individuals with inflammatory bowel disease (IBD) have been found to have population declines of butyrateproducing microbes, such as Clostridium leptum and Clostridium coccoides groups [10]. This phenomena could also lead to decreasing butyrate utilization [11], which implies the fact that the restoration of hostmicrobe equilibrium could cure or prevent human complex diseases.
Thanks to the high volume of genomic data by highthroughput techniques, increasing bioinformatics tools and databases have been proposed for downstream analysis and data management [12,13,14]. For examples, a 16S rDNA analysis toolkit named W.A.T.E.R.S [15]. can be used for sequence alignment, operational taxonomic units (OTUs) determination, phylogenetic tree construction, and etc. Moreover, hundreds of microbedisease associations are publicly available in Human MicrobeDisease Association Database (HMDAD) [16]. However, the known microbedisease associations are just the tip of the iceberg and far from enough towards a complete picture for clinical medicine. The routine wetlab experiments for biomarker discovery are easy to fail in clinical trials after considerable effort, money, and time have been already invested. In recent years, computational models have been proposed to prioritize seminal biomarker candidates using heterogeneous biological information in several fields including risk genedisease association prediction [17,18,19], proteinprotein interaction (PPI) prediction [20, 21], drugtarget interaction prediction [22], and etc. [14]. The successful applications of these studies motivate us to devise an effective computational model for prioritizing potential pathogenic microbes.
16S ribosomal RNA (rRNA) gene encodes the 30S small subunit of the ribosomal RNA molecules of ribosomes. The low resolution of 16S rRNA gene enables the rapid and accurate identification to establish taxonomic relationships between microbes [23]. The major difference between 16S rRNA gene sequences (~ 1500 bp) tends to fall in the nine hypervariable regions (V1V9), representing dramatic variations for alignments. Therefore, 16S rRNA sequencing analysis is widely used to capture natural speciesspecific “fingerprints” for phylogenetic comparisons.
Recently, increasing effective computational prediction models like KATZHMDA [24] and PBHMDA [25] were developed to explore the potential microbedisease associations using the known microbedisease association network as well as the calculated homologous similarity matrices. In light of a social network prediction algorithm called KATZ [26], KATZHMDA proposes a new proxy measure index to calculate the probabilities of unknown microbedisease associations by considering the number of walks within the network and their own lengths. Moreover, PBHMDA is developed as a pathbased prediction model to perform a restricted depthfirst search by traversing all possible paths between microbes and diseases. However, some limitations could limit their usage and effectiveness, for example, by introducing the systematic bias of the predicted similarity matrices, merely focusing on the known diseases and microbe, and applying the global scoring schemes. Under the hypothesis that the functionally similar microbes tend to share the similar interaction patterns with pathologically similar human diseases, the ultimate goal of this work is to facilitate the discovery of validated biomarkers for helping the early diagnosis, risk assessment, tracking progression, and drug development. Here, we present a Group based computational model of Bayesian Diseaseoriented Ranking (GBDR) for identification of potential microbedisease associations based on the HMDAD database. Heterogeneous biological information is leveraged to compute similarity matrices, including disease semantic similarity, microbe similarity based on BLAST+ scores and microbe similarity based on MEGA7 evolutionary distance scores. The proposed model obtains the supreme prediction accuracy via leaveoneout cross validation (LOOCV) and kfold cross validation (kfold CV) in comparison with other stateoftheart methods. Experiment results demonstrate that the groupbased collaboration filtering and inferred similarity matrices can contribute to the improvement of prediction performance. We conduct a case study for an important disease to manually validate those predicted pathogenic microbes ranked in the top20 list via published literatures. As a result, the reliable performance of the proposed model is fully demonstrated. It is anticipated that GBDR could be an effective computational tool to accelerate the identification of pathogenic microorganisms.
Results
Cross validation and case study
Under the frameworks of LOOCV and kfold CV, the performance of GBDR is thoroughly evaluated. Since GBDR is devised as a diseaseoriented ranking computational model, it aims to prioritize the most potential microbes for each disease. As such, we adopt a local scoring scheme for performance evaluation. As for LOOCV, each known microbedisease association is used to test the model in turns while the rest are used for training until all counterparts are selected. Similarly, in the simulations of kfold CV, the whole set of known microbedisease associations are randomly split into k groups, where (k1) groups form a training set while the remainder is a testing set until each group is tested in truns. To reduce the bias of random divisions, 50 times kfold CVs are conducted to then achieve the average results.
The microbedisease association prediction is actually a binary classification problem. The receiver operating characteristic (ROC) curve is extensively used to evaluate the performance of binary classification models. It is plotted by the true positive rate (sensitivity) versus false positive rate (1specificity). In this work, sensitivity/specificity represents what a high probability of a predicted result can be told to make a positive/negative prediction correctly. The area under ROC curve (AUC) is a numerical evaluation coefficient between 0 and 1. For a disease d, the AUC value can be defined accordingly as:
where \({\mathcal{R}}^{te}(d)=\left\{\left(i,j\right)\left(d,i\right)\in {\mathcal{R}}^{te},\left(d,j\right)\notin \mathcal{R}\cup {\mathcal{R}}^{te}\right\}\). \({\mathcal{R}}^{te}(d)\) is a test dataset of d, \({\hat{r}}_{di}\) and \({\hat{r}}_{dj}\) are predicted values, and δ() is a binary indicator. If the equation within the brackets is true, δ() =1, otherwise 0. The final AUC value can be averaged as follows:
Here, D^{te} is a disease set on testing sets. Normally, AUC = 1 represents a perfect prediction and AUC = 0.5 represents a completely random one.
First of all, GBDR is compared with PBHMDA and KATZHMDA based on the known microbedisease associations from HMDAD database via LOOCV (see Fig. 1). For a fair comparison, all compared models employ the same data resources, i.e. disease semantic similarity, microbe similarity based on BLAST+ scores and microbe similarity based on MEGA7 evolutionary distance scores. GBDR, PBHMDA and KATZHMDA achieve AUC values of 0.9456, 0.6087 and 0.6185, respectively. The proposed model performs better than the other two stateoftheart models. PBHMDA and KATZHMDA have similar prediction performance in terms of local LOOCV. Since both of them are proposed to globally predict the most potential microbedisease associations using the global scoring schemes, the class imbalance could lead to degrade their prediction performance to some extent.
Second, the proposed model is also compared with the other representative algorithms of recommender system via LOOCV (see Fig. 2), including singular value decomposition (SVD) based model, latent factor model (LFM), microbebased collaborative filtering (CF), diseasebased CF and neighborbased CF models. Since GBDR is originally proposed as a recommendation algorithm, the fundamental assumption of pairwise association is adopted to resolve the limitations of the pointwise association assumption where the unknown (unlabeled) microbedisease association are irrelevant. It is interesting to know the performance difference between the GBDR and other representative recommendation algorithms. Since the purpose of this work is to “recommend” the most possible microbes to a certain disease, it is intuitive and meaningful to use these recommendation algorithms for the performance comparison. As we can see in Fig. 2, GBDR also achieves the highest AUC of 0.9456. These representative recommendation algorithms tend to show a moderate predictive power in this case. Among these classical recommendation algorithms, the neighborbased CF model obtains the best performance achieving the AUC of 0.6393. The result suggests that the diseaseoriented ranking model with Bayesian filtering is capable of capturing latent relationships between microbes and diseases. The new and improved assumption in Bayesian diseaseoriented ranking is more effective for prediction by introducing richer interactions among microbes. Particularly, the unified effect of group preference and individual preference is linearly combined to naturally maximize the overall likelihood.
Finally, we also implement kfold CV for further evaluation (see Table 1). As a result, the proposed model yields average AUCs of 0.8266 ± 0.0805, 0.8866 ± 0.0270 and 0.8926 ± 0.0167 in 2fold CV, 5fold CV and 10fold CV, respectively. Both LOOCV and kfold CV can demonstrate the effectiveness of GBDR. Furthermore, colorectal carcinoma (CRC) is selected as an important human disease for a case study. As a result, 9 out of top10 and 16 out of top20 predicted microbes have been experimentally confirmed to have associations with the development of CRC. Detailed information is provided in Additional file 1.
Effectiveness evaluation of groupbased collaboration filtering
In this section, we conduct LOOCV to evaluate the prediction performance with or without the groupbased preference strategy (as shown in Fig. 3A). Without the group preference, the proposed model suffers a nearly 16.2% decline in prediction accuracy with an AUC value of 0.7925. It shows that the group preference strategy is efficient to aggregate the group preference for the diseaseoriented ranking through injecting richer interactions among microbes. The linear combination of pairwise preference and group preference is more effective than the simple pairwise preference.
Moreover, we further evaluate the prediction performance of the proposed model with the diseasebased or microbebased group preference respectively via LOOCV (see Fig. 3B). Firstly, without integrated similarity of microbe or disease used in Eq. (18), the proposed model only leverages the known microbedisease associations for prediction. In this scenarios, combined with microbebased group preference, the proposed model obtains an improved performance with AUC of 0.8793. And the counterpart with diseasebased group preference yields an AUC value of 0.5130. This result supports our assumption that the coordinated functions of microbial groups may pathologically influence the susceptibility to human diseases whereas the human diseases fail to have a significant group trend to affect microbial communities. Secondly, based on the known microbedisease associations, the proposed model is carried out with both microbe and disease similarity matrices using the microbebased group preference. In this way, the proposed model achieves the best prediction performance with AUC of 0.9456. On the other hands, the proposed model with the diseasebased group preference shows a significant increase (39.23%) in the prediction accuracy achieving the AUC of 0.9053. This result suggests that the inferred similarity matrices provide useful heterogeneous information to effectively discriminate seminal biomarker candidates.
Effectiveness evaluation of combining different types of similarities
As mentioned at the end of the above section, the inferred similarity matrices can effectively improve the prediction performance of our model based on the diseasebased group preference. It motivates us to conduct the performance effect analysis of different types of similarities proposed in our model via 5fold CV. The results are shown in Table 2. As a baseline, the GBDR without using any similarity matrices shows the average AUC of 0.6189 with standard deviation of 0.0561. Using the disease semantic similarity, the prediction accuracy achieves average AUC value of 0.6796+/− 0.0468 with 6.27% improvement. Moreover, when the GBDR is integrated with both microbe similarity based on BLAST+ scores and microbe similarity based on MEGA7 evolutionary distance scores, the prediction accuracy is improved by 10.02% achieving the average AUC value of 0.7171+/− 0.0427. Finally, the GBDR obtains the best average AUC value of 0.8081+/− 0.0284 using the disease semantic similarity and the integrated microbe similarity. The result demonstrates that the sequence information of gene exploited by computational approaches (BLAST+ and MEGA 7) enables the precise measure of the biological homology between microbes. Furthermore, the disease semantic similarity based on the Medical Subject Headings (MeSH) descriptors can reflects the molecular relatedness between hereditary diseases. Although the known microbedisease association network is sparse as its links are limited in number, applying diverse similarities to the proposed model is useful to provide discriminative biological information, and therefore enabling the precise prediction of pathogenic microbes.
Effect analysis of key parameters in GBDR
There are several key parameters in GBDR, e.g., regularization weights α_{u}, α_{v} and β_{v}, learning rate γ, number of later features z, and group sizes \(\left\mathbf{\mathcal{G}}\right\). Based on 5fold CV, we conduct an effect analysis to explore the potential of parameter tuning. As we can see in Fig. 4, GBDR reaches the peak of AUC values when regularization weights α_{u}, α_{v} and β_{v} are to 0.01 and learning rate γ is set to 0.001. By evaluating all possible combinations via grid search, the achieved AUC values vary from 0.8536 to 0.8866. For better demonstration, we combine the results of z and \(\left\mathbf{\mathcal{G}}\right\) as a whole in Fig. 5. Regarding of the different number of latent features z, no significant change is observed in terms of AUC values. Based on the results, we set 30 as the default value of z. Moreover, GBDR achieves the highest AUC value when \(\left\mathbf{\mathcal{G}}\right\) is set to 5. As a baseline, GBDR without group preference setting (i.e., \(\left\mathbf{\mathcal{G}}\right\) =1) suffers from degradation by 6.5% as expected. In short, the performance of GBDR is not sensitive to the key parameters tuning.
Discussion
Accumulating evidence show that different types of microbiota are associated with the mechanism of human diseases, forming a complex causal network. Although there are a number of methods having been proposed for predicting such important associations, 16 s rRNA gene sequences, the information most easily obtained in microbe research, haven’t been utilized for this task. To bridge this gay, we proposed a Bayesian prediction model called GBDR, using various types of information including 16 s rRNA sequences. GBDR is based on the computation of disease/microbe similarity assuming that similarity microbes tend to be involved in similar disease mechanism. The experimental results show that the prediction based on such an assumption is feasible and effective. We anticipate that GBDR can help the researchers find the relevant diseases for a specific type of microbe given its 16 s rRNA sequence.
Conclusion
The human microbiota has attracted the increasing attention thanks to its key role playing in human biological activities. It has even been deemed as the “forgotten organ” in the human body. Recent researches show the dysfunction of hostmicrobiota homeostatic balance can result in the onset of various human diseases. The great advance of technology, especially PCR amplification and nextgeneration sequencing, allows to generate high volumes of sequences, providing a new window for the followup downstream analysis. In this work, we leveraged 16S rRNA gene to infer microbe similarity based on BLAST+ scores and microbe similarity based on MEGA7 evolutionary distance scores. The framework of GBDR is proposed to identify the most seminal diseasespecific microorganisms on a large scale. Based on the results of simulation experiments, GBDR is demonstrated to achieve higher performance than the two stateoftheart models and representative recommendation algorithms via LOOCV. Furthermore, reliable predictive capability of GBDR is validated by kfold CV and a case study. GBDR is expected to provide valuable insights into advancing the identification of potential microbes as ideal biomarkers for evaluating and measuring human complex diseases. The prediction list of the most seminal pathogenic microbes is released in Additional file 2 sorted by various specific diseases. Based on the assumption that functionally similar microbes tend to share the similar interaction patterns with human diseases, the main goal of this work is to prioritize the most potential diseaserelated microbes. That is, the involved microbes could be affected by the involved human diseases and/or the human diseases could be caused by the involved microbes. Identifying the seminal diseasemicrobe association is the first key step to develop the full potential for further in vitro tests in therapeutics and clinical research.
Several factors can be summarized to improve the effectiveness of GBDR. Firstly, the integrated microbe similarity holds a significant potential to characterize the remarkable feature patterns between microbes. The hierarchical relevance is leveraged to measure disease semantic similarity. Secondly, the groupbased pairwise strategy is capable of extracting the fruitful information based on the group preference of microbes associating with a certain disease. Thirdly, the Bayesian approach with diseaseoriented ranking is quite suitable for the local prediction of microbedisease associations. However, GBDR adopts a local scoring scheme for diseaseoriented ranking without global normalization process. GBDR is inapplicable to globally predicting the most potential microbedisease associations like PBHMDA and KATZHMDA.
Methods
Materials
In this work, three types of biological information are utilized, i.e. the known microbedisease associations derived from HMDAD database (http://www.cuilab.cn/hmdad) [16], 16S rRNA partial or complete gene sequences downloaded from the Nucleotide Database of the National Centre for Biotechnology Information (NCBI) [27] and MeSH descriptors provided by the Nation Library of Medicine (NLM) [28]. It is noted that, Ma et al. [16] searched articles regarding human microbiomerelated research published before July 2014. The HMDAD database provides 450 nonrepetitive known microbedisease associations including 292 microbes and 39 human diseases (see Additional file 3). The numbers of microbes and diseases are denoted as nm and nd, respectively. We note that all the symbols that used in the Methods section are summarised in the Table 3. All known microbedisease associations are converted into an adjacency binary matrix as variable \(\mathbf{\mathcal{R}}\) of size nm × nd representing their association relationships. Namely, \(\mathbf{\mathcal{R}}\left({\boldsymbol{m}}_{\boldsymbol{i}},{\boldsymbol{d}}_{\boldsymbol{j}}\right)=\mathbf{1}\) indicates microbe mi is known to be associated with disease dj, otherwise \(\mathbf{\mathcal{R}}\left({\boldsymbol{m}}_{\boldsymbol{i}},{\boldsymbol{d}}_{\boldsymbol{j}}\right)=\mathbf{0}\).
Disease semantic similarity
The hierarchy system of MeSH descriptors is informative to offer semanticbased taxonomic categorization for various human diseases. For example, the MeSH ID of overnutrition (C18.654.726) shares the same prefix with its subtype obesity’s (C18.654.726.500). Accordingly, the relationships between any disease and others can be established by respective Directed Acyclic Graphs (DAGs) using the hierarchy of MeSH IDs [29]. Each disease has at least one MeSH ID which numerically represents its location in DAGs. Figure 6 illustrates the calculation process of disease semantic similarity. Empirically, the shorter path between the ancestor node d and the target node t, the higher weight value should be given. It can be formulated as follows:
where len(d, t) is the shortest path length between the ancestor node d and one of its descendant node t. For example in Fig. 6, d_{1} is the ancestor node of d_{3} and d_{6}. V_{d1}(d_{3}) = 1/(len(d_{1}, d_{3}) + 1) = 1/2 where len(d_{1}, d_{3}) = 1. Likewise, V_{d1}(d_{6}) = 1/(len(d_{1}, d_{6}) + 1) = 1/2 where len(d_{1}, d_{6}) = 1. In this way, a feature vector for d_{1} can be computed as V_{d1} = (1, 0, 1/2, 0, 0, 1/2). Then, we further calculate the semantic similarity of any two diseases d_{i} and d_{j} by cosine similarity measure:
where V_{di} and V_{dj} are the feature vectors of d_{i} and d_{j}, respectively.
Microbe similarity based on BLAST+ scores and MEGA7 evolutionary distance scores
Sequence similarity and evolutionary distancebased similarity are two effective measurements to examine the relatedness among microbes from different perspectives. The former reflects the degree of likeness between any two sequences while the latter refers to the divergence of their common ancestral sequence. Although the calculation of both are based on the same information source, i.e., 16S rRNA gene sequences, they do not have to be similar as a necessary condition. Basic Local Alignment Search Tool (BLAST) is a specific sequence similarity search program (http://www.ncbi.nlm.nih.gov/blast) [30]. We use its variant BLAST+ [31] to compare a targeted 16S rRNA gene sequence of each target microbe against the sequences of other microbes as a nucleotide sequence database in turns. Identity is an important glossary of BLAST+ to measure the extent to which two (nucleotide or amino acid) sequences are invariant in an alignment. In this work, identity is used for measuring microbe sequence similarity. In this way, we define a matrix as Iden of size nm×nm to store the identity values yielded by the alignment. Then the microbe sequence similarity denoted as MSS is calculated by normalizing Iden matrix as follows:
It is noted that, among 292 investigated microbes, five microbe have no available 16S rRNA gene sequences in NCBI (denoted as “unavailable” to “FASTA filename” in Additional file 3). We simply set their sequence similarities as the mean of the rest available.
The calculation of microbe evolutionary distancebased similarity is mainly based on the molecular evolutionary genetics analysis of MEGA 7 [32] (http://www.megasoftware.net/). The evolutionary distance between any two sequences is measured by the number of nucleotide substitutions involved. First, Clustal W is used to perform multiple sequences alignment [33]. To reduce the disturbance caused by the gaps, all sequences are trimmed down to the shortest size by removing terminal redundancy at 5′ and 3′ terminus. Then the option of complete deletion option is set to address the issues of gaps and missing data. We utilized pdistance model to measure the evolutionary distances based on substitutions (including transitions and transversions). pdistance [34] for nucleotide sequences is written as:
where n_{d} refers to the number of different nucleotides between two tested sequences and n is the total number of nucleotides examined. The higher value of evolutionary distances denotes the higher evolutionary diversity. The evolutionary distances are subtracted from 1 and the result is denoted by a matrix as ED of size nm×nm. Similarly, the microbe evolutionary distancebased similarity (MES) is also normalized as follows:
For those microbes without available 16S rRNA gene sequences, their evolutionary distancebased similarities are also set to the overall mean level. If the unavailable microbe list increases, performance degeneration is inevitable to happen. We can address this problem based on the know microbedisease associations by exploiting the implicit information from the topological network structure. According to the previous works [22, 35], Gaussian interaction profile kernel similarity and local similaritybased methods (e.g., the Jaccard index and Salton index) can be applied to calculate the biological functionbased similarity of those microbes without available 16 s rRNA gene sequences. Finally, we empirically merge MSS and MES to represent the final microbe similarity S_{m}:
Group preference based Bayesian diseaseoriented ranking
Based on the previous work [36, 37] in recommender system, the pointwise association assumption, i.e., considering all known microbedisease associations as “interactions” and unknown ones as “no interactions”, could mislead the learning process. However, the pairwise association assumption over two microbes could relax the pointwise preference assumption by treating that a disease d is more probably related to a microbe i than a microbe j represented as \({\hat{\boldsymbol{r}}}_{\boldsymbol{di}}>{\hat{\boldsymbol{r}}}_{\boldsymbol{dj}}\) where i belongs to the known association with d whereas j does not. Empirically, this assumption generates better prediction results than the pointwise assumption. Inspired by this idea [38, 39], we present group pairwise preference based Bayesian diseaseoriented ranking for prioritizing the most potential pathogenic microbes (see Fig. 7).
Based on the known microbedisease associations for a typical disease d, we first define the overall likelihood of pairwise preferences (LPP) among the whole set of microbe (denoted as \(\mathbf{\mathcal{M}}\)):
where (d, i) ≻ (d, j) means that disease d is more potentially associated with microbe i than microbe j. And δ((d, i) ≻ (d, j)) is Bernoulli distribution over the binary random variable. To better approximate the diseaseoriented pairwise preference over two microbes, the Bayesian diseaseoriented ranking method is adopted to simplify the term LPP(d) as follows [38]:
here \(\boldsymbol{i}\boldsymbol{\in }{\mathbf{\mathcal{M}}}_{\boldsymbol{d}}^{\boldsymbol{tr}}\) indicates the known microbedisease association pair (i, d) in training data and \(\boldsymbol{j}\boldsymbol{\in }{\mathbf{\mathcal{M}}}^{\boldsymbol{tr}}\backslash {\mathbf{\mathcal{M}}}_{\boldsymbol{d}}^{\boldsymbol{tr}}\) represents the microbedisease association pair (j, d) is unknown. We assume that the group preference is an overall preference score of a microbe group on a disease. If microbedisease pair (i, d) is a known association but (i, b) is not, the group preference can be represented as:
It can assume that the group preference of \(\mathbf{\mathcal{G}}\subseteq {\mathbf{\mathcal{M}}}_{\boldsymbol{d}}^{\boldsymbol{tr}}\) on a disease d is probably stronger than the individual preference of microbe i on disease b. To learn the unified effect of both individual preference and group preference, we linearly combine them as follows:
where \({\hat{\boldsymbol{r}}}_{\mathbf{\mathcal{G}}\boldsymbol{id}}=\boldsymbol{\rho} {\hat{\boldsymbol{r}}}_{\mathbf{\mathcal{G}}\boldsymbol{d}}+\left(\mathbf{1}\boldsymbol{\rho} \right){\hat{\boldsymbol{r}}}_{\boldsymbol{id}}\) is the combined preference of individual preference \({\hat{\boldsymbol{r}}}_{\boldsymbol{id}}\) and group preference \({\hat{\boldsymbol{r}}}_{\mathbf{\mathcal{G}}\boldsymbol{d}}\) and parameter ρ controls the weight of two different preferences from 0 to 1. We equally set ρ to 0.5 in this study. In this way, we can define group Bayesian diseaseoriented ranking (GBDR) analogously to how we define in Eq. (10) as follows:
where \(\mathbf{\mathcal{G}}\subseteq {\mathbf{\mathcal{M}}}_{\boldsymbol{d}}^{\boldsymbol{tr}}\). \(\boldsymbol{d}\boldsymbol{\in }{\mathbf{\mathcal{D}}}_{\boldsymbol{i}}^{\boldsymbol{tr}}\) means disease d has an interaction with microbe i in training data. Likewise, \(\boldsymbol{b}\boldsymbol{\in }{\mathbf{\mathcal{D}}}^{\boldsymbol{tr}}\backslash {\mathbf{\mathcal{D}}}_{\boldsymbol{i}}^{\boldsymbol{tr}}\) means disease b has an unknown interaction with microbe i. For given two microbes i and j, the joint likelihood is simply approximated via multiplication like GBDR(i, j) ≈ GBDR(i) × GBDR(j). Therefore, the overall likelihood for all microbes and all diseases can be formulated as:
where \(\mathbf{\mathcal{G}}\subseteq {\mathbf{\mathcal{M}}}_{\boldsymbol{d}}^{\boldsymbol{tr}}\). Given \(\boldsymbol{\Theta} =\left\{{\boldsymbol{U}}_{\boldsymbol{i}}\boldsymbol{\in}{\mathbb{R}}^{\mathbf{1}\times \boldsymbol{nd}},{\boldsymbol{V}}_{\boldsymbol{d}}\boldsymbol{\in}{\mathbb{R}}^{\mathbf{1}\times \boldsymbol{nd}},{\boldsymbol{b}}_{\boldsymbol{d}}\in \mathbb{R},\boldsymbol{i}\boldsymbol{\in }{\mathbf{\mathcal{M}}}^{\boldsymbol{tr}},\boldsymbol{d}\boldsymbol{\in }{\mathbf{\mathcal{D}}}^{\boldsymbol{tr}}\ \right\}\) is a set of model parameters to be learned, one common way to estimate the model parameters is to minimize the loglikelihood function of GBDR as follows,
We use stochastic gradient descent (SGD) algorithm to optimize the object function in Eq.(15). Before using the algorithm of SGD, a subset of microbes is randomly sampled to form a microbe group \(\mathbf{\mathcal{G}}\). In this way, for each random sampling, it includes a microbe i, a disease d, a disease b and a microbe group \(\mathbf{\mathcal{G}}\) where \(\mathbf{i}\mathbf{\in}\mathbf{\mathcal{G}}\). The objective function in Eq. (15) can be written as:
where \({\hat{\boldsymbol{r}}}_{\mathbf{\mathcal{G}}\boldsymbol{id};\boldsymbol{ib}}={\hat{\boldsymbol{r}}}_{\mathbf{\mathcal{G}}\boldsymbol{id}}{\hat{\boldsymbol{r}}}_{\boldsymbol{ib}}\), and α_{u}, α_{v} and β_{v} are the regularization weights ranging from 0.0001 to 0.1. U_{j} ∈ ℝ^{1 × z} is the latent feature vector for microbe j, where z is the number of latent features. V_{d} ∈ ℝ^{1 × z} and b_{d} are disease d’s latent feature vector and bias values, respectively. We can then update the model parameters Θ as:
where the learning rate γ is set to 0.01 in this study via parameter tuning. The learning process is repeatedly trained until it reaches the maximum iterations (default: 100). The predicted score of microbe i on disease d is calculated via \({\hat{\boldsymbol{r}}}_{\boldsymbol{d}\boldsymbol{i}}={\boldsymbol{V}}_{\boldsymbol{d}}^{\boldsymbol{T}}{\boldsymbol{U}}_{\boldsymbol{i}}+{\boldsymbol{b}}_{\boldsymbol{d}}\). The calculation procedure of GBDR is described by the pseudocode in Algorithm 1.
Then we calculate \({\hat{\boldsymbol{r}}}_{\boldsymbol{di}}\) with the integrated microbe similarity S_{m} and disease semantic similarity S_{d}. For an unknown diseasemicrobe pair (d_{i}, m_{j}), \(\boldsymbol{d}^{\prime}\mathbf{\in}{\mathbf{\mathcal{D}}}_{{\boldsymbol{m}}_{\boldsymbol{j}}}^{\boldsymbol{tr}}\) means a set of diseases having associations with microbe m_{j} in training data and \(\boldsymbol{m}^{\prime}\mathbf{\in}{\mathbf{\mathcal{M}}}_{{\boldsymbol{d}}_{\boldsymbol{i}}}^{\boldsymbol{tr}}\) indicates a set of microbes having associations with disease d_{i}. Finally, the final prediction score of d_{i} on m_{j} could be calculated by adding the mean values as follows:
where parameters α_{d} and α_{m} control the weights of S_{m} and S_{d} respectively. In this way, \({\hat{\boldsymbol{r}}}_{{\boldsymbol{d}}_{\boldsymbol{i}}{\boldsymbol{m}}_{\boldsymbol{j}}}\) is the predicted probability score of the unknown diseasemicrobe pair (d_{i}, m_{j}) ranging from − 1 to 1. The higher value \({\hat{\boldsymbol{r}}}_{{\boldsymbol{d}}_{\boldsymbol{i}}{\boldsymbol{m}}_{\boldsymbol{j}}}\), the higher probability of the potential association between disease d and microbe j. Then the model calculate \(\hat{\boldsymbol{r}}\) for each unknown microbedisease association. Finally, the potential microbedisease associations can be predicted by ranking the predicted probability scores. The total time complexity of the proposed model is \(O\left( Tnm\left\mathbf{\mathcal{G}}\rightz\right)\).
Availability of data and materials
The datasets used and/or analysed in this study can be downloaded from the following public databases: the dataset of 16S rRNA partial or complete gene sequences from the National Center for Biotechnology Information (NCBI) repository (https://www.ncbi.nlm.nih.gov/); the dataset of MeSH descriptors from the Nation Library of Medicine (NLM) repository (https://www.nlm.nih.gov/mesh/meshhome.html); the dataset of known microbedisease associations from the Human MicrobeDisease Association Database (HMDAD) repository (http://www.cuilab.cn/hmdad).
Abbreviations
 rRNA:

ribosomal RNA
 IBD:

inflammatory bowel disease
 OTUs:

operational taxonomic units
 LOOCV:

leaveoneout cross validation
 kfold CV:

kfold cross validation
 ROC:

receiver operating characteristic
 AUC:

area under ROC curve
 SVD:

singular value decomposition
 CF:

collaborative filtering
 CRC:

colorectal carcinoma
 NCBI:

National Centre for Biotechnology Information
 MeSH:

Medical Subject Headings
 NLM:

Nation Library of Medicine
 DAGs:

Directed Acyclic Graphs
 BLAST:

Basic Local Alignment Search Tool
 SGD:

stochastic gradient descent
References
Clemente JC, Ursell LK, Parfrey LW, Knight R. The impact of the gut microbiota on human health: an integrative view. Cell. 2012;148(6):1258–70.
Sender R, Fuchs S, Milo R. Are we really vastly outnumbered? Revisiting the ratio of bacterial to host cells in humans. Cell. 2016;164(3):337–40.
Wang B, Yao M, Lv L, Ling Z, Li L. The human microbiota in health and disease. Engineering. 2017;3(1):71–82.
Tilg H, Kaser A. Gut microbiome, obesity, and metabolic dysfunction. J Clin Invest. 2011;121(6):2126–32.
Morgan XC, Tickle TL, Sokol H, Gevers D, Devaney KL, Ward DV, et al. Dysfunction of the intestinal microbiome in inflammatory bowel disease and treatment. Genome Biol. 2012;13(9):R79.
Savitz LD. The human microbiota: the role of microbial communities in health and disease. Acta Biologica Colombiana. 2016;21(1):5–15.
Donia MS, Cimermancic P, Schulze CJ, Wieland Brown LC, Martin J, Mitreva M, et al. A systematic analysis of biosynthetic gene clusters in the human microbiome reveals a common family of antibiotics. Cell. 2014;158(6):1402–14.
Davenport ER, MizrahiMan O, Michelini K, Barreiro LB, Ober C, Gilad Y. Seasonal variation in human gut microbiome composition. PLoS One. 2014;9(3):e90731.
Mason MR, Preshaw PM, Nagaraja HN, Dabdoub SM, Rahman A, Kumar PS. The subgingival microbiome of clinically healthy current and never smokers. ISME J. 2015;9(1):268–72.
Manichanh C, RigottierGois L, Bonnaud E, Gloux K, Pelletier E, Frangeul L, et al. Reduced diversity of faecal microbiota in Crohn's disease revealed by a metagenomic approach. Gut. 2006;55(2):205–11.
Thibault R, Blachier F, DarcyVrillon B, de Coppet P, Bourreille A, Segain JP. Butyrate utilization by the colonic mucosa in inflammatory bowel diseases: a transport deficiency. Inflamm Bowel Dis. 2010;16(4):684–95.
Huang ZA, Wen Z, Deng Q, Chu Y, Sun Y, Zhu Z. LWFQZip 2: a parallelized referencebased compression of FASTQ files. BMC Bioinformatics. 2017;18(1):179.
Huang ZA, Huang YA, You ZH, Zhu Z, Sun Y. Novel link prediction for largescale miRNAlncRNA interaction network in a bipartite graph. BMC Med Genet. 2018;11(6):113.
Huang ZA, Huang YA, You ZH, Zhu Z, Yu CQ, Huang W, Guo J: Predicting lncRNAmiRNA interaction via graph convolution autoencoder. Front Genetics 2019, 10:758.
Hartman AL, Riddle S, McPhillips T, Ludascher B, Eisen JA: Introducing W.A.T.E.R.S.: a workflow for the alignment, taxonomy, and ecology of ribosomal sequences. BMC Bioinformatics 2010, 11:317.
Ma W, Zhang L, Zeng P, Huang C, Li J, Geng B, et al. An analysis of human microbedisease associations. Brief Bioinform. 2017;18(1):85–97.
Huang YA, You ZH, Li LP, Huang ZA, Xiang LX, Li XF, et al. EPMDA: an expressionprofile based computational model for microRNAdisease association prediction. Oncotarget. 2017;8(50):87033.
Huang ZA, Zhang J, Zhu Z, Wu EQ, Tan KC: Identification of Autistic Risk Candidate Genes and Toxic Chemicals via Multilabel Learning. IEEE Transactions on Neural Networks and Learning Systems 2020.
Sun Y, Zhu Z, You ZH, Zeng Z, Huang ZA, Huang YA. FMSM: a novel computational model for predicting potential miRNA biomarkers for various human diseases. BMC Syst Biol. 2018;12(9):121.
Huang YA, You ZH, Gao X, Wong L, Wang L. Using weighted sparse representation model combined with discrete cosine transformation to predict proteinprotein interactions from protein sequence. Biomed Res Int. 2015;2015:902198.
Huang YA, You ZH, Chen X, Yan GY. Improved proteinprotein interactions prediction via weighted sparse representation model combining continuous wavelet descriptor and PseAA composition. BMC Syst Biol. 2016;10(Suppl 4):120.
Huang Y, You Z, Chen X. A systematic prediction of drugtarget interactions using molecular fingerprints and protein sequences. Curr Protein Pept Sci. 2018;19(5):468–78.
Coenye T, Vandamme P. Intragenomic heterogeneity between multiple 16S ribosomal RNA operons in sequenced bacterial genomes. FEMS Microbiol Lett. 2003;228(1):45–9.
Chen X, Huang YA, You ZH, Yan GY, Wang XS. A novel approach based on KATZ measure to predict associations of human microbiota with noninfectious diseases. Bioinformatics. 2017;33(5):733–9.
Huang ZA, Chen X, Zhu Z, Liu H, Yan GY, You ZH, et al. PBHMDA: pathbased human microbedisease association prediction. Front Microbiol. 2017;8:233.
Katz L. A new status index derived from sociometric analysis. Psychometrika. 1953;18(1):39–43.
Pruitt KD, Tatusova T, Maglott DR. NCBI reference sequences (RefSeq): a curated nonredundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2007;35(Database issue):D61–5.
Lipscomb CE. Medical subject headings (MeSH). Bull Med Libr Assoc. 2000;88(3):265–6.
Wang D, Wang J, Lu M, Song F, Cui Q: Inferring the human microRNA functional similarity and functional network based on microRNAassociated diseases. Bioinformatics (Oxford, England) 2010, 26(13):1644–1650.
Johnson M, Zaretskaya I, Raytselis Y, Merezhuk Y, McGinnis S, Madden TL: NCBI BLAST: a better web interface. Nucleic acids research 2008, 36(Web Server issue):W5–9.
Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, et al. BLAST+: architecture and applications. BMC Bioinformatics. 2009;10:421.
Kumar S, Stecher G, Tamura K. MEGA7: molecular evolutionary genetics analysis version 7.0 for bigger datasets. Mol Biol Evol. 2016;33(7):1870–4.
Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, McWilliam H, et al. Clustal W and Clustal X version 2.0. Bioinformatics. 2007;23(21):2947–8.
Thomas RH. Molecular evolution and Phylogenetics. Heredity. 2001;86(86):385–5.
Huang ZA, Huang YA, You ZH, Zhu Z, Sun Y. Novel link prediction for largescale miRNAlncRNA interaction network in a bipartite graph. BMC Med Genet. 2018;11(6):17–27.
Hu Y, Koren Y, Volinsky C: Collaborative Filtering for Implicit Feedback Datasets. In: Eighth IEEE International Conference on Data Mining: 2009. 263–272.
Pan R, Zhou Y, Cao B, Liu NN, Lukose R, Scholz M, Yang Q: OneClass Collaborative Filtering. In: Eighth IEEE International Conference on Data Mining: 2008. 502–511.
Rendle S, Freudenthaler C, Gantner Z, SchmidtThieme L: BPR: Bayesian personalized ranking from implicit feedback. In: Conference on Uncertainty in Artificial Intelligence: 2009. 452–461.
Pan W, Chen L: GBPR: group preference based Bayesian personalized ranking for oneclass collaborative filtering. In: International Joint Conference on Artificial Intelligence: 2013. 2691–2697.
Acknowledgements
Not applicable.
About this supplement
This article has been published as part of BMC Genomics Volume 22 Supplement 1, 2021: Proceedings of the 2019 International Conference on Intelligent Computing (ICIC 2019): genomics. The full contents of the supplement are available online at https://bmcgenomics.biomedcentral.com/articles/supplements/volume22supplement1.
Funding
YAH and JQL were supported by the National Key RD Program of China under Grants 2020YFA0908700. ZAH was supported by the National Natural Science Foundation of China under Grant No. 61702424. ZHY was supported by the National Natural Science Foundation of China under Grant No. 61572506.
Author information
Authors and Affiliations
Contributions
ZAH & YAH conceived the algorithm, developed the program, and wrote the manuscript. ZHY and JQL carried out analyses, designed and performed experiments. LW, CQY & HCY prepared the data sets, carried out analyses, helped with manuscript editing and program design. All authors read and approved the final manuscript.
Authors’ information
YuAn Huang obtained his Ph.D. degree in Department of Computing at the Hong Kong Polytechnic University. He is currently an Associate Professor with Department of Information Engineering, Xijing University, Xi’an, China.
ZhiAn Huang obtained his PhD degree in the City University of Hong Kong. His current research interests mainly focus on medical image analysis, machine learning, and medical data mining. He is currently a Research Assistant Professor with Center for Computer Science and Information Technology, City University of Hong Kong Dongguan Research Institute, Dongguan, China.
JianQiang Li obtained his PhD degree in South China University of Technology. He is now working as the vice dean of College of Computer & Software Engineering, Shenzhen University, Shenzhen, China.
ZhuHong You obtained his Ph.D. degree in control science and engineering from University of Science & Technology of China (USTC). He is currently a professor with Department of Information Engineering, Xijing University, Xi’an, China.
Lei Wang received the Ph.D. degree from China University of Mining and Technology. He is currently working as a Postdoc research fellow in the Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Science, Ürümqi, China.
HaiCheng Yi is currently a PhD student at the Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences. He works on machine learning, network analysis with applications in the field of bioinformatics.
ChangQing Yu is now an associate professor with Department of Information Engineering, Xijing University, Xi’an, China. His research interests include machine learning, network analysis.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1.
The case study of colorectal carcinoma.
Additional file 2.
The prediction list of most likely pathogenic microbes is sorted by final prediction scores.
Additional file 3.
Names of all investigated microbes and diseases, and known human microbedisease associations obtained from HMDAD database.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Huang, YA., Huang, ZA., Li, JQ. et al. GBDR: a Bayesian model for precise prediction of pathogenic microorganisms using 16S rRNA gene sequences. BMC Genomics 22 (Suppl 1), 916 (2021). https://doi.org/10.1186/s1286402208423w
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1286402208423w