- Open Access
Meta-network: optimized species-species network analysis for microbial communities
BMC Genomicsvolume 20, Article number: 187 (2019)
The explosive growth of microbiome data provides ample opportunities to gain a better understanding of the microbes and their interactions in microbial communities. Given these massive data, optimized data mining methods become important and necessary to perform deep and comprehensive analysis. Among the various priorities for microbiome data mining, the examination of species-species co-occurrence patterns becomes one of the key themes in urgent need.
Hence, in this work, we propose the Meta-Network framework to lucubrate the microbial communities. Rooted in loose definitions of network (two species co-exist in a certain samples rather than all samples) as well as association rule mining (mining more complex forms of correlations like indirect correlation and mutual information), this framework outperforms other methods in restoring the microbial communities, based on two cohorts of microbial communities: (a) the loose definition strategy is capable to generate more reasonable relationships among species in the species-species co-occurrence network; (b) important species-species co-occurrence patterns could not be identified by other existing approaches, but could successfully generated by association rule mining.
Results have shown that the species-species co-occurrence network we generated are much more informative than those based on traditional methods. Meta-Network has consistently constructed more meaningful networks with biologically important clusters, hubs, and provides a general approach towards deciphering the species-species co-occurrence networks.
Network-based approaches are gaining momentum as one of the most helpful tools for the analysis of microbial community structure. They offer new methodological and biological insights to investigate species interactions. Many microorganisms co-exist by interacting with each other and effectively exert various functions . In addition, due to currently insufficient understanding of the community structure, the mounting volume of metagenomics data limits the traditional network analysis to recover the real relationships in bacterial community . Hence finding out intricate yet important associations (for example, to explore the cyclical process of a substance or element in a bacterial community ) becomes increasingly challenging for traditional methods.
To address these limitations, we propose Meta-Network to establish the species-species co-occurrence network. The loose definition method is introduced first to recover more correlations before correlation calculation. Then we utilize the FS-Weight and PCA-PMI (Part Mutual Information adjusted by Path Consistency Algorithm) methods to explore the indirect and non-linear correlations, respectively. To investigate the optimized species-species co-occurrence network, systematic evaluation is investigated to discover meaningful biological implications in the species-species co-occurrence network.
Construction workflow of species-species co-occurrence network
While direct and linear correlations were calculated by Pearson and Spearman , limitations of these algorithms are developed in quantifying the relationship among species pairs: First, the correlations are calculated when pairwise species exist in all samples, which fail to recover the microbial communities owing to the sparse distribution of species. Second, Pearson and Spearman algorithms only capture direct and linear correlations for pairwise species, but cannot detect complex ecology correlations patterns . In response to these observations, we applied several association rule mining algorithms to uncover complex correlations among species (for example, indirect correlation and non-linear correlations) (Fig. 1, a). First, we propose to recover correlations filtered by traditional methods, termed as the loose definition method (Fig. 1, a.2). Second, While direct relationships have been reported in many existing works in the context of intricate bacterial community, more complex correlations can be detected using association rule mining . Hence we develop two methods to detect complex forms of correlations like indirect correlations (FS-Weight) and non-linear correlations (PCA-PMI) (Fig. 1, a.3, and a.4).
Loose definition of species-species relationship (Fig. 1, a.1)
To measure the number of co-exist samples for pairwise species, we introduce the co-occurrence probability (Fig. 1, a.1). To be precise, we first convert the original abundance matrix into a presence-absence matrix, then we calculate the ratio of co-exist samples to all the samples as co-occurrence probability for each pairwise species (Fig. 2, a and b). When co-occurrence probability reaches above a user-defined threshold, the correlation for this pairwise species will be calculated (80% is used here as default, Fig. 2, c). It is worth mentioning that some genera pairs exist in a few number of samples but yield abnormally high correlation from Pearson and Spearman algorithms. However, no known literature and functionality can support this kind of correlation. Noise like this was filtered out based on the sample quantity set of loose definition.
Indirect relationships in the network (Fig. 1, a.2)
While direct relationships have been reported in many existing works in the context of intricate bacterial community, it is possible to detect more complex correlations using association rule mining . In bacterial community, bacteria which do not interact but share interaction partners may play roles in the same functional pathway . Hence, in our context, the FS-Weight method is applied to detect the indirect correlations (Fig. 1, a.2). FS-Weight measures the overlap between the pairwise species, and is originally designed to estimate association between direct and indirect correlations based on network structure .
For pairwise species, FS-Weight value between them was calculated in two steps. First, the direct correlations were calculated using a user-defined threshold for edge value. In this work, Pearson coefficient correlation was selected and the thresholds was set as 0.5 on genus level and 0.7 on OTU level based on previous research . Second, the network constructed by FS-Weight was applied to filter out less reliable correlations and to add meaningful indirect relationships. FS-Weight threshold was set to 0.7 on OTU level and 0.5 on genus level based on the comparison between different thresholds. The benchmark result for FS-Weight on genus level was shown at Additional file 1: Figure S1, A and B.
Nonlinear associations (Fig. 1, a.3)
In microbial communities, nonlinear relationships play important roles . Thus, to explore potentially nonlinear correlations, we adopt PCA-PMI method to explore non-linear correlations in microbial  (Fig. 2, d). PCA-PMI method calculate the partial information to measure the linear and non-linear relationship for each pairwise correlations in microbial community. PMI is defined as follows: assuming X and Y represent two one-dimensional variables representing abundance distribution in all the samples for species A and B respectively, and Z means an n-2 dimensional vector (n-2 > 0) representing other species abundance distribution in all the samples, the PMI between species x and y given indirect neighbor z is defined as below:
Where the Part independence of species x and y given indirect neighbor z is defined as:
In order-one PCA-PMI, only one species was considered as indirect neighbor between two species, and the Path Consistency Algorithm (PCA)  was applied to adjust the correlation distribution using a user-defined correlation threshold (0.02 as default). This threshold was set based on the comparison between different thresholds. The benchmark result for FS-Weight on genus level was shown at Additional file 1: Figure S1, C and D. The sparse matrix is defined as a matrix owing a large number of nodes and sparse species distribution, thus it is feasible to use PCA method to construct the species-species co-occurrence network . After all linear and non-linear correlations calculated, the network was updated by detecting higher orders of PMI (more species were set as indirect neighbor for two species). Higher orders of PMI are calculated to check the correlation reliability iteratively until no more edges are changed.
Identification of clusters and hubs in the network
Detecting co-occurrence patterns and their biological significance (Fig. 1, b) supply an important material to understand the microbial network. First, we perform cluster calculation and hub nodes detection based on density clustering. We apply the MCODE cluster algorithm to detect the potential clusters . Aligning cluster members against taxonomical or functional annotation databases can predict potential cluster functions and taxonomical compositions.
To identify hub nodes, we calculate the most connected nodes as candidates for hub nodes. These nodes are selected to perform Kruskal-Wallis test (test differences in network distribution before and after deleting hub node) to decide whether they are hub nodes. Finally, we interpret the clusters and hub nodes based on known taxonomical and functional databases and literatures.
Systematic evaluation methods
In order to compare Pearson, loose definition, FS-Weight, PCA-PMI algorithm, the correlation coefficient calculated by loose definition which is implemented by the loose definition plus Pearson method.
First, we calculate the global network alignment by MAGNA++  based on genetic algorithm. The alignment quality is measured by edge correctness (EC), induced conserved structure (ICS) and symmetric substructure score (S3) for each pair of networks. Then, after all nodes compared based on Jaccard index, we construct a similarity tree measured by the global similarity between any two networks been compared.
Secondly, these four algorithms were compared based on their topological structures including network similarity, global properties and local properties by compNet software . Furthermore, we analyze the local properties for all taxon based on the R package igraph (http://igraph.org/r/), including coreness, degree, eigenvector centrality, eccentricity, betweeness and degree centrality.
Thirdly, we examined the correlation distribution using the same set of species as target. Based on top 100 abundant genera in gut microbiome datasets, four networks are constructed to investigate their edge distributions, which reflect the correlation difference. Then network motif distribution was investigated by mfinder (version 1.2) . The motif size is set to 4, the query random network size set as 100, and other parameters are set as default. Network motif distribution was calculated in all sub networks and ranked by their frequency.
Results and discussions
Network construction based on human gut datasets
We applied the Meta-Network on a dataset of health young Chinese as our representation of human gut (MGP15838 in MG-RAST database) . It consists of 314 healthy young adults, covering 20 rural and urban cohorts from 7 ethnic groups and 9 provinces throughout China. 5,102,015 high-quality sequences were generated and QIIME (version 1.91)  was applied to process this dataset. Finally, we obtained in total 24, 125 microbial OTUs. On genus level, we identified in total 2124 genera in which 102 genera process the relative abundance above 0.1%.
We compared networks constructed by all four algorithms (Pearson, Loose Definition, FS-Weight and PCA-PMI) on genus level. First, we calculated the global properties and illustrated in Fig. 3, a. Based on Pearson algorithm, 31 genera and 172 correlations were detected. Based on loose definition method, 29 genera and 287 correlations were detected. Network constructed by FS-Weight method detected 38 nodes and 252 edges. In network constructed by PCA-PMI method, 44 nodes and 235 edges were detected. Based on loose definition algorithm, the network removed 2 genera and 30 correlations, in which 81.16% correlations are among genera counted for average relative abundance less than 0.1%. On the other hand, 145 correlations are added among the genera which dominate in the microbial community (average abundance over 0.1%). This result indicates the importance roles of high-abundance species in microbial community, which is consistence with previous researches . Furthermore, comparison between network constructed by loose definition method and FS-Weight and PCA-PMI methods shows that more complex forms of correlations have been detected by Meta-Network analysis. In our analysis workflow, both FS-Weight and PCA-PMI are calculated only based on Pearson Correlation coefficient. It would be interesting to see how it performs for network construction based on other pair-wise correlation analysis method under the Meta-Network workflow. Hence we applied the same workflow to construct the networks based on CCLasso , and the analysis results were provided in Additional file 1.
Network comparison based on human gut datasets
We performed a systematic evaluation to compare the network constructed by Pearson, loose definition, FS-Weight and PCA-PMI algorithms. Global network alignment was carried to compare the network constructed by these four algorithms on OTU level (Fig. 3, a). The alignment between networks constructed by Pearson correlation (100% co-occurrence probability) and loose definition method (80% co-occurrence probability) measured the co-occurrence probability optimization. Owing to a low EC and S3 score (0.63, 0.579, respectively), the two networks had a low match. Network constructed by FS-Weight and PCA-PMI methods had the highest quality score (S3 score: 0.982), indicating that the network constructed by FS-Weight and PCA-PMI methods had the highest similarity. Network alignment results indicate that: First, both two association rule mining methods show a low match to network constructed by Pearson method. Second, two different methods show a high similarity.
Based on the similarity tree based on Jaccard index, network constructed by Pearson method had a low similarity with other three networks (Fig. 3, b). Network constructed by loose-definition and FS-Weight had the shortest distance because FS-Weight employed loose-definition abundance as input to find indirect correlations and PCA-PMI based network had a similar structural composition compared to network constructed by FS-Weight method.
Based on their motif distributions, we compared these four networks to reflect homology relationship (Fig. 3, c). Network constructed by loose definition and FS-Weight method present similar network motif distributions, and this result suggested a high homology . Motifs detected in Pearson method had a different distribution with the others (Fig. 3, d).
Correlation distribution and subnetwork comparison based on human gut datasets
To investigate the edge distribution, four algorithms were compared based on the top 100 genera (Fig. 4, a). In network constructed by Pearson and Spearman algorithms, 52 and 63 correlations were identified among low abundance genera (illustrated as small node size). However, in network constructed by FS-Weight and PCA-PMI methods, most correlations were identified among the nodes counted as high abundance (over 0.1%). Furthermore, in view of the importance of abundance and function in microbial community, we can further speculate that these genera play important role in the gut microbiome .
In the network constructed by Pearson and Spearman algorithm (Fig. 4, b), Genera Lactocuccus and Granulicatella processed low abundance (average abundance 1.25e-5, 2.58e-4, respectively) and existed in 15 and 14 samples, respectively. In addition, a strong and significant positive correlation were detected between them (correlation coefficient 0.915, 0.823, respectively, both p-value < 0.01). No convincing literature and reported function (Lactocuccus: commonly identified as produce lactic acid, Granulicatella: commonly identified as potential pathogenic bacteria) could prove their relationships and both genera [19, 20]. These two correlations might be noise correlations and do not take part in this non-digestible carbohydrates degradation pathway.
More importantly, FS-Weight and PCA-PMI methods are capable to discover correlations undetected by the Pearson and Spearman algorithms (Fig. 4, d). For example, genus Syntrophomonas is reported as a fatty-acidoxidizing genus which cooperates with methanogens such as genus Methanosphaera and Methanobrevibacter . However, Pearson method failed to detect these correlations, probably due to the large heterogeneity (different kinds of methanogens in gut samples) in human gut dataset. On the contrary, the correlations between Syntrophomonas and methanogens could be clearly identified by FS-Weight and PCA-PMI methods (FS-Weight correlation: 0.873, 0.926; PCA-PMI correlation: 0.912, 0.915). The interaction mechanism was illustrated in Fig. 4, c. First, genera such as Bacteroides and Ruminococcus degrade the non-digestible carbohydrates into the short-chain fatty acid . Second, the Syntrophomonas and methanogens cooperate to transfer the short-chain fatty acid into methane and energy. Combining the network analysis results with literature review, we speculate that Bacteroides and Ruminococcus play the role, as indirect neighbor, between the Syntrophomonas and methanogens.
In network constructed by CCLasso, the indirect correlation pattern could also be found, which could again prove the indirect pattern and advantage of FS-Weight and PCA-PMI. The result is provided in Additional file 1.
Network construction and analysis based on the Tara oceans datasets
We applied the same workflow to Tara Oceans Project (PRJEB1787, also known as Project ERP001736 on EBI Metagenomics Portal). The processed nucleotide sequences from 245 experimental runs were publicly available, of which the volume exceeded 1.3 TB. Parallel-Meta (version 3.0)  was adopted to process this dataset to calculate the taxonomical abundance distribution on genus and OTU levels. Based on the Pearson algorithm, 154 nodes and 4289 edges were detected, and 6 clusters were identified by the MCODE cluster algorithm (Fig. 5, a). The largest cluster was mainly composed with the members of phylum Proteobacteria, which was confirmed as the largest phylum in the ocean bacterial community .
Figure 5, b illustrated the network constructed by FS-Weight algorithm, in which 190 nodes and 8135 edges were detected, including 3910 indirect correlations. Ten clusters were calculated and 6 of them were composed with similar taxonomical composition compared to the network constructed by Pearson method. New cluster members were identified as common members in the ocean bacterial community . The network constructed by PCA-PMI method was composed with 237 nodes and 5861 edges, with a clustering coefficient of 0.666 (Fig. 5, c). Nitrosopumilus and Candidatus_Scalindua were detected only in network constructed by PCA-PMI method and were identified as common members in the ocean environment .
These results have again, proved that the Meta-Network workflow has the ability to discover more complex formats of correlations. Moreover, Networks constructed by two different methods (FS-Weight and PCA-PMI) shared many identical network members and clusters, indicating that both methods portrayed the complex formats of correlations in microbial communities.
Species-species co-occurrence network is becoming one of the emerging fronts for microbiome research, largely due to the important ecological patterns it could reveal. However, previous methods are limited by only detecting the direct correlations among pairwise species. This work presents the Meta-Network method that optimized the construction, analysis and interpretation of the species-species co-occurrence network, focusing on three critical processes for species-species correlation analysis: we first employ loose definition to recover correlations missed by strict co-occurrence probability before calculating correlations. Based on these candidate correlations, we also develop two association rule mining methods for the recurrence of the real bacterial community network: a graph-based method FS-Weight to detect indirect correlations, PCA-PMI method to detect indirect correlations. Therefore, we believe that Meta-Network provides a general approach towards deciphering the species-species co-occurrence networks.
This method could be improved in several ways: in line with current gene expression network inferences, we can also improve its high-level associations (i.e., associations between two clusters of species). A complete species-species network, which provides the full picture of the regulation profile in the community, should also include viruses. All of these will be considered in our future work.
Operational Taxonomic Unit
Path Consistency Algorithm
Part Mutual Information
Hurwitz BL, Westveld AH, Brum JR, Sullivan MB. Modeling ecological drivers in marine viral communities using comparative metagenomics and network analyses. Proc Natl Acad Sci U S A. 2014;111:10714–9.
Weiss S, Van Treuren W, Lozupone C, Faust K, Friedman J, Deng Y, Xia LC, Xu ZZ, Ursell L, Alm EJ, et al. Correlation detection strategies in microbial data sets vary widely in sensitivity and precision. ISME J. 2016;10:1669–81.
Vasquez-Cardenas D, van de Vossenberg J, Polerecky L, Malkin SY, Schauer R, Hidalgo-Martinez S, Confurius V, Middelburg JJ, Meysman FJ, Boschker HT. Microbial carbon metabolism associated with electrogenic Sulphur oxidation in coastal sediments. ISME J. 2015;9:1966–78.
de Winter JC, Gosling SD, Potter J. Comparing the Pearson and Spearman correlation coefficients across distributions and sample sizes: a tutorial using simulations and empirical data. Psychol Methods. 2016;21:273–90.
Fang H, Huang C, Zhao H, Deng M. CCLasso: correlation inference for compositional data through lasso. Bioinformatics. 2015;31:3172–80.
Zhao J, Zhou Y, Zhang X, Chen L. Part mutual information for quantifying direct associations in networks. Proc Natl Acad Sci U S A. 2016;113:5130–5.
Price MN, Deutschbauer AM, Skerker JM, Wetmore KM, Ruths T, Mar JS, Kuehl JV, Shao W, Arkin AP. Indirect and suboptimal control of gene expression is widespread in bacteria. Mol Syst Biol. 2013;9:660.
Lammel DR, Barth G, Ovaskainen O, Cruz LM, Zanatta JA, Ryo M, de Souza EM, Pedrosa FO. Direct and indirect effects of a pH gradient bring insights into the mechanisms driving prokaryotic community structures. Microbiome. 2018;6:106.
Chua HN, Sung WK, Wong L. Exploiting indirect neighbours and topological weight to predict protein function from protein-protein interactions. Bioinformatics. 2006;22:1623–30.
Barberan A, Bates ST, Casamayor EO, Fierer N. Using network analysis to explore co-occurrence patterns in soil microbial communities. ISME J. 2012;6:343–51.
Chaffron S, Rehrauer H, Pernthaler J, von Mering C. A global network of coexisting microbes from environmental and whole-genome sequence data. Genome Res. 2010;20:947–59.
Marino S, Baxter NT, Huffnagle GB, Petrosino JF, Schloss PD. Mathematical modeling of primary succession of murine intestinal microbiota. Proc Natl Acad Sci U S A. 2014;111:439–44.
Zhou C, Zhang SW, Liu F. An ensemble method for reconstructing gene regulatory network with jackknife resampling and arithmetic mean fusion. Int J Data Min Bioinform. 2015;12:328–42.
Zhang X, Zhao XM, He K, Lu L, Cao Y, Liu J, Hao JK, Liu ZP, Chen L. Inferring gene regulatory networks from gene expression data by path consistency algorithm based on conditional mutual information. Bioinformatics. 2012;28:98–104.
Wang J, Zhong J, Chen G, Li M, Wu FX, Pan Y. ClusterViz: a Cytoscape APP for cluster analysis of biological network. IEEE/ACM Trans Comput Biol Bioinform. 2015;12:815–22.
Vijayan V, Saraph V, Milenkovic T. MAGNA++: maximizing accuracy in global network alignment via both node and edge conservation. Bioinformatics. 2015;31:2409–11.
Kuntal BK, Dutta A, Mande SS. CompNet: a GUI based tool for comparison of multiple biological interaction networks. BMC Bioinformatics. 2016;17:185.
Milo R, Shen-Orr S, Itzkovitz S, Kashtan N, Chklovskii D, Alon U. Network motifs: simple building blocks of complex networks. Science. 2002;298:824–7.
Zhang J, Guo Z, Xue Z, Sun Z, Zhang M, Wang L, Wang G, Wang F, Xu J, Cao H, et al. A phylo-functional core of gut microbiota in healthy young Chinese cohorts across lifestyles, geography and ethnicities. ISME J. 2015;9:1979–90.
Kuczynski J, Stombaugh J, Walters WA, Gonzalez A, Caporaso JG, Knight R. Using QIIME to analyze 16S rRNA gene sequences from microbial communities. Curr Protoc Microbiol. 2012, Chapter 1:Unit;27:1E–5.
Rivett DW, Bell T. Abundance determines the functional role of bacterial phylotypes in complex communities. Nat Microbiol. 2018;3:767–72.
Alon U. Network motifs: theory and experimental approaches. Nat Rev Genet. 2007;8:450–61.
Si J, You HJ, Yu J, Sung J, Ko G. Prevotella as a hub for vaginal microbiota under the influence of host genetics and their association with obesity. Cell Host Microbe. 2017;21:97–105.
Hatamoto M, Imachi H, Fukayo S, Ohashi A, Harada H. Syntrophomonas palmitatica sp. nov., an anaerobic, syntrophic, long-chain fatty-acid-oxidizing bacterium isolated from methanogenic sludge. Int J Syst Evol Microbiol. 2007;57:2137–42.
Neyrinck AM, Delzenne NM. Potential interest of gut microbial changes induced by non-digestible carbohydrates of wheat in the management of obesity and related disorders. Curr Opin Clin Nutr Metab Care. 2010;13:722–8.
Jing G, Sun Z, Wang H, Gong Y, Huang S, Ning K, Xu J, Su X. Parallel-META 3: comprehensive taxonomical and functional analysis platform for efficient comparison of microbial communities. Sci Rep. 2017;7:40371.
Lima-Mendez G, Faust K, Henry N, Decelle J, Colin S, Carcillo F, Chaffron S, Ignacio-Espinosa JC, Roux S, Vincent F, et al. Ocean plankton. Determinants of community structure in the global plankton interactome. Science. 2015;348:1262073.
Feng XM, Mo YX, Han L, Nogi Y, Zhu YH, Lv J. Qipengyuania sediminis gen. Nov., sp. nov., a member of the family Erythrobacteraceae isolated from subterrestrial sediment. Int J Syst Evol Microbiol. 2015;65:3658–65.
Shehzad A, Liu J, Yu M, Qismat S, Liu J, Zhang XH. Diversity, community composition and abundance of Anammox Bacteria in sediments of the north marginal seas of China. Microbes Environ. 2016;31:111–20.
We thank Torsten Juelich for linguistic assistance during the preparation of this manuscript.
This work is partially supported by National Science Foundation of China grant 31871334 and 31671374, Ministry of Science and Technology’s high-tech (863) grant 2018YFC0910502, and Sino-German Research Center grant GZ878. Publication of this article was sponsored by Ministry of Science and Technology’s high-tech (863) grant 2018YFC0910502.
Availability of data and materials
The source code and documentation are accessible at http://www.microbioinformatics.org/software/Meta-Network.htm.
About this supplement
This article has been published as part of BMC Genomics Volume 20 Supplement 2, 2019: Selected articles from the 17th Asia Pacific Bioinformatics Conference (APBC 2019): genomics. The full contents of the supplement are available online at https://bmcgenomics.biomedcentral.com/articles/supplements/volume-20-supplement-2.
Ethics approval and consent to participate
Consent for publication
The authors declare no competing financial interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Figure S1. Threshold selection for FS-Weight and PCA-PMI method. (A) Global network properties for networks constructed by FS-Weight method. (DOCX 354 kb)