 Methodology article
 Open Access
 Published:
A highly adaptive microbiomebased association test for survival traits
BMC Genomics volume 19, Article number: 210 (2018)
Abstract
Background
There has been increasing interest in discovering microbial taxa that are associated with human health or disease, gathering momentum through the advances in nextgeneration sequencing technologies. Investigators have also increasingly employed prospective study designs to survey survival (i.e., timetoevent) outcomes, but current itembyitem statistical methods have limitations due to the unknown true association pattern. Here, we propose a new adaptive microbiomebased association test for survival outcomes, namely, optimal microbiomebased survival analysis (OMiSA). OMiSA approximates to the most powerful association test in two domains: 1) microbiomebased survival analysis using linear and nonlinear bases of OTUs (MiSALN) which weighs rare, midabundant, and abundant OTUs, respectively, and 2) microbiome regressionbased kernel association test for survival traits (MiRKATS) which incorporates different distance metrics (e.g., unique fraction (UniFrac) distance and BrayCurtis dissimilarity), respectively.
Results
We illustrate that OMiSA powerfully discovers microbial taxa whether their underlying associated lineages are rare or abundant and phylogenetically related or not. OMiSA is a semiparametric method based on a variancecomponent score test and a resampling method; hence, it is free from any distributional assumption on the effect of microbial composition and advantageous to robustly control type I error rates. Our extensive simulations demonstrate the highly robust performance of OMiSA. We also present the use of OMiSA with real data applications.
Conclusions
OMiSA is attractive in practice as the true association pattern is unpredictable in advance and, for survival outcomes, no adaptive microbiomebased association test is currently available.
Background
The human microbiota is the totality of all microorganisms living in and on the human body [1] and its role in human health and disease has been increasingly studied [2,3,4,5]. Human microbiota studies have been accelerated by the advent of nextgeneration sequencing technologies which enabled an unbiased characterization of all microorganisms, often by targeting the bacterial 16S ribosomal RNA (rRNA) gene [6, 7]. Diverse microorganisms can be identified based on sequence similarity to known 16S rRNA genes and classified into operational taxonomic units (OTUs) [8]. The OTUs are characterized by their estimated abundance (e.g., read count or relative abundance) and phylogenetic tree structure (i.e., taxonomical and evolutionary relationships). Accordingly, various microbial diversity metrics on the basis of microbial abundance and phylogenetic tree information have been surveyed in microbiomebased association studies [9]. The data are also largescale including numerous OTUs (e.g., hundreds to thousands) with the presence of a long tail of rare microorganisms. We now pursue further discovery of associated microbial taxa for more integrative assessments about the root causes of maladies.
Recently, a number of microbiomebased association tests have been introduced to survey the entire microbial community (e.g., bacterial kingdom) and microbial taxa (e.g., phyla, classes, orders, families, genera, and species). As statistical power from massive univariate analyses for individual OTUs is considerably low due to the requisite multiple testing correction, here, our focus lies on association tests for microbial groups of multiple OTUs (i.e., the entire community (e.g., bacterial kingdom) and higherlevel taxa (e.g., phyla, classes, orders, families, and genera)). A majority of existing methods (e.g., LEfSe [10], STAMP [11], DESeq2 [12], and metagenomeSeqfit Zig [13]) relate aggregated microbial abundance for each taxon with health or disease outcome [14]. However, these methods are subject to a substantial loss of power as its underlying assumption  the same effect direction for all associated OTUs  is violated (e.g., within a taxon of interest, some OTUs are symbiotic, while others are pathogenic) [14]. To explain, when OTUs in a taxon of interest are all positively (or all negatively) associated with an outcome of interest (i.e., in the case of the same effect direction), their positive (or negative) association signals are amplified in their aggregated abundance, so that we can powerfully discover the association between the aggregated abundance and the outcome of interest. However, when OTUs in a taxon of interest are in mixed effect directions (i.e., some are positively, while others are negatively associated with an outcome of interest), their positive and negative association signals are canceled out in their aggregated abundance, so that we cannot discover any (positive or negative) association between the aggregated abundance and the outcome of interest. Detailed description and simulation studies have been addressed in [14]. Moreover, those aggregatebased methods do not utilize phylogenetic tree structure which considers taxonomical and evolutionary relationships among diverse microorganisms. As an alternative, distancebased analysis is popular and, for example, microbiome regressionbased kernel association test (MiRKAT) [15] is spotlighted in this context. MiRKAT incorporates diverse ecologically informative distance metrics (e.g., unique fraction (UniFrac) distance [16,17,18] and BrayCurtis dissimilarity [19]) into its kernel machine regression framework. As different distance metrics vary in the extent of the relative contributions from microbial abundance and phylogenetic tree information, they can be most accurate in different true underlying association patterns, respectively. However, prior knowledge about the true association pattern is limited and it is thus difficult to predict which distance metric is optimal in practice. The adaptive test of MiRKAT, called Optimal MiRKAT, approximates to an optimal test adaptively among multiple MiRKAT tests using different distance metric specifications; hence, in practice, Optimal MiRKAT is attractive. OMiAT [14] is a further adaptive test which approximates to an optimal test adaptively throughout the sum of powered score tests (SPU) [20] and MiRKAT tests. By including SPU tests in the search space, OMiAT robustly discovers rare, midabundant, and abundant associated lineages along with the functionality of Optimal MiRKAT.
There has also been increasing interest in discovering microbial taxa that are associated with survival (i.e., timetoevent) outcomes on the basis of prospective study designs (e.g., randomized clinical trials and prospective cohort studies) [21,22,23]. Survival outcomes are better informed by examining health or disease progression at multiple times over a lengthy period of followup. However, all of the above methods can handle only binary or continuous outcomes at a single time point. Currently, a remarkable association testing method in microbiomebased survival analysis is microbiome regressionbased kernel association test for survival traits (MiRKATS) [24]. As with MiRKAT, MiRKATS incorporates distance metrics into its kernel machine regression framework, but is designed to handle survival outcomes. Plantinga et al. [24] also demonstrated that MiRKATS has higher power than other distancebased approaches used in prior studies, such as Cox proportional hazards regression followed by principal coordinates analysis [21] or Ward’s agglomerative hierarchical clustering method [25].
However, MiRKATS has three critical issues. First, Plantinga et al. [24] reports that MiRKATS performs poorly when associated OTUs are rare in abundance. Microbiome data usually contain mostly rare OTUs and only few OTUs representing most of the abundance, especially for gut or oral microbiota which has greater microbial diversity. This indicates that if the test works only for few dominant associated OTUs, numerous rare or midabundant taxa are simply ignored. As a remedy, we introduce a new set of association tests, namely, microbiomebased survival analysis using linear and nonlinear bases of OTUs (MiSALN), which weigh rare, midabundant, and abundant OTUs, respectively, and its adaptive testing method, Optimal MiSALN (OMiSALN), to ensure a robust performance for OTUs with low or high abundance. Second, MiRKATS handles distance metrics onebyone and no adaptive testing method is available. The cherrypicking approach from multiple itembyitem MiRKATS tests cannot correctly control type I error rate or the requisite multiple testing correction can lead to a substantial loss of power. Therefore, we also introduce an adaptive testing method for MiRKATS, namely, Optimal MiRKATS (OMiRKATS). Third, as with MiRKAT, MiRKATS can assess only the entire community and is not currently applicable to higherlevel taxa. Thus, we extend its usability as a general microbial group analytic method.
Our major proposed method is a highly adaptive test, namely, optimal microbiomebased survival analysis (OMiSA), which creates an optimal test throughout multiple MiSALN and MiRKATS tests. As a result, OMiSA performs well whether the underlying associated lineages are rare or abundant by MiSALN and phylogenetically related or not by MiRKATS.
We now present the methodological details of the approach, and then provide extensive simulations and real data applications, before discussing limitations and other feasibilities.
Methods
This section is devoted to describe the methodological details of our proposed methods. Here, we first organize our new methods separated from existing methods in Fig. 1. That is, MiRKATS [24] for the individual use of different distance metrics is an existing method (blue letters, Fig. 1), and our methods (red letters, Fig. 1) are its adaptive test, OMiRKATS, MiSALN and its adaptive test, OMiSALN, and OMiSA. Again, OMiSA is our major proposed method, and the other individual and subadaptive tests are necessary to reach our final destination, OMiSA.
Models and notations
Suppose that there are n subjects, p OTUs, and q covariates (e.g., age and sex) and the subscripts, i, j, and k, indicate a subject, an OTU, and a covariate, respectively. For each subject i, let T_{i} be a survival time, C_{i} be a censoring time, and Y_{i} be an observed time. Then, Y_{i} is defined as Y_{i} = min(T_{i}, C_{i}) and an event indicator, δ_{i}, is defined as δ_{i} = I(T_{i} ≤ C_{i}). The ordered observed event times are denoted by τ_{1}, ..., τ_{m}, where m is the number of total events (m ≤ n), and the risk set at time τ_{g}, is denoted by R_{g}, for g = 1, ..., m. In addition, for each subject i, denote a p × 1 vector, Z_{i}, for the microbial composition of the entire community or a higherlevel taxon, marked for its elements in OTUlevel relative abundance as Z_{ij} for j = 1, ..., p, and denote a k × 1 vector, X_{i}, for the covariates, marked for its elements as X_{ik} for k = 1, ..., q. Here, we assume that n subjects are identically and independently distributed (e.g., random subjects) and C_{i} is independent of T_{i} conditional on Z_{i} and X_{i}.
To relate microbial composition with survival outcomes adjusting for covariates, we consider a Cox proportional hazard model (Eq. 1) [26].
where λ_{i}(t) is the conditional hazard function given Z_{i} and X_{i}, λ_{0}(t) is the baseline hazard function, α’s are coefficients for the effect of covariates (e.g., age and sex), and h(Z_{i}) is a function which characterizes the relationship between microbial composition and survival outcomes. For example, if we specify h(Z_{i}) = \( \sum \limits_{\mathrm{j}=1}^{\mathrm{p}}{\upbeta}_{\mathrm{j}}{\mathrm{Z}}_{\mathrm{ij}} \), we can relate the linear effects of OTUs in relative abundance to the log hazard rate. The nonlinear effects of OTUs in relative abundance can also be surveyed by the use of nonlinear bases of OTUs (e.g., polynomials/splines) [27]. Moreover, we can specify h(Z_{i}) more flexibly by the use of different positive semidefinite kernel functions modeled for different distance/similarity metrics among subjects [27, 28]. In the following sections, we introduce two different machineries for the specification of h(Z_{i})  one for the use of different linear/nonlinear bases of OTUs and the other for the use of different kernel functions  and illustrate how their performance varies by different true underlying association patterns. The former is a newly introduced method, MiSALN. The latter is an existing method, MiRKATS [24]; hence, we describe its main ideas and formula and refer to its original paper for more details. Of most importance is that we introduce new adaptive testing methods which approximate to an optimal test for each of the two machineries (namely, OMiSALN and OMiRKATS, respectively) and throughout the two different machineries (namely, OMiSA).
MiSALN
Suppose that β = (β_{1}, …, β_{p})^{T} is the vector of regression coefficients for p OTUs. We are particularly interested in testing the null hypothesis of no association between the microbial composition consisting of those p OTUs and survival outcomes, H_{0}: β = (β_{1}, …, β_{p})^{T} = 0. Assuming the coefficients, β_{1}, ..., β_{p}, are random and independent with mean zero, a common variance, σ^{2}, and a pairwise correlation matrix, Ρ, (i.e., E(β) = 0 and Cov(β) = σ^{2}Ρ), the same null hypothesis and its corresponding alternative can be formulated as a variancecomponent score test with Eq. 2 [28,29,30,31,32].
Note here that any full distributional form for the coefficients, β_{1}, …, β_{p}, was not required, but we need to specify the correlation matrix, Ρ [29]. For the choice of Ρ, we can consider different choices which have greater deviations from H_{0} in directions corresponding to the larger eigenvalues of P [29]. However, we consider the p × p identity matrix, I_{p}, for no correlation over β_{j}’s for MiSALN [30].
The Cox proportional hazard model for the null hypothesis can be formulated with Eq. 3.
Based on the maximum likelihood estimates, \( \widehat{\upalpha} \)’s, the estimated cumulative hazard rate for subject i at its observed time, \( {\widehat{\Lambda}}_{\mathrm{i}} \), can be derived as in Eq. 4 [29]. We here used Efron’s approximation [33] to handle observations which have tied survival times.
Based on \( \widehat{\upalpha} \)’s and the resulting estimates, \( {\widehat{\Lambda}}_{\mathrm{i}} \)’s, Verweij et al. [29] derives a variancecomponent score test statistic to test H_{0}: σ^{2} = 0 against H_{1}: σ^{2} > 0 as in Eq. 5.
where d = (δ_{1}, …, δ_{n})^{T}, \( \widehat{\mathrm{e}} \) = \( {\left({\widehat{\Lambda}}_1,\dots, {\widehat{\Lambda}}_{\mathrm{n}}\right)}^{\mathrm{T}} \), and R is the n × n correlation matrix for n subjects that we need to specify. Here, \( \mathrm{d}\widehat{\Lambda} \) is the vector of the estimated martingale residuals under the null model (Eq. 3). Verweij et al. also derives the mean and variance of U and demonstrates that the null distribution of the standardized score test closely approximates to standard normal distribution [29]. However, since our proposed tests are based on a residual permutationbased scheme for pvalue calculation and the mean and variance of U are evaluated under the null, the unstandardized score test, U, is sufficient in our study.
Of importance is that with different specifications for R, we can survey different correlation structures among subjects. For MiSALN, our choices for R are formulated with Eq. 6.
where Z is the n × p matrix for OTU relative abundances (i.e., compositions), Z = (Z_{1}, …, Z_{n})^{T}, and γ (ϵ ℝ^{+}) powers Z and needs to be prespecified. The variancecomponent score test with this correlation matrix, R_{MiSALN(γ)}, can be simply derived as in Eq. 7.
The correlation structure, R_{MiSALN(γ)}, describes pairwise similarities in microbial abundance among subjects and the variancecomponent score test, U_{MiSALN(γ)}, represents the degree of overall association between R_{MiSALN(γ)} and the estimated martingale residuals [30, 31]. Here, only the microbial abundance information is contributed and no phylogenetic tree information is incorporated.
Note that, γ transforms OTUs to the γ’s power of the original relative abundances, so that different bases of OTUs can be surveyed. When γ = 1, the original scale of OTUs is used for testing the linear effect of OTUs. The resulting correlation matrix, ZZ^{T}, is equivalent to the linear kernel in kernel machine regression models [28] and has also been used for a geneset association testing method, namely, Global Test [30]. When γ ≠ 1, the nonlinear bases of OTUs can be surveyed. Here, we demonstrate different γ value specifications as different weighting schemes for OTUs in relative abundance as follows. As γ increases, abundant OTUs will be relatively weighted, while rare OTUs will gradually be lost, but vice versa as γ decreases. Therefore, we can expect that when abundant OTUs are associated with survival outcomes, a large value of γ can be more suitable by weighting them more, but vice versa when rare OTUs are associated. However, in practice, the true underlying association pattern is mostly unknown and we cannot presume whether rare, midabundant, or abundant OTUs are associated with survival outcomes. Therefore, we propose a datadriven approach, Optimal MiSALN (OMiSALN), which approximates to an optimal test adaptively through different γ value specifications and its test statistic is formulated with Eq. 8.
where Г is a set of candidate γ values and P_{MiSALN(γ)} is the estimated pvalue for MiSALN(γ), where γϵГ. We can observe that Q_{OMiSALN} is the minimum pvalue among different MiSALN(γ) tests, where γϵГ. Again, Q_{OMiSALN} itself is a test statistic which requires its own pvalue estimation. We use a residualbased permutation method to estimate pvalues for individual MiSALN(γ) tests, where γϵГ, and OMiSALN (see Additional file 1).
For a set of candidate γ values, we used Г = {1/4, 1/3, 1/2, 1} and it was sufficient in our simulations and real data analysis.
MiRKATS
The key idea behind MiRKATS [24] is that diverse distance metrics (e.g., UniFrac distance [16,17,18] and BrayCurtis dissimilarity [19]) can be incorporated into the kernel machine Cox proportional hazards model. Hence, we can survey the relationship between ecologically related metrics and survival outcomes on health or disease with covariate adjustments (e.g., age and sex) [24]. First, we need to specify a samplebysample pairwise distance matrix based on a chosen distance metric and transform it into a kernel (similarity) matrix using the kernel formula, Eq. 9.
where D is the n × n pairwise distance matrix and D^{2} is its elementwise square, I is the n × n identity matrix, and 1 in 11′ is the vector of n ones. To ensure the kernel matrix, K, to be positive semidefinite, negative eigenvalues are replaced with zero [24]. Then, using the resulting kernel matrix, the variancecomponent score statistic can be formulated with Eq. 10 [24, 27].
where k is an index for a particular kernel matrix based on a chosen distance metric. Plantinga et al. [24] has also proposed a modified score statistic which accounts for overdispersion, but since we calculate pvalues based on a residual permutationbased method and the dispersion parameter, \( \frac{1}{{\left(\mathrm{d}\widehat{\Lambda}\right)}^{\mathrm{T}}\left(\mathrm{d}\widehat{\Lambda}\right)} \), is evaluated under the null, the variancecomponent score test statistic of Eq. 10 is sufficient in our study.
Importantly, different distance metrics reflect different relative contributions from microbial abundance and phylogenetic tree information; as such, the performance of MiRKATS differs according to the choice of distance metric and the true underlying association pattern [16,17,18, 24]. The UniFrac distances are constructed based on phylogenetic tree information and the contribution of microbial abundance is modulated by different weighting schemes. The unweighted UniFrac distance incorporates only microbial presence/absence information so that it is most inclined to phylogenetic tree information [16], whereas the weighted UniFrac distance further incorporates microbial abundances [17]. In this context, the generalized UniFrac distance is regarded as a compromised version between the unweighted and weighted UniFrac distances [18]. In contrast, the BrayCurtis dissimilarity [19] does not incorporate any phylogenetic tree information so that it is most inclined to microbial abundance information. Accordingly, when associated OTUs are phylogenetically related, the UniFrac distances can be better choices than BrayCurtis dissimilarity, but vice versa when they are not phylogenetically related. However, we cannot predict which distance metric is optimal to our study. Therefore, here, we proposed a datadriven approach, namely, Optimal MiRKATS (OMiRKATS), which is taken adaptively through multiple distance metric specifications and its test statistic is formulated with Eq. 11.
where Ψ is a set of candidate distance metrics and P_{MiRKAT − S(k)} is the estimated pvalue for U_{MiRKAT − S(k)}, where kϵΨ. Note that, OMiRKATS is similar to Optimal MiRKAT [15], but the difference is that OMiRKATS handles survival outcomes, while Optimal MiRKAT handles binary or continuous outcomes at a time point. Here again, Q_{OMiRKAT − S} is the minimum pvalue among different MiRKATS(k) tests, where kϵΨ, and it is a test statistic that requires its own pvalue estimation. Similar to MiSALN(γ)/OMiSALN, a residualbased permutation method was used to estimate pvalues for individual MiRKATS(k) tests, where kϵΨ, and OMiRKATS (see Additional file 1).
For a set of candidate distance metrics, Ψ, we used Ψ = {unweighted UniFrac (K_{U}), generalized UniFrac(0.5) (K_{0.5}), weighted UniFrac (K_{W}), BrayCurtis (K_{BC})}, where K_{0.5} is the generalized UniFrac distance with the parameter, ϴ = 0.5, as suggested [9].
OMiSA
Our major proposed method, OMiSA, approximates to an optimal test adaptively throughout all the different variancecomponent score tests of MiSALN(γ), where γϵГ, and MiRKATS(k), where kϵΨ, and its test statistic can be simply formulated with Eq. 12.
Q_{OMiSA} is the minimum pvalue among different MiSALN(γ) tests, where γϵГ, and MiRKATS(k) tests, where kϵΨ. It is a test statistic, like Q_{OMiSALN} (Eq. 8) and Q_{OMiRKAT − S} (Eq. 11). We do not report this genuine minimum pvalue as the final pvalue to be reported for OMiSA, but estimate its pvalue using a residual permutationbased method (see Additional file 1). We emphasize that throughout the machineries of MiSALN and MiRKATS, OMiSA powerfully discovers microbial taxa whenever their underlying associated OTUs are rare or abundant by MiSALN and phylogenetically related or not by MiRKATS. Our extensive simulations in later sections also demonstrate the robust performance of OMiSA.
Assessment of higherlevel taxa
We extend all the individual and adaptive tests as general group analytic methods which can assess any higherlevel taxon (e.g., phyla, classes, orders, families, and genera), not only the entire community (e.g., bacterial kingdom), as long as they include multiple OTUs with phylogenetic tree structure. The only matter that requires attention is that when we assess higherlevel taxa, their OTU relative abundances (i.e., compositions) are computed based on total reads in the entire community (i.e., OTU relative abundances are not subcompositions which have unit sum to each higherlevel taxon). Specifically, such normalization needs to be applied to the OTU relative abundances for MiSALN and to the UniFrac distances for MiRKATS.
A graphical representation
As for visual representations of discoveries, we used an existing software tool, GraPhlAn [34], which is addressed later in our real data analysis. As GraPhlAn is flexibly customizable with beautiful circular representations of hierarchical taxonomic tree [34], here, we do not introduce any new graphical representation and suggest to use GraPhlAn after obtaining outcomes from OMiSA.
Results
Simulations
This section is devoted to simulations which evaluate individual MiSALN and MiRKATS tests and their adaptive tests, OMiSALN, OMiRKATS, and OMiSA in terms of type I error and statistical power. While the association testing methods can be applied to higherlevel taxa, as a demonstration, here, we survey the entire community.
Simulation design
We simulated microbiome data according to prior approaches [35] which are based on a Dirichletmultinomial distribution reflecting real microbial composition. We first estimated proportion means and a dispersion parameter to be inserted into the Dirichletmultinomial distribution using actual intestinal microbiome data of nonobese diabetic (NOD) mice in [23]. The complete microbiome data include NOD mice in different treatment groups and sequencing time points; however, as a demonstration, we selected 35 fecal samples from NOD mice at 6 weeks of age in the control group which had not been disturbed by antibiotic exposure. Then, 353 OTUs which have proportional mean abundance > 10^{−4} were included in the analysis. The total reads per sample was set to be 1000 [15, 24]. Based on these specifications, we simulated OTU counts for small (n = 50) and large (n = 100) samples, respectively, from the Dirichletmultinomial distribution.
Two covariates for age and sex were simulated from a normal distribution, N(50, 5^{2}) and a Bernoulli distribution, Bern(0.5), respectively. The survival time, T_{i}, was simulated through Eq. 13, assuming proportional hazards and a Weibull distribution, Weibull(2,2), for the baseline at age = 50 and sex = 0 [28, 36].
where U_{i} was randomly sampled from a uniform distribution, Unif(0,1), β_{j} is a coefficient for each OTU j = 1,…,p, and scale(Z_{ij}) = \( \frac{{\mathrm{Z}}_{\mathrm{ij}}\mathrm{mean}\left({\mathrm{Z}}_{1\mathrm{j}},{\mathrm{Z}}_{2\mathrm{j}},\dots, {\mathrm{Z}}_{\mathrm{nj}}\right)}{\mathrm{SD}\left({\mathrm{Z}}_{1\mathrm{j}},{\mathrm{Z}}_{2\mathrm{j}},\dots, {\mathrm{Z}}_{\mathrm{nj}}\right)} \), for subjects i = 1, …, n and OTUs j = 1, …, p. The censoring time, C_{i}, was simulated based on uniform distribution with two different schemes to survey different extent of censoring: 1) C_{i} ~ Unif(0,10) which is of the estimated proportions of censored outcomes, 25.78%, and 25.88%, for small (n = 50) and large samples (n = 100), respectively, and 2) C_{i} ~ Unif(0,5) which is of the estimated proportions of censored outcomes, 40.42% and 40.48%, for small (n = 50) and large samples (n = 100), respectively (Table 1). Then, the observed time, Y_{i}, and the event indicator, δ_{i}, were calculated by the formula, Y_{i} = min(T_{i}, C_{i}) and δ_{i} = I(T_{i} ≤ C_{i}), respectively.
Empirical type I error rates with the proportions of censored outcomes were estimated by setting β = (β_{1}, …, β_{p})^{′} = 0. Statistical power was estimated with four different scenarios: (i), 10 most abundant OTUs; (ii), 10 random OTUs; (iii), 10 least abundant OTUs; and (iv), OTUs in a selected cluster performing the partitioningaroundmedoids (PAM) algorithm [37], which are associated with survival outcomes, respectively. The first three scenarios evaluate discovery ability when abundant, random/midabundant, or rare OTUs are associated. For the fourth scenario, we distributed all OTUs into 10 clusters using the PAM algorithm on the basis of their cophenetic distances in the real phylogenetic tree. The 10 clusters have contained 7.8%, 8.2%, 13.9%, 6.6%, 33.3%, 8.3%, 1.1%, 1.8%, 17.2%, and 1.8% of total reads, respectively, and in each simulation iteration, the selection of one associated cluster was randomized to overcome arbitrary selection. The fourth scenario additionally reflects phylogeny, which may provide a more realistic evaluation.
We also surveyed different effect sizes and directions for the associated OTUs. Λ is denoted as a set of indices for the associated OTUs. For the same effect direction, we set β_{j ∈ Λ}as a vector of the elements randomly sampled from Unif(0,1), Unif(0,2), or Unif(0,3) and for the mixed effect direction, we set β_{j ∈ Λ}as a vector of the elements randomly sampled from Unif(− 1,1), Unif(− 2,2), or Unif(− 3,3).
Simulation results
Type I error
We can observe that empirical type I error rates are wellcontrolled at the significance level of 5% across all the individual and adaptive tests for different censoring schemes and for both small samples (n = 50) and large samples (n = 100) (Table 1).
Power
As we observed similar comparative performances of individual and adaptive tests for different sample sizes and censoring schemes, we include here only the outcomes for large samples (n = 100) and the censoring scheme, Unif(0,10), and moved all of the other outcomes to Additional material (see Additional file 2: Figure S1, Additional file 3: Figure S2, Additional file 4: Figure S3, Additional file 5: Figure S4, Additional file 6: Figure S5, Additional file 7: Figure S6). Figs. 2 and 3 report estimated powers for the same effect direction and mixed effect directions, respectively. We observe that with the increase of effect size, power increases for all the methods under any simulation setting (Figs. 2 and 3), as expected. We also observe that MiRKATS/OMiRKATS gains slightly more power than MiSALN/OMiSALN for the same effect direction, but it is vice versa for mixed effect directions (Figs. 2 and 3).
MiSALN is powerful using either a large γ value when abundant OTUs are associated with survival outcomes (Figs. 2a and 3a), or using a small γ value when rare OTUs are associated (Figs. 2c and 3c), as expected. For MiRKATS, the BrayCurtis dissimilarity gain power for the first two scenarios in which only the microbial abundances for abundant and random OTUs influence the association (Figs. 2a, b and 3a, b), while the UniFrac distances gain power for the fourth scenario in which phylogenetic tree information is reflected (Figs. 2d and 3d), as expected.
It is notable that the BrayCurtis dissimilarity is most powerful across all the tests when abundant OTUs are associated with survival outcomes, resulting in high power for OMiRKATS (Figs. 2a and 3a). However, as reported in [24], the major problem of MiRKATS is that it is underpowered using any distance metric when rare OTUs are associated with survival outcomes; as such, OMiRKATS is also underpowered (Figs. 2c and 3c). In contrast, we observed that MiSALN using a small γ value gains power when rare OTUs are associated with survival outcomes, resulting in power for OMiSALN (Figs. 2c and 3c). Overall, we observe that individual itembyitem tests fit specific scenarios, respectively, and there is no single test which fits every scenario. Remarkably, OMiSA is highly robust, approaching the most powerful performance, throughout the four scenarios in which abundant, random, and rare OTUs are associated with survival outcomes and phylogenetic tree information is present (Figs. 2ad and 3ad).
Software comparison
For individual MiRKATS tests, we also tried the software package, MiRKATS [24], to determine whether there is any disparity between software facilities. For the use of MiRKATS, we applied the permutation method for small samples (n = 50) and the analytic pvalue calculation for large samples (n = 100), as suggested [24]. We show that two software packages, OMiSA (our software) and MiRKATS [24], produce nearly identical power estimates for individual MiRKATS tests (see Additional file 8: Figure S7, Additional file 9: Figure S8, Additional file 10: Figure S9, Additional file 11: Figure S10, Additional file 12: Figure S11, Additional file 13: Figure S12, Additional file 14: Figure S13, Additional file 15: Figure S14).
Real data analysis
Earlylife interactions between microbiota and their hosts have been considered as potential key factors in immunological and metabolic development [38]. Type 1 diabetes (T1D) is an autoimmune disease which is growing in incidence with decreasing age of onset [39]. Livanos et al. performed a microbiome profiling study to survey whether antibioticmediated gut microbiome perturbation accelerates onset of T1D in mice [23]. For the study, NOD mice were exposed to different antibiotic treatments or not, and their intestinal microbial populations were sequenced over time, as described in detail [23]. In brief, fecal, cecal, or ileal specimens from NOD mice were collected and the V4 region of bacterial 16S rRNA gene was amplified by triplicate PCR (F515/R806) using barcoded fusion primers. OTUs, as well as their phylogenetic relationships, were examined using the QIIME pipeline [8].
The data are extensive and motivate diverse study orientations, but, here, we analyze whether perturbed microbial composition by antibiotic use is associated with T1D survival. This analysis is necessary as a part of mediation analysis to understand the process by which the antibiotic use causally affects T1D development [40]. As a demonstration, we used 19 fecal samples from male NOD mice at 6 weeks of age in therapeuticdose pulsed antibiotic (PAT) treatment. The mice were followed for 30 weeks, during this time, 10 mice developed T1D, while 9 mice did not. Livanos et al. [23] also performed analyses to determine differences in relative abundance of genera between the T1Dfree and T1Ddevelopment groups at 30 weeks of followup. However, those analyses were limited to surveying the binary T1D status (T1Dfree vs. T1Ddevelopment) at an arbitrary single time point. Thus, we reanalyzed the data by taking the entire survival process into account using our proposed methods. We applied a filtering rule, a proportion mean > 10^{−4}, identifying 120 OTUs in the entire community. We first conducted association testing for the entire community (i.e., bacterial kingdom) using the individual and adaptive tests. Then, we tested microbial taxa at five different taxonomic levels, phylum, class, order, family, and genus, respectively. We tested microbial taxa that have ≥2 OTUs in the data using OMiSA and microbial taxa that have only one OTU in the data using univariate Cox proportional hazards models. At each taxonomic level, we omitted OTUs that do not have any taxonomy assignment. For multiple testing correction, we applied the BenjaminiHochberg (BH) procedure [41] per taxonomic level. However, we are not restricted to the use of BH procedure for multiple testing correction and other less conservative procedures might be considered [42]. No covariate adjustments were included as with [23].
Table 2 reports adjusted pvalues for the entire community using the individual and adaptive tests. Based on either of the adaptive tests, OMiSALN, OMiRKATS, and OMiSA, we observe that the microbial composition of the entire community is significantly associated with T1D survival (Table 2). Here, we further observe that MiRKATS using the BrayCurtis dissimilarity has the smallest pvalue among all the individual tests and the pvalue for MiSALN is decreasing as the specified γ value is increasing (Table 2). This may indicate that relatively abundant OTUs in the entire community are associated with T1D survival in microbial abundance.
We created Fig. 4 using a software tool, GraPhlAn, for circular representations of hierarchical taxonomic tree [34] and Fig. 4 summarizes discovered taxa (red circles). We observe that the microbial composition of a phylum, Verrucomicrobia (adj. p: 0.005), a class, Verrucomicrobiae (adj. p: <.001), an order, Verrucomicrobiales (adj. p: <.001), two families, Verrucomicrobiaceae (adj. p: <.001) and Streptococcaceae (adj. p: 0.040), and a genus, Akkermansia (adj. p: 0.012), are significantly associated with T1D survival (Fig. 4).
Discussion
We described that our computational procedures are efficient as they are based on the scorebased tests without double permutations (see Additional file 1), but Plantinga et al.’s analytical pvalue calculation  based on a closed form asymptotic null distribution suggested for large samples (n > 100)  should be more efficient [24]. Testing all higherlevel taxa throughout all different taxonomic ranks may impose a greater computational burden on OMiSA. Thus, for such complete association mapping, we suggest to use multicore computers to simultaneously implement multiple OMiSA tests. Our software package is currently written in R to exploit existing R functions, but the use of lowerlevel languages (e.g., C or Fortran) may further enhance its computational performance. Although OMiSA approaches the smallest pvalue and thus the highest power adaptively, varying pvalues from individual candidate tests may provide some biological insights on the basis of their methodological aspects and simulation outcomes (Figs. 2 and 3). For MiSALN, we may, in reverse, infer that when the use of a small γ value gains a relatively smaller pvalue, rare OTUs might be associated, but it is vice versa when the use of a large γ value gains a relatively smaller pvalue. Similarly, for MiRKATS, when the use of UniFrac distances gains a relatively smaller pvalue than the use of BrayCurtis dissimilarity, associated OTUs might be phylogenetically related, while when the use of BrayCurtis dissimilarity gains a relatively smaller pvalue than the use of UniFrac distances, associated OTUs might not be phylogenetically related. However, “rare or abundant” and “phylogenetically related or not” are conceptual terms with no firm definition, so such interpretations might be challenging.
It is a wellrecognized issue in the microbiome research community that the compositional nature of data renders spurious correlation among microbial markers by the unit sum constraint per sample. As a remedy, diverse data transformation procedures (e.g., logratio transformation [43]) have been studied, but it is still highly debatable on which procedure is the most robust one [13, 44]. Thus, here, we did not employ any specific data transformation procedure, but would let users decide. For example, centered logratio transformation [43] can be considered for OMiSALN as it is usable for either data format in count or composition.
While we illustrated our proposed methods to be used for microbiota profiling studies via 16S rRNA gene target sequencing, they are transferrable to metagenomic studies via whole microbial genome sequencing [45]. As long as the data are organized for a group of multiple count markers with their phylogenetic tree structure, any of our proposed methods can be readily used.
Conclusions
As current itembyitem approaches have limitations due to the unknown true association pattern, we introduced three new adaptive tests, OMiSALN, OMiRKATS, and OMiSA, for microbiomebased association studies with survival outcomes. We ascertained that they are all statistically valid with wellcontrolled type I error rates for different sample sizes and censoring proportions. Among those, our major proposed method, OMiSA, is highly attractive, as it is robustly powerful through a breadth of association patterns. We also presented that it is not restricted to test the entire community, but also applicable to any taxonomic level above species, and our residualbased permutation method always produces a closed form pvalue (see Additional file 1 [14, 15, 20, 46]). Consequently, we recommend that OMiSA can extensively be used for microbiomebased survival analysis as a robust group analytic method.
Abbreviations
 BH:

BenjaminiHochberg
 MiRKAT:

Microbiome regressionbased kernel association test
 MiRKATS:

Microbiome regressionbased kernel association test for survival traits
 MiSALN:

Microbiomebased survival analysis using linear and nonlinear bases of OTUs
 NOD:

Nonobese diabetic
 OMiAT:

Optimal microbiomebased association test
 OMiRKATS:

Optimal microbiome regressionbased kernel association test for survival traits
 OMiSA:

Optimal microbiomebased survival analysis
 OMiSALN:

Optimal microbiomebased survival analysis using linear and nonlinear bases of OTUs
 OTU:

Operational taxonomic unit
 PAM:

Partitioningaroundmedoids
 PAT:

Pulsed antibiotic
 SPU:

The sum of powered score tests
 T1D:

Type 1 diabetes
 UniFrac:

Unique fraction
References
Human Microbiome Project Consortium. A framework for human microbiome research. Nature. 2012;486:215–21.
Cho I, Yamanishi S, Cox L, Methé BA, Zavadil J, Li K, et al. Antibiotics in early life alter the murine colonic microbiome and adiposity. Nature. 2012;488:621–6.
Cox LM, Yamanish S, Sohn J, Alekseyenko AV, Leung JM, Cho I, et al. Altering the intestinal microbiota during a critical developmental window has lasting metabolic consequences. Cell. 2013;158:705–21.
Bokulich NA, Chung J, Battaglia T, Henderson N, Jay M, Li H, et al. Antibiotics, birth mode, and diet shape microbiome maturation during early life. Sci Transl Med. 2016;8:343–82.
Mahana D, Trent CM, Kurtz ZD, Bokulich NA, Battaglia T, Chung J, et al. Antibiotic perturbation of the murine gut microbiome enhances the adiposity, insulin resistance, and liver disease associated with highfat diet. Genome Med. 2016;8:48.
Woese CR, Fox GE, Zablen L, Uchida T, Bonen L, Pechman K, et al. Conservation of primary structure in 16S ribosomal RNA. Nature. 1975;254:83–5.
Hamady M, Knight R. Microbial community profiling for human microbiome projects: tools, techniques. Genome Res. 2009;11:1998–52.
Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, Bushman FD, Costello EK, et al. QIIME allows analysis of highthroughput community sequencing data. Nat Methods. 2010;7:335–6.
Li H. Microbiome, metagenomics, and highdimensional compositional data analysis. Annu Rev Stat Appl. 2015;2:73–94.
Segata N, Izard J, Waldron L, Gevers D, Miropolsky L, Garrett WS, et al. Metagenomic biomarker discovery and explanation. Genome Biol. 2011;12:R60.
Parks DH, Tyson GW, Hugenholtz P, Beiko RG. STAMP: statistical analysis of taxonomic and functional profiles. Bioinformatics. 2014;30:3123–4.
Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNAseq data with DESeq2. Genome Biol. 2014;15:1–21.
Paulson JN, Stine OC, Bravo HC, Pop M. Differential abundance analysis for microbial markergene surveys. Nat Methods. 2013;10:1200–2.
Koh H, Blaser MJ, Li H. A powerful microbiomebased association test and a microbial taxa discovery framework for comprehensive association mapping. Microbiome. 2017;5:45.
Zhao N, Chen J, Carroll IM, RinqelKulka T, Epstein MP, Zhou H, et al. Testing in microbiomeprofiling studies with MiRKAT, the microbiome regressionbased kernel association test. Am J Hum Genet. 2015;96:797–807.
Lozupone CA, Knight R. UniFrac: a new phylogenetic method for comparing microbial communities. Appl Environ Microbiol. 2005;71:8228–35.
Lozupone CA, Hamady M, Kelley ST, Knight R. Quantitative and qualitative β diversity measures lead to different insights into factors that structure microbial communities. Appl Environ Microbiol. 2007;73:1576–85.
Chen J, Bittinger K, Charlson ES, Hoffmann C, Lewis J, Wu GD, et al. Associating microbiome composition with environmental covariates using generalized UniFrac distances. Bioinformatics. 2012;28:2106–13.
Bray JR, Curtis JT. An ordination of upland forest communities of southern Wisconsin. Ecol Monogr. 1957;27:325–49.
Pan W, Kim J, Zhang Y, Shen X, Wei P. A powerful and adaptive association test for rare variants. Genetics. 2014;(4):1081–95.
Han MK, Zhou Y, Murray S, Tayob N, Noth I, Lama VN, et al. Association between lung microbiome and disease progression in IPF: a prospective cohort study. Lancet Respir Med. 2014;2:548–56.
Jenq RR, Taur Y, Devlin SM, Ponce DM, Goldberg JD, Ahr KF, et al. Intestinal Blautia is associated with reduced death from graftversushost disease. Biol Blood Marrow Transplants. 2015;21:1373–83.
Livanos AE, Greiner TU, Vangay P, Pathmasiri W, Stewart D, McRitchie S, et al. Antibioticmediated gut microbiome perturbation accelerates development of type 1 diabetes in mice. Nat Microbiol. 2016;1:6140.
Plantinga A, Zhan X, Zhao N, Chen J, Jenq RR, Wu MC, et al. MiRKATS: a communitylevel test of association between the microbiota and survival times. Microbiome. 2017;5:17.
Ward Jr. JH. Hierarchical grouping to optimize an objective function. J Am Stat Assoc. 1963;58:236–44.
Cox D. Regression models and life tables (with discussion). J R Stat Soc Series B. 1972;34:187–220.
Lin X, Cai T, Wu MC, Zhou Q, Liu G, Christiani DC, et al. Kernel machine SNPset analysis for censored survival outcomes in genomewide association studies. Genet Epidemiol. 2011;35:620–31.
Chen H, Lumley T, Brody J, HeardCosta NL, Fox CS, Cupples LA, et al. Sequence kernel association test for survival traits. Genet Epidemiol. 2014;38:191–7.
Verweij PJM, Van Houwelingen HC, Stijnen T. A goodnessoffit test for Cox's proportional hazards model based on martingale residuals. Biometrics. 1998;54:1517–26.
Goeman JJ, Oosting J, CletonJansen AM, Anninga JK, Van Houwelingen HC. Testing association of a pathway with survival using gene expression data. Bioinformatics. 2005;21:1950–7.
Goeman JJ, Van De Geer SA, Van Houwelingen HC. Testing against a high dimensional alternative. J R Stat Soc Series B. 2006;68:477–93.
Li H, Chen J. Efficient unified rare variant association test by modeling the population genetic distribution in casecontrol studies. Genet Epidemiol. 2016;40:579–90.
Efron B. The efficiency of Cox’s likelihood function for censored data. J Am Stat Assoc. 1977;72:557–65.
Morgan XC, Tickle TL, Sokol H, Gevers D, Devaney KL, Ward DV, et al. Dysfunction of the intestinal microbiome in inflammatory bowel disease and treatment. Genome Biol. 2012;13:R79.
Chen J, Li H. Kernel methods for regression analysis of microbiome composition data. Topics in applied statistics: 2012 symposium of the international Chinese statistical association. New York: Springer; 1998. p. 191–201.
Bender R, Augustin T, Blettner M. Generating survival times to simulate cox proportional hazards models. Stat Med. 2005;24:1713–23.
Reynolds AP, Richard G, De La Iglesia B, RaywardSmith VJ. Clustering rules: a comparison of partitioning and hierarchical clustering algorithms. J Math Model Algorithms. 2006;5:474–504.
Olszak T, An D, Zeissiq S, Vera MP, Richter J, Franke A, et al. Microbial exposure during early life has persistent effects on natural killer T cell function. Science. 2012;336:489–93.
Diamond Project Group. Incidence and trends of childhood type 1 diabetes worldwide 19901999. Diabetic Med. 2006;23:857–66.
Baron RM, Kenny DA. The moderatormediator variable distinction in social psychological research: conceptual, strategic and statistical considerations. J Pers Soc Psychol. 1986;51:1173–82.
Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Series B. 1995;57:289–300.
Sankaran K, Holmes S. structSSI: simultaneous and selective inference for grouped or hierarchically structured data. J Stat Softw. 2014;59(13)
Aitchison J. The statistical analysis of compositional data. J R Stat Soc B. 1982;44:139–77.
O’Hara RB, Kotze DJ. Do not logtransform count data. Methods Ecol Evol. 2010;2:118–22.
Tringe SG, Rubin EM. Metagenomics: DNA sequencing of environmental samples. Nat Rev Genet. 2005;6:805–14.
Churchill GA, Doerge RW. Empirical threshold values for quantitative trait mapping. Genetics. 1994;138:963–71.
Acknowledgements
The authors are grateful to the anonymous reviewers for their insightful observations and comments.
Funding
The work was supported in part by the National Institute of Health (R01 DK090989, R01 DK110014, U01 AI22285), Juvenile Diabetes Research Foundation, Howard Hughes Medical Institute, Diane Belfer Program in Human Microbial Ecology, and C&D Fund. The funding body had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Availability of data and materials
The real microbiome data [23] we used in our real data analysis have been deposited at European Bioinformatics Institute (EBI) database (https://www.ebi.ac.uk) with accession code ERP016357 and Qiita database (https://qiita.ucsd.edu) with access code 10508. The synthetic data we used are described in the above Simulations section.
Our methods are implemented using the R package, OMiSA, which is freely available on the web at https://sites.google.com/site/huilinli09/software or https://github.com/hk1785/OMiSA together with its detailed manual which includes arguments/options, input formats, and outputs with examples. The software and tutorial for GraPhlAn are available at https://bitbucket.org/nsegata/graphlan/wiki/Home and referenced to its original paper [34].
Author information
Authors and Affiliations
Contributions
HK and HL developed the methodological ideas. HK implemented the methods, performed the simulations and real data analysis, and developed the software package. AEL and MJB provided biological insights and interpretations in the real data analysis and contributed to the acquisition of utilized real microbiome data. HK and HL wrote the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable. All utilized data sets are publicly and freely available which do not require any ethics approval and consent to participate.
Consent for publication
Not applicable. All utilized data sets are publicly and freely available which do not require any consent for publication.
Competing interests
The authors declare that they have no competing interests.
Additional files
Additional file 1:
Computational procedures. (PDF 390 kb)
Additional file 2:
Figure S1. Power estimates for the individual and adaptive tests. The censoring scheme, C_{i} ~ Unif(0,5), and the same effect directions, where β_{j ∈ Λ} is a vector of the elements sampled from Unif(0,1) (blue), Unif(0,2) (yellow), or Unif(0,3) (red), for a large sample size (n = 100) were surveyed. K_{U}, K_{0.5}, K_{W}, and K_{BC}, indicates the use of unweighted UniFrac, generalized UniFrac with ϴ = 0.5, weighted UniFrac, and the BrayCurtis dissimilarity kernels, respectively, for MiRKATS [24]. (A: 10 most abundant OTUs are associated. B: 10 random OTUs are associated. C: 10 least abundant OTUs are associated. D: OTUs in a chosen cluster are associated.). (PDF 10 kb)
Additional file 3:
Figure S2. Power estimates for the individual and adaptive tests. The censoring scheme, C_{i} ~ Unif(0,5), and the mixed effect directions, where β_{j ∈ Λ} is a vector of the elements sampled from Unif(− 1,1) (blue), Unif(− 2,2) (yellow), or Unif(− 3,3) (red), for a large sample size (n = 100) were surveyed. K_{U}, K_{0.5}, K_{W}, and K_{BC}, indicates the use of unweighted UniFrac, generalized UniFrac with ϴ = 0.5, weighted UniFrac, and the BrayCurtis dissimilarity kernels, respectively, for MiRKATS [24]. (A: 10 most abundant OTUs are associated. B: 10 random OTUs are associated. C: 10 least abundant OTUs are associated. D: OTUs in a chosen cluster are associated.). (PDF 10 kb)
Additional file 4:
Figure S3. Power estimates for the individual and adaptive tests. The censoring scheme, C_{i} ~ Unif(0,10), and the same effect directions, where β_{j ∈ Λ} is a vector of the elements sampled from Unif(0,1) (blue), Unif(0,2) (yellow), or Unif(0,3) (red), for a small sample size (n = 50) were surveyed. K_{U}, K_{0.5}, K_{W}, and K_{BC}, indicates the use of unweighted UniFrac, generalized UniFrac with ϴ = 0.5, weighted UniFrac, and the BrayCurtis dissimilarity kernels, respectively, for MiRKATS [24]. (A: 10 most abundant OTUs are associated. B: 10 random OTUs are associated. C: 10 least abundant OTUs are associated. D: OTUs in a chosen cluster are associated.). (PDF 9 kb)
Additional file 5:
Figure S4. Power estimates for the individual and adaptive tests. The censoring scheme, C_{i} ~ Unif(0,10), and the mixed effect directions, where β_{j ∈ Λ} is a vector of the elements sampled from Unif(0,1) (blue), Unif(0,2) (yellow), or Unif(0,3) (red), for a small sample size (n = 50) were surveyed. K_{U}, K_{0.5}, K_{W}, and K_{BC}, indicates the use of unweighted UniFrac, generalized UniFrac with ϴ = 0.5, weighted UniFrac, and the BrayCurtis dissimilarity kernels, respectively, for MiRKATS [24]. (A: 10 most abundant OTUs are associated. B: 10 random OTUs are associated. C: 10 least abundant OTUs are associated. D: OTUs in a chosen cluster are associated.). (PDF 9 kb)
Additional file 6:
Figure S5. Power estimates for the individual and adaptive tests. The censoring scheme, C_{i} ~ Unif(0,5), and the same effect directions, where β_{j ∈ Λ} is a vector of the elements sampled from Unif(0,1) (blue), Unif(0,2) (yellow), or Unif(0,3) (red), for a small sample size (n = 50) were surveyed. K_{U}, K_{0.5}, K_{W}, and K_{BC}, indicates the use of unweighted UniFrac, generalized UniFrac with ϴ = 0.5, weighted UniFrac, and the BrayCurtis dissimilarity kernels, respectively, for MiRKATS [24]. (A: 10 most abundant OTUs are associated. B: 10 random OTUs are associated. C: 10 least abundant OTUs are associated. D: OTUs in a chosen cluster are associated.). (PDF 9 kb)
Additional file 7:
Figure S6. Power estimates for the individual and adaptive tests. The censoring scheme, C_{i} ~ Unif(0,5), and the mixed effect directions, where β_{j ∈ Λ} is a vector of the elements sampled from Unif(− 1,1) (blue), Unif(− 2,2) (yellow), or Unif(− 3,3) (red), for a small sample size (n = 50) were surveyed. K_{U}, K_{0.5}, K_{W}, and K_{BC}, indicates the use of unweighted UniFrac, generalized UniFrac with ϴ = 0.5, weighted UniFrac, and the BrayCurtis dissimilarity kernels, respectively, for MiRKATS [24]. (A: 10 most abundant OTUs are associated. B: 10 random OTUs are associated. C: 10 least abundant OTUs are associated. D: OTUs in a chosen cluster are associated.). (PDF 9 kb)
Additional file 8:
Figure S7. Power estimates for individual MiRKATS tests through different software facilities, OMiSA and MiRKATS (via analytic pvalue calculation). The censoring scheme, C_{i} ~ Unif(0,10), and the same effect directions, where β_{j ∈ Λ} is a vector of the elements sampled from Unif(0,1) (blue), Unif(0,2) (yellow), or Unif(0,3) (red), for a large sample size (n = 100) were surveyed. K_{U}, K_{0.5}, K_{W}, and K_{BC}, indicates the use of unweighted UniFrac, generalized UniFrac with ϴ = 0.5, weighted UniFrac, and the BrayCurtis dissimilarity kernels, respectively. (PDF 6 kb)
Additional file 9:
Figure S8. Power estimates for individual MiRKATS tests through different software facilities, OMiSA and MiRKATS (via analytic pvalue calculation). The censoring scheme, C_{i} ~ Unif(0,10), and the mixed effect directions, where β_{j ∈ Λ} is a vector of the elements sampled from Unif(− 1,1) (blue), Unif(− 2,2) (yellow), or Unif(− 3,3) (red), for a large sample size (n = 100) were surveyed. K_{U}, K_{0.5}, K_{W}, and K_{BC}, indicates the use of unweighted UniFrac, generalized UniFrac with ϴ = 0.5, weighted UniFrac, and the BrayCurtis dissimilarity kernels, respectively. (PDF 6 kb)
Additional file 10:
Figure S9. Power estimates for individual MiRKATS tests through different software facilities, OMiSA and MiRKATS (via analytic pvalue calculation). The censoring scheme, C_{i} ~ Unif(0,5), and the same effect directions, where β_{j ∈ Λ} is a vector of the elements sampled from Unif(0,1) (blue), Unif(0,2) (yellow), or Unif(0,3) (red), for a large sample size (n = 100) were surveyed. K_{U}, K_{0.5}, K_{W}, and K_{BC}, indicates the use of unweighted UniFrac, generalized UniFrac with ϴ = 0.5, weighted UniFrac, and the BrayCurtis dissimilarity kernels, respectively. (PDF 6 kb)
Additional file 11:
Figure S10. Power estimates for individual MiRKATS tests through different software facilities, OMiSA and MiRKATS (via analytic pvalue calculation). The censoring scheme, C_{i} ~ Unif(0,5), and the mixed effect directions, where β_{j ∈ Λ} is a vector of the elements sampled from Unif(− 1,1) (blue), Unif(− 2,2) (yellow), or Unif(− 3,3) (red), for a large sample size (n = 100) were surveyed. K_{U}, K_{0.5}, K_{W}, and K_{BC}, indicates the use of unweighted UniFrac, generalized UniFrac with ϴ = 0.5, weighted UniFrac, and the BrayCurtis dissimilarity kernels, respectively. (PDF 6 kb)
Additional file 12:
Figure S11. Power estimates for individual MiRKATS tests through different software facilities, OMiSA and MiRKATS (via permutation). The censoring scheme, C_{i} ~ Unif(0,10), and the same effect directions, where β_{j ∈ Λ} is a vector of the elements sampled from Unif(0,1) (blue), Unif(0,2) (yellow), or Unif(0,3) (red), for a small sample size (n = 50) were surveyed. K_{U}, K_{0.5}, K_{W}, and K_{BC}, indicates the use of unweighted UniFrac, generalized UniFrac with ϴ = 0.5, weighted UniFrac, and the BrayCurtis dissimilarity kernels, respectively. (PDF 6 kb)
Additional file 13:
Figure S12. Power estimates for individual MiRKATS tests through different software facilities, OMiSA and MiRKATS (via permutation). The censoring scheme, C_{i} ~ Unif(0,10), and the mixed effect directions, where β_{j ∈ Λ} is a vector of the elements sampled from Unif(− 1,1) (blue), Unif(− 2,2) (yellow), or Unif(− 3,3) (red), for a small sample size (n = 50) were surveyed. K_{U}, K_{0.5}, K_{W}, and K_{BC}, indicates the use of unweighted UniFrac, generalized UniFrac with ϴ = 0.5, weighted UniFrac, and the BrayCurtis dissimilarity kernels, respectively. (PDF 6 kb)
Additional file 14:
Figure S13. Power estimates for individual MiRKATS tests through different software facilities, OMiSA and MiRKATS (via permutation). The censoring scheme, C_{i} ~ Unif(0,5), and the same effect directions, where β_{j ∈ Λ} is a vector of the elements sampled from Unif(0,1) (blue), Unif(0,2) (yellow), or Unif(0,3) (red), for a small sample size (n = 50) were surveyed. K_{U}, K_{0.5}, K_{W}, and K_{BC}, indicates the use of unweighted UniFrac, generalized UniFrac with ϴ = 0.5, weighted UniFrac, and the BrayCurtis dissimilarity kernels, respectively. (PDF 6 kb)
Additional file 15:
Figure S14. Power estimates for individual MiRKATS tests through different software facilities, OMiSA and MiRKATS (via permutation). The censoring scheme, C_{i} ~ Unif(0,5), and the mixed effect directions, where β_{j ∈ Λ} is a vector of the elements sampled from Unif(− 1,1) (blue), Unif(− 2,2) (yellow), or Unif(− 3,3) (red), for a small sample size (n = 50) were surveyed. K_{U}, K_{0.5}, K_{W}, and K_{BC}, indicates the use of unweighted UniFrac, generalized UniFrac with ϴ = 0.5, weighted UniFrac, and the BrayCurtis dissimilarity kernels, respectively. (PDF 6 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Koh, H., Livanos, A.E., Blaser, M.J. et al. A highly adaptive microbiomebased association test for survival traits. BMC Genomics 19, 210 (2018). https://doi.org/10.1186/s1286401845998
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1286401845998
Keywords
 Microbiomebased survival analysis
 Microbiomebased association test
 Communitylevel association test
 Microbial group analysis
 Highdimensional compositional data analysis
 Phylogenetic tree