 Research
 Open Access
 Published:
A robust and transformationfree joint model with matching and regularization for metagenomic trajectory and disease onset
BMC Genomics volume 23, Article number: 661 (2022)
Abstract
Background
To identify operational taxonomy units (OTUs) signaling disease onset in an observational study, a powerful strategy was selecting participants by matched sets and profiling temporal metagenomes, followed by trajectory analysis. Existing trajectory analyses modeled individual OTU or microbial community without adjusting for the withincommunity correlation and matchedsetspecific latent factors.
Results
We proposed a joint model with matching and regularization (JMR) to detect OTUspecific trajectory predictive of host disease status. The between and withinmatchedsets heterogeneity in OTU relative abundance and disease risk were modeled by nested random effects. The inherent negative correlation in microbiota composition was adjusted by incorporating and regularizing the topcorrelated taxa as longitudinal covariate, preselected by BrayCurtis distance and elastic net regression. We designed a simulation pipeline to generate true biomarkers for disease onset and the pseudo biomarkers caused by compositionality. We demonstrated that JMR effectively controlled the false discovery and pseudo biomarkers in a simulation study generating temporal highdimensional metagenomic counts with random intercept or slope. Application of the competing methods in the simulated data and the TEDDY cohort showed that JMR outperformed the other methods and identified important taxa in infants’ fecal samples with dynamics preceding host disease status.
Conclusion
Our method JMR is a robust framework that models taxonspecific trajectory and host disease status for matched participants without transformation of relative abundance, improving the power of detecting diseaseassociated microbial features in certain scenarios. JMR is available in R package mtradeR at https://github.com/qianli10000/mtradeR.
Background
Gut microbiota profiled by 16s rRNA gene sequencing or metagenomic (i.e., wholegenome shotgun) sequencing has been frequently used in observational studies of environmental exposures, immune biomarkers, and disease onset [1,2,3,4,5]. One of the challenges in analyzing microbiota in an observational study is to incorporate the matching between participants based on certain confounding risk factors (e.g. gender, clinical site, etc.) and/or disease status (casecontrol), such as the DIABIMMUNE and TEDDY cohorts [1, 2, 5]. A matching design effectively eliminates the noise effect of sample collection, storage, shipment, sequencing batch, and environmental exposures confounding with disease outcomes, as well as reduces the sequencing costs. Statistical analyses of microbiota in matched sets included, but are not limited to, conditional logistic regression [1], nonparametric comparison PERMANOVA [6] and LDM [7] with extension to compare cases and controls within a matched set [8], which aimed to model and analyze microbiome data at independent time points.
Longitudinal profiling is a powerful strategy for the microbiome studies that aim to identify differential microbial trajectories between exposure groups or phenotypes [9, 10] or detect the time intervals of differential abundance [11]. However, most of these studies failed to test if the compositional trajectory of an operational taxonomic unit (OTU) signaled host disease status. To detect microbial trajectories predictive of disease outcome in matched sets, an intuitive method is the generalized linear mixed effect model with or without the zeroinflation component [9, 10, 12], in which a taxon’s abundance and/or presence is the outcome variable and the disease status is the covariate of interest. The ZeroInflated Beta Regression (ZIBR) model [9] tests the association between OTU and a covariate factor using a twopart model for the nonzero relative abundance and presence of each OTU, assuming the nonzero relative abundance and presence being independent. A similar framework [10] was proposed to analyze the longitudinal zeroinflated counts per OTU using a Negative Binomial distribution, without converting the raw counts to relative abundance. A semiparametric approach for longitudinal taxonspecific relative abundance is the linear mixed effect model (LMM) with asinsquareroot transformation, which has been implemented in MaAsLin 2 [12].
One concern about using generalized linear mixed model to test the association between 16S rRNA or metagenomic trajectory and disease onset is that the covariates in this model may contribute to disease risk. For example, the HLA haplogenotypes and early use of probiotics may affect infants’ gut microbiota and should be included as covariates. These factors were also found associated with islet autoimmunity among children enrolled in TEDDY [13]. One usually added interaction terms between each covariate and the disease outcome [3, 12] to adjust for the association. However, a linear model with many interaction terms may lead to overfitting and reduce the detection power [14]. A sensible choice is the joint modeling of longitudinal biomarker and survival outcomes [15, 16], but there are limitations in applying this model to microbiome data in observational studies. First, the cost of metagenomic sequencing and the availability of fecal samples in a multicenter study may restrict the metagenome profiling to a subgroup of participants selected by certain criteria [1,2,3], whose survival outcome may deviate from common statistical assumptions. Second, the classic joint modeling approach aims to address repeated measurements of biomarkers in a timetoevent analysis rather than test if a biomarker’s intercept or slope is predictive of host health condition. Third, in an observational study that selects and matches participants by certain factors, their risk of developing disease is also matched. Thus, a survival submodel may not be capable of characterizing the disease risk between matched participants.
Many of the existing methods for microbiome data are built on the transformed relative abundance, such as centered logratio or interquartile logratio. In our new method, transformation of compositional data is not considered, since transformation strategy may have profound impact on analysis result and interpretation [17]. The compositional change in true biomarkers (e.g., causal OTUs contributing to disease onset) always leads to simultaneous change in some other OTUs’ composition because of sumtoone constraint. In an observational study with matching design, it is common to collect and profile microbiota at many time points. The sumtoone constraint and latent noise effect may yield pseudo biomarkers with relative abundance associated with host disease status but not contributing to disease development. Hence, a taxonlevel model is built for relative abundance trajectory that adjusts for the dynamic interdependence between taxa and reduces pseudo biomarker rate. In addition, we illustrate the performance of our method by a simulation pipeline that mimics the negative correlation in microbial community.
The latent technical noise in microbiome was removed by converting raw counts to relative abundance, and ZeroInflated Beta density [9] was adopted to model an OTU’s nonzero relative abundance and presence, respectively. We employed a subjectlevel random effect to link the logistic regression model of disease to a twopart longitudinal submodel. The latent effect of exposures related to matched set indicator was modeled by another random effect nested with the subjectlevel random effect. The OTUdisease association was assessed by jointly testing the scaling parameters for the subjectlevel random effect in the twopart submodel. We benchmarked the robustness and power of our method by a comprehensive simulation study and an application in the TEDDY cohort. The results illustrated that our method controlled the rates of false discovery and pseudo biomarkers, as well as improved the efficacy of detecting microbial trajectories signaling disease outcome.
Results
For simplicity, the aim of present research is to link the matched longitudinal microbiome samples to hosts’ matched disease risk and incorporate the unknown dependence between taxa in an univariate trajectory framework, without modeling the compositionality. Briefly, we develop a Joint model with Matching and Regularization (JMR) to detect taxonspecific compositional trajectory associated with disease onset, adjusting for the linear correlation with other taxa and matchedsetspecific latent noises. According to the characteristics of disease risk and infantage gut microbiota in the TEDDY cohort, we designed a simulation pipeline similar to [8], generated the observed counts of temporal microbiota and compared our method to LMM and ZIBR using the simulated data. We also applied these methods to the shotgun metagenomic sequencing data profiled from the 49 months stool samples of infants enrolled in TEDDY cohort.
Overview of TEDDY microbiome study
TEDDY is an observational prospective study of children at increased genetic risk of type 1 diabetes (T1D) conducted in six clinical centers in the U.S. and Europe (Finland, Germany, and Sweden). A total of 8,676 children were enrolled from birth and followed every 3 months for blood sample collection and islet autoantibody measurement up to 4 years of age, then every 36 months based on autoantibody status until the age of 15 years or diabetes onset [18]. A primary disease endpoint in TEDDY is islet autoimmunity (IA), defined as persistently positive for insulin autoantibodies (IAA), glutamic acid decarboxylase autoantibodies (GADA), or insulinomaassociated2 autoantibodies (IA2A) at two consecutive visits confirmed by the two TEDDY laboratories [18]. The participants’ monthly stool samples were collected from 3month age until the onset of IA or censoring with random missing samples [1, 2]. Based on the sample availability and metagenomic sequencing cost, the microbiome study in TEDDY selected all the participants (cases) who developed IA by the design cutoff date May 31, 2012 and the controls at 1:1 casecontrol ratio matched by clinical center, gender, family history of T1D to profile the temporal gut microbiota, resulting in S = 418 matched sets (or pairs [19]). These matching factors are known risk factors for type 1 diabetes. Some of the matched sets are at higher risk of IA than the others due to higher risk human leukocyte antigen (HLA) genotypes, geography or having family history of T1D. Hence, the matched participants have comparable risk of IA, but heterogeneity still exists between them according to the casecontrol status by the design freeze date. The observed metagenomic counts table in TEDDY was generated by the standard procedure of DNA extraction, PCR amplification, shotgun metagenomic sequencing, assembly, annotation and quantification, as described in [1]. We visualized the top abundant species in the metagenomes of TEDDY participants who had matched IA endpoint no later than 2 years of age (Fig. 1).
Simulation
Disease outcome for the matched participants are simulated by the procedure below. The observed relative abundance per taxon were simulated by different scenarios. We first generated raw counts for a single OTU by BetaBinomial distribution to assess the robustness and power of our method JMR without covariate taxa. We also designed a shifting procedure to mimic inherent negative correlation in the true composition of microbiota and generated the temporal highdimensional raw counts table to evaluate the performance of compared methods.
Generate disease outcome in matched sets
We defined matched sets and subjects as ‘highrisk’ and ‘lowrisk’ to generate the temporal OTU counts prior to disease onset. Subjects are matched at 1:1 ratio. For participant \(j=1,2\) in matched set s \((s=1,\dots ,S)\), we first generated subjectlevel and setlevel random effects from a standard Normal distribution \(a_{s_j}\sim N(0,1)\), \(b_s\sim N(0,1)\). Each random effect was converted to a binary variable by the median value. That is \(A_{s_j}=\varvec{I}(a_{s_j}>\text {median}(a_{s_j}))\), \(B_s=\varvec{I}(b_s>\text {median}(b_s))\), where \(A_{s_j}=1\) (or \(B_s=1\)) represents a ‘highrisk’ subject (or set). Next, we simulated a host genotype \(G_{s_j}\) as disease risk factor, and the host disease status by a Bernoulli distribution \(O_{s_j}\sim B(p_{s_j})\), where \(\text {logit}(p_{s_j})=\alpha _0+\alpha _1 G_{s_j}+\alpha _2 A_{s_j}+\alpha _3 B_{s}\). We fixed \((\alpha _0,\alpha _1, \alpha _3)=(0.5,2,1)\), which is the JMR estimate from real data, and set \(\alpha _2 \in \{0.5,0.75,1,1.25,1.5\}\) to generate different datasets.
Scenario A: single OTU counts.
We first simulated the observed counts of a single OTU by BetaBinomial [14] distribution to compare the univariate trajectory methods without adjusting for covariate taxa. The true relative abundance of an OTU at the earliest time point \(t=1\) was drawn from a Beta distribution \(\mu _1\sim Beta(\mu _0,\phi _0)\), where parameters \(\mu _0,\phi _0\) were estimated by applying BetaBinomial MLE to the metagenomic raw counts of an OTU selected at a given relative abundance level in the TEDDY data. To simplify the agedependent effect, the relative abundance of this OTU at later time points \(t>1\) was generated by linearly increasing \(\mu _1\) to \(\mu _t\). The baseline relative abundance at time t in a matched set s was generated by \(\mu _{st}\sim Beta(\mu ,\phi _t)\), and was increased or decreased by \(\Delta \mu _{st}\) if the set was labeled as ‘highrisk’. The true relative abundance of this OTU for subject j in set s at time point t was simulated by \(\mu _{s_jt}\sim Beta(\mu _{st}, \phi _{st})\), and was increased or decreased by \(\Delta \mu _{s_jt}\) if the subject was ‘highrisk’. The total counts per sample, i.e., library size was drawn from a Poisson distribution \(N_{s_jt}\sim PS(100000)\), and the counts for this OTU is generated from a Binomial (BN) distribution \(Y_{s_jt}\sim BN(N_{s_jt},\mu _{s_jt})\).
Scenario B1: counts table with random intercept and pseudo biomarkers
We also generated a highdimensional counts table with \(P=1030\) OTUs to demonstrate the performance of each method, so that the covariate taxa can be used in JMR. The true composition of each microbiome sample \(\bar{\eta }_{s_jt}\) was simulated by a shifting procedure combined with Dirichlet distribution to account for the negative correlation within microbial community. The samplewise library size was generated by a Poisson distribution, and the observed raw counts were sampled from a Multinomial distribution. Details of data generation process for this scenario is available in Methods, with a visualization for dimension of \(P=4\) in Fig. 2.
For a subject labeled as ‘highrisk’, we increased \(15\%\) OTUs (denoted by \(M^{+}\)) in \(\bar{\eta }_{s_jt}\) by \(\Delta _{s_jt}\), and reduced another \(15\%\) OTUs (denoted by \(M^{}\)) by \(d\Delta _{s_jt}\) (\(0<d<1\)). The subsets \(M^{+}\) and \(M^{}\) are the true biomarkers for disease status. We randomly selected a third subset (denoted by \(M^{0}\)) from the remaining \(70\%\) OTUs in \(\bar{\eta }_{s_jt}\) and reduced the composition of \(M^{0}\) by a total of \((1d)\Delta _{s_jt}\). There may exist OTUs never selected in \(M^{+}\), \(M^{}\), or \(M^{0}\), which are the ‘null’ OTUs. The OTUs selected in \(M^{0}\) are the pseudo biomarkers due to random shift in frequency. We set the total shift \(\Delta _{s_jt}=\lambda \Delta ^0_{{s_jt}}\) at distinct effect size \(\lambda \in \{0.5,0.6,0.7,0.8\}\), where \(\Delta ^0_{{s_jt}}\) is the maximum shift restricted by sumtoone.
Scenario B2: counts table with random slope and pseudo biomarkers
Data generation process for this scenario is similar to Scenario B1, except that the shift (\(\Delta _{s_jt}\)) in microbiota true composition between ‘lowrisk’ and ‘highrisk’ subjects varies by time points. It’s worth to note that we cannot distinguish ‘false positive’ from ‘pseudo positive’ in scenarios B1 and B2. Hence, we use the sum of false positive rate and pseudo positive rate, i.e., false or pseudo positive rate (FPPR) as a performance metric for scenarios B1,B2. That is \(\text {FPPR}=\frac{ \# \text { of positives in } (M^{+}\cup M^{})^c }{\# \text { of OTUs in } (M^{+}\cup M^{})^c }\).
Scenario C: counts table without pseudo biomarkers
In this scenario we considered random intercept signaling the disease onset and fixed half OTUs in \(\bar{\eta }_{s_jt}\) as ‘null’ in order to evaluate the FPR and FDR of each method, although this scenario is not applicable to real data. Among the other half OTUs, we selected \(10\%\) OTUs in \(\bar{\eta }_{s_jt}\) as \(M^{+}\) and \(40\%\) OTUs as \(M^{}\) without a subset of pseudo biomarkers (\(M^{0}\)).
Performance of competing methods
In scenario A, we compared JMR not adjusting for correlated taxa (JMRNC) with the following methods: a) a joint model with regularization but without matching indicator and correlated taxa (JRNC); b) the ZIBR model with a Wald statistic jointly testing OTUspecific abundance or presence using either a single random effect (ZIBRS) or nested random effects (ZIBRN); c) LMM with arcsinsquareroot transformation using either a single random effect (LMMS) or nested random effects (LMMN). For LMM and ZIBR methods, we used R package gamlss and set the sample age, genotype, disease status, and genotypedisease interaction term as the fixed effect covariates. It’s worth to note that the nested random effects used in LMM and ZIBR are independent of host disease risk, different from those in JMR.
We randomly selected 6 OTUs with different relative abundance from TEDDY data and estimated the baseline parameters for each. These OTUs are Acinetobacter sp. NIPH 236, Brachyspira murdochii, Streptococcus phage YMC2011, Erysipelatoclostridium ramosum, Ruminococcus gnavus. Then we generated \(n=10000\) replicates for each OTU with \(S\in \{50,100\}\). The type I error rate and power of each method was calculated at statistical significance level \(p<0.05\), shown in Table 1. The results showed that JMRNC persistently controlled the type I error and provided higher detection power at distinct abundance levels except for the OTUs with \(\log _{10}(y) \in (2, 3]\) and (5, 6]. Type I error of the reduced model JRNC was severely inflated in some datasets and its power was lower than JMRNC. LMM consistently controlled type I error, with power lower than JMRNC in most simulated OTUs. The ZIBR method yielded inflated type I error rate and low efficacy regardless of sample size in this singleOTU scenario.
In scenarios B1, B2 and C, we generated 10 replicates for each OTU table to assess the performance of competing methods. We evaluated the performance of each method at different size of setlevel random effect \(\gamma \in \{0.6,0.7,0.8\}\). The taxa associated with disease onset in each OTU table are detected by FDR cutoff \(q<0.15\). The FPPR in scenario B1 (Fig. 3) showed that adjusting for the topcorrelated taxa in JMR successfully controlled the rate of pseudo biomarkers across different scenarios and was more powerful than LMM at larger sample size (\(N=200\)), although JMR showed lower detection power compared to JMRNC. The results in Fig. 4 also demonstrated the outperformance of JMRNC, JMR, and ZIBR in the sensitivity of detecting taxonspecific trajectory heralding disease outcome. The power of ZIBR in either intercept or slope analysis was higher than the competing methods regardless of sample size in the highdimensional scenarios, but this model yielded inflated FPPR (Fig. 3). The LMM methods were powerful in the test of intercept, but this model occasionally produced inflated FPPR (Fig. 3) regardless of the setlevel random effect size. Furthermore, the power of LMM was unstable in intercept analysis, while its power in slope analysis was nearly zero. To confirm the impact of \(\gamma\) on performance, we also compared the metrics in Figs. 3 and 5 between different values of \(\gamma\), using KruskalWallis test. A larger setlevel random effect led to significant change in FPPR, FPR, FDR for LMM and ZIBR methods (\(p<10^{5}\)), but this impact was trivial in JMR or JMRNC (\(p>0.1\)). JMR showed the best performance in slope analysis, with higher sensitivity (Fig. 4) and the lowest FPPR (Fig. 3). Results of scenario C in Fig. 5 showed that JMR and LMM effectively controlled the FPR, while LMM produced higher FPR at larger matchedsetspecific random effect (\(\gamma\)). The FDR of JMR at \(\gamma =0.6\) was relatively higher than that of LMM due to lower sensitivity. The inflated FPR and FDR of ZIBR in scenario C (Fig. 5) is consistent with the FPPR in scenario B (Fig. 3).
For each raw counts table in scenarios B and C, more than half of simulated OTUs have the observed zeroinflation probability (i.e., \(1\)prevalence) between 2% and 90%, although there are a few OTUs with the observed prevalence at \(100\%\). The overall prevalence of each OTU table in scenarios B and C is similar between different datasets, which cannot be specified in the DirichletMultinomial distribution or the shifting procedure. Hence, we assess the impact of zeroinflation on performance only in scenario A, whereas the prevalence is related to taxonspecific relative abundance. We also visualized the prevalence of six OTUs generated in scenario A (Fig. 6). The power of JMRNC (Table 1) was better for the prevalence at a medium level, i.e., replicates with \(\log _{10}(y)\) between (3, 4] or (4, 5] (Fig. 6). The OTUs with higher abundance and relatively lower prevalence (i.e., replicates with \(\log _{10}(y) \in (1,2]\) in Fig. 6) showed better efficacy in Table 1. In general, OTUspecific prevalence being too high or too low may reduce the power of JMR.
Application in TEDDY
We applied the competing methods to the longitudinal metagenomes profiled from TEDDY children’s monthly stool samples collected at the age of 49 months [1]. We included the cases developing IA between 9month and 24month age and their matched controls who remained IAnegative by the cases’ diagnosis age. For each matched pair included in the present analysis, one participant was IA positive and the other one was negative at the age of 24 months. We excluded the participant(s) matched to multiple pairs, yielding \(N=152\) subjects (\(S=76\) pairs) and \(n=672\) metagenome samples. The cases who experienced IA onset after 24 months and their matched controls were not included in this analysis.
We first filtered OTUs at genus and species level by relative abundance \(>10^{6}\) and prevalence \(>5\%\), selecting 125 out of 265 genera and 365 out of 750 species in downstream analysis. It’s worth to note that there are 1797 species in total profiled and quantified in TEDDY cohort, with 750 species detected between 3 and 9month age. The sample age and the hosts’ breastfeeding status per time point were used as longitudinal covariates, while HLA DR3 &4 haplotype was included as timeinvariant covariates. For the LMM and ZIBR methods, we used the interaction term between IA status and the binary HLA category (DR3 &4 vs. others) as a covariate to adjust for the association. We tested each OTU’s association with IA by FDR cutoff \(q<0.05\) or \(q<0.1\), individually. The HLA DR3 &4 genotype was confirmed positively and significantly (\(p<0.05\) by Wald test) associated with IA in JMR. The results in Table 2 showed that JMR identified more OTUs than LMM in both intercept and slope analysis. The LMM methods only found a small subgroup of taxa associated with IA at either genus or species level. We also visualized the overlap and difference between JMR, JMRNC, LMMN selected by \(q<0.1\) in Fig. 7 with OTU names listed in Supplementary Table S1, and then compared Akaike Information Criterion (AIC) of JMR and JMRNC for the 76 species detected by both methods. Adjusting for the correlated taxa in JMR did improve model fitting with lower mean AIC (2631.847) compared to JMRNC (2615.571). LMMN is not comparable to JMR or JMRNC in terms of information criteria, since the taxonspecific relative abundance was transformed by asinsquareroot.
The taxa with mean abundance (intercept) associated with IA onset exclusively detected by both JMR and JMRNC at \(q<0.1\) include Bifidobacterium breve, Bacteroides fragilis, Lactobacillus ruminis, Veillonella ratti. B.breve, as one of the three species dominating infantage gut microbiota in TEDDY, was less abundant in intercept (i.e. at 4 and 9month) during infancy among IA cases, with density shown in Fig. 8. The species B.fragilis as part of the normal microbiota in human colon was found more abundant among IA cases compared to their matched controls (Fig. 1). This Bacteroides species was also found differential between T1D cases and controls at only one time point in a smallsize Finnish cohort [20].
There are two more abundant species Faecalibacterium prausnitzii and Escherichia coli visualized in Fig. 1 associated with IA in slope and exclusively detected by JMR. F.prausnitzii, as one of the most abundant and important commensal bacteria of human gut microbiota that produces butyrate and shortchain fatty acids from the fermentation of dietary fiber, increased faster in IA cases after 6month of age. This rapid change and abnormally higher level of F.prausnitzii prior to IA seroconversion may be a result of the sudden change of dietary pattern during infancy.
Our method successfully detected the casecontrol difference in the slope of E.coli, which was found as an amyloidproducing bacteria with temperal dynamics heralding IA onset in a subset analysis in DIABIMMUNE cohort [21]. The relative abundance of E.coli in TEDDY smoothly decreased from 4month to 9month for both cases and controls (Fig. 1), and it was relatively more abundant in controls between 7 and 9month with stratified densities shown in Fig. 8. The temporal change of E.coli prior to IA seroconversion in TEDDY detected by JMR was consistent with the decrease of E.coli reported in DIABIMMUNE cohort [21], which was possibly due to prophage activation according to the E.coli phage/E.coli ratio prior to E.coli depletion in that research.
Discussion
We developed a joint model with nested random effects to test the association between taxa and disease risk, and adjusted for the correlated taxa screened by a preselection procedure in abundance and prevalence, individually. We implemented our method in an R package mtradeR (metagenomic trajectory analysis with disease endpoint and risk factors) with illustration examples at https://github.com/qianli10000/mtradeR. The JMR function implemented the framework in equation (1) by parallel computing. We also provided simulation functions StatSim and TaxaSim to generate (binary) disease status and temporal highdimensional metagenomic counts of matched sets. The runtime of each method for different sample size and different number of OTUs were compared on an 8core computer, with mean and standard deviation shown in Table 3. The nested random effects were utilized in each method. For the univariate models without covariate taxa, LMMN is the fastest algorithm and ZIBRN is the slowest, both implemented in gamlss R package. Although the adjustment of correlated taxa in JMR requires additional computation, the runtime of JMR is still shorter than ZIBRN in gamlss.
The simulation of single OTU demonstrated the performance of each method at different relative abundance levels, implying that LMM with either single or nested random effect is still a robust method. The simulation of highdimensional OTU tables also illustrated LMM’s overall performance in the test of intercept, but the unstableness of LMM is a concern in real data analysis. JMR yielded lower false or pseudo positive rate in the simulated datasets and higher detection power in slope analysis by adjusting for the topcorrelated taxa. The preselection of topcorrelated taxa in JMR was performed in relative abundance and presence, individually, being consistent with the twopart model strategy. According to the simulation study, a disadvantage of JMR is the limited power at small sample size and the dependence on tuning parameter. The prescreening procedure in JMR may occasionally select a true biomarker as covariate taxon, which is possibly confounding with the subjectlevel random effect. Hence, the adjustment of related taxa in JMR reduced the detection power compared to JMRNC, although this strategy controlled the pseudo biomarker rate. Adding nodes in the GH approximation may improve the power of JMR, but more nodes will also lead to additional computation costs. Hence, future work should focus on improvement of JMR in both detection power and computation efficiency. Furthermore, the simulation results in Fig. 3 also suggested the minimum number of participants or matched pairs required based on setlevel or subjectlevel random effect size. In an observational study with strong setlevel noises in the microbiota (e.g., multicenter effect), a minimum sample size of \(N=200\) participants (i.e., \(S=100\) pairs) coupled with JMR can improve the detection power and control FPPR at each level of diseaseassociated random effect.
Another limitation of our method is the potential bias in scaling parameter (\(\lambda _r, \lambda _p\)) estimation, possibly caused by the \(L_2\) regularization. Our current work only focused on the unsigned association between a taxon and host disease status by using a Wald statistic. An improvement in the estimate of scaling parameter and statistical inference should be considered in future work, such as the algorithm in ZINQ [14]. We did not use quantile regression in current research, since the performance of ZINQ required tuning of grid. But ZINQ provided an alternative approach for modeling zeroinflation in microbiota composition with fewer statistical assumptions.
The rightcensoring of longitudinal biomarker measurements or a binary disease outcome always occurs in observational studies. Our model allows random missingness or censoring of microbiome samples at any time point. In an observational study like TEDDY, the controls’ disease outcome was censored at or later than the matched cases’ endpoint, because the casecontrol matching was based on the participants’ disease status. Thus, rightcensoring is not applicable to the disease status at matched endpoint. For a study matching participants solely based on confounding risk factors (e.g., DIABIMMUNE), the rightcensoring of disease outcome should be addressed prior to the usage of JMR, such as multiple imputation. There are other important topics to be considered in the modeling of longitudinal microbiome data. One potential direction is high dimensional modeling framework, such as tensor singular value decomposition [22]. A promising extension of the current work in JMR is to exploit functional data analysis for multiple microbial trajectories. By employing a nonparametric joint modeling, we may be able to capture nonlinear trends and heterogeneous patterns of longitudinal biomarkers in microbiota, as well as negative correlations among taxa [23].
Conclusions
The proposed framework JMR successfully controlled the false or pseudo biomarkers in taxonspecific trajectory analysis with improved detection power by incorporating the matching of participants and adjusting for the dependence between taxa.
Methods
Joint model with matching and regularization
The probability for participant j \((j=1,\dots ,J)\) in matched set s \((s=1,\dots ,S)\) developing the disease of interest is \(p_{s_j}=P(O_{s_j}=1)\), where \(O_{s_j}\) is the binary disease status. There are J participants in each matched set. Let \(y_{s_jt}\) be the relative abundance of an OTU for participant j in matched set s at time point t \((t=1,\dots ,T_{s_j})\). We denote the expected nonzero abundance by \(\mu _{s_jt}=E(y_{s_jt}y_{s_jt}>0)\), and the probability of presence (or zeroinflation) by \(\pi _{s_jt}=P(y_{s_jt}>0)\), similar to [9]. For a microbiome study matching participants by the diseaseassociated factors and/or disease status (e.g., DIABIMMUNE, TEDDY), the matched participants are assumed to have comparable but distinct disease risk. Hence, we model the disease status by a logistic mixed effect model with nested random effects. A joint model for the host disease status and microbial trajectory in matched set is
The host disease status is determined by a vector of fixed effect covariates \(\varvec{u}_{s_j}\) and the independent nested random effects \(a_{s_j}\), \(b_s\). The nonzero relative abundance \(\mu _{s_j t}\) and presence \(\pi _{s_j t}\) per OTU are predicted by the same random effects rescaled by parameters \(\lambda _r, \lambda _p\) and a vector of clinical or bioinformatics technical covariates \(\varvec{z}_{s_j t}\). To model the unknown correlation between taxa, this OTU’s nonzero abundance and presence per time point also depend on the other taxa with relative abundance \(\varvec{x}^{(1)}_{s_j t}\) and presenceabsence \(\varvec{x}^{(2)}_{s_j t}\) measured at the same time point, preselected by a procedure described below. The twopart submodel of \(y_{s_j t}\) characterizes how the trajectory is affected by subject and setlevel latent factors contributing to disease risk, and how the OTU trajectory interacts with correlated taxa over time. If an OTU is a pseudo biomarker, then its relative abundance (\(y_{s_jt}\)) should be driven by the topcorrelated taxa per time point instead of the diseaseassociated random effect \(a_{s_j}\). On the other hand, the abundance of a true biomarker OTU at each time point is mainly determined by the latent risk of disease onset (\(a_{s_j}, b_{s}\)) and possibly associated with the topcorrelated taxa.
We set \(\tilde{z}_{s_jt}=1\) in equation (1) to test intercept, and \(\tilde{z}_{s_jt}=\text {age}\) to test slope. The nested random effects and parameters \(\lambda _r, \lambda _p\) provide flexibility in the modeling of betweensubjects and betweensets heterogeneity, as well as model the abundancepresence correlation in each taxon by shared nested random effects instead of assuming independence between the two processes as in [9].
Parameter estimation and hypothesis testing
To account for the sumtoone restriction on nonzero relative abundance (\(0<\mu _{s_j t}<1\)) and the binarized measurement \(\varvec{I}(y_{s_jt}>0)\) of an OTU, we intuitively employ the ZeroInflated Beta (ZIB) density function [9] to define the matchsetspecific marginal likelihood for parameter estimation. That is \(L(\theta ;\varvec{y},\varvec{O})=\prod _{s=1}^{S}L_s\), where
\(g_a( a_{s_j}), g_b( b_s)\) are the Gaussian density functions with mean 0 and variance \(\sigma ^2_a,\sigma ^2_b\), individually, and \(f(y_{s_jt}y_{s_jt}>0)\) is the Beta density function with mean \(\mu _{s_j t}\) and overdispersion \(\phi\). In the simulation study, we demonstrated that the robustness and performance of this model does not require the observed relative abundance being generated from ZIB distribution.
The estimate of overdispersion \(\hat{\phi }\) without regularization is severely inflated and also leads to bias in the estimate of other parameters. Hence, we use \(L_2\) (ridge) regularization to control the overdispersion and type I error in hypothesis testing. All the parameters \(\theta\) are estimated by maximizing a penalized marginal likelihood function \(\hat{\theta }=\arg max \tilde{L}(\theta ;\varvec{y},\varvec{O})\), where
and \(\rho\) is selected by a crossvalidation described below.
There is no closed form of the multivariate integral \(L_s\) in equation (2) because of the Beta density in \(l_{s_j}^{(1)} (a_{s_j},b_s)\). Hence, \(L_s\) can be approximated by GaussHermite (GH) quadrature, with details explained in Appendix. We test the association between OTU trajectory and host disease status with null hypothesis \(H_0: \lambda _r= \lambda _p=0\) and a Wald statistic \(W=\frac{\hat{\lambda }^2_r}{SE^2_{\lambda _r}}+\frac{\hat{\lambda }^2_p}{SE^2_{\lambda _p}}\), which follows a ChiSquare distribution \(W\sim \chi ^2(2)\). The false discovery rate (FDR) for multiple testing is corrected by the BenjaminiHochberg (BH) procedure.
Preselection of correlated taxa and tuning parameter selection
For each OTU (\(y_{s_jt}\)) in equation (1), using all the other taxa as covariates is computationally inefficient. Hence, we use a datadriven procedure to preselect \(\varvec{x}^{(1)}_{s_jt}\) and \(\varvec{x}^{(2)}_{s_jt}\), and then perform a postselection hypothesis testing. The first step screens the taxa correlated with \(y_{s_jt}\) in abundance and presence, individually, using the BrayCurtis distance less than 0.1 quantile. Our current method uses relative abundance in both preselection and modeling, since this method is developed for largescale microbiome studies and the multicenter technical batch effect can be simply normalized by relative abundance. According to the comparison of dissimilarity metrics on microbiome compositional data [24], we choose BrayCurtis dissimilarity to preselect the related taxa. This step may still result in many covariate taxa at species level in metageonmic data due to high dimensionality. Thus, we employ elastic net regression to further select the taxa with relative abundance \(\varvec{x}^{(1)}_{s_jt}\) associated with \(y_{s_j t}\) or the taxa with presence \(\varvec{x}^{(2)}_{s_jt}\) associated with \(\varvec{I}(y_{s_jt}>0)\), individually. In this preselection procedure, we model all the longitudinal metagenomes as independent samples regardless of time points (or age). One may restrict this procedure to a subcommunity such as the species or subspecies of certain genera.
To reduce the computational burden of crossvalidation for a highdimensional OTU table, we randomly select \(P_0\) OTUs from distinct relative abundance levels to represent the complexity of microbiota composition. The matched sets are divided into 5 folds, each being a validation fold for the model built on the other four (training) folds. The penalized log likelihood in equation (4) is the negative objective function in crossvalidation. For each validation fold f and the selected OTU i, the loss function is \(S_{fi}=\tilde{L}(\hat{\theta }^i_{f};\varvec{y}^i_{f},\varvec{O}_{f})\), where \(\hat{\theta }^i_{f}=\arg \max \tilde{L}(\theta ^i;\varvec{y}^i_{f},\varvec{O}_{f})\). The optimal \(\rho\) is selected by the ‘elbow point’ minimizing \(S=\sum _{f=1}^{5}\sum _{i=1}^{P_0}S_{fi}/(5P_0)\).
Data generation process for simulation scenario B1
Step 1: Estimate the baseline mean composition (or frequency) of microbiota (\(\bar{\eta }_0\)) and the overdispersion (\(\xi _0=0.04\)) at the starting time point \(t=1\) in TEDDY data by DirichletMultinomial (DM) maximum likelihood estimate (MLE) of the observed counts. Generate the mean frequency of microbiota at the first time point by Dirichlet (DL) distribution: \(\bar{\eta }_{01}\sim DL(\bar{\eta }_0, \xi _0)\).
Step 2: The mean frequency \(\bar{\eta }_{0t}\) at a later time point \(t>1\) is generated by the following shifting procedure: increase the frequency of some OTUs in \(\bar{\eta }_{01}\) (denoted by \(M^{+}_{base}\)) with a sum of \(\Delta _t\) and simultaneously reduce that of other OTUs in \(\bar{\eta }_{01}\) (denoted by \(M^{}_{base}\)) by \(\Delta _t\). The absolute shift size \(\Delta _t\) represented the age effect on microbiota. This shifting strategy characterized the inherent correlation between \(M^{+}_{base}\) and \(M^{}_{base}\) because of the simultaneous compositional change in these OTUs. All the OTUs in \(\bar{\eta }_{0t}\) are assigned to either \(M^{+}_{base}\) or \(M^{}_{base}\) to account for the impact of latent exposures across time points.
Step 3: At each time point, the heterogeneity between matched sets is the overdispersion estimated by DM MLE based on the samples per time point in TEDDY, denoted by \(\xi _t\). The overdispersion at the first time point is \(\xi _1=0.05\) and linearly decreases over time, which mimics the timedependent overdispersion observed in the infantage metagenome in TEDDY. We generated a mean frequency for each matched set s at time point t by \(\bar{\eta }_{st}\sim DL(\bar{\eta }_{0t},\xi _t)\). If a set is labeled as ‘highrisk’, we shifted all the OTUs in \(\bar{\eta }_{st}\) using the procedure in Step 2 with shift size \(\Delta _{st}\), which is a proportion of the maximum shift size, i.e., \(\Delta _{st}=\gamma \Delta ^0_{st}\).
Step 4: The betweensubject heterogeneity within each matched set was the median DM MLE of overdispersion per matched set based on the real data, that is \(\xi ^*=0.03\). Hence, we generated the true microbiota composition for a sample collected from a ‘lowrisk’ subject j in set s at time t by \(\bar{\eta }_{s_jt}\sim DL(\bar{\eta }_{st},\xi ^*)\). The shift in \(\bar{\eta }_{s_jt}\) between ‘lowrisk’ and ‘highrisk’ subjects were described in Results.
Step 5: The library size for each sample is simulated by a Poisson distribution \(N_{s_jt}\sim PS(100000)\), truncated by a minimum of 10000. The raw counts per sample is generated by Multinomial (MN) distribution \(C_{s_jt}\sim MN(N_{s_jt},\bar{\eta }_{s_jt})\).
Availability of data and materials
The TEDDY Microbiome WGS data that supports the findings of this study have been deposited in NCBI’s database of Genotypes and Phenotypes (dbGaP) with the primary accession code phs001443.v1.p1. The R package mtradeR that implements JMR and the simulation pipeline is available at https://github.com/qianli10000/mtradeR.
Abbreviations
 AIC:

Akaike information criterion
 FDR:

False discovery rate
 FPR:

False positie rate
 FPPR:

False or pseudo positive rate
 GH:

Gausshermite
 JMR:

Joint model with matching and regularization
 LMM:

Linear mixedeffect model
 OTU:

Operational taxonomic unit
 TEDDY:

The environmental determinants of diabetes in the young
 ZIBR:

Zeroinflated beta regression
References
Stewart CJ, Ajami NJ, O’Brien JL, Hutchinson DS, Smith DP, Wong MC, et al. Temporal development of the gut microbiome in early childhood from the TEDDY study. Nature. 2018;562(7728):583–8.
Vatanen T, Franzosa EA, Schwager R, Tripathi S, Arthur TD, Vehik K, et al. The human gut microbiome in earlyonset type 1 diabetes from the TEDDY study. Nature. 2018;562(7728):589–94.
Wang DD, Nguyen LH, Li Y, Yan Y, Ma W, Rinott E, et al. The gut microbiome modulates the protective association between a Mediterranean diet and cardiometabolic disease risk. Nat Med. 2021;27(2):333–43.
Schirmer M, Smeekens SP, Vlamakis H, Jaeger M, Oosting M, Franzosa EA, et al. Linking the human gut microbiome to inflammatory cytokine production capacity. Cell. 2016;167(4):1125–36.
Vatanen T, Kostic AD, d’Hennezel E, Siljander H, Franzosa EA, Yassour M, et al. Variation in microbiome LPS immunogenicity contributes to autoimmunity in humans. Cell. 2016;165(4):842–53.
Anderson MJ. A new method for nonparametric multivariate analysis of variance. Austral Ecol. 2001;26(1):32–46.
Hu YJ, Satten GA. Testing hypotheses about the microbiome using the linear decomposition model (LDM). Bioinformatics. 2020;36(14):4106–15.
Zhu Z, Satten GA, Mitchell C, Hu YJ. Constraining PERMANOVA and LDM to withinset comparisons by projection improves the efficiency of analyses of matched sets of microbiome data. Microbiome. 2021;9(1):1–19.
Chen EZ, Li H. A twopart mixedeffects model for analyzing longitudinal microbiome compositional data. Bioinformatics. 2016;32(17):2611–7.
Zhang X, Yi N. NBZIMM: negative binomial and zeroinflated mixed models, with application to microbiome/metagenomics data analysis. BMC Bioinformatics. 2020;21(1):1–19.
Metwally AA, Yang J, Ascoli C, Dai Y, Finn PW, Perkins DL. MetaLonDA: a flexible R package for identifying time intervals of differentially abundant features in metagenomic longitudinal studies. Microbiome. 2018;6(1):1–12.
Mallick H, Rahnavard A, McIver LJ, Ma S, Zhang Y, Nguyen LH, et al. Multivariable association discovery in populationscale metaomics studies. PLoS Comput Biol. 2021;17(11): e1009442.
Uusitalo U, Liu X, Yang J, Aronsson CA, Hummel S, Butterworth M, et al. Association of early exposure of probiotics and islet autoimmunity in the TEDDY study. JAMA Pediatr. 2016;170(1):20–8.
Ling W, Zhao N, Plantinga AM, Launer LJ, Fodor AA, Meyer KA, et al. Powerful and robust nonparametric association testing for microbiome data via a zeroinflated quantile approach (ZINQ). Microbiome. 2021;9(1):1–19.
Luna PN, Mansbach JM, Shaw CA. A joint modeling approach for longitudinal microbiome data improves ability to detect microbiome associations with disease. PLoS Comput Biol. 2020;16(12): e1008473.
Hu J, Wang C, Blaser MJ, Li H. Joint modeling of zeroinflated longitudinal proportions and timetoevent data with application to a gut microbiome study. Biometrics. 2021. https://doi.org/10.1111/biom.13515.
Quinn TP, Erb I, Gloor G, Notredame C, Richardson MF, Crowley TM. A field guide for the compositional analysis of anyomics data. GigaScience. 2019;8(9):giz107.
Group TS. The environmental determinants of diabetes in the young (TEDDY) study. Ann N Y Acad Sci. 2008;1150(1):1–13.
Lee HS, Burkhardt BR, McLeod W, Smith S, Eberhard C, Lynch K, et al. Biomarker discovery study design for type 1 diabetes in The Environmental Determinants of Diabetes in the Young (TEDDY) study. Diabetes Metab Res Rev. 2014;30(5):424–34.
Giongo A, Gano KA, Crabb DB, Mukherjee N, Novelo LL, Casella G, et al. Toward defining the autoimmune microbiome for type 1 diabetes. ISME J. 2011;5(1):82–91.
Tetz G, Brown SM, Hao Y, Tetz V. Type 1 diabetes: an association between autoimmunity, the dynamics of gut amyloidproducing E. coli and their phages. Sci Rep. 2019;9(1):1–11.
Han R, Shi P, Zhang AR. Guaranteed Functional Tensor Singular Value Decomposition. arXiv preprint arXiv:2108.04201. 2021.
Li C, Xiao L, Luo S. Joint model for survival and multivariate sparse functional data with application to a study of Alzheimer’s Disease. Biometrics. 2021;78(2):435–47.
Weiss S, Van Treuren W, Lozupone C, Faust K, Friedman J, Deng Y, et al. Correlation detection strategies in microbial data sets vary widely in sensitivity and precision. ISME J. 2016;10(7):1669–81.
Acknowledgements
The TEDDY study is funded by the National Institute of Diabetes and Digestive and Kidney Diseases, National Institute of Allergy and Infectious Diseases, National Institute of Child Health and Human Development, National Institute of Environmental Health Sciences, Centers for Disease Control and Prevention, and JDRF. We thank the TEDDY study data coordinating center at Health Informatics Institute, University of South Florida for data processing and sharing. We thank Suraj Sarvode Mothi for data cleaning support.
Funding
The work of QL was in part supported by U24DK097771 from the National Institute of Diabetes, Digestive and Kidney Diseases via the NIDDK Information Network’s (dkNET) New Investigator Pilot Program in Bioinformatics. QL and CL are in part supported by P30 Cancer Center Support Grant (CA21765) funded by National Cancer Institute; and the American Lebanese Syrian Associated Charities (ALSAC).
Author information
Authors and Affiliations
Contributions
QL proposed and implemented the model and algorithm, performed the experiments, analyzed the data, and drafted the manuscript. KV conceived the real data analysis, interpreted results, and contributed to manuscript writing. QL and YH designed the experiments. CL contributed to methodology discussion and manuscript writing. ET and LR contributed to manuscript writing. JK supervised the study design, sample collection, data generation and analysis in the TEDDY cohort. The author(s) read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1:
Appendix. GaussHermite quadrature approximation for marginal likelihood.
Additional file 2:
Supplementary Table S1. The list of OTUs detected by JMR, JMRNC, LMMN, individually, as shown in Figure 7.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Li, Q., Vehik, K., Li, C. et al. A robust and transformationfree joint model with matching and regularization for metagenomic trajectory and disease onset. BMC Genomics 23, 661 (2022). https://doi.org/10.1186/s12864022088901
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12864022088901
Keywords
 Metagenomic trajectory
 Compositional data
 Joint model
 Zeroinflated
 Pseudo biomarker