Microbial trend analysis for common dynamic trend, group comparison, and classification in longitudinal microbiome study

Background The human microbiome is inherently dynamic and its dynamic nature plays a critical role in maintaining health and driving disease. With an increasing number of longitudinal microbiome studies, scientists are eager to learn the comprehensive characterization of microbial dynamics and their implications to the health and disease-related phenotypes. However, due to the challenging structure of longitudinal microbiome data, few analytic methods are available to characterize the microbial dynamics over time. Results We propose a microbial trend analysis (MTA) framework for the high-dimensional and phylogenetically-based longitudinal microbiome data. In particular, MTA can perform three tasks: 1) capture the common microbial dynamic trends for a group of subjects at the community level and identify the dominant taxa; 2) examine whether or not the microbial overall dynamic trends are significantly different between groups; 3) classify an individual subject based on its longitudinal microbial profiling. Our extensive simulations demonstrate that the proposed MTA framework is robust and powerful in hypothesis testing, taxon identification, and subject classification. Our real data analyses further illustrate the utility of MTA through a longitudinal study in mice. Conclusions The proposed MTA framework is an attractive and effective tool in investigating dynamic microbial pattern from longitudinal microbiome studies. Supplementary Information The online version contains supplementary material available at (10.1186/s12864-021-07948-w).

Additional file 1: Additional file 1 has 8 sections as below.
Section S1: Simulation study based on the simulation design of the perturbation method Without loss of generality, we assume that there is only one different trend between control and case groups and 3 dominant taxa that contribute to it.As described in the "Simulation design" section, we denote the set of their indices as ∧ 1 and their corresponding deviations as β 1 .t = (β 1 1t , β 1 2t , β 1 3t ) ′ .Then β 1 .t is assigned as the following perturbations based on the perturbing values in Shields-Cutler et al. (2018), instead of equation ( 2).
where e is the magnitude of perturbation and was assigned to 1 and 2 respectively.
Section S2: Comparison between relative abundance and log-ratio transformed relative abundances The compositional nature of microbiome data has raised many debates in the field.for P = 35 taxa and they were set as the corresponding estimates from the control mice using R package dirmult (Tvedebrink, 2009).β l pt is the deviation of the p th taxon from the log-transformed baseline relative abundance β l 0p at the t th time point, which contributes to the difference between control and case groups.The deviations of 5 dominant taxa are assigned by equations (S1) and (S2) respectively, Furthermore, the remaining parameters are set as the same as those in the "Simulation design" section.
Figure S13 presents the estimated powers of the proposed MTA method using the relative abundances, the log, ALR, and CLR transformations under two simulation designs respectively.We observe that the power using the relative abundances directly is highest under any scenario and using relative abundance makes the proposed MTA method capture more dynamic information and elucidate more accurate difference between control and case groups.The reason that MTA based on the relative abundance has better performance is twofold.First, MTA evaluates the dynamic difference between cases and controls by comparing the trends extracted from a difference array G * to a noise trend which is randomly arranged around 0 across all time points, rather than directly compare the dynamic trends extracted from cases and controls, respectively.Second, MTA employs matrix factorization and regularization techniques to analyze all taxa together and identify the dominant taxa simultaneously, by assuming that the microbiota act as an integrated microbial community.This strategy releases the correlations among taxa due to compositionality to some extent (Zou et al., 2006;Maxwell et al., 2019).
We recognize that all results above are based on the simulation studies, rather than on theoretical consideration.We also include log, ALR, and CLR transformations in the MTA R package as options and let the users decide the transformation procedure based on their studies.

FIG S13
Estimated powers (%) of the proposed MTA method for testing the microbial trend difference between case and control groups using different transformed relative abundances, with sample size N = 20 and the number of time points T = 10.Specifically, the proposed MTA method was applied on the relative abundances directly (None; red bar), the log transformation (Log; blue bar), ALR (additive log-ratio transformation; darkgreen bar), and CLR (centered log-ratio transformation; orange bar), respectively under the scenarios 1-3.(A) Scenarios 1-3 were constructed with the simulation design described in the "Simulation design" section.(B) Scenarios 1-3 were constructed with the simulation design described in Additional file 1, Section S2.
Section S3: Principal trend analysis Zhang and Davis (2013) proposed the PTA method for analyzing the time-course gene expression data.Being an unsupervised approach, PTA aims to extract the grouplevel principal trends of time-course gene expression data from a group of subjects and identify genes that make dominant contributions to the principal trends.They suppose there are N subjects in a given group.Let Y npt be the gene expression of the p th gene of subject n at time point t, p = 1, . . ., P , and t = 1, . . ., T , and Y n denote a P × T matrix of observations of subject n; F = (f 1 , . . ., f M ) be a P × M matrix of factor scores; be a T × (T + 2) matrix containing the cubic spline basis; Q = CB ′ denote a M × T matrix presenting the top M common trends across T time points.The underlying model is expressed as where, E n denotes the error term for subject n.Zhang and Davis (2013) estimated the parameters F and C by solving the optimization problem below is a (T + 2) × (T + 2) matrix with ′′ j (t)dt, i, j = 1, . . ., T +2. a 1 and a 2 control the smoothness of trends and sparsity on genes, respectively.Using Langrange multipliers, the optimization problem For m = 1, 2, . . ., M factor score vector from the mean matrix Step 1: Step 2: Step 3: If both fm and ĉm are convergent, then break; Qm = B ĉm is the estimated m th common trend and fm is the corresponding estimated factor scores.
With the estimates F and Ĉ, PTA identifies the group-level time-course patterns Q = ĈB ′ shared by all subjects, and the dominant genes contributing to the common patterns which have non-zero estimated factor scores F .Besides, cross-validation is used to select the tuning parameters in the algorithm of PTA.
Section S4: Laplacian matrix construction Given phylogenetic tree, we can calculate a matrix of patristic distances for all pairs of taxa.The patristic distance between any two taxa is defined as the length sum of all the branches that connect these two taxa and it measures the evolutionary divergence.
Then the adjacency matrix A is based on the normalized patristic distance D = (d jk ) as the following: , where ϕ(x) is a decreasing function and is chosen as ϕ(x) = 1−x in this paper.Larger a jk represents OTU j and OTU k have closer evolutionary relationship.Finally Laplacian matrix L is defined as where D d = diag(d 1 , . . ., d P ) and d j = P k=1 a jk .The Laplacian penalty weights the OTUs closely linked on the phylogenetic tree to have similar effects on the extracted common trends.
Section S5: The derivation of Algorithm 1 Following Zhang and Davis (2013), the sum of squared errors L(F , C|Y n ) is illustrated as So, in the case K = 1, f and c can be estimated by solving the following problem With a fixed value f , we set the derivative of expression (S5) to 0, and solve for c: With the estimated value ĉ, we set the derivative of expression (S6) to 0, and solve for f : Note that MetaLonDA works on the time interval and provides the p-value for each time interval, rather than the overall study period.To obtain its overall significance for comparison, two strategies are considered to determine whether a taxon is significantly different or not.MetaLonDA L: the taxon is determined as significantly different if any time interval [t, t + 1], t = 1, . . ., T − 1, is significantly identified; MetaLonDA G: the taxon is determined as significantly different if the time interval [1, T ] is significantly identified.All simulations related to the MetaLonDA method were analyzed with the R package MetaLonDA, in which we used the microbial counts, the NB smoothing spline, and 1000 permutations.For the LonGP, the taxon is claimed significantly different if the interaction between time point and group is selected using the additive Gaussian process.
Note that LonGP provides the component relevance and the corresponding selection with a given threshold of the total relevance, not the testing p-value.Specifically, we employed the R package lgpr (Timonen et al., 2021) to run the LonGP, in which we used the log-transformed count data and the threshold of 0.9.
We evaluated the proposed MTA method, Permuspliner, MetaLonDA, and LonGP in terms of sensitivity and specificity in two simulation designs: (1) the same simulation design used in the main text; and (2) the simulation design based on the NB distribution used in the MetaLonDA method.In design 1, since the permutation procedure in MetaLonDA and the sampling procedure to determine the prior parameters in LonGP are both time-consuming, we simulated a small scale data with sample size N = 20 and time point T = 10 as an illustration without loss of generality.Figure S14 reports the estimated sensitivity and specificity of four methods in scenarios 1-3, respectively.As we discussed in the main Simulation results section, the proposed MTA has a superior and robust performance, while Permuspliner has the poorest performance in terms of sensitivity.For the MetaLonDA method, it is not surprising that neither MetaLonDA L nor MetaLonDA G has the comparable performance, because that MetaLonDA is based on the area under the microbial splines within a given time interval, similar to Permuspliner, where cross-over microbial trends cause the area between curves to balance out and reduce the power accordingly.LonGP does not have competitive sensitivities either, although it has close to 100% specificities.

FIG S14
The estimated sensitivity and specificity for identifying the dominant taxa which contribute to the extracted trends with sample size N = 20 and the number of time points T = 10 in scenarios 1-3, respectively.Scenarios 1-3 were constructed as in the "Simulation design" section.MetaLonDA L: the taxon is determined as significantly different if any time interval [t, t + 1], t = 1, . . ., T − 1, is significantly identified; MetaLonDA G: the taxon is determined as significantly different if the time interval [1, T ] is significantly identified.
Second, we simulated data from the NB distribution using the parameters in the MetaLonDA method (Metwally et al., 2018).Specifically, we considered the sample size N = 20 per group, the number of taxa P = 20, and the number of time points T = 20.Among 20 taxa, 4 were assumed significant in three time regions as in Metwally et al. (2018).Since the total depths varied across different samples, both the original counts and the rarefied counts were used for MetaLonDA and LonGP.The estimated sensitivity and specificity in Figure S15 show that the proposed MTA is able to identify all differentially abundant taxa successfully as well as has sufficient specificity.Permuspliner has commensurate performance.All LonGP and MetaLonDA methods suffer from low sensitivity.
In summary, the proposed MTA has robust and superior performance in the comprehensive simulation in terms of identifying the differentially abundant taxa, compared to all the competing methods including Permuspliner, MetaLonDA, and LonGP.First, when the zero inflation proportion does not change (Type 1), the proposed MTA has power close to 1 with all zero inflation proportions in both scenarios.The estimated sensitivity and specificity are presented in Figure S16.We observe that the estimated sensitivity decreases, while the specificity increases as the zero inflation proportion increases in scenario 1.In scenario 2, both sensitivity and specificity are stable no matter what the sparsity is.In summary, the magnitude of sparsity in causal taxa affects the performance of the proposed MTA, but not the sparsity in non-causal taxa.
In addition, MTA has lower sensitivities and higher specificities in scenario 1 than in scenario 2.

FIG S16
The estimated sensitivity and specificity for the MTA with various zero inflation proportions which are unchanged over time, in scenarios 1-2, respectively.
When the zero inflation proportion varies over time, the proposed MTA has power=1 with all zero inflation proportions in both scenarios.Similar to the pattern we observe in Type 1, Figure S17 shows that lower sensitivities and higher specificities in scenario 1 than in scenario 2 again.Unlike Type 1, the sensitivities are consistent with all proportion 0 s, meanwhile, the specificity decreases as proportion 0 s increases, in both scenarios.
The constant sum constraint can lead traditional statistical methods to produce spurious correlations and errant results.Log-ratio normalizations have been commonly used to address microbial compositionality and have achieved considerable progress.But the problem caused by the compositionality is still open in the longitudinal microbiome study.Here we constructed some simulation studies to illustrate the performance of the proposed MTA framework using relative abundance, and three transformed relative abundances with log, additive log-ratio (ALR), and centered log-ratio (CLR) transformations, respectively.Specifically, we consider scenarios 1-3 with sample size N = 20 and the number of time points T = 10 under two simulation designs: the first is the same as the one we used in the "Simulation design" section; the second one is constructed to be in favor of the transformed methods and based on the Dirichlet distribution with a different modeling.The mean relative abundance of the p th taxon for a subject at the t th time point is assigned as below, different from equation (1) in the first design, E[O pt ] = γ pt P p=1 γ pt , log (γ pt ) = β l 0p + β l pt , p = 1, . . ., P, t = 1, . . ., T, where β l 0 = (β l 01 , . . ., β l 0P ) ′ represents the log-transformed baseline relative abundances where λ 1 (≥ 0) and λ 2 (≥ 0) are tuning parameters that control smoothness of the common trends and sparsity on genes, respectively.Zhang and Davis (2013) demonstrated the optimization problem (S4) is not convex, but biconvex and proposed an iterative algorithm (Algorithm S1) to obtain the parameter estimates F and Ĉ.Algorithm S1 Nonlinear iterative algorithm for PTA Input: the observed data Y n (n = 1, . . ., N ), the number of common trends M , and the fixed tuning parameters λ 1 and λ 2 where Soft(a, b) = sign(a)(|a| − b) + is the soft-thresholding operator.
Soft(a, b) = sign(a)(|a|−b) + is the soft thresholding operator.Section S6: Comparisons among the proposed MTA, Permuspliner, MetaLonDA, and LonGPMetaLonDA(Metwally et al., 2018) is proposed to identify the significant time intervals of differentially abundant microbial features in the longitudinal microbiome data.For a given microbial feature, MetaLonDA models the mapped read counts by a negative binomial (NB) distribution and fits the longitudinal profiles in each phenotypic group with NB smoothing splines.The microbial group difference per unit time interval is measured by the area ratio between two spline curves fitted in groups separately and the p-value for each time interval is calculated by a permutation test.LonGP(Cheng et al., 2019) is an additive Gaussian process regression model designed for statistical analysis of longitudinal data and it incorporates multiple kernels to model time-varying random effects and non-stationary signals.

FIG
FIG S15 The estimated sensitivity and specificity for identifying the differentially abundant taxa with the sample size N = 20, the number of taxa P = 20 and the number of time points T = 20.The simulation data were generated from the NB distribution described in Metwally et al. (2018).LonGP NB: LogGP employs the original counts generated from the NB distribution; LonGP rarefication: LogGP employs the rarefied counts; MetaLonDA NB 19 L: MetaLonDA employs the original counts and the taxon is determined as significantly different if any time interval [t, t + 1], t = 1, . . ., T , is significantly identified; MetaLonDA NB 5 L: MetaLonDA employs the original counts and the taxon is determined as significantly different if at least one of five predefined time intervals in Metwally et al. (2018) is significantly identified; MetaLonDA NB G: MetaLonDA employs the original counts and the taxon is determined as significantly different if the time interval [1, T ] is significantly identified; MetaLonDA rarefication 19 L: MetaLonDA employs the rarefied counts and the taxon is determined as significantly different if any time interval [t, t + 1], t = 1, . . ., T , is significantly identified; MetaLonDA rarefication 5 L: MetaLonDA employs the rarefied counts and the taxon is determined as significantly different if at least one of five predefined time intervals in Metwally et al. (2018) is significantly identified; MetaLonDA rarefication G: MetaLonDA employs the rarefied counts and the taxon is determined as significantly different if the time interval [1, T ] is significantly identified.