Proteomics data analysis using multiple statistical approaches identified proteins and metabolic networks associated with sucrose accumulation in sugarcane

Sugarcane is the most important sugar crop, contributing > 80% of global sugar production. High sucrose content is a key target of sugarcane breeding, yet sucrose improvement in sugarcane remains extremely slow for decades. Molecular breeding has the potential to break through the genetic bottleneck of sucrose improvement. Dissecting the molecular mechanism(s) and identifying the key genetic elements controlling sucrose accumulation will accelerate sucrose improvement by molecular breeding. In our previous work, a proteomics dataset based on 12 independent samples from high- and low-sugar genotypes treated with ethephon or water was established. However, in that study, employing conventional analysis, only 25 proteins involved in sugar metabolism were identified . In this work, the proteomics dataset used in our previous study was reanalyzed by three different statistical approaches, which include a logistic marginal regression, a penalized multiple logistic regression named Elastic net, as well as a Bayesian multiple logistic regression method named Stochastic search variable selection (SSVS) to identify more sugar metabolism-associated proteins. A total of 507 differentially abundant proteins (DAPs) were identified from this dataset, with 5 of them were validated by western blot. Among the DAPs, 49 proteins were found to participate in sugar metabolism-related processes including photosynthesis, carbon fixation as well as carbon, amino sugar, nucleotide sugar, starch and sucrose metabolism. Based on our studies, a putative network of key proteins regulating sucrose accumulation in sugarcane is proposed, with glucose-6-phosphate isomerase, 2-phospho-D-glycerate hydrolyase, malate dehydrogenase and phospho-glycerate kinase, as hub proteins. The sugar metabolism-related proteins identified in this work are potential candidates for sucrose improvement by molecular breeding. Further, this work provides an alternative solution for omics data processing.


Introduction
Omics data analysis applies statistical and computational methodology to determine the key genes, proteins and metabolites that significantly regulate various plant growth and developmental processes [1,2]. A major challenge is that the omics data usually comprise thousands or even millions of features, but the sample size used can be relatively limited to detect potentially biologically-relevant proteins or its attributes. This creates the so-called p > > n problem [3], i.e., the number of features is massively larger than the number of samples needed to make reliable findings. With large high dimensional datasets, classical statistical models such as linear regression or logistic regression become oversaturated and ineffective, and thus cannot provide useful outcomes.
One strategy to address this analytical limitation is to apply a marginal regression approach or a t-test approach to analyse one bio-feature at a time. Such an approach can often identify a set of hundreds of significant features [4]. Alternatively, another recently proposed approach is penalized regression [4], which analyse multiple features simultaneously, and use a penalty function to shrink the effects of unimportant features towards zero to exclude them to keep only the important ones in the model. The penalized regression approach has also been suggested as a tool to estimate the sparse inverse covariation matrix, which can be applied to construct gene or protein interaction networks [5]. A third option is to use Bayesian regression methods. The Bayesian approaches are in fact closely connected to the penalised regression, in the sense that they interpret the penalty function as a prior distribution, and then combine the prior and the model likelihood to form a posterior distribution. Then simulation-based computational methods such as Markov Chain Monte Carlo can be used to approximate the posterior distribution. One advantage of the Bayesian regression approaches is that they can directly provide uncertainty measurements such as credible intervals or inclusion probabilities to the regression coefficients, which can be used for inference. Although these newer penalized regression and Bayesian approaches have been proposed to detect association between traits of interest and different kind of omics data, including proteomics, biologists tend to keep on using conventional two sample t-test or simple logistic regression, due to the fact that there is little efforts to illustrate the power and advantages of using the high multiple regression methods for large biological datasets [6].
Sugarcane has the capacity to store sucrose in the stem as the main reserve food [7], and it contributes to > 80% of the world's sucrose production [8]. High sucrose content is a main target of sugarcane breeding. However, the sucrose content in sugarcane remained plateaued for several decades despite sustained conventional breeding efforts for a long time. Recent success of molecular breeding in other crops suggests its potential for sucrose improvement in sugarcane [9][10][11]. However, little success has been made so far in relation to identifying key genetic elements that control sucrose accumulation, which will help develop molecular breeding strategies for sugarcane. Some previous studies in sugarcane have identified candidate genes likely to be involved in sugar accumulation, but the molecular mechanism sugar accumulation in this crop remains unknown [12][13][14][15][16].
Ethylene is an effective chemical ripener used in commercial sugarcane crops [17]. In our previous work, application of ethephon (an ethylene releasing compound) in a high-and a low-sugar genotype elicited differential sucrose accumulation responses, with low-sugar variety accumulating relatively more sugar than the high-sugar clone [18]. Using this system, proved to be an effective model for dissecting the molecular regulation of sugar accumulation in sugarcane, we established a proteomics dataset representing early stages of sugar accumulation in both clones in response to ethephon treatment. The dataset was based on 12 independent samples and had about 3,000 protein features. Differentially abundant proteins (DAPs) were identified using Mann-Whitney Test with a significance level of p < 0.05 adjusted by Benjamini-Hochberg Correction and fold change over 1.2. But, in this analysis only 25 proteins related to sugar metabolism were identified [19]. To determine whether our previous analysis detected only a fraction of proteins associated with sugar accumulation, we evaluated more sophisticated and powerful methods for analysis. Here we used three different statistical approaches, namely, the marginal logistic regression method [20], a logistic penalized regression approach named Elastic net method [21], and a logistic Bayesian stochastic search variable selection (SSVS) method [22] to re-analyse the proteomics dataset to determine the most effective analytical methods for proteomics data with < 20 sample size and to maximise the extraction of valuable information for the study. described in our previous work [19]. Briefly, two sugarcane genotypes, ROC22, a high-sugar (on average 15% sucrose content, ROC5 × ROC69-46), and GT86-877, a low-sugar (on average 6% sucrose content, GT82-10 × GT73-11) genotype were used for the experiment. Plants were grown in the experimental farm of Sugarcane Research Institute, Guangxi Academy of Agricultural Sciences, Nanning, Guangxi China in 2016. and they were treated with deionized water or 400 mg/L Ethephon solution as foliar spray till run-off from the lamina in early sucrose accumulation stage (8 months old crop), in mid-October. Developing stalk tissues, 20 cm above the node attached to the second youngest fully expanded leaf, which is highly photosynthetically active, were sampled on day 7 following ethylene treatment. Pooled sample of tissues collected from six individual plants from each replicate plot constitute a single biological replicate. Three biological replicates were collected for each clone from each treatment and they were flash-frozen in liquid nitrogen for proteome analysis by iTRAQ method.

Batch effect correction
The following linear mixed model (LMM) was applied to correct the potential batch effect caused by the data sampling process (i.e., plates in MS): where log-inten ijk represents the intensity value at the protein features k of the replicate i within the group j, g i is the design matrix of the (fixed) group effect and a j is the corresponding regression coefficients, b j is the random effect of the replicate j, s jk is the (interactive) random effect of the protein k and replicate j, and e ijk is the model residual. The random effects r j . s jk and the residuals e ijk are assumed to mutually follow normal distributions N (0, σ 2 r ) , N (0, σ 2 s ) , and N (0, σ 2 e ) , respectively. The regression parameters of LMM was estimated by the maximum likelihood method, implemented by the R package (version 4.0.3) lme4.
Principal component analysis (PCA) was conducted by the R function "prcomp" on the proteomics data to visualize the structure and variation among the samples, and check with the performance of batch effect correction. The residuals.
were considered as features being used in the follow-up statistical analyses.

Differential protein abundance analysis
The marginal regression analyzed one spectrum at a time, defined as.
(1) log −inten ijk = g j a i + b j + s jk + e ijk , (2) x ik = log −inten ijk − s jk where x ik is the (adjusted) intensity of the proteins k at the sample i, y i is the binary group indicator, z i is the design matrix of the covariate variables, β 0 is the model intercept, and β k is the effect of protein features k which is of primary interest, and e ik is the residual error which is assumed to follow a normal distribution.
A t-test was conducted on the regression coefficient β k , and p-value for the proteins was calculated. The p-values were adjusted by the conservative Bonferroni adjustment to control the multiplicity.
Additionally, a penalized logistic regression approach named Elastic net [21] was applied to simultaneously analyze all the protein features: . Variables y i and x ik , and regression parameters β 0 and β k are defined as the way as in Eq. (3), but the difference is that now in (4) the multiple protein variables are simultaneously analyzed in the same model. Another major difference from the simple regression model (3) is that in (4) the group variable y i is treated as binary response variables, while the spectra x ik are treated as explanatory variables. In (4), a penalty term combining the l 1 and l 2 norm penalties: k ] was introduced to shrink the effects of un-important features (i.e., a protein which is not associated with the group variable) to be zero, and only keep the important proteins into the model. The tuning parameter λ (λ > 0) determines the degree of shrinkage of the regression parameters, and how many proteins should be included into the model, while the weight parameter ω (0 < ω < 1) determines the relative importance of the l 1 and l 2 penalties. The advantage of using such a mixture of l 1 and l 2 penalties over using merely the l 1 penalty (which leads to theLASSO regression) is that the mixture penalty could account for the dependency structure among different features, and simultaneously select multiple correlated features into the model. The optimal values of λ and ω are determined by nested cross-validation. The proteins having non-zero regression coefficients were considered to have significant features. The numerical estimation of model (4) was conducted by the R package glmnet (version 4.0).
One disadvantage of the logistic Elastic net approach is that the uncertainty measures of the regression coefficients are not available, especially when the sample size is limited. Therefore, no multiple testing procedure could be applied to control the false positives. This motivates us to use an alternative approach named Bayesian stochastic search selection. Unlike Elastic net (4) which adds the combination of l 1 and l 2 penalty of the regression parameters to the log-likelihood function, in SSVS, the regression coefficient is assumed to follow the so-called spike and slab prior distribution as: The prior is a mixture of normal distribution with zero mean and unknown variance σ 2 j and point mass at zero. The binary indicator variable r j determines whether the regression parameter is included into the model. The indicator r j was further assigned with a Bernouli prior: and the hyper-parameter ω was specified to a small value, following the assumption that only a small proportion of the spectra was associated with the treatment effects. In this work, we use ω = 0.05.
The variance σ 2 j was assigned with an Inverse Gamma prior: where the hyper-parameters were specified as a = b = 0.1, which is non-informative.
Combining these prior distributions and the model likelihood leads to a posterior distribution of the model parameters θ = (β j , r j , σ 2 j , ω): . Bayesian estimation of a logistic regression is generally recognized as a challenging problem. To ease the computation, the posterior was first decomposed into the following two separate problems [23]: where PG (n i , v i ) represents a Polya Gamma distribution. Now the posterior has a hierarchical structure with the second layer equivalent to a normal SSVS regression instead of a logistic SSVS regression. An efficient Gibbs sampler could be used for parameter estimation.
The empirical marginal posterior distribution of the selection indictor: P (γ j = 1|y) which is often viewed as a posterior inclusion probability (PIP) [24] can be used for the posterior inference. Intuitively, one could use PIP > 0.5 as a criterion to claim the corresponding P(θ |y) ∝ P(y|θ)P(θ) features to be significant. However, such a heuristic criterion does not necessarily guarantee that the false positives due to simultaneously tests of multiple hypotheses could be effectively controlled [25]. Alternatively, we could also utilize the marginal probability of a feature not being selected: P (γ j = 0|y) = 1 −P(γ j = 1|y) , which is interpreted as a local false discovery rate (LFDR) [26][27][28]. Based on LFDR, a global level Bayesian FDR (BFDR) [29,30] could be constructed by combining LFDRs for a group of features. To eliminate the multiplicity, the BFDR should be controlled under a defined threshold, here we specified α = 0.05 guaranteeing that the overall FDR among the multiple hypotheses is less than 0.05. In more detail, the BFDR is calculated as follows. First, the LFDRs for each feature are sorted in ascending order. The average value of the LFDRs for the first T features (T = 1, … ,p) is defined as a BFDR for feature markers. We find the highest possible value of BFDR (the average of the T smallest LFDRs), which is still smaller than the given threshold, and the corresponding features are significant.

Protein validation by western blot
To verify the DAPs detected by statistical methods are indeed biologically relevant to sugar accumulation, 3 proteins selected randomly from this work and the 2 proteins validated in our previous work [19] were further validated using western blot (WB) based on the protocol we used previously [19]. Briefly, SDS-PAGE was used to separate proteins (20 µg) from each sample. They were then transferred to polyvinylidene fluoride membranes which were incubated with appropriate primary antibodies generated by Abmart and the HRP-conjugated anti-mouse IgG secondary antibody (Abmart, M21001). An enhanced chemiluminescence system (Biouniquer, China) was used to visualize Immune-reactive bands, which were exposed to X-ray film (Kodak). Following this, ImageJ program was used to quantify signal intensities which were normalized to the b-actin signal.

Protein-protein interaction network construction
Protein-Protein Interaction (PPI) Network Analysis was conducted using STRING (https:// string-db. org/) [35]. We selected the gene symbol as input of website https:// cn. string-db. org/, chose multiple proteins, used gene symbol as list of names, and selected Zea mays as organism. The output was exported in the tsv format. The PPIs in the string-db are based on experimental data or predictions by bioinformatics methods. We mapped the DAPs associated with sucrose accumulation onto the PPI network. The hub genes were determined by the website https:// cn. string-db. org/. Nodes with the greatest numbers of interactions with neighboring nodes were considered as hub nodes.

Datasets and batch effect correction
The sucrose content was increased in both genotypes after 3 days of treatment, and it was continued for 7 days [18]. Four samples were collected on 7th day after treatment, and they were labelled as RCK: samples from high-sucrose content sugarcane (on average 15% sucrose content) treated with water; MCK: samples from lowsucrose content sugarcane (on average 6.0% sucrose content) treated with water; R400: samples with high-sucrose content sugarcane treated with 400 mg/L ethephon; and M400: samples from low-sucrose content sugarcane treated with 400 mg/L ethephon. Each group comprises 3 independent biological replicates. By iTRAQ experiment, about 65,000 spectra (19%) were identified from a total of 345,000 spectra of each iTRAQ sample. After data processing, 2,983 proteins were identified [19], and they were included in the statistical analysis. The PCA (Fig. 1a) reveals that the first replicates of the four groups formed one cluster, and the rest of the replicates jointed in the other. The first and second PC explained 81% and 10%, respectively, of the total variation of the spectral data. After batch effect correction, samples from four groups were clearly separated (Fig. 1b). The two PCs explained 41% and 24% of the variation.

Identification of differentially abundant proteins
The stalk tissue protein abundance in the following treatment comparisons were analysed: (i) high-sugar genotype VS low-sugar genotype (RCK vs. MCK, R400 vs. M400), and (ii) water VS ethephon treatment (MCK vs. M400, RCK vs. R400). Protein abundance variation in different groups were analysed by three different approaches, including a logistic marginal regression, a penalized multiple logistic regression called Elastic net, and a Bayesian multiple logistic regression method namely Stochastic search variable selection (SSVS).
The simple regression analysis identified 340 proteins that are differentially abundant in the samples with low-and high-sucrose content (RCK vs. MCK, R400 vs. M400) (Table S1). While the multiple regression-based Elastic net detects 40 DAPs (Table S2), which were also detected by the simple regression analysis. The Bayesian SSVS approach identified 21 significantly different DAPs (Table S3), but only 2 of them overlapped with those detected by simple regression method, indicating that 19 new proteins (Table S4) were identified. So, a total of 359 Fig. 1 The scatter plot of first and second principal components (PCs) ofsugarcane proteomics data. PCs were calculated from the data before (a) and after (b) batch effect correction DAPs were detected from the genotype comparison. No common gene was found in the analysis outputs of these three methods ( Fig. 2A). When genotypes treated with ethephon were compared (R400 vs. M400), 162, 23 and 3 proteins were found up-regulated in the simple regression, the Elastic net and the Bayesian SSVS analyses, respectively, whereas 40, 1 and 3 proteins, were found to be down-regulated in these three methods, respectively. A similar comparison of genotypes treated with water (RCK vs. MCK), identified 173, 23 and 3 up-regulated and 26, 2 and 2 down-regulated proteins in the simple regression, the Elastic net and the Bayesian SSVS analyses, respectively (Fig. 2B).
In the water and ethephon treatment comparison (MCK vs. M400, RCK vs. R400), the simple regression approach detected 126 DAPs (Table S5), and the Elastic net method identified 60 DAPs (Table S6). Again, all the DAPs identified by Elastic net were also detected by simple regression. However, the Bayesian approach identified 26 DAPs (Table S7), and only 4 of them were common across the Bayesian and the simple regression groups, suggesting that 22 additional DAPs were detected with Bayesian analysis (Table S8). Thus, a total of 148 DAPs were detected from the comparison based on treatments, of which 4 proteins were detected in all the three methods (Fig. 2 C). Three of these proteins were annotated, including enolase (m.4058; m.159,651) and histone H1 (m.136,263). In the comparison of low sugar genotype treated with water and ethephon (MCK vs. M400), 58, 39 and 3 proteins were found up-regulated, and, 15, 7 and 13 proteins were down-regulated in the three methods combined. In a similar comparison between water vs. ethephon treatments in the high-sugar genotype (RCK vs. R400), 48, 28 and 4 up-regulated and 22, 8 and 12 downregulated proteins were detected between the 3 statistical methods, (Fig. 2D). So, totally 507 DAPs were detected between three methods, of which 306 DAPs were annotated (Table S9).

Identification of differentially abundant proteins involved in sugar metabolism
To find out the DAPs associated with sugar metabolism in sugarcane, KEGG enrichment analysis of 306 DAPs identified by the three methods was conducted (Table S9). Sugar metabolism related pathways, including carbon fixation in photosynthetic organisms, carbon metabolism, photosynthesis, amino sugar and nucleotide sugar metabolism, as well as starch and sucrose metabolism, were enriched ( Fig. 3; Table S10). A total of 48 DAPs involved in sugar metabolism were identified from KEGG enrichment analysis (Table S11).
Considering that some proteins associated with sugar metabolism related pathways were not enriched due to their very low abundance, we checked the annotation of all the DAPs detected in this work to find possible missing proteins related to sugar metabolism. We did find one protein m.146,783, annotated as chlorophyll a-b binding protein, which is a photosynthesisantenna protein, was positively correlated with sugar metabolism in plants. So, the protein m.146,783 is also considered as a candidate target participating in sucrose accumulation in sugarcane. Thus, a total of 49 proteins associated with sugar metabolism were identified in this work (Table 1). Then we compared these 49 DAPs with the DAPs involved in sugar metabolism identified in our previous work [19]. Seven DAPs were found overlapping between two studies, indicating that 42 new DAPs associated with sugar metabolism were identified in this work (  159,644/m.159,651), ribulose-phosphate 3-epimerase (m.37,775) and ribulose-1,5-bisphosphate carboxylase oxygenase (m.120,708) were also identified in this work, compared to our previous work [19]. Interestingly, m.159,651 was the common protein detected by all the three methods, implying an important role for it in the sucrose accumulation process.
While in the carbon metabolism pathway, 23 new proteins were identified in this work, including important enzymes such as, phosphoglycerate kinase (m.138,161), glucose-6-phosphate isomerase (m.101,603), phospho-glucomutase (m.128,432/m.141,239), and  Table 1 Differentially abundant proteins related to sucrose accumulation identified in this work Protein ID and names in italic bold indicate DAPs with their encoding genes showing differential expression at transcriptional level [18], and those with bold font with gray background indicate proteins overlapping between those identified in the current and our previous work [19] 6-phosphogluconate dehydrogenase (m.48,690), etc.
Genes are carriers of genetic information, while proteins execute the gene function. To get moreinsights into the molecular mechanism of sucrose accumulation, we looked for overlaps between the new DAPs and those from our previous work that used the same dataset [19], as well as DEGs reported by Chen et al. [18]. There were 7 overlapping proteins between this work and previous work [19] (Table 1). This analysis also identified genes of 12 DAPs showing differential expression at transcriptional level ( Table 1).

Validation of DAPs by western blot
To validate the reliability of DAPs identified in this work, the abundance of 5 DAPs involved in sugar metabolism were validated by western blot. Among them, 2 proteins, m.120,013 (FBA, fructose-bisphosphate aldolase) and m.97,819 (beta-glucosidase 30-like) identified both in this study and previous work, have also been validated by western blot in previous work [19]. The protein m.68,843 (phosphoenolpyruvate carboxylase) showed higher expression in CK than that in ethephon treatment in both high-and low-sugar genotypes but showed no difference between high-and low-sugar genotypes irrespective of treatment, indicating that m.68,843 was negatively affected byethephon not by genotype. While m.138,161 (phosphoglycerate kinase) and m.141,239 (phospho-glucomutase) were highly expressed in low-sugar genotype than in high-sugar clone, ethephon failed to induce them in both genotypes, suggesting that their expression is genotype-dependent. The results from western blot confirmed the DAPs detected by the new statistical analyses in this study, proving the value of the statistical approaches described here for identifying DAPs in sugarcane (Fig. 4).

Protein-protein interaction network
Almost all the biological functions are mediated by proteins. Many proteins take part in biological processes such as signal transduction, gene expression, energy and nutrient metabolism by interaction with other proteins. PPI analysis predicts the function of proteins and identify key regulatory factors. To further explore the interaction between the 49 DAPs related to sugar metabolism identified in this work, a PPI network was constructed using STRING database. The PPI network contained 42 nodes and 350 edges. In the PPI network, 4 proteins (degree ≥ 26) were selected as hub proteins, including glucose-6-phosphate isomerase, 2-phospho-D-glycerate hydrolyase (enolase), malate dehydrogenase (NADP-dependent malic enzyme) and hosphor-glycerate kinase. Moreover, malate dehydrogenase had the highest degrees, suggesting that it may play a crucial role in sucrose accumulation in sugarcane (Fig. 5).

Discussion
Sugarcane sucrose content improvement by conventional breeding programme is concerningly low due to the complex genetic background and slow introgression of exogenous genes [36]. To explore molecular breeding strategies for sugar content improvement, we are studying the molecular basis of sucrose accumulation in sugarcane [13,16,18,19], with the ultimate aim of identifying potential molecular targets for variety improvement. Recently, we have developed a proteomic dataset developed for sucrose accumulation studies, and it is used in this study. After acquiring the raw sugarcane proteomics data, and conducting routine data pre-processing such as protein feature quantification, baseline correction and normalization, we conducted three different types of statistical analyses such as (i) principal component analysis (PCA), (ii) protein differential expression Fig. 5 The protein-protein interaction network based on DAPs related to sugar metabolism. Each node represents a protein. The helical symbol in the node indicates the known 3D structure of the protein, and empty nodes indicate unknown proteins. The line between two nodes represents interaction and multiple lines represent various interactions between two proteins analysis and (iii) protein interaction network modelling, and they revealed biologically meaningful results from different perspectives. The PCA results (Fig. 1b) gave an overall picture of how the proteomics data of different samples varied. The first PC clearly separated M400 from the other three groups, indicating that when the sucrose content is low, the treatment of ethephon has a significant effect on protein abundance, which was also supported by our transcriptomic studies [18]. In that work, more DEGs were induced by ethephon in low-sugar genotypes compared to high-sugar clone [18]. On the other hand, the second PC separates the groups M400, and MCK from the other two groups R400, and RCk. Hence, the second PC explained the difference in protein expression between samples from the groups of low and high sucrose content, which was consistent with the finding that genotype had a dominant effect on gene expression than ethephon treatment [18].
We then evaluated and compared the three high dimensional statistical approaches, including the logistic marginal regression method [20], the logistic Elastic net method [21], and the logistic Bayesian stochastic search variable selection (SSVS) method [22], for detecting significantly differentially abundant proteins from this proteomics data comprising thousands of features that are associated with vital traits of interest. The marginal regression methods analyses one feature at a time, and then use a multiple testing procedure to eliminate the false positive finds. While the Elastic net and Bayesian approaches simultaneously analyse multiple protein features, and use penalty functions or prior information to conduct feature selection. The logistic Elastic net and Bayesian SSVS methods are closely related approaches and both of them analyse multiple features simultaneously and are able to conduct variable selection to select only a set of important features into the final model. However, the two approaches are philosophically different. The Elastic net, as a frequentist approach, relies on the likelihood model to make inference, and assumes all the model parameters to be fixed. But the Bayesian approach uses the posterior distribution (the likelihood combines with model prior) as a foundation to make inference, and all the model parameters have a marginal posterior distribution. Therefore, the uncertainty measures such as standard errors or credible intervals, or model inclusion probability in the Bayesian SSVS were directly controlled [37]. In the sugarcane proteomics dataset with 12 samples and about 3,000 protein features, the performance of the three methods on detecting different expressed features associated with either (i) high sugar VS low sugar content and (ii) water VS ethephon treatment was evaluated. The simple regression approach is the most liberal approach which detected 466 significant DAPs (Table   S1; S5), accounting for 91.9% of total DAPs, even using the Bonferroni adjustment which is often considered as the most conservative approach for multiple testing. So, the simple regression is the most effective method for analysis of the proteomics dataset. However,, the multiple regression-based Bayesian SSVS detected 41 more significant DAPs, while the Elastic net method detected 40 and 60 DAPs from genotype and treatment comparisons, respectively (Table S2; S6),. All these DAPs were also detected by the simple regression analysis suggesting that the two approaches could confirm the analytical reliability of each other.
A total of 306 annotated DAPs were identified by the three statistical methods used in this research (Table  S9), whereas only 139 DAPs were identified in our previous work using the same dataset [19]. By KEGG analysis and further comparison with our previous work, 42 new DAPs associated with sugar metabolism were identified in this work (Table 1). These DAPs provide new targets for further functional analysis and molecular breeding due to their potential biological functions in sugar metabolism. For instance, phospho-glucomutase (PGM) catalyzes the reversible conversion of glucose 1-phosphate (G1P) and glucose 6-phosphate (G6P), which is one of the vital enzymes promoting carbohydrate synthesis in higher plants [38,39]. The lack of the cytosolic PGM activity in Arabidopsis resulted in decreased rosette fresh weight, shorter roots, and reduced seed production, leading to reduced growth [40]. The alphaglucan phosphorylase is an important enzyme involved in carbohydrate metabolism in prokaryotes and eukaryotes. The plant alpha-glucan phosphorylase, also called starch phosphorylase, is known for the phosphorolytic degradation of starch, which is closely related to sugar metabolism [41]. In this work, 2 phospho-glucomutase (m.128,432/m.141,239) and one alpha-glucan phosphorylase (m.45,257) were newly detected, suggesting that they may play a crucial role in sugar metabolism.
The hub node in the PPT network may be the key regulatory factor in a pathway. The PPI analysis suggests that enzymes glucose-6-phosphate isomerase, malate dehydrogenase (NADP-dependent malic enzyme), enolase (2-phospho-D-glycerate hydrolyase) and phosphoglycerate kinase are hub proteins, with malate dehydrogenase scoring the highest degrees. So, these hub proteins may play a crucial role in sucrose accumulation in sugarcane. Glucose-6-phosphote isomerase catalyzes the interconversion of glucose 6-phosphate and fructose 6-phosphate, which plays an important role in sucrose synthesis. Over-expression of a wheat glucose-phosphate isomerase gene TaPGIc in Arabidopsis thaliana improved plant photosythesis, starch over-accumulation, biomass and yield [42].