### Protein mixture model

In a typical label-free MS experiment, two sample classes (e.g. control vs. treatment) are considered. Assume each class has *M* samples and all samples share up to *N*_{
pro
} possible protein species of a given proteome. Protein concentration in the pooled control sample is modeled by a Gamma distribution in accordance with the observations in [13]:

{\eta}_{l}~\mathsf{\text{Gamma(}}t\mathsf{\text{,}}\theta ),\phantom{\rule{1em}{0ex}}l=1,2,...,{N}_{pro},

(1)

where *t* = 2 and *θ* = 1000 are the shape and scale parameters. The concentration has a dynamic range of approximately 4 orders of magnitude representing typical real-world scenarios. For the pooled treatment sample, expression levels of some proteins (e.g. biomarkers) may differ from those in the control sample, which can be captured by fold change:

{f}_{l}=\left\{\begin{array}{cc}\hfill {a}_{l},& \phantom{\rule{1em}{0ex}}\mathsf{\text{ifprotein}}l\mathsf{\text{isover-expressed}}\hfill \\ \hfill \frac{1}{{a}_{l}},& \phantom{\rule{1em}{0ex}}\mathsf{\text{ifprotein}}l\mathsf{\text{isunder-expressed}}\hfill \\ \hfill 1,& \phantom{\rule{1em}{0ex}}\mathsf{\text{otherwise}}\hfill \end{array}\right.

(2)

where the fold change parameter, *a*_{
l
} > 1, is sampled from a uniform distribution, as specified in the Results section.

Sample variation of each protein is modeled by a Gaussian distribution [14], with means *η*_{
l
} and *η*_{
l
} *f*_{
l
} in the control and treatment sample classes, respectively. Considering the fact that protein expression levels are often correlated, the following multivariate Gaussian (MVG) distribution is appropriate to model the interactions among proteins and their concentrations. Let {c}_{lj}^{pro} denote the molar concentration of protein species *l* in sample *j*, then we have

{c}_{lj}^{pro}~\left\{\begin{array}{cc}\mathsf{\text{MVG}}\left(\left[{\eta}_{1},{\eta}_{2},...,{\eta}_{{N}_{pro}}\right],\Sigma \right),\hfill & \phantom{\rule{1em}{0ex}}j\in \mathsf{\text{class0}}\hfill \\ \mathsf{\text{MVG}}\left(\left[{\eta}_{1}{f}_{1},{\eta}_{2}{f}_{2},...,{\eta}_{{N}_{pro}}{f}_{{N}_{pro}}\right],\Sigma \right),\hfill & \phantom{\rule{1em}{0ex}}j\in \mathsf{\text{class1}}\hfill \end{array}\right.

(3)

where the covariance matrix Σ has a block-diagonal structure-proteins within the same block (e.g. proteins belonging to the same pathway) are correlated with correlation coefficient *ρ* and proteins of different blocks are uncorrelated [15]:

\begin{array}{l}\sum ={[{\sigma}_{lj}^{2}]}_{{N}_{pro}\times {N}_{pro}},\\ {\sigma}_{lj}^{2}={\sigma}_{ll}{\sigma}_{jj}{r}_{lj},\end{array}

(4)

where *σ*_{
ll
} is proportional to the control protein mean *η*_{
l
} by a constant factor *ϕ*_{
l
} (i.e., the coefficient of variation), and the correlation coefficient matrix is

R={[{r}_{lj}]}_{{N}_{pro}\times {N}_{pro}}=\left[\begin{array}{cccc}{R}_{\rho}& 0& & 0\\ 0& {R}_{\rho}& \cdots & 0\\ \vdots & \vdots & \ddots & \vdots \\ 0& 0& \cdots & {R}_{\rho}\end{array}\right],

where *R*_{
ρ
} is a *D × D* matrix with 1 on the diagonal and *ρ* elsewhere. The correlation *ρ* and block size *D* are tunable parameters, with values specified in the Results section.

### Peptide mixture model

Before being analyzed by the MS instrument, proteins are usually digested into peptides. In the proposed simulation pipeline, *in-silico* tryptic digestion is performed, and retention time of peptide products is predicted using the PNNL Protein Digestion Simulator [16]. Different protein species may share the same peptide sequence. Thus, the molar concentration of peptide species *i* in sample *j*, {c}_{ij}^{pep}, is given by the following equation:

{c}_{ij}^{pep}=\sum _{k\in {\Omega}_{i}}{c}_{kj}^{pro},\phantom{\rule{1em}{0ex}}i=1,2,...,{N}_{pep,}\phantom{\rule{1em}{0ex}}j=1,2,...,2M,

(5)

where the set Ω_{
i
} comprises all proteins sharing the peptide species *i*, and *N*_{
pep
} is the number of peptide species. The concentration {c}_{ij}^{pep} is represented by ion abundance in MS data. Thus, the expected abundance readout *μ*_{
ij
} of peptide species *i* in sample *j* can be modeled as

{\mu}_{ij}={c}_{ij}^{pep}{e}_{i}\kappa ,

(6)

where *e*_{
i
} is a peptide efficiency factor similar to the one used in [17], and κ is the MS instrument response factor converting the original analyte concentration to the output ion current signal. The parameter *e*_{
i
} is affected by many factors: first, various peptides differ in hydrophobicity, which mainly determines their efficiencies in passing through the liquid chromatography column. Then, upon entering the ionization chamber, peptides demonstrate great disparities in ionization efficiency, which is affected by sample complexity, peptide concentration and characteristics such as polarity of side chains, molecular bulkiness, and so on [18]. In addition, some amino acids at the N-terminal end of peptides have destabilizing effects that can reduce the efficiency factor. Although there are methods attempting to predict *e*_{
i
} [17], they often neglect the fact that peptide efficiency and expected peptide ion abundance depend not only on the underlying peptide, but also on the combinational effects of other peptides present (e.g., LC elution competition, ion competition and suppression). In reality, it is unfeasible to predict *e*_{
i
} for all possible peptide combinations. Thus, we model *e*_{
i
} from a uniform distribution and evaluate a wide range of interval bounds in simulations -- we are not really interested in the precise value of *e*_{
i
}, but rather we aim to examine how the dispersion of *e*_{
i
} affects subsequent analysis. As for the parameter *κ*, it can be estimated through calibration and is related to the efficiency by which molecules are converted into gas-phase ions, the efficiency by which ions are transferred through various stages of the mass spectrometer, and how well experiment conditions are optimized. For a typical MS instrument, its response is linear for three to five orders of magnitude [18]. At high analyte concentration, instrument response plateaus because of detector saturation, restricted amount of excess charge, or limited space for ionization, as depicted in Figure 2. To account for instrument saturation, an upper limit, *sat*, is set for the expected abundance readout: *μ*_{
ij
} = min(*μ*_{
ij
}, *sat*).

### Peptide detection and identification

#### Peptide abundance

The actual abundance *v*_{
ij
} of peptide species *i* in sample *j* is modeled as the expected abundance plus Gaussian noise:

{v}_{ij}={\mu}_{ij}+{\epsilon}_{ij},

(7)

where

{\epsilon}_{ij}~\mathsf{\text{Gaussian(0,}}\alpha {\mu}_{ij}^{2}+\beta {\mu}_{ij}),\phantom{\rule{1em}{0ex}}i=1,2,...,{N}_{pep},\phantom{\rule{1em}{0ex}}j=1,2,...,2M.

(8)

The sources of noise include variation in experimental conditions, instrument variance, thermal noise and measurement error. It is reported that the noise variance follows a quadratic dependence on the expected abundance [19], which is reflected by Eq. (8). The two parameters in the noise model, *α* and *β*, determine the noise severity. Their value can be estimated using replication analysis, as explained in [19].

In electrospray ionization, peptides can be multiply charged. But we do not model the charge distribution, considering the following facts: (1) Peptide charge distribution and the maximum charge states are complicated by many factors such as sample composition, analyte concentration and peptide conformation [20, 21]. The distribution is hard to predict and has not been well characterized. (2) In order to get the abundance of a peptide, and further, its parent protein, the abundance of peptide charge variants will eventually be summed up. We omit the intermediate process since in reality many factors involved are not well understood.

#### Peptide detection

Peptide detection from mass spectra is not an easy task -- the observed peptide signals are corrupted by noise and may also be affected by signals of other peptides, and thus may deviate significantly from the expected pattern. The performance of a peptide detection algorithm on a specific MS instrument and the underlying signal-to-noise ratios (SNRs) ultimately affect the number of detected true positives, i.e., the true positive rate (TPR), as shown in [22–25]. The SNR is defined as the ratio of signal power to noise power, i.e., SNR\triangleq E{\left[v\right]}^{2}/\mathsf{\text{Var(}}v\mathsf{\text{)=1/(}}\alpha \mathsf{\text{+}}\frac{\beta}{\mu}), see Eqs. (7)-(8). It can be seen that SNR increases as signal strength *μ* increases. The relationship between TPR and SNR can be approximated by a polynomial function, for algorithms such as those in [22, 23, 25]:

TPR=k\times SN{R}^{p}+b,

(9)

where *b* represents the worst TPR when the SNR approaches zero.

Besides *SNR*, signal interference and mass resolving power may also have considerable impact on TPR [12, 23]. Over the years, much effort has been made towards enhancing instrument resolution, leading to improved mass accuracy, better separated MS peaks, and less convoluted peptide signals. But for complex samples, substantial overlapping of peptide signals is still frequently encountered, due to peptide isoforms or co-elution. It has been reported that if two peptides have overlapping signal regions, some detection algorithms may fail to report one of them even when the underlying SNRs are high, while other algorithms are shown to be superior in the detection of overlapping peptides [22]. To account for signal interference, we modify Eq. (9) by introducing an overlapping factor *o*_{
ij
}, so that the TPR of peptide species *i* in sample *j* becomes

TP{R}_{ij}=\left(k\times SN{R}_{ij}^{p}+b\right)\times {o}_{ij},\phantom{\rule{1em}{0ex}}{o}_{ij}\le 1.

(10)

For algorithms such as NITPICK [24], BPDA [22] and BPDA2d [23], which are effective in detecting overlapping peptides, the overlapping factor *o*_{
ij
} can be approximated by 1, whereas for algorithms that are ineffective in detecting convoluted peptides, *o*_{
ij
} is assumed to be inversely proportional to the number of overlapping peptides, which is a function of the sample composition and the mass resolution. In our simulation, two peptide species *i*_{1} and *i*_{2} are said to overlap if their mass and retention time (RT) are close, in the sense that

\frac{|mas{s}_{2}-mas{s}_{1}|}{mas{s}_{1}}<\frac{1}{\mathsf{\text{massresolution}}}\phantom{\rule{1em}{0ex}}\mathsf{\text{and}}\phantom{\rule{1em}{0ex}}\frac{|R{T}_{1}-R{T}_{2}|}{\#\mathsf{\text{scans}}}0.005.

(11)

#### Peptide identification

The output of the MS1-based peptide detection algorithm is a list of detected peptides annotated by monoisotopic mass, retention time, abundance, and so on. To obtain peptide sequence information, i.e. peptide identification, which can be used to infer the parent protein from which the peptide was digested, database searching is required. To do so, the acquired MS/MS (MS2) spectra are searched against a protein database containing theoretical MS2 spectra generated from *in-silico* digested peptide sequences by popular software such as SEQUEST [26] and Mascot [27].

Several machine learning methods have been proposed to predict the probability (i.e., identifiability) of a peptide being identified through MS2 database searching [14, 28]. These methods try to extract the common trends residing in peptide identifiability that can be explained by peptide sequence-specific properties. Their successful application may suggest that the peptide sequence largely affects the chance of a peptide getting selected for MS2 analysis, whether the peptide can be sufficiently fragmented, and the quality of its fragmentation spectra. In our simulation, the identifiability *p*_{
i
} of the true peptide species *i* is predicted by the APEX software [14], trained on the human serum proteome [29], and whether peptide species *i* in sample *j* is identified or not through database searching is determined by the outcome of a Bernoulli trial with success rate *p*_{
i
}.

#### Linking of detection and identification results

For both MS1-based and MS2-based algorithms, sources of error exist that give rise to false positives (FPs). For the former, error sources include shot noise, abundance measurement error, signal interference, and so on. For the latter, co-eluting precursor ions, spectra matching ambiguity, or post-translational modifications may all lead to false identifications. By confronting the results of the two orthogonal algorithms (i.e., a feature is treated as a true positive if it is reported by both algorithms), dubious features reported by either algorithm can be filtered out.

### High-level analysis

#### Peptide to protein abundance roll-up

As demonstrated in the previous sections, each step of the MS analysis pipeline introduces a degree of loss or distortion to the underlying true signal. Thus, "decoding" protein abundance from observed peptide abundance corrupted by noise is nontrivial. To reduce noise, three levels of filtering are applied: (1) only unique peptides that exist only in one protein of the analyzed proteome are kept; (2) peptides with large missing value rates (larger than 0.7) are filtered out, since low reproducibility may be a red flag for false identifications; (3) among the remaining peptides, those having sufficiently high correlations (larger than 0.6) with other peptides digested from the same protein are retained. The estimated abundance of protein *l* in sample *j* is then obtained by averaging the abundances of its children peptides that pass the previous filters; if less than two peptides pass the filters, the estimated protein abundance is set to zero. The estimated protein concentration is calculated by dividing the estimated protein abundance by the instrument response factor *κ*.

Quantification accuracy can be assessed by the commonly adopted mean quantification error, defined by

qerr\triangleq \frac{\sum _{l=1}^{{N}_{pro}}\sum _{j=1}^{2M}|{c}_{lj}^{prot}-{\u0109}_{lj}^{prot}|/{c}_{lj}^{prot}}{2M\phantom{\rule{0.3em}{0ex}}{N}_{pro}},

where {c}_{lj}^{prot}and {\u0109}_{lj}^{prot}are the original and estimated concentrations of protein *l* in sample *j*, respectively.

#### Differential expression analysis

Differential expression analysis is performed via a two-sample t-test with equal sample size and variance. The t statistic (or t score) is calculated as below:

{t}_{l}\triangleq \frac{|{m}_{l}^{1}-{m}_{l}^{0}|}{\sqrt{\frac{Va{r}_{l}^{1}+Va{r}_{l}^{0}}{M}}},

where the superscripts identify the two classes, and *m*_{
l
} and *Var*_{
l
} represent the estimated class mean and variance of the abundance of protein *l*, respectively. The standard 0.05 significance level is used to detect differentially expressed markers.

#### Feature selection and classification

In the simulation, t-test feature selection is first performed to reduce the data dimension, by selecting the top 20 differentially expressed features. Then two classifiers, namely K-nearest neighbor (KNN, K = 3) and linear discriminant analysis (LDA) are trained using the observed protein expression data. Classification performance is validated by independent ground-truth (testing) data sets (each with 1000 samples, generated from the same data model), and the classification error is recorded. In addition, the KNN and LDA classification error on the original protein data (before entering the MS analysis pipeline) is obtained using a similar approach. The latter may serve as a benchmark to gauge how much loss in classification performance the analysis pipeline has introduced.