The information combining peak detection (ICPD) algorithm for LC/MS
In this section, we describe the proposed ICPD algorithm. Isotope pattern matching is widely used for peak detection in MS because non-peptide contaminants do not possess similar isotope patterns as that of peptides. While it is possible to differentiate peptide peaks from noise peaks based on other parameters, such as peak intensity, they are generally not as effective for low abundance peptide identification. Isotope pattern matching has been employed as the core method in many existing LC/MS processing packages ( [6, 9]). However, isotope pattern estimation is primarily performed on a scan by scan basis at one charge state at a time. Since LC/MS is corrupted considerably by noise, the estimation of isotope pattern is often inaccurate which greatly reduces the specificity and sensitivity of peak detection.
In the following, we first propose a method to estimate isotope pattern through combining all relevant information in all MS scans and then discuss the the ICPD algorithm which performs peak detection based on the estimated isotope pattern.
Isotope pattern estimation based on information combining
Suppose that through some preprocessing steps, which we will describe in detail in the next subsection, a peptide candidate peak list is generated. This candidate list includes all probable peptides peaks. Each entry in the candidate list is a peptide candidate annotated with its mass value and chromatographic peak duration. Suppose that the chromatographic peaks of each peptide candidate have been extracted for all charge states and isotope positions from the LC/MS dataset. For a peptide candidate i in the list, we consider the following model for abundance of the chromatographic peak at a particular charge state cs and isotope position iso:
P
i
(t
r
, cs, iso) = A
i
· C
i
(tr) · f (cs) · f (iso), (1)
where A
i
stands for the total abundance of the i th peptide candidate which is also noted as the total volume of the peak, C
i
(t
r
) stands for the chromatography eluting distribution function of the i th peptide candidate at retention time t
r
, it describes the fraction of peptide i that elutes out during the t
r
th retention time period. C
i
(t
r
) sums to one. cs stands for charge state, iso stands for the number of Carbon C13s in the peptide, f(cs) stands for the charge state distribution function, and f(iso) stands for the isotope distribution. Both f(cs) and f(iso) are distribution functions that sum to one. This model describes how does a peptide elute out according to an elution distribution function and subsequently get dispersed to different charge states according to f(cs) and isotope positions according to f(iso). The observed chromatographic peak of the peptide can be expressed as
y
i
(t
r
, cs, iso) = p
i
(t
r
, cs, iso) + n(MZ(M
i
, cs, iso), t
r
), (2)
where MZ(M
i
, cs, iso) stands for the m/z value of the peptide with mass M
i
at charge state cs and isotope position iso. It is calculated as: MZ (cs, iso, M
i
) = (M
i
+ cs · w
p
+ iso · w
n
)/cs, where w
p
stands for the weight of a proton that carries one positive charge, and w
n
stands for the weight of a neutron which causes weight shift in an isotope. n(MZ(·),t
r
) stands for the noise at MZ(·), and at the elution time t
r
. Various sources could attribute to the noise term n(MZ(M
i
, cs, iso), t
r
) whose characteristics are not entirely understood and are highly dependent on the instrument [11, 12]. Although in studies like [12], the noise variance is studied, the noise distribution is still unknown. Unless an accurate model can be constructed for n(MZ(M
i
,cs,iso), t
r
) for the specific instrument studied, effective combining of y
i
(·)s can not be performed.
On the other hand, if we sum y
i
(tr, cs, iso) with respect to cs and C
i
(t
r
), the resulted signal becomes
where
is a noise corrupted copy of f(iso) with a multiplication coefficient. The noise term
is the summed noise term. Since n′ is the summation of many noise terms n(MZ(M
i
, cs, iso), t
r
) at different m/z locations and MS scans, it can be assumed to be Gaussian according to the central limit theorem. Here, it is characterized as a zero mean Gaussian noise with variance
. Given this model,
becomes a Gaussian variable with mean A
i
f(iso) and variance
, where f(iso) is the theoretically predicted isotope pattern based on “averigine” [16]. The likelihood function of the isotope pattern becomes:
which measures the likelihood that a peptide with an theoretical isotope pattern f(iso) could have caused the observed value
. The term
can be used as the isotope pattern matching score. However, it is critical to assign a reasonably good approximation of the noise variance. The noise variance generally scales with the peptide abundance with a power law
. Various p have been reported in the literature [11, 12]. The former suggested a poisson distribution on noise and thus p =1, the later suggested a quadratic form, p = 2, at high intensities for a Q-TOF MS. Note that these previous results are based on individual MS scans and are derived for specific instruments. When applying the peak picking algorithm, the selection of p should be adjusted. In the simulation section, we experimented with several possible values of p and found that p = 3 yields the best performance on the instrument that we experimented for low abundance proteins.
Note that in the past, the estimation of isotope pattern in algorithms such as [6] is based on local observations of the isotope pattern in one MS scan indexed by t
r
and at one charge state indexed by cs: y
i
(t
r
, cs, iso) = A
i
· C
i
(t
r
) · f(cs) · f(iso) + n which has a smaller SNR. Not surprisingly, the performance is a lot worse than the result based on information combining algorithm we propose here which is shown in simulation results.
ICPD algorithm
The previous section introduced a method for estimating isotope pattern by combining all relevant information in different scans and charge states. Before the method can be applied, several preprocessing steps are necessary to complete the peak detection process. The ICPD algorithm is described as the following.
Generate a peak candidate list MassList.
For each entry in the MassList, extract their chromatographic peaks at different charge state and isotope positions.
Estimate the isotope pattern by combining multiple scans and charge states, and calculate the isotope matching score.
Produce an output candidate list annotated by various peak parameters such as isotope matching score.
Apply a threshold on the matching score, and produce a list of detected peaks.
Now we explain each processing steps. 1. To generate a peak candidate list MassList, we first amass a list LMZOverall of all m/z values where a MS peak is centered. For high resolution data, this can be done by converting all MS scans to centroid data. Many software package provide this functionality and we utilized the mspeaks function provided by Matlab.
Then we group all m/z values that are within a bin of 2dmz ppm from each other. dmz is determined mostly by the mass accuracy of the instrument. Then, an elution time profile can be extracted for the m/z value within the bin. Then segments of the LC profile that contains at least s non-zero values are extracted. Here s is a lower bound for LC peak width. It is desirable to set it to a very small number (2 - 3) such that no peptides can be missed in the candidate list. Segments with a gap of g scans are combined. The tolerable gap in LC chromatographic gap depends on the instrument operating mode. If LC/MS and LC/MS/MS are collected together, ions may be directed to LC/MS/MS periodically which produces gaps in LC/MS. After segmentation and gap filtering in the LC profile process, the resulted segments are considered as peptide candidate peaks in the considered m/z value bin. The mean m/z value is recalculated within each segment. This process ensures that peptides with small mass difference within the bin that elute at different times are separated. Each peptide candidate is registered in a list called MZlist. Each entry of the MZlist contains the elution start time, end time, m/z value. Overlapping of bins of maximally dmz ppm is allowed. This could cause a peptide peak being registered twice in the MZlist. A subsequent merging step is performed if two peaks have m/z values that are within dmz ppm and the elution time period is overlapping.
Subsequently, this candidate MZlist is converted to the peptide candidate mass list MassList by assuming upto CS charge states. For example, if the maximum charge state considered is 4, then MZlist will be expanded four times by assuming each peptide candidate in MZlist has charge state 1, 2, 3 and 4. Thus, an entry in MassList contains a mass value of the peptide candidate as well as the elution time period.
2. To extract chromatographic peaks of a peptide candidate in the MassList, the m/z values at different isotope positions and different charge states are theoretically calculated first. For example, if the mass of the i th entry of MassList is M
i
, then its m/z value at charge state cs and isotope position iso can be calculated as MZ(cs, iso, M
i
) = (M
i
+ cs * w
p
+ iso * w
n
)/cs. Then MS peak centroid falling within ±dmz of the theoretically predicted values and within the elution time period of the peak candidate are extracted from the LC/MS datasets. These MS peaks are then sorted according to their elution time form the chromatographic peak of the peptide at the considered charge state and isotope position.
3. After extracting the chromatographic peaks, isotope pattern estimation can be performed as described in section .
4. An output list OutList is generated by annotating each candidate peptide in MassList with their isotope matching score as well as other parameters such as maximum peak intensity and peak volume.
5. User pick a threshold on the isotope matching score, and all entries that pass the threshold will be reported as detected peptide peaks.