A top-down MS/MS spectrum generated from a protein consists of a precursor mass, corresponding to the molecular mass of the protein, and a list of peaks, corresponding to fragment ions of the protein. Each peak represents the mass-to-charge ratio and the abundance of the fragment ion. The *residue mass* of a spectrum *S* is defined as M(*S*) = *PrecursorMass − WaterMass*, where *PrecursorMass* is the precursor mass of the spectrum, and *WaterMass* is the mass of a water molecule. Because top-down MS/MS spectra are very complex, and the charge states of most fragment ions are high, high mass resolution and high mass accuracy spectra are absolutely required. The first step in top-down spectral interpretation is usually spectral deconvolution, which converts a complex top-down spectrum to a list of monoisotopic neutral masses (a deconvoluted spectrum) [19, 20]. The neutral masses are further converted to a list of *prefix residue masses (PRMs)* corresponding to the masses of protein prefixes [21]. For a collision-induced dissociation (CID) spectrum, the PRM spectrum is generated as follows: (1) the residue mass of the experimental spectrum is added to the PRM spectrum; (2) for each neutral mass *x* extracted from the experimental spectrum, two masses *x* and *PrecursorMass − ×* are added to the PRM spectrum. If mass *x* corresponds to a protein suffix (prefix), then mass *PrecursorMass − ×* corresponds to a protein prefix (suffix) [22]. The proposed extended generating function method can be applied to all types of spectra, such as CID and electron-transfer dissociation (ETD) spectra, because all these types of spectra can be converted to PRM spectra. All masses in PRM spectra are discretized by scaling the masses with a constant and rounding the values to integers [23]. For highly accurate top-down spectra, a scaling constant 274.335215 is used (e.g. mass(*G*) = 57.021464 × 274.335215 = 15642.995586 *≈* 15643) to reduce the rounding error to 2.5 parts per million (ppm) [22]. In the following analysis, we assume that only PRM spectra with integer masses are studied and peak intensities are ignored.

### Scores of PrSMs

A PRM spectrum *S* is represented as an ordered list of integer masses, in which the largest one is M(*S*). Let
be the set of the 20 standard amino acids with integer residue masses M(*r*) for
(the residue masses of amino acids are discretized using the same discretization method for PRM spectra). The residue mass of *r* is also denoted as *‖r‖*. The *residue mass* M(*P*) of a protein *P* is the sum of the residue masses of all amino acids of the protein. It differs from the molecular mass of the protein by the mass of a water molecule. A protein *P* with *m* amino acids is associated with an ordered list of integer masses *p*
_{1}
*< p*
_{2}
*<*. . . *< p*
_{
m
}, where *p*
_{
i
} is the sum of the residue masses of the first *i* amino acids and *p*
_{
m
} = M(*P*).

If the residue masses of spectrum *S* and protein *P* are the same value *N* , the *mass count score* of *S* and *P* is the number of shared masses (except for the residue mass *N*) in *S* and *P*, denoted by CScore(*S*, *P*). The *mass shift* of a PTM is the mass difference between the modified form (with the PTM) and the unmodified form of an amino acid residue. When a PTM occurs at the *i*th amino acid of *P* and the mass shift *d* of the PTM is positive, the modified form of *P* is denoted by *Q*
_{
i
}
_{,}
_{
d
}(*P*) = {*p*
_{1}, *p*
_{2}, . . ., *p*
_{
i−
}
_{1}, *p*
_{
i
} + *d*, . . ., *p*
_{
m
} + *d*}. When the mass shift of the PTM is a negative value *−d*, *Q*
_{
i
}
_{,}
_{
−d
}(*P*) = {*p*
_{1}, *p*
_{2}, . . ., *p*
_{
i−
}
_{1}, *p*
_{
i
}
*− d*, . . ., *p*
_{
m
}
*− d*}. In addition, if a mass in *p*
_{
i
}
*− d*, . . ., *p*
_{
m
}
*− d* is negative or not greater than *p*
_{
i−
}
_{1}, the mass is removed from *Q*
_{
i
}
_{,}
_{
−d
}(*P*). Let
be the set of all modified proteins of *P* with a PTM of mass shift *d*. When the protein is not ambiguous, we use shortened notations
. To identify an experimental PRM spectrum *S* generated from protein *P* with a PTM, one can find the mass shift *d* of the PTM by comparing the residue masses of *S* and *P* , and compute the mass count score between *S* and each of the modified proteins in
to find the best match. The *PTM mass count score* of *S* and *P* is defined as
CScore(*S*, *Q*), where *d* = M(*S*) *−* M(*P*).

### Random proteins

Let Pr(

*r*) be the probability that an amino acid

is observed at a position in a random protein. In practice, the frequencies of amino acids in the Swiss-Prot database [

24] can be used to estimate Pr(

*r*). The probability that a random protein

*P* with amino acids

*r*
_{1}
*r*
_{2} . . .

*r*
_{
m
} is observed is

where *L* represents the length of the random protein. To simplify computation, a uniform probability Pr(*L* = *m*) = 1*/MaxLength* is chosen, where *MaxLength* is the length of the longest protein in the Swiss-Prot database. Despite the difference between the uniform distribution and the distribution of protein length in the target protein database, experimental results showed the uniform distribution does not introduce large errors into the computation of spectral probabilities.

### Spectral probabilities

Let

be the set of negative/positive mass shifts of allowed PTMs. Any number in

is a valid mass shift. Let

*S* be an experimental PRM spectrum and

*P* a random protein. The residue mass difference between

*S* and

*P* is a random variable

*D* = M(

*S*)

*−* M(

*P* ). The

*spectral probability* of

*S* with respect to a threshold

*t* and one PTM is the probability that the residue mass difference

and PScore(

*S*,

*P* )

*≥ t*:

where 1 in SpecProb(

*S*,

*t*, 1) represents one PTM. From the definition of PScore(

*S*,

*P*),

Computing SpecProb(*S*, *t*, 1) accurately and efficiently is a problem that has not been solved. In the following subsections, we propose two upper bounds of SpecProb(*S*, *t*, 1). The two upper bounds can be calculated accurately and efficiently using dynamic programming algorithms. The second upper bound is better than the first one and is used for estimating SpecProb(*S*, *t*, 1). Since the second upper bound is larger than SpecProb(*S*, *t*, 1), a constant *K* is introduced for correcting errors in estimated spectral probabilities. In practice, the value of *K* can be estimated from training data sets.

### The first upper bound of spectral probabilities

Based on Equation (

2) and the union bound of probabilities,

Let

*q* denote the right hand part of the above inequality. The value of

*q* is an upper bound of SpecProb(

*S*,

*t*, 1). Next, we describe a dynamic programming algorithm for computing the value of

*q*. The algorithm is an extension of the generating function method in [

8]. In this algorithm, a spectrum

*S* with a residue mass

*N* is represented as a 0/1 vector

*S* =

*s*
_{1}
*s*
_{2} . . .

*s*
_{
N
}, where

*s*
_{
i
} = 1 if the spectrum has a prefix residue mass

*i* and 0 otherwise. For example, a spectrum with a PRM list {2, 5, 8, 10} (10 is the residue mass of the spectrum) is represented as 0100100100. We first study the case where all mass shifts are positive; negative mass shifts will be discussed at the end of this subsection. A three dimensional table

*T* (

*i*,

*j*,

*k*) is computed to acquire the upper bound, where

*i* is the number of PTMs in modified proteins. Let

*S*[1 :

*j*] be the subspectrum

*s*
_{1}
*s*
_{2} . . .

*s*
_{
j
} of

*S*. The residue mass of

*S*[1 :

*j*] is

*j*. The value

*T* (0,

*j*,

*k*) is the probability that M(

*P*) =

*j* and the mass count score CScore(

*S*[1 :

*j*],

*P*) =

*k*. Let

be set of all proteins with a residue mass

*j*. We define a function:

*f*(

*S*,

*P*,

*k*) = 1 if CScore(

*S*,

*P*) =

*k*; 0 otherwise. Then,

Suppose

*P* contains

*m* amino acids and the residue mass of

*P* is

*j*. If the last amino acid of

*P* is

*r*, then

*j − ‖r‖* is the prefix residue mass of the first

*m −* 1 amino acids of

*P*, where

*‖r‖* is the residue mass of

*r*. In the vector representation of

*S*, if

*S* contains a prefix residue mass

*j − ‖r‖*,

*s*
_{
j−‖r‖
} = 1; otherwise,

*s*
_{
j−‖r‖
} = 0. The recurrence function for computing

*T*(0,

*j*,

*k*) was given in [

8]:

Let

*D*
_{
j
} =

*j −* M(

*P* ), the random variable representing the difference between

*j* and the residue mass of random protein

*P*. The value

*T*(1,

*j*,

*k*) is the sum of probabilities

Suppose the residue mass of protein

*P* is

*j − d*, that is,

. Let

*m* be the number of amino acids in

*P* and

*Q*
_{
m
}
_{,}
_{
d
} the modified protein of

*P* whose PTM is on the last amino acid. Because the first

*m−*1 masses of

*Q*
_{
m
}
_{,}
_{
d
} are unchanged compared with

*P*,

Combined with Equation (

4),

Let

*r* be the last amino acid of

*P* and

*P*' the protein containing the first

*m −* 1 amino acids of

*P*. All the

*m −* 1 masses in

*Q*
_{
l
}
_{,}
_{
d
}(

*P*'), 1

*≤ l ≤ m −* 1, are the same to the first

*m −* 1 masses in

*Q*
_{
l
}
_{,}
_{
d
}(

*P*). While the

*m −* 1th mass

*j − ‖r‖* in

*Q*
_{
l
}
_{,}
_{
d
}(

*P*) is included in the computation of mass count scores, the mass

*j − ‖r‖* in

*Q*
_{
l
}
_{,}
_{
d
}(

*P*') is not included because it is the residue mass. Thus,

Combining the fact that

*Pr*(

*P*) =

*Pr*(

*P*')

*Pr*(

*r*) and Equations (

6) and (

8),

With Equations (

6), (

7) and (

9), the recurrence function for

*T*(1,

*j*,

*k*) is

When PTMs with negative mass shifts

*d* are allowed,

*j − d* in Equation (

10) is larger than

*j*. The value

*T*(1,

*j − d*,

*k*) has not been computed when

*T*(1,

*j*,

*k*) is computed, making Equation (

10) invalid. To avoid this problem, a short amino acid sequence

*g* is introduced to guarantee that

*j − d −* M(

*g*)

*< j*. Let

be the set of all amino acid sequences

*g* =

*r*
_{1}
*r*
_{2} . . .

*r*
_{
l
} satisfying M(

*g*)

*> −d* and M(

*r*
_{1}
*r*
_{2} . . .

*r*
_{
l−
}
_{1})

*≤ −d* (

*d* is negative). Equation (

10) is modified to

where *‖g‖* is the residue mass of *g*, and
for a sequence *g* = *r*
_{1}
*r*
_{2} . . . *r*
_{
l
}. The value of *q* is
, where *N* and *n* are the residue mass and the number of masses of *S*, respectively. The time complexity for computing *T*(0, *j*, *k*) and *T*(1, *j*, *k*) is *O*(*N · t · z*), where *z* is the sum of the sizes of
and all
,
.

### The second upper bound of spectral probabilities

The only difference between two modified proteins

*Q*
_{
i
}
_{,}
_{
d
} and

*Q*
_{
i
}
_{+1,}
_{
d
} is the

*i*th mass. If

*p*
_{
i
} in

*P* (which is not changed in

*Q*
_{
i
}
_{+1,}
_{
d
}) does not equal any mass in

*S*, then CScore(

*S*,

*Q*
_{
i
}
_{+1,}
_{
d
})

*≤* CScore(

*S*,

*Q*
_{
i
},

_{
d
}). Based on this observation, if

*p*
_{
i
} does not equal any mass in

*S*,

*Q*
_{
i
}
_{+1,}
_{
d
} is removed from

. In this way, a new set

is acquired containing

*Q*
_{1,d
}and all

*Q*
_{
i
},

_{
d
} satisfying that

*p*
_{
i−
}
_{1} equals a mass in

*S*. It follows that

. From Equation (

1) and the union bound of probabilities,

Let *q*' denote the right hand part of the above inequality. Compared with *q*, the value of *q*' is a better upper bound for SpecProb(*S*, *t*, 1). Similar to the method for computing *q*, we fill out a three dimensional array *T*(*i*, *j*, *k*) for computing *q*'. The recurrence function for filling out *T*(0, *j*, *k*) is the same to Equation (5). We change the definition of *T*(1, *j*, *k*) by replacing
with
in Equation (6). To compute *T*(1, *j*, *k*), the second and third terms of the right-hand part of Equation (11) should be changed so that only the probabilities for the modified proteins in
are summed up.

Similar to the proof of Equation (7), consider a random protein
. Let *Q*
_{
m
},_{
d
} be the modified protein of *P* whose PTM is on the last amino acid, and *r* the last amino acid of *P*. If
, then *j − d − ‖r‖* is a mass in *S* or *j − d − ‖r‖* = 0 (in the extreme case that *P* contains only one amino acid, *j − d − ‖r‖* = 0), and vice versa. Therefore, if *j − d − ‖r‖* = 0 or *s*
_{
j−d−‖r‖
} = 1, then Pr(*P* ) *· f*(*S*[1 : *j*], *Q*
_{
m
},_{
d
}, *k*) is included in the computation of *T*(1, *j*, *k*).

For a positive mass shift

*d*, we define

as the set of amino acids

*r* ∈

*R* satisfying that

*j − d − ‖r‖* = 0 or the element

*s*
_{
j−d−‖r‖
} = 1. For a negative mass shift

*d*, we introduce a set

of amino acid sequences

*g* =

*r*
_{1}
*r*
_{2} . . .

*r*
_{
l
} satisfying: (1) M(

*g*)

*> −d*, (2) M(

*r*
_{1}
*r*
_{2} . . .

*r*
_{
l−
}
_{1})

*≤ −d*, and (3)

*j − d − ‖g‖* = 0 or the element

*s*
_{
j−d−‖g‖
} = 1. Then Equation (

11) is changed to:

and
. The time complexity for computing *T*(0, *j*, *k*) and *T*(1, *j*, *k*) is similar to the method in the previous subsection.

Since the scores CScore(

*S*,

*Q*) for

are not independent,

*q*' is usually larger than the spectral probability SpecProb(

*S*,

*t*, 1). To estimate SpecProb(

*S*,

*t*, 1) more accurately,

*q*' is multiplied by a constant value

*K* for correction:

###
*P*-values and *E*-values

Let

, where

*N* is the residue mass of

*S*. From table

*T*(0,

*j*,

*k*) described in the previous subsection, we can compute the probability that the residue mass difference

*D* between

*S* and

*P* is in

:

Using Equations (

13) and (

14), the

*conditional spectral probability* of

*S* with respect to threshold

*t* and one PTM is

Intact proteins may have N or C-terminal truncations, e.g., the removal of a signal peptide. If a top-down MS/MS spectrum is matched to an intact protein without N- or C-terminal truncations, the PrSM is called a *complete* PrSM. A PrSM matched to an intact protein with an N-/C-terminal truncation is called a *suffix*/*prefix* PrSM. An *internal* PrSM corresponds to an intact protein with both N- and C-terminal truncations.

Similar to the *E*-values defined in BLAST [25], the *E*-value of a PrSM describes the number of hits one can "expect" to see by chance when searching the spectrum against a protein database of a particular size. Suppose a complete PrSM contains one mass shift (PTM) in
and its PTM mass count score is *t*. We count the number *Z* of proteins in the target database with a residue mass in
. The *E*-value of the complete PrSM is estimated as *Z ·* CSP(*S*, *t*, 1). The *p*-value of the PrSM is estimated as 1 *−* (1 *−* CSP(*S*, *t*, 1))^{
Z
}.

For prefix, suffix and internal PrSMs, we count the numbers *Z*
_{
p
}, *Z*
_{
s
}, and *Z*
_{
i
} of prefixes/ suffixes/internal sub-proteins in the target database with a residue mass in
. Because some prefixes/suffixes/internal sub-proteins are not independent, a constant factor *C*
_{
p
}/*C*
_{
s
}/*C*
_{
i
} is multiplied in the computation of *E*-values of prefix/suffix/internal PrSMs for correction. For example, if a prefix PrSM contains one mass shift (PTM) in
and its PTM mass count score is *t*, the *E*-value of the PrSM is estimated as *C*
_{
p
}
*· Z*
_{
p
}
*·* CSP(*S*, *t*, 1).

### Multiple PTMs

The dynamic programming algorithm for computing the second upper bound can be extended to estimate

*E*-values of PrSMs with multiple PTMs. When multiple PTMs are allowed, we replace

*T*(0,

*j*,

*k*) and

*T*(1,

*j*,

*k*) in Equation (

12) by

*T*(

*i*,

*j*,

*k*) and

*T*(

*i −* 1,

*j*,

*k*) to estimate spectral probabilities with respect to

*i* PTMs: