Skip to main content

Reducing the haystack to find the needle: improved protein identification after fast elimination of non-interpretable peptide MS/MS spectra and noise reduction



Tandem mass spectrometry (MS/MS) has become a standard method for identification of proteins extracted from biological samples but the huge number and the noise contamination of MS/MS spectra obstruct swift and reliable computer-aided interpretation. Typically, a minor fraction of the spectra per sample (most often, only a few %) and about 10% of the peaks per spectrum contribute to the final result if protein identification is not prevented by the noise at all.


Two fast preprocessing screens can substantially reduce the haystack of MS/MS data. (1) Simple sequence ladder rules remove spectra non-interpretable in peptide sequences. (2) Modified Fourier-transform-based criteria clear background in the remaining data. In average, only a remainder of 35% of the MS/MS spectra (each reduced in size by about one quarter) has to be handed over to the interpretation software for reliable protein identification essentially without loss of information, with a trend to improved sequence coverage and with proportional decrease of computer resource consumption.


The search for sequence ladders in tandem MS/MS spectra with subsequent noise suppression is a promising strategy to reduce the number of MS/MS spectra from electro-spray instruments and to enhance the reliability of protein matches. Supplementary material and the software are available from an accompanying WWW-site with the URL


Liquid chromatography (LC) coupled with tandem mass spectrometry (MS/MS) is the method of choice for the identification of proteins extracted from biological samples. The standard procedure of post-MS/MS data processing involves computer-aided interpretation of the measured spectra with MASCOT [1], SEQUEST [2] or some other software for comparing theoretical spectra calculated for database sequences with the experimental ones. But modern instruments generate extremely large sets of MS/MS spectra (in the order of 10000 per sample), which are heavily contaminated with different types of background and noise. In addition to b-, y- and their derivative ions from peptides, spectra contain repeated shifted signals due to the natural isotope distribution (isotope clusters), multiply charged replicas, peaks from unknown fragmentation pathways, sample-specific or systematic chemical contaminations and random noise from the electronic detection system.

Thus, the spectra consist mostly of background; typically, only a few percent of the spectra recorded have signals from target protein fragments and just about 10% of the peaks in such a spectrum contribute to the peptide identification. Thus, computer resources in mass spectrometry departments all over the world are mostly spent on analyzing non-relevant data if the identification of the protein with significance is possible within the background at all. This strategy clashes with limitations in compute server capacity in proteomics laboratories and seriously limits the access of less generously equipped teams to the field.

With the broad availability of accurate MS/MS instruments with resolution in the order of tenths of a Dalton, automatic background removal procedures before interpretation software application became possible [35]. Various spectrum pre-processing rules, deconvolution of multiply charged peaks and deisotoping procedures have been described [615]. It should be noted that many spectra do not contain peaks from peptide fragmentations or are extremely noisy and, therefore, are non-interpretable into peptide sequences reliably. Thus, the exclusion of non-interpretable spectra is a valid strategy for reducing the computational load. For a well performing method, one would desire it to remove clearly more than half or three quarters of the experimental MS/MS spectra and, essentially, to keep all interpretable ones. At the same time, computation time for this task should be negligible or, at least, small compared to the processing time used by an interpretation program such as MSACOT that is saved by unselecting a large spectra subset.

Published approaches to this problem differ in the criterion for spectrum selection, either with empirically defined score functions or with a classifier generated by automated learning approaches [1623]. Although many of these methods apply quite sophisticated criteria, they either are not efficient filters or suffer from a substantial fraction of unselected but nevertheless interpretable MS/MS spectra (e.g., loss of ~10% of the interpretable spectra for removing ~75% of the total number spectra in Figures 2 and 3 of Bern et al. [18]). Thus, substantial computational load reduction is traded in for the risk not to find the desired peptide hit. Consequently, none of the published techniques has routinely entered the laboratories so far.

In the attempt to develop an alternative methodical approach, we propose to return to ideas from the beginning of mass spectrometry of proteins. Originally, interpretation of an MS/MS spectrum meant experts trying to manually find sequence ladders (i.e., sets of peaks with amino acid mass spacing between them) among the high-intensity peaks. The concept of searching mainly among the higher intensity peaks is still reminiscent in the formulas for evaluating the significance of a peptide hit as used in MASCOT [1]. Indeed, a peptide the theoretical fragmentation spectrum of which matches exclusively low intensity peaks cannot serve as convincing explanation of the experimental data.

In this work, we explore the idea that at least some short oligopeptide segment of a significant peptide hit should be fully matched by the higher intensity peaks in the spectrum. In an efficient implementation, the computational costs are low if one tries just to check whether small peptide ladders of predefined length do occur in a MS/MS spectrum at all among the top fraction of most intense peaks. The identity of the oligopepetide is not important in this context; it is rather questioned whether such an amino acid chain theoretically exists at all. It is reasonable to suggest that the spectrum is probably not interpretable into a peptide sequence with statistical significance if not even a short oligopeptide sequence is matched by this criterion.

After this unselecting procedure, the remaining spectra still contain considerable background in the typical case. In a previous publication [24], we developed an approach based on techniques from electrical signal processing. Periodical band-reject and high-frequency filters as well as correlation analyses with etalons of multiply charged clusters can successfully be used for background suppression. In this work, we describe a workflow involving sequence ladder and improved signal processing criteria on a large MS/MS dataset exemplified in the MS Cleaner version 2.0 that efficiently reduces the number and the size of spectra and, subsequently, dramatically shrinks the computing time used by the interpretation software. To emphasize, the approach described in this work is thought to increase the efficiency of protein identification. It is not considered to process MS/MS data that is intended to be screened for protein posttranslational modifications.


Mass spectrometry

Commercially acquired proteins (α-amylase, amyloglucosidase, apo-transferrin, β-galactidase, carbonic anhydrase, catalase, phosphorylase B, glutamic dehydrogenase, glutathione transferase, immunoglobulin γ, lactic dehydrogenase, lactoperoxidase, myoglobin) were used, each in two independent preparations (each with a concentration of 100 fmol). For chromatography, a UltiMate Plus Nano-LC system. LC-Packings - A Dionex Co was used. Chromatographic mobile phases were: loading mobile phase 0.1% TFA in water, separation mobile phase A 5% acetonitrile in 0.1% aqueous formic acid and mobile phase B 80% acetonitrile, 20% water with 0.08% formic acid. The sample was loaded for 10 min onto a reversed phase trap column (PepMap C18, 300 μm ID × 5 mm length, 5 μm particle size, 100 Å pore size, LC Packings - A Dionex Co., not online with the separation column) at a flow rate of 20 μl/min and washed free of ion pairing agents and other impurities.

The gradient for separation of analytes starts at 10 min when the trap column is switched online with the separation column (PepMapC18, 75 μm ID × 15 cm length, 3μm particle size, 100 Å pore size) at 0.275 μl/min. The gradient used starts at 100% mobile phase A and changes to 50% mobile phase B from 10 minutes (trap column and separation column online) to 40 minutes. Additional wash step of 90% mobile phase B is incorporated in order to clean the separation column and elute hydrophobic analytes. After the separation, the trap column is switched offline and equilibrated with loading mobile phase. The analytical nano column is equilibrated with separation mobile phase A. The mass spectrometric data are only recorded for the time both columns are online.

The mass spectra were recorded with a Thermo Finnigan LTQ (positive nano-ESI mode, ionizing spray voltage: 1.5 kV, enhanced mass-spec full-scan range: 220 - 2000 amu). The much smaller datasets for bovine serum albumin (BSA), yeast alcohol dehydrogenase (ADH) and human transferrin (TRF) recorded with a 3D IT mass spectrometer (model DecaXP Thermo Finnigan) were reused from our previous work [24].

File processing and MS/MS data analysis

The MS/MS output was converted into mgf-files (MASCOT generic format). Each dataset was then separately processed using the MS Cleaner program (with default internal parameters), generating two new mgf-files with cleaned and bad (non-interpretable) spectra respectively. The MASCOT search parameters were the same in all runs (enzyme: trypsin; fixed modifications: carbamidomethyl (at cysteines) for BSA, ADH and TRF, carboxymethyl (at cysteiness) for other proteins; variable modifications: oxidation (at methionines); peptide charges: 1+, 2+ and 3+; mass values: monoisotopic; protein mass: unrestricted; peptide mass tolerance: ± 2 Da; fragment mass tolerance: ± 0.8 Da; max. missed cleavages: 1). The MASCOT search results output html-file was formatted with standard scoring, a significance threshold of p < 0.05, and an ion score cut-off for each peptide of 30. The non-redundant protein database (NCBI) was used (both for the local PC MASCOT installation and for the MASCOT Linux cluster).

In this work, we compare the MASCOT interpretation results of non-pre-processed tandem MS datasets with those obtained in a two-step preprocessing. First, each spectrum (.dta-file) is analyzed with the sequence ladder algorithm. Only those spectra that pass this test, are then processed with the background removal routines described in our previous publication [24].

The sequence ladder algorithm

For this algorithm, two parameters are critical - the values n(in amino acid residues), the minimal length of the sequence ladder, and s(in per cent), the fraction of peaks from the spectrum that is considered of high intensity. The number n can theoretically be just one (i.e., we would require just two high intensity peaks that are spaced by the mass difference corresponding to the mass of one of the amino acids); yet, larger values of n(for example, between two and six residues) represent stricter requirements to the sequence ladder. The other parameter s restricts the search space. For this purpose, the peaks in the spectrum considered (i.e., in one .dta-file) are sorted by intensity into a list with descending order. Only the first part of this list (the fraction s of the total set) is used for searching sequence ladders. The condition of s= 100% implies that all peaks are included; yet, considerably smaller values of s are desirable since they would help unselecting more non-interpretable spectra. Once the set of high-intensity peaks is defined, their pair-wise mass differences are compared in a systematic enumeration with the masses of amino acids residues (to select pairs of peaks separated by the mass of any of the amino acids within a user-defined accuracy) and it is tested whether a subset of peaks forms a sequence ladder of the required minimal length. If at least one such ladder is found, the search is stopped and the procedure is restarted with the next tandem MS spectrum in the dataset.

Modifications of the noise detection algorithm

If a spectrum has passed the sequence ladder test, it is handed over to a series of routines for noise and background detection. The procedures for removing multiply charged peak clusters with the etalon method and for the suppression of high-frequency noise with a low-pass filter after Fourier transformation have been described in a previous publication [24] in detail and have been applied without changes here.

The algorithm for the removal of latent periodic background (including deisotoping) received another option with respect to the determination of the base frequency of the noise. We observed that the determination of the base frequency f B in the first power spectrum (see sections 3.3 and 3.5 in ref. [24]) is, in rare cases, not always as unambiguous as in Figure 2A of ref. [24] since several almost equally intense peaks may appear in the second-level Fourier transform. Wrong base frequency f B detection leads to wrong multi-band rejection filter creation and a few interpretable spectra can be lost after applying this technique. This ambiguity can be avoided by not choosing the frequency of the most intense peak in the second-level Fourier transform. Rather, we propose to iterate through all possible base frequencies detected in this spectrum. For each of these frequencies, theoretical maxima and minima expected in first level Fourier transform are calculated. Best matching between the theoretical and experimental maxima and minima (see Figure 3 in ref. [24]) confirms the right base frequency. We call this method "soft recognition" of latent periodic noise which should be applied if minor improvements in sequence coverage (in rare cases, a single additional peptide) are more important than data size reduction; yet, it leads to an increment of about 10% of the computation time compared with the previous method [24].

Standalone implementation and cluster version

We created two implementations for MS Cleaner 2.0. A single-machine Windows version was used for most of the computations in this article and it is available for free download at the associated WWW site. A Unix-Port of the MS Cleaner 2.0 software is deployed in a clustered environment in order to guarantee scalability. The spectrum file is partitioned into workpackages, which are then handed over to a batch queuing system for scheduling on available nodes. Each node processes the spectra in its workpackage and transfers the results back to the controlling application where they are post-processed into the final good/bad spectra output. This version is the engine behind the MS Cleaner 2.0 WWW server.

WWW Supplement

At the WWW-site, supplementary resources are available: all experimental mass-spectrometry data used in this work, the processed spectra, the user manual, default parameter datasets and a free downloadable Windows version of the program MSCleaner 2.0 as well as free access to a MSCleaner 2.0 WWW server accessing a local Linux cluster. Other implementations can be obtained on request.

Results and discussion

For the initial determination of optimal parameter ranges (sequence ladder length n and peak intensity threshold s), we used the datasets for bovine serum albumin (BSA), yeast alcohol dehydrogenase (ADH) and human transferrin (TRF) from our previous work [24] since they are quite small (less than 3000 .dta-files per set). We checked the influence of the preprocessing procedures on the spectrum interpretation with the MASCOT tool. A systematic analysis was performed; sequence ladder length was tested with values n between 2 and 6 and the high-intensity threshold s was varied from 5% to 35% (the sequence ladder was searched for only among the 5%, 10%, 15%, ..., or 35% of most intense peaks). The goal is to have as many unselected "bad" spectra as possible (the savings in computing time are about proportional to the fraction of spectra that is not handed over to the spectrum interpretation program) without losses of (i) MASCOT score, (ii) spectra giving peptide matches and (iii) sequence coverage.

Due to the space limitation, only the results of a parameter subset are presented (Table 1). As expected, the number of detected bad spectra increases with growing sequence ladder length n and decreasing intensity threshold s. We observe that the MASCOT score of the non-preprocessed data (586 for BSA, 224 for ADH and 588 for TRF; see rows with n= 0 and s= 0%) is considerably smaller than that of the cleaned datasets (often, by a factor of 2-5) regardless of the severity of data pre-processing. Thus, the reliability of the top protein hit in the database searches greatly increases by the background reduction, both by discarding bad spectra and by removing noise from spectra that can be interpreted in peptides. This alone is an interesting result.

Table 1 Influence of background removal on the recovery of BSA, ADH and TRF in MS/MS spectra of 100 fmol test samples. The original number of MS/MS spectra for the BSA (bovine serum albumine), ADH (yeast alcoholdehydrogenase) and TRF (human transferring) datasets (recorded on a DecaXP machine) are 2679, 2325 and 2608 respectively. The intensity threshold s (column 3) describes the search of the sequence ladder (length n in column 2) within the 15%, 20%, 25% or 30% top peaks (100% - all peaks are considered). The following three columns show the MS Cleaner output - number of spectra with background removal, number of unselected spectra and the MS Cleaner CPU time on a single-processor Windows XP computer (Pentium IV 2.4 GHz; to get exact measurements of computation time, we did not use the cluster version). The remaining four columns present the MASCOT output - the CPU time on the same machine, the protein score, the number of spectra matching peptides in a MASCOT search and the final sequence coverage. For each dataset, the first line shows the results for the case when MS Cleaner is not used for pre-processing and the MS/MS data is immediately interpreted by MASCOT.

The sequence coverage is more sensitive to the pre-processing parameters. For a sequence ladder length of n= 5 residues, we see a trend that sequence coverage is slightly decreased with respect to that of unprocessed data (41-54% instead of 55% for BSA, 21-31% instead of 39% for ADH, 45-48% instead of 47% for TRF). Sequence coverage is about the same or even slightly higher as for non-preprocessed data for sequence ladder lengths n= 3 and n= 4 and intensity thresholds s at and above 20%. With regard to the number of spectra that lead to a significant peptide match in the MASCOT search, the settings n= 3, s= 20%; n= 3, s= 25%; n= 4, s= 20% and n= 4, s= 25% are close to reproduce the result achieved with the unprocessed data for the BSA and TRF cases. Surprisingly, the number of peptide matches is slightly higher for s= 100% (all peaks are included in the sequence ladder search) than for the datasets without preprocessing. Thus, the number of falsely rejected spectra by the sequence-ladder algorithm is essentially zero in these two cases. For ADH, the number of spectra matching peptides is always somewhat lower if the tandem MS/MS data is pre-processed, although MASCOT score and sequence coverage do not suffer from choices of n= 3, or n= 4 and the higher values of s.

To detect a considerable fraction of the bad spectra and to reduce the time for interpretation by MASCOT, these results support the selection of a sequence ladder length equal to n= 4 and an intensity threshold of s= 20%. If the sequence coverage is more important than computational time savings, softer parameters can be chosen, for example with an intensity threshold of s= 25%. With these parameters, it is possible to eliminate more than 80% of all spectra in the datasets BSA, ADH and TRF by declaring them non-interpretable in oligopeptides (see Table 1). Minor sequence coverage loss, if at all observed, does not affect the interpretation result. Yet, the total computing time required for interpretation narrows up to only 20% of the original value. The computing time consumption for MS Cleaner alone in such a setting is ~2% of MASCOT time for non-preprocessed data (see Table 1); i.e., it is essentially negligible.

For further analysis of the algorithm's performance, large MS/MS datasets are necessary that are recorded from samples with known protein composition. For this purpose, we used solutions of commercially available proteins at 100 fmol concentration. The behavior of the MS Cleaner algorithms was tested over this large dataset of about 270000 spectra from 26 samples of 13 proteins (Table 2) generated by an LTQ device. We used sequence ladder length n= 4 with intensity thresholds s= 20% and s= 25% and contrasted the results both (i) with the MASCOT-based interpretation of non-preprocessed data and (ii) with sequence ladder 4 and the inclusion of all peaks (s= 100% threshold). We find that, as a rule, preprocessing reproduces or slightly improves the sequence coverage relative to the non-preprocessed data (100-110% for threshold s= 100% (columns A4 and A7), 100-108% for thresholds s= 20% and s= 25% (columns A4, A11 and A15)). Thus, the number of falsely rejected spectra by the sequence-ladder algorithm is essentially zero in these examples. This clear trend says that the preprocessing algorithm proposed here performs even better if it is supplied with more accurate data from the LTQ instrument as compared with those from the DecaXP. There is a trend for increased MASCOT scores (103-140% for threshold 100% (columns A3 and A6), 98-140% for s= 20% (A3 and A10) and 103-140% for s= 25% (A3 and A14) with an average of 110% regardless of threshold. The reduction of the dataset by unselecting spectra is significant (on average, 11% for threshold 100% (column A5), 63% for threshold s= 20% (column A8) and 53% for threshold s= 25% (column A13)). This means that the interpretation time with MASCOT reduces in a similar proportion.

Table 2 Performance of the MSCleaner version 2.0 over a large test set.

To summarize, the results support that testing spectra for interpretability in oligopeptides is a useful criterion for dataset reduction in protein mass spectrometry if a sequence ladder of a tetrapeptide segment is searched for among the 20% (or 25%) most intense peaks. This preprocessing is accompanied by an increase in MASCOT score and more significant top protein hits and it does not significantly affect sequence coverage. Running MS Cleaner 2.0 as a standard preprocessing step in peptide tandem MS data analysis for protein identification is recommended.

The idea of using short series of sequence ions (peptide sequence tags) as a specific identifier that speeds up searches for matches between spectra and sequences in databases (either by searching the database with the tag or by creating sequence tag database filters in order to reduce the size of a database via a preprocessing step) is extensively explored in the literature [2527]. It is interesting to see that this simple idea applied to the problem of recognizing spectra non-interpretable in oligopeptides greatly reduces the complexity of analyzing protein mass spectrometry data.



collision-induced dissociation




Electrospray ionization


liquid chromatography coupled with tandem mass spectrometry


mass spectrometry


tandem mass spectrometry


power spectrum.


  1. Perkins DN, Pappin DJ, Creasy DM, Cottrell JS: Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis. 1999, 20: 3551-3567. 10.1002/(SICI)1522-2683(19991201)20:18<3551::AID-ELPS3551>3.0.CO;2-2.

    Article  CAS  PubMed  Google Scholar 

  2. Yates JR, Eng J, McCormack AL, Schieltz DM: Method to Correlate Tandem Mass Spectra of Modified Peptides to Amino Acid Sequences in the Protein Database. Anal Chem. 1995, 67: 1426-1436. 10.1021/ac00104a020.

    Article  CAS  PubMed  Google Scholar 

  3. Webb-Robertson BJ, Cannon WR, Oehmen CS, Shah AR, Gurumoorthi V, Lipton MS, Waters KM: A support vector machine model for the prediction of proteotypic peptides for accurate mass and time proteomics. Bioinformatics. 2008, 24: 1503-1509. 10.1093/bioinformatics/btn218.

    Article  CAS  PubMed  Google Scholar 

  4. Nesvizhskii AI, Vitek O, Aebersold R: Analysis and validation of proteomic data generated by tandem mass spectrometry. Nat Methods. 2007, 4: 787-797. 10.1038/nmeth1088.

    Article  CAS  PubMed  Google Scholar 

  5. Keller BO, Sui J, Young AB, Whittal RM: Interferences and contaminants encountered in modern mass spectrometry. Anal Chim Acta. 2008, 627: 71-81. 10.1016/j.aca.2008.04.043.

    Article  CAS  PubMed  Google Scholar 

  6. Eng JK, McCormack AL, Yates JR: An Approach to Correlate Tandem Mass Spectral Data of Peptides with Amino Acid Sequences in a Protein Database. J Am Soc Mass Spectrom. 1994, 5: 976-989. 10.1016/1044-0305(94)80016-2.

    Article  CAS  PubMed  Google Scholar 

  7. Ferrige AG, Seddon MJ: Maximum Entropy Deconvolution in Electrospray Mass Spectrometry. Rapid Commun Mass Spectrom. 1991, 5: 374-379. 10.1002/rcm.1290050810.

    Article  CAS  Google Scholar 

  8. Gentzel M, Kocher T, Ponnusamy S, Wilm M: Preprocessing of tandem mass spectrometric data to support automatic protein identification. Proteomics. 2003, 3: 1597-1610. 10.1002/pmic.200300486.

    Article  CAS  PubMed  Google Scholar 

  9. MSMS Peak Identification and its Applications. (communication 46), []

  10. Mann M, Meng CK, Fenn JB: Interpreting mass spectra of multiply charged ions. Anal Chem. 1989, 61: 1702-1708. 10.1021/ac00190a023.

    Article  CAS  Google Scholar 

  11. Reinhold BB, Reinhold VN: Electrospray ionization mass spectrometry: Deconvolution by an entropy-based algorithm. J Am Soc Mass Spectrom. 1992, 3: 207-215. 10.1016/1044-0305(92)87004-I.

    Article  CAS  PubMed  Google Scholar 

  12. Sadygov RG, Eng J, Durr E, Saraf A, McDonald H, MacCoss MJ, Yates JR: Code developments to improve the efficiency of automated MS/MS spectra interpretation. J Proteome Res. 2002, 1: 211-215. 10.1021/pr015514r.

    Article  CAS  PubMed  Google Scholar 

  13. Wehofsky M, Hoffmann R: Automated deconvolution and deisotoping of electrospray mass spectra. J Mass Spectrom. 2002, 37: 223-229. 10.1002/jms.278.

    Article  CAS  PubMed  Google Scholar 

  14. Zhang N, Aebersold R, Schwikowski B: ProbID: A probabilistic algorithm to identify peptides through sequence database searching using tandem mass spectral data. Proteomics. 2002, 2: 1406-1412. 10.1002/1615-9861(200210)2:10<1406::AID-PROT1406>3.0.CO;2-9.

    Article  CAS  PubMed  Google Scholar 

  15. Zhang Z, Marshall A: A Universal Algorithm for Fast and Automated Charge State Deconvolution of Electrospray Mass-to-Charge Ratio Spectra. J Am Soc Mass Spectrom. 1998, 9: 225-233. 10.1016/S1044-0305(97)00284-5.

    Article  CAS  PubMed  Google Scholar 

  16. Anderson DC, Li W, Payan DG, Noble WS: A new algorithm for the evaluation of shotgun peptide sequencing in proteomics: support vector machine classification of peptide MS/MS spectra and SEQUEST scores. J Proteome Res. 2003, 2: 137-146. 10.1021/pr0255654.

    Article  CAS  PubMed  Google Scholar 

  17. Baczek T, Bucinski A, Ivanov AR, Kaliszan R: Artificial neural network analysis for evaluation of peptide MS/MS spectra in proteomics. Anal Chem. 2004, 76: 1726-1732. 10.1021/ac030297u.

    Article  CAS  PubMed  Google Scholar 

  18. Bern M, Goldberg D, McDonald WH, Yates JR: Automatic quality assessment of Peptide tandem mass spectra. Bioinformatics. 2004, 20 (Suppl 1): I49-I54. 10.1093/bioinformatics/bth947.

    Article  CAS  PubMed  Google Scholar 

  19. Purvine S, Kolker N, Kolker E: Spectral quality assessment for high-throughput tandem mass spectrometry proteomics. OMICS. 2004, 8: 255-265. 10.1089/omi.2004.8.255.

    Article  CAS  PubMed  Google Scholar 

  20. Salmi J, Moulder R, Filen JJ, Nevalainen OS, Nyman TA, Lahesmaa R, Aittokallio T: Quality classification of tandem mass spectrometry data. Bioinformatics. 2006, 22: 400-406. 10.1093/bioinformatics/bti829.

    Article  CAS  PubMed  Google Scholar 

  21. Savitski MM, Nielsen ML, Zubarev RA: New data base-independent, sequence tag-based scoring of peptide MS/MS data validates Mowse scores, recovers below threshold data, singles out modified peptides, and assesses the quality of MS/MS techniques. Mol Cell Proteomics. 2005, 4: 1180-1188. 10.1074/mcp.T500009-MCP200.

    Article  CAS  PubMed  Google Scholar 

  22. Xu M, Geer LY, Bryant SH, Roth JS, Kowalak JA, Maynard DM, Markey SP: Assessing data quality of Peptide mass spectra obtained by quadrupole ion trap mass spectrometry. J Proteome Res. 2005, 4: 300-305. 10.1021/pr049844y.

    Article  CAS  PubMed  Google Scholar 

  23. Ning K, Leong HW: Algorithm for peptide sequencing by tandem mass spectrometry based on better preprocessing and anti-symmetric computational model. Comput Syst Bioinformatics Conf. 2007, 6: 19-30. full_text.

    Article  PubMed  Google Scholar 

  24. Mujezinovic N, Raidl G, Hutchins JR, Peters JM, Mechtler K, Eisenhaber F: Cleaning of raw peptide MS/MS spectra: Improved protein identification following deconvolution of multiply charged peaks, isotope clusters, and removal of background noise. Proteomics. 2006, 6: 5117-5131. 10.1002/pmic.200500928.

    Article  CAS  PubMed  Google Scholar 

  25. Bandeira N, Tsur D, Frank A, Pevzner PA: Protein identification by spectral networks analysis. Proc Natl Acad Sci USA. 2007, 104: 6140-6145. 10.1073/pnas.0701130104.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  26. Mann M, Wilm M: Error-tolerant identification of peptides in sequence databases by peptide sequence tags. Anal Chem. 1994, 66: 4390-4399. 10.1021/ac00096a002.

    Article  CAS  PubMed  Google Scholar 

  27. Tanner S, Shu H, Frank A, Wang LC, Zandi E, Mumby M, Pevzner PA, Bafna V: InsPecT: identification of posttranslationally modified peptides from tandem mass spectra. Anal Chem. 2005, 77: 4626-4639. 10.1021/ac050102d.

    Article  CAS  PubMed  Google Scholar 

Download references


The authors are grateful to Werner Kubina for advice in software design and for implementing MASCOT, the Mass Spectrometry group of the Institute of Molecular Pathology (Vienna) for support in carrying out mass spectrometry measurements and to Günther Raidl and Kurt Varmuza (Technical University Vienna) for advice. This work has been supported by Boehringer Ingelheim where most team members worked together until Summer 2007, Gen-AU BIN II (to F.E.) and Gen-AU APP II (to K.M.) until July 2007.

This article has been published as part of BMC Genomics Volume 11 Supplement 1, 2010: International Workshop on Computational Systems Biology Approaches to Analysis of Genome Complexity and Regulatory Gene Networks. The full contents of the supplement are available online at

Author information

Authors and Affiliations


Corresponding author

Correspondence to Frank Eisenhaber.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

NM programmed the single-processor prototype of MSCleaner and carried out all computational experiments. NM and FE together produced the WWW site associated with this publication. GS and MW were instrumental for creating the multi-processor version and the WWW server. KM provided the wet lab part of this work and participated in the discussion of the results. FE proposed the scientific task, guided the work and wrote the article.

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and Permissions

About this article

Cite this article

Mujezinovic, N., Schneider, G., Wildpaner, M. et al. Reducing the haystack to find the needle: improved protein identification after fast elimination of non-interpretable peptide MS/MS spectra and noise reduction. BMC Genomics 11 (Suppl 1), S13 (2010).

Download citation

  • Published:

  • DOI:


  • Sequence Coverage
  • Mascot Score
  • Trap Column
  • Sequence Ladder
  • Interpretation Software