A significant motivating factor behind the development of the PEDRo repository has been to allow informed discussion, assisted by concrete examples, into the level of detail and forms of model that are most appropriate for a proteome data repository. As the PEDRo model is being used as the starting point for the HUPO-PSI activity on models for proteome data, early validation of this model is important. The following observations have been made about the PEDRo model during the data capture process:
1.
Sample description is neither very precise nor systematic. The effective description of samples is an open issue that spans different kinds of functional genomic data. For example, work is underway on the development of an ontology for characterising microarray experiments, focusing, in particular, on samples [20]. However, as the variety of organisms, genetic manipulations, extraction techniques, environmental conditions and experimental manipulations that may characterise a sample are extremely large, a mature solution to this problem may be some way off.
2.
There is only limited support for relative protein abundance data (e.g. DIGE and stable isotope labelling strategies). Thus, for example, there is no place in the model to describe an expression ratio for a protein species derived from quantitative experimental strategies, only the ability to capture the 'raw' numbers. In fact, the PEDRo model was not designed to capture expression ratios, partly because such numbers are easily derived from the captured primary data, and partly because the particular method of their derivation may be contentious. It is hoped that the HUPO-PSI model will provide generic constructs for representing relationships between certain kinds of measurement (e.g. relative protein expression readings), to which can be attached the specific detail for individual techniques. However, it also seems important to avoid the pitfalls associated with overly permissive models, as these provide a less stable foundation for the developers of analytical tools than their more proscriptive counterparts.
3.
The gel model is not particularly detailed. Thus, for example, there is no detailed description of the image analysis software used, the descriptions of individual spots are fairly minimal, and no details are captured on spot excision. An earlier critique of the PEDRo model for gels, and some possible extensions, is provided by [21]. It seems that, in order to provide insights for the developers of gel-based experiments, it would be appropriate for the model to be revised to provide additional details on gels.
Overall, the appropriate level of detail for a proteomics repository is somewhat subjective, but can usefully be based on guiding principles; agreement as to the principles should then avoid scope-based discussions at a very fine-grained level. The current PEDRo model essentially supports the principle that enough detail should be captured about an experiment to:
i. Allow results of different experiments to be analysed/compared.
ii. Allow suitability of experiment design and implementation decisions to be assessed.
iii. Allow protein identifications to be re-run in the future with new databases or software.
There is also an additional negative principle, to the effect that the model itself should not be designed to include dependencies on characteristics relating to the configuration or properties of an individual piece of equipment. Accordingly, we have attempted to allow experimental methods and results to be described in significant detail, but without including parameters and properties that are likely to be superseded rapidly when new models of equipment are introduced, and without including parameters that can only be understood with reference to the documentation of a particular product.
The data stored in PEDRo is more comprehensive for each experiment than is the case for most existing proteome databases. For example, in the longest established experimental proteomics resource, SWISS-2DPAGE [7], the emphasis is on annotated gels, and there is much less information collected on how the annotations were arrived at. Furthermore, there is an architectural distinction – SWISS-2DPAGE follows a more federated approach, with individual sites continuing to hold their own data. These other proteome data sources can be accessed through WORLD-2DPAGE, a web resource listing sites making available experimental proteomics data [22]. An example of a database that participates in WORLD-2DPAGE is the University of Alabama (UAB) Proteomics Database [23] which provides search and browsing facilities over data from its host university. As such, the emphasis is on annotated gels, and relatively few details are captured on sample processing, mass spectrometry or in silico analysis. Such design decisions are appropriate for certain categories of user of a proteomics database, but not for others. The UAB database has been designed to provide access to processed experimental results for biomedical researchers, but does not provide enough information to allow detailed comparisons of the ways in which the results were obtained.
The ProteomeWeb [24] provides a wider range of tools than PEDRo (for example, for computing theoretical maps), and supports browsing of annotated gels from several bacteria and archaea. Once again, though, the data provided for each experiment are less comprehensive than in PEDRo. ProDB [25] has a certain amount in common with UAB, in that it too provides search and browsing over a database of locally produced data. In addition, ProDB features an architecture that supports the plugging-in of data-loading and analysis tools. However, the level of detail supported by the model is not obvious from the paper, which gives only part of the model, and the database was not publicly accessible at the time of writing. In consisting of a collection of tools associated with a database, ProDB thus also has a certain amount in common with SBEAMS [26] which includes a relational database of proteomic data. The SBEAMS model emphasises the description and analysis of mass spectrometry data, but seems not to support open access to experimental data at the time of writing.
In terms of quantities of data, there are fewer data sets in PEDRo than in SWISS-2DPAGE, reflecting the fact that PEDRo is a newly created resource (Release 16 of SWISS-2DPAGE contains 34 reference maps), but somewhat more than in the UAB Proteomics Database. The Open Proteomics Database (OPD) supports the browsing and downloading of comparable amounts of data to those in PEDRo, and also includes mass spectrometry data, although quite a lot of the data are in flat-file format [27]. However, it is fair to say that none of the current databases is operating in the context of high-throughput experimentation, which will certainly be prevalent in the near future.