Protein disorder prediction at multiple levels of sensitivity and specificity
© Hecker et al.. 2008
Published: 20 March 2008
Skip to main content
© Hecker et al.. 2008
Published: 20 March 2008
Many protein regions and some entire proteins have no definite tertiary structure, existing instead as dynamic, disorder ensembles under different physiochemical circumstances. Identification of these protein disorder regions is important for protein production, protein structure prediction and determination, and protein function annotation. A number of different disorder prediction software and web services have been developed since the first predictor was designed by Dunker's lab in 1997. However, most of the software packages use a pre-defined threshold to select ordered or disordered residues. In many situations, users need to choose ordered or disordered residues at different sensitivity and specificity levels.
Here we benchmark a state of the art disorder predictor, DISpro, on a large protein disorder dataset created from Protein Data Bank and systematically evaluate the relationship of sensitivity and specificity. Also, we extend its functionality to allow users to trade off specificity and sensitivity by setting different decision thresholds. Moreover, we compare DISpro with seven other automated disorder predictors on the 95 protein targets used in the seventh edition of Critical Assessment of Techniques for Protein Structure Prediction (CASP7). DISpro is ranked as one of the best predictors.
The evaluation and extension of DISpro make it a more valuable and useful tool for structural and functional genomics.
Prediction of protein structure from its sequence is one of most fundamental tasks of structural bioinformatics and proteomics. Although most amino acids (or residues) in most proteins adopt rather rigid structures (alpha-helix, beta-sheet, and loop), some residues in some proteins are very flexible and do not adopt a fixed conformation. The regions are usually called disorder regions . Identification of protein disorder regions is important for protein production, protein function annotation, and protein structure prediction and determination . For instance, flagging disorder residues is usually an important step in structural genomics projects.
To assist with the locating of these disordered regions, a number of computational tools have been developed which are capable of predicting the locations of the regions [2–12]. Most of these tools use a predefined threshold to choose ordered or disordered residues without allowing users to trade off the sensitivity and specificity, which is desirable in many different biological contexts. Moreover, a systematic benchmarking of the specificity-sensitivity relationship and the performance of different predictors, which provides a useful guide to better use these tools, is not available.
Thus, in this paper, we first create a large disorder dataset to evaluate the specificity-sensitivity relationship of a state of the art tool, DISpro . We improve DISpro to allow users to set different thresholds to trade off the specificity and sensitivity of disorder predictions and to add a function for the visualization of protein disorder prediction.
Second, we benchmark several disorder predictors which participated in the seventh edition of Critical Assessment of Techniques for Protein Structure Prediction [13, 14] on a common dataset. The evaluation provides a useful guide for the current state of the art of protein disorder prediction tools.
To evaluate our modified DISpro, we utilized the 2408 sequence data set discussed in the second section. Each sequence was run through DISpro. For outputs, instead of using the basic DISpro cutoff of 0.5, 99 different threshold values from 0.01 to 0.99, in steps of 0.01, were used to select classification of ordered or disordered residues.
Sensitivity and specificity over varying thresholds
DISpro and seven disorder predictors participated in the seventh edition of Critical Assessment of Techniques for Protein Structure Prediction (CASP7) [13, 14]. CASP is the biannual community-wide evaluation of protein structure prediction techniques. CASP7 sent 100 protein targets whose structures were not yet known to the research groups around the world to predict their structures. The human and automated predictors from the research groups made predictions for these targets within a predefined period (two days for servers and a couple of weeks for humans) before the structures of the targets were released. The eight automated disorder predictors, including DISpro, DISOPRED , GeneSilico, MBI, BIME, DRIP-PRED , Distill , and ProfBval , participated in the disorder predictions of CASP7. Among them, GeneSilico is a meta-server using as inputs the outputs of other predictors and ProfBval was originally designed to predict the flexibility of residues in a protein sequence instead of disorder regions.
To benchmark these automated predictors, we downloaded their predictions for 95 official targets from the CASP7 web site (predictioncenter.org/casp7). We generate the states (order or disorder) for these targets using the structure files compiled by Dr. Yang Zhang (zhang.bioinformatics.ku.edu/casp7/native.html). We label residues without coordinates disordered and others ordered.
One caveat is that the dataset we created may be slightly different from the official disorder dataset used in CASP7 evaluation . For instance, CASP7 uses at most 96 targets to evaluate disorder predictors, whereas we use only 95 official targets. However, the results based on the two datasets should be largely consistent. For instance, according to the official evaluation of CASP7, DISOPRED was ranked first and DISpro second in terms of ROC score (the area under the Receive Operator Curve). According to our evaluation on the dataset we created, both methods are also ranked among top two methods.
For the official CASP7 assessment of these predictors, readers should refer to the CASP7 disorder assessment paper  for details. Here we just try to provide a complementary evaluation of these predictors based on a common disorder definition.
The ROC scores of eight predictors on the CASP7 dataset
It is worth noting that the accuracy of the automated disorder predictors is slightly lower than the best human predictors [9, 11] in CASP7 . However, a comparison with the human predictors cannot be made since their prediction data are not publicly available.
Installation and use of this add-on script follows the general protocol of the SCRATCH  protein data mining suite. Usage will require a pre-existing installation of SSpro [18, 19], and the add-on program can either be added into the script folder of the DISpro package, or simply installed automatically with an updated version of DISpro.
Input to the program calls for a text file containing the test sequence in FASTA format, a file name to be used for data output, and a threshold value. The output file is intended to be used directly with Microsoft Excel to allow for quick and easy viewing of data trends in a graphical format. As with the rest of the SCRATCH suite, the DISpro add-on is designed for the Linux operation system, with all testing being done using Linux. The original DISpro package is available at: contact.ics.uci.edu/download.html. The add-on program is available at: babbage.cs.missouri.edu/~chengji/cheng_software.html.
The protein sequences used for testing were acquired from the Protein Data Bank (PDB) . This dataset consisted of 3131 sequences, with a disorder residue frequency of 5.4% (54,364). As similarly noted , the majority of disordered regions were located at the N- and C- termini ends of the protein sequences.
To strictly evaluate the performance of DISpro, we removed all sequences previously used in the training or testing of the original DISpro. Thus, classification accuracy is based solely on data previously unseen by the network.
The overall neural network system remains unchanged from the original DISpro, but it is discussed here briefly to ensure clarity. As in , DISpro utilizes a 1-dimensional recursive neural network, which we will refer to as 1D-RNN . Please see Baldi and Pollastri (2003) for a detailed explanation of the 1D-RNN's rolling "wheel" system .
In the 1D-RNN architecture, the network is designed such that it can accept an entire sequence at once, rather than the more common sliding window technique, thereby allowing for variable input size. As an example, let us use a sequence of arbitrary length I. In this case, I represents the total number of residues in the example sequence, and I i is a vector containing the 25 values used to represent residue i. Of these values, 20 represent the frequencies of the 20 amino acids from a PSI-BLAST profile , and the other five are binary values denoting secondary structure and solvent accessibility predictions [18, 19, 23].
For an output value, the 1D-RNN produce a vector of real numbers O, where O i is the probability that residue i will be disordered. These probabilities are then utilized by DISpro to select a classification of disordered or ordered, based on a decision threshold of 0.5 . However, by varying this threshold (as discussed in the next subsection), we are able to investigate the relationship of specificity and sensitivity of disorder predictions.
One key goal of this study is to investigate the specificity and sensitivity relationship of disorder prediction. The major difference between the original DISpro and our extended version is found at the final stage of data classification. While the original DISpro makes a classification decision based on the default threshold of 0.5, where ≤ 0.5 is ordered and >0.5 is disordered, we have now implemented the capability to vary the decision threshold as needed. As a result, users will be able to input their own threshold value and view the corresponding output.
To measure the effect of varying decision threshold, we compute the sensitivity and specificity of different decision thresholds. Sensitivity (TP / (TP + FN)) is the percentage of true disordered residues being predicted as disordered, while specificity (TP / (TP+FP)) is the percentage of predicted disordered regions being true disordered residues. TP, FP, FN and TN denote the number of true positives, false positives, false negatives, and true negatives.
The other major goal of the study is to estimate the state of the art of the current disorder predictors. Thus, we systematically evaluate the performance of eight disorder region predictors on the CASP7 dataset including 95 proteins. We compute the true positive rates (TP / (TP + FN)) at different false positive rates (FP / (TN + FP)) to generate ROC curves of these predictors. We use the areas under the ROC curve (ROC scores) to compare these predictors.
In this study, we have created a large dataset to systematically investigate the performance of a state of the art protein disorder predictor, DISpro. We improve the predictor by allowing for variable threshold selection rather than a fixed default, and provide an easy transition from output data into graphical form. Our results demonstrate the effectiveness of a variable decision threshold, which sometime allows for a significant increase in the sensitivity with only a small drop in the specificity. Users can also visualize the predicted output probabilities of being disordered for all residues when making a decision on threshold values for their purpose.
Moreover, we benchmark DISpro with seven other protein disorder predictors on the 95 targets used in CASP7. The evaluation provides an approximate guide about the state of the art of the current protein disorder prediction methods.
In the future, we plan to use machine learning ensemble (or bagging) techniques to integrate different predictors evaluated in this research and some other predictors such as IUP  together to improve disorder prediction.
JC is supported by a faculty start-up grant at University of Missouri. JC is also grateful to Dr. Pierre Baldi for the support during his PhD research at the University of California Irvine.
This article has been published as part of BMC Genomics Volume 9 Supplement 1, 2008: The 2007 International Conference on Bioinformatics & Computational Biology (BIOCOMP'07). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2164/9?issue=S1.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.