- Open Access
PNImodeler: web server for inferring protein-binding nucleotides from sequence data
© Im et al.; licensee BioMed Central Ltd. 2015
- Published: 29 January 2015
Interactions between DNA and proteins are essential to many biological processes such as transcriptional regulation and DNA replication. With the increased availability of structures of protein-DNA complexes, several computational studies have been conducted to predict DNA binding sites in proteins. However, little attempt has been made to predict protein binding sites in DNA.
From an extensive analysis of protein-DNA complexes, we identified powerful features of DNA and protein sequences which can be used in predicting protein binding sites in DNA sequences. We developed two support vector machine (SVM) models that predict protein binding nucleotides from DNA and/or protein sequences. One SVM model that used DNA sequence data alone achieved a sensitivity of 73.4%, a specificity of 64.8%, an accuracy of 68.9% and a correlation coefficient of 0.382 with a test dataset that was not used in training. Another SVM model that used both DNA and protein sequences achieved a sensitivity of 67.6%, a specificity of 74.3%, an accuracy of 71.4% and a correlation coefficient of 0.418.
Predicting binding sites in double-stranded DNAs is a more difficult task than predicting binding sites in single-stranded molecules. Our study showed that protein binding sites in double-stranded DNA molecules can be predicted with a comparable accuracy as those in single-stranded molecules. Our study also demonstrated that using both DNA and protein sequences resulted in a better prediction performance than using DNA sequence data alone. The SVM models and datasets constructed in this study are available at http://bclab.inha.ac.kr/pnimodeler.
- Support Vector Machine
- Positive Predictive Value
- Negative Predictive Value
- Support Vector Machine Model
- Matthews Correlation Coefficient
Interactions between nucleic acids and proteins have diverse functions within a cell, and play an important role in many biological processes. For example, proteins that bind to specific regions of DNA act as transcription factors by activating or repressing gene expression of the DNA. Thus, identifying protein recognition parts in DNAs or DNA recognition parts in proteins will help understand a variety of cellular processes [1, 2]. As many structures of protein-DNA complexes have been determined, theoretical and experimental studies have been conducted in recent years to study protein-DNA interactions, but protein-DNA interactions and their mechanisms are not yet fully understood.
Several computational methods have been developed to predict DNA- or RNA-binding residues in protein sequences using machine learning methods such as support vector machines (SVM) as classifiers. BindN  uses SVM to predict RNA- or DNA-binding residues in proteins based on the biochemical features of amino acids. DP-Bind  predicts DNA-binding residues in proteins and uses SVM with a position specific scoring matrix (PSSM) and amino acid properties. DNABindR  uses a naïve Bayes classifier to predict DNA-binding residues in proteins. MetaDBSite  predicts DNA-binding residues by integrating the prediction results from six predictors (DISIS, DNABindR, BindN, BindN-rf, DP-Bind and DBS-PRED). A method developed by Liu et al.  predicts RNA-binding sites in proteins using a random forest. It uses several features such as mutual interaction propensity, physicochemical characteristics, hydrophobicity, relative accessible surface area, secondary structure, conservation score and side-chain environment.
Instead of finding DNA-binding sites in proteins, some works have attempted to classify whether a given protein is DNA-binding or non-binding. iDNA-Prot  classifies proteins into DNA-binding and non-binding proteins from amino acid sequence data. DBPPred  also classifies whether a given protein is a DNA-binding protein or not using secondary structure, relative solvent accessibility and PSSM.
Several studies have been conducted to find effective features of proteins in predicting DNA-binding sites in proteins. For example, Yi et al.  characterized DNA-binding residues on protein surface with B-factors and packing density, and Dey et al.  investigated evolutionary conservation, spatial clustering, hydrogen bond donor capability and residue propensity.
Unlike the previous works that have focused on DNA- or RNA- binding proteins, in the present work we attempted to predict protein binding nucleotides using sequence data. Predicting protein binding sites in DNA is more difficult than predicting DNA binding sites in proteins for several reasons: (1) for a sequence of the same length, DNA has many fewer sequence patterns than protein, (2) in protein-DNA interactions nucleotides show less diverse interaction propensities than amino acids, and (3) predicting binding sites in a double-stranded molecule (i.e., DNA) is more complicated than predicting binding sites in a single protein sequence.
In the present work we studied key features of DNA and protein sequences and their representation to predict protein binding sites in DNA. We developed two SVM models and compared their performances with actual data. One SVM model (hereafter called DPI1) predicts binding sites in a given DNA sequence with unknown protein. Another SVM model (called DPI2) predicts binding sites in a given DNA sequence with a specified protein. Experimental results showed that the SVM model that used DNA sequence data alone predicted more binding sites than the SVM model that used both DNA and protein sequences, but the overall performance of the latter was higher than that of the former. In this paper, we present our approach to the problem of predicting protein binding nucleotides from sequence data and discuss experimental results.
Definition of a binding site
We collected protein-DNA complexes which are determined by X-ray crystallography with a resolution of 3.0 Å or better from the Protein Data Bank (PDB) . As of July 2013, there were 1,654 protein-DNA complexes which contain 1,589 DNA sequences and 892 protein sequences.
We divided the 1,589 DNA sequences into two groups using CD-HIT-EST . 1,416 DNA sequences with the similarity of 80% or higher were selected as a training dataset for the prediction models. The remaining 173 DNA sequences that have a similarity lower than 80% with any sequence of the training dataset were used as a test dataset for the prediction models. We applied the feature vector-based method to the 1,416 DNA sequences to construct a non-redundant training dataset. The feature vector-based redundancy removal method, developed in our previous study , constructs a larger training dataset of non-redundant data than the standard sequence similarity-based reduction method. The initial 1,416 DNA sequences of the training dataset form 2,658 interaction pairs with 837 protein sequences, and the 173 DNA sequences of the test dataset form 189 interaction pairs with 135 protein sequences.
Our prediction models do not assume that the structure data or sequence direction is known, so they handle each sequence in double stranded DNA molecules separately to predict protein-binding sites in the DNA sequence. The training dataset for the model that uses both DNA and protein sequences contains 20,588 binding nucleotides and 27,630 non-binding nucleotides. For the model that uses only DNA sequence data, binding sites in a same DNA sequence with different protein partners were incorporated in the DNA sequence. Thus, the training dataset for this model contains fewer binding and non-binding nucleotides (20,378 binding nucleotides and 23,950 non-binding nucleotides) than that for the model that uses both DNA and protein sequences.
Support Vector Machines and Feature Vectors
We implemented two prediction models using a library for support vector machines (LIBSVM) . As a kernel function we used the radial basis function (RBF). Two parameters (cost and gamma) of RBF control the performance and time-cost. We found the best values of the parameters cost and gamma for each window size using an optimization tool called grid.py. We assigned binding nucleotides a weight of 1.3 to balance the data size of binding nucleotides with that of non-binding nucleotides.
Since SVM handles numerical data, we encode sequence information into a feature vector with numerical elements. We created feature values from three types of sequence data: original DNA sequence, DNA sequence fragments from the original DNA sequence, and protein sequence interacting with the DNA. The original DNA sequence is represented by its base composition. DNA sequence fragments are represented by nucleotide triplet composition, normalized position, molecular mass, molecular pKa and interaction propensity (IP) of nucleotide triplets . For protein, which is an interaction partner of DNA, we represent the sum of normalized position of 20 amino acids  and dipeptide composition .
The base composition represents the percentage of four nucleotides in a DNA sequence (equation 1). The nucleotide triplet com-position represents the frequency of a nucleotide triplet using a sequence encoding scheme called the n-gram extraction method . For a given sequence, the n-gram method extracts the patterns and frequencies of n consecutive nucleotides. There are 64 (= 4 × 4 × 4) possible nucleotide triplets, and they are represented by 64 features in a feature vector using equation 2. The IP represents the binding propensities between nucleotide triplets and amino acids. The normalized position of the i-th nucleotide or amino acid is its relative position in the original sequence (equation 3).
The dipeptide composition is the frequency of 400 (= 20 × 20) amino acid duplet patterns . The partner feature represents the sum of the normalized positions of 20 amino acids (equation 4).
A DNA sequence is represented by overlapping sequence fragments using a sliding window method. Part A of Figure 2 shows the process of dividing sequences with a DNA sequence of length 9 and sliding window of size 5. After we represented the sequence fragments, we removed redundant feature vectors using the feature vector-based redundancy reduction method, which was developed in our previous study .
We performed a 10-fold cross validation to train and test the prediction models. For a more rigorous evaluation, we tested them on independent datasets that were not used in training them. The performance of the prediction models was evaluated with respect to six measures: sensitivity, specificity, accuracy, positive predictive value, negative predictive value, and Matthews correlation coefficient.
Sensitivity (SN) is the ratio of correctly predicted binding nucleotides to actual binding nucleotides (equation 5). Specificity (SP) is the ratio of correctly predicted non-binding nucleotides to actual non-binding nucleotides (equation 6). Accuracy (ACC) is the ratio of correctly predicted nucleotides to all nucleotides (equation 7). Positive predictive value (PPV) measures the ratio of correctly predicted binding nucleotides to all nucleotides that are predicted as binding (equation 8). Negative predictive value (NPV) measures the ratio of correctly predicted non-binding nucleotides to all nucleotides that are predicted as non-binding (equation 9). The Matthews correlation coefficient (MCC) is a strong indicator for multi-class problems and returns a score between -1 and 1 (equation 10).
Sensitivity, specificity and accuracy do not provide reliable performance indicators for imbalanced data. For example, consider a data set of 30 positives and 2,000 negatives which shows a sensitivity of 67%, a specificity of 91% and an accuracy of 91%. If it has nine times more false positives than true positives, the positive predictive value (PPV) can be as low as 10% despite the seemingly reasonable values of sensitivity, specificity and accuracy. Thus, we compute PPV and negative predictive value (NPV) in addition to sensitivity, specificity and accuracy.
Comparison of two SVM models
10-fold cross validation with different window sizes (WS) from 21 to 31.
Testing on independent datasets with different window sizes (WS) from 21 to 31.
Several programs have been developed to predict DNA-binding sites in proteins, but there are very few programs available that can predict protein-binding sites in DNA. PROMO  is one of the few programs that predict transcription factor (TF) binding sites in DNA sequences. For comparative purposes, we tested the two models of PNImodeler (DPI1 and DPI2) and PROMO on DNA sequences of recent TF-DNA complexes which were deposited into PDB after December 2013 (DNA chains D and E of 3WTV, DNA chains C and D of 4CHU, and DNA chains E and F of 4ON0). The model DPI2 of PNImodeler, which used both DNA and protein sequences, showed a sensitivity of 65.31%, a specificity of 75.33%, an accuracy of 71.43% and an MCC of 0.404 on average. The model DPI1 of PNImodeler, which used DNA sequences only, showed a sensitivity of 61.40%, a specificity of 77.47%, an accuracy of 70.31% and an MCC of 0.395 on average. With all listed transcription factors as candidate binding partners of the DNA sequences of the recent TF-DNA complexes, PROMO showed a sensitivity of 36.95%, a specificity of 71.42%, an accuracy of 57.08% and an MCC of 0.088 on average. These results demonstrate that PNImodeler is better than PROMO with or without the information on protein sequences.
In general predicting protein binding sites in a double stranded molecule is more complex than predicting binding sites in single stranded molecules. We developed two SVM models to predict protein binding nucleotides in DNA. One model uses DNA sequence data alone and predicts all potential binding sites with unknown protein partners. The other model uses both DNA and protein sequences to predict protein binding nucleotides with the specific protein. In both 10-fold cross validation and independent testing, the second model that uses both DNA and protein sequences achieved better performance than the first model that uses DNA sequence data only.
We have implemented the SVM models as a web server called PNImodeler (Protein-Nucleic acid Interaction modeler), and it is available at http://bclab.inha.ac.kr/pnimodeler. This web server will be useful to find protein-binding sites in DNA with unknown structure. To the best of our knowledge, this is the first attempt to predict protein-binding DNA nucleotides with sequence data alone.
This work was funded by the Ministry of Science, ICT and Future Planning (2012R1A1A3011982) and the Ministry of Education (2010-0020163) of Republic of Korea. The cost of the article was funded by the Ministry of Science, ICT and Future Planning (2012R1A1A3011982).
This article has been published as part of BMC Genomics Volume 16 Supplement 3, 2015: Selected articles from the 10th International Symposium on Bioinformatics Research and Applications (ISBRA-14): Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/16/S3.
- Wang L, Brown SJ: BindN: a web-based tool for efficient prediction of DNA and RNA binding sites in amino acid sequences. Nucleic Acids Res. 2006, 34 (suppl 2): W243-W248.PubMed CentralView ArticlePubMedGoogle Scholar
- Ho S, Yu F, Chang C, Huang H: Design of accurate predictors for DNA-binding sites in proteins using hybrid SVM-PSSM method. BioSystems. 2007, 90 (1): 234-241. 10.1016/j.biosystems.2006.08.007.View ArticlePubMedGoogle Scholar
- Hwang S, Gou Z, Kuznetsov IB: DP-Bind: a web server for sequence-based prediction of DNA-binding residues in DNA-binding proteins. Bioinformatics. 2007, 23 (5): 634-636. 10.1093/bioinformatics/btl672.View ArticlePubMedGoogle Scholar
- Yan C, Terribilini M, Wu F, Jernigan R, Dobbs D, Honavar V: Predicting DNA-binding sites of proteins from amino acid sequence. BMC Bioinformatics. 2006, 7 (1): 262-10.1186/1471-2105-7-262.PubMed CentralView ArticlePubMedGoogle Scholar
- Si J, Zhang Z, Lin B, Schroeder M, Huang B: MetaDBSite: a meta approach to improve protein DNA-binding sites prediction. BMC Syst Biol. 2011, S7-0509-5-S1-S7. 5 Suppl 1Google Scholar
- Liu ZP, Wu LY, Wang Y, Zhang XS, Chen L: Prediction of protein-RNA binding sites by a random forest method with combined features. Bioinformatics. 2010, 26 (13): 1616-1622. 10.1093/bioinformatics/btq253.View ArticlePubMedGoogle Scholar
- Lin WZ, Fang JA, Xiao X, Chou KC: iDNA-Prot: identification of DNA binding proteins using random forest with grey model. PLoS One. 2011, 6 (9): e24756-10.1371/journal.pone.0024756.PubMed CentralView ArticlePubMedGoogle Scholar
- Lou W, Wang X, Chen F, Chen Y, Jiang B, Zhang H: Sequence based prediction of DNA-binding proteins based on hybrid feature selection using random forest and Gaussian naive Bayes. PLoS One. 2014, 9 (1): e86703-10.1371/journal.pone.0086703.PubMed CentralView ArticlePubMedGoogle Scholar
- Xiong Y, Liu J, Wei D: An accurate feature-based method for identifying DNA-binding residues on protein surfaces. Proteins. 2011, 79 (2): 509-517. 10.1002/prot.22898.View ArticlePubMedGoogle Scholar
- Dey S, Pal A, Guharoy M, Sonavane S, Chakrabarti P: Characterization and prediction of the binding site in DNA-binding proteins: improvement of accuracy by combining residue composition, evolutionary conservation and structural parameters. Nucleic Acids Res. 2012, 40 (15): 7150-7161. 10.1093/nar/gks405.PubMed CentralView ArticlePubMedGoogle Scholar
- Kirsanov DD, Zanegina ON, Aksianov EA, Spirin SA, Karyagina AS, Alexeevski AV: NPIDB: nucleic acid--protein interaction database. Nucleic Acids Res. 2013, 41 (D1): D517-D523. 10.1093/nar/gks1199.PubMed CentralView ArticlePubMedGoogle Scholar
- Rose PW, Beran B, Bi C, Bluhm WF, Dimitropoulos D, Goodsell DS, Prlic A, Quesada M, Quinn GB, Westbrook JD, Young J, Yukich B, Zardecki C, Berman HM, Bourne PE: The RCSB Protein Data Bank: redesigned web site and web services. Nucleic Acids Res. 2011, 39 (Database): D392-D401. 10.1093/nar/gkq1021.PubMed CentralView ArticlePubMedGoogle Scholar
- Huang Y, Niu B, Gao Y, Fu L, Li W: CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics. 2010, 26 (5): 680-682. 10.1093/bioinformatics/btq003.PubMed CentralView ArticlePubMedGoogle Scholar
- Choi S, Han K: Predicting protein-binding RNA nucleotides using the feature-based removal of data redundancy and the interaction propensity of nucleotide triplets. Comput Biol Med. 2013, 43 (11): 1687-1697. 10.1016/j.compbiomed.2013.08.011.View ArticlePubMedGoogle Scholar
- Chang C, Lin C: LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol. 2011, 2 (3): 27-View ArticleGoogle Scholar
- Huang Y, Li Y: Prediction of protein subcellular locations using fuzzy k-NN method. Bioinformatics. 2004, 20 (1): 21-28. 10.1093/bioinformatics/btg366.View ArticlePubMedGoogle Scholar
- Wu C, Whitson G, McLarty J, Ermongkonchai A, Chang TC: Protein classification artificial neural system. Protein Sci. 1992, 1 (5): 667-677. 10.1002/pro.5560010512.PubMed CentralView ArticlePubMedGoogle Scholar
- Farre D, Roset R, Huerta M, Adsuara JE, Rosello L, Alba MM, Messeguer X: Identification of patterns in biological sequences at the ALGGEN server: PROMO and MALGEN. Nucleic Acids Res. 2003, 31 (13): 3651-3653. 10.1093/nar/gkg605.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.