PDA: an automatic and comprehensive analysis program for protein-DNA complex structures
© Kim and Guo. 2009
Published: 7 July 2009
Skip to main content
© Kim and Guo. 2009
Published: 7 July 2009
Knowledge of protein-DNA interactions at the structural-level can provide insights into the mechanisms of protein-DNA recognition and gene regulation. Although over 1400 protein-DNA complex structures have been deposited into Protein Data Bank (PDB), the structural details of protein-DNA interactions are generally not available. In addition, current approaches to comparison of protein-DNA complexes are mainly based on protein sequence similarity while the DNA sequences are not taken into account. With the number of experimentally-determined protein-DNA complex structures increasing, there is a need for an automatic program to analyze the protein-DNA complex structures and to provide comprehensive structural information for the benefit of the whole research community.
We developed an automatic and comprehensive protein-DNA complex structure analysis program, PDA (for protein-DNA complex structure analyzer). PDA takes PDB files as inputs and performs structural analysis that includes 1) whole protein-DNA complex structure restoration, especially the reconstruction of double-stranded DNA structures; 2) an efficient new approach for DNA base-pair detection; 3) systematic annotation of protein-DNA interactions; and 4) extraction of DNA subsequences involved in protein-DNA interactions and identification of protein-DNA binding units. Protein-DNA complex structures in current PDB were processed and analyzed with our PDA program and the analysis results were stored in a database. A dataset useful for studying protein-DNA interactions involved in gene regulation was generated using both protein and DNA sequences as well as the contact information of the complexes. WebPDA was developed to provide a web interface for using PDA and for data retrieval.
PDA is a computational tool for structural annotations of protein-DNA complexes. It provides a useful resource for investigating protein-DNA interactions. Data from the PDA analysis can also facilitate the classification of protein-DNA complexes and provide insights into rational design of benchmarks. The PDA program is freely available at http://bioinfozen.uncc.edu/webpda.
Due to the unique structural features of DNA, protein residues may interact with DNA bases in major or minor grooves, and the protein-DNA interactions can be specific or non-specific. Although these features are important in characterizing the nature of the interactions in a protein-DNA complex, they are not available in PDB files. Currently, there are several programs and databases, such as 3DNA [13, 14], Nucleic Acid Database (NDB) , Amino Acid-Nucleotide Interaction Database (AANT) , and Protein-Nucleic Acid Complex Database (ProNuC) , which represent previous efforts in providing some structural details of DNA or protein-DNA complex structures. However, these programs/databases only provide information on some aspects of the protein-DNA complex structures. For example, NDB and 3DNA are nucleic acid specific. AANT only has statistical information on amino acid-nucleotide interaction. While ProNuC provides a list of atom-atom contact pairs between protein and DNA, it lacks other information such as the nature of protein-DNA interactions.
Sarai and colleagues recently developed a new scheme for classification of protein-DNA complexes using a "DNA-centric" approach [12, 18]. The new viewpoint highlights the need for a comprehensive annotation of the solved protein-DNA complex structures and an automatic program for generating such information. Here we present the development of such a program, PDA (for protein-DNA complex structure analyzer), which can help us better understand the mechanism of protein-DNA interactions and should be useful in statistical potential development, protein-DNA docking, and structure-based regulatory network studies. In addition, the protein-DNA complex structures can be classified from a more holistic view by combining the "protein-centric" and "DNA-centric" approaches.
Some PDB entries provide only partial coordinates of the whole protein-DNA complex structures. We observed that there were two kinds of incompleteness of the protein-DNA complex structures in PDB files. The first is that parts of the complex structures, such as one chain of a double-stranded DNA or one chain of a protein dimer is missing in the original PDB file (Figure 3A). PDB files with this type of incomplete structure usually have codes, e.g. "biological molecules", embedded in the structure file, which PDA uses for generating the full structure model if missing component(s) is identified. The second type of incomplete complex structures is that the coordinates of one or more full double-stranded DNAs are missing (Figure 3C). PDA searches for such missing double-stranded DNAs by first reconstructing 3 × 3 crystal cells with the crystal symmetry information of the structure in PDB and examining if there is any double-stranded DNA in the crystal whose bases are in contact with the protein(s).
A DNA base is considered to be in contact with a protein if the distance between any heavy atom of the base and any heavy atom of the protein is less than a cutoff value (the default cutoff is 4.8 Å in PDA). If the contact involves a base in the major/minor groove, it is annotated as a major/minor groove contact. When the distances between both the major and minor groove atoms of a base and a protein atom are within the cutoff value, the type of the base-protein contact is determined by comparing the contact distances and the angle formed by the "major groove atom"-"protein atom"-"minor groove atom". If the angle is less than 40 degrees, the contact with the longer distance is considered to be shielded by the shorter contact and is thus discarded. In case that the angle is more than 40 degrees, both the major and minor groove atoms of the base are considered to be in contact with the protein atom, which was usually observed in terminal bases and the bases that do not have base-pairing partners (for example, the DNA glycosylase-DNA complex shown in Figure 5C). We use several measures to describe the nature of protein-DNA interactions: 1) major (minor) groove contact number refers to the number of major (minor) groove DNA bases that are in contact with protein; 2) major groove contact ratio is calculated as the ratio between the number of major groove contacts and the sum of major and minor groove contacts; 3) base (backbone) contact number refers to the number of nucleotides whose base (backbone) is in contact with protein; 4) base contact ratio is calculated as the ratio between the number of base contacts and the sum of base and backbone contacts. Additional aspects of protein-DNA interaction that are analyzed by PDA include "running-into-protein" DNA (when the axis of a double-stranded DNA is blocked by a protein) (Figure 5A) and "flipped base" (if a base in a double-stranded DNA does not have a base-pairing partner and is in contact with protein) (Figure 5C).
Each protein-DNA complex structure is assigned with one of four functional classes ("gene regulation", "transferase", "hydrolase" and "others") based on the keywords in the PDB file of a protein-DNA complex structure. Entries with "transcription" or "gene regulation" as keywords belong to the gene regulation class. The transferase class contains structures with keywords "transferase" or "polymerase" while the hydrolase class consists of PDB entries with annotated function of "hydrolase" or "nuclease". The protein-DNA complex structures that cannot be assigned with any of these three classes are grouped into the "others" category. In case of conflicts, the function of the complex structure is further examined by manual inspection. For example, a few PDB entries have keywords for both the transferase class and the gene regulation class. All of them were classified as transferase after manual inspection.
Sequence comparison is a convenient way for determining the similarity of two macromolecules such as two protein or two DNA sequences. It would be useful if such sequence comparison could also be done for protein-DNA complexes. Previous studies only compare protein sequences for dataset construction. Since a protein-DNA complex can have multiple protein chains as well as multiple double-stranded DNAs, we take an approach of all-against-all comparison (protein vs. protein and DNA vs. DNA) of two complexes and report the lower and upper bounds of the sequence identities for protein and DNA separately. While the sequences of the entire protein chains are used for protein comparison, the DNA sequences used for comparison are not straightforward. Some protein complexes have long DNA sequences but only a small portion of the sequences are involved in protein-DNA interaction. On the other hand, in some protein-DNA complexes, a large percentage of DNA participates in the binding and interaction with proteins even though the DNA sequences are short. To address this issue, we first extract the DNA subsequences that interact with proteins since in general the DNA binding motifs are better conserved while the flanking sequences showed less conservation. The protein-binding DNA fragment is defined as the longest DNA subsequence bounded by two bases that are in contact with the protein plus one flanking base on each side (5' and 3'). Within the subsequence, at most three consecutive bases are allowed to be not in contact with the protein. If there is no base-protein contact in a double-stranded DNA, the double-stranded DNA is excluded from sequence comparison. Likewise, protein chains that are not in contact with any bases of DNAs are also excluded from sequence comparison. ALIGN [26, 27] is used for protein sequence comparison, with gap opening and extension penalty of -12 and -2, respectively. As for the DNA sequence comparison, we used an in-house program to perform gapless alignments since the binding motifs are generally short. Sequence identity is defined as the number of identical residues or bases in the alignment divided by the length of the shorter sequence.
To test the efficiency of PDA that uses only two distances (H-distance and stagger distance) for base-pair detection, we compared the performance of PDA with 3DNA , a program widely-used for DNA structure analysis, on a dataset of 1077 protein-DNA complex structures that are solved by X-ray crystallography with high resolution (less than 3.5 Å) and have at least one base-pair determined by 3DNA. Two base-pairing matrices were generated for each DNA by PDA and 3DNA respectively. Each cell has a value of 1 if two bases form a pair and 0 otherwise. The correlation of base-pair assignments between PDA and 3DNA was calculated using Matthews Correlation Coefficient (MCC) . The histogram of the MCC for the 1077 protein-DNA complex structures is shown in Figure 5. The MCC of more than 99% of the complexes is more than 0.90 and 73% of the complexes show a perfect correlation between 3DNA and PDA. Compared with the 3DNA assignment, most of the missed base-pairs by PDA were located at the termini of DNA or in the middle of very long and wound DNA. There are some base-pairs detected only by PDA but not 3DNA. Through manual inspection, we found that many of these "false positive" base-pairs are possibly true base-pairs. Based on above analysis, the performance of PDA in base-pair detection is comparable to that of 3DNA. Our simple but effective approach uses less than five distance calculations per base-pair while 3DNA employs a least square fitting procedure to obtain a reference frame for each base followed by comparing six geometrical parameters from two reference frames for a pair of bases.
PDA takes a PDB file as input and outputs the detailed analysis result to the standard output as well as files for protein-DNA binding units. Most of the PDA output is self-explanatory. Several notable features of PDA are as follows. One is the PDAgram, a text-based diagram from PDA analysis showing the organization and structure of double-stranded DNAs and the interaction patterns between protein and double-stranded DNA (Figure 6). The advantage of PDAgram over 3D visualization of protein-DNA complexes is that it provides an easy way to display the interaction pattern of a protein-DNA complex. For 3D visualization of PDA analysis data, a RasMol/Jmol [29, 30] visualization script is automatically created for each PDA analysis report, in which the protein and DNA are rendered in cartoon and space-fill formats, respectively, with a default color scheme as shown in Figure 5D.
Since the DNA sequences involved in protein-DNA interaction are generally short (Figure 9A), two unrelated DNA sequences may have high sequence similarity. In Set263, there are 49874 complex pairs in which the proteins have less than 30% sequence identity. When the corresponding DNA sequences were compared, we found that about 93% of the DNA sequences have up to 65% sequence identity even though the protein sequences are not similar (data not shown). It is not surprising that all the DNA sequence pairs showed at least 25% sequence identity using gapless sequence alignment approach as the DNA sequences are short. On the other hand, there are 252 pairs of protein-DNA complexes that have less than 65% DNA sequence identity while the proteins have more than 50% sequence identity (Figure 9B).
In general there is a trade-off between "redundancy" and "dataset size" for statistical analysis when constructing a dataset especially if the data available is not large enough as in the case of protein-DNA complex structures. For example, when only protein sequences are used for protein-DNA complex comparison, a low sequence identity cutoff (e.g. 25%) will generate a relatively small dataset. This dataset offers low-redundancy but lacks power in statistical analysis . While a higher protein sequence identity cutoff increases the dataset size, the "non-redundancy" is compromised. Note that protein-DNA complexes may have dissimilar DNA sequences and interaction patterns even though the protein sequence identity is over 50% (Figure 3 and Figure 9B) . Therefore, it is possible to produce a dataset that is bigger while keeping a low data redundancy in terms of the nature of protein-DNA interactions by increasing the cutoff of protein sequence similarity and applying DNA sequence similarity at the same time. As an application example, we clustered complex structures in Set263 into 104 groups using a sequence identity cutoff of 50% for both the protein and double-stranded DNA. Non-redundant datasets can be selected from the 104 distinct clusters and used for studying transcription factor-DNA interactions. To our knowledge, this is the first attempt that not only considers the number of base-protein contacts, ratio of specific contacts between protein and DNA but also take the double-stranded DNA sequence identity into account. These datasets are available at http://bioinfozen.uncc.edu/webpda.
We developed an automatic and comprehensive analyzer for protein-DNA complex structures and implemented it as a computer program PDA. PDA can restore the full atomic coordinates of protein-DNA complex structures from partial coordinates, accurately detect DNA base-pairs with a new and simple algorithm, recognize double-stranded DNA structures, analyze protein-DNA contacts and define protein-DNA binding sites. These restorations and annotations are necessary for constructing datasets that takes the DNA into consideration, making them real non-redundant "complex structures", not just non-redundant in terms of proteins. PDA's analysis on protein-DNA binding modes, including major/minor groove interactions and base/backbone-protein contacts, will also help classification of protein-DNA complex structures and construction of contact specific datasets for protein-DNA interaction studies.
PDA and pre-compiled PDA analysis results for protein-DNA complex structures in PDB are freely available for non-commercial use at http://bioinfozen.uncc.edu/webpda. The only requirement for running PDA is a Python interpreter (tested on Python v2.4.2). Java virtual machine, which is available free at http://www.java.com, is required for using the precompiled analysis data at http://bioinfozen.uncc.edu/webpda. The webserver was successfully tested with FireFox 2, Safari 3 and Internet Explorer 6. The PDA program and web server will be updated regularly.
The authors thank the anonymous reviewers for many helpful comments on the manuscript. This work was supported by the startup fund to JTG from the University of North Carolina at Charlotte.
This article has been published as part of BMC Genomics Volume 10 Supplement 1, 2009: The 2008 International Conference on Bioinformatics & Computational Biology (BIOCOMP'08). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2164/10?issue=S1.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.