MUFOLD-DB: a processed protein structure database for protein structure prediction and analysis
© He et al.; licensee BioMed Central Ltd. 2014
Published: 16 December 2014
Protein structure data in Protein Data Bank (PDB) are widely used in studies of protein function and evolution and in protein structure prediction. However, there are two main barriers in large-scale usage of PDB data: 1) PDB data are highly redundant in terms of sequence and structure similarity; and 2) many PDB files have issues due to inconsistency of data and standards as well as missing residues, so that automated retrieval and analysis are often difficult.
To address these issues, we have created MUFOLD-DB http://mufold.org/mufolddb.php, a web-based database, to collect and process the weekly PDB files thereby providing users with non-redundant, cleaned and partially-predicted structure data. For each of the non-redundant sequences, we annotate the SCOP domain classification and predict structures of missing regions by loop modelling. In addition, evolutional information, secondary structure, disorder region, and processed three-dimensional structure are computed and visualized to help users better understand the protein.
MUFOLD-DB integrates processed PDB sequence and structure data and multiple computational results, provides a friendly interface for users to retrieve, browse and download these data, and offers several useful functionalities to facilitate users' data operation.
Protein structure data in Protein Data Bank (PDB)  are widely used in studies of protein function and evolution, and they serve as a basis for protein structure prediction. The number of entries in PDB has been increasing rapidly. However, there are two barriers in large-scale usage of PDB data, especially in an automatic fashion. The first barrier is that a large number of protein chains in PDB are highly similar in terms of sequence or structure. For example, many PDB files contain identical chains. Hence, a light version of PDB may be useful. In addition, PDB users often need to obtain a set of PDB chains satisfying some criteria such as structure resolution and sequence length, or they may need to select a representative from a group of similar sequences/structures. The second barrier in large-scale usage of PDB data is that many PDB files have issues due to inconsistency of data and standards as well as missing residues, so that automated retrieval and analysis are often difficult. For example, the sequence in a PDB header is sometimes inconsistent with that in the 3D coordinate part. Another example is that some residues in PDB are modified, and the residue types cannot be easily mapped to the original amino acids. One more issue is that many PDB files have incomplete coordinates containing some residues or atoms without 3D coordinates. This may be due to un-resolved electron density maps. However, it creates problems for a systematic data analysis of large-scale PDB files. Furthermore, if someone likes to perform molecular dynamics simulation or other computational analysis of a given PDB file, it may require preprocessing the file to add coordinates of missing atoms. If the pre-processed PDB files are readily available for download, it may help many simulation users.
Currently, several websites are available to address the first barrier. The PDB website itself can remove similar sequences with specific levels of mutual sequence identity. Other websites such as PDB-Select , ASTRAL , PDB-REPRDB  and PISCES  have similar functions, all of which allow users to download a pre-defined chain list or generate a customized list with some sequence or structure criteria. However, the derived chain lists from these websites are typically not updated weekly following the release of hundreds of PDB files each week. Release of non-redundant structure datasets is even slower. For example, the widely used protein structure classification database SCOP , which involves extensive manual annotations, was updated years ago (1.75 release in June 2009). It would be useful to incorporate automatic SCOP classification for newly released PDB files, even if the classification quality is suboptimal. In addition, the second barrier in large-scale usage of PDB data, as illustrated above, has not been addressed systematically.
Users can search a PDB sequence against several derived sequence databases by using BLAST with specified parameters and browse all the hit sequences.
Users can generate a customized list from the entire PDB sequences by setting the filtering parameters, which include full or partial SCOP address, experimental method (e.g., X-Ray or NMR), sequence length, structure resolution (only applied to X-Ray structures), deposit date, and mutual sequence identity level from 90, 80 to 30 percent. This can be used for a non-redundant template database in developing protein energy function and template-based protein structure prediction.
Users can input a list of chain names to browse the corresponding information and quickly get the representatives of the involved clusters after clustering with seven levels of mutual sequence identity, from 90 to 30 percent. This utility can be used to cluster a set of sequences to reduce redundancy.
MUFOLD-DB carefully processes the PDB sequence and structure to provide users clean data which is much easier to manipulate than the original PDB files. Structures of missing regions with less than 7 residues in PDB chains are predicted by high-quality loop modelling using MODELLER , to help structure prediction and function analysis.
Multiple data are provided for users to download including sequence, predicted SCOP classification, cleaned PDB format file, and PDB files with loop modelling. Pre-computed sequence and SCOP representative datasets are also provided. These files can be retrieved through a command line without going through a web browser.
Users can view each chain in details. Besides the basic information from PDB files, evolutional information represented as sequence logo, secondary structure, disorder region, and three-dimensional structure visualization with JMol http://www.jmol.org are provided.
The database is automatically updated every week following the weekly release of PDB.
Construction and content
PDB sequence and structure processing
The sequences from the header (reference sequence) and coordinate part of PDB file for each chain are aligned through sequence alignment in which gap is not allowed in the reference sequence. The alignment shows the sequence positions missing coordinates, some of which are probably disordered regions. Also, residue index for the "ATOM" part is re-ordered starting from one, according to the alignment.
We have restored the residue codes for the majority of the modified residues through "MODRES" records from PDB files.
Atoms of a residue beyond its standard amino acid composition are simply removed as it is difficult for various structure prediction and analysis tools, e.g., MODELLER , to process them.
PDB chains are removed if the structures have only CA or sequences with length less than 30 or contain too many unknown residues ('X's).
If alternative conformations exist for residues or atoms (e.g., residue 1 of THR in PDB 1CBN has two conformations), only the first conformation is selected.
MUFOLD-DB provides a fast way to generate a subset of chains from the whole dataset with seven levels of sequence identity from 90 to 30 percent. This is implemented by a systematic indexing scheme and pre-computed clustering results. Sequence clustering with threshold from 90 to 40 percent is done using CD-Hit . Clustering into 30 percent of mutual identity is done by all-to-all sequence comparison using PSI-BLAST as the lowest identity cutoff of CD-Hit is 40%. The similarity between two sequences is computed by the PSI-BLAST local alignment identity divided by the average sequence length. The selection of the representative from each cluster is based on combination of sequence length, structure resolution and deposit date. Longer sequence has higher priority to be selected; but if two sequences have a length difference of less than 10 residues, the one with higher resolution will be selected. Here X-ray structures are always assumed more accurate than NMR structures. If these criteria cannot determine the priority, sequences with later deposit date have higher priority as newly resolved structures are more likely to have better quality.
Compare each new sequence in PDB dataset against all sequences of SCOP dataset using PSI-BLAST. Select those hits whose E-value is less than 0.01 and Z-Score of the corresponding CE  structure alignment is greater than 4.5.
If no hit is found in step 1, compare the query structure to the family representatives of the SCOP dataset. Select those hits whose CE Z-Score is greater than 4.5.
When multiple hits are found in step 1 or 2, assign the address of the new protein to the hit with the highest CE Z-Score. When the Z-Score is identical, choose the longest sequence as the representative.
Check unassigned regions: If the length is greater than 30 residues, repeat steps 1 to 3 using the sub-sequence; otherwise merge the short unassigned regions to the neighboring domains.
SCOP classification accuracy.
Single domain (491)
Generate an initial model with the alignment between the sequence in the "ATOM" section and the reference sequence in the "SEQRES" section of the PDB file.
Run MODELLER to generate 500 model candidates for residues with missing coordinates (missing loop region).
Compute the Root Mean Square Deviation (RMSD) between the model and the original structure, defined as rmsdRest for the structure other than missing loop regions. If rmsdRest is greater than 0.1 Å, remove the model, in order to keep the experimental structure intact.
Compute the DOPE  energy for each of the remaining models and select the model with the lowest DOPE energy as the final output.
Additional computational results
To help users better understand the protein, more computational results are integrated. We have calculated sequence profiles for all the sequences by running PSI-BLAST three rounds against the non-redundant (NR) database with the E-value cutoff of 0.001. Sequence profile is represented as a logo image generated using Weblogo  for the first 100 alignments extracted from the last round of PSI-BLAST. The secondary structure and solvent accessibility are computed by DSSP. Secondary structure is represented in three states: H (alpha helix), E (beta strand) and C (coil). And relative solvent accessibility (RSA) is computed and classified into three states: E (exposed wherein RSA is greater than 0.37), B (buried wherein RSA is less than 0.069) and I (intermediate, in between) . In addition, a structure image is generated for each chain using Raster3D  and MOLSCRIPT , and users can view the three-dimensional structure interactively with JMol.
Utility and discussion
MUFOLD-DB has integrated processed protein sequence and structure data from PDB files and multiple-source information from computational results. It has web-based interfaces and utilities for users to retrieve, browse and download data. The system has some limitations. In particular, the added coordinates for missing residues and atoms are based on computational predictions and may not be reliable. Nevertheless, we believe it provides a valuable resource for the protein modelling community and the structural biology in general.
Providing a customized list of chains
Browse feature provides a list of chains
Figure 3C shows the interface for users to browse the chains on a list generated online or the pre-computed dataset such as sequence representatives of SCOP classification or an input list. The chains will be listed in a table with attributes, ID name, sequence length, structure type, deposit date, source of the protein and predicted SCOP classification. Users can browse and make selections over pages.
Cluster feature works with input list of chains
Figure 3C has the entry to cluster a set of chains with different levels of mutual similarity.
Details of single chain available
Besides the download entry as shown in a result page (see Figure 4), MUFOLD-DB has more options for users to get data. As shown in Figure 3D, users can download the data for an input list, or pre-computed data set, e.g. representative of SCOP classification or sequence clustering with different levels of mutual sequence identity. As MUFOLD-DB is weekly updated, some of its past data are kept. This can be used as a benchmark dataset at different dates.
Number of processed PDB files and the deposit time of the PDB files.
Aug. 11, 1972
Jan. 13, 2014
Number of representative sequences at each threshold level of mutual sequence identity.
The database is publicly available and can be accessed at http://mufold.org/mufolddb.php
This work and its publication has been supported by National Institutes of Health grants NIH/NIGMS R21/R33-GM078601 and 5R01GM100701 to DX. The computations were mainly performed on the high-performance computing resources at the University of Missouri Bioinformatics Consortium.
This article has been published as part of BMC Genomics Volume 15 Supplement 11, 2014: Selected articles from the 2014 International Conference on Advances in Big Data Analytics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/15/S11.
- Sussman JL, Lin D, Jiang J, Manning NO, Prilusky J, Ritter O, Abola EE: Protein Data Bank (PDB): database of three-dimensional structural information of biological macromolecules. Acta Crystallogr D Biol Crystallogr. 1998, 54 (Pt 6 Pt 1): 1078-1084.PubMedView ArticleGoogle Scholar
- Griep S, Hobohm U: PDBselect 1992-2009 and PDBfilter-select. Nucleic Acids Res. 2010, D318-319. 38 DatabaseGoogle Scholar
- Brenner SE, Koehl P, Levitt M: The ASTRAL compendium for protein structure and sequence analysis. Nucleic Acids Res. 2000, 28 (1): 254-256. 10.1093/nar/28.1.254.PubMedPubMed CentralView ArticleGoogle Scholar
- Noguchi T, Matsuda H, Akiyama Y: PDB-REPRDB: a database of representative protein chains from the Protein Data Bank (PDB). Nucleic Acids Res. 2001, 29 (1): 219-220. 10.1093/nar/29.1.219.PubMedPubMed CentralView ArticleGoogle Scholar
- Wang G, Dunbrack RL: PISCES: a protein sequence culling server. Bioinformatics. 2003, 19 (12): 1589-1591. 10.1093/bioinformatics/btg224.PubMedView ArticleGoogle Scholar
- Murzin AG, Brenner SE, Hubbard T, Chothia C: SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol. 1995, 247 (4): 536-540.PubMedGoogle Scholar
- Kabsch W, Sander C: Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers. 1983, 22 (12): 2577-2637. 10.1002/bip.360221211.PubMedView ArticleGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25 (17): 3389-3402. 10.1093/nar/25.17.3389.PubMedPubMed CentralView ArticleGoogle Scholar
- Sali A, Blundell TL: Comparative protein modelling by satisfaction of spatial restraints. J Mol Biol. 1993, 234 (3): 779-815. 10.1006/jmbi.1993.1626.PubMedView ArticleGoogle Scholar
- Soding J: Protein homology detection by HMM-HMM comparison. Bioinformatics. 2005, 21 (7): 951-960. 10.1093/bioinformatics/bti125.PubMedView ArticleGoogle Scholar
- Li W, Godzik A: Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006, 22 (13): 1658-1659. 10.1093/bioinformatics/btl158.PubMedView ArticleGoogle Scholar
- Shindyalov IN, Bourne PE: Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng. 1998, 11 (9): 739-747. 10.1093/protein/11.9.739.PubMedView ArticleGoogle Scholar
- Tung CH, Yang JM: fastSCOP: a fast web server for recognizing protein structural domains and SCOP superfamilies. Nucleic Acids Res. 2007, W438-443. 35 Web ServerGoogle Scholar
- Shen MY, Sali A: Statistical potential for assessment and prediction of protein structures. Protein Sci. 2006, 15 (11): 2507-2524. 10.1110/ps.062416606.PubMedPubMed CentralView ArticleGoogle Scholar
- Crooks GE, Hon G, Chandonia JM, Brenner SE: WebLogo: a sequence logo generator. Genome Res. 2004, 14 (6): 1188-1190. 10.1101/gr.849004.PubMedPubMed CentralView ArticleGoogle Scholar
- Xu J: Fold recognition by predicted alignment accuracy. IEEE/ACM Trans Comput Biol Bioinform. 2005, 2 (2): 157-165. 10.1109/TCBB.2005.24.PubMedView ArticleGoogle Scholar
- Merritt EA, Murphy ME: Raster3D Version 2.0. A program for photorealistic molecular graphics. Acta Crystallogr D Biol Crystallogr. 1994, 50 (Pt 6): 869-873.PubMedView ArticleGoogle Scholar
- Kraulis PJ: MOLSCRIPT: a program to produce both detailed and schematic plots of protein structures. J Appl Cryst. 1991, 24: 946-950. 10.1107/S0021889891004399.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.