Volume 12 Supplement 4
Cutoff Scanning Matrix (CSM): structural classification and function prediction by protein inter-residue distance patterns
© Pires et al; licensee BioMed Central Ltd. 2011
Published: 22 December 2011
The unforgiving pace of growth of available biological data has increased the demand for efficient and scalable paradigms, models and methodologies for automatic annotation. In this paper, we present a novel structure-based protein function prediction and structural classification method: Cutoff Scanning Matrix (CSM). CSM generates feature vectors that represent distance patterns between protein residues. These feature vectors are then used as evidence for classification. Singular value decomposition is used as a preprocessing step to reduce dimensionality and noise. The aspect of protein function considered in the present work is enzyme activity. A series of experiments was performed on datasets based on Enzyme Commission (EC) numbers and mechanistically different enzyme superfamilies as well as other datasets derived from SCOP release 1.75.
CSM was able to achieve a precision of up to 99% after SVD preprocessing for a database derived from manually curated protein superfamilies and up to 95% for a dataset of the 950 most-populated EC numbers. Moreover, we conducted experiments to verify our ability to assign SCOP class, superfamily, family and fold to protein domains. An experiment using the whole set of domains found in last SCOP version yielded high levels of precision and recall (up to 95%). Finally, we compared our structural classification results with those in the literature to place this work into context. Our method was capable of significantly improving the recall of a previous study while preserving a compatible precision level.
We showed that the patterns derived from CSMs could effectively be used to predict protein function and thus help with automatic function annotation. We also demonstrated that our method is effective in structural classification tasks. These facts reinforce the idea that the pattern of inter-residue distances is an important component of family structural signatures. Furthermore, singular value decomposition provided a consistent increase in precision and recall, which makes it an important preprocessing step when dealing with noisy data.
With the increasing number of genome and metagenome projects, sequence databases have grown exponentially. On the one hand, the August 2010 release of the UniprotKB/TrEMBL database  contains about 12,000,000 protein sequences. In the last month, more than 300,000 new sequences have been added to that repository, and about 6,000,000 entry annotations have been revised. On the other hand, the Pfam database of protein families  represents about 12,000 families, and about 20% of these are domains of unknown function (DUFs), revealing that state-of-the-art sequence similarity-based and even profile-based annotation methods have had limited success in assigning functions to novel proteins.
Protein structural classification databases, such as SCOP , also present difficulties in keeping up with the increasing number of protein structures solved and deposited in public repositories. Approximately 53% of the Protein Data Bank (PDB)  entries are classified by the current release of SCOP (1.75) as of April 2011, and after removing redundancy (sequence similarity at 90%), the coverage drops to about 41%. As international structural genomics initiatives have produced a huge number of structures of unknown function, attempting to automatically assign functions to these proteins is becoming even more necessary, and significant efforts have been devoted to this task [5–8].
In this context, novel paradigms, models and methodologies for automatic annotation must be investigated. Because protein structure and function are more conserved than protein sequence , the identification of similarities between novel sequences and known structures would greatly improve the characterization of these sequences. Fold recognition refers to identifying main structural features by the connections and positions of secondary structure elements. Conversely, according to Murzin et al. , structural classification is conducted at hierarchical levels (class, fold, superfamily and family) that embody evolutionary and structural relationships. In this work, we focused on structural classification, which encompasses the problem of fold recognition. Both fold recognition and structural classification are important steps toward function prediction.
Over the years, protein fold recognition has been addressed through different approaches. The authors of  extracted a series of features from protein sequences and used support vector machines and neural network learning methods as the base classifiers in a dataset composed of SCOP folds. Later, ensemble classifiers  were applied to these same feature vectors, improving the success rate. The use of a combination of sequence and structure information brought an improvement to fold recognition, as mentioned in the information retrieval approach introduced in .
Likewise, several efforts toward structure-based protein function prediction have been made. We can quote, for instance, the search for structural motifs [13–15] and functional residues (such as DNA  and metal  binding sites), the use of 3D templates  and the comparison of protein folds by structure alignments [18, 19]. There have also been attempts to infer function from structure without the use of alignment algorithms, such as in enzyme classification [20, 21]. Similarly, in the present work, we do not use alignment techniques or any sequence information in our method, relying only on structural grounds. A primary problem faced when dealing with protein function, as pointed out in , s defining the scope and function. Protein function prediction may be understood from different perspectives. It could mean the prediction of the cellular process in which a protein is involved, its enzymatic activity or even its physiological role. For instance, a protein’s enzymatic activity could be described by EC numbers, while its physiological role might be related to its subcellular localization. In this work, the aspect of protein function considered is enzyme activity. However, the study might be extended, without loss of generality, to other functional features, like the terms of the Gene Ontology (GO)  annotation.
Even though function cannot be directly implied from the specific fold adopted by a certain protein, structural data can be used to detect proteins with similar functions whose sequences have diverged during evolution . In this context, one possible strategy is the definition of structural signatures, which are sets of features that are able to unequivocally identify a protein fold and the nature of interactions it can establish with other proteins and ligands. These feature sets are concise representations of protein structures, and we believe that their discovery and comprehension will be an important milestone in the protein function prediction field, being a step beyond sequence homology-based methods.
In this paper, we investigate a special type of feature that might be part of structural signatures: the patterns in inter-residue distances (or contacts). Proteins with different folds and functions present significant differences in the distribution of distances among residues as a consequence of the underlying interaction and packing of the atomic network, which is fundamental for defining protein folding . In , we have used these distribution distances to compare and correlate different methodologies of protein inter-residue contacts. We found, surprisingly, that the traditional cutoff-dependent approach was a simpler, more complete and more reliable technique for contact definition than other cutoff-independent methods, such as Delaunay tessellation , especially when the target is the discrimination of first-order contacts. In this work, we propose using inter-residue distance patterns for protein classification.
The structural data we used are the cumulative contact distributions based on the Euclidean distances among alpha carbons, the Cutoff Scanning Matrix (CSM). The motivation for the use of this kind of information lies in the fact that proteins with different folds and functions have significantly different distributions of distances between their residues, and protein similarity is reflected in these distance distributions, information that is captured in the CSM. After generating this structural data, we apply singular value decomposition (SVD) to reduce dimensionality and noise. The processed matrix is finally submitted to different, previously described classification algorithms. Therefore, the main innovation of this work relies more on the powerful combination of the new structural feature of inter-residue contacts used as a discriminator and principal components selection by SVD rather than in the creation of a new classification method per se. Indeed, we showed our methodology to be, in general, independent of the classifiers utilized, giving even results for different classification heuristics.
Having in mind these considerations, we showed that the patterns derived from CSMs might effectively be used in automatic protein function prediction and structural classification. At first glance, in the case of enzyme function prediction, the proposed method achieved (over the superfamilies) an average precision of 98.2% (sd = 1.6) and average recall of 97.9% (sd = 2.0), using a gold-standard dataset of enzymes . Using a much larger set of enzymes with their respective EC numbers (the 950 most-populated EC numbers in terms of available structures), CSM was able to achieve up to 95.1% precision and recall results. For the recall results, considering the levels of hierarchical structure of SCOP , we were able to accomplish an average precision of 93.5% (sd = 1.4) and average recall of 93.6% (sd = 1.4). In comparison to the state-of-the-art methods used in this context, such as that given by Jain and Hirst , using very similar database input (SCOP release 1.75), our methodology presented more robust and homogeneous results, with an average precision a bit below that of those authors: 90.7% versus 93.6%, but with less dispersion (sd of 3.0 versus 6.4). We had remarkably better recall results: an average of 90.7% versus 77.0%, with significantly lower dispersion (sd of 2.9 versus 18.4). Further details are discussed in the next section.
Results and discussion
To test the ability of our method to successfully predict functions and recognize folds, we performed two sets of experiments with datasets designed for these different tasks.
For function prediction, as mentioned in the Methods section, we built one database based on manually curated protein superfamilies and another based on EC numbers to test if the present structure-based method could help in protein function annotation.
For structural classification, we performed experiments to verify our ability to assign SCOP class, superfamily, family and fold to protein domains. Furthermore, to place this work into the context of the literature, we also tested a superset of the dataset used by Jain and Hirst in . As far as we know, their work presents the highest precision in protein fold recognition published thus far.
Finally, we relate some experiments that aimed to evaluate an SVD-based noise reduction strategy.
In the function prediction experiments, our goal was to assess how well three different classification algorithms predict protein function according to protein EC numbers and a mechanistically diverse gold-standard database of functional family classes . We used 10-fold cross validation for all the experiments.
For the dataset of the top 950 most-populated EC numbers, CSM was able to achieve 95.1% precision and recall after SVD processing using the KNN (K-Nearest Neighbors) algorithm. The four levels of the EC number were used together as the classes to train and test the classifier. Additional file 1, Figure S1 shows the variation in the performance metrics for each EC number class considered. Even though the number of proteins assigned to each EC number is very unbalanced, the majority of classes were classified properly, with high quality according to the metrics extracted.
Function prediction performance using KNN for the gold-standard dataset
Isoprenoid Synthase Type I
Vicinal Oxygen Chelate
Protein structural classification
To the best of our knowledge, no test of the structural classification of very large databases, such as the entire SCOP containing about 110,000 domains, has been published. Due to SVD dimensionality reduction ability and the possibility of representing protein instances by a few significant attributes, we present a method that can efficiently handle such volume of data.
Structural classification performance using KNN for the Full-SCOP dataset
Comparison of prediction performance
Jain et al.
Noise reduction strategy
As we mentioned, SVD-based noise reduction was able to improve the precision and recall levels. We obtained a gain of up to 10.3% with the KNN classifier, 35.0% with naive Bayes and 16.2% with random forest. Interestingly, we verified that the different classifiers achieved comparable results after the use of SVD for dimensionality reduction (all levels remained above 90%). Dimension reduction ability is important for scalability in this scenario because many protein domains are experiencing exponential growth. There are about 110,000 domains, i.e., instances to classify, in the SCOP database. Each of these instances can be represented by 151 attributes (dimensions) in the case of the CSM with a cut-off of up to 30Å.
Function and fold prediction, while means of understanding the composition, operation, interaction and evolution of proteins, are still great challenges in the face of the explosive growth of protein data generation and storage in public databases. To keep up with the frenetic pace imposed by this increasing data availability, novel, efficient methods for automatic and semi-supervised annotation are needed. As a mechanism to exploit the close relationship between protein structure and function, we developed a structure-based method for function prediction and fold recognition based on protein inter-residue distance patterns. The motivation for this approach arose from the hypothesis that proteins with different structures would show different inter-residue distance patterns, and structural similarity would be reflected in these distances.
One of the most remarkable advantages of the CSM-based structural signature is its generality, as we successfully instantiated it in different problem domains, such as function and fold prediction. Also, as a requirement and demand for its application to databases that are continuously growing, it is scalable for real-world scenarios, such as whole-SCOP classification tasks, as shown in previous sections, and it shows an efficacy comparable or superior to state-of-the-art protein folding and function predictors. We would like to stress that our method is probably the first to present a full-SCOP automatic classification in acceptable time (a few hours in a quad-core machine).
The interpretation and understanding of the intrinsic distance patterns generated by CSM demand further investigation. As part of future studies, we intend to explore the generality of CSMs in other aspects of protein function, such as subcellular localization prediction and prediction of GO terms, as well as under different structural classification databases, such as CATH . We also plan to contrast SVD with feature selection as methods for discriminant information discovery in CSMs.
Furthermore, the significant gain in prediction power provided by SVD processing might imply that there is room to improve in terms of the data input, indicating that other cutoff ranges and granularities should also be tested, which is a study already in progress in our group.
After the data acquisition and filtering steps for a certain dataset (designed either for function prediction or fold recognition purposes), the CSMs are generated (the details of the procedure are explained later in this section). The CSM defines a feature vector that is then processed with SVD. To define a threshold value for dimensionality reduction, the singular values distribution is analyzed. The elbow of this distribution is used as a threshold for data approximation and recomposition (the explanation of the SVD procedure is detailed in the next subsections) and indicates that the contribution of the other singular values to describing the matrix is insignificant, and thus they might be seen as noise.
These singular values are then discarded. Finally, the processed CSM is submitted for classification tasks under different algorithms. Metrics such as precision and recall are calculated to assess the prediction power of the classifiers.
Cutoff scanning matrices
In a previous work , we conductedd a comparative analysis between two classical methodologies to prospect residue contacts in proteins, one based on geometric aspects, and the other based on a distance threshold or cutoff, by varying (scanning) this distance to find a robust and reliable way to define these contacts. In the present work, we used the cutoff scanning approach for classification purposes, which is the basis of the CSMs. The motivation for the use of this kind of information relies on the fact that proteins with different folds and functions present significant differences in the distribution of distances between their residues. On the other hand, one can expect that proteins with similar structures would also have similar distance distributions between their residues, information that is captured in a CSM.
In this work, we vary the distance threshold from 0.0 Å to 30.0 Å, with a 0.2-Å step, which generates a vector of 151 entries for each protein. Together, these vectors compose the CSM. In short, each line of the matrix represents one protein, and each column represents the frequency of residue pairs within a certain distance. Alternatively, this frequency might be seen as the number of contacts in the protein for a certain cutoff distance or the edge count of the contact graph defined using that distance threshold. This step was implemented in the Perl programming language.
It is important to mentioned that other centroids could be chosen instead of the C α , such as the C β or the last heavy atom (LHA) of the side chain. Additional file 1, Figure S3 shows the performance comparison between the C α and C β for the EC number dataset. The C α performed better in all experiments, a fact that demands further investigation.
Noise reduction with SVD
Where T is an orthonormal matrix of dimensions m x m, S is a diagonal matrix of dimensions m x n and D is an orthonormal matrix with dimensions n x n. The diagonal values of S are the singular values of A, and they are ordered from the most to the least significant values.
The justification for using only V k is that the relationships among the columns of A k are preserved in V k because T k is a base for the columns of A k .
We evaluated the singular values distribution in an effort to find a good threshold to reduce the number of dimensions without losing information. This step, as well as the generation of all graphics, was performed via R programming language scripts.
An extensive series of experiments was designed to evaluate the efficacy of CSMs as a source of information for protein fold recognition and function prediction.
In the classification tasks, the Weka Toolkit , developer version 3.7.2 was used. For the gold-standard dataset, three classification algorithms were used, and their performances were compared: KNN, random forest and naive Bayes. For the other datasets, KNN was used. The algorithms’ parameters, when applicable, were varied and the best result computed. In all scenarios, 10-fold cross validation was applied. The classification performance was evaluated using metrics such as precision (Precision = TP/(TP + FP)), recall (Recall = TP/(TP + FN)), F1 score (the harmonic mean between precision and recall: ) and the Area Under the ROC Curve (AUC). The variation in precision was used to measure the gain obtained with SVD processing, and the recall variation was evaluated to compare the results with those for the dataset derived from .
We also correlated the precision obtained by the classifiers and the number of singular values considered and compared it with the results using the whole CSM.
Our datasets consisted of proteins structures available in the Protein Data Bank . The domains covered by SCOP release 1.75 were obtained through the ASTRAL compendium . The protein structures were grouped according to the purpose of the experiment, namely, function prediction or fold recognition. For structures solved by NMR, we only considered the first model. The chains were split into separate files and the C α co-ordinates extracted using PDBEST toolkit.
The first dataset concerns a gold-standard of mechanistically diverse enzyme superfamilies . We consider six superfamilies (amidohydrolase, crotonase, haloacid dehalogenase, isoprenoid synthase type I and vicinal oxygen chelate), comprising 47 families distributed among 566 different chains. The list of PDB IDs as well as the family and superfamily assignments are available in Additional file 2.
The second dataset contains enzymes with EC numbers. We considered the top 950 most-populated EC numbers in terms of available structures, with at least 9 representatives per class, in a total of 55,474 chains, which covered 95% of the reviewed enzymes from Uniprot , i.e., the experimentally validated annotations from that database.
The third dataset originated from SCOP version 1.75 for fold recognition tasks. We selected all PDB IDs covered by SCOP with at least 10 residues and 10 representatives per node in the SCOP classification hierarchy. These IDs represented a total of 110,799, 108,332, 106,657 and 102,100 domains at the class, fold, superfamily and family levels, respectively. We would like to emphasize that this is a very large dataset and that we found no other paper relating the use of such a complete dataset in strutcural classification tasks. The last dataset was derived from  for comparison in fold recognition tasks. We selected all domains described in its additional files with a minimum of 10 representatives per node in the SCOP classification hierarchy. It was not possible to identify exactly the domains they used from the additional files and only those pairs of domains with a sequence identify below 35% were retained. It is important to stress that the work of Jain and colleagues only contemplate structures with 3, 4, 5 or 6 secondary structure elements.
List of abbreviations used
Cutoff Scanning Matrix
Domain of Unknown Function
Singular Value Decomposition
Protein Data Bank
Structural Classification of Proteins
Last Heavy Atom
Area Under the ROC Curve.
This work was supported by the Brazilian agencies: CAPES, CNPq, FAPEMIG and FINEP. The EC number dataset was kindly provided by Elisa Lima.
This article has been published as part of BMC Genomics Volume 12 Supplement 4, 2011: Proceedings of the 6th International Conference of the Brazilian Association for Bioinformatics and Computational Biology (X-meeting 2010). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2164/12?issue=S4
- Consortium TU: The Universal Protein Resource (UniProt) in 2010. Nucleic Acids Research. 2010, 38 (Database issue): D142-D148.View ArticleGoogle Scholar
- Finn RD, Mistry J, Coggill P, Heger A, Pollington J, Gavin OL, Gunasekaran P, Ceric G, Forslund K, Holm L, Sohhhammer ELL, Eddy SR, Bateman A: The Pfam protein families database. Nucleic Acids Research. 2010, 38 (Database issue): D211-D222.PubMedPubMed CentralView ArticleGoogle Scholar
- Murzin AG, Brenner SE, Hubbard T, Chothia C: SCOP: a structural classification of proteins database for the investigation of sequences and structures. Journal of Molecular Biology. 1995, 247 (4): 536-40.PubMedGoogle Scholar
- Berman HM, Battistuz T, Bhat TN, Bluhm WF, Bourne PE, Burkhardt K, Feng Z, Gilliland GL, Iype L, Jain S, Fagan P, Marvin J, Padilla D, Ravichandran V, Schneider B, Thanki N, Weissig H, Westbrook JD, Zardecki C: The Protein Data Bank. Acta Crystallogr D Biol Crystallogr. 2002, 58 (Pt 6 No 1): 899-907.PubMedView ArticleGoogle Scholar
- Laskowski RA, Watson JD, Thornton JM: Protein function prediction using local 3D templates. Journal of Molecular Biology. 2005, 351 (3): 614-626. 10.1016/j.jmb.2005.05.067.PubMedView ArticleGoogle Scholar
- Laskowski RA, Watson JD, Thornton JM: ProFunc: a server for predicting protein function from 3D structure. Nucleic Acids Research. 2005, 33 (Web Server issue): W89-93.PubMedPubMed CentralView ArticleGoogle Scholar
- Watson JD, Roman AL, Thornton JM: Predicting protein function from sequence and structural data. Current Opinion in Structural Biology. 2005, 15 (3): 275-284. 10.1016/j.sbi.2005.04.003.PubMedView ArticleGoogle Scholar
- Watson JD, Sanderson S, Ezersky A, Savchenko A, Edwards A, Orengo C, Joachimiak A, Laskowski RA, Thornton JM: Towards fully automated structure-based function prediction in structural genomics: a case study. Journal of Molecular Biology. 2007, 367 (5): 1511-1522. 10.1016/j.jmb.2007.01.063.PubMedPubMed CentralView ArticleGoogle Scholar
- Chothia C, Lesk AM: The relation between the divergence of sequence and structure in proteins. EMBO J. 1986, 5 (4): 823-6.PubMedPubMed CentralGoogle Scholar
- Ding CH, Dubchak I: Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics. 2001, 17 (4): 349-58. 10.1093/bioinformatics/17.4.349.PubMedView ArticleGoogle Scholar
- Shen HB, Chou KC: Ensemble classifier for protein fold pattern recognition. Bioinformatics. 2006, 22 (14): 1717-22. 10.1093/bioinformatics/btl170.PubMedView ArticleGoogle Scholar
- Cheng J, Baldi P: A machine learning information retrieval approach to protein fold recognition. Bioinformatics. 2006, 22 (12): 1456-63. 10.1093/bioinformatics/btl102.PubMedView ArticleGoogle Scholar
- Barker JA, Thornton JM: An algorithm for constraint-based structural template matching: application to 3D templates with statistical analysis. Bioinformatics. 2003, 19 (13): 1644-9. 10.1093/bioinformatics/btg226.PubMedView ArticleGoogle Scholar
- Goyal K, Mohanty D, Mande SC: PAR-3D: a server to predict protein active site residues. Nucleic Acids Research. 2007, 35 (Web Server issue): W503-5.PubMedPubMed CentralView ArticleGoogle Scholar
- Stark A, Russell RB: Annotation in three dimensions. PINTS: Patterns in Non-homologous Tertiary Structures. Nucleic Acids Research. 2003, 31 (13): 3341-4. 10.1093/nar/gkg506.PubMedPubMed CentralView ArticleGoogle Scholar
- Shazman S, Celniker G, Haber O, Glaser F, Mandel-Gutfreund Y: Patch Finder Plus (PFplus): a web server for extracting and displaying positive electrostatic patches on protein surfaces. Nucleic Acids Research. 2007, 35 (Web Server issue): W526-30.PubMedPubMed CentralView ArticleGoogle Scholar
- Babor M, Gerzon S, Raveh B, Sobolev V, Edelman M: Prediction of transition metal-binding sites from apo protein structures. Proteins. 2008, 70: 208-217.PubMedView ArticleGoogle Scholar
- Holm L, Sander C: Protein structure comparison by alignment of distance matrices. Journal of Molecular Biology. 1993, 233: 123-38. 10.1006/jmbi.1993.1489.PubMedView ArticleGoogle Scholar
- Kolodny R, Koehl P, Levitt M: Comprehensive evaluation of protein structure alignment methods: scoring by geometric measures. Journal of Molecular Biology. 2005, 346 (4): 1173-88. 10.1016/j.jmb.2004.12.032.PubMedPubMed CentralView ArticleGoogle Scholar
- Dobson PD, Doig AJ: Predicting enzyme class from protein structure without alignments. Journal of Molecular Biology. 2005, 345: 187-199. 10.1016/j.jmb.2004.10.024.PubMedView ArticleGoogle Scholar
- Alvarez MA, Yan C: Exploring structural modeling of proteins for kernel-based enzyme discrimination. IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB). 2010, 1-5.Google Scholar
- Punta M, Ofran Y: The rough guide to in silico function prediction, or how to use sequence and structure information to predict protein function. PLoS Computational Biology. 2008, 4 (10): e1000160-10.1371/journal.pcbi.1000160.PubMedPubMed CentralView ArticleGoogle Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature Genetics. 2000, 25: 25-29. 10.1038/75556.PubMedPubMed CentralView ArticleGoogle Scholar
- Lee D, Redfen O, C O: Predicting protein function from sequence and structure. Nature Reviews: Molecular Cell Biology. 2007, 8 (12): 995-1005. 10.1038/nrm2281.PubMedView ArticleGoogle Scholar
- Soundararajan V, Raman R, Raguram S, Sasisekharan V, Sasisekharan R: Atomic interaction networks in the core of protein domains and their native folds. PLoS One. 2010, 5 (2): e9391-10.1371/journal.pone.0009391.PubMedPubMed CentralView ArticleGoogle Scholar
- da Silveira CH, Pires DE, Minardi RC, Ribeiro C, Veloso CJ, Lopes JC, Meira W, Neshich G, Ramos CH, Habesch R, Santoro MM: Protein cutoff scanning: a comparative analysis of cutoff dependent and cutoff free methods for prospecting contacts in proteins. Proteins. 2009, 74 (3): 727-743. 10.1002/prot.22187.PubMedView ArticleGoogle Scholar
- Delaunay B: Sur la sphere vide. A la memoire de Georges Voronoi. Izv Akad Nauk SSSR. 1934, 7: 793-800.Google Scholar
- Brown SD, Gerlt JA, Seffernick JL, Babbitt PC: A gold standard set of mechanistically diverse enzyme superfamilies. Genome Biology. 2006, 7: R8-10.1186/gb-2006-7-1-r8.PubMedPubMed CentralView ArticleGoogle Scholar
- Jain P, Hirst JD: Automatic structure classification of small proteins using random forest. BMC Bioinformatics. 2010, 11 (364): 1-14.Google Scholar
- Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, Thornton JM: CATH - a hierarchic classification of protein domain structures. Structure. 1997, 5 (8): 1093-108. 10.1016/S0969-2126(97)00260-8.PubMedView ArticleGoogle Scholar
- Eldén L: Matrix Methods in Data Mining and Pattern Recognition (Fundamentals of Algorithms). Society for Industrial and Applied Mathematics. 2007Google Scholar
- Eldén L: Numerical linear algebra in data mining. Acta Numerica. 2006, 15: 327-384.View ArticleGoogle Scholar
- Berry MW, Dumais ST, O’Brien GW: Using linear algebra for intelligent information retrieval. SIAM review. 1995, 37 (4): 573-595. 10.1137/1037127.View ArticleGoogle Scholar
- del Castillo-Negrete D, Hirshman SP, Spong DA, D’Azevedo EF: Compression of magnetohydrodynamic simulation data using singular value decomposition. Journal of Computational Physics. 2007, 222: 265-286. 10.1016/j.jcp.2006.07.022.View ArticleGoogle Scholar
- Deerwester SC, Dumais ST, Furnas GW, Harshman RA, Landauer TK, Lochbaum KE, Streeter LA: Computer information retrieval using latent semantic structure. 1989Google Scholar
- Witten IH, Frank E: Data Mining: Practical Machine Learning Tools and Techniques. 2005, Morgan Kaufmann, secondGoogle Scholar
- Brenner SE, Koehl P, Levitt M: The ASTRAL compendium for sequence and structure analysis. Nucleic Acids Research. 2000, 28: 254-256. 10.1093/nar/28.1.254.PubMedPubMed CentralView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.