HIPPI: highly accurate protein family classification with ensembles of HMMs

Background Given a new biological sequence, detecting membership in a known family is a basic step in many bioinformatics analyses, with applications to protein structure and function prediction and metagenomic taxon identification and abundance profiling, among others. Yet family identification of sequences that are distantly related to sequences in public databases or that are fragmentary remains one of the more difficult analytical problems in bioinformatics. Results We present a new technique for family identification called HIPPI (Hierarchical Profile Hidden Markov Models for Protein family Identification). HIPPI uses a novel technique to represent a multiple sequence alignment for a given protein family or superfamily by an ensemble of profile hidden Markov models computed using HMMER. An evaluation of HIPPI on the Pfam database shows that HIPPI has better overall precision and recall than blastp, HMMER, and pipelines based on HHsearch, and maintains good accuracy even for fragmentary query sequences and for protein families with low average pairwise sequence identity, both conditions where other methods degrade in accuracy. Conclusion HIPPI provides accurate protein family identification and is robust to difficult model conditions. Our results, combined with observations from previous studies, show that ensembles of profile Hidden Markov models can better represent multiple sequence alignments than a single profile Hidden Markov model, and thus can improve downstream analyses for various bioinformatic tasks. Further research is needed to determine the best practices for building the ensemble of profile Hidden Markov models. HIPPI is available on GitHub at https://github.com/smirarab/sepp. Electronic supplementary material The online version of this article (doi:10.1186/s12864-016-3097-0) contains supplementary material, which is available to authorized users.


Commands
To run blastp [1], we built a BLAST database on the set of sequences in the training set. We then scored the test set against the BLAST database using blastp. All commands were run using BLAST 2.2.31+ and are given below.
The module acts as a wrapper around the HMMER suite of tools and builds an ensemble of HMM profiles from a given alignment and tree. The module takes as input an alignment and tree, and the stopping parameters. The HMMER HMM profiles were collected into a single HMM file and then the query sequences were scored against the HMMER HMM profile file using hmmsearch. All HMMER commands were run using HMMER version 3.1b2. The ML tree was estimated using FastTree-2 version 2.1.7 [4].
We provide the commands for each method below, as well as the internal calls to HMMER.
We provide the commands for running the HHsearch-HHpred pipeline [5] below. To run the pipeline we first ran each query sequence against the UniProt20 database using HHblits. We then convert the HHblits results into a FASTA alignment file using the hhr2fas.py python script.
In the event that the HHblits output contains no hits (i.e., zero homologies to any sequences within the UniProt20 database), the hhr2fas.py script returns an empty alignment; in this case we replace the empty alignment with an alignment containing only the query sequence. Next, we built HMMs on each Pfam seed alignment using HHmake. We compile all the HMMs into an HMM database using hhsuitedb.py. Finally, we scored the quality of the HMM-HMM alignment of the HMMs created from the HHblits results against the HMM database using HHsearch. All programs and scripts listed below are from the HH-suite software version 3.0.0, with the exception of the script hhr2fas.py, which was a custom script provided to us by the authors of HH-Suite.  Figure A2: Precision-recall curves for HIPPI, HMMER, and blastp, evaluated on all four cross-fold subsets of the data. The curves are estimated by varying an inclusion threshold parameter for the particular method and producing five to seven distinct points, with intermediate values interpolated linearly. Note that the scales for both axes vary between panels due to the significant impact of sequence fragmentation.