Performance of rotation forest ensemble classifier and feature extractor in predicting protein interactions using amino acid sequences

Bustamam, Alhadi; Musti, Mohamad I. S.; Hartomo, Susilo; Aprilia, Shirley; Tampubolon, Patuan P.; Lestari, Dian

doi:10.1186/s12864-019-6304-y

Volume 20 Supplement 9

18th International Conference on Bioinformatics

Research
Open access
Published: 24 December 2019

Performance of rotation forest ensemble classifier and feature extractor in predicting protein interactions using amino acid sequences

Alhadi Bustamam ORCID: orcid.org/0000-0002-7408-074X¹,
Mohamad I. S. Musti¹,
Susilo Hartomo¹,
Shirley Aprilia¹,
Patuan P. Tampubolon¹ &
…
Dian Lestari¹

BMC Genomics volume 20, Article number: 950 (2019) Cite this article

2559 Accesses
10 Citations
1 Altmetric
Metrics details

Abstract

Background

There are two significant problems associated with predicting protein-protein interactions using the sequences of amino acids. The first problem is representing each sequence as a feature vector, and the second is designing a model that can identify the protein interactions. Thus, effective feature extraction methods can lead to improved model performance. In this study, we used two types of feature extraction methods—global encoding and pseudo-substitution matrix representation (PseudoSMR)—to represent the sequences of amino acids in human proteins and Human Immunodeficiency Virus type 1 (HIV-1) to address the classification problem of predicting protein-protein interactions. We also compared principal component analysis (PCA) with independent principal component analysis (IPCA) as methods for transforming Rotation Forest.

Results

The results show that using global encoding and PseudoSMR as a feature extraction method successfully represents the amino acid sequence for the Rotation Forest classifier with PCA or with IPCA. This can be seen from the comparison of the results of evaluation metrics, which were >73% across the six different parameters. The accuracy of both methods was >74%. The results for the other model performance criteria, such as sensitivity, specificity, precision, and F1-score, were all >73%. The data used in this study can be accessed using the following link: https://www.dsc.ui.ac.id/research/amino-acid-pred/.

Conclusions

Both global encoding and PseudoSMR can successfully represent the sequences of amino acids. Rotation Forest (PCA) performed better than Rotation Forest (IPCA) in terms of predicting protein-protein interactions between HIV-1 and human proteins. Both the Rotation Forest (PCA) classifier and the Rotation Forest IPCA classifier performed better than other classifiers, such as Gradient Boosting, K-Nearest Neighbor, Logistic Regression, Random Forest, and Support Vector Machine (SVM). Rotation Forest (PCA) and Rotation Forest (IPCA) have accuracy, sensitivity, specificity, precision, and F1-score values >70% while the other classifiers have values <70%.

Background

Proteins are polymers that are composed of amino acid monomers associated with peptide bonds, and they are essential for the survival of an organism. According to [1], a protein is a linear, chain-like polymer molecule comprising 10 to thousands of monomer units that are connected like beads in a necklace, with each monomer, in turn, comprising 20 natural amino acids. Proteins play an important role in forming the structural components of organisms, and they can also carry out the metabolic reactions needed to sustain life [2]. As essential macromolecules, proteins rarely act as isolated agents; instead, they must interact with other proteins to perform their functions properly [3]. Protein interactions play a central role in the many cellular functions carried out by all organisms. Thus, when irregularities occur in protein interactions, bodily malfunctions, such as autoimmune conditions, cancer, or even virus-borne diseases, can arise.

Widespread recognition of the participation of proteins in all organismal cellular processes has guided researchers to predict protein function through the sequencing of amino acids or protein structures on the basis of their interactions. Because most protein functions are driven by interactions with other proteins, developing a better understanding of protein structures should lead to a clearer picture of the impact and benefits of protein interactions [4]. Protein interactions also play a central role in medical research, as it is often necessary to understand them when developing disease-curing drugs designed to prevent or break the interactions between proteins that can result in disease.

The study of protein interactions generally involves the use of either experimental or computational methods. Experimental methods, such as Yeast Two-Hybrid (Y2H), Tandem Affinity Purification, and Mass Spectrometric Protein Complex Identification (MS-PCI), are known to have a number of disadvantages, including substantial time requirements for identifying protein interactions and the ability to identify only a small part of the overall protein interaction, which can potentially lead to significant mistakes in terms of research outcomes [5]. Usually, a graph can represent protein-protein interactions (PPIs). The nodes represent the protein, and the edges represent the interactions between the proteins [6]. However, the graph representation can only make clusters of interaction. To predict new interactions, we have to use the amino acid sequencing.

When identifying protein-protein interactions using amino acid sequencing, computational methods must solve two major problems: effectively representing a sequence as a feature vector that can be analyzed and designing a model that can identify protein interactions accurately and quickly. To solve these problems, computational methods generally apply a two-stage approach involving feature extraction followed by machine learning [7].

Effective feature extraction methods are required to represent sequences of amino acids as whole proteins. An effective feature extraction method will provide better model performance by skillfully extracting potential information from an amino acid sequence and representing it as feature vectors for further analysis via machine learning [7]. The feature extraction method has become one of the most important benchmarks for ensuring the successful classification of proteins based on their constituent amino acids. The success, or even failure, of a classification method in identifying protein interactions based on the sequence of amino acids cannot be seen only from the point of view of whether or not the classification method is effective; it must also be determined based on how well a feature extraction method represents a sequence of amino acids in the input feature vectors to be analyzed later in the classification method. Many studies have focused on developing methods for the feature extraction of amino acid sequences for use in further machine learning analysis. Sharma et al. [8] used feature extraction techniques to recognize protein folds that use the bi-gram feature by using position-specific scoring matrix (PSSM) and Support Vector Machine (SVM) as the classifiers. Dehzangi et al. [9] used the bi-gram feature technique for predicting protein subcellular localization for Prokaryotic microorganisms, i.e., Gram-positive and Gram-negative bacteria. Huang et al. [7] developed a successful feature extraction approach called global encoding, which has come to play an important role in weighted sparse representation modeling as a classifier for predicting protein interactions from their amino acid sequences. In a related study, pseudo-substitution matrix representation (PseudoSMR) features were also found to be useful in applying the weighted sparse representation method to the identification of interactions between proteins [3].

Machine learning methods adopt algorithms or mathematical models to perform classification, and they have been used to develop multiple classifier systems (MCSs). Machine learning can be implemented either by applying multiple classification methods to a given dataset or by applying a single method to several different data subsets. Most researchers have used the following classifiers: Gradient Boosting, K-Nearest Neighbor, Logistics Regression, Random Forest, and SVM. For example, SVM and Naïve Bayes classifier has been used for analyzing the texture of the brain 3D MRI images [10]. In 2006, Rodriguez et al. [11] proposed Rotation Forest as an ensemble classifier method, a type of MCS that uses compound decision trees to perform classification on several data subsets. This method involves the application of bagging and Random Forest algorithms to perform principal component analysis (PCA), and then matrix rotation on the datasets, which are compiled into compound decision trees. The rotation process produces decision trees that are mutually independent. Although the PCA is applied, all principal components (PCs) are still used to build the decision trees to ensure the completeness of the data. This method has been shown to perform well as a classification method for identifying protein interactions based on amino acid sequences [5, 12].

The success of feature extraction methods, such as global encoding and PseudoSMR, in extracting the features of amino acid sequences for use as input data, together with the usefulness of the Rotation Forest method as a classification method for predicting amino acid sequences, suggests that these methods could be combined into a system to successfully predict PPIs, which was the goal of this study. We also assessed the performance of the Rotation Forest classifier under two different transformation methods: PCA and independent principal component analysis (IPCA). Yao et al. introduced IPCA as a method for successfully combining the respective advantages of PCA and independent component analysis (ICA) for uncovering independent principal components (IPCs) [13].

Kuncheva and Rodriguez [14] demonstrated that PCA could be successfully applied as a Rotation Forest transformation method, and that it was more accurate than random projection and nonparametric discriminant analysis. The higher accuracy of PCA is due to its ability to produce rotational matrices with very small correlations, characterized by a reduced cumulative proportion of matrix diversity, which enables the formation of mutually independent decision trees within an ensemble system. Thus, PCA guarantees a diversity of decision trees under the Rotation Forest method in the same manner as the separation of random data free variables. This prevents the production of large numbers of allegations that can cause the model to experience inconsistencies in decision-making. Therefore, PCA can play an important role in improving the accuracy of the Rotation Forest method while ensuring the diversity of the established ensemble systems.

As mentioned earlier, Yao et al. [13] developed a dimensional reduction method that works in a manner similar to PCA. Their method transforms an initial data group to reduce its dimensionality while maintaining a transformed component that can represent the data as a whole. The method applies PCA in an initial stage to produce a loading matrix, which contains the coefficients of the linear combination of the initial free data variables used to produce the PCs, for input into an ICA stage [13]. Because the PCA loading matrix for biological data will still contain a large amount of noise, ICA is used to generate a new loading matrix that contains little or no noise from which potential data can be extracted. ICA is used in this process because of its known ability to find hidden (latent) variables in noisy data [15]. The IPCA process is used to produce an independent loading vector matrix that is then applied as a rotation matrix to the initial data group to produce a set of IPCs.

The IPCA method is often used as a clustering method, and to perform dimensional reduction. In the present study, IPCA was not used to perform these tasks; instead, it was applied in the Rotation Forest method to transform initial free data variables into new variables within an independent loading vector matrix in which all of the PCs in the PCA loading matrix were retained. This use of IPCA as a method of transformation under Rotation Forest for predicting protein interactions based on amino acid sequences represents a novel approach in the literature; accordingly, it was further tested by comparing the performance of the Rotation Forest method by applying global encoding for feature extraction under both PCA and IPCA. The proposed method was then used to predict the amino acid sequence of Human Immunodeficiency Virus type 1 (HIV-1) to identify newly identified human proteins that can interact with HIV-1 proteins based on a comparison between the respective sequences in both organisms.

HIV

Although viruses are the smallest reproductive structures, they have a substantial range of abilities. A virus generally consists of four to six genes that are capable of taking over the biological processes within a host cell during its reproductive process [16]. The virus forces the host cell to produce new viruses by inserting its genetic information, in the form of DNA and viral RNA, into the cell. This process compromises the host cell to the point that it dies when the virus reproduction process is complete.

HIV attacks the human immune system. The virus is often also referred to as an intracellular obligate retrovirus because of its ability to convert single-stranded RNA into double-helix DNA within infected cells, and then merge it with the target cell’s DNA, forcing it to replicate into new viruses [16]. The targets are cells that can express CD4 receptors, which play an important role in maintaining immune system cells, such as T-lymphocytes. In fact, damage to or destruction of even one T-lymphocyte cell can lead to the failure of the entire specific immune response to attacks from harmful pathogens, even, ironically, from HIV itself [16].

HIV infects the human body through protein interactions. The HIV-linked glycoprotein 120 binds to specific T-cell receptors to produce bonding between a virus and the target cell. This bond is then reinforced by the second coordinator, which consists of a number of transmembrane receptors, such as CC Chemokine Receptor 5 (CCR5) or CXC Chemokine Receptor 4 (CXCR4) that bind through 100 interactions between the viral proteins and the target cells. Once binding has occurred, HIV glycoprotein 41 allows the virus to enter the target cell membrane, and its reverse transcriptase enzyme converts a single strand of RNA into a double-helix DNA virus that will be carried into the target cell nucleus and inserted into the cell’s DNA via an integrase enzyme. Once this occurs, the host cell becomes a provirus.

The connected DNA of the viral and human cells is transcribed by a polymerase enzyme to produce genomic RNA and mRNA. The RNA is ejected from the cell nucleus, and the mRNA undergoes a process of transition into a polypeptide, which is then incorporated with the RNA into a new viral core, and assembled on the surface of the target cell. Protease enzymes then break down the polypeptide into new proteins and other functional enzymes. This process results in new HIV viruses that are ready to infect other target cells that express the CD4 receptor. The reproduction of the HIV virus slowly creates a failure in the immune system that results in the body’s inability to fight various types of diseases and infections in a process known as opportunistic disease spread; ultimately, this can result in full-blown Acquired Immunodeficiency Syndrome.

Results

In this study, we used R=2,3,4,5,6,7, and 7 for Global Encoding and Lg=2,3,5,6,8, and 10 for PseudoSMR. The difference in the value between R and Lg is because we wanted to compare dimensions that are not too different, which can be caused by differences in the values of those two parameters. We also used K=1,5,10,15,20, and p/3 and L=10,20,30,40,50,60,70,80,90 and 100 as the parameters in the Rotation Forest (PCA) and Rotation Forest (IPCA) methods. Tables 1 and 2 show the performance evaluation results obtained from Rotation Forest (PCA) and Rotation Forest (IPCA), respectively, for various values of L and K, as well as the R parameters, and with global encoding combined with both methods. For both methods, the best scores tended to occur for K=p/3 at various values of L and R. The results presented in both tables indicate that using global encoding as a feature extraction method successfully represents sequences of amino acids; this is seen from a comparison of the evaluation metric results, which was >73% across the six distinct parameters used in global encoding.

Table 1 Performance of Rotation Forest (PCA) combined with global encoding

18th International Conference on Bioinformatics

Performance of rotation forest ensemble classifier and feature extractor in predicting protein interactions using amino acid sequences

Abstract

Background

Results

Conclusions

Background

HIV

Results

Sensitivity analysis of K and L rotation forest parameters

Discussion

Conclusion

Methods

Gold standard dataset

Global encoding amino acid sequence

Step 1: transformation of amino acid protein sequence

Step 2: partitioning the characteristic sequence

Step 3: feature vector extraction

PseudoSMR features

IPCA

Rotation forest ensemble classifiers

Evaluation measures

Abbreviations

References

Acknowledgments

About this supplement

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Genomics

Contact us