The parasite specific substitution matrices improve the annotation of apicomplexan proteins
© Ali et al.; licensee BioMed Central Ltd. 2012
Published: 13 December 2012
Skip to main content
© Ali et al.; licensee BioMed Central Ltd. 2012
Published: 13 December 2012
A number of apicomplexan genomes have been sequenced successfully in recent years and this would help in understanding the biology of apicomplexan parasites. The members of the phylum Apicomplexa are important protozoan parasites (Plasmodium, Toxoplasma and Cryptosporidium etc) that cause some of the deadly diseases in humans and animals. In our earlier studies, we have shown that the standard BLOSUM matrices are not suitable for compositionally biased apicomplexan proteins. So we developed a novel series (SMAT and PfFSmat60) of substitution matrices which performed better in comparison to standard BLOSUM matrices and developed ApicoAlign, a sequence search and alignment tool for apicomplexan proteins. In this study, we demonstrate the higher specificity of these matrices and make an attempt to improve the annotation of apicomplexan kinases and proteases.
The ROC curves proved that SMAT80 performs best for apicomplexan proteins followed by compositionally adjusted BLOSUM62 (PSI-BLAST searches), BLOSUM90 and BLOSUM62 matrices in terms of detecting true positives. The poor E-values and/or bit scores given by SMAT80 matrix for the experimentally identified coccidia-specific oocyst wall proteins against hematozoan (non-coccidian) parasites further supported the higher specificity of the same. SMAT80 uniquely detected (missed by BLOSUM) orthologs for 1374 apicomplexan hypothetical proteins against SwissProt database and predicted 70 kinases and 17 proteases. Further analysis confirmed the conservation of functional residues of kinase domain in one of the SMAT80 detected kinases. Similarly, one of the SMAT80 detected proteases was predicted to be a rhomboid protease.
The parasite specific substitution matrices have higher specificity for apicomplexan proteins and are helpful in detecting the orthologs missed by BLOSUM matrices and thereby improve the annotation of apicomplexan proteins which are hypothetical or with unknown function.
One of the most important and challenging tasks of post-genomic era is to improve the annotation of newly sequenced genomes in general and of parasite genomes in particular. The members of the phylum Apicomplexa are important protozoan parasites that cause some of the deadly diseases in humans and animals [1, 2]. They include parasites like Plasmodium, Toxoplasma, Eimeria, Neospora, Cryptosporidium, Babesia and Theileria. Apicomplexan genomics started with the completion of Plasmodium falciparum genome sequence  and no homology was detected for approximately 60% of its genes . Later, a number of apicomplexan parasite genomes were sequenced successfully followed by genome annotation projects which would help in understanding the biology of these parasites [4–8]. The amino acid substitution and composition in P. falciparum proteins were unusual and standard matrices (BLOSUM & PAM) did not detect orthologs and/or gave poor alignment for many P. falciparum proteins [9–11]. In order to address this issue we developed an alternate option i.e. a novel series of substitution matrices (SMAT and PfFSmat60) and demonstrated their superior performance over the standard matrices (BLOSUM and PAM) for P. falciparum proteins in particular  and for apicomplexan proteins in general . We further demonstrated that the amino acid compositions of proteins of nine apicomplexan parasites (Toxoplasma gondii, Neospora caninum, Theileria parva, Cryptosporidium parvum, P. berghei, P. chabaudi, P. knowlesi, P. vivax and P. yoelii yoelii) were similar to that of P. falciparum and because of this unusual amino acid composition of apicomplexan proteins these matrices (originally developed for P. falciparum) performed better even for other apicomplexan proteins (when compared to standard matrices BLOSUM & PAM) . Moreover to provide access to this novel series of matrices to researchers working on apicomplexan parasites, a web server ApicoAlign (http://www.cdfd.org.in/apicoalign/) was developed to detect orthologs and align apicomplexan proteins . In the present study, we assess the performance of these matrices with that of compositionally adjusted matrices (sensitive PSI-BLAST searches) in terms of detection of the true and false positives, an important aspect missing in our earlier studies [9, 10]. Many protein families like kinases are under-represented in apicomplexan parasites probably because standard matrices (BLOSUM & PAM) could not detect them during genome annotation. SMAT80 uniquely detected (i.e. missed by BLOSUM matrices) completely or partially annotated ortholog proteins for 1374 apicomplexan hypothetical proteins against SwissProt database.
Average bit scores for C. parvum proteins of purified oocysts.
Average bit score with SMAT80
Average bit score with BLOSUM62
Plasmodium yoelii yoelii
BLOSUM62 (default option in BLAST) is the most commonly used matrix for detecting orthologs. However we have shown that the choice of matrices can also significantly improve the ortholog detection in our previous [9, 10] and the present studies. SMAT80 uniquely detected orthologs for 16, 166, 11, 21, 32, 72, 31, 185, 717, 7, 3, 5, 291, 20 and 191 proteins of Babesia bovis, Theileria annulata, Theileria parva, P. berghei, P. chabaudi, P. falciparum, P. knowlesi, P. vivax, P. yoelii yoelii, Cryptosporidium hominis, Cryptosporidium muris, Cryptosporidium parvum, Eimeria tenella, Neospora caninum and Toxoplasma gondii respectively (Figure 4). For these 1768 apicomplexan proteins, BLOSUM62 and BLOSUM90 could not identify any ortholog against SwissProt database and 1374 (out of 1768) are labeled as hypothetical proteins in EuPathDB version 2.14, the list of these proteins and their subject hits along with % identity, E-value and score are provided in Additional File 5. The annotation of SMAT80 hits (BLAST hits detected using SMAT80 matrix) for these apicomplexan proteins include 70 kinases, 14 phosphatases, 3 heat shock proteins, 17 proteases and several other proteins.
The eukaryotic protein kinases (ePKs) belong to a very extensive family of proteins which play crucial roles in most of the cellular pathways [16, 17]; therefore apicomplexan kinases represent potential drug targets . Ward and coworkers carried out exhaustive analysis of P. falciparum kinome and surprisingly found only 65 typical ePKs as Saccharomyces cerevisiae genome is half the size of P. falciparum genome but encodes approximately twice number of ePKs . We speculate perhaps the standard BLOSUM matrices were not able to detect orthologs for many malarial protein kinases because of unusual amino acid composition [9, 10] of apicomplexan proteins. And in fact, a novel family (FIKK) of protein kinases was reported  and Schneider and coworkers  detected many other kinases of the same family and they [18, 19] considered it as Apicomplexan-specific protein kinase family.
Several studies [20–24] have suggested that proteases are important for invasion by apicomplexan parasites. Wu and coworkers  revealed hidden families of proteases in malaria parasite genome and completion of apicomplexan genomes provides a basis for identifying new proteases. The SwissProt hits uniquely detected by SMAT80 for 17 apicomplexan hypothetical proteins (Additional File 11) have protease annotation i.e. SMAT80 predicts these hypothetical proteins as proteases. The conserved domain search in batch mode at NCBI site was carried out for these 17 proteins but could find hits only for 8 proteins. PVX_114890 (presently labeled as conserved hypothetical protein in PlasmoDB version 9.0) gave hits for rhomboid superfamily of proteases (Additional File 12) in this conserved domain search. The GO terms for PVX_114890 of molecular function and cellular component were GO:0004252 (serine-type endopeptidase activity) and GO:0016021 (integral to membrane) respectively. Therefore SMAT80 correctly predicted it to be protease and we conclude that it is a putative rhomboid protease. A complete list of apicomplexan hypothetical proteins whose subject hits (against SwissProt using any of the three matrices) were probable or known proteases has been provided in Additional File 13. The GRAVY (grand average of hydropathy) values were calculated for these SMAT80-detected proteases (described in Methods). Out of the 17 proteases, four proteases; TP01_0999 (1.041), ETH_00005295 (0.301), TA05135 (0.244) and ETH_00042245 (0.049) had the positive GRAVY values indicating their hydrophobic nature while the remaining 13 probable proteases had negative values ranging from -0.013 (PVX_114890) to -1.421 (PY06720) indicating their hydrophilic nature (Additional File 14). The rhomboid proteases are integral to membrane and we expect them to have positive GRAVY values or very low negative GRAVY values. Six SMAT80 predicted proteases (TP01_0999, ETH_00005295, TA05135, ETH00042245, PVX_114890 and ETH_00006170) with positive or very low negative GRAVY values have stronger possibility of being rhomboid proteases compared to others (Additional File 14).
In our previous study, we have shown that the amino acid compositions of proteins of nine apicomplexan species (P. berghei, P. chabaudi, P. knowlesi, P. vivax, P. yoelii yoelii, T. gondii, C. parvum, T. parva and N. caninum) were similar to that of P. falciparum proteins . We carried out similar amino acid composition study  for all the 15 apicomplexan genomes and observed that all the apicomplexan genomes are having unusual amino acid composition like that of P. falciparum (data not shown) in comparison to Mycobacterium tuberculosis proteins. As discussed earlier, SMAT80 uniquely detected orthologs for 1374 apicomplexan hypothetical proteins and predicted 70 kinases and 17 proteases out of these hypothetical proteins. We compared the amino acid composition of these SMAT80 predicted kinases and proteases with that of yeast kinases and proteases respectively in terms of p-values (described in Methods). These apicomplexan proteins had very similar amino acid composition in terms of positively charged amino acids i.e. p-values were 0.88 and 0.90 for apicomplexan kinases and proteases respectively (Additional File 15). SMAT80-predicted apicomplexan kinases and proteases differed significantly from yeast kinases and proteases respectively in terms of composition of non-polar and negatively charged amino acids (Additional File 15) and we think that this is one of the reasons that BLOSUM matrices could not detect orthologs for these proteins.
The available genomes of apicomplexan parasites have significant number of hypothetical proteins and improving the annotation of these proteins is one of the most important and challenging tasks of post-genomic era. We think one of the probable reasons for this was that the standard matrices (BLOSUM & PAM) could not detect orthologs for many compositionally-biased apicomplexan proteins [9, 10]. We were able to find orthologs for 1374 such apicomplexan hypothetical proteins against SwissProt database using SMAT80 matrix in the BLAST searches. The subject annotations of these 1374 apicomplexan hypothetical proteins included 70 kinases, 14 phosphatases, 3 heat shock proteins, 17 proteases and several other important proteins therefore SMAT80 assigned some probable functions to these hypothetical proteins. The conserved domain search at NCBI site did not find any kinase domain in these 70 SMAT80-predicted kinases but found one rhomboid protease among the 17 SMAT80-predicted proteases. However further analysis of one of the predicted kinases (PY07003) revealed that the key functional residues of kinase domain were conserved in this protein. Similarly, one of the proteases (PVX_114890) was integral to membrane and having serine-type endopeptidase activity and these two features are the characteristics of the rhomboid proteases. Therefore SMAT80 correctly predicted it to be a protease and we conclude that it is a putative rhomboid protease. The hydrophobicity/hydrophilicity in terms of GRAVY values was also calculated for these SMAT80 predicted apicomplexan kinases and proteases. These probable apicomplexan kinases and proteases had significantly different non-polar and negatively charged amino acids contents in comparison to yeast kinases and proteases respectively and we think this was one of the reasons that BLOSUM matrices could not detect ortholog for these proteins. We also studied the performance of apicomplexan parasite-specific matrices in terms of ROC curves, an important aspect missing in our earlier studies [9, 10]. These ROC curves indicated the higher specificity of SMAT80 matrix even against PSI-BLAST searches using compositionally adjusted BLOSUM62 matrix thereby signifying the role of these parasite-specific matrices in BLAST searches for apicomplexan proteins. And this higher specificity of SMAT80 matrix was studied in biological context also i.e. SMAT80 gave BLAST hits with very poor E-values and/or bit scores (compared to BLOSUM62) for the experimentally identified coccidia specific oocyst wall proteins against hematozoan parasites which are supposed not to have oocyst wall proteins. We have provided the lists of apicomplexan hypothetical proteins to which SMAT80 could assign some function in the supplementary material. We hope that this data would be useful for the researchers working on apicomplexan parasites in general and particularly for those working on apicomplexan kinases and proteases.
PiroplasmaDB version 1.1  data was used for B. bovis, T. annulata and T. parva, PlasmoDB release 8.0 [27, 28] data for P. berghei, P. chabaudi, P. falciparum, P. knowlesi, P. vivax and P. yoelii yoelii, ToxoDB release 7.0  data for E. tenella, N. caninum and T. gondii, CryptoDB release 4.3  data for Cryptosporidium hominis, C. muris and C. parvum, the whole protein datasets from NCBI ftp site were used for rest other organisms used in this study and SwissProt/Uniprot database was downloaded from EBI ftp site.
The pairwise alignments using BLOSUM62 and PfFSmat60 matrices were carried out using ApicoAlign web server (http://www.cdfd.org.in/apicoalign/) developed by us. The blastp program of standalone BLAST software was used for carrying out local BLAST searches  (ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools/old/20051206). SMAT series of matrices were accepted by blastp program after some modifications in the source code [9, 10]. The default gap open and extension penalties were used for BLOSUM62 while for BLOSUM90 and SMAT80, 10 and 1 were gap open and extension penalties respectively (best parameters for matrices which have entropies similar to BLOSUM90). Shell scripts were written using awk, sed and perl to find Best Bidirectional Hits between two organisms, best non-self hits common to two matrices and for other small purposes. The two tailed P-values for amino acid fractions (as correlated samples) were calculated using VassarStats (http://vassarstats.net/), a website for statistical computation. R package (version 2.10.1, http://www.r-project.org/) was used for various calculations and making graphs.
The BLAST searches (blastp program) were carried out for all the proteins of 15 apicomplexan parasites using SMAT80, BLOSUM90 and BLOSUM62 matrices against SwissProt database. These hits were classified into eight categories (1) better or similar E-values, better or similar scores and better or similar % identity with SMAT80 compared to BLOSUM90, (2) better or similar E-values, better or similar scores and poor % identity, (3) better or similar E-values, poor scores and better or similar % identity, (4) better or similar E-values, poor scores and poor % identity, (5) poor E-values, better or similar scores and better or similar % identity, (6) poor E-values, better or similar scores and poor % identity, (7) poor E-values, poor scores and better or similar % identity and (8) poor E-values, poor scores and poor % identity. Only the best non-self hits were considered for calculating the percentage of proteins for each category for all the 15 apicomplexan parasites.
A unique dataset of all P. berghei and P. yoelii proteins with an assigned gene ontology was constructed and all P. berghei vs. all P. yoelii BLAST searches were carried out using BLOSUM62, BLOSUM90, SMAT80 and compositionally adjusted (BL62adj) matrices. The standalone PSI-BLAST searches were performed using blastpgp program of NCBI BLAST software with option -t 2 for compositionally adjusted BLOSUM62 matrix. The BLAST hits (e-value cut-off 1e-10) ranked by bit score were compared using GO identifiers for each pair of the query and subject sequences. Only those hits where the query and the subject proteins share gene ontologies were considered as true positives (TP) and the remaining hits were considered as false positives (FP). The numbers of false positives and true positives were used to make ROCn curves and for every curve we calculated the area under curve (AUCn). Here, n was chosen to be 162 as this was the maximum number of false positives which were present in all searches (BLOSUM62, BLOSUM90, SMAT80 and BL62adj).
The average hydropathy values for SMAT80-detected apicomplexan protein kinases and proteases were calculated using "Sequence Manipulation Suite" (http://www.bioinformatics.org/sms2/protein_gravy.html). It gives "Protein GRAVY" (grand average of hydropathy) values for protein sequences. The GRAVY values are calculated by adding the hydropathy value for each amino acid and dividing it by the length of the sequence. The algorithm for calculating the values is based on the method developed by Kyte and Doolittle . The grand average hydropathicity index for a protein indicates its solubility, with the positive GRAVY indicating hydrophobicity and negative GRAVY indicating hydrophilicity. The hydrophobicity profiles in Figure 5B were constructed using AlignMe tool  (http://www.bioinfo.mpg.de/AlignMe/index.html).
The amino acid compositions in terms of P-values for 15 apicomplexan parasites (used in this study) were calculated using the same methodology described earlier by us . The amino acids were used as four categories: non-polar, polar with no charge, positively charged and negatively charged amino acids (see  for details). The protein sequences in FASTA format for yeast kinases and proteases were downloaded from AmiGO version 1.8 . The amino acid composition of 70 SMAT80 predicted apicomplexan kinases was compared with that of yeast kinases and similarly for 17 SMAT80 predicted apicomplexan proteases it was compared with that of yeast proteases.
We acknowledge Umadevi Paila (present address: Centre for Public Health Genomics, University of Virginia, Charlottesville, VA - 22908, USA) who started the work on substitution matrices in our laboratory. JA is registered as PhD student (Registration number: 060100015) with Manipal University, though all the research work was carried out at Centre for DNA Fingerprinting and Diagnostics (CDFD), Hyderabad, India.
Funding: We acknowledge CDFD for payment of open access charges. JA acknowledges UGC (University Grants Commission, India) and CDFD for Senior Research Fellowship, SRT acknowledges DBT (Department of Biotechnology, India) for postdoctoral fellowship, AR acknowledges DBT research grant. JA also acknowledges travel support from APBioNet and Department of Science & Technology, India (Ref No. SR/ITS/3160/2012-2013) to attend InCoB 2012
This article has been published as part of BMC Genomics Volume 13 Supplement 7, 2012: Eleventh International Conference on Bioinformatics (InCoB2012): Computational Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/13/S7.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.