ApicoAlign: an alignment and sequence search tool for apicomplexan proteins
© Ali et al; licensee BioMed Central Ltd. 2011
Published: 30 November 2011
Over the recent years, a number of genomes have been successfully sequenced and this was followed by genome annotation projects to help understand the biological capabilities of newly sequenced genomes. To improve the annotation of Plasmodium falciparum proteins, we earlier developed parasite specific matrices (PfSSM) and demonstrated their (Smat80 and PfFSmat60) better performance over standard matrices (BLOSUM and PAM). Here we extend that study to nine apicomplexan species other than P. falciparum and develop a web application ApicoAlign for improving the annotation of apicomplexan proteins.
The SMAT80 and PfFSmat60 matrices perform better for apicomplexan proteins compared to BLOSUM in detecting the orthologs and improving the alignment of these proteins with their potential orthologs respectively. Database searches against non-redundant (nr) database have shown that SMAT80 gives superior performance compared to BLOSUM series in terms of E-values, bit scores, percent identity, alignment length and mismatches for most of the apicomplexan proteins studied here. Using these matrices, we were able to find orthologs for rhomboid proteases of P. berghei, P. falciparum &P. vivax and large subunit of U2 snRNP auxiliary factor of Cryptosporidium parvum in Arabidopsis thaliana. We also show improved pairwise alignments of proteins from Apicomplexa viz. Cryptosporidium parvum and P. falciparum with their orthologs from other species using the PfFSmat60 matrix.
The SMAT80 and PfFSmat60 substitution matrices perform better for apicomplexan proteins compared to BLOSUM series. Since they can be helpful in improving the annotation of apicomplexan genomes and their functional characterization, we have developed a web server ApicoAlign for finding orthologs and aligning apicomplexan proteins.
One of the important goals of post-genomic era is to develop tools/services to help in the annotation of hypothetical/putative proteins of newly sequenced genomes. In case of Plasmodium falciparum, approximately ~60% of its genes did not show sequence similarity to known genes . This organism showed an unusual amino acid composition and substitution in its proteins due to its extreme AT rich genome composition [2, 3]. As a result, many proteins show no or low sequence match to the known proteins in the database, posing a major difficulty in genome annotation. In order to address this issue we developed the symmetric Smat series and the asymmetric PfFSmat60 and demonstrated their better performance over standard matrices (BLOSUM and PAM) . Here we extend the use of these matrices to better annotate the proteins of other apicomplexa like Plasmodium berghei, Plasmodium chabaudi, Plasmodium knowlesi, Plasmodium vivax, Plasmodium yoelii yoelii, Toxoplasma gondii, Cryptosporidium parvum, Theileria parva and Neospora caninum. After benchmarking the performance of these matrices for apicomplexan proteins, we develop ApicoAlign a web server for finding orthologs and aligning apicomplexan proteins using a novel series of matrices.
ApicoAlign is a web-based application written in Perl/CGI language. The web server has five applications (1) Search Database (2) Search a genome (3) Reciprocal Hit (4) Best Bidirectional Hit and (5) Pairwise Alignment for apicomplexan proteins. The sample input buttons have been provided for some apicomplexan species for automatic loading of sample protein sequences in the required fields for each option. The parasite specific symmetric matrices (Smat series) consisting of Smat50, Smat60, Smat70, Smat80 and Smat90 are provided for first four applications. Smat matrices have been earlier demonstrated to work best for database searches  of P. falciparum and here we show their superior performance for other apicomplexa to increase the utility of these matrices. For comparison, the standard BLOSUM62 matrix and similar entropy matrix BLOSUM90 have been provided in the drop down menu. For the first four applications, the default values for gap open and extension penalties have been set to 10 and 1 respectively that are defined best for the standard matrices with entropy similar to Smat series. Few other combinations of gap open and extensions have also been provided that the user can try. E-value cut-off may be defined by the user.
The non-redundant (nr), swiss-prot and PDB databases have been provided for finding orthologs for apicomplexan proteins using parasite specific and standard matrices. The input should be a single protein sequence in FASTA format which can be pasted in the text box provided or uploaded through a file.
Search a genome
This option has been provided for finding hits for apicomplexan proteins across different genomes provided in the drop down menu. The input is protein sequences in FASTA format which can be pasted in the text box or uploaded through a file (upto 5 MB).
This option has been provided for finding reciprocal hits for apicomplexan proteins across different genomes provided in the drop down menu. The input is protein sequences in FASTA format which can be pasted in the text box or uploaded through a file (upto 5 MB).
Best Bidirectional Hit
The method of BBH (Bidirectional Best Hit)  has been employed for the search of potential orthologs of apicomplexan proteins across a range of organisms. The input for bidirectional ortholog detection is a protein sequence file of the query genome and that of the subject in the fasta format. The subject proteome may be either selected from the list of the organisms provided in the web page or in case of a user specific sequence file it may be uploaded through the file upload option. Large sequence files may take a longer run time and the size of the uploaded query and subject sequence files is limited to 25 MB.
The pair-wise alignment option uses the water program (EMBOSS package, version 6.3.1)  for performing local alignments of the apicomplexan query protein and its potential ortholog. The asymmetric parasite specific matrix, PfFSmat60 is provided for performing these alignments along with standard matrices EBLOSUM62, EBLOSUM90, EPAM200, and PfFSmat60. PfFSmat60 has been demonstrated to perform best for pair-wise alignments , where the alignments span motif like regions of the protein. PfFSmat60 is a scaled version of a unique asymmetric matrix  used here for improving the alignment of an apicomplexan protein with its strongly suspected ortholog. Hence, users are not encouraged to use this matrix indiscriminately for non-orthologous proteins. The input is a single protein sequence in fasta format for query as well as subject. The user may provide (or use default values of) the gap open and extension penalties for the pair-wise alignment. PfFSmat60 was developed in context of Plasmodium falciparum and represents unidirectional substitutions  whose usage we extend to other apicomplexans in this study. Hence, one of the limitations of the pairwise alignment is that the query sequence is restricted only to apicomplexa, therefore, the query and subject proteins should not be reversed in their order.
Results and discussion
To check whether Plasmodium falciparum Specific Substitution Matrices (SMAT and PfFSmat) perform better for other apicomplexan species, we carried out database searches against non-redundant database (nr) and found best bidirectional hits across different bacterial and eukaryotic genomes using BLOSUM and SMAT series of matrices.
Amino acid composition of different apicomplexan species
Best Bidirectional Hits
Examples and applications
Best Bidirectional Hit for apicomplexan rhomboid proteases
The proteins PBANKA_110650 (P. berghei), PFE0340c (P. falciparum) and PVX_097905 (P. vivax) do not give any BBH in Arabidopsis thaliana using BLOSUM62 and BLOSUM90 matrices. The present annotation (in PlasmoDB version 7.2) of PBANKA_110650 and PFE0340c is rhomboid protease while that of PVX_097905 is a conserved hypothetical protein. Using SMAT80 matrix, we got a single BBH (in Arabidopsis thaliana) for all these three proteins and that is ATRBL5 (Arabidopsis Rhomboid-like protein-5, gi:15219034). All these four proteins (PBANKA_110650, PFE0340c, PVX_097905 and gi:15219034) have same molecular function (serine-type endopeptidase activity, GO:0004252) and are integral to membrane (GO:0016021). Thereofore we can safely consider that these four proteins are true orthologs of each other and predict that PVX_097905 (presently labelled as conserved hypothetical protein in PlasmoDB version 7.2) is a rhomboid protease.
Best Bidirectional Hit for splicing factor subunit of Cryptosporidium parvum
The cgd2_1480 is a large subunit of U2 snRNP auxiliary factor of Cryptosporidium parvum which do not give any BBH in Arabidopsis thaliana using BLOSUM62 and BLOSUM90 matrices. The SMAT80 matrix gives BBH for this protein in Arabidopsis thaliana (gid: 30696485) with same annotation. The E-values are 2e-10 and 5e-12 when Cryptosporidium parvum is used as query and subject respectively.
Alignment of experimentally characterized glutathione S-transferase from an apicomplexan
Pairwise alignments of plasmodia GSTs with yeast GST.
P. berghei (PBANKA_102390)
P. chabaudi (PCHAS_102470)
P. falciparum (PF14_0187)
P. knowlesi (PKH_132970)
P. vivax (PVX_085515)
P. yoelii (PY05088)
P. berghei (PBANKA_102390)
P. chabaudi (PCHAS_102470)
P. falciparum (PF14_0187)
P. knowlesi (PKH_132970)
P. vivax (PVX_085515)
P. yoelii (PY05088)
Alignment of experimentally characterized protein kinase
The eukaryotic protein kinases (ePKs) are a large family of enzymes with crucial roles in most cellular processes; hence malarial ePKS represent potential drug targets . In case of Plasmodium falciparum, PF11_0220 (Molecular Function GO:0004672, protein kinase activity, evidence code IDA, source:http://www.plasmodb.org) is a known protein kinase which shows poor alignment with known yeast protein kinase with BLOSUM series of matrices. The pairwise alignment of PF11_0220 against PIK-related protein kinase and rapamycin target of Saccharomyces cerevisiae (gi: 6322526) was performed with standard and PfFSmat60 matrices by fasta program (FASTA package, version 3)  (Additional File 10: Supplementary Figure 8) and water program (EMBOSS package, version 6.3.1) (data not shown). We observed improvement in alignment by using PfFSmat60 over other matrices irrespective of program used for alignment. With fasta program, BLOSUM50 gave an alignment score of 23.4 bits at an E-value 0.31 with an overlap of only 71 amino acid residues, BLOSUM100 gave an alignment score of 18.1 bits at an E-value 1 with an overlap of only 16 amino acids. PfFSmat60 gave an alignment score of 4872.5 bits at an E-value 0.0 and the overlap was 1990 amino acid residues. PAM2, a similar entropy matrix, gave an insignificant alignment with a score of 22.8 bits and an E-value of 0.43 for an overlap of only 6 amino acids.
Alignment of two experimentally known Acyl CoA binding proteins
Acyl CoA Binding Proteins (ACBPs) are generally small (10 kD) highly conserved proteins found in all four eukaryotic kingdoms Animalia, Plantae, Fungi, Protista and only eleven eubacterial species but not in any other known bacterial species or in archaea till now . The long type ACBPs containing ankyrin repeats have been characterized experimentally in Cryptosporidium parvum and Arabidopsis thaliana, Zeng et.al  characterized the CpACBP1 (Cryptosporidium parvum Acyl CoA binding protein) containing acyl CoA binding domain and ankyrin repeats while Xiao et. al [12, 13] studied similar type of ACBP with ankyrin repeats from Arabidopsis thaliana ACBP1. Therefore we can safely consider Arabidopsis ACBP1 to be a true ortholog of CpACBP1. We performed pairwise alignment of these two ACBPs using different matrices by fasta and water programs. We observed significant improvement in alignment by PfFSmat60 though not much statistically but it was expected as these proteins (ACBPs) are highly conserved and show good alignment even with standard matrices. Our purpose here was to see alignment of an apicomplexan protein with its experimentally known ortholog using standard and PfFSmat60 matrices (Additional File 11: Supplementary Table 5).
Bi-functional enzyme of shikimate pathway across apicomplexan genomes
Example of a missing metabolic enzyme - Acylglycerol lipase
Recently Mohanty and Srinivasan  have attempted to identify some of the missing enzymes from the parasite genome using multiple profiles for every protein domain family. One of the predicted missing enzymes was a conserved Plasmodium protein with an unknown function, MAL7P1.156. They predicted it to be acylglycerol lipase associated with glycerol biosynthesis pathway. Saccharomyces cerevisiae has four experimentally characterized triglyceride lipases tgl2p (gi:6320263), tgl3p (gi:6323973), tgl4p (gi:6322942) and tgl5p (gi:6324655). We submitted MAL7P1.156 as query and compared it with all these four yeast lipases and the best match was observed with tgl5p (gi:6324655) (BLOSUM50 E-value 0.15; BLOSUM100 E-value 1.0; PfFSmat60 E-value 8.4e-156 and PAM2 E-value 1.0) (Additional File 17: Supplementary Table 3). PfFSmat60 gives much larger and qualitatively better alignment compared to standard matrices as it covers the functionally important whole patatin domain (183-388 residues) of yeast lipase tgl5p (gi:6324655) (Additional File 18: Supplementary Figure 13).
The PfSSM (Plasmodium falciparum Specific Substitution Matrices) were developed basically for P. falciparum and particularly for those proteins which do not find their orthologs in other eukaryotes or show very poor alignment with their orthologs. In this study, database searches, best bidirectional hits and the improved pairwise alignment of apicomplexan proteins have shown that these matrices perform better for apicomplexan species other than Plasmodium falciparum and they can be thus helpful in improving the annotation of the same. To provide the access to these matrices for researchers working on apicomplexan species, we developed a web server ApicoAlign for detecting orthologs and aligning apicomplexan proteins. The real importance of this tool will be for those apicomplexan proteins which do not give any ortholog in other eukaryotes or show poor alignment at sequence level using matrices of BLOSUM and PAM series.
Amino acid composition of different apicomplexan species
We compared the amino acid composition for all the proteins of ten apicomplexan genomes (Plasmodium berghei, Plasmodium chabaudi, Plasmodium falciparum, Plasmodium knowlesi, Plasmodium vivax, Plasmodium yoelii yoelii, Toxoplasma gondii, Cryptosporidium parvum, Neospora caninum and Theileria parva) with that of non-apicomplexan Mycobacterium tuberculosis genome. The proteins having the terms “hypothetical”, “putative” and “unknown function” were removed in all the genomes. A matrix (20 columns for 20 amino acids and where each row represents a protein) was generated by calculating the fraction of each amino acid in each protein. The mean was calculated for each column (amino acid) and thus for each genome we got 20 means for 20 amino acids. The amino acids were divided in four categories based on their properties namely non-polar amino acids (glycine, alanine, valine, leucine, isoleucine, methionine, proline, phenylalanine & tryptophan), polar amino acids with no charge (serine, tyrosine, threonine, cysteine, asparagine & glutamine), positively charged amino acids (arginine, histidine & lysine) and negatively charged amino acids (aspartate & glutamate). Next the means of amino acids of each category were used to calculate the p-value of student t-test between any two genomes. Next, the P-values for a two tailed t-test for correlated samples were calculated for each individual amino acid fraction obtained from the ortholog set of these organisms. The higher the p-value of t-test, the closer will be the two genomes in terms of amino acid composition.
The complete protein datasets of Plasmodium berghei, Plasmodium chabaudi, Plasmodium falciparum, Plasmodium knowlesi, Plasmodium vivax and Plasmodium yoelii yoelii were downloaded from PlasmoDB release 7.0 , that of Toxoplasma gondii and Neospora caninum were downloaded from ToxoDB release 6.2 , that of Cryptosporidium parvum from CryptoDB release 4.3  and for rest other organisms used in this study and for the web server the whole protein datasets were downloaded from NCBI ftp site.
Availability and requirements
Project name: ApicoAlign
Project home page: http://www.cdfd.org.in/apicoalign/
Operating system(s): Platform independent and it is not web browser specific.
Other requirements: only internet and any web browser like firefox or internet explorer.
Any restrictions to use by non-academics: No restriction.
Altschul, S.F, Tom Madden and Kevin Brick for help with the BLAST source code compilation. Altschul, S.F. for providing the Island program. Pearson, W. for the fasta package.
Funding: CSIR-NMITLI and DBT (to A.R.); CSIR fellowship (to U.P.); UGC fellowship (to J.A.).
This article has been published as part of BMC Genomics Volume 12 Supplement 3, 2011: Tenth International Conference on Bioinformatics – First ISCB Asia Joint Conference 2011 (InCoB/ISCB-Asia 2011): Computational Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2164/12?issue=S3.
- Gardner MJ, Hall N, Fung E, White O, Berriman M, Hyman RW, et al: Genome sequence of the human malaria parasite Plasmodium falciparum. Nature. 2002, 419: 498-511. 10.1038/nature01097.View ArticlePubMedGoogle Scholar
- Paila U, Kondam R, Ranjan A: Genome bias influences amino acid choices: analysis of amino acid substitution and re-compilation of substitution matrices exclusive to an AT-biased genome. Nucleic Acids Res. 2008, 36: 6664-6675. 10.1093/nar/gkn635.PubMed CentralView ArticlePubMedGoogle Scholar
- Brick K, Pizzi E: A novel series of compositionally biased substitution matrices for comparing Plasmodium proteins. BMC Bioinformatics. 2008, 9: 236-10.1186/1471-2105-9-236.PubMed CentralView ArticlePubMedGoogle Scholar
- Hulsen T, Huynen MA, Vlieg J, Groenen PMA: Benchmarking ortholog identification methods using functional genomics data. Genome Biol. 2006, 7: R31-R31. 10.1186/gb-2006-7-4-r31.PubMed CentralView ArticlePubMedGoogle Scholar
- Rice P, Longden I, Bleasby A: EMBOSS: The European Molecular Biology Open Software Suite. Trends Genet. 2000, 16: 276-277. 10.1016/S0168-9525(00)02024-2.View ArticlePubMedGoogle Scholar
- Aurrecoechea C, Brestelli J, Brunk BP, Dommer J, Fischer S, Gajria B, et al: PlasmoDB: a functional genomic database for malaria parasites. Nucleic Acids Res. 2009, 37: D539-D543. 10.1093/nar/gkn814.PubMed CentralView ArticlePubMedGoogle Scholar
- PlasmoDB: Plasmodium Genomics Resource. [http://plasmodb.org/plasmo/]
- Ward P, Equinet L, Packer J, Doerig C: Protein kinases of the human malaria parasite Plasmodium falciparum: the kinome of a divergent eukaryote. BMC Genomics. 2004, 5: 79-10.1186/1471-2164-5-79.PubMed CentralView ArticlePubMedGoogle Scholar
- Pearson WR, Lipman DJ: Improved tools for biological sequence comparison. Proc. Natl Acad. Sci. USA. 1988, 85: 2444-2448. 10.1073/pnas.85.8.2444.PubMed CentralView ArticlePubMedGoogle Scholar
- Burton M, Rose TM, Faergeman NJ, Knudsen J: Evolution of the acyl-CoA binding protein (ACBP). Biochem J. 2005, 392: 299-307. 10.1042/BJ20050664.PubMed CentralView ArticlePubMedGoogle Scholar
- Zeng B, Cai X, Zhu G: Functional characterization of a fatty acyl-CoA binding protein (ACBP) from the apicomplexan Cryptosporidium parvum. Microbiology. 2006, 152: 2355-2363. 10.1099/mic.0.28944-0.PubMed CentralView ArticlePubMedGoogle Scholar
- Xiao S, Chye ML: An Arabidopsis family of six acyl-CoA-binding proteins has three cytosolic members. Plant Physiol Biochem. 2009, 47: 479-84. 10.1016/j.plaphy.2008.12.002.View ArticlePubMedGoogle Scholar
- Xiao S, Gao W, Chen QF, Ramalingam S, Chye ML: Overexpression of membrane-associated acyl-CoA-binding protein ACBP1 enhances lead tolerance in Arabidopsis. Plant J. 2008, 54: 141-51. 10.1111/j.1365-313X.2008.03402.x.View ArticlePubMedGoogle Scholar
- Dieckmann A, Jung A: Mechanisms of sulfadoxine resistance in Plasmodium falciparum. Mol. Biochem. Parasitol. 1986, 19: 143-147. 10.1016/0166-6851(86)90119-2.View ArticlePubMedGoogle Scholar
- Roberts F, Roberts CW, Johnson JJ, Kyle DE, Krell T, Coggins JR, Coombs GH, Milhous WK, Tzipori S, Ferguson DJP, Chakrabarti D, McLeod R: Evidence for the shikimate pathway in apicomplexan parasites. Nature. 1998, 393: 801-805. 10.1038/31723.View ArticlePubMedGoogle Scholar
- McConkey GA, Pinney JW, Westhead DR, Plueckhahn K, Fitzpatrick TB, Macheroux P, Kappes B: Annotating the Plasmodium genome and the enigma of the shikimate pathway. Trends Parasitol. 2004, 20: 60-65. 10.1016/j.pt.2003.11.001.View ArticlePubMedGoogle Scholar
- Mohanty S, Srinivasan N: Identification of “missing” metabolic proteins of Plasmodium falciparum: a bioinformatic approach. Protein Pept Lett. 2009, 16: 961-8. 10.2174/092986609788923257.View ArticlePubMedGoogle Scholar
- ToxoDB: Toxoplasma Genomics Resource. [http://toxodb.org/toxo/]
- CryptoDB: Cryptosporidium Genomics Resource. [http://cryptodb.org/cryptodb/]
- EuPathDB: The Eukaryotic Pathogen Genome Resource. [http://eupathdb.org/eupathdb/]
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J. Mol. Biol. 1990, 215: 403-410.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.