In silico secretome analysis approach for next generation sequencing transcriptomic data
© Garg and Ranganathan; licensee BioMed Central Ltd. 2011
Published: 30 November 2011
Skip to main content
© Garg and Ranganathan; licensee BioMed Central Ltd. 2011
Published: 30 November 2011
Excretory/secretory proteins (ESPs) play a major role in parasitic infection as they are present at the host-parasite interface and regulate host immune system. In case of parasitic helminths, transcriptomics has been used extensively to understand the molecular basis of parasitism and for developing novel therapeutic strategies against parasitic infections. However, none of transcriptomic studies have extensively covered ES protein prediction for identifying novel therapeutic targets, especially as parasites adopt non-classical secretion pathways.
We developed a semi-automated computational approach for prediction and annotation of ES proteins using transcriptomic data from next generation sequencing platforms. For the prediction of non-classically secreted proteins, we have used an improved computational strategy, together with homology matching to a dataset of experimentally determined parasitic helminth ES proteins. We applied this protocol to analyse 454 short reads of parasitic nematode, Strongyloides ratti. From 296231 reads, we derived 28901 contigs, which were translated into 20877 proteins. Based on our improved ES protein prediction pipeline, we identified 2572 ES proteins, of which 407 (1.9%) proteins have classical N-terminal signal peptides, 923 (4.4%) were computationally identified as non-classically secreted while 1516 (7.26%) were identified by homology to experimentally identified parasitic helminth ES proteins. Out of 2572 ES proteins, 2310 (89.8%) ES proteins had homologues in the free-living nematode Caenorhabditis elegans and 2220 (86.3%) in parasitic nematodes. We could functionally annotate 1591 (61.8%) ES proteins with protein families and domains and establish pathway associations for 691 (26.8%) proteins. In addition, we have identified 19 representative ES proteins, which have no homologues in the host organism but homologous to lethal RNAi phenotypes in C. elegans, as potential therapeutic targets.
We report a comprehensive approach using freely available computational tools for the secretome analysis of NGS data. This approach has been applied to S. ratti 454 transcriptomic data for in silico excretory/secretory proteins prediction and analysis, providing a foundation for developing new therapeutic solutions for parasitic infections.
The secretome of an organism is defined as the subset of proteins secreted by the cell . This subset of proteins is usually known as excretory/secretory (ES) proteins , plays an important role in producing clinical infections in the host organism. ES proteins are the choice of new therapeutic solutions for different clinical infections, especially in the case of parasitic infections [3, 4] because these proteins are present at the host-parasite interface and act as immunoregulators to host immune recognition for parasite survival inside the host organism .
Transcriptomic data is the representation of actively expressed genes in a cell at any given time. Earlier transcriptomic studies were based on generation of expressed sequence tags (ESTs) generated at different stages of an organism using traditional Sanger sequencing. These studies were restricted to the analysis of a few thousand ESTs at a time. Recent technological improvements in cDNA sequencing, using next generation sequencing (NGS) platforms, are able to generate millions of reads, to record the transcript profile of an organism at a given developmental stage. The read length generated through NGS is quiet short (50-400 bases) as compared to traditional Sanger sequencing (800-1000 bases). Thus, the assembly of shorter reads is challenging in terms of computational power and resources needed. These reads are assembled into long consensus sequences (clusters) known as contigs using assemblers such as ABySS , Velvet  and MIRA , which have been reviewed in a recent study . ABySS and Velvet provide good results for genome assembly, while MIRA is very well tested for handling de novo transcriptome assembly . Since the genomes of only a very few parasitic nematodes are currently available, de novo assemblers such as MIRA are the only option for NGS data from these neglected organisms.
Recently, NGS platforms have been used to generate large amounts of transcriptomic data for different organisms, including several helminth parasites like Fasciola gigantica , Fasciola hepatica, Trichostrongylus colubriformis, Oesophagostomum dentatum, Haemonchus contortus , Dictyocaulus viviparus, Necator americanus , Clonorchis sinensis, Opisthorchis viverrini  and Teladorsagia circumcinta. Here, NGS data has been assembled with CAP3 alone [14, 16] or with MIRA followed by CAP3 [12, 18], based on combinations of assemblers performing better in a recent study . However, none of these studies have extensively covered ES protein prediction and further analysis, for identifying therapeutic targets.
ES proteins were once considered to be secreted only through conventional secretion pathways, using N-terminal signal peptide signatures, but there are now many proteins which are found to be secreted by non-classical secretory pathways . Usually non-classical secretory proteins are predicted through SecretomeP , which is the most widely used tool for non-classical secretory proteins. However in case of parasites, SecretomeP is not able to completely predict non-classical secretory proteins, as shown in the study of Brugia malayi. Hence, a novel approach to identifying non-classically secreted proteins is required for comprehensive secretome analysis.
Transcriptomic data has been used extensively for the prediction of ES proteins in parasitic helminth studies . EST2Secretome, a computational prediction and annotation pipeline for ES proteins from our group, was designed to handle ESTs from Sanger sequencing and currently has the following limitations: (i) assembly of short reads, (ii) prediction of non-classical secretory proteins and (iii) pathway mapping using KOBAS [24, 25], which contains pathways that are not regularly updated.
In the present study, we have developed an updated computational approach for the prediction and annotation of ES proteins using NGS transcriptomic data overcoming the limitations of the earlier EST2Secretome pipeline. We have developed a robust assembly protocol for NGS data. In order to identify non-classically secreted proteins that are missed by SecretomeP, we have also compiled a dataset of experimentally determined ES proteins of parasitic helminths for homology-based prediction (details in the Methods section). Additionally, we have replaced KOBAS with KAAS , for efficient and up-to-date pathway identification.
We applied our approach to ~0.3M 454 transcriptomic reads for a parasitic nematode, Strongyloides ratti, which is a gastro intestinal nematode that infects rats, comprehensively reviewed by Viney  and is a Clade IV parasite . Genome data is available only for the free living nematodes, C. elegans and C. briggsae from Clade V, which is adjacent to Clade IV and for a parasite, Brugia malayi from Clade III, which is not similar to Clade IV parasites, whereas limited transcriptomic and proteomic data from experimental studies are available for several helminth parasites. As such, a BLASTX against a reference organism, as proposed recently  will not provide comprehensive annotation results, unless the fully annotated proteome of a very similar organism is available.
In adult phase, S. ratti is present in both parasitic (females only) and free living forms (male and female) . Eggs produced by parasitic females develop into free living males, free living females and parasitic females by different larval stages. Our dataset is derived from the adult nematode, which includes parasitic and free living forms (sequencing details in the Methods section). The NGS data has been clustered and translated into proteins and ES proteins predicted using a series of computational tools, augmented by homology matching to our in-house dataset of experimentally determined parasitic helminth ES proteins. Predicted ES proteins have been annotated functionally in terms of protein families, domains and biochemical pathways. ES proteins have also been compared with proteomic data of the host (rat) and other nematodes, with an emphasis on the best characterized nematode, C. elegans. Such annotation techniques have enabled us to identify 19 novel targets, matching to lethal RNAi phenotypes in C. elegans, which could be considered in the development of future therapeutic strategies.
For this study, S. ratti cDNA sequencing data from the University of Liverpool  is used. cDNA libraries were prepared from adult helminths, comprising a mixture of parasitic females, free-living males and free-living females. Sequencing was performed using 454-FLX platform (Roche diagnostics). The pyrosequencing procedure used to prepare this dataset is described elsewhere .
FASTA and associated quality files were extracted from SFF file along with clipping of sequence adapters using the sff_extract software . Extracted data from sff files is first assembled using the MIRA  (V3.2.0rc1) assembler using quality information. MIRA is our preferred assembler as it is an open source tool which is considered reliable for data from different NGS platforms  and it has been very well tested in other parasitic helminth transcriptomic studies [12, 18]. For this dataset, we have used MIRA, ABYSS and Velvet, compared with Newbler (data not shown), MIRA giving the longest contigs. Contigs generated by MIRA are further passed to the Contig Assembly Program (CAP3) , to extend the MIRA assembly. This is in accord with an earlier study which suggests that serial assembly from two assemblers can improve the quality of the assembly . Second order contigs generated using CAP3 are combined with MIRA contigs, to be conceptually translated into putative proteins using ESTScan .
ES proteins were predicted using a combination of four tools, SecretomeP , SignalP , TargetP  and TMHMM . SignalP is used for the prediction of classical secretory proteins, while SecretomeP predicts non-classical secretory proteins. TargetP is for the prediction of mitochondrial proteins and TMHMM identifies transmembrane proteins. Firstly, the proteins generated from ESTScan are passed to SignalP for prediction of classical secreted proteins. All the proteins, which are predicted as non-secretory (proteins having D score and signal peptide probability less than 0.5) are then passed to SecretomeP for prediction of non-classical secretory proteins. Proteins which obtain neural network (NN) score of greater than or equal to 0.9 are considered as non-classical secretory proteins. All the classical and non-classical secretory proteins are merged together and then scanned by TargetP. Proteins predicted as mitochondrial proteins by TargetP are omitted out from the set of predicted ES proteins and passed to TMHMM. Finally the proteins which are predicted to have no transmembrane helices are considered as ES proteins.
In addition to standard computational approaches for the prediction of ES proteins, we compiled a list of 1080 ES protein sequences of parasitic helminths (Brugia malayi, Teladorsagia circumcinta, Schistosoma mansoni, Ancylostoma caninum, Schistosoma japonicum, Clonorchis sinesis and Fasciola hepatica) from the literature [22, 41–49]. A homology-based search with BLASTP  is used to further extract ES proteins from proteins which are predicted to be non-secretory by SecretomeP.
The results from computational tools are combined with those from BLAST searches, for functional annotation and analysis in Phase III.
All the predicted ES proteins are annotated using a number of tools. We used Interproscan  for protein domain and family classification. KAAS  is used for mapping ES proteins to KEGG pathways and to KEGG BRITE objects [52–54]. ES proteins are searched for sequence similarity against the Wormpep database (WS224)  for proteins similar to C. elegans. ES proteins are also searched for sequence similarity against rat (host) proteins and parasitic nematodes using BLASTP algorithm, to identify parasite-specific proteins. Comparative analysis of similarity of ES proteins with rat, parasitic nematodes and C. elegans proteins are analyzed using Simitri . Proteins not homologous to the host (rat) proteome are further screened for RNAi phenotypes in C. elegans.
All the programs used in this study were installed on a 16 CPU Linux cluster (2.4 GHz, Intel(R)Xeon(R) E5530, 32 RAM) running on ubuntu server operating system. The computer intensive steps are sequence assembly (MIRA, CAP3) and protein functional annotation mapping (Interproscan). All other programs will run efficiently on current desktop systems.
A semi-automated computational approach, incorporating three key components, was constructed. The different components of the workflow system (Figure 1) are linked using Perl, Python and bash shell scripts. This approach was applied to S. ratti 454 transcriptomic dataset to show its efficacy and utility.
ES protein prediction is carried out in Phase II of the pipeline (Figure 1). Firstly, 407 (1.9%) proteins were predicted as classical secreted proteins using SignalP. The remaining 20470 (98.05%) proteins, which were predicted as non secretory by SignalP were processed by SecretomeP for prediction of non-classical secretory proteins. A total of 923 (4.4%) proteins were predicted as non-classical secretory proteins using SecretomeP. The classical and non-classical secretory proteins (1330, 6.3%) from these two programs were analyzed by TargetP for mitochondrial proteins. Only 18 proteins were predicted as mitochondrial proteins using TargetP at 95% specificity. These 18 proteins were removed from the set of 1330 secreted proteins while 1312 secretory proteins were passed to TMHMM for the prediction of transmembrane proteins. 256 proteins, predicted as transmembrane proteins having one or more transmembrane helices, were removed from the secretory protein dataset. A total of 1056 (5.05%) proteins were finally predicted as ES proteins from the computational prediction pipeline.
Proteins that were considered non-secretory by SecretomeP were matched to our in-house dataset of 1080 non redundant experimentally determined parasitic helminth proteins, using the BLASTP similarity search. We found an additional 1516 (7.26%) proteins similar to known ES proteins by this homology search approach. Thus, for annotation and analyses in Phase III, we compiled a total of 2572 ES proteins, which is 12.3% of our putative proteins. This dataset is a more comprehensive collection of ES proteins of S. ratti, compared to those reported by other S. ratti secretome studies [57, 58].
Top 15 most represented protein domains found in ES proteins using Interproscan
Number of ES proteins (%)
Protein Kinase like domain
Protein kinase, catalytic domain
Serine/threonine-protein kinase like domain
Serine/threonine-protein kinase domain
Serine/threonine-protein kinase active site
WD40 repeat like domain
WD40 repeat subgroup
WD40/YVTN repeat like domain
WD40 repeat domain
Tyrosine-protein kinase catalytic domain
WD40 repeat 2
Top 15 most represented KEGG pathways found in ES proteins predicted by KAAS
Number of ES proteins represented (%)
Protein processing in endoplasmic reticulum
Ubiquitin mediated proteolysis
Wnt signalling pathway
Glycolysis / Gluconeogenesis
Circadian rhythm - mammal
TGF- beta signalling pathway
Top 15 most represented KEGG BRITE objects found in ES proteins predicted by KAAS
Number of ES proteins represented (%)
Chaperons and folding catalysts
DNA repair and recombination proteins
DNA replication proteins
We demonstrated the utility of our new computational approach for the comprehensive prediction and analysis of ES proteins from transcriptomic data generated by NGS. The protocol will be implemented in a web server, in the future, after extensive testing of different assembly programs, and considering the choice of specific assemblers, based on the transcriptomic dataset, as proposed by Kumar and Blaxter . For this study, we have selected programs that are freely available under academic licence. All the programs used in our approach are available with free academic licence, which can be easily installed on Linux platforms. Our use of MIRA followed by CAP3 for assembly of NGS data is simpler than the assembler combinations proposed by Kumar and Blaxter  and also used by studies on Fasciola hepatica , Clonorchis sinensis  and Opisthorchis viverrini to generate second order contigs by CAP3 from contigs generated by MIRA which have open reading frames. The whole assembly for the current dataset was performed in approximately 3 hours CPU time using both MIRA and CAP3, whereas the use of CAP3 alone was not possible due to memory overflow with the current dataset, using hardware specified in the methods section. Although all the studies discussed here are more comprehensive in terms of transcriptome coverage (more than 0.5M 454 reads were generated), which is higher as compared to our current dataset of ~0.3M, none of them have comprehensively studied ES proteins. For example, the 454 transcriptomic study on Fasciola hepatica  reported only 1812 ES proteins (only 4%) from 44597 putative protein sequences generated from ESTScan, followed by ES protein predictions based on signal peptide identification by SignalP.
Millions of people globally suffer from Strongyloidiasis, caused by the parasitic nematode, Strongyloides stercoralis. S. ratti is a common gastro-intestinal parasite of the rat, which is used as a model to study Strongyloidiasis. Here, we have analysed S. ratti transcriptomic data from parasitic females, free-living males and free-living females for the prediction and analysis of ES proteins. Of the dataset of 2572 ES proteins 2310 (89.8%) had homologues in the free-living nematode, C. elegans, which is similar to earlier reported findings in Strongyloides EST analysis studies . Many predicted ES proteins map to protein kinase domains as shown in Table 2, which are reported to be essential for parasitic activity in parasitic nematodes . Protein kinases play a central role in signal transduction and hence are considered as drugabble targets. Another representative Interpro protein domains among S. ratti ES proteins were WD40 repeat domains (7.5%), which are associated with signalling transduction pathways . These domains were also found among the top 20 most represented Interpro protein domains of O. dentatum putative proteins . ES proteins also map to ribosomal protein interpro domains such as IPR000589 (Ribosomal protein S15), which is associated with ageing in S. ratti . All the most representative KEGG pathways mapped to ES proteins shown in table 3 are required for parasite survival inside the host, as the secretome of a parasite is representative of its genome in the host environment. Major ES proteins map to enzymes, which are essential for metabolic pathways functioning and also very well reflected in our protein domain mapping. Other KEGG pathways like purine metabolism and glutathione metabolism found in this study were also found in other parasitic nematodes excretory/secretory proteins analysis . 22 (0.85%) ES proteins were mapped to the circadian rhythm – mammal pathway in C. elegans. This pathway is unexpected in the case of ES proteins of nematodes, however three proteins S-phase kinase-associated protein 1 (KO3094), cullin 1 (KO3347) and F-box and WD-40 domain protein 1/11 (KO3362) which were found in our ES proteins are common to Ubiquitin mediated proteolysis in C. elegans. The common components of several pathways have led to this unexpected result. KEGG BRITE objects (representative objects shown in Table 4) reflect the presence of essential proteins such as protein kinases, peptidases and proteasome among ES proteins for S. ratti survival inside the host organism. 44 (1.71%) ES proteins map to chaperones, which are responsible for host immune system modulation, such as the recently characterised S. ratti heat shock protein 10 . Along with well known protein families found in ES proteins, we found some protein categories such as chromosome, DNA replication proteins and DNA repair and recombination proteins which are expected to be localized in the nucleus but found in S. ratti ES proteins. This pattern of exporting nuclear proteins to the secretome of a parasitic nematode was also observed in Meloidogyne incognita . 66 secreted proteins were identified with putative nuclear localization such as DNA and RNA binding proteins including helicases in M. incognita, of which we observed the presence of helicase C domain in 35 (1.36%) S. ratti ES proteins. Contig 1289 and Contig 428 map to the metalloproteinase precursor in S. stercoralis , this is also well characterized protein in Trichinella spirallis. Expresssion of an S. stercoralis metalloproteinase homologue was also found in the recent transcript analysis of another intestinal nematode, Strongyloides venezuelensis . Many of these potential therapeutic targets map to hypothetical proteins present in C. elegans, C. briggsae and B. malayi and having lethal phenotypes according to C. elegans RNAi phenotype mapping and could be considered as parasitism central genes  of S. ratti. Many of the putative proteins from S. ratti could be examined further after the publication of S. ratti genome, which is expected soon .
Integrated approaches similar to the one discussed in this paper have been applied to several socio-economically important parasites. These approaches are based on data available on the reference organism of that taxonomic order where limited data is available for the subject organism. For example, C. elegans is the most studied organism among nematodes. C. elegans data was used to create the translation matrix used by ESTScan, to translate potential coding regions in the assembled contigs into protein sequences. These translated coding regions were then used for ES proteins prediction. The use of a reference organism data for the translation matrix instead of using actual organism information may lead to false positives in peptides prediction as well as in ES protein prediction. Another limiting factor is that we are looking into the annotation of protein function in terms of primary sequence alone, rather than the 3D structure. Therefore, all the therapeutic targets predicted in this study are preliminary predictions which need to be further validated by additional computation analysis such as structural modelling and by experimental assays.
In this paper we demonstrate how different computational tools can be used together to extract the useful information of ES proteins from transcriptomic data. All the programs used in our approach are open source tools that are freely available for academic purposes. With the advent of NGS technologies, while there is a massive increase in sequence data, this data is extremely fragmented and of no use for information extraction as output from the sequencer. Our methodology will help in rapid assembly, fast annotation and reliable prediction of ES proteins. The approach is a generalized method which can be applied to any organism, although its main application is for neglected organisms whose genomes are not yet sequenced, with limited functional knowledge. Although we have used 454 transcriptomic data in this study but this methodology can be applied to transcriptomic data from other NGS platforms with slight modifications in terms of pre-processing, as data output formats obtained from different NGS platforms are different. Thus, this system will help us to carry out secretome studies for other parasitic organisms in future.
SR directed the study. GG did the analysis. SR and GG contributed to writing the manuscript.
Biomolecular Relations in Information Transmission and Expression
Kyoto Encyclopedia of Genes and Genomes
KEGG automatic annotation server.
We would like to thank Dr. Steve Paterson for providing the Strongyloides ratti cDNA sequencing data. We are thankful to Prof. Minoru Kanehisa for providing us the stand alone copy of KAAS program. GG would like to acknowledge Macquarie University for an Australian Post-graduate Award scholarship.
This article has been published as part of BMC Genomics Volume 12 Supplement 3, 2011: Tenth International Conference on Bioinformatics – First ISCB Asia Joint Conference 2011 (InCoB/ISCB-Asia 2011): Computational Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2164/12?issue=S3.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.