Oomycete transcriptomics database: A resource for oomycete transcriptomes
© Tripathy et al.; licensee BioMed Central Ltd. 2012
Received: 17 April 2012
Accepted: 30 May 2012
Published: 6 July 2012
Skip to main content
© Tripathy et al.; licensee BioMed Central Ltd. 2012
Received: 17 April 2012
Accepted: 30 May 2012
Published: 6 July 2012
Oomycete pathogens have attracted significant attention in recent years due to their economic impact. With improving sequencing technologies, large amounts of oomycete transcriptomics data are now available which have great biological utility. A known bottleneck with next generation sequencing data however lies with their analysis, interpretation, organization, storage and visualization. A number of efforts have been made in this respect resulting in development of a myriad of resources. Most of the existing NGS browsers work as standalone applications that need processed data to be uploaded to the browser locally for visualization. At the same time, several oomycete EST databases such as PFGD, ESTAP and SPC, are not available anymore, so there is an immediate need for a database resource that can store and disseminate this legacy information in addition to NGS data.
Oomycetes Transcriptomics Database is an integrated transcriptome and EST data resource for oomycete pathogens. The database currently stores processed ABI SOLiD transcript sequences from Phytophthora sojae and its host soybean (P. sojae mycelia, healthy soybean and P. sojae-infected soybean) as well as Illumina transcript sequences from five Hyaloperonospora arabidopsidis libraries. In addition to those resources, it has also a complete set of Sanger EST sequences from P. sojae, P. infestans and H. arabidopsidis grown under various conditions. A web-based transcriptome browser was created for visualization of assembled transcripts, their mapping to the reference genome, expression profiling and depth of read coverage for particular locations on the genome. The transcriptome browser merges EST-derived contigs with NGS-derived assembled transcripts on the fly and displays the consensus. OTD possesses strong query features and the database interacts with the VBI Microbial Database as well as the Phytophthora Transcriptomics Database.
Oomycete Transcriptomics Database provides access to NGS transcript and EST data for oomycete pathogens and soybean. The OTD browser is a light weight transcriptome browser that displays the raw read alignment as well as the transcript assembly and expression information quantitatively. The query features offer a wide variety of options including querying data from the VBI microbial database and the Phytophthora transcriptomics database. The database is publicly available at http://www.eumicrobedb.org/transcripts/.
Oomycete pathogens cause devastation to a wide range of hosts belonging to both plant and animal kingdoms . Superficially, oomycete pathogens resemble fungi, but in fact they belong to a kingdom of life called Stramenopila, which also contains algae such as kelp and diatoms. Hence, conventional fungal control measures often fail against these pathogens . Phytophthora species and many other members of the order Peronosporales cause destructive diseases in an enormous variety of crop plant species as well as forests and native ecosystems . The potato pathogen, P. infestans, was responsible for the Irish potato famine and is still a destructive pathogen of concern for bio-security . In the past few years, whole genomes, transcriptomes and ESTs have been sequenced for many oomycete species [1, 5–7]. With the rapid growth of next generation sequencing (NGS) technologies such as those of 454 Life Sciences, Illumina and ABI SOLiD, informatics tools and resources have increasingly become a bottleneck. Several oomycete EST databases described previously are no longer available, including the Syngenta Phytophthora Consortium (SPC) EST sequence data bases at https://xgi.ncgr.org/spc; the Phytophthora Functional Genomics Database at http://www.pfgd.org; and ESTAP (EST Analysis Pipeline)  at http://staff.vbi.vt.edu/estap/. Recently a new transcriptomics database, Phytophthora transcriptomics database (PTD) was created at Nanjing Agricultural University, that contains digital gene expression information from Phytophthora sojae. We previously created the VBI microbial database (VMD) , that served as a data warehouse for several oomycete genome sequences. However, the schema level of VMD did not readily accommodate NGS transcriptomic data. Therefore, we have created the Oomycete Transcriptomics Database (OTD) to store oomycete transcriptomics data and easily interface with VMD and PTD.
One challenging feature of presenting transcriptomics data produced by next generation sequencing (NGS) methods is data visualization on a browser. Most of the existing NGS browsers are stand alone applications that require users to upload processed data into the browser for visualization. This is a significant drawback, since the users must have access to the processed information or else they need to run analysis pipelines. We have created a web based transcript browser that displays EST transcripts, NGS transcripts, their alignment to the reference genome, genome annotation features and merged EST and NGS transcripts.
OTD is a relational database with a backend that uses MySQL version 5.1.49, a front end that uses PHP version 5.3.3 and PERL CGI version 5.12.1. All the visualization tools were created using PERL GD, GnuPlotter and Image Magik.
The P. sojae ABI SOLiD sequences were obtained from mycelial transcripts and from soybean hypocotyls 12 h after inoculation with P. sojae (four replicates each). The Soybean ABI SOLiD sequences were obtained from the mock inoculated transcripts and 12 h P. sojae post infected samples. Details of the production of these data will be published elsewhere. The H. arabidopsidis Illumina reads were obtained from Arabidopsis leaves 7 days after inoculation with the pathogen . The EST sequences of P. sojae were generated from six different cDNA libraries  and the P. infestans EST sequences, downloaded from Genbank .
Raw SOLiD, Illumina and EST sequences were preprocessed and analyzed prior to uploading them into the database.
RNAseq read alignment statistics
P. sojae mycelia (4 replicates)
P. sojae-infected soybean (4 replicates)
H. arabidopsidis-infected Arabidopsis (5 replicates)
We carried out a de novo assembly of the reads to generate contigs using the Abyss assembler . The assembled contigs were then mapped onto the genome assembly using BLAT . A number of assembled contigs that did not match with the genome assembly were annotated and stored in the database.
Assembled Transcripts Supported by Predicted models and EST libraries
Total Number of assembled Transcripts
Number of Transcripts with both EST and
Gene Model support
Number of Transcripts with no match to
Number of Transcripts with No match to
Number of Transcripts having no match to
gene models or ESTs
The raw P. sojae and H. arabidopsidis EST sequences were obtained as chromatograms. The sequence files and qual files were extracted using PHRED  with a command line option –trim_alt and a cutoff parameter of 0.1. As part of the cleaning protocol, the sequences were quality trimmed using an in-house algorithm. The maximum number of low-quality bases (quality score < 20 for 5′ end and < 15 for 3′ end) allowed in a window of size 25 was 6 for both the 3′ and 5′ end. Windows having > 6 low-quality bases were shifted one base, and the process was repeated.
For vector removal, CrossMatch  was used with the –minmatch and –minscore parameters set to 10 and 20, respectively. For adaptor removal, both these parameters were lowered to 8, so that smaller adaptors could be removed. Internal poly A/T tracts (indicating chimeric cDNA fragments) were removed and the sequence cleaved if the tract length was > 18 bases. For terminal poly A/T tracts, the tract length parameter was removed.
Contaminating sequences with very strong (95%) similarity with vector or any other sequence database were removed prior to clustering. The ESTs from infection libraries were initially assigned to host or pathogen by the procedure (Additional file 1: Figure S1). Later when the genome sequences of the pathogen and host became available, the assignments were checked and if necessary corrected. The soybean ESTs recovered from the analysis of the infection libraries were submitted to GenBank and can be found with accession numbers between CF805618-CF809370.
The clean EST sequences were clustered and assembled using the TGICL wrapper . TGICL uses megablast  for clustering and CAP3  for assembly. The analysis was run on a Sun server with 2 Xeon 3-GHZ processors and 4 GB RAM with Slackware Linux (i486). The minimum percent of identity for overlaps was kept at 94, minimum overlap length was kept at 30, and maximum length of unmatched overhangs was kept at 30 for CAP3 alignment. Finally, 7,863 unigenes from P. sojae, 2,292 unigenes from soybean (derived from P. sojae-infected tissue), 14,754 unigenes from P. infestans and 13,363 unigenes from H. arabidopsidis were obtained.
We identified the protein coding sequences from the unigenes using a modified log-likelihood algorithm . This algorithm calculates the coding potential across a sliding window of user-defined size (we used 120) for all six frames and determines the most likely coding frame. Then it compares the islands with higher coding potential with known sequence patterns such as start and stop codons. If the start/stop pattern is found around the window size where there is high coding potential, then that region is called a coding sequence. Cases of frame shift sequence errors, chimeric sequences and contamination were easily detected using this algorithm and marked accordingly. Once protein coding regions were marked, the sequence annotation steps required much less processing time.
More than 95% identity and no query gaps.
More than 95% identity. Query gaps exist and can be explained by the presence of plausible genomic sequence gaps.
More than 95% identity. Query gaps exist that can’t be explained by genomic sequence gaps but are less than 10 bases and/or end mismatches are present but are less than 10 bases.
More than 95% identity. Query gaps that can’t be explained by genomic sequence gaps are more than 10 bases and/or end mismatches are present and more than 10 bases.
The primary annotation of the sequences was done with tera-BLASTX against a non-redundant protein database accelerated on the TimeLogic’s DeCypher system. The Blast outputs were parsed, and up to 10 significant blast hits with associated HSP data were stored in the database. For functional annotation of the protein sequences, we used InterProScan . We sent smaller chunks of sequences to the server to optimize the resource usage. The data were parsed and stored in the database. Secretory and membrane proteins were predicted by running signalP  and TMHMM  on the protein sequences. The annotations are updated every six months. The last time annotations were updated was during Nov 2011.
Read depth of coverage of NGS data
Assembled reads generated from mapped and unmapped NGS reads
Levels of existing and novel transcripts expressed as FPKM [Fragments Per Kilobase of exons per Million fragments mapped]
Cleaning, clustering, assembly, and annotation information for EST data.
The transcriptomics browser is the central component of the resource that enables users to walk over the genome assembly and discover important transcribed elements that may be missing from the annotation. One can switch from one organism to another on the browser by selecting the organism from the top panel drop down box of the main transcriptomics browser page [Figure 2A, B].
We have created a web based text alignment viewer on the reference genome. This viewer can also be used for SNP viewing and for correcting gene models based on the alignment of the transcript reads to the reference genome. Links to the text based viewer are provided from the main transcript assembly page that is based on the reads assembly on the reference strand. The top most row is the genomic reference followed by the reads mapped to them arranged in rows. As the number of reads increase, the page needs to be scrolled down and towards right to view the alignment. We have used java script for fixing the position of the reference strand on the screen vertically, so the users can always superimpose reference bases with the read bases (Figure 4). This greatly helps in detecting substitutions, intron—exon location and a false assembly.
Each component EST sequence of a unigene, if present is provided with a link, so that the user can reach the EST details with a click. The individual EST page has quality trimming protocols, other ESTs that overlap with the sequence and many more relevant information [Figure 7B]. Also an on-the-fly BLAT option is available for EST sequences against the respective reference genome [Figure 7C].
By fold change in treated versus untreated samples.
By expression value.
By names of the unigenes or ESTs or contigs.
By primary and secondary annotation.
By number of ESTs present in a unigene.
Another useful query feature is that ability to retrieve transcripts that have a fold change between two given conditions. For example, in the case of the P. sojae V1.0 assembly, one can query and find all the transcripts that show a certain fold change (e.g. two-fold) between infection and non-infection conditions. Similar search options are also available for soybean datasets. Due to the data size and query time, options are currently restricted to searching by individual scaffolds.
EST-derived unigenes and contigs can be searched by exact id name or by a regular expression. For example, most of the EST contigs begin with CL1, so, users can query the database with CL1* [Figure 6A]. If the user chooses to query by a single contig name, then the primary contig page with primary annotation and quality scores are displayed. If a contig has a overlapping gene model, the gene_id along with its VMD link is provided.
In addition to the utilities described above, there are a number of miscellaneous items available from the home page. Sequence statistics, cluster statistics, metadata information and library construction methods are accessible from this page. For P. sojae EST datasets, cluster statistics and details of the sequence distribution in EST clusters are listed with proper links to the main annotation pages.
The download site currently provides 39 curated data types for download. Users can request additional information if necessary through the available requisition form provided in the page.
OTD, with its numerous visualization tools and backend processing pipelines, is a valuable resource for the oomycete community to browse and retrieve transcriptomics information. OTD is also linked with VMD and PTD for additional information on genome sequences and expression data. As additional genome and transcriptome data become available, they will be imported into the database.
The database is publicly available at http://www.EuMicrobeDB.org/transcripts. The database and associated software are open source and will be made available upon request.
Expressed sequence tags
Open reading frames
High-scoring Segment Pair
Quality score files
Oomycete Transcriptomics database
Fragments per Kilobase of exons per Million fragments mapped
Next generation sequencing data
Phytophthora Transcriptomics Database
VBI Microbial Database.
This work was supported by grants to BMT from the Agriculture and Food Research Initiative of the USDA National Institute of Food and Agriculture, grant numbers 00-52100-9684, 2004-35600-15055, 2005-35604-15525, and 2007-35600-18530 and from the US National Science Foundation, numbers EF-0412213, MCB-0731969.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.