Volume 9 Supplement 2
Batch Blast Extractor: an automated blastx parser application
© Pirooznia et al; licensee BioMed Central Ltd. 2008
Published: 16 September 2008
BLAST programs are very efficient in finding similarities for sequences. However for large datasets such as ESTs, manual extraction of the information from the batch BLAST output is needed. This can be time consuming, insufficient, and inaccurate. Therefore implementation of a parser application would be extremely useful in extracting information from BLAST outputs.
We have developed a java application, Batch Blast Extractor, with a user friendly graphical interface to extract information from BLAST output. The application generates a tab delimited text file that can be easily imported into any statistical package such as Excel or SPSS for further analysis. For each BLAST hit, the program obtains and saves the essential features from the BLAST output file that would allow further analysis. The program was written in Java and therefore is OS independent. It works on both Windows and Linux OS with java 1.4 and higher. It is freely available from: http://mcbc.usm.edu/BatchBlastExtractor/
The NCBI BLAST database search tool is one of the most popular programs designed to solve single query problems. BLAST (Basic Local Alignment Search Tool) is the heuristic search algorithm employed by the programs blastp, blastn, blastx, tblastn, and tblastx. The BLAST programs were tailored for sequence similarity searching for example to identify homologs of a given query sequence .
The five common BLAST programs perform the following tasks: 1) blastp compares an amino acid query sequence against a protein sequence database; 2) blastn compares a nucleotide query sequence against a nucleotide sequence database; 3) blastx compares the six-frame conceptual translation products of a nucleotide query sequence (both strands) against a protein sequence database 4) tblastn compares a protein query sequence against a nucleotide sequence database dynamically translated in all six reading frames (both strands), and 5) tblastx compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.
The application generates a tab delimited text file that can be easily imported into any statistical package such as Excel or SPSS for further analysis. For each BLAST hit, the program derives and saves the following features: Query ID, Query Length, Accession version and GI number, Alignment Length, Score, bit, E-value, Identities, Positives, Gaps, Frame, Organism, and Description.
The extracted information includes the following:
▪ Query: headers of sequences to analyze
▪ Subject: headers of sequences found in the database
▪ Score: a number representation (e.g. 550)
▪ Score Text: full text representation plus BITS (e.g. 235 bits (450))
▪ Expect: the E-Value as number (e.g. 1e-166)
▪ Identities %: a number representation (e.g. 85)
▪ Identities Text: full text representation plus characters matching (e.g. 110/130 (90%))
▪ Positives %: a number representation (e.g. 92)
▪ Positives Text: full text representation (e.g. 110/130 (90%))
▪ Gaps %: a number representation (e.g. 11)
▪ Gaps Text: full text representation plus voids (e.g. 9/102 (9%))
▪ Frame: orientation of the translated ORF (e.g. +3)
▪ Length Query: the number of nucleotides or amino acids (e.g. 400)
▪ Length Subject: the number of nucleotides or amino acids (e.g. 500)
▪ Position Query: as text representation plus the length of the frame (e.g. 328–600 (360))
▪ Position Subject: as text representation plus the length of the frame (e.g. 1–110 (120))
The program was written in Java. It is OS independent and works on both Windows and Linux OS with java 1.4 and higher. It is freely available to noncommercial users from: http://mcbc.usm.edu/BatchBlastExtractor/ (Figure 2 and 3).
Currently the application works with blastx results. Efforts to extend functionality to other BLAST programs such as blastp and blastn are in progress.
This work was supported by the Mississippi Functional Genomics Networks (MFGN), Mississippi Computational Biology Consortium (MCBC) (NSF Grant # EPS-0556308), and the Army Environmental Quality Program of the US Army Corps of Engineers under contract #W912HZ-05-P-0145. Permission was granted by the Chief of Engineers to publish this information.
This article has been published as part of BMC Genomics Volume 9 Supplement 2, 2008: IEEE 7th International Conference on Bioinformatics and Bioengineering at Harvard Medical School. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2164/9?issue=S2
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25 (17): 3389-3402. 10.1093/nar/25.17.3389.PubMedPubMed CentralView ArticleGoogle Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000, 25 (1): 25-29. 10.1038/75556.PubMedPubMed CentralView ArticleGoogle Scholar
- Liang C, Wang G, Liu L, Ji G, Liu Y, Chen J, Webb JS, Reese G, Dean JF: WebTraceMiner: a web service for processing and mining EST sequence trace files. Nucleic Acids Res. 2007, W137-142. 10.1093/nar/gkm299. 35 Web Server
- Nagaraj SH, Deshpande N, Gasser RB, Ranganathan S: ESTExplorer: an expressed sequence tag (EST) assembly and annotation platform. Nucleic Acids Res. 2007, W143-147. 10.1093/nar/gkm378. 35 Web Server
- Pirooznia M, Gong P, Guan X, Inouye LS, Yang K, Perkins EJ, Deng Y: Cloning, analysis and functional annotation of expressed sequence tags from the Earthworm Eisenia fetida. BMC Bioinformatics. 2007, 8 (Suppl 7): S7-10.1186/1471-2105-8-S7-S7.PubMedPubMed CentralView ArticleGoogle Scholar
- Mangalam H: The Bio* toolkits – a brief overview. Brief Bioinform. 2002, 3 (3): 296-302. 10.1093/bib/3.3.296.PubMedView ArticleGoogle Scholar
- Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA, Dagdigian C, Fuellen G, Gilbert JG, Korf I, Lapp H: The Bioperl toolkit: Perl modules for the life sciences. Genome Res. 2002, 12 (10): 1611-1618. 10.1101/gr.361602.PubMedPubMed CentralView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.