Today, many genetics laboratories have access to modern capillary DNA sequencing machines, such as the ABI PRISM 3100, 3700 or 3730. These machines generate vast amounts of raw sequence data with little user intervention. Consequently, the amount of data to be analyzed has expanded and the bottleneck now is the analysis capacity. Data analysis capacity can be increased by higher levels of automation. Investments in infrastructures to process the raw sequencing data in sophisticated but rigid pipelines might be justified for larger laboratories and larger projects but might be too costly for smaller laboratories. In addition, rigid pipelines are too impractical if different projects share run-time on the same machine while requiring (slightly) different analysis procedures (e.g. vector trimming is needed in plasmid sequencing, but needless when sequencing PCR products).
Nucleotide sequence analysis can be performed with a variety of software tools. Although the number of console and web-based software tools has grown rapidly, the routine use of data input, output and storage may be inconvenient. Furthermore, for performing a series of analyses with different software tools, the sequence data need to be reformatted to the required data structure. Alternatively, sophisticated software suites that provide an integrated environment often are expensive.
Several of the available software solutions are designed to facilitate automated DNA sequence analysis at low cost. Well-known solutions are the Staden package and Bioperl.
The Staden Package contains pregap4 and gap4, full-featured applications with an intuitive graphical user interface [1]. These programs handle a list of raw sequence reads method-by-method. The programs in the Staden Package typically require a degree of user intervention and thus hands-on time.
Alternatively, Bioperl is a group of perl modules describing many genetics and genomics concepts [2]. For example, it includes the Bio::Seq::SeqWithQuality object that provides some of the basic properties of a raw sequence (i.e. its nucleotide sequence and quality values); the Bio::Tools::Primer3 object provides methods to work with primer3 input and output. However, to build custom DNA sequencing data pipelines, basic programming skills are needed to combine all these modules.
Smaller laboratory sites, however, often need to implement versatile pipelines that can be adjusted for any research question that suits the project best; at the same time, they often also do not have dedicated programmers available.
Although (semi-)automated procedures have been published by other groups [3, 4], these are mostly focused on one particular pipeline and environment.
Here, we present POSA, a set of two new perl objects (Read.pm and Contig.pm) that describe a raw sequence and a Phrap contig in detail and are easily implemented in perl-based pipelines. Because these objects provide building blocks for sequencing data analysis pipelines and the actual pipelines are built using perl-scripts, the POSA objects can be used in very diverse settings.