The production of large molecular and clinical datasets in modern biological research centers has posed new major challenges for information processing systems. In a typical biological laboratory routine, many tests are run in parallel by a single equipment; for instance, a thermal cycler or DNA amplifier can perform several different PCR reactions simultaneously. This high parallelism demands accurate procedure, reagent, and result management. New tests are being continually developed and users have to be guided to perform the right task at the appropriate time. Incompatibilities between new and old processes and new data requirements make it difficult to integrate and analyze all the available information. Besides, the process of scientific knowledge discovery involves frequent process updates for the refinement of scientific hypotheses. Since new data are automatically generated and processed, manual approaches have become very expensive or even infeasible. As a result, a long-term solution to the scientific data integration issue requires formally validated and automated information processing tools.
An illustrative example is genome databases that are characterized by large amounts of data that are frequently updated from a variety of data sources. It is a highly time-consuming process to integrate “new” genome data especially because the processes that create keep changing and software tools must be continuously updated to reflect these changes. As a result, delayed software evolution may slow down scientific progress.
Challenges in the CEGH environment
The Human Genome Research Center (CEGH) created at the University of São Paulo is the largest center in Latin America dedicated to the study of human Mendelian genetic disorders. Since its establishment about 40 years ago, more than 100,000 patients and their relatives have been referred to the center and have been examined by different research groups. CEGH’s main mission is to gain understanding of gene function with a focus on neuromuscular, craniofacial, and brain development through the study of genetic disorders.
The CEGH offers around 40 different genetic tests, which are performed by several technicians under the supervision of six researchers. All samples are recorded and sent to specialized technicians for analysis. It is crucial to have a flawless control flow for each sample during every step of analysis.
One of the tests performed is polymerase chain reaction (PCR) with primers and amplification, specific for the segments (i.e., exons or introns) of a particular gene to be tested. The results are obtained through the analysis of PCR products. Most tests involve sequencing of several exons of the gene, which usually consists of a three-step procedure: PCR, DNA purification, and sequencing reaction. Note that amplification and reagents usually vary per exon and gene to be tested. Another set of tests requires a screening step using DHPLC or a different strategy, and only exons with an altered pattern in the first step are sequenced.
Genetic disorders are very heterogeneous and their phenotypes can be encoded by several genetic mechanisms involving more than one gene. Genetic testing is required for accurate estimates of recurrence risks. A strategy for cost reduction of testing is to perform common genetic tests that are used for most cases of the disorder being investigated. Once a negative result is obtained, a second test is performed, which is used for the second most frequent cause of the disorder. At times, three or more different tests need to be performed.
Besides the heterogeneity of diseases, there are several gene mutations that can cause a single disease. For example, there have been so far described more than 1,500 mutations for cystic fibrosis, which is the most common autosomal recessive disorder in Caucasians. As previously mentioned, testing starts in regions most likely to have mutations; other segments of the gene are tested only if negative results are obtained.
Figure 1 shows patients, tests and their atomic steps, as well as the order of test performance. In other words, the steps shown in this figure are an illustration of test representation. This genetic test is used to identify mutations associated with spinal muscular atrophy (SMA) [1], a neuromuscular disease that causes progressive muscle degeneration. As most SMA patients carry a deletion of exons 7 and 8, the SMA test will probe the gene SMN1 looking for this type of mutation in exons 7 and 8. First, DNA is extracted from a patient sample. Then PCR of exon 7 and PCR of exon 8 can be performed. And because they are two independent procedures they can be performed simultaneously. At completion of both PCRs, the procedure analysis is released to execution, and it will analyze the previous results to produce a test result. Each procedure can be repeated as needed, as shown by a return arrow.
During laboratory routine, several tests are run at the same time by a single instrument requiring accurate procedure, reagent, and result management. Since new genetic tests are constantly being developed from old ones, a software system is required to handle various different versions of procedures and tests. Moreover, users have to be guided to perform the right task at the appropriate time.
Related work
Previous research efforts in scientific data management have adopted several foundational concepts and tools to address the challenges of biological testing management. A popular approach is to include ontology evaluation into the design and specification process of biological system rules [2–4]. However, ontologies cannot include semantics of tests and effectively address dynamic changes in genome testing routines. Several successful efforts have been documented with automatic generation of scientific workflows through technical planning based on ontology descriptions [5–7]. As a downside, these approaches do not support the control of long transactions. An alternative approach is to use software architecture for knowledge discovery in biology [8], but its focus is solely on the data analysis of the process under study.
Although there are several tools for the management of workflows and business processes, some of them are focused on providing resources for process or work flow simulations. In other cases, processes can be managed using software tools without a formal validation. It becomes especially critical as process complexity increases. Some popular tools for the control and implementation of scientific workflows and business processes have been developed based on formal approaches such as colored Petri nets [9–12] and process algebra [13–15].
In fact, workflow approaches have been applied to develop the Laboratory Information Management Systems (LIMS) as an efficient method to handle tests performed in the laboratory.
The Protein Information Management System (PIMS) [16] and the WIST: Toolkit for rapid and customized LIMS development [17] are examples of programming language approach that offers an easy interface to define laboratory process routines based on specific patterns and manage workflow requirements. The PIMS is a specialized LIMS to manage protein testing methods with a customized notation to define workflow for protein analysis. The WIST provides a set of application programming interfaces and web application to support the LIMS development. The PIMS and WIST are a group of software approaches that uses specific workflow definition languages and workflow management systems to support laboratory management requirements. These approaches have been successfully used in several information systems offering a customized workflow solution.
Another important development of the LIMS approach is based on integration software tools to reduce the time, complexity and cost of software development. A representative example of this approach is the SIGla: an adaptive open source LIMS for multiple laboratories [18] that provides the integration between a generic information system and several available workflow engines using standard files. SIGla provides workflow features using the Enhydra Shark [19] open source editor that is a graphical tool to design workflows. Once the workflow is designed, SIGla creates a XPDL file [20] allowing workflow execution into a workflow engine, called Shark. In addition to the use workflow management tools, the Shark framework in SIGla has been expanded to support important features including sequential chaining (where the output of an activity can be used as input to another activity), and repetition procedures for exception handling. The SIGla authors presented an example where the integration of workflow tools was not a simple task. To make this integration possible, some cases require not only adaptation but also the development of new features.
Both LIMS approaches (i.e., programming language and workflow engine integration) provide scalability since there are many ways to integrate and adapt laboratory management systems. However, they do not support explicitly control-flow algebraic properties for workflow systems. The lack of algebraic properties is a disadvantage when there are involved complex workflows and mission-critical laboratory routines.
In this paper, we present an alternative LIMS approach that allows formal validation of workflow generated by the CEGH interface based on process algebra. This formal approach not only supports evaluation but also can be used for the execution of workflow applications.