Advances in whole genome profiling technologies have revolutionized the field of cancer research. These technologies have facilitated the discovery of potential biomarkers for disease development and progression as well as our understanding of the complex, underlying molecular mechanisms that lead to cancer. Reduction in costs have spurred the adoption of next generation sequencing (NGS) platforms which offer greater resolution and sensitivity compared to traditional microarray profiling . At the same time, NGS raises new bioinformatics challenges, both practical (e.g. data storage, computational costs) and theoretical (e.g. defining appropriate statistical measures).
A promising application of NGS is the whole-genome profiling of epigenetic modifications, including DNA methylation. The addition of methyl groups to the 5' carbon position of cytosine bases is a major mechanism of epigenetic regulation which participates in reorganizing chromatin structure and silencing gene expression , Epigenetic alterations, such as tumor suppressor gene hypermethylation and oncogene hypomethylation, are hallmarks of cancer and play a pivotal role in tumorgenesis and disease progression [3, 4].
The DNA methylation profiling approach used in our lab, MethylCap-seq involves the in vitro capture of methylated DNA with the high affinity methyl-CpG binding domain of human MBD2 protein and subsequent analysis of enriched fragments by massively parallel sequencing [5–8]. Benchmarking has shown MethylCap-seq is more effective at interrogating CpG islands than antibody-based methylated DNA immunoprecipitation sequencing (MeDIP-seq) . While optimizing this experimental technique, we recognized two potential issues affecting subsequent data analysis. First, unsuccessful or incomplete capture reactions can result in the sequencing of non-methylated DNA fragments, leading to inconsistencies in or the absence of methylation enrichment in a sample. Second, poor sequencing library complexity and CpG coverage limit the statistical power to call differential methylation, and ultimately the reproducibility of the dataset. Conventional sequencing analysis pipelines often do not include assay-dependent quality control assessments. Spurious samples reduce analytical power and lead to excess "noise" in downstream analyses.
The challenges to data analysis are real. The numerous options for file processing and genome alignment mean any particular strategy requires extensive troubleshooting and optimization. Large file sizes make data visualization exceedingly difficult without the use of expensive commercial software packages or system resource-intensive publicly available programs. In more practical terms, MethylCap-seq projects, in particular, would greatly benefit from the ability to receive rapid feedback of overall experimental quality. There is also a lack of workflows for efficient analysis of large, MethylCap-seq datasets containing multiple sample groups. To address these pertinent issues, we have developed a scalable, flexible workflow for MethylCap-seq Quality Control and secondary data analysis which facilitates tertiary analysis of multiple experimental groups and data visualization.
The automated MethylCap-seq workflow has been developed over the course of 200 sequencing runs. The workflow is scalable in terms of handling studies of disparate sample sizes. It is flexible in that unique experimental considerations (genome alignment, read bin sizes, test statistics) can be addressed by simple modification of several operational parameters independent of the scripts responsible for automating the workflow. Automation is imperative because of the large number of intermediate steps and temporary files required. The workflow incorporates proven, existing tools where applicable: e.g., raw read processing, the short read aligner, the R environment and third party libraries. It further takes advantage of high performance computing systems for parallel batch job submissions. This feature is important for scalability and computational feasibility. Data visualization is supported by Anno-J, a genome annotation visualization program and web service viewport.