Methods for high-throughput MethylCap-Seq data analysis
- Benjamin AT Rodriguez†1,
- David Frankhouser†1,
- Mark Murphy†1,
- Michael Trimarchi1,
- Hok-Hei Tam1,
- John Curfman1,
- Rita Huang2,
- Michael WY Chan3,
- Hung-Cheng Lai2,
- Deval Parikh1,
- Bryan Ball1,
- Sebastian Schwind1,
- William Blum1,
- Guido Marcucci1,
- Pearlly Yan1Email author and
- Ralf Bundschuh4Email author
© Rodriguez et al.; licensee BioMed Central Ltd. 2012
Published: 26 October 2012
Advances in whole genome profiling have revolutionized the cancer research field, but at the same time have raised new bioinformatics challenges. For next generation sequencing (NGS), these include data storage, computational costs, sequence processing and alignment, delineating appropriate statistical measures, and data visualization. Currently there is a lack of workflows for efficient analysis of large, MethylCap-seq datasets containing multiple sample groups.
The NGS application MethylCap-seq involves the in vitro capture of methylated DNA and subsequent analysis of enriched fragments by massively parallel sequencing. The workflow we describe performs MethylCap-seq experimental Quality Control (QC), sequence file processing and alignment, differential methylation analysis of multiple biological groups, hierarchical clustering, assessment of genome-wide methylation patterns, and preparation of files for data visualization.
Here, we present a scalable, flexible workflow for MethylCap-seq QC, secondary data analysis, tertiary analysis of multiple experimental groups, and data visualization. We demonstrate the experimental QC procedure with results from a large ovarian cancer study dataset and propose parameters which can identify problematic experiments. Promoter methylation profiling and hierarchical clustering analyses are demonstrated for four groups of acute myeloid leukemia (AML) patients. We propose a Global Methylation Indicator (GMI) function to assess genome-wide changes in methylation patterns between experimental groups. We also show how the workflow facilitates data visualization in a web browser with the application Anno-J.
This workflow and its suite of features will assist biologists in conducting methylation profiling projects and facilitate meaningful biological interpretation.
Advances in whole genome profiling technologies have revolutionized the field of cancer research. These technologies have facilitated the discovery of potential biomarkers for disease development and progression as well as our understanding of the complex, underlying molecular mechanisms that lead to cancer. Reduction in costs have spurred the adoption of next generation sequencing (NGS) platforms which offer greater resolution and sensitivity compared to traditional microarray profiling . At the same time, NGS raises new bioinformatics challenges, both practical (e.g. data storage, computational costs) and theoretical (e.g. defining appropriate statistical measures).
A promising application of NGS is the whole-genome profiling of epigenetic modifications, including DNA methylation. The addition of methyl groups to the 5' carbon position of cytosine bases is a major mechanism of epigenetic regulation which participates in reorganizing chromatin structure and silencing gene expression , Epigenetic alterations, such as tumor suppressor gene hypermethylation and oncogene hypomethylation, are hallmarks of cancer and play a pivotal role in tumorgenesis and disease progression [3, 4].
The DNA methylation profiling approach used in our lab, MethylCap-seq involves the in vitro capture of methylated DNA with the high affinity methyl-CpG binding domain of human MBD2 protein and subsequent analysis of enriched fragments by massively parallel sequencing [5–8]. Benchmarking has shown MethylCap-seq is more effective at interrogating CpG islands than antibody-based methylated DNA immunoprecipitation sequencing (MeDIP-seq) . While optimizing this experimental technique, we recognized two potential issues affecting subsequent data analysis. First, unsuccessful or incomplete capture reactions can result in the sequencing of non-methylated DNA fragments, leading to inconsistencies in or the absence of methylation enrichment in a sample. Second, poor sequencing library complexity and CpG coverage limit the statistical power to call differential methylation, and ultimately the reproducibility of the dataset. Conventional sequencing analysis pipelines often do not include assay-dependent quality control assessments. Spurious samples reduce analytical power and lead to excess "noise" in downstream analyses.
The challenges to data analysis are real. The numerous options for file processing and genome alignment mean any particular strategy requires extensive troubleshooting and optimization. Large file sizes make data visualization exceedingly difficult without the use of expensive commercial software packages or system resource-intensive publicly available programs. In more practical terms, MethylCap-seq projects, in particular, would greatly benefit from the ability to receive rapid feedback of overall experimental quality. There is also a lack of workflows for efficient analysis of large, MethylCap-seq datasets containing multiple sample groups. To address these pertinent issues, we have developed a scalable, flexible workflow for MethylCap-seq Quality Control and secondary data analysis which facilitates tertiary analysis of multiple experimental groups and data visualization.
The automated MethylCap-seq workflow has been developed over the course of 200 sequencing runs. The workflow is scalable in terms of handling studies of disparate sample sizes. It is flexible in that unique experimental considerations (genome alignment, read bin sizes, test statistics) can be addressed by simple modification of several operational parameters independent of the scripts responsible for automating the workflow. Automation is imperative because of the large number of intermediate steps and temporary files required. The workflow incorporates proven, existing tools where applicable: e.g., raw read processing, the short read aligner, the R environment and third party libraries. It further takes advantage of high performance computing systems for parallel batch job submissions. This feature is important for scalability and computational feasibility. Data visualization is supported by Anno-J, a genome annotation visualization program and web service viewport.
A total of 71 ovarian cancer samples, 6 adjacent normal tissues as well as 20 tissues collected from patients during surgery for benign gynaecological disease were obtained from Triservice General Hospital, Taipei, Taiwan. All studies involving human ovarian cancer samples were approved by the Institutional Review Boards of Triservice General Hospital and National Defense Medical Center.
Methylated-DNA capture (MethylCap-seq)
Enrichment of methylated DNA was performed with the Methylminer kit (Invitrogen) according to the manufacturer's protocol. Briefly, one microgram of sonicated DNA was incubated at room temperature on a rotator mixer in a solution containing 3.5 micrograms of MBD-Biotin Protein coupled to M-280 Streptavidin Dynabeads. Non-captured DNA was removed by collecting beads on a magnet and washing three times with Bind/Wash Buffer. Enriched, methylated DNA was eluted from the bead complex with 1 M NaCl and purified by ethanol precipitation. Library generation and 36-bp single-ended sequencing were performed on the Illumina Genome Analyzer IIx according to the manufacturer's standard protocol. Each sample was sequenced on a single lane, for a total of 97 lanes. Additional data sets are presented for demonstration purposes only.
MethylCap-seq experimental QC
Sequence file processing and alignment
Global methylation analysis workflow
The methylation analysis workflow is outlined in Figure 2. Chromosomal coordinates of sequence reads are parsed from the final alignment output, then counted using a specified bin size and read extension length (reflecting average fragment size) in order to generate a binary file containing bin counts and scaled count values (reads per million - rpms). The bin size determines the computational resolution of the analysis. We find that a bin size of 500 bp provides sufficient analysis resolution while smoothing the data statistically. The binary counts file is next interrogated by genomic feature (e.g., CpG islands, CpG shores, Refseq genes) to generate feature-specific count files. The workflow is compatible with custom feature files listing genomic loci of interest in BED format. In addition, aggregate read count summaries can be compiled for each type of genomic feature. Our approach of binning aligned reads, scaling read count values, and/or generating genomic feature-specific count files could prove applicable to other enrichment-based sequencing methods. For instance, the process responsible for filtering counts by genomic features might be modified to accept ChIP-seq peak calling values.
Once the samples are binned and genomic features are extracted, they are grouped based on biological factors, such as known genotype difference, and statistical tests are performed to discern if there are significant differences in methylation counts among predefined groups of samples. One locus from a genomic feature in one group is tested against the same locus in the other group for all loci in that genomic feature. The statistical test used is dependent on the number of groupings. For two groups a Wilcoxon rank-sum test is employed to test the distribution of methylation counts for each locus across the two groups. We then select significant differentially methylated loci by applying a multiple test corrected p-value cutoff (False Discovery Rate, FDR). Similarly for groupings of more than two biological factors, the Kruskal-Wallis test is employed. Statistical testing of genomic features is a custom workflow implemented in R which utilizes the predefined Wilcoxon and Kruskal-Wallis test functions. The output of the workflow is a list of loci from each genomic feature that passes significance testing. Boxplots are also created for the list of significant features for visualization of their differential methylation.
To assess genome-wide changes in methylation patterns between experimental groups, we calculate a Global Methylation Indicator (GMI) for each individual sample in different groups. First, the sample's methylation distribution, an average rpm for each CpG content classification, is determined. The distribution is obtained as follows: each 500 base bin is classified by the CpG content (# of CG base sequences, counting any CG base sequences straddling the end of the bin and the beginning of the next) within the 500 bases it covers. Then within each CpG content classification, the average rpm per bin is calculated by summing the rpms and dividing by the number of bins. The difference between two groups is calculated by using a t test on the estimation of the area under the curve for each individual sample.
To identify novel classifications of samples independently of predefined biological factors, unsupervised clustering of the data can be implemented. Clustering of the data is a workflow that takes a data matrix of the samples and the rpm value of each locus for a given genomic feature. The workflow is implemented in R and utilizes various R libraries for matrix manipulation, flashClust, and pvclust for unsupervised clustering. Adjusted p-values are obtained via multiscale bootstrap resampling of the data. Many combinations of correlation calculations and clustering methods can be implemented. Our clustering workflow uses the Pearson correlation distance measure. It takes as input the "raw" rpm data values or rescaled rpms, depending on the features of interest in the dataset. Rescaling the rpms involves dividing the rpms of each locus by the average rpm for that locus. This allows Pearson correlation to evenly weight both the low and high rpm values. Using the raw rpms causes Pearson correlation to more heavily weight the high rpms. The default clustering method of the workflow is that of McQuitty, but R provides any number of additional choices. Our workflow also implements data selection criteria that enforce a minimum coefficient of variation (CV) threshold in combination with minimum average rpm threshold for each locus. In tandem with the dendrograms, heatmaps are also produced to help visualize the relationship between the clustering sample members. This entire workflow, including all combinations of selection criteria and all genomic features of interest, is completed in a single script.
Because we produce a variety of dendrograms through the use of various genomic features and loci selection criteria, it is useful to see if the membership of a significant group is conserved throughout the dendrograms that were created using other genomic features and even within genomic features analyzed with varying selection criteria. To easily visualize the location of a certain sample group's membership in other dendrograms, we use different colors to track the membership of that group through alternative dendrograms that are produced for different genomic features and selection criteria. Tracking the membership of a group is accomplished by supplying the membership of that group to a color function that can be applied to subsequent dendrograms through the dendrapply function in R.
In our workflow, we have incorporated Anno-J, a REST-based Web 2.0 application for the visualization of deep sequencing information and other genomic annotations . Anno-J is capable of streaming all necessary applets and scripts to the user, providing immediate and installation-less viewing within a user's web browser. This facilitates the fast, real-time and interactive visualization of multiple data sets by users with access to any server hosting Anno-J. Data visualization within Anno-J uses tracks, discrete rows of graphs, each of which corresponds to a particular set of data. Our workflow incorporates a number of custom scripts which allow quick conversion of binary and raw text read counts and SAM files to various Anno-J track formats, including standard mask and read tracks. These scripts extract from read count files the location and rpm, and from SAM files the location, sequence count and strand identifier, and generate Anno-J read track format files. With minor modification, the scripts could be used to generate data tracks compatible with the UCSC Genome Browser. For the Anno-J experiment tracks, a scheduled service loads any new files from a shared folder into our database using a prescribed data format. Each track is assigned a unique identifier and properties for experiment type (e.g., methylation, small RNA) and track type (e.g., read, mask). The Anno-J web application will configure the browser with specified tracks based on these properties. The browser calls web services which return formatted data for each track, filtered by the currently viewable portion of the chromosome.
Additionally, we have incorporated a custom algorithm which allows conversion of binary and raw text read counts files to a custom discretized methylation heatmap track format. The heatmap track format modifies constraints and features of the Anno-J mask track format to allow generation of individual rows of heatmap data. Discretized methylation heatmap tracks are created by percentile ranking binned rpm values from binary or raw text read counts files, and then assigning color gradient based upon rank. Generation of the final discretized heatmap is a matter of stacking multiple heatmap tracks together.
Results and discussion
Experimental quality control
The automated MethylCap-seq workflow has been developed over the course of 200 sequencing runs. It has been applied to human solid tumors (e.g., breast, ovarian, endometrial, and hepatocellular carcinoma) and blood cancers (e.g., acute myeloid leukemia, chronic lymphocytic leukemia) as well a number of mouse cancer models. Though untested in that context, our analysis workflow should be applicable to other enrichment-based methylation assays such as MeDIP-seq studies.
Differential methylation analysis of multiple sample groups
Assessing genome-wide methylation patterns
MethylCap-seq data visualization
In this paper, we presented a scalable, flexible workflow for performing MethylCap-seq Quality Control, secondary data analysis, tertiary analysis of multiple experimental groups, and data visualization in the web service viewport, Anno-J. As the cancer epigenetics field further expands into next generation sequencing, our workflow should assist biologists in conducting methylation profiling projects and facilitate meaningful biological interpretation.
Based on “A scalable, flexible workflow for MethylCap-seq data analysis”, by Benjamin AT Rodriguez, Hok-Hei Tam, David Frankhouser, Michael Trimarchi, Mark Murphy, Chris Kuo, Deval Parikh, Bryan Ball, Sebastian Schwind, John Curfman, William Blum, Guido Marcucci, Pearlly Yan and Ralf Bundschuh which appeared in Genomic Signal Processing and Statistics (GENSIPS), 2011 IEEE International Workshop on. © 2011 IEEE .
This work was supported by NCI Comprehensive Cancer Center Support Grant P30 CA016058 (PY and GM) and CA102031 (GM), as well as 5 P50 CA140158-03 (GM and RB). This work was supported in part by an allocation of computing time from the Ohio Supercomputer Center.
This article has been published as part of BMC Genomics Volume 13 Supplement 6, 2012: Selected articles from the IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS) 2011. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/13/S6.
- Hurd PJ, Nelson CJ: Advantages of next-generation sequencing versus the microarray in epigenetic research. Brief Funct Genomic Proteomic. 2009, 8: 174-183. 10.1093/bfgp/elp013.View ArticlePubMedGoogle Scholar
- Bird A: DNA methylation patterns and epigenetic memory. Genes Dev. 2002, 16: 6-21. 10.1101/gad.947102.View ArticlePubMedGoogle Scholar
- Esteller M: Epigenetics in Cancer. New England Journal of Medicine. 2008, 358: 1148-1159. 10.1056/NEJMra072067.View ArticlePubMedGoogle Scholar
- Baylin SB, Ohm JE: Epigenetic gene silencing in cancer - a mechanism for early oncogenic pathway addiction?. Nat Rev Cancer. 2006, 6: 107-116. 10.1038/nrc1799.View ArticlePubMedGoogle Scholar
- Rauch TA, Pfeifer GP: DNA methylation profiling using the methylated-CpG island recovery assay (MIRA). Methods. 2010, 52: 213-217. 10.1016/j.ymeth.2010.03.004.PubMed CentralView ArticlePubMedGoogle Scholar
- Serre D, Lee BH, Ting AH: MBD-isolated Genome Sequencing provides a high-throughput and comprehensive survey of DNA methylation in the human genome. Nucleic Acids Research. 2010, 38: 391-399. 10.1093/nar/gkp992.PubMed CentralView ArticlePubMedGoogle Scholar
- Brinkman AB, Simmer F, Ma K, Kaan A, Zhu J, Stunnenberg HG: Whole-genome DNA methylation profiling using MethylCap-seq. Methods. 2010, 52: 232-236. 10.1016/j.ymeth.2010.06.012.View ArticlePubMedGoogle Scholar
- Harris RA, Wang T, Coarfa C, Nagarajan RP, Hong C, Downey SL, Johnson BE, Fouse SD, Delaney A, Zhao Y, et al: Comparison of sequencing-based methods to profile DNA methylation and identification of monoallelic epigenetic modifications. Nat Biotechnol. 2010, 28: 1097-1105. 10.1038/nbt.1682.PubMed CentralView ArticlePubMedGoogle Scholar
- Bock C, Tomazou EM, Brinkman AB, Muller F, Simmer F, Gu H, Jager N, Gnirke A, Stunnenberg HG, Meissner A: Quantitative comparison of genome-wide DNA methylation mapping technologies. Nat Biotechnol. 2010, 28: 1106-1114. 10.1038/nbt.1681.PubMed CentralView ArticlePubMedGoogle Scholar
- Chavez L, Jozefczuk J, Grimm C, Dietrich J, Timmermann B, Lehrach H, Herwig R, Adjaye J: Computational analysis of genome-wide DNA methylation during the differentiation of human embryonic stem cells along the endodermal lineage. Genome Research. 2010, 20: 1441-1450. 10.1101/gr.110114.110.PubMed CentralView ArticlePubMedGoogle Scholar
- Langmead B, Trapnell C, Pop M, Salzberg S: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology. 2009, 10: R25-10.1186/gb-2009-10-3-r25.PubMed CentralView ArticlePubMedGoogle Scholar
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, Subgroup GPDP: The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009, 25: 2078-2079. 10.1093/bioinformatics/btp352.PubMed CentralView ArticlePubMedGoogle Scholar
- Lister R, Pelizzola M, Dowen RH, Hawkins RD, Hon G, Tonti-Filippini J, Nery JR, Lee L, Ye Z, Ngo Q-M, et al: Human DNA methylomes at base resolution show widespread epigenomic differences. Nature. 2009, 462: 315-322. 10.1038/nature08514.PubMed CentralView ArticlePubMedGoogle Scholar
- Ruike Y, Imanaka Y, Sato F, Shimizu K, Tsujimoto G: Genome-wide analysis of aberrant methylation in human breast cancer cells using methyl-DNA immunoprecipitation combined with high-throughput sequencing. BMC Genomics. 2010, 11: 137-10.1186/1471-2164-11-137.PubMed CentralView ArticlePubMedGoogle Scholar
- Rodriguez BAT, Tam H, Frankhouser D, Trimarchi M, Murphy M, Kuo C, Parikh D, Ball B, Schwind S, Curfman J, Blum W, Marcucci G, Yan P, Bundschuh R: A scalable, flexible workflow for MethylCap-seq data analysis. Genomic Signal Processing and Statistics (GENSIPS), 2011 IEEE International Workshop on: 4-6 December 2011. 2011, 1-4. 10.1109/GENSiPS.2011.6169426.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.