Patient samples
A total of 71 ovarian cancer samples, 6 adjacent normal tissues as well as 20 tissues collected from patients during surgery for benign gynaecological disease were obtained from Triservice General Hospital, Taipei, Taiwan. All studies involving human ovarian cancer samples were approved by the Institutional Review Boards of Triservice General Hospital and National Defense Medical Center.
Methylated-DNA capture (MethylCap-seq)
Enrichment of methylated DNA was performed with the Methylminer kit (Invitrogen) according to the manufacturer's protocol. Briefly, one microgram of sonicated DNA was incubated at room temperature on a rotator mixer in a solution containing 3.5 micrograms of MBD-Biotin Protein coupled to M-280 Streptavidin Dynabeads. Non-captured DNA was removed by collecting beads on a magnet and washing three times with Bind/Wash Buffer. Enriched, methylated DNA was eluted from the bead complex with 1 M NaCl and purified by ethanol precipitation. Library generation and 36-bp single-ended sequencing were performed on the Illumina Genome Analyzer IIx according to the manufacturer's standard protocol. Each sample was sequenced on a single lane, for a total of 97 lanes. Additional data sets are presented for demonstration purposes only.
MethylCap-seq experimental QC
The quality control module identifies technical problems in the sequencing data and flags potentially spurious samples. The module is based on MEDIPs [10], an enrichment-based DNA methylation analysis package, and provides rapid feedback to investigators regarding dataset quality, facilitating protocol optimization prior to committing resources to a larger scale sequencing project. Figure 1 illustrates the QC automated workflow. For each aligned sequencing file (e.g., the default output of Illumina's CASAVA pipeline), duplicate reads are removed (a correction for potential PCR artifacts), and a stripped, uniquely aligned sequence BED file is loaded into an R workspace for processing by the MEDIPS library. Three functions are performed on the data: Saturation analysis, CpG enrichment calculation, and CpG coverage analysis. Saturation analysis performs a Pearson correlation coefficient estimation of sequencing library complexity and potential reproducibility. CpG enrichment calculation consists of the relative CpG dinucleotide frequency interrogated by aligned sequence reads divided by the relative frequency of CpG dinucleotides in the reference genome. CpG coverage rate (5×) is the fraction of CpG dinucleotides in the reference genome sequenced at least five times. The automated workflow produces a QC summary file containing the MEDIPs results and sequencing output metrics from the Illumina CASAVA pipeline. The QC module utilizes the parallel processing capability of a supercomputing environment to greatly decrease the time required for analysis.
Sequence file processing and alignment
The ability to use multiple custom sequence alignment policies facilitates analysis of various genomic regions and features. Bowtie, a short read aligner, provides many alignment policies and options that allow a great deal of customization of the alignment output[11]. While our focus and workflow centers on reporting uniquely aligned reads, alternative alignment options are used for more customized data analysis. The qseq files are preprocessed for a uniquely aligned Bowtie output by being converted to FASTA format. The converted file is then aligned by Bowtie with options that optimize for uniquely aligned reads and output in SAM format. Post processing uses various SAMtools [12] commands to convert the alignment to BAM format and remove all duplicate reads from the alignment before converting back to a final SAM alignment. The workflow, illustrated in Figure 2, is concisely handled by a single script which passes each intermediate stage of the alignment process to the subsequent stage and outputs a single SAM alignment file and a report of the number of reads that were aligned and those which were counted as duplicates. Speed is increased by Bowtie's multithreading options and by performing the alignment in a supercomputing environment. To achieve alternative alignments, Bowtie options can be changed, and different genomic sequences or subsets of genomic sequences may be used for alignment. With minor modification our workflow can be run with other short read aligners that generate SAM files as output.
Global methylation analysis workflow
The methylation analysis workflow is outlined in Figure 2. Chromosomal coordinates of sequence reads are parsed from the final alignment output, then counted using a specified bin size and read extension length (reflecting average fragment size) in order to generate a binary file containing bin counts and scaled count values (reads per million - rpms). The bin size determines the computational resolution of the analysis. We find that a bin size of 500 bp provides sufficient analysis resolution while smoothing the data statistically. The binary counts file is next interrogated by genomic feature (e.g., CpG islands, CpG shores, Refseq genes) to generate feature-specific count files. The workflow is compatible with custom feature files listing genomic loci of interest in BED format. In addition, aggregate read count summaries can be compiled for each type of genomic feature. Our approach of binning aligned reads, scaling read count values, and/or generating genomic feature-specific count files could prove applicable to other enrichment-based sequencing methods. For instance, the process responsible for filtering counts by genomic features might be modified to accept ChIP-seq peak calling values.
Once the samples are binned and genomic features are extracted, they are grouped based on biological factors, such as known genotype difference, and statistical tests are performed to discern if there are significant differences in methylation counts among predefined groups of samples. One locus from a genomic feature in one group is tested against the same locus in the other group for all loci in that genomic feature. The statistical test used is dependent on the number of groupings. For two groups a Wilcoxon rank-sum test is employed to test the distribution of methylation counts for each locus across the two groups. We then select significant differentially methylated loci by applying a multiple test corrected p-value cutoff (False Discovery Rate, FDR). Similarly for groupings of more than two biological factors, the Kruskal-Wallis test is employed. Statistical testing of genomic features is a custom workflow implemented in R which utilizes the predefined Wilcoxon and Kruskal-Wallis test functions. The output of the workflow is a list of loci from each genomic feature that passes significance testing. Boxplots are also created for the list of significant features for visualization of their differential methylation.
To assess genome-wide changes in methylation patterns between experimental groups, we calculate a Global Methylation Indicator (GMI) for each individual sample in different groups. First, the sample's methylation distribution, an average rpm for each CpG content classification, is determined. The distribution is obtained as follows: each 500 base bin is classified by the CpG content (# of CG base sequences, counting any CG base sequences straddling the end of the bin and the beginning of the next) within the 500 bases it covers. Then within each CpG content classification, the average rpm per bin is calculated by summing the rpms and dividing by the number of bins. The difference between two groups is calculated by using a t test on the estimation of the area under the curve for each individual sample.
Clustering
To identify novel classifications of samples independently of predefined biological factors, unsupervised clustering of the data can be implemented. Clustering of the data is a workflow that takes a data matrix of the samples and the rpm value of each locus for a given genomic feature. The workflow is implemented in R and utilizes various R libraries for matrix manipulation, flashClust, and pvclust for unsupervised clustering. Adjusted p-values are obtained via multiscale bootstrap resampling of the data. Many combinations of correlation calculations and clustering methods can be implemented. Our clustering workflow uses the Pearson correlation distance measure. It takes as input the "raw" rpm data values or rescaled rpms, depending on the features of interest in the dataset. Rescaling the rpms involves dividing the rpms of each locus by the average rpm for that locus. This allows Pearson correlation to evenly weight both the low and high rpm values. Using the raw rpms causes Pearson correlation to more heavily weight the high rpms. The default clustering method of the workflow is that of McQuitty, but R provides any number of additional choices. Our workflow also implements data selection criteria that enforce a minimum coefficient of variation (CV) threshold in combination with minimum average rpm threshold for each locus. In tandem with the dendrograms, heatmaps are also produced to help visualize the relationship between the clustering sample members. This entire workflow, including all combinations of selection criteria and all genomic features of interest, is completed in a single script.
Because we produce a variety of dendrograms through the use of various genomic features and loci selection criteria, it is useful to see if the membership of a significant group is conserved throughout the dendrograms that were created using other genomic features and even within genomic features analyzed with varying selection criteria. To easily visualize the location of a certain sample group's membership in other dendrograms, we use different colors to track the membership of that group through alternative dendrograms that are produced for different genomic features and selection criteria. Tracking the membership of a group is accomplished by supplying the membership of that group to a color function that can be applied to subsequent dendrograms through the dendrapply function in R.
Data visualization
In our workflow, we have incorporated Anno-J, a REST-based Web 2.0 application for the visualization of deep sequencing information and other genomic annotations [13]. Anno-J is capable of streaming all necessary applets and scripts to the user, providing immediate and installation-less viewing within a user's web browser. This facilitates the fast, real-time and interactive visualization of multiple data sets by users with access to any server hosting Anno-J. Data visualization within Anno-J uses tracks, discrete rows of graphs, each of which corresponds to a particular set of data. Our workflow incorporates a number of custom scripts which allow quick conversion of binary and raw text read counts and SAM files to various Anno-J track formats, including standard mask and read tracks. These scripts extract from read count files the location and rpm, and from SAM files the location, sequence count and strand identifier, and generate Anno-J read track format files. With minor modification, the scripts could be used to generate data tracks compatible with the UCSC Genome Browser. For the Anno-J experiment tracks, a scheduled service loads any new files from a shared folder into our database using a prescribed data format. Each track is assigned a unique identifier and properties for experiment type (e.g., methylation, small RNA) and track type (e.g., read, mask). The Anno-J web application will configure the browser with specified tracks based on these properties. The browser calls web services which return formatted data for each track, filtered by the currently viewable portion of the chromosome.
Additionally, we have incorporated a custom algorithm which allows conversion of binary and raw text read counts files to a custom discretized methylation heatmap track format. The heatmap track format modifies constraints and features of the Anno-J mask track format to allow generation of individual rows of heatmap data. Discretized methylation heatmap tracks are created by percentile ranking binned rpm values from binary or raw text read counts files, and then assigning color gradient based upon rank. Generation of the final discretized heatmap is a matter of stacking multiple heatmap tracks together.