EpiMOLAS: an intuitive web-based framework for genome-wide DNA methylation analysis

Background DNA methylation is a crucial epigenomic mechanism in various biological processes. Using whole-genome bisulfite sequencing (WGBS) technology, methylated cytosine sites can be revealed at the single nucleotide level. However, the WGBS data analysis process is usually complicated and challenging. Results To alleviate the associated difficulties, we integrated the WGBS data processing steps and downstream analysis into a two-phase approach. First, we set up the required tools in Galaxy and developed workflows to calculate the methylation level from raw WGBS data and generate a methylation status summary, the mtable. This computation environment is wrapped into the Docker container image DocMethyl, which allows users to rapidly deploy an executable environment without tedious software installation and library dependency problems. Next, the mtable files were uploaded to the web server EpiMOLAS_web to link with the gene annotation databases that enable rapid data retrieval and analyses. Conclusion To our knowledge, the EpiMOLAS framework, consisting of DocMethyl and EpiMOLAS_web, is the first approach to include containerization technology and a web-based system for WGBS data analysis from raw data processing to downstream analysis. EpiMOLAS will help users cope with their WGBS data and also conduct reproducible analyses of publicly available data, thereby gaining insights into the mechanisms underlying complex biological phenomenon. The Galaxy Docker image DocMethyl is available at https://hub.docker.com/r/lsbnb/docmethyl/. EpiMOLAS_web is publicly accessible at http://symbiosis.iis.sinica.edu.tw/epimolas/.


System Requirements
DocMethyl is available at https://hub.docker.com/r/lsbnb/docmethyl/. We have tested the whole process and provided a demonstration dataset for the Galaxy Docker container on an Ubuntu (16.04 64-bit) server with four-core CPU, 16 GB of RAM, and 400 GB of data storage. The elapsed time on single thread mapping mode for the demonstration dataset (raw data size 5 GB) is about 10 hours and ~50 GB of intermediate files are generated through the workflow. We recommend more data storage for large datasets.

Installation Steps
Before starting, please have the Docker engine ready and note that all the descriptions here are the command line instructions on a daemon launched session. Check the following list: Step 3. DocMethyl now is launched and the Galaxy Server will start at localhost (DOCKER), accessible through web browser at port 8080 ( Figure S1). (http://docker_IP:8080).
Step 4. Login to the Galaxy using the default user account 'epimolas@galaxy.org' and password 'epimolas', and run the built-in workflows. For Galaxy administration purposes, login to the server using the account 'admin@galaxy.org' and password 'admin@galaxy'. Please note that any manipulation of the Galaxy settings will not be carried over to a restart session of the Docker container.
Trimmed pairs containing reads less than 20 bp in with length will also be excluded.

Quality Control (QC) of Raw Reads: FastQC
(https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) -It provides a trimmed read quality report that users can access quickly on screen for the read data, such as GC content, length distribution, and overrepresented sequences.  About the mtable 5-methylcytosine (5mC) is the best known modified nucleotides in the genome. To catch the essence of the genome methylation status and to meet the efficiency for performing the analysis online, we introduced a straightforward method to measure the methylation landscapes of genes and promoters with regard to the sequence contexts.

Map Reads on Genome and
The DNA methylation level for an individual cytosine is estimated using Equation (1.1).
The methylation landscape of a promoter or a gene body is scored by the average of each observed C in all sequence contexts, as calculated using Equation (1.

Execute the Workflow by Selection of a Suitable Dataset
We built two Galaxy workflows to meet the need to process raw data in paired-end format (DocMethyl-PE) or single end (DocMethyl-SE) format respectively. These two workflows can be found in "WGBS" of the menu (left panel) in Galaxy/ DocMethyl. To run the workflow, users should specify read files and the target genome information (the genome sequences in fasta and genome's annotation in gtf) for the run ( Figure S5) in the dialog box and submit the job to Galaxy server. Please make sure that the raw files are in a normal FASTQ format. Compressed read files in format *.gz or *.bz2 are acceptable. Once a DocMethyl job starts, the steps of the query will be listed in right panel on the Galaxy web interface ( Figure S6).
A galaxy workflow can take files from client desktop; however, this is not applicable in most cases. To use big files in Galaxy and DocMethyl, please refer to the next Section (1.5 How to upload files) for guidance.

How to upload files
There are complicated and diverse settings to make the Docker-wrapped Galaxy default FTP server work in various network environments. Therefore, we provided two shortcut solutions to use large files in Galaxy/ DocMethyl without manipulating the system configurations. Considering that most WGBS analyses require a deep read coverage of the genome, which is above the limit of file size in a browser uploading session (around 2 GB), we suggest users can go straight to Solution B, especially B2, to use files uploaded in the same server that hosts the DocMethyl Docker.
Solution A. For files with a size less than 2 GB, find "WGBS" in the Galaxy menu panel and use "upload file/ choose local file" to send the file to the server. B2. Connect to the Linux server that hosts the DocMethyl Docker container via a FTP software tool. Create a file directory "galaxy_guest" in the same folder the Docker image being deployed. For example, if "docker pull" command is executed in the path "$~/home/user/test_docker/", the correct path of the FTP folder will be "$~/home/user/test_docker/galaxy_user/". This folder will be mounted as the deployed DocMethyl/ Galaxy FTP default file folder. Now, from the "upload file -> choose FTP file" in "WGBS" on the Galaxy menu (the left panel of the browser interface), you can find the files that were uploaded into this /galaxy_user/ folder ( Figure S7). Figure S7. Find files in a local directory that is mounted as a FTP directory.

Transfer Outputs to EpiMOLAS_web
As shown in Figure S6, a DocMethyl job triggers thirteen continuous steps and produces the output mtable in the last step. One mtable file will be derived from one submitted BS-Seq/ WGBS read dataset. For example, if an experimental design includes two conditions with three replicates each, there will be six mtables in total derived from six DocMethyl jobs of six WGBS datasets. Users can download the mtable of each job from Galaxy right panel; these files are compatible with the EpiMOLAS web application, EpiMOLAS_web. For users who run the DocMethyl workflow on a genome other than human, mouse, or Arabidopsis, the bisulfite mapping reports from Bismark are available in the previous step (i.e., step 12 in Figure S6). Please note that these files may be large and will take a longer time to download.

EpiMOLAS_web system
EpiMOLAS_web (http://symbiosis.iis.sinica.edu.tw/epimolas) is a web service that links the summary of WGBS data (mtable) with a rich annotation databases for human, mouse, and Arabidopsis ( Figure S8). The data uploading process is a wizard guided method that leads users to create a private or open-accessible website in EpiMOLAS_web. One important issue is the compatibility of user's data. The format mtable is described in Section 1. 3 is also available upon request. Figure S8. The web portal of EpiMOLAS_web.

Browse existing Projects
Before users' data becomes ready, there are Demo Sites (human, mouse, and Arabidopsis) that allow users to try the analysis, or users can find other established projects in the system if the project owners (data submitters) set the website release status as "open to public" (Figure S9 & S10).

Create a Project
Mtables generated by the DocMethyl workflow in the Galaxy/DocMethyl Docker or by EpiMolas.jar alone can be submitted to the EpiMOLAS_web server to create a project which is a website that joins the data to the annotation database and the analysis toolkit. The six measurements are summarized by the combination of three sequence contexts (CG, CHG, and CHH) and two gene regions (promoter and gene body) ( Figure   S6). The general principle is depicted in Figure S11. Figure S11. Illustration of throughput analysis from raw data to creating a project.
The option "New Submission" can be found on the navigator bar. Users should choose the compatible genome and load files first. Clicking on the "demo" icon beside the genome items will start a short tutorial about the data uploading process ( Figure S12).
Alternatively, users can try the "Load example data" button beside the dataset uploading box and use the demo dataset to run through the data deployment process to the resulting website. Gene IDs in each uploaded mtable will be matched to the EpiMOLAS_web database and return the matching rate upon the file uploading process; the sample label should be assigned here. Click on "submit" and the server will generate a brief summary of this submission ( Figure S13). For a registered user, we require the access authority in the next step, then users will be lead to subsequent steps to fill in all the required information, e.g., registered user's email, a password (for validating this registered user), the website access policy (i.e., open to public, private use, or restricted to people with a secret word shared by the project or website creator). For non-registered users, less information is required, however, there is also no detailed website policy options available; therefore, the created website will be held for one month only. When all the required information has been provided and submitted, system will start processing the data linkage and construct a web portal for this submission ( Figure S14 & S15).    For more details about new project creation, please visit our online documentation.

Generate Gene Sets
A general strategy to explore biological meaning in high-throughput, genome-wide scale experiments is to find a set of genes that is associated with a particular function.
The gene set may be derived from a quantitative assay based on a comparison, for example, to find genes that meet the criteria of "having difference between two experiment groups", such as a cut-off ratio or a subtraction result. Other ways to define gene sets are canonically defined sets of genes such as gene ontology (GO) terms,  Figure S16).

Find Genes According to User's Interests
In the module "Full-Text Search", users can search the annotation database by Gene ID, gene symbol, keywords in the gene description, and by KEGG pathway name. In the modules "KEGG GlobalView", users can browse KEGG pathway lists and find genes on the interactive map. The function "save as a list" in the pathway list view is available to grasp the genes in a KEGG map as a gene list for other analyses ( Figure S17).
In the module "Import Genelist", users can copy and paste a list into the query form or can upload a plain text file containing a gene list in Ensembl gene IDs or official gene symbols. Figure S17. An example of a full text search on the annotation database.

Find Genes by an Arithmetic Calculation
Module DMGs (differentially methylated genes) is a pairwise comparison workflow between two data pools, where single or multiple datasets can be assigned to one data pool (e.g., an experimental condition). Through customized and flexible parameter settings, genes that fulfill the criteria are selected. Module mC Threshold is used to select genes above or below a cutoff in at least one dataset, or among all datasets.
These two modules provide different ways to extract the methylation signatures of the six combinatorial sequence contexts and regions. C D

Find Genes by KEGG GlobalView
KEGG Pathway maps provide users with a knowledge-based view of biological processes. Users can find a KEGG map by a pathway name search or by browsing. In a KEGG pathway map, each KEGG component box highlighted in green means that the content of this KEGG component matches the user's uploading data. Clicking on the component will lead to a summary page of the methylation landscapes of the genes (gene body and promoter for CG, CHG, CHH) for all mapped genes in this component ( Figure S19). In addition, users can store gene sets in specific KEGG maps and perform versatile gene set analyses as described previously (See 2.

Find Genes According to
User's Interests). Figure S19. Finding genes by a specific pathway view. A partial list shown in the module "KEGG GlobalView". Click on the pathway link, such as "00051 Fructose and mannose metabolism" in this case, and system will return the corresponding KEGG map.
Users can examine the methylation landscapes of genes for each map component. Figure S21. Enriched GO terms in biological process. Figure S22. Enriched KEGG pathways for the given gene sets.

Protein-Protein Interaction Network analysis
Here, we implemented a protein-protein interaction network (PPIN) viewer for integrating, visualizing, and analyzing gene list members in the protein network context in the system ( Figure S23). This java plugin is based on Cytoscape.js [1], which supports network analysis and visualization. We use the BioGRID protein interaction data (version 3.4.159) to build the network topology. As a protein network graph, each node represents a protein, and each edge represents a known interaction between two nodes. We provide various network layout options and use the node size as a key In addition, the option "shortest path level" will expand the subnetwork to include connected paths that require only one or two extra stepping nodes (neighboring proteins) between any two nodes in the original input list. Recruited neighboring genes and edges for this expansion are discriminated in red.  Export: The selected gene list or network can be exported as a Cytoscape JSON file, a text file of binary protein interactions, or an image in PNG or JPG format.

Hierarchical Clustering Heatmap
The heatmap function is used to display a measurement from each sample with unsupervised hierarchical clustering in both samples and the selected gene lists. We integrated the interactive clustered heatmap visualization generated by Clustergrammer [3] into EpiMOLAS_web. The number of genes used to draw a heatmap is limited to 3000 because of the efficiency and rationality. The row annotations are the gene symbols with the dendrogram, while the column annotations are the samples with the dendrogram showing clusters between samples. We downloaded several human primordial germ cell WGBS datasets from this paper [4] for the following demonstration ( Figure S24).

Venn diagram and Circos plot
Venn diagram is a graphical way to manipulate gene lists as sets. It is an interactive and intuitive visual way to find subsets that are either overlapping or exclusive to the gene sets of interest. It can produce a diagram to compare up to four gene sets ( Figure S25).
The Circos plot visualization module [5] is used to label gene locations (the chromosomal coordination) of all genes in the selected list(s). This circular genome data visualization supports a variety of plot types. In Figure S26, chromosomes are shown in the outermost circle, and the innermost circle shows the genomic coordinates of selected genes. The middle circle with a color gradient represents the density of the coding genes. Each bin size for counting the coding genes is 1 MB base pairs.