Skip to main content
  • Methodology article
  • Open access
  • Published:

BisQC: an operational pipeline for multiplexed bisulfite sequencing

Abstract

Background

Bisulfite sequencing is the most efficient single nucleotide resolution method for analysis of methylation status at whole genome scale, but improved quality control metrics are needed to better standardize experiments.

Results

We describe BisQC, a step-by-step method for multiplexed bisulfite-converted DNA library construction, pooling, spike-in content, and bioinformatics. We demonstrate technical improvements for library preparation and bioinformatic analyses that can be done in standard laboratories. We find that decoupling amplification of bisulfite converted (bis) DNA from the indexing reaction is an advantage, specifically in reducing total PCR cycle number and pre-selecting high quality bis-libraries. We also introduce a progressive PCR method for optimal library amplification and size-selection. At the sequencing stage, we thoroughly test the benefits of pooling non-bis DNA library with bis-libraries and find that BisSeq libraries can be pooled with a high proportion of non-bis DNA libraries with minimal impact on BisSeq output. For informatics analysis, we propose a series of optimization steps including the utilization of the mitochondrial genome as a QC standard, and we assess the validity of using duplicate reads for coverage statistics.

Conclusion

We demonstrate several quality control checkpoints at the library preparation, pre-sequencing, post-sequencing, and post-alignment stages, which should prove useful in determining sample and processing quality. We also determine that including a significant portion of non-bisulfite converted DNA with bisulfite converted DNA has a minimal impact on usable bisulfite read output.

Background

DNA methylation has a role in the development of eukaryotic organisms [1, 2] and may represent the interface between genome and environment [3]. Currently, the most widely used method for detecting 5-methylcytosine is the treatment of DNA with sodium bisulfite [4], which results in the deamination of all non-methylated cytosine to uracil. Since sodium bisulfite does not convert 5-methylcytosine bases to uracil, this approach allows for the direct interpretation of where methylation has or has not occurred in the genome. This interpretation is complicated by the need to generate percentage methylation statistics per cytosine residue because the same locus can have different methylation levels across cells. This complication underscores the need to have careful metrics to assess experimental procedures.

Bisulfite treatment combined with Next-Generation sequencing (NGS) is the method of choice for the NIH Epigenomics Roadmap [5], a project with a stated goal to map methylation patterns in multiple tissue types. This experimental design underlies the expectation that there are different methylation patterns in different tissues, and there is a high likelihood that cells that make up these tissues themselves have different methylation patterns, even at the same genomic loci. This variation complicates analysis and can lead to high levels of noise, making data interpretation challenging. One major issue for all bis-DNA Next Generation sequencing (BisSeq) experiments is read alignment to a reference genome. This is due to the loss of unmethylated C residues (observed as T residues after sequencing) leading to decreased complexity. Specifically, the longer and more diverse a sequenced read, the more likely it is to align to the genome. Loss of base diversity from the decrease in C bases means that individual reads may appear to align to multiple regions of the genome. A second major reason for alignment difficulties is the diversity of methylation patterns at cytosine loci. In a read containing many C residues, methylation patterns at the same base could be different between reads. This means that reads from the same genomic locus could align to different genomic regions.

Bisulfite conversion of DNA followed by massively parallel sequencing is likely to be the most practical approach to map methylation in the coming years, whether in reduced or complete genomic space [6, 7]. Reduced-Representation bisulfite sequencing (RRBS) uses an MspI digestion (cutting at CVCGG) prior to bisulfite conversion and library preparation to reduce genomic space, resulting in the sequencing of ~2.5% of the genome; however, a thorough analysis of its limitations, and methods to cope with these inadequacies, has not been fully addressed. Early protocols for bisulfite sequencing [8ā€“10] were described for the Illumina Genome Analyzer IIx sequencer, where library preparation was singleplexed. Later protocols [11] switched to multiplexed indexing approaches using Illumina TrueSeq DNA sample preparation. Notably, early protocols focused on library construction, while recent protocols have explored strategies for successful sequencing and analysis [12]; however, there currently lacks a simple, detailed description of multiplex sequencing using conventional tools; complex, hard-to access strategies are exemplified by the proposal to use ā€˜Darkā€™ sequencing [11] in bisulfite sequencing experiments.

The purpose of the current work is to provide a simple BisSeq protocol, with several QC checks throughout that can be used in standard laboratories and to test two main questions: 1) What is the function of pooling non-bisulfite DNA with bisulfite converted DNA in a single lane, and 2) What is the validity of using duplicate reads to calculate coverage statistics? In this work we carefully and completely lay out experimental procedures and QC parameters to assess BisSeq experiments and we suggest that pooling 30% non-bisulfite converted DNA with bisulfite converted DNA has a minimal effect on bisulfite read output, meaning that sequencing non-bisulfite DNA samples from unrelated experiments is practical. We find also that using duplicate reads to calculate coverage is legitimate, but that a simple test to assess whether the total read pool is representative of a read pool with no duplicates should be applied first.

Methods

Library preparation

All tissue samples used in this study were provided by the Brain Endowment Bankā„¢ following protocols approved by the research ethics board of the University of Miami Miller School of Medicine. Brain samples (anterior caudate nucleus) were obtained at autopsy following the principles of the Helsinki declaration and next-of-kin gave written informed consent. We used NEBNext Illumina Library Prep Master Mix kit for the library construction work. All experiments herein use reduced representation bisulfite sequencing; however, most optimizations can be applied to standard bisulfite sequencing. FigureĀ 1 shows a flow chart of library preparation procedures for multiplexed Reduced-Representation Bisulfite Sequencing (mRRBS), which takes on average 7ā€“10 days. FigureĀ 2 shows a molecular level illustration of library preparation after the adaptor ligation stage.

Figure 1
figure 1

Flow chart of library preparation steps.

Figure 2
figure 2

Major RRBS library construction steps. This figure demonstrates adaptor ligation (step 1) and barcode indexing (step 2) for Illumina two-step library preparation, as well as in between steps including bisulfite treatment. We show how the 2-step procedure affects DNA inserts when used for RRBS directional sequencing. First, DNA inserts (underlined) with ā€˜Aā€™ overhangs are ligated to methylated Illumina adaptors (methylated cytosines are marked in bold), meC-PE1 and meC-PE2. Next, adaptor ligated-DNA inserts are bisulfite treated and amplified using primer indPEPCR1F and indPEPCR2R. All unmethylated cytosines deaminate to uracil. We show two cycles of the PCR reaction toamplify bisulfite fragments to show how DNA inserts change after bisulfite treatment and amplification, as well as to track original top (OT) and original bottom (OB) strands. After an appropriate number of cycles (appropriate is defined by the visualization of bands shown in this manuscript in the library preparation stage), bisulfite treated libraries can be indexed, then sent for sequencing. Note that for directional sequencing all sequencing reads are either from the original top (OT) or the original bottom (OB) strands. The first three bases of almost all RRBS reads are either CGG or TGG, depending on their genomic methylation state and this applies to reads generated from both OT and OB strand. Therefore almost every read in a directional RRBS sequencing experiment that use MspI digestion contains at least one CpG at the 2nd and 3rd base positions, plus any internal CpGs (provided they are not in CCGG or CCGG sequences). Internal CpGs can be in CCGG sequence where MspI does not cut when the first C is methylated. Abbreviations: C (Bold): methylated C; p: phosphate; s: phosphorothioate bond. Illustrated insert DNA is underlined. P5 (5ā€² AATGATACGGCGACCACCGA 3ā€²) and P7 (5ā€² CAAGCAGAAGACGGCATACGA 3ā€²) are flow cell attachment sites.

Enzyme digestion

After DNA purification using the QIAamp genomic DNA micro isolation kit from human brain, 2-5 Ī¼g of genomic DNA was used to carry out the MspI (New England Biolabs) digestion at 37Ā°C for 7 hours using 20 units of enzyme per Ī¼g of DNA. Digested DNA was purified by phenol/chloroform (p/c) extraction (49:49:2; phenol:chloroform:isoamyl alcohol). The aqueous top layer containing genomic DNA was precipitated in the presence of NaCl (0.3 M final) and glycogen (25 Ī¼g final). The precipitated DNA was pelleted via centrifugation and washed with 500 Ī¼l 80% ethanol (rinsing the pellet instead of re-suspending) and then centrifuged again at 12000 rpm for 20 minutes to solidify the pellet. The pellet containing MspI fragmented DNA was resuspended in 100 Ī¼l dH20 for the end-repair reaction. Five-ten percent of the purified MspI fragmented DNA was then run on a precast 4-20% gradient polyacrylamide TBEx1 gel and stained with ethidium bromide (EtBr; Invitrogen). This is an important step to verify the enzymatic digestion of DNA; a complete MspI digestion will produce visible satellite bands in a smearing background (FigureĀ 3). A high sensitivity DNA chip to check the completion of MspI digestion can also be used (FigureĀ 3C).

Figure 3
figure 3

Standards for MspI digestion and progressive PCR. A) MspI digestion of human genomic DNA isolated from human post-mortem brain tissues. DNA (200 ng) was digested by MspI and run on a 4ā€“20% precast polyacrylamide gel and stained with EtBr. Arrows show three satellite DNA bands characteristic of this enzymatic digestion. B) Agilent 2100 Bioanalyzer chromatogram of MspI digested genomic DNA. C) Bioanalyzer 2100 image of a single library from an MspI digested DNA sample. Notice that the satellite bands (indicated by arrows) are still visible on the Bioanalyzer image. D) Progressive PCR amplification combined with limited PCR extension time allows for size selection and amplification of six bisulfite converted libraries (Lanes 1ā€“5 are distinct RRBS libraries; lane 6 (ā€˜Cā€™) is a negative control). After different progressive PCR cycles (18X, 22X, 24X, or 26X ā€“ the same libraries are shown for each cycle number) band intensity increases as cycle number increases. Arrows indicate the three satellite DNA bands that are still visible in these libraries.

End repair (Filling-in and dA-tailing)

MspI recognizes double stranded DNA at 5ā€²-C^CGG-3ā€² and cleaves the phosphodiester bonds upstream of CpG dinucleotide. This reaction results in DNA fragments with 5ā€² overhangs, so end repair is necessary to fill-in the 3ā€² termini of each fragment. This way, all MspI digested library fragments should contain a CpG dinucleotide on both ends of the fragment. The NEBNext DNA Library Prep Master Mix Set for Illumina separates the filling-in step from dA-tailing reaction.

Filling- in of MspI rragmented DNA

Using the MspI fragmented DNA (40ā€“85 Ī¼L) described above, the NEBNext End Repair Reaction Buffer (10X; 10 Ī¼L), and the NEBNext End Repair Enzyme Mix (5 Ī¼L), we incubated the solution (final volume of 100 Ī¼l) at 20Ā°C for 30 minutes. After incubation, the reaction mixture was diluted to 200 Ī¼L with dH2O. Next, we added 200 Ī¼L of p/c at room temperature. After ethanol precipitation, DNA was resuspended in 50 Ī¼L dH2O in preparation for the dA-tailing reaction.

dA-tailing of blunt end MspI fragment

The enzymatic process that adds an extra adenosine (A) to both the plus and minus strand 3ā€² termini is referred to as dA-Tailing and is necessary for ligation of the adaptors (which contain a 3ā€² dT overhang). Following the NEBNext DNA library Prep Master Mix Set for Illumina kit manual, we carried out the dA-Tailing reaction. We used 42 Ī¼L blunt-end DNA, 5 Ī¼L of NEBNext dA-Tailing Reaction Buffer (10X), and 3 Ī¼L Klenow Fragment (3ā€²-ā€‰>ā€‰5ā€² exo-), for a total volume of 50 Ī¼L and incubated at 37Ā°C for 30 minutes. Next, p/c extraction and ethanol precipitation were performed following the same steps as outlined above, without the inclusion of glycogen (no glycogen is needed, since previously added glycogen is co- precipitated with genomic DNA). The final volume of dA-tailed MspI fragment is 30 Ī¼l in dH2O.

Methylated adaptor design

We used the two-step Illumina adaptor design for the adapters and the PCR indexing primers because it decouples the indexing reaction from library amplification (FigureĀ 2), and allows for more efficient bisulfite treated DNA library amplification and size selection. We synthesized published Illumina paired-end adaptor oligonucleotides, and had all cytosines replaced with 5ā€²methyl-cytosines in order to prevent the deamination of the adaptor cytosines in the bisulfite conversion reaction. All adaptors and indexing primers are listed in TableĀ 1.

Table 1 Sequences and specific modifications of oligonucleotides used in the BisQC protocol

Illumina methylated Y-adaptor annealing

Prior to adaptor ligation, we carried out an adaptor Y fork annealing reaction by combining equal molar ratios of methylated PE1 and methylated PE2 adaptors. With this, annealed adaptor oligonucleotides can be kept at -20Ā°C for many months before use, provided high temperature and other denaturing conditions are avoided. To perform this reaction, we mixed 50 Ī¼L each of mC-PE1 (25 Ī¼M) and mC-PE2 (25 Ī¼M) in PCR well-plates and carried out the following denaturing and annealing reaction on the thermal cycler: 95Ā°C, 120 s; 80Ā°C, 60 s; 70Ā°C, 60 s; 60Ā°C, 60 s; 50Ā°C, 60 s; 40Ā°C, 60 s; 30Ā°C, 60 s; 4Ā°C indefinitely.

Ligation of methylated Y-adaptor to dA-tailed DNA fragments

For the ligation reaction, we used 25 Ī¼L dA-tailed DNA, 10 Ī¼L NEB Quick ligation Reaction Buffer (5X), 10 Ī¼L pre-annealed Illumina methylated Y-adaptors, and 5 Ī¼L NEB Quick T4 ligase for a total volume of 50 Ī¼L. This was then incubated for 1 hour at 16Ā°C, followed by a second incubation for 30 minutes at 20Ā°C.

Purification of adaptor- ligated library

We used the QiaQuick PCR purification kit for the purification of the adaptor-ligated libraries. We used 250 Ī¼L of QIA buffer PE to carry out the purification process for 50 Ī¼L of the ligation mixture. The purified libraries were then eluted in 100 Ī¼L of dH2O.

Bisulfite conversion

We used the Qiagen EpiTect Fast 96 Bisulfite Kit to carry out the bisulfite conversion of adaptor-ligated library. This single conversion step is reported to be sufficient to achieve a ā‰„99% conversion rate (discussed in more detail in the results section). We used 50% of the purified adaptor ligated library for bisulfite conversion. To purify DNA, the reaction mixture was transferred into a 96-well plate, with a high affinity membrane on the bottom of each well (Qiagen). Buffer BL, bisulfite conversion mixture and ethanol (96%) were added sequentially, mixed, then left to stand for 2 minutes. After spinning, single stranded, bis-converted DNA was bound to the membrane. We then washed twice with buffer BW, than twice with buffer BD to achieve a complete on-membrane desulfonation. This was followed by two more washes with BW then a final elution with buffer EB.

PCR amplification of bisulfite converted libraries

Agilent Pfu turbo Cx Hotstart DNA polymerase has the property of uracil-tolerance and high fidelity DNA polymerization. Amplification of bisulfite-converted libraries was carried out simultaneously with the library size selection process by using the progressive PCR method. IndPEPCR_F (33 nt) and _R (32 nt) (TableĀ 1 and FigureĀ 2) were used as PCR primers.

The detailed PCR reaction mixture (200 Ī¼L volume) is as follows:

10X pfu Turbo Cx Rxn Buffer (Agilent)-20 Ī¼L

dNTP (10 mM each)-4 Ī¼L

IndPEPCR_F (25 M)-1 Ī¼L

IndPEPCR_R (25 M)-1 Ī¼L

Pfu Turbo Cx Hot Start DNA Polymerase-2 Ī¼L

Bisulfite converted RRBS library-50 Ī¼L

dH2O -122 Ī¼L

The Pfu Turbo Cx Hot Start DNA Polymerase should be added last. The detailed amplification cycle is as following:

  1. 1.

    95Ā°C-90 s

  2. 2.

    95Ā°C-30 s

  3. 3.

    60Ā°C-30 s

  4. 4.

    72Ā°C-30 s

  5. 5.

    4Ā°C indefinitely (in 4Ā°C fridge, do not freeze the PCR reaction)

Repeat steps 2ā€“4 for 18 cycles

The final PCR products represent the minimally-amplified and size-selected non-indexed RRBS libraries. We confirmed the correct size amplification by running a 2.0 % HR Agarose Gel (100 mL 1XTAEā€‰+ā€‰2.0 g of HR Agaroseā€‰+ā€‰2.5 Ī¼L of 10mg/mL EtBr) or an Invitrogen 4-20% gradient polyacrylamide gel (1XTAE), or Invitrogen E-gel 2% with SYBR Safe, all after 18 cycles of PCR. Samples showing faint but visible 150ā€“400 base pair (bp) smearing on the gel have the optimal amplification PCR cycles. Satellite DNA bands should also be visible in the smearing background for a well-constructed and optimally amplified RRBS library (FigureĀ 3). Samples that give very faint smearing require additional PCR amplification cycles. Samples requiring extra PCR cycles (using the PCR reaction tubes kept at 4Ā°C) can be returned to the thermal cycler for 2ā€“4 more cycles of amplification following the above progressive PCR protocol. Ten Ī¼L of these ā€˜additional cycleā€™ amplified PCR reaction mixtures can be run on a gel and checked for smearing. Thus, final library selection is determined by visual inspection of gel images for appropriate smearing and satellite band patterns (FigureĀ 3). This is required because of the large variation observed across libraries, even with identical starting DNA concentrations and enzyme digestion times. Finally, we perform a cleanup step with the remaining PCR products from all libraries using AMPure XP SPRI beads (Agencourt, Beckman-Coulter) using 60 Ī¼l/55 Ī¼l (beads/DNA) ratio. After purification, 1 Ī¼L of the purified library was used for quality control using an Agilent Bioanalyzer 2100 High Sensitivity DNA Chips. A good amplified size-selected and purified library produces a smear covering 150ā€“400 bp with visible satellite bands, similar to that seen in FigureĀ 4, with very little primer-dimers.

Figure 4
figure 4

Agilent 2100 Bioanalyzer images from final reduced representation bisulfite libraries. A) High Sensitivity DNA Chip from 10 RRBS libraries. Notice that the satellite bands are still visible on the Bioanalyzer gel image B) Chromatogram representation of panel A showing high quality RRBS libraries.

Indexing

Using 50 Ī¼L of purified, non-indexed library, we performed the following PCR indexing reaction for each library. The PCR primers used for indexing reaction are IndPEPCR_F (33 nt) and Index_#R primer (43 nt)

10X pfu Turbo Cx Rxn Buffer (Agilent)-20 Ī¼L

dNTP (10 mM each)-4 Ī¼L

IndPEPCR_F (25 M)-1 Ī¼L

IndPEPCR_#R (25 M)-1 Ī¼L

Pfu Turbo Cx Hot Start DNA Polymerase-2 Ī¼L

Non-indexed RRBS library-50 Ī¼L

dH2O-122 Ī¼L

The Pfu Turbo Cx Hot Start DNA Polymerase should be added last.

Using the following PCR steps:

  1. 1.

    95Ā°C-90 s

  2. 2.

    95Ā°C-30 s

  3. 3.

    65Ā°C-30 s

  4. 4.

    72Ā°C-30 s

Repeat 2ā€“4 times

  1. 5.

    4Ā°C indefinitely

We used 60Ī¼l/55Ī¼l (beads/DNA) ratio. AMPure SPRI bead-purified library was eluted in 60 Ī¼L of dH2O. Purified libraries can be screened on an Agilent 2100 BioAnalyzer (FigureĀ 4A and B).

Molecular cloning and sanger sequencing

5 Ī¼l of AMPure bead-purified library (total 60 Ī¼l) was used for cloning experiments. The single band product was cloned using chemically competent E.coli cells and the pCRā„¢4-TOPOĀ® TA vector (Invitrogen Cat#450030) using 4 Ī¼l of fresh PCR product, 1 Ī¼l of vector, 1 Ī¼l of salt solution, and ligated for 5 minutes at room temperature. Four Ī¼l of ligated product was added to 50 Ī¼l of competent cells and incubated on ice for 30 minutes, 240 Ī¼l of S.O.C. medium (Invitrogen Cat#15544-034) was added to the cells incubated in a 37Ā°C shaker for 1 hour. 50 Ī¼l of transformed cells was plated on ampicillin agar plates and incubated for 16 hours at 37Ā°C. Prior to sequencing, we assessed some band sizes by an EcoRI digestion of plasmids (FigureĀ 5A). Sequencing was done using rolling circular amplification, a service provided by Genewiz, Inc (South Plainfield, NJ). M13F (-21) sequenced colonies were aligned using LasergeneĀ® SeqMan Pro (DNASTAR, Inc. Madison, WI).

Figure 5
figure 5

Pre-sequencing quality control of bisulfite-converted DNA libraries. A) Cloning and subsequent EcoRI release of library inserts reveals a random size distribution as revealed on an agarose gel. The extra band in some lanes represents enzyme cut sites present in the inserts. B) qPCR dissociation curve of a bisulfite- converted DNA library using primers directed at adaptors to the bisulfite library (blue peak) and PhiX standard DNA library (green peak). Bisulfite converted DNA libraries have a lower melting temperature due to loss of cytosine residues. The qPCR dissociation curve only serves as a general tool to verify bisufite conversion and is not able to distinguish a minor bisulfite conversion problem. C) Sanger sequencing reaction of a single clone insert from A). Plasmid is detectable by presence of cytosine residues; insert begins at CGG residue marked by the arrow. Note the lack of cytosine residues in the insert, except at CpG loci (blue ā€˜Cā€™ peaks under black arrow). Base colors are: C: Blue, G: Black, T: Red, A: Green.

For methylation assessment of a single, targeted locus, a 250 bp amplicon was isolated from the original RRBS library preparation for sample G12. Two Ī¼l of template was amplified using High Fidelity Platinum Taq (Invitrogen Cat #11304-102) and the PCRx enhancer system to facilitate amplification of CG rich template (Cat# 11495ā€“017), with the following thermocycling conditons: 95Ā° for 5 minutes, (95Ā° for 30s, 58Ā° for 30s and 72Ā° for 45 seconds) x 45 cycles, 72Ā° for 7 minutes, 4Ā° indefinitely. Bisulfite primers were designed using Methyl Primer ExpressĀ® (Applied Biosystems), forward 5ā€²- GGG AAG AGT TGG TTA GAG AGA -3ā€² and reverse 5ā€²-AAA ACC CCC TAT AAA AAA ACC C-3, corresponding to (HG19) Chromosome 3: 75, 718, 452ā€“75, 718, 701.

Massively parallel (next generation) sequencing

We used the Illumina HiSeq2000 platform, with single-end 50 cycle sequencing. Sequencing was performed by the McGill University and GĆ©nome QuĆ©bec Innovation Centre in Montreal, Quebec. Data was downloaded onto our servers in FASTQ format. Illumina de-indexed data was first processed using FastQC tool v0.10.0 (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/), which provides a user-friendly overview of raw sequencing data. The programs fastx_clipper, fastx_trimmer and fastx_collapser [part of the FASTX-Toolkit (http://hannonlab.cshl.edu/fastx_toolkit/)] were used to remove adaptors, filter low quality reads, and remove readsā€‰<ā€‰20 bp in length.

Alignment

There are many excellent aligners for bisulfite sequencing data including Bismark [13], BSMAP [14, 15], and RMAP [16], BS-Seekeer2 [17] and some of these have been recently assessed [18]. We opted to use Bismark [14] for its flexibility and compatibility with downstream processing tools such as MethylKit [19], which we use routinely for post-alignment processing. All user-set features were set to default, except for: --best, --n 2, and ā€“directional. For data visualization, we wrote a custom Perl script to convert Bismark output alignment files into the SAM (Sequence Alignment/Map) format for the visualization of the alignment reads into the Integrated Genomic Viewer [20]. This script is available for download at: http://www.mcgill.ca/psychiatricgenetics/tools-0. Further description of bioinformatic details can be found in the results section.

Results

Post-bisulfite indexing for increased flexibility and improved library purification

Currently, most NGS bisulfite DNA protocols take advantage of the ā€˜one-stepā€™ design, where indexing and adaptor ligation occur in one step, but this creates longer products (Adaptors with indexes are >100 nucleotides for the one-step design as opposed to 64 nucleotides in the two-step design) which influences clean-up procedures and allows less flexibility in library amplification ā€“ an important factor in bisulfite DNA projects because of the variability in DNA concentration after bisulfite treatment. Pre-indexing also requires decisions to be made about what samples to pool together prior to knowing the quality of each library. To devise a post-library indexing strategy, we used the Illumina two-step paired end design (FigureĀ 2) which uses a third indexing primer (TableĀ 1 lists the sequences for all oligonucleotides used in these experiments).

A major benefit to using the two-step Illumina adaptor/indexing technology beyond the increased flexibility, is that they form dimers less than 100 bp, allowing for the use of spin columns rather than AMPure beads; adaptor dimers using this strategy are 65 bp in contrast to the 126 bp dimers generated by the TruSeq protocol. Spin columns recover all fragments >100 bps, whereas AMPure bead purification functions best at high DNA concentrations or with small volume operations. This means that a greater proportion of the library is retained after purification. A disadvantage of using the two-step approach is that fusion products between the adaptor and indexing primer form, and this product needs to be removed (see Additional file 1 for details).

Simultaneous amplification and size selection of bisulfite libraries by progressive PCR

Progressive PCR amplification combined with limited PCR extension time achieves library amplification and size selection at the same time (FiguresĀ 3, 4 and 5). Size selection is achieved because standard Taq DNA polymerase extension speed is on the order of 700ā€“800 nucleotides per minute at 72Ā°C. Instead of time-consuming gel-cutting for size selection, optimized progressive PCR can be performed to amplify and size-select libraries. We set the 72Ā°C extension time to 30 seconds for 150ā€“400 bp libraries. We used 18 cycles of PCR amplification cycles, then transferred plates immediately to 4Ā°C (and never froze samples because extra PCR cycles may be required without adding any fresh enzyme or dNTPs). Next, we ran a small aliquot of the PCR product (5 Ī¼L) and verified the products on an agarose gel with EtBr. When a smearing range from 200ā€“500 bp is nearly visible a minimal amplification condition has been achieved (FigureĀ 3D). If there is no smear visible at all, the complete PCR reaction is returned to the thermal cycler and amplified for an additional 2ā€“4 cycles, hence progressive PCR.

Quality control check of final RRBS libraries

We used cloning and Sanger sequencing to verify the quality of bisulfite conversion as well as library quality (FigureĀ 5). qPCR is used for absolute quantification of the final library and also serves as a check-point for the incomplete bisulfite conversion. As illustrated in FigureĀ 5B, the dissociation curve from a bisulfite-converted RRBS libraries demonstrates a much lower melting temperature (5ā€“8Ā°C lower) as compared to a non-RRBS library (in this case, the qPCR standard PhiX library). This drop in the template melting temperature is caused by the excessive presence of T and A bases in the RRBS libraries. This is not meant as a quantification of conversion rate, but rather as a ā€˜sanityā€™ check - a large shift should be observed in the melting temperature at this step when bis-DNA versus non-bis libraries are compared. We performed real-time PCR using an ABI7000 SDS system (Applied Biosystems) with SYBR Green Master Mix according to the manufacturerā€™s instructions. Primers used were PE-qPCR_F and R (TableĀ 1). FigureĀ 6 demonstrates a raw Illumina QC output from one representative sample. Note the general quality of the library (FigureĀ 6A; mean phred scores are above 30 at all positions), and the low cytosine content and approximately equal cytosines at positions 40ā€“50 as compared to positions 1ā€“10 (FigureĀ 6B). Note also that >95% of reads have a large spike at base position 2 and 3 of a Guanidine dinucleotide. This is caused by the MspI digestion which is expected in any MspI RRBS library.

Figure 6
figure 6

Illumina HiSeq2000 QC of one representative sample. A) Raw quality scores (FASTQ) of reads from one bisulfite treated library (G12; lane 6). B) Graph showing the representation of each base at each position in a 50 base read. Notice the very low level of Cytosine residues compared to the high content of Adenosine residues; non-bisulifte converted DNA shows approximately equal base composition at each site. One metric of C-T conversion rate is the ratio of C residues at bases 1ā€“10 and 40ā€“50. In the graph above, this ratio is about 1, as expected. In RRBS libraries, a large increase in Guanidine at positions 2 and 3 (~95% of reads have GG at these positions) should also be observed.

Descriptive characteristics of methylation patterns across samples post-sequencing but pre- alignment

Prior to beginning any experiment, a visual inspection of CpG distribution can be done. Even if an explicit hypothesis is being tested, the distribution of methylated to unmethylated CpGs is expected to be similar, and major deviations from this can be flagged in BisSeq experiments. For example, FigureĀ 7A, C, E, G suggest that 55-65% of all CpGs have <5% methylation while 25-30% are methylated >95%, at least in DNA derived from human brain tissue. There is also a relatively consistent but low frequency of methylation between CpG residues methylated at 10%-89.99%, with a small but consistent increase when methylation data is binned in 5% blocks in the 45.0-49.99% methylated range. Methylation in this bin is usually 2- to 3-fold higher than neighboring bins (FigureĀ 7A, C, E, G). Finally, coverage per CpG should be uniform across independent samples; compare FigureĀ 7B, D, F, H (the log10 value of the coverage per CpG is consistent across independent experiments).

Figure 7
figure 7

Frequency of methylation per CpG and mean CpG coverage as QC metrics. A, C, E, G) The frequency of methylation at any CpG region has very similar patterns across samples (A, C, E, G). In adult human brain tissue, >50% of all CpGs show 0% methylation and >20% of CpGs are fully methylated. A minority of CpGs lies within the extremes. B, D, F, H) Histograms denote average coverage per CpG in bins and where the X-axis is on the log10 scale (1ā€‰=ā€‰10X coverage). The purpose of performing these analyses is to assess if any sample deviates substantially from other samples. If so, these samples should be carefully re-investigated or re-processed. Here, the histogram pattern across samples B, D, F, and H are very similar.

Calculation of C-T conversion rate

The calculation of C-T conversion rate is a contentious issue in any bisulfite experiment, but is fundamental to accurate calling of methylation. Previous NGS studies using bisulfite converted DNA have reported conversion rates from 97-99% [12]. C-T conversion rates in the current study are 98.7-99.2%ā€‰Ā±ā€‰0.1% across all 12 independent libraries. We calculate this conversion rate by adding the CHH and CHG variables from Bismark, where ā€˜Hā€™ refers to any non-G base in the genome. It may be that methylation occurs at non-CpG sites in the human genome; convincing reports of this have begun to emerge [21, 22] from mammalian genome studies. Normally an all ā€˜Cā€™ DNA fragment could be included as a control in bisulfite sequencing studies, but this downplays the importance of the diversity of fragments in the genome which may effect conversion rates and has its own potential for errors being a homo-polymer. There are several checks that can be performed to assess C-T conversion: 1) an initial cloning experiment should be performed prior to NGS (FigureĀ 3C) to determine how well C-T conversion has functioned, 2) an assessment of the C ratio in bases 1ā€“10 and 40ā€“50 in sequenced reads (the ratio should be approximately 1), and finally 3) an internal negative control can be used to assess C-T conversion (see Visualizing data in IGV and using the mitochondrial (MT) genome to assess C-T conversion).

Visualizing data in IGV and using the mitochondrial (MT) genome to assess C-T conversion

Visualization of aligned reads in the Integrated Genome Viewer (IGV) is a simple method to assess initial accuracy of alignment. Reads should begin at CGG or TGG (in both forward and reverse directions ā€“ see FigureĀ 2), there should be equal balance between forward and reverse reads, and the number of non-methylated related errors should be what was set in the alignment program (usually <2 mismatches). FigureĀ 8 shows an example of what this may look like for one region, but simply looking at different regions is inefficient, so we propose that the mitochondrial genome be used for initial visualization. Mammalian MT genomes are not methylated [23, 24] and have extremely low CpG content [25], thereby providing an internal negative control for DNA methylation experiments. Methylation in the MT genome has also been recently confirmed to be absent from brain tissue [26] which is the tissue source used here. Specifically, alignment to the 16.6 Kb human MT genome allows for an assessment of C-T conversion rate (no Cā€™s are expected to remain post-bisulfite conversion). Analyzing reads that align to the MT genome provides an isolated genomic locus with which to assess how well the alignment procedure has worked.

Figure 8
figure 8

Use of the Integrated Genome Viewer (IGV) showing an example of data at one locus in one subject. A) IGV view of a 45 kb window on chromosome 5. Reads should map to distinct regions (due to MspI digestion), and there should be an equivalent balance of forward and reverse reads. Gray shading represents perfect alignment with the HG19 standard human reference genome. B) Close-up of the region from A) showing that forward reads have a high proportion of red ā€˜Tā€™s and reverse reads have a high proportion of green ā€˜Aā€™s; IGV shows colored bases in gray reads when an error is detected. In this case, the read is perfectly aligned to the standard reference genome, but, due to bisulfite conversion, most ā€˜Cā€™s have converted to ā€˜Tā€™s (forward strand), and this is outputted in the IGV window. Diversity at a locus is partially generated by different methylation levels at specific CpG sites and sequencing errors.

Using imprinted regions and cloning to finalize alignment and library quality assessment

There are many known imprinting clusters in the human genome, though the tissue specificity and extent of methylation at each site are not precisely known [27, 28]. To provide reference maps for expected methylation levels in human brain tissue and to demonstrate how these regions can be used as positive controls for a BisSeq experiment, we began by selecting known imprinted CpG islands previously documented in sperm [29]. Next, we assessed the same CpG islands in our sample using reads across all lanes and combining the data for each subject, though we included two CpG islands from genes that should not be imprinted, in other words, two regions where we did not expect to observe any methylation. TableĀ 2 shows the results from this experiment and we note the relatively narrow ranges across subjects for each locus. To demonstrate this narrow range across subjects we show the mean and standard deviation for each locus (IGF2R:50.8%ā€‰Ā±ā€‰4.7%, GRB10:26.3%ā€‰Ā±ā€‰1.9%, PEG3:36.7%ā€‰Ā±ā€‰6.3%, MEST:35.0%ā€‰Ā±ā€‰1.5%, RB1:59.7%ā€‰Ā±ā€‰2.2%, KCNQ1:37.4%ā€‰Ā±ā€‰1.3%, GAPDH:1.3%ā€‰Ā±ā€‰0.5%, ACTB:1.0%ā€‰Ā± 0.5%). TableĀ 3 shows the number of CpGs in the CpG island that were coveredā€‰ā‰„ā€‰1X and the mean coverage per CpG.

Table 2 Average methylation percentage across several known imprinted genes, with GAPDH and ACTB serving as comparison regions
Table 3 Number of CpG sites within a given CpG island observed and mean coverage per sequenced CpG

Assessing imprinted regions for methylation status is an important check to ensure appropriate mapping (i.e., if percentages deviate significantly form an expected norm, data should be re-assessed), but a gold standard experiment to determine if methylation mapping and percentages are consistent with informatics analysis can also be done. To perform this experiment, we searched for high coverage regions with a mid level of methylation. We avoided regions with either 100% or 0% methylation because these likely would not show the variation across cell DNA that would be needed to test if the computational algorithm was able to accurately detect methylation patterns. To validate the accuracy of our RRBS results we cloned a 250 bp amplicon from the original library preparation of sample G12, which was the only subject sequenced across all lanes. We sequenced 40 individual clones, of those 35 were high quality and aligned to the reference sequence. This region gave us information for 15 individual CpG pairs. The average methylation level across all CpGs was not different between any of the 8 independent lanes of Illumina sequencing, or the cloning and Sanger sequencing (Kruskal-Wallis (9, 4.58) pā€‰=ā€‰0.8013; FigureĀ 9).

Figure 9
figure 9

Comparison of a single locus from one subject (G12) in reads from 8 different lanes and from an independent cloning experiment. Average methylation per CpG in Illumina output data and cloned reads suggest the same methylation levels. This is one method to assess methylation calling in BisSeq computational pipelines.

Significantly increasing the proportion of non-bisulfite converted DNA has a minimal impact on read count from bisulfite-converted libraries

Spike-in of non-bisulfite converted libraries might improve the quality of sequencing reads of bisulfite treated libraries. The reason for adding non-bisulfite converted DNA is that it can increase the performance of an MspI-based RRBS library by increasing the diversity of the first three bases. In theory, all MspI based RRBS library reads start with CGG or TGG (see FigureĀ 2 for why this is) and this can cause a diversity issue on the Illumina HiSeq platform and result in incorrect base calls and lower QC scores. One way to address this issue is to diversify the first 3 bases in DNA fragments by the addition of non-bisulfite converted libraries which have a balanced A, C, G, T ratio.

To determine a reasonable concentration of non-bis DNA to incorporate we used identical libraries across 5 lanes with varying levels of non-bisulfite-converted PhiX library in 1%, 10%, 20%, 30%, or 50% increments. TableĀ 4 shows the experimental design and TablesĀ 5 and 6 show the output data from this experiment for all samples across all lanes. Additional file 1: Table S1 shows clustering information. We first asked if the quality call per base was different when different amounts of PhiX spike-in were used. The mean quality score for all bases from bisulfite converted libraries (Nā€‰=ā€‰4) across each of the 5 lanes was (1% PhiX: 37.58ā€‰Ā±ā€‰0.03, 10% PhiX: 37.56ā€‰+ā€‰0.08, 20% PhiX: 37.44ā€‰Ā±ā€‰0.1, 30% PhiX: 37.18ā€‰Ā±ā€‰0.07, 50% PhiX: 36.42ā€‰Ā±ā€‰0.07), which corresponds to a linear, highly significant (pā€‰<ā€‰0.01 for all comparisons) decrease in quality as PhiX spike-in concentration increased; Importantly, all of these quality scores are excellent (mean phred score >35).

Table 4 Pooling and spike-in design for the analysis of the effects of non bisulfite-converted DNA pooled with bisulfite-converted DNA
Table 5 Output data from sequencing experiment using non-duplicate reads
Table 6 Output data when including duplicate reads

We next determined what the total number of achievable reads was for an experiment; here we used one full lane to sequence one bisulfite-converted library (TableĀ 5; lane 6). We found 4.38 M unique alignable reads in this analysis, with a mean coverage of 4.93X per CpG (coverage definition to follow), using only non-duplicate reads. Given that there wasā€‰~ā€‰150 M reads per lane, sequencing one subject per lane is very inefficient; however, there is a logarithmic relationship between number of reads and number of unique alignable reads in an RRBS experiment. That is, there are diminishing returns as one continues to sequence the same sample ā€“ largely due to the high number of duplicates in each library.

We estimate that approximately >2 M unique reads is the range where ā€˜returnsā€™ (unique reads) significantly diminishes. This is best demonstrated by the samples that were fixed across multiple lanes, but with decreasing concentration as the proportion of non-bis DNA increased. We take one sample from TableĀ 5 to demonstrate this: sample G12 yielded 2.31 M, 2.35 M, 2.23 M, 2.14 M, and 2.14 M reads in lanes 1 to 5, respectively, despite the fact that the proportion of this library represented just 24.75%, 22.5%, 20%, 17.5%, and 12.5% of each lane, respectively. One would normally expect a decrease in unique, alignable reads between the first and last points to be 50%, yet in fact the decrease is just 7.5%. This relationship was confirmed in three other independent libraries (TableĀ 5), suggesting that this is not an artifact of a single library or of liquid handling. Our data further suggest that one major reason for this is the increase in cluster density as non-bis increases (Additional file 1: Table S1). Cluster density is directly proportional to the total number reads, meaning that the more clusters there are the more reads there will be.

Given that the total number of usable reads does not change substantially (even when the proportion of a lane comprising bisulfite-converted libraries is decreased by 50%) with increasing proportion of non-bisulfite converted DNA, we would suggest that bisulfite converted libraries can be run with non-bisulfite converted libraries at minimal cost to sequencing output. This has significant implications for cost savings and strongly encourages genome centers or individual labs to save sequencing costs by pooling bis DNA and standard DNA libraries in the same lane. We currently use this approach (i.e., using non bisulfite-converted DNA not from PhiX but from the complete human genome) for on-going projects.

Calculation of coverage and use of duplicates in coverage statistics

A genomic library with reduced representation from the genome (due to targeted enzymatic digestion) and reduced read complexity (due to bisulfite-mediated C-T conversion) leads to a situation where many identical reads may be observed but which come from independent DNA sources (i.e., not from PCR amplification). To highlight this, we demonstrate coverage over one locus in FigureĀ 10. First, note that unique read coverage comes from 1 of 4 possibilities; 1) from read ā€˜pile-upā€™ over a given region where MspI digestion sites are close, so that the same CpG is covered by different fragments, 2) by reads with identical MspI digestion sites but which have acquired errors, so map to the locus but are derived from a different read cluster, 3) by reads with identical MspI digestion sites, but with different methylation patterns, so map to the same locus, and 4) sequencing reads from the reverse strand. Reads with a small number of errors (usually set at 2 in Bismark) are highly useful in this context. In standard sequencing libraries, coverage is determined by pile-up, where reads have different start and end regions. Here, reads at a given locus mostly have identical start and end points, due to targeted enzymatic digestion and Illumina-determined read size (here, 50 bases). This suggests that regions of very high or low methylation will have lower coverage in these libraries (extremely high or low methylation patterns will lead to identical reads).

Figure 10
figure 10

A method to validate the inclusion of duplicate reads for coverage statistics. Individual reads where no duplicates are present (A) and where duplicates are included (B). Duplicates can be used in coverage statistics if the proportion of methylation in the duplicate read pool is significantly correlated to the non-duplicate read pool. In the locus shown here, there are four sites that are differentially methylated and the proportion of methylation in the non-duplicate and duplicate pools are similar.

How should coverage of a bisulfite-converted library be determined, given that enzymatic digestion gives identical fragment sizes when representation of the genome is reduced? A conservative approach is to use unique reads only and to calculate coverage over CpG sites only. In our data this approach gives a mean coverage of approximately 2-4X (TableĀ 5). Because this is largely seen as inadequate to call methylation differences, different groups have included duplicates in their coverage calls, but duplicates are not normally used in coverage statistics from massively parallel sequencing experiments. The importance of utilizing duplicates is that they function as proxies to the true number of fragment reads, implying that simply filtering out duplicates removes valid data; however, as of yet there is no metric to determine when or how duplicates should be used. TableĀ 6 shows how output data from the experiment shown in TableĀ 5 changes when duplicates are included - coverage increases by ~10-fold (from 2-4X to 20-30X). The relationship between methylation frequency calls and coverage may also be a limitation of RRBS more generally; that is, that low sampling of unique reads is inevitable.

We reasoned that if duplicates are valid to use they should be representative of the pool of unique reads. To determine if duplicate reads are representative of the unique aligned reads, we wanted to assess a sample of genomic loci where there are many unique reads (for power) and where a given read has a high number of CpGs. ā€˜Representativenessā€™ can be assessed in different ways, but the issue we most wanted to test was whether PCR bias (from library amplification) led to poor representation. PCR bias (where different fragments are amplified preferentially over others) occurs for two independent reasons in BisSeq libraries: 1) fragments derived from the same locus can have different methylation patterns; increased methylation leads to the preservation of cytosine residues which have different melting temperatures, different interactions with polymerase, and different annealing times than fragments with ā€˜Tā€™ residues; 2) 50 bp SE sequencing does not alter the fact that libraries were selected at 150ā€“400 bp in the progressive PCR step. Shorter fragments may be amplified preferentially than longer ones because annealing and extension times are shorter. Because the polymerase extension time is fixed in the library amplification step, longer fragments may not amplify in the initial PCR stages, but all short fragments will. PCR amplification can lead to a BisSeq library that includes duplicates in the mapping statistics being unrepresentative of the initial BisSeq library.

To assess this idea, we calculated a Pearson correlation between methylation sites in all unique aligned reads compared to all reads at the same genomic locus. FigureĀ 10 shows an example region to demonstrate this concept. FigureĀ 10 and TableĀ 7 show four sites with differential methylation at one locus, (10 A no duplicates; 10 B with duplicates). Using just four CpG sites at this region selected for high CpG coverage in uniquely aligned reads, we found an r2 value of 0.95 between reads from the uniquely aligned reads and all reads, which has a p-value of 0.03 with a CpG N of four. While all the reads present in the uniquely aligned group are present in the all read group, this Pearson calculation helps to guard against potential PCR bias (i.e., over-amplification of certain library fragments over others). If one DNA fragment happens to amplify at a significantly higher rate than any other, the Pearson correlation coefficient would drop and duplicates would then be less representative of non-duplicate, uniquely aligned reads.

Since pile-up of non-duplicate reads is low at any given locus and this QC metric requires differential methylation at CpG sites in the same read, we recommend assessing all loci with several unique reads (>10) and differential methylation (>four differentially methylated sites). Significant r2 values at these regions would suggest that the library constructed is of good quality, and that duplicates can be used for assessing coverage statistics for the library.

Table 7 Correlation of the frequency of methylation at a single locus between reads with duplicates and reads without duplicates

Discussion

We have proposed a set of improvements, from library construction to bioinformatic analysis, for any experiment using bisulfite converted DNA for the purpose of methylation mapping, and addressed two issues in bisulfite sequencing experiments, namely the effect of including non-bis DNA in BisSeq experiments, and the extent to which duplicate reads can be used to calculate coverage statistics. Our detailed approach and suggestions for sample pooling and coverage calculations should be useful in any experiment that uses massively parallel sequencing to understand DNA methylation states.

We have several quality control checkpoints at each stage of the process, including library construction, pre-sequencing, post-sequencing, and post-alignment. At the library construction stage, we propose a multiplexing approach that uses a post-library indexing step, which allows the user to select an indexing strategy after library synthesis. Because this technique also involves smaller adaptors, it allows for simple sample clean-up using inexpensive, efficient spin-columns. The progressive PCR step for library amplification allows fragment size selection by altering PCR extension time while allowing the user to achieve the lowest possible cycle number per library. Together, these improvements simplify and speed-up the library synthesis process. At the pre-sequencing stage, we recommend cloning a small number of fragments to ensure fragment diversity (there should be no identical clones) and to assess C-T conversion. A qPCR experiment can also be done using primers directed at adaptor ends and compared to melting curves for standard libraries. While this is a gross measure ā€“ 5% differences bisulfite conversion are unlikely to be detected for example, it is a simple check to reassure, particularly since qPCR experiments are often required for accurate quantification of library concentration prior to sample pooling (multiplexing).

There are several assessments that can be done immediately after massively parallel sequencing but before alignment. Assessing base composition in sequenced reads will give a further reassurance about bisulfite conversion, for example. If there is a higher proportion of cytosine bases at position 40ā€“50 in sequenced read compared to bases 1ā€“10, this may be evidence of incomplete conversion or sequencing error; one expects equal representation of all bases across all positions, with significantly lower cytosine residues compared to the other bases. In an RRBS experiment, the strong majority of reads should have CGG or TGG at base position 2, 3, and 4. If this is not observed, it suggests a problem with the enzymatic digestion. Post-alignment, we generated a novel computational tool to make the Bismark output readable in the IGV. We suggest assessing all reads in the <17Kb MT genome to determine library quality: are reads the size that is expected? Is there equal representation from forward and reverse reads? Are all C residues converted to T? Beyond visual assessments, we also generated data at several imprinted regions that can be used by any other group using brain DNA to determine library quality. Imprinted regions are expected to have a specific level of methylation and if these regions show a large deviation from what is expected, it may suggest alignment or other problems. We also used molecular cloning to determine if our methylation calling parameters were accurate. Having two different technologies (NGS and cloning) show strongly similar methylation patterns in a given region suggests that computational calling parameters are accurate.

A further goal of this project was to understand the role of pooling non-bis DNA with bis-DNA. We found that increasing the proportion of non-bis DNA libraries with bis-DNA libraries led to a minimal decrease in uniquely aligned reads and significantly increased cluster density, suggesting that pooling unrelated samples that have not undergone bisulfite conversion is not only viable but recommended to decrease sequencing costs. The increased proportion of reads that were sequenced when bis DNA was decreased in a lane was due both to sequencing less duplicates from a library and to increasing clustering density presumably created by the addition of more diverse DNA fragments. Because duplicates are desirable in a well-performed BisSeq experiment, we recommend a 30% spike-in of non-Bis DNA. A 30% spike-in will give almost identical CpG coverage as at 1% spike-in, and allow for the sequencing of unrelated non-bis DNA samples.

We also addressed the validity of using duplicates in calculating coverage for BisSeq experiments because duplicate reads are traditionally discarded in massively parallel sequencing experiments. We suggested that PCR bias can be assessed by using a correlation assessment where a single locus has many CpGs and many unique reads. By comparing CpG methylation frequency in the non-duplicate read pool to the all-read pool, one expects to observe a strong correlation between pools. A low correlation might be evidence of over-amplification in the library preparation stage which will be biased towards shorter DNA fragments and fragments with less cytosine residues.

Where does the future lie for methylation mapping? A move from RRBS to whole genome bisulfite sequencing (WGBS) has already begun, though these whole genome experiments are much more expensive and still have significant alignment issues. The fractionation of DNA instead of the enzymatic digestion improves things considerably because most fragments will be generated from different genomic locations instead of from CCGG locations. Still, it is not clear that WGBS is worth the cost given that sites rich in CpGs continue to be assessed, and these are covered to a wide extent in RRBS, though recent evidence from mouse suggests that the most functionally relevant DNA methylation sites might be located in CpG poor distal regulatory regions with low methylation levels. In this case, WGBS will be preferable because RRBS will miss many of these sites because of the absence of CCGG at the CpG poor distal regulatory regions [30]. It is clear that bisulfite sequencing is not an ideal methodology but it is currently the best available option to give reproducible data. Bisulfite treatment of DNA causes DNA damage and recent reports suggest that hydroxymethylation may also be an important epigenetic mark [31] and this may confound BisSeq experiments because hydroxymethylation and methylation are indistinguishable after bisulfite treatment and PCR. Single molecule sequencers [32], whereby unamplified and untreated ā€˜nativeā€™ DNA is sequenced are not yet ready for prime-time; however, it is foreseeable that all methylation (or indeed epigenetic mark-mapping) is done on single molecule sequencers in the not-to-distant future [33].

Conclusion

These data suggest improved ways to process and analyze bisulfite sequencing data. We provide new methods for library construction and indexing, as well as several new options for QC assessment as well as recommendations for generating high quality data.

Availability of supporting data

All supporting data is available within this manuscript and at the Ernst lab website at:http://www.mcgill.ca/psychiatricgenetics.

Abbreviations

Bp:

Base pair

EtBr:

Ethidium Bromide

MT:

Mitochondrial

NGS:

Next Generation Sequencing

RRBS:

Reduced Representative Bisulfite Sequencing

WGBS:

Whole Genome Bisulfite Sequencing.

References

  1. Scarano E, Iaccarino M, Grippo P, Winckelmans D: On methylation of DNA during development of the sea urchin embryo. J Mol Biol. 1965, 14 (2): 603-607. 10.1016/S0022-2836(65)80211-X.

    ArticleĀ  CASĀ  PubMedĀ  Google ScholarĀ 

  2. Reik W, Dean W, Walter J: Epigenetic reprogramming in mammalian development. Science. 2001, 293 (5532): 1089-1093. 10.1126/science.1063443.

    ArticleĀ  CASĀ  PubMedĀ  Google ScholarĀ 

  3. Caldji C, Hellstrom IC, Zhang TY, Diorio J, Meaney MJ: Environmental regulation of the neural epigenome. FEBS Lett. 2011, 585 (13): 2049-2058. 10.1016/j.febslet.2011.03.032.

    ArticleĀ  CASĀ  PubMedĀ  Google ScholarĀ 

  4. Clark SJ, Harrison J, Paul CL, Frommer M: High sensitivity mapping of methylated cytosines. Nucleic Acids Res. 1994, 22 (15): 2990-2997. 10.1093/nar/22.15.2990.

    ArticleĀ  CASĀ  PubMed CentralĀ  PubMedĀ  Google ScholarĀ 

  5. Bernstein BE, Stamatoyannopoulos JA, Costello JF, Ren B, Milosavljevic A, Meissner A, Kellis M, Marra MA, Beaudet AL, Ecker JR, Farnham PJ, Hirst M, Lander ES, Mikkelsen TS, Thomson JA: The NIH Roadmap epigenomics mapping consortium. Nat Biotechnol. 2010, 28 (10): 1045-1048. 10.1038/nbt1010-1045.

    ArticleĀ  CASĀ  PubMed CentralĀ  PubMedĀ  Google ScholarĀ 

  6. Meissner A, Gnirke A, Bell GW, Ramsahoye B, Lander ES, Jaenisch R: Reduced representation bisulfite sequencing for comparative high-resolution DNA methylation analysis. Nucleic Acids Res. 2005, 33 (18): 5868-5877. 10.1093/nar/gki901.

    ArticleĀ  CASĀ  PubMed CentralĀ  PubMedĀ  Google ScholarĀ 

  7. Gifford CA, Ziller MJ, Gu H, Trapnell C, Donaghey J, Tsankov A, Shalek AK, Kelley DR, Shishkin AA, Issner R, Zhang X, Coyne M, Fostel JL, Holmes L, Meldrim J, Guttman M, Epstein C, Park H, Kohlbacher O, Rinn J, Gnirke A, Lander ES, Bernstein BE, Meissner A: Transcriptional and epigenetic dynamics during specification of human embryonic stem cells. Cell. 2013, 153 (5): 1149-1163. 10.1016/j.cell.2013.04.037.

    ArticleĀ  CASĀ  PubMed CentralĀ  PubMedĀ  Google ScholarĀ 

  8. Gu H, Smith ZD, Bock C, Boyle P, Gnirke A, Meissner A: Preparation of reduced representation bisulfite sequencing libraries for genome-scale DNA methylation profiling. Nat Protoc. 2011, 6 (4): 468-481. 10.1038/nprot.2010.190.

    ArticleĀ  CASĀ  PubMedĀ  Google ScholarĀ 

  9. Gu H, Bock C, Mikkelsen TS, Jager N, Smith ZD, Tomazou E, Gnirke A, Lander ES, Meissner A: Genome-scale DNA methylation mapping of clinical samples at single-nucleotide resolution. Nat Methods. 2010, 7 (2): 133-136. 10.1038/nmeth.1414.

    ArticleĀ  CASĀ  PubMed CentralĀ  PubMedĀ  Google ScholarĀ 

  10. Smith ZD, Gu H, Bock C, Gnirke A, Meissner A: High-throughput bisulfite sequencing in mammalian genomes. Methods. 2009, 48 (3): 226-232. 10.1016/j.ymeth.2009.05.003.

    ArticleĀ  CASĀ  PubMed CentralĀ  PubMedĀ  Google ScholarĀ 

  11. Boyle P, Clement K, Gu H, Smith ZD, Ziller M, Fostel JL, Holmes L, Meldrim J, Kelley F, Gnirke A, Meissner A: Gel-free multiplexed reduced representation bisulfite sequencing for large-scale DNA methylation profiling. Genome Biol. 2012, 13 (10): R92-10.1186/gb-2012-13-10-r92.

    ArticleĀ  CASĀ  PubMed CentralĀ  PubMedĀ  Google ScholarĀ 

  12. Chatterjee A, Rodger EJ, Stockwell PA, Weeks RJ, Morison IM: Technical considerations for reduced representation bisulfite sequencing with multiplexed libraries. J Biomed Biotechnol. 2012, 2012: 741542-

    ArticleĀ  PubMed CentralĀ  PubMedĀ  Google ScholarĀ 

  13. Krueger F, Andrews SR: Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications. Bioinformatics. 2011, 27 (11): 1571-1572. 10.1093/bioinformatics/btr167.

    ArticleĀ  CASĀ  PubMed CentralĀ  PubMedĀ  Google ScholarĀ 

  14. Xi Y, Bock C, Muller F, Sun D, Meissner A, Li W: RRBSMAP: a fast, accurate and user-friendly alignment tool for reduced representation bisulfite sequencing. Bioinformatics. 2012, 28 (3): 430-432. 10.1093/bioinformatics/btr668.

    ArticleĀ  CASĀ  PubMed CentralĀ  PubMedĀ  Google ScholarĀ 

  15. Xi Y, Li W: BSMAP: whole genome bisulfite sequence MAPping program. BMC Bioinforma. 2009, 10: 232-10.1186/1471-2105-10-232.

    ArticleĀ  Google ScholarĀ 

  16. Smith AD, Chung WY, Hodges E, Kendall J, Hannon G, Hicks J, Xuan Z, Zhang MQ: Updates to the RMAP short-read mapping software. Bioinformatics. 2009, 25 (21): 2841-2842. 10.1093/bioinformatics/btp533.

    ArticleĀ  CASĀ  PubMed CentralĀ  PubMedĀ  Google ScholarĀ 

  17. Guo W, Fiziev P, Yan W, Cokus S, Sun X, Zhang MQ, Chen PY, Pellegrini M: BS-Seeker2: a versatile aligning pipeline for bisulfite sequencing data. BMC Genomics. 2013, 14 (1): 774-10.1186/1471-2164-14-774.

    ArticleĀ  CASĀ  PubMed CentralĀ  PubMedĀ  Google ScholarĀ 

  18. Chatterjee A, Stockwell PA, Rodger EJ, Morison IM: Comparison of alignment software for genome-wide bisulphite sequence data. Nucleic Acids Res. 2012, 40 (10): e79-10.1093/nar/gks150.

    ArticleĀ  CASĀ  PubMed CentralĀ  PubMedĀ  Google ScholarĀ 

  19. Akalin A, Kormaksson M, Li S, Garrett-Bakelman FE, Figueroa ME, Melnick A, Mason CE: methylKit: a comprehensive R package for the analysis of genome-wide DNA methylation profiles. Genome Biol. 2012, 13 (10): R87-10.1186/gb-2012-13-10-r87.

    ArticleĀ  PubMed CentralĀ  PubMedĀ  Google ScholarĀ 

  20. Robinson JT, Thorvaldsdottir H, Winckler W, Guttman M, Lander ES, Getz G, Mesirov JP: Integrative genomics viewer. Nat Biotechnol. 2011, 29 (1): 24-26. 10.1038/nbt.1754.

    ArticleĀ  CASĀ  PubMed CentralĀ  PubMedĀ  Google ScholarĀ 

  21. Ichiyanagi T, Ichiyanagi K, Miyake M, Sasaki H: Accumulation and loss of asymmetric non-CpG methylation during male germ-cell development. Nucleic Acids Res. 2013, 41 (2): 738-745. 10.1093/nar/gks1117.

    ArticleĀ  CASĀ  PubMed CentralĀ  PubMedĀ  Google ScholarĀ 

  22. Arand J, Spieler D, Karius T, Branco MR, Meilinger D, Meissner A, Jenuwein T, Xu G, Leonhardt H, Wolf V, Walter J: In vivo control of CpG and non-CpG DNA methylation by DNA methyltransferases. PLoS Genet. 2012, 8 (6): e1002750-10.1371/journal.pgen.1002750.

    ArticleĀ  CASĀ  PubMed CentralĀ  PubMedĀ  Google ScholarĀ 

  23. Dawid IB: 5-methylcytidylic acid: absence from mitochondrial DNA of frogs and HeLa cells. Science. 1974, 184 (4132): 80-81. 10.1126/science.184.4132.80.

    ArticleĀ  CASĀ  PubMedĀ  Google ScholarĀ 

  24. Nass MM: Differential methylation of mitochondrial and nuclear DNA in cultured mouse, hamster and virus-transformed hamster cells. In vivo and in vitro methylation. J Mol Biol. 1973, 80 (1): 155-175. 10.1016/0022-2836(73)90239-8.

    ArticleĀ  CASĀ  PubMedĀ  Google ScholarĀ 

  25. Cardon LR, Burge C, Clayton DA, Karlin S: Pervasive CpG suppression in animal mitochondrial genomes. Proc Natl Acad Sci U S A. 1994, 91 (9): 3799-3803. 10.1073/pnas.91.9.3799.

    ArticleĀ  CASĀ  PubMed CentralĀ  PubMedĀ  Google ScholarĀ 

  26. Guo JU, Ma DK, Mo H, Ball MP, Jang MH, Bonaguidi MA, Balazer JA, Eaves HL, Xie B, Ford E, Zhang K, Ming GL, Gao Y, Song H: Neuronal activity modifies the DNA methylation landscape in the adult brain. Nat Neurosci. 2011, 14 (10): 1345-1351. 10.1038/nn.2900.

    ArticleĀ  CASĀ  PubMed CentralĀ  PubMedĀ  Google ScholarĀ 

  27. Reik W, Walter J: Genomic imprinting: parental influence on the genome. Nat Rev Genet. 2001, 2 (1): 21-32.

    ArticleĀ  CASĀ  PubMedĀ  Google ScholarĀ 

  28. Davies W, Isles AR, Wilkinson LS: Imprinted gene expression in the brain. Neurosci Biobehav Rev. 2005, 29 (3): 421-430. 10.1016/j.neubiorev.2004.11.007.

    ArticleĀ  CASĀ  PubMedĀ  Google ScholarĀ 

  29. Das R, Lee YK, Strogantsev R, Jin S, Lim YC, Ng PY, Lin XM, Chng K, Yeo G, Ferguson-Smith AC, Ding C: DNMT1 and AIM1 Imprinting in human placenta revealed through a genome-wide screen for allele-specific DNA methylation. BMC Genomics. 2013, 14: 685-10.1186/1471-2164-14-685.

    ArticleĀ  CASĀ  PubMed CentralĀ  PubMedĀ  Google ScholarĀ 

  30. Stadler MB, Murr R, Burger L, Ivanek R, Lienert F, Scholer A, van Nimwegen E, Wirbelauer C, Oakeley EJ, Gaidatzis D, Tiwari VK, SchĆ¼beler D: DNA-binding factors shape the mouse methylome at distal regulatory regions. Nature. 2011, 480 (7378): 490-495.

    CASĀ  PubMedĀ  Google ScholarĀ 

  31. Weichenhan D, Plass C: The evolving epigenome. Hum Mol Genet. 2013, 22 (R1): R1-R6. 10.1093/hmg/ddt348.

    ArticleĀ  CASĀ  PubMedĀ  Google ScholarĀ 

  32. Chen YS, Lee CH, Hung MY, Pan HA, Chiou JC, Huang GS: DNA sequencing using electrical conductance measurements of a DNA polymerase. Nat Nanotechnol. 2013, 8 (6): 452-458. 10.1038/nnano.2013.71.

    ArticleĀ  CASĀ  PubMedĀ  Google ScholarĀ 

  33. Flusberg BA, Webster DR, Lee JH, Travers KJ, Olivares EC, Clark TA, Korlach J, Turner SW: Direct detection of DNA methylation during single-molecule, real-time sequencing. Nat Methods. 2010, 7 (6): 461-465. 10.1038/nmeth.1459.

    ArticleĀ  CASĀ  PubMed CentralĀ  PubMedĀ  Google ScholarĀ 

Download references

Ackowledgements

This work was funded by a research grant RGPIN 435512ā€“2013 from the Natural Science and Engineering Research Council of Canada (CE) and PHS NIDA grant DA033684 (DM and GT). CE is supported by the Canada Research Chairs Program.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Carl Ernst.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authorsā€™ contributions

CE designed experiments and conceived of the study. GC carried out experiments. CE and GC wrote the manuscript. AD, RJ and CE performed bioinformatic analyses. AS, CN, KV, PEL, and VO assisted with experiments. DM and GT supplied reagents. All authors read and approved the final manuscript.

Electronic supplementary material

12864_2014_6012_MOESM1_ESM.docx

Additional file 1: Clustering informations, as well as adaptor and indexing primers fusion products, of BisSeq libraries(DOCX 111 KB)

Authorsā€™ original submitted files for images

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver ( https://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, G.G., Diallo, A.B., Poujol, R. et al. BisQC: an operational pipeline for multiplexed bisulfite sequencing. BMC Genomics 15, 290 (2014). https://doi.org/10.1186/1471-2164-15-290

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/1471-2164-15-290

Keywords