Genomic DNA
P. falciparum 3D7 genomic DNA was a gift from Prof Chris Newbold, University of Oxford, UK. Bordetella pertussis ST24 genomic DNA was a gift from Craig Cummings, Stanford University School of Medicine, CA. Staphylococcus aureus TW20 genomic DNA was a gift from Jodi Lindsay, St George’s Hospital Medical School, University of London. S. Pullorum S449/87 genomic DNA was prepared at the Wellcome Trust Sanger Institute, UK.
Illumina library construction
DNA (0.5 μg in 120 μl of 10 mM Tris–HCl pH8.5) was sheared in an AFA microtube using a Covaris S2 device (Covaris Inc.) with the following settings: Duty cycle 20, Intensity 5, cycles/burst 200, 45 seconds.
Sheared DNA was purified by binding to an equal volume of Ampure beads (Beckman Coulter Inc.) and eluted in 32 μl of 10 mM Tris–HCl, pH8.5. End-repair, A-tailing and paired-end adapter ligation were performed (as per the protocols supplied by Illumina, Inc. using reagents from New England Biolabs- NEB) with purification using a 1.5:1 ratio of standard Ampure to sample between each enzymatic reaction. PCR-free libraries were constructed according to Kozarewa et al. [4]. After ligation, excess adapters and adapter dimers were removed using two Ampure clean-ups, first with a 1.5:1 ratio of standard Ampure to sample, followed by a 0.7:1 ratio of Ampure beads. PCR free libraries were then used as is. Libraries prepared with amplification were diluted to 2 ng/μl and 1 μl was used as template for PCR amplification with Kapa HiFi [5] 2 x mastermix (KK2601, Kapa Biosystems). PCR reactions were performed in 0.2 μl thin-wall microtubes on an MJ tetrad thermal cycler with the following conditions: 94°C −2 minutes; 14 cycles of 94°C −20 seconds, 65°C −30 seconds, 72°C −30 seconds; 72°C - 3 minutes with 200nM final concentration of standard PE1.0 and modified multiplexing PE2.0 primers [5].
After PCR, excess primers and any primer dimer were removed using two Ampure clean-ups, with a 1.5:1 ratio of Ampure then with a 0.8:1 ratio of Ampure beads. All libraries were quantified by real-time PCR using the SYBR Fast Illumina Library Quantification Kit (Kapa Biosystems) and pooled so as to give equal genome coverage from each library.
Illumina sequencing
Each multiplexed library pool was sequenced on: i) an Illumina GAIIx instrument for 76 cycles from each end plus an 8 base-index sequence read, using version 2 chemistry, ii) an Illumina MiSeq for 151 cycles from each end plus an 8 base-index sequence read, iii) an Illumina HiSeq 2000 instrument for 75 cycles from each end plus an 8 base-index sequence read, using version 3 chemistry. Summary sequencing statistics are given in Additional file 1: Table S1. All runs had error rates, and associated sequence quality, that surpassed the minimum Illumina specifications.
Ion torrent library preparation sequencing
For the B. pertussis, S. aureus and P. falciparum genomes, library preparation was carried out using the Ion Xpress™ Fragment Library Kit, with 100 ng of DNA. Adapter ligation, size selection, nick repair and amplification (8 cycles for B. pertussis and S. aureus, 6 cycles for P. falciparum) were performed as described in the Ion Torrent protocol associated with the kit (Ion Xpress™ Fragment Library Kit - Part Number 4469142 Rev. B). For the S. Pullorum genome, library preparation was undertaken using the Ion Fragment Library Kit with 5 μg of DNA. The DNA was fragmented by adaptive focused acoustics using a Covaris S2 (Covaris Inc.) with AFA tubes as described in the protocol (Part Number 4467320 Rev. A). End repair, adapter ligation, nick repair and amplification (8 cycles) were also performed as described in the Ion Torrent protocol. Size selection was performed using the LabChip XT (Caliper LifeSciences) and the LabChip XT DNA750 Assay Kit (Caliper LifeSciences), with collection between 175 bp and 220 bp.
The Agilent 2100 Bioanalyzer (Agilent Technologies) and the associated High Sensitivity DNA kit (Agilent Technologies) were used to determine quality and concentration of the libraries. The amount of library required for template preparation was calculated using the Template Dilution Factor calculation described in the protocol.
Emulsion PCR and enrichment steps were carried out using the Ion Xpress™ Template Kit and associated protocol (Part Number 4469004 Rev. B). Ion Sphere Particle quality assessment was carried out as outlined in an earlier version of this protocol (Part Number 4467389 Rev. B) for all samples because no benefit was seen with using the Ion Sphere Quality Control Kit as recommended in the later version of the protocol. The oligos used for this analysis were purchased from IDT (Integrated DNA Technologies Inc.). Assessment of the Ion Sphere Particle quality was undertaken between the emulsion PCR and enrichment steps only.
Ion torrent sequencing
Sequencing was undertaken using 316 chips in all cases and barcoding was not used for these samples. The Ion Sequencing Kit v2.0 was used for all sequencing reactions, following the recommended protocol (Part Number 4469714 Rev. B) and Torrent Suite 1.5 was used for all analyses. Summary sequencing statistics are given in Additional file 1: Table S2.
PacBio library construction
DNA (2.0-10 μg in 200 μl 10 mM Tris–HCl pH8.5) was sheared in an AFA clear mini-tube using a Covaris S2 device (Covaris Inc.) with the following settings: Duty cycle 20, Intensity 0.1, cycles/burst 1000, 600 seconds. Sheared DNA was purified by binding to 0.6X volume of pre-washed AMPure XP beads (Beckman Coulter Inc.), as per PacBio protocol 000-710-821-DRAFT (five times in purified water, one time in EB, reconstituted in original supernatant) and eluted in EB to >60 ng/μl. The sheared DNA was quantified on an Agilent 2100 Bioanalyzer using the 7500 kit. 1 μg of sheared DNA was end-repaired using the PacBio DNA Template Prep Kit 1.0 (Part Number 001-322-716) and incubated for 15 min at 25°C prior to another 0.6X AMPure XP clean up, eluting in 30 μl EB. Blunt adapters were ligated before exonuclease incubation was carried out in order to remove all un-ligated adapters and DNA. Finally, two 0.6X AMPure bead clean ups are performed - removing enzymes and adapter dimers – the final “SMRT bells” being eluted in 10 μl EB. Final quantification was carried out on an Agilent 2100 Bioanalyzer with 1 μl of library.
Using the SMRT bell concentration (ng/μl) and insert size previously determined, the PacBio-provided calculator was used to calculate the amounts of primer and polymerase used for the binding reaction. Using the PacBio DNA/Polymerase Binding Kit 1.0 (Part Number 001-359-802), primers are annealed and the proprietary polymerase is bound forming the “Binding Complex”. The Binding Complex can be stored as a long-term storage mix at −20°C or diluted for immediate sequencing. The quantity of SMRT bell determines whether a long-term storage mix can be used. In this instance, there was ample genomic DNA from the four test genomes to allow long-term storage.
PacBio sequencing
Long-term storage mixes were diluted to the required concentration and volume with the provided dilution buffer and loaded into 96-well plates. These are loaded onto the instrument, along with DNA Sequencing Kit 1.0 (Part Number 001-379-044) and a SMRT Cell 8Pac. In all sequencing runs, 2x45 min movies were captured for each SMRT Cell loaded with a single binding complex. Primary filtering analysis was performed on the Blade Center server provided with the RS instrument, before this data was transferred off the Blade Center for secondary analysis in SMRT Portal using the SMRT analysis pipeline version 1.2.0.1.81002. Summary sequencing statistics are given in Additional file 1: Table S3.
Reference genomes
Each reference genome was created using capillary sequence data with manual finishing and are available to download from http://www.sanger.ac.uk/resources/downloads/. The methods used to sequence the genomes of P. falciparum[14] and S. aureus TW20 [29] have been published.
Data processing
After sequencing, reads were mapped to each genome reference sequence using the manufacturers’ alignment tools, tmap for PGM and blasr for PacBio (http://www.pacificbiosciences.com/products/software/algorithms). BWA [30] was used for mapping reads from the Illumina GAIIx, MiSeq and HiSeq. SAMtools [31] was then used to generate pileup and coverage information from the mapping output.
Genome coverage
We counted the number of bases in the genome that were not covered by any reads (Coverage = 0) and those with less than 5x read coverage (Coverage <5x). SAMtools was used to generate coverage plots and bash/awk scripts were used for coverage counting.
Evenness of coverage metrics
We extracted genome coverage information from the pileup data derived by SAMtools from mapped reads after randomly down sampling to a uniform depth of 15x (this down sampling was achieved by taking “from the top” the number of reads required to give 15x coverage of each genome). As reads are randomly allocated evaluation of uniformity of coverage was based on cumulative distributions over the overall average depth.
GC-content analysis
To evaluate the coverage uniformity in different genome regions, a GC profile was calculated for each data set. All mapped reads were shredded into 50-mers and the GC-percentage in each 50-mer was calculated. The proportions of 50-mers containing a given GC-percentage were plotted against their GC percentage. A theoretical curve for each genome was also produced in the same way from its reference sequence for comparison. The difference from the theoretical curve gives an indication of GC bias.
Alignment base error analysis
The aligned error rate for data generated on the different sequencing platforms was taken from the report generated by the program SMALT [9], after aligning the S. aureus dataset against its reference sequence. The error rate is calculated as the per-base error within a mapped region divided by the total mapped bases in that region. An average error rate was calculated from all mapped reads for each data set.
To quantify errors associated with specific motifs, we took the fastq file and searched all the reads for the presence of that motif. The three bases (triplets) after the motif were tabulated, and the mean quality of the following base was calculated. We did this analysis for GGC, GCC and a neutral motif - ATG.
SNP detection
SNP detection was performed using a random selection of reads to give an average depth of coverage of 15x for all platforms, except PacBio where this coverage depth was insufficient and the full dataset representing 190x coverage was used.
SNPs from the PacBio reads were called using PacBio’s SMRT Portal software version 1.2.3. Each SMRT cell was selected for analysis against the S. aureus USA300_FPR3757 reference genome (accession number CP000255), imported into the PacBio secondary analysis protocol; the parameters can be altered for filtering, mapping, and consensus calling. SFilter.1.xml was used for filtering with a minimum allowed read length of 50 bases and a minimum read quality of 0.75 (on a PacBio-developed scale specific to RS-generated reads). BLASR.1.xml was used for mapping with the maximum number of hits per read being set to 1, a maximum divergence of 30% and minimum anchor size of 8. Finally, EviCons.1.xml was used for consensus and SNP calling. Reads from the Illumina and Ion Torrent platforms were mapped against the S. aureus USA300_FPR3757 reference using SMALT [9].
SNPs were called using the default parameters for SAMtools mpileup followed by bcftools and the SAMtools vcfutils.pl varFilter script, as described on the SAMtools webpage (http://samtools.sourceforge.net/mpileup.shtml). SNPs were also called for the Ion Torrent data using the Torrent Suite variant calling parameters for SAMtools mpileup and bcftools followed by the Torrent Suite vcf_filter.pl script.
A set of reference SNPs was created by aligning the complete S. aureus USA300_FPR3757 genome sequence with a high-quality draft sequence for S. aureus TW20 using Mugsy [32]. A single contiguous whole-genome alignment was generated by extracting aligned blocks from the Mugsy output and then manually curating. In order to control for the effects of software-specific mis-mapping, we identified and removed from our alignment regions sequences corresponding to mobile genetic elements (MGEs) in the S. aureus USA300_FPR3757 genome, along with regions with no homologue in S. aureus. MGEs were manually identified from the S. aureus USA300_FPR3757 genome annotation SNPs called from the resulting alignment provided a high-quality reference set for comparison with the SNPs identified by each sequencing platform. True SNPs are those that agree with the SNPs found in this reference set.
All datasets have been deposited in the ENA read archive under accession number ERP001163.