Mutagenesis with sodium azide
Seeds of O. sativa ssp. japonica (cv. Kitaake), a variety closely related to the reference cv. Nipponbare, were pre-soaked in ultrapure water for 20 hours at 25°C prior to sodium azide treatment. For mutagenesis, sodium azide solutions of 1 mM, 5 mM, and 10 mM were made in 0.1 M sodium phosphate buffer, pH 3. Batches of 100 seeds were treated with 27 ml of sodium azide solution in a 50 ml tube at RT (22-24°C) for 3 hours. Sodium azide was decanted and seeds were washed 3 times with 30 ml of ultrapure water for 5 minutes each time. Seeds were then transferred to germination paper in a standard plastic petri dish for germination at 25°C. Control Kitaake seeds (neither presoaked nor treated with sodium azide) were plated at the same time. After 7 days, germinated seeds were transplanted to UC soil mix C and grown in greenhouse to produce M2 seeds. M2 seeds were planted directly in UC soil mix C and leaf tissue was harvested for DNA isolation.
DNA extraction
Total genomic DNA was isolated from frozen leaf tissues that were mechanically ground prior to extraction using a potassium acetate-SDS method [35].
RESCAN library construction
Method development
Approximately 1000 ng of DNA was digested with the restriction enzyme MseI (T↓TAA, NEB, Ipswich, Massachusetts, cat. no. R0525) for 1 to 6 hour at 37°C. After the completion of digestion was verified by agarose gel electrophoresis, the DNA was purified with a Qiaquick PCR purification minicolumn (Qiagen, Germantown, Maryland, cat. no. 28104) and resuspended in 40 μl of 10 mM Tris buffer. Alternatively, the desired size range of genomic fragments was excised from agarose gel after electrophoretic separation, extracted with a Qiaquick gel extraction kit (cat. no. 28704) and resuspensed in 20 μl of 10 mM Tris buffer. For ligation, 20 μl of genomic DNA, either from the total digestion, or the size cut, were combined in a final volume of 44 μl with T4 DNA ligase (various manufacturers), the ligase buffer provided by the manufacturer, 0.5 μl of T4 DNA ligase, 0.5 μl of MseI (except for agct barcode, see below), and 1 μl of 0.05 μM premixed adapter oligonucleotides to form the single end sequencing Illumina Y adapter. The use of MseI during ligation depended on the sequence of the adapter and was employed whenever possible to minimize ligation between genomic fragments. The sequence of the adapter oligonucleotides is shown below, with the barcode in lower case. Additional barcodes used were (shown as the adA2 oligonucleotide strand sequence): gata, cacc, tagc, agct, ctag. Note that all, except "agct", cause loss of the MseI site upon ligation to an MseI fragment. The adapter sequences are shown for the record, but are no longer suited for the 2012 and later Illumina sequencing platform. The oligonucleotides, prepared at desalted quality, were obtained from Life Technologies (http://www.invitrogen.com).
adA2_GGTG: P-TAggtgAGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG
adB2_GGTG: ACACTCTTTCCCTACACGACGCTCTTCCGATCTcacc
Ligated DNA was purified on a Qiaquick column, enriched by PCR amplification with Illumina PCR amplification primers for 16 to 18 cycles and examined by analytical gel electrophoresis. A slightly diffuse band in the target range of molecular weight (insert size + ~100b of adapters) was diagnostic of the desired outcome. The presence of excessive adapter dimer (a fragment in the 100-150 bp) was undesirable. If found, it could be removed by preparative gel electrophoresis, or it could minimized by repeating the procedure from the ligation step using a lower concentration of adapters.
Illumina primers:
pr1: AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT
pr2: CAAGCAGAAGACGGCATACGAGCTCTTCCGATCT
Sequencing of the RESCAN libraries started with 25b reads on the original (2007) Genome Analyzer and progressively employed the improvements in chemistry and apparatus.
Standard method
Approximately 500 ng of DNA was digested with the restriction enzyme NlaIII (NEB, Ipswich, Massachusetts, cat. no. R0125) for 1 to 6 hour at 37°C. After the completion of digestion was verified by agarose gel electrophoresis, the DNA was purified selecting the desired molecular weight range. For this purpose AMPure SPRI Beads (Beckman Coulter Genomics, Danvers, Massachusetts, cat. no. A50850) were added in the suitable ratio to DNA (see Figure 3) and used to manufacturer's instructions. For example, to remove fragments smaller than 100 b the recommended ratio (1.8:1) was used. To remove fragments higher than a certain amount (top cut) SPRI beads in the desired ratio were applied and the unbound fraction was saved. Because the bulk of the fragments are always found in the bottom fraction (Figure 3), we commonly used conditions (i.e. ratios) that would remove all fragments below a certain molecular weight and directly proceeded with ligation of the remaining fragments to the adapter. Although this represented a range of sizes, the higher frequency of smaller fragments and their advantage during the amplification steps resulted in a de facto enrichment of the near-bottom class of fragments. This simplification shortened the protocol and made the method cheaper, important for high level multiplexing. Oligonucleotides for the barcoded Illumina adapters used in the mutation detection experiment were as follows (barcode in lower case):
Control, adA: P-atcacAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAG
adB: ACACTCTTTCCCTACACGACGCTCTTCCGATCTgtgatCATG
The oligonucleotides were ordered as in desalted quality and were not modified except for phosphorylation of adA. The same oligonucleotides were used for control, T1, T2 and T3 samples, but for the barcode (respectively: atcac, ctctc, cgaat, and gagca). The oligonucleotides were mixed in a 1:1 ratio to form the adapter, which was stored and used as needed without any annealing pre-incubation. One μl of a 0.05 μM dilution of the adapter was added to the SPRI bead-fractionated DNA in a 44 μl ligation reaction employing the Quick Ligation Kit (NEB, cat. no. M2200). After 15 minute incubation at room temperature, the ligation reaction was cleaned with AMPure SPRI Beads, utilizing a 0.8:1 v/v ("bead in binding buffer":sample) ratio to remove smaller fragments (less than 250-bp) and unligated adapters. The libraries were enriched using a mix of 10 μl of template, 15 μl of Phusion 2x HF Master Mix (NEB, cat. no. F531), 1 μl of 5 μM premixed paired-end Illumina primers and 4 μl of water and the following amplification protocol: 30 sec at 98°C; 14 cycles of 10 sec at 98°, 30 sec at 65°, and 30 sec at 72°; and a final extension with 5 min at 72°. PCR product was purified using AMPure SPRI Beads with a 0.8:1 v/v (bead in buffer:sample) ratio. Libraries were quantified using the Agilent 2100 Bioanalyzer, and were sequenced according to manufacturer's instructions on one and one half lanes (3/8 lane per individual) of the Illumina GAII (Illumina, San Diego, California) with 85-bp paired-end reads.
Computational analysis
Analysis during method development employed the Illumina alignment pipeline using the Eland program with the relevant updates of year 2007, 2008 and 2009. For the mutation analysis, the Illumina 1.5+ format (fastq) reads were filtered using a custom informatic pipeline (http://tinyurl.com/barcode-tool) that divided them based on barcode. Additionally, it removed the barcode sequences, adapter and primer sequences, reads shorter than 25-bp, and reads containing bases with Phred quality scores less than 20. Quality scores were converted to Sanger scale, which is compatible with most alignment programs.
Mutation detection: For referenced discovery: BWA (http://bio-bwa.sourceforge.net/) was used to align reads to the reference (Os 6.1, http://rice.plantbiology.msu.edu/) [16] genome with default mismatch allowance, producing an mpileup file with Samtools [36] (http://samtools.sourceforge.net/). The mpileup file contains the base calls at each position, for each library. Basecalls can be A, T, C, G, * (deleted base) or insertions (+AAT for example). This file was parsed in the following manner. First, any basecall with a sequence quality lower than 20 or a read with a mapping quality < 20 were discarded. Next, the basecalls for the four libraries were pooled and only positions that were collectively covered not more than 200 times were retained (to avoid repeated sequences). Positions that were not covered at least once in each of the four libraries were further discarded. If all basecalls were the same, that position was classified as homozygous and further classified as "ref" if the basecall was the same as in the reference genome or "SNP" if it was not. If there were more than 1 basecall, the following criteria were applied: If the least frequent basecall was found in more than 1 library and accounted for > 10% of the basecalls, the base was called heterozygous. If the least frequent basecall was found in more than 1 library and accounted for < 10% of the basecalls, the base was called homozygous.
This latter subset of positions was further assayed for the presence of potential mutations. Potential mutations were detected as follows: i) if there were only 2 different basecall (one dominant "WT" basecall and an another): if the non-WT base was observed at least twice from a single library and never from the other three, and there were no other basecalls for that library, the mutation was classified as potential homozygous mutation. If the non-WT base was observed at least 5 times but there were other basecalls for that library, it was classified as a potential heterozygous mutation. ii) positions for which more than two different basecalls were observed were dealt with in the following manner: If the least frequent basecalls were each only found once, the position was ignored for mutation detection purposes. Similarly, if the least frequent basecall was only found once, it was ignored and that base was processed as if there were only 2 basecalls (see above). If all basecalls were each found at least twice, that base was classified as "ambiguous" and removed from further analysis. This analysis was performed on reads aligned to the reference genome and reads aligned to the pseudo-reference.
The number of positions obtained in each of the categories described above are summarized as follows for O. sativa-based reference and for de novo reference, respectively: million bases in pileup table = 118.6, 101.9; total covererage > 200 and covered in all four samples (Mbases) = 84.7, 89.2; % SNP (as defined above) = 0.028, 0.0004; % heterozygotes = 0.019, 0.036.
To assess mutation rates, the number of putative mutations was divided by the number of assayed bases for each library. For homozygous scoring, these had to meet the following requirements: i) covered at least once in each of the four libraries, ii) covered at least twice in the putative mutant, iii) covered less than 200 times cumulatively in all libraries. For heterozygous scoring, the requirements were as follows: i) covered at least one in each of the four libraries, ii) covered less than 200 times cumulatively in all libraries and iii) mutant allele covered at least five times in the mutant library and never in the other three libraries. Positions covered less than 15 times were adjusted for random sampling effects of non mutant bases in a heterozygous background (see Discussion).
In order to determine how many potential mutations were found in both types of analysis, the k-mers containing potential mutations were aligned to the reference genome using BWA (as described above). The position of the potential mutation in the reference genome was extracted from the alignment and the position of the mutation in the k-mer. This set of positions were compared to the set of positions obtained from the referenced-analysis.
Statistical tests
The Fisher Exact test was used to compare observed and expected (from sequencing errors) SNP type ratios. We derived an expected fraction of GC > AT in sequencing errors of 0.59 based on Figure 2c of Tsai et al.[2]
Data, software and further information
Sequence reads used for the mutation analysis are available at NCBI Sequence Read Archive with the following accession number: [Sequence Read Archive:SRA049884.2]. Software and additional information on the RESCAN method are available at http://comailab.genomecenter.ucdavis.edu/index.php/RESCAN.