We demonstrate the use of a simple method to identify extremely rare SNP in the genome of individuals and estimate the connected mutation rate, independent of availability of a reference genome. The method takes advantage of the facile construction of libraries for Illumina sequencers using restriction-digested genomes. Construction of reduced representation libraries using restriction enzymes was first described for Sanger sequencing . The method has since been applied to high throughput sequencing library construction for the 454 platform , and for Illumina [7, 8, 21–23]. These approaches have been reviewed recently . Restriction site associated DNA tags (RAD) sequencing, described four years ago, has found multiple successful applications [22, 24–31]. Our method differs from RAD sequencing because a genomic fragment in the sequencing library is defined by two symmetric restriction sites instead of asymmetric combination of a restriction site with a randomly fragmented and flushed end, thus being most similar to that of Andolfatto et al.  and of Elshire et al. . The one-step ligation for library construction used by all three methods provides a simple, robust and easily implemented step. Our method differs from the first because the terminal portion of the Illumina adapter sequence is left intact allowing the use of the standard Illumina sequencing primers and combined sequencing with regular libraries. It differs from the second by the use of a single Y-adapter instead of two fully double-stranded ones and the placement of the same barcode on both sequencing reads derived from paired-end sequencing (Figure 1). Double barcoding identifies potential chimeric products in multiplexed libraries that are sequenced as paired reads. The method employs oligonucleotides with desalted purity and no phosphorothioate modification, enabling considerable savings in setting up 96-barcode multiplexing. Similarly to the two methods above, different types of overhangs can be used. When used with a two-base overhang producing enzyme such as MseI (T↓TAA), the adapter can be ligated to the insert in the presence of the restriction enzyme because adapter-insert ligation eliminates the restriction site. When used with a four-base overhang producing enzyme such as NlaIII, the restriction enzyme must be removed or inactivated.
The complexity of the sequencing library can be modified by the choice of restriction enzyme and by size fractionation of DNA. Six-base cutters reduce complexity compared to four-base cutters. For example, the six-base cutter SphI (GCATG↓C) can be used with the NlaIII (CATG↓) adapters achieving satisfactory reduction in complexity of large genomes (> 10 Gb; Monson-Miller and Comai, unpublished results). SPRI bead cleanup before and after ligation of the digested DNA can be tailored to produce different molecular size cuts. Although size selection can be achieved by excision of DNA from agarose gel after electrophoresis, we found that the latter method is less efficient, more laborious and not easily scalable. A drawback of the SPRI bead-based fractionation is that removal of the abundant smaller fragments can be incomplete, resulting in their capture through intrafragmental ligation and chimeric library products. These instances complicate analyses based on assumed contiguity of the paired end reads. Often, however, the two reads are queried independently and this is not a problem.
Analysis of reduced complexity libraries starts by mapping quality-filtered reads to a reference genome using software such as ELAND or BWA. BWA outputs two file types useful in this application: SAM files and pileup files. The first provides the entry point of each read identifying restriction sites common to the input and the reference and restriction sites unique to the input and not present in the reference. Interestingly, the latter sites can be highly predictive of SNP even with very low coverage (one or two reads, Figure 4). For example, using MseI (T↓TAA) we demonstrated that sites where the fourth base (TTAA) is verified in the read and the corresponding reference proto site is TTAB, are confirmed in over 80% of the tested cases (Figure 4). Of 3000 SNP inferred B > A SNP, 1200 were confirmed in the databases. We believe that the remaining SNP are likely to be real as well.
More commonly, SNP discovery employs the sequence of the read beyond the restriction site, a method that requires higher coverage, but is more productive because it queries more sequence (currently 100 bases vs the 4 of a restriction site) and can provide codominant information for any SNP discovered. Detection of these SNP is achieved by parsing the pileup table produced by BWA, where calls for each position are listed with the corresponding qualities . We searched for changes induced by NaAzide, which compared to changes derived from natural variation represent a considerable challenge because they are much rarer than variation SNP [1, 15]. Typical mutations are present in mutagenized diploid genomes with frequencies ranging from less than 0.5 to 10 per 106 bp of diploid DNA. Furthermore, while natural variation SNP are shared by multiple individuals and can thus be confirmed through biological replication, most mutations affect a single individual. We were able to detect induced changes by applying a common sense strategy (Figure 5, see methods). We were helped by the specificity of the induced changes: the mutations conformed the NaAzide mutagenic spectrum detected in barley and consisted almost exclusively of G:C to A:T transitions  in contrast to the 28% G:C to A:T (accompanied by 28% A:T to G:C) transitions expected from natural variation .
Reference-independent discovery using k-mers found more SNP, including 70-80% of those found in the O. sativa-referenced discovery. The fact that 20 to 30% of the potential mutations found in the reference-analysis were not found in the k-mer analysis can partially be explained by the fact that the k-mers were trimmed (from 75 bps to 70 bps) and the reads were further trimmed (from 75 bps to 65 bps, resulting in an effective loss of 14% of the sequence information). Another factor differentiating the two approaches is the potentially different treatment of repeated regions. SNP that are in known repetitive regions would be excluded in the referenced search. However, if only one of the repeat was represented in the RESCAN library, it would behave as single copy and be scored in the reference-independent discovery. Such cases may contribute to the efficiency of de novo-referenced discovery.
A considerable challenge in the analysis is constituted by the presence of heterozygous mutations, which in M2s are expected to be 2/3 of the mutant sites. One difficulty lays in distinguishing homozygous from heterozygous sites: for example, a base position for which three calls are all variant could be homozygous mutant or heterozygous with associated probabilities of 0.875 vs 0.125 (0.53), respectively. Similarly, an heterozygous site could yield three wild-type calls with a 0.125 probability. A second difficulty, connected to the first, lies in estimating the covered genome because each coverage level has an associate probability of detection. In order to reduce noise, we set our algorithm to call heterozygous sites only if they carry a minimum of 5 mutant calls. For example, if a 100b DNA was sequenced to a coverage of ten, any heterozygous site would yield call ratios according to the binomial distribution resulting in a connected probability of detection of 0.62 (X ≥ 5, p = 0.5). Because heterozygotes are detected with lower efficiency, estimating the mutation density under these conditions would require adjusting the number of bases effectively assayed to 62 bases instead of 100. In practical terms, this is laborious and may require the careful construction of an adequate statistical model. A simpler solution to the heterozygous problem would be increasing the coverage, which can be achieved as discussed above. For the purpose of our estimate, we derived a mutation rate using a simplified calculation with the homozygous mutants. We estimate that this might be between half and one third the real mutation rate, depending on the fraction of putative homozygous sites that are actually heterozygous.
Number of SNPs consistent with NaAzide mutagenesis was higher in all 3 treated individuals than in the control, and mutation density peaked in the intermediate treatment according to the homozygous calls. This is not the case when considering the heterozygous calls. If we consider the homozygous analysis to be more accurate, a plausible outcome, this behavior requires potential explanations. The lethality of the mutagenic treatments increased from low to a very high 96% in the 10 mM NaAzide treatment. It is possible that the severity of the 10 mM treatment may be counterproductive and that, for example, survivors may have escaped the full treatment. A similar observation has been reported in barley using sterility as a proxy for mutation density . Alternatively, variation may result from the limited sampling. It is possible, for example, that different cell types in the embryo may respond differently to the mutagen and subsequently enter stochastically the transient germ line that gives rise to plant gametes . Therefore, the extent of individual variability remains to be assessed.
The method described here should be applicable to more studies than just those focusing on mutagenesis. For example, it will allow mapping and backcrossing of induced mutations in the background of the same accession used for mutagenesis by providing markers that allow discrimination of the mutagenized genome from the wild type. It should also allow comparison of substrains of the same variety and, if sufficient SNP are found, genetic characterization of diverged traits. We have also applied the RESCAN to natural SNP discovery and mapping in rice (Tai et al., unpublished results), Arabidopsis suecica (Henry and Comai, unpublished results) and Arabidopsis thaliana . In all these systems, RESCAN proved robust in its application and analysis.