Comprehensive evaluation of SNP identification with the Restriction Enzyme-based Reduced Representation Library (RRL) method
- Ye Du†1,
- Hui Jiang†1,
- Ying Chen†1,
- Cong Li1, 2,
- Meiru Zhao1,
- Jinghua Wu1,
- Yong Qiu1,
- Qibin Li1 and
- Xiuqing Zhang1Email author
© Du et al; licensee BioMed Central Ltd. 2012
Received: 8 June 2011
Accepted: 16 February 2012
Published: 16 February 2012
Restriction Enzyme-based Reduced Representation Library (RRL) method represents a relatively feasible and flexible strategy used for Single Nucleotide Polymorphism (SNP) identification in different species. It has remarkable advantage of reducing the complexity of the genome by orders of magnitude. However, comprehensive evaluation for actual efficacy of SNP identification by this method is still unavailable.
In order to evaluate the efficacy of Restriction Enzyme-based RRL method, we selected Tsp 45I enzyme which covers 266 Mb flanking region of the enzyme recognition site according to in silico simulation on human reference genome, then we sequenced YH RRL after Tsp 45I treatment and obtained reads of which 80.8% were mapped to target region with an 20-fold average coverage, about 96.8% of target region was covered by at least one read and 257 K SNPs were identified in the region using SOAPsnp software.
Compared with whole genome resequencing data, we observed false discovery rate (FDR) of 13.95% and false negative rate (FNR) of 25.90%. The concordance rate of homozygote loci was over 99.8%, but that of heterozygote were only 92.56%. Repeat sequences and bases quality were proved to have a great effect on the accuracy of SNP calling, SNPs in recognition sites contributed evidently to the high FNR and the low concordance rate of heterozygote. Our results indicated that repeat masking and high stringent filter criteria could significantly decrease both FDR and FNR.
This study demonstrates that Restriction Enzyme-based RRL method was effective for SNP identification. The results highlight the important role of bias and the method-derived defects represented in this method and emphasize the special attentions noteworthy.
SNPs are the most abundant markers across the genome. Their relatively uniform distribution and high density make them ideal markers in genome wide association studies (GWAS), comparative or evolutionary genomics study and marker-assisted molecular breeding research. Due to the outcome of HapMap Project [1, 2] and the proceeding of 1000-genomes Project , over 30 M SNPs in human genome have been genotyped and reported in the dbSNP database .
Several genome-wide genotyping technologies have been developed and commercialized, aiming at detecting common SNPs or tagSNPs in parallel  (e.g. Illumina BeadArray based on primer extension , Affymetrix SNP arrays based on differential hybridization  etc.). Although these technologies have obvious advantages such as low costs, whole genome sequencing (WGS) is the most straightforward method for genome-wide identification of SNPs and other types of variants. So far genotyping hundreds to thousands of individuals by WGS is still not affordable for many investigators even considering the dramatic cost decrease due to innovation and update of the technology. Therefore, to fill the gap between current methods, considerable efforts have been made to develop the RRL methods, which have great advantage of reducing the complexity of a genome by orders of magnitude. Recently, lots of RRL strategies have been proposed and proved, such as target enrichment technologies including multiplex PCR, restriction enzyme digestion, selective sequence capture on array  or in solution , and others (reviewed by Mamanova L ).
Compared with other methods, RRL based on restriction enzyme digestion is relatively feasible and flexible, especially for those species without the reference genome . The first RRL using restriction enzyme was described over ten years ago, subsequently many successful cases have been performed in human , soybean , cattle , swine  and other species . Recently reported Restriction-site associated DNA (RAD, with similar experimental procedure as RRL) tag method is able to identify and score thousands of genetic markers in the flank region of enzyme recognition sites which randomly distribute across the target genome. RAD method can be widely used in large population studies, enabling not only genotyping and SNP discovery, but also more complex analysis such as quantitative genetic and phylogeographic studies [15–17]. However, RAD method involving multiple steps was labour intensive and typically requires a large volume of starting genomic DNA, moreover, the validation rate was only 85% under high stringency of SNP calling condition in some species like soybean . In a recent study, the researchers utilized RRL of enzyme Hae III to identify up to 47 K SNPs with validation rate of 48% in a rainbow trout genome . Although the low validation rate of RAD in soybean and rainbow trout were conferred by polyploidy and a whole genome duplication event related to target genome, the main cause was due to the poor accuracy of the method, indicating the potential possibility of improvement by further optimizing the library building or SNP calling in human beings. Consequently, the low validation rate of RRL method greatly hinders its widespread application. Until now comprehensive evaluation (e.g. power and efficacy of SNP discovery, genome coverage, etc.) is still unavailable, the comprehensive evaluation should be carried out in target genomes with detailed genomic coordinates and available genotyping information. Here we proposed restriction enzyme based RRL construction method and performed comprehensive evaluation for this method, especially for the accuracy of SNPs discovery.
In silico digestion and enzyme selection
In silico digestion of the human reference genome (http://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips) was performed with nine commercially available and methylation-insensitive restriction enzymes. These enzymes were selected upon following criterions: (1) predicted fragments length ranged from 200 to 700 base pairs (bp); (2) proportion of target region overlapped with the repetitive elements; (3) number of putative SNPs in dbSNP database v129 (ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/) covered by target region. The repetitive elements were determined using RepeatMasker software v3.2.7 (http://hgdownload.cse.ucsc.edu/) .
Library construction and DNA sequencing
SNPs calling and validation
The raw sequencing data were masked from adapter using our own software application Mlinker. Parameters were set up for Mlinker as following: (1) overlapped length between read and adapter was at least 5 bp; (2) minimum of 90% sequence identity between read and adapter; (3) length of reads after adapter-masked was at least 35 bases. The clean reads were mapped to the human reference (hg18) using SOAPaligner v2 (http://soap.genomics.org.cn/) with at most 4 mismatches and at least 40 perfectly matched bases.
SOAPsnp v2 [19, 20] based on Bayesian model was used to call SNPs on target regions. Five filter steps were used to remove unreliable part of the consensus sequences: (1) both the major and minor alleles should be detected at least twice; (2) the overall sequencing depth, including randomly placed repetitive hits, was less than 200-fold; (3) the copy number of flanking sequences should be less than 2; (4) the genotype quality and average quality score of best nucleotide were at least 20, and average quality score of second best nucleotide was at least 20 if the genotype was heterozygous; (5) SNPs ought to be at least 5 bp away from each other. After filtering, the number of YH consensus sequence was calculated, and discrepancies between the YH genome and hg18 reference genome were considered SNPs.
SNP validation was processed in two aspects including detection of SNP calling accuracy and genotyping accuracy. We used high confident SNPs set previously generated by whole genome sequencing (WGS) to evaluate our SNPs calling results. Novel SNPs identified in our results were defined as false discovery loci, and missing SNPs in our results were defined as false negative. The FDR and FNR indicated the accuracy of SNP calling. To evaluate the genotyping accuracy, we calculated the concordance of these SNPs genotyped by RRL method and Illumina 1 M Duo_v3.B BeadArray.
Genome coverage and Minor allele frequency (MAF) distribution
To compare with commercially available genotyping platform, we set some parameters including genomic coverage (r2 ≥ 0.8), MAF (mean/median), and spacing (mean/median) between markers. Allele frequency and linkage disequilibrium (LD) data for the four HapMap populations were obtained from HapMap database (http://www.hapmap.org, phase 2). The value of genomic coverage was calculated according to the method previously described , In brief, we performed the naive estimation of coverage using HapMap release24 as reference set of SNPs, which was defined as R. The sets of SNPs covered by Illumina genotyping Beadchip or RRL were represented by T. The sets of SNPs located in same LD blocks with T sets of SNPs with r2 ≥ 0.8 were represented by L. Then the value of genome coverage was calculated as (L+T)/R. Human 1 M Duo_v3.B Beadchip from Illumina which cover about 1.1 M SNPs was selected to perform the comparison.
Enzyme selection, RRL construction and sequencing
Summary of in silico digestion results
Fragments (200 -700 bp)
Distance between two adjacent reads
#Total selected fragments
#Total length of target regionsa
% Percent of coverage
#Length of repetitive on target regions
%percent of repetitive contents
Hin d III
Summary of sequencing and alignment results
Total bases (Gb)
Mapped bases (Gb)
On target region (Gb)
Target region with depth ≥ 1 (Mb)
To assess the quality of RRL construction and sequencing, we compared the actual insert size distribution calculated according to the paired end information of sequence reads to that of simulation result. Figure 2B indicated that the length distribution of inserted fragments in Tsp 45I RRL was approximately from 200 to 700 bp. Compared with the simulative curve, the small fragments were over-presented in the final sequencing results, probably indicating the bias in the process of library construction and cluster generation. Furthermore these peaks in the distribution with consistent patterns were contributed by the accumulation of repetitive elements.
SNP identification and validation
Validation of SNP calling
On target regions of RRL
SNP dataset generated by WGS
Detailed interpretations for high False Discovery Rate and False Negative rate
False Discovery Rate (FDR) class
Intersection of the reasons
Number of loci (percentage)
In dbSNP v129
Reasonable interpretations for SNPs filtered out in YH WGS results
1. Low depth (<2)
2. Low quality (< 20)
3. High copy number (> 2)
4. High depth (> 200)
Overcalled for unknown reasons in RRL sequencing
False Negative rate (FNR) class
Intersection of the reasons
Number of loci (percentage)
In dbSNP v129
Reasonable interpretations for SNPs filtered out in RRL sequencing results
1. Low depth (<2)
2. Low quality (< 20)
3. High copy number (> 2)
4. High depth (> 200)
Allele dropout in RRL sequencing
Comparison of RRL sequencing and Illumina Beadchip genotyping results
Tsp 45I RRL sequencing
Genome coverage and MAF distribution
SNPs are the most abundant markers that are evenly distributed throughout genome. In human genome over 30 M SNPs are identified by HapMap Project [1, 2] and 1000-genomes Project . In this study we comprehensively accessed Restriction Enzyme-based RRL method with YH genome, which was the first complete Asian genome, including over 30 × whole genome sequencing data.
Compared with YH genome we observed high FDR and FNR of SNP calling, however, our further analysis found that 78.65% SNPs were already present in the dbSNP database, implying that these loci were possibly newly identified by RRL method. Repeat sequences and base quality contributed a great portion: 84.2% of FDR loci were confirmed to locate on repeat regions. We assumed to some extent that it was due to the different sequencing platform and sequencing strategy, YH WGS was performed on the Illumina Genome Analyzer that generated average 35 bp length reads obviously shorter than RRL sequencing performed on Illumina Hiseq 2000 with average read length of 90 bp. It was consistent with the fact that longer read length could improve the accuracy of mapping into reference and consequent lower FDR; Meanwhile, longer read length also leaded to lower quality score and consequent higher FDR, therefore we set up highly stringent filter parameter to reduce the FDR. We concluded that stringent filter criteria and repeat masking were necessary for increasing the accuracy of SNP calling.
FNR is another important parameter to access the power of this method, and due to lack of large-scale validation approach, published reports seldom include the evaluation of FNR. The error ratio of sequencing increases with read length, which may be one of the contributors for high FNR. Generally speaking, poor quality score of sequencing would not increase FNR, but FDR; however the major cause of the FNR in our study was due to the low quality score of sequencing. We consider false discovery worse than false negative in our RRL method, to confine the FDR well, we set relatively stringent parameters of SNP calling, which leaded to the lost of SNPs with low quality score as a consequence. Given restriction enzyme-based RRL methods always generate the fragments from the same start and end position, it formed low quality blocks at the 3'-end of enzyme fragments. The density distribution of FNR loci from six to ninety base along read indicated the number of FNR loci increased with position nearer to the end of reads (Figure 3) and significant increment of FNR was observed after 55 sequencing cycles. Moreover, part of false negative loci occurred in first five positions corresponding to enzyme recognition site because all of these candidate SNPs located in recognition site would be missed by RE digestion-based methods. From the comparison with genotyping results from Illumina 1 M Beadchip, the concordance rate of heterozygotes loci was only 92%, extremely lower than that of homozygotes. Over 99.7% of inconsistent heterozygotes were under-called to homozygotes partially because of low quality and inadequate depth. Moreover, 57.1% of these under-called loci were due to the disruption of recognition site by SNPs incorporation mentioned above. The discordance rate indicated the actual impact of disruption of recognition site. The disruption of the restriction site by a SNP can never totally be ruled out and merited careful attentions.
Given Restriction Enzyme-based RRL method identified SNPs around the restriction enzyme recognition site, only part of them was tag SNPs to represent a region of the genome with high LD. A strong correlation between the number of SNPs and coverage was observed in our simulative results. Clearly, there was a trade-off between SNPs density and genome coverage.
RRL method combined with high-throughput sequencing is demonstrated to be an effective way for SNP discovery in individuals or populations. Our YH RRL data displayed high coverage and specificity of target region and identified over 257 K SNPs with about 8G sequencing data. Comprehensive evaluation with this method clarified the factors contributing to FDR and FNR of SNP identification and also presented the potential solution to improve the SNP calling accuracy. Our study extended the scope of this method and highlighted its application in the future.
This study was supported in part by grants from the National High Technology Research and Development Program of China (863 Program, 2006AA02A302 to H.M.-Y, 2009AA022707 to X.Q.-Z), and a National Natural Science Foundation grant of China (30811130531 to H.M.-Y).
- The International HapMap Project. Nature. 2003, 426 (6968): 789-96. 10.1038/nature02168.Google Scholar
- A haplotype map of the human genome. Nature. 2005, 437 (7063): 1299-320. 10.1038/nature04226.Google Scholar
- Kuehn BT, et al: 1000 Genomes Project promises closer look at variation in human genome. Nucleic Acids Res. 2001, 29 (1): 308-11. 10.1093/nar/29.1.308.View ArticleGoogle Scholar
- Sherry ST, et al: dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 2001, 29 (1): 308-11. 10.1093/nar/29.1.308.PubMed CentralView ArticlePubMedGoogle Scholar
- Kim S, Misra A: SNP genotyping: technologies and biomedical applications. Annu Rev Biomed Eng. 2007, 9: 289-320. 10.1146/annurev.bioeng.9.060906.152037.View ArticlePubMedGoogle Scholar
- Shen R, et al: High-throughput SNP genotyping on universal bead arrays. Mutat Res. 2005, 573 (1-2): 70-82. 10.1016/j.mrfmmm.2004.07.022.View ArticlePubMedGoogle Scholar
- Matsuzaki H, et al: Genotyping over 100,000 SNPs on a pair of oligonucleotide arrays. Nat Methods. 2004, 1 (2): 109-11. 10.1038/nmeth718.View ArticlePubMedGoogle Scholar
- Gnirke A, et al: Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing. Nat Biotechnol. 2009, 27 (2): 182-9. 10.1038/nbt.1523.PubMed CentralView ArticlePubMedGoogle Scholar
- Mamanova L, et al: Target-enrichment strategies for next-generation sequencing. Nat Methods. 2010, 7 (2): 111-8. 10.1038/nmeth.1419.View ArticlePubMedGoogle Scholar
- Wiedmann RT, Smith TP, Nonneman DJ: SNP discovery in swine by reduced representation and high throughput pyrosequencing. BMC Genet. 2008, 9: 81-PubMed CentralView ArticlePubMedGoogle Scholar
- Altshuler D, et al: An SNP map of the human genome generated by reduced representation shotgun sequencing. Nature. 2000, 407 (6803): 513-6. 10.1038/35035083.View ArticlePubMedGoogle Scholar
- Wu X, et al: SNP discovery by high-throughput sequencing in soybean. BMC Genomics. 2010, 11: 469-10.1186/1471-2164-11-469.PubMed CentralView ArticlePubMedGoogle Scholar
- Van Tassell CP, et al: SNP discovery and allele frequency estimation by deep sequencing of reduced representation libraries. Nat Methods. 2008, 5 (3): 247-52. 10.1038/nmeth.1185.View ArticlePubMedGoogle Scholar
- Sanchez CC, et al: Single nucleotide polymorphism discovery in rainbow trout by deep sequencing of a reduced representation library. BMC Genomics. 2009, 10: 559-10.1186/1471-2164-10-559.PubMed CentralView ArticlePubMedGoogle Scholar
- Baird NA, et al: Rapid SNP discovery and genetic mapping using sequenced RAD markers. PLoS One. 2008, 3 (10): e3376-10.1371/journal.pone.0003376.PubMed CentralView ArticlePubMedGoogle Scholar
- Hohenlohe PA, et al: Population genomics of parallel adaptation in threespine stickleback using sequenced RAD tags. PLoS Genet. 2010, 6 (2): e1000862-10.1371/journal.pgen.1000862.PubMed CentralView ArticlePubMedGoogle Scholar
- Emerson KJ, et al: Resolving postglacial phylogeography using high-throughput sequencing. Proc Natl Acad Sci USA. 2010, 107 (37): 16196-200. 10.1073/pnas.1006538107.PubMed CentralView ArticlePubMedGoogle Scholar
- Smit A, Hubley R, Green P: RepeatMasker Open-3.0. 2004Google Scholar
- Li R, et al: SNP detection for massively parallel whole-genome resequencing. Genome Res. 2009, 19 (6): 1124-32. 10.1101/gr.088013.108.PubMed CentralView ArticlePubMedGoogle Scholar
- Li R, et al: SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics. 2009, 25 (15): 1966-7. 10.1093/bioinformatics/btp336.View ArticlePubMedGoogle Scholar
- Li M, Li C, Guan W: Evaluation of coverage variation of SNP chips for genome-wide association studies. Eur J Hum Genet. 2008, 16 (5): 635-43. 10.1038/sj.ejhg.5202007.View ArticlePubMedGoogle Scholar
- Wang J, et al: The diploid genome sequence of an Asian individual. Nature. 2008, 456 (7218): 60-5. 10.1038/nature07484.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.