Genome-wide SNP discovery in walnut with an AGSNP pipeline updated for SNP discovery in allogamous organisms
© You et al.; licensee BioMed Central Ltd. 2012
Received: 14 February 2012
Accepted: 5 July 2012
Published: 31 July 2012
A genome-wide set of single nucleotide polymorphisms (SNPs) is a valuable resource in genetic research and breeding and is usually developed by re-sequencing a genome. If a genome sequence is not available, an alternative strategy must be used. We previously reported the development of a pipeline (AGSNP) for genome-wide SNP discovery in coding sequences and other single-copy DNA without a complete genome sequence in self-pollinating (autogamous) plants. Here we updated this pipeline for SNP discovery in outcrossing (allogamous) species and demonstrated its efficacy in SNP discovery in walnut (Juglans regia L.).
The first step in the original implementation of the AGSNP pipeline was the construction of a reference sequence and the identification of single-copy sequences in it. To identify single-copy sequences, multiple genome equivalents of short SOLiD reads of another individual were mapped to shallow genome coverage of long Sanger or Roche 454 reads making up the reference sequence. The relative depth of SOLiD reads was used to filter out repeated sequences from single-copy sequences in the reference sequence. The second step was a search for SNPs between SOLiD reads and the reference sequence. Polymorphism within the mapped SOLiD reads would have precluded SNP discovery; hence both individuals had to be homozygous. The AGSNP pipeline was updated here for using SOLiD or other type of short reads of a heterozygous individual for these two principal steps. A total of 32.6X walnut genome equivalents of SOLiD reads of vegetatively propagated walnut scion cultivar ‘Chandler’ were mapped to 48,661 ‘Chandler’ bacterial artificial chromosome (BAC) end sequences (BESs) produced by Sanger sequencing during the construction of a walnut physical map. A total of 22,799 putative SNPs were initially identified. A total of 6,000 Infinium II type SNPs evenly distributed along the walnut physical map were selected for the construction of an Infinium BeadChip, which was used to genotype a walnut mapping population having ‘Chandler’ as one of the parents. Genotyping results were used to adjust the filtering parameters of the updated AGSNP pipeline. With the adjusted filtering criteria, 69.6% of SNPs discovered with the updated pipeline were real and could be mapped on the walnut genetic map. A total of 13,439 SNPs were discovered by BES re-sequencing. BESs harboring SNPs were in 677 FPC contigs covering 98% of the physical map of the walnut genome.
The updated AGSNP pipeline is a versatile SNP discovery tool for a high-throughput, genome-wide SNP discovery in both autogamous and allogamous species. With this pipeline, a large set of SNPs were identified in a single walnut cultivar.
KeywordsBAC Physical map BAC end sequence Infinium Single nucleotide polymorphism Genome sequence SOLiD Walnut AGSNP
Walnut (Juglans regia L., 2n = 32, ~606 Mb per 1C genome, Horjales et al. 2003 inhttp://data.kew.org/cvalues/) is an economically important tree widely cultivated for its nuts and timber. A long reproductive cycle limits its genetic improvement. Walnut breeding would therefore greatly benefit from the development of molecular markers that could be used for gene discovery, marker-assisted selection, and other breeding applications that would accelerate breeding progress.
Walnut genetic markers are currently inadequate to satisfy these needs. Only a limited number of amplified polymorphic DNA (RAPD) markers, RAPD-derived sequence characterized amplified regions (SCAR), restriction fragment length polymorphisms (RFLP), and amplified fragment length polymorphisms (AFLP) have been developed[2–4]. Only a few simple sequence repeats (SSR) have been developed in walnut[5–9], although the recently reported bacterial artificial chromosome (BAC) end sequences (BESs) provide an opportunity for the development of a larger number of them.
Single nucleotide polymorphism (SNP) is the most abundant type of DNA variation in most species. The advent of massively parallel next generation sequencing (NGS), coupled with high throughput genotyping technology, makes it relatively easy to identify and use SNPs[11–15]. An example of a high-throughput SNP genotyping platform is Illumina’s Infinium SNP oligonucleotide assay, which can simultaneously assay between 3,000 and 1 million SNPs[16, 17]. The assay has been deployed in high-throughput SNP genotyping of animals, such as cattle (50 K BeadChip) and swine (60 K BeadChip). A prerequisite for the development of an Infinium SNP assay is the availability of a large number of genome-wide SNPs.
Genome-wide SNP discovery utilizing NGS is predicated on bioinformatic tools facilitating mapping NGS reads to reference sequences[20–22] and variant calling. Pipelines for processing of billions of short NGS reads for the purpose of discovery of genome-wide SNPs have been reported[24, 25]. Standard approaches to genome-wide SNP discovery are searches for variants in transcriptome assemblies of multiple individuals or mapping of NGS genomic reads of multiple individuals to a complete genome sequence. This approach is limited in many species by the absence of a complete genome sequence, and alternative strategies for genome-wide SNP discovery are therefore needed.
One such strategy is to substitute shallow genome coverage of long sequence reads, or their assemblies, generated with the Roche 454 or Sanger sequencing technology for a complete genome sequence. This strategy was implemented in the AGSNP pipeline for genome-wide SNP discovery in self-pollinating (autogamous) plants without a reference genome sequence. In AGSNP, deep genome coverage of NGS reads from one homozygous individual was mapped to shallow 454 reads of another homozygous individual. SNPs were discovered between the two sets of reads. The assumption of homozygosity limits the universal utility of the pipeline because many plants and most animals are allogamous and hence heterozygous. We report here an update of AGSNP for applications in allogamous species.
BAC end sequences are one of several possible sources of shallow coverage, genome-wide, long DNA reads. BESs are often developed from BAC libraries and are used for marker development and genome sequence composition surveys before whole genome sequencing. BESs can also be used for anchoring FPC contigs on a genetic map by searching for homology between BESs and marker sequences on the genetic map. The BES-based anchoring strategy is an alternative to contig anchoring via hybridization of radioactive probes with BAC library screening membranes or screening of multidimensional BAC pools by PCR or Illumina’s Golden Gate[27–29]. The deployment of BESs in SNP discovery can therefore serve multiple objectives.
To develop markers for the construction of walnut genetic map and anchoring walnut FPC contigs on it, a total of 54,912 BESs from walnut cv ‘Chandler’, a major walnut scion cultivar grown in California, have been generated by Sanger sequencing. SNPs were identified in BESs with the updated AGSNP pipeline by mapping deep sequence coverage of walnut NGS reads to ‘Chandler’ BESs. An Infinium assay for 6,000 SNPs was developed and used to genotype a ‘Chandler’ x ‘Idaho’ F1 mapping population. The genotyping results were analysed to validate the SNPs and to improve the SNP discovery rate with the pipeline by adjusting the SNP filtering criteria.
Updating the AGSNP pipeline was based on the following rationale. Reads forming a stack may vary either due to heterozygosity or sequencing and mapping errors. If heterozygosity was the cause of variation, two variants were expected at a nucleotide position in a stack of reads, and the expected frequency of each variant was 0.5. If a sequencing or mapping error was the cause of variation, two or more variants were expected at a nucleotide position in a stack of reads, and the frequency of one of the variants was expected to be minor. Variables ‘variant frequency’ (VF), defined as the number of SOLiD reads in a stack having a nucleotide at a specific nucleotide position that was different from the reference sequence divided by the total number of reads in the stack and ‘folded variant frequency’ (FVF), derived from VF were used to discriminate between these two possibilities. FVF is 1-VF if VF > 0.5 and equal to VF if VF ≤ 0.5. The domain of VF was 0 to 1 and that of FVF was 0 to 0.5. If FVF was 0, the locus was homozygous. If FVF was minor, read variation was likely a sequencing or mapping error and if it was near 0.5, read variation was likely caused by SNP.
Numbers of putative SNPs in relation to read variant frequency ( VF )
No of all SNPs
No of SNPs after removing SNPs with quality score < 35
Discovery and characterization of SNPs
A total of 48,661 BESs of an average read length of 721 bp and totaling about 35 Mbp were used as reference sequences. Of them, 42,022 (86%) were located in 804 of the 916 FPC contigs assembled from fingerprinted BAC clones (not shown), which indicated that the BESs were distributed across the entire walnut genome. To annotate these BESs, homology was searched between the 48,661 BESs and walnut cDNA sequence contigs at 1E-10. A total of 29,223 BESs showed homology to cDNA sequences. Those BES were called genic BESs whereas the remaining 19,438 were called non-genic BESs.
A total of 395,528,231 high-quality SOLiD reads 50 bp long were retained after removing low quality reads with an average read quality score < 20. The total length of the reads was 19,776 Mbp, which translated to ~32.6X walnut genome equivalents. The filtered SOLiD reads were mapped to the 48,661 reference BESs using the BWA program[20, 21], and SNPs were called in the mapped SOLiD reads using SAMtools.
SNP filtering criteria used in walnut SNP discovery
Minimum read depth mapped to the reference sequences (Minimum RMD)
Maximum read depth mapped to the reference sequences (maximum RMD)
≤ 25 ( + 0.5- s)(a)
Folded variant frequency in SOLiD reads (FVF)
Statistically no deviation from 0.5(b)
Mapping quality score in SAMtools (MQS)
Reference SNP base quality
SNP base ≥40 for genic BESs and ≥45 for non-genic BESs
Removing homopolymer SNPs
SNP base string length ≥ 3 bp
Removing very close SNPs
> 3 bp between two contiguous SNPs
Removing SNPs at the right side of Sanger reads
> 30 bp away from the right side
Illumina genotyping quality
≥ 60 bp between two contiguous SNPs
Putative SNPs identified from the walnut BESs and their Infinium types and ADT design scores for Infinium genotyping
No of BESs
No of BESs with SNPs
Infinium I SNPs
Infinium II SNPs
Infinium II SNP Design score ≥ 0.9 (%)
Infinium II SNP Design score ≥ 0.7 (%)
SNPs are divided into two categories in the Infinium HD assay, Infinium I type (A/T, C/G) and Infinium II type (A/C, A/G, T/C, T/G), according to probe or bead type design. The Infinium II probe design employs one probe per SNP (single probe for both alleles) whereas Infinium I probe design employs two probes (one probe for each allele). As the pricing and ordering of the custom BeadChip product are determined by the number of bead types, rather than the number of SNPs, using only Infinium II SNPs increases the number of SNP loci genotyped per constant number of probes and is therefore more economical. In this study, 88% (20,092/22,799) of SNPs were of Infinium II type (Table 3). SNPs of Infinium II type were present in 11,247 BAC clones present in 683 FPC contigs containing 107,262 BAC clones, which accounted for 94.9% of the walnut physical map.
All of the 22,799 putative SNPs identified in genic and non-genic BESs were evaluated using Illumina’s ADT software. In the 20,092 SNPs of Infinium II type, 17,019 SNPs (85.2%) had a score ≥0.9, and 19,891 (99%) of SNPs had a score ≥ 0.7 (Table 3). The design score of 0.7 was used as a cutoff. After removing SNPs with design score < 0.7, 16,216 SNPs located in 682 FPC contigs were retained. A total of 6,000 of them, 3,866 from genic BESs and 2,134 from non-genic BESs, were chosen for designing a 6 K Infinium SNP assay and genotyping of 352 F1 walnut plants making up the mapping population from the ‘Chandler’ x ‘Idaho’ cross.
Comparison of several SNP identification criteria and ADT design scores in scorable and unscorable SNPs
Number of SNPs
ADT design score
Average quality score of reference
SNP quality score of reference
SNP mapping quality score
Read mapping depth
Folded variant frequency
0.973 ± 0.052(a)
54.6 ± 9.8
59.1 ± 6.8
34.0 ± 1.8
24.9 ± 15.7
0.35 ± 0.10
0.977 ± 0.047
54.5 ± 9.5
59.2 ± 6.9
34.0 ± 1.8
23.1 ± 14.1
0.35 ± 0.10
P value of t test
Comparison of several SNP filtering criteria and ADT design scores in true-positive and false-positive SNPs
No of SNPs
ADT design score
Average quality score of reference
SNP quality score of reference
SNP mapping quality score
Read mapping depth
Folded variant frequency
0.974 ± 0.050(a)
54.9 ± 9.4
59.2 ± 6.6
34.3 ± 1.6
18.4 ± 9.3
0.39 ± 0.11
0.971 ± 0.053
54.3 ± 10.1
59.1 ± 7.1
33.7 ± 2.0
31.7 ± 18.0
0.34 ± 0.11
P value of t test
Optimization of the pipeline
Out of the 2,655 false-positive SNPs, 1,850 SNPs were from genic BESs and 805 SNPs were from non-genic BESs, while of the 2,765 true-positive SNPs, 1,638 SNPs were from genic BESs and 1,127 SNPs were from non-genic BESs. There was a highly significant relationship between SNP source and true/false-positive SNP outcome (Fisher’s exact test, p < 0.0001); SNPs from non-genic single-copy BESs had a higher chance of being true-positive (58.3%) than those from genic BESs (47.0%).
Logistic regression model of polymorphism of SNPs (true-positive and false-positive) and several related SNP filtering criteria: log ( p /(1- p )) = b0 + b1* X1 + b2* X2 + b3* X3 , where p is the probability of an SNP belonging to a false-positive
Estimates of coefficients
Prob > χ2
High-quality SNPs identified from the walnut genome after applying the new filters of read mapping depth (≤ 25), SNP mapping quality score (≥ 30), and folded variantfrequency (statistically not different from 0.5)
No of BESs
No of BESs with SNPs
Number of contigs with SNPs
Predicted true SNP rate(a)
Updating of the AGSNP pipeline for SNP discovery in cross-fertilizing species
The AGSNP pipeline was originally designed as a high-throughput bioinformatic tool for large-scale, genome-wide SNP discovery in large and complex genomes using sequences of two inbred, and hence homozygous, lines. Sequences assembled from long reads, such as those produced by the Sanger or Roche 454 sequencing platforms, of one inbred line, and annotated using 20 cDNA libraries, served as a reference. Short reads of a great depth, such as those produced by the SOLiD or Illumina NGS platforms, of the other line were used to further annotate the assembled sequences and discover SNPs between the two lines. This pipeline was successfully used for SNP discovery in the 4.02 Gbp genome of self-pollinating Ae. tauschii. Approximately half of million SNPs with a validation rate of over 85.9% were identified in genic regions, single or low copy repeat regions, and uncharacterized low copy number sequences.
To discover SNPs in walnut, which is a wind-pollinated, out-crossing species, the pipeline had to be modified to accommodate heterozygosity. To generate SNPs for genotyping a mapping population, SNP discovery can be limited only to a single out-crossed parent. Each SNP is detected as variation within the stack of mapped short NGS reads at a locus, rather than a difference between an invariant stack of mapped NGS reads and reference sequence. In this application of AGSNP, the role of the reference sequence is to filter SNPs. Therefore, the reference sequence can be derived either from the same genotype as the mapped reads or a different genotype (Figure 1). In this study, ‘Chandler’ BESs were used as a reference sequence and SOLiD ‘Chandler’ reads were mapped to them to identify SNPs in ‘Chandler’.
SNP filtering is a critical step for removing false-positive SNPs from the pool of putative SNPs during SNP discovery. FVF is one of the most important variables used in SNP filtering. It is used to set a boundary between variation caused by sequencing or mapping errors and that caused by true SNPs. Ideally, FVF of true SNPs should be close to 0.5 but it is difficult to find a fixed FVF cutoff that reasonably balances false-positive and false-negative SNP rates. As the counts of variable reads at a nucleotide position follow a binomial distribution and the expected FVF for true SNPs is 0.5, we used the binomial probability of deviation between observed FVF and 0.5 to set the cutoff between true or false SNPs. The benefit of this approach to setting the cutoff value is seen using the following example. When the 5,420 genotyped SNPs were declared true or false on the basis of a fixed FVF cutoff of ≥ 0.3, the false-positive SNP rate insignificantly increased from 57.4% (1,525/2,655) to 61.7% (1,638/2,655). But when cutoff was set on the basis of the binomial test, the false-negative SNP rate significantly decreased from 23.1% (640/2,765) to 9.4% (260/2,765).
The SNP discovery in Ae. tauschii used a cutoff value of+ 2 s to identify single copy reference sequences or to set a maximum read mapping depth. In contrast, in walnut,+ 0.5 s turned out to be an optimal cutoff value. Similarly, more stringent SNP filtering cutoff values were required for the SNP mapping quality score, average reference quality score, and reference SNP quality score. After adjusting the cutoff values, a 69.6% true-positive SNP rate was obtained in walnut, which was much higher than 51.0% with the initial cutoff values. This validation rate was lower than that obtained in self-pollinating species[25, 30–33] but higher than that obtained using other SNP discovery strategies in outcrossing maritime pine, loblolly pine, and sugar pine, in which SNP validation rates ranged from 36.0% to 61.5%[34–36].
Factors reducing the rate of false-positive SNPs in SNP discovery
The analysis of validated SNPs showed that factors such as read mapping depth, SNP mapping quality score, and folded variant frequency were closely related to the rate of true-positive SNPs in the updated AGSNP pipeline. All those factors were directly or indirectly associated with a fundamental issue: mismapping of NGS reads to a reference sequence. Because reference sequences and NGS reads are derived from heterozygous loci, mismapping can easily result in a large proportion of false-positive SNPs. Focusing SNP discovery on genic regions and single-copy non-genic sequences, increasing the stringency of mapping depth, increasing the SNP mapping quality score, and increasing FVF will decrease false-positive SNP rate. Using paired NGS reads would probably also help since it will increase the likelihood of mapping reads to their correct locations. In addition, more stringent mapping parameters in the mapping software, e.g., the number of mismatched bases and the number of gaps, should be applied if reads from heterozygous genomes are used for SNP discovery with the AGSNP pipeline.
A total of 6,000 SNPs scattered along most of the FPC contigs were selected to generate a 6 K iSelect Infinium BeadChip. Of them, 90.3% produced genotyping data and 86.1% were converted to potential SNP markers for genetic mapping. This conversion rate of SNP sequence to SNP markers is higher than rates in an outcrossing tree species, maritime pine, (63.6%–74.8%) using the custom Golden Gate assay but lower than the conversion rates evaluated using a custom Infinium assay in animal species, such as pig (97.5%) and cattle (97.6%). The final genotyping success rate is the product of a combination of the conversion rate and the true-positive SNP rate. In this study, 86.1% conversion rate and 51.0% true-positive rate yielded a final Infinium genotyping rate of 43.9%, still higher than the rate obtained in maritime pine (25.7%) using the custom Golden Gate assay. Increasing true-positive SNP rate in SNP discovery will increase the final genotyping rate. Overall, genome-wide SNP discovery using BESs and short NGS sequence reads resulted in successful SNP genotyping strategy in the heterozygous walnut genome.
General utility of identified SNPs
The AGSNP pipeline was updated here and is now applicable to genome-wide SNP discovery in all species, irrespective of their mating system, although the error rates of SNP discovery with the pipeline are higher in autogamous species (81.3 to 88%) than in allogamous species (69.6%). The greater fidelity of SNP discovery in autogamous species is undoubtedly related to greater efficiency with which the pipeline is able to separate SNPs from sequencing and mapping errors in autogamous species. The updated pipeline can be downloaded athttp://avena.pw.usda.gov/wheatD/agsnp.shtml.
Shotgun SOLiD sequencing
A fragment library was constructed from genomic DNA isolated from the walnut cultivar ‘Chandler’ using the Applied Biosystems Fragment Library Construction Kit (Life Technologies, Inc.). Templated beads were prepared from the fragment library using the ePCR kit v.2 and the Bead Enrichment Kit from Applied Biosystems for SOLiD4+. Workflow Analysis was done after the first round of templated bead preparation for each library using the Workflow Analysis kit from Applied Biosystems to check library quality and the amount of templated beads generated per ePCR. Additional Templated beads were deposited on slides using the Bead Deposition kit from Applied Biosystems. One full slide of the fragment library was sequenced. Greater details of SOLiD library preparation and sequencing were published earlier.
A total of 20 tissue-specific mRNA-Seq libraries were constructed and sequenced on the Illumina GAII platform to characterize the walnut transcriptome. Over 1 billion RNA-Seq reads were generated and trimmed for quality with a custom script. The trimmed reads derived from each sample were assembled using velvet v1.12/oases v1.15 and tgicl/CAP3. Assemblies at least 200 bp long were saved and redundancy among the contigs and singletons (128,286 sequences) was removed by mapping raw Illumina RNA-Seq reads to all assembled contigs and singletons from CAP3 with BWA[20, 21]. A threshold of 10 reads per kilobase mapped was set to arrive at a final set of transcriptome contigs (85,045 sequences, with a total of 137,069,830 bp and an average contig length of 1,612 bp). These sequences were used for the identification of genic BES.
BAC contig assembly
FPC BAC contigs were constructed from 113,063 fingerprinted Hin dIII and Mbo I BAC clones of walnut cultivar Chandler. A total of 917 contigs and 4,830 singletons were obtained from 108,233 clones suitable for contig assembly. Contigs can be found at (http://probes.pw.usda.gov:8080/walnut/Database) but details of contig assembly and the construction of walnut physical map will be published elsewhere.
BAC end sequences
The development of BESs for Mbo I and Hin dIII BAC clones has been described in detail previously. A total of 54,912 BESs were produced and used here as reference sequences for SNP discovery. BES quality scores were used for SNP quality checks in the AGSNP pipeline.
The AGSNP pipeline was updated for the discovery of SNPs in genomic sequences of a heterozygous individual. The following strategy for SNP discovery with the updated pipeline was followed. (1) Walnut BESs were annotated as genic and non-genic using blast searches against the walnut cDNA sequences. (2) SOLiD reads were mapped to the annotated BESs using the BWA program package and potential SNPs were found using SAM tools. (3) The maximum mapping depth cutoff value was computed according to the extreme value distribution to find high-quality SNPs. (4) SNP filtering criteria were adjusted and applied to SNP discovery. The details of implementation of the updated pipeline were described in Results.
The criteria used for SNP filtering are listed in Table 2. In the previous application of AGSNP for SNP discovery in an inbred line of Aegilops tauschii, SNPs located in repeated sequences and those due to mapping errors were removed using the average read mapping depth (RMD) () and standard deviation (s) estimated from the fitted extreme value distribution of mapping depths of all mapped sequences used in SNP discovery. A cutoff value of of mapped sequences was used as a boundary between single copy reference sequences and multi-copy reference sequences. Reference sequences of RMD less than this value were assumed to be single copy and those greater than this value were assumed to be repeated. The same strategy was used in this study. The filtered SOLiD reads were mapped to the 29,223 genic BESs using the updated AGSNP pipeline. The RMD () of SOLiD reads and standard deviation (s) were estimated to be 15.9 reads and 19.1 reads, respectively (Figure 4). However, as walnut is heterozygous, no data is available to suggest that a cut off of+2 - s is applicable as a boundary separating single copy sequences from repeated sequences in the reference sequences. To evaluate the relationship of RMDs with true-positive SNPs, we did not limit the maximum RMD in the initial SNP discovery. A more relevant RMD cutoff value was determined later, after SNP validation.
Folded variant frequency (FVF) is one of the criteria to filter out potentially false-positive SNPs among reads generated for an outcrossing individual. FVF represents the frequency of minor read variants in a stack of reads, which follows a binomial distribution and is expected to be 0.5 in a random mating population. A t-test was used to test whether or not the FVF of an SNP statistically deviated from 0.5 (H0). If the FVF is significantly different from 0.5 at the 0.05 probability level, the SNP was inferred to be false-positive. The test statistics is as follows:
, if RMD × 0.5 < 30, or, if RMD × 0.5 ≥ 30. If t ≥ t0.05, RMD-1, the FVF of an SNP is significantly different from 0.5 and thus is discarded, where t0.05, RMD-1 is a critical value of the t distribution at the 0.05 probability level with a degree of freedom of RMD-1.
Infinium iSelect construction and genotyping
All SNPs identified in BESs were submitted to Illumina for evaluation using Illumina’s Assay Design Tool (ADT). A total of 6,000 SNPs were selected for iSelectInfinium genotyping. To obtain a dense, genome-wide genetic map, SNP markers should be distributed evenly across the entire genome or be present in all FPC contigs. To maximize the likelihood of that, SNP selection was based on the following criteria: (1) only one SNP was chosen per BES, (2) at least one SNP marker was chosen per FPC contig, (3) the number of selected SNPs per FPC contig was proportional to contig size, and SNPs were evenly distributed along the contig, (4) if the same gene was in multiple BESs only one BES was chosen, and (5) only SNPs of Infinium II type were used.
A mapping population consisting of 428 F1 progeny produced from across between cultivars ‘Chandler’ and ‘Idaho’ was used for genotyping. The F1 individuals, along with their parents, were grafted on to ‘Paradox’ rootstock or grown on their roots in the field. The mapping population was segregating for a number of phenological and metric traits. A set of 20 microsatellite loci were used to confirm that the individuals in the mapping population were true hybrids.
The standard PCR-based approach of SNP validation by designing primers flanking an SNP, sequencing amplicons, and comparing them with expected genotype is not strictly applicable for a heterozygous genome due to possible amplification artifacts. SNPs were therefore validated indirectly through Infinium genotyping. SNPs were declared to be false positive if no polymorphism was observed at that nucleotide in cv ‘Chandler’ and its F1 progeny from the cross with cv ‘Idaho’.
Pedigrees of walnut cultivars including ‘Chandler’ released by the UC Davis walnut breeding program were collected from annual breeding progress reports and published papers[4, 9]. The pairwise coefficients of parentage (COP) between cultivars in pedigrees were calculated based on the definition in Malecot and Kempthorne. Due to the outcrossing nature of walnut, the inbreeding coefficient (F) of a cultivar was set to 0. A Perl program ‘calculate_COP.plx’ was written for pairwise COP calculations.
Estimation of heterozygosity and pairwise dissimilarity
In order to assess the utility of SNPs discovered in the single cultivar ‘Chandler’, a total of 32 walnut cultivars including ‘Chandler’ and ‘Idaho’ were genotyped with the 6 K Infinium genotyping assay. The heterozygosity percentage for each was calculated as the number of heterozygous loci divided by the total number of SNP markers. The pairwise dissimilarity coefficients were computed based on heterozygosity data using the improved coefficient definition and calculation methods for diploids and codominant markers.
All statistical analyses, including significance test, correlation analysis and logistic regression modeling, were performed using JMP 7.0 (SAS Institute Inc.) and Microsoft Excel (Microsoft).
Single nucleotide polymorphism
Bacterial artificial chromosome
BAC end sequence
Random amplified polymorphic DNA
Sequence characterized amplified region
Amplified fragment length polymorphism
Simple sequence repeats
Next generation sequencing
Polymerase chain reaction
- RMD :
Read mapping depth
- VF :
- FVF :
Folded variant frequency
Assay design tool
- MQS :
Mapping quality score
- RTP :
Rate of true-positive SNPs.
The authors thank the editors and the anonymous reviewers for their constructive comments and suggestions. This work was supported by the California Walnut Marketing Board (106–10162) and UC Discovery Grants (IT106-10162).
- Pijut PM, Woeste KE, Veugadesan G, Michler CH: Technological advances in temperate hardwood tree improvement including breeding and molecular marker applications. In Vitro Cell Dev-Pl. 2007, 43 (4): 283-303. 10.1007/s11627-007-9026-9.View Article
- Bayazit S, Kazan K, Gulbitti S, Cevik V, Ayanoglu H, Ergul A: AFLP analysis of genetic diversity in low chill requiring walnut (Juglans regia L.) genotypes from Hatay, Turkey. Sci Hortic-Amsterdam. 2007, 111 (4): 394-398. 10.1016/j.scienta.2006.11.006.View Article
- Fjellstrom RG, Parfitt DE: Walnut (Juglans Spp) genetic diversity determined by restriction-fragment-length-polymorphisms. Genome. 1994, 37 (4): 690-700. 10.1139/g94-097.View ArticlePubMed
- Nicese FP, Hormaza JI, McGranahan GH: Molecular characterization and genetic relatedness among walnut (Juglans regia L.) genotypes based on RAPD markers. Euphytica. 1998, 101 (2): 199-206. 10.1023/A:1018390120142.View Article
- Foroni I, Rao R, Woeste K, Gallitelli M: Characterisation of Juglans regia L. with SSR markers and evaluation of genetic relationships among cultivars and the ‘Sorrento’ landrace. J Hortic Sci Biotech. 2005, 80 (1): 49-53.
- Gunn BF, Aradhya M, Salick JM, Miller AJ, Yang YP, Liu L, Hai X: Genetic variation in walnuts (Juglans regia and J. sigillata; juglandaceae): species distinctions, human impacts, and the conservation of agrobiodiversity in Yunnan, China. Amer J Bot. 2010, 97 (4): 660-671. 10.3732/ajb.0900114.View Article
- Woeste K, Burns R, Rhodes O, Michler C: Thirty polymorphic nuclear microsatellite loci from black walnut. J Heredity. 2002, 93 (1): 58-60. 10.1093/jhered/93.1.58.View Article
- Zhang R, Zhu AD, Wang XJ, Yu J, Zhang HR, Gao JS, Cheng YJ, Deng XX: Development of Juglans regia SSR markers by data mining of the EST database. Pl Mol Biol Reporter. 2010, 28 (4): 646-653. 10.1007/s11105-010-0192-2.View Article
- Dangl GS, Woeste K, Aradhya MK, Koehmstedt A, Simon C, Potter D, Leslie CA, McGranahan G: Characterization of 14 microsatellite markers for genetic analysis and cultivar identification of walnut. J Am Soc Hortic Sci. 2005, 130 (3): 348-354.
- Wu JJ, Gu YQ, Hu YQ, You FM, Dandekar AM, Leslie CA, Aradhya M, Dvorak J, Luo MC: Characterizing the walnut genome through analyses of BAC end sequences. Pl Mol Biol. 2012, 78 (1–2): 95-107.View Article
- Elshire RJ, Glaubitz JC, Sun Q, Poland JA, Kawamoto K, Buckler ES, Mitchell SE: A robust, simple genotyping-by-sequencing (GBS) approach for high diversity species. Plos One. 2011, 6 (5): e19379-10.1371/journal.pone.0019379.PubMed CentralView ArticlePubMed
- Cao J, Schneeberger K, Ossowski S, Gunther T, Bender S, Fitz J, Koenig D, Lanz C, Stegle O, Lippert C, Wang X, Ott F, Muller J, Alonso-Blanco C, Borgwardt K, Schmid KJ, Weigel D: Whole-genome sequencing of multiple Arabidopsis thaliana populations. Nat Genet. 2011, 43 (10): 956-963. 10.1038/ng.911.View ArticlePubMed
- Huang XH, Wei XH, Sang T, Zhao QA, Feng Q, Zhao Y, Li CY, Zhu CR, Lu TT, Zhang ZW, Li M, Fan DL, Guo YL, Wang A, Wang L, Deng LW, Li WJ, Lu YQ, Weng QJ, Liu KY, Huang T, Zhou TY, Jing YF, Li W, Lin Z, Buckler ES, Qian QA, Zhang QF, Li JY, Han B: Genome-wide association studies of 14 agronomic traits in rice landraces. Nat Genet. 2010, 42 (11): 961-967. 10.1038/ng.695.View ArticlePubMed
- Lam HM, Xu X, Liu X, Chen WB, Yang GH, Wong FL, Li MW, He WM, Qin N, Wang B, Li J, Jian M, Wang JA, Shao GH, Wang J, Sun SSM, Zhang GY: Resequencing of 31 wild and cultivated soybean genomes identifies patterns of genetic diversity and selection (vol 42, pg 1053, 2010). Nat Genet. 2011, 43 (4): 387-387. 10.1038/ng0411-387.View Article
- Tian F, Bradbury PJ, Brown PJ, Hung H, Sun Q, Flint-Garcia S, Rocheford TR, McMullen MD, Holland JB, Buckler ES: Genome-wide association study of leaf architecture in the maize nested association mapping population. Nat Genet. 2011, 43 (2): 159-162. 10.1038/ng.746.View ArticlePubMed
- Fan JB, Chen XQ, Halushka MK, Berno A, Huang XH, Ryder T, Lipshutz RJ, Lockhart DJ, Chakravarti A: Parallel genotyping of human SNPs using generic high-density oligonucleotide tag arrays. Genome Res. 2000, 10 (6): 853-860. 10.1101/gr.10.6.853.PubMed CentralView ArticlePubMed
- Fan JB, Oliphant A, Shen R, Kermani BG, Garcia F, Gunderson KL, Hansen M, Steemers F, Butler SL, Deloukas P, Galver L, Hunt S, McBride C, Bibikova M, Rubano T, Chen J, Wickham E, Doucet D, Chang W, Campbell D, Zhang B, Kruglyak S, Bentley D, Haas J, Rigault P, Zhou L, Stuelpnagel J, Chee MS: Highly parallel SNP genotyping. Cold Spring HarborSymp Quant Biol. 2003, 68: 69-78. 10.1101/sqb.2003.68.69.View Article
- Matukumalli LK, Lawley CT, Schnabel RD, Taylor JF, Allan MF, Heaton MP, O’Connell J, Moore SS, Smith TP, Sonstegard TS, Van Tassell CP: Development and characterization of a high density SNP genotyping assay for cattle. PLoS One. 2009, 4 (4): e5350-10.1371/journal.pone.0005350.PubMed CentralView ArticlePubMed
- Ramos AM, Crooijmans RPMA, Affara NA, Amaral AJ, Archibald AL, Beever JE, Bendixen C, Churcher C, Clark R, Dehais P, Hansen MS, Hedegaard J, Hu ZL, Kerstens HH, Law AS, Megens HJ, Milan D, Nonneman DJ, Rohrer GA, Rothschild MF, Smith TPL, Schnabel RD, Van Tassell CP, Taylor JF, Wiedmann RT, Schook LB, Groenen MAM: Design of a high density SNP genotyping assay in the pig using SNPs identified and characterized by next generation sequencing technology. PLoS One. 2009, 4 (8): Artn E6524-View Article
- Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009, 25 (14): 1754-1760. 10.1093/bioinformatics/btp324.PubMed CentralView ArticlePubMed
- Li H, Durbin R: Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics. 2010, 26 (5): 589-595. 10.1093/bioinformatics/btp698.PubMed CentralView ArticlePubMed
- Langmead B, Trapnell C, Pop M, Salzberg SL: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009, 10 (3): Artn R25-View Article
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R: The sequence alignment/map format and SAMtools. Bioinformatics. 2009, 25 (16): 2078-2079. 10.1093/bioinformatics/btp352.PubMed CentralView ArticlePubMed
- Ratan A, Zhang Y, Hayes VM, Schuster SC, Miller W: Calling SNPs without a reference sequence. BMC Bioinformatics. 2010, 11: 130-10.1186/1471-2105-11-130.PubMed CentralView ArticlePubMed
- You FM, Huo NX, Deal KR, Gu YQ, Luo MC, McGuire PE, Dvorak J, Anderson OD: Annotation-based genome-wide SNP discovery in the large and complex Aegilops tauschii genome using next-generation sequencing without a reference genome sequence. BMC Genomics. 2011, 12: Artn 59-View Article
- Romanov MN, Price JA, Dodgson JB: Integration of animal linkage and BAC contig maps using overgo hybridization. Cytogenet Genome Res. 2003, 102 (1–4): 277-281.View ArticlePubMed
- Klein PE, Klein RR, Cartinhour SW, Ulanch PE, Dong JM, Obert JA, Morishige DT, Schlueter SD, Childs KL, Ale M, Mullet JE: A high-throughput AFLP-based method for constructing integrated genetic and physical maps: progress toward a sorghum genome map. Genome Res. 2000, 10 (6): 789-807. 10.1101/gr.10.6.789.PubMed CentralView ArticlePubMed
- Luo MC, Xu K, Ma Y, Deal KR, Nicolet CM, Dvorak J: A high-throughput strategy for screening of bacterial artificial chromosome libraries and anchoring of clones on a genetic map constructed with single nucleotide polymorphisms. BMC Genomics. 2009, 10: 28-10.1186/1471-2164-10-28.PubMed CentralView ArticlePubMed
- You FM, Luo MC, Xu K, Deal KR, Anderson OD, Dvorak J: A new implementation of high-throughput five-dimensional clone pooling strategy for BAC library screening. BMC Genomics. 2010, 11: 692-10.1186/1471-2164-11-692.PubMed CentralView ArticlePubMed
- Deschamps S, Rota ML, Ratashak JP, Biddle P, Thureen D, Farmer A, Luck S, Beatty M, Nagasawa N, Michael L, Llaca V, Sakai H, May G, Lightner J, Campbell MA: Rapid genome-wide single nucleotide polymorphism discovery in soybean and rice via deep resequencing of reduced representation libraries with the Illumina genome analyzer. The Plant Genome. 2010, 3 (1): 16-View Article
- Hyten DL, Song Q, Fickus EW, Quigley CV, Lim JS, Choi IY, Hwang EY, Pastor-Corrales M, Cregan PB: High-throughput SNP discovery and assay development in common bean. BMC Genomics. 2010, 11: 475-10.1186/1471-2164-11-475.PubMed CentralView ArticlePubMed
- Hyten DL, Cannon SB, Song Q, Weeks N, Fickus EW, Shoemaker RC, Specht JE, Farmer AD, May GD, Cregan PB: High-throughput SNP discovery through deep resequencing of a reduced representation library to anchor and orient scaffolds in the soybean whole genome sequence. BMC Genomics. 2010, 11: 38-10.1186/1471-2164-11-38.PubMed CentralView ArticlePubMed
- Nelson JC, Wang SC, Wu YY, Li XR, Antony G, White FF, Yu JM: Single-nucleotide polymorphism discovery by high-throughput sequencing in sorghum. BMC Genomics. 2011, 12: 352-10.1186/1471-2164-12-352.PubMed CentralView ArticlePubMed
- Chancerel E, Lepoittevin C, Le Provost G, Lin YC, Jaramillo-Correa JP, Eckert AJ, Wegrzyn JL, Zelenika D, Boland A, Frigerio JM, Chaumeil P, Garnier-Gere P, Boury C, Grivet D, Gonzalez-Martinez SC, Rouze P, Van de Peer Y, Neale DB, Cervera MT, Kremer A, Plomion C: Development and implementation of a highly-multiplexed SNP array for genetic mapping in maritime pine and comparative mapping with loblolly pine. BMC Genomics. 2011, 12: 368-10.1186/1471-2164-12-368.PubMed CentralView ArticlePubMed
- Jermstad KD, Eckert AJ, Wegrzyn JL, Delfino-Mix A, Davis DA, Burton DC, Neale DB: Comparative mapping in Pinus: sugar pine (Pinus lambertiana Dougl.) and loblolly pine (Pinus taeda L.). Tree Genetics & Genomes. 2011, 7 (3): 457-468. 10.1007/s11295-010-0347-1.View Article
- Lepoittevin C, Frigerio JM, Garnier-Gere P, Salin F, Cervera MT, Vornam B, Harvengt L, Plomion C: In vitro vs in silico detected SNPs for the development of a genotyping array: What can we learn from a non-model species?. PLoS One. 2010, 5 (6): ARTN e11034-10.1371/journal.pone.0011034.View Article
- Zerbino DR, Birney E: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008, 18 (5): 821-829. 10.1101/gr.074492.107.PubMed CentralView ArticlePubMed
- Huang XQ, Madan A: CAP3: A DNA sequence assembly program. Genome Res. 1999, 9 (9): 868-877. 10.1101/gr.9.9.868.PubMed CentralView ArticlePubMed
- Malecot G: Les Mathematiques de l’Hérédité. 1948, Masson, Paris
- Kempthorne O: An introduction to genetic statistics. 1969, Iowa State University Press
- Kosman E, Leonard KJ: Similarity coefficients for molecular markers in studies of genetic relationships between individuals for haploid, diploid, and polyploid species. Mol Ecol. 2005, 14 (2): 415-424. 10.1111/j.1365-294X.2005.02416.x.View ArticlePubMed
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.