Digital genotyping of sorghum – a diverse plant species with a large repeat-rich genome
© Morishige et al.; licensee BioMed Central Ltd. 2013
Received: 8 January 2013
Accepted: 28 June 2013
Published: 5 July 2013
Skip to main content
© Morishige et al.; licensee BioMed Central Ltd. 2013
Received: 8 January 2013
Accepted: 28 June 2013
Published: 5 July 2013
Rapid acquisition of accurate genotyping information is essential for all genetic marker-based studies. For species with relatively small genomes, complete genome resequencing is a feasible approach for genotyping; however, for species with large and highly repetitive genomes, the acquisition of whole genome sequences for the purpose of genotyping is still relatively inefficient and too expensive to be carried out on a high-throughput basis. Sorghum bicolor is a C4 grass with a sequenced genome size of ~730 Mb, of which ~80% is highly repetitive. We have developed a restriction enzyme targeted genome resequencing method for genetic analysis, termed Digital Genotyping (DG), to be applied to sorghum and other grass species with large repeat-rich genomes.
DG templates are generated using one of three methylation sensitive restriction enzymes that recognize a nested set of 4, 6 or 8 bp GC-rich sequences, enabling varying depth of analysis and integration of results among assays. Variation in sequencing efficiency among DG markers was correlated with template GC-content and length. The expected DG allele sequence was obtained 97.3% of the time with a ratio of expected to alternative allele sequence acquisition of >20:1. A genetic map aligned to the sorghum genome sequence with an average resolution of 1.47 cM was constructed using 1,772 DG markers from 137 recombinant inbred lines. The DG map enhanced the detection of QTL for variation in plant height and precisely aligned QTL such as Dw3 to underlying genes/alleles. Higher-resolution NgoMIV-based DG haplotypes were used to trace the origin of DNA on SBI-06, spanning Ma1 and Dw2 from progenitors to BTx623 and IS3620C. DG marker analysis identified the correct location of two miss-assembled regions and located seven super contigs in the sorghum reference genome sequence.
DG technology provides a cost-effective approach to rapidly generate accurate genotyping data in sorghum. Currently, data derived from DG are used for many marker-based analyses, including marker-assisted breeding, pedigree and QTL analysis, genetic map construction, map-based gene cloning and association studies. DG in combination with whole genome resequencing is dramatically accelerating all aspects of genetic analysis of sorghum, an important genetic reference for C4 grass species.
The acquisition of high quality genotyping information is essential for the assessment of genetic diversity, pedigree analysis, genetic map construction, QTL (Quantitative Trait Locus) analysis, marker-assisted breeding and association studies. Since development of RFLP (Restriction Fragment Length Polymorphism) technology , numerous methods for the analysis of DNA polymorphism have been developed including AFLPs (Amplified Fragment Length Polymorphism; ), SSRs (Simple Sequence Repeats), chip-based genotyping [2, 3], GOOD , Taqman , TILLING (Targeting Induced Local Lesions in Genomes; ), SBE (Single-Base Extension; ) and SNPWave technologies . Most DNA marker systems require the discovery and validation of SNPs (single nucleotide polymorphism) and INDELs (insertion- deletion) that are then targeted for high throughput assay. This approach allows the most informative and reproducible DNA markers to be utilized, but also can introduce ascertainment bias into analyses. Moreover, in diverse species with high rates of polymorphisms (>1/100 bp), indirect assays for DNA polymorphisms can be inefficient [9, 10].
New high capacity DNA sequencing platforms provide an opportunity to transition from indirect assays of DNA polymorphism to genotyping by sequencing . For many small genome species, complete genome resequencing is a feasible approach for genotyping. However, for species with large and highly repetitive genomes, the acquisition of whole genome sequences for the purpose of genotyping is inefficient and too expensive to be done on a routine basis. Moreover, most genotyping applications, such as marker-assisted breeding, require detection of only a subset of the extant genetic diversity among individuals to be effective. For these species and applications, targeted resequencing of specific sub-regions or ‘reduced representations’ of genomes provides sufficient information for genetic analysis.
Methods for acquiring ‘reduced representations’ of genomes for genotyping include the capture of DNA by hybridization to oligonucleotide arrays , skimming randomly sheared genomic DNA , and by using restriction enzymes . The use of restriction enzymes for analysis of DNA polymorphism originated with RFLP analysis and is embedded in numerous DNA marker assays such as AFLP technology  and the related CRoPS (Complexity Reduction of Polymorphic Sequences) resequencing method for SNP discovery . Baird and colleagues  successfully utilized resequencing of ‘restriction site associated DNA’ (RAD-tags) for SNP discovery and genetic mapping in stickleback species. A further increase in efficiency was achieved through the use of barcoding to enable multiplex sequencing of DNA pooled from numerous individuals. More recently, Genotyping-By-Sequencing (GBS) was described and tested on maize and barley, two grass species with large and highly repetitive genomes . GBS utilizes multiplex sequencing of DNAs generated by a single restriction enzyme, ApeKI, a methylation insensitive restriction enzyme that recognizes the sequence GCWCG. One challenge noted in most of the prior genotyping-by-sequencing methods is variation in depth of sequencing among multiplexed samples, as well as site-to-site variation within the genome of a single genotype. This situation reduces efficiency and accuracy, requiring either greater depth of overall sequencing to obtain accurate information at a high portion of sites containing DNA polymorphisms, or indirect methods for haplotype reconstruction by imputation (e.g. ).
Our group is working on Sorghum bicolor, a diploid C4 grass that has a genome size of ~820 Mbp determined by flow cytometry  and encodes approximately ~30,000 genes, spanning ~150 Mbp of ‘gene space’ that is not highly methylated [17, 18]. The remainder of the sorghum genome is largely composed of highly methylated repetitive DNA, preferentially localized in pericentromeric heterochromatic regions that have low rates of recombination . Related members of sorghum can have much larger and more complex genomes that are more similar to polyploid grass species . The sorghum genome sequence enables in silico testing of various genotyping by resequencing options, aids analysis of acquired sequences and the validation of results.
The sorghum research and public breeding community is small, therefore the development of chip-based methods for genotyping has been hindered by the start-up costs for this technology. Thus, we began developing a restriction enzyme guided genotyping-by-sequencing method termed Digital Genotyping (DG), when the 454 genome sequencing platform became available  and later transitioned this technology onto the Illumina GAIIx and HiSeq2000 to take advantage of their increased sequencing capacity . DG was designed to enable analysis of sorghum genotypes at different levels of complexity (number of sites per genome), using a set of methylation sensitive restriction enzymes that have nested cut sites, so that information from all assays can be easily cross-referenced. Additionally, we report some of the reasons why variation in sequencing depth per site occurs within the same genome and how to minimize this source of inefficiency. Digital Genotyping was validated through genetic map reconstruction, QTL analysis, and haplotype/pedigree analysis.
The DG method provides an efficient means to produce accurate sequence-based genotype information for SNP discovery and genetic map construction within large populations in a short period of time. In the current iteration of the method using FseI, index or barcode sequences incorporated into the adapters used for template synthesis permits combining DNA from up to 48 individual lines into a common pool for downstream processing. After sequencing on the Illumina GAIIx platform, the raw sequencing reads are processed and deconvoluted into individual groups by barcode. After parsing for reads containing proper bar codes and partial restriction site sequences, an efficiency of 80-90% is normally attained. Absence of a proper barcode and/or restriction site in a DG sequence is usually the result of improper purification of the products after the first ligation step or off-site PCR amplification. Initial DNA quality and accurate DNA quantitation also ensures higher yields of final useable sequence.
The restriction enzymes used for DG were selected based on six criteria: (1) sensitivity to DNA methylation to reduce template generation from repetitive regions of the sorghum genome; (2) GC-rich digestion sites that preferentially cut in or near genes; (3) lack or a limited number of sites of digestion in plastid DNA; (4) digestion at nested 4, 6, or 8 base restriction sites to enable varying depth of analysis; (5) generation of over-hanging termini that facilitate adapter ligation; and (6) presence of unique polymorphic sequences flanking restriction sites that provide good coverage of the genome. Restriction enzymes with nested 4, 6, and 8 bp recognition sites were sought so that information from analysis at different numbers of sites in the genome could be cross-referenced, enabling internal validation and coherent information sharing among different types of analysis (i.e., marker-assisted breeding, genetic maps, association studies). Restriction enzymes that met the criteria listed above were screened in silico to eliminate enzymes that cut preferentially in repetitive sequences and to confirm that sites of digestion would provide good coverage of the sorghum genome.
Restriction enzymes used for digital genotyping
No. of potential sequences RE sites x 2 ( In silico )
DG sequences unique, 33 bp ( In silico )
DG sequences unique, 33 bp (Sequenced >3x)
Only unique DG sequences that were sequenced a minimum of three times and that mapped to a single location in the genome were used for genotyping. In this study, ‘unique’ DG sequences were defined as genomic sequences of a specified length adjacent to restriction enzyme recognition sites used for DG template preparation that mapped to either one location in the sorghum genome only or when mapped to more than one location, the top alignment differed from the second best alignment by at least 2 bp. The requirement for a two base difference among alignments was used so that a SNP allele in one DG sequence would not be confused with a DG sequence that maps to a different site. Most of the analysis presented here utilizes 33 bp of genomic DNA sequence for DG analysis. However, as DNA sequencing platforms have improved in accuracy, we have increased read lengths used for genotyping from 33 bp to 72 bp on the Illumina GAIIx and to 100 bp on the HiSeq2000. The number of 33 bp DG sequences in the sorghum genome obtained from BTx623 at a sequencing depth of 3x or greater ranged from 22,272 (FseI) to ~572,000 (HpaII), depending on the enzyme used to generate DG template (Table 1). Only 19,894 of the 22,272 unique FseI DG sequences were used for genotyping. When two FseI sites were located in close proximity and mapped to a unique overlapping genome sequence, only one of the DG sequences was used for analysis, thereby eliminating 2,378 sequences.
The substantial number of repetitive sequences adjacent to any set of restriction enzyme sites in the sorghum genome represents a potential source of inefficiency. For example, approximately 50% of the sequences flanking FseI sites are repetitive. Repetitive DNA in plant genomes is highly methylated, therefore utilization of restriction enzymes sensitive to DNA methylation such as FseI, NgoMIV and HpaII should reduce the acquisition of repetitive DG sequences. This expectation was confirmed. The ratio of unique/repetitive 33 bp DNA sequences obtained by sequencing DG templates generated using FseI was ~6.3:1, compared to a ratio of 1:1 in the reference genome sequence determined by in silico analysis. Approximately 90% of the possible unique DG sequences flanking FseI sites were represented in DG templates, indicating that ~5-10% of the FseI sites flanked by a unique DG sequence were methylated. A small number of the FseI derived DG sequences were also not identified due to the close proximity of two FseI sites. Taken together, these results indicate that ~90% of the FseI sites flanked by repetitive sequences were methylated and not represented in DG templates. A similar enrichment of unique sequences was obtained using the methylation sensitive restriction enzymes NgoMIV and HpaII in which approximately 81% and 37% of the in silico identified sequences, respectively, were actually sequenced. These data demonstrate that the use of methylation sensitive restriction enzymes significantly increases the efficiency of DG.
When sequencing DG templates from different genotypes, the number of identified DNA polymorphisms will depend on several factors: (1) the number of ‘unique’ DG sequences derived from two or more genotypes that initially can be compared; (2) the length of the unique genome sequence acquired from DG templates, excluding the restriction enzyme partial site and barcode; and (3) the density of polymorphism among the genotypes analyzed in unique DG sequence space. If two parental lines used for genetic map construction have a polymorphism density of 1/500 bp in DG sequences, then analysis of 20,000 unique DG sequences 33 bp in length would be predicted to yield 1,320 DG markers for genetic analysis. This prediction was tested and the DG process further optimized through analysis of BTx623 and IS3620C, a pair of inbred sorghum genotypes, and 137 recombinant inbred lines (RILs) derived from these genotypes [25–27]. A precise alignment between the reference BTx623 genome sequence  and DG sequences derived from these genotypes were possible. To obtain sufficient information for analysis, an average of ~908,000 (+/− 278,200) sequences were obtained from DG templates prepared for each RIL. The templates were prepared and sequenced in pools of 24 RILs per lane on the Illumina GAIIx. The resulting range of sequencing depth per RIL varied from 335,000 to 1.9 M, indicating non-uniform pooling of DG templates from RILs constituting the pools. Approximately 88% of the reads acquired from the Illumina GAIIx contained the expected FseI restriction site and barcode sequences and could therefore be used for further analysis.
We observed a sequencing error frequency of ~0.5-1% on the Illumina GAIIx. At this error rate up to one-third of the 33 bp DG sequences will contain sequencing errors. However, if the errors are random, the probability that the same error-containing sequence will occur several times in a specific DG sequence sequenced <100 times is low. To exclude this type of sequencing error, genetic analysis was based on DG sequences obtained three or more times from BTx623 or IS3620C.
BTx623 x IS3620c recombinant inbred digital genotyping map statistics
Chromosome Length (Mbp)
Total Genetic Distance (cM)
1 marker / cM
1 marker / Kbp
Largest interval (cM)
The fidelity of DG marker identification and the validity of the criteria for assigning genotypes were evaluated by comparing DG genotypes obtained from the 137 RILs to SSR genotypes previously collected from this population . The location of DG markers on each chromosome was determined by alignment of DG sequences with the reference genome. SSR markers previously used for genetic map construction were also aligned to the genome sequence based on their oligonucleotide sequences and in an order consistent with prior genetic analysis . The genotypes of 16 SSR markers on LG-01 and a DG marker located within 50 kbp of each SSR marker were compared in the 137 RILs. There was 99% agreement between genotypes assigned by the two types of markers in homozygous regions of the genome (data not shown). All but one of the 21 differences in genotype assignment out of 2,105 loci examined were due to SSR genotypes that interrupted haplotypes, possibly caused by double recombination events flanking these SSR markers, or more likely, due to genotyping errors associated with the SSR markers.
Genotyping accuracy was further quantified and criteria for assigning genotypes improved through analysis of pairs of DG markers derived from the same restriction site. Approximately 10% of the time, DG sequences flanking a common site of digestion were ‘unique’ and contained DNA sequence polymorphisms that distinguish BTx623 and IS3620C. Because the polymorphisms in these ‘pairs’ of DG markers are within 100 bp, genotypes assigned using data derived from the DG markers should be the same except in rare circumstances when recombination occurs within this interval. Therefore, the overall accuracy of DG genotype assignment was assessed by determining the consistency of the genotypes assigned by 40 pairs of these DG markers using data obtained from the 137 RILs. In homozygous regions of the RIL genomes, all 5,200 genotypes assigned using data from pairs of DG markers were identical (12 missing data points), indicating a high degree of genotyping accuracy in these regions of the RIL genomes (data not shown).
Criteria for assigning DG marker genotypes was further refined through analysis of sequences obtained from DG markers located in homozygous haplotypes. Only one allele sequence should be present in these regions of the RIL genomes. Therefore a count of the number of times the expected allele was sequenced compared to the alternative allele was used to estimate of the accuracy of genotype assignment based on DG marker data (average depth of sequencing/DG marker = 41). Overall, the expected DG allele was obtained 99.7% of the time (246 alternative allele sequences out of 76,032 sequences analyzed, data not shown). For 1,516 of the 1,728 DG markers analyzed, the correct allele was the only sequence obtained. One alternative allele was found in 183 DG marker sequences, two alternative alleles were found 25 times, three alternative alleles were found three times and four alternative alleles was found once. For DG markers where three or four alternative alleles were sequenced, the ratio of expected to unexpected allele sequence was >20:1. Thus, the assignment of DG genotypes in homozygous regions of a genome was very accurate.
Analysis of DG sequences from heterozygous (HET) regions of the RIL genomes (e.g. Figure 2) revealed that accurate assignment of DG genotypes in these regions is more challenging for several reasons. First, the two DG alleles from HET regions are sequenced on average only 50% as deeply as DG marker alleles from homozygous regions of the genome. Second, accurate assignment of HET genotypes requires more sequence reads per DG marker in order to be certain sufficient reads from both alleles have been acquired, if present, and an accurate ratio of allele sequences has been obtained. This potential source of false negative error can be reduced by greater depth of overall sequencing and by setting criteria that require more reads per DG marker for genotype assignment within heterozygous haplotypes. Third, pooled DNA from RIL progeny used for genotyping is occasionally enriched in alleles from one of the two parental genotypes due to non-uniform tissue pooling or progeny growth. This source of error can be minimized by pooling of equal amounts of tissue or DNA from large numbers of progeny. A fourth source of error occurs when alleles are sequenced with different efficiency (discussed below). A consideration of these factors and allele sequencing data described above led us to assign HET genotypes when both allele sequences corresponding to a DG marker are obtained three times or more and the ratio of allele sequences is <15:1. In addition, DG markers located within heterozygous haplotypes were only assigned genotypes when they are sequenced a total of 15 times or more. The criteria were independently assessed by carrying out DG analysis on F1 plants derived from a cross of Hegari x 100 M that would be expected have HET DG genotypes at all loci. Approximately 1,200 DG markers were analyzed and 99% of the genotypes called were HETs (data not shown). The remaining genotypes (A or B calls) occurred when too few reads were obtained from a given site (<10) and a few sites with highly skewed ratios of allele sequences (i.e. 26:1). We conclude that the empirical method developed here for identifying HET DG genotypes is reasonably accurate although further refinements will be possible in the future.
Genome coordinates of DG mapped super contigs
Super contig marker
Start positon (bp)
End position (bp)
41,820,858 - 44,466,741
51,294,508 - 51,595,270
57,928,418 - 58,404,926
42,789,974 - 44,674,214
60,223,905 - 60,421,993
49,298,803 - 49,486,219
0 - 1,083,522
Height QTL based on DG markers and phenotype data from Hart et al. (2001)
A comparison of BTx623 and IS3620C DG genotypes showed that their genomes were highly polymorphic at the beginning of SBI-06 and from ~31.5 Mbp to the end of the long arm of SBI-06. However, DG sequences located between ~0.5 Mbp to ~31.5 Mbp showed a limited amount of polymorphism. The origin of this block of DNA in IS3620C was investigated by comparison with BTx406, the genotype used in the conversion program. The haplotype of the region of IS3620C from ~0.5-43 Mbp was nearly identical to BTx406, consistent with introgression of this block of DNA into IS3620 during the conversion process. BTx406 was derived from a cross of BTx398 and SA403 . The haplotype of the region in BTx406 from 0.5-30.4 Mbp was identical to BTx398, indicating that during construction of BTx406, this genomic region was inherited from BTx398 (Figure 4). BTx398 and CK60, the immediate progenitor of BTx623 were developed during the period from 1920 to 1950 from a limited number of Kafir/Milo genotypes. BTx623 and CK60 have identical DG genotypes across this entire region of SBI-06, therefore, it is not surprising that the region from 0.5-30.4 Mbp of SBI-06 from BTx398, BTx406 and IS3620C is similar to BTx623. The low diversity of this region in IS3620C and BTx623 explains why there were so few DG markers in this portion of the genetic map derived from these lines.
The haplotype of IS3620C was nearly identical to BTx406 from ~32 Mbp to 43.5 Mbp of SBI-06, consistent with introgression of this region of BTx406 into IS3620C (Figure 4). A prior study showed that the recessive dw2 allele in BTx406 was derived from Double Dwarf Yellow Milo, whereas the recessive ma1 allele in BTx406 was derived from Early White Milo . SM100, an early flowering (ma1) and short (dw2) genotype, was also derived from a cross of Double Dwarf Yellow Milo and Early White Milo ; therefore, the genotype of SM100 was compared to BTx406. The haplotype of SM100 from 31.5 Mbp to 45 Mbp was nearly identical to BTx406 and IS3620C, consistent with these regions being identical by descent and recessive for both ma1 and dw2. The region of IS3620C from 45 Mbp to the end of SBI-06 was genetically distinct from BTx406, consistent with its origin from IS3620 (data not shown).
Variation in DG template length also affects the relative frequency of read acquisition on the Illumina GAIIx platform. The influence of template size on the relative sequencing efficiency was analyzed by generating templates with a wide range of fixed sizes using FseI and MseI, a methylation insensitive restriction enzyme that recognizes the four base sequence TvTAA. Barcoded adapters were ligated to FseI-generated termini as usual, but the second adapter was ligated to the DNA termini created after digestion with MseI instead of blunt-end termini generated by shearing. After sequencing on an Illumina GAIIx the full length of each sequenced DG template between the FseI site and the nearest MseI site was determined in silico. The relationship between template length and sequence frequency was determined. From this analysis it was evident that genomic sequences were obtained at very different frequencies from templates that ranged from 43 bp to 350 bp in length. Templates less than 65 bp were rarely sequenced because DNA purification during template generation preferentially removes small DNA fragments. Templates sequenced 40–72 times averaged 109 bp in length; templates sequenced 20–40 times averaged 148 bp; and templates sequenced 5–20 times were an average of 286 bp (data not shown). Reduced depth of sequencing of longer templates likely reflects a combination of decreased efficiency of amplification in PCR steps used in template preparation and less efficient bridge amplification of longer templates. These results indicate that a more uniform depth of sequencing among DG markers is achieved using randomly sheared DG templates that are within the same size distribution. To minimize read-depth variation due to template size, DNA from individual lines was pooled following ligation of barcoded adapters and sheared en masse, followed by ligation of the second adapter. DNA templates generated by shearing had an initial size range of 100–500 bp (data not shown). Templates of an optimal size for sequencing (~150-250 bp) were selected by sizing DNA on agarose gels, followed by excision and elution of DNA.
Digital Genotyping was developed to aid in the genetic analysis of sorghum and other grass species that have large repeat rich genomes. This general approach to genotyping is now feasible due to the rapidly decreasing cost of DNA sequencing over the past decade. Genotyping by resequencing is also efficient for species that lack array-based genotyping platforms because it combines polymorphism discovery and analysis and has the added benefit of reducing ascertainment bias. The acquisition of genotypes by sequencing is very accurate and rapid once template preparation is multiplexed using barcodes, informatics pipelines are established, and criteria for assigning genotypes have been validated.
One of the central design principles embedded in DG is flexible depth of analysis and coherent cross-referencing of information derived from different applications. To accomplish this, we selected a set of restriction enzymes for DG template generation that recognize a nested set of 4, 6 or 8 bp sequences enabling analysis of ~520 K, 120 K and 20 K unique DG sequences, respectively, depending on the amount of information required. Further variation in depth of analysis and cost per assay can be obtained by changing sequence read length from 33 to 100 bp. This flexibility, combined with multiplex sample analysis, increases DG efficiency by allowing depth of analysis to be varied to match the information requirement of each application. For example, we currently multiplex 48 genotypes prepared with FseI per lane on the Illumina GAIIx for genetic map construction and 12–24 genotypes per lane prepared with NgoMIV for haplotyping, but use lower depth of analysis for marker-assisted breeding applications. With this design, FseI-derived DG markers are a subset of the DG markers generated by NgoMIV, and NgoMIV DG markers are a subset of DG sequences derived from analysis with HpaII, allowing coherent cross referencing between different levels of analysis. One important benefit of this transition has been greatly improved alignment and fine mapping of QTL relative to the underlying genes/alleles and precise inter-map alignment of QTL identified in populations derived from different parental genotypes.
The three restriction enzymes selected for DG have other useful properties including: (1) sites of digestion that are GC-rich and preferentially located in or near genes; (2) DNA methylation sensitivity that results in an ~6-fold enrichment of unique versus repetitive sequences; (3) lack of sites of digestion in plastid DNA for FseI, eliminating a potential source of background due to the high copy number of the plastid genome; (4) generation of termini with overhangs that improve adapter ligation efficiency; and (5) high proportion of sequences flanking sites of digestion that are unique and polymorphic in sorghum germplasm. In silico analysis indicated that the distribution of restriction sites recognized by these enzymes would provide good coverage within the genic portion of chromosomes, a fact confirmed by sequencing DG templates. DG templates generated by restriction enzymes that are methylation sensitive have relatively low coverage in pericentromeric heterochromatic regions of sorghum chromosomes. These regions also have low rates of recombination. However, for some studies analysis of these regions using DG is of interest. This can be accomplished by preparing DG templates using restriction enzymes that are not sensitive to DNA methylation. For example, high-resolution DNA methylation mapping and analysis of pericentromeric heterochromatic regions of the sorghum genome can be carried out using the isoschizomers HpaII (methylation sensitive) and MspI (methylation insensitive).
DG was very accurate once parameters for allele identification and assignment of genotypes were optimized. During DG allele discovery most of the sequences containing random sequencing errors were eliminated with the requirement that DG sequences used for genetic analysis are obtained at least three times from parental lines. This criterion is useful during allele sequence discovery and eliminates most random sequencing errors that would occur when conducting DG analysis of species that lack reference genome sequences. DG markers, identified by comparing DG sequences derived from parental lines, genetically mapped with high fidelity to locations in the genome predicted by alignment of DG sequences to the reference genome. Comparison of genotypes assigned by pairs of DG markers derived from the same site of digestion confirmed the high level of accuracy of allele identification and genotype assignment in homozygous regions of RILs. An overall accuracy of allele identification of 99.7% was obtained. Genotype assignment in heterozygous regions was more challenging than in homozygous regions, requiring greater depth of analysis and more stringent criteria to assign genotypes accurately.
The overall efficiency of DG analysis has been enhanced by continuous improvements in sequencing, sample barcoding/multiplexing, and use of methylation sensitive restriction enzymes that access different numbers of DG sequences, depending on the information requirements of genotyping applications. On the other hand, the frequency of random sequencing errors on the Illumina GAIIx reduces efficiency by requiring a 20-40X average depth of sequencing and a minimum of 2-4X deep sequencing per DG marker during allele discovery and validation. However, following allele validation, lower depths of sequencing can be used to obtain high quality genotypes by designing informatics pipelines that preferentially search for validated allele sequences. We found that 99.7% of the time the expected DG allele sequence was obtained rather than the alternative allele. Overall, we routinely collect ~1,000 DG marker genotypes (15X average depth) from 400 samples per run on the GAIIx for ~ $8,000, excluding capital costs. Moreover, the estimated cost of genotyping is expected to be ~5-fold lower on the HiSeq2000 due to the increased number of templates sequenced per run and more uniform amplification of DG templates.
The main source of inefficiency in DG analysis identified in this study is variation in sequencing depth per genotype and among different DG markers. Variation in sequencing depth among multiplexed genotypes has been documented previously . In the present analysis, the average depth of sequencing of individual genotypes that comprise pools of 24 RILs was 908 K (+/−278 K) but overall, sampling depth ranged from 335 K to 1.9 M. This results partly from variation in the amount of starting DNA from each genotype that is subjected to digestion and ligation of barcoded adapters in the first step of the protocol. In addition to careful quantitation of input genomic DNA, q-PCR could be used following digestion and ligation of bar-coded adapters to normalize template numbers derived from different genotypes at the template pooling stage . Variation in template copy number among pooled genotypes can be compensated for by deeper sequencing, imputation of missing data, or by rerunning samples sequenced at low depth.
Variation in the relative depth of sequencing exhibited by different DG markers is a more significant issue. We found greater than 40-fold variation in sequencing depth for different DG markers from a given RIL and from pairs of DG markers derived from the same restriction site. The variation in the relative efficiency of sequencing among DG markers was consistently observed in different RILs and was associated with differences in the GC-content of DG templates. This result is consistent with the observation that DNA templates with high GC content are less efficiently bridge-amplified on the Illumina GAIIx leading to under representation of these sequences. New cluster generation kits that amplify templates of varying GC-content more uniformly for sequencing on the HiSeq2000 should reduce this source of variation.
Template length was another source of significant variation in relative efficiency of DG sequencing on the Illumina GAIIx. This was discovered when analyzing template sequences generated using FseI and MseI. Sequencing depth varied > 40-fold overall among templates of varying length regardless of GC content. For example, templates with an average length of 109 bp were sequenced 6-fold more frequently than templates that averaged 286 bp in length and DNA templates 350 bp or longer were rarely sequenced. This source of variation will be similarly present in template populations generated by ApeKI  or when using restriction enzymes such as HpaII and MseI as implemented in CRoPs technology . Variation due to differences in template length can be overcome in part by greater depth of sequencing or by imputation of missing data [10, 15]. However, this dynamic combined with variation in number of sequences/genotype in multiplexed samples could potentially result in significant amounts of missing data or overall loss of genotyping efficiency. This led us to utilize random shearing to generate DG templates of a more uniform size from all DG markers.
The utility of DG was tested and demonstrated through genetic map construction, improved genome sequence assembly, QTL mapping, and haplotype analysis. A DG genetic map was constructed by scoring 1,772 DG markers in 137 RILs derived from BTx623 and IS3620C. The resulting genetic map spanned 1,233 cM with an average resolution of 1.47 cM, a 6-fold improvement over a previous genetic map based on data from SSR/RFLP markers . The DG map was used to reanalyze QTL for variation in plant height, using the original height phenotype values collected by Hart and coworkers . The new QTL analysis identified the same four height QTL identified previously, but with higher LOD scores. More importantly, because DG map density is higher, and DG markers are located on the reference sequence, alignment between QTL and the underlying causative alleles is more precise. For example, the QTL peak corresponding to Dw3 was aligned with the gene known to cause variation in height at this locus. Higher resolution NgoMIV-depth DG haplotype analysis of IS3620C and BTx623 and their progenitors clarified the nature and origin of DNA present in SBI-06 in these genotypes. The analysis showed that IS3630C inherited DNA from approximately 0.5-32 Mbp from BTx398 via BTx406 during the conversion of IS3620 to a short, early flowering genotype. The common origin of BTx398 and BTx623 explained the low number of DG markers in the interval from 0.5 Mbp to 32 Mbp in the BTx623 x IS3620C genetic map. The results also confirmed that ma1 and dw2 in IS3620C traced back to recessive alleles present in BTx406, originally found in Milo genotypes as reported by Quinby .
DG analysis also helped improve the assembly of the sorghum reference genome sequence. DG marker mapping identified two regions of the reference sequence that were probably miss-assembled due to the high repeat content in these regions of the sorghum genome. Three DG markers that aligned to a region of the reference sequence on SBI-06 mapped in a cluster on SBI-07 and a DG marker aligned to the reference sequence on SBI-02 was mapped to SBI-03, indicating that the sequence assembly in these regions should be reexamined. Furthermore, DG markers aligned to sequences present in seven super-contigs that are not currently merged with the 10 pseudomolecules that comprise the reference sequence. Genetic analysis of these DG markers allowed the super-contigs to be placed in their approximate positions in the reference sequence. These results indicate that the construction of additional DG maps from other diverse parental genotypes will improve the quality and coverage of the sorghum reference sequence.
Our sorghum genomics and breeding group has transitioned to Digital Genotyping for nearly all genotyping applications. We utilize genotyping information derived from DG to quantify genetic relationships among accessions in the sorghum germplasm collection (n = ~40,000), for marker-assisted breeding and pedigree analysis, genetic map construction, QTL analysis, map-based gene cloning and association studies. Digital Genotyping in combination with whole genome resequencing is dramatically accelerating all aspects of genetic analysis of sorghum, an important genetic reference for C4 grass species.
Digital Genotyping was developed to aid in the genetic analysis of sorghum and other grass species possessing large repeat-rich genomes. DG technology provides a cost-effective approach to rapidly generate highly accurate genotyping data. Our sorghum genomics and breeding group has transitioned to DG for nearly all genotyping applications. Restriction enzymes used for DG template generation recognize a nested set of 4, 6 and 8 bp restriction sites, providing a flexible depth of analysis and coherent cross-referencing of information derived from different genotyping applications. Currently, we utilize genotyping information derived from DG to quantify genetic relationships among accessions in the sorghum germplasm collection (n = ~40,000), for marker-assisted breeding, pedigree and QTL analysis, genetic map construction, map-based gene cloning and association studies. DG in combination with whole genome resequencing is dramatically accelerating all aspects of genetic analysis of sorghum, an important genetic reference for C4 grass species.
A collection of 137 F6-8 recombinant inbred lines (RILs) derived by single-seed descent from an initial cross between BTx623, an elite inbred line, and IS3620C, a converted inbred line highly divergent from BTx623, was used for genetic map construction [25–27].
Sorghum seeds were geminated in Sunshine MVP growing media (Sun Gro Horticulture) in a greenhouse for seven to ten days under normal sunlight supplemented with sodium halide lights. Temperatures varied from 24°C (night) to 30°C (day). Total genomic DNA was isolated from leaf tissue from 10–12 seedlings using a FastDNA Spin kit, according to the protocol provided by the manufacturer (MP Biochemicals). Purified genomic DNA was quantitated fluorometrically using a Qubit Fluorometer (Invitrogen).
For the index adapter, additional bases were added to the 5′-end of the core Illumina Y-adapter sequence to facilitate two rounds of PCR. The PCR primers used in the initial and final amplification steps (see below) anneal to two unique regions within the adapter to reduce the potential for generation of amplification products from random sequences in the genome complementary to the PCR primers and not associated with an FseI site. A unique four base pair index sequence is located immediately downstream from the sequencing primer binding site. Twenty-four unique index sequences were designed and incorporated into separate index adapters (Additional file 4). The index sequences were designed such that the FseI site is not re-generated upon adapter ligation. A four base 3′-overhang sequence in the adapter immediately follows the index sequence to accommodate annealing to FseI digested DNA. The same basic design was used for NgoMIV index adapters except for addition of an NgoMIV-specific four base 5′-overhang placed at the end of the adapter.
Adapter and PCR oligonucleotides (Additional file 4) were synthesized by Integrated DNA Technologies. For synthesis of adapters, oligonucleotides were resuspended to a final concentration of 100 μM in annealing buffer (10 mM Tris pH 8.0, 50 mM NaCl, 1 mM EDTA). Equal volumes of a complementary pair of oligonucleotides were combined. Annealing was accomplished by first heating the combined oligonucleotide solution to 94°C for 1 min and then allowing the mixture to slowly cool to room temperature (~1 hr). Freshly annealed adapters were diluted to a 5 μM final concentration for FseI Index adapters and 25 μM for T-tailed adapters in annealing buffer. Oligonucleotides for PCR were diluted to 10 μM in 10 mM Tris pH 8.5.
After size selection, a 3′-fill-in reaction of the 5′-overhang in the adapter was carried out in a reaction containing 20 units Bst DNA polymerase, large fragment and 200 μM each dNTP, at 65°C for 30 min. The DNA was purified using a QIAquick PCR purification kit and eluted in 40 μl EB buffer. DNA ends were blunt-ended using a quick blunting kit, following the directions provided by the manufacturer (New England Biolabs). DNA was purified using a QIAquick PCR purification kit and eluted in 35 μl EB buffer. Blunt-ended DNA fragments were A-tailed in a reaction containing 10 units Klenow polymerase, 3′-5′ exo-, and 50 μM dATP at 37°C for 30 min. DNA was purified using a QIAquick PCR purification kit and eluted in 45 μl EB buffer. A T-tailed adapter was ligated to the DNA fragments by addition of 25 pmol index adapter and 3U T4 DNA ligase and incubated for 4 hours to overnight at 20°C. DNA was purified and concentrated with AMPure XP beads.
The resulting pool of DNA fragments contains a relatively small population of molecules with the two different adapters ligated to opposite ends of the genomic DNA fragments mixed with a much larger population of fragments with the second adapter ligated to both ends. To enrich for the former population of fragments, PCR was carried out (20 cycles) using Phusion DNA polymerase (New England Biolabs) with primers complementary to the two adapter sequences. The PCR primer complementary to the FseI adapter sequence contains a biotin at its 5′-end. PCR reactions were purified using the QIAquick PCR purification kit.
Biotin-containing PCR products were isolated using streptavidin conjugated magnetic beads (Dynal M-280; Invitrogen). Briefly, ~2.0 μg PCR products in QIAGEN EB buffer were mixed with an equal volume of 2x Binding and Wash buffer (1x buffer: 5 mM Tris, pH7.5, 0.5 mM EDTA, 1.0 M NaCl). DNA was allowed to bind to the beads for 20 min with gentle shaking. After binding, beads were concentrated on a magnetic stand and the supernatant was discarded. Magnetic beads were washed with 300 μl 1x Binding and Wash buffer four times, followed by three washes with distilled water. To release the DNA fragments bound to the magnetic beads, the beads were washed once in 200 μl 2x SSC (1x: 0.15 M NaCl and 0.015 M Na Citrate) and resuspended in 50 μl 2x SSC. Beads were incubated at 95°C for 5 min. After collection of beads on a magnet, the supernatant was saved. The denaturation process was repeated and the supernatants were pooled. Single-stranded DNA was purified using the QIAquick PCR purification kit.
Single-stranded DNA products were used in a final PCR step (14 cycles) with primers containing sequences complementary to the flow cell, according to the protocol provided by Illumina. Amplified products were purified using a QIAquick PCR purification kit. The final products were quantitated by UV spectroscopy and diluted to 10 nM. The template was used in cluster generation (Single Read Cluster Generation Kit, version 4) and sequencing (36-cycle sequencing kit, version 4), according to standard Illumina protocols. Single-end sequencing was carried out for 38 cycles on an Illumina Genome Analyzer GAIIx.
For experiments utilizing the restriction enzymes NgoMIV or MseI, the index adapter or T-tailed adapter, respectively, was modifed with a 5′-overhang complementary to the sequence created by the specific restriction enzyme. All other downstream processing steps were identical to those used to synthesize FseI-derived DG templates. Up to twelve individual lines were pooled for NgoMIV-derived template.
Base calling was performed using Illumina’s Real Time Analysis (RTA) software. Sequence text files were ge-nerated using GERALD in Illumina’s CASAVA v1.7 software package. The GERALD configuration file was set to trim 1 base at the 3′ end of each read since prephasing correction in CASAVA cannot be applied to the last base resulting in a slight increase in error at that position. Reads were then sorted into individual files by barcode, and filtered for 100% identity to the individual barcode plus partial restriction enzyme site. Identical reads were collapsed and read depth recorded and each sequence given a unique name that included the genotype from which it was generated and read depth using a series of custom python scripts. For genetic mapping with FseI, unique sequences from the parental lines with a read depth of 3 or higher were aligned to the sorghum genome by BLASTN analysis using a Word size of 7 and match/mismatch scores of 2 and −3, respectively. The sorghum genome sequence was downloaded from http://ftp.jgi-psf.org/pub/JGI_data/phytozome/v6.0/Sbicolor/assembly/) and BLASTN performed on a local Linux workstation. After parsing the BLAST output files from the two parental lines, the results were manually inspected to remove sequences that aligned to two or more sites within the sorghum genome at the same e-value or percent identity. The output files from the parental lines were then combined and putative polymorphisms between the two identified using a custom python script. This script identified sequences from both parents that aligned at the same FseI site within the genome and then performed pairwise comparison between the two to detect putative polymorphisms (SNPs and INDELs). To map putative polymorphisms identified between the two parents through the mapping population, a second python script was written that searched for each parental sequence from a given FseI site in each progeny line and then recorded the appropriate parental allele (A, B or H) to produce a file suitable for input into either Mapmaker/EXP ver. 3.0b  or JoinMap V4.0  for genetic map construction. The script was written to allow the user to specify the minimum read depth/sequence required to call an allele as well as the fold difference required to call a heterozygous loci in a line that contained both parental alleles at a given FseI site.
For pedigree analysis with NgoMIV template, following GERALD analysis and separation of the individual sequences into separate files based on the 4 bp barcode and partial RE site, the barcodes were trimmed from each sequence and the sequences uploaded to the CLC Genomics Workbench (CLC Bio) for sequence alignment and SNP detection. For alignment to the sorghum genome, the mismatch, insertion and deletion costs were set to 3 and sequences that matched more than one location identically were ignored. Using these alignment parameters, ~90% of the sequences generated from BTx623 aligned to the BTx623 sequenced genome. Following alignment, the CLC SNP Detection tool was used to identify potential SNPs in each genotype. Parameters for SNP detection included: a minimum read coverage at the potential SNP of 6, window length of 9, minimum average quality score of 15 and minimum central quality (i.e. quality of the SNP base) of 20. Once each genotype was processed using the SNP Detection Tool within the CLC Genomics Workbench, the individual files were exported in csv format and custom python scripts were used to combine the results and reformat the data for input into downstream analysis software (i.e. PowerMarker, STRUCTURE, FlapJack).
The DG genetic linkage map was constructed using genotypes assigned by analysis of markers from 137 RILs derived from the BTx623 x IS3620C RI population [25–27]. Initial marker order was predicted based on alignment of DG marker sequences to the reference genome sequence . Recombination frequencies of DG markers were determined using Mapmaker/EXP ver. 3.0b . The command ‘map’ was used to calculate genetic distance between markers, using the Kosambi mapping function. When two or more DG markers mapped to identical locations, all but one of the markers were removed prior to the next step in mapping. The order of the remaining DG markers was confirmed using ‘order’ and ‘ripple’ functions. DG markers with LOD scores >3.0 were retained in the final DG genetic map.
QTL analysis was carried out using Windows QTL Cartographer version 2.5 . Composite Interval mapping (model 6) was used for mapping QTLs. Threshold significance levels were determined by permutation analysis (1,000 permutations). Height measurements from Hart and coworkers  were used.
The sequences generated in this study have been deposited in the NCBI Sequence Read Archive (SRA) (http://www.ncbi.nlm.nih.gov/sra) under accession number [NCBI: SRX207965].
The authors thank Dr. Eun-Gyu No for cluster generation and operation of the Illumina GAIIx sequencer and Susan R. Hall for expert technical assistance. This research was supported by the Perry L. Adkisson Chair in Agricultural Biology, USDA-NIFA award 67009–21507, and Ceres Inc.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.