Next-generation sequencing has been a powerful tool for deletion identification . A variety of computational algorithms have been developed to use NGS sequence data to search for deletions. Notably, these methods attempt to collect comprehensive details about genomic structural variants. For example, a recent study surveyed structural changes (ranging in size from single base pairs to several Mbp) in two personal genomes using the de novo assembly of short reads . However, comprehensive methods require high sequence coverage (>30× genome size), which drives up costs, requires a large amount of data storage space, necessitates long analysis time and creates heavy computational demands. In select studies such as evolutionary genomics, it is not necessary to achieve comprehensiveness; instead, a limited amount of information is sufficient. In recent years, several restriction-based NGS methods have been developed to sequence partial genomes [16–19]. Most of these methods aim for SNP discovery, not the detection of structural changes. We modified the method of Chen et al. by simplifying its experimental procedures and developing a computational program. In this study, we showed that sequencing both ends of the restriction fragments generated by a medium-frequency enzyme can be an accurate method for identifying medium-sized deletions, even with sequence coverage as low as one-fold genome size.
The deletion resolution can be controlled by selecting restriction enzymes with different cutting frequencies depending on the research objectives. The selected restriction enzyme determines what target regions will be sequenced, as well as the length distribution of the restriction fragments (Figure 3B). In this study, very low sequencing coverage (3 Gb or 1.0× human genome size) could be concentrated within the tag regions to reach a sufficient depth for deletion identification. The high rate of overlap between the two separate datasets used in our study, both of which had low genomic coverage, demonstrates the reproducibility of this method (Additional file 2: Figure S2).
The number of detected deletions can also be adjusted by the coverage. Library 1 contained 0.21× paired reads and detected 51 deletions, whereas Library 2 had 0.79× read coverage and detected 150 deletions (Table 1). Importantly, most of the deletions found with Library 1 were also identified with Library 2 (Additional file 2: Figure S2B).
In addition to high flexibility and efficiency, this method also displayed high accuracy. The use of in silico PCR significantly increased the specificity of the detected deletions by eliminating the noisy sequences that were produced by experimental errors, such as randomly broken fragment ends, star activity of the restriction enzyme, sequencing errors and false mapping.
The population of target deletions can be fixed once the restriction enzyme is determined, and the size of the deletion population can be adjusted by selecting different enzymes and coverage according specific needs (Figure 1
A). The flexible choice of the fixed target enabled comparative genomic studies on a subset of deletions across different samples because these deletions were randomly distributed across the genomes (Additional file 3
: Table S2) and could be accessed repetitively without heavy sequencing input. Thus, our method is applicable to a variety of fields, including:
Detecting the deletions across multiple genomes, especially for the species with large, difficult-to-sequence genomes in population or comparative genomic studies. For example, a recent survey of the structural variants in an individual gorilla genome required a 60 Gb sequence input . At this scale, our method can examine the genetic diversity of deletions in a population of 5–20 gorillas. Although the Genome STRiP can also examine deletions in multiple large genomes, as it did with 1000 Genome data , it cannot deal with single genome data nor identify singletons from pooled data as we did in this study.
Detecting the deletions in paired samples. For example, rapid identification of residual alleles of cancer cells which usually exist in trace amount in circulating DNA . By sequencing the ditags of original tumor DNA and normal DNA in the same individual, several somatically-acquired, tumor specific deletions could be identified. PCR primers could be designed based on these deletions to amplify the tumor DNA specifically.
Massive validation of deletions found by other comprehensive methods or massive genotyping of known deletions.