Integrating dilution-based sequencing and population genotypes for single individual haplotyping

Background Haplotype information is useful for many genetic analyses and haplotypes are usually inferred using computational approaches. Among such approaches, the importance of single individual haplotyping (SIH), which infers individual haplotypes from sequence fragments, has been increasing with the advent of novel sequencing techniques, such as dilution-based sequencing. These techniques could produce virtual long read fragments by separating DNA fragments into multiple low-concentration aliquots, sequencing and mapping each aliquot, and merging clustered short reads. Although these experimental techniques are sophisticated, they have the problem of producing chimeric fragments whose left and right parts match different chromosomes. In our previous research, we found that chimeric fragments significantly decrease the accuracy of SIH. Although chimeric fragments can be removed by using haplotypes which are determined from pedigree genotypes, pedigree genotypes are generally not available. The length of reads cluster and heterozygous calls were also used to detect chimeric fragments. Although some chimeric fragments will be removed with these features, considerable number of chimeric fragments will be undetected because of the dispersion of the length and the absence of SNPs in the overlapped regions. For these reasons, a general method to detect and remove chimeric fragments is needed. Results In this paper, we propose a general method to detect chimeric fragments. The basis of our method is that a chimeric fragment would correspond to an artificial recombinant haplotype and would differ from biological haplotypes. To detect differences from biological haplotypes, we integrated statistical phasing, which is a haplotype inference approach from population genotypes, into our method. We applied our method to two datasets and detected chimeric fragments with high AUC. AUC values of our method are higher than those of just using cluster length and heterozygous calls. We then used multiple SIH algorithm to compare the accuracy of SIH before and after removing the chimeric fragment candidates. The accuracy of assembled haplotypes increased significantly after removing chimeric fragment candidates. Conclusions Our method is useful for detecting chimeric fragments and improving SIH accuracy. The Ruby script is available at https://sites.google.com/site/hmatsu1226/software/csp. Electronic supplementary material The online version of this article (doi:10.1186/1471-2164-15-733) contains supplementary material, which is available to authorized users.


Cluster length and heterozygous calls 1.Evaluation of heterozygous calls in a reads cluster
To detect CF by using the heterozygous calls in a reads cluster, we defined three measurements to evaluate the heterozygosity of SNP fragment f i .
Firstly, we defined the total number of reads which cover minority allele (total heterozygosity) as follows: ∑ j∈X(fi) min(n(r i,j = 0), n(r i,j = 1)) , where n(r i,j = 0) and n(r i,j = 1) are the number of reads, which are contained in a reads cluster which corresponds to a SNP fragment f i and whose base at j-th locus are major allele and minor allele, respectively. Secondary, we defined maximum of the rate of the minority allele (maximum heterozogosity) as follows: max j∈X(fi) min(n(r i,j = 0), n(r i,j = 1)) n(r i,j = 0) + n(r i,j = 1) .
With these measurements, we detected CFs candidates by selection the fragments whose values are larger than a threshold.   Figure 1 shows the ROC curves of total heterozygosity, maximum heterozygosity, and average heterozygosity. In Kaper's data, ROC curves stops at sensitivity is around 0.7. This is because there are many CFs which do not show heterozygous, and this could be caused when the coverage is low and only one origin of reads which derived from the same haplotype exist. In Duitama's data, the ROC curve of maximum heterozygosity and averaged heterozygosity are below that of total heterozygosity. This is because maximum and average heterozygosity overestimate the effect of sequencing error. Therefore, we concluded that total heterozygosity is appropriate to evaluate heterozygosity in a reads cluster. Figure 2 shows the distribution of the length of reads clusters for each dataset. The length of reads cluster which correspond to CFs tend to be larger because reads with different long DNA fragments origins are merged into one reads cluster. Although the cluster length of CFs tend to be larger than that of NFs, there are considerable overlapping between NFs and CFs, especially in the Kaper's data. Because the sliding window width would affect the result, we examined the impact of sliding window width (W ) on accuracy and running time. We used SNP fragments of chromosome 1 from Kaper's data for the AUC calculation, and used 100 randomly generated SNP fragments of size 30 for the running time calculation. Figure 3 shows the AUC values and running times for W =3, 5, 7, 9. AUC increases roughly in line with the increase of W . This is because the difference between haplotypes becomes clearer when we consider more SNPs. However, difference between AUC values for W =3 and W =5 is larger than that for W =5 and W =7, which suggests that AUC would roughly saturate for low W . Running time also increases with increasing W . This is because the possible haplotypes and combinations of haplotypes increase exponentially as W increases. In view of these accuracy and running time results, we use W =5 as the default setting.

Effect of error rate α
We included an error term in CSP to represent sequencing and PHASE errors. To examine the effect of the error rate parameter α, we calculated AUC values for various values of α. We used chromosome 1 from Kaper's data and Duitama's data for the AUC calculation. Table 1 shows the AUC values for each α. The AUC for α = 0.0 is lowest because CSP with α = 0.0 cannot deal with the inconsistency between inferred haplotypes and the context of a fragment which is caused by the sequencing and PHASE errors. The AUC values for 0.001 ≤ α ≤ 0.1 are almost equal. These results suggest that including α in CSP is important but the absolute value of α is unimportant. Based on these results, we use α=0.01 as the default value.

Effect of the number of individual genotypes
The accuracy of PHASE should increase with the number of individual genotypes. To examine the effect of changing the number of individual genotypes, we calculated the AUC of CF detection using chromosome 1 from Kaper's data and selecting N =5, 10, 20, 40, 60 individuals randomly from 60 unrelated individuals in the CEU population. We ran PHASE for randomly selected genotypes and the NA12878 genotype, and calculated AUC using the result of PHASE (Figure 4). AUC increases with the number of individuals. However, the rate

Recovering SNP fragments from CF candidates
CSP might regard NFs as CF candidates when NFs differ from population haplotypes because of rare variants or spontaneous recombination. As CFs are generated because an aliquot occasionally contains multiple DNA fragments which cover the same region, CFs would be distributed randomly. Therefore, if there are many CF candidates which cover the same region, they would be misidentified NFs. Because some CFs remain with only the threshold coverage, we removed fragments using a SIH-based measure. The detailed process is as follows: 1. Calculate the coverage of CF candidates for each heterozygous site.
2. Exclude sites whose coverage is lower than 3 and recover the SNP fragments which correspond to the remaining sites (P1).

Run MixSIH for recovered SNP fragments.
4. Calculate the chimerity-like measure 'SIH-chimerity' 5. Remove the fragments which satisfy SIH-chimerity ≥ 2 ln(α 0 /(1 − α 0 )) (P2). Table 2 shows the numbers of all fragments, NF, and CF before and after recovery. The numbers of all fragments are larger than sums of NFs and CFs because trio-based haplotyping is partial and the chimerity of fragments which cover unphased regions cannot be calculated. The rates of CF for Kaper's data are 61.2%, 2.5%, and 2.1%, and the rates of NF for Duitama's data are 31.3%, 24.5%, and 24.2%. For both of datasets, the rates of CF decrease and we successfully recover NFs from CF candidates with high precision. The recovered fragments rates are 4.4% (235/5,375) and 16.1% (2,692/16,715) for Kaper's data and Duitama's data, respectively. The rate of recovered fragments for Duitama's data is larger than that for Kaper's data because the coverage of Duitama's data is higher than that of Kaper's data. High coverage might result in a larger CF rate in recovered fragments for Duitama's data.
In summary, NFs could be recovered from the CFs candidates by using the coverage information and SIH based chimerity. The coverage threshold should be determined according to the purpose of the analysis because there is a tradeoff between sensitivity and specificity.

Calculation of SNP fragment error rate
The SNP fragment error rate was calculated by comparison with the results of trio-based haplotyping. Because we were interested in the SNP fragment errors which were caused by sequencing and mapping errors, and CFs might disrupt the error rate calculation, we used only SNP fragments whose chimerity was under 2 ln(α 0 /(1 − α 0 )) for the calculation. The SNP fragment error rate is where X (f i ) is the set of sites which are covered by f i and whose phases are determined by trio-based haplotyping, |X (f i )| is the number of sites in X (f i ), and and 0 otherwise.

The number of NFs and CFs of Duitama's SNP fragments
The number of NFs and CFs of Duitama's SNP fragments are 245,772 and 8,247, respectively, while the number of NFs and CFs of our processed Duitama's data are 384,857 and 6,381, respectively ( Table 3). The number of NFs of Duitama's SNP fragments is lower than that of our data. This difference could be caused by the mapping tools, the reads cluster detection algorithm, and the filtering step. We used bfast for mapping SOLiD reads instead of BioScope which was used by Duitama et al. because the original bfast paper suggested that bfast has robustness against the sequence variants, and BioScope was not easily available. Concerning that the number of CFs of our data is lower than that of Duitama's SNP fragments, our processing method turns out to be more strict processing method. Some reads clusters will be divided into smaller reads clusters with the strict processing method, and this results in the increase of the number of NFs.
The SIH accuracy was shown to decrease with the presence of CFs. Therefore, our processing method which generates less CFs will be better than Duitama's processing method in terms of SIH accuracy.

SIH accuracy of Duitama's SNP fragments after removing suspicious CFs by using CSP
The SNP fragments data, in which long reads cluster and heterozygous calls are already filtered, is open by Duitama's group and we examined the pairwise accuracies of original Duitama's SNP fragments and processed Duitama's SNP fragments, in which fragments with CSP > 7 are removed ( Figure 5 (A)). For comparison, the pairwise accuracies our processed Duitama's data that are already shown in the main text are shown again ( Figure 5 (B)  removed. The precision of MixSIH increased from 0.875 to 0.925 at (CP+IP) = 1.4 × 10 8 . The precision of other algorithm increased likewise. Thus, CSP is an efficient measure to detect the CFs which are undetected with cluster length and heterozygous calls, and useful for improving SIH accuracy. In addition, (CP+IP) for Duitama's SNP fragments is larger than that for our processed Duitama's data, while the precision of each algorithm for Duitama's SNP fragments are lower than those for our data. These differences are caused by the difference of the processing methods (as discussed in the above section). With the strict processing method, the length of SNP fragments become smaller owing to the division of the reads cluster, and hence the length of assembled haplotypes is smaller. On the other hand, the strict processing method generates less CFs and the precision of assembled haplotypes increase.

Precision of MixSIH and PHASE
The precision of MixSIH was calculated as follows.
1. Select 10,000 regions in chromosome 1 randomly such that each region has five SNP sites and the haplotypes of the regions are determined by trio-based haplotyping. 3. Calculate the precision for MC value, which is defined by where mc is the target MC value, and CP mc and IP mc are the number of consistent pairs and inconsistent pairs in the regions for which MC value satisfy mc ≤ MC < mc + 0.5. The precision for each ln(1.001 − max P ), where max P is the maximum PHASE probability, was calculated as follows.