In this study, we have compared SNP detection rates in resequencing of pooled samples from two genetically divergent chicken lines with allele frequency estimates using 60 K SNP genotyping of individual samples. Using a SNP detection threshold of three non-reference reads for each pool, a total of approximately 3.7 SNPs/kb were found in the resequencing data. This is less than the around 5 SNPs/kb found in previous chicken comparisons . Furthermore, less than 50% of the 60 K SNPs that showed variation within or between the chicken lines were detected in the resequencing data. The reason for this low detection rate is the limited depth coverage (about 5X) combined with a low non-reference allele frequency among the undetected SNPs. On the other hand, when there is a large allele frequency difference between lines, one line will have a high frequency of a non-reference allele, which increases the probability to reach over the SNP detection threshold of three non-reference reads in resequencing (Figure 1). In many studies, the most interesting SNPs will be those where there is a high allele-frequency difference or fixation for alternative alleles in the studied populations as these are putative causative mutations or represent opposite selective sweeps. Here we have reported, however, that resequencing of pools of individuals with approximately 5X depth coverage failed to identify 33% of the SNPs that were fixed for alternative alleles in the two lines according to the SNP chip data (Figure 1). Thus, a higher depth coverage in genome resequencing and additional resequencing of targeted regions should be considered to increase the detection sensitivity.
We have proposed a new measure for identifying informative SNPs from low coverage resequencing data, which takes into account variation in flanking SNPs (FSV). We have also demonstrated a positive correlation between FSV and the difference in allele frequency between lines from the 60 K SNP chip data. The correlation was strongest on the macrochromosomes when FSV was calculated with a 62 kb interval. For the microchromosomes the optimal interval size was considerably shorter (38 kb) and the correlation weaker, probably due to a higher recombination rate and/or a higher mutation rate [11, 12]. However, despite a presumed higher mutation rate on microchromosomes, the overall density of SNPs detected in resequencing was approximately 5% lower on the microchromosomes than on macrochromosomes. This is probably due to a considerably higher gene density on the microchromosomes and consequently a larger proportion of conserved sequence under purifying selection . The optimal FSV interval size with the highest correlation between FSV and SNP variation between lines may approximately correspond to the average LD block size in these populations.
No correlation was observed between FSV and difference in allele frequency between lines that were lower than 0.4. A plausible explanation is that the FSV is most affected by random sampling effects for the few sequencing reads covering the SNP, when the difference in allele frequencies is small. For example, if both lines show a high frequency for a non-reference allele there will be many cases where only one of the lines reaches over the SNP detection level, whereas the other line will be assumed to be fixed for the RJF allele. This would incorrectly increase the FSV and lower the correlation with the true SNP variation between lines. In other words, most random sampling effects among sequencing reads from a pool of DNA samples would increase the FSV and thereby reduce the correlation with SNP variation between lines if that variation is small. Another possible explanation is that large difference in allele frequencies between the two lines studied at a biallelic locus includes less sensitivity to such random sampling effects, because of one predominant allele in each line that is relatively rare or absent in the other line.
In summary, our results show that the depth coverage and pooling of samples limits SNP detection sensitivity using resequencing. The described measure, however, makes better use of available sequence information than studies using only individual SNP positions and can thereby provide a valuable indication of the SNP variation-between resequenced populations, even when the depth coverage is quite low and pooled samples are sequenced. This could, for example, facilitate the selection of the most informative SNPs for a genotyping panel among millions of putative candidates detected in resequencing. In selecting SNPs it is also important to consider how many reads covered the SNP position and the individual SNP score indicating the fraction of reads in agreement with the reference. When there exist a very large number of SNPs to choose from it would be possible to stringently select SNPs with high FSV, large number of covering reads and a high proportion of non-reference reads.