Cost-effective genome-wide estimation of allele frequencies from pooled DNA in Atlantic salmon (Salmo salar L.)

Background New sequencing technologies have tremendously increased the number of known molecular markers (single nucleotide polymorphisms; SNPs) in a variety of species. Concurrently, improvements to genotyping technology have now made it possible to efficiently genotype large numbers of genome-wide distributed SNPs enabling genome wide association studies (GWAS). However, genotyping significant numbers of individuals with large number of SNPs remains prohibitively expensive for many research groups. A possible solution to this problem is to determine allele frequencies from pooled DNA samples, such ‘allelotyping’ has been presented as a cost-effective alternative to individual genotyping and has become popular in human GWAS. In this article we have tested the effectiveness of DNA pooling to obtain accurate allele frequency estimates for Atlantic salmon (Salmo salar L.) populations using an Illumina SNP-chip. Results In total, 56 Atlantic salmon DNA pools from 14 populations were analyzed on an Atlantic salmon SNP-chip containing probes for 5568 SNP markers, 3928 of which were bi-allelic. We developed an efficient quality control filter which enables exclusion of loci showing high error rate and minor allele frequency (MAF) close to zero. After applying multiple quality control filters we obtained allele frequency estimates for 3631 bi-allelic loci. We observed high concordance (r > 0.99) between allele frequency estimates derived from individual genotyping and DNA pools. Our results also indicate that even relatively small DNA pools (35 individuals) can provide accurate allele frequency estimates for a given sample. Conclusions Despite of higher level of variation associated with array replicates compared to pool construction, we suggest that both sources of variation should be taken into account. This study demonstrates that DNA pooling allows fast and high-throughput determination of allele frequencies in Atlantic salmon enabling cost-efficient identification of informative markers for discrimination of populations at various geographical scales, as well as identification of loci controlling ecologically and economically important traits.

Despite the recent technical advances, genotyping large numbers of individuals with thousands of SNPs remains prohibitively expensive for many research groups. Furthermore, many population genetic studies are based on population allele frequency rather than individual genotype data. Therefore, determination of allele frequencies from pooled DNA samples, i.e. 'allelotyping' , has been suggested more than 30 years ago as a cost-effective alternative to individual genotyping (reviewed by Sham et al. [16]). Several studies have successfully used this approach in genomewide association studies that compare the allele frequencies between cases and controls e.g. [17][18][19][20][21][22][23]. These studies have demonstrated satisfactory accuracy and repeatability, and the DNA pooling approach can reduce costs by as much as 100-fold depending on the number of samples [16,21,23].
While the allelotyping of DNA pools can substantially reduce the costs compared to individual sample by sample genotyping, this approach is not without disadvantages. First, various sources of error occur during the allele frequency estimation from DNA pools. According to Earp et al. [23], variation introduced to allele frequency estimates can be divided into four categories: (i) within array; (ii) between arrays; (iii) between independently constructed identical pools, and (iv) between pools constructed from different individuals of the same population (biological replicates). Therefore, in order to obtain reliable allele frequency estimates using DNA pooling it is important to evaluate the magnitude and relative importance of different sources of error [23,24]. In addition, DNA pooling generally does not provide information about haplotype frequency and despite recent computational improvements [25,26] resolving the phase ambiguity remains a challenge for large number of loci [27]. However, despite the popularity of DNA pooling in genetic association studies, only few studies to date have utilized allelotyping approach to characterize inter-population variation e.g. [28].
Here, we tested the usefulness of DNA pooling for a first time using an Atlantic salmon (Salmo salar L.) Illumina SNP-chip to obtain accurate allele frequency estimates for multiple Atlantic salmon populations and evaluated the importance of different sources of errors arising from allelotyping. First, we assessed the effect of DNA pool construction and between-array variations on allele frequency estimates. Subsequently, the effect of cluster separation scores (parameter that summarizes the separation of three genotype classes in the theta dimension), two alternative sources of theta (a value between 0 and 1 which defines the genotype; 0 = AA, 1 = BB, 0.5 = AB) and DNA pool size on allele frequency estimation were evaluated. Finally, two alternative quality control (QC) filters were tested to select optimal sets of SNP loci for subsequent population genetic analysis.

Results and discussion
In total, 56 Atlantic salmon DNA pools from 14 populations were analyzed using an Atlantic salmon SNP-chip [29,30] carrying probes for 5568 SNP markers 3928 of which were bi-allelic. After excluding 1640 non biallelic markers and 31 bi-allelic loci due to low call rate (< 95%) (see Additional file 1, Figure S1a) the repeatability of allelotyping from DNA pools was tested for 3897 loci.

Array-vs. pool-construction variation
The experimental design described in Table 1 provided 56 estimates of array-variation and 52 of pool-construction variation in the theta value. The mean array-variation per SNP varied from theta 0.000 to 0.089, whereas the mean pool-construction variation of theta ranged from 0.000 to 0.069. The estimated variation of theta between different arrays (i.e. array-variation using identical DNA pools) was 20% higher compared to variation arising from DNA pool construction (median array = 0.012 vs. median pool-construction = 0.010, non-parametric Mann-Whitney U-test, P < 0.0001) ( Figure 1). These results suggest that it is more important to consider variation arising from different arrays than variation associated with pool construction [22][23][24]31]. This is in line with the earlier studies suggesting that running the same DNA pool in multiple arrays should be preferred over construction and analysis of multiple DNA pools within the same array [18,22,32]. However, considering the relatively similar levels of variation associated with the array and pool replicates, future studies should incorporate both sources of variation in the experimental design for reliable estimation of allele frequencies from DNA pools.

Estimation of allele frequencies from DNA pools
The allele frequencies for 3631 SNPs that passed the quality control (see below) were estimated from DNA pools using reference values of theta provided by CIGENE and reference values of theta derived from the genotyping of 106 individuals used in pool construction. Comparison of the two sets of theta values revealed a small, but significant, difference in allele frequency estimates. Using individual genotypes from this study to derive reference values of theta provided slightly higher accuracy in allele frequency estimates compared to the larger (n = 300) but unrelated dataset provided by CIGENE (median error 106 = 0.020 -0.023 vs. median error CIGENE = 0.025 -0.028; Mann-Whitney U-test, all tests, P < 0.0001; Figure 2). Errors associated with allele frequency estimations using reference values of theta from two different sources were significantly correlated (Pearson's r = 0.640 -0.684, P < 0.0001) suggesting that small number of SNPs suffer from larger error irrespective of the source of reference values of theta while the majority of loci have relatively low error rates. Taken together, these results suggest that even relatively small number of individuals (~100) is sufficient to generate reliable reference values of theta. However, because all three genotype classes are needed for accurate estimation of allele frequencies, using relatively small number of individuals resulted in loss of SNPs as not all genotypes were observed in the reference datasets (3631 vs. 3138 SNPs based on CIGENE and our data, respectively).
We observed very high concordance between allele frequency estimates derived from DNA pools and from individual genotyping (Pearson's r = 0.991 -0.992, all tests, P < 0.0001, Figure 3). This demonstrates the accuracy of the DNA pooling approach in Atlantic salmon and is consistent with earlier studies in other species using Illumina bead-array platform. For example, high correlation between allele frequency estimates derived from individual genotyping and DNA pools have been observed in humans (Pearson's r = 0.969) and cattle (Pearson's r = 0.992 -0.994) [18,33]. The number of individuals in the DNA pool had only a minor effect on the allele frequency estimation ( Figure 3) as the error between true and estimated allele frequencies was small and similar for all three pool sizes (median error = 0.023 -0.025, Figure 2). Therefore, our results suggest that it is possible to obtain accurate allele frequency estimates using DNA pools consisting of relatively small number of individuals (n ≥ 35). However, larger pool sizes should be always preferred over small ones as small number of individuals may not be representative of the whole population.

Quality control
One of the important parameters for accurate determination of genotypes and subsequent allelotyping is cluster separation score that quantifies the discrimination between genotype clusters for particular SNP (see Additional file 1: Figure S1b, c, d). Since the heterozygous cluster can be indistinguishable from one or both homozygous clusters for SNP with low cluster separation score, exclusion of loci demonstrating low cluster separation scores has been often applied [34,35]. To date, most of the studies have used a cluster separation score cut-off <0. 35 to exclude low quality SNPs e.g. [36,37].  3897 markers for subsequent analysis. As expected, the error in allele frequency estimates of SNPs having cluster separation score < 0.4 was higher compared to SNPs with cluster separation score > 0.4 (Mann-Whitney U test, both for array and pool replicates, P < 0.0001) (see Additional file 1: Figure S2a, b). Moreover, the correlation between allele frequency estimates derived from three DNA pools and from individual genotyping for SNPs demonstrating low cluster separation scores (< 0.4) was lower than for markers with cluster separation scores > 0.4 (Pearson's r = 0.960 -0.969 vs. Pearson's r = 0.991 -0.992). In addition, the estimated variation of theta was negatively correlated with the cluster separation score both for array (Pearson's r = − 0.346, P < 0.0001) and pool construction (Pearson's r = − 0.246, P < 0.0001) replicates (see Additional file 1: Figure S3a, b). While application of QC filter based on cluster separation excludes SNPs having low quality genotypes, it is not able to remove all loci showing relatively high variation in allele frequency estimates (see Additional 1: Figure S3a, b). Therefore, application of additional QC filters, e.g. based on comparisons between 'true' and estimated allele frequencies or based on combination of variation in allele frequency estimates and heterozygosity have been suggested e.g. [28,36,37].
Here, we tested two alternative QC filters (uniform and spherical cut-off ) that use heterozygosity and variation in allele frequency estimates (Figure 4). This resulted selection of 2879 vs. 2880 loci for uniform and spherical cut-off, respectively ( Table 2). Majority of loci (2777) that passed both filters were the same ( Figure 4). However, spherical filtering is expected to be more useful than uniform cut-off as it retains larger proportion of polymorphic loci with mean allele frequency 0.2 -0.8 across   populations, while uniform filter increases the proportion of less variable loci (Table 2, Figure 4, Additional 1: Figure S4). Therefore, for identification of reliable and informative SNPs, application of spherical filter is preferable over uniform since it effectively excludes loci with relatively high error rate compared to the information content.

Conclusions
This study tested the effectiveness of DNA pooling to obtain accurate allele frequency estimates for large number of Atlantic salmon populations using an Illumina SNP-chip. We demonstrated that pooled DNA approach provides a reliable, accurate and cost-effective means for obtaining genome-wide allele frequency estimates for multiple populations. We proposed a novel quality control filter based on spherical cut-off which enables efficient exclusion of loci showing high error rate and minor allele frequency close to zero. Our results indicate that even relatively small DNA pools (35 individuals) provide accurate allele frequency estimates for a given sample. Despite of higher levels of variation associated with array replicates compared to pool construction we suggest that both sources of variation should be taken into account. Taken together, this study demonstrates that DNA pooling allows fast and high-throughput determination of allele frequencies in Atlantic salmon enabling cost-efficient identification of informative markers for discrimination of salmon populations at various geographical scales, as well as identification of loci controlling ecologically and economically important traits. Moreover, the main findings of our study based on Atlantic salmon SNPchip were in line with those observed for human SNPchips, and thus the technical approaches described herein are encouraging for employing allelotyping approach in other species using Illumina SNP-chips or other SNP genotyping systems and arrays.

DNA samples
In total, 927 Atlantic salmon individuals representing 19 populations from Northern Europe were used for individual genotyping and/or construction of DNA pools (Table 1). Tissue samples (fin clips) were collected from juveniles during 2006 -2010 and preserved in ethanol. Total genomic DNA was extracted according to Elphinstone et al. [38] or using Qiagen DNeasy 96 Blood & Tissue kits (Qiagen™) following manufacturer's recommendations.

Quality control of DNA extracts
Prior to pool construction, quality control of individual DNA extracts was performed in two steps. First, samples were examined for degradation by visual inspection on 1% agarose gels. Samples containing low molecular weight DNA (indicative of degradation) were excluded from further analysis. Each extract was then tested for contamination (the presence of DNA from multiple individuals) by screening individual samples using 18 microsatellite loci [39]; V. Wennevik, unpublished data] and only non-contaminated Atlantic salmon samples were selected for further analysis.

Construction of DNA pools and SNP genotyping
In total, 56 DNA pools were constructed using individuals from 14 Atlantic salmon populations ( Table 1). The adjustment of DNA concentration was carried out in two steps. The initial concentration of DNA samples was first adjusted to 20 ng/μl, measured in duplicate with the NanoDrop™ 1000 (Thermo Scientific) and subsequently diluted to 10 ng/ul. Individual DNA samples were pooled (50 ng per individual) and subsequently concentrated using a DNA concentrator Eppendorf 5301. The final concentration of the pools was adjusted to 50 ng/μl. Constructed DNA pools were analyzed using an Atlantic salmon Illumina SNP-chip [29,30] at the Centre for Integrative Genetics (CIGENE), Norway. In addition, 106 salmon samples used in pool construction were genotyped individually to guide cluster positioning and to obtain the 'true' allele frequency for each locus for the population from the River Kola (Table 1).

Quality control
Genotyping of the 106 individual samples was performed using Genotyping module v. 1.9.4 (Genome Studio software v. 2011.1, Illumina Inc.), only those samples with > 97% call rates were included when calculating 'true' allele frequencies. SNPs with call rates < 95% (i.e. the proportion of individual samples successfully genotyped in a locus) were eliminated from the data set. Thresholds for quality control (QC) filtering were determined as in Murray et al. [37] and for estimation of allele frequencies from DNA pools, SNPs with cluster separation scores ≤ 0.4 were excluded.

Estimation of allele frequencies in a pooled DNA samples
In Illumina genotyping, the genotype is assigned after converting raw color signal data into a theta value which ranges from 0 to 1 and reflects the relative signal contribution for the 2 alternate alleles. In theory, an individual homozygous for the B allele would have a theta value close to 1, an individual homozygous for the A allele a value close to 0 and a value of 0.5 would indicate a heterozygous genotype. However, in reality a SNP's theta for genotype clusters (AA, AB and BB) may vary from 0, 0.5 and 1, therefore for estimation of allele frequency in a pooled sample, the theta value for each SNP is compared to the mean theta values for AA, AB and BB genotypes calculated by genotyping individual samples, i.e. the allele frequency of the DNA pool can be derived by applying correction algorithms from comparing pool-specific value of theta with the reference values of theta from individual genotyping data e.g. [40,41].
To obtain the allele frequency estimate for allele B in the pool B pool Sample position of each pool along the axis of normalized theta values were compared to the reference values of AA, AB and BB genotype cluster positions for each SNP (reference values of theta) as in Janicki & Liu [41].
The following equations were applied [41]: θ pool is the sample position and θ AA , θ AB , θ BB are means of the cluster positions of the corresponding reference genotypes along the axis of normalized theta values. The frequency of allele A was calculated as Reference values for AA, AB and BB genotype positions along the axis of normalized theta values were obtained from individual genotyping of 300 Atlantic salmon specimens genotyped in previous studies by CIGENE. As this data did not include samples from all the populations used to construct the DNA pools, the mean cluster position values were also derived from the genotype classes of 106 individuals originating from 8 populations across the study area (Rivers: Alta, Laukhelle, Iesjoki, Kola, Varzuga, Onega, Pechora Unya and Narva). For subsequent analyses, however, reference values of theta provided by CIGENE were used.
The accuracy of allele frequency estimates was quantified as an absolute difference between allele frequencies derived from individual genotypes (referred to as 'true') and allele frequencies estimated from DNA pools from the River Kola population (35,50 and 70 individuals per pool).

Estimation of array-and pool-construction variation
To estimate the within-pool variation of theta, replicates of the same DNA pool were run on different arrays (array replicates, as in Earp et al. [23]) (Table 1). To assess the variation in theta values introduced by pool construction, independently constructed pools consisting the same DNA extracts were run on same array (pool construction replicates, as in Earp et al. [23]) (Table 1). To evaluate the effect of number of individuals in the DNA pool on allele frequency estimation, DNA pools with varying number of individual DNA extracts were constructed ( Table 1).
Variation of theta within a SNP locus was estimated similar to Macgregor [31]. The array-variation was calculated as the mean difference of all possible pair-wise comparisons of theta values among technical replicates of the same pool allelotyped on different arrays. The pool-construction variation was calculated as the mean difference of all possible pair-wise comparisons of theta values among technical replicates of the independently constructed DNA pools containing same individuals allelotyped on the same array.

Additional file
Additional file 1: Figure S1. Example of SNP loci failed to pass QC: a) call rate < 95%; b) cluster separation < 0.40; and SNP loci met QC requirements: c) cluster separation = 0.41 and d) call rate 100%, cluster separation = 1.00. Figure S2. Box-plot showing estimated variation of theta in two sets of SNPs with cluster separation score < 0.4 and > 0.4 for (a) array and (b) pool construction replicates (both tests, Mann-Whitney U test, P < 0.0001). Horizontal line, grey square, whiskers, open circles, and stars indicate median, 25th and 75th quartiles, non-outlier range, outliers and extreme outliers, respectively. Figure S3. A significant negative correlation between (a) array-(Pearson's r = − 0.346, P < 0.0001) and (b) pool-construction (Pearson's r = − 0.246, P < 0.0001) variation and cluster separation scores. Figure S4. Proportion of loci remained in each allele frequency class after application of (a) uniform and (b) spherical filter.