A unified approach for allele frequency estimation, SNP detection and association studies based on pooled sequencing data using EM algorithms
 Quan Chen^{1} and
 Fengzhu Sun^{1, 2}Email author
https://doi.org/10.1186/1471216414S1S1
© Chen and Sun; licensee BioMed Central Ltd. 2013
Published: 21 January 2013
Abstract
Background
Genomewide association studies (GWAS) have identified many common polymorphisms associated with complex traits. However, these associated common variants explain only a small fraction of the phenotypic variances, leaving a substantial portion of genetic heritability unexplained. As a result, searches for "missing" heritability are drawing increasing attention, particularly for rare variant studies that often require a large sample size and, thus, extensive sequencing effort. Although the development of next generation sequencing (NGS) technologies has made it possible to sequence a large number of reads economically and efficiently, it is still often cost prohibitive to sequence thousands of individuals that are generally required for association studies. A more efficient and costeffective design would involve pooling the genetic materials of multiple individuals together and then sequencing the pools, instead of the individuals. This pooled sequencing approach has improved the plausibility of association studies for rare variants, while, at the same time, posed a great challenge to the pooled sequencing data analysis, essentially because individual sample identity is lost, and NGS sequencing errors could be hard to distinguish from low frequency alleles.
Results
A unified approach for estimating minor allele frequency, SNP calling and association studies based on pooled sequencing data using an expectation maximization (EM) algorithm is developed in this paper. This approach makes it possible to study the effects of minor allele frequency, sequencing error rate, number of pools, number of individuals in each pool, and the sequencing depth on the estimation accuracy of minor allele frequencies. We show that the naive method of estimating minor allele frequencies by taking the fraction of observed minor alleles can be significantly biased, especially for rare variants. In contrast, our EM approach can give an unbiased estimate of the minor allele frequency under all scenarios studied in this paper. A SNP calling approach, EMSNP, for pooled sequencing data based on the EM algorithm is then developed and compared with another recent SNP calling method, SNVer. We show that EMSNP outperforms SNVer in terms of the fraction of dbSNPs among the called SNPs, as well as transition/transversion (Ti/Tv) ratio. Finally, the EM approach is used to study the association between variants and type I diabetes.
Conclusions
The EMbased approach for the analysis of pooled sequencing data can accurately estimate minor allele frequencies, call SNPs, and find associations between variants and complex traits. This approach is especially useful for studies involving rare variants.
Introduction
Finding genomic variants associated with complex traits is one of the most important problems in modern genomics. Genomewide association studies (GWAS) based on common variants have been the dominant approach to achieve this objective [1]. However, the genomic variants identified in GWAS studies often explain only a small portion of the phenotypic variation related to heritable human diseases, a phenomenon known as "missing heritability" in genomics literature [2]. This missing heritability problem has led to increasingly skeptical views of the common diseasecommon variant (CDCV) hypothesis which predicts that common diseasecausing alleles, or variants, will be found in all human populations that manifest a given disease. On the other hand, interest in studies on rare variants with minor allele frequencies less than 1% is growing [3, 4].
Studies of rare variants are complicated by the low minor allele frequencies of rare variants. The development of next generation sequencing (NGS) technologies such as Illumina and Roche 454 has made it possible to sequence a large number of reads economically. Despite such important progress, sequencing a large number of individuals separately is still costly for most biological laboratories. One frequently adopted approach to reduce sequencing cost in the search of rare variants is pooled sequencing, where mixtures of genetic materials from several individuals are grouped together to form a pool for a single sequencing. While this design greatly lowers the sequencing cost, it also makes it hard to distinguish true genetic polymorphisms from sequencing errors, estimate minor allele frequencies at the polymorphic loci, and perform association studies on the rare variants.
Several research groups have used pooled sequencing to detect rare variants that are associated with complex traits such as retinitis pigmentosa, diabetes, cancer, and inflammatory bowel disease [5–8]. There are generally two types of pooling designs. One is pooling of tagged samples with each individual tagged by a unique short barcode. In this design, the genomic origins of the reads can be identified. However, barcoding many individuals and distinguishing these barcodes from each other can still be a challenging task. The second type of pooling is to mix the genetic materials from different individuals without tagging, and then generate reads from the mixture of genetic materials using NGS. With this design, the identities of the individuals from whom the reads originate cannot be identified. In this paper, we concentrate on the second type of pooling design.
Several groups have developed SNP calling methods based on pooled sequencing data [7, 9–12]. Out et al.[7] modeled the number of sequencing errors as a Poisson random variable and identified SNPs by comparing the number of minor alleles within the reads with the Poisson distribution. For rare variants with minor allele frequencies similar to or lower than the sequencing error rate, this approach could miss many true variants if the pool size is relatively large. Druley et al.[9] developed a SNP identification method, SNPSeeker, that can be applied to large pools by using control sequences without SNPs. In many studies, control sequences may not be available, making this approach impractical. Also, the program can only be used to analyze Illumina data. Bansal et al.[10] developed a method called CRISP to identify rare variants by comparing the minor allele frequencies across multiple pools using contingency tables. It was shown that CRISP outperforms SNPSeeker in terms of accuracy, but CRISP is more computationally demanding. Altmann et al.[13] improved the computational speed of CRISP and identified SNPs as the variants with different minor allele frequencies across at least two pools. Wei et al.[11] developed a statistical tool, called SNVer, for variant identification. For each pool, SNVer first defined a pvalue by testing the hypothesis that the minor allele frequency is above a given threshold and then combined the pvalues for individual pools to give an overall pvalue using Simes methods as in [12]. This algorithm makes it possible to rank the observed variants so that the top ranked ones are more likely to be true SNPs. However, the algorithm needs a prespecified sequencing error rate which can be difficult to do because the sequencing error rates can be position dependent. In the above studies, the investigators are mainly concerned with the detection of SNPs; they do not aim to estimate minor allele frequencies.
In order to estimate minor allele frequencies in pooling studies, several groups developed statistical models for the sampling of individuals and the sampling of reads from the individuals in the pools [14, 15]. These studies assumed a predefined constant sequencing error rate across different loci. However, sequencing error rates can vary for different loci depending on the nucleotide contents of the surrounding genomic regions. The effects of misspecifying the sequencing error rate on minor allele frequency estimation, SNP detection and power of association studies are not clear. Using similar models, Lee et al. [16] studied an optimal design in pooling studies. This involved the number of individuals in each pool and the number of pools. Recently, Chen et al.[17] considered more complex issues such as uneven sampling of individuals, different coverage of the minor and major alleles due to either PCR amplification or reads mapping, and reads quality scores.
In this paper, we develop new methods for estimating minor allele frequencies, SNP detection, and association studies using pooled sequencing data based on the models in [14–16]. In contrast to methods developed in previous studies, we do not assume that the sequencing error rate is constant. Instead, we estimate the sequencing error rate for each position together with the minor allele frequency based on the minor and major allele counts for all the pools using an expectationmaximization (EM) algorithm. We show that the naive estimation of the allele frequency by the fraction of minor alleles in the reads can be significantly inflated, especially for rare variants, while the EM approach can give an unbiased estimate of the minor allele frequency in all situations we studied. The estimation accuracy of the EM algorithm increases with the number of reads and the number of pools, but decreases with the number of chromosomes in each pool. Based on the allele frequency estimation, we develop a SNP calling method, EMSNP, and an association test using likelihood ratio statistics. The likelihood ratio statistic used in EMSNP is then used to rank candidate polymorphic loci to determine if they are true polymorphisms. Using a real resequencing dataset, we show that, for rare variants with minor allele frequencies lower than 1%, the fraction of dbSNPs (http://www.ncbi.nlm.nih.gov/projects/SNP/) among the SNPs called by EMSNP is higher than that of SNVer. Similarly, the transition/transversion ratio of rare variants called by EMSNP is higher than that of SNVer. These observations show that EMSNP outperforms SNVer at calling rare variants with minor allele frequencies less than 1%.
Materials and methods
Notation
Assume that a total of G pools of individuals are sequenced and each pool contains K/ 2 individuals (K chromosomes). For each pool g, a total of n_{ g } reads are mapped to this locus, with ${n}_{{0}_{g}}$ reads carrying the major allele and ${n}_{{1}_{g}}$ reads carrying the minor allele. Thus ${n}_{g}={{n}_{0}}_{{}_{g}}+{n}_{{1}_{g}}$.
Given the above likelihood expression and the data $\left\{({n}_{{0}_{g}},{n}_{{1}_{g}}),g=1,2,\cdot \cdot \cdot ,G\right\}$, our objectives are as follows

Find the maximum likelihood estimate of (f, α, β).

Determine whether an observed variant is a true SNP or not, i.e. SNP calling.

Find the variants associated with a phenotype of interest.
Computational methods
An expectationmaximization (EM) approach for allele frequency estimation
Based on the likelihood function, an approximate solution to the maximum likelihood estimation of the parameters can be obtained using the EM algorithm. We consider the following missing data:

C_{ g }: the number of chromosomes carrying the minor allele in the gth pool;

I_{ gi }: the true underlying allele state (major (0) or minor (1)) of read i in the gth pool;

r_{ gi }: the observed allele state (major (0) and minor (1)) of the ith read in the gth pool.
Note the expectation E_{(t)}is taken when the parameters are at Θ^{(t)}.
where all the parameters in the equations are of the values taken at the tth step.
From Equations 3 and 4, we are able to obtain the recursive formula for f.
Similarly, we can derive the formulas for E_{(t)}(T_{10}Data), E_{(t)}(T_{01}Data) and E_{(t)}(T_{00}Data), and the recursive formulas for α and β can be derived from them.
SNP identification using EM
Due to sequencing errors, the observed variants may contain a significant amount of false positives, i.e. loci that are not truly polymorphic. Thus, before testing for associations with phenotypes, we need to determine the true polymorphic sites. This step is especially important in the case of rare variants since the sequencing error rates for NGS could be close to or even higher than the minor allele frequencies.
where l_{f} is the maximum loglikelihood of the observed data for both the cases and the controls. Note that the null hypothesis f = 0 is on the boundary of the region of the parameters of interest. Therefore, the asymptotic distribution of Λ is $\frac{1}{4}{I}_{0}+\frac{1}{2}{\chi}_{1}^{2}+\frac{1}{4}{\chi}_{2}^{2}$ when the number of pools is large according to [18], where I_{0} is the point mass at 0 and ${\chi}_{i}^{2}$, i = 1, 2 are the chisquare distributions with i degrees of freedom. When the number of pools is relatively small, simulation approaches for the null distribution of Λ are needed to obtain the asymptotic distribution.
to test each hypothesis, where ${l}_{{f}_{\mathsf{\text{1}}}}$ and ${l}_{{f}_{\mathsf{\text{0}}}}$ are the maximum loglikelihood of the data for the cases and controls, respectively. Because the null hypothesis f_{ i } = 0 is on the boundary of parameter region f_{ i } > 0, the statistic Λ_{ i } has an asymptotic distribution $\frac{1}{2}{I}_{0}+\frac{1}{2}{\chi}_{1}^{2}$ when the number of pools is large according to [18]. We refer to the above method for SNP identification as EMSNP.
Testing for associations between a SNP and a phenotype in casecontrol studies
This statistic has an asymptotic chisquare distribution with 1 degree of freedom.
Simulation studies
We use simulations to evaluate our approaches for allele frequency estimation, SNP detection and test for association. A large range of parameter space is considered to see how different parameters affect the performance of our methods. These parameters include minor allele frequency (f), sequencing error rate (α), the number of chromosomes in each pool (K), the number of pools (G) and the relative risk for a disease (λ).
Pooled data generation
In our simulations, we set α = β and choose four starting values of α = β = 0.05%, 0.1%, 0.5%, 1% corresponding to different sequencing error rates ranging from low to high. The sequencing error rate for current NGS technologies is around 1% and we expect that it will decrease as the technologies improve. Therefore, we also consider much lower sequencing error rate in our studies. For the allele frequency, we choose four values f = 0.1%, 0.5%, 1%, 5%. Loci with minor allele frequencies above 5% are considered to be common. We want to study if it is possible to estimate the minor allele frequency even if it is lower than the sequencing error rate. We let the read number n = 1000 and 3000, which is around the sequencing depth in [8]. To study the effect of the number of chromosomes, we let K = 50, 100, 200. The number of pools is set at G = 10, 20, 50.
Since the sequencing error rate can vary from locus to locus and from one pool to another, we generate 1000 α_{ i }(= β_{ i }, i = 1, · · ·, 1000) values from a normal distribution with mean equal to the starting values of α, and variance equal to 0.1 times the starting values. Finally, we generate 1000 pooled data sets with each combination of the five parameters (K, G, n, f, α) based on α_{ i }(= β_{ i }), i = 1, ⋯, 1000.
Measuring the accuracy of the allele frequency estimation
 1.In the ith simulation, we use the EM algorithm derived above to estimate the parameters (f, α, β), denoted as $\left({\hat{f}}_{\mathsf{\text{em}}}^{\left(i\right)},\hat{{\alpha}_{i}},\hat{{\beta}_{i}}\right)$. We also consider a naive estimate for the minor allele frequency as the fraction of observed minor alleles in the observed reads,${\hat{f}}_{\mathsf{\text{avg}}}^{\left(i\right)}=\frac{{\sum}_{g=1}^{G}{n}_{{1}_{g}}}{{\sum}_{g=1}^{G}{n}_{g}}.$(9)
 2.
Repeat Step 1) for R = 1000 times.
 3.Compute the mean squared error (MSE) of ${\widehat{f}}_{\mathsf{\text{em}}}$ and ${\widehat{f}}_{\mathsf{\text{avg}}}$ from the true population minor allele frequency f,$\begin{array}{c}MSE\left({\widehat{f}}_{\mathsf{\text{em}}}\right)=\frac{1}{R}\sum _{i=1}^{R}{\left({\hat{f}}_{\mathsf{\text{em}}}^{\left(i\right)}f\right)}^{2},\\ MSE\left({\widehat{f}}_{\mathsf{\text{avg}}}\right)=\frac{1}{R}\sum _{i=1}^{R}{\left({\hat{f}}_{\mathsf{\text{avg}}}^{\left(i\right)}f\right)}^{2}.\end{array}$
 4.Compute the MSE of ${\widehat{f}}_{\mathsf{\text{em}}}$ and ${\widehat{f}}_{\mathsf{\text{avg}}}$ from the fraction of chromosomes carrying minor alleles f_{frac} = C/(KG) in the pools,$\begin{array}{c}Cg\left({\widehat{f}}_{\mathsf{\text{em}}}\right)=\frac{1}{R}\sum _{i=1}^{R}{\left({\hat{f}}_{\mathsf{\text{em}}}^{\left(i\right)}{f}_{\mathsf{\text{frac}}}\right)}^{2},\\ Cg\left({\widehat{f}}_{\mathsf{\text{avg}}}\right)=\frac{1}{R}\sum _{i=1}^{R}{\left({\hat{f}}_{\mathsf{\text{avg}}}^{\left(i\right)}{f}_{\mathsf{\text{frac}}}\right)}^{2}.\end{array}$
We use both MSE and Cg to compare the accuracy of the EM algorithm with the naive approach of estimating f by the fraction of reads carrying the minor allele.
Generating casecontrol data to study the power of SNP identification and association studies using EM
In order to evaluate the power of SNP identification using EMSNP and test for association, we simulate casecontrol data as follows. When generating the control data, we assume that the minor allele frequency is f and that the locus is under HardyWeinberg equilibrium. When generating the genotypes of the case individuals, we assume that the penetrance (the probability an individual is affected) of the genotypes 00, 01 and 11 are g_{0} = 0.01, λg_{0}, and λ^{2}g_{0}, respectively. In our simulations, we choose λ = 1.2, 2, and 4.
We can use the case or control samples separately or combine them for SNP detection as in the "SNP identification using EM" subsection. For example, we consider both the cases and controls jointly. The loglikelihood ratio statistic Λ (or Λ_{ i } if we use case or control samples separately) defined in equation 6 is used to test if an observed variant is a true polymorphic locus or not. For a given type I error γ (= 0.05 in our study), we claim that the variant is truly polymorphic if Λ >t_{ γ }, where t_{ γ } is the threshold corresponding to type I error γ. If the threshold can not be found theoretically, we can do parametric bootstrap to find the threshold. Firstly, assume that the variant site is not polymorphic, estimate the allele frequency and error rate using the maximum likelihood approach. Secondly, generate the reads data as in our model a large number of times (R = 1000), and obtain the empirical distribution of Λ. For a given type I error γ, we find the upper γ percentile t_{ γ }. Finally, the null hypothesis is rejected if the value of Λ is at least t_{ γ }. For a given relative risk λ, we repeat these steps 1000 times and the power is the fraction of times that the locus is called as polymorphic.
Similar approaches can be used to study the power of association studies using the pooling design. For details, see additional file 1.
A pooled sequencing data set related to type 1 diabetes [8]
We use our method to study the pooled sequencing data related to type 1 Diabetes dataset (T1D) in [8] and compare the results with current methods for SNP identifiction [11]. The data was generated using DNA samples of 480 T1D patients and 480 healthy controls, arranged in 20 DNA pools, with 48 patients/controls in each pool. Roche 454 sequencing was used to sequence 144 target regions that cover exons and splice sites of 10 candidate genes. We use MOSAIK (http://bioinformatics.bc.edu/marthlab/Mosaik) to map the reads to the human reference genome (hg19) with parameters hs 15 p 12 mmp 0.05 act 26  mhp 100 bw 51 as recommended in its documentation. MOSAIK is a widely used referenceguided assembler that hashes the whole reference genome and locate information in the hash table using a 'jump database' [19–21]. Then we use SAMtools (http://samtools.sourceforge.net/) [22] to pileup the reads onto the target regions. We also remove indels and keep loci that are covered by at least one read in each pool. Finally, we use ANNOVAR [23] to annotate the identified SNPs.
Results
We first present our results on the effects of various parameters on the estimation accuracy of the minor allele frequency using the EM algorithm. We then present the results on the power of SNP detection and association studies. Finally, we present our results on the analysis of the data in [8] using the approaches in the "Materials and Methods" section.
The effects of minor allele frequency, sequencing error rate, number of individuals in the pools and number of pools on the accuracy of allele frequency estimation
Comparison of ${\widehat{\mathit{f}}}_{\mathbf{em}}$ and ${\widehat{\mathit{f}}}_{\mathbf{avg}}$ in terms of mean squared error
α= 0.05%  α= 0.1%  α= 0.5%  α= 1%  

MSE  Cg  MSE  Cg  MSE  Cg  MSE  Cg  
f = 0.1%  0  0  0  0  0  0  0  0 
f = 0.5%  9  5  4  3  0  0  0  0 
f = 1%  13  10  9  7  0  0  0  0 
f = 5%  17  16  17  16  12  12  7  7 
The relative errors of ${\widehat{\mathit{f}}}_{\mathbf{avg}}$ and ${\widehat{\mathit{f}}}_{\mathbf{em}}$ in estimating minor allele frequency f
Comparison of ${\widehat{\mathit{f}}}_{\mathbf{em}}$ and ${\widehat{\mathit{f}}}_{\mathbf{avg}}$ in terms of average relative error
α= 0.05%  α= 0.1%  α= 0.5%  α= 1%  

f  $\mathbf{R}{\mathbf{E}}_{{\widehat{\mathit{f}}}_{\mathbf{avg}}}$  $\mathbf{R}{\mathbf{E}}_{{\widehat{\mathit{f}}}_{\mathbf{em}}}$  $\mathbf{R}{\mathbf{E}}_{{\widehat{\mathit{f}}}_{\mathbf{avg}}}$  $\mathbf{R}{\mathbf{E}}_{{\widehat{\mathit{f}}}_{\mathbf{em}}}$  $\mathbf{R}{\mathbf{E}}_{{\widehat{\mathit{f}}}_{\mathbf{avg}}}$  $\mathbf{R}{\mathbf{E}}_{{\widehat{\mathit{f}}}_{\mathbf{em}}}$  $\mathbf{R}{\mathbf{E}}_{{\widehat{\mathit{f}}}_{\mathbf{avg}}}$  $\mathbf{R}{\mathbf{E}}_{{\widehat{\mathit{f}}}_{\mathbf{em}}}$ 
0.1%  52.0  9.4  102.0  15.6  502.0  72.0  1000.0  146.0 
0.5%  10.3  4.0  20.2  4.9  99.5  13.3  199.0  26.5 
1%  5.0  3.3  10.0  3.7  49.2  5.7  98.3  9.5 
5%  1.0  5.4  1.9  6.0  9.1  6.7  18.1  6.3 
Next we present our results for the effects of (K, G, n, f, α) on the estimation accuracy of minor allele frequency f using the EM algorithm. We note that the estimation of β = f(0 1) is highly unreliable (data not shown). This phenomenon can be explained as follows. When minor allele frequency f is low, the expected number of chromosomes having the minor allele in each pool is also low. When the number of pools G is small, the estimation of β can be difficult with a small number of chromosomes carrying the minor allele. Thus, we do not show detailed results on the estimation of β. Despite the fact that β cannot be reliably estimated, the other two parameters f and α can be reliably estimated using the EM approach.
The effects of minor allele frequency f and sequencing error rate α on the estimation accuracy of ${\widehat{\mathit{f}}}_{\mathbf{em}}$
${\widehat{\mathit{f}}}_{\mathbf{em}}$ as an estimator of f or f_{frac}
α= 0.05%  α= 0.1%  α= 0.5%  α= 1%  

f  MSE  Cg  MSE  Cg  MSE  Cg  MSE  Cg 
0.1%  9.8e7  4.3e8  9.7e7  1.2e7  3.2e6  2.8e6  1.1e5  1.1e5 
0.5%  5.5e6  1.9e7  5.4e6  1.9e7  6.7e6  1.5e6  1.0e5  5.8e6 
1%  1.2e5  8.6e7  1.2e5  9.6e7  1.3e5  2.3e6  1.6e5  5.6e6 
5%  5.5e5  1.3e5  5.9e5  1.7e5  8.3e5  4.3e5  1.0e4  7.0e5 
To reduce the effect of a few outliers of ${\widehat{f}}_{\mathsf{\text{em}}}$ on the MSE and Cg calculation, we also modified the definitions of MSE and Cg by removing the top and bottom κ% of its values and recalculate the values of MSE and Cg. The results on the modified measures are presented in additional file 1 and the qualitative results on the performance of ${\widehat{f}}_{\mathsf{\text{em}}}$ continue to hold.
We also studied the effects of (K, G, n) on the estimation accuracy of ${\widehat{f}}_{\mathsf{\text{em}}}$ and the details of the simulation results are given in additional file 1. It was observed that the accuracy increases with G and n as expected. However, the accuracy decreases with the number of individuals K in each pool.
Results on the power of SNP calling using the likelihood ratio test
It can be seen from Figure 2 that at a type I error rate of 0.05, the power is consistently well above 0.9 in all scenarios demonstrated here except for the extremely rare variant case of f = 0.1%. The power tends to increase with the minor allele frequency or the number of pools, while it decreases with sequencing error rate or number of individuals in each pool. The power also increases with the number of reads in each pool. We observe that the power using the case samples is slightly higher than that using the control samples. This observation can be explained by the fact that the frequency of the minor allele in the cases is higher than that in the controls, resulting in higher power of SNP detection.
Results on the power of association studies using the likelihood ratio test
It can be seen from Figure 3 that at a type I error rate of 0.05, the power increases with λ and approaches 1 as λ goes up to 4, which happens in all scenarios demonstrated here except for the extremely rare variant scenario of f = 0.1%. The power increases with allele frequency, pool size or number of pools, while it seems robust with respect to changes in sequencing error rate.
Results on the analysis of the type 1 diabetes data in [8]
Allele frequency estimation and SNP calling in the control samples
We apply our approaches to analyze the pooled sequencing data in [8]. First, we conduct SNP calling using both our method EMSNP and SNVer (parameter setting bq 20 a 0 f 0 p 1 t 0) [11], a program that has been shown to outperform several other programs for SNP calling including CRISP, SAMtools, and GATK. Unlike many previous programs calling variants as SNPs or not, SNVer ranks variants according to their potential of being true SNPs using the pvalue defined in the program. As a likelihood ratio test, EMSNP can also rank the variants by the magnitude of the likelihood ratio. The estimated allele frequency spectrum of the top 100 called SNPs by either EMSNP or SNVer is given in Figure S9 of the additional file 1. Note that 5 variants have dominant nonreference alleles and they are excluded from both lists for a fair comparison. In the SNVer list, we also exclude the variants that are removed in the preprocessing stage of EMSNP. Both frequency spectrums of the variants called by EMSNP and SNVer tend to concentrate on the lower frequency range.
Evaluation of the SNP calling results
In terms of the dbSNP ratio for the top 100 called variants, EMSNP consistently outperforms SNVer under all allele frequency thresholds, and EMSNP displays significant superiority especially for low frequency variants. In Table S7 of the additional file 1, we give an example of the dbSNP ratio among the top 100 SNP calls with f_{em} ≤ 0.2% for the two methods. Using a total of 480 control samples, EMSNP identifies variants with minor allele frequency less than 0.2% with a high dbSNP ratio, which serves as an evidence of its superior performance in rare variant scenarios. On the other hand, the upper left panel of Figure S10 in the additional file 1 shows that the relative performance of EMSNP and SNVer based on dbSNP ratio depends on minor allele frequency. EMSNP detects more rare variants and has higher dbSNP ratio at minor allele frequency less than 1%. Whereas this relative performance of EMSNP and SNVer is reversed for minor allele frequency above 1%. Thus, EMSNP is most useful in detecting rare variants.
Another criterion to evaluate SNP calls is the transitiontransversion (Ti/Tv) ratio. It is well known that transitions are much more frequent than transversions in evolution, and the number of transitions over the number of transversions, referred to as Ti/Tv ratio, in known SNPs is expected to be between 2 and 4 [25]. Figure 4 shows that EMSNP gives a consistently higher Ti/Tv ratio throughout the entire allele frequency range for both the whole set of called variants and the novel set. For the known variants, the Ti/Tv ratio trends of the two methods are similar to each other. Table S8 in the additional file 1 gives an example of the Ti/Tv ratio among the top 100 SNP calls with f_{EM} < 0.2% by EMSNP and SNVer. The effect of minor allele frequency on the relative performance of EMSNP and SNVer in terms of Ti/Tv ratio is similar to that in terms of dbSNP ratio (Figure S10 in the additional file 1).
We also consider the top 150 ranked SNPs and the corresponding figures and tables are shown as Figure S11S12 and Tables S7S8 in the additional file 1. The qualitative conclusions are the same.
Identifying SNPs associated with type 1 diabetes
Association results
SNP  Gene  ${\widehat{\mathit{f}}}_{\mathbf{0}}$  n _{0}  ${\widehat{\mathit{f}}}_{\mathbf{1}}$  n _{1}  Fisher's pvalue  EM pvalue 

rs3184504  SH2B3  0.52  499  0.41  394  1.9e6  8.4e7 
rs7076103  IL2RA  0.19  178  0.10  93  4.5e8  2.7e7 
rs2476601  PTPN22  0.09  86  0.16  151  8.1e6  9.2e6 
The SNP rs3184504 residing within gene SH2B3 has an EM pvalue of 8.4e  7. This SNP was also found to be associated with a preliminary pvalue of 5e  7 in the original study [8] and the corresponding gene was previously identified to be associated with T1D. The fractions of observed minor alleles in controls and cases in the original study were 53% and 41.5%, respectively, and our mapping results are consistent with the estimates. The EM estimated minor allele frequencies in the controls and cases are 52% and 41%, which are slightly smaller than the observed values since we took sequencing errors into account. The SNP rs7076103 within the gene IL2RA has a preliminary Fisher's pvalue of 4.5e8 and an EM pvalue of 2.7e07. This SNP was not reported to be associated with any phenotypes according to the catalog of GWAS studies (http://www.genome.gov/gwastudies/) to date. Nevertheless, other SNPs within the IL2RA gene were found to be associated with T1D by Cooper et al. [26] before the publication of [8] and by two other studies [27, 28] after the publication of [8] using different approaches. Barrett et al. [27] used genomewide association studies giving a pvalue of 1.0e13 while Huang et al. [28] used imputation of the genotypes based on the 1000 genome projects yielding a pvalue of 5e9. These new studies support our significant result on the association of rs7076103 with T1D. The estimated minor allele frequencies in the controls and cases were 18.8% and 14.8% in [8] giving a pvalue of 0.02 which is not significant after adjusting for multiple testing. This example shows that even for common polymorphisms, the EM approach can help to find likely associations that the naive approach can not. The SNP rs2476601 was found to be associated with T1D in several studies before the publication of [8] and was confirmed in a recent study in [29] that was published after the publication of [8]. All these studies support our results for the association of common polymorphisms with T1D.
For rare polymorphisms $\left({\widehat{f}}_{0}<\mathsf{\text{1}}\%\right)$ within the controls, we first use the naive approach described above to obtain a preliminary Fisher's pvalue for every SNP. Due to the low minor allele frequencies of the rare variants, none of the pvalues is smaller than 0.001. We did not pursue the association of individual variants with T1D further.
Discussion
In this paper, we developed an EM algorithm based unified approach for minor allele frequency estimation, SNP calling and association studies, applicable to pooled sequencing data where genetic materials of multiple individuals are pooled together. This study differs from previous studies in that we estimate sequencing error rate for each position while previous studies generally assume a prespecified sequencing error rate across all sequenced regions. Since sequencing error rate depends on the genomic context, it is essential that the sequencing error rate be estimated specifically for different loci. In a pooling design without tagging, the origin of the reads is not known, and it is impossible to obtain the individual genotypes from the pooled data. Therefore, we modelled the pooled sequencing data as a "missing value" problem and designed an EM algorithm to estimate the minor allele frequency and sequencing error rate.
We first studied the effects of minor allele frequency, sequencing error rate, number of pools, number of individuals in each pool, and the sequencing depth in each pool, on the estimation accuracy of the minor allele frequency. It was shown that the naive approach, which estimates the minor allele frequency by the fraction of observed minor alleles in the reads, can significantly overestimate the true minor allele frequency, and that the effect is most severe for rare variants. The EM based algorithm, on the other hand, can estimate the minor allele frequency in a relatively unbiased manner. Although the variation of this estimation seems to be relatively large, a major part of the variation comes from the sampling of individuals from the population rather than the algorithm itself. We also show that the estimation accuracy of the EM algorithm increases with the number of pools and sequence depth as expected. However, the estimation accuracy decreases with the number of individuals in each pool, most likely because a more extensive pooling induces greater loss of information. Secondly, we used a likelihood ratio statistic based on the estimated parameters from EM to call SNPs. With the real data from [8], in terms of the dbSNP ratio, we showed that EMSNP outperforms SNVer for rare variants with minor allele frequency less than 1%. We also showed that the transition/transversion ratio of the called SNPs for rare variants based on EMSNP is higher than that of the called SNPs by SNVer. These two independent pieces of evidence demonstrate that EMSNP is superior to SNVer in the discovery of rare variants. However, the extent of this advantage decreases as minor allele frequency increases due to the tradeoff between EMSNP's bias adjustment for the estimation of minor allele frequencies and extra variation introduced in the EM algorithm. Finally, we applied our approach to reanalyze the casecontrol data from [8] and showed that we can find the associated common SNPs. Unfortunately we did not find any significantly associated rare variants. One possible explanation is that the power of finding rare variants associated with complex traits is generally low as a consequence of the low frequencies of minor alleles.
We made several simplifying assumptions in our study. First and foremost, we did not consider errors introduced by mapping the reads to the reference genome. The mapping of Roche 454 data still has many challenges, in particular, in regions around homopolymers, and further development of algorithms for mapping is needed. Secondly, although we assumed that the amount of genetic materials from each individual is the same for each pool, this assumption can be violated. To overcome this problem, one approach would assume that the fractions of genetic materials from individuals follow a Dirichlet distribution [17]. Thirdly, the called SNPs by EMSNP still have many false positives since the Ti/Tv ratio for the called novel SNPs is still low compared to the known SNPs. Further improvements in SNP calling are needed. Finally, the computational speed of the EM based approach can be relatively slow, and the method cannot be applied to whole genome association studies although this is not a problem for targeted sequencing studies as in [8]. These are the topics for future research.
Software
Software can be downloaded from http://wwwrcf.usc.edu/~fsun/Programs/EMSNP/EMSNP.html.
Declarations
The publication costs for this article were funded by US NIH 1 U01 HL108634.
This article has been published as part of BMC Genomics Volume 14 Supplement 1, 2013: Selected articles from the Eleventh Asia Pacific Bioinformatics Conference (APBC 2013): Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/14/S1.
Declarations
Acknowledgements
This research was supported by National Institutes of Health (P50HG002790 and 1 U01 HL108634). Q Chen was partially supported by the Viterbi Fellowship. F.S. is also supported by National Natural Science Foundation of China (60928007 and 60805010) and Tsinghua National Laboratory for Information Science and Technology (TNLIST) Crossdiscipline Foundation.
Authors’ Affiliations
References
 Yang J, Benyamin B, McEvoy BP, Gordon S, Henders AK, Nyholt DR, Madden PA, Heath AC, Martin NG, Montgomery GW, Goddard ME, Visscher PM: Common SNPs explain a large proportion of the heritability for human height. Nature Genetics. 2010, 42: 565569. 10.1038/ng.608.PubMed CentralView ArticlePubMedGoogle Scholar
 Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, McCarthy MI, Ramos EM, Cardon LR, Chakravarti A, Cho JH, Guttmacher AE, Kong A, Kruglyak L, Mardis E, Rotimi CN, Slatkin M, Valle D, Whittemore AS, Boehnke M, Clark AG, Eichler EE, Gibson G, Haines JL, Mackay TFC, McCarroll SA, Visscher PM: Finding the missing heritability of complex diseases. Nature. 2009, 461: 747753. 10.1038/nature08494.PubMed CentralView ArticlePubMedGoogle Scholar
 Nelson MR, Wegmann D, Ehm MG, Kessner D, Jean PS, Verzilli C, Shen J, Tang Z, Bacanu SA, Fraser D, Warren L, Aponte J, Zawistowski M, Liu X, Zhang H, Zhang Y, Li J, Li Y, Li L, Woollard P, Topp S, Hall MD, Nangle K, Wang J, Abecasis G, Cardon LR, Zöllner S, Whittaker JC, Chissoe SL, Novembre J, Mooser V: An abundance of rare functional variants in 202 drug target genes sequenced in 14,002 people. Science. 2012, 337: 100104. 10.1126/science.1217876.PubMed CentralView ArticlePubMedGoogle Scholar
 Tennessen JA, Bigham AW, O'Connor TD, Fu W, Kenny EE, Gravel S, McGee S, Do R, Liu X, Jun G, Kang HM, Jordan D, Leal SM, Gabriel S, Rieder MJ, Abecasis G, Altshuler D, Nickerson DA, Boerwinkle E, Sunyaev S, Bustamante CD, Bamshad MJ, Akey JM: Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science. 2012, 337: 6469. 10.1126/science.1219240.PubMed CentralView ArticlePubMedGoogle Scholar
 Benaglio P, McGee TL, Capelli LP, Harper S, Berson EL, Rivolta C: Next generation sequencing of pooled samples reveals new SNRNP200 mutations associated with retinitis pigmentosa. Human Mutation. 2011, 32: E22462258. 10.1002/humu.21485.View ArticlePubMedGoogle Scholar
 Rivas MA, Beaudoin M, Gardet A, Stevens C, Sharma Y, Zhang CK, Boucher G, Ripke S, Ellinghaus D, Burtt N, Fennell T, Kirby A, Latiano A, Goyette P, Green T, Halfvarson J, Haritunians T, Korn JM, Kuruvilla F, Lagacé C, Neale B, Lo KS, Schumm P, Törkvist L, Dubinsky MC, Brant SR, Silverberg MS, Duerr RH, Altshuler D, Gabriel S, Lettre G, Franke A, D'Amato M, McGovern DPB, Cho JH, Rioux JD, Xavier RJ, Daly MJ: Deep resequencing of GWAS loci identifies independent rare variants associated with inflammatory bowel disease. Nature Genetics. 2011, 43: 10661073. 10.1038/ng.952.PubMed CentralView ArticlePubMedGoogle Scholar
 Out AA, van Minderhout IJHM, Goeman JJ, Ariyurek Y, Ossowski S, Schneeberger K, Weigel D, van Galen M, Taschner PEM, Tops CMJ, Breuning MH, van Ommen GJB, den Dunnen JT, Devilee P, Hes FJ: Deep sequencing to reveal new variants in pooled DNA samples. Human Mutation. 2009, 30: 17031712. 10.1002/humu.21122.View ArticlePubMedGoogle Scholar
 Nejentsev S, Walker N, Riches D, Egholm M, Todd JA: Rare variants of IFIH1, a gene implicated in antiviral responses, protect against type 1 diabetes. Science. 2009, 324: 387389. 10.1126/science.1167728.PubMed CentralView ArticlePubMedGoogle Scholar
 Druley TE, Vallania FLM, Wegner DJ, Varley KE, Knowles OL, Bonds JA, Robison SW, Doniger SW, Hamvas A, Cole FS, Fay JC, Mitra RD: Quantification of rare allelic variants from pooled genomic DNA. Nature Methods. 2009, 6: 263265. 10.1038/nmeth.1307.PubMed CentralView ArticlePubMedGoogle Scholar
 Bansal V: A statistical method for the detection of variants from nextgeneration resequencing of DNA pools. Bioinformatics. 2010, 26: i318i324. 10.1093/bioinformatics/btq214.PubMed CentralView ArticlePubMedGoogle Scholar
 Wei Z, Wang W, Hu P, Lyon GJ, Hakonarson H: SNVer: a statistical tool for variant calling in analysis of pooled or individual nextgeneration sequencing data. Nucleic Acids Research. 2011, 39: e13210.1093/nar/gkr599.PubMed CentralView ArticlePubMedGoogle Scholar
 Li H, Durbin R: Fast and accurate longread alignment with BurrowsWheeler transform. Bioinformatics. 2010, 26: 589595. 10.1093/bioinformatics/btp698.PubMed CentralView ArticlePubMedGoogle Scholar
 Altmann A, Weber P, Quast C, RexHaffner M, Binder EB, MüllerMyhsok B: vipR: variant identification in pooled DNA using R. Bioinformatics. 2011, 27: i77i84. 10.1093/bioinformatics/btr205.PubMed CentralView ArticlePubMedGoogle Scholar
 Kim SY, Li Y, Guo Y, Li R, Holmkvist J, Hansen T, Pedersen O, Wang J, Nielsen R: Design of association studies with pooled or unpooled nextgeneration sequencing data. Genetic Epidemiology. 2010, 34: 479491. 10.1002/gepi.20501.View ArticlePubMedGoogle Scholar
 Wang T, Lin CY, Rohan TE, Ye K: Resequencing of pooled DNA for detecting disease associations with rare variants. Genetic Epidemiology. 2010, 34: 492501. 10.1002/gepi.20502.PubMed CentralView ArticlePubMedGoogle Scholar
 Lee JS, Choi M, Yan X, Lifton RP, Zhao H: On optimal pooling designs to identify rare variants through massive resequencing. Genetic Epidemiology. 2011, 35: 139147. 10.1002/gepi.20561.PubMed CentralView ArticlePubMedGoogle Scholar
 Chen X, Listman JB, Slack FJ, Gelernter J, Zhao H: Biases and errors on allele frequency estimation and disease association tests of next generation sequencing of pooled samples. Genetic Epidemiology. 2012, (Epub ahead of print)Google Scholar
 Self SG, Liang KY: Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions. Journal of the American Statistical Association. 1987, 82: 605610. 10.1080/01621459.1987.10478472.View ArticleGoogle Scholar
 Quinlan AR, Stewart DA, Strömberg MP, Marth GT: Pyrobayes: an improved base caller for SNP discovery in pyrosequences. Nature Methods. 2008, 5: 179181. 10.1038/nmeth.1172.View ArticlePubMedGoogle Scholar
 Flicek P, Birney E: Sense from sequence reads: methods for alignment and assembly. Nature Methods. 2009, 6: S6S12. 10.1038/nmeth.1376.View ArticlePubMedGoogle Scholar
 Hillier LW, Marth GT, Quinlan AR, Dooling D, Fewell G, Barnett D, Fox P, Glasscock JI, Hickenbotham M, Huang W, Magrini VJ, Richt RJ, Sander SN, Stewart DA, Stromberg M, Tsung EF, Wylie T, Schedl T, Wilson RK, Mardis ER: Wholegenome sequencing and variant discovery in C. elegans. Nature Methods. 2008, 5: 183188. 10.1038/nmeth.1179.View ArticlePubMedGoogle Scholar
 Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R: The sequence alignment/map format and SAMtools. Bioinformatics. 2009, 25: 20782079. 10.1093/bioinformatics/btp352.PubMed CentralView ArticlePubMedGoogle Scholar
 Wang K, Li M, Hakonarson H: ANNOVAR: functional annotation of genetic variants from highthroughput sequencing data. Nucleic Acids Research. 2010, 38: e164e164. 10.1093/nar/gkq603.PubMed CentralView ArticlePubMedGoogle Scholar
 Sherry ST, Ward M, Sirotkin K: dbSNPDatabase for single nucleotide polymorphisms and other classes of minor genetic variation. Genome Research. 1999, 9: 677679.PubMedGoogle Scholar
 DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, Angel Gd, Rivas MA, Hanna M, McKenna A, Fennell TJ, Kernytsky AM, Sivachenko AY, Cibulskis K, Gabriel SB, Altshuler D, Daly MJ: A framework for variation discovery and genotyping using nextgeneration DNA sequencing data. Nature Genetics. 2011, 43: 491498. 10.1038/ng.806.PubMed CentralView ArticlePubMedGoogle Scholar
 Cooper JD, Smyth DJ, Smiles AM, Plagnol V, Walker NM, Allen JE, Downes K, Barrett JC, Healy BC, Mychaleckyj JC, Warram JH, Todd JA: Metaanalysis of genomewide association study data identifies additional type 1 diabetes risk loci. Nature Genetics. 2008, 40: 13991401. 10.1038/ng.249.PubMed CentralView ArticlePubMedGoogle Scholar
 Barrett JC, Clayton DG, Concannon P, Akolkar B, Cooper JD, Erlich HA, Julier C, Morahan G, Nerup J, Nierras C, Plagnol V, Pociot F, Schuilenburg H, Smyth DJ, Stevens H, Todd JA, Walker NM, Rich SS: Genomewide association study and metaanalysis find that over 40 loci affect risk of type 1 diabetes. Nature Genetics. 2009, 41: 703707. 10.1038/ng.381.PubMed CentralView ArticlePubMedGoogle Scholar
 Huang J, Ellinghaus D, Franke A, Howie B, Li Y: 1000 Genomesbased imputation identifies novel and refined associations for the Wellcome Trust Case Control Consortium phase 1 Data. European Journal of Human Genetics. 2012, 20: 801805. 10.1038/ejhg.2012.3.PubMed CentralView ArticlePubMedGoogle Scholar
 Plagnol V, Howson JMM, Smyth DJ, Walker N, Hafler JP, Wallace C, Stevens H, Jackson L, Simmonds MJ, Bingley PJ, Gough SC, Todd JA, Consortium TDG: Genomewide association analysis of autoantibody positivity in type 1 diabetes cases. PLoS Genetics. 2011, 7: e100221610.1371/journal.pgen.1002216.PubMed CentralView ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.