Finding genomic variants associated with complex traits is one of the most important problems in modern genomics. Genome-wide association studies (GWAS) based on common variants have been the dominant approach to achieve this objective . However, the genomic variants identified in GWAS studies often explain only a small portion of the phenotypic variation related to heritable human diseases, a phenomenon known as "missing heritability" in genomics literature . This missing heritability problem has led to increasingly skeptical views of the common disease-common variant (CD-CV) hypothesis which predicts that common disease-causing alleles, or variants, will be found in all human populations that manifest a given disease. On the other hand, interest in studies on rare variants with minor allele frequencies less than 1% is growing [3, 4].
Studies of rare variants are complicated by the low minor allele frequencies of rare variants. The development of next generation sequencing (NGS) technologies such as Illumina and Roche 454 has made it possible to sequence a large number of reads economically. Despite such important progress, sequencing a large number of individuals separately is still costly for most biological laboratories. One frequently adopted approach to reduce sequencing cost in the search of rare variants is pooled sequencing, where mixtures of genetic materials from several individuals are grouped together to form a pool for a single sequencing. While this design greatly lowers the sequencing cost, it also makes it hard to distinguish true genetic polymorphisms from sequencing errors, estimate minor allele frequencies at the polymorphic loci, and perform association studies on the rare variants.
Several research groups have used pooled sequencing to detect rare variants that are associated with complex traits such as retinitis pigmentosa, diabetes, cancer, and inflammatory bowel disease [5–8]. There are generally two types of pooling designs. One is pooling of tagged samples with each individual tagged by a unique short barcode. In this design, the genomic origins of the reads can be identified. However, barcoding many individuals and distinguishing these barcodes from each other can still be a challenging task. The second type of pooling is to mix the genetic materials from different individuals without tagging, and then generate reads from the mixture of genetic materials using NGS. With this design, the identities of the individuals from whom the reads originate cannot be identified. In this paper, we concentrate on the second type of pooling design.
Several groups have developed SNP calling methods based on pooled sequencing data [7, 9–12]. Out et al. modeled the number of sequencing errors as a Poisson random variable and identified SNPs by comparing the number of minor alleles within the reads with the Poisson distribution. For rare variants with minor allele frequencies similar to or lower than the sequencing error rate, this approach could miss many true variants if the pool size is relatively large. Druley et al. developed a SNP identification method, SNPSeeker, that can be applied to large pools by using control sequences without SNPs. In many studies, control sequences may not be available, making this approach impractical. Also, the program can only be used to analyze Illumina data. Bansal et al. developed a method called CRISP to identify rare variants by comparing the minor allele frequencies across multiple pools using contingency tables. It was shown that CRISP outperforms SNPSeeker in terms of accuracy, but CRISP is more computationally demanding. Altmann et al. improved the computational speed of CRISP and identified SNPs as the variants with different minor allele frequencies across at least two pools. Wei et al. developed a statistical tool, called SNVer, for variant identification. For each pool, SNVer first defined a p-value by testing the hypothesis that the minor allele frequency is above a given threshold and then combined the p-values for individual pools to give an overall p-value using Simes methods as in . This algorithm makes it possible to rank the observed variants so that the top ranked ones are more likely to be true SNPs. However, the algorithm needs a pre-specified sequencing error rate which can be difficult to do because the sequencing error rates can be position dependent. In the above studies, the investigators are mainly concerned with the detection of SNPs; they do not aim to estimate minor allele frequencies.
In order to estimate minor allele frequencies in pooling studies, several groups developed statistical models for the sampling of individuals and the sampling of reads from the individuals in the pools [14, 15]. These studies assumed a pre-defined constant sequencing error rate across different loci. However, sequencing error rates can vary for different loci depending on the nucleotide contents of the surrounding genomic regions. The effects of mis-specifying the sequencing error rate on minor allele frequency estimation, SNP detection and power of association studies are not clear. Using similar models, Lee et al.  studied an optimal design in pooling studies. This involved the number of individuals in each pool and the number of pools. Recently, Chen et al. considered more complex issues such as uneven sampling of individuals, different coverage of the minor and major alleles due to either PCR amplification or reads mapping, and reads quality scores.
In this paper, we develop new methods for estimating minor allele frequencies, SNP detection, and association studies using pooled sequencing data based on the models in [14–16]. In contrast to methods developed in previous studies, we do not assume that the sequencing error rate is constant. Instead, we estimate the sequencing error rate for each position together with the minor allele frequency based on the minor and major allele counts for all the pools using an expectation-maximization (EM) algorithm. We show that the naive estimation of the allele frequency by the fraction of minor alleles in the reads can be significantly inflated, especially for rare variants, while the EM approach can give an unbiased estimate of the minor allele frequency in all situations we studied. The estimation accuracy of the EM algorithm increases with the number of reads and the number of pools, but decreases with the number of chromosomes in each pool. Based on the allele frequency estimation, we develop a SNP calling method, EM-SNP, and an association test using likelihood ratio statistics. The likelihood ratio statistic used in EM-SNP is then used to rank candidate polymorphic loci to determine if they are true polymorphisms. Using a real re-sequencing dataset, we show that, for rare variants with minor allele frequencies lower than 1%, the fraction of dbSNPs (http://www.ncbi.nlm.nih.gov/projects/SNP/) among the SNPs called by EM-SNP is higher than that of SNVer. Similarly, the transition/transversion ratio of rare variants called by EM-SNP is higher than that of SNVer. These observations show that EM-SNP outperforms SNVer at calling rare variants with minor allele frequencies less than 1%.