The next generation sequencing (NGS) technologies have dramatically increased the throughput. The new technologies, including those being developed currently, improve on many aspects of DNA sequencing but a higher accuracy than the traditional Sanger sequencing does not appear to be one of them. The nature of the technology would result in specific types of sequencing errors inherent in each process. In general, the new sequencing methods have an error rate between 0.1% and 1.0% . Due to the non-random distribution of errors across sites where some sites can be 10 times more error prone than the average, single nucleotide polymorphism (SNP) calling can often be difficult [2–4].
In this study, we are concerned with estimating a fundamental parameter of natural populations, namely, Watterson's θ of DNA polymorphism . Briefly, θ is the number of nucleotide differences between two sequences of the same locus, randomly chosen from the population. It is a good measure of genetic diversity and a basic parameter for doing population genetic analysis (e.g. tests of positive selection, [6–8]). As polymorphism in natural populations is dominated by low frequency variants , which are often indistinguishable from sequencing errors, using the new sequencing technologies to estimate polymorphism will remain a challenge in the near future. A number of methods have been proposed to separate errors from rare polymorphisms [10–14]. Among them, Nielsen et al.'s approach  is most direct by filtering out errors from the raw read data. However, since error signals may vary from operation to operation, its general applicability will need to be evaluated.
There are two ways to prepare samples for sequencing and polymorphism estimation. First, sequencing is done on individual samples, or at least on pooled samples with each sample individually barcoded . We call this type of data “single-line data”. Second, DNA samples from multiple individuals are pooled in equal quantity for sequencing without individual identification . It is referred to as “Pooled-line data”. We should note that sequencing each diploid sample individually is in fact a pooled-line approach as two haploid genomes are sequenced together. In order to call SNP accurately for both haploids, the diploid has to be sequenced to a sufficient depth (e.g. 20X) . Since individual samples are generally not sequenced to such a depth (e.g. the 1000 Human Genome Project ), most methods cited above examine the aggregate properties of these individual sequences. In other words, although individuals may be sequenced separately, the data are pooled in the analysis. Hence, for many population genetic questions, little information would be lost by sequencing pooled samples and the efficiency would be greatly improved when the sample number is large. It would then be possible to sequence each pool with greater exactitude in order to filter out errors from the data.
We now propose a method which minimizes the confounding effects of sequencing errors by combining two different sequencing applications. Dual sequencing applications have previously been carried out on the Illumina GA and SOLiD platforms for the same samples [16, 18, 19]. It has been shown that the two technologies have nearly non-overlapping error distributions . Dual platform is in fact a standard method as NGS sequencing, on whichever platform, needs to be backed up by another method, usually by Sanger sequencing or other genotyping tools [4, 20, 21]. Dual applications on two NGS platforms is simply a more systematic and large-scale method of error correction. Such dual applications can also be expected on newer and very different technologies such as HiSeq , Ion Protons , PacBio  and MspA nanopore . When dual platform sequencing is not feasible, dual applications of the same platform on the same DNA sample, independently prepared for sequencing, may serve the same purpose. The correlation of error distribution between two applications on the same platform is slightly higher than those on different platforms but is often adequate for error corrections.
In this study, we first investigated a simple single-line method by extracting haploid information from individual diploids. We then propose dual sequencing applications to improve the pooled-line method for analyzing pooled samples of diploids.