Generation of SNP datasets for orangutan population genomics using improved reduced-representation sequencing and direct comparisons of SNP calling algorithms
- Maja P Greminger1Email author,
- Kai N Stölting2,
- Alexander Nater1,
- Benoit Goossens3, 4, 5,
- Natasha Arora1,
- Rémy Bruggmann6, 7,
- Andrea Patrignani6,
- Beatrice Nussberger8,
- Reeta Sharma9,
- Robert H S Kraus10,
- Laurentius N Ambu5,
- Ian Singleton11, 12,
- Lounes Chikhi9, 13, 14,
- Carel P van Schaik1 and
- Michael Krützen1
© Greminger et al.; licensee BioMed Central Ltd. 2014
Received: 8 April 2013
Accepted: 21 December 2013
Published: 10 January 2014
High-throughput sequencing has opened up exciting possibilities in population and conservation genetics by enabling the assessment of genetic variation at genome-wide scales. One approach to reduce genome complexity, i.e. investigating only parts of the genome, is reduced-representation library (RRL) sequencing. Like similar approaches, RRL sequencing reduces ascertainment bias due to simultaneous discovery and genotyping of single-nucleotide polymorphisms (SNPs) and does not require reference genomes. Yet, generating such datasets remains challenging due to laboratory and bioinformatical issues. In the laboratory, current protocols require improvements with regards to sequencing homologous fragments to reduce the number of missing genotypes. From the bioinformatical perspective, the reliance of most studies on a single SNP caller disregards the possibility that different algorithms may produce disparate SNP datasets.
We present an improved RRL (iRRL) protocol that maximizes the generation of homologous DNA sequences, thus achieving improved genotyping-by-sequencing efficiency. Our modifications facilitate generation of single-sample libraries, enabling individual genotype assignments instead of pooled-sample analysis. We sequenced ~1% of the orangutan genome with 41-fold median coverage in 31 wild-born individuals from two populations. SNPs and genotypes were called using three different algorithms. We obtained substantially different SNP datasets depending on the SNP caller. Genotype validations revealed that the Unified Genotyper of the Genome Analysis Toolkit and SAMtools performed significantly better than a caller from CLC Genomics Workbench (CLC). Of all conflicting genotype calls, CLC was only correct in 17% of the cases. Furthermore, conflicting genotypes between two algorithms showed a systematic bias in that one caller almost exclusively assigned heterozygotes, while the other one almost exclusively assigned homozygotes.
Our enhanced iRRL approach greatly facilitates genotyping-by-sequencing and thus direct estimates of allele frequencies. Our direct comparison of three commonly used SNP callers emphasizes the need to question the accuracy of SNP and genotype calling, as we obtained considerably different SNP datasets depending on caller algorithms, sequencing depths and filtering criteria. These differences affected scans for signatures of natural selection, but will also exert undue influences on demographic inferences. This study presents the first effort to generate a population genomic dataset for wild-born orangutans with known population provenance.
KeywordsNext-generation sequencing Single-nucleotide polymorphisms Reduced-representation libraries Bioinformatics GATK SAMtools CLC genomics workbench Great apes
The availability of high-throughput sequencing has revolutionized the fields of population genetics and molecular ecology . Early genomic work focused mainly on broad comparative analyses between species [2–6] and was limited to one or a few individuals per species. The emergent field of population genomics , including conservation  and landscape genomics , investigates genomic allele-frequency patterns at the species level, i.e. among and within natural populations. Main interests revolve around exploring patterns of genetic diversity, differentiation and admixture, inferring demographic population histories, and studying signals of local adaptations in wild populations [7–9].
To date, population-genomics studies have mainly focused on humans [10, 11], some model species [12–14] and others relevant to agricultural production [15, 16]. Other taxa, particularly those with large genomes, have remained largely unexplored because of significant challenges in the laboratory and during bioinformatical analyses [8, 17, 18]. Sequencing of complete genomes of many individuals is usually still prohibitive because of associated costs and bioinformatical complexities, especially in species where a reference genome is unavailable. Yet, many biological questions can be addressed by describing polymorphisms from a subset of genomic regions, provided that these regions are approximately evenly distributed throughout the genome.
In the laboratory, several strategies have recently been developed enabling so-called ‘reduced genome complexity sequencing’, i.e. sampling only a small fraction of the genome in several individuals. These strategies include sequencing of reduced-representation libraries (RRLs) , restriction-site-associated DNA sequencing [7, 20], and other sequence-based-genotyping approaches [21, 22]. Essentially, all of these methods are based on the same key principle: reducing genome complexity by digestion of genomic DNA with one or several restriction enzymes followed by a selection of resulting restriction fragments, and high-throughput sequencing of the final set of fragments.
One of the key characteristics of the aforementioned methods is that, at least in theory, read mapping can be carried out regardless of the availability of a reference genome by constructing a reference sequence from overlapping sequence stacks [21, 23–26]. Moreover, the similarity among sequence stacks of different individuals allows the direct estimation of allele frequencies by simultaneous identification of polymorphisms and genotype calling (genotyping-by-sequencing). This reduces the major issue of ascertainment bias, which arises when markers are identified in a small subset of individuals and subsequently genotyped in an extended sample set [1, 17, 27].
One popular reduced-genome complexity approach is RRL sequencing. RRLs were first used to generate single-nucleotide polymorphisms (SNP) maps of the human genome using classical Sanger sequencing . Since Van Tassel et al. first adapted the approach to high-throughput sequencing, it has been applied in a number of SNP discovery studies (e.g. [26, 29–32]). In the RRL approach, the number of restriction fragments subjected to high-throughput sequencing is reduced via size-selection before sequencing library preparation. RRLs allow the degree of complexity reduction to be customized by defining the selected fragment-size range. By providing easy access to flanking sequences necessary to design SNP genotyping assays when a reference genome is unavailable, RRLs are superior to other reduced-complexity approaches [7, 20]. In the RRL approach, long DNA stretches can be sequenced by simply size-selecting for longer fragments (up to several kb possible) and complete sequencing of these fragments independent of the platform read length through shearing of fragments prior to high-throughput sequencing library preparation followed by assembly of the resulting sequence fragments .
Although the RRL principle is highly promising for generating population genomic SNP data, current protocols must be improved so as to i) facilitate library construction for individual samples, and most importantly, ii) maximize the number of homologous fragments generated during library construction. In the past, RRL sequencing has usually been performed on pools of DNA samples from multiple individuals for practical reasons [23, 26, 32, 33]. However, pooling leads to the loss of major biological information as it prohibits the assignment of individual genotypes (i.e. genotyping-by-sequencing). Because of this, many biological questions, such as investigating admixture or linking phenotypes with genotypes in studies of natural selection, cannot be addressed when samples are pooled. Furthermore, pooling strongly increases the risk of missing rare alleles, especially if there are many individuals in the pool . In addition, pooling is highly sensitive to variation in DNA concentration among samples, which will inadvertently lead to an over – or underrepresentation of certain alleles . Thus, current protocols need to be improved to facilitate RRL generation of individual samples.
Analyzing individual samples requires improvements to minimize DNA loss during purification steps, which is particularly important if sample-DNA quantity is limited. Moreover, genome complexity needs to be reduced in a reproducible manner (i.e. homologous sites must be sequenced) across samples as this primarily determines the effectiveness of the genotyping-by-sequencing principle and reference-free mapping . Non-overlapping sequences will lead to a high number of missing genotypes. The accurate sequencing of homologous sites is also of particular importance when working with pooled samples, as the true number of sequenced individuals at a particular SNP site cannot be determined. In the most extreme case, only alleles of one individual would be sequenced. In such a case, however, allele frequencies would nonetheless be estimated under the assumption that all allele copies in the pool had been sampled.
From a bioinformatical point of view, the amounts of raw data produced by high-throughput sequencing platforms are vast and many computational steps are required to translate raw outputs into high-quality SNP calls . Thus, accurately identifying SNPs and calling genotypes from high-throughput sequencing data while filtering out sequencing errors remains a challenge. Various SNP calling programs have been introduced and algorithms are under constant development [18, 35–37].
One of the most widely used commercial software suites for genomic data analysis is the CLC Genomics Workbench (CLC bio, Aarhus, Denmark). The software contains a basic SNP caller (hereafter referred to as ‘CLC’) that detects SNPs based solely on applying quality thresholds to sequencing, mapping and base quality. Genotypes are determined using hard-filter criteria, i.e. by simply counting the number of sequencing reads for each allele and applying arbitrary custom cut-off rules. For instance, a genotype would be called heterozygous if an alternative allele is present in 20-80% of the reads. However, for low sequencing depths this way of genotype calling tends to underestimate the number of heterozygous genotypes .
Arguably, two of the most popular non-commercial software suites are the Genome Analysis Toolkit (Broad Institute) [37, 38] and SAMtools. Both SAMtools and the Unified Genotyper of the Genome Analysis Toolkit (hereafter referred to as ‘GATK’), incorporate uncertainty in a probabilistic framework, in order to call SNPs and genotypes simultaneously [36–38]. Both SAMtools and GATK allow the joint analysis of all samples from one population (multi-sample calling). A major strength of the Bayesian framework is the potential to incorporate prior information, such as previous observations of alternative alleles, heterozygosity, and allele frequencies. Ideally, additional information such as representative reference SNPs or linkage-disequilibrium patterns could be incorporated [18, 37, 39]. Unfortunately, such information so far limited to a few model species (e.g. Arabidopsis ) and humans . It has been proposed that in contrast to CLC, GATK (and potentially also SAMtools) might have the tendency to overestimate the number of heterozygous genotypes . This is because GATK aggressively calls alternative alleles in favor of high sensitivity, resulting in a high number of false-positive calls which require extensive post-filtering.
Despite the fact that accurate SNP and genotype calling is fundamental for precise population parameter estimation in downstream analyses [17, 40, 41], to our knowledge direct comparisons of different SNP callers in the aforementioned context are still scarce. To date, most studies employ only one SNP caller, although it is conceivable that different callers will produce different datasets. In previous studies, validations were often restricted to confirming and comparing the polymorphic state of SNPs (e.g. [26, 31, 42]), but not actual genotypes at the individual level.
Here, we provide a comprehensive framework to obtain high-quality SNP data in population genomics, addressing both laboratory and bioinformatic challenges. First, we refined and improved an RRL protocol (iRRL), which maximizes the generation of homologous DNA fragments across individuals, thus achieving high genotyping-by-sequencing efficiency. Our protocol also contains modifications for economical handling of DNA during library preparation. All modifications support the establishment of single-sample libraries. Second, we directly compared three popular SNP callers (GATK, SAMtools and CLC) using our iRRL data generated for two orangutan populations (Genus: Pongo).
Orangutans are the only great apes found outside Africa and the phylogenetically most distant great apes to humans, which makes them particularly interesting to study in terms of the evolution of the hominid lineage [43, 44]. In contrast to humans (e.g. the International HapMap Project; the 1000 Genomes Project), in non-human great apes large-scale population genomic data from wild-born individuals with known population origin are scarce (but see [45, 46]). Rather, most genomic data were generated from a small number of zoo animals with mostly unknown population origins [47–50], thus providing a limited perspective for population genomic analyses of wild populations. Genome-wide data in orangutans will enable the investigation of the genetic basis of local adaptations among orangutan populations . Moreover, population genomic data will shed more light on the particularly complex demographic history of orangutans, as shaped by volcanic eruptions and recurrent sea level changes connecting the islands of Borneo and Sumatra during the Pleistocene [52–56].
Improved reduced-representation sequencing
We developed a protocol to construct improved RRLs (referred to as iRRLs) that maximizes efficiency and repeatability of genome complexity reduction. We applied several key modifications to the method outlined in van Tassel et al. including: (i) high-resolution fragment-size-selection down to an accuracy of one base pair to increase precision of isolating homologous fragments (Additional file 1: Figure S2), (ii) modifications to minimize DNA loss during purification steps, achieving DNA recovery rates of >95%, and (iii) adjustments to establish single-sample libraries to avoid the necessity of sample pooling. In order for restriction enzymes to generate homologous fragments across samples, our protocol includes recommendations for suitable sample handling and DNA isolation to avoid DNA strand breaking prior to digestion.
Overview of the sequencing of improved reduced-representation libraries (iRRLs) for the West Alas (WA) and South Kinabatangan (SK) orangutan study populations
No. of individuals
iRRL stacks per individual (predicted)a
Median iRRL target efficiencyb
Total no. of beads per population
Total no. of mapped reads per population
Median no. of mapped reads per individual
% reads mapped F3/F5 (mappability)c
Mean no. of bphiqual per individuald
Median sequence coverage per individuale
To assess the performance of our iRRL protocol, we estimated the iRRL target efficiency as the percentage of obtained bphiqual sites which were predicted by the in-silico digest (= target sites). iRRL efficiency varied among individuals but was very high with a median of 97% for West Alas individuals and 86% for the South Kinabatangan individuals (Table 1 and Additional file 3: Table S2). Thus, the vast majority of sequenced high quality bases were target sites, i.e. predicted by the in-silico digest of the orangutan reference genome.
Comparison of SNP discovery and genotype calling
Overview of SNP discovery and genotype calling using three different callers
No. of SNPs
No. of private SNPs
Median site heterozygositya
Median coverage per individual
SAMtools- GATK intersect
No. of SNPs
No. of private SNPs
Median site heterozygositya
Median coverage per individualb
Compared to similar studies (e.g. [30, 33, 42, 57]), median sequence coverage at SNP sites across all individuals in our datasets was extremely high (82× for GATK, 48× for CLC, and 27× for SAMtools), although coverage counts differed drastically among datasets. This discrepancy in coverage counts could be attributed to a different treatment of quality scores in read counting and/or different default parameters among the callers, since we applied identical quality thresholds to the data. The considerably lower read counts in the SAMtools dataset and potentially different prior probabilities in the Bayesian framework may be causal for the strikingly lower number of total SNPs in our SAMtools dataset.
Many of the non-overlapping sites were present in initial SNP discoveries, but were removed because less than eight individuals per population had a genotype call meeting all high-quality filter criteria (population-based filter). For the CLC and SAMtools dataset, genotype calls often failed the minimum coverage requirement of 10 reads. For the GATK dataset, many genotype calls did not have a sufficiently high genotype quality score.
Median genotype concordance between designated SNP callers for overlapping SNP sites assessed at the individual level
% same genotype called Pop_WA
% same genotype called Pop_SK
% same genotype called overall
% same genotype called overall (range)
We also created three intersect datasets by accepting only identically assigned genotypes between pairs of SNP callers (at the individual level prior to the population-based filtering). This procedure has been suggested to reduce caller-specific errors and increase specificity [18, 58]. We retained 37,085 SNPs for the GATK-CLCintersect dataset, 18,933 SNPs for the SAMtools-GATKintersect dataset, and 17,163 SNPs for the SAMtools-CLCintersect dataset (Table 2).
Impact on biological inferences
Over all six datasets, there were more sites segregating in the Sumatran West Alas population compared to the Bornean South Kinabatangan population (Table 2). The vast majority of SNPs (70-80% depending on the dataset) were private. In addition, we observed a large percentage of singletons (Table 2). The highest number of singletons was obtained in the CLC dataset (26%) followed by SAMtools (22%) and GATK (12%). Median site heterozygosity was always higher for the South Kinabatangan population than for the West Alas population.
To determine genotype accuracy, we validated 63 genotypes from a subset of 58 SNPs overlapping among datasets by classical Sanger sequencing. We picked SNPs with the only requirements that a minimum of ten individuals per population had an assigned genotype and that at least one individual showed a conflicting genotype call between GATK/SAMtools and CLC. Because all validated genotypes were identical between GATK and SAMtools, we did not distinguish between the two for this analysis, but rather focused on the difference between probabilistic (GATK and SAMtools) and hard-filtering (CLC) callers.
Overview of genotype validations at overlapping SNP sites
Discordant calls a
Singleton site determined by GATK/SAMtoolsb
Singleton site determined by CLCb
Homozygote with GATK/SAMtools but heterozygote with CLC
Heterozygote with GATK/SAMtools but homozygote with CLC
Concordant calls c
Characteristics of SNP callers
GATK seemed to be conservative in calling singletons and low frequency alleles in our dataset, as it exhibited the lowest proportion of singletons among all SNP datasets. Yet, among all datasets, GATK had the highest medium site heterozygosity. It appears that GATK slightly overestimates mid-frequency alleles, because our genotype validations revealed that in 30% of the cases where GATK called a heterozygous and CLC a homozygous genotype, CLC was correct. Thus, our results suggest that with increasing minor allele frequency, GATK starts calling alternative alleles more aggressively due to the population prior in multi-sample analysis.
It appears that CLC generally underestimates heterozygosity. The CLC dataset consisted of an excess of singletons, suggesting that CLC called sequence errors as a genetic variants to a greater extent. Thus, the CLC dataset contained the lowest overall site heterozygosities among all datasets. Detailed investigation of discordantly called genotypes revealed that almost all of these genotypes were homozygous with CLC, but heterozygous with the other callers. To our surprise, CLC largely miscalled genotypes as heterozygous which were correctly assigned as homozygous by GATK (89% correct by GATK).
Our results indicate that SAMtools is more restrictive in SNP calling than GATK and CLC. The SAMtools dataset consisted of considerably fewer SNPs than GATK and CLC, but the degree of overlap with the other datasets was much higher than for the other datasets. SAMtools showed the highest tendency to assign heterozygous genotypes in cases of discordantly called genotypes among callers. For example, the few discordantly called genotypes were strongly biased in that 84% were heterozygous with SAMtools, but homozygous with GATK.
Our study provides a framework for the generation of genome-wide SNP datasets for population genomic studies, from laboratory procedures to bioinformatics, which is widely applicable in non-model species. We present an improved protocol for highly efficient and more precise reduced genome complexity sequencing that simultaneously allows discovery of novel SNPs and genotyping. Using data generated from 31 wild-born orangutans from two populations, we observed significant inconsistencies among three commonly used SNP callers (CLC Genomics Workbench, GATK Unified Genotyper and SAMtools). These inconsistencies among the SNP datasets led to strong disagreement in outliers detected in scans for signatures of natural selection. This shows the potential impact on downstream biological analyses and emphasizes the need to critically evaluate the accuracy of SNP and genotype calling in population genomic studies.
We present a refined iRRL method presenting an improvement of the approach by van Tassel et al. . Several key modifications greatly enhanced the effectiveness of genotyping-by-sequencing, as measured by target sequence efficiency. Target sequence efficiency was high because we focused on laboratory procedures to obtain homologous sequences across individuals, i.e. reproducible fragment generation and precise size selection. To our knowledge, these procedures do not seem to have received sufficient attention in the literature, probably because most studies pooled individuals to develop SNP markers [26, 32, 33] without the direct aim of estimating allele-frequencies.
The importance of uniform fragment selection is well illustrated by our in-silico digests of the orangutan reference genome. An imprecise isolation of fragments would have led to a substantial change in the overall composition of fragment libraries across samples. This in turn would have caused a substantial increase in missing genotypes because of significantly reduced overlap of homologous fragments. Thus, accurate size selection and generation of uniform fragments to achieve high sequences homology are paramount in producing high-quality RRLs that maximize the amount of biological information.
The higher and more constant target sequence efficiencies for Sumatran West Alas individuals (median 97%) compared to Bornean South Kinabatangan individuals (median 86%) were most likely caused by carrying out the initial in-silico digest, which predicted our target sites, on the Sumatran reference genome. Since Sumatran and Bornean orangutans diverged more than 400,000 years ago, [48, 53, 56], Bornean orangutans will inevitably exhibit more mutations at restriction sites.
We also improved previous RRL approaches by minimizing the loss of DNA during purification steps, thus facilitating single-sample library construction. Economical handling of DNA is particularly relevant when studying species for which sample quantity is a limiting factor, which is the case for most wild animal populations. A high DNA recovery rate during purification steps is especially important when dealing with low template amounts (<100 ng), where DNA loss will be disproportionately higher for technical reasons, and/or targeting only a small fraction of the genome. So far, these problems have been circumvented by pooling samples. Our DNA recovery rate of >95% in the purification steps is considerably higher than obtained through conventional methods using extractions from gels and/or silica columns [<80%; QIAquick Spin Handbook Qiagen].
From a bioinformatical perspective, we demonstrate that different SNP callers lead to substantially different SNP datasets, in spite of applying rather conservative quality filters. For example, we applied a phred-scaled mapping and base quality threshold of ≥ Q30, corresponding to an error probability of ≤0.1%. In contrast, other studies only apply Q20 (1.0% error probability) [18, 26, 59, 60]. Furthermore, our median sequence coverage of 41× (minimal cut-off of 10 reads) is substantially higher than that found in other studies, in which sequencing depth is usually between 6-16× with lower cut-off values than used in this study [30, 32, 33, 42, 57].
There are three main reasons for the conspicuous differences among the SNP datasets. First, the SNPs dropping out because of our population-based filtering were different among the GATK, SAMtools and CLC datasets. Second, although we used identical mapped short reads and filtering criteria on the raw data to call SNPs and genotypes, we cannot exclude a potential influence of the poor F5 sequence read performance due to specific internal filters of SNP callers. Third and most importantly, some differences will arguably be related to the conceptually very different methods of SNP identification and genotype assignment [18, 36–38].
Intersect strategies have been proposed to reduce caller-specific errors [18, 58]. The estimated genotype accuracy of 96.5% of intersected genotypes is higher than in comparable studies that use only one caller (e.g. 47-84% [26, 42, 61]; 89-95% [19, 31, 32, 62]). Yet, most of these studies actually only verified the polymorphic state of SNPs but not individual genotype calls. Thus, the true genotype error rate in these studies is almost certainly higher than estimated.
The intersect strategy seems to be appealing because false-positive assignments should be minimized. However, it is inevitably less sensitive towards SNP discovery . The appropriate strategy and filter stringencies for each study depend on the specific needs of downstream analyses. Nonetheless, apart from higher false-negatives rates, as observed in our dataset, intersecting genotype calls might also introduce non-random biases. More detailed investigations will be required to fully appreciate the consequences of intersecting strategies.
Among all datasets, the general patterns tend to agree with previous detailed studies on orangutan population genetics and demographic history. For instance, the higher number of singletons and low-frequency alleles we observe in the Sumatran West Alas population is in agreement with previous studies using conventional genetic markers (mitochondrial DNA, microsatellites) [52–54]. Furthermore, the slightly higher site heterozygosities in the South Kinabatangan population are also in agreement with previous studies using conventional genetic markers [52, 63–65].
Many downstream analyses in population genomics, such as selection tests or demographic inferences rely on the allele-frequency spectra . Thus, biological conclusions drawn from such analyses may well change depending on which SNP caller has been used. This possibility is illustrated by the extremely low overlap of identified outlier regions in our sliding-window analyses to detect selective sweeps based on population differentiation.
Apart from reliable SNP analysis, the accurate characterization of the allele-frequency spectra is mainly influenced by three sources of bias. First, allele frequencies will not be representative of the population if there is a sampling bias [9, 66, 67]. To address this issue and reduce this bias, we carefully selected study animals and verified population origins. By contrast, genomic studies often rely on zoo animals with unknown population provenance (if wild-born) or apply a limited sampling schema (e.g. ), and thus there are likely inherent sampling biases.
Second, the discovery of SNPs in a subset of individuals for subsequent genotype calling in an extended sample set will lead to ascertainment bias [1, 17, 27]. The degree of ascertainment bias depends on the representativeness of the sampling scheme of individuals used for the initial SNP discovery . Especially in population and conservation genomics, ascertainment bias is a serious problem when assessing, for instance, genetic diversity. Low-frequency variants will be underestimated and a systematic bias will be introduced . The key strength of reduced genome complexity approaches is that this form of ascertainment bias can be minimized by the genotyping-by-sequencing principle.
Third, it is biologically relevant to also capture rare alleles, which is the reason why we established individual libraries (i.e. no pooling of samples). Low-frequency alleles are important in estimates of demographic parameters  and studies of positive  and purifying selection .
The framework provided in this study will be valuable to generate genome-wide SNP datasets in the emerging fields of population, conservation and landscape genomics. Our iRRL protocol is part of a growing suite of sequencing methods, which have completely changed study designs and hold great promise for studies of ecology and evolution in diverse species. The strength of reduced-genome-complexity RRL methods is that they can be applied to any DNA-based life form, opening up the field of population genomics to smaller research groups studying organisms for which large-scale genetic data is not yet available. Until high-throughput sequencing becomes more affordable and bioinformatical advances allow routine whole-genome re-sequencing of populations, we expect that reduced-genome-complexity approaches will remain essential for population genomic studies particularly in non-model organisms with large genomes.
We generated SNP datasets for 31 wild-born orangutans from two populations representing the first effort of large-scale SNP discovery and genotyping of orangutans with known population provenance. In the field of population genomics, researchers need to exert caution when generating genome-wide SNP datasets. We show that accurate generation of homologues fragments in reduced-genome-complexity sequencing is paramount, especially for pooled samples with no control for missing genotypes in the estimation of allele frequencies. We present an improved RRL protocol (iRRLs), which allows sampling only a fraction of the genome with maximized sequence overlap among individuals. The scale and efficiency achieved with our iRRL protocol demonstrates its suitability to generate genome-wide SNP datasets. Our direct comparison of three popular SNP callers demonstrated that depending on the calling algorithm, sequence depths and filtering criteria, substantially different SNP datasets are obtained that will affect downstream analyses and thus might have a substantial effect on biological conclusions. When only applying a single SNP caller, we advise to use a probabilistic algorithm and call genotypes in a multi-sample mode. In our study, the Bayesian framework of the Unified Genotyper of the GATK showed a higher sensitivity in discovering SNPs than the framework of SAMtools with similar genotype calling accuracy.
We sampled two orangutan populations, one from Borneo and one from Sumatra (Figure 1). To obtain sufficient amounts of high-quality DNA, we collected blood samples from rehabilitant wild-born orangutans. We sampled 15 individuals from the West Alas population (WA, Pongo abelii, northwestern Sumatra) at the Batu Mbelin Quarantine Center of the Sumatran Orangutan Conservation Programme, and 16 individuals from the South Kinabatangan population (SK, Pongo pygmaeus morio, northeastern Borneo) at the Sepilok Orangutan Rehabilitation Centre, Shangri-La's Rasa Ria Resort Sanctuary and Lok Kawi Wildlife Park in Sabah. Whole blood samples were taken during routine veterinary examinations and stored in EDTA blood collection tubes at −20°C. The collection and transport of samples were conducted in strict accordance with Malaysian, Indonesian and international regulations. Samples were exported from Malaysia and Indonesia to Switzerland under the Convention on International Trade of Endangered Species in Fauna and Flora (CITES) permit numbers 4872/2010 (Sabah, Malaysia) and 06968/IV/SATS-LN/2005 (Indonesia), respectively. Detailed information on the sampled individuals is provided in Additional file 2: Table S1. We verified the individual’s population origin by genetic assignment tests and Bayesian clustering algorithms as described in the Additional file 1.
To minimize DNA shearing, we avoided repeated thawing and freezing of samples and used only wide-bore tips and avoided vortexing during DNA extraction. Genomic DNA was extracted using the Gentra Puregene Kit (Qiagen) according to the manufacturer’s instructions, including RNase treatment, but with the following modifications for clotted blood: we added twice the amount of Cell Lysis Solution as well as 7 μl of Proteinase K (20 mg/ml, Promega) per 100 mg blood clot to the samples, followed by incubation for 3 hours at 55°C in a slowly revolving overhead rotator. If the solution still appeared to be viscous after this treatment, we increased incubation time and added more Proteinase K as required until complete liquefaction. We also used twice the recommended amount of Protein Precipitation Solution and incubated samples on ice for 10 minutes after addition of the solution to promote protein precipitation. DNA pellets were eluted in ddH2O instead of DNA Hydration Solution (Qiagen) to facilitate DNA concentration using a SpeedVac vacuum centrifuge (Savant).
Reduced-representation libraries construction
We performed in-silico digests of the orangutan reference genome (ponAbe2) to evaluate a suitable restriction enzyme to construct iRRLs using custom perl scripts. We tested 23 commercially available Type II DNA blunt-end cutters (4–6 bp recognition sites) in multiple combinations (Additional file 7: Table S6). Selection criteria were: (i) target size range 70–200 bp, (ii) number of fragments predicted in size range corresponding to ~1% of the genome, and (iii) low repetitive element content. We chose Hae III because in the size range of 104–123 bp, Hae III did not produce obvious repetitive elements based on visual inspection of the fragment distribution profile (Figure 2), and covered ~1% of the genome. The enzyme Hae III has also been selected in previous studies [19, 31, 60], and thus might be a good candidate enzyme for reduced-genome-complexity sequencing in general.
In cases where there is no reference genome available, the evaluation for a suitable enzyme could also be carried out in the laboratory, for example by analyzing the fragment distribution of digested genomic DNA using high resolution electrophoresis (e.g. Agilent 2100 Bioanalyzer). These instruments offer tools to estimate the represented genome proportion of fragments within a given size range.
We established iRRLs for each individual by digesting 20 μg of genomic DNA with 200 units of Hae III (50,000 U/ml, New England Biolabs) in a total volume of 32 μl. Digests were run on high-resolution Spreadex EL400 Wide Mini S-2 × 13 gels with M3 size marker in a SEA 2000 electrophoresis chamber (all Elchrom Scientific, Switzerland) in 1× TAE buffer at 120 Volt for 147 min, keeping temperature constant at 55°C to ensure reproducibility of fragment migration. The running time was the evaluated optimum for the target size range using the ELQuant Software (http://www.elchrom.com). Each digest was equally distributed in two separate wells to avoid DNA overloading. We stained gels with GelRed (Biotium) and excised fragments between 104 bp and 123 bp on a UV-transilluminator using a long-bladed sharp knife, keeping UV exposure as short as possible.
DNA fragments were recovered by electro elution to achieve high DNA recovery rate (>95%). For this, we prepared dialysis membranes (Carl Roth, 1785.1 Dialysierschlauch Visking) of approximately 5 cm width, which we sealed on one side with a plastic clip (Carl Roth, H277.1 Verschlussklammer). We filled each dialysis membrane with 1 ml of 1× TAE buffer and placed gel slices in the membrane in the same running orientation as in the electrophoresis run (illustrated in Additional file 1: Figure S3). We closed the dialysis membrane with a second plastic clip and avoided trapping any air bubbles inside the membrane. Packages were then placed in an SEA 2000 electrophoresis chamber filled with 1× TAE buffer. We applied 90 Volts for 100 minutes, followed by 1 minute of reverse polarity to detach DNA from the wall of the membrane. We gently massaged the packages to mix the eluted DNA in the buffer. After this, we carefully opened one of the clips to gently pipet out the buffer containing the eluted DNA. The DNA was purified using the MinElute PCR Purification Kit (Qiagen). This way, we obtained between 2 and 20 ng of DNA per sample. Individual barcoding of iRRLs and SOLiD sequencing library preparation was performed according to the SOLiD ChiP-Seq protocol step 11 (Applied Biosystems, 2010), which had been optimized for low template quantities (e.g. Agencourt AMPure XP beads for purification steps). We restricted library amplification to six PCR cycles only, so as to minimize the risk of over-amplification. After library quality control on an Agilent Bioanalyzer 2000, we normalized samples and sent pooled libraries to the Functional Genomics Center Zurich, Switzerland (FGCZ) for sequencing on a SOLiD 4™ System with paired-end (50/35) chemistry (Life Technologies).
SNP discovery and genotype calling
Raw sequence reads were processed and mapped to the orangutan reference genome ponAbe2 using the SOLiD LifeScope v.2.5.1 package (Life Technologies) according to their guidelines. We used Picard v.1.57 [http://picard.sourceforge.net/] to merge mapping files for each individual from different SOLiD runs and adjust read group headers. We called SNPs using three different programs as described below.
We performed simultaneous multi-sample SNP and genotype calling with the Unified Genotyper of the GATK v.2.5-0 [37, 38] with the following thresholds: phred-scaled mapping and base qualities ≥ 30 (‘-mmq 30 -mbq 30’). We filtered out low-quality genotypes (GQ < 10) and genotypes covered by less than 10 or more than 1000 reads (‘-minGQ 10 -minDP 10 -maxDP 1000’) using VCFtools v.0.1.9 . Sites which were homozygous after this filtering were removed. Finally, we disregarded sites with more than two alleles and only retained sites with a genotype call for a minimum of eight individuals per population that had passed all quality filters applying custom R scripts.
As a second probabilistic caller, we used SAMtools v.0.1.19  to call SNPs and genotypes in all individuals simultaneously. We applied the same filter thresholds as for the GATK dataset and used defaults settings otherwise (except for deactivating the base alignment quality realignment with the -B parameter: ‘samtools mpileup -q 30 -Q 30 -B’). Post-filtering of SNP and genotype calls was conducted as for the GATK dataset.
As an alternative non-probabilistic approach, we discovered SNPs with the quality-based variant detection tool of the CLC Genomics Workbench v.5.0.1 (CLC bio) following the same quality requirements as applied in the GATK/SAMtools calls. Since the CLC version we used did not offer multi-sample calling (i.e. analyzing all individuals simultaneously) at the time of this study, we detected SNPs for each individual separately and merged the SNP data subsequently using R scripts. In this merged dataset, a missing call for an individual for a certain SNP position could arise either because this individual is homozygous for the reference allele or because this site was not sequenced. To obtain this information for all missing genotypes, we used SAMtools v.0.1.12a . We called genotypes according to common practice, applying fixed cut-off rules based on read counts  with ad-hoc R scripts. Sites with an alternative allele frequency between 0-15% were called homozygous for the reference allele, sites with an alternative allele frequency between 20-80% as heterozygous, and sites between 85-100% alternative allele frequency as homozygous for the alternative allele. To be conservative, we denoted sites with borderline alternative allele frequencies (i.e. 15-20% and 80-85%) as ‘N’. We only accepted sites with a maximum of two alleles and covered by minimal eight individuals per study population, as we had done for the GATK and SAMtools datasets.
Finally, we used custom R scripts to intersect the GATK, SAMtools and CLC genotype calls for each individual at all sites (without the population-based filters) only retaining identical genotype calls. After merging the individual data, we again excluded sites with more than two alleles and genotypes in less than eight individuals per population as performed with the other datasets (population-based filtering). Note that further filters could be applied for SNP and genotype calling from high-throughput sequencing data such as filtering clusters of SNPs (for a list see Supporting Information of Auton et. al.)
SNP and genotype validation
To assess genotype accuracy between the probabilistic callers and CLC and estimate the error rate of identical genotype calls, we validated 180 genotypes by classical Sanger sequencing. We randomly picked 58 overlapping SNPs with the only requirements that a minimum of 10 individuals per population had a genotype called, and that at least one individual showed a conflicting genotype call. We validated genotypes of several individuals at those SNPs, which appeared to belong to three different classes: (i) homozygote genotype with GATK/SAMtools but heterozygote with CLC, (ii) heterozygote genotype with GATK/SAMtools but homozygote with CLC, (iii) identical genotype call. We also validated (iv) singleton sites (only one alternative allele called in the entire dataset) that were determined by only one of the callers through Sanger-sequencing of the individual that exhibited the singleton. Additionally, to specifically investigate the genotype accuracy of SNPs present in the intersect data of all three callers, we randomly picked an additional 14 SNPs from this dataset. BEDtools v.2.16.2  was used to extract the DNA sequences 400 bp downstream and upstream of the targeted SNPs from the orangutan reference genome ponAbe2. PCR primers flanking the SNPs were designed with Primer3  (Additional file 8: Table S7). We verified genotypes by sequencing PCR products on a 3730 DNA Analyzer (Applied Biosystems). Details on PCR conditions, cycle sequencing and data analyses are provided in the Additional file 1.
We considered all sites with mapping and base quality phred scores of ≥30, and a minimal sequence depth of 10 as high-quality base pairs (bphiqual). We estimated the target efficiency of our iRRL protocol by calculating which percentage of the actually sequenced bphiqual was predicted by our in-silico digest of ponAbe2 with Hae III (=target sites). Furthermore, for each SNP and population we calculated the observed site heterozygosity as the number of individuals carrying both alleles divided by the total number of called genotypes in this population. Kernel density plots of the minor allele frequency and site heterozygosity distributions were drawn in R with the ‘sm’ package . We assessed the significance of equality of the density estimates among the different datasets with the ‘sm.density.compare’ function with 10,000 permutations.
In addition, we performed sliding-window analyses for each dataset to detect selective sweeps in the genome based on population differentiation using custom R-scripts. For all SNPs we estimated population differentiation using allele-frequency differentials, defined as: D = ∑ [abs(pSK − pWA) + abs(qSK − qWA)]/2, where p and q denote the frequencies of the two alleles for each SNP. We scanned each chromosome ('chrXY_random' excluded) and calculated for each window (100 kb window size, 25 kb step size) the average D of all SNPs. We arbitrarily defined outlier regions as windows with an average population differentiation D > 0.95 (covered by at least 2 SNPs).
Availability of supporting data
The raw sequence data to this article is available in the NCBI Sequence Read Archive, BioProject: PRJNA230877; BioSamples: SAMN02439270-SAMN02439300.
Reduced representation library
Improved reduced-representation library
West Alas population (Sumatra)
South Kinabatangan population (Borneo)
Sequenced bases with mapping and base quality phred scores of ≥30 and a minimal sequence depth of 10
Unified Genotyper of the Genome Analysis Toolkit (Broad Institute)
SNP caller of the CLC Genomics Workbench (CLC bio)
Allele frequency differential
Minor allele frequency.
We thank the Functional Genomics Centre Zurich and Gerrit Kuhn from Life Technologies for their support on sequencing and bioinformatics. We are grateful to Giada Ferrari for assisting in the CLC SNP analyses, David Marques who genotyped some of the WA individuals, Glauco Camenisch for help with in-silico digests, and Erik Willems for providing the map of Borneo and Sumatra. We thank Christian Lexer for general support. We are indebted to the staff at the Sepilok Orangutan Rehabilitation Centre, the Shangri-La's Rasa Ria Resort, the Lok Kawi Wildlife Park and the Sumatran Orangutan Conservation Programme who helped collecting samples. Furthermore, we thank the following institutions for supporting our research: Sabah Wildlife Department (SWD), Indonesian State Ministry for Research and Technology (RISTEK), Indonesian Institute of Sciences (LIPI), and Leuser International Foundation (LIF). Financial support was provided by the Forschungskredit University of Zurich (to MPG), A.H. Schultz Foundation (to MK and MPG), Swiss National Science Foundation (grant no. 3100A-116848 to MK and CPvS), Julius-Klaus Foundation (to MK), Leakey Foundation (to MPG), and the Anthropological Institute & Museum at the University of Zurich.
- Seeb JE, Carvalho G, Hauser L, Naish K, Roberts S, Seeb LW: Single-nucleotide polymorphism (SNP) discovery and applications of SNP genotyping in nonmodel organisms. Mol Ecol Resour. 2011, 11: 1-8.PubMedView ArticleGoogle Scholar
- Bakewell MA, Shi P, Zhang J: More genes underwent positive selection in chimpanzee evolution than in human evolution. Proc Natl Acad Sci. 2007, 104 (18): 7489-7494. 10.1073/pnas.0701705104.PubMed CentralPubMedView ArticleGoogle Scholar
- Enard D, Depaulis F, Roest Crollius H: Human and Non-human primate genomes share hotspots of positive selection. PLoS Genet. 2010, 6 (2): e1000840-10.1371/journal.pgen.1000840.PubMed CentralPubMedView ArticleGoogle Scholar
- Kosiol C, Vinař T, da Fonseca RR, Hubisz MJ, Bustamante CD, Nielsen R, Siepel A: Patterns of positive selection in six mammalian genomes. PLoS Genet. 2008, 4 (8): e1000144-10.1371/journal.pgen.1000144.PubMed CentralPubMedView ArticleGoogle Scholar
- Nielsen R, Bustamante C, Clark AG, Glanowski S, Sackton TB, Hubisz MJ, Fledel-Alon A, Tanenbaum DM, Civello D, White TJ, et al: A scan for positively selected genes in the genomes of humans and chimpanzees. PLoS Biol. 2005, 3 (6): e170-10.1371/journal.pbio.0030170.PubMed CentralPubMedView ArticleGoogle Scholar
- Gibbs RA, Rogers J, Katze MG, Bumgarner R, Weinstock GM, Mardis ER, Remington KA, Strausberg RL, Venter JC, Wilson RK, et al: Evolutionary and biomedical insights from the rhesus macaque genome. Science. 2007, 316 (5822): 222-234.PubMedView ArticleGoogle Scholar
- Hohenlohe PA, Bassham S, Etter PD, Stiffler N, Johnson EA, Cresko WA: Population genomics of parallel adaptation in threespine stickleback using sequenced RAD tags. PLoS Genet. 2010, 6 (2): e1000862-10.1371/journal.pgen.1000862.PubMed CentralPubMedView ArticleGoogle Scholar
- Steiner CC, Putnam AS, Hoeck PEA, Ryder OA: Conservation genomics of threatened animal species. Annu Rev Anim Biosci. 2013, 1: 18.11-18.21.View ArticleGoogle Scholar
- Schoville SD, Bonin A, François O, Lobreaux S, Melodelima C, Manel S: Adaptive genetic variation on the landscape: methods and cases. Annu Rev Ecol Evol Syst. 2012, 43 (1): 23-43. 10.1146/annurev-ecolsys-110411-160248.View ArticleGoogle Scholar
- Altshuler DM, Gibbs RA, Peltonen L, Dermitzakis E, Schaffner S, Yu F, Bonnen PE, De Bakker P, Deloukas P, Gabriel SB: Integrating common and rare genetic variation in diverse human populations. Nature. 2010, 467 (7311): 52-58. 10.1038/nature09298.PubMedView ArticleGoogle Scholar
- Altshuler DM, Lander ES, Ambrogio L, Bloom T, Cibulskis K, Fennell TJ, Gabriel SB, Jaffe DB, Shefler E, Sougnez CL: A map of human genome variation from population-scale sequencing. Nature. 2010, 467 (7319): 1061-1073. 10.1038/nature09534.PubMedView ArticleGoogle Scholar
- Weigel D, Mott R: The 1001 genomes project for Arabidopsis thaliana. Genome Biol. 2009, 10 (5): 107-10.1186/gb-2009-10-5-107.PubMed CentralPubMedView ArticleGoogle Scholar
- Cao J, Schneeberger K, Ossowski S, Günther T, Bender S, Fitz J, Koenig D, Lanz C, Stegle O, Lippert C: Whole-genome sequencing of multiple Arabidopsis thaliana populations. Nat Genet. 2011, 43 (10): 956-963. 10.1038/ng.911.PubMedView ArticleGoogle Scholar
- Kumar S, Schiffer PH, Blaxter M: 959 Nematode genomes: a semantic wiki for coordinating sequencing projects. Nucleic Acids Res. 2012, 40 (D1): D1295-D1300. 10.1093/nar/gkr826.PubMed CentralPubMedView ArticleGoogle Scholar
- Gibbs RA, Taylor JF, Van Tassell CP, Barendse W, Eversole KA, Gill CA, Green RD, Hamernik DL, Kappes SM, Lien S, et al: Genome-wide survey of SNP variation uncovers the genetic structure of cattle breeds. Science. 2009, 324 (5926): 528-532.PubMedView ArticleGoogle Scholar
- Elferink MG, Megens H-J, Vereijken A, Hu X, Crooijmans RPMA, Groenen MAM: Signatures of selection in the genomes of commercial and non-commercial chicken breeds. PLoS One. 2012, 7 (2): e32720-10.1371/journal.pone.0032720.PubMed CentralPubMedView ArticleGoogle Scholar
- Helyar SJ, Hemmer-Hansen J, Bekkevold D, Taylor MI, Ogden R, Limborg MT, Cariani A, Maes GE, Diopere E, Carvalho GR, et al: Application of SNPs for population genetics of nonmodel organisms: new opportunities and challenges. Mol Ecol Resour. 2011, 11: 123-136.PubMedView ArticleGoogle Scholar
- Nielsen R, Paul JS, Albrechtsen A, Song YS: Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet. 2011, 12 (6): 443-451. 10.1038/nrg2986.PubMed CentralPubMedView ArticleGoogle Scholar
- Van Tassell CP, Smith TP, Matukumalli LK, Taylor JF, Schnabel RD, Lawley CT, Haudenschild CD, Moore SS, Warren WC, Sonstegard TS: SNP discovery and allele frequency estimation by deep sequencing of reduced representation libraries. Nat Methods. 2008, 5 (3): 247-252. 10.1038/nmeth.1185.PubMedView ArticleGoogle Scholar
- Baird NA, Etter PD, Atwood TS, Currey MC, Shiver AL, Lewis ZA, Selker EU, Cresko WA, Johnson EA: Rapid SNP discovery and genetic mapping using sequenced RAD markers. PLoS One. 2008, 3 (10): e3376-10.1371/journal.pone.0003376.PubMed CentralPubMedView ArticleGoogle Scholar
- Truong HT, Ramos AM, Yalcin F, de Ruiter M, van der Poel HJA, Huvenaars KHJ, Hogers RCJ, van Enckevort LJG, Janssen A, van Orsouw NJ, et al: Sequence-based genotyping for marker discovery and co-dominant scoring in germplasm and populations. PLoS One. 2012, 7 (5): e37565-10.1371/journal.pone.0037565.PubMed CentralPubMedView ArticleGoogle Scholar
- Elshire RJ, Glaubitz JC, Sun Q, Poland JA, Kawamoto K, Buckler ES, Mitchell SE: A robust, simple genotyping-by-sequencing (GBS) approach for high diversity species. PLoS One. 2011, 6 (5): e19379-10.1371/journal.pone.0019379.PubMed CentralPubMedView ArticleGoogle Scholar
- Kerstens H, Crooijmans R, Dibbits B, Vereijken A, Okimoto R, Groenen M: Structural variation in the chicken genome identified by paired-end next-generation DNA sequencing of reduced representation libraries. BMC Genomics. 2011, 12 (1): 94-10.1186/1471-2164-12-94.PubMed CentralPubMedView ArticleGoogle Scholar
- Young AL, Abaan HO, Zerbino D, Mullikin JC, Birney E, Margulies EH: A new strategy for genome assembly using short sequence reads and reduced representation libraries. Genome Res. 2010, 20 (2): 249-256. 10.1101/gr.097956.109.PubMed CentralPubMedView ArticleGoogle Scholar
- Senn H, Ogden R, Cezard T, Gharbi K, Iqbal Z, Johnson E, Kamps-Hughes N, Rosell F, McEwing R: Reference-free SNP discovery for the Eurasian beaver from restriction site-associated DNA paired-end data. Mol Ecol. 2013, 22 (11): 3141-3150. 10.1111/mec.12242.PubMedView ArticleGoogle Scholar
- Van Bers NEM, Van Oers K, Kerstens HHD, Dibbits BW, Crooijmans RPMA, Visser ME, Groenen MAM: Genome-wide SNP detection in the great tit Parus major using high throughput sequencing. Mol Ecol. 2010, 19: 89-99.PubMedView ArticleGoogle Scholar
- Seeb LW, Templin WD, Sato S, Abe S, Warheit K, Park JY, Seeb JE: Single nucleotide polymorphisms across a species’ range: implications for conservation studies of Pacific salmon. Mol Ecol Resour. 2011, 11: 195-217.PubMedView ArticleGoogle Scholar
- Altshuler D, Pollara VJ, Cowles CR, Van Etten WJ, Baldwin J, Linton L, Lander ES: An SNP map of the human genome generated by reduced representation shotgun sequencing. Nature. 2000, 407 (6803): 513-516. 10.1038/35035083.PubMedView ArticleGoogle Scholar
- Amaral A, Megens H-J, Kerstens H, Heuven H, Dibbits B, Crooijmans R, den Dunnen J, Groenen M: Application of massive parallel sequencing to whole genome SNP discovery in the porcine genome. BMC Genomics. 2009, 10 (1): 374-10.1186/1471-2164-10-374.PubMed CentralPubMedView ArticleGoogle Scholar
- Kerstens H, Crooijmans R, Veenendaal A, Dibbits B, Chin-A-Woeng T, den Dunnen J, Groenen M: Large scale single nucleotide polymorphism discovery in unsequenced genomes using second generation high throughput sequencing technology: applied to turkey. BMC Genomics. 2009, 10 (1): 479-10.1186/1471-2164-10-479.PubMed CentralPubMedView ArticleGoogle Scholar
- Hyten D, Cannon S, Song Q, Weeks N, Fickus E, Shoemaker R, Specht J, Farmer A, May G, Cregan P: High-throughput SNP discovery through deep resequencing of a reduced representation library to anchor and orient scaffolds in the soybean whole genome sequence. BMC Genomics. 2010, 11 (1): 38-10.1186/1471-2164-11-38.PubMed CentralPubMedView ArticleGoogle Scholar
- Kraus R, Kerstens H, Van Hooft P, Crooijmans R, Van Der Poel J, Elmberg J, Vignal A, Huang Y, Li N, Prins H, et al: Genome wide SNP discovery, analysis and evaluation in mallard (Anas platyrhynchos). BMC Genomics. 2011, 12 (1): 150-10.1186/1471-2164-12-150.PubMed CentralPubMedView ArticleGoogle Scholar
- Wiedmann R, Smith T, Nonneman D: SNP discovery in swine by reduced representation and high throughput pyrosequencing. BMC Genet. 2008, 9 (1): 81-10.1186/1471-2156-9-81.PubMed CentralPubMedView ArticleGoogle Scholar
- Cutler DJ, Jensen JD: To pool, or Not to pool?. Genetics. 2010, 186 (1): 41-43. 10.1534/genetics.110.121012.PubMed CentralPubMedView ArticleGoogle Scholar
- Pabinger S, Dander A, Fischer M, Snajder R, Sperk M, Efremova M, Krabichler B, Speicher MR, Zschocke J, Trajanoski Z: A survey of tools for variant analysis of next-generation genome sequencing data. Brief Bioinform. 2013, Jan 21. [Epub ahead of print]Google Scholar
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, Subgroup GPDP: The sequence alignment/map format and SAMtools. Bioinformatics. 2009, 25 (16): 2078-2079. 10.1093/bioinformatics/btp352.PubMed CentralPubMedView ArticleGoogle Scholar
- DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M, et al: A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011, 43 (5): 491-498. 10.1038/ng.806.PubMed CentralPubMedView ArticleGoogle Scholar
- McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, et al: The genome analysis toolkit: a mapreduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010, 20 (9): 1297-1303. 10.1101/gr.107524.110.PubMed CentralPubMedView ArticleGoogle Scholar
- Le SQ, Durbin R: SNP detection and genotyping from low-coverage sequencing data on multiple diploid samples. Genome Res. 2011, 21 (6): 952-960. 10.1101/gr.113084.110.PubMed CentralPubMedView ArticleGoogle Scholar
- Pool JE, Hellmann I, Jensen JD, Nielsen R: Population genetic inference from genomic sequence variation. Genome Res. 2010, 20 (3): 291-300. 10.1101/gr.079509.108.PubMed CentralPubMedView ArticleGoogle Scholar
- Nielsen R: Molecular signatures of natural selection. Annu Rev Genet. 2005, 39 (1): 197-218. 10.1146/annurev.genet.39.073003.112420.PubMedView ArticleGoogle Scholar
- Sanchez C, Smith T, Wiedmann R, Vallejo R, Salem M, Yao J, Rexroad C: Single nucleotide polymorphism discovery in rainbow trout by deep sequencing of a reduced representation library. BMC Genomics. 2009, 10 (1): 559-10.1186/1471-2164-10-559.PubMed CentralPubMedView ArticleGoogle Scholar
- Groves C: Primate Taxonomy. 2001, Washington, DC (USA): Smithsonian BooksGoogle Scholar
- Delgado AR, van Schaik CP: The behavioral ecology and conservation of the orangutan (pongo pygmaeus): a tale of two islands. Evol Anthropol. 2000, 9 (5): 201-218. 10.1002/1520-6505(2000)9:5<201::AID-EVAN2>3.0.CO;2-Y.View ArticleGoogle Scholar
- Prado-Martinez J, Sudmant PH, Kidd JM, Li H, Kelley JL, Lorente-Galdos B, Veeramah KR, Woerner AE, O’Connor TD, Santpere G: Great ape genetic diversity and population history. Nature. 2013, 499 (7459): 471-475. 10.1038/nature12228.PubMed CentralPubMedView ArticleGoogle Scholar
- Hvilsom C, Qian Y, Bataillon T, Li Y, Mailund T, Sallé B, Carlsen F, Li R, Zheng H, Jiang T: Extensive X-linked adaptive evolution in central chimpanzees. Proc Natl Acad Sci. 2012, 109 (6): 2054-2059. 10.1073/pnas.1106877109.PubMed CentralPubMedView ArticleGoogle Scholar
- Auton A, Fledel-Alon A, Pfeifer S, Venn O, Ségurel L, Street T, Leffler EM, Bowden R, Aneas I, Broxholme J, et al: A fine-scale chimpanzee genetic map from population sequencing. Science. 2012, 336 (6078): 193-198. 10.1126/science.1216872.PubMed CentralPubMedView ArticleGoogle Scholar
- Locke DP, Hillier LW, Warren WC, Worley KC, Nazareth LV, Muzny DM, Yang S-P, Wang Z, Chinwalla AT, Minx P, et al: Comparative and demographic analysis of orang-utan genomes. Nature. 2011, 469 (7331): 529-533. 10.1038/nature09687.PubMed CentralPubMedView ArticleGoogle Scholar
- Prufer K, Munch K, Hellmann I, Akagi K, Miller JR, Walenz B, Koren S, Sutton G, Kodira C, Winer R, et al: The bonobo genome compared with the chimpanzee and human genomes. Nature. 2012, 486 (7404): 527-531.PubMed CentralPubMedGoogle Scholar
- Scally A, Dutheil JY, Hillier LW, Jordan GE, Goodhead I, Herrero J, Hobolth A, Lappalainen T, Mailund T, Marques-Bonet T, et al: Insights into hominid evolution from the gorilla genome sequence. Nature. 2012, 483 (7388): 169-175. 10.1038/nature10842.PubMed CentralPubMedView ArticleGoogle Scholar
- van Schaik CP, Marshall AJ, Wich SA: Orangutans - Geographic Variation in Behavioral Ecology and Conservation. 2009, Oxford (UK): Oxford University PressGoogle Scholar
- Nater A, Arora N, Greminger MP, van Schaik CP, Singleton I, Wich SA, Fredriksson G, Perwitasari-Farajallah D, Pamungkas J, Krützen M: Marked population structure and recent migration in the critically endangered Sumatran orangutan (Pongo abelii). J Hered. 2013, 104 (1): 2-13. 10.1093/jhered/ess065.PubMedView ArticleGoogle Scholar
- Nater A, Nietlisbach P, Arora N, van Schaik CP, van Noordwijk MA, Willems EP, Singleton I, Wich SA, Goossens B, Warren KS, et al: Sex-biased dispersal and volcanic activities shaped phylogeographic patterns of extant orangutans (genus: Pongo). Mol Biol Evol. 2011, 28 (8): 2275-2288. 10.1093/molbev/msr042.PubMedView ArticleGoogle Scholar
- Steiper ME: Population history, biogeography, and taxonomy of orangutans (Genus: Pongo) based on a population genetic meta-analysis of multiple loci. J Hum Evol. 2006, 50 (5): 509-522. 10.1016/j.jhevol.2005.12.005.PubMedView ArticleGoogle Scholar
- Arora N, Nater A, van Schaik CP, Willems EP, van Noordwijk MA, Goossens B, Morf N, Bastian M, Knott C, Morrogh-Bernard H, et al: Effects of Pleistocene glaciations and rivers on the population structure of Bornean orangutans (Pongo pygmaeus). Proc Natl Acad Sci. 2010, 107 (50): 21376-21381. 10.1073/pnas.1010169107.PubMed CentralPubMedView ArticleGoogle Scholar
- Nater A: Dissertation. Processes Underlying Genetic Differentiation and Speciation in Orangutans (Pongo spp.). 2012, Switzerland: University of ZurichGoogle Scholar
- Esteve-Codina A, Kofler R, Himmelbauer H, Ferretti L, Vivancos AP, Groenen MAM, Folch JM, Rodriguez MC, Perez-Enciso M: Partial short-read sequencing of a highly inbred Iberian pig and genomics inference thereof. Heredity. 2011, 107 (3): 256-264. 10.1038/hdy.2011.13.PubMed CentralPubMedView ArticleGoogle Scholar
- 1000 Genomes project. [http://www.1000genomes.org/analysis
- Amaral AJ, Ferretti L, Megens H-J, Crooijmans RPMA, Nie H, Ramos-Onsins SE, Perez-Enciso M, Schook LB, Groenen MAM: Genome-wide footprints of pig domestication and selection revealed through massive parallel sequencing of pooled DNA. PLoS One. 2011, 6 (4): e14782-10.1371/journal.pone.0014782.PubMed CentralPubMedView ArticleGoogle Scholar
- Jonker RM, Zhang Q, Van Hooft P, Loonen MJJE, Van der Jeugd HP, Crooijmans RPMA, Groenen MAM, Prins HHT, Kraus RHS: The development of a genome wide SNP set for the Barnacle Goose (Branta leucopsis). PLoS One. 2012, 7 (7): e38412-10.1371/journal.pone.0038412.PubMed CentralPubMedView ArticleGoogle Scholar
- Everett MV, Grau ED, Seeb JE: Short reads and nonmodel species: exploring the complexities of next-generation sequence assembly and SNP discovery in the absence of a reference genome. Mol Ecol Resour. 2011, 11: 93-108.PubMedView ArticleGoogle Scholar
- Sharma R, Goossens B, Kun-Rodrigues C, Teixeira T, Othman N, Boone JQ, Jue NK, Obergfell C, O'Neill RJ, Chikhi L: Two different high throughput sequencing approaches identify thousands of de novo genomic markers for the genetically depleted Bornean elephant. PLoS One. 2012, 7 (11): e49533-10.1371/journal.pone.0049533.PubMed CentralPubMedView ArticleGoogle Scholar
- Sharma R, Arora N, Goossens B, Nater A, Morf N, Salmona J, Bruford MW, Van Schaik CP, Krützen M, Chikhi L: Effective population size dynamics and the demographic collapse of Bornean orang-utans. PLoS One. 2012, 7 (11): e49429-10.1371/journal.pone.0049429.PubMed CentralPubMedView ArticleGoogle Scholar
- Goossens B, Chikhi L, Jalil M, Ancrenaz M, Lackman‒Ancrenaz I, Mohamed M, Andau P, Bruford MW: Patterns of genetic diversity and migration in increasingly fragmented and declining orang‒utan (Pongo pygmaeus) populations from Sabah, Malaysia. Mol Ecol. 2005, 14 (2): 441-456.PubMedView ArticleGoogle Scholar
- Kanthaswamy S, Smith DG: Population subdivision and gene flow among wild orangutans. Primates. 2002, 43 (4): 315-327. 10.1007/BF02629605.PubMedView ArticleGoogle Scholar
- Excoffier L, Hofer T, Foll M: Detecting loci under selection in a hierarchically structured population. Heredity. 2009, 103 (4): 285-298. 10.1038/hdy.2009.74.PubMedView ArticleGoogle Scholar
- Garvin MR, Saitoh K, Gharrett AJ: Application of single nucleotide polymorphisms to non-model species: a technical review. Mol Ecol Resour. 2010, 10 (6): 915-934. 10.1111/j.1755-0998.2010.02891.x.PubMedView ArticleGoogle Scholar
- Thornton K, Andolfatto P: Approximate bayesian inference reveals evidence for a recent, severe bottleneck in a Netherlands population of Drosophila melanogaster. Genetics. 2006, 172 (3): 1607-1619.PubMed CentralPubMedView ArticleGoogle Scholar
- Przeworski M: The signature of positive selection at randomly chosen Loci. Genetics. 2002, 160 (3): 1179-1189.PubMed CentralPubMedGoogle Scholar
- Charlesworth B, Morgan MT, Charlesworth D: The effect of deleterious mutations on neutral molecular variation. Genetics. 1993, 134 (4): 1289-1303.PubMed CentralPubMedGoogle Scholar
- Quinlan AR, Hall IM: BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010, 26 (6): 841-842. 10.1093/bioinformatics/btq033.PubMed CentralPubMedView ArticleGoogle Scholar
- Rozen S, Skaletsky H: Primer3 on the WWW for General Users and for Biologist Programmers. Bioinformatics Methods and Protocols, Volume 132. Edited by: Misener S, Krawetz S. 1999, Totowa, New Jersey, USA: Humana Press, 365-386.View ArticleGoogle Scholar
- Bowman AW, Azzalini A: R package ‘sm’: nonparametric smoothing methods (version 2.2-4). 2010Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.