Estimating the information value of polymorphic sites using pooled sequences
© Malde; licensee BioMed Central Ltd. 2014
Published: 17 October 2014
High-throughput sequencing is a cost effective method for identifying genetic variation, and it is currently in use on a large scale across the field of biology, including ecology and population genetics. Correctly identifying variable sites and allele frequencies from sequencing data remains challenging, in large part due to artifacts and biases inherent in the sequencing process. Selecting variants that are diagnostic is commonly done using diversity statistics like F ST , but these measures are not ideal for the task.
Here, we develop a method that directly calculates the expected amount of information gained from observing each variant site. We then develop and implement a conservative estimator that takes into account uncertainity introduced by sampling bias and sequencing error. This estimator is applied to simulated and real sequencing data, and we discuss how it performs compared to the commonly used existing methods for identifying diagnostic polymorphisms.
The expected information content gives an easy to interpret measure for the usefulness of variant sites. The results show that we achieve a clear separation between true variants and noise, allowing us to select candidate sites with a high degree of confidence.
A large part of the genetic variation in a species come in the form of single nucleotide polymorphisms (SNPs) . Technological advances in high-throughput sequencing have made it possible to detect variations on genome-wide scales, also for non-model species. With current developments in high resolution genotyping technologies like SNP arrays and high-throughput mass spectrometry, SNP analysis is quickly becoming an indispensable tool in many fields of biology.
In spite of improvements to technology, SNP analysis is still limited by genotyping cost and capacity. It therefore remains an important challenge to find a set of SNP markers that is as effective and efficient as possible. To be precise, we want to identify the minimal set of SNPs that must be examined in order to draw conclusions with an acceptable certainty - viz., the SNPs that are most informative for the task at hand. For instance, when selecting SNPs that are diagnostic (i.e., that can be used to identify individuals as belonging to one of two or more groups), we would like to pick a small set of sites that provide the most information about the individual's group affiliation. Although one could achieve the same certainty (at a somewhat higher cost) using a larger set of individually less informative SNPs, this would also increase the risk of overfitting the model to the data. Careful selection of SNPs is therefore not just an issue of economy and expedience, but also of accuracy.
Diagnostic SNP identification
In practice, diagnostic SNPs are usually identified and ranked or selected using some variation of the following procedure:
First, samples are collected from individuals from the populations of interest, and DNA is extracted and sequenced to a depth deemed reasonable in terms of cost and benefit. Sequencing each sample individually is advantageous for reliably detecting rare alleles and to ensure a more complete SNP discovery , and is less sensitive to variation in molarity in the samples [3, 2]. In spite of these advantages, collecting multiple samples in pools before sequencing can still be more cost effective, in particular for novel SNP discovery in less well-studied species and when sample material is abundant . The sequence data is typically filtered for quality and contamination, mapped to a reference genome sequence using a short read alignment program, and putative SNPs are identified when reads differ from each other or from the reference.
The set of putative SNPs are then evaluated using some diversity statistic (e.g., F ST ), or statistical confidence in allele frequency difference (e.g., using Fisher's exact test, ). Often several measures are used, and candidates are typically filtered on one criterion (e.g., p-values), and then ranked using the other (e.g., F ST ). Sites can also be excluded based on coverage and more specific error estimates using base quality or mapping quality.
In practice, some additional care is often taken in the selection of candidate sites. For instance, one might require a certain minimum distance between sites in the genome in order to avoid unwanted correlations, or exclude sites in regions with low average mapping quality.
Challenges with this approach
There are many statistics that could be used to identify diagnostic SNPs (the properties of several such statistics are reviewed by Rosenberg , other options are discussed by Zhou et al. ), but F ST is perhaps most commonly used , and is readily calculated from identified allele frequencies.
Unfortunately, F ST is less than ideal for several reasons. It is a population genetics statistic, and must be calculated using some estimator. There exist several different options (e.g., reviews by Weir and Hill  and Holsinger and Weir , others are suggested by Karlsson et al.  and Fumagalli et al. ) which can give different results, and thus F ST statistics may not be directly comparable between studies. F ST is not robust to errors in the data, something that becomes a challenge with the relatively high number of errors and large number of candidate sites that typically arise from sequencing data. When coverage is low, a low number of sequencing errors can shift statistics substantially, and the highest F ST scores tend to come from sites with low coverage. To counter this, coverage thresholds can be used, but this excludes a substantial fraction of candidate sites. And, although commonly used in this role, F ST is controversial as a measure of differentiation. In particular, where heterozygosity within populations is high, F ST will be lowered, regardless of the differences between populations .
An alternative (or complimentary) approach is to use p-values for allele difference, usually calculated using Fisher's exact test, but other options are also possible . One challenge here is that although we are usually interested in the magnitude of the allele differences, this is only taken indirectly into account by p-values. Variation in sequencing coverage means that sites with high coverage will tend to have higher confidence, even if the actual allele difference is small . Even with no difference between population, sampling will introduce artificial differences, which will result in significant p-values if the coverage is sufficiently high. In addition, Fisher's exact test does not generally take into account the possibility of errors - the observation of a single allele will exclude an underlying frequency of zero for the observed allele, even if that may well be the case if the observation is an error.
In the following, we derive a method for calculating the expected information to be gained from genotyping a specific site, and argue that this is a more intuitive and useful measure for evaluating diagnostic SNPs than the commonly used alternatives. We will first describe how to calculate the expected informatino given a priori knowledge of allele frequency, we will then proceed to develop a method to make a conservative estimate for this statistic, taking into account sampling bias and uncertainty in the data. Finally, we provide an implementation, and discuss the results from applying it to both simulated and real data sets.
Expected site information
Given the drawbacks to using F ST discussed above, it is perhaps tempting to instead use some other measures, like nucleotide diversity or absolute difference in allele frequency. However, it is easy to see that nucleotide diversity per site (defined as the probability of samples having different alleles, i.e., p(1 − q) + (1 − p)q where p and q are the major allele frequencies in the two populations) fails to measure divergence when one of the populations has an allele frequency of 0.5 - substituting p = 0.5 in the formula above results in 0.5(1 − q) + (1 − 0.5)q, and it is easy to see that nucleotide diversity will be 0.5 regardless of q.
Absolute difference in allele frequencies (|p − q|) is perhaps better, but consider populations where one allele's frequencies in the two populations are 0.4 and 0.6, respectively. Assigning an individual to a population based on observing this allele not inspire a lot of confidence in the result, they are roughly equally likely. Although the difference between allele frequency is the same for a site with allele frequencies 0.05 and 0.25, observing this allele is here five times as likely in one population as in the other, intuitively making this a much more useful site to observe.
That is, the ratio of this allele's frequencies in each of the populations. For practical reasons, we take the logarithm of the odds. This gives us scores that are additive and symmetric (so that switching the two populations gives us the same score with the opposite sign). Specifically, base two logarithms will give us the score in bits.
Returning to the example at the start of the section, we now find that a site with allele frequencies of 0.4 and 0.6 contributes 0.58 bits of expected information, while 0.05 and 0.25 contributes 2.32 bits. Unlike measures like F ST , measures of I is additive (assuming independence between sites), so the information gained from observing multiple sites is readily calculated, and observing with an ESI of 2.32 bits is equivalent to observing four sites with ESI 0.58.
It may also be instructive to compare this procedure to sequence alignment and position specific score matrices (PSSMs). In sequence alignment, a sequence of nucleotides or amino acids are scored by comparing its match to a target sequence to its match to some base model using log odds scores. The base model to compare against is often implicit (typically using sequences of random composition), but more elaborate models is also possible (). Similarly, position specific frequency matrices are often converted to position specific score matrices using log odds. Calculating the information value from a set of observed alleles is then analogous to scoring an "alignment" of the set of observed alleles to two different sets of allele frequencies.
Allele frequency confidence intervals
In order to apply the above method in practice, we need to measure the allele frequencies in the population. This is problematic for two reasons: First, we do not have precise knowledge of the allele frequencies, we can only estimate them from our sample, which introduces a sampling bias. Second, the sequencing process introduces additional artifacts that add nose and bias to the data. For instance, sequencing errors often result in substitutions, which are observed as apparent alleles. In addition, sequences can be incorrectly mapped, contain contamination, the reference genome can contain collapsed repeats, and the chemistry of the sequencing process is usually also biased - for instance, coverage is often biased by GC content. These artifacts often give the false appearance of variant positions.
One challenge with calculating site information from sequencing data (as opposed to using allele frequencies directly), is that such errors in the data can vastly overestimate the information content. For instance, an allele that appears to be fixed in one population means that any other observed allele will assign the individual to the alternative population - regardless of any other alleles. It is easy to see that an allele frequency of zero results in the odds going either to zero or infinity, and thus the log odds will go to either positive or negative infinity.
For diagnostic SNP discovery, it is more important to ensure that identified SNPs are informative, than to precisely estimate the information content. Thus, we take a conservative approach and use upper and lower limits for the allele frequencies by calculating confidence intervals using the method by Agresti and Coull . In addition, the limits are also adjusted by a factor ∈, corresponding to sequencing error rate. In the following, we will refer to the resulting measure as conservative site information, or CSI.
Statistics for the variants from haplotypes H1, H2, and H3, as mixed in populations P1 and P2.
Simulated reads were then generated with genome coverages of 10x, 20x, and 40x from each of the populations, using substitution rates of 0.002, 0.01 and 0.02. To simplify analysis, the indel rate was held constant at 0.001.
The reads were mapped to the reference genome using the BWA short read mapper , and analyzed using Samtools' mpileup command . In addition to the methods described here, Popoolation  was used to calculate F ST and p-values from Fisher's exact test.
CSI scores for divergent and non-divergent sites
Comparing CSI and traditional statistics
Comparing CSI and FSTfor divergent and non-divergent sites
One important use for SNPs, is to assign individuals to their respective populations or subpopulations. For instance, the quantity of Norwegian farmed salmon exceeds the wild river populations by a large factor. As salmon occasionally escape from sea farms, the ability to effectively identify escapees is important both to identify the farms responsible, as well as quantifying the ecological effects of introgression. Here SNPs will play an important role by providing a low-cost, high resolution data .
Below, we examine pooled salmon sequencing data from rivers Flekkeelva and Suldalselva, and investigate the resulting CSI distributions. From each of the rivers, two pools were sequenced using Illumina HiSeq, resulting in datasets F1, F2, S1 and S2, each containing between 346 and 397 million aligned reads, corresponding to coverages of 11.5x to 13.2x, assuming a genome size of 3 gigabases. The data sets were then merged by river (combining F1 with F2 and S1 with S2), and by replicate (combining F1 with S1 and F2 with S2, to provide a model for false discoveries).
Filtering by coverage
The number of sites with non-zero expected site information, both in absolute numbers and in percent of an estimated 3 gigabase genome, before and after filtering for coverage.
Between rivers, filtered
Between replicates, filtered
Statistics, coverage, and sequencing errors
It is striking that p-values for the non-divergent sites increase with coverage. For instance, out of the 36000 non-divergent sites, we expect approximately 36 sites by chance to have a p-value less than 10−3. For 10x coverage, we find 9, for 20x, we find 35, and for 40x we find 70. This indicates that p-values are biased upwards with increasing coverage, and must be consequently be interpreted with care . The expectation and variance of F ST similarly depends on coverage. In contrast, low coverage in combination with sequencing errors and incorrectly mapped reads here result in a large number of high-scoring non-divergent sites. Using a combination of these measures may be effective, but also effectively narrows the data set, much like a stringent filtering for coverage.
Simulated data is by definition a simplification of reality. For instance, here the data assumes uniform probability of reads across the genome, and unbiased and context independent sequencing errors. Also, divergent and non-divergent positions occur in similar numbers in the simulated data, in reality, there will be a continuous spectrum of allele frequencies, and it will depend globally on the degree of divergence between populations, and locally on selection and other non-random evolution pressures. Results from simulated data must, as always, be interpreted as optimistic. In practice, coverage will vary substantially across a sequenced genome. In general, high variance regions tend to have lower mapping , but other factors are bias caused by GC-content, misassembly and collapsed repeats, copy-number and other structural variations, incorrect mapping, sampling bias (including from variation of molarity in DNA samples). Real data sets must therefore be expected to contain a wide range of coverages, mapping reliability, and sequencing error rates.
Other information theoretic measures
Although not commonly applied, information theoretic measures have been used previously in analyzing genetic variation. Expected site information is related to Kullback-Leibler divergence , but differs in that it is symmetric and extended to multiple alleles. Rosenberg  gives a summary of several alternative statistics, and also develops an information theoretic measure that contrasts individual populations with an average of all population. This measure is then used to infer ancestry, and applied to microsatellite data. Here, we develop an information theoretic measure in a Bayesian context, and apply it to high-throughput sequencing data.
Dealing with sequencing errors and artifacts
Based on the assumption that most sequencing errors will be singletons, Achaz  developed variants of several estimators for Θ which avoids taking singletons into account. Achaz' formulas were later adapted to high-throughput sequencing experiments, and given a more generalized (but approximate) form that allowed an arbitrary lower bound on number of observed alleles . However, much of the genetic diversity is in the form of low frequency alleles, and as singletons also have a high impact on many statistics , these estimators have lower power [24, 2]. It is also possible to attempt to quantify the errors more precisely by leveraging characteristics of the data .
Here, we have focused on the expected information content. As this is an additive measure, it is straightforward to sum over multiple sites to get the expected information for a set of SNPs. Since rare alleles yield more information than common ones, a natural extension might be to consider instead the minimum information content from a set of loci, ensuring that we can reach a conclusion even if we are unlucky with the actual alleles observed. Yet another option is to calculate a confidence interval for the information.
When selecting diagnostic SNPs, we want to find sites that provide the most information regarding our current problem. Although this is commonly measured using statistics like F ST , these are indirect measurements, proxies for the actual information. In addition, we have seen that it and other commonly used statistics have intrinsic biases when applied to sequencing data, due to coverage variation, sequencing artifacts, and mapping errors.
As an alternative, we have derived a direct calculation of the expected site information from allele frequencies, using a Bayesian framework. In addition to being a direct measurement of the value of interest, it has a clear interpretation, and desirable properties, like additivity. We have further developed a conservative estimator for this statistic, and provide an implementation.
The method as described above was implemented in a program, 'varan', which parses read alignments in the standard "mpileup" format as output by the samtools mpileup command. It can currently output several different statistics and estimators, including conservative expected site information (CSI). The software is distributed under the General Public License, and the source code can be downloaded from http://malde.org/~ketil/biohaskell/varan. Further information and documentation is available from http://biohaskell.org/Applications/Varan.
Simulation data, tables, and scripts used in this paper is available from http://malde.org/~ketil/papers/varan. The salmon louse genome used to generate the simulated reads is available from http://sealouse.imr.no/.
I am grateful for the support of my colleagues, both in the IBIS group at Helmholz Zentrum, Munich, Germany, and at the Institute of Marine Research in Bergen, Norway. In particular, I would like to thank Sapna Sharma and François Besnier for helpful discussion and comments. I also appreciate Anna Wargelius and the SALMAT project allowing me access to the sequencing data.
Publication costs for this article were funded by the Institute of Marine Research.
This article has been published as part of BMC Genomics Volume 15 Supplement 6, 2014: Proceedings of the Twelfth Annual Research in Computational Molecular Biology (RECOMB) Satellite Workshop on Comparative Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcgenomics/supplements/15/S6.
- Collins FS, Brooks LD, Chakravarti A: A DNA polymorphism discovery resource for research on human genetic variation. Genome research. 1998, 8 (12): 1229-1231.PubMedGoogle Scholar
- Cutler DJ, Jensen JD: To pool, or not to pool?. Genetics. 2010, 186 (1): 41-43. 10.1534/genetics.110.121012.PubMedPubMed CentralView ArticleGoogle Scholar
- Altmann A, Weber P, Quast C, Rex-Haffner M, Binder EB, Müller-Myhsok B: vipR: variant identification in pooled DNA using R. Bioinformatics [ISMB/ECCB]. 2011, 27 (13): 77-84. 10.1093/bioinformatics/btr205.View ArticleGoogle Scholar
- Futschik A, Schlötterer C: The next generation of molecular markers from massively parallel sequencing of pooled DNA samples. Genetics. 2010, 186 (1): 207-218. 10.1534/genetics.110.114397.PubMedPubMed CentralView ArticleGoogle Scholar
- Bansal V, Harismendy O, Tewhey R, Murray SS, Schork NJ, Topol EJ, Frazer KA: Accurate detection and genotyping of SNPs utilizing population sequencing data. Genome research. 2010, 20 (4): 537-545. 10.1101/gr.100040.109.PubMedPubMed CentralView ArticleGoogle Scholar
- Rosenberg NA, Li LM, Ward R, Pritchard JK: Informativeness of genetic markers for inference of ancestry. The American Journal of Human Genetics. 2003, 73 (6): 1402-1422. 10.1086/380416.PubMedView ArticleGoogle Scholar
- Zhou N, Wang L: Effective selection of informative SNPs and classification on the hapmap genotype data. BMC Bioinformatics. 2007, 8 (1): 484-10.1186/1471-2105-8-484.PubMedPubMed CentralView ArticleGoogle Scholar
- Fumagalli M, Vieira FG, Korneliussen TS, Linderoth T, Huerta-Sánchez E, Albrechtsen A, Nielsen R: Quantifying population genetic differentiation from next-generation sequencing data. Genetics. 2013, 195 (3): 979-992. 10.1534/genetics.113.154740.PubMedPubMed CentralView ArticleGoogle Scholar
- Weir BS, Hill W: Estimating F-statistics. Annual Review of Genetics. 2002, 36 (1): 721-750. 10.1146/annurev.genet.36.050802.093940.PubMedView ArticleGoogle Scholar
- Holsinger KE, Weir BS: Genetics in geographically structured populations: defining, estimating and interpreting F ST . Nature Reviews Genetics. 2009, 10 (9): 639-650. 10.1038/nrg2611.PubMedView ArticleGoogle Scholar
- Karlsson EK, Baranowska I, Wade CM, Hillbertz NHS, Zody MC, Anderson N, Biagi TM, Patterson N, Pielberg GR, Kulbokas EJ, et al: Efficient mapping of mendelian traits in dogs through genome-wide association. Nature genetics. 2007, 39 (11): 1321-1328. 10.1038/ng.2007.10.PubMedView ArticleGoogle Scholar
- Jost L: GST and its relatives do not measure differentiation. Molecular Ecology. 2008, 17 (18): 4015-4026. 10.1111/j.1365-294X.2008.03887.x.PubMedView ArticleGoogle Scholar
- Lin M, Lucas HC, Shmueli G: Research commentary-too big to fail: Large samples and the p-value problem. Information Systems Research. 2013, 24 (4): 906-917. 10.1287/isre.2013.0480.View ArticleGoogle Scholar
- Malde K: The effect of sequence quality on sequence alignment. Bioinformatics. 2008, 24 (7): 897-900. 10.1093/bioinformatics/btn052.PubMedView ArticleGoogle Scholar
- Agresti A, Coull BA: Approximate is better than "exact" for interval estimation of binomial proportions. The American Statistician. 1998, 52 (2): 119-126.Google Scholar
- Balzer S, Malde K, Lanzén A, Sharma A, Jonassen I: Characteristics of 454 pyrosequencing data--enabling realistic simulation with FlowSim. Bioinformatics. 2010, 26 (18): 420-425. 10.1093/bioinformatics/btq365.View ArticleGoogle Scholar
- Malde K: Simulating a population genomics data set using FlowSim. BMC Research Notes. 2014, 7 (1): 68-10.1186/1756-0500-7-68.PubMedPubMed CentralView ArticleGoogle Scholar
- Li H, Durbin R: Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics. 2009, 25 (14): 1754-1760. 10.1093/bioinformatics/btp324.PubMedPubMed CentralView ArticleGoogle Scholar
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, et al: The sequence alignment/map format and samtools. Bioinformatics. 2009, 25 (16): 2078-2079. 10.1093/bioinformatics/btp352.PubMedPubMed CentralView ArticleGoogle Scholar
- Kofler R, Orozco-terWengel P, De Maio N, Pandey RV, Nolte V, Futschik A, Kosiol C, Schlötterer C: Popoolation: a toolbox for population genetic analysis of next generation sequencing data from pooled individuals. PLoS One. 2011, 6 (1): 15925-10.1371/journal.pone.0015925.View ArticleGoogle Scholar
- Karlsson S, Moen T, Lien S, Glover KA, Hindar K: Generic genetic differences between farmed and wild atlantic salmon identified from a 7K SNP-chip. Molecular Ecology Resources. 2011, 11 (s1): 247-253.PubMedView ArticleGoogle Scholar
- Wang W, Wei Z, Lam TW, Wang J: Next generation sequencing has lower sequence coverage and poorer SNP-detection capability in the regulatory regions. Scientific reports. 2011, 1: 55-10.1038/srep00055.PubMedPubMed CentralGoogle Scholar
- Kullback S, Leibler RA: On information and sufficiency. The Annals of Mathematical Statistics. 1951, 79-86.Google Scholar
- Achaz G: Testing for neutrality in samples with sequencing errors. Genetics. 2008, 179 (3): 1409-1424. 10.1534/genetics.107.082198.PubMedPubMed CentralView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.