The theory of discovering rare variants via DNA sequencing
 Michael C Wendl^{1}Email author and
 Richard K Wilson^{1}
DOI: 10.1186/1471216410485
© Wendl and Wilson; licensee BioMed Central Ltd. 2009
Received: 20 March 2009
Accepted: 20 October 2009
Published: 20 October 2009
Abstract
Background
Rare population variants are known to have important biomedical implications, but their systematic discovery has only recently been enabled by advances in DNA sequencing. The design process of a discovery project remains formidable, being limited to ad hoc mixtures of extensive computer simulation and pilot sequencing. Here, the task is examined from a general mathematical perspective.
Results
We pose and solve the population sequencing design problem and subsequently apply standard optimization techniques that maximize the discovery probability. Emphasis is placed on cases whose discovery thresholds place them within reach of current technologies. We find that parameter values characteristic of rarevariant projects lead to a general, yet remarkably simple set of optimization rules. Specifically, optimal processing occurs at constant values of the persample redundancy, refuting current notions that sample size should be selected outright. Optimal projectwide redundancy and sample size are then shown to be inversely proportional to the desired variant frequency. A second family of constants governs these relationships, permitting one to immediately establish the most efficient settings for a given set of discovery conditions. Our results largely concur with the empirical design of the Thousand Genomes Project, though they furnish some additional refinement.
Conclusion
The optimization principles reported here dramatically simplify the design process and should be broadly useful as rarevariant projects become both more important and routine in the future.
Background
Technological developments continue to dramatically expand the enterprise of DNA sequencing. In particular, the emergence of socalled "nextgeneration" instruments (NGIs) is opening a new chapter of genomic research [1]. If we characterize sequencing economy by the ratio of project speed to total project cost, NGIs are orders of magnitude superior to their traditional Sangerbased predecessors. Indeed, they are the first systems to demonstrate the economic feasibility of sequencing individual genomes on a large scale [2].
Future efforts will undoubtedly use NGIs to address issues in medical sequencing and personal genomics [3], but these instruments are also poised for major contributions at the population level [4, 5]. For example, the Thousand Genomes Project (TGP) is focusing on comprehensive identification of variants in the human population through cohortlevel wholegenome sequencing using NGIs [6, 7]. One of its main goals is to discover and characterize rare single nucleotide alleles, basically those present at minor allele frequencies around 1% or less. This region was not accessible to the earlier HapMap Project [8]. Rarer instances are obviously much more difficult to find and necessitate gathering enormously larger amounts of data. Such demands will obviously extend to any future such projects one might envision, including those for model organisms, agriculturally important species, cancer genomes, infectious agents, etc.
The success of such variation projects depends upon adequately understanding the relevant process engineering issues and subsequently crafting a suitable project design. One concern in traditional singlegenome sequencing is the socalled "stopping problem" [9–11]], which is the proposition of estimating what redundancy will suffice for a desired level of genomic coverage. Variation projects similarly require specification of a total, projectwide redundancy, R. Yet, because they necessarily involve multiple genomes, an essentially new design question also emerges. That is, how does one optimize the number of samples, σ, versus the redundancy allotted per sample, ρ, such that the probability of finding a rare variant, P_{ v }, is maximized? The existence of such optima is intuitively clear. Heavily sequencing only a few samples will tend to miss a variant because it is unlikely to be present in the original sample set. Conversely, light sequencing of too many samples may overlook the variant by virtue of insufficient coverage for any samples actually harboring it. Somewhere between these extremes lie optimum combinations of parameters.
Variables in a MultiGenome Variant Detection Project
variable^{†}  meaning 

P _{ v }  probability of finding a rare variant 
P _{v, min}  minimum acceptable value of P_{ v }for a project 
ρ  haploid persample sequence redundancy 
R  total, projectwide redundancy 
ϕ  frequency of variant in population 
σ  number of samples sequenced in project 
τ  minimum read coverings for detection 
N  minimum variant observations to declare discovery 
Here, we examine optimization from a more focused mathematical perspective. Our treatment accounts for sequence errors via the proxy of a variable read covering count [3, 13], but it omits secondary, projectspecific details like software idiosyncrasies [14], instrumentspecific biases [15], and alignment issues [16]. The solution leads to a set of general, though unexpectedly simple optimization principles, which correct some earlier speculation [17] and are useful as first approximations for actual projects. Because these rules appreciably narrow the solution space, they also offer good starting points for even more targeted numerical and empirical searches that might account for secondary effects, if such are deemed necessary.
Results
The term "rare variant" is routinely taken to mean a rare allele, although it can also mean a rare SNP genotype. Take ϕ to be the variant frequency, i.e. the minor allele frequency or the rare homozygous genotype frequency, as appropriate. We assume the TGP convention whereby samples are sequenced separately to uniform depths [6, 7], instead of being pooled first. The general theory then encompasses the multiplegenome population sequencing problem and its subsequent design optimization.
Analytical Characterization of Discovery in Multiple Genomes
and its discovery probability is again given by Eq. 3, except where D_{ G }, replaces D_{ A }.
Statement of the Optimization Problem
In practical terms, we want to most efficiently discover a variant at the lowest possible cost, as represented by R.
Although the problem is framed in terms of finding a single variant, actual projects are apt to be specified according to discovering a certain average number of rare variants. These scenarios are equivalent, as Eq. 6 also quantifies the expected fraction of variants that will be found in the project. For example, P_{v, min}= 0.95 indicates finding 95%, on average, of the variants occurring at some value of ϕ.
Optimizing for Single and Double Variant Observations
Leaving aside the optimization of ρ versus σ for a moment, the least obvious of the projectspecific parameters to specify is arguably N. Higher values may exceed the actual number of instances in the sample set, resulting in a priori failure of the project. We will therefore concentrate on the experimentally relevant special cases N = 1 and N = 2. The former is clearly a minimum requirement, while the latter serves to better discern between a rare population variant and a SNP that is unique to an individual sample (a "private SNP").
Because we have an explicitlydefined equality constraint in the form of Eq. 5, the number of design variables can be reduced by one [18]. Specifically, substituting ρ = R/σ into Eq. 2 allows us to write a constrained form of the coverage probability, which in turn furnishes constrained expressions for the probabilities of events D_{ A }and D_{ G }. It is expedient at this point to switch from the eventbased notation of probability used up until now to the Eulerian (functional) convention for the calculusbased aspect of optimization. Specifically, let f_{τ, i}with i ∈ {A, G} represent the nowconstrained probabilities of D_{ A }and D_{ G }. (A detailed explanation of the switch in notation appears in "Mathematical Preliminaries".) We now state the following important optimization conditions.
In particular, the roots of these equations in σ indicate maxima in P_{ v }for rare alleles and genotypes. Each setting of the independent variables has one such optimum, σ*, which is necessarily a global optimum.
Discussion
Finding rare variants is clearly an important aspect of both population and medical genetics [19]. The discovery process was not feasible before the advent of NGIs, but is now being actively prototyped through efforts like the TGP [6, 7] and will likely become more routine in the future. This eventuality motivates examination of the problem from a general perspective, similar in spirit to theoretical treatments of single genomes [20]. The following sections report on both some of the broad trends across the design variable spectrum, as well as optimal conditions for the important special cases of N = 1 and N = 2.
General Trends
Fig. 1 also shows that curves are not symmetric with respect to σ*. The rate of dropoff of P_{ v }for a given projectwide redundancy is much more severe for σ <σ*, implying that it is better to err in sequencing too many samples rather than too few. It is interesting to examine one of the TGP sequencing pilot phases in this context, which specifies 2× data collection for each of σ = 180 samples [6, 7]. Here, R = 2·180 = 360, which is one of the curves plotted in Fig. 1. Using the above thresholds, this design yields P_{ v }≈ 61%, whereas the optimal configuration returns P_{ v }≈ 66% for only about 100 samples. Despite using almost twice as many samples as is optimal, this design remains relatively good, precisely because of the nonsymmetric behavior.
Constant SampleSize Designs and the Stalling Effect
which is asymptotically identical to what is obtained if coverage is not considered at all [5]. The basic problem associated with constant samplesize designs is immediately apparent in this equation. Given small ϕ, the exponential term decays very slowly and can only be compensated for by increasing σ. The challenge, of course, is to do this such that P_{ v }attains a maximum.
Remarks on Optimization Methods
We commented above that empirical prototyping and numerical simulation are unlikely to give complete insights to the general optimization problem because of the size of the solution space. Consider that the relationship between two parameters requires only a single curve on an XY plot, three parameters require a family of curves on one plot, four a textbook of familytype plots, and so forth. Richard Bellman, who developed the optimization technique of dynamic programming, called this phenomenon the "curse of dimensionality". Table 1 shows that we have 8 variables in our particular problem, however, even this is somewhat misleading because it does not consider the probabilistic nature of the problem. That is, P_{ v }can only be established as an expected value through a sufficient number of repeated trials for each particular combination of the independent variables. This is the basic tactic used in simulation.
The population model in Thms. 1 and 2 improves matters considerably, furnishing P_{ v }explicitly in terms of (τ, R, σ, ϕ, N). One could march through every combination of these variables, evaluating P_{ v }for each, and log maxima that attain given levels of P_{v, min}. Though this approach would be enormously more efficient than naïve bruteforce simulation, the calculations needed to adequately survey the floatingpoint "continuum" of the realvalued variables remain basically infeasible. Consequently, we still might not expect to discern any latent general laws.
The Weak Optimization Problem
We resort instead to Thm. 3, whose roots for N = 1 and N = 2 represent optimal sample sizes, σ*. Let us first describe some unexpected properties found among the independent variables. These are important in that they furnish a direct solution to what might be called the weak optimization problem. This is the proposition that relaxes the condition defined by P_{v, min}. In effect, weak optimization provides the optimal probability, , subject to a predetermined R rather than a given P_{v, min}> 0.
Constants Associated with Optimum Designs
rare variant  τ  ρ*(τ) 



genotype  1  2.5  0.512  2.5 
genotype  2  6.4  0.690  6.4 
allele  2  3.6  0.537  1.8 
allele  1  special case, see Eq. 10 
Remarks on the Special Case of τ = 1 for Rare Alleles
Here, we have the seemingly contradictory implication that we should spread a finite amount of sequence resources over the largest number of samples, each of which will then have a vanishingly small ρ. Mathematically speaking, the rate by which the persample f_{1, A}decreases precisely offsets the favorable rate of increasing sample size, whereby P_{ v }does not asymptotically vanish. However, there will usually be good secondary reasons for limiting σ in practice, e.g. cost of sample procurement. Moreover, conditions approach the limiting value rather quickly, for example setting ρ = R/σ ≤ 0.1 brings P_{ v }very close to the expression in Eq. 10. R is the main factor governing discovery under these conditions and its value can be calculated for a desired P_{ v }by simply inverting Eq. 10.
Optimal Designs for Single and Double Variant Observations
Optimization Coefficients for 95% Discovery Probability
variant  τ  N  C 

allele  2  1  10.0 
allele  2  2  15.8 
genotype  1  1  14.7 
genotype  1  2  23.3 
genotype  2  1  27.8 
genotype  2  2  44.1 
Consider the example of the TGP, whose sizable ad hoc design effort was already mentioned above. For N = τ = 2 at the 95% level, Table 3 indicates C = 15.8. Assuming 1% rare allele discovery, optimal processing calls for roughly 440 samples sequenced to 3.6× each, for a project total of R = 1580×. Given the longstanding convention of specifying ρ in whole units, these results largely confirm the TGP design, though in a more precise fashion. That is, TGP has only winnowed the sample size to 400500 per population cluster, with each sample sequenced to 4× [6, 7]. The associated P_{ v }curve is relatively flat in 400 ≤ σ ≤ 500, but this imprecision, coupled with a round value of ρ, still imposes a degree of cost liability. For instance, on the outer end, the project would expend 4·500 = 2000× in data, roughly 25% more than that required for 95% confidence. Project modifications are readily analyzed, for example, reaching alleles down to ϕ = 0.5% would simply require doubling the project: about 880 samples with R = 3160×. Analysis of genotypes is now similarly trivial.
Conclusion
Sequence variation is often called the "currency" of genetics [4] and wholegenome sequence variation projects, enabled by continuing advances in technology, will likely become both increasingly important and routine in biomedical research. Although finding common occurrences is no longer considered to be very difficult, rare ones remain challenging because of the significantly larger amounts of data that must be gathered. Process optimization has to be considered much more carefully here. We have reported a general, though remarkably simple set of optimization principles based on analyzing the population sequencing problem. Results largely confirm the design of a special case, that of the TGP, but also permit immediate analysis of a much broader set of possible project requirements. Derivation of optimal conditions for even higher N and/or τ would be a mechanical, albeit not entirely trivial extension of the mathematics shown here, but the experimental feasibility of such designs for future projects remains unclear.
Population structure is another consideration, as rare variants are likely to be associated with particular geographic regions and their subpopulations [4]. A few issues are relevant here. First, some studies target the variation underlying specific phenotypes [21], but variant discovery projects do not place strong preference on the kinds of variation that are sought. Second, ρ* is not a function of rareness (Fig. 4), meaning that latent populationrelated differences in frequency will not ruin optimality. One should simply treat each desired subpopulation separately, making no differential adjustments to persample redundancies. This strategy assures discovery of populationspecific variants and, incidentally, is precisely what the TGP is following.
Methods
Mathematical Preliminaries
This section expands on some of the mathematical esoterica involved in establishing the theory.
Chain Rule
This principle enables one to find the derivative of a function that itself depends on another function [22]. In essence, it establishes a product rule for the respective derivatives. For example, if y = z^{3} and z = x^{2} + 1, then dy/dx = dy/dz·dz/dx = 3z^{2}·2x = 6x(x^{2} + 1)^{2}. Chain Rule is used in the logarithmic differentiation process described below.
Independently and Identically Distributed (IID)
This term means that all random variables in a collection are independent of one another, i.e. they have no mutual influences or relationships, and that each has the same probability as all the others [23]. Coin flipping is a simple example. The current flip is not influenced by past ones, nor does it influence future ones, and each flip has the same probability of showing, say, "heads". This concept is the basis of ultimately establishing the binomial nature of P_{ v }in Theorem 1.
Logarithmic Differentiation
This mathematical device employs the Chain Rule (see above) to differentiate functions whose forms render them difficult to handle using more basic rules. Proof of Theorem 3 (below) requires this treatment because the independent variable being differentiated against appears in the exponent. An illustrative example having precisely the same issue is y = e^{ x }, which is readily shown by this procedure to be its own derivative. Applying Chain Rule to the logarithmic form, ln y = x, yields y^{1}·dy/dx = 1, from which dy/dx = y = e^{ x }immediately follows.
Notation
This aspect is complicated by the fact that the theory straddles two different branches of mathematics: probability and calculus. In the former case, notation is primarily concerned with specifying configurations of events, while in the latter, Euler's convention is used to describe functional dependence on a set of independent variables. This necessitates a change in notation as we move from the probabilistic discovery model in Thms. 1 and 2 to the calculusbased optimization process in Thm. 3.
which now depends upon τ, R, and σ. In turn, this expression is substituted into Eqs. 1 and 4 to obtain constrained probabilities for events D_{ A }and D_{ G }, respectively, with dependence now extending to ϕ, as well. From here on, let us consider these event probabilities simply as mathematical functions. For example, f_{1, G}is the expression obtained by setting τ = 1 in Eq. 11, squaring it, and multiplying by ϕ, i.e. it is the constrained probability of the event D_{ G }originally introduced in Eq. 4. Under this notation, we can then easily represent all such functions universally by writing them in a form f_{τ, i}= f_{τ, i}(ϕ, R, σ), where i ∈ {A, G}. This is the convention we follow in both Thm. 3 (above) and its proof (below).
Roots of a Function
Roots are values of the independent variable which cause a function to vanish, i.e. to be equal to zero. For example, y = x^{2}  9 can be factored as y = (x + 3) (x  3), showing that x = ±3 are the roots for which y = 0. This concept is relevant to the proof of Theorem 3 (below) because maxima within the P_{ v }family of functions occur at roots in σ of the first derivatives. Roots play a similar role in solving Eqs. 15 and 16.
Proofs of Theorems 1 to 3
Theorem 1: Let A_{ j }and C_{ j }be the events, respectively, that an allele variant exists on chromosome j in a sample at location x and that x is spanned (covered) by at least τ sequence reads. The detection event is D_{ A }= (A_{1} ∩ C_{1}) ∪ (A_{2} ∩ C_{2}). Given the presumed IID (see "Mathematical Preliminaries") nature of alleles and coverage with respect to chromosomes, ϕ = P(A_{1}) = P(A_{2}) and P(C) = P(C_{1}) = P(C_{2}), from which Eq. 1 follows directly. Eq. 2 is a corollary of diploid covering theory [24]. Finally, with respect to any given sample, D_{ A }is a Bernoulli process: an allele is either detected, or it is not. Given uniform ρ for each sample and the random selection of presumably independent genomes, the process is IID. The distribution of detected variants is then binomial [23], from which Eq. 3 follows directly.
Theorem 2: Let G represent the existence of a rare genotype in a sample. Since both alleles must be discerned, the detection event is D_{ G }= G ∩ C_{1} ∩ C_{2}. Because coverage of x is not a function of whether the genotype is actually present and vice versa, G and C_{1} ∩ C_{2} are independent, whereby Eq. 4 follows directly.
for the special cases of interest, N = 1 and N = 2, respectively.
for N = 1 and N = 2, respectively. In general, P_{ v }≠ 1 in either case, so the conditions must instead be satisfied by the terms in square brackets. Eqs. 7 and 8 follow directly.
The fact that there is only a single, global optimum, σ*, for each case is a consequence of P_{ v }being a unimodal function in σ. In general, P_{ v }vanishes monotonically for σ > σ* because P(C) → 0, and consequently f_{τ, i}→ 0, as σ is increased under finite values of R. The exception is f_{1, A}, for which P_{ v }asymptotically approaches a maximum (Eq. 10).
Solution of the General Optimization Problem
That is, given τ, R, and P_{v, min}the values of ϕ under which the process is optimal are the roots of Eqs. 15 and 16.
Derivatives and Numerical Method
Note that an equation for f_{1, A}is absent because the case of τ = 1 for rare alleles does not have a welldefined optimum (Eq. 10).
Eqs. 7, 8, 15, and 16 all depend upon the concept of finding the roots of an equation. (See "Mathematical Preliminaries" above.) Although none is readily factorable, they can be solved by the bisection algorithm, which is straightforward to apply, has reasonably good convergence behavior, and is extremely robust [25].
Abbreviations
 NGI:

nextgeneration sequencing instrument
 TGP:

Thousand Genomes Project.
Declarations
Acknowledgements
The authors wish to thank Ken Chen, Li Ding, Elaine Mardis, and John Wallis for comments on the draft manuscript, as well as the referees for their suggestions related to making the mathematical content more accessible. This work was partially supported by grant HG003079 from the National Human Genome Research Institute (R. K. Wilson, PI).
Authors’ Affiliations
References
 Mardis ER: The impact of nextgeneration sequencing technology on genetics. Trends in Genetics. 2008, 24: 133141.View ArticlePubMedGoogle Scholar
 Quail MA, Kozarewa I, Smith F, Scally A, Stephens PJ, Durbin R, Swerdlow H, Turner DJ: A large genome center's improvements to the Illumina sequencing system. Nature Methods. 2008, 5: 10051010. 10.1038/nmeth.1270.PubMed CentralView ArticlePubMedGoogle Scholar
 Ley TJ, Mardis ER, Ding L, Fulton B, McLellan MD, Chen K, Dooling D, DunfordShore BH, McGrath S, Hickenbotham M, et al: DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome. Nature. 2008, 456: 6672. 10.1038/nature07485.PubMed CentralView ArticlePubMedGoogle Scholar
 Chakravarti A: Population genetics  Making sense out of sequence. Nature Genetics. 1999, 21: 5660. 10.1038/4482.View ArticlePubMedGoogle Scholar
 Zwick ME, Cutler DJ, Chakravarti A: Patterns of genetic variation in Mendelian and complex traits. Annual Review of Genomics and Human Genetics. 2000, 1: 387407. 10.1146/annurev.genom.1.1.387.View ArticlePubMedGoogle Scholar
 Kaiser J: A plan to capture human diversity in 1000 genomes. Science. 2008, 319: 39510.1126/science.319.5862.395.View ArticlePubMedGoogle Scholar
 Siva N: 1000 genomes project. Nature Biotechnology. 2008, 26: 256PubMedGoogle Scholar
 Gibbs RA, Belmont JW, Boudreau A, Leal SM, Hardenbol P, Pasternak S, Wheeler DA, Willis TD, Yu F, Yang H, et al: A haplotype map of the human genome. Nature. 2005, 437: 12991320. 10.1038/4371233a.View ArticleGoogle Scholar
 Wendl MC, Waterston RH: Generalized gap model for bacterial artificial chromosome clone fingerprint mapping and shotgun sequencing. Genome Research. 2002, 12: 19431949. 10.1101/gr.655102.PubMed CentralView ArticlePubMedGoogle Scholar
 Wendl MC, Barbazuk WB: Extension of LanderWaterman theory for sequencing filtered DNA libraries. BMC Bioinformatics. 2005, 6: article no. 24510.1186/147121056245.View ArticleGoogle Scholar
 Wendl MC: Random covering of multiple onedimensional domains with an application to DNA sequencing. SIAM Journal on Applied Mathematics. 2008, 68: 890905. 10.1137/06065979X.View ArticleGoogle Scholar
 Li B, Leal SM: Discovery of rare variants via sequencing: Implications for association studies [abstract]. Genetic Epidemiology. 2008, 32: 702Google Scholar
 Wheeler DA, Srinivasan M, Egholm M, Shen Y, Chen L, McGuire A, He W, Chen YJ, Makhijani V, Roth GT, et al: The complete genome of an individual by massively parallel DNA sequencing. Nature. 2008, 452: 872876. 10.1038/nature06884.View ArticlePubMedGoogle Scholar
 Chen K, McLellan MD, Ding L, Wendl MC, Kasai Y, Wilson RK, Mardis ER: PolyScan: An automatic indel and SNP detection approach to the analysis of human resequencing data. Genome Research. 2007, 17: 659666. 10.1101/gr.6151507.PubMed CentralView ArticlePubMedGoogle Scholar
 Harismendy O, Ng PC, Strausberg RL, Wang X, Stockwell TB, Beeson KY, Schork NJ, Murray SS, Topol EJ, Levy S, et al: Evaluation of next generation sequencing platforms for population targeted sequencing studies. Genome Biology. 2009, 10: article no. R3210.1186/gb2009103r32.View ArticleGoogle Scholar
 Whiteford N, Haslam N, Weber G, PrügelBennett A, Essex JW, Roach PL, Bradley M, Neylon C: An analysis of the feasibility of short read sequencing. Nucleic Acids Research. 2005, 33: article no. e17110.1093/nar/gni170.View ArticleGoogle Scholar
 Gibbs R: Deeper into the genome. Nature. 2005, 437: 12331234. 10.1038/4371233a.View ArticlePubMedGoogle Scholar
 Vanderplaats GN: Numerical Optimization Techniques for Engineering Design. 1984, New York NY: McGrawHillGoogle Scholar
 Fearnhead NS, Wilding JL, Winney B, Tonks S, Bartlett S, Bicknell DC, Tomlinson IPM, Mortensen NJM, Bodmer WF: Multiple rare variants in different genes account for multifactorial inherited susceptibility to colorectal adenomas. Proceedings of the National Academy of Sciences. 2004, 101: 1599215997. 10.1073/pnas.0407187101.View ArticleGoogle Scholar
 Lander ES, Waterman MS: Genomic mapping by fingerprinting random clones: A mathematical analysis. Genomics. 1988, 2: 231239. 10.1016/08887543(88)900079.View ArticlePubMedGoogle Scholar
 Halushka MK, Fan JB, Bentley K, Hsie L, Shen NP, Weder A, Cooper R, Lipshutz R, Chakravarti A: Patterns of singlenucleotide polymorphisms in candidate genes for bloodpressure homeostasis. Nature Genetics. 1999, 22: 239247. 10.1038/10297.View ArticlePubMedGoogle Scholar
 Courant R: Differential and Integral Calculus. 1937, New York NY: Interscience, I:Google Scholar
 Feller W: An Introduction to Probability Theory and Its Applications. 1968, New York NY: John Wiley & Sons, 3Google Scholar
 Wendl MC, Wilson RK: Aspects of coverage in medical DNA sequencing. BMC Bioinformatics. 2008, 9: article no. 23910.1186/147121059239.View ArticleGoogle Scholar
 Hamming RW: Numerical Methods for Scientists and Engineers. 1962, New York NY: McGrawHillGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.