Short tandem repeat number estimation from pairedend reads for multiple individuals by considering coalescent tree
 Kaname Kojima^{1},
 Yosuke Kawai^{1},
 Naoki Nariai^{2},
 Takahiro Mimori^{1},
 Takanori Hasegawa^{1} and
 Masao Nagasaki^{1}Email author
Published: 31 August 2016
Abstract
Background
Two types of approaches are mainly considered for the repeat number estimation in short tandem repeat (STR) regions from highthroughput sequencing data: approaches directly counting repeat patterns included in sequence reads spanning the region and approaches based on detecting the difference between the insert size inferred from aligned pairedend reads and the actual insert size. Although the accuracy of repeat numbers estimated with the former approaches is high, the size of target STR regions is limited to the length of sequence reads. On the other hand, the latter approaches can handle STR regions longer than the length of sequence reads. However, repeat numbers estimated with the latter approaches is less accurate than those with the former approaches.
Results
We proposed a new statistical model named coalescentSTR that estimates repeat numbers from pairedend read distances for multiple individuals simultaneously by connecting the read generative model for each individual with their genealogy. In the model, the genealogy is represented by handling coalescent trees as hidden variables, and the summation of the hidden variables is taken on coalescent trees sampled based on phased genotypes located around a target STR region with Markov chain Monte Carlo. In the sampled coalescent trees, repeat number information from insert size data is propagated, and more accurate estimation of repeat numbers is expected for STR regions longer than the length of sequence reads.
For finding the repeat numbers maximizing the likelihood of the model on the estimation of repeat numbers, we proposed a stateoftheart belief propagation algorithm on sampled coalescent trees.
Conclusions
We verified the effectiveness of the proposed approach from the comparison with existing methods by using simulation datasets and real whole genome and whole exome data for HapMap individuals analyzed in the 1000 Genomes Project.
Keywords
Background
The progress of highthroughput sequencing (HTS) technologies enables the variant detection of each individual in genomewide scale in practical time and with reasonable cost. From HTS data, various types of single nucleotide variant (SNV) calling methods have been proposed [1–4], and SNVs for more than a thousand of individuals were accurately detected [5]. However, unlike SNVs, we still have difficulty in accurately detecting structural variations such as genome insertions, genome deletions, short tandem repeat (STR) number polymorphisms, and copy number variations, especially from low coverage HTS data [6].
Some repeat number polymorphisms are associated with various disease phenotypes such as CAG repeats in the Huntingtin gene with Huntington’s disease [7]. From HTS data, several approaches such as lobSTR and RepeatSeq [8, 9] have been proposed for the estimation of repeat numbers in STR regions by directly counting repeat patterns in sequence reads spanning the regions. In these approaches, the accuracy on both the detection of STR variants and estimated repeat numbers is high. Another strategy is to use pairedend reads aligned to the flanking regions of the target STR region in the reference genome [10]. Insert size inferred from the aligned pairedend reads is longer than its actual size if the repeat number is smaller than that in the reference genome. On the other hand, the inferred insert size is shorter if the repeat number is larger. By detecting the difference between the inferred and actual insert size, repeat numbers are estimated. Since insert size is generally longer than sequence reads, this strategy can be used for estimating repeat numbers for relatively long STR regions that cannot be handled by the strategy counting repeat patterns in sequence reads. However, repeat numbers estimated from insert size data are less accurate than those from the strategy counting repeat patterns directly in the sequence reads, especially for low coverage HTS data.
We proposed a new statistical model named coalescentSTR that estimates repeat numbers for multiple individuals simultaneously from pairedend read distances by connecting the read generative model for each individual with their genealogy. In the model, the genealogy is represented with coalescent trees, which describe the ancestral history of multiple individuals on a local genome region backwards in time [11–13]. By considering the change in repeat numbers in coalescent trees in a natural manner, more accurate estimation of repeat numbers is expected. For the estimation of repeat numbers in the model, coalescent trees handled as hidden variables are sampled with Markov chain Monte Carlo (MCMC) according to phased genotypes around a target STR region. We proposed a new belief propagation method that calculates the loopy belief propagation [14] and the mixedproduct belief propagation [15] by taking the summation on the sampled coalescent trees. By using the proposed belief propagation, approximated maximum configuration of repeat numbers in the model are searched for the estimation of repeat numbers.
In a simulation study, we used synthetically generated HTS data for STR regions mostly longer than read length, and showed the effectiveness of our model from the comparison with other existing methods, especially in handling more individuals. The effectiveness of our approach is also verified from the analysis of real whole exome data of HapMap JPT individuals and whole genome sequencing (WGS) data of HapMap CEU and GBR individuals analyzed in the 1000 Genomes Project (1KGP).
Method
We describe a model considering insert size of pairedend reads for one individual and its extension to consider multiple individuals based on their unobserved genealogy. Procedures for the repeat number estimation are then explained.
Repeat number estimation from pairedend read distance
We consider a statistical model that estimates repeat number in an STR region from pairedend read distance for one individual. We hereafter call the model a basic model. Let s ^{(d)} be the start position of the forward read of the dth aligned read pair. We also let e ^{(d)} be the end position of the reverse read of the dth aligned read pair. The insert size of the dth read pair or the length of the DNA fragment from which the read pair was generated is given by e ^{(d)}−s ^{(d)}, and we denote the insert size e ^{(d)}−s ^{(d)} as l ^{(d)}.
By using the above recurrence formulae, N(s,n) is calculated for s∈{s _{ m }−K,s _{ m }−1} and \(n \in \{n_{\min }, \dots, n_{\max }\}\), where s _{ m } is the start position of the STR region. Since the repeat pattern size u is usually less than or equal to four and can be considered as a constant, the calculation of N(s,n) requires \(O((n_{\max }n_{\min }+1)K)\) time, which is smaller than that required in the naïve way.
Repeat number estimation considering genealogy of multiple individuals
DNA sequences are inherited from parents to offspring, and single base substitutions occur in the inheritance with mutation rate of around 2.0×10^{−8} [16]. Repeat numbers in STR regions also change or mutate in the inheritance from a parent to its offspring with rate ranging usually from 1.0×10^{−4} to 1.0×10^{−3} [17]. From the phased genotypes around an STR region of interest for multiple individuals, we consider their genealogy around the region by using coalescent tree [11–13]. Coalescent tree is a binary tree in which leaves represent the current haplotypes and internal nodes represent past coalescent events of the haplotypes. For each coalescent event, two linages are involved, and cases involving more than two lineages are not considered in our model. The length of each edge in the tree represents time between coalescent events.
For sampling with MCMC, burnin period, period between samples, and the number of samples are respectively set to 50,000, 100, and 100 in our study.
Estimation of repeat numbers in coalescentSTR
where \(\boldsymbol {n}_{i_{1}}\) is a set of \(n_{i_{1}}\) maximizing \(m_{i_{2} \rightarrow i_{1}}(n_{i_{1}})\!\!\!\!\cdot \sum _{g \in \mathcal {G}} m_{p_{i_{1} \mid g} \rightarrow i_{1}}(n_{i_{1}})\). After some iterations of mixedproduct BP, \(\hat {n}_{i_{1}} \in \hat {N}\) is obtained by \(\arg \max _{n_{i_{1}}} m_{i_{2} \rightarrow i_{1}}(n_{i_{1}}) \cdot \sum _{g \in \mathcal {G}} m_{p_{i_{1} \mid g} \rightarrow i_{1}}(n_{i_{1}})\). We first calculate loopy BP in ten cycles and then calculate mixedproduct BP in ten cycles using messages from loopy BP as initial values. Empirically, the above procedure provides better \(\hat {N}\) than only considering mixedproduct BP.
Selection of STR mutation rate
where \(\bar {m}_{i_{2} \rightarrow i_{1}}(n_{i_{1}})\) is \(m_{i_{2} \rightarrow i_{1}}(n_{i_{1}})\) normalized to have a sum of one and \(\bar {m}_{\mathcal {G} \rightarrow i_{1}}(n_{i_{1}})\) is a message to \(n_{i_{1}}\) from coalescent trees \(\sum _{g \in \mathcal {G}} m_{p_{i_{1} \mid g}}(n_{i_{1}})\) normalized to have a sum of one. We consider that messages from pairedend reads to \(n_{i_{1}}\) or \(n_{i_{2}}\) and from coalescent trees to \(n_{i_{1}}\) or \(n_{i_{2}}\) are similar to each other if the STR mutation rate is proper. The value in Eq. (7) is designed to take higher value if those messages are more similar to each other. We consider several STR mutation rates and select the rate with the highest value given by Eq. (7).
Results and discussion
Simulation analysis

Generate a coalescent tree for 2I haplotypes with an algorithm in [18] under the assumption of a constant effective population size.

Obtain phased genotypes at 1,000 bp upstream and downstream positions of the STR region based on the generated coalescent tree and a specified single base substitution rates.

Obtain repeat numbers based on the generated coalescent tree by considering the stepwise model with a specified STR mutation rate.

Five types of the numbers of individuals: 5, 10, 20, 50, and 100.

Two types of STR mutation rates: 1.00×10^{−3} and 2.73×10^{−4}. The former rate is an estimated STR mutation rate for tetranucleotide repeats, and the latter for dinucleotide repeats in human [17].

Directions of pairedend reads are concordant.

The start position of the forward read in each aligned read pair is located before the start position of the STR region.

The end position of the reverse read in each aligned read pair is located after the end position of the STR region.
Comparison of estimated repeat numbers in terms of RMSE for simulation datasets with STR mutation rates of 2.73×10^{−4} and 1.00×10^{−3} and read coverage of 20 ×
STR mutation rate  2.73×10^{−4}  1.00×10^{−3}  

No. of samples  5  10  20  50  100  5  10  20  50  100 
CoalescentSTR  2.30  2.46  2.17  1.25  1.09  4.09  3.22  2.24  1.96  1.81 
CoalescentSTR (shuffled)  2.37  2.42  2.30  1.38  1.96  4.02  3.52  3.05  2.73  3.33 
Basic Model  4.18  5.37  5.00  5.15  4.91  5.39  5.11  5.41  4.94  5.03 
lobSTR  9.09  7.33  6.02  7.47  5.94  10.1  7.04  5.13  6.67  6.20 
RepeatSeq  9.09  7.33  6.02  7.47  5.91  10.1  7.03  5.09  6.66  6.18 
STRViper  8.38  6.87  5.57  6.91  5.60  9.37  6.59  4.90  6.23  5.75 
Comparison of estimated repeat numbers in terms of RMSE for simulation datasets with STR mutation rates of 2.73×10^{−4} and 1.00×10^{−3} and read coverage of 40 ×
STR mutation rate  2.73×10^{−4}  1.00×10^{−3}  

No. of samples  5  10  20  50  100  5  10  20  50  100 
CoalescentSTR  2.47  2.20  1.86  1.11  1.00  3.28  2.75  2.18  1.74  1.61 
CoalescentSTR (shuffled)  2.58  2.32  2.16  1.38  1.94  3.66  3.36  2.91  2.73  3.03 
Basic Model  4.18  4.48  4.39  4.18  4.07  4.74  4.09  4.33  4.17  4.19 
lobSTR  9.09  7.32  6.02  7.47  5.94  10.10  7.05  5.12  6.67  6.20 
RepeatSeq  9.10  7.34  6.02  7.47  5.94  10.08  7.03  5.11  6.65  6.18 
STRViper  7.93  6.55  5.32  6.57  5.37  8.88  6.32  4.81  5.97  5.51 
If no STR variant was detected, the reference repeat number, 17, was assigned as the estimated repeat number. STRViper reports only one repeat number for each individuals although each individual has two repeat numbers. Thus, two repeat numbers in each individual were set to the same value in results of STRViper. CoalescentSTR gives the best result in most of the conditions and coalescentSTR (shuffled) gives the best result in some conditions with sample sizes of 5 and 10. If the sample size considered for estimation is small, the improvement of the performance by considering multiple individuals in coalescentSTR is limited. Thus, coalescentSTR (shuffled) can provide better results than coalescentSTR for some conditions with small sample sizes.
Since some repeat numbers are longer than or equal to the read length, the results from pairedend read distance based methods (coalescentSTR, basic model, and STRViper) are better than those from methods counting repeat numbers in sequences reads (lobSTR and RepeatSeq). The RMSE value is smaller for considering more individuals on coalescentSTR. In addition, the performance of coalescentSTR with the shuffled haplotypes is worse than that of coalescentSTR with correct haplotypes. These observations support the effectiveness of considering the genealogy. The RMSE value for each method on the dataset with read coverage of 40 × is smaller than that on the dataset with read coverage of 20 × in most of the cases.
In the plot for coalescentSTR, points are around the diagonal line. Points in the plot for coalescentSTR (shuffled) are also located around the diagonal line, but scattered in larger area than those in the plot for coalescentSTR. In addition, points in the plot for the basic model are scattered in larger area around the diagonal line than those in plots for coalescentSTR and coalescentSTR (shuffled). There exists a horizontal line with the value twice as much as the reference repeat number in plots for lobSTR and RepeatSeq The line is due to points for cases where these methods failed to STR variants and provided the reference repeat number as estimated repeat numbers. For cases with STR variants which can be detected by RepeatSeq, the corresponding points are located around the diagonal line tightly. On the other hand, points associated with STR variants which can be detected by lobSTR are scattered around the diagonal line. In the plot for STRViper, the sum of estimated diploid repeat numbers is correlated with the sum of true diploid repeat numbers, but differences between estimated repeat numbers and the reference repeat number are underestimated.
Real data analysis
We evaluated the performance of coalescentSTR, basic model, lobSTR, RepeatSeq, and STRViper using exome sequencing data of JPT individuals and WGS data of CEU and GBR individuals.
Performance evaluation with exome sequencing data
Comparison of estimated repeat numbers in terms of RMSE for real exome data for JPT individuals in 1KGP
Method  CoalescentSTR  CoalescentSTR  Basic model  lobSTR  RepeatSeq  STRViper 

(shuffled)  
RMSE  1.33  2.38  9.44  1.63  1.63  1.63 
In the plot for coalescentSTR, points are located around the diagonal line. Points in the plot for coalescentSTR (shuffled) are also located around the diagonal line, but scattered in larger area than those in the plot for coalescentSTR. In the plot for the basic model, the sum of estimated repeat numbers is not correlated with the sum of true repeat numbers because the amount data for each individual in this experiment is not sufficient for estimating repeat numbers correctly. Since lobSTR, RepeatSeq, and STRViper could not detect STR variants for any sample, only a horizontal line with the value twice as much as the reference repeat number is observed in the plots for lobSTR, RepeatSeq, and STRViper.
Performance evaluation with WGS data
We applied coalescentSTR and other existing methods to WGS data of a HapMap CEU individual, NA12878 from HiSeq 2000 to estimate numbers of CAC repeats at chr1:2020057320200666 in GRCh37 for NA12878. Read length and average insert size of the WGS data are respectively 101 bp and 300 bp, and its read coverage is 50 ×. The data was provided by the Illumina Platinum Genomes Project through the European Nucleotide Archive under the study accession PRJEB3381 (http://www.ebi.ac.uk/ena/data/view/ERP001960). In addition to the WGS data of NA12878, we used WGS data for 35 HapMap CEU and 35 HapMap GBR individuals released in May, 22, 2012 by 1KGP [5]. Read length and read coverage of the WGS data of these 70 individual is 100 bp and 5 ×, respectively. Sequence reads in the WGS data of NA12878 were aligned with BWAMEM while those in the WGS of others were aligned with BWA [24]. Phased genotypes around the STR region for coalescentSTR were obtained from the 1KGP Phase3 imputation panels released in October, 12, 2014 [5]. For sampling coalescent trees, phased genotypes at 3,000 bp upstream and downstream positions of the STR region were used. An STR mutation rate was selected from {0.0001,0.0005,0.001,0.005,0.01} based on the value given by Eq. (7). \(n_{\max }\) and \(n_{\min }\) were set to 40 and zero, respectively.

Errorcorrected reads with Falcon (https://github.com/PacificBiosciences/FALCON) in FASTA format were aligned to GRCh37 with BWAMEM.

The number of bases aligned in the STR region was counted for each read spanning the region.

Twocomponent Gaussian mixture model was applied to the set of numbers of bases obtained the above, and estimated means for two components divided by the size of the repeat pattern were adopted as estimated repeat numbers.
Estimated repeat numbers from WGS data from HiSeq 2000 for NA12878 and their corresponding RMSE values with the repeat numbers estimated from PacBio sequencing data for NA12878
Method  Estimated repeat numbers  RMSE 

CoalescentSTR  28/26  0.44 
CoalescentSTR (shuffled)  25/25  2.15 
Basic Model  33/32  5.46 
lobSTR  31/31  4.50 
RepeatSeq  31/31  4.50 
STRViper  31.12/31.12  4.61 
Comparison of computational time
Comparison of computational time on the simulation dataset with 100 individuals, STR mutation rate of 2.73 × 10^{4}, and read coverage of 40 × and the real dataset for HapMap JPT individuals
Method  Computational time  Computational time 

(simulation data)  (real data)  
CoalescentSTR (sampling)  13372.96 [s]  846.60 [s] 
CoalescentSTR (estimation)  452.59 [s]  66.86 [s] 
Basic Model  49.58 [s]  16.18 [s] 
lobSTR  2.72 [s]  8.17 [s] 
RepeatSeq  10.24 [s]  14.78 [s] 
STRViper  407.03 [s]  83.45 [s] 
Conclusions
We proposed a statistical approach named coalescentSTR to estimate repeat numbers in an STR region for multiple individuals from insert size data obtained by pairedend reads in HTS data. We considered the genealogy of the multiple individuals and used the genealogy for propagating repeat number information from insert size among individuals to achieve more accurate estimation of repeat numbers. We evaluated the performance of coalescentSTR, the basic model, lobSTR, RepeatSeq, and STRViper from simulation data and real data from 1KGP and verified the effectiveness of coalescentSTR for STR regions longer than or equal to the read length.
For computational time, coalescentSTR requires the most computational time from the comparison with other existing methods, and its computational time is mainly taken by sampling coalescent trees with MCMC. The use of MCMC with approximate Bayesian computation (ABC) [26] is a solution addressing this issue because the calculation of likelihood for each sampled tree is avoided with ABC and the calculation mainly requires the computational time for sampling. For larger size of genome structural variations such as large size copy number variations, the recombination of genomes needs to be considered although the recombination is basically not considered in coalescent theory. We are considering to extend the proposed model in future work in order to use ancestral recombination graph, which can handle the recombination in the genealogy of multiple individuals unlike coalescent tree.
Declarations
Acknowledgements
This research was supported by grants from the Reconstruction Agency, from the Ministry of Education, Culture, Sports, Science and Technology (MEXT), and from the Japan Agency for Medical Research and Development (AMED). All computational resources were provided by the ToMMo supercomputer system (http://sc.megabank.tohoku.ac.jp/en).
Declarations
This article has been published as part of BMC Genomics Volume 17 Supplement 5, 2016. Selected articles from the 11th International Symposium on Bioinformatics Research and Applications (ISBRA ’15): genomics. The full contents of the supplement are available online https://bmcgenomics.biomedcentral.com/articles/supplements/volume17supplement5.
Funding
Publication costs for this work were funded by the Tohoku Medical Megabank Project (Special Account for reconstruction from the Great East Japan Earthquake).
Availability of data and materials
Sequence reads from HiSeq 2000 for NA12878 are available at the European Nucleotide Archive under the study accession number PRJEB3381 (http://www.ebi.ac.uk/ena/data/view/ERP001960). Sequence reads from PacBio for NA12878 are available in GIAB Reference Materials and Data through NCBI FTP site (ftp://ftptrace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/NA12878_PacBio_MtSinai). Sequence reads for 33 HapMap JPT individuals, 35 CEU individuals, and 35 GBR individuals are available from the 1000 Genomes Project through EBI FTP site (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/).
Authors’ contributions
KK proposed the statistical model and implemented the program for evaluation. YK provided the knowledge and advice about coalescent theory and computational techniques on the field. KK, NN, TM, and MN developed fundamental environments for the evaluation using simulation and real data studies. KK, YK, NN, TM, TH, MN carefully checked equations and other contents in this manuscript. All authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 DePristo MA, Banks E, Poplin R., Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell TJ, Kernytsky AM, Sivachenko AY, Cibulskis K, Gabriel SB, Altshuler D, Daly MJ. A framework for variation discovery and genotyping using nextgeneration dna sequencing data. Nat Genet. 2011; 43:491–8.View ArticlePubMedPubMed CentralGoogle Scholar
 Kojima K, Nariai N, Mimori T, Takahashi M, YamaguchiKabata Y, Sato Y, Nagasaki M. A statistical variant calling approach from pedigree information and local haplotyping with phase informative reads. Bioinformatics. 2013; 29(22):2835–43.View ArticlePubMedGoogle Scholar
 Kojima K, Nariai N, Mimori T, YamaguchiKabata Y, Sato Y, Kawai Y, Nagasaki M. Hapmonster: a statistically unified approach for variant calling and haplotyping based on phaseinformative reads. Lect Nodes Comput Sci. 2014; 8542:107–18.View ArticleGoogle Scholar
 Li H, Ruan J, Durbin R. Mapping short dna sequencing reads and calling variants using mapping quality scores. Genome Res. 2008; 18(11):1851–8.View ArticlePubMedPubMed CentralGoogle Scholar
 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012; 491(7422):56–65.View ArticleGoogle Scholar
 Mimori T, Nariai N, Kojima K, Takahashi M, Ono A, Sato Y, YamaguchiKabata Y, Kawai Y, Nagasaki M. iSVP: an integrated structural variant calling pipeline from highthroughput sequencing data. BMC Syst. Biol. 2013; 7(Suppl 6):S8.View ArticlePubMedPubMed CentralGoogle Scholar
 Walker FO. Huntington’s disease. Lancet. 2007; 369(9557):2185–228.View ArticleGoogle Scholar
 Gymrek M, Golan D, Rosset S, Erlich Y. lobstr: A short tandem repeat profiler for personal genomes. Genome Res. 2012; 6:1154–62.View ArticleGoogle Scholar
 Highnam G, Franck C, Martin A, Stephens C, Puthige A, Mittelman D. Accurate human microsatellite genotypes from highthroughput resequencing data using informed error profiles. Nucleic Acids Res. 2013;41(1).Google Scholar
 Cao MD, Tasker E, Willadsen K, Imelfort M, Vishwanathan S, Sureshkumar S, Balasubramanian S, Boden M. Inferring short tandem repeat variation from pairedend short reads. Nucleic Acids Res. 2014;42(3).Google Scholar
 Kingman JFC. On the genealogy of large populations. J Appl Probab. 1982; 19(A):27–43.View ArticleGoogle Scholar
 Tajima F. Evolutionary relationship of dna sequences in finite populations. Genetics. 1983; 105:437–60.PubMedPubMed CentralGoogle Scholar
 Wakeley J. Coalescent Theory: An Introduction. Greenwood Village: Roberts and Company Publishers; 2008.Google Scholar
 Yedidia JS, Freeman WT, Weiss Y. Constructing freeenergy approximations and generalized belief propagation algorithms. IEEE Trans Inf. 2005; 51(7):2282–312.View ArticleGoogle Scholar
 Liu Q, Ihler A. Variational algorithms for marginal map. J Mach Learn Res. 2013; 14:3165–200.Google Scholar
 Nachman MW, Crowell SL. Estimate of the mutation rate per nucleotide in humans. Genetics. 2000; 156(1):297–304.PubMedPubMed CentralGoogle Scholar
 Sun JX, Helgason A, Masson G, Ebenesersdóttir SS, Li H, Mallick S, Gnerre S, Patterson N, Kong A, Reich D, Stefansson K. A direct characterization of human mutation based on microsatellites. Nat Genet. 2012; 44:1161–5.View ArticlePubMedPubMed CentralGoogle Scholar
 Hudson RR. Gene genealogies and the coalscent process In: Harris TJR, editor. Oxford Surveys in Evolutionary Biology. Stoner G(Series Editor): Methods and Perspectives in Cell Biology, vol 1. New York: Academic Press: 1990. p. 1–44.Google Scholar
 Zhivotovsky LA, Feldman MW. Microsatellite variability and genetic distances. Proc Natl Acad Sci. 1995; 92(6):11549–52.View ArticlePubMedPubMed CentralGoogle Scholar
 Tavaré S. Part I: Ancestral inference in population genetics. Lect Nodes Math. 2004; 1837:1–188.View ArticleGoogle Scholar
 Yu N, Chen FC, Ota S, Jorde LB, Pamilo P, Patthy L, Ramsay M, Jenkins T, Shyue SK, Li WH. Larger genetic differences within africans than between africans and eurasians. Genetics. 2002; 161:269–74.PubMedPubMed CentralGoogle Scholar
 Benson G. Tandem repeats finder: a program to analyze dna sequences. Nucleic Acids Res. 1999; 27(2):573–80.View ArticlePubMedPubMed CentralGoogle Scholar
 Li H. Aligning sequence reads, clone sequences and assembly contigs with bwamem. 2013. https://arxiv.org/abs/1303.3997.
 Li H, Durbin R. Fast and accurate short read alignment with burrowswheeler transform. Bioinformatics. 2009; 25(14):1754–60.View ArticlePubMedPubMed CentralGoogle Scholar
 Zook JM, Chapman B, Wang J, Mittelman D, Hofmann O, Hide W, Salit M. Integrating human sequence data sets provides a resource of benchmark snp and indel genotype calls. Nat Biotechnol. 2014; 32:246–51.View ArticlePubMedGoogle Scholar
 Marjoram P, Molitor J, Plagnol V, Tavaré S. Markov chain monte carlo without likelihoods. Proc Natl Acad Sci. 2003; 100(26):15324–8.View ArticlePubMedPubMed CentralGoogle Scholar