SNPhylo: a pipeline to construct a phylogenetic tree from huge SNP data
© Lee et al.; licensee BioMed Central Ltd. 2014
Received: 25 September 2013
Accepted: 18 February 2014
Published: 26 February 2014
Phylogenetic trees are widely used for genetic and evolutionary studies in various organisms. Advanced sequencing technology has dramatically enriched data available for constructing phylogenetic trees based on single nucleotide polymorphisms (SNPs). However, massive SNP data makes it difficult to perform reliable analysis, and there has been no ready-to-use pipeline to generate phylogenetic trees from these data.
We developed a new pipeline, SNPhylo, to construct phylogenetic trees based on large SNP datasets. The pipeline may enable users to construct a phylogenetic tree from three representative SNP data file formats. In addition, in order to increase reliability of a tree, the pipeline has steps such as removing low quality data and considering linkage disequilibrium. A maximum likelihood method for the inference of phylogeny is also adopted in generation of a tree in our pipeline.
Using SNPhylo, users can easily produce a reliable phylogenetic tree from a large SNP data file. Thus, this pipeline can help a researcher focus more on interpretation of the results of analysis of voluminous data sets, rather than manipulations necessary to accomplish the analysis.
KeywordsPolymorphisms Linkage disequilibrium Maximum likelihood
Since the Arabidopsis genome was completed , advanced sequencing technology has facilitated the whole genome sequencing of many plants of commercial or experimental importance [2–4]. Reference genome sequences and high-throughput data analysis also provide the basis for resequencing whole genomes or transcriptomes to answer questions about variations between cultivars, populations, and taxa. In a variation study, the distribution of single nucleotide polymorphisms (SNPs) and/or short insertions and deletions (indels) is the prime concern.
A variety of studies have begun to utilize and illustrate how to deal with extensive SNP data [5–10]. Particularly, phylogenetic trees have been used in many evolutionary studies to depict evidence about evolutionary relationships between or within organisms, and to study the evolution and functional innovation of genes [6, 7]. However, there has been no easy-to-use pipeline to determine phylogenetic trees with the huge number of variants obtained from sequencing projects. One typical method to determine trees has been: 1) calculating p-distance from all SNP data between two samples, 2) making the p-distance matrix for all samples, 3) constructing a neighbor-joining tree with the matrix by a program such as ‘neighbor’ in the PHYLIP package  and 4) drawing the phylogenetic tree image by a program such as MEGA4 . However, there are at least three points to be methodologically improved: 1) there is no consideration of LD (Linkage Disequilibrium) blocks which can cause bias of variants, 2) statistical tests need be improved to evaluate the level of confidence, and 3) users are required to manipulate large data sets step-by-step to obtain a phylogenetic tree. The snpTree server  provided solutions for the second and third points. However, the target of this web server was bacterial genomes which are much smaller than eukaryotic genomes and seldom if ever have LD blocks.
We developed a pipeline, SNPhylo (Additional file 1), permitting users to construct a phylogenetic tree from a file containing SNP data in VCF (Variant Call Format), HapMap format or GDS (Genomic Data Structure) format . Here we introduce the pipeline with three examples that show the applicability of the pipeline.
All the steps are automated by one Bash shell script, snphylo.sh, though the pipeline includes additional components implemented in Python and R. Thus, by the script, users can obtain from a SNP data file a phylogenetic tree file and other informative files such as multiple alignment results file in PHYLIP format, which can be used for additional analysis such as a parallel bootstrap analysis by PhyML . The pipeline also generates a tree image in PNG format with R packages [16, 18] so the user easily interprets the results of analysis. In addition, the tree file in Newick format is provided as well so users can make more informative tree image by other programs such as MEGA4  and Newick utility  depending on the demands of users.
Results & discussion
Phylogenetic tree with soybean SNP data
Most branches in the tree correspond to those inferred in the original report  though our tree was easily determined by our pipeline in a relatively short time. Interestingly, in one case, our tree was more consistent with the Bayesian clustering result of the original report (Figure 2B) rather than the tree of the original report. Specifically, in the original report, the three wild soybeans (W03, W13, and W14) were clustered together in Bayesian clustering (red box in Figure 2C), while phylogenetic analysis separated W03 from the others (two red ellipses in Figure 2B). The tree determined by SNPhylo shows the three wild soybeans included in same cluster (red ellipse in Figure 2A), consistent with the Bayesian clustering result (red box in Figure 2C). In addition, we constructed a phylogenetic tree by the neighbor-joining method used in the original report using only the SNP data filtered by LD information, and obtained the same tree constructed by SNPhylo for the three wild soybeans (data not shown). Thus, the consistency with the Bayesian clustering result of both our tree and a phylogenetic tree based on LD-filtered data may indicate that using LD information improves interpretation of phylogenetic relationships from genomic data.
Rapid construction of a tree with rice SNP data
The tree constructed by SNPhylo had three evident clusters representing the three rice groups, japonica, indica and aus, and the results was consistent with the previous tree of the original report . Interestingly, the previous tree (Figure 3B) and the SNPhylo tree showed different branch lengths between the three rice group clusters. Specifically, the branch between japonica and the other two clusters was much longer in the previous tree, with the branch lengths being more similar to one another in the SNPhylo tree. The relatively long edge in the previous tree may be caused by the higher LD level of japonica groups than other rice groups . SNP bias due to high levels of LD in japonica might lead to overestimation of distances between clusters. The inclusion of a step to decrease this bias may permit SNPhylo to construct a more accurate tree.
Construction of a phylogenetic tree with Arabidopsis SNP data
There are three major clusters in the phylogenetic tree (Figure 4A). The accessions in each cluster show high consistency regarding geographic origins (Figure 4B). The origins of accessions circled in red, blue, and orange are located in the geographical west, middle, and east of Europe, respectively. For example, the origins of Edi-0 and Bur-0 in the same cluster are Scotland and Ireland, respectively. In addition, the relationship between geographical location and the cluster in the phylogenetic tree are consistent with the East–West gradient in clustering results of 96 Arabidopsis genotypes which is likely caused by post-glaciation colonization routes .
Dependence of SNPhylo run time on amount of SNP data
Using SNPhylo, users can easily produce a phylogenetic tree from large SNP data derived from various detection technologies such as genome wide resequencing  and resequencing microarrays . Consequently, this pipeline can help a researcher focus more on interpretation of a reliable tree generated by maximum likelihood analysis of voluminous data sets, rather than manipulations necessary to accomplish the analysis.
Availability and requirements
Project name: SNPhylo
Project home page: http://chibba.pgml.uga.edu/snphylo/
Operating system(s): Linux, UNIX and OS X
Programming language: Python, R and BASH
License: GNU GPLv2
Any restrictions to use by non-academics: None
Variant call format
Genomic data structure
Minor allele frequency.
We thank Barry Marler and the Georgia Advanced Computing Resource Center for IT support.
A.H.P. appreciates funding from the National Science Foundation (NSF: DBI 0849896, MCB 0821096, MCB 1021718). This study was supported in part by resources and technical expertise from the University of Georgia, Georgia Advanced Computing Resource Center, a partnership between the Office of the Vice President for Research and the Office of the Chief Information Officer.
- Initiative TAG: Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature. 2000, 408 (6814): 796-815. 10.1038/35048692.View ArticleGoogle Scholar
- International Rice Genome Sequencing Project: The map-based sequence of the rice genome. Nature. 2005, 436 (7052): 793-800. 10.1038/nature03895.View ArticleGoogle Scholar
- Schmutz J, Cannon SB, Schlueter J, Ma J, Mitros T, Nelson W, Hyten DL, Song Q, Thelen JJ, Cheng J, Xu D, Hellsten U, May GD, Yu Y, Sakurai T, Umezawa T, Bhattacharyya MK, Sandhu D, Valliyodan B, Lindquist E, Peto M, Grant D, Shu S, Goodstein D, Barry K, Futrell-Griggs M, Abernathy B, Du J, Tian Z, Zhu L, et al: Genome sequence of the palaeopolyploid soybean. Nature. 2010, 463 (7278): 178-183. 10.1038/nature08670.PubMedView ArticleGoogle Scholar
- Paterson AH, Wendel JF, Gundlach H, Guo H, Jenkins J, Jin D, Llewellyn D, Showmaker KC, Shu S, Udall J, Yoo MJ, Byers R, Chen W, Doron-Faigenboim A, Duke MV, Gong L, Grimwood J, Grover C, Grupp K, Hu G, Lee TH, Li J, Lin L, Liu T, Marler BS, Page JT, Roberts AW, Romanel E, Sanders WS, Szadkowski E, et al: Repeated polyploidization of Gossypium genomes and the evolution of spinnable cotton fibres. Nature. 2012, 492 (7429): 423-427. 10.1038/nature11798.PubMedView ArticleGoogle Scholar
- The 1000 Genomes Project Consortium: An integrated map of genetic variation from 1,092 human genomes. Nature. 2012, 491: 56-65. 10.1038/nature11632.PubMed CentralView ArticleGoogle Scholar
- Guo S, Zhang J, Sun H, Salse J, Lucas WJ, Zhang H, Zheng Y, Mao L, Ren Y, Wang Z, Min J, Guo X, Murat F, Ham BK, Zhang Z, Gao S, Huang M, Xu Y, Zhong S, Bombarely A, Mueller LA, Zhao H, He H, Zhang Y, Zhang Z, Huang S, Tan T, Pang E, Lin K, Hu Q, et al: The draft genome of watermelon (Citrullus lanatus) and resequencing of 20 diverse accessions. Nat Genet. 2013, 45 (1): 51-58.PubMedView ArticleGoogle Scholar
- Lam HM, Xu X, Liu X, Chen W, Yang G, Wong FL, Li MW, He W, Qin N, Wang B, Li J, Jian M, Wang J, Shao G, Wang J, Sun SS, Zhang G: Resequencing of 31 wild and cultivated soybean genomes identifies patterns of genetic diversity and selection. Nat Genet. 2010, 42 (12): 1053-1059. 10.1038/ng.715.PubMedView ArticleGoogle Scholar
- Xu X, Liu X, Ge S, Jensen JD, Hu F, Li X, Dong Y, Gutenkunst RN, Fang L, Huang L, Li J, He W, Zhang G, Zheng X, Zhang F, Li Y, Yu C, Kristiansen K, Zhang X, Wang J, Wright M, McCouch S, Nielsen R, Wang J, Wang W: Resequencing 50 accessions of cultivated and wild rice yields markers for identifying agronomically important genes. Nat Biotechnol. 2012, 30 (1): 105-111.View ArticleGoogle Scholar
- Cao J, Schneeberger K, Ossowski S, Gunther T, Bender S, Fitz J, Koenig D, Lanz C, Stegle O, Lippert C, Wang X, Ott F, Muller J, Alonso-Blanco C, Borgwardt K, Schmid KJ, Weigel D: Whole-genome sequencing of multiple Arabidopsis thaliana populations. Nat Genet. 2011, 43 (10): 956-963. 10.1038/ng.911.PubMedView ArticleGoogle Scholar
- McNally KL, Childs KL, Bohnert R, Davidson RM, Zhao K, Ulat VJ, Zeller G, Clark RM, Hoen DR, Bureau TE, Stokowski R, Ballinger DG, Frazer KA, Cox DR, Padhukasahasram B, Bustamante CD, Weigel D, Mackill DJ, Bruskiewich RM, Ratsch G, Buell CR, Leung H, Leach JE: Genomewide SNP variation reveals relationships among landraces and modern varieties of rice. Proc Natl Acad Sci USA. 2009, 106 (30): 12273-12278. 10.1073/pnas.0900992106.PubMed CentralPubMedView ArticleGoogle Scholar
- Felsenstein J: PHYLIP - Phylogeny Inference Package (Version 3.2). Cladistics. 1989, 5 (2): 163-166.View ArticleGoogle Scholar
- Tamura K, Dudley J, Nei M, Kumar S: MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) software version 4.0. Mol Biol Evol. 2007, 24 (8): 1596-1599. 10.1093/molbev/msm092.PubMedView ArticleGoogle Scholar
- Leekitcharoenphon P, Kaas RS, Thomsen MC, Friis C, Rasmussen S, Aarestrup FM: snpTree - a web-server to identify and construct SNP trees from whole genome sequence data. BMC Genomics. 2012, 13 (Suppl 7): S6-10.1186/1471-2164-13-S7-S6.PubMed CentralPubMedView ArticleGoogle Scholar
- Zheng X, Levine D, Shen J, Gogarten SM, Laurie C, Weir BS: A high-performance computing toolset for relatedness and principal component analysis of SNP data. Bioinformatics. 2012, 28 (24): 3326-3328. 10.1093/bioinformatics/bts606.PubMed CentralPubMedView ArticleGoogle Scholar
- Edgar RC: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004, 32 (5): 1792-1797. 10.1093/nar/gkh340.PubMed CentralPubMedView ArticleGoogle Scholar
- Schliep KP: phangorn: phylogenetic analysis in R. Bioinformatics. 2011, 27 (4): 592-593. 10.1093/bioinformatics/btq706.PubMed CentralPubMedView ArticleGoogle Scholar
- Guindon S, Dufayard JF, Lefort V, Anisimova M, Hordijk W, Gascuel O: New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst Biol. 2010, 59 (3): 307-321. 10.1093/sysbio/syq010.PubMedView ArticleGoogle Scholar
- Paradis E, Claude J, Strimmer K: APE: Analyses of Phylogenetics and Evolution in R language. Bioinformatics. 2004, 20 (2): 289-290. 10.1093/bioinformatics/btg412.PubMedView ArticleGoogle Scholar
- Junier T, Zdobnov EM: The Newick utilities: high-throughput phylogenetic tree processing in the UNIX shell. Bioinformatics. 2010, 26 (13): 1669-1670. 10.1093/bioinformatics/btq243.PubMed CentralPubMedView ArticleGoogle Scholar
- Gan X, Stegle O, Behr J, Steffen JG, Drewe P, Hildebrand KL, Lyngsoe R, Schultheiss SJ, Osborne EJ, Sreedharan VT, Kahles A, Bohnert R, Jean G, Derwent P, Kersey P, Belfield EJ, Harberd NP, Kemen E, Toomajian C, Kover PX, Clark RM, Ratsch G, Mott R: Multiple reference genomes and transcriptomes for Arabidopsis thaliana. Nature. 2011, 477 (7365): 419-423. 10.1038/nature10414.PubMedView ArticleGoogle Scholar
- Nordborg M, Hu TT, Ishino Y, Jhaveri J, Toomajian C, Zheng H, Bakker E, Calabrese P, Gladstone J, Goyal R, Jakobsson M, Kim S, Morozov Y, Padhukasahasram B, Plagnol V, Rosenberg NA, Shah C, Wall JD, Wang J, Zhao K, Kalbfleisch T, Schulz V, Kreitman M, Bergelson J: The pattern of polymorphism in Arabidopsis thaliana. PLoS Biol. 2005, 3 (7): e196-10.1371/journal.pbio.0030196.PubMed CentralPubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.