Application of massive parallel sequencing to whole genome SNP discovery in the porcine genome
© Amaral et al; licensee BioMed Central Ltd. 2009
Received: 23 February 2009
Accepted: 12 August 2009
Published: 12 August 2009
Although the Illumina 1 G Genome Analyzer generates billions of base pairs of sequence data, challenges arise in sequence selection due to the varying sequence quality. Therefore, in the framework of the International Porcine SNP Chip Consortium, this pilot study aimed to evaluate the impact of the quality level of the sequenced bases on mapping quality and identification of true SNPs on a large scale.
DNA pooled from five animals from a commercial boar line was digested with Dra I; 150–250-bp fragments were isolated and end-sequenced using the Illumina 1 G Genome Analyzer, yielding 70,348,064 sequences 36-bp long. Rules were developed to select sequences, which were then aligned to unique positions in a reference genome. Sequences were selected based on quality, and three thresholds of sequence quality (SQ) were compared. The highest threshold of SQ allowed identification of a larger number of SNPs (17,489), distributed widely across the pig genome. In total, 3,142 SNPs were validated with a success rate of 96%. The correlation between estimated minor allele frequency (MAF) and genotyped MAF was moderate, and SNPs were highly polymorphic in other pig breeds. Lowering the SQ threshold and maintaining the same criteria for SNP identification resulted in the discovery of fewer SNPs (16,768), of which 259 were not identified using higher SQ levels. Validation of SNPs found exclusively in the lower SQ threshold had a success rate of 94% and a low correlation between estimated MAF and genotyped MAF. Base change analysis suggested that the rate of transitions in the pig genome is likely to be similar to that observed in humans. Chromosome X showed reduced nucleotide diversity relative to autosomes, as observed for other species.
Large numbers of SNPs can be identified reliably by creating strict rules for sequence selection, which simultaneously decreases sequence ambiguity. Selection of sequences using a higher SQ threshold leads to more reliable identification of SNPs. Lower SQ thresholds can be used to guarantee sufficient sequence coverage, resulting in high success rate but less reliable MAF estimation. Nucleotide diversity varies between porcine chromosomes, with the X chromosome showing less variation as observed in other species.
The Sanger DNA sequencing technique has been and still is the method of choice for de novo sequencing of complete genomes [1, 2]. However, whole genome sequencing using the Sanger method is relatively expensive, labor intensive, and time consuming.
Several methods for ultra high-throughput DNA sequencing that reduce the cost and labor demands of Sanger sequencing are currently available [3, 4]. The Illumina 1 G Genome Analyzer (ILLUMINA, San Diego, CA, USA) uses a sequencing by synthesis method, during which millions of DNA fragments are sequenced in parallel (massive parallel sequencing). With this method, costly and often problematic procedures, such as cloning are eliminated. Another advantage is that accuracy is independent of sequence context because a discrete signal is generated per each base. Thus, this method is very accurate in cases of homopolymeric sequences and generates quality values that are analogous to Phred scores . Sequence lengths generally range from 25–50 bp (short sequences), which is sufficient for unique alignment to a reference genome . Because millions of fragments are sequenced in parallel, a fragment can be sequenced even if it exists in low abundance in the sample, thereby increasing sequencing depth and enabling identification of single nucleotide polymorphisms (SNPs) with high accuracy [7–9].
Sequencing of reduced representation libraries (RRLs), which are reproducible subsets of the genome, allows cost-effective genome-wide SNP discovery with accurate estimations of minor allele frequencies (MAF) . Because the cost of large-scale sequencing of RRLs is still prohibitive for individual samples, DNA samples can be pooled to further reduce sequencing costs and simultaneously infer the frequencies of two SNP alleles with high accuracy . Previous studies have shown that large-scale SNP discovery can be accurate using massive parallel sequences of RRLs prepared from pooled DNA [8, 12].
Despite the efficiency of massive parallel sequencing for providing large amounts of sequencing data, a sequence selection stage is still required. Previous studies have applied various rules for sequence selection: sequences must start with a restriction motif , sequences must be aligned to a unique location in the genome , and sequences must have a minimum average sequence quality (SQ) score of 20 [8, 9]. An effective sequence selection stage can decrease noise in the data that can compromise alignment and SNP identification. Therefore, the effect of different levels of SQ in identifying SNPs needs to be evaluated.
With increasing attention being paid to genomic selection by animal breeders, there is a need for high-density SNP maps of the genomes of farm animals. Experimental evidence has shown that linkage disequilibrium extends from 0.1 to 2 cM in European commercial pig breeds ; thus, an SNP assay should contain a minimum of 30 k informative SNPs. To achieve this goal, we designed a cost-effective strategy for large scale identification of SNPs in the porcine genome that could be applied to other species.
In this study, an RRL generated from a DNA pool of a boar line was sequenced using the Illumina 1 G Genome Analyzer. The two main goals of our study were (a) to develop rules for decreasing sequence ambiguity (sequence alignment to several locations in the genome), which would decrease noise and increase efficiency in sequence alignment and SNP identification, and (b) to evaluate the effects of different SQ filtering strategies for cost-effective, large-scale identification of SNPs.
Results and discussion
Sequencing and filtering the RRL
An RRL was produced from a DNA pool of five boars from a crossbred (Large White vs. Pietrain) commercial boar line (PW), using the restriction enzyme Dra I, which recognizes pattern "TTTAAA" and generates blunt-ended fragments starting with AAA. Fragments 150–250-bp long were selected and end-sequenced using the Illumina 1 G Genome Analyzer. Short sequences will align to a unique genomic location (target), creating groups (clusters) with a number of sequences (target depth) sufficient for SNP identification. The in silico digest of pre-EnsEMBL Sus scrofa build 7  indicated an expected sequence coverage of ~1% of the reference genome, that is, 11,089,914 bp uniquely aligned to the porcine genome.
In total, 70,348,064 sequences were generated during three different runs. In addition to sequence information, this second generation platform generates quality scores that are analogous to Phred scores (which assign a probability to the four possible nucleotides for each base in the sequence) . Levels of base quality varied between runs and along the sequence length, and decreased considerably at the 3'end [see Additional file 1]. This variation in base quality along the sequence has been reported in previous studies using short sequencing . Base quality for the first three bases is poor, but increases before decreasing again at the end of the 31-bp sequence. Poor base quality at the first three bases is due to the properties of the algorithm implemented in the Illumina base caller BUSTARD® (ILLUMINA, San Diego, CA, USA). The quality score of a base is calculated by comparing the fluorescence signals of the previous and following bases. The algorithm does not expect a repeated motif in the beginning of a sequence (AAA) and therefore estimates poor quality scores. The severe decrease in base quality at the 3' end of the sequence indicates the existence of a higher level of sequencing errors at the 3' end. The proportion of unique sequences (sequences occurring only once) ranged from 35% when considering a sequence length of 29 bp to 55% when considering a sequence length of 35 bp, which again indicates an increase in sequencing errors with an increase in sequence length. Therefore, sequences were trimmed at 33 bp and filtering rules were applied.
We applied a number of rules to select sequences for further analysis: (a) screening for properties of the RRL (i.e., discarding sequences without the restriction motif "AAA" in the 5' end); (b) filtering for sequence ambiguity, and (c) filtering for SQ. Filtering for sequence ambiguity was performed by removing sequences with homopolymers and removing sequences with a high rate of re-sampling. Sequences with a re-sampling rate above an expected level were discarded as potentially paralogous sequences. Because pre-EnsEMBL Sus scrofa build 7  comprises approximately 70% of the porcine genome, unique alignment of reads does not guarantee that there is no other similar (duplicated) sequence in the remainder of the genome. Therefore, potential paralogous sequences should be eliminated from the data set to avoid identification of false-positive SNPs. This was done by estimating the ratio between the total number of fragments obtained after filtering for the restriction motif, unknown bases, homopolymers, and the numbers of fragments generated from an in silico digest. This ratio, the estimated average level of sequence re-sampling, was estimated at 35×, and sequences with a frequency approximately 3-fold greater (>100×) were removed.
Sequence production and filtering for the three strategies used to identify SNPs.
Total sequences after filtering
Total number of SNPs
Comparison of strategies for SNP identification
RRL sequence coverage along the pig genome
Summary of in silico digest of reference genome and analysis of consensus sequences.
Total sequence length (bp)
Total observed sequence length (bp)
Total observed sequence length (bp) per 1-Mb window
GC content (%)
Sequence polymorphism in the pig genome
Although this study covered only ~1% of the porcine genome, using an RRL allowed estimation of genome-wide nucleotide diversity. Variation in nucleotide diversity and length of sequence coverage for the remaining chromosomes are shown in [see Additional file 1]. The pattern of variation in nucleotide diversity along chromosomes varies between chromosomes; SSC4, SSC8, SSC9, SSC10, SSC13, SSC15, and SSCX have higher levels of nucleotide diversity towards the telomeric regions and lower levels of nucleotide diversity in the centromeric region. On SSCX, large areas flanking the centromere were devoid of nucleotide diversity. Reduced variability in the X chromosome relative to the autosomes has been described for other species, including humans [21, 22], Drosophila , and mice , and is explained mainly by the fact that this chromosome has a lower mutation rate and a smaller effective population size .
SNP genotyping and validation
A SNP chip assay was conducted to validate a sample of 3,230 SNPs in the original SNP discovery panel. Of the 3,230 SNPs, 3% failed as a result of the assay design. The validation assay included 68 assayable SNPs exclusively found in Data 12, 147 assayable SNPs exclusively shared between Data 12 and Data 15, 48 assayable SNPs exclusively found in Data 20, and 2,879 assayable SNPs shared between Data 12, Data 15, and Data 20. The correlation between the estimated MAF (calculated from the analysis of short sequences) and the genotyped MAF in the animals used in the discovery panel was calculated for the 2,879 SNPs shared between Data 12, Data 15, and Data 20. The observed correlation of 0.32 was somewhat lower than that reported by Wiedmann . In order to investigate this result, MAF obtained from short sequence data and from genotyping were simulated and correlations were calculated. Results from simulations showed that the correlation between MAF estimated from short sequences and MAF estimated from genotype data can range from 0.1 to 0.5. Of the total number of SNPs tested, 4% appeared to be monomorphic and 85% had an MAF above 0.2, showing that our strategy yielded a high proportion of informative SNPs useful for whole genome association assays. The fact that the correlation between estimated and observed MAFs was not high could restrict the type of measures used to evaluate genomic variation in population genomics studies using short sequences. For example, estimation of pairwise nucleotide diversity (π) requires accurate estimates of MAF, and a correlation of 0.3 shows that the π estimated from short sequences can be biased.
Percentage of monomorphic SNPs and average minor allele frequencies (MAF) by breed for 3,142 SNPs.
# SNPs = 68
Data 12 & Data 15
# SNPs = 147
# SNPs = 48
# SNPs = 2,879
# Genotyped Animals
We presented a strategy for using short sequences derived from second generation sequencing technology to efficiently identify large numbers of SNPs with MAF estimates at a low false discovery rate. These results show that by lowering the SQ it is possible to identify SNPs while still keeping the false discovery rate low, although the cost is a lower correlation between the estimated and true MAFs. Finally, our data show that nucleotide diversity is quite variable among porcine chromosomes and is particularly low on SSCX.
Library construction and sequencing
DNA was extracted from five individual boars produced from a cross between Landrace and Pietrain. Extracted DNA was pooled (60 ng) and digested with Dra I (100 units; New England Biolabs, Ipswich, MA, USA) at 37°C for 16 hours. The fragments were separated by 1.2% agarose gel electrophoresis (2 hours; 60 volts), and 150–250 bp-fragments were eluted (yielding 1069 ng), end-repaired, and ligated to Illumina's oligonucleotide adapters. Fragments were end-sequenced using the Illumina 1 G Genome Analyzer.
The Illumina 1 G Genome Analyzer generates image information that is translated by BUSTARD® into sequences and base quality scores similar to Phred . Using PERL scripts prepared by the authors, quality scores were converted into PHRED scores.
Sequence mapping and SNP identification
Filtering rules were applied to select sequences that were mapped to the pig genome  using MAQ 0.6.6 . The algorithm implemented in MAQ calculates a mapping quality for each sequence that measures the probability that a sequence belongs to a specific target . Mapping was performed allowing two mismatches and a mutation rate of 0.001. To generate consensus sequences, the algorithm implemented in MAQ estimated CQ, which is the value at which the probability of each genotype is maximized . Consensus sequences were generated allowing a maximum of four mismatches since the expected SNP frequency in pigs is 1/336 bp ; therefore, the percentage of clusters with more than one SNP should be low.
Nucleotide diversity analysis
The total aligned consensus sequence was obtained using option cns.view of MAQ . Full output was filtered using the same rules applied to SNP identification. A PERL script was developed by the authors to calculate mapping density, nucleotide diversity , and GC content over 1-Mb windows.
A total of 3,142 SNPs located in Sus scrofa build 7  were validated using the Illumina Infinium® Genotyping assay on an Illumina® BeadStation.
Oligonucleotides were designed, synthesized, and assembled into oligo-pooled assays by Illumina Inc.
Individual DNA samples from the animals used for the SNP identification panel (PW) were genotyped, plus one more animal from PW, 20 samples of Wild Boar, 136 samples of Large White, 80 samples of Landrace, 82 samples of Duroc, 90 samples of Pietrain, 67 samples of Berkshire, and 39 samples of Hampshire.
Additional files and information
The sequences and nucleotide variations have been submitted to GenBank dbSNP database with accession numbers ranging from 120015817 to 120032847.
This work was conducted as part of the SABRETRAIN Project funded by the Marie Curie Host Fellowships for Early Stage Research Training and as part of the 6th Framework Programme of the European Commission. Sequencing costs were funded by the Institute for Pig Genetics, the Netherlands. The authors thank Luca Ferretti (Universidade Autonoma de Barcelona) for his help with the MAF simulations.
- Venter JC, et al: The Sequence of the Human Genome. Science. 2001, 291: 1304-1351. 10.1126/science.1058040.View ArticlePubMedGoogle Scholar
- Schook LB, Beever JE, Rogers J, Humphray S, Archibald A, Chardon P, Milan D, Rohrer G, Eversole K: Swine Genome Sequencing Consortium (SGSC): A Strategic Roadmap for Sequencing The Pig Genome. Comparative and Functional Genomics. 2005, 6: 251-255. 10.1002/cfg.479.PubMed CentralView ArticlePubMedGoogle Scholar
- Ahmadian A, Ehn M, Hober S: Pyrosequencing: History, biochemistry and future. Clinica chimica acta. 2006, 363: 83-10.1016/j.cccn.2005.04.038.View ArticleGoogle Scholar
- Metzker ML: Emerging technologies in DNA sequencing. Genome research. 2005, 15 (12): 1767-10.1101/gr.3770505.View ArticlePubMedGoogle Scholar
- Bentley DR: Whole-genome re-sequencing. Current opinion in genetics & development. 2006, 16 (6): 545-10.1016/j.gde.2006.10.009.View ArticleGoogle Scholar
- Whiteford N, Haslam N, Weber G, Prugel-Bennett A, Essex JW, Roach PL, Bradley M, Neylon C: An analysis of the feasibility of short read sequencing. Nucleic Acids Research. 2005, 33 (19):
- Shendure J, Mitra RD, Varma C, Church GM: Advanced sequencing technologies: Methods and goals. Nature Reviews Genetics. 2004, 5 (5): 335-344. 10.1038/nrg1325.View ArticlePubMedGoogle Scholar
- Van Tassell CP, Smith TPL, Matukumalli LK, Taylor JF, Schnabel RD, Lawley CT, Haudenschild CD, Moore SS, Warren WC, Sonstegard TS: SNP discovery and allele frequency estimation by deep sequencing of reduced representation libraries. Nature methods. 2008, 5 (3): 247-10.1038/nmeth.1185.View ArticlePubMedGoogle Scholar
- Hillier LDW, Marth GT, Quinlan AQ, Dooling D, Fewell G, Barnett D, Fox P, Glasscock JI, Hickenbotham MT, Huang W, et al: Whole-genome sequencing and variant discovery in C. elegans. Nature methods. 2008, 5 (2): 183-10.1038/nmeth.1179.View ArticlePubMedGoogle Scholar
- Altshuler D, Pollara VJ, Cowles CR, Van Etten WJ, Baldwin J, Linton L, Lander ES: An SNP map of the human genome generated by reduced representation shotgun sequencing. Nature. 2000, 407 (6803): 513-516. 10.1038/35035083.View ArticlePubMedGoogle Scholar
- Sham P, Bader JS, Craig IMOD, Owen M: DNA Pooling: a toolfor large-scale association studies. Nature reviews Genetics. 2002, 3 (11): 862-10.1038/nrg930.View ArticlePubMedGoogle Scholar
- Wiedmann RT, Smith TPL, Nonneman DJ: SNP discovery inswine by reduced representation and throughput pyrosequencing. BMC Genetics. 2008, 9: 81-10.1186/1471-2156-9-81.PubMed CentralView ArticlePubMedGoogle Scholar
- Amaral AJ, Megens H-J, Crooijmans RPMA, Heuven HCM, Groenen MAM: Linkage disequilibrium decay and haplotype bock structure in the pig. Genetics. 2008, 179: 569-579. 10.1534/genetics.107.084277.PubMed CentralView ArticlePubMedGoogle Scholar
- pre-EnsEMBL Sus scrofa build 7. [ftp://ftp.sanger.ac.uk/pub/S_scrofa/assemblies/PreEnsembl_Sscrofa7]
- Ewing B, Hillier L, Wendl MC, Green P: Base calling of automated sequencer traces using PhredI. Accuracy and assessment. Genome Research. 1998, 8: 175-185.View ArticlePubMedGoogle Scholar
- Li H, Ruan J, Durbin R: Maq: Mapping and Assembly with Qualities. Genome Research. 2008, 18 (11): 1851-1858. 10.1101/gr.078212.108.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang DG, Fan JB, Siao CJ, Berno A, Young P, Sapolsky R, Ghandour G, Perkins N, Winchester E, Spencer J, Kruglyak L, Stein L, Hsie L, Topaloglou T, Hubbell E, Robinson E, Mittmann M, Morris MS, Shen N, Kilburn D, Rioux J, Nusbaum C, Rozen S, Hudson TJ, Lipshutz R, Chee M, Lander ES: Large-scale identification, mapping, and genotyping of single-nucleotide polymorphisms in the human genome. Science. 1998, 280: 1077-1082. 10.1126/science.280.5366.1077.View ArticlePubMedGoogle Scholar
- Kerstens HHD, Kollers S, Kommadath A, Rosario M, Dibbits B, Kinders SM, Crooijmans RP, Groenen MAM: Mining for SingleNucleotide Polymorphisms in Pig genome sequence data. BMC Genomics. 2009, 10: 4-10.1186/1471-2164-10-4. doi:10.1186/1471-2164-10-4PubMed CentralView ArticlePubMedGoogle Scholar
- Watterson GA: On the number of segregation sites. Theoret Popul Biol. 1975, 7: 256-276. 10.1016/0040-5809(75)90020-9.View ArticleGoogle Scholar
- Galtier N, Piganeau G, Mouchiroud D, Duret L: G-C content evolution in mammalian genomes: the biased gene conversion hypothesis. Genetics. 2001, 159: 907-911.PubMed CentralPubMedGoogle Scholar
- Payseur BA, Nachman MW: Natural Selection at linked sites in humans. Gene. 2002, 300: 31-42. 10.1016/S0378-1119(02)00849-1.View ArticlePubMedGoogle Scholar
- Lu J, Wu C, Ohta T: Weak selection revealed by the whole-genome comparison of the X chromosome and autosomes of human and chimpanzee. Proceedings of the National Academy of Sciences of the United States of America. 2005, 102: 4063-4067. 10.1073/pnas.0500436102.PubMed CentralView ArticlePubMedGoogle Scholar
- Kauer M, Zangeri B, Dieringer D, Schlotterer C: Chromosomal pattern of microsatellite variability contrastsharply in African and non-African populations of Drosophila melanogaster. Genetics. 2002, 160: 247-256.PubMed CentralPubMedGoogle Scholar
- Baines JF, Harr B: Reduced X-linked diversity in derivedpopulations of house mice. Genetics. 2007, 175: 1911-1921. 10.1534/genetics.106.069419.PubMed CentralView ArticlePubMedGoogle Scholar
- Betancourt AJ, Kim Y, Orr HA: A pseudohitchhivking model of X vs. autosomal diversity. Genetics. 2004, 168: 2261-2269. 10.1534/genetics.104.030999.PubMed CentralView ArticlePubMedGoogle Scholar
- Megens H-J, Crooijmans RPMA, SanCristobal M, Hui X, Ning L, Groenen MAM: Biodiversity of pig breeds from China and Europe estimated from pooled DNA samples: differences in microsatellite variation between two areas of domestication. Genetics Selection and Evolution. 2008, 40: 103-128.Google Scholar
- Jungerius BJ, Gu J, Crooijmans RPMA, Poel JJvd, Groenen MAM, Oost BAv, Pas MFWt: Estimation of the extent of linkage disequilibrium in seven regions of the porcine genome. Anim Biotechnol. 2005, 16: 41-54. 10.1081/ABIO-200053402.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.