Evaluation of variant identification methods for whole genome sequencing data in dairy cattle
© Baes et al.; licensee BioMed Central Ltd. 2014
Received: 2 June 2014
Accepted: 14 October 2014
Published: 1 November 2014
Advances in human genomics have allowed unprecedented productivity in terms of algorithms, software, and literature available for translating raw next-generation sequence data into high-quality information. The challenges of variant identification in organisms with lower quality reference genomes are less well documented. We explored the consequences of commonly recommended preparatory steps and the effects of single and multi sample variant identification methods using four publicly available software applications (Platypus, HaplotypeCaller, Samtools and UnifiedGenotyper) on whole genome sequence data of 65 key ancestors of Swiss dairy cattle populations. Accuracy of calling next-generation sequence variants was assessed by comparison to the same loci from medium and high-density single nucleotide variant (SNV) arrays.
The total number of SNVs identified varied by software and method, with single (multi) sample results ranging from 17.7 to 22.0 (16.9 to 22.0) million variants. Computing time varied considerably between software. Preparatory realignment of insertions and deletions and subsequent base quality score recalibration had only minor effects on the number and quality of SNVs identified by different software, but increased computing time considerably. Average concordance for single (multi) sample results with high-density chip data was 58.3% (87.0%) and average genotype concordance in correctly identified SNVs was 99.2% (99.2%) across software. The average quality of SNVs identified, measured as the ratio of transitions to transversions, was higher using single sample methods than multi sample methods. A consensus approach using results of different software generally provided the highest variant quality in terms of transition/transversion ratio.
Our findings serve as a reference for variant identification pipeline development in non-human organisms and help assess the implication of preparatory steps in next-generation sequencing pipelines for organisms with incomplete reference genomes (pipeline code is included). Benchmarking this information should prove particularly useful in processing next-generation sequencing data for use in genome-wide association studies and genomic selection.
Practical application of genomic technologies, such as large-scale use of single nucleotide variant (SNV) arrays in animal and plant breeding, has become routine in many areas of the life sciences. Taking both polygenic additive (pedigree) effects and genomic (SNV) effects into account, between 71 and 85% of the genetic variance observed in phenotypic traits of interest in cattle can be explained solely by SNV effects  and the number of genotyped animals in cattle populations worldwide is increasing steadily . Fueled by decreasing costs, advances in next-generation sequencing (NGS) technologies enable identification of more complex forms of genetic variation (e.g. short insertions and deletions (InDels), copy number variations (CNVs), etc.). These advances will inevitably foster our ability to partition the genetic variance underlying traits of interest. While some applications of NGS require de novo sequencing of an individual organism (sample), re-sequencing may also be possible if a reference genome for the species of interest is available.
The translation of raw NGS reads into tangible variants (SNVs, InDels, CNVs, etc.) via re-sequencing is a specific, delicate and computationally demanding task  and comprises three steps. First, short reads of DNA are aligned to an existing reference genome (referred to as alignment). Second, sequence differences between the sample being sequenced and the reference genome are identified (referred to as variant calling ). A myriad of alignment (see  for a review) and variant identification software programmes are available (e.g. the UnifiedGenotyper (UG) or the HaplotypeCaller (HC) of the genome analysis toolkit (GATK) [6, 7]; Platypus (PL) ; SAMtools (SAM) ; etc.), the majority of which can be obtained free of charge. As a final step, variants are screened and filtered to remove potential false positives common in most NGS technologies (see ).
As sequencing costs decline, reference genomes are becoming available for an increasing number of organisms, including agriculturally important species such as cattle (see  for a review). The Bos taurus reference genome UMD3.1 is similar in size to the human genome and contains ~2.8 billion base pairs, approximately 10% of which are not assigned to any chromosome . The N50 size can be used to compare the quality of genome assemblies of similar size: it represents contig size such that 50% of the genome is contained in contigs of length N or greater . Because the N50 size for the UMD3.1 reference genome (UMD3.1 accession number GCA_000003055.3: N50 = 96,955b) is much shorter than that of the current human reference (GRCh38 accession number GCA_000001405.15: N50 = 56,413,054b), the UMD3.1 reference genome will not likely allow the same accuracy in alignment, variant identification and further downstream analysis as the human reference allows. Nevertheless, algorithms and software developed for alignment and variant identification of human NGS data provide an excellent resource for translating NGS data of other non-human organisms, such as cattle, into genetic variants for application in genome wide association studies and genomic selection programmes.
Several approaches to variant identification are possible. The simplest variant detection methods identify variants on a per-sample basis, one position at a time. Once a variant locus is found, the most likely genotype for that locus is determined stochastically based on a consensus of aligned reads. If multiple samples are analyzed simultaneously, an a priori likelihood of finding a variant locus given the observed data is derived, and the most likely genotype at a given position is determined. Either single or multi sample variant identification methods can be implemented in the UG [6, 7]; and SAM . More advanced haplotype-based methods incorporate the correlation between adjacent variants within the variant detection procedure. Such methods use linkage disequilibrium between nearby variants to further enrich variant identification. Haplotype-based methods are implemented in PL  and the GATK HC [6, 7]. The haplotype-based variant detection approach can also be conducted in either single or multi sample settings.
Read chimerism, base pair tautomerisms and signal intensity issues can contribute to false positive variant detection by causing stochastic inaccuracies, general sequencing errors, and misalignments in NGS data [7, 10]. Aside from the variant identification approach itself, a number of optional auxiliary steps have been recommended to improve the quality of NGS-derived variants. These steps are conducted before (preparatory steps) or after variant identification (filtering). The first generally recommended preparatory step is the identification of falsely duplicated reads (mainly artifacts caused by PCR), which reduces bias in variant detection . Secondly, local realignment around single or multiple bases that are either missing in the reference (insertion) or missing in the DNA sequence being analyzed (deletion) is also commonly recommended . Realignment cleans up spurious SNVs that result from misalignment of reads around known alignment gaps and helps detect false negative SNVs in the near vicinity of InDels (insertions and deletions). Furthermore, the full alignment context is used to determine whether the reported divergence from the reference (i.e. the insertion or deletion) actually exists . Finally, flow cell lane, machine cycle (base position within the read), sequencing context (preceding and current nucleotide) or other technical aspects may influence base quality scores, which help characterize the quality of the bases in the individual reads. Base quality score recalibration is recommended to lower the number of falsely identified SNVs and to lower false confidence in identified bases . After variant identification, further filters can be implemented according to the individual dataset under consideration.
Variant genotypes compared for concordance between the array-based and sequence based methods to determine concordance, sensitivity and discrepancy between the two assays (a) and measures of concordance (b)
a) NGS-based genotypes
Array-based genotypes (gold standard)
Genotype not identified
Measures of concordance
Aside from measuring concordance between NGS data and array data, the ratio of transitions (pyrimidine-pyrimidine or purine-purine mutations) to transversions (pyrimidine-purine or purine-pyrimidine mutations; Ti/Tv ratio) can be used as a convenient diagnostic to measure the quality of NGS data (e.g. [7, 21]). The genome-wide Ti/Tv ratio is reported between 2.0 and 2.2 in human whole-genome sequence data [7, 21], whereby this ratio is higher in exomes due to the increased presence of methelated cytosine in CpG dinucleotides in exonic regions . Because this bias in favour of mutations between bases with similar biochemical properties (transitions) over those with dissimilar properties (transversions) is dependent on both CpG and GC content of the region, the Ti/Tv ratio is a useful diagnostic to measure quality across the genome .
The objective of this study was to investigate which methods and software work best for detection of high quality genetic variants using NGS data in cattle, with a specific focus on single nucleotide variants. Using whole-genome sequence information from 65 individuals, we a) explore the implications of preparatory steps commonly recommended in human analysis, b) compare results of single and multi sample variant detection achieved using four publicly available variant detection software programmes, c) provide a comparison of computational processing time, and d) compare accuracy and completeness of SNVs identified in NGS data by comparing them to genotypes from the same individuals generated with either high- or medium-density SNV arrays, as well as to analyse genome-wide Ti/Tv ratios. Through benchmarking different variant detection methods in cattle, preliminary recommendations for variant identification in other organisms can be extrapolated. Our findings can serve as a reference for choosing variant identification software and can help assess the implication of preparatory steps in NGS pipelines for species with lower-quality or unfinished reference genomes.
Results and Discussion
Alignment and coverage
Approximately 24 billion paired-end reads were obtained for the 65 sequenced animals. An average of 96.8% of these reads (range 86.6% - 98.2%) were mapped to 30 chromosomes (autosomes 1 – 29, X) of the bovine reference genome assembly UMD3.1 . Approximately 1.7 billion (PCR and optical) duplicate reads (average 7.3%, range 4.2% - 10.6%) were marked and excluded from further analysis. Average coverage was 12.1 reads per base; average coverage per-animal ranged from 10.1 - 17.5. See Additional file 1: Table S1 for individual alignment and coverage information.
Single sample variant detection
Total number of single nucleotide variants (SNVs), insertions and deletions (InDels), and Transition/Transversion Ratios found using single and multi sample calling methods with HaplotypeCaller (HC), Platypus (PL), Platypus results after multi-nucleotide variants were split into allelic primitives (PL_PRIM), Samtools (SAM), and the UnifiedGenotyper (UG) for 65 animals
Total number of SNVs
Total number of InDels
Single sample calling, combined
Multi sample calling
Single sample calling, combined
Multi sample calling
Single sample calling, combined
Multi sample calling
Base quality score recalibration
Base quality score recalibration reduced the average number of SNVs identified in PL (-127,292) and UG (-181,119), but not in SAM (+182,943) (Figure 1a). The number of SNVs identified after InDel realignment and base quality score recalibration is expected to decrease, as quality scores are initially overestimated and more SNVs fall below the cut-off after recalibration. By lowering base quality scores through recalibration, confidence in weak variants should decrease and, as a result, the number of false positives is expected to drop . The observed increase in SNVs identified with SAM after base quality score recalibration likely corresponds with the effective removal of false positive SNVs by SAM prior to IR and BQSR and the less stringent default quality scores in SAM, though the increase is minimal. Recalibration of base quality scores slightly increased the number of InDels for PL and SAM (no change in UG, Figure 1b).The effect of IR + BQSR on the number of SNVs identified was more obvious than the effect of IR alone for PL and SAM. Depending on variant detection software used, the total average number of SNVs identified per animal increased (SAM) or decreased (PL and UG) by around 3.0 – 4.0%. In contrast, the number of InDels identified after IR + BQSR did not change markedly compared to the number of InDels identified after IR alone. The Ti/Tv ratio decreased slightly for PL_PRIM, SAM and UG (no change in PL, Figure 2a) after BQSR, indicating that BQSR using default settings and the current available resources may actually decrease overall variant quality for these variant identification methods.
Liu et al.  analysed the effect of preparatory steps in whole exome sequencing data and found no clear effect of InDel realignment or base quality score recalibration in five whole exome sequencing samples of breast cancer patients. The authors state that the relative contribution of each preparatory step to the accuracy of variant identification is dependent on read depth; the lack of a sufficient number of reads in a low-coverage scenario limits the power of local multiple sequence alignment. In contrast, local realignment can benefit from consistent alignment among reads when coverage is high, thus effectively reducing the number of false positive SNVs. Li  focused on deep Illumina sequencing data from two human cell lines (55–100 fold coverage) and found “only” a 0.1% difference in the number of variants before and after InDel realignment and base quality score recalibration using both SAM and GATK. Although the authors regarded this difference as negligible considering the increased computational costs, a 0.1% difference in the number of variants may represent a good proportion of false positives being eliminated. Unfortunately, the resulting variants were not further analysed for quality (e.g. Ti/Tv ratio) leaving the key question regarding the effect of InDel realignment or base quality score recalibration unanswered.
Both InDel realignment and base quality score recalibration rely on a reference set of high-quality known InDels and SNVs. Existing resources for human sequence (e.g. HapMap , the Omni family of arrays from Illumina  or results from the 1000 genomes project ) provide qualitatively solid references, whereas the quality of bovine resources such as dbSNP  is notably lower. Although there are currently close to 70 million bovine SNVs included in dbSNP, only very few of them are validated (i.e. at least one clustered SNV determined using a non-computational technique or both population frequency data and genotype data included in the entry). In contrast, approximately 28 million human SNV are validated. High-density array information, such as that from the Illumina (BovineHD BeadChip®) or Affymetrix (Axiom® Genome-Wide BOS 1 Array), provides higher-quality information, but for only a limited number of SNVs (there are a total of 908,866 mapped SNVs on these two chips combined). As the number and quality of known InDels and SNVs in the bovine genome increase and reference information improves, we can expect better and more dependable effects of InDel realignment and base quality score recalibration in variant identification pipelines.
Multi sample variant detection
The differences between single and multi sample variant identification methods were generally slight in terms of number of SNVs and Ti/Tv ratios, and depended on software. Results of multi sample variant detection (conducted after InDel realignment and base quality score recalibration) are shown in Table 2. Variant detection with UG (multi sample) resulted in the highest number of variants (22,048,382), followed by HC (19,901,885) and SAM (18,767,273). As observed in single sample variant identification, PL originally had the lowest number of variants (16,894,054), because some variants were “hidden” in multi-SNV replacements as described previously. After splitting multi-nucleotide variants into their allelic primitives, the number of variants identified with PL increased with PL_PRIM to 19,759,134. Le Roex et al.  compared the number of SNVs identified with SAM and GATK in African buffalo and identified considerably more SNVs with GATK than with SAM using multi sample variant identification methods. Though not as pronounced, this agrees with both our single sample and multi sample results.
The total number of SNVs identified by combining single sample results of all 65 animals was higher for PL and SAM than when multi sample variant identification was carried out on all 65 animals simultaneously (Table 2). This was not the case for the UG, although the difference was very slight. Similarly, Liu et al.  found that the UG multi sample pipeline resulted in 16.6% more raw SNVs than single sample results, although they found no difference in the number of SNVs found in single and multi sample SAM pipelines.In terms of variant quality, Ti/Tv ratios for single sample calling with PL, PL_PRIM and UG were higher than those observed in multi sample calling (Figure 2b), whereby the Ti/Tv ratio for single sample calling with SAM was lower than that for multi sample calling (no single sample results available for HC). This difference was most prominent in UG, in which the Ti/Tv ratio dropped by approximately 6% when multi sample calling was applied.
Surprisingly, the total number of InDels identified using single sample variant detection methods was also higher for PL (+4%) and SAM (+34%) than when multi sample methods were used. Again, this was not the case for the UG (-10%). In contrast, Liu et al.  analysed human data and found that multi sample analysis increased the number of InDels identified considerably (SAM: +88.2%, UG: +92.6%) compared to single sample variant detection. It is possible that the number of InDels is inherently lower in cattle populations because their effective population size is much smaller than it is in humans, however a more likely reason for this discrepancy could also be the quality of the reference genome. The percentage of variants identified as InDels in the human genome has been estimated at up to 18%  whereas the number of InDels in cattle has been estimated at only 5.65%  of the total variants identified, although this may also be only a difference in reference quality. Depending on software and methods used, we found 7.19 – 12.25% of the variation observed was due to InDels.
By using default parameters, we did not fine-tune all possible options available in the individual software applications. Nevertheless, using default settings in both single and multi sample variant identification yielded good performance while maintaining output quality. Our goal was to provide an initial overview of methods using the default settings recommended; it should be noted that each dataset must be treated uniquely and alternative parameter settings may deliver more accurate results.
If possible, we recommend a consensus approach for variant identification using SNVs identified by all software, which resulted in the highest SNV quality and should be considered the “golden standard” for variant identification in organisms with lower-quality reference genomes. If computational constraints do not allow a consensus approach to variant identification, the tradeoff between quality and quantity of SNVs must be faced (computation time is discussed in the next section). The UG identified the highest number of SNV in both single and multi sample methods, however the number of false positive SNVs was also highest. SNVs identified with PL had the highest quality of single sample methods, however the number of SNV identified by PL may appear to be low if “hidden” SNVs are not split into allelic primitives. SAM identified a good number of SNVs, which were of comparable quality to those identified by PL. Both PL and SAM are likely a good choice of software for organisms with lower-quality reference genomes, as the built-in InDel realignment algorithm seems to efficiently remove false positives, making the use of lower quality resources (i.e. lower-quality bovine dbSNP information) superfluous.
InDel realignment and base quality score recalibration had only slight effects on the number of SNVs, InDels and multi-allelic sites identified. The effect of InDel realignment on Ti/Tv ratio was only positive for UG, and the effect of base quality score recalibration on Ti/Tv ratio was negligible (PL) or even slightly negative (SAM, UG). Given that computational costs in terms of time are very high, we recommend InDel realignment only in combination with UG. The use of BQSR for organisms with lower-quality resource information seems superfluous until better resources become available.
Concordance, measured as non-reference sensitivity (NRS), non-reference discrepancy (NRD), SNV concordance and genotype concordance, was calculated by comparing variants identified in NGS data to array information from the Illumina BovineSNP50 v1 DNA Analysis BeadChip® (n = 17 samples) and the Illumina BovineHD BeadChip® (n = 48 samples; Table 1). Detailed results of NGS concordance with the Illumina BovineSNP50 v1 DNA Analysis BeadChip® are shown in Additional file 3: Table S3; results of NGS concordance with the Illumina BovineHD BeadChip® are given in Additional file 4: Table S4. In this section we discuss concordance results with the high-density array (medium-density array results mirrored those of the high-density analysis and are not discussed in detail).
The use of multi sample variant detection to identify SNVs improved NRS but worsened NRD compared to single sample variant detection. SNV and genotype concordance improved when multi sample methods were applied. This effect was most pronounced in SNV concordance of homozygous reference genotypes and less pronounced in heterozygous genotypes, whereas both single and multi sample methods identified homozygote alternative genotypes equally well.
The objective of this study was to investigate which methods and software work best for detection of high quality genetic variants using NGS data in cattle. We conclude that InDel realignment and base quality score recalibration have only slight effects on the number and quality of variants identified with the currently available resources for cattle and are costly with respect to computation time. The SNV concordance between variants identified using NGS data and array-based data was higher for multi sample methods than for single sample calling methods, although this was due mainly to the lack of homozygous reference genotypes in single sample results. The quality of SNVs identified (measured as the Ti/Tv ratio) using single sample methods was higher than that of multi sample calling for PL and UG and slightly lower for SAM, whereby a consensus approach using results of different software generally provides the highest variant quality. Computation time for single and multi sample methods was similar when calculated on a per-sample basis. These findings can serve as a reference for variant detection pipeline development in various organisms and help assess the value of preparatory steps in NGS pipelines for species with lower-quality reference genomes.
DNA preparation, sequencing and alignment
Sequencing was done at the Helmholtz Center in Munich, Germany (German Research Center for Environmental Health Center) in collaboration with the Technical University of Munich. Genomic DNA was extracted from semen samples and sequenced using an Illumina HiSeq2000 (Illumina Inc., San Diego, CA, USA). Individual samples were sequenced on individual lanes of the flow cell. The bases of the resulting paired-end reads (101 bp) were identified with the Illumina BaseCaller; FASTQ files  were produced for downstream analysis of the sequence data.
Sequence alignment was done according to the sequence alignment guidelines for producing binary alignment mapping (BAM) files for the 1000 bull genomes project . Briefly, the Burrows-Wheeler aligner (BWA version 0.6.1-r104 ) was used for read alignment to the University of Maryland Bovine reference assembly UMD3.1 build 137 . Conversion from sequence alignment map format to sorted, indexed BAM files was done using SAMtools (version 0.1.18 ). PCR-duplicates were flagged using the MarkDuplicates option of the Picard software tools (version 1.61, ) and the MD5 message-digest algorithm values were examined to ensure correct data transfer from the sequencing lab to the computation center.
Both single and multi sample methods for variant detection were applied. Single sample variant detection was performed with three different software applications: 1) SAM (MPileup / bcftools, version 0.1.18 ), 2) PL (version 0.5.2 ) and 3) the GATK UG (version 2.7-4-g6f46d11 [6, 7]) using default settings. In some cases, PL identified multi-SNV replacements (long alleles containing multiple bases) instead of SNVs. Because multiple SNVs may be embedded within a multi-SNV replacement, the number of SNVs recognised as such for PL was slightly lower than in other software. This problem was alleviated by applying the VariantsToAllelicPrimitives walker of the GATK, which splits multi-SNVs into their allelic primitive states.
For single sample variant identification, three levels of quality recalibration were compared for each animal and software application: a) no quality recalibration (RAW), b) local realignment around insertions and deletions using the GATK IndelRealigner walker (IR ) and c) local realignment around insertions and deletions using the GATK IndelRealigner walker followed by base quality score recalibration using the GATK BaseRecalibrator walker (IR + BQSR ). For IR and IR + BQSR, a set of 256,831 known InDels mapped uniquely to UMD3.1 and that passed the Ensembl quality control process was used to decrease computation time (Ensembl release 74, 2013). Reads surrounding InDels annotated by BWA but not included in the known InDel list were also used to identify further targets for realignment. Pipeline commands are available in Additional file 5. Multi sample variant detection was performed with the same three software applications above as well as with the GATK HC (version 2.7-4-g6f46d11 [6, 7]) using default settings. The HC is recommended by the GATK , but was not included in single sample variant identification due to excessive computation time. Preparatory local InDel realignment and base quality score recalibration were conducted for all 4 multi-sample analyses as described above (IR + BQSR). A schematic overview of methods and preparatory steps investigated in this study is given in Additional file 6: Figure S1.
Computation time required for preparatory steps and variant detection of a 5Mbp portion of BTA24 (single and multi sample methods) was tested on an Intel Xeon E5-2650 machine with two eight-core processors and a total of 256Gb RAM. No parallelism options were used for time calculations. For the whole genome analysis, single sample variant identification was done on a 24-node cluster at Iowa State University. Each computational node has two six-core processors, 64Gb of RAM, and 1 TB scratch space, for a total of 288 processors and 1536Gb RAM. Chromosomes were split into regions of similar size for parallelization; one chromosome was run per core and optimal threading options in GATK were implemented. Multi sample variant identification was done on the CyEnce cluster of Iowa State University, which includes 288 nodes with 128Gb RAM each.
Concordance with high and medium density SNV arrays
The accuracy and completeness of SNVs identified in NGS data were assessed by comparing them to genotypes of the same animals generated with either the Illumina BovineHD BeadChip® (48 animals) or the Illumina BovineSNP50 v1 DNA Analysis BeadChip® (17 animals) (Illumina Inc., San Diego, CA, USA). The high-density array contained 774,660 SNVs mapped to the 29 autosomal chromosomes and the X chromosome (3,302 mitochondrial, Y-chromosomal and non-positioned SNVs were excluded). For the high-density array, 772,821 SNVs were assigned refSNP cluster ID numbers from dbSNP build 137  using SNPchiMp . After passing the Ensembl quality control process (includes plausibility checks on map positions, alleles of reference SNVs, alleles in dbSNP submissions, external failure classifications, etc., ), reference and alternative allele information for 673,396 remaining SNVs was included using the variant calling format files (VCF) available from the National Center for Biotechnology Information . The medium density 50 K array originally contained 54,001 SNVs; after exclusion of unmapped SNVs, assignment of refSNP cluster IDs  and Ensembl quality control , reference and alternative allele information for 34,419 SNVs remained for analysis. The forward-forward coding from the Illumina final reports was used to compare all available array genotypes with the sequence-derived genotypes for single and multi sample variant detection.
Four measures of concordance and discrepancy between array data and NGS data were calculated as described in Table 1. The set of 673,396 SNVs on the Illumina BovineHD BeadChip® (or the set of 34,419 SNVs on the BovineSNP50 v1 DNA Analysis BeadChip®) was considered the total sample space. No differentiation was made between homozygous reference sites and those sites not identified due to low coverage (i.e. all non-polymorphic (non-variant) sites were considered homozygous reference). SNV concordance was calculated by adding the number of homozygote reference, heterozygote and homozygous alternative genotypes identified in the NGS-based data and dividing this sum by the number of SNVs in the array-based data (the total sample space). Genotype concordance was calculated as the number of correctly identified NGS-based genotypes divided by the number of homozygote reference, heterozygote and homozygous alternative genotypes identified in the NGS-based data. NRS and NRD were calculated as proposed by  and applied in . NRS measures the proportion of variant loci identified in the NGS-based data also identified as variants in the array-based data. An NRS of one indicates perfect concordance of variant loci found in the NGS variant set and in the array. NRD represents the proportion of differing genotypes between the NGS variant set and the array. As the NRD represents discrepancy between the NGS variant set and the array, the NRD value should be as close to zero as possible. These measures were calculated chromosome-wise for both single and multi sample results.
Availability of supporting data
All DNA references used were taken from the publically available bovine assembly UMD3.1 available for download from http://www.1000bullgenomes.com/. The identified variants were submitted to the Database of Single Nucleotide Polymorphisms (dbSNP) and are available from: http://www.ncbi.nlm.nih.gov/SNP/.
No animal experiments were performed.
The authors thank 2 anonymous reviewers whose comments and suggestions improved this manuscript considerably. C. Baes thanks M. Berner for excellent technical support, P. von Rohr and U. Schuler for fruitful discussions and critical comments on the manuscript and A. Rimmer and D. Bickhart for very insightful thoughts on variant identification methods. We would like to thank the developers of all methods compared in this paper for making their software available. Financial support from the Swiss Commission for Technology and Innovation and the Swiss Cattle Breeders Federation is gratefully acknowledged. S. Jansen was supported by the German Federal Ministry of Education and Research (BMBF) within the AgroClustEr “Synbreed –Synergistic plant and animal breeding” (FKZ: 0315528A). This research was partially computed on the HPC equipment at Iowa State University (NSF MRI grant number CNS 1229081, NSF CRI grant number 1205413).
- Jensen J, Su G, Madsen P: Partitioning additive genetic variance into genomic and remaining polygenic components for complex traits in dairy cattle. BMC Genet. 2012, 13: 44-PubMed CentralPubMedView ArticleGoogle Scholar
- Van Raden P, O’Connell JR, Wiggans GR, Weigel KA: Genomic evaluations with many more genotypes. Gen Sel Evol. 2011, 43 (1): 10-10.1186/1297-9686-43-10.View ArticleGoogle Scholar
- Horner DS, Pavesi G, Castrignano T, D’Onorio De Meo P, Liuni S, Sammeth M, Picardi E, Pesole G: Bioinformatics approaches for genomics and post genomics applications of next-generation sequencing. Brief Bioinformatics. 2009, 11: 181-197.PubMedView ArticleGoogle Scholar
- Stratton M: Genome resequencing and genetic variation. Nat Biotechnol. 2009, 26: 65-66.View ArticleGoogle Scholar
- Flicek P, Birney E: Sense from sequence reads: methods for alignment and assembly. Nat Methods. 2009, 6: S6-S12. 10.1038/nmeth.1376.PubMedView ArticleGoogle Scholar
- McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA: The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010, 20: 1297-1303. 10.1101/gr.107524.110.PubMed CentralPubMedView ArticleGoogle Scholar
- DePristo M, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Ricas MA, Hanna M, McKenna A, Fennel TJ, Kernytsky AM, Sicachenko AY, Cibulskis K, Gabriel SB, Altshuler D, Daly MJ: A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011, 43: 491-498. 10.1038/ng.806.PubMed CentralPubMedView ArticleGoogle Scholar
- Rimmer A, Mathieson I, Lunter G, McVean G: Platypus: an integrated variant caller. http://www.well.ox.ac.uk/platypus,
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, 1000 Genome Project Data Processing Subgroup: The sequence alignment/Map format and SAMtools. Bioinformatics. 2009, 25: 2078-2079. 10.1093/bioinformatics/btp352.PubMed CentralPubMedView ArticleGoogle Scholar
- Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, Hall KP, Evers DJ, Barnes CL, Bignell HR, Boutell JM, Bryant J, Carter RJ, Keira Cheetham R, Cox AJ, Ellis DJ, Flatbush MR, Gormley NA, Humphray SJ, Irving LJ, Karbelashvili MS, Kirk SM, Li H, Liu X, Maisinger KS, Murray LJ, Obradovic B, Ost T, Parkinson ML, Pratt MR, et al: Accurate whole genome sequencing using reversible terminator chemistry. Nature. 2008, 456 (7218): 53-59. 10.1038/nature07517.PubMed CentralPubMedView ArticleGoogle Scholar
- Ellegren H: Genome sequencing and population genomics in non-model organisms. Trends Ecol Evol. 2014, 29: 51-63. 10.1016/j.tree.2013.09.008.PubMedView ArticleGoogle Scholar
- Zimin A, Delcher A, Florea L, Kelley DR, Schatz MC, Puiu D, Hanrahan F, Pertea G, Van Tassel CP, Sonstegard TS, Marçais G, Roberts M, Subramanian P, Yorke JA, Salzberg S: A whole-genome assembly of the domestic cow, Bos taurus. Genome Biol. 2009, 10: R42-10.1186/gb-2009-10-4-r42.PubMed CentralPubMedView ArticleGoogle Scholar
- Kozarewa I, Ning Z, Quail MA, Sanders MJ, Berriman M, Turner D: Amplification-free Illumina sequencing-library preparation facilitates improved mapping and assembly of (G + C)-biased genomes. Nat Methods. 2009, 6: 291-295. 10.1038/nmeth.1311.PubMed CentralPubMedView ArticleGoogle Scholar
- Van der Auwera GA, Carneiro M, Hartl C, Poplin R, del Angel G, Levy-Moonshine A, Jordan T, Shakir K, Roazen D, Thibault J, Banks E, Garimella K, Altshuler D, Gabriel S, DePristo M: From FastQ data to high-confidence variant calls: the genome analysis toolkit best practices pipeline. Curr Protoc Bioinform. 2013, 43: 11.10.1-11.10.33.Google Scholar
- Li H, Ruan J, Durbin R: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 2008, 18: 1851-1858. 10.1101/gr.078212.108.PubMed CentralPubMedView ArticleGoogle Scholar
- Li H: Towards better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics. 2014, 30: 2843-2851. 10.1093/bioinformatics/btu356.PubMed CentralPubMedView ArticleGoogle Scholar
- Liu Q, Guo Y, Li J, Long J, Zhang B, Shyr Y: Steps to ensure accuracy in genotype and SNP calling from Illumina Sequencing data. BMC Genomics. 2012, 13 (Suppl 8): S8-PubMed CentralPubMedGoogle Scholar
- Cheng AY, Teo YY, Ong RT: Assessing single nucleotide variant detection and genotype calling on whole-genome sequenced individuals. Bioinformatics. 2014, 30 (12): 1707-1713. 10.1093/bioinformatics/btu067.PubMedView ArticleGoogle Scholar
- The 1000 Genomes Project Consortium: An integrated map of genetic variation from 1,092 human genomes. Nature. 2012, 491: 56-65. 10.1038/nature11632.PubMed CentralView ArticleGoogle Scholar
- The International HapMap Project. http://hapmap.ncbi.nlm.nih.gov/thehapmap.html.en,
- Ebersberger I, Metzler D, Schwarz C, Pääbo S: Genomewide comparison of DNA sequences between humans and chimpanees. Am J Hum Gen. 2002, 70 (6): 1490-1497. 10.1086/340787.View ArticleGoogle Scholar
- Hodges E, Smith AD, Kendall J, Xuan Z, Ravi K, Rooks M, Zhang MQ, Ye K, Bhattacharjee A, Brizuela L, McCombie WR, Wigler M, Hannon GJ, Hicks JB: High definition profiling of mammalian DNA methylation by array capture and single molecule bisulfite sequencing. GenomeRes. 2009, 19 (9): 1593-1605.Google Scholar
- Omni Array Family. http://www.illumina.com/applications/genotyping/human-genotyping-arrays/omni-arrays.ilmn,
- dbSNP Bovine Assembly Bos_taurus_UMD_3.1. ftp://ftp.cbcb.umd.edu/pub/data/assembly/Bos_taurus/Bos_taurus_UMD_3.1/Google Scholar
- Le Roex N, Noyes H, Brass A: Novel SNP Discovery in African Buffalo, Syncerus caffer. Using High-Throughput Sequencing. PLoS One. 2012, 7 (11): e48792-10.1371/journal.pone.0048792.PubMed CentralPubMedView ArticleGoogle Scholar
- Liu X, Han S, Wang Z, Gelernter J, Yang B: Variant callers for next-generation sequencing data: a comparison study. PLoS One. 2013, 8 (9): e75619-10.1371/journal.pone.0075619.PubMed CentralPubMedView ArticleGoogle Scholar
- Mullaney JM, Mills RE, Pittard S, Devine S: Small insertions and deletions (INDELs) in human genomes. Hum Mol Genet. 2010, 19 (R2): R131-R136. 10.1093/hmg/ddq400.PubMed CentralPubMedView ArticleGoogle Scholar
- Daetwyler HD, Capitan A, Pausch H, Stothard P, van Binsbergen R, Brøndum RF, Liao X, Djari A, Rodriguez SC, Grohs C, Esquerré D, Bouchez O, Rossignol M, Klopp C, Rocha D, Fritz S, Eggen A, Bowman PJ, Coote D, Chamberlain AJ, Anderson C, VanTassell CP, Hulsegge I, Goddard ME, Guldbrandtsen B, Lund MS, Veerkamp RF, Boichard DA, Fries R, Hayes BJ: Whole-genome sequencing of 234 bulls facilitates mapping of monogenic and complex traits in cattle. Nat Gen. 2014, 46: 858-865. 10.1038/ng.3034.View ArticleGoogle Scholar
- Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo M, Handsacker RE, Lunter G, Marth GT, Sherry S, McVean G, Durbin R and 1000 Genomes Project Analysis Group: The variant call format and VCFTools. Bioinformatics. 2011, 27 (15): 2156-2158. 10.1093/bioinformatics/btr330.PubMed CentralPubMedView ArticleGoogle Scholar
- Jansen S, Aigner B, Pausch H, Wysocki M, Eck S, Benet-Pagès A, Graf E, Wieland T, Strom TM, Meitinger T, Fries R: Assessment of the genomic variation in a cattle population by re-sequencing of key animals at low to medium coverage. BMC Genomics. 2013, 14: 446-10.1186/1471-2164-14-446.PubMed CentralPubMedView ArticleGoogle Scholar
- Abecasis Lab GLF Tools. http://www.sph.umich.edu/csg/abecasis/glfTools/,
- Goddard ME, Hayes BJ: Genomic selection based on dense genotypes inferred from sparse genotypes. Proc Assoc Advmt Anim Breed Genet. 2009, 18: 26-29.Google Scholar
- Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, MAller J, Sklar P, de Bakker PI, Daly MJ, Sham PC: PLINK: a tool set for whole-genome association and population-based linkage analysis. Am J Hum Gen. 2007, 81 (3): 559-575. 10.1086/519795.View ArticleGoogle Scholar
- Cock PJA, Fields CJ, Goto N, Heuer ML, Rice PM: The sanger FASTQ file format for sequences with quality scores, and the solexa/illumina FASTQ variants. Nucleic Acids Res. 2010, 38 (6): 1767-1771. 10.1093/nar/gkp1137.PubMed CentralPubMedView ArticleGoogle Scholar
- Hayes B, Daetwyler H, Fries R, Stothard P, Pausch H, van Binsbergen R, Veerkamp R, Capitan A, Fritz S, Lund M, Boichard D, Van Tassell C, Guldbrandtsen B, Liao X, and the 1000 bull genomes consortium: Sequence Alignment Guidelines for producing bam files for the 1000 bull genomes project Version: 15.07.2013. http://www.1000bullgenomes.com/,
- Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics. 2009, 25: 1754-1760. 10.1093/bioinformatics/btp324.PubMed CentralPubMedView ArticleGoogle Scholar
- Picard Version 1.61. http://picard.sourceforge.net/,
- Nicolazzi EL, Picciolini M, Strozzi F, Schnabel RD, Lawley C, Pirani A, Brew F, Stella A: SNPchiMp: a database to disentangle the SNPchip jungle in bovine livestock. BMC Genomics. 2014, 15: 123-10.1186/1471-2164-15-123.PubMed CentralPubMedView ArticleGoogle Scholar
- Ensembl release 74. http://www.ensembl.org/info/genome/variation/data_description.html#quality_control,
- NCBI Resource Coordinators: Database resources of the National Center for Biotechnology information. Nucleic Acids Res. 2013, 41: D8-D20.PubMed CentralView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.