Hybridization arrays are widely used in research, clinical, and commercial applications involving humans, mice and other organisms to genotype single nucleotide polymorphisms (SNPs). Software is used to infer discrete genotypes using continuous intensity data from bi-allelic probes. However, existing methods are imperfect leading to incorrect calls and uncalled genotypes ("no-calls").
The Mouse Diversity Array  is a high-density genotyping platform similar to the Affymetrix Genome-Wide Human SNP 6.0 array. It contains probes for 623124 SNPs and 916269 invariant genomic probes (IGP) designed to broadly sample diversity within the Mus musculus species. SNP probes occur in sets of eight: two forward-strand and two reverse-strand probes for both an A allele, which corresponds to the reference sequence, and a B allele, which corresponds to the known variant. Opposite-strand probes are potentially offset by up to 10 bp. IGP probes occur in pairs, with one probe targeting each strand, and are not offset. The Mouse Diversity Array and similar platforms use genome-wide sampling to reduce genomic complexity by size-selective amplification of restriction fragments . Efficient hybridization requires genomic DNA targeted by a probe set to fall within at least one restriction enzyme fragment in the selected size range (50 bp to 1 kb). The Mouse Diversity Array was designed to use a combination of two restriction enzymes, NspI and StyI, and fragment sizes were predicted based on the mouse reference genome (NCBI mouse genome Build 36).
Genotype calling programs use a variety of methods to infer discrete genotypes from continuous intensity data. Many methods, including the BRLMM-P algorithm developed by Affymetrix , employ clustering of multiple samples based on the contrast between allelic probe intensities. Samples belonging to the two clusters with a large absolute contrast are called as homozygous genotypes and samples with low contrast are called heterozygous. Samples that do not fall within any of the three clusters in the contrast dimension remain uncalled.
In an earlier study we genotyped 162 laboratory mouse strains using the Mouse Diversity Array . We used these data to determine the subspecific origin and haplotype diversity of the laboratory mouse. As a model organism, the laboratory mouse has several distinct features that are key for this study. Laboratory strains sample the genetic variation present in genetically divergent species and subspecies. Most strains are fully homozygous as a result of dozens to hundreds of generations of inbreeding. F1 hybrids obtained by crossing two inbred strains have genotypes that can be accurately predicted from the parental genotypes. Finally, the mouse has a whole-genome reference sequence based on a single inbred strain (C57BL/6J) and 17 additional strains have been sequenced recently as part of the Sanger Institute's Mouse Genomes sequencing project (henceforth referred to as the Sanger strains) .
Contrary to our expectation of homozygosity at all SNPs in inbred mouse strains, we observed a substantial number of heterozygous genotype calls [1, 4]. Furthermore, the rates of both no-calls and unexpected heterozygous calls were positively correlated with divergence from the reference genome. The highest rates were observed in strains derived from species of the Mus genus other than Mus musculus, such as M. spretus and M. spicilegus , followed by strains derived from the M. m. musculus and M. m. castaneus subspecies (whereas the C57BL/6J genome is primarily M. m. domesticus in origin). These findings illuminate problems affecting all hybridization arrays, genotype calling software and studies that use these genotype data for a variety of goals. Our studies of well-characterized inbred strains have brought these issues to the forefront and provide an opportunity for investigating the underlying causes of genotyping errors.
At any given time, only a subset of the genetic variation within a species is known. This creates a bias in SNPs available for array designs in favor of variants present in the best-studied individuals, populations or clades . Furthermore, many arrays are designed using an iterative process that selects only probes that perform well across a screening set of samples. This is done to ensure low miscall and no-call rates, but can introduce further bias. Miscall and no-call rates can vary greatly depending on the composition of samples. As we have observed, and as noted in other studies , miscall and no-call rates are positively correlated with genetic divergence from the reference sequence used to design the array. Furthermore, when SNP probes are excluded from analyses due to post-hoc filtering based on no-call rate, unexpected heterozygosity or departure from Hardy-Weinberg equilibrium, important information is lost (discussed below) in addition to the introduction of further bias. In a recent genome-wide analysis of a large number of dog breeds, over 50% of SNPs were excluded for such reasons . The cumulative effect of these SNP selection procedures can potentially skew the interpretation of experimental results and limit researchers' ability to effectively study genetically divergent samples. The Mouse Diversity Array was designed with attention to the phylogenetic origin of SNPs , but SNP selection will still introduce some biases, especially in studies that include wild-derived strains or wild-caught mice .
Essentially, a no-call or incorrect genotype call is the result of abnormal hybridization intensity for a sample at a given SNP and may be due to technical or biological causes. Technical issues, such as array manufacturing or DNA processing, would result either in systematic errors that affect all samples at that SNP (such as an incorrect probe sequence on the array) or all SNPs from a single sample (such as incomplete digestion). Errors of this class should be detectable. In addition, non-systematic stochastic errors may affect a small subset of genotype calls.
An additional source of genotype calling errors is biological in origin and can be attributed to previously uncharacterized variation in genomic DNA, either in the sequence targeted by a probe set or in the proximal or distal restriction sites used for genome-wide amplification. These variants can reduce hybridization intensity sufficiently to eliminate or reverse the contrast between allelic probes such that an incorrect genotype call (or no-call) is made. We term such variants "off-target variants" (OTVs) to distinguish them from the expected variant targeted by the SNP probe set. We term probe sets affected by OTVs as variable intensity oligonucleotides (VINOs) due to the dynamic effect of OTVs on hybridization intensity.
Genotyping errors due to uncharacterized sequence variation have been observed in microsatellite genotyping (termed "null alleles") and were recently subjected to systematic analysis [9, 10], however they have gone largely unaddressed in SNP genotyping studies. In this study, we investigated the effect and extent of OTVs in a diverse collection of inbred strains and intercrossed mice using the Mouse Diversity Array. We conclude that OTVs are the primary cause of miscalls and no-calls. Furthermore, we determined that a substantial fraction of VINOs can be reliably identified, and we have developed MouseDivGeno , a novel genotype-calling algorithm implemented as a package for the R language . We demonstrate the accuracy of our algorithm by comparison with other genotype calling software and with whole-genome sequence of the Sanger strains. The ability to recognize VINOs and treat samples having OTVs as a distinct genotype class will enable SNP discovery, increase the power of evolutionary and association studies, and lend itself to potential clinical applications. Finally, we investigated the extent of off-target variation and its effect on genotype calls. Our findings suggest ways to improve both array design and genotype calling algorithms.