Efficient single nucleotide polymorphism discovery in laboratory rat strains using wild rat-derived SNP candidates

Background The laboratory rat (Rattus norvegicus) is an important model for studying many aspects of human health and disease. Detailed knowledge on genetic variation between strains is important from a biomedical, particularly pharmacogenetic point of view and useful for marker selection for genetic cloning and association studies. Results We show that Single Nucleotide Polymorphisms (SNPs) in commonly used rat strains are surprisingly well represented in wild rat isolates. Shotgun sequencing of 814 Kbp in one wild rat resulted in the identification of 485 SNPs as compared with the Brown Norway genome sequence. Genotyping 36 commonly used inbred rat strains showed that 84% of these alleles are also polymorphic in a representative set of laboratory rat strains. Conclusion We postulate that shotgun sequencing in a wild rat sample and subsequent genotyping in multiple laboratory or domesticated strains rather than direct shotgun sequencing of multiple strains, could be the most efficient SNP discovery approach. For the rat, laboratory strains still harbor a large portion of the haplotypes present in wild isolates, suggesting a relatively recent common origin and supporting the idea that rat inbred strains, in contrast to mouse inbred strains, originate from a single species, R. norvegicus.


Background
Genetic variation exists between individuals (or strains) of all organisms and it makes up the genetic basis for phenotypic differences between individuals. In addition, genetic variation functions as a valuable resource for mapping phenotypic traits in model organisms. Single Nucleotide Polymorphisms (SNPs) are the most abundant form of genetic variation and therefore dominate high-resolution genetic mapping strategies. Moreover, numerous well-per-forming high-throughout SNP detection technologies have been developed, like oligonucleotide array-based technology, mass-spectrometry-based technology (MALDI-TOF), and sequence-based technology (pyrosequencing, DHPLC) [1], which makes automated SNP detection favored above the more labor-intensive detection of microsatellite markers [2].
Since the availability of its genome, the laboratory rat is gaining influence as a genetic model organism [3]. In addition, over 200 well-characterized inbred strains that are models for a wide variety of human diseases are available [4,5]. However, the availability of genetic tools, like a dense genome-wide SNP marker set, is still subordinate compared to other commonly used model organisms. This is illustrated by the number of entries in dbSNP, the central SNP repository of NCBI [6]: the amount of human (>10,000,000), chicken (>3,000,000), and mouse (>500,000) entries surpass the amount of rat entries (>43,000) spectacularly. In search for rat SNPs, experimental [7,8] and computational [9] approaches have been employed, but these efforts primarily resulted in SNPs associated with coding regions. For genetic mapping purposes, a much denser marker set, preferentially equally distributed over the genome, is required.
Laboratory rat strains are thought to be established from a limited number of founder animals originating from a domesticated wild population [10,11]. The value of inbred strains emanates from the close genetic uniformity that facilitates phenotyping and genotyping. In principle, inbred strains are selectively bred for certain traits from a genetically diverse pool, comprising diverse genetic information about the trait. However, since many of the current rat strains were derived from common ancestral stocks and simply inbred to increase genetic uniformity, inbred strains clearly share alleles [12]. Although such simplified models are essential for biomedical research, modulating effects on the clinical manifestation of a trait resulting from genetic heterogeneity in a population can only be studied to a limited extent in F1 hybrids. The use of a carefully chosen selection of inbred strains may address this issue, but the choice depends on knowledge on the relationship between the strains and hence the degree of genetic variation. Alternatively, wild-derived strains may be good alternatives to introduce sufficient genetic variation in laboratory experiments [13,14].
Based on a preliminary observation that alleles from laboratory rat strains are frequently detected in wild-derived samples, we developed a wild rat-based SNP discovery approach. The method consists of shotgun sequencing of a wild rat-derived genomic library followed by comparison with the published rat genome (strain Brown Norway). Genotyping commonly used rat strains for newly identified SNPs revealed that 84% of SNP-alleles (and 87% of all genetic variation) occurring between BN and a single wild individual is also represented in one or more laboratory strains. A user-friendly webtool allows exploration of the genetic variation between any arbitrary combinations of two strains that were used in this study, making all information directly available for experimental use.

Wild rat-based SNP discovery
It is generally believed that commonly used rat strains originate from a wild-derived founder population of limited size [10]. To examine whether polymorphisms found in laboratory strains are still represented in individuals of the wild population, we typed two wild-derived samples for confirmed SNPs of the CASCAD database [9]. Interestingly, about 53% of alleles (n = 147), which were confirmed to exist in laboratory strains, were also represented in wild 1, wild 2 or both (not shown). Hence, a preselection of highly likely candidate SNPs could potentially be made by genotyping wild individuals and comparing the sequences to the rat genome sequence (Brown Norway).
Accordingly, we performed random shotgun sequencing on a genomic library of a wild rat (wild 1). We generated shotgun traces (814 Kbp) by bidirectional sequencing of about 1,600 colonies (Table 1). 85.5% of the reads (2545/ 2975; Table 1) could be mapped to a unique location in the Brown Norway rat genome using BLAT [15], resulting in the automated identification of nearly 5,000 ambiguous nucleotide positions (potential polymorphisms). Manual inspection of the sequencing reads reduced this set of potential polymorphisms to a set of 746 real SNPs and 122 indels. The average SNP rate between BN (BN/ SsNMcw; genome sequencing project) and this single wild rat is estimated to be about 1 per 900 bp and, hence, discovery of a novel SNP can be expected every second shotgun read. A subset of the discovered SNPs was verified and genotyped in 36 commonly used strains (including BN). To this end, we designed primers for 451 SNP-containing amplicons (about 300 bp) of which 416 (92.2%) were successfully read by unidirectional sequencing of the PCR products, resulting in roughly 119 Kbp high quality sequence per strain or individual (Table 1).

Wild rat-derived SNP characteristics
The verification of 746 candidate SNPs by ampliconbased resequencing in 36 inbred rat strains and three wild-derived samples (wild 1, 2, and 3) revealed 960 polymorphisms, consisting of 90 indels, seven 2-bp substitu-tions, one 3-bp substitution, one 5-bp substitution, and 861 SNPs, of which only one was tri-allelic. The amplicons are randomly distributed over the genome (Fig. 1). We observed heterozygous positions in the outbred strains, but unexpectedly some were also found in the inbred strains (for detailed information: [see Additional file 1] or [6]). For our analysis, we considered these loci to be polymorphic as compared to the BN genome sequence.
From the 746 shotgun-based candidate SNPs, 685 were located in the 416 PCR amplicons that worked, and 485 (71%) were reconfirmed by resequencing (shotgunbased; Table 2). Strikingly, for 408 (84%) of the confirmed SNPs, the wild rat allele is also present in one or more commonly used strains, with only 36 (7.4%) being specific to BN ( Table 2). Of the remaining 77 (16%) SNPs, wild rat alleles are not present in any of the 36 selected strains and could be considered wild rat-specific. These results illustrate that shotgun sequencing one wild individual efficiently identifies shared polymorphisms among commonly used rat strains.
While genotyping by resequencing, 358 novel SNPs were discovered that were not identified in the shotgun sequencing experiment (genotyping-based; Table 2). About 39% (139) of this set can be accounted for by differences in the sequence coverage between the shotgun reads and the resequencing genotyping reads (Table 2), whereas the remaining part of this set is strongly biased towards SNPs that are not polymorphic between BN and wild rat 1 and thus could not have been discovered in the shotgun experiment. Interestingly, about 37% of the newly discovered SNPs are polymorphic between the shotgun sequenced wild rat and any of the inbred strains ( Table 2). When considering all SNPs that are polymorphic in the set of 36 commonly used laboratory strains, of the majority (66%) the wild rat allele is found back in one of the strains (total; Table 2) and this percentage increases only slightly (70%) when two other wild individuals (wild 2 and 3) are included in the analysis. This indicates that wild rat-based SNP discovery is already highly efficient using a single wild sample.  rats resulted in 438 polymorphic positions, whereas the most polymorphic combination of inbred strains in this experiment (BN, BH, and SHR) yielded 427 SNPs. This indicates that three random, but potentially related, Dutch wild rats are about equally polymorphic as three carefully selected inbred strains. Inclusion of wild isolates from other locations worldwide may increase the efficiency of the SNP discovery approach.

Intraspecific phylogenetic network
Relationships among different rat strains have been determined previously by phylogenetic tree reconstruction based on microsatellite markers [16,17]. However, intraspecific relationships for laboratory strains are often very challenging to determine, due to small genetic distances and complex gene flow. The resulting multitude of plausible trees is best expressed by a network, which displays alternative potential evolutionary paths in the form of cycles [18]. We used Network software (v4.111 Reduced-Joining, [19]) to construct a spatial network, based on 861 SNP markers in 36 rat strains and three wild rat individuals (Fig. 2). The three wild individuals are grouped together, possibly due to the geographic and possibly genetic relation between the samples, but in accordance with the last paragraph of the previous section, they appear relatively unrelated as compared to the set of inbred strains.
The majority of the SNPs (485 of 861) was selected for being polymorphic between wild 1 and BN. As a result, different BN substrains (BN/Ztm, BN/Crl), depicted as a double-sized end node because of high similarity, and different wild rat individuals (wild 1, wild 2, and wild 3) are grouped together as the outliers. Several strains that are known to be closely related (source RGD-strains: [20]) are also grouped together, like DA and COP or SS and SR. Interestingly, WKY is also an outlier, indicating that besides BN, this strain can be utilized as an alternative mapping strain. WKY is already commonly used as a normotensive control strain in genetic mapping of blood pressure quantitative trait loci [21]. WKY is known to be closely related to SHR and these strains are indeed grouped together (Fig. 2). Additionally, BDII and BDIX are related and BDE is an RI strain from E3. These strain combinations are also grouped together. Wistar is contributing to a large subset of these strains, like WKY, WC, BDII, MWF, LEW, and WF, which contributes to the complexity of the network structure.

Data availability
The use of genetic markers for mapping traits in rat strains has been exploited for long time already. Current marker sets in rats are mostly limited to microsatellites [22,23], which are not abundantly available and are commonly detected in a more laborious way than SNPs. In this study, we have determined a total of about 35,000 genotypes (about 960 loci in 36 inbred strains), out of which the vast majority are SNPs. This data is accessible via a versatile webtool [24]. Pairs of strains of interest can be selected and explored on presence of verified genetic variation. Besides a graphical representation of the location of the SNPs on a genome map, primer sequences that were successfully used in our experiments are also provided. In a pairwise comparison matrix (Table 3), we plotted the absolute number of polymorphic positions for each of the (sub-)strains or individuals used. Interestingly, for some strains different alleles are observed in substrains (e.g. BN/Crl differs from BN/Ztm at 4 positions), in line with previous observations [8].

Simulation experiment wild rat-based SNP discovery
To get insight in the benefits of using wild rats in SNP discovery studies, we simulated larger scale experiments based on the results obtained in the experiments described above. Shotgun sequencing of 814 Kbp resulted in the identification of 485 SNPs. For 408 of those, the wild rat allele was also represented in laboratory rat strains and hence of interest for research purposes. The maximum amount of SNPs that can be discovered by fully sequencing this single rat is calculated by multiplying the SNP frequency (408/814,440) with the rat genome size (2,48 Gbp), which is 1,252,911 SNPs. Since none of our shotgun reads were overlapping, we can calculate the relation between shotgun sequencing reads of the wild rat and the amount of SNPs that will be found by scaling up this methodology, assuming random distribution of 400 bp shotgun reads over the genome (Fig. 3a). One million shotgun reads of a single wild rat would already result in the discovery of 200,000 novel SNPs that are polymorphic in commonly used rat strains. This simulation indicates that a relatively small sequencing effort could potentially result in a vast expansion of the amount of genetic variation for the rat.
Because shotgun sequencing was only done in the wild rat 1, we cannot make a direct comparison between wild ratbased SNP discovery and SNP discovery based on rat strains separately. However, a similar simulation experiment can be performed by treating the genotyping resequencing as shotgun reads. For wild 1, this would result in the identification of 577 SNPs as compared to the BN genome sequence. For 539 of those, the wild rat allele is found back in one of the inbred strains. For the combination of three strains most polymorphic as compared to BN in this experiment, the latter number would be 304, 292, and 287 for AUG, SHR, and WF, respectively. Simulations based on these numbers show that it requires nearly two times as much shotgun sequencing in different inbred strains separately to discover the same amount of SNPs that can be found using the wild rat shotgun sequencing approach. It should be mentioned that parallel shotgun sequencing of all 36 inbred strains until saturation has the potential to yield 1.6 times as many SNPs as compared to the wild-derived approach (Fig. 3b). An advantage of using inbred strains for SNP discovery is that the genotype of the strain is immediately known. Nevertheless, reconfirmation of the SNP or genotyping of other strains of interest may be necessary anyway, minimizing the relevance of this advantage.

Discussion
An increase in the amount of documented genetic variation for the rat will be essential to allow for high-resolution genetic mapping of the many inherited traits that have now been described for a wide variety of rat inbred strains. In addition, insight into genetic variation between rat strains provides valuable information on genetic relationships between strains, which can be instrumental to dissect the genetic basis of phenotypic differences. The wild rat-based shotgun sequencing method described here provides an efficient approach to generate such a dense map of genetic variation. To be able to benefit from Strain relationships in a network structure  The matrix is built from genotyping data of 960 polymorphisms in 36 strains and three wild individuals. Two inbred strains are represented by two substrains (BN and DA) and outbred SD is represented by two individuals from different stocks. Sets of polymorphisms, including a graphical representation, can be retrieved from [24].
haplotype-based mapping approaches [25][26][27][28] a high marker density is needed to first reliably define haplotype blocks in strains of interest [29]. For the mouse, it has recently been announced that 15 inbred strains will be fully resequenced to achieve this goal [30]. With extreme dense genotype maps, it may even become possible to clone traits by haplotype-based in silico mapping [25], but to achieve this, it is estimated that complete sequences of over 50 strains are needed [29]. Although densities needed for these approaches are not reached, we do show here that wild rat-based SNP discovery is potentially much more effective than shotgun sequencing different inbred strains. We propose that the most effective SNP discovery strategy for the rat would be one based on shotgun sequencing of a single wild-derived sample and subsequent low-cost high-throughput genotyping of the resulting candidates in the laboratory strains of interest. Many other model organisms are currently undergoing full coverage sequencing and SNP discovery in these organisms will become increasingly important, especially for those organisms that are selectively bred for specific traits, such as cow and pig. Pilot experiments using for example wildderived swine samples could be performed to test whether it is eligible to efficiently transfer the wild isolate-based SNP discovery strategy to other organisms.
Our results do provide insight in the genetic descent of the laboratory rat. It is generally accepted that current rat strains underwent two major genetic bottlenecks. First, they originate from a small founder population of domesticated wild rats and second, they were selectively inbred to obtain homogeneity [11]. The three Dutch wild rats used in this study are potentially relatively closely related as compared to wild rats from different parts of the world, but the genetic variation between them is mostly larger than or sporadically equal to any combination of three inbred strains, indeed suggesting the existence a common genetic bottleneck for laboratory strains. In addition, the laboratory rat does not show an extensive polymorphism rate in the MHC (major histocompatibilty complex) as compared to other species [31], like human, cattle etc. Cramer et al. has analyzed the MHC of wild rats and compared the data with those from inbred strains [32]. In line with our observation, there were not many new haplotypes.
We observed that wild rat genetic variation is to a large extent represented in the inbred strains, which is in sharp contrast to genetic variation in wild-derived mouse strains that is mostly unique [33]. Contrary to classical mouse inbred strains, where multiple subspecies contribute to the genetic make-up [13,34] and recent mouse strains, derived from different Mus species [35], laboratory rat strains are most likely descending from a single rat species, Rattus norvegicus [10].
An independent study using 42 microsatellites in German and Japanese wild-derived samples showed that the genetic profiles were quite divergent, partially owing to different geographic locations [36]. Our study involved only Dutch wild rats, suggesting that the inclusion of wild rats from different parts of the world could result in even more efficient SNP discovery, although it also remains to be demonstrated what proportion of the additional discovered alleles is present in the inbred strains and if a geographic bias for this exists.
When multiple SNPs are present per locus/amplicon, independent haplotypes can be discerned. The genetic variation identified here is mostly organized in a limited amount of haplotypes per locus (Table 4). Theoretically, an amplicon containing two or three SNPs can be represented by four and eight haplotypes, respectively, but in our dataset the vast majority of amplicons harboring multiple SNPs is represented by only two or three haplotypes (Table 4). Again, these observations suggest the existence of a common and small founding population with very limited haplotype diversity and/or a very narrow genetic bottleneck before inbred strain selection. The observed small genetic basis in a wide selection of laboratory rat strains does not mimic genetic variation in the human population and as a result, studies and pharmacological tests in rat models neglect potential modulatory effects caused by genetic variation. Although the use of F1 crosses and mosaic populations [37] could address this issue, our data suggests that wild-derived rats may be very useful to this end, since a large amount of all genetic variation present in a large selection of inbred strains, is already represented in a limited number of individuals. Therefore, it would be very interesting to investigate genetic variation in recently domesticated inbred [38] and outbred rats such as wild-type Groningen rats (WTG) [39]. Alternatively, careful selection of inbred strains based on genotyping data and subsequent random breeding may also expose the wild side of laboratory rats.

Conclusion
We describe a SNP discovery platform for the rat that is based on two steps. First, candidate SNPs are discovered by shotgun sequencing a wild rat, followed by genotyping laboratory strains of interest. We show that 84% of alleles in wild rats as compared to the sequenced Brown Norway rat genome are also represented in a set of 36 laboratory strains. Hence, the approach described here would be an efficient strategy for the discovery of novel informative SNPs in the laboratory rat. Inclusion of other wild samples, preferably from different locations in the world could result in an even more effective SNP discovery platform, as the three wild rats in our study, caught in relative close vicinity to each other, were already more polymorphic than the most polymorphic combination of carefully selected inbred strains. Based on the more than 34,000 genotyping datapoints obtained in this study, we postulate two things. First, laboratory rats originate from a single rat species, and inbred stains are relatively closely related with a limited number of haplotypes, reflecting known genetic bottlenecks in strain establishment. Second, wild rats have the potential to represent the degrees of genetic variation as present in the human population much more efficiently than a random selection of inbred strains. This makes them or wild-derived strains potentially well-suited for studying modulatory effects of genetic background variation on specific phenotypes, such as behavior or responses to drug treatment.

Sequencing reactions, purification, and analysis
PCR products were diluted with 25 µl water and 1 µl was directly used as template for the sequencing reactions. Sequencing reactions, containing 0.25 µl BigDYE (v3.1; Applied Biosystems, Foster City, CA, USA), 3.75 µl 2.5× dilution buffer (Applied Biosystems) and 0.4 µM universal M13 primer in a total volume of 10 µl, were performed using cycling conditions recommended by the manufacturer (40 cycles of 92°C for 10 sec, 50°C for 5 sec and 60°C for 120 sec). Of sequencing products, 5 µl was purified by ethanol precipitation in the presence of 40 mM sodium-acetate and analyzed on 96-capillary 3730XL DNA analyzers (Applied Biosystems), using the standard RapidSeq protocol. Sequences were analyzed for presence of heterozygous mutations using PolyPhred [40], followed by manual inspection of the polymorphic positions.

Automation
All PCR and sequencing reactions were set up on a Tecan Genesis RSP200 liquid handling workstation, with a robotic and an 8-channel pipetting arm, an integrated 96channel pipetting head (TEMO96, Tecan), and four integrated dual-384 well PCR blocks (Applied Biosystems).

Mapping of shotgun reads and SNP discovery
Shotgun reads were assigned to positions in the RGSC 3.1 rat genome assembly using blat search [15]. Shotgun reads that complied with our mapping criteria, namely those having at least 80 identical bp for the best hit and no more than 60 identical bp for second blat hit were retained for further analysis. Blast nucleotide sequence alignments between shotgun read and corresponding genomic segment were used for discovery of single base variations (including single base indels). A site was treated as polymorphic only in the case when it has identical 5'and 3'-flanks of at least 5 bp. A custom designed webapplication was employed for manual chromatogram inspection and confirmation of a correct shotgun base-call for every polymorphic SNP locus. Primer design for resequencing was performed using a local web-interface [41] to the PRIMER3 program [42].

Simulation model for wild rat-based SNP discovery
To estimate the number of SNPs to be discovered by the wild rat resequencing approach we performed computer simulations using the observed sample-specific polymorphism frequencies and the rat genome size of 2.48 Gbp as an input. We used a Monte-Carlo method for the placement of N 400-bp shotgun reads to the genome and calculated the total size of genome covered by N shotgun reads. To obtain a conservative estimate by assuming low heterozygosity in wild-derived strain the estimate of number of SNPs is given by product of covered genome size and polymorphism rate.