High-throughput polymorphism detection and genotyping in Brassica napus using next-generation RAD sequencing

Background The complex genome of rapeseed (Brassica napus) is not well understood despite the economic importance of the species. Good knowledge of sequence variation is needed for genetics approaches and breeding purposes. We used a diversity set of B. napus representing eight different germplasm types to sequence genome-wide distributed restriction-site associated DNA (RAD) fragments for polymorphism detection and genotyping. Results More than 113,000 RAD clusters with more than 20,000 single nucleotide polymorphisms (SNPs) and 125 insertions/deletions were detected and characterized. About one third of the RAD clusters and polymorphisms mapped to the Brassica rapa reference sequence. An even distribution of RAD clusters and polymorphisms was observed across the B. rapa chromosomes, which suggests that there might be an equal distribution over the Brassica oleracea chromosomes, too. The representation of Gene Ontology (GO) terms for unigenes with RAD clusters and polymorphisms revealed no signature of selection with respect to the distribution of polymorphisms within genes belonging to a specific GO category. Conclusions Considering the decreasing costs for next-generation sequencing, the results of our study suggest that RAD sequencing is not only a simple and cost-effective method for high-density polymorphism detection but also an alternative to SNP genotyping from transcriptome sequencing or SNP arrays, even for species with complex genomes such as B. napus.


Background
Rapeseed (Brassica napus) is an important oilseed and energy crop. Its allotetraploid genome fully comprises the two genomes of the species Brassica rapa (A genome) and Brassica oleracea (C genome), and the estimated genome size of B. napus is about 1.2 Gbp [1]. The large and complex nature of the B. napus genome makes the species a challenging target for genomic research.
The knowledge of sequence variation and especially the availability of a high number of molecular markers for a species serve as tools for various plant genetics approaches, such as marker-assisted backcrossing or map http://www.biomedcentral.com/1471-2164/13/281 merely provides information from coding regions, where nucleotide diversity is lower compared to non-coding regions [6,7].
An alternative approach to reduce the complexity of a genome is the use of restriction enzymes. This principle is the basis for the restriction-site associated DNA (RAD) sequencing technology, which allows parallel sequencing of millions of DNA fragments flanking individual restriction enzyme sites. The RAD approach has been successfully applied in a number of organisms, including crop species like barley [8], perennial ryegrass [9], maize [10], eggplant [11], and artichoke [12]. Nelson et al. [13] sequenced a library of eight diverse sorghum accessions to detect SNPs and study their genomic distribution, and they found 283,000 SNPs based on an alignment to the reference genome. To our knowledge, the RAD method has not been applied in earlier studies to identify sequence polymorphisms in an allotetraploid species like B. napus.
The DNA polymorphisms identified by RAD sequencing need to be characterized to facilitate subsequent applications. Due to the availability of the B. rapa reference sequence and the annotated 95K Brassica unigene (UG) dataset (http://brassica.bbsrc.ac.uk/) it is possible to carry out BLAST searches against these datasets, examine the functions of those genes with DNA polymorphisms for which BLAST hits were obtained, and also check the genome-wide distribution of DNA polymorphisms across the B. rapa chromosomes. Furthermore, the categorization of genes with DNA polymorphisms by Gene Ontologies (GO) helps to understand signatures of selection.
Here, we describe the construction of a comprehensive, well-characterized set of DNA polymorphisms in B. napus based on RAD sequencing. The objectives of our study were to (i) establish genomic RAD tags from eight diverse B. napus inbred lines for high-density polymorphism detection and genotyping, (ii) validate a subset of the identified polymorphisms by Sanger sequencing, and (iii) characterize the RAD tags according to BLAST searches against genomic databases of B. rapa and B. napus.

Results
Sequencing the KpnI library generated more than 26 million single-end reads of 120 bp length (Table 1). Sequence data have been deposited in the NCBI Sequence Read Archive under submission number SRA052686. More than 99% of the reads contained the barcode, and more than 95% contained the restriction overhang, where one mismatch was allowed (Table 1). After clustering reads to RAD tags and removing RAD tags with more than 200 reads, the number of RAD tags per inbred ranged from 100,420 to 127,681. Altogether 636,179 RAD tags were assigned to 113,221 RAD clusters. RAD clusters with one to four RAD tags occurred most frequently, and there was a gradual decrease in the number of RAD clusters with larger numbers of RAD tags ( Figure 1). After the filtering steps to remove intergenomic polymorphisms, nearly 33,000 polymorphisms were detected (Table 1). SNPs and insertions/deletions (InDels) identified at positions 7 to 88, and 7 to 80 at a read, respectively, occurred with equal frequency at each position ( Figure 2), and only those were used for all further analyses. Polymorphisms outside the aforementioned read positions were more numerous due to a higher number of sequencing errors and were therefore discarded. Afterwards, altogether 20,835 SNPs were detected from 9,552 RAD clusters, and 125 InDels were detected from 59 RAD clusters (Table 1, Additional file 1). A decrease in the number of reads per RAD cluster also resulted in a decreasing number of polymorphisms per RAD cluster ( The correlation between pairwise modified Roger's distance (MRD) estimates among all inbreds based on simple sequence repeat (SSR) data from a previous study [3] and based on DNA polymorphism data from RAD sequencing was ρ = 0.92 ( Figure 5). The 16 RAD clusters selected for validation comprised 31 SNPs, of which 26 (83.9%) were verified to be polymorphic. Out of the five nonpolymorphic SNPs, four belonged to one specific RAD cluster. RAD and Sanger sequencing information did not agree for 13.1% of the inbred-allele combinations observed for the 31 SNPs.

RAD sequencing and representation of the B. napus genome
In our study, 113,221 RAD clusters were sequenced (  useful to apply paired-end sequencing methods. Also, sequencing libraries constructed with restriction enzymes cutting more frequently than KpnI would result in a higher genome representation. Alternatively, more than one library prepared with different restriction enzymes could be sequenced on separate lanes of a flow cell to improve the read yield. The latter two approaches have been demonstrated to be useful in sorghum [13], and explain why a substantially larger percentage of the genome could be targeted by RAD sequencing in that study. However, when using more than one enzyme and sequencing a number of different libraries, the necessity for enzyme-specific adapters will increase the total costs. This problem can be circumvented by using double digest RAD sequencing [14], where DNA is digested with different restriction enzymes simultaneously.

Genome-wide distribution of RAD polymorphisms
The RAD clusters detected in our study for B. napus were tested for their presence in the known B. rapa genome sequence. We found only about one third of the RAD clusters in the B. rapa sequence, and slightly less in the B. rapa  chromosome data (Figure 6(a)). The B. rapa genome has been reported to have a size of about 529 Mbp [15,16], whereas the size of the B. oleracea genome has been described to be larger, namely 599-696 Mbp [1,15]. Hence, it is expected that the number of RAD clusters mapping to the B. oleracea genome is higher compared to the number of RAD clusters mapping to the B. rapa genome, given that the G/C content, which influences the number of restriction fragments and consequently also the number of RAD clusters, is identical for the two species. Results from earlier studies show that this is the case, as they found that the G/C content was 35.4% in B. rapa [17], and 36.0% in B. oleracea [18]. Despite the aforementioned genome size of B. rapa, the published B. rapa reference sequence [19] has a size of 284 Mbp. We therefore conclude that a considerable part of the B. rapa genome sequence is not part of the reference sequence, which, together with the smaller genome size when compared to B. oleracea, explains the fraction of RAD clusters we found in the B. rapa sequence. In addition, the BLAST searches might have been affected by matches against low complexity regions. We observed an equal distribution of RAD clusters across the B. rapa chromosomes (Figure 7). Our finding suggests that we can expect a similar distribution of RAD clusters with regard to the chromosomes of the unknown C genome. This result in turn suggests that polymorphisms detected from RAD clusters are also uniformly distributed across the B. napus genome, which makes them an important resource not only for GWAS but also other applications like high-resolution linkage mapping.

Polymorphism detection and genotyping
In the examined eight B. napus inbreds, we observed for 113,221 RAD clusters a total of 20,835 SNPs and 125 InDels ( Table 1). Considering that for SNPs, 82 bp (positions 7-88) and for InDels, 74 bp (positions 7-80) of each RAD cluster were regarded, we detected SNPs from a total of 9,284,122 bp and InDels from a total of 8,378,354 bp. Hence, we found one SNP every 446 bp and one InDel every 67,027 bp for a very diverse set of B. napus. The study by Westermeier et al. [6] on six B. napus winter oilseed rape varieties observed with one SNP every 247 bp a slightly higher polymorphism frequency. This might be due to sampling effects because candidate sequences with a total length of 21.4 kb of the B. napus genome were investigated. However, with one InDel every 3,583 bp, the study by Westermeier et al. [6] observed a drastically higher InDel frequency. This is because the relatively short reads of 102 bp in our study in combination with the unavailability of a reference sequence are not powerful for the detection of InDels.
Trick et al. [5] estimated the overall sequence polymorphism rate between the transcriptomes of the two cultivars 'Tapidor' and 'Ningyou 7' to be one SNP per 2,130 bp based on a minimum read depth of eight, or 1,195 bp based on a minimum read depth of four. The reason for the lower SNP frequency in that study when compared to ours is most likely the derivation of SNPs from the less polymorphic coding region. Moreover, only two inbreds were investigated in the study by Trick et al. [5], which leads to an underestimation of the SNP frequency in a species.

Characterization of polymorphisms
The overall representation of different transitions (58.2% (total), 49.7% (A/G), 50.3% (C/T)) and transversions (41.8% (total), 26.5% (A/C), 29.7% (A/T), 17.0% (C/G), and 26.8% (G/T) (Figure 4)) we observed was in good accordance with those detected by Barchi et al. [11] in eggplant through RAD sequencing (transitions: 49.7% (A/G), 50.3% (C/T), transversions: 24.0% (A/C), 28.5% (A/T), 19.9% (C/G), 27.6% (G/T)). Both studies showed a preponderance of transitions over transversions, which has been observed earlier in various species [20,21]. However, the ratio of total transitions/transversions was with 1.39 lower in our work compared to 1.65 in the study by Barchi et al. [11]. On the other hand, in a study on DNA polymorphism in B. rapa [22], more than 21,000 SNPs were discovered and characterized in eight diverse genotypes, and a transition/transversion ratio of 1.03 was observed. We therefore conclude that the ratio we found fits into the range of observations from comparable studies. The aforementioned research [22] furthermore found a transition/transversion ratio of 1.03 in exons and introns versus a ratio of 1.63 in exons only. Hence, our observation of a lower ratio in the B. rapa sequence data (1.45) compared to that in the coding sequence data (1.60) is in line with earlier findings from the species B. rapa.
The BLAST search against the UG set allowed us to identify the fractions of unigenes with RAD clusters and polymorphisms. For this search it was useful to apply a large sample size, therefore we refrained from a differentiation between the B. napus A and C genomes for this part of the study. Consequently, also the BLASTX search for each UG against the UniProtKB/Swiss-Prot dataset was based on the B. napus data. We observed a tight correlation between the GO term representations of all UG and UG with RAD clusters and polymorphisms, except for an apparent over-and underrepresentation of GO terms for UG with InDels ( Figure 8). This observation is due to the small number of that polymorphism type in this work. Hence, the results of our study revealed no signature of selection with respect to the distribution of http://www.biomedcentral.com/1471-2164/13/281 polymorphisms within genes belonging to a specific GO category.

Verification of polymorphisms
In our study, 26 out of 31 SNPs (84%) were verified to be polymorphic according to the Sanger sequencing information. However, four out of the five nonpolymorphic SNPs in our work were from one specific RAD cluster. If this RAD cluster was disregarded, the percentage of verified polymorphisms would be close to 100%. It is not obvious why the SNPs from this RAD cluster could not be validated. The RAD cluster was based on a low number of reads (data not shown), but so were other RAD clusters from which SNPs could be validated. A possible reason might be that this RAD cluster comes from a region which is similar between the B. rapa and B. oleracea genomes. The Illumina sequencing approach does not allow the assignment of the reads to the two different genomes and, thus, hemi-SNPs were considered as SNPs. In contrast, with Sanger sequencing, only the SNPs in one of the genomes have been targeted. However, this requires further research.
The high correlation (ρ=0.92) of MRD determined with SSR markers and RAD polymorphisms between pairs of the eight inbreds examined in our study ( Figure 5) indicated that the RAD polymorphisms identified are likely to be true polymorphisms. Furthermore, this observation is strongly supported by the verification data of 31 SNPs by Sanger sequencing, where only 13.1% of the inbred-allele combinations observed for the 31 SNPs did not agree. Trick et al. [5] also validated candidate SNPs in their study. Out of nine SNPs that had been PCR amplified previously, eight had been called as hemi-SNPs according to the transcriptome sequencing data. Four of the hemi-SNPs (44.4%) were confirmed, four of the hemi-SNPs (44.4%) were uninformative, and the ninth putative SNP (11.1%) was contradicted. Moreover, Trick et al. [5] found eight out of nine SNPs (88.9%) in the aligned regions to be polymorphic in both data sets. Therefore, the number of validated polymorphic RAD clusters and inbred-allele combinations in our study provides a good rate of correctly called polymorphisms.

Conclusions
We applied the RAD sequencing strategy in B. napus, which allowed us to detect more than 20,000 SNPs and simultaneously genotype eight inbred lines. Having sequenced about 1% of the B. napus genome, we managed to carry out comparative sequencing of a large part of the genome in a diverse set of the species. The equal dispersal of identified polymorphisms across the genome underlines the suitability of RAD sequencing for obtaining markers useful for GWAS and linkage mapping purposes. RAD sequencing reduces the complexity of the genome and detects sequence variation from coding and non-coding regions across the genome. Finally, the successful verification of SNP information demonstrates that our RAD sequencing experiment detected true polymorphisms with a high rate of correctly called alleles.

RAD library preparation and sequencing
The RAD library of the eight inbreds was prepared for single-end sequencing according to Baird et al. [23] with the following modifications: Barcodes were 5 bp long, being at least two mutational steps separated from each other with regard to the first four bases, followed by the fifth checksum base. A total of 2 μg genomic DNA of each inbred was digested for 30 min at 37 • C in a 50 μl reaction with 20 units of KpnI (New England Biolabs). Samples were inactivated by purification with Qiaquick spin columns (Qiagen). The RAD library was sequenced on one lane of the Illumina GAIIx system with 120 bp reads at the Charité, Berlin, following the manufacturer's protocol.

Sequence analysis and polymorphism detection
Reads were separated by barcode if applicable and trimmed by 17 bases at the 3' end. Reads with more than ten bases showing a quality score below 30 were discarded. Due to the unavailability of a B. napus reference sequence, we applied the following de novo RAD analysis procedure. The software RADtags from the software package RADtools [24] was used to cluster the reads of each of the eight inbreds into RAD tags. The options for this clustering were (i) bases with quality scores below 20 were ignored, (ii) the number of mismatches allowed for a RAD tag to be added to a RAD cluster was three, and (iii) RAD tags with less than three or more than 200 reads were discarded. Afterwards, RAD tags were merged across all inbreds into RAD clusters if the maximum http://www.biomedcentral.com/1471-2164/13/281 number of mismatches was twelve. For this step, the software tool RADmarkers was used.
The RAD tags present at RAD clusters were screened for SNPs and InDels. The principle was that polymorphisms between inbreds were considered, whereas polymorphisms within inbreds were only considered if one of the alleles was specific to one inbred. In detail, the following filters, which take the allotetraploid nature of the B. napus genome into account, were applied: (i) polymorphisms in RAD clusters were discarded if at least one inbred carried more than two RAD tags, because this is higher than what would be expected based on the allotetraploid genome of B. napus, (ii) the sequence information of inbreds was discarded for a RAD cluster if more than one inbred carried two RAD tags, because in this case no assignment of SNPs to one homoeologue is possible, and (iii) if in one RAD cluster one inbred carried two RAD tags, the polymorphisms were only considered if the alleles were specific to one inbred (cf. [5]). Furthermore, all polymorphisms were discarded from further analyses if the corresponding allele information was missing for more than two inbreds. Pairwise MRD estimates between all eight inbreds were calculated according to Wright [25].

Sequence and polymorphism characterization
A BLASTN search was performed for the consensus sequences of all RAD clusters as well as only for those that contain SNPs or InDels against two data sources using an expectation cutoff value of 1e-15. Further settings were chosen as follows: cost to open a gap = 25, cost to extend a gap = 10, reward for a nucleotide match = 2, penalty for a nucleotide mismatch = -3, word size = 56, megablast search = T, number of alignments to show = 1. The first search was against the entire B. rapa genome sequence (which includes sequence data that were not assigned to the ten pseudochromosomes), the chromosome data, and the CDS data of the genotype Chiifu-401 [26]. Similarly, a search against the 95K Brassica species UG dataset (http://brassica.bbsrc.ac.uk/) was performed. A BLASTX search was carried out for each UG against the UniProtKB/Swiss-Prot dataset, adopting an E-value of 1e-10 in order to assign the UGs a function. The GO terms of different subsets of UG were mapped to higher level categories (plant GO slim) using GOSlimViewer [27] according to the three principal GO categories. If not stated otherwise, all analyses were performed using the software R [28].