Development and evaluation of the first high-throughput SNP array for common carp (Cyprinus carpio)

Background A large number of single nucleotide polymorphisms (SNPs) have been identified in common carp (Cyprinus carpio) but, as yet, no high-throughput genotyping platform is available for this species. C. carpio is an important aquaculture species that accounts for nearly 14% of freshwater aquaculture production worldwide. We have developed an array for C. carpio with 250,000 SNPs and evaluated its performance using samples from various strains of C. carpio. Results The SNPs used on the array were selected from two resources: the transcribed sequences from RNA-seq data of four strains of C. carpio, and the genome re-sequencing data of five strains of C. carpio. The 250,000 SNPs on the resulting array are distributed evenly across the reference C.carpio genome with an average spacing of 6.6 kb. To evaluate the SNP array, 1,072 C. carpio samples were collected and tested. Of the 250,000 SNPs on the array, 185,150 (74.06%) were found to be polymorphic sites. Genotyping accuracy was checked using genotyping data from a group of full-siblings and their parents, and over 99.8% of the qualified SNPs were found to be reliable. Analysis of the linkage disequilibrium on all samples and on three domestic C.carpio strains revealed that the latter had the longer haplotype blocks. We also evaluated our SNP array on 80 samples from eight species related to C. carpio, with from 53,526 to 71,984 polymorphic SNPs. An identity by state analysis divided all the samples into three clusters; most of the C. carpio strains formed the largest cluster. Conclusions The Carp SNP array described here is the first high-throughput genotyping platform for C. carpio. Our evaluation of this array indicates that it will be valuable for farmed carp and for genetic and population biology studies in C. carpio and related species.


Background
Common carp (Cyprinus carpio) is naturally distributed across Europe and Asia. It was domesticated about 2,000 years ago, and is cultured in over 100 countries worldwide with over 3 million metric tons of global annual production [1,2]. As a result of selection and breeding efforts over the past centuries, many domesticated strains have been established with distinct economic traits or phenotypes adapted to local environments and to meet consumer demands. China is the largest C. carpio producer, and there are abundant domesticated strains and populations in China, including Sonpu mirror carp, Hebao red carp, Xingguo red carp, Yellow River carp, and Oujiang color carp, as well as many hybrid strains, all of which are the basis and genetic resources for selective breeding using modern genetic tools.
Because of the economic importance of C. carpio for the global aquaculture industry, as well as its importance as a model species for ecology, physiology, and evolutionary studies, over the past decade, researchers have developed a variety of genetic and genomics tool and resources. A large number of genetic markers have been developed, including microsatellites [3,4], and single nucleotide polymorphisms (SNPs) [5,6]. A number of genetic linkage maps have been constructed based on these markers [7][8][9][10]. The markers have also been used to identify quantitative trait loci (QTLs) associated with economically important traits including growth rate, body shape, and meat quality [4,11,12]. A large set of expressed sequence tags (ESTs) have been generated using traditional cloning and Sanger sequencing methods, or next-generation transcriptome sequencing, and a cDNA microarray has been designed and constructed [13][14][15][16][17]. A bacterial artificial chromosome (BAC) library has been built [18], a BAC-based physical map has been constructed, and a large set of BAC-end sequences (BES) have been generated [19,20]. The complete mitochondrial genomes of several strains and populations have been sequenced [21][22][23]. Whole genome exome data were generated for a comparative study with the Danio rerio genome [24] and, recently, the C. carpio genome consortium has completely sequenced and assembled a draft genome sequence of C. carpio [25].
A major gap in the C. carpio toolkit is the lack of a highthroughput SNP genotyping platform for genetic research. Such a platform is essential for whole genome association studies (GWAS) of important traits, as well as for genomeassisted selection in breeding programs. Genome-scale SNP genotyping is most efficiently performed using SNP arrays or chips. Arrays of this type have been used widely in genetic studies in humans, as well as in important model organisms and agriculture species.
The reductions in the cost of acquiring sequence data using next-generation sequencing technologies has led to the development of genotyping by sequencing (GBS) approaches, which use whole genome sequencing, reduced representative genome sequencing, or target-enriched DNA sequencing data to determine genotypes. The most popular GBS protocol is restriction-site-associated DNA (RAD) tag sequencing in which DNA fragments flanking particular restriction sites are targeted for sequencing, thereby allowing the discovery and genotyping of SNPs at these targeted locations [26]. Although GBS methods have some advantages for genome-wide SNP discovery and genotyping, especially for species for which a reference genome has not been established, they also have limitations, which include the requirements for complicated DNA library preparation procedures and intensive bioinformatics pipelines. GBS is not suitable for genotyping the very large numbers of individuals or SNP loci that are used commonly in GWAS and genomic selection. In addition, GBS genotyping results are not shared easily among different research groups because the same SNP loci are not assayed in all individuals. Therefore high-density SNP genotyping arrays remain the tools of choice for high-resolution genetics analysis. Many SNP arrays or chips have been developed for either Illumina or Affymetrix platforms, including the human 500 K array, the Genome-Wide Human SNP Array 5.0 and 6.0, the porcine 60 K SNP array [27], the bovine 50KSNP array [28], the chicken 60 K [29] and 600 K SNP arrays [30], the canine 22 k SNP array [31], and the equine 50 K SNP array [32]. These arrays have been used widely for research on selective sweeps, phylogeny, population structure, copy number variations, GWAS, and other aspects [32][33][34][35][36], boosting genome and genetic studies as well as breeding programs of these species.
Although the importance of high-density SNP genotyping arrays has been recognized widely, as yet there are only a few such SNP genotyping arrays for aquaculture species. After the submission of this manuscript, an Affymetrix Axiom® myDesign Custom Array containing 132,033 Atlantic salmon SNPs was developed [37]. Meanwhile, an Affymetrix Axiom Array containing 204,437 putative catfish SNPs was also developed [38]. Although a large research community is working on C. carpio and other closely related Cyprinid species, and genotyping is performed intensively for diverse purposes, no SNP genotyping array is available for C. carpio.
Here, we report the design and validation of the first high-density C. carpio SNP array, the Carp SNP array, based on the Affymetrix Axiom platform. The Carp SNP array was validated with 1,072 samples from various C. carpio populations and strains. To assess its potential use in closely related Cyprinids, we also validated the array in 80 individuals from eight related species. A pilot study was conducted to demonstrate the accuracy and efficiency of the genome-scale genotyping and linkage disequilibrium (LD) decay was analyzed in all samples and in several domesticated strains. Identity by state (IBS) clustering of all samples was conducted, which demonstrated the reliability of the Carp SNP array.

Results and discussion
The pipeline and design parameters described below are summarized in Figure 1.

Sequencing and alignment of sequence reads
In previous studies, over 700,000 SNPs have been identified in transcript sequences and classified [5]. All these SNPs were mapped to the reference genome and assigned to genomic positions. However, because these SNPs are from transcribed sequences, their numbers are limited and represent only the SNPs in coding sequences. To improve on this situation, we selected 18 representative carps for genome re-sequencing, including seven accessions of two wild populations from the Yellow and Heilongjiang rivers, and 11 accessions of three domesticated strains (Songpu, Oujiang color, and Hebao). Re-sequencing of these 18 accessions generated a total of 2,281 million paired-end reads that were 101 bp long (228.1 Gb). All raw sequencing data have been deposited in the NCBI Sequence Read Archive [SRA: SRP026407]. The short reads were mapped to the reference genome, with an average sequencing depth of six genome equivalent per animal. The mapping coverage rate was an average of 87.6% (Table 1).

SNP identification
SNP identification was performed separately within each strain. The criteria used for calling SNPs were as following: (1) mapping quality score ≥ 20; (2) relevant base quality score ≥ 20; (3) SNP quality score ≥ 20 and SNP position must be covered by at least 10 reads; and (4) (Table 2). Overall, a total of 24,272,905 non-redundant SNPs were identified, of which 802,209 were shared by all strains, and 13,811,200 were strain-specific. Together with the SNPs identified previously in the transcript sequences, we had a pool of 15,366,108 SNPs from which to select SNPs for the carp array. An abundant source of candidate SNPs is essential for designing SNP arrays, especially for large genomes like the C. carpio genome. When the dog SNP array was developed, more than 2.5 million potential SNPs were identified, with one SNP per 0.9 kb between breeds and one SNP per 1.5 kb within breeds. In other studies, 2.8 million SNPs were detected in chicken [9], and 1.1 million SNPs were discovered in horse [36]. Thus, based on these previous studies, it is evident that we had gathered a   [39]. Based on advice from Affymetrix scientists, we removed SNPs that were within 10 bp of each other or there were more than two variants within 35 bp. After these steps, 3,719,260 SNPs remained in the final pool for selection. Priority was given to SNPs in coding sequences, and then the genome re-sequencing SNPs were selected on the basis of their quality scores and spacing on the genome. Finally, a total of 378,815 SNPs were submitted for probe design.

SNP reduction based on in-silico analysis of conversion values
The 378,815 selected SNPs were submitted to Affymetrix for in-silico analysis to predict their reproducibility on the Axiom platform. The p -convert value, which is calculated using a random forest model, is designed to predict the probability that the SNP will convert on the array. The random forest model considers many factors, such as probe sequence, binding energies, unexpected non-specific binding and probability of hybridization to multiple genomic regions [30]. P-convert values were generated for the forward and reverse probes and p-convert values ≥ 0.58 were considered to be qualified. As shown in Figure 2, a high proportion of the 378,815 SNPs (347,712; 91.8%) had a p-convert value ≥ 0.58.

SNP selection for the final Carp array
In this final step, we selected 250,000 SNPs in the following order: (1) 8,204 non-synonymous SNPs and 5,219 SNPs in UTR regions with each SNP at least 100 bp from any adjacent SNP; (2) 133,603 SNPs in transcribed sequences that were at least 1.8 kb from any adjacent selected SNP; (3) 100,974 SNPs from the genome re-sequencing data that were shared between strains and separated by at least 10 kb from any adjacent selected SNP; and (4) 2,000 strain-specific SNPs that were at least 17 kb from any other SNP on the array ( Table 3). As shown in Figure 3, the average interval between the final 250,000 SNPs was 6.6 kb, and the intervals between most SNPs ranged from 3 to 8 kb. When the SNP densities on the assembled C. carpio chromosomes were calculated, we found that the SNP densities ranged from 137 sites/Mb to 187 sites/Mb. Scaffolds that have not been assigned to one of the 50 chromosomes were joined to form a pseudo 'P' chromosome, which had a SNP density of 122 sites/Mb ( Figure 4)

Accuracy of genotyping for the SNP array
High accuracy is a vital parameter for a genotyping platform. In this study, we assessed the genotyping accuracy of our Carp array using data from a family comprising two parents and 80 offspring. PLINK software was applied with the 'Mendel' parameter. Any genotypes not concordant between parents and offspring were regarded as genotyping errors. We estimated the accuracy to be 99.6% on average, and after excluding one sample because of multiple inconsistencies with the inheritance pattern expected on the basis of the declared pedigree, the genotyping accuracy increased to 99.8% on average, showing the high genotyping quality of the Carp array. Thus, in subsequent research, this array will be of great importance in trait association analysis, QTL mapping, and marker assisted selection.

Extensive assessment of the SNP array in Cyprinids
We evaluated the SNP array in 80 samples from the C.   collected, more of the SNPs on the array may pass the call-rate threshold. Among these eight species, D. rerio is the only species for which a genome assembly has been reported.

Linkage disequilibrium (LD) analysis
The extent of LD across the SNPs that are on the array was analyzed for all the samples of C. carpio and for three of the domesticated strains, Yellow River carp, Hebao carp, and Xingguo red carp. Pairwise r 2 was calculated using 82,113 SNP markers with MAFs over 0.05 for 120,395 samples for Yellow River carp, 73,703 for Hebao carp, and 86,517 for Xingguo red carp. The average r 2 within each kilo base pair was calculated and plotted against the physical distance ( Figure 5). A similar trend of LD decay was observed in all samples and in each strain, showing that the LD blocks in C. carpio are shorter than most other species [40][41][42][43][44][45]. On the other hand, the LD blocks in these three strains are relatively longer than the LD blocks in all the samples tested, probably because of simpler genetic background within each strain. Similar results have been reported in other species; for example, the domestic dog in which much longer LD blocks have been reported in each breed compared with in mixed samples [44]. In a future study, we will use larger samples of each strain for LD analysis and construct haplotypes, which will be useful for the design of medium or low density SNP panels. As observed previously in several domesticated animals [46,47], lower density SNP panels can be designed and applied for genomic selection and breeding, with fewer tag markers selected on interesting traits.

Population structure analysis through identity by state (IBS) clustering
Population structure analyses have commonly been conducted before GWAS analyses [48,49] and several methods for population stratification have been developed, such as IBS and principle component analysis (PCA). In this study, genotyping was performed on 1,072 samples of C. carpio and on 80 samples of another eight related species. After quality control 73,377 markers and 1,152 samples passed all the criteria. Multidimensional scaling analysis of an IBS matrix revealed the substructure of the samples ( Figure 6). All the samples were divided into three clusters. All the C. carpio samples (except Oujiang color carp and Heilongjiang carp) formed the largest cluster, within which different   [50][51][52], indicating that the Carp SNP array is reliable and potentially has applications in breeding.

Conclusions
We developed the Carp SNP array which is the first high-throughput genotyping platform for C. carpio.
After evaluation with large samples, nearly three fourths of the designed 250,000 SNPs proved to be polymorphic in C. carpio. Besides, the Carp SNP array was also evaluated in related species. LD was calculated and longer haplotype blocks were observed in domesticated strains. IBS was conducted and most of the samples were assigned to different clusters. This study indicates that the Carp SNP array will be valuable for farmed carp and for genetic and population biology studies in C. carpio and related species.

Ethics statement
This study was approved by the Animal Care and Use Committee (ACUC) of the Centre for Applied Aquatic

SNP identification
The paired-end reads from each accession were aligned to the reference genome using BWA [53] to generate sequence alignment/map SAM files. After mapping, SNPs were identified on the basis of the mpileup files generated by SAMtools [54]. The variant call format (VCF) files were manipulated further using custom-made scripts for primary filtration based on depth and quality.

SNP selection
SNP selection was carried out in multiple steps using different criteria. All the filtration parameters were set to minimize the risk of false positive sites and to select SNPs that were relatively evenly distributed across the genome. All the original SNPs were classified to six different databases and selected in a certain order. First, non-synonymous SNPs and SNPs in UTR regions were selected; then other transcriptome SNPs were added; and finally, strain-shared and strain-specific SNPs were added to the pool of candidate SNPs. During the SNP selection steps, several custom-made scripts were used to qualify flanking sequences. To ensure an even distribution of SNPs over the genome, a custom-made algorithm (described below) was used. When a new SNP was introduced into the final pool, a threshold of t bases was set and SNPs within the t bases were excluded. For SNPs that originated from the transcriptome data, t was set lower than 2 kb so that all the cSNPs were included in the final pool. For SNPs from the genome re-sequencing data, t was set over 10 kb because most of these SNPs were from non-coding regions.

Evaluation of the SNP array
To evaluate the Carp SNP array, 1,072 samples from C. carpio and 80 samples from carp-related species were collected. Genomic DNA was extracted from blood using a DNeasy 96 Blood & Tissue Kit (Qiagen). All the DNA samples were quantified by NanodropND-1000 spectrophotometer (NanoDrop Technologies Inc., Wilmington, DE, USA) and sent to GeneSeek (Lansing, MI) for genotyping. The genotype data were extracted and converted to Ped/Map format. PLINK software [55] was used to classify the SNPs and extract the data for the different species. Mendelian analysis and LD decay were also conducted with PLINK using the "-mendel" and "-r2" parameters. Mendelian analysis was conducted on family data for two parents and 80 offspring, following the procedure reported previously [56]. X-Y plots were drawn using the average r 2 values (Y axis) and the physical distances (X axis) for each pair of SNPs each kilo base-pair. IBS clustering was conducted with PLINK using the "-mds-plot 2", "-cluster", and "-genome" parameters, with a P-value threshold of 1E-3. The PLINK MDS file was extracted and a scatter plot was drawn using d$C1 (X axis) and d$C2 (Y axis) in the R software package (version 3.0.2, Vienna, Austria).