Copy number variations (CNVs) identified in Korean individuals

Background Copy number variations (CNVs) are deletions, insertions, duplications, and more complex variations ranging from 1 kb to sub-microscopic sizes. Recent advances in array technologies have enabled researchers to identify a number of CNVs from normal individuals. However, the identification of new CNVs has not yet reached saturation, and more CNVs from diverse populations remain to be discovered. Results We identified 65 copy number variation regions (CNVRs) in 116 normal Korean individuals by analyzing Affymetrix 250 K Nsp whole-genome SNP data. Ten of these CNVRs were novel and not present in the Database of Genomic Variants (DGV). To increase the specificity of CNV detection, three algorithms, CNAG, dChip and GEMCA, were applied to the data set, and only those regions recognized at least by two algorithms were identified as CNVs. Most CNVRs identified in the Korean population were rare (<1%), occurring just once among the 116 individuals. When CNVs from the Korean population were compared with CNVs from the three HapMap ethnic groups, African, European, and Asian; our Korean population showed the highest degree of overlap with the Asian population, as expected. However, the overlap was less than 40%, implying that more CNVs remain to be discovered from the Asian population as well as from other populations. Genes in the novel CNVRs from the Korean population were enriched for genes involved in regulation and development processes. Conclusion CNVs are recently-recognized structural variations among individuals, and more CNVs need to be identified from diverse populations. Until now, CNVs from Asian populations have been studied less than those from European or American populations. In this regard, our study of CNVs from the Korean population will contribute to the full cataloguing of structural variation among diverse human populations.


Background
Understanding variations in the human genome is the key to unraveling the phenotypic diversity among individuals and understanding various human diseases. Genomic variations exist at various levels, from differences in single nucleotides to microscopic chromosome-level variation [1]. Copy number variations (CNVs), a new type of genomic variation that has recently received considerable attention, are deletions, insertions, duplications, and more complex variations ranging from 1 kb to submicroscopic sizes [1][2][3][4]. Recent advances in array technologies such as BAC arrays, oligonucleotide array CGHs, and whole-genome SNP arrays, have finally enabled researchers to identify this new type of variation, which had gone unnoticed for a long time [5].
Since Sebat et al. [6] and Iafrate et al. [7] first reported large-scale CNVs among normal human individuals in 2004, and since then, many researchers have identified novel CNVs using diverse technical and computational approaches [8][9][10][11][12][13][14][15][16][17]. These reported CNVs are collected and maintained in a curated database, the database of genomic variants http://projects.tcag.ca/variation/, which contains more than 15,000 CNVs obtained from 48 publications as of April, 2008. However, the discovery of new CNVs has not yet been saturated, and many challenges remain for the standardization of CNV discovery [18,19]. The global map of CNVs from the 270 normal individuals in the HapMap collection is an important advance in the field, yet genomes from more individuals from diverse populations should be studied to achieve a full cataloging of human CNVs [11].
Whole-genome SNP arrays such as Affymetrix 500 K or Illumina 300 K arrays, which are widely used for wholegenome association studies, are also useful for CNV discovery since the intensity of the probes can be exploited to detect CNV gains and losses [20][21][22][23]. A few recent studies successfully utilized whole-genome SNP data from control populations in North American and European countries for the detection of novel CNVs [19,22,24,25]. Here, we report the identification of 10 novel CNVs from 116 normal Korean individuals by analyzing Affymetrix 250 Nsp SNP array data. Our work will be valuable in expanding our knowledge of CNVs across diverse populations and ethnicities.

CNVRs from the Korean population
Commonly used algorithms for CNV detection from SNP arrays can produce widely different results from the same data because they differ both in the way reference samples are prepared and in their calling criteria [19,26]. A stringent criterion to select only regions identified by more than two different algorithms is currently recommended to increase confidence in the identified CNVs [19]. In this work, we applied three algorithms, CNAG [21], dChip [27] and GEMCA [20], to our data set of 116 normal Korean individuals genotyped using Affymetrix 250 K Nsp arrays. We identified a total of 65 CNVRs, among which 10 CNVRs (15.4%) were novel and not present in the Database of Genomic Variants. Many novel CNVs were likely missed by our approach, but we chose to be conservative in our selection of CNVs to reduce false positives. More than 15.4% of the identified CNVs in the Korean population would be novel if we consider a recent study, which showed that most CNV loci are actually smaller than currently recorded in the Database of Genomic Variants [28].
As expected, there were significant differences in the numbers and positions of CNVs identified by the three methods ( Figure 1). In most cases, the dChip algorithm identified more CNVs than CNAG and GEMCA. Average 6.7, 3.5 and 2.6 CNVs per individual were found by dChip, CNAG and GEMCA, respectively (Additional file 1). In total, 772, 403 and 302 CNVs were found by the dChip, CNAG and GEMCA algorithms. Detailed information for each identified CNV is shown in Additional file 2. A total of 141 CNVs was identified by our criterion of selecting CNVs represented by more than two algorithms. When we compared size distribution between 84 duplicated and 57 deleted CNVs (Additional file 2), we found that duplicated regions had a tendency to be longer than deleted regions (p < 0.0009664, t-test). When we plotted each CNV in the genome, we found that most CNVs were located near the band of each chromosome ( Figure 2). Finally, we defined 65 CNVRs from the 141 CNVs by merging overlapping CNVs from different individuals (Additional file 3 and 4).

Size and occurrence of CNVs in the Korean population
The sizes of the 141 CNVs ranged from several kb to several megabases ( Table 1). The smallest CNV was 15,723 bp, and the largest 2,262,135 bp. Many CNVs were in the range of 10 kb to 300 kb. We also compared the size distributions of the CNVs identified by each method. The smallest, median, and largest CNVs were 998, 153,137 and 2,264,086 bp for the GEMCA algorithm, 1,184, 267,962 and 23,992,731 bp for the CNAG algorithm and 641, 67,372 and 5,035,303 bp for the dChip. In general, CNVs identified by the dChip algorithm had larger range than those identified by the GEMCA and CNAG algorithms.
Most CNVs (75%) from the Korean population were rare (<1%), occurring just once among the 116 individuals (Table 2). However, a few previously reported CNVs occurred in a significant proportion of the Korean population. For instance, one CNV on chromosome 14 was present in 31 individuals. Generally, there were more CNV gains than losses, and 5 (31%) of the 16 CNVRs had mixed gains and losses among different individuals. Among all autosomal chromosomes, CNVs were detected most frequently on chromosomes 14, 15 and 8.

Comparison by ethnicity
Affymetrix 500 K CEL files from the 270 HapMap individuals were obtained from the Affymetrix web site and analyzed with the CNAT algorithm to identify CNVs at an individual level. Also, individual-level CNV data from the 269 HapMap samples obtained by the array CGH method were downloaded from the copy number variation project at the Welcome Trust Sanger institute web site http:// www.sanger.ac.uk/humgen/cnv/data/cnv_data/display/. The 270 individuals were divided into three ethnic groups -Asian (JPT + CHB), European (CEU), and African (YRI), and the overlap of CNVs between the Korean population and each of the three ethnic groups was investigated (Table 3). Overall, there was a 23-40% overlap in counts and a 23-79% overlap in actual nucleotides in CNVs between the Korean population and the three ethnic groups. The Korean population showed the highest degree of CNV overlap with the Asian population, as expected, but the overlap was less than 40%, implying that many more CNVs remain to be identified from the Asian population beyond those identified in the 90 Asian HapMap individuals.

Novel CNVRs from the Korean population
Among the 10 novel CNVRs identified from the Korean population, 3 CNVRs contained a total of 5 genes (Additional file 5). The total length of the novel CNVRs was 1,788,129 bp, or 0.06% of the human genome. The total Distribution of CNV counts identified using CNAG, dChip and GEMCA algorithms Figure 1 Distribution of CNV counts identified using CNAG, dChip and GEMCA algorithms. Distribution of CNV counts in each individual. The Y-axis represents the CNV count and the X-axis represents each individual.
length of the 55 known CNVRs is 14,280,140 bp (0.48% of the human genome). Twenty-four of these CNVRs contained 52 genes.
Among the three novel CNVRs, we validated two CNVRs by Q-PCR ( Figure 3). One case sample, which had a gain of two copies in a novel CNVR encompassing SYNPR gene, showed a 3.59-fold increase in DNA copy number in comparison to five samples with normal copy number ( Figure 3A). The other validated region was a CNVR containing KRR1 gene, In this case, the case sample, which had a gain of one copy, showed a 1.86-fold increase in DNA copy number in comparison to five samples with normal copy number ( Figure 3B).
We analyzed the functional enrichment of genes contained in the CNVRs from the Korean population using Distribution and frequencies of CNVs identified in Korean population in the human genome    the GOstat tool (Table 4 and 5) [29]. The novel CNVRs were enriched with genes involved in regulation and development processes (Table 4). Genes in the previously known CNVRs were mainly related to processes such as cell adhesion, multicellular, development, and regulation of gene expression (Table 5). Our results are in agreement with Nguyen et al.'s work, which showed the over-representation of secreted, cell adhesion, and immunity-related proteins in CNV-associated genes [30].
The fact that 15% (10/65) of CNVs in the Korean population were novel implies that current CNV discovery has not yet plateaued, and that the genomes of more individ-uals should be examined to fully understand CNVs in the general population. Until recently, CNV studies have mainly focused on populations in North America and Europe [19,25]. More individuals from other continents, such as Asia, Africa, and South America, need to be studied to enrich our understanding of the diversity of CNVs in the human population. We stress that the Korean population had less than a 40% overlap in CNVRs with the 90 Asian HapMap individuals, which suggests that more individuals should be studied to fully represent the pattern of CNVs among East Asian populations. In this regard, our work on 116 Korean individuals will be a use- Validation of two novel CNVRs by Q-PCR

Conclusion
Recent studies have shown that CNVs are as important as single nucleotide polymorphisms (SNPs) or microscopic variations. Many studies have reported the identification of novel CNVs, but more CNVs from diverse populations should be identified until we have a full catalogue of the structural variations among human populations. Until now, the CNVs of Asian populations have not been as thoroughly studied as those of European or American populations, and in this regard our study of CNVs from the Korean population will contribute to the full cataloguing of structural variations among diverse human populations.

DNA samples
Blood specimens were obtained from normal, healthy subjects who visited the Korean Institute of Oriental Medicine (KIOM) and collaborative hospitals. The internal review board at KIOM approved study protocols and informed consent was obtained from all enrolled study subjects. Genomic DNA was extracted from blood samples using the QIAamp DNA Blood Maxi Kit (Qiagen, Valencia, CA) according to the manufacturer's instruction. DNA concentration and purity were determined using the NanoDrop DN-1000 spectrophotometer (NanoDrop Technologies, Rockland, DE).  A reference data set of 48 normal individuals (obtained from the Affymetrix website) was used in the non-paired reference analysis with default parameters and CNVs inferred as more than two consecutive SNPs in CNAG analysis. In the GEMCA analysis, a reference data set of 10 normal individuals was used in the non-paired reference analysis and the default parameters were used. The boundary of CNVs was determined using 90% density borders [20]. Analysis with dChip was normalized at the probe intensity level with an invariant set normalization method [27]. A signal value was calculated for each SNP using an average model method (PM/MM difference). From the raw copy numbers, the inferred copy number was estimated by using HMM (Hidden Markov model) and 10% of sample trimmed options and CNVs were inferred as more than two consecutive SNPs. Finally, for each individual, CNVs were defined as a region identified by more than two algorithms (overlap rate >= 50%, length >= 1000 bp). This strategy is likely to increase a confidence in the detected CNVs although many novel CNVs may be missed [19]. Considering the current lack of standards in CNV discovery methods, we think that a more stringent approach like ours is appropriate. NCBI genome build 36 (hg18) was used to map each CNV to its genomic position.

Determination of novel CNVRs and functional annotation analysis
CNVs identified in our Korean population were compared with 11,966 CNVs in the Database of Genomic Variants (downloaded as of Feb. 2008). The GOstat web service was used for gene ontology (GO) term analysis to study the enrichment of GO terms in the known and novel CNVs [29]. This analysis was performed with the default option for biological processes and the GO term candidates were ordered by p-value.

Authors' contributions
SL and HYP collected blood samples and prepared DNA. HJK and JHK performed genotyping experiments. JHK and JYP performed RT-PCR experiments. TWK, YJJ, and SYK performed bioinformatics analyses. TWK, YJJ, SYK, JYK and YSK wrote the manuscript. All authors read and approved the manuscript.