Genetic variation in the human genome takes many forms ranging from large chromosome anomalies to single nucleotide polymorphisms (SNPs). Deletion, insertion and duplication events giving rise to copy number variations (CNVs) are found genome-wide in humans [1–8] and other species [9–12]. Genomic variants can impact both somatic and germ-line genetics. The link between CNVs and inherited diseases is now solidly established (e.g. [13–15]), and copy number plasticity is typical of cancer cells . Such genomic variability, which was identified more than a decade ago using array-based comparative hybridization [17, 18], was known for much longer from cytogenetic studies or Southern blots. It has been demonstrated that CNVs near oncogenes or tumor suppressor genes can affect gene expression levels or result in the expression of chimeric fusion genes [18, 19]. However, the number and positions of rare CNVs in the human genome are still likely to be underestimated and their contribution to common complex diseases such as diabetes or obesity is unclear. Very recent results demonstrate that rare variants can have very high penetrance in the etiology of morbid obesity [20, 21].
The CoLaus (Cohorte Lausannoise) study is a population-based survey started in 2003 to study risk factors for hypertension and cardiovascular diseases . 6188 Caucasian individuals (35-75 years old) from the Lausanne area in Switzerland participated in the study. 5612 individuals were genotyped on Affymetrix 500K SNP chips, and a fraction of these were also genotyped on the Illumina 550 K SNP chips . A number of SNP-based genome-wide association studies (GWAS) that employed the CoLaus data have already been reported [24–30]. Although many other large cohorts including thousands of individuals have been genotyped for SNPs [24, 25, 31], very few have reported CNV maps [32, 33].
It is important to emphasize that most SNP arrays used so far in GWAS of clinical cohorts were not designed for CNV (dosage) detection, but only to call the three possible genotypes of SNPs. Nevertheless, by combining the intensities of the two alleles for a given SNP, it is possible to obtain information also on the copy number state of the SNP locus. However, this is challenging for several reasons: First, when analyzing very large datasets (with several thousands of individuals), it is likely that experiments were conducted at different times and/or by different laboratories, which often introduces strong batch effects for the raw intensities. Thus the first challenge in CNV calling is to ensure proper normalization of these raw data. Second, due to the large noise in the SNP probe intensities in these arrays (even after batch effects have been corrected for), the estimates of copy numbers for a given locus (SNP) are not very reliable. Thus more reliable prediction can only be made by integration of intensities from several neighboring loci, a strategy that is employed by many different CNV detection methods [34–40]. However, this approach makes CNV detection difficult (and sometimes completely fails) in regions with low SNP density. To overcome this limitation, the Illumina (1M) and Affymetrix arrays (Affymetrix 5.0 and 6.0) include more SNP markers and non-polymorphic probes to cover CNV-rich regions. These arrays also received considerable attention from the community and now benefit from a variety of freely available and efficient CNV detection methods [41–47]. These methods also make use of the ratio of allelic intensities which can improve CNV prediction . Until very recently , little has been done for Affymetrix 500 K arrays, which were analyzed with software such as dChip , CNAG , GEMCA  and CNAT . All but CNAT are restricted to the Windows operating system and thus are inappropriate for the analysis of large cohorts and for distributed computing on UNIX-based clusters. Software initially developed for Illumina arrays [45, 47, 50] were modified to allow the analysis of Affymetrix arrays (in particular Affymetrix 5.0 and 6.0 arrays). However the performance of these software on Affymetrix 500 k data has not been intensively tested and for some the software implementation is tedious to use. For example, PennCNV  is considered as a very efficient software for CNV analysis. However to analyse Affymetrix 500 K data, several pre-processing steps are needed. These steps rely on external applications (the Affymetrix APT tools) which in their current release do not longer support the pre-processing. While supporting dependencies is a very challenging work in any software development project, it makes it difficult to the user to decide which software to use. In addition, whilst there are now several performance benchmarks for the newest array generation [51–53], assessment of the Affymetrix 500K arrays in large cohorts is still needed.
Finally, while some methods take advantage of the signals from a single or a group of SNPs across the population to predict CNV regions for each individual [41, 54, 55], there are very few methods to merge individual CNV predictions into regions at the population level: Redon et al.  merged CNVs based on the extent of their overlap, whereas Itsara et al. manually annotated complex regions.
In the current study we followed two main goals: First, we performed an extensive survey of candidate CNVs in the CoLaus study as detected by SNP genotyping microarrays. We provide a large dataset that can serve as a resource for other studies elucidating human structural variants, and for future association studies of CNVs with the clinical phenotypes measured in CoLaus. Second, since the methods for detecting individual CNV profiles and merging those into consensus regions have not yet been well established, we developed new algorithms for CNV calling and merging, and devised novel techniques to evaluate and compare them with existing methods. Specifically, we compared three existing CNV detection methods with our new method (GMM) that uses a Gaussian Mixture Model to estimate the copy number dosage at each SNP of each individual. GMM models the signal intensity at each SNP across the entire population (cohort) which differs from HMM approaches like CNAT , CNAG , dChip , PennCNV  and QuantiSNP  that model the signal sample by sample along each chromosome. Other GMM implementations have been successfully used in the past for BAC and CGH array analyses [56–58], but all these different methods (GMMs and HMMs) provide a discrete copy number value (e.g. 0, 1, 2, 3 and 4). It was also proposed to integrate in a single statistical model both the CNV classification and the association with binary traits . Their EM algorithm estimates the copy number state probabilities, but only to use them internally for the association testing. Similar to their approach, our GMM implementation produces (continuous) copy number dosage values that account for uncertainty in the prediction (e.g. due to sample contamination or tumor cell heterogeneity). However, our algorithm couples the calling with CNV merging and focuses explicitly on the copy number region (CNR) calls.
Our GMM was successfully applied to both Affymetrix and Illumina arrays; and is not restricted to SNP array analysis (i.e. is applicable to CGH and qPCR analyses). We also developed two merging strategies, which were applied to create a map of CNV regions for each of the four CNV detection methods. We studied how CNVs predicted by the various algorithms coincided with previously reported variants. We also investigated the concordance in predicting CNVs in a subsample of individuals that were additionally genotyped on the Illumina 550K SNP chips. Finally we compared the sensitivity and specificity of the different approaches using related CoLaus individuals which are expected to share more CNVs than unrelated individuals. Based on these criteria, we demonstrate that our new method outperforms two established CNV detection algorithms and has higher sensitivity than a third method.