Identification and validation of copy number variants using SNP genotyping arrays from a large clinical cohort
- Armand Valsesia1, 2, 3,
- Brian J Stevenson2, 3,
- Dawn Waterworth4,
- Vincent Mooser4, 5,
- Peter Vollenweider5,
- Gérard Waeber5,
- C Victor Jongeneel2, 3, 6,
- Jacques S Beckmann1, 7,
- Zoltán Kutalik†1, 2 and
- Sven Bergmann†1, 2Email author
© Valsesia et al.; licensee BioMed Central Ltd. 2012
Received: 28 November 2011
Accepted: 15 June 2012
Published: 15 June 2012
Genotypes obtained with commercial SNP arrays have been extensively used in many large case-control or population-based cohorts for SNP-based genome-wide association studies for a multitude of traits. Yet, these genotypes capture only a small fraction of the variance of the studied traits. Genomic structural variants (GSV) such as Copy Number Variation (CNV) may account for part of the missing heritability, but their comprehensive detection requires either next-generation arrays or sequencing. Sophisticated algorithms that infer CNVs by combining the intensities from SNP-probes for the two alleles can already be used to extract a partial view of such GSV from existing data sets.
Here we present several advances to facilitate the latter approach. First, we introduce a novel CNV detection method based on a Gaussian Mixture Model. Second, we propose a new algorithm, PCA merge, for combining copy-number profiles from many individuals into consensus regions. We applied both our new methods as well as existing ones to data from 5612 individuals from the CoLaus study who were genotyped on Affymetrix 500K arrays. We developed a number of procedures in order to evaluate the performance of the different methods. This includes comparison with previously published CNVs as well as using a replication sample of 239 individuals, genotyped with Illumina 550K arrays. We also established a new evaluation procedure that employs the fact that related individuals are expected to share their CNVs more frequently than randomly selected individuals. The ability to detect both rare and common CNVs provides a valuable resource that will facilitate association studies exploring potential phenotypic associations with CNVs.
Our new methodologies for CNV detection and their evaluation will help in extracting additional information from the large amount of SNP-genotyping data on various cohorts and use this to explore structural variants and their impact on complex traits.
Genetic variation in the human genome takes many forms ranging from large chromosome anomalies to single nucleotide polymorphisms (SNPs). Deletion, insertion and duplication events giving rise to copy number variations (CNVs) are found genome-wide in humans [1–8] and other species [9–12]. Genomic variants can impact both somatic and germ-line genetics. The link between CNVs and inherited diseases is now solidly established (e.g. [13–15]), and copy number plasticity is typical of cancer cells . Such genomic variability, which was identified more than a decade ago using array-based comparative hybridization [17, 18], was known for much longer from cytogenetic studies or Southern blots. It has been demonstrated that CNVs near oncogenes or tumor suppressor genes can affect gene expression levels or result in the expression of chimeric fusion genes [18, 19]. However, the number and positions of rare CNVs in the human genome are still likely to be underestimated and their contribution to common complex diseases such as diabetes or obesity is unclear. Very recent results demonstrate that rare variants can have very high penetrance in the etiology of morbid obesity [20, 21].
The CoLaus (Cohorte Lausannoise) study is a population-based survey started in 2003 to study risk factors for hypertension and cardiovascular diseases . 6188 Caucasian individuals (35-75 years old) from the Lausanne area in Switzerland participated in the study. 5612 individuals were genotyped on Affymetrix 500K SNP chips, and a fraction of these were also genotyped on the Illumina 550 K SNP chips . A number of SNP-based genome-wide association studies (GWAS) that employed the CoLaus data have already been reported [24–30]. Although many other large cohorts including thousands of individuals have been genotyped for SNPs [24, 25, 31], very few have reported CNV maps [32, 33].
It is important to emphasize that most SNP arrays used so far in GWAS of clinical cohorts were not designed for CNV (dosage) detection, but only to call the three possible genotypes of SNPs. Nevertheless, by combining the intensities of the two alleles for a given SNP, it is possible to obtain information also on the copy number state of the SNP locus. However, this is challenging for several reasons: First, when analyzing very large datasets (with several thousands of individuals), it is likely that experiments were conducted at different times and/or by different laboratories, which often introduces strong batch effects for the raw intensities. Thus the first challenge in CNV calling is to ensure proper normalization of these raw data. Second, due to the large noise in the SNP probe intensities in these arrays (even after batch effects have been corrected for), the estimates of copy numbers for a given locus (SNP) are not very reliable. Thus more reliable prediction can only be made by integration of intensities from several neighboring loci, a strategy that is employed by many different CNV detection methods [34–40]. However, this approach makes CNV detection difficult (and sometimes completely fails) in regions with low SNP density. To overcome this limitation, the Illumina (1M) and Affymetrix arrays (Affymetrix 5.0 and 6.0) include more SNP markers and non-polymorphic probes to cover CNV-rich regions. These arrays also received considerable attention from the community and now benefit from a variety of freely available and efficient CNV detection methods [41–47]. These methods also make use of the ratio of allelic intensities which can improve CNV prediction . Until very recently , little has been done for Affymetrix 500 K arrays, which were analyzed with software such as dChip , CNAG , GEMCA  and CNAT . All but CNAT are restricted to the Windows operating system and thus are inappropriate for the analysis of large cohorts and for distributed computing on UNIX-based clusters. Software initially developed for Illumina arrays [45, 47, 50] were modified to allow the analysis of Affymetrix arrays (in particular Affymetrix 5.0 and 6.0 arrays). However the performance of these software on Affymetrix 500 k data has not been intensively tested and for some the software implementation is tedious to use. For example, PennCNV  is considered as a very efficient software for CNV analysis. However to analyse Affymetrix 500 K data, several pre-processing steps are needed. These steps rely on external applications (the Affymetrix APT tools) which in their current release do not longer support the pre-processing. While supporting dependencies is a very challenging work in any software development project, it makes it difficult to the user to decide which software to use. In addition, whilst there are now several performance benchmarks for the newest array generation [51–53], assessment of the Affymetrix 500K arrays in large cohorts is still needed.
Finally, while some methods take advantage of the signals from a single or a group of SNPs across the population to predict CNV regions for each individual [41, 54, 55], there are very few methods to merge individual CNV predictions into regions at the population level: Redon et al.  merged CNVs based on the extent of their overlap, whereas Itsara et al. manually annotated complex regions.
In the current study we followed two main goals: First, we performed an extensive survey of candidate CNVs in the CoLaus study as detected by SNP genotyping microarrays. We provide a large dataset that can serve as a resource for other studies elucidating human structural variants, and for future association studies of CNVs with the clinical phenotypes measured in CoLaus. Second, since the methods for detecting individual CNV profiles and merging those into consensus regions have not yet been well established, we developed new algorithms for CNV calling and merging, and devised novel techniques to evaluate and compare them with existing methods. Specifically, we compared three existing CNV detection methods with our new method (GMM) that uses a Gaussian Mixture Model to estimate the copy number dosage at each SNP of each individual. GMM models the signal intensity at each SNP across the entire population (cohort) which differs from HMM approaches like CNAT , CNAG , dChip , PennCNV  and QuantiSNP  that model the signal sample by sample along each chromosome. Other GMM implementations have been successfully used in the past for BAC and CGH array analyses [56–58], but all these different methods (GMMs and HMMs) provide a discrete copy number value (e.g. 0, 1, 2, 3 and 4). It was also proposed to integrate in a single statistical model both the CNV classification and the association with binary traits . Their EM algorithm estimates the copy number state probabilities, but only to use them internally for the association testing. Similar to their approach, our GMM implementation produces (continuous) copy number dosage values that account for uncertainty in the prediction (e.g. due to sample contamination or tumor cell heterogeneity). However, our algorithm couples the calling with CNV merging and focuses explicitly on the copy number region (CNR) calls.
Our GMM was successfully applied to both Affymetrix and Illumina arrays; and is not restricted to SNP array analysis (i.e. is applicable to CGH and qPCR analyses). We also developed two merging strategies, which were applied to create a map of CNV regions for each of the four CNV detection methods. We studied how CNVs predicted by the various algorithms coincided with previously reported variants. We also investigated the concordance in predicting CNVs in a subsample of individuals that were additionally genotyped on the Illumina 550K SNP chips. Finally we compared the sensitivity and specificity of the different approaches using related CoLaus individuals which are expected to share more CNVs than unrelated individuals. Based on these criteria, we demonstrate that our new method outperforms two established CNV detection algorithms and has higher sensitivity than a third method.
Identification of copy number variants in Colaus
To detect CNVs in CoLaus, we applied four different CNV detection algorithms to the data from 5612 Caucasians generated with Affymetrix 500 K microarrays: two implementations of the Copy Number Analysis Tool (CNAT ) that integrate the SNP intensities by summing their raw (CNAT.total) or log-transformed (CNAT.allelic) values; Circular Binary Segmentation (CBS [36, 37]) and our own algorithm based on a Gaussian Mixture Model, to which we refer subsequently as GMM. We restricted our analysis to autosomes allowing us to use a mixture of males and females as the reference panel. Using these four methods, we assigned copy number values to each probe and each CoLaus individual. (The CBS method only returns segments and their mean signal intensity, which we used to identify SNPs within candidate regions for CNVs if the corresponding ratio was below (loss) or above (gain) a certain threshold, see Methods for more details.)
In the second step we attempted to reduce the complexity of these CNV profiles by merging adjacent SNPs that contained highly redundant information into CNV regions. The first method (“simple merge”) joins neighboring SNPs (on a same chromosome) that have identical copy number values across all CoLaus participants (see Additional file 1: Figure S1A for illustration). This simple approach already significantly reduced the number of SNPs (for example, it compresses 490K autosomal SNPs into 8000 regions for CNAT.total and into 40K for CBS). However, by nature, this simple scheme leaves the boundaries of CNVs fragmented. I.e. If two adjacent SNPs differ in copy number for at least one subject, they will not be merged together (see Additional file 1: Figure S1B). Thus we devised a refined method, which is based on a principal component analysis (PCA) and self-organizing maps (SOMs). The PCA identifies orthogonal components explaining a significant (e.g. 90%) fraction of the variance. Including these components in clustering or multivariate analyses allows us to remove components that are likely driven by noise and to concentrate on those which, individually, explain a significant fraction of the data variability (i.e. 90%). We then used Self-Organizing Maps (SOMs) to cluster SNPs with similar ‘eigen-value profiles’ in CNV regions (see Methods for details, and Additional file 1: Figure S2 for illustration). For convenience, we refer to this approach as the ‘PCA merge’.
The fraction of the (autosomal) genome effectively covered by these regions is reported in Additional file 1: Table S1 (details per chromosome are provided in Additional file 1: Figure S5). Although GMM produces many more CNPs than the other methods, they only cover about 2.4% of the autosomes. CNAT.allelic predictions for CNPs cover 12.4% of the autosomes, while CBS and CNAT.total cover only 1.5% and 0.7% respectively. We also checked the coverage with rare variants (CNVRs), GMM had the lowest autosomal coverage of only 9.8%, whereas CBS had the highest with 42.4%. CBS predictions for CNPs are rather conservative in the sense that CNPs found with other methods are found for fewer individuals when using CBS (thus much higher genome coverage for CNVRs). Additional file 1: Figure S6 shows the CNV profile on chromosome 1 as predicted by the different methods. This illustrates the dramatic differences between methods and the limited ability of CBS to detect CNPs (despite using optimized thresholds when classifying CBS segments; see Methods for details).
To further investigate at the differences between the four methods, we computed the intersection using CN prediction from 60K independent autosomal SNPs (SNPs that were not in LD in the CEU population, see Supplementary Methods). Only 2.3% of the SNPs composing CNPs were validated with at least three methods (10% with at least two methods) (see Additional file 1: Table S2 and Venn diagrams in Additional file 1: Figure S7). By contrast, 23.5% of the SNPs in CNVRs were found in at least three methods and this number reached 55.3% for at least two methods. Next, we checked pair-wise comparison between the CNV methods (Additional file 1: Table S3). The maximal intersection between two methods is 47% and corresponds to the comparison between all CNVs from GMM and CBS. Such relatively low overlaps are not uncommon with CNV analysis from SNP genotyping arrays and underline the need for proper replication of any CNV predictions [51, 52, 60].
In order to evaluate the different detection and merging algorithms, we used three different approaches: (i) A comparison with known CNVs from a public database, (ii) A cross-platform validation using a subset of samples that were also genotyped on the Illumina platform, and (iii) similarity of related individuals with respect to their CNV profiles.
Comparison with known CNVs
The Database of Genomic Variants (DGV ) is a curated catalogue of structural variation in the human genome. We downloaded its content (release 7) and kept only CNVs discovered from SNP or CGH arrays (BAC and ROMA arrays were excluded). We complemented this dataset with predictions from Itsara et al.  and predictions from the high resolution CNV project . This dataset of “known” CNVs included 17804 autosomal CNVs, whose size ranged from 1 kb to 3 Mb.
Validation with Illumina arrays
From our overlap analysis, we found that CNAT.allelic predictions were not significantly different from random predictions (according to the controls using reshuffled data). This indicates that CNAT.allelic is too permissive and that the vast majority of its predictions are likely to be false positives. In contrast, CNAT.total had a better specificity than CNAT.allelic but identified much fewer CNVs compared to other methods (CBS and GMM). Both CBS and GMM performed well (showing depletion of CNVs unique to the Affymetrix data and enrichment of common CNVs). Interestingly, GMM predicted many more CNVs than CBS and the bias with respect to predictions from reshuffled data was much stronger than for all the other methods (Additional file 1: Table S5). We also performed the above analyses independently for CNPs and CNVRs (both against DGV and the Illumina data, see Additional file 1: Figure S8) and arrived at the same conclusions.
Predicting relatedness between individuals based on their CNV profile
In this work, we analyzed CNPs and rare CNV regions within the CoLaus population using four different copy number detection methods and applying two different merging procedures. We also devised various validation strategies to compare the performance of these methods.
Properties of the PCA merging technique
The simple merging approach is able to concatenate about half a million SNPs into a few thousands regions. Yet, this naïve technique requires discrete copy number predictions and leaves CNV edges fragmented into regions of few or even single SNPs. Therefore we developed a novel merging method, based on a PCA and SOM which, provides a strong improvement over the simple approach as it significantly reduces the number of single SNPs by re-attributing them to larger regions. Also, small regions (<1 kb) were extended either by incorporating single SNPs or by merging them with other small regions.
This new method provides a powerful alternative to the so-called “merge by overlap” method (MbO), commonly used in CNV studies. An inherent limitation of the MbO method is when the underlying CNV is predicted as two distinct regions (i.e. when the predicted CNV locus is disrupted by few probes). Also the MbO requires to discretize CNV predictions (i.e. to convert any region with CN < 2 as a deletion and any region with CN > 2 as a duplication), which results in a significant loss of information (especially in cancer studies where homozygous deletion and focal amplification often play a critical role in the tumorigenesis). Our PCA-merging method allows 1) reconciliation of ‘disrupted’ CNVs, 2) to consider the predicted copy number value without loss of information due to subsequent discretization (i.e. use of continuous copy number prediction) and 3) to ignore (outlier) variation likely induced by noise in the measurement. Our PCA-merge can thus be useful to process the copy number dosage data matrix (of dimension #subjects by #SNPs) and obtain a smaller matrix (#subjects by #CNV regions) for subsequent association studies with a given clinical trait.
Comparison of the different CNV prediction methods
We demonstrated that CNAT.allelic predicts a large number of CNVs. Yet only a relatively small fraction of these could be replicated, indicating that most of the predicted CNVs are likely to be false positives. This is also supported by the fact that CNV profiles generated by CNAT.allelic performed worse in predicting kinship. In contrast, CNAT.total appears to be overly conservative and is likely to miss subtle, but real CNV events. HMMs are very popular for CNV analysis but our findings underline the difficulty of using parameters that are applicable to different datasets. Ideally, the HMM parameters would need re-evaluation with each novel dataset, which can become tedious in the absence of a ground truth. An obvious improvement of CNAT would include refining HMM transition parameters with Bayesian methods and to co-analyse multiple samples thus improving parameter estimation by combining data across individuals. In addition, summing allelic intensities in the log space (as in CNAT.allelic) is adding considerable noise to the CN ratios and thus should be avoided.
Based on our comparative analyses we find that CBS is a robust segmentation algorithm, confirming reports by several independent studies [35, 63, 64]. Although our GMM method, does not explicitly account for probe auto-correlation or allelic intensity ratios, it performs much better than the two CNAT implementations: it recalls more Illumina CNVs (CNPs and rare CNVs) while being depleted in ‘novel CNVs’ with respect to the shuffled controls. GMM and CNAT.total also perform equally well at predicting relatedness between individuals. In addition, GMM does not need pre-estimated parameters; the mean and variance of each mixture component (i.e. CN class) are updated from the data using constrained nonlinear optimization . Finally, we observed that our model was able to detect many more CNPs than CBS, suggesting higher sensitivity.
Currently our model only considers deletion, copy neutral, single copy or multiple copies. Since very few homozygous deletions were observed with other applied algorithms, we did not use such a dedicated component in our analysis. Nevertheless, our GMM implementation allows for such an extension.
Validation of CNVs in a large clinical cohort
Validation is an essential part of any CNV discovery project. PCR, Southern blot and many other targeted techniques are useful to predict accurately the copy number at a given locus, but low throughput is a severe limitation when large numbers of CNVs need to be validated. The Database of Genomic Variants is a valuable resource and is useful to compare the ‘known’ (published) CNVs that can be recalled in a large cohort using different methods. However due to the high heterogeneity between studies (e.g. different populations, methods and platforms, unknown false positive rates etc.) and to the absence of medical ascertainment of the subjects, DGV cannot be used to ‘validate’ CNVs (in discovery studies) and must not be used to assess the clinical impact of a given CNV. Instead, for large-scale CNV discovery studies, replicating a number of individuals (e.g. a few hundred) on an independent array platform is a viable option. With the recent reduction in the cost of microarrays, such large-scale replication now becomes affordable. In the context of CNV association with clinical traits, further validations are necessary and would include replication of the association signal in independent cohort(s) (with appropriate clinical ascertainment) as well as CNV validation (for e.g. with MPLA or PCR approaches) in probands. As a complement to replication experiments, one can take advantage of the relatedness between individuals. Deciphering relatedness (if not already known) can easily be achieved by applying simple Method-of-Moments approaches [66–68] to the SNP genotypes. We show that assessing how well the relatedness can be predicted based on the CNV profiles is a powerful technique to gauge the quality of a CNV calling and merging method.
The combination of our GMM and PCA merging algorithms is a useful tool to identify CNVs. They have been successfully applied to a large clinical cohort. The techniques involved here are not limited to data from SNP arrays, they require as input only a matrix of hybridization ratios (for the former) or copy number values (for the latter). Thus they can be applied to data from other platforms such as CGH arrays. Although GMM-like approaches are simplified versions of HMMs, these are simpler to optimize (as opposed to apply pre-trained HMM parameters on a new dataset) and remain powerful tools for the analysis of both large cohort (e.g. CoLaus) and complex dataset, as we recently demonstrated with melanoma .
Despite significant improvements in CNV detection and analysis when using the most recent SNP arrays (e.g. new generation Affymetrix arrays [41, 54]), there are still many large medical cohorts where SNP data have been collected but CNV analysis has not been reported. This concerns both complex diseases (e.g. [28, 69–71]) and cancer (e.g. [72–74]). Hundreds of thousands of individuals have already been genotyped on 500K Affymetrix or 550K Illumina SNP chips, but the corresponding data have not been used for CNV analysis, simply because it is a much more challenging task due to the lack of well-established algorithms and protocols. We hope that the present work will make it easier for researchers to make better use of their data for CNV calling.
GWAS have demonstrated that the genetic variance cannot fully be attributed to SNPs. For example, for highly heritable traits such as height (with 13665 individuals), SNPs only explain 10% of the variance . It has also been shown that, for common traits, the large fraction of heritability cannot be accounted for by CNPs . Thus the identification of rare CNVs with stronger clinical impact, as we recently demonstrated for obesity [20, 21], can open up new avenues to explore. Meta-analysis of existing cohorts for CNVs gives more power to detect rare CNVs because unique CNVs in a single cohort can then be supported by different cohorts. But such meta-analyses cannot be used to identify small variants due to the poor SNP density. In such cases, individuals with rare variants should be investigated further with higher density arrays or with genomic sequencing.
With the recent cost reduction in next generation sequencing (NGS), full-genome and exome sequencing become possible even for large cohorts (a few hundred subjects). Already data from several large studies can be retrieved [76–80] and many different algorithms have been developed to mine indels and CNVs [81–87]. Although our GMM method might be applied to predict copy number from sequencing read-depth, it was not developed to this aim. The current Matlab implementation may not be optimal (i.e. not fast enough) and the Gaussian modeling may not be the best option for such analysis (detection methods based on Poisson distribution  would be more appropriate). Nevertheless, our PCA merge could be useful for NGS data analyses. These analyses generate massive amount of variants, among which there can be a high number of false positives. Also the predicted variants differ greatly in size (from small indels to larger CNVs) and their boundaries (start and end positions) change between subjects. To some extent, this is similar to the different challenges that occurred in our CoLaus analyses. Therefore our PCA-merge method that is designed to identify consensus CNV regions in large and complex dataset could be of use in the post-processing of NGS structural variants.
The implementation of the Gaussian Mixture Model is publicly available at http://www2.unil.ch/cbg/index.php?title=GMM. The algorithm has been implemented in Matlab, both the source code and a compiled version for UNIX 64-bit operating systems are available. The PCA-merging algorithm has also been written in Matlab and the source code is available at http://www2.unil.ch/cbg/index.php?title=PCAmerge.
The source code of the PCA-merging algorithm requires the Matlab Neural Network toolbox, whereas the GMM source code requires the Optimization Toolbox (the compiled GMM version does not have any prerequisites and can be run as a standalone). Both methods require the Statistical Toolbox.
PCAmerge results can be retrieved from http://www2.unil.ch/cbg/index.php?title=File:Colaus_PCAmerge_results.zip.
The CoLaus study was approved by the institutional review boards of the University of Lausanne, and written consent was obtained from all participants.
The CoLaus design has been previously described . Nuclear DNA was extracted from whole blood and genotyping was performed using Affymetrix 500K SNP chips. Genotype experiments were performed by Affymetrix, Santa Clara CA, following their standard protocol.
Copy number analysis tool
In the above equations, S refers to the intensity of the test sample (of an individual) and R to the (mean) intensity of the reference panel; A and B refer to the SNP alleles.
The CNAT.allelic approach uses the sum of the logs of the allelic signals and is more sensitive to subtle allelic CN changes than CNAT.total.
Through QC analyses, we discovered an important batch effect related to the fact that these samples were processed by four distinct centers (respectively with 615, 1666, 1618 and 1736 samples). These batches differed in variance, as revealed with a PCA analysis (Additional file 1: Figure S11). Therefore we normalized data from each genotyping center independently and tested the improvement as a function of the number of references used (see Supplementary Data and Additional file 1: Figure S12). Although Affymetrix suggests that 25 samples are enough for normalization (see CNAT manual http://www.affymetrix.com), we established that in the presence of strong experimental biases, using many more references performed significantly better (see Supplementary Methods). Thus we re-applied the two CNAT implementations to ratios normalized within each genotyping center and using 280 references, producing much more reliable results than the initial normalization (with 30 references). PCA analysis of the renormalized ratios did not revealed significant differences neither for the genotyping centers (Additional file 1: Figure S13) nor the array set (NSP, STY) (Additional file 1: Figure S14).
In parallel to the normalizations performed using GTYPE, we normalized the CoLaus data with the Aroma.Affymetrix framework . Normalizations were done independently for datasets from each genotyping center with at least 336 individuals (since the Aroma.Affymetrix requires a lot of I/O operations, which can cause a severe drop of the computational performance on shared-network discs, this number of references was decided for optimal computational performances while keeping this number large enough for batch effects correction (see Supplementary Methods). Normalization steps included Allelic Cross-talk calibration [91, 92] to correct for differences between SNP alleles; intensity summarization using Robust Median Average and correction for any PCR amplification bias inherent to the Affymetrix SNP platform. To estimate the CNR for a given sample at a given SNP probe, we computed the log2 ratio of the normalized intensity of this probe divided by the median across all the samples from the same batch.
Circular binary segmentation
Circular Binary Segmentation (CBS) has been described as a state-of-the-art segmentation algorithm [36, 37]; it identifies change points using maximal t-statistics and assesses segment significance with permutations. We applied CBS, with its default parameters, on the CNRs as obtained by the Aroma.Affymetrix framework. It should be noted CBS only report segments of probes (with their mean log2 ratios) and does not provide classification into gains or losses. To this aim, we investigated the distribution of segments’ log2 ratios (Additional file 1: Figure S15). This distribution revealed that segments with log2 ratios greater than 0.25 or lower than -0.25 were outliers (i.e. ratios greater than 3rd quartile + 1.5 * interquartile range or lower than 1st quartile - 1.5 * interquartile range). A clustering using a three component Gaussian Mixture Model confirmed such data separation. Thus we decided to classify regions having a mean log2 ratio greater than 0.25 as gains (CN = 3) and regions with mean log2 ratios lower than -0.25 as losses (CN = 1).
Gaussian mixture models
Raw copy number ratios were smoothed along physical position using Loess filtering with a 41-probe window size (producing the same resolution ~100 kb than the smoothing done in CNAT). This Loess smoothing enables to correct for spatial autocorrelation artifacts due to GC effects . Next, a four component Gaussian mixture model (one component for each of the following copy number states: deletion, copy-neutral, 1 and 2 additional copies) was fitted to the smoothed copy number ratios with a constraint on the differences between the mixture means. Separation between the mixture components is obtained using the simplex search method from Lagarias et al .The means of the mixture components were decided not to be fixed as the population mean may not necessarily be two copies. Then, for each individual we determined the probabilities for each of these copy number states (see Additional file 1: Figure S16). The expected copy number was finally assigned as the weighted sum of individual dosage probabilities; for example a SNP with probabilities: 1% for CN = 1, 9% for CN = 2, 85% for CN = 3 and 5% for CN = 4, would have a CN dosage value equal to 2.94 (1*0.1 + 2*0.9 + 3*0.85 + 4*0.05). Evaluation of the GMM performance, using simulated data, is detailed in the Supplementary Methods (see also Additional file 1: Figures S17 and S18).
Illumina CNV analysis
A subset of 239 CoLaus individuals was analyzed on Illumina arrays (550K version 1 & 3, 1 M ). Only SNPs, from the 550K version 1 and 1 M arrays, that could be remapped to the 550K version 3 array (genome assembly build NCBI 36) were used for the analysis. Intensities were normalized within BeadStudio using 120 Hapmap samples. Then copy number ratios (LRR as exported in the Final Report files) were smoothed using Loess smoothing and copy number estimation was performed using GMM. Subsequently CNV predictions were merged into CNVRs with the PCA approach (see below). CNVRs found in only one sample were excluded.
Our raw CN data can be represented as a matrix where each element represents the Copy Number status for all individuals (rows) and all SNPs (columns). The “simple merge procedure” consists of combining adjacent SNPs, from a same chromosome, that share the same CN profile across the whole population (see illustration in Additional file 1: Figure S1). This is equivalent to merging strictly identical SNP columns. I.e. to define a CNV region, all the corresponding SNPs from the same subject must have the same predicted copy number. However different subjects can differ in copy number (profile). To avoid creating CNV regions that would encompass long genomic regions with low SNP density, we applied the requirement that two SNPs in the same CNV region should not be further away than 500Kb from each other. This rule did not apply to regions where all SNPs were copy neutral. To perform such merge with the GMM predictions, we rounded the CN values to the nearest integer.
The PCA merge is a novel merging algorithm for CNV profiles. It includes four steps: (1) each chromosome is partitioned into CNV regions, whose boundaries are a long stretch of SNPs (e.g. 1 Mb in size) that are in the diploid state for all Colaus subjects. (2) For each of these CNV regions, a principal component analysis is performed by analyzing the regional (clipped) CNV profiles (Additional file 1: Figure S2); (3) We then apply a principal component (PC) decomposition of the expected CNV dosage matrix (of size #individuals by #probes). Only the m largest components that explain at least 90% of the total variance are then used to derive a (filtered) matrix of SNP eigenvectors (of size m by SNPs) which is subsequently used to cluster together SNPs with similar eigenvector profiles. Clustering is done using Self-Organizing Maps (see Supplementary Methods for details about SOMs); (4) strictly adjacent SNPs within a same SOM cluster are merged into final CNV regions.
Pairwise IBS analysis
Pairwise identity-by-state (IBS) analysis was performed using Plink (). We used a sliding window of 50 SNPs, sliding along in 5 SNP increments. SNPs with a variance inflation factor (VIF) greater than 2 were pruned from each window.
We acknowledge Yolande Barreau, Mathieu Firmann, Vladimir Mayor, Anne-Lise Bastian, Binasa Ramic, Martine Moranville, Martine Baumer, Marcy Sagette, Jeanne Ecoffey, and Sylvie Mermoud for data collection. Parts of the computations were performed on the Vital-IT cluster. We also thank Bastian Peter for system administration and fulfilling our needs for storage. We would like to acknowledge Richard Redon, for his precious advice at early stage of the study and Toby Johnson for statistical discussions.
AV, BJS and CVJ are funded by the Ludwig Institute for Cancer Research. JSB is supported by a grant from the Swiss National Foundation (310000-112552). The CoLaus study was supported by grants from GlaxoSmithKline, the Faculty of Biology and Medicine of Lausanne and by the Swiss National Foundation (33CSCO-122661).
- Iafrate AJ, Feuk L, Rivera MN, Listewnik ML, Donahoe PK, Qi Y, Scherer SW, Lee C: Detection of large-scale variation in the human genome. Nat Genet. 2004, 36: 949-951. 10.1038/ng1416.View ArticlePubMedGoogle Scholar
- Feuk L, Carson AR, Scherer SW: Structural variation in the human genome. Nat Rev Genet. 2006, 7: 85-97.View ArticlePubMedGoogle Scholar
- Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, Andrews TD, Fiegler H, Shapero MH, Carson AR, Chen W: Global variation in copy number in the human genome. Nature. 2006, 444: 444-454. 10.1038/nature05329.PubMed CentralView ArticlePubMedGoogle Scholar
- Tuzun E, Sharp AJ, Bailey JA, Kaul R, Morrison VA, Pertz LM, Haugen E, Hayden H, Albertson D, Pinkel D: Fine-scale structural variation of the human genome. Nat Genet. 2005, 37: 727-732. 10.1038/ng1562.View ArticlePubMedGoogle Scholar
- Sharp AJ, Locke DP, McGrath SD, Cheng Z, Bailey JA, Vallente RU, Pertz LM, Clark RA, Schwartz S, Segraves R: Segmental duplications and copy-number variation in the human genome. Am J Hum Genet. 2005, 77: 78-88. 10.1086/431652.PubMed CentralView ArticlePubMedGoogle Scholar
- Jakobsson M, Scholz SW, Scheet P, Gibbs JR, VanLiere JM, Fung HC, Szpiech ZA, Degnan JH, Wang K, Guerreiro R: Genotype, haplotype and copy-number variation in worldwide human populations. Nature. 2008, 451: 998-1003. 10.1038/nature06742.View ArticlePubMedGoogle Scholar
- Sebat J, Lakshmi B, Troge J, Alexander J, Young J, Lundin P, Maner S, Massa H, Walker M, Chi M: Large-scale copy number polymorphism in the human genome. Science. 2004, 305: 525-528. 10.1126/science.1098918.View ArticlePubMedGoogle Scholar
- Freeman JL, Perry GH, Feuk L, Redon R, McCarroll SA, Altshuler DM, Aburatani H, Jones KW, Tyler-Smith C, Hurles ME: Copy number variation: new insights in genome diversity. Genome Res. 2006, 16: 949-961. 10.1101/gr.3677206.View ArticlePubMedGoogle Scholar
- Perry GH, Tchinda J, McGrath SD, Zhang J, Picker SR, Caceres AM, Iafrate AJ, Tyler-Smith C, Scherer SW, Eichler EE: Hotspots for copy number variation in chimpanzees and humans. Proc Natl Acad Sci U S A. 2006, 103: 8006-8011. 10.1073/pnas.0602318103.PubMed CentralView ArticlePubMedGoogle Scholar
- Perry GH, Yang F, Marques-Bonet T, Murphy C, Fitzgerald T, Lee AS, Hyland C, Stone AC, Hurles ME, Tyler-Smith C: Copy number variation and evolution in humans and chimpanzees. Genome Res. 2008, 18: 1698-1710. 10.1101/gr.082016.108.PubMed CentralView ArticlePubMedGoogle Scholar
- Lee AS, Gutierrez-Arcelus M, Perry GH, Vallender EJ, Johnson WE, Miller GM, Korbel JO, Lee C: Analysis of copy number variation in the rhesus macaque genome identifies candidate loci for evolutionary and human disease studies. Hum Mol Genet. 2008, 17: 1127-1136. 10.1093/hmg/ddn002.View ArticlePubMedGoogle Scholar
- Henrichsen CN, Vinckenbosch N, Zollner S, Chaignat E, Pradervand S, Schutz F, Ruedi M, Kaessmann H, Reymond A: Segmental copy number variation shapes tissue transcriptomes. Nat Genet. 2009, 41: 424-429. 10.1038/ng.345.View ArticlePubMedGoogle Scholar
- Lupski JR, Stankiewicz P: Genomic disorders: molecular mechanisms for rearrangements and conveyed phenotypes. PLoS Genet. 2005, 1: e49-10.1371/journal.pgen.0010049.PubMed CentralView ArticlePubMedGoogle Scholar
- de Cid R, Riveira-Munoz E, Zeeuwen PL, Robarge J, Liao W, Dannhauser EN, Giardina E, Stuart PE, Nair R, Helms C: Deletion of the late cornified envelope LCE3B and LCE3C genes as a susceptibility factor for psoriasis. Nat Genet. 2009, 41: 211-215. 10.1038/ng.313.PubMed CentralView ArticlePubMedGoogle Scholar
- Beckmann JS, Estivill X, Antonarakis SE: Copy number variants and genetic traits: closer to the resolution of phenotypic to genotypic variability. Nat Rev Genet. 2007, 8: 639-646.View ArticlePubMedGoogle Scholar
- Cowell JK, Hawthorn L: The application of microarray technology to the analysis of the cancer genome. Curr Mol Med. 2007, 7: 103-120. 10.2174/156652407779940387.View ArticlePubMedGoogle Scholar
- Kallioniemi A, Kallioniemi OP, Sudar D, Rutovitz D, Gray JW, Waldman F, Pinkel D: Comparative genomic hybridization for molecular cytogenetic analysis of solid tumors. Science. 1992, 258: 818-821. 10.1126/science.1359641.View ArticlePubMedGoogle Scholar
- Kallioniemi A: CGH microarrays and cancer. Curr Opin Biotechnol. 2008, 19: 36-40. 10.1016/j.copbio.2007.11.004.View ArticlePubMedGoogle Scholar
- Pinkel D, Albertson DG: Array comparative genomic hybridization and its applications in cancer. Nat Genet. 2005, 37 (Suppl): S11-S17.View ArticlePubMedGoogle Scholar
- Jacquemont S, Reymond A, Zufferey F, Harewood L, Walters RG, Kutalik Z, Martinet D, Shen Y, Valsesia A, Beckmann ND: Mirror extreme BMI phenotypes associated with gene dosage at the chromosome 16p11.2 locus. Nature. 2011, 478: 97-102. 10.1038/nature10406.PubMed CentralView ArticlePubMedGoogle Scholar
- Walters RG, Jacquemont S, Valsesia A, de Smith AJ, Martinet D, Andersson J, Falchi M, Chen F, Andrieux J, Lobbens S: A new highly penetrant form of obesity due to deletions on chromosome 16p11.2. Nature. 2010, 463: 671-675. 10.1038/nature08727.PubMed CentralView ArticlePubMedGoogle Scholar
- Vollenweider P, Hayoz D, Preisig M, Pecoud A, Warterworht D, Mooser V, Paccaud F, Waeber G: [Health examination survey of the Lausanne population: first results of the CoLaus study]. Rev Med Suisse. 2006, 2: 2528-2530. 2532-2523PubMedGoogle Scholar
- Newton-Cheh C, Johnson T, Gateva V, Tobin MD, Bochud M, Coin L, Najjar SS, Zhao JH, Heath SC, Eyheramendy S: Genome-wide association study identifies eight loci associated with blood pressure. Nat Genet. 2009, 41: 666-676. 10.1038/ng.361.PubMed CentralView ArticlePubMedGoogle Scholar
- Weedon MN, Lango H, Lindgren CM, Wallace C, Evans DM, Mangino M, Freathy RM, Perry JR, Stevens S, Hall AS: Genome-wide association analysis identifies 20 loci that influence adult height. Nat Genet. 2008, 40: 575-583. 10.1038/ng.121.PubMed CentralView ArticlePubMedGoogle Scholar
- Kolz M, Johnson T, Sanna S, Teumer A, Vitart V, Perola M, Mangino M, Albrecht E, Wallace C, Farrall M: Meta-analysis of 28,141 individuals identifies common variants within five new loci that influence uric acid concentrations. PLoS Genet. 2009, 5: e1000504-10.1371/journal.pgen.1000504.PubMed CentralView ArticlePubMedGoogle Scholar
- Prokopenko I, Langenberg C, Florez JC, Saxena R, Soranzo N, Thorleifsson G, Loos RJ, Manning AK, Jackson AU, Aulchenko Y: Variants in MTNR1B influence fasting glucose levels. Nat Genet. 2009, 41: 77-81. 10.1038/ng.290.PubMed CentralView ArticlePubMedGoogle Scholar
- Loos RJ, Lindgren CM, Li S, Wheeler E, Zhao JH, Prokopenko I, Inouye M, Freathy RM, Attwood AP, Beckmann JS: Common variants near MC4R are associated with fat mass, weight and risk of obesity. Nat Genet. 2008, 40: 768-775. 10.1038/ng.140.PubMed CentralView ArticlePubMedGoogle Scholar
- Sandhu MS, Waterworth DM, Debenham SL, Wheeler E, Papadakis K, Zhao JH, Song K, Yuan X, Johnson T, Ashford S: LDL-cholesterol concentrations: a genome-wide association study. Lancet. 2008, 371: 483-491. 10.1016/S0140-6736(08)60208-1.PubMed CentralView ArticlePubMedGoogle Scholar
- Lango Allen H, Estrada K, Lettre G, Berndt SI, Weedon MN, Rivadeneira F, Willer CJ, Jackson AU, Vedantam S, Raychaudhuri S: Hundreds of variants clustered in genomic loci and biological pathways affect human height. Nature. 2010, 467: 832-838. 10.1038/nature09410.PubMed CentralView ArticlePubMedGoogle Scholar
- Willer CJ, Speliotes EK, Loos RJ, Li S, Lindgren CM, Heid IM, Berndt SI, Elliott AL, Jackson AU, Lamina C: Six new loci associated with body mass index highlight a neuronal influence on body weight regulation. Nat Genet. 2009, 41: 25-34. 10.1038/ng.287.PubMed CentralView ArticlePubMedGoogle Scholar
- Itsara A, Cooper GM, Baker C, Girirajan S, Li J, Absher D, Krauss RM, Myers RM, Ridker PM, Chasman DI: Population analysis of large copy number variants and hotspots of human genetic disease. Am J Hum Genet. 2009, 84: 148-161. 10.1016/j.ajhg.2008.12.014.PubMed CentralView ArticlePubMedGoogle Scholar
- Shaikh TH, Gai X, Perin JC, Glessner JT, Xie H, Murphy K, O'Hara R, Casalunovo T, Conlin LK, D'Arcy M: High-resolution mapping and analysis of copy number variations in the human genome: a data resource for clinical and research applications. Genome Res. 2009, 19: 1682-1690. 10.1101/gr.083501.108.PubMed CentralView ArticlePubMedGoogle Scholar
- Clevert DA, Mitterecker A, Mayr A, Klambauer G, Tuefferd M, De Bondt A, Talloen W, Gohlmann H, Hochreiter S: cn.FARMS: a latent variable model to detect copy number variations in microarray data with a low false discovery rate. Nucleic Acids Res. 2011, 39: e79-10.1093/nar/gkr197.PubMed CentralView ArticlePubMedGoogle Scholar
- Pique-Regi R, Monso-Varona J, Ortega A, Seeger RC, Triche TJ, Asgharzadeh S: Sparse representation and Bayesian detection of genome copy number alterations from microarray data. Bioinformatics. 2008, 24: 309-318. 10.1093/bioinformatics/btm601.PubMed CentralView ArticlePubMedGoogle Scholar
- Venkatraman ES, Olshen AB: A faster circular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics. 2007, 23: 657-663. 10.1093/bioinformatics/btl646.View ArticlePubMedGoogle Scholar
- Olshen AB, Venkatraman ES, Lucito R, Wigler M: Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics. 2004, 5: 557-572. 10.1093/biostatistics/kxh008.View ArticlePubMedGoogle Scholar
- Komura D, Shen F, Ishikawa S, Fitch KR, Chen W, Zhang J, Liu G, Ihara S, Nakamura H, Hurles ME: Genome-wide detection of human copy number variations using high-density DNA oligonucleotide arrays. Genome Res. 2006, 16: 1575-1584. 10.1101/gr.5629106.PubMed CentralView ArticlePubMedGoogle Scholar
- Huang J, Wei W, Zhang J, Liu G, Bignell GR, Stratton MR, Futreal PA, Wooster R, Jones KW, Shapero MH: Whole genome DNA copy number changes identified by high density oligonucleotide arrays. Hum Genomics. 2004, 1: 287-299.PubMed CentralView ArticlePubMedGoogle Scholar
- Nannya Y, Sanada M, Nakazaki K, Hosoya N, Wang L, Hangaishi A, Kurokawa M, Chiba S, Bailey DK, Kennedy GC, Ogawa S: A robust algorithm for copy number detection using high-density oligonucleotide single nucleotide polymorphism genotyping arrays. Cancer Res. 2005, 65: 6071-6079. 10.1158/0008-5472.CAN-05-0465.View ArticlePubMedGoogle Scholar
- McCarroll SA, Kuruvilla FG, Korn JM, Cawley S, Nemesh J, Wysoker A, Shapero MH, de Bakker PI, Maller JB, Kirby A: Integrated detection and population-genetic analysis of SNPs and copy number variation. Nat Genet. 2008, 40: 1166-1174. 10.1038/ng.238.View ArticlePubMedGoogle Scholar
- Greenman CD, Bignell G, Butler A, Edkins S, Hinton J, Beare D, Swamy S, Santarius T, Chen L, Widaa S: PICNIC: an algorithm to predict absolute allelic copy number variation with microarray cancer data. Biostatistics. 2010, 11: 164-175. 10.1093/biostatistics/kxp045.PubMed CentralView ArticlePubMedGoogle Scholar
- Scharpf RB, Ruczinski I, Carvalho B, Doan B, Chakravarti A, Irizarry RA: A multilevel model to address batch effects in copy number estimation using SNP arrays. Biostatistics. 2010, 12: 33-50.PubMed CentralView ArticlePubMedGoogle Scholar
- Ritchie ME, Carvalho BS, Hetrick KN, Tavare S, Irizarry RA: R/Bioconductor software for Illumina's Infinium whole-genome genotyping BeadChips. Bioinformatics. 2009, 25: 2621-2623. 10.1093/bioinformatics/btp470.PubMed CentralView ArticlePubMedGoogle Scholar
- Coin LJ, Asher JE, Walters RG, Moustafa JS, de Smith AJ, Sladek R, Balding DJ, Froguel P, Blakemore AI: cnvHap: an integrative population and haplotype-based multiplatform model of SNPs and CNVs. Nat Methods. 2010, 7: 541-546. 10.1038/nmeth.1466.View ArticlePubMedGoogle Scholar
- Wang K, Li M, Hadley D, Liu R, Glessner J, Grant SF, Hakonarson H, Bucan M: PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. Genome Res. 2007, 17: 1665-1674. 10.1101/gr.6861907.PubMed CentralView ArticlePubMedGoogle Scholar
- Colella S, Yau C, Taylor JM, Mirza G, Butler H, Clouston P, Bassett AS, Seller A, Holmes CC, Ragoussis J: QuantiSNP: an Objective Bayes Hidden-Markov Model to detect and accurately map copy number variation using SNP genotyping data. Nucleic Acids Res. 2007, 35: 2013-2025. 10.1093/nar/gkm076.PubMed CentralView ArticlePubMedGoogle Scholar
- LaFramboise T, Weir BA, Zhao X, Beroukhim R, Li C, Harrington D, Sellers WR, Meyerson M: Allele-specific amplification in cancer revealed by SNP array analysis. PLoS Comput Biol. 2005, 1: e65-10.1371/journal.pcbi.0010065.PubMed CentralView ArticlePubMedGoogle Scholar
- Li C: Automating dChip: toward reproducible sharing of microarray data analysis. BMC Bioinformatics. 2008, 9: 231-10.1186/1471-2105-9-231.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang K: PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. Genome Res. 2007, 17: 1665-1674. 10.1101/gr.6861907.PubMed CentralView ArticlePubMedGoogle Scholar
- Pinto D, Darvishi K, Shi X, Rajan D, Rigler D, Fitzgerald T, Lionel AC, Thiruvahindrapuram B, Macdonald JR, Mills R: Comprehensive assessment of array-based platforms and calling algorithms for detection of copy number variants. Nat Biotechnol. 2011, 29: 512-520. 10.1038/nbt.1852.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhang D, Qian Y, Akula N, Alliey-Rodriguez N, Tang J, Gershon ES, Liu C: Accuracy of CNV Detection from GWAS Data. PLoS One. 2011, 6: e14511-10.1371/journal.pone.0014511.PubMed CentralView ArticlePubMedGoogle Scholar
- Tsuang DW, Millard SP, Ely B, Chi P, Wang K, Raskind WH, Kim S, Brkanac Z, Yu CE: The effect of algorithms on copy number variant detection. PLoS One. 2010, 5: e14456-10.1371/journal.pone.0014456.PubMed CentralView ArticlePubMedGoogle Scholar
- Korn JM, Kuruvilla FG, McCarroll SA, Wysoker A, Nemesh J, Cawley S, Hubbell E, Veitch J, Collins PJ, Darvishi K: Integrated genotype calling and association analysis of SNPs, common copy number polymorphisms and rare CNVs. Nat Genet. 2008, 40: 1253-1260. 10.1038/ng.237.PubMed CentralView ArticlePubMedGoogle Scholar
- Pique-Regi R, Ortega A, Asgharzadeh S: Joint estimation of copy number variation and reference intensities on multiple DNA arrays using GADA. Bioinformatics. 2009, 25: 1223-1230. 10.1093/bioinformatics/btp119.PubMed CentralView ArticlePubMedGoogle Scholar
- Valsesia A, Rimoldi D, Martinet D, Ibberson M, Benaglio P, Quadroni M, Waridel P, Gaillard M, Pidoux M, Rapin B: Network-guided analysis of genes with altered somatic copy number and gene expression reveals pathways commonly perturbed in metastatic melanoma. PLoS One. 2011, 6: e18369-10.1371/journal.pone.0018369.PubMed CentralView ArticlePubMedGoogle Scholar
- Marioni JC, Thorne NP, Valsesia A, Fitzgerald T, Redon R, Fiegler H, Andrews TD, Stranger BE, Lynch AG, Dermitzakis ET: Breaking the waves: improved detection of copy number variation from microarray-based comparative genomic hybridization. Genome Biol. 2007, 8: R228-10.1186/gb-2007-8-10-r228.PubMed CentralView ArticlePubMedGoogle Scholar
- Broet P, Richardson S: Detection of gene copy number changes in CGH microarrays using a spatially correlated mixture model. Bioinformatics. 2006, 22: 911-918. 10.1093/bioinformatics/btl035.View ArticlePubMedGoogle Scholar
- Barnes C, Plagnol V, Fitzgerald T, Redon R, Marchini J, Clayton D, Hurles ME: A robust statistical method for case-control association testing with copy number variation. Nat Genet. 2008, 40: 1245-1252. 10.1038/ng.206.PubMed CentralView ArticlePubMedGoogle Scholar
- Dellinger AE, Saw SM, Goh LK, Seielstad M, Young TL, Li YJ: Comparative analyses of seven algorithms for copy number variant identification from single nucleotide polymorphism arrays. Nucleic Acids Res. 2010, 38: e105-10.1093/nar/gkq040.PubMed CentralView ArticlePubMedGoogle Scholar
- Conrad DF, Pinto D, Redon R, Feuk L, Gokcumen O, Zhang Y, Aerts J, Andrews TD, Barnes C, Campbell P: Origins and functional impact of copy number variation in the human genome. Nature. 2010, 464: 704-712. 10.1038/nature08516.PubMed CentralView ArticlePubMedGoogle Scholar
- Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ, Sham PC: PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007, 81: 559-575. 10.1086/519795.PubMed CentralView ArticlePubMedGoogle Scholar
- Lai WR, Johnson MD, Kucherlapati R, Park PJ: Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data. Bioinformatics. 2005, 21: 3763-3770. 10.1093/bioinformatics/bti611.PubMed CentralView ArticlePubMedGoogle Scholar
- Willenbrock H, Fridlyand J: A comparison study: applying segmentation to array CGH data for downstream analyses. Bioinformatics. 2005, 21: 4084-4091. 10.1093/bioinformatics/bti677.View ArticlePubMedGoogle Scholar
- Lagarias JC, Reeds JA, Wright MH, Wright PE: Convergence Properties of the Nelder–Mead Simplex Method in Low Dimensions. SIAM J on Optimization. 1998, 9: 112-147. 10.1137/S1052623496303470.View ArticleGoogle Scholar
- Anderson AD, Weir BS: A maximum-likelihood method for the estimation of pairwise relatedness in structured populations. Genetics. 2007, 176: 421-440. 10.1534/genetics.106.063149.PubMed CentralView ArticlePubMedGoogle Scholar
- Milligan BG: Maximum-likelihood estimation of relatedness. Genetics. 2003, 163: 1153-1167.PubMed CentralPubMedGoogle Scholar
- Csillery K, Johnson T, Beraldi D, Clutton-Brock T, Coltman D, Hansson B, Spong G, Pemberton JM: Performance of marker-based relatedness estimators in natural populations of outbred vertebrates. Genetics. 2006, 173: 2091-2101. 10.1534/genetics.106.057331.PubMed CentralView ArticlePubMedGoogle Scholar
- van Es MA, Veldink JH, Saris CG, Blauw HM, van Vught PW, Birve A, Lemmens R, Schelhaas HJ, Groen EJ, Huisman MH: Genome-wide association study identifies 19p13.3 (UNC13A) and 9p21.2 as susceptibility loci for sporadic amyotrophic lateral sclerosis. Nat Genet. 2009, 41: 1083-1087. 10.1038/ng.442.View ArticlePubMedGoogle Scholar
- Soranzo N, Spector TD, Mangino M, Kuhnel B, Rendon A, Teumer A, Willenborg C, Wright B, Chen L, Li M: A genome-wide meta-analysis identifies 22 loci associated with eight hematological parameters in the HaemGen consortium. Nat Genet. 2009, 41: 1182-1190. 10.1038/ng.467.PubMed CentralView ArticlePubMedGoogle Scholar
- Rivadeneira F, Styrkarsdottir U, Estrada K, Halldorsson BV, Hsu YH, Richards JB, Zillikens MC, Kavvoura FK, Amin N, Aulchenko YS: Twenty bone-mineral-density loci identified by large-scale meta-analysis of genome-wide association studies. Nat Genet. 2009, 41: 1199-1206. 10.1038/ng.446.PubMed CentralView ArticlePubMedGoogle Scholar
- Gudmundsson J, Sulem P, Rafnar T, Bergthorsson JT, Manolescu A, Gudbjartsson D, Agnarsson BA, Sigurdsson A, Benediktsdottir KR, Blondal T: Common sequence variants on 2p15 and Xp11.22 confer susceptibility to prostate cancer. Nat Genet. 2008, 40: 281-283. 10.1038/ng.89.PubMed CentralView ArticlePubMedGoogle Scholar
- Eeles RA, Kote-Jarai Z, Al Olama AA, Giles GG, Guy M, Severi G, Muir K, Hopper JL, Henderson BE, Haiman CA: Identification of seven new prostate cancer susceptibility loci through a genome-wide association study. Nat Genet. 2009, 41: 1116-1121. 10.1038/ng.450.PubMed CentralView ArticlePubMedGoogle Scholar
- Thomas G, Jacobs KB, Yeager M, Kraft P, Wacholder S, Orr N, Yu K, Chatterjee N, Welch R, Hutchinson A: Multiple loci identified in a genome-wide association study of prostate cancer. Nat Genet. 2008, 40: 310-315. 10.1038/ng.91.View ArticlePubMedGoogle Scholar
- Conrad DF, Pinto D, Redon R, Feuk L, Gokcumen O, Zhang Y, Aerts J, Andrews TD, Barnes C, Campbell P: Origins and functional impact of copy number variation in the human genome. Nature. 2009, 464: 704-712.PubMed CentralView ArticlePubMedGoogle Scholar
- 1000 Genomes Project Consortium: A map of human genome variation from population-scale sequencing. Nature. 2010, 467: 1061-1073. 10.1038/nature09534.View ArticleGoogle Scholar
- Wang J, Wang W, Li R, Li Y, Tian G, Goodman L, Fan W, Zhang J, Li J, Guo Y: The diploid genome sequence of an Asian individual. Nature. 2008, 456: 60-65. 10.1038/nature07484.PubMed CentralView ArticlePubMedGoogle Scholar
- Mills RE, Walter K, Stewart C, Handsaker RE, Chen K, Alkan C, Abyzov A, Yoon SC, Ye K, Cheetham RK: Mapping copy number variation by population-scale genome sequencing. Nature. 2011, 470: 59-65. 10.1038/nature09708.PubMed CentralView ArticlePubMedGoogle Scholar
- Sudmant PH, Kitzman JO, Antonacci F, Alkan C, Malig M, Tsalenko A, Sampas N, Bruhn L, Shendure J, Eichler EE: Diversity of human copy number variation and multicopy genes. Science. 2010, 330: 641-646. 10.1126/science.1197005.PubMed CentralView ArticlePubMedGoogle Scholar
- Levy S, Sutton G, Ng PC, Feuk L, Halpern AL, Walenz BP, Axelrod N, Huang J, Kirkness EF, Denisov G: The diploid genome sequence of an individual human. PLoS Biol. 2007, 5: e254-10.1371/journal.pbio.0050254.PubMed CentralView ArticlePubMedGoogle Scholar
- Klambauer G, Schwarzbauer K, Mayr A, Clevert DA, Mitterecker A, Bodenhofer U, Hochreiter S: cn.MOPS: mixture of Poissons for discovering copy number variations in next-generation sequencing data with a low false discovery rate. Nucleic Acids Res. 2012, 40: e69-10.1093/nar/gks003.PubMed CentralView ArticlePubMedGoogle Scholar
- DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M: A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2010, 43: 491-498.View ArticleGoogle Scholar
- Yoon S, Xuan Z, Makarov V, Ye K, Sebat J: Sensitive and accurate detection of copy number variants using read depth of coverage. Genome Res. 2009, 19: 1586-1592. 10.1101/gr.092981.109.PubMed CentralView ArticlePubMedGoogle Scholar
- Ye K, Schulz MH, Long Q, Apweiler R, Ning Z: Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics. 2009, 25: 2865-2871. 10.1093/bioinformatics/btp394.PubMed CentralView ArticlePubMedGoogle Scholar
- Medvedev P, Stanciu M, Brudno M: Computational methods for discovering structural variation with next-generation sequencing. Nat Methods. 2009, 6: S13-S20. 10.1038/nmeth.1374.View ArticlePubMedGoogle Scholar
- Hormozdiari F, Alkan C, Eichler EE, Sahinalp SC: Combinatorial algorithms for structural variation detection in high-throughput sequenced genomes. Genome Res. 2009, 19: 1270-1278. 10.1101/gr.088633.108.PubMed CentralView ArticlePubMedGoogle Scholar
- Chen K, Wallis JW, McLellan MD, Larson DE, Kalicki JM, Pohl CS, McGrath SD, Wendl MC, Zhang Q, Locke DP: BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nat Methods. 2009, 6: 677-681. 10.1038/nmeth.1363.PubMed CentralView ArticlePubMedGoogle Scholar
- Firmann M, Mayor V, Vidal PM, Bochud M, Pecoud A, Hayoz D, Paccaud F, Preisig M, Song KS, Yuan X: The CoLaus study: a population-based study to investigate the epidemiology and genetic determinants of cardiovascular risk factors and metabolic syndrome. BMC Cardiovasc Disord. 2008, 8: 6-10.1186/1471-2261-8-6.PubMed CentralView ArticlePubMedGoogle Scholar
- The International HapMap Consortium: The International HapMap Project. Nature. 2003, 426: 789-796. 10.1038/nature02168.View ArticleGoogle Scholar
- Bengtsson H: A generic framework in R for analyzing small to very large Affymetrix data sets in bounded memory. 2008, Berkeley: Tech Report, Department of Statistics, University of California, 745-Google Scholar
- Bengtsson H, Irizarry R, Carvalho B, Speed TP: Estimation and assessment of raw copy numbers at the single locus level. Bioinformatics. 2008, 24: 759-767. 10.1093/bioinformatics/btn016.View ArticlePubMedGoogle Scholar
- Bengtsson H, Ray A, Spellman P, Speed TP: A single-sample method for normalizing and combining full-resolution copy numbers from multiple platforms, labs and analysis methods. Bioinformatics. 2009, 25: 861-867. 10.1093/bioinformatics/btp074.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.