Sample selection
The genomes analyzed in this study were selected from the 1000 Genomes Project [3] and previously from the International HapMap Project [16]. The test sample, NA12878, was selected due to extensive prior characterization of its genomic variation by the 1000 Genomes Project, ultra-high resolution array Comparative Genome Hybridization [17], and its use as a standard for the study of human genome variation by the National Institute of Standards and Technology’s Genome in a Bottle project [18]. NA12878 is a Utah resident of European Ancestry (CEU) and is the daughter in one of the two trios sequenced at high coverage in the 1000 Genomes Pilot Project. The control sample NA10851 was also chosen because of extensive genomic characterization including its use as the control in ultra-high resolution aCGH and as a 1000 Genomes Project low coverage sample [3, 17]. NA10851 is a male of European Ancestry from Utah (CEU). Genomic DNA for these samples as well as appropriate approval for this study was obtained from the Coriell Institute for Medical Research.
NA12878 Gold Standard CNVs
The gold standard CNVs utilized in this study were a subset of those CNVs released by the 1000 Genomes Project from 8-fold coverage Illumina paired-end population-scale sequencing data (available at 1000genomes.org) and analysis of the genomes of 2,504 individuals [4]. CNV discovery was conducted using the following tools: Delly, VariationHunter, BreakDancer, CNVnator, GenomeStrip, Pindel, and SSF. CNVs were merged by taking into account the confidence intervals around estimated boundaries. The merged set was genotyped across the entire population by GenomeStrip and filtered to remove redundancy and low quality sites based on genotype information. Genotypes were then updated though the integrated imputation of all variants (CNVs, SNPs, indels, etc.) with the MNV tool. The resulting genotypes were estimated to have a very low 3.1% false positive rate. Array-based experimental validation of CNV calls and PCR-based validation of CNV breakpoints were carried out by different contributors to the 1000 Genomes Project. CNVs genotyped as existing in NA12878 and not within 1000 Genomes Project described regions of VDJ recombination were selected to comprise our NA12878 gold standard CNV call set [Additional file 5: Spreadsheet 1].
A gold standard set of CNVs was similarly generated for the genome of NA10851, the aCGH control genome used in this study. 876 CNVs are common to the genomes of both NA12878 and NA10851 and are therefore are not expected to be detectable by aCGH platforms in this study.
Additionally, we generated an unfiltered CNV call set for NA12878 using1000 Genome Project deep sequencing data (2×250 bp paired-end sequencing to 60× coverage on Illumina HiSeq available at ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/phase3/data/NA12878/high_coverage_alignment) and read-depth analysis by the CNVnator algorithm [14]. This set contained 4293 total CNV calls and 4047 autosomal CNV calls [Additional file 5: Spreadsheet 1]. Of these CNVnator calls, 609 total and 587 autosomal calls overlapped a gold standard call by 50% reciprocally. A further 137 autosomal CNVnator calls overlapped a gold standard call by less than 50% reciprocally.
Generation of NA12878 CNV call sets
For each array design raw data from two technical replicate experiments using NA12878 DNA were obtained directly from the manufacturer or a manufacturer-recommended service provider and analyzed independently. Two CNV call sets were generated for each array design; one based on the manufacturer recommended platform specific software and another based on the platform agnostic software Nexus Copy Number 7.5 (BioDiscovery, Hawthorne CA 90250, U.S.A.) [Additional file 5: Spreadsheet 1]. The details of each analysis are described below. All chromosomal coordinates for the resulting CNV calls are based on hg19.
Affymetrix arrays
Raw data from hybridizations carried out on the Affymetrix SNP 6.0 and CytoScan HD arrays was obtained in the form of.cel files from published data (first replicate of SNP 6.0), from Affymetrix public data (second replicate of SNP6.0), and from Affymetrix (Santa Clara, CA 95051, U.S.A.) using NA12878 DNA. The platform specific software CNV call set for the first replicate of the SNP 6.0 array was obtained from published data [19]. The second replicate of the SNP 6.0 array was only analyzed using Nexus software as per the Affymetrix recommendation for analysis of this array. The platform specific software CNV call sets for the CytoScan HD array were obtained using the ChAS software (Affymetrix Inc. Santa Clara, CA 95051, U.S.A.) with default settings against a reference composed of a female and male HapMap sample (NA10847 and NA10851 respectively).
The Nexus call sets were generated using the SNP-FASST2 Segmentation Algorithm. The significance threshold for segmentation was set at 10−9 also requiring a minimum of three probes per segment and a maximum probe spacing of 1000 kb between adjacent probes before breaking a segment. The log ratio thresholds for single copy gain and single copy loss were set at 0.2 and −0.2, respectively. The log ratio thresholds for two or more copy gain and homozygous loss were set at 0.7 and −1.1 respectively. The Homozygous Frequency Threshold was set to 0.9. The Homozygous Value Threshold was set to 0.8. The Heterozygous Imbalance Threshold was set to 0.4. The minimum LOH length was set at 1 kb and minimum SNP probe density, at 0 probes/Mb. The final CNV call sets were generated by filtering the total set of predicted genomic aberrations for each array in the Nexus software and exporting the CNVs. This was done because Nexus reports LOH and Allelic Imbalance separately from losses and gains. Thus, LOH regions that are also copy number loss, allelic imbalance regions that are also copy number events, LOH regions that are not covered by copy number events, and allelic imbalance regions that are not covered by copy number events were removed.
Agilent arrays
Raw data from aCGH experiments carried out on the Agilent 1×1M-CGH, 1×1M-High Resolution, 2×400K-CNV, 2×400K-CGH, and 4×180K-CGH arrays using NA12878 as the test DNA and NA10851 as the control DNA were obtained from service providers in the form of feature extraction files.
The platform specific software CNV call sets were obtained by generating Interval Based Reports in the Agilent Genomic Workbench 7.0.4.0 software package (Agilent Technologies, Santa Clara CA 95051, U.S.A.) using default settings.
The Nexus call sets were generated using the FASST2 Segmentation Algorithm. The significance threshold for segmentation was set at 10−7, 10−6, 10-5 for the 1×1M, 2×400K and 4×180K arrays respectively, also requiring a minimum of 3 probes per segment and a maximum probe spacing of 1000 kb between adjacent probes before breaking a segment. The log ratio thresholds for single copy gain and single copy loss were set at 0.2 and −0.2, respectively. The log ratio thresholds for two or more copy gain and homozygous loss were set at 0.8 and −1.1 respectively. The final call sets were exported as text files from the software.
Illumina arrays
Raw data from experiments carried out on the HumanOmni5Exome v1, HumanOmni5-4v1, HumanOmni25Exome 1, HumanOmni25-8v1-1, HumanOmniExpressExome 1.2, HumanOmniExpress-24v1-0, HumanCoreExome v1.1, CytoSNP-850 k, and PsychArray using NA12878 DNA were obtained from Illumina Inc. (San Diego CA 92122, U.S.A.) in the form of.idat files.
The platform specific software CNV calls were obtained using the CNVpartition 3.2.0 algorithm with default settings in the Genome Studio 2011.1 software package (Illumina, San Diego CA 92122, U.S.A.). The Illumina-supplied analysis files used in Genome Studio are specified in Additional file 6: Table S3 for each array. Genomic regions flagged as having loss of heterozygosity and copy number of two, were removed from the final list of CNV calls from CNVpartition output.
Final reports were also generated in Genome Studio for use in Nexus CNV calling. The Nexus call sets were generated using the SNP-FASST2 Segmentation Algorithm. The significance threshold for segmentation was set at 10−10 for the Omni5, 10−9 for the Omni5Exome, Omni2.5Exome, and Omni2.5, 10−8 for the Omni1, and 10−6 for all other arrays, also requiring a minimum of 3 probes per segment and a maximum probe spacing of 1000 kb between adjacent probes before breaking a segment. The log ratio thresholds for single copy gain and single copy loss were set at 0.2 and −0.2, respectively. The log ratio thresholds for two or more copy gain and homozygous loss were set at 0.7 and −1.1 respectively. The Homozygous Frequency Threshold was set to 0.9. The Homozygous Value Threshold was set to 0.8. The Heterozygous Imbalance Threshold was set to 0.4. The minimum LOH length was set at 1 kb and the minimum SNP probe density at 0 probes/Mb. The final CNV call sets were produced by invoking two separate filtering schemes on the total set of predicted copy number aberrations for each array in the Nexus software, exporting the CNVs of each individual filtering scheme, and concatenating the two resulting CNV lists. This was done to ensure that only CNVs that were supported by both log ratio and B allele frequency evidence were included. In the first filtering scheme the following were removed: LOH regions that are also copy number loss, losses not covered by an allelic event, allelic imbalance regions that are also copy number events, LOH regions that are not covered by copy number events, allelic imbalance regions that are not covered by copy number events, and gains not covered by an allelic event. In the second filtering scheme the following were removed: LOH regions that are also copy number loss, allelic imbalance regions that are also copy number events, one copy loss, LOH regions that are not covered by copy number events, allelic imbalance regions that are not covered by copy number events, and one copy gains.
Analysis of Illumina Infinium Multi-Ethnic Global-8 v1.0 array
Using our standard procedure, i.e. analysis with both the Nexus and CNVPartition algorithms, on data obtained from the Illumina Infinium Multi-Ethnic Global-8 v1.0 array (MEGA-EX), no CNVs were called in the genome of NA12878. This 1.7 million SNP array is designed to provide insights from diverse populations by powerfully detecting common and rare variants across the 5 most commonly studied subpopulations including African, Admixed American, East Asian, European, and South Asian. Using the QuantiSNP algorithm, 11 and 16 high confidence CNVs (Max Log BF > 10) were called in two replicates. Of these 4 and 5 CNVs respectively met the 50% reciprocal overlapping criteria with a gold standard CNV. This limited performance of the Multi-Ethnic Global array for CNV calling on the genome of the model European sample, NA12878, may be due to the number and distribution of contiguously informative SNPs on this array for European samples. We found that 12% of all SNPs on this array were found to be heterozygous in NA12878 compared to 19% and 21% of the SNPs on the similarly-sized Illumina HumanOmni2.5 and HumanOmni1Quad arrays. In any case, we did not pursue a deeper analysis of the performance of the Multi-Ethnic Global array for CNV calling in in this study. It is possible that alternative software solutions could be found to achieve more increased CNV detection with this array.