Whole genome sequencing data
To characterize the landscape of structure variations, we use whole genome sequencing data from three publicly available genomes of Caucasian (CEU, NA12891; CEf, NA12892 [4]), African (YRI1, NA19239 [6]; YRI2 NA 18507 [10]), and Asian (Yh1 [11]; KOR [12]) individuals, as well as two currently unpublished genomes of two individuals (UG1, and UG2) from a population distinct from the six public genomes. All were sequenced using the Illumina GAII sequencer using paired-end and single-end reads differing primarily in the read length - the first three sets genomes are based mostly on 36-base reads, whereas the 2 additional genomes are using 115 base reads. All genomes were processed and aligned for this analysis using the Burrows Wheelers Aligner [13] and Samtools [14] to 25x coverage or higher using a total number of aligned bases ranging from 100 to 275 million bases. Identical assembly pipeline was applied to align the genomes to the Hg19 human genome reference. For validation, we use genic regions as defined in RefGene [15], as well as a 8,599 validated CNV segments from Conrad et al. [5] comparing genotyping information from several populations.
Estimating copy number variation with FREEC
We run FREEC with 3 Kb window size and other default parameters except those under study (detailed below). We compared the CNV estimates from the output of FREEC, which applies a sliding window strategy with GC content normalization to make absolute copy number predictions. FREEC first maps reads from a given sequencing run to non-overlapping windows spanning the entire reference genome. The raw copy number of a given genomic region is assumed to be proportional to the number of reads that align to the windows spanning that region. The algorithm then normalizes these raw read counts to account for sequence characteristics of the genomic region which influence the number of reads within each window. A segmentation algorithm is applied to the normalized read counts to identify contiguous windows that make up a genomic region with a unique copy-number value. The final step in the algorithm estimates the copy-number value of the segmented genomic regions, thus resulting in a genome-wide copy number profile.
FREEC can be run in three configurations corresponding to different normalizations of read counts within a window [see Additional File 1]. We compare FREEC results from the different configurations with CNV estimation based on normalization from: (i) GC content, (ii) mappability [16] (76-base segment length for UG1 and UG2, and 36-base segment length for the remaining genomes), and (iii) control genome.
Mappability characterizes the degree to which a region of certain length is distinct and hence uniquely mappable to the reference genome. We use FREEC's default parameter which, given mappability information, considers regions for which 85% of the bases are mappable. We use three different criteria to select a control genome: (i) in-population where all genomes use as control the other genome from the same population, (ii) single control genome YRI1 (African, similar coverage profile as most genome), and (iii) single control genome YRI2 (African, high coverage genome, >50X). For in-population controls, the genomes are paired as follows: UG1 and UG2, YRI1 and YRI2, CEU and CEf; and Yh1 and KOR.
Estimating copy number variation with CNV-seq
CNV-seq [8] uses a fixed length sliding window and normalization of the analyzed (test) genome using a control genome. Differences in read counts mapped between the control and test genomes for a given segment determine the relative copy number change. We use CNV-seq with two modifications of the default parameters for increased stringency: lower p-value for the CNV calls (10-4 instead of the default 10-3), and higher number of consecutive windows required to call a variation (6 instead of the default 4). We present results using a 3 kilobase window and the trends remain the same with smaller or larger window sizes (data not shown).
We use two sets of seven CNV-seq comparisons, where CNV analysis was done using YRI1 in the first and YRI2 in the second set. These sets are analogous to the FREEC configurations using the African genomes as control, namely (ii) and (iii) in the previous subsection.
Merging CNV calls into comparative CNV calls
To enable comparative analysis of CNV segments output by multiple tool configurations, we consolidate the outputs of each of the three FREEC configurations and one CNV-seq configuration into a set of CNV segments that characterize variations collectively. The steps in the method for merging the CNV segment are given in Figure 1A, and are also illustrated with an example in Figure 1B. Briefly, a merged list of CNV segments is generated by aligning all detected CNV segments in a set of genomes and extracting all unique segment start and end positions to generate a new, more granular set of CNV regions. CNV values are assigned each to these regions for each genome, based on the overlap of the new segments with the original CNV calls. The merging of segments generates an aggregated view of a set of CNVs, and effectively adds a "second dimension" to the genome resulting from a combined set of genomes, represented with a vector containing the original CNV values.
In the example in Figure 1B, the CNV calls from three configurations are broken into five merged segments MSn - MSr each of which is characterized with a vector of three values. We call these segments comparative CNV (cCNV) regions. In the example in Figure 1, two cCNV are formed (MSn-r and MSp-q) as a result of this step.
The creation of cCNV segments tends to "pad" the segments in one or more CNV estimates as the segments are merged to meet the outer boundaries of overlapping CNV segments. In our current approach, we focus on encompassing all bases in a shared CNV region. In a more conservative approach, one may consider a more restrictive method to prune the ends of cCNV segments and narrow the overlapping segment. In our approach, due to the tendency of FREEC to merge very large adjacent segments into very long segments (several million bases), we introduce a pre-processing step to fragment segments into 10,000-base segments and avoid superficial extension of cCNV segments into configurations where a small region was originally detected. In our example, this would remedy a case where the length of Sc is relatively small, and Sd very large. With the fragmentation step, Sd is split into k segments: Sd1, Sd2,... Sdk and only a subset of these overlapping with Sc are returned as an overlapping cCNV region for CNV2 and CNV3. The remaining Sdi are separately considered as a variation only in CNV3.
In a subsequent step, all cCNV segments are annotated for their overlap with the gene regions in refSeq and the CNV regions in the [5], referred to as 'Sanger' in the rest of the text. Based on the overlap, each cCNV region is assigned two values between 0 and 1 ranging between no overlap (0), partial overlap (between 0 and 1), and full overlap (1).
To obtain the number of bases in genic regions, we use the following formula: where for each cCNV region c, we scale its length given in number of bases with the fraction of the segment overlapping with the gene (refSeq_overlap). Non-genic regions are simply the inverse of genic regions: . Finally, to obtain the number of bases overlapping Sanger regions, we use: , where Sanger_overlap indicates the fraction of the segment overlapping with Sanger regions.
Comparison of configurations and estimating concordance
Figure 2 depicts our analysis flow which compares the results from the execution of FREEC and CNV-seq. After Alignment, the CNV step applies a selected FREEC or CNV-seq configuration resulting in CNV calls. The cCNV step merges all the output segments from the configurations. cCNV segments are characterized in the Overlap step to quantify the intersection with genic regions, and 8,599 validated segments from the Sanger set. In the last step, Concordance Assessment, we assess concordance between the three FREEC normalizations. We measure the similarity using the Jaccard index (JI) on any two sets of output comparing the sets at the level of individual base pairs. Given two configurations A and B, we use the following to compute Jaccard Index: . It computes the ratio of the total number of cCNV bases in the intersection of configurations A and B over total number of bases in the union of the of the two sets JI is reported for each genome and four CNV segment sets: all regions, genic, non-genic, and Sanger.
Variation profiles of cCNV
Additionally, for the output of FREEC, the number of bases falling under a range of copy number amplifications and deletions is reported for individual genomes and within segment sets. The output of this characterization is binned using the following breakpoints: less than 2 copies, 2 to 6 copies, and more than 6 copies.