Skip to main content

Copy number variations in the genome of the Qatari population

Abstract

Background

The populations of the Arabian Peninsula remain the least represented in public genetic databases, both in terms of single nucleotide variants and of larger genomic mutations. We present the first high-resolution copy number variation (CNV) map for a Gulf Arab population, using a hybrid approach that integrates array genotyping intensity data and next-generation sequencing reads to call CNVs in the Qatari population.

Methods

CNVs were detected in 97 unrelated Qatari individuals by running two calling algorithms on each of two primary datasets: high-resolution genotyping (Illumina Omni 2.5M) and high depth whole-genome sequencing (Illumina PE 100bp). The four call-sets were integrated to identify high confidence CNV regions, which were subsequently annotated for putative functional effect and compared to public databases of CNVs in other populations. The availability of genome sequence was leveraged to identify tagging SNPs in high LD with common deletions in this population, enabling their imputation from genotyping experiments in the future.

Results

Genotyping intensities and genome sequencing data from 97 Qataris were analyzed with four different algorithms and integrated to discover 16,660 high confidence CNV regions (CNVRs) in the total population, affecting ~28 Mb in the median Qatari genome. Up to 40 % of all CNVs affected genes, including novel CNVs affecting Mendelian disease genes, segregating at different frequencies in the 3 major Qatari subpopulations, including those with Bedouin, Persian/South Asian, and African ancestry. Consistent with high consanguinity levels in the Bedouin subpopulation, we found an increased burden for homozygous deletions in this group. In comparison to known CNVs in the comprehensive Database of Genomic Variants, we found that 5 % of all CNVRs in Qataris were completely novel, with an enrichment of CNVs affecting several known chromosomal disorder loci and genes known to regulate sugar metabolism and type 2 diabetes in the Qatari cohort. Finally, we leveraged the availability of genome sequence to find suitable tagging SNPs for common deletions in this population.

Conclusion

We combine four independently generated datasets from 97 individuals to study CNVs for the first time at high-resolution in a Gulf Arab population.

Background

The Qatari peninsula, located on the eastern coast of the Arabian Peninsula, is at a major crossroads of human migration [1]. This geographical location has led to several waves of settlement over the past millennia, creating unique ethnic ancestries that form the present day’s population. We previously described 3 major genetic subgroups of Qataris, including those of Bedouin (Q1), Persian-South Asian (Q2), and African ancestry (Q3) [24]. Although all three share a common environment, there is a significant level of segregation among the populations, with a higher level of consanguinity observed in Q1 and to a certain extent Q2, over Q3, leading to the formation of 3 distinguishable genetic pools. Exome sequencing has shown that each subpopulation has different predispositions to various Mendelian diseases, an observation with consequences for public health planning in the context of pre-existing premarital screening programs [5].

Whereas classic Mendelian disorders are generally defined by single nucleotide polymorphism’s (SNPs), there is increasing evidence that copy number variation (CNV) also play a significant role in disease [69]. CNVs in the 0.5 Kb to several Mb size range account for up to 4 % of the human genome, and generally behave like SNPs – having a detectable minor allele frequency, existing in linkage disequilibrium (LD) with neighboring SNPs on variable length haplotypes and being inherited in Mendelian fashion [7, 10]. Unlike SNPs, however, CNVs could be extremely multi-allelic (>3 segregating alleles), and may therefore contribute to continuous traits by causing a spectrum of variation in gene dosage [11]. Though common CNVs have traditionally shown relatively poor association with common disease [12], intermediate frequency (1-10 %) and rare (<1 %) but recurrent CNVs cause a variety of disorders, including extreme obesity, congenital heart disease, and a wide spectrum of neurological and developmental disorders [1324].

Because of the high degree of consanguinity in the Qatari population, it is likely that CNVs, alongside SNPs and indels, play a role in the inherited disease risk burden in this population [5, 10, 25]. Further, because of the cultural segregation of the major genetic groups (Q1, Q2, and Q3) within the Qatari population, it is likely that there are novel CNV Regions (CNVRs) specific to each group, representing distinct subpopulation histories and risk for disease. In the context of these considerations, we have used complete genome sequencing along with SNP microarray analysis of 97 Qataris (57 Q1, 20 Q2, 20 Q3) to assess the spectrum of CNVs and CNVRs in the Qatari population, representing the first high-resolution, genome-wide assessment of the burden of small to medium-sized chromosomal deletions and duplications in a Middle Eastern Arab population. This work on the CNV class of mutations is complementary to a study from our group describing ancestry, polymorphisms and disease susceptibility from the single nucleotide variant class of mutations from these same individuals [26].

Results

Detection of CNVs

To identify CNVs in the Qatari population, primary data was obtained from arrays and whole-genome sequencing sources, and then called and integrated as described in methods (Fig. 1). Briefly, CNV calls were first generated in the 100 individuals from Illumina’s Omni2.5 M array intensity data using both cnvPartition (Illumina’s proprietary Beadstudio plug-in) and QuantiSNP [27], and, separately, from NGS using cn.MOPS [28], in addition to CNV calls provided by Illumina from WGS data. Altogether, there were two primary datasets called by 2 independent algorithms each, and all 4 subsequently combined for final CNV calls as described below.

Fig. 1
figure1

CNV analysis strategy. CNV detection in Qataris was assessed at two tiers. First, CNVs were called in 100 individuals using two algorithms each, on two primary input datasets: genotyping array (OMNI2.5 M) and next-generation whole genome sequencing reads (Illumina PE 100 bp, Mean Depth: 37X). A size cut-off of at least 5 consecutive probes for genotyping data and at least 5 consecutive windows for whole genome sequencing data was used to increase specificity (see Methods). Three samples with an unusually high number of CNVs were first removed from the population (see Additional file 1: Figure S1). In the second step, high-quality CNVs from the remaining 97 subjects called by all 4 platforms were distributed into 97 individual files. CNVs were first compared intra-individuals and retained if observed by more than one algorithm. If no overlap was detected within the individual, the CNV was compared inter-individuals to detect a second occurrence in the remaining 97 individuals. CNVs observed only once in the entire sample were discarded. CNVs passing these filters were merged across the population to generate population level CNV regions (CNVRs), which were taken into the detailed analysis steps. *Denotes data was provided as-is from proprietary Illumina Genome Network sequencing pipeline without the ability of the user to alter parameters

Preliminary qualitative inspection of the raw distribution of all CNVs in the 100 individuals revealed 3 outlier samples with a large excess of CNV calls (Additional file 1: Figure S1). These individuals significantly skewed the average number of CNVs in the population (Additional file 1: Table S1) and were therefore removed from further consideration, yielding a cohort of 97 individuals in whom all subsequent analysis was conducted.

The 4 platforms initially identified a total of 536,889 CNVs from all 4 algorithms in the 97 individuals, including 119,236 putative deletions [copy number class (CN) 0 or 1] and 417,653 putative duplications (CN 3 or 4+; Table 1). The excess of duplications over deletions is largely a result of CNVs provided by Illumina’s proprietary WGS calls, which reported 314,656 duplications and 49,177 deletions, with no homozygous deletions (CN 0) called in the 97 samples.

Table 1 Copy Number Variations in the Qatari Populationa

In order to enhance specificity, we devised an approach to integrate CNV calls across all 4 platforms, requiring a CNV to be observed at least twice to be retained (Fig. 1). In this step, each of the 536,889 raw, high quality CNVs was first compared to all other CNVs detected by any of the four platforms within the same individual file, and those observed twice (detected by >1 algorithm in the same individual) were included in the ‘final’ variant file for that individual. All CNVs observed only once were then compared across all other individuals to look for a second occurrence in another individual. If found, that CNV was retained in the individual’s ‘final’ variant file, or otherwise discarded. This allowed for significant refinement of the list of CNVs in the population, eliminating all singleton CNV occurrences in the population – usually the most enriched for spurious calls. Using this approach, the average individual’s ‘final’ genome had 1824 high-confidence CNVs, comprising 120 homozygous deletions, 628 single-copy deletions, 801 single-copy duplications and 275 amplifications (Table 1).

During this filtration and integration process, all CNVs were concurrently curated to re-define breakpoints based on the source of CNVs. Briefly, whenever NGS-derived CNVs overlapped array-CNVs, we used the NGS’s higher-resolution breakpoints to define the start and/or end coordinate of the duplication or deletion. Wherever two CNVs detected from the same platform were observed to overlap, the narrower breakpoints were chosen, yielding a shorter, more conservative CNV call. After this curation, the ‘final’ CNV content in the average Qatari genome affected a total of 29.9 × 106 non-redundant bases (Table 1). This is slightly lower than previously published estimates [22] and may reflect the strict filtration and breakpoint definition thresholds applied in this study.

The median Qatari genome, based on the four different algorithms and platforms, contained 1815 high-resolution CNVs, covering an estimated total of 27,991,857 bp (Table 1). These were distributed into the 4 CN classes: CN–0 (homozygous deletions), 121 CNVs, affecting 1.1 Mb; CN–1 (single-allele deletions), 622 CNVs, affecting 5.8 Mb; CN–3 (single-allele duplication), 801, affecting 16.8 Mb; and CN–4+ (amplification), 271, affecting 4.1 Mb. The excess number and larger size of duplications could be explained by a higher proportion of duplications detected by our NGS-CNV callers (Table 1), and may reflect a combination of reads mapping to segmental duplications and the fact that we included all multi-allelic CNVs with more than 4 copies in the same amplification group [11]. Of 1815 CNVs, 1381 (76.1 %) were detected by NGS alone, 122 (6.7 %) were detected by array technology only, and 312 (17.2 %) were detected by both, suggesting that the two datasets may both be beneficial in representing the total CNV content within an individual, and relying on only one may not be sufficient to cover all variation (Additional file 1: Figure S2).

CNV distribution in Qatari subpopulations

The 97 individuals were examined in the context of the three Qatari ancestral subpopulations (57 Q1 – Bedouin ancestry, 20 Q2 – Persian ancestry, and 20 Q3 – African ancestry). In order to evaluate the accuracy of CNV calls, we initially used the CNVs detected across all 97 individuals and performed principle component analysis. This analysis separated individuals previously known to belong to Q1, Q2 and Q3 from genotyping data into their three respective subpopulations based on CNV sharing. The PCA plot showed some level of overlap between Q1 and Q2 clusters, which could be a result of admixture and our assignment of ethnicity based on only 65 % of 48 SNPs (Additional file 1: Figure S3 and details in methods), with Q3 (with the exception of 1 individual) being the most clearly distinct subpopulation. These results are similar to those obtained from a PCA plot using only SNPs, as published in [29]. The similarity of clustering using PCA on CNV and genotyping data in 97 Qataris is consistent with a previous report demonstrating that PCA analysis based on high quality CNVs yields similar clusters to one based on SNPs from the same individuals [30].

We then inspected the distribution of CNVs by frequency in each class per individual (Additional file 1: Figure S4), and observed that, on average, individuals from all three subpopulations had a similar range of CNVs in all four classes. However, in order to detect if the three genetic subpopulations may have differences in the distribution by number or size of CNVs in each CN class, probability curves were generated of CN number (Fig. 2a-d) and total size affected (Fig. 2e-h) within each CN class for each of the 3 subpopulations (as described in Methods). For CN class 0 (homozygous deletions), these occurred at a significantly higher frequency in Q1 and Q2 over Q3 (p = 1.8x10−6 and 1.2 × 10−4, respectively). However, this trend was reversed in amplifications (CN 4+), which were found at a higher rate in Q3 than either Q1 or Q2 (p = 1.5 × 10−5 and 0.006, respectively). These observations may reflect higher consanguinity rates in recent generations within Q1 and Q2, where enrichment in homozygous deletions (Fig. 2a) but depletion of amplifications vs Q3 (Fig. 2d) suggests that homozygous deletions are more harmful than multi-allelic, runaway duplications, and may therefore have been purged from Q3 by purifying selection over population history but only recently arisen in Q1 and Q2. This possibility is supported by two further observations. First, for single-copy deletions (CN 1), we observed a significantly higher number in Q3 (p = 3 × 10−7 and 1 × 10−7 vs Q1 and Q2, respectively) despite the depletion of homozygous deletions relative to the other two subpopulations, suggesting higher diversity and less consanguinity in recent generations among Q3 Qataris vs Q1 or Q2. Second, for Q1, we observe a slightly longer tail in the size of the genome affected by single copy deletions (Fig. 2f) despite reduced number of CNVs in that class compared to Q3, suggesting these alleles are larger in size and possibly more recent or more deleterious, causing this tail of large CNVs to be absent in the homozygous subset of CNVs in Q1 (Fig. 2e).

Fig. 2
figure2

Probability distributions of CNVs by frequency and size in each copy number class in 97 Qataris. Density curves showing the probability (y-axis) of a given individual from each of the 3 subpopulations having a certain number of CNVs (a-d) or a certain cumulative size of the genome affected by CNVs (e-h) in each copy number class (a, e. CN = 0; b, f. CN = 1; c, g. CN = 3; d, h. CN = 4+). All p -values are calculated using the ANOVA-Tukey method. Black trace – Q1, Blue trace - Q2, Red trace – Q3

Genomic impact of CNVRs in the genetic subpopulations

In order to evaluate the impact of duplications and deletions on each subpopulation individually, we first separately merged deletions and duplications within each group to detect subpopulation-specific CNV Regions (CNVRs). There were a total of 16,660 CNVRs in the 3 subpopulations; 12,709 (76.2 %) came from NGS data only, 1976 (11.9 %) from array only, and 1975 (11.9 %) from both platforms combined (Additional file 1: Figure S2B; see Additional file 1: Additional Data). When deletions and duplications at the same locus (polymorphic CNVRs) were combined, there were a total of 14,058 CNVRs, including 7092 deletions, 4885 duplications, and 2081 polymorphic CNVRs (Table 1).

In the Q1 subpopulation, there were a total of 5241 CNVRs of all CN classes, affecting 85.7 Mb of genomic content; in Q2, 4176 CNVRs affecting 65.8 Mb, and in Q3 4641 CNVRs affecting 65.8 Mb (Table 1). The excess number and cumulative size of CNVRs in Q1 is likely due to the ~3-fold higher number of individuals studied. As expected, the majority of CNVRs were sub-population specific, with 3624, 3242 and 3633 CNVRs at low-frequency (affecting 1 to 20 % of individuals) in Q1, Q2 and Q3 respectively, vs only 2657, 1715 and 1789 that were common (affecting >20 %).

Functional effect of CNV-affected genes in Q1, Q2 and Q3

In order to evaluate the functional effect of deletions and duplications separately on the entire population, the polymorphic CNVRs were separated into their respective CN classes (Table 2). In total, 16,660 CNVRs were observed in all four CN classes in the three subpopulations, including 6281 in Q1, 4957 in Q2 and 5422 in Q3. In all three subpopulations, ~39-40 % of all CNVRs were genic (2491 in Q1, 1995 in Q2 and 2085 in Q3), 4-5 % affected microRNA loci (229 in Q1, 183 in Q2 and 180 in Q3), 13-15 % affected promoter sites (831 in Q1, 647 in Q2 and 660 in Q3) and ~38-40 % affected transcription factor binding sites (2573 in Q1, 1879 in Q2 and 2065 in Q3). We focused on genic CNVs in subsequent analysis to determine the extent of CNV impact on genes and pathways and population burden for genetic disease.

Table 2 Functional Annotation of CNV Regions in the Qatari Genetic Subpopulationsa

Genic pathway enrichment

The genes affected by CNVRs in all Qataris were evaluated by standard pathway analysis against the KEGG pathway database using the DAVID bioinformatics suite [3133]. Among the top 15 pathways enriched for by genes affected by all CNVs in Qataris, we observed several of potential concern for public health (Table 3). These included genes involved in starch and sugar metabolism, in the insulin signaling pathway, and in type I and type II diabetes mellitus (Additional file 1: Figure S5A-E). Among these genes was the amylase enzyme AMY1, for which decreased copy number was previously shown to be associated with obesity [34]. Of interest, 47 of 97 individuals in the cohort had type 2 diabetes (27 Q1, 10 Q2 and 10 Q3), but there was no statistical enrichment for any of these CNVs in obese or diabetic individuals vs controls. This may be due to the low power in small sample size, combined with the possibility that individuals labeled as controls have yet to develop diabetes due to their young age at time of assessment (cohort average age 42 years, with >50 % <40 year). We also observed nominal enrichment in other medically relevant pathways, including drug metabolism and non-small cell lung cancer (Table 3). Together, these observations suggest that CNVs in this population may affect public health by contributing to the burden of chronic disease in the population and should be assessed systematically in a larger cohort to establish power and assess significance.

Table 3 Top 15 KEGG Pathways Enriched in Genes Affected by CNVs in Qatarisa

CNVs affecting Mendelian disease genes

In order to determine whether CNVs may also play a role in rare disease in Qataris, we compared all genes affected by CNVRs to the database of Online Mendelian Inheritance in Man (OMIM). In all three subpopulations, approximately 10 % of all genic CNVRs affected at least 1 OMIM gene (Table 2). The OMIM database contains a combination of disease causing genes, as well as disease-associated genes and genes affecting polymorphic traits. Because we were most interested in genes that have sufficient evidence of disease-causality from the literature, we re-annotated all CNV-encompassed OMIM genes based on their published role in causing disease, and then manually curated all putative OMIM-gene-containing CNVRs to determine the exact number of exons that were likely to be disrupted by each CNV (contained within the CNVs’ breakpoints).

The focus was on the subset of CNVRs most likely to have a functional impact on a gene. These include deletions affecting any number of exons and duplications that either encompass at least one entire gene (increased dosage) or are internal to the gene (possibly disrupting protein translation frame). We therefore eliminated from consideration all intronic events as well as duplications that were partially genic (one breakpoint extending past the first or last exon with the other breakpoint inside the gene). We then split the list of OMIM-affected genes into two groups: (1) genes in which CNVRs had been previously reported; and (2) genes affected by novel, Qatari-specific CNVRs. In the former group, we found a total of 46 disrupted disease-causing genes (13 in deletions and 33 in duplications) affected by 40 unique CNVR loci (13 deletions and 27 duplications) (Table 4). These CNVRs had variable distribution among the 3 subpopulations, with most being specific to one or two populations while only 9 were observed in all 3 subpopulations. Thus, the majority of Mendelian-disease-gene containing CNVRs is population-specific, and may predispose to disease due to high levels of intra-population mating. Further, though these CNVRs are marked as previously reported due to overlap with CNVRs in the database of genomic variants (DGV) [35], it is possible due to the variable breakpoints of CNVs deposited in the DGV that these Qatari CNVs affect different exons or occur at a higher frequency in this population than the rest of the world.

Table 4 Qatari Genetic Subpopulation-specific Distribution of Known CNV Regions Deletions Affecting Known Mendelian Disease Genesa

We also examined OMIM-gene-containing CNVRs that were novel to Qataris. To determine novelty here, Qatari CNVRs were compared to CNVRs reported in the 1000 Genomes Phase I [36] study that were detected through next-generation sequencing with high-resolution breakpoints. Only 14 Qatari CNVRs passed this filter, reflecting the high diversity of populations represented in the 1000 Genomes data. These CNVRs included 9 deletions and 5 duplications (Table 5). Five of these CNVRs were Qatari sub-population-specific, while nine were shared by 2 or more sub-populations. Of the shared CNVRs, there were four deletions – one of exon 47 in the Chediak-Higashi syndrome gene LYST (lysosomal trafficking regulator gene) observed in one Q1 and one Q2 individuals, one in the glutaric acidemia gene ETFDH (electron transfer flavoprotein dehydrogenase) in one Q1 and one Q2 individuals, one in exons 2 to 3 of the alpha-methyl acetoacetic aciduria gene ACAT1 (acetyl-coA acetyltransferase 1) in one Q1 and one Q3 individuals, and one in exons 1-7 of the Gitelman Syndrome gene solute carrier 12, family member 3 (SLC12A3) observed in one Q1 and two Q2 individuals. All of these disorders are autosomal recessive and these deletions putatively truncate the genes and therefore predispose these subpopulations to these diseases if present in homozygous state. Additionally, there was one disease-gene affecting CNVR that was present in 7 individuals from all three subpopulations (5 Q1, 1 Q2 and 1 Q3), a 3 kb internal duplication of exons 13-14 of PMS2 (post-meiotic segregation increased in S. cerevisiae 2), a gene in which mutations in both alleles are observed in patients with hereditary nonpolyposis cancer and mismatch repair cancer syndrome. Additionally, 3 other individuals (2 Q1, 1 Q2) had a smaller (2.7 kb) deletion affecting the same exons. In total, 10 individuals (10.3 % of the cohort) had a CNV not present in public databases that putatively disrupts PMS2. Of note, colorectal cancer is the second most common cancer in Qatari males and third most common in females [37]; whether this gene contributes to the burden of colorectal cancer in this population is currently not known.

Table 5 Novel Qatari-specific CNVRs Affecting OMIM Disease Genesa

Qatari CNVs affecting known disease cytobands

There has been substantial evidence implicating CNV mutations in a range of diseases, including obesity, congenital heart disease and a variety of neuropsychiatric disorders [1324, 31]. In particular, there is a growing body of literature suggesting rare but recurrent CNVs at several loci are responsible for a proportion of these diseases in sporadic cohorts [14, 18, 20, 21, 23, 38]. We sought to determine the burden of CNVs by chromosomal cytoband to detect any enrichment in Qatari Arabs over global cohorts. Because the database of genomic variants (DGV) contains >200,000 CNVRs from >200 studies [35] detected using a wide variety of low- and high-resolution platforms, we limited this comparison to CNVRs detected by an equally high-resolution platform (next-generation sequencing) in the diverse 1000 Genomes Project phase I study (1000Gp1) [36]. All CNVRs reported in the 1000Gp1 dataset and from our study in each of the 3 Qatari subpopulations were annotated by cytoband. Of 862 cytobands in the 24 human chromosomes, 769 contained CNVRs in the 1000 genomes samples; of these, 741 had CNVRs in Q1, 708 in Q2, and 735 in Q3. There were several cytobands observed in which unique CNVRs were observed at a much higher frequency (1.5 to 10 times more non-overlapping CNVRs per cytoband) in any one of the Qatari subpopulations than in the phase I data. Among the top 10 cytobands (Table 6) with the highest enrichment were several disease-associated hotspot loci, including: 1q21.1 (Q1 p = 1.6 × 10−16, Q2 p = 9.67 × 10−22 and Q3 p = 2.4 × 10−20), in which recurrent CNVs have been observed in patients with schizophrenia and congenital heart disease; 5q13.2 (p = 3.4 × 10−11, 8.2 × 10−14 and 1.8 × 10−11 in Q1, Q2 and Q3 respectively), a locus associated with neurological disorders and alcohol dependence, 16p11.2 (p = 7.1 × 10−9, 8.1 × 10−12; and 1.2 × 10−11 in Q1, Q2 and Q3 respectively), a locus highly associated with autism, schizophrenia and childhood obesity; and 1p36.33 (p = 2.87 × 10−7 in Q3 only), a locus associated with disorders of sexual development and obesity [3943]. CNVRs in these loci may contribute to the collective burden of these disorders in the Qatari population.

Table 6 Top 10 Cytobands in Which Qatari Genetic Subpopulations’ CNVRs were Observed at a Significantly Higher Frequency than in 1000 Genomes Phase I CNV Dataa

Determining tagging probes for Qatari CNVRs

The Illumina OMNI2.5 M array was developed as a global population-based genotyping tool to integrate variation down to minor allele frequency (MAF) >2.5 % from the diverse populations represented in the 1000 Genomes Project [36]. We first sought to estimate the utility of the Illumina OMNI 2.5 M array for studying the Qatari population. To assess the amount of “informative” SNPs on the OMNI2.5 M array in this population, we used genotypes from a total of 108 Qataris genotyped on this platform. The OMNI 2.5 M SNPs were first pruned for those that had a call in at least 90 % of the 108 individuals (i.e., at most 11 individuals with “no calls”), which resulted in 2,368,880 (>99.5 % of all SNPs). Surprisingly, of these, 412,839 SNPs (17.4 %) were monomorphic (MAF = 0 %) in 108 Qataris. Additionally, 676,116 (28.5 %) had a MAF of less than 1 % in Qataris, and 1,028,842 (43.4 %) have a minor allele frequency of less than 5 %. Therefore, in total less than 60 % of the SNPs on the OMNI 2.5 M array adequately sample common variants in Qataris (Additional file 1: Table S2).

Nevertheless, we attempted to assess whether a subset of these SNPs tag common CNVs in this population, which could be useful for imputation of Arab-population CNVs from genetic data using this or similar arrays in future studies. In order to increase specificity, we focused on 1193 common deletions (deleted allele observed at least 4 times) across the 97 individuals in this study, and investigated the pairwise correlation between the deletion CNV and all SNPs within 500 kb either side of the deletion’s breakpoints. As expected, while 1168 CNVs (98 %) had at least 1 SNP within 500 kb of either breakpoint, the majority of SNPs from the OMNI 2.5 M array neighboring CNVs did not adequately tag the deletion allele, with the majority of SNPs (~62 %) having a maximum r2 < 0.5 (Fig. 3a). In fact, only 318 of 1193 deletions were tagged by at least one SNP at a Pearson correlation of r2 > 0.7, of which only 195 (~16 %) were tagged at r2 > 0.9 (Additional file 1: Table S3). Further, of the 422 genic subset of CNVs within the 1193 deletions, only 35 % were tagged by an array SNPs with a correlation r2 > 0.5, and less than 7 % appeared in complete LD (r2 > 1.0). Therefore, the majority of deletions common to the Qatari population were poorly tagged by the high density OMNI 2.5 M array, which could pose significant challenges to using this or other commercial arrays to genotype CNVs in Arab populations.

Fig. 3
figure3

All SNPs within 500 kb of start and end breakpoints of 1,193 deletions were used to detect for each deletion a SNP with the maximum pairwise LD correlation. This was done both for a. all 1193 CNVs and b. for only 422 Genic CNVs. In both cases, the WGS SNVs significantly outperformed the OMNI2.5 M SNPs, especially at higher r2 values. WGS-SNVs: Whole genome sequencing detected variants (). OMNI2.5 M-SNPs: SNPs present on the OMNI2.5 M array (Ο)

In order to rectify this issue, we sought to determine a set of SNPs that could better tag these CNVs by relying on genotypes obtained from the whole genome sequencing of these 97 individuals (described in methods). All ~21 million high-quality variants detected in 97 individuals were first pruned for those within 500 kb upstream and downstream of each CNV breakpoint, and then LD measured between each CNV and all neighboring SNPs within this window. There was a highly significant improvement of up to 250 % for all CNVs and almost 1.5 times that (367 %) for genic CNVs at r2 = 1 (Fig. 3b and Additional file 1: Table S3). With whole genome sequencing SNPs, we observed >70 % of all deletions being tagged by at least one SNP at an r2 > 0.5, and over 50 % at r2 > 0.8, suggesting these could be imputed in future experiments from sequence data. In order to facilitate the design of new genotyping arrays that tag CNVs in this population, we include a list of deletion-tagging genotypes at SNPs tagging 806 CNVs at r2 > 0.5 (Additional file 1: Table S4). We also include this information at greater detail in the accompanying Additional file 2 containing the complete CNV dataset with all functional annotation in 97 Qataris.

Discussion

This study uses two primary datasets called by four separate algorithms to generate the first catalog of high-resolution copy number variants within Qataris, a population that shares significant genetic ancestry with the neighboring populations of the Arabian Gulf who remain under represented in public databases. Our analysis reveals several notable features of copy number changes in this region of the world. First, CNV distribution patterns among members of the different Qatari subpopulations is concordant with their cultural and demographic histories, where higher consanguinity in Q1 and Q2 populations has led to a significantly higher number of homozygous deletions vs the more diverse Q3 individuals. Notably, these deletions are larger in size, and may be useful for studies of the effect of gene-loss on individual fitness, similar to studies of loss of function mutations in large but seemingly healthy cohorts. Further, a large proportion of CNVRs across all 3 Qatari subpopulations affect coding or functional elements (with slight depletion in Q3 vs either Q1 or Q2), including known rare, severe disease genes and loci. Of particular interest is the increased burden of CNVs in certain KEGG pathways relevant to population health in Qatar and in the region, such as diabetes, insulin signaling and metabolism.

While all 97 individuals analyzed here are phenotypically “control” adults, the possibility that the CNVs they carry might be associated with disease cannot be ruled out. CNVs have been shown to play a role both in population diversity and in pathophysiology, where increased or decreased gene dosage may be responsible for human phenotypic variability as well as complex behavioral traits and disease (reviewed in [17, 4446]). Indeed, we observe a significantly higher burden of CNVs in cytobands linked to known, rare syndromic disorders. Thus, the Qatari population could be at risk for any of these CNV-related disorders under a two-hit model, where incomplete penetrance and variable expressivity may depend on a second insult, possibly a point mutation, at an interacting locus [47]. Further, we found a number of CNVs that were shared within the Qatari population (but novel to public databases) affecting severe Mendelian disease causing genes, including recurrent hits (both deletions and duplications) in PMS2, a gene that is often mutated in colon cancer, the second most prevalent cancer in Qatar [37]. These observations will be informative in the design of the next generation of clinical copy number arrays for use with this population.

A distinctive feature of this study is in combining high-resolution CNV calls from whole genome sequencing with the traditional CNV calls from genotyping array data, which demonstrated a non-trivial number of CNVs detected exclusively by only one platform. This held true despite high specificity thresholds implemented to reduce spurious calls, suggesting that relying on only one of these platforms risks eliminating a substantial amount of the underlying variation. This hybrid strategy uncovered approximately 1815 CNVs of size range 2.5 kb to 2 Mb in the median Qatari individual, affecting a total of ~28 Mb of genomic DNA. This is slightly lower than previously published high-resolution studies that estimate CNVs affect up to 40 Mb of genomic DNA [10, 25], and may be due to a combination of our higher minimum size thresholds (2.5 kb vs 500 bp used in other studies) and the strict specificity criteria we employed, which eliminated all singletons. Nevertheless, we observed that up to 40 % of CNVRs affected genic content, suggesting these could contribute to key traits yet to be fully understood in the general population.

As an additional analysis, we determined the usefulness of the OMNI2.5 M array to genotype CNVs in this population. The performance of this array was relatively poor, with <35 % of CNVs tagged at r2 > 0.5. We decided instead to leverage the genotypes obtained from whole genome sequencing of these same individuals and found substantial improvements to discovering tagging SNPs for these population-level deletions (included in the whole CNV dataset released as Additional Data accompanying this study). These SNPs could be included in future arrays designed to genotype Arab cohorts, and used to impute these deletions in Qatari or ancestrally similar Gulf Arabs.

This study, with its small population size, is underpowered to discover any significant effect of CNVRs on common disease in the population. However, it provides the first step in creating a database of Qatari-specific CNVRs that sets the landscape for future research to assess rare and common CNVs in a much larger Qatari cohort. We suspect that the Qatari population’s unique burden of rare and chronic disease will provide a strong platform for discovery of functional CNVs in future studies. This is especially critical, for example, in the burden of deletions disrupting known disease genes (e.g. those in OMIM), which may appear in homozygous state in the highly consanguineous Qatari population. Indeed, of particular interest for human disease studies is the significantly higher proportion of homozygous deletions in Q1s and Q2s, which could shed light on the subset of the genome that is “dispensable” for normal human development into adulthood due to loss of gene function [4850]. In this study, we find 200 genes that are affected by homozygous deletions, yet appear to cause no overt phenotypic abnormalities in this group of 97 seemingly healthy adults.

In recent years, each of genotyping arrays, array-CGH microarrays, and next-generation sequencing have all been implemented successfully to detect structural variation in human cohorts (reviewed in [45]). While the majority of studies continue with HapMap cohorts, there has been a recent increase in the number of reports from non-HapMap populations, the majority of which are still ancestrally similar to pre-existing HapMap populations (e.g., East Asian populations) [5153]. To date, however, there has been no systematic characterization of CNVs in Gulf Arabs, a population witnessing a growth in clinical genomics testing but whose ancestry is not adequately represented in current HapMap and 1000 Genomes populations. We present this study as the first systematic approach that makes available data on a large number of CNVs detected from both genotyping and sequencing platforms in this previously under-explored population, and believe the frequency of CNVs reported in this study could be used to assess the pathogenicity of clinically detected CNVs in Gulf Arab patients that appear novel to public databases. Thus, this high-resolution map of CNVs in Qataris is likely representative of ethnically similar Gulf-Arabs, and the accompanying database will inform future studies with larger cohorts aimed at understanding the unique population history in this region and the interpretation and analysis of clinically-observed CNVs in patients.

Conclusions

Overall, there is little doubt that some proportion of disease burden in certain patient cohorts can be explained by CNVs [17, 4446]. As detection methods continue to improve, so will the discovery of new genes and loci whose deletion or duplication could lead to morbid clinical phenotypes in specific populations. This kind of population specificity will have detrimental effects on the design of population-specific clinical CHG-arrays, and on the interpretation and assignment of pathogenicity of CNV findings in individuals from different ethnic populations [25, 5154]. This issue is of special relevance to Arabs in general and Qataris specifically, for whom there is a paucity of databases of CNVs observed in healthy controls.

In this study, we generate a highly dense catalog of 16,660 CNVRs in a cohort of 97 Qataris, by employing strict QC criteria for detection of CNVs and by integrating data from both SNP arrays and NGS technologies to achieve high-resolution breakpoint annotation for most variants. We find that ~40 % of these CNVRs affect genic and functional content, and that >5 % of all CNVs discovered are novel when compared to the well-populated Database of Genomic Variants [35], and thus represent polymorphisms in this population that may be misinterpreted as disease-causing in clinical cases in the absence of population-matched controls. Interestingly, despite the small sample size, we observe genes affected by these CNVRs that function in biological pathways relevant to population health, including Mendelian disease-causing genes reported in OMIM.genes involved in diabetes and carbohydrate metabolism, genes implicated in cancer, as well as homozygous deletions affecting up to 200 unique genes in seemingly healthy individuals. These are likely underestimates resulting from our strict quality filtration criteria (e.g., eliminating all true singletons and annotating breakpoints conservatively whenever two CNVs overlapped). Nevertheless, these findings pave the way for follow-up studies in larger cohorts with sufficient power and phenotypic information to determine their true contribution to disease burden in this population. Finally, we leverage the genotypes obtained from next-generation sequencing to identify SNPs that are at high linkage disequilibrium with sub-population-specific deletions, allowing their imputation in future studies of matched cohorts. This study therefore complements pre-existing and future next-generation sequencing work in Qataris, and presents findings which improve our understanding of CNV prevalence as a class of previously underexplored variation in this population, and their contribution to disease in Qataris, and possibly in ethnically similar Gulf Arabs.

Methods

Study population sample preparation and sequencing

In order to catalog CNVs in the Qatari population, the genomes of 108 Qataris were deeply sequenced (mean depth 37X) on the Illumina platform. Human subjects were recruited and written informed consent was obtained at Hamad Medical Corporation (HMC) and HMC Primary Health Care Centers in Doha, Qatar under protocols approved by the Joint Institutional Review Boards of Hamad Medical Corporation and Weill Cornell Medical College in Qatar (protocol # 13-00063). The population of Qatar includes over 2 million inhabitants, comprised of approximately 300,000 nationals with roots in Qatar predating the discovery of oil and gas, and establishment of an independent nation in 1970 and the over 1.7 million immigrants who mostly arrived in the past decade [55]. As selection criteria for this study, we required that subjects be third generation Qataris where all ancestors were Qatari citizens born in Qatar, as assessed by questionnaires.

Cohort selection criteria

Qataris with three or more generations of ancestry can be divided into 3 genetic subpopulations that reflect the historical migration patterns in the region: Q1 (Bedouin), Q2 (Persian-South Asian) and Q3 (Sub-Saharan African) [24]. Proportions of Q1, Q2, and Q3 ancestry were determined for each Qatari based on TaqMan genotypes (Life Technologies, Carlsbad, CA) for a panel of 48 ancestry informative SNPs [4]. A STRUCTURE analysis of population structure with k = 3 applied to the genotypes was used to estimate the proportion of ancestry in each of 3 groups (Q1, Q2, Q3). 100 individuals >65 % Q1, Q2 or Q3 were selected for genome sequencing. An additional 8 admixed Qataris who could not be cleanly placed in one of the three groups were also selected for sequencing.

Next-generation sequencing

Next-generation sequencing was conducted using a sequencing library preparation method that eliminates the need for size selection after shearing and PCR amplification before ligation of sequencing adapters. Sequencing was conducted at the Illumina Genome Services sequencing facility using the HiSeq 2500. Sufficient paired-end 100 bp reads were generated in order to produce a median of 112 GB of sequence data passing filters and aligned to the hg19/GRCh37 human reference genome with a median insert size of 301 bp, where at least 85 % of bases ≥ Q30, passed filtering steps and were aligned.

Single nucleotide variant calling

In order to identify SNPs in LD with CNVs in Qatar, SNP genotypes were generated for each Qatari genome. The Illumina Genome Network generated variant calls for the autosomal chromosomes for each of the 108 Qatari genomes using the ELAND/CASAVA v1.9 pipeline. In order to maximize confidence in the observed variants, sequence reads were re-mapped and genotypes were re-called using an in-house population genotyping pipeline [5]. Reads were realigned to the 1000 Genomes Project version of the hg19/GRCh37 human reference genome using BWA 0.5.9 [56] (maximum insert size 3 kb), and mapped reads were prepared for variant calling using GATK best practices, including PCR duplicate removal using SAMTOOLS [57], producing an average of 37x depth in autosomal chromosomes, with a mean of 98 % of mappable sites covered per genome. In order to maximize computational efficiency and integrated call set quality, calling for the autosomes was restricted to the biallelic SNPs in the combined set of 21 million segregating autosomal SNPs observed at least once in 108 Qatari genomes by the CASAVA pipeline. SNP genotypes were called for all 108 Qatari genomes using GATK [58] as described in the “Best practices for variant detection v3” Broad Institute pipeline [http://www.broadinstitute.org/gatk/]. Genotypes for each Qatari were also generated using the Illumina OMNI2.5 M array using Illumina’s GenomeStudio application.

Copy number variant calling and analysis

Copy number was estimated from two independent primary data sets generated from the same set of 100 samples: (1) array hybridization on Illumina’s OMNI2.5 M platform; and (2) whole genome sequencing. See Additional file 1: Supplemental Methods for details on data preparation and quality control.

CNVs called from genotyping and Next Generation Sequencing (NGS) data

To comprehensively assess the copy number content of the 100 genomes, CNVs were assessed using four separate algorithms, two from genotyping and two from NGS data (Fig. 1). To identify CNVs from SNP genotype intensities, two calling platforms were used: Illumina GenomeStudio’s proprietary cnvPartition software, and QuantiSNP v2.0 [27], reviewed in Pinto et al. [59]. To identify CNVs from NGS, CNV calls were provided by Illumina Genome Network, and also independently called using cn.MOPS, which is suited for multi-sample CNV detection in population-level data [28]. Preliminary QC and minimum specificity thresholds for CNV calls are described in greater detail in Additional file 1: Supplementary Methods. In total, 97 samples passed the strict quality thresholds and were used for all downstream analyses.

Generating probability density curves of CNV distributions in the 3 subpopulations

In order to generate a probability density curve for each of the three subpopulations, we first plotted a histogram with the number of individuals having a number of CNVs or cumulative size of CNVs within a given bin. Then, using the “density” package in R (by calling the function density(x), where x is the CNV frequency or size bin), the empirical histogram was transformed into a curve that best fits the shape of the data, and accurately approximates the probability that a given individual has a certain number or size of CNVs. The empirical data is therefore transformed into a probability density curve, in a manner similar to the results obtained using the histogram function (hist(x,probability = TRUE) for getting a probability histogram and then fitting a curve to it.

Merging population-level CNV regions

To determine population-level CNV Regions (CNVRs), the population of 97 Qataris were first separated into their respective subpopulations (57 Q1, 20 Q2, and 20 Q3) using a panel of 48 SNPs previously as described [24]. The CNVs from individuals within each subpopulation were then merged separately to generate subpopulation-level CNVRs. Details of merging criteria and breakpoint assignment are found in Additional file 1: Supplementary Methods.

CNVR annotation and data analysis

Annotating CNVR features

Subpopulation-specific CNVRs were annotated using Anntools (version 1.1), in which the following genomic features were annotated against genome build 37 (hg19): cytobands, genes (and affected exons), microRNAs, transcription factor binding sites and promoter sites. All genes were also custom-annotated against the Online Mendelian Inheritance in Man (OMIM) database and manually curated by critically evaluating the primary literature to differentiate disease-causing genes from disease associated genes and polymorphisms. Disease-causing genes were further annotated manually to check exactly which exons were encompassed within each CNV’s breakpoints.

Determining CNVR novelty

CNVR novelty was assessed by comparison to the database of genomic variants (DGV). In order to compare CNVRs discovered in Qataris with those in the DGV, DGV CNVRs were first pruned to keep only high-resolution studies, matching or exceeding the resolution of CNVs discovered in this study (2.5 M SNP arrays or next-generation sequencing). This allowed for higher confidence in breakpoint assignment and in deciding whether a CNV in Qataris was observed previously or not.

CNVR cytoband enrichment

For assessment of cytoband enrichment, CNVRs in each Qatari subpopulation were compared to CNVRs detected from the 1000 Genomes Phase I study, representing a very high-resolution dataset. CNVRs from that study were first annotated by the cytoband in which they occurred, and then a basic count was done for the number of unique CNVRs in each cytoband. Enrichment in the number of CNVRs affecting a given cytoband in Qataris vs in the 1000Genomes dataset was calculated by Fisher’s test.

CNVR tagging analysis

In order to determine which SNPs on the OMNI 2.5 M array adequately tag CNVRs observed in the Qatari population, we selected CNV deletions that were observed at least 4 times in the population of 97 (MAF ~2 %). The CNV genotypes were converted to binary format along with all genotypes for use by PLINK [60]. The PLINK-pairwise-ld command was used to calculate Pearson’s correlation between the genotype of each CNV and all neighboring SNPs ± 500 kb of the CNV’s breakpoints across 97 individuals. For each CNV, the closest SNP with the highest r2 value within the 500 kb window was identified.

Similarly, tagging SNPs from the whole genome sequencing data were identified for these same deletions. Whole genome sequencing data from all 97 individuals were obtained from Illumina PE 100 Sequencing (as described above). These were pruned for variants within 500 kb upstream and downstream of each CNV, and then plink was used as before to determine SNPs with the highest Pearson correlation (maximum r2 value) within these windows.

Availability of supporting data

The data sets supporting the results of this article are included within the article and its additional files.

Abbreviations

Array-CGH:

Array-comparative genome hybridization

CN:

Copy number class

CNV:

Copy number variant

CNVR:

Copy number variable region

LD:

Linkage disequilibrium

MAF:

Minor allele frequency

NGS:

Next-generation sequencing

PCA:

Principle component analysis

SNP:

Single nucleotide polymorphism

Q1:

2, 3, Qatari subpopulation 1, 2 and 3

WGS:

Whole genome sequencing

1000G:

One thousand genomes project

DAVID:

Database for annotation, visualization and annotated discovery

DGV:

Database of genomic variants

HapMap:

International haplotype map project

KEGG:

Kyoto encyclopedia of genes and genomes

OMIM:

Online mendelian inheritance in man

References

  1. 1.

    Oppenheimer S. Out-of-Africa, the peopling of continents and islands: tracing uniparental gene trees across the map. Philos Trans R Soc Lond B Biol Sci. 2012;367:770–84.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  2. 2.

    Hunter-Zinck H, Musharoff S, Salit J, Al-Ali KA, Chouchane L, Gohar A, et al. Population genetic structure of the people of Qatar. Am J Hum Genet. 2010;87:17–25.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  3. 3.

    Omberg L, Salit J, Hackett N, Fuller J, Matthew R, Chouchane L, et al. Inferring genome-wide patterns of admixture in Qataris using fifty-five ancestral populations. BMC Genet. 2012;13:49.

    PubMed Central  Article  PubMed  Google Scholar 

  4. 4.

    Rodriguez-Flores JL, Fuller J, Hackett NR, Salit J, Malek JA, Al-Dous E, et al. Exome sequencing of only seven qataris identifies potentially deleterious variants in the qatari population. PLoS One. 2012;7:e47614.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  5. 5.

    Rodriguez-Flores JL, Fakhro K, Hackett NR, Salit J, Fuller J, Gosto-Perez F, et al. Exome sequencing identifies potential risk variants for Mendelian disorders at high prevalence in Qatar. Hum Mutat. 2014;35:105–16.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  6. 6.

    Feuk L, Marshall CR, Wintle RF, Scherer SW. Structural variants: changing the landscape of chromosomes and design of disease studies. Hum Mol Genet. 2006;15 Spec No 1:R57–66.

    Article  PubMed  Google Scholar 

  7. 7.

    McCarroll SA, Altshuler DM. Copy-number variation and association studies of human disease. Nat Genet. 2007;39:S37–42.

    CAS  Article  PubMed  Google Scholar 

  8. 8.

    Scherer SW, Lee C, Birney E, Altshuler DM, Eichler EE, Carter NP, et al. Challenges and standards in integrating surveys of structural variation. Nat Genet. 2007;39:S7–15.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  9. 9.

    Zhang YB, Li X, Zhang F, Wang DM, Yu J. A preliminary study of copy number variation in Tibetans. PLoS One. 2012;7:e41768.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  10. 10.

    Conrad DF, Pinto D, Redon R, Feuk L, Gokcumen O, Zhang Y, et al. Origins and functional impact of copy number variation in the human genome. Nature. 2010;464:704–12.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  11. 11.

    Handsaker RE, Van DV, Berman JR, Genovese G, Kashin S, Boettger LM, et al. Large multiallelic copy number variations in humans. Nat Genet. 2015;47:296–303.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  12. 12.

    Craddock N, Hurles ME, Cardin N, Pearson RD, Plagnol V, Robson S, et al. Genome-wide association study of CNVs in 16,000 cases of eight common diseases and 3,000 shared controls. Nature. 2010;464:713–20.

    CAS  Article  PubMed  Google Scholar 

  13. 13.

    Gonzalez E, Kulkarni H, Bolivar H, Mangano A, Sanchez R, Catano G, et al. The influence of CCL3L1 gene-containing segmental duplications on HIV-1/AIDS susceptibility. Science. 2005;307:1434–40.

    CAS  Article  PubMed  Google Scholar 

  14. 14.

    Aitman TJ, Dong R, Vyse TJ, Norsworthy PJ, Johnson MD, Smith J, et al. Copy number polymorphism in Fcgr3 predisposes to glomerulonephritis in rats and humans. Nature. 2006;439:851–5.

    CAS  Article  PubMed  Google Scholar 

  15. 15.

    Carson AR, Feuk L, Mohammed M, Scherer SW. Strategies for the detection of copy number and other structural variants in the human genome. Hum Genomics. 2006;2:403–14.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  16. 16.

    Fellermann K, Stange DE, Schaeffeler E, Schmalzl H, Wehkamp J, Bevins CL, et al. A chromosome 8 gene-cluster polymorphism with low human beta-defensin 2 gene copy number predisposes to Crohn disease of the colon. Am J Hum Genet. 2006;79:439–48.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  17. 17.

    Feuk L, Carson AR, Scherer SW. Structural variation in the human genome. Nat Rev Genet. 2006;7:85–97.

    CAS  Article  PubMed  Google Scholar 

  18. 18.

    Sebat J, Lakshmi B, Malhotra D, Troge J, Lese-Martin C, Walsh T, et al. Strong association of de novo copy number mutations with autism. Science. 2007;316:445–9.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  19. 19.

    Sharp AJ, Hansen S, Selzer RR, Cheng Z, Regan R, Hurst JA, et al. Discovery of previously unidentified genomic disorders from the duplication architecture of the human genome. Nat Genet. 2006;38:1038–42.

    CAS  Article  PubMed  Google Scholar 

  20. 20.

    Walsh T, McClellan JM, McCarthy SE, Addington AM, Pierce SB, Cooper GM, et al. Rare structural variants disrupt multiple genes in neurodevelopmental pathways in schizophrenia. Science. 2008;320:539–43.

    CAS  Article  PubMed  Google Scholar 

  21. 21.

    International Schizophrenia Consortium. Rare chromosomal deletions and duplications increase risk of schizophrenia. Nature. 2008;455:237–41.

    Article  Google Scholar 

  22. 22.

    Pang AW, Macdonald JR, Pinto D, Wei J, Rafiq MA, Conrad DF, et al. Towards a comprehensive structural variation map of an individual human genome. Genome Biol. 2010;11:R52.

    PubMed Central  Article  PubMed  Google Scholar 

  23. 23.

    Fakhro KA, Choi M, Ware SM, Belmont JW, Towbin JA, Lifton RP, et al. Rare copy number variations in congenital heart disease patients identify unique genes in left-right patterning. Proc Natl Acad Sci U S A. 2011;108:2915–20.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  24. 24.

    Coe BP, Witherspoon K, Rosenfeld JA, van Bon BW, Vulto-van Silfhout AT, Bosco P, et al. Refining analyses of copy number variation identifies specific genes associated with developmental delay. Nat Genet. 2014;46:1063–71.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  25. 25.

    Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, Handsaker RE, et al. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491:56–65.

    Article  PubMed  Google Scholar 

  26. 26.

    Rodriguez-Flores JL, Fakhro K, Robay A, Mezey JG, Crystal RG. The Qatari genome: novel, rare and population specific genetic variation in the Middle East of potential relevance to human disease. Cold Spring Harbor, New York: Cold Spring Harbor Laboratory 2014 Meeting - The Biology of Genomes. 2014.

  27. 27.

    Colella S, Yau C, Taylor JM, Mirza G, Butler H, Clouston P, et al. QuantiSNP: an Objective Bayes Hidden-Markov Model to detect and accurately map copy number variation using SNP genotyping data. Nucleic Acids Res. 2007;35:2013–25.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  28. 28.

    Klambauer G, Schwarzbauer K, Mayr A, Clevert DA, Mitterecker A, Bodenhofer U, et al. cn.MOPS: mixture of Poissons for discovering copy number variations in next-generation sequencing data with a low false discovery rate. Nucleic Acids Res. 2012;40:e69.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  29. 29.

    Rodriguez-Flores JL, Fakhro K, Hackett NR, Salit J, Fuller J, Malek JA, et al. Deep-coverage whole genome sequencing of 100 Qatari reveals elevated prevalence of population-specific variants linked to disease. San Francisco, CA: Presented at the 62nd Annual Meeting of the American Society of Human Genetics; 2012.

    Google Scholar 

  30. 30.

    Wang C, Szpiech ZA, Degnan JH, Jakobsson M, Pemberton TJ, Hardy JA, et al. Comparing spatial maps of human population-genetic variation using Procrustes analysis. Stat Appl Genet Mol Biol. 2010;9:13.

    CAS  Google Scholar 

  31. 31.

    Dennis Jr G, Sherman BT, Hosack DA, Yang J, Gao W, Lane HC, et al. DAVID: database for annotation, visualization, and integrated discovery. Genome Biol. 2003;4:3.

    Article  Google Scholar 

  32. 32.

    Huang DW, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc. 2009;4:44–57.

    CAS  Article  Google Scholar 

  33. 33.

    Kanehisa M, Goto S, Sato Y, Furumichi M, Tanabe M. KEGG for integration and interpretation of large-scale molecular data sets. Nucleic Acids Res. 2012;40:D109–14.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  34. 34.

    Falchi M, El-Sayed Moustafa JS, Takousis P, Pesce F, Bonnefond A, Andersson-Assarsson JC, et al. Low copy number of the salivary amylase gene predisposes to obesity. Nat Genet. 2014;46:492–7.

    CAS  Article  PubMed  Google Scholar 

  35. 35.

    Macdonald JR, Ziman R, Yuen RK, Feuk L, Scherer SW. The Database of Genomic Variants: a curated collection of structural variation in the human genome. Nucleic Acids Res. 2014;42:D986–92.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  36. 36.

    1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature. 2010;467:1061–73.

    Article  Google Scholar 

  37. 37.

    Gulf Center for Cancer Control and Prevention Ten-Year Cancer Incidence Among Nationals of the GCC States 1998-2007. http://www.moh.gov.bh/pdf/publications/GCC%20Cancer%20Incidence%202011.pdf; [last accessed 3/25/2015]

  38. 38.

    Bochukova EG, Huang N, Keogh J, Henning E, Purmann C, Blaszczyk K, et al. Large, rare chromosomal deletions associated with severe early-onset obesity. Nature. 2010;463:666–70.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  39. 39.

    Dolcetti A, Silversides CK, Marshall CR, Lionel AC, Stavropoulos DJ, Scherer SW, et al. 1q21.1 Microduplication expression in adults. Genet Med. 2013;15:282–9.

    Article  PubMed  Google Scholar 

  40. 40.

    Horev G, Ellegood J, Lerch JP, Son YE, Muthuswamy L, Vogel H, et al. Dosage-dependent phenotypes in models of 16p11.2 lesions found in autism. Proc Natl Acad Sci U S A. 2011;108:17076–81.

    PubMed Central  Article  PubMed  Google Scholar 

  41. 41.

    Lin P, Hartz SM, Wang JC, Agrawal A, Zhang TX, McKenna N, et al. Copy number variations in 6q14.1 and 5q13.2 are associated with alcohol dependence. Alcohol Clin Exp Res. 2012;36:1512–8.

    PubMed Central  Article  PubMed  Google Scholar 

  42. 42.

    McCarthy SE, Makarov V, Kirov G, Addington AM, McClellan J, Yoon S, et al. Microduplications of 16p11.2 are associated with schizophrenia. Nat Genet. 2009;41:1223–7.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  43. 43.

    Tannour-Louet M, Han S, Corbett ST, Louet JF, Yatsenko S, Meyers L, et al. Identification of de novo copy number variants associated with human disorders of sexual development. PLoS One. 2010;5:e15392.

    PubMed Central  Article  PubMed  Google Scholar 

  44. 44.

    Stankiewicz P, Lupski JR. Structural variation in the human genome and its role in disease. Annu Rev Med. 2010;61:437–55.

    CAS  Article  PubMed  Google Scholar 

  45. 45.

    Alkan C, Coe BP, Eichler EE. Genome structural variation discovery and genotyping. Nat Rev Genet. 2011;12:363–76.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  46. 46.

    Zarrei M, Macdonald JR, Merico D, Scherer SW. A copy number variation map of the human genome. Nat Rev Genet. 2015;16:172–83.

    CAS  Article  PubMed  Google Scholar 

  47. 47.

    Girirajan S, Rosenfeld JA, Cooper GM, Antonacci F, Siswara P, Itsara A, et al. A recurrent 16p12.1 microdeletion supports a two-hit model for severe developmental delay. Nat Genet. 2010;42:203–9.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  48. 48.

    Macarthur DG, Tyler-Smith C. Loss-of-function variants in the genomes of healthy humans. Hum Mol Genet. 2010;19:R125–30.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  49. 49.

    Macarthur DG, Balasubramanian S, Frankish A, Huang N, Morris J, Walter K, et al. A systematic survey of loss-of-function variants in human protein-coding genes. Science. 2012;335:823–8.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  50. 50.

    Yngvadottir B, Xue Y, Searle S, Hunt S, Delgado M, Morrison J, et al. A genome-wide survey of the prevalence and evolutionary forces acting on human nonsense SNPs. Am J Hum Genet. 2009;84:224–34.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  51. 51.

    Moon S, Jung KS, Kim YJ, Hwang MY, Han K, Lee JY, et al. KGVDB: a population-based genomic map of CNVs tagged by SNPs in Koreans. Bioinformatics. 2013;29:1481–3.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  52. 52.

    Park H, Kim JI, Ju YS, Gokcumen O, Mills RE, Kim S, et al. Discovery of common Asian copy number variants using integrated high-resolution array CGH and massively parallel DNA sequencing. Nat Genet. 2010;42:400–5.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  53. 53.

    Xu H, Poh WT, Sim X, Ong RT, Suo C, Tay WT, et al. SgD-CNV, a database for common and rare copy number variants in three Asian populations. Hum Mutat. 2011;32:1341–9.

    CAS  Article  PubMed  Google Scholar 

  54. 54.

    Gautam P, Jha P, Kumar D, Tyagi S, Varma B, Dash D, et al. Spectrum of large copy number variations in 26 diverse Indian populations: potential involvement in phenotypic diversity. Hum Genet. 2012;131:131–43.

    Article  PubMed  Google Scholar 

  55. 55.

    Qatar Statistics Authority. Results of the 2010 Census of Population, Housing and Establishments. http://www.qsa.gov.qa/QatarCensus/. Accessed October 19 2015.

  56. 56.

    Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25:1754–60.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  57. 57.

    Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The sequence alignment/Map format and SAMtools. Bioinformatics. 2009;25:2078–9.

    PubMed Central  Article  PubMed  Google Scholar 

  58. 58.

    DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011;43:491–8.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  59. 59.

    Pinto D, Darvishi K, Shi X, Rajan D, Rigler D, Fitzgerald T, et al. Comprehensive assessment of array-based platforms and calling algorithms for detection of copy number variants. Nat Biotechnol. 2011;29:512–20.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  60. 60.

    Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81:559–75.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

Download references

Acknowledgments

We thank the 1000 Genomes Project for helpful advice on analysis methods; Y Strulovici-Barel, A Al Shakaki, OM Chidiac, R Mathew, and the WCMC-Q Bioinformatics Core, which is supported by the Biomedical Research Program funds at Weill Cornell Medical College in Qatar, a program funded by the Qatar Foundation, for help with the study; and N Mohamed for help in preparing this manuscript. These studies were supported, in part, by Qatar National Research Fund, NPRP 09-741-3-193; NPRP 09-740-3-192 and 5-436-3-116; The Qatar Foundation and the Weill Cornell Medical College in Qatar. JLR-F was supported, in part, by T32HL094284.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Ronald G. Crystal.

Additional information

Competing interests

The authors declared that they have no competing interests.

Authors’ contributions

KAF conceived the study, managed and prepared all primary data obtained from Illumina Genome Network sequencing and genotyping; carried out data analysis, wrote the manuscript; NAY conceived the study; carried out all data analysis; edited the manuscript for content; JLR-F conceived the study; managed and prepared all primary data obtained from Illumina Genome Network sequencing and genotyping; wrote the manuscript, wrote the manuscript; AR managed cohort recruitment, consent and clinical phenotyping; MRS managed and prepared all primary data obtained from Illumina Genome Network sequencing and genotyping; FA-P managed and prepared all primary data obtained from Illumina Genome Network sequencing and genotyping; JS managed and prepared all primary data obtained from Illumina Genome Network sequencing and genotyping; JAM provided genomic and bioinformatics core support for data analysis; KS provided genomic and bioinformatics core support for data analysis; AJ managed cohort recruitment, consent and clinical phenotyping; MZ managed cohort recruitment, consent and clinical phenotyping; DS managed cohort recruitment, consent and clinical phenotyping; JGM conceived the study, wrote the manuscript; RGC conceived the study, overall supervision and responsibility for the study, wrote the manuscript. All authors read and approved the final manuscript.

Khalid A. Fakhro and Noha A. Yousri contributed equally to this work.

Additional files

Additional file 1:

Additional Methods, Tables, Figures and Legends. (PDF 1401 kb)

Additional file 2:

Complete dataset (with full annotation) of all CNV regions detected in 97 Qataris. (XLSX 3202 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Fakhro, K.A., Yousri, N.A., Rodriguez-Flores, J.L. et al. Copy number variations in the genome of the Qatari population. BMC Genomics 16, 834 (2015). https://doi.org/10.1186/s12864-015-1991-5

Download citation

Keywords

  • Copy number variation
  • Next-generation sequencing
  • Genotyping
  • Genomics
  • Mendelian disease
  • Qatar