Genome structural variants and, particularly CNV, are thought to play an important role in phenotypic variation and in the development of many complex diseases. In the last few years, several calling algorithms have been developed to identify CNVs at the whole genome scale using the same SNP-chips used to perform GWAS. However, studies that have evaluated the available tools have concluded that they lack sensitivity leading to a large number of false negative callings [14–16]. While PennCNV algorithm was found to be the one performing the best in previous comparisons, here we demonstrate the lack of sensitivity of PennCNV in a particular scenario. In the well-characterized region of GSTM1, we found that PennCNV did not detect any deletion in a large sample of cases with bladder cancer and controls where homozygous deletion was known to have a frequency of 50% using Taqman and MLPA technologies. Because PennCNV was designed to identify unknown CNV regions, we also applied the cnvHap algorithm that was designed to genotype known CNV regions . As expected, cnvHap did not detect any deletion in the GSTM1 region in our sample, either. It is noteworthy the fact that, using CNV calls derived from Illumina 1 M platform, GSTM1 would have never been associated with bladder cancer. However, when individual probe LRR values are compared between cases and controls, the association can be detected and provide results similar to those obtained when using Taqman or MLPA. This observation clearly shows that PennCNV lacks sensitivity to detect CNV in the GSTM1 region.
A possible explanation for the lack of sensitivity of PennCNV (and cnvHap) is the high frequency of the GSTM1 deletion in the studied population. Indeed, CNV calling is done using the LRR that depends both on the observed (Robs) and the expected (Rexp) R values. The Rexp is determined based on the clusters of genotypes. In the case of GSTM1 where the homozygous deletion is very frequent, these clusters include a high number of subjects with a homozygous deletion (GSTM1-null genotype). Thus, Robs and Rexp are expected to be similar in a GSTM1-null individual and, accordingly, the LRR value is around 0. The normalization process could also play a role as it aims at finding three clusters and this is not possible for GSTM1 locus since the BAF of homozygous deleted sample is uniformly distributed between 0 and 1, thus normalization is affecting the intensity values, too.
The fact that the association between GSTM1 CNV and bladder cancer can be detected with LRR values without applying a calling CNV confirms the utility of this measure as a complementary screening strategy to test for association at the genome-wide level, as already suggested [17–19]. Indeed, the LRR is a continuous measure that approximates and correlates well with the actual discrete number of copies. Nonetheless, it is affected by the noise contained in the intensity measurement of both alleles obtained through the hybridization experiments. Thus, using LRR in the association test may decrease the power of some probes in detecting the association in comparison of using an accurate calling of the discrete number of copies. This loss of power would explain that two of the five probes located in the GSTM1 locus failed to show association with bladder cancer risk in our study, and that the three significant probes only showed a moderate significant p-values (between 8x10-4 and 0.019, Table 1). Nevertheless, even if the significance of these probes was moderate, we observed an excess of significant p-values in comparison to what we could expect under the null hypothesis of no association in that region. Thus, methods working at the genome-wide level and searching for regions with an excess of significant probes could have identified the GSTM1 region in our study. Alternatively, CNVtools, performing a joint calling and association testing, might also be considered, though it is more difficult to apply than that based on LRR and takes longer to run. The main caveat with CNVtools and equivalent methods is the definition of regions of interest.
The GSTM1 deletion is located in a region of high sequence homology neighbored by a segmental duplication and this might explain that its breakpoints may slightly vary and, thus, the difficulties of calling. However, the locus is defined since the deletion in GSTM1 was already known and approaches based in probes are able to identify it. Nevertheless, there might be other still unknown CNVs in the genome showing similar characteristics that might thus not be easy to call . To increase sensitivity in CNV identification at the whole genome scale, we propose performing a genome-wide screen for association using LRR values at each probe and then applying CNVtools for a fine-tuning in the most promising regions.