 Research
 Open Access
 Published:
A distancetype measure approach to the analysis of copy number variation in DNA sequencing data
BMC Genomics volume 20, Article number: 195 (2019)
Abstract
Background
The next generation sequencing technology allows us to obtain a large amount of short DNA sequence (DNAseq) reads at a genomewide level. DNAseq data have been increasingly collected during the recent years. Counttype data analysis is a widely used approach for DNAseq data. However, the related data preprocessing is based on the moving window method, in which a window size need to be defined in order to obtain counttype data. Furthermore, useful information can be reduced after data preprocessing for counttype data.
Results
In this study, we propose to analyze DNAseq data based on the related distancetype measure. Distances are measured in base pairs (bps) between two adjacent alignments of short reads mapped to a reference genome. Our experimental data based simulation study confirms the advantages of distancetype measure approach in both detection power and detection accuracy. Furthermore, we propose artificial censoring for the distance data so that distances larger than a given value are considered potential outliers. Our purpose is to simplify the preprocessing of DNAseq data. Statistically, we consider a mixture of right censored geometric distributions to model the distance data. Additionally, to reduce the GCcontent bias, we extend the mixture model to a mixture of generalized linear models (GLMs). The estimation of model can be achieved by the NewtonRaphson algorithm as well as the ExpectationMaximization (EM) algorithm. We have conducted simulations to evaluate the performance of our approach. Based on the rank based inverse normal transformation of distance data, we can obtain the related zvalues for a followup analysis. For an illustration, an application to the DNAseq data from a pair of normal and tumor cell lines is presented with a changepoint analysis of zvalues to detect DNA copy number alterations.
Conclusion
Our distancetype measure approach is novel. It does not require either a fixed or a sliding window procedure for generating counttype data. Its advantages have been demonstrated by our simulation studies and its practical usefulness has been illustrated by an experimental data application.
Background
The next generation sequencing technology has advanced significantly during the recent decade. It allows us to obtain a large amount of short DNA sequence (DNAseq) reads at a genomewide level [1]. DNAseq data have been increasingly collected during the recent years [2, 3]. Counttype data analysis is a widely used approach for DNAseq data [4]. However, for data preprocessing, a moving window with the related window size need to be defined in order to obtain counttype data [5]. During the data preprocessing, given a window of genomic region, the observations within the region are summarized. However, useful information from the original data can be reduced after data preprocessing for counttype data.
DNAseq has been widely used for the detection of copy number variation/alteration [1, 4]. Both copy number variation (CNV) analysis and copy number alteration (CNA) analysis are based on the assumption that the observed number of short reads of a local genomic region is proportional to the underlying copy number of DNA. CNV/CAN can also be discovered by other biomedical techniques such as fluorescent in situ hybridization, comparative genomic hybridization, array comparative genomic hybridization, and by virtual karyotyping with SNP arrays [2, 6, 7]. The next generation sequencing technology allows us to obtain millions of short DNA reads in a relatively short amount of time [8,9,10]. Digital karyotyping is a simple and powerful method [6]. Many statistical methods for CNV/CAN analysis are based on the counttype DNAseq data [2, 3, 5, 11]. It is also necessary to consider GC content bias in the DNAseq data [12]. The GC content can be calculated based on the Guanine (G) and Cytosine (C) bases in a reference genome. GC content bias refers to the dependence between the sequencing data and the related GC content.
Due to the limitations of current sequencing techniques, the sequencing data cannot be obtained for certain genomic regions. Then, the related counttype measure is simply zero. For a DNA sequence containing such a genomic region, a large gap without any short reads is observed. Then, the related distancetype measure can be very large (considered outliers statistically, see below for the definition of distancetype measure) and an artificial censoring can be considered.
In this study, we consider distancetype measure. Distances are measured in base pairs (bps) between two adjacent alignments of short reads mapped to a reference genome. Our distancetype measure approach is novel. It does not require either a fixed or a sliding window procedure for generating counttype data. Furthermore, we propose artificial censoring for the distance data so that distances larger than a given value are considered potential outliers. Our purpose is to simplify the preprocessing of DNAseq data.
Statistically, we consider a mixture of right censored geometric distributions to model the distance data. Additionally, to reduce the GCcontent bias, we extend the mixture model to a mixture of generalized linear models (GLMs). Our approach can be considered as an alternative to the method proposed by Shen and Zhang [11]. The estimation of model can be achieved by on the NewtonRaphson algorithm as well as the ExpectationMaximization (EM) algorithm. Then, based on the rank based inverse normal transformation of distance data, we can obtain the related zvalues for a followup analysis.
In the following, we first demonstrate how to conduct experimental data based simulation study to confirm the advantages of distancetype measure approach in both detection power and detection accuracy. Then, we introduce our statistical models and the related estimation procedure. The model performance can be demonstrated by simulation studies. The practical usefulness of our approach can be illustrated by an application to the DNAseq data from a pair of normal and tumor cell lines [1] as well as a followup changepoint analysis of zvalues for the detection of DNA copy number alterations.
Methods
We consider a novel distancetype measure approach to the analysis of DNAseq data. Distances are measured in base pairs (bps) between two adjacent alignments of short reads mapped to a reference genome. An illustration is shown in Fig. 1. A clear advantage is that the moving window method is no longer required for data preprocessing.
We first introduce how to compare the distancetype vs. counttype data so that more advantages of distancetype data can be confirmed. Then, we introduce a finite mixture [13, 14] of generalized linear models (GLMs) for modeling the distance between two adjacent reads with the adjustment to the GCcontent. Artificial censoring of distancetype data can be considered so that potential outliers can be accommodated in the model. The basic distribution component for our method is the right censored geometric distribution. In the following, we briefly describe our method and the related estimation procedure. Their details are given in the Additional file 1.
Comparing distancetype vs. counttype data
The widely used coverage depth (or counttype) based approach can be briefly described as follows. For a genomic region, it is divided into nonoverlapping smaller regions (windows or bins) and the number of sequencing reads is counted for each bin. For example, if a genomic region is divided into 100 bins, then 100 counttype data can be obtained from these bins. The original sequencing read data are then simplified to counttype data by this approach. (Our proposed distance based approach is to maintain the overall structure as well as the overall size of original sequencing read data.)
In this study, the DNA sequencing data were collected for the detection of copy number variations. A certain type of data (distance or count) is generated from the original sequencing read data (by the distance based approach or the coverage depth based approach, respectively). Between these two types of data, our choice is decided based on the statistical power for detecting changes. To demonstrate the advantage of our approach, we perform a simulation study to evaluate whether the distance based approach can be more statistically powerful than the coverage depth based approach. We consider a relatively simple scenario: if two genomic regions are known to be different in copy numbers, then which approach is more likely to detect the change? Furthermore, can accurate detection of change location be achieved when the distancetype data are used in the copy number variation/alteration analysis?
To address which type of data is more likely to detect the change (detection power), we consider the following simulation. Our simulation data were generated based on some real experimental sequencing data (instead of simulating data completely from some statistical distributions). We generated data based on the region with position 1030Mbps from the normal sample (as presented in the section Application to Sequencing Data, but we used the original sequencing read data). In this region, we randomly selected two bins with length L for each bin. For each of the original sequencing reads from the second bin, we removed it randomly with 50% probability. (The purpose was to simulate a deletion of 50% copy number.) We kept all the original sequencing reads from the first bin. In this way, we simulated sequencing data for two genomic regions with different copy numbers. The distancetype data can be obtained as described at the beginning of Methods section. To obtain counttype data, we divided each region into bins with length W for each bin (and then counted the number of sequencing reads in each bin). To achieve an unbiased comparison between the distancetype data and the counttype data, we used the nonparametric KomogorovSmirnov twosample test to compare the empirical distributions of simulated data for two genomic regions. Notice that either the distance based approach or the coverage depth based approach is to generate a certain type of data (distancetype or counttype, respectively). It is necessary to compare which type of data can be more statistically powerful in the detection of changes. Our evaluation strategy does not depend on data type, as it does not require any parametric or statistical distribution assumptions.
We considered different values for L: 50,000, 200,000 and 1,000,000 bps for different lengths of copy number variations, as well as different values for W: 5000, 10,000 and 25,000 bps for different bin sizes. (We considered the range of bin size 525 kbps to avoid zero counts or very few count data.) For each setting, we repeated our simulation and analysis for 5000 times.
To address whether accurate detection of change location can be achieved (detection accuracy), we obtained more analysis results from the above simulation setting. In each simulation repetition, we considered the data for two genomic regions were connected. We evaluated how accurate the boundary (change) between two regions could be detected. As the data for two regions were ordered, the change location could be between any two consecutive data (for both distancetype and counttype). Therefore, we screened all the possible locations and considered the location with the lowest pvalue as the identified change location. In this way, from each simulation repetition, we obtained the identified change location and its related pvalue. Although the lowest pvalue was considered, some cutoff values would still be usually considered in practice. As all the possible change locations were screened, a cutoff value lower than the conventional cutoff value 0.05 would be usually considered. Therefore, we considered the following cutoff values for the lowest pvalues: 0.01, 0.001, 0.0001, 0.00001, 0.000001 and 0.0000001. Based on the results for detection power comparison (also see Table 1), the advantage of distancetype data was clear for short L and the advantage of counttype data became clear for long L. Therefore, we considered several moderate L values: 150,000, 300,000 and 450,000 bps for different lengths of copy number variations. 5000, 10,000 and 25,000 bps were still considered for W (bin sizes). For each setting, we still repeated our simulation and analysis for 5000 times.
For each simulation repetition, if the returned pvalue was lower than the cutoff value, then we calculated the difference between the identified and true change location; otherwise, we considered the difference as nodetection (ND). Then, we would evaluate whether the overall difference based on the distancetype data was lower than the overall difference based on the counttype data. (Notice that a lower difference means a more accurate detection.) Statistically, the traditional WilcoxonMannWhitney rank sum test would be an appropriate choice (for testing “distancetype based difference” < “counttype based difference”). In the test, for these ND differences, we changed them to be a value larger than any observed differences as nodetection (ND) would be the worst scenario.
Right censored geometric distribution
Suppose Y is a random variable (e.g. distancetype measure) with a right censored geometric distribution with p as the probability of success. Notice that a geometric distribution can be considered as a model for the number of failures until first success. Let T be a given constant for the censoring value (e.g. the value for artificial right censoring). The indicator function δ = 1 if y ≥ T and δ = 0 if y < T. The probability distribution function of right censored geometric distribution is given by:
Mixture of Right Censored Geometric Distribution.
Based on the above right censored geometric distribution, a mixture of right censored geometric distribution can be described. Let Y_{1}, … , Y_{n} denote a random sample of size n. Each mixture component is a right censored geometric distribution given by Eq. (1) where p_{j} is the success probability for the jth component. The probability distribution function of a mixture of g components is given by:
where
π_{1}, π_{2}, … , π_{g} are the component proportions subject to the constraint \( \sum \limits_{j=1}^g{\pi}_j=1\kern0.5em \)and all π_{j} ∈ [0, 1], all p_{j} ∈ [0, 1]. The indicator function δ = 1 if y ≥ T and δ = 0 if y < T. All the sample data are (artificially) censored at the same value T.
Mixture of right censured geometric distribution based GLMs
Based on the above mixture of right censored geometric distribution, a mixture of right censored geometric distribution based GLMs can be described. Eq. (2) can be extended to a finite mixture of gcomponent GLMs. Let y_{1}, … , y_{n} be n independent observations of the response variable. For each y_{i}, there is a covariate x_{i}. π_{1}, … , π_{g} are the component proportions as defined above. The mixture is given by:
where f_{j}(y; x) = (p_{j}(x)(1 − p_{j}(x))^{y})^{δ}((1 − p_{j}(x))^{(T + 1)})^{(1 − δ)}. Notice that each p_{j} is a function of x. We first define μ_{j} = (1 − p_{j})/p_{j} and then we use a link function η_{j} = log(μ_{j}) = β_{j0} + β_{1}x for j = 1, … , g. Here, β_{1} is common for all different components since it represents the GCcontent effect. Furthermore, the component proportions π_{1}, … , π_{g} do not depend on the covariate since they are considered as global parameters (not local). Therefore, the vector Ψ of parameters is given by Ψ = (π_{1}, … , π_{g − 1}, β_{10}, … , β_{g0}, β_{1}). The intercepts (β_{10}, … , β_{g0}) are different for different distribution components.
Estimation procedure for mixture GLM based on geometric distribution
The loglikelihood based on Eq. (3) is given by
\( l\left(\uppsi \right)=\log L\left(\uppsi \right)={\sum}_{i=1}^n\log\;\left[{\sum}_{j=1}^g{\pi}_j\;{f}_j\left({y}_i;{x}_i\right)\right]. \)
The EM algorithm [15] can be used to obtain the maximum likelihood estimate of Ψ. For each y_{i}, missing component information can be considered. Accordingly, the vector {z_{ij}} is introduced: z_{ij} = 1 if y_{i} belongs to the jth component of the mixture model; z_{ij} = 0 otherwise (j = 1, … , g; i = 1, … , n). Then, the loglikelihood of “complete data” is given by:
For the Estep, we calculate:
\( {\tau}_{ij}=\mathbf{E}\;\left({z}_{ij}\lefty\right.\right)={\pi}_j\;{f}_j\left({y}_i;{x}_i\right)/\left[{\sum}_{h=1}^g{\pi}_h\;{f}_h\;\left({y}_i;{x}_i\right).\right] \)
For the Mstep, we first calculate:\( {\pi}_j={\sum}_{i=1}^n{\tau}_{ij}/n. \)
The calculation for (β_{10}, … , β_{g0}, β_{1}) requires a numerical optimization procedure, which is used to solve the following equation system.
and
Based on the NewtonRalphson method, we have the following iterative equation for β = (β_{10}, … , β_{g0}, β_{1}):
The details for this iterative equation as well as the EM algorithm are given in the Additional file 1.
Number of components
The number of components in the mixture model is determined by a likelihood ratio test (LRT) based approach. Consider two integers g_{0} < g_{1}. We test the null hypothesis H_{0} : g = g_{0} against H_{1} : g = g_{1}. We first use the observed data to fit a mixture model with g_{0} components and a mixture model with g_{1} components. A LRT score can be calculated for the observed data. Then, we can simulate data based on the fitted model with g_{0} components (parametric bootstrap). For each set of simulated data, we can calculate the related bootstrapped LRT score. After B rounds of parametric bootstrap repetitions, the pvalue of observed LRT score can be approximately calculated based on B bootstrapped LRT scores. (If pvalue< 0.05, the null hypothesis can be rejected.)
Results
Distancetype data vs. counttype data
For the detection power based comparison, our evaluation results are summarized in Table 1. As different pvalue cutoff values may be used in different scenarios of copy number variation analysis, we considered 0.05, 0.01, 0.001, 0.0001, 0.00001, and 0.000001. Given a cutoff value, a better approach should show a higher proportion (of pvalues less than the cutoff value). When the regions are as long as 200 kbps, the advantage of distance based approach (distancetype data) can be clearly observed. It is more difficult to detect changes as the regions become shorter. However, the advantage of distance based approach (distancetype data) becomes even clearer when the regions are as short as 50 kbps. When the regions are as long as 1Mbps, two approaches perform overall similarly. From Table 1, it is also clear that the performance of coverage depth based approach (counttype data) depends on the choice of bin size. This has already been discussed in the literature. As our focus in this study is on the choice between distancetype data or counttype data, we prefer not to discuss the choice of bin size for counttype data.
For the detection accuracy based comparison, our evaluation results are summarized in Table 2. The proportion of the lowest pvalue less than a given cutoff value is presented for each type of data. The proportion based on the distancetype data is always higher. Then, the onesided pvalue for each pair of comparison test is presented in Table 2. (Notice that, the relationship between onesided and twosided pvalues for WilcoxonMannWhitney rank sum test is actually simple.) When L is relatively short as 150,000 bps, the detection accuracy based on the distancetype data is always better than the detection of accuracy based on the counttype data. When L is increased to 300,000 bps, the detection accuracy based on the distancetype data is still clearly better when the cutoff value is 0.001 or lower. When L is relatively long as 450,000 bps, the detection accuracy based on the counttype data is better, but the detection accuracy based on the distancetype data is still better when the cutoff value is 0.0001 or lower. (Notice that, low cutoff values are necessarily considered in many practical situations.)
Estimation and testing performance for mixture of right censored geometric distributions
We performed a comprehensive simulation study to evaluate the estimation and also the testing performance of our proposed mixture of right censored geometric distribution. To understand estimation when artificial right censoring is applied we simulated a mixture of 3component geometric with the parameters π_{1} = 0.3, π_{2} = 0.5, π_{3} = 0.2 and p_{1} = 0.002, p_{2} = 0.02, p_{3} = 0.2 (Fig. 2 and Additional file 2: Figures S1S5). Another simulation was based on a mixture of 3component geometric with the parameters π_{1} = 0.008, π_{2} = 0.754, π_{3} = 0.238 and p_{1} = 0.999, p_{2} = 0.011, p_{3} = 0.0006 (Fig. 3 and Additional file 2: Figure S6S10). For estimation performance evaluation, we simulated 1000 times with 100, 250, 500, 1000 and 5000 samples each with no censoring, censoring at 250, 500 and 1000. We evaluated the bias, variance and root mean square error and used boxplots to summarize the findings (most figures provided as supplementary materials).
The standard deviation and the root mean square error (RMSE, Figs. 2 and 3) decreases with sample size and the censoring value. The RMSE combines both bias and variance. It shows a steady decline as the sample size increases. It also decreases with as the censoring value for observations increases. However for the second simulation, for p_{1} = 0.999 we observed that the median consistently underestimated the true parameter value. Therefore, we need to be cautious when the true parameter value is close to 1. The interquartile range becomes narrower with increasing sample size and the artificial censoring value of the observations as well. Overall, both simulation studies support the use of right censored geometric distribution for DNAseq data.
Furthermore, we have evaluated the effect of censoring on the performance of hypothesis testing. A data set can be simulated as above and we can test the hypothesis of 3 components vs. 4 based on 500 parametric bootstrap pvalues. These steps can be repeated 500 times for each case (sample size and censoring value) to generate the empirical distribution of pvalues. The empirical distribution plots for these pvalues show uniform distribution like patterns (results not shown). Therefore, the hypothesis testing pvalues can be appropriately calculated by the parametric bootstrap procedure.
Estimation and testing performance for mixture GLM where response variable has right censored geometric distributions
To understand estimation and testing performance when artificial right censoring is applied in the case of a mixture GLM, we simulated a mixture of 3component GLM with the parameters π_{1} = 0.3, π_{2} = 0.5, π_{3} = 0.2 and β_{10} = 6.0, β_{20} = 3.6, β_{30} = 1, 1, β_{1} = 2.0. The choice of the component proportions are the same as above and the choice of β_{10}, β_{20}, β_{30} was based on setting β_{1} = 2.0 and using p_{1} = 0.002, p_{2} = 0.02, p_{3} = 0.2 as above (consistent with a previous simulation). For estimation performance evaluation, we simulated 1000 times with 10,000 samples each with no censoring (NC), censoring at 570, 937, and 1309 respectively (Additional file 2: Figure S11S12). The censoring values were set such that approximately on average about 10, 5 and 2.5% of the data is censored (consistent with our previous simulations; also about 1–3% in our application). We generated the xvariable from the Uniform [0, 0.25] distribution. A second simulation was based on a mixture of 3component geometric with the parameters π_{1} = 0.3, π_{2} = 0.5, π_{3} = 0.2 and β_{10} = 6.2, β_{20} = 3.9, β_{30} = 1.4, β_{1} = 0.1. The group proportions were set according to the same strategy as explained above (based on β_{1} = 0.1). For estimation performance evaluation, we simulated 1000 times with 10,000 samples each with no censoring (NC), censoring at 1241, 895 and 549 (Additional file 2: Figure S13S14). The censoring values were also set according to the same strategy as explained above. We evaluated the bias, variance and root mean square error (results not shown) and used boxplots to summarize the findings (most figures provided as supplementary materials).
The standard deviation and the root mean square error (RMSE) decreases with the increasing censoring value. The RMSE also decreases as the artificial censoring value for observations increases. The median gets closer to the true value with the increasing censoring value of the observations. The interquartile range gets narrower with increasing sample size and the censoring value of the observations as well. Overall, both simulation studies support the use of right censoring in mixture GLM.
Furthermore, we have evaluated the effect of censoring on the testing performance. A data set can be simulated as above and we can test the hypothesis of 3 components vs. 4 based on 100 parametric bootstrap pvalues. These steps can be repeated 200 times for the first simulation scenario or 100 times for the second simulation scenario (also for each censoring value) to generate the empirical distribution of the pvalues. (The number of parametric bootstraps was limited to 200 or 100 due to computational burden as this evaluation took much longer time in computing.) Additional file 2: Figure S15 (for the first simulation scenario) and Additional file 2: Figure S16 (for the second simulation scenario) show the empirical distribution plots for these pvalues. Uniform distribution like patterns can be observed. Therefore, the hypothesis testing pvalues can be appropriately calculated by the parametric bootstrap procedure.
Application to sequencing data
We considered an application to the DNAseq data from Chiang et al. [1]. The data set was made publicly available and we selected a normal cell line and the related tumor cell line (HCC1954) for breast carcinoma. Furthermore, we focused on chromosome 9 for an illustration of our method. The value 5000 (bps) was used for the artificial right censoring (we observed at most 1–3% distance data greater than 5000 bps so most of the distance data were relatively short).
A 4component mixture of right censored geometric distribution based generalized linear models was estimated for the normal sample and then for the tumor sample (Table 3). GCcontent adjustment was considered. (We downloaded the 50Kbpswindow based GCcontent data from http://bioinfoout.curie.fr/projects/freec .) Notice that, as the GCcontent data were based on a reference genome, the model coefficient for GCcontent was first estimated based on the normal sample data and then fixed in the model for tumor sample data (all the other model parameters were estimated separately for different sample data). This approach (pseudomaximum likelihood estimation) was based on the work by Gong and Samaniego [16]. (About the number of mixture components, we could not perform the related hypothesis testing because the related computing could be not afforded. However, we conducted a preliminary data analysis based on the mixture of right censored geometric distributions without considering any covariates, in which we could performed hypothesis testing on the number of mixture components and four distinct components were confirmed. Therefore, we reasonably assumed 4component for the mixture of GLMs.)
Inverse normal transformations (INTs) have been widely considered in practice. The purpose is to make the (transformed) data more similar to normal distributions. Both rankbased INTs and nonrank based INTs can be considered [17,18,19,20,21,22,23,24,25,26]. We considered the following INT. Based on our estimated model (for normal or tumor sample, see Table 3), the cumulative distribution function (CDF) for an observations y (not artificially right censored) was calculated by:
where F_{j} was the CDF based on f_{j} given in Eq. (3). Then, an inverse normal distribution function was applied. Through this INT, we obtain the distancetype measure related INT values.
The comparison between the distance data (log2 of distance+ 1) and the related INT values (inverse normal transformed values; not related to artificially right censored distance data) is shown in Fig. 4. Some uneven patterns can be visualized from the plots of distance data. These were not observed from the plots of zvalues. Therefore, the GCcontent adjustments have been well considered in our method.
Then, for the INT values related to these artificially right censored distance data, we simply assign a value 10 (larger than any INT values in Fig. 4). In Fig. 5, more details can be visualized for INT values. Notice that censored observations are highlighted by red vertical lines (there are 31 instances for the normal sample and 35 instances for the tumor sample). Some differences on the region around 200 K38 M bps (chromosome 9) can be observed and we performed a changepoint analysis for this region. (The normality was checked for INT values. We randomly selected regions with length 100 K bps for both the normal and tumor samples and generated the related quantilequantile plots [QQ plots, Additional file 2: Figure S17S18]. Then, we assumed that INT values were normally distributed, which was helpful to avoid the permutation based pvalue calculation [27] in the next changepoint analysis.)
For a changepoint analysis of the above selected region (as an illustration), we considered a recursive combination approach [28]. For each round of recursion, at a given level α_{C}, any two adjacent blocks of INT values with twosample test pvalues larger than α_{C} were selected. Among them, the combination to generate the largest overall likelihood was applied. The recursion stopped when all the test pvalues were less than α_{C}. Considering the large number of twosample comparisons, we used 50 K bps bins to group INT values before the changepoint analysis. We set α_{C} = 0.05/(m ∗ (m − 1)/2), where m is the number of bins (767 for normal sample and 772 for tumor sample).
The changepoint analysis results are shown in Fig. 6. For both normal and tumor samples, there is a changepoint around 67 M bps. However, the magnitude of changes is small. This detection is unlikely to be biologically relevant. For the tumor sample, there is an additional changepoint around 1819 M bps, which implies a clear deletion (larger distance related to less DNA copy number). Cytobands have been widely considered in biomedical studies [29]. A search in UCSC data base (http://www.genome.ucsc.edu/cgibin/hgTracks) shows that the cytobands p13.2, p13.3, p21.1, p21.2, p21.3 and p22.1 are related to this detected region. To understand biological significance of this detected deletion, a literature search shows that this region has been investigated by many studies on the relationship between these cytobands and breast cancer [30, 31].
Discussion
Counttype data analysis is a widely used approach for DNAseq data. It requires either a fixed or a sliding window procedure for generating counttype data, and useful information can be reduced after this data preprocessing. In this study, we proposed to analyze DNAseq data based on the related distancetype measure. Our experimental data based simulation study confirmed the advantages of distancetype measure approach in both detection power and detection accuracy. Furthermore, we proposed a mixture of generalized linear models for analyzing the recent DNAseq data. Our model was based on the right censored geometric distribution. According to the mathematical details provided in the Additional file 1, our method is linear in term of the number of sequencing reads. Therefore, our method scales well. The validity and usefulness of this approach were demonstrated by the estimation and testing performances in our simulation study as well as an application to experimental data. The selection of the censoring value was based on that 1–3% of observed distance data were larger than 5000 bps. The optimization of the censoring value is an interesting research topic and we will investigate this in our future study. Our distancebased approach is novel with a clear advantage: it does not require either a fixed or a sliding window procedure for generating counttype data.
Similar as the exponential distribution (for continuous measurements), which is widely used to model waiting time processes, the geometric distribution (for discrete measurements) is also a memoryless probability distribution. The artificial rightcensoring has been proposed to overcome the difficult in modeling extreme distance values. (For example, it is difficult to obtain sequencing reads near centromere regions.) Due to different biological properties in different genomic regions, which results in different distance distributions, different geometric components are necessary to model the data. As one geometric distribution does not well fit the distance data, a mixture of rightcensored geometric distributions has been proposed. To understand the connection between a probability distribution component and its related underling biological properties, we need further statistical and biological investigations, which will be an interesting topic for our future research.
In our application, interesting changepoints were observed from the tumor cell line data. The impact from GCcontent was considered in our analysis. In this study, we chose a set of relatively simple DNAseq data for the purpose of illustration of our approach. Our approach can also be applied to pairedend DNAseq data. In practice, either pairedend reads or singleend reads can be considered in sequencing experiments. If only the starting position of each pairedend read is considered, then the analysis scenario is equivalent to the analysis of singleend reads. However, this strategy may not well utilize the advantage of pairedend reads. (For example, longer regions can be covered by a pairedend read.) To consider the distance based approach and to utilize the advantage of pairedend reads, we will need to investigate which modifications or extensions are necessary to our method proposed in this study. This will be an interesting topic for our future research.
Conclusion
With the next generation sequencing technology, a large amount of short DNA sequence reads can be obtained at a genomewide level. DNAseq data have been increasingly used in different areas of biomedical studies. Although the counttype based approaches have been widely used for DNAseq data analysis, a moving window based data preprocessing (or similar) is required with a prespecified window size. Useful information can be reduced after the data preprocessing. In this study, we have developed a distancetype measure based method. Distances are measured in base pairs (bps) between two adjacent alignments of short reads mapped to a reference genome. This is a novel approach. Its advantages have been demonstrated by our simulation studies and its practical usefulness has been illustrated by an experimental data application.
Abbreviations
 CNA:

copy number alteration
 CNV:

copy number variation
 DNAseq:

short DNA sequence
 EM algorithm:

ExpectationMaximization algorithm
 GLM:

generalized linear model
 INT:

Inverse normal transformation
 LRT:

likelihood ratio test
 QQ plot:

quantilequantile plot
 RMSE:

root mean square error
References
 1.
Chiang DY, et al. Highresolution mapping of copynumber alterations with massively parallel sequencing. Nat Methods. 2009;6:99–103.
 2.
Miller CA, et al. ReadDepth: a parallel R package for detecting copy number alterations from short sequencing reads. PLoS One. 2011;6(1):e16327.
 3.
Xie R, et al. Detecting structural variations in the human genome using nextgeneration sequencing. Brief Funct Genomics. 2011;9:405–15.
 4.
Kim TM, et al. rSWseq: algorithm for detection of copy number alterations in deep sequencing data. BMC Bioinformatics. 2010;11:432.
 5.
Krishnan NM, et al. COPS: a sensitive and accurate tool for detecting somatic copy number alterations using shortread sequence data from paired samples. PLoS One. 2012;7(10):e47812.
 6.
Korbel JO, et al. Pairedend mapping reveals extensive structural variation in the human genome. Science. 2007;318(5849):420–6.
 7.
Sudmant PH, et al. Diversity of human copy number variation and multicopy genes. Science. 2010;330(6004):641–6.
 8.
Wang TL, et al. Digital karyotyping. Proc Natl Acad Sci U S A. 2002;99:16156–61.
 9.
Leary RJ, et al. Digital karyotyping. Nat protocols. 2007;2:1973–86.
 10.
Morozova O, et al. Applications of nextgeneration sequencing technologies in functional genomics. Genomics. 2008;95:255–64.
 11.
Shen JJ, Zhang N. Changepoint model on nonhomogeneous Poisson process with application in copy number profiling by nextgeneration DNA sequencing. Ann Appl Stat. 2012;6:476–96.
 12.
Benjamini Y, Speed TP. Summarizing and correcting the GC content bias in highthroughput sequencing. Nucleic Acids Res. 2012;40(10):e72.
 13.
Aitkin M, Rubin DB. Estimation and hypothesis testing in finite mixture models. Royal Statistical Society. 1985;47(1):67–75.
 14.
McLachlan G, Peel D. Finite Mixture Models. New York: Wiley; 2002.
 15.
Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B. 1977;39(1):1–38.
 16.
Gong G, Samaniego FJ. Pseudo maximum likelihood estimation: theory and applications. Ann Stat. 1981;9(4):861–9.
 17.
Ashton GC, Borecki IB. Further evidence for a gene influencing spatial ability. Behav Genet. 1987;17(3):243–56. https://doi.org/10.1007/BF01065504.
 18.
Silverman EK, et al. Biochemical intermediates in alpha 1antitrypsin deficiency: residual family resemblance for total alpha 1antitrypsin, oxidized alpha 1antitrypsin, and immunoglobulin E after adjustment for the effect of the pi locus. Genet Epidemiol. 1990;7(2):137–49. https://doi.org/10.1002/gepi.1370070204.
 19.
Tzou GG, et al. Classification of beef calves as proteindeficient or thermally stressed by discriminant analysis of blood constituents. J Anim Sci. 1991;69:864–73.
 20.
Martin LJ, Crawford MH. Genetic and environmental components of thyroxine variation in Mennonites from Kansas and Nebraska. Hum Biol. 1998;70(4):745–60.
 21.
Basrak B, et al. Copulas in QTL mapping. Behav Genet. 2004;34(2):161–71. https://doi.org/10.1023/B:BEGE.0000013730.63991.ba.
 22.
PrzybylaZawislak BD, et al. Identification of rat hippocampal mRNAs altered by the mitochondrial toxicant, 3NPA. Ann N Y Acad Sci. 2005;1053:162–73. https://doi.org/10.1196/annals.1344.014.
 23.
Li M, et al. Quantitative trait linkage analysis using Gaussian copulas. Genetics. 2006;173(4):2317–27. https://doi.org/10.1534/genetics.105.054650.
 24.
Hicks BM, et al. Genes mediate the association between P3 amplitude and externalizing disorders. Psychophysiology. 2007;44(1):98–105. https://doi.org/10.1111/j.14698986.2006.00471.x.
 25.
Kraja AT, et al. Rheumatoid arthritis, item response theory, Blom transformation, and mixed models. BMC Proc. 2007;1(Suppl. 1):S116.
 26.
Knoll J, Ejeta G. Markerassisted selection for earlyseason cold tolerance in sorghum: QTL validation across populations and environments. Theor Appl Genet. 2008;116(4):541–53. https://doi.org/10.1007/s0012200706898.
 27.
Lai Y. Changepoint analysis of paired allelespecific copy number variation data. J Comput Biol. 2012;19(6):679–93.
 28.
Lai Y. On the adaptive partition approach to the detection of multiple changepoints. PLoS One. 2011;6(5):e19754.
 29.
Furey TS, Haussler D. Integration of the cytogenetic map with the draft human genome sequence. Hum Mol Genet. 2003;12(9):1037–44.
 30.
Brenner AJ, Aldaz CM. Chromosome 9p allelic loss and p16/CDKN2 in breast cancer and evidence of p16 inactivation in immortal breast epithelial cells. Cancer Res. 1995;55(13):2892–5.
 31.
Nessling M, et al. Candidate genes in breast cancer revealed by microarraybased comparative genomic hybridization of archived tissue. Cancer Res. 2005;65(2):439–47.
Acknowledgements
Not applicable.
Funding
This work was partially supported by the NIH grant GM092963 (Y.Lai). The publication costs were funded by the Department of Statistics at The George Washington University. The U. S. Food and Drug Administration (FDA) and the sponsors (NIH and The George Washington University) had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Availability of data and materials
Not applicable.
About this supplement
This article has been published as part of BMC Genomics Volume 20 Supplement 2, 2019: Selected articles from the 17th Asia Pacific Bioinformatics Conference (APBC 2019): genomics. The full contents of the supplement are available online at https://bmcgenomics.biomedcentral.com/articles/supplements/volume20supplement2.
Author information
Affiliations
Contributions
Bipasa Biswas and Yinglei Lai conceived of the study, developed the methods, performed the statistical analysis, and drafted the manuscript. Both authors have read an approved the final manuscript.
Corresponding author
Correspondence to Yinglei Lai.
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
The authors agree the consent for publication.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional files
Additional file 1:
Mathematical details. (PDF 271 kb)
Additional file 2:
Supplemental figures. (PDF 1697 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Biswas, B., Lai, Y. A distancetype measure approach to the analysis of copy number variation in DNA sequencing data. BMC Genomics 20, 195 (2019). https://doi.org/10.1186/s128640195491x
Published:
Keywords
 Genomewide sequencing
 DNA
 Copy number variation
 Distancetype measure
 Geometric distribution
 Mixture model