Low-cost and clinically applicable copy number profiling using repeat DNA

Background Somatic copy number alterations (SCNAs) are an important class of genomic alteration in cancer. They are frequently observed in cancer samples, with studies showing that, on average, SCNAs affect 34% of a cancer cell’s genome. Furthermore, SCNAs have been shown to be major drivers of tumour development and have been associated with response to therapy and prognosis. Large-scale cancer genome studies suggest that tumours are driven by somatic copy number alterations (SCNAs) or single-nucleotide variants (SNVs). Despite the frequency of SCNAs and their clinical relevance, the use of genomics assays in the clinic is biased towards targeted gene panels, which identify SNVs but provide limited scope to detect SCNAs throughout the genome. There is a need for a comparably low-cost and simple method for high-resolution SCNA profiling. Results We present conliga, a fully probabilistic method that infers SCNA profiles from a low-cost, simple, and clinically-relevant assay (FAST-SeqS). When applied to 11 high-purity oesophageal adenocarcinoma samples, we obtain good agreement (Spearman’s rank correlation coefficient, rs=0.94) between conliga’s inferred SCNA profiles using FAST-SeqS data (approximately £14 per sample) and those inferred by ASCAT using high-coverage WGS (gold-standard). We find that conliga outperforms CNVkit (rs=0.89), also applied to FAST-SeqS data, and is comparable to QDNAseq (rs=0.96) applied to low-coverage WGS, which is approximately four-fold more expensive, more laborious and less clinically-relevant. By performing an in silico dilution series experiment, we find that conliga is particularly suited to detecting SCNAs in low tumour purity samples. At two million reads per sample, conliga is able to detect SCNAs in all nine samples at 3% tumour purity and as low as 0.5% purity in one sample. Crucially, we show that conliga’s hidden state information can be used to decide when a sample is abnormal or normal, whereas CNVkit and QDNAseq cannot provide this critical information. Conclusions We show that conliga provides high-resolution SCNA profiles using a convenient, low-cost assay. We believe conliga makes FAST-SeqS a more clinically valuable assay as well as a useful research tool, enabling inexpensive and fast copy number profiling of pre-malignant and cancer samples. Supplementary Information The online version contains supplementary material available at (10.1186/s12864-022-08681-8).


Next-generation sequencing
We calculated the per sample cost based on using an Illumina HiSeq 4000 sequencer which, for our purposes, produces 350 million single end (SE) 150 base pair (bp) reads from a single lane of sequencing. We factored in that our library would include 20% PhiX to increase diversity for sequencing, and as such, we would expect 280 million reads per lane to originate from our FAST-SeqS amplicons. Aiming for approximately 2 million reads per sample, this would mean multiplexing 140 FAST-SeqS on a single lane.
There is considerable variation in the cost of sequencing services, depending on sector, location, and relationship with the customer. Moreover, of those services that display their costs up front, extremely few provide a direct comparison of prices for Single-End 50bp reads and Single-End 150bp reads on an Illumina HiSeq 4000 machine. Thus, we will make use here of the Stanford Medicine Genome Servicing Sequence Centre prices, roughly converted to Sterling, obtained from http://med.stanford.edu/gssc/rates.html on 17/08/18. These costs do not differ substantially from our own experience. Thus, the cost of a single lane 150 bp SE sequencing on the HiSeq 4000 we take to be approximately £1400 at the time of writing. As such, the sequencing cost equates to £10 per sample.

DNA Shearing
Prior to library preparation, the input DNA needs to be sheared to a desired length distribution. This is often achieved by the use of sonification, for example using Covaris microTUBE strips. At the time of writing, 12 x 8 microTUBE strips (i.e. for 96 samples) can be purchased for £413.10 and is therefore approximately £4.30 per sample.

Library preparation
To process the DNA and prepare the library for sequencing, library preparation is required. This generally consists of end-repair, adapter ligation, and is sometimes followed by PCR amplification to generate sufficient quantities of the library for sequencing. Library preparation kits can be purchased from a variety of manufacturers, with varying costs and time to prepare each sample. Examples of library preparations include:

DNA Quantification and Quality Control
Similarly to FAST-SeqS, quantification of DNA is performed using Bioanalyzer 1000 DNA kit on an Agilent 2100 Bioanalyzer instrument, for example. 300 Bioanalyzer 1000 DNA chips can be purchased at approximately £500, which equates to £1.67 per sample.

Next-generation sequencing
We calculated the per sample cost based on using an Illumina HiSeq 4000 sequencer, which can be used for lowcoverage WGS to produce 350 million single end (SE) 50 base pair (bp) reads from a single lane of sequencing.
PhiX should not be required as the library should not be low complexity. To achieve approximately 9 million reads per sample in order to obtain approximately 0.1X coverage as per Scheinin et al. [1], would mean multiplexing 38 samples on a single lane. The cost of a single lane 50 bp SE sequencing on the HiSeq 4000 we take to be £1000 (justification as in the previous section). As such, the sequencing cost equates to ∼£26 per sample.

Supplementary Note 2: Aspects of FAST-SeqS data
We explored the loci counts of normal samples (which were assumed to be predominantly diploid) and observed various aspects of the data which led to the model. We observed (1) a technical bias in the number of reads This could be extended to include all J non-control samples, so that β, π 0 , π u ,ĉ u , z r,l ,s,θ r,l , y r,l andñ are sample specific and include an index j. However, here we drop the j index and consider only one non-control sample for simplicity.
m r,l | ϕ c,r,l , ϕ d,r,l ∼ Beta(ϕ c,r,l , ϕ d,r,l ) θ r,l,k | s k , m r,l ∼ Beta(s k m r,l , s k (1 − m r,l )) x r,l,k | θ r,l,k , n k ∼ Binomial(n k , θ r,l,k ) θ r,l | {ĉ u } ∞ u=1 , z r,l ,m r,l ,s ∼ Beta(sĉ z r,lm r,l ,s(1 −ĉ z r,lm r,l )) y r,l |θ r,l ,ñ, ∼ Binomial(ñ,θ r,l ) See Supplementary Table 1 and Methods for an explanation of the variables. Note that this model is not implemented but is shown for completeness (see below for how we split this generative model into two separate generative models in practice). Notice that we assume that the sample inverse dispersion parameters are drawn from the same distribution, reflecting our belief that normal and tumor count observations should have the same level of noise. Here we assume fixed values for ω. If we were to use this model as a basis for inference of the latent variables, we may wish to place priors over the values of ω and infer them from the data.

Split generative model (for more efficient inference)
We decided to split the full model in two parts. The first part describes the generation of the data for the control samples. The second part describes the generation of the copy number profile and loci count observations for a non-control sample, using the maximum a prior (MAP) estimate of m. We did this to allow for simpler implementation of the inference algorithms. In this way, we could infer the MAP estimates of m and the nuisance parameters s using the counts of the control samples. Once MAP estimates were obtained, we could infer the copy number profile for each non-control sample in independent MCMC chains, allowing the inference of each sample to be run in parallel.

Part 1: Generative model for control counts
x r,l,k | θ r,l,k , n k ∼ Binomial(n k , θ r,l,k ) (2) Part 2: Generative model for the counts of a sample with a relative copy number profile , z r,l ,m r,l ,s ∼ Beta(sĉ z r,lm r,l ,s(1 −ĉ z r,lm r,l )) y r,l |θ r,l ,ñ, ∼ Binomial(ñ,θ r,l ) Note that here we allow the distribution of sample specific parameters to vary between the normals (in part 1) and the sample with relative copy number profile z (in part 2). This distinction is made so that we can use the posterior distribution of s k to update our prior distribution overs.

Supplementary Note 4: Notation used in MCMC algorithms
We denote the probability mass function (pmf) of the compound Beta-Binomial distribution as follows: where B represents the Beta function, n represents the total number of Bernoulli trials (counts), k represents a count observation (k ∈ { 0, . . . , n }), a and b are parameters (a > 0, b > 0). This is used as our likelihood function in algorithm 1 where k = x r,l,k , n = n k , a = s k m r,l , and b = s k (1 − m r,l ). It is also used as a likelihood function in algorithm 2 where k = y r,l , n =ñ, a =sm r,l , and b =s(1 −m r,l ).
We denote the probability density function (pdf) of the Beta distribution as follows: where x represents an observation (x ∈ [0, 1]), and c and d are the shape parameters of the Beta distribution We denote the pdf of the Gamma distribution as: where Γ represents the Gamma function, k is the shape parameter (k > 0), θ is the scale parameter (θ > 0) and observation x (where x ≥ 0).
The normal distribution is denoted N (µ, σ) and the uniform distribution is denoted U (a, b). We use the follow notation: M·,·+γ η ∼ Beta γ + 1,M ·,· Then sample γ as follows: If not provided as a fixed value, sample ρ: