Conditional probability models for the central base
The base at position i (chromosome, coordinate) in the reference genome is called xi and the symmetric sequence context around it is called
$$ \begin{aligned} & s_{i}(k) = x_{i-k},x_{i-k+1},\ldots,x_{i-1},x_{i+1},x_{i+2},\ldots,x_{i+k}. \end{aligned} $$
(1)
If it is clear from the context which k, we call it si to ease notation. To estimate the conditional probability of base b at position i, we use the counts n(b|si) of the occurrences in the same context throughout the reference genome (on both strands):
$$ \begin{aligned} & P(b|s_{i}) = \frac{n(b|s_{i})-\delta_{b,x_{i}}}{N(s_{i})-1}, \end{aligned} $$
(2)
where
$$N(s_{i}) = \sum_{b} n(b|s_{i}). $$
We use the Kronecker \(\phantom {\dot {i}\!}\delta _{b,x_{i}}\), which is 1 if xi=b and otherwise 0, to ensure that we only count other contexts, when estimating probabilities at position i. This is leave-one-out cross-validation and is discussed further below.
For large contexts, the counts become small and thus the probabilities cannot be reliably estimated. To interpolate between different orders of the model, we use regularization by pseudo-counts obtained from the k−1 model. Specifically, for order k, we define pseudo-counts
$$r(b|s_{i}(k)) = \gamma P(b|s_{i}(k-1)), $$
where γ is the strength of pseudo-counts. Now the model of order k is estimated as before, but using the actual counts plus pseudo-counts,
$$P(b|s_{i}(k)) = \frac{n(b|s_{i}(k))-\delta_{b,x_{i}}+r(b|s_{i}(k))}{N(s_{i}(k))-1+\gamma}. $$
The advantage of pseudo-counts is that they have minor influence, when there is plenty of data (actual counts are high), but have strong effect at low counts. With k=4 counts are on average 6∗109/49≃23000, so we assume that psudo-counts are not needed. Therefore, our interpolated model starts with unregularized estimates for k=4, and then use the pseudo-counts iteratively for k=5 to k=7 for the interpolated model. We used a strength of γ=100 for the pseudo-counts (a few experiments showed that the model is relatively robust to changes in γ, see below).
Markov models
In a Markov model of order k, the probability of a base is conditioned on the k previous bases. If we redefine the k-context in (1) to be the k previous bases,
$$s_{i}(k) = x_{i-k},x_{i-k+1},\ldots,x_{i-1}, $$
we can use exactly the same formulation as above. In this case however, the context size is not 2k letters as above, but only k letters. Therefore, one can estimate Markov models up to sizes around k=14 for the human genome, and we used a model interpolated from k=8 to k=14 analogously to the central interpolated model described above.
Due to the interpolation, larger k are possible, and we performed a small experiment with k ranging from 10 to 20 and with four different values of the interpolation constant γ resulting in Supplementary Fig. S2. These tests were done only on chromosome 20 with a model estimated from all chromosomes except 20. Although small gains can be obtained with larger k values and different γ, we decided to stick to our initial choice of k=14 and γ=100.
Estimating a “forward” Markov model from both strands of the human genome will automatically make it strand-symmetric. For a given position in the genome, the model can therefore give two sets of base probabilities: one for the forward strand and one for the reverse strand. Our final Markov probabilities are the average between the two as described in the main text and referred to as bidirectional.
Cross-validation
Our way of estimating the conditional probability of seeing one of the four bases given the surrounding context can be seen as a leave-one-out procedure. In particular, the estimate depends on the reference base at the considered position as well as the context. To obtain an estimate that is independent of the reference base at the position, a natural way to proceed is to consider the average of the four base-dependent estimates over all occurrences of the given context. This average turns out to be equal to the estimate that includes all positions. To see this, average (2) over all sites (skipping the k dependence for clarity) gives the probability of a base b:
$$\bar P(b|s) = \frac{1}{N(s)} \sum_{b'} n(b'|s)\frac{n(b|s)-\delta_{b,b'}}{N(s)-1}. $$
Here the base we are summing over is called b′ to distinguish it from the base b in question. Since \(\sum _{s} n(b'|s) \delta _{b,b'} = n(b|s)\), we get
$${}\bar P(b|s) = \frac{1}{N(s)(N(s)-1)}(N(s)n(b|s)-n(b|s))= \frac{n(b|s)}{N(s)}. $$
We also assessed our models by cross-validation by chromosomes. One chromosome was used as test data, and the remaining chromosomes as training data. We repeated this step 24 times to calculate the fraction correct predictions for each chromosome.
Substitution models
A simple model estimates mutability as the fraction of all sites with context \(\hat s\) having a specific mutation. More specifically,
$$ \begin{aligned} & P_{\text{Simple}} (a \rightarrow b| \hat s) = \frac{n(a\rightarrow b|\hat s)}{n(a|\hat s)}. \end{aligned} $$
(3)
Here \(n(a\rightarrow b|\hat s)\) is the number of observed mutations a→b in context \(\hat s\) and \(n(a|\hat s)\) is the number of times we see reference base a in context \(\hat s\) (as above). We use \(\hat s\) to indicate that the context may be different from the context s for the genome model above. We have used this model with a symmetric context of three bases to each side, which we call the simple model.
We will now derive a continuous time Markov model with context dependent substitution rates μab|s that takes the nucleotide distribution into account. We also assume a constant evolutionary time, which is infinitesimally small compared to the rates, so we can approximate the substitution probability by the first-order term in the Taylor expansion of an exponential
$$P(a\rightarrow b|s) \simeq \delta_{a,b} + \mu_{ab|s}, $$
where time is set to 1. The diagonal rates are \(- \sum _{b \neq a} \mu _{ab|s}\), so in the following we will not write the diagonal terms. For a stationary, reversible Markov model with P(a|s) as equilibrium probabilities the rates can be written as
$$P(a\rightarrow b|s) \simeq \mu_{ab|s} = \alpha_{ab|s} P(b|s) \text{~~~~~(\(a\neq b\))}. $$
with a symmetric matrix αab. This is the general time-reversible six-parameter model (see e.g. [19]). Inspired by this model, we assume that mutability is given by the same equation, but without requiring that the nucleotide distribution is the equilibrium distribution and without requiring that α is symmetric.
The above expression factorizes the rates into the nucleotide distribution and the α-term that encapsulates the mutations. Now we assume the αs depend on a smaller context \(\hat s\) than the context s for the genome model P(a|s), so the above can be written as
$$ \begin{aligned} & P(a\rightarrow b|s) \simeq \mu_{ab|s} = \alpha_{ab|\hat s} P(b|s) \text{~~~~~(\(a\neq b\))} \end{aligned} $$
(4)
In analogy with (3), P(a→b|s)=n(a→b|s)/n(a|s) with s instead of \(\hat s\), so combining with the above
$$n(a\rightarrow b|s) \simeq n(a|s) \alpha_{ab|\hat s} P(b|s) \text{~~~~~(\(a\neq b\))} $$
To estimate the αs we sum over all contexts that contains \(\hat s\), which we write as \(s|\hat s \subseteq s\), so
$$n(a\rightarrow b|\hat s) = \sum_{s|\hat s \subseteq s} n(a\rightarrow b|s) \simeq \alpha_{ab|\hat s} \sum_{s|\hat s \subseteq s} n(a|s) P(b|s) $$
The last sum depends only on the nucleotide distribution. It can be rewritten as a sum over all positions in the genome, where the reference base, ri, equals a and where the context is \(\hat s\). We call this term \(Z_{ab|\hat s}\),
$${}Z_{ab|\hat s} \,=\, \frac{1}{n(a|\hat s)} \sum_{s|\hat s \subseteq s} n(a|s) P(b|s)= \frac{1}{n(a|\hat s)} \sum_{i|r_{i} = a \land \hat s \subseteq s_{i}} P(b|s_{i}), $$
For convenience, it is normalized by \(n(a|\hat s)\), so it is the average probability of base b over all positions with reference base a and context \(\hat s\). As an estimate of α we then have
$$\alpha_{ab|\hat s} = \frac{1}{Z_{ab|\hat s}} \frac{n(a\rightarrow b|\hat s)}{n(a|\hat s)} = \frac{P_{\text{Simple}} (a\rightarrow b|\hat s)}{Z_{ab|\hat s}} $$
Note that we can rewrite the original probability (4) in terms of the simple model as
$$P(a\rightarrow b|s) \simeq \frac{P(b|s)}{Z_{ab|\hat s}} P_{\text{Simple}} (a \rightarrow b| \hat s) $$
for \(\hat s \subseteq s\). The factor is 1 when \(\hat s=s\), so the models are identical as they should be when they use the same context. The equation directly shows how the wider context from the genome model can modulate the simpler estimate. If the probability of base b in context s is larger than the mean \(Z_{ab|\hat s}\), the mutability becomes larger than in the simple model, and if it is smaller, the mutability becomes smaller.
The first order approximation assumes the rates are small. When calculating the total mutability of a site, we therefore use the approximation \(\phantom {\dot {i}\!}1-P(a\rightarrow a|s) \simeq 1-e^{\mu _{aa|s}}\). For small α’s it makes little difference whether it is the exponentiated form or not.
Data
The human reference genome, GRCh38.p13, was downloaded from NCBI (released March 2019 by Genome Reference Consortium). We considered only primary assemblies of chromosomes 1 to 22 and X, Y. Genomic annotation bed files were downloaded from UCSC Table Browser. These are 3’-UTR, 5’-UTR, CDS, Introns, Genes, and Repeats. Conservation scores file (PhastCons100way) was downloaded from the UCSC as well.
Variants were downloaded from the 1000 Genomes project (released March 2019, phased 20190312_biallelic_SNV_and_INDEL) in VCF format. The INDELs were filtered from 1KGP dataset.
ClinVar (clinvar_20200310.vcf) [27, 28] and somatic mutations (CosmicCodingMuts.vcf and CosmicNonCodingVariants.vcf) [29] data were obtained from NCBI and COSMIC, respectively.
The genomes and GFF files of Arabidopsis thaliana (TAIR10.1), Caenorhabditis elegans (WBcel235), Escherichia coli (str. K-12 substr. MG1655), Saccharomyces cerevisiae (R64) were downloaded from NCBI.
Data analysis
Model implementation Counting of k-mers and estimation of probabilities is implemented in the C programming language. The program counts the contexts for each site using a Burrows-Wheeler transform (BWT) [30] rather than storing the k-mers, because it is much more efficient for the interpolated models. The program is called predictDNA and relies on an index built with the program makeabwt.
One program, called makeabwt, is used for construction of an index from a fasta file containing the genome sequences. If there are multiple sequences, they are concatenated with termination symbols in between and the suffixes are sorted. The BWT is constructed from the sorted suffixes and saved. An FM index [31] is constructed to ease the search of the BWT. To limit memory usage, the values are stored in first-level checkpoints for every 216 positions as long integers (8 byte) and for every 256 positions the difference from the nearest first-level checkpoint is stored as a short integer (two bytes). We used an index containing both the forward and reverse complements strands of the genome.
Another program, called predictDNA, use the index to look up k-mers. This is done using the standard backward search of the BWT/FM-index [31]. The size of the resulting suffix interval equals the number of the k-mers in the genome and these are used for calculating the conditional probabilities.
The advantage of using a BWT is that the index can be used with any k and thus facilitates the interpolated models. An naive approach using table-lookup would require a new table for each value of k and a table of 415≃109 integers for k=14, which corresponds to 4GB of memory and this would become 16GB for k=15, etc. The index used for this work use around 8GB of memory.
Model Performance We calculated the probabilities of the four bases for every position in the human genome using the software predictDNA we developed. We tested different k’s, but used the same interpolation constant, γ=100, for all models. We counted the correct sites for which the reference alleles gave the highest probabilities of the four bases, to calculate the fraction correct for each chromosome.
Furthermore, we overlapped the bed files with models’ outputs via bedtools [32, 33] to get the feature-specific fraction correct and predicted probabilities. These were used to obtain the performance of our models for different regions of human genome.
Based on CDS bed file and human genome fasta file, we calculated average probabilities for the positions around the human 3’ and 5’ splice sites. We included 500 nucleotides beforer and 100 after the 3’ splice site and, similarly, 500 before and 100 after the 5’ splice. Besides, we extracted the conservation scores of PhastCons100Way for the same regions [34]. Those results were shown in Fig. 5.
SNP Variants Analysis We kept only single nucleotide bi-allelic variants in 1KGP, ClinVar and COSMIC databases for the following analysis, and we filtered INDELs. Based on central model and BM14 results, reference and alternative allele probabilities for each SNP sties in these three databases were extracted. The triangle plots (Fig. 6) were made by using reference probabilities against alternative probabilities of all SNPs in 1KGP database.
In order to understand the possible asymmetry shown by the cluster of many sites in the corners of the triangle plot, we separated SNPs with allele frequency greater than 0, 0.01, 0.1 and 0.2. To present the different types of SNPs in coding and non-coding parts, we did the density plots also by using Pref minus Palt for SNPs in 1KGP, ClinVar and COSMIC databases. Additionally, we used ANNOVAR software [35] to annotate benign and damaging SNPs on 1KGP, which were predicted by PolyPhen2 [17]. These are sites associated with single genetic disease.
We developed the subsitution model to estimate the mutability of SNVs as described above. We estimated the α matrix for k= 0, 1, 2, 3 for all SNPs 1KGP outside of Chr1. The model was applied to chromosome 1, where we calculated the probability of a mutation from the BM14 and the alpha matrices. These were compared to observed SNVs in 1KP, ClinVar, and COSMIC on Chr1.
Test Bi-directional Markov Model on Other Species The bi-directional Markov model with was tested on the chosen species and also human genome. We used k=10,γ=100, and interpolated from k=6, instead of using the same parameters as BM14, that is because of the smaller genome size of these species. The densities of the reference base probabilities were plotted (Supplementary Fig. S4A). We separated the CDS and non-coding regions of A.thaliana, C. elegans and S. cerevisiae according to the GFF files and made a density plot to show the distributions of CDS and non-coding of these three species.
Software
Our software is open source and available at GiHub: https://github.com/AndersKrogh/abwt/releases/tag/v1.2.1a. We wrote several scripts in Perl and Python for data analysis and these are all available in the GitHub release. The usage of these scripts is described in README files. All the figures made in R and this code is also available.