Revealing the core single-nucleotide substitution rates
The repetitive occurrence of mobile DNA elements in different regions within the same genome [34] provides the opportunity to obtain the \(r_{i,j}^{core}\) (2) rate constants that account for the \(\delta r_{i,j}^{sr}\) immediate effects of neighbouring nucleotides. After the initial inactivation at different time epochs [36–39], individual remnants of many transposon subfamilies within a genome have been subjected to largely the same overall mutagenic and repair conditions as the rest of the genome [40], hence can also serve as markers of \(r_{i,j}^{core}\) neutral substitution rates applicable to genomic sites that share the immediate sequence-context. For the purpose of this study, we have used the hominoid lineage of the L1 (long interspersed nuclear element 1, LINE-1) retrotransposons, spanning 3.1 to 20.4 myr (million years) of age [37]. The constituent subfamilies of the lineage are L1PA5, L1PA4, L1PA3, L1PA2 and the most recent L1Hs. Their respective age and the number of insertions in the human genome are presented in Table S1 in Additional file 1. The choice was made through the following reasoning. The L1 elements have a long (∼6 k nt) sequence without extended repeats like in the LTR (long terminal repeat) elements [34]. This enables their robust mapping on a chosen template and provides essential local sequence variability around different nucleotide positions within L1 elements. There are distinct L1 subfamilies that were active at different time epochs, with detailed molecular clock analyses available [36–39] to reveal and, importantly, validate the age of each subfamily. They are well-represented and, unlike other classes of transposable elements, are uniformly scattered across mostly the intergenic regions of the human genome [34, 41, 42], and are less prone to recombination and context bias [43, 44]. The young L1 subfamilies have most of their remnants coming from the genomic regions with G+C content close to the genomic average value (Fig. 1, see also [41]). Unlike SINEs (small interspersed nuclear elements) and LTRs, LINE sites show a very low level of RNA polymerase enrichment, as a marker of transcriptional association, in normal tissues [45]. The selected most-recent subfamilies are sufficiently young [37] a) to enable an unambiguous identification of the genomic coordinates of the borders for the remnants; b) to assume that each position in those elements would be unlikely to undergo repeated substitutions over the studied period of their existence as remnants in the human genome (see Methods); c) to attribute a time-invariance to the rates during the analysed period of mutation accumulation [20, 46, 47]. Finally, many matching positions in our studied five L1 representatives share the same consensus bases, hence, such positions are not polymorphic due to adaptive pressure and can serve as internal references for inferring the \(r_{i,j}^{core}\) rates.
The Trek methodology of obtaining \(r_{i,j}^{core}\) rates, along with the considerations for filtering out the possible selection and non-neutral substitution sites, is presented in Fig. 2 with further details in Methods, Figures S1 and S2 in Additional file 1. The acquired data on the full set of position-specific substitution rates are presented in Fig. 3, to highlight the revealed variation per substitution type, along with the fully averaged values for the rate constants.
A total of 661 positions, at the 3’ side of the L1 elements, passed our robustness checks (see Methods) and were thus employed to infer the corresponding \(r_{i,j}^{core}\) values from the analysis of all the young L1 remnants in the human genome. We recorded the data in the Trek database that contains a set of well-defined \(r_{i,j}^{core}\) constants (see below for the extent of sequence context coverage in the Trek database) capturing the influence of the unique arrangement of neighbouring nucleotides at those positions. Owing to the nature of the selected L1 elements, as discussed above, and the Trek procedure design (Fig. 2), we expect the absence of the \(\delta r_{i,j}^{gene}\) contribution, the elimination of \(\delta r_{i,j}^{lr}\) at the averaging stage (Fig. 2
b) and the removal of the \(\delta r_{i,j}^{CpG}\) and \(\delta r_{i,j}^{spec}\) effects through our robustness checks embedded within the Trek procedure (see Fig. 2
c, d and Methods). Therefore, our method provides the \(r_{i,j}^{core}\)=\(r_{i,j}^{sb}+\delta r_{i,j}^{sr}\) core variation (Fig. 3) of the substitution rates at around the \(r_{i,j}^{sb}\) genomic average values for each i→j base substitution. If the above is correct and Trek indeed results in \(r_{i,j}^{core}\) values, further averaging of the core \(r_{i,j}^{sb} + \delta r_{i,j}^{sr}\) rates (median values shown in Fig. 3) should give us the single-base \(r_{i,j}^{sb}\) genomic average substitution rates, cancelling out the remaining \(\delta r_{i,j}^{sr}\) contribution. In fact, the comparisons of our Trek-derived \(r_{i,j}^{sb}\) with two published datasets that reflect on the genomic average \(r_{i,j}^{sb}\) rates [4, 20] show an excellent correlation (Figure S3 in Additional file 1, Pearson’s R > 0.99) confirming the absence of any biased averaging and unusual substitution rates in the time-accumulated substitutions at the L1 sites that pass the Trek procedure. The genome simulation, described later in this work, provides an additional validation for our rate constants. The \(r_{i,j}^{core}\) values (Eqs. 1 and 2) for all possible i→j neutral substitutions inferred for each of the eligible individual L1 positions are thus assumed to be common for any other sites in the human genome that share the short-range sequence context.
The influence range of neighbour nucleotides
To apply the \(r_{i,j}^{core}\) constants to the human genome, we first established the optimal length of a DNA sequence (k-mer, where k is the length of the sequence) capturing most of the influences that modulate substitution rates of the base at the centre. For this, we evaluated the power of the knowledge of the neighbouring arrangement of nucleotides in predicting the \(r_{i,j}^{core}\) constants for each of the twelve i→j substitution types, where i and j are the four DNA bases. We built test predictors for individual substitution types via a tree-based gradient boosting machine (GBM, machine learning technique) [48, 49], while using varying lengths of sequences centred at the positions where the rate constants were to be predicted (see Methods). The aim of the machine learning procedure was to establish the optimal sequence length to minimise the error in the predicted rate constants (Figures S4 and S5 in Additional file 1). In agreement with prior evidence [9, 10, 44, 50], but now obtained for each individual i→j substitution type from Trek data, the optimal window was found to be 5-7-nt (both 5- and 7-nt resulting in comparable results for many substitution types) and was subsequently used as guidance for the direct mapping of the Trek rate constants from the L1 sequence onto any given human nuclear DNA sequence for the \(r_{i,j}^{core}\) assignment.
Mapping the Trek \(r_{i,j}^{core}\) data on any DNA sequence
The upper 7-nt size window for determining the single-nucleotide substitution rate constants at the central base accounts for three upstream and three downstream bases relative to each nucleotide position. Our neutral substitution positions that pass the Trek criteria capture 636 unique 7-mers out of the possible 16,384 (47). Therefore, for many loci in the human genome we need to use a smaller window (< 7-mer) as a match criterion to assign to one of the Trek rate constant sets. By trimming the size of the k-mer to five, hence accounting for two upstream and two downstream bases, we cover 404 unique sequences out of possible 1024 (45). Further reduction of the size to three, allows having data for 56 unique triads out of 64 (leaving out only the CpG containing triads, see below). For the single-base case (1-mers), where we average out all short-range neighbour effects and longer-range sequence variability, we obtain data for all the four bases and 4×3 possible substitutions as shown in Fig. 3 (the median values on the top of the figure). The coverage of the longer k-mers is, however, increased nearly twice when we account for the strand-symmetry, as described in Methods. Please note, that for each unique k-mer we obtained three \(r_{i,j}^{core}\) constants via the described analysis of a large pool of L1 remnants from different genomic loci.
With the above considerations, we created a program (Trek mapper, Methods, Note S1 in Additional file 1) to produce \(r_{i,j}^{core}\) core substitution rate constants for any sequence, accounting for the context information within up to the 7-mer window and pulling the matching core data from the Trek database. Should a representative match be absent with the full 7-nt long sequence, the window around the given position in a query sequence is shortened into the longest variant possible (out of the 5-nt, 3-nt or 1-nt lengths) with a full match in the Trek database (Methods, Figure S6 in Additional file 1). In this way, for all the possible 16,384 7-mers, our Trek database reports 49,152 rate constants (3 ×16,384), of which 3168 (6.4%) account for the 7-mer context, 23,232 (47.3%) account for the nested 5-mer context, 17,120 (34.8%) for 3-mer and only 5632 (11.5%, CpG containing sequences) constants do not account for any context effect on the central base (since we eliminate those by design, due to the \(\delta r_{i,j}^{CpG}\) contribution). Our full dataset reports and makes publicly available (Additional file 2), the time-dependent \(r_{i,j}^{core}\) rates for all individual i→j substitutions accounting for the context effects beyond the 64 triads [35]. If we consider only the unique values in the Trek database, we report 2078 unique rate constants (taking into account different extent of averaging, where multiple entries are present for the different context ranges), of which 1208 (58.1%), 782 (37.6%), 85 (4.1%) and 3 (0.1%) entries account for 7-, 5-, 3- and 1-mer contexts respectively. The 1-mer averaged data were used for only the k-mers that contain either C or G bases of a CpG dyad at the centre, to assign the overall substitution rate constants by the Trek mapper. This was done since none of the CpG sites in the L1 remnants passed our robustness checks (Methods), due to the targeted, epigenetic- \((\delta r_{i,j}^{CpG})\) or APOBEC-affected single-nucleotide substitutions there [12–14]. The latter effects were likely to be non-uniform with time (active targeting, \(\delta r_{i,j}^{spec}\), while in the viable epoch for each L1 subfamily, at both DNA and RNA levels) and may had been present in order to silence the active retrotransposons.
Our current data are for the human nuclear genome. However, the general approach for obtaining \(r_{i,j}^{core}\) constants is applicable to any organism where the genome contains a set of well-characterised and related young mobile elements silenced at different time epochs and without notable genomic context bias.
\(r_{i,j}^{core}\) rates and the oligomeric composition of the human genome
The full set of sequence-dependent human \(r_{i,j}^{core}\) substitution rates (all three constants per position) enabled us to perform a sophisticated in silico evolution of a random DNA sequence, guided solely by our \(r_{i,j}^{core}\) values. We started from a random sequence of 5 million (mln) nt with a G+C content of 60% (substantially greater than the 40.45% G+C content for the human genome). We performed random nucleotide substitutions weighted by Trek-inferred probabilities (Methods, Figure S7 in Additional file 1), where, after each cycle, the substitution rate constants were updated for the sequence positions that were either mutated or fell within the influence zone of the performed substitutions. The simulation was continued until the overall G+C content of the simulated sequence became constant (see Fig. 4
a–c).
The simulation converged to generate a sequence with the A, T, G and C compositions of 30.91, 30.90, 19.06 and 19.13% respectively. Note, that these values are close to the A, T, G and C compositions of the repeat-masked human genome of 29.75, 29.79, 20.24 and 20.22% respectively (Methods), being slightly AT rich. Furthermore, the simulated sequence captures the contents of different individual oligomers (k-mers) in the human genome. The data for all the possible 16 dyads, 64 triads, 1,024 pentads and 16,384 heptads are presented in Fig. 4
d–g and show a significant (see the correlation coefficients on the plots) correlation between the compositional landscapes of the Trek-simulated sequence and the actual human genome. Regardless of the starting composition of the initial DNA sequences, our simulations always equilibrated to a state with similar oligomer (up to 7-mer) content. The k-mer contents shown in Fig. 4
d–g for the actual human genome were calculated from the repeat-masked version of the RefSeq human genome, where all the identified repeat elements, including the L1, were disregarded. This assured the removal of a potential bias due to the presence of L1 elements (Methods), used to infer the rate constants, in the human genome. As \(r_{i,j}^{core}\) constants are free of the \(\delta r_{i,j}^{CpG}\) contribution (see above), the simulated genome produced higher alterations in representing the k-mer contents that have CpGs (red points in Fig. 4
d–g). These alterations visually demonstrate the role of \(\delta r_{i,j}^{CpG}\) in the background compositional landscape of the human genome. The correlations in Fig. 4 are from simulations where the rate constants were symmetrised according to the inherent strand-symmetry in double-helical DNA (see Methods). The results without such equalisation are still significant, though producing slightly worse correlation coefficients (Figure S8 in Additional file 1).
To confirm that the observed correlations for different k-mer contents (Fig. 4
d–g) present an improvement due to our sequence-context-dependent rates, rather than being a side effect, by a pure chance, in a sequence where the simulation makes only the single-base composition converge to that of the real human genome (such as in a sequence generated using an ideal 4×4 single-nucleotide substitution rate matrix), we calculated the expected distribution of different k-mers in a genome with fully random base arrangement but with the exact human A, T, G and C overall base composition. In the complete absence of any sequence-context effects, the probability of the occurrence (fraction) of any k-mer in a sufficiently long sequence is equal to the product of the occurrence probabilities of their constituent bases. For instance, the probability of observing the AGT triad is the p
AGT
=p
A
p
G
p
T
product, where the individual p
i
probabilities are the base contents expressed in fractions. The comparison of the k-mer fractions obtained in this way with the human genome data (Fig. 5) shows a substantially reduced correlation (for the genomic 7-mer content, Pearson’s R = 0.59 compared to 0.74 using Trek rates). The discrepancies are minimal while accounting for only the dyad and therefore singleton contents, however, while using context-invariant singleton substitution rates in the simplistic in silico simulations descried above, we observe slight but systematic underestimation of the overall G+C content in the sequence at compositional equilibrium (Fig. 6). Although excluded via the specificities of the Trek methodology (Fig. 2), to completely rule out the presence of any circularity in the reproduction of the human genome higher k-mer (up to 7 nt) content through the Trek rate constants, we also demonstrated the overall poor agreement between the k-mer contents of L1 elements used to infer the rate constants and the human nuclear genome (Figure S9 in Additional file 1).
The described simulations therefore support the attribution of a contributory role that the \(r_{i,j}^{core}\) variability plays in shaping the compositional landscape of the human nuclear genome [51]. Importantly, our study demonstrates that the non-specific core substitution rates are capable of producing apparent selection or depletion patterns in higher k-mers, beyond dyads, in the human genome. To this end, the 7-mer content from our in silico equilibrated sequences, obtained solely based on the set of \(r_{i,j}^{core}\) constants, can serve as a background standard to reveal specific selection [52] for or against different sequence motifs in the human genome.
Similar link between the context-dependent mutation pressure and the nucleotide composition has recently been shown for the triad counts in the bacterial genome of Mesoplasma forum [53] and in the human mitochondrial genome [54]. Interestingly, the latter study employed the somatic mutation propensities found by analysing cancer genomes to reproduce the triad count of the human mitochondrial genome, as an evidence of the link between the somatic and germline mutation rates in human mitochondria. That link is further visible in the cancer analysis presented here (vide infra) for the human nuclear genome.
Basal substitution propensity profile of the human genome, substitution prone and resistant motifs
Trek mapper provides the full set of \(r_{i,j}^{core}\) constants for each position in the whole human genome. Such data enables us to calculate the germline context-dependent basal substitution propensity (BSP) by taking the sum of the individual rate constants for the three possible substitutions at each base position, thus producing the core \(r_{i,N}^{core}\) constant for the substitution of a given base i by any other base N. Figure S10 in Additional file 1 shows the BSP profiles calculated for the individual chromosomes (red) as compared with the whole genome profile (green), where most of the chromosomes exhibit the same overall distribution as the whole genome. Further grouping and analysis [55] of the unique sequences found in regions of different BSP for the whole human genome reveals motifs with varying substitution propensities of the bases at the centre of 7-mers (Fig. 7). In particular, adhering to the standard nucleobase notation in small letters (a=A, c=C, g=G, k={G, T}, m={A, C}, n={A, C, G, T}, r={A, G}, t=T, w={A, T}, y={C, T}), the comparative examinations of the sequences reveal the overall stability of C in the wntCnwn context, and analogously, G in the nwnGanw context. Those bases become prone to substitutions, independently from the well-studied CpG context, in the nmrCarn and nytGykn motifs for the C and G bases respectively. Furthermore, A and T bases become prone to more frequent substitutions in the ncwAtnn and analogous ngwTann motifs (Fig. 7).
Basal substitution propensity profile and cancer-linked somatic mutations
A recent study, which correlated the cancer and cell division frequencies, suggested that cancers, with their multi-etiologic nature, are linked to random mutation events upon cell division/DNA replication [56], in addition to expressing unique type-dependent mutational signatures [57]. To this end, genomic sites with higher intrinsic BSP (lower stability) may potentially exhibit a higher prevalence of cancer-related genome alterations, as compared to sites of lower intrinsic BSP (higher stability), should the germline and cancer-linked somatic mutations share common mechanisms [54]. Although the sequence context signatures of cancer mutations and their variation across different cancer types is out of the scope of the present work and is covered in detail elsewhere [57–62], here we examined the simple relationship between our calculated germline BSP values and the observed cancer-associated somatic mutations accessed via the annotated COSMIC database of somatic mutations in cancer [63] (Methods). Since the Trek data are for the core neutral substitutions, we restricted the analysis to the non-coding and non-polymorphic (not identified as SNP) point mutations (6 mln) in cancer. By mapping these sites to the human genome and retrieving the sequence-context information (7-nt long sequences centred at the mutation points), we processed the data with Trek mapper and obtained the BSP (\(r_{i,N}^{core}\)) profile for the non-coding sites detected in human cancer. The outcome in Figure S11 in Additional file 1, overlapped with the whole-genome BSP profile, shows that stable sites in the human genome, assigned by the Trek mapper to have \(r_{i,N}^{core}\) below 1.13 byr −1, are significantly less likely to undergo somatic mutations in cancer. Like many other disease-causing mutation sites [64], most of the sites that are highly enriched in cancer (Figure S12a in Additional file 1) are CpGs [65], which, even without accounting for the methylation driven increase [12–14] of the mutation rates, show high basal mutability [66]. However, Figure S12b, c in Additional file 1 demonstrates the discussed trend in the 7-mer cancer enrichment ratio (Methods) vs. BSP dependence even when all the CpG sites are removed from the analysis. Furthermore, while investigating the same relationship in different varieties of cancer (as classified based on the primary tissue and primary cancer types, see Methods) we can see that the trend is mostly in place for the 11 cancer types where we have enough data on non-coding somatic mutations (Fig. 8), with the only exception being the oesophageal carcinoma (Fig. 8
e, l). The latter deviation might stem from the greater role of carcinogen driven mechanisms of somatic mutations in a tissue (oesophageal) more exposed to external carcinogens.
Overall, the results show that the intrinsic BSP of different sites in DNA may contribute to their absence/presence in pathological genotypes. In particular, we observe that 7-mers with low germline BSPs of the central base are relatively depleted in cancer-linked somatic mutation data (Figure S12 in Additional file 1 and Fig. 8). They present a cancer enrichment ratio that is smaller than 1, whereas for the unstable 7-mers, the enrichment ratio, on average, tends to 1, meaning that their presence in cancer is overall comparable to the one in the whole genome. Focusing on the average trend (red lines in Fig. 8) that does not expand to the k-mers with > 1 cancer enrichment ratio, we cannot comment about the behaviour of the sites of higher mutability in cancer by the present work. Those could potentially be better reflected in the cancer mutational signature analyses done by others [57, 60, 61]. Our results, however, further outline the potential role of the general imbalance in repair machinery in determining the accumulation of somatic mutations in cancer, where the mutations are originally caused by errors in replication that possibly emerge via mechanisms more or less common in somatic and germline cells [54].
The relation between the Trek \(r_{i,j}^{core}\) and germline mutation rates
Since the rates of point mutations can be indirectly estimated by determining the rates at which neutral substitutions accumulate in a genome [5, 51, 67], (also see Box 1 in [68]), and, we have applied stringent criteria for filtering out L1 positions with a potential to be non-neutral, our Trek \(r_{i,j}^{core}\) substitution rate constants may serve as contributory estimates for germline mutation rates (especially for the relative mutation rates across individual base conversions). The Trek method complements previous studies on the germline mutation rates [5], and retrieves rate constants (per site, in time domain, byr −1) in a single genome manner, where all the detected substitutions in L1 remnants (their neutral positions only) retrieved from a single genome have accumulated while evolving in multiple (39,894) copies within a lineage of a single organism. This may have a potential to minimise fixation biases and further eliminate biases attributed to the change (mutations) in background mutagenic and reparatory machineries, from an individual to individual, when comparing data from multiple genomes. Taking into account the singleton composition of the unmasked human genome (Additional file 2) and the \(r_{i,j}^{sb}\) constants for the individual transitions and transversions brought in Fig. 3, as estimated through Trek method, we can provide a new estimate for the overall germline mutation rate for the human genome to be 1.176 mutations per site per billion year for the used reference human genome. This makes 2.35×10−8 mutations per site per generation, assuming 20 years of average generation span [2]. Interestingly, the value is slightly lower than the ∼ 2.5×10−8 estimate [2, 5, 67, 69, 70] based on phylogenetic-based approaches, which can be attributed to the more aggressive elimination of biases built in the Trek design (see above). However, the estimate is still greater than the more recent 1.2×10−8 to 1.45×10−8 evaluations from pedigree-based studies [71–74], known to be free from background recombination influence but highly dependent on the paternal age [75]. Less so for the relative values of the rates for individual base-conversions, we expect the absolute values in Trek to be dependent on the absolute timing of the used five subfamilies of L1 elements. Although the latter reliance may be the cause of the relatedness of the Trek estimate to the phylogeny-based ones, our estimate is still a useful addition to the ongoing debate on calibrating the average germline mutation rates [76–78].