Thanks to our new Sequence Reads Archive-based genomic analysis pipeline, we explored the M. tuberculosis CRISPR sequences diversity in 198 clinical isolates representative of the MTC excluding M. canettii, which deserve new specific studies [39, 40]. These data show that M. tuberculosis CRISPR locus contains at most 69 spacers (68 + one duplication), is not prone to inversions, evolves by duplication and deletions through recombination between DR, but also and primarily through insertion/deletions implicating IS6110, by homologous recombination, and independently of lineage. We detail below the support for these different kinds of mutations and inferences that can be drawn concerning the functionality of CRISPR-Cas locus.
Evolutionary mechanisms of MTC CRISPR locus expansion
Despite the absence of acquisition of new spacers, MTC CRISPR locus is of relative long size in many isolates (for instance, 4589 bp between Rv2813 and Rv2816c/cas2 in H37Rv). This relates to its ability to continue to expand using mechanisms other than classical CRISPR adaptation.
A first mechanism of MTC CRISPR size expansion, when considered as the distance between its two borders, is the integration of IS6110 insertion sequences (1355 bp). The most frequent insertion is found between spacers 34 and 35 as in H37Rv genome. Other IS6110 insertions were found along the whole MTC CRISPR locus, with up to two insertions in the CRISPR locus and three when considering the whole CRISPR-Cas locus. Other similar IS Sequences right next to or farther away, might be responsible for other homologous recombination mechanisms involving CRISPR.
The second CRISPR expansion mechanism identified in this overall review concerns duplications of DVR (DR + spacer). These duplications are of two main types. First of all, duplications can concern a single DVR and occur in tandem which was observed in 11 independent cases throughout our 198 samples. This type of tandem duplication concerns also several adjacent DVRs such as DVR1–2 in M. bovis or DVR14–15–16-17-18-19-20 in L1.1.1.7. Such multiple DVR duplications were observed 5 times in our sample, so that in total 16 independent events of tandem duplications were observed. The second type of duplications concerns DVR that are far away from their original position, a type we call “rearrangement duplications”. This first concerns DVR35 located between DVR41 and DVR42 as already mentioned above and supposedly in MTC MRCA CRISPR. Other examples include a second copy of DVR3 found between DVR12 and 13 found in ERR036187 (L4.3.4.1), while in ERR234197 (L1.1.3.1), there is an additional copy of DVR38 between DVR55 and 56. In one instance, this concerned several adjacent DVRs: a second copy of DVR50–51–52-53 is found between DVR3 and 4 in ERR2245409 (L3.1.1). Altogether, this made a total of 4 independent rearrangement-duplications. The fact that rearrangement duplications are less common than standard duplications suggests that they occur less frequently and/or that they are less stable. If the stability of rearrangement duplications was low, there should be several cases of deletions between the two copies of DVR35 as they were likely already present in MTC MRCA. Yet, we observed no case where a deletion concerned solely the DVR between these two copies.
Overall, the proportion of genomes containing either several copies of IS6110 or a duplication of one of the forms listed above is important, showing that MTC CRISPR is much more variable than what could be derived from a standard 43 spacers spoligotyping analysis. This is true not only for the in vitro but also for the in Silico-based acquisition of the spoligotype, as the blast procedure used in the current analytic tools (Spolpred, SpoTyping) only provides information on the presence or absence of a given spacer: there is nothing quantitative or location-related in these approaches [41, 42]. Hence, on one hand, the representation of the CRISPR locus through a simple barcode of presence/absence of individual spacers hides these quantitative and localization information, whereas on another hand, a more extensive description of the CRISPR locus including duplications, insertions, point mutations, provides useful information to classify and/or cluster clinical isolates. Such an information is advantageously correlated with the current SNPs based taxonomical system of MTC genomes and enhances our understanding of isolates evolution [20, 21, 26, 27, 43].
Combined mechanism of CRISPR locus reduction: how does IS6110 contributes to the evolution of CRISPR locus in MTC?
In addition to the undeniable expansion mechanisms mentioned above, CRISPR reduction mechanisms also coexist, which -to some extent- explain some of the spacer block deletions in MTC spoligotypes.
The first potential mechanism is the simple loss of spacer, for instance by recombination between two adjacent DRs. For instance, clinical isolate ERR1203071 of L4.8 lacks spacer 1. In place, it harbors a one nucleotide variant of the beginning sequence, a DR0 and spacer 2. The principle of parsimony here tends to suggest that a recombination between the DR0 bordering spacer 1 led to this genotype. The same kind of recombination seems to occur on slightly higher number of DVR such as the DVR54-DVR61 deletion typical of L2–3–4-7. Recombination between normal, standard DRs would be favored compared to recombination between different DRs (one standard, one mutated).
We can now confidently argue that the second highly frequent mechanism that is at play for the largest suppressions of consecutive spacers, is an IS-linked three steps mechanism: (1) insertion or prior presence of a first copy of IS6110 (for instance that after spacer 34), (2) insertion of a second IS6110 copy at another location (e.g. in csm6 in the ancestor of L2, also seen in SRR1710060, see Supplementary file 2), and (3) recombination between the two IS6110 copies. This IS-mediated mechanism, that has been described in previous studies is a general mechanism, i.e. it happens independently of the lineage and is responsible of convergence in IS6110 copy numbers [44]. The final result is the change from x to x-1 copies of IS6110, with the loss of all spacers between the two copies. This mechanism can be observed independently of lineages, for example, in lineage 4, in Haarlem (4.1.2.1): L4 ancestor has a single copy of IS between 34 and 35, then a second copy occurred in the ancestor of Haarlem L4.1.2 isolates as seen in ERR552680, between 41 and 35, and finally a deletion occurred leading to the loss of spacers 35 to 41 for some isolates such as ERR234259. It therefore seems reasonable to think that after the insertion after spacer 41, this copy of IS6110 has recombined with the one upstream of spacer 35. This mechanism is also at work elsewhere in the Haarlem isolates between csm5 and spacer 34 and between csm5 and spacer 41 (Supplementary file 2).
IS6110 insertions can take place in spacers or in DR and it is not necessary for an IS to be in a DR to be able to recombine. For instance, in many L4.3 (LAM) clinical isolates where spacers 31 to 34 (#21-#24) are missing, the successive sequences of interest are: the beginning of spacer 31 (#21), an IS6110c, DRb1 and spacer 35. The last three sequences of interest are found in the exact same order in undeleted isolates such as H37Rv. This suggests that an IS6110 copy was first inserted at the end of spacer 31, and that it later recombined with the one located between spacers 34 and 35. This recombination did not modify the flanking sequences.
The orientation of the two IS6110 copies that recombined cannot always be derived due to the lack of the ancestral versions. Still in several cases, we could identify isolates related to the deleted ones, that carry the two IS6110 flanking the future deletion. This is true for the IS6110 insertions having led to the deletion described in Fig. 4. In that case, both insertions were in the reverse sense as compared to H37Rv orientation and can be called IS6110c. In another case, the isolate with two IS6110 insertions is SRR5073887 (L4.4.1): it carries not only the standard IS6110c insertion between spacers 34 and 35 but also an IS6110 insertion in the sense direction at the 439th nucleotide of csm6. The deletion in ERR2653229 (also L4.4.1) flanked by the beginning of csm6 and DRb1 and spacer 35 with a sense IS6110 sequence in its middle (Supplementary file 2 [IS6110 sheet]) likely occurred through the recombination of these two IS although they lie in opposite orientations. This phenomenon was recently observed in several cases of IS6110 mediated deletions in L2 [45].
Variants and problems in spoligotyping
How does the CRISPR sequence diversity impact spoligotyping data? When performed in vitro, spoligotyping consists first in the amplification of the CRISPR locus using primers facing the outside of DR region, referred to as DRa and DRb, and second in the hybridization to probes attached at a specific position on a membrane or another support. CRISPR sequences variants may reduce the efficiency of the process, whether at the amplification or at the hybridization step. The presence of intermediate signals in spoligotyping or discrepant results between in silico and in vitro-based spoligotypes has been documented by several authors [46, 47]. We looked for intermediate signals corresponding to variants. In the case of L6 clinical isolates that carry a variant of spacer 4 (spacer 3 in spoligo-43 nomenclature), we found no evidence of such report in the literature and in our own data (data not shown). The same was true for spacer 38 (spacer 28 in spoligo-43 nomenclature) found in L1.1.1 clinical isolates even if the mutation is relatively central in the probe (Supplementary file 5).
Asymmetric variations affecting of MTC CRISPR-Cas locus
As described above, we identified punctual nucleotide mutations, duplications, IS insertions and deletions along CRISPR-Cas locus. CRISPR are oriented loci that acquire new spacers at the 5′ end relative to their transcription direction [12, 48]. It may therefore be expected that variations do not affect symmetrically this locus. To explore and understand the consequences of this possibility, it is important to identify the orientation of the CRISPR locus in question. Using RNAseq data on H37Rv, Wei et al. showed that transcription occurs from spacer 1 towards spacer 68 [49]. We independently confirmed this observation by the exploration of independent RNAseq data from [50, 51] (Refregier et al. unpublished results). The orientation presented in this study is thus the functional one. According to classical CRISPR expansion mechanism, the introduction of new spacers occurs at the 5′ end of the locus, so that the most ancient DVR lies at its 3′ end.
In contradiction with the remarkable feature that most ancient DR carry mutations in all isolates, no subregion exhibited a significantly higher punctual mutation rate (Supplementary file 6). The fact that the most ancient part of CRISPR locus does not carry a significantly higher number of punctual mutations as compared to parts that are more recent (spacer block deletions), may suggest that the time during which the locus expanded from spacer 68 to spacer 1 may be negligible as compared to the time between MTC MRCA and present, or that the CRISPR locus was transferred by lateral gene transfer in one single block from another environmental organism. Alternatively, the time of CRISPR locus expansion could have been quite long, however the pace of CRISPR locus SNPs mutations acquisition was very slow because of an extremely slow pace of MTC transmission. Demography and genetic drift could have been much more important for MTC evolution than selection in human populations [52]. Yet, the presence of mutations in several DR at the 3′ end of the locus could also play a role in its stability.
In contrast, we detected an asymmetry concerning the loss of flanking sequences: it was apparently more frequent to have a loss of the beginning sequences of CRISPR, on the side of the cas genes (several independent isolates from L2 and from L4) than to have a loss of the ending sequences, i.e. on the side of Rv2813. All deletions implicating flanking sequences were bordered by an IS6110 sequence. Altogether, the asymmetry in deletion suggests either a more crucial role of the end of the CRISPR i.e. of gene Rv2813 and/or its neighbors, or asymmetric mechanisms favoring deletion on the cas gene side. This second possibility relates to IS6110 insertion frequency as IS are always involved in large deletions. Saying that IS6110 insertions are more likely on the cas gene side suggests either their lower impact on bacterial fitness, or a DNA superstructure that would favor IS insertions. Other IS exist in the genome that could also insert in a favorable region. Their presence in CRISPR region would be a sign that it is an integration hot spot. However, our script was designed to look only for insertion in cas gene that also lead to a deletion in the CRISPR in at least one of the explored sample.
IS other than IS6110 are unlikely to lead to any deletion. Even if our script may have overlooked non-IS6110 insertions, we did not encounter it in around 500 randomly sampled genomes. The question of cas gene locus being an integration hotspot of IS sequences needs other studies to be completely solved.
Functionality of MTC CRISPR-Cas locus
CRISPR-Cas loci are involved in two mechanisms: 1) adaptation by the integration of new spacers, usually taken from foreign DNA, at the 5′ end of CRISPR with the help of Cas1 and Cas2 proteins, and 2) immunity by the transcription of CRISPR locus, processing with the help of Cas6 protein in the case of type III-A CRISPRs, and degradation of DNA and/or RNA carrying protospacers, with the help of the crRNP (CRISPR RiboNucleoProtein complex), a complex involving the crRNA and other Cas proteins. By exploring the diversity of many genomes at the CRISPR locus, we are able to infer the effectivity of adaptation process. Regarding immunity, we can only state whether the necessary genes are present or not.
In the whole M. tuberculosis complex sensu stricto, we could find only the 68 spacers already present in the MRCA [34]. We found no evidence that a single clinical isolate has acquired a new spacer in the course of MTC evolution. This seems particularly surprising as most currently spreading isolates apart those from L2 still carry the full set of Cas genes including Cas1 and Cas 2 involved in CRISPR adaptation in other type III-A systems. This could be due to a mutation in M. tuberculosis ancestor that has abolished Cas1 and/or Cas2 functionality in the ancestor. Another reason could be that MTC, given its intra-cellular life-style, does simply not have the chance anymore to encounter foreign DNA such as phages or plasmids. These two phenomena could also be linked: a loss of functionality of Cas1 and Cas2 in the MRCA of all MTC could have fostered an adaptative change in life-style of the bacterium, i.e. from an environmental extracellular to a host-specialized intracellular life-style. Such an hypothesis could be supported by the evolution of the CRISPR locus of Vibrio cholerae, with observations that the recent pandemic strains have lost their ancestral CRISPR locus [53] and (FX Weill, personal communication). Hence, the functionality of Cas1 and Cas2 of MTC remains to be explored.
Regarding immunity, this study only focused on the full presence or absence of cas genes without exploring in detail SNP variations. As stated previously, 23/198 (12%) lacked at least part of the cas genes. Among these yet, all isolates still carried the cas6, cas10/csm1, csm2, and csm3 genes. This observation matches that made previously on CRISPR- clinical isolates [28]. Cas6 protein is involved in pre-crRNA processing. Cas10/Csm1 and Csm3 are the enzymes responsible for the catalytic activity of the crRNP [54, 55]. Hence, regarding immunity, even if the spatial structure of the crRNP may be impaired by the absence of csm4 and/or csm5 in some isolates, it could remain possible that immunity occurs in all MTC isolates through the consecutive actions of Cas6 to process pre-crRNA and of Cas10/Csm1 and Csm3 to degrade DNA and/or RNA. The fact that none of the spacer is conserved in all isolates implies that, if immunity occurs, it does not always target the same DNA and/or RNA sequences.
Global implication of CRISPR diversity for the understanding of MTC clinical isolates evolution
In MTC, the CRISPR locus is a likely witness of a previous yet unknown evolutionary history of phage DNA invaders defense, whereas IS6110 is a specific MTC element that belongs to the IS3 family that, through transposition, also plays a permanent role in shaping MTC genomes [56]. The link between the two in evolutionary genomics remains poorly investigated until now. MTC genome actually contains a lot of other IS and transposases (88 genes retrieved in mycobrowser,https://mycobrowser.epfl.ch/(https://mycobrowser.epfl.ch/) such as IS1081, IS1533, IS1547, IS1560), but IS6110 is the one with the largest number of copies in most isolates and especially in the reference isolate H37Rv [57]. IS1547 was previously shown to play a role in MTC evolution however it remains poorly investigated [58]. IS6110-RFLP was the golden standard to define epidemiological clusters at the end of the nineties and stayed so during around 20 years, until it was replaced by MIRU-VNTRFootnote 1 and more recently by Whole-Genome-Sequencing [4, 5, 59,60,61] (for a recent review on evolution of TB molecular epidemiological methods, see also [1]). Previous results on IS6110 insertion sites have shown that independent IS6110 copy acquisition through transposition into hot-spots was a common mechanism explaining convergence in IS6110 copy number in some of the MTBC sublineages [44, 62]. A recent paper on the micro- and macro-evolution of Lineage 2 of MTC in relation to IS6110 transposition also stresses the interest of such studies using WGS [45]. The role of the ipl (Insertion Preference Locus) was also stressed long time ago and showed consequences on the CRISPR locus [58, 63, 64], however no generalized observations on IS-CRISPR genomics dynamics had been done so far before this study.