Reference genome and comparative genome analysis for the WHO reference strain for Mycobacterium bovis BCG Danish, the present tuberculosis vaccine

Background Mycobacterium bovis bacillus Calmette-Guérin (M. bovis BCG) is the only vaccine available against tuberculosis (TB). In an effort to standardize the vaccine production, three substrains, i.e. BCG Danish 1331, Tokyo 172–1 and Russia BCG-1 were established as the WHO reference strains. Both for BCG Tokyo 172–1 as Russia BCG-1, reference genomes exist, not for BCG Danish. In this study, we set out to determine the completely assembled genome sequence for BCG Danish and to establish a workflow for genome characterization of engineering-derived vaccine candidate strains. Results By combining second (Illumina) and third (PacBio) generation sequencing in an integrated genome analysis workflow for BCG, we could construct the completely assembled genome sequence of BCG Danish 1331 (07/270) (and an engineered derivative that is studied as an improved vaccine candidate, a SapM KO), including the resolution of the analytically challenging long duplication regions. We report the presence of a DU1-like duplication in BCG Danish 1331, while this tandem duplication was previously thought to be exclusively restricted to BCG Pasteur. Furthermore, comparative genome analyses of publicly available data for BCG substrains showed the absence of a DU1 in certain BCG Pasteur substrains and the presence of a DU1-like duplication in some BCG China substrains. By integrating publicly available data, we provide an update to the genome features of the commonly used BCG strains. Conclusions We demonstrate how this analysis workflow enables the resolution of genome duplications and of the genome of engineered derivatives of the BCG Danish vaccine strain. The BCG Danish WHO reference genome will serve as a reference for future engineered strains and the established workflow can be used to enhance BCG vaccine standardization. Electronic supplementary material The online version of this article (10.1186/s12864-019-5909-5) contains supplementary material, which is available to authorized users.


Background
The BCG live attenuated TB vaccine is one of the oldest and most widely used vaccines in human medicine. Each year, BCG vaccines are administered to over 100 million newborns (i.e. 75% of all newborns on the planet). The original BCG strain was developed at the Pasteur Institute, through attenuation of the bovine TB pathogen M. bovis, by 231 serial passages on potato slices soaked in glycerol-ox bile over a time-span of 13 years [1]. After its release for use in 1921, this BCG Pasteur strain was distributed to laboratories around the world and different laboratories maintained their own daughter strains by passaging. Over the years, different substrains arose with different protective efficacy [2,3]. The establishment of a frozen seed-lot system in 1956 and the WHO (World Health Organization) recommendation of 1966 that vaccines should not be prepared from cultures that had undergone > 12 passages starting from a defined freezedried seed lot, halted the accumulation of additional genetic changes [1]. In an effort to further standardize the vaccine production and to prevent severe adverse reactions related to BCG vaccination, three substrains, i.e. BCG Danish 1331, Tokyo 172-1 and Russia BCG-1 were established as the WHO reference strains in 2009 and 2010 [4]. Of these, the BCG Danish 1331 strain is the most frequently used one, and it also serves as a basis of most current 'next-generation' engineering efforts to improve the BCG vaccine or to use it as a 'carrier' for antigens of other pathogens [5,6].
Complete genome elucidation of BCG strains is challenging by the occurrence of large genome segment duplications and a high GC content (65%). Therefore, no fully assembled reference genome is yet available for BCG Danish, only incomplete ones [7,8], which hinders further standardization efforts. In this study, we set out to determine the completely assembled genome sequence for BCG Danish and meanwhile, to establish a currentgeneration sequencing-based workflow to analyze genomes of BCG Danish-derived engineered strains.

Results
General genomic features of the whole genome sequence for BCG Danish 1331 (07/270) The BCG Danish 1331 (07/270) strain genome sequence was assembled by combining second (Illumina) and third (PacBio) generation sequencing technologies in an integrated bioinformatics workflow (Fig. 1, see Methods) . Ambiguous regions were locally reassembled and/or experimentally verified (Additional file 1: Table S1). In all cases, the experimental validation confirmed the assembly, demonstrating that this integration of sequencing data types and bioinformatics workflow is adequate for high-GC mycobacterial genomes. The single circular chromosome is 4,411,814 bp in length and encodes 4084 genes, including 4004 genes encoding for proteins, 3 genes for rRNA (5S, 16S and 23S), 45 genes for tRNA, 1 tmRNA gene (ssrA), 1 ncRNA gene (rnpB) and 30 pseudogenes (Fig. 2a). Compared to the reference genome sequence of BCG Pasteur 1173P2, 42 SNPs were identified, including 24 non-synonymous SNPs, 9 synonymous SNPs and 9 SNPs in the intergenic region (Additional file 1: Table S2). For all the genes containing missense and/or nonsense SNPs, we attempted to validate the SNPs via PCR and Sanger sequencing (26 SNPs affecting 19 genes) (Additional file 1: Table S3). In all cases where the validation experiment yielded interpretable quality results (i.e. not hindered by highly repetitive and/or highly GC-rich regions), these mutations were all validated (15 SNPs affecting 15 genes), demonstrating that the generated genome has very high per-base accuracy. Genetic features determinative for the BCG Danish substrain, as described by Abdallah et al. [8], were identified, including the region of difference (RD) Denmark/Glaxo and the DU2 type III, that was completely resolved in the assembly ( Fig. 2a-b). Additionally, a 1 bp deletion in Mb3865 and a 465 bp insertion in PE_PGRS54 compared to BCG Pasteur were found. The organization of 2 repeats (A and B) in PE_PGRS54 has been reported to differ between the BCG strains [9]. We report a A-A-B-B-B-B organization for BCG Danish in contrast to BCG Tokyo (A-A-B-B-B) and BCG Pasteur (A-B-B-B-B). Previously, two separate genetic populations for BCG Danish 1331 have been described, which differ in the SenX3-RegX3 region (having 2 or 3 repeats of 77 bp) [10]. For BCG Danish 1331 07/270, we documented only 3 repeats of 77 bp (Additional file 1: Figure S1). Two features described by Abdallah et al. [8] to be determinative for BCG Danish were not identified, namely the rearrangement of the fadD26-pssA gene region and a 894 bp deletion in Mb0096c-Mb0098c. In addition, a 399 bp instead of a 118 bp insertion was detected in leuA, giving 12 direct repeats of 57 bp, as in the Pasteur strain (previously denoted as S-RD13 [11]). These three regions were characterized by the presence of inherent repeat structures. Furthermore, these genome regions contained assembly gaps in the assembly for BCG Danish published with the study of Abdallah et al. [8,12], so it is likely that our long-read based genome is more accurate in these challenging regions.

The DU1 in BCG strains
Two large tandem chromosomal duplications characterize the BCG strains; the DU2 and DU1. While four different forms of the DU2 exist, the DU1 is supposed to be exclusively present in BCG Pasteur [11,13,14]; it spans the chromosomal origin of replication or oriC (dnaA-dnaN region) and encodes key components of the replication initiation and cell division machinery. Surprisingly, we detected a DU1-like duplication of 14,577 bp in BCG Danish (Fig. 2). This finding was validated by performing a copy-number analysis of genes in and surrounding the DU1-like duplication (Fig. 2d). To adapt an unambiguous terminology, we considered all duplications spanning the oriC as DU1, while specifying the strain in which the duplication was found. Investigation of other publicly available data for BCG Danish did not show presence of a DU1 (Figs. 2c and 3), indicating that only the Danish 1331 substrain deposited as the WHO reference at the National Institute for Biological Standards and Control (NIBSC) contains this duplication. Additional inconsistencies in DU1 presence/absence were detected by reanalyzing publicly available data [12,[15][16][17][18][19][20] (Figs. 2c and 3): in contrast to what is concluded in the literature, we found that the public data show that there are BCG Pasteur substrains with a DU1 (data [15]) and others without a DU1 (data [12,20]). Similarly, experimental analysis of our in-house Pasteur strains (1721, 1173 ATCC 35734) showed absence of a DU1 (Fig. 2d). Additionally, a DU1-China was detected in some data sources [15,16], but not in others [12], which is likely explained by the use of two different substrains of BCG that are both named BCG China [8]. DU1-Birkhaug Characterization of a derivative of BCG Danish 1331, the sapM KO Using the same genome analysis methodology, we determined the complete genome assembly for a KO mutant in the SapM secreted acid phosphatase. Since the sapM gene is located in the DU2, the sapM locus is present twice in WT cells. The assembly for the sapM KO strain did not contain a DU2 repeat, as the KO engineering entirely out-recombined one of the copies of the DU2 to form a single sapM KO locus (Fig. 4a). The absence of the DU2 was unequivocally validated by performing a copy-number analysis of multiple genes in and surrounding the DU2 (Fig. 4b).
Furthermore, we detected one SNP compared to the parental BCG Danish WT strain, a missense SNP in BCG_3966 or BCGDan_4053 (encoding a conserved hypothetical protein), which was validated by Sanger  sequencing (Additional file 1: Table S2 and S3). The single DU2 sapM KO is a useful chassis for further vaccine engineering, as another target gene for improving BCG vaccine efficacy (sigH ( [22]) is novo haploid in this strain, facilitating its future knockout to generate a sapM/sigH double knockout.

Discussion
All BCG strains originate from a common ancestor [23], but since then, they have incorporated many gene deletions and evolved gene amplifications (DU1 and DU2), that differentiate the different BCG strains from each other. Several studies on BCG vaccine strains have mapped these genomic changes using a variety of comparative genomic techniques, starting from subtractive genomic hybridization [24] to whole genome sequencing [7,8,25], enabling the deciphering of a genealogy of the BCG strains. The study of Abdallah and others used short-read Illumina sequencing data for 14 of the most widely used BCG strains in combination with a large-indel detection pipeline to identify a number of previously unknown deletions and insertions [8]. Most genetic signatures identified for BCG Danish by that study were also found in the complete long read/short read hybrid genome assembly that we generated for BCG Danish 1331. However, some RDs could not be found. We hypothesize that inherent repeat structures in these regions triggered the undue assignment of these regions as RD in the short-read Illumina sequencing dataset. Unequivocal assembly of repeat-containing sequences, clearly requires long sequencing reads, as generated for example by PacBio SMRT sequencing in this study. In 2001, Bedwell and others identified two substrains admixed in a Copenhagen commercial preparation of the BCG vaccine (a.k.a. BCG Danish 1331) [10]. These two genetic populations differed in the senX3-regX3 region, having 2 or 3 repeats of 77 bp. We documented only one version for the senX3-regX3 region, with 3 repeats of 77 bp for the BCG Danish 1331 WHO reference reagent strain. In contrast, Magdalena et al. reported the presence of 2 repeats for a M. bovis BCG Danish vaccine strain provided by M. Lagranderie (Institut Pasteur, Paris, France) [26]. These data indicate that different substrains of BCG Danish are in circulation, and that this region probably is genetically drifting. Extensive genomic characterization of the WHO reference reagent for BCG Danish (as provided by this study) will facilitate the identity assurance of the genomic integrity of new lots of the BCG Danish vaccine.
Similarly, we document the presence of a DU1-like duplication in this WHO reference BCG strain (DU1-Danish), that has never been reported on before, as the DU1 was thought to be exclusively restricted to BCG Pasteur [11,23]. Furthermore, we showed that not all BCG Pasteur strains contain the DU1-Pasteur, based on experimental analysis of in-house Pasteur strains and based on reanalysis of publicly available sequencing data. In addition, we detected a DU1-China in one of the two different substrains of BCG that are both named BCG China [8]. Seemingly the oriC is prone for duplication, as DU1-like duplications were observed for BCG Pasteur, BCG Birkhaug, BCG China and BCG Danish. The genealogy of BCG strains is thus further complicated by the genomic instability of the oriC during in vitro cultivation (Fig. 5, Additional file 2: Table S8). A DU1-like duplication has also been identified in a 'non-vaccine' strain; in a clinical isolate (3281), identified as BCG, a 7kb region that covered six genes and crossed the oriC was repeated three times [27], further indicating that this region is prone to (possibly reversible) duplication. Together, these data underline the importance of the genomic characterization of the BCG vaccine strains, including their dynamic duplications. Furthermore, they demand for the specification of the exact origin of the BCG strain(s) used in studies on this vaccine and the determination of the presence of the RD documented for that strain. The implementation of copy number analysis via qPCR as described here, could allow for easy discrimination whether a certain strain contains a DU1-like duplication or not, instead of requiring next-generation sequencing (more expensive) and bioinformatics analyses (requires expert knowledge).
Until now, no driving factor for the DU1 has been identified, as the DU1 in BCG Pasteur contains 31 genes and none of these genes are expected to give an obvious in vitro growth advantage upon duplication [13]. Perhaps, this could now be elucidated by examining the  [15] and Illumina sequencing data (b) for BCG Danish 1331 (this study) as well as published genome data from Pan et al. 2011 [16][17][18][19], Abdallah et al. 2015 [12] and Festjens et al. 2019 [20] were reanalyzed for the presence of a DU1 in the region of the oriC. These references were chosen as they contain BCG Danish or BCG Pasteur genome sequencing data. The graphs in (a) depict the ratio of the reference (M. tb H37Rv) probe intensity (Cy5) divided by the test (BCG strain) probe intensity as originally presented in Leung et al. 2008 [14]. The graphs in (b) depict the ratio of mean whole genome read coverage divided by the mean read coverage in 500 bp window size. Detection of a DU1-like duplication in BCG Pasteur 1173P2 [15], Birkhaug [12,15], Danish 1331 07/270 (this study) [21] and BCG China [15,16] sequencing data, indicated in grey. No detection of DU1-duplication for other BCG Pasteur [12,20], Danish [12,17] and China [12] sequencing data  (Table 1). It remains however difficult to speculate about the impact of two copies of oriC (dnaA-dnaN region) on the biology of BCG strains [13]. Bacteria carefully regulate the activity of the initiator protein DnaA and its interactions with the oriC to assure correct timing of the chromosome duplication [30]. Therefore, one has assumed that multiple copies of the oriC are deleterious, as they can provoke uncoordinated replication [13,31]. It is known that M. smegmatis transformants with two functional DnaA gene copies cannot be obtained [31], as observed in both B. subtilis [32] and S. lividans [33]. However, such an inhibitory effect was not observed when a complete dnaA gene was transformed to M. smegmatis [34], although Salazar and others questioned whether the construct did not acquire a point mutation or small deletion that inactivated dnaA [31]. Until now, no sequence differences were observed between the different copies of the dnaA-dnaN region, suggesting that both copies of the origin are functional in vivo. It has been speculated that BCG 3281 (containing 3 copies of the dnaA-dnaN region) would likely be capable of enduring greater gene expression burdens in replication [27]. Indeed, as DnaA and oriC are so closely genetically linked, duplication of this genomic region is not necessarily the same as just increasing the gene copy number or overexpressing DnaA. It could be envisioned that selection for rapid growth on rich medium may favor or tolerate more rapid genomic replication initiation, but also that this selective advantage may collapse in the face of e.g. nutrient limitation or prolonged stationary phase cultivation. Possibly this is at the heart of the observed unpredictable behavior of this genomic duplication. Confirmation of this hypothesis awaits experimental confirmation.
To demonstrate how the genome analysis methodology, developed in this study, contributes to full characterization of improved BCG-derived engineered vaccines, we applied it to a KO for the SapM secreted acid phosphatase, located in the analytically challenging  Table S8). The blue dashed squares indicate the different DU2-forms, which classify the BCG strains into four major lineages. When the DU1 is not found in all substrains of a certain strain, this is indicated on the scheme. According to the literature, two different substrains of BCG are named BCG China or Beijing [8]. Therefore, the scheme contains two 'BCG China' strains: BCG China [8] and BCG China* [7,14]. Adapted from references [8,11,14,28,29]. Concerning reference [8], only the RD and deleted genes that could be verified on the assembled genomes [12] are included long duplication region DU2 [11]. Our BCG genome analysis workflow unequivocally demonstrated that the KO engineering had inadvertently out-recombined one of the copies of this DU2 and had furthermore given rise to a single SNP. The out-recombination of the DU2 will most probably not have a dramatic impact on the phenotype of the sapM KO, as all the genes are still present as a single copy. One could perhaps expect slower growth of the sapM KO in glycerol-containing media, as the DU2 probably arose due to inadvertent selection for increased growth rate on glycerol [11]. GlpD2, encoding glycerol-3-phosphate dehydrogenase, is one of the three genes present in all DU2 versions and higher levels of glpD2 probably gave a growth advantage to strains with duplications [11]. We did not observe a decreased growth rate in the Middlebrook 7H9 standard medium for the sapM KO. Perhaps, the growth advantage attributed to the DU2 would only be apparent in Calmette's glycerol-containing medium, traditionally used to subculture the BCG strains before the introduction of a frozen seed-lot system in 1956 [37]. The effect of the SNP in BCG_3966 (or Rv3909) is hard to estimate. The mutated gene encodes for a conserved hypothetical protein of 802 amino acids and is predicted to be an outer membrane protein [38]. The missense SNP converts the asparagine (located at the end of the protein) in the WT to a threonine in the sapM KO (pAsn737Thr). However, as the gene has been found to be essential for in vitro growth of M. tb H37Rv [39,40], we suspect that the protein function is retained. Such unexpected genomic alterations may be more common than thought in engineered live attenuated TB vaccines, but may have so far gone largely unnoticed due to lack of a complete reference genome and/or suitable genome analysis methodology.
The implementation of both short (Illumina) and long (PacBio) sequencing reads in one genome analysis methodology allowed for the straightforward generation of completely assembled genomes of BCG strains. These included the decomposition of the analytically challenging long duplication regions DU1 and DU2, thanks to the inclusion of long sequencing reads, whereas one formerly needed many additional experimentation (Table 2). Furthermore, the generated genome assemblies were highly polished at base level, due to the incorporation of reliable Illumina sequencing reads (single-pass error rate of 0.1%), in addition to the more error-prone PacBio sequencing reads (single-pass error rate of 10-15%) [41,42]. This methodology is thus currently the most cost-effective strategy that allows to create high-quality BCG genomes, solely based on nextgeneration sequencing strategies.  information pathways oriC Sequence in the genome at which replication is initiated. Contains non-overlapping MtrA-and DnaA-binding boxes. Is located in the dnaA-dnaN genomic region [35].
-dnaN DNA polymerase III (β-chain) DnaN (DNA nucleotidyltransferase). DNA polymerase III is a complex, multichain enzyme responsible for most of the replicative synthesis in bacteria. This DNA polymerase also exhibits 3′ to 5′ exonuclease activity. The β-chain is required for initiation of replication. Once it is clamped onto DNA, it slides freely (bidirectionally and ATP-independently) along duplex DNA.
information pathways recF DNA replication and repair protein RecF (ssDNA binding protein) is involved in DNA metabolism and recombination; it is required for DNA replication and normal SOS inducibility. Binds preferentially to linear ssDNA. It also seems to bind ATP.

information pathways
Gene information was extracted from Mycobrowser [36] Conclusions Our data highlight the importance of characterizing our BCG vaccine strains, as more variability exists among these strains than was thought. The availability of the complete reference genome for BCG Danish 1331 as well as the associated genome analysis workflow, now permits full genomic characterization of (engineered) TB vaccine strains, which should contribute to more consistent manufacturing of this highly cost-effective vaccine that protects the world's newborns from disseminated TB, and that is used as a basic chassis for improved TB vaccine design.

Mycobacterial strains, gDNA and reference genomes
The strains used include the M.  For each strain, we indicated the used method to create the assembled genome, the type of DU2 present in the strain, whether the DU2 was resolved in the genome assembly, the BioProject and genome assembly accession number, the reference to the study in which genome assembly method was published and the year of publication. In the 'method' description, we have put labor/capital-intensive aspects in bold, illustrating that our approach using solely massive parallel sequencing, is the only one that provides both high per-bp accuracy (allowing for SNP calling) and complete resolution of the assembly across large repeat regions sequenced on an Illumina MiSeq instrument (MiSeq Reagent Kit v2 Nano, PE250 (paired end 250 bp), 500 Mb), with an average of 55-56x coverage per genome.

Genome assembly and analysis
Illumina reads were quality-filtered and adapter sequences were trimmed (Trimmomatic v0.36 [54]), after which overlapping paired-end reads were merged into single reads (BBMerge v36.69 [55]). PacBio read sequences were corrected using the high quality Illumina reads (Lordec v0.6 [56]). The unmerged and merged Illumina reads were assembled into a draft assembly (SPAdes v3.9.0 [57]). The draft assembly was scaffolded using the corrected PacBio reads (SSPACE-LongRead v3.0 [58]). Finally, gaps in the scaffold were closed (Gap-Filler v1.10 [59]) and the assembly was improved (Pilon v1.20 [60]), both using the trimmed Illumina reads. The exact sequence of the DU1 region was based on a second round of local de novo assembly (SPAdes v3.9.0 [57]) using soft-clipped Illumina reads surrounding the draft DU1 region where the Illumina read coverage is more than two times higher than the background coverage. The DU2 repeat was resolved by comparing the SPAdes assembly with the assembly from HINGE (v201705) [61], where the R1 and R2 regions have been separated. The junction sequences of DU1 and DU2 were further confirmed by aligning uniquely mapped PacBio reads and the results were always consistent with PCR and Sanger sequencing.
A probabilistic variant analysis was performed by mapping the BBmerged Illumina reads to the BCG Pasteur reference genome (BWA-MEM [70]) and calling variants by GATK UnifiedGenotyper [71] (Count ≥10 & Variant Probability > 0.9), whereafter variant annotations and functional effect prediction were carried out with SnpEff and SnpSift [72]. The orthologous relationships between M. tb, M. bovis BCG Pasteur and BCG Danish WT and sapM KO were investigated, the proteins of strains (M. tb H37Rv [51], BCG Pasteur 1173P2 [53], BCG Danish WT and sapM KO (this study)) were searched using all-against-all with BLASTP [64], after which the result was analyzed by TribeMCL [73] and i-ADHoRe 3.0 [74] based on the genome synteny information (Additional file 3: Table S9).
To validate the detection of the DU1, the DU1 duplication region was reanalyzed in published genome data [12,[15][16][17][18][19][20]. Probes on tiling array or Illumina short sequencing reads were mapped to the M. tb reference strain [48] (BWA-MEM [70]). The tilling array data were directly compared by the intensity ratio between H37Rv and the sampled strains (ratio = strain / H37Rv). A ratio larger than one was considered as a duplication in the sampled strain. The DU1 duplications in the Illumina data were detected by cn.mops [75]. In brief, cn.mops first took all aligned BAM files (BWA-MEM) and normalized the mappable read counts to make it compatible across all samples in the comparison. A mixture of Poisson model was then used to compare read counts for each genomic position (bin size 500 bp) across all samples. A mixture of Poisson model will not be affected by read count variations along the chromosomes caused by technical or biological noise, since a separate model is constructed at each position. Using a Bayesian approach, read counts and the noise across samples were decomposed by an expectation maximization algorithm into integer copy numbers (with confidence intervals).
In Fig. 1 a graphical overview of the performed genome analysis pipeline is given. All presented nextgeneration sequencing data were integrated in an online genome browser (JBrowse) [76].
PCR analysis, gel electrophoresis and sanger sequencing PCR (GoTaq®Green, Promega) was performed on gDNA using primers listed in Additional file 1: Table S1 and S4. PCR products were run on a 1.2% agarose gel, stained with Midori Green and visualized under ultraviolet light. To confirm the single nucleotide polymorphisms (SNPs), regions of interest were amplified (Phusion High-Fidelity DNA Polymerase, NEB) from gDNA with primers listed in Additional file 1: Table S5. The resulting PCR products were purified (AMPure XP beads) and Sanger sequenced with (a) nested primer(s) (Additional file 1: Table S1 and S5).

Copy number profiling via qPCR
Real-time quantitative PCR was done on a LightCycler 480 (Roche Diagnostics) using the SensiFast SYBR-NoRox kit (Bioline) in quadruplicate for each gDNA sample using primers listed in Additional file 1: Table  S6. Determination of the average relative quantities was performed using the qbasePLUS software (Biogazelle). All results were normalized using the reference genes 16S rRNA, nuoG and mptpB.