Genome sequencing of Mycobacterium pinnipedii strains: complete genetic characterization and evidence of superinfection in a South American sea lion (Otaria flavescens)

Background: Mycobacterium a is capable of infecting several host species, including humans. Recently, ancient DNA from this organism was recovered from pre-Columbian mummies of Peru, sparkling debate over the origin and frequency of tuberculosis in the Americas prior to European colonization. Results: We present the first comparative genomic study of this bacterial species, building from the complete genome sequencing of two M. pinnipedii isolates (MP1 and MP2) obtained from different organs of a stranded South American sea lion. Our results indicate that MP1 and MP2 differ by 113 SNPs (single nucleotide polymorphisms) and 46 short deletions, constituting the first report of a mixed-strain infection in a sea lion. SNP annotation and protein networks indicate that genes of the VapBC family, a toxin-antitoxin system, are under evolutionary pressure for protein sequence change in these strains. OrthoMCL analysis with seven modern isolates of M. pinnipedii shows that these strains have highly similar proteomes. Gene variations was only marginally associated with hypothetical proteins and PE/PPE (proline-glutamate and proline-proline-glutamic, respectively) gene families. We also detected 28 large deletions in ancient and modern M. pinnipedii strains, including 21 never described and 15 occurring only in modern strains, indicating a process of genome reduction occurring over the past 1,000 years. Our phylogenomic analyses suggest the existence of two modern clusters of M. pinnipedii associated with geographic location, and possibly host species, and one basal node associated with the ancient M. pinnipedii strains. Previously described MiD3 and MiD4 or RD2Seal deletions may have occurred independently, twice, over evolutionary course of the MTC. Conclusion: The presence of superinfection (i.e. mixed-strain infection) in this sea lion suggests that M. pinnipedii is highly endemic in this population. In contrast to the human-adapted M. tuberculosis strains, M. pinnipedii proteomes of the studied isolate showed high degree of conservation, despite being under genomic decay when compared to M. tuberculosis. This finding indicates that further genomes need to be sequenced and analyzed to increase the chances of finding variably present genes among strains or that M. pinnipedii genome remodeling occurred prior to bacterial speciation.

Introduction reports of infected cattle were associated with a contaminated water canal connecting directly to the ocean or beach grazing areas where pinnipeds were found [18]. The actual impact, however, of tuberculosis in pinniped species remains largely unknown, mainly because its prevalence in free ranging seals and sea lions are completely unexplored.
The carcass of a South American sea lion (Otaria flavescens) has recently been recovered from the Southern coast of the state of Rio Grande do Sul, Brazil. As reproductive colonies of these animals are absent in Brazil, specimens found on the coast are probably from Uruguay. At the animal necropsy, lesions compatible with tuberculosis in different organs were found, from which M. pinnipedii isolates were obtained, constituting the first report of this pathogen in the Brazilian coast [19]. Unfortunately, genome sequencing of the isolates was not undertaken at the time, precluding opportunities to better explore the genetic makeup of these bacteria, including the possibility of intra-host bacterial clonal variants (i.e. microevolution) or mixed-strain infection (i.e. superinfection). Until now, M. pinnipedii genome sequences have never been fully characterized by comparative genomics. Genomic reads of modern M. pinnipedii strains described by Bos et al. [5] were solely used to compare to the ancient M.
pinnipedii-like genomes. During the development of this study, another strain of M. pinnipedii was sequenced, but only applied to evaluate the phylogenomic position of this species within the MTC.
Therefore, the objectives of the present study were to sequence and compare two M. pinnipedii isolates obtained from different organs of this sea lion carcass [19], and to perform a comparative genomic analysis with other M. pinnipedii strains available from public databases.  Table S1). The main functional annotations were those related to energy production and conversion (11.95%, 19/159), lipid transport and metabolism (9.43%, 15/159) and cell wall/membrane/envelope biogenesis (7.55 %, 12/159), which are known to play crucial roles in the pathogenicity of MTC species. Interestingly, the majority of detected SNPs are non-synonymous (86.72%, 98/113), which means that related CDSs are under selective pressure for protein sequence change, having the potential to promote functional diversity. When analyzing CDSs of known function with non-synonymous SNPs, we identified strong protein interactions between specific members of the VapBC protein family (Figure 2). The fact that multiple proteins interacting with each other have non-synonymous SNPs indicates that this interaction, in particular, is under evolutionary pressure for change. In M. tuberculosis, the VapBC family accounts for more than half of the proportion of toxinantitoxin (AT) systems (47/88 putative AT systems) [20] and genomic regions containing VapBC are strongly associated with virulence and pathogenicity factors [21]. The identified CDSs are part of a type II AT system related to mRNA translation regulation. More specifically, VapC25 and VapC29 are ribonucleases that have been shown to inhibit M. smegmatis cell growth when induced (i.e. have bacteriostatic effects) [22]. Taken together, these results suggest that MP1 and MP2 may have different survival capacities. numbers. Variation at several loci normally indicates mixed-strain infections [24]. Therefore, based on current parameters of distinction between microevolution and superinfection regarding SNPs/indels [25] and MIRU-VNTR [24], the sea lion analyzed herein was infected with two different strains of M.

pinnipedii.
This is the first report of M. pinnipedii superinfection in a sea lion. As these animals normally live in dense aggregations of specimens, they are highly vulnerable to epizootics of infectious diseases that can be transmitted by direct animal-to-animal contact. Superinfection caused by M. tuberculosis in humans is normally reported in highly endemic countries, in which people are exposed to multiple strains of M. tuberculosis throughout their lives and HIV plays an important role in shaping the disease incidence [24]. By tracing a parallel, it is possible that M. pinnipedii is highly endemic in the population from which this sea lion came from, represented by different circulating strains, which may have unprecedented effects on the conservation of this species. Adverse environmental conditions, insufficient nutrition and chronic stress due to disturbance or competition can act to suppress the immune system and, therefore, pinnipeds may be predisposed to diseases caused by several pathogens. In Brazil, sea lion stranding occurs normally due to a compromise in animal health when they travel long distances during the migratory period [26]. The weakened state with which these animals arrive at the coast, often infected by large bacterial loads, is the factor that most commonly lead these animals to death [27,28]. The contribution of M. pinnipedii infection in this context remains to be elucidated.
Although M. pinnipedii has been described in both captive and free-ranging animals, systematic population surveillance studies are still lacking. In 2011, a captive South American sea lion housed in a Czech Republic zoo that was imported from a German zoo (not specified) was found to be infected with M. pinnipedii. The animal's parents also died of tuberculosis in Germany and they were captured as juveniles from the coastal waters of Uruguay in 1992 [16], suggesting that the infection came from wildlife. Outbreaks of the disease in wild and/or captive-born South American sea lions housed at Heidelberg Zoo, Germany and Le Pal Zoo, France have also been described [29]. Added to this report of superinfection, it is thus clear that M. pinnipedii is endemic in wild populations of South American sea lions and is being introduced into zoological gardens, warranting an urgent need to evaluate the extent of tuberculosis in free-ranging pinnipeds of the Southern hemisphere.
Strains of M. pinnipedii have a highly conserved proteome   Table S3. Cluster size varied from two to forty-three proteins.
There are no protein clusters (i.e. groups of ≥ 2 proteins) unique to each M. pinnipedii strain ( Figure   3A). However, a number of proteins not categorized into any protein cluster was considered unique of each M. pinnipedii strain (MP1 = 32, MP2 = 37, ATCC BAA-688 = 58, G01498 = 103, G01492 = 68, G01491 = 52, G01222 = 80). As the annotation tool used herein (RAST) does not differentiate between pseudogenes/broken genes and true genes, we used BLASTp and/or tBLASTn to search the NCBI's PGAP annotation of MP1, MP2 and ATCC BAA-688 strains using these "unique proteins" as input. From those with annotated function (i.e. not hypothetical or PE/PPE genes), our BLAST search revealed that all functional CDSs are considered pseudogenes, are non-existent or are present in contigs smaller than 700 bp (with doubtful sequencing/assembly quality) based on PGAP annotation (Additional file 4: Table S4). As RAST reports broken CDSs at the end of contigs as true genes, these CDS fragments (i.e. artefacts of draft genomes) failed to cluster into orthologous groups.
Nevertheless, there is still a number of hypothetical and PE/PPE genes that we were not able to distinguish between pseudogenes and true genes using the proposed tools, indicating the possibility of distinct phenotypes within M. pinnipedii strains. These variations may be a consequence of gene truncation or pseudogenization, which may lead to neofunctionalization or gene loss, respectively, as well as duplication events, phenomena commonly observed in the MTC [30,31].
When considering non-core protein clusters (n= 292) shared between two and up to six M. pinnipedii strains, similar results were obtained, with all functional CDSs identified as pseudogenes or nonexistent based on PGAP annotation, and an unknown number of possible hypothetical and PE/PPE genes that may be variably present among M. pinnipedii strains. Taken together, our results indicate that the analyzed M. pinnipedii strains present a high level of proteome conservation, which is in contrast with a recent pangenome analysis of M. tuberculosis strains that detected at least 1,122 CDSs in the accessory genome and 964 strain-specific CDSs in 36 genomes [32]. However, the M. tuberculosis pangenome in that study was also analyzed using RAST, and pseudogenes were not  tuberculosis genome [33] and show a high degree of variation across members of the MTC and between strains of the same species. PPEs and PEs proteins often interact with each other and are found closely associated with type VII secretion system ESX domains in mycobacterial genomes, acting as important virulence factors [34]. As with the ESAT-6 family, the loss of genes encoding PE/PPE proteins may play an important role in the attenuation process of M. pinnipedii for some hosts [35].
Specific deleted regions of M. microti named MiD3 and MiD4 were found only in the modern strains of M. pinnipedii ( Figure 5). This observation suggests that MiD3 and MiD4 or RD2seal deletions may have occurred independently, twice, over evolutionary course. Mycobacterium microti has been first described in voles [36], but has since been isolated from pigs, llamas, cats, wild boars and immunosuppressed humans, being restricted to Eurasia. This raises questions about the true host range of these closely related pathogens, which may or may not have changed less than 1,000 years ago.  Table S5). In addition, strains of M. pinnipedii appeared distributed among three main clusters, with the ancient strains constituting the most basal node ( Figure 5). Interestingly, modern M. pinnipedii strains appeared divided into two groups according to geographical locations, and possibly host species: the modern cluster 1, comprising isolates of South American origin, and the modern cluster 2 comprising isolates from Australia. These results are in accordance with findings of M. tuberculosis and M. bovis lineages/clonal complexes that are also associated with distinct geographical locations [31,37,38].
Whether or not these are also associated with different virulence phenotypes is unknown.

Concluding Remarks
This is the first report of mixed-strain infection of M. pinnipedii in pinnipeds. This finding, coupled with previous reports in the literature, suggest that M. pinnipedii infection is endemic in free-ranging populations of South American sea lions. Genetic differences between these strains were found to be associated with virulence factors and enzymes necessary for intracellular maintenance and membrane shaping of mycobacteria. The actual effect of these phenomena for the disease outcome and conservation of the animal species, and potential population-level implications are unknown.
Nevertheless, it prompts an urgent need to evaluate the extent of the disease in these animals and to consider tuberculosis in pinnipeds as one of the main health concerns for these species. showed high degree of conservation, indicating that further genomes need to be sequenced and analyzed to increase the chance of finding significant differences or that M. pinnipedii genome remodeling occurred prior to bacterial speciation. In addition, particular attention should be given to possible "pseudogenes" and/or truncated CDSs, as to standardize genome annotations and guide future gene-based analyses. And finally, M. pinnipedii strains, as with other MTC species, are likely to cluster based on geographical occurrence, which will coincide with pinniped species, as these animal populations are already segregated by geography.  [19] were kept at -80 ºC in 7H9 broth with 20% glycerol. For this study, both isolates were reactivated in Stonebrink media and a single colony from each isolate was sub-cultured for DNA extraction as described previously [40,41]. The quality and quantity of the DNA was measured by Nanodrop 2000c Assembly and annotation of the genome The 100 bp paired-end reads from each library were first filtered by quality and presence of adaptors using Trimmomatic software version 0.36 [42]. Mycobacterium pinnipedii MP1 and MP2 genomes were de novo assembled using SPAdes software [43]. Gene identification and annotation were performed automatically by the NCBI's PGAP and draft genomes deposited as GCA_003027795.2 (MP1) and GCA_003027895.2 (MP2).

Circular map of M. pinnipedii genomes
In order to visualize the similarity among the predicted proteins of different M. pinnipedii strains, the genomes of strains G01222, G01491, G01492 and G01498 were assembled using CLC Genomics WorkBench (Qiagen, Netherlands) or SPAdes software [43] and annotated with RAST [44]. Resulting contigs from strains G01222, G01491, G01492 and G01498 and contigs from M. pinnipedii strains MP1, MP2 and ATCC BAA-688 deposited in GenBank were reordered with MAUVE [45] using M. tuberculosis H37Rv (GenBank NC000962.3) as the reference genome. Subsequently, a circular genomic map was constructed using the Proteome Comparison toll implemented in Patric 3.5.18 web resources [46].

Mapping and variant calling of reads
Burrows-Wheeler Aligner (BWA) [47] was used to map reads against the reference genome M. tuberculosis H37Rv. The BWA outputs were analyzed with SAMtools [48]

Protein interaction networks
CDSs identified with non-synonymous mutations when comparing MP1 and MP2 strains were analyzed using STRING database with default settings for the prediction of network associations between proteins [51]. Results were used to identify specific metabolic pathways or protein interactions under evolutionary pressure. When needed, protein sequences were searched against the Virulence Factor Database (VFDB) [52] to identify virulence factors.
In silico spoligotyping and Mycobacterial Interspersed Repetitive Unit-Variable Number of Tandem Repeat (MIRU-VNTR) For the identification of spoligotypes, the reads of M. pinnipedii isolates were analyzed in SpoTyping [53]. The identified patterns were submitted to the Spoligotyping database of M. bovis (www.mbovis.org) for the identification of the pattern number. In order to identify the variable number tandem repeats (VNTRs) of genetic elements named mycobacterial interspersed repetitive units (MIRUs), the 24 MIRU-VNTR loci were individually identified in silico using primers previously described [54] and the resulted patterns were analyzed in the MIRU-VNTRplus database [55,56] Mycobaterium pinnipedii orthologous and paralogous protein clusters To identify groups of orthologous genes, seven M. pinnipedii strains were used (G01222, G01491, G01492, G01498, MP1, MP2 and ATCC BAA-688). Assembled contigs were annotated or re-annotated with the RAST program [44] and predicted CDS ≤150 bp were excluded. Predicted proteins were clustered using OrthoMCL [57] available from the KBase platform [58]. Briefly, homologous pairs of sequences were found using the all-against-all BLASTp algorithm with an E-value <1e-5. OrthoMCL then converted the BLASTp results into a normalized similarity matrix that was analyzed by a Markov Cluster algorithm (MCL) for clustering of orthologous sequences. The inflation index of 1.5 was used to regulate cluster tightness. Since the RAST algorithm reports pseudogenes and broken genes located at the end of contigs as true genes, there may be an overestimation of the number of strainspecific CDSs. This overestimation was manually checked using BLASTp and/or tBLASTn [59]  When mapping the ancient samples against the reference genome, special treatment was given to the reads prior and during mapping. First, reads were trimmed based on quality parameters of 0.05 base-calling error probability and presence of two or more ambiguous nucleotides at the ends using modified-Mott trimming algorithm. Second, reads mapping in more than one region of the genome were excluded.
Mapped regions with less than 10 aligned reads and/or containing low-quality reads were considered absent in the query genome. LSPs were called when this region spanned more than 500 bp.

Phylogenetic analysis
Members of MTC and eleven of the M. pinnipedii strains described above were used to construct a phylogenetic tree. Briefly, all read sets were mapped against the reference M. tuberculosis H37Rv genome and true variants were called as described above, with exclusion of SNPs located in repetitive regions. A customized script in Python language was then used to build a positional matrix of SNPs identified in genomes of all strains/species. The matrix was used as input for the RAxML [62] software for the construction of the phylogenetic tree using Maximum Likelihood (ML) algorithm with GTRCAT model and autoMRE for best-scoring ML tree and a maximum of 1,000 bootstrap inferences.

Declarations
Availability of data and material     Table S3. Cluster size varied from two to forty-three proteins.  MiD3 and MiD4; red triangle: deletion RD2seal. Mycobacterium tuberculosis H37Rv was used as outgroup. A customized script in Python was used to build a matrix of SNPs, which was used to infer a ML tree using RAxML with GTRCAT model and autoMRE for best-scoring ML tree and a maximum of 1,000 bootstrap inferences using RaxML software.

Supplementary Files
This is a list of supplementary files associated with this preprint. Click to download.