Whole genome sequencing and phylogenetic analysis of six SARS-CoV-2 strains isolated during COVID-19 pandemic in Tunisia, North Africa

Background In Tunisia a first SARS-CoV-2 confirmed case was reported in March 03, 2020. Since then, an increase of cases number was observed from either imported or local cases. The aim of this preliminary study was to better understand the molecular epidemiology and genetic variability of SARS-CoV-2 viruses circulating in Tunisia and worldwide. Methods Whole genome sequencing was performed using NGS approach on six SARS. CoV-2 highly positive samples detected during the early phase of the outbreak. Results Full genomes sequences of six Tunisian SARS-CoV-2 strains were obtained from imported and locally transmission cases during the COVID-19 outbreak. Reported sequences were non-identical with 0.1% nucleotide divergence rate and clustered into 6 different clades with worldwide sequences. SNPs results favor the distribution of the reported Tunisian sequences into 3 major genotypes. These SNP mutations are critical for diagnosis and vaccine development. Conclusions These results indicate multiple introductions of the virus in Tunisia and add new genomic data on SARS-CoV-2 at the international level.

In December 2019, a novel Human Coronavirus (HCoV) associated to severe acute respiratory syndrome was discovered in the city of Wuhan, Hubei province, China and was later named SARS-CoV-2 [2,3]. The outbreak of the coronavirus disease (COVID-19) spread further worldwide and the World Health Organization (WHO) officially declared the pandemic aspect of the COVID-19 epidemic on March 12th, 2020 [4]. The WHO report dated on June 21st, 2020 confirmed 8,708, 008 cases of SARS-CoV-2 with 461715deaths from 215 countries and territories [5].
The Tunisian government, through the Ministry of Health (MoH) and the National Observatory of New and Emerging Diseases, reviewed and initiated multisectoral measures to deal with the COVID-19 epidemic. The national strategy aims early detection of imported cases, isolation of confirmed cases as well as suspected cases. Moreover, the MoH commit to develop diagnostic and treatment procedures in order to limit the spread of the SARS-CoV-2 within the population and avoid exceeding of the health system capacity. These measures allowed the Tunisian authorities to identify the first SARS-CoV-2 in March 03, 2020 [6]. As of June 7th, a total of 1087 COVID-19 cases, including 286 imported and 801 locally transmitted cases, have been diagnosed and tested positive for SARS-CoV-2 with 49 deaths [7].
This work reports the first whole genome sequences of the SARS-CoV-2 viruses isolated in Tunisia during the early phase of the outbreak. Phylogenetic comparison with whole genome sequences reported from other countries and single-nucleotide polymorphism (SNPs) were assessed.

Results
The six Tunisian sequences obtained were 29,541 to 29, 825 nucleotide-long and covered the whole coding region and more than 99.1% of the genome. The totality of complete genome sequences available at writing time, of SARS-CoV-2 were downloaded from the GISAID (http://www.gisaid.org/) and the NCBI (http://www.ncbi. nlm.nih.gov/genbank) databases. All the repeated sequences as well as the sequences with missing regions have been excluded. The 268 selected sequences originated from 61 different countries worldwide. The phylogenetic tree in Fig. 1, shows that Tunisian SARS-CoV-2 sequences were scattered independently of each other in different clusters including strains mainly from Europe and Asia (Fig. 1a). Three out of six Tunisian sequences clustered with sequences from France: one case came from France (COV0425, Fig. 1g) and two were locally infected (COV1482 and COV1663, Fig. 1d and e). The sequence COV1339 was isolated from a case coming from Turkey, it grouped with sequences originated from Taiwan, Egypt and Finland (Fig. 1f). Sequence COV0010 was from a case with recent travel to Vietnam and clustered with sequences from Thailand, Vietnam, China and Japan (Fig. 1b). The last identified sequence COV0880 was from a locally infected individual and grouped with sequences from Turkey, Belarus and Kuwait (Fig. 1c).
The Tunisian SARS-CoV-2 retrieved full genome sequences alignment with the Wuhan reference sequences (NC045512) were used for the investigation of the SNPs and the resulting variable amino-acids composition throw genome detective web server (https://www. genomedetective.com/); the obtained results are shown in Table 1.
The average and the Min/Max depth of coverage for the six described isolates was as follows, Cov0010: 12545 (0/35320), Cov0880: 9339 (0/41049), Cov1339: 10456 (0/ Fig. 1 Phylogenetic tree comparing the six Tunisian SARS-CoV-2 with 268 previously published sequences. The tree was performed using the neighbor joining method and the kimura-2 parameter model. Topology was supported by 1000 bootstrap replicates. For better clarity of the tree, lineages from the same country were collapsed and bootstrap values lower than 50 were not indicated (a). Phylogenetic clusters including SARS-CoV-2 Tunisian genome sequences were detailed in (b to g). The sequences reported in this study are shown in red bold, and indicated by the accession number, laboratory code, district, country code, week and year of isolation. The sequences downloaded from GISAID and NCBI database are indicated by their accession number followed by the country code and the year of isolation Cov1339 SNPs Cov0425 SNPs To properly present the mutants described in Table 1, confirmed mutations with more than 2500 reads are written in bold and underlined, and validated mutations with less than 2500 reads are written in italics. The six Tunisian SARS-CoV-2 sequences were non identical with 0.1% nucleotide divergence between each other. Compared to the Wuhan (NC045512) reference, the reported Tunisian SARS-CoV-2 sequences showed several synonymous and non-synonymous mutations, distributed over the entire genome. A total of 43 SNP mutations were identified with 27 and 16 SNPs in the non-structural and structural region respectively. Each sequence showed a different SNPs profile; the number of detected SNPs in each sequence ranged between 6 and 13. Most of them were located in the non-structural region affecting 08 out of the 16 non-structural genes ( Table 1). In the structural region, most of SNPs were detected in the S gene encoding for the spike protein.

Discussion
In this work, the full genomes sequences of six SARS-CoV-2 strains were obtained from RT-PCR positive samples collected in Tunisian patients between March and April 2020. Three cases had a recent travel history, one coming from Vietnam, one from Turkey and one from France; while the three remaining cases were infected locally through horizontal transmission. The six sequences were non-identical with 0.1% nucleotide divergence between each other and had several nucleotide changes compared to the first published COVID-19 sequence from Wuhan, China (NC045512). Phylogenetic analysis including representative sequences from 61 countries split the six sequences from Tunisia into 6 different clusters, together with sequences from various other countries. These results indicate multiple introductions of the SARS-CoV-2 virus in Tunisia.
According to recently published results by Yin in 2020, four major genotypes were suggested depending on the positions of SNPs [8]. The SNPs here reported in the positions 3037, 8782, 11,083, 14,408, 23,403 and 28, 144 allowed us to classify the Tunisian sequences into 3 distinct genotypes among the four major genotypes described: Genotype I (COV0880), genotype III (COV0010) and Genotype IV (COV0425, COV1339, COV1482 and COV1663). Indeed, these SNPs were identified as the most frequent mutations described since of the beginning of the pandemic in comparison with the Wuhan reference SARS-CoV-2 genome NC045512.2 [8,9]. The highly frequent SNP mutations was previously described in the S protein, RNA polymerase, RNA primase, and nucleoprotein, which are critical proteins for diagnosis and vaccine development [8,10].
Further complementary studies including clinical status of COVID-19 cases will be of a great interest to better understand the impacts of these SNP mutations on the transmissibility and pathogenicity of SARS-CoV-2.

Conclusions
Analysis of SARS-CoV-2 sequences from the maximum of affected countries and in different phases of the COVID-19 outbreak is crucial to understand the virus transmission and its genetic evolution through time. This preliminary study reports the first SARS-CoV-2 whole genome sequences from Tunisia, North Africa. Further sequences from different clinical and epidemiological settings are still needed to monitor the molecular epidemiology of SARS-CoV-2 in the country and worldwide.

Methods
Detection of SARS-CoV-2 infection was performed on nasopharyngeal swabs samples by specific real time reverse transcription polymerase chain reaction (RT-PCR) according to the WHO approved protocol published by Corman et al. and detecting specific sequences in the E, N and RdRp genes [11]. A total of 6 SARS-CoV-2 highly positive samples were assessed for whole genome sequencing using the NGS approach. The respective COVID-19 cases were detected between March and April 2020. They originated from six different districts of Tunisia. Three cases had recent travel history and three were infected locally.
A total of 140 μl of each nasopharyngeal sample was used for RNA extraction via the Qiamp viral RNA mini kit (Qiagen, Hilden, Germany). Reverse transcription of the whole genome was performed by The ProtoScript II First Strand cDNA synthesis kit (New England Biolabs, USA) using random hexamers. For the multiplex PCR approach, the general method described in (https://artic. network/nCoV-2019) was applied using Version 3 of amplicon set. After 40 amplification cycles, the PCR products were cleaned-up with AMPure XP magnetic beads (Beckman Coulter, USA) and quantified with the Qubit 2.0 fluorometer (ThermoFisher Scientific, USA).
The Nextera DNA Flex Prep kit (Illumina, USA) was used for library preparation and thereafter qualified on an Agilent Technologies 2100 Bioanalyser using a highsensitivity DNA chip following the manufacturer's instructions. Generated libraries were sequenced using the MiSeq System (Illumina, USA) providing 2 × 250 bp read length data.
To remove low quality reads, trim off low-quality and contaminant residues and filter out duplicated reads, fqCleaner v.0.5.0 was used, with Phred quality score of 28. Filtered reads were mapped against SARS-CoV-2 reference (NC045512) using Burrows-Wheeler Aligner MEM algorithm (BWA-MEM) (v0.7.7). SAM tools were used to sort BAM files, to generate alignment statistics and to obtain coverage data. Whole genome sequences of the six SARS-CoV-2 Tunisian strains were submitted to NCBI database under accession number MT499215 to MT499220.
Multiple sequences alignment was performed using Multiple Sequence Comparison by Log-Expectation (MUSCLE) software implemented in Molecular Evolutionary Genetics Analysis software (MEGA) version 7 [12]. Phylogenetic analysis was performed using the neighbor-joining analysis method and the Kimura 2parameter model with 1000 bootstrap replications. Genetic distances were calculated using the p-distance method.