Assembly and comparative analysis of the complete mitochondrial genome of Suaeda glauca

Background Suaeda glauca (S. glauca) is a halophyte widely distributed in saline and sandy beaches, with strong saline-alkali tolerance. It is also admired as a landscape plant with high development prospects and scientific research value. The S. glauca chloroplast (cp) genome has recently been reported; however, the mitochondria (mt) genome is still unexplored. Results The mt genome of S. glauca were assembled based on the reads from Pacbio and Illumina sequencing platforms. The circular mt genome of S. glauca has a length of 474,330 bp. The base composition of the S. glauca mt genome showed A (28.00%), T (27.93%), C (21.62%), and G (22.45%). S. glauca mt genome contains 61 genes, including 27 protein-coding genes, 29 tRNA genes, and 5 rRNA genes. The sequence repeats, RNA editing, and gene migration from cp to mt were observed in S. glauca mt genome. Phylogenetic analysis based on the mt genomes of S. glauca and other 28 taxa reflects an exact evolutionary and taxonomic status of S. glauca. Furthermore, the investigation on mt genome characteristics, including genome size, GC contents, genome organization, and gene repeats of S. gulaca genome, was investigated compared to other land plants, indicating the variation of the mt genome in plants. However, the subsequently Ka/Ks analysis revealed that most of the protein-coding genes in mt genome had undergone negative selections, reflecting the importance of those genes in the mt genomes. Conclusions In this study, we reported the mt genome assembly and annotation of a halophytic model plant S. glauca. The subsequent analysis provided us a comprehensive understanding of the S. glauca mt genome, which might facilitate the research on the salt-tolerant plant species. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-021-07490-9.


Background
Chenopodiaceae is among the large families of angiosperms that mainly include Spinacia oleracea, Chenopodium quinoa Willd, and Beta vulgaris [1][2][3]. Chenopodiaceae plants are mostly annual herbs, half shrubs, shrubs, living in the desert, and saline soil areas. Therefore, they often show xerophytic adaptation. As an annual herb of Chenopodiaceae, S. glauca grows in saline-alkali land and beaches. It displays a strong salt tolerance and drought tolerance capacity and has high value as medicine and food material [4][5][6]. Moreover, S. glauca possesses immense ecological importance as it can tolerate heavy metals at higher levels and could be used as a super accumulator of heavy metals. The environmental protection and remediation of contaminated soil make it a natural resource with significant economic and ecological importance [7].
Plant mt is involved in numerous metabolic processes related to energy generation and the synthesis and degradation of several compounds [8]. Margulis' endosymbiosis theory suggests that mt originated from archaea living in nucleated cells when eukaryotes swallowed the bacteria. Later it evolved into organelles with special functions during the long-term symbiosis [9][10][11], incorporated as an additional mt genome. Mitochondria convert biomass energy into chemical energy through phosphorylation and provide energy for life activities. Besides, it is involved in cell differentiation, apoptosis, cell growth, and cell division [12][13][14][15]. Therefore, mitochondria play a crucial role in plant productivity and development [16]. For most seed plants, nuclear genetic information is inherited from both parents, while cp and mt are inherited from the maternal parent. This genetic mechanism eliminates the paternal lines' influence, thus reducing the difficulty of genetic research and facilitating the study of genetic mechanisms [17].
With the development of sequencing technology, an increasing number of mt genomes have been reported. Up to Jan. 2021, 351 complete mt genomes have been deposited in GenBank Organelle Genome Resources. Long periods of mutualism leave mitochondria with some of their original DNA lost, and some of them transferred, leaving only the DNA that codes for it [18,19]. Mt DNA has long been recognized as tending to integrate DNA from various sources through intracellular and horizontal transfer [20]. Therefore, the mt genome in plants has significant differences in length, gene sequence, and gene content [21]. The mt genome length of the smallest known terrestrial plant is about 66 Kb, and the largest terrestrial plant mt genome length is 11.3 Mb [22,23]. As a result, the amount of genes in terrestrial plants varies widely, typically between 32 and 67 [24]. In this study, we sequenced and annotated the mt genome of S. glauca and compared it with the genomes of other angiosperms (as well as gymnosperms), which provides additional information for a better understanding of the genetics of the halophyte S. glauca.

Results
Genomic features of the S. glauca mt genome The S. glauca mt genome is circular with a length of 474,330 bp. The base composition of the genome is A (28.00%), T (27.93%), C (21.62%), G (22.45%). There are 61 genes annotated in the mt genome, including 27 protein-coding genes, 29 tRNA genes, and 5 rRNA genes. The functional categorization and physical locations of the annotated genes were presented in Fig. 1. According to our findings, the mt genome of S. glauca encodes 26 different protein (nad7 has two copies) that could be divided into 9 classes (Table 1): NADH dehydrogenase (7 genes), ATP Synthase (5 genes), Cytochrome C Biogenesis (4 genes), Cytochrome C oxidase (3 genes), Ribosomal proteins (SSU) (3 genes), Ribosomal proteins (LSU) (1 gene), Transport membrane protein (1 gene), Maturases (1 gene), and Ubiquinol Cytochrome c Reductase (1 gene). The homologs of S. glauca mt genes in the mt genomes of H. sapiens, S. cerevisiae, and A. thaliana were identified and listed in Table S1. All of the protein-coding genes used ATG as starting codon, and all three stop codons TAA, TGA, and TAG were found with the following utilization rate: TAA 44.4%, TGA 37.04%, and TAG 18.52% (Table S2). It is reported that the mt genomes of land plants contain variable number of introns [25]. In the mt genome of S. glauca, there are 8 intron-containing genes (nad2, nad5, nad7 with two copyies, cox2, ccmFc, trnA-UGC, and trnV-AAC) harboring 15 introns in total with a total length of 16,743 bp. The intron lengths varied from 105 bp (trnV-AAC) to 2103 bp (nad2). The gene nad7 has two copies in the mt genome, and each copy contains 4 introns, which is the highest intron number. The trnV-AAC, instead, contains only one intron with a length of 105 bp, which is the smallest intron.
It has been reported that most land plants contain 3 rRNA genes [9,11]. Consistently, three rRNA genes rrn5 (119 bp), rrnS (1303 bp), and rrnL (1369 bp) were annotated in S. glauca mt genome. Besides, 20 different transfer RNAs were identified in S. glauca mt genome transporting 18 amino acids, since more than one transfer RNAs might transport the same amino acid for different codons. For example, trnS-UGA and trnS-GCU transport Ser for synonymous codons UCA and AGC, respectively. Moreover, we observed that transfer RNA trnF-GAA, trnM-CAU, and trnN-GUU have two different structures with the same anticodon. Taking trnM-CAU as an example, both A and B structures share the same anticodon CAU transporting amino acid Met ( Figure S1).

Repeat sequences anaysis
Microsatellites, or simple sequence repetitions (SSRs), are DNA fragments consisting of short units of sequence repetition of 1-6 base pairs in length [26]. The uniqueness and the value of microsatellites are due to their polymorphism, codominant inheritance, relative abundance, extensive genome coverage, and simplicity in PCR detection [27]. SSRs in the mt genome of S. glauca were identified with Tandem Repeats Finder software [28]. As a result, 361 SSRs were found in the mt genome of S. glauca, and the proportion of different forms were shown in Figure S2. SSRs in monomer and dimer forms accounted for 78.67% of the total SSRs present. Adenine (A) monomer repeats represented 46.28% (56) of 121 monomer SSRs, and AT repeat was the most frequent type among the dimeric SSRs, accounting for 58.15%.
There are only two hexameric SSRs presented in S. glauca mt genome, located between nad4L and cox2, and between trnQ-UUG and trnM-CAU. The specific locations of pentamer and hexamer are shown in Table 2. Tandem repeats, also named satellite DNA, refer to the core repeating units of about 1 to 200 bases, repeated several times in tandem. They are widely found in eukaryotic genomes and in some prokaryotes [29]. As shown in Table 3, a total of 12 tandem repeats with a matching degree greater than 95% and a length ranging from 13 bp to 38 bp were present in the mt genome of S. glauca. The non-tandem repeats in S. glauca mt genome were also detected using REPuter software [30]. As a result, 928 repeats with the length equal to or longer than 20 were observed, of which 483 were direct, and 445 were inverted. The longest direct repeat was 30,706 bp,  while the longest inverted repeat was 12,556 bp (Supplementary data sheet 1). The length distribution of the direct and inverted repeats are shown in Fig. 2. It is shown that the 20-29 bp repeats are most abundant for both repeat types.

The prediction of RNA editing
RNA editing refers to the addition, loss, or conversion of the base in the coding region of the transcribed RNA [31], found in all eukaryotes, including plants [32]. In chloroplast and mitochondrion, the conversion of specific cytosine into uridine alters the genomic information [33]. This process improves protein preservation in plants by modifying codons. Without the support of the proteomics data, it is impossible to detect accurate RNA editing. However, Mower's software PREP could be used to computationally predict the RNA edit site [34]. In this analysis, 216 RNA editing sites within 26 protein-coding genes ( Table 4) were predicted in the mt genome of S. glauca, using PREP-MT program (Fig. 3). Among those protein-coding genes, cox1 does not have any editing site predicted, while ccmB has the most editing sites predicted (29). Of those editing sites, 35.19% (76) were located at the first position of the triplet codes, 63.89% (138) occurred with the second base of the triplet codes.
And there was a particular editing case in which the first and second positions of the triplet codes were edited, resulting in an amino acid change from the original proline (CCC) to phenylalanine (TTC). After the RNA editing, the hydrophobicity of 42.13% of amino acids did not change. However, 45.83% of the amino acids were were predicted to change from hydrophilic to hydrophobic, while 11.11% were predicted to change from hydrophobic to hydrophilic. The RNA editing might lead to the premature termination of protein-coding genes, and this phenomenon is likely to occur with atp4 and atp9 in S. glauca mt genome. Our results also showed that the amino acids of predicted editing codons showed a leucine tendency after RNA editing, which is supported by the fact that the amino acids of 47.69% (103 sites) of the edits were converted to leucine (Table 4).

DNA migration from chloroplast to mitochondria
Thirty-two fragments with a total length of 26.87 kb were observed to be migrated from cp genome to mt genome in S. glauca, accounting for 5.18% of the mt  genome. There are 8 annotated genes located on those fragments, all of which are tRNA genes, namely trnA-UGC, trnF-GAA, trnH-GUG, trnI-GAU, trnR-ACG, trnM-CAU, trnN-GUU, and trnV-GAC. Our data also demonstrate that some chloroplast protein-coding genes, i.e. atpA, rrn16, rrn23, rpoC2, ndhA, psaB, and psbB migrated from cp to mitochondrion, even though most of them lost their integrities during evolution, and only partial sequences of those genes could be found in the mt genome nowadays ( Table 5). The different destinations of transferred protein-coding genes and tRNA genes suggested that tRNA genes are much more conserved in the mt genome than the protein-coding genes, indicating their indispensable roles in mitochondria.

Phylogenetic analysis within higher plant mt genomes
To understand the evolutionary status of S. glauca mt genome, the phylogenetic analyses was performed on S. glauca together with other 28 species, including 22 eudicots, 4 monocots, and 2 gymnosperms (designated as outgroups). Abbreviations and the accession number of mt genomes investigated in this study are listed in Table S3. A phylogenetic tree was obtained based on an aligned data matrix of 23 conserved protein-coding genes from these species, as shown in Fig. 4. The phylogenetic tree strongly supports the separation of eudicots from monocots and the separation of angiosperms from gymnosperms. Moreover, the taxa from 13 families (Leguminosae, Cucurbitaceae, Apiaceae, Apocynaceae, Solanaceae, Rosaceae, Caricaceae, Brassicaceae, Salicaceae, Chenopodiaceae, Gramineae, Cycadaceae, and Ginkgoaceae) were well clustered. The order of taxa in the phylogenetic tree was consistent with the evolutionary relationships of those species, indicating the consistency of traditional taxonomy with the molecular classification. Based on the phylogenetic relationships among the 29 species, different groups of plants were selected for further comparative analysis.
The comparison of mt genome size and GC content between S. glauca and other species The size and GC content are the primary characteristics of an organelle genome. We compared the size and GC content of S. glauca with other 35 green plants, including 4 phycophyta, 3 bryophytes, 2 gymnosperms, 4 monocots, and 22 dicots. The abbreviations of species names of those plants and the accession numbers of their mt genomes are listed in Table S3. As shown in Fig. 5, the sizes of mt genomes varied from 15,758 bp (C. reinhardtii) to 1,555,935 bp (C. sativus). The sizes of mt genomes of phycophyta and bryophytes were generally smaller compared to land plants, while that of S. glauca (474,330 bp) has an average size. Similarly, the GC contents of the mt genomes were also variable, ranging from 32.24% in S. palustre to 50.36% in G. biloba.
In general, the GC contents of angiosperms, including monocots and dicots, are larger than those of bryophytes but smaller than those of gymnosperms, suggesting that the GC contents frequently changed after the divergence of angiosperms from bryophytes and gymnosperms. Interestingly, our results also showed that the GC contents fluctuate widely in phycophyta. In contrast, the GC contents in angiosperms were much conserved during the evolution, although their genome sizes varied tremendously.
Comparison of genome organization with ten green plant mt genomes The S. glauca mt genome organization was extensively investigated for protein-coding genes, cis-spliced introns, rRNAs tRNAs, and non-coding regions. It was further compared with 10 other taxa, including 3 plants from Chenopodiaceae. As shown in Table 6, protein-coding genes and cis-introns regions represent 5.00% and 3.92% of the whole S. glauca mt genome sequence, respectively. In comparison, the proportions of rRNA and tRNA regions represent only 1.17% and 0.47%, respectively. The other three plants from Chenopodiaceae have similar proportions of protein-coding genes, slightly higher than that of S. glauca. However, the proportions of coding regions were significantly different across families, probably due to the different mt genome sizes.

Gene duplication and lost in mt genomes of Chenopodiaceae plants
With the rapid development of sequencing technology, an increasing number of complete plant mt genomes were assembled and reported recently, facilitating the comparison analysis of the mt genome features among multiple plant species [35]. As described by Richardson et al., the mt genomes in plants vary considerably in size, gene content, and gene order [21]. The Chenopodiaceae plants have a relatively strong tolerance to biotic stress, especially to salt. Four mt genomes from this family: C. quinoa willd, S. oleracea, B. vulgaris, and S. glauca are already available. To understand whether those four plants have the same gene contents, the protein-coding genes from those 4 mt genomes were compared. As shown in Table S4, the specific gene duplication and gene loss were observed in different species. For example, nad7 was duplicated in S. glauca mt genome, and nad1 and rps7 were duplicated in B. vulgaris mt genome. The C. quinoa has the most intact mt genome, with only one gene (sdh4) loss, while atp4 and ccmC from B. vulgaris ssp, and nad1 and shh4 from S. oleracea were also lost. However, with five genes, nad4, nad6, rps4, rps13, and tatC, gene loss appears more frequent in the mt genome of S. glauca.

The substitution rates of protein-coding genes
The calculation of non-synonymous substitutions (Ka) and synonymous substitutions (Ks) is of great significance for the reconstruction of phylogeny and the understanding of evolutionary dynamics of protein-coding sequences in closely related species [36]. In genetics, Ka/ Ks value could be used to determine whether selective pressure existed on a specific protein-coding gene during evolution: Ka/Ks > 1, positive selection; Ka/Ks = 1, neutral selection; and Ka/Ks < 1, negative selection [37].  Fig. 6, the Ka/Ks values of S. glauca ccmB compared to G. max, S. suchowensis, A. thaliana, N. tabacum, and C. papaya were higher than 1, suggesting a positive selection occurred during evolution. However, the Ka/Ks values of most proteins in S. glauca were less than 1 compared to the other plant species, indicating the negative selections of those genes during evolution. Taken together, we conclude that the mt genes are highly conserved during the evolutionary process in green plants.

Discussion
Mitochondria are the powerhouse of the plants that produce the required energy to carry out life processes. Plant mitochondria possess more complex genomes than animals, with extensive size variations, sequence arrangements, repeat content, and a highly conserved coding sequence [38]. Understanding the mt genome structure is required to unravel its function, replication, inheritance, and evolutionary trajectories [38]. In the current study, we studied the characteristics of the mt genome of S. glauca, a crucial salt tolerance plant with great value as a food source and phytoremediation agent. According to the reported data, most of the mt genome is circular, and few mt genomes are linear such as the mt genome of Polytomella parva [39,40]. The mt genome of S. glauca reported in this study is circular with 474,330 bp in size. The repeat sequences widely exist in the mt genome, and these repeats include tandem, short, and large repeats [41,42]. Previous studies have shown that repeats in mitochondria are vital for  intermolecular recombination. For this reason, the repeat sequences play a pivotal role in shaping the mt genome [43]. In this study, the SSRs, longer tandem repeats, and non-tandem repeats were intensively investigated (Fig. 2). The mt genome of S. glauca harbors abundant repeat sequences that might indicate that the intermolecular recombination frequently happens in the mt genome, which dynamically changes the sequence and conformation during the evolution. We also investigated the genome structure and organization of S. glauca in comparison with other land plants. Conclusively, the mt genome characteristics of S. glauca were consistent with those of other terrestrial green plants. RNA-editing is a posttranscriptional process that occurs in the cp and mt genomes of higher plants, contributing to the better folding of proteins [44]. Investigating the RNA-editing sites helps to understand the gene expression of the cp and mt genes in plants. Previous studies reported approximately 441 RNA-editing sites within 36 genes in Arabidopsis and 491 RNA-editing sites within 34 genes in rice [39,45]. In this study, 216 RNAediting sites within 26 genes were identified. The identification of RNA editing sites provides essential clues for predicting gene functions with novel codons. As the cytoplasmic genome, migration of cp DNA to the mt genome occurred during the plant evolution. We found that 32 fragments were transferred from the cp genome to mt with 8 integrated genes, which are all tRNA genes ( Table 5). Transfer of tRNA genes from cp to mt is common in angiosperms [44].
Further, we have analyzed the phylogenetic relationship of S. glauca with representative taxa based on the mt genome information. The resulted phylogenetic tree reflected a clear taxonomic relationship among the taxa. We also analyzed GC content of the mt genome in S. glauca along with other green plants. The result supports the conclusion that GC content is highly conserved in higher plants. The Ka/Ks analysis and the comparison of genome features with other plant's mt genomes provide a comprehensive understanding of plant mt evolution. Generally, most of the results in this study were consistent with previous reports. The genes that undergone neutral and negative selections were also identified in S. glauca. However, most of the protein-coding genes in S. glauca mt had negative selection compared with other selected species, which is consistent with the previous studies, indicating that the protein-coding genes in the mt genome are conserved across the land plants. The ccmB gene is the only gene that underwent positive selection during the evolution.
In crop plants, deciphering and understanding the mt genome is essential for plant breeding. Understanding of mt genome will set a foundation for the evolutionary analysis, cytoplasmic male sterility, and molecular biological information for plant breeding. Even though S. glauca is not a crop plant, its biological significance and edible values are being examined. As a halophytic model plant with prominent salt-tolerance, whose mt genome has not been reported, the accomplishment of the mt genome provides an opportunity to conduct further genomic studies in S. glauca. Therefore, our study provides essential background information for future understanding of this plant [44].

Conclusion
In this study, we assembled and annotated the mt genome of S. glauca and performed extensive analyses based on the DNA sequences and amino acid sequences of the annotated genes. The S. glauca mt genome is circular, with a length of 474,330 bp. 61 genes, including 27 protein-coding genes, 29 tRNA genes, and 5 rRNA genes, were annotated in the genome. The repeats sequences and RNA editing in S. glauca mt genome were analyzed subsequently. The gene conversation between mt and cp genome was also observed in S. glauca by detecting gene migration. Moreover, our result also indicates consistency in molecular and taxonomic classification, besides GC contents in angiosperms, were also found conserved despite their genome sizes that varied tremendously. The Ka/Ks analysis based on code substitution revealed that most of the coding genes had undergone negative selections, indicating the conservation of mt genes during the evolution. This study provides extensive information about the mt genome for S. glauca, facilitating deciphering the salt resistance mechanism in plants.  [49]. The output genebank format file was manually confirmed, and the mitochondrial circular map was drawn using Organellar Genome DRAW (OGDRAW) [50].
Chloroplast to mitochondrion DNA transformation and RNA editing analyses DNA migration is common in plants and varies from species to species [52]. This phenomenon occurs during autophagy, gametogenesis, and fertilization [53]. The cp genome of S. glauca (NC_045302.1) was downloaded from NCBI Organelle Genome Resources Database.  Blastn software on NCBI was used to identify the protein-coding and tRNA genes transferred from chloroplasts to mitochondria. Screening criteria were set as the matching rate ≥ 70%, E-value ≤ 1e − 10 , and length ≥ 40. The editing sites in the mitochondrial RNA of S. glauca were revealed using the mt gene encoding proteins of plants as references. The analysis was conducted on the Plant Predictive RNA Editor (PREP) suite [34] (http:// prep.unl.edu/) with a cut off value of 0.2.

Phylogenetic tree construction and Ka/Ks analysis
The conserved protein-coding genes from mt genomes of S. glauca and other 28 taxa were used for phylogenetic tree construction. The mt genomes were downloaded from NCBI, and the conserved protein-coding genes (atp1, atp4, atp6, atp8, atp9, ccmB, ccmC, ccmFc, ccmFn, cob, cox1, cox2, cox3, matR, nad1, nad2, nad3, nad4L, nad5, nad6, nad7, and nad9) were extracted using TBtool software [54], and then aligned using Muscle software [55]. Subsequently, a Neighbor-joining (NJ) tree was constructed by Mega 7.0 software using the Poisson model with a bootstrap of 1000 [49]. C. taitungensis and G. biloba were designated as the outgroup in this analysis. The synonymous (Ks) and nonsynonymous (Ka) substitution rates of the proteincoding genes in S. glauca mt genome were analyzed using ten representative species (Table S3) as references.