Newly reported chloroplast genome of Sinosenecio albonervius Y. Liu & Q. E. Yang and comparative analyses with other Sinosenecio species

Sinosenecio B. Nordenstam (Asteraceae) currently comprises 44 species. To investigate the interspecific relationship, several chloroplast markers, including ndhC-trnV, rpl32-trnL, matK, and rbcL, are used to analyze the phylogeny of Sinosenecio. However, the chloroplast genomes of this genus have not been thoroughly investigated. We sequenced and assembled the Sinosenecio albonervius chloroplast genome for the first time. A detailed comparative analysis was performed in this study using the previously reported chloroplast genomes of three Sinosenecio species. The results showed that the chloroplast genomes of four Sinosenecio species exhibit a typical quadripartite structure. There are equal numbers of total genes, protein-coding genes and RNA genes among the annotated genomes. Per genome, 49–56 simple sequence repeats and 99 repeat sequences were identified. Thirty codons were identified as RSCU values greater than 1 in the chloroplast genome of S. albonervius based on 54 protein-coding genes, indicating that they showed biased usage. Among 18 protein-coding genes, 46 potential RNA editing sites were discovered. By comparing these chloroplast genomes' structures, inverted repeat regions and coding regions were more conserved than single-copy and non-coding regions. The junctions among inverted repeat and single-copy regions showed slight difference. Several hot spots of genomic divergence were detected, which can be used as new DNA barcodes for species identification. Phylogenetic analysis of the whole chloroplast genome showed that the four Sinosenecio species have close interspecific relationships. The complete chloroplast genome of Sinosenecio albonervius was revealed in this study, which included a comparison of Sinosenecio chloroplast genome structure, variation, and phylogenetic analysis for related species. These will help future research on Sinosenecio taxonomy, identification, origin, and evolution to some extent.


Background
The sophisticated oxygenic photosynthesis performed by chloroplasts is the most remarkable function of modern plastids. As a photosynthetic organelle capable of supplying energy to green plants, chloroplasts play an important role in photosynthetic oxygen production and secondary metabolism and the biosynthesis of starch, Open Access *Correspondence: zhouqiang@jsu.edu.cn fatty pigments, and amino acids. Chloroplasts and their complex signaling pathways provide a fine regulatory mechanism for plant development, metabolism, and environmental response, forming a major genetic system with the nucleus and mitochondria [1][2][3].
Chloroplasts also have their independent genomes. Most chloroplast genomes of angiosperms are highly conserved and exhibits a typical quadripartite structure, usually with 110-130 genes, including a large single-copy region (LSC), a small single-copy region (SSC), and two inverted repeat regions (IRs), ranging in size from 120 to 160 kb [4]. Due to its highly conserved nature, slow nucleotide substitution rate, and maternal inheritance, chloroplast DNA, an important information source for taxonomic and phylogenetic research, has been widely used in genomics to research plant phylogeny [5].
Sinosenecio B. Nordenstam (1978) (Asteraceae) contains 44 species that are primarily found in central and southwestern China [6][7][8][9]. This genus is distinguished by stems that are subscapiform or leafy, palmately or rarely pinnately veined, capitula that range from solitary to numerous, involucres that are ecalyculate or calyculate, and so on. Sinosenecio is divided into two species assemblages based on chromosome number and endothecial cell wall thickening patterns, namely the Sinosenecio s.s. group and the S. oldhamianus group [10][11][12][13]. These two groups also differ in geographical distribution. The former is restricted to mountainous regions around Sichuan Basin, southwestern China, and the latter is widely distributed in central and southern China, with two species extending to Indochina.
Previously, several chloroplast markers, including ndhC-trnV, rpl32-trnL, matK, and rbcL, were used to determine the relationship of Sinosenecio species. However, the chloroplast genomes of this genus have not been thoroughly investigated. Here, we sequenced and assembled the chloroplast genome of Sinosenecio albonervius Y. Liu

Simple sequences repeats (SSRs) and repeat sequences
S. albonervius chloroplast genome contained 53 simple sequence repeats (SSRs), including 26 mononucleotide repeats, seven dinucleotide repeats, eight trinucleotide repeats, and 12 tetranucleotide repeats ( Fig. 2A). We counted the number of SSRs in SC and IR regions (Fig. 2B) and the different types of SSRs, in each chloroplast genome (Fig. 2C, Table S1). It can be seen that SSRs mainly occur in LSC, while SSRs are not detected in the IR regions of S. baojingensis and S. albonervius. The SSRs in S. albonervius, S. jishouensis, S. baojingensis, and S. oldhamianus are 53, 55, 49, and 56. It is worth noting that mononucleotide repeats of S. baojingensis and S. oldhamianus are more than the sum of other types. The most common SSRs are mononucleotide repeats composed of A or T (Fig. 2D), and S. oldhamianus has the most (35 mononucleotide repeats). In contrast, S. albonervius has 26, as do S. jishouensis and S. baojingensis. Furthermore, we discovered repeat sequences (> 10 bp) in the chloroplast genomes (Fig. 3, Table S2). Palindromic and forward repetitions are more universal than other repetition types. For S. albonervius, 99 repeat sequences were identified, which are composed of 37 forward (F), 21 reverse (R), 37 palindromic (P), and four complements (C) repeats, and the largest repeat is a palindromic repeat with a size of 48 bp.

Codon usage and RNA editing sites
The codon usage frequency and relative synonymous codon usage (RSCU) frequency were calculated using 54 protein-coding sequences from the chloroplast genome of S. albonervius (Table 4). There are 21,301 codons in these protein-coding sequences. With 2281 and 238 codons, Leu and Cys are the most and the least frequently used amino acids respectively. Relative synonymous codon usage analysis (Fig. 4) showed that RSCU value of 30 codons is greater than one, indicating some biased usage for these codons. At the same time, Met and Trp are encoded by a single codon (RSCU = 1), showing no biased usage. Additionally, among the codons with RSCU > 1, only the Leu codon (UUG) is G-ending, and the other 29 codons are A or U-ending.
A total of 46 potential RNA editing sites were found in 18 protein-coding genes from the chloroplast genome of S. albonervius ( Table 5). The ndhB genes contain the most RNA editing sites (9 sites), while several genes (atpI, Fig. 1 Gene map of the chloroplast genomes of S. albonervius. Genes inside the circle are transcribed clockwise, and those on the outside are transcribed counter-clockwise. Genes belonging to different functional groups have been colour-coded. The darker grey area in the inner circle corresponds to GC content, whereas the lighter grey corresponds to AT content psbf, rpl20, rpoA, rpoB, and rps2) include only one editing site. C-T conversion occurred at the first (21.7%) and second codon positions (78.3%) of all RNA editing sites, indicating that the editing frequency of the third codon position was lower than that of the second or first codon positions. Furthermore, serine codons were edited more  frequently than other amino acid codons, and the conversion from serine to leucine occurred the most frequently.

Comparative genomic and nucleotide diversity analyses
The chloroplast genomes of Sinosenecio species were compared and analyzed to determine the level of divergence, with S. oldhamianus as a reference (Fig. 5). IR regions and the coding regions are more conserved than the SC and non-coding regions. The coding regions of the ycf1 gene, on the other hand, are the most divergent, with greater diversity than the coding regions of other genes. We also compared IR, SC, and junction sites of Sinosenecio species (Fig. 6). The size of IR regions in different chloroplast genomes ranges from 24,848 to    (Fig. 7). Pi values range from 0.00083 to 0.02611. The highest Pi values occurs in accD-pasI area with 0.02611, and other high-level peaks (Pi > 0.013) are found in following regions: trnK_UUU-rps16 (0.01583), ycf1 (0.01444), ccsA-ndhD (0.01333) and trnT_UGU-trnL_UAA (0.01306). However, these regions are primarily concentrated in LSC, implying that the LSC contains the most highly diverse regions.

Phylogenetic analysis
An ML phylogenetic tree was constructed using the chloroplast genome sequence alignments of 14 Asteraceae species (Fig. 8). All nodes have high support values, and Senecioneae of Asteraceae contains three major clades. The first clade includes four species from Sinosenecio of subtribe Tephroseridinae and the other two clades consist of eight species from subtribe Senecioninae. In the genus Sinosenecio, S. oldhamianus is the first to differentiate, followed by S. albonervius, and finally S. baojingensis and S. jishouensis. From the perspective of whole chloroplast genomes, Sinosenecio is phylogenetically close to Farfugium and Ligularia.

Basic characteristics of Sinosenecio species chloroplast genome
We assembled the complete chloroplast genome of S. albonervius, and deposited it in Genbank (OL678114).
Comparing the chloroplast genomes of S. albonervius and the other three Sinosenecio species revealed that their genomes have a uniformly typical quadripartite structure with the same numbers of total genes, protein-coding genes and RNA genes as well as consistent GC content. Meanwhile, they differ slightly in the size of the SC and IR regions, which reflects the high degree of conservativeness in angiosperms chloroplast genomes to some extent. 18 genes in S. albonervius contain introns that significantly affect RNA stability, regulation of gene expression, and alternative splicing [14]. Additionally, some genes are also sometimes absent from chloroplast genomes of plants. The loss of rps7 gene is unique to gymnosperms, while the loss of at least seventeen genes (accD, ndhA, ndhB, ndhC, ndhD, ndhE, ndhF, ndhG, ndhH, ndhI, ndhJ, ndhK, psaJ, rpl23, rpl32, rps15 and rps16) was found to be common in angiosperms. However, it is noteworthy that the four Sinosenecio species retain the above seventeen genes that are easy to be deleted, and most of these genes are related to NADPH-quinone oxidoreduction [15,16].

SSRs and repeat sequences
Simple sequence repeats (SSR) are tandem DNA repeats with short motifs found in plant nuclear, mitochondrial and chloroplast genomes, and exhibit polymorphism and a codominant inheritance pattern. These sequences have been widely used to speculate genetic variation among plant genotypes and as DNA markers in population genetic researches [17,18]. The SSR abundances in different species are varied [19]. Different numbers of SSR were detected from Sinosenecio species chloroplast genomes, while most of the SSRs appear in the SC regions, especially in the LSC region. We found that A or T mononucleotide repetition is the most primary repetitive type, and all mononucleotide repeats are composed of A and T. Such results are consistent with previous reports that A and T are the most abundant repeats in the most angiosperms chloroplast genome, and rarely contain tandem G or C repeats [20]. Furthermore, we discovered 99 repeat sequences in S. albonervius chloroplast genomes, the largest of which is a 48-bp palindrome repeat. Repeat sequences are essential genetic resources that play a significant role in phylogenetic studies. Larger and more complex repeat sequences may significantly impact chloroplast genome rearrangement and sequence divergence [21][22][23][24].

Codon usage analysis and RNA editing sites
Synonymous codons encode the same amino acids with different frequencies in many organisms, known as codon bias. The genetic code is usually conserved between organisms but differs in the frequency of codons usage for each amino acid. The selection for which codons are frequent and rare is generally consistent within each genome [25][26][27][28]. In our study, the RSCU values of 30 codons are greater than one, indicating a codon bias in the amino acids. Twenty-nine of these codons end in A or T, similar to the codons ending in A/T in most chloroplast genomes, most likely due to the composition bias of the high A/T ratio [29]. The codon usage bias is a common characteristic of eukaryotic genomes and is critical for regulating gene expression [30]. Subsequent research has revealed that RNA editing patterns are a universal phenomenon in higher plants, except the complex leafy licheniformes, a subclass of complex thalloid marchantiid liverworts [31]. It is a process that converts specific RNA  nucleotide from C to U and alters the RNA sequence encoded by the genome, but with less frequent conversion from U to C in mitochondria and plastids [32,33]. In our study, 46 potential RNA editing sites of 18 proteincoding genes in the chloroplast genome of S. albonervius were all C-T conversions at the codon's second or third position (21.7 vs. 78.3%). According to previous research, the editing site is usually in the first or second base of codons, resulting in the hydrophilic amino acid being transformed into hydrophobic [1,32].

Genomes comparison and nucleotide diversity
We discovered that the chloroplast genomes of Sinosenecio species are highly conserved, with high similarity and gene order conservancy. However, the IR and coding regions are more conserved than the SC and noncoding regions, supported by previous findings [34,35]. The expansion and contraction of boundary regions are evolutionary events and influence chloroplast genomes in size [36]. The length of IR regions ranges from 24,848 to 24,853 bp in Sinosenecio genomes. There were two Small IR expansion and movement are due to gene conversion, while double-stranded DNA breaks and recombination cause major IR expansion [37,38]. Furthermore, IRs can stabilize plastomes, and species with IRs in their genomes are more stable in terms of genomic alignment than plastomes lacking one or all IRs [5]. Nucleotide diversity analysis found the hotspot regions for genome divergence, which can be used as new DNA barcodes in species identification [39]. These high Pi loci (accD-pasI, trnK_UUU-rps16, ycf1, ccsA-ndhD, trnT_ UGU-trnL_UAA ) are mostly found in the LSC regions. Some of these regions, such as ycf1, ccsA-ndhD, and trnT_UGU-trnL_UAA , have been reported in previous studies on the chloroplast genome [40]. The IR regions are more conserved than SC regions, which may be due to copy correction between IR sequences by gene conversion [41].

Phylogenetic relationships
The chloroplast genome sequences with sufficient variable loci have been successfully used for classification and phylogenetic studies [42]. To determine Sinosenecio phylogenetic relationship, we assembled a dataset of chloroplast genome sequences. The interspecific relationship within Sinosenecio has been strongly supported by phylogenetic analysis, and this result is essentially consistent with their taxonomy. However, Sinosenecio is a large genus with 44 species, and only four species' chloroplast genome sequences were used in this analysis, making a more comprehensive comparison with phylogenetic results inferred from other chloroplast fragments (ndhC-trnV, rpl32-trnL) or nuclear genes impossible. In addition, according to Liu 2010, S. albonervius, S. baojingensis, S. jishouensis, and S. oldhamianus, based on chromosome number and patterns of endothecial cell wall thickenings, were considered to be partial members of S. oldhamianus group. This group is closely related to Nemosenecio (Kitam.) B. Nord of subtribe Tephroseridinae may represent a new genus or should be merged into Nemosenecio [10,43,44]. Still, there is not enough molecular data on Nemosenecio that we can use to illustrate this conclusion from the level of chloroplast genome at present. Therefore, more taxon sampling and a more rounded analysis of chloroplast genomes are necessary to deeply understand the Sinosenecio genetic relationship.

Conclusions
The complete chloroplast genome of S. albonervius was assembled and compared to other Sinosenecio species. Sinosenecio chloroplast genomes shared structural characteristics such as strict gene order, stable GC content, and relatively conservative IR and coding regions, while boundary region expansion and contraction influence genome size. Some codons encoding amino acids in S. albonervius have codon usage bias, which is critical for regulating gene expression. 46 RNA editing sites were detected based on 18 protein-coding genes showing that editing events often occurred in the first and second positions of the codon. Furthermore, the phylogenetic analysis strongly supported the interspecific relationship within Sinosenecio, and partial hotspot regions for this genus genome divergence can be used as new DNA barcodes in species identification. Our study provides valuable information for future research on taxonomy, identification, and systematic evolution in Sinosenecio.

Plant materials, DNA extraction and sequencing
Fresh S. albonervius leaves were collected from Hupingshan Natural Reserve in Hunan Province, China, and dried with silica gel. The voucher specimen was deposited at the herbarium of Jishou University. Plant Genomic DNA Kit DP305 (Beijing, China) was used to extract high-quality total DNA from the silica-dried leaf. Wholegenome sequencing was performed on the Illumina Hiseq platform by Guangdong Mercells Cell Biotechnology Co., Ltd. (Foshan, China).

Assembly and annotation
The clean data were used to assemble the complete chloroplast genome sequence of S. albonervius by the program GetOrganelle [45], and this sequence was annotated on the web page GeSeq (https:// chlor obox. mpimpgolm. mpg. de/ geseq. html) [46]. The obtained results were checked and manually adjusted in the program Geneious-9.0.2 using S. jishouensis as a reference. Finally, the S. albonervius chloroplast genome was uploaded to NCBI (Genbank: OL678114). Furthermore, the chloroplast genome map of S. albonervius was drawn using the web link (https:// chlor obox. mpimp-golm. mpg. de/ OGDraw. html) [47].

Chloroplast genome analysis
The simple sequence repeats (SSR) were detected by using MISA online tool (https:// webbl ast. ipk-gater sleben. de/ misa/) [48], and the parameters were set to ten, five, and four repeats for mononucleotide, dinucleotide, and trinucleotide. Three repeats were used for tetranucleotide, pentanucleotide, and hexanucleotide [49]. REPuter was used to analyze forward, palindrome, reverse, and complementary sequences with a minimum repeat length of 10 bp and minimum sequence identity greater than 90% [1,50].
The expansion and contraction of IR regions in Sinosenecio chloroplast genome sequences were studied using the IRscope online program (https:// irsco pe. shiny apps. io/ irapp/) [51]. The codon usage of S. albonervius chloroplast genome was analyzed using CodonW in MEGA [52], and protein-coding genes with less than 300 nucleotides in length and repeated gene sequences were deleted to reduce the deviation of the results. Besides, the putative RNA editing sites of 18 proteincoding genes were predicted via the PREP-Cp Web server (http:// prep. unl. edu/ cgi-bin/ cp-input. pl), with a cutoff value of 0.8 [53].
Sinosenecio chloroplast genomes obtained from Genbank were compared with S. albonervius on the mVISTA online program using the Shuffle-Lagan model [54], with S. oldhamianus as the reference.
For the nucleotide diversity analysis, Sinosenecio complete chloroplast genome sequences were aligned using MAFFT [55]. A sliding window analysis of window length of 600 bp and step size of 200 bp was used in the DnaSP to estimate the nucleotide diversity values [5,56].

Phylogenetic analysis
Thirteen complete chloroplast genome sequences, including three Sinosenecio species and other ten Asteraceae species sequences, were downloaded from Gen-Bank to clarify the phylogenetic position and relationship of S. albonervius with other related species. The genus Aster was selected as an out-group. All these sequences were aligned by using MAFFT, and RAxML-8.2.12 was used for maximum likelihood analysis on Cipres Portal (https:// www. phylo. org/ porta l2) with the GTRGAMMA model, and 1000 bootstrap replicates [57].