Comparative and phylogenetic analysis of the complete chloroplast genomes of 10 Artemisia selengensis resources based on high-throughput sequencing

Wang, Yuhang; Wei, Qingying; Xue, Tianyuan; He, Sixiao; Fang, Jiao; Zeng, Changli

doi:10.1186/s12864-024-10455-3

Research
Open access
Published: 05 June 2024

Comparative and phylogenetic analysis of the complete chloroplast genomes of 10 Artemisia selengensis resources based on high-throughput sequencing

Yuhang Wang¹,
Qingying Wei¹,
Tianyuan Xue¹,
Sixiao He¹,
Jiao Fang² &
…
Changli Zeng¹

BMC Genomics volume 25, Article number: 561 (2024) Cite this article

301 Accesses
Metrics details

Abstract

Background

Artemisia selengensis, classified within the genus Artemisia of the Asteraceae family, is a perennial herb recognized for its dual utility in culinary and medicinal domains. There are few studies on the chloroplast genome of A. selengensis, and the phylogeographic classification is vague, which makes phylogenetic analysis and evolutionary studies very difficult.

Results

The chloroplast genomes of 10 A. selengensis in this study were highly conserved in terms of gene content, gene order, and gene intron number. The genome lengths ranged from 151,148 to 151,257 bp and were typical of a quadripartite structure with a total GC content of approximately 37.5%. The chloroplast genomes of all species encode 133 genes, including 88 protein-coding genes, 37 tRNA genes, and 8 rRNA genes. Due to the contraction and expansion of the inverted repeats (IR), the overlap of ycf1 and ndhF genes occurred at the inverted repeats B (IRB) and short single copy sequence (SSC) boundaries. According to a codon use study, the frequent base in the chloroplast genome of A. selengensis’ third codon position was A/T. The number of SSR repeats was 42–44, most of which were single nucleotide A/T repeats. Sequence alignment analysis of the chloroplast genome showed that variable regions were mainly distributed in single copy regions, nucleotide diversity values of 0 to 0.009 were calculated by sliding window analysis, 8 mutation hotspot regions were detected, and coding regions were more conserved than non-coding regions. Analysis of non-synonymous substitution (Ka) and synonymous substitution (Ks) revealed that accD, rps12, petB, and atpF genes were affected by positive selection and no genes were affected by neutral selection. Based on the findings of the phylogenetic analysis, Artemisia selengensis was sister to the genus Artemisia Chrysanthemum and formed a monophyletic group with other Artemisia genera.

Conclusions

In this research, the present study systematically compared the chloroplast genomic features of A. selengensis and provided important information for the study of the chloroplast genome of A. selengensis and the evolutionary relationships among Asteraceae species.

Peer Review reports

Background

Asteraceae is the first family of dicotyledonous plants, currently, there are about 1000 genera and 25,000–30,000 species in the family, and there are about 200 genera and more than 2000 species in China, which are distributed all over the country [1, 2]. As the largest genus in Asteraceae [3], Artemisia has about 300 species. It is mainly found in temperate, cold-temperate, and subtropical regions of Asia, Europe, and North America. In many countries, most Artemisia plants are used as herbal medicines. For example, A. annua is used as a treatment for malaria because of its rich content of artemisinin [4]; the Dragon Boat Festival, a traditional Chinese festival, uses. argyi to repel insects and kill viruses. As a perennial herb of the genus Artemisia in the family Asteraceae, A.a selengensis has rhizomes, young stems are green or purple, young leaves are mostly light green, and old leaves are dark green. The leaves are mostly oval or lance-shaped in shape, with white tomentum on the back, and the whole plant grows upright or obliquely upward. The plant itself has a clear fragrance, the stalks are crisp and tender, rich in protein, fatty acids, and trace elements [5], with a delicious flavor and rich nutrition, and is widely grown mainly as a vegetable in China. A. selengensis contains various chemical substances such as flavonoids, chlorogenic acid, and reducing sugars. It is the polysaccharides, chlorogenic acid, and other bioactive components present within the plant, that have demonstrated effects in anti-tumor, antioxidant and free radical scavenging [6,7,8], which can improve liver function [8] and lower blood sugar [9]. It is also used in tea making, yogurt fermentation, functional shampoos, and cosmetics development [10, 11].

Chloroplasts are important organelles with independent genetic material and capable of photosynthesis, commonly found in terrestrial plants, algae, and a few protists [12, 13], showing matrilineal inheritance in most angiosperms. The chloroplast genomes are relatively conserved in structure [14] and has a typical tetrad structure of a circular genome with a genome size ranging from 120 to 160 kb [15], including a large single-copy region (LSC), a small single-copy region (SSC), and these two single copy regions are separated by two inverted repeat regions (IR), where the inverted repeat regions are pairs of repeats with equal length and opposite orientation sequences [16]. Chloroplast genomes are so commonly used in angiosperms, gymnosperms, and ferns for phylogenetic and comparative genomic investigations [17]. Chloroplast genomes are inherited either paternally or maternally and may be utilized as a legitimate barcode for species identification as well as the creation of additional possible identifying markers [18].

Artemisia species are diverse, complicated in genetic relationships, and have ambiguous taxonomic relationships based on morphology. As organelle genomes with highly conserved genetic information, the chloroplast genomes are widely used for genome evolution studies [19, 20]. Many researchers have used single gene data (accD, ycf1, rbcL, matK, ndhF, rps11) and IGS data (psbA-trnH, trnS-trnC, trnS-trnfM, trnL-trnF) for phylogenetic analysis of Artemisia [21,22,23,24,25,26,27], however, these chloroplast single-gene molecular markers do not work for all plant taxa and only supply limited information at the subspecies level [28]. In contrast, there is little information on the chloroplast genomes of A. selengensis in the current database, and it is unclear whether the chloroplast genomes of A. selengensis resources differ from one region to another. Therefore, in this work, chloroplast whole-genome sequencing, assembly, and annotation of 10 A. selengensis materials from 6 regions using second-generation sequencing technology not only enriched the existing genetic information of the chloroplast genome of A. selengensis, which was helpful for phylogenetic and taxonomic studies but also provided genetic information for the conservation of A. selengensis germplasm resources. To better understand the evolution of the chloroplast genome structure of A. selengensis and to clarify the evolutionary relationships between A. selengensis and other Artemisia species, genome structure analysis and comparative genomic research were also carried out.

Methods

Samples collection

Ten A. selengensis germplasm resources were collected from 6 provinces in China. The material’s number, source, and GenBank number are listed in Table 1. The labels for 10 different A. selengensis materials were HWB, HWS, HQ, HY, HC, JN1, JN2, AC, JS, and YN.

Table 1 Ten germplasm resources of A. selengensis from six provinces

Full size table

Chloroplast genome sequencing

More than 0.5 g of fresh leaves were taken from each material separately, kept in discolored silica gel, and then sequenced by Illumina high-throughput sequencing platform from Beijing Novogene Biotechnology Co., Ltd. A total of 56.5 G of raw data and 56.15 G of filtered clean data were generated by sequencing. The clean data were utilized to assemble the chloroplast complete genome. The base quality values of the sequencing results were all above 97% for Q20 and above 92% for Q30 (Supplementary Table S1).

Genome assembly and annotation

In this study, the chloroplast genomes of A. selengensis were assembled using GetOrganelle v1.7.5.3 software [29]. We used the published complete chloroplast genome of A. selengensis downloaded from NCBI [30] (GenBank accession: NC_039647) as a reference for chloroplast genome annotation of 10 A. selengensis materials, using CPGAVAS2 (http://www.herbalgenomics.org/cpgavas) online software and PGA software [31] to annotate the chloroplast genomes of A. selengensis. By using Geneious v8.0.4 software [32], we compared the number of annotated chloroplast genome genes, added missing genes manually, verified the CDS sequences rigorously, and manually modified the start codon and stop codon of the misannotated genes. If a gene was present as a shortened partial copy of another gene or had an internal stop codon in comparison to other homologous genes, it was deemed to be a pseudogene. The annotated GenBank file was converted into a five-column tab-delimited annotation file using GB2Sequin [33], and the chloroplast genome annotation files and the complete FASTA sequence files for 10 materials were submitted to GenBank via Bankit and specific accession numbers were acquired (Table 1). The annotated chloroplast genomes were visualized using the online software Chloroplot (https://irscope.shinyapps.io/Chloroplot).

Structural characterization and comparative chloroplast genome analysis

Geneious v8.0.4 software was used to calculate the whole genome length, length of each region (large single-copy region, small single-copy region, inverted repeats), gene composition and position distribution, base composition, and GC (AT) content to analyze the characteristics of the chloroplast genomes of A. selengensis.

Boundary regions and comparative analysis

Variations in gene sequences at the boundary junctions of the 4 regions are observed across different plant species. The main reason for the variation in chloroplast genome length is the expansion and contraction of the IR region. We used the CPJSdraw-boundary map drawing tool (http://cloud.genepioneer.com:9929) of the JSHYCloud Platform to analyze and compare the boundary regions between the large single-copy region (LSC) and the IR region and between the small single-copy region (SSC) and the IR region.

Codon usage analysis

Codon usage bias (CUB) refers to the phenomenon that codons have the characteristics of degeneracy in the process of gene translation between different species or within the same species, that is, one amino acid corresponds to different codons, resulting in some codons using more than other synonymous codons [34]. CUB is a useful tool for understanding genetic and evolutionary processes, and the analysis of codon usage bias in genes can help determine these genes’ origin and evolutionary history. In this study, CodonW software was used to analyze the codon preferences and the results were visualized for graphing using R software. By employing CodonW and CUSP software, we calculated the effective number of codons (ENC), relative synonymous codon usage (RSCU), and the overall GC content (GCall) for each gene. Concurrently, the GC content at the three positions of codons was recorded, denoted as GC1, GC2, and GC3, respectively, with the GC content at the third position of synonymous codons represented as GC3s. To reduce errors, protein-coding sequences needed to be screened, requiring each CDS sequence to be a multiple of 3, ≥ 300 bp in length, each containing a start codon and a stop codon, with no stop codon inside the sequence, while duplicate sequences were removed, and finally, all 53 CDS sequences were retained for codon analysis.

Scattered repeats sequence and SSRs analysis

Forward, reverse, complementary and palindromic repeats in the chloroplast genome of A. selengensis were detected using the REPuter (https://bibiserv.cebitec.uni-bielefeld.de/reputer) with parameters set to Hamming distance of 3, maximum calculated repeats of 50 and repeat size > 30 bp. Simple sequence repeats (SSR) were detected using MISA (https://webblast.ipk-gatersleben.de/misa/index.php) with nucleotide motifs of 1–6, parameters using default values, and the minimum number of repeats for single nucleotide, dinucleotide, trinucleotide, tetranucleotide, pentanucleotide, and hexanucleotide is set to 10, 6, 5, 5, 5 and 5 respectively.

Comparative genomic analysis

Comparing chloroplast genome sequences provides a reference for discovering sequence variants and identifying mutation hotspot regions, as well as detecting gene loss and duplication events. Mutation hotspot regions obtained from chloroplast genome sequences can also be used as effective molecular markers for species identification and population genetics [35, 36]. mVISTA is an online tool for multiple DNA sequence alignment that allows sequence similarity to be assessed by comparing coding and non-coding regions, introns, and exons [37]. In this study, the whole chloroplast genomes of 10 A. selengensis were compared and visualized using mVISTA (http://genome.lbl.gov/vista/index.shtml). The published genome of A. selengensis (NC_039647) was selected as a reference, and the input files were the original FASTA format nucleotide sequence files and gff3 format annotation files. Nucleotide diversity (PI) was calculated using DnaSP v6 software [38], with the window length set to 600 bp and the step size set to 200 bp.

Ka/Ks analysis

We calculated Ka, Ks, and Ka/Ks ratios of homologous protein-coding genes in the chloroplast genomes of A. selengensis and eight other Asteraceae species, including intra-genus species (A. argyi, A. annua, A. absinthium, A. borotalensis) and intergeneric species (other genera of Asteraceae: A. carlinoides, Chrysanthemum vestitum, Aster albescens, Helianthus carnosus). The GenBank file was downloaded from NCBI, the protein-coding sequences in the GenBank file were extracted, and the homologous protein sequences were obtained by comparing other protein sequences with the reference protein sequences using BlastN (v2.10.1) to find the best match; then the homologous protein sequences were automatically aligned using MAFFT (v7.427) software [39], and the aligned protein sequence was mapped to the coding sequence to obtain the aligned coding sequence. Finally, the KaKs_Calculator2 software [40] was used to calculate the non-synonymous substitution rate (Ka), the synonymous substitution rate (Ks), and their ratios using the YN method. Ka/Ks > 1 indicates positive selection, Ka/Ks < 1 indicates purifying selection, and Ka/Ks = 1 denotes neutral selection.

Phylogenetic analysis

The complete chloroplast genome GenBank data of 27 published genera of Artemisia and other genera in the Asteraceae family were downloaded from NCBI and phylogenetically analyzed with the 10 A. selengensis materials in this study, and the species names and GenBank accession numbers of the chloroplast genomes downloaded from NCBI are listed in Table S2 (Supplementary Table S2). Protein-coding sequences homologous and non-coding regions to the chloroplast genome were extracted for phylogenetic tree construction. The shared protein-coding sequences and non-coding regions were extracted using PhyloSuite software [41], and sequence alignment were performed using MAFFT. The compared sequences were then trimmed and concatenated and finally imported into IQTree to find the best model and construct a phylogenetic tree using the maximum likelihood method.

Results

Chloroplast genome structure and features

In terms of gene content, gene order, and the number of gene introns, the 10 A. selengensis materials included in this study’s research had substantially conserved chloroplast genomes. The genome lengths ranged from 151,148 to 151,257 bp, and all had a typical tetrameric loop structure containing four regions, LSC, SSC, IRA, and IRB (Fig. 1). the LSC region was 82,888 to 82,956 bp in length, the SSC region was 18,338 to 18,390 bp in length, and the IR region was 24,961 to 24,964 bp in length. The total chloroplast genome GC content of the 10 materials was about 37.5%, showing a high degree of similarity. However, the GC content differed among the three major regions of the chloroplast genome, and the GC content of the IR region was 43.1%, which was higher than that of the LSC region (35.6%) and the SSC region (30.8%) (Supplementary Table S3).

Each chloroplast genome contains 133 genes, the number of protein-coding genes, tRNA genes, and rRNA genes are 88, 37, and 8, respectively (Table 2, Table SS3). The chloroplast genome contained 17 intron genes, including 11 protein-coding genes and 6 tRNA genes. Fifteen genes contained one intron and 2 genes (ycf3 and clpP) contained two introns (Table 3). the LSC location included 68 protein-coding genes and 28 tRNA genes, the IR region contained 8 protein-coding genes, 8 tRNA genes, and 8 rRNA genes, and the SSC region contained 12 protein-coding genes and 1 tRNA gene. The genes in the LSC region accounted for 72.2% of the chloroplast genome, the IR region for 18.0%, and the SSC region for 9.8%.

Table 2 Predicted genes in the chloroplast genome of A. selengensis

Full size table

Table 3 Location and length of exon and intron genes in the chloroplast genome of A. selengensis

Full size table

IR boundary analysis

The IR region is one of the most conserved regions in the plant chloroplast genome, and the contraction and expansion of the IR region is the main cause of changes in chloroplast genome size and gene number, as well as a common evolutionary event in the chloroplast genome [35, 36]. Therefore, we performed boundary region analysis of the chloroplast genomes of 10 A. selengensis materials, and the results showed that the boundaries of the four regions were relatively conserved, and the types and numbers of genes in the boundary regions were highly consistent (Fig. 2). The contraction and expansion of the reverse repeat region showed high similarity at the boundary junctions of LSC/IRB, IRB/SSC, SSC/IRA, and IRA/LSC. The boundary of LSC/IRB was located at the rps19 gene, which was 212–218 bp in the LSC region and 61–67 bp in the IRB region. The ycf1 and ndhF genes were located at the SSC/IR region boundary, and the IRB/SSC boundary is located in the ycf1 gene, which extends 558 bp into the IRB region and 36–108 bp into the SSC region. The trnN-GUU genes are all located in the IRA region at the SSC/IRA boundary, rpl2 genes are completely present in the IRB region at the LSC/IRB boundary, and the trnH-GUG gene was located in the LSC region. In addition, we also found that the ycf1 and ndhF genes overlapped at the IRB/SSC boundary in the A. selengensis material (HWB) from Baishazhou (HWB) and Shamao (HWS) in Wuhan City, Hubei Province. The location and order of the genes in the border area were largely constant in all materials, showing that the IR region is highly conserved.