Complete chloroplast genome sequence of Betula platyphylla: gene organization, RNA editing, and comparative and phylogenetic analyses

Wang, Sui; Yang, Chuanping; Zhao, Xiyang; Chen, Su; Qu, Guan-Zheng

doi:10.1186/s12864-018-5346-x

Research article
Open access
Published: 20 December 2018

Complete chloroplast genome sequence of Betula platyphylla: gene organization, RNA editing, and comparative and phylogenetic analyses

Sui Wang¹,
Chuanping Yang¹,
Xiyang Zhao¹,
Su Chen¹ &
…
Guan-Zheng Qu¹

BMC Genomics volume 19, Article number: 950 (2018) Cite this article

3659 Accesses
31 Citations
1 Altmetric
Metrics details

Abstract

Background

Betula platyphylla is a common tree species in northern China that has high economic and medicinal value. Our laboratory has been devoted to genome research on B. platyphylla for approximately 10 years. As primary organelle genomes, the complete genome sequences of chloroplasts are important to study the divergence of species, RNA editing and phylogeny. In this study, we sequenced and analyzed the complete chloroplast (cp) genome sequence of B. platyphylla.

Results

The complete cp genome of B. platyphylla was 160,518 bp in length, which included a pair of inverted repeats (IRs) of 26,056 bp that separated a large single copy (LSC) region of 89,397 bp and a small single copy (SSC) region of 19,009 bp. The annotation contained a total of 129 genes, including 84 protein-coding genes, 37 tRNA genes and 8 rRNA genes. There were 3 genes using alternative initiation codons. Comparative genomics showed that the sequence of the Fagales species cp genome was relatively conserved, but there were still some high variation regions that could be used as molecular markers. The IR expansion event of B. platyphylla resulted in larger cp genomes and rps19 pseudogene formation. The simple sequence repeat (SSR) analysis showed that there were 105 SSRs in the cp genome of B. platyphylla. RNA editing sites recognition indicated that at least 80 RNA editing events occurred in the cp genome. Most of the substitutions were C to U, while a small proportion of them were not. In particular, three editing loci on the rRNA were converted to more than two other bases that had never been reported. For synonymous conversion, most of them increased the relative synonymous codon usage (RSCU) value of the codons. The phylogenetic analysis suggested that B. platyphylla had a closer evolutionary relationship with B. pendula than B. nana.

Conclusions

In this study, we not only obtained and annotated the complete cp genome sequence of B. platyphylla, but we also identified new RNA editing sites and predicted the phylogenetic relationships among Fagales species. These findings will facilitate genomic, genetic engineering and phylogenetic studies of this important species.

Background

Betula platyphylla, or Asian white birch, is a broad-leaved deciduous hardwood tree species that belong to the genus Betula, in the family Betulaceae. It is a pioneer tree species that can rapidly colonize open ground, especially in secondary successional sequences. It grows in the temperate or subarctic regions of Asia, including Japan, China, Korea, and Siberia. The grayish bark of this tree is marked with long, horizontal lenticels that often separates into thin, papery plates, which is the most typical characteristic of this tree species [1, 2]. B. platyphylla is often used as a wayside tree or landscape tree species because of its graceful shape. It is a valuable commercial tree species that is harvested for lumber and pulpwood for paper production [3]. Recent studies have indicated that birch bark contains numerous triterpenoids and has substantial medicinal value [4, 5].

As primary plastids found only in plant cells and eukaryotic algae, chloroplasts are semiautonomous organelles that are not only perform photosynthesis but also participate in a range of biochemical processes. Chloroplasts are believed to have arisen from an endosymbiotic event and have their own genomes, which are often abbreviated as cp or ct [6, 7]. Since the first cp genomes were sequenced in 1986, more than 2500 complete cp genome sequences have been released in the National Center for Biotechnology Information (NCBI) organelle genome database as of March 2018 [8, 9]. The advent of next-generation-sequencing (NGS) technologies has facilitated rapid progress in the field of cp genomics [10]. In the future, due to the popularity of third-generation sequencing, longer average read lengths will make it easier to assemble cp genomes [11,12,13]. For most land plants, cp genomes have highly conserved structures and are circular DNA molecules that comprise two inverted repeats (IR), which separate a large and a small single copy (LSC and SSC) region. Chloroplast genome sizes vary between species, ranging from 107 kb (Cathaya argyrophylla) to 218 kb (Pelargonium × hortorum), with an average size of approximately 150 kb [14, 15]. There are approximately 120–130 genes in the cp genome, which participate primarily in photosynthesis, transcription, and translation [16]. RNA editing, which is a posttranscriptional modification phenomenon, occurs in some transcripts of these cp genes. Editing by insertion, deletion or switching bases, such as cytidine (C) to uridine (U), is an essential repair mechanism, and many mutations at the cp genome level may lead to strong deleterious phenotypes [17]. Because they have fairly stable structures, moderate evolutionary rates and uniparental inheritance in most angiosperms, the cp genomes have made significant contributions to phylogenetic studies [16, 18].

In this study, we aimed to determine the complete cp genome sequence of B. platyphylla and to characterize its genome structure, gene content and other characteristics. Furthermore, we recognized RNA editing sites in the whole cp genome of B. platyphylla using RNA-Seq data. We predicted their relationships through a comparative analysis with other Fagales species cp sequences within phylogenetic clades.

Materials and methods

Plant materials and sequencing

Tender leaves were collected from an adult B. platyphylla plus tree that is located on the Northeast Forestry University campus. Total genomic DNA was extracted from tender leaves using the CTAB method [19]. Three paired-end (insert sizes = 200 bp, 500 bp and 800 bp) and three mate-pair (insert sizes = 2 kbp, 5 kbp and 10 kbp) Illumina libraries were prepared and sequenced on the HiSeq 2000 platform (Illumina, USA) at BGI (Shenzhen,Guangdong,China).

Data filtration and cp DNA sequence extraction

To obtain high-quality and vector/adaptor-free reads, raw paired-end reads were filtered using the NGSQC Toolkit v2.3.3 (cut-off read length for HQ = 70%, cut-off quality score = 20, trim reads from 5′ = 3, trim reads from 3′ = 7) [20]. The qualities of the clean reads were checked using FastQC (v0.11.5). To identify the cp sequences, all of the clean reads, which included sequences from both the nucleus and organelles, were mapped to the complete cp genome sequences of 2670 plant species, which were downloaded from the NCBI Organelle Genome Resources database (www.ncbi.nlm.nih.gov/genome/organelle/) using BWA (v0.7.13) [21]. Finally, we extracted cp sequences from the SAM files and obtained three files of paired-end reads.

Genome assembly and annotation

For de novo cp genome assembly, an Edena assembler (v3.131028) with default parameters was used to assemble all the paired-end sequences into contigs [22]. Next, neighboring contigs with paired-end or mate-pair support for continuity were merged into scaffolds using SSPACE (v3.0) [23]. Then, using the cp genome sequences of two other reference Fagales plants, Betula nana (KX703002.1) and Ostrya rehderiana (KT454094.1), a single cp sequence with gaps was assembled. After that, GapCloser (v1.12) was used to close most of the gaps, and Sanger sequencing was used to fill residual gaps. The complete cp genome sequence was further checked using BWA.

Except for tRNA genes, which were verified using tRNAscan-SE 2.0, the B. platyphylla cp genome sequence was annotated using the online Chloroplast Genome Annotation, Visualization, Analysis and GenBank Submission Tool (CpGAVAS) [24, 25]. First, AnnotateGenome was utilized to obtain the primitive annotation results in the GFF3 format. Second, we used AnnotateGene and Apollo Genome Annotation and Curation Tool (v1.11.8) to manually correct the abnormal features based on the reference database of CpGAVAS and the tRNA genes annotated by tRNAscan-SE. Last, OrganellarGenomeDRAW was used to directly generate a corrected cp circular map [26].

Codon usage and alternative start codons statistics

Codon usage was determined for all protein-coding genes (RNA sequences without editing). To examine the deviation in synonymous codon usage while avoiding the influence of the amino acid composition, the relative synonymous codon usage (RSCU) was calculated with MEGA 7 software (version 7.0.18).

Three cp genes (rps19, psbC and ndhD) were annotated with the Non-ATG start codon in the B. platyphylla cp genome, we selected these genes from 30 model plant and representative plant species according to the Angiosperm Phylogeny Group (APG) IV system (Additional file 1: Table S1). Then, the sequence logos of the first 10 bp of the three genes across the species were created using the WebLogo 3 application (http://weblogo.threeplusone.com/). We also visualized the RNA-Seq mapping of these sites and aligned them with the sequence logos.

Genome comparison

The complete cp genome sequences of B. platyphylla and four other closely related species, B. pendula (LT855378.1), B. nana (KX703002.1), Corylus chinensis (KX814336.2) of Betulaceae and Juglans sigillata (KX424843.1) of Juglandaceae, were compared using the program mVISTA. EMBOSS Stretcher, a modification of the Needleman-Wunsch algorithm that allows larger sequences to be aligned globally, was used to align these cp genome sequences to obtain accurate identity and similarity.

IR expansion and contraction

Depending on the classification system for Fagales taxa, four species Betula platyphylla, Juglans regia (MF167463.1), Morella rubra (KY476637.1) and Castanea mollissima (KY951992.1) were selected to represent the families Betulaceae, Juglandaceae, Myricaceae and Fagaceae, respectively.

SSR analysis

Simple sequence repeats (SSRs) were detected using the Perl script MISA (MIcroSAtellite identification tool) by setting the minimum number of repeats to 10, 5, 4, 3, 3 and 3 for mono-, di-, tri-, tetra-, penta- and hexanucleotides, respectively. Meanwhile, CandiSSR was used to identify polymorphic SSRs (PolySSRs) and to automatically design primer pairs for each identified PolySSR in the three Betula species [27].

Recognition of RNA editing sites

In this study, an RNA-Seq experiment with 3 individual leaf samples was used to identify RNA editing events. The total RNA was extracted from mature foliage using an Extract kit (RP3301, BioTeke, China). The RNA-Seq library construction and sequencing were performed at Novogene Bioinformatics Technology Co., Ltd. (Beijing, China). The filtered paired-end reads obtained from an Illumina HiSeq 2000, were aligned to the B. platyphylla cp genome using HISAT2 (v2.1.0) software with strict comparison conditions. SAMtools (v1.9), bedtools (v2.25.0) and ChloroSeq were used to call and analyse precise RNA editing sites [28]. Because SNPs or mismatches may interfere with the results, we also mapped the set of PE 100 bp-long reads that was used to assemble the B. platyphylla cp genome back to the cp genome sequence using bowtie2 (v2.3.4.1) software and then checked the SNPs. Finally, we designed several pairs of primers using Primer Premier 6.0 software (PREMIER Biosoft International, Canada) and amplified the target sequence by PCR to form genomic DNA (gDNA) and complementary DNA (cDNA). The target representative editing sites were confirmed by Sanger sequencing. The relevant primer information is summarized in Additional file 1: Table S2.

Phylogenetic analysis and character evolution

The whole cp genome sequences of 21 species of Fagales were used to build a phylogenetic tree to confirm the genetic relationship among closely related species of B. platyphylla. In this phylogenetic tree, Nicotiana tabacum was used as the out-group. Nucleotide sequences were aligned using MAFFT (version 7.294b). All alignments were checked and adjusted manually. The program MEGA-CC (version 7.0.26–1) was employed to find an optimal substitution model and to build a maximum likelihood (ML) phylogenetic tree. Bootstrap resampling with 500 replicates was used to evaluate the branch supports. More information is summarized in the Additional file 1: Table S3.

Results

Chloroplast genome assembly

Based on the NCBI Organelle Genome Resources database, we extracted approximately 128.8 Mbp of paired-end reads for cp genome assembly. With the help of Edena, a first assembly consisting of 35 contigs was obtained (Table 1). Further scaffolding with all of the paired-end and mate-pair reads resulted in a single scaffold under the guidance of the reference sequences. After using GapCloser to close most of the gaps, only two gaps remained. Finally, with the aid of Sanger sequencing, we filled the gaps, identified both ends of the sequence and obtained a circular cp genome.

Table 1 Statistics for the contigs

Full size table

The whole cp genome of B. platyphylla had a length of 160,518 bp. Like most land plants, the circular cpDNA had typical quadripartite structures. An LSC region of 89,397 bp and an SSC region of 19,009 bp were separated by a pair of IR regions of 26,056 bp. The overall GC content of the B. platyphylla cp genome was 36.06%, and the GC contents of the LSC and the SSC regions were 33.66 and 29.76%, respectively. Because each IR region contained relatively abundant GC-rich rRNA and tRNA genes, the GC content of the IR region was 42.48%, which was much higher than that of the LSC and SSC regions.

Chloroplast genome annotation

A total of 129 genes were predicted to be encoded in the B. platyphylla cp genome, including 84 protein-coding genes, 37 tRNA genes and 8 rRNA genes. Among them, 95 genes were unique, and 17 genes were duplicated in the IR regions. By calculating the GC content of the genes, we found that it was higher in the rRNAs (54.89%) and tRNAs (53.20%) than in the protein-coding genes (36.93%). The majority of 112 unigenes were single-exon genes, while 18 genes (12 protein-coding genes and 6 tRNA genes) contained 2 exons and only 4 protein-coding genes contained 3 exons. All of the genome and annotation information is shown in Fig. 1.

Among the B. platyphylla cp genes, several were special. The rps12 gene was a trans-spliced gene that consisted of 3 exons that code for the homologous ribosomal protein S12. C-terminal exons 2 and 3 of rps12 were located in each IR region, but exon 1 was located in the LSC, approximately 28 kbp downstream of the nearest copy of exon 2, which was located in one of the IR regions and 61 kbp away from the other copy of exon 3, which was located in the other IR region. Prediction of the B. platyphylla cp gene function was based on homology, as these genes code for a variety of proteins, mostly involved in photosynthesis and other metabolic processes. Regarding photosynthesis, a subset of the genes synthesizes the large Rubisco subunit and thylakoid proteins. In addition, other genes encode subunits of a protein complex that mediates redox reactions to recycle electrons. Table 2 shows the gene functions and groups in the B. platyphylla cp genome.

Table 2 Group of genes within the B. platyphylla chloroplast genome

Full size table

Codon usage and alternative initiation codons statistics

It is generally acknowledged that codon biases reflect a balance between mutational biases and natural selection for translational optimization. We further analyzed the codon usage frequency and RSCU value in the B. platyphylla cp genome. It was not clear whether RNA editing occurred in these areas because there were some regions covered with no reads in our experiment, and the editing rates were not 100%. Here, we used RNA sequences without editing to compute the codon usage and RCSU values. We estimated that this would not have a large impact on the results. There were 84 protein-coding genes in the B. platyphylla cp genome, including 26,298 codons in total. Among the codons, the three amino acids present in the highest proportions were leucine (10.49%), isoleucine (8.97%) and serine (7.49%). Excluding the stop codons, cysteine (1.16%) was the least abundant amino acid (Additional file 1: Table S4, Figure S1). Codon usage was biased towards A and U at the third-codon position, which is similar to the trend that was observed in a majority of angiosperm cp genomes [29].

Unlike ordinary genes that use ATG as their initiation codons, several cp genes use other codons as exceptions. In the B. platyphylla cp genome, three genes were annotated with Non-ATG start codon: GTG was used by rps19 and psbC and ACG was used by ndhD. These three genes are involved in translation, photosynthesis and respiration, respectively. As shown in Fig. 2a, these selected gene sites were relatively conserved across species. GTG was the dominant initiation codon in rps19, but not in psbC, and about half of the species took ACG as the start codon in ndhD. Figure 2b shows that, at the transcriptional level, the initiation codon of rps19 and the psbC transcripts of B. platyphylla did not change significantly. However, editing ACG to AUG at the ndhD start codon was obvious and made its start codon go back to normal.

Comparison of chloroplast genome sequences with those of other species

To investigate the similarities and differences of the cp genome sequences between B. platyphylla and other species of Fagales, a global alignment program was used to align these sequences. The result was plotted using the mVISTA tools with B. platyphylla as a reference (Fig. 3). Overall, these closely related species had little difference in cp genome size, ranging from 160,320 bp to 161,148 bp. The global patterns of sequence similarities among these sequences were very high, especially among the Betulaceae species, with over 99% identity. As shown in Fig. 3, the structures of these cp genomes were conserved, and neither translocations nor inversions were detected in the sequences. As expected, coding regions were revealed to be more conserved than noncoding regions. More concretely, most high polymorphic regions were located in the intergenic regions (such as trnR-TCT—atpA, trnE-TTC—trnT-GGT, psbE—petL, rpl32—trnL-TAG), but the ycf1 gene had higher variability regions. These regions may be undergoing more rapid nucleotide substitution at the species level, which indicates the potential application of molecular markers for phylogenetic analyses and plant identification in Fagales.