Comparative chloroplast genomes of the tea plants and the implications for the different origins of the two Assam teas

Tea plants belong to the genus Camellia, whose species are taxonomically complex due to frequent hybridization and polyploidy nature. The genetic genealogy of Camellia has always been a focus of botanical and ecological research, including a debate about whether Assam tea has two different domestication origins (Chinese Assam type and Indian Assam type). The chloroplast genome resources were able to provide useful data for the analysis of the plastome evolutionary relationship and species classication. Here, we determined the rst chloroplast genome of the natural triploid tea plant (Camellia sinensis cv. Wuyi Narcissus) and conducted the genome comparison with Chinese type tea (Camellia sinensis var. sinensis), Chinese Assam type tea (Camellia sinensis var. assamica) and Indian Assam type tea (Camellia assamica) to improve our understanding of the evolutionary mechanism and the taxonomic classication of Camellia. This study presented detailed sequences and structural variations of chloroplast genomes of four tea plants. The chloroplast genome of the natural triploid tea showed no obvious sequence difference from that of other two types of Chinese teas, while that of Chinese tea and Indian tea was signicant sequence difference. The natural selection probably dominated in shaping the codon bias of the chloroplast genome in tea plant, and the codon usage distribution of genome in Indian tea was obviously different from that in Chinese tea. The phylogenetic status of Chinese and Indian Assam teas was in the different branches of the tea plant. Phylogenetic tree clustering was not consistent with the current some taxonomy of Camellia. and contraction of border regions SC), which sequence repeat and mutation phylogenetic Assam type and Assam type different current


Introduction
Tea is the most popular non-alcoholic beverage with huge economic values in the world [1]. Botanically, cultivated tea is a member of the camellia family of angiosperms, which mainly comes from two different regions: China and India, so it can be mainly classi ed as Chinese type tea (Camellia sinensis var. sinensis) and Assam type tea (Camellia sinensis var. assamica) [2][3][4]. On the origin of cultivated tea sequenced using PacBio technology at PacBio Sequel platform. For Illumina: total DNA was used to generate libraries with the Illumina Novaseq 6000 platform. PacBio and Mate-pair reads were mapped against three published cp genomes, including two types of China tea tree (CSS, KJ806281.1; CSA, MH019307.1) and one India tea tree (CIA, MH460639.1) [7,8,13], using CLC Genomics Workbench 11.0.1 (CLC Bio, Arhus, Denmark) to lter out the chloroplast reads. The extracted chloroplast reads of C. Wuyi Narcissus were de-novo assembled by guidance-based assembly approach using CLC genomics workbench. The assembled contigs were ordered with reference cp genomes using ABACAS v1.3.1 (http://abacas.sourceforge.net/) to generate draft chloroplast assembly of CWN. The orientation and inverted repeat (IR) regions were checked with cpGAVAS (http://www.herbalgenomics.org/cpgavas) followed by manual curation and gap lling with extracted chloroplast reads to obtained the nal cp genome of C. Wuyi Narcissus. To annotate the cp genome, we used initial annotation by cpGAVAS and veri ed the sequence coordinates of each of the annotated genes using BLAST search against annotated chloroplast genes available at NCBI. Annotation errors were manually corrected. The nal annotated cp genome sequence of C. Wuyi Narcissus was subjected to OGDRAW software (http://ogdraw.mpimpgolm.mpg.de) to generate the circular cp genome map and deposited to NCBI GenBank. For the accuracy of the cp genome-wide comparison, the inverted repeat (IR) regions and the annotated data of three published cp genomes (CSS, CSA and CIA) was veri ed with the above methods.

Comparative analysis of cp genomes
Four cp genome sequences of Camellia species (Tab. 1) were aligned using MAFFT Version 7.017 [14] and adjusted manually where necessary. The cp genome sequence divergences between four Camellia species were compared and plotted using mVISTA program [15]. Genetic divergence parameter (pdistance) was calculated by MEGA 6.0 [16]. A sliding window analysis was conducted to compare π among the chloroplast genomes, using DnaSP v5.0 [17]. The window length was 600 bp with a 200 bp step size. The percentage of variable characters for coding and noncoding regions in the genome was calculated as described previously [18].

Codon Usage Bias Analyses
In order to avoid sampling bias, each CDS in cp genome were checked for being full-length and for the presence of proper start and stop codons. CDS shorter than 300 bp in length were excluded in codon usage calculations [21]. GC3s, ENc, CAI and RSCU for CDS were calculated using CodonW v1.4.4 [22].
ENc value is a measure of general non-uniformity of usage within synonymous groups of codons, ranging from 20 (extreme bias where only one codon is used in each amino acid) to 61 (random codon usage) [23]. ENc plot analysis (ENc vs GC3s) was used to examine whether the codon usages were affected only by mutation or other factors. If codon usage is constrained only by mutation pressure, ENc value lie on or slightly below the expected curve, and if codon usage is subject to natural selection, ENc value will lie considerably below the expected curve [24].
Neutrality plot (GC12 vs. GC3) was used to investigate the effects of mutation pressure and natural selection on codon use patterns. GC12 and GC3 were calculated by Perl script. GC3 was calculated excluding the three termination codons (TAA, TAG and TGA) and the three codons for Ile (ATT, ATC and ATA). Meanwhile, two single codons for Met (ATG) and Trp (TGG) were also excluded in all three patterns [25]. The slope of the plot regression is zero which indicates that there was no effect on directional mutation pressure (complete selection constraint). Slope 1 indicates that the codon usage bias is completely affected by the directional mutation pressure, and represented complete neutrality [26].
RSCU value for a particular codon refers to the ratio of its actual usage frequency to expected frequency when it is used without bias. The preferred codons with RSCU > 1.0 occur when they are used with higher frequencies than random, and the rare codons with RSCU < 1.0 means the opposite [27]. The distribution of codon usage for the four species was shown in the form of a heatmap using HemI 1.0 [28], according to the RSCU value.
CAI value is widely used to evaluate the gene expression level and ranges from 0 to 1. The larger the CAI value, the stronger the codon usage bias, otherwise, the weaker the codon usage bias [29]. The chloroplast genes in the upper and lower 5% of CAI values were respectively de ned as the high-and lowexpression gene datasets. A statistical chi-squared test was performed with SPSS 18.0 to compare the RSCU values between two datasets. If a codon whose frequency in the high-expression genes was signi cantly higher (p < 0.05) than in the low-expression genes, it will be classi ed as an optimal codon [30].

Phylogenomic Analyses
The cp genome sequences of 37 Camellia species and one outgroup (Apterosperma oblata) were aligned with the program MAFFT version and adjusted manually when necessary. Maximum likelihood (ML) analyses were implemented in RAxML version 7.2.6 [31]. RAxML searches relied on the general time reversible (GTR) model of nucleotide substitution with the gamma model of rate heterogeneity. Nonparametric bootstrapping test was implemented in the "fast bootstrap" algorithm of RAxML with 1000 replicates. Bayesian analyses were performed using the program MrBayes version 3.1.2 [32]. The besttting models were determined by the Akaike Information Criterion [33] as implemented in the program Modeltest 3.7 [34]. In all analyses, A. oblata was set as an outgroup.

Results And Discussion
Cp genome sequencing and assembly PacBio (10838 long reads, >5kb) and Mate-pair reads (7.04G of raw data were produced with 2×150 bp pair-end read lengths) were used to assemble the cp genome of CWN into a circular contig of 156,762 bp length. Circular genome maps were drawn with OGDRAW software (Fig. 1). Raw reads have been deposited in the NCBI Sequence Read Archive (SRA, SRR12002624). Assembled cp genome sequences and accompanying gene annotations of C. sinensis cv. Wuyi Narcissus have been deposited in the NCBI GenBank (Accession numbers: MT612435). Four plastomes are highly conserved in gene content, gene order, and intron number. Each cp genome contained a total of 137 genes, including 92 protein-coding genes, 37 transfer RNA (tRNA) genes and 8 ribosomal RNA (rRNA) (Supplementary Tab. S1). Of them, 60 protein-coding and 22 tRNA genes were located within LSC, 16 protein-coding genes, 14 tRNA coding genes and eight rRNA coding genes were located within IRs and 11 protein-coding and one tRNA gene were located within SSC. The rps12 gene was a divided gene with the 5′ end exon located in the LSC region while two copies of 3′ end exon and intron were located in the IRs. The ycf1 was located in the boundary regions between IRa/SSC, leading to incomplete duplication of the gene within IRs. The rps19 genes in CSS, CSA, and CWN were crossed the LSC/IR region while in CIA was located within LSC. There were 18 genes containing introns, including 6 tRNA genes and 12 protein-coding genes. Except for two introns in the ycf3 and clpP genes, all other genes contained only one intron. MatK gene was located within the intron of trnK-UUU with the largest intron (2,489 bp). Overlaps of adjacent genes were found in the complete genome, rps3-rpl22, atpB-atpE, and psbD-psbC had a 16 bp, 4 bp, and 53 bp overlapping region, respectively. Unusual initiator codons were observed in rps19 with GTG and orf42 with ATC in four cp genomes. The initiation codon of ndhD in CIA was ATG, while that of other three cp genomes was GTG.

Sequence variation and IR expansion/contraction
To elucidate the level of the genome divergence, the sequences were plotted to check their identity using the program mVISTA by aligning the four Camellia cp genomes with CWN (Fig. 2) as a reference. The whole aligned sequences showed high similarities with only a few regions below 90%, suggesting that tea plant plastomes were rather conserved ( Figure 4).
Sequence divergence analyses of four cp genomes revealed Pi values in the range from 0 to 0.01917 with an average of 0.00093, indicating moderate genetic divergence existed within the four cp genomes. However, four regions (including rp12/trnH-UGU, psaA/ycf3, atpB/rbcL and psbT/psbH) had relatively higher divergence values (Pi > 0.006) (Fig. 3). These four regions were all located within LSC, indicating that the LSC region had more gene mutations than the IR region and the SSC region.
Mutations may cause changes in the length of the coding gene sequence, leading to changes in the coding and non-coding regions. Therefore, the number and distribution patterns of variable characters in coding and non-coding regions among four cp genomes were further analyzed. The result showed that the proportion of variability in non-coding regions was with a mean value of 1.82%, while in the coding regions was 1.15%. Five coding genes had over 4% variability proportion, such as rps19, ndhF, ndhD, ndhI and ycf1. Five non-coding regions had over 10% variability proportions, such as rpl12/trnH-GUG, trnE-UUC/trnT-GGU, ndhD/psaC, ndhI/ndhA and rps15/ycf1 (Fig. 4). These divergence hotspot regions might provide information for marker development in phylogenetic analyses of tea plants, but further veri cation is needed. Among them, the coding gene rps19, ndhF, ycf1 and non-coding region rpl2/trnH-GUG, rps15/ycf1 were located in the junctions of IR/SC region, which supported that the length of angiosperm cp genomes was variable primarily due to the expansion and contraction of IR/SC boundary regions [35].
To further elucidate the potential contraction and expansion of IR regions, the gene variation at the IR/SSC and IR/LSC boundary regions of the four plastomes was compared (Fig. 5). The genes rps19, ycf1-5'end/ndhF, ycf1 and rp12/trnH-GUG were located in the junctions of LSC/IR and SSC/IR regions. The rps19 gene in CSS, CSA, and CWN was 279 bp, and crossed the LSC/IRa region by 46 bp while the rps19 gene in CIA was just 150 bp, and all located in the LSC region, 1bp away from the IRa region. The ycf1-5'end gene in CSS, CSA, and CWN was 1071 bp, and crossed the IRa/SSC region by 2 bp while in CIA was 1065 bp, and crossed the IR/SSC region by 33 bp. The ndhF gene in all four cp genomes was located in the SSC region. The ndhF gene in CSA, CIA, and CWN was 2247 bp while in CSS was 2139. The ndhF gene in CSS was 165 bp away from the IRa region, in CSA or CWN was 57 bp away from the IRa region while in CIA was 88 bp away from the IRa region. The ycf1 gene in CSS or CWN was 5622 bp, in CSA was 5628 bp while in CIA was only 1038 bp. The ycf1 genes in all four cp genomes crossed the IR/SSC region. The ycf1 gene in CSS or CWN was with 4553 bp located in the SSC region and 1069 bp in IRb region, in CSA was with 4559 bp located in the SSC region and 1069 bp in IRb region while in CIA was with only 6 bp located in the SSC region and 1032 bp in IRb region. The rpl2 gene in CSS, CSA or CWN was 107 bp away from the LSC region while in CIA was 82 bp away from the LSC region. The trnH-GUG gene in CSS, CSA or CWN was 2 bp away from the IRb region while in CIA was 637 bp away from the IRb region.
Among four species, three Chinese tea varieties (CSS, CSA, and CWN) were similar in both gene sequence and IR/LSC boundary pattern, except for length variations in ndhF and ycf1. The triploid CWN was closer to CSS, indicating that triploidy did not cause signi cant changes in cp genome. However, there were relatively obvious differences in IR/LSC boundary of cp genomic between Chinese tea and Indian tea, suggesting that environmental selection may be responsible for the differences.

Repeat and indel sequence analyses
SSRs are small repeating units of cpDNA, and have been widely used to characterize genetic variation among plant genotypes [36][37][38]. A total of 671 SSRs were identi ed in four cp genomes, of which 57% were in IGS, 34% were in CDS, and 9% were in Intron (Fig. 6C). 74.0% of these SSRs were monomers, 19.3% of dimers, 0.5% of trimers, 5.3% of tetramers, 0.9% of hexamers and no pentamers found (Fig. 6A). Comparing the four genomes, except for 167 SSRs of CIA, the other three were all 168. A total of 128 SSRs were fully shared among four cp genomes (Fig. Additional le 1: Table). There were 47 loci with different SSR types, most of which existed in the LSC region. Among them, CSS had 7 unique types, CSA had 18 unique types, CIA had 9 unique types, and CWN had 14 unique types. (Fig. 6B, Supplementary Tab. S2).
Indels played an important role in sequence evolution [41]. In indels analysis, the indel events in simple sequence repeats were ltered out. By comparing four cp genomes, a total 67 indels were found. Indels ranged in size from 1 to 637 bp (Fig. 8A), and most of the Indels events occurred in IGS regions (72%), with 24% in CDS regions and only 4% in Intron regions (Fig. 8B). As expected, single-nucleotide Indels (1 bp) were the most common, but 16 long Indels (>10bp) were found. The longest one was an insertion of 637 bp in CIA (intergenic rp12/trnH-GUG), followed by a 335 bp deletion in CWN (intergenic trnE-UUC/trnT-GGU) and a 107 bp deletion in CIA (gene rps19). The largest number of long inserts were found in intergenic psaA/ycf3, where CSA had a unique 10 bp deletion, CIA had a unique 12 bp insertion, and CSA and CWN had a common 17 bp deletion. Secondly, two deletions occurred in the gene coding region of nhdA in CIA, which were respectively 53bp and 66bp deletions (Fig. 8C, Supplementary Tab. S4).

Codon usage pattern analyses
Codon use bias can be used to re ect the origin, evolution and mutation mode of species or genes [44].
The ENc plots showed only a few points lie near the curve, however, most of the genes with lower ENc values than expected values lay below the curve (Fig. 9). Therefore, the codon usage bias of the chloroplast genome was slightly affected by the mutation pressure, but natural selection and other factors play an important role.
To further investigate the extent of in uence between mutation pressure and natural selection on the codon usage patterns, Neutrality plot (GC12 vs. GC3) was performed. The correlation between GC1 and GC2 was strong (CSS: r = 0.445; CSA: r = 0.453; CIA: r = 0.445; CWN: r = 0.464, p <0.01). However, no signi cant correlation was found for GC1 with GC3 (CSS: r =0.141; CSA: r = 0.139; CIA: r = 0.078; CWN: r = 0.141) or GC2 with GC3 (CSS: r = 0.146; CSA: r = 0.143; CIA: r = 0.078; CWN: r = 0.152), which suggested mutation pressure had a minor effect on the codon usage bias. Moreover, the slope of Neutrality plot showed that mutation pressure accounts for only 0.52% -8.42% on the codon usage patterns in four cp genomes, while natural selection accounts for 91.58% -99.48% (Fig. 10). These results further suggested that natural selection played an important role in the codon usage patterns.
The patterns of codon usage are strongly correlated with GC content [45]. The overall GC content was almost identical with each other among the four cp genomes, about 37.3%, indicating a higher AU content. Furthermore, the distributions of codon usage in the heatmap for 4 Camellia species showed that about half of the codons with low RSCU values were infrequently used and almost all codons with RSCU >1 ended with A/U (Fig. 11). These data indicated that the four cp genomes tended to use A/U bases and A/U-ending codons. In addition, we also found that the codon usage patterns of the three Chinese teas were more similar, and the cp genome of CWN has a closer relationship with that of CSS in RSCU cluster analysis. In contrast, the codon usage patterns of Indian tea (CIA) were quite different from that of Chinese teas. The RSCU values of the 36 codons (36/64, 56.25%) were identical in the three Chinese teas, but different from those in Indian tea (Fig. 11, Tab. 2), indicative of a different selection on codon usage in the cp genomes between Chinese tea and Indian tea. This also suggested that the two Assam tea plants (Chinese Assam type and Indian Assam type) might undergo different domestications.

Phylogenetic analysis
To further understand the evolutionary mechanism and taxonomic classi cation of Camellia, a most comprehensive phylogenetic analyse was performed based on complete cp genome from 37 Camellia species with Apterosperma. oblata as an outgroup so far, including Chinese type tea (CSS), Chinese Assam type tea (CSA) and Indian Assam type tea (CIA) and a nature triploid species (CWN) (Fig. 12). Phylogenetic trees were generated by Bayesian inference (BI) and Maximum likelihood (ML) based on the complete cp genome sequence have similar topologies, and almost all relationships have high bootstrap values. Both trees showed that natural triploid CWN was formed into a monophyletic clade with a bootstrap value of 100%, suggesting a polyploid event in the evolution of tea plants. Further, our results clearly indicated that the ancestors of tea plants differentiated into two branches at a certain node, with one branch continuing to differentiate into Chinese Assam type tea and the other branch continuing to differentiate into Indian Assam type tea and Chinese type tea, respectively. This result suggested that these three species might have different domestication origins, which supported an earlier idea that there are three independently domesticated tea species. The one is Chinese type tea, and the other two are distinct Assam tea: a Chinese tea from the southwestern province of Yunnan (Chinese Assam type tea) and an Indian tea from the Assam region (Indian Assam type tea) [46]. The existence of two distinct domestication origins of Assam tea had long been a major topic of debate [5]. Our study further suggested that there were two distinct domesticated origins of Assam tea.
Because of the frequent hybridization and other factors, the classi cation of the genus Camellia based on morphology had been controversial. Some previous studies had shown that cp genomes can provide useful information for solving complex taxonomic problem of Camellia species [7,47].  [49,50]. Corresponding to Ming's classi cation, our results showed that not all subgenus tea or subgenus camellia were individually clustered, suggesting that this classi cation needed to be reconsidered to revise: (i). at the subgenus level, both BI tree and ML tree (BS=100%) showed that C. danzaiensis should belong to subgenus camellia, rather than subgenus tea, which was consistent with a previous study by Huang et al. 2014. In addition, (ii). C. nitidissima and C. petelotii of sect. Archecamellia of subgenus tea (Syn. sect. Chrysantha Chang) were clustered with C. szechuanensis of sect. Heterogenea of subgenus Camellia in BI tree (BS=88%), and not all species of sect. Heterogenea clusters came together. This result supported a previous study which pointed out that sect. Chrysantha Chang should not be merged into sect. Archecamellia, and sect. Heterogenea should not be recognized in taxonomic treatments of Camellia species by analyzing the secretory structure [51]. Moreover, (iii). The positions of C. yunnanensis, C. luteo ora and C. tachangensis were with low bootstrap values in ML tree, which were also in doubt and need further veri cation. More cp genomes of Camellia would be needed for further studies to solve these problems as phylogenomic analysis tends.

Conclusion
The rst complete cp genome of natural triploid tea tree and its phylogenetic status were determined. The speci c genome features, including repeat sequences, SSRs, and indels were compared among the natural triploid tea, Chinese type tea, Chinese Assam type tea and Indian Assam type tea. The identi ed divergence regions could be used as potential molecular markers for further analysis. This study is also the rst to provide information on the domestication origin of tea plant based on the complete cp genome. Our results supported that Chinese Assam type tea and Indian Assam type tea might have different domestication origins. Moreover, phylogenetic tree analysis also suggested that the current taxonomy of Camellia species might need further revision. Our results might facilitate germplasm resources protection for these economically important tea plants, and offer useful genetic information for the purposes of phylogenetics, taxonomy and species identi cation in the genus Camellia.

Availability of data and materials
The chloroplast genomes generated during the current study were deposited in NCBI with accession number MT612435 (C. sinensis cv. Wuyi Narcissus), and Raw reads have been deposited in the NCBI Sequence Read Archive (SRA, SRR12002624). CSA, C. sinensis var. assamica; CIA, C. assamica Table 3. The relative synonymous codon usage (RSCU) values of four chloroplast genomes. The optimal codon, marked with * (p < 0.05) and @ (p < 0.01), was de ned as a codon whose usage frequency in the high-expression genes was signi cantly higher than in the low-expression genes by the chi-squared test.
The RSCU values are identical in all three Chinese teas (Blue background), but different in Indian teas (Yellow background).

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.