De novo sequencing and comparative analysis of holy and sweet basil transcriptomes

Background Ocimum L. of family Lamiaceae is a well known genus for its ethnobotanical, medicinal and aromatic properties, which are attributed to innumerable phenylpropanoid and terpenoid compounds produced by the plant. To enrich genomic resources for understanding various pathways, de novo transcriptome sequencing of two important species, O. sanctum and O. basilicum, was carried out by Illumina paired-end sequencing. Results The sequence assembly resulted in 69117 and 130043 transcripts with an average length of 1646 ± 1210.1 bp and 1363 ± 1139.3 bp for O. sanctum and O. basilicum, respectively. Out of the total transcripts, 59648 (86.30%) and 105470 (81.10%) from O. sanctum and O. basilicum, and respectively were annotated by uniprot blastx against Arabidopsis, rice and lamiaceae. KEGG analysis identified 501 and 952 transcripts from O. sanctum and O. basilicum, respectively, related to secondary metabolism with higher percentage of transcripts for biosynthesis of terpenoids in O. sanctum and phenylpropanoids in O. basilicum. Higher digital gene expression in O. basilicum was validated through qPCR and correlated to higher essential oil content and chromosome number (O. sanctum, 2n = 16; and O. basilicum, 2n = 48). Several CYP450 (26) and TF (40) families were identified having probable roles in primary and secondary metabolism. Also SSR and SNP markers were identified in the transcriptomes of both species with many SSRs linked to phenylpropanoid and terpenoid pathway genes. Conclusion This is the first report of a comparative transcriptome analysis of Ocimum species and can be utilized to characterize genes related to secondary metabolism, their regulation, and breeding special chemotypes with unique essential oil composition in Ocimum. Electronic supplementary material The online version of this article (doi:10.1186/1471-2164-15-588) contains supplementary material, which is available to authorized users.


Background
Ocimum L., belonging to family Lamiaceae is one of the best known genus for its medicinal properties and economically important aromatic oils. Some Ocimum species are also constituents of Ayurvedic and indigenous medicines. This genus is highly variable and possesses wide range of intra-and inter-specific genetic diversity comprising at least 65 [1] to more than 150 species [2] distributed all over the world. Among these, Ocimum sanctum L. (Ocimum tenuiflorum L.) and Ocimum basilicum L.
are the two important species used extensively for their medicinal and industrial importance. O. sanctum, known as "the holy basil" is native to Asian tropics [3], whereas O. basilicum L. or "the sweet basil" is described to be of African origin as per the Germplasm Resources Information Network [4] of United States Department of Agriculture. While holy basil is revered for its spiritual sanctity and medicinal potential [5], the sweet basil is widely used as culinary herb and for fragrance [6]. Both of the two Ocimum species are rich reservoirs of innumerable phytochemicals, which comprises predominantly phenylpropanoids and terpenoids with various medicinal and aromatic properties. Most of these phytochemicals are sequestered in specialized anatomical structures, termed glandular trichomes, on the surface of the aerial parts of the plants [7]. O. sanctum is known to possess antibacterial, antianaphylactic, antihistaminic, wound healing, radioprotective, antidiabetic, larvicidal, anti-genotoxic, neuro-protective, cardio-protective and mast cell stabilization activity [8]. The leaves and stem of holy basil contain a variety of biologically active constituents like saponins, flavonoids, triterpenoids, and tannins [9]. Urosolic acid (UA) from O. sanctum L. is reported to have cardioprotective effect [10]. Some active phenolics like rosmarinic acid, apigenin, cirsimaritin, isothymusin and isothymonin exhibit antioxidant and anti-inflammatory activities [10]. The most important aroma components are described to be 1, 8 cineole, linalool, methyl chavicol (estragole) and to a lesser degree, eugenol [11]. Similarly, the essential oil of sweet basil (O. basilicum) is described to be having antifungal, antimicrobial and insect-repelling activities [12]. O. basilicum, contains primarily phenolic derivatives, such as eugenol, methyl eugenol, chavicol, estragole, and methyl cinnamate, often combined with various amounts of linalool [13]. This is also reported to be clinically useful for prevention of stroke, and exhibiting anticarcinogenic, antituberculosis and hypoglycemic activities [14,15]. Thus, the uses of Ocimum sp. for therapeutic purposes in addition to their industrial importance for aromatic properties reinforce the importance of ethno-botanical approach as a potential source of bioactive substances.
Despite spiritual, pharmacological, and industrial importance, very little transcriptomic and genomic data of Ocimum sp. is available limiting the studies on biosynthetic pathways of important phytochemicals [7]. National Center for Biotechnology Information (NCBI) shows a record of 312 entries in nucleotide database and 23336 EST sequences of O. basilicum compared to only 61 entries in nucleotide database and 108 EST sequences of O. sanctum. In recent years, several studies have successfully reported the generation of transcriptome data and its analysis as an effective tool to study gene expression in specific tissues at specific time, and also provide a platform to address comparative genomics for gene discovery in non-model plants for which no reference genome sequences are available [16]. Due to the availability of quick, low cost sequencing [17] and high quality annotation using different assembly tools [18] it has become possible to analyze and understand the genome of non model plant like Ocimum. Hence, O. sanctum and O. basilicum were selected for next generation sequencing (NGS) and analysis with the main objective to establish the basic understanding about genes involved in various pathways and the factors involved in the regulation and channelling of the secondary metabolites like phenylpropanoids and terpenoids. So, leaf transcriptome data of O. sanctum (CIM Ayu-eugenol rich variety) and O. basilicum (CIM Saumya-methylchavicol rich variety) [19] was generated using paired-end (PE) Illumina NGS sequencing platform and genes involved in phenylpropanoids/ terpenoids biosynthesis pathway were identified. This study also reports EST collection of leaf tissues from O. sanctum and O. basilicum with a number of differentially expressed cytochrome P450s, transcription factors and pathway genes with probable involvement in differential metabolite biosynthesis in O. sanctum and O. basilicum leaf tissues. Using these datasets, molecular markers of EST-SSRs were also analyzed to facilitate the marker-assisted breeding of these species. Overall, this data set will be a significant advancement in terms of genomic resources in the diverse Ocimum genus.

Results and Discussion
Transcriptome sequencing, de novo assembly and functional annotation of contigs In recent years, Illumina sequencing platform has been widely used for transcriptome analysis of plants devoid of reference genomes [20][21][22] Figure 1A). In root transcriptome of Ipomoea batatas, 65.76% unigenes were in the range of 101-500 bp length followed by 20.79% of transcripts of 501-100 bp length [20], similarly in the case of Medicago sativa, Boehmeria nivea, Apium graveolens and C. longa, Centella asiatica the highest number of transcripts/unigenes were reported with length between 75-500 bp [21][22][23]. Further, transcripts from both Ocimum samples were clustered using CD-HIT-v4.5.4 at 95% identity and query coverage resulting in a total of 130996 transcripts. Blastx search was conducted for assembled sequences of O. sanctum and O. basilicum against uniprot databases and GO terms were assigned for each unigene based on the GO terms annotated to its corresponding homologue in the uniprot database with the proteins of Arabidopsis, rice and lamiaceae family (

Functional classification of Ocimum transcriptome by GO
Gene Ontology (GO) is an international standardized gene functional classification system offering an updated and a strictly defined concept to comprehensively describe the properties of genes and gene products in any organism [24]. In order to assign putative functions, transcripts from O. sanctum and O. basilicum were compared against the NR protein sequences of Arabidopsis, rice and lamiaceae family available at uniprot database using blastx algorithm. The associated hits were searched for their respective GO. Based on sequence homology, 59380 sequences from O. sanctum and 104856 sequences from O. basilicum were categorized into 51 functional groups under three main categories: biological process (BP), cellular component (CC) and molecular function (MF) (Figure 2). Highest percentages of genes were
Recently, presence of pentacyclic triterpenoids like ursolic, oleanolic and betulinic acids has been reported in Ocimum spp. [28]. This non-aromatic class of compounds have pharmacological importance such as anti-HIV, antibacterial, antiviral, anticancer and anti-inflammatory activities [29]. Like other sesquiterpenoids these triterpenoids also share their origin to farnesyl diphosphate (FDP). FDP is converted to squalene and then to oxidosqualene respectively by squalene synthase (SQS) and squalene epoxidase (SQE) enzymes. Subsequently, oxidosqualene in presence of multifunctional oxidosqualene cyclases (OSCs) viz.αamyrin synthase (aAS), β-amyrin synthase (bAS) or lupeol synthase (LUP) which are then converted to α-amyrin, β-amyrin or lupeol, respectively. OSCs catalyzing the formation of α-amyrin, also produce β-amyrin, finally synthesizing diverse triterpenoids with the help of CypP450s members. Hence, the bAS expression cannot be directly correlated with the triterpene accumulation. Similar reports of triterpenoids biosynthesis from these OSCs are available from Catharanthus roseus and O. basilicum [30,31]. In this investigation a total of 12 transcripts from O. basilicum and 8 transcripts from O. sanctum were homologous to βamyrin synthase as per the Arabidopsis annotation. HPLC analysis from the dried leaves of both the Ocimum species detected oleanolic and ursolic acids however betulinic acid remained undetected. O. sanctum was observed to be having higher content of oleanolic and ursolic acids as compared to O. basilicum ( Figure 7A).
Ocimum spp. is also known to accumulate rosmarinic acid (an ester of caffeic acid and 3,4-dihydroxyphenyllactic acid), which has various pharmacological properties including antioxidant, antibacterial, antiviral and antiinflammatory activities [32]. Both transcriptomes contained several (32 in O. sanctum; 37 in O. basilicum) transcripts annotated as rosmarinic acid synthase with average RPKM values of 13.6 and 6.3, respectively. To validate differential digital gene expression, 8 genes were randomly selected for quantitative real time PCR (qPCR). These genes (PAL, CCR, CS3′H, EGS, CVOMT, HPPR, BAS and PMK) showed up-regulation in O. basilicum compared to O. sanctum ( Figure 7B). All the genes described in this investigation shows up-regulation for O. basilicum in digital gene expression results ( Figure 7C). This indicates higher accumulation of metabolites in O. basilicum compared to O. sanctum which is in coherence with the cytological study (Additional file 6). As also discussed earlier, O. basilicum is rich in phenylpropanoids with higher content and array of related compounds, which is also in coherence with the observation on upregulation of the phenylpropanoid pathway genes like PAL, CCR, CS3′H, EGS, CVOMT and HPPR in O. basilicum.

Discovery of candidate CYP450s and transcription factors with probable involvement in phenylpropanoid/terpenoid biosynthesis
Cytochrome P450s (CYP450s) are reported to be nature's most versatile biological catalysts forming the biggest gene families in plants accounting for more than 1% of the total gene annotations in individual plant species [33]. These are generally involved in the biosynthesis of terpenoids, sterols, lignins, hormones, fatty acids, pigments, and phytoalexins in plants [34]. These genes are also the subject of analysis in many of the de novo     [36]. Additionally, transcripts of CYP716A class were also identified to be the members of multifunctional oxidases involved in triterpenoids (ursolic, oleanolic and betulinic acids) biosynthesis [37]. Transcription factors (TFs) are sequence specific DNAbinding proteins interacting with the promoter regions of target genes to modulate their expression. In plants, these  proteins play a very important role in regulation of plant development, reproduction, intercellular signalling, response to environment, cell cycle and are also important in the modulation of secondary metabolites biosynthesis [38]. In recent years, many studies have been reported on the involvement of various TF families like bHLH, bZIP, Zinc fingers, MYB, ARF, HSF, WRKY, HB and NAC in regulation of secondary metabolites and plant stress responses [25,39]. As phenylpropanoids and terpenoids are the main determinants of aroma and flavour in Ocimum, it becomes important to investigate the transcriptional regulation of the genes involved their biosynthesis, which can further be used to modulate the pathway and develop phenylpropanoid or terpenoid enriched chemotypes. A few transcription factors from other plants, eg. EMISSION OF BENZENOIDS I (EOBI), EMISSION OF BENZENOIDS II (EOBII), and ODORANT 1 (ODO 1), MYB4, members of R2R3-MYB family regulate benzenoid/phenylpropanoid volatile biosynthesis in Petunia hybrida [40,41]. ORCA2 and AP2 family member, MYC2, a bHLH family member and WRKY1 regulate indole alkaloid and terpenoid biosynthesis pathway in Catharanthus roseus [42,43]. Similarly, a wound inducible WRKY transcription factor from Papaver somniferum was suggested to be involved in benzylisoquinoline biosynthetic pathway [44]. Also, in Lamiaceae family plants like Salvia miltiorrhiza and Perilla frutescens, TFs belonging to bHLH family are reported to be involved in phenypropanoid biosynthesis pathway [45,46]. In the present investigation TFs were classified according to uniprot annotation for Arabidopsis family. A total of 3489 (5.9%) and 6074 (5.8%) transcripts in O. sanctum and O. basilicum, respectively were grouped into 40 TF families (Figure 8). Those which were annotated to have sequence specific transcription factor activity but cannot be grouped among any family were included in 'other' TFs category, following Arabidopsis transcription factor database (http://Arabidopsis.med.ohio-state. edu/AtTFDB/) and Plant transcription factor database (http://planttfdb.cbi.pku.edu.cn/) [47] classification. A systematic analysis of these transcription factors would help in understanding differential regulation of terpenoid and phenypropanoid pathways.

Cytogenetic characterization of O. sanctum and O. basilicum
There have been discrepancies regarding the chromosome number of Ocimum in literature. Darlington and Wylie [48] and Mehra and Gill [49] considered x = 8 as basic chromosome number for the genus Ocimum as a whole, while some other authors suggested that Ocimum species are characterized by the different basic chromosome numbers x = 8, 10, 12, or 16 [50]. In order to establish the actual chromosome numbers for the two varieties used in this study, fast growing roots emerging from stem cuttings were examined for somatic chromosome number. Observations recorded from root-tip mitosis reveal somatic chromosome count of 2n = 16 for O. sanctum and 2n = 48 for O. basilicum and chromosome size below 1 μm (Additional file 6). As the Stilbene, coumarine and lignin biosynthesis essential oil of the genus Ocimum is the reservoir of secondary metabolites, there may be a probable correlation between the chromosome numbers of species and its essential oil yield, which may in turn be affected by expression of related genes. Indeed, DGE and real-time expression analyses showed higher expression of pathway genes in O. basilicum compared to O. sanctum (Figures 4, 5, 7). Moreover, the ploidy level has been shown to enhance the accumulation of secondary metabolites in Cymbopogon [51]. As reported earlier, O. basilicum (var: CIM-Saumya) shows more vigorous growth and higher oil content (0.99%) compared to O. sanctum (var: CIM-Ayu) with 0.70% oil content [19,26].

Analysis of GC content and identification of SSR Markers
Next generation sequencing also offers an opportunity for the analysis of GC content among transcripts and expands the scope for molecular markers such as SSRs. GC content is an important indicator of the genomic composition including evolution, gene structure (intron size and number), gene regulation and stability of DNA [52]. Average GC contents of O. sanctum and O. basilicum transcripts were analyzed to be 47.12% and 46.39%, respectively (Additional file 7), which is in the range of GC levels of coding sequences in dicots (44-47%) [53]. Simple sequence repeats (SSRs) markers have proven to be valuable tools for various applications in genetics and breeding for the better understanding of genetic variation. As described, more than 150 species [1,2] of Ocimum are reported around the world and hence, polymorphic SSR markers are important for investigations related to genetic diversity, relatedness, evolution, linkage mapping, comparative genomics and gene-based association studies. Transcriptome SSR markers also exhibit high inter-specific transferability [54]. Genus Ocimum is highly prone to cross pollination and hence the seed raised population will have variability in metabolite content [10]. The identification of SSRs in Ocimum sp. will help in distinguishing closely related individuals and will also provide useful criteria for enriching and analyzing variation in the gene pool of both the plants. Even though SNPs serve as excellent markers especially for high-throughput mapping and studying complex genetic traits, SSRs provide a number of advantages over other marker systems. SSRs with their moderate density still serve as the best co-dominant marker system for construction of basilicum. The gene specific identification of SSRs in both the Ocimum sp. will help in distinguishing closely related individuals and will also provide useful criteria for enriching and analyzing variation in the gene pool of the plant. Similarly, mining of SNPs from NGSgenerated transcripts mainly involves clustering and assembling the sequence reads, followed by SNP identification by means of in silico approaches [56]. In this investigation, a total of 3245 (66.16%) transitions and 1660 (33.84%) transversions were observed by the SNP finder tool with O. sanctum as anchor (Table 9 and Additional file 10).

Conclusion
Terpenoids and phenylpropanoids are the predominant secondary metabolites in Ocimum species. These metabolites are synthesized through metabolic divergence from the mevalonate/non-mevalonate and shikimate pathways, respectively, and accumulate in the specialized glandular trichomes on the leaves [7]. So, this study was undertaken with the objective of enriching the existing limited set of genomic resources in Ocimum, and to provide a comparative analysis of transcriptomes of two Ocimum species having contrasting essential oil composition. To this end, high quality transcriptome database was established for O. sanctum and O. basilicum by using NGS technology. This is the first report of a comprehensive transcriptome analysis of Ocimum species. Genes encoding pathway enzymes related to aromatic components such as volatile terpenoids, phenylpropanoids and non-volatile medicinal compounds such as pentacyclic triterpenes and rosmarinic acid were identified in the transcriptome database; indicating the importance of exploring Ocimum species as a source of both medicinal and aromatic compounds. Moreover, several putative CYPs and transcription factors with probable involvement in the biosynthesis and regulation of terpenoids and phenylpropanoids were identified. Further investigations on these putative CYPs and TFs may reveal the reasons behind differential accumulation of phenylpropanoids/terpenoids, along with the similarity/difference in biosynthetic pathways operating in different species of Ocimum. Additionally, several SNPs and SSRs were identified in both the transcriptomes which will assist in breeding of Ocimum for developing distinct chemotypes. Overall, Ocimum transcriptome databases presented here, both individually and collectively, can be exploited to characterize genes related to phenylproanoid and terpenoid metabolism and their regulation, as well as for breeding chemotypes with unique essential oil composition in this largely cross-pollinating species.

Methods
Plant material, library preparation and sequencing The cDNA was cleaned up using Agencourt Ampure XP SPRI beads (Beckman Coulter, USA) followed by ligation of "Illumina Adapters" to the cDNA molecules, after end repair and addition of "A"base. Following SPRI cleanup after ligation, the library was amplified using 11 cycles of PCR, for enrichment of adapter ligated fragments. The prepared library was quantified using Nanodrop and validated for quality by running an aliquot on High Sensitivity Bioanalyzer Chip (Agilent).

De novo assembly and sequence clustering
Raw reads obtained after sequencing were subjected to adapter, B-block and low quality base filtering to obtain the processed reads. De novo assembly of the processed reads was carried out using Velvet_1.2.10 for different hash lengths (45-73) [57]. Velvet takes in short reads and assembles them into contigs using paired-end information. This assembly was used by "observed-insertlength.pl" and "estimate-exp_cov.pl" (from Velvet package) to estimate insert length and expected coverage parameters, which were then used to generate a final assembly. The resulting contigs were assembled into transcripts by Oases-0.2.01 for the same (45-73) hash lengths [58], using the assembly from Velvet and clustering them into small groups (loci). It then uses paired end information to construct transcript isoforms. Transcript  Tot. no. of SNPs assembly was selected for the best hash length based on the assembly statistics and the transcripts from both the samples were clustered together using CD-HIT-v4.5.4 at 95% identity and 95% query coverage [59]. The transcriptome data for both the species was submitted to the NCBI under SRA Study accession number SRP039008 for O. sanctum and SRP039533 for O. basilicum).

Sequence annotation and functional characterization
Assembled transcripts were blasted against UniProt databases and GO (Gene Ontology) terms were assigned for each unigene based on the GO terms annotated to its corresponding homologue in the UniProt database with the proteins of Arabidopsis, Rice and Lamiaceae family. Each annotated sequence may have more than one GO term, assigned either for different GO categories (Biological Process, Molecular Function and Cellular Component) or in the same category [60]. To gain an overview of gene pathway networks, the assigned polypeptides encoded by unigenes from O. sanctum and O. basilicum transcriptome were mapped to metabolic pathways according to the Kyoto Encyclopedia of Genes and Genomes (KEGG) [61]. The output of KEGG analysis includes KEGG orthology (KO) assignments using KEGG automated annotation server, KAAS (http://www. genome.jp/kaas-bin/kaas_main?mode = partial).
Read mapping and transcript abundance measurement RPKM (Reads Per Kilobase per Million) measurement is a sensitive approach by which expression level of even poorly expressed transcripts can be detected using read count as the fundamental basis. For RPKM measurement, reads were first aligned using "Bowtie tool" [62] and "Awk scripting" was used to generate the read count profile from the output file (.sam) of Bowtie alignment. RPKM values were calculated applying the approach adopted by Mortazvi and co-workers [63], to measure the expression level of each assembled transcript sequence. The clustered transcripts were used as the master reference for carrying out the digital gene expression (DGE) analysis by employing a negative binomial distribution model (DESeq v1.8.1 package (http://www-huber. embl.de/users/anders/DESeq/) [64].

Cytological analysis
Stem cuttings of the O. sanctum (var. CIM Ayu) and O. basilicum (var. CIM Saumya) were transplanted in moist sand. The fast growing 1 cm long young roots emerging from the stem cuttings were excised and pre-treated for 2.5 h in saturated aqueous solution of p-dichloro benzene at 12-14°C, washed thoroughly in water and quickly transferred to Carnoy's mixture (6:3:1) for fixation overnight at room temperature. Next day the fixed roots were transferred to 45% acetic acid for 10 minutes, and thereafter stained in 2% acetocarmine for 2 hrs at 60°C and then overnight at room temperature. The stained root tips were squashed in 45% acetic acid and permanent chromosome preparations were made by removing the cover glass by quick-freeze method followed by dehydration in tertiary butyl alcohol series and mounting in DPEX.

Real-time PCR analysis
Total RNA was isolated from both O. sanctum and O. basilicum leaves of same stage and cDNAs were prepared using RevertAid first strand cDNA synthesis Kit (ThermoScientific, USA). Expression of selected pathway genes and cytochrome P450s was analyzed through qPCR using Fast Real Time PCR System (7900HT Applied Biosystems, USA) and Maxima SYBR Green PCR Master Mix (2X) (ThermoScientific, Waltham MA, US) to validate Illumina sequencing data. Each PCR reaction was set up in 15 μl volume containing 7.5 μl of Maxima SYBR Green PCR master mix, 50 ng of cDNA sample prepared using RevertAid first strand cDNA synthesis Kit (ThermoScientific) and gene-specific primers (Additional file 11). The specificity of the reactions was verified by melting curve analysis with the thermal cycling parameters: initial hold (50°C for 2 min); initial denaturation (95°C for 10 min); and 40 amplification cycles (95°C for 15 s; and 60°C for 1 min) followed by additional steps (60°C for 15 s, 95°C for 15 s and 37°C for 2 min). Relative mRNA levels were quantified with respect to the reference gene 'actin' of O. sanctum (SO_2009_tran-script16212) [65]. Sequence Detection System (SDS) software version 2.2.1 was used for relative quantification of gene transcripts using the ΔΔCQ method. Threshold cycle (Cq) values obtained after real-time PCR were used for calculation of the ΔCq value (target-reference). The quantification was carried out by calculating ΔΔCq to determine the fold difference in gene expression [ΔCq target -ΔCq calibrator]. The RQ was determined as 2 -ΔΔCQ . All the experiments were repeated using three biological replicates and the data were analyzed statistically (±Standard Deviation).