Multi-tissue transcriptome analysis using hybrid-seq revealed potential genes and biological pathways associated with azadirachtin A biosynthesis in neem (Azadirachtin indica)

Background: Azadirachtin A is a triterpenoid from neem tree exhibiting excellent activities against over 600 insect species in agriculture. The manufacture of azadirachtin A depends on extraction from neem tissues, which is not ecofriendly and sustainable. The low yield and discontinuous supply impeded the further application. The biosynthetic pathway of azadirachtin A is still unknown. Results: We attempted to explore azadirachtin A biosynthetic pathway and identied key involved genes by analyzing transcriptome data of ve neem tissues through hybrid-seq (Illumina HiSeq and Pacic Biosciences Single Molecule Real Time (PacBio SMRT)) approach. Candidates were rstly screened by comparing expression level within ve tissues. After phylogenetic analysis, domain prediction and molecular docking, 22 candidates encoding 2,3-oxidosqualene cyclase (OSC), alcohol dehydrogenase (ADH), cytochrome P450 (CYP450), acyltransferase (ACT) and esterase (EST) were proposed to be potential genes involved in azadirachtin A biosynthesis. Among them, two unigenes encoding homolog of MaOCS1 and MaCYP71CD2 were identied. An unigene encoding complete homolog of MaCYP71BQ5 was rst reported. Accuracy of the assemblies were veried by Quantitative Real-Time PCR (qRT-PCR) and full-length cloning PCR. Conclusions: By integrating and analysis transcriptome data from hybrid-seq technology, 22 DEGs were nally selected as candidates involved in azadirachtin A pathway. The obtained reliable and accurate sequencing data provided important novel information for understanding neem genome. Our data shed new light on the understanding of other triterpenoids biosynthesis in neem trees and provide reference for exploring other valuable natural product biosynthesis in plants. unigene, two ADH unigenes, twelve CYP450 unigenes, two ACT unigenes and five EST unigenes after phylogenetic analysis, domain prediction and molecular docking. Among them, three transcripts encoding complete AiOSC1, homologs of MaCYP71CD2 and MaCYP71BQ5[18] were also found in our study. Unigene contains complete ORF encoding homolog of MaCYP71BQ5 was first reported. Qualitative real-time PCR and full-length cloning PCR verifying of the reasons for slow progress in azadirachtin pathway exploration although numerous available neem genome and transcriptome data. transcriptome ve of in azadirachtin neem tree genome of terpene or limonoid biosynthetic genes. 22 unigenes encoding enzymes include OSC, ADH, CYP450, ACT and EST were selected as candidates involved in azadirachtin A downstream pathway. This is the rst report on hybrid-seq transcriptome proling analysis of A. indica. The obtained unigenes may provide a valid and diverse candidate pool for the study of the selective modication by the functional groups at triterpenoid or limonoids skeleton as well as the convergent evolution in secondary metabolism.

Although the genome and transcriptome data were easy to download from SRA database, the submitted files were all raw data without assembly or annotation. This caused difficulties for people who are not familiar with the processing of high-throughput genome or transcriptome data. Besides, samples for transcriptome or genome analysis in previous report were growing in India. Genetic information within same species in different living area or developmental stage were different and this caused the differences in gene expression. Hens, five neem tissues (fruit, leaf, stem, flower and root) were sampled for transcriptome sequencing. Among five tissues, fruit with the green hard seed was reported to contain the highest azadirachtin A in all the developing stage of fruit [14,19]. Leaf from neem tree in China contained azadirachtin A at the value of 969.9 μg g −1 [20]. The rank of azadirachtin content in different tissues was consistent with that in previous report (seed kernels, 0.03%; leaves, 0.9x10 -3 %; bark, 0.5x10 -3 %; root, 0.3x10 -3 %; stem 0.2 x10 -3 %) [21].
They were chosen as the higher azadirachtin A group and used for mining synthetic genes.
With the advances in high-throughput sequencing technology, the third-generation sequencing represented by Biosciences Single Molecule Real Time (PacBio SMRT [22]) and Oxford Nanopore was applied in academic researches. Owing to its high error rate of sequencing in its longer reads (15%) as well as the accuracy of reads from Illumina, a new method called hybrid-seq [23] was generated and it bring together the best of two sequencing technologies. The obtained longer reads was corrected by short but accurate reads from Illumina. As reported by Koren [24], only 41 (0.1%) of the PacBio reads exactly matched the annotated exon structure before correction in genome assembly. This number will rise to 12, 065 (24.1%) after correction with short reads. Error correction method was used to produce de novo assembly of the Saccharomyces cerevisiae. The generated contig N50 length is more than ten times greater than an Illumina-only assembly (678 kb versus 59.9 kb), and has >99.88% consensus identity when compared to the reference genome [25].
The successful applications of hybrid-seq in genome refining, isoform identification laid strong foundations for our study.
While three steps involved in azadirachtin A biosynthesis had been figured out, the rest part of the downstream pathway was still unexplored. Based on the metabolites in Neem Metabolite Structure Database (http://www.vmsrfdatabas e.org/index.php) as well as the distribution of metabolites (Table S1) in neem tissues, a putative biosynthetic pathway for azadirachtin A (Figure 1) were proposed. Five kinds of putative enzymes (OSC, CYP450 [26], ADH [27], ACT and EST [28]) participating in were proposed based on the putative biosynthetic pathway. Candidate mining were performed through the work flow of gene mining ( Figure S1). Extensive bioinformatics analysis of unigenes further provided twentytwo candidates (Table 1) including one OSC unigene, two ADH unigenes, twelve CYP450 unigenes, two ACT unigenes and five EST unigenes after phylogenetic analysis, domain prediction and molecular docking. Among them, three transcripts encoding complete AiOSC1, homologs of MaCYP71CD2 and MaCYP71BQ5 [18] were also found in our study. Unigene contains complete ORF encoding homolog of MaCYP71BQ5 was first reported. Qualitative real-time PCR and full-length cloning PCR were used for verifying unigenes expression level (FPKM) and transcript sequence accuracy. The obtained candidates could be used as an important resource to investigate the catalysts responsible for essential biochemical reactions in azadirachtin A biosynthesis as well as triterpenoids metabolism in close species of neem. Homologs with * represent the functionally identified genes in the previous report.

Plant materials
All fresh and healthy tissues (root, leaf, stem, ower and fruit containing seed from neem (Azadirachtin indica, A. indica) tree were randomly sampled from the park of Hainan University, followed by transcriptome analysis on September 26, 2018. All samples for transcriptome sequencing were harvested from three plants for RNA extraction. Tissues were gently rinsed and subsequently cut into small pieces. All materials were immediately frozen in liquid nitrogen and stored at -80℃ before use. The A. indica was authenticated by Prof. Zhiwei Wang and Dr. Yaqing Wei.

RNA extraction
Total RNA was extracted using the RNA plant Plus Reagent (TianGen, Beijing, China) according to the manufacturer's protocol. The extracted RNA concentration and integrity were assessed using the RNA Nano 6000 Assay Kit with the Agilent Bioanalyzer 2100 system (Agilent, CA, USA). For PacBio sequencing, the extracted RNA concentration and integrity were assessed using the Fragment Analyzer system (Agilent, CA, USA). The A260/A280 ratio ranging from 1.9 to 2.0, concentration above 285 ng/μL, and RNA integrity number (RIN) greater than 8.0 were used for subsequent cDNA library construction.
cDNA library construction and hybrid-sequencing A total amount of 5 μg RNA was used for cDNA library construction. For Illumina HiSeq sequencing, oligo (dT) beads were used to isolate poly(A) + mRNA. The paired-end libraries with an insert size of approximately 250 bp were constructed. All libraries were sequenced commercially to generate paired-end reads of average length 150 bp on the Illumina HiSeq 2000 sequencing platform (Hiseq 2000 V3) according to the manufacturer' protocol by Beijing Genomics Institute (BGI-Shenzhen, China).
For PacBio SMRT sequencing, 1000 ng of mRNA from each tissue was pooled for cDNA library construction. Double-strand cDNA was synthesized according to SMARTer PCR cDNA Synthesis Kit (Clontech). DNAs were selected by BluePippin TM and ranged into four sizes: 1-2, 2-3, 3-6 and 5-10 kb. DNAs after the second large-scale PCR were used as template prepared for SMRTbell library for sequencing. The amount of generated base was about 12 GB to cover all transcripts in sample.
HiSeq reads were ltered by discarding the reads with adaptor sequences, reads with ambiguous "N" bases larger than 5% and low-quality reads. and the ltered reads were then assembled using Trinity (v2.0.6) with default parameters to generate contigs. These contigs were then processed by sequencing clustering software TGICL (v2.0.6) for redundant Trinity assembled contig removal. Raw PacBio SMRT reads were processed using SMRT Analysis Server (v2.3) for full-length transcripts generation. The obtained transcripts were corrected with ltered HiSeq reads using LSC [29] software and subsequently ltered with CD-HIT-EST for the removal of redundant Trinity generated fragments. Finally, the calibrated transcripts were assembled into unique putative transcripts (including contigs and singletons) and unigenes were characterized for subsequent analysis.

Annotation and differential gene expression analysis
The unigenes were annotated based on sequence similarity using BLASTX against ve databases, including non-redundant protein database (Nr), SwissProt, Clusters of Orthologous Groups of proteins (COG), and Kyoto Encyclopedia of Genes and Genomes protein database (KEGG). Pfam annotation for the unigenes was nished using the HMMER 3.0 package. Sequence description for each unigene were transferred from homologous BLAST hits with E-value<10 -5 . Gene Ontology (GO) terms were assigned based on the top BLAST hit using Blast2GO. Genes were obtained by BLASTn using non-redundant nucleotide sequence database (Nt). Functional enrichment of the assigned GO terms was calculated and analyzed via WEGO software. The distribution of gene functions was illustrated by biological process, cellular component and molecular function.
Clean reads were mapped to unigenes using Bowtie2 (v2.2.5), and then gene expression level was calculated with RSEM (v1.1.12). To compare the difference of gene expression among different samples, the FPKM (Fragments per kilobase per transcript per million mapped reads) method was used [30]. DEseq2 was used to identify DEGs (absolute value of log 2 fold-change≥1) after correction of p-values (adjusted 0.05) using Benjamini-Hochberg (false discovery rate, FDR≤0.001). Higher expressional unigenes characterized from leaf and fruit (azadirachtin A stimulating tissues) libraries were used for candidate mining.

Analysis of phylogeny, domain architecture of unigenes
The SwissProt database was queried to retrieve all reviewed sequences of alcohol dehydrogenase, CYP450, acyltransferase, and esterase.
These sequences were downloaded in FASTA format and aligned with all four kinds of candidate using Clustal W algorithm with default parameters. Phylogenetic analysis based on multiple alignments of protein sequences was done using the Neighbor Joining [31] method as implemented in MEGA7 and the phylogenetic trees were visualized on iTOL (https://itol.embl.de/). Accessions of these protein sequences used in phylogenetic analysis were provided in Additional le 1. The protein sequences of the candidates were also searched against Pfam database in order to get domain architecture information complementary to that provided by SwissProt.

Molecular modelling and docking for enzyme-substrate analysis
Modelling of ve CYP450 proteins encoded by unigenes was performed by using Phyre2 web portal using fold recognition method [32].
Docking analysis was performed to predict the interactions of four triterpenoids (tirucalla-7,24-dien-3β-ol, azadirone nimbin and nimbolide) as substrate for the CYP450 proteins using software AutoDock 4.0 (http://autodock.scripps.edu). For covering acting domains present in CYP450 protein, grid spacing was maintained at 0.375 Å. Genetic algorithm (GA) was applied as searching parameter with 10 GA runs and population size was set to 150, energy evaluations was set to maximum 25,00,000 with considering the maximum number of generations to 27,000. The most favorable docking conditions were in the form of lowest binding energy conformations with H-bonds in cluster. Phymol 2.3 software was used for better analysis of interaction in protein-ligand complexes obtained from AUTODOCK4 [33] software.
Validation of hybrid-seq by quantitative real-time PCR and full-length cloning PCR Ten transcripts were selected to validate their expression from hybrid-seq by quantitative real-time PCR (qRT-PCR). Total RNA was processed with RNase-free-DNase I (TianGen, Beijing, China) following the manufacturer's instructions to eliminate potential DNA contamination. First strand cDNA was synthesized using GoScript TM Reverse Transcription System (Promega, Canada). The reactions were performed in triplicate using 2 μL diluted cDNA template in a 20 μL total volume. qRT-PCR was performed in 96-well plates on a Bio-Rad CFX96 real-time PCR system (Bio-Rad, CA, USA) using SYBR Green Mix (Bio-Rad, CA, USA). A two-step cycling program was performed, comprising an initial 95℃ polymerase activation for 3 min, followed by 40 cycles of 95℃ for 10s, then 60℃ for 30s. The melting curve was obtained by heating the amplicon from 65℃ to 95℃ at increments of 0.5℃ per 5s. The actin gene was used as an internal control to normalize all data. The relative quantitation (ΔΔCt) method was used to evaluate differences between the tissues for each gene examined.
Data analysis was performed using GraphPad Prism version 5 for Windows (GraphPad Software, Inc. La Jolla, CA, USA). The primers for qRT-PCR reactions were listed in Additional le 2.
Ten unigenes were selected to verify the accuracy of sequence obtained from hybrid-seq by full-length cloning PCR. Obtained cDNA (2 μL) was used in a 50 μL PCR reaction containing 2 μL of forward primer (10 μM), and 2 μL of reverse primer (10 μM) and 25 μL of I-5™ 2×High-Fidelity Master Mix (TsingKe Biotech, Beijing, China). The PCR product was puri ed using GeneJET PCR Puri cation Kit (Thermo, USA) and assembled into pJET vector using CloneJET PCR Clone Kit (Thermo, USA). The constructs were then transformed into Trelief 5α Chemically competent cell (TsingKe Biotech, Beijing, China) and the ampli ed constructs were sequenced by Genewiz Company (GENEWIZ, Beijing, China). The primers for full-length cloning PCR were listed in Additional le 2.

Results And Discussions
Hybrid-seq transcriptome analysis and data validation To generate a comprehensive overview of neem transcriptome, total RNAs were extracted from leaves, fruits (containing seeds), roots, stems and owers. The approach we used to obtain the transcriptome data was hybrid-seq technology which combine the Illumina Hiseq and PacBio SMRT for sequencing and correct the errors in long reads with short reads [24]. Different neem tissues were sequenced separately using Illumina Hiseq platform and generated 41.14, 41.35, 40.60, 40.59 and 40.64 million clean reads, respectively. PacBio SMRT platform produced 6.75 million clean reads. After calibration with short reads from Illumina platform, the assembled unigenes were corrected with an N50 of 5076 bp and mean length of 3607 bp ( Table 2). The obtained assembly were 2.5 times and 3 times longer than in the previous report [14]. The length of these unigenes ranged from 500 to 6001bp, and the majority (over 55.5%) were distributed in 4501 bp and above ( Figure 2a).
To con rm the accuracy of the hybrid-seq (FPKM) results, we selected 10 unigenes and used qRT-PCR to determine their relative expression ( Figure S2). Good consistency between the qRT-PCR and FPKM were observed except for transcript/19882 and transcript/16577. Their expression from RNA-seq (FPKM) and relative expression level by qRT-PCR in root and stem were inconsistent. The inconsistency between qRT-PCR and FPKM in transcriptome analysis have also been reported by other researchers. Eight out of fty-eight genes expression level were inconsistent with their FPKM when Zhang validated by qRT-PCR from his transcriptome [34]. In another case, when comparing gene expression fold changes between MAQCA and MAQCB samples, about 85% of the genes showed consistent results between RNAsequencing and qRT-PCR data [35]. The inconsistency between FPKM and qRT-PCR may result from reasons below: Although realtime-PCR and RNA-seq are both used to measure gene expression, the unit of measurement [36] as well as the computing method are different for FPKM and qRT-PCR. Many factors affect the accuracy of FPKM and qRT-PCR. Bias in PCR ampli cation [37] and RNA-seq library preparation [38]and sequencing add noises in RNA-seq. The quality of the mRNA, ampli cation e ciency and the choice of reliable internal controls referred to as reference genes affect the accuracy of qRT-PCR [39]. Therefore, the consistency of transcript/16577 and transcript/19882 between qRT-PCR and FKPM was acceptable. More further examinations need to do on these two unigenes.
Ten unigenes were cloned by PCR used primers listed in Additional le 2. According to the sequencing results of the cloned unigenes, every of them was 100% identical to the sequences obtained from hybrid-seq platform. Among them, four unigenes attracted our attentions since they contained complete ORF of the genes. One of them(transcript/14554) was annotated as NADPH-cytochrome P450 reductase 2 of neem. The cloned transcript/14449 was found to encode an OSC consist of 760 amino acids. The length of two CYP450 unigenes (transcript/16971 and transcript/16742) was 1536bp and 1527bp, and they encoded protein consist of 511 and 508 amino acids, respectively. Detailed sequencing results indicated that the generated unigenes in our study were of good accuracy and therefore reliable for further analysis.
COG and GO classi cation were used to further evaluate the completeness and effectiveness of the neem annotation. All 17634 unigenes (87.3% of 20201 unigenes) were classi ed into 25 functional Clusters of Orthologous Groups (COG) clusters (Figure 2b). Of those, 2610 unigenes (14.8% of the total 17634 classi ed unigenes) were categorized into general function prediction only cluster, which formed the largest group, whereas the clusters for replication, transcription, recombination and repair followed closely. Although only 378 unigenes were categorized into the "Secondary metabolites biosynthesis transport and catabolism" cluster, they may play important roles in providing precursors in secondary metabolite biosynthesis.
In total, the 74905 assembled unigenes (66.3% of 113008 unigenes obtained in Hiseq sequencing analysis) were assigned at least one of the 55 GO terms (Figure 2c). Of those, these unigenes were predominantly assigned to the metabolic process (GO:0008152) and cellular process (GO:0009987). The unigenes categorized in the molecular function category were predominantly associated with catalytic activity (GO:0003824) and binding functions (GO:0005488). The unigenes categorized in the cellular component category were predominantly associated with membrane (GO: 0016020), cell part (GO: 0044464) and cell (GO: 0005623). These ndings showed that the main COG and GO classi cations for the fundamental biological processes were identi ed.
The KEGG pathway database was used to systematically evaluate the gene biological functions participated in different pathways. A total of 16778 unigenes were matched to the database and assigned to 135 KEGG pathways (Table S3). Metabolic pathways (3590 unigenes, 21.4%) and biosynthesis of second metabolites (1560 unigenes, 9.3%) were the two dominant categories. In the biosynthesis of secondary metabolites category (Table S4) in neem, subcategories of avonoid biosynthesis, terpenoid backbone biosynthesis (Table S5), steroid biosynthesis, sesquiterpenoid and triterpenoid biosynthesis (Table S6), diterpenoid metabolism and carotenoid biosynthesis were included. There were 242 unigenes involved in the metabolism pathways of terpenoids and polyketides.

Differential expression analysis of unigenes in neem
In order to nd candidate genes in azadirachtin A biosynthesis pathway, all transcripts had been identi ed and annotated and mapped in different pathways. Genes expressed higher in azadirachtin A stimulating-organ fruit and leaf would be more likely to be the involved-genes in azadirachtin A biosynthesis. The expression level of unigenes was calculated by FPKM. The number of up-regulated DEGs in leaf and fruit comparing to other tissues were 219 and 397, respectively. All these DEGs were used for mining candidate involved in azadirachtin A biosynthesis.

First screening of candidate genes involved in azadirachtin A biosynthesis
According to the putative azadirachtin A pathway in Figure 1, tirucalla-7,24-dien-3β-ol was assumed as the scaffold formed from 2,3oxidisqualene. A few steps such as hydroxylation and furan ring formation occurred after scaffold formation. The hydroxyl groups were then either oxidized to acid or acylated or esteri ed to esters forming limonoid compounds like azadirone or nimbin. Azadirachtin A was nally obtained after modi cations on azadirone or nimbin. ADH, CYP450, ACT and EST were supposed to be involved in azadirachtin downstream pathway and their encoding-unigenes were chosen for candidates.
As a triterpenoid, the rst step of azadirachtin A biopathway was the formation of its scaffold catalyzed by OSC. Among all unigenes involved in terpenoids biosynthesis, eight detected unigenes were annotated as OSC, including transcript/1784, transcript/1866, transcript/8176, transcript/8892, transcript/9751, transcript/14584 and transcript/19700 and transcript/14449. Among them, only transcript/14449 expressed higher in fruit. While after phylogenetic analysis (Figure 3) of transcript/14449 with several characterized OSCs from other plants, transcript/14449 was grouped with AiOSC1 [18], a newly characterized OSC from neem catalyzing the formation of tirucalla-7,24-dien-3β-ol, while other unigenes were grouped with cycloartenol synthase. After DNA sequence analysis with AiOSC1( Figure  S3), transcript/14449 is 100% identical to AiOSC1, which means that these two genes were the same gene. It indicated that transcript/14449 could be a candidate gene for producing azadirachtin A scaffold.
Among all the DEGs in fruit and leaf transcriptome data, DEGs encoding ADH, CYP450, ACT and EST were discovered. There were sixteen DEGs encoding ADH. Four up-regulated DEGs encoding ADH were selected for further screening. As for CYP450, sixteen DEGs encoded CYP450 and ten DEGs were selected. Similarly, thirteen and nineteen DEGs encoded ACT and EST respectively and there were four and twelve DEGs up-regulating in leaf and fruit tissues, respectively. DEGs with same sequence or sequences within 150 amino acids were excluded in further screening.

Further screening of the other four enzymes through phylogenetic analysis and domain prediction
Phylogenetic trees ( Figure 4) and protein domains prediction were used for further screening of these DEGs. Transcript/22186, transcript/18833 and transcript/19291 were respectively grouped with CADH4 [40] and CADH1, which catalyzed biosynthesis of cinnamaldehyde from cinnamyl alcohol. Transcript/18482 was grouped with ADHX [41] which had activity to primary and secondary alcohols. Transcript/18833 and transcript/19291 contained PLN02514 domain which was also found in cinnamyl-alcohol dehydrogenase [42]. Transcript/22186 contained nsLTP2 [43] domain which was contained in non-speci c lipid-transfer protein. Transcript/18482 contained GxGxxG motif [44] found in S-(hydroxymethyl) glutathione dehydrogenase. Transcript/18833 and transcript/19291 were selected as candidates for further examination.
For all EST DEGs after phylogenetic analysis, transcript/19998 and transcript/16750 grouped with KAI2 [61]. KAI2 was reported to be involved in seed germination and didn't show activity as esterase. Transcript/19188 was divided into a subclade with A. thaliana GDL15, which belonged to GDSL [62]-like lipase/acylhydrolase superfamily and displayed hydrolytic activity with esters. Transcript/18100 was grouped with PME3, a pectinesterase catalyzed the hydrolysis of (1,4)-α-D-galacturonosyl methyl ester [63]. Transcript/19748 formed a tight subclade with TGL1, which was a sterol esterase mediating the hydrolysis of steryl esters [64]. Transcript/19882 and transcript/19697 were in a group with HIDH [65] and CXE18, respectively and these two enzymes shew activity to carboxylic esters. The conserved domain analysis of EST candidates presented transcript/19188 contained Ser-His-Asp (Glu) triad found in a SNGH plant lipase [66]. Transcript/19882, transcript/19697, transcript/19748 had AES domain [67] which contained by acetyl esterase/lipase. PAE domain [68] in transcript/18100 was also found in pectin acetylesterase while transcript/16750 and transcript/19998 contained the plant pectinesterase inhibitor domain PLN02201. Therefore, transcript/16750 and transcript/19998 were removed from the list of candidates because of the prediction from phylogenetic analysis and domain analysis.

Molecular Docking analysis of CYP450s
From the phylogenic analysis of CYP450s, ve CYP450s were grouped with the members in CYP71 and CYP72 clades. Molecular docking of ve CYP proteins (CYP16057 and CYP16577, CYP16950, CYP16777 and CYP17001) was performed with four triterpenoids (ligands) ( Figure 5 and Figure S4). Analysis revealed that binding energy was lowest in case of CYP16057 docked with tirucalla-7,24-dien-3β-ol forming zero hydrogen bond. However, azadirone and nimbolide formed stable complexes with CYP16057 with one hydrogen bond with -7.20 and-6.40 kcal/mol of binding energies ( Figure 5 and Table S7), respectively. The docking analysis for CYP16577 revealed that, among all the ligands, binding energy was lowest for tirucalla-7,24-dien-3β-ol and nimbin with -10.07 and -9.83 kcal/mol forming zero and two hydrogen bonds respectively. Docking of CYP16577 with triterpenoids showed interaction through only one hydrogen bond with binding energy of -9.42 and-9.16 kcal/mol for azadirone and nimbolide ( Figure 5 and Table S8). The binding energy of CYP16777 docked with azadirone and tirucalla-7,24-dien-3β-ol was -9.96 and -9.75 kcal/mol forming one hydrogen bond respectively. However, nimbolide and nimbin formed stable complexes with three and two hydrogen bonds respectively, indicating that the conformation of nimbolide is best suitable for CYP16777 when number of hydrogen bonds formed between protein and ligand is set as criteria. The hydrogen bonds between nimbolide and CYP16777 are formed at PHE354, ARG355 and ARG419 with ligand moiety at different positions (Table S9) with varying bond length ( Figure 5). The docking analysis for CYP16950 revealed that, among all the ligands, binding energy was lowest for tirucalla-7,24-dien-3β-ol and azadirone with -8.31 and -7.90 kcal/mol without forming hydrogen bond. However, nimbin and nimbolide formed stable complexes with CYP16950 with one hydrogen bond with -6.32 and-6.66 kcal/mol of binding energies ( Figure 5 and Table S10). Docking of CYP17001 with triterpenoids showed interaction through only one hydrogen bond with the lowest binding energy of -10.11kcal/mol for tirucalla-7,24-dien-3β-ol. Azadirone and nimbolide respectively formed stable complexes with CYP17001 through two hydrogen bonds. The hydrogen bonds between nimbin and CYP17001 are formed at ARG355, ARG419 and GLY423 with the binding energy is -9.82kcal/mol ( Figure 5 and Table S11). Tirucalla-7,24-dien-3β-ol was con rmed to be the scaffold of azadirachtin A. Nimbin, nimbolide and azadirone are three important compounds isolated from neem. They were proposed as intermediates in azadirachtin A pathway. According to the docking of ve CYP450s with them, CYP16057, CYP16577, CYP16950 and CYP17001 showed strongest binding with tirucalla-7,24-dien-3β-ol. CYP16777 was more easily to bind with azadirone. Three residues in CYP16777 and CYP17001 respectively formed stable hydrogen bonds with nimbolide and nimbin. All the docking results indicate the priority of reactions between ve CYP450s and four ligands. It also provided a theoretical basis for further functional assay of these CYP450s. The residues in proteins forming hydrogen bond with ligands also provided sites for mutation when we need to improve catalytic ability of these CYP450s.
Farnesyl diphosphate synthase (FDS) catalyzed formation of farnesyl pyrophosphate (FPP) from IPP and unigene encoding FDS expressed highest in fruit followed by in root and ower. Solavetivol pathway is one of sesquiterpenoid pathways detected and unigene encoding solavetivol synthase (CYP71D55) expressed higher in fruit and leaf. In triterpenoids biosynthesis, two FPPs formed to 2,3-oxidosqualene under continuously catalysis of squalene synthase (SQS) and squalene epoxidase (SQLE). SQS and SQLE encoded unigenes both expressed the highest in leaf. Unigenes encoding cycloartenol synthase (CAS1) was identi ed and its expression in different tissues was in the order of fruit> stem>root. Methylsterol monooxygenase (SMO1) and sterol-4alpha-carboxylate 3-dehydrogenase (NSDHL) expressed the highest in ower. Unigenes encoding delta14-sterol reductase (TM7SF2) and CYP51G1 were highly-expressed in leaf and fruit respectively.
The expression level of unigenes involved in putative azadirachtin A downstream pathway had been presented. Transcript/14449 encoding the rst enzyme in azadirachtin A downstream pathway expressed highest in fruit followed by in leaf. After scaffold synthesis, the methyl group at C14 was removed. It was catalyzed by enzyme encoded by transcript/16198, which was highly-expressed in leaf. Alcohol at C3 was continuously oxidized into C3 ketone group by transcript/18725 and transcript/17679 and formed common compounds [69] isolated from Meliaceae family. These two unigenes expressed highest in root and fruit respectively. Next important step involved in azadirachtin A was the formation of furan ring, and two CYP450s catalyzing the formation were isolated from M. azedarach, and C. sinensis [18]. Two transcripts (transcript/16971 and transcript/16742) in our study were found to be homologs of the identi ed two CYP450s and might produce melianone [70] from the precursor. Transcript/16971 and transcript/16742 expressed highest in leaf and fruit, respectively. With some unknown enzymes, melianone was further modi ed into compound with furan ring and C7-OH. The insecticidal C7-hydroxylased compound [71] was esterized by transcript/18100 (higher-expressed in fruit and leaf) and further formed compounds like nimbin or nimbolide after some modi cations. However, reactions between nimbin or nimbolide and azadirachtin A were still unclear.
After the bioinformatic analysis of these DEGs, two transcripts (transcript/18833 and transcript/19291) encoding ADH, twelve transcripts encoding CYP450, two ACT transcripts (transcript/17792 and transcript/18214) and ve transcripts encoding EST were selected as candidates in azadirachtn A downstream pathway. Some DEGs were removed from the library after phylogenetic analysis and domain prediction, and some non-encoding DEGs were also deleted. DEGs were also selected although they were not higher expressed in fruit and leaf, for example transcript/16198, transcript/16742 and transcript/16950. Transcript/16198 was annotated to encode sterol 14demethylase, which removes methyl group from C14 of sterol. Transcript/16742 encoded protein with 509 amino acids and it was 98% identical to MaCYP71BQ5( Figure S5). MaCYP71BQ5 is the CYP450 found to be involved in melianol formation. Researchers could only get the fragment of its homolog (AiCYP71BQ5) from neem while our transcript/16742 contained the complete ORF of AiCYP71BQ5.
As a kind of terpenoid, the increase of precursor leads to the higher production of terpenoid. In the case of artemisinic acid production, improvement of terpenoid precursor by engineering MVA pathway made 500 time increases in yield [72]. Thus, the up-regulated DEGs in MVA or MEP pathway and 2,3-oxidosqualene biosynthetic pathway would be used as building blocks in the construction of azadirachtin A precursor biosynthetic pathway in the future.
Although reaction types and key enzymes were partially proposed based on the structural differences between intermediates in our putative azadirachtin A pathway, some information was still missing. For instance, we couldn't nd the enzyme catalyzes the hydroxylation reaction at C7 site. What's more, the order of reactions during the azadirachtin A downstream was not clear. Neither the numbers of reactions nor the catalysis type were characterized. These limitations of pathway lead to insu cient mining on neem transcriptome data. This might be one of the reasons for slow progress in azadirachtin pathway exploration although numerous available neem genome and transcriptome data.

Conclusions
In conclusion, the multi-tissue transcriptome analysis revealed ve types of genes potentially involved in azadirachtin A downstream pathway and their respective transcription levels. It also indicated that the neem tree genome encodes a high number of terpene or limonoid biosynthetic genes. Finally, 22 unigenes encoding enzymes include OSC, ADH, CYP450, ACT and EST were selected as candidates involved in azadirachtin A downstream pathway. This is the rst report on hybrid-seq transcriptome pro ling analysis of A. indica. The obtained unigenes may provide a valid and diverse candidate pool for the study of the selective modi cation by the functional groups at triterpenoid or limonoids skeleton as well as the convergent evolution in secondary metabolism.

Consent for publication
Not applicable.

Competing interests
The authors declare that they have no competing interests.
Additional Files Table S1. Concentration of metabolites in leaf, bark and seed extract of A. indica.        Q20 and Q30 on Illumina platform correspond to the predicted base call error rate of 1 % and 0.1%, respectively.
N50: The minimum contig length needed to cover 50% of the transcriptome. Figure 3 Phylogenetic tree of candidate OSC from neem transcriptome. Functionally characterized OSCs from other plant species are included, with the previously characterized tirucalla-7,24-dien-3β-ol synthases from A. thaliana (PEN3) (bold, marked with pink circle) and AiOSC1(bold, marked with green triangle) identi ed previously. Candidate OSC chosen for further analysis is boldly displayed and marked with purple star. The phylogenetic tree was constructed by MEGA V7 and formatted using iTOL.