Overview of the 67 cDNA libraries
NCBI Sequence Read Archive contains 67 M. sexta RNA-seq datasets (Fig. 1), representing different tissues and life stages of this insect: P for 33 paired-end (read length: 100 bp) and S for 19 single-end (51 bp)] samples sequenced as a part of the genome project [28], H for 8 head samples (single-end, 51 bp) to study sex-biased gene expression [14], and A for 7 antenna samples (single-end, 94 bp) to examine chemosensory receptor genes [27]. Names and descriptions of the libraries are shown in Fig. 1 and (Additional file 1: Table S1). The four groups of P (33), S (19), H (8), and A (7) libraries are analyzed and compared. While there is no biological replicate for libraries 1–52, four samples (i.e. G-L5-W, M-L5–12 h, M-L5-preW, M-L5-W) were analyzed by both single- and paired-end sequencing (Fig. 1). The numbers of total reads in these libraries vary greatly (Fig. 2a), ranging from 4.2 million in G-L5-preW-S (library 32) to 73 million in F-L5-preW (library 14). In general, there are many more reads in group P than in group S (Fig. 2c), with an average of 37 million versus 7.7. The variation of read numbers is the highest in group P and smallest in groups H and A, as these 15 libraries include biological replicates from the same sample types.
We generated more accurate Cufflinks4.0 gene models based on a newer version of the genome assembly Msex1.0 (with new IDs for scaffolds and three mitochondrial sequences) and quality-controlled reads from the 67 RNA-seq libraries. In the first step, reads with low quality bases were removed by Trimmomatic and reads longer than 50 were kept for further studies (e.g. unmapped read analysis) to reduce the chance of non-specific matching. Higher percentages of read survival were detected in group S (average: 96%) than in groups H (90%), P (88%), and A (82%) (Fig. 2b and d). Rates for libraries 66 and 67 (group A) were as low as 61%. We then mapped the remaining reads to the genome using TopHat and obtained Cufflinks4.0 gene models (Additional file 3 for coding transcripts; Additional file 4 for noncoding transcripts) [33]. Interested in unmapped reads, we also used STAR [34] to map the reads to the genome. Assisted by the GTF file generated from the Cufflinks program and running in the 2-pass mapping mode, STAR mapped nearly 10% more reads to the genome than TopHat did (Fig. 2e).
As anticipated, mRNA reads account for the largest portion of total reads in each library (Fig. 2f). The rRNA and mitochondrial gene reads are substantial in libraries 1–52, especially when their low gene counts are considered. This indicates that poly-A-minus RNA was not completely removed in these libraries (e.g. H-A-D7, G-L5–1~3 h–S). Higher percentages of rRNA reads were often accompanied by lower percentages of mitochondrial gene reads (Fig. 2f−h). In comparison, libraries 53–67 were mostly mRNA and noncoding RNA reads. The 33,378 genes in Cufflinks4.0 include 3 mitochondrial, 4 rRNA, 14,532 mRNA and 18,839 noncoding genes. Even though noncoding gene number is higher than coding genes in Cufflinks4.0, noncoding genes were generally shorter than coding genes [26] and their contribution to the total reads is only about 10% of that by coding genes in each library (Fig. 2f).
Genome transcription
Genes in disparate parts of the genome are differentially transcribed in various tissues or life stages and their RNA products can be detected by RNA-seq technology. Based on the Cufflinks4.0 models, 51.7% of the M. sexta genome contains genes, 17.1% is transcribed to exons of genes, and 5.3% is protein-coding regions. The percentages are 41.6%, 9.2% and 5.1% based on OGS2.0 models for M. sexta, 65.4%, 24.9% and 15.9% for D. melanogaster, and 52.1%, 4.1% and 1.2% for human. The percentage of CDS for D. melanogaster is much higher than M. sexta, but the total DNA length is 22.8 million, similar to 22.3 million of M. sexta (Cufflinks4.0). Interestingly, the transcribed portion goes up to 63.9% based on the mapped reads which is much higher than 17.1% by Cufflinks4.0, and the ratios differ strikingly in different libraries, ranging from 1.5% in library 45 (M-L5-preW-S) to 23.2% in library 49 (T-P-D15~18) (Additional file 1: Table S1). While temporospatial expression certainly affects the mapped portion of genome, more RNA-seq bases generally lead to higher genome coverage (Fig. 3a). The linear regression analyses were performed for both single (S) and paired (P) end libraries to identify significant outliers. Libraries below the lines have relatively lower ratios or more highly transcribed bases; libraries above the lines have higher than usual genome coverage or are less biased toward using various parts of the genome. The testis libraries (e.g. library 49) are much higher above the line, consistent with the observation that they have fewer genes expressed at high levels [28]. In general, the mapped portion in library group P is higher than in group S, partly due to the higher number of aligned bases (Fig. 3b).
Portions of mapped genome vary dramatically across the libraries. In addition to total numbers of aligned bases, their distribution across the genome may also contribute to the variations. To test this hypothesis, we first obtained sequencing depths for each base of the genome in all libraries. As rRNA and mitochondrial reads were overrepresented in some libraries (see above), we removed the scaffolds shorter than 200 kilobases before the analysis. We then normalized the sequence depth by using the BPKM values, sorted the bases according to their sequencing depth, divided them into 19 BPKM groups from high to low, and calculated average BPKM values and z-scores for each group for comparison across the 67 libraries (Fig. 3c). Figure 3d shows the ratio of aligned bases in each group. As expected, in libraries below the linear regression line in Fig. 3a generally, more RNA-seq bases were aligned to top transcribed base groups (Fig. 3d), and the average BPKM values for top groups were higher than other libraries (more red in top base groups in Fig. 3c). For instance, top base groups of libraries 14 and 15 have higher BPKM values than most other libraries (more red than for groups 1 to 6 in Fig. 3c). In fact, 12,800 top transcribed bases (0.004% of the studied scaffolds) in groups 1 to 6 contribute 64.8% and 49.9% of aligned RNA-seq bases, respectively (Fig. 3d). For libraries 5 and 16, which have similar number of aligned bases (0.5 billion) but very different percentages of mapped genome (2.5% and 7.7%), average BPKM value of library 5 is higher than library 16 (BPKM: 3.7 × 105 vs. 1.3 × 105) in highly transcribed groups 1–5 (ratio: 23.6% vs. 8.5% of the mapped bases) but lower (BPKM: 3.8 vs. 6.4) in lowly transcribed groups 10–19 (ratio: 39.9% vs. 68.5%). In the midgut libraries 25–37, a clear change of high z score in BPKM ranking is observed from larva to adult (Fig. 3c). In testis and ovary libraries 48–52, base groups 1 to 10 have lower than average BPKM values (green color in Fig. 3c), while groups 13 to 19 have higher BPKM values (red color in Fig. 3c), which is consistent with their higher positions in Fig. 3a. There are large variations in the ratio of each highly-transcribed groups: Groups 1 to 4 are top 3200 transcribed bases, which may come from a few genes, as the average length of transcripts in OGS2.0 is about 2000 bp. In other words, such genes may contribute to 20–40% of the total mRNA bases. Groups 1–12 cover top 819,200 transcribed bases, which may represent 400 genes or 2.6% of the total OGS2.0 set and, on average, they account for >63% of the aligned bases. Groups 13–16 have contributed another 32% or about 6000 genes. These results confirm the large variation in expression levels, with a few highly expressed genes contributing to a major part of the sequenced RNAs [28]. Additionally, the observed variations among the libraries suggest that lists of highly expressed genes are quite different and worth exploring.
Unmapped reads
During the initial gene modeling by the standard protocol of Cufflinks [33], we noticed that percentages of mapped reads were only around 60% (data not shown). Curious about ratio and composition of the unmapped reads, we controlled the read quality using Trimmomatic, mapped the reads using TopHat and then STAR in the 2-pass mode [34] with the splicing site information from Cufflinks, and improved the mapping ratio to 92.6% on average. For unknown reasons, library 11 still had a low ratio of 69% (Fig. 4a). Since unmapped reads may come from un-sequenced regions of the genome, symbionts, viruses or other sources, we searched NCBI nr/nt database (M. sexta OGS2.0 sequences not yet included) with the STAR-unmapped reads using BLASTN. Sequences with more than 10 reads matched were summarized in (Additional file 5: Table S2). There was a positive correlation between ratios of STAR-unmapped and BLASTN matching. Paired-end libraries had higher ratios in both than single-end ones, with the exceptions of libraries 48–51. Reads without BLASTN match, accounting for 22% of the total unmapped reads, might be low complexity ones (e.g. AT-rich sequences) or from not sequenced parts of the genome.
We divided the BLASTN hits to seven groups after initial analysis (Fig. 4b). rRNA reads accounted for 80.2% of the total (Additional file 2: Table S3), indicating their genes were incomplete in the OGS2.0. Correlating with the higher mapped ratios (Fig. 2g), library group P had higher ratio of unmapped rRNA reads (Fig. 4c), except for library 19. Libraries 19, 66 and 67 contained phage φ174 reads accounting for 62.6, 79.6, and 80.5% of the total unmapped reads – the phage DNA was used as internal positive control in DNA sequencing. While percentages for the phage reads varied a lot in the libraries, the average was 6.9% (Table S3). The mitochondrial reads (1.8% of the total) were mapped to a more complete version of M. sexta mitochondrion in NCBI. The Oryza reads (1.1%), linked with the midgut libraries (26, 28–33) of feeding larvae, probably represented plant RNA in the artificial diet; E. coli reads (1.7%), correlated with the midgut and certain head libraries (26, 28–33, 53–60), may represent microbiota of the midgut and foregut, as part of the head. The other head libraries (1–11) apparently were less contaminated by foregut tissues. Likely due to allelic variations, 2.8% of the total were mapped to the M. sexta sequences previously deposited at NCBI, including highly expressed lysozyme, apolipophorin and other genes. Other reads (5.5%) matched sequences of lepidopteran insects and other bacteria. No viral sequence was detected in this lab strain which had been established for a long period of time [28]. Insects caught in the field are often infected by viruses that are detectable by RNA-seq analyses [42].
Comparisons of different libraries and their gene expression
FPKM values and gene names of OGS2.0 are enlisted in (Additional file 6: Table S4). So are FPKM values of Cufflinks4.0 genes in (Additional file 7: Table S5) and MCOT1.0 genes in (Additional file 8: Table S6). Based on the definition of library-correlated genes (FPKM >1, z-score > 1.5) [39], 15,289 of the 15,543 OGS2.0 genes are correlated with at least one library, and the rest are expressed at low levels (FPKM <1). The correlated genes range from 200 to 3000 in each library. Early embryo, pupa, and adult generally had more correlated genes, but lower than testis (libraries 48–50) (Fig. 5b). Percentages of the highly expressed ones (FPKM >100) vary to some extent.
We compared the 67 libraries based on mapping scores using the same strategy described by Li et al. [39]. Digitized scores in the color gradient clearly showed the interlibrary relationships (Fig. 5a). As anticipated, mapping scores close to the diagonal line were much higher, indicating that the libraries with closer developmental stages from the same tissue type are more similar to each other. Some square-shaped regions of different sizes along the diagonal line overlap each other (e.g. midgut libraries 27–32, 32–35 and 34–37); others do not (e.g. head libraries 1–3 and 4–5). Gene expression (e.g. digestive enzymes) in midgut of the feeding larvae, wandering larvae, and pupal − adult stage changed progressively, with each library more similar to its neighboring libraries or developmental stages [23, 28]. Libraries from the same tissue/organ/body part but distant in life stage share fewer correlated genes and behave independently in the heatmap. Conversely, those with close developmental stages share more correlated genes (e.g. libraries 53–60 vs. 9–10, head, young adults). For antenna libraries 61–67, larvae and adults are somewhat similar, female adults (64–66) resemble female head (53–56) and, surprisingly, the male antenna library 67 is similar to female head (53–56). The whole body libraries 20–24 show higher similarity with libraries of midgut from feeding larvae, larval head and muscle. The ovary libraries 51–52 (O-P-D15~18 and O-A-D1) are highly similar to fat body libraries 17 (F-P-D15~18) and 19 (F-A-D7~9). Surprisingly, the fat body library 18 (F-A-D1~3) is most similar to the Malpighian tubule libraries 39 (MT-A-D1) and 40 (MT-A-D3). Sequencing methods seem to affect library similarity (i.e. mapping scores) for unclear reasons. Single-end muscle libraries 43, 45 and 47, but not the corresponding paired-end ones (42, 44 and 46), are highly similar to single-end head libraries 5–7.
Highly expressed genes in the libraries
Genes with high FPKM values contribute a major portion of the total reads in all libraries. For the current study, we selected top 3 expressed genes in each library, removed the redundant ones, and examined expression patterns of the remaining 69 (Fig. 6). Among them, housekeeping genes (e.g. ribosomal proteins) are highly expressed in nearly all libraries. Expression of odorant-binding/chemosensory proteins are high in head and antenna. Some cuticle proteins are abundantly made by epithelial cells in the head and whole body samples.
Very high read numbers of serine protease-102 in libraries 17 and 51 (F-P-D15~18 and O-P-D15~18) suggest its role in tissue remodeling during metamorphosis in the late pupal stage (Fig. 6). Other tissue-specific proteins include titin in young adults (MT-A-D1 and MT-A-D3), histones H2B and H4 in embryo (W-E-3 h-S and W-E-Late-S), circadian clock-controlled gene products in head (H-L5-preW-S for OGS2.0 genes Msex2.07414 and Msex2.14769); adult head libraries 9–11 and 53–60 for gene Msex2.04015), lysozyme in fat body and midgut, and diapausin-4 and -13 (Msex2.13832 and Msex2.13057) in head, fat body, and other tissues. Insect development is controlled by diverse clocks, including the circadian clock [43]. The fact that circadian clock-controlled genes (Msex2.07414 and Msex2.14769) are expressed in very high levels in head of pre-wandering larvae suggests vital roles for their proteins in this and the wandering stage. The founding member of the diapausin protein family in the leaf beetle was named diapausin on the basis of stage-specific synthesis [44]. A group of 14 diapausins was later identified in the M. sexta genome, with diapausin-1 shown to be an antifungal peptide [25, 45]. While the high mRNA levels for diapausin-4 and -13 in fat body may contribute to the antifungal activity of hemolymph, it is interesting to note that their expression levels were even higher in head.
Library-correlated expression of genes
In theory, FPKM values are proportional to mRNA levels inside cells, especially for genes with high FPKM values. To acquire an overview of library-correlated gene expression, we prepared a heatmap of z-scores for genes with at least one FPKM value >100 and divided these 6108 genes into 18 clusters based on their expression patterns (Fig. 7). The 896 genes in cluster 1 are (more) highly expressed in the testis libraries, some highly expressed in all three and others either in T-P-D3 or in T-P-D15~18 and T-A-D1~3. Cluster 3 genes are (more) highly expressed in adult ovary (O-A-D1); cluster 7 in 3 h embryo; cluster 9, including many digestive enzyme genes, in larva midgut (data not shown); cluster 13 in pre-wandering head; clusters 16 and 17 in adult and larva antenna, respectively. Noncoding genes and MCOT specific genes show similar expression patterns (Additional file 2: Figure S1). The original z-scores for Fig. 7 and Additional file 2: Figure S1 can be found in (Additional file 9: Tables S7−S9).
To describe general features of the genes in different clusters, we performed a GO enrichment analysis (Fig. 7). As expected, the enriched GO terms were well correlated with the expression pattern of gene clusters. For example, the most significantly enriched terms of cluster 1 include microtubule binding, microtubule motor activity, and protein kinase activity under molecular function (Fig. 7). These terms suggest that meiosis and sperm generation in testis heavily use kinase cascades and microtubule-binding for chromosome movement. The upregulation of microtubule-related proteins was also observed in D. melanogaster [46]. Cluster 3 genes are more specifically expressed in adult ovaries, and the enriched GO terms, including chorion and eggshell formation, were tightly linked with female reproduction [47]. Similar clusters of genes more specifically expressed in testes and ovaries were also identified in D. melanogaster [38]. Cluster 7 genes are highly expressed in early embryo, and the enriched terms, including DNA replication, transcription and methylation, were tied to early embryonic development, same as observed in D. melanogaster [48].
tRNA genes and codon usage
Different organisms have different codon preferences, which is associated with the age of genes and influenced by natural selection [49, 50]. Most codon frequency tables are calculated based on a limited number of protein sequences and might be inaccurate [51]. To complete this transcriptome study, we took advantage of the OGS2.0 and transcriptome data to examine a conceivable relationship between codon preferences and tRNA genes (Additional file 2: Table S10). For some codons, the predicted tRNA gene number is 0 (e.g. CTC), while its codon frequency in the genome and transcriptome is 15.2 and 17.7 per thousand, respectively. In other words, there is no direct correlation with tRNA gene numbers due to codon-anticodon wobbling. However, the tRNA gene number and codon frequency of different amino acids are highly correlated both in genome and transcriptome as shown in (Additional file 2: Figure S2). This suggests that the number of tRNAs might be regulated at the amino acid level instead of codon level. We also found the codon usages in genome and transcriptome generally agree. By calculating codon preferences in different libraries based on the RNA-seq data, we realized that codon usage is mainly influenced by highly expressed genes and there are no obvious global changes in codon preference.