RNA-Seq improves annotation of protein-coding genes in the cucumber genome
© Li et al; licensee BioMed Central Ltd. 2011
Received: 10 March 2011
Accepted: 2 November 2011
Published: 2 November 2011
Skip to main content
© Li et al; licensee BioMed Central Ltd. 2011
Received: 10 March 2011
Accepted: 2 November 2011
Published: 2 November 2011
As more and more genomes are sequenced, genome annotation becomes increasingly important in bridging the gap between sequence and biology. Gene prediction, which is at the center of genome annotation, usually integrates various resources to compute consensus gene structures. However, many newly sequenced genomes have limited resources for gene predictions. In an effort to create high-quality gene models of the cucumber genome (Cucumis sativus var. sativus), based on the EVidenceModeler gene prediction pipeline, we incorporated the massively parallel complementary DNA sequencing (RNA-Seq) reads of 10 cucumber tissues into EVidenceModeler. We applied the new pipeline to the reassembled cucumber genome and included a comparison between our predicted protein-coding gene sets and a published set.
The reassembled cucumber genome, annotated with RNA-Seq reads from 10 tissues, has 23, 248 identified protein-coding genes. Compared with the published prediction in 2009, approximately 8, 700 genes reveal structural modifications and 5, 285 genes only appear in the reassembled cucumber genome. All the related results, including genome sequence and annotations, are available at http://cmb.bnu.edu.cn/Cucumis_sativus_v20/.
We conclude that RNA-Seq greatly improves the accuracy of prediction of protein-coding genes in the reassembled cucumber genome. The comparison between the two gene sets also suggests that it is feasible to use RNA-Seq reads to annotate newly sequenced or less-studied genomes.
As new sequencing technologies develop, thousands of eukaryotic genomes across all kingdoms of life will be sequenced during the next decade [1, 2], and this trend will spark an improvement in our knowledge of evolutionary biology and functional genomics. Genome annotation is a stepping stone to bridge the gap between genomic sequences and the biology of organisms . It can be stated that the quality of genome annotations represents the value of genome sequences.
Gene prediction, within the process of genome annotation, is a complex endeavor. In eukaryotic species, it is usually carried out by integrating multiple sources of evidence , such as complementary DNA (cDNA), proteins in closely related species, and de novo predictions . Representing the integral sequences of messenger RNAs (mRNAs), full-length cDNAs (FL-cDNAs) are recognized as the gold-standards for discovering and annotating gene structures in eukaryotic genomes [5, 6]. Additionally, even incomplete cDNAs, i.e. expressed sequence tags (ESTs), provide more accurate evidence than other sources. Nevertheless, until recently, the sequencing of cDNA was a laborious and capital-intensive task.
Thanks to the massively parallel cDNA sequencing (RNA-Seq) technologies , scientists can obtain cDNA fragments from transcriptomes with reasonably complete coverage in a reduced time scale and at a lower cost . With its informative content, RNA-Seq is expected to revolutionize the prediction of genes . RNA-Seq has been used to improve the genome annotations, including: (i) correcting predicted gene structures ; (ii) detecting new alternative splicing isoforms ; and (iii) discovering new genes and new transcripts [12, 13]. However, most of these applications focused on species with well-annotated genomes, such as human, mouse, yeast, Arabidopsis thaliana, and rice. Among these studies, Trapnell, Williams and Pertea et al. and Guttman, Garver and Levin et al. correctly reconstructed full-length transcripts for most known expressed genes in specific mouse tissues [12, 13]; nevertheless, their procedures still need to be tested in other eukaryotic genomes, because of varied genome characteristics . For less-studied genomes, Denoeud, Aury and Da Silva et al. used the short RNA-Seq reads to build thousands of gene models for the grape genome ; however, fewer genes were predicted than in the public annotation .
Although far from perfect, the considerable potential demonstrated in these studies for the applicability of RNA-Seq in gene predictions encourages us to update the original gene prediction of the cucumber genome (Cucumis sativus var. sativus line 9930), which was annotated and published in 2009 . Therefore, based on EVidenceModeler (EVM) , we built a genome annotation pipeline in which we incorporated analyses of Solexa/Illumina RNA-Seq reads. In an attempt to provide a high-quality gene set for the scientific community and for further study, we reassembled and reannotated the cucumber genome. We subsequently compared the two versions of the gene predictions to evaluate any improvements brought about by RNA-Seq. The comparison presented here supports the hypothesis that RNA-Seq has a positive impact on gene prediction of the cucumber genome.
Using the improved SOAPdenovo program  (Release 1.04), we reassembled the cucumber genome by integrating additional large insert paired-end Illumina GA reads from Cucumis sativus var. sativus (7.4-fold genome coverage, 5 Kb insert size) and from Cucumis. sativus var. hardwickii (3.8-fold, 5 Kb insert size; 3.2-fold, 10 Kb insert size; see Additional file 1, Table S1 for details). The final assembly (assemVer 2.0) spans 197 Mb and contains 12, 845 scaffolds (see Additional file 1, Table S2 for details). This is approximately 46 Mb less than the previous assembly (assemVer 1.0) and this difference mostly represents redundant repetitive sequences and contaminating sequences. The N50 and N90 contig sizes of assemVer 2.0 are 37.9 Kb and 8.9 Kb, respectively, and 90% of the assembly falls into 153 scaffolds larger than 281 Kb. Compared to assemVer 1.0, assemVer 2.0 is more contiguous, thus facilitating genome annotation.
Number of de novo assemblies and "align-then-assembled" transcripts.
# RNA-Seq reads
# De novo assembled transcripts
# "Align-then-assembled" transcripts
19, 247, 768
17, 656, 392 (91.7%)
18, 466, 067
17, 047, 763 (92.3%)
19, 111, 746
17, 394, 685 (91.0%)
18, 732, 466
17, 162, 238 (91.6%)
24, 535, 215
22, 789, 659 (92.9%)
26, 400, 675
24, 405, 569 (92.4%)
26, 050, 858
24, 531, 662 (94.2%)
23, 818, 868
21, 886, 487 (91.9%)
22, 472, 146
20, 585, 234 (91.6%)
Base part of tendril
21, 653, 855
19, 556, 866 (90.3%)
Two different approaches, de novo assembly and 'align-then-assemble' , were used to reconstruct transcripts from these RNA-Seq reads. The de novo assembly was carried out by Inchworm, a de novo assembler of RNA-Seq in Trinity , which reconstructed 802, 216 de novo contigs from the 10 tissues (Table 1). We applied CD-HIT  to remove some de novo contigs, such as assembled artifacts with low-coverage or redundancies from different tissues. Finally, 258, 876 de novo contigs assembled by RNA-Seq reads remained for gene prediction. In the 'align-then-assemble' approach, we mapped and generated spliced alignments of the RNA-Seq reads from each tissue to the reassembled cucumber genome using Bowtie  and TopHat  (Table 1; Additional file 1, Table S3 for mapping details of reads). Cufflinks  was then used to reconstruct 220, 590 transcripts belonging to 59, 481 transcriptional units from the alignments of 10 tissues. However, a complete open reading frame (ORF) could be found in only 9, 964 (4.5%) transcripts reconstructed by Cufflinks using getorf in EMBOSS .
Summary statistics and annotation comparison of cucumber genome.
243, 568, 484
197, 271, 687
197, 271, 687
annotVer 1.0 (mapped)
Number of Genes
Number of Genes on Plus Strand
Number of Genes on Minus Strand
Mean Gene Length (bp)
Gene density (Kb/gene)
Number of Transcripts
Percent of Transcripts with Introns
Mean Transcript Length (bp)
Mean CDS Length
annotVer 1.0 (mapped)
Mean Number per Transcript
Mean Length (bp)
Total Length (bp)
27, 991, 662
22, 988, 520
36, 686, 879
annotVer 1.0 (mapped)
Mean Number per Transcript
Mean Length (bp)
Total Length (bp)
43, 647, 564
39, 074, 873
48, 152, 435
annotVer 1.0 (mapped)
Number of Genes Having UTRs
Mean UTR Length (bp)
Number of 5' UTRs
Mean 5' UTR Length (bp)
Number of 3' UTRs
Mean 3' UTR Length (bp)
Compared with the published annotation of the cucumber genome (labeled as annotVer 1.0, Table 2), annotVer 2.0 contains 3, 434 fewer protein-coding genes, mostly because of the reduced size of the reassembly and the removal of some contaminating bacterial segments implied by about 2, 000 bacterial genes in annotVer 1.0. Consistent with the reduction of gene number in annotVer 2.0, there is an increase in the number of multi-exon genes, which indicates an improvement of the protein-coding prediction to some extent, because the prediction of single-exon genes is still unreliable in eukaryotic genomes.
Two other improvements resulting from the incorporation of RNA-Seq are the prediction of untranslated regions (UTRs) and alternative splicing isoforms. Of the 23, 248 protein-coding genes in annotVer 2.0, 18, 690 genes have UTRs and 1, 935 genes appear to have alternative splicing isoforms. In general, incorporating RNA-Seq reads offers overwhelming evidence for the prediction of these two features. The prediction of UTRs was uncertain before the appearance of RNA-Seq, because of the incompleteness of ESTs and the difficulty of collecting bona fide FL-cDNAs. Furthermore, because they are not well conserved across species, comparative predictive techniques are not suited to UTR detection. However, the use of high-throughput RNA-Seq from the same species naturally removes both of these difficulties. Similarly, RNA-Seq provides evidence pointing to the potential for alternative splicing, though it is still quite difficult to determine full-length isoforms from these short reads. With the help of de novo assemblies and PASA assemblies , 2, 352 full-length isoforms of 1, 935 genes were identified. RNA-Seq provides an opportunity to comprehensively study alternative splicing events in cucumber, as in other species.
For the 18, 580 multi-exon genes in annotVer 2.0, we inspected different sources of evidence for them, and the results suggested that most of the multi-exon genes were supported by reliable evidence, such as transcript evidence or protein evidence . In fact, there are three sources of evidence in our pipeline: transcript evidence from RNA-Seq or ESTs, proteins from related species, and predictions from de novo predictors (see Methods), which actually provided introns in the final gene structures . We did not include the predictions of Augustus  and Geneid  in this analysis because the two predictors had used RNA-Seq information and homologous proteins, respectively.
De novo predictions are also necessary for gene predictions. Although RNA-Seq has a high coverage, de novo predictions actually support more multi-exon genes than do transcripts (Figure 1.A), because of the large number of genes generated by the three de novo predictors, for example, GeneMark.hmm-ES predicted more than 40, 000 genes. Furthermore, about one tenth of the multi-exon genes (2, 003/10.8%) are supported only by de novo predictions, which indicates that de novo predictions are indispensable to the completeness of the final gene sets.
As transcript evidence plays a considerable role in the multi-exon gene prediction, we examined the contributions of RNA-Seq to the final gene set, especially when compared with ESTs. Figure 1.C shows that most of the ESTs are covered by RNA-Seq and none of the multi-exon genes are supported only by ESTs. Despite the fact that one gene in Figure 1.D is full-length supported by one EST, we maintain that RNA-Seq could replace ESTs in the process of protein-coding gene prediction.
Evidential support for multi-exon genes suggests that RNA-Seq has an innate capability for high coverage in protein-coding gene predictions. Transcript evidence is taken as the most valuable evidence in protein-coding gene prediction, as it often identifies exact intronic boundaries . RNA-Seq, among all the transcript evidence that affects gene prediction, is the one that could increase the number of genes supported by transcript evidence and improve the structural predictions. RNA-Seq could also be used in place of ESTs as the major transcript evidence, which liberates scientists from time-consuming work in traditional cDNA sequencing projects. Although RNA-Seq still could not replace the role of FL-cDNA in gene discovery, sophisticated methods of transcript reconstruction through RNA-Seq in the near future may help us to reconstruct more full-length transcripts.
The second analysis focused on genes at the same locus but with different structures (see Additional file 1, Figure S3 for example). There are 5, 824 pairs of genes, each of which was composed of an annotVer 1.0 and annotVer 2.0 gene that only map to each other, but are structurally different. We then launched a Pfam domain search by InterProScan  and performed the global pairwise alignments by stretcher in EMBOSS  on each pair (see Methods). The search of InterProScan found 1, 817 different kinds of Pfam domains in 4, 297 (73.8%) genes in annotVer 1.0, whereas, 1, 861 different Pfam domains were found in 4, 399 (75.5%) genes in annotVer 2.0. In the same way, when identity, similarity, score, and gaps in alignments are compared, global pairwise alignments also suggests that genes in annotVer 2.0 are more optimal than genes in annotVer 1.0 (Figure 3.B).
In the third analysis, the presence of non-overlapped locus protein-coding genes implies specific genes in different predicted gene sets. To measure the reliability of version specific genes in the two sets, we compared the percentages of BLASTP hits to UniProt plant proteins and multi-exon genes between annotVer 1.0 and annotVer 2.0. The BLASTP results indicated that the prediction of annotVer 2.0 produces more genes and a higher percentage of hits (3, 134, 59.3%) to UniProt than annotVer 1.0 (684, 26.4%). Meanwhile, the specific genes in annotVer 2.0 contain a significantly higher percentage of multi-exon transcripts (4, 385, 77.8%) than those in annotVer 1.0 (1, 216, 46.9%).
In all four analyses, annotVer 2.0 shows a better performance than annotVer 1.0. Although the processes of generation of the two gene sets are different (Glean for annotVer 1.0 ), they are comparable, because the principles of evidence combination are nearly the same. In fact, annotVer 2.0 used less protein evidence and homolog EST evidence than did annotVer 1.0 ; however, using RNA-Seq compensates for this minor deficiency and actually obtains a better gene set. Thus, adopting the RNA-Seq technique proved to be vital to the quality of protein-coding gene prediction in the reassembled cucumber genome.
A genome project requires continual refinement, even after the publication of its genome sequence. Some problems, such as bacterial DNA contamination during genome sequencing and redundancy of repetitive DNA sequences, were found in the first assembly of the cucumber genome; therefore, we reassembled the cucumber genome. After RNA-Seq evidence of transcription was generated, we improved the prediction of protein-coding genes in the reassembled cucumber genome, based upon the RNA-Seq reads. In the new assembly, about 8, 700 protein-coding gene structures are modified and about 5, 200 genes are newly predicted. Based upon the comparison of the gene sets of the two versions, we conclude that the considerable improvement in protein-coding gene prediction is largely due to the use of the RNA-Seq technique. We also suggest that, for newly sequenced or less-studied eukaryotic genomes, RNA-Seq is a good choice for providing evidence for prediction of protein-coding genes, as it reduces the necessity for EST sequencing and increases the utility of each round of genome annotation.
To link more contigs, we sequenced additional long insert sized (5 kb) paired-end Illumina GAII reads of Cucumis sativus var. sativus, representing approximately 7.4-fold genome coverage.
We first assembled paired-end short reads with short insert sizes (insert size < 1 kb) into contigs. To increase the assembly accuracy, only high quality reads were considered. These contigs were further linked into scaffolds by paired-end relationships (300-550 bp insert size), mate-pair reads (2-10 kb), fosmid ends (~40 kb), and BAC ends (~100 kb). We then filled gaps in all the reads generated by both Illumina GAII and Sanger methods.
During the process of linking contigs to scaffolds, paired-end reads with long insert sizes (approximately 3.8-fold genome coverage, 5 Kb insert size; approximately 3.2-fold, 10 Kb insert size) from wild cucumber (Cucumis sativus var. hardwickii) were also used.
Cucumis sativus var. sativus line 9930 was used in all experiments. A total of 10 tissues were collected: root, stem, leaf, male flower, female flower, ovary, expanded ovary under fertilization (7 days after flowering), expanded ovary not fertilized (7 days after flowering), base part of tendril, and tendril. In accordance with the manufacturer's instructions, total RNA was isolated with TRIzol (Invitrogen, Carlsbad, CA, USA) from each sample. Samples were treated with RNase-free DNase I for 30 minutes at 37°C (New England BioLabs, Ipswich, MA, USA) to remove residual DNA. The OligoTex mRNA mini kit (QIAGEN, Hilden, Germany) was used to isolate poly(A) mRNA from the total RNA samples. The first cDNA strand was synthesized using random hexamer primers and reverse transcriptase (Invitrogen). The second strand cDNA was synthesized using RNase H (Invitrogen) and DNA polymerase I (New England BioLabs). The sequencing library was constructed following the manufacturer's instructions (Illumina, San Diego, CA, USA). Fragments of approximately 200 bp were excised and enriched by 18 cycles of PCR. The fragments were loaded onto flow cell channels at a concentration of 2 pM to generate paired-end reads with lengths of 75 bp. The Illumina GA processing pipeline v0.2.2.6 was used for image analysis and base calling. The data is obtainable with the accession number SRA046916 in the Sequence Read Archive (SRA) at NCBI.
De novo assembly was carried out by Inchworm , which utilizes the Kmer graph method to assemble Illumina RNA-Seq reads. Although it prefers strand-specific RNA-Seq reads, Inchworm can also deal with the non-strand-specific RNA-Seq reads generated from the RNA-Seq experiments. Low-coverage artifacts or redundancies from different tissues were removed by CD-HIT , with an identity threshold of 95%.
In the 'align-then-assemble' approach, we firstly mapped the RNA-Seq reads from each tissue to the reassembled cucumber genome using Bowtie  and the spliced aligner TopHat . Cufflinks  assembled the results of TopHat into transcript assemblies, followed by the integration of transcript assemblies from different tissues. Transcripts that were shorter than 150 bp were deemed as false positives and removed before gene prediction. We used getorf in EMBOSS  to find ORFs in the transcripts. Only ORFs with start and stop codons were regarded as complete ORFs.
RepeatMasker masked the repeat elements in the newly assembled genome using a custom library. The custom library included: (i) Repbase ; (ii) TIGR plant repeat database ; and (iii) a cucumber de novo transposable element library built in-house. Three types of de novo software, PILER-DF , RepeatScout , and LTR_Finder  were used to predict species-specific transposable element sequences in the cucumber genome. PILER-DF and RepeatScout were used for the repeat sequences in cucumber assembly. Based on the cucumber assembly, full-length LTR retrotransposons were identified using LTR_Finder. We filtered repeat elements belonging to rRNA, satellites, and organellar sequences by BLASTN. Elements belonging to high-copy number genes were filtered by BLASTX searching of UniProt-SwissProt (release 2010_07). After removing redundant repeat elements by all-versus-all BLASTN and manual curation, the de novo TE library for cucumber was obtained.
We used spaln  and PASA  to align 90, 307 cucumber ESTs sequenced by Guo, Zheng and Joung et al , 260 cucumber FL-cDNAs downloaded from NCBI, and transcripts reconstructed by Inchworm . The result of 'align-then-assemble' procedure was also directly used as transcript evidence. PASA strictly aligns EST or cDNA sequences to the genome and assembles the aligned sequences into transcripts called 'PASA assemblies'. ORFs are found from these PASA assemblies as a training set. We selected genes with complete structures and removed some redundant genes with 70% identity at the amino acid level by CD-HIT .
Five de novo gene predictors were used on the masked genome. GlimerHMM , SNAP , and Augustus  were trained with the training set generated by PASA; Geneid  used the parameter of Cucumis spp.; and GeneMark.hmm-ES  only used unmasked genomic data and was self-trained.
The dataset used for protein homology alignment included: (i) UniProt-SwissProt plant proteins (release 2010_07); (ii) Arabidopsis thaliana proteins (TAIR9, Augustus 2009 release); and (iii) Oryza sativa proteins (TIGR Release 5.0, January 2007 release). We used spaln , TBLASTN  and BLAT  to search for nucleotide homology in the cucumber genome. Scipio  made use of the BLAT result to identify intron-exon boundaries. Proteins with the highest score in TBLASTN were processed by BLAST2GENE  to predict gene structures by GeneWise .
EVM, which is an effective automated annotation combiner , computed the gene structures for the reassembled genome of cucumber as a weighted consensus of all available evidence obtained above. The gene models generated by EVM were updated by PASA with ESTs and de novo assembled transcripts. This process modified exons or gene models, added UTRs, and found alternatively spliced isoforms. Finally, we removed genes encoding proteins with less than 50 amino acids and incomplete genes without start and stop codons. Gene models and the different evidence were visualized by GBrowse .
Three non-coding RNA gene predictors were used independently to identify different types of non-coding RNA genes in the cucumber genome. tRNA-SE  was used to identify tRNA genes. Snoscan  was used to identify C/D-box small nucleolar RNAs. INFERNAL  searches against the Rfam  database identified miRNAs, small nuclear RNAs, and H/ACA-box small nucleolar RNAs.
In annotVer 1.0, 26, 882 CDSs were aligned to the reassembled cucumber genome by spaln  and gene structures that have less than 50 amino acids or without start and/or stop codons were removed.
During the comparison, only the coding regions were considered, because the UTRs had more changes between the two versions. Genes with at least one base pair overlapping the coding region were assumed to occupy the same gene locus. If genes occupying the same gene locus had different structures in all alternative spliced isoforms, they were viewed as genes with different structures. We filtered genes with alternative spliced isoforms to simplify further analyses.
When genes mapped to the same locus but with different numbers in the two gene sets (i.e. genes that were merged/split into one or more genes in the other version), we grouped each locus as a group. We then used BLASTP  to search each group against UniProt plant proteins (release 2010_07). A group was treated as false positive when no hits were found in UniProt. If a merged gene in annotVer 2.0 and more than two of its counterparts in annotVer 1.0 had the same best hit, the merged event was regarded to be optimal. On the other hand, if the split genes in annotVer 2.0 had more than one best hit, the split structures were considered to be better than the merged structure in annotVer 1.0. In exceptional cases, where only one of the split genes has a best hit, we were confused as to which was better, because there are two conditions we have to consider. If the best hit was longer than the aligned split gene without connection to other split ones, the merged gene would not seem to be better than the split ones. On the other hand, if the best hit was as long as the aligned split gene without other hits to the remaining split ones, then the split genes would also not seem to be better than the merged one.
If different structural genes in the two versions were mapped to the same locus, methods developed by Lorenzi, Puiu and Miller et al.  were modified and used to describe the structural changes. First, we searched the Pfam domain for each pair of genes with different structures by InterProScan . Then, we used the proteins of each pair to search UniProt plant proteins (release 2010_07). The proteins in the pairs with the same best hit were aligned to the matching proteins in UniProt by stretcher in EMBOSS . Gene structures with higher identity, similarity, score, and fewer gaps were considered as better structures.
Comparison of non-overlapped locus protein-coding genes in the two versions was carried out by BLASTP  searches against UniProt plant proteins (release 2010_07), with an E value threshold of 10-5. The percentage of multi-exon genes was also used as an index to evaluate the gene set for the inaccuracy of single-exon gene prediction.
A dataset of 33 cucumber WRKY genes and 35 assembled WRKY gene cDNAs generated in a previous experimental study  were aligned to the reassembled cucumber genome by spaln , followed by manual checking of the differences between the alignment results and gene predictions in annotVer 1.0 and annotVer 2.0.
We thank three anonymous reviewers for their invaluable comments and suggestions. This work was supported by the Chinese Ministry of Agriculture (the 948 program) to SH, the Ministry of Science and Technology (2010AA10A108) to SH, the National Natural Science Foundation of China (30972011) to ZZ, and the Beijing Municipal Commission of Education (YB20101002702) to ZL, PY, and KL.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.