Skip to main content

cDNA sequences reveal considerable gene prediction inaccuracy in the Plasmodium falciparum genome

Abstract

Background

The completion of the Plasmodium falciparum genome represents a milestone in malaria research. The genome sequence allows for the development of genome-wide approaches such as microarray and proteomics that will greatly facilitate our understanding of the parasite biology and accelerate new drug and vaccine development. Designing and application of these genome-wide assays, however, requires accurate information on gene prediction and genome annotation. Unfortunately, the genes in the parasite genome databases were mostly identified using computer software that could make some erroneous predictions.

Results

We aimed to obtain cDNA sequences to examine the accuracy of gene prediction in silico. We constructed cDNA libraries from mixed blood stages of P. falciparum parasite using the SMART cDNA library construction technique and generated 17332 high-quality expressed sequence tags (EST), including 2198 from primer-walking experiments. Assembly of our sequence tags produced 2548 contigs and 2671 singletons versus 5220 contigs and 5910 singletons when our EST were assembled with EST in public databases. Comparison of all the assembled EST/contigs with predicted CDS and genomic sequences in the PlasmoDB database identified 356 genes with predicted coding sequences fully covered by EST, including 85 genes (23.6%) with introns incorrectly predicted. Careful automatic software and manual alignments found an additional 308 genes that have introns different from those predicted, with 152 new introns discovered and 182 introns with sizes or locations different from those predicted. Alternative spliced and antisense transcripts were also detected. Matching cDNA to predicted genes also revealed silent chromosomal regions, mostly at subtelomere regions.

Conclusion

Our data indicated that approximately 24% of the genes in the current databases were predicted incorrectly, although some of these inaccuracies could represent alternatively spliced transcripts, and that more genes than currently predicted have one or more additional introns. It is therefore necessary to annotate the parasite genome with experimental data, although obtaining complete cDNA sequences from this parasite will be a formidable task due to the high AT nature of the genome. This study provides valuable information for genome annotation that will be critical for functional analyses.

Background

Malaria parasites infect and kill millions of people in the tropics each year [1, 2]. Efforts to develop vaccines have so far failed to produce any effective vaccine. Additionally, drug-resistant parasites are spreading quickly, particularly parasites resistant to chloroquine, leading to a recent resurgence of malaria in many developing countries [3, 4].

To facilitate our understanding of parasite molecular biology and development of drugs and vaccines, the genome of the malignant human malaria parasite Plasmodium falciparum was sequenced and published in 2002 [5]. The genome sequence provides a basis for various genome-wide approaches such as microarray and proteomic analyses [69]. Unfortunately, the majority of the genes in the P. falciparum genome were predicted using computer software, with ~60% of the predicted genes encoding hypothetical proteins [5]. Although software 'trained' with well characterized genes and improved strategies have provided relatively accurate gene prediction [10, 11], the accuracy of gene prediction of this organism is unknown. It is therefore necessary to verify the predictions with complementary DNA (cDNA) sequences, particularly for eukaryotic organisms that have introns in their genes. Indeed, full-length cDNA clones from many species from Drosophila to human have been collected and characterized [1216], providing important information for verification of genes in a genome and for studying gene functions. Recently, when a high-density array was used to survey transcribed exons, up to 30% of the detected transcripts were found to be unannotatd even in the well characterized Drosophila genome [17].

P. falciparum has a unique genome with a very high AT content (~82% of AT) [5] that presents various difficulties for studying gene structure and gene function. The extremely high AT content in non-coding regions (up to 99%) is often an obstacle to obtaining sequences from introns, 5' and 3' untranscribed regions (UTR), and intergene sequences. P. falciparum DNA is often unstable in bacteria, making it almost impossible to obtain full cDNA clones from genes larger than 5 kb for expression or other analyses. Approximately 50% of the genes in the P. falciparum genome were predicted to have introns flanked by the conserved eukaryotic GT-AG intron-exon splice sites [18, 19]. The parasite genome also has many large open reading frames (ORF) that likely encode large transcripts; however, introns imbedded in the ORF cannot be ruled out [20]. The elements regulating gene expression such as promoters and polyA recognition sites seen in other eukaryotic cells may not function properly in this parasite due to the high AT content in noncoding regions [21].

Expressed sequence tags (EST) from malaria parasites, particularly P. falciparum, have been obtained previously [19, 2227]. The first survey of P. falciparum EST produced 389 tags from 550 random cDNA clones [22]; and the number of EST was later increased to ~2,500 [23]. More recently, 2490 single random sequences were obtained from a library enriched for full-length cDNA [19], which were updated to 11424 sequences covering 1357 predicted genes [27]. cDNA sequences from the full-length cDNA clones (mostly sequences from 5' UTR) identified new genes and multiple transcript initiation sites in some genes, but it appeared that no efforts were made to obtain complete cDNA sequences from full-length cDNA clones. In this report, we constructed various cDNA libraries from mixed blood stages, including three cDNA libraries with different sized inserts enriched for full-length transcripts and sublibraries that contain smaller clones after digestion of the initial inserts with restriction enzymes. We also used synthetic oligonucleotides to extend sequences deep into coding regions. We obtained a total of 17332 clean EST. Comparison of our EST, the EST in public databases, the predicted coding sequences (CDS), and genomic DNA sequences identified 393 genes that may be incorrectly predicted.

Results and Discussion

cDNA libraries and DNA sequencing

Collection of EST from P. falciparum has been reported previously, and searches of public databases found 21305 P. falciparum EST in PlasmoDB [28, 29] and GenBank, contributed by various research groups [19, 23, 27] (Washington University, unpublished). The majority of EST collected previously were short sequences from single sequencing reads. To obtain longer cDNA sequences, we used two different approaches–primer-walking and construction of sublibraries of restriction enzyme-digested DNA clones–to extend sequence reads into the cloned DNA. Three different libraries, each with three sublibraries of different insert sizes, were constructed using polymerase chain reaction (PCR) products after 11 cycles of amplification (Additional file 1A and 1B). The first library contained cDNA clones directly from 5'-enriched cDNA inserts, which were divided into groups of large (> 3 kb), medium (1–3 kb), and small (< 1 kb) insert sizes (Additional file 1B). Unfortunately, we were not able to obtain sequences from either 5' or 3' ends of many clones from this library, probably due to polyA or polyT sequences in non-coding regions, suggesting that these clones may contain full coding sequences. We then constructed sublibraries with DNA inserts digested with restriction enzymes Bam HI or Sau3A before cloning into the vector (Additional file 1A).

Sequence trimming and contig assembly

A total of 28416 sequence runs–including 10656 from 'full-length' libraries, 10368 from Bam HI-restricted libraries, 7392 from Sau3A-digested libraries, and 4,800 runs from primer walking–were performed. From the sequence runs, we obtained 17332 EST 100 base pairs (bp) or longer [GenBank EL492722-EL510074] after trimming and vector sequence cleaning (see Methods). Because of difficulty in obtaining sequences from AT-rich sequences in non-coding regions and sequences with polyA tails, most of the sequences were from digested libraries or from the 5' ends of the undigested libraries. The trimmed EST from our libraries were assembled into 2548 contigs and 2671 singletons with an average size of 473.4 bp and an average qual value of 64.7. When our EST were assembled with EST in public databases, we obtained 5220 contigs and 5910 singletons with an average size of 520 bp.

Genome-wide cDNA coverage

To determine patterns of genome-wide gene expression and locations of EST on chromosomes, we assembled our EST and the public EST with 5485 predicted CDS in PlasmoDB (version 5.2) and displayed them on the physical chromosomes (Figure 1). When assembled using CAP3 [30] (21 bp overlap and 85% identity), 3857 CDS were assembled with EST contigs. When the sequences were aligned using Blast and methods described previously [31], 3792 CDS were identified by the same EST with cutoff values of at least 100-bp long and 95% identity. The two methods produced almost identical numbers of hits on predicted CDS. This percentage of genes (~70% of total predicted genes) with EST coverage is a little higher than those detected using a 70mer oligonucleotide array (~60%) [6]. Among those EST matching CDS, approximately 42% (or ~1700 genes) were matched by EST from both our collection and those in public databases.

Figure 1
figure1

Diagram of the 14 P. falciparum chromosomes showing positions of potentially expressed genes. Expressed sequence tags (EST) from our libraries or from public databases were assembled against predicted coding sequences in PlasmoDB; genes that matched our EST only (green), EST already in public databases (red), or both (yellow) are displayed according to gene order on the chromosomes. Those in white are CDS that were not covered by any EST. Approximately 70% of the 5485 predicted CDS were matched with one or more EST.

Alignment of cDNA to predicted genes on physical chromosomes allowed us to identify chromosomal regions that are transcriptionally active or silent. Our results show that genes located at telomere or subtelomere regions of many chromosomes (for example, genes at ends of chromosomes 7 and 10) do not have matching cDNA or are largely silent (Figure 1). The chromosome ends of P. falciparum are highly variable, consisting of many multigene families such as rifin, stevor, and var [5]. Although the functions of the proteins encoded by rifin and stevor are still uncertain, the var gene family has been shown to encode variant proteins (PfEMP1) that can mediate parasite adhesion to receptors on host endothelial cells [3234]. Different observations on the expression of the genes at chromosomal ends have been reported using microarray hybridization, with one reporting silent chromosome ends [6] and another suggesting expression of genes from chromosome ends [7]. Because microarray is based on probe-target hybridization, cross hybridization among probes from members of gene families could produce false-positive signals under some hybridization conditions. Our data are consistent with results showing that RNA transcripts from only a small subset of these genes could be detected in intraerythrocytic stages [6]. Additionally, there are regions in the middle of the chromosomes with genes that do not have cDNA coverage (Figure 1).

Full-length cDNA sequences and discovery of new introns

One of our goals was to collect complete cDNA clones and sequences from the P. falciparum genome. Unfortunately, we encountered difficulties in sequencing highly AT-rich regions, mostly 5' and 3' UTR, and obtained only 199 contigs that cover the entire ORF of 87 predicted genes, with predicted ORF sizes ranging from 126 to 2709 bp (Additional file 2). Among the 87 genes, 21 (~24%) were predicted incorrectly (or mismatched), with 18 genes having 23 additional introns and 3 genes with cDNA sequences running into predicted introns. Of the 23 new introns, 21 were found 5' of the predicted ATG, suggesting either additional exons or introns in the predicted non-coding regions. Assembly of our EST and those in public databases increased the number of genes 'fully' covered by EST to 356, with 85 (~24%) genes having mismatched introns (Table 1; Additional file 2). If we assume an error rate of gene prediction for the whole genome similar to that seen in the 356 fully covered genes, we would expect 1316 genes (24% of 5485 genes) being predicted erroneously. This is quite a large number of predicted genes that may have to be re-annotated, which argues for efforts to experimentally annotate the genome using full-length cDNA sequences.

Table 1 Predicted coding regions that were covered fully by cDNA and their mismatched introns

Approximately half of the P. falciparum genes (53.9%) were predicted to contain introns [5]. Our data suggest that the percentage of genes with introns will be higher than the predicted 54%. Among the 21 genes found to have new introns in cDNA, 10 were predicted to have no introns, and one gene predicted to have only one intron actually had none. This represents a net gain of 9 genes with introns among the 87 genes (or ~10%). Among the 85 genes with mismatched introns from the 356 genes with full coverage of predicted coding sequences, 21 genes gained introns (~5.9%), Based on these data, we can predict that about 60% to 65% of the genes in the P. falciparum will have one or more introns. Of interest, the majority (> 90%) of the new introns were found at 5' and 3' UTR or within 100 bp from a predicted ATG or stop codon, suggesting additional exons or changes of start or stop codons. It is also possible that the proposed genome sequence contains insertion/deletion errors causing apparent frameshift. Automatic prediction algorithms would then have to find an intron/exon border adding one spurious intron.

Alignment of our cDNA contigs with predicted CDS also identified 78 genes, although not fully covered by our cDNA sequences, with 88 introns either missed by computer prediction or predicted incorrectly (Additional file 3). Among them, 26 genes have 38 introns missed by computer prediction; 25 genes have falsely predicted introns (i.e., they do not exist); 22 genes have 25 introns larger than predicted; and 11 genes have 13 introns smaller than predicted. There are also three predicted genes (PFA0175w, PFB0610c, and PFL2160c) that have cDNA sequences extending into their neighboring genes (PFA0180w, PFB0605w, and PFL2155w, respectively). These predicted gene pairs are 200 bp or less apart on the chromosomes. It is likely that the 3' UTR of the genes will be longer than 200 bp, particularly for gene pairs PFB0610c/PFB0605w and PFL2160c/PFL2155w with ORF in opposite orientations. Similarly, assembly of our and public EST with predicted CDS and genomic DNA increased the number of genes having incorrectly predicted introns to 305, with 152 new introns found and 182 introns having sizes different from those predicted (Table 2; Additional file 3). These genes will require further experimental verification with complete cDNA sequences.

Table 2 Genes having introns that do not match those predicted in public databases

Confirmation of conserved GT-AG intron splicing sites and alternatively spliced introns

All the introns confirmed by our cDNA sequences have typical eukaryotic GT-AG splicing sites except a few genes that have potential 'introns' lacking GT-AG. These atypical 'introns' could be due to deletion during cloning in bacteria. For example, a 497 bp gap was found at 32 bp 5' of the ATG in gene MAL13P1.130, but no GT-AG sites were found in the gap. Gaps without GT-AG sites can be due to either deletion during cloning in bacteria or sequencing errors, although it cannot be ruled out that some introns may not have the conserved GT-AG sites. To investigate this possibility, we designed PCR primers flanking the 497-bp gap in MAL13P1.130 and confirmed the absence of the 497 bp gap (Table 3 and data not shown). Similarly, gene PFL0290w has a gap of 287 bp without GT-AG sites within the predicted ORF; we could not confirm the gap, either. It is clear that gaps without GT-AG sites are unlikely to be true introns. This observation also shows that sequences, including coding regions with relatively high GC content, can be deleted during cloning in bacteria.

Table 3 PCR verification of selected introns that were alternatively spliced

Alternative splicing has been well documented in many organisms [35, 36] including malaria parasites [3739]. We noticed that many predicted introns were covered with EST contigs that may or may not have the predicted introns, suggesting potential alternatively spliced introns (Table 3; Additional files 2 and 3), in addition to some cDNA that showed introns of different sizes; however, we could not rule out that those cDNA contigs without introns were from contaminated genomic DNA sequences. To verify these introns, we synthesized primers to amplify some alternatively spliced introns suggested by the cDNA sequences (Table 3). The majority of these introns (except four that have different intron sizes) were either present or absent in sequence alignments, e.g., contigs with some sequences running into the predicted introns. Results from PCR confirmed 29 alternatively spliced introns out of 42 genes tested, including genes with more than two forms of transcripts (Figure 2; Table 3).

Figure 2
figure2

PCR products confirming alternatively spliced introns. Oligonucleotide primers flanking selected predicted introns that might be alternatively spliced were amplified from genomic DNA (G lanes), reverse-transcribed mRNA of mixed asexual stages (C lanes), and mRNA controls of mixed asexual stages (without reverse transcriptase, R lanes). Genes with alternatively spliced introns are as marked; M, 100 bp DNA ladder. Note that more than two bands were amplified from PFE1540w, PF13_0220, and PF13_0224.

Antisense transcripts

Antisense transcripts are present in the cDNA collections. Because of our cDNA cloning strategies (digestion with restriction enzymes), the orientation of our cDNA clones was not preserved; however, there were transcripts with introns that had conserved GT-AG intron splice sites in the orientation opposite to the predicted genes (Table 2; Additional file 3). These transcripts matched the genomic DNA sequences but with introns having the conserved GT-AG sites in the opposite direction, suggesting antisense transcripts. Of interest, DNA sequence encoding gene PFL1420w (predicted as human macrophage migration inhibitory factor homolog) was matched by two cDNA contigs, one in sense and the other in antisense orientation. The sense sequence had an intron that matches the predicted intron with conserved GT-AG splicing sites. The antisense contig also had an intron with conserved GT-AG sites, but was 121 bp smaller than the predicted sense intron (Figure 3). Translation of the antisense sequence produced a polypeptide with 84 amino acids that had good homology with N-terminal sequence of myosin IXA protein, which could represent a new gene. The presence of these antisense cDNA is consistent with previous reports of antisense transcripts in the parasite [40, 41], but the functions of the these transcripts are largely unknown.

Figure 3
figure3

Diagram of exon/intron structures of predicted gene PFL1420w and cDNA contigs covering the gene. FC (forward contig) is a sense transcript with an intron matching the predicted intron. RC (reverse contig) is an antisense transcript having a smaller intron with GT-AG sites in the opposite direction. The line on top represents plus strand genomic DNA. Dashed lines are introns; heavy lines are predicted exons or ORF.

Functional classification

The EST contigs matching CDS predictions were grouped as functional categories according to GO molecular functions. As expected, the majority of the genes with functional assignments were housekeeping genes (Figure 4; Additional file 4). Almost all genes with functional assignment among the 356 genes fully covered with EST (likely representing genes relatively small and highly transcribed) were housekeeping genes encoding proteins related to transcription, translation, and other basic cell functions such as ribosomal proteins (41), histone proteins (7), or proteasome proteins (7) (Additional file 2). Based on this observation, we can predict that the majority of the 171 hypothetical genes in Additional file 2 are likely housekeeping genes.

Figure 4
figure4

Functional categories of expressed genes covered by all EST. A total of 3862 genes matched by EST were sorted according to GO molecular functions with P values < 0.0001 on sequence matches. The majority of the genes encode housekeeping proteins involved in DNA/RNA and protein binding, enzyme catalytic activities, transcription, translation, signal transduction, and transport activities.

Potential new genes

There were also contigs and EST sequences that match neither the nuclear genome nor the mitochondrial and plastid genomes (Additional file 4). Some of these sequences might be parasite DNA sequences that were not represented in the finished P. falciparum genome. Similarly, there were sequences that match genomic DNA but not the predicted CDS. These sequences could represent new genes or non-coding sequences of intergenic/intron/UTR that require further investigation. For sequence information, linked files and detailed annotation for all the EST contigs, please visit [42].

There are also many predicted ORF larger than 5 kb in the P. falciparum genome. The sizes of these large ORF/genes are probably off the limit of cloning stability in bacteria and in vitro extension capability of reverse transcriptase. In addition, high AT content in the DNA is an obstacle for obtaining good-quality DNA sequences from PCR products. More efforts with new strategies will be required for obtaining full cDNA sequences for the large genes.

Conclusion

Although our EST data are still limited, this work obtained 17332 high-quality cDNA sequences that almost double the current EST collection in public databases. Our effort to extend sequences into cDNA clones allows us to assemble some relatively long cDNA sequences and to correct some erroneously predicted introns. Our data suggest that considerably large numbers of genes in this parasite genome may have incorrect intron/exon predictions, arguing for more efforts to collect complete cDNA sequences and reannotate the genome with cDNA sequences. This study also confirms the conserved eukaryotic intron splice site (GT-AG) at the parasite introns, shows the presence of relatively large numbers of alternatively spliced and antisense transcripts, and reveals silence loci at subtelomeric regions of many chromosomes. The cDNA sequences presented here will provide useful resources for genome annotation and analyses of gene expression.

Methods

Parasite culture and RNA extraction

P. falciparum isolate 3D7 was cultured as described [43, 44]. Parasite mRNA was extracted from mixed asexual stages using the Micro-Fast Track mRNA isolation kit (Invitrogen).

Construction of cDNA libraries

PCR-based cDNA libraries were constructed using a SMART cDNA library kit (BD-Clontech) as previously described [45]. After reversed transcription using polyT primer, the cDNA were amplified for 11 cycles with primers attached to the 5' capping sequences (5'-GCAGTTGTA TCAACGCAGAGTGGCCATTACGGCCGGG-3') and 3' polyT tail. After separation of the PCR products on 1% agarose gel, DNA inserts of large (> 3 kb), medium (1–3 kb), and small (< 1 kb) sizes were eluted from the gel and cloned into Trip-lEX2 vector for trasnfection of XL1blue cells (BD-Clontech). Additional libraries with inserts digested with Bam HI and Sau3A were constructed similarly (Additional file 1).

Sequencing cDNA clones

Plaques were randomly picked and transferred to a 96-well PCR plate (PGC Scientifics) containing 43 μl of SM buffer per well. Each phage sample (5 μl) was used as a template in PCR amplification of the insert using 5' primer PT2F1 (5'-AAGTACTCTAGCATTGTGAGC-3') and 3' primer PT2R1 (5'-CTCTTCGCTATTACGCCAGCTG-3') flanking the cloning sites. For libraries restricted with Bam HI or Sau3A, PBKF (5'-ACGGCCAGTGAATTGTAATAC GAC-3') and PBKR (5'-ACAGGAAACAGCTATGACCTTGAT-3') were used in PCR amplification. PCR setups included 30 μl H2O, 4.0 μl of 10× buffer, 0.4 μl dNTP (10 mM), 0.15 μl (5 U/μl) Tag polymerase, 0.25 μl of each primer (50 μM), and 5 μl phage solution. The amplification conditions were: 94°C for 5 min; 35 cycles of 94°C for 1 min, 56°C for 10 s, 52°C for 10 s, 60°C for 2 min; and a final extension at 60°C for 5 min. PCR products were treated with 1 μl of ExoSAPIT (United States Biochemical) at 37°C for 15 min and 80°C for another 15 min. Treated PCR products (5 μ) were used in cyclesequencing reaction using BigDye terminator chemistry. The primers for sequencing were PT2F3 (5'-TCTCGGGAAGCGCGCCATTGT-3'), T719 (5'-TAATACGACTCACTATAGGG-3'), or T320 (5'-GAAATTAACCCTCACTAA AG-3'). Sequencing cycles were as follows: denaturing at 94°C for 2 min; 25 cycles at 94°C for 20 s, 52°C for 5 s, 50°C for 5 s, and 60°C for 3 min; and a final extension at 60°C for 5 min. After cleaning with Sephadex 50 beads packed in a multiscreen 96-well cleaning plate (Millipore), the products were analyzed on an ABI3730×l automatic DNA sequencer. To extend the cDNA sequences, 4800 oligonucleotide primers were synthesized based on sequences obtained and used to extend sequences that could not be reached using primers from the vector.

DNA sequence trimming and assembly

Sequence runs were first base called and assigned quality scores using Phred [46, 47] and then trimmed using Lucy [48] to remove sequences shorter than 100 bp or with Phred quality scores lower than 20. Vector sequences and polyA/T were also removed. The trimmed sequences were assembled using CAP3 [30] with 21-bp overlap and 85% identity; the quality of the assembled sequences was inspected visually using Sequencher 4.5 (Gene Codes) and Blast [49]. For sequences having mismatches with predicted CDS (indicating potential incorrect intron/exon predictions), genomic sequences covering the whole predicted coding region plus 1 kb from 5' of start codon and 1 kb from 3' stop codon were downloaded and assembled with EST and CDS. After assembly, the intron/exon junctions were visually inspected and adjusted to ensure proper alignments, particularly for intron splice sites, as software frequently fails to align the A-Trich sequences properly. For Bam HI- and Sau3A-digested libraries, some artificial clones from ligation of unrelated DNA fragments were identified and trimmed accordingly after Blast search of the mismatched sequences against the parasite genome sequence.

Locations of each cluster on the assembled chromosomes and the relationships of clusters with each computer-predicted CDS were displayed with Artemis [50]. Sequence annotation, comparison, classification, and functional annotations were performed as described [31] using various software and databases.

Abbreviations

bp:

base pair(s)

cDNA:

complementary DNA(s)

CDS:

coding sequence(s)

EST:

expressed sequence tag(s)

ORF:

open reading frame(s)

PCR:

polymerase chain reaction

UTR:

untranslated region(s).

References

  1. 1.

    WHO: WHO Expert Committee on Malaria. World Health Organ Tech Rep Ser. 2000, 892: 1-74.

    Google Scholar 

  2. 2.

    Snow RW, Guerra CA, Noor AM, Myint HY, Hay SI: The global distribution of clinical episodes of Plasmodium falciparum malaria. Nature. 2005, 434: 214-217. 10.1038/nature03342.

    CAS  PubMed Central  PubMed  Article  Google Scholar 

  3. 3.

    White N: Antimalarial drug resistance and combination chemotherapy. Philos Trans R Soc Lond B Biol Sci. 1999, 354: 739-749. 10.1098/rstb.1999.0426.

    CAS  PubMed Central  PubMed  Article  Google Scholar 

  4. 4.

    Wootton JC, Feng X, Ferdig MT, Cooper RA, Mu J, Baruch DI, Magill AJ, Su X-z: Genetic diversity and chloroquine selective sweeps in Plasmodium falciparum. Nature. 2002, 418: 320-323. 10.1038/nature00813.

    CAS  PubMed  Article  Google Scholar 

  5. 5.

    Gardner MJ, Hall N, Fung E, White O, Berriman M, Hyman RW, Carlton JM, Pain A, Nelson KE, Bowman S: Genome sequence of the human malaria parasite Plasmodium falciparum. Nature. 2002, 419: 498-511. 10.1038/nature01097.

    CAS  PubMed  Article  Google Scholar 

  6. 6.

    Bozdech Z, Llinas M, Pulliam BL, Wong ED, Zhu J, DeRisi JL: The Transcriptome of the intraerythrocytic developmental cycle of Plasmodium falciparum. PLoS Biol. 2003, 1: E5-10.1371/journal.pbio.0000005.

    PubMed Central  PubMed  Article  Google Scholar 

  7. 7.

    Le Roch KG, Zhou Y, Blair PL, Grainger M, Moch JK, Haynes JD, De La Vega P, Holder AA, Batalov S, Carucci DJ: Discovery of gene function by expression profiling of the malaria parasite life cycle. Science. 2003, 301: 1503-1508. 10.1126/science.1087025.

    CAS  PubMed  Article  Google Scholar 

  8. 8.

    Florens L, Washburn MP, Raine JD, Anthony RM, Grainger M, Haynes JD, Moch JK, Muster N, Sacci JB, Tabb DL: A proteomic view of the Plasmodium falciparum life cycle. Nature. 2002, 419: 520-526. 10.1038/nature01107.

    CAS  PubMed  Article  Google Scholar 

  9. 9.

    Lasonder E, Ishihama Y, Andersen JS, Vermunt AM, Pain A, Sauerwein RW, Eling WM, Hall N, Waters AP, Stunnenberg HG: Analysis of the Plasmodium falciparum proteome by high-accuracy mass spectrometry. Nature. 2002, 419: 537-542. 10.1038/nature01111.

    CAS  PubMed  Article  Google Scholar 

  10. 10.

    Huestis R, Cloonan N, Tchavtchitch M, Saul A: An algorithm to predict 3' intron splice sites in Plasmodium falciparum genomic sequences. Mol Biochem Parasitol. 2001, 112: 71-77. 10.1016/S0166-6851(00)00347-9.

    CAS  PubMed  Article  Google Scholar 

  11. 11.

    Huestis R, Fischer K: Prediction of many new exons and introns in Plasmodium falciparum chromosome 2. Mol Biochem Parasitol. 2001, 118: 187-199. 10.1016/S0166-6851(01)00376-0.

    CAS  PubMed  Article  Google Scholar 

  12. 12.

    Okazaki Y, Furuno M, Kasukawa T, Adachi J, Bono H, Kondo S, Nikaido I, Osato N, Saito R, Suzuki H: Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs. Nature. 2002, 420: 563-573. 10.1038/nature01266.

    PubMed  Article  Google Scholar 

  13. 13.

    Seki M, Narusaka M, Kamiya A, Ishida J, Satou M, Sakurai T, Nakajima M, Enju A, Akiyama K, Oono Y: Functional annotation of a full-length Arabidopsis cDNA collection. Science. 2002, 296: 141-145. 10.1126/science.1071006.

    PubMed  Article  Google Scholar 

  14. 14.

    Stapleton M, Liao G, Brokstein P, Hong L, Carninci P, Shiraki T, Hayashizaki Y, Champe M, Pacleb J, Wan K: The Drosophila gene collection: identification of putative full-length cDNAs for 70% of D. melanogaster genes. Genome Res. 2002, 12: 1294-1300. 10.1101/gr.269102.

    PubMed Central  PubMed  Article  Google Scholar 

  15. 15.

    Kikuchi S, Satoh K, Nagata T, Kawagashira N, Doi K, Kishimoto N, Yazaki J, Ishikawa M, Yamada H, Ooka H: Collection, mapping, and annotation of over 28,000 cDNA clones from japonica rice. Science. 2003, 301: 376-379. 10.1126/science.1081288.

    PubMed  Article  Google Scholar 

  16. 16.

    Ota T, Suzuki Y, Nishikawa T, Otsuki T, Sugiyama T, Irie R, Wakamatsu A, Hayashi K, Sato H, Nagai K: Complete sequencing and characterization of 21,243 full-length human cDNAs. Nat Genet. 2004, 36: 40-45. 10.1038/ng1285.

    PubMed  Article  Google Scholar 

  17. 17.

    Manak JR, Dike S, Sementchenko V, Kapranov P, Biemar F, Long J, Cheng J, Bell I, Ghosh S, Piccolboni A: Biological function of unannotated transcription during the early development of Drosophila melanogaster. Nat Genet. 2006, 38: 1151-1158. 10.1038/ng1875.

    CAS  PubMed  Article  Google Scholar 

  18. 18.

    Weber JL: Molecular biology of malaria parasites. Exp Parasitol. 1988, 66: 143-170. 10.1016/0014-4894(88)90087-2.

    CAS  PubMed  Article  Google Scholar 

  19. 19.

    Watanabe J, Sasaki M, Suzuki Y, Sugano S: FULL-malaria: a database for a full-length enriched cDNA library from human malaria parasite, Plasmodium falciparum. Nucleic Acids Res. 2001, 29: 70-71. 10.1093/nar/29.1.70.

    CAS  PubMed Central  PubMed  Article  Google Scholar 

  20. 20.

    Adams JH, Fang X, Kaslow DC, Miller LH: Identification of a cryptic intron in the Plasmodium vivax Duffy binding protein gene. Mol Biochem Parasitol. 1992, 56: 181-183. 10.1016/0166-6851(92)90166-H.

    PubMed  Article  Google Scholar 

  21. 21.

    Golightly LM, Mbacham W, Daily J, Wirth DF: 3' UTR elements enhance expression of Pgs28, an ookinete protein of Plasmodium gallinaceum. Mol Biochem Parasitol. 2000, 105: 61-70. 10.1016/S0166-6851(99)00165-6.

    CAS  PubMed  Article  Google Scholar 

  22. 22.

    Chakrabarti D, Reddy GR, Dame JB, Almira EC, Laipis PJ, Ferl RJ, Yang TP, Rowe TC, Schuster SM: Analysis of expressed sequence tags from Plasmodium falciparum. Mol Biochem Parasitol. 1994, 66: 97-104. 10.1016/0166-6851(94)90039-6.

    CAS  PubMed  Article  Google Scholar 

  23. 23.

    Carlton JM, Muller R, Yowell CA, Fluegge MR, Sturrock KA, Pritt JR, Vargas-Serrato E, Galinski MR, Barnwell JW, Mulder N: Profiling the malaria genome: a gene survey of three species of malaria parasite with comparison to other apicomplexan species. Mol Biochem Parasitol. 2001, 118: 201-210. 10.1016/S0166-6851(01)00371-1.

    PubMed  Article  Google Scholar 

  24. 24.

    Kappe SH, Gardner MJ, Brown SM, Ross J, Matuschewski K, Ribeiro JM, Adams JH, Quackenbush J, Cho J, Carucci DJ: Exploring the transcriptome of the malaria sporozoite stage. Proc Natl Acad Sci USA. 2001, 98: 9895-9900. 10.1073/pnas.171185198.

    CAS  PubMed Central  PubMed  Article  Google Scholar 

  25. 25.

    Watanabe J, Sasaki M, Suzuki Y, Sugano S: Analysis of transcriptomes of human malaria parasite Plasmodium falciparum using full-length enriched library: identification of novel genes and diverse transcription start sites of messenger RNAs. Gene. 2002, 291: 105-113. 10.1016/S0378-1119(02)00552-8.

    CAS  PubMed  Article  Google Scholar 

  26. 26.

    Merino EF, Fernandez-Becerra C, Madeira AM, Machado AL, Durham A, Gruber A, Hall N, del Portillo HA: Pilot survey of expressed sequence tags (ESTs) from the asexual blood stages of Plasmodium vivax in human patients. Malar J. 2003, 2: 21-10.1186/1475-2875-2-21.

    PubMed Central  PubMed  Article  Google Scholar 

  27. 27.

    Watanabe J, Suzuki Y, Sasaki M, Sugano S: Full-malaria 2004: an enlarged database for comparative studies of full-length cDNAs of malaria parasites, Plasmodium species. Nucleic Acids Res. 2004, 32: D334-338. 10.1093/nar/gkh115.

    CAS  PubMed Central  PubMed  Article  Google Scholar 

  28. 28.

    Kissinger JC, Brunk BP, Crabtree J, Fraunholz MJ, Gajria B, Milgram AJ, Pearson DS, Schug J, Bahl A, Diskin SJ: The Plasmodium genome database. Nature. 2002, 419: 490-492. 10.1038/419490a.

    CAS  PubMed  Article  Google Scholar 

  29. 29.

    PlasmoDB. [http://www.plasmodb.org/plasmo/home.jsp]

  30. 30.

    Huang X, Madan A: CAP3: A DNA sequence assembly program. Genome Res. 1999, 9: 868-877. 10.1101/gr.9.9.868.

    CAS  PubMed Central  PubMed  Article  Google Scholar 

  31. 31.

    Ribeiro JM, Alarcon-Chaidez F, Francischetti IM, Mans BJ, Mather TN, Valenzuela JG, Wikel SK: An annotated catalog of salivary gland transcripts from Ixodes scapularis ticks. Insect Biochem Mol Biol. 2006, 36: 111-129. 10.1016/j.ibmb.2005.11.005.

    CAS  PubMed  Article  Google Scholar 

  32. 32.

    Baruch DI, Pasloske BL, Singh HB, Bi X, Ma XC, Feldman M, Taraschi TF, Howard RJ: Cloning the P. falciparum gene encoding PfEMP1, a malarial variant antigen and adherence receptor on the surface of parasitized human erythrocytes. Cell. 1995, 82: 77-87. 10.1016/0092-8674(95)90054-3.

    CAS  PubMed  Article  Google Scholar 

  33. 33.

    Smith JD, Chitnis CE, Craig AG, Roberts DJ, Hudson-Taylor DE, Peterson DS, Pinches R, Newbold CI, Miller LH: Switches in expression of Plasmodium falciparum var genes correlate with changes in antigenic and cytoadherent phenotypes of infected erythrocytes. Cell. 1995, 82: 101-110. 10.1016/0092-8674(95)90056-X.

    CAS  PubMed Central  PubMed  Article  Google Scholar 

  34. 34.

    Su XZ, Heatwole VM, Wertheimer SP, Guinet F, Herrfeldt JA, Peterson DS, Ravetch JA, Wellems TE: The large diverse gene family var encodes proteins involved in cytoadherence and antigenic variation of Plasmodium falciparum-infected erythrocytes [see comments]. Cell. 1995, 82: 89-100. 10.1016/0092-8674(95)90055-1.

    CAS  PubMed  Article  Google Scholar 

  35. 35.

    Campbell MA, Haas BJ, Hamilton JP, Mount SM, Buell CR: Comprehensive analysis of alternative splicing in rice and comparative analyses with Arabidopsis. BMC Genomics. 2006, 7: 327-10.1186/1471-2164-7-327.

    PubMed Central  PubMed  Article  Google Scholar 

  36. 36.

    Johnson JM, Castle J, Garrett-Engele P, Kan Z, Loerch PM, Armour CD, Santos R, Schadt EE, Stoughton R, Shoemaker DD: Genome-wide survey of human alternative pre-mRNA splicing with exon junction microarrays. Science. 2003, 302: 2141-2144. 10.1126/science.1090100.

    CAS  PubMed  Article  Google Scholar 

  37. 37.

    Knapp B, Nau U, Hundt E, Kupper HA: Demonstration of alternative splicing of a pre-mRNA expressed in the blood stage form of Plasmodium falciparum. J Biol Chem. 1991, 266: 7148-7154.

    CAS  PubMed  Google Scholar 

  38. 38.

    Muhia DK, Swales CA, Eckstein-Ludwig U, Saran S, Polley SD, Kelly JM, Schaap P, Krishna S, Baker DA: Multiple splice variants encode a novel adenylyl cyclase of possible plastid origin expressed in the sexual stage of the malaria parasite Plasmodium falciparum. J Biol Chem. 2003, 278: 22014-22022. 10.1074/jbc.M301639200.

    CAS  PubMed  Article  Google Scholar 

  39. 39.

    Singh N, Preiser P, Renia L, Balu B, Barnwell J, Blair P, Jarra W, Voza T, Landau I, Adams JH: Conservation and developmental control of alternative splicing in maebl among malaria parasites. J Mol Biol. 2004, 343: 589-599. 10.1016/j.jmb.2004.08.047.

    CAS  PubMed  Article  Google Scholar 

  40. 40.

    Patankar S, Munasinghe A, Shoaibi A, Cummings LM, Wirth DF: Serial analysis of gene expression in Plasmodium falciparum reveals the global expression profile of erythrocytic stages and the presence of anti-sense transcripts in the malarial parasite. Mol Biol Cell. 2001, 12: 3114-3125.

    CAS  PubMed Central  PubMed  Article  Google Scholar 

  41. 41.

    Gunasekera AM, Patankar S, Schug J, Eisen G, Kissinger J, Roos D, Wirth DF: Widespread distribution of antisense transcripts in the Plasmodium falciparum genome. Mol Biochem Parasitol. 2004, 136: 35-42. 10.1016/j.molbiopara.2004.02.007.

    CAS  PubMed  Article  Google Scholar 

  42. 42.

    Additional file 4. [http://www.ncbi.nlm.nih.gov/projects/omes/P_falciparum_2007/Sup_table3/Sup-table3.xls]

  43. 43.

    Trager W, Jensen JB: Human malaria parasites in continuous culture. Science. 1976, 193: 673-675. 10.1126/science.781840.

    CAS  PubMed  Article  Google Scholar 

  44. 44.

    Haynes JD, Diggs CL, Hines FA, Desjardins RE: Culture of human malaria parasites Plasmodium falciparum. Nature. 1976, 263: 767-769. 10.1038/263767a0.

    CAS  PubMed  Article  Google Scholar 

  45. 45.

    Valenzuela JG, Belkaid Y, Rowton E, Ribeiro JM: The salivary apyrase of the blood-sucking sand fly Phlebotomus papatasi belongs to the novel Cimex family of apyrases. J Exp Biol. 2001, 204: 229-237.

    CAS  PubMed  Google Scholar 

  46. 46.

    Ewing B, Hillier L, Wendl MC, Green P: Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 1998, 8: 175-185.

    CAS  PubMed  Article  Google Scholar 

  47. 47.

    Ewing B, Green P: Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 1998, 8: 186-194.

    CAS  PubMed  Article  Google Scholar 

  48. 48.

    Chou HH, Holmes MH: DNA sequence quality trimming and vector removal. Bioinformatics. 2001, 17: 1093-1104. 10.1093/bioinformatics/17.12.1093.

    CAS  PubMed  Article  Google Scholar 

  49. 49.

    Blast. [http://www.ncbi.nlm.nih.gov/BLAST/]

  50. 50.

    Artemis. [http://www.sanger.ac.uk/]

Download references

Acknowledgements

This work was supported by the Division of Intramural Research, National Institute of Allergy and Infectious Diseases, National Institutes of Health. We thank Dr. Tetsuya Furuya for help in PCR and sequencing and NIAID intramural editor Brenda Rae Marshall for assistance.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Xin-zhuan Su.

Additional information

Authors' contributions

FL participated in library construction and in sequencing; HJ and JD participated in sequencing trimming and analysis; JM participated in sequence alignment and PCR; JGV participated in library construction; JMCR participated in sequence clustering, database search, analysis, and manuscript writing; X-z S conceived the project design and participated in analysis and manuscript writing.

Electronic supplementary material

Additional file 1: Construction of cDNA libraries. The procedures for cDNA librabry construction and sequencing are summarized in diagram (A). PCR products were separated on 1% agarose gel (B) and DNA fragments >3 kb, 1.5–3 kb, and <1.5 kb were eluted from gel blocks. M, molecular weight marker; lane number 8–18 on the gel were products from PCR amplification from 8 to 18 cycles. Eluted DNA fragments were cloned into trip-1EX2 vector that were transfected into bacteria. For construction of sub-libraries, the DNA were first digested with BamH1 or SAU3A and cloned into the same vector. DNA amplified from the vector was sequenced directly. (PDF 81 KB)

Additional file 2: Genes with predicted coding regions fully covered by EST and confirmation of predicted introns. Aligned cDNA and predicted CDS sequences can be viewed by double clicking the hyper-linked gene names in the Excel file. (ZIP 997 KB)

Additional file 3: Genes with cDNA sequences not matching predicted CDS perfectly. Aligned cDNA and predicted CDS sequences can be viewed by double clicking the hyper-linked gene names in the Excel file. (ZIP 1015 KB)

EST contigs that do not match

Additional file 4: P. falciparum CDS and genomic DNA. For all EST contigs and linked files in additional file 4, please go to [42]. (ZIP 4 MB)

Authors’ original submitted files for images

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Lu, F., Jiang, H., Ding, J. et al. cDNA sequences reveal considerable gene prediction inaccuracy in the Plasmodium falciparum genome. BMC Genomics 8, 255 (2007). https://doi.org/10.1186/1471-2164-8-255

Download citation

Keywords

  • Antisense Transcript
  • Parasite Genome
  • Falciparum Genome
  • Complete cDNA Sequence
  • Predict Code Sequence