Gene copy number variation throughout the Plasmodium falciparum genome
© Cheeseman et al. 2009
Received: 5 January 2009
Accepted: 4 August 2009
Published: 4 August 2009
Skip to main content
© Cheeseman et al. 2009
Received: 5 January 2009
Accepted: 4 August 2009
Published: 4 August 2009
Gene copy number variation (CNV) is responsible for several important phenotypes of the malaria Parasite Plasmodium falciparum, including drug resistance, loss of infected erythrocyte cytoadherence and alteration of receptor usage for erythrocyte invasion. Despite the known effects of CNV, little is known about its extent throughout the genome.
We performed a whole-genome survey of CNV genes in P. falciparum using comParative genome hybridisation of a diverse set of 16 laboratory culture-adapted isolates to a custom designed high density Affymetrix GeneChip array. Overall, 186 genes showed hybridisation signals consistent with deletion or amplification in one or more isolate. There is a strong association of CNV with gene length, genomic location, and low orthology to genes in other Plasmodium species. Sub-telomeric regions of all chromosomes are strongly associated with CNV genes independent from members of previously described multigene families. However, ~40% of CNV genes were located in more central regions of the chromosomes. Among the previously undescribed CNV genes, several that are of potential phenotypic relevance are identified.
CNV represents a major form of genetic variation within the P. falciparum genome; the distribution of gene features indicates the involvement of highly non-random mutational and selective processes. Additional studies should be directed at examining CNV in natural Parasite populations to extend conclusions to clinical settings.
Many classes of genome sequence variation occur in the malaria Parasite P. falciparum that may contribute to variation in pathogenicity, including single nucleotide polymorphism (SNP) , insertion and deletion of short sequences (indels) , large scale deletions , amplifications , inversions , and translocations . Analysis of diversity in the ~22.8 Mb Parasite genome has focused predominantly upon the detection of SNPs and indels [7–9] by DNA re-sequencing methods, although the power of oligonucleotide microarray hybridisation for detecting sequence polymorphism has also been applied [10, 11]. This has shown that P. falciparum genes encoding surface proteins possess a higher than average level of sequence diversity, whereas genes encoding proteins with mitochondrial, metabolic and cell growth functions have low diversity [8, 9]. Recombination rate has also been shown to be variable along the length of each chromosome, being particularly high in the sub-telomeres [8, 12].
Genome sequencing can be utilised for detecting gene deletions and amplifications [13, 14] although alternative methods such as DNA microarray platforms also allow copy number variable regions of the genome to be robustly detected on a large scale. These CNV regions play a prominent role in the adaptive biology of the Parasite. For example, sub-telomeric deletions on chromosomes 2 and 9 have been linked to a loss of infected erythrocyte cytoadherence through deletion of genes encoding adhesion molecules [3, 15]. Conversely, mefloquine resistance is conferred through amplification of a multi-genic locus on chromosome 5 including the multi-drug transporter pfmdr1 gene . Hybridisation of labelled genomic DNA to microarrays can detect amplified or deleted sequences, and depending upon probe density the spectrum of DNA sequence variation that can be detected ranges from SNPs up to copy number variants in the Mb size range [16, 17]. Early studies using microarray platforms [18, 19] demonstrated polymorphism within the sub-telomeres of chromosomes in the Hb3 strain and a panel of sub-clones from a single P. falciparum isolate (IT/FCR-3) respectively, and more recent studies identified novel CNV genes by using arrays to assay small panels of different P. falciparum isolates [10, 11, 20].
CNVs have been shown to be correlated with repetitive sequences in the genomes of humans and model organisms, highlighting the role of non-allelic homologous recombination in the genesis of variants [21–24]. Consistent with this, amplicon breakpoints in the P. falciparum mefloquine resistance locus on chromosome 5 and surrounding the GTP cyclohydrolase amplicon are within microsatellite regions or repetitive monomeric A/T tracts [25, 26]. In comparison with SNP mutation rates in eukaryotic genomes in the order of ~10-8-10-9 per nucleotide site per generation, it has been estimated that large deletion or amplification mutations occur at a rate ~10-4 per nucleotide site per generation [22, 27]. It has been postulated that gene duplications safeguard against deleterious mutations , and allow more rapid rates of evolution within coding sequences , as one gene copy retains gene function whilst the other has reduced constraint to accrue new mutations. Conversely, deletions are suggested to be more often under purifying selection, and thus purged from populations .
To identify commonly amplified and deleted genes, and test for associations with factors that could contribute to their generation and selection, a genome-wide survey of CNV genes was performed on diverse cultured lines of P. falciparum, using a high-density oligonucleotide microarray platform.
Properties of CNV genes compared with non-CNV genes in P. falciparum
(n = 87)
(n = 99)
Proximity to centromeresb
Proximity to telomeresb
Proximity to segmental duplications
Rodent malaria Parasite orthologuea
P. knowlesi orthologuea
The sub-telomeres are typified by repeat regions, large multi-gene families (var, rifin, stevor) and segmental duplications and share considerable homology among chromosomes. There was no association of CNV genes with internal var clusters on chromosomes 4, 6, 7, 8 and 12, nor with identified segmental duplications, and only a weak association with centromeres was shown (Table 1). A strong association (p < 0.0001) with proximity to segmental duplications was however observed.
Mefloquine resistance is caused by amplification of a multi-genic locus surrounding the pfmdr1 gene on chromosome 5 (PFE1150w) and has been shown to be highly copy number dependant . The Dd2 and W2mef Parasites are related clones, derived from a single progenitor selectively grown under mefloquine selection to produce drug resistant Parasites . Here, hybridisation indicated an increased signal for pfmdr1 in Dd2 and W2mef (2.2 fold and 3.5 fold increase in signal intensity respectively, Figure 2B). The extent of the amplification in Dd2 encompassed 14 genes from PFE1095w to PFE1160w, in complete agreement with previously published microarray data . Here the exact same boundaries of the amplification are found for W2mef.
The distribution of size polymorphism in the P. falciparum genome and the number of consecutively identified CNV genes was deTermined, revealing relatively few runs of two or more CNV genes (Figure 3B). There was also a trend towards small genes (<1 kb) being more frequently variable than larger genes (p < 0.0001, Kolmogorov-Smirnov test; Table 1, Figure 3C).
Validation of deletions in the Dd2 and Hb3 genomes was performed using bioinformatic searches of published sequence data aligned to the 3D7 genome. Eleven of 14 putative deletions in Dd2 from the microarray data were supported by the absence of the gene in the Dd2 genome sequence (see Additional file 3), the exceptions being PF14_0040 and PF11_0061 (containing polymorphisms overlapping 27% and 33% of probe target sequences respectively), and PFE0070w (showing extensive re-arrangements along its length due to interspersed repeats). Five of 7 putative deletions in the Hb3 genome were confirmed by the absence of the genes in the Hb3 sequence (see Additional file 3), the exceptions being PFF0860c (containing differences in the Hb3 sequence that overlapped 44% of probe target sequences) and PFI0060c (in which all probes were affected by sequence polymorphism). As a control comparison, a random set of 500 genes (~10% of the genome) that were classed as 'present' in both the Dd2 and Hb3 hybridisations were examined. 97.8% of these genes were identified in the Hb3 sequence suggesting that deletions confirmed by bioinformatic methods are unlikely to be a product of lack of genome coverage, whereas a smaller majority (76.6%) was identified in the Dd2 sequence, consistent with the higher number of unclosed gaps in the current draft Dd2 genome compared to Hb3.
This high level of confirmation of results from the array platform allowed the identification of novel deletions to be catalogued in the P. falciparum genome. For example, the pfsbp1 (PFE0065w) gene was deTermined as 'deleted' along with an adjacent gene (interspersed repeat antigen, PFE0070w) in three strains (Dd2, Fcc-2 and 7G8). Examination of the Dd2 genome sequence revealed a truncation of the pfsbp1 gene (contains only the first 444 bp, out of 1012 bp in 3D7) explaining the low hybridisation signal observed on the microarray. Gene knock-out studies on pfsbp1 (PFE0065w) encoding the skeleton binding protein, have previously shown it as important in the trafficking of PfEMP1 to the erythrocyte cell surface [36, 37]. Deletion of two glycophorin binding protein genes with high (>93%) sequence identity on different chromosomes was also identified: gbph (PF14_0010) in Dd2, Fcc-2 and 7G8 and gbph2 (PF13_0010) in D10, Fcc-2 and T9/102. As deletion of both genes was detected in a single strain (Fcc-2) it is likely that function is either complemented by gene Paralogues (i.e. PF10_0159, the glycophorin-binding protein 130 precursor) or redundant for in vitro cultured Parasites.
Overall, the putative CNV genes included many hypothetical genes (n = 96/3371, 2.9% of all hypothetical genes). The proportion of CNVs was elevated in the rifin (n = 20/180, 11%) and stevor (n = 3/12, 25%) gene families. Other gene families also showed a high proportion of CNV genes, pfmc-2tm (n = 6/12, 50%), large ribosomal subunit protein (n = 5/48, 10.4%), histone (including variant and putative histone proteins) (n = 3/8, 38%), FIKK protein kinase (n = 2/20, 10%) and etramp (n = 2/11, 18%, see Additional file 1). However, a high proportion of CNV was not detected among var genes (n = 1/59, 1.8%), though truncated and pseudogenic var genes exhibited high levels of CNV (n = 5/22, 22.7%; 8 genes had low or no probe coverage), confirming that the polymorphism observed in PfEMP1 proteins is most probably due to recombination events between members of the var genes rather than amplification and/or deletion [38, 39].
There was a deficit in CNV genes with orthologues to other Plasmodium species (Table 1) with 67.2% of common CNVs and 56.2% of rare CNV genes being P. falciparum specific. A large number of these species-specific genes in the P. falciparum genome are exported to the infected erythrocyte cell surface and possess transmembrane domains, though we found no excess of CNV among genes containing the major signal peptide associated with export to the cell surface (the PEXEL motif), and only a weakly significant excess of genes containing transmembrane domains (Table 1).
Duplicated genes have been hypothesised to possess a higher than neutral level of polymorphism due to relaxation of selective constraint. We used a published dataset of 3,539 values  of per gene nucleotide diversity (π) to deTermine if the CNVs detected in this study showed increased sequence polymorphism, compared to non-CNV genes. A significant excess of polymorphism was detected for both the distribution of π values (p = 0.018, Kolmogorov-Smirnov test), and the proportion of CNV genes with values of π greater than the overall average (4.83 × 10-4, p = 0.011, chi squared test, Pearson corrected). There was a slight excess of deleted genes, with 57% of CNV genes above the average pairwise π being deleted and 43% being amplified, although this was not significant (p = 0.07, two-tailed, Fisher's exact test, see Additional file 4).
No association was seen in the distribution of monomeric A/T tracts in CNV genes and 1 kb of surrounding intergenic sequence up- and down-stream of the gene coding sequences, compared to non-CNV genes (p = 0.08, Kolmogorov-Smirnov test, see Additional file 5). This was consistent when CNV in subtelomeric genes or non-subtelomeric genes were considered seParately (non-CNV vs. CNV in sub-telomeric regions, p = 0.10; non-CNV vs. CNV in non-subtelomeric regions p = 0.98, Kolmogorov-Smirnov test). However, sequence encompassing amplified genes (n = 80) showed a significantly skewed distribution toward shorter A/T tract lengths when compared to deleted genes (n = 71, p = 4.6 × 10-6, Kolmogorov-Smirnov test) and non-CNV genes (n = 5176, p = 1.3 × 10-5, Kolmogorov-Smirnov test).
SNPs located within probe sequences have been shown to reduce binding of target DNA hybridised to the array, in the case of exceptionally highly polymorphic genes. To deTermine if the modest excess in polymorphism observed in deleted CNVs was due to an artefact of lower hybridisation, the relationship between sequence diversity and hybridisation signal was examined. SNPs relative to the 3D7 genome were downloaded from whole genome sequencing projects for the Parasite clones Dd2 and Hb3 http://www.PlasmoDB.org. Overall, no association between the gene level hybridisation signal and the density of SNPs in each gene (compared to 3D7) was observed (see Additional file 6 and 7).
The distribution of CNV genes in Plasmodium falciparum shows consistency with what has been seen in the genomes of higher eukaryotes [29, 30]. The excess of smaller sized CNV genes is in accordance with general predictions [30, 40]. The small proportion of consecutive CNV genes indicates that CNV regions in P. falciparum are mostly very small, although there are notable exceptions previously known that were also detected here. Sub-telomeric regions of P. falciparum have evolved as hotspots of genetic diversity, typified by a high proportion of species specific genes, segmental duplications, nucleotide diversity, high mutation and recombination rates. Results here show that CNV genes are also common in sub-telomeric regions, and have highlighted features relevant to the mutational origin and selective pressures of CNV genes in P. falciparum. Given the pivotal role of segmental duplications in CNV generation in the genomes of humans and mice [21, 22], it seems likely that a similar role could occur in malaria Parasites. No complete segmental duplications were detected as either amplified or deleted in this study, though it has been previously shown that segmental duplications are copy number variable in the P. falciparum genome resulting in transcriptional up- and down-regulation of their gene complement [20, 31]. The sub-telomeres possess large gene families containing regions of high sequence homology, and we have identified a number of these as being putative CNV genes. This suggests that mechanisms such as non-allelic homologous recombination (NAHR) may be key to the generation of copy number variation, through recombination of regions of homology either intra- or inter chromosomally .
The function of deleted genes may be frequently complemented by closely related genes, suggesting that most CNV has no adverse effect on phenotype, and is thus nearly neutral . However, in vitro culture of Parasites might allow the maintenance of a greater number of mildly deleterious alleles than in natural populations due either to the reduction in effective population size that would generally increase the effect of genetic drift, or to the relaxation of some specific selective constraints in vitro. This is highlighted by the deletion in vitro of pfsbp1, an important transporter of PfEMP1 to the erythrocyte cell surface, in three of the Parasite lines, a loss that may be non-viable in natural infection due to a failure to cytoadhere to the microvasculature and evade immune responses. An unknown proportion of the variation presented here may not have originated in the natural environment, so Parasites sampled directly from patients will need to be examined to exclude the effect of culture adaptation.
Genomic organisation between rodent malaria Parasites and P. falciparum is well conserved, and arranged in large blocks of syntenic genes, between which species specific genes have evolved . The deficit of CNV genes with orthologues to rodent or macaque malaria Parasites suggests two possible causes that are not mutually exclusive: 1) there is strong negative (purifying) selection on the "core" Plasmodium genes and disruption of these has a deleterious effect. 2) P. falciparum specific genes tend to be under strong positive (diversifying) selection and this favours the maintenance of population-wide CNV.
Many of the P. falciparum specific genes encode known antigens and exported proteins from the intra-erythrocytic stage of the Parasite lifecycle. The presence of multiple copies of these genes allows the generation of antigenic variants within a single genome, greatly facilitating immune escape of the Parasite. A significant association with increased nucleotide diversity in the CNV genes identified in this study is consistent with this. The higher values of π detected in loss CNVs may be attributable to technical limitations of the comparison, there was poor sequence coverage of CNV genes in the dataset used (29% of CNV genes with values for π, compared to 67% of all genes in the genome) . High coverage sequencing that is underway on a large number of P. falciparum isolates will further refine our understanding of evolutionary constraints and modifications on the genome.
Our results demonstrate the existence of gene amplifications and deletions (copy number variations, CNVs) throughout the Plasmodium falciparum genome by comparing diverse culture-adapted isolates. These gene CNVs correlate to subtelomeric location, short gene length, nucleotide diversity and low orthology to closely related Plasmodium species. We have discussed CNV as an important evolutionary mechanism, allowing the adaptive expansion of diverse gene families whilst copy number of core "housekeeping" genes are maintained by negative selection. In addition to this we have noted the occurrence of previously undetected gene CNVs and highlighted those with a potential phenotypic importance to clinical disease. However, host selective pressures are relaxed in vitro and may allow for some deleterious rearrangements than would not be seen in vivo, so studies directly on clinical isolates will also be important.
P. falciparum culture-adapted Parasite lines were obtained either from the MR4 repository or were kindly provided by Prof David Walliker, Prof Richard Carter (University of Edinburgh) or Dr Livia Vivas (LSHTM). Parasite culture in erythrocytes was performed using standard methods , and genomic DNA (gDNA) was extracted using the QIAquick whole blood extraction kit (QIAGEN) and eluted into 30 μl of a low salt buffer.
Genomic DNA from each cultured isolate was first examined by sequencing the highly polymorphic block 2 region of the msp1 gene as a check for identity. Two pairs of cultured lines (MAD20/D10 and T9/94/Wellcome) had identical sequences at msp1, and subsequent sequencing showed each pair was identical at 24 other polymorphic genes. It was thus considered that these identical pairs were the product of cross-contamination, though the exact timing of this was unknown as all isolates were cultured independently in our study. The panel of Parasite lines nevertheless represents a geographically diverse set of isolates, with 3 nominally from Africa (D6, RO33, Palo Alto; apart from 'Wellcome' that was originally supposedly African but was identical to T9/94 here), 7 from South East Asia (T9/96, T9/102, K1, Fcc-2, T9/94, and Dd2 and W2mef that are almost identical clones from the same isolate), 2 from Papua New Guinea (MAD20 and D10 with apparently shared clonal history in culture), one isolate from South America (7G8), one from Central America (HB3), and one (3D7) which was cloned from isolate NF54 derived from a case of airport malaria in The Netherlands presumed to be acquired from the bite of an infected African mosquito.
12 μg genomic DNA from each of the 16 cultured lines was fractionated by digestion with ApoI for 2 h at 50°C, followed by incubation for 20 min at 80°C. Digested DNA was then extracted by addition of phenol-chloroform-isoamyl alcohol volume/volume, and precipitated in 0.1 × volume 3 M sodium acetate at pH 5.2 and 2 × volume 100% ethanol. Pelleted DNA was washed in 70% ethanol then air dried and re-suspended in 20 μl de-ionised H2O. End labelling of the DNA was then performed in Terminal deoxynucleotidyl transferase buffer and 5 mM CoCl2 (Roche) by addition of 1 nM biotin-N11-ddATP (Molecular Probes) and 400 units Terminal deoxynucleotidyl transferase (Roche) for an hour at 37°C in preParation for hybridisation.
The Affymetrix PFSANGER GeneChip is a tiling-like array, with oligonucleotide probes that can overlap across unique sequence regions of the genome [11, 43]. It is composed of 2,439,646 perfect match-only probes (PM) designed using the January 2005 sequence data of the 3D7 P. falciparum strain as a reference. 1,219,157 of the probes map to predicted gene coding sequences with representation of most of all annotated and hypothetical genes. Hybridization was performed after equilibration of the array at room temperature and incubation with pre-hybridization buffer in an Affymetrix oven at 45°C for 10 min rotating at 60 rpm. A mixture containing the labelled genomic DNA of a particular P. falciparum isolate, eukaryotic hybridization controls (Affymetrix), oligoB2 control (Affymetrix), Herring sperm DNA (Promega), acetylated BSA (Invitrogen), in pre-hybridization buffer was incubated at 99°C for 5 min, and then at 45°C for 5 min. This mixture was then used for hybridisation at 45°C for 16 h under constant rotation at 60 rpm.
After hybridization, the arrays were washed and stained in an Affymetrix wash station following Affymetrix Eukaryotic protocol, in a series of washes in non-stringent buffer (6× SSPE, 0.01% Tween-20) followed by stringent washes (100 mM MES, 0.1 M NaCl, 0.01% Tween-20). The GeneChips were stained by addition of 10 μg ml-1 R-phycoerythrin streptavidin in staining buffer (100 mM MES, 1 M NaCl, 0.05% Tween-20) with 2 mg ml-1 acetylated BSA for 5 min at 35°C then washed in non-stringent buffer. Amplification of the staining signal was performed by addition of 3 μg ml-1 biotinylated antistreptavidin antibody (Vector Laboratories) in staining buffer, 2 mg ml-1 acetylated BSA and 0.1 mg ml-1 normal goat IgG (Sigma). This was followed by a final staining step in R-phycoerythrin streptavidin. The stained GeneChips were scanned on an Affymetrix GeneChip Scanner 3000 7G at an emission wavelength of 570 nm at a 3 pixels resolution and stored as binary CEL files. Prior to further data analysis CEL files were background corrected, quantiles normalised and median-polished summarised using the robust multi-array averaging algorithm  with R/Bioconductor Software .
A chip definition file (CDF) was generated in house after mapping each probe onto the version of the 3D7 genome sequence available on 31st October 2007 http://www.genedb.org, and specifying probesets consisting solely of probes mapping to individual genes and excluding all intergenic probes. For statistical purposes, only probesets with a minimum of 10 probes were retained for analysis. Log2 ratios were generated using median probeset hybridisation intensity signals from all hybridised strains as a comParator for individual strain probeset hybridisation intensities, the resultant values representing relative copy number on a Log2 scale and referred to as Log2 ratio hybridisation signal intensity. Cut-off values of +1/-1 represent a doubling or halving of Log2 hybridisation signal, probesets outside this threshold were classed as amplified or deleted respectively. This threshold limited detection of CNV genes to those >4 standard deviations from the mean, this criteria has previously been demonstrated to result in very low false positivity when detecting copy number variants [16, 46]. The gene annotation used was from the 3D7 genome annotation as listed in PlasmoDB http://www.PlasmoDB.org on 31st October 2007.
Sub-telomeric regions indicated in the analysis were defined as those distal to chromosomal boundaries as previously defined [31, 32], based upon homologous gene organisation between P. falciparum and rodent malaria Parasite chromosomes. Proximity to telomeres and centromeres and possession of orthologues, transmembrane domains and signal peptides were derived from publically available datasets on http://www.PlasmoDB.org. Nucleotide diversity in 3,539 predicted genes among 5 different P. falciparum isolates was calculated previously . Differences in distributions of values of π between groups was deTermined using the Kolmogorov-Smirnov test applied in the stats package in R. Average pairwise nucleotide diversity was used as a cut-off to group genes into those with values of π above or below the genome-wide average, statistically significant differences between observed and expected proportions of genes with a higher than average value of π was calculated using a Pearson's corrected chi-squared test using the on-line calculator at http://faculty.vassar.edu/lowry/odds2x2.html. Heatmaps were plotted using the Heatplus package in R/Bioconductor.
Whole genome sequence data was browsed using the Broad Institute website for Parasite sequence http://www.broad.mit.edu/annotation/genome/plasmodium_falciparum_spp/Regions.html. The 3D7 sequence of genes considered as 'deleted' in Hb3 or Dd2 on the basis of microarray hybridisation analysis was BLASTed against Hb3/Dd2 genomic sequence using the default settings (E-value > 1 × 10-3, using gapped alignments and unfiltered sequence). As a comparison to test the sensitivity of this validation approach, 500 genes that were not considered to be deleted were selected and similarly BLASTed against the Hb3 and Dd2 genome sequences (this sub-set of 500 genes had a similar distribution of probe number, gene size and AT content to the genome-wide distribution). Member of the var, rifin and stevor genes families were also removed prior to analysis due to the greater likelihood of misalignment in these multi-copy gene families.
copy number variation
single nucleotide polymorphism
skeleton binding protein
ring exported protein
cytoadherence linked asexual protein
multi-drug resistance protein 1
glycophorein binding protein homologue
glycophorin binding protein homologue 2
glycophorein-binding protein 130 precursor
merozoite surface protein 1
maurer's cleft 2 transmembrane protein
This work was supported by Wellcome Trust programme grant number 074695/Z/04/B to DJC, an MRC PhD studentship for IHC, and core funding to the MRC Gambia Unit. We are grateful to Chris Newbold, Dominic Kwiatkowski, Matt Berriman, Daniel Jeffares and Neil Abernathy for helpful discussions.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the Terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.