Chromatin-driven de novo discovery of DNA binding motifs in the human malaria parasite
© Harris et al; licensee BioMed Central Ltd. 2011
Received: 30 June 2011
Accepted: 13 December 2011
Published: 13 December 2011
Skip to main content
© Harris et al; licensee BioMed Central Ltd. 2011
Received: 30 June 2011
Accepted: 13 December 2011
Published: 13 December 2011
Despite extensive efforts to discover transcription factors and their binding sites in the human malaria parasite Plasmodium falciparum, only a few transcription factor binding motifs have been experimentally validated to date. As a consequence, gene regulation in P. falciparum is still poorly understood. There is now evidence that the chromatin architecture plays an important role in transcriptional control in malaria.
We propose a methodology for discovering cis-regulatory elements that uses for the first time exclusively dynamic chromatin remodeling data. Our method employs nucleosome positioning data collected at seven time points during the erythrocytic cycle of P. falciparum to discover putative DNA binding motifs and their transcription factor binding sites along with their associated clusters of target genes. Our approach results in 129 putative binding motifs within the promoter region of known genes. About 75% of those are novel, the remaining being highly similar to experimentally validated binding motifs. About half of the binding motifs reported show statistically significant enrichment in functional gene sets and strong positional bias in the promoter region.
Experimental results establish the principle that dynamic chromatin remodeling data can be used in lieu of gene expression data to discover binding motifs and their transcription factor binding sites. Our approach can be applied using only dynamic nucleosome positioning data, independent from any knowledge of gene function or expression.
One of the major challenges in molecular biology is to characterize the mechanisms governing the regulation of transcription. Mechanisms of regulation can be broadly classified in three classes: (1) interaction of a control factor with DNA, (2) interaction of a control factor with the transcriptional complex, and (3) epigenetic factors. In this paper we are interested in elucidating mechanisms that belong to the first class, in which transcription factor proteins modulate expression levels by binding to one or more specific sites in the promoter region of a gene. The problem of identifying in silico transcription factor binding sites (TFBS) has been studied intensively. As a result a variety of algorithms and tools have been developed (see  for a review). Typically, these methodologies involve obtaining a set of genes which are known to be co-expressed or functionally-related and searching for common (over-represented) short "motifs" in their promoter regions. The underlying hypothesis is that co-expressed or functionally-related genes are expected to share common TFBS.
Algorithms for motif discovery can be enumerative or alignment-based. Enumerative techniques involve the enumeration of all the possible motifs in the promoters, the assignment of an appropriate score based on probabilistic models, and a criteria to select the most statistically significant motifs. Alignment-based methods use probabilistic modeling and combinatorial optimization, e.g., expectation maximization or Gibbs sampling, to identify sequence patterns that are over-represented in the context of the promoter regions. Commonly used tools for motif discovery include MEME , Weeder , Gibbs Motif Sampler  and AlignACE . In general, the discovery of very short or highly degenerated motifs remains statistically challenging. Therefore, these tools have only a limited success when used alone, especially when the ~80% AT-rich genome of the human malaria parasite is considered. In response to an urgent need to understand how P. falciparum regulates its genes, several ad-hoc techniques to discover cis-regulatory elements have been developed specifically for the malaria parasite. Young et al.'s Gene Enrichment Motif Searching (GEMS)  uses a hypergeometric-based scoring scheme and a position-specific weight matrix optimization procedure to identify putative motifs. The input to GEMS are twenty-one clusters of functionally related or co-expressed genes in P. falciparum. The ouput is 34 putative TFBS in promoter sequences and 21 TFBS in introns-derived sequences. The method proposed by Wu et al.  compares evolutionarily related species of Plasmodium and uses orthologous sequences to identify conserved TFBS. The method identified 38 TFBS that partially overlap previously reported putative TFBS in P. falciparum [2, 8]. Elemento et al. [9, 10] propose an algorithm called Finding Informative Regulatory Elements (FIRE) that measures the mutual information between sequences and gene expression profiles, which is used to select the most statistically significant motifs. While most of the experiments are carried out in Saccharomyces cerevisiae, the authors report 21 putative TFBS in P. falciparum. Iengar and Joshi  combined the strenght of MEME , AlignACE  and Weeder , to identify putative TFBS in promoters of P. falciparum co-expressed genes. The authors used strict cut-offs and selected only motifs that were found by all three software tools. This study resulted in 27 sets of putative TFBS.
The success of computational TFBS discovery methods critically depends on the a priori knowledge about which genes are co-regulated or functionally-related and therefore more likely to share common TFBS. Carrying out this step can be challenging because (1) gene functional annotation are often incomplete or inaccurate, (2) gene expression profiles are often limited to a subset of the genes or specific events during the cell cycle and (3) genome-wide expression data can be incomplete. For instance, P. falciparum has 5418 protein-coding genes of which about 2500 have no known function and only ~3200 have stable and constitutive expression profile. Furthermore, gene clustering rely on the measurement of mRNA steady state levels that are not a direct measure of transcriptional activity but reflect both mechanisms of transcriptional and post-transcriptional regulation. As a consequence, the resulting clusters are likely to be incomplete or incorrect.
Recent alternative approaches that exploit the chromatin structure information to identify TFBS are based on the observation that active TFBS are usually nucleosome-depleted. Nucleosome occupancy data has already shown to improve the discovery of TFBS employed in a few model organisms (e.g., [12–15]). Nucleosome information is typically used in combination with sequence sets generated by ChIP experiments and occasionally mRNA expression profiles rather than alone. Collecting all nucleosome-depleted regions of the genome would indeed generate a very high number of sequence sets without transcription-related specificity. To circumvent this problem, we propose to use dynamic changes of nucleosome occupancy (e.g., across time, such as the duration of a cell cycle) to build sets of genes that are likely to be co-regulated. We demonstrate the power of our approach in the context of the search for potential TFBS in P. falciparum's genome using nucleosome positioning data obtained at seven time points during the erythrocytic asexual cycle. This approach is general, and can be applied to any organism for which nucleosome occupancy data at various time points or under varying conditions are available. Here, we identify 129 potential DNA binding motifs in the human malaria parasite's genome, most of them being novel. These results represent a major resource for the human malaria parasite, with significant implications for future investigations.
In a previous analysis of the chromatin architecture in P. falciparum , we used FAIRE (formaldehyde-assisted isolation of regulatory elements) coupled with next generation sequencing, or FAIRE-seq, to study the variations of nucleosome occupancy across its intra-erythrocytic cycle. Samples of P. falciparum-infected erythrocytes were collected with six-hours increment for 36 hours (seven time points), which is the duration of one cycle of growth, replication, and maturation into multiplied invasive parasites. The parasites' chromatin status at all seven time points was analyzed by FAIRE-seq. Briefly, the method involves the chemical cross-linking of the chromatin followed by shearing, protein-free DNA purification, and sequencing. FAIRE-seq therefore isolates and reveals nucleosome-free regions of the genome. In general, we found that nucleosomes are more abundant within malaria gene bodies whereas promoters are relatively nucleosome-depleted (i.e., with high FAIRE-seq read coverage). In that context, we analyzed the variations of chromatin availability (i.e., the variations of FAIRE-seq read coverage) at the loci of two validated TFBS that are specific to apicomplexan AP2-related transcription factors . We observed that the chromatin availability of these TFBS varies significantly across time ; the presence/absence of nucleosomes masks/reveals the TFBS and thereby modulates the binding of the ApiAP2 transcription factors. These elements indicate that one should be looking for TFBS within the regions of the genome with variable FAIRE-seq coverage. Here, we analyze the regions of the P. falciparum genome covered by FAIRE-seq sequenced reads with the highest variance of FAIRE-seq coverage across the seven time points of the erythrocytic cycle - i.e., nucleosome-depleted regions that are potentially accessible to transcription factors in a cycle-dependent manner - to discover novel putative TFBS.
The FAIRE-seq coverage within the functional window is then averaged for each time point, creating a seven-point chromatin availability profile for the considered gene These profiles are used for k-means clustering of the genes. Using k = 15, we obtained clusters of 33 to 841 genes, for an average of 364 genes per cluster (data not shown). The sequence sets from the functional windows within the same cluster are used as input for the discovery of putative TFBS, hereafter called motifs. Since the only experimentally-validated motifs for P. falciparum are either six or eight bp-long, we restricted the present analysis to motifs of six, seven and eight nucleotides in length. Our method can however be used with motifs of any length.
For each cluster, all possible motifs of size six to eight nucleotides in length were searched within the functional windows of each gene. Frequencies of each motif were then modelled according to a hypergeometric probability distribution that measures the chances that the frequency of a given motif would be observed if the input sequences would have been selected randomly. The hypergeometric enrichment score or HES captures the statistical significance of the over-representation of a given motif in a given group (e.g., significance threshold set at HES = 2 ⇔ p = 0.01). Motifs and their variants were selected according to their HES values. This step selected 2727 6-mers, 8813 7-mers, and 19,435 8-mers in the P. falciparum genome.
Finally, since motifs are usually degenerated to various extents, they were regrouped according to their similarities measured by the Tanimoto distance [3, 6] and the Pearson coefficient of position weight matrix (PWM) . Ultimately, the final list of degenerated motifs consisted of 129 putative sets, i.e., 21 6-mers, 46 7-mers, and 62 8-mers (this lists of motifs can be found in Additional File 1). Our tool can provide different representations for a motif: as the list of mutants, as a position weight matrix (PWM), as a sequence logo or as a regular expression. This flexibility allows users to make an informed decision about which particular sequence of a motif to select in experimental validation.
We evaluated the performances of our method by comparison with previously published work. We analyzed the distribution of our motifs within the P. falciparum genome, looking for positional biases relative to transcription start sites and predicted promoters, and for enrichment in functional sets of genes built from gene ontology and gene expression profiles information.
We analyzed the distribution of the motif sets identified by our method within functionally relevant groups of genes. We used (1) clusters of genes inferred from GO annotations, (2) 15 previously published functional clusters derived from the analysis of mRNA profiles , and (3) a view of the previous 15 clusters reduced to four gene expression (GE) groups based on the morphological stage at which genes are expressed (see  and the Methods section). These GE groups were previously published  and correspond to clusters of genes grouped according to variations in FAIRE-seq within their promoters across seven time points. In the following we generically refer to any of these three sets above as a functional gene set. We estimated motif enrichment within each functional gene set using hypergeometric statistics.
Percentage of motifs with positional bias relative to the TSS
Window width [bases]
Percentage of our motifs having p-value of 0.01 or lower
Percentage of motifs in having p-value of 0.01 or lower
The human malaria parasite has highly skewed nucleotide distribution (~80% of A+T) that together with the fact that over 40% of all genes have no known function makes in silico TFBS discovery challenging. Here we have demonstrated that chromatin structure data can be used in the context of an enumerative motif approach for successful in silico discovery of regulatory elements.
In our previous analysis of the dynamic chromatin structure in the human malaria parasite  we observed that FAIRE-seq coverage surrounding a validated TFBS varies drastically throughout the erythrocytic cycle. Another important observation was that most of the genes had well-defined FAIRE-seq peaks within their promoters, and while the intensity of the peaks changed across the time points, their locations relative to the start codons remained unchanged. Given these observations, we formulated the hypothesis that the most likely regions containing TFBS are the windows within the promoters with the highest variance in the FAIRE-seq coverage.
We tested this hypothesis by developing a new methodology for in silico motif discovery using FAIRE-seq coverage. Our approach resulted in the finding of 129 putative motifs, including many of the motifs proposed by previous studies [17, 6, 19, 20]. Moreover, half of the motifs that we propose are over-represented within particular functional gene sets, especially genes involved in specialized stages of the parasite such as its sexual or invasive form. Our motifs also showed stronger positional bias relative to the promoter region compared to previously proposed set of motifs. These putative motifs together with their associative target clusters can serve as starting points in future research on characterization of unannotated proteins and regulatory mechanisms. In summary, our data confirm the importance of chromatin structural changes to regulate gene expression in the human malaria parasite. Precise knowledge of gene regulation pathways in the human malaria parasite will be essential for developing novel therapeutic strategies.
FAIRE-seq and alignment data were obtained from . Using sequenced reads that align uniquely to the P. falciparum reference genome (downloaded from http://www.plasmoDB.org, version 5.5), FAIRE-seq read coverages were computed by extending the mapped reads up to 200 bases (i.e., the average size of the sequencing libraries) as previously described . The raw counts were then added at each position of the genome, and normalized per million of mapped reads and per percentage of area covered (see ). The functional window of a gene was identified in a 1000 bp-long region upstream of the start codon was identified as follows. The average FAIRE-seq coverage inside a sliding 146 bp-long (nucleosome size) window was computed for each region and for each time point. The window with the highest variance of average FAIRE-seq coverage across the seven time points was declared to be the functional window for that gene. The seven average values of FAIRE-seq coverage in the functional windows were used to generate the initial k-means clustering of the genes
Clustering via k-means was performed using the FAIRE-seq coverage profiles in the functional windows of all protein-coding genes. Clustering was performed for k = 5, k = 10, k = 15, and k = 20. The k-means procedure for the initial clustering by FAIRE-seq coverage was carried out several times and always resulted in the same final assignment of genes to clusters. The underlying metric was Euclidean. Patterns for putative TFBS were searched for, using the clusters generated for each choice of k. The sets of motifs for each k were compared using Tanimoto distance and Pearson correlation coefficient. With a threshold on the Tanimoto distance of 0.5 and Pearson coefficient of 0.6, the sets of motifs according to various values of k were very similar. The choice k = 15 seemed to offer the best tradeoff between cluster size and the quality of cluster separation (data not shown). We therefore retained k = 15 in the rest of our analysis.
The HES measures the statistical significance of a motif m in the positive set M. The smaller is the hypergeometric p-value, the smaller is the probability that the observed number of occurrences is due to random factors, and the higher is the HES.
Where S[i]=max1≤j<i HES(over all k-mers in LN[j] and L[i]). Observe that LN has only one k-mer, namely L; hence, MH is simply HES(L), and our claim that HES over k-mers in LN is maximized for top one k-mers is obviously true. To choose LN[i], we consider each LN[j] one at a time for all choices of 1≤j<i: we calculate HES over previously chosen k-mers in LN[j] together with L[i], and choose the k-mers in LN[j] ∪ L[i], whose HES is maximum over all choices for j. Then we compare the resulted maximum HES with HES of just one k-mer, namely L[i]: if HES(L[i]) is greater, then the final choice for LN[i] is L[i], otherwise LN[i] = LN[j]∪L[i]. Since LN[j] (for 1≤j<i) is a subset of top j k-mers that maximizes HES over all such subsets that include j th k-mer, our choice of LN[i] increases the likelihood that it is a subset of top i k-mers that maximizes HES over all such subsets that include i th k-mer. After the MH vector is calculated, the value of i corresponding to maxMH[i] is determined and the set of k-mers in LN[i] becomes the mutant set of t. Observe that when we compute HES over L[i]∪LN[j], only r and y are affected in the formula for HES (both are increased). While this algorithm does not guarantee to always find the optimal subset of mutants with the lowest p-value, it works very well in practice and significantly decreases the computation time compared to the brute force approach.
Each k-mer t in M was represented by this HES-maximal subset that is called the mutant set of t. Since in practice the size of the mutant set is very small (less than ten), we only considered the top ten k-mers with the highest HES in N(t,1). To ensure that the selection of motifs followed a probabilistic model without replacement, we did not consider motifs whose occurrences of corresponding mutants overlapped each other.
For all identified motifs and their target gene clusters, we applied a final pipeline of filtration steps described next aimed at reducing the false positives.
where G is the number of sequences in the positive set M.
The first threshold on the minimum number of distinct sequences in which k-mers from N(t,1) must occur has the effect of filtering out motifs that have low probabilities of being over-represented in the positive set M. The second threshold filters out motifs that are abundant in the genome in general, and therefore an over-representation of these motifs in the positive set M is more likely random.
Each identified potential motifs was then analyzed in the context of their respective FAIRE-seq based preliminary cluster. For each k-mer t, we found the subset of mutants in N(t,1) that maximizes HES for t in the cluster. Then, we identified all genes that had exact occurrences of the mutants of t within the selected nucleosome-sized windows. These genes constituted a new expanded cluster for which the set of mutants for t was recomputed. This final set of mutants is a putative motif and the expanded cluster is its target gene cluster. Since the only three validated motifs for P. falciparum are of length 6 and 8 bases, we restricted our analysis to motif lengths 6 to 8 bases.
Since we calculated the HES based on the number of mutant occurrences rather than over the number of distinct genes where each mutant occurs, additional filtration steps are necessary to ensure that a high HES is not due to multiple occurrences within a few genes of the cluster. We required the number of distinct genes where the motif occured to be close to the number of occurrences of the motif in the target genes cluster. More specifically, we filtered out motifs whose ratio between the number of distinct genes where the motif occured and the number of occurrences of the motif in the cluster was smaller than 0.8. Since the HES of a motif was calculated over sequences of about 150 bp in length, we wanted the number of exact occurrences of mutants inside a single sequence to be close to one. This threshold was used to distinguish between motifs and tandem repeats. Finally, we only considered motifs with HES in their target clusters greater than or equal to the average HES in the distribution of HES for all motifs of length 6 to 8 bases.
We used phylogenetic conservation information to filter out non-conserved motifs that are likely to be false positives. Similarly to , we calculated HES for the putative motifs in four orthologous species, namely P. berghei, P.chabaudi, P. vivax and P. yoelii. For each putative motif, we computed the HES for genes orthologous to the genes in the motif's target cluster within 1000 bp upstream of the start codon. Motifs with a HES of at least 2 (i.e., p-value of 0.01) in at least one of the four orthologous species were kept for further analysis. To avoid artificially high orthologous HES, we required the ratio between the number of distinct genes where the mutants occurred and the number of exact occurrences of motifs in the set of orthologous genes corresponding to the target cluster to be at least 0.5 (this threshold is lower than in the previous step due to the increased length of input sequences). This step ensured that HES was supported with enough orthologous genes and that all occurrences did not fall into the promoters of a few genes.
To confirm that the statistical significance of our motifs in their corresponding orthologous clusters are not due to a random chance, we conducted an additional randomized analysis. Particularly, we investigated the distribution of HES calculated for our motifs in randomly chosen orthologous clusters. For each motif we calculated its HES in 100 randomly chosen orthologous clusters of the same size that was used in our original analysis. Then we calculated the average and the standard deviation of HES over the random clusters and used these values to find a new threshold on the orthologous HES corresponding to a significance level of 0.01. A motif was kept for further analysis if it had orthologous HES of at least two and this HES was higher than the threshold calculated over the random orthologous clusters. This analysis ensures that orthologous HES is statistically significant for each motif and the given cluster size.
Finally, we adjusted the orthologous p-values of each final motif to correct for multiple testing. Our permutation test gives us the distribution of the orthologous HES when the enrichment of a motif in an orthologous cluster is solely due to a random chance. For each enriched motif, we computed its HES in 1000 randomly-selected orthologous clusters in the four orthologous species, where the size of a random orthologous cluster was the same as the size of the original orthologous target gene cluster. Since the statistical significance test requires that the orthologous HES must exceed the threshold in at least one out of four species, we computed the HES in all four species at each iteration, but used in the analysis only the maximum score. In order to be able to compare HES across different motifs (which might have different cluster sizes), we normalized all HES (both for real and random clusters) using their mean μ HES and the standard deviation σ HES , calculated over 1000 random clusters. The normalized HES is (HES -μ HES )/σ HES . The resulting distribution of 1000N normalized HES (where N is 1587, 2471, and 2745 for 6-, 7-, and 8-mers respectively), was used to correct for multiple testing, as follows. The normalized HES for both real and random clusters were aggregated, then sorted in decreasing order. For each final motif we calculated the p-value (corrected for multiple testing) as the proportion of the top values in this joint distribution that are greater than or equal to the real orthologous normalized HES of the motif . Out of the final 129 motifs reported previously, a total of 113 have adjusted p-values corresponding to a false discovery rate of less than 5%. Additional File 1 contains the p-values for all 129 motifs before and after correction for multiple testing.
In order to account for sequence degeneration of motifs we added a motif clustering step to filter out duplicates and highly similar motifs. We selected unique motifs with the highest HES and clustered the remaining putative motifs by similarity using the Tanimoto distance [3, 6] and Pearson correlation coefficients of their position weight matrices (PWMs) .
where the intersection of A and B includes all overlapping positional occurrences.
Then, motifs were clustered according to the Pearson coefficients of their PWMs. This step filtered out motifs that had very similar content, e.g., motifs that might be shifted versions of each other. We define two motifs shifted version if they shared at least half of a motif's length. We considered two motifs similar if the Pearson coefficient between their corresponding PWMs or of the PWMs built on their shifted versions was greater than or equal to 0.75 [7, 18]. To build PWMs for two motifs that are shifted versions of each other, we first built a PWM for the motif with the highest HES and then we constructed a PWM for the other motif that we shifted to align with the highest-HES-motif.
Orthologous gene maps between P. falciparum and P. berghei, P. falciparum and P. chabaudi, P. falciparum and P. vivax, and P. falciparum and P. yoelii were obtained from the OrthoMCL database http://www.orthomcl.org/]. A total of 1915 orthologous genes of P. berghei, 1247 of P. chabaudi, 4126 of P. vivax, and 2685 of P. yoelii were used in this study.
In order to study the motifs enrichment in functional gene sets, we used gene ontology (GO) functions from P. falciparum v6.0 together with ontology-based pattern identification (OPI) clusters from [http://carrier.gnf.org/publications/OPI/] to retrieve a total of 13,859 GO/gene pairs, with 1288 distinct GO names. As an alternative for GO annotation, we used 15 functional clusters of genes that were previously obtained based on mRNA profiles . These 15 clusters were used as is or regrouped into four gene expression (GE) groups as follows. Gene expression group I (GEI) contains genes expressed in sporozoites and gametocytes (clusters 1-3), GEII contains genes corresponding to ring, schizont and trophozoite stages (clusters 4-7), GEIII contains genes expressed at the throphozoite stage (clusters 8-13) and group GEIV contains genes expressed at sporozoite, gametocyte and schizont stages and involved in red blood cell invasion (clusters 14-15).
This project was supported in part by NSF CAREER IIS-0447773, NSF DBI-1062301 and NIH R01 AI85077-01A1.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.