FAIRE-seq read processing and functional window selection
FAIRE-seq and alignment data were obtained from [16]. Using sequenced reads that align uniquely to the P. falciparum reference genome (downloaded from http://www.plasmoDB.org, version 5.5), FAIRE-seq read coverages were computed by extending the mapped reads up to 200 bases (i.e., the average size of the sequencing libraries) as previously described [28]. The raw counts were then added at each position of the genome, and normalized per million of mapped reads and per percentage of area covered (see [16]). The functional window of a gene was identified in a 1000 bp-long region upstream of the start codon was identified as follows. The average FAIRE-seq coverage inside a sliding 146 bp-long (nucleosome size) window was computed for each region and for each time point. The window with the highest variance of average FAIRE-seq coverage across the seven time points was declared to be the functional window for that gene. The seven average values of FAIRE-seq coverage in the functional windows were used to generate the initial k- means clustering of the genes
k-means clustering of coverage profiles of the functional windows
Clustering via k-means was performed using the FAIRE-seq coverage profiles in the functional windows of all protein-coding genes. Clustering was performed for k = 5, k = 10, k = 15, and k = 20. The k-means procedure for the initial clustering by FAIRE-seq coverage was carried out several times and always resulted in the same final assignment of genes to clusters. The underlying metric was Euclidean. Patterns for putative TFBS were searched for, using the clusters generated for each choice of k. The sets of motifs for each k were compared using Tanimoto distance and Pearson correlation coefficient. With a threshold on the Tanimoto distance of 0.5 and Pearson coefficient of 0.6, the sets of motifs according to various values of k were very similar. The choice k = 15 seemed to offer the best tradeoff between cluster size and the quality of cluster separation (data not shown). We therefore retained k = 15 in the rest of our analysis.
Motifs scoring using a hypergeometric probabilistic model
All possible k-mers of length six to eight bases were searched within the functional windows of each gene. Frequencies of each k- mer were then modelled according to a hypergeometric distribution. For each cluster of genes, let M be the set of all k- mers inside the functional windows of the genes in the cluster. Let n denote the size of M and y denote the number of occurrences of a k- mer m in M. The frequency of each k- mer m of M was compared to its frequency observed in the set S of all k- mers found within 1000 bp upstream of genes (which also includes the defined functional windows). Let N denote the size of S and r denote the number of occurrences of m in S. The hypergeometric distribution was used to measure the probability that the frequency of a given k- mer of M would be observed if the input sequences would have been selected randomly within the entire promoter regions (1000 bp upstream of genes) rather than functional windows (see [29] for more details). For each k- mer m of M, a hypergeometric p-value was defined as:
The corresponding hypergeometric enrichment score (HES) is then defined as:
The HES measures the statistical significance of a motif m in the positive set M. The smaller is the hypergeometric p-value, the smaller is the probability that the observed number of occurrences is due to random factors, and the higher is the HES.
Motif identification and representation
Given a k- mer t, let N(t,1) denote the one-mismatch neighborhood of t, i.e., a k- mer s belongs to N(t,1) if the number of mismatches between s and t is at most one. Given any k- mer t in the set M, the k- mers in N(t,1) were first sorted in decreasing order of their HES, and then a dynamic programming algorithm was used to select the set of k-mers in N(t,1) that maximized the HES. The HES for a set of k-mers is calculated using the formula in the previous section where r is the total number of occurrences of all k-mers in the set S and y is the total number of occurrences of all k-mers in the set M (M and S are also defined above). The dynamic programming algorithm works as follows. Let L be the ordered list of k-mers from N(t,1) by decreasing order of their HES. Each k-mer in L is considered one at a time in that order starting with L[1].We use LN[i] to denote a subset of top i k-mers in L whose overall HES is maximized over all possible such subsets that include the ithk-mer, where 1 ≤ i ≤ min{10, |L|}. Let MH[i] denote the HES calculated over k-mers in LN[i]. The recurrence relation for MH[i] is
Where S[i]=max1≤j<iHES(over all k-mers in LN[j] and L[i]). Observe that LN[1] has only one k-mer, namely L[1]; hence, MH[1] is simply HES(L[1]), and our claim that HES over k-mers in LN[1] is maximized for top one k-mers is obviously true. To choose LN[i], we consider each LN[j] one at a time for all choices of 1≤j<i: we calculate HES over previously chosen k-mers in LN[j] together with L[i], and choose the k-mers in LN[j] ∪ L[i], whose HES is maximum over all choices for j. Then we compare the resulted maximum HES with HES of just one k-mer, namely L[i]: if HES(L[i]) is greater, then the final choice for LN[i] is L[i], otherwise LN[i] = LN[j]∪L[i]. Since LN[j] (for 1≤j<i) is a subset of top j k-mers that maximizes HES over all such subsets that include jthk-mer, our choice of LN[i] increases the likelihood that it is a subset of top i k-mers that maximizes HES over all such subsets that include ithk-mer. After the MH vector is calculated, the value of i corresponding to maxMH[i] is determined and the set of k-mers in LN[i] becomes the mutant set of t. Observe that when we compute HES over L[i]∪LN[j], only r and y are affected in the formula for HES (both are increased). While this algorithm does not guarantee to always find the optimal subset of mutants with the lowest p-value, it works very well in practice and significantly decreases the computation time compared to the brute force approach.
Each k-mer t in M was represented by this HES-maximal subset that is called the mutant set of t. Since in practice the size of the mutant set is very small (less than ten), we only considered the top ten k- mers with the highest HES in N(t,1). To ensure that the selection of motifs followed a probabilistic model without replacement, we did not consider motifs whose occurrences of corresponding mutants overlapped each other.
For all identified motifs and their target gene clusters, we applied a final pipeline of filtration steps described next aimed at reducing the false positives.
Motif identification and target genes clustering
Occurences of each exact k- mer were counted inside the 1000 bp promoter windows. The probability of occurrence of each k-mer was calculated under an independent and identically distributed (i.i.d.) model using individual nucleotide frequencies inferred from the genome composition. For each group of genes obtained by k- means clustering based on FAIRE-seq information, we selected potential k- mers for further analysis as follows. Given a k- mer t in the set M of all k- mers found within functional windows (see Motifs Scoring), first its one-mismatch neighborhood N(t, 1) was computed. A k-mer t was selected if (1) t occured in at least five distinct input sequences in a given cluster and (2) the expected number of distinct sequences in which any k- mer from N(t, 1) occured was smaller than the actual number of distinct sequences with k- mers from N(t, 1). Regarding the first condition, we chose to require five distinct input sequences because the smallest preliminary cluster was 33, and square root of 33 is approximately 5 [1]. The expected number of occurrences and the expected number of distinct sequences were computed on the positive set M of k- mers within the given sequences of a cluster of genes. Let i be a k- mer from N(t,1), then the expected number of occurrences of i in the positive set M is , where p
i
is the probability of one occurrence of i in the genome, and n is the total number of k- mers in the positive set M. Then the expected number of distinct sequences containing i is
and the expected number of distinct sequences for a motif t can be calculated as
where G is the number of sequences in the positive set M.
The first threshold on the minimum number of distinct sequences in which k- mers from N(t,1) must occur has the effect of filtering out motifs that have low probabilities of being over-represented in the positive set M. The second threshold filters out motifs that are abundant in the genome in general, and therefore an over-representation of these motifs in the positive set M is more likely random.
Each identified potential motifs was then analyzed in the context of their respective FAIRE-seq based preliminary cluster. For each k- mer t, we found the subset of mutants in N(t,1) that maximizes HES for t in the cluster. Then, we identified all genes that had exact occurrences of the mutants of t within the selected nucleosome-sized windows. These genes constituted a new expanded cluster for which the set of mutants for t was recomputed. This final set of mutants is a putative motif and the expanded cluster is its target gene cluster. Since the only three validated motifs for P. falciparum are of length 6 and 8 bases, we restricted our analysis to motif lengths 6 to 8 bases.
Additional filtration steps
Since we calculated the HES based on the number of mutant occurrences rather than over the number of distinct genes where each mutant occurs, additional filtration steps are necessary to ensure that a high HES is not due to multiple occurrences within a few genes of the cluster. We required the number of distinct genes where the motif occured to be close to the number of occurrences of the motif in the target genes cluster. More specifically, we filtered out motifs whose ratio between the number of distinct genes where the motif occured and the number of occurrences of the motif in the cluster was smaller than 0.8. Since the HES of a motif was calculated over sequences of about 150 bp in length, we wanted the number of exact occurrences of mutants inside a single sequence to be close to one. This threshold was used to distinguish between motifs and tandem repeats. Finally, we only considered motifs with HES in their target clusters greater than or equal to the average HES in the distribution of HES for all motifs of length 6 to 8 bases.
Phylogenetic conservation analysis
We used phylogenetic conservation information to filter out non-conserved motifs that are likely to be false positives. Similarly to [6], we calculated HES for the putative motifs in four orthologous species, namely P. berghei, P.chabaudi, P. vivax and P. yoelii. For each putative motif, we computed the HES for genes orthologous to the genes in the motif's target cluster within 1000 bp upstream of the start codon. Motifs with a HES of at least 2 (i.e., p-value of 0.01) in at least one of the four orthologous species were kept for further analysis. To avoid artificially high orthologous HES, we required the ratio between the number of distinct genes where the mutants occurred and the number of exact occurrences of motifs in the set of orthologous genes corresponding to the target cluster to be at least 0.5 (this threshold is lower than in the previous step due to the increased length of input sequences). This step ensured that HES was supported with enough orthologous genes and that all occurrences did not fall into the promoters of a few genes.
To confirm that the statistical significance of our motifs in their corresponding orthologous clusters are not due to a random chance, we conducted an additional randomized analysis. Particularly, we investigated the distribution of HES calculated for our motifs in randomly chosen orthologous clusters. For each motif we calculated its HES in 100 randomly chosen orthologous clusters of the same size that was used in our original analysis. Then we calculated the average and the standard deviation of HES over the random clusters and used these values to find a new threshold on the orthologous HES corresponding to a significance level of 0.01. A motif was kept for further analysis if it had orthologous HES of at least two and this HES was higher than the threshold calculated over the random orthologous clusters. This analysis ensures that orthologous HES is statistically significant for each motif and the given cluster size.
Finally, we adjusted the orthologous p-values of each final motif to correct for multiple testing. Our permutation test gives us the distribution of the orthologous HES when the enrichment of a motif in an orthologous cluster is solely due to a random chance. For each enriched motif, we computed its HES in 1000 randomly-selected orthologous clusters in the four orthologous species, where the size of a random orthologous cluster was the same as the size of the original orthologous target gene cluster. Since the statistical significance test requires that the orthologous HES must exceed the threshold in at least one out of four species, we computed the HES in all four species at each iteration, but used in the analysis only the maximum score. In order to be able to compare HES across different motifs (which might have different cluster sizes), we normalized all HES (both for real and random clusters) using their mean μ
HES
and the standard deviation σ
HES
, calculated over 1000 random clusters. The normalized HES is (HES -μ
HES
)/σ
HES
. The resulting distribution of 1000N normalized HES (where N is 1587, 2471, and 2745 for 6-, 7-, and 8-mers respectively), was used to correct for multiple testing, as follows. The normalized HES for both real and random clusters were aggregated, then sorted in decreasing order. For each final motif we calculated the p-value (corrected for multiple testing) as the proportion of the top values in this joint distribution that are greater than or equal to the real orthologous normalized HES of the motif [30]. Out of the final 129 motifs reported previously, a total of 113 have adjusted p-values corresponding to a false discovery rate of less than 5%. Additional File 1 contains the p-values for all 129 motifs before and after correction for multiple testing.
Motifs clustering
In order to account for sequence degeneration of motifs we added a motif clustering step to filter out duplicates and highly similar motifs. We selected unique motifs with the highest HES and clustered the remaining putative motifs by similarity using the Tanimoto distance [3, 6] and Pearson correlation coefficients of their position weight matrices (PWMs) [18].
The Tanimoto distance measures pairwise similarity of the motifs. Motifs with pairwise distances smaller than 0.5 were grouped, and the one with the highest HES was selected to represent the group. In other words, if two motifs shared at least half of their positions then they belonged to the same group. Since our method does not rely on co-regulated clusters, we assumed that for a given motif not all genes in the target genes cluster were co-regulated. That is why we allowed a lower threshold for the Tanimoto distance by comparison with the threshold of 0.8 used [3, 6]. Given the positional occurrences for two motifs represented in sets A and B respectively, we defined two occurrences (one from A and one from B) to be overlapping if they shared at least one base. The Tanimoto distance was calculated as follows
where the intersection of A and B includes all overlapping positional occurrences.
Then, motifs were clustered according to the Pearson coefficients of their PWMs. This step filtered out motifs that had very similar content, e.g., motifs that might be shifted versions of each other. We define two motifs shifted version if they shared at least half of a motif's length. We considered two motifs similar if the Pearson coefficient between their corresponding PWMs or of the PWMs built on their shifted versions was greater than or equal to 0.75 [7, 18]. To build PWMs for two motifs that are shifted versions of each other, we first built a PWM for the motif with the highest HES and then we constructed a PWM for the other motif that we shifted to align with the highest-HES-motif.
Gene orthology and gene functional sets
Orthologous gene maps between P. falciparum and P. berghei, P. falciparum and P. chabaudi, P. falciparum and P. vivax, and P. falciparum and P. yoelii were obtained from the OrthoMCL database http://www.orthomcl.org/]. A total of 1915 orthologous genes of P. berghei, 1247 of P. chabaudi, 4126 of P. vivax, and 2685 of P. yoelii were used in this study.
In order to study the motifs enrichment in functional gene sets, we used gene ontology (GO) functions from P. falciparum v6.0 together with ontology-based pattern identification (OPI) clusters from [http://carrier.gnf.org/publications/OPI/] to retrieve a total of 13,859 GO/gene pairs, with 1288 distinct GO names. As an alternative for GO annotation, we used 15 functional clusters of genes that were previously obtained based on mRNA profiles [21]. These 15 clusters were used as is or regrouped into four gene expression (GE) groups as follows. Gene expression group I (GEI) contains genes expressed in sporozoites and gametocytes (clusters 1-3), GEII contains genes corresponding to ring, schizont and trophozoite stages (clusters 4-7), GEIII contains genes expressed at the throphozoite stage (clusters 8-13) and group GEIV contains genes expressed at sporozoite, gametocyte and schizont stages and involved in red blood cell invasion (clusters 14-15).