MP3 has four components: reference promoter set (RPS) preparation from sequenced prokaryotic genomes (Fig. 1a), candidate binding region (CBR) detection by motif voting strategy and peak finding (Fig. 1b), candidate binding region clustering based on a graph model (Fig. 1c), and motif profile identification through curve fitting (Fig. 1d).
Preparation of reference promoter set (RPS) of a given gene in MP3
Collection of orthologous promoters: The traditional strategy for orthologous gene collection in phylogenetic footprinting relies on choosing several species in advance [15, 25, 31, 32]. This can limit the quantity and quality of available orthologous genes. MP3 collects the orthologous genes from a large set of references genomes, i.e. “big data source”. Specifically, (i) we used the recent orthology detection tool, GOST [33] to identify the orthologous genes of any given prokaryotic gene in the reference genomes. These genomes belong to the same phylum, but a different genus than that of the target gene, and we took only one genome into consideration for each genus to avoid redundancy. We (ii) then extended the orthologous relationship from gene to operon level. Thus, for a given gene, its host operon is denoted as o
0 = {g
1, g
2,…, g
r
}(r ≥ 1) and the operons in the reference genomes that contain orthologous genes of any g
i
in o
0 (i = 1, …, r) are considered as orthologous operons of o
0, denoted as {o
1, o
2, …, o
n
}. Their promoter sequences are defined as corresponding upstream regulatory regions (up to 300 bp), denoted as p
0 and {p
1, p
2, …, p
n
}, respectively. Then iii), we define the promoter set P = {p
1, p
2, …, p
n
} as the orthologous promoters of p
0.
Reference Promoter Set (RPS): The preliminary orthologous promoter set obtained above could not be directly used to predict motifs, as the large data set size and unconsidered phylogenetic relationships can overpower the conserved motif signal. MP3 polished the preliminary promoter set to generate a reference promoter set (RPS), which was of reasonable size and with conserved significant motifs, i.e. “reduced final set”. Our selection strategy was partly inspired by McCue et al., who claimed that three well-selected reference promoters might be sufficient to identify a motif on a given human gene [15]. We improved this model for application in prokaryotes by selecting three groups of orthologous sequences instead of just three sequences. In addition, rather than using existing phylogenetic tree based on species, phylogenetic trees were assembled for each group of orthologous promoters. Before selection, the phylogenetic tree of orthologous promoter sequences was built by ClustalW [18], and the distance scores of this tree were used to represent the distance between any pair of orthologous promoter sequences. MP3 then divided P into three groups, P
1, P
2, and P
3, corresponding to highly similar to, relatively similar to, and distant from p
0, according to the thresholds obtained by analyzing the distribution of distance scores between orthologous promoters (Additional file 1: Method S1 and Figure S2). MP3 first selected three reference promoters from each group, and then added three more from P
3, because P
3 has many more orthologous promoters. In this selection, we considered the additional following factors: (i) The promoters whose operons had the same leading orthologous genes with O
0 had higher priority to be chosen. (ii) The promoters were re-ranked based on a genomic similarity score (GSS) [33], which was calculated as the fraction of genes in the target genome, which have orthologous genes in the reference genome. We selected promoters with higher GSS based on the assumption that the genome with higher GSS tends to have regulatory mechanism more similar to that of the target genome [15]. (iii) Any two selected promoters were required to have a mutual distance score greater than 0.05 to avoid redundant promoters. Finally, the selected reference promoters, along with p
0 itself, composed a reference promoter set (RPS), which was expected to contain key motif signals and have a reasonable size with the consideration of computational efficiency. More details about RPS generation are provided Additional file 1: Method S1.
Pruning promoter to identify Candidate Binding Region (CBR)
For a given gene, the RPS can be used to prune its corresponding promoter p
0 and identify rough TF binding regions through a voting strategy by integrating multiple motif finding tools (Fig. 1b). Six widely used de novo motif finding tools, Biprospector, BOBRO, MDscan, MEME, CUBIC, and CONSENSUS [4, 5, 8–11], were applied to the RPS to identify conserved motifs with lengths ranging from 5 to 30, and for each length, we kept the top ten predicted motifs (if available). The predictions for a specific program can be denoted as
$$ S={\displaystyle \underset{l=5}{\overset{30}{\cup }}}{\displaystyle \underset{t=1}{\overset{10}{\cup }}}\kern0.5em {S}_{lt} $$
(1)
where S
lt
represents the t-th motif in the prediction with length l. If S
lt
contains an instance from p
0, denoted as s, its contribution will be added to the voting score C
i
(set to 0 initially) using the following formula (Fig. 1b),
$$ {C}_i\kern0.5em =\kern0.5em {C}_i\kern0.5em +\kern0.5em {V}_s,\kern0.5em \mathrm{f}\mathrm{o}\mathrm{r}\kern0.5em i\kern0.5em \in \kern0.5em \left\{i\Big|{b}_s\le i\le {e}_s\right\}; $$
(2)
where b
s
and e
s
represent the starting and ending positions of s along p
0, and
$$ {V}_s\kern0.5em =\kern0.5em \frac{1}{\left|{S}_{l\bullet}\right|\left(1\kern0.5em +\kern0.5em \log t\right)},\kern0.5em {S}_{l\bullet}\kern0.5em =\kern0.5em \underset{t=1}{\overset{10}{\cup }}{S}_{lt} $$
(3)
where t is the rank of motif profile, which motif instance s belongs to, in prediction results for input length l. Intuitively, such voting scores are reliable and informative as different tools do have complementary effects [6, 14] while the false positive noise tend to randomly distribute in p
0. The voting scores generally represent the support obtained from multiple predictions. The larger a score, the higher probability that the site overlaps true TFBSs. Additionally, we normalized the contribution of different predictions by introducing S
l
., instead of directly counting the number of predicted segment covering each site, since the output size of motif finding tools may be very different.
Application of a pick calling strategy to the voting scores allows a set of CBRs to be identified, each of which is recognized as a continuous genomic segment of p
0, containing nucleotides with significant higher voting scores than the surrounding sequence. Additional details can be found in Additional file 1: Method S2. The CBRs, as primary output of MP3, can be used by researchers directly in genetic engineering to locate the functional regulatory regions of a promoter.
Clustering of correlated CBR set
The CBR sets identified in the target and reference promoters are used to build motif profiles (Fig. 1c). A similarity graph G with all CBRs represented as vertices and edges connecting every pair of vertices was constructed. The weight of edges are set as the correlation scores between two corresponding CBRs as follows: (i) p
0 and p
1 are the target promoter and a reference promoter, respectively; (ii) a CBR c
0 in p
0 begins at b
0 and ends at e
0 (−|p
0| ≤ b
0 < e
0 ≤ −1) and another CBR c
1 begins at b
1 and ends at e
1 in p
1 (the start of coding regions as the origin position 0). (iii) the correlation score W(c
0, c
j
) between the two CBRs was evaluated:
$$ W\left({c}_0,{c}_1\right)=\left(1-\frac{\left|{b}_0-{b}_1\right|}{ \max \left\{\left|{b}_0\right|,\left|{b}_1\right|\right\}}\right)\times S\left({c}_0,{c}_1\right) $$
(4)
where S(c
0, c
1) was the sequence similarity score, calculated by aligning c
0 and c
1. The weight of the edge that connects CBRs of the same promoter will be set as 0. Clearly, the higher a weight, the more correlated the two corresponding CBRs were. The relative location of CBR pairs S(c
0, c
1) was also considered as the position of many TFBSs tend to be conserved in evolution [34].
Intuitively, a set of highly correlated CBRs should be connected by large weights producing a subgraph of G, i.e. subgraph with large edge weight, because these correlations should make the weight of each involved edge larger. It should also be noted that identifying all heavy subgraphs in a weighted graph itself was NP-hard. Hence, we identified the CBR clusters in a heuristic way: (i) we sorted the edges in G in decreasing order of their weights and only keep the top 1/3. One third was absolutely enough because the graph with only real connections should be sparse. However, the random cliques have little chance to survive because graph G is a multi-partite graph; (ii) we obtained the induced sub-graph of a CBR in target promoter and its neighbors in other promoters; and (iii) we detected the maximal clique in induced sub-graph and then expanded it by including the highly connected vertex. The CBRs corresponding to the vertex in each cluster composed the correlated CBR set in which the motif profile identification will be carried out.
Identification of candidate motif profiles
Building Motif profiles from correlated CBR set. We applied our motif finding tool, BOBRO [5] on the identified CBR sets to generate candidate motif profiles. Outstanding motif instances were identified using the support from several motif finding tools (Fig. 1d).
It was still very challenging to evaluate motif profiles with different widths. Although BOBRO and MEME are capable of detecting motif width on co-regulated promoters, they may fail on phylogenetic footprinting data, because the flanking regions of motifs in orthologous promoters are usually conserved to some extent. In MP3, a curve fitting method was designed to detect the motif profiles with an optimized width for phylogenetic footprinting. The BOBRO predicted motif profiles have a width from 6 to 22 and corresponding IC (information content) scores, which are calculated by the formula:
$$ IC(w)={\displaystyle \sum_{j=1}^w}{\displaystyle \sum_{i=1}^4}{f}_{ij}lo\mathit{\mathsf{g}}\frac{f_{ij}}{b_i} $$
(5)
where (f
ij
) is the probability of nucleotide type i appearing at position j in the motif profile, and b
i
is the probability of i appearing in the background sequence which is calculated on all input promoter sequences. However, IC cannot be directly used to compare different motif profiles, because they are width-dependent. MP3 regresses the correlation function between the IC and the width of motif profile by minimizing
$$ \underset{w=\kern0.5em 6}{\overset{22}{\varSigma }}{\left[IC(w)\kern0.5em -\kern0.5em f(w)\right]}^2 $$
(6)
on the conjectured function:
$$ f(w)\kern0.5em =\kern0.5em a\kern0.5em \cdot \kern0.5em {e}^{\beta w}\kern0.5em +\kern0.5em \gamma $$
(7)
where α, β and γ are fitting coefficients. Then, we took the difference between the real IC scores and fitting scores for each profile, i.e. the residual of above regression,
$$ r(w)\kern0.5em =\kern0.5em IC(w)\kern0.5em \mathit{\hbox{-}}\kern0.5em f(w) $$
(8)
as the criterion to select the best motif profile. Basically, the motif profiles whose r(w) are local maximum are ranked in the decreasing order of r(w).
MP3 application and performance evaluation using E. coli genome
Data Acquisition. We used E. coli K12 as the target genome and another 216 selected prokaryotic genomes from the Proteo-bacteria phylum as references to test MP3 methods and the applications. The genome data were downloaded from the NCBI database (released as of November 2011). The 216 reference genomes were obtained from 216 different genera (a general principal for orthologous data for MP3) to avoid potential selection bias in comparative genomics studies [33]. The operons of these genomes were retrieved from the DOOR2.0 operon database [27, 35], and the documented motifs in E. coli were obtained from RegulonDB [28]. We linked the documented TFBSs in E. coli to their target operons and then to corresponding promoters in the identified 2,252 RPSs. Figure 2d showed that 583 of the 2,379 operons have experimentally confirmed TFBSs (solid bars in black) in their regulatory regions. Twenty of these 583 operons and their corresponding TFBSs were removed since they did not have enough orthology. The remaining 563 promoter sequences, containing 2,048 binding sites, were used to evaluate the performance of MP3. Besides, we downloaded Sigma 70 binding promoters of E. coli from the RegulonDB and conducted analysis to see the correlation between orthology and Sigma 70 binding in E. coli.
Performance evaluation. To conduct performance comparison, we applied six de novo motif finding tools previously mentioned, i.e., Biprospector, CONSENSUS, MDscan, MEME, CUBIC, BOBRO and a phylogenetic footprinting pipeline MicroFootprinter [4–13, 21, 25, 30, 36] on the same genome and compared with MP3. We followed Tompa’s method [14] and assessed the predictions both at nucleotide level and at the binding site level. Specifically, we calculated the sensitivity (nSN), positive prediction value (nPPV), specificity (nSP), performance coefficient (nPC) and correlation coefficient (nCC) at nucleotide level, and calculated the sensitivity (sSN), positive prediction value (sPPV), and average site performance (sASP) at site level. In addition, we added the widely used F-score (sFS) at site level for better evaluation. The calculation details for these measures can be seen in Additional file 1: Method S3. We followed Tompa’s criterion to indicate that a predicted site overlaps a known TFBS if they overlapped by at least 1/4 the length of known site [14].
Functional enrichment analysis according to the KEGG database
For a set of operons in E. coli, we did functional enrichment analysis of the corresponding genes with DAVID [37]. Specifically, given a set of operons, their genes were picked from the DOOR2 database [27] and submitted to DAVID as the input gene list with this genome as background genome. The p-values were calculated in terms of a Bonferroni-corrected modified Fisher's exact test under the null hypothesis that this set of genes was not enriched with certain biological functions.