Binding predictions
We tested the PSSMs mentioned above and used ChIP-seq derived sequences binding to TFs as positive controls. Results were assessed with ROC analysis (see Methods for details).
The various binding models from the multiple sources yielded different results in many cases: While only 3 of the 19 PBM-derived models from hPDI (16 %) reached an AUC score of ≥0.7, this was the case for 46 % of the HT-SELEX and 60 % of the JASPAR-models. The average AUC values generated by the different binding model sources were highly divergent, ranging from 0.53 (hPDI) to 0.72 (JASPAR).
When tested on the ‘high-confidence’ dataset, 4 out of 15 models (26.6 %) from hPDI yielded AUC scores of ≥0.7, which resulted in an average AUC score of 0.57. The HT-SELEX-models reached AUC values ≥0.7 in 70 % of the cases (average AUC of all PSSMs: 0.76), whereas the JASPAR-matrices generated an average AUC score of 0.83. An overview of these findings is depicted in Fig. 1.
We were able to directly compare the prediction quality of models for 28 different TFs that were represented in both the JASPAR and the HT-SELEX data (26 for the high confidence test set). We found that in general, the manually curated matrices from JASPAR enabled a slightly better distinction between positive and negative sequences (average AUC 0.74 versus 0.70). In the ‘high-confidence’ data set, these values increased for both data model sources: The HT-SELEX models generated an average AUC of 0.80, whereas for JASPAR this value was 0.84 (Figs. 1a and 2).
These findings can help to estimate the utility of TFBS models from different sources, enabling researchers to better circumvent the high false-positive rate, a problem which is well known for in silico TF binding analyses [16], and thus to make the most of the available data. In general, we found manually curated models provided by JASPAR and HT-SELEX-derived matrices to be much more successful in recognising in vivo TFBSs than PBM derived models.
Previously, the success of TF binding prediction has been found to be similar for the majority of matrices, with degenerate matrices yielding the most accurate predictions [17]. Hence researchers currently often use TF binding models from various sources indiscriminately. The notion that TF binding models obtained through different experimental methods can be variable has just recently entered the scientific discourse in a novel publication [13]. There, the authors compared PBM- and HT-SELEX-derived models and found both models to be reliable with a slight advantage of HT-SELEX models in the prediction of in vivo TFBSs. Our study on a larger test set (up to several thousand binding events per TF with direct comparison of models from different sources for a subset of the tested TFs) showed JASPAR to be the most reliable source for TFBS matrices in this data set.
To check if the higher accuracy of JASPAR might be the result of an overfitting of the JASPAR models to the ENCODE data, we manually checked for an overlap between the tested matrices and the ENCODE data set. For each PSSM, we studied the accompanying publication and found that this was not the case for any of the binding models included in our study. Hence we can safely assume that the current JASPAR models are indeed the most successful ones to recognise in vivo TFBSs.
In this study, we decided to use exonic sequences as a negative test set. Because the 5’ regions of genes occasionally contain functional binding sites [18, 19], we decided to categorically exclude the first exon of each transcript from our analyses. Thus, the percentage of functional TFBSs in the negative test set should be sufficiently decreased in order to allow computational predictions. We preferred this approach to alternative analyses, such as the usage of artificially generated random DNA sequences without TFBSs, as they might be very different from naturally occurring sequences and are also likely to increase error rates [20]. Furthermore, the usage of random exonic sequences as a negative set closely mimics the real-life research situation, therefore allowing an important insight into the limitations of current in silico TFBS analysis methods.
We found that the usage of only high-confidence ENCODE data leads to an increased reliability in the computerised recognition of TFBSs. However, this filtering step also introduces a potential drawback: Strong TF binding does not necessarily implicate strong TF function, as has been shown for D. melanogaster [21], where low-affinity binding sites can be of great importance for normal gene activation. Moreover, a researcher looking for a functional binding site in silico does not necessarily have any previous knowledge on the binding strength. Therefore, the decision to admit only high-confidence data into analyses – which is common practice in TF binding research [13, 17, 22]—in fact excludes a large amount of relevant information and should therefore be taken with care.
Conservation analyses
In a real-life research situation, TFBS predictions are usually supplemented by additional analyses such as the determination of sequence conservation. We therefore further tested whether the ENCODE TFBSs (positive set) show a higher degree of conservation than the genetic background. In detail, we measured phyloP [23] and phastCons [24] scores for each position of the ENCODE TFBSs and of negative test cases (length-matched random sequences from chromosomes 1 and 2). For each TF, we then calculated the average scores of the highest conservation as well as of the average conservation score reported for each sequence. As we did not know where exactly in the ENCODE sequence the actual TFBS was located, we chose to focus on the highest conservation scores, assuming that functional TFBSs would show a high degree of conservation. However, we found that, while there was a higher degree of conservation for the ENCODE sequences, the variability in the tested sequences was very high. Moreover, we did not find a link between strong TF binding and sequence conservation because the mean maximum phyloP and phastCons scores for the high-confidence test set were not significantly higher than for all ENCODE sequences. We hence conclude that, while this may sometimes be the case, a functional binding site does not necessarily have to be highly conserved. As a representative example, Fig. 3 shows example images of conservation readings for ZBTB33 and BCL11A.
Our findings indicate that while there seems to be a certain degree of increased conservation for ChIP-seq verified TFBSs in comparison to random genetic sequences (Fig. 3), the variability between individual sites is too high to draw any reliable conclusions. These results support previous findings, which suggest that functional TFBSs can, but do not have to be evolutionarily conserved with regard to their primary nucleotide sequence [25–27]. More precisely, some researchers argue that TFBSs may fluctuate quite rapidly between species. For instance, it has been found in a comparative genomics study that the likelihood of a TFBS to be conserved between S. cerevisiae and its two closest relatives (S. paradoxus and S. mikatae) lies below 5 % for a majority of the sites [26]. In addition, rapid evolution of combinatorial transcription networks was found for the transcriptional regulator MCM1 in closely related fungi species [28]. Moreover, a simulation of the evolution of TFBSs by introducing local point mutations showed TFBSs and combinations of binding sites to evolve quickly [29]. Considering the fact that TFBSs are known to be relatively variable, even within the same species, it would be surprising if all functional binding sites were conserved across larger evolutionary distances [27, 30, 31]. It has also been shown in comparative studies between various species that the functionality of TFBSs can still be preserved despite underlying sequence changes [32, 33]. The three-dimensional structure of the DNA at a TFBS, including the histones, is likely to play an important role for TF binding, a notion which is slowly being acknowledged for the analysis of TFBSs [34]. As such three-dimensional structures can be well conserved while the underlying linear sequence may evolve rapidly, sequence conservation scores might not be of major importance for the assessment of TFBSs. In our study, we found that the degree of variation between individual TFBSs was too high to infer TFBSs conservation in general. Hence, measurements such as genomic location or three-dimensional conservation might be better suited for the assessment of the validity of a TFBS.
Regulatory SNPs (rSNPs)
We compiled a set of 52 rSNPs known to directly impact TF binding or gene expression (included as Additional file 1). rSNPs are single-nucleotide polymorphisms occurring in regulatory regions such as promoters or enhancers. They are likely to be of functional significance as they do affect transcription initiation, elongation, and translational characteristics of mRNA but are more difficult to study than alterations in coding regions and are hence under-represented in the literature [35–37]. We only included alterations in this test set which have been shown to directly interfere with a TFBS’s function in in vitro or in vivo experiments. In this test set, we found differences between wild-type and mutant variants with respect to the evaluated maximum binding scores in 76 % of all observations, but they rarely exceeded 20 %. For a subset of the rSNPs, which linked to 15 different TFs (represented by 58 PSSMs from JASPAR, hPDI, and HT-SELEX), we were able to directly compare the computational predictions to the experimental findings. If solely considering maximum binding score changes, we found agreement between computational and experimental results in less than 50 % (48.1 %) of the tested matrices. However, when analysing the data with ePOSSUM (see below), we obtained a correct prediction in 62 % of the cases. Interestingly, the different binding models from the various sources again yielded quite divergent results, thus confirming strong variability in the quality of the tested matrices (see Additional file 1).
We are aware that the rSNP test set used in this study was not large enough to provide comprehensive insight. This is due to the fact that experimental data on rSNPs in TFBSs are not readily available. Hence, we decided to make the best of the available data and to provide an initial glimpse into the reliability of the state-of-the-art methods for TFBS prediction.
ePOSSUM
Our study demonstrates that many of the existing PSSMs offer only limited reliability, while others do indeed allow reasonable predictions. In order to make our findings readily available to the scientific community, we created software called ePOSSUM that predicts the effect of genetic variation on TF binding using the matrices included in our study. It also includes a measure of significance of the assessments to indicate how reliable a prediction by a certain PSSM would be.
ePOSSUM is a web-based application that can be freely accessed at http://mutationtaster.charite.de/ePOSSUM/. Users can either enter genetic variants in the common VCF format or wild-type sequences along with the variant sequences. In the first case, the position of the variant is known and ePOSSUM scans the ENCODE ChIP-seq data used in our study for known TFBSs around this position. The existence of TFBSs at the position under study is included in the output. In total, ePOSSUM offers predictions for 81 TFs which are based on 171 different PSSMs. To speed up analysis, users can choose which TFs to include into the search for binding sites. If they submit a genomic position, they can limit the search to those TFs that are known to bind at this position.
Unlike similar applications such as motifbreakR [38], ePOSSUM does not merely subject sequences to different PSSMs, but uses their binding scores as input for a Bayes classifier. More precisely, this classifier uses the quotient of the maximum binding scores calculated for wild-type and variant sequence to determine whether the genetic alteration leads to the (predicted) gain or loss of a TFBS. As most of these quotients do not show a normal distribution, we use the R module klaR [39] for the classifier, which considers the kernel-density, i.e. the real distribution of values, to generate predictions.
Due to the poor performance of many PSSMs in discriminating real TFBSs from exonic sequences as indicated by similar maximum binding scores for both, the Bayes probability for both outcomes (TF binding versus no TF binding) is often far from 100 %. We therefore include the negative or positive predictive value (NPV/PPV) for each matrix for the given Bayes probability as a measure of the reliability of the classifier’s prediction. This value is also used to colour-code the likelihood of increased/new or decreased/lost TF binding. Darker colours represent more likely changes in TF binding. ePOSSUM uses all available PSSMs for a TF to generate a final assessment—here, only PPVs/NPVs of 70 % or above are considered. The assessment also includes whether or not a binding site for the TF was reported by ENCODE at the respective genomic location.
Different Bayes classifiers were trained for each PSSM with the quotients of all possible pairs of positive versus negative cases for the respective TF using all available sequences from our ENCODE high-confidence set as positive and exonic sequences as negative cases. The number of pairs was deliberately limited to 1,000,000 per matrix, even if more pairs were possible.
ePOSSUM is limited to human TFs and, if genomic positions were used, to genome version GRCh37. A weakness lies in the recognition of known TFBSs—as each TFBS positive ChIP-seq snippet from ENCODE comprises several hundred bases, we do not know whether or not the “real” TFBS is truly located within the fragment screened in the analysis.