Ensemble approach combining multiple methods improves human transcription start site prediction
© Dineen et al; licensee BioMed Central Ltd. 2010
Received: 21 April 2010
Accepted: 30 November 2010
Published: 30 November 2010
The computational prediction of transcription start sites is an important unsolved problem. Some recent progress has been made, but many promoters, particularly those not associated with CpG islands, are still difficult to locate using current methods. These methods use different features and training sets, along with a variety of machine learning techniques and result in different prediction sets.
We demonstrate the heterogeneity of current prediction sets, and take advantage of this heterogeneity to construct a two-level classifier ('Profisi Ensemble') using predictions from 7 programs, along with 2 other data sources. Support vector machines using 'full' and 'reduced' data sets are combined in an either/or approach. We achieve a 14% increase in performance over the current state-of-the-art, as benchmarked by a third-party tool.
Supervised learning methods are a useful way to combine predictions from diverse sources.
The field of in-silico promoter prediction has developed greatly in recent years. Machine learning techniques, such as support vector machines and self-organising maps, and new features, especially those associated with structural properties of the DNA molecule, have led to progressive improvements in accuracy. The realization that the majority of the genome is transcribed [1–3], and that most promoters have diffuse clusters of multiple transcription start sites (TSS) , has led to a move away from discrete predictions, and towards scores for all base pairs of the genome. There is greater consensus on the correct way to evaluate predictions, reducing the biases inherent in the plethora of methods previously used .
One obvious way of improving performance is to combine several existing methods using an ensemble learning approach. Ensembles combining results from multiple programs have seen some use in promoter prediction [11, 12]. They have also been successfully used in several other computational biology problem areas [13–15]. High diversity in individual methods is considered predictive of good ensemble accuracy . It can be difficult, however, to improve on the performance of the best individual method . In this paper, our aim was to explore whether the set of prediction methods was indeed diverse, and to improve predictive performance across the genome and at all thresholds.
Description of features used.
DNA melting temperature, as calculated with Fixman & Freire's method
Custom SVM kernel using both sequence and structural information
HMM gene predictor - start of 5' UTR defines TSS
Decision tree using k-mers, GC and CpG content
RVM using mixture of Gaussian distributions of position weight matrices
Self-organising map trained on base stacking energy
Base stacking energy
Experimentally determined CpG methylation profiles
17-way vertebrate conservation scores
As not all of our features are prediction scores, we did not use an averaging or voting-based ensemble method. Instead we used a support vector machine for aggregation. This also gave us the opportunity to use a non-linear kernel to increase separability of promoter and non-promoter classes.
MetaProm  and EnsemPro  are both programs that use ensemble methods for promoter prediction. Although we were unable to obtain predictions for these programs, we could evaluate Profisi Ensemble using the evaluation rules described in the original papers in an attempt to make some comparison with them. MetaProm is based on an artificial neural network, and makes predictions in an area covering around 30% of the human genome, using a combination of dbTSS and RefSeq as the reference set. Multiple methods are discussed in the EnsemPro paper, but the most successful is weighted majority voting. It restricts its predictions to an area 1,150 base pairs either side of 400 TSS drawn from the Eukaryotic Promoter Database (EPD). The EPD is known to be strongly biased towards TATA box-containing promoters, which only comprise a small fraction of human promoters as a whole .
Results and Discussion
Overlap between whole genome predictions as measured by ∩/∪ (1000 bp tolerance).
Overlap between whole genome predictions as measured by ∩/∪ (1000 bp tolerance), considering only true positive predictions.
Overlap between whole genome predictions as measured by ∩/∪ (1000 bp tolerance), considering only false positive predictions.
Information gain-based ranking of features based on analysis of the training set with Weka 3.6 using default parameters.
ARTS (+ strand)
ARTS (- strand)
Methylation (differentiated cells)
Methylation (stem cells)
Whole genome predictions were evaluated using pppBenchmark 1.3. pppBenchmark evaluates predictions versus cap analysis of gene expression (CAGE) and RefSeq annotations, using both binning and distance-based protocols, for an accurate overall view of predictive power. The best performer in the original benchmarking was ARTS.
We also performed comparisons with MetaProm and EnsemPro. As we did not have access to predictions from these programs, we evaluated Profisi Ensemble using their evaluation rules.
EnsemPro uses the EPD as its reference set. As mentioned above, the EPD is not considered a representative set of human promoters. Only an area 1.5 kbp in size around the TSS was examined. Predictions within 200 bp (upstream) or 100 bp (downstream) were counted as true positives. The results of the evaluation are shown in Figure 7b. Profisi Ensemble shows roughly equivalent performance to EnsemPro in this evaluation, although results may not be exact due to variations in the dataset (see Methods).
Profisi Ensemble uses a two layer approach to prediction. Two SVMs are trained using scores from existing prediction programs as features. The predictions from these SVMs are then combined in an either/or manner.
In this work, we have demonstrated the substantial heterogeneity of promoter predictions from current methods. We showed that this heterogeneity enables performance improvements via an ensemble approach. Finally, we have shown that high-sensitivity and high-specificity classifiers may be combined using a "punting" approach to guarantee higher performance across a range of thresholds.
In many fields, diverse predictors for the same task exist, often of broadly similar performance. If these predictors are sufficiently heterogenous, there is merit in exploring an ensemble-based approach. If high specificity/precision is required, consideration should be given to using feature ranking to ensure that only useful features are included.
The same technique we have used for human predictions could be extended to any other genome, as long as sufficiently diverse predictions are available for it. Detailed instructions on applying our method to other organisms are included in Additional file 2. Many prediction programs are able to output predictions for multiple genomes. EP3, for example contains models for ten model organisms . As we have used a supervised approach, a high quality training set (preferably based on experimental data, like the dbTSS) is essential, however.
5 bp resolution probability scores for genome builds hg17 and hg18 are available from http://mlg.ucd.ie/profisiensemble. 1 bp resolution scores are available on request. Source code is available in Additional file 3.
To assess the overlap between predictions from different programs, whole genome predictions were downloaded from the UCSC Genome Browser  and from the websites associated with the programs. Where multiple predictions existed around a single locus (2000 base pairs), only the prediction with the highest score was kept. Programs giving discrete predictions (N-SCAN , FirstEF , and Eponine ) had roughly 20,000 predictions each. The remaining programs gave continuous scores for the whole genome. These scores were thresholded to also leave ~20,000 predictions per program. Overlap between sets was measured by dividing set intersection by set union for each pair of programs. Overlap was measured for (a) all predictions, (b) true positive predictions only, and (c) false positive predictions only. Predictions within 1,000 bp of the 5' end of a RefSeq first exon were counted as true positives.
N-SCAN, FirstEF, and Eponine predictions were downloaded from the UCSC Genome Browser. These point predictions were converted to continuous scores using a 1000 base pair window, with the central 200 base pairs getting the full score, linearly falling to 0 at the edges, giving a trapezoid-type distribution. These parameters were determined using small-scale tests on the ENCODE regions. The remaining features had scores for all base pairs. ProSOM and EP3 predictions were obtained using the Java executables available online. ARTS predictions were download from the ARTS website. Profisi melting temperatures were downloaded from the human genome melting map. Methylation scores were obtained from a whole-genome methylation map of 15 cell lines  (Island methylation scores from Supplementary Table 1b). Cell lines were split into pluripotent and differentiated categories, and averaged. Scores for the two sperm cell lines were ignored due to the large differences in DNA packing and methylation in these lines. PhastCons 17-way vertebrate conservation scores were downloaded from the UCSC Genome Browser.
Training examples were drawn from the 44 ENCODE regions which together comprise about 1% of the human genome. Positive examples were taken from the dbTSS , an experimentally verified database which is already used as the training set for  and . There were 519 TSS from the database in the ENCODE regions. Five times as many negative examples were selected, to account for the greater variety of negative examples (intergenic, exons, introns, non-promoter regulation such as enhancers, insulators, etc.). These negative examples were all at least 1000 base pairs from the nearest TSS.
Principal components analysis was performed in Weka 3.6  using the default parameters, giving five principal components. Information gain-based feature selection was also performed in Weka using the default parameters.
LibSVM 2.9  was used to train the models and generate predictions, due to its speed, stability, and availability for multiple platforms. It is not multithreaded, but was easily parallelizable as each chromosome was a separate test file. The default kernel - the radial basis function (RBF) was used. Weights were used to compensate for the uneven class sizes. Features were normalized in the range 0-1 to maximize sparsity. LibSVM was set to output a probability rather than a margin score. The error penalty (C) and the tightness parameter (γ) were chosen using the supplied grid.py.
Figure 3 shows performance on the mapped portions of human chromosome 22 (~1% of the genome). The reduced set outperforms the full set above a certain crossover point. The probability from the reduced set at this crossover was 0.94. To ensure that this was not due to the SVM parameters resulting in optimization of different areas of the curve, a wide range of values of C and γ were tried. In all cases, the area under the curve was reduced, but the shape of the curve stayed the same. Based on this, we decided to combine predictions from both with an either/or approach. Reduced model predictions below 0.94 were discarded and replaced with predictions from the full set, which were scaled so that the highest value remaining was 0.94.
As the evaluations for both MetaProm and EnsemPro are based on the older system of point predictions, the continuous scores from Profisi Ensemble had to also be converted to point predictions. We did this using a combination of thresholding and clustering. Thresholding meant throwing away all predictions below a certain level. Clustering meant finding the location with the highest score, and discarding all locations within n base pairs of it, then finding the location with the next highest score and doing the same, etc., until the last location was reached. For both the MetaProm and EnsemPro evaluations, we performed a grid search on the thresholding and clustering parameters, and kept the best performing ones. Cluster sizes were 50-2000 for EnsemPro and 500 for MetaProm. Thresholds were 0-1 for both. Predictions in areas not examined by MetaProm and EnsemPro were discarded.
42,536 TSS in 14,566 sequences were obtained from the MetaProm authors, along with sensitivity-specificity curves. MetaProm CpG and non-CpG scores were combined.
The EnsemPro evaluation describes discarding EPD TSS where there missing bases within 1,150 bp of the TSS, leaving 400 TSS from 1871. As we were unable to find any missing bases, we used all TSS. EnsemPro figures were obtained from Table 2 of the EnsemPro paper. Weighted majority voting figures were used, as this was the best performing method.
Predictions were made for genome build hg17, and reduced to 5 base pair resolution and converted to build hg18 for testing with pppBenchmark.
This work was funded by Science Foundation Ireland (SFI) grants 08/SRC/I1407 and 07/IN.1/B1783. The authors would like to thank Andreas Wilm and Junwen Wang for his help.
- The FANTOM Consortium, Carninci P, Kasukawa T, et al: The Transcriptional Landscape of the Mammalian Genome. Science. 2005, 309: 1559-1563. 10.1126/science.1112014.View ArticleGoogle Scholar
- Cheng J, Kapranov P, Drenkow J, Dike S, Brubaker S, Patel S, Long J, Stern D, Tammana H, Helt G, Sementchenko V, Piccolboni A, Bekiranov S, Bailey DK, Ganesh M, Ghosh S, Bell I, Gerhard DS, Gingeras TR: Transcriptional Maps of 10 Human Chromosomes at 5-Nucleotide Resolution. Science. 2005, 308: 1149-1154. 10.1126/science.1108625.PubMedView ArticleGoogle Scholar
- Kapranov P, Cheng J, Dike S, Nix DA, Duttagupta R, Willingham AT, Stadler PF, Hertel J, Hackermuller J, Hofacker IL, Bell I, Cheung E, Drenkow J, Dumais E, Patel S, Helt G, Ganesh M, Ghosh S, Piccolboni A, Sementchenko V, Tammana H, Gingeras TR: RNA Maps Reveal New RNA Classes and a Possible Function for Pervasive Transcription. Science. 2007, 316: 1484-1488. 10.1126/science.1138341.PubMedView ArticleGoogle Scholar
- Carninci P, Sandelin A, Lenhard B, Katayama S, Shimokawa K, Ponjavic J, Semple CAM, Taylor MS, Engstrom PG, Frith MC, Forrest ARR, Alkema WB, Tan SL, Plessy C, Kodzius R, Ravasi T, Kasukawa T, Fukuda S, Kanamori-Katayama M, Kitazume Y, Kawaji H, Kai C, Nakamura M, Konno H, Nakano K, Mottagui-Tabar S, Arner P, Chesi A, Gustincich S, Persichetti F, Suzuki H, Grimmond SM, Wells CA, Orlando V, Wahlestedt C, Liu ET, Harbers M, Kawai J, Bajic VB, Hume DA, Hayashizaki Y: Genome-wide analysis of mammalian promoter architecture and evolution. Nat Genet. 2006, 38: 626-635. 10.1038/ng1789.PubMedView ArticleGoogle Scholar
- Abeel T, Van de Peer Y, Saeys Y: Toward a gold standard for promoter prediction evaluation. Bioinformatics. 2009, 25: i313-320. 10.1093/bioinformatics/btp191.PubMed CentralPubMedView ArticleGoogle Scholar
- Saxonov S, Berg P, Brutlag DL: A genome-wide analysis of CpG dinucleotides in the human genome distinguishes two distinct classes of promoters. Proceedings of the National Academy of Sciences of the United States of America. 2006, 103: 1412-1417. 10.1073/pnas.0510310103.PubMed CentralPubMedView ArticleGoogle Scholar
- Roider HG, Lenhard B, Kanhere A, Haas SA, Vingron M: CpG-depleted promoters harbor tissue-specific transcription factor binding signals--implications for motif overrepresentation analyses. Nucl Acids Res. 2009, 37: 6305-6315. 10.1093/nar/gkp682.PubMed CentralPubMedView ArticleGoogle Scholar
- Goni JR, Perez A, Torrents D, Orozco M: Determining promoter location based on DNA structure first-principles calculations. Genome Biology. 2007, 8: R263-10.1186/gb-2007-8-12-r263.PubMed CentralPubMedView ArticleGoogle Scholar
- Sonnenburg S, Zien A, Ratsch G: ARTS: accurate recognition of transcription starts in human. Bioinformatics. 2006, 22: e472-480. 10.1093/bioinformatics/btl250.PubMedView ArticleGoogle Scholar
- Dineen DG, Wilm A, Cunningham P, Higgins DG: High DNA melting temperature predicts transcription start site location in human and mouse. Nucl Acids Res. 2009, gkp821-Google Scholar
- Wang J, Ungar L, Tseng H, Hannenhalli S: MetaProm: a neural network based meta-predictor for alternative human promoter prediction. BMC Genomics. 2007, 8: 374-10.1186/1471-2164-8-374.PubMed CentralPubMedView ArticleGoogle Scholar
- Won H, Kim M, Kim S, Kim J: EnsemPro: An ensemble approach to predicting transcription start sites in human genomic DNA sequences. Genomics. 2008, 91: 259-266. 10.1016/j.ygeno.2007.11.001.PubMedView ArticleGoogle Scholar
- Wan J, Kang S, Tang C, Yan J, Ren Y, Liu J, Gao X, Banerjee A, Ellis LBM, Li T: Meta-prediction of phosphorylation sites with weighted voting and restricted grid search parameter selection. Nucl Acids Res. 2008, gkm848-Google Scholar
- Boulesteix A, Porzelius C, Daumer M: Microarray-based classification and clinical predictors: on combined classifiers and additional predictive value. Bioinformatics. 2008, 24: 1698-1706. 10.1093/bioinformatics/btn262.PubMedView ArticleGoogle Scholar
- Kedarisetti KD, Kurgan L, Dick S: Classifier ensembles for protein structural class prediction with varying homology. Biochemical and Biophysical Research Communications. 2006, 348: 981-988. 10.1016/j.bbrc.2006.07.141.PubMedView ArticleGoogle Scholar
- Polikar R: Ensemble Based Systems in Decision Making. IEEE Circuits and Systems Magazine. 2006, 6: 45-21. 10.1109/MCAS.2006.1688199.View ArticleGoogle Scholar
- Zenko Bˇ: Is Combining Classifiers Better than Selecting the Best One. MACHINE LEARNING. 2004, 54: 255-273. 10.1023/B:MACH.0000015881.36452.6e.View ArticleGoogle Scholar
- Bajic VB, Brent MR, Brown RH, Frankish A, Harrow J, Ohler U, Solovyev VV, Tan SL: Performance assessment of promoter predictions on ENCODE regions in the EGASP experiment. Genome Biol. 2006, 7: S3-10.1186/gb-2006-7-s1-s3.PubMed CentralPubMedView ArticleGoogle Scholar
- Straussman R, Nejman D, Roberts D, Steinfeld I, Blum B, Benvenisty N, Simon I, Yakhini Z, Cedar H: Developmental programming of CpG island methylation profiles in the human genome. Nat Struct Mol Biol. 2009, 16: 564-571. 10.1038/nsmb.1594.PubMedView ArticleGoogle Scholar
- Aloui A, Tagourti J, El May A, Joseleau Petit D, Landoulsi A: The effect of methylation on some biological parameters in Salmonella enterica serovar Typhimurium. Pathologie Biologie. Corrected Proof.,
- Doerfler W: The Significance of DNA Methylation Patterns: Promoter Inhibition by Sequence-Specific Methylation is One Functional Consequence. Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences. 1990, 326: 253-265. 10.1098/rstb.1990.0009.PubMedView ArticleGoogle Scholar
- King DC, Taylor J, Elnitski L, Chiaromonte F, Miller W, Hardison RC: Evaluation of regulatory potential and conservation scores for detecting cis-regulatory modules in aligned mammalian genome sequences. Genome Res. 2005, 15: 1051-1060. 10.1101/gr.3642605.PubMed CentralPubMedView ArticleGoogle Scholar
- Jin VX, Singer GAC, Agosto-Pérez FJ, Liyanarachchi S, Davuluri RV: Genome-wide analysis of core promoter elements from conserved human and mouse orthologous pairs. BMC Bioinformatics. 2006, 7: 114-10.1186/1471-2105-7-114.PubMed CentralPubMedView ArticleGoogle Scholar
- Abeel T, Saeys Y, Rouzé P, Van de Peer Y: ProSOM: core promoter prediction based on unsupervised clustering of DNA physical profiles. Bioinformatics. 2008, 24: i24-31. 10.1093/bioinformatics/btn172.PubMed CentralPubMedView ArticleGoogle Scholar
- Abeel T, Saeys Y, Bonnet E, Rouzé P, Van de Peer Y: Generic eukaryotic core promoter prediction using structural features of DNA. Genome Res. 2008, 18: 310-23. 10.1101/gr.6991408.PubMed CentralPubMedView ArticleGoogle Scholar
- Davuluri RV, Grosse I, Zhang MQ: Computational identification of promoters and first exons in the human genome. Nat Genet. 2001, 29: 412-7. 10.1038/ng780.PubMedView ArticleGoogle Scholar
- Down TA, Hubbard TJP: Computational detection and location of transcription start sites in mammalian genomic DNA. Genome Res. 2002, 12: 458-61. 10.1101/gr.216102.PubMed CentralPubMedView ArticleGoogle Scholar
- Gross SS, Brent MR: Using multiple alignments to improve gene prediction. J Comput Biol. 2006, 13: 379-93. 10.1089/cmb.2006.13.379.PubMedView ArticleGoogle Scholar
- Melvin I, Weston J, Leslie C, Noble W: Combining classifiers for improved classification of proteins from sequence or structure. BMC Bioinformatics. 2008, 9: 389-10.1186/1471-2105-9-389.PubMed CentralPubMedView ArticleGoogle Scholar
- Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D: The human genome browser at UCSC. Genome Res. 2002, 12: 996-1006.PubMed CentralPubMedView ArticleGoogle Scholar
- Suzuki Y, Yamashita R, Nakai K, Sugano S: DBTSS: DataBase of human Transcriptional Start Sites and full-length cDNAs. Nucl Acids Res. 2002, 30: 328-331. 10.1093/nar/30.1.328.PubMed CentralPubMedView ArticleGoogle Scholar
- Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH: The WEKA data mining software: an update. SIGKDD Explor Newsl. 2009, 11: 10-18. 10.1145/1656274.1656278.View ArticleGoogle Scholar
- Chang Chih-chung, Lin Chih-jen: LIBSVM: a library for support vector machines. 2001Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.