GPMiner: an integrated system for mining combinatorial cis-regulatory elements in mammalian gene group

Background Sequence features in promoter regions are involved in regulating gene transcription initiation. Although numerous computational methods have been developed for predicting transcriptional start sites (TSSs) or transcription factor (TF) binding sites (TFBSs), they lack annotations for do not consider some important regulatory features such as CpG islands, tandem repeats, the TATA box, CCAAT box, GC box, over-represented oligonucleotides, DNA stability, and GC content. Additionally, the combinatorial interaction of TFs regulates the gene group that is associated with same expression pattern. To investigate gene transcriptional regulation, an integrated system that annotates regulatory features in a promoter sequence and detects co-regulation of TFs in a group of genes is needed. Results This work identifies TSSs and regulatory features in a promoter sequence, and recognizes co-occurrence of cis-regulatory elements in co-expressed genes using a novel system. Three well-known TSS prediction tools are incorporated with orthologous conserved features, such as CpG islands, nucleotide composition, over-represented hexamer nucleotides, and DNA stability, to construct the novel Gene Promoter Miner (GPMiner) using a support vector machine (SVM). According to five-fold cross-validation results, the predictive sensitivity and specificity are both roughly 80%. The proposed system allows users to input a group of gene names/symbols, enabling the co-occurrence of TFBSs to be determined. Additionally, an input sequence can also be analyzed for homogeneity of experimental mammalian promoter sequences, and conserved regulatory features between homologous promoters can be observed through cross-species analysis. After identifying promoter regions, regulatory features are visualized graphically to facilitate gene promoter observations. Conclusions The GPMiner, which has a user-friendly input/output interface, has numerous benefits in analyzing human and mouse promoters. The proposed system is freely available at http://GPMiner.mbc.nctu.edu.tw/.

369, 300, 310 and 246, respectively. The information of first exon end position is used by the system to define the known gene promoter regions.

CpG Island
In vertebrate genomes, the CpG Islands (CGIs) are involved in DNA methylation of gene transcription. 50-60% of the human genes exhibit a CGI over the transcription start site (TSS) but not all the CGIs are associated with promoter regions (Larsen et al., 1992). The CGIs associated with promoters can be, a priori, identified from their structural characteristics (greater size, higher G+C content and CpG o/e ratio; Ioshikhes and Zhang, 2000; Ponger et al., 2001). CpGProD can detect the CGIs in the promoter region with prediction specificity ~ 70%, which is integrated by GPMiner to search the CGIs for input sequence. The CGIs are defined as DNA regions longer than 500 nucleotides, with a moving average C+C frequency above 0.5 and a moving average CpG observed/expected (o/e) ratio greater than 0.6. The information of CpG islands can help improving the prediction of gene promoter regions.

G + C Content
The C + C content represent a frequency of nucleotide G and C occurrence in a given window. The default window size is 15 nt sliding 1 nt each time. The representation of G + C content can help observing the CpG islands and GC box in the promoter region. It is found that most genes had high G + C content in promoter regions.

Transcription Factor Binding Site
The experimentally identified TF bind sites were obtained from TRANSFAC (professional 8.1), which contains 5,711 transcription factors and 14,406 binding sites. In the system, 4,206 known binding sites are matched to upstream regions of human, mouse, rat, chimpanzee, and dog genes. A program, namely MATCH, was implemented to match the consensus patterns of the TRANSFAC known binding sites to the input sequences.
Two important parameters of MATCH, core score and matrix score, represent the sequence matching score of core region and whole region of binding site, respectively. For high specificity of transcription factor binding site matching, the system set 1.0 (perfect match) for core score and 0.95 for matrix score. The known TF binding sites are used to scan the input sequence in both strands, and the positions of each known site homolog are then displayed in the graphical visualization.

TATA box, CCAAT box, and GC box
Narang et al. used computational method to reveal several important core and proximal promoter elements such as TATA box, CCAAT box, GC box, etc., along with their expected locations around the TSS. These oligonucleotides are kinds of transcription factor binding site and located near the transcription start site. As shown in Table S8, the lists of TATA box, CCAAT box, and GC box with positional densities are used by GPMiner to help the annotation of promoter region.

Over Represented (OR) Oligonucleotide
The system applies a statistical method to discover statistically significant oligonucleotides in promoter region, the so called over-represented (OR) Oligonucleotide, which is identified by comparing their frequencies of occurrence in the promoter regions to their background frequencies of occurrence throughout whole genome. If . Let n be the frequency of the considered oligonucleotide S occurring in the promoter regions; the Z-score is given by Z = (n -u) / σ. The probability of observing at least n successes, as given by Chebyshev's theorem, is less than or equal to If Z > 0, then a lower p-value corresponds to a more over-represented oligonucleotide. If Z < 0, then a lower p-value corresponds to a more under-represented oligonucleotide. By statistical significance, we selected the oligonucleotide with the p-value (cumulative hypergeometric probability) < 0.01 to be the OR aligonucleotides.

Repeats
The repeats such as SINE, LINE, Alu, L1, and so on, are extracted from Ensembl database by using Ensembl core libraries. Previous study (Batzer et al.) found that repeats such as Alu and L1 elements can alter the distribution of methylation in the genome, and possibly in gene transcription. These repeats are represented only for known gene promoter sequences. To find the tandem repeats in promoter region, the system integrates a program namely tandem repeat finer. The parameters such as period size, copy number, consensus size, score, etc.
are set corresponding to the default value of tandem repeat finer.

DNA Stability
Aditi Kanhere et al. [1] devised a novel regulatory feature, DNA stability, for prokaryotic promoter prediction.
DNA stability is the structural property of the fragment of DNA duplex, and calculates the minimum free energy based on the hydrogen bond of A-T and C-G pairs. The standard free energy change ( 0 37 G ∆ ) corresponding to the melting transition of an 'n' nucleotide (or 'n-1' dinucleotides) long DNA molecule, from double strand to single strand, is calculated as follows [2]: denotes two types of initiation free energy : "initiation with terminal G⋅C" and "initiation with terminal A⋅T"; 0 sym G ∆ is +0.43 kcal/mol and is applicable if the duplex is self-complementary, and 0 , j i G ∆ represents the standard free energy change for type ij dinucleotide. Table S5 lists the standard free energy changes for ten Watson-Crick types ij.
In the present calculation, each promoter sequence is divided into overlapping windows of 15 bp (or 14 dinucleotide steps), and for each window the free energy is calculated as shown above. This study used the equation of standard free energy change (mentioned in the supplementary materials) to calculate the stability of DNA duplex with window size = 15 nt, sliding from -1000 to +201 of TSS in the DBTSS human and mouse experimentally determined promoters. Figure S1 shows the distributions of average free energy of DNA duplex formation, and reveals a peak near the TSS, lying between -10 and -30 region, which corresponds to the TATA box in the eukaryotic promoter sequences. Aditi Kanhere et al. [1] demonstrated that the change in DNA stability appears to provide a much better clue than the usual sequence motifs.

Homologous Gene Promoter Analysis
For cross species analysis of homologous gene promoter sequences, the Ensembl core libraries [3] were used to identify homologous genes among human, mouse, rat, chimpanzee, and dog. These homologous genes are analyzed based on gene sequence similarity, and the paired homologous genes with both sequence coverage and sequence identity exceeding 80% are further analyzed in homologous gene promoter sequences. The statistics of pair of homologous genes among five species considered in this work are given in Table S9. Following the determination of the paired homologous gene sequences among those five mammals, the multiple sequences alignment tool, CLUSTALW, was used to analyze the promoter sequences of the paired homologous genes. This work found that certain pairs of promoter sequences were not conserved while their homologous gene sequences 5 were highly conserved. Based on the conservation of homologous gene promoter sequences among the five mammals, the conserved regulatory features should have a greater influence on gene transcriptional regulation.     Figure S1. The average distributions of occurring rate of mono-, di-, and tri-mer nucleotides in human promoter training sequences (-3000 ~ +3000 of TSSs (+1)). Figure S2. The average distributions of DNA stability in human promoter training sequences (-3000 ~ +3000 of TSSs (+1)). Near the TSS (+1), a peak lying between in the -10 to -30 bps region corresponds to the TATA box in the eukaryotic promoter sequences.