Comparative promoter region analysis powered by CORG

Dieterich, Christoph; Grossmann, Steffen; Tanzer, Andrea; Röpcke, Stefan; Arndt, Peter F; Stadler, Peter F; Vingron, Martin

doi:10.1186/1471-2164-6-24

Database
Open access
Published: 21 February 2005

Comparative promoter region analysis powered by CORG

Christoph Dieterich¹,
Steffen Grossmann¹,
Andrea Tanzer^2,3,
Stefan Röpcke¹,
Peter F Arndt¹,
Peter F Stadler^2,3 &
…
Martin Vingron¹

BMC Genomics volume 6, Article number: 24 (2005) Cite this article

21k Accesses
23 Citations
3 Altmetric
Metrics details

Abstract

Background

Promoters are key players in gene regulation. They receive signals from various sources (e.g. cell surface receptors) and control the level of transcription initiation, which largely determines gene expression. In vertebrates, transcription start sites and surrounding regulatory elements are often poorly defined. To support promoter analysis, we present CORG http://corg.molgen.mpg.de, a framework for studying upstream regions including untranslated exons (5' UTR).

Description

The automated annotation of promoter regions integrates information of two kinds. First, statistically significant cross-species conservation within upstream regions of orthologous genes is detected. Pairwise as well as multiple sequence comparisons are computed. Second, binding site descriptions (position-weight matrices) are employed to predict conserved regulatory elements with a novel approach. Assembled EST sequences and verified transcription start sites are incorporated to distinguish exonic from other sequences.

As of now, we have included 5 species in our analysis pipeline (man, mouse, rat, fugu and zebrafish). We characterized promoter regions of 16,127 groups of orthologous genes. All data are presented in an intuitive way via our web site. Users are free to export data for single genes or access larger data sets via our DAS server http://tomcat.molgen.mpg.de:8080/das. The benefits of our framework are exemplarily shown in the context of phylogenetic profiling of transcription factor binding sites and detection of microRNAs close to transcription start sites of our gene set.

Conclusion

The CORG platform is a versatile tool to support analyses of gene regulation in vertebrate promoter regions. Applications for CORG cover a broad range from studying evolution of DNA binding sites and promoter constitution to the discovery of new regulatory sequence elements (e.g. microRNAs and binding sites).

Background

Comparative sequence analysis has been a powerful tool in bioinformatics for addressing a variety of issues. Applications range from grouping of sequences (e.g. protein sequences into families) to de novo pattern discovery of functional signatures.

Speaking of gene regulation, it has been known for a long time that there is considerable sequence conservation between species in non-coding regions of the genome. A comprehensive explanation of this observation is still elusive. However, sequence conservation within promoter regions of genes often stems from transcription factor binding sites that are under selective pressure (see [1] for a review and [2] for a systematic assessment of binding site conservation in man and mouse comparisons).

Conserved sequence elements of other types have recently caught much attention. Not all non-coding conserved DNA in the vicinity of a gene's transcription start site necessarily functions at the level of transcriptional regulation. For example, most known methylation-guide snoRNAs are intronencoded and processed from transcripts of housekeeping genes [3]. A few microRNAs are apparently linked to protein coding genes, most notably mir-10 and mir-196 which are located in the (short) intergenic regions in the Hox gene clusters of vertebrates [4–7].

A second class of conserved sequence elements exert their function as regulatory motifs in the untranslated region (UTR) of the primary transcript or the mature mRNA. The UTRsite database [8], for example, lists about 30 distinct functional motifs including the Histone 3'UTR stem-loop structure (HSL3) [9], the Iron Responsive Element (IRE) [10], the Selenocysteine Insertion Sequences (SECIS) [11], and the Internal Ribosome Entry Sites (IRES) [12]. Most of these elements are contained in CORG since short intergenic regions or introns upstream of the translation start site are entirely covered by our definition of an upstream region.

Phylogenetic footprinting

The CORG framework aims at detecting and describing regulatory elements that are proximal to the transcription start site. In this context, the comparison of upstream regions of orthologous genes is particularly valuable. This concept is called "phylogenetic footprinting" and an overview of this approach can be found in [13].

Phylogenetic footprinting in a strict sense is carried out on orthologous promoter regions. Local sequence similarities can then be directly interpreted as related regions harboring conserved functional elements. We denote these similarities as Conserved Non-coding Blocks (CNBs).

Multi-species sequence conservation

Comparative approaches gain power from the inclusion of sequences from more than two species [14]. Multi-species comparisons help to increase specificity at the expense of intra-species sensitivity since supporting evidence (conservation) stems from many observations. To give an example, Man-mouse-rat comparisons enhance the detection of transcription factor binding sites since the rat genome is more divergent from the mouse genome than anticipated [15]. A nice property of vertebrate microRNAs is the high degree of sequence conservation which is found in alignments of man, mouse and fish microRNAs [16]. Both types of comparisons are available in CORG. In CORG, we consider cross-species conservation between promoter regions from 5 vertebrate genomes, namely Homo sapiens, Mus musculus, Rattus norvegicus, Danio rerio and Fugu rubripes. Multiple alignments are built from pairwise CNBs as described in the subsequent section.

Construction and content

Groups of orthologous genes

In this work, we take a gene-centered view of phylogeny. Homology among proteins and thus genes is often concluded on the basis of sequence similarity. The EnsEMBL database [17] allows to distinguish orthologous from merely homologous genes by taking information on conserved synteny into account. We employed single linkage clustering on the graph of EnsEMBL orthologous gene pairs to define the CORG gene groups.

Genomic mapping of validated promoter regions

Various recent experimental efforts supply information about the position of transcriptional start sites in the human and mouse genome. Table 1 gives an overview on the resources that were employed in CORG.

Table 1 Resources for validated transcription start sites

Full size table

Some repositories offer genomic coordinates for their start site entries. Existing genomic mapping information was incorporated unless the underlying genome assembly build differed. The remaining data were projected onto the genome with SSAHA (Sequence Search and Alignment by Hashing Algorithm), a rapid near-exact alignment algorithm [18].

Sequence retrieval

The notion of "promoter region" deserves some further explanation in the context of our approach. Typically, though not exclusively, we expect conserved regulatory regions to appear in the vicinity of the transcription start site of a gene. Since we do not know the precise location of the start of transcription for each and every gene, we chose to compare the sequence regions upstream of the start of translation from orthologous genes. If verified transcription start sites are known, we define a sequence window that is large enough to hold both, translation and transcription start sites, plus 5 kB upstream sequence. In case we lack this information, our observations on known transcription start sites indicate that most promoter regions should be captured in a sequence window of 10 kb size (Additional File 1). The size of a promoter region may be bounded by the size of the corresponding intergenic region. If an annotated gene happens to lie within the primary sequence window, the promoter region is shortened to exclude exonic sequence.

Detection of pairwise local sequence similarities

Significant local sequence similarities (phylogenetic footprints) in two sequences are computed with an implementation of the Waterman-Eggert algorithm. We have already given an account of the algorithm and statistics in [19, 20]. The underlying alignment scoring scheme is the general reversible model [21]:

where Q is the transition rate matrix. We left out the elements on the diagonal, which are constrained by the requirement that the sum of all elements in a row equals zero.

The π_iare the stationary nucleotide frequencies, their sum is constrained to be one. Although the two genomes under consideration are in general not in their stationary state with respect to the substitutional process we take the mean of the two observed nucleotide frequencies, , to be the best estimate of the stationary base composition.

From other studies we have further knowledge about the relative rates between transversions, the transition A:T→G:C, and the transition G:C→A:T, which occur in roughly in the ratio 1:3:5 along vertebrate lineages [22]. These ratios of rates would generate sequences with 40% GC in their stationary state. To accommodate the observed nucleotide frequencies π_iwe have to allow for deviation from those ratios. We do this by choosing for example α ∝ (R(A → T)/π_T+ R(T → A)/π_A)/2, where R(i → j) is either 1, 3, or 5 depending on the process under consideration. At the end we scale the matrix Q, such that the PAM distance [23] of the substitution model equals the observed degree of divergence between the two species under comparison.

Since we were mainly interested in highly conserved regulatory elements, we demanded an average similarity level at least as high as the average exon conservation between the species under comparison.

The score for aligning two nucleotides i and j is then s(i, j) = log(P(i, j)/(π_iπ_j)) where P(i, j) is the probability of finding the pairing of i and j under the above substitution model [21].

Joining pairwise into multiple alignments

All CNBs from pairwise sequence alignments are split up into groups as defined by gene homology. For each group a graph O = (V, E) with vertices V and edges E is constructed, which represents the species-internal overlap of CNBs on the genomic coordinate level. Each vertex a ∈ V represents a footprint, which is a pairwise local alignment between two species. An undirected edge is placed between two vertices if the corresponding CNBs have only one species in common and show an overlap of at least 10 bp on the sequence level.

In our graph O, cliques of minimal size three are detected with an implementation of the Bron-Kerbosh algorithm [24]. Only those cliques are selected whose species count is equal to their size. This move prohibits the emergence of multiple alignments by similarity of multiple short CNBs to a single long CNB. Multiple alignments are then computed based on all cliques that meet the outlined criteria. We chose to employ the multiple alignment method of [25] who applies partial order graphs (POG) to the multiple alignment problem.

Partial order graphs belong to the class of directed acyclic graphs (DAGs). A DAG is a graph consisting of a set of nodes N and edges E, which are one-way edges and form no cycles.

The multiple alignment problem is then reduced to to subsequent alignment steps of individual sequences to a growing multiple alignment graph. If the sequences to be aligned share substantial sequence similarity, the number of bifurcation points within the POG stays low and allows rapid computation of the multiple alignment.

Alignment results are subsequently trimmed to encompass the leftmost and rightmost ungapped block of at least 6 nucleotides.

Annotation of promoter regions

Exon detection with assembled EST clusters

Promoter regions in CORG always extend upstream from the most downstream coding start (ATG). As a consequence, promoter regions may contain exons that are not translated. Our way of detecting such exons is a similarity search of man-mouse footprints versus GENENEST [26], a database of assembled EST clusters. Database searches are carried out for human and mouse footprints with the BLASTN program [27]. An E-value cut-off of 10^-4 is applied.

Annotation with predicted binding sites

The TRANSFAC database [28] is a repository of experimentally verified binding site sequences and representations thereof. These representations are used for querying the collection of man-mouse CNBs for known binding site patterns.

Potential binding sites are detected with TRANSFAC weight matrices by the method of [29]. Here, the intuition is that there are two random models for a given sequence S: one is given by the signal profile F and the other one by the background model B. Under both models the distribution of weight matrix scores can conveniently be calculated by convolution, since the score is a sum of independent random variables. Probability mass distributions of P_F(Score(S)) as well as P_B(Score(S)) can be computed by dynamic programming if column scores are reasonably discretized. This allows a fine tuning of the proportions of false positives and negatives for each TRANSFAC weight matrix. Both error levels were set to be equal. All details are given in [29].

Utility and discussion

We now present an overview of the web interface of the database and several example applications.

Interface

The CORG database is accessible via its home page http://corg.molgen.mpg.de and offers a redesigned web interface. From the search page one can quickly jump to gene loci via EnsEMBL or other standard identifers (e.g. HUGO symbol, LocusLink identifier, ...). The search query is processed according to the chosen reference source and a list of all matching database entries is returned to the user. This list serves as a springboard to a summary page where the genomic context of the selected gene and its similarities to other upstream regions is visualized as in Figure 1.

Pairwise as well as multiple comparisons are displayed on demand at this stage with a JAVA applet that complies with the JDK 1.1 standard. Alternatively, upstream region sequence and corresponding annotation can be exported in EMBL format (sequence data also in FASTA format). The JAVA applet should run on all JAVA-compatible web browsers. Detailed information about the conserved non-coding block structure are simultaneously shown for multiple upstream regions of different species. If available, annotation information on putative binding sites of transcription factors and EST matches are displayed for the query sequence. The applet facilitates zooming into sequence and annotation. In addition, web links are assigned to sequence features that relate external data sources to the corresponding annotation.

CORG data may be also embedded into other viewers or programs via the distributed annotation system (DAS, [30]). DAS facilitates the display of distributed data sources in a common framework with respect to a reference sequence. Our DAS server http://tomcat.molgen.mpg.de:8080/das constitutes such an external data source. Position information on all conserved non-coding blocks and mapped promoters is accessible from this DAS server. Each DAS sequence feature provides a link to the corresponding CORG database entry. New DAS sources can be easily added to the ENSEMBL display. A small tutorial on installing external DAS data sources is available on our web page http://corg.molgen.mpg.de/DAS_tutorial.htm.

Additionally, tools for on-site batch retrieval of CORG data will be added to the web portal in the near future.

Phylogenetic profiling of binding sites

One potential application of CORG is phylogenetic profiling of promoter regions. We define phylogenetic profiling in the context of gene regulation as comparative analysis of presence/absence patterns of binding sites in promoter regions. Here, we consider conserved predicted binding sites and contrast them with validated ones.

Serum Response Factor (SRF) promoter

SRF, a MADS-box transcription factor, regulates the expression of immediate-early genes, genes encoding several components of the actin cytoskeleton, and cell-type specific genes, e.g. smooth, cardiac and skeletal muscle or neuronal-specific genes [31, 32]. Mouse embryos lacking SRF die before gastrulation and do not form any detectable mesoderm [33, 34]. SRF mediates transcriptional activation by binding to CArG box sequences (Consensus pattern: CC(AT)₆GG) in target gene promoters and by recruiting different co-factors. SRF regulates transcription downstream of MAPK signaling in association with ternary complex factors (TCFs) (for a review see [35]). TCFs bind to ets binding sites present adjacent to CArG boxes in many SRF target gene promoters.

Figure 1 gives an overview of the genomic context of human SRF. As expected, the upstream region of SRF shows substantial conservation to its rodent orthologs. Additionally, significant alignments were found in comparisons with fish homologs (one from zebrafish and two from fugu). The same data is presented in the multiple alignment view of the JAVA applet in Figure 2. This view gives a better idea on the location of alignments in the corresponding source sequences. Note, that the spacing between translation start and alignment is greater in fish than in mammals, which hints at different extension of the promoter region in the two subgroups.

We get a better idea on the cause of sequence conservation by browsing the multiple alignment. Textual information can be obtained by clicking on the alignment boxes. Then, the alignment appears in a pop-up window and may be copied to another destination. In Figure 3, we used CLUSTAL X ([36]) to render the conservation structure on to the nucleotide level. Here, a striking observation is the conservation of the regulatory feedback loop of SRF to its own promoter in all species under consideration. So far, this feedback loop was experimentally verified in the mouse system [37] but could exist in all other species under comparison.

Non-coding RNAs

Non-coding RNA can be classified as transcribed regulatory elements. Non-coding RNAs are also accessible to the user via the CORG database. Since we were primarily interested in non-coding RNAs rather than small mRNA motifs we restricted our search here to long CNBs. A blast search of our multiple alignments with length L ≥ 50 against the Rfam database [38] and the microRNA Registry [39] identifies 21 alignments as 7 distinct microRNAs and a single snoRNA, Table 2.

Table 2 Rfam non-coding RNAs in CORG A + sign indicates that a sequence fragment from the corresponding species (hsa Homo sapiens, mmu Mus musculus, rno Rattus norvegicus, dre Danio rerio, tru Takifugu rubripes) is contained in the CORG CNB; ∅ indicates that a blast search for an orhologous sequence in the Ensemble database was unsuccessful; n.d. mean no descriptive Ensemble gene annotation. The CNBs containing mir-196a-2 are shifted compared to the known microRNA sequences, preventing the detection of the correct stem-loop structure. The B columns marks whether a candidate was identified by a blast search against the Rfam or microRNA Registry, the A column shows whether a hairpin structure was identified by RNAalifold.p_RNAz is the p-value for being an evolutionary conserved RNA secondary structure element returned by RNAz.

Full size table

The snoRNA U93 is an unusual mammalian pseudouridinylation guide RNA which accumulates in Cajal (coiled) bodies and it is predicted to function in pseudouridylation of the U2 spliceosomal snRNA [40]. It appears to be specific for mammals. The genomic copy of the human U93 RNA is located in an intron of a series of reported spliced expressed sequence tags (ESTs); furthermore, it has been verified experimentally that U93 is indeed spliced from an intron [40]. It was detectable in the CORG footprint dataset because of its location upstream of a conserved putative gene C14orf87 with unknown function.

The known microRNAs belong to four different groups. The mir10 and the mir196 precursors are located at specific positions in the Hox gene clusters [4–7]. The mir-196 family regulates Hox8 and Hox7 genes, the function of mir10 is unknown.

Substitution pattern of non-coding RNAs

For a microRNA we expect a subsequence of about 20 nt that is almost absolutely conserved among vertebrates (the mature miRNA) and a well-conserved complementary sequence forming the other side of the stem from which the mature microRNA is excised. In contrast, the substitution rate should be much larger in the loop region of the hairpin [41]. mir10 is a good example of this typical substitution pattern, which gives rise to a hairpin structure. The pairwise correlation structure of nucleotides is depicted on top of the multiple alignment in Figure 4. A different pattern is observed for the Iron Responsive element in the 5'UTR of SLCA1, a member of the sodium transporter family. This time the substitution pattern does not meet the minimal length of the microRNA definition above. Nevertheless, it is conserved across all vertebrate species as shown in Figure 5.

Conclusion

We have improved and extended our framework of comparative analysis and annotation of vertebrate promoter regions over previous releases (see [20]). The following features have been added to the CORG framework:

• Mapping of validated promoter regions and proper adjustment of the extent of upstream regions.

• Multiple alignments from significant local pair wise alignments.

• Novel approach to predict transcription factor binding sites.

• Web site offers now a genomic context view (as in Figure 1) and an option to export sequence and annotation data.

The CORG database is accessible via our web site. The user is guided step-by-step through the process of selecting and analyzing her promoter region of choice. CORG features an interactive viewer based on JAVA technology, which is tailored to detailed promoter analysis. Large-scale studies make direct use of our DAS service or the MySQL implementation of CORG in conjunction with an application interface (contact authors for details).

We presented selected application examples from the realm of vertebrate gene regulation. Conserved regulatory elements of different kinds (binding sites, microRNAs and UTR elements) are readily accessible to CORG users. New genomes and annotation will be continuously added to CORG.

Availability and requirements

The database is freely accessible through the website http://corg.molgen.mpg.de. Programs, scripts and MySQL database dumps are available from the authors upon request.

References

Hardison RC: Conserved noncoding sequences are reliable guides to regulatory elements. Trends Genet. 2000, 16 (9): 369-72. 10.1016/S0168-9525(00)02081-3.
Article PubMed CAS Google Scholar
Liu Y, Liu XS, Wei L, Altman RB, Batzoglou S: Eukaryotic regulatory element conservation analysis and identification using comparative genomics. Genome Res. 2004, 14 (3): 451-458. 10.1101/gr.1327604.
Article PubMed CAS PubMed Central Google Scholar
Bachellerie JP, Cavaillé J, Hüttenhofer A: The expanding snoRNA world. Biochimie. 2002, 775-790. 10.1016/S0300-9084(02)01402-5.
Google Scholar
Lagos-Quintana M, Rauhut R, Lendeckel W, Tuschl T: Identification of novel genes coding for small expressed RNAs. Science. 2001, 294: 853-858. 10.1126/science.1064921.
Article PubMed CAS Google Scholar
Lagos-Quintana M, Rauhut R, Meyer J, Borkhardt A, Tuschl T: New microRNAs from mouse and human. RNA. 2003, 9: 175-179. 10.1261/rna.2146903.
Article PubMed CAS PubMed Central Google Scholar
Yekta S, Shih Ih, Bartel DP: MircoRNA-directed cleavage of HoxB8 mRNA. Science. 2004, 304: 594-596. 10.1126/science.1097434.
Article PubMed CAS Google Scholar
Tanzer A, Amemiya CT, Kim CB, Stadler PF: Evolution of MicroRNAs Located Within Hox Gene Clusters. J Exp Zool: Mol Dev Evol. 2004,
Google Scholar
Pesole G, Liuni S, Grillo G, Licciulli F, Mignone F, Gissi C, Saccone C: UTRdb and UTRsite: specialized databases of sequences and functional elements of 5' and 3' untranslated regions of eukaryotic mRNAs: Update 2002. Nucl Acids Res. 2002, 30: 335-340. 10.1093/nar/30.1.335.
Article PubMed CAS PubMed Central Google Scholar
Williams AS, Marzluff WF: The sequence of the stem and flanking sequences at the 3'end of histone mRNA are critical determinants for the binding of the stemm-loop binding protein. Nucl Acids Res. 1995, 23: 654-662.
Article PubMed CAS PubMed Central Google Scholar
Hentze MW, Kuhn LC: Molecular control of vertebrate iron metabolism: mRNA based regulatory circuits operated by iron, nitric oxide, and oxidative stress. Proc Natl Acad Sci USA. 1996, 93: 8175-8182. 10.1073/pnas.93.16.8175.
Article PubMed CAS PubMed Central Google Scholar
Walczak R, Westhof E, P C, Krol A: A novel RNA structural motif in the selenocysteine insertion element of eukaryotic selenoprotein mRNAs. RNA. 1996, 2: 367-379.
PubMed CAS PubMed Central Google Scholar
Le SY, Maizel JV: A common RNA structural motif involved in the internal initiation of translation of cellular mRNAs. Nucl Acids Res. 1997, 25: 362-369. 10.1093/nar/25.2.362.
Article PubMed CAS PubMed Central Google Scholar
Duret L, Bucher P: Searching for regulatory elements in human noncoding sequences. Curr Opin Struct Biol. 1997, 7 (3): 399-406. 10.1016/S0959-440X(97)80058-9.
Article PubMed CAS Google Scholar
McCue LA, Thompson W, Carmack CS, Lawrence CE: Factors influencing the identification of transcription factor binding sites by cross-species comparison. Genome Res. 2002, 12 (10): 1523-32. 10.1101/gr.323602.
Article PubMed CAS PubMed Central Google Scholar
Mullins LJ, Mullins JJ: Insights from the rat genome sequence. Genome Biol. 2004, 5 (5): 221-10.1186/gb-2004-5-5-221.
Article PubMed PubMed Central Google Scholar
Lim LP, Glasner ME, Yekta S, Burge CB, Bartel DP: Vertebrate microRNA genes. Science. 2003, 299 (5612): 1540-10.1126/science.1080372.
Article PubMed CAS Google Scholar
Birney E, Andrews TD, Bevan P, Caccamo M, Chen Y, Clarke L, Coates G, Cuff J, Curwen V, Cutts T, Down T, Eyras E, Fernandez-Suarez XM, Gane P, Gibbins B, Gilbert J, Hammond M, Hotz HR, Iyer V, Jekosch K, Kahari A, Kasprzyk A, Keefe D, Keenan S, Lehvaslaiho H, McVicker G, Melsopp C, Meidl P, Mongin E, Pettett R, Potter S, Proctor G, Rae M, Searle S, Slater G, Smedley D, Smith J, Spooner W, Stabenau A, Stalker J, Storey R, Ureta-Vidal A, Woodwark KC, Cameron G, Durbin R, Cox A, Hubbard T, Clamp M: An overview of Ensembl. Genome Res. 2004, 14 (5): 925-928. 10.1101/gr.1860604.
Article PubMed CAS PubMed Central Google Scholar
Ning Z, Cox AJ, Mullikin JC: SSAHA: a fast search method for large DNA databases. Genome Res. 2001, 11 (10): 1725-1729. 10.1101/gr.194201.
Article PubMed CAS PubMed Central Google Scholar
Dieterich C, Cusack B, Wang H, Rateitschak K, Krause A, Vingron M: Annotating regulatory DNA based on man-mouse genomic comparison. Bioinformatics. 2002, 18 (Suppl 2): S84-90.
Article PubMed Google Scholar
Dieterich C, Wang H, Rateitschak K, Luz H, Vingron M: CORG: a database for Comparative Regulatory Genomics. Nucleic Acids Res. 2003, 31: 55-57. 10.1093/nar/gkg007.
Article PubMed CAS PubMed Central Google Scholar
Lio P, Goldman N: Models of molecular evolution and phylogeny. Genome Res. 1998, 8 (12): 1233-44.
PubMed CAS Google Scholar
Arndt PF, Petrov DA, Hwa T: Distinct changes of genomic biases in nucleotide substitution at the time of Mammalian radiation. Mol Biol Evol. 2003, 20 (11): 1887-96. 10.1093/molbev/msg204.
Article PubMed CAS Google Scholar
States D, Gish W, Altschul S: Improved sensitivity of nucleic acid database searches using application- specific scoring matrices. Methods: A companion of Methods in Enzymology. 1991, 3: 66-70.
Article CAS Google Scholar
Bron C, Kerbosch J: Algorithm 457. Finding all cliques of an undirected graph. Commun ACM. 1973, 16: 575-10.1145/362342.362367.
Article Google Scholar
Lee C, Grasso C, Sharlow MF: Multiple sequence alignment using partial order graphs. Bioinformatics. 2002, 18 (3): 452-64. 10.1093/bioinformatics/18.3.452.
Article PubMed CAS Google Scholar
Krause A, Haas SA, Coward E, Vingron M: SYSTERS, GeneNest, SpliceNest: exploring sequence space from genome to protein. Nucleic Acids Res. 2002, 30: 299-300. 10.1093/nar/30.1.299.
Article PubMed CAS PubMed Central Google Scholar
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25 (17): 3389-402. 10.1093/nar/25.17.3389.
Article PubMed CAS PubMed Central Google Scholar
Matys V, Fricke E, Geffers R, Gossling E, Haubrock M, Hehl R, Hornischer K, Karas D, Kel A, Kel-Margoulis O: TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids Research. 2003, 31: 374-378. 10.1093/nar/gkg108.
Article PubMed CAS PubMed Central Google Scholar
Rahmann S, Mueller T, Vingron M: On the Power of Profiles for Transcription Factor Binding Site Detection. Statistical Applications in Genetics and Molecular Biology. 2003, 2: 7-
Article Google Scholar
Dowell RD, Jokerst RM, Day A, Eddy SR, Stein L: The distributed annotation system. BMC Bioinformatics. 2001, 2: 7-10.1186/1471-2105-2-7.
Article PubMed CAS PubMed Central Google Scholar
Miano JM: Serum response factor: toggling between disparate programs of gene expression. J Mol Cell Cardiol. 2003, 35 (6): 577-93. 10.1016/S0022-2828(03)00110-X.
Article PubMed CAS Google Scholar
Treisman R: Journey to the surface of the cell: Fos regulation and the SRE. EMBO J. 1995, 14 (20): 4905-13.
PubMed CAS PubMed Central Google Scholar
Arsenian S, Weinhold B, Oelgeschlager M, Ruther U, Nordheim A: Serum response factor is essential for mesoderm formation during mouse embryogenesis. EMBO J. 1998, 17 (21): 6289-99. 10.1093/emboj/17.21.6289.
Article PubMed CAS PubMed Central Google Scholar
Weinhold B, Schratt G, Arsenian S, Berger J, Kamino K, Schwarz H, Ruther U, Nordheim A: Srf(-/-) ES cells display non-cell-autonomous impairment in mesodermal differentiation. EMBO J. 2000, 19 (21): 5835-44. 10.1093/emboj/19.21.5835.
Article PubMed CAS PubMed Central Google Scholar
Buchwalter G, Gross C, Wasylyk B: Ets ternary complex transcription factors. Gene. 2004, 324: 1-14. 10.1016/j.gene.2003.09.028.
Article PubMed CAS Google Scholar
Chenna R, Sugawara H, Koike T, Lopez R, Gibson TJ, Higgins DG, Thompson JD: Multiple sequence alignment with the Clustal series of programs. Nucleic Acids Res. 2003, 31 (13): 3497-500. 10.1093/nar/gkg500.
Article PubMed CAS PubMed Central Google Scholar
Spencer JA, Misra RP: Expression of the SRF gene occurs through a Ras/Sp/SRF-mediated-mechanism in response to serum growth signals. Oncogene. 1999, 18 (51): 7319-27. 10.1038/sj.onc.1203121.
Article PubMed CAS Google Scholar
Griffiths-Jones S, Bateman A, Marshall M, Khanna A, Eddy S: Rfam: an RNA family database. Nucl Acids Res. 2003, 31: 439-441. 10.1093/nar/gkg006.
Article PubMed CAS PubMed Central Google Scholar
Griffiths-Jones S: The microRNA Registry. Nucl Acids Res. 2004, 32: D109-D111. 10.1093/nar/gkh023. [Database Issue]
Article PubMed CAS PubMed Central Google Scholar
Kiss AM, Jády BE, Darzacq X, Verheggen C, Bertrand E, Kiss T: Cajal body-specific pseudouridylation guide RNA is composed of two box H/ACA snoRNA-like domains. Nucl Acids Res. 2002, 30: 4643-4649. 10.1093/nar/gkf592.
Article PubMed CAS PubMed Central Google Scholar
Lai EC, Tomancak P, Williams RW, Rubin GM: Computational identification of Drosophila microRNA genes. Genome Biol. 2003, 4: R42-10.1186/gb-2003-4-7-r42. (20 pages)
Article PubMed PubMed Central Google Scholar
Spencer JA, Major ML, Misra RP: Basic fibroblast growth factor activates serum response factor gene expression by multiple distinct signaling mechanisms. Mol Cell Biol. 1999, 19 (6): 3977-88.
Article PubMed CAS PubMed Central Google Scholar
Washietl S, Hofacker IL, Stadler PF: Fast and reliable prediction of noncoding RNAs. PNAS. 2005, 0409169102-[http://www.pnas.org/cgi/content/abstract/0409169102v1]
Google Scholar
Schmid CD, Praz V, Delorenzi M, Perier R, Bucher P: The Eukaryotic Promoter Database EPD: the impact of in silico primer extension. Nucleic Acids Res. 2004, D82-5. 10.1093/nar/gkh122. 32 Database
Suzuki Y, Yamashita R, Sugano S, Nakai K: DBTSS, DataBase of Transcriptional Start Sites: progress report 2004. Nucleic Acids Res. 2004, D78-81. 10.1093/nar/gkh076. 32 Database
Imanishi T, Itoh T, Suzuki Y, O'Donovan C, Fukuchi S, Koyanagi KO, Barrero RA, Tamura T, Yamaguchi-Kabata Y, Tanino M, Yura K, Miyazaki S, Ikeo K, Homma K, Kasprzyk A, Nishikawa T, Hirakawa M, Thierry-Mieg J, Thierry-Mieg D, Ashurst J, Jia L, Nakao M, Thomas MA, Mulder N, Karavidopoulou Y, Jin L, Kim S, Yasuda T, Lenhard B, Eveno E, Suzuki Y, Ya-masaki C, Takeda J, Gough C, Hilton P, Fujii Y, Sakai H, Tanaka S, Amid C, Bellgard M, Mde FBM, Bono H, Bromberg SK, Brookes AJ, Bruford E, Carninci P, Chelala C, Couillault C, Souza SJ, Debily MA, Devignes MD, Dubchak I, Endo T, Estreicher A, Eyras E, Fukami-Kobayashi K, Gopinath GR, Graudens E, Hahn Y, Han M, Han ZG, Hanada K, Hanaoka H, Harada E, Hashimoto K, Hinz U, Hirai M, Hishiki T, Hopkinson I, Imbeaud S, Inoko H, Kanapin A, Kaneko Y, Kasukawa T, Kelso J, Kersey P, Kikuno R, Kimura K, Korn B, Kuryshev V, Makalowska I, Makino T, Mano S, Mariage-Samson R, Mashima J, Matsuda H, Mewes HW, Minoshima S, Nagai K, Nagasaki H, Nagata N, Nigam R, Ogasawara O, Ohara O, Ohtsubo M, Okada N, Okido T, Oota S, Ota M, Ota T, Otsuki T, Piatier-Tonneau D, Poustka A, Ren SX, Saitou N, Sakai K, Sakamoto S, Sakate R, Schupp I, Servant F, Sherry S, Shiba R, Shimizu N, Shimoyama M, Simpson AJ, Soares B, Steward C, Suwa M, Suzuki M, Takahashi A, Tamiya G, Tanaka H, Taylor T, Terwilliger JD, Unneberg P, Veeramachaneni V, Watanabe S, Wilming L, Yasuda N, Yoo HS, Stodolsky M, Makalowski W, Go M, Nakai K, Takagi T, Kanehisa M, Sakaki Y, Quackenbush J, Okazaki Y, Hayashizaki Y, Hide W, Chakraborty R, Nishikawa K, Sugawara H, Tateno Y, Chen Z, Oishi M, Tonellato P, Apweiler R, Okubo K, Wagner L, Wiemann S, Strausberg RL, Isogai T, Auffray C, Nomura N, Gojobori T, Sugano S: Integrative annotation of 21,037 human genes validated by full-length cDNA clones. PLoS Biol. 2004, 2 (6): E162-10.1371/journal.pbio.0020162.
Article PubMed PubMed Central Google Scholar
Bono H, Kasukawa T, Furuno M, Hayashizaki Y, Okazaki Y: FANTOM DB: database of Functional Annotation of RIKEN Mouse cDNA Clones. Nucleic Acids Res. 2002, 30: 116-118. 10.1093/nar/30.1.116.
Article PubMed CAS PubMed Central Google Scholar
Pruitt KD, Maglott DR: RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Res. 2001, 29: 137-40. 10.1093/nar/29.1.137.
Article PubMed CAS PubMed Central Google Scholar

Download references

Acknowledgements

We greatly acknowledge funding by the EU (BioSapiens Network of Excellence). Steffen Grossmann is supported by the Deutsche Forschungsgemeinschaft (DFG) as a member of the Sonderforschungsbereich (SFB) 618 – Theoretical Biology: Robustness, Modularity and Evolutionary Design of Living Systems.

Author information

Authors and Affiliations

Computational Molecular Biology Department, Max Planck Institute for Molecular Genetics, Ihnestrasse 73, 14195, Berlin, Germany
Christoph Dieterich, Steffen Grossmann, Stefan Röpcke, Peter F Arndt & Martin Vingron
Institute for Theoretical Chemistry and Structural Biology, University of Vienna, Währingerstrasse 17, A-1090, Wien, Austria
Andrea Tanzer & Peter F Stadler
Bioinformatics Group, Department of Computer Science, University of Leipzig, Kreuzstraße 7b, D-04103, Leipzig, Germany
Andrea Tanzer & Peter F Stadler

Authors

Christoph Dieterich
View author publications
You can also search for this author in PubMed Google Scholar
Steffen Grossmann
View author publications
You can also search for this author in PubMed Google Scholar
Andrea Tanzer
View author publications
You can also search for this author in PubMed Google Scholar
Stefan Röpcke
View author publications
You can also search for this author in PubMed Google Scholar
Peter F Arndt
View author publications
You can also search for this author in PubMed Google Scholar
Peter F Stadler
View author publications
You can also search for this author in PubMed Google Scholar
Martin Vingron
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christoph Dieterich.

Additional information

Authors' contributions

Christoph Dieterich built the entire pipeline and some parts of the web interface. Steffen Grossmann annotated transcription factor binding sites and provided parts of the web interface. Andrea Tanzer analyzed known and novel RNA elements in the multiple alignments of the CORG database. Stefan Röpcke set up our database of binding site descriptions. Peter F. Arndt worked on an appropriate alignment scoring scheme. Peter F. Stadler and Martin Vingron initiated this work and provided all necessary infrastructure.

Electronic supplementary material

12864_2004_225_MOESM1_ESM.pdf

Additional File 1: Distribution of distance between start of transcription and translation. Histogram of observed genomic distances between start sites of transcription and translation in man for 1,700 entries from the EPD. The red and blue line indicates the 90% and 95% quantiles, respectively. Distances greater than 10⁶ bp were exluded from the analysis as they mostly occur due to mismappings in the ENSEMBL database. (PDF 4 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dieterich, C., Grossmann, S., Tanzer, A. et al. Comparative promoter region analysis powered by CORG. BMC Genomics 6, 24 (2005). https://doi.org/10.1186/1471-2164-6-24

Download citation

Received: 26 October 2004
Accepted: 21 February 2005
Published: 21 February 2005
DOI: https://doi.org/10.1186/1471-2164-6-24

Comparative promoter region analysis powered by CORG

Abstract

Background

Description

Conclusion

Background

Phylogenetic footprinting

Multi-species sequence conservation

Construction and content

Groups of orthologous genes

Genomic mapping of validated promoter regions

Sequence retrieval

Detection of pairwise local sequence similarities

Joining pairwise into multiple alignments

Annotation of promoter regions

Exon detection with assembled EST clusters

Annotation with predicted binding sites

Utility and discussion

Interface

Phylogenetic profiling of binding sites

Serum Response Factor (SRF) promoter

Non-coding RNAs

Substitution pattern of non-coding RNAs

Conclusion

Availability and requirements

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Authors' contributions

Electronic supplementary material

12864_2004_225_MOESM1_ESM.pdf

Authors’ original submitted files for images

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Genomics

Contact us