Cross-species mapping of bidirectional promoters enables prediction of unannotated 5' UTRs and identification of species-specific transcripts
- Helen Piontkivska†2,
- Mary Q Yang†1,
- Denis M Larkin3,
- Harris A Lewin3, 4,
- James Reecy5 and
- Laura Elnitski1Email author
© Piontkivska et al; licensee BioMed Central Ltd. 2009
Received: 06 December 2008
Accepted: 24 April 2009
Published: 24 April 2009
Bidirectional promoters are shared regulatory regions that influence the expression of two oppositely oriented genes. This type of regulatory architecture is found more frequently than expected by chance in the human genome, yet many specifics underlying the regulatory design are unknown. Given that the function of most orthologous genes is similar across species, we hypothesized that the architecture and regulation of bidirectional promoters might also be similar across species, representing a core regulatory structure and enabling annotation of these regions in additional mammalian genomes.
By mapping the intergenic distances of genes in human, chimpanzee, bovine, murine, and rat, we show an enrichment for pairs of genes equal to or less than 1,000 bp between their adjacent 5' ends ("head-to-head") compared to pairs of genes that fall in the same orientation ("head-to-tail") or whose 3' ends are side-by-side ("tail-to-tail"). A representative set of 1,369 human bidirectional promoters was mapped to orthologous sequences in other mammals. We confirmed predictions for 5' UTRs in nine of ten manual picks in bovine based on comparison to the orthologous human promoter set and in six of seven predictions in human based on comparison to the bovine dataset. The two predictions that did not have orthology as bidirectional promoters in the other species resulted from unique events that initiated transcription in the opposite direction in only those species. We found evidence supporting the independent emergence of bidirectional promoters from the family of five RecQ helicase genes, which gained their bidirectional promoters and partner genes independently rather than through a duplication process. Furthermore, by expanding our comparisons from pairwise to multispecies analyses we developed a map representing a core set of bidirectional promoters in mammals.
We show that the orthologous positions of bidirectional promoters provide a reliable guide to directly annotate over one thousand regulatory regions in sequences of mammalian genomes, while also serving as a useful tool to predict 5' UTR positions and identify genes that are novel to a single species.
The completed sequence of numerous vertebrate genomes has enabled rapid gene annotation across species using orthologous relationships. This approach is feasible because purifying selection, acting on the open reading frames of coding exons and aimed at preserving encoded protein sequences, minimizes the sequence divergence that can occur. The sequences of these protein-coding genes generally change more slowly over millions of years than do non-coding sequences. Similarity at the nucleotide level is reflected in the likeness of structure and function of the gene products produced in different species. Additional features, such as non-coding functional elements, are also maintained as conserved sequences across species through the action of purifying selection . Enhancer elements are often predicted from their distinctive sequence conservation. Other functional classes, such as promoters, contain more plasticity in their composition and do not lend themselves to identification in this manner. Given that precise computational methods are not yet developed for predicting promoter regions in newly assembled genomes, their annotation lags behind that of coding genes and enhancers.
We hypothesized that promoter regions could be reliably mapped across species using a unique class of promoter that is flanked by genes on each side. These promoters, known as bidirectional promoters, would be useful for annotating promoter regions across mammals because the genes on both the left and right sides of the promoter change slowly. Thus, the promoter region is maintained as a recognizable, intergenic, architectural region that is amenable to computational discovery. Furthermore, if no repetitive elements were inserted at the bidirectional promoter region in either species, the intergenic distances should be maintained across species. To lend support to this hypothesis, Takai and Jones (2004)  showed the exclusion of repetitive elements from bidirectional promoters of human chromosomes 20, 21, and 22.
Bidirectional promoters were originally defined as the regulatory regions present in the intergenic space of two oppositely oriented genes whose transcription start sites (TSSs) were separated by no more than 1,000 bp . Such genes appear in a head-to-head arrangement, i.e. facing away from one another, and are transcribed from opposite strands of DNA. The closely spaced arrangement of the TSSs flanking the bidirectional promoter was recognized as a non-random event, proven by the fact that a greater-than-expected number of genes had this architecture . Up to 10% of human protein-coding genes were initially identified with bidirectional promoters. We subsequently identified thousands of additional, putative, bidirectional promoters by analyzing divergently transcribed, spliced EST data . The methodology of mapping bidirectional promoters across species used here treats the genes on each side of a promoter as anchors that delimit the intergenic, orthologous regulatory region. If the genome of the other species contains conserved gene order and orientation at the orthologous location, then the intergenic promoter region must have evolved from the ancestral sequence at that location. If the intergenic distance of the annotated transcripts in the other species is also maintained as ≤ 1,000 bp, the orthologous bidirectional promoter is declared validated. Of added benefit, this method is not dependent on the level of nucleotide sequence conservation in the promoter regions, which can vary extensively .
The enrichment of bidirectional promoters in the human genome evokes questions about their evolution. In some cases, chromosomal rearrangements could have conjoined promoter regions of two genes. Those genes would remain united through all subsequent speciation events due to selective pressure against change. Any breakage of the union (within or near the bidirectional promoter) could disrupt the normal regulation of both genes, potentially having profound (disadvantageous) effects on cellular function. If true, bidirectional promoters should provide an evolutionary timestamp of rearrangement events across mammalian genomes. Alternatively, some unidirectional promoters could have lost control of their regulated transcription, enabling RNA polymerase to load and traverse in the opposite direction . This scenario could serve as a mechanism for generating new genes in the genome, which would occur in a rare and species-specific manner.
Building on our previous computational infrastructure, we utilize updated human genome annotations to compare bidirectional promoters in human and bovine genomes to test the hypothesis that long-term evolutionary histories of these promoters could be identified and used to annotate the bovine genome. We used these data to create a detailed regulatory map of orthologous promoter regions across 5 placental mammals (human, chimp, cow, mouse and rat). As an outcome of the analysis, we have shown that the "locked" arrangement of genes around these promoters enables prediction of unannotated 5' UTRs using cross-species comparisons. Furthermore, we identified bidirectional promoters that lack orthologous counterparts in all other species, supporting the conclusion that species-specific genes can be identified from rigorous, cross-species comparisons of this dataset. One human-specific example was from the family of five RecQ helicase paralogs (WRN, BLM, RECQL, RECQL4, and RECQL5), all of which have bidirectional promoters that developed independently.
Mapping bidirectional promoters in the cow genome
Bidirectional promoters were predicted in the cow genome from in silico analyses of gene order, orientation and intergenic distances, analogous to our studies in the human and mouse genomes [4, 7]. The official bovine gene set (OGS v1, available at the Bovine Genome Database; ) contained 4,948 bidirectional gene pairs. Those predictions were made without the normal requirement for an intergenic distance of ≤ 1,000 bp due to incomplete annotation of 5' UTRs in the cow genome. Thus, we labeled the data as "low-stringency" predictions. In comparison, our previous studies in the human and mouse genomes identified 5,000–6,000 bidirectional promoters whose intergenic distances were limited to 1,000 bp (i.e., "high stringency") [4, 9], suggesting that the number of bidirectional promoters possible in the cow genome was as high as 6,000. We concluded that our low stringency bovine set captured a majority of actual bidirectional promoters, but that limited annotations contributed false positive and false negative predictions.
To further assess the bidirectional promoters in the cow genome, we examined additional transcript evidence. The OGS v1 dataset, which lacked information regarding most 5' UTRs of bovine genes, was supplemented with RefSeq annotations from GenBank and spliced EST data from the UCSC Cow Genome Browser. Together these datasets identified 1,574 bovine bidirectional promoters, all of which met the requirement of no more than 1,000 bp separating any pair of TSSs (Supplemental Figure S13 in ).
Human bidirectional promoters have orthologs in cow
To address a core set of orthologous bidirectional promoters in mammals, we mapped positions of the regulatory regions in the human and cow genomes. We noted that less than 25% (1,369) of the human bidirectional promoters controlled expression of genes that encode proteins. The remainder regulated combinations of protein-coding genes with RNA gene partners or pairs of RNA genes. We proceeded by mapping only these 1,369 human promoters from the protein coding set, because of the limited annotations of RNA genes available in most mammalian species.
Cross-species UTR prediction
Updates to the cow bidirectional promoter annotations predicted from cross-species comparisons
Updates to the human bidirectional promoter annotations predicted from cross-species comparisons
A unique bidirectional promoter in cow
Genomic rearrangements that displaced one gene from an orthologous bidirectional gene pair occurred in less than 1% of the genes analyzed. An example of such an event was found when the bidirectional promoter for cow CYB5R4 (cytochrome b5 reductase 4, this represents an alternative promoter of CYB5R4. Through alternative splicing, the first exon is also used for a different gene, RIPPLEY2). did not validate in human. The human region contained the ortholog for CYB5R4, but not the partner gene from cow (GenBank Accession DV834581). This partner was expressed in numerous cow expression libraries from the brain (Bovine Genome Sequencing Program: Full-length cDNA Sequencing, unpublished) and although it had a minimal open reading frame, it showed strong evidence for RNA secondary structure (Supplemental Fig. S15 in ). The unique appearance of the transcript in cow was explained by a 43 Mb chromosomal inversion on cow chromosome 9 (, Supplemental Fig. S14 in ). The transcript DV834581 crossed this rearrangement breakpoint. None of the other sequenced mammalian genomes showed evidence of the 43 Mb inversion. A duplicated MIR3 SINE was found flanking the inversion on both sides, with one copy being embedded within exon 3 of the DV834581 transcript. Although the repetitive element was implicated in the inversion, no clear model explained its role mechanistically. Conserved synteny between the human, macaque, chimp, mouse, rat, dog, and pig genomes indicated that no other genomes could reconstitute this transcript because it spanned the unique chromosomal junction. Therefore, the transcript DV834581 was identified as a bovine-specific transcript being transcribed from a bidirectional promoter that was not bidirectional in any other species.
Parallel evolution of bidirectional promoters regulating RecQ helicases
Pairwise sequence distances between Human and Cow orthologs
The extent of pairwise nucleotide and amino acid sequence divergence between Human and Cow orthologs of RecQ partner genes
no significant sequence similarity
no significant sequence similarity
Comparative Vertebrate Analyses
Bidirectional promoter analyses contribute many benefits to genomic annotations including predictions of unannotated genes and 5' UTRs in mammalian genomes. No other methods exist to predict UTRs across species, and conventional techniques to align ESTs from other species perform poorly in the divergent UTR sequences. In addition to 5' UTR annotations, the conserved architecture of bidirectional promoters enables annotation of orthologous promoters as well as identification of missing coding annotations in other mammals. Our previous work reported that similar intergenic distances were present at orthologous positions of bidirectional promoters in pairwise comparisons between human and mouse or human and chimp. Thus, the finding that the human and cow comparison showed fewer bidirectional promoters in the bovine genome was attributed to the early stage of the bovine annotation effort, in which 5' UTRs of genes have not been fully characterized. As the depth and coverage of transcribed regions in the cow genome increase, both the in-species annotations and cross-species validations of bidirectional promoters will benefit. Consistent with this idea, our analyses strongly advocate continued efforts towards defining the 5' ends of genes, in order to expedite the annotation of adjacent regions, which contain promoter sequences. The resulting information will enable downstream analyses of conservation of regulatory networks that act through the same transcription factor binding sites. Despite the limitations imposed by the shallow depth of transcripts available in the bovine genome, our data provide strong evidence that bidirectional promoters can be mapped across species using conservation of gene synteny to rapidly annotate these functional regions in non-human genomes. Furthermore, the benefit of mapping these promoters independently in other species is that new bidirectional promoters can be reciprocally predicted and validated in the human genome. We were able to use bidirectional promoters to identify chromosomal rearrangements, which harbor species-specific transcripts with robust evidence and biological intrigue. For example, RecQ genes originated via duplication events that created 5 paralogs early in the evolution of metazoans, but their bidirectional partners are unrelated to each other. Therefore, the emergence of bidirectional promoters in this gene family was not a passive result of duplication of the original gene pair bringing the promoter along with it. This conclusion is apparent from the lack of similarity among the partner genes. Moreover, d N /d S analyses show that the partners are under strong purifying selection and have not undergone accelerated change, which would have masked their original similarities if they had been paralogs. Any tendency towards rearrangement by these genes is no longer apparent, as each gene partner has remained stably associated with its RecQ paralog for more than 80 million years. Retrotransposition could be another mechanism bringing RecQ genes near their partners or vice versa, but does not explain the BLM gene pairing. The partner of the BLM gene is unique to Catarrhini and has only recently evolved, however, the underlying genomic sequence is orthologous to opossum sequence, precluding a recent introduction of the partner gene into the region. Given these data, we hypothesize that elements in or near the RecQ promoters are responsible for initiating transcription in the opposing direction. Recent work by Core et al. 2008  and Seila et al. 2008  demonstrates that promoters can load RNA Pol II in both the forward and reverse directions to maintain a short region of open chromatin. We propose that in some cases, this phenomenon could provide a mechanism for generating novel, full-length transcripts that are spliced and polyadenylated in a species-specific manner.
We have produced a record of orthologous bidirectional promoters in 5 placental mammals (human, chimpanzee, bovine, murine, and rat). Furthermore, we addressed the evolution of new genes that can be identified by mapping bidirectional promoters across species. Continued work on the development of a cross-species regulatory map for these promoters is likely to reveal additional information about transcripts that are not only unique to individual species, but also functionally relevant.
The bovine gene annotations from OGS v1 are available at http://genomes.arc.georgetown.edu/bovine/bovine_genome_consortium/datasets.html. All other annotations were obtained from the UCSC Human Genome Browser. PhlyoP scores were also obtained from the Human UCSC Genome Browser.
A multi-stage approach to mapping orthology at bidirectional promoters was developed. For example, orthology assignments are strongest in coding regions. Therefore, we began by mapping single human genes regulated by bidirectional promoters from the Known Genes annotations  of the UCSC human genome assembly hg18. Orthology assignments were determined using multiz alignment information , the "chains and nets" data from the UCSC Human Genome Browser mysql tables . Chains in the Genome Browser represent sequences of gapless aligned blocks. Nets provide a hierarchical ordering of those chains. Level 1 chains, which contain the longest, best-scoring sequence chains that span any selected region, were the only ones considered in this analysis. Given a human gene, our approach examined whether it fell within an orthologous region defined by level 1 alignment data without knowledge of the exact position within an alignment or relative to a gap. In a subsequent step, we intersected the positions of gaps and exons of each gene to ensure that the exons fell into alignable positions across species.
After determining the orthology assignments using the UCSC chains and nets data, we used the RefSeq, spliced ESTs, or OGS v1 annotations from cow to validate predictions from the human dataset. RefSeq genes represent mostly protein-coding genes and therefore were verified by chains and nets alignments, followed by confirmation of protein identity in both species. Spliced ESTs carried less descriptive information than protein coding genes and therefore were validated in the second species by their presence in an orthologous region, showing conserved synteny of the two genes within that gene-pair, and meeting the criteria of less than 1,000 bp of intergenic distance between those transcripts. Our method for mapping bidirectional promoters in spliced EST datasets is described in more detail in a previous publication . If the program verified evidence for orthology and conserved-syntenic gene arrangement, then the orthologous bidirectional promoter was confirmed. After orthologous assignments were confirmed for pairs of human genes, the reciprocal assignments were analyzed from cow to human, using a similar process.
Heat maps were generated to represent the orthologous positions of bidirectional promoters. The scale of the map is designated by the number of human genes that were evaluated in a linear distance on the map, due to the fine gradation of the illustrated data. Circumstances in which orthologous bidirectional promoters were not identified included: (A) the presence of one flanking gene but not both, (B) no annotations for either gene in the orthologous genomic region, or (C) no orthologous genomic region was identified. Additional scenarios were possible, but were not presented here. Bidirectional promoters that were validated at the same orthologous position across multiple species are presented as columns of purple color. The heatmap is clustered by similar color groups, while maintaining the column of information at each position.
The phylogenetic tree of protein sequences showing 5 members of the RecQ gene family from human and cow was reconstructed using the neighbor-joining method  based on the Dayhoff distance implemented in the MEGA4 program . The RMI1 sequences from human and cow were used as an outgroup to root the tree. The reliability of the internal branches was evaluated using 1,000 bootstrap replications . The number of synonymous (d S ) and nonsynonymous (d N ) substitutions per synonymous and nonsynonymous site, respectively, was computed using Nei-Gojobori method .
MQY and LE are sponsored by the Intramural program of the National Human Genome Research Institute, U.S. National Institutes of Health. HP was partially supported by an Ohio Board of Regents/Kent State research challenge grant. We thank Dr. Jen Harrow at the Wellcome Trust Sanger Institute whom supervised the re-annotation of human transcripts in the Vega pipeline. Editorial assistance was from the NIH Fellows Editorial Board.
- Frazer KA, Elnitski L, Church DM, Dubchak I, Hardison RC: Cross-species sequence comparisons: a review of methods and available resources. Genome Res. 2003, 13: 1-12. 10.1101/gr.222003.PubMed CentralView ArticlePubMedGoogle Scholar
- Takai D, Jones PA: Origins of bidirectional promoters: computational analyses of intergenic distance in the human genome. Mol Biol Evol. 2004, 21: 463-467. 10.1093/molbev/msh040.View ArticlePubMedGoogle Scholar
- Adachi N, Lieber MR: Bidirectional gene organization: a common architectural feature of the human genome. Cell. 2002, 109: 807-809. 10.1016/S0092-8674(02)00758-4.View ArticlePubMedGoogle Scholar
- Yang MQ, Elnitski L: A computational study of bidirectional promoters in the human genome. Springer Lecture Series: Notes in Bioinformatics. 2007Google Scholar
- Dermitzakis ET, Clark AG: Evolution of transcription factor binding sites in Mammalian gene regulatory regions: conservation and turnover. Mol Biol Evol. 2002, 19: 1114-1121.View ArticlePubMedGoogle Scholar
- Whitehouse I, Rando OJ, Delrow J, Tsukiyama T: Chromatin remodelling at promoters suppresses antisense transcription. Nature. 2007, 450: 1031-1035. 10.1038/nature06391.View ArticlePubMedGoogle Scholar
- Yang MQ, Koehly L, Elnitski L: Comprehensive annotation of human bidirectional promoters identifies co-regulatory relationships among somatic breast and ovarian cancer genes. PLOS Computational Biology. 2007, 3 (4): e72-10.1371/journal.pcbi.0030072.PubMed CentralView ArticlePubMedGoogle Scholar
- Elsik C, Guigo R, Reymond A, Antonarakis S, Alioto T, Weinstock G: Bovine Consensus Gene Set.Google Scholar
- Kawaji H, Kasukawa T, Fukuda S, Katayama S, Kai C, Kawai J, Carninci P, Hayashizaki Y: CAGE Basic/Analysis Databases: the CAGE resource for comprehensive promoter analysis. Nucleic Acids Res. 2006, D632-636. 10.1093/nar/gkj034. 34 DatabaseGoogle Scholar
- Consortium BGSaA: The Genome Sequence of Taurine Cattle: a window to ruminant biology and evolution. Science. 2009, 324: 522-528. 10.1126/science.1169588.View ArticleGoogle Scholar
- Searle SM, Gilbert J, Iyer V, Clamp M: The otter annotation system. Genome Res. 2004, 14: 963-970. 10.1101/gr.1864804.PubMed CentralView ArticlePubMedGoogle Scholar
- Wilming LG, Gilbert JG, Howe K, Trevanion S, Hubbard T, Harrow JL: The vertebrate genome annotation (Vega) database. Nucleic Acids Res. 2008, D753-760. 36 DatabaseGoogle Scholar
- Hedges B, Kumar S: Genomic clocks and evolutionary timescales. Trends Genet. 2003, 19: 200-206. 10.1016/S0168-9525(03)00053-2.View ArticleGoogle Scholar
- Kusano K, Berres ME, Engels WR: Evolution of the RECQ family of helicases: A drosophila homolog, Dmblm, is similar to the human bloom syndrome gene. Genetics. 1999, 151: 1027-1039.PubMed CentralPubMedGoogle Scholar
- Siepel APK, Haussler D: New methods for detecting lineage-specific selection. Proc 10th Int'l Conf on Research in Computational Molecular Biology (RECOMB '06). 2006Google Scholar
- Core LJ, Waterfall JJ, Lis JT: Nascent RNA sequencing reveals widespread pausing and divergent initiation at human promoters. Science. 2008, 322: 1845-1848. 10.1126/science.1162228.PubMed CentralView ArticlePubMedGoogle Scholar
- Seila ACCJ, Levine SS, Yeo GW, Rahl PB, Flynn RA, Young RA, Sharp PA: Divergent transcription from active promoters. Science. 2008, 322: 1849-1851. 10.1126/science.1162253.PubMed CentralView ArticlePubMedGoogle Scholar
- Hsu F, Kent WJ, Clawson H, Kuhn RM, Diekhans M, Haussler D: The UCSC Known Genes. Bioinformatics. 2006, 22: 1036-46. 10.1093/bioinformatics/btl048.View ArticlePubMedGoogle Scholar
- Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AF, Roskin KM, Baertsch R, Rosenbloom K, Clawson H, Green ED, et al: Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. 2004, 14: 708-715. 10.1101/gr.1933104.PubMed CentralView ArticlePubMedGoogle Scholar
- Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D: Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci USA. 2003, 100: 11484-11489. 10.1073/pnas.1932072100.PubMed CentralView ArticlePubMedGoogle Scholar
- Saitou N, Nei M: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol. 1987, 4: 406-425.PubMedGoogle Scholar
- Tamura K, Dudley J, Nei M, Kumar : MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) software version 4.0. Mol Biol Evol. 2007, 24: 1596-1599. 10.1093/molbev/msm092.View ArticlePubMedGoogle Scholar
- Felsenstein J: Confidence limits on phylogenies: an approach using the bootstrap. Evolution. 1985, 39: 783-791. 10.2307/2408678.View ArticleGoogle Scholar
- Nei M, Gojobori T: Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol Biol Evol. 1986, 3: 418-426.PubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.