Genome-wide identification of coding and non-coding conserved sequence tags in human and mouse genomes
© Mignone et al; licensee BioMed Central Ltd. 2008
Received: 18 October 2007
Accepted: 11 June 2008
Published: 11 June 2008
The accurate detection of genes and the identification of functional regions is still an open issue in the annotation of genomic sequences. This problem affects new genomes but also those of very well studied organisms such as human and mouse where, despite the great efforts, the inventory of genes and regulatory regions is far from complete. Comparative genomics is an effective approach to address this problem. Unfortunately it is limited by the computational requirements needed to perform genome-wide comparisons and by the problem of discriminating between conserved coding and non-coding sequences. This discrimination is often based (thus dependent) on the availability of annotated proteins.
In this paper we present the results of a comprehensive comparison of human and mouse genomes performed with a new high throughput grid-based system which allows the rapid detection of conserved sequences and accurate assessment of their coding potential. By detecting clusters of coding conserved sequences the system is also suitable to accurately identify potential gene loci.
Following this analysis we created a collection of human-mouse conserved sequence tags and carefully compared our results to reliable annotations in order to benchmark the reliability of our classifications. Strikingly we were able to detect several potential gene loci supported by EST sequences but not corresponding to as yet annotated genes.
Here we present a new system which allows comprehensive comparison of genomes to detect conserved coding and non-coding sequences and the identification of potential gene loci. Our system does not require the availability of any annotated sequence thus is suitable for the analysis of new or poorly annotated genomes.
One of the main challenges of post-genomic era is the accurate annotation of genes and the improvement of our knowledge of mechanisms of gene expression through the identification of cis-acting non-coding regulatory regions. Comparative genomics has been one of the most successful approaches used to address this task. Indeed it is well known that sequences with functional activity – such as coding sequences or regulatory regions – are subject to selective pressures that prevent the fixation of mutations and conserve sequences during evolution.
Conserved non-coding sequences have been shown to act as tissue specific enhancers of gene expression  and in particular of genes involved in control of development . Evolutionary conserved sequences have also been successfully used for the identification of new genes .
Given the great interest in this area of research and thanks to the availability of the almost complete genome sequences of many organisms, several tools to identify and collect conserved sequences have been proposed [4, 5].
The identification of a conserved sequence is only the first step in the identification of functional elements that requires further information, the most obvious being the assessment of its coding potential, i.e. to assess if the conserved sequence is likely to be part of a coding region. Discriminating between coding and non coding conserved sequences is of great importance as the discovery of novel coding sequences may help the detection of unannotated genes or coding exons and the identification of splice variants. Conversely, the study of non-coding conserved sequences may lead to the identification of regions that may have regulatory activity both at DNA or mRNA level by affecting transcription or translation thus modulating gene expression.
The usual approach is to classify conserved sequences as coding or noncoding by comparison with annotated protein sequences: if a conserved sequence is not supported by (does not align to) a known protein it is labelled as non-coding.
This approach makes classification heavily dependent on the quality of the annotation of the genomes under analysis and it is obviously less applicable to new – poorly annotated – genomes.
We previously developed CSTminer, a tool that does not suffer from these limitations as it identifies conserved sequences (Conserved Sequence Tags – CSTs) and classifies them as coding or non-coding by evaluating the presence of evolutionary dynamics specific of coding sequences [6, 7].
We recently applied CSTminer to an extensive analysis of human chromosomes 15, 21 and 22 and corresponding mouse syntenic regions . We identified more than 37,000 CSTs. 9,500 of these were labelled as coding and were used to benchmark a novel methodology – based on the identification of clusters of coding CSTs – to detect genomic regions which are likely to contain genes (see  for details). One striking result of the work was that, despite the large efforts made towards the annotation of human genome, we were able to identify 25 loci potentially containing unannotated genes using a relatively simple comparative approach. Interestingly 11 of the 25 predicted genes were confirmed by updated genome annotation at the time of publication – confirming the reliability of our approach.
The computational problem for the comprehensive comparison of two genomes of the size of human and mouse is not trivial. Indeed, although CSTminer is fairly fast, it is limited by the alignment step which implements a blast like algorithm which is not suitable to compare very long sequences.
In this paper we propose a highly parallelized system to perform a complete comparison of large genomes. This system also allows the submission of precomputed genomic alignments (such as blastz) to further improve the speed of the analysis.
A preliminary study of sequences conserved between human and mouse genomes allowed the identification of several clusters of coding CSTs that do not correspond to any annotated genes and the creation of a collection of non-coding CSTs possibly endowed of some functional activity.
The first operation performed by the CSTminer algorithm is the identification of high scoring segment pairs (HSPs) through a Blast-like sequence comparison. A solution to the computational problem associated with whole genome comparisons is to split long sequences into smaller fragments (in order to limit the number of CSTs interrupted at the boundaries of the fragments, these cannot be too short). We empirically established a length of 100 Kbp (with an overlap of 1 Kbp) as a good compromise (data not shown). Given the size of human (~3 Gbp) and mouse (~2.6 Gbp) genomes, an exhaustive comparison between all human and mouse 100 Kbp sequences would require nearly 800 M comparisons. Considering an average computation time of 2 sec for each comparison the whole analysis would require many years of computation on a single CPU. We took advantage of the high level of parallelization offered by grid technology and developed a system suitable to perform all comparisons in a "reasonable" time.
The number of tasks to be performed is very large and in a distributed environment there are many reasons that individual jobs may fail (Worker Node problems, site configuration, problems due to middleware failures, etc). Accurate job management is therefore essential and to this end a fully automated procedure based on mysql DBMS was developed to launch and monitor jobs, re-run failed jobs and to collect results of the analysis.
Genome-wide detection of human-mouse Conserved Sequence Tags (CSTs)
CST distribution among human (A) and mouse (B) chromosomes.
The minimum CST length has been limited to 60 nt as shorter sequences would not allow a reliable computation of the coding potential score (see below). Chromosomes 2 and 13 show some uncommonly long CSTs (17,543 and 10,176 nt respectively). These two CSTs are characterized by a high coding potential and correspond to conserved sequences of the long coding exons of TTN and SACS genes.
We observed that CSTs labelled as undefined often overlap both coding and non-coding regions corresponding to CDS or to UTR or intron sequences, respectively (data not shown). The global coding potential score assigned by CSTminer is influenced by both subregions and does not allow a clear classification of the sequence.
Human mouse conservation
Previous observations showed that the majority of genes on human chromosome 17 have their homologues on mouse chromosome 11. Interestingly our data show that these chromosomes are the most conserved, with 6.9% and 7.2% of conserved nucleotides respectively.
Conversely, both human and mouse chromosomes Y are poorly conserved (about 0.5%). This observation could be explained with the degeneration process faced by Y chromosome. Moreover they are unusually rich of repetitive elements that we masked before running our analysis.
Only 0.94% of human genome is labelled as coding from our comparison with mouse genome.
CSTs and annotated mRNAs
To identify regions with an high density of coding conserved sequences – likely gene loci – we applied an improved version of a clustering procedure previously described  (see also Material and Methods) and detected 25,296 clusters containing 141,001 CSTs.
Clusters of coding CSTs have been identified as described in text and have been compared to annotated RefSeq mRNAs.
Hs confirmed Clusters (full confirmed)
Mm confirmed Clusters (full confirmed)
Tot confirmed Clusters
Tot Clusters unconfirmed by RefSeq
Hs confirmed Clusters (full confirmed)
Moreover, given that our approach does not require the previous availability of annotated features, it seems reasonable to think that it could prove to be a powerful tool in the annotation of genomes lacking a well curated gene annotation.
As 668 clusters were confirmed neither by human nor by mouse mRNAs, we compared their chromosome coordinates to those of human ESTs to find evidence of their expression. Indeed, 551 (82%) of these clusters showed an overlap with ESTs (432 clusters were fully supported). Only 117 clusters (comprising 775 CSTs) did not show any overlap with known transcribed sequences (14 of these corresponding to pseudogenes according to the human pseudogene database ).
Several evidences have been reported about the critical role of non coding conserved sequences in regulation of gene expression  and in particular in the regulation of genes involved in control of development .
We are aware that – at this stage – our data are limited to the comparison of human and mouse and may not allow the precise localization of short functional motifs. Nonetheless the identification of core sequence elements shared by several non-coding nrCSTs might represent a powerful approach for the detection of conserved sequences that might be involved in chromatin remodelling or in the regulation of the expression of many genes while unique non-coding nrCSTs might be expected to include elements with more gene-specific functions.
To investigate the hypothesis that noncoding conserved sequences (ncCSTs) might correspond to functional regions we made a comparison with specialized databases containing known regulatory elements. The enrichment of known functional regions in our dataset of conserved non-coding sequences would support the possibility that the same dataset could contain new regulatory elements.
In particular we considered "Presta-promoter", which contains a curated non-redundant set of human promoter sequences , Rfam ncRNA database  and "OregAnno" database which contains manually curated known regulatory elements . 306 ncCSTs matched with 238 (43%) sequences in the PrestaPromoter dataset, 347 ncCSTs with 269 (21%) "OregoAnno" sequences and 513 ncCSTs with 1069 (3%) Rfam elements. The finding that a sizable proportion of known functional elements are represented in our conserved non-coding set suggests that additional, still unknown, regulatory elements are represented in our ncCST dataset.
CST comparison with blastz chains
Pre-computed genome alignments are already available for several genomes, including human and mouse and it may make sense to take advantage of this data – provided that information loss is minimal.
We compared CSTs obtained from our full genome analysis with the results obtained by comparing genome regions corresponding to blastz  chain tracts only. We adapted the grid based system we developed for the full genome comparison to allow the submission of blastz alignments (or any other "query – target" coordinate pairs) to limit the analysis to these regions. Nonetheless many blastz chains are longer than 100 Kbp (with length up to about 80 Mbp) and cannot be efficiently analyzed with a direct comparison. Those sequences were split in 100 Kbp slices (with 1 Kbp overlap) and – to further reduce computational load – only slices showing at least 3 identical sequences of 10 nt on the same diagonal were compared. This procedure – which is similar to the one employed by the BLAT algorithm  – remarkably reduced the number of CSTminer comparisons to about 1% of all the possible 100 Kbp comparisons.
Despite the striking reduction of comparisons we observed that only 1% of total CSTs were completely missed by blastz chains while and additional 4% escaped detection because they did not pass the filter imposing three decamers on the same diagonal above described. However, our data suggest that the use of blastz chains can provide an acceptable reduction of the complexity of analysis with a limited (about 5%) loss of information.
However, the main advantage of blastz chains in this context is their availability as pre-computed features (available for instance at UCSC genome browser website ). Indeed, their computation is rather time consuming (481 CPU days for the human – mouse comparison according to ).
On the other hand, the reduction procedure based on the identification of exact matches on the same diagonal provides a significant speed up of the process as the computational requirements to perform the identification of those regions is limited (data not shown).
As highlighted in  the CSTminer algorithm measures the coding potential through the evaluation of evolutionary dynamics unique to coding sequences not requiring the availability of any annotated feature. Indeed, many analyses of conserved coding or noncoding sequences have been made by classifying a sequence as coding (or non-coding) following a comparison with protein databases [1, 17, 18]. It is clear that the reliability of such approach depends on the availability of annotated proteins. If a sequence is not supported by a protein it is difficult to decide if the sequence is really noncoding or whether the corresponding protein has simply not been identified.
Moreover very few annotated proteins have been physically sequenced and the vast majority of them are conceptual translation products of available mRNA sequences. This introduces a vicious circle as the hypothetical codingness of a sequence is inferred by the alignment to putative proteins.
We have developed and implemented a high performance grid-based system to perform exhaustive full genome comparisons with CSTminer algorithm to identify and discriminate between conserved coding and noncoding sequences. Besides the speed of the whole procedure even when entire large genomes are compared, one of the main advantages of our system is that it does not require any annotated feature for the assessment of the putative coding potential of identified conserved tracts rendering it useful for the comparison of poorly annotated genomes (i.e. when no or few cDNA sequences are available). It is thus possible to identify "interesting" sequences such as putative genes or regulatory regions, and use these data to drive subsequent experimental analysis or to strengthen the reliability of independent computational data (i.e. de novo gene finding data).
We demonstrated – using as a reference the well annotated human and mouse genomes – that the observation of clusters of coding CSTs is a good indicator of the existence of a gene locus. This information can be incorporated in a gene prediction pipeline where several gene prediction tools are combined and their results compared to limit the rate of false positives and to strengthen the significance of predictions .
Collections of conserved non-coding sequences can also address specific studies on sequences that might regulate, for example, gene expression or chromatin structure. Indeed, these data might also facilitate the identification of novel non-coding RNAs, whose importance and prevalence are currently the subject of much debate .
Comparison of publicly available noncoding datasets to coding CSTs and currently annotated RefSeq coding sequences.
Non Coding Elements datasets
Coding CST/RefSeq CDS
Penn State (*)
As pointed out by Couronne  local alignment tools, beside the identification of orthologous segments, lead to the identification of paralogous relationships and sequence repetitions. This information is often considered "noise" and is thus removed. This seems reasonable if the primary goal is to align genomes to find large scale orthologous regions; nonetheless repetitive elements can have functional relevance in regulation of gene expression and warrant further inspection. We have used repeat masked sequences – thus purging known repetitive elements – before CSTminer analysis. Nonetheless we observed many highly repetitive conserved noncoding elements that we believe to be interesting and may represent novel lineage specific repetitive elements.
Our analysis system has been implemented on a grid facility , taking advantage of the high parallelization achievable and allowing full genome comparisons in very reasonable amount of time (15 days computation to compare mouse and human genomes). Nonetheless many genomes have already been aligned with very sensitive algorithms like blastz and it is possible to take advantage of this information, limiting CSTminer comparison to those regions only.
It is important to notice that although blastz chains are alignments of genomic sequences, the CSTminer alignment step is required to detect local similarities. Indeed the average length of CSTs detected by CSTminer is 190 bases, while blastz tries to extend matches to find large synteny tracts, often resulting in very long alignments.
Use of blastz chains only leads to the complete loss of only about 1% of total CSTs. However, chains longer than 100 Kbp must be split into shorter tracts and an exhaustive comparison of all tracts must be performed. To further improve the speed of the analysis it is possible to limit CSTminer comparison of chain fragments to pairs that show at least 3 identical matches of 10 nt on the same diagonal. By applying this restriction it is possible to drastically limit the number of comparisons (nearly 1% of total comparisons) thus reducing the computation time to slightly more than 24 hours (in the case of human and mouse genomes) with a further loss of 4% of CSTs.
In this paper we describe a grid-based system devised to perform full genome comparisons with CSTminer algorithm. The main advantage of this system is that the assessment of coding potential of conserved sequences does not require any annotated feature rendering it useful for the comparative analysis of poorly annotated genomes.
The system has been benchmarked on the well-annotated human and mouse genomes where it proved its reliability.
The CSTminer algorithm has been described in  and slightly modified in  where a web interface to run CSTminer was made available. A further automatic web system to compare a single sequence to several genomes has also been implemented in .
Briefly, given a pair of sequences, CSTminer identifies high scoring segment pairs (HSPs) through a Blast-like sequence comparison. The coding capacity of each CST delimited by an HSP is then assessed by assigning a coding potential score (CPS) which corresponds to the maximum score value obtained from each of the possible reading frames in the forward and reverse orientation.
CSTminer also allows the display of the highest-scoring triplet window (default minimum length of 60 nt) by scanning each detected CST. This approach facilitates the detection of potential coding regions located in longer CSTs which might contain both coding and non-coding tracts (through the presence of untranslated mRNA or intronic regions).
Following an accurate benchmark on controlled coding and non-coding datasets, CPS thresholds for coding and non-coding CSTs were evaluated. Therefore each CST was labeled as coding (Cod) (if CPS ≥ coding_threshold) or non-coding (NonCod) (if CPS ≤ non_coding threshold or CPS < coding_threshold and highest scoring triplet window CPS < coding threshold). CSTs with CPS non fulfilling these requirements were labeled as Undefined. Finally CSTs with more than 95% of similarity were labelled as Ultrancoserved and no CPS was computed as the low divergence would not allow the computation of a significant score.
We empirically determined that for large scale comparisons CSTminer gives optimal results with sequences of 100 Kbp with 1 Kbp overlap. Indeed this value allows a good balance between computational speed and the occurrence of CST fragmentation at the border of the submitted sequences.
In order to reduce the time needed to execute this large amount of comparisons, we took advantage of grid technology using many machines in parallel. Indeed each CST comparison requires an independent computation thus we split all the 800 M comparisons in smaller subset and we run them on the EGEE grid infrastructure .
In order to maximize the level of parallelization, the comparisons were grouped in set of 1000 (10 human 100 K slices vs. 100 mouse 100 K slices). The number of comparisons was chosen in order to have each task running for approximately one hour, giving a good ratio between the time spent in order to set-up the environment and the CPU time spent in running the comparison. This approach also assures something similar to a check-point: even if a job fails only less than one hour of computation is lost.
The clustering procedure is an improvement of the procedure described in . The basic idea is to identify genomic regions with a significant concentration of coding CSTs.
Given N coding CSTs sorted on their genomic start we computed t3i, the genomic span of three consecutive CSTs centered in CST i for i ∈ (2, N - 1). We labeled as pre-cluster three consecutive coding CSTs centered in CST i if 2 * t3i ≤ where is the average genomic span of t3k for k ∈ (i - 30, i + 30). When i<30 or i>N-30, is computed respectively for k ∈ (2,60) or k ∈ (N - 61, N - 1). Overlapping pre-clusters are then merged into clusters.
The main difference with the previous clustering procedure is that clustering parameters (average density of surrounding CSTs) are now dynamically computed over a 60 CSTs window in the genomic region under analysis – thus accounting for regions with different gene density.
Given that each CST has human and mouse genomic coordinates, the clustering procedure is applied both to human and mouse genomes. Only CSTs belonging to a cluster in both organisms are considered. As already pointed out clusters are computed on coding CSTs only, but syntenic CSTs of all classes (non-coding, undefined and ultraconserved) are included in clusters following a post-processing step.
This work was supported by FIRB (Ministero Università e Ricerca), EU STREP project TRANSCODE and by AIRC. We thank David Horner for valuable comments on the manuscript and Matteo Re' for helpful discussions about clustering procedure.
- Pennacchio LA, Ahituv N, Moses AM, Prabhakar S, Nobrega MA, Shoukry M, Minovitsky S, Dubchak I, Holt A, Lewis KD, Plajzer-Frick I, Akiyama J, De Val S, Afzal V, Black BL, Couronne O, Eisen MB, Visel A, Rubin EM: In vivo enhancer analysis of human conserved non-coding sequences. Nature. 2006, 444: 499-502. 10.1038/nature05295.PubMedView ArticleGoogle Scholar
- Woolfe A, Goodson M, Goode DK, Snell P, McEwen GK, Vavouri T, Smith SF, North P, Callaway H, Kelly K, Walter K, Abnizova I, Gilks W, Edwards YJ, Cooke JE, Elgar G: Highly conserved non-coding sequences are associated with vertebrate development. PLoS Biol. 2005, 3: e7-10.1371/journal.pbio.0030007.PubMedPubMed CentralView ArticleGoogle Scholar
- Pennacchio LA, Olivier M, Hubacek JA, Cohen JC, Cox DR, Fruchart JC, Krauss RM, Rubin EM: An apolipoprotein influencing triglycerides in humans and mice revealed by comparative sequencing. Science. 2001, 294: 169-173. 10.1126/science.1064852.PubMedView ArticleGoogle Scholar
- Boccia A, Petrillo M, di Bernardo D, Guffanti A, Mignone F, Confalonieri S, Luzi L, Pesole G, Paolella G, Ballabio A, Banfi S: DG-CST (Disease Gene Conserved Sequence Tags), a database of human-mouse conserved elements associated to disease genes. Nucleic Acids Res. 2005, 33: D505-10. 10.1093/nar/gki011.PubMedPubMed CentralView ArticleGoogle Scholar
- Loots G, Ovcharenko I: ECRbase: database of evolutionary conserved regions, promoters, and transcription factor binding sites in vertebrate genomes. Bioinformatics. 2007, 23: 122-124. 10.1093/bioinformatics/btl546.PubMedView ArticleGoogle Scholar
- Mignone F, Grillo G, Liuni S, Pesole G: Computational identification of protein coding potential of conserved sequence tags through cross-species evolutionary analysis. Nucleic Acids Res. 2003, 31: 4639-4645. 10.1093/nar/gkg483.PubMedPubMed CentralView ArticleGoogle Scholar
- Castrignano T, Canali A, Grillo G, Liuni S, Mignone F, Pesole G: CSTminer: a web tool for the identification of coding and noncoding conserved sequence tags through cross-species genome comparison. Nucleic Acids Res. 2004, 32: W624-7. 10.1093/nar/gkh486.PubMedPubMed CentralView ArticleGoogle Scholar
- Re M, Mignone F, Iacono M, Grillo G, Liuni S, Pesole G: A new strategy to identify novel genes and gene isoforms: Analysis of human chromosomes 15, 21 and 22. Gene. 2006, 365: 35-40. 10.1016/j.gene.2005.09.041.PubMedView ArticleGoogle Scholar
- Waterston RH, Lindblad-Toh K, Birney E, Rogers J, Abril JF, Agarwal P, Agarwala R, Ainscough R, Alexandersson M, An P, Antonarakis SE, Attwood J, Baertsch R, Bailey J, Barlow K, Beck S, Berry E, Birren B, Bloom T, Bork P, Botcherby M, Bray N, Brent MR, Brown DG, Brown SD, Bult C, Burton J, Butler J, Campbell RD, Carninci P, Cawley S, Chiaromonte F, Chinwalla AT, Church DM, Clamp M, Clee C, Collins FS, Cook LL, Copley RR, Coulson A, Couronne O, Cuff J, Curwen V, Cutts T, Daly M, David R, Davies J, Delehaunty KD, Deri J, Dermitzakis ET, Dewey C, Dickens NJ, Diekhans M, Dodge S, Dubchak I, Dunn DM, Eddy SR, Elnitski L, Emes RD, Eswara P, Eyras E, Felsenfeld A, Fewell GA, Flicek P, Foley K, Frankel WN, Fulton LA, Fulton RS, Furey TS, Gage D, Gibbs RA, Glusman G, Gnerre S, Goldman N, Goodstadt L, Grafham D, Graves TA, Green ED, Gregory S, Guigo R, Guyer M, Hardison RC, Haussler D, Hayashizaki Y, Hillier LW, Hinrichs A, Hlavina W, Holzer T, Hsu F, Hua A, Hubbard T, Hunt A, Jackson I, Jaffe DB, Johnson LS, Jones M, Jones TA, Joy A, Kamal M, Karlsson EK, Karolchik D, Kasprzyk A, Kawai J, Keibler E, Kells C, Kent WJ, Kirby A, Kolbe DL, Korf I, Kucherlapati RS, Kulbokas EJ, Kulp D, Landers T, Leger JP, Leonard S, Letunic I, Levine R, Li J, Li M, Lloyd C, Lucas S, Ma B, Maglott DR, Mardis ER, Matthews L, Mauceli E, Mayer JH, McCarthy M, McCombie WR, McLaren S, McLay K, McPherson JD, Meldrim J, Meredith B, Mesirov JP, Miller W, Miner TL, Mongin E, Montgomery KT, Morgan M, Mott R, Mullikin JC, Muzny DM, Nash WE, Nelson JO, Nhan MN, Nicol R, Ning Z, Nusbaum C, O'Connor MJ, Okazaki Y, Oliver K, Overton-Larty E, Pachter L, Parra G, Pepin KH, Peterson J, Pevzner P, Plumb R, Pohl CS, Poliakov A, Ponce TC, Ponting CP, Potter S, Quail M, Reymond A, Roe BA, Roskin KM, Rubin EM, Rust AG, Santos R, Sapojnikov V, Schultz B, Schultz J, Schwartz MS, Schwartz S, Scott C, Seaman S, Searle S, Sharpe T, Sheridan A, Shownkeen R, Sims S, Singer JB, Slater G, Smit A, Smith DR, Spencer B, Stabenau A, Stange-Thomann N, Sugnet C, Suyama M, Tesler G, Thompson J, Torrents D, Trevaskis E, Tromp J, Ucla C, Ureta-Vidal A, Vinson JP, Von Niederhausern AC, Wade CM, Wall M, Weber RJ, Weiss RB, Wendl MC, West AP, Wetterstrand K, Wheeler R, Whelan S, Wierzbowski J, Willey D, Williams S, Wilson RK, Winter E, Worley KC, Wyman D, Yang S, Yang SP, Zdobnov EM, Zody MC, Lander ES: Initial sequencing and comparative analysis of the mouse genome. Nature. 2002, 420: 520-562. 10.1038/nature01262.PubMedView ArticleGoogle Scholar
- Zheng D, Gerstein MB: A computational approach for identifying pseudogenes in the ENCODE regions. Genome Biol. 2006, 7 : 1-10. 10.1186/gb-2006-7-s1-s13.PubMedView ArticleGoogle Scholar
- Mach V: PRESTA: associating promoter sequences with information on gene expression. Genome Biol. 2002, 3: research0050-10.1186/gb-2002-3-9-research0050.PubMedPubMed CentralView ArticleGoogle Scholar
- Griffiths-Jones S, Moxon S, Marshall M, Khanna A, Eddy SR, Bateman A: Rfam: annotating non-coding RNAs in complete genomes. Nucleic Acids Res. 2005, 33: D121-4. 10.1093/nar/gki081.PubMedPubMed CentralView ArticleGoogle Scholar
- Montgomery SB, Griffith OL, Sleumer MC, Bergman CM, Bilenky M, Pleasance ED, Prychyna Y, Zhang X, Jones SJ: ORegAnno: an open access database and curation system for literature-derived promoters, transcription factor binding sites and regulatory variation. Bioinformatics. 2006, 22: 637-640. 10.1093/bioinformatics/btk027.PubMedView ArticleGoogle Scholar
- Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W: Human-mouse alignments with BLASTZ. Genome Res. 2003, 13: 103-107. 10.1101/gr.809403.PubMedPubMed CentralView ArticleGoogle Scholar
- Kent WJ: BLAT--the BLAST-like alignment tool. Genome Res. 2002, 12: 656-664. 10.1101/gr.229202. Article published online before March 2002.PubMedPubMed CentralView ArticleGoogle Scholar
- Kuhn RM, Karolchik D, Zweig AS, Trumbower H, Thomas DJ, Thakkapallayil A, Sugnet CW, Stanke M, Smith KE, Siepel A, Rosenbloom KR, Rhead B, Raney BJ, Pohl A, Pedersen JS, Hsu F, Hinrichs AS, Harte RA, Diekhans M, Clawson H, Bejerano G, Barber GP, Baertsch R, Haussler D, Kent WJ: The UCSC genome browser database: update 2007. Nucleic Acids Res. 2007, 35: D668-73. 10.1093/nar/gkl928.PubMedPubMed CentralView ArticleGoogle Scholar
- Prabhakar S, Noonan JP, Paabo S, Rubin EM: Accelerated evolution of conserved noncoding sequences in humans. Science. 2006, 314: 786-10.1126/science.1130738.PubMedView ArticleGoogle Scholar
- Bird CP, Stranger BE, Liu M, Thomas DJ, Ingle CE, Beazley C, Miller W, Hurles ME, Dermitzakis ET: Fast-evolving noncoding sequences in the human genome. Genome Biol. 2007, 8: R118-10.1186/gb-2007-8-6-r118.PubMedPubMed CentralView ArticleGoogle Scholar
- Allen JE, Salzberg SL: JIGSAW: integration of multiple sources of evidence for gene prediction. Bioinformatics. 2005, 21: 3596-3603. 10.1093/bioinformatics/bti609.PubMedView ArticleGoogle Scholar
- Mattick JS, Makunin IV: Non-coding RNA. Hum Mol Genet. 2006, 15 Spec No 1: R17-29. 10.1093/hmg/ddl046.PubMedView ArticleGoogle Scholar
- Woolfe A, Goode DK, Cooke J, Callaway H, Smith S, Snell P, McEwen GK, Elgar G: CONDOR: a database resource of developmentally associated conserved non-coding elements. BMC Dev Biol. 2007, 7: 100-10.1186/1471-213X-7-100.PubMedPubMed CentralView ArticleGoogle Scholar
- Engstrom PG, Fredman D, Lenhard B: Ancora: a web resource for exploring highly conserved noncoding elements and their association with developmental regulatory genes. Genome Biol. 2008, 9: R34-10.1186/gb-2008-9-2-r34.PubMedPubMed CentralView ArticleGoogle Scholar
- Couronne O, Poliakov A, Bray N, Ishkhanov T, Ryaboy D, Rubin E, Pachter L, Dubchak I: Strategies and tools for whole-genome alignments. Genome Res. 2003, 13: 73-80. 10.1101/gr.762503.PubMedPubMed CentralView ArticleGoogle Scholar
- . [http://www.eu-egee.org]
- Castrignano T, De Meo PD, Grillo G, Liuni S, Mignone F, Talamo IG, Pesole G: GenoMiner: a tool for genome-wide search of coding and non-coding conserved sequence tags. Bioinformatics. 2006, 22: 497-499. 10.1093/bioinformatics/bti754.PubMedView ArticleGoogle Scholar