A new measurement of sequence conservation
© Cai et al; licensee BioMed Central Ltd. 2009
Received: 15 July 2009
Accepted: 22 December 2009
Published: 22 December 2009
Understanding sequence conservation is important for the study of sequence evolution and for the identification of functional regions of the genome. Current studies often measure sequence conservation based on every position in contiguous regions. Therefore, a large number of functional regions that contain conserved segments separated by relatively long divergent segments are ignored. Our goal in this paper is to define a new measurement of sequence conservation such that both contiguously conserved regions and discontiguously conserved regions can be detected based on this new measurement. Here and in the following, conserved regions are those regions that share similarity higher than a pre-specified similarity threshold with their homologous regions in other species. That is, conserved regions are good candidates of functional regions and may not be always functional. Moreover, conserved regions may contain long and divergent segments.
To identify both discontiguously and contiguously conserved regions, we proposed a new measurement of sequence conservation, which measures sequence similarity based only on the conserved segments within the regions. By defining conserved segments using the local alignment tool CHAOS, under the new measurement, we analyzed the conservation of 1642 experimentally verified human functional non-coding regions in the mouse genome. We found that the conservation in at least 11% of these functional regions could be missed by the current conservation analysis methods. We also found that 72% of the mouse homologous regions identified based on the new measurement are more similar to the human functional sequences than the aligned mouse sequences from the UCSC genome browser. We further compared BLAST and discontiguous MegaBLAST with our method. We found that our method picks up many more conserved segments than BLAST and discontiguous MegaBLAST in these regions.
It is critical to have a new measurement of sequence conservation that is based only on the conserved segments in one region. Such a new measurement can aid the identification of better local "orthologous" regions. It will also shed light on the identification of new types of conserved functional regions in vertebrate genomes .
The identification of the conserved regions of a genome is fundamentally important. The importance lies in the fact that conserved regions are often functional. For instance, many studies have shown that conserved regions correspond to coding genes, non-coding RNAs, enhancers, and other functional regions [2–4]. With many regions in the human genome that are largely of unknown function, the identification of the conserved regions is critical to accelerate the process of understanding the function of the human genome. Note that, in this paper, conserved regions are the regions that share at least a certain degree of sequence similarity with their homologous regions in other species [2, 3]. Therefore, conserved regions are good candidates of functional regions and may not be functional regions sometimes. Moreover, different from previous studies, conserved regions may contain long and divergent segments.
There are many methods available for the identification of conserved regions. Early methods require conserved human regions to be at least 70% identical over at least 100 base pairs (bps) long ungapped alignment of human and mouse sequences [2, 3]. Later, methods that are more sophisticated have been developed with given pairwise or multiple sequence alignments to define conservation regions [4–6]. All of these methods are based on contiguous sequence similarity between or among aligned sequences, which requires that the contiguous regions under study are similar to the aligned regions in other species in order to be claimed as conserved regions. There may be a few short divergent segments in such conserved regions. However, the overall sequence similarity of the conserved regions compared with their aligned regions still needs to be high. Here and in the following, the overall sequence similarity is defined as the percentage of aligned identical nucleotides in the alignments of the entire region. Note that these methods identify conserved regions from pre-aligned sequences, which makes them vulnerable to the quality of the pre-aligned sequences .
Besides CRMs, many DNase I hypersensitive regions are shown below to contain ICS. There could be new types of functional regions with ICS as well. Note that in such conserved regions, it is possible that all the ICS work together to perform a function. The above methods either consider individual ICS separately, or neglect them. Therefore, such conserved regions with ICS are missed completely or partially, because any individual ICS may be not statistically significant. It is thus critical to have a new measurement of sequence conservation that considers the conservation of all ICS in a region simultaneously.
To identify both types of conserved non-coding regions, we proposed a new measurement of sequence conservation. Under this measurement, the sequence similarity is calculated by using only the conserved segments in the regions under consideration (see Methods). Therefore, two orthologous regions with a low overall sequence similarity could be detected as conserved. By developing a local alignment-based procedure with the new measurement, we analyzed the conservation of 1642 human functional regions from the ENCyclopedia Of DNA Elements (ENCODE) project  in the mouse genome (see Methods). These 1642 functional regions include 172 regions defined from chromatin immunoprecipitation followed by microarray experiments (TFBS-clustered regions) and 1470 regions defined by DNase I hypersensitivity-related experiments (DHS regions).
We found that there are two or more ICS in 70.3% of TFBS-clustered regions and 17.0% of DHS regions. Moreover, at least 11% of human functional regions that contain multiple ICS would be neglected based on contiguous sequence similarity. We also noticed that for more than 72.9% of the 1642 human regions, our procedure identifies mouse regions that are more similar to the human regions than those mouse regions aligned in the UCSC (University of California, Santa Cruz) genome browser . We also compared the homologous regions obtained from our procedure with those obtained from BLAST  and MegaBLAST . We found that the mouse regions identified from our procedure comprise the best BLAST and MegaBLAST hits for all functional regions with significant hits (the smallest E-value less than 1E-10). However, BLAST and MegaBLAST missed several conserved segments in more than 29.3% regions and our procedure identifies all of the BLAST/MegaBLAST hits in all of the regions. Our observation from the study of the conservation of these functional regions may change the way people define sequence conservation and may shed light on the identification of new types of functional regions.
Our new measurement of sequence conservation calculates the sequence similarity based on conserved segments. To obtain the conserved segments, we apply the local alignment software CHAOS  to a pair of human-mouse orthologous non-coding sequences (see Methods for details). The aligned human-mouse segments outputted from CHAOS are called conserved segments. Note that these conserved segments may not be in the same order as that in the input human-mouse sequences (Figure 1(B)). We use CHAOS instead of other local alignment software because CHAOS has been shown to correctly align regulatory elements in distant species  in long sequences. We do not use global alignment methods to define conserved segments because a recent study has shown that three most popular methods cannot align a portion of coding regions consistently . Moreover, conserved regions with ICS are difficult to align well by one single alignment.
With this definition of conserved segments, we implement the following three-step procedure to calculate the conservation score for an m-kilobase (kb) long human region. Assume this human region is in the non-coding region of the gene H1. The ortholog of H1 in the mouse genome is M1. First, we apply the CHAOS software to identify conserved segments in the non-coding region of M1, by comparing this m-kb long human region with the non-coding region of M1. Here and in the whole paper, the non-coding sequence of a gene includes the upstream sequences until the closer endpoint of the 5' adjacent gene, the intron sequences of this gene, and the downstream sequences until the closer endpoint of the 3' adjacent gene. Depending on which codon is closer to this gene under consideration, the endpoint could be the start codon or the stop codon of the adjacent genes. Second, for any m-kb long mouse region starting from a conserved mouse segment, we calculate the score of the mouse region and the m- kb long human region, by summing the scores of the aligned conserved segments within this pair of m-kb long regions. We obtain the score of a pair of aligned conserved segments from the CHAOS output. Third, we define the conservation score of the m-kb long human region as the best score obtained at the second step. The corresponding m-kb long mouse region with the best score is claimed as the mouse homologous region of the human region. If the above H1 has multiple mouse orthologs, we will use the non-coding regions of all of the orthologs to carry out the above three-step procedure.
We applied the above procedure to the 172 TFBS-clustered regions and 1470 DHS regions. For each human region, we identified the best mouse homologous region. In the following, we described our observations regarding these homologous regions and compared these regions with the homologous regions defined by the UCSC genome browser and those defined by BLAST and MegaBLAST .
More than 17.0% human functional regions contain ICS
We also investigated whether these discontiguously conserved regions are biologically meaningful. We scanned the discontiguously conserved regions using the known motifs in the TRANSFAC database and using stringent score cutoff to define TFBSs (p-value < 0.0001). We found that ICS in both the TFBS-clustered regions and the DHS regions contain conserved TFBSs. On the other hand, we did not find conserved TFBSs in the sequences between adjacent ICS in these regions. For instance, we found seven conserved TFBSs in the two ICS in the DHS region id-1244 (chr11:130824648-130824895). In another example, we found more than 3 TFBSs on average in each of the seven ICS in the TFBS-clustered region id-211591 (chr1:149712431-49714909). These putative TFBSs in the ICS support that these ICS may be responsible for the function of these regions.
Our procedure provides mouse homologous regions that are more similar to the human regions
From the above analyses, we already know that there exist a large number of conserved regions with ICS. Here we show that our procedure provides mouse homologous sequences that are more similar to the human functional sequences compared with the aligned mouse sequences from genome alignments, another advantage of the new measurement.
We found that CHAOS sequences are often more similar to the corresponding human sequences than the UCSC sequences are to the human sequences. In 139 out of 172 (80.8%) TFBS-clustered regions and in 842 out of 1470 (57.3%) DHS regions, the CHAOS sequences are more similar to the corresponding human sequences than the UCSC sequences. For these 139 TFBS-cluster regions and 842 DHS regions, the CHAOS sequences have on average 22.2% and 29.9% more identities than the UCSC sequences, respectively. This shows that the new measurement may be a better way to measure the similarity of orthologous region for non-coding sequences. It also implies that the current conservation studies may have missed many conserved regions by calculating conservation scores based on genome alignments.
Besides the above 139 TFBS-clustered regions and 842 DHS regions, we found 2 TFBS-clustered regions and 61 DHS regions for which the UCSC sequences are as similar to the human sequences as the CHAOS sequences. Moreover, in 31 (18.0%) TFBS-clustered regions and 567 (38.6%) DHS regions, the UCSC sequences are more similar to the corresponding human sequences. Note that the CHAOS sequences may be not so similar to the human sequences as the UCSC sequences, since the CHAOS sequences are identified based on the conserved segments only. In the following, we wanted to investigate whether this was the case.
In summary, in at least 90.1% (139+16 out of 172) TFBS-clustered regions and 72.9% (842+229 out of 1470) DHS regions, the CHAOS sequences are more similar to the human sequences than the UCSC sequences are to the human sequences, in terms of percent identity in the sequence alignments. Such a dominant performance from the new measurement confirms that genome alignments based on contiguous sequence similarity may misalign many conserved regions. For the remaining regions, although the UCSC sequences are the same or more similar to the human sequence, they often misaligned the most conserved sub-regions. Thus, it is questionable that UCSC sequences provide better counterparts in the remaining regions.
Genome alignments misaligned many functional regions
In the previous section, we have shown that many UCSC sequences are not from the non-coding regions of the orthologous genes. We have also shown that the UCSC sequences are not as similar to the human sequences as the CHAOS sequences in most regions. Moreover, in the regions where the UCSC sequences are more similar, we found that the most conserved segments in the UCSC sequences may be misaligned. We found two factors can contribute to this.
First, the genome alignment considers contiguous sequence similarity, which makes it difficult to align some local regions. For instance, due to genome rearrangements during evolution, some parts of a functional region are kept in the original 5'-3' direction while other parts are inverted to 3'-5' directions. Thus, the overall sequence similarity based on genome alignments for true orthologous regions is too low to be identified. Therefore, genome alignments may poorly align these regions across species. For instance, the DHS region id-2404 (chr16:61153038-61153304) shares 75.4% identities with the CHAOS sequence (chr8:102219277-102219500) and shares 49.8% identities with the UCSC sequence (chr8:102870981-102871214). The much lower percent identity from the UCSC sequence is due to the fact that the segment (chr16:61153109-61153157) and the segment (chr16:61153242-61153294) in this DHS region are inverted in the mouse genome. In the CHAOS sequence, the two segments are aligned with two segments chr8+:102219348-102219395 and chr8-:102219440-102219395, which occur in the positive strand and negative strand, respectively ("+" and "-" following the chromosome name mean the positive and negative strand, respectively). In the UCSC sequence, the two segments are aligned with two segments, chr8+:102871048-102871095 and chr8+:102871168-102871203, in the positive strand.
Second, the genome alignments are targeting genome scale sequence similarity and thus may sacrifice the alignment quality of short functional regions. For instance, for the DHS region id-5225 (chr5:142205165-142205746), we found that there is a conserved segment of 101 bp long with 83% identity to its orthologous region in the mouse genome. The genome alignment at UCSC aligned this region with all gaps. It is clear that, to provide better genome scale matches, the genome browser cannot guarantee to align the corresponding sequences for short regions.
BLAST and MegaBLAST neglect many conserved segments
The comparison of the CHAOS sequences with the UCSC sequences in previous sections shows that the aligned sequences from the UCSC genome browser may be misleading when considering the evolution of a local region. Because we are trying to identify the most similar regions around an orthologous mouse gene for a human query sequence, it is also necessary to determine the difference between our approach and BLAST, the basic tool for the same purpose using contiguous sequence similarity .
The comparison of the CHAOS sequences with the BLAST hits.
#TFBS-clustered regions (172)
#DHS regions (1470)
E-value < 1
E-value < 1E-10
E-value < 1 & Overlap
E-value < 1 & Non-overlap
E-value < 1E-10 & Overlap
E-value < 1E-10 & Non-overlap
To show the benefit of measuring the sequence similarity based on conserved segments without considering divergent sequences in a region, we further examined the human regions with significant mouse BLAST hits. We found that in 71 out of 76 (93.4%) TFBS-clustered regions and 167 out of 270 (61.9%) DHS regions, there were one or more CHAOS segments that were missed by BLAST (Table 1). In the remaining regions, the CHAOS segments had a one-to-one correspondence with the BLAST hits, including the hits with E-values larger than 1E-10. Note that the human query sequences are experimentally verified to be functional. It is most likely that all the ICS in such regions, especially in the short DHS regions, are working together to perform functions. Therefore, BLAST missed many ICS by considering conserved segments individually. It also implies that many significantly conserved regions could be missed by BLAST if there is no individual significant hit. On the other hand, the identified ICS in a region together may tell us new functions of the region.
The comparison of the CHAOS sequences with the discontiguous MegaBLAST hits.
#TFBS-clustered regions (172)
#DHS regions (1470)
E-value < 1
E-value < 1E-10
E-value < 1 & Overlap
E-value < 1 & Non-overlap
E-value < 1E-10 &overlap
E-value < 1E-10 & non-overlap
At least 12.8% human functional regions are conserved in mouse
In the previous sections, we have shown that it is necessary to extend the current conservation measurements to consider only the conserved segments in a region. Here we want to estimate the percentage of human functional regions conserved in the mouse based on our new conservation measurement and the functional regions mentioned above.
If we consider the contiguous sequence similarity for the TFBS-clustered conserved regions, the percentage of sequence identity is from 46.1% to 78.0%, with a median of 67.4%. For the DHS conserved regions, the percentage of sequence identity is from 23.8% to 92.1%, with a median of 69.8%. It is thus evident that many conserved regions are neglected by the current conservation studies.
We proposed a new measurement of sequence conservation. Compared with current measurements based on contiguous sequence similarity in local or global alignments, this new measurement considers interspersed sequence similarity. Therefore, the conserved regions based on the new measurement will include the conserved regions defined by the existing methods. Moreover, the conserved regions based on the new measurement will also include the conserved regions with ICS that are missed by current measurements, such as some conserved CRMs [9, 10] and many DHS regions.
The advantage of the new measurement over the current measurement is demonstrated in the functional regions we studied. First, many functional regions can be easily missed by the current conservation studies while they are identified by our method based on the new measurement. We found that 121 (70.3%) TFBS-clustered functional regions and 250 (17.0%) DHS functional regions contain two or more ICS. If we consider contiguous sequence similarity, 112 of the 121 (92.6%) TFBS-clustered regions and 162 of 250 (64.8%) DHS regions have an overall sequence identity of less than 70% compared with their homologous regions. Therefore, at least 11% (17%*64.8%) of regions containing multiple ICSs are neglected by the current conservation methods. Second, our procedure based on the new conservation measurement provides homologous regions that are more similar to the human regions than the aligned sequences in the genome alignments. Third, our procedure identifies a larger number of conserved segments in homologous regions than BLAST and MegaBLAST.
The new conservation measurement is similar to the normalized sequence similarity . Both methods will normalize the sequence similarity by the sequence length. However, the normalized sequence similarity is aimed at identifying regions with percentage of identities larger than a pre-specified threshold. It is still considering every bp in a region to measure the sequence similarity. The new measurement considers only the conserved segments to calculate the sequence similarity.
It is understandable that conserved CRMs may only contain several ICS compared with their orthologous CRMs. We notice that many DHS regions shorter than 400 bp long also contain ICS, which shows that there may be functional regions other than CRMs that also share ICS with their counterparts. We thus need to adopt the new measurement of sequence conservation in order to have better understanding of conservation and to perform novel comparative genomics analyses.
The new measurement of sequence conservation proposed in this paper will significantly affect how people study evolution. Our study here shows that the classical measurement will miss 11% of conserved functional regions between human and mouse. This has two implications. First, there may be many more sequences conserved between human and mouse than we currently estimate, which is consistent with the argument in a recent paper . Second, with more divergent species, the percentage of missed conserved regions by the classical measurement may be even larger, given the fact that orthologous sequences are more divergent and orthologous sequences contain more ICS .
Note that the conserved functional regions defined above may not be functional in mouse. Although a functional human region shares ICS with a mouse region and the conservation is significant compared with that of random sequences, the function of the mouse region needs to be experimentally verified. Moreover, in this study, we implemented a procedure based on the local alignment software CHAOS, which may still miss some conserved segment candidates. Future studies independent of alignments should detect even more conserved regions. With the verification of the function of these mouse regions and further improvements of the method, we may finally estimate how many conserved regions are functional.
We have proposed a new measurement of sequence conservation. By studying the human functional regions, we found that the new measurement is necessary since the functional regions with ICS are not rare and these regions are not considered as conserved regions under the current measurement. Moreover, for most human regions under study, the homologous mouse regions identified under the new measurement have better overall sequence similarities to the human regions than the corresponding regions identified using the current measurements. That is, there could be many conserved regions missed by using the current measurement. Thus, to apply the new measurement to identify conserved regions and to understand the function of the ICS in the conserved regions may change the way people study comparative genomics and may enable the identification of new types of functional elements.
Collection of functional regions
We collected two sets of functional regions. The first set contained 689 TFBS-clustered functional regions based on chromatin immunoprecipitation followed by microarray experiments for 29 transcription factors . The second set  contained 8217 DHS regions based on quantitative chromatin profiling , massively parallel signature sequencing  and DNase-chip . Both types of functional regions are the non-coding regions from the published results of the ENCODE project [22, 23].
We further selected the functional regions based on two criteria. First, the functional regions fell into the 30 random regions selected by the ENCODE project. We did not use the functional regions from the other 14 manually selected ENCODE regions in order to draw more unbiased conclusions. Second, the functional regions fell into the non-coding regions of the 13628 human refseq genes that have mouse orthologs defined in the MGI database . We did not consider rat orthologs because there are only 6991 human genes with rat orthologs in MGI. Certainly, our method can be easily extended to multiple species, in the similar way as extending pairwise alignments to multiple alignments (the conservation score will be defined as the sum of pairwise conservation score). In this manner, we obtained 172 TFBS-clustered functional regions and 1470 DHS functional regions in the human genome. The start positions, the end positions, and the original ID number of these regions are listed in additional files 1 and 22.
We downloaded the human and mouse genome sequences from the UCSC genome browser website (version hg18 and mm8). The repeats in these sequences are already masked with lowercase alphabets. To define the conservation score, C(R), of a human region R of m-kb long, we implemented the three-step procedure below. For simplicity, assume R is in the non-coding region of the human gene H1. The mouse ortholog of H1 is M1 at MGI. Then , where nc(M1) is the non-coding region of M1 and R' is one m-kb long region in nc(M1), and S(R, R') is the sum of local alignment scores of all pairs of aligned segments in R and R' output from CHAOS . CHAOS is a local alignment program for pairwise alignments. The basic idea of CHAOS is to identify similar k-mers (DNA segments of k bp) shared by two sequences and then to extend these k-mers to generate local alignments . We used CHAOS as the local alignment tool because CHAOS is able to correctly align regulatory elements in distant species . The details of the calculation of the C(R) are in the following sections.
First, we identified the conserved segments by using CHAOS to align the human region R with the non-coding sequences of M1. Here the non-coding sequence includes the upstream sequences, introns and downstream sequences of M1. The upstream sequence of M1 is the sequence from the closer endpoint of the 5' adjacent gene of M1 to the start codon of M1. The downstream sequence of M1 is the sequence from the stop codon of M1 to the closer endpoint the 3' adjacent gene of M1. Here the endpoint is either the start codon or the stop codon of the adjacent genes, depending on the orientation of the adjacent genes. When applying CHAOS, we set the word length parameter k as 6 bp, the number of degeneracy parameter as 0, and other parameters as default values in CHAOS. Since CHAOS can identify any significant local matches in two sequences, by setting the parameters in CHAOS in this way, we expected that multiple corresponding pairs of functional segments within a region in the two species would be aligned in some local alignments. Certainly, a long conserved region under the current conservation measurement will also be aligned by CHAOS. We call these aligned segments output from CHAOS conserved segments.
Second, we calculated the sequence similarity of R and every m-kb long mouse region that starts from a conserved segment in the non-coding region of M1. The similarity was defined as the sum of the scores of the local alignments of the conserved segments within this pair of m-kb long regions. Note that the local alignments and the score of local alignments were provided by CHAOS.
Third, we calculated the conservation score of R. The conservation score is defined as the best sequence similarity of R compared with an m-kb long mouse region R' in the non-coding sequence of M1, divided by m. Therefore, a long contiguously conserved region would have a high conservation score. On the other hand, some short regions with ICS will also have high score. Note that it takes O(nlogn) time to identify the best sequence similarity and the corresponding m-kb long mouse region R', if there are n mouse segments aligned with the human region R in the CHAOS output. For the 13628 human-mouse gene pairs we used, n is in the range of 0 to 259276 for m = 1.
Conserved functional regions
In order to define the conserved functional regions, we generated the distribution of the conservation score of a random human region. We obtained this distribution by calculating the conservation score of every 1 kb long human non-coding region that starts with an aligned CHAOS segment in the local alignments of non-coding sequences of orthologous human-mouse genes.
With this background distribution of the conservation score, we defined a functional human region as a conserved functional region if the conservation score of this region is within the top 3.5% of the background distribution. We used 3.5% as a cutoff, because it is estimated that 3.5% of the non-coding sequences in the human genome are under constraint  and we assumed constraint sequences should be conserved. This assumption could be incorrect, but the analysis here should give a rough estimate of conserved regions.
We thank the anonymous referees for their insightful comments and suggestions, which have led to an improved article. This project was supported by a NHGRI grant R01HG004359.
- Chen X, Zheng J, Fu Z, Nan P, Zhong Y, Lonardi S, Jiang T: Assignment of orthologous genes via genome rearrangement. IEEE/ACM transactions on computational biology and bioinformatics/IEEE, ACM. 2005, 2 (4): 302-315. 10.1109/TCBB.2005.48.View ArticlePubMedGoogle Scholar
- Dermitzakis ET, Reymond A, Lyle R, Scamuffa N, Ucla C, Deutsch S, Stevenson BJ, Flegel V, Bucher P, Jongeneel CV, et al: Numerous potentially functional but non-genic conserved sequences on human chromosome 21. Nature. 2002, 420 (6915): 578-582. 10.1038/nature01251.View ArticlePubMedGoogle Scholar
- Loots GG, Locksley RM, Blankespoor CM, Wang ZE, Miller W, Rubin EM, Frazer KA: Identification of a coordinate regulator of interleukins 4, 13, and 5 by cross-species sequence comparisons. Science (New York, NY). 2000, 288 (5463): 136-140.View ArticleGoogle Scholar
- Margulies EH, Blanchette M, Haussler D, Green ED: Identification and characterization of multi-species conserved sequences. Genome research. 2003, 13 (12): 2507-2518. 10.1101/gr.1602203.PubMed CentralView ArticlePubMedGoogle Scholar
- Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, Clawson H, Spieth J, Hillier LW, Richards S, et al: Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome research. 2005, 15 (8): 1034-1050. 10.1101/gr.3715005.PubMed CentralView ArticlePubMedGoogle Scholar
- Cooper GM, Stone EA, Asimenos G, Green ED, Batzoglou S, Sidow A: Distribution and intensity of constraint in mammalian genomic sequence. Genome research. 2005, 15 (7): 901-913. 10.1101/gr.3577405.PubMed CentralView ArticlePubMedGoogle Scholar
- Margulies EH, Cooper GM, Asimenos G, Thomas DJ, Dewey CN, Siepel A, Birney E, Keefe D, Schwartz AS, Hou M, et al: Analyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genome. Genome research. 2007, 17 (6): 760-774. 10.1101/gr.6034307.PubMed CentralView ArticlePubMedGoogle Scholar
- Davidson E: The Regulatory Genome: Gene Regulatory Networks in Development and Evolution. 2006, Burlington, MA: Academic Press, 1Google Scholar
- Arnone M, Davidson EH: The hardwiring of development: organization and function of genomic regulatory systems. Development. 1997, 124 (10): 1851-1864.PubMedGoogle Scholar
- Yuh C, Bolouri H, Davidson EH: Genomic cis-regulatory logic: experimental and computational analysis of a sea urchin gene. Science. 1998, 279 (5358): 1896-1902. 10.1126/science.279.5358.1896.View ArticlePubMedGoogle Scholar
- Shashikant CS, Bolanowsky SA, Anand S, Anderson SM: Comparison of diverged Hoxc8 early enhancer activities reveals modification of regulatory interactions at conserved cis-acting elements. Journal of experimental zoology Part B. 2007, 308 (3): 242-249. 10.1002/jez.b.21143.View ArticleGoogle Scholar
- Birney E, Stamatoyannopoulos JA, Dutta A, Guigo R, Gingeras TR, Margulies EH, Weng Z, Snyder M, Dermitzakis ET, Thurman RE, et al: Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. 2007, 447 (7146): 799-816. 10.1038/nature05874.View ArticlePubMedGoogle Scholar
- Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D: The human genome browser at UCSC. Genome research. 2002, 12 (6): 996-1006.PubMed CentralView ArticlePubMedGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. Journal of molecular biology. 1990, 215 (3): 403-410.View ArticlePubMedGoogle Scholar
- Ma B, Tromp J, Li M: PatternHunter: faster and more sensitive homology search. Bioinformatics (Oxford, England). 2002, 18 (3): 440-445. 10.1093/bioinformatics/18.3.440.View ArticleGoogle Scholar
- Brudno M, Chapman M, Gottgens B, Batzoglou S, Morgenstern B: Fast and sensitive multiple alignment of large genomic sequences. BMC Bioinformatics. 2003, 4: 66-10.1186/1471-2105-4-66.PubMed CentralView ArticlePubMedGoogle Scholar
- Brudno M, Do CB, Cooper GM, Kim MF, Davydov E, Green ED, Sidow A, Batzoglou S: LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome research. 2003, 13 (4): 721-731. 10.1101/gr.926603.PubMed CentralView ArticlePubMedGoogle Scholar
- Blake JA, Eppig JT, Richardson JE, Bult CJ, Kadin JA: The Mouse Genome Database (MGD): integration nexus for the laboratory mouse. Nucleic acids research. 2001, 29 (1): 91-94. 10.1093/nar/29.1.91.PubMed CentralView ArticlePubMedGoogle Scholar
- Waterston RH, Lindblad-Toh K, Birney E, Rogers J, Abril JF, Agarwal P, Agarwala R, Ainscough R, Alexandersson M, An P, et al: Initial sequencing and comparative analysis of the mouse genome. Nature. 2002, 420 (6915): 520-562. 10.1038/nature01262.View ArticlePubMedGoogle Scholar
- Pheasant M, Mattick JS: Raising the estimate of functional human sequences. Genome research. 2007, 17 (9): 1245-1253. 10.1101/gr.6406307.View ArticlePubMedGoogle Scholar
- Arslan AN EelO, Pevzner PA: A new approach to sequence comparison: normalized sequence alignment. Bioinformatics. 2001, 17 (4): 327-323. 10.1093/bioinformatics/17.4.327.View ArticlePubMedGoogle Scholar
- Zhang ZD, Paccanaro A, Fu Y, Weissman S, Weng Z, Chang J, Snyder M, Gerstein MB: Statistical analysis of the genomic distribution and correlation of regulatory elements in the ENCODE regions. Genome Res. 2007, 17 (6): 787-797. 10.1101/gr.5573107.PubMed CentralView ArticlePubMedGoogle Scholar
- King DC, Taylor J, Zhang Y, Cheng Y, Lawson HA, Martin J, Chiaromonte F, Miller W, Hardison RC: Finding cis-regulatory elements using comparative genomics: some lessons from ENCODE data. Genome Res. 2007, 17 (6): 775-786. 10.1101/gr.5592107.PubMed CentralView ArticlePubMedGoogle Scholar
- Sabo PJ, Kuehn MS, Thurman R, Johnson BE, Johnson EM, Cao H, Yu M, Rosenzweig E, Goldy J, Haydock A, et al: Genome-scale mapping of DNase I sensitivity in vivo using tiling DNA microarrays. Nat Methods. 2006, 3 (7): 511-518. 10.1038/nmeth890.View ArticlePubMedGoogle Scholar
- Crawford GE, Holt IE, Whittle J, Webb BD, Tai D, Davis S, Margulies EH, Chen Y, Bernat JA, Ginsburg D, et al: Genome-wide mapping of DNase hypersensitive sites using massively parallel signature sequencing (MPSS). Genome Res. 2006, 16 (1): 123-131. 10.1101/gr.4074106.PubMed CentralView ArticlePubMedGoogle Scholar
- Crawford GE, Davis S, Scacheri PC, Renaud G, Halawi MJ, Erdos MR, Green R, Meltzer PS, Wolfsberg TG, Collins FS: DNase-chip: a high-resolution method to identify DNase I hypersensitive sites using tiled microarrays. Nat Methods. 2006, 3 (7): 503-509. 10.1038/nmeth888.PubMed CentralView ArticlePubMedGoogle Scholar