Genomic divergences among cattle, dog and human estimated from large-scale alignments of genomic sequences
© Liu et al. 2006
Received: 09 January 2006
Accepted: 07 June 2006
Published: 07 June 2006
Skip to main content
© Liu et al. 2006
Received: 09 January 2006
Accepted: 07 June 2006
Published: 07 June 2006
Approximately 11 Mb of finished high quality genomic sequences were sampled from cattle, dog and human to estimate genomic divergences and their regional variation among these lineages.
Optimal three-way multi-species global sequence alignments for 84 cattle clones or loci (each >50 kb of genomic sequence) were constructed using the human and dog genome assemblies as references. Genomic divergences and substitution rates were examined for each clone and for various sequence classes under different functional constraints. Analysis of these alignments revealed that the overall genomic divergences are relatively constant (0.32–0.37 change/site) for pairwise comparisons among cattle, dog and human; however substitution rates vary across genomic regions and among different sequence classes. A neutral mutation rate (2.0–2.2 × 10(-9) change/site/year) was derived from ancestral repetitive sequences, whereas the substitution rate in coding sequences (1.1 × 10(-9) change/site/year) was approximately half of the overall rate (1.9–2.0 × 10(-9) change/site/year). Relative rate tests also indicated that cattle have a significantly faster rate of substitution as compared to dog and that this difference is about 6%.
This analysis provides a large-scale and unbiased assessment of genomic divergences and regional variation of substitution rates among cattle, dog and human. It is expected that these data will serve as a baseline for future mammalian molecular evolution studies.
Many mammalian species have long served our human society by providing food, materials, and labor, providing companionship as pets, and serving as model organisms for biological studies. Besides the seven mammals (human, mouse, rat, chimpanzee, macaque, dog and cattle) whose genomic sequence data are already available, 16 eutherian mammals have been proposed for low-coverage genome sequencing efforts . Comparative genomics has been proven to be a powerful strategy to identify important evolutionary changes among these mammalian species . Evolutionary changes, which have shaped the mammalian genomes, include both small-scale (point mutations, microsatellite slippage, insertions/deletions) as well as large-scale events (transpositions, genomic rearrangements and segmental duplications). Knowledge of mutation rates is critical for building evolutionary timescale, discovering conserved noncoding functional elements, identifying evolutionary processes like positive selection, and understanding heritable diseases .
Earlier studies on mammalian evolution were limited by the lack of large-scale genomic sequence data and were dependent upon PCR cross-amplification of limited numbers of mitochondrial and nuclear genes. Therefore, these sampled sequences were often limited to closely related species and had a bias towards conserved unique regions. This also resulted in repetitive sequences being excluded from genomic divergence calculations in these earlier studies. As the remnants of transposition events, repetitive sequences are one of the most predominant features of mammalian genomes (for example, 40–50% of the human genome are repeats) [4, 5]. Repeats have been shown to play an important role in mammalian genome evolution [6, 7]. Depending on their time of origin, repeats can be divided into ancestral repeats (AR: arrived before a speciation event, and thus shared by both species) and lineage-specific repeats (arrived after a speciation event). Recently it has been shown that virtually all ancient repeats evolve neutrally . As one class of nonfunctional neutral sequences, ancient repeats have been used to estimate neutral mutation rates [9–11]. Several recent studies have indicated that neutral mutation rates (not substitution rates which are the combined effects of mutation and selection) in mammals have been relatively constant , except for the discrepant results from rodents, which were shown to mutate as much as 2-fold faster than other mammals [9, 11].
With the availability of the human, mouse, rat and chimpanzee genome assemblies, whole genome-wide comparisons and analyses have been generated using primates and rodents (such as human vs. non-human primates, mouse, and rat) [4, 5, 9, 13–15]. Targeted comparative sequencing efforts (the ENCODE - ENCyclopedia Of DNA Elements Project) also have generated megabases of high-quality genomic sequence for dozens of mammalian species [2, 16, 17]. Recent studies also have measured mutation rates , their regional variation [19, 20] and their covariation with other genomic events in human, mouse and rat[11, 21]. A local alignment algorithm, blastz , has been used to align human, mouse and rat genomes [9, 14, 21]. On the other hand, a global alignment algorithm, mlagan, has been used to generate multiple alignments in the "greater CFTR region" . A comparison of results derived from local versus global alignment algorithms would be of interest.
With the dog draft assembly (July 2004, canFam1), the cattle draft assembly (March 2005, bosTau2) and cattle BAC library resources  now available, a large-scale genomic comparison was initiated to assess the nature and pattern of genomic variation among other mammalian orders; i.e. artiodactyls (Cattle, Bos taurus) and carnivores (Dog, Canis familiaris) as compared to primates (Human, Homo sapiens). To avoid any potential genome assembly artifacts, the project began with high-quality finished genomic sequences from cattle BAC clones, rather than the cattle draft assembly. The three-way multi-species global alignments (ranging in alignment length from 67 to 491 kb) were generated from the orthologous sequences of cattle, dog and human using an optimized global alignment algorithm to provide a platform for analyzing genomic variation. The lineage, which led to the last common ancestor (LCA) of cattle and dog, was estimated to have diverged from human approximately 92 million years ago (mya) followed by the estimated separation of cattle and dog 83 million years ago [27, 28]. The overall objective of this study was to assess patterns of single-nucleotide mutations across genomic regions and among different sequence classes in the mammalian lineages.
A total of 84 ortholog trios were identified though a sequence similarity search, which included 10.5 Mb of cattle sequences, 9.3 Mb of dog sequences and 11.1 Mb of human sequences. The putative ortholog trios were further confirmed by reciprocal blast . These ortholog trios were placed to all human chromosomes (chr) except for chr 9, 15, 19 and Y (see Additional file 1 Table S3).
Nucleotide Divergence versus Sequence Class.
Total length (bp)
Aligned length (bp)
Substitution rate* (change/site × 10-9)
0.1681 ± 0.0003
0.1547 ± 0.0003
0.2036 ± 0.0003
2.026 ± 0.003
1.864 ± 0.003
2.016 ± 0.003
0.1595 ± 0.0003
0.1451 ± 0.0002
0.1938 ± 0.0003
1.921 ± 0.003
1.748 ± 0.003
1.919 ± 0.003
0.0644 ± 0.0010
0.0647 ± 0.0010
0.0595 ± 0.0009
0.776 ± 0.012
0.780 ± 0.012
0.589 ± 0.009
0.1223 ± 0.0016
0.1472 ± 0.0018
0.1161 ± 0.0015
1.473 ± 0.019
1.773 ± 0.022
1.059 ± 0.010
0.1676 ± 0.0003
0.1538 ± 0.0003
0.2021 ± 0.0004
2.019 ± 0.004
1.853 ± 0.003
2.001 ± 0.004
0.1830 ± 0.0006
0.1668 ± 0.0006
0.2221 ± 0.0007
2.205 ± 0.007
2.010 ± 0.007
2.199 ± 0.007
0.1749 ± 0.0006
0.1581 ± 0.0006
0.2129 ± 0.0007
2.108 ± 0.007
1.905 ± 0.007
2.108 ± 0.007
Despite the optimization of alignment parameters, suboptimal or ectopic alignments occasionally occurred. Suboptimal alignments were defined as those alignments that exceeded 3 standard deviations of the mean pairwise K2 divergences in a sliding window analysis (See Methods), which were removed using a post-alignment filter. Although such suboptimal alignments composed less than 5% of aligned bases, these alignments were not considered in our analysis to avoid overestimation of genomic divergence.
Comparative genomic analyses were performed on these 84 three-way multi-species global alignments. The branch lengths and substitution rates of cattle, dog and human are shown in Table 1. The average overall branch lengths were 0.1681 ± 0.0003, 0.1547 ± 0.0003, and 0.2036 ± 0.0003 change/site for cattle, dog and human, respectively. Similar degrees of branch lengths were reported in previous studies [12, 30]. The genomic divergence between cattle and dog was the smallest with a value of 0.3228 ± 0.0005 change/site. The dog-human evolutionary divergence was 0.3583 ± 0.0006 change/site, which was less than the cattle-human divergence of 0.3717 ± 0.0006 change/site. As expected, these results confirm that artiodactyls and carnivores are the closest relatives, with primates being the most distant. Mutations at CpG dinucleotides occur frequently due to spontaneous deamination of methylated cytosines . To remove any variation caused by differences in levels of methylation, substitution rates were estimated after removing CpG dinucleotides (Overall-CG, Repetitive-CG). The overall branch lengths decreased 5.1% (cattle), 6.2% (dog) and 4.8% (human) after removing CpG dinucleotides from all sequences within alignments (Table 1, Overall-CG). Alignments were further sorted into four sequence classes based on NCBI RefSeq  and RepeatMasker coordinates using the software MaM . The total 5.5 Mb aligned sequences included 133 kb, 115 kb, 4.0 Mb and 1.2 Mb aligned bases from coding, UTR, unique noncoding (i.e. not annotated), and repetitive regions, respectively. Coding regions of 193 well-annotated RefSeq genes excluded both 3' and 5' UTR. Branch lengths in coding regions (cattle 0.0644 ± 0.0010, dog 0.0647 ± 0.0010, and human 0.0595 ± 0.0009 change/site) were only half of the overall branch length, reflecting that they are under strong purifying selection. The branch lengths in UTR regions (cattle 0.1676 ± 0.0003, dog 0.1538 ± 0.0003, and human 0.2021 ± 0.0004 change/site) were significantly larger than the coding branch lengths (t-test, for each species p <0.0001). The branch lengths in unique noncoding portions (cattle 0.1676 ± 0.0003, dog 0.1538 ± 0.0003, and human 0.2021 ± 0.0004 change/site) were slightly less than the overall branch lengths. In contrast, the aligned repetitive portions possessed the longest branch lengths (cattle 0.1830 ± 0.0006, dog 0.1668 ± 0.0006, and human 0.2221 ± 0.0007 change/site). These branch lengths decreased 4.4% (cattle), 5.2% (dog) and 4.1% (human) when CpG dinucleotide sites were excluded, suggesting higher substitution rates of CpG sites (Table 1, Repetitive-CG). The differences were significant between the branch lengths in unique noncoding vs. repetitive portions before and after removing CpG dinucleotides from repetitive elements (one-way ANOVA, cattle P = 0.0006, dog P = 0.0116, and human P <0.0001) for all 83 autosomal alignments.
Substitution rates were calculated from the LCA of cattle and dog assuming branch times of 83, 83 and 101 million years for cattle, dog and human lineages, respectively [27, 28]. A dramatic variation of substitution rates was observed between and within chromosomes according to the human placement. Table S4 (see Additional file 1) summarizes the substitution rates of AR for each individual clone or locus on each chromosome. Chromosome X accumulated fewer substitutions than autosomal chromosomes (cattle 1.771 ± 0.045, dog 1.680 ± 0.043, and human 2.083 ± 0.049 × 10-9 change/site/year), supporting the existence of a higher mutation rate in the male than in the female germline . Among autosomal chromosomes, HSA10 (Human chromosome 10), showed higher substitution rates (cattle 2.372 ± 0.057, dog 2.417 ± 0.058, and human 2.583 ± 0.059 × 10-9 change/site/year) compared to rates in chromosome 11 (cattle 2.151 ± 0.028, dog 1.916 ± 0.025, and human 2.022 ± 0.025 × 10-9 change/site/year). Substitution rates for HSA10 and HSA16 were significantly higher, while those for HSA14, HSA12 and HSA7 were significantly lower when compared to the average substitution rates in repetitive regions (t-test, all P <0.0001, see Additional file 1 Table S4).
Similarly, substitution rates varied significantly among individual clones or loci within one chromosome (see Additional file 1 Fig. S2, Table S4). For example, contig 01.01 (mapped to HSA7:30,585,342-30,707,957 and CFA14:46,029,765-46,135,257) showed high substitution rates (cattle 2.371 ± 0.080, dog 1.788 ± 0.065, and human 2.020 ± 0.068 × 10-9 change/site/year), while contig 33.39 (mapped to HSA7:114,308,522-114,473,710 and CFA14:56,758,901-56,922,307) demonstrated low substitution rates (cattle 1.998 ± 0.081, dog 2.092 ± 0.083, and human 2.264 ± 0.086 × 10-9 change/site/year), even though both belonged to the same chromosomes (HSA7 and CFA14).
Histograms of substitution rates in non-overlapping 3-kb sliding windows for overall (A) and repetitive (B) sequences (with and without CpG sites) are shown in Fig. 1. ANOVA tests were performed on variation in branch lengths of 3-kb nonoverlapping windows between and within autosomal chromosomes for each species. These included 6 types of sequences: Overall, Overall-CG, Unique noncoding, Unique noncoding-CG, Repetitive, and Repetitive-CG. The overall sequence comprised 83 autosomal alignments containing 1761 windows; the unique noncoding regions comprised 83 autosomal alignments containing 1290 windows; and the repetitive regions comprised 83 autosomal alignments containing 347 windows. All tests were statistically significant at P <0.0001.
Loci with lower overall divergences were inspected for the presence of underlying RefSeq genes. As expected, many protein coding genes were under functional constraints. These constraints such as those on the FOXP2, MET and SCAP2 genes within the great CFTR region may explain the low overall divergences observed within that part of HSA7 . When loci with high overall divergences were examined, it is interesting to note that a few protein coding genes were also detected. These included CSMD2  (contig 38.45, HSA1:33,820,824-33,883,038), FDFT1 [40, 41] and CTSB (contig 03.03, HSA8:11698086-11762835) [40, 42, 43]. These loci retained higher substitution rates even if only the AR regions were considered (see Additional file 1 TableS4).
The overall substitution rates were estimated to be 2.026 ± 0.003, 1.864 ± 0.003, and 2.016 ± 0.003 × 10-9 change/site/year for cattle, dog and human, respectively (Table 1). Indeed, estimates of neutral mutation rates using ancient repeats (cattle 2.205 ± 0.007, dog 2.010 ± 0.007, and human 2.199 ± 0.007 × 10-9 change/site/year) were comparable to previous studies (2.1–3.7 × 10-9 change/site/year) , agreeing almost perfectly with the estimates from the human-mouse comparisons (i.e. 2.2 × 10-9 and 4.5 × 10-9 change/site/year in the human and mouse lineages) . In all cases in Fig. 1 (Overall, Overall-CG, repetitive and repetitive-CG), the distributions of dog substitution rates (green) were shifted slightly to the left of those of cattle rates (blue), consistent with the faster rate of substitution in the cattle branch compared with the dog branch.
Relative rate tests were performed on a single merged alignment and on each of the 84 multiple alignments using Tajima's method [44, 45]. Differences in mutation counts were assessed using the χ2 test based on the assumption that mutation would not show a species preference. When using human as an outgroup, cattle had faster rates of substitution as compared to dog. Although the difference was relatively small (6%), it was significant by the χ2 test (P <0.0001) when the merged alignment was tested. Almost two-thirds (54 out of 84) of the individual alignment rate tests supported that cattle had faster rates, while 11 of these rate tests supported that dog had faster rates (including 5 from the greater CFTR region). The remaining 19 out of 84 tests supported the molecular clock hypothesis for the cattle and dog lineages (including 3 from the greater CFTR region).
One of the fundamental challenges in large-scale comparative genomic analysis is to build biologically meaningful multiple sequence alignments [18, 46]. A variety of biological events are known to create insertion/deletions including lineage-specific amplification of tandem repeats, homology-mediated genomic deletions and transposition events . Local alignment algorithms, combined with the removal and reinsertion strategy of repeat elements, have been shown to reduce the number of gaps in DNA alignments and increase sensitivity [22, 47]. This is particularly important for aligning the species like rodents which have high genome-wide substitution rates. However, the aligned ancient repeats may be enriched for those in more slowly changing regions, while the fast changing repeats may be too divergent for detection and alignment . On the other hand, global alignment algorithms seem appropriate for species with low substitution rates like cattle, dog and human. Comparative gene mapping and chromosome painting studies have indicated that a remarkably slow rate of chromosomal change exists within several mammalian orders. Artiodactyls and carnivores are more conserved relative to humans than rodents [48–53]. In terms of genomic divergence, previous data  also suggests that cattle and dog are more conserved relative to human. But global alignment algorithms assume colinearity between sequences and do not specifically handle synteny breaking events like transpositions, rearrangements (such as microinversions) or duplications . For example, global alignment algorithms may be ineffective to treat lineage-specific repeats which are closely matched such as young SINEs and LINEs, creating suboptimal alignments . These suboptimal alignments may lead to less accurate estimates of sequence divergence. Therefore, in this study, alignment parameters were optimized and a post-alignment filter was applied to overcome the above limitation of the global alignment algorithm. The post-alignment filter effectively removed the suboptimal alignments from the mlagan output. Such suboptimal alignments appeared abnormal because they had extreme fluctuations in genomic divergences compared to their flanking sequences and were always associated with multiple gaps. Similar genomic divergence results obtained in the current study compared to earlier reports [10, 12, 30], confirm that our sequence datasets were representative and our alignment strategies were successful.
Our orthologous sequence datasets, comprised of 10.5 Mb of cattle sequences, 9.3 Mb of dog sequences and 11.1 Mb of human sequences, were placed on all human chromosomes except for chr 9, 15, 19 and Y (see Additional file 1 Table S3). As a control for sample bias and rate variation among these genomic regions, we mapped randomly selected cattle BAC end sequences onto the human genome assembly Build35 (73,728 BES from CHORI-240 ). A comparison of these BES alignments to our large-scale genomic alignments showed comparable results (G.E. Liu et al, unpublished results). Therefore, it is reasonable to believe that these datasets are sufficiently representative and robust to draw sound conclusions regarding rates and properties of mammalian genomic mutation.
However, our estimates were consistently larger than those in an earlier study of the greater CFTR region  and revealed significant rate differences between the cattle and dog lineages. Reanalysis of the alignments in that study (116 kb cattle, 122 kb dog, and 332 kb human sequences, 68 kb aligned bases) indicated that the dog-human divergence (0.3335 ± 0.0046 change/site) was significantly higher than the cattle-human divergence (0.3237 ± 0.0045 change/site) (Relative rate test, p <0.001). Comparable divergences were derived from our AR regions (369 kb cattle, 369 kb dog, and 485 kb human sequences, 157 kb aligned bases) from the same region (dog-human: 0.3856 ± 0.0035 change/site and cattle-human: 0.3870 ± 0.0035 change/site). In our study, no significant rate difference was detected between cattle and dog (Relative rate test, p = 0.251). One possible explanation is that the global alignment algorithm mlagan was used to create multiple alignments in the current study while pair-wise alignments were constructed by the local alignment algorithm - blastz in the earlier study. As discussed above, local alignment algorithms are known to be less efficient in identifying fast changing ancient repeats, which may be too divergent to detect and align. This could lead to the underestimation of the genomic divergences. On the other hand, use of a global alignment algorithm can recover the fast changing orthologous ancient repeats by taking into consideration the conservation of nearby unique flanking sequences. Discrepancies in the significance of rate variation between the small and large datasets also further highlight the importance of a large-scale sampling strategy.
As expected, different sequence classes were under different purifying selection pressures. Coding regions were under the strongest functional constraints with substitution rates at only half that of the overall substitution rates. It is interesting to note that substitution rates in unique noncoding portions were slightly less than overall substitution rates suggesting they may be under weak negative selection due to unidentified functional regions, regulatory domains, or unknown genes. Significantly higher substitution rates in repetitive elements before or after removing CpG dinucleotides indicate that CpG content is only partially the reason for high substitution rates. In addition, other factors like increased rates of gene conversion, relaxed purifying selection and unequal crossover among repeats may contribute to our observations.
The quadratic relationships between substitution rate, branch length, K2 divergence, indel rate per 10 kb and GC% were derived to explain regional variation. These results suggest that fluctuations in GC% predict an appreciable amount of the regional variation that was observed in mutation and indel rates, but leave the majority of the variation unexplained. Additional causes beyond GC%, including CpG content, recombination and other as of yet unknown factors are needed to explain the variation among mutation rates. Significant variation in mutation rates across genomic regions and among sequence classes strongly demonstrates that future studies of genomic variation should include multiple regions from different chromosomes. Another important observation is that regional variation in mutation rate is correlated among cattle, dog and human lineages over time. Regional correlations of mutation rates have been demonstrated and quantified genome-wide in human-chimpanzee, human-mouse and human-rat comparisons [9, 14, 20].
It is also interesting to note that a handful of protein coding genes were detected within a few cattle BAC clones with high neutral mutation rates. Several possible nonexclusive explanations for this phenomenon exist. For instance, the sequences compared may not have been orthologous. Within one gene family, paralogous genes could be confused with orthologous genes. Gene conversion may have occurred, which could considerably increase the genomic divergence . In addition, high mutation rates or relaxed purifying selection could have occurred due to gene duplication [56, 57]. These possibilities warrant further investigation. However, these rare events would not likely significantly change our estimates of mutation rates.
Measurement of the neutral mutation rate is crucial for validating molecular clock and neutral evolution theories [58, 59]. The neutral mutation rate has been approximately estimated from neutral or close to neutral non-functional sites such as introns, pseudo-genes, unique noncoding intergenic regions, four-fold degenerate sites (4D sites) in coding regions (i.e. third codon position) and shared ancestral repeats. One way to identify regions under positive selection is to focus on DNA segments with significantly higher mutation rates . Genomic regions that are changing significantly slower than the neutral rate because of purifying selection contain potentially conserved noncoding functional elements [11, 21].
Estimates of the neutral mutation rates in this study, which are in agreement with many previous reports [2, 12, 30], show that mutation rates in the cattle and dog lineages are slower as compared with those in rodents. However, our estimates around 2.0–2.2 × 10-9 change/site/year are in the lower end of the reported range (2.1–3.7 × 10-9 change/site/year) . These differences could result from the usage of 4D sites in the earlier studies, as nucleotides in coding regions may not be an ideal dataset because of codon usage bias and potential weak selection . Regions that harbored large, low copy repeat sequences were excluded in this study to unambiguously determine the orthologous relationship. Such segmental duplicated regions may significantly inflate estimates of divergence due to non-orthologous sequence relationships [46, 60] or gene conversion .
The dataset presented here, though much large than those used previously [2, 12], is still a small part (0.4%) of the cattle, dog and human genomes. It is also worth noting that a number of the common assumptions made about neutral mutation, genetic drift, generation-time and population size, can affect these estimates [34, 61], and rate calculations could be confounded by incorrect estimates of species divergence times. More comprehensive genome sequences and polymorphism data will be required to further clarify the important role of mutation rates in mammalian evolution. Further study of the molecular mechanisms behind mutation will be essential to understand the causes of mutation rate variation. Additional analyses will become feasible as the bovine genome approaches the finishing stage.
After the completion of this study, a comprehensive comparative analysis of the domestic dog genome reported similar genomic divergence estimates between dog and human .
The unique features of this study include 1) optimal multiple (not pairwise) alignments were carefully constructed using a global (not local) alignment algorithm; 2) the scale was considerably larger as compared to earlier reports using small datasets of protein coding sequences or targeted genomic regions and 3) Our results were statistically significant and unbiased as supported by the mapping results of genome-wide randomly selected cattle BAC end sequences.
Therefore, this analysis provides a large-scale and unbiased assessment of genome divergences and regional variations of substitution rates among cattle, dog and human. Cattle had faster average rates of substitution as compared to dog and the difference was 6%. The global molecular clock needs to be adjusted to fit rates among mammalian species. These data will serve as a valuable baseline for future molecular evolution studies, especially in cattle and other livestock like sheep and pig.
The comparative analyses performed in the current study were similar to those previously published . However, several improvements to the previous analyses were 1) the use of three-way multiple sequence alignments instead of comparison of several pairwise alignments; 2) the application of REV rate matrices and ML methods using the PAML package  in addition to the simple K2 calculation; and 3) the optimization of alignment parameters and filter thresholds to deal with larger sequence divergences.
Large finished genomic sequences were retrieved from cattle BAC libraries (CH240 and RP42) from GenBank. Cattle sequence segments longer than 50 kb in length were then extracted and masked for common repeat elements [63, 64]. Orthologous dog and human sequences were identified by sequence similarity searches  of cattle sequence queried against a formatted version of the assembled dog (canFam1, July 2004) and human (hg17, May 2004) genomes  using the following options (blastall -p blastn -U T -e 1e-05 -q -2 -r 1 -W 11 -G 3 -E 1 -b 25). Overlapping sequences within a species were excluded based on the genome assembly coordinates and sequence identity. We excluded any accession located within a known duplicated region of the human genome , because duplicated regions of the genome complicate identification of orthologous segments and confound genomic divergence estimates [18, 46]. Because the assembly of the dog genome is based on only seven-fold "shotgun" sequence coverage, our analysis was limited to genomic sequences completely finished and containing no gaps or internal ambiguous bases. A total of 84 cattle clones and subclones (see Additional file 1 Table S3) met these criteria: 69 were generated by Baylor College of Medicine Human Genome Sequencing Center ; 12 were generated in National Institutes of Health Intramural Sequencing Center  as a part of a targeted comparative sequencing effort (the ENCODE - ENCyclopedia Of DNA Elements Project) [2, 16, 17]; and the remaining 3 were generated at the University of Oklahoma, Advanced Center for Genome Technology . A complete list of all accessions, their consensus assemblies, their map locations with respect to the genomes of dog and human and their sequence attributes are provided (see Additional file 1 Table S3).
Orthologous sequences were extracted using parasight visualization software (J.A. Bailey, unpublished results) . The mlagan algorithm  was used to construct all three-way multiple sequence global alignments. A subset of gap opening and gap extension penalties was chosen to minimize the frequency of both single-nucleotide substitution and insertion/deletion events in order to provide the most biologically meaningful optimal global alignment (See Results and Discussion). For equally parsimonious gap parameters, selected parameters (gap opening penalty of -1,000 and gap extension penalty of -10) were used so that known "young" transposition events were treated as a single insertion/deletion event. A total of 84 three-way multiple alignments for cattle, dog, and human (total alignment lengths ~15.5 Mb) were constructed with mlagan using ~10 Mb of genomic sequence from each species. All alignments were manually inspected for extreme fluctuations in genomic divergence. A suboptimal alignment was defined as any alignment which exceeded 3 standard deviations of the mean pairwise genomic divergence (window size 2 kb, slide 100 bp). These regions were considered separately in the analysis (Table 1). A total of 89 such subalignments were classified as suboptimal for cattle (732 kb), dog (619 kb), and human (822 kb). Only a small fraction (<5%) of all aligned bases was classified as suboptimal.
The branch lengths were calculated by maximum likelihood using version 3.15 of the PAML package, which allows base frequency change, all bases exchangeability and rate heterogeneity across sites (Table 1) . The most general reversible substitution model (REV) was used (model = 7), rate variation among sites was allowed (fix_alpha = 0 and ncatG = 5), no molecular clock was assumed (clock = 0), unrooted trees were used, and ambiguity characters were discarded (cleandata = 1). Kimura's two-parameter (K2) method, which corrects for multiple events and transversion/transition mutational biases , was used to estimate genomic divergences in pairwise comparisons. Genomic divergences or branch lengths were always reported as the means ± their standard deviations. Insertion/deletion events were not factored into these calculations . Coding, UTR, unique noncoding and repetitive regions from the sequence alignments were extracted using MaM (Multiple Alignment Manipulator) [36, 72]. Repeat coordinates were identified using the slow option of RepeatMasker v3.0.5 with an updated RepBase library for cattle. Five major classes of repeats were considered in this analysis (LINEs, SINEs, LTR, DNA Transposons, and others). In order to eliminate the possibility that more divergent or novel common repeats may not have been effectively masked by RepeatMasker, intraspecific sequence-similarity searches were performed. Exon definition was limited to well-annoted human genes (NCBI RefSeq) [35, 73]. Among these, a total of 1,909 exons corresponding to 193 genes were analyzed. The coding regions were extracted from exonic sequences between CDS start and end sites. The UTR regions contained both 5'-UTR (between transcription start and CDS start sites) and 3'- UTR (between transcription end and CDS end sites). Unique noncoding regions excluded both exonic and repetitive regions. Non-overlapping sliding window analyses (3-kp in Fig. 1, Additional file 1 FigS5A and 500-bp in Fig. S5B) were performed using align_slider (J.A. Bailey, unpublished results). Substitution rates were calculated from the LCA of cattle and dog using branch length/time assuming branch times of the cattle, dog and human lineages of 83, 83 and 101 mya, respectively [27, 28]. All alignment attributes were maintained within a MySQL database which facilitated cross-referencing with various properties of the genomic sequence. Tajima's relative rate tests were performed on multiple alignments using MEGA3 . ANOVA was performed to test variation in branch lengths of whole alignments or 3-kb nonoverlapping windows between and within autosomal chromosomes in cattle, dog, and human. Quadratic regression fits were implemented using the SigmaPlot software package.
last common ancestor
million years ago
BAC end sequences
Bos taurus autosome
Canis familiaris autosome
Homo sapiens autosome
We thank four anonymous reviewers for helpful comments on the manuscript. We thank M.D. Adams, E.E. Connor, and L.C. Gasbarre for critical reading of the manuscript, R.A. Gibbs, S.M. Kappes, and G.M. Weinstock for helpful comments in the preparation of this manuscript. We thank M. Brudno for insights to the lagan and mlagan aligning score matrixes. This work was supported in part by USDA CRIS Project No. 1265-31000-090-00D and 1265-31000-081-00D. Mention of trade names or commercial products in this article is solely for the purpose of providing specific information and does not imply recommendation or endorsement by the U.S. Department of Agriculture.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.