Identification of mammalian orthologs using local synteny
© Jun et al; licensee BioMed Central Ltd. 2009
Received: 23 June 2009
Accepted: 23 December 2009
Published: 23 December 2009
Accurate determination of orthology is central to comparative genomics. For vertebrates in particular, very large gene families, high rates of gene duplication and loss, multiple mechanisms of gene duplication, and high rates of retrotransposition all combine to make inference of orthology between genes difficult. Many methods have been developed to identify orthologous genes, mostly based upon analysis of the inferred protein sequence of the genes. More recently, methods have been proposed that use genomic context in addition to protein sequence to improve orthology assignment in vertebrates. Such methods have been most successfully implemented in fungal genomes and have long been used in prokaryotic genomes, where gene order is far less variable than in vertebrates. However, to our knowledge, no explicit comparison of synteny and sequence based definitions of orthology has been reported in vertebrates, or, more specifically, in mammals.
We test a simple method for the measurement and utilization of gene order (local synteny) in the identification of mammalian orthologs by investigating the agreement between coding sequence based orthology (Inparanoid) and local synteny based orthology. In the 5 mammalian genomes studied, 93% of the sampled inter-species pairs were found to be concordant between the two orthology methods, illustrating that local synteny is a robust substitute to coding sequence for identifying orthologs. However, 7% of pairs were found to be discordant between local synteny and Inparanoid. These cases of discordance result from evolutionary events including retrotransposition and genome rearrangements.
By analyzing cases of discordance between local synteny and Inparanoid we show that local synteny can distinguish between true orthologs and recent retrogenes, can resolve ambiguous many-to-many orthology relationships into one-to-one ortholog pairs, and might be used to identify cases of non-orthologous gene displacement by retroduplicated paralogs.
The accurate determination of orthology is central to comparative genomics. Pinpointing the origin of new genes, understanding the evolution of new gene families, and assessing the impact of gene and genome duplication events all require the accurate assignment of orthology between genes in distinct genomes. In complex genomes with large gene families this task requires differentiating between genes that have diverged through a speciation event (orthologs) and those derived through duplication events within a species (paralogs). Determination of orthology and paralogy is especially challenging in vertebrates species such as the mammals. Very large gene families, high rates of gene duplication and loss, multiple mechanisms of gene duplication, and high rates of retrotransposition all combine to make the determination of orthology between vertebrate genes difficult.
Given the importance of accurate orthology assignment, many methods have been developed to identify orthologous genes. Most of these methods rely upon analysis of the inferred protein sequence of the genes in question by clustering the results of protein sequence comparisons results in the classification of putative orthologs. Examples of this approach include reciprocal best BLAST hits and more inclusive BLAST based clustering methods. Splitting of these clusters based on relative similarity can distinguish between older and newer duplication events and is implemented in the widely used Inparanoid algorithm  and related approaches . While these methods are robust and easily implemented, they rely upon a single character, the protein sequence, for classifying genes into orthologous groups.
More recently, methods have been proposed that use genomic context in addition to protein sequence to improve orthology assignment. These methods have been most successfully implemented in fungal genomes [3, 4], and have long been used in prokaryotic genomes [5, 6], where gene order is far less variable than in eukaryotes. An interesting implementation of this approach is found in the SOAR and MSOAR algorithms [7, 8], which seek to assign orthology by minimizing the recombination distance between two genomes. In most of these approaches, synteny blocks covering some percentage of the genome are used hierarchically with protein coding information to assign orthology between similar genes. Approaches that exploit synteny information can be particularly useful in resolving ambiguous sequence based matches between putative orthologs. Recently, Han and Hahn  used local synteny information to identify parent-daughter relationships among duplicated genes. However, it is worth noting that "phylogenetic shadowing"  approaches used in genome assembly might lead to a lack of independence between sequence and synteny information.
In this study, we evaluate a simple method for using gene order (local synteny) in the identification of mammalian orthologs. We explicitly compare the relative performance of local synteny and Inparanoid for inferring orthology within this clade and show that local synteny alone is sufficient to identify orthologs with an accuracy comparable to that of Inparanoid. By analyzing cases of discordance between local synteny and Inparanoid we show that local synteny can distinguish between true orthologs and recent retrogenes, can resolve ambiguous many-to-many orthology relationships into one-to-one ortholog pairs, and can highlight possible cases of parental gene displacement by retrogenes.
Using Local Synteny to Infer Orthology
Theoretically two non-orthologous genes should not share homologous neighboring genes. However, there is a small probability of homology matches occurring by chance. Moreover rearrangements, insertions, and deletions will lead to loss of local synteny between orthologous genes. In order to account for these events, we wanted to determine an optimal window size and match percentage that could reliably identify orthologs based on local synteny. In Figure 2b, we test the impact of the number of matches between six neighboring genes to determine local synteny. Student's t tests comparing Protdist means illustrate that 0, 1, and >1 matches have significantly different Protdist means (p-value ≤ 0.05). In order to choose the threshold of homologous matches, we calculate the false positive rate (FP) and false negative rate (FN) to Inparanoid orthologs and Ensembl orthologs then choose the threshold that minimizes the sum of FP and FN events. For six neighbors, the pairs with more than one homologous match minimizes FP and FN rates (0.152 to Inparanoid orthologs, and 0.151 to Ensembl orthologs (see Additional file 1)). Increasing the window size to 10 or 20 flanking genes does not show a significant difference in detecting orthology (see Additional file 1). Based on these results we define orthology by local synteny when the number of maximum unique homologous matches between the six neighboring genes is greater than one. We will refer to those pairs as syntenic from now on.
False Positive/False Negative Rates Estimated by LCA (Latent Class Analysis)
Since many orthology detection methods use more complicated algorithms than just coding sequence similarity, a high correlation between Protdist and local synteny (Figure 2) is not sufficient evidence that local synteny captures true orthology. For a more rigorous analysis we compared local synteny based orthology to the orthology relationships inferred by six well-known orthology detection methods: Inparanoid , OrthoMCL , RBH (Reciprocal Best Hit), SBH (Single-way or One-way Best Hit), BLASTP, and orthology data from Ensembl . Since there is no gold standard of orthology, we performed Latent Class Analysis (LCA) [15, 16] to estimate the accuracy (sensitivity and specificity) in the absence of a reliable standard. LCA estimates false positive (FP) and false negative rates (FN) based on agreement and disagreement between various ortholog definitions. To minimize sampling bias 10 LCA's were performed on random samples with a size of 1,000 orthologous genes from five mammalian genomes using the same sampling method described in the previous section. For one sample 1,000 Pfam families were randomly selected, then one cross-species protein pair was selected from each Pfam family for analysis by the seven compared orthology inference methods. For all methods based on coding sequence similarity, the longest proteins were used as the representative proteins of the genes. A more detailed description can be found in the methods section.
Discordance between Inparanoid Orthology and Local Synteny Based Orthology
Figure 4 shows ICR histograms for pairs of genes falling in each of the four classes of agreement/disagreement between Inparanoid and local synteny. In both concordance cases, namely for syntenic Inparanoid orthologs (Figure 4a) and non-syntenic non-Inparanoid orthologs (Figure 4d), ICR is in strong agreement with the orthology assignments made by the two methods. Indeed, most of the syntenic Inparanoid orthologs have ICR of 1 (Figure 4a), and the majority of non-syntenic non-Inparanoid orthologs have ICR < 0.5 (Figure 4d). In the two discordant cases (7% of the evaluated gene pairs) intron evidence can be used to resolve the conflicting assignments made by Inparanoid and local synteny. About 3/4 of syntenic non-Inparanoid orthologs (Figure 4b) have ICR > 0.5 and half of non-syntenic Inparanoid orthologs (Figure 4c) have ICR = 0. This suggests that in these cases local synteny based orthology assignments are more often concordant with gene structure evidence (ICR) than those based on coding sequence similarity. However, the ICR histogram of non-syntenic Inparanoid orthologs (Figure 4c) has a bimodal distribution, which might arise from a mixture of FNs from local synteny (pairs with high ICR) and FPs from Inparanoid (pairs with low ICR). We further investigate these cases in the following subsections.
Non-syntenic Inparanoid orthologs with zero ICR: Retrotransposed copies
The rest of non-syntenic 0-ICR Inparanoid orthologs (1.0%) of all samples are classified as one-to-many orthologs by Inparanoid. In many of these one-to-many orthologs, syntenic high-ICR pairs are chosen as "core orthologs" (based on the Inparanoid scores) with additional RT copies added to the orthology group because their protein sequences are still close to those of the original copies. By using local synteny, the RT copies in one-to-many orthologs would be distinguished from the other members. Thus, local synteny based orthology separates RT copies from those generated by speciation or other duplication mechanisms and can be more informative in recovering the evolutionary history of a gene family.
Non-syntenic Inparanoid orthologs with non-zero ICR: Loss of local synteny
All non-syntenic Inparanoid orthologs with an ICR > 0 (1.3%) are likely to result from the loss of local synteny. They each have one homologous match between neighboring genes (lower than the threshold of being syntenic) and are selected from distant species pairs (i.e. not human-chimp or mouse-rat). Each of these distant pairs is part of larger orthology groups (5-species), in which the counterparts in closer species pairs have higher local synteny (more than 1 match). This loss of local synteny likely results from rearrangements and gene insertions or gene losses in more distant species. See Additional file 3 for an example.
Syntenic non-Inparanoid orthologs: Distant paralogs
There are 2.3% syntenic non-Inparanoid orthologs with an ICR ≥ 0.5 (Figure 4b). The majority (2.1/2.3%) of syntenic non-Inparanoid orthologs with high ICR are likely distant paralogs: both genes are in different syntenic Inparanoid orthology groups with high ICR. This may result from old DNA-mediated duplication events followed by speciation events without significant local genome rearrangements. Local synteny cannot distinguish between orthologs created by large-scale segmental duplications and by polyploidy events. Moreover, the syntenic non-Inparanoid pairs with an ICR < 0.5 (1.1% of all tested pairs) are also likely distant paralogs, by old tandem duplications with different gene structures resulting in lower ICR.
Non-syntenic non-Inparanoid orthologs with ICR of 1: More distant paralogs
Summary of disagreement among three measures: Inparanoid orthology, local synteny based orthology, and intron conservation ratio (ICR).
Non-syntenic Inparanoid orthologs
ICR = 0
RT miscalls (0.2%)
Part of 1-many (1.0%)
ICR > 0
Loss of local synteny (1.3%)
Syntenic non-Inparanoid orthologs
ICR ≥ 0.5
RT miscalls (0.2%)
Distant paralogs (2.1%)
ICR < 0.5
Distant paralogs (1.1%)
Non-syntenic non-Inparanoid orthologs
ICR = 1
Distant paralogs (2.2%)
Local Synteny Breaks the Tie
Since local synteny is able to differentiate some speciation and duplication events, a phylogenetic tree (or duplication-speciation history) of ambiguous many-to-many orthologs may be determined using local synteny information. In mouse-to-rat ortholog definitions from Inparanoid, there are 131 many-to-many ortholog groups. The majority (~75%) of many-to-many groups are comprised of DNA-mediated duplicated copies (usually from tandem duplications combined with one or two distant segmental duplication events) while ~20% have true orthologs (confirmed by local synteny and ICR) as well as non-syntenic intronless copies (probably from RT events).
FP and FN Estimates by Using LCA
The accurate determination of orthology is critically important to comparative genomics. However it has been a challenge to compare the various orthology determination methods without a reliable gold-standard orthology dataset . The statistical technique of Latent Class Analysis allows estimates of false positive and false negative rates from data based on agreement and disagreement between various ortholog definitions. Because error rates estimated in this way may be affected by which methods are included in the analysis, we must consider the FP and FN rates estimated here as relative error rates. Furthermore, these error rates might not reflect rates obtained from a genome-wide implementation of these methods. There is an ongoing effort on standardizing protein datasets for benchmarking orthology determination methods  which would help resolve this issue.
In Figure 3, Inparanoid and RBH agree more closely than any other pair of orthology definitions. This is in accord with the fact that Inparanoid uses the reciprocal best hit to start core ortholog identification. SBH and BLASTP have zero FN rates and higher FP rates than Inparanoid and RBH due to less stringent conditions used for ortholog detection in these methods. The order of FP rates (BLASTP > SBH > RBH) is concordant with the stringency of each method. The FP rate of local synteny based orthology falls between SBH and RBH. This is reasonable considering the fact that SBH and local synteny based orthology cannot distinguish close paralogs from ortholog pairs, but local synteny can separate RT close paralog copies from orthologs.
The FN rate of local synteny based orthology is higher than those of Blastp based orthologies (Inparanoid, RBH, SBH, and BLASTP). This may be due to distant paralogs retaining flanking genes, but diverging in their coding sequences enough to be distinguishable by Blastp. RT miscalls by Inparanoid (e.g. Figure 5) are also likely to have contributed to the relative FN rate of local synteny based orthology.
Due to the fact that OrthoMCL detected the smallest number of orthologs in any sample (data not shown) the estimated FN rates from OrthoMCL are the highest in this experiment (approximately 0.3). This is opposite from the result of Chen et al.  where OrthoMCL and Inparanoid were shown to have lowest estimated FP and FN rates and OrthoMCL has lower FN rates than Inparanoid. The apparent disparity between these results could be explained by the fact that LCA is designed to estimate consensus FP and FN rates without any guarantee that the estimated rates are close to absolute values, and by the difference in the species sets used in the two experiments: our dataset of five mammalian genomes and the comparatively distant seven eukaryotic genomes used in Chen et al.
Gene Order as a Measure of Conservation
Synteny information has been used to cluster synteny blocks between two related genomes in order to detect orthologous gene pairs [8, 21] and to reconstruct phylogenetic trees . These synteny blocks are generally used as "genomic anchors"  or to place gene loss/deletion events on phylogenetic trees [23, 24], not as a definitive measure to distinguish close paralogs from distant paralogs. In Zheng et al. , one of three methods to define orthologs between human and mouse used a genomic anchor approach. They identified synteny anchors and synteny blocks [25, 26] then introduced a local synteny approach for the anchor-poor regions by accepting the pairs of genes flanked by previously identified ortholog pairs. With this approach they found 11% more orthologs than by RBH alone.
Another similar approach to local synteny was used to reconstruct phylogenetic gene trees from coding sequence similarity and local gene order . In the algorithm SYNERGY, Wapinski et al. measure a synteny similarity score for a pair of genes by counting the neighboring genes in syntenic blocks. However, SYNERGY has not been tested in mammalian genomes.
Finally, MSOAR  uses combinatorial optimization on global gene order to identify orthologs based on minimal rearrangement scenarios. However, Fu et al. point out that the global optimization might lead to false ortholog links in some scenarios. Because our local synteny exploits proximate synteny information we expect lower false positive ratios than those obtained with MSOAR.
Most importantly, when the orthology is defined by codon sequence similarity, testing any hypothesis of selective pressures on orthologous gene presents tautological challenges. Since orthology detection by local synteny is not based on the comparison of coding sequence information between candidate orthologs, testing the selective pressure between orthologs becomes a valid comparison of largely independent variables. However, gene order also degrades over evolutionary time. How well synteny will be able to effectively identify ancient orthologs remains to be seen.
Gene Duplication Mechanisms and Orthologs
Gene duplications are a major force in genome evolution . Genes are duplicated through two main duplication mechanisms; DNA-mediated and RNA-mediated [27, 28]. DNA-mediated duplications can include multiple genes and associated intergenic sequences and introns. On the other hand, RNA-mediated duplication, or retrotransposition (RT), only copies coding sequences in the duplication event. Retrotransposed genes had been considered mostly "dead on arrival", but recent studies [29, 30] show that there are many functional RT copies in the human and mouse genomes.
Sometimes the difference between coding sequences of parental genes and RT copies is not large enough to distinguish the RT copy from the parent. RT copies, however, do not share introns or flanking genes with the parental paralog. Therefore, local synteny and gene structure can often separate the RT copy from the original gene. However, when neither has an intron, only local synteny will distinguish parental and RT copies. Based on our random sampling, approximately 8~10% of Pfam orthologs are intronless gene pairs.
Gene Order Helps Illuminate Gene Family Evolution
Reconstructing phylogenetic trees informs our understanding of the evolutionary history of gene families. Using tree reconciliation between a species tree and gene tree we can identify duplication and lost events on the tree. However by distinguishing two duplication mechanisms, DNA-mediated and RNA-mediated, not just identifying duplication events, we can sometimes place duplication events in an appropriate phylogenetic context. For example, as Figure 6 and Figure 7 show, when coding sequence does not distinguish paralogs, local synteny can determine whether DNA or RNA-mediated duplication occurred first. Local synteny can also help place RT duplication events before or after speciation (Figure 7). Even when the coding sequence of RT duplicates drifts apart, pre-speciation RT genes often retain local synteny. Conversely, when RT duplicates are young enough to be indistinguishable by coding sequence comparison, synteny can discriminate between pre-speciation duplications, and independent RT duplication events in parallel lineages. Finally, local synteny information can often resolve the order of iterative DNA-mediated duplication events in large gene families (see also ).
In this paper, we investigate the concordance between coding sequence based orthology (Inparanoid) and local synteny based orthology. In the 5 mammalian genomes studied, 93% of the sampled inter-species pairs were found to be concordant between the two orthology methods, illustrating that local synteny can largely substitute for coding sequence in identifying orthologs. However, 7% of pairs were found to be discordant. Discordance is often associated with evolutionary events like retrotransposition, iterative DNA-mediated duplication, and genome rearrangement. Analysis of discordant cases between local synteny and Inparanoid shows that local synteny can differentiate between true orthologs and recent retrogenes, can split ambiguous many-to-many orthology groups into more precise one-to-one ortholog pairs, and, when employed in a genome-wide screen, might help in highlighting possible cases of non-orthologous gene displacement by retrocopied paralogs in mammalian genomes.
Datasets for Ortholog Definitions
Five species analyzed (human, chimp, mouse, rat and dog) were obtained from Ensembl release 48 . We only used protein coding genes in the Ensembl database. For genes with multiple alternative transcripts we used the longest transcripts as the representative ones. We used Pfam families  to choose ortholog candidates. Since there are more than hundreds of millions possible protein pairs among five genomes in Pfam families, we sampled our datasets in the following way. First, we randomly selected 1,000 families from 3,418 Pfam families which have at least two representative proteins from different genomes. One cross-species protein pair was selected from each Pfam family in order to avoid a bias from big families. We used 10 sample datasets for the LCA experiment and one of them was used in the discordance analysis.
Local synteny is measured by homology between the neighboring genes of two genes of interest. In this study the maximum unique homology matches between two sets of six neighboring genes (three upstream and three downstream neighbors) was used. The matches do not need to be co-linear or between genes on the same strand/orientation either (see Figure 1), which allows for genome micro-rearrangements as well as gene losses and insertions in the flanking region. The homology between neighboring genes is decided by pre-calculated Blastp  results in the Ensembl Compara database . To avoid having high local synteny due to proximate tandem array genes, we considered the tandem array genes as one neighboring gene. Within a tandem array, each gene was counted separately.
Orthology Definitions and LCA
BLASTP: homology detection using E-value cutoff (= 1e-5)
SBH: Single-way or One-way Best Hit. 'Best-hit' is defined as the hit (or multiple hits tied) with the highest E-value (E-value cutoff = 1e-5)
RBH: Reciprocal Best Hit. 'Best-hit' is defined as same as SBH (E-value cutoff = 1e-5)
Inparanoid (v2.0): bit score cutoff = 50 bits and sequence overlap cutoff = 0.5
OrthoMCL (v1.4): E-value cutoff = 1e-5 and MCL inflation index = 1.5; MCL package (v02-063) was used.
The frequency table of agreements and disagreements between orthology detection methods was calculated and used for LCA. LCA was performed using the LEM package  with default parameters to estimate the false positive and false negative rates. We used a basic latent model to produce Figure 3 assuming independence between various orthology detection methods. However, all methods we considered are solely or partially based on protein sequences. In order to account for these dependencies, we applied another latent model with an extra latent variable. With such a model, called latent class model with random effects or a continuous factor (CFactor) model, the responses of different tests are assumed to be independent [15, 33]. Although the estimated error rates from the CFactor model (Additional file 6) are less tightly distributed than ones from the basic model, the relative values are not changed significantly compared with Figure 3. The looser distribution of values in the CFactor model is likely due to a lack of convergence in these runs. For detailed description of two LCA models and the FP/FN graph by the CFactor model, see Additional file 6.
Intron Conservation Ratio (ICR)
Gene structure similarity is measured by the intron conservation ratio (ICR) between two intron-bearing genes . For genes with multiple alternative transcripts we developed a collapsed gene model that incorporates all potential exons of that gene. Resulting exon coordinates were used to obtain the protein alignments and also to align the positions of introns. ICR between two homologous genes was calculated as the ratio of the number of positionally homologous introns divided by the total number of intron positions from the protein/intron alignment, similar to the method in Rogozin et al. . Introns with less than 40BP were ignored in ICR calculation.
the other Inparanoid orthologs in all 5 species in order to confirm rearrangement history in those families (with ICR > 0 in Figure 4c), and
Mouse-rat many-to-many Inparanoid ortholog groups were collected and analyzed to reconstruct their evolutionary history by considering the genomic location and intron content of their member genes. Two example ortholog groups (Figure 66 and Figure 7) were chosen due to their unambiguous evolutionary history: for most groups it is difficult to unambiguously reconstruct the full evolution due to the presence of intermingled events.
We thank Asav Dharia for critical reading of the manuscript and valuable suggestions.
- Berglund A, Sjolund E, Ostlund G, Sonnhammer ELL: InParanoid 6: eukaryotic ortholog clusters with inparalogs. Nucl Acids Res. 2008, 36: D263-266. 10.1093/nar/gkm1020.PubMed CentralView ArticlePubMedGoogle Scholar
- Li L, Stoeckert CJ, Roos DS: OrthoMCL: Identification of ortholog groups for eukaryotic genomes. Genome Res. 2003, 13: 2178-2189. 10.1101/gr.1224503.PubMed CentralView ArticlePubMedGoogle Scholar
- Kellis M, Birren BW, Lander ES: Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae. Nature. 2004, 428: 617-624. 10.1038/nature02424.View ArticlePubMedGoogle Scholar
- Wapinski I, Pfeffer A, Friedman N, Regev A: Natural history and evolutionary principles of gene duplication in fungi. Nature. 2007, 449: 54-61. 10.1038/nature06107.View ArticlePubMedGoogle Scholar
- Lemoine F, Lespinet O, Labedan B: Assessing the evolutionary rate of positional orthologous genes in prokaryotes using synteny data. BMC Evolutionary Biology. 2007, 7: 237-10.1186/1471-2148-7-237.PubMed CentralView ArticlePubMedGoogle Scholar
- Lemoine F, Labedan B, Lespinet O: SynteBase/SynteView: a tool to visualize gene order conservation in prokaryotic genomes. BMC Bioinformatics. 2008, 9: 536-10.1186/1471-2105-9-536.PubMed CentralView ArticlePubMedGoogle Scholar
- Fu Z, Chen X, Vacic V, Nan P, Zhong Y, Jiang T: A parsimony approach to genome-wide ortholog assignment. Research in Computational Molecular Biology. 2006, 3909: 578-594. 10.1007/11732990.View ArticleGoogle Scholar
- Fu Z, Chen X, Vacic V, Nan P, Zhong Y, Jiang T: MSOAR: a high-throughput ortholog assignment system based on genome rearrangement. J Comput Biol. 2007, 14: 1160-75. 10.1089/cmb.2007.0048.View ArticlePubMedGoogle Scholar
- Han MV, Hahn MW: Identifying parent-daughter relationships among duplicated genes. Proceedings of the Pacific Symposium on Biocomputing. 2009, 14: 114-125.Google Scholar
- Boffelli D, McAuliffe J, Ovcharenko D, Lewis KD, Ovcharenko I, Pachter L, Rubin EM: Phylogenetic shadowing of primate sequences to find functional regions of the human genome. Science. 2003, 299: 1391-1394. 10.1126/science.1081331.View ArticlePubMedGoogle Scholar
- Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucl Acids Res. 1997, 25: 3389-3402. 10.1093/nar/25.17.3389.PubMed CentralView ArticlePubMedGoogle Scholar
- Felsenstein J: PHYLIP (Phylogeny Inference Package) version 3.6. Department of Genetics, University of Washington, Seattle, [http://evolution.genetics.washington.edu/phylip.html]
- Finn RD, Mistry J, Schuster-Böckler B, Griffiths-Jones S, Hollich V, Lassmann T, Moxon S, Marshall M, Khanna A, Durbin R, Eddy SR, Sonnhammer EL, Bateman A: Pfam: clans, web tools and services. Nucl Acids Res. 2006, 34: D247-251. 10.1093/nar/gkj149.PubMed CentralView ArticlePubMedGoogle Scholar
- Flicek P, Aken BL, Beal K, Ballester B, Caccamo M, Chen Y, Clarke L, Coates G, Cunningham F, Cutts T, Down T, Dyer SC, Eyre T, Fitzgerald S, Fernandez-Banet J, Gräf S, Haider S, Hammond M, Holland R, Howe KL, Howe K, Johnson N, Jenkinson A, Kähäri A, Keefe D, Kokocinski F, Kulesha E, Lawson D, Longden I, Megy K, Meidl P, Overduin B, Parker A, Pritchard B, Prlic A, Rice S, Rios D, Schuster M, Sealy I, Slater G, Smedley D, Spudich G, Trevanion S, Vilella AJ, Vogel J, White S, Wood M, Birney E, Cox T, Curwen V, Durbin R, Fernandez-Suarez XM, Herrero J, Hubbard TJ, Kasprzyk A, Proctor G, Smith J, Ureta-Vidal A, Searle S: Ensembl 2008. Nucl Acids Res. 2008, 36: D707-714. 10.1093/nar/gkm988.PubMed CentralView ArticlePubMedGoogle Scholar
- Chen F, Mackey AJ, Vermunt JK, Roos DS: Assessing performance of orthology detection strategies applied to eukaryotic genomes. PLoS ONE. 2007, 2: e383-10.1371/journal.pone.0000383.PubMed CentralView ArticlePubMedGoogle Scholar
- Hui SL, Zhou XH: Evaluation of diagnostic tests without gold standards. Statistical Methods in Medical Research. 1998, 7: 354-70. 10.1191/096228098671192352.View ArticlePubMedGoogle Scholar
- Bandyopadhyay S, Sharan R, Ideker T: Systematic identification of functional orthologs based on protein network comparison. Genome Res. 2006, 16: 428-435. 10.1101/gr.4526006.PubMed CentralView ArticlePubMedGoogle Scholar
- Su Z, Wang J, Yu J, Huang X, Gu X: Evolution of alternative splicing after gene duplication. Genome Res. 2006, 16: 182-189. 10.1101/gr.4197006.PubMed CentralView ArticlePubMedGoogle Scholar
- Sui SJ, Fulton DL, Arenillas DJ, Kwon AT, Wasserman WW: oPOSSUM: integrated tools for analysis of regulatory motif over-representation. Nucl Acids Res. 2007, 35: W245-252. 10.1093/nar/gkm427.View ArticleGoogle Scholar
- Gabaldón T, Dessimoz C, Huxley-Jones J, Vilella AJ, Sonnhammer EL, Lewis S: Joining forces in the quest for orthologs. Genome Biology. 2009, 10: 403-10.1186/gb-2009-10-9-403.PubMed CentralView ArticlePubMedGoogle Scholar
- Zheng XH, Lu F, Wang ZY, Zhong F, Hoover J, Mural R: Using shared genomic synteny and shared protein functions to enhance the identification of orthologous gene pairs. Bioinformatics. 2005, 21: 703-710. 10.1093/bioinformatics/bti045.View ArticlePubMedGoogle Scholar
- Wapinski I, Pfeffer A, Friedman N, Regev A: Automatic genome-wide reconstruction of phylogenetic gene trees. Bioinformatics. 2007, 23: i549-558. 10.1093/bioinformatics/btm193.View ArticlePubMedGoogle Scholar
- Chen K, Durand D, Farach-Colton M: NOTUNG: A program for dating gene duplications and optimizing gene family trees. J Comput Biol. 2000, 7: 429-47. 10.1089/106652700750050871.View ArticlePubMedGoogle Scholar
- Poptsova MS, Gogarten JP: BranchClust: a phylogenetic algorithm for selecting gene families. BMC Bioinformatics. 2007, 8: 120-10.1186/1471-2105-8-120.PubMed CentralView ArticlePubMedGoogle Scholar
- Mouse Genome Sequencing Consortium: Initial sequencing and comparative analysis of the mouse genome. Nature. 2002, 420: 520-562. 10.1038/nature01262.View ArticleGoogle Scholar
- Mural RJ, Adams MD, Myers EW, Smith HO, Miklos GL, Wides R, Halpern A, Li PW, Sutton GG, Nadeau J, Salzberg SL, Holt RA, Kodira CD, Lu F, Chen L, Deng Z, Evangelista CC, Gan W, Heiman TJ, Li J, Li Z, Merkulov GV, Milshina NV, Naik AK, Qi R, Shue BC, Wang A, Wang J, Wang X, Yan X, Ye J, Yooseph S, Zhao Q, Zheng L, Zhu SC, Biddick K, Bolanos R, Delcher AL, Dew IM, Fasulo D, Flanigan MJ, Huson DH, Kravitz SA, Miller JR, Mobarry CM, Reinert K, Remington KA, Zhang Q, Zheng XH, Nusskern DR, Lai Z, Lei Y, Zhong W, Yao A, Guan P, Ji RR, Gu Z, Wang ZY, Zhong F, Xiao C, Chiang CC, Yandell M, Wortman JR, Amanatides PG, Hladun SL, Pratts EC, Johnson JE, Dodson KL, Woodford KJ, Evans CA, Gropman B, Rusch DB, Venter E, Wang M, Smith TJ, Houck JT, Tompkins DE, Haynes C, Jacob D, Chin SH, Allen DR, Dahlke CE, Sanders R, Li K, Liu X, Levitsky AA, Majoros WH, Chen Q, Xia AC, Lopez JR, Donnelly MT, Newman MH, Glodek A, Kraft CL, Nodell M, Ali F, An HJ, Baldwin-Pitts D, Beeson KY, Cai S, Carnes M, Carver A, Caulk PM, Center A, Chen YH, Cheng ML, Coyne MD, Crowder M, Danaher S, Davenport LB, Desilets R, Dietz SM, Doup L, Dullaghan P, Ferriera S, Fosler CR, Gire HC, Gluecksmann A, Gocayne JD, Gray J, Hart B, Haynes J, Hoover J, Howland T, Ibegwam C, Jalali M, Johns D, Kline L, Ma DS, MacCawley S, Magoon A, Mann F, May D, McIntosh TC, Mehta S, Moy L, Moy MC, Murphy BJ, Murphy SD, Nelson KA, Nuri Z, Parker KA, Prudhomme AC, Puri VN, Qureshi H, Raley JC, Reardon MS, Regier MA, Rogers YH, Romblad DL, Schutz J, Scott JL, Scott R, Sitter CD, Smallwood M, Sprague AC, Stewart E, Strong RV, Suh E, Sylvester K, Thomas R, Tint NN, Tsonis C, Wang G, Wang G, Williams MS, Williams SM, Windsor SM, Wolfe K, Wu MM, Zaveri J, Chaturvedi K, Gabrielian AE, Ke Z, Sun J, Subramanian G, Venter JC, Pfannkoch CM, Barnstead M, Stephenson LD: A comparison of whole-genome shotgun-derived mouse chromosome 16 and the human genome. Science. 2002, 296: 1661-1671. 10.1126/science.1069193.View ArticlePubMedGoogle Scholar
- Ohno S: Evolution by gene duplication. 1970, London, United Kingdom: Allen and UnwinView ArticleGoogle Scholar
- Zhang J: Evolution by gene duplication: an update. Trends in Ecology & Evolution. 2003, 18: 292-298. 10.1016/S0169-5347(03)00033-8.View ArticleGoogle Scholar
- Sakai H, Koyanagi KO, Imanishi T, Itoh T, Gojobori T: Frequent emergence and functional resurrection of processed pseudogenes in the human and mouse genomes. Gene. 2007, 389: 196-203. 10.1016/j.gene.2006.11.007.View ArticlePubMedGoogle Scholar
- Vinckenbosch N, Dupanloup I, Kaessmann H: Evolutionary fate of retroposed gene copies in the human genome. Proc Natl Acad Sci USA. 2006, 103: 3220-3225. 10.1073/pnas.0511307103.PubMed CentralView ArticlePubMedGoogle Scholar
- Ensembl Release 48. [http://dec2007.archive.ensembl.org/index.html]
- Vermunt JK: LEM 1.0: A general program for the analysis of categorical data. Department of Methodology and Statistics, Tilburg University, [http://www.uvt.nl/faculteiten/fsw/organisatie/departementen/mto/software2.html]
- Qu Y, Tan M, Kutner MH: Random effects models in latent class analysis for evaluating accuracy of diagnostic tests. Biometrics. 1996, 52: 797-810. 10.2307/2533043.View ArticlePubMedGoogle Scholar
- Rogozin IB, Wolf YI, Sorokin AV, Mirkin BG, Koonin EV: Remarkable interkingdom conservation of intron positions and massive, lineage-specific intron loss and gain in eukaryotic evolution. Current Biology. 2003, 13: 1512-1517. 10.1016/S0960-9822(03)00558-X.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.