Controversies in modern evolutionary biology: the imperative for error detection and quality control
- Francisco Prosdocimi†1, 2,
- Benjamin Linard†1,
- Pierre Pontarotti3,
- Olivier Poch1 and
- Julie D Thompson1Email author
© Prosdocimi et al; licensee BioMed Central Ltd. 2012
Received: 30 June 2011
Accepted: 4 January 2012
Published: 4 January 2012
The data from high throughput genomics technologies provide unique opportunities for studies of complex biological systems, but also pose many new challenges. The shift to the genome scale in evolutionary biology, for example, has led to many interesting, but often controversial studies. It has been suggested that part of the conflict may be due to errors in the initial sequences. Most gene sequences are predicted by bioinformatics programs and a number of quality issues have been raised, concerning DNA sequencing errors or badly predicted coding regions, particularly in eukaryotes.
We investigated the impact of these errors on evolutionary studies and specifically on the identification of important genetic events. We focused on the detection of asymmetric evolution after duplication, which has been the subject of controversy recently. Using the human genome as a reference, we established a reliable set of 688 duplicated genes in 13 complete vertebrate genomes, where significantly different evolutionary rates are observed. We estimated the rates at which protein sequence errors occur and are accumulated in the higher-level analyses. We showed that the majority of the detected events (57%) are in fact artifacts due to the putative erroneous sequences and that these artifacts are sufficient to mask the true functional significance of the events.
Initial errors are accumulated throughout the evolutionary analysis, generating artificially high rates of event predictions and leading to substantial uncertainty in the conclusions. This study emphasizes the urgent need for error detection and quality control strategies in order to efficiently extract knowledge from the new genome data.
High throughput genomics technologies are now providing the raw data for genome-level or systems-level studies . At the same time, the avalanche of data also poses many new challenges. The shift to genome scale studies in evolutionary biology, for instance, has led to many interesting, but often controversial studies. Many branches in the Tree of Life are still the subject of intense discussions, and simply adding more sequences has not resolved the inconsistencies . In prokaryotes, phylogenetic incongruencies are often assumed to be the result of lateral gene transfers, but the frequency of these events has been challenged recently [3, 4]. In eukaryotes, the ancestral relationships between the major eukaryotic kingdoms [5–8], as well as many more recent clades such as fish or mammalian [9–11], are also hotly debated. It has been suggested that at least some of the conflicting results from evolutionary analyses are due to differences in the models and methodologies used to test the original hypotheses, e.g. [12, 13], as well as errors in the input sequences .
High throughput biological datasets are notoriously incomplete [14–16], noisy and inconsistent and DNA or protein sequences are no exception. The DNA sequences produced by next generation sequencing (NGS) technologies or low-coverage assemblies pose particular problems [17, 18]. A number of recent studies have investigated the rate of errors in these new genome sequences and their impact on the accuracy of downstream analyses [19–22]. In the context of proteome studies, the DNA sequencing errors are further confounded by inaccuracies in the delineation of the protein-coding genes. Coding regions are mostly predicted by automatic methods, but the relationship between genes, transcripts and proteins is complex and automated genome annotation is not completely accurate. Thus, ten years after the publication of the human genome, the exact number of human protein-coding genes is still unknown . Furthermore, recent analyses have shown that, even for those genes that have been identified, the complete exon/intron structure is correctly predicted for only about 50-60% of them [24–26]. In eukaryotic genomes, the situation is also complicated by widespread alternative splicing events, which affects more than 92-94% of multi-exon human genes .
To what extent do these quality issues affect our understanding of the evolutionary events shaping modern organisms? Although sequence errors are essentially ignored in most genome-scale analyses, some studies have addressed certain aspects of this question. For example, Hubisz and coworkers  investigated the impact of DNA sequencing errors in low-coverage genome assemblies on inferred rates and patterns of insertion/deletion and substitution on the mammalian phylogeny. Schneider et al.  showed that the estimated amount of positively selected genes in genome scale analyses may be inflated by the presence of unreliable sequences.
Here, we have investigated the impact of erroneous protein sequences, resulting from either DNA sequencing errors or inaccurate prediction of exon/intron structures, on evolutionary analyses and the detection of important genetic events. We concentrated specifically on duplication events, which are known to be an important source of functional diversity [29–32] and where there has been a great deal of debate about the long term fate of duplicated genes. Two main models have been proposed for the evolution of novel gene function associated with gene duplication. The neofunctionalization model predicts the evolution of a new function in one of the duplicates, with accelerated evolution of the deconstrained copy compared with the copy that retains the ancestral function. The subfunctionalization model implies the division of the ancestral functions among the duplicates and does not make any prediction about the symmetry or asymmetry of sequence evolution. Although individual cases of both modes of evolution have been reported, the relative frequency of the different scenarios in nature is still not clear [12, 33, 34].
To some extent, the evolutionary fate of duplicated genes depends on the duplication mechanism. After tandem duplications or large-scale (e.g. whole-chromosome or whole-genome) duplications, both gene copies retain the same genome context. In contrast, after segmental duplications or retrotranspositions, one of the gene copies retains the ancestral genome position while the other copy is relocated elsewhere. It is generally expected that the gene copy that retains the genome context will be more conserved, and thus will be more likely to retain the ancestral functions . The hypothesis is that newly duplicated genes that have been transposed to new chromosomal locations experience a new genomic and epigenetic environment, modifying the expression and/or function of the genes.
Estimation of sequence error rates
Ensembl genomes used in this study
No. of genes
No. of proteins
The observed error rates were not homogeneous across the different species. Lower rates were observed for the human and mouse proteomes, with 30-31% erroneous sequences, as might be expected for these well studied organisms. Among the non-human primate proteomes considered here, lower error rates were estimated for the orangutan (Pongo pygmaeus), compared to the chimpanzee (Pan troglodytes) and especially the Rhesus macaque (Macaca mulatta). The relatively high error rate for the macaque is not surprising since the macaque genome in Ensembl version 51 is a preliminary assembly using whole genome shotgun (WGS) reads from small and medium insert clones. On the other hand, the relative error rates in chimpanzee and orangutan are more surprising. Both the chimpanzee and orangutan genomes have been sequenced to 6x coverage, but in a recent study of primate genome assembly quality, the chimpanzee genome assembly was estimated to be of higher quality .
Nevertheless, the same study found that about 70% of inferred errors in the orangutan genome were clustered in the 3.2% of the assembly that is of low quality, implying that > 96% of the assembly could be considered of high fidelity. We found the highest error rates in the opossum, chicken and fish proteomes, with > 45% erroneous sequences. Although these genomes have all been sequenced to high coverage, the lack of a well annotated reference genome from a closely related model organism may result in lower quality protein sequence prediction.
The predicted protein sequence errors were then characterized according to two different factors: (i) the nature of the error, i.e. insertion, deletion or suspicious segment and (ii) the position in the sequence, i.e. at the N/C-terminus or within the sequence. Figure 2B shows the proportion of the different errors observed. The most commonly found error was the presence of a suspicious sequence segment, possibly representing a mispredicted exon. At the N- and C- termini, deletions were observed more frequently than extensions. Although this may be due in part to the protocol used to detect sequence errors, it may also reflect the difficulty of predicting the first and last coding exons. In contrast, internal insertions were more common than internal deletions, suggesting that more internal errors were due to the over-prediction of introns as coding sequences, rather than the under-prediction of exons.
Comparison of similarity and synteny based homologs
We then investigated whether the gene that is most similar on the sequence level is also the gene that shares the same gene-neighbourhood (Figure 3 and Table S2 in Additional file 1). Of the 212,409 similarity homologs identified in the 13 vertebrate genomes, 113,517 were found in locally syntenic regions. In mammals, this represents 69% of the highest similarity homologs. This is less than that estimated in a previous study , where 97.5% of Inparanoid orthologs in human, mouse, rat and dog were found in syntenic regions, most likely due to our stricter definition of local synteny. On the other hand, only 1% of the identified syntenic homologs (1,157 out of 114,680) were not identified by the similarity-based approach. As expected, a generally higher level of disagreement was observed for more divergent genome pairs. Nevertheless, in human-chicken comparisons, the synteny method identified the same homolog as the similarity approach in 98.8% of the cases. Fewer consistencies were observed in human-fish comparisons (84-87% of syntenic homologs were also the highest similarity homologs), possibly due in part to the whole genome duplications in the fish lineage, resulting in a larger number of paralogs.
Asymmetric evolution events
Number of syntenic homologs with significantly faster evolutionary rates compared to the remote similarity homolog
No. of syntenic homologs
No. of inconsistencies: syntenic versus highest similarity homologs
Significant asymmetric evolution events (AED)
Effect of erroneous sequences on prediction of asymmetrical evolution
As might be expected, the proportion of artifactual events varies with the different genomes studied, depending on the percentage of erroneous sequence detected (Figure 4B). For example, 19% of chimpanzee and 24% of mouse predicted events are due to artifacts, while this figure increases significantly for the draft macaque and chicken genomes (69% and 88% respectively). It is interesting to note that a larger proportion of artifacts are observed in the orangutan genome than in the chimpanzee, even though the orangutan genome is predicted to contain less sequence errors than the chimpanzee (see above).
Validation of putative protein sequence errors
Putative protein errorsa
% FP errorg
FP error predictionf
Putative protein errors
% FP error
FP error prediction
Detailed analysis of sequence errors leading to artifactual AED events
When all the sequences in the gene triplets were pooled, no significant enrichment was observed in the frequency distribution of the different error types causing artifactual events, compared to the background distribution observed in all the sequences (as shown in Figure 2). The goodness-of-fit was measured using a likelihood ratio chi-square statistic (chi-square = 3.12, p-value = 0.79). Nevertheless, different error types were observed when the syntenic and highest similarity homologs were considered separately. For example, artifactual events were observed more frequently if the syntenic homolog, i.e. the gene copy that retained the genome neighbourhood after duplication, contained suspicious segments. In contrast, N- and C-deletions in the highest similarity homolog, i.e. the gene copy that was relocated, were more likely to cause artifacts.
The orthology prediction method used in the Ensembl project, based on a phylogenetic gene tree approach, finds the correct 1-to-1 orthology relationship between the human and macaque COPG proteins. Unfortunately, many other orthology databases are less successful. For example, in the Inparanoid database (inparanoid.sbc.su.se), the Ensembl human COPG and macaque COPG2 sequences are in the same orthologous cluster, while no human ortholog is found for the macaque COPG sequence.
Functional analysis of asymmetrical evolution events
GO term enrichment analysis for artifactual and putative AED events
GO enrichment for all events
GO enrichment for true events only
GO biological process
GO biological process
multicellular organismal process
multicellular organismal process
anatomical structure development
response to stimulus
anatomical structure development
regulation of primary metabolic process
cellular component organization at cellular level
regulation of biological process
regulation of macromolecule metabolic process
cellular component organization at cellular level
regulation of nitrogen compound metabolic process
RNA biosynthetic process
regulation of cellular macromolecule biosynthetic process
cellular component movement
nucleobase, nucleoside, nucleotide and nucleic acid metabolic process
cell surface receptor linked signaling pathway
purine nucleotide catabolic process
purine ribonucleoside triphosphate catabolic process
Gene list 1 was enriched in 24 key GO terms, including a number of vertebrate specializations (e.g. anatomical structure development), but also some fundamental eukaryotic processes (e.g. regulation of metabolic processes, gene expression, axon guidance). For example, the term 'RNA biosynthetic process' is found with a P-value of 5E-16, involving 101 (20%) of the 607 genes in the list. However, only 6 of these 24 key GO terms are associated with the true events in gene list 2. Thus, the remaining 18 (75%) enriched GO terms are probably false positives resulting from the artifactual events. Furthermore, and perhaps more importantly, important key GO terms associated with the true events are not enriched in gene list 1, notably neurogenesis related functions. After filtering of gene triplets with erroneous sequences, gene set 2 was enriched in 10 key terms, including neuron differentiation functions, and response to the environment.
Several recent studies have highlighted the prevalence of errors in genes predicted from genome sequences [24–26, 44], particularly in eukaryotic genes. The situation is further complicated by the fact that multiple transcript variants are often expressed by the same gene. Nevertheless, orthology and paralogy, which are fundamental concepts for most evolutionary analyses, are generally defined at the gene level. Many systems, including Ensembl compara , simply select the longest transcripts to represent a gene, although there is no guarantee that the longest predicted transcripts in different organisms are equivalent. Some authors have specifically addressed these issues by defining relationships at the transcript level [46, 47] or by using processed transcription units, i.e. a combination of all overlapping sequence variants in the genomic region . Nevertheless, these remain partial solutions only and do not resolve all problems.
These quality issues may lead to inaccurate or erroneous conclusions if they are integrated indiscriminately in downstream evolutionary or functional analyses. As an example, when annotating a new genome, gene structure data is often transferred from the genome of a closely related species, e.g., many chimpanzee genes in the Ensembl database were predicted based on comparisons with human transcript data. These gene sequences were then used to perform genome-wide scans for positive selection . Although more positively selected genes were identified in chimpanzees compared to human, it has been suggested that the majority of the signals may be due to errors in the original sequences or in the gene alignments . Thus, we have a vicious circle, where the gene sequences that provide the starting point for most evolutionary analyses are themselves generally predicted based on evolutionary information.
Protein sequence error rates
We detected erroneous protein sequences based on discrepancies in the conservation of vertebrate protein MSAs. The sequence errors may result from (i) DNA sequencing errors, (ii) badly predicted introns/exons, (iii) different splicing variants predicted in different organisms. We estimated the frequency of erroneous sequences to be at least 41%, although some genomes are more error-prone than others, depending on factors such as sequencing coverage or the availability of a well annotated genome from a closely related organism.
In this study, we only considered sequences from the Ensembl database and we used cross-comparisons between species to identify discrepancies. However, Ensembl may produce predictions that are consistent across organisms, i.e. may reproduce the same errors in different genomes or propagate intron/exon structures. Thus, our estimate of the average sequence error rate is probably conservative. Another recent study  showed that the Ensembl compara sequence prediction method correctly identified only 55% of coding transcripts exactly.
Identification of evolutionary events
Our main goal was to determine to what extent these erroneous sequences affect subsequent evolutionary analyses. We focused on a specific event: gene duplication and the evolutionary fate of paralogs, since gene duplication is often assumed to be the most important source of new functions.
Since duplication events where the local copy has evolved more rapidly may indicate unusual evolutionary scenarios, innovations or adaptations, we specifically searched for examples of such asymmetric evolution events. Our approach involved the identification of reliable AED events that could be used as a test set for estimating the impact of sequence errors. We therefore designed a stringent protocol where we included only high coverage genomes and used the well studied human genome as a reference. We then identified putative orthologs in 13 vertebrate genomes, based on either sequence similarity or local synteny conservation. The similarity-based method used a very simple model of sequence evolution, in order to avoid bias towards one particular model. Nevertheless, this model clearly oversimplifies the complex evolutionary processes involved, and in the future, it would be interesting to investigate the effect of a more realistic model of sequence evolution on AED detection, once sequencing/annotation errors have been removed. We also used a strict definition of local synteny, which led to lower genome coverage in the ortholog prediction step. For the detection of asymmetric evolution, we used a simple measure of amino acid divergence and specified a high significance threshold that would ensure only reliable predictions. Nevertheless, 688 putative AED events were identified that were then used to perform an in-depth investigation of the effect of sequence errors.
Impact of sequence errors
We compared the syntenic and highest similarity homologs and identified cases where significantly faster evolutionary rates were observed in the syntenic homolog, i.e. the gene copy that retained the genome neighbourhood after duplication, compared to the relocated highest similarity homolog. Initially, 688 AED events were identified, of which 81 similarity homologs were potential retropseudogenes with a reduced exonic map. The majority (57%) of the remaining detected events corresponded to erroneous sequences and only 250 represented putative true AED events. Thus, we conclude that care should be taken when performing genome-wide scans to search for genes with unusual patterns, since outlying genes are more likely to be due to artifacts in the input sequences than the result of true evolutionary events. Furthermore, our in-depth study revealed some of the mechanisms by which errors in the input sequences are propagated during the event prediction. For example, a badly predicted internal segment in one of the homologs results in an increased evolutionary distance to the human reference sequence, while a loss in the more variable N/C-terminal regions artificially reduces the distance. These observations provide guidelines for future error detection and correction strategies that will hopefully allow us to reduce the impact of the sequencing errors.
In asymmetric evolution, one duplicate evolves or degrades faster than the other and often becomes functionally or conditionally specialized. In this context, the accurate detection of the 'functional' homologs, i.e. protein pairs that play functionally equivalent roles , is critical. We have shown that orthology assignment and the detection of important genetic events are severely impacted by the high proportion of errors in the initial set of protein sequences, even in high coverage genomes. The errors in the initial data are accumulated and amplified in the higher-level analyses. Our estimated rate of 41% erroneous protein sequences leads to 57% errors in AED event prediction and, in the subsequent Gene Ontology (GO) functional analysis, 75% of the enriched terms are in fact false positives.
The false positive terms in the functional analysis can be very costly to investigate experimentally and a reduction in the false discovery rate is clearly desirable. They are also sufficient to mask some of the true functional enrichments. After filtering the artifactual events corresponding to erroneous sequences, the remaining AED events were enriched in a number of GO categories, including neuron differentiation and response to external stimuli. Interestingly, human-specific duplicates evolving under adaptive natural selection also include genes involved in neuronal and cognitive functions, as well as response to inflammation or stress . Similarly, gene families involved in copy number variations (CNVs) are enriched for similar categories, including interactions with the environment, neurophysiological processes and brain development . A recent study suggested that the relationship between CNVs and positive selection may play an important role in the emergence and evolution of species-specific traits in primates . Genes in many of these categories are thus thought to be important in evolutionary adaptation and to be particular targets of natural selection.
Up to half of all protein sequences in today's genome databases contain erroneous insertions, deletions or suspicious segments. The high error rates have profound implications, not only for the analysis of protein functions, interaction networks, biochemical pathways or disease phenotypes, but also for our understanding of life's evolution.
The putative sequence errors identified here lead to a significant number of false positives in the detection of asymmetric evolution events, which, if ignored, are sufficient to obscure their true functional significance. We have looked at one important event, asymmetric evolution after duplication, but the effect of protein sequence errors is likely to be similar for other types of events. This might explain many of the contradictions observed in many recent evolutionary studies, aggravating the effects of differences in source data, methodology and planning of experiments .
Exploitation of the new genome data is clearly challenging, due to the size of the data sets, their complexity and the high level of noise, and the situation is not likely to improve with low coverage genomes becoming the norm. As a consequence, data cleaning tools and robust statistical analyses will be essential for its reliable interpretation. With as many as 50% erroneous sequences, the simple removal of this data will result in the loss of too much information. It will be necessary to validate and correct the sequence errors and ideally, propagate these corrections to the public databases. Some recent efforts have been undertaken to address these issues [19, 26, 47], but additional work will be essential to reduce the impact of error and to extract the true meaning hidden in the data.
The alternative is an escalating process where systematic errors are accumulated at each level of the analysis, generating artificially high rates of unusual event predictions and eventually leading to an 'error catastrophe', where the noise overwhelms the true signal.
Protein sequence data sets
Human protein coding genes were retrieved from the Human Proteome Initiative (HPI) and Swiss-prot databases , resulting in a total of 19,778 human sequences. Each gene was then used as a query for a BlastP  search in a database consisting of the proteomes of 14 vertebrates (Table 1) with almost complete genomes from the Ensembl (version 51) database . The Ensembl human protein sequence with the highest similarity to the HPI query was designated as the reference protein sequence. For each of the 19,778 human reference sequences, potential orthologs were then identified using two different, complementary approaches: sequence similarity and local synteny.
Putative orthologs based on sequence similarity
For each human reference sequence, a modified version of the PipeAlign  protein analysis pipeline was used to construct a multiple sequence alignment (MSA) for all sequences detected by the BlastP search with E < 10-3 (maximum sequences = 500). PipeAlign integrates several steps, including post-processing of the BlastP results, construction of a MSA of the full-length sequences with DbClustal , verification of the MSA with RASCAL  and removal of unrelated sequences with LEON . In this modified version, DbClustal was replaced by the MAFFT  program, since the computational speed of MAFFT is better suited to high throughput projects. The MSAs obtained from this pipeline were then annotated with structural and functional information using MACSIMS , an information management system that combines knowledge-based methods with complementary ab initio sequence-based predictions. MACSIMS integrates several types of data in the alignment, in particular Gene Ontology annotations, functional annotations and keywords from Swiss-prot, and functional/structural domains from the Pfam database .
where d is the pairwise distance and p is the proportion of different amino acids aligned (dissimilarity).
Then, for each human reference sequence, Hi, the sequences from the 13 vertebrate organisms with the highest similarity (i.e. the smallest distance) to Hi were identified and denoted Vn_Simi, where Vn refers to one of the 13 vertebrate organisms (Figure S2A in Additional file 1).
Putative orthologs based on local synteny
The chromosomal localization of all genes coding for protein sequences was obtained from the Ensembl database. Locally developed software was used to identify regions on the human chromosomes where local synteny was conserved between the human genome and each of the other 13 vertebrate genomes. The chromosomes in each genome are thus represented as a linear sequence of genes. For each human reference sequence, the local syntenic homolog was defined as outlined in (Figure S2B in Additional file 1). For the coding gene, Hi, at position i on the human genome, its neighbours (Hi-1 and Hi+1) were identified. For each of the 13 vertebrate genomes, the sequences with the highest similarity to Hi-1 and Hi+1 were selected from the MSA as described above, and denoted Vn_Simi-1 and Vn_Simi+1 respectively, where Vn refers to one of the 13 vertebrate genomes. A local synteny homolog, Vn_Syni exists for Hi and genome Vn if: (i) homologs were found in Vn for Hi-1 and Hi+1, (ii) the separation between the highest similarity homologs, denoted Vn_Simi-1 and Vn_Simi+1, on the genome was less than 5 genes and (iii) a homolog of Hi was found on the genome between Vn_Simi-1 and Vn_Simi+1. The homolog of Hi localized between V_Simi-1 and V_Simi+1 with the highest similarity (smallest evolutionary distance) to the human reference sequence was then defined as the syntenic homolog.
Genes with ambiguous genomic locations, such as scaffolds etc., were discarded since the synteny relationship could not be reliably established. In addition, local or tandem duplications were excluded since the genome contexts of the two gene copies were similar. Although tandem duplicates should be adjacent to each other on one chromosome, extensive gene inversions may insert irrelevant genes into the tandem arrays. We therefore used a stringent threshold and excluded cases where Vn_Simi and Vn_Syni were separated on the genomes by less than 10 genes.
Automatic detection of potential sequence errors
Each error was then classified in one of 7 different classes: internal insertions, internal deletions, suspicious sequence segments, extensions at the N- or C-terminus, and deletions at the N- or C-terminus.
Validation of potential sequence errors
The errors in the protein sequences were estimated by analysing the corresponding DNA gene sequences from the Ensembl database. First, if the gene sequence contained a run of 'N' characters, we assumed that the predicted protein sequence error was the result of a DNA sequencing or assembly error. Second, the gene sequences with no 'N' characters were searched for the missing protein sequence fragments. For errors corresponding to internal deletions, deletions at the N- or C- terminus or suspicious sequence segments, the missing protein fragment was extracted from a closely related sequence in the multiple alignment. The protein fragment was then aligned to the gene sequence from the ENSEMBL database using the PairWise software . The fragment was considered to be present in the gene sequence if the percent identity of the protein and translated gene sequences was greater than a given threshold. The threshold used here was specific to the pair of organisms compared and was defined as the lower quartile of the protein sequence identities for the complete proteomes of the two organisms. A similar protocol was used for the errors corresponding to insertions in a given protein sequence, except that, in this case, the protein fragment corresponding to the insertion was aligned to the gene sequence of another closely related sequence. Finally, the transcript evidence for the protein sequences in the Ensembl database was searched manually for known transcripts and splicing variants.
Prediction of asymmetrical evolutionary rates
It has been suggested that, after a gene duplication event, one duplicate generally maintains the ancestral function while the other is free to evolve and acquire novel functionality. This scenario implies that the protein with conserved functionality will undergo less sequence evolution than the one exploring new functionalities. To determine which of the two homologs described above (highest sequence similarity or syntenic) was more likely to share the same function as the human reference sequence, we estimated the difference between the two evolutionary distances: human reference to similarity homolog and human reference to syntenic homolog. Thus, for each of the 13 vertebrate genomes considered in this study, we have a triplet of homologs, Hi, Vn_simi, Vn_syni, and we want to estimate the difference Δ between two distances d(H i , Vn_sim i ) and d(H i , Vn_syn i ).
where p is the proportion of amino acid differences and n is the total number of amino acids compared.
In this work, the parameter k was set to 1.96, reflecting the 95% confidence level. Thus, we would expect 5% of the tested gene triplets to falsely reject the hypothesis of asymmetrical evolution.
We would like to thank the members of the Laboratory of Integrative Bioinformatics and Genomics for fruitful discussions, and the members of the Strasbourg Bioinformatics Platform for their support. This work was funded by the ANR (EvolHHuPro: BLAN07-1-198915) project, the AFM Décrypthon programme and Institute funds from the CNRS, INSERM, and the Université de Strasbourg.
- Mardis ER: A decade's perspective on DNA sequencing technology. Nature. 2011, 470 (7333): 198-203. 10.1038/nature09796.View ArticlePubMedGoogle Scholar
- Philippe H, Brinkmann H, Lavrov DV, Littlewood DT, Manuel M, Worheide G, Baurain D: Resolving difficult phylogenetic questions: why more sequences are not enough. PLoS Biol. 2011, 9 (3): e1000602-10.1371/journal.pbio.1000602.PubMed CentralView ArticlePubMedGoogle Scholar
- Soria-Carrasco V, Castresana J: Estimation of phylogenetic inconsistencies in the three domains of life. Mol Biol Evol. 2008, 25 (11): 2319-2329. 10.1093/molbev/msn176.View ArticlePubMedGoogle Scholar
- Stiller JW: Experimental design and statistical rigor in phylogenomics of horizontal and endosymbiotic gene transfer. BMC Evol Biol. 2011, 11 (1): 259-10.1186/1471-2148-11-259.PubMed CentralView ArticlePubMedGoogle Scholar
- Koonin EV: The origin and early evolution of eukaryotes in the light of phylogenomics. Genome Biol. 2011, 11 (5): 209-View ArticleGoogle Scholar
- Pace NR: Mapping the tree of life: progress and prospects. Microbiol Mol Biol Rev. 2009, 73 (4): 565-576. 10.1128/MMBR.00033-09.PubMed CentralView ArticlePubMedGoogle Scholar
- Parfrey LW, Lahr DJ, Knoll AH, Katz LA: Estimating the timing of early eukaryotic diversification with multigene molecular clocks. Proc Natl Acad Sci USA. 2011, 108 (33): 13624-13629. 10.1073/pnas.1110633108.PubMed CentralView ArticlePubMedGoogle Scholar
- Desmond E, Brochier-Armanet C, Forterre P, Gribaldo S: On the last common ancestor and early evolution of eukaryotes: reconstructing the history of mitochondrial ribosomes. Res Microbiol. 2011, 162 (1): 53-70. 10.1016/j.resmic.2010.10.004.View ArticlePubMedGoogle Scholar
- Negrisolo E, Kuhl H, Forcato C, Vitulo N, Reinhardt R, Patarnello T, Bargelloni L: Different phylogenomic approaches to resolve the evolutionary relationships among model fish species. Mol Biol Evol. 2010, 27 (12): 2757-2774. 10.1093/molbev/msq165.View ArticlePubMedGoogle Scholar
- Campbell V, Lapointe FJ: An application of supertree methods to Mammalian mitogenomic sequences. Evol Bioinform Online. 2010, 6: 57-71.PubMed CentralView ArticlePubMedGoogle Scholar
- Agnarsson I, Kuntner M, May-Collado LJ: Dogs, cats, and kin: a molecular species-level phylogeny of Carnivora. Mol Phylogenet Evol. 2010, 54 (3): 726-745. 10.1016/j.ympev.2009.10.033.View ArticlePubMedGoogle Scholar
- Studer R, Robinson-Rechavi M: How confident can we be that orthologs are similar, but paralogs differ?. Trends Genet. 2009, 25: 210-216. 10.1016/j.tig.2009.03.004.View ArticlePubMedGoogle Scholar
- Kumar S, Filipski AJ, Battistuzzi FU, Kosakovsky Pond SL, Tamura K: Statistics and Truth in Phylogenomics. Mol Biol Evol. 2011Google Scholar
- Sanderson MJ, McMahon MM, Steel M: Phylogenomics with incomplete taxon coverage: the limits to inference. BMC Evol Biol. 2010, 10: 155-10.1186/1471-2148-10-155.PubMed CentralView ArticlePubMedGoogle Scholar
- Aittokallio T: Dealing with missing values in large-scale studies: microarray data imputation and beyond. Brief Bioinform. 2010, 11 (2): 253-264. 10.1093/bib/bbp059.View ArticlePubMedGoogle Scholar
- Berthoumieux S, Brilli M, de Jong H, Kahn D, Cinquemani E: Identification of metabolic network models from incomplete high-throughput datasets. Bioinformatics. 2010, 27 (13): i186-195.View ArticleGoogle Scholar
- Pop M, Salzberg SL: Bioinformatics challenges of new sequencing technology. Trends Genet. 2008, 24: 142-149. 10.1016/j.tig.2007.12.006.PubMed CentralView ArticlePubMedGoogle Scholar
- Hayden EC: Genome builders face the competition. Nature. 2011, 471 (7339): 425-10.1038/471425a.View ArticlePubMedGoogle Scholar
- Hubisz M, Lin M, Kellis M, Siepel A: Error and error mitigation in low-coverage genome assemblies. PLOS One. 2011, 6: e17034-10.1371/journal.pone.0017034.PubMed CentralView ArticlePubMedGoogle Scholar
- Vilella AJ, Birney E, Flicek P, Herrero J: Considerations for the inclusion of 2x mammalian genomes in phylogenetic analyses. Genome Biol. 2011, 12 (2): 401-PubMed CentralView ArticlePubMedGoogle Scholar
- Hoff KJ: The effect of sequencing errors on metagenomic gene prediction. BMC Genomics. 2009, 10: 520-10.1186/1471-2164-10-520.PubMed CentralView ArticlePubMedGoogle Scholar
- Milinkovitch M, Helaers R, Depiereux E, Tzika A, Gabaldon T: 2X genomes - depth does matter. 2010, 11: R16-Google Scholar
- Pertea M, Salzberg SL: Between a chicken and a grape: estimating the number of human genes. Genome Biol. 2011, 11 (5): 206-View ArticleGoogle Scholar
- Brent MR: Steady progress and recent breakthroughs in the accuracy of automated genome annotation. Nat Rev Genet. 2008, 9 (1): 62-73. 10.1038/nrg2220.View ArticlePubMedGoogle Scholar
- Harrow J, Nagy A, Reymond A, Alioto T, Patthy L, Antonarakis SE, Guigo R: Identifying protein-coding genes in genomic sequences. Genome Biol. 2009, 10 (1): 201-10.1186/gb-2009-10-1-201.PubMed CentralView ArticlePubMedGoogle Scholar
- Nagy A, Hegyi H, Farkas K, Tordai H, Kozma E, Banyai L, Patthy L: Identification and correction of abnormal, incomplete and mispredicted proteins in public databases. BMC Bioinformatics. 2008, 9: 353-10.1186/1471-2105-9-353.PubMed CentralView ArticlePubMedGoogle Scholar
- Hallegger M, Llorian M, Smith CW: Alternative splicing: global insights. Febs J. 2010, 277 (4): 856-866. 10.1111/j.1742-4658.2009.07521.x.View ArticlePubMedGoogle Scholar
- Schneider A, Souvorov A, Sabath N, Landan G, Gonnet GH, Graur D: Estimates of positive Darwinian selection are inflated by errors in sequencing, annotation, and alignment. Genome Biol Evol. 2009, 1: 114-118.PubMed CentralView ArticlePubMedGoogle Scholar
- Ohno S: Evolution by gene duplication. 1970, Berlin (Germany): Springer VerlagView ArticleGoogle Scholar
- Semon M, Wolfe KH: Consequences of genome duplication. Curr Opin Genet Dev. 2007, 17 (6): 505-512. 10.1016/j.gde.2007.09.007.View ArticlePubMedGoogle Scholar
- Kassahn KS, Dang VT, Wilkins SJ, Perkins AC, Ragan MA: Evolution of gene function and regulatory control after whole-genome duplication: comparative analyses in vertebrates. Genome Res. 2009, 19 (8): 1404-1418. 10.1101/gr.086827.108.PubMed CentralView ArticlePubMedGoogle Scholar
- Levasseur A, Pontarotti P: The role of duplications in the evolution of genomes highlights the need for evolutionary-based approaches in comparative genomics. Biol Direct. 2011, 6: 11-10.1186/1745-6150-6-11.PubMed CentralView ArticlePubMedGoogle Scholar
- Durand D, Hoberman R: Diagnosing duplications--can it be done?. Trends Genet. 2006, 22 (3): 156-164. 10.1016/j.tig.2006.01.002.View ArticlePubMedGoogle Scholar
- Conant GC, Wolfe KH: Turning a hobby into a job: how duplicated genes find new functions. Nat Rev Genet. 2008, 9 (12): 938-950. 10.1038/nrg2482.View ArticlePubMedGoogle Scholar
- Jun J, Ryvkin P, Hemphill E, Nelson C: Duplication mechanism and disruptions in flanking regions determine the fate of Mammalian gene duplicates. J Comput Biol. 2009, 16 (9): 1253-1266. 10.1089/cmb.2009.0074.View ArticlePubMedGoogle Scholar
- Flicek P, Amode MR, Barrell D, Beal K, Brent S, Chen Y, Clapham P, Coates G, Fairley S, Fitzgerald S: Ensembl 2011. Nucleic Acids Res. 2011, D800-806. 39 DatabaseGoogle Scholar
- Thompson JD, Linard B, Lecompte O, Poch O: A comprehensive benchmark study of multiple sequence alignment methods: current challenges and future perspectives. PLoS One. 2011, 6 (3): e18093-10.1371/journal.pone.0018093.PubMed CentralView ArticlePubMedGoogle Scholar
- Meader S, Hillier L, Locke D, Ponting C, Lunter G: Genome assembly quality: Assessment and improvement using the neutral indel model. Genome Res. 2010, 20: 675-684. 10.1101/gr.096966.109.PubMed CentralView ArticlePubMedGoogle Scholar
- Boyer F, Morgat A, Labarre L, Pothier J, Viari A: Syntons, metabolons and interactons: an exact graph-theoretical approach for exploring neighbourhood between genomic and functional data. Bioinformatics. 2005, 21 (23): 4209-4215. 10.1093/bioinformatics/bti711.View ArticlePubMedGoogle Scholar
- Rodelsperger C, Dieterich C: Syntenator: multiple gene order alignments with a gene-specific scoring function. Algorithms Mol Biol. 2008, 3: 14-10.1186/1748-7188-3-14.PubMed CentralView ArticlePubMedGoogle Scholar
- Jun J, Mandoiu I, Nelson C: Identification of mammalian orthologs using local synteny. BMC Genomics. 2009, 10: 630-10.1186/1471-2164-10-630.PubMed CentralView ArticlePubMedGoogle Scholar
- Carbon S, Ireland A, Mungall CJ, Shu S, Marshall B, Lewis S: AmiGO: online access to ontology and annotation data. Bioinformatics. 2009, 25 (2): 288-289. 10.1093/bioinformatics/btn615.PubMed CentralView ArticlePubMedGoogle Scholar
- Yang X, Li J, Lee Y, Lussier YA: GO-Module: functional synthesis and improved interpretation of Gene Ontology patterns. Bioinformatics. 2011, 27 (10): 1444-1446. 10.1093/bioinformatics/btr142.PubMed CentralView ArticlePubMedGoogle Scholar
- Ranwez V, Harispe S, Delsuc F, Douzery EJ: MACSE: Multiple Alignment of Coding SEquences accounting for frameshifts and stop codons. Plos One. 2011, 6 (9): e22594-10.1371/journal.pone.0022594.PubMed CentralView ArticlePubMedGoogle Scholar
- Hubbard TJ, Aken BL, Beal K, Ballester B, Caccamo M, Chen Y, Clarke L, Coates G, Cunningham F, Cutts T: Ensembl 2007. Nucleic Acids Res. 2007, D610-617. 35 DatabaseGoogle Scholar
- Zambelli F, Pavesi G, Gissi C, Horner DS, Pesole G: Assessment of orthologous splicing isoforms in human and mouse orthologous genes. BMC Genomics. 2010, 11: 534-10.1186/1471-2164-11-534.PubMed CentralView ArticlePubMedGoogle Scholar
- Goodstadt L, Ponting CP: Phylogenetic reconstruction of orthology, paralogy, and conserved synteny for dog and human. PLoS Comput Biol. 2006, 2 (9): e133-10.1371/journal.pcbi.0020133.PubMed CentralView ArticlePubMedGoogle Scholar
- Ho MR, Jang WJ, Chen CH, Ch'ang LY, Lin WC: Designating eukaryotic orthology via processed transcription units. Nucleic Acids Res. 2008, 36 (10): 3436-3442. 10.1093/nar/gkn227.PubMed CentralView ArticlePubMedGoogle Scholar
- Bakewell MA, Shi P, Zhang J: More genes underwent positive selection in chimpanzee evolution than in human evolution. Proc Natl Acad Sci USA. 2007, 104 (18): 7489-7494. 10.1073/pnas.0701705104.PubMed CentralView ArticlePubMedGoogle Scholar
- Mallick S, Gnerre S, Muller P, Reich D: The difficulty of avoiding false positives in genome scans for natural selection. Genome Res. 2009, 19 (5): 922-933. 10.1101/gr.086512.108.PubMed CentralView ArticlePubMedGoogle Scholar
- Guigo R, Flicek P, Abril JF, Reymond A, Lagarde J, Denoeud F, Antonarakis S, Ashburner M, Bajic VB, Birney E, et al: EGASP: the human ENCODE Genome Annotation Assessment Project. Genome Biol. 2006, 7 (Suppl 1): S2 1-31.View ArticlePubMedGoogle Scholar
- Remm M, Storm CE, Sonnhammer EL: Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J Mol Biol. 2001, 314 (5): 1041-1052. 10.1006/jmbi.2000.5197.View ArticlePubMedGoogle Scholar
- Han MV, Demuth JP, McGrath CL, Casola C, Hahn MW: Adaptive evolution of young gene duplicates in mammals. Genome Res. 2009, 19 (5): 859-867. 10.1101/gr.085951.108.PubMed CentralView ArticlePubMedGoogle Scholar
- de Smith AJ, Walters RG, Froguel P, Blakemore AI: Human genes involved in copy number variation: mechanisms of origin, functional effects and implications for disease. Cytogenet Genome Res. 2008, 123 (1-4): 17-26. 10.1159/000184688.View ArticlePubMedGoogle Scholar
- Gokcumen OO, Babb PL, Iskow R, Zhu Q, Shi X, Mills RE, Ionita-Laza I, Vallender EJ, Clark AG, Johnson WE, et al: Refinement of primate CNV hotspots identifies candidate genomic regions evolving under positive selection. Genome Biol. 2011, 12 (5): R52-10.1186/gb-2011-12-5-r52.PubMed CentralView ArticlePubMedGoogle Scholar
- UniProt: Ongoing and future developments at the Universal Protein Resource. Nucleic Acids Res. 2011, D214-219. 39 DatabaseGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25 (17): 3389-3402. 10.1093/nar/25.17.3389.PubMed CentralView ArticlePubMedGoogle Scholar
- Plewniak F, Thompson JD, Poch O: Ballast: blast post-processing based on locally conserved segments. Bioinformatics. 2000, 16: 750-759. 10.1093/bioinformatics/16.9.750.View ArticlePubMedGoogle Scholar
- Thompson J, Plewniak F, Thierry J, O P: DbClustal: rapid and reliable global multiple alignments of protein sequences detected by database searches. Nucleic Acids Res. 2000, 28: 2919-2926. 10.1093/nar/28.15.2919.PubMed CentralView ArticlePubMedGoogle Scholar
- Thompson JD, Thierry JC, Poch O: RASCAL: rapid scanning and correction of multiple sequence alignments. Bioinformatics. 2003, 19: 1155-1161. 10.1093/bioinformatics/btg133.View ArticlePubMedGoogle Scholar
- Thompson JD, Prigent V, Poch O: LEON: multiple aLignment Evaluation Of Neighbours. Nucleic Acids Res. 2004, 32: 1298-1307. 10.1093/nar/gkh294.PubMed CentralView ArticlePubMedGoogle Scholar
- Katoh K, Toh H: Recent developments in the MAFFT multiple sequence alignment program. Brief Bioinform. 2008, 9: 286-298. 10.1093/bib/bbn013.View ArticlePubMedGoogle Scholar
- Thompson JD, Muller A, Waterhouse A, Procter J, Barton GJ, Plewniak F, Poch O: MACSIMS: multiple alignment of complete sequences information management system. BMC Bioinformatics. 2006, 7: 318-10.1186/1471-2105-7-318.PubMed CentralView ArticlePubMedGoogle Scholar
- Finn RD, Mistry J, Tate J, Coggill P, Heger A, Pollington JE, Gavin OL, Gunasekaran P, Ceric G, Forslund K: The Pfam protein families database. Nucleic Acids Res. 2010, D211-222. 38 DatabaseGoogle Scholar
- Birney E, Thompson J, Gibson T: PairWise and SearchWise: finding the optimal alignment in a simultaneous comparison of a protein profile against all DNA translation frames. Nucleic Acids Res. 1996, 24 (14): 2730-2739. 10.1093/nar/24.14.2730.PubMed CentralView ArticlePubMedGoogle Scholar
- Dessimoz C, Gil M, Schneider A, Gonnet G: Fast estimation of the difference between two PAM/JTT evolutionary distances in triplets of homologous sequences. BMC Bioinformatics. 2006, 7: 529-10.1186/1471-2105-7-529.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.