Comparison of RefSeq protein-coding regions in human and vertebrate genomes
© Fong et al.; licensee BioMed Central Ltd. 2013
Received: 12 February 2013
Accepted: 22 August 2013
Published: 25 September 2013
Skip to main content
© Fong et al.; licensee BioMed Central Ltd. 2013
Received: 12 February 2013
Accepted: 22 August 2013
Published: 25 September 2013
Advances in high-throughput sequencing technology have yielded a large number of publicly available vertebrate genomes, many of which are selected for inclusion in NCBI’s RefSeq project and subsequently processed by NCBI’s eukaryotic annotation pipeline. Genome annotation results are affected by differences in available support evidence and may be impacted by annotation pipeline software changes over time. The RefSeq project has not previously assessed annotation trends across organisms or over time. To address this deficiency, we have developed a comparative protocol which integrates analysis of annotated protein-coding regions across a data set of vertebrate orthologs in genomic sequence coordinates, protein sequences, and protein features.
We assessed an ortholog dataset that includes 34 annotated vertebrate RefSeq genomes including human. We confirm that RefSeq protein-coding gene annotations in mammals exhibit considerable similarity. Over 50% of the orthologous protein-coding genes in 20 organisms are supported at the level of splicing conservation with at least three selected reference genomes. Approximately 7,500 ortholog sets include at least half of the analyzed organisms, show highly similar sequence and conserved splicing, and may serve as a minimal set of mammalian “core proteins” for initial assessment of new mammalian genomes. Additionally, 80% of the proteins analyzed pass a suite of tests to detect proteins that lack splicing conservation and have unusual sequence or domain annotation. We use these tests to define an annotation quality metric that is based directly on the annotated proteins thus operates independently of other quality metrics such as availability of transcripts or assembly quality measures. Results are available on the RefSeq FTP site [http://ftp.ncbi.nlm.nih.gov/refseq/supplemental/ProtCore/SM1.txt].
Our multi-factored analysis demonstrates a high level of consistency in RefSeq protein representation among vertebrates. We find that the majority of the RefSeq vertebrate proteins for which we have calculated orthology are good as measured by these metrics. The process flow described provides specific information on the scope and degree of conservation for the analyzed protein sequences and annotations and will be used to enrich the quality of RefSeq records by identifying targets for further improvement in the computational annotation pipeline, and by flagging specific genes for manual curation.
The large number of genomes that have been sequenced in recent years has been followed by an unprecedented amount of data at all levels of biological systems, from genomic assemblies to gene annotations and mRNA and protein sequences. Collections of high-quality biological data include curated genomic, transcript, and protein records in the National Center for Biotechnology Information’s (NCBI) Reference Sequences (RefSeq) database ; consistently annotated human and mouse protein-coding regions (CDS) in the Consensus Coding Sequence database (CCDS) ; and curated protein data in Swiss-Prot/UniProtKB . Outside of the best-studied species such as human and mouse, much of the available annotation data is predicted using computational pipelines or high-throughput techniques, with or without supplemental manual curation, and may therefore include more frequent errors . At the same time, the increasing quantity and expanding scope of biological data enables assessment of conservation, evolutionary histories, and functional importance using comparative genomics and proteomics. Comparative genomics is becoming indispensable and has recently been applied toward problems in gene annotation and evolution, such as distinguishing coding from non-coding genes , identifying functional elements , and modeling the evolution of vertebrate exons and introns .
With the exception of human and mouse, most vertebrate RefSeq transcripts and proteins were predicted using NCBI’s eukaryotic annotation pipeline component. Gnomon is the core computational tool which integrates analysis of transcript and protein alignments and ab initio data to generate a set of annotation models which are further filtered before being selected for final RefSeq genomic annotation . Although this method tends to produce gene and protein annotations that match known ones, it also results in annotation differences that are supported by additional data available for one genome over another, or differences that are influenced by insufficient same-species transcript evidence, genome assembly issues, inexact exon definitions based on protein alignments, or limitations of the prediction method.
The Vertebrate RefSeq project has developed a conservative protocol for comparative analysis of proteins in order to assess computational annotation of RefSeq proteins and to supplement quality assurance measures. Our protocol identifies orthology at the level of annotated RefSeq vertebrate genomes. It leverages the sizable collection of genomic, transcript, and protein sequences in the RefSeq database to assess consistency and conservation of protein sequences, domain annotations, and annotated protein-coding sequence (CDS) regions on RefSeq genome sequences across sets of orthologs, while accommodating for diversity in splicing products across genes and wide differences in data quality across existing annotations. As a critical part of our study, we evaluate conservation at two orthogonal levels: protein sequence and gene structure; that is, protein-coding regions on the gene. Changes in amino acid sequence and the translated coding region and exons of the respective genes are driven by different molecular mechanisms (largely mutations vs. exon shuffling). Consequently, integrating analysis of sequence similarity and coding regions helps to detect and distinguish changes involving whole exons from localized mutations and indels. Conservation at the splice level has been used toward novel gene finding, particularly to detect homologs and to predict intron-exon structure [9, 10]. Our work extends these previous studies by including many other vertebrates and by integrating evaluation of all splicing isoforms rather than a selected protein for each gene. We determine “splicing orthologs”  (namely, isoforms with the same pattern of protein-coding exons) in vertebrates by aligning protein-coding regions in protein sequence space.
In this report, we describe the application of our protocol towards three specific aims. First, we identify those proteins with sequences and splice patterns consistent with its orthologs, indicating correct annotation. When these proteins are present in a large number of species across some taxonomic scope (here, across mammals or vertebrates), we also designate them as “core” proteins that are expected to be consistently found in novel genomes and may have high functional importance over their conserved lineage [11, 12]. Second, we search for proteins with inconsistent amino acid sequences compared to their orthologs in order to identify targets for improvements to the computational annotation pipeline and/or for curatorial review. We describe a suite of computational screens which assesses sequence lengths over whole protein and terminal regions, sequence similarity, domain composition, and closest neighbors in other species. Lastly, we explore the effectiveness of combining our measurements to infer the quality of predicted annotations and assemblies.
We take particular care to address challenges related to data quality, availability, and scalability in dynamic databases. First, requiring only sequence data allows the process to be applied to the widest number of proteins, transcripts, and genome annotations available in NCBI databases. In particular, it avoids using external data sources such as whole-genome alignments from the UCSC Genome Browser , which are less frequently updated. Cross-species alignments are efficiently calculated over protein space using BLAST software. Subsequently, transcript sequences and annotated CDS may be mapped onto their respective protein sequence alignments to determine splice conservation across orthologs. Second, our protocol minimizes re-calculations in the course of database updates and takes advantage of pre-computed domain assignments and protein-coding regions when available. Additionally, all proteins are benchmarked against a small set of reference species believed to have higher quality assembly and protein data. These species have been selected to be somewhat well distributed across sequenced vertebrates and are estimated to be of higher quality according to the number of ESTs, contig N50, scaffold N50, curated data, and by our own evaluation.
Comparison of RefSeq orthology vs. HomoloGene
Avg. synteny support1
Orthologs that agree with HG
Orthologs in conflict with HG
Percent that agree with HG
RefSeq orthologs not in HG
In HG, not in RefSeq orthologs
Ortholog dataset used in current analysis
Annotation release date
Genes in ortholog dataset
Annotated protein-coding genes
% Pipeline prediction
western clawed frog
*african savanna elephant
*Bos taurus (bovine)
*domestic guinea pig
gray short-tailed opossum
*bolivian squirrel monkey
northern white-cheeked gibbon
All proteins from each set of orthologous genes are partitioned into clusters containing at most one protein from each species so that each protein is grouped with the most similar proteins from other species according to sequence similarity and length. Limiting comparisons of proteins to those within the same cluster is intended to reduce spurious differences from comparing homologous proteins that are not the closest relative to one another. Figure 3 shows that the largest clusters are slightly smaller than the full ortholog sets; the median number of species in each largest cluster is 25, however again 92-96% of sets contain 10 and 6 species, respectively.
The 34 species reported here represent four taxonomy subgroups: primates (11 species), rodents (5 species), other mammals (11 species), and non-mammalian vertebrates (7 species). From these species, a subset of 12 species believed to have higher-quality data and representing all four subgroups were selected to use as reference genomes: human and Bolivian squirrel monkey (primates); mouse, rat, and guinea pig (rodents); dog, elephant, horse, and cow (mammals); and zebrafish, chicken, and anole (vertebrates). The reference species contain 205,670 proteins. For full assembly information for the organisms evaluated here, see Table 2. This table also describes the version of the assembly and annotation that is represented in the reported dataset and additional details on the protein-coding annotation results, statistical metrics that are commonly used to evaluate the quality of the assembly, and one measure (EST count) of the amount of same-species transcript data that was available for that annotation run.
The Conserved CDS database (CCDS)  contains human and mouse protein coding regions that are consistently annotated in NCBI and Ensembl genome browsers and provides a gold standard for coding region locations. Comparison of human genes in our orthology dataset and in CCDS shows that the intersection of the two dataset contains 92% of all human proteins in orthology sets and 86% of human proteins in CCDS. For mouse, the respective values are 93% and 80%. This sizable overlap confirms that nearly all human and mouse genes in the orthology dataset are likely to have valid CDS. The CCDS proteins that do not overlap with our ortholog dataset relates to our method of determining ortholog sets, which tends to omit paralogous gene clusters and large gene families with notable species expansion differences (e.g., olfactory receptors).
To measure splice-level conservation in protein-coding regions, the splice boundaries of transcripts were obtained from genome annotations or mRNA-genome alignments calculated using the Splign program . Coding regions and their splice boundaries are compared in protein space in order to use protein-protein alignments to determine corresponding genomic positions, similarly to . We define two proteins as splicing orthologs if all protein-coding exons in the two proteins can be paired with 90% overlap in lengths of both exons. Our approach ensures that splicing orthologs exhibit sufficient sequence similarity at the level of every individual protein-coding exon and very similar CDS splice patterns, allowing more flexibility for insertions and deletions than sequence-independent methods such as Exalign . Unlike methods that depend on pre-calculated whole-genome alignments to assess intra- or intergenic regions across species , our software can be applied to any valid sequence, in parallel to revision of existing sequence records and newly deposited proteins.
Protein and conserved splicing attributes, numbers of proteins in evaluation dataset for splicing conservation satisfying different conditions
Number of proteins
All coding exons aligned
1-to-1 mapping between exons
All exons have 90% overlap
All exons have length +/− 15
Proteins with > =3 coding exons
All splice junctions aligned
Protein and conserved splicing attributes, specific differences accounting for absence of conserved splicing as we measure it
Proteins without 90% overlap
Not all coding exons aligned
All exons 1-to-1 but lengths vary
Inspecting the human-mouse, human-cow, and human-zebrafish ortholog pairs that lack conserved splicing provides some insights into why some orthologous proteins are not splicing orthologs. First, 37-58% of ortholog pairs have all coding exons paired one-to-one yet at least one pair of exons differs by over 90% in length. An additional 28-42% of ortholog pairs have protein alignments that exclude at least one protein-coding exon. This may occur due to data quality issues in the genome assembly, lack of high quality transcript or protein evidence, or low sequence similarity. For example, lower sequence similarity in terminal regions may exclude those regions from the protein alignment and consequently the corresponding CDS. Also, certain mechanisms such as exonization, exon shuffling, or intronization are known to create novel coding exons or to merge or split exons. By searching for exons with no counterpart in the ortholog or split exons in one transcript mapped to a single exon in the other (including both single-exon and multi-exon genes), we find that 20-30% of non-splicing orthologs may have undergone these mechanisms. However, these differences may also be due to errors from our annotation pipeline or in the genome assembly at that gene location.
The above results do not change if we restrict the evaluation set to the 4247 clusters that contain one protein from each of the four species; in that case, a very slightly higher fraction of proteins show conserved CDS (data not shown). We also verify that splicing conservation and sequence conservation are complementary measures. Over pairs of orthologous proteins from human and each of the other species, and excluding protein pairs with perfectly conserved splicing which contribute a large number of tied scores, the Pearson correlation coefficient between sequence identity (number of identical residues over alignment length) and fraction of exons conserved (with 90% overlap) is a weak 20-42%.
The number of genes with conserved splicing orthologs appears to be fairly stable across mammals when requiring a splice match to at least 3, 6, or 9 orthologs from reference species; however, the choice of threshold greatly impacts the number of genes labeled as conserved CDS. The median number of genes (across all species) with splice orthologs from 9 out of 12 reference taxa is 1707 compared to 8197 at the 3-reference threshold. Note that reference species exhibit fewer conserved genes at the 6- and 9-ortholog thresholds because each reference gene may be compared to 11 reference species, while all non-reference species were compared to 12 reference species. However, this bias against reference organisms has little impact at the 3-reference threshold. Accordingly, we use the 3-reference threshold as a more inclusive approach to measure the number of genes supported by conserved splicing.
Looking at individual organisms, for human, 70% of genes have evidence of conserved splicing to at least 3 other reference species, a fraction comparable to the human-mouse conservation rate from the previous section. Among all the organisms in our evaluation set, human, mouse, and chimpanzee have the highest splice conservation rates and numbers of conserved genes, possibly reflecting higher annotation quality for human and mouse which have undergone extensive curation efforts. The similarity between chimpanzee and human is expected to have improved annotation of chimpanzee in the NCBI eukaryotic annotation process. Overall, 20 species have over 50% of the genes in ortholog sets with conserved splicing with respect to 3 reference species. Considering the large number of conserved genes and the diversity among its orthologous proteins in both sequence and splicing conservation, we expect that providing information on the scope of conserved splicing will be an interesting addition to RefSeq records.
Examples of human genes with splice orthologs across taxonomic subgroups
Distribution of reference proteins among taxonomic subgroups
Sets with conserved splicing
Please note that splicing conservation of the CDS is determined in pairwise fashion, includes 90% length criteria, and is not necessarily transitive. If gene A has a coding region 90% the length of the corresponding coding region in gene B, and gene B likewise has length 90% compared to gene C, then genes A-B, and B-C are conserved, but genes A and C don’t meet the 90% length criteria and do not have conserved splicing, by our definition. This explains how in Figure 5B, the number of ortholog sets with 1 to 8 genes that are conserved to 9 reference genes is slightly higher than zero, and likewise for the other conservation thresholds in both Figures 5A and 5B.
Figure 2 illustrates 10 orthologous alpha-2 macroglobulin proteins from human (GeneID 2) and its orthologs. This gene has a relatively small set of orthologs each encoding a single, richly annotated protein product. Although this orthology set contains similar proteins, the degree of conservation differs when terminal sequence regions and splicing conservation are assessed. Thus, among the reference taxa included in this set of orthologs only human and dog are splicing orthologs. The computationally predicted turkey protein is N-terminally truncated due to a gap in the turkey assembly. The N-terminus of the computationally predicted cat and opossum proteins exhibit greater divergence in length and sequence similarity, respectively. Genome annotation for both is primarily based on protein alignments coupled with ab initio, as there is minimal same-species transcript data available. In contrast, the computationally predicted galago protein is of high quality having a conserved N-terminal sequence as well as conserved splicing with 90% overlap in all protein-coding exons compared to human, dog, and guinea pig (all reference taxa); however, human and guinea pig are not splicing counterparts due to length variation in exon 18 (115 bp and 133 bp, respectively). By defining core proteins using a low threshold for the number of reference proteins with conserved CDS, we are able to identify sets of proteins with conserved CDS to at least a few other orthologs, typically from the closest species, without requiring such high level of conservation over all pairs of proteins.
Frequency of outlier proteins by category
To measure variation in domain composition, one protein with maximum similarity to orthologs is selected for each gene, that is, the protein with maximum average Jaccard score of domain content [20, 21]. A score of 1 indicates that two proteins compared have the same domain composition while a score of 0 indicates no domain in common. Across all species in this study, the average domain score (when calculated) falls within a narrow range of 0.79-0.82. These values are significantly lower than found by  which may be due to differences in ortholog identification, domain assignment, and calculation of domain score over only sizable sets of orthologs. Over all genes, 51% had score 1, 13% had score 0, and 6.6% had no score calculated. Using average domain scores for reference species proteins as a sample distribution, a Z-score is calculated for each protein. There are 4065 proteins of interest with a Z-score greater than +/− 2 yields. We also search directly for proteins with extra, missing, or truncated domains compared to all but one of the reference proteins. Unsurprisingly, missing domains are 8-fold more common than extra domains. Some sequence divergence or even a small mis-annotated region may be sufficient to disrupt alignment between a domain PSSM and the sequence, but the presence of an extra domain may point to genuine domain shuffling or long mis-annotated regions.
We identify unusual protein length over the whole protein and within the N-terminal, C-terminal, and conserved regions. First, N-terminal regions are defined as the first 30 and 100 amino acids in each protein (selected to represent short regions and the upper bound on known lengths of mitochondrial transit peptides). A multiple sequence alignment is calculated for each protein cluster, allowing length differences between each protein and all other aligned proteins to be compared. We also define each region based on indel content in the columns of the MSA. Protein lengths are the most common unusual feature detected (see Table 6). However, this is due to a relaxed definition of length outliers that allow length differences as short as 15 amino acids, in order to provide detailed information for review.
Finally, we searched for two types of errors in the protein N-terminus. First, we looked for alignment of the initial Methionine (Met) amino acid to a downstream Met in multiple proteins, which may point to the less common initial Met being an incorrect translation start position. Requiring either the upstream or downstream Met to be conserved in at least half of the comparisons to proteins from the reference species proteins dataset returns 30,953 proteins. Our protocol has already clustered alternative splicing products to their closest counterparts however the majority of proteins in the dataset are inferred from predicted gene annotations and for many genes, only one protein product is predicted. Consequently, our results indicate that some of these predicted proteins may have incorrect translation start positions, while other genes may encode additional products with the more conserved translation start .
A second type of error at N-terminal involves exons annotated at the incorrect genomic location. N-terminal coding exons are frequently more distant from the remaining coding exons and more challenging to annotate in computational pipelines when there is scant same-species transcript data available to specifically define the exon boundaries and when homologous protein alignments do not extend to the N-terminus due to cross-species sequence differences, masking of genomic sequence, or indels or larger gaps in the assembled genomic sequence. We attempt to identify such errors using sequence similarity: proteins with particularly poor sequence similarity at the N-terminal compared to its orthologs and compared to whole-sequence similarity are candidates for incorrect N-terminal coding exons. Testing on N-terminal regions defined as initial 30-residue or 100-residue regions identifies 1267 proteins that need curator review.
Our results provide a summary of specific, consistent differences in particular proteins, which may be valuable for internal review to improve the manually curated dataset and to identify targets for improvement of NCBI’s genome annotation pipeline.
Estimating annotation quality using protein features and sequence data
Domain score outlier
Conserved downstream Methionine
No protein in main cluster
Negative sum-of-logs score
*African savanna elephant
*domestic guinea pig
*bolivian squirrel monkey
northern white-cheeked gibbon
gray short-tailed opossum
western clawed frog
Correlation Contig N50
Correlation EST count
To estimate whether these criteria can be used to help gauge annotation quality, we calculate the Pearson correlation coefficient for each method. The 6 individual methods have an inverse correlation with contig N50 ranging from −0.29 to −0.49 while the sum-of-logs score has higher correlation 0.72 (p-value < 0.00001). (The respective values calculated using EST counts are very similar.) Interestingly, among the individual methods, the test of conserved upstream/downstream Met has strongest correlation compared to all other criteria, and its Spearman correlation coefficient is even stronger (−0.7). This likely indicates a deficiency in the NCBI eukaryotic annotation pipeline specific to correctly annotating N-terminal regions. Correlations are higher when calculated separately for each taxonomic subgroup Excluding human, mouse, and zebrafish which have outlier contig N50 values, the Pearson correlation coefficient was 0.88 for primates, 0.95 for rodents, 0.67 for mammals, and 0.87 for vertebrates (all p-values < 0.03). These results confirm that the combined “error” score may be valuable for estimating quality, especially by comparing scores between close species.
We note that the number of core proteins in each species is only weakly correlated with contig N50s (correlation 0.32) and did not boost performance of the combined score with respect to a stronger correlation coefficient. Here, core proteins were defined as the splice-conserved proteins in the 7,577 gene (ortholog) sets that each contain at least 17 proteins having conserved splicing to 3 or more reference proteins (as described in a previous section). Nevertheless, we may use the number of core proteins to supplement the combined score as the former is independent to the combined score and more easily interpreted, as a direct measurement of the extent of gene conservation. We note that in contrast the total number of protein-coding genes in each species bears no correlation with contig N50; this could imply that a sizable number of protein-coding genes are species specific or erroneous.
Within the primate subgroup, the correlation between the number of core proteins and contig N50s for all species besides human was particularly strong at 0.82 (p-value < 0.003). Indeed, both the combined score and the number of core genes are low for macaque which is known to have a poorer quality genome assembly , and likewise predicted (based on a lower contig N50) for orangutan. In contrast, gorilla, which has a new assembly based on Sanger and Solexa sequencing, has a larger number of core proteins and higher combined score despite a lower contig N50. This indicates that our approach is a more sensitive metric for annotation quality than N50 or EST count alone. For mammals, platypus, Tasmanian devil, and pig have low combined score while cat and panda have higher scores despite scarcity of contig N50s or ESTs. These results exemplify how directly evaluating conservation across orthologous genes provide more sensitive measures of overall annotation quality.
RefSeq is in a unique position to provide orthology and comparative analysis results, as a repository of genome-wide high-quality gene, transcript, and protein records, and having access to resources hosted by NCBI and other sites. An efficient hybrid method for orthology identification has recently been put into production to provide expanded quality assurance for curated RefSeq proteins and identify areas to target improvements in the genome annotation pipeline. These results supplement the data available in HomoloGene. Taking advantage of the extensive orthology data available, we have developed a computational pipeline to perform several orthogonal analyses on these orthology sets. The process described here has been incorporated into regular RefSeq processing: it is run regularly in response to newly annotated genome assemblies, changes in the gene membership of ortholog sets, and changes (updates and additions) to the protein products of each gene. Employing parallel processing resources enables results for the 568,459 proteins in our dataset to be calculated within hours, and this process can be adapted to scale linearly to accommodate growth in the number of genomes.
Using our suite of methods, we demonstrated a high level of consistency in RefSeq protein representation among vertebrates. Independent assessment measures that include considerations of protein sequence similarity, exon coverage, and splice similarity provide similar results. Previous comparisons of human and mouse orthologs have reported identical splicing in 32% of transcript pairs, and 64% highly conserved splice orthologs with a relaxed criteria that tolerates exon length differences of up to 5 codons , and identical lengths in 73% of corresponding human-mouse exons within a smaller data set . Our results of 71% splice orthologs between human and mouse and 68% splice orthologs between human and cow are consistent with these previous reports but we offer a considerably expanded dataset scope. These results lend support to conclusions of the quality of RefSeq proteins for organisms beyond human and mouse and provide specific information about the most conserved protein annotations. These results suggest that within a relatively narrow taxonomic scope such as mammals, many orthologs can be expected to have similar structure in their protein-coding exons, and that comparison of splicing is a reasonable metric for distinguishing counterparts among the various isoforms in orthologous genes.
We find that the majority of the RefSeq vertebrate proteins for which we have calculated orthology are good as measured by several orthogonal metrics (number of orthologs in sets, splice conservation, protein tests), and we find particular concern in N-terminal sequence definitions. Furthermore, our results suggest that evaluating annotation results for unusual sequence qualities is a reasonable metric for annotation quality that is independent of available transcript data and contig N50. Our findings agree with previous reports of lower quality annotation for rhesus and our aggregate error score may be a generally useful measure of overall annotation quality for a given genome (a more direct and granular metric than contig N50 although there is a correlation with contig N50).
Novel genomes of interest may contain few highly-conserved genes compared to the organisms in our evaluation set, particularly organisms have been shown to be genomic “singletons” with few close relatives . We have attempted to assuage this issue by selecting a representative set of organisms related to the most commonly analyzed mammals. We also showed here that thousands of genes in mammals that are relatively distant from primates and rodents are highly conserved compared to 3 or more reference species. Consequently we expect to be able to reuse this reference set of genomes to evaluate a sizable fraction of genes in a variety of mammals. Further, our computational pipeline may be applied to a different set of organisms. For a finer-level evaluation of novel genomes, we can further refine our process flow to identify genome neighbors  and apply the process described here using that customized set of species for comparison. Note that our method relies on coverage of proteins from those organisms in Swiss-Prot and availability of accurate assembly data, so such an approach would still have some shortcomings.
The process flow described here is being incorporated into the suite of RefSeq analysis protocols and results will be used multiple ways including: a) identify outliers needing the attention of RefSeq curation staff; b) provide additional public information about proteins with higher conservation as well as protein isoforms that are predicted to be more functionally relevant (or of uncertain function) based on the annotation signatures of signal peptide and domain content; c) as a quality assurance benchmark for annotation of new vertebrate (especially mammalian) genomes in that the most conserved protein dataset should reasonably be expected to be annotated; and d) to further improve the NCBI eukaryotic genome annotation pipeline.
Orthologous genes in human and 33 additional vertebrates (Table 2) were identified using a hybrid method of protein homology and local synteny. RefSeq proteins and Gnomon models for each taxa were queried against the Swiss-Prot database  using BLASTp with default parameters, and the best hit selected based on bit score with at least 50% query coverage. Proteins with best hit to Swiss-Prot proteins with the same name (for example “Alpha-2-macroglobulin”) are considered potential homologs to one another. At present all sets are “anchored” around a human gene: each set of homonymous proteins must contain a human protein with respectively named counterpart in Swiss-Prot.
Their respective genes are confirmed as orthologs if at least two of the six flanking genes (three on each side) for each gene are in the expected order and orientation or if the genes are single copy in both the human and the target species and share at least one of the six flanking genes. For this study, only genes that can be mapped to a reference genome assembly are retained, to ensure that exon annotations can be calculated against the same reference assembly. For simplicity, only non-redundant proteins are retained for this evaluation. One gene can encode multiple, identical proteins in the database when they correspond to different transcripts with the same coding regions.
Genes in this dataset, primarily from human and mouse, may encode one or more proteins (due to alternate splicing). Protein sequence similarity was used to cluster proteins in each ortholog set to identify the most similar proteins among alternative splicing variants. Within each set of orthologous genes, pairwise alignments were calculated between all proteins from reference species and all proteins in the set using BLAST  with parameters E-value 1e-05, no composition-adjusted statistics, and no masking, and BLAST scores were used to determine reciprocal best hit proteins for each pair of genes (species) with alignments. Ties are broken by selecting the protein with length closer to the query protein. Among identical proteins for one gene, as may occur when the gene has UTR-specific splice variants, one protein is randomly selected to be the representative protein. Proteins are then clustered into sets of “orthologous proteins” such that no set contains more than one protein from each gene, according to a simple greedy algorithm: Analyzing reciprocal best hits in order of descending BLAST score, the two proteins are merged into the same cluster (or rather, their respective clusters are merged) if this operation does not create a cluster containing two proteins from the same gene.
Exon junctions for each transcript are extracted from pre-existing mRNA feature annotations on the RefSeq record when they exist, or computed through cDNA-genome alignments using Splign . Genomic locations for each gene are obtained from database for the current reference assembly. The pairwise protein-protein alignments described in the previous section define the corresponding nucleotide positions between their respective mRNAs as well. Exon boundaries are then mapped onto the transcript sequences providing inferred alignments between protein-coding exons, enabling us to test conservation across exon positions and splice junctions. A few adjustments are made to reduce the impact of incomplete protein and mRNA-genome alignments: The alignments between the exons at the N- and C-terminals of the protein alignment are extended gap-free to the beginning and end of the respective exons. When a transcript cannot be wholly aligned to the selected genomic region, nucleotide positions outside of a defined exon are labeled as part of the previous exon, except for positions preceding the first exon.
For each proteins, domain from the Conserved Domain Database  are assigned from protein sequence using the CD-Search tool . These values are pre-calculated and contained in the protein record viewable in NCBI’s Protein resource . We use all domains in CDD v3.08 that have been clustered into superfamilies, which include all NCBI-curated and imported models except certain models that span multiple other domains. Additionally, signal peptides are predicted for all sequences using SignalP4.0 .
Z-scores for different scores, including Jaccard indices, were calculated using reference proteins to determine mean and standard deviation.
Second, proteins with extra, missing, and truncated domains compared to at least r-1 reference proteins (where r is the number of reference proteins) are flagged. We use r-1 to enable testing of the reference proteins themselves. Since CDD superfamily annotations are based on alignments between the protein and a CDD domain in the superfamily cluster, we define domain truncation as the maximum hit length fraction (length of alignment between PSSM and protein sequence, divided by PSSM length) is less than 60% of the PSSM length over one or more occurrences of that superfamily on the protein; this maximum hit length fraction is also required to be less than 60% of the hit length fraction for this superfamily in all other proteins in its cluster.
For clusters of proteins with at least 5 members from reference species, multiple sequence alignments were calculated using COBALT . Then, we performed a suite of calculations for length, N-terminal, and sequence outliers for each protein compared to reference proteins:
Average ungapped sequence identity over the first 30 and 100 residues, flagging proteins with identity < 50%, Z-score of identity greater than +/−2. To aid in screening out cases with low sequence identity over the whole sequence, we also require the Z-score of log( identity over N-terminal region / identity over whole sequence) to be greater than +/− 2.
Average length difference at N-terminal compared to reference protein, flagging proteins with length difference greater than 15 and Z-score of length difference greater than +/− 3.
Identify proteins with initial methionine codon aligned to a downstream methionine in another protein, or vice versa, where this difference is preserved across more than (r-1)/2 proteins.
To increase sensitivity in detecting length outliers that might be masked by similar whole-protein length, we also label regions of each multiple sequence alignment as conserved, N-terminal, C-terminal, or intermediate regions by requiring at most 20% gap content across reference proteins within conserved regions. Trailing terminal sequence regions not present in the MSA are included in the calculations. Then, we report proteins with length difference across any region or whole protein with Z-score greater than +/− 3 or absolute length difference greater than 15 compared to r-1 reference proteins.
The RefSeq proteins used in this analysis are publicly available at NCBI . A supplemental data file listing the protein identifiers and related information is provided on the RefSeq ftp site .
JF worked at the National Institutes of Health during the period of time that this work was carried out and the manuscript was written. She was not affiliated with the NIH at the time of manuscript submission.
Basic local alignment search tool
Conserved domain database
Consensus coding sequence
Constraint-based multiple protein alignment tool
Insertion or deletion
National Center for Biotechnology Information
Position specific scoring matrix
NCBI taxonomy identifier
This research was supported by the Intramural Research Program of the NIH, National Library of Medicine. We would like to acknowledge technical support provided by Andrei Shkeda and Olga Ermolaeva.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.