Krishanpal Anamika1, Juliette Martin2, Suruchi Bakshi3 & Narayanaswamy Srinivasan
Molecular Biophysics Unit, Indian Institute of Science, Bangalore 560012, India
Current address: 1Department of Functional Genomics, Institute of Genetics and Molecular and Cellular Biology, 1 rue Laurent Fries/BP 10142/67404 Illkirch Cedex
Strasbourg, France. 2Universit'e de Lyon, Lyon, France; Universit'e Lyon 1; IFR 128; CNRS, UMR 5086; IBCP, Institut de Biologie et Chimie des Prot'eines, 7 passage du Vercors, Lyon, F-69367, France. 3Doctoral Training Centre, University of Oxford, Wolfson Building, Oxford, OX1 3QD, UK
The article by Robison [11] discusses an important point on the validity of gene prediction for the presently available genome assembly of chimpanzee genome. The entire discussion is unfortunately misplaced in the context of analysis of domain combinations in protein kinases encoded in the chimpanzee genome that was reported by Anamika et al [12]. In general, roots of such problems stem not only from the quality of gene prediction but also from the quality of the genomic data and assembly. Though the discussion is misplaced within a highly specific context of domain combinations in chimpanzee kinases it is certainly important to discuss and understand the quality of genomic assembly of chimpanzee and the quality of gene predictions as they could have significant impact on the results of any analysis on chimpanzee proteome.
Anamika et al [12] have used the ENSEMBL database (release 46) in their comparative analysis of human and chimpanzee kinases. In our opinion, ENSEMBL has set quite high standards on the quality of datasets disseminated including gene annotation [13]. High quality of the datasets is achieved in ENSEMBL in a series of stages. Of particular relevance to the comparison of human and chimpanzee proteomes, we would like to draw the attention to the steps employed by ENSEMBL: In the first stage of the gene build species-specific proteins are aligned to the genome and a transcript structure is built for the protein on the genome. It must be noted that this step is followed by the similarity stage, in which proteins from closely related species are used to build transcript structure in regions were a targeted transcript structure is absent. For those species having significant number of experimentally identified protein sequences, this stage has a major impact on the gene structures in the genebuild. However, according to the documentation of ENSEMBL, for species with fewer species-specific protein sequences, this stage plays an even more important role in predicting gene structures. It should also be noted that the next stage in ENSEMBL involves alignment of species-specific cDNA and EST sequences to the genome and a careful treatment of non-translated region of cDNA. The final set of prediction corresponds to multi-transcript genes as it is derived by combining identical transcripts corresponding to different protein sequences. For every transcript model, the protein and mRNA sequences serve as supportive evidence.
Based on these steps taken in ENSEMBL, it is clear that automatically annotated genes too have strong mRNA and cDNA evidence from other species. Thus it is difficult to accept that there are serious problems with many of the genes predicted and listed in ENSEMBL database.
Robison points out differences between ENSEMBL and RefSeq databases in terms of annotation and gene prediction. While this matter is certainly beyond the analysis of domain combinations in kinases, it is more a comparison of quality of two different databases which are of high quality in their own right. As stated by Anamika et al in their paper [12], the human proteome data used in their analysis is based on the NCBI 36 assembly of the human genome, released in November 2005. It is composed of 43,570 protein sequences: 3,614 "novel peptides", 18,642 "known peptides" and 21,314 "known-ccds peptides". The annotation "known-ccds" suggests that these sequences are part of a core set consistently annotated, high quality data in the frame of the Consensus Coding Sequence (CCDS) project. The phrase "known peptides" suggests that these proteins were mapped to entries in Swiss-Prot, RefSeq or SPTrEMBL during the annotation process. Given the high proportion of proteins that are flagged as "known-ccds" or "known peptides", we are of the impression that the quality of the data used by Anamika et al is generally very good.
Although we believe that ENSEMBL dataset is of high quality, we agree with Robison that, in general, quality of the genome sequence and assembly as well as quality of gene prediction can suggest dubious dissimilarities between chimpanzee and human proteins. However, based on our more detailed comparative analysis of entire proteomes of human and chimpanzee there are genuine significant differences between proteins of human and chimpanzee that can not be easily explained on the basis of genome sequence quality and gene prediction. For example, we consider the cases discussed by Robison in the later sections of this article.
Unfortunately there are a few points of comprehension, different from what Anamika et al intended, expressed in Robison article. For example, Robison states: "......they (Anamika et al) concluded that chimps possess many kinases with unprecedented domain structures". It should be noted that Anamika et al report 587 putative protein kinases in chimp and they report only 3 cases of chimp-specific architectures, displayed in Figure three a and discussed in the text. We believe that Robison's contention of 3 out of 587 (0.5%) as "many" is inappropriate. On the contrary 584 chimp kinases do not exhibit any difference in terms of domain structure compared to human kinases.
In another case Robison seems to have understood that Anamika et al claim to identify a chimp kinase (ENSPTRP00000001150) whose closest relative in human has 31% identity. Anamika et al did not imply such a point. Anamika et al did not state that when searching for orthologues of ENSPTRP00000001150 in the human proteome, the closest human sequence is found to have 31% sequence identity. Indeed in Table two of the Anamika et al paper, the closest human orthologue of ENSPTRP00000001150 is listed as ENSP00000361275 which is described as PLK3_HUMAN shown on Figure one in Robison paper. The difference between sequence identity reported by Robison (>90%) and us (65.5%) comes from the formula used to compute the percentage: with respect to the length of the sequence or the length of the alignment. Thus, in this example the apparent contradiction stems from difference in understanding. Indeed, both Anamika et al and Robison concur on the matter of nearest human protein of ENSPTRP00000001150.
Robison provides an impression that Anamika et al considers this kinase as a casein kinase 1 with a polo box. Again, Anamika et al did not mean to give such an impression. On the contrary Anamika et al state in page 3 of their paper that the catalytic domain of the sequence ENSPTRP00000001150 was classified as casein kinase 1 subfamily because it has 31% identity and e-value equal to 2 × 10-16 with the Position Specific Scoring Matrix (PSSM) of catalytic domain of casein kinase 1 in the Reverse BLAST analysis. In no way it means that the closest relative in human is 31% identical. However the point made by Anamika et al, but, seemed to have missed by Robison is that the amino acid sequence pattern of the catalytic region of ENSPTRP00000001150 is similar to casein kinase 1 PSSM. As stated in our paper, we have used multiple PSSMs created for the 55 protein kinase subfamilies represented in Hanks and Hunter classification scheme [13] as query to search against the chimpanzee proteome using RPS-BLAST. The kinase domain of ENSPTRP00000001150 has shown to have highest similarity with a PSSM representative of the casein kinase 1 subfamily. So, this is an interesting case of kinase catalytic region more similar to casein kinase 1 while the full length protein sequence is more similar to that of a POLO kinase due to the occurrence of a POLO box in ENSPTRP00000001150. So, we contend that this is a difficult case of classification purely from sequence analysis. It will be interesting to understand the role of ENSPTRP00000001150 in signaling process from experimental studies when they become available.
Robison provides an interesting analysis of whole-genome shotgun sequence comparing the genomic regions of ENSPTRP00000001150 and ENSP00000361275 and speculates that the finished product will probably contain currently missing exonic regions. We concur with this speculation. This is a good possibility to look for when genomic data with better accuracy becomes available. However currently it remains speculative. Robison identified the ATP-binding site in human ENSP00000361275/PLK3_HUMAN, which is missing in the chimp sequence. Robison thus concludes that this chimp protein is either incomplete or non-functional. We agree that the corresponding protein, if complete and shown to lack ATP binding site, might be non-functional as a kinase. It should be emphasized that Anamika et al discussed this example as chimp protein kinases with interesting difference compared to their closest human homologue.
Anamika et al reported that the example of the chimp kinase ENSPTRP00000000076 contains a PB1 domain followed by a protein kinase C-like domain and they further reported that this domain architecture is seen only in a sea squirt kinase. Although Anamika et al used Pfam database version 22.0, the architecture PB1/catalytic domain/PKC terminal domain is actually reported only in a single sea squirt kinase in Pfam version 23.0. Kinases discussed in references 5 and 6 of Robison paper possess a DAG binding domain between PB1 and catalytic domains. Anamika et al also identified these cases in their analysis. In page 7 of the Anamika et al paper it is stated that "Our analysis identified two chimpanzee PKCs and a human PKC with a similar architecture, in which a phorbol esters/diacylglycerol binding domain is inserted between the PB1 and the protein kinase domain." The Robison article reports that three chimpanzee ESTs deposited in October 2007 (DC524857, DC519886 and CD524857) are consistent with the presence of the DAG-binding domain in chimpanzee PKC-zeta. In this context it should be pointed out that Anamika et al report has identified the PB1/DAG binding/catalytic domain/PKC terminal domain architecture in two chimp kinases. Indeed in Table two of Anamika et al the authors have already mentioned ENSP00000367830 (human PKCzeta) as 87% identical to the chimp kinase ENSPTRP00000000076 that was also mentioned by Robison as the point missed by Anamika et al. Robison provides an interesting analysis of ENSPTRP00000000076 and human PKC-zeta with exonic structure, showing that the deletion in ENSPTRP00000000076 corresponds precisely to the sixth exon of human PKC-zeta isoform 1, and concludes that sequence of ENSPTRP00000000076 is incomplete. This remains to be confirmed when better genomic data becomes available.
So, we disagree with the statement made by Robison that none of the domain architectures proposed by Anamika et al appear to be both novel and well-supported. Indeed many of the points raised by Robison seem to stem from misunderstanding of the statements made by Anamika et al. We agree with Robison who suggests that a list of all the kinase sequences analyzed should be made available publicly. Indeed all the sequences used in the Anamika et al analysis are publicly available in the KinG web site http://hodgkin.mbu.iisc.ernet.in/~king as mentioned in their paper.
In the context of discussion on how similar a human protein sequence could be to the nearest homologue in chimpanzee Robison appropriately recalls very high nucleotide sequence identity between the two genomes. We agree with this point. However, it has also been pointed out by the Chimpanzee Genome Sequencing and Analysis Consortium [14] while comparing chimpanzee and human genomic data that "Insertion and deletion (indel) events are fewer in number than single-nucleotide substitutions, but result in ~1.5% of the euchromatic sequence in each species being lineage-specific". Also, it is not clear if the possibilities of recombination/horizontal gene transfer/insertion-deletion/frameshift events in either of the genomes can be precluded.
Finally, chimpanzee proteome has been generated from an automatic annotation system based on biological evidences [14]. The human genome data, which is of relatively high quality, was used as a guide for annotation, by projecting the human gene models onto the chimpanzee genome. Further, ENSEMBL carefully determines transcript on a case-by-case basis
(see http://www.ensembl.org/info/docs/genebuild/genome_annotation.html) [15]. It should be noted that all ENSEMBL transcripts are also based on experimental evidence from EMBL, UniProtKB, and RefSeq. Clearly, one cannot do better than the data one is working with. We contend that, in general, chimpanzee proteome dataset in ENSEMBL is of good quality and not all the differences one notes between human and chimpanzee proteins can be explained solely on the basis of quality of gene prediction. It is premature to conclude so at this stage. While Anamika et al focused only on differences in domain combinations of human and chimpanzee kinases, even more significantly, a recent analysis [16] identified a few completely unique genes in human compared to chimpanzee by a very careful analysis which provides useful guidelines on analysis of chimpanzee genomic data. Also, as pointed out by Robison, the work of Volfovsky et al [17] reports numerous indels, which are experimentally validated, in the sequence data of chimp and human. Work of Knowles and McLysaght [16] and Volfovsky et al [17] suggests that more concrete conclusions can only be arrived at when such experimental validation has been achieved in a much larger scale.