Gene discovery in the hamster: a comparative genomics approach for gene annotation by sequencing of hamster testis cDNAs
© Oduru et al 2003
Received: 31 January 2003
Accepted: 3 June 2003
Published: 3 June 2003
Skip to main content
© Oduru et al 2003
Received: 31 January 2003
Accepted: 3 June 2003
Published: 3 June 2003
Complete genome annotation will likely be achieved through a combination of computer-based analysis of available genome sequences combined with direct experimental characterization of expressed regions of individual genomes. We have utilized a comparative genomics approach involving the sequencing of randomly selected hamster testis cDNAs to begin to identify genes not previously annotated on the human, mouse, rat and Fugu (pufferfish) genomes.
735 distinct sequences were analyzed for their relatedness to known sequences in public databases. Eight of these sequences were derived from previously unidentified genes and expression of these genes in testis was confirmed by Northern blotting. The genomic locations of each sequence were mapped in human, mouse, rat and pufferfish, where applicable, and the structure of their cognate genes was derived using computer-based predictions, genomic comparisons and analysis of uncharacterized cDNA sequences from human and macaque.
The use of a comparative genomics approach resulted in the identification of eight cDNAs that correspond to previously uncharacterized genes in the human genome. The proteins encoded by these genes included a new member of the kinesin superfamily, a SET/MYND-domain protein, and six proteins for which no specific function could be predicted. Each gene was expressed primarily in testis, suggesting that they may play roles in the development and/or function of testicular cells.
The initial publication of two draft versions of the human genome led to intense debate over the exact number of genes in the human genome [1, 2]. Current estimates suggest that the human genome encodes approximately 35,000 to 38,000 although the final number must await the complete annotation of each genome sequence. The search for additional genes not discovered during early annotation attempts has involved the use of several different approaches. These have included the sequencing of randomly selected cDNAs from various tissue sources, the development of computer-based prediction programs of ever-increasing accuracy, and the direct comparison between the human genome and the genome sequences of other vertebrates and invertebrates [3–9]. Using these approaches, fully annotated genomes of numerous species will be available within a relatively short time.
We approached the problem of gene identification by using a combination of experimental and in silico techniques. Specifically, we initiated a project designed to sequence expressed sequence tags from the hamster testis and used these sequences to identify unannotated, or incompletely annotated, genes in the human and other vertebrate genomes. The hamster has not been used extensively in genomics research; however, it has been used extensively in various areas of investigation including circadian rhythm research  and also in investigations in a number of areas of research in reproductive biology. For example, the study of hamster gametes has revealed significant information concerning the mechanisms underlying species-specific sperm-egg interactions [11–13] and the deleterious effects of endocrine disruptors on male and female reproductive development [14–17]. The hamster, mouse and rat are all members of the family Muridae, however both mice and rats belong to the subfamily Murinae while hamsters belong to the subfamily Cricetinae. Three hamster species that are commonly used in research are Mesocricetus auratus (Syrian golden hamster), Cricetulus griseus (Chinese hamster) and Phodopus sungorus (Siberian hamster). Therefore, sequence information from any hamster species should complement information gained from other closely related species.
The testis was chosen for these studies as it represents a viable source for the identification of novel genes. The adult testis is a complex organ consisting of numerous different somatic cell types as well as germ cells at all stages of spermatogenesis from the gonocyte stem cells to the mature sperm cells . Consequently, several unique gene populations, including those involved in the regulation of meiosis, as well as those specific to the various testicular cell types, are expressed in the testis. A recent gene discovery study performed in the testis of Drosophila melanogaster found that 47% of greater than 1500 sequenced cDNAs did not match to ESTs previously identified in this organism . Likewise the testis of the cynomolgus monkey has yielded several novel gene sequences [8, 9]. Therefore, we reasoned that the sequencing of ESTs from hamster testis might reveal the existence of novel genes conserved in other species that may function in controlling testicular development and/or function. In this report, we describe our initial results from the sequencing of randomly-selected cDNAs from the testes of male Syrian golden hamsters. In particular we identified eight cDNAs that appear to be derived from genes that were not previously annotated in the human genome. We describe the detailed analysis of two of these genes, which encode a new member of the kinesin superfamily of microtubule-based molecular motors and a protein likely to be involved in chromatin remodeling.
Classification of sequenced hamster cDNAs
Number of clones
Cell surface antigens
We next considered the overall complexity of the sequences in our database as our cloning strategy made it possible to obtain multiple fragments from the same cDNA. We identified only 40 instances where more than one fragment was derived from the same gene (data not shown). In addition, only 17 fragments appeared more than two times in the database and only 6 fragments appeared more than three times. The most common sequences were derived from genes that are known to be highly expressed in the testis in other organisms and include heat shock protein 90A , chitobiase , outer dense fiber of sperm tails  and kinectin . Therefore, the majority of sequences were represented only once or at most two times in the data set. The fact that most sequences were not identified multiple times suggests that we have not exhaustively sequenced all of the DpnII fragments from cDNAs expressed in the hamster testis. This is not surprising as conservative estimates suggest that there are many thousands of unique transcripts expressed in the various somatic and germ cell lineages in the testis.
The largest group of sequences identified in this project falls into the functionally unclassified group (Table 1). This group includes a small number of named genes for which no function has yet been assigned but primarily contains the results of full-length cDNA sequencing projects or genes predicted by gene finding software. Approximately one third of these clones (78 out of 239) were originally isolated from testis libraries, suggesting that a large number of genes remain to be functionally characterized in the testis.
Putative novel genes identified by sequencing of random hamster testis cDNAs
AL832216 AK095517 XM_086235
In this report, we describe the sequencing and initial characterization of greater than 700 randomly selected ESTs from the hamster testis. This represents the first such study carried out in hamster, as dbEST listed just twenty-seven entries from hamster in January 2003 (release 012403). It has been widely speculated that the sequencing of additional mammalian genomes will aid in the annotation of the human genome, particularly in the identification of previously unidentified coding regions through the mapping of conserved regions in different genomes . We describe here our initial characterization of eight genes that were not annotated on human genome sequences at the beginning of our study. Although predicted structures for some of these genes have appeared recently, our data represents the first experimental verification of their existence in several cases, particularly in the testis. We were unable to predict specific functions for the proteins encoded by six of the genes, however, the other two genes encoded a new member of the kinesin superfamily (KIF27) and a protein predicted to play a role in chromatin remodeling.
Our studies revealed the existence of a new member of the kinesin superfamily of microtubule-based molecular motors, which we have named KIF27. KIF27 RNA was detected primarily in hamster testis but weaker signals were also present in several other tissues, suggesting that this protein may function in numerous cell types. Significant characteristics of the KIF27 genes mapped in human, mouse and rat included a conserved 18 exon arrangement and the existence of at least three mRNAs that resulted from alternative splicing, at least in human. Although partial cDNAs existed for the human and macaque genes in public databases, we were able to construct full length cDNA sequences for human, mouse and rat from genomic comparisons, and to assemble a corresponding sequence from macaque by joining two previously reported cDNAs. Kinesins are characterized by a conserved motor domain of approximately 350 amino acids that may be located at the amino terminus (KIN-N), carboxy-terminus (KIN-C) or within the polypeptide (KIN-I)  and the motor domain of KIF27 is located in the N-terminus of the protein. Based on this arrangement and the sequence of the adjacent neck domain, we assigned KIF27 to the N-5 phyogenetic group of kinesins . The N-5 subgroup is defined by the human KIF4 protein, which is primarily localized to the nuclear matrix and associates with chromosomes during mitosis . In cell division, nuclear kinesins of this chromokinesin class appear to be important for the maintenance of sister chromatids on the metaphase plate . However, additional functions for nuclear kinesins have recently been uncovered. For example, KIF17a, a member of the N-4 subgroup that is predominantly expressed in germ cells, was recently shown to possess transcriptional regulatory properties by controlling access of the transcriptional activator protein CREM to a coactivator, ACT . Clearly, functional analysis of the KIF27 polypeptide will be needed to determine the subcellular location of this protein to determine whether it may add to the growing number of kinesins that function in the nucleus. In this regard, although the carboxy terminal regions of KIF27 and KIF4 display little significant sequence similarity, both contain a putative topoisomerase domain that may be important for nuclear functions (figure 2). In addition, several clusters of basic amino acids that may function as nuclear localization signals are located in the KIF27 polypeptide.
The final clone isolated in our search encodes a 433 amino acid protein whose sequence was highly conserved from human to pufferfish. This protein, now named SMYD2, contains SET and MYND domains that are characteristic of proteins with chromatin remodeling capabilities. The SET domain is a common feature of proteins with histone lysine methyl transferase (HKMT) activity and has been identified in hundreds of proteins in organisms ranging from bacteria and viruses to humans . The SMYD2 SET domain is separated into two parts (i.e. a S-ET domain) and is followed by a short cyteine-rich region that is common in many SET domain proteins (figures 2 and 12). The MYND domain is located between the two halves of the SET domain and similar domains have been identified in a number of proteins that function as transcriptional repressors, including the ETO protein that is fused to the AML-1 transcription factor in the t(8;21) translocation in acute myeloid leukemias . The MYND contains two zinc finger motifs and is a protein-protein interaction interface responsible for the recruitment of corepressors [39–41].
This domain organization is conserved in several proteins in public databases, including the recently described SMYD1 (aka BOP) family of proteins . Three isoforms of SMYD1 have been reported thus far (referred to by their original names, m-BOP1, m-BOP2 and t-BOP) that are products of a single gene that result from either alternative splicing or promoter usage [31–33]. m-BOP is essential for cardiac differentiation and morphogenesis while t-BOP is expressed in cytotoxic T lymphoctes. Studies are currently underway to examine the function(s) of SMYD2 in testis and heart.
The impetus for this study arose from the need to perform microarray experiments to investigate the molecular changes elicited by environmental toxicants on male reproduction function in hamster. Comparisons performed between the limited numbers of hamster sequences in public databases with ortho logous sequences from the mouse suggested that, despite the close taxonomic relationship between hamster and mouse, evolutionary divergence in coding sequences was sufficiently great in certain cases that reagents developed for the mouse would be of limited use for genomic studies in the hamster. This conclusion was supported by experimental observations in our laboratories indicating that probes derived from rat and mouse cDNAs yield inconsistent results in Northern blotting under stringent hybridization conditions (data not shown). The reagents described here will now permit the initiation of genomic studies in the hamster.
Total RNA was prepared from testes of adult Syrian golden hamsters (Mesocricetus auratus). Poly A+ RNA was isolated using the poly A Spin™ mRNA isolation kit (New England Biolabs, Beverly, MA). 5 μg of polyA+ RNA was converted to double stranded cDNA using the cDNA Synthesis Kit (Life Technologies, Gaithersburg, MD). The cDNA was then digested to completion with DpnII and electrophoresed on a 1.5% agarose gel. Five populations of digested cDNAs in size ranges between 100 and 800 bp were excised from the gel and purified using Qiaex II resin (Qiagen, Valencia, CA). Each population of cDNA was ligated into a pBluescript vector that had been digested with BamHI and alkaline phosphatased to decrease the rate of self-ligation. The ligations were transformed into DH5α supercompetent cells (Life Technologies) and positive clones were identified by blue-white selection. The success of the cloning procedure was initially monitored by picking 5 clones from each group, preparing plasmid DNA and sequencing. These preliminary studies indicated that the most useful clone sets were those derived from cDNAs in the 100–300 bp range. White colonies from these two sets were carefully picked and used to inoculate single cells of a 96 well culture block with each well containing 1.2 ml of TB (1.2% Tryptone, 2.4% yeast extract, 0.4% glycerol) supplemented with 50 μg/ml carbenecillin. The bacteria were grown for 20 hours at 37°C and glycerol stocks were subsequently prepared using a Biomek 3000 robot (Beckman-Coulter, Inc., Fullerton, CA). Fresh cultures (in 1 ml of 2X LB (2% Tryptone, 1% yeast extract, 1% NaCl) plus 50 μg/ml carbenecillin) were inoculated from the glycerol stocks using a 96-well needle transfer device and grown as before. Plasmid DNA was prepared using the full lysate protocol of the Montage™ Plasmid Miniprep96 Kit (Millipore, Bedford, MA). Typical yields of plasmid DNA ranged from 5–10 μg in 50 μl volume.
Plasmid DNA was diluted four-fold in water and 2 μl of the diluted sample was sequenced in 96-well format using the Dye Terminator Cycle Sequencing Kit (Beckman Coulter, Inc.). The reactions were cleaned up by an ethanol precipitation step in which 35 μl of 10 mg/ml glycogen, 70 μl of 0.5 M EDTA, 70 μl of 1.5 M Sodium Acetate (pH 4.8) was first mixed with the sequencing reaction (6.6 μl). 20 μl of 95% ethanol was added and the plate was centrifuged at 4,000 rpm for 45 minutes. The pellets were washed twice with 100 μl of 70% ethanol, air dried and dissolved in 25 μl of molecular biology grade formamide. A drop of mineral oil was overlaid on each reaction and these were then run on a CEQ2000 capillary array sequencer (Beckman Coulter, Inc.).
Raw sequence data was first imported into the Contig Express component of the Vector NTI suite of sequence analysis programs (InforMax, Inc. Bethesda, MD). Each clone was named according to its position in the original 96-well plate, for example, clone 1030 came from position 30 in plate 1. Vector sequences were first scanned in batch mode for the presence of vector-derived sequences using Contig Express and these sequences were trimmed before proceeding. In some cases, manual processing of the sequences was necessary using the VecScreen program at the National Canter for Biotechnology Information (NCBI) http://www.ncbi.nlm.nih.gov/VecScreen/VecScreen.html. The presence of DpnII sites at either end of the clone simplified the removal of vector sequences. The sequences were concatenated in batches of 96 into a text file using FASTA format and submitted for a batch BLASTN search using the interface program BLASTCL3. This search returned the three top matches into an output file. No further analysis was performed in those cases where a match with a known gene was clearly established, and the clone was annotated as the hamster orthologue of a known gene. A series of additional comparisons were performed if a clear relationship between a hamster clone and known sequence was not established by this initial search. The most useful additional comparison was a BLASTX search, which compares the translated sequence of the input clone in all six frames to the peptide sequences in GenBank. This search clarified the identity of several additional hamster clones. 200 clones did not match to known sequences and were analyzed to determine whether they might represent potentially novel genes. First, the output sequences were translated in all frames to determine whether a complete open reading frame (ORF) was present. If so, the clones were resequenced on both strands to ensure that the sequence was correct. Promising sequences were compared to the public version of the annotated human sequence (v 9.30a.1) at Ensembl http://www.ensembl.org to find matches with human chromosomal sequences. Further comparisons were performed against the mouse (v 9.3a.1), rat (v 9.1.1), zebrafish (v 9.08.1), pufferfish (v 9.1.1) and mosquito (v 9.1a.1) genome sequence in the same database. When matching genomic loci were identified, approximately 100,000 bp of genome sequence surrounding the match was imported into Vector NTI and submitted to gene prediction software programs to determine if the hamster sequences were located within exons of predicted genes. Two gene prediction programs were used FGENESH http://www.softberry.com/berry.phtml and GENEMARK http://opal.biology.gatech.edu/GeneMark/eukhmm.cgi?org=H.sapiens. Predicted cDNA sequences were then assembled into contiguous files and subjected to further comparisons against cDNA and EST databases at NCBI. Genome structures were further examined using PipMaker, a program that supports comparisons of large DNA fragments and identifies short conserved regions of similarity, such as exons. The predicted protein sequences of the derived cDNAs were analyzed for the presence of functional domains using the conserved domain function of BLAST at NCBI as well as the Simple Modular Architecture Research Tool at http://smart.embl-heidelberg.de. Protein structures were annotated in Vector NTI and published in Canvas v7.0 (Deneba Systems, Inc., Miami, FL). Sequences were submitted to the appropriate databases at the National Center for Biotechnology Information (NCBI). Specifically, EST sequences were submitted to dbEST under accession numbers BI431001-BI431008 and CB884447-CB885166. Human KIF27 sequences were submitted to Genbank under accession numbers AY237536-AY237538. Sequences defined by annotation of previously available sequences were submitted to the Third Party Annotation database under accession numbers BK001053-BK001057 and BK001326-BK001332. The Human Genome Organization (HUGO) nomenclature committee has approved proposed gene names.
Total RNA was purified from various hamster tissues using TriZol reagent (Life Technologies, Inc.). 20 μg of each RNA was electrophoresed through a 1% agarose-formaldehyde gel and transferred to a Nylon membrane (Micron Separations, Westborough, MA). The RNA was cross-linked to the membrane by exposure to UV light and hybridized with specific probes labeled with 32P-dCTP (Perkin Elmer Life Sciences, Boston, MA) using the Prime-It random prime labeling kit (Stratagene, La Jolla, CA). The hybridization procedure was performed as described before . Several duplicate membranes were prepared and the radioactive probe was stripped after each round of hybridization by boiling for 10 minutes in 10 mM Tris-HCl (pH 7.5), 1 mM EDTA, 1% SDS. The specific probes were prepared by digesting the appropriate pBluescript plasmid with XbaI and EcoRI, separating the fragment on a 1.5% agarose gel followed by extraction using the Qiaex II extraction kit (Qiagen).
PCR primers based on the predicted human KIF27 genomic sequence were designed for PCR cloning of overlapping regions of human KIF27 cDNAs. PCR reactions were performed using human testis Marathon RACE-Ready cDNA (Clontech, Palo Alto, CA) using conditions described previously . PCR products were subcloned in to the pGEM-Teasy plasmid and sequenced. The sequences were then assembled into full length cDNAs using the Vector NTI sequence analysis suite of programs. The primers used for the reaction shown in figure 2 were: 5' AACTAGATGTAGAAGTCGTTCATGGATTC 3' and 5'TTCCAGTAAGTTCAGGCGAGTTG 3'.
We appreciate the assistance of Natalya Klueva in the Texas Tech University Center for Biotechnology and Genomics for assistance with sequencing and sequence analysis. We thank Dan Hardy for providing human testicular cDNA for PCR cloning of human KIF27. We also tha nk Curt Pfarr and Demet Nalbant for critical reading of the manuscript. This project was supported by National Institutes of Health, NIEHS grant number ES 10232 to S.A.K.
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.