TMC and EVER genes belong to a larger novel family, the TMC gene family encoding transmembrane proteins

Background Mutations in the transmembrane cochlear expressed gene 1 (TMC1) cause deafness in human and mouse. Mutations in two homologous genes, EVER1 and EVER2 increase the susceptibility to infection with certain human papillomaviruses resulting in high risk of skin carcinoma. Here we report that TMC1, EVER1 and EVER2 (now TMC6 and TMC8) belong to a larger novel gene family, which is named TMC for trans membrane channel-like gene family. Results Using a combination of iterative database searches and reverse transcriptase-polymerase chain reaction (RT-PCR) experiments we assembled contigs for cDNA encoding human, murine, puffer fish, and invertebrate TMC proteins. TMC proteins of individual species can be grouped into three subfamilies A, B, and C. Vertebrates have eight TMC genes. The majority of murine TMC transcripts are expressed in most organs; some transcripts, however, in particular the three subfamily A members are rare and more restrictively expressed. Conclusion The eight vertebrate TMC genes are evolutionary conserved and encode proteins that form three subfamilies. Invertebrate TMC proteins can also be categorized into these three subfamilies. All TMC genes encode transmembrane proteins with intracellular amino- and carboxyl-termini and at least eight membrane-spanning domains. We speculate that the TMC proteins constitute a novel group of ion channels, transporters, or modifiers of such.


Background
The complete sequencing of the human genome led to the conclusion that the number of human genes is approximately 30,000 -40,000 [1,2]. This initial estimate was proven to be incorrect because a comparison of the predicted genes that were identified in public and commercial sequencing projects and inclusion of a third cluster of known genes revealed little overlap among the three groups [3]. Although many of the discrepancies between the different human genome assemblies are caused by fundamental disparities in the compilation of the sequence information [4], it is still evident that the human genome harbors probably significantly more than 40,000 genes and that these genes are not easily identified. This is particularly true for genes that are either expressed transiently, at low levels, or in very few specific cell types. Such elusive genes are underrepresented in cDNA libraries and consequently also in expressed sequence tag (EST) databases. Computational annotation of such genes from genomic DNA becomes additionally challenging when the genes consist of many scattered small exons.
Two pairs of previously un-notated genes encoding transmembrane proteins have recently been identified based on their linkage with inherited disorders. The first set of genes, transmembrane cochlear expressed genes 1 and 2 (TMC1) and TMC2 [5] were found in a search for mutations causing dominant and recessive deafness in human and mouse [5,6]. Mutations in the second pair of genes, EVER1 and EVER2, are linked to epidermodysplasia verruciformis that is associated with an increased susceptibility to infection with some human papillomaviruses causing a high risk of skin carcinoma [7]. We noticed that the proteins encoded by these genes are homologous, which raised our interest in testing whether the genome accommodates additional related genes.

The mammalian TMC gene and protein family
We used human TMC1 and TMC2, murine Tmc1 and Tmc2, and human EVER1 and EVER2 sequences in conjunction with database search algorithms to identify homologous sequences in genomic and expressed sequence tag (EST) databases. We iteratively assembled contiguous theoretical coding sequences for individual human and mouse proteins. We verified regions of individual murine cDNAs that were not unequivocally predictable with database information by RT-PCR and by sequencing. This strategy resulted in the identification of the murine orthologues of EVER1 and EVER2, and four hitherto uncharacterized human and mouse proteins (Figure 1, Table 1). All identified proteins have eight predicted membrane-spanning domains (TM1 -TM8) and share the completely conserved amino acid triplet C (cysteine) -W (tryptophan) -E (glutamic acid), predicted to be located in the extracellular loop upstream of TM6 ( Figure  2). The presence of this hereby named TMC signature sequence motif CWETXVGQEly(K/R)LtvXD is our defining criterion for this novel TMC protein family (Figures 1, 2, additional file 2).
The eight mammalian TMC proteins can be grouped into three subfamilies A, B, and C, based on sequence homology and on similarities of the genomic structure of their respective genes (Figures 1, 3, and 4). The murine TMC protein subfamily A consists of three proteins, Tmc1, Tmc2, and the novel Tmc3. All TMC subfamily A proteins are between 757 and 1130 amino acids in length. Tmc3 bears a long carboxyl-terminal tail that is unlike all other TMC proteins ( Figure 1). The overall identity within the TMC subfamily A is 36-56%; the positions of >73% of the genes' introns within the conserved core region are conserved ( Figure 3A, 3C, additional file 1).
The murine TMC subfamily B consists of Tmc5 and Tmc6 (mouse orthologue of EVER1), proteins of 810 and 757 amino acid residues that are 31% identical and share a >92% conservation of the corresponding genes' intron locations within the conserved core region ( Figure 3A, 3C, additional file 1). A significant structural difference between subfamily B and subfamily C proteins is that the long presumptive extracellular loop of TMC subfamily A proteins between TM5 and TM6 is much shorter in subfamily B proteins and mainly consists of the TMC signature sequence motif (Figures 1, 2).
Finally, the three members of the murine TMC subfamily C, Tmc4, Tmc7, and Tmc8 (mouse orthologue of EVER2) are the shortest TMC proteins with 694, 726, and 722 amino acid residues. The overall identity within the murine TMC protein subfamily C is 29-33% with a common gene structure of >92% conserved intron locations within the conserved core region ( Figure 3A, 3C, additional file 1).
Analysis of the human TMC gene and protein family yielded principally identical results, likely because of the high degree of conservation between human and murine TMC proteins [5] (Figure 4). The new TMC classification now designates TMC6 for EVER1, and TMC8 for EVER2 ( Table 1).
The TMC genes map to six chromosomal locations in the human and mouse (Table 1). Two chromosomal locations in both species harbor two neighboring TMC genes, Tmc5 and Tmc7 on murine chromosome 7, and Tmc6 and Tmc8 on murine chromosome 11; the human orthologues are located on the syntenic regions of chromosomes 16 and 17, respectively [7]. An additional murine locus on chromosome 8 represents a partial gene fragment, likely a pseudogene of Tmc2.

Expression of murine TMC transcripts
To demonstrate that all predicted murine TMC genes are transcribed, we performed RT-PCR experiments. Primer pairs specific for each TMC transcript amplified products of predicted length and sequence ( Figure 5).
Tmc1 and Tmc3 mRNAs were detectable in most neuronal organs and we also found expression in some non-neuronal organs. Tmc2 transcripts were only detectable in testis; we did not reveal by RT-PCR Tmc2 expression in cochlea. However, we were able to verify that Tmc2 is expressed in the cochlea by using an organ of Corti cDNA library as a template for PCR, which corroborates the results presented by Kurima et al. (2002).
We observed that mRNAs encoding Tmc5, Tmc6, Tmc4 and Tmc7 are expressed in most murine organs tested. Tmc8 mRNA is detectable in thymus and lung; we also found expression of Tmc8 mRNA in spleen (not shown). We did not detect Tmc8 transcripts in any other organs investigated.

The murine TMC protein family
Our expression analysis results are corroborated by an analysis of applicable TMC ESTs obtained from GenBank http://www.ncbi.nlm.nih.gov/dbEST/ ( Figure 5).

Non-mammalian vertebrate and invertebrate TMC genes and proteins
The high degree of similarity between the corresponding human and murine TMCs suggests a conserved function of these proteins. This role of TMC proteins may also be conserved among other species. We therefore decided to investigate the TMC genes of other vertebrates and invertebrates.
We identified eight TMC loci in the Japanese pufferfish (Torafugu, Fugu rubripes, Fr) genomic database http:// fugu.hgmp.mrc.ac.uk/. Whereas the genome of Fugu rubripes is one order of magnitude shorter than the length of the human genome, the total number of genes is estimated to be approximately the same [8,9]. Because of the low homology of amino-and carboxyl-terminal sequences, we were not able to determine the complete coding sequences of the eight pufferfish TMC proteins unequivocally; nevertheless, we obtained sufficient sequence information of the central parts of the proteins bearing the transmembrane domains to classify the eight pufferfish TMCs into the three subfamilies. This subfamily assignment of individual TMC proteins was further substantiated by an analysis of the degree of conservation of intron positions within the Fugu rubripes TMC gene family ( Figure 3B). The Fugu rubripes genome contains three TMC subfamilyA genes Tmc2-rs1 (Tmc2-related sequence1), Tmc2-rs2 (Tmc2-related sequence2), Tmc3, three TMC sub-familyB genes Tmc5, Tmc6-rs1 (Tmc6-related sequence1), Tmc6-rs2 (Tmc6-related sequence2), and two subfamilyC genes Tmc4 and Tmc7 (Table 1). The nomenclature of the pufferfish TMC genes and proteins is derived from the phylogenetic relation of the corresponding sequences with the mammalian TMCs ( Figure 4). It is interesting that the pufferfish genome lacks orthologues of mammalian TMC1 and TMC8. Fugu rubripes Tmc5 and Tmc7 are 17q25 clustered, equivalent to the clustering of their mammalian orthologues [7].
TMC genes also exist in invertebrates. In GenBank, we identified two mRNA sequences encoding Caenorhabditis elegans (Ce) TMC proteins. These mRNAs are transcribed from TmcAh1 (Tmc subfamily A homologue1) and TmcAh2 (Tmc subfamily A homologue2) ( Table 1).
Whereas the C. elegans genome appears to lack TMC subfamily B and C genes, some insects have genes for all three TMC subfamilies. For example, the mosquito Anopheles gambiae has three TMC genes, TmcAh (Tmc subfamily A homologue), TmcBh (Tmc subfamily B homologue), and TmcCh (Tmc subfamily C homologue) ( Figure 4, Table 1). We did not find deposited cDNA TmcAh and TmcCh, but TmcBh appears to be transcribed (ESTs BM645887, BM621478, BM605758, and BM636384). A search of the Drosophila melanogaster genome database revealed only a single TMC gene, TmcAh, (Figure 4, Table 1).
We did not find evidence for TMC genes in genomes and cDNA databases of yeast and plants.

Discussion
In an effort to define a novel gene family, we set out to identify genes related to TMC1, TMC2, EVER1, and EVER2 [5][6][7]. We obtained the coding sequences of additional homologues in human and mouse, which form the TMC protein family (Figures 1, 2). We subdivided the TMC protein family into three subcategories A, B, and C. This subfamily-classification is based on two major observations. First, the sequence homologies among the different TMC protein sequences of individual species' do cluster into three groups (Figures 1, 4, and additional file 1). Second, our analysis of the organization of vertebrate TMC Proposed structure of TMC proteins On the basis of comparative structural predictions of all mammalian TMCs we propose a basic topology of the generic TMC protein with variable intracellular aminoand carboxyl-termini, a conserved core with eight membrane-spanning domains, and some variability in the length of several intra-and extracellular loops ( Figure 2). This prediction further refines the previously published TMC protein features [5,7]. The structural presumptions for the region flanked by TM6 and TM7 were somewhat ambiguous, despite the region's high conservation among all TMC proteins with a high proportion of apolar amino acids ( Figure 1). We hypothesize that it is unlikely that some members of the TMC family display an atypical topology, thus we propose that the lipophilic intracellular loop between TM6 and TM7 bears some flexibility, which may enable this domain to integrate into the inner surface of the phospholipid bilayer or even to pass through the plasmamembrane (Figure 2).
The high amino acid sequence conservation of 75-96% identity among the individual human and mouse TMC proteins implies that mutations that alter the coding sequence in the corresponding genes are subjected to significant selective pressure; thus advocating that TMC proteins have important cellular roles. Mutations in TMC1 are responsible for the autosomal dominant human hearing disorder DFNA36 and for the recessive deafness DFNB7/11 [5]; the corresponding murine gene Tmc1 also causes deafness in the mutant mouse strains Beethoven (Bth) and deafness (dn) [6]. Bth/Bth mice display altered potassium currents in early postnatal cochlear hair cells, in particular Ik,n and Ik,f currents appear to be depressed [10]. This observation led to the hypothesis that Tmc1 may participate directly or indirectly in regulating the permeability of potassium channels [10]. Because of the high level of conservation of the amino acid sequences among the eight mammalian TMC proteins (Figure 1, Additional file 1), we speculate that other TMC proteins may as well be modifiers of ion channels or transporters.

Comparison of murine and pufferfish TMC gene structures
Our assembly of TMC protein sequences allowed us to analyze their phylogenetic relationships (Figure 4). Because our analysis did not reveal any TMC genes in yeast and plants, we conclude that TMC proteins are a specific trait of animals. The C. elegans genome harbors two genes encoding presumptive TMC subfamily A-like proteins, both of which branch off close to the center of the phylogenetic tree, which may indicate similarity with the primordial TMC protein sequence. It is interesting that some of the intron locations of the two C. elegans genes are conserved when compared with vertebrate TMC genes, in particular with the TMC subfamily A (Additional file 3).
It is likely that the TMC family has diversified into three subfamilies before the Protostomia and the Deuterostomia diverged because the genome of the mosquito Anopheles gambiae contains one putative member of each TMC subfamily. Interesting in this regard is that the fruit fly's genome only has a single TMC subfamily A-like gene. The vertebrate TMC gene family is diversified, with each subfamily represented by multiple members.

Conclusions
The recently identified genes encoding the cochlear transcripts for TMC1 and TMC2, and encoding the proteins EVER1 and EVER2, belong to a novel gene family that is conserved in animals. The TMC protein family has eight members in vertebrates and forms three subfamilies, A, B, and C. TMCs are proteins with a conserved core of eight membrane-spanning domains. Most murine TMC genes are widely expressed at relatively low transcription levels; some TMC genes, however, display more restricted expression patterns.

Database-aided and experimental assessment of TMC cDNA and protein sequences
We created and continuously refined contigs of members of novel TMC cDNAs by aligning fragmented sequence information obtained from public and commercial EST databases (GenBank/NCBI -http:// www.ncbi.nlm.nih.gov, Celera Discovery System -http:// www.celeradiscoverysystem.com). We started this procedure by using sequence fragments homologous to the known mammalian TMC proteins. Individual fragments were assembled to longer contigs using multiple alignment tools such as ClustalW [11] http:// www.es.embnet.org/Doc/phylodendron/clustalform.html or Multalin [12] http://prodes.tou louse.inra.fr/multalin/multalin.html.
Our sequence determination was continuously refined with the results of repeated BlastN and tBlastN database searches with the updated contigs, and finally we extended our search by using genomic databases. The majority of Blast searches were conducted with the default parameter settings. With this strategy were able to generate presumptive "framework" contigs for the in databases less abundantly represented TMC genes. Because of the inter-species conservation of protein-coding sequences we were able to predict the presumptive DNA sequence encoding aminoand carboxyl-termini of various TMC proteins by comparing human and murine genomic TMC sequence.
We constantly refined individual TMC sequences by using genomic sequence information and comparing gene structures. In addition to iterative database searches, we also took into account the results of gene prediction algorithms using genomic DNA harboring TMC genes as inputs (GeneScan [13], GenomeScan [14] http:// genes.mit.edu). Some invertebrate TMC genes were already annotated as hypothetical proteins. We used human and murine genome databases (GenBank/NCBIhttp://www.ncbi.nlm.nih.gov) to determine the chromosomal locations of all mammalian TMC genes.
In a second phase, we used polymerase chain reaction and 5' and 3' rapid amplification of cDNA ends (RACE) experiments to obtain experimental proof for our cDNA predictions and to determine murine TMC sequence that was only ambiguously predicted.
Secondary structure predictions were done with PSORTII [15] http://psort.ims.u-tokyo.ac.jp. The phylogenetic distances between various TMC protein sequences were calculated with ClustalW [11] http://www.es.embnet.org/ Doc/phylodendron/clustal-form.html and the resulting distance matrix Additional file: 1 (A) was visualized as a tree diagram with centered node position using the Phylodendron phylogenetic tree printer available at http:/ /iubio.bio.indiana.edu/treeapp. Evolutionary distance between two gene structures was defined as the reciprocal of the ratio of conserved intron locations with respect to the coding sequences of the TMC genes. The distances were determined for each pair of murine TMC genes (additional file 1 (B)). We compared the relationship of the TMC gene structures by performing cluster analysis of distances using the kitsch algorithm of the PHYLIP Phylogeny Inference Package [16] http://bio web.pasteur.fr/seqanal/phylogeny/intro-uk.html.
The results of the cluster analysis were visualized as a tree diagram with centred node position using the Phylodendron phylogenetic tree printer.