A genome-wide survey of Major Histocompatibility Complex (MHC) genes and their paralogues in zebrafish

Background The genomic organisation of the Major Histocompatibility Complex (MHC) varies greatly between different vertebrates. In mammals, the classical MHC consists of a large number of linked genes (e.g. greater than 200 in humans) with predominantly immune function. In some birds, it consists of only a small number of linked MHC core genes (e.g. smaller than 20 in chickens) forming a minimal essential MHC and, in fish, the MHC consists of a so far unknown number of genes including non-linked MHC core genes. Here we report a survey of MHC genes and their paralogues in the zebrafish genome. Results Using sequence similarity searches against the zebrafish draft genome assembly (Zv4, September 2004), 149 putative MHC gene loci and their paralogues have been identified. Of these, 41 map to chromosome 19 while the remaining loci are spread across essentially all chromosomes. Despite the fragmentation, a set of MHC core genes involved in peptide transport, loading and presentation are still found in a single linkage group. Conclusion The results extend the linkage information of MHC core genes on zebrafish chromosome 19 and show the distribution of the remaining MHC genes and their paralogues to be genome-wide. Although based on a draft genome assembly, this survey demonstrates an essentially fragmented MHC in zebrafish.


Background
The human Major Histocompatibility Complex (MHC) is a gene-dense region on chromosome 6p21.3, and comprises a group of genes that are involved functionally with the adaptive and innate immune system. From centromere to telomere, it is divided into five regions: extended class II, classical class II, class III, classical class I and extended class I. The classical MHC contains 224 genes, many of which are pseudogenes, and with every two in five expressed genes having a potential immune function and a role in disease resistance [1], this region has become a focal locus in comparative genomics. Antibody and T cell mediated immune responses against invading pathogens are initiated through MHC class I and class II molecules [2]. These main components are not only missing from invertebrates, but are also not present in primitive jawless fish, such as hagfish and lamprey [3]. MHC class I and II molecules do, however, exist in all jawed vertebrates, including the cartilaginous fish. The gene loci in the class III region encode a variety of proteins with both immune and non-immune functions [4].
Genomic sequences encompassing the MHC are currently available for human, chimpanzee, macaque, rat, mouse, cat, pig, horse, quail, chicken, frog, teleost fish and shark [5]. While the genomic architecture of the mammalian MHC is conserved, the number of genes between species can differ greatly. In birds, for example, the minimal essential MHC in chicken consists of 19 genes [6] while gene duplications have expanded this region in quail [7] and sparrows [8]. The linkage of class I, II, and III region genes can be traced back to cartilaginous fish, which are the earliest jawed vertebrates known to have diverged from a common ancestor with humans [9]. MHC loci, however, do not always exist in a single tightly linked cluster as generally observed in mammals. A large scale inversion has separated the class IIB cluster from the MHClinked class IIA cluster in cattle [10]. The class I and class II loci in zebrafish (Brachydanio rerio) are found on different chromosomes [11,12]. A similar organisation is Table 1: List of human MHC class I and extended class I (xI) region genes used against the zebrafish whole-genome assembly. Genes with the suffix "like" are either gene fragments and/or highly similar to their human counterparts. Genes not identified by sequence similarity searches are marked as not found (NF). The official gene nomenclature for zebrafish is shown in brackets. MHC I  C6orf15  NM_014070  NF  I  CDSN  NM_001264  NF  I  PSORS1C1  NM_014068  NF  I  PSORS1C2  NM_014069  NF  I  C6orf18  NM_019052  NF  I  TCF19  NM_007109  1 [tcf19-like]  I  POU5FI  NM_002701  16 [pou5f1]  I  MICA  NM_000247  NF  I  HCP5  NM_006674  NF  I  MICB  NM_005931 NF present in trout, stickleback, common guppy and cichlid fish [13,14]. This observation demonstrates that the separation of the MHC class I and class II loci is characteristic of teleost fish, which represent half of all vertebrates. Since the genes of the immune system were present in the common ancestor of tetrapods and teleosts, the differences in their genomic organisation may be the result of lineagespecific chromosomal events such as duplications, inversions, deletions and translocations.
The genome of the zebrafish is currently being sequenced at the Wellcome Trust Sanger Institute. The availability of sequence data will allow an insight into the understanding and evolution of the immune system in fish. To date, small regions containing major histocompatibility genes have been mapped by radiation hybrid mapping and sequencing of select genomic clones [15,16]. With the ongoing whole-genome sequencing efforts and the compilation of physical maps, it is now possible to examine larger genomic regions and assess the degree of shared synteny between mammals and fish on a genome-wide scale. Here we have utilised a comprehensive wholegenome shotgun assembly data set to fully analyse the zebrafish loci that are related to the human MHC. The Map of major histocompatibility genes and their paralogues in zebrafish (not to scale) Figure 1 Map of major histocompatibility genes and their paralogues in zebrafish (not to scale). Only chromosomes and two unmapped contigs (NA17761 and NA17767) that harbour MHC-related genes are shown. Orthologues or paralogues of human MHC class I region genes are shown in red, of class II region genes in blue, and class III region genes in green. Genes with the suffix "L" for like are either gene fragments and/or highly similar to their human counterparts. Genes that have more than one copy in the genome are shown by a greater than (>) symbol. Similarities and differences between the whole-genome assembly and mapping information present in the ZFIN database are shown by plus (+) and minus (-) signs, respectively. The core MHC has been compiled from five genomic clones spanning this region [EMBL:AL672216, EMBL:AL672151, EMBL:AL672164, EMBL:AL672176]. human genome contains at least three regions that are paralogous to the MHC [17]. These are thought to be the result of two rounds of duplications that occurred early during vertebrate evolution. In this study, we examine the genome-wide distribution of paralogous genes in zebrafish.

Analysis of the MHC class I region in zebrafish
The MHC class I region in human embodies gene clusters coding for HLA class I molecules, histones, solute carriers, vomeronasal receptors, olfactory receptors and zinc fingers [18]. These clusters have undergone a large-scale expansion and each has paralogues throughout the genome. Ten representative members of the extended class I region, which are expressed and have paralogues in the human genome, and more than 30 genes that reside in the classical class I region were chosen for this analysis ( Table 1). Genes that were excluded from this study included pseudogenes, zinc fingers with the tripartite motif (TRIM), and RING finger proteins. Approximately half of the human genes were identified in the zebrafish whole-genome assembly, predominantly on chromosome 19. Because multigene families evolve by a birthand-death process [19], orthologous genes are usually difficult to identify through sequence similarity searches.

Analysis of the MHC class II region in zebrafish
MHC class II molecules are heterodimers comprising A (alpha) and B (beta) chains that present peptides to CD4+ T cells via the endosomal pathway. One class IIA (mhc2daa) and six class IIB (mhc2dab, mhc2dbb, mhc2dcb, mhc2ddb, mhc2deb, mhc2dfb) genes have previously been identified by screening a zebrafish genomic BAC library [21]. Only the mhc2dab and mhc2daa genes are known to be expressed [22]. They are closely linked and were identified on chromosome 8 ( Table 2). Analysis of approximately 1 Mb of contiguous DNA surrounding the functional class II region in zebrafish demonstrates the presence of 23 flanking genes mapping to various human chromosomal locations, including two or more genes mapping to human 6q (C6orf117, LOC557721, ppil4), 12q (slc15a4, KIAA1944), 20q (rnpc2, slc12a5), 22q (slc7a4, sf3a1) and Xp (pqbp1l, t541, jarid1c). The lack of genes mapping to the human MHC, in addition to the low gene density of this region, indicates that the functional zebrafish class II region is the result of a translocation event [12].
Mapping to chromosome 19, approximately 22 Mb telomeric of the class I region, is the class IIB gene mhc2dcb-rs (Q95HJ7), with a similar sequence on contig NA4006. Also within this segment of DNA there are two predicted class IIA chain-encoding gene fragments, consisting of only the alpha2 domain and cytoplasmic tail-encoding parts. It is therefore unlikely that this locus is functional. Assuming that these gene fragments are not due to errors in the whole-genome assembly, it is evidence for a linkage of remnants of MHC-related class II genes with the core MHC region containing the class I peptide presenting genes.
Previously identified mhc2dbb and mhc2dfb genes are predicted to be located in currently unmapped contigs, NA14232 and scaffold 2399. However, mhc2deb (found in clone U08874) was not identified in the current assembly, and may be attributed to gaps in the sequence data or allelic or haplotypic differences. Several fragmented sequences resembling class IIA and/or B chain-encoding genes were also identified in contigs NA17244 and NA15003, and chromosomes 5, 15 and 20. Only contig NA6696 harbours putative complete class IIA and class IIB genes. Another gene found in this 15 kb contig resembles mhc2ddb, consisting of exons 3 and 4 only.
The two TAP transporters in human, namely TAP1 and TAP2, are both located in the MHC class II region, and are closely linked to genes encoding the proteasome subunits and the class II molecules. In zebrafish, however, a TAP1like sequence is found on chromosomes 9 and three TAP2like genes are found in the class I region [15] on chromosome 19. The latter, named abcb3, abcb3l1, abcb3l2, form part of the zebrafish core MHC region. The trout tap1 gene is not linked to the major class IA region either [23]. Likewise, in Fugu, the tap1 gene is found on an isolated scaffold that is not linked to the main class I region [24].

Analysis of the MHC class III region in zebrafish
The human class III is the most gene dense region within the MHC containing few or no pseudogenes. Unlike the class I and class II loci that are evolutionary and functionally related [25], the class III region genes are not. The class III, however, include immune-related genes such as those encoding complement components, tumour necrosis factors and heat shock proteins. The search for the MHC class III region in zebrafish was first initiated by the identification of the BF/C2 gene using degenerate PCR [26]. This was then followed by the identification of several, but not all, of the zebrafish homologues of the human MHC class III region genes [16].
In the zebrafish, the MHC class III genes are found distributed throughout their genome (Table 3). Several zebrafish genes (zgc:63773, tnf, bat2-like, ck2b, ddah1, skiv2l, C6orf31-like, ppt2, bat3-like) map to chromosome 19, which also encompasses the largest stretch of MHCrelated genes. Two other unmapped scaffolds also harbour several class III genes: neu1, zgc:64108, bat8, zbtb12 are found in NA17767; and bf/c2, rdbp, hsp70, skiv2l are in scaffold NA17761. These are syntenic to the human MHC class III region with the conservation of both gene order and content (Figure 1).
Approximately half of the human MHC class III region genes were not identified in the zebrafish assembly. Among these were the Ly6 family members, which may therefore be mammalian-specific. Alternatively, being involved in the immune response, Ly6 genes evolved more rapidly than others, and might have diverged sufficiently to not be recognised by sequence similarity searches. For eight human genes, ATP6V1G2, AIF1, CLIC1, CREBL1, DDAH2, AGPAT1, PBX2 and NOTCH4, the paralogues but not the orthologues of the genes in the human MHC class III region have been identified by sequence similarity in zebrafish. This observation extends to IER3, RING1 and B3GALT4 found in the class I and extended class II regions. The presence or absence of other genes may be attributed to lineage-specific evolution. For example, lamprey [27], zebrafish [26] and medaka [28] possess genes equally similar to both mammalian BF and C2, while a Xenopus clone has clearly been identified as encompassing BF. This BF/C2 ancestral gene has further duplicated in zebrafish [29], copies of which were identified on chromosome 21 and contig NA17761. Similarly, a recent survey of hsp70 genes in Xiphophorus maculates (platyfish) [30] has revealed that a single HSP70 gene gave rise to four distinct groups of genes: mammalian testisspecific HST70, mammalian MHC-linked HSP70, mammalian HSP70B' and the fish HSP70. Human class III HSPA1A, HSPA1B and HSPA1L genes are intronless and encode identical or near-identical proteins. Intronless zebrafish hsp70 genes were identified on chromosome 8 and scaffold NA17761. Three additional copies of HSP70like genes were identified on chromosome 3, although these contained one, two or three introns in the 3' end of their sequence. The identified zebrafish genes cluster phylogenetically within the fish subgroup, apart from the sequence on scaffold NA17761, which appears to be similar to the mammalian MHC-linked heat shock protein sequences (data not shown).

Construction of in-silico gene maps
The zebrafish MHC gene map (Figure 1) was constructed using primarily mapping information from the wholegenome assembly displayed in the Ensembl platform. The position of at least 49 genes mapped in this survey could be verified by comparison to experimental data stored in the ZFIN database, which also yielded information for genes residing in unmapped contigs. Mapping data from these two sources did not coincide for several genes, including bat3, clica, vars2, c3, pou5f1, rgl2, tubb5, ddx39, col11a1, pbx1a, and stk19-like. These discrepencies may be attributed to the high levels of polymorphisms and regions of misassembly caused by the source DNA used for the whole-genome assembly being collected from over 1000 embryos. Until the genome sequence is complete, it will not be possible to accurately predict the position and number of all human MHC orthologues in zebrafish. There are also a number of genes, notably CD1, MOG and HFE, that could not be identified in the draft assembly used here (these are listed as 'not found' in Tables 1, 2, 3).
There were a number of difficulties associated with assessing the degree of synteny between the zebrafish whole-genome assembly and the human MHC. Although BLAST is a heavily used analysis tool for identifying related sequences, it does not discriminate between large gene family members. To maximise the identification of MHCrelated loci in zebrafish, only the highest BLAST scoring sequences were chosen for further analysis (many of which were unique BLAST reciprocal hits), and searches were conducted using blastp or tblastx. Orthology can be confirmed by obtaining mapping data of surrounding genes, with the assumption that groups of syntenic genes would remain in close proximity through evolution, and therefore be maintained in similar segments of DNA. The physical linkage between numerous MHC-related loci is apparent on zebrafish chromosome 19. Genes mapping to zebrafish chromosome 5 have also been observed to be syntenic with human chromosome 9 [31]. Gene blocks, in particular present in contig NA17761 and NA17767, map closely together in the human MHC class III region. This criterion is more difficult to apply when genes are dispersed, as seen in Figure 1. The Ensembl platform [32] used to map the MHC in zebrafish provides a combination of alignment data, genomic location, detailed transcript structures to compare functional domains of orthologous proteins, in conjunction with multi-species comparisons. In combination with further phylogenetic analyses when duplicates of one gene were identified, the distinction between orthologues and paralogues was ascertained. Nevertheless, it is very difficult to assign orthologous comparative relationships with multigene families until all members of the family have been sequenced in both organisms. This might be the case in human, but the zebrafish genome is yet to be completed.

Duplicated genes and MHC paralagous regions
In human, it was observed that four regions, the MHC on 6p21.3 as well as 9q33-q34, 1q21-q25/1p11-p32 and 19p13.1-p13.4 are paralogous regions [33] that share members of the same gene family. Further analysis has shown that paralogues of human MHC genes are also scattered essentially over all chromosomes [18]. Likewise, genome-wide duplications have recently been examined in zebrafish [31], confirming that an extra round of a whole-genome duplication event occurred early in the teleost lineage after it split from the tetrapod lineage. Evidence that paralogous genes exist in zebrafish became apparent when many duplicates associated with the MHC, and mainly the class III region, were identified ( Figure 1).
Here we discuss five MHC-encoded gene families in more detail.
The Notch gene family members encode evolutionary transmembrane receptors that regulate cell fate determination. Four Notch paralogues (NOTCH1-4) have been identified in human, and only three of their orthologues are found in zebrafish ( Table 3). The orthologue of human NOTCH4, found in the MHC, is absent in the fish whole-genome assembly, indicative that gene loss has followed the block duplication events in teleosts. This is also seen in Fugu [24] and may be due to it being the most divergent member of the Notch family [34]. Two duplicates of notch3 are found in scaffolds 1523 and 285. Until the assembly is complete, it is not clear whether there are two copies of this gene or if this is due to the status of the current assembly. Only one surrounding rdh8-like sequence is found in common between the two scaffolds. In addition, notch1 and notch2 have been mapped to chromosome 5 and contig NA15389.
Retinoid receptors are soluble nuclear proteins belonging to the steroid/thyroid hormone receptor superfamily of transcriptional regulators. The RXR subfamily consists of three polypeptide chains, namely alpha, beta and gamma, encoded by separate loci. The human RXRB gene is found within the MHC, while the RXRA and RXRG paralogues are located on chromosomes 9 and 1, respectively. Five rxr loci have been identified in the zebrafish assembly in contig NA16779 and chromosomes 5, 9, 17 and 21. Two semi-orthologues of human rxrb appear to have arisen from a fish-specific duplication, and duplicates of rxra are also present in the zebrafish genome. Mapping data from the current assembly in comparison to the ZFIN linkage maps are contradictory and may highlight problems in both approaches to mapping genes. Interestingly, neither of the rxrb sequences map to chromosome 19 in the Zv4 whole-genome assembly as originally thought [35].
The human PBX2 gene encodes a homeodomain-containing protein. Three paralogues are located within chromosomes 1q23, 9q33 and 19p12, named PBX1, PBX3 and PBX4, respectively. Four PBX-like sequences were identified in the zebrafish: pbx1a on scaffold NA14559 and chromosome 11; pbx3b on chromosome 18; and pbxy on scaffold NA11844. These are related to human PBX1, PBX3 and PBX4, respectively [36]. The orthologous sequence to human PBX2, which is present in the MHC, has not been identified in the current assembly.
The complement component C4 gene encodes a protein that plays a central role in the innate immune response. Structurally, C4, C5 and C3 belong to the α2 macroglobulin (A2M) protein family and are derived from a single common ancestor. In human, they are found on four different chromosomes: 6p21 (C4), 9q33 (C5), 19p13 (C3) and 12p13 (A2M), and a similar situation exists in all tetrapods studied to date. In zebrafish, the a2m and c4 genes are both found on chromosome 15 as previously shown [37]. Multiple copies of a2m have also been observed in zebrafish with a duplicate being present in NA17328. Two copies of the c3 gene were identified on chromosome 22 and NA14479.
Six members of the chloride intracellular channel (CLIC) gene family (CLIC1-CLIC6) have been described in humans. They are involved in chloride ion transport within various subcellular compartments [38]. Four clic genes were identified in zebrafish on chromosomes 14, 16, 17 and 18. On phylogenetic analysis, they cluster with mammalian CLIC2, CLIC3, CLIC4 and CLIC5 genes, respectively. The orthologue of the human MHC-embedded CLIC1 was not found.

Conclusion
Comparative genomics reveals that the organisation of the MHC in more distantly-related organisms varies from the human model [39]. Teleost fish are particularly unusual in their organisation in comparison to mammals, chicken, Xenopus and shark in that their class I and class II loci are found on different chromosomes [14]. The core class I region in zebrafish, medaka, Fugu and rainbow trout [13,[40][41][42] comprises genes involved in class I peptide presentation and processing: the classical class I molecules, the immunoproteasome subunits, ATP-binding cassette transporters and tapasin. The whole-genome shotgun data for zebrafish has allowed a genome-wide analysis of major histocompatibility genes and their paralogues, and highlights that one or more copies exist of MHC-related genes in fish as in humans. The results obtained thus far extend the linkage information regarding major histocompatibility genes on chromosome 19 in zebrafish and the classical mammalian MHC, and further supports previous findings that the functional class II loci are found on a different chromosome. The distribution of the remaining MHC-related genes and their paralogues is genome-wide, confirming a fragmented MHC in fish.

Methods
The genome of the zebrafish is currently being sequenced at the Sanger Institute [43]. The analysis was carried out on the fourth Ensembl whole-genome assembly Zv4, released on September 2004 [32]. This assembly integrates the whole-genome shotgun assembly with data from the physical map [44]. Protein sequences encoded in 117 human MHC genes were chosen for BLAST sequence similarity searches against the zebrafish Ensembl assembly. Previously identified MHC-related zebrafish cDNAs [45] were also used in the analysis. Gene annotations were verified using VEGA [46]. A number of potentially duplicated genes were identified. Their predicted amino acid sequences were then aligned with other vertebrate gene family members present in the UniProt database [47] using CLUSTAL [48], and the multiple sequence alignments were used for phylogenetic analysis by the neighbour joining method [49] via the PHYLO_WIN interface [50]. Official gene symbols assigned by the Zebrafish Nomenclature Committee (ZNC) and HUGO Nomenclature Committee (HGNC) have been used in the annota-tion. Zebrafish and human gene names are given in lower and upper case, respectively.
The zebrafish source DNA for the whole-genome assembly was acquired from approximately 1000 five day old embryos. This has resulted in possible misassemblies due to polymorphisms between sequences, different haplotypes, and difficulties with the assembly of duplicated genes and regions, in particular with many highly homologous MHC genes that have arisen by duplication. Where possible, finished clones [e.g. EMBL:AL672216, EMBL:AL672151, EMBL:AL672185, EMBL:AL672164, EMBL:AL672176] were used to study the short-range linkage of genes in preference to the Ensembl whole-genome assembly. Alternative mapping information, for comparison and validity of the in-silico approach used during this study, was obtained from the Zebrafish Information Network (ZFIN) [45] and published scientific literature.