Research article Comparative genomics of vertebrate Fox cluster loci

Background Vertebrate genomes contain numerous duplicate genes, many of which are organised into paralagous regions indicating duplication of linked groups of genes. Comparison of genomic organisation in different lineages can often allow the evolutionary history of such regions to be traced. A classic example of this is the Hox genes, where the presence of a single continuous Hox cluster in amphioxus and four vertebrate clusters has allowed the genomic evolution of this region to be established. Fox transcription factors of the C, F, L1 and Q1 classes are also organised in clusters in both amphioxus and humans. However in contrast to the Hox genes, only two clusters of paralogous Fox genes have so far been identified in the Human genome and the organisation in other vertebrates is unknown. Results To uncover the evolutionary history of the Fox clusters, we report on the comparative genomics of these loci. We demonstrate two further paralogous regions in the Human genome, and identify orthologous regions in mammalian, chicken, frog and teleost genomes, timing the duplications to before the separation of the actinopterygian and sarcopterygian lineages. An additional Fox class, FoxS, was also found to reside in this duplicated genomic region. Conclusion Comparison of loci identifies the pattern of gene duplication, loss and cluster break up through multiple lineages, and suggests FoxS1 is a likely remnant of Fox cluster duplication.

Lines connecting genes on different chromosomes indicate paralogy inferred by BLAST and subsequently confirmed by molecular phylogenetic analysis.

Gene Family introductions and analyses
The Fox gene family FOXQ1 starts the Human 300kb chromosome 6 FOX cluster consisting of FOXQ1, FOXF2 and FOXC1 separated by 59kb and 215kb respectively (see Figure 1). Only one predicted gene appears in the cluster between FOXF2 and FOXC1. e-PCR suggest LOC642910 be a pseudogene with similarity to E74-like factor 2 isoform 2. The Human chromosome 16 cluster spans 70kb with three genes found between FOXF1 and FOXC2. The first, LOC401865 is a probable pseudogene with similarity to 60S ribosomal protein L7a, the second, FLJ12998, is a novel gene with a conserved 5-formyltetrahydrofolate cyclo-ligase family domain. Homologous genes are also located next to FoxF1 in mammals, Gallus gallus, Xenopus tropicalis and the teleost genomes. The third, FLJ30679, shows no homology to other genes. In this cluster FOXC2 is 54kb from FOXF1 and 10kb from FOXL1. Mazet et al. (2003) resolved the phylogeny of the Fox gene family to show a close relationship between the FoxL1 and FoxC subclasses and between the FoxF and FoxQ1 subclasses [1]. The human gene FREAC10 (now called FOXS1) resolved basally to the FoxC subclass in these analyses, but did not group with any other Fox genes.
Maximum likelihood analysis of the Fox genes FoxF, FoxC, FoxL1, FoxQ1 and FoxS1 was carried out using the forkhead domains of these genes. The tree was rooted using a human FOXK1 sequence. All subfamilies include the amphioxus representative except for FoxS1 (where no gene has been identified).
The results essentially confirm previous studies [1] with the FoxL1, FoxS1 and FoxC subfamilies most closely related to each other as are the FoxQ1 and FoxF subfamilies. A previously FoxC1 predicted chicken gene (automated computational analysis using GNOMON) branches as a FoxS1 gene. Vertebrate FoxF and FoxC duplicates resolve within the FoxF and FoxC subfamilies as expected, however the high sequence identity between the duplicates in the forkhead domain reduces the resolution of these parts of the tree. To clarify this, separate trees were produced for each subset of genes using extended alignments. The Fox duplicates then resolved with high bootstrap values supporting the paralogy and synteny of the genomic regions identified in this study.
G. gallus FOXC1 was previously identified from a cDNA library [2] and is not represented in the FoxC tree due to doubtful 5' sequence. RACE used to extend the 5' end of the cDNA gave what is likely to be partial FOXC2 sequence. Trees produced using more 3' regions of this gene confirm its placement as a FOXC1. No FoxC2 genes have been found in the genomes of teleost fish, however, in this study foxc1 and foxc2 genes have been cloned from Amia calva, a sister group to the teleosts [3] and are included in the FoxC tree. This finding suggests the foxc2 loss to be teleost specific.
Primers used to clone foxc1 and foxc2 from A. calva genomic DNA.
Accession numbers for the A. calva sequences are AM402970 and AM402971 The IRF gene family The IRF (interferon regulatory factor) family is a group of transcription factors that share homology in the DNA-binding domain [4,5]. IRFs are mostly involved in the regulation of immune responses, in particular against viral infection. . It has been suggested the IRF genes arose early in vertebrate evolution from a prototypical protein with a myb-like DNA binding motif [6,7]. Interestingly the IRF genes also encode a winged helix DNA binding domain with similarity to that encoded by the Fox genes, however the significance of this is unclear.
Maximum likelihood analysis of IRF sequences recovered previously described relationships among the IRF genes [6], and extended these due to expansion in the number and phylogenetic range of Irf sequences used. The tree is rooted with a Strongylocentrotus purpuratus (purple sea urchin) Irf sequence though it is unclear if this gene is an outgroup to all vertebrate IRF genes. Nehyba (2002) split the IRF genes into four subfamilies, the IRF-1 subfamily contains the Irf1 and Irf2 genes, the IRF-5 subfamily contains the Irf5 and Irf6 genes, the IRF-3 subfamily contains the Irf3 and Irf7 genes and finally the IRF-4 subfamily contains the Irf4, Irf8, Irf9 and Irf10 genes as indicated on the tree.
Human IRF10 has diverged to such an extent that it cannot produce a full length protein [6] and no mouse Irf10 has been found (hence their omission from this analysis). However definitive Irf10-like sequences were recovered from the genomes of other mammals (not shown) and from other vertebrate genomes Genes from the IRF-4 subfamily map to the genomic regions under consideration here, and in our analysis form a distinct group supported by a value of 85, indicating they are more closely related to each other than to the other IRF subfamilies. This supports our interpretation of Irf4, Irf8, Irf9 and Irf10 as paralagous genes derived from block duplications

The Dusp gene family
The Dusp (Dual specificity phosphatases) genes are a subclass of the protein tyrosine phosphatase (PTP) superfamily involved in the dephosphorylation of threonine and tyrosine residues in MAP kinase [8]. This is a large gene family, with over 20 representatives in the human genome. Maximum likelihood analysis of approximately 160 amino acids around the Dusp active site was used to produce the phylogenetic tree, which is rooted with a C. elegans Dusp sequence. Though some parts of the tree are supported by low bootstrap values, it does show the three genes under consideration here, Dusp22, Dusp15 and DuspF1, to be closely related to each other, with a support value of 97. Sequences from D. melanogaster and D. pseudoobscura group basally to these genes suggesting them to be the orthologous to the vertebrate genes. A subfamily tree drawn using these genes resolves the relationship with high bootstrap values.
The COX4 gene family COX4 (Cytochrome c oxidase subunit IV) forms part of the electron transport chain responsible for aerobic energy metabolism. Two types have been identified, named isoform 1 (I1) and isoform 2 (I2). In Humans COX4I1 is located on chromosome 16 and shares a promoter with NOC4 [9]; see below) while COX4I2 is on chromosome 20. Expression analysis in rats has shown Cox4I1 to be expressed ubiquitously, while Cox4I2 shows high expression in adult lung with lower expression in all other tissue investigated [10]. No Noc related gene has been identified linked to Cox4I2. In the human genome a hypothetical protein (LOC646365) has been annotated overlapping with COX4I1, however this is likely a mis-annotation as some of the protein sequence encoded by this gene is that of the COX4I1 protein.
Cox4 orthologues were identified in C. elegans, A. gambiae and S. purpuratus and various vertebrates. Maximum likelihood analysis shows Cox4I1 and Cox4I2 to be paralogous genes with the duplication event producing them occurring somewhere between the sea urchin and teleost lineages. No Cox4I2 was identified from the genomes of G. gallus or teleosts. The cox4 gene found on chromosome 23 in the D. rerio genome falls outside both established vertebrate groups. As it is situated 44.65Mb from irf10, its presence of chromosome 23 cannot be considered informative.
The NOC4 gene family NOC4 (Neighbour of COX4) is a novel gene of unknown function that shares its promoter with COX4I1 on human chromosome 16 and whose expression appears to be ubiquitous [9]. Bachman et al (1999) found no significant homologies to known proteins, however, searches of the human genome in our study revealed a second NOC homologue 19kb from IRF9 on chromosome 14 of the human genome, which we provisionally name here as NOC9 signifying its linkage to IRF9.
NOC orthologues were identified in C .elegans (used to root the tree), D. melanogaster, A. gambiae and S. purpuratus and various vertebrates. Maximum likelihood analysis of these genes shows Noc4 and Noc9 to be paralogous genes with the duplication event producing them occurring somewhere between the sea urchin and teleost lineages. In D. rerio a third noc gene is found on chr12 and is not linked to an irf9 gene. No NOC9 or IRF9 was found in the genome of the chicken.

The MLCK gene family
The MLCK (Myosin light chain kinase) gene family encodes enzymes involved in muscle contraction and myosin regulation [11][12][13][14][15]. Sequences were collected from D. melanogaster (known as twichin), C. elegans (known as stretchin) and various vertebrates. Maximum likelihood analysis of these sequences reveals three vertebrate gene groups. Mlck2 has been previously named, here we call the others Mlckf1 and Mlckf2 to signify their linkage to FoxF1 and FoxF2 respectively. However while Mlckf1 is found linked to the FoxF1-FoxC1-FoxL1 cluster in mammals and G.gallus it is several Mb distant. Mlckf1 was not identified in the genomes of X. tropicalis, F. rubripes or T. nigrovirdis. The D. rerio gene has not been mapped . Mlck2 is linked to Irf10 in amniotes but has not been mapped in D.rerio or identified in the other teleosts. The relationship of these vertebrate Mlck genes supports the paralogy and synteny of the genomic regions identified in this study.

Other genes
Several other genes for which molecular phylogenetics were not undertaken are shown in Figure 1. Molecular phylogenetics were not undertaken either because of insufficient sequence information, or more typically because there appears to be only one homolog associated with the paralagous gene families described above.
EXOC2. The exocyst complex component 2 protein is a component of the exocyst complex which targets exocytic vesicles to docking sites on the plasma membrane (Entrez gene). A blast search with overlapping LOC642335 protein hits EXOC2 in P. troglodytes.
HUS1B. HUS1 checkpoint homolog b (S. pombe). HUS1B is a paralog of human HUS1 with suggested roles in regulating cell cycle checkpoints and genomic integrity [16].

BCL.
Members of the BCL-2 family act as anti or pro-apoptotic regulators and form hetro or homodimers involved in a wide variety of cellular activities (Entrez gene).
TPX2 is a microtubule-associated homologue with a role in spindle assembly [18].
PSF2. Partner of Sld five 2 is a component of the GINS multiprotein complex involved in the initiation of DNA replication [19][20][21]. A second PSF2-like gene is found on chromosome 20.
FBX031 is F-box protein 31 a ubiqitin ligase specificity factor [22]. PSME is proteasome activator subunit 1 and 2. These subunits are regulators of the immunoproteasome, an altered 26S proteasome that processes class I MHC peptides (Entrez gene).
RNF31 is ring finger protein 31. The gene contains a ring finger motif known to be involved in protein-protein and protein-DNA interactions (Entrez gene).

Pseudogenes
LOC401863 is similar to ribosomal protein L10a, LOC401865 is similar to chloride intracellular channel 1. Pseudogenes are also found in each FOX cluster (detailed above).