Gene duplication in an African cichlid adaptive radiation

Background Gene duplication is a source of evolutionary innovation and can contribute to the divergence of lineages; however, the relative importance of this process remains to be determined. The explosive divergence of the African cichlid adaptive radiations provides both a model for studying the general role of gene duplication in the divergence of lineages and also an exciting foray into the identification of genomic features that underlie the dramatic phenotypic and ecological diversification in this particular lineage. We present the first genome-wide study of gene duplication in African cichlid fishes, identifying gene duplicates in three species belonging to the Lake Malawi adaptive radiation (Metriaclima estherae, Protomelas similis, Rhamphochromis “chilingali”) and one closely related species from a non-radiated riverine lineage (Astatotilapia tweddlei). Results Using Astatotilapia burtoni as reference, microarray comparative genomic hybridization analysis of 5689 genes reveals 134 duplicated genes among the four cichlid species tested. Between 51 and 55 genes were identified as duplicated in each of the three species from the Lake Malawi radiation, representing a 38%–49% increase in number of duplicated genes relative to the non-radiated lineage (37 genes). Duplicated genes include several that are involved in immune response, ATP metabolism and detoxification. Conclusions These results contribute to our understanding of the abundance and type of gene duplicates present in cichlid fish lineages. The duplicated genes identified in this study provide candidates for the analysis of functional relevance with regard to phenotype and divergence. Comparative sequence analysis of gene duplicates can address the role of positive selection and adaptive evolution by gene duplication, while further study across the phylogenetic range of cichlid radiations (and more generally in other adaptive radiations) will determine whether the patterns of gene duplication seen in this study consistently accompany rapid radiation.


Background
Adaptive radiation, the evolution of genetic and ecological diversity leading to species proliferation in a lineage, is thought to be the result of divergent selection for resource specialization [1][2][3]. Differential selection in heterogeneous environments can result in adaptive radiation when there is a genetic basis for variability in organisms' success in exploiting alternative resources [1][2][3][4][5]. Examples of such radiations include the Cambrian explosion of metazoans [6], the diversification of Darwin's finches in the Galapagos [7], variations in amphipods and cottoid fishes in Lake Baikal [8], the Caribbean anoles [9], the Hawaiian Silverswords [10] and the explosive speciation of the cichlid fishes in the African Great Lakes [11].
The cichlid fishes are the product of an incredible series of adaptive radiations in response to the local physical, biological and social environment. While cichlids can be found on several continents [12], the most dramatic radiations are those of the haplochromine cichlids in the great lakes of East Africa. This speciose clade exhibits unprecedented diversity in morphological and behavioral characteristics [13] and accounts for~10% of the world's teleost fish. Interestingly, this clade also includes lineages that have remained in a riverine environment and have not radiated [14].
Classic work by Ohno [15] proposed a prominent role for gene duplication events in evolutionary expansion, despite their frequent loss due to drift [16]. Duplication makes extra gene copies available for dosage effects, subfunctionalization, or neofunctionaliztion [17], with the resultant phenotype potentially contributing to an organism's fitness (for review see [18]). Current genomic research (e.g. primates: [19,20]) supports this, but the ability to compare closely related cichlid lineages that have and have not undergone an evolutionary radiation provides a critical tool for testing the association of gene duplication with adaptive radiation.
We used array-based comparative genomic hybridization (aCGH) to identify gene duplications among 5689 genes for three Lake Malawi radiation species, which began accumulating molecular diversity approximately 5 million years ago [21] (Metriaclima estherae, Protomelas similis, Rhamphochromis "chilingali") and one closely related riverine species from a non-radiated lineage (Astatotilapia tweddlei). While previous mitochondrial data suggested a bifurcation that separated the Lake Malawi radiation from the riverine species (Figure 1), more recent data based on ALFP data and single nucleotide polymorphisms derived from low coverage whole genome sequence [22][23][24] suggest that the Malawi flock is not monophyletic and that some of the riverine lineages may have contributed to Malawi genomes. These insights further support the use of A. burtoni as a reference to the three approximately equidistant test species. This is the first genome-wide study of gene duplication among haplochromine cichlids.

aCGH identification of duplicated genes
Microarray features, representing a total of 5689 genes, passed quality control measures in all four test species. Among these, 145 array features (representing 134 genes) were determined to have an increased genomic content (i.e. copy number) for one or more heterologous species relative to A. burtoni (P < 0.1 FDR corrected) ( Tables 1, 2). This included duplications of 54 genes in M. estherae, 51 in P. similis, and 55 in R. "chilingali", compared to only 37 in A. tweddlei, the species from the non-radiated lineage ( Figure 2). The number of duplicated genes identified for the species from the radiated lineage represents a 38%-49% increase relative to the number of duplicated genes identified in A. tweddlei. Consistent with their shared evolutionary history, shared duplications were prevalent among the three Lake Malawi species, with 11 duplications shared among all three and 16 duplications shared between two of the three species ( Figure 2). Five Figure 1 Maximum likelihood phylogeny illustrating the positions of experimental (stars) and reference (circle) taxa. The maximum likelihood tree is based on 1785 bp mitochondrial ND2. Nodes not supported by 50% maximum likelihood SH values collapsed and Lake Victoria, Lake Makgadikgadi, and Lake Tanganyika radiations are represented by triangles. The tree is rooted with Oreochromis and the scale bar indicates the mean number of nucleotide substitutions per site (DRYAD doi:10.5061/dryad.7vs2c). genes had greater gene copy number in all four species relative to A. burtoni. Genes found duplicated in only one of the four species were also identified. This included 27 genes in M. estherae, 20 in P. similis, 24 in R. "chilingali" and 27 in A. tweddlei. In twenty cases, the gene identified as duplicated was represented on the array by multiple features. Five of these instances showed complete concordance among the two or three array features representing that gene such that all showed the same significant pattern across species. However, for some genes found to be duplicated, only one of the two (n = 8), one of the three (n = 4), two of three (n = 2) or in one case four of the eight array features representing that gene reached statistical significance. In most cases, those features that did not reach statistical significance followed a similar pattern (Additional file 1 Figure S1). However this was not always the case which may be due to different parts of the gene sequence being represented by the different features, high variance or poor quality for one of the features, miss-annotation of the array, or other technical reasons.
BLAST comparison of array feature sequence similarity to the nucleotide database allows annotation and predicted function for discussion of possible adaptive processes. Based on these annotations, several candidate genes were identified as duplicated in and among lineages. Repeated similarities of functional annotations were noticed, particularly for genes involved in immune response, ATP metabolism and detoxification.

Quantitative PCR verification
Four loci found to be duplicated in one or more test species according to aCGH were chosen for quantitative PCR (qPCR) validation for their observed duplication patterns-one duplicated in all species relative to A. burtoni, two duplicated in all three Lake Malawi radiation species and one species-specific duplication ( Table 2). Primer pairs that were designed to A. burtoni sequence successfully amplified product with a similar or slightly reduced efficiency in each heterologous species tested ( Table 2). We estimated the copy number relative to A. burtoni for these loci based on the array hybridization ratio, and compared that to the copy number estimated from the qPCR results. Each duplication of a given locus as identified by the microarray analysis also showed significantly increased copy number of that locus according to the qPCR analysis ( Figure 3). Furthermore, the pattern of relative copy number among test species observed in the qPCR analysis, reflected, with few exceptions, the pattern of relative copy number observed in the microarray analysis. The only notable discrepancy was an increased genomic content for gene DY631898 detected for M. estherae that was not found by microarray analysis.

Discussion
Gene duplication is an important source of functional novelty and has a demonstrated role in adaptive evolution [18]. Such adaptations can allow for niche diversification, as has been suggested for thermal adaptation (plants: [25], Antarctic ice fish: [26]) and for metabolic   novelty (C-4 photosynthesis: [27]). The adaptive radiations of the African cichlid fishes exhibit remarkable niche exploitation in the presence of low levels of sequence divergence (reviewed by [13,21]). However, little is known regarding the relative number of duplicated genes, nor the identity of duplicated genes, within this group. If there is an increased rate of gene duplication or gene duplicate retention in radiated lineages, or if  particular duplications are associated with these lineages, then their pattern and identity could provide insight into the processes facilitating the rapid expansion of the African cichlids. The patterns reported and validated here indicate shared and increased gene duplication within the Lake Malawi radiation compared to a close non-radiating lineage. While three of the identified gene duplicates were annotated as mobile elements (retrotransposons or SINE element), the majority of the genes could be assigned functional annotation based on a manually curated homology search to UniProtKB/Swiss-Prot for those genes found to be duplicated. Based on individual gene names and functional annotations, several candidate genes, including those that are involved in immune response, ATP metabolism and detoxification, are identified as duplicated in and among lineages ( Table 1). Some of these gene duplicates may underlie adaptive phenotypic change.

Immune response
The evolution of immune response is a potent factor contributing to the divergence of lineages, resulting from strong selection on certain loci [28][29][30]. A greater number of genes associated with immune response (4-9) are found to be duplicated in the Lake Malawi lineage as compared to the riverine species (2). This list includes two finTRIM genes (one duplicated in P. similis and the other in both P. similis and R. "chilingali"), a gene family that is known to play a role in immunity against viral infection, and several finTRIM paralogs have been found  in teleost fishes, resulting from duplication and positive selection (70 in trout, 84 in zebrafish) [31]. There are also five major histocompatibility complex (MHC) genes-two MHC class I, two MHC class II, and kinesinlike protein 2-found duplicated in one or more of the species from the radiated Lake Malawi lineage. The MHC gene family, in addition to being involved in immunity (salmon: [32]), has a history of expansion and contraction through duplication and deletion [33]. MHC gene families vary in size among teleosts, with particularly large families in cichlids [34][35][36][37][38]. Additional immune related genes duplicated in the Lake Malawi radiation include an immunoglobulin light chain, small inducible cytokine (associated with the MHC region in stickleback: [39]), and sestrin 3. In A. tweddlei, the test species from the non-radiated lineage, two immune genes, kallikrein-8 and natural killer cell lecin-type receptor, are also found to be duplicated. The identification of several duplicated immune function genes is consistent with previous work documenting size variability and rapid expansion of immune function gene families (Drosophila: [28], silkworm: [40]) that may allow species to invade new niches or better adapt to existing ones.

ATP metabolism
ATP metabolism and function is critical to many physiological processes. Two ATP synthases and one ATP transporter are found duplicated among the four species. Subunits G and E of vacuolar ATPases, which couple the energy of ATP hydrolysis to proton transport across intracellular and plasma membranes, are duplicated in A. tweddlei and M. estherae, respectively. In R. "chilingali", the adenine nucleotide translocator (ANT) s598 is found duplicated. This mitochondrial transmembrane protein is the most abundant mitochondrial protein and is integral in the exchange of ADP and ATP between the mitochondria and the cytoplasm. Increased expression of mitochondrial ATP synthase has been found in cold acclimated carp [41] and ANT genes are being studied for their potential adaptive role in thermal acclimation (fugu: [42]). Given that these ATP synthase and transport genes are found duplicated in all 4 species of this study rather than showing enrichment only within the Lake species, they may represent an ancestral duplication, or deletion in A. burtoni, nonetheless, their retention may be associated with adaptation to ecological conditions.

Detoxification
Selection on duplicated detoxification genes (those involved in the breakdown of toxic compounds) can determine survival in particular environments or can contribute to expansion into new niches. One example is seen in plant-herbivore interactions, where gene duplication has been implicated in the ability of herbivores to detoxify plant defense compounds and prevent exclusion of the herbivore from that food source [43,44]. We detect duplication of detoxification genes in all three species from the radiated lineage. In P. similis and R. "chilingali", the sulfotransferase (SULT) gene cytosolic sulfotransferase 3 is found duplicated. SULT genes are detoxifying enzymes that catalyze the transfer sulfonate groups to endogenous compounds and xenobiotics. Once sulfated, compounds may become more easily excreted from the body. In zebrafish, ten SULT proteins have been cloned, two of which show strong activity towards environmental estrogens [45]. Zebrafish SULTs have also been found to act on other xenobiotics [46]. In Atlantic cod, a SULT gene was found to be upregulated in response to polluted water [47]. In R. "chilingali", two other genes involved in detoxification, arsenic methyltransferase and ferritin (heavy subunit), are found duplicated. Arsenic methyltransferase converts inorganic arsenic into less harmful methylated species, and ferritin is an iron storage protein that is essential for iron homeostasis, keeping iron concentrations at non-toxic levels. Another iron-related protein, the iron-sulfur cluster assembly enzyme, was also duplicated in R. "chilingali". It is possible that some of these gene duplicates have been retained due to a selective advantage for metabolic breakdown of environmental compounds and toxins. Such duplicates may allow novel physiological interactions with the chemical, physical and pathogenic environment that may play a role in adaptive divergence as a lineage radiates to inhabit new niches such as those associated with the African Great Lakes.

Gene family membership
Gene families by their very nature reveal a propensity for duplication and duplicate retention of certain genes. One study estimated that 38% of known human genes can be assigned to gene families, based on amino acid sequence similarity [48]. These gene families typically consist of two genes, but the largest gene families can have more than 100 members. In the present study, several of the genes found to be duplicated were members of large gene families, comprised of multiple known genes. These include 40 S and 60 S ribosomal proteins (duplicated in R. "chilingali" and M. estherae), claudin 29a (M. estherae), GTPase IMAP family member 7 (P. similis), C-type lectin domain family 4 (M. estherae), high-mobility group 20B (HMG20B) from HMG-box superfamily (A. tweddlei), and hox gene cluster genes (all species). Hox genes are important in the regulation of development, and have been found to be associated with differential jaw development in cichlid fishes [49,50]. An immunoglobulin light chain gene belonging to the largest gene family represented in this study was found duplicated in P. similis. Since large gene families are comprised of multiple paralogs and may possess a greater tendency for expansion, it is not surprising that large gene families are well represented in our list of duplicated regions.

qPCR verification
The robust validation of aCGH results using quantitative PCR not only verifies the increased genomic content for all four loci analyzed in test species relative to A. burtoni, it also provides a complementary approach that may prove to be a more efficient means to survey candidate loci in future population level analyses. For each locus except DY631898, the pattern of copy number among the four test species relative to A. burtoni is similar to that found by aCGH. However, the copy number estimated by qPCR differs from that estimated with array results. This is particularly true of the DY626766 and DY632057 loci, which showed greater qPCR copy number than predicted, despite the underestimation bias possible for those loci. Similarly, in M. estherae, the DY631898 locus appeared to be substantially higher in copy number than predicted by the array results. This discrepancy could result from three factors. First, it may be due to the fact that aCGH will produce an underestimate of true copy number when there is sequence divergence of the heterologous species relative to the platform provided the primers are in a conserved sequence region. Second, while qPCR and microarray analyses both provide relative rather than absolute measures, the scale of the relationship measurements may differ due to the difference in normalization techniques applied to the raw data. Finally, particularly for the case of the DY631898 locus in M. estherae, the micorray analysis includes only two replicates for each species and is thus sensitive to technical error where technical failure of qPCR is more easily replicated. Nonetheless, even for the two instances in which reduced primer efficiency in the tested heterologous species would have been expected to result in an underestimate rather than an overestimate of copy number, the pattern identified by aCGH was upheld. Regardless of discrepancies in magnitude, our quantitative PCR results demonstrate, with the exception of one data point, both qPCR,and aCGH are valid techniques for estimation of relative copy number in heterologous species. While aCGH allows one to survey a greater number of genes, the qPCR technique may provide an efficient means to assess copy number variation (CNV) of candidate loci within a larger population in order to illuminate the role of gene duplication on a microevolutionary scale.

Technical considerations
The use of aCGH was initially developed for cancer studies and has been applied to several within species studies, but has less frequently been used to assess between species patterns of gene duplication. Careful consideration of the technical biases and conservative interpretation of the results are warranted [51,52]. The array features analyzed represent only 5689 genes, a fraction (25-30% of a standard vertebrate genome) of predicted total gene content for these species. Furthermore, because genomic content for each gene has been assessed relative to the array platform species A. burtoni, any gene that has equivalent copy number (even if greater than 1) in both the platform and the heterologous species will go undetected. Similarly, those genes that appear to be duplicated in all heterologous species may actually represent a reduction in genomic content in A. burtoni due to gene deletion events. Furthermore, aCGH with spotted cDNA arrays does not allow quantification among different genes and it is therefore impossible to provide absolute copy numbers. We identify five such genes, two annotated as Hox gene cluster genes, one as a Ras-related C3 botulinum toxin substrate gene and two that lack annotation, that appear to be duplicated in all four test species, but which may in fact be deleted in A. burtoni. In our study we do not attempt to distinguish between these two scenarios. The hybridization bias due to sequence divergence of the heterologous species from the platform species is another an important consideration for the interpretation of aCGH results. Diverged sequences will hybridize less well to the array feature than A. burtoni DNA. Therefore, it follows that duplicated genes for which the paralog is highly diverged will be less likely to be detected as duplicated than duplicated genes with paralogs that are less diverged from the platform species, as found by Machado and Renn [52]. Therefore, older gene duplication events, those with very little purifying selection pressure, and those with strong positive selection in the gene region represented on the array are less likely to be identified, while recent duplication events or highly conserved duplicates are more likely to be identified. Therefore, the results presented here represent a subset of the total gene duplicates that may differ from the subset of gene duplicates identified by other techniques such as sequence assembly or depth of coverage. Gene number and gene copy number identified by short read sequencing technology is prone to overestimation of copy number variation [53]. Nonetheless, the numbers reported here are clearly an under-representation of the total and may present a different phylogenetic pattern of retention than other subsets of gene duplicates.
In this study, we use a recent adaptive radiation so that, whilst strong positive selection on duplicates might be overlooked by the aCGH technique, the majority of very recent duplications are likely to be identified. We find a pattern of increased gene duplication in these Lake Malawi haplochromines, with 38-49% more genes duplicated than in the non-radiated lineage. Care must be taken in interpreting this increase in the context of adaptive radiation, with four primary considerations. First, only a subset of genes (i.e. those present on the array with available sequence) was tested. Second, gene duplicates may have become fixed in ancestral populations due to neutral processes such as founder events, genetic bottlenecks or drift during the relatively recent evolutionary past. Sequence data from multiple species will be necessary to distinguish neutral vs. adaptive evolutionary processes. Third, due to the shared evolutionary history of the three Lake Malawi species, they cannot be considered independent. Fourth, the ecology of the species, lake versus riverine, is confounded with the tendency to radiate. Therefore, as tantalizing as these results are, our single comparison of radiated versus nonradiated lineages requires further support before general patterns associated with adaptive radiation can be rigorously discussed. Fortunately, the African cichlids provide such a system with which to undertake this [14].

Conclusions
Only recently have studies begun to examine the patterns of gene duplication and copy number polymorphism across species in natural systems, beyond primates (e.g. [26,[54][55][56]). While other studies have examined specific genes (e.g. [57][58][59]), we present the largest analysis thus far of genome wide patterns of gene duplication across lineages of the African cichlid radiations. We identify several candidate gene duplicates in four cichlid species and find a pattern of increased gene duplication within the Lake Malawi radiation. While our inference regarding the adaptive value of candidate gene duplicates must be tempered, the results of this study support the hypothesis that gene duplication, particularly of genes related to immune response, ATP metabolism and detoxification, is a characteristic of the Lake Malawi adaptive radiation. Assessment across a greater phylogenetic range of cichlid radiations will identify consistent patterns of gene duplication correlated with radiated and non-radiated lineages, and comparative sequence analysis will reveal the potential contribution of natural selection to gene duplicate evolution.
Microarray data (GEO series GSE19368) were filtered by omitting features with a lack of sequence information, known ribosomal content, or that had faint array signal (<2 SD above background). Only features that survived this quality control for all eight microarrays were analyzed. Data were corrected for background intensity ("minimum") and were loess normalized within array using 250 conserved features [60]. This corrects for bias introduced by sequence divergence under standard normalization [61]. Duplicated genes were identified as those with increased fluorescence according to the "lmFit" statistical model with "eBayes" correction and FDR adjustment for P < 0.1 significance level [62]. The reported results are underestimates of duplication levels, due to the fact that diverged duplicates are less likely to be detected [52]. GEL 50 measurements [63] indicated that experiments were of similar statistical power (M. estherae: 1.80, P. similis: 1.95, R. "chilingali": 1.61, A. tweddlei: 1.89). The automated annotations available from DFCI were not used in this study because many proved to be uninformative. Instead, functional annotations for genes were gathered only for identified duplicates using BLASTn to compare EST sequences to the UniProtKB/Swiss-Prot database. The top 100 hits were returned in order to identify informative annotations and infer function based on homology. Bit scores are reported for these annotations. No filtering or masking was applied during the BLASTn thus annotations for repetitive sequences and transponsons are included.

Quantitative PCR
Genomic content was validated for four genes using qPCR (  Table S2).