A critical assessment of cross-species detection of gene duplicates using comparative genomic hybridization
© Machado and Renn. 2010
Received: 4 February 2010
Accepted: 13 May 2010
Published: 13 May 2010
Skip to main content
© Machado and Renn. 2010
Received: 4 February 2010
Accepted: 13 May 2010
Published: 13 May 2010
Comparison of genomic DNA among closely related strains or species is a powerful approach for identifying variation in evolutionary processes. One potent source of genomic variation is gene duplication, which is prevalent among individuals and species. Array comparative genomic hybridization (aCGH) has been successfully utilized to detect this variation among lineages. Here, beyond the demonstration that gene duplicates among species can be quantified with aCGH, we consider the effect of sequence divergence on the ability to detect gene duplicates.
Using the X chromosome genomic content difference between male D. melanogaster and female D. yakuba and D. simulans, we describe a decrease in the ability to accurately measure genomic content (copy number) for orthologs that are only 90% identical. We demonstrate that genome characteristics (e.g. chromatin environment and non-orthologous sequence similarity) can also affect the ability to accurately measure genomic content. We describe a normalization strategy and statistical criteria to be used for the identification of gene duplicates among any species group for which an array platform is available from a closely related species.
Array CGH can be used to effectively identify gene duplication and genome content; however, certain biases are present due to sequence divergence and other genome characteristics resulting from the divergence between lineages. Highly conserved gene duplicates will be more readily recovered by aCGH. Duplicates that have been retained for a selective advantage due to directional selection acting on many loci in one or both gene copies are likely to be under-represented. The results of this study should inform the interpretation of both previously published and future work that employs this powerful technique.
It is well established that gene duplication and the subsequent evolution of duplicates is an important source of functional novelty [for review see ]. For example, gene duplications are known to be involved in adaptive evolution in response to diet [2–4], chemical challenge [5, 6], and reproductive incompatibility [7, 8]. Such adaptations can allow diversification into new niches, as has been suggested for cold adaptation [plants: , Antarctic ice fish: ] novel metabolic processes [C-4 photosynthesis: ]. Appreciation for the pervasive nature of gene duplication has been reinforced by genomic studies that identify dramatic variation in gene copy number between individuals and between species [, human: , mouse: [14, 15], comparative mammals: , Drosophila : , Arabidopsis : ].
One genomic technique for identifying gene duplications among lineages is array-based comparative genomic hybridization (aCGH). This technique can identify duplicates that may be collapsed during shotgun sequence assembly . Furthermore, unlike next-generation DNA sequencing technologies, this technique does not rely on a full genome assembly as a reference [, e.g. ]. In addition to the assessment of copy number variation within a species [e.g. ] and between closely related lineages [Drosophila : , D. discoideum : , experimental evolution in yeast: ], array CGH has been applied to the identification of chromosomal aberrations underlying cancer [for review see ] and genotyping of individuals within and between populations according to single nucleotide polymorphisms [e.g. Arabidopsis : , stickleback fish: ]. For each of these applications, when gDNA isolated from one sample is competitively hybridized against gDNA isolated from another sample, genomic regions that have been deleted (or are highly diverged) in the genome of the first sample will fail to hybridize to the array features resulting in a log ratio less than zero. Conversely, genomic regions that have been duplicated in the first sample will hybridize at a ratio of 2:1 (or greater), resulting in a log ratio greater than zero. When using aCGH to compare genome content between two different species, only one of these gDNA samples is isolated from the species for which the microarray was constructed, and the other gDNA sample is isolated from a heterologous species of interest. Through repetition with multiple heterologous species, a phylogenetic study can be performed that can address genome content in a broad range of related organisms. The aCGH technique has been used to reveal genomic regions likely involved in an organism's ability to inhabit a specific environment [Chlamydia trachomatis tissue specificity: , Sinorhizobium meliloti root symbiont: , Clostridium difficile host specificity: ], pathogenicity [, Yersinia pesits : , Mycobacterium tuberculosis : , Vibrio cholerae : ], genomic duplications and deletions associated with population divergence and speciation [Anopheles gambiae : [36, 37]], and genomic regions that differentiate humans from other primate species [38, 39].
Array-based genomic comparisons can also identify orthologs with high sequence divergence because an increased number of basepair differences between the platform species and the heterologous species will cause a detectable reduction in hybridization strength for the heterologous species [40, 41]. We have shown a consistent linear relationship between hybridization ratio and sequence divergence. While 40% of the variation in hybridization ratio was accounted for by variation in sequence identity of the heterologous sample to the platform sample, other characteristics of the DNA sequence, such as GC content and alignment characteristics, also contributed to variation in hybridization ratio . The extent to which reduced hybridization due to sequence divergence compromises the ability to accurately identify gene duplications has not been rigorously addressed, nor have the resultant biases in types of gene duplicates been identified. Here we quantify this effect by using genomic content as a model for gene duplication by specifically focusing on X-linked genes, such that these genes are "duplicated" in female individuals (XX) relative to male individuals (XY). Using three Drosophilid species for which complete genome assemblies are available , we survey thousands of orthologs over a range of sequence divergence. We quantify the ability to accurately detect increased genomic content of D. simulans and D. yakuba relative to D. melanogaster males using CGH on a D. melanogaster microarray. We find a decreased ability to identify genomic content with increased sequence divergence, suggesting that array CGH will be biased toward the identification of recent duplicates or otherwise conserved duplicates. We further discuss other potential confounding factors that may affect duplicate detection.
Identification of genomic excess for array features that represent X chromosome genes.
True Positives (rate)
False Positives (rate)
False Negatives (rate)
1000 conserved genes
100 conserved genes
The false positive rate (the percent of array features found to be in genomic excess that map to autosomes) was greater in the heterologous species than in D. melanogaster (D. melanogaster : 16%, D. simulans : 18%, D. yakuba : 45%). If the entire complement of genes on the array had been used for normalization, rather than using a set of 1000 genes with sequence conservation, then there would have been an even greater increase in the false positive rate in D. simulans (39%) and D. yakuba (67%) relative to the platform species, D. melanogaster (12%). We also tested this conserved gene normalization strategy with a set of only 100 conserved genes. Normalizing with this reduced set of genes, there is still significant reduction in false positives, but with reduced true positive rate (D. simulans : 10% false positives, 29% true positive; D. yakuba : 21% false positives, 17% true positive).
Explanatory chromosomal factors for false negatives.
Melhit > hetf
Explanatory chromosomal factors for false positives.
HetHits > meld
Uncovering incidences and patterns of gene duplication can increase our understanding of this important source of functional novelty [e.g. [43, 44]]. It is well documented that aCGH can be used to identify gene dosage, as seen in tumor studies for cancer diagnosis [reviewed by ], and in studies of within-species copy number variation [e.g. ]. There have also been multiple studies that have successfully used aCGH to identify duplications between species [12, 20, 39]. Although both hybridization biases resulting from copy number variation for within-species duplicate detection [45, 46] and hybridization biases resulting from sequence variation for between-species analysis of single genes [29, 40, 47] have been addressed, the complexities of duplicate detection under conditions of sequence divergence have not been well addressed. Among the between-species studies of copy number differences in primates, no technical or computational assessment of the influence of sequence divergence has been made. Instead, the result that more lineage-specific copy number increases were found relative to decreases has been taken to indicate that sequence divergence does not significantly contribute to copy number estimates . While full genome sequence does exist for primate species such that a computational validation of aCGH results could be conducted [e.g. ], we instead chose an empirical test. We used X-linked array features as a model for duplication and studied three Drosophilid species for which full genome sequence was available. The thousands of X-linked orthologs allowed us to address systematic biases of aCGH duplicate detection that could not have been addressed by the lesser number of known duplicates among Drosophilids [17, 48, 49]. These systematic biases are introduced by sequence divergence in heterologous the species and by other confounding genomic characteristics related to species divergence.
Consistent with previous aCGH surveys of gene duplication, the 93% true positive rate for D. melanogaster X-linked genes demonstrates a strong ability of aCGH to detect copy number variation among individuals of a species. The fact that approximately half of the false negatives had BLAST hits to one or more autosomal sequences reflects an ability to quantify relative genomic content other than straightforward duplication. The significant correlation between the number of similar autosomal sequences and the hybridization ratio reflects the ability to estimate relative copy number. Such a quantitative relationship between copy number and hybridization ratio is integral to cancer diagnostics . Such within-species correlations have been validated repeatedly [26, 51]. For example, male and female samples mixed in different amounts were used to assess the ability to identify tumor cells in heterogeneous (tumor and normal) tissues samples . Our within-species results from D. melanogaster add evidence that this quantitative relationship persists even when the additional copies do not share perfect sequence identity. As such, the existence of large gene families can interfere with the ability to detect specific gene duplicates with aCGH.
Because aCGH relies on sequence similarity for DNA hybridization, sequence dissimilarity of sample DNA to a microarray probe is expected to decrease hybridization of that sample to the array when competitively hybridized with DNA of greater sequence similarity. A roughly linear relationship between sequence divergence and hybridization ratio has been demonstrated repeatedly for single copy genes [29, 40, 47]. Variation in sequence divergence explains ~ 45-60% of the variation in hybridization ratio [40, 53], and our results demonstrate that this will affect our ability to detect gene duplicates in a heterologous species. Successful detection of X-linked genes decreased for heterologous species, and sequence similarity to the array feature had a strong impact on this success. At successively greater sequence divergence there was a lower true positive rate for X chromosome orthologs. When translated to non-model studies of gene duplication among evolutionarily interesting lineages, this predicts of a discovery rate biased toward highly similar gene duplicates.
The biased discovery of highly similar gene duplicates means that many of those recovered by aCGH are likely to be the products of evolutionarily recent events having occurred between closely related species. Therefore, the current results indicate that fewer duplicates will be detected in more distantly related taxa in general, a conclusion that should impact experimental design and phylogenetic inference. Older gene duplicates will be recovered only if they are highly conserved. Such conserved duplicates are thought to be retained when there is a selective advantage for greater protein production of a particular gene product [for review see ] as suggested for cold adaptation genes in Antarctic ice fish . Similarly, a selective advantage for spatially or temporally divided expression can produce highly conserved protein coding regions (a type of subfunctionalization) due to mutations in the enhancer regions . Such changes in enhancer regions have been reported to occur in recent duplicates . In some cases, novel function may come about with only a small level of sequence divergence of the protein coding region. Such highly similar duplicates, which can result from a limited number of point mutations, will be retained when the closely related gene products confer a selective advantage, as suggested for the evolution of olfactory receptor family [reviewed by ] and opsin genes [e.g. ]. Highly conserved duplicates may also be the product of gene conversion [yeast: , roundworm: ]. Such highly similar duplicates could be recovered with reasonable success by aCGH.
It is important to note that sequence divergence among duplicates is likely to be a complex process, not completely modeled here with the use of the X chromosome. Here we detect a duplication of 1N to 2N and both copies of the gene in the heterologous species exhibit the same percent sequence identity to the array feature. Because competitive hybridization relies on ratios rather than absolute levels, the technique should work equally well for duplications of 2N to 4N, as would occur on autosomes. However, in a natural, between-species comparisons, the gene duplicates present will include those for which two copies are diverged to varying degrees. From the data presented here, it is unclear what the success rate would be for a gene duplicate pair in which one copy was conserved and the other had diverged. This is exactly the case in the proposed processes described by "neofunctionalization" . The rapid evolution of one or both copies following gene duplication has been suggested to accompany adaptive evolution in several instances [e.g. [61, 62]]. While theoretical hypotheses regarding the adaptive significance of gene duplicate function or the selective forces that have maintained gene duplicates are tempting, it should be noted that aCGH will also recover gene duplicates that have acquired psuedogene status  or that have been fixed in a population due to non-adaptive processes. In all cases, the individual sequence characteristics (GC content, distribution of mismatches, presence of indels, etc.) will influence the hybridization kinetics [40, 47, 64] and therefore the duplicate discovery rate using aCGH.
Genomic factors other than sequence divergence can affect duplicate detection in heterologous species. The seven factors that we took into consideration account for a large portion of the false positives and false negatives of the D. simulans and D. yakuba analyses. If we omit these sets genes from the calculations, our true positive rate for duplicate detection increases to 53% in D. simulans and 32% in D. yakuba, with the false positive rate reduced to 14% in D. simulans and 32% in D. yakuba. The remaining false negatives are due to sequence divergence, microarray technical error, or a variable that we did not quantify. However, for gene duplicate discovery in non-model organisms, such detailed sequence information is unlikely to be available and as such would not factor into the analysis. The remaining false positives detected in this study potentially represent actual duplications that were not identified by the BLAST queries due to improper sequence assembly. Algorithms for genome assembly cluster together similar sequences. This legitimately collapses alleles into a single physical location, but also potentially collapses duplicated loci, thus reducing duplications identified by BLAST . However, by determining depth of coverage from raw sequence reads such errors can be addressed and compared to the current array results [e.g. ].
When detecting duplication levels in heterologous species, it is important to use a normalization method that accounts for hybridization bias [40, 41]. Multiple techniques have been proposed for the normalization of aCGH data in order to account for biases due to dramatic sequence divergence in a heterologous test species  and the large biases due to extreme copy number, or large segmental duplications associated with cancer [e.g. [45, 66]]. In this study, we find support for the use of a set of conserved genes for normalization, such as proposed by van Hijum et al. . In the cross-species experiments, this normalization technique provided a substantial decrease in the false positive rate.
For non-model species lacking substantial genomic DNA sequence data, the set of conserved genes to be used for normalization can be selected according to high sequence conservation across more distantly related and sequenced organisms. Here we use a gene set of 1000 conserved Drosophila features for normalization (4.5% of the array). However, we also test a reduced set of only ~100 conserved genes (0.5% of the array features), which represents a gene set size that would be more easily obtainable for species lacking substantial sequence information. This reduced gene set still provides significant reduction in false positives. Van Hijum et al.  noted "satisfactory" results using 1.2% array features for normalization, or 20 features per block for print-tip normalization.
Array CGH can be used to effectively identify gene duplication and genome content; however, certain biases are present due to sequence divergence and other genome characteristics resulting from the divergence between lineages. Using the X chromosome as a proof-of-concept, we demonstrated high true positive rates for genome content in heterologous species. However, we do find a strong negative effect of gene sequence divergence on the ability to identify X-linked genes. X-linked orthologs with less than 90%-95% identity were much less likely to be detected. The false negative rate for these diverged genes should be taken into consideration when making phylogenetic inferences with aCGH because both false positive and false negative rates increase with phylogenetic distance. Furthermore, a biased set of duplicates will be recovered such that those with high sequence similarity will be over-represented. This means that aCGH will be more likely to recover gene duplicates that have been retained due to a selective advantage that relies on conserved gene function, such as increased gene product. Duplicates that have been retained for a selective advantage due to directional selection acting on many loci in one or both copies will be under-represented. Due to this differential representation of gene duplicate classes, one must be cautious when evaluating the relative contribution of different evolutionary processes to the interspecific diversity under study. The aCGH technique is strongly applicable to the growing number of non-model species groups for which a single microarray platform is available. Sequence information provided by EST analysis can be used to select the appropriate set of conserved genes for array normalization that will substantially reduce the false positive rate. Through sequence analysis and functional testing of aCGH-identified gene duplicates, researchers will be able to further address the role of gene duplication in adaptation, speciation, and population dynamics.
We used a Drosophila melanogaster microarray with ~22,000 features containing PCR products (~500 base pairs long) generated from custom primers designed to predict open reading frames  (GEO platform number GPL6056). The microarray was printed on poly-L-lysine slides (Thermo Scientific) in a 48 pin format using an OmniGrid-100 arrayer (GeneMachines). Following hydration, snap drying and UV cross-linking, the slides were blocked with succinic anhydride and sodium borate in 1-Methyl-2-Pyrrolidinone, rinsed, dried according to standard procedure  and stored dry until used.
Isogenic Drosophila melanogaster, D. simulans and D. yakuba (strain # 14021-0231.36; 14021-0251.195; 14021-0261.01) were obtained from the UCSD Drosophila stock center. Genomic DNA from D. melanogaster males and virgin females of each species was extracted according to a standard ProteinaseK/Phenol:Chloroform protocol and quantified (Nanodrop 1000). The DNA was diluted to 0.3 μg/μl and size reduced using the Hydroshear (Genome Solutions/Digilab) standard orifice to an average size of 1.5 - 2 Kb, verified by gel electrophoresis. Three micrograms of genomic DNA were fluorescently labeled with Alexa-Fluors conjugated dCTP by Klenow fragment polymerization (Invitrogen, Bio-Prime), the efficiency of which was quantified (Nanodrop 1000) such that competitive hybridizations were matched for DNA concentration. Male D. melanogaster samples were used in competitive hybridizations with six female D. melanogaster samples, six female D. simulans samples, and four female D. yakuba samples, in dye swaps to account for dye bias. Hybridizations proceeded for ~16 hours at 65°C in a 3.4× SSC, 0.15% SDS, 1 mM DTT hybridization buffer, or at 48°C in Ambion Hyb Buffer 1, blocked by Cot-1DNA (Invitrogen).
Hybridized arrays were scanned with an Axon 4000B scanner (Axon Instruments) using Genepix 5.0 software (Axon Instruments). Features of poor quality (fluorescence < 2 standard deviations above background) or known technical error (poor PCR, improper primer design, unexpected PCR length etc.) were flagged and excluded from the analysis. Features that survived this quality control on less than two arrays per species were excluded from the analysis. Raw data from Genepix was imported into R. LIMMA [Linear Models for Microarray Data, ] was used to apply a background correction ("minimum") and each of three (see below) within-array intensity normalization ("loess") procedures. Raw and normalized data are submitted to the Gene Expression Omnibus (GEO) repository under the series identifier GSE19584. A linear model was fitted to the data using "lmFit", and "eBayes" provided a correction of variation by borrowing information across genes. Significance was assessed after a FDR multiple testing correction at P < 0.1. In order to not confound statistical power and phylogenetic distance within our study, GEL50 measurements of statistical power  were held equivalent such that the D. yakuba analysis (GEL50 = 0.456) employed 4 hybridizations and the D. simulans analysis (GEL50 = 0.455) employed 6 hybridizations. This difference in required technical replication was likely due to array quality.
Since we expect the normalization of cross-species arrays to be affected by a substantial number of diverged genes in the non-platform species (and in this case, also a substantial number of duplicated loci), we compared our preferred normalization technique, using a set of ~1000 genes that are highly conserved among the three Drosophila species (determined with NCBI megaBLAST 1/2009), to a normalization based on a set of 100 conserved genes and to a traditional normalization using the full complement of array features. In each case, statistical analysis for hybridization bias was performed (see above).
Features representing X chromosome genes were identified by the top BLAST hit of the feature sequence to D. melanogaster genome sequence. Sequence divergence of X chromosome orthologs was assessed as the percent sequence identity (%ID) of the feature sequence to the top BLAST hit in D. simulans and D. yakuba genome sequence (E < 10-14, NCBI megaBLAST 1/2009).
Additional autosomal regions of high sequence similarity in each species were identified (E < 10-14) and a measure of the relative number of occurrences of a sequence in the two genomes of interest was determined as the ratio of the number of BLAST hits to the heterologous species to number of BLAST hits to the reference species (E < 10-10). Movement off of the X chromosome was predicted by the lack of any BLAST hit of the feature sequence to the heterologous X chromosome. Conversely movement onto the X chromosome was predicted by the presence of any BLAST hit of an autosomal feature sequence to the heterologous X chromosome. Furthermore, any hits to heterochromatin or to telomeric regions (25 Kb from the end of a chromosome) were also recorded. These factors represent between-genome chromosomal differences that can confound duplicate detection for cross-species aCGH analyses.
The authors thank Molly Schumer and Julia Carleton for helpful comments on earlier drafts of the manuscript. Microarray data contributing to this manuscript were produced by eleven Reed College students in the 2009 Biology 431 course on Comparative Functional Genomics. This work was supported by grant from the M.J. Murdock Charitable Trust.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.