Uncovering incidences and patterns of gene duplication can increase our understanding of this important source of functional novelty [e.g. [43, 44]]. It is well documented that aCGH can be used to identify gene dosage, as seen in tumor studies for cancer diagnosis [reviewed by ], and in studies of within-species copy number variation [e.g. ]. There have also been multiple studies that have successfully used aCGH to identify duplications between species [12, 20, 39]. Although both hybridization biases resulting from copy number variation for within-species duplicate detection [45, 46] and hybridization biases resulting from sequence variation for between-species analysis of single genes [29, 40, 47] have been addressed, the complexities of duplicate detection under conditions of sequence divergence have not been well addressed. Among the between-species studies of copy number differences in primates, no technical or computational assessment of the influence of sequence divergence has been made. Instead, the result that more lineage-specific copy number increases were found relative to decreases has been taken to indicate that sequence divergence does not significantly contribute to copy number estimates . While full genome sequence does exist for primate species such that a computational validation of aCGH results could be conducted [e.g. ], we instead chose an empirical test. We used X-linked array features as a model for duplication and studied three Drosophilid species for which full genome sequence was available. The thousands of X-linked orthologs allowed us to address systematic biases of aCGH duplicate detection that could not have been addressed by the lesser number of known duplicates among Drosophilids [17, 48, 49]. These systematic biases are introduced by sequence divergence in heterologous the species and by other confounding genomic characteristics related to species divergence.
Within-species duplicate detection
Consistent with previous aCGH surveys of gene duplication, the 93% true positive rate for D. melanogaster X-linked genes demonstrates a strong ability of aCGH to detect copy number variation among individuals of a species. The fact that approximately half of the false negatives had BLAST hits to one or more autosomal sequences reflects an ability to quantify relative genomic content other than straightforward duplication. The significant correlation between the number of similar autosomal sequences and the hybridization ratio reflects the ability to estimate relative copy number. Such a quantitative relationship between copy number and hybridization ratio is integral to cancer diagnostics . Such within-species correlations have been validated repeatedly [26, 51]. For example, male and female samples mixed in different amounts were used to assess the ability to identify tumor cells in heterogeneous (tumor and normal) tissues samples . Our within-species results from D. melanogaster add evidence that this quantitative relationship persists even when the additional copies do not share perfect sequence identity. As such, the existence of large gene families can interfere with the ability to detect specific gene duplicates with aCGH.
Duplicate detection in heterologous species will decrease with sequence divergence
Because aCGH relies on sequence similarity for DNA hybridization, sequence dissimilarity of sample DNA to a microarray probe is expected to decrease hybridization of that sample to the array when competitively hybridized with DNA of greater sequence similarity. A roughly linear relationship between sequence divergence and hybridization ratio has been demonstrated repeatedly for single copy genes [29, 40, 47]. Variation in sequence divergence explains ~ 45-60% of the variation in hybridization ratio [40, 53], and our results demonstrate that this will affect our ability to detect gene duplicates in a heterologous species. Successful detection of X-linked genes decreased for heterologous species, and sequence similarity to the array feature had a strong impact on this success. At successively greater sequence divergence there was a lower true positive rate for X chromosome orthologs. When translated to non-model studies of gene duplication among evolutionarily interesting lineages, this predicts of a discovery rate biased toward highly similar gene duplicates.
The biased discovery of highly similar gene duplicates means that many of those recovered by aCGH are likely to be the products of evolutionarily recent events having occurred between closely related species. Therefore, the current results indicate that fewer duplicates will be detected in more distantly related taxa in general, a conclusion that should impact experimental design and phylogenetic inference. Older gene duplicates will be recovered only if they are highly conserved. Such conserved duplicates are thought to be retained when there is a selective advantage for greater protein production of a particular gene product [for review see ] as suggested for cold adaptation genes in Antarctic ice fish . Similarly, a selective advantage for spatially or temporally divided expression can produce highly conserved protein coding regions (a type of subfunctionalization) due to mutations in the enhancer regions . Such changes in enhancer regions have been reported to occur in recent duplicates . In some cases, novel function may come about with only a small level of sequence divergence of the protein coding region. Such highly similar duplicates, which can result from a limited number of point mutations, will be retained when the closely related gene products confer a selective advantage, as suggested for the evolution of olfactory receptor family [reviewed by ] and opsin genes [e.g. ]. Highly conserved duplicates may also be the product of gene conversion [yeast: , roundworm: ]. Such highly similar duplicates could be recovered with reasonable success by aCGH.
It is important to note that sequence divergence among duplicates is likely to be a complex process, not completely modeled here with the use of the X chromosome. Here we detect a duplication of 1N to 2N and both copies of the gene in the heterologous species exhibit the same percent sequence identity to the array feature. Because competitive hybridization relies on ratios rather than absolute levels, the technique should work equally well for duplications of 2N to 4N, as would occur on autosomes. However, in a natural, between-species comparisons, the gene duplicates present will include those for which two copies are diverged to varying degrees. From the data presented here, it is unclear what the success rate would be for a gene duplicate pair in which one copy was conserved and the other had diverged. This is exactly the case in the proposed processes described by "neofunctionalization" . The rapid evolution of one or both copies following gene duplication has been suggested to accompany adaptive evolution in several instances [e.g. [61, 62]]. While theoretical hypotheses regarding the adaptive significance of gene duplicate function or the selective forces that have maintained gene duplicates are tempting, it should be noted that aCGH will also recover gene duplicates that have acquired psuedogene status  or that have been fixed in a population due to non-adaptive processes. In all cases, the individual sequence characteristics (GC content, distribution of mismatches, presence of indels, etc.) will influence the hybridization kinetics [40, 47, 64] and therefore the duplicate discovery rate using aCGH.
Additional factors affecting duplicate detection
Genomic factors other than sequence divergence can affect duplicate detection in heterologous species. The seven factors that we took into consideration account for a large portion of the false positives and false negatives of the D. simulans and D. yakuba analyses. If we omit these sets genes from the calculations, our true positive rate for duplicate detection increases to 53% in D. simulans and 32% in D. yakuba, with the false positive rate reduced to 14% in D. simulans and 32% in D. yakuba. The remaining false negatives are due to sequence divergence, microarray technical error, or a variable that we did not quantify. However, for gene duplicate discovery in non-model organisms, such detailed sequence information is unlikely to be available and as such would not factor into the analysis. The remaining false positives detected in this study potentially represent actual duplications that were not identified by the BLAST queries due to improper sequence assembly. Algorithms for genome assembly cluster together similar sequences. This legitimately collapses alleles into a single physical location, but also potentially collapses duplicated loci, thus reducing duplications identified by BLAST . However, by determining depth of coverage from raw sequence reads such errors can be addressed and compared to the current array results [e.g. ].
Use of conserved genes for normalization
When detecting duplication levels in heterologous species, it is important to use a normalization method that accounts for hybridization bias [40, 41]. Multiple techniques have been proposed for the normalization of aCGH data in order to account for biases due to dramatic sequence divergence in a heterologous test species  and the large biases due to extreme copy number, or large segmental duplications associated with cancer [e.g. [45, 66]]. In this study, we find support for the use of a set of conserved genes for normalization, such as proposed by van Hijum et al. . In the cross-species experiments, this normalization technique provided a substantial decrease in the false positive rate.
For non-model species lacking substantial genomic DNA sequence data, the set of conserved genes to be used for normalization can be selected according to high sequence conservation across more distantly related and sequenced organisms. Here we use a gene set of 1000 conserved Drosophila features for normalization (4.5% of the array). However, we also test a reduced set of only ~100 conserved genes (0.5% of the array features), which represents a gene set size that would be more easily obtainable for species lacking substantial sequence information. This reduced gene set still provides significant reduction in false positives. Van Hijum et al.  noted "satisfactory" results using 1.2% array features for normalization, or 20 features per block for print-tip normalization.