Novel approaches to the estimation of homologous genes in non-sequenced species
Our in vitro homology search array was able to estimate more than five thousand homologous genes in squid and human, which represents 68% of the human genes tested in this study. This indicates that a large-scale genomic comparative homology search is suitable for non-sequenced species. We were able to obtain this result from a single experiment, and by doing further experiments, we expect the number homologous genes identified to increase. As only 85 nucleotides and 9,079 ESTs for pygmy squid have been submitted to GenBank as of May 2010, our approach is a very efficient technique by which to identify homologous genes. Our method will be useful particularly for sibling genome-sequenced species.
Several cross-species microarrays have been designed for various purposes, and a number of papers have provided thoughtful insights into the design and analysis of microarray experiments [11–13]. Other approaches have focused on the genomic and transcriptomic diversity within mammals or primates [14, 15]. Specific targets, such as nitrogen availability and toxicogenomics, could be pursued through the use of cross-species arrays [16, 17]. Heterologous or duplicated regions are hard target to detect with microarrays because of cross hybridization; however, careful assessment and analysis of those data have been reported recently [18, 19]. Unlike these previous studies, our approach has a specialized target, the identification of homologous genes between distant species.
There is a vogue strategy for the large-scale genomics based on next-generation sequencers[20], but the advantage of our method lies in the low experimental costs. Microarray-based experiments are 10 to 20 times cheaper than current next-generation sequencing methods, and this reason alone is sufficient to encourage the application of this method to millions of species to promote the the study of biodiversity. In silico homology search methods, including normal homology searches such as blast, are, on the contrary, powerful tools with which to detect both close and distant homologies, but they require query sequences that are usually only partially available in non-sequenced species.
Limitations and solutions
There are some difficulties in detecting homologous sequences by experimental methods. First, in comparing amino acid sequences, DNA sequences are easily mutated over long periods. We often use homology searches (nucleotide-nucleotide searches) for closely related species, and translated searches of amino acid databases for distantly related species. As it is not possible to perform translated searches in experiments, we can only detect homologous regions if their sequences are well conserved and protected from mutations by slow molecular clocks. We are now assessing the efficiency of hybridization by conducting artificial cross-hybridization microarray experiments (in preparation).
Second, we need to focus on gene coding regions for homology detection because it is likely that intergenic regions are too highly mutated for experimental detection. We, therefore, used mRNAs to search for homologous genes between two different species. In the case of mRNAs, problems still arise in the distribution of gene expression intensities.
Third, probe design in the microarray is also problematic. If probes with similar sequences are used in the same microarray, it is possible to detect expressed genes with two different probes as cross hybridization. To avoid this problem, we have calculated probe sequences in which edit distances are at least 70% different from the most similar probes. We are currently investigating the effect of cross hybridization by designing two different microarrays with a few artificial mutation sites (in preparation). The potential for cross hybridization should be considered carefully when designing microarrays.
Fourth, there is a problem with duplicated genes and gene families. These genes have hundreds of homologous siblings that prevent experimental homology searches due to cross hybridization. To prevent estimating false positive homologs, we have removed duplicated genes from the array. Our method was aimed at detecting homologous genes. The search for homology between species proceeds more efficiently if we focus on single copy genes. Detection of homologs of multi-copy genes would require a different probe design strategy.
Possible applications to other species such as more closely related species
We have tested this methods on humans and squid, which are distant from each other in terms of nucleotide sequence conservations. We could still identified 95 homologous genes, and estimated more than 5,400 candidate genes in squid that could have homologs in humans. This indicates that if we apply this method with more probes from humans, we may identify an increased number of homologous genes in squid. This method may also be applied to other species groups such Primates, Rodents, or Diptera (flies), thus allowing larger scale and cheaper genomic comparisons.