Transcriptome screen for fast evolving genes by Inter-Specific Selective Hybridization (ISSH)
© Montoya-Burgos et al. 2010
Received: 24 September 2009
Accepted: 22 February 2010
Published: 22 February 2010
Skip to main content
© Montoya-Burgos et al. 2010
Received: 24 September 2009
Accepted: 22 February 2010
Published: 22 February 2010
Fast evolving genes are targets of an increasing panel of biological studies, from cancer research to population genetics and species specific adaptations. Yet, their identification and isolation are still laborious, particularly for non-model organisms. We developed a method, named the Inter-Specific Selective Hybridization (ISSH) method, for generating cDNA libraries enriched in fast evolving genes. It utilizes transcripts of homologous tissues of distinct yet related species. Experimental hybridization conditions are monitored in order to discard transcripts that do not find their homologous counterparts in the two species sets as well as transcripts that display a strong complementarity between the two species. Only heteroduplexes that disanneal at low stringency are used for constructing the resulting cDNA library.
We demonstrate the efficiency of the ISSH method by generating a brain cDNA library enriched in fast evolving transcripts of a non-model catfish species as well as a control, non-enriched library. Our results indicate that the enriched library contains effectively more fast evolving sequences than the control library. Gene annotation analyses also indicate enrichment in genes with low expression levels and non-ubiquitously expressed genes, both categories encompassing the majority of fast evolving genes. Furthermore, most of the identified transcripts show higher sequence divergence between two closely related catfish species as compared to recognized fast evolving DNA markers.
The ISSH method offers a simple, inexpensive and efficient way to screen the transcriptome for isolating fast evolving genes. This method opens new opportunities in the investigation of biological mechanisms that include fast evolving genes, such as the evolution of lineage specific processes and traits responsible for species adaptation to their environment.
Fast evolving DNA sequences are used for answering a broad range of biological questions relative to population processes and phylogeography [e.g. ], species diversification [e.g. [2, 3]], conservation biology  and also genome or phenotype mapping [e.g. ]. However, due to the very same intrinsic quality for which they are looked for, i.e. their high evolutionary rate, fast evolving DNA sequences display "lineage specific" changes and therefore require de novo development each time a new group of non-model organisms is being investigated. Despite various methodologies targeted toward the isolation of unspecific polymorphic DNA fragments [e.g. [6–8]] the identification and the isolation of fast evolving DNA sequences in non-model organisms is still laborious and expensive, making it a major impediment to the routine analysis of multiple loci on many taxa.
The isolation of fast evolving genes has gained new motivation and attention as genes involved in several actively investigated processes display high substitution rates: the evolution of species specific traits such as the human brain [e.g. [9, 10]], speciation genes [e.g. [11, 12]], reproduction genes [e.g. [13, 14]] or genes governing the evolution of adaptive traits [e.g. ]. Theoretical approaches suggest that adaptation genes should be fast evolving so that selection could have a substrate on which to act . Furthermore, speciation genes, those that are directly or indirectly involved in the establishment of the genetic barrier between closely related species, consistently displayed high divergence rates . At present, fast evolving genes which often evolve under positive selection can be identified either through large genomic comparisons which are feasible only for model organisms like Drosophila species [e.g. [17, 18]] or human-chimpanzee comparisons  or via long term experimental approaches such as in the discovery of the hybrid inviability gene Hmr in Drosophila . The increasing interest in biological mechanisms driven by fast evolving genes appeals to the development of a more efficient and cost effective method for the isolation of such genes across closely related species and which would not imply the prior knowledge of genetic or genomic information.
The ISSH method (Figure 1) confronts in solution complementary transcriptomes of two closely related species with the aim of rescuing transcripts of fast evolving genes. The property of evolving fast implies that such transcripts will disanneal at low stringencies from the heteroduplexes formed by homologous complementary sequences of the two species. Our method was applied to build a cDNA library enriched in fast evolving transcript fragments of brain tissue of the catfish Ancistrus temminckii. We used as the selector species its close relative Ancistrus dolichopterus. To assess the efficiency of the ISSH method we prepared a non-enriched control cDNA library of brain tissue of A. temminckii using standard protocols. The two libraries were sequenced with the FLX Genome Sequencer technology (Roche). We then "blasted" the enriched and control libraries against the complete genome of the zebrafish and analyzed the differences. We also annotated the transcripts producing significant matches and examined their characteristics to highlight the effectiveness of our method. As the zebrafish is not a close relative to our catfish and because the sequences of interest display high sequence divergence, a substantial proportion of the enriched library yielded no significant Blast matches. Therefore, we prepared an EST library of a close catfish relative, Hypostomus gr. plecostomus, belonging to the same subfamily (Loricariidae: Hypostominae), for refining the analyses.
Analysis of sequence divergence for the enriched and the control libraries.
Blast against zebrafish
Blast against Hypostomus catfish
When using the zebrafish genome as reference, the fastest evolving sequences may not find their homologous counterparts due to the distant evolutionary relationship between the zebrafish and our non-model catfish. Thus, performing the same analysis yet using an evolutionary closer reference - our EST database of the catfish Hypostomus gr. plecostomus - may allow a better understanding of the efficiency of ISSH method. The sequence divergence comparisons (Table 1) show again a systematic and significant enrichment in fast evolving sequences in the enriched library as compared to the control library. The difference between the two libraries is generally higher than when using the zebrafish as reference. This is likely explained by the inclusion of a set of faster evolving genes which can now find their homologues in the evolutionary closer Hypostomus reference.
Tentatively annotated fast evolving transcript fragments and their sequence divergence as compared to the closest ortholog in the teleost mRNA refseq database and in the Hypostomus gr. plecostomus EST dataset.
mRNA refseq annotation according to closest teleost ortholog
A.temminckii vs closest teleost ortholog in mRNA refseq
A. temminckii vs H. gr. plecostomus
interferon regulatory factor 6 (irf6)
NADH dehydrogenase (ubiquinone) 1 alpha subcomplex, assembly factor 2 (ndufaf2), nuclear gene encoding mitochondrial protein
similar to porcupine homolog (LOC100148644)
single-minded homolog 2 (sim2)
similar to pol polyprotein (LOC796496)
similar to ORF1-encoded protein (LOC100004717)
RMD5 homolog B (rmd5b)
similar to ORF1-encoded protein (LOC100004764)
ras-related C3 botulinum toxin substrate 1 (rho family, small GTP binding protein Rac1) like (rac1l)
similar to NLR family, pyrin domain containing 3 (LOC100002061)
monoacylglycerol O-acyltransferase 2
similar to G protein-coupled receptor 128 (LOC100148710)
similar to Uromodulin precursor (Tamm-Horsfall urinary glycoprotein) (THP) (LOC100007639)
hypothetical protein LOC100150258
similar to CG6639 CG6639-PA (LOC100000002)
hypothetical protein LOC100149782
similar to zymogen granule membrane glycoprotein 2 (LOC100005977)
Reference fast evolving sequences
A. brevipinnis vs Danio rerio
A. brevipinnis vs H. boulengeri
cytochrome oxidase subunit I (COI)
A. cirrhosus vs Danio rerio
A. cirrhosus vs H. boulengeri
reticulon 4 (RTN4) introns 1 & 2
We emphasize that the sequences of the transcripts annotated using the mRNA refseq database likely represent the most conserved regions of the isolated transcripts dataset, as faster evolving regions will not find their sequence counterparts in the refseq database, which comprises no closely related catfish sequences.
The isolation of fast evolving genes can be easily accomplished on model organisms for which abundant genomic and transcriptomic knowledge exist. Bioinformatic routines and experimental procedures (micro-array technology) are available for this purpose. At present, however, there is no efficient method for doing so in non-model organisms. The ISSH method presented here is a fast and cost-effective procedure for enriching a cDNA library in fast evolving genes. The various tests we have performed resulted in a convincing demonstration of the efficiency of our method. We have shown that the overall sequence divergence was significantly increased in the enriched library as compared to the control when blasting these libraries against the zebrafish genome or against our Hypostomus catfish EST library. Moreover, the results of the ISSH method fulfilled the three predictions made upon the knowledge of general properties of fast versus slowly evolving genes. Briefly, the enriched library displayed (1) a higher proportion of fast evolving mitochondrial genes, (2) a higher fraction of genes with low expression level, and (3) proportionally more non-ubiquitously expressed genes. Furthermore, the fast evolving transcripts with orthologous sequences in the two catfish species and in the mRNA refseq fish database displayed generally higher sequence divergence than recognized fast evolving DNA markers.
The proportion of annotated contigs via Uniprot/Swissprot comparisons is rather small, particularly in the enriched library. This can be explained firstly by the relatively poor representation of fish genes in the Uniprot/Swissprot database combined with the likely high sequence divergence between the genes of the non-fish organism in the database and our catfish. Secondly, not all contigs may contain coding sequence; they may be composed mainly of UTR sequence. However, the enriched library shows no marked bias toward UTR sequences, which evolved generally faster than their contiguous coding sequences. Indeed, about 68% of contigs longer that 240 bp display putative open reading frames (ORF) longer than 80 aa (criterion of the H-invitational annotation project), and 51% of contigs longer that 300 bp display putative ORFs longer than 100 aa (criterion of the Functional Annotation of Mouse (FANTOM) project), which corresponds to four and five times the calculated sequence length without stop codons in non-coding sequence using the same base frequencies, respectively. Similar proportions are observed in the control library (70% and 54%, respectively) indicating no strong enrichment in UTR sequences. Furthermore, a significant part of the isolated transcripts may be non-coding RNAs. It has been shown, for instance, that non-coding RNAs constitute more than half of the mammalian transcriptome . As the annotation of the isolated fast evolving transcripts is difficult due to the lack of sequence similarity with distant reference species, we are unable for the moment to assess the proportion of fast evolving non-coding RNAs in our dataset.
Our method has the advantage of being theoretically very versatile in terms of evolutionary divergence relating the species of interest and its selector species. The faster evolving genes will already show detectable sequence divergence between closely related species while using a more distant selector species will allow the isolation of a wider set of fast evolving genes. Likewise, one can modify the hybridization temperature to fine-tune the degree of sequence divergence one is looking for between the species of interest and the selector species. Our method may also be applicable for screening intra-species gene-associated polymorphism. Only in that objective the ISSH method may be compared with the In-Gel Competitive Reassociation and EST Array Hybridization method , which exploits the property that the vast majority of RFLP fragments between two strains or populations share the same electrophoretic size. Deviation from this property generates false positives and, therefore, the method of Gotoh and Oishi (2003) looses its interest if more distantly related groups are used.
Interestingly, the ISSH method can also be used for isolating the fraction of highly conserved genes between species. This is achieved by rescuing the fraction of ss cDNA that disanneal only at very stringent conditions, which guarantees an almost perfect complementation between the probed and selector pool of transcripts. Moreover, the species from which the selector pool of mRNA is extracted may be selected in order to increase the level of conservation of the enriched cDNA library: the more evolutionary distant the selector species will be, the more conserved the isolated transcripts will be.
The ISSH method is not linked to a specific sequencing technology. In this study we used the long-read 454 FLX technology (Roche) to ensure a minimum sequence length for downstream sequence analyses. However, this argument is currently less valid as the Illumina short-read sequencing technology, which produces many more reads at a lower cost per base, has been recently shown to be useful and accurate in de novo transcriptome assembly of non-model organisms . Traditional Sanger sequencing can also be used providing that the PCR amplified fast evolving transcripts are cloned before sequencing.
We demonstrated that the ISSH method efficiently enriches a cDNA library in fast evolving genes. As this new method does not rely on the previous knowledge of sequence information, it can be performed on every non-model organism, and is therefore of wide use. Although the improvements and reduced cost of next-generation sequencing technologies may lead to ever more complete transcriptomes assemblies, and may have the potential to be used for identifying fast evolving transcripts with bioinformatic tools, the ISSH method will still have and interesting role to play. First, the ISSH method is inexpensive, of little labor, and leads directly to the set of transcripts of interest. Second, as the fast evolving genes are often expressed at low level, they may be hard to retrieve using next-generation sequencing technologies unless very deep sequence coverage is performed, at high cost. Therefore, the ISSH method opens new possibilities in screening transcriptomes in search of genes involved in lineage specific processes and traits, a field of growing interest in evolutionary biology.
Total RNA was extracted from fresh brain tissue of Ancistrus temminckii (probed species) and its close relative Ancistrus dolichopterus (selector species) using TRIzol reagent (Gibco). We also extracted total RNA from our catfish outgroup reference Hypostomus gr. plecostomus. After quantification and quality verification of the total RNA, mRNA was isolated using the mRNA Isolation Kit (Roche Diagnostics). The SuperScript double-stranded cDNA synthesis kit (Invitrogen) was used to prepare the brain control library of Ancistrus temminckii and also the outgroup reference Hypostomus gr. plecostomus, starting with 1 μg of brain mRNA and the option of oligo(dT) anchor priming for the first strand synthesis step.
The selector pool of mRNA, extracted in this work from Ancistrus dolichopterus, is biotinylated to allow subsequent separation by magnetic particles coated with streptavidin. Biotinylation of 5 μg mRNA was done using the BIO-ULS labeling kit (Kreatech); the final volume was reduce to 7 μl using a Speedvac concentrator. The probed pool of mRNAs extracted from the species of interest Ancistrus temminckii is reverse-transcribed into ss cDNA using a short-tailed random hexamer primer (5'-AGGA-(N)6-3'). We used 1 μg of mRNA (one fifth of the selector's mRNA amount) and 200 ng of the short-tailed random primer in a total volume of 12 μl. The reverse transcription was performed using the SuperScript II RT (Invitrogen) following the manufacturer's protocol for random priming; the final volume was 20 μl. The RNA template is destroyed by alkaline hydrolysis (0.35 N NaOH; 0.35 M EDTA) at 65°C for 15 min. The solution is then neutralized with 0.35 N HCl and first strand cDNAs are purified using the Mini Elute PCR Purification Kit (Qiagen) following the manufacturer's protocol but with an additional washing step and two rounds of elution. The final volume was reduced to 7 μl using a Speedvac concentrator.
The pool of biotinylated selector mRNA (7 μl) and the pool of first strand cDNA of the species of interest (7 μl) are mixed and the total volume is adjusted to 15 μl. An equal volume (15 μl) of 2× hybridization buffer is added (10 mM EDTA pH8, 1.5 M NaCl, 40 mM sodium phosphate buffer pH 7.2, 10× Denhardt's, 0.2% SDS). The solution is heated at 90°C for 2 min and quickly placed in a rotary shaker located inside a preheated hybridization oven at 55°C. The hybridization is carried on during 60 hours at 55°C. At the end of the hybridization step, 75 μl of NaCl 1 M is added to the hybridization mixture, which is kept at RT.
The selector-probed hybridization mix is sequentially denatured to separate two fractions of cDNAs with increasing denaturation stringencies, the first fraction containing the non-hybridized or non-specifically hybridized probed cDNAs and the second fraction is the one enriched in fast evolving transcripts. First, streptavidin magnetic particles (Roche Diagnostics) are prepared according to the manufacturer's instructions (1200 μg) and resuspended in 100 μl of TEN 1000 buffer. The hybridization mixture is then transferred in to the tube containing the streptavidin magnetic particles and placed in a rotary shaker for 45 min at RT. In this step the biotinylated selector mRNAs, which may be hybridized or not with a complementary probed ss cDNA, are linked to the streptavidin magnetic particles. The non-hybridized probed cDNAs are discarded by placing the tube in a magnetic separator (Qiagen) and by removing the supernatant. The magnetic particles with their attached molecules are washed three times at 55°C for 15 min, in 600 μl of preheated 5× SSC, then resuspended in 50 μl of 0.1× SSC and incubated at 65°C for 15 min. In this last step the fast evolving probed cDNAs will disanneal from their selector counterpart and this fraction of interest is recovered in the supernatant after a magnetic separation. This step is repeated once. The fraction enriched in fast evolving cDNAs is purified by ethanol precipitation in presence of ammonium acetate and glycogen. The pellet is rinsed once in 70% ethanol and resuspended in 20 μl water.
The ss cDNAs are transformed into double stranded (ds) cDNAs using short-tailed random hexamer primers (CCAC-(N)6) and the DNA polymerase I Klenow fragment (Promega), according to the manufacturer's random priming protocol. cDNAs are then blunt ended using T4 DNA Polymerase (Promega), extracted with phenol/chloroform/isoamylalcohol (25:24:1) and recovered by ethanol precipitation with ammonium acetate. Double strand EcoRI adapters (Invitrogen) are ligated to the ds cDNA ends according to the manufacturer's instructions. The final volume is adjusted to 100 μl with water and the cDNAs are purified using the High Pure PCR Product Purification Kit (Roche Diagnostics).
A first PCR amplification is performed using a single primer (5'-GTCGACGCGGCCGCGAATT-3') targeted toward the EcoRI adapter ligated at both ends. The PCR reaction is done in 50 μl final volume with 10 μl of cDNA as template and with the following profile: 1 min initial denaturation at 94°C followed by 35 cycles with 30 s at 94°C, 30 s at 62°C, 2.5 min at 72°C and a final elongation step of 5 min at 72°C. The PCR product is checked on 1,5% agarose gel. A nested PCR is performed using specific primers overlapping the end of the EcoRI adapters and the tails of the two short-tailed random primers used for the synthesis of the first strand and then for the synthesis of the second strand (EcoRI-AGGA: 5'-TCGCGGCCGCGTCGACAGGA-3'; EcoRI-CCAC: 5'-TCGCGGCCGCGTCGACCCAC-3'). The PCR conditions are as described above but the amount of template DNA is adjusted according to the result of the first PCR. The PCR products are checked on 1,5% agarose gel and then purified using the High Pure PCR Product Purification Kit (Roche Diagnostics).
For the Ancistrus control and the outgroup reference Hypostomus gr. plecostomus, shotgun DNA libraries were prepared with a starting amount of 4 μg DNA. The mean fragment size was of about 500 bp, obtained using nebulizers and chemicals from the GS DNA Library Preparation Kit (Roche Diagnostics) according to the manufacturer's manual. This step was not needed for the Ancistrus library enriched in fast evolving transcripts as the ISSH method results in a PCR product containing fragmented transcripts, generally in the range of 300 to 1000 bp. After DNA purification, the DNA end repair step and the ligation of the barcoding adaptors were performed following established protocols . The adapter-ligated DNA from each of the three libraries were pooled and prepared for the 454 sequencing according to standard protocols , using the GS DNA Library Preparation Kit with Titanium reagents, and following the instructions of the GS FLX manuals (Roche Diagnostics). The library was sequenced on one 16th region of a full GS FLX sequencing plate with a prior titration run. Upon completion, sequences were screened for primer concatemers, week signal, poly A/T sequences, and barcodes for assigning them to one of the three samples. The average lengths of the reads were 180 bp. cDNA assemblies were performed with the SeqMan software from DNAStar. The cDNA library enriched in fast evolving genes and the control library of Ancitrus temminkii were deposited in the Short Read Archive (SRA) of NCBI under the accession number SRA009346.1
The Blast search against the zebrafish sequences of all the EMBL sub-divisions (Expressed Sequence Tag; High Throughput cDNA sequencing; High Throughput Genome sequencing; mRNA of Standard; Whole Genome Shotgun) were performed on the Vital-IT high-performance computing facility of the Swiss Institute of Bioinformatics http://www.vital-it.ch. We used blast parameter values suitable for comparing divergent sequences (word size = 7; match score = +1; mismatch score = -1; initial penalty for opening a gap = 1; penalty for extending a gap = 2). The local Blast search against our Hypostomus gr. plecostomus brain EST database was performed using the software blast-2.2.19 developed by NCBI. Perl scripts for parsing the blast outputs were built using Eclipse SDK 3.4.1.
The proportion of contigs with low complexity sequences or sequence repeats was assessed using RepeatMasker open-3.2.8 (Smit, AFA, Hubley, R & Green, P. RepeatMasker Open-3.0. 1996-2004; http://www.repeatmasker.org). Transcripts were annotated according to their best Blast hit against Swissprot/Uniprot databases, with a minimal E-score of 10e-8. The translations into the six frames were performed using BCM Search Launcher  and blasting was done with Blastp at NCBI. Expected frequency of stop codons in non-coding sequences was calculated by multiplying the three single nucleotide frequencies taken from the sequence data of the corresponding library, and summing the frequency of the three possible stop codons. Gene transcription levels in specific tissues were taken from the Unigene database and are expressed in number of ESTs of the gene under consideration divided by the total ESTs of the tissue library, multiplied by 10'000. Gene ontology classification was performed on Panther (Protein ANalysis THrough Evolutionary Relationships; http://www.pantherdb.org), complemented with ontology information given in Uniprot database. We used only the top categories of the classification hierarchy, as given in Panther. Fast evolving transcripts found in Ancistrus temminckii, Hypostomus gr. plecostomus EST, and in the mRNA reference sequences database of NCBI, restricted to the Teleostei (Blastn threshold E-score < 10e-8), were used to asses the sequence divergence. A tentative annotated was given according to the best hit against the mRNA refseq database. For direct comparison purposes, sequence divergence was calculated on the sequence region present in all three taxa. Sequences of the reference fast evolving markers were obtained from GeneBank: Ancistrus brevipinnis COI: EU359402; Hypostomus boulengeri COI: EU359422; Danio rerio complete genome: NC_002333; Ancistrus cirrhosus RTN4 introns: EU817562; Hypostomus boulengeri RTN4 introns: EU817560. The RTN4 introns from Danio rerio were retrieved from Ensembl http://www.ensembl.org/, locus: chromosome: Zv8:1:42092991:42094205:1.
Raw read data is available at the Short Read Archive (SRA) of NCBI under the accession number SRA009346.1
No ethical approval was required for this study.
We thank Dr. L. Falquet who performed the Blast calculations against the zebrafish transcript databases on the Vital-IT high-performance computing facility of the Swiss Institute of Bioinformatics http://www.vital-it.ch. We acknowledge P.-Y. Pettina and A. Coulot for their help in preliminary laboratory tests of feasibility. We thank Dr. Y. Surget-Groba for bioinformatic advises and for revising the manuscript, and two anonymous reviewers for helpful comments and suggestions. This work was supported by funds from the Canton de Genève; the Swiss National Research Fund [grant number 3100A0-122303/1]; and the G & A Claraz Foundation.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.