- Research article
- Open Access
Efficient assembly and annotation of the transcriptome of catfish by RNA-Seq analysis of a doubled haploid homozygote
© Liu et al.; licensee BioMed Central Ltd. 2013
- Received: 1 March 2012
- Accepted: 9 August 2012
- Published: 5 November 2012
Upon the completion of whole genome sequencing, thorough genome annotation that associates genome sequences with biological meanings is essential. Genome annotation depends on the availability of transcript information as well as orthology information. In teleost fish, genome annotation is seriously hindered by genome duplication. Because of gene duplications, one cannot establish orthologies simply by homology comparisons. Rather intense phylogenetic analysis or structural analysis of orthologies is required for the identification of genes. To conduct phylogenetic analysis and orthology analysis, full-length transcripts are essential. Generation of large numbers of full-length transcripts using traditional transcript sequencing is very difficult and extremely costly.
In this work, we took advantage of a doubled haploid catfish, which has two sets of identical chromosomes and in theory there should be no allelic variations. As such, transcript sequences generated from next-generation sequencing can be favorably assembled into full-length transcripts. Deep sequencing of the doubled haploid channel catfish transcriptome was performed using Illumina HiSeq 2000 platform, yielding over 300 million high-quality trimmed reads totaling 27 Gbp. Assembly of these reads generated 370,798 non-redundant transcript-derived contigs. Functional annotation of the assembly allowed identification of 25,144 unique protein-encoding genes. A total of 2,659 unique genes were identified as putative duplicated genes in the catfish genome because the assembly of the corresponding transcripts harbored PSVs or MSVs (in the form of pseudo-SNPs in the assembly). Of the 25,144 contigs with unique protein hits, around 20,000 contigs matched 50% length of reference proteins, and over 14,000 transcripts were identified as full-length with complete open reading frames. The characterization of consensus sequences surrounding start codon and the stop codon confirmed the correct assembly of the full-length transcripts.
The large set of transcripts assembled in this study is the most comprehensive set of genome resources ever developed from catfish, which will provide the much needed resources for functional genome research in catfish, serving as a reference transcriptome for genome annotation, analysis of gene duplication, gene family structures, and digital gene expression analysis. The putative set of duplicated genes provide a starting point for genome scale analysis of gene duplication in the catfish genome, and should be a valuable resource for comparative genome analysis, genome evolution, and genome function studies.
- Transcriptome Assembly
- Iron Responsive Element
- Protein Query Sequence
- Average Contig Length
- Catfish Genome
Recent advances in next-generation sequencing enabled an array of whole genome sequencing or re-sequencing projects in both model and non-model species. Such efforts have produced a wealth of genome resources. However, thorough genome analysis thereafter is essential to associate genome sequences with biological meanings [1, 2]. An important step in genome analysis is to decipher the complete protein coding sequence (CDS) region of each gene. In eukaryotes, prediction of CDS regions in genomic sequence is complicated by the intron interruptions and the low proportion of protein coding regions in the genome. It is still problematic at present to predict the correct distribution of CDS regions solely based on genomic sequences . To obtain information about the portion of a genome that is transcribed as RNAs and then translated into proteins, a comprehensive set of full-length transcripts is needed .
In teleost fish, genome annotation is further seriously hindered by genome duplication. Because of gene duplications, it’s unable to establish orthologies simply by homology comparisons. Rather intense phylogenetic analyses or syntenic analyses of orthologies are required for the identification of genes. To conduct phylogenetic analysis and orthology analysis, full-length coding regions of transcripts are essential. Furthermore, the nature of highly diversified and duplicated genome of fish species hindered sequence assembly and complicated the genome annotation as well as SNP identification. Previous efforts aiming at SNP discovery for catfish rendered the issue about discrimination of false positive SNPs derived from paralogous sequences and multisite sequences (PSVs/MSVs) .
Obtaining large numbers of full-length transcripts is not an easy task. In most cases, full-length transcripts were obtained by Sanger-based sequencing of full-length cDNA clones. This strategy works well for many highly abundantly expressed and short transcripts because in such cases: 1) clones containing full-length transcripts can be readily identified, and 2) complete sequencing can be achieved through sequencing from both ends of the clones. However, such a task for rarely expressed genes and large transcripts can be troublesome because the identification of full-length cDNA containing clones is itself a huge challenge, and even if the clones are identified, complete sequencing of long inserts using Sanger sequencing can still be time-consuming and expensive. Recently, next-generation sequencing has been recognized as one solution for transcriptome sequencing at a reasonable cost [6, 7]. High-throughput sequencing of cDNA (RNA-Seq) does not rely on prior knowledge, enabling interrogation of all transcripts including potentially novel transcripts uncaptured in Sanger-based sequencing . RNA-Seq can provide sufficient sequencing coverage on whole transcriptome scale to ensure the precision of each single base and integrality of full-length transcripts [7, 9]. However, huge amount of short reads generated from RNA-Seq make the transcriptome assembly difficult, which is not only impeded by repeats but also by alternatively spliced transcripts. Moreover, while genomic sequencing coverage is generally uniform across the genome, transcriptome sequencing coverage is highly variable, depending on gene expression levels, excluding the use of coverage information to resolve repeated motifs . Since a transcriptome assembly with good quality is essential for all the downstream analysis, extra efforts are required to improve the transcriptome assembly. An optimized assembly strategy can be obtained by combinatory use of different assembly softwares, especially the ones using multiple k-mers to resolve the problem of biased coverage of transcriptome sequencing. A higher k-mer length will theoretically results in a more contiguous assembly of highly expressed transcripts. On the other hand, poorly expressed transcripts will be better assembled if lower k-mer lengths are used . Therefore, multiple k-mer approach can be used to increase the assembly sensitivity and contiguity. With the fast development of assemblers able to efficiently handle a greater number of sequence reads [10, 11], short-reads can be of considerable utility for assembling transcriptomes of non-model organisms [9, 12, 13]. The most significant issue with RNA-Seq for the assembly of transcriptome, however, is the allelic variation. Most vertebrate species are diploid organisms and therefore two sets of chromosomes are involved in the generation of transcripts. Even if only one individual is used for RNA-Seq, the assembly of related transcripts from the two sets of chromosomes creates “haplotypes” that do not exist, and this alone prohibits assembly of full-length transcripts from multiple sequences. To overcome this huge problem, the ideal approach is to create an individual with doubled haploid genome and therefore there are no allelic variations such that the short reads from RNA-Seq can be assembled, not only technically feasible by the software packages, but also biologically meaningful as they are transcribed from the same sequences.
In catfish, years of efforts have resulted in nearly 500,000 quality ESTs , but a limited number of full-length transcripts were obtained through the traditional clone-by-clone approach using Sanger sequencing . Recent efforts using RNA-Seq allowed the identification of a large number of transcripts in catfish , but the assembly of full-length transcripts was hindered by the reasons discussed above. In order to circumvent this problem and generate a large set of full-length coding sequences to support the genome sequencing and annotation in catfish, doubled haploid catfish was produced that has been demonstrated to harbor two sets of homozygous chromosomes . Such homozygous catfish is ideal material for the generation and assembly of full-length transcripts using RNA-Seq. Here we report deep sequencing of catfish transcriptome by RNA-Seq using this gynogenetic homozygous catfish. The two main objectives of this study were to develop a comprehensive set of reference transcript sequences for genome-scale gene discovery and expression studies in catfish; and to obtain a large number of full-length transcripts for whole genome annotation, duplicate gene identification, and facilitating detection of false SNPs derived from PSVs/MSVs.
Summary of data generated for catfish transcriptome using Illumina Hiseq 2000 sequencing
No. of tissues*
No. of raw reads
Read length (bp)
No. of reads after trim
Avg. length after trim (bp)
No. of bases after trim (Gbp)
In order to obtain a comprehensive and reliable assembly, three different assemblers were used including CLC Genomics Workbench (version 4.2), ABySS (version 1.2.6) and Velvet (version 1.1.02) for de novo assembly. Although all these three assemblers use the same de Bruijn graph algorithm, they are different in how to treat sequencing errors, resolve ambiguities and utilize read pair information . Furthermore, multiple k-mer approach applied in ABySS and Velvet has the capability of producing a better assembly for transcripts from both highly and lowly expressed genes. Therefore, we believe that the combinatory use of these three assemblers could improve the assembly.
Summary of assemblies generated using various de novo assemblers
No. of contigs ≥200bp
No. of contigs with length ≥N50
No. of contigs with length ≥1kb
Avg. contig length (bp)
Total size (Mbp)
The multiple k-mer assemblies encompass both higher k-mer lengths to result in a more contiguous assembly of highly expressed transcripts, and lower k- mer lengths to better assemble poorly expressed transcripts to increase assembly sensitivity. In order to compare the ability of each assembler to reconstruct protein-coding gene transcripts on sensitivity and contiguity, each assembly was searched against annotated protein database using BLASTX (Additional file 1). The significant unique protein hits were identified from both zebrafish RefSeq protein and Uniprot/Swiss-Prot database with E-value cutoff of 1e-10. The largest number of unique protein hits was obtained by Velvet assembly in comparison to the assemblies generated by CLC and ABySS. The contigs generated by ABySS showed higher N50 and average lengths, but a smaller number of unique protein hits was obtained (Table 2; Additional file 1). By comparison, the contigs generated from CLC offered larger number of unique protein hits than ABySS, while providing higher N50 than Velvet.
A total of 370,798 non-redundant transcript-derived contigs were obtained as the final assembly based on the three sets of assemblies. The combined final assembly had a N50 length of 1,395 bp and average contig length of 743 bp, including 68,569 contigs with length greater than 1 kb (Table 2). Such assembly took advantage of each assembler to achieve both highest assembly sensitivity and contiguity. In order to assess the transcriptome capture obtained by the current work, we aligned all the channel catfish ESTs currently available in NCBI (354,488) with transcripts of the final assembly reconstructed in this study. The results showed that the vast majority (97% of the total) of the ESTs were represented in our data set showing ≥ 90% identity over a length of ≥ 100 bp and E-value of ≤ 1e-10. All the 370,798 transcript-derived contigs have been submitted to the NCBI Transcriptome Shotgun Assembly (TSA) database under accession numbers: JT123347-JT494144.
Assessment of transcriptome assembly
Assessment of transcriptome assembly
Total previous full-length cDNAs
Completely assembled with same ORFs
BLASTX annotation based on homology search
Summary of BLASTX searches to annotated protein databases for final assembly
No. of contigs with hits
No. of unique accessions
No. of unique protein hits
Analysis of BLAST top hit species and potential presence of non-catfish transcripts
However, it’s noteworthy to mention that a good fraction (1,972 sequences, 8%) of our sequences had their top BLASTX hits with bacterial sequences, followed by those had top hits with fungi (424 sequences, 2%), and a few (33 sequences) even had top hits to viral sequences. Of these 2,429 sequences with top hits to other species than vertebrates, the top BLAST hit e-values were larger than those with top hits with vertebrates, in the range of 10-10 to 10-40, whereas those with top hits to vertebrate species had a smaller e-value (often < 10-80), suggesting the presence of xenobiotic species in the catfish samples. The top BLAST hits to bacteria were from a variety of species, of which 56% were Escherichia coli, and 29% were Shigella species (Figure 1D). Since our sequences were derived from multiple tissues, some of the bacterial sequences could belong to commensal microorganisms in digestive organs, although careful treatments had been taken to clean the tissues during collection process. Further investigation is warranted to fully understand these sequences most similar to xenobiotic species, which could provide information concerning symbionts and other microbial communities in catfish.
Conserved domain (CD) search for contigs without BLAST protein hits
In order to identify potential protein-coding genes from those contigs without significant blast hits from protein databases, we conducted ab initio prediction of the potential ORFs for the 276,322 catfish contigs that did not have significant hits to known proteins. A total of 260,793 contigs have putative ORFs detected with minimum length of 30 amino acids, and 16,688 of which possess putative ORFs with minimum length of 100 amino acids (Additional file 2). To determine the putative functions of these ORFs, the CD-Search tool in NCBI was used to identify conserved domains, or functional units, within the protein query sequences. The specific hit found by a CD-Search indicates a high confident association between the protein query sequence and a conserved domain, resulting in a high confidence level for the inferred function of the protein query sequence. A total of 4,984 ORFs were identified with conserved domains (Additional file 3), suggesting that such ORF-harboring contigs were derived from functional protein-coding genes as well.
Comparison of catfish transcripts with model fish species
Reciprocal BLAST comparison between catfish and five model fish with sequenced genome
#Protein coding genes
#Proteins have hits in catfish
#Genes have hits in catfish
#Unique protein hits by catfish
#Unique gene hits by catfish
Identification of gene duplicates
To evaluate the detected gene duplicates, we aligned the transcripts to the preliminary assembly of the catfish genome (255,858 contigs with mean length of 2,996 bp and N50 of 6,027 bp, unpublished data). Duplicated genes were expected to be present in different genome locations (generally, i.e., genome contigs), and thus on different genomic contigs. As summarized in Additional file 5, the vast majority (92%, 2,446/2,659) of transcript-derived contigs hit more than one genomic contigs, suggesting their potential involvement in duplication. Of the remaining 213 transcript contigs, 196 contigs had only one catfish genome contig hits, indicating their uniqueness in the catfish genome. However, the possibility that it was because of the lack of corresponding genome contigs in the preliminary catfish genome assembly cannot be excluded. A total of 17 gene-derived transcripts did not get any significant match with genome contigs, suggesting the incompleteness of the current catfish preliminary assembly (Additional file 5). Additional analysis of paralogous relationships may be strengthened by examination of physical locations in the genome for tandom duplications and flanking regions for segmental duplications when catfish genome scaffolds become available in the near future. It is understood that the same gene can still be placed into two or more genomic contigs with the preliminary assembly, but this problem should be overcome soon when the catfish genome assembly is completed.
Identification of catfish full-length transcripts
It has been shown that transcriptome sequencing using RNA-Seq can be a cost-effective approach for reconstruction of full-length transcripts in species without a reference genome . In the context of this work, we took advantage of an individual homozygous catfish which provided a haploid transcriptome to facilitate the transcriptome assembly. A large fraction of transcripts were reconstructed and a large portion of these reconstructed transcripts were expected to be full-length transcripts. To identify full-length transcripts in our transcript collection, we utilized the functional annotation results with all the contigs in the final assembly being searched against zebrafish RefSeq and Uniprot database. The ORFs were predicted by using BLASTX-aided method which detects ORFs by finding the starting methionine and stop codon in catfish transcripts relative to the same features in the most closely related reference proteins identified by BLASTX.
Summary of full-length transcript identification after manual inspection
Potential full-length transcripts
Transcripts lack at 5' end
Transcripts lack at 3' end
Transcripts with incorrect ORFs
Manual inspection of completeness of CDS in each transcript sequence was determined by the BLASTX alignment. In this procedure, we consider a full-length transcript to contain a complete CDS if the ORF revealed a start codon and stop codon in agreement with the matched protein sequence in the database. As summarized in Table 6, a total of 14,240 were identified as full-length transcripts with complete coding sequences, 1,725 contigs and 711 contigs were partial transcripts lacking of 5’ end or 3’ end, respectively, and 193 contigs had incorrect open reading frames predicted due to either partial sequences or incorrect bases in start codon regions.
Examination of sequences surrounding the start codon and stop codon of the catfish full-length transcripts
To our best knowledge, there were no systematic studies on the sequence contexts surrounding the stop codons in fish species. However, we reexamined our previously identified 1,087 full-length cDNAs generated from clone sequencing to reconfirm the results we provided here. As shown in Additional file 6, the contexts around stop codons were highly consistent with the present results in Figure 9. The frequency of usage of the three stop codons were 45.4%, 38.6% and 16% for UGA, UAA and UAG, respectively, also similar as frequencies found from this study. This also provided additional support for the proper identification of the ORFs in this study.
Polyadenylation signal (PAS) in the catfish transcripts
The polyadenylation signal (PAS) plays an important role in polyadenylation by determining the site for addition of a polyA tail to pre-mRNA. The PAS in mammals has been widely investigated and has been identified as the canonical hexamer AAUAAA which is located 10–30 nucleotide upstream of the cleavage site [35, 36]. Although the AAUAAA signal is considered to be highly conserved, studies have shown that different variants of the PAS are certainly present in the 3’ ends of transcripts, and that the frequency distribution of the most common PAS versus alternate signals is species- and tissue-dependent [36, 37]. The high diversity of gene expression of limited number of genes can be due to the 3’ end polyadenylation of pre-mRNA . The choice of which element is used as PAS might play important roles in gene regulation, possibly by including or excluding down-stream regulatory motifs situated between such alternate PAS, as well as mRNA stability [35–37].
Identification and analysis of the catfish polyadenylation signal (PAS)
Number of transcripts
Found in catfish1
Found in salmon2
Found in human3
Regulatory motifs in UTR regions of catfish full-length transcripts
Regulatory motifs are short nucleotide sequences that regulate gene expression. Most of the regulatory motifs are thought to be embedded in the non-coding part of the genomes. Among non-coding regions, the 5' and 3' UTRs of eukaryotic mRNAs have often been experimentally demonstrated to contain sequence elements crucial for many aspects of gene regulation and expression . The large set of full-length transcripts generated in this study provided the opportunity for the identification of conserved regulatory motifs in the UTR regions of catfish transcripts. All 5’ and 3’ UTRs from catfish full-length transcripts were searched against UTRsite collection by using the pattern match program UTRscan. The UTRsite is a collection of functional sequence patterns located in 5' or 3' UTR sequences whose function and structure have been experimentally determined and published .
Identification of conserved regulatory motifs from catfish untranslated regions (UTRs)
Number of UTRs
Iron Responsive Element
5' and 3’
Internal Ribosome Entry Site
SXL binding site
5' and 3’
Terminal Oligopyrimidine Tract
UNR binding site
5' and 3’
Upstream Open Reading Frame
15-Lipoxygenase Differentiation Control Element
Alcohol dehydrogenase 3'UTR downregulation control element
AU-rich class-2 Element
Bruno 3'UTR responsive element
Cytoplasmic polyadenylation element
Elastin G3A 3'UTR stability motif
Glusose transporter type-1 3'UTR cis-acting element
Insulin 3'UTR stability element
Musashi binding element
Selenocysteine Insertion Sequence - type 1
Selenocysteine Insertion Sequence - type 2
TGE translational regulation element
The transcriptome sequencing enables various structural and functional genomic studies of an organism. Although a lot of Sanger-based EST sequencing projects had been carried out for comprehensive characterization of transcriptomes, expressed sequence data are still limited resources, specifically in non-model species. The next-generation sequencing technologies provide a low cost, labor-saving and rapid means for transcriptome sequencing and characterization. However, the de novo assembly of short reads without a known reference is still difficult . High throughput 454 sequencing which generates longer reads has been widely used in many transcriptome sequencing studies in non-model species previously [48, 49]. Recently, more and more studies have shown the feasibility of transcriptome assembly by using Illumina short reads , especially with the combination of paired-end reads sequencing technology to facilitate the assembly. However, the differential gene expression results in variable coverage in transcriptome sequencing, the choice of a single k-mer value usually used in genome assembly cannot generate an assembly with emphasis on both transcript diversity and contiguity. Performing multiple assemblies with various k-mer lengths and to retain the best part of each one to form the final assembly has been shown effective for de novo transcriptome assembly . Furthermore, as each assembler utilizes different approaches to deal with sequencing errors and paired-end information, the assemblers may differ in their abilities to capture different portions of the transcriptome with accuracy. It is reasonable that merging assemblies from multiple assemblers might yield a combined assembly with higher accuracy. More comprehensive transcripts would be obtained with combinatory use of several de novo assemblers.
In this study, we reported an efficient assembly and annotation of the catfish transcriptome by applying several combined strategies. Firstly, we took advantage of a doubled haploid channel catfish to reduce the complexity of transcriptome. Most importantly, the doubled haploid allows biologically meaningful assembly of transcripts without artificially creating “haplotypes” that do not exist in nature. The sequence assembly was facilitated since there are no allelic variations. Various tissues were collected with the aim to cover a comprehensive transcriptome. The paired-end reads were generated to resolve the assembly problem caused by repetitive regions. Secondly, we generated a final assembly with a combinatory use of three different widely used de novo assemblers. Multiple k-mer method was also used to enable the assembly sensitivity and contiguity. A higher N50 length and average length are considered as a benchmark for better assembly on contiguity. Our results showed that N50 length and average length of contigs varied greatly as a function of k-mer length, and also varied greatly between different assemblers. To get the optimum results, the validation of different assembly programs was conducted by comparing sequence similarity with closely related species. The ABySS generated contigs with higher N50 length and average length indicating its strength in generating assembly with better contiguity, while the Velvet generated sequences with more number and percentage of contigs showed significant similarity with zebrafish proteins. The CLC Genomics Workbench performed as intermediate between ABySS and Velvet according to both contiguity and sensitivity. The final assembly obtained by merging all these three sets of assembly provided a more comprehensive and accurate assembly.
The transcriptome coverage and completeness achieved by these sub-assemblies were then evaluated by matching with annotated known genes. Due to the limited number of catfish genes in the public database, we used the zebrafish RefSeq protein sequences in the NCBI database to conduct this evaluation. There are a total of 27,239 zebrafish annotated protein sequences available which were searched against the assembled catfish contigs for homologous match using TBLASTN. The numbers of detected zebrafish proteins at different levels of sequencing depth are presented in Additional file 8. There were 25,795 zebrafish proteins observed in the catfish RNA-Seq assembly accounting for 94.7% of all annotated proteins in the database. As shown in Figure 10B, the number of observed genes increased almost linearly with lower sequencing depth. However, the number of observed genes started to plateau when the sequencing depth reached 124 million reads. From 124 million reads to 308 million reads, the percent of gene hits increased from 94% to 95%; while the percentage of gene hits with greater than 90% length homology stayed essentially unchanged, suggesting that the sequencing depth was sufficient to provide a good coverage of the transcriptome.
The analysis of sequence conservation comparing with other model fish species helps in transfer of knowledge from model species to catfish for both structural and functional genomic studies. A large number of catfish transcripts showed significant similarity with model fish at protein level as expected, suggesting that their function might also be conserved. Interestingly, a large number of the transcripts did not show significant homology with any other reference sequences, which may be novel and transcribed from catfish-specific genes. The study of these genes will be very important to dissect the species-specific cellular process, ‘catfish-specific’ gene duplication and divergence and study evolutionary processes of speciation and adaptation.
The doubled haploid fish used in this work provided an opportunity to evaluate genome-scale gene duplication in catfish. There were a total of 2,659 unique genes detected as putative duplicated genes. Ultimately, duplicated genes will have different genome coordinates as to their locations. However, the catfish genome sequencing is still in progress. Nonetheless, the evaluation based on the catfish preliminary genome assembly (unpublished) supported that the majority of these genes had duplicate copies. Clearly, the use of doubled haploid catfish was not only important for the transcript assembly, but also important for the initial identification of putative gene duplications in catfish. Additional analysis and validation are needed to demonstrate that the putative duplicated genes are indeed duplicated in the future.
In conclusion, we have demonstrated the use of short-reads sequence data to efficiently and comprehensively characterize a draft transcriptome of an organism without sequenced genome. The strategy of de novo assembly described here can be potentially used for other species. The advantages offered by the use of a homozygote are applicable to most teleost species where doubled haploid can be made. Our study contributed a significant non-redundant set of 370,798 transcripts including 14,240 full-length transcripts in catfish. The detailed analyses of these sequences has provided several important features of catfish transcriptome such as length distribution, sequence patterns around translation initiation and termination codons, conserved regulatory motifs, conserved genes across fishes, and functional annotation. It is anticipated that the results from this study will contribute significantly towards assembly and annotation of the catfish genome. Such resources will likely be important for structural and functional genomics studies in other teleosts and related species as well.
Sample and RNA isolation
To better characterize the catfish transcriptome and improve discovery of variations derived from duplicated genes, a haploid transcriptome is needed. Doubled haploid channel catfish, that have two identical copies of each chromosome, were established using gynogenesis . Tissues were collected from a single doubled haploid female channel catfish adult for this study. The fish were euthanized with tricaine methanesulfonate (MS 222) at 300 mg/l before tissue collection. Samples of 19 tissues including head kidney, fin, pancreas, spleen, gill, brain, trunk kidney, adipose, liver, stomach, gall bladder, ovary, intestine, thymus, skin, eye, swim bladder, muscle, and heart were collected. Tissues were flash-frozen in liquid nitrogen and shipped on dry ice then stored at −80°C until RNA extraction. Tissue samples were ground to a fine powder with mortar and pestle in the presence of liquid nitrogen and thoroughly mixed. A fraction of the tissue samples was used for RNA isolation. Total RNA was isolated using the RNeasy plus Mini Kit (Qiagen, USA) followed by DNase I (Invitrogen, USA) treatment according to the manufacturer's protocol as in previous study . Equal amount of total RNA from each tissue was combined and sent out for commercial sequencing.
Sequencing was conducted commercially in HudsonAlpha Genomic Services Lab (Huntsville, AL, USA) similarly as in previous study . Briefly, 100 ng of total RNA was used to prepare amplified cDNA using Ovation RNA-Seq, a commercially available kit optimized for RNA sequencing (NuGEN Technologies, San Carlos, CA). The produced double-stranded cDNA was subsequently used as the input to the Illumina library preparation protocol starting with the standard end-repair step. The end-repaired DNA with a single ‘A’-base overhang is ligated to the adaptors in a standard ligation reaction using T4 DNA ligase and 2 μM-4 μM final adaptor concentration, depending on the DNA yield following purification after the addition of the ‘A’-base. Following ligation, the samples were purified and subjected to size selection via gel electrophoresis to isolate 350 bp fragments for ligation-mediated PCR (LM-PCR). Twelve cycles of LM-PCR were used to amplify the ligated material in preparation for cluster generation. The prepared cDNA library was sequenced for 100-bp paired-end reads on three flow cell lanes using the Hiseq 2000 platform, but one of the three lanes was partitioned for including three other samples, generating less number of sequences as in other two lanes for catfish. The image analysis, base calling and quality score calibration were processed using the Illumina Pipeline Software v1.4.1 according to the manufacturer's instructions. Reads were exported in FASTQ format and has been deposited at the NCBI Sequence Read Archive (SRA) under accession number SRA047025.
Assembly of expressed short reads
The raw reads were cleaned by trimming of adaptor sequences, ambiguous nucleotides (‘N’ in the end of reads) and low quality sequences with average quality scores less than 20. Reads less than 15 bp after trim were also discarded, the remaining reads were used in subsequent assembly. In order to obtain a comprehensive and reliable assembly, three different assemblers including CLC Genomics Workbench (version 4.2; CLC bio, Aarhus, Denmark), ABySS (version 1.2.6) and Velvet (version 1.1.02) were used for de novo assembly. In brief, the CLC de novo assembly was performed with a choice of an optimized k-mer length based on the input data by default settings. In case of ABySS and Velvet assembly, multiple k-mer approach with every other k-mer values from 21 to 95 for ABySS and from 45 to 95 for Velvet were used so as to maximize assembly contiguity and sensitivity. Subsequently, the multiple k-mer assemblies from ABySS and Velvet were merged by running the first stage of the trans-ABySS analysis pipeline (version 1.2.0) , respectively. Afterwards, these three assemblies were combined to produce the final non-redundant assembly. As anticipated, some identical contigs were generated from more than one assemblies introducing duplicates. The CD-HIT-EST  was used to remove redundancy and retain the longest possible contigs. The short redundant contigs were removed, and the remaining contigs composed the final assembly of non-redundant transcripts.
Functional annotation and identification of putative full-length transcripts
All the non-redundant transcripts from final assembly were searched against NCBI zebrafish RefSeq protein database and Uniprot/Swiss-Prot database using BLASTX with E-value ≤ 1e-10. The ORFs were predicted with the software orfPredictor  by using BLASTX as a guide for the prediction. The BLASTX-aided method detects ORFs by finding the starting methionine and stop codon in catfish transcripts relative to the same features in the most closely related species identified by BLASTX. In the cases where the catfish transcripts did not show high similarity to reference protein, ORF identified by finding the longest stretch of uninterrupted sequence between a start codon and a stop codon in both strand orientations. The completeness of ORFs in each transcript sequence was determined by the BLASTX alignment. We considered a full-length transcript to contain a complete CDS if the ORF revealed a start codon and stop codon in agreement with the match in the database. In the context of this work, the full-length transcript was defined as a consensus sequence containing the complete CDS and at least partial 5’ and 3’ UTR sequence. The start and stop codons of CDSs were used to define the boundary between the CDSs and the 5' and 3' UTRs. If a significant match did not contain a start codon in the 5' or a stop codon in the 3' end of the coding sequence and the pairwise alignment indicated that the transcript lacked some 5' or 3' coding sequence, it was considered to be a transcript with a partial coding sequence.
Detection of catfish putative gene duplicates
Since the doubled haploid channel catfish were used, there should be no allelic variations, and the gene-derived transcripts showing signs of “SNPs” would be assembled from duplicate gene copies. In order to detect the PSVs or MSVs derived from the duplicated genes, the catfish transcripts that had unique protein hits were used as reference, and all the short reads were mapped with the similarity of 99%. The “SNPs” (actually the PSVs or MSVs) were detected as SNP detection in previous work . Briefly, at least four short reads were required for “SNP” detection at each position, the minimum number of variant alleles was required as at least two, and minor allele frequency was required as at least 10%. Putative catfish duplicated genes were assessed by aligning the transcripts with the preliminary catfish whole genome assembly (unpublished data) using BLASTN with the E-value cutoff of 1e-10 and minimum alignment length of 100 bp.
Analysis of UTRs of full-length transcripts
The catfish Kozak consensus sequences were examined in the 5’ UTR analysis. Eight-base sequences spanning from position −4 to position +4 of transcripts were selected. The extracted sequences were used as input into WebLogo  to assess the common Kozak consensus sequence in catfish transcripts.
For 3’ UTR analysis, the TEIRESIAS-based pattern discovery tool  was used to search the most frequently occurring motifs. A search for putative polyadenylation signals (PAS) in full-length transcripts was performed using 35 bp sequence immediate upstream of the polyA tail as input. Pattern discovery tool conditions in the program were: “exact discovery”, L=6 and W=6. To elucidate the sequence patterns that could affect the efficiency of translation termination, the bases around the stop codons (−6 to +12) in the catfish full-length transcripts were extracted and illustrated using WebLogo .
To identify evolutionarily conserved regulatory motifs in catfish transcripts, we searched the UTR database collection (UTRdb)  using the 5’ and 3’ UTRs as queries with the pattern match program UTRscan.
This project was supported by Agriculture and Food Research Initiative Competitive Grant no. 2009-35205-05101 and 2010-65205-20356 from the USDA National Institute of Food and Agriculture (NIFA). The authors want to thank Ludmilla Kaltenboeck for lab assistance during RNA preparation. Thanks are also given to Alabama Supercomputer Center for providing the computer capacity for the bioinformatic analysis. Shikai Liu is supported by a scholarship from the China Scholarship Council (CSC) for studying abroad.
- Adamidi C, Wang Y, Gruen D, Mastrobuoni G, You X, Tolle D, Dodt M, Mackowiak SD, Gogol-Doering A, Oenal P, et al: De novo assembly and validation of planaria transcriptome by massive parallel sequencing and shotgun proteomics. Genome Res. 2011, 21 (7): 1193-1200. 10.1101/gr.113779.110.PubMed CentralView ArticlePubMedGoogle Scholar
- Bruno VM, Wang Z, Marjani SL, Euskirchen GM, Martin J, Sherlock G, Snyder M: Comprehensive annotation of the transcriptome of the human fungal pathogen Candida albicans using RNA-seq. Genome Res. 2010, 20 (10): 1451-1458. 10.1101/gr.109553.110.PubMed CentralView ArticlePubMedGoogle Scholar
- Furuno M, Kasukawa T, Saito R, Adachi J, Suzuki H, Baldarelli R, Hayashizaki Y, Okazaki Y: CDS annotation in full-length cDNA sequence. Genome Res. 2003, 13 (6B): 1478-1487.PubMed CentralView ArticlePubMedGoogle Scholar
- Denoeud F, Aury JM, Da Silva C, Noel B, Rogier O, Delledonne M, Morgante M, Valle G, Wincker P, Scarpelli C, et al: Annotating genomes with massive-scale RNA sequencing. Genome Biol. 2008, 9 (12): R175-10.1186/gb-2008-9-12-r175.PubMed CentralView ArticlePubMedGoogle Scholar
- Liu S, Zhou Z, Lu J, Sun F, Wang S, Liu H, Jiang Y, Kucuktas H, Kaltenboeck L, Peatman E, et al: Generation of genome-scale gene-associated SNPs in catfish for the construction of a high-density SNP array. BMC Genomics. 2011, 12: 53-10.1186/1471-2164-12-53.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang Z, Gerstein M, Snyder M: RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009, 10 (1): 57-63. 10.1038/nrg2484.PubMed CentralView ArticlePubMedGoogle Scholar
- Sandmann T, Vogg MC, Owlarn S, Boutros M, Bartscherer K: The head-regeneration transcriptome of the planarian Schmidtea mediterranea. Genome Biol. 2011, 12 (8): R76-10.1186/gb-2011-12-8-r76.PubMed CentralView ArticlePubMedGoogle Scholar
- Torres TT, Metta M, Ottenwalder B, Schlotterer C: Gene expression profiling by massively parallel sequencing. Genome Res. 2008, 18 (1): 172-177.PubMed CentralView ArticlePubMedGoogle Scholar
- Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng QD, et al: Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol. 2011, 29 (7): 644-U130. 10.1038/nbt.1883.PubMed CentralView ArticlePubMedGoogle Scholar
- Zerbino DR, Birney E: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008, 18 (5): 821-829. 10.1101/gr.074492.107.PubMed CentralView ArticlePubMedGoogle Scholar
- Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJ, Birol I: ABySS: a parallel assembler for short read sequence data. Genome Res. 2009, 19 (6): 1117-1123. 10.1101/gr.089532.108.PubMed CentralView ArticlePubMedGoogle Scholar
- Gibbons JG, Janson EM, Hittinger CT, Johnston M, Abbot P, Rokas A: Benchmarking next-generation transcriptome sequencing for functional and evolutionary genomics. Mol Biol Evol. 2009, 26 (12): 2731-2744. 10.1093/molbev/msp188.View ArticlePubMedGoogle Scholar
- Mizrachi E, Hefer CA, Ranik M, Joubert F, Myburg AA: De novo assembled expressed gene catalog of a fast-growing Eucalyptus tree produced by Illumina mRNA-Seq. BMC Genomics. 2010, 11: 681-10.1186/1471-2164-11-681.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang S, Peatman E, Abernathy J, Waldbieser G, Lindquist E, Richardson P, Lucas S, Wang M, Li P, Thimmapuram J, et al: Assembly of 500,000 inter-specific catfish expressed sequence tags and large scale gene-associated marker development for whole genome association studies. Genome Biol. 2010, 11 (1): R8-10.1186/gb-2010-11-1-r8.PubMed CentralView ArticlePubMedGoogle Scholar
- Chen F, Lee Y, Jiang YL, Wang SL, Peatman E, Abernathy J, Liu H, Liu SK, Kucuktas H, Ke CH, et al: Identification and Characterization of Full-Length cDNAs in Channel Catfish (Ictalurus punctatus) and Blue Catfish (Ictalurus furcatus). PLoS One. 2010, 5 (7): e11546-10.1371/journal.pone.0011546.PubMed CentralView ArticlePubMedGoogle Scholar
- Waldbieser GC, Bosworth BG, Quiniou SM: Production of viable homozygous, doubled haploid channel catfish (Ictalurus punctatus). Mar Biotechnol (NY). 2010, 12 (4): 380-385. 10.1007/s10126-009-9221-2.View ArticleGoogle Scholar
- Flicek P, Birney E: Sense from sequence reads: methods for alignment and assembly. Nat Methods. 2009, 6 (11 Suppl): S6-S12.View ArticlePubMedGoogle Scholar
- Hale MC, McCormick CR, Jackson JR, Dewoody JA: Next-generation pyrosequencing of gonad transcriptomes in the polyploid lake sturgeon (Acipenser fulvescens): the relative merits of normalization and rarefaction in gene discovery. BMC Genomics. 2009, 10: 203-10.1186/1471-2164-10-203.PubMed CentralView ArticlePubMedGoogle Scholar
- Vera JC, Wheat CW, Fescemyer HW, Frilander MJ, Crawford DL, Hanski I, Marden JH: Rapid transcriptome characterization for a nonmodel organism using 454 pyrosequencing. Mol Ecol. 2008, 17 (7): 1636-1647. 10.1111/j.1365-294X.2008.03666.x.View ArticlePubMedGoogle Scholar
- Fredman D, White SJ, Potter S, Eichler EE, Den Dunnen JT, Brookes AJ: Complex SNP-related sequence variation in segmental genome duplications. Nat Genet. 2004, 36 (8): 861-866. 10.1038/ng1401.View ArticlePubMedGoogle Scholar
- Gidskehaug L, Kent M, Hayes BJ, Lien S: Genotype calling and mapping of multisite variants using an Atlantic salmon iSelect SNP array. Bioinformatics. 2011, 27 (3): 303-310. 10.1093/bioinformatics/btq673.View ArticlePubMedGoogle Scholar
- Kozak M: An analysis of 5'-noncoding sequences from 699 vertebrate messenger RNAs. Nucleic Acids Res. 1987, 15 (20): 8125-8148. 10.1093/nar/15.20.8125.PubMed CentralView ArticlePubMedGoogle Scholar
- Crooks GE, Hon G, Chandonia JM, Brenner SE: WebLogo: a sequence logo generator. Genome Res. 2004, 14 (6): 1188-1190. 10.1101/gr.849004.PubMed CentralView ArticlePubMedGoogle Scholar
- Harhay GP, Sonstegard TS, Keele JW, Heaton MP, Clawson ML, Snelling WM, Wiedmann RT, Van Tassell CP, Smith TPL: Characterization of 954 bovine full-CDS cDNA sequences. BMC Genomics. 2005, 6 (1): 166-10.1186/1471-2164-6-166.PubMed CentralView ArticlePubMedGoogle Scholar
- Andreassen R, Lunner S, Hoyheim B: Characterization of full-length sequenced cDNA inserts (FLIcs) from Atlantic salmon (Salmo salar). BMC Genomics. 2009, 10: 502-10.1186/1471-2164-10-502.PubMed CentralView ArticlePubMedGoogle Scholar
- Frolova LY, Merkulova TI, Kisselev LL: Translation termination in eukaryotes: polypeptide release factor eRF1 is composed of functionally and structurally distinct domains. RNA. 2000, 6 (3): 381-390. 10.1017/S135583820099143X.PubMed CentralView ArticlePubMedGoogle Scholar
- Kisselev LL, Buckingham RH: Translational termination comes of age. Trends Biochem Sci. 2000, 25 (11): 561-566. 10.1016/S0968-0004(00)01669-8.View ArticlePubMedGoogle Scholar
- Tate WP, Poole ES, Horsfield JA, Mannering SA, Brown CM, Moffat JG, Dalphin ME, McCaughan KK, Major LL, Wilson DN: Translational termination efficiency in both bacteria and mammals is regulated by the base following the stop codon. Biochem Cell Biol. 1995, 73 (11–12): 1095-1103.View ArticlePubMedGoogle Scholar
- McCaughan KK, Brown CM, Dalphin ME, Berry MJ, Tate WP: Translational termination efficiency in mammals is influenced by the base following the stop codon. Proc Natl Acad Sci U S A. 1995, 92 (12): 5431-5435. 10.1073/pnas.92.12.5431.PubMed CentralView ArticlePubMedGoogle Scholar
- Bonetti B, Fu L, Moon J, Bedwell DM: The efficiency of translation termination is determined by a synergistic interplay between upstream and downstream sequences in Saccharomyces cerevisiae. J Mol Biol. 1995, 251 (3): 334-345. 10.1006/jmbi.1995.0438.View ArticlePubMedGoogle Scholar
- Cassan M, Rousset JP: UAG readthrough in mammalian cells: effect of upstream and downstream stop codon contexts reveal different signals. BMC Mol Biol. 2001, 2: 3-10.1186/1471-2199-2-3.PubMed CentralView ArticlePubMedGoogle Scholar
- Cavener DR, Ray SC: Eukaryotic start and stop translation sites. Nucleic Acids Res. 1991, 19 (12): 3185-3192. 10.1093/nar/19.12.3185.PubMed CentralView ArticlePubMedGoogle Scholar
- Liu Q: Comparative analysis of base biases around the stop codons in six eukaryotes. Biosystems. 2005, 81 (3): 281-289. 10.1016/j.biosystems.2005.05.005.View ArticlePubMedGoogle Scholar
- Mottagui-Tabar S, Tuite MF, Isaksson LA: The influence of 5' codon context on translation termination in Saccharomyces cerevisiae. Eur J Biochem. 1998, 257 (1): 249-254. 10.1046/j.1432-1327.1998.2570249.x.View ArticlePubMedGoogle Scholar
- Beaudoing E, Freier S, Wyatt JR, Claverie JM, Gautheret D: Patterns of variant polyadenylation signal usage in human genes. Genome Res. 2000, 10 (7): 1001-1010. 10.1101/gr.10.7.1001.PubMed CentralView ArticlePubMedGoogle Scholar
- Graber JH, Cantor CR, Mohr SC, Smith TF: In silico detection of control signals: mRNA 3'-end-processing sequences in diverse species. Proc Natl Acad Sci U S A. 1999, 96 (24): 14055-14060. 10.1073/pnas.96.24.14055.PubMed CentralView ArticlePubMedGoogle Scholar
- MacDonald CC, Redondo JL: Reexamining the polyadenylation signal: were we wrong about AAUAAA?. Mol Cell Endocrinol. 2002, 190 (1–2): 1-8.View ArticlePubMedGoogle Scholar
- Tupler R, Perini G, Green MR: Expressing the human genome. Nature. 2001, 409 (6822): 832-833. 10.1038/35057011.View ArticlePubMedGoogle Scholar
- Rigoutsos I, Floratos A: Combinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm. Bioinformatics. 1998, 14 (1): 55-67. 10.1093/bioinformatics/14.1.55.View ArticlePubMedGoogle Scholar
- Grillo G, Turi A, Licciulli F, Mignone F, Liuni S, Banfi S, Gennarino VA, Horner DS, Pavesi G, Picardi E, et al: UTRdb and UTRsite (RELEASE 2010): a collection of sequences and regulatory motifs of the untranslated regions of eukaryotic mRNAs. Nucleic Acids Res. 2010, 38 (Database issue): D75-D80.PubMed CentralView ArticlePubMedGoogle Scholar
- Kaldy P, Menotti E, Moret R, Kuhn LC: Identification of RNA-binding surfaces in iron regulatory protein-1. EMBO J. 1999, 18 (21): 6073-6083. 10.1093/emboj/18.21.6073.PubMed CentralView ArticlePubMedGoogle Scholar
- Thomson AM, Rogers JT, Leedman PJ: Iron-regulatory proteins, iron-responsive elements and ferritin mRNA translation. Int J Biochem Cell Biol. 1999, 31 (10): 1139-1152. 10.1016/S1357-2725(99)00080-1.View ArticlePubMedGoogle Scholar
- Walczak R, Westhof E, Carbon P, Krol A: A novel RNA structural motif in the selenocysteine insertion element of eukaryotic selenoprotein mRNAs. RNA. 1996, 2 (4): 367-379.PubMed CentralPubMedGoogle Scholar
- Fagegaltier D, Hubert N, Carbon P, Krol A: The selenocysteine insertion sequence binding protein SBP is different from the Y-box protein dbpB. Biochimie. 2000, 82 (2): 117-122. 10.1016/S0300-9084(00)00192-9.View ArticlePubMedGoogle Scholar
- Fischer A, Pallauf J, Gohil K, Weber SU, Packer L, Rimbach G: Effect of selenium and vitamin E deficiency on differential gene expression in rat liver. Biochem Biophys Res Commun. 2001, 285 (2): 470-475. 10.1006/bbrc.2001.5171.View ArticlePubMedGoogle Scholar
- Bjornstedt M, Kumar S, Bjorkhem L, Spyrou G, Holmgren A: Selenium and the thioredoxin and glutaredoxin systems. Biomed Environ Sci. 1997, 10 (2–3): 271-279.PubMedGoogle Scholar
- Schuster SC: Next-generation sequencing transforms today's biology. Nat Methods. 2008, 5 (1): 16-18. 10.1038/nmeth1156.View ArticlePubMedGoogle Scholar
- Meyer E, Aglyamova GV, Wang S, Buchanan-Carter J, Abrego D, Colbourne JK, Willis BL, Matz MV: Sequencing and de novo analysis of a coral larval transcriptome using 454 GSFlx. BMC Genomics. 2009, 10: 219-10.1186/1471-2164-10-219.PubMed CentralView ArticlePubMedGoogle Scholar
- Hahn DA, Ragland GJ, Shoemaker DD, Denlinger DL: Gene discovery using massively parallel pyrosequencing to develop ESTs for the flesh fly Sarcophaga crassipalpis. BMC Genomics. 2009, 10: 234-10.1186/1471-2164-10-234.PubMed CentralView ArticlePubMedGoogle Scholar
- Surget-Groba Y, Montoya-Burgos JI: Optimization of de novo transcriptome assembly from next-generation sequencing data. Genome Res. 2010, 20 (10): 1432-1440. 10.1101/gr.103846.109.PubMed CentralView ArticlePubMedGoogle Scholar
- Robertson G, Schein J, Chiu R, Corbett R, Field M, Jackman SD, Mungall K, Lee S, Okada HM, Qian JQ, et al: De novo assembly and analysis of RNA-seq data. Nat Methods. 2010, 7 (11): 909-912. 10.1038/nmeth.1517.View ArticlePubMedGoogle Scholar
- Li W, Godzik A: Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006, 22 (13): 1658-1659. 10.1093/bioinformatics/btl158.View ArticlePubMedGoogle Scholar
- Min XJ, Butler G, Storms R, Tsang A: OrfPredictor: predicting protein-coding regions in EST-derived sequences. Nucleic Acids Res. 2005, 33 (Web Server issue): W677-W680.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.