A pilot study for channel catfish whole genome sequencing and de novo assembly
© Jiang et al; licensee BioMed Central Ltd. 2011
Received: 30 June 2011
Accepted: 22 December 2011
Published: 22 December 2011
Skip to main content
© Jiang et al; licensee BioMed Central Ltd. 2011
Received: 30 June 2011
Accepted: 22 December 2011
Published: 22 December 2011
Recent advances in next-generation sequencing technologies have drastically increased throughput and significantly reduced sequencing costs. However, the average read lengths in next-generation sequencing technologies are short as compared with that of traditional Sanger sequencing. The short sequence reads pose great challenges for de novo sequence assembly. As a pilot project for whole genome sequencing of the catfish genome, here we attempt to determine the proper sequence coverage, the proper software for assembly, and various parameters used for the assembly of a BAC physical map contig spanning approximately a million of base pairs.
A combination of low sequence coverage of 454 and Illumina sequencing appeared to provide effective assembly as reflected by a high N50 value. Using 454 sequencing alone, a sequencing depth of 18 X was sufficient to obtain the good quality assembly, whereas a 70 X Illumina appeared to be sufficient for a good quality assembly. Additional sequencing coverage after 18 X of 454 or after 70 X of Illumina sequencing does not provide significant improvement of the assembly. Considering the cost of sequencing, a 2 X 454 sequencing, when coupled to 70 X Illumina sequencing, provided an assembly of reasonably good quality. With several software tested, Newbler with a seed length of 16 and ABySS with a K-value of 60 appear to be appropriate for the assembly of 454 reads alone and Illumina paired-end reads alone, respectively. Using both 454 and Illumina paired-end reads, a hybrid assembly strategy using Newbler for initial 454 sequence assembly, Velvet for initial Illumina sequence assembly, followed by a second step assembly using MIRA provided the best assembly of the physical map contig, resulting in 193 contigs with a N50 value of 13,123 bp.
A hybrid sequencing strategy using low sequencing depth of 454 and high sequencing depth of Illumina provided the good quality assembly with high N50 value and relatively low cost. A combination of Newbler, Velvet, and MIRA can be used to assemble the 454 sequence reads and the Illumina reads effectively. The assembled sequence can serve as a resource for comparative genome analysis. Additional long reads using the third generation sequencing platforms are needed to sequence through repetitive genome regions that should further enhance the sequence assembly.
Channel catfish, Ictalurus punctatus, is the major aquaculture species in the United States, accounting for over 60% of all U.S. aquaculture production. Channel catfish is regarded as one of the best characterized species serving as a model for teleost immune studies , and an important model species for study of toxicology and reproductive physiology . The channel catfish genome is estimated to be 1 Gb in size http://www.genomesize.com and is highly AT-rich, with 60.7% A+T . The catfish genome contains one main type of tandem repeats named as Xba elements  and several types of dispersed repetitive elements with the mariner/Tc1 DNA transposons as the leading type of dispersed repetitive elements (4-5% of the genome), followed by retrotransposons (3-4% of the genome), Mermaid, Merman, and other SINE elements (~1.5% of the genome), LINES (~1.5% of the genome) and various types of short sequence repeats such as microsatellites (~3% of the genome) [3, 5–8].
At present, a number of genomic tools and resources have been developed in catfish, including bacterial artificial chromosome (BAC) libraries [9, 10], BAC-based physical maps [11, 12], genetic linkage maps [13–15], a large number of ESTs [2, 16], over 1700 unique full length cDNAs , over 60,000 BAC end sequences [3, 7], and a large number of identified molecular markers such as microsatellites and single nucleotide polymorphisms [2, 18]. Whole genome sequencing of catfish is underway, and this project was conducted as a pilot study to define the parameters important for the generation of the whole genome sequence assembly.
A major limitation of eukaryotic genome sequencing is the costs involved in sequencing. In recent years, however, advances in sequencing technologies have allowed drastic reduction in sequencing costs. Among many sequencing platforms, the second generation of sequencing technologies such as 454 sequencing, Illumina sequencing, and SOLiD sequencing are the most commonly used sequencing platforms. A common feature of these sequencers is their relatively short sequencing reads, making subsequent sequence assembly a great challenge. Such challenges become even more significant when dealing with large and complex eukaryotic genomes. Teleost genome, known to have gone through a third round of whole genome duplication , poses additional challenge when coupled with the short sequencing reads. In consideration of such complexities, Quinn et al.  conducted a pilot study with eight pooled BAC clones covering approximately 1 Mb of the Atlantic salmon genome with 454 GS FLX pyrosequencing, and concluded that it was difficult to achieve good levels of genome sequence assembly with 454 sequencing with the tetraploid genome. However, with the diploid European sea bass, Kuhl et al.  was able to generate large superscaffolds (13.2-17.5 Mb) with 17-39X coverage of pooled BACs using pyrosequencing. Apparently, genome complexity as well as existing genome resources can influence the outcomes of assembly of sequences generated from next generation sequencing. In this study, based on the existing catfish BAC-based physical map, 24 pooled BAC clones covering around 1 Mb catfish genome were sequenced by using both pyrosequencing and Illumina sequencing technologies. In addition with the existing BAC end sequences, we aim to take full advantages of multiple sequencing technologies and existing genetic resources for the upcoming catfish whole genome sequencing.
Several de novo assembly software packages have been developed for assembling genome sequences generated from the next-generation sequencing platforms. Two basic graph based algorithms exist for assemblers: The first one is based on overlap-layout-consensus graphs, such as Newbler  and MIRA . In these packages, the algorithm computes all pair-wise overlaps between reads to build an overlap graph. Then, the overlap graph is used to compute a layout of reads and consensus sequence of contigs. The second algorithm is based on de-Bruijn graphs, such as the algorithm used in Velvet , ABySS , and several other packages in which sequence reads are broken into smaller sequences of DNA, referred to as K-mers, where K denotes the length of these smaller sequences . The de-Bruijn graph is built based on the overlaps of length K-1 between these K-mers rather than the actual reads. The assembly algorithms and their implementations become typically complex with large volumes of sequence data.
The objectives of this pilot study was to determine the framework for proper levels of genome coverage using Illumina sequencing, 454 sequencing, and a combination of 454 and Illumina sequencing when both technical factors and economic factors were considered, to compare the assembly of sequences using different software packages with various parameters, and to develop a cost-effective de novo assembly strategy for the upcoming catfish whole genome sequencing. Here we report sequencing and assembly of twenty-four BAC clones of the largest contig of catfish physical map using 454 sequencing, Illumina sequencing, and a combination of these sequencing technologies, and compare their assembly using several de novo assembly softwares. Based on the assembly, we attempted to identify conserved syntenies among catfish and several fish species whose genome sequences are available using BLAST  sequence similarity comparisons.
Sequencing statistics using 454 and Illumina sequencing platforms.
Primary sequence data
Total number of reads
Total number of reads after trimmed
Average trimmed read-length (bp)
Total accumulative length of all trimmed reads (Mb)
A number of sequence assemblers have been developed recently to cope with sequences generated from the next generation sequencers. To determine the most appropriate assembler and its associated K-values (as appropriate) to use for the de novo assembly of 454 reads and Illumina reads, we compared several existing assemblers: Newbler, Velvet, ABySS, MIRA and a commercial software, CLC Genomics Workbench http://www.clcbio.com. The Celera Assembler V6.1 was also considered, but our Illumina sequences generated a while ago was not long enough for the assembler. The metrics used to evaluate the assembly included N50, the average contig size, the maximum contig size and the number of contigs.
Comparison of assembly output with 454 reads and Illumina reads using different assemblers with different K-values.
Maximum contig size (bp)
Average contig size (bp)
Number of contig
CLC Genomics Workbench
CLC Genomics Workbench
The hybrid assembly is attempted by combining all reads from both 454 and Illumina platform. Using the combination of two types of sequencing reads, we tested Velvet, ABySS, Newbler and CLC for hybrid assembly. Using a single assembler, ABySS with K-value 60 provided a good quality assembly, with N50 value of approximately 9.1 Kb and a maximum contig size of approximately 26 Kb. Newbler v2.6 performed better for the assembly of the same sequence data that generated a N50 value of approximately 12.5 Kb and a maximum contig size of 39 Kb. However, assembly derived from 454 reads plus Illumina reads by using a single assembler did not provide much advantage as compared with the assembly of using only 454 reads. Rather than using a single assembler for the hybrid assembly, it is crucial to develop a suitable strategy by using several assemblers for the assembly of combination of different types of sequencing reads.
Several studies have utilized two-step assembly strategy for the assembly of Illumina and 454 sequences: In an effort to sequence the whole genome of a model fungi Sordaria macrospora, Nowrousian et al.  pre-assembled 454 reads first, and then constructed an assembly by using the combined raw data from both Illumina and 454 reads using Velvet assembler; In a different study, the whole genome sequence reads from a bacteria, Geobacter sulfurreducens, Illumina reads were alone pre-assembled first and the contigs generated from this assembly plus the singletons and the 454 reads were then assembled using Newbler .
Comparison of assemblies from Illumina paired-end reads (PE) and single-end reads (SE).
Number of contig
The importance of paired reads is not limited to contig assembly. They are even more important in scaffolding by bringing separate contigs together into larger scaffolds. For instance, when the paired-end reads of Illumina sequences were used, the 193 contigs were brought together into 68 scaffolds. The scaffolding capacity of paired-end reads is under estimated in this study because we used a library with small insert size of 350 bp. The scaffolding capacity of paired-end reads should be much greater if larger insert libraries were used. The BAC end sequences helped in scaffolding, but were less effective than expected because of the small number of the available mate paired BAC end sequences within this contig. Of the 75 BAC end sequences falling within this contig, only 29 were mate paired reads. Along with the paired-end reads of Illumina sequences, these BAC end sequences allowed assembly of 193 contigs into 61 scaffolds with the largest scaffold of 684,936 bp.
Comparison of assembly statistics of fully sequenced fish genomes1 and some pilot studies2.
Number of contigs
Maximum contig/scaffold3 size (bp)
Average contig size (bp)
Atlantic cod 1
Atlantic salmon 2
Summary of repetitive elements in the assembled region of catfish genome.
Low complexity repeats
% of sequences
Overall, 98.5% of all reads were assembled. That is to say, 1.5% of the generated sequences were not assembled. Apparently, many unassembled sequences were very short reads, but they are probably also repetitive in nature so that prohibit themselves to be assembled. BLAST was used to assess the nature of the unassembled sequences against the nr database with a cutoff E-value of 1e-5. Most hits were repetitive elements such as transposable elements (~18% of unassembled reads), retrotransposons reverse transcriptase-like sequences (~8%), zinc-finger protein-like (~1%), recombinase-like protein (~1%), among others. In addition, the unassembled sequences were also used for BLAST search against themselves to determine if they are highly repetitive sequences themselves. Approximate 79% of the unassembled reads can hit more than 10 other reads, 55% of which hit more than 100 other reads. Taken together, these results indicated that the unassembled reads are located in the repetitive region of the genome, resulting in gaps when conducting de novo assembly. Therefore, long reads are crucially important to go through such repetitive regions of the genome.
After gene annotation, we attempted to identify conserved syntenies in the assembled region with the genomes of four fish model species, medaka, stickleback, Tetraodon and zebrafish. The order of the catfish genes was oriented by anchoring the scaffolds to the physical map by using BAC end sequences. One syntenic block containing seven genes (designated as J-K-L-M-N-I-H in Figure 5) and a second syntenic block containing four genes (designated as O-P-Q-R in Figure 5) are highly conserved, across most of these fish species (Figure 5). Several other smaller syntenic blocks were also somewhat conserved that contained 2-4 genes (Figure 5). However, the distance spanning these genes is different. All these 18 genes were found in catfish physical map contig of approximately 1 Mb, but these genes were found to be located in a genomic region of approximately 7 Mb in stickleback, 8 Mb in Tetraodon, over 24 Mb in medaka, and 44 Mb in zebrafish. Nonetheless, gene arrangements in the catfish genome are most similar to that in zebrafish, consistent with their phylogenetic relations .
The results of this study demonstrated that for the catfish genome, the hybrid sequencing strategy using both 454 and Illumina is more effective than either alone, especially when sequencing costs were also considered, as demonstrated with the potato genome project . Initially, the effectiveness of sequence assembly was almost linearly correlated with the depth of sequence coverage, but additional sequencing after a certain level (18 × 454 and 70 X Illumina) provided no additional power for effective sequence assembly. The best assembly software for 454 reads appears to be Newbler as assembly statistics resulted in the lowest number of contigs and highest values for the contig size as well as N50 value. Using the same set of assembly criteria, both ABySS and Velvet appear to be suitable for the assembly of Illumina reads. A two-step strategy, initially using Newbler for 454 reads and Velvet for Illumina reads followed by using MIRA, seemed to provide highly effective sequence assembly. Sequencing of this genomic region allowed identification of 18 protein encoding genes. Their genomic arrangements are highly conserved among catfish, zebrafish, medaka, Tetraodon, and stickleback.
It should be noted that this work dealt with only one million base pair region of the catfish genome. Therefore, while this work provide framework for planning of whole genome sequencing of the catfish genome, extension of technical parameters from this work to whole genome sequencing requires additional work. The combination of 454 and Illumina sequencing may not be effective in dealing with fish whose genomes are polyploidy or contain complex and long repeat structures. Even for the catfish genome, sequence assembly was attenuated with repeats, and therefore, an appropriate level of long reads, e.g., those produced by PacBio sequencing [35, 39], to pass through the repeat regions may prove to be very useful.
Twenty four clones ensuring the coverage of a minimum tiling path of the contig0241 from the CHORI-212 BAC library  were selected for sequencing. The BAC DNA isolation was conducted as previously described , with modifications. Briefly, BAC clones were transferred from 384-well plates to a 96-well culture block, which contained 1.5 ml of 2X YT medium with 12.5 μg/ml chloramphenicol and grown at 37°C overnight with shaking at 300 rpm. The block was centrifuged at 2000 × g for 10 min in an Eppendorf 5804R bench top centrifuge to collect bacteria. The culture supernatant was decanted and the block was inverted and tapped gently on paper towels to remove remaining liquid. BAC DNA was isolated using the Perfectprep™ BAC 96 kit (Eppendorf North America, Westbury, NY) according to the manufacturer's specifications. An equal amount of 1 μg DNA per BAC clone was pooled, followed by purification with phenol/chloroform. This DNA was used for 454 and Illumina sequencing in the Genomic Services Lab at HudsonAlpha Institute for Biotechnology (Huntsville, AL).
All raw 454 reads and Illumina reads were trimmed of BAC vector sequences and low quality reads were filtered before assembly. CLC Genomics Workbench was used to trim raw sequences with quality score limit of 0.01 (Q20). Illumina 72-bp paired-end reads and 454 reads shorter than 15 bp were discarded. Assembly of 454 reads alone was performed by Newbler v. 2.6, Velvet v. 1.0.01, MIRA v. 3.0.0 and CLC Genomics Work bench v. 4.0.0 (CLC Bio, Cambridge, MA). Newbler with seed length from 10 to 16, and Velvet with K-value from 25 to 95 were tested. For Illumina data, assembly of Illumina reads alone was performed by Velvet, ABySSv.1.2.1 and CLC Genomics Workbench. Velvet with K-value from 21 to 61, and ABySS with K-value from 30 to 64 were tested. The hybrid assembly with both 454 and Illumina data included three major steps: First, 454 reads alone was pre-assembled by Newbler with seed length 16, minimum overlap length 40 and minimum overlap identity 95%; second, Illumina reads were pre-assembled alone using Velvet with K-value 29. Although Velvet is able to generate scaffolded contigs, this function was turned off in this step to prevent influence of scaffolded contigs on the subsequent hybrid assembly step; third, pre-assembled 454 and Illumina contigs along with singeltons were assembled using MIRA, with minimum overlap 40.
In order to compare the effects of both paired-end and single-end reads on de novo assembly, the Illumina paired-end sequencing data set was assembled using ABySS with a K-value 60. The same paired-end reads data set was treated as single-end reads for the comparison, by neglecting the paired-end information.
Repetitive elements were masked using RepeatMasker against zebrafish repeat database. Repeat masked sequences were used for gene and syntenic identification by using both gene prediction algorithms and sequence similarity searches. GENSCAN gene model prediction algorithm was used to predict introns and exons. The resulting predictions were searched against NCBI nr database by using BLASTX with an E-value cutoff of 1e-10. The identified gene sequences were used for TBLASTX search against medaka, stickleback, Tetraodon and zebrafish peptide database with an E-value cutoff of 1e-10. The chromosomal positions of the homologous genes were identified from Ensembl database.
This project was supported by Agriculture and Food Research InitiativeCompetitive Grant no. 2010-65205-20356, and Grant no. 2009-35205-05101 from the USDA National Instituteof Food and Agriculture.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.