A pilot study for channel catfish whole genome sequencing and de novo assembly

Background Recent advances in next-generation sequencing technologies have drastically increased throughput and significantly reduced sequencing costs. However, the average read lengths in next-generation sequencing technologies are short as compared with that of traditional Sanger sequencing. The short sequence reads pose great challenges for de novo sequence assembly. As a pilot project for whole genome sequencing of the catfish genome, here we attempt to determine the proper sequence coverage, the proper software for assembly, and various parameters used for the assembly of a BAC physical map contig spanning approximately a million of base pairs. Results A combination of low sequence coverage of 454 and Illumina sequencing appeared to provide effective assembly as reflected by a high N50 value. Using 454 sequencing alone, a sequencing depth of 18 X was sufficient to obtain the good quality assembly, whereas a 70 X Illumina appeared to be sufficient for a good quality assembly. Additional sequencing coverage after 18 X of 454 or after 70 X of Illumina sequencing does not provide significant improvement of the assembly. Considering the cost of sequencing, a 2 X 454 sequencing, when coupled to 70 X Illumina sequencing, provided an assembly of reasonably good quality. With several software tested, Newbler with a seed length of 16 and ABySS with a K-value of 60 appear to be appropriate for the assembly of 454 reads alone and Illumina paired-end reads alone, respectively. Using both 454 and Illumina paired-end reads, a hybrid assembly strategy using Newbler for initial 454 sequence assembly, Velvet for initial Illumina sequence assembly, followed by a second step assembly using MIRA provided the best assembly of the physical map contig, resulting in 193 contigs with a N50 value of 13,123 bp. Conclusions A hybrid sequencing strategy using low sequencing depth of 454 and high sequencing depth of Illumina provided the good quality assembly with high N50 value and relatively low cost. A combination of Newbler, Velvet, and MIRA can be used to assemble the 454 sequence reads and the Illumina reads effectively. The assembled sequence can serve as a resource for comparative genome analysis. Additional long reads using the third generation sequencing platforms are needed to sequence through repetitive genome regions that should further enhance the sequence assembly.


Background
Channel catfish, Ictalurus punctatus, is the major aquaculture species in the United States, accounting for over 60% of all U.S. aquaculture production. Channel catfish is regarded as one of the best characterized species serving as a model for teleost immune studies [1], and an important model species for study of toxicology and reproductive physiology [2]. The channel catfish genome is estimated to be 1 Gb in size http://www.genomesize. com and is highly AT-rich, with 60.7% A+T [3]. The catfish genome contains one main type of tandem repeats named as Xba elements [4] and several types of dispersed repetitive elements with the mariner/Tc1 DNA transposons as the leading type of dispersed repetitive elements (4-5% of the genome), followed by retrotransposons (3-4% of the genome), Mermaid, Merman, and other SINE elements (~1.5% of the genome), LINES (~1.5% of the genome) and various types of short * Correspondence: liuzhan@auburn.edu 1 The Fish Molecular Genetics and Biotechnology Laboratory, Department of Fisheries and Allied Aquacultures, Program of Cell and Molecular Biosciences, Aquatic Genomics Unit, 203 Swingle Hall, Auburn University, Auburn, AL 36849, USA Full list of author information is available at the end of the article sequence repeats such as microsatellites (~3% of the genome) [3,[5][6][7][8].
At present, a number of genomic tools and resources have been developed in catfish, including bacterial artificial chromosome (BAC) libraries [9,10], BAC-based physical maps [11,12], genetic linkage maps [13][14][15], a large number of ESTs [2,16], over 1700 unique full length cDNAs [17], over 60,000 BAC end sequences [3,7], and a large number of identified molecular markers such as microsatellites and single nucleotide polymorphisms [2,18]. Whole genome sequencing of catfish is underway, and this project was conducted as a pilot study to define the parameters important for the generation of the whole genome sequence assembly.
A major limitation of eukaryotic genome sequencing is the costs involved in sequencing. In recent years, however, advances in sequencing technologies have allowed drastic reduction in sequencing costs. Among many sequencing platforms, the second generation of sequencing technologies such as 454 sequencing, Illumina sequencing, and SOLiD sequencing are the most commonly used sequencing platforms. A common feature of these sequencers is their relatively short sequencing reads, making subsequent sequence assembly a great challenge. Such challenges become even more significant when dealing with large and complex eukaryotic genomes. Teleost genome, known to have gone through a third round of whole genome duplication [19], poses additional challenge when coupled with the short sequencing reads. In consideration of such complexities, Quinn et al. [20] conducted a pilot study with eight pooled BAC clones covering approximately 1 Mb of the Atlantic salmon genome with 454 GS FLX pyrosequencing, and concluded that it was difficult to achieve good levels of genome sequence assembly with 454 sequencing with the tetraploid genome. However, with the diploid European sea bass, Kuhl et al. [21] was able to generate large superscaffolds (13.2-17.5 Mb) with 17-39X coverage of pooled BACs using pyrosequencing. Apparently, genome complexity as well as existing genome resources can influence the outcomes of assembly of sequences generated from next generation sequencing. In this study, based on the existing catfish BACbased physical map, 24 pooled BAC clones covering around 1 Mb catfish genome were sequenced by using both pyrosequencing and Illumina sequencing technologies. In addition with the existing BAC end sequences, we aim to take full advantages of multiple sequencing technologies and existing genetic resources for the upcoming catfish whole genome sequencing.
Several de novo assembly software packages have been developed for assembling genome sequences generated from the next-generation sequencing platforms. Two basic graph based algorithms exist for assemblers: The first one is based on overlap-layout-consensus graphs, such as Newbler [22] and MIRA [23]. In these packages, the algorithm computes all pair-wise overlaps between reads to build an overlap graph. Then, the overlap graph is used to compute a layout of reads and consensus sequence of contigs. The second algorithm is based on de-Bruijn graphs, such as the algorithm used in Velvet [24], ABySS [25], and several other packages in which sequence reads are broken into smaller sequences of DNA, referred to as K-mers, where K denotes the length of these smaller sequences [26]. The de-Bruijn graph is built based on the overlaps of length K-1 between these K-mers rather than the actual reads. The assembly algorithms and their implementations become typically complex with large volumes of sequence data.
The objectives of this pilot study was to determine the framework for proper levels of genome coverage using Illumina sequencing, 454 sequencing, and a combination of 454 and Illumina sequencing when both technical factors and economic factors were considered, to compare the assembly of sequences using different software packages with various parameters, and to develop a cost-effective de novo assembly strategy for the upcoming catfish whole genome sequencing. Here we report sequencing and assembly of twenty-four BAC clones of the largest contig of catfish physical map using 454 sequencing, Illumina sequencing, and a combination of these sequencing technologies, and compare their assembly using several de novo assembly softwares. Based on the assembly, we attempted to identify conserved syntenies among catfish and several fish species whose genome sequences are available using BLAST [27] sequence similarity comparisons.

Results and discussion
Generation of short sequencing reads from pooled catfish BAC clones To assess the de novo assembly strategy for the catfish whole genome sequencing project, twenty-four BAC clones were selected from the largest contig (con-tig0241) of the BAC-based catfish physical map. Individual clones included in the contig were pooled together and sequenced by using both 454 and Illumina sequencing technologies. Based on the physical map, the estimated length of the contig spanned approximately 1 Mb long. As shown in Table 1, a total of 200,901 from 454 single-end reads, 29,067,404 from Illumina 72-bp paired-end reads from a 350-bp insert library were generated. After adaptor and BAC vector trimming, and low quality reads filtering, a total of 41 Mb 454 sequences and 1,245 Mb Illumina paired-end sequences were obtained (Table 1). These represented approximately 40 X and 1,200 X sequence coverage of the contig for 454 and Illumina sequencing, respectively. All the sequences have been deposited to the Catfish Genome Database cBARBEL [ [28], http://www.animalgenome. org/catfish/cbarbel/contig0241].

Comparison of de novo assemblers for 454 reads and Illumina reads
A number of sequence assemblers have been developed recently to cope with sequences generated from the next generation sequencers. To determine the most appropriate assembler and its associated K-values (as appropriate) to use for the de novo assembly of 454 reads and Illumina reads, we compared several existing assemblers: Newbler, Velvet, ABySS, MIRA and a commercial software, CLC Genomics Workbench http:// www.clcbio.com. The Celera Assembler V6.1 was also considered, but our Illumina sequences generated a while ago was not long enough for the assembler. The metrics used to evaluate the assembly included N50, the average contig size, the maximum contig size and the number of contigs.
As highlighted in Table 2, for 454 sequences, Newbler allowed the most effective assembly, compared to Velvet, MIRA, or CLC Genomics Workbench. With Newbler, the best assembly was achieved at seed length (i.e. "K" value of the K-mer matching) of 16 at which 282 of contigs were assembled, with N50 of 8,674 bp, average contig size of 3,486 bp, and maximum contig length of 28,736 bp. For Illumina sequences, ABySS and Velvet provided more effective assemblies than CLC Genomics Workbench ( Table 2). The best assembly using ABySS was achieved at K-value of 60 at which 386 contigs were assembled with N50 of 6,870, average contig size of 2,496 bp, and maximum contig size of 22,594 bp. Velvet at K-value of 29 generated a similar assembly to that of ABySS with N50 of 6,319 bp, average contig size of 1,703 bp, and maximum contig size of 35,339 bp, but with significantly greater number of contigs. Similar results were found in a recent study using ABySS [29].

Assembly strategy for the combination of 454 and Illumina reads
The hybrid assembly is attempted by combining all reads from both 454 and Illumina platform. Using the combination of two types of sequencing reads, we tested Velvet, ABySS, Newbler and CLC for hybrid assembly. Using a single assembler, ABySS with K-value 60 provided a good quality assembly, with N50 value of approximately 9.1 Kb and a maximum contig size of approximately 26 Kb. Newbler v2.6 performed better for the assembly of the same sequence data that generated a N50 value of approximately 12.5 Kb and a maximum contig size of 39 Kb. However, assembly derived from 454 reads plus Illumina reads by using a single assembler did not provide much advantage as compared with the assembly of using only 454 reads. Rather than using a single assembler for the hybrid assembly, it is crucial to develop a suitable strategy by using several assemblers for the assembly of combination of different types of sequencing reads.
The strategy for the hybrid assembly is shown in Figure  1. Initially, 454 sequences were assembled using Newbler and Illumina sequences were assembled using Velvet. Velvet rather than AbySS was used for the assembly of Illumina reads because some of the contigs generated by ABySS contained internal gaps since ABySS has the capability to merge contigs using paired-end information, and these internal gaps may influence the subsequent assembly attempts. After the initial separate assembly, the contigs assembled and the remaining unassembled sequences were subjected to a secondary assembly using MIRA. This additional assembly step with MIRA maximized the complementary advantages of 454 and Illumina reads, resulting in the assembly of 193 contigs. The maximum contig size was 71,434 bp. All 75 BAC end sequences from the same physical map contig generated in our previous studies [3,7] were aligned to the hybrid assembled contigs, suggesting that the sequence coverage is near completion and gaps are small, as expected from the size estimation. The contigs sequences have been deposited to Catfish Genome Database cBARBEL http:// www.animalgenome.org/catfish/cbarbel/contig0241.
Several studies have utilized two-step assembly strategy for the assembly of Illumina and 454 sequences: In an effort to sequence the whole genome of a model fungi Sordaria macrospora, Nowrousian et al. [30] preassembled 454 reads first, and then constructed an assembly by using the combined raw data from both Illumina and 454 reads using Velvet assembler; In a different study, the whole genome sequence reads from a bacteria, Geobacter sulfurreducens, Illumina reads were alone pre-assembled first and the contigs generated from this assembly plus the singletons and the 454 reads were then assembled using Newbler [31].
Depth of sequence coverage required for effective de novo assembly The depth of sequence coverage needed for effective sequence assembly depends on the genome size, the GC content [32] and repetitive elements content of the organism, and the sequence read lengths. In addition, error correction algorithms used in assemblers can be utilized effectively when the depth of coverage is high [33]. The Beijing Genomics Institute sequenced the Giant Panda genome using 75 bp Illumina reads at a coverage of 50 X [34] leaving a small number of gaps that may be filled by Sanger sequencing [33]. In this pilot project, we evaluated the influence of depth of sequence coverage of different types of sequencing data sets on the overall assembly. As shown in Figure 2, the N50 values increased almost linearly until the 454 sequences reached 18 X coverage. After 18 X sequence coverage, additional 454 sequencing generated slightly greater N50 values, but was not cost effective any longer. We understand that our 454 sequences were generated with older chemistry and overall sequence length was relatively short, but the point is that with relatively short 454 sequences, assembly quality is initially positively correlated with sequence coverage. However, when the sequence coverage reached a certain level (here 18 X), additional sequence coverage was no longer effective ( Figure 2). Similarly, with Illumina sequencing, the N50 values increased almost linearly with the sequence coverage up to 70 X sequence coverage ( Figure 3). However, additional Illumina sequencing was ineffective in providing a better sequence assembly, as reflected in the plateau N50 values after 70 X sequence coverage ( Figure 3). Apparently, the assembly that can be achieved may be related to sequence read Figure 1 Hybrid assembly strategy pipeline. All 454 reads were assembled by using Newbler assembler with seed length 16, minimum overlap length 40 and minimum overlap identity 95%. All Illumina reads were assembled by using Velvet assembler with K-value 29. MIRA assembler was then used to generate the hybrid contigs with minimum overlap 40. lengths, e.g., the higher level of genome coverage may not help if the sequence read length is short. An effective sequencing strategy is to use a hybrid sequencing strategy of Illumina plus 454 sequencing. In order to determine if a hybrid approach is cost effective, we evaluated assembly of the contig using various 454 and Illumina sequence coverage combinations. As shown in Figure 4, inclusion of 454 sequencing very effectively increased N50 of the assemblies. Interestingly, a small sequence coverage of 454 sequencing provided almost as effective sequence assembly as a large 454 sequence coverage. Only when the 454 sequences reached a very high sequence coverage (over 30 X coverage), the overall sequence assembly was more effective with much greater N50. N50 values of 454 sequences at 6-30 X provided similar assemblies. Similar but different assembly patterns were observed with various sequence coverages of Illumina sequences: with or without 454 sequences, the N50 values of the assemblies increased almost proportional to Illumina sequence coverages up to 70 X sequence coverage; thereafter, additional Illumina sequences were not effective in increasing the N50 values. Considering the cost for the two sequencing platforms, 454 sequencing is much more expensive than Illumina sequencing, e.g., 2 X catfish whole genome sequencing with 454 sequencing technology will cost $25,000, which is almost the cost to get 140 X catfish whole genome sequences using Illumina sequencing. Taking the cost into consideration, it appears that the assembly of 2 × 454 sequences and 70 X Illumina sequences was the most effective ( Figure 4). The importance of longer reads should be noted. In the catfish genome, and perhaps many fish genomes as well, there are long repetitive elements such as the mariner-like DNA transposons. Long reads helps to get through such long repetitive regions. Working with the genome of Rhodopseudomonas palustris, Schadt et al. [35] demonstrated that long reads generated from PacBio sequencing can help resolve assembly difficulties in a repetitive region spanning 1.5 kb, which otherwise cannot be assembled by using Illumina sequencing. In our case with the catfish genome segment here, the failure of contig assembly was mostly caused by repetitive sequences, and therefore, the adoption of PacBio, or similar long sequencing platforms, should enhance the contig assembly.

Comparison of assembly with Illumina paired-end and single-end reads and scaffolding
Paired-end reads are the reads in pairs such that each pair is of a known distance from each other. Although the cost associated with library construction of singleend reads is much cheaper than that of paired-end reads, because of their linked nature, paired-end reads are important for de novo assembly because they can improve the assembly by spanning repeat regions, anchoring reads to the sequence within repeat regions, and ordering of contigs into larger scaffolds [32]. To investigate the advantage of paired-end reads over single-end reads, we compared the assembly between Illumina paired-end and single-end reads with the same set of sequence data using ABySS with a K-value of 60. Table 3 clearly shows the advantage of paired-end reads over single-end reads for de novo assembly. The N50 value with Illumina paired-end reads is approximately 3.6 Kb, which is 5.6 times larger than the N50 value of single-end reads assembly (652 bp). Similarly, the maximum contig size with Illumina paired-end reads is almost 21 Kb, while the maximum contig size of singleends reads assembly was only 3.5 Kb (Table 3). The difference in the number of assembled contigs is more dramatic, with 716 contigs from paired-end reads assembly but 2,230 contigs with single-end reads assembly. In a recent study, the giant panda whole genome draft was completed by using multiple paired-end reads libraries with different insert size [34], which initiated the eukaryotic genome sequencing and assembly based mostly on the next generation sequencing platform. Clearly, use of multiple libraries in various sizes would increase the effectiveness of sequence assemblies, but that was not tested in this study.
The importance of paired reads is not limited to contig assembly. They are even more important in scaffolding by bringing separate contigs together into larger scaffolds. For instance, when the paired-end reads of Illumina sequences were used, the 193 contigs were brought together into 68 scaffolds. The scaffolding capacity of paired-end reads is under estimated in this study because we used a library with small insert size of 350 bp. The scaffolding capacity of paired-end reads should be much greater if larger insert libraries were used. The BAC end sequences helped in scaffolding, but were less effective than expected because of the small number of the available mate paired BAC end sequences within this contig. Of the 75 BAC end sequences falling within this contig, only 29 were mate paired reads. Along with the paired-end reads of Illumina sequences, these BAC end sequences allowed assembly of 193 contigs into 61 scaffolds with the largest scaffold of 684,936 bp.
In order to have a sense of comparison of the sequence assembly in this pilot study with similar studies or those achieved through whole genome sequencing, we summarized contig size information in Table 4. In terms of N50 and average contig sizes, the assemblies presented here with catfish is similar to those of the pilot study with Atlantic salmon [20]. A direct comparison of the assemblies here with only one million base Figure 4 Comparison of N50 values assembled with various sequencing depth combination of 454 and Illumina reads. The assemblies were achieved by using the two step approach (Newbler for 454 data, Velvet for Illumina data for initial assembly followed by assembly with MIRA, for details, see materials and methods), with various sequencing depth combination of 454 sequence reads and Illumina sequence reads. Red circle indicates the sequence depths at which additional sequencing started to loss power for effective sequence assembly. pairs and those with the whole genome sequences is difficult because the sequence complexities and sequencing strategies were different. However, if the similar results from this pilot study could be extended to the genome scale with similar genome coverage, the results to be achieved with catfish whole genome sequencing would be similar to those achieved with Tetraodon, Fugu, or Atlantic cod.

Repetitive content analysis and annotation of the assembled region
Repetitive content of the assembly were assessed by using RepeatMasker http://www.repeatmasker.org against zebrafish repeat database. A total of 140,019 bp (13.6%) sequences were masked. The content of various types of repetitive elements in the assembled region is presented in Table 5. The most abundant type of repeat was found to be DNA transposons totaling 8.38% of the assembly, which confirmed our previous findings that DNA transposons were the most abundant type of repetitive elements in the catfish genome [3,6,7]. However, the 8.38% of DNA transposons found here was almost twice the amount as previously assessed [6]. Overall, 98.5% of all reads were assembled. That is to say, 1.5% of the generated sequences were not assembled. Apparently, many unassembled sequences were very short reads, but they are probably also repetitive in nature so that prohibit themselves to be assembled. BLAST was used to assess the nature of the unassembled sequences against the nr database with a cutoff E-value of 1e-5. Most hits were repetitive elements such as transposable elements (~18% of unassembled reads), retrotransposons reverse transcriptase-like sequences (~8%), zinc-finger protein-like (~1%), recombinase-like protein (~1%), among others. In addition, the unassembled sequences were also used for BLAST search against themselves to determine if they are highly repetitive sequences themselves. Approximate 79% of the unassembled reads can hit more than 10 other reads, 55% of which hit more than 100 other reads. Taken together, these results indicated that the unassembled reads are located in the repetitive region of the genome, resulting in gaps when conducting de novo assembly. Therefore, long reads are crucially important to go through such repetitive regions of the genome.
After gene annotation, we attempted to identify conserved syntenies in the assembled region with the genomes of four fish model species, medaka, stickleback, Tetraodon and zebrafish. The order of the catfish genes  was oriented by anchoring the scaffolds to the physical map by using BAC end sequences. One syntenic block containing seven genes (designated as J-K-L-M-N-I-H in Figure 5) and a second syntenic block containing four genes (designated as O-P-Q-R in Figure 5) are highly conserved, across most of these fish species ( Figure 5).
Several other smaller syntenic blocks were also somewhat conserved that contained 2-4 genes ( Figure 5). However, the distance spanning these genes is different. All these 18 genes were found in catfish physical map contig of approximately 1 Mb, but these genes were found to be located in a genomic region of

Conclusions
The results of this study demonstrated that for the catfish genome, the hybrid sequencing strategy using both 454 and Illumina is more effective than either alone, especially when sequencing costs were also considered, as demonstrated with the potato genome project [38]. Initially, the effectiveness of sequence assembly was almost linearly correlated with the depth of sequence coverage, but additional sequencing after a certain level (18 × 454 and 70 X Illumina) provided no additional power for effective sequence assembly. The best assembly software for 454 reads appears to be Newbler as assembly statistics resulted in the lowest number of contigs and highest values for the contig size as well as N50 value. Using the same set of assembly criteria, both ABySS and Velvet appear to be suitable for the assembly of Illumina reads. A two-step strategy, initially using Newbler for 454 reads and Velvet for Illumina reads followed by using MIRA, seemed to provide highly effective sequence assembly. Sequencing of this genomic region allowed identification of 18 protein encoding genes. Their genomic arrangements are highly conserved among catfish, zebrafish, medaka, Tetraodon, and stickleback.
It should be noted that this work dealt with only one million base pair region of the catfish genome. Therefore, while this work provide framework for planning of whole genome sequencing of the catfish genome, extension of technical parameters from this work to whole genome sequencing requires additional work. The combination of 454 and Illumina sequencing may not be effective in dealing with fish whose genomes are polyploidy or contain complex and long repeat structures. Even for the catfish genome, sequence assembly was attenuated with repeats, and therefore, an appropriate level of long reads, e.g., those produced by PacBio sequencing [35,39], to pass through the repeat regions may prove to be very useful.

DNA preparation and sequencing
Twenty four clones ensuring the coverage of a minimum tiling path of the contig0241 from the CHORI-212 BAC library [10] were selected for sequencing. The BAC DNA isolation was conducted as previously described [3], with modifications. Briefly, BAC clones were transferred from 384-well plates to a 96-well culture block, which contained 1.5 ml of 2X YT medium with 12.5 μg/ml chloramphenicol and grown at 37°C overnight with shaking at 300 rpm. The block was centrifuged at 2000 × g for 10 min in an Eppendorf 5804R bench top centrifuge to collect bacteria. The culture supernatant was decanted and the block was inverted and tapped gently on paper towels to remove remaining liquid. BAC DNA was isolated using the Perfectprep™ BAC 96 kit (Eppendorf North America, Westbury, NY) according to the manufacturer's specifications. An equal amount of 1 μg DNA per BAC clone was pooled, followed by purification with phenol/chloroform. This DNA was used for 454 and Illumina sequencing in the Genomic Services Lab at Hudso-nAlpha Institute for Biotechnology (Huntsville, AL).

Assembly strategy
All raw 454 reads and Illumina reads were trimmed of BAC vector sequences and low quality reads were filtered before assembly. CLC Genomics Workbench was used to trim raw sequences with quality score limit of 0.01 (Q20). Illumina 72-bp paired-end reads and 454 reads shorter than 15 bp were discarded. Assembly of 454 reads alone was performed by Newbler v. 2.6, Velvet v. 1.0.01, MIRA v. 3.0.0 and CLC Genomics Work bench v. 4.0.0 (CLC Bio, Cambridge, MA). Newbler with seed length from 10 to 16, and Velvet with K-value from 25 to 95 were tested. For Illumina data, assembly of Illumina reads alone was performed by Velvet, ABySSv.1.2.1 and CLC Genomics Workbench. Velvet with K-value from 21 to 61, and ABySS with K-value from 30 to 64 were tested. The hybrid assembly with both 454 and Illumina data included three major steps: First, 454 reads alone was pre-assembled by Newbler with seed length 16, minimum overlap length 40 and minimum overlap identity 95%; second, Illumina reads were pre-assembled alone using Velvet with K-value 29. Although Velvet is able to generate scaffolded contigs, this function was turned off in this step to prevent influence of scaffolded contigs on the subsequent hybrid assembly step; third, pre-assembled 454 and Illumina contigs along with singeltons were assembled using MIRA, with minimum overlap 40.

Comparison of assembly of paired-end reads and singleend reads
In order to compare the effects of both paired-end and single-end reads on de novo assembly, the Illumina paired-end sequencing data set was assembled using ABySS with a K-value 60. The same paired-end reads data set was treated as single-end reads for the comparison, by neglecting the paired-end information.

Gene prediction and comparative analysis
Repetitive elements were masked using RepeatMasker against zebrafish repeat database. Repeat masked sequences were used for gene and syntenic identification by using both gene prediction algorithms and sequence similarity searches. GENSCAN gene model prediction algorithm was used to predict introns and exons. The resulting predictions were searched against NCBI nr database by using BLASTX with an E-value cutoff of 1e-10. The identified gene sequences were used for TBLASTX search against medaka, stickleback, Tetraodon and zebrafish peptide database with an E-value cutoff of 1e-10. The chromosomal positions of the homologous genes were identified from Ensembl database.