Hybrid assembly of the 454 contigs and public ESTs of common carp
All experimental procedures were conducted in conformity with institutional guidelines for the care and use of laboratory animals in Chinese Academy of Fishery Science, Beijing, China, and conformed to the National Institutes of Health Guide for Care and Use of Laboratory Animals. The animal work was approved by the academic committee in the Centre for Applied Aquatic Genomics (approval ID: 01/2011). Tissue samples were excised from brain, muscle and live of three mature German mirror carps. The tissues were cut into small pieces and immediately pooled into liquid nitrogen. Total RNA in all three tissues was purified using TRIZOL (Invitrogen, Carlsbad, CA, USA) following the manufacturer's instructions. Total RNA quality and quantity were checked by agarose gel electrophoresis containing formaldehyde and using a spectrophotometer. Poly(A) + RNA was purified from total cellular RNA using oligo dT primer. Full-length cDNA was synthesized from 2 μg of poly(A) + RNA using the Clontech SMART cDNA Library Construction Kit (Clontech, Mountain View, CA, USA) according to manufacturer's protocol. The cDNA was amplified using PCR Advantage II Polymerase in 16 thermo cycles with the following thermal profile: 7 s at 95°C, 20 s at 66°C, and 4 mins at 72°C. The amplified cDNA was subsequently purified using the QIAquick PCR Purification Kit (Qiagen, Valencia, CA, USA) to remove fragments of less than 300 bp.
Preparation of the 454 library was performed according to the supplier's instructions (454 Life Sciences, Branford, CT, USA). In summary, approximately 3 μg of amplified cDNA was nebulized and selected for lengths that ranged from 300 to 800 bp. The FLX specific adapters, adapter A (GCCTCCCTCGCGCCATCAG) and adapter B (GCCTTGCCAGCCCGCTCAG), were added to each fragmented cDNA, resulting in adapter A-DNA fragment-adapter B constructs. The DNA fragments were then denatured to generate single-stranded DNA which was then amplified by emulsion PCR for sequencing. The sequencing of the library was performed in one half-plate run on the 454 GS FLX machine. The entire set of reads used for the assembly was submitted to the NCBI Sequence Read Archive under the accession SRA009366 (Submission: SRA009366 by CAFS).
Before assembling the 454 sequencing reads, the adapter sequences were removed. If the ends of one read contained parts of either adapter A (GCCTCCCTCGCGCCATCAG) or adapter B (GCCTTGCCAGCCCGCTCAG), these nucleotides were removed. Next, the Solexa QA package  was used to filter out low-quality bases with the following parameters: -probcutoff 0.05 (the quality cutoff score below which the base-calling error was considered to be too high) and -454 (to trim 454 reads). Because homopolymers of poly(A/T) have low quality scores in 454 sequencing , this process filters out poly(A/T) sequences. The resulting high-quality (HQ) reads were then assembled using the Celera Assembler 6.1  on a single multiprocessor computer. Most of the parameters were set to the default values; for example, overlapper = mer (a seed and extend overlap algorithm), unitigger = bog (a best overlap graph approach for building unitigs), and doOverlapTrimming = 1 (for overlap-based trimming). To avoid the putative mis-assembly of paralogous genes into chimera contigs, we manually collected 20 pairs of published common carp paralogous genes and analyzed their sequence identity using BLASTN [see Additional file 1: Table S1]. We found that the highest identity between these paralogous gene sequences was 97%. Therefore, the Assembler parameters that might influence sequence assembly were set as: utgErrorRate = 0.029 (the error rate above which the unitigger discards overlaps), ovlErrorRate = 0.029 (overlaps above this limit will not be detected), and cnsErrorRate = 0.029 (he error rate below which consensus finds alignments).
To increase transcriptome coverage, we downloaded 34,067 common carp ESTs from the UniGene database  and 989 common carp mRNAs from GenBank. Any vector contamination of the public ESTs/mRNAs was removed using the seqclean program . The 454 contigs/singletons and cleaned ESTs/mRNAs were assembled into contigs using the CAP3 software with default parameters except that we used -p 98 because the highest identity that we found between the carp paralogous genes was 97%.
To avoid the identification of redundant genes as a result of alternative splicing, all-against-all BLASTN searches were performed using CAP3 contigs. If the alignment of two sequences had 100% identity over 100 bps, then they were considered as spliced variants and the longest contigs was selected to represent this gene.
To identify the common carp protein-coding genes, we used BLASTX with an e-value of 1e-5 to run our assembled sequences against an in-house fish protein database of protein sequences from Zebrafish, Fugu, Stickleback, Tetraodon and Medaka that were downloaded from the Ensembl database . The sequences that had no BLASTX hits to the sequences in the fish protein database were searched against the UniProt database  and the NCBI nonredundant protein database  using BLASTX with an e-value of 1e-5. Contigs that had no matches to either of these protein databases were aligned against UTRdb  because these sequences might represent the UTRs of protein-coding genes. Contigs that had hits to UTR sequences were considered to be protein-coding genes. These remaining unmatched contigs were searched against NCBI nonredundant nucleotide collection using BLASTN with an e-value of 0.05.
Because common carp might have some species-specific protein-coding genes, the contigs that had no matches to any of the known proteins or nucleotide sequences were run through the Coding Potential Calculator (CPC)  to predict their coding potential. If CPC predicted that the contig was a coding gene or if it predicted that the contig was non-coding but had an intact open reading frame over 100 amino acids long, then the contig was considered to be a protein-coding gene.
To compare the common carp contigs with the zebrafish genes, we downloaded 45,646 zebrafish protein-coding transcripts from the Ensembl database  and selected the longest transcript to represent each gene so as to avoid redundant GO comparison and ortholog identification.
Estimation of Ks in orthologs and paralogs
To identify putative orthologs between common carp and zebrafish, the sequences from common carp and zebrafish were aligned using the reciprocal BLAST (BLASTN) hit method of Blanc et al.  with an e-value of 1e-20. Two sequences were defined as orthologs if each of them was the best hit of the other and if the sequences were aligned over 300 bp.
The approach used to estimate the Ks of orthologous pairs was adapted from previous studies [15, 16]. Briefly, we aligned the common carp contigs to its orthologous zebrafish protein sequence using BLASTX. The longest alignment was selected for analysis. The corresponding common carp sequences were extracted using their aligned coordinates and translated with the getorf program from the EMBOSS package . The translated carp amino acid sequences were aligned against the orthologous zebrafish protein using Clustalw . The corresponding codon alignments were produced using PAL2NAL . Finally, Ks were estimated using a maximum likelihood method in the CODEML program (runmode-2) of the PAML package .
Paralogs within the common carp and zebrafish sequences were identified by all-against-all BLASTN searches. Because of length variations between the sequences, we modified the procedure of Ding et al.  and defined two sequences as paralogs if the aligned regions were over 70% of the shorter sequences. To estimate the Ks for common carp paralogs, we first aligned the sequence pairs using TBLASTX and then followed the same steps that were used in the Ks estimation pipeline in orthologous pairs. Because zebrafish genes have well-annotated protein sequences, the protein sequences of two paralogous genes were aligned with Clustalw  and the corresponding codon alignments were produced using PAL2NAL . Ks was estimated based on the codon alignments.
Detection of genome speciation and duplication events
The detection of genome speciation and the duplication event was carried out according to the procedure described by Blanc et al.  which were based on the Ks distribution of orthologous and paralogous pairs, respectively. For detection, we used only those alignments that were longer than 30 amino acids and had Ks < 2 to minimize statistical artifacts caused by short alignments and the saturation of Ks .
The Ks frequencies of orthologous pairs were plotted and the Ks peak was used as a potential indicator of the genome speciation event. To detect genome duplication event, the common carp paralogous pairs were organized into gene families according to previous studies [15, 16]. One contig was selected to represent one gene cluster, obviating possible redundant Ks from multiple entries of the same gene. A gene family of n members was assumed to be the results of n - 1 gene duplication events. A hierarchical clustering method described previously , was used to reconstruct the tentative phylogeny of each gene family. Briefly, in one gene family, all contigs were treated first as a separate cluster. Then, the two clusters (A and B) with the smallest Ks values were grouped into a new cluster containing all the contigs. The median Ks, obtained for all possible pairs between a contig in cluster A and a contig in cluster B, was used as the Ks for the new cluster. Every clustering was assumed to represent one gene duplication event. These steps were repeated until all the contigs were grouped into a single cluster. The obtained Ks values that were obtained in each clustering were plotted and the Ks peak was assumed to indicate the genome duplication event.
Biological process and pathway enriched in common carp compared with zebrafish
To study the biological processes enriched in common carp, we annotated each contig by assigning the GO terms associated with the top hit in the fish protein database to the common carp contig. If a contig could not be annotated with a homologous assignment, then Interproscan  was used with the default settings to annotate the contigs with GO terms. To study the pathways involved, we used the KOBAS software  to map common carp contigs to KEGG pathways based on sequence similarity.
We retrieved the GO terms for the zebrafish protein-coding genes from Ensembl database and kept non-redundant GO terms for every zebrafish gene because redundant GO terms might be assigned to the gene spliced variants. We used WEGO  to identify significantly enriched GO terms in the common carp contigs using zebrafish genes as the background. WEGO uses the Pearson Chi-Square test to indicate significant relationships between two input datasets. The biological process terms with p-value ≤ 0.05 were considered to be statistically enriched in common carp contigs. To study the enriched pathways by common carp, we assigned KEGG pathways to the zebrafish genes using KOBAS  and then used the software to compare the proportion of common carp contigs in each pathway against the proportion of zebrafish genes in the same pathway. KOBAS used fisher-exact test to identify significant pathways and then performed an FDR correction to reduce Type-1 errors. KEGG pathways with corrected p values ≤ 0.05 were considered to be statistically enriched in common carp.