Common Minke whale genome sequencing and assembly
Four individual common minke whale samples (S30, S34, S35, and S37) were collected from the Whale Research Institute at the National Fisheries Research & Development Institute (NFRDI), Korea. The samples were caught incidentally by fishing net in Hupo, Ganggu, Pohang, and off the east coast of Korea, and were donated to NFRDI for research purposes. The process was guided and investigated by the coast guard, Cetacean Research Institute and Fisheries Cooperation Association. DNA was extracted from the muscle tissue of each common minke whale, and paired-end libraries were constructed with insert sizes of about 270 bp and 480 bp. Then 101 cycle paired-end sequencing was conducted using the Illumina HiSeq 2000 sequencer. The data are listed in Additional file 1: Table S11.
FastQC  was used to check the quality of the raw read data, and sequencing errors were discarded using the error-correction module of Allpaths-LG . Fq2fa was used to merge error-corrected paired-end reads of each sample into one shuffled-form fasta file, with a filter option for filtering N bases in the reads. We assembled error-corrected paired-end reads using IDBA_UD  with the option of pre-correction and kmin = 40. Gaps (N bases) in assembled sequences were filled using Gapcloser  with parameter k value = 31. We carried out a genome assembly for the S30 sample using CLC Assembly with minimum contig lengths = 2000, similarity = 0.85, length fraction = 0.5, insert cost = 3, deletion cost = 3, and mismatch cost = 2.
Before gene prediction of the assembled sequence, sequence patterns including repeats were screened using RepeatMasker  with mammal species and the no-low, and no-is options. Augustus  was run across the repeat masked sequence for gene prediction, and the results were used as the input for a BLASTP  search. The results of the BLASTP search were filtered by length (peptides of more than 100 amino acids) and gene coverage (over 70% without gap).
The overall processes of genome assembly and gene prediction are shown in Additional file 1: Figure S4A. To maximize the gene contents of the genome assembly and construct the representative genome assembly of the four common minke whale genomes, we combined the assembly results from each sample. The process is described in further detail in Additional file 1: Figure S4B. All predicted gene sequences (>100 amino acid sequences) were merged and used as the BLASTP DB. Next, we queried the gene sequences of each sample against the DB and filtered BLASTP results (identity > 95%, 70% < q.cov and s.cov < 130%). Every contig from each sample was classified into four groups (contigs without genes, contigs with sample-specific genes, contigs with only one gene, and contigs with multiple genes). Based on the genome assembly of the S30 sample, which showed the best result from among the four samples (N50 length, N contents, coverage, maximum contig length), we added the contigs to the sample-specific genes from the other samples. We conducted multiple sequence alignments for clustered contigs from each group (contig with only one gene and contig with multiple genes) using ClustalW . Contig extension and bridging were conducted based on the S30 contigs, or consensus sequence (if there was no S30 contig) in the cluster. The contig extension and bridging process is described in Additional file 1: Figure S5. The combined genome sequence was masked, and genes were predicted using the same process as in Additional file 1: Figure S4A.
The short reads were mapped to the combined assembled genome using Bowtie2  with the default option. Alignment of the SAM file and removal of duplicated reads were conducted using Picard (http://picard.sourceforge.net) and SAMtools . Local realignment was conducted using the Genome Analysis Toolkit (GATK)  and SNPs were extracted from the reads alignment file using UnifiedGenotyper, based on multi sample calling. Detected variants (QUAL < 30, QD < 5, FS > 200, MQ0 > 4, MQ0/DP > 0.1) and missing variants (which were found in one sample) were discarded from further analysis. The overall variant-calling process using GATK is shown in Additional file 1: Figure S4B.
Resequencing analysis based on the previously reported draft genome was performed to compare SNP genotypes of each sample that were called upon using reported draft genome and our assembled scaffolds as reference. Using LAST , we redefined the coordinate of our assembled scaffolds according to the reported draft genome. The following parameter options were adopted: ‘-c –m1111110’ and ‘-q3 –e35.’ We then isolated the loci that were exactly matched and calculated SNP genotype concordance for each sample.
The data matrix for the phylogenomic analyses consisted of four newly determined and 18 published mitochondrial genome sequences of baleen whales: 2 eschrichtiids, 14 balaenopterids, 2 neobalaenids, and 4 balaenids. Here, Kogia breviceps (Odontoceti, Kogiidae) was used as an outgroup. The mitochondrial genome sequences were initially aligned using MAFFT v6  and then corrected by visual inspection. The final alignment included 16,435 nucleotides.
The phylogenomic analysis was carried out using the BI, ML, and NJ methods. We chose the best-fit model of nucleotide substitution with the standard ModelTest PAUP block in PAUP 4.0b10  and Akaike’s information criterion (AIC) in ModelTest 3.7; GTR + I + G was selected as the best evolutionary model. The uniformed BI analysis was implemented using MrBayes 3.2.1 with the GTR + I + G model. For the partitioned model approach of BI analysis, mitochondrial genomes were divided into 18 partitions and the following models were applied: GTR + I + G for the 12S rRNA, 16S rRNA, COX1, and 22 tRNAs regions; SYM + I for the 2 STS region; TVM + I + G for the NADH1, COX3, NADH4, and Control regions; HKY + G for the NADH2, ATPase8, and NADH4L regions; TVM + G for the COX2 region; HKY + I + G for the ATPase6 and NADH6 regions; TVM + I for the NADH3 region; K81uf + I + G for the NADH5 region; TrN + G for the Cytb region (Additional file 1: Table S12). Each analysis consisted of 20,000,000 generations with a burn-in of 20,000 and a sample frequency of 500. Bayesian posterior probability (BPP) values are shown on internal nodes to indicate the robustness of the phylogenomic analysis.
ML analysis was performed using PHYML 3.0 with a BIONJ starting tree under the GTR model and nonparametric bootstrap analysis was conducted with 500 pseudoreplicates. The proportion of invariable sites and gamma shape parameter were estimated from the dataset and the number of substitution rate categories was set to 6. The tree topology optimization was chosen.
NJ analysis  was conducted using the PHYLIP package 3.69 , based on Kimura’s  two-parameter distance. Ts/Tv ratios (10.10) were estimated from the data set using PUZZLE 4.0.2  and then were used as inputs for the SEQBOOT, DNADIST, NEIGHBOUR, and CONSENS programs of the PHYLIP package. A bootstrap test (with 1,000 pseudoreplicates)  was performed to determine the statistical support for each node of the NJ tree.
Co-estimation of evolutionary rates, TMRCA
To co-estimate the evolutionary rates and times to the most recent common ancestor (TMRCA), Bayesian coalescent approaches were implemented in BEAST 1.6.2 . Crown Cetacea was calibrated based on the oldest mysticete fossil Llanocetus [50,51] (34 Mya, 35 mean, 1.0 SD. The age of the basal of the crown Mysticeti was calibrated based on an unnamed balaenid from New Zealand  (28 Mya, 29.0 mean, 1.0 SD). Kogia breviceps (Odontoceti; Kogiidae) was used as an outgroup. The analysis was conducted under the GTR + I + G model, nst = 6, and rates = gamma derived from AIC in ModelTest 3.7 . We employed relaxed uncorrelated lognormal for molecular clock model and Yule process for tree topology prior. The data sets were each run for 20,000,000 generations to ensure convergence of all parameters (ESSs > 200) with discarded burn-in of 10%. The resulting convergence was analyzed using Tracer 1.5 (http://beast.bio.ed.ac.uk/Tracer) and the statistical uncertainties were summarized in the 95% highest probability density (HPD) intervals. Trees were summarized as maximum clade credibility trees using the TreeAnnotator program, which forms part of the BEAST package, and were visualized using FigTree 1.4.0 (http://tree.bio.ed.ac.uk/software/figtree/).
dN/dS ratio of orthologous genes
We used protein and reference cDNA sequences of human, mouse, dog, pig, cow, horse, and dolphin from Ensembl  and the common minke whale from our results. We used Hcluster_sg  to generate clusters based on BLAST 2.2.27 +  results. Then we generated multiple alignments for input into Mestortho  using PRANK31 . Mestortho was used to identify the 1:1 orthologs of all eight species. As a result, 5,539 1:1 orthologs were identified and used to estimate the synonynous and nonsynonymous substitution rates. We obtained phylogenetic tree information from Timetree (www.timetree.org). Orthologous gene sets were aligned using PRANK31 , and poorly aligned sites were eliminated using Gblocks . We used the ML method from codeml of PAML 4  to estimate the dN (the rate of non-synonymous substitution), dS (the rate of synonymous substitution), and ω (the ratio of non-synonymous substitutions to the rate of synonymous substitutions) with F3X4 codon frequencies under the branch (model = 2, NSsites = 0) and basic models (model = 0, NSsites = 0). Results from the branch model were filtered with dS > 3 or dN/dS > 5. A log likelihood ratio test (LRT) was performed to compare these models; FDR adjustments for multiple testing corrections were applied , and a significance level of P < 0.05 was used.
We used the protein and reference cDNA sequences of common minke whales and dolphins. We used the RBH method from BLAST 2.2.27 + . As a result, 11,698 1:1 orthologs for the two species were identified. We generated multiple alignments using PRANK31  and eliminated poorly aligned sites using Gblocks  before performing a standard McDonald-Kreitman test .