Library preparation and sequencing
All animal experiments and housing were conducted in accordance with guidelines approved by the RIKEN Animal Experiments Committee (Approval IDs AH25-05-1 and AH14-05-47). Embryos of the Madagascar ground gecko, Paroedura picta, were provided by Animal Resource Development Unit, RIKEN CLST. Eggs of 9 and 30 dpo were collected from natural ovipositions, respectively, and an egg of four days before the estimated date of oviposition [−4 days post oviposition (dpo)] was extracted from a egg-bearing female. Total RNA was extracted using TRIzol reagent (Life Technologies), and RNA-seq libraries for non-stranded paired ends were prepared by Illumina TruSeq Total RNA Sample Prep Kit according to its standard protocol unless specifically described below. After confirming the reproducibility in a different species (data not shown), we shortened RNA fragmentation time from eight to two (Library B, C, and E) or four (Library F—H) minutes. In DNA purification, we applied x0.7 (Library B, C, and E) or x1.0 (Library B, C, and E) volume of Agencourt AMPure XP, instead of x1.6. These conditions were summarized in Table 1. The libraries were sequenced with HiSeq 1500 (Illumina inc.) operated by HiSeq Control Software v2.0.12.0 using Rapid SBS kit v1 as well as MiSeq using MiSeq Reagent kit v2, in the read lengths designated in Table 1.
Quality control of sequenced reads
Raw nucleotide bases were called with RTA 1.17.21.3 and converted to the Fastq-format files with bcl2fastq 1.8.3 (Illumina Inc.). The short reads were deposited in the DDBJ Short Read Archive (DRA) database under the accession numbers PRJDB4004. Initially, qualities of sequence reads were checked with FastQC v0.10.1 (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/), and no marked abnormalities were observed for all of the samples. Adapter sequences and low quality bases (<Q30) were trimmed from the 3′-ends by trim_galore (http://www.bioinformatics.babraham.ac.uk/projects/trim_galore/), in which cutadapt is implemented [33], discarding the reads of shorter than 50 bp after adapter and quality trimming. Low quality reads in which proportion of the bases ≥ Q30 was less than 80 % were discarded by the program fastq_quality_filter in FASTX Toolkit 0.0.13 (http://hannonlab.cshl.edu/fastx_toolkit/index.html). Fragments retaining paired reads were used for de novo assembly.
De novo assembly
The assemblies for individual libraries (Assembly 1–8 in Table 2) were built by Trinity r20131110 [6] which employs a single k-mer length (k = 25). The two all-in-one assemblies were built using the reads of all the libraries by Trinity (Assembly 9) and SOAPdenovo-trans v1.03 [8] based on multiple k-mer lengths (k = 21, 25, 31, 41, 51, 61, 71, 81, and 91) (Assembly 10). For Trinity, we employed default parameters except the ‘group_pairs_distance’ option setting at 1000 for assembling the fragments with long inserts (Assembly 2, 3, 5, 6, 7, 8, 11, and 12). For SOAPdenovo-trans, we set parameters as follows: max_rd_len = 250, rd_len_cutof = 250, avg_ins = 300, reverse_seq = 0, asm_flags = 3, and map_len = 32. The contigs shorter than 200 bp were discarded from Assembly 10, as the default setting of Trinity. In order to remove redundancy of multiple sequences derived presumably from identical transcripts, contigs assembled by SOAPdenovo-trans with multiple k-mer lengths were merged by cd-hit-est v4.6.1 [34] with the similarity threshold of 99 % and word size at eight nucleotides, followed by clustering using gicl v0.0.1 (http://sourceforge.net/projects/gicl/) [35] with the similarity threshold of 95 % and the overlap length threshold of 50 bp. In silico normalizations were performed using the reads of all the libraries by two different programs in order to produce sequences showing unimodal k-mer coverage distributions with averages around 25: khmer 0.2 with options of “-K 20 –C 20” [27] and normalize_by_kmer_coverage.pl implemented in Trinity with options of “--KMER_SIZE 25 --max_cov 50” [13]. The resultant reads from the in silico normalizations were assembled with Trinity into Assembly 11 and Assembly 12. The assemblies for the individual libraries were merged into Assembly 13, as performed for Assembly 10. For Assembly 10 and Assembly 13, which underwent a post-assembly merge, we reassigned gene-transcript relationships by performing single linkage clustering of the contigs derived from the same locus. In order to remove the contigs with minimal read depths that possibly resulted from so-called ‘leaky’ transcription [31], all assemblies were subjected to further modification as follows. The reads were mapped to the contigs in each assembly with Bowtie2 version 2.2.2 end-to-end mode [36]. Using the mapping results, we excluded from the assemblies the contigs on which fewer than five reads were mapped, based on read counts with eXpress v1.5.1 [37]. The mapping and read count were processed using the wrapper align_and_estimate_abundance.pl in Trinity with default parameters except the ‘max_ins_size’ option set at 1500 for Assembly 2-3, and 5-13: the option corresponds to the ‘maxin’ option of Bowtie2 and is set at 800 as default in the wrapper.
In order to confirm that the discarded contigs contain little substantial information for transcriptome analyses, those contig sequences were subjected to searches for possible protein-coding regions with homologs in annotated vertebrate protein databases. For this purpose, similarity searches were carried out using BLASTX [38] in protein sequences of 13 vertebrates with an E-value cutoff of 1E-10: human (Homo sapiens), dog (Canis familiaris), opossum (Monodelphis domestica), chicken (Gallus gallus), zebra finch (Taeniopygia guttata), Chinese alligator (Alligator sinensis), Chinese softshell turtle (Pelodiscus sinensis), green sea turtle (Chelonia mydas), green anole (Anolis carolinensis), Burmese python (Python molurus bivittatus), Xenopus tropicalis, stickleback (Gasterosteus aculeatus), and elephant shark (Callorhinchus milii). Protein sequences of Chinese alligator and green sea turtle were obtained from GigaDB, and those of Burmese python and elephant shark were obtained from NCBI Genbank and Elephant Shark Genome Project web page (http://esharkgenome.imcb.a-star.edu.sg/), respectively. Protein sequences of the other species were downloaded from Ensembl release 75.
Selection of components of CVG
The core vertebrate genes (CVG) were defined as one-to-one orthologs selected based on eggNOG [26] as follows. Initially, we extracted 463 chordate ortholog groups of eggNOG v4.0 (ChorNOGs) [26] that were composed of one-to-one genes of the 26 ‘core’ vertebrates defined by eggNOG, zebrafish (Danio rerio), and sea lamprey (Petromyzon marinus). From them, 292 groups possessing at least one ortholog of either Ciona intestinalis or C. savignyi were selected. Orthologs of elephant shark, whose genome assembly was released later, were added to the gene sets based on the BLASTP [38] reciprocal best-hit approach, and one-to-one elephant shark orthologs were identified in 270 groups out of the 292. The one-to-one orthology of 233 gene sets were systematically validated by gene trees produced by Ensembl release 70 [25], and examined with manual curation when necessary (also see Fig. 3a). Finally we extracted the one-to-one orthologs of eight species, human, platypus (Ornithorhynchus anatinus), chicken, Xenopus tropicalis, zebrafish (Danio rerio), stickleback, elephant shark and sea lamprey from each of the gene sets and defined the selected gene set as 233 CVGs for CEGMA (Additional file 4).
Assessment of assembly
N50 lengths were computed by TrinityStats.pl embedded in Trinity [13]. Short reads were mapped to contigs by Bowtie2 [36], and mapping rates were obtained from its summary output. For completeness assessment, we applied CEGMA version 2.4 [17] based on the 248 CEGs and 233 CVGs separately. Using the CVG gene set for CEGMA, containing the eight vertebrates, HMMER profiles [39] were generated by HMMER 3.0 based on the multiple amino acid alignments processed by MAFFT v7.158b [40] following format conversion into HMMER 2.X. The HMMER bit score cutoffs of the CVGs were computed according to the criterion proposed by Parra et al. [16, 17]: the cutoff values for the standard CEGMA and the completeness analysis, corresponding to the ‘profiles_cutoff.tbl’ and ‘completeness_cutoff.tbl’ files in the original CEGMA package, respectively. We computed the cutoff values of the completeness analysis as maximum hmmsearch bit scores between the HMMER profiles and proteins from any transcripts of the eight species instead of the proteins from the representative transcripts of the genes. In order to conduct the complete assessment using custom gene sets, we modified the scripts of ‘cegma’ and ‘completeness’ implemented in the CEGMA.
We identified false positives in the CEGMA results for the gecko transcriptome assemblies based on a BLASTP search [38]. Using a gecko protein that was predicted as an ortholog of a CVG/CEG by CEGMA, we searched for its best-hit homolog in the human proteins. If this human best-hit was a paralog to the human protein of the CVG/CEG and if these paralogs were duplicated before the split of mammals and sauropsids, the protein predicted by CEGMA was recognized as a misidentified ortholog to the gene group. In this analysis, the human protein sequences and inferred timings of gene duplications were obtained from Ensembl release 70.
In addition to CEGMA, we conducted completeness assessment using BUSCO v1.1 [28] referring to the CVG, and the CVG dataset for BUSCO was prepared as follows. Using the whole sequence set (29 species) of the CVG, HMMER profiles [39] were generated by HMMER 3.1b2 based on the multiple amino acid alignments processed by MAFFT v7.158b [40]. Protein profiles of the CVG for Augustus were generated with msa2prfl.pl in Augustus 3.1 [41] based on the multiple alignments. The consensus sequence of each CVG was inferred by hmmemit in the HMMER. The cutoff values of sequence lengths were computed according to the criterion described previously [28]. As for cutoff values of HMMER bit scores, we used the values for CEGMA described above, instead of those according to the original criterion by BUSCO.
Mapping rates of the assemblies were computed by SAMtools version 0.1.19 [42] using the mapping files that were made for counting mapped reads to the contigs in the previous subsection. Insert lengths of the fragments were estimated with CollectInsertSizeMetrics in Picard Tools version 1.90 (http://broadinstitute.github.io/picard/). For this purpose, in mapping, we used first 50 nucleotides of each paired read so that paired-reads were not mapped overlapping each other. These reads were mapped to the assemblies using Bowtie2 with the same parameters described in the subsection “De novo assembly” in Methods.
Molecular phylogenetic analysis
Peptide sequences of G6PD and H6PD were collected from the gene set of KOG0563 in CEGMA as well as NCBI Genbank and Ensembl release 70 with the assistance of aLeaves [43]. The homologous peptides were aligned with six different approaches: forward and reverse directions by MAFFT v7.158b [40], Clustal Omega 1.2.0 [44], and T-Coffee 10.00.r1613A [45]. The consensus multiple alignment from the six procedures was made by M-Coffee [46] implemented in the T-Coffee package. Unambiguous alignment sites were selected based on trimAl version 1.4 with the automated1 option following removal of gapped sites. Molecular phylogenetic trees were reconstructed based on RAxML version 7.5.7 [47] assuming the PROTCATWAG model with 1,000 bootstrap replicates ("−f a" option) and PhyloBayes 3.3f assuming the CAT-GTR model [48].
Comments
View archived comments (3)