Skip to main content

Table 2 Summary of assessment of the de novo and integrated transcriptome assemblies

From: A high-quality annotated transcriptome of swine peripheral blood

Transcriptome assembly

Type of assessment

Purpose

Reference data

Software

Results

The de novo transcriptome assembly

RNA-seq read representation of the assembly

To determine representation of RNA-seq reads

Normalized RNA-seq reads

Trintiy [43, 44]

66.2% of normalized RNA-seq reads could be mapped back to the de novo assembly

Representation of full-length assembled protein-coding transcripts

To assess the number of full-length PTs

All protein sequences in the Swiss-Prot database

BLASTX [83]

22,831 (nearly) full-length PTs covered more than 80% of the full length of 10,097 protein sequences in the Swiss-Prot database

Representation of full-length assembled transcripts

To assess the number of full-length PTs

NCBI pig RefSeq mRNAs

DC-megaBLAST [83]

16,010 (nearly) full-length PTs covered more than 80% of the full length of 9228 pig RefSeq mRNAs

Origin of assembled transcripts

To assess whether the assembled PTs were of porcine genomic origin

Pig reference genomes: SSC10.2 and USMARCv1.0

GMAP [49]

94.2% and 99.4% of the PTs could be mapped to SSC10.2 and USMARCv1.0, respectively

Similarity-based assessment

To annotate the assembled PTs with known sequences of significant similarity

Sequences in the NCBI NT and NR databases

DC-megaBLAST and BLASTX [83]

69.42% and 21.9% of the PTs shared significant similarities to sequences in the NCBI NT and NR databases, respectively

The integrated transcriptome assembly

Similarity-based assessment

To annotate the assembled PTs with known sequences of significant similarity

Sequences in the NCBI NT and NR databases

DC-megaBLAST and BLASTX [83]

~90% and 63% of the PTs shared significant similarities to sequences in the NCBI NT and NR databases, respectively

Correctness of exon-intron splicing junctions of PTs

To validate the exon-intron splicing junctions of PTs

Porcine IsoSeq full-length cDNA read data from the liver, spleen and thymus, SSC10.2 transcripts and NCBI RefSeq mRNAs

Bedtools [48] and custom Perl scripts

15,303 PTs and 106,483 IsoSeq sequences had the same exon-intron junctions; and 63,845 uniquely mapping, spliced PTs shared at least one intron or exon with 390,943 IsoSeq reads; 4155 and 6641 PTs shared the same exon-intron junctions as 4010 SSC10.2 annotated transcripts and 6418 RefSeq mRNA sequences, respectively; 54,402 and 60,180 PTs shared at least one intron or one exon with 18,437 SSC10.2 transcripts and 33,870 RefSeq mRNA sequences, respectively

Completeness of 5′ termini of PTs

To validate the completeness of 5′ termini of PTs

FANTOM5 CAGE data for humans and mouse, and porcine macrophage CAGE data

CAGEr [55], Bedtools [48] and custom Perl scripts

Completeness of the 5′ termini of 37,569 PTs were verified by 43,845 proximal promoters determined by CAGE data

Length extension of existing transcripts

To determine to what extent the assembled PTs improved over the existing porcine annotation

SSC10.2 transcripts and NCBI pig RefSeq mRNAs

Bedtools [48] and custom Perl scripts

12,262 PTs had both longer 5′ and 3′ termini than the maximally overlapping SSC10.2 transcripts; 9764 PTs had only longer 3′ termini; and14,650 PTs had only longer 5′ termini

Novelty of PTs

To determine novel PTs

SSC10.2 transcripts and NCBI pig RefSeq mRNAs

Bedtools [48] and custom Perl scripts

41,838 and 35,738 spliced PTs that did not overlap any spliced, uniquely mapping SSC10.2 transcripts or with any spliced, uniquely mapping pig RefSeq mRNA sequence were potential novel transcritps relative to the two reference sets, respectively