A platform independent RNA-Seq protocol for the detection of transcriptome complexity
- Claudia Calabrese†1,
- Marina Mangiulli†2,
- Caterina Manzari2,
- Anna Maria Paluscio3,
- Mariano Francesco Caratozzolo3,
- Flaviana Marzano3,
- Ivana Kurelac1,
- Anna Maria D’Erchia2,
- Domenica D’Elia3,
- Flavio Licciulli3,
- Sabino Liuni3,
- Ernesto Picardi2,
- Marcella Attimonelli2,
- Giuseppe Gasparre1,
- Anna Maria Porcelli4,
- Graziano Pesole2, 5,
- Elisabetta Sbisà3 and
- Apollonia Tullo3Email author
© Calabrese et al.; licensee BioMed Central Ltd. 2013
Received: 27 June 2013
Accepted: 26 November 2013
Published: 5 December 2013
Recent studies have demonstrated an unexpected complexity of transcription in eukaryotes. The majority of the genome is transcribed and only a little fraction of these transcripts is annotated as protein coding genes and their splice variants. Indeed, most transcripts are the result of antisense, overlapping and non-coding RNA expression. In this frame, one of the key aims of high throughput transcriptome sequencing is the detection of all RNA species present in the cell and the first crucial step for RNA-seq users is represented by the choice of the strategy for cDNA library construction. The protocols developed so far provide the utilization of the entire library for a single sequencing run with a specific platform.
We set up a unique protocol to generate and amplify a strand-specific cDNA library representative of all RNA species that may be implemented with all major platforms currently available on the market (Roche 454, Illumina, ABI/SOLiD). Our method is reproducible, fast, easy-to-perform and even allows to start from low input total RNA. Furthermore, we provide a suitable bioinformatics tool for the analysis of the sequences produced following this protocol.
We tested the efficiency of our strategy, showing that our method is platform-independent, thus allowing the simultaneous analysis of the same sample with different NGS technologies, and providing an accurate quantitative and qualitative portrait of complex whole transcriptomes.
Next Generation Sequencing (NGS) technology, initially used almost exclusively to sequence and/or re-sequence whole genomes and to determine single nucleotide polymorphisms (SNP) , is nowadays increasingly exploited in an effective manner for many other applications, including identification of biological markers and pathogenic agents in biodiversity studies and metagenomics, in biomedical, agro-food, environmental and industrial sectors [2–4].
Interestingly, quite recently the ENCODE Project (Encyclopedia of DNA Elements) has revealed that the vast majority of human genome is pervasively transcribed and almost the totality of the DNA is associated to some biochemical RNA-associated function . Indeed, the transcriptome captures a level of complexity that the simple genome sequence does not . In this frame, the RNA-seq technology is an effective method to map functional elements across the human genome and a suitable tool for the analysis of genome-wide differential RNA expression, underlining the wide range of physical, epigenetics, biochemical, and developmental differences observed among various cells and tissues, that may play a role in determining the balance between health and disease [7, 8].
Among next generation sequencing platforms, the latest series of Roche 454 GS Sequencer, the GS FLX Titanium FLX+, allows to obtain over a million reads in each run, with a length up to 700 bases . Such long reads provide connectivity information among splicing sites, in addition to enabling accurate mapping and relative quantification of mRNAs, hence they are particularly suitable for the characterization of full-length splicing variants that may be differently expressed in physiopathological conditions [7, 10]. On the other hand, the higher throughput, although with shorter reads, of the Illumina and ABI/SOLiD platforms makes them particularly suitable to better quantify the levels of transcripts and for small RNAs sequencing [11–13].
Irrespectively of the NGS platform used, the first step required for high throughput transcriptome sequencing is the construction of a cDNA library representative of all RNA species.
Several protocols have been developed so far, but their applicability remains tightly linked to the specific platform used. These protocols for cDNA library construction generally involve the following steps: rRNA removal or poly(A) RNA selection, RNA retrotranscription, synthesis of the second strand and, in some cases, amplification of the double-stranded cDNA species. However, this strategy has the drawback of producing a population of fragments smaller than 2 Kb; as a result, the fragmentation process used in the standard 454 GS FLX (Roche) shotgun procedure (nebulization), may not give optimal results since it is more suitable for high molecular weight molecules.
Alternative protocols account for RNA fragmentation followed by retrotranscription. The nebulization step is therefore omitted, albeit the crucial RNA fragmentation step may easily result in RNA degradation.
One recurrent problem in the cDNA library construction is represented by the starting amount of RNA, which is often exiguous. In such conditions, low-expressed transcripts may be scarcely represented despite their considerable biological significance.
Another pivotal element for cDNA libraries preparation, often overlooked in current protocols, is the preservation of the strand specificity information, needed to distinguish either sense and antisense transcripts, or the correct strand of synthesis of other non-coding RNAs, useful to define more precisely the exact boundaries of adjacent genes transcribed from opposite strands, and to determine the levels of expression of coding and non-coding RNA molecules transcribed from overlapping sequences. Indeed, strand-specificity information is useful to correctly map reads that do not contain polyA tails or splice junctions and represent a challenge in the annotation step of RNA-seq reads.
In this work we describe a new unique protocol (Patent pending RM2010A000293-PCT/IB2011/052369) to generate and amplify a representative and strand-specific cDNA library for RNA-Seq applications that may be implemented with all the major platforms currently available (Roche 454, Illumina, Solid). This method is reproducible, fast, easy-to-perform and allows to start from low input total RNA. Moreover, we also provide a suitable bioinformatics pipeline for the analysis of the sequences thereby produced.
Robustness of the strategy here presented was tested by analyzing the transcriptome of two well characterized xenograft tumor masses (OST), sharing the same nuclear genome, but carrying a different status of mitochondrial heteroplasmy  and derived from the injection in nude mice of an osteosarcoma cell line. The transcriptome sequencing of the two samples was performed by using two of the main sequencing platforms available on the market: the 454 Roche platform and MiSeq Illumina platform .
To illustrate the validity of our method, we prepared two cDNA libraries starting from two cybrid-derived xenograft tumor masses, which shared the same nuclear genome and were characterized by a different status of mitochondrial heteroplasmy (OST-83 and OST-78 samples, see Methods for details) .
500 ng of total RNA of OST-83 and OST-78 samples were initially depleted of ribosomal RNA component. The efficiency of removal of rRNAs from the total starting RNA in both samples was evaluated with the Agilent 2100 bioanalyzer using the specific chip for RNA (Figure 1B). We verified that, starting from 400–500 ng of total RNA, it is possible to obtain around 30–40 ng of RNA depleted of the ribosomal fraction, which is sufficient for the subsequent generation and amplification of a representative and strand-specific cDNA library. Starting from an amount of cleaned RNA <30 ng increased the risk that transcripts with a low number of copies might be only partially represented or absent. This in turn may not ensure high reproducibility in the cDNA amplification (data not shown).
Even poor quality RNA samples with low RNA Integrity Number (RIN) (8 < RIN > 4) could be used for sequencing (data not shown).
5′-phosphorylated Tag-random-octamer custom designed primers (5′-TCGCGATCGTCGNNNNNNNN-3′) were used for the retrotranscription. The trinucleotide TCG, upstream and downstream of the rare-cutter enzyme Pvul restriction site sequence (underlined) ensured that strand information would be preserved. Such trinucleotide was selected because it had the rarest occurrence in the RefSeq Homo Sapiens collection, representative of the human transcriptome. Analogously, the 12nt long sequence, composed by the Pvul site flanked at both sides by the trinucleotide 5′-TCG-3′ also resulted to be considerably rare. The need for a tag sequence at the 5′ of the random primer is related not only to the retrieval of the strand information, but also to the protocol-specific step of ligation of single-stranded cDNA molecules into ~ 20 kb long concatenamers (Figure 1A-c). Moreover, the presence of the random nucleotides at the 3′ end of the Tag-random-octamers primers ensured retrotranscription of all RNA species, including those lacking the poly(A) tail, such as replication-dependent histones mRNAs .
High molecular weight DNA species for both samples (OST-83 and OST-78) were hence efficiently obtained, quantified with Quant-iT PicoGreen dsDNA Reagent kit (Invitrogen) and visualized by agarose electrophoresis (Figure 1C). An average of 30 μg of ds cDNA were obtained for each library and were suitable for the subsequent fragmentation process required by both 454 Roche protocol or by Illumina Nextera XT protocol.
454 and Miseq RNA-seq yield
Yield of OST samples sequencing and quantification of tagged reads
sample (sequencing technology)
average read length (nt)
n.of reads with ≥ 1 tag
n. of reads after the tags elimination ≥ 40nt
n.of reads after the tags elimination ≥ 100nt
n.of reads after duplicates elimination ≥ 40nt
OST-78  (MiSeq)
OST-78  (MiSeq)
OST-83  (MiSeq)
OST-83  (MiSeq)
Tags identification and Read strand orientation
Comparison between the number of simulated tagged reads expected and observed
N.of reads with 1 Tag
N.of reads with 2 Tags
N.of reads with 3 Tags
N.of reads with 4 Tags
N.of reads with 5 Tag
N. of total tags
Tag Find was hence applied on 454 and MiSeq datasets of reads, choosing a read length cut-off to 40 nt. All the four type of tags were identified, namely the fw5-3 and rv5-3, referred to 5′→3′ directed sequences derived respectively from the 2nd and from the 1st cDNA strand, respectively (Figure 1A-a,c) as well as the fw3-5 and rv3-5 tag, 3′→5′ directed. With both the two sequencing platforms Tag Find recognized the same percentage (11%) of reads carrying at least one tag (Table 1 and Additional file 1: Table S1). This percentage will undoubtedly increase with longer reads (up to 700 bp) that will be generated by the latest series of Roche 454 GS Sequencer, the GS FLX Titanium FLX+. The read length filter applied did not significantly reduce the initial yield of reads obtained (Table 1), as the majority of the first cDNA strands produced were at least 100 nucleotides long, thus suggesting that the retro-transcriptase produced long cDNAs. The majority of tagged reads carried only one tag per sequence with a maximum of 6 tags per 454 read and 10 tags per MiSeq read (Additional file 1: Table S1). Tags identical, or at least with one mismatch with respect to the correct tag sequence, were the most represented in both the read sets (Additional file 1: Figure S2). In general, 5′→3′ oriented tags were the most represented, whereas 0.1% of 454 tagged-reads and 0.6% of tagged-MiSeq reads was composed by chimeric artefacts identified by the presence of at least one 3′→5′ tag (Additional file 1: Figure S2). Boxplot distributions of quality scores (QS) for 454 sequencing showed that the median QS value per base was above 20, up to the first 350 nucleotides in each read. Therefore they were trimmed up to an optimal length sufficient to increase the median QS value per base to at least 25.
Spectrum of different RNA species identified within the OST transcriptomes
genes with poly(A) mRNAs
histone genes with non-poly(A) mRNAs
n.of genic loci with at least 1 read mapped in antisense
On average > 95% of 454 and Illumina tag-removed reads could be mapped on the human genome but only 71% and 87% of all 454 and Illumina reads, respectively, passed our strict quality filters (see Methods). Coding and non-coding genic loci were identified by at least one read covering at least 70nt of their genic locus. A mild increase in total number of genes detected was observed with MiSeq data: 13,389 OST-78 genes and 13,609 OST-83 genes identified with pyrosequencing versus 13,931 OST-78 genes and 14,365 OST-83 genes recognized with Illumina technology (Table 3 and Additional file 2b-c). Overall, this result was perfectly in line with those reported in the literature on the average number of genes expressed in most human and mouse tissues, which varies from 11,000 to 13,000 . Interestingly, the average number of genes detected by 454 and Illumina RNA-Seq data represented 87% and 91% of the full set of expressed genes found by Klevebring et al. in U2 osteosarcoma cell line based on the Ensembl gene annotation (21,146 genes) .
Given that our protocol is able to generate reads with strand-specific orientation we identified a number of antisense reads for each gene and their exonic or intronic localization (Additional file 2f). On average, 1.4% of all reads mapping in genic loci in each dataset was found in antisense orientation (see Table 3), mostly of them mapping at level of exonic regions (83%).
\We also calculated the number of all coding and non-coding RNA species detected, by counting the number of RNA-specific NCBI RefSeq accession IDs associated to all RNA isoforms (Table 3). Furthermore, to test the ability of our strategy to identify also non-polyadenylated mRNA species, we downloaded the whole set of 87 human genes encoding for replication-dependent histones (http://www.genenames.org/genefamilies/histones), known to be transcribed in non-poly(A) mRNA species. This set of genes was used to query OST-78 and OST-83 genes in order to retrieve all histone genes covered by at least one read. Intriguingly, with both the two technologies we were able to detect almost the half of the whole set of histone genes with non-poly(A) mRNAs in each OST sample (Table 3 and Additional file 2d-e).
Differential expression and functional categories enrichment results
The key aims of the RNA-seq is to catalogue all the RNA species present in a cell, including mRNA and non-coding RNAs, to identify splicing isoforms and to quantify the changing expression level of each transcript under different pathophysiological conditions. Ideally, a method for cDNA library construction should be easy to carry out, quick, high-yield and suitable to obtain a library representative of all RNA species.
All cDNA library preparation protocols currently available provide the utilization of the entire library for a single sequencing run on a single dedicated platform. We here present the development of a unique method that allows the generation of a representative and high-yield cDNA library that can be used for multiple RNA-Seq application with all NGS platforms available on the market, starting from exiguous amounts of total RNA (450-500 ng), not always provided by all protocols. Indeed, the standard Roche cDNA library preparation protocol recommends to start with 200 ng of RNA depleted of rRNA. This implies that it is necessary to have at least 8-10 μg of total RNA, not always available in particular with human samples. Moreover our method allowed to obtain reads in average longer than those obtained with the Roche standard protocol (Additional file 1: Figure S3).
In addition, thanks to the high yields of amplified cDNA library obtained with our protocol (~30-40 μg), it is also possible to reuse the same library also for further subsequent validation experiments as we have here shown. This is of paramount importance because, especially considering the high complexity of the transcriptome disclosed by the ENCODE Project , the use of different methodological approaches enabling the qualitative and the quantitative assessment of the transcriptome complexity is presently highly needed. We do also believe that the combined use of Illumina and Roche 454 sequencing platforms may be very effective for transcriptome profiling. In fact, the latter yields longer reads and may therefore provide critical information to assess expressed full-length alternative transcripts, whose quantification may be instead better obtained by using Illumina data. In this context, only a universal protocol such as the one we propose may allow to easily carry out a dual sequencing strategy with a single library efficiently generated even from compromised or partially degraded samples.
An added value is that the Tag-random-octamer primer used in the retro-transcription step, which allows to retain the strand information, can help to optimize the annotation of unspliced reads, and to detect all RNA species, including non-polyadenylated RNAs, which appear to constitute a significant fraction (>25%) of long transcripts present in human cells [21, 22]. Even more importantly, the Tag-random-octamer primers can also be used for metatranscriptomic approaches in order to identify bacterial or viral RNA species in human cells in some physiological or pathological conditions, such as, for example, in the case of multiple sclerosis .
Furthermore, the faithful amplification of the library is ensured by singular properties of the Phi29 polymerase: i) strand displacement activity, which enables effective and uniform amplification also in the presence of secondary structures and sequences hardly amplifiable; ii) proof-reading activity among the highest currently available in the realm of high-fidelity Taq Polymerases; iii) high processivity without disassociation from the template, which allows the generation of very long synthesized products, ideal for the nebulization step; iv) high yields of amplified product starting from small amounts of template [15–17]. In addition, the very low percentage of chimeric artefacts observed within the in silico analysis and maybe due to the multidisplacement Phi29 amplification, confirms the high proof-reading activity of this enzyme.
The bioinformatics analysis with Tag Find allows the user to quantify tagged reads and record all information about their cDNA strand of origin and tags position across the concatenamer. Moreover, the information provided by tags may be used to quantify reads mapped in a discordant orientation with the coding strand, thus belonging to an antisense RNA molecule.
The efficiency of our strategy was tested by using two of the main NGS platforms available: the 454 Roche platform and MiSeq Illumina platform. We sequenced the transcriptome of the two xenograft tumor masses, previously well characterized  and derived from the injection in nude mice of an osteosarcoma cell line with a nearly-homoplasmic mitochondrial Complex I disruptive mutation (m.3571insC) in the MT-ND1 gene. We were able to obtain two representative cDNA libraries of the two tumor transcriptomes, as we identified both poly(A) and non-poly(A) mRNAs, as well as non coding RNAs and even a smaller fraction of antisense RNAs, demonstrating that our method is highly effective in detecting the complexity of gene expression. It is noteworthy that using the cDNA libraries prepared with our protocol, we obtained excellent sequencing results with both the two sequencing platforms. Indeed, the number of detected genes was consistent with the ones estimated in most human and mouse tissues  and in the specific U2 human osteosarcoma cell line . To further substantiate these findings, we tried to confirm our results by exploiting the biological significance of the transcriptional profile observed in xenograft samples, therefore we searched only for differentially expressed genes encoding for Respiratory Chain subunits. We found that the expression profile of RC genes well correlated with and may explain the bioenergetic features that we previously investigated in osteosarcoma cells with different loads of m.3571insC MT-ND1 mutation [14, 24, 25], that influence mitochondrial Complex I assembly. It was therefore not surprising to observe such a distinct pattern of over-expression between nuclear and mitochondrial encoded genes, considering also that the lack of an accompanying up-regulation of nuclear-encoded respiratory genes in the Complex I-depleted OST-83 sample may account for a ‘black-out’ in the nucleus-mitochondrion crosstalk.
The platform-independent method we proposed for the generation of a representative cDNA library in RNA-seq experiments, combined with a feasible downstream bioinformatics pipeline, provides a powerful tool for whole transcriptome interrogation using single or multiple NGS platforms from which an accurate quantitative and qualitative portrait of complex transcriptome can be traced.
Two xenograft tumor masses (OST-83 and OST-78), previously characterized in Gasparre et al., 2011 , were obtained from the injection of an human osteosarcoma cell line (OSC-83) with a nearly-homoplasmic mitochondrial Complex I disruptive mutation (m.3571insC) in the MT-ND1 gene, in nude mice. The OST-83 and OST-78 samples carried respectively 83% and 78% of heteroplasmy of the MT-ND1 frameshift mutation. OST-83 tumor, deprived of Complex I (CI), showed a low-proliferative phenotype, whereas OST-78 tumors, carrying a mutation load below the heteroplasmy threshold for the previously described oncojanus anti-tumorigenic effect , showed a proliferative, aggressive phenotype and a partial CI assembly.
RNA isolation and removal of ribosomal RNA (rRNA)
RNeasy Plus Mini kit (Qiagen) was used to extract RNA from snap-frozen tissues. The removal of the ribosomal component from the total RNA was performed using the RiboMinus Eukaryote Kit for RNA Seq (Invitrogen), according to the manufacturer’s instructions.
Primer design and cDNA first strand synthesis
The RNAs, depleted of the ribosomal component, were converted into cDNAs using the “SuperScript III (SSIII) First-strand synthesis system for RT-PCR” (Invitrogen). The associated protocol was modified in relation to some components, times and reaction conditions. In summary, 30-40 ng of post-Ribominus RNA was incubated at 75°C for 5 min with 0.5 mM dNTP and 50 μM 5′-phosphorylated Tag-random-octamers (rv5-3 tag) custom designed (5′-TCGCGATCGTCGNNNNNNNN-3′) in a volume of 10 μl. Samples were placed immediately on ice for 5 min and 1X RT Buffer SSIII, 5 mM MgCl2, 10 mM DTT, 40U RNaseOUT, 2.5%DMSO and 200U of SuperScript III Reverse Transcriptase were added. The mixtures were incubated at 25°C for 10 min, followed by 45°C for 90 min and 95°C for 5 min. After blocking the reaction on ice, 1 μl of RNase H was added and incubation performed at 37°C for 20 min.
Synthesized cDNAs were purified using the MinElute PCR Purification kit (QIAGEN) to eliminate primers, nucleotides, enzymes and salts. The cDNA mixtures were eluted in 10 μl of EB Buffer (10 mM Tris∙Cl, pH 8.5).
Second strand synthesis, cDNA ligation and library amplification
The purified cDNAs, resuspended in EB Buffer, were ligated together in long concatenamers, double-stranded converted and subsequently amplified following a modified protocol of the QuantiTect Whole Transcriptome kit (QIAGEN).
Since the Invitrogen kit was used for the synthesis of the first strand, the first step of the kit Qiagen was skipped. 10 μl of purified cDNA were mixed and incubated at 22°C for 2 h with 6 μl of Ligation Buffer, 2 μl of Ligation Reagent, 1 μl of Ligation Enzyme 1 and 1 μl of Ligation Enzyme 2. Hence, 29 μl of Amplification buffer and 1 μl of Phi29 DNA polymerase were added and the mixtures were incubated at 30°C for 4 h and then at 95°C for 5 min.
The ligated cDNAs are amplified by REPLI-g DNA polymerase, which moves along the cDNA template strand displacing the complementary strand. The displaced strand becomes a template for replication, allowing high yields of high-molecular-weight cDNA to be generated. This enzyme works as many engineered systems on linearized template .
Library preparation for 454 GS FLX sequencing platform
Sequencing libraries were prepared following the Rapid Library Preparation Method Manual (Roche, GS FLX Titanium series) starting from 1 μg of the ligated ds cDNAs. The libraries thus obtained were amplified in the emulsion PCR system. The enriched-DNA beads of the two samples (OST-83 and OST-78) were deposited onto the wells of a full Roche 454 FLX Titanium Picotiter Plate device and pyrosequenced according to the Sequencing Manual.
Library preparation for Illumina MiSeq sequencing platform
Sequencing libraries were prepared following the Illumina Nextera DNA Sample Preparation Method Guide (Part 15027987 Rev. B, October 2012) starting from 50 ng of ligated ds cDNA. The two indexed libraries thus obtained were applied to cluster generation and a paired-end 250 bp sequencing on Illumina MiSeq platform to obtain about 4 M reads/sample.
Identification and evaluation of tag strand-specificity
An ad hoc python script, named Tag Find (freely available at http://sourceforge.net/projects/tagfind/files/?source=navbar), was developed to clean up the reads from long tags (12 bases) used to identify point of ligations between adjacent cDNA molecules. Tag Find uses a fastq file as input, and was implemented to discard reads shorter than a user-defined value, after tags elimination. Tag Find scans nucleotide windows with length ranging from 11 to 13 nucleotides, starting from the 5′ end of each read, allowing at most one variation (deletion, insertion or mismatch) in the tag sequence. It recognizes the tag sequence in both orientations and saves results. In particular, since the direction of sequence synthesis is 5′→3′, all sequence reads with tags inversely directed (3′→5′) were considered chimeric artefacts produced during the multidisplacement amplification activity of the Phi29 DNA polymerase [26, 27] and can be quantified. Such script was tested on a 454 simulated dataset of 1 K fastq reads (available at http://sourceforge.net/projects/tagfind/files/Test) obtained by using the Metasim software . The dataset was built with reads 350nt long on average and containing at least three tags per sequence (see more details in Additional file 3).
Reads mapping and genes annotation
Tag removed reads were subjected to sequence quality control with the FastQC program (http://www.bioinformatics.babraham.ac.uk/projects/fastqc). Reads longer than or equal to 40 nucleotides were used for downstream analyses exclusively. Since a drop in median quality score (QS) per base was observed only within the 3′end of MiSeq reverse-reads in each sample, these datasets were trimmed using custom python scripts. In particular, upon the observation of quality metrics generated by the FastQC program, OST-78 and OST-83 reverse-reads were trimmed starting from the180th and the 209th (OST-83) nucleotide of the sequence, respectively, in order to maintain a median QS value per base ≥ 25, overall the reads. MiSeq SAM files of alignment, generated with the GMAP aligner software  using the human reference genome (hg18/NCBI_36), were further subjected to the elimination of duplicated sequences, known to be prevalently produced during the pre-amplification sequencing step of the Illumina workflow . The MarkDuplicates function of the Picard tools suite (v.1.68) (http://sourceforge.net/projects/picard/files/picard-tools/1.68) was used to this aim. Fastq sequences deprived of duplicates were extracted from SAM files with the SamToFastq function of Picard tools and re-mapped onto the human genome through the NGS-Trex platform  (http://www.ngs-trex.org). A read coverage of at least 70 bases and a minimum identity of 95% were admitted in this step. In addition, each read was assigned to the gene of origin if at least 50 bases were in overlap with the genomic region annotated as ‘Genic’ in NGS-Trex. Gene annotations were performed by the NGS-Trex platform as described in . At least 95% of nucleotide identity was required to link individual read to specific splicing isoforms, tolerating a trim of 15nt at most. Reads with ambiguous gene assignments were discarded. The presence of rRNAs within each library was further verified considering the amount of reads mapping onto nuclear and mitochondrial human rRNAs identified by NGS-Trex. All coding/non-coding RNA species were identified among the whole set of transcripts mapped, downloaded from the NGS-Trex system, by searching for NM and XM RefSeq NCBI prefixes for coding and NR and XR prefixes for non-coding isoforms.
Custom python scripts were used to check on the presence of antisense reads by scanning the FLAG field of the SAM file in order to recognize sequences oriented by their tag on the genomic strand opposite to that annotated as ‘Genic’ (see Figure 1A-d for more details).
Differential gene expression analysis and search of genes coding for the Respiratory Chain subunits
The total reads coverage per gene was used as digital expression value for the differential expression analysis, performed separately on 454 and MiSeq data. All statistical analyses to calculate differentially expressed (DE) genes were conducted within the R environment (http://www.r-project.org). Only genes expressed in both samples were considered and analyzed with the edgeR Bioconductor package . In order to eliminate compositional biases of the two cDNA tumor samples compared in the experiment, CalcNormFactors function of the edgeR package was used to estimate the normalization factors implemented in the recalculation of each unbiased library size value per sample. A stringent adjusted (Benjamini-Hochberg correction) p-value (≤0.01) was chosen to consider a gene differentially expressed. Fold changes of gene expression were calculated with edgeR as ratio of digital expression values (OST-78/OST-83). The identification of DE Respiratory Chain (RC) genes was conducted by intersecting the set of DE genes found (data not shown) with the list of RC genes downloaded from the HUGO database (http://www.genenames.org/genefamilies/mitocomplex). RPKM (reads per kilobase per million reads) values  of gene expression, calculated with both the sequencing platforms and associated to four nuclear (NDUFAB1, UQCRQ, ATP5G3, NDUFB5, COX5B) and one mitochondrial-encoded (MT-ND5) RC genes, were downloaded from the NGS-Trex platform in order to calculate a Spearman’s rank correlation coefficient and the related p-values of significance.
Validation of RNA-Seq data was performed on Phi29 amplified cDNA library (1:50 dilution). Primer sequences were designed on exon-exon boundaries considering the exon regions covered by the reads and using Primer3 software . The presence of 3′ intra/inter primer similarity was ruled out using IDT OligoAnalyzer tool (http://eu.idtdna.com/analyzer/Applications/OligoAnalyzer/) and SNP positions were verified based on the RNA-seq data. The availability of the target sequence was evaluated by prediction of the cDNA secondary structure using Mfold web server . Primer sequences and concentrations used for the Real-time PCR, together with the primer-specific PCR efficiencies are listed in (Additional file 1: Table S3). The PCR reaction was performed with GoTaq® qPCR Master Mix (Promega) and run in 7500 Fast Real-Time PCR System (Applied Biosystems), using the following conditions: 95°C 5 min; 45 cycles of 95°C 15 sec and 60°C 45 sec. The calculations were performed following 2-ΔΔCTmethod. The normalization was performed with geometric average of the 2-ΔCT of two reference genes, namely U2AF2 and H2AFV, which were selected based on RNA-seq data. The normalization of the mtDNA transcription was performed using geometric average of the 2-ΔCT of U2AF2 and TUBG1, the latter being designed on the 3′UTR, in order to normalize for the lack of introns in MT-ND5. In order to compare RNA-seq and qReal-Time PCR data, log2 fold changes were calculated as ratio OST-78/OST-83. The R coefficients and related CI with a significance level of 0.05 were calculated between log2 fold changes of gene expression by applying the Pearson’s correlation.
Availability of supporting data
We thank A. Valletti for helpful discussion and critical reading the manuscript. We would like to thank F. Mignone and I. Boria for their support in mapping data revision.
This work was supported by Ministero dell’Istruzione, Università e Ricerca (projects Micromap [PON01_02589], Virtualab [PON01_01297], Futuro in Ricerca 2008 [J31J10000040001], and Progetto di ricerca M.I.U.R. [prot. n.718/RIC.] 2005 “Realizzazione di un laboratorio internazionale di bioinformatica” , by the Associazione Italiana Ricerca sul Cancro - AIRC [IG8810]; Fondazione Umberto Veronesi - DISCO TRIP project to G.G.; by ‘Fondazione della Cassa di Risparmio di Puglia’ to M.A.; I.K. is supported by a triennial fellowship ‘Borromeo’ from AIRC.
- Horner DS, Pavesi G, Castrignanò T, De Meo PD, Liuni S, Sammeth M, Picardi E, Pesole G: Bioinformatics approaches for genomics and post genomics applications of next-generation sequencing. Brief Bioinform. 2010, 11: 181-197. 10.1093/bib/bbp046.View ArticlePubMedGoogle Scholar
- Shendure J, Aiden EL: The expanding scope of DNA sequencing. Nat Biotechnol. 2012, 30: 1084-1094. 10.1038/nbt.2421.PubMed CentralView ArticlePubMedGoogle Scholar
- Shokralla S, Spall JL, Gibson JF, Hajibabaei M: Next-generation sequencing technologies for environmental DNA research. Mol Ecol. 2012, 21: 1794-1805. 10.1111/j.1365-294X.2012.05538.x.View ArticlePubMedGoogle Scholar
- Simon C, Daniel R: Achievements and new knowledge unraveled by metagenomic approaches. Appl Microbiol Biotechnol. 2009, 85: 265-276. 10.1007/s00253-009-2233-z.PubMed CentralView ArticlePubMedGoogle Scholar
- ENCODE Project Consortium: An integrated encyclopedia of DNA elements in the human genome. Nature. 2012, 489: 57-74. 10.1038/nature11247.View ArticleGoogle Scholar
- Forrest AR, Carninci P: Whole genome transcriptome analysis. RNA Biol. 2009, 6: 107-112. 10.4161/rna.6.2.7931.View ArticlePubMedGoogle Scholar
- Dong L, Jensen RV, De Rienzo A, Gordon GJ, Xu Y, Sugarbaker DJ, Bueno R: Differentially expressed alternatively spliced genes in Malignant Pleural Mesothelioma identified using massively parallel transcriptome sequencing. BMC Medical Genetics. 2009, 10: 149-10.1186/1471-2350-10-149.PubMed CentralView ArticlePubMedGoogle Scholar
- Sajnani MR, Patel AK, Bhatt VD, Tripathi AK, Ahir VB, Shankar V, Shah S, Shah TM, Koringa PG, Jakhesara SJ, Joshi CG: Identification of novel transcripts deregulated in buccal cancer by RNA-seq. Gene. 2012, 507: 152-158. 10.1016/j.gene.2012.07.036.View ArticlePubMedGoogle Scholar
- Liu L, Li Y, Li S, Hu N, He Y, Pong R, Lin D, Lu L, Law M: Comparison of next-generation sequencing systems. J. Biomed. Biotechnol. 2012, 2012: 251364-PubMed CentralPubMedGoogle Scholar
- Valletti A, Anselmo A, Mangiulli M, Boria I, Mignone F, Merla G, D'Angelo V, Tullo A, Sbisà E, D'Erchia AM, Pesole G: Identification of tumor-associated cassette exons in human cancer through EST-based computational prediction and experimental validation. Mol Cancer. 2010, 9: 230-10.1186/1476-4598-9-230.PubMed CentralView ArticlePubMedGoogle Scholar
- Ramsköld D, Kavak E, Sandberg R: How to analyze gene expression using RNA-sequencing data. Methods Mol Biol. 2012, 802: 259-274. 10.1007/978-1-61779-400-1_17.View ArticlePubMedGoogle Scholar
- Yang KC, Ku YC, Lovett M, Nerbonne JM: Combined deep microRNA and mRNA sequencing identifies protective transcriptomal signature of enhanced PI3Kα signaling in cardiac hypertrophy. Journal of Molecular and Cellular Cardiology. 2012, 53: 101-112. 10.1016/j.yjmcc.2012.04.012.PubMed CentralView ArticlePubMedGoogle Scholar
- Yu ZB, Han SP, Bai YF, Zhu C, Pan Y, Guo XR: microRNA expression profiling in fetal single ventricle malformation identified by deep sequencing. Int J Mol Med. 2012, 29: 53-60.PubMedGoogle Scholar
- Gasparre G, Kurelac I, Capristo M, Iommarini L, Ghelli A, Ceccarelli C, Nicoletti G, Nanni P, De Giovanni C, Scotlandi K, Betts CM, Carelli V, Lollini PL, Romeo G, Rugolo M, Porcelli AM: A mutation threshold distinguishes the antitumorigenic effects of the mitochondrial gene MTND1, an oncojanus function. Cancer Res. 2011, 71: 6220-6229. 10.1158/0008-5472.CAN-11-1042.View ArticlePubMedGoogle Scholar
- Dean FB, Nelson JR, Giesler TL, Lasken RS: Rapid amplification of plasmid and phage DNA using Phi 29 DNA polymerase and multiply-primed rolling circle amplification. Genome Res. 2001, 11: 1095-1099. 10.1101/gr.180501.PubMed CentralView ArticlePubMedGoogle Scholar
- Rodríguez I, Lázaro JM, Blanco L, Kamtekar S, Berman AJ, Wang J, Steitz TA, Salas M, de Vega M: A specific subdomain in φ29 DNA polymerase confers both processivity and strand-displacement capacity. Proc Natl Acad Sci U S A. 2005, 102: 6407-6412. 10.1073/pnas.0500597102.PubMed CentralView ArticlePubMedGoogle Scholar
- Paez JG, Lin M, Beroukhim R, Lee JC, Zhao X, Richter DJ, Gabriel S, Herman P, Sasaki H, Altshuler D, Li C, Meyerson M, Sellers WR: Genome coverage and sequence fidelity of ϕ29 polymerase-based multiple strand displacement whole genome amplification. Nucleic Acids Research. 2004, 32: e71-10.1093/nar/gnh069.PubMed CentralView ArticlePubMedGoogle Scholar
- Marzluff WF, Wagner EJ, Duronio RJ: Metabolism and regulation of canonical histone mRNAs: life without a poly(A) tail. Nat Rev Genet. 2008, 9: 843-854. 10.1038/nrg2438.PubMed CentralView ArticlePubMedGoogle Scholar
- Smibert P, Miura P, Westholm JO, Shenker S, May G, Duff MO, Zhang D, Eads BD, Carlson J, Brown JB, Eisman RC, Andrews J, Kaufman T, Cherbas P, Celniker SE, Graveley BR, Lai EC: Global patterns of tissue-specific alternative polyadenylation in Drosophila. Cell Rep. 2012, 1: 277-289. 10.1016/j.celrep.2012.01.001.PubMed CentralView ArticlePubMedGoogle Scholar
- Klevebring D, Fagerberg L, Lundberg E, Emanuelsson O, Uhlén M, Lundeberg J: Analysis of transcript and protein overlap in a human osteosarcoma cell line. BMC Genomics. 2010, 11: 684-10.1186/1471-2164-11-684.PubMed CentralView ArticlePubMedGoogle Scholar
- Cheng ZF, Deutscher MP: An important role for RNase R in mRNA decay. Mol Cell. 2005, 17: 313-318. 10.1016/j.molcel.2004.11.048.View ArticlePubMedGoogle Scholar
- Wu Q, Kim YC, Lu J, Xuan Z, Chen J, Zheng Y, Zhou T, Zhang MQ, Wu CI, Wang SM: Poly A- transcripts expressed in HeLa cells. PLoS One. 2008, 3: e2803-10.1371/journal.pone.0002803.PubMed CentralView ArticlePubMedGoogle Scholar
- Kriesel JD, Hobbs MR, Jones BB, Milash B, Nagra RM, Fischer KF: Deep Sequencing for the Detection of Virus-Like Sequences in the Brains of Patients with Multiple Sclerosis: Detection of GBV-C in Human Brain. PLoS One. 2012, 7: e31886-10.1371/journal.pone.0031886.PubMed CentralView ArticlePubMedGoogle Scholar
- Bonora E, Porcelli AM, Gasparre G, Biondi A, Ghelli A, Carelli V, Baracca A, Tallini G, Martinuzzi A, Lenaz G, Rugolo M, Romeo G: Defective oxidative phosphorylation in thyroid oncocytic carcinoma is associated with pathogenic mitochondrial DNA mutations affecting complexes I and III. Cancer Res. 2006, 66: 6087-6096. 10.1158/0008-5472.CAN-06-0171.View ArticlePubMedGoogle Scholar
- Porcelli AM, Ghelli A, Ceccarelli C, Lang M, Cenacchi G, Capristo M, Pennisi LF, Morra I, Ciccarelli E, Melcarne A, Bartoletti-Stella A, Salfi N, Tallini G, Martinuzzi A, Carelli V, Attimonelli M, Rugolo M, Romeo G, Gasparre G: The genetic and metabolic signature of oncocytic transformation implicates HIF1alpha destabilization. Hum Mol Genet. 2010, 19: 1019-1032. 10.1093/hmg/ddp566.View ArticlePubMedGoogle Scholar
- Zhang K, Martiny AC, Reppas NB, Barry KW, Malek J, Chisholm SW, Church GM: Sequencing genomes from single cells by polymerase cloning. Nature Biotechnology. 2006, 24: 680-686. 10.1038/nbt1214.View ArticlePubMedGoogle Scholar
- Lasken RS, Stockwell TB: Mechanism of chimera formation during the Multiple Displacement Amplification reaction. BMC Biotechnol. 2007, 7: 19-10.1186/1472-6750-7-19.PubMed CentralView ArticlePubMedGoogle Scholar
- Richter DC, Ott F, Auch AF, Schmid R, Huson DH: MetaSim: a sequencing simulator for genomics and metagenomics. PLoS One. 2008, 3: e3373-10.1371/journal.pone.0003373.PubMed CentralView ArticlePubMedGoogle Scholar
- Wu TD, Watanabe CK: GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics. 2005, 21: 1859-1875. 10.1093/bioinformatics/bti310.View ArticlePubMedGoogle Scholar
- Kozarewa I, Ning Z, Quail MA, Sanders MJ, Berriman M, Turner DJ: Amplification-free Illumina sequencing-library preparation facilitates improved mapping and assembly of GC-biased genomes. Nat Methods. 2009 April, 6 (4): 291-10.1038/nmeth.1311.PubMed CentralView ArticlePubMedGoogle Scholar
- Boria I, Boatti L, Pesole G, Mignone F: NGS Trex – Next Generation Sequences Transcriptome profile explorer. BMC Bioinformatics, in press. BMC Bioinformatics, in pressGoogle Scholar
- Robinson MD, McCarthy DJ, Smyth GK: edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010, 26: 139-140. 10.1093/bioinformatics/btp616.PubMed CentralView ArticlePubMedGoogle Scholar
- Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNA-seq. Nature Methods. 2008, 5: 621-628. 10.1038/nmeth.1226.View ArticlePubMedGoogle Scholar
- Rozen S, Skaletsky HJ: Primer3 on the WWW for general users and for biologist programmers. Methods Mol Biol. 2000, 132: 365-86.PubMedGoogle Scholar
- Zuker M: Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Res. 2003, 31: 3406-3415. 10.1093/nar/gkg595.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.