Quantitative measurement of RNA-cDNA hybridization kinetics demonstrates that the eukaryotic transcriptome is dominated by a relatively small number of highly expressed genes. 20% of the yeast transcriptome is expressed at hundreds of copies per cell, while most other transcripts are present in tens of copies per cell or fewer . These predictions from hybridization data are consistent with SAGE experiments in yeast . Full characterization of gene expression by sequencing requires sufficient data to accurately measure transcripts with low abundance.
We found that a large fraction of Illumina 1 G reads originated from highly abundant ribosomal rRNAs. Between 30.9%–60.4% of the total number of reads in our six samples originated from rRNAs. Even though we depleted each cDNA sample of ribosomal RNA before priming cDNA synthesis using the mRNA polyA tail, the remaining highly structured ribosomal RNAs self-primed a significant amount of reverse transcription, so that the final sample contained a large amount of cDNA originating from ribosomal RNA. This problem of limited mRNA coverage can be addressed in one of three ways: by combining data from several lanes of the sequencer (at increased cost), by developing new techniques in order to selectively and efficiently deplete abundant rRNAs, and through advances in sequencing technology that increase the number of reads per run.
The effect of the limited number of reads was observed in our model of sampling error. The most striking similarity between the simulated and real correlation plots is the vertical cluster centered at a log2 value of 0 (Fig. 1B and Fig. 1D). This vertical cluster suggests that small gene expression ratios were less reliably measured by the Illumina 1 G than by the Agilent microarray platform. The majority of the genes in this vertical cluster were low-abundance transcripts with few reads. Reflecting the low reliably of these measurements, filtering genes based on a threshold number of reads in all six samples improves the correlation between array and sequencing data (Fig. 1E).
The correlation between array measurements and RNA sequencing has been addressed by a number of recent studies and in most cases ranges from 0.6 to 0.85[1, 2]. Similar correlations were observed when comparing microarrays to massively parallel signature sequencing (MPSS) and SAGE [19–25]. This suggests that the disparity may arise through differences between the array and sequence technologies rather than noise specific to new sequencing technologies. While correlation is a useful comparison method, a key metric for comparing expression technologies is the accuracy of detecting differential gene expression.
Counting error limits the ability of cDNA sequencing to accurately identify changes in mRNA levels for low abundance transcripts. In our analysis of the RMg and RMe samples, only 566 genes were detected as differentially expressed by both platforms, about half of the total genes found differentially expressed by each platform individually. The subset of genes that were called significantly differentially expressed by only arrays had significantly lower number of reads than the set of all genes examined, as well as lower average array intensity. Because of the limited number of filtered reads for these genes, low abundance transcripts were less accurately measured by the Illumina 1 G. In contrast, the subset of genes that were called differentially expressed by sequencing was enriched for genes with both higher read counts and higher array intensities. This indicates that additional sequencing data would improve quantification of genes with lower expression levels. Furthermore, for highly transcribed genes having more than 300 read counts and for which differential expression was assessed by arrays, sequencing, and qPCR, qPCR ratios were more strongly correlated with sequencing results than with array results (R = 0.907 vs R = 0.850). This suggested that the Illumina 1 G was less subject to saturation effects than microarray analysis.
Based on our qPCR assessment of transcript levels for a large number of genes, we find both sequencing and array technologies have similar overall accuracy for measuring differential gene expression. One report has observed very high (up to R = 0.98) correlations between cDNA sequencing and qPCR, but selected just 34 genes spanning a 6000-fold range of expression . Another study used qPCR only to address discrepancies in 11 genes identified as differentially expressed , observing that sequencing made 4 of 5 correct calls of differentially expressed genes while arrays correctly identified 2 of 6 differentially expressed genes. Unlike our study, these studies used either much smaller sample sizes or were biased in selecting genes to be interrogated by qPCR. Because of the large sample of genes we measured by qPCR assays, we were able to examine subsets of these genes with differential expression more closely than previous studies[1, 2]. We found that the subset of genes recognized as significantly differentially expressed by arrays only was highly correlated with qPCR (R = 0.925, bootstrap 95% CI: 0.8621 - 0.9648), whereas as the subset of genes recognized as differentially expressed only by Illumina sequencing was moderately correlated with qPCR (R = .518, bootstrap 95% CI: 0.3227 - 0.7069). Although caution must be exercised in distinguishing between the effectiveness of the statistical method to evaluate differential expression and the accuracy of the technology, our results suggests that arrays may have an advantage in measuring differential gene expression for low-abundance transcripts.
For 9 genes that were significantly differentially expressed in both technologies but where the technologies disagreed on the direction of the change, the qPCR results were inconclusive. All 9 genes were measured with fewer than 300 reads. The majority, six out of 9, showed no differential expression by qPCR. They were likely false positives for both sequencing and arrays or false negatives for qPCR. Two of the qPCR assayed genes agreed with the array prediction, while the third agreed with the sequencing data. These numbers were too small to draw definitive conclusions.
Our analysis, both in our comparison to arrays and qPCR, focused on the differential gene expression within the RM11-1a strain across two environmental conditions (RMg and RMe). This represents a realistic situation in which the strain of interest has several polymorphisms compared to a reference sequence. Moreover, gene expression differs greatly between these conditions . The use of a non-reference strain allowed us to assess the accuracy of SNP detection in cDNA using the Illumina 1 G. Using alignments of the ORFs from a draft assembly of RM11-1a, we catalogued SNPs from genomic sequence and assessed how well these could be detected from cDNA. Simply flagging every site of sequence discrepancy from reference would have generated 180,628 predicted SNPs. The majority of these discrepancies most likely represent Illumina 1 G sequencing errors, as the sequence of RM11-1A has been established with 10× coverage Sanger sequencing . Calculating a likelihood ratio statistic allowed SNP detection with high sensitivity and specificity. 11,608 known SNPs were detected, and the false discovery rate was 3.8 × 10-2. cDNA sequencing shows great promise for generating a coding SNP map in any genome for which a reference sequence is available.
In our assessment of heterozygous site detection in a diploid, the total number of reads available limited predictive power. In the in silico data set, generated by combining reads from RMg and BYg, we are able to detect 9,848 known sites of heterozygosity while falsely calling 654 sites (6.2 × 10-2 false discovery rate). This performance was notably worse than detecting SNPs from haploid cDNA. Heterozygous site detection requires more reads than homozygous SNP detection because bases from each allele must be present to provide sufficient evidence for the existence of the site. This is further complicated by the presence of strong allele specific expression (ASE) for many genes, resulting in fewer reads from the less expressed allele. Consequently, our results with an actual BY4716/RM11-1a heterozygous diploid indicated that at high specificity only a smaller percentage, approximately 15%, of heterozygous sites could be detected. The discrepancy between the ability to detect heterozygosity for the in silico diploid and the actual BYg/RMg diploid can also be partially attributed to the fact that the in silico diploid was modeled by combining data from two sequencing runs and so had twice the data generated for the actual diploid.
Similarly, in our assessment of the quantitative measurement of ASE in a diploid, the total number of reads available limited the accuracy of the results. Correlation of ASE as determined by sequencing with ASE as measured by RT-PCR was weak (n = 22, R = 0.313, p < 0.1551). In contrast, ASE calculated for the in silico diploid showed significantly higher correlation with gene expression (R = .69, bootstrap 95% CI: 0.6504 - 0.7322). Because the in silico diploid contained the number of reads obtained from two lanes of a sequencer, this result suggests that the total number of reads was the limiting factor.