Illumina Read Alignment
The basis of our comparison was a DNA microarray array experiment in which the gene expression in two different haploid strains of S. cerevisiae, RM11-1a and BY4716, in two different growth media was measured using six biological replicates (GEO accession number GSE9376) [11]. Two color Agilent arrays were used, with a common reference consisting of a mixture of all of the original conditions and strains. Using the same RNA samples profiled in the array replicates, we sequenced full length cDNA corresponding to each strain and condition with a single lane of the Illumina 1 G sequencer.
Differential gene expression information from the Illumina 1 G sequencer was obtained by aligning reads to a reference genome and counting the number of reads that overlap each ORF. Samples were denoted BYg, BYe, RMg, RMe where BY indicates strain BY4716 and RM indicates RM11-1a. The lower-case e and g denote two nutrient conditions: ethanol and glucose as carbon sources. We used a simple alignment strategy that assumed a low error rate and determined the alignment with the entire reference genome with the fewest number of differences (insertions, deletions, and substitutions) of the reads. Between 91.6% and 96.2% of the reads obtained from a single lane of the Illumina 1 G sequencer aligned with the reference S. cerevisiae sequence [see Additional file 1, Table S1]. The BY4716 strain has few differences from the S288C reference strain. Using the unfinished assembly of the RM11-1a genome sequence from the Broad Institute, we found 5974 genes that were reciprocal best matches with at least 98% sequence identity between the two strains across alignments of ORFs, including 1 kb upstream and 1 kb downstream sequence[13]. Expression values were only computed for these 5974 genes using the Illumina 1 G data.
We separated reads into unique reads that aligned to a single location in the genome and reads that aligned to multiple locations. Among the reads that mapped to a unique position, we separated reads into those that mapped to an ORF, a ribosomal RNA (rRNA) gene, or a transfer RNA (tRNA) gene. We observed that the vast majority of the unique reads originated from ORFs [see Additional file 1, Table S2]. The distribution among multiple-mapping reads was different (see Table S3). Among these reads, most originated from rRNA genes. Even though we reduced the population of rRNA in our samples by depleting ribosomal small and large subunit RNAs with a LNA probe and by using reverse transcription with an oligo-dT primer, between 30.9%–60.4% of the total number of reads originated from rRNAs [see Additional file 1, Table S3]. Furthermore, one of the consequences of the short read length of the Illumina 1 G is that reads from rRNAs might also align to ORFs of protein-coding genes. In the five samples, we found that between 4.6%–10.4% of the total number of reads matched both an rRNA gene and an ORF. A smaller fraction of reads, 1.9%–3.3%, aligned to two or more ORFs. In all further analyses we used only reads that uniquely align to an ORF, including the few reads that mapped to more than one location within a single ORF.
Statistically Significant Differential Gene Expression
To carefully assess the ability of the sequencer to detect differential expression we focused our analysis on the RMe and RMg samples, which show significant differential expression between the two environmental conditions, had similar data quality, and contain differences from the reference sequence. Analysis of the other samples is summarized in [see Additional file 1, Table S4]. We obtained 1,919,687 reads for RMe and 2,719,827 reads for RMg that aligned to ORFs and passed our filtering criteria, resulting in 5186 genes covered by at least one read. For each ORF a digital measurement of expression was obtained by counting the total number of reads that align to it (Fig. 1A). Differential gene expression between RMe and RMg was determined by taking the log2 ratio of RMe to RMg counts, normalizing for the different total number of read counts between the two samples. For the 4697 genes that passed quality criteria for both the microarrays and sequencer, the resulting ratio was highly correlated (R = 0.75356, bootstrap 95% CI: 0.7236–0.7851) between microarrays and the sequencing data (Fig. 1B).
We identified the subset of differentially expressed genes for each platform. We detected significant differential expression for 1027 genes by arrays and 1303 genes by sequencing, with 566 genes detected as differentially expressed by both platforms (Fig. 1C). The subsets of genes detected as differentially expressed in a single technology had distinct array intensities and read coverage. The subset of genes that were called significantly differentially expressed only by arrays had significantly lower read counts in the sequencing data than the set of all genes examined (t = -5.2306, df = 1132.873, p < 2.012 × 10-7) as well as significantly lower average array intensity (t = -5.3667, df = 1137.305, p < 9.175 × 10-8), suggesting that differential expression of genes with low expression levels is detected more readily by arrays. The subset of genes that were called significantly differentially expressed by sequencing alone had higher read counts than the set of all genes examined (t = 20.3979, df = 254.083, p < 2.2 × 10-16) as well as higher array intensities (t = 12.7456, df = 237.751, p < 2.2 × 10-16), indicating that differential expression of genes with high expression is detected more frequently by sequencing than by arrays.
Modeling Sampling Error for Low Abundance Transcripts
We modeled the effect of counting noise on sequencing-based measurements of differential expression (see Methods). The scatter plots comparing differential gene expression determined from arrays and from simulated sequencing results predicted from our noise model were similar to what was experimentally observed (Fig. 1B and Fig. 1D). Both the observed results and simulations show similar increases in correlation when thresholding on the number of reads, indicating that modeling counting error captures most of the noise in the cDNA sequencing data ([see Additional file 1, Table S4] and Fig. 1E). In addition to this comparison between the RMe and RMg samples discussed above, we computed all between-strain-within-condition, and between-condition-within-strain Pearson correlations for the BYg, RMg, RMe, and RMg samples [see Additional file 1, Table S4]. The correlations showed a similar dependence on a threshold on the number of reads obtained from a gene.
Investigating Differences by qPCR
To better understand the disagreement between the cDNA sequencing and array results, we performed qPCR on a large subset of genes. We randomly selected 192 genes from genes significantly differentially expressed in the following categories: in the Illumina 1 G sequencing data only, in microarrays only, and as measured by both technologies. Of the 192 candidates, 12 genes that showed non-specific amplification or unexpected size amplification products were removed from the analysis. qPCR results are highly correlated with both microarrays (Fig. 2A R = 0.86, bootstrap 95% CI: 0.7043 – 0.953) and sequencing results (Fig. 2B R = 0.82, bootstrap 95% CI: 0.7031 – 0.8917). Of the 9 genes that were significantly differentially expressed in both technologies but where the technologies disagreed on the direction of the change, 6 out of 9 showed no differential expression by qPCR. Because of the number of genes we measured, we were able to examine the discrepant subsets more closely than previous studies[1, 2]. Specifically, we found that the subset of genes recognized as significantly differentially expressed by arrays only was highly correlated with qPCR (Fig. 2C R = 0.925, bootstrap 95% CI: 0.8621 – 0.9648) whereas as the subset of genes recognized as differentially expressed only by Illumina sequencing was moderately correlated with qPCR measurements (Fig. 2D R = 0.518, bootstrap 95% CI: 0.3227 – 0.7069).
Simultaneous Discovery of Heterozygosity and Allele-Specific Expression
As a prelude to detecting heterozygosity and measuring allele specific expression (ASE), we examined the capability of the Illumina 1 G sequencer to locate cDNA single nucleotide polymorphisms (SNPs) in our haploid RM11-1a strain samples. Alignments of ORFs from the reference sequence with the RM11-1a unfinished genome sequence provided the known position of 24,751 SNPs. We tested how accurately these known RM11-1a SNPs were rediscovered using the Illumina 1 G reads aligned with the divergent S288C reference sequence. We tested a total of 5,926,474 bases covered by at least one read. 180,628 bases showed at least one discrepancy between an Illumina base-call and the reference sequence. Considering all discrepancies gives more than a sevenfold overestimate of the known SNPs, so we calculated a likelihood ratio statistic for each discrepant base to more accurately identify SNPs (see methods). With this additional filter we were able to detect 11,608 known SNPs while only falsely calling SNPs for 457 bases (3.8 × 10-2 false discovery rate).
Next, we focused on detecting heterozygosity and quantifying ASE simultaneously in data from diploid organisms. First, we simulated the results that could be expected from a diploid strain by combining the haploid BYg and RMg data sets together. This in silico diploid had twice the total number of reads that we obtained from a single lane for an actual diploid. Thus, it has greater power to detect sites of heterozygosity and sets an upper limit to what we could expect from an actual diploid strain. We tested a total of 6,681,784 bases covered by at least one read. 236,975 bases showed at least one discrepancy between an Illumina base-call and the reference sequence. After applying the same statistical filter described above, we were able to detect 9,848 known sites of heterozygosity while falsely calling 654 bases (6.2 × 10-2 false discovery rate) for this in silico data set. We calculated simulated ASE expression by separating reads directly aligning over the SNP base into two groups based on which base they contain, counting the total reads in each group, and calculating the log2 ratio to determine differential gene expression. ASE measurements from the in silico BYg/RMg diploid for genes with over 10 sequencing reads were significantly correlated (R = .69, bootstrap 95% CI: 0.6504 – 0.7322) with the differential expression values calculated by summing all reads aligning to an ORF from the haploid samples.
Last, we used sequencing data obtained from an actual BYg/RMg diploid S. cerevisiae strain grown in glucose to discover sites of heterozygosity and simultaneously quantify allele specific expression. We tested 5,072,257 bases covered by at least one read, and 270,898 bases showed at least one discrepancy with the reference sequence. Using the likelihood ratio statistic, we were able to detect 3,691 known sites of heterozygosity while falsely calling 868 bases (1.9 × 10-1 false discovery rate). We calculated ASE at these heterozygous sites by calculating the ratio of the number of reads containing the BY base and the number of reads containing the RM base. Considering only the heterozygous sites assessed by at least 10 sequencing reads, we observed limited correlation (n = 22, R = 0.313, p < 0.155) with the ASE previously quantified for those transcripts in a BY/RM diploid by allele-specific RT-PCR [14]. Although the allele-specific RT-PCR results for the BY and RM strains were obtained in a separate experiment, the RNA samples were prepared from cells grown under similar conditions.
As another assessment of ASE, we compared our ASE predictions to results from genetic linkage analysis of gene expression levels. Using 112 haploid segregants from a cross of the BY and RM strains, expression levels were linked to both local and distant loci [14]. Local linkages can be mechanistically explained by polymorphisms in promoters, 3' untranslated regions, or coding sequence [15]. Approximately 25% of all transcripts show local linkage. In many cases, these types of local linkages alter gene expression in cis, such that local-linkage effect sizes in a sample of transcripts have been observed to correlate with ASE as quantified by allele-specific qPCR [14]. Therefore, we compared our ASE estimates to genes with known local linkages and found that ASE ratios determined by cDNA sequencing data from the BYg/RMg diploid were weakly correlated with average effect sizes of genes with local linkages (R = 0.13, bootstrap 95% CI: 0.08 – 0.22). Thresholding on only those genes with at least 10 sequencing reads at the site of heterozygosity improves this correlation (R = 0.332, bootstrap 95% CI: 0.20 – 0.46).
Comments
View archived comments (1)